Scoring-and-Unfolding Trimmed Tree Assembler:

Algorithms for Assembling Genome Sequences

Accurately and Efficiently

by

Giuseppe Narzisi

A dissertation submitted in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Department of

Courant Institute of Mathematical Sciences

New York University

May 2011

Bud Mishra — Advisor c Giuseppe Narzisi

All Rights Reserved, 2011 To Valentina

&

My family

iii Acknowledgements

The work presented in this dissertation would not have been possible without the contribution of many great people.

My advisor, Bud Mishra, has definitely played the most important role during my studies and research at New York University. It has been a great privilege to work with him for so many years, since I first entered the

Ph.D. program at NYU. He has been a scientific mentor, a human guide and an enormous source of ideas thanks to his extraordinary and deep proficiency in many different fields. In a time when sequence assembly was assumed to be an almost solved puzzle, he encouraged me to rethink again about the problem without being biased by the currently developed models and helped me to overcome the many obstacles that I faced as I was making progress in my studies. He has certainly made the major contribution to shape the scientific research presented in this thesis.

I was fortunate to have a very diverse thesis committee which has provided many important insights during the preparation of the dissertation. In particular

I am grateful to Professor Ernest Davis for taking the time to read and review my thesis with so much care. By raising many fundamental questions, he has al-

iv lowed me to clarify several important aspects of the thesis for the general reader.

I thank Professor Michael Schatz for making sure that the theoretical framework presented in the dissertation was correctly presented and related to the relevant prior art extensively. I also express my thanks to Professor Alan Siegel for im- proving the presentation and stile of the thesis. He has been not just a scientific mentor but also a tutoring figure in virtue of his high commitment to the value of education. Finally, I would like to thank Professor Raul Rabadan for suggesting many areas of application of the tools developed in this thesis.

One of the contributions of the dissertation (TotalReCaller) is the result of the joint work with Fabian Menges. I am particularly grateful to him for such collaboration as well as for all the valuable and energetic discussions that we had on many of the topics presented in this thesis. I also would like to thank all the members of the NYU Group for creating a unique and exciting research environment during my work and study at NYU. In particular

I would like to thank: Salvatore Paxia, Fabian Menges, Matthias Heymann, An- dreas Witzel, Andrew Sundstrom, Samantha Kleinberg, Antonina Mitrofanova,

Iuliana Ionita, Venkatesh P. Mysore, Marco Antoniotti, Bing Sun, Ofer Gill, Josh

Mincer, Seongho Ryu, Shaila Musharoff, Fang Cheng, Raoul-Sam Daruwala, Yi

(Joey) Zhou, Peyman Faratin, Pierre Franquin, Athena Shilin, and Thomas S.

Anantharaman.

Five years of studies have been a long journey with many ups and downs but, even during the most difficult times, there was always a constant figure in my life: my love and wife Valentina. She has been an extraordinary source of love and

v stability in my life, thanks to her continuous support and trust. She has clearly played an important role for the successful completion of this thesis.

Finally, I thank my whole family, my parent and my younger brother, for being always very supportive even when I decided to embrace the American dream. Despite the big ocean dividing our lives, they have always been caring as

I have never left them.

vi Abstract

The recent advances in DNA sequencing technology and their many potential applications to and Medicine have rekindled enormous interest in sev- eral classical algorithmic problems at the core of Genomics and Computational

Biology: primarily, the whole-genome sequence assembly problem (WGSA). Two decades back, in the context of the , the problem had received unprecedented scientific prominence: its computational complexity and intractability were thought to have been well understood; various competitive heuristics, thoroughly explored and the necessary software, properly implemented and validated. However, several recent studies, focusing on the experimental val- idation of de novo assemblies, have highlighted several limitations of the current assemblers.

Intrigued by these negative results, this dissertation reinvestigates the algo- rithmic techniques required to correctly and efficiently assemble genomes. Mired by its connection to a well-known -complete combinatorial optimization prob- N P lem, historically, WGSA has been assumed to be amenable only to greedy and heuristic methods. By placing efficiency as their priority, these methods opted to rely on local searches, and are thus inherently approximate, ambiguous or

vii error-prone. This dissertation presents a novel sequence assembler, SUTTA, that dispenses with the idea of limiting the solutions to just the approximated ones, and instead favors an approach that could potentially lead to an exhaustive

(exponential-time) search of all possible layouts but tames the complexity through constrained search (Branch-and-Bound) and quick identification and pruning of implausible solutions.

Complementary to this problem is the task of validating the generated as- semblies. Unfortunately, no commonly accepted method exists yet and widely used metrics to compare the assembled sequences emphasize only size, poorly capturing quality and accuracy. This dissertation also addresses these concerns by developing a more comprehensive metric, the Feature-Response Curve, that, using ideas from classical ROC (receiver-operating characteristic) curve, more faithfully captures the trade-off between contiguity and quality.

Finally, this dissertation demonstrates the advantages of a complete pipeline integrating base-calling (TotalReCaller) with assembly (SUTTA) in a Bayesian manner.

viii Contents

Dedication iii

Acknowledgements iv

Abstract vii

List of Figures xxi

List of Tables xxiii

List of Appendices xxiv

Introduction 1

1 Genome Sequencing and Assembly 9

1.1 Introduction...... 9

1.2 DNA:Deoxyribonucleicacid ...... 10

1.3 ShotgunSequencing...... 12

ix 1.4 Next Generation Sequencing and their

challenges ...... 14

1.5 Lander-Watermanstatistics ...... 15

1.6 Trade-off in sequencing technology ...... 20

1.7 AssemblyPipeline...... 21

1.8 History of the assembly of the

HumanGenome...... 27

2 Sequence Assembly: Problem and Complexity 32

2.1 Introduction...... 32

2.2 Thedovetail-pathframework...... 33

2.2.1 Basic definitions: reads, overlaps and layouts ...... 33

2.2.2 Min-length reconstruction theorem ...... 37

2.3 Shortest Superstring Problem (SSP )...... 39

2.4 Graph-Theoreticformulation...... 41

2.4.1 Strings,OverlapsandOverlapGraph ...... 41

2.4.2 StringGraph ...... 43

2.4.3 DeBruijngraph...... 47

2.5 Probability of unique reconstruction ...... 51

x 2.6 Sequence Assembly as a Constrained

OptimizationProblem ...... 54

2.6.1 Modeling sequencing errors ...... 55

2.6.2 AnewformulationofSAP ...... 55

2.6.3 Relationtothepriorart ...... 58

3 Sequence Assemblers and Assembly Paradigms 60

3.1 Introduction...... 60

3.2 A Historical Perspective on Sequence Assembly ...... 61

3.3 AssemblyParadigms ...... 64

3.3.1 Greedy...... 64

3.3.2 Graph-based...... 66

3.3.3 Seed-and-Extend ...... 69

4 SUTTA: Scoring-and-Unfolding Trimmed Tree Assembler 71

4.1 Introduction...... 71

4.2 HistoryandMotivation...... 72

4.3 SUTTAAlgorithm ...... 73

4.4 Overlap Score (Weighted transitivity) ...... 77

4.5 Nodeexpansion ...... 80

xi 4.6 SearchStrategy ...... 84

4.7 PruningtheTree ...... 85

4.8 Lookahead...... 87

4.9 Implementationdetails ...... 92

4.10 Short-ReadOverlapper ...... 93

5 Feature-Response Curve 95

5.1 Introduction...... 95

5.2 Assembly Comparison and Validation ...... 96

5.3 Feature-ResponseCurve ...... 98

5.4 Implementationdetails ...... 101

6 Experimental Comparison of De Novo Genome Assembly 102

6.1 Introduction...... 102

6.2 ExperimentalProtocol ...... 104

6.2.1 Benchmarks ...... 104

6.2.2 Assemblers ...... 109

6.3 Longreadsresults...... 110

6.4 Shortreadsresults ...... 120

6.5 Parametric complexity experiments ...... 124

xii 6.5.1 Overlapgraphcomplexity ...... 125

6.5.2 Trade-off between N50 and Overlap size k ...... 127

6.5.3 Feature-Response curve dynamics ...... 128

6.6 Computationalperformance ...... 128

7 Integrating Base-Calling, Error Correction and Assembly 131

7.1 Introduction...... 131

7.2 Base-CallingChallanges ...... 132

7.3 Source of Errors in Illumina Raw Sequencing Data ...... 135

7.4 TotalReCaller ...... 138

7.4.1 Linearerrormodelandfilter ...... 139

7.4.2 Base-by-base ...... 142

7.4.3 Beamsearchreadextension ...... 144

7.4.4 Scorefunctions ...... 146

7.5 Base-CallingResults ...... 147

7.5.1 Errorrates...... 147

7.5.2 Alignment rate and base-calling speed ...... 151

7.5.3 SNPs specificity and sensitivity ...... 151

7.6 Error Correction during Base-Calling ...... 153

xiii 7.6.1 Assemblyresults ...... 155

Conclusion 160

Appendices 166

Bibliography 193

xiv List of Figures

1.1 The double-helix structure of the DNA...... 11

1.2 Shotgunsequencing...... 13

1.3 Readcoverageillustration...... 16

1.4 Expected number of contigs and their average length as a function

of coverage for different values of the minimum detectable overlap θ. 18

1.5 Trade-off between read length and coverage...... 21

2.1 Two possible overlaps (illustration): left overlap is normal (both

reads pointing to the same forward direction) right overlap is innie

(the second read B is reverse complemented and is pointing in

the backward direction); The suffix predicate for the left (normal)

overlap is s.t. suffixπ(A)= true and suffixπ(B)= false...... 34 2.2 Overlaptaxonomy...... 35

2.3 Example of layout for a set of reads F = A,B,C,D,E,F,G { } N I N I N N with overlaps π(A,B), π(B,C), π(C,D), π(D,E), π(E,F ), π(F,G)...... 37 2.4 Example of compression due to a repeat...... 40

2.5 Example of mis-assembly using a string graph...... 46

xv 2.6 De Bruijn graph with parameter k = 2 for the list of strings L =

AAA, AAC, ACA, CAC, CAA, CGC, GCG ...... 49 { } 2.7 Plot of function f(λ)=(1 λ)e−λ in the range [0, 10]...... 53 −

3.1 Example of greedily merging three fragments (the numbers repre-

senttheoverlapsizes)...... 65

3.2 Example of layout representation and transformations in the OLC

framework...... 67

4.1 Example of transitivity relation: the overlap regions between reads

AB and BC shareanintersection...... 78

4.2 Contig construction: (i) the D-tree is constructed by generating

LEFT and RIGHT trees for the root node; (ii) best left and right

paths are selected and joined together; (iii) the reads layout is

computedforthesetofreadsinthefullpath...... 82

4.3 Example of transitivity pruning: expanding nodes B2,...,Bn can

be delayed because their overlap with read A is enforced by read B1. 86 4.4 Lookahead: the repeat boundary between reads B and C is re-

solved looking ahead in the subtree of B and C, and checking how

many and how well the mate-pair constrains are satisfied...... 88

4.5 Dead-end: short branches of overlaps that extend only for very few

steps. They typically associated with base errors located close to

thereadends...... 92

xvi 4.6 Bubble: false branches that reconnect to a single path after a small

number of steps. They are typically caused by single nucleotide

difference carried by a small subset of reads...... 92

4.7 Overlap computation using the trie data structure...... 94

6.1 Feature-Response curve comparison for S. epidermidis and Chro-

mosome Y (3Mbp of p11.2 region) genomes when no mate-pairs

information is used in the assembly...... 113

6.2 Feature-Response curve comparison for S. epidermidis and Chro-

mosome Y (3Mbp of p11.2 region) genomes when mate-pairs in-

formationisusedintheassembly...... 116

6.3 Dotplotsfor the Staphylococcus epidermidis. Assemblies produced

by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR. . . . 118

6.4 Separate FR-curve comparison for each feature type for the S.

epidermidis genomeusingmate-pairs...... 119

6.5 Feature-Response curve comparison for the E. coli genome using

mate-pairshortreaddata...... 124

6.6 Overlap and read count distribution for E. coli...... 126

6.7 Relation between the min overlap parameter k and the N50 contig

size for S. aureus and E. coli...... 127

6.8 Feature-Response curve dynamics as a function of the minimum

overlap parameter k for E. coli usingmatepairdata...... 129

7.1 Block diagrams for re-sequencing pipelines: (a) open-loop (no feed-

back) and (b) closed-loop with feedback...... 135

xvii 7.2 for high and low intensity levels depicted with their

means and standard deviations for four channels...... 137

7.3 Filtered intensity channels and separation. Crosstalk and lagging

are corrected using a linear filter, developed here. The high and

low intensity levels are now cleanly separated for the first 60 cycles.141

7.4 Sequence read error rates per cycle for each of the three datasets

(seetable7.3)...... 149

7.5 SNP specificity (SPC) and sensitivity (SNS) for E. coli. Effect of

the alignment on base-calling, as the weights walign are varied. . . 153 7.6 Dot plots for the E.coli assemblies produced by SUTTA, Velvet,

SOAPdenovo and ABySS. The horizontal lines indicate the bound-

ary between assembled contigs represented on the y axis. . . . . 157

7.7 Feature-Response curve comparison for SUTTA and Velvet on the

100X E.coli dataset...... 158

8 Feature-Response curves by feature type for Brucella suis without

mate-pairconstraints...... 174

9 Dotplotsfor Brucella suis (no mate-pairs). Assemblies produced

by Minimus, PHRAP, Euler, PCAP, SUTTA and TIGR. The hor-

izontal lines indicate the boundary between assembled contigs rep-

resented on the y axis...... 175

10 Feature-Response curves by feature type for Staphylococcus epi-

dermidis withoutmate-pairconstraints...... 176

xviii 11 Dot plots for Staphylococcus epidermidis (no mate-pairs). Assem-

blies produced by Minimus, PHRAP, Euler, PCAP, SUTTA and

TIGR. The horizontal lines indicate the boundary between assem-

bled contigs represented on the y axis...... 177

12 Feature-Response curves by feature type for Wolbachia sp. without

mate-pairconstraints...... 178

13 Dotplotsfor Wolbachia sp. (no mate-pairs). Assemblies produced

by Minimus, PHRAP, Euler, PCAP, SUTTA and TIGR. The hor-

izontal lines indicate the boundary between assembled contigs rep-

resented on the y axis...... 179

14 Feature-Response curves by feature type for Chromosome Y with-

outmate-pairconstraints...... 180

15 Dot plots for Chromosome Y 3Mbs region (no mate-pairs). As-

semblies produced by Minimus, PHRAP, Euler, PCAP, SUTTA

and TIGR. The horizontal lines indicate the boundary between

assembled contigs represented on the y axis...... 181

16 Feature-Response curves by feature type for Brucella suis with

mate-pairconstraints...... 182

17 Dotplotsfor Brucella suis (with mate-pairs). Assemblies produced

by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR. The

horizontal lines indicate the boundary between assembled contigs

represented on the y axis...... 183

xix 18 Feature-Response curves by feature type for Staphylococcus epi-

dermidis withmate-pairconstraints...... 184

19 Dot plots for Staphylococcus epidermidis (with mate-pairs). As-

semblies produced by ARACHNE, CABOG, Euler, PCAP, SUTTA

and TIGR. The horizontal lines indicate the boundary between as-

sembled contigs represented on the y axis...... 185

20 Feature-Response curves by feature type for Wolbachia sp. with

mate-pairconstraints...... 186

21 Dot plots for Wolbachia sp (with mate-pairs). Assemblies pro-

duced by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR.

The horizontal lines indicate the boundary between assembled con-

tigs represented on the y axis...... 187

22 Feature-Response curves by feature type for Chromosome Y with

mate-pairconstraints...... 188

23 Dotplotsfor Chromosome Y 3Mbps region (with mate-pairs). As-

semblies produced by ARACHNE, CABOG, Euler, PCAP, SUTTA

and TIGR. The horizontal lines indicate the boundary between as-

sembled contigs represented on the y axis...... 189

24 Feature-Response curves by feature type for Escherichia coli with

mate-pairconstraints...... 190

xx 25 Dot plots for E. coli (with mate-pairs). Assemblies produced by

Velvet, ABySS, Taipan, SOAPdenovo, SUTTA and Edena. The

horizontal lines indicate the boundary between assembled contigs

represented on the y axis...... 191

xxi List of Tables

3.1 List of sequence assemblers...... 62

6.1 Benchmarkdata...... 106

6.2 Long reads assembly comparison without mate-pair information

(clone sizes and forward-reverse constraints)...... 111

6.3 Long reads assembly comparison using mate-pair information. . . 115

6.4 Short reads assembly comparison without mate-pair information. . 121

6.5 Short reads assembly comparison using mate-pair information. . . 123

6.6 Assemblers computational performance for Staphylococcus aureus

strain MW2. Time and memory requirements reported here in-

clude both overlapping and assembly steps...... 130

7.1 Probabilities from FM search for each base preceded by “ACGAC”.143

7.2 List of available Base-callers for Illumina sequencing technology

includingTotalreCaller...... 147

7.3 Data sets used to evaluate and compare TotalReCaller’s perfor-

mance to its peers: phiX, E. coli and poplar...... 148

7.4 Basecalling speed and alignment comparison...... 150

xxii 7.5 Alignment rate for Bustard and TotalReCaller...... 155

7.6 Assembly results (contigs) for E. coli dataset (100X 125bp reads

fromonelaneofGenomeAnalyzerII)...... 156

7 Short Read Assemblers parameter setting...... 192

xxiii List of Appendices

Appendix A 166

Appendix B 172

xxiv Introduction

This dissertation addresses the problem of assembling the genome sequence of an organism using the available sequencing technologies and long-range information.

Producing a complete and high-quality assembly of a genome has fundamental implications in Biology and Medicine; however such a task is particularly chal- lenging for large, repeat-rich genomes such as those of mammals. For this reason new assembly tools, cautious experimental design, and novel metric for assembly validation must be developed if there is any hope to successfully and correctly assembly any genome. In this introductory chapter I will describe the major questions and grand challenges that must be faced in this field while explaining how the thesis contributes to address these problems.

Motivation

Probably there is no better way to describe my motivation for working on genome sequence assembly than considering how to answer the following general ques- tions. The answers also highlight the reasons why this problem has received so much consideration by both the computer science and

1 community in the last 30 years.

Why is genome sequencing so important? Knowing the full sequence of a genome (in particular Human) has many important and direct applications to

Biology and Medicine. Following is a (not exhaustive) list of such applications:

Understand evolution: compare genomes of near-by species. •

Understand traits and diseases: compare genomes of wild-types vs. mutants, • normals vs. patients, normals vs. tumor, etc.

Find genes, regulatory regions, exon-intron-boundaries, splicing sites, etc. • computationally (hence, more cheaply and quickly).

Understand how the genome as a whole works: how genes work together • to direct the growth, development and maintenance of an entire organism.

Especially, the motifs and patterns in intergenic regions (containing so-

called “junk” DNA).

If, by developing a successful technology for genome assembly, it will be possible to solve at least one of such challenges in Medicine and Biology, then we will have made an invaluable contribution to these fields.

Why is genome sequencing so difficult? In a more general framework, computer science researchers first formalized the shotgun sequence assembly prob- lem as an “approximation” to finding the shortest common superstring of a set of sequences. Because of its natural connection to a well known -complete N P

2 combinatorial optimization problem (Shortest Superstring Problem), in these so- lutions, accuracy is inevitably traded for computational efficiency with greedy and heuristic methods being the preferred choice. However these approaches have their foundations on the following argument: if the DNA was totally ran- dom then the overlap information between the sequences would be sufficient to reassemble the full sequence and greedy strategies would always have satisfac- tory performance. Unfortunately this assumption is not true for real biological problem. In fact genomes (especially Eukaryotic ones) are characterized by non- random structures (e.g., repeated regions, rearrangements, segmental duplications etc.) which complicate the assembly problem and make greedy strategies fail in- variably. In addition, the sequence reads generated by the sequencing machines are not error-free and the error profiles change quickly over time in response to the changes in sequencing technology. This makes the assembly problem also sequencing-technology dependent and the algorithms must change to accommo- date changes to read-length, errors profiles, and nature and availability of long range information.

How well have we done so far? The initial “draft” sequence of the human genome [38] has been revised several times, since its first publication, each revi- sion eliminating various classes of errors through successive heuristic advances; nevertheless, genome sequencing continues to be viewed as an inexact craft and inadequate in controlling the number of errors, which in the draft genomes are estimated to be up to hundreds or even thousands [86]. So ten years after the publication of the first draft of the human genome (2001), there is still need for

3 error-free and efficient methods to further improve the quality of the assemblies.

What is missing? Despite extensive usage of these methods and documenta- tion in the literature for almost 20 years, there still exists no rigorous characteri- zation of these methods, i.e., (i) no complete analysis of their performance in the context of varying genomic structures, (ii) no comparative study on the accuracy and efficiency of each of their heuristics, (iii) no guiding principles on how to choose and adapt the heuristics to changing DNA sequencing technologies, and

finally, (iv) no flexibility in exploiting the long range information to decipher structural variation or achieve haplotypic disambiguation. Last but not least, lacking such an extensive understanding, no prescriptive framework has resulted for designing the next generation of assemblers that aim to be more accurate, adaptively efficient and extensible to deal with possible future technologies.

The Sense of the Approximation

Although the process of assembling a genome is often described as solving a com- plex jigsaw puzzle, in reality there are even further complications. This is mostly due to the fact that we still do not completely understand the global structure of the human genome (as well as of many other species). How the genome is modeled has strong implications not only in the way the problem is formulated but also in the accuracy of the methods used to tackle it. To understand this point it is helpful to observe that the sequence assembly problem is unfortunately a “wicked problem” [83]. This is a term used to describe a problem that is difficult or im-

4 possible to solve because of incomplete, contradictory, and changing requirements that are often difficult to recognize. Moreover, because of complex interdependen- cies, the effort to solve one aspect of a wicked problem may reveal or create other problems. In the context of genome assembly, the incomplete, contradictory, and changing requirements correspond to the underline genome structure. Since its discovery in 1953 by James D. Watson and Francis Crick, scientists have made a lot of progress in understanding the DNA chemical structure, but only more recently, after the first draft of the human genome was assembled, we know more about its base-by-base structure. Real genomes (especially Eukaryotic ones) con- tain non-random structures (e.g., repeated regions, rearrangements, segmental duplications etc.) which complicate the assembly problem and any mathematical formulation will never be complete and correct until these structures are fully characterized. As a consequence, in the presence of various unmodeled genome structure, an assembly method must make a choice among quality, quantity and nature of the input data, computational complexity, false discovery rate (chimeric contigs) and false negative rates (gaps in the assembly). We have chosen to em- phasize a lower false discovery rate (fdr). Though it is difficult to have a sense of the approximation of the sequence to the true sequence while they are being assembled, it is possible to analyze the accuracy of an assembler using simulated genomes (of varying complexities) and feature-response-curves with features that are highly indicative of certain kinds of errors. Nonetheless, the approach pre- sented in this dissertation is not yet immune to these criticisms, but it contributes a new framework that has potential to be more faithful to biology.

5 Thesis Contributions

This dissertation contributes three ways to the field of Sequence Assembly and

Genomics in general:

1. A novel DNA sequence assembler, called Scoring-and-Unfolding Trimmed

Tree Assembler (SUTTA). Despite the negative theoretical results of the

complexity of the assembly problem ( -hard), SUTTA dispenses with N P the idea of using greedy and heuristic methods (standardly used by all the

current state-of-the-art sequence assemblers) in favor of a brute-force ap-

proach whose complexity is carefully tamed with constrained search method

(branch-and-bound). To achieve this goal SUTTA relies on flexible designed

score functions that can combine data from different technologies.

2. A new assembly metric, named Feature-Response Curve (FRC). Widely

used metrics to compare the assembled sequences up to now emphasize

only size, while scant attention has been paid to evaluate quality and accu-

racy. To address this problem the FRC has been designed using ideas from

the classical ROC (receiver-operating characteristic) curve, to capture the

trade-offs between quality and sequence size into a single metric.

3. A new Base-Caller, called TotalReCaller. By improving the quality of the

input DNA sequences, base-calling and error correction tools play a signif-

icant role to enable more accurate sequence assemblies. TotalReCaller is a

novel tool that combines base-calling and alignment into a unified frame-

work that concurrently performs base-calling, alignment, and error correc-

6 tion. Like SUTTA, TotalReCaller also relies on global optimization search

methods (branch-and-bound and beam-search) to optimize a Bayesian score

function that takes into accounts both intensity signals and alignments to

a reference genome.

Thesis Outline

The dissertation is organized as follows. Chapter 1 covers the fundamental back- ground in genome sequencing and sequence assembly. Chapter 2 investigates the major complexity results in sequence assembly, in particular highlighting the in- consistencies in the early formulations. Chapter 3 reviews the different assembly paradigms that have been designed over the years to adapt to various sequenc- ing projects and technologies. Chapter 4 presents a detailed description of the

first contribution of this dissertation: the Scoring-and-Unfolding Trimmed Tree

Assembler (SUTTA). SUTTA is based on a different formulation of the sequence assembly problem (as constrained optimization) and uses the branch-and-bound method to quickly identify and prune implausible overlays. Chapter 5 introduces and describes the second contribution of the dissertation: the Feature-Response curve (FRC). This new metric more satisfactorily captures the trade-offs between quality and size of the assembled sequences. Chapter 6 presents an extensive set of experimental results to compare SUTTAs performance relative to many state-of-the-art assembly algorithms in the literature. The analysis is performed under both standard metrics (N50, coverage, contig sizes, etc.) as well as the

Feature-Response Curves introduced in 5. Finally, chapter 7 presents the third

7 contribution of this dissertation: the TotalReCaller base-caller, which combines base-calling and alignment into a new re-sequencing framework. This final chap- ter also demonstrates the advantages of a complete de novo assembly pipeline integrating TotalReCaller (for base-calling) with SUTTA (for sequence assem- bly) in a Bayesian manner.

8 Chapter 1

Genome Sequencing and

Assembly

1.1 Introduction

A combination of tremendous advances in sequencing technologies, chemistry and computer science has revolutionized biological research by allowing scientists to decode the genomes of many organisms and in particular the human genome [38].

Moreover, the following advent of high-throughput next- and subsequent gener- ations of sequencing technologies (Gen-1, Gen-2 and Gen-3, respectively) now promise to considerably reduce the genome sequencing cost in what now seems to be a personal genomics revolution. This chapter covers the fundamental back- ground required to understand the research in genome sequencing and assembly which is at the core of this revolution. Specifically the chapter is organized as follows: a quick introduction to DNA is first given; next, the traditional shotgun

9 sequencing technology is presented; then next-generation sequencing technologies are discussed especially in light of various challenges that they introduce; next, the standard assembly pipeline for large genome projects is described; finally, the assembly of the first draft of the human genome is reviewed from a historical perspective.

1.2 DNA: Deoxyribonucleic acid

Deoxyribonucleic acid (DNA) is a double-stranded polymer that contains all the fundamental building blocks for an individual’s entire genetic makeup, and it is a component of virtually every cell in the human body. Specifically, it contains the genetic instructions used in the development and functioning of all known living organisms as well as the hereditary information that gets transmitted from organism to organism. DNA has a unique structure composed of two long poly- mers (Watson and Crick strands) of simple units called nucleotides (A,T,C,G) that run in opposite directions to each other (anti-parallel). The two polymers are held tightly together forming a double-helix shape (see figure 1.1). In the nu- cleus of each cell, the DNA molecule is packaged into thread-like structures called chromosomes. Each chromosome consists of a single piece of coiled DNA contain- ing many genes, regulatory elements and other nucleotide sequences. In humans, each cell normally contains 23 pairs of chromosomes, where each chromosome has two homologous copies, one from the mother and one from the father (except for the sex chromosomes X and Y: females have two copies of the X chromosome, while males have one X and one Y chromosome). Because of this dual represen-

10 Figure 1.1: The double-helix structure of the DNA. tation of each chromosomes, DNA is said to have an haplotypic structure. Note however that the current sequencing technologies do not distinguish between the two strands and the two homologous copies. So when DNA is assembled the two homologous copies are merged into a single sequence. Another important observation from the point of view of the assembly process is the following: al-

11 though the two strands are simply the complement of each other, because of the double-strand “antiparallel” structure of the DNA, when the fragments are sampled (using the available technology) from the DNA molecule two cases can happen: (1) if the fragment is sampled from Watson’s strand it is read in the forward direction; (2) if it is sampled form Crick’s strand it is read in reverse direction, and because of the complementarity rule (A-T,C-G) this fragment is in fact reverse-complemented. This properties has implications in the way the overlap between the fragments are computed (in section 2.2.1).

1.3 Shotgun Sequencing

For more than three decades, starting with the pioneering DNA sequencing work of Frederick Sanger in 1975 based on the Sanger chemistry, which is still univer- sally used [87], every large-scale sequencing project has been organized around one single goal of overcoming the following obstacle: How can one generate the sequence of gigabases of genome as one uninterrupted string, if from the genome of an organism, it is only possible to obtain short sequence reads, limited to about 1000bp, which carry no contextual (chromosomal location or haplotypic disambiguation) information? Since the short read-lengths place a critical barrier against direct reading of long genomes, one requires an algorithmic solution to indirectly derive the full-genome sequence from an overly redundant set of read data. Thus, most sequencing projects have adopted a shotgun sequencing strat- egy [49], which, as currently practiced in many genome projects, is organized in several steps: namely, first genomic DNA of multiple copies of a target DNA

12 Figure 1.2: Shotgun sequencing. molecule of an organism is sheared into a very large number of small fragments

(typically 8–10 coverage), each of whose ends are sequenced ( 500–1000 bp); × ≈ next the resulting sequence reads are fed to a computer program, called a se- quence assembler, whose purpose is to reconstruct a full genome sequence that can consistently and correctly explain the sequence read data (see figure 1.2).

Intuitively, the assumption is that if two sequence reads (two strings of letters produced by the sequencing machine) share a common overlapping substring of letters, then it is because they are likely to have originated from the same chromosomal location in the genome. The basic assumption can be made more precise by additionally taking into account the facts that the sequence reads could come from either Watson or Crick strand, and that if two strings are part of a mate-pair1 then the estimated distance between the reads and their orientation imposes additional constraints, not to be violated. Once such overlap structures between the sequence reads are detected, then the sequence assembler places the reads in a lay-out and combines the reads together to create a consensus sequence—not unlike how one solves a complex jigsaw puzzle. In addition this assembly process is complicated by the presence of non-random structures (e.g.,

1Reads sequenced from the ends of the same clone.

13 repeated regions, rearrangements, segmental duplications) in the genome that affect the correctness of the overlaps by producing many false-positives.

1.4 Next Generation Sequencing and their

challenges

Although quite reliable and considerably optimized, the Sanger process is quite expensive in cost and time. For example, the monetary cost necessary to assemble mammalian genomes was estimated to be about $12 million dollars in 2005 [63].

Since the final goal still remains personalized medicine, it remains fundamental to significantly reduce the sequencing cost (ultimately below $1000 dollars) in order to have an affordable technology for mass application.

In response to these requirements, recent advances in sequencing technology have produced a new class of massively parallel Next-Generation Sequencing

(NGS) platforms such as: Illumina2 Inc. Genome Analyzer, Applied Biosystems

SOLiD3 System, 4544 Life Sciences (Roche) GS FLX, and Helicos5 Heliscope

Sequencer. Although they have significantly reduced the production cost and have orders of magnitude higher throughput per single run (200x coverage and higher) than older Sanger technology, the reads produced by these machines are typically shorter (35 - 500 bps). As a result they have introduced a succession of new computational challenges, for instance, the need to assemble millions of

2http://www.illumina.com/ 3http://www.appliedbiosystems.com/ 4http://www.454.com/ 5http://www.helicosbio.com/

14 reads even for bacterial genomes [82]. The short read length complicates the assembly problem since repeat regions are now harder to disambiguate and so higher coverage depth is generally required. Furthermore, the assembly tools originally developed for Sanger sequencing data cannot be directly applied to

NGS technologies for different reasons:

1. because of specific algorithmic choices that rely on long read lengths avail-

able with older Sanger reads;

2. because of the specific error profiles of NGS data (e.g. pyrosequencing

technologies are characterized by high error rates in homopolymer regions);

3. because of the computational requirements of the vastly larger number of

reads generated by NGS projects (e.g., 8 times coverage of a 3 Gbp mam-

malian genome requires 30 million Sanger reads but 750 million Illumina

reads);

As a consequence new assembly tools have been designed to specifically deal with the new features of the data generated by NGS (see chapter 3 for a review of sequence assemblers). However developing a universal assembler that can handle multiple types of sequencing data still remains a noteworthy goal.

1.5 Lander-Waterman statistics

The statistical properties of the sequencing process play a critical role on the performance of an assembler. In fact, even in absence of repeats, the output

15 6 5 4

Coverage 3 2 1

Contig

Reads

Figure 1.3: Read coverage illustration (inspired by a lecture given by Michael Schatz in 2006 at the University of Hawaii). of a sequence assembler may consist of multiple contigs (contiguous sequence of the genome) if not enough depth of coverage is available. This phenomenon can be explained with an analogy to the water covering a sidewalk as it starts to rain: as droplets fall, the sidewalk becomes increasingly wet, though many spots remain dry for a while. Similarly, as the fragments are being sequenced, the randomness of the shearing process leads to cover successively more new sections of the original DNA not yet represented in the collection of reads (see figure 1.3).

If some region of the genome are not covered, the best possible assembly consists of the collection of contigs with gaps in between representing the location of the

DNA not covered by the reads.

This phenomenon was initially analyzed by Lander and Waterman [55]. Specif-

16 ically they approximate the arrival of N reads of equal length l along a genome of length G as a Poisson process whose parameters are defined as follows.

Definition 1 (Lander-Waterman parameters). Consider a genome of length G that has been uniformly randomly sampled to collect N fragments/reads each one of length l. We can define the following Lander-Waterman parameters:

c = lN , coverage. • G

k, size of the minimum detectable overlap: number of base pairs two frag- • ments must have in common to ensure their overlap (overlap parameter).

σ =1 k , fraction of a read not involved in the minimum detectable overlap; • − l k where θ = l .

For example 1 (1 times) coverage of the human genome requires N = cG = × l 9 1(3×10 ) = 6 million reads. However higher coverage ( 10 coverage) is typically 500 ∼ × used to assembled genomic data, and in that case N = 60 million reads would be required.

Considering the reads in order of their arrival along the genome (from left to right), the number of contigs is the same as the number of reads that do not overlap. The following theorem precisely formalize this intuition.

Theorem 1 (Contigs statistics). If we model the “arrival” of N fragments of length l along a genome of length G as a Poisson process then the expected number of non-trivial contigs6 and their sizes are:

6contig composed of two or more reads.

17 Number of contigs vs. coverage Contigs length vs. coverage 60000 θ = 30% θ = 30% 250 θ = 50% 50000 θ = 50% θ = 70% θ = 70% 200 40000

150 30000

100 20000

50 10000 Average contig length 0 0 0 2 4 6 8 10 0 2 4 6 8 10

Expected number of non-trivial contigs Coverage Coverage

Figure 1.4: Expected number of contigs and their average length as a function of coverage for different values of the minimum detectable overlap θ.

E[# non-trivial contigs]= Ne−(cσ) Ne−(2cσ) (1.1) − (e(cσ) 1) E[contig size]= l − + (1 σ) (1.2)  c − 

Figure 1.4 shows the relation between the expected number of contigs and their average length as a function of coverage. As the coverage increases more and more contigs can be created, however when the coverage reaches a specific threshold the number of contigs starts to decrease since now the probability that a new fragment connects two previously created contigs becomes close to 1. As this process continues the average contigs’ length simply increases monotonically.

Note that decreasing the minimum detectable overlap θ greatly reduces the ex- pected number of contigs and increases their lengths. However it is important to emphasize that this is only a mathematical model and in real applications smaller values of θ must be balanced with the possibility of making erroneous assemblies.

Proof. Presented below is the proof for the expected number of non-trivial contigs

18 (equation 1.1), while a proof for equation 1.2 can be found in [55]. The proof is based on the following simple observations: a contig stops at the last overlapping fragment with no following overlapping fragments. Thus computing the number of contigs is equivalent to counting the number of “stopper” fragments. A stopper fragment is defined as a read with no further overlaps at any of the first l(1 θ) − k positions, where θ = l models the notion of detectable overlap. Now note that the probability of a fragment starting at any position in the genome is:

N Pr[fragment start at position i]= (1.3) G so the probability of having a stopper fragment is:

1 l(1−θ) − lN (1−θ) − N [ G ] N N ( G ) Pr[“stopper” fragment] = 1 = 1+  − G   − G  (1.4) − lN (1−θ) −(cσ) e G = e ≈

The approximation is possible because N 0 for large G. Hence, the expected G ≈ number of contigs and singleton contigs are given respectively by the following equations:

E[# of contigs] = Ne−(cσ), (1.5)

E[# of singleton contigs] = Ne−(2cσ) (1.6)

19 Thereof the expected number of non-trivial contigs is:

E[# non-trivial contigs] = E[# of contigs] E[# of singleton contigs] − = Ne−(cσ) Ne−(2cσ) −

1.6 Trade-off in sequencing technology

There is a critical trade-off between sequence reads and the number of reads that can be generated by current sequencing machines (coverage). Ideally we would like to generate very long reads with high coverage for de novo assembly projects, but currently no technology generates reads longer then 1Kb in length. Some technologies, e.g., Life Technologies7 and Pacific BioSystems8, have announced new single molecule sequencing (SMS) technology that can combine virtually un- limited continuous long read lengths with unmatched accuracy to deliver targeted genomic sequence data in a matter of hours. However there is a lot of specula- tion whether these promises are too optimistic and whether we will see very long sequence reads (> 10kb) in the near future. On the other side, next-generation sequencing technologies have been able to significantly increase the throughput

(and therefore the coverage) while paying a high price by generating much shorter reads. Figure 1.5 depicts the trade-off scenario. However even very high coverage is not the final solution to the shorter read length and long-range data (mate-

7http://www.lifetechnologies.com/ 8http://www.pacificbiosciences.com/

20 Ideal point: impossible to achieve with current sequencing technologies Read length

1 Kbp

High throughput

Number of reads

Figure 1.5: Trade-off between read length and coverage. pairs, optical maps, etc) is necessary to correctly assemble the most complex genome structures. Although initially unavailable, paired-end sequencing has now become a common routine for almost all of the next-generation sequencing technologies.

1.7 Assembly Pipeline

The process of reconstructing the complete genome sequence given the input sequence reads and, if available, additional long-range information (mate-pairs, optical maps, etc.), is typically carried out in multiple steps (sub-problems).

Each sub-problem is tackled using a specific module and all the modules are

21 sequentially performed in a pipeline. Note that in the following we will focus on describing the modules that involve the use of computational approaches, while omitting the steps that involve laboratory experiments. All together this assembly pipeline is the general procedure for Whole-Genome Shotgun (WGS)

Sequencing, but we emphasize that the specific procedure used at the various genomic centers may differ in details.

1. Fragment Readout or Base-calling - The first step of the assembly

pipeline consists of determining each fragment sequence using automatic

base-calling software. The base-caller task is to interpret the signal in-

tensities generated by the sequencing machines and call a base for each

position in the sequence. In the case of longer Sanger reads, Phred [25]

was the standard base-caller used to analyze the fluorescence “trace” data

generated by the DNA sequencer. New base-callers have been designed for

next-generation sequencers that use sophisticated signal processing meth-

ods such us statistical learning (BayesCall [42]), supervised learning and

support vector machine (Alta-Cyclic [24], Ibis [51]), and model-based clus-

tering and information theory (Rolexa [85]). More recently, we have pro-

posed a novel unified re-sequencing framework, TotalRecaller [62], that has

the ability to concurrently perform base-calling, alignment and SNP detec-

tion. In the last chapter of the thesis, we will demonstrate the advantages of

a complete pipeline integrating Base-Calling (TotalReCaller) with assembly

(SUTTA) in a Bayesian manner.

2. Trimming low-quality sequences - The sequence reads often contain

22 poor quality regions which, if removed, can lead to more accurate sequence

assembly. However this step is optional, for example one may rely on the

overlapper (next step) to filter false overlaps using the quality values.

3. Overlap Detection/Pairing - The specific algorithmic approaches for

this task have evolved throughout the years, as increasingly more complex

sequencing projects were tackled through the shotgun method and next-

generation sequencing machines were developed. The earliest algorithms

involved either iteratively aligning each read to an already generated con-

sensus or comparing all the reads against each other [77]. Most recently

the detection of read overlaps involves sophisticated techniques meant to

reduce the number of pairs of reads being analyzed. For example the UMD

overlapper [84] keeps the number of repeat-induced spurious overlaps small

and it builds the initial overlapping-phase of the algorithm with a reason-

ably small number of k-mers, whose cardinality is optimized by an order of

magnitude through the use of minimizers. Because of the higher through-

put of NSG data the standard approach used for finding overlaps is instead

exact matching which makes use of suffix- and prefix- trees or arrays. These

are data-structures designed to efficiently store all the suffixes or prefixes of

a particular string. In order to identify all reads that overlap a particular

read r, it is sufficient to identify the reads whose suffixes match a prefix of r.

This approach allows one to perform the overlap computation for millions

of reads both in a time and memory efficient manner although sequencing

errors are not tolerated. The output of this module is a set of fragment

23 pairs with associated overlap score.

4. Fragment Assembly - Given the set of fragments/reads and their pair-

wise overlaps (and possible long range information), this step generates a

layout or arrangement of the reads that is consistent with the overlap in-

formation and satisfies the long range information constraints. This is the

most complicated step in the pipeline and requires sophisticated techniques

and heuristics which are the central problems addressed by this thesis. In

particular, this problem can be formulated in graph-theoretic terms and it

can be shown to belong to the class of -complete problems (see chapter N P 2). Several assembly paradigms have been developed to efficiently tackle

the fragment assembly problem (see chapter 3). The output of this step is

a set of assembled layouts (arrangements of the reads) for the contigs.

5. Consensus Generation - Once a layout (or set of layouts) is generated,

the sequence of base pairs nucleotides for each layout is computed using a

consensus program. If overlaps are detected using exact match, then the

consensus generation is a trivial task, however, if dynamic programming

is used for overlap detection, the consensus generation requires solving a

multiple sequence alignment problem [44].

6. Scaffolding - The scaffolding process groups the contigs together into sub-

sets with a known order and orientation. Researchers generally infer the

relationships between contigs from mate-pair information. Similarly to step

4, this problem can be also formulated in graph-theoretic terms where each

24 node in the graph is a contig and a directed link between two contigs ex-

ists if there are mate-pairs bridging the gap between them. The goal is

to find a consistent orientation for all nodes in the graph according to the

mate-pair information. Because of errors in the pairing data and possible

mis-assembly errors in the contigs, this problem can also be shown to be

-complete; several greedy and heuristic strategies have been proposed. N P Many sequence assemblers include a scaffolding step, however stand-alone

options are also available (e.g., Bambus [81], SSPACE [14]). The output of

this step is a set of ordered contigs and estimated distances between them.

7. Assembly Validation - In practice the assembly problem is never an

error-free process and even the most sophisticated assemblers are affected

by mis-assembly errors. So it is important to include in the pipeline a

validation step that checks the quality of the assembled contigs. Unfortu-

nately no standardized method yet exists and most reported measures of

assembly quality are aggregate measures, such as the number and sizes of

contigs, which do not account for the possibility of misassemblies. Thus

they are only marginally useful. If a reference genome is available, the con-

tigs can be aligned to the reference in order to identify region that have

been mis-assembled. However, if the true layout is unknown, which is com-

mon in de novo sequencing projects, then additional data must be used

to validate the contigs. For example, physical maps provide markers (with

approximate order and/or distance) that can be used to validate large con-

tigs. Similarly, the sequence of a closely related organism can be used to

25 confirm areas where we expect not to find significant discrepancy (or di-

vergence). In the absence of any other types of information, clone mates

(also known as mate-pairs or paired-ends) have been widely used to detect

assembly errors by checking areas of the genome that violate the orienta-

tion and distance constraints imposed by the clone mates. Some tools have

recently appeared that perform automatic validation of contigs [80]. This

thesis contributes many improvements to this step by introducing a novel

metric, which satisfactorily captures the trade-offs between contig’s length

and quality (see chapter 5). The typical output of this step is a quality

value for each contig together with pointers to locations in the contigs that

might be mis-assembled.

8. Finishing - In practice, imperfect coverage, repeats, and sequencing errors

cause the assembler to produce not one but hundreds or even thousands of

contigs. After scaffolding, the contigs are oriented and the sequence gaps

between them now need to be filled with their genomic sequence. The task

of closing the gaps between contigs and obtaining a complete molecule is

called finishing. Although they represent genuine gaps in the sequence,

researchers can retrieve the original clone inserts spanning the gap and use

a straightforward walking technique to fill in the sequence. However, filling

these gaps involves a large amount of manual labor and complex laboratory

techniques, so any improvement in assembly that could reduce the cost of

this step would have significant importance. The output from this final

step is the whole genome sequence.

26 1.8 History of the assembly of the

Human Genome

Among the thousand genomes that have been sequenced and assembled in the last 20 years, the human genome has a very intriguing history. For obvious reasons, we have been particularly interested in decoding our own DNA. The potential discoveries, applications, and the sheer glory of such an achievement has motivated scientists all over the world to compete fiercely against each other.

Here we briefly outline the major historical events that led to the first draft of the Human genome.

[1990] The Human Genome Project was launched through funding from the

US National Institutes of Health (NIH) and Department of Energy, whose labs joined with international collaborators and resolved to sequence 95% of the DNA in human cells in just 15 years. This joint effort was named the International Hu- man Genome Sequencing Consortium (IHGSC). Researchers at IHGSC proposed that a feasible assembly strategy should follow the BAC-by-BAC hierarchical method where the genome is first broken up into a collection of large overlapping in-vivo-clonable fragments (between 160 and 200 Kb) – called Bacterial Artifi- cial Chromosomes or BACs. Each BAC would be independently assembled by shotgun strategy and then mapped together using restriction-finger-print-based overlaps. There was also a YAC map but it had too many errors due to chimerisms in the clones.

27 [1998] A new private venture was launched to sequence the human genome: the enterprise - named Celera Genomics - aimed to create its own database of human genomic data, which users would be able to subscribe to for a fee. In contrast to the IHGSC and despite popular skepticism, the Celera Genomics opted for the

Fred Sanger’s “whole genome shotgun” method, skipping the mapping phase of the BAC-by-BAC hierarchical process.

[2000] On 26 June 2000 the public and private enterprises both announced that they had completed their respective draft genome sequences.

[2001] Each effort published an account of its draft human genome sequence:

Celera’s effort appeared in Science [100], and the International Human Genome

Sequencing Consortium’s effort was published in Nature [38].

[2003] Two years ahead of schedule, with contributions from countless scien- tists from 20 institutions across the globe, the International Human Genome Se- quencing Consortium announced that they had completed the gold-standard ref- erence human genome, according to the guidelines of the original Human Genome

Project, with 99.99% accuracy [39].

What about quality? Although it is widely believed that the Herculean task of the Human Genome project (HGP) was completed in 2003, 13 years after its initial project announcement, it is legitimate to ask: how well did we do? “Of particular interest are the relative rates of mis-assembly (sequence assembled in the wrong order and/or orientation) and the relative coverage achieved by the

28 three protocols” [91]. Unfortunately, prior to 2003, the IHGSC group were alone in having published assessments of the rate of misassembly in the contigs they produced (see [40] on subsequent assembly analysis and comparison). Using arti-

ficial data sets, they found that, on average 10 per sent of assembled fragments were assigned the wrong orientation and 15 per cent of fragments were placed in wrong order by their protocol [47]. A more recent article, entitled “Revolution

Postponed” in Scientific American [31] asserted, “The Human Genome Project has failed so far to produce the medical miracles that scientists promised. Biol- ogists are now divided over what, if anything, went wrong...”. Due to these and other related recent events, genome assembly is again receiving a lot of attention from the genomic community. In particular, the central problem is the develop- ment of new assembly and validation tools that can overcome the limitations of previous sequence assemblers thus leading to more accurate genome sequences: this thesis will focus on both aspects of the problem by contributing not just to a novel sequence assembler but also to a better metric for assembly validation.

Some recent advances. The history of the assembly of the human genome did not stop with the successful completion of the first draft in 2001. Several other teams of researchers have continued on a similar journey tackling these and other massive assembly problems. Below is a list of several such major efforts.

[2007] A team of researchers mostly from the Craig Venter Institute9 published an updated version of the human genome produced from 32 million random DNA

9http://www.jcvi.org/

29 fragments sequenced using Sanger technology [57]. Similar to the Celera effort, the whole-genome shotgun sequencing method was used also in this case. In par- ticular they developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within the individual diploid genome. Comparison of this genome and the National Center for Biotechnol- ogy Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb.

[2008] The first human genome sequenced by next-generation technologies was published, using massively parallel sequencing in picolitre-size reaction vessels

[102]. This genome belonged to James D. Watson, one of the co-discoverers of the structure of DNA (with Francis Crick) in 1953. The assembly results seem to agree well with the previous results published in 2007 by traditional sequencing methods [57].

[2009] First human genome assembled using next-generation short read data

[92]. Specifically 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. were assembled using a new parallel assembler based on a distributed representation of a De Bruijn graph. Approximately 2.76 million contigs 100 base pairs (bp) in length were created with an N50 size of ≥ 1499 bp, representing 68% of the reference human genome. Also in this case the analysis of the assembled sequences revealed polymorphic and novel sequences not present in the initial human reference assembly.

30 [2010] The second human assembly using next-generation short read data was published using the assembler ALLPATHS-LG [29]. Using data from the Illumina

GAII and HiSeq sequencers, a team from the Broad Institute10 of MIT created an assembly of comparable quality to the one previously obtained using capillary- based sequencing (base accuracy of 99.95%, N50 contig length of 24 kb, and ≥ N50 scaffold sizes of 11.5 Mbp) and it has been described as the current best human de novo assembly.

10http://www.broadinstitute.org/

31 Chapter 2

Sequence Assembly:

Problem and Complexity

2.1 Introduction

Because of its natural connection to a well known -complete combinatorial N P optimization problem (shortest superstring problem), for many years the sequence assembly problem (SAP ) has been investigated using rather simple string and graph-theoretic formulations. This chapter will address such problems through the following sequence of steps: (1) the standard formulations of the SAP and their complexity results; (2) solutions in terms of these formulations which are infeasible/intractable in the context of biology; (3) a newly proposed formulation

(more faithful to biology) of the problem as a constrained optimization problem and the complexity issues in this new framework. The chapter also contains a probabilistic analysis of unique reconstruction for random DNA sequences.

32 2.2 The dovetail-path framework

The first attempt to mathematically formalize the sequence assembly problem is due to Myers (1995) in his seminal paper [67] where the dovetail path framework was first introduced. An essential set of definitions from this notational framework is given below.

2.2.1 Basic definitions: reads, overlaps and layouts

The output of a sequencing project consists of a set of reads (or fragments) F =

r ,r ,...,r , where each read r is a string over the alphabet Σ = A,C,G,T . { 1 2 N } i { } To each read is associated a pair of integers (s , e ), i [1, F ] where s and e i i ∈ | | i i are respectively the starting and ending points of the read ri in the reconstructed string R (to be computed by the assembler), such that 1 s , e R . The ≤ i i ≤ | | order of si and ei encodes the orientation of the read (whether ri was sampled from Watson or Crick strand).

The overlaps (best local alignment) between each pair of reads may be com- puted using the Smith-Waterman algorithm [94] with match, mismatch and gap penalty scores dependent on the errors introduced by the sequencing technology.

Exact string matching is instead typically used for short read from next genera- tion sequencing, since they usually provide high coverage, thus allowing tolerance for increased false negatives. Note also that by restricting the problem to exact matches only, the time complexity of the overlap detection procedure is reduced from a quadratic to a linear function of the input size. The complete description of an overlap π is given by specifying:

33 A A Innie π π π π .sA .eA .sA .eA

A hang A hang B hang B hang π.s π.e π π.e B B .sB B Normal B B

Figure 2.1: Two possible overlaps (illustration): left overlap is normal (both reads pointing to the same forward direction) right overlap is innie (the second read B is reverse complemented and is pointing in the backward direction); The suffix predicate for the left (normal) overlap is s.t. suffixπ(A)= true and suffixπ(B)= false.

1. the substrings π.A[π.sA, π.eA] and π.B[π.sB, π.eB] of the two reads that are involved in the overlap;

2. the offsets from the left-most and right-most positions of the reads π.Ahang

and π.Bhang ;

3. the relative directions of the two reads: Normal (N), Innie (I);

4. a predicate suffixπ(r) on a read r s.t.:

true iff suffix of r participates in the overlap π suffix (r)=  (2.1) π   false iff prefix of r participates in the overlap π  Figure 2.1 illustrates two possible overlaps. Note that a right arrow represents a read in forward orientation, conversely a left arrow represents a read that is reverse-complemented. Because of the double-stranded nature of the DNA molecule, each read can be sampled from either the Watson or Crick strands and they have different orientation. This formulation gives rise to a taxonomy of

34 s e A A A

Containment A hang B hang s e B B B

s e A A A Regular dovetail A hang B hang B sB eB

s e A A A Suffix dovetail A hang B hang B sB eB

s e A A A

Prefix dovetail A hang B hang B sB eB

Figure 2.2: Overlap taxonomy. overlaps illustrated in Figure 2.2.

Definition 2 (Layout). A layout L induced by a set of reads F = r ,r ,...,r { 1 2 N } is defined as:

π1 π2 π3 πN−1 L = r ⇋ r ⇋ r ⇋ ⇋ r . (2.2) j1 j2 j3 ··· jN

Informally a layout is simply a sequence of reads connected by overlap rela- tions. Note that the order of the reads in L is a permutation of the reads in F .

The previous definition assumes that there are no containments1; without any loss of correctness in the base-pair sequences that can be generated, contained

1Reads that are proper subsequences of another read.

35 reads can be initially removed (in a preprocessing step) and then reintroduced later after the layout has been created. However, note that their reintroduction is important to provide additional support for mate-pair constraints (if available).

Out of all the possible layouts (possibly, exponential in the number of reads), it is imperative to efficiently identify the ones that are consistent according to the following definition:

Definition 3 (Consistency Property). A layout L is consistent if the following property holds for i =2,...,N 1: −

πi−1 πi ⇋ r ⇋ iff suffix (r ) = suffix (r ). (2.3) ji πi−1 ji 6 πi ji

The consistency property imposes a directionality to the sequence of reads in the layout. The directionality of each internal read in the layout must be preserved. Figure 2.3 shows an example of layout associated to 7 overlapping reads. The estimated start positions for each read are given by the formula:

sp =1, sp = sp + π .hang −1 if i> 1 (2.4) 1 i i−1 i−1 rji

It should be clear at this point why this framework is called the dovetail-path framework: layouts consist of a sequence of dovetail overlaps that satisfy the consistency property with containment overlaps hanging off the main dovetail- path.

36 A B C D E F G

spA sp BCDEFG sp sp sp sp sp

Figure 2.3: Example of layout for a set of reads F = A,B,C,D,E,F,G with N I N I N N { } overlaps π(A,B), π(B,C), π(C,D), π(D,E), π(E,F ), π(F,G).

2.2.2 Min-length reconstruction theorem

Appealing to parsimony, we are typically interested in a layout whose length is minimal (although we will see that this assumption may lead to biologically incorrect solutions). The following theorem shows the correlation between the length of a layout and the sizes of its overlaps. Let us define the length of an overlap to be the average length of the two overlapping substrings:

( π.s π.e + π.s π.e ) length(π)= | A − A| | B − B| . (2.5) 2

Note that π.s π.e and π.s π.e can have different values when the | A − A| | B − B| overlaps are computed using the Smith-Waterman alignment algorithm, though they are the same if exact match is used. Let us define the weight of a layout L | |

37 to be the sum of the lengths of its overlaps:

weight(L)= length(π) (2.6) Xπ∈L then the following theorem holds [97, 98]:

Theorem 2 (Min-length reconstruction). A layout of maximum weight results in a reconstruction of minimum length.

Proof. First note that:

N−1 L = sp + r = π .hang + r (2.7) | | n | N | i ri | N | Xi=1 using the facts that:

1. sp1 =1,spi = spi−1 + πi−1.hangri−1 , if i> 1,

2. π .hang r length(π ), i ri ≈| i| − i

3. length(π ) g when π is a containment edge and g is the contained i ≈ | | fragment, it follows that:

L = r length(π) . (2.8) | | | | − Xr∈F Xπ weight(L) | {z } But since the second sum is the weight of the layout, maximizing weight minimizes length.

38 2.3 Shortest Superstring Problem (SSP )

Researchers first approximated the shotgun sequence assembly problem as one of

finding the shortest common superstring of a set of sequences. This formulation was favored because of the results of the previous theorem and the availability of efficient algorithms to solve the SSP .

Definition 4 (Shortest Superstring Problem). Given a set of strings or sequences

S = r ,r ,...,r find the shortest string R (reconstruction) such that i, r is { 1 2 n} ∀ i a substring of R.

This shortest common superstring (SCS) formulation led to a simple theoret- ical abstraction, but by being oblivious to how biological sequences are organized by evolution, it often yielded biologically implausible and incorrect solutions.

Its inability to correctly model the assembly problem is owed to a multitude of reasons, but primarily because:

1. it does not account for possible errors arising during the process of sequenc-

ing the fragments,

2. it does not model fragment orientation (the sequence source can be one of

the two DNA strands, Watson or Crick), and

3. most importantly, it fails in the presence of repeats, as it encourages repeat-

induced compressions.

To emphasize the last point it is of interest to note Richard Karp’s statement from 2003 [43]: The shortest superstring problem [is an] an elegant but flawed

39 R1 R2 A l m r B l m r C

R1 R2 A l m r B l r C Mis−assembly Correct Assembly

Figure 2.4: Example of compression: the two copies of repeat (R1 and R2) are compressed into one leading to a shorter but misassembled sequence. abstraction: [since it defines assembly problem as finding] a shortest string con- taining a set of given strings as substrings. Figure 2.4 shows an example of the kind of errors that such formulation could lead to: since strings contained in- side a repeat regions cannot be disambiguated, multiple copies of a repeat are compressed into a single one.

Because of the theoretical computational intractability ( -completeness N P [27]) of the SSP , most of the approaches for genome sequence assembly have resorted to greedy and heuristic methods that, by definition, restrict themselves to near-optimal solutions, where the “nearness” may be guaranteed within a mul- tiplicative competitiveness factor. The best known greedy algorithm for the SSP

2 has an approximation factor of 2 3 [8].

40 2.4 Graph-Theoretic formulation

Since the SSP formulation was not able to correctly model the sequence assembly problem, researches opted for more sophisticated graph-theoretics ideas. Specifi- cally, graph-theoretic approaches convert sequence assembly into solving specific problems for general graphs constructed using the overlap information of the in- put set of reads. This mapping has the advantage of allowing us to apply the large collection of algorithms and heuristics that have been developed in graph theory for many decades. However, this formulation still suffer from the same problems and limitations of the SSP model, since it can produce mis-assembly errors (as shown later). In this section we introduce the two most used graphical models for the sequence assembly problem: String Graph and De Bruijn graph.

But before formally defining them, we need to give a minimal set of definitions.

2.4.1 Strings, Overlaps and Overlap Graph

Let x and y be two strings over the alphabet Σ. Let us denote the length of x by x . The ith character of x is denoted by x[i]. If 1 i j x , we use x[i, j] to | | ≤ ≤ ≤| | denote the substring of x starting at position i and ending at position j. Given two strings x and y over the alphabet Σ, we say that there is an overlap between x and y, and we denote it with x ⇋ y, if there exists a suffix of x that matches2 a prefix of y. Let us denote with o(x, y) the length of the longest such match.

Definition 5 (Overlap Graph). Given a set of strings S = r ,r ,...,r and { 1 2 n} 2The matching does not have to be perfect and it can be approximated allowing up to ǫ percent error on real data.

41 a minimum overlap threshold value k, the overlap-graph for S is a weighted bidi- rected graph OGk =(V, E) where:

V = S = r ,r ,...,r ; • { 1 2 n}

E = (r ,r ):(r ⇋ r ) o(r ,r ) k,r ,r V ; • { i j i j ∧ i j ≥ i j ∈ }

the weight of each edge (r ,r ) is w(r ,r )= r o(r ,r ). • i j i j | j| − i j

The overlap graph represents all the inferable relationships between the strings in the set S. Note that r o(r ,r ) is the length of the overhang3 for string r , | j| − i j j

Since each vertex/string ri has an orientation, every edge has two orientations, one with respect to each of its endpoints. Because the graph is bidirectional, we need to describe how to explore the nodes of the graph to generate the set of valid paths.

e1 e2 e3 em−1 Definition 6 (Path validity). A path P = r ⇋ r ⇋ r ⇋ ⇋ r in G is h 1 2 3 ··· mi valid if i, 2 i m 1, e and e have opposite directions at r . ∀ ≤ ≤ − i−1 i i Note that this definition is equivalent to the consistency property for a layout.

In order to traverse a node in the graph we need that the entry edge and the exit edge have opposite directions at the node. So we are allowed to enter a node x even if the edge ei is pointing out of the node as long as we use an edge ej with opposte direction to ei when we exit the node (see figure 2.5 for an example of overlap graph).

Given any path P in the overlap graph, we associate a path-string to P that consists of the concatenation of the strings according to the order in the path,

3The size of the read portion that is not involved in the overlap.

42 where only one copy of the overlap is kept. Clearly the weight of a path P is given by the sum of the weights of its edges:

w(P )= w(r ,r )= ( r o(r ,r )) (2.9) i j | j| − i j (riX,rj )∈P (riX,rj )∈P

Note that because of the weight function associated to the edges of the graph, a path of minimum weight defines a path-string of minimum length.

2.4.2 String Graph

The size of the overlap graph can be dramatically reduced by a sequence of transformations whose goal is to eliminate edges that can be transitively inferred.

e1 e2 e3 Definition 7 (transitively inferable edge). If x ⇋ y ⇋ z and x ⇋ z are “mutu- ally consistent” overlaps among nodes x, y and z then the edge e3 is said to be transitively inferable from the sequence of edges e1 and e2.

Informally the overlap between strings x and z is implied by the concatena- tion of the overlaps between x, y and z. It is important to note the edges must be mutually consistent: entry edge and the exit edge must have opposite direc- tions. The string graph is a particular graph where all the contained string and transitivity inferable edges are removed [68].

Definition 8 (String Graph). Given a set of strings S = (r1,r2,...,rn) and a minimum overlap threshold value k, the string graph SGk for S is obtained from the overlap graph OGk by removing contained strings (strings that are substrings of other strings) and transitively inferable edges.

43 Such transformation can be computed in polynomial time using the algorithm proposed by Myers in [68]. In order to correctly apply the transitivity reduction step to the graph, it is important to first mark all transitivity edges and then remove all marked edges. This is because this process is not Church-Rosser [20] and any arbitrary strategy would fail to remove some of the transitivity edges.

Equipped with the notion of string graph, the sequence assembly problem can be formulated as follows:

Definition 9 (Sequence Assembly Problem (SAP1)). Given a set of fragments or reads S = (r1,r2,...,rn) and a minimum overlap threshold k, the Sequence Assembly Problem (SAP ) is the problem of finding a Hamiltonian Path in the string graph SGk for S such that its weight is minimal.

Note that we seek a Hamiltonian Path because we would like all the reads to be part of the assembly without repetition. The problem is clearly a special case of the Traveling Salesman Problem (TSP ) with the following two differences: (1) instead of a Hamiltonian cycle we look for a Hamiltonian path; (2) we work with bi-directed graphs instead of undirected or directed graphs. However, for circular genomes (such as plasmids and bacterial genomes), the first difference does not apply anymore as we need to find a Hamiltonian cycle as well.

Note that this formulation differs from the one presented in [70]. Specifically

Nagarajan and Pop define the sequence assembly problem as one of finding a generalized Hamiltonian path (every node is visited at least once) of minimum weight in the string graph of the reads. This is in accordance with the solution proposed in [68] where they seek a cyclic tour. In such a model each edge has

44 assigned to it a selection constraint c that dictates how many times the edge should appear in the target solution. Specifically, there are three cases:

1. exact edge (c = 1),

2. required edge (c 1) and ≥

3. optional edge (c 0). ≥

The argument for allowing an edge to appear multiple times in the solution is to model the repeated regions of the genome. If an edge belongs to a repeat then it should be possible to reuse it in when the second (or more) copy of the repeat are assembled. However, notice that if one assumes uniform coverage of the genome, it is reasonable to presume that the other copies of the repeat are covered by a different set of homologous reads. The problem for the assembler is then how to avoid compressing these reads together into one single copy of the repeat.

Before analyzing the complexity of this formulation it is important to observe that this graph-theoretical framework suffers from a similar kind of problem as the shortest superstring approach. Figure 2.5 shows an example of a string graph where all the possible Hamiltonian paths create mis-assembly errors due to the presence of a repeat. The compression error is due to the fact that repeats can induce false positive transitivity edges. For example consider the reads 3, 7 and 8 in figure 2.5, we have that 7 ⇋ 3, 3 ⇋ 8 and 7 ⇋ 8, so the edge 7 ⇋ 8 is removed with the negative effect of merging together reads that belong to two different copies of the repeat R2. In particular, after removal of the transitivity edges, there

45 R 1 R 2 True layout AAB AAB

1 3 5 7 2 4 6 8

Graph construction

2 3 2 3 1 1

4 4

Transitivity reduction 8 5 8 5

7 6 7 6 Overlap−Graph String−Graph

Layout generation

R 1

AAB Mis−assembly 2 A Mis−assembly 1

1 3 5

2 4 6

7 8

R 1

A AAB

1 6 4 7

2 5 8

3

Figure 2.5: Example of mis-assembly using a string graph: the removal of the transitivity edges (in red) produces a string graph where every (Hamiltonian) paths through all nodes creates misassemblies. The layout for two of these paths are shown at the bottom: the first one with errors due to compression and the second with both compression and inversion errors.

46 is more then one path that traverses all the nodes and it always produces mis- assembled layouts. Note that edge 2 ⇋ 6 cannot be removed because, although there are edges 2 ⇋ 7 and 7 ⇋ 6, the edge directions at node 7 do not match thus obstructing it from being traversed (the edges are not mutually consistent).

This example also shows another problem inherent to this framework. Even if it would be possible to efficiently compute the Hamiltonian path, the string graph might have many different Hamiltonian paths (as in this example) of minimal length and all these paths represent a possible reconstruction of the genome.

Additional long-range information (e.g., mate-pais, optical maps, etc.) must be then used to resolve these ambiguities.

The problem of finding a minimal weight Hamiltonian path in a directed or undirected graph is known to be -complete. Since directed graphs are special N P types of bidirected graphs, it follows that the Sequence Assembly Problems is also -complete: N P

Theorem 3. The Sequence Assembly Problem is -complete. N P

2.4.3 De Bruijn graph

A different approach was first developed by Idury and Waterman in 1995 [37] and later expanded by Pevzner et al. in 2001 [78], which is based on the notion of De

Bruijn graph. In graph theory, an n-dimensional De Bruijn graph of m symbols is a directed graph representing overlaps between sequences of symbols. It has mn vertices, consisting of all possible length-n sequences of the given symbols.

For example, given a set of m symbols S = s ,s ,...,s , the set of nodes are: { 1 2 m}

47 V = Sn = (s ,...,s ,s ), (s ,...,s ,s ),..., (s ,...,s ,s ), { 1 1 1 1 1 2 1 1 m (s ,...,s ,s ),..., (s ,...,s ,s ) 1 2 1 m m m }

Any two nodes v and v have a direct edge between them if the n 1 suffix 1 2 − v is equal to the n 1 prefix of v . 1 − 2 Unlike the string graph formulation, in a De Bruijn graph the notions of nodes and edges are in some sense the dual of the overlap graph. In the context of the assembly of sequence reads, a De Bruijn graph is formally defined as follows:

Definition 10 (De Bruijn Graph). Given a set of strings S =(r1,r2,...,rn) and a minimum overlap threshold value k, the De Bruijn graph for S is a directed graph BGk =(V, E) where:

V = d Σk i s.t. d is a substring of r S ; • { ∈ | ∃ i ∈ }

E = (d ,d ) : if the k 1 prefix of d is a suffix of d ; • { i j − i j}

Informally, the set of vertices of BGk is the set of k-mers for the set of strings

S, and the edges correspond to their perfect k 1 overlap. Clearly every read − r S is translated into a path composed of ( r k) nodes. Let us call such i ∈ | i| − a path a walk and define it w(ri). Also note that the graph is directed (not bidirected as in the case of the string graph) and there is no weight associated to the edges (the overlap weight is k 1 for all the edges and it can be disre- − garded). Figure 2.6 shows an example of De Bruijn graph for the list of strings

L = AAA, AAC, ACA, CAC, CAA, CGC, GCG with parameter k = 2. We { }

48 AA

C C AA AC CA A

G A A

CG GC C

Figure 2.6: De Bruijn graph with parameter k = 2 for the list of strings L = AAA, AAC, ACA, CAC, CAA, CGC, GCG . { } create one node for each 2-mer in the set L and a directed edge from node x1 to node x if the k 1(= 1) suffix of x is a prefix of x and we label the edge 2 − 1 2 with the rightmost character in x2. Hence, in this graph each edge corresponds to just one of the k-mers and so the general problem consists of finding a path that visits all the edges exactly once, an Eulerian path. The string S correspond- ing to a path in this graph can be reconstructed in the following way: begin the string S with the label of the first node and then concatenate, in order, all the labels of the edges in the path. For example, in figure 2.6 one Euler path

A C G C A A A C is AC CA AC CG GC CA AA AA AC and the → → → → → → → → reconstructed string associated to the path is S = ACACGCAAAC.

Although Eulerian paths in the De Bruijn framework can be computed in polynomial time, in reality there are many complications: (i) The De Bruijn graph might have more than one Eulerian path and, as for the earlier string graph framework, choosing the correct one is a non trivial task (Kingsford et al.

49 in [50] give a precise formula for the number of possible genomes that can be constructed from a De Bruijn graph). For example, another possible Eulerian

G C A C A path for the De Bruijn graph in figure 2.6 is AC CG GC CA AC → → → → → A A C CA AA AA AC and the reconstructed string is S = ACGCACAAAC. → → → (ii) Errors in the data can introduce many erroneous edges which complicate the graph structure; (iii) Even when an Eulerian path can be computed and it represents a possible assembly of the k-mers, it still might not constitute a correct assembly of the input reads (due to the presence of repeats). However, as mentioned before, since each read corresponds to a particular walk in the De

Bruijn graph, and any walk that contains all the reads as subwalks represents a possible assembly of the reads.

Definition 11 (Superwalk). A walk is called a superwalk of BGk if i, w(s ) is ∀ i a subwalk of it.

In this framework a parsimonious solution corresponds to a superwalk of min- imum length:

Definition 12 (Superwalk Problem (SAP2)). Given a set of fragments/reads

S = (r1,r2,...,rn) find a minimum length superwalk in the De Bruijn graph BGk of S.

The sequence assembly problem in the De Bruijn graph framework corre- sponds to Superwalk Problem, and, unsurprisingly this problem is also - N P complete by reduction from the Shortest Superstring Problem [61]:

Theorem 4. The Superwalk Problem is -complete. N P

50 2.5 Probability of unique reconstruction

The difficulty of assembling any set of reads (even if error-free) both in the

String and De Bruijn graph frameworks is related to the overall structure of the genome and in particular to the presence/absence of repeats. Repeats can produce branches and cycles in the graph leading to multiple reconstructions. If we (incorrectly) assume that the genome has size n and its letters are uniformly random distributed (the nucleotide at each position is chosen independently and

1 uniformly with probability p = 4 ) , then we can ask the following question: What is the probability that a random genome can be uniquely reconstructed? Here we give a simple analysis based on the k-mer size in the De Bruijn framework. Since two k-mers overlap only if they have a prefix-suffix perfect match of size k 1, − this analysis can be applied also to the special case of the String graph method with minimum overlap threshold k. Note that although this analysis is quite far from reality (genomes are not random), it helps to compute an estimate of the minimum k-mer size that should be used in any real application.

Note that a repeat of size k will always lead to a non-unique reconstruction, ≥ so we need to calculate the probability that a random DNA sequence of length n has no repeats of size k. For very large n (n >> 0), we can define a random ≥ variable X whose values represent the number of times a k-mer occurs in the

DNA sequence. The random variable X follows a binomial distribution with parameters X B(n, pk). Therefore: ∼

n Pr(X = i)= (pk)i(1 pk)n−i (2.10)  i  −

51 which is the probability of having exactly i repeats of k-mer in the random DNA sequence. Note that here we assume all the events to be independent. This assumption is not generally true since strings of size k can overlap in the DNA sequence, however, for very large genomes (e.g., human), the length n is big enough to assume these events to be independent with good approximation. If there are no repeats then for any k-mer only the events for i = 0 and i = 1 can happen, therefore we wish to approach the following equality as closely as possible.

Pr(X =0)+ Pr(X =1)=(1 pk)n + npk(1 pk)n−1 =1. (2.11) − −

For large n (say, n> 105) and for a suitable ǫ, we may wish to find a k such that:

k k np k e−np + e−np > 1 ǫ. (2.12) (1 pk) − −

Which follows from the fact that:

λ n lim 1 = e−λ (2.13) n→∞  − n where λ = npk. Equation 2.12 can be easily rewritten as follows:

k k 1+ npk > enp (1 ǫ)(1 pk)+ pk > enp (1 ǫ∗ ) (2.14) − − − (k)

∗ ∗ for some ǫ(k) > ǫ. Note that ǫ(k) depends on the choice of k. If we substitute

52 1 λ (1-λ)e- 0.9 0.8 0.7 0.6 0.5

Probability 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 λ

Figure 2.7: Plot of function f(λ)=(1 λ)e−λ in the range [0, 10]. −

k np = λǫ∗ we obtain:

−λ ∗ ∗ (1 + λ ∗ )e ǫ > (1 ǫ ). (2.15) ǫ − (k)

The function f(λ)=(1 λ)e−λ is plotted in figure 2.7. When n = 0, probability − 6 1 can only be approached from below, which means it is never guaranteed that

k there are no repeats in the random DNA sequence. However from λǫ∗ = np it follows that: 1 k = log 1 (n)+ log 1 . (2.16) p p λǫ∗ 

For a random sequence of human genome size we have that:

k = log (3 109) +1.3=17, (2.17) 4 × ≈15.7 | {z }

1 ∗ (−1.3) where p = 4 . Then λǫ = 4 = .165 and the probability that a human-

53 sized random genome is 17-mer-repeat-free is bigger than (1 ǫ = .988). This − means that sequences of size k > 17 are already very unlikely to be repeated.

Unfortunately the human genome is not random and contains many longer and complicated repeat structures than simple repeats of length 17 bps, so it would be required to use much higher values for k.

2.6 Sequence Assembly as a Constrained

Optimization Problem

All the previously described formulations nicely convert the sequence assembly problem into well-defined combinatorial optimization problems; however, these formulations are inherently wrong since the best solution so obtained can be biologically incorrect (mis-assembled). Moreover, since all these problems are

-hard, many of the algorithmic solutions proposed over the last 20 years use N P greedy and heuristic methods which are inherently approximate. Thus, before discussing the various techniques used in sequence assemblers (chapter 3), it is important to give a new formulation of the sequence assembly problem whose solutions are more faithful in the context of biology. This formulation obviously will require the use of long-range information to concurrently validate and resolve the complex structures present in large genomes.

54 2.6.1 Modeling sequencing errors

So far, we have omitted from the discussion the possibility of errors in the reads.

However, all currently available sequencing technologies are error-prone and these errors must therefore be modeled in our study of assembly. If there are errors in the reads then consistent layouts must also satisfy the following property:

Definition 13 (ǫ-valid layout). Let 0 ǫ< 1 be the maximum error rate of the ≤ sequencing process. A layout L is ǫ-valid if each read r L can be aligned to i ∈ the reconstructed string R with no more than ǫ r differences. | i|

Note that in practice, the maximum error rate ǫ is used during the overlap computation to filter only detected overlaps between two reads r1 and r2 whose number of errors is no more than ǫ( r + r ). | 1| | 2|

2.6.2 A new formulation of SAP

The sequence assembly problem (SAP) can be now formulated as follows:

Definition 14 (Sequence Assembly Problem (SAP3)). Given a collection of frag- ment reads F = r N and a tolerance level (error rate) ǫ, find a reconstruction { i}i=1 π1 π2 πN−1 R whose layout L = r ⇋ r ⇋ ⇋ r is ǫ-valid, consistent and such h j1 j2 ··· jN i that the following set of properties (oracles) are satisfied :

(Overlap-Constraint (O)): The cumulative overlap score O of the layout L • is optimized:

O(L)= SO(ri,rj) (2.18)

(riX,rj )∈L π(ri,rj )

55 where SO is the score of the overlap π between the two reads ri and rj.

(Mate-Pair-Constraint (MP)): The cumulative mate-pair score S of the • MP distance between reads in the layout L is consistent with the mate-pair con-

straints:

MP (L)= SMP (ri,rj) (2.19)

(riX,rj )∈L (ri↔rj )

where r r indicates that the two reads r and r are oriented according i ↔ j i j to the mate library.

(Optical-Map-Constraint (OM)): The observed distribution of restriction • enzyme sites in the layout L, C = a , a ,...,a , is consistent with the obs h 1 2 ni distribution of experimental optical map data C = b , b ,...,b (ob- src h 1 2 ni tained by a restriction enzyme digestion process).

This thesis focuses on constraints O(L) and MP (L) for which detailed for- mulations are given in chapter 4. The details for the Optical-Map-Constraint

(OM) score will be addressed in the future work with suitable modification of the framework presented here. The general idea would be to use a scoring system based on the following scoring function:

n a b 2 χ2 = k − k (2.20)  σk  Xk=1 where σk is the standard deviation of the observed distribution of restriction enzyme sites (modeled as normally distributed random variables). In practice, optical maps are not error-free and the presence of sequencing errors complicates

56 the matching process by introducing false-cuts and missing-cuts that must be properly accounted for. A possible solution to reduce the effect of these errors is to employ multiple restriction fragments in order to increase the map resolution.

Another option is to design a Bayesian score function analogous to the validation approach presented in [6], but this method requires a dynamic programming implementation which could be too computationally expensive to use during the assembly process. A similar approach can be used in conjunction with long range data generated by the Bionanomatrix4 platforms. An efficient and accurate solution must be designed to somehow combine the best of these two approaches.

Each property in definition 14 plays an important role in resolving problems that arise when real genomic data is used (e.g., data containing repeat-regions, rearrangements, segmental duplications, etc.). Note that in absence of additional information, among all possible layouts, the minimum length layout is typically preferred (shortest superstring), although, as previously explained, this choice is difficult to justify. As the genomic sequence deviates further and further from a random sequence, minimum length layouts typically introduce various mis- assembly errors (e.g., compression, insertions, rearrangements, etc.). Note that, traditionally, assemblers have only optimized/approximated one of the properties

(i.e., (O)) listed above, while checking for the others in a post-processing step. If the formulation were only to contain the overlap-constraint O, then the problem would correspond to the sequence assembly problem as defined in 9 for the String- graph. The other two constraints do not reduce the complexity of the problem so this new formulation still belongs to the class of -complete problems. Finally, N P 4http://www.bionanomatrix.com/

57 note that this list of constraints is not exhaustive, and it will likely change from year to year as new sequencing technologies become available and new types of long range information become possible to produce. It is thus important to have an assembly framework that can dynamically and effortlessly adapt to the new technologies and constraints.

2.6.3 Relation to the prior art

At this point it is important to relate the new SAP3 formulation to the one originally presented in Myers’ paper in 1995 [67]. In his seminal paper, Myers defines the fragment assembly problem as follows:

Definition 15 (Fragment Assembly Problem (F AP )). Given a set of fragments

S =(r ,r ,...,r ) and a maximum error rate ǫ [0, 1], the Fragment Assembly 1 2 n ∈ Problem (F AP ) is the problem of finding a reconstruction R and ǫ-valid layout of the reads whose observed distribution of fragment read start points, Dobs, has the minimum relative deviation from Dsrc (the source distribution of the sampling process).

In particular, the goodness-of-fit between the observed distribution and the hypothesized source distribution is modeled using the density function of the

Kolmogorov-Smironv (KS) test statistic [52, 93], which gives a likelihood function to optimize (the details of this function can be found in the original paper [67]).

Although this formulation also defines the problem as an optimization problem

(under the KS statistic), it is not general enough to make use of additional long- range information (used as constraints in the SAP3 formulation). But most

58 importantly, although the problem is modeled as an optimization problem, the proposed algorithmic solution in [67] does not tackle the whole assembly problem

(with constraints) directly but it proposes a graph-theoretic strategy which is the core of the overlap-layout-consensus (OLC) paradigm widely used in many

first-generation assemblers (see chapter 3 for a review of assembly paradigms). In particular, lemma 4 of Myers’ paper proves that the shortest-common-superstring of the “chunk”5 graph is the shortest-common-superstring (SSP) of the reads themselves. This means that the set of graph transformation used to create the chunks is lossless under that model of sequence assembly. However, as we have shown in section 2.3, the SSP can yield biologically implausible and incorrect solutions because it is oblivious to how biological sequences are organized by evolution.

It is also relevant to mention that in [67] Myers proposes to design “algorithms that are capable of solving a ‘pure’ shotgun problem subject to a collection of overlap, orientation, and distance constraints that model the additional informa- tion provided by the directed components of the strategy.” However, he explains that such a shotgun-with-constraints problem is not addressed in his paper, but it should be explored “if there is to be any hope of solving these more difficult con- straint problems” [67]. The formulation of sequence assembly problem presented in this section covers such theoretical gap and, together with the algorithmic so- lution presented in chapter 4, it represents a further step toward algorithms that solve the pure sequence assembly problem.

5Chunks correspond to unitigs (uniquely assembled contigs) in the OLC framework.

59 Chapter 3

Sequence Assemblers and

Assembly Paradigms

3.1 Introduction

The history of sequence assembly can be seen as a sequence of responses to changes in sequencing technologies, and, as a result, several sequence assemblers and assembly paradigms have been designed over the last 30 year to accommodate these changes. This chapter first presents a historical perspective on sequence as- sembly, then it reviews the major assembly paradigms that have been successfully applied to large sequencing projects. All of the major sequence assemblers are categorized according to the assembly paradigm that they adopt.

60 3.2 A Historical Perspective on Sequence As-

sembly

Computational biologists first formalized the shotgun sequence assembly problem in terms of an approximation to finding the shortest common superstring (SCS) of a set of sequences [97]. Because of the theoretical computational intractability

( -completeness) of exact SCS-problem (SCSP), most of the approaches for N P genome sequence assembly have resorted to greedy and heuristic methods that by definition restrict themselves to near-optimal solutions, where the “nearness” may be guaranteed within a multiplicative competitiveness factor, e.g., four (see next section for a review of sequence assembly paradigms). In this context, the accuracy of the resulting reference genome sequences and their suitability for biomedical applications play a decisive role, as they additionally depend upon many parameters of the sequencing platforms: read lengths, base-calling errors, homo-polymer errors, etc. These parameters continue to change at a faster-and- faster pace as the platform chemistry and engineering continue to evolve.

To further emphasize this point, we note that there are now several efforts to develop a relatively cheap (e.g., $1000) genome sequencing technology of ac- ceptable accuracy (e.g., one base error in 10,000 bps) and high-speed (e.g., a turn-around time of less than a day). However, it would be more useful to design assembly algorithms that are independent of the particular technology. Such a strategy would allow scientists to accommodate changes in technology more rapidly in the future years.

61 Name Read Type Algorithm Reference SUTTA long & short B&B (Narzisi and Mishra [74], 2010) Arachne long OLC (Batzoglou et al. [11], 2002) CABOG long & short OLC (Miller et al. [64], 2008) Celera long OLC (Myers et al. [69], 2000) Edena short OLC (Hernandez et al. [32], 2008) Minimus (AMOS) long OLC (Sommer et al. [95], 2007) Newbler long OLC 454/Roche CAP3 long Greedy (Huang and Madan [34], 1999) PCAP long Greedy (Huang et al. [35], 2003) Phrap long Greedy (Green [30], 1996) Phusion long Greedy (Mullikin and Ning [66], 2003) TIGR long Greedy (Sutton et al. [96], 1995) ABySS short SBH (Simpson et al. [92], 2009) ALLPATHS short SBH (Butler et al. [18], 2008) ALLPATHS-LG short SBH (Gnerre et al. [29], 2010) Contrail short SBH (Schatz M. et al., 2010) Euler long SBH (Pevzner et al. [79], 2001) Euler-SR short SBH (Chaisson and Pevzner [19], 2008) Ray long & short SBH (Boisvert et al. [15], 2010) SOAPdenovo short SBH (Li et al. [60], 2010) Velvet long & short SBH (Zerbino and Birney [104], 2008) PE-Assembler short Seed-and-Extend (Nuwantha and Sung [75], 2010) QSRA short Seed-and-Extend (Bryant et al. [16], 2009) SHARCGS short Seed-and-Extend (Dohm et al. [22], 2007) SHORTY short Seed-and-Extend (Hossain et al. [33], 2009) SSAKE short Seed-and-Extend (Warren et al. [101], 2007) Taipan short Seed-and-Extend (Schmidt et al. [88], 2009) VCAKE short Seed-and-Extend (Jeck et al. [41], 2007) Table 3.1: List of sequence assemblers. Reads are defined as “long” if produced by Sanger technology and “short” if produced by Illumina technology . Note that Velvet was designed for micro-reads (e.g. Illumina) but long reads can be given in input as additional data to resolve repeats in a greedy fashion.

To a first approximation, the history of genome assembly could be read as a long series of rapidly adapting responses to changes in sequencing paradigms. At the start of the HGP, the proposed assembly strategy was based on the BAC-by-

BAC (recursive and hierarchical divide-and-conquer) approach. In this approach, the genome is first broken up into a collection of large overlapping in-vivo-clonable

62 fragments (between 160 and 200 Kb) – called Bacterial Artificial Chromosomes or BACs. Since the location of the BACs could be mapped through restriction-

finger-print-based overlaps and the resulting tiling-paths (using specialized lab- oratory protocols), this strategy only required assembly of sequence reads from each individual BAC, which could be accomplished independently and in par- allel using the standard shotgun assembly method, e.g., using the Phrap [30] assembler. In an effort to reduce the assembly cost and time, in the late-90’s it was proposed instead to tackle directly the full genome assembly problem by shotgun sequencing. Specialized assemblers, e.g., CELERA [69] and ARACHNE

[11], were then developed to deal with this type of data, often collected from long eukaryotic genomes.

However, as previously highlighted in chapter 1, when next generation short read technologies first began to appear, all of the conventional sequence assem- blers originally developed for longer Sanger reads failed to scale well on the new data. Consequently, new assemblers needed to be designed and developed ab ini- tio to handle the features of the new sequencers (short and high coverage reads).

Analogously, since mate-pairs are currently the most common auxiliary informa- tion used to resolve ambiguities due to repeats, these next-generation (Gen-1) assemblers have incorporated hard-wired mate-pair-centric heuristics that are practically useless in dealing with other new and higher quality auxiliary data

(e.g., dilution sequencing, optical maps, etc.).

Historically, all these assemblers – each representing many man-years of effort

– appear to require complete and costly overhauls, with each introduction of a new

63 short-read or long-range technology. Arguably, there is a need for novel assembly platforms that are flexible enough to handle future sequencing technologies with minimal changes to their infrastructure.

3.3 Assembly Paradigms

Based on the underlying search strategies, most of the assemblers belong to just two major categories: greedy and graph-based. Table 3.1 presents a nearly ex- haustive list of the major sequence assemblers organized by assembly paradigms.

It also contains the sequencing platform(s) that each assembler is capable of han- dling. Note that for clarity of exposition, SUTTA is also included in this table but its description is postponed to the next chapter.

3.3.1 Greedy

Because of its assumed computational intractability ( -completeness of the N P related “Shortest Common Superstring Problem”), most of the successful ap- proaches for genome sequence assembly have resorted to greedy methods. Greedy algorithms typically construct the solution incrementally using the following ba- sic steps: (i) pick the highest scoring overlap; (ii) merge the two overlapping fragments and add the resulting new sequence to the pool of sequences; (iii) repeat until no more merges can be carried out. Algorithm 1 shows the pseudo code of a generic greedy assembly strategy, while figure 3.1 illustrates with an example the greedy marge process.

64 Algorithm 1: GREEDY - pseudo code Input: Set of reads/fragments Output: Set of contigs 1 repeat 2 Calculate pairwise alignments of all fragments; 3 Choose two fragments with the largest overlap; 4 Merge chosen fragments and add the resulting new sequence to the pool of sequences; 5 heuristically correct errors (e.g., using mate-pairs); 6 until only one fragment is left ; 7 return Set of contigs;

A A 130 1) 130 B B

B 35 C 2) 75 C D C A 75 130 3) D 35 B C 75 D

Figure 3.1: Example of greedily merging three fragments (the numbers represent the overlap sizes).

In addition, after each merge operation, the region of the overlay is heuristi- cally corrected in some reasonable manner (whenever possible). Regions that fail to yield to these error-correction heuristics are relinquished as irrecoverable and shown as gaps. Mate-pairs information is used judiciously during the merging

65 process to further validate the connection between the two sequences. At the end of this process a single solution (consisting of the set of assembled contigs) is generated as output. Note that were the genome completely random, the overlap information could be computed unerringly, and would thus suffice for error-free reassembly of the target sequence. In this case, greedy algorithms would perform satisfactorily. As hinted earlier, the problem is complicated by the presence of non-random structure in genomes (e.g., repeated regions, rearrangements, seg- mental duplications — which are particularly pervasive in Eukaryotes), and it makes the greedy strategy invariably fail. Well known assemblers in this cat- egory include: TIGR [96], PHRAP [30], CAP3 [34], PCAP [35] and Phusion

[66].

3.3.2 Graph-based

Graph-based algorithms start by preprocessing the sequence-reads to determine the pair-wise overlap information and represent these binary relationships as edges in a string-graph. The problem of finding a consistent lay-out can then be formulated in terms of searching a collection of paths in the graph satisfying certain specific properties. The paths correspond to contigs (contiguous sequences of the genome, consistently interpreting disjoint subsets of sequence reads). Con- tingent upon how the overlap relation is represented in these graphs, two domi- nant assembly paradigms have emerged: Overlap-Layout-Consensus (OLC) and

Sequencing-by-Hybridization (SBH).

66 Overlap-Layout-Consensus. In the OLC approach, the underlying graph

(overlap graph) comprises nodes representing reads and edges representing over- laps. Ideally, the goal of the algorithm is to determine a simple path traversing all the nodes — that is, a Hamiltonian path. For a general graph, this problem is known to lead to an -hard optimization problem [61] though polynomial- N P time solutions exist for certain specialized graphs (e.g., interval graphs, etc.) [36].

In order to circumvent the theoretical intractability of this problem, a popular

7

1 6 1 4 7 2 5 3 6 Transitivity 2 Edges 5

3 4

7

1 6

2 5

3 4 Matep−Pairs

Figure 3.2: Example of layout representation and transformations in the OLC framework. heuristic strategy is typically employed as follows:

1. remove “contained” and “transitivity” edges;

67 2. collapse “unique connector” overlaps (chordal subgraph with no conflicting

edges) to compute the contigs;

3. use mate-pairs to connect and order the contigs.

Hence, the set of computed contigs corresponds to the set of nonintersecting simple paths in the reduced graph. Figure 3.2 shows an example of OLC graph and its sequence of transformations. Well known assemblers in this category include: CELERA [69], Arachne [11], Minimus [95] and Edena [32].

Sequencing-by-Hybridization. In the SBH approach, the underlying graph encodes overlaps by nodes and the reads containing a specific overlap by edges incident to the corresponding node (for that overlap). This dual representation may be described in terms of the following steps:

1. partition the reads into a collection of overlapping n-mers (an n-mer is a

substring of length n);

2. build a DeBruijn graph in which each edge is an n-mer from this collection

and the source and destination nodes are respectively the (n 1)-prefix and − (n 1)-suffix of the corresponding n-mer. −

In this new graph, instead of a Hamiltonian path, one seeks to find an Eulerian path containing every edge exactly once. Thus the computed genome sequence provides a consistent explanation for every consecutive n-mers on any sequence read. More importantly, such a graph is linear in the size of the input and allows

(in theory) the computation of an Eulerian path to be carried out in linear time.

68 Because of the succinct representation it generates, this approach has become the algorithm of choice for assembling short reads with very high coverage. However, in practice, many complications arise. First, sequencing errors in the read data introduce many spurious (false-positive) edges which mislead the algorithm. Sec- ond, for any reasonable choice of n, the size of the graph is dramatically bigger than the one in the overlap-layout-consensus strategy. Third, a DeBruijn graph may not have a unique Eulerian path, and an assembler must find a particu- lar Eulerian path subject to certain extraneous constraints; thus, it must solve a somewhat general problem, namely the Eulerian-superpath problem: given an

Eulerian graph and a sequence of paths, find an Eulerian path in the Eulerian graph that contains all these paths as sub-paths. It is known that finding the shortest Eulerian superpath is, unfortunately, also an -hard problem [61]. As N P earlier, the preferred work around is to use a heuristic method that computes such a superpath by applying a series of transformations to the original Eulerian graph. Prominent examples of the SBH approach include: Euler [79], Velvet

[104], ABySS [92] and SOAPdenovo [60].

3.3.3 Seed-and-Extend

Recently, there has appeared yet another smaller category of assemblers specif- ically designed for short reads. They are based on a contig extension heuristic scheme, which uses a prefix-tree to efficiently look up potential extensions. In this framework a contig is elongated at either of its ends so long as there exist reads with a prefix of minimal length, provided that it perfectly matches an end of the

69 contig. Because of their heuristic scheme, these approaches can be categorized as greedy methods, but for the sake of clear exposition they are presented here using a separate category. Example of assemblers belonging to this new category are: PE-Assembler [75], SHARCGS [22], SSAKE [101], QSRA [16] and Taipan

[88].

Finally, all the techniques described earlier need to separately incorporate mate-pairs information as they play an important role in resolving repeats as well as in generating longer contigs, thus dramatically reducing the cost of the assembly-finishing. Note that although mate-pairs are typically expensive and slow to obtain, prone to statistical errors and frequently incapable of spanning longer repeat regions, historically, they have played an important role in the assembly of large genomes as other cheaper and more informative mapping tech- niques have not been widely available. In addition, they have become essential to many emerging sequencing technologies (454, Illumina-Solexa, etc.), which can generate very high coverage short reads data (from 30 bp up to 500 bp).

70 Chapter 4

SUTTA: Scoring-and-Unfolding

Trimmed Tree Assembler

4.1 Introduction

Mired by its connection to a well known -complete combinatorial optimization N P problem — namely, the Shortest Common Superstring Problem (SCSP) — the whole-genome sequence assembly (WGSA) problem has been historically assumed to be amenable only to greedy and heuristic methods. By placing efficiency as their first priority, these methods opted to rely only on local searches, and are thus inherently approximate, ambiguous or error-prone, especially, for genomes with complex structures. Furthermore, since choice of the best heuristics depended critically on the properties of (e.g., errors in) the input data and the available long range information, (i.e., base-calling errors in sequencing technology) and the genome structure, these approaches hindered designing an error free WGSA

71 pipeline.

We dispense with the idea of limiting the solutions to just the approximated ones, and instead favor an approach that could potentially lead to an exhaustive

(exponential-time) search of all possible layouts. Its computational complexity thus must be tamed through a constrained search (branch-and-bound) and quick identification and pruning of implausible overlays. For his purpose, such a method necessarily relies on a set of score-functions (oracles) that can combine different structural properties (e.g., transitivity, coverage, physical maps, etc.). In this chapter we give a detailed description of this novel assembly framework, referred to as SUTTA (Scoring-and-Unfolding Trimmed Tree Assembler) [74, 71, 72].

4.2 History and Motivation

There is no unanimous agreement, within the computer science community at least, that the sequence assembly problem has exhausted all reasonable methods of attack. For example, Richard M. Karp observed “The shortest superstring problem [is] an elegant but flawed abstraction: [since it defines assembly prob- lem as finding] a shortest string containing a set of given strings as substrings.

The SCSP problem is -hard, and theoretical results focus on constant-factor N P approximation algorithms... Should this approach be studied within theoretical computer science?” [43]. In contrast to the work in computational biology, there have now emerged examples within computer science, where impressive progress has been made to solve important -hard problems exactly, despite N P their worst-case exponential time complexity: e.g., Traveling Salesman Problem

72 (TSP), Satisfiability (SAT), Quadratic Assignment Problem (QAP), etc. For ex- ample, the work of Applegate et al. [7] demonstrated the feasibility of solving in- stances of TSP (as large as 85,900 cities) using branch-and-cut, whereas symbolic techniques in propositional satisfiability (e.g., DPLL SAT solver [21]), employing systematic backtracking search procedure (in combination with efficient conflict analysis, clause learning, non-chronological backtracking, “two-watched-literals” unit propagation, adaptive branching, and random restarts), have exhibited the capability to handle more than a million variables.

Inspired by these lessons from theoretical computer science, a novel approach, embodied in SUTTA algorithm, was developed. In the process, several related issues were addressed: developing better ways to dynamically evaluate and vali- date layouts, formulating the assembly problem more faithfully, devising superior and accurate algorithms, taming the complexity of the algorithms, and finally, developing a theoretical framework for further studies along with practical tools for future sequencing technologies.

4.3 SUTTA Algorithm

A common view of de novo genome assembly that has been quietly accepted among the genome sequence community is well expressed by the following quote

[82]: An assembler must either “guess” the correct genome from among a large number of alternatives (a number that grows exponentially with the number of repeats in the genome) or restrict itself to assembling only the non-repetitive seg- ments of the genome, thereby producing a fragmented assembly. Since exponential

73 growth of the time complexity would restrict the analyzable genomes to tiny sizes with very low coverage data, the preceding argument appears to give only a Hob- son’s choice: just find an approximated solution and trade off correctness against the price of exploring potentially exponentially many possible layouts of the reads.

We take exception to this pessimistic view, and develop a new framework that works by forcefully eliminating incorrect solutions (i.e., implausible layouts).

Traditional graph-based assembly algorithms use either the overlap-layout- consensus (OLC) or the sequencing-by-hybridization (SBH) paradigm (as de- scribed previously), in which first the overlap/DeBruijn graph is built and the contigs are extracted later. SUTTA instead assembles each contig independently and dynamically one after another using the Branch-and-Bound (B&B) strategy.

Originally developed for linear programming problems [54], B&B algorithms are well known searching techniques applied to intractable ( -hard) combinatorial N P optimization problems. The basic idea is to search the complete space of solu- tions. However the caveat is that explicit enumeration is practically impossible

(i.e. has exponential time complexity). The tactics honed by B&B are to limit the search to a smaller subspace that contains the optimum. This subspace is determined dynamically through the use of certain well chosen score functions.

B&B has been successfully employed to solve a large collection of complex prob- lems, whose most prominent members are TSP (traveling-salesman problem),

MAX-SAT (maximal satisfiability) and QAP (quadratic assignment problem).

Note that in the past B&B, was already suggested as a method to use in the context of the sequence assembly problem. However, those proposed strate-

74 gies are substantially different from the branch-and-bound framework adopted in

SUTTA. For example, Gene Myers, in his seminal paper in 1995 [67], proposed a general framework to tackle the sequence assembly problem, which consisted of

(1) assembling the fragments into “chunks” using a series of graph transforma- tions, and (2) use a branch-and-bound procedure to search the space of all chunk paths. However, this B&B procedure was only proposed in the paper and the details were not resolved at the time of publication. Another example of B&B in the context of sequence assembly can be found in the paper by Kececioglu and

Myers of 1995 [44]. Also in this case B&B method is not used to tackle the full problem but only proposed to solve some of its possible sub-problems, such as the fragment orientation problem.

At a high level, SUTTA’s framework views the assembly problem simply as that of constrained optimization (based on definition 14): it relies on a rather simple and easily verifiable definition of feasible solutions as “consistent layouts.”

It generates potentially all possible consistent layouts, organizing them as paths in a “double-tree” structure rooted at a randomly selected “seed” read. A path is progressively evaluated in terms of an optimality criteria, encoded by a set of score functions based on the set of overlaps along the lay-out. This strategy en- ables the algorithm to concurrently assemble and check the validity of the layouts

(with respect to various long-range information) through well-chosen constraint- related penalty functions. Complexity and scalability problems are addressed by pruning most of the implausible layouts, via a branch-and-bound scheme. Ambi- guities resulting from repeats or haplotypic dissimilarities may occasionally delay

75 immediate pruning and force the algorithm to lookahead, but in practice, the computational cost of these events has been low. Because of the generality and

flexibility of the scheme (it only depends on the underlying sequencing technolo- gies through the choice of score and penalty functions), SUTTA is extensible, at least in principle, to deal with possible future technologies. It also allows concurrent assembly and validation of multiple layouts, thus providing a flex- ible framework that combines short and long range information from different technologies.

The high level SUTTA pseudocode is shown in Algorithm 2. Here, two im- portant data structures are maintained: a forest of double-trees (D-tree) and a B set of contigs . At each step a new D-tree is initiated from one of the remaining C reads in . Once the construction of the D-tree is completed, the associated con- F tig is created and stored in the set of contigs . Next the layout for this contig C is computed and all its reads are removed from the set of all available reads . F This process continues as long as there are reads left in the set . Note that for F the sake of a clear exposition, both the forest of D-trees and the set of contigs B C are kept and updated in the pseudocode; however, after the layout is computed, there is no particular reason to keep the full D-tree in memory, especially, where memory requirements are of concern.

Finally, note that the proposed Algorithm 1 is input order dependent. SUTTA adopts the policy to always select the next unassembled read with highest oc- currence as the seed for the D-tree (also used by Taipan; [88]). This strategy minimizes the extension of reads containing sequencing errors. However, em-

76 Algorithm 2: SUTTA - pseudo code Input: Set of N reads Output: Set of contigs 1 := ; /* Forest of D-trees */ B ⊘ 2 := ; /* Set of contigs */ C ⊘ N 3 := r ; /* All the available reads/fragments */ F i { i} 4 whileS( = ) do F 6 ⊘ 5 r := .getNextRead(); F 6 if ( isUsed(r) isContained(r) ) then ¬ ∧ ¬ 7 := create double tree(r); DT 8 := ; B B∪{DT} 9 Contig := create contig( ); CT G DT 10 := ; C C ∪{CTG} 11 .layout(); /* Compute contig layout */ CT:= G .reads ; /* Remove used reads */ F F \{CTG } 12 end 13 end 14 return ; C pirical observations indicate that changing the order of the reads rarely affects structure of the solutions, as the relatively longer contigs are not affected. An explanation for this can be obtained through a probabilistic analysis of the data and a 0-1 law resulting from such an analysis.

4.4 Overlap Score (Weighted transitivity)

Like any B&B approach, a major component of SUTTA algorithm is the score function used to evaluate the quality of the candidate solutions that are dynam- ically constructed using the B&B strategy. SUTTA employs an overlap score based on the following argument. Large de novo sequencing projects typically

77 sA eA A

B sB eB

s C C eC

Figure 4.1: Example of transitivity relation: the overlap regions between reads AB and BC share an intersection. have coverage higher than three, this implies that frequently two overlapping re- gions of three consecutive reads in a region of correct layout share intersections.

Events of this type are “witness” to a transitivity relation between the three reads and they play an important role in identifying true positive1 overlaps with high probability. Figure 4.1 shows an example of transitivity relation between three reads A, B and C. During contig layout construction, the overlap score uses the following basic principle to dynamically compute the score value of a candidate solution: if read A overlaps read B, and read B overlaps read C, SUTTA will score those overlaps strongly if in addition A and C also overlap:

if(π(A, B) π(B,C))then S(π(A,B,C)) = ∧ { S(π(A, B)) + S(π(B,C))+(π(A, C)?S(π(A, C)) : 0) (4.1) }

The implicit assumption of a coverage of 3 can be further generalized to higher coverages in an obvious manner. Note that the score of a single overlap corresponds to the score computed by the Smith-Waterman alignment algorithm

(for long reads) or exact matching (for short reads). Clearly the total score of a

1The two reads correctly originate from the same place in the genome.

78 candidate solution is given by the sum of the scores along the overlaps that join the set of reads in the layout L plus the score of the transitivity edges (if any):

g(L)= rj ∈L S(π(rj−1,rj,rj+1)) Pj∈2,...,N−1 = S(π (r 1 ,r 2 )) + S(π (r 1 ,r 2 )), (4.2) πi∈O i j j πk∈T k j j P P where O and T are respectively the set of overlaps and transitivity edges (the set of reads is defined by the layout L) and S(π) is the score (Smith-Waterman, exact match, etc.) for the overlap.

This step in SUTTA resembles superficially to the Unitig (unique regions of the genome) construction step in overlap-layout-consensus assemblers. Specifi- cally, in these assemblers, one of the reduction steps applied to the overlap graph consists of removing all the transitivity edges, as it makes it simpler to find the unitigs directly from the simple paths in the graph (corresponding to se- quences of reads connected by transitivity relations). However, unlike SUTTA, in the overlap-layout-consensus approach the weights of the overlaps are ignored in meaningfully scoring the paths. Since Unitig construction can be computa- tionally expensive, large scale assemblers like Celera/CABOG [64] have adopted a strategy, where Unitigs are computed as chains of adjacent reads with best over- lap between each other. This technique takes time and space linear in the number of reads. Although Celera’s approach uses the overlap score during assembly, it is only applied locally for Unitigs — it neither scores the contigs globally nor generates multiple Unitigs solutions. Finally, it must be noted that the overlap scores are insufficient to resolve long repeats or haplotypic variations. The score

79 functions must be augmented with constraints (formulated as reward/penalty terms) arising from mate-pair distance information or optical map alignments.

4.5 Node expansion

The core component of SUTTA is the branch-and-bound procedure used to build and expand the D-tree (create double tree() procedure in Algorithm 2). The high-level description of this procedure is as follows:

1. Start with a random read (it will be the root of a tree; use only the read

that has not been used in a contig yet, or that is not contained).

2. Create RIGHT Tree: start with an unexplored leaf node (a read) with

the best score-value; choose all its non-contained right-overlapping reads

(Extensions() procedure in Algorithm 3); Filter out the set of overlap-

ping reads by pruning unpromising directions (T ransitivity(),DeadEnds(),

Bubbles() and MateP airs() procedures in Algorithm 3); expand the re-

maining nodes by making them its children; compute their scores. (Add

the “contained” nodes along the way, while including them in the computed

scores; check that no read occurs repeatedly along any path of the tree).

STOP when the tree cannot be expanded any further.

3. Create LEFT Tree: Symmetric to previous step.

Algorithm 3 presents the pseudocode of the expansion routine (details for each subroutine are available in the Appendix 7.6.1). In this framework each path

80 constructed using Algorithm 3 corresponds to a possible layout of the reads for the current contig. Unlike the graph-based approaches (OLC and SBH), multiple paths/layouts are concurrently expanded and validated. Based on the branching strategy, two versions of SUTTA are available: if at the end of the pruning process there are still multiple directions to follow the branching is either terminated

(conservative) or not (aggressive). In the aggressive case, the algorithm choses the direction to follow with the highest local overlap. Algorithm 3 is applied twice to generate LEFT and RIGHT trees from the start read. Next, to create a globally optimal contig, the best LEFT path, the root and the best RIGHT path are concatenated together. Figure 4.2 illustrates the steps involved in the construction of a contig.

The amount of exploration and resource consumption is controlled by the two parameters K and T : K is the max number of candidate solution allowed in the queue at each time step, while T is the percentage of top ranking solutions compared to the current optimum score. At each iteration the queue is pruned such that its size is always max(K, T Q ), where Q is the current size of ≤ | | | | the queue. Note that while K remains fixed at each iteration of Algorithm 3, the percentage of top ranking solutions dynamically changes over time. As a consequence, more exploration is performed when many solutions evaluate to nearly identical scores.

A few additional implementation notes must be mentioned: (i) Checking right- or left-overlapping properties between two reads is only required while expanding the root; checking just the consistency relation for the non-root node

81 iue42 otgcntuto:( construction: Contig 4.2: Figure n IH re o h otnd;( node; root the for trees RIGHT and rc of track ftersoe,ec oee,cretyAgrtm3ol re (function only score 3 maximal the Algorithm with currently solution However, single etc. scores, a their output of its in expans solutions multiple the generate during potentially kept can are tree the of branches intermediate r rknarbitrarily. broken are n ondtgte;( together; joined and ucs ( suffices. path. full ob nlddi n etpt.( left-path. any in included be to

Reads layout Best path Double tree used ii ato utb ae naodn ed rmtebs right- best the from reads avoiding in taken be must Caution ) , explored iii , overlapping h ed aoti optdfrtesto ed nthe in reads of set the for computed is layout reads the ) i iii h -rei osrce ygnrtn LEFT generating by constructed is D-tree the ) and , oebo-epn utb oet keep to done be must book-keeping Some ) 82 ii Start node etlf n ih ah r selected are paths right and left best ) contained g ecie nscin44.Ties 4.4). section in described eainhp.Snedifferent Since relationships. drn hmi terms in them rank nd o,tealgorithm the ion, un notu a output in turns path Algorithm 3: Node expansion

Input: Start read r0, max queue size K, percentage T of top ranking solutions, dead-end depth Wde, bubble depth Wbb, mate-pair depth Wmp Output: Best scoring leaf 1 := ; /* Set of leaves */ V ⊘ 2 := (r ,g(r )) ; /* Live nodes (priority queue) */ L { 0 0 } 3 while ( = ) do L 6 ⊘ 4 := P rune( ,K,T ); /* Prune the queue */ L L 5 r := .popNext(); /* Get the best scoring node */ i L 6 := Extensions(r ); /* Possible extensions */ E i 7 (1) := T ransitivity( ,r ); /* Transitivity pruning */ E E i 8 (2) := DeadEnds( (1),r , W ); /* Dead-end pruning */ E E 0 de 9 (3) := Bubbles( (2),r , W ); /* Bubble pruning */ E E 0 bb 10 (4) := MateP airs( (3),r , W ); /* Mate pruning */ E E 0 mp 11 if ( (4) == 0) then |E | 12 := r ; /* r is a leaf */ V V∪{ i} i 13 else 14 for (j=1 to (4) ) do |E | 15 := (r ,g(r )) ; L L∪{ j j } 16 end 17 end 18 end 19 return max g(r ) ; ri∈V { i }

83 4.6 Search Strategy

A critical component of any branch-and-bound approach is the choice of the search strategy used to explore the next sub-problem in the tree. There are several variations among strategies (with no single one being universally accepted as ideal), since these strategies’ computational performance varies with the problem type. The typical trade-off is between keeping the number of explored nodes in the search tree low and staying within the memory capacity. The two most common strategies are Best First Search (BeFS) and Depth First Search (DFS).

BeFS always selects among the live (i.e., yet to be explored) subproblems, the one with the best score. It has the advantage of being theoretically superior since whenever a node is chosen for expansion, a best-score path to that node has been found. However, it suffers from memory usage problems, since it behaves essentially like a Breadth First Search (BFS). Also checking repeated nodes in a branch of the tree is computationally expensive (linear time). DFS instead always selects among the live subproblems the one with largest level (deepest) in the tree. It does not have the same theoretical guarantees of BeFS but the memory requirements are now bounded by the product of the maximum depth of the tree and the branching factor. The other advantage is that checking if a read occurs repeatedly along a path can be done in constant time by using the depth-

first search interval schemes. For SUTTA we use a combined strategy: using

DFS as overall search strategy, but switching to BeFS, when choice needs to be made between nodes at the same level. This strategy can be easily implemented by ordering the set of live nodes of Algorithm 3 using the following precedence L

84 relation between two nodes x and y:

depth(x) > depth(y)   x y iff  or , (4.3) ≺  depth(x) == depth(y) score(x) > score(y)   ∧  where depth is the depth of the node in the tree and score is the current score of the node (defined in section “Overlap Score”). Because BeFS is applied locally at each level the score is optimized concurrently.

4.7 Pruning the Tree

Transitivity pruning. The potentially exponential size of the D-tree is con- trolled by exploiting certain specific structures of the assembly problem that permit a quick pruning of many redundant and uninformative branches of the tree — surprisingly, substantial pruning can be done only using local structures of the overlap relations among the reads. The core observation is that it is not prudent to spend time on expanding nodes that can create a suffix-path of a previously created path, as no information is lost by delaying the expansion of the last node/read involved in such a “transitivity” relation. This scenario can happen every time there is a transitivity edge between 3 consecutive reads (see

figure 4.1), and it is further illustrated in Figure 4.3 with an example. Suppose that A, B , B ,...,B are n + 1 reads with a layout shown in figure 4.3. The h 1 2 ni local structure of the D-tree will have node A with n children B1, B2,...,Bn.

However , since B1 also overlaps B2, B3,...,Bn these nodes will appear as chil-

85 A

B 1 A B 2

B 3

h B n 1 h 2 h 3 h n h1 h 2 h B B 3 1 BB2 3 n

h n

Figure 4.3: Example of transitivity pruning: expanding nodes B2,...,Bn can be delayed because their overlap with read A is enforced by read B1. dren of B1 at the next level in the tree. So the expansion of nodes B2, B3,...,Bn can be delayed because their overlap with read A is enforced by read B1. Simi- larly arguments hold for nodes B2, B3,...,Bn. In the best scenario, this kind of pruning can reduce a full tree structure into a linear chain of nodes. Additional optimization can be performed by evaluating the children according to the fol- lowing order (h h h ), where h is the size of the hang2 for read 1 ≤ 2 ≤ ··· ≤ n i

Bi. This ordering gives higher priority to reads with higher overlap score. This explains how the T ransitivity() procedure from Algorithm 3 is performed.

Zig-zag overlaps mapping. Although based on a simple principle, the time complexity of the transitivity pruning is a function of how quickly it is possible to check the existence of an overlap between two reads (corresponding to the red arrows of figure 4.3). The general problem is the following: given the set of overlaps O (computed in a preprocessing step) for a set of reads F , check the

2Size of the read portion that is not involved in the overlap.

86 existence of an overlap (or set of overlaps) for a pair of reads (r1,r2). The naive strategy that checks all the possible pairs takes time O(n2) where n = O . If | | a graph-theoretic approach is used, by building the overlap-graph information

(adjacency list), this operation takes time O(l) where l is the size of the longest adjacency list in the graph. However a much better (with O(1) expected time) approach uses hashing. The idea is to build a hash-table, in which a pair of reads is uniquely encoded to a single location of the table by using the following hash-function: (a + b)(a + b 1) H(a, b)= − + (1 b), (4.4) 2 − where a and b are the unique identification numbers of the two reads. This is the well known zig-zag function which is the bijection often used in countability proofs. The number of possible overlaps H(a, b) between two reads is always | | bounded by some constant c which is a function of the read length, genome structure (e.g., number of simple repeats) and the strategy adopted for the overlap computation (Smith-Waterman, exact match, etc.). In practice the constant c is never too large because even when multiple overlaps between two reads are available (typically 4), only a small subset with a reasonably good score (i.e. above a threshold) is examined by the algorithm.

4.8 Lookahead

Mate-pairs. Had the overlapping phase produced only true-positive overlaps, every overlapping pair of reads would have been correctly inferred to have orig-

87 R R

A B C

B

Start node

A

C

Figure 4.4: Lookahead: the repeat boundary between reads B and C is resolved looking ahead in the subtree of B and C, and checking how many and how well the mate-pair constrains are satisfied. inated in the same genomic neighborhood, thus turning the assembly process to an almost trivial task. However, this is not the case — the overlap detection is not error-free and produces false-positive or ambiguous overlaps abundantly, especially when repeat regions are encountered. A potential repeat boundary between reads A, B and C is shown in figure 4.4. Read A overlaps both reads

B and C, but B and C do not overlap each other. Thus, the missing overlap between B and C is the sign of a possible repeat-boundary location, making the pruning decisions impossible. However, SUTTA’s framework makes it possible to resolve this scenario by looking ahead into the possible layouts generated by the

88 two reads, and keeping the node that generates the layout with the least number of unsatisfied constraints (i.e., consistent with mate-pair distances or restriction fragment lengths from optical maps).

SUTTA’s implementation generates two subtrees: one for node B and the other for C (see figure 4.4). The size of each subtree is controlled by the parameter

Wmp, the maximum height allowed for each node in the tree. The choice of Wmp is both a function of the size of the mate-pair library, local genome coverage and the genome structure. For genomes with short repeats a small value for Wmp is sufficient to resolve most of the repeat boundaries, and can be estimated from a k-mer analysis of the reads. However some genomes have much higher complexity

(family of LINEs, SINEs and segmental duplications with varying homologies).

In this case, a higher value of Wmp is necessary, but can be estimated adaptively. Once the two (or occasionally more) subtrees are constructed, the best path is selected based on the overlap score and the quality of each path is evaluated by a reward/penalty function corresponding to mate-pair constraints. For each node in the path, its pairing mate (if any) is searched to collect only those mate-pairs that cross the connection point between the subtree and the full tree, which are then scored by the following rule:

1, iff (l [µ ασ,µ + ασ]) (r1 r2) ;  ∈ − ∧ ↔   1, iff (l / [µ ασ,µ + ασ]) (r1 r2) ; SMP (r ,r )=  − ∈ − ∧ ↔ (4.5) 1 2   1, iff (r r ) ; − ¬ 1 ↔ 2   0, otherwise.  

89 Here l is the distance between the two reads in the layout, µ and σ are the mean and standard deviation of the mate-pair library, α is a parameter that controls the relaxation of the mate-pair constrains (in the results, fixed at α = 6), and r r denotes that the two reads are oriented towards each other. Such a 1 ↔ 2 score can be easily shown to give higher value to layouts with as few unsatisfied constraints as possible. Note that the mate-pair score is also dependent on local coverage of the reads, so its value should be adjusted/normalized to compensate for the variation in coverage. By penalizing the score negatively and positively according to the constraints, the current formulation assumes uniform coverage.

However, more sophisticated score functions could be employed if it is necessary to precisely quantify the extent to which the score varies with coverage. The mate-pair score f of the full path P is given by the sum of the scores of each pair of reads with feasible constraints in P :

f(P )= SMP (ri,rj) (4.6)

riX,rj ∈P

Note that the current formulation of SMP models only mate-pairs libraries whose reads face against each other. However, most current assemblies use a mixture of paired-end and mate-pair data sets that differ in insert size and read pair orientation. SUTTA’s mate-pair score can be easily adapted to support any read pair orientation and insert size.

Memory management is very important during lookahead: the subtrees are dynamically constructed and their memory deallocated as soon as the repeat boundary is resolved. Also note that the lookahead procedure is performed ev-

90 ery time a repeat boundary is identified, so the extra work associated with the construction and scoring of the subtrees is performed only when repeated regions of the genome are assembled. Finally, note that the construction of each subtree follows the same strategy (from Algorithm 3) and uses the same overlap score

(defined in section “Overlap Score”). However, recursive lookahead is not permit- ted. The mate-pair score introduced in (4.5) is used only to prune one of the two original nodes under consideration (or both, in the rare but possible scenarios, where neither of the subtrees satisfies the mate-pair constraints). This explains how the MateP airs() procedure from Algorithm 3 is performed.

Dead-ends and Bubbles. Base pair errors in short reads from next generation sequencing produce an intuitively non-obvious set of erroneous paths in the graph and tree structures. Because perfect matching is used to compute the overlaps, these errors vary according to where the base error is located. Two possible ambiguities need to be resolved: dead-ends and bubbles. Dead-ends consist of short branches of overlaps that extend only for very few steps and they are typically associated with base errors located close to the read ends (see figure

4.5). Bubbles instead manifest themselves as false branches that reconnect to a single path after a small number of steps. They are typically caused by single nucleotide difference carried by a small subset of reads (see figure 4.6). Note that for human genomes bubbles might have been caused by either errors in the reads or haplotypic differences due to the structure of the human genome. In the second case both paths should be kept and given in output. The lookahead procedure is easily adapted to handle these kind of structures. Specifically for dead-ends, each

91 Figure 4.5: Dead-end: short branches of overlaps that extend only for very few steps. They typically associated with base errors located close to the read ends.

Figure 4.6: Bubble: false branches that reconnect to a single path after a small number of steps. They are typically caused by single nucleotide difference carried by a small subset of reads. branch is explored up to depth Wde and all the branches that have shorter depth are pruned. In the case of bubbles, both branches are expanded up to depth Wbb and, if they converge, only the branch with higher coverage is kept and the other one is pruned.

4.9 Implementation details

SUTTA assembler is prototyped around the AMOS3 assembly framework (A

Modular Open-Source assembler). AMOS supports a central data repository of various genomic objects (reads, inserts, maps, overlaps, contigs, scaffolds, etc.) to be easily collected and indexed. In addition, the framework provides several algorithms to perform some of the standard steps in the assembly pipeline (e.g.,

Trimming, Overlapping, Error Correction, Scaffolding, Validation). SUTTA’s

3http://amos.sourceforge.net

92 pipeline is composed of three modules: (1) overlapper, (2) contigger, and (3) multi-aligner. We developed our tools for the first two steps (described here).

However, we relied on the “make-consensus” module available in AMOS for the computation of the final consensus sequence. Specifically, for the assembly of long reads, the UMD overlapper [84] has been used to compute the set of overlaps for the input reads. This overlapper keeps the number of repeat-induced spurious overlaps small and it builds the initial overlapping-phase of the algorithm with a reasonably small number of k-mers, whose cardinality is optimized by an order of magnitude through the use of minimizers. For the assembly of short reads, we instead developed our own short read overlapper (see next section).

4.10 Short-Read Overlapper

Current overlappers designed to deal with data generated by next-generation se- quencing technologies (e.g., Illumina Inc. Genome Analyzer, Applied Biosystems

SOLiD System and 454 Life Sciences) need to efficiently deal with an impres- sive amount of data (200X coverage or more in a single run for the bacteria

E. coli). Allowing approximate matching (using dynamic programming) for the computation of the overlaps would not only be computationally intensive, but also significantly increases the number of nonspecific spurious overlaps due to sequencing errors. SUTTA’s framework supports both Smith-Waterman align- ments and perfect matching. However, for short read technologies, we have opted for an approach now popular among many short read assemblers: exact string matching. Because of the high coverage of next-generation sequencing data, this

93 A i j

B

K

i j Set of overlapping reads

Prefix−Tree

Figure 4.7: Overlap computation using the trie data structure. technique produces reliable overlaps, while being drastically faster than approx- imate matching that requires dynamic programming. We first filter the data by removing redundant reads and those containing ambiguous bases. Next we index the remaining reads by a prefix-tree (trie) both in the forward and reverse com- plement direction. Using this data structure we can compute overlaps by a fast and simple traversal of the tree as illustrated in figure 4.7. Note that using this strategy for short reads enforces the overlaps to be only of type prefix or suffix.

94 Chapter 5

Feature-Response Curve

5.1 Introduction

As noted earlier, recent advances in DNA sequencing technology and their fo- cal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, in- undating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers quality and accuracy. No commonly accepted and standardized method for comparison exists as yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, while poorly capturing the contig quality and accuracy. This chapter addresses these concerns and introduces a novel metric that more satisfactorily captures the trade-offs between quality and contig size.

95 5.2 Assembly Comparison and Validation

Though validation and performance evaluation of an assembler are very important tasks, no commonly accepted and standardized method for this purpose exists as yet. The genome validation process appears to have remained a largely manual and expensive process, with most of the genomes simply accepted as draft assem- blies. For instance, the initial “draft” sequence of the human genome [38] has been revised several times since its first publication, with each revision eliminating various classes of errors through successive algorithmic advances. Nevertheless, genome sequencing continues to be viewed as an inexact craft and inadequate in controlling the number of errors, which in the draft genomes are estimated to be up to hundred or even thousands [86]. The errors in such draft assemblies fall into several categories: collapsed repeats, rearrangements, inversions, etc., with their incidents varying from genome to genome.

It should be noted that the most popular metrics for evaluating an assembly

(e.g., contig size and N50) only emphasize size and poorly capture the contig quality as they do not contain all the information needed to judge the correct- ness of the assembly. For example N50 is defined as the largest number L such that the combined length of all contigs of length L is at least 50% of the total ≥ length of all contigs. In these scenarios, an assembler that sacrifices assembly quality in exchange for contig sizes, appears to outperform others, despite gen- erating consensus sequences replete with rearrangement errors. For example, in the extreme case, an assembly consisting of one large contig of roughly the size of the genome is useless if mis-assembled. On the other extreme, an assembly

96 consisting of many short contigs covering only the inter-repeat regions of the genome could have very high accuracy although contigs might be too short to be used in, for example, gene-annotation efforts. Similarly, just a simple count of the number of mis-assembled contigs obtained by alignments to the reference genome (if available), a metric typically used to compare short-read assemblies, is also inadeguate, because it does not take into account the various structural properties of the contigs and of the reads contained in it. For example, one single mis-assembled contig could represent the longest contig in the set and it could include multiple types of errors (mate-pair orientation, depth of coverage, poly- morphism, etc) which should be weighted differently. Although the evaluation of the tradeoff between contig length and errors is an important problem, there is very little in the literature to address this topic and help evaluate the assembly quality of different assemblers.

Consequently, we have developed a new metric, Feature-Response curve (FRC)

[73], which captures the trade-offs between quality and contig size more accurately

(see section 5.3). The FRC shares many similarities with classical ROC (receiver- operating characteristic) curves, which are commonly employed to compare the performance of statistical inference procedures. Analogous to ROC, FRC empha- sizes how well an assembler exploits the relation between incorrectly-assembled contigs (false positives, contributing to “features”) against gaps in assembly (false negatives, contributing to fraction of genome-coverage or “response”), when all other parameters (read-length, sequencing error, depth, etc.) are held constant.

97 5.3 Feature-Response Curve

Inspired by the standard receiver operating characteristic (ROC) curve, the Feature-

Response curve characterizes the sensitivity (coverage) of the sequence assembler as a function of its discrimination threshold (number of features). The AMOS package provides an automated assembly validation pipeline called amosvali- date [80] that analyzes the output of an assembler using a variety of assembly quality metrics (or features). The features include:

(M) mate-pair orientations and separations, •

(K) repeat content by k-mer analysis, •

(C) depth-of-coverage, •

(P ) correlated polymorphism in the read alignments, and •

(B) read alignment breakpoints to identify structurally suspicious regions • of the assembly.

After running amosvalidate on the output of the assembler, each contig is assigned a number of features that correspond to doubtful regions of the sequence. For example, in the case of mate-pairs checking (M), the tool flags regions where multiple matepairs are mis-oriented or the insert coverage is low. Given any such set of features, the response (quality) of the assembler output is then analyzed as a function of the maximum number of possible errors (features) allowed in the contigs. More specifically, for a fixed feature threshold φ, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum

98 of features is φ. For this set of contigs, the corresponding approximate genome ≤ coverage is computed, leading to a single point of the Feature-Response curve.

Algorithm 4 shows the pseudo-code of the FRC computation.

Algorithm 4: Feature-Response Curve - pseudocode Input: Set of contigs tagged with features/errors computed using C F amosvalidate. Output: Feature-Response Curve. 1 sort( ); /* Sort contigs by length (longest to shortest) */ C 2 for (k=1 to 100) do 3 φ := k ; /* Feature threshold */ k |F|× 100.0 4 sum := 0; /* Sum of features */ 5 tot length := 0; /* Sum of contig lengths */ 6 for (j=1 to ) do |C| 7 fj := cj.getF eatures(); /* Num. of features for contig cj */ 8 sum := sum + fj; /* Update the sum of features */ 9 tot length := tot length + cj.getLength(); 10 if (sum φ ) then ≥ k /* Exit if the sum of features is more than φk */ 11 break; 12 end 13 end tot length 14 cov := 100; /* Approximate coverage for φ */ k genome size × k 15 end 16 return (φ ,cov ),k 1, 2,..., 100 ; k k ∈{ }

Note that no reference sequence is used to compute the FRC curve, which makes the FRC a useful tool in de novo sequencing project where a reference genome is not available to validate and guide the assembly process. In a scenario where the size of the genome is not available any reasonably good estimate of the reference genome size is adequate for the purpose of computing the FRC, since the genome size is simply used as a normalizing denominator across all

99 the assemblers to compare the contigs quality. For example, in the case of re- sequencing, a good estimate for the genome size can be obtained from genomes of the related species. In the case of the de novo sequencing projects the genome size can be judged from estimate of coverage (usually modeled as a dispersed

Poisson) from a subsample of contigs (with some care to eliminate the outliers coming from repeats or difficult to sequence regions). The procedure described in Algorithm 4 can be applied to generate FRCs for each feature separately as shown in chapter 6 and appendix 7.6.1. The inspection of these separate FRCs enables to quantify comparisons of the relative strengths and weaknesses of each assembler. Finally note that the definition of coverage computed by the FRC is only an approximation of the standard one because the contigs are not aligned to the genome. However, it has the property of identifying assemblies where the genome length has been overestimated.

The current formulation of the FRC is exceedingly simple and yet natural.

Thus, we hope that starting from here, more sophisticated versions of the FRC will be developed in the future. For example, the features could be weighted by contig length (density function); additional features may be included; features may be combined or transformed (e.g., eigen-features); the response, instead of coverage, could be another assembly quality metric of choice; etc. It must be emphasized that the features should not be interpreted directly as errors, since, as reported by the developers of amosvalidate [80], the method used to compute each feature may contain some false-positives. These false-positives frequently correspond to irresolvable inconsistencies in the assembly — and not mis-assembly

100 errors or incorrect consensus sequences. Consequently, the results could appear pessimistic for any one assembler, but are unlikely to be skewed in a comparative study, such as the one presented in chapter 6. The utility of Feature-Response curve is thus not diminished by the nature of the simple features, and it should be used in combination with other metrics and alignments to the reference genome

(if available). Note further that since the reported sensitivity of amosvalidate is

92%, almost all the mis-assemblies are captured by one or more features, which ≥ point to possible sources of errors in a particular assembler.

5.4 Implementation details

As with SUTTA, the Feature-Response curve has been also developed as part of the AMOS1 assembly framework (A Modular Open-Source assembler). Following the AMOS philosophy, the FRC is implemented as a pipeline that consists of two steps: 1) invocation to the amosvalidate tool to compute the features for the set of contigs; 2) invocation to the FRC module that implements Algorithm 4.

FRCurve module is part of the AMOS distribution and its documentation is available at the AMOS wiki page2.

1http://amos.sourceforge.net 2http://sourceforge.net/apps/mediawiki/amos/index.php?title=FRCurve

101 Chapter 6

Experimental Comparison of

De Novo Genome Assembly

6.1 Introduction

Since the completion of the Herculean task of the Human Genome project (HGP) in 2003, the genomics community has witnessed a deluge of sequencing projects:

They range from metagenomes, microbiomes, and genomes to transcriptomes; often, they focus on a multitude of organisms, populations and ecologies. In addition, the subsequent advent of high-throughput sequencing technologies – with their promise to considerably reduce the genome sequencing cost – now appear poised to usher in a personal genomics revolution [48, 90, 1].

However, in the ensuing euphoria, what seems to have been left neglected is a constructive and critical retrospection, namely:

1. to appraise the strengths and weaknesses of the schemes, protocols and

102 algorithms that now comprise a typical “sequencing pipeline”;

2. to scrutinize the accuracy and usefulness of the assembled sequences by any

standard pipeline;

3. and to build assembly algorithms that would easily adapt to the fast evolv-

ing biotechnologies.

This chapter addresses these issues by presenting a diverse set of experimental results to compare SUTTA’s performance relative to many assembly algorithms in the literature with a targeted focus on both older Sanger technology and next- generation sequencing technology data. The analyses are performed under both standard metrics (N50, coverage, contig sizes, etc.) as well as the new more com- prehensive metric (Feature-Response Curves, FRC) that has been introduced in chapter 5. Furthermore, visual inspection of the consistency of the assembled contigs is enabled by several graphic representations of the alignment against the reference genome, e.g., through dot-plots. Experimental analysis of the para- metric complexity is also reported here showing that overlap graph complexity, assembly contiguity and assembly quality all strongly depend on the choice of the minimum overlap parameter k.

Specifically, the chapter is organized as follows: the experimental protocol adopted is first described; assembly results and quality analysis are then pre- sented using standard paired and unpaired, low- and high-coverage, long and short reads from previously collected real and simulated data. Experimental analysis, showing the dependency of the assembly contiguity and quality on the choice of the minimum overlap size k, is presented next; and finally, the compu-

103 tational performance of SUTTA compared to several other assemblers concludes the chapter.

6.2 Experimental Protocol

In order to analyze the assembly quality of many different assemblers and to un- derstand the inconsistencies that plague many traditional metrics, it was decided to collect a significant volume of comparative performance statistics using a large benchmark of both bacterial and human genome data. For all the genomes used in this study, the finished sequences are available, thus enabling direct validation of the assemblies. Before discussing the results, we present the benchmark and assemblers that we have selected for comparison and explain the design of the experimental protocol adopted here.

6.2.1 Benchmarks

In evaluating the assembly quality of different assemblers, several criteria were used in choosing bench-mark data sets, assembly-pipelines and comparison met- rics: e.g., statistical significance, ease of reproducibility, accessibility in public domain etc. For example, by avoiding expensive studies with large sized genome- assembly and specialized (but not widely available) technologies, we wished to ensure that the reported results could be widely reproduced, revalidated and extended — even by moderate-sized biology laboratories or small teams of com- puter scientists. To the extent possible, we have favored the use of real data over synthetic data.

104 Consequently, we have not included large genomes (e.g., whole haplotypic hu- man genomes) or single-molecule technologies (PacBiosciences or Optical Map- ping), but ensured that all possible genome structures are modeled in the data

(from available data or through simulation) as are the variations in coverage, read lengths and error rates. The only long range information included has come from mate-pairs. Note, however, that our analysis is completely general, as is the software used in this study, and can be used for broader studies in the future.

For these reasons, following datasets were selected (see table 6.1):

1. Sanger reads data, although now considered archaic, remain an important

benchmark for the future. For instance, various technologies, promised by

PacBiosciences, Life Technologies, and others, seek to match and exceed the

read-lengths and accuracy of Sanger (hopefully, also inexpensively). Also

Sanger-approach remains a statistically reliable source of data and imple-

mentations, since there continue to exist an active community of Sanger

sequencers, a large amount of data and a variety of algorithmic frameworks

dealing with Sanger data. As a result, they provide much more reliable

statistics in the context of comparing so many different algorithmic frame-

works (e.g., greedy, OLC and SBH). Such richness is not yet available from

the current short-read assemblers, which have primarily focused on SBH

(and de Bruijn graph representation).

2. Most recent Illumina machines can now generate reads of about 100 bps

or more. However, our focus on 36 bps Illumina reads is based on the fact

the these datasets have been extensively analyzed by previously published

105 short read assemblers. Since longer reads can only make the assembly

process easier, these datasets still represent some of the hardest instances

of the sequence assembly problem. We have also discovered that longer

reads ( 100 bps) from recent Illumina machines have higher error rates ∼ towards the ends of the reads, thus, limiting the apparent advantage of

longer sequences.

3. By focusing on low-coverage long reads and short reads with high cover-

age, we stress-test all assemblers against the most extreme instances of the

sequence assembly problem, especially, where assembly quality is of essence.

Genome Length (bp) # reads Avg. read Std. Cov. length (bp) (bp) Long reads: Brucella suus 3, 315, 173 36, 276 895.8 44.1 9.8 Wolbachia sp. 1, 267, 782 26, 817 981.9 50.6 20.7 Staphylococcus epidermidis 2, 616, 530 60, 761 900.2 46.2 19.9 Chromosome Y∗ 3, 000, 000 37, 530 800 88 10 Short reads: Staphylococcus aureus 2, 820, 462 3, 857, 879 35 0 47.8 Helicobacter acinonychis 1, 553, 927 12, 288, 791 36 0 284.6 Escherichia coli 4, 639, 675 20, 816, 448 36 0 161.5

Table 6.1: Benchmark data. first and second columns report the genome name and length; columns 3 to 6 report the statistics of the shotgun projects: number of reads, average and standard deviation of the read length and genome coverage (∗[35,000,001 - 38,000,000]).

Long reads. Starting with the pioneering DNA sequencing work of Frederick

Sanger in 1975, every large-scale sequencing project has been organized around reads generated using the Sanger chemistry [87]. This technology could be typi-

106 cally characterized by reads of length up to 1000 bps and average coverage of 10 . × Additional mate-pair constraints are typically available in the form of estimated distance between a pair of reads.

The first data set of Sanger reads consists of three bacterial genomes: Brucella suis [76], Wolbachia sp. [103] and Staphylococcus epidermidis RP62A [28]. These bacteria have been sequenced and fully finished at TIGR, and all the sequencing reads generated for these projects are publicly available at two sites: the NCBI

Trace Archive1, and the CBCB website2. Also included in the benchmark are sequence data from the human genome. Specifically, we selected a region of 3Mb from human Chromosome Y’s p11.2 region. These euchromatin regions of Y

Chromosome are assumed to be particularly challenging for shotgun assembly as it is full of pathologically complex patterning of genome structures at multiple scales and resolutions — usually described as fractal-like motifs within motifs

(repeats, duplications, indels, head-to-head copies, etc.). For this region of the Y

Chromosome we generated simulated shotgun reads as described in Table 6.1. We created two mate-pair libraries of size (µ =2, 500, σ = 166) and (µ = 10, 000, σ =

1, 300) respectively; 90% of the reads have mates (45% from the first library and

45% from the second library), the rest of the reads are unmated; finally, we introduced errors in each read at a rate of 1%.

1http://www.ncbi.nlm.nih.gov/sra/ 2www.cbcb.umd.edu/research/benchmark.shtml

107 Short reads. More recent advances in sequencing technology have produced a new class of massively parallel next-generation sequencing platforms such as:

Illumina, Inc. Genome Analyzer, Applied Biosystems SOLiD System, and 454

Life Sciences (Roche) GS FLX. Although they have orders of magnitude higher throughput per single run (up to 200 coverage) than older Sanger technology, × the reads produced by these machines are typically shorter (35–500 bps). As a result they have introduced a succession of new computational challenges, for instance, the need to assemble millions of reads even for bacterial genomes.

For the short reads technology, we used three different data sets, which have been extensively analyzed by previously published short read assemblers. The

first data set consists of 3.86 million 35-bp unmated reads from the Staphylococcus aureus strain MW2 [10]. The set of reads for this genome are freely available from the Edena assembler website3. The second dataset consists of 12.3 million 36- bp unmated reads for a raw coverage of 284 . This second dataset is for the × Helicobacter acinonychis strain Sheeba genome [23], which was presented in the

SHARCGS [22] paper and is available for download at sharcgs.molgen.mpg.de.

The third data set instead is made up of 20.8 million paired-end 36 bp Illumina reads from a 200 bp insert Escherichia coli strain K12 MG1655 [13] library (NCBI

Short Read Archive, accession no. SRX000429).

3www.genomic.ch/edena.php

108 6.2.2 Assemblers

The following assemblers have been selected for comparison of long read pipelines:

ARACHNE [11], CABOG [64], Euler [79], Minimus [95], PCAP [35], Phrap [30],

SUTTA [74], and TIGR [96]. Simiarly, the following assembler were selected for comparison of short read pipelines: ABySS [92], Edena [32], Euler-SR [19],

SOAPdenovo [60], SSAKE [101], SUTTA [74], Taipan [88], and Velvet [104].

Note that, although the most recent release of CABOG supports short reads from Illumina technology, it cannot be run on reads shorter than 64bp.

This list is meant to be representative (see [49] for a survey), as it includes assemblers satisfying the following two criteria: (i) they have been used in large sequencing projects with some success, (ii) together they represent all the gen- erally accepted assembly paradigms (e.g., greedy, OLC, SBH and B&B; see the discussion in chapter 3) and (iii) the source code or binaries for these assemblers is publicly available on-line, thus enabling one to download and run each of them on the benchmark genomes. In order to interpret the variability in assembler per- formance under different scenarios, both paired and unpaired data were analyzed separately. All the long-read assemblers were run with their default parameters, while parameters for the short-read assemblers were optimized according to recent studies [74] (see table 7 in 7.6.1).

109 6.3 Long reads results

Analysis without mate-pair constraints. Table 6.2 presents the contig size analysis while excluding mate-pair data (thus, ignoring clone sizes and forward- reverse constraints). Since not all next-generation sequencing technologies are likely to produce mate-pair data, it is informative to calibrate to what extent an assembler’s performance is determined by such auxiliary information. Note that ARACHNE had to be omitted from this comparison (and the associated table 6.2), since its use of mate-pairs is tightly integrated into its assembly process and cannot be decoupled from it. Similarly CABOG does not support data that is totally lacking in paired ends.

A caveat with the preceding analysis needs to be addressed: if one wish- ing to select an assembler of the highest quality were to base one’s judgement solely on the standard and popular metrics (as in this table), the result would be somewhat uncanny and unsatisfying. For instance, Pharp would appear to be a particularly good choice, since it seems to typically produce an almost complete genome coverage with fewer contigs and each of sizable length (as confirmed by the N50 values). More specifically, except for Wolbachia, the N50 value of

Phrap is the highest, yielding a respectable genome coverage of the big contigs

(> 10 kbp). Unfortunately, a closer scrutiny of the Phrap-generated assembly

(e.g., the dot plots of the contigs’ alignment) reveals that Phrap’s apparent su- periority is without much substance — Phrap’s weaknesses, as evidenced by its mis-assemblies within long contigs (see alignments in the Appendix 7.6.1), are not captured by the N50-like performance parameters. Phrap’s greedy strategy

110 Genome Assembler # ctgs # big ctgs Max ctg Mean big ctg N50 Big ctg (>10 kbp) size (kbp) size (kbp) (kbp) cov. (%) Brucella Euler 280 118 82 22 19 78.4 suis Minimus 203 101 89 30 32 93.1 PCAP 88 62 198 53 80 100.7 PHRAP 54 23 434 126 199 103.2 SUTTA 73 53 268 62 79 99.2 TIGR 108 67 182 48 57 98.8 Staphylococcus Euler 192 75 78 29 32 85.6 epidermidis Minimus 425 86 119 10 19 80.7 PCAP 109 36 179 72 114 100.1 PHRAP 86 22 357 123 183 103.9 SUTTA 65 31 249 83 116 99.3 TIGR 94 38 230 68 100 99.8 Wolbachia sp. Euler 604 0 6 0 1 0 Minimus 1545 37 16 13 2 40.7 PCAP 1241 41 64 23 3 77.2 PHRAP 2253 55 64 22 1.8 98.5 SUTTA 1089 39 87 26 6 80.8 TIGR 1080 46 46 20 5 73.6 Human Euler 60 27 403 107 266 96.7 Chromosome Y Minimus 850 104 48 18 11 63.1 PCAP 140 38 239 77 112 98.2 PHRAP 4 4 1869 764 1869 101.9 SUTTA 15 10 1020 301 712 100.5 TIGR 1103 108 51 10 8 63.7 Table 6.2: Long reads assembly comparison without mate-pair information (clone sizes and forward-reverse constraints). First and second columns report the genome and assembler names; columns 3 to 7 report the contig size statistics, specifically: number of contigs, number of contigs with size 10kbp, max contig ≥ size, mean contig size, and N50 size (N50 is the largest number L such that the combined length of all contigs of length L is at least 50% of the total length of all contigs). Finally column 8 reports the≥ coverage achieved by the large contigs ( 10kbp). Coverage is computed by double-counting overlapping regions of the contigs,≥ when aligned to the genome. cannot always handle long-range genome structures and when a repeat boundary is found it can be fooled by false positive overlaps. In contrast TIGR, PCAP and

SUTTA have similar performance in terms of N50; however, SUTTA produces a smaller number of big contigs (>10 kbp) compared to TIGR and PCAP, and higher genome coverage (except for the S. epidermidis, where they have simi- lar coverage). All the assemblers encounter various difficulties in assembling the

Wolbachia sp. dataset into long contigs, which is probably due to a higher error

111 rate in the reads. These difficulties are especially noticeable for Euler assembler; in fact, its big contigs coverage comes very close to zero. Minimus instead uses a very conservative approach where, if a repeat boundary is encountered, it stops extending the contig. Such a strategy reduces the possible mis-assembly errors, but causes a considerable pdecrease in contig size.

The results for the 3Mb region of Chromosome Y (from p11.2, a euchromatin region) paint a somewhat different picture. TIGR’s and PCAP’s performances are now inferior, with lower coverage and a higher number of contigs gener- ated. In particular, PCAP performance was obtained by reducing the stringency in overlap detection to tolerate more overlaps (using parameter -d 500), this parameter setting was necessary in order to generate reasonably long contigs.

Phrap still has the best performances in terms of contig size and N50, followed by SUTTA, but now its alignment results do not show mis-assembled contigs

(see Appendix 7.6.1). Surprisingly, Euler now improves the genome coverage for simulated assembly. Note that for Chromosome Y, simulated reads were gener- ated using fairly realistic error distributions, but still raise questions about the simulation’s fidelity (e.g., ability to capture the non uniform coverage pattern, potential cloning bias, etc., that would be inevitable in any real large scale ge- nomic project). Various simplifying assumptions used by the simulators may explain why simulated data appear somewhat easier to assemble.

Since the contig size analysis gives only an incomplete and often misleading view of the real performance of the assemblers, a more principled and informative approach needs to be devised. As described in chapter 5, a new metric, called

112 Feature-Response curve comparison for S. epidermidis (no mate-pairs) Feature-Response curve comparison for Chr. Y (no mate-pairs) 120 160 140 100 120 80 100 60 80 60 40 SUTTA SUTTA Minimus 40 Minimus 20 PHRAP Phrap

Approximate coverage (%) TIGR Approximate coverage (%) 20 TIGR PCAP PCAP 0 0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Feature threshold Feature threshold

Figure 6.1: Feature-Response curve comparison for S. epidermidis and Chromo- some Y (3Mbp of p11.2 region) genomes when no mate-pairs information is used in the assembly.

“Feature-Response curve” (FRC), is proposed and evaluated to see how well it can check the quality of the contigs and validate the assembly output. Figure 6.1 shows the Feature-Response curves for the S. epidermidis and Chromosome Y

(p11.2 region) genomes when mate-pairs are not used in the assembly. The x- axis is the maximum number φ of errors/features allowed in the contigs and the y-axis reports the approximate genome coverage achieved by all the contigs

(sorted in decreasing order by size) such that the sum of their features is φ ≤ (see section 5.3 for more details). Note that the definition of coverage used in this plot is not the conventional one since we double-count overlapping regions of contigs, when aligned to the genome. We decided to employ such a definition because it highlights assemblies that over-estimate the genome size (coverage greater than 100%). Based on this analysis, SUTTA seems to be performing better than all the other assemblers in terms of assembly quality, however it is important to mention that the current version of the FRC includes several types

113 of assembly errors with a uniform weighting, chosen arbitrarily. For example, a mis-join is generally considered the most severe type of mis-assembly, but this is not currently captured by the FRC. In fact SUTTA clearly creates mis-joined contigs in the absence of paired reads (see dot plots in the Appendix 7.6.1).

However, this problem is alleviated by the addition of paired reads as shown next in the analysis with mate-pair constraints. Finally, note that Euler and Minimus go to extreme short lengths to avoid mis-joins in the absence of paired reads.

Analysis with mate-pair constraints. Table 6.3 presents the results with mate-pairs data, restricting the analysis only to assemblers (ARACHNE, CABOG,

Euler, PCAP, SUTTA and TIGR) that use mate-pair constraints effectively dur- ing the assembly process. Obviously, the use of mate-pair-constraints improves the performance and quality of all four assemblers; however, they do so to vary- ing degrees. For example TIGR’s N50 values are now typically twice as large as those without mate-pairs. In contrast, Euler’s results only improve marginally with mate-pair constraints, and it is still unable to produce contigs larger than 10 kbp for the Wolbachia sp. genome. Note that Euler shows weaker performance in comparison to the results reported on its home-page4 for the bacterial genomes.

Although the exact explanation of this discrepancy is not obvious, it could be due to an additional screening (preprocessing) of the reads that removes low quality regions (note that, here, the analysis of all assemblers assumes no preprocessing.).

ARACHNE and CABOG shows the highest N50 values for all datasets.

As earlier, whereas the contig size analysis indicates all of the following assem-

4http://nbcr.sdsc.edu/euler/benchmarking/bact.html

114 Genome Assembler # ctgs # big ctgs Max ctg Mean big ctg N50 Big ctg (>10 kbp) size (kbp) size (kbp) (kbp) cov. (%) Brucella ARACHNE 33 28 463 119 161 101.2 suis CABOG 30 19 775 175 268 101.5 Euler 258 118 82 22 20 80.1 PCAP 81 34 416 98 131 100.5 SUTTA 72 58 269 56 74 99.2 TIGR 69 43 361 77 112 99.9 Staphylococcus ARACHNE 27 17 565 156 294 101.7 epidermidis CABOG 41 8 655 330 483 103.1 Euler 131 60 123 38 49 89.1 PCAP 103 27 362 98 153 101.5 SUTTA 89 63 244 48 63 97.6 TIGR 51 12 545 220 389 101.1 Wolbachia sp. ARACHNE 100 41 71 26 27 84.7 CABOG 1035 26 181 47 26 178.7 Euler 604 0 6 0 1 0 PCAP 1263 54 42 18 3 79.1 SUTTA 1132 46 104 22 6 86.7 TIGR 1131 29 136 40 6 91.8 Human ARACHNE 5 4 1869 763 1869 101.7 Chromosome Y CABOG 5 4 1851 756 1851 100.9 Euler 37 21 585 139 268 97.9 PCAP 135 33 239 89 130 98.9 SUTTA 17 8 1020 377 737 100.7 TIGR 1030 116 48 18 10 72.6 Table 6.3: Long reads assembly comparison using mate-pair information. First and second columns report the genome and assembler names; columns 3 to 7 report the contig size statistics, specifically: number of contigs, number of contigs with size 10kbp, max contig size, mean contig size, and N50 size (N50 is the ≥ largest number L such that the combined length of all contigs of length L is at least 50% of the total length of all contigs). Finally column 8 reports≥ the coverage achieved by the large contigs ( 10kbp). Coverage is computed by ≥ double-counting overlapping regions of the contigs, when aligned to the genome. blers, ARACHNE, CABOG, PCAP and TIGR producing better performance, a cursory inspection of the Feature-Response curve points to a different conclusion.

Figure 6.2 shows the FRCs of the assemblers for S. epidermidis and Chromosome

Y (p11.2 region) genomes when mate-pairs are used in the assembly. Because Eu- ler assembly output could not be converted into an AMOS bank for validation, it is excluded from the plot.

An intuitive understanding of the different assembly quality can be gleaned

115 Feature-Response curve comparison for S. epidermidis (with mate-pairs)Feature-Response curve comparison for Chr. Y (with mate-pairs) 110 160 100 140 90 120 80 70 100 60 80 50 60 40 SUTTA SUTTA TIGR 40 TIGR 30 ARACHNE ARACHNE

Approximate coverage (%) 20 PCAP Approximate coverage (%) 20 PCAP CABOG CABOG 10 0 0 200 400 600 800 1000 1200 1400 1600 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Feature threshold Feature threshold

Figure 6.2: Feature-Response curve comparison for S. epidermidis and Chromo- some Y (3Mbp of p11.2 region) genomes when mate-pairs information is used in the assembly. from Figure 6.3, which shows the dot plots of comparison of assemblies produced by the various assemblers aligned to the completed S. epidermidis genome. The dot plot alignments were generated using the MUMmer5 package [53]. Assemblies generated by CABOG, Euler, PCAP and SUTTA are seen to match quite well with the reference sequence, as suggested by the fraction of matches lying along the main diagonal6. TIGR instead shows many large assembly errors, mostly due to chimeric joining of segments from two distinct non-adjacent regions of the genome. Further, note that two perfect dot plot alignments can still have different quality when analyzed with the FRC. An example is given by compar- ing the contigs generated by ARACHNE and SUTTA for the 3Mb segment of

Chromosome Y’s p11.2. Despite the dot plots showing high alignment quality for both, the FRC scores them very differently (see Appendix 7.6.1). Appendix 7.6.1

5http://mummer.sourceforge.net/ 6Note that since S. epidermidis has a circular genome, the small contigs aligned at the bottom right or top left are not mis-assembled.

116 contains the dot plots for the other genomes and the associated FRCs.

To further analyze the relative strengths and weaknesses of each assembler,

Figure 6.4 shows separate FRCs for each feature type when assembling the S. epidermidis genome using mate-pairs. By inspecting these plots it is clear that each assembler behaves differently according to each feature type. For example,

CABOG outperforms the other assemblers when mate-pair constraints are con- sidered. TIGR and SUTTA outperform the other assemblers in the number of correlated polymorphism in the read alignments. The FRC that analyzes the depth of coverage shows ARACHNE, CABOG and PCAP to be winners in the comparison. Moving to the FRC that analyzes the k-mer frequencies, which can be used to detect the presence of mis-assemblies due to repeats, SUTTA and

TIGR outperform ARACHNE and PCAP, while CABOG performs somewhere in between. The breakpoint-FRC examines the presence of multiple reads that share a common breakpoint, which often indicates assembly problems. PCAP and

ARACHNE seem to suffer more from this problems, while the other assemblers are not affected (the FRCs reduces to a single point). Finally the mis-assembly

FRC is computed using the mis-assembly feature which is obtained applying a feature combiner to collect a diverse set of evidence for a mis-assembly and out- put regions with multiple mis-assembly features present at the same region (see

[80] for more details). CABOG in this case again achieves a superior rank over the other assemblers.

117 262724105 261718301627252820142410311122132329121519763 23 *5*43332212

*2 39

7

16*4 *6 17 35 15 *25*22*21 *9*1368 1314

12 QRY *37

QRY 19 18

*3 20 *38

41

*1 119 3440 8 S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_chromsome S_epidermidis_RP62A_plasmid

S_epidermidis_RP62A_chromsome REF REF (a) ARACHNE (b) CABOG

Contig112Contig13 3743 *Contig34Contig115Contig113Contig114Contig116Contig111Contig27Contig81Contig85Contig63Contig64Contig86Contig14Contig22Contig87Contig53Contig75Contig48Contig25Contig56Contig7 1011009741817345861917883082252883404924794246239685369451473898342 Contig33 10210333329080269918721644847427959289313591784887932950 *Contig29Contig78Contig38 *55*6177 *Contig58*Contig42*Contig28Contig24 *7520 *Contig108Contig97Contig72 *Contig41Contig43Contig50 Contig57Contig73 *Contig59 *56 Contig31Contig70 *Contig36 *Contig67Contig65 *5 *63*62 *Contig79 *11 *Contig88Contig107Contig54Contig80

*Contig18*Contig11 *Contig66 4 Contig103Contig61 3 *Contig44*Contig2 Contig60 58 *13*1567 *Contig102*Contig6 *1470686921 *Contig110Contig101*Contig35*Contig94Contig30Contig10Contig39 *12 *Contig84Contig99Contig93Contig83Contig90 *Contig104*Contig95*Contig98 *Contig74*Contig32 57 Contig51*Contig1 64 *Contig16Contig96 QRY *10

QRY Contig52 *66*76*65 Contig68Contig26*Contig8 *Contig89*Contig71Contig69 *9 *Contig105 *225354 *Contig45Contig19Contig76Contig77 Contig12 52 Contig109*Contig55Contig82 *Contig92 59 *Contig91 *Contig100*Contig106 60 *Contig15Contig9 *6 Contig20 *Contig21*Contig40Contig47 *Contig46 Contig17

*Contig23Contig49 *39*71*1 *Contig37Contig3Contig4 *7 Contig5 8 *Contig62 S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_chromsome S_epidermidis_RP62A_plasmid

S_epidermidis_RP62A_chromsome REF REF (c) Euler (d) PCAP

1849534842505147523754 3221261830272520144924101142224613232950395141121538454319769845 16 *48 *2046 16 47 *21

6 *28 *26 12 *2731 44 *15 3 *4 24 *37*31*353440 253 *45*444140323413 *1728 *3522 7 36 QRY *1 QRY *33

2 *14 *19 *1143 1 9 *29*3930 8 *5 *2336 *3338 *17 *10 *2 *S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_chromsome S_epidermidis_RP62A_plasmid

REF S_epidermidis_RP62A_chromsome REF (e) SUTTA (f) TIGR

Figure 6.3: Dot plots for the Staphylococcus epidermidis. Assemblies produced by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis. Note that number of single dots are an artifact of the sensitivity of the MUMmer alignment tool; they can be reduced or removed by using a larger value for the minimum cluster length parameter –mincluster (default 65).

118 Matepair Feature-Response curve Polymorphism Feature-Response curve 110 110 100 100 90 90 80 80 70 70 60 60 50 50 40 40 SUTTA SUTTA TIGR 30 TIGR 30 Arachne 20 Arachne Approximate coverage (%) 20 PCAP Approximate coverage (%) 10 PCAP CABOG CABOG 10 0 0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 Feature threshold Feature threshold (a) Mate-pairs (b) Polymorphism

Coverage Feature-Response curve Kmer Feature-Response curve 110 110 100 100 90 90 80 80 70 70 60 60 50 50 40 SUTTA 40 SUTTA 30 TIGR TIGR 20 Arachne 30 Arachne Approximate coverage (%) 10 PCAP Approximate coverage (%) 20 PCAP CABOG CABOG 0 10 0 2 4 6 8 10 12 14 0 5 10 15 20 25 Feature threshold Feature threshold (c) Coverage (d) Kmer

Breakpoint Feature-Response curve Misassembly Feature-Response curve 110 110 100 100 90 90 80 80 70 70 60 60 50 50 40 SUTTA 40 SUTTA 30 TIGR TIGR 20 Arachne 30 Arachne Approximate coverage (%) 10 PCAP Approximate coverage (%) 20 PCAP CABOG CABOG 0 10 0 10 20 30 40 50 60 0 20 40 60 80 100 120 140 160 Feature threshold Feature threshold (e) Breakpoint (f) Misassembly

Figure 6.4: Separate FR-curve comparison for each feature type for the S. epi- dermidis genome using mate-pairs.

119 6.4 Short reads results

In case of short reads, interpretations of the contigs data, e.g., ones based purely on contig sizes and N50, etc. are complicated by the following facts: (1) for short-

K reads, the required threshold ratio L is only slightly less than 1, where K is the required minimum overlap length and L is the length of the reads; therefore the effective coverage is significantly small, thus making all the statistics rather non- robust and highly sensitive to choice of the parameters; (2) there is no consensus definition of correctness of a contig – the required similarity varying from 98% down to 90% and the allowed end-trimming of each contig being idiosyncratic; and

(3) many algorithms have specific error-correction routines that are embedded in a pre-processing or post-processing steps and that cull or correct bad reads and contigs in a highly technology-specific manner.

Analysis without mate-pair constraints. Returning to an analysis based on contig size, in Table 6.4, we show a comparison of the assembly results for the S. aureus and H. acinonychis genomes. The values reported for all the assemblers, are based on the tables presented in the recent SUTTA paper [74]. Only contigs of size 100 are used in the statistics. Without mate-pair information, as in ≥ here, it is inevitable that all assembly approaches (especially, if they are not conservative enough) could produce some mis-assemblies. As described earlier, what constitutes a correct contig is defined idiosyncratically (align along the whole length with at least 98% base similarity [32]), making it very sensitive to small errors which typically occur in short distal regions of the contigs. For

120 Genome Assembler # correct # mis-assembled N50 Mean Max Coverage (kbp) (kbp) (kbp) (%) S. aureus ABySS 928 6 7.8 2.9 32.7 98 (strain MW2) Edena (strict) 1124 0 5.9 2.4 25.7 98 Edena (nonstrict) 740 16 9.0 3.7 51.8 97 EULER-SR 669 33 10.1 4.0 37.9 99 SOAPdenovo 867 25 8.1 3.1 30.8 97 SSAKE 2073 378 2.0 1.1 9.7 99 SUTTA 998 11 6.0 2.6 22.8 97 Taipan 692 16 11.1 3.9 44.6 98 Velvet 945 5 7.4 2.8 32.7 97 H. acininychis ABySS 270 8 13.9 5.4 54.7 98 (strain Sheeba) Edena (strict) 336 0 10.1 4.5 36.9 98 Edena (nonstrict) 302 1 13.2 4.9 35.0 97 EULER-SR 730 21 4.3 2.1 18.8 98 SOAPdenovo 479 21 7.3 3.3 29.8 98 SSAKE 675 156 3.2 1.8 14.6 99 SUTTA 313 9 9.6 4.5 41.3 98 Taipan 271 0 13.3 5.6 48.6 98 Velvet 278 2 12.8 5.4 49.5 98 Table 6.4: Short reads assembly comparison without mate-pair information. First and second columns report the genome and assembler names; columns 3 to 7 report the contig size statistics, specifically: number of contigs, number of contigs with size 10kbp, max contig size, mean contig size, and N50 size (N50 is the ≥ largest number L such that the combined length of all contigs of length L is at least 50% of the total length of all contigs). Finally column 8 reports≥ the coverage achieved by all the contigs. example, contigs ending in gaps accumulate errors, as coverage gets lower towards the ends. To overcome such errors some assemblers perform a few correction steps. For example, Edena exercises an option to trim a few bases from these ends until a minimum coverage is reached; Euler-SR performs a preprocessing error correction step where errors in reads are corrected based on k-mer coverage analysis. From the table 6.4 it is clear that SSAKE has the worst performance in terms of contig size and quality, while the rest of the assemblers have relatively small errors and they all achieve high genome coverage ( 97%). ≥

Analysis with mate-pair constraints. Table 6.5 shows the assembly com- parison using mate-pair information on the read set for the E. coli genome. The

121 comparison is based on the results from table 2 in [74]. In accordance with this analysis, statistics are computed only for contigs whose length is greater than 100 bps. A contig is defined to be correct if it aligns to the reference genome with fewer than five consecutive base mismatches at the termini and has at least 95% base similarity. By inspecting the column with the number of errors, one might conclude that the lower the number of errors the better the overall assembly qual- ity. As explained earlier, a simple count of the number of total mis-assembled contigs is not informative enough. For example, ABySS and SOAPdenovo have the highest N50 values and a low number of mis-assembled contigs. However, such misassembled contigs are on average longer than those from other assem- blers like SUTTA and Edena. This is evident in the table from the analysis of the mean length of the mis-assembled contigs. This analysis also shows that Edena and SUTTA behave more conservatively than Velvet, ABySS and SOAPdenovo, as they trade contig length in favor of shorter correctly assembled contigs. Inter- estingly Taipan’s number of errors for E .coli increases compared to the results in table 6.4. Instead SOAPdenovo’s performance improves for E.coli thanks to the availability of mate-pair information.

Note that the N50 statistic does not give any information about the reason why the contigs are mis-assembled: the contigs could contain an error due to accumulated errors close to the contigs’ ends or it could contain rearrangements due to repeated sequences. Of course, these two error types have very different importance in terms of quality. In this scenario, the FRC analysis can give a deeper understanding of the assembly quality, as shown in Figure 6.5 where the

122 Genome Assembler # correct # mis-assembled N50 Mean Max Coverage (mean kbp) (kbp) (kbp) (kbp) (%) E. coli ABySS 114 10 (49.5) 87.4 37.3 210.7 99 (K12 MG1655) Edena 674 6 (13.2) 16.4 6.6 67.1 99 EULER-SR 190 26 (37.8) 57.4 21.1 174.0 99 SOAPdenovo 200 9 (71.8) 76.6 21.7 173.9 98 SSAKE 407 66 (15.3) 31.2 9.6 105.9 98 SUTTA 423 7 (18.8) 22.7 10.2 84.5 98 Taipan 742 62 (5.2) 12.2 5.6 56.5 97 Velvet 275 9 (52.9) 54.3 15.9 166.0 98 Table 6.5: Short reads assembly comparison using mate-pair information. First and second columns report the genome and assembler names; columns 3 to 7 report the contig size statistics, specifically: number of contigs, number of contigs with size 10kbp, max contig size, mean contig size, and N50 size (N50 is the ≥ largest number L such that the combined length of all contigs of length L is at least 50% of the total length of all contigs). Finally column 8 reports≥ the coverage achieved by all the contigs. contigs produced by SUTTA and Velvet are compared. Although SUTTA has a higher number of mis-assembled contigs (see table 6.5), the FRC presents a different scenario. By inspecting the feature information of the contigs produced by SUTTA and Velvet, it is seen that SUTTA’s contigs have a lower number of unsatisfied mate-pair constraints, which leads to fewer large mis-assembly errors.

This result is primarily due to SUTTA’s optimization scheme, which allows it to concurrently optimize both overlap and mate-pair scores while searching for the best layout. Unfortunately we were unable to generate FRCs for each short- read assembler because their output could not be converted into an AMOS bank for validation. The ones that could be analyzed with FRC include SUTTA and

Velvet.

123 Feature-Response curve 100 90 80 70 60 50 40 30 20

Approximate coverage (%) 10 SUTTA Velvet 0 0 10000 20000 30000 40000 50000 60000 70000 Feature threshold

Figure 6.5: Feature-Response curve comparison for the E. coli genome using mate-pair short read data.

6.5 Parametric complexity experiments

In order to design better assembly algorithms and exploit the characteristics of sequence data from new technologies, it is important to have a deep understanding of the parametric complexity of the assembly problem. This is especially true for short reads from next-generation technologies since typically the required overlapping length represents a significant part of the read length. In fact, the min overlap length k is a determinant parameter, and its optimal setting strongly depends on the data (coverage). Therefore the effective coverage [55]:

N(l k) E = − (6.1) cov G

124 is very sensitive to the choice of the minimum overlap parameter k, where N is the number of reads, l is the read length and G is the genome size. For example, in the case of the real dataset for S. aureus from table 6.1 (l = 35, G =2.82 Mbp,

N =3.86 Millions) the raw coverage and effective coverage are respectively:

lN c = = 48X, c = N(l−k) = 14X(k =21) (6.2) G E G

This section illustrates the the strong dependency of assembly contiguity and quality on the choice of the minimum overlap parameter k.

6.5.1 Overlap graph complexity

As described in chapter 2, sequence assembly can be formulated as solving spe- cific problems for a general graph constructed from the overlap information of the reads. The size and complexity of this graph is clearly a function of the minimum overlap size k. Figure 6.6 shows the overlap and read count distribu- tions as a function of the minimum overlap parameter k for the E. coli dataset from table 6.1. The Y axis of the main plot shows how many reads have the number of overlaps in the X axis. The overlap graph seems to closely follow a power law distribution in accordance with random graph models and scale-free networks, where it is common to have vertices with a degree that greatly exceeds the average. In the context of genome sequences this is expected because of the presence of repeat regions: reads that have been sampled from such region are more likely to have a higher number of overlaps. These reads appear in the tail of the distribution in figure 6.6 (x 400). The majority of the sequences have ≥

125 Overlap and read count distributions for E. coli 1e+08 k=20 k=24 1e+08 1e+07 k=28 1e+07 k=32 1e+06 1e+06 100000 10000 1000 100000 100 10

10000 Occurrence frequency 1 0 20 40 60 80 100 1000 Number of occurrences Overlap frequency 100

10

1 0 100 200 300 400 500 Number of overlaps

Figure 6.6: Overlap and read count distribution for E. coli. Main plot: point (x, y) represents that y number of reads have x number of overlaps. Inset plot: point (x, y) represents that y number of reads occur x number of times in the dataset (duplicates). For both plots the Y axis is in logarithmic scale. a much lower number of overlaps (10 x 70) and they probably correspond ≤ ≤ to the inter repeat regions. Because the sequencing machines are not error free, there is another class of sequences that contains error and have very few number of overlaps (x 10). These reads are responsible for the presence of dead-ends ≤ and bubbles in the overlap graph, which represent a big portion of the graph size.

From this analysis it is clear that the extreme situations where sequences have very few or high number of overlaps make the overlap graph particularly complex to analyze and sequence assemblers must be carefully designed to explore and disambiguate these graph structures.

126 N50 vs overlap size 8000 S aureus E. coli (no mate-pairs) 30000 7000 E. coli (with mate-pairs) 25000 6000

5000 20000

4000 15000

3000 10000 2000 N50 contig size (E. coli) N50 contig size (S. aureus)

5000 1000

0 0 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Min overlap size k

Figure 6.7: Relation between the min overlap parameter k and the N50 contig size for S. aureus and E. coli (contigs have been assembled with SUTTA).

6.5.2 Trade-off between N50 and Overlap size k

Figure 6.7 shows the relation between the N50 size versus the minimum overlap parameter k for two of the genomic data sets analyzed in this chapter. Clearly there is a trade-off between the number of spurious overlaps and lack of overlaps as the values for k move from small to larger numbers. Increasing the overlap size allows the resolution of more ambiguities, but in turn requires a higher coverage depth to achieve the same N50 value. It is important to emphasize that the optimal value for k depends on the genome structure and coverage (S. aureus and E. coli have different optimal values) and so it needs to be tuned accordingly.

Finally, the availability of mate pairs definitely improves the results and enables assembly of longer contigs for the E. coli genome.

127 6.5.3 Feature-Response curve dynamics

The choice of the minimum overlap parameter k not only affects the estimated length of the assembled contigs but also changes the overall quality of the as- sembled sequences. In order to show this phenomenon, we have examined the assembly quality as a function of k using the Feature-Response curve. Figure 6.8 shows the dynamics of the FR curve for E. coli as a function of the minimum overlap parameter k. Like the plots in figure 6.7, both small and large values of k produce more assembly errors, while the best value lies in the middle range

25-29. There seems to be a phase transition for k = 33 and k = 34; this is due to the fact that the probability to detect a perfect match overlap of higher size

(k > 32) becomes more unlikely without increasing the coverage. Both average contig length and N50 value decrease such that more contigs of size smaller than the insert size are created. All these contigs then violate the mate-pair constraints and result in a high number of features/errors.

6.6 Computational performance

Because of the theoretical intractability of the sequence assembly problem and because, in principle, SUTTA’s exploration scheme could make it generate an exponentially large number of layouts, SUTTA could be expected to suffer from long running times and high memory requirements. However, our empirical anal- ysis shows that SUTTA has a competitively good performance — thanks to the branch-and-bound strategy, well defined scoring and pruning schemes, and a care-

128 Feature-Response curve dynamics as function of k (min overlap) 100

90

80 k=18 70 k=19 k=20 k=21 60 k=22 k=23 50 k=24 k=25 40 k=26 k=27 30 k=28 k=29 20 k=30 k=31 Approximate genome coverage (%) k=32 10 k=33 k=34 0 0 10000 20000 30000 40000 50000 60000 Feature threshold

Figure 6.8: Feature-Response curve dynamics as a function of the minimum overlap parameter k for E. coli using mate pair data. ful implementation. SUTTA’s computational performance was compared to Vel- vet, ABySS, EULER-SR, Edena and SSAKE on the S. aureus genome using a four quad core processor machine, Opteron CPU 2.5GHz (see table 6.6). SUTTA has an assembly time complexity similar to Edena, SSAKE and EULER-SR. Vel- vet and ABySS have the best computational performance. Velvet, ABySS and

Edena consume less memory than SUTTA. However, SUTTA relies on AMOS to maintain various genomic objects (reads, inserts, maps, overlaps, contigs, etc.), which are not optimized for short reads. At the current stage of development,

SUTTA’s time complexity increases with mate-pairs constrain computation, but is expected to improve with reengineering planned for the next versions. Finally, we note that typically 2/3 of total SUTTA’s running time is dedicated to the

129 computation of overlaps, leaving only a 1/3 of the total time to assemble the contigs.

SUTTA Edena Velvet ABySS SSAKE EULER Time(min.) 18m 10m 2m 3m 30m 13m Memory 3.4Gb 800Mb 600Mb 600Mb 4.2Gb 800Mb

Table 6.6: Assemblers computational performance for Staphylococcus aureus strain MW2. Time and memory requirements reported here include both over- lapping and assembly steps.

130 Chapter 7

Integrating Base-Calling, Error

Correction and Assembly

7.1 Introduction

Although algorithmic improvements play an important role in sequence assembly, the complexity of the problem is strongly reduced if high quality (low error rate) sequences can be generated. For this purpose, Base-Calling and error correction tools play a significant role in generating highly accurate sequence assemblies.

This chapter starts by presenting a novel DNA base-caller, TotalReCaller [62], designed to interpret analog signals from sequencing machines in terms of a se- quence of bases ( A,C,G,T ), while simultaneously aligning the sequence reads { } to a source reference (draft) genome, whenever available as it can reduce the error rate. By merging genomic information with raw intensity data TotalReCaller pro- duces high quality sequence reads as well as an alignment at the possible location

131 in the reference genome. To achieve these objectives, TotalReCaller combines a linear error model for the raw intensity data and Burrows-Wheeler transform

(BWT) based alignment into a Bayesian score function, which is then globally optimized over all possible genomic locations using an efficient branch-and-bound

(specifically beam search) approach.

This chapter also demonstrates the advantages of a complete de novo pipeline integrating TotalReCaller (base-calling) with SUTTA (sequence assembly) in a

Bayesian manner. We have found that this pipeline significantly improves the assembly quality and performance compared to the standard SUTTA pipeline

(without error correction) described in chapter 4. Comparison results from the best state-of-the-art assembly algorithms (e.g., SOAPDenovo and ABySS) for short read next-generation technology demonstrate the competitive performance improvement with this new pipeline.

7.2 Base-Calling Challanges

Base-calling takes as input the vector analog time series of signals generated by the sequencing machines, and produces a base-by-base digitized estimate of the underlying DNA sequence, which is most likely to have given rise to that signal.

As explained in chapter 1, although next-generation sequencing technologies have reduced the cost and increased the throughput, they pose new challenges in base- calling. In fact, these platforms are error-prone, corrupt signals in the data by non-stationary errors, and generate much shorter reads than those needed for both proper alignment and sequence assembly as well as what used to be routinely

132 possible with the traditional Sanger sequencers.

Motivated by such challenges, novel base-calling frameworks have been pro- posed to deal with many unknown sources of noise in the data (see chapter 1 section 1.7). However, these methods have also exposed additional difficulties that must be resolved by better Base-Calling software:

Over-fitting: Parametric models (e.g., Alta-Cyclic, Ibis, BayesCall, and • Rolexa) seem to suffer from over-fitting to the in-sample data and thus are

unlikely to be very robust in dealing with varying out-of-sample datasets,

even from the same sequencing platforms.

Computational Cost: All base-callers require preprocessing to learn • the error model from training data in order to build a classifier that can

then correct the errors in the signal. This preprocessing can be very time

consuming (as in the case of BayesCall, Alta-Cyclic and Ibis) and may

require a cluster computer facility (as in the case of Alta-Cyclic), thus

preventing them from real time base-calling, which would be needed in

many clinical applications.

Training: Since a sub-class of these base-callers (e.g., Alta-Cyclic, Ibis) • needs a training library of correct reads, they require both a secondary base

caller as well as a sequence aligner to build the training library. However,

this concern has been somewhat alleviated in the newer base-callers (e.g.,

Bustard, BayesCall), which estimate their parameters solely from intensity

files.

133 Technology Dependence: Many base-callers (e.g., Alta-Cyclic, Ibis • and BayesCall) use a detailed parametric model to describe the signal dis-

tortion as a function of successive cycles. Such models require hard-wiring

specific knowledge of the underlying sequencing technology into the algo-

rithm, thus making it harder to customize the base-caller to support other

platforms.

Currently, the major application of the next-generation sequencing technolo- gies is thought to be in re-sequencing. Despite the obvious centrality of alignment in these applications, traditional base-callers have avoided performing alignment until the end of the base-calling process. In fact, the typical pipeline for a re- sequencing process traditionally consists of two sequential steps (see figure 7.1(a)):

(1) Base-Calling: each single base of the read is called according to the intensity signal and error profiles; (2) Alignment: sequence reads are aligned to a refer- ence genome. Because the base calling process is error-prone, and because correct alignment to the reference genome is non-trivial, high coverage is required in order to reduce the errors in re-sequencing and recover the true full DNA sequence.

Motivated by the limitations of current base-callers and the challenges of re- sequencing, we have designed a new base-caller, TotalReCaller, which substan- tially ameliorates the problems discussed above and significantly improves the quality of reads by injecting knowledge of the reference genome into the base- calling step. TotalReCaller replaces the typical sequential re-sequencing protocol into a combined pipeline (see figure 7.1(b)) that has the ability to concurrently perform base-calling, alignment, error correction and SNP detection. Although in

134 this chapter we address base-calling for the Illumina platform, the method is, in principle, applicable to any other sequencing technology. Adaptation to a differ- ent sequencing platform only requires re-designed score functions to encapsulate error correction and alignment – and, thus, accommodate the different features and error profiles of the variant system in a technologically extensible manner.

(a) Traditional re-sequencing pipeline

(b) Proposed re-sequencing pipeline

Figure 7.1: Block diagrams for re-sequencing pipelines: (a) open-loop (no feed- back) and (b) closed-loop with feedback.

7.3 Source of Errors in Illumina Raw Sequenc-

ing Data

As reported previously in [24], there are four dominant sources of noise affecting the intensities generated by Illumina:

1. Crosstalk: The intensity channels are not independent. This interdepen-

dence is due to the fact that the fluorescent markers for A, C and G, T emit

photons with similar wavelength, get excited by the same lasers and fluo-

rescent markers from one cycle can only be chemically partially removed

135 (“washed”) before the cycles for the next nucleotides (all performed in the

same flow cell). Figure 7.2 clearly shows this effect: calling the bases A

and G from their intensity channels does not pose a severe problem at least

for the first 40 cycles. In contrast, channels C and T appear hopelessly

disrupted soon after the start and within the first few cycles.

2. Fading: With successive cycles, the absolute intensity of light emitted from

the cluster of DNA strands diminishes because fluorescent markers are only

able to bind to fewer and fewer strands within the clusters. This produces

a low signal-to-noise ratio (SNR) as shown in figure 7.2. which worsens

precipitously with the cycle numbers – attributed primarily to polymerase

desynchronization and its low processivity.

3. Lagging (Phasing): Some strands in the clusters start to lag behind the

population, as in each cycle some of the polymerases fail to operate syn-

chronously, but then rejoin the other strands in subsequent cycles, whence

producing ambiguous intensity values. Eventually, the correct channel gets

obscured by the other wrong channel intensities, leading to wrong base-calls.

4. Leading (Pre-phasing): Some strands in the clusters start to lead ahead of

the population, which also causes ambiguous and incorrect channel intensity

values in a fashion analogous to lagging.

These noise factors dominate and affect the signal differently in different cy- cles. In the first few cycles, crosstalk is the major source of base-call errors.

However, in later cycles fading, lagging and leading prevail. We have observed

136 2500 1500 2000 1500 1000 1000 500 500 0 0 intensity channel A intensity channel C 20 40 60 20 40 60 cycle cycle 3000 1500

2000 1000

1000 500

0 0 intensity channel T intensity channel G 20 40 60 20 40 60 cycle cycle

Figure 7.2: Statistics for high and low intensity levels depicted with their means and standard deviations for four channels – one for each base B A,C,G,T . A high intensity level [blue] (with a value above a threshold) in one∈{ of the channels} indicates that this base should be called at a given cycle. A low intensity level [red] (below threshold) in a channel means that this base should not be called at a given cycle. In a “good” set of intensities, it is expected that one channel is higher than a threshold, while all other three are lower than it. The panels depict that in later cycles the low and high intensities become increasingly indistinguishable, which causes erroneous base-calling for distal positions. that lagging often causes many false-positive insertions in the distal extending- end of sequence reads. In later cycles, intensities measured in cycle k reflect more and more what would have been the value in cycle (k 1). This process − leads directly to “base-stuttering,” occurring much more frequently after some threshold value for k, the cycle number. This dynamic can be modeled by a step

137 function appearing randomly but more frequently in later cycles, thus making it extremely difficult to analyze. This effect has important implications for the succeeding alignment step, since many popular short-read sequence-aligners can- not align gapped sequence reads [56], [58]. We have also observed the effects of leading on signal to noise ratio to be negligible in comparison to the other three

(crosstalk, fading and lagging).

7.4 TotalReCaller

TotalReCaller combines the knowledge from sequencers’ raw intensity data with information from a reference genome (when available). In other words, it gener- ates the most plausible m-base string (out of 4m possibilities) that is most likely to have generated the channel intensity data, and also most likely to have orig- inated at some location of the reference genome. Like SUTTA, TotalReCaller tames the worst-case exponential complexity of the implementation by using a beam search strategy strategy (an adaptation of the branch-and-bound method).

Specifically, this strategy is used to concurrently extend multiple high quality reads that are immediately validated not only by the intensity signals but also by the likely alignments to a reference genome (thus the genome providing a weak prior to a Bayesian inference). This scheme builds on a rigorously defined

Bayesian score function that accounts for both — thereby, resulting in a single score to quantify the quality of a given sequence read. In order to execute this task, TotalReCaller implements four different components that are described in detail in the following sections: (1) linear error model; (2) base-by-base sequence

138 alignment; (3) beam search read extension; and, finally, (4) score function.

7.4.1 Linear error model and filter

A simple linear model has been devised to correct errors resulting from crosstalk, fading and cycle-dependent synchronous-lagging. The model is based on a cycle- dependent transition matrix (thus dynamic) in order to filter the raw intensity channels. The linear-algebraic model for crosstalk and fading is first derived and

I k k k k ⊤ then extended to include lagging. Let k = (IA IC IG IT ) be the vector of the four raw intensity channels. In order to model crosstalk in cycle k N, ∈ we introduce the crosstalk matrix A R4×4 and the crosstalk-free channels k ∈ X = (Xk Xk Xk Xk )⊤ R4. We model their relationship simply by the k A C G T ∈ following formula:

I = A X . (7.1) k k · k

Since a separate crosstalk matrix is computed for every cycle k, the intensities are implicitly normalized, thus additionally accounting for fading.

Lagging is then modeled by introducing a dependency between the current cycle and the previous cycle, resulting in:

Ik−1 Ak−1 0 Xk−1   =     (7.2) I Υ A · X  k   k k  k     −1 

Xk−1 Ak−1 0 Ik−1   =     , (7.3) ⇒ X Υ A · I  k   k k  k        Gk | {z }

139 where Υ R4×4 describes the coupling between I and I . A matrix inversion k ∈ k k−1 results in a simple transition matrix Gk, which is then used to filter the raw intensity channels. The elements of the matrices Υk and Ak are obtained by statistical analysis of the intensities, using a library of correct reads similar to the training sets used for the parameter estimation of the support vector machines in Alta-Cyclic and Ibis. However, notice that for TotalReCaller this is not a computationally expensive task since it only solves a simple linear system. Also, notice that, while training data set is used here to simplify the learning phase of

TotalReCaller, it is not absolutely necessary for its operation, as TotalReCaller can adaptively tune these parameters.

After applying the filter to the set of raw intensities (see figure 7.2), we were able to significantly improve the quality of the intensity channels, as shown in

figure 7.3. The error model and filter could easily be extended to include leading

(pre-phasing). It was decided to refrain from including it in the current implemen- tation, since it appeared to increase the computational costs without balancing it by a further proportional improvement in the quality of intensity information.

With the intensity channels suitably filtered, we needed a metric to compare the intensity channels among one another. For this purpose we focused on the conditional probabilities P (X B) and P (X B) with B A,C,G,T . k B | k B | ¬ ∈ { } P (X B) denotes the conditional probability of the filtered intensity X of k B | B channel B, given that the correct base to call is base B, whereas P (X B) k B | ¬ denotes the conditional probability of the filtered intensity XB, given that the correct base to call is not base B. Since the filtered channels XB are independent

140 2 2

1 1

0 0 intensity channel A −1 intensity channel C −1 20 40 60 20 40 60 cycle cycle 2 2

1 1

0 0 intensity channel T intensity channel G −1 −1 20 40 60 20 40 60 cycle cycle

Figure 7.3: Filtered intensity channels and separation. Crosstalk and lagging are corrected using a linear filter, developed here. The high and low intensity levels are now cleanly separated for the first 60 cycles. of each other, we can approximate these probabilities assuming that they are normally distributed (with subscript k suppressed to avoid clutter),

X B (µ , σ ) and X B (µ , σ ). (7.4) B | ∼N B B B | ¬ ∼N ¬B ¬B

That is:

2 1 (XB µB) Pk(XB B)= exp −2 , (7.5) | √2πσB − 2σB  2 1 (XB µ¬B) Pk(XB B)= exp −2 . (7.6) | ¬ √2πσ¬B − 2σ¬B 

141 The means together with their standard deviations have already been presented in

figure 7.3, in order to show the workings of the linear error model and filter. Note that although it is reasonable to assume X B to be normally distributed, it B | seems less justifiable to take X B to be normally distributed as well, because B | ¬ the condition in this case is the disjunction of the three other bases. However, we have experimentally computed the distribution for X B and found out B | ¬ that it follows a narrower distribution than the normal distribution. Because of that, forcing X B to be also normally distributed is actually less accurate. B | ¬ Nevertheless, we decided to keep that assumption to design a score function that would be simpler and faster to compute.

7.4.2 Base-by-base sequence alignment

The key idea of TotalReCaller is to perform alignment while the sequence is being base-called. The partially generated sequences, which are grown one base at a time, must be aligned back to the reference genome. To account for this com- putationally intensive task, we designed an efficient base-by-base aligner that is based on a suffix tree search algorithm. Inspired by the many Burrows-Wheeler based short read sequence aligners (Bowtie [56], SOAP2 [59], BWA [58]), we con- structed our base-by-base aligner essentially on the same principles, specifically the Ferrangina-Manzini search algorithm [26] and the Burrows-Wheeler transfor- mation [17]. Ferrangina and Manzini showed how the suffix tree of a reference genome can be accessed through its Burrows-Wheeler transformation (BWT), which does not require more memory than the reference genome itself. Thus

142 searching for a (partial) sequence read in a BWT reference can be performed very efficiently. In addition, not only information about the existence but also the number of occurrences in the given reference of the (partial) sequence read can be computed concurrently.

Sequence Frequency f P (B) P ( B) B ¬ ACGACA 100 0.10 0.90 ACGACC 20 0.02 0.98 ACGACG 500 0.50 0.50 ACGACT 380 0.38 0.62

Table 7.1: Probabilities from FM search for each base preceded by “ACGAC”.

Note that sequence aligners usually only estimate whether and where a se- quence read is located in a given reference. With alignment information for base-calling, we are also able to use the alignment frequency of a partial sequence read. For the example, suppose that the (partial) sequence “ACGAC” is con- tained in a reference 1000 times. We can use the FM search to count how often the sequences “ACGACB” with B A,C,G,T are contained in the reference, ∈{ } from which we can then compute the probability Pk(B), that the next base (at cycle k) in the sequence is B (similarly to the k-gram model used in natural language analysis):

fB Pk(B)= , Pk( B)=1 P (B) (7.7) fA + fC + fG + fT ¬ −

Table 7.1 shows a complete example how the base probabilities are computed using the FM search.

143 So far we have introduced the intensity filter and base-by-base alignment components, what remains to be shown (in the next section) is how to combine them in order to score and prune the candidate solutions.

7.4.3 Beam search read extension

Like SUTTA, TotalReCaller uses a Branch-and-Bound (specifically a variant called beam search) strategy to combine intensity and alignment information by sequentially constructing a tree of hypothetical sequences. In order to reduce the computational, complexity the tree is only partially constructed and repeatedly evaluated. At each cycle the tree grows in depth, with each node in the tree (not just the leaves) representing a possible sequence. In order to be able to recover the best sequence out of the Nk possible sequences, every node is scored according to a Bayesian score function immediately upon creation. This score function com- bines terms for intensity and alignment information and is described in the next section. The maximally likely estimate for the correct sequence read(s) is thus obtained by simply choosing the node with the (globally) highest score. Without any pruning, the tree could grow exponentially in the number of cycles: at cycle k, N = 4k sequence reads must be evaluated. Moreover, since asynchronous | k| lagging causes incorrect insertions into the sequence read, we need to consider deletions as a 5th child, which means that a tree N , with N = 5k sequence k | k| reads must be created and evaluated. Since TotalReCaller dynamically prunes unpromising sequences based on the evaluation of the score function in a beam search scheme [12], the worst-case complexity is rarely encountered in practice.

144 Note that the interesting special situations, where the exponential worst-case be- havior would be exhibited, occur when the sequencer is extremely noisy and/or when the reference is incorrect (or highly error-ridden), thus producing exponen- tially many plausible hypothetical sequence reads – a judicious solution in these cases would then involve terminating the sequence read at a smaller read-length or rejecting it outright. The algorithm is described as a sequence of three con- secutive steps that are repeated once for each cycle, as described in Algorithm 5.

Algorithm 5: TotalReCaller - beam-search pseudo code Input: Set of intensities Ik and reference genome G Output: Read sequence 1 repeat 2 Branching: For each sequence in the solution space Nk−1 all four possible successor sequences are generated, resulting in the solution space N (note that at this point N N ); k k−1 ⊂ k 3 Bounding: Each sequence in Nk is evaluated according to the score function g (combining intensity and alignment information); 4 Pruning: All but the best (highest score) l N sequences are pruned, thus reducing the size of N to N = l; ∈ k | k| 5 until no more cycles ; 6 return Best sequence read according to g;

Note that by reducing the computational complexity through bounding the solution space, we are no longer guaranteed to generate the optimal solution.

However, in practice, the accuracy of the algorithm’s outputs seem to be only slightly affected. Wherever necessary, the computational cost can be traded off for higher accuracy by setting a parameter that controls the width of the beam search.

145 7.4.4 Score functions

To complete the description of TotalReCaller, we need to define the score function used to evaluate the quality of the candidate sequences in the tree. From Bayes’ theorem, it is possible to estimate the probability P that a specific base B k ∈ A,C,G,T is indeed the correct base to call at cycle k, given the filtered intensity { } vector Xk:

Pk(Xk B)Pk(B) Pk(B Xk) = | with B A,C,G,T (7.8) | Pk(Xk) ∈{ } P (X B)P (B) = k k | k (7.9) P (X B)P (B)+ P (X B)P ( B) k k | k k k | ¬ k ¬ 1 = (7.10) Pk(Xk|¬B) Pk(¬B) 1+ X Pk( k|B) · Pk(B)

Since for our purposes it is sufficient to have a quantitative measurement (not a probability) to compare all different solutions to one another, we introduce a simplified score function f which is based on P (B X ): score k | k

P (X B) P (B) f = log k k | +w log k (7.11) score P (X B) align · P ( B) k k | ¬ k ¬ intensities (eqn. 7.5) alignment (eqn. 7.7) | {z } | {z } The score function consists therefore of two parts, both of which can be computed independently according to the sections discussing the intensity filter and the base-by-base alignment algorithm. The weight w [0, 1] permits a user- align ∈ defined control over the impact of the alignment on the overall score, thus enabling the user to adjust the Bayesian bias for a particular application.

146 7.5 Base-Calling Results

Several approaches have been developed to improve the read quality of Illumina reads. They use a variety of error models, statistical inference algorithms and supervised learning methods. Currently there are six major base-callers for Il- lumina sequencing machines (including TotalReCaller), which are presented in table 7.2. In the following we compare the performance of these base-callers1 according to the standard metrics that have been used in the past: namely, base- calling error rate, alignment rate and base-calling speed. Base-calling results for

TotalReCaller are presented with different weights on reference-alignment with respect to intensity information. Three different datasets have been used for comparison (listed in figure 7.3).

Name Institute Reference Technology TotalReCaller New York University [62] Beam Search Bustard Illumina n.a. Linear Model Alta-Cyclic CSHL [24] SVM BayesCall UC Berkley [42] Graphical model Ibis Max Planck [51] SVM Rolexa Universit de Lausanne [85] Probabilistic model

Table 7.2: List of available Base-callers for Illumina sequencing technology in- cluding TotalreCaller.

7.5.1 Error rates

The base-calling error rate measures the quality per cycle (bp position) of the sequence reads, produced by a given base-caller (see figure 7.4). In order to

1As we lacked the required hardware and software, we were unable to compare against Alta-Cyclic.

147 Genome Genome size #Clusters #Cycles #Aligned Aligner Description phiX 5.4kBp 11, 803, 171 78 87% BLAT Onelaneofa Genome ∼ Analyzer I run. E. coli 4.5MBp 35, 027, 442 125 65% BLAT The 2nd pair of a ∼ paired-end Genome Analyzer II run (one lane). poplar 420MBp 31, 445, 866 109 32% Bowtie The 1st pair of a ∼ paired-end Genome Analyzer II run (one lane). Table 7.3: Data sets used to evaluate and compare TotalReCaller’s performance to its peers: phiX, E. coli and poplar. generate error rates based on the same set of reads for each of the base-callers, we aligned all Bustard reads to the respective reference genome in order to create a set of “correct reads”. We then perform a base by base comparison between this set of “correct reads” and the sequence reads created by each of the base-callers, resulting in an error rate for each position (cycle) in a sequence read. Since we used the output of Bustard to create the set of correct reads we introduced a bias, favoring Bustard. For the small genomes, phiX and E. coli, we used the aligner BLAT [46], which allows accurate, gapped alignment to create the set of “correct reads” (see table 7.3). For the poplar dataset we used Bowtie [56] to create the set of “correct reads”. We had to use Bowtie instead of BLAT in order to properly handle the current draft of the poplar [99] genome2, which is of relatively lower quality in comparison to E. coli and phiX, e.g. poplar consists of many contigs (out of 2518) that have not yet been phased to a scaffold. The low quality, in conjunction with the length and complex structure, of the poplar genome results in an unusually large number of false positive alignments, which,

2Populus trichocarpa v2: http://www.phytozome.net/poplar Feb. 2011

148 when analyzed by a sensitive aligner, makes it impossible to create a valid set of “correct reads.” Since Bowtie and other suffix tree based algorithms are in general less sensitive than BLAST-like [2] alignment algorithms (e.g., they do not allow gapped alignment), they produce fewer but, especially in case of a

“bad” reference genome, many more accurate sets of “correct reads.” The base- calling error rates based on the reads produced by the base-callers and the set of

“correct reads” can be found in figure 7.4. Based on the previous discussion, it is safe to conclude that the error rates for poplar dataset may be used only for a qualitative (and not a quantitative) comparison.

0.5 0.2 0.09 Bustard (Illumina) Bustard (Illumina) Bustard (Illumina) BayesCall (Berkeley) Ibis (MPG) 0.08 Ibis (MPG) Ibis (MPG) TotalReCaller w=1 TotalReCaller w=1 0.4 0.07 Rolexa (EPFL) 0.15 TotalReCaller w=2 TotalReCaller w=3 TotalReCaller w=1 TotalReCaller w=3 TotalReCaller w=5 0.06 TotalReCaller w=3 TotalReCaller w=5 TotalReCaller w=7 0.3 TotalReCaller w=5 TotalReCaller w=9 0.05 TotalReCaller w=9 TotalReCaller w=9 0.1 0.04

Error rate 0.2 Error rate Error rate 0.03 0.05 0.1 0.02 0.01

0 0 0 10 20 30 40 50 60 70 20 40 60 80 100 120 20 40 60 80 100 cycle cycle cycle (a) PhiX (b) E. coli (c) Poplar

Figure 7.4: Sequence read error rates per cycle for each of the three datasets (see table 7.3). In order to compute the error rates, the sequence reads, gen- erated by each of the base-callers, were compared base-by-base (cycle-by-cycle) to a corresponding “correct read”. As can be seen, sequence reads generated by TotalReCaller have a significantly lower base calling error rate in comparison to all other base callers. Furthermore, the error rate can be controlled by choosing an alignment weight walign. Note also that reads produced by the GAIIx Illumina machines are in general of higher quality than those generated by the older GAI machines. The fluctuations in the poplar error rates are primarily due to the poor quality of the poplar reference genome which led to a limited set of “cor- rect reads”. It was not possible to present statistics for BayesCall and Rolex for E. coli and poplar, since these base-callers do not support the newer Illumina file formats (RTA pipeline).

149 Genome Base-caller ttraining tcalling alignment rate Bustard - - 15.29% Ibis 8h 2h 31.80% BayesCall ∼50h ∼32h 32.66% Rolexa ∼ 2h ∼ 90h 11.40% ∼ ∼ phiX TotalReCaller (w =0.1) 40.50% TotalReCaller (w =0.3) 1h 17h 45.45% ∼ ∼ TotalReCaller (w =0.5) 49.57% TotalReCaller (w =0.9) 57.94% Bustard - - 28.70% Ibis 20h 10h 36.31% TotalReCaller (w =0.1) ∼ ∼ 46.45% E. coli TotalReCaller (w =0.2) 55.77% TotalReCaller (w =0.3) 1h 28h 64.37% ∼ ∼ TotalReCaller (w =0.5) 77.19% TotalReCaller (w =0.9) 87.47% Bustard - - 25.55% Ibis 16h 9h 25.97% TotalReCaller (w =0.1) ∼ ∼ 25.60% poplar TotalReCaller (w =0.3) 27.74% TotalReCaller (w =0.5) 1.5h 23h 29.83% ∼ ∼ TotalReCaller (w =0.7) 31.84% TotalReCaller (w =0.9) 33.62%

Table 7.4: Speed and alignment comparison. In the third column, “ttraining” tabulates the approximate duration of the training phase, the parameter esti- mation, for each of the base-callers. In the fourth column, “tcalling” tabulates the duration of the actual base calling. In the last column, “alignment rate” shows the percentage of how many of the reads, called by a specific base-caller, could be aligned back to the reference genome using Bowtie [56]. As an exam- ple, for E. coli after 1.5h of training TotalReCaller, with an alignment weight of walign =0.5, calls 35, 027, 442 reads with a length of 125BP in 28h hours, which BP 7 corresponds to 43 ms . Out of these 3.5 10 reads, 77.19% could be aligned back to the E. coli reference genome. Base-calling· was performed on the datasets pre- sented in table 7.3 utilizing a single CPU thread. No precise times are given, since they vary depending on runtime parameters. In comparison to Ibis, Rolexa and BayesCall, TotalReCaller uses a faster training phase. The relatively long base- calling time for TotalReCaller can be accounted for by the time TotalReCaller implicitly spends on genome-alignment while base-calling. For all three datasets, it is shown that a bigger percentage of reads can be aligned to the reference, if TotalReCaller’s strategy is used. Note also that the higher the alignment weight walign, the more reads can be aligned.

150 7.5.2 Alignment rate and base-calling speed

The alignment rate (or mapping rate) describes how many reads produced by a specific base-caller can be aligned back to the source reference genome. This rate provides an important metric, because it quantifies how many of the estimated reads possess good enough quality to permit high level genome analysis (such as SNP detection and assembly). Of course, the alignment rate, similarly to the base calling error rate, depends to large extent on the specific sequence alignment tool that is used. In table 7.4 we present the alignment rates for the sequence reads produced by the different base-callers for each of the three datasets. The reads were aligned using Bowtie with conservative parameters (low sensitivity).

Table 7.4 also shows the approximate base-calling speed for each base-caller.

7.5.3 SNPs specificity and sensitivity

Notwithstanding TotalReCaller’s relative performance advantage in terms of error and alignment rates, it may be questioned whether TotalReCaller’s bias due to reference-based Bayesian prior is the source of this advantage, and could affect

(perhaps, adversely) its single nucleotide polymorphism (SNP) sensitivity and specificity. Specifically, since TotalReCaller injects knowledge from a reference genome into the base-calling process, it is possible that sequence reads at true

SNP-positions (containing information from positions where the reference genome differs from the genome that is sequenced) are called incorrectly. Thus, it is vital to examine what happens when a sequence read is called, if that sequence contains one or more SNPs with respect to the reference genome. In order to assess

151 TotalReCaller’s bias toward the reference genome, particularly with respect to reference-independent basecallers, we define two important statistics: Specificity

(SPCk) (also known as true negative rate, TNR) and Sensitivity (SNSk) (also known as true positive rate, TPR) for each cycle k:

true negativesk SPCk = , (7.12) true negativesk + false positivesk true positivesk SNSk = (7.13) true positivesk + false negativesk

These statistics are based on artificially SNP-inserted reference genomes. On average we inserted one SNP every n bases randomly into each of the reference genomes (see table 7.3), where n was chosen to be equal to the number of cycles available for the given intensity data. Then the SNP-inserted genome was used as a reference for TotalReCaller. Although all other base callers ignore side- information, e.g., information in a reference genome, these same statistics can be computed for all of them for comparison purposes. For those basecallers, sensitivity and specificity depend only on their raw error rates.

Figure 7.5 shows the effect of the alignment on base-calling, as the weights walign are varied for the E. Coli dataset. TotalReCaller’s specificity is higher in comparison to all other base-callers for each of the presented alignment weights.

TotalReCaller’s sensitivity for a low alignment weight is however surpassed by Ibis for the E. coli dataset. Increasing TotalReCaller’s alignment weight increases the specificity and reduces the sensitivity. Considering the significant higher align-

152 ment rate of TotalReCaller (see table 7.4), the loss of sensitivity with increasing waling is more than compensated.

1 1

0.9

0.8 0.95 0.7

0.6 0.9 Bustard (Illumina) 0.5 Bustard (Illumina) Ibis (MPG) 0.4 Ibis (MPG)

SNP specificity/TNR TotalReCaller w=1 TotalReCaller w=1 0.85 SNP sensitivity/TPR TotalReCaller w=2 0.3 TotalReCaller w=2 TotalReCaller w=3 TotalReCaller w=3 TotalReCaller w=5 0.2 TotalReCaller w=5 TotalReCaller w=9 TotalReCaller w=9 0.8 0.1 20 40 60 80 100 120 20 40 60 80 100 120 cycle cycle (a) E. coli-SPC (b) E. coli-SNS

Figure 7.5: SNP specificity (SPC) and sensitivity (SNS) for E. coli. These figures show the effect of the alignment on base-calling, as the weights walign are varied. SNP specificity (SPC) measures the rate at which a difference between a read and its reference represents a SNP and not a base calling error. SNP sensitivity (SNS) measures the rate of called SNPs with respect to all SNPs that should be called.

7.6 Error Correction during Base-Calling

The general technique adopted for error correction by next-generation sequence assemblers is based on k-mer frequency analysis. The basic idea is that, for deep sequencing, the correct k-mers appear multiple times in the reads set, while random sequencing errors produce k-mers with low frequency. For example, using such frequencies, SOAPdenovo analyzes each read in the dataset to infer potential erroneous sites of low-frequency k-mers. The impact of changing each erroneous site to the other three allele types is then tested and changes are allowed only

153 if the new k-mer results in higher frequencies. Other well-known assemblers, such as ALLPATH-LG and EULER-SR, use similar strategies. Since correcting errors can significantly reduce the complexity of assembling the reads, standalone software programs have been recently released that focus only on this task, e.g.,

Quake [45] and SHREC [89]. Although TotalReCaller was initially designed to perform base-calling and alignment in a combined re-sequencing framework, its range of applications goes beyond that. In this section we will show how it can be used to correct sequencing errors even in de novo sequencing projects.

In addition to the intensities, TotalReCaller requires an input reference genome to perform the base-calling. However, the reference genome does not need to be the correct one or complete (gap-free). Even with a low quality draft genome, To- talReCaller can significantly improve the quality of the reads during base-calling.

The draft genome plays the role of a prior knowledge in Bayesian inference. In the specific case of genome sequences, the draft genome represents a more reli- able vocabulary of longer words that can be used to correct the errors in the raw intensities. Table 7.5 shows the number of perfect reads (no errors in alignment) base-called by Bustard and TotalReCaller on the full E. coli dataset. Both the absolute number and percentage of perfect reads generated by TotalReCaller is higher then Bustard. Also note that this is true even when TotalReCaller uses the draft genome created by SUTTA. Although the alignment rate for the first mate improves only slightly, the number of perfect second mate is significantly higher. This has a strong implication during the assembly process, since more mates can be used to resolve repeat sequences in the genome.

154 Base-caller # perfect reads alignment rate Bustard(1stmate) 13443953 54.82% Bustard (2nd mate) 4855602 19.80% Draft genome: TotalReCaller (1st mate) 19412596 58.17% TotalReCaller (2nd mate) 11226130 33.64% Real genome: TotalReCaller (1st mate) 19749991 59.18% TotalReCaller (2nd mate) 11335611 33.97%

Table 7.5: Alignment rate for Bustard and TotalReCaller.

Unlike other sequence assemblers, SUTTA does not include any error cor- rection preprocessing step. So we have designed the following pipeline to take advantage of both SUTTA and TotalReCaller capabilities:

1. Draft Assembly: Using SUTTA (or any other sequence assembler) gen-

erate a draft assembly using the available reads.

2. Base-calling & Error correction: given the reads intensity files and

the draft assembly (generated in step 1), run TotalReCaller to generate a

new set of reads with higher accuracy.

3. Sequence Assembly: Run SUTTA on the new set of reads generated in

step 2 to create an improved assembly.

7.6.1 Assembly results

We have tested this pipeline on the Illumina E. coli dataset presented in table

7.3. Note that current Illumina software can filter the data by removing reads that do not pass the GA analysis software called Failed_Chastity. To stress

155 test the assemblers on harder datasets, in this study we use the full output of the machine, usually contained in the export file. This dataset consists of 49 million

125 bp long reads, for a total coverage 1320X. Since such an high coverage is not typically available for larger genomes, we have subsampled only 100X coverage for comparing the results.

Assembler #correct #errors #ctgs≥10K N50 Max Mean Cov. Cov. (µ kbp) (kbp) (kbp) (kbp) (kbp) all (%) correct (%) SUTTA (exp.) 339 49 (13.8) 147 (37.9%) 24.1 105.6 11.6 97.4 82.7 SUTTA (draft) 168 21 (20.9) 100 (52.9%) 54.6 221.5 24.1 98.2 88.6 SUTTA (ref.) 154 25 (31.4) 86 (48.0%) 71.7 141.6 25.4 98.2 81.3 SOAPdenovo (ctg) 245 80 (18.6) 52 (42.3%) 35.7 100.1 14.1 98.4 66.3 SOAPdenovo (scaf) 106 17 (99.6) 53 (45.3%) 117.6 312.5 37.1 99.3 61.9 ABySS 92 13 (80.9) 54 (49.5%) 134.4 312.5 40.7 102.9 79.7 Velvet 126 60 (32.1) 100 (53.8%) 54.8 148.8 24.5 98.5 56.9

Table 7.6: Assembly results (contigs) for E. coli dataset (100X 125bp reads from one lane of Genome Analyzer II). A contig is defined to be correct if it aligns to the reference genome along the whole length with at least 95% base similarity.

Table 7.6 shows a comparison of the assemblies obtained by SUTTA both on the original read set (created by Bustard) and the error corrected set (base- called by TotalReCaller). SUTTA’s performance significantly improves on the new reads generated by TotalReCaller. For comparison, we have tested some of the best assemblers for short read technology on the E. coli dataset, specifically

SOAPdenovo, ABySS and Velvet. The results are reported in table 7.6. Since the reads are already 125 bp long, only contigs with size 200 have been considered ≥ in the comparison. A contig is defined to be correct if it aligns to the reference genome along the whole length with at least 95% base similarity. Inspecting the results in the table it is clear that SOAPdenovo and ABySS are particularly successful in assembling long contigs, in fact their N50 statistic is the highest.

However the assembly quality is inferior to SUTTA: if only correct contigs are

156 18618418718118017918218918818518315882163 NODE_168_length_1195_cov_147.441010NODE_183_length_1285_cov_91.802338NODE_174_length_451_cov_249.895782NODE_142_length_732_cov_172.954926NODE_83_length_1130_cov_217.483185NODE_270_length_153_cov_62.725491NODE_209_length_419_cov_45.661098NODE_252_length_306_cov_47.385620NODE_162_length_338_cov_59.464497NODE_85_length_539_cov_245.983307NODE_97_length_458_cov_293.694336NODE_332_length_178_cov_22.477528NODE_179_length_200_cov_54.544998NODE_210_length_145_cov_67.558624NODE_295_length_274_cov_19.996351NODE_298_length_171_cov_66.327484 85 116 *NODE_34_length_85957_cov_29.646788NODE_78_length_28315_cov_29.900087 *107*17 NODE_116_length_1040_cov_103.960579*NODE_164_length_3341_cov_30.360970NODE_71_length_30582_cov_30.925087NODE_40_length_8005_cov_30.430481 *NODE_19_length_45338_cov_30.422096 1131088 *NODE_62_length_15014_cov_30.701078*NODE_2_length_58552_cov_30.185835 *32125 *NODE_56_length_20125_cov_26.846012NODE_82_length_42808_cov_31.940689 *71 *NODE_105_length_28416_cov_28.064894*NODE_64_length_12535_cov_27.934504 *19 NODE_253_length_165_cov_227.969696NODE_16_length_30537_cov_29.645905NODE_3_length_30803_cov_30.542381 *158*34*6935 *NODE_106_length_30793_cov_30.853636*NODE_178_length_140_cov_238.028564NODE_231_length_4377_cov_32.588531*NODE_237_length_971_cov_72.409889 *128 *105112*916 *NODE_181_length_11087_cov_30.440786NODE_242_length_114242_cov_30.130302NODE_205_length_193_cov_29.455959 164104 *NODE_128_length_11332_cov_29.145958*NODE_37_length_29550_cov_30.256481 *2111720 NODE_15_length_46946_cov_30.064861 75 *NODE_57_length_63514_cov_32.130444 16516876 *153*136171*26167*83 *NODE_130_length_11851_cov_26.462324*NODE_204_length_1884_cov_26.133759NODE_22_length_101905_cov_29.087248*NODE_52_length_3364_cov_26.690546NODE_79_length_8439_cov_30.059368 *NODE_30_length_77597_cov_30.607563 *80*90 *NODE_180_length_13226_cov_30.086723NODE_26_length_26837_cov_28.795469NODE_243_length_180_cov_47.266666 *141*161*4610*557 *NODE_152_length_282_cov_268.400696*NODE_157_length_3240_cov_28.379013NODE_147_length_312_cov_232.897430NODE_65_length_35133_cov_27.296103NODE_4_length_1336_cov_52.430389 *NODE_12_length_39125_cov_29.592945 38 *NODE_21_length_73131_cov_30.296043 *111218 *NODE_108_length_41229_cov_34.506271NODE_268_length_324_cov_93.462959 *137*65 *NODE_358_length_31070_cov_30.940456NODE_39_length_25803_cov_30.953726 61 NODE_32_length_71236_cov_31.788927 *114 NODE_67_length_30266_cov_30.585508NODE_28_length_24178_cov_29.565928 *92*73 *NODE_145_length_10488_cov_30.656654*NODE_44_length_40173_cov_29.395639NODE_221_length_1524_cov_24.700130 99 *176*3968 *NODE_103_length_1315_cov_27.599239NODE_86_length_24822_cov_30.794214NODE_20_length_54808_cov_30.917219 *63 *NODE_8_length_106154_cov_31.487028 79 *144*95 *NODE_25_length_148816_cov_30.142969*NODE_81_length_6692_cov_26.793934 119 NODE_27_length_61394_cov_28.075073 152*84*45106*4 *NODE_41_length_47675_cov_31.716393NODE_119_length_810_cov_211.151855 *131 *132*87 *NODE_149_length_10959_cov_29.348663NODE_5_length_106783_cov_29.105991 *121 *NODE_190_length_3290_cov_29.001520*NODE_1_length_60395_cov_28.764566 *103115 NODE_46_length_11115_cov_29.736483NODE_11_length_28847_cov_30.099283 24 *NODE_42_length_106573_cov_30.364679 *111*101 *NODE_29_length_38731_cov_29.176395 *77 *NODE_63_length_48361_cov_30.953764*NODE_246_length_191_cov_21.324608NODE_74_length_4061_cov_25.307806 QRY *147*93*56143 QRY *NODE_51_length_12597_cov_29.077320*NODE_132_length_1711_cov_24.959673 *47 *NODE_141_length_2376_cov_26.238216NODE_200_length_4391_cov_27.425415NODE_73_length_59616_cov_29.851332NODE_59_length_2932_cov_29.973738 *100*12060 NODE_102_length_18445_cov_30.086906NODE_107_length_24769_cov_27.556786NODE_87_length_4734_cov_30.160542 74 *NODE_47_length_4846_cov_33.341518NODE_76_length_67280_cov_28.669010 52422733 *NODE_31_length_1118_cov_389.014313*NODE_150_length_1646_cov_31.245443NODE_80_length_26625_cov_29.922480 *41 *NODE_167_length_50648_cov_30.499842 *82*66 *NODE_66_length_36160_cov_29.525055 55 *NODE_55_length_105530_cov_31.065033 NODE_24_length_60828_cov_28.901722 *NODE_50_length_103476_cov_30.629333 NODE_36_length_45361_cov_29.003704 *126*169118*8630 NODE_143_length_11752_cov_31.914654*NODE_90_length_6429_cov_31.194277*NODE_172_length_876_cov_33.234016NODE_89_length_14851_cov_28.617064 145*58133 *NODE_88_length_10105_cov_30.609203*NODE_9_length_54545_cov_31.040663 15512731 NODE_342_length_28115_cov_29.774107NODE_188_length_7983_cov_30.651510NODE_203_length_3055_cov_22.074305 256296 *NODE_280_length_33443_cov_29.491343*NODE_171_length_2192_cov_58.638229*NODE_96_length_49570_cov_30.582933 *124*51139*1464 NODE_115_length_35090_cov_30.236164 *NODE_70_length_19820_cov_29.741877*NODE_33_length_75508_cov_30.622517*NODE_61_length_4908_cov_31.490219 *156*178173134*4850 *NODE_114_length_11804_cov_30.604542*NODE_140_length_12628_cov_30.878445 70 *109 81 *NODE_113_length_148294_cov_30.179873*NODE_17_length_23766_cov_29.765675NODE_194_length_1517_cov_32.987476*NODE_279_length_181_cov_46.309391 148122149 NODE_138_length_26246_cov_27.924675 102 *NODE_123_length_18044_cov_31.190258*NODE_139_length_2488_cov_28.054260NODE_121_length_16729_cov_30.904537*NODE_58_length_17507_cov_29.273491 *NODE_120_length_44067_cov_30.454126 *9889 NODE_7_length_57371_cov_30.859144 *16672 *NODE_49_length_78636_cov_28.137138*NODE_43_length_2512_cov_25.182325 *40*28*97 *NODE_48_length_47778_cov_29.767591 *162*142138170177232254 *NODE_137_length_5342_cov_28.476414NODE_343_length_1336_cov_19.241766*NODE_163_length_355_cov_49.664787NODE_95_length_37943_cov_29.820335NODE_165_length_978_cov_25.139059NODE_54_length_2559_cov_24.372021NODE_247_length_340_cov_28.500000NODE_13_length_146_cov_83.863014 *123 *NODE_98_length_13307_cov_31.178402NODE_45_length_41087_cov_29.988926NODE_136_length_393_cov_27.801527 43 *NODE_38_length_65222_cov_29.377157 *175*491741103678 *NODE_53_length_26704_cov_27.908854NODE_104_length_6610_cov_29.448259NODE_309_length_5101_cov_30.173496*NODE_195_length_260_cov_52.430771 129151 *NODE_75_length_32255_cov_27.917439*NODE_23_length_41113_cov_28.441710 NODE_18_length_100458_cov_31.092905NODE_122_length_4874_cov_31.422445 130163*94*67 *NODE_159_length_23283_cov_27.682945*NODE_199_length_4285_cov_26.626137 140157*597 *NODE_146_length_14591_cov_27.590158*NODE_14_length_12313_cov_29.317389NODE_100_length_28510_cov_26.952719 *150*172154*912913 *NODE_184_length_3627_cov_29.062586*NODE_186_length_4072_cov_27.827848*NODE_153_length_5120_cov_33.173828NODE_101_length_8276_cov_30.344007NODE_173_length_2253_cov_31.166889NODE_127_length_3916_cov_22.029877NODE_126_length_6631_cov_18.998642*NODE_72_length_4128_cov_24.312742 *159*146135*44 *NODE_151_length_16626_cov_30.584387*NODE_111_length_1135_cov_244.862549*NODE_129_length_9843_cov_28.331200*NODE_160_length_8419_cov_28.094547*NODE_60_length_2587_cov_24.765366NODE_6_length_7950_cov_32.619370 *NODE_10_length_85735_cov_30.884130 *53 NODE_94_length_45893_cov_31.142092NODE_77_length_3236_cov_21.707355 *160*37 *NODE_131_length_3123_cov_28.848223*NODE_84_length_12854_cov_29.739614*NODE_35_length_38340_cov_28.911112 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 gi|49175990|ref|NC_000913.2| gi|49175990|ref|NC_000913.2| (a) SUTTA (b) Velvet

C1172C1164C1058C1012C1146C1076C1508C1038C1234C1018C1042C1422C1572C1486C1366C1134C1016C1546C1450C1258C1082C1078C1086C1066C1176C1106C1484C1142C1464C1216C1174C1454C1370C1022C1426C1524C1092C1368C1150C1180C1154C1098C1170C1358C1068C1122C1028C1036C1030C1294C1204C1408C1280C1056C1158C1090C1126C1314C1006C1124C1334C1130C1104C1162C1470C1072C1210C1026C1044C1344C1266C1120C1284C1372C1140C1178C1088C1502C1228C1020C1274C1224C1516C1304C1166C1014C1362C1430C1406C1064C1148C1080C1512C1152C1548C1138C795C976C994C904C835C972C817C879C891C789C831C747C819C958C731C829C865C765C705C759C982C803C926C912C823C839C733C751C869C922C863C853C974C950C964C805C775C970C753C903C717C777C741C737C873C885C769C914C761C986C887C735C877C996C992C745C948C739C709C908C825C793C807C843C859C763C928C783C837C990C723C946C962C899C849C956C801C815C811C867C861C757C749C932C920C727C851C797C767C910C871C755C895C938C966C934C940 4662521928483812457914694 C1168C1110C1402C1024C1468C1208C1114C1308C1032C1102C1456C1118C1046C1052C1270C719C779C809C847C791C916C881C715C936C799C711C875C743C771C707C901C841C960C988C980C942C781 31674260678922412474026639678413678845 *scaffold15scaffold49*C1542C1310C1428C1482 C1730 *761764 *767 *scaffold6C1060C1100 *scaffold63C1008 scaffold18scaffold12*scaffold4*C1400*C1200*C1562*C827 *795757524745794797840 *scaffold14*C883C857 scaffold25*C1074*C1010C1480*C845C1004C1302C1182 *841*782*777*821790774 *scaffold40*C1108*C1490*C1202*C1112*C1144C1374C1160C1116 *785*822813759776

*scaffold9C1504 *845*559*820779 scaffold28*C1804*C1346 744 *scaffold30scaffold20C1416*C721 *751*682*630*743812*48*14370249811 *scaffold60*C1328*C984C1196C1000 *780807849 *scaffold58 796 *C1230*C1952*C787C978 *331*408334525756823 *scaffold32*C954 834

scaffold21 *827 scaffold56 *766 *scaffold42 *838

*scaffold1*scaffold8*C1096*C1476*C1432C1232*C893C729 *749*591*758*853755

*scaffold17*scaffold7*C924 817

QRY *scaffold43scaffold11*C1226*C952C855 QRY 854 *scaffold47 *830 *scaffold57 *C1070*C1592*C1912C1034C1192 *770*831*799826 scaffold45 *832 scaffold54 *805

*scaffold26*scaffold19*C1556*C1414*C1818C1510C1436 *843*769772 *scaffold55*C1538*C1348 *778 scaffold5*C1626C1062 *850*517810 *scaffold13*scaffold35*C1584*C1324*C1394*C998 765762 *scaffold51 *846*771 scaffold29scaffold31scaffold62 *787*836*835 scaffold39*C1410C1984 *786 *scaffold23scaffold64C1128 *851 scaffold61C1050C1194C1332C1002 *801

*scaffold37*scaffold22scaffold33*scaffold2scaffold10scaffold24*C1094*C1156C1054*C813C889C944C785 *808*754844684809852275750 *scaffold16 *118*847837 *scaffold59 *828 *scaffold46scaffold27scaffold38*C1706 802833842 scaffold41 *800

*scaffold34scaffold53 *824*768*747773 *scaffold3 *825 scaffold44scaffold52*C1396C1608C1664C906C930 *829*804*798839 *scaffold50scaffold48*C1212*C1392*C1132C1282*C713C1040C1186*C773C968C833 *741*783*760*806*763781775

*scaffold36C1636 816803 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 gi|49175990|ref|NC_000913.2| gi|49175990|ref|NC_000913.2| (c) SOAPdenovo (d) ABySS

Figure 7.6: Dot plots for the E.coli assemblies produced by SUTTA, Velvet, SOAPdenovo and ABySS. The horizontal lines indicate the boundary between assembled contigs represented on the y axis. aligned to the reference genome, the total coverage of SOAPdenovo and ABySS are respectively 66.3% and 61.9%, while SUTTA achieve a coverage 80% in ≥ all instances. This might be due to the different assembly strategy adopted: both SOAPdenovo and ABySS first create a set of contig using solely the read sequences and only later, in a second step, extend and merge the contigs using the mate-pair information; SUTTA instead assembles the contigs by concurrently optimizing mate-pairs constraints and sequence quality. Another source for the different behavior could be found in the error correction technique: SOAPdenovo

157 uses the k-mer analysis to correct the reads but, since this process is not error- free, it might be introducing additional errors in the set of reads. Velvet’s contigs instead are similar in size to SUTTA’s but the coverage achieved with the correct contigs is only 56.9%.

Figure 7.6 shows the dotplot alignments of the contigs generated by the four assemblers using the MUMmer [53] software. All the contigs find a proper align- ment to the reference genome, however notice that MUMmer generates the best possible alignment for each contig (even if the alignment similarity is 95%). ≤ So in this case, when all the contigs have already a fairly high quality, this kind of alignment plots becomes less informative.

Feature-Response curve 100 90 80 70 60 50 40 30 20

Approximate coverage (%) 10 SUTTA Velvet 0 0 5000 10000 15000 20000 25000 Feature threshold

Figure 7.7: Feature-Response curve comparison for SUTTA and Velvet on the 100X E.coli data set.

More explanatory information can be gleaned from the Feature-Response

158 curve analysis presented in figure 7.7. SUTTA clearly outperforms Velvet as- sembly in quality. These results are in accordance with the coverage analysis presented in table 7.6. We were unable to compute the Feature-Response curve for the other two assemblers, SOAPdenovo and ABySS, because their output could not be converted into AMOS format. However, based on the previous cov- erage in table 7.6, it is fair to presume that the results would not have significantly changed.

159 Conclusion

Sequence assembly accuracy has become particularly important in: (1) genome- wide association studies, (2) detecting new polymorphisms, and (3) finding rare and de novo mutations. New sequencing technologies have reduced cost and in- creased the throughput. However, they have sacrificed read length and accuracy by allowing more single nucleotide (base-calling) and indel (e.g., due to homo- polymer) errors. Overcoming these difficulties without paying for high compu- tational cost requires (a) better algorithmic frameworks (not greedy), (b) being able to adapt to new and combined hybrid technologies (allowing significantly larger coverages and auxiliary long rage information) and (c) prudent experimen- tal designs.

This dissertation has presented a novel assembly algorithm, SUTTA, that has been designed to satisfy these goals as it exploits many new algorithmic ideas.

Challenging the popular intuition, SUTTA enables “fast” global optimization of the WGSA problem by taming the complexity using the branch-and-bound method. The resulting assembly paradigm gives the first priority to accuracy; secondly, it employs judicious experiment design concurrently (using long range information) to validate the results; and finally, it achieves computational ef-

160 ficiency by adaptive allocation of the resources. Because of the generality of the proposed approach, SUTTA has the potential to adapt to future sequencing technologies without major changes to its infrastructure: technology-dependent features can be encapsulated into the lookahead procedure and well-chosen score functions. Note that SUTTA also uses heuristics and has certain assumptions regarding consistent coverage levels, well behaved mates, reducing memory re- quirements (keeping the queue size small), resolving dead-ends and bubbles, etc.

However these tools are plugged into a more general and flexible branch-and- bound (B&B) framework (similar to the one successfully applied to TSP and

SAT problems) used to solve the full sequence assembly problem.

Nonetheless, SUTTA is still in its embryonic stage and has yet to achieve its full potential. In particular, it remains to be practically demonstrated that

SUTTA can assemble a human haplotypic genome of any individual using single- molecule (optical) maps of moderate coverage (single molecules from only a hand- ful of cells) rapidly and cheaply – a goal that has remained elusive to genomics research thus far. To achieve this goal we are planning several extensions of the

SUTTA framework: (1) additional targeted error correction and local optimiza- tion routines to improve the quality results for any specific short reads technology,

(2) software re-engineering to scale to large genomes, and (3) finally integration of long-range map data (e.g., optical maps) to concurrently validate and assemble in a dovetailing fashion.

A recent article entitled “Revolution Postponed” in Scientific American [31] commented “The Human Genome Project has failed so far to produce the med-

161 ical miracles that scientists promised. Biologists are now divided over what, if anything, went wrong...”. And yet, excitement over the rapid improvements in biochemistry (pyrosequencing, sequencing by synthesis, etc.) and sensing (zeroth- order waveguides, nanopores, etc.) has now pervaded the field, as newer and newer sequencing platforms have started mass-migration into laboratories. Thus, biologists stand at a cross-road, pondering over the question of how to tackle the challenges of large-scale genomics with the high-throughput next-gen sequencing platforms. One may ask: Do biologists possess correct reference sequence(s)? If not, how should they be improved? How important are haplotypes? Does it suf-

fice to impute the haplotype-phasing from population? How much information is captured by the known genetic variants (e.g., SNPs and CNVs)? How does one find the de novo mutations and their effects on various complex traits? Can exon-sequencing be sufficiently informative?

Central to all these challenges is the second problem we have addressed in this thesis: namely, how correct are the existing sequence assemblers for mak- ing reference genome sequences? How good are the assumptions they are built upon? Unfortunately, we have discovered that the quality and performance of the existing assemblers varies dramatically. The standard metrics used to compare assemblers for the last ten years emphasized contig size while poorly captur- ing the assembly quality. For these reasons, we have developed a new metric,

Feature-Response curve (FRC), to compare assemblers and assemblies that more satisfactorily captures the trade-off between contigs’ size and quality. This metric shares many similarities with the receiver operating characteristic curve widely

162 used in medicine, radiology, machine learning and other areas for many decades.

Moreover the FRC does not require any reference sequence (except an estimate of the genome size) to be used for validation, thus making it a very useful tool in de novo sequencing projects. Furthermore the inspection of separate FRCs for each feature type enables to scrutinize the relative strengths and weaknesses of each assembler.

Note that although in this thesis the aim has been to test and compare as many de novo sequence assemblers and covering known assembly paradigms as exhaustively as possible, in a fast evolving field such as this, this goal has not been completely met — some of the assemblers listed in table 3.1 were only released very recently, and not early enough to be included in the statistical comparison.

It is hoped that the community of researchers interested in sequence assembly algorithms will close this gap with the FRC software, which is now available as part of the AMOS open-source consortium.

The third and final topic of this dissertation has been the introduction of a new Base-Caller, TotalReCaller, that yields better quality sequence reads, SNP- calls, variant detection, etc., as well as an alignment at the best possible location in the reference genome. In the same spirit as SUTTA, TotalReCaller is based on global optimization of a Bayesian score function (combining a linear error model for the raw intensity and Burrows-Wheeler transform based alignment) using an efficient branch-and-bound approach. TotalReCaller and SUTTA have been integrated into a full pipeline (from base-calling to assembly) that achieves competitive performance compared to the state-of-the-art assemblers for next-

163 generation technology.

Returning to our earlier concerns, and the quandaries of computational biolo- gists, biotechnologists also need to reflect on related issues: Given the unavoidable computational complexity burden of assembly, how do we best design the se- quencing platforms? There are many parameters that characterize a sequencing platform: read-length, base-call-errors, homopolymer-length, throughput, cost, latency, augmentation with mate-pairs, scaffolding, long-range information, etc.

And not all can be addressed equally well in all the platforms. The history of sequencing technology has been a random-walk in this complex design-space. To speed up the classical Sanger sequencing, while increasing throughput, ten years ago, engineers focused on effects of increased electric field, Joule-heating, calibrat- ing with smaller amount of materials, etc., as was done with multi-lane capillary- sequencers. The next improvement came from pyro-sequencing or ligation-based sequencing carried out synchronously using small number of clonal copies of DNA fragments, following bridge or emulsion PCR. However, since it was difficult to keep the reactions synchronized (with confounding leading, lagging and fading reactions), the reads shortened and became more error-prone — somewhat, com- pensated by higher depth of coverage. To avoid the synchronization problem, it was necessary to go to a single (genomic DNA) molecule technology, in which either the molecules are kept fixed and sensing mobile (optical mapping, opti- cal sequencing, Heliscope sequencer, AFM-based mapping, SMASH, etc.) or the molecules mobile and sensing fixed (PacificBioscienes, nanopores, etc.). How- ever, the single-molecule technologies face the problem of mismatched speed of

164 the mobile molecules or mobile sensors and the resulting resolution. For the time being, low-resolution map technology (e.g., optical restriction maps) for very long immobile molecules points to the most profitable avenue. If one were to spec- ulate what the next step should be, as was done by Schwartz and Waterman

[90], one may “project that over the next two years, reference genomes will be constructed using new algorithms combining long-range physical maps with volu- minous Gen-2/3 datasets. In this regard, the Optical Mapping System constructs genome-wide ordered restriction maps from individual ( 500 kbp) genomic DNA ∼ molecules” [9, 65, 4, 5, 6, 3, 74]. Among all the assemblers examined in this thesis,

SUTTA appears to be best suited for such a strategy.

165 Appendix A

This appendix contains the pseudo-code for the subroutines used in Algorithm 3.

Note that for the sake of a clear exposition we give only a high level description of the algorithms while most of the implementation details and optimizations are omitted. The following subroutines are also used in the pseudo-code: isSuffix(x, y): returns true if the suffix of x overlaps with y

Consistent(x, y): checks the consistency property between reads x and y ac- cording to the definition 3 in chapter 2. checkT ransitivity(x, y, z): checks if there is a transitivity relation between reads x, y and z.

Lookaheadde(x, y, Wde): returns the maximum depth of the lookahead tree cre- ated starting from node/read y with ancestor x. The local tree is constructed using the Branch-and-Bound strategy similarly to the “Node expansion” routine described in Algorithm 3 in chapter 4, however to avoid recursion, dead-end, bubble and mate-pair pruning are not performed during the construction.

166 Converge(x, y): return true if the path constructed starting from nodes x and y converges after at most Wbb steps.

Lookaheadbb(x, y, Wbb): returns the depth of the tree constructed starting from node y with ancestor x.

Lookaheadmp(x, y, Wmp): computes the local tree starting from node y with ancestor x of maximum depth Wmp and returns the score of the best path (highest score). Path’s score is computed according to the mate-pair score defined in equations (4.5) and (4.6).

167 Algorithm 6: Extensions Input: Read/Node r to extend Output: Set of reads/nodes that can be extended E 1 := getReads(r); /* Set of reads overlapping with r */ R 2 for (j=1 to ) do |R| 3 if ( isUsed(r )) then ¬ j 4 if (isRoot(r)) then 5 if (leftTree isSuffix(r ,r) (rightTree isSuffix(r ,r) then ∧ ¬ j k ∧ j 6 := r ; E E∪{ j} 7 end 8 if (leftTree isSuffix(r ,r) (rightTree isSuffix(r ,r) then ∧ j k ∧ ¬ j 9 := r ; E E∪{ j} 10 end 11 else /* r is not the root node */ 12 ri := ParentNode(rj ); 13 if (Consistent(ri,rj)) then 14 := r ; /* r is a possible extension */ E E∪{ j} j 15 end 16 end 17 end 18 end 19 return ; E

168 Algorithm 7: Transitivity Input: Set of reads/nodes , Read/Node r to extend Output: Set of reads/nodesE (1) after removing transitivity between siblings E 1 for (i=1 to ) do |E| 2 for (j=i+1 to ) do |E| 3 if (checkTransitivity(r, ri,rj)) then 4 := r ; E E\{ j} 5 end 6 end 7 end 8 (1) := E E 9 return (1); E

Algorithm 8: DeadEnds Input: Set of reads/nodes (1), Read/Node r to extend, max depth W E de Output: Set of reads/nodes (2) after removing dead-ends E 1 for (j=1 to (1) ) do |E | 2 d := Lookaheadde(r, rj, Wde); 3 if (d W ) then ≤ de 4 (1) := (1) r ; E E \{ j} 5 end 6 end 7 (2) := (1) E E 8 return (2); E

169 Algorithm 9: Bubbles Input: Set of reads/nodes (2), Read/Node r to extend, max depth W E bb Output: Set of reads/nodes (4) after removing bubbles E 1 for (i=1 to (2) ) do |E | 2 for (j=i+1 to (2)) do |E| 3 if (Converge(ri,rj, Wbb)) then 4 d1 := Lookaheadbb(r, ri, Wbb); 5 d2 := Lookaheadbb(r, rj, Wbb); 6 if (d1

170 Algorithm 10: MatePairs (3) Input: Set of reads/nodes , Read/Node r to extend, max depth Wmp Output: Set of reads/nodesE (4) after removing low mate-pair scoring E extensions 1 for (i=1 to (3) ) do |E | 2 for (j=i+1 to (3)) do |E| 3 s1 := Lookaheadmp(r, ri, Wmp); 4 s2 := Lookaheadmp(r, rj, Wmp); 5 if (s1 >s2) then 6 (2) := (2) r ; E E \{ i} 7 end 8 if (s1

171 Appendix B

The results in Appendix B are organized in two main sections:

Long Reads

1. No Mate-Pairs constraints

(a) Brucella suis

(b) Staphylococcus epidermidis

(c) Wolbachia Sp.

(d) Chromosome Y

2. With Mate-Pairs constraints

(a) Brucella suis

(b) Staphylococcus epidermidis

(c) Wolbachia sp.

(d) Chromosome Y

Short Reads

172 1. With Mate-Pairs constraints

(a) Escherichia coli

For each genome we report the Feature-Response curves (FRC) cumulative over all the features, the FRCs for each feature type, and the dot plot alignments of the set of contigs generated by each assembler computed using the MUMmer package. For the short read E. coli data, only Velvet is used in the comparison since it is the only short read assembler whose output can be converted into an

AMOS bank for validation. For the same reason Euler is excluded from the dot plots for long reads. The mis-assembly FRC is computed using the mis-assembly feature which, according to the amosvalidate description, is obtained by applying a feature combiner that collects all of the evidence for a mis-assembly and outputs regions with multiple mis-assembly features when present at the same region.

Note that the mis-assembly feature is not used in computing the FRC curve.

Finally note that when the number of features of a specific type is 0 for each contig in the set, the FRC reduces to a single point. Although not all the curves are equally informative, for the sake of completeness of exposition, we present all the FRCs.

173 Long Reads

No Mate-Pairs constraints

Brucella suis

Feature-Response curve (no mate-pairs) 120

100

80

60

40 SUTTA Minimus 20 Phrap

Approximate coverage (%) TIGR PCAP 0 0 200 400 600 800 1000 1200 1400 1600 1800 Feature threshold

Matepair Feature-Response curve (no mate-pairs) Polymorphism Feature-Response curve (no mate-pairs) 120 110 100 100 90 80 80 70 60 60 50 40 40 SUTTA SUTTA Minimus 30 Minimus 20 Phrap 20 Phrap Approximate coverage (%) TIGR Approximate coverage (%) 10 TIGR PCAP PCAP 0 0 0 200 400 600 800 1000 1200 1400 1600 0 100 200 300 400 500 600 700 Feature threshold Feature threshold

Coverage Feature-Response curve (no mate-pairs) Kmer Feature-Response curve (no mate-pairs) 110 14 100 12 90 80 10 70 60 8 50 6 40 SUTTA SUTTA Minimus Minimus 30 Phrap 4 Phrap Approximate coverage (%) 20 TIGR Approximate coverage (%) TIGR PCAP PCAP 10 2 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 Feature threshold Feature threshold

Breakpoint Feature-Response curve (no mate-pairs) Mis-assembly Feature-Response curve (no mate-pairs) 100 110 90 100 80 90 70 80 70 60 60 50 50 40 40 30 SUTTA SUTTA Minimus 30 Minimus 20 Phrap 20 Phrap Approximate coverage (%) 10 TIGR Approximate coverage (%) 10 TIGR PCAP PCAP 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 50 100 150 200 250 300 Feature threshold Feature threshold

Figure 8: Feature-Response curves by feature type for Brucella suis without mate-pair constraints. 174 141142147148149168175176177178179180181184185186187189190192193194197198200201202136 2021222324252610111213141516171819123456789 *102*17155 *11230 41 1321628 *9969 *43 1537666 *2352 44 *14553 *151*54 39 19 *84 105*22*898521 *177779 *6381 *115*158*150126 *109*10472 *57 48 53 *9 1632092 47 *113121144 *32 *129*15513 *169*62 *114*117 *42 36 *41 *116*1374727 3146 *15 *3040 *100101195*5 *154*59 *37 *3918 *2834 *44 33 *124*143*13938 1307498 34 *111*166174*31 2 QRY *4 QRY *50 1251316190 *183*172*43*687 2935 *14040 *49118*5610 *12820319112 51 188*88*24 *133*5893 *135134*7880 *51*3 7546 107*68170*657116 *29 *182*122*14 7 *54 13864 *11067 9137 *16460 48 106*1 *1182 12315625 *5026 *49 *160*19614619928 165 *2738 103*32 42 173*35 *45 *86 *159*9597 *157*167*7096 120 152161108 *127*119*33 *947383 52 *45 36

AE014291 REF AE014292 AE014291 REF AE014292 (a) Minimus (b) PHRAP

*Contig119Contig238Contig185Contig237Contig48Contig19Contig55 80448435873941868882834042853843 Contig140 283463 *Contig203*Contig162*Contig49Contig2 23 *Contig35*Contig34Contig33 *52 *Contig151Contig233Contig199Contig208 Contig76 *Contig169Contig222 *Contig165Contig111 1 Contig39 *7421 *Contig168*Contig235Contig234*Contig66*Contig25 *Contig148*Contig124 *Contig170Contig31 *Contig207Contig171Contig112 22*2 *Contig221Contig184 59 *Contig229Contig114*Contig71Contig56 17 *Contig236*Contig191 *Contig22Contig13Contig26 46 *Contig212 *Contig104*Contig30 *53 *Contig126*Contig3 *Contig115 7 *Contig177Contig225Contig105*Contig90Contig94Contig70 *71*29 Contig60Contig51 *Contig193*Contig159*Contig194Contig9 *9 *Contig202*Contig28 *Contig205Contig227*Contig84 *127569 *Contig189Contig158Contig95 *6430 *Contig125 *Contig174*Contig161Contig122Contig224*Contig96*Contig11*Contig4 *58 Contig5 *Contig143Contig186Contig150Contig106Contig149*Contig86Contig92 *33*25*3610 *Contig131Contig58 *Contig197*Contig37Contig72Contig52Contig38 *Contig200Contig160Contig110 3 *Contig120Contig69 65 *Contig213*Contig93 *Contig230Contig132 66*8 *Contig130Contig219*Contig47Contig164Contig45Contig46 *7211 *2079 *Contig101*Contig155Contig181 *7357 *Contig180Contig175 QRY QRY *Contig24Contig40 *Contig166*Contig64Contig214*Contig62 *4 *Contig142*Contig154Contig157*Contig74*Contig73Contig218*Contig7Contig29 *27*37*32*67 *Contig179Contig116 7662 *61 Contig98Contig89 *26 *Contig113*Contig153Contig79 *Contig139*Contig123Contig231Contig163 *Contig198Contig36 *77*5 Contig15Contig16 60 *Contig53Contig87Contig91 15 *Contig54Contig88 *Contig81 *Contig127 *5019 *Contig232*Contig100Contig102Contig85 *Contig195*Contig137Contig196*Contig82*Contig80*Contig20Contig121 *Contig211*Contig146*Contig108Contig152*Contig8 *Contig204Contig41 *Contig21Contig97 45 *Contig216Contig209*Contig14Contig228Contig42 *48 *Contig206Contig32Contig78 24 *Contig109Contig172Contig178Contig17Contig77Contig59 *55 *Contig156Contig63Contig99 *18 Contig182Contig107Contig103 *Contig223Contig133*Contig44 *13 *Contig210Contig145Contig118 Contig117 49 *Contig183Contig68 *Contig188*Contig23*Contig67Contig43 Contig187Contig134Contig1 70*6 *Contig135Contig128 *81*78*68 *Contig190*Contig129*Contig136 *Contig65 *47 Contig192 *Contig220*Contig141*Contig75 *54 *Contig83 14 Contig147Contig173*Contig57 *Contig201Contig138*Contig27Contig226Contig144Contig61 51 *Contig18Contig50 *5631 Contig167*Contig12*Contig6 *16 *Contig176*Contig215Contig217 Contig10 AE014291 AE014292

AE014291 REF AE014292 REF (c) Euler (d) PCAP

575856 105107384047481115171819222425603134367 *15 *71*3378 *4330 *95 *29 38 *56 33 45 *55*4 *68 8739 *378 11 *8626 *2239 44 *82 81 *32 *4276

7 *104 21 *3 31 *85*1 *3588 54*9 16 7497 *1 *27*52 32 29 *54 *729 *4923 52 *4842 *49*9070 *14*51*19 *93 1235 20 *100*1029979 *8428 *25 *57 26 *63 QRY *47 108 34 *61*2169 QRY *13 *5313 17 46 435345 405 *106*98916 *10 6 8341 *2314 7530 *16 *55 18 *66*65 4541 *89

*2 58 *46 *2050 94 *10373 *12*77*62 *10 101*80 *3 *92*96 *36 *595167 *2 3728 24 8 *4427 *50 64 AE014291 *AE014292

REF AE014291 REF AE014292 (e) SUTTA (f) TIGR

Figure 9: Dot plots for Brucella suis (no mate-pairs). Assemblies produced by Minimus, PHRAP, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

175 Staphylococcus epidermidis

Feature-Response curve (no mate-pairs) 120

100

80

60

40 SUTTA Minimus 20 PHRAP

Approximate coverage (%) TIGR PCAP 0 0 500 1000 1500 2000 2500 Feature threshold

Matepairs Feature-Response curve (no mate-pairs) Polymorphism Feature-Response curve (no mate-pairs) 120 120

100 100

80 80

60 60

40 SUTTA 40 SUTTA Minimus Minimus 20 Phrap 20 Phrap

Approximate coverage (%) TIGR Approximate coverage (%) TIGR PCAP PCAP 0 0 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 700 Feature threshold Feature threshold

Coverage Feature-Response curve (no mate-pairs) Kmer Feature-Response curve (no mate-pairs) 120 110 100 100 90 80 80 70 60 60 50 40 40 SUTTA SUTTA Minimus 30 Minimus 20 Phrap 20 Phrap Approximate coverage (%) TIGR Approximate coverage (%) 10 TIGR PCAP PCAP 0 0 0 5 10 15 20 25 0 5 10 15 20 25 30 Feature threshold Feature threshold

Breakpoint Feature-Response curve (no mate-pairs) Mis-assembly Feature-Response curve (no mate-pairs) 110 120 100 90 100 80 70 80 60 60 50 40 SUTTA 40 SUTTA 30 Minimus Minimus 20 Phrap 20 Phrap Approximate coverage (%) 10 TIGR Approximate coverage (%) TIGR PCAP PCAP 0 0 0 20 40 60 80 100 120 0 50 100 150 200 250 300 350 400 450 Feature threshold Feature threshold

Figure 10: Feature-Response curves by feature type for Staphylococcus epider- midis without mate-pair constraints.

176 2913652922933692972993733743753763793803813823833843853873893913923933943963991081141181212012067389 545556202122232425262728296230313233343536 341415342416270417271418345272419346273347348275276349278420421422423350424351425280281282356283357285359286289360361363290 434445464748491011121314151617181952123456789 252254328255256258259188400401402403330404405332406333260407334408262409336263264265339266267195268197410411412413340414 *68*6366373839404142 146147148221222225300301302303230304305306307309236237238311313240314315316317244172319247249176320321324251325 *58*64 *377*200*203*165*94211212214216217218354657 *232248*96*88129228 *124*111173*4513844 *235*25 *101*11712217931 *191*338204220335109337 *116*398*1232395032 *84 *13512618053 *288*119*132298 *59*74 *279*158343*54*75 78 *168*250161*91127 241*6 *318*130*70113 15010035849 *14*30 *9552 *184*169*269*215*331*128170115294 *6586 *190*166*196*257*366846723 77 *323*226210308*97 424 *63*3731268 *106185*90 *224*41*246421 *167112*17 83 *388*131*329*159*16019935236261 *20735326 *7667 *157*367386104*66228*2799 *187*125202227233183242*3476 *61*7257 *327*229*390*2215383*789

QRY 56 QRY *120*234*344134194181 *7853 *198*17116360 *372*1101 69 *164*58 *2 33 *79 *152*205213231*19 *71 *155724820 *162370310143*7151 73 *98 75 *145*8629 *14965 13 107*11193*79 *151*243*1416955 *177*245*2231429339 *85 *103*253*28715643 *3552461742743541024782 *186*87296*1640 *136*18 *175*322364295*1085 *60*82 155 *137*7436881 *378*139*209*105*178208261140 *15436 *505181 *192*189*28413332639518280 *70 *1293713973 *277219*3859 *14477 62 80 S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_plasmid

S_epidermidis_RP62A_chromsome REF S_epidermidis_RP62A_chromsome REF (a) Minimus (b) PHRAP

Contig113Contig114Contig116Contig111Contig112Contig86Contig14Contig22Contig87Contig53Contig75Contig48Contig25Contig56Contig13 514738983437 *Contig34Contig115Contig27Contig81Contig85Contig63Contig64Contig7 1071061011001041053591784893772950394181524586548830834049534246963694 Contig33 *701081091031633329026995527959289 *Contig29Contig78Contig38 *17 *Contig58*Contig42*Contig28Contig24 *Contig108Contig97Contig72 *Contig41Contig43Contig50 57 Contig57Contig73 697 *Contig59 *66 Contig31Contig70 *Contig36 *Contig67Contig65 1 *Contig79 8 *Contig88Contig107Contig54Contig80 3161 *Contig11 *Contig18 *58 *Contig66 14 Contig103Contig61 *Contig44*Contig2 *1015 Contig60 *7213 *85*75*76*23282224 *Contig102*Contig6 *Contig110Contig101*Contig35*Contig94Contig30Contig10Contig39 *Contig84Contig99Contig93Contig83Contig90 *Contig104*Contig95*Contig98 *796 *82*6584 *Contig74*Contig32 *68 Contig51*Contig1 *Contig16Contig96 12 QRY 9 QRY Contig52 19 Contig68Contig26*Contig8 *Contig89*Contig71Contig69 *Contig105 2 *2011 *Contig45Contig19Contig76Contig77 *62 Contig12 5 Contig109*Contig55Contig82 *Contig92 *44*59 *Contig91 *63 *Contig100*Contig106 *6418 *Contig15Contig9 *102*7471 Contig20 *Contig21*Contig40Contig47 *5687 *Contig46 Contig17 *73*4 Contig49 *97*80*25*21*6043 *Contig23 67 *Contig37Contig3Contig4 Contig5 *3 *Contig62 S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_chromsome S_epidermidis_RP62A_plasmid

S_epidermidis_RP62A_chromsome REF REF (c) Euler (d) PCAP

65 3436 *47*2522375455565758596061626364 71757677488385888910111416171851525354565790599293952325262860296162656832345679 *19*5038 16 *55*94*27 12 *18 69 13 *33*24 *4015 5 45 *23 *41 10 *47*42 *28*14 21 8 15

*52*53*51*494342343611 79 *45*17 43 *3819 87 *3550466 13 *27 *708022 26 *21*84*73*7886 91 82 QRY QRY 2 64 58 *12 *1

*411 *74 *2 *9 72 *46 *3966 8 *3032 *37 33 *3549 *317 6324 *30 44294 *20484039 *44*81*20 *67

*3 *31 S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_plasmid

S_epidermidis_RP62A_chromsome REF S_epidermidis_RP62A_chromsome REF (e) SUTTA (f) TIGR

Figure 11: Dot plots for Staphylococcus epidermidis (no mate-pairs). Assemblies produced by Minimus, PHRAP, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

177 Wolbachia sp.

Feature-Response curve (no mate-pairs) 400 350 300 250 200 150 SUTTA 100 Minimus Phrap

Approximate coverage (%) 50 TIGR PCAP 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Feature threshold

Mate-pairs Feature-Response (no mate-pairs) Polymorphism Feature-Response curve (no mate-pairs) 400 250 350 200 300

250 150 200 150 100 SUTTA SUTTA 100 Minimus Minimus Phrap 50 Phrap

Approximate coverage (%) 50 TIGR Approximate coverage (%) TIGR PCAP PCAP 0 0 0 500 1000 1500 2000 2500 3000 3500 4000 0 50 100 150 200 250 300 350 Feature threshold Feature threshold

Coverage Feature-Response curve (no mate-pairs) Kmer Feature-Response curve (no mate-pairs) 300 100 SUTTA Minimus 90 SUTTA 250 Phrap 80 Minimus TIGR 70 Phrap 200 PCAP TIGR 60 PCAP 150 50 40 100 30 20 50 Approximate coverage (%) Approximate coverage (%) 10 0 0 0 20 40 60 80 100 120 140 160 180 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Feature threshold Feature threshold

Breakpoint Feature-Response curve (no mate-pairs) Mis-assembly Feature-Response curve (no mate-pairs) 300 300

250 250

200 200

150 150

100 SUTTA 100 SUTTA Minimus Minimus 50 Phrap 50 Phrap

Approximate coverage (%) TIGR Approximate coverage (%) TIGR PCAP PCAP 0 0 0 5 10 15 20 25 30 35 40 45 50 0 100 200 300 400 500 600 Feature threshold Feature threshold

Figure 12: Feature-Response curves by feature type for Wolbachia sp. without mate-pair constraints.

178 1247713714 20482049 1097109810991240124112421243124412451246564709565566567568569710711712 1092109310941095109610971098109920402041204220432044204520462047560561562563564565566567568569 123710911238109212391093109410951096704705560706561707562708563 1086108710881089203020312032203320342035203620371090203810912039553554555556557558559 1230123112321233123412351236109055855970070170270399 1077107810792020202120222023202420252026202710802028108120291082108310841085544545546547548549550551552 12231224122512261080122710811228108212291083108410851086108710881089550551552553554555556557 106920102011201220132014201520162017107020181071201910721073107410751076536537538539540541542543 107410751076107710781079122012211222540541542543544545546547548549 200120022003200420052006200710602008106120091062106310641065106610671068530531532533534535 106912101211121212131214121512161070121710711218107212191073535536537538539394397398399 104610471048104910501051105210531054105510561057105810592000514515516517518519520521522523524525526527528529 120112021203120412051206106012071061120810621209106310641065106610671068530531532533534 10291030103110321033103410351036103710381039104010411042104310441045500501502503504505506507508509510511512513 105210531054105510561057105810591200520521522523524525526381527528383384529386389 100810091010101110121013101410151016101710181019102010211022102310241025102610271028190191192193194195196197198199 10441045104610471048104910501051511512513514515516371517518373519375377378 1792179317941795179617971798179910001001100210031004100510061007180181182183184185186187188189 1037103810391040104110421043504505506360507361508509363364367368369510 1779178017811782178317841785178617871788178917901791164165166167168169170171172173174175176177178179 10261027102810291030103110321033103410351036350351352353354358359500501502503 1763176417651766176717681769177017711772177317741775177617771778150151152153154155156157158159160161162163 1013101410151016101710181019102010211022102310241025340341345346347348349 17481749175017511752175317541755175617571758175917601762132133134135136137138139140141142143144145146147148149 1000100110021003100410051006100710081009101010111012180328183184332333335336190192339 173017311732173317341735173617371738173917401741174217431744174517461747120121122123124125126127128129130131 895896897898899103106109112125152304161308311312313314317173179320323326 171517161717171817191720172117221723172417251726172717281729100101102103104105106107108109110111112113114115116117118119 872873874875876877878879880881882883884885886887888889890891892893894 170017011702170317041705170617071708170917101711171217131714887888889890891892893894895896897898899 13921539139313941395139613971398139915401541154315441545860861862863864865866867868869870871 139713981399864865866867868869870871872873874875876877878879880881882883884885886 1529138313841385138613871388138915301531153215331534153515361390153713911538850851852853854855856857858859 13831384138513861387138813891390139113921393139413951396851852853854855856857858859860861862863 137413751376137713781379152015211522152315241525152613801527138115281382841842843844845846847848849 13691370137113721373137413751376137713781379138013811382838839840841842843844845846847848849850 15101511151215131514151515161370151713711518137215191373838693839694695696697698699840 1356135713581359136013611362136313641365136613671368823824825826827828829830831832833834835 150713611508136215091363136413651366136713681369830831832833834690835836691837692 1343134413451346134713481349135013511352135313541355810811812813814815816817818819820821822 135915001501150215031504150515061360680681826827682828683829684685686687688689 132513261327132813291330133113321333133413351336133713381339134013411342800801802803804805806807808809 135013511352135313541355135613571358674675676677678679820821822823824825 13051306130713081309131013111312131313141315131613171318131913201321132213231324490491492493494495496497498499 1340134113421343134413451346134713481349810811812813814815670816671672817818673819 13001301130213031304459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489 119411951196119711981199660806661807662663808809664665666667668669 420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458 13301331133213331334133513361190133711911338119213391193657658659800801802803804805 1687168816891690169116921693169416951696169716981699400401402403404405406407408409410411412413414415416417418419 132213231324132513261180132711811328118213291183118411851186118711881189650651652653654655656 1662166316641665166616671668166916701671167216731674167516761677167816791680168116821683168416851686 1175117611771178117913201321641642643644645646647648649 162916301631163216331634163516361637163816391640164116421643164416451646164716481649165016511652165316541655165616571658165916601661 131013111312131313141315131611701317117113181172131911731174495496497498499640 160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628790791792793794795796797798799 116411651166116711681169630631632633634635636490491637492638639493494 129612971298129916001601160216031604763764765766767768769770771772773774775776777778779780781782783784785786787788789 13001301130213031304130513061160130711611308116213091163485486487488489 1280128112821283128412851286128712881289129012911292129312941295746747748749750751752753754755756757758759760761762 1156115711581159622623624625626480627481482628483629484 1264126512661267126812691270127112721273127412751276127712781279732733734735736737738739740741742743744745 115011511152115311541155471618472473619474475476477478479620621 1252125312541255125612571258125922001260126112621263720721722723724725726727728729730731 1140114111421143114411451146114711481149467468469610611612613614615616470617 123712381239124012411242124312441245124612471248124912501251703704705706707708709710711712713714715716717718719 1136113711381139602603604605606460607461608462609463464465466 122012211222122312241225122612271228122912301231123212331234123512367007017028687888990919293949596979899 1126112711281129113011311132113311341135450451452453455456457458459600601 121012111212121312141215121612171218121939239339439539639739839970717273747576777879808182838485 1110111111121113111411151116111711181119112011211122112311241125440441442445446447448 19981999120012011202120312041205120612071208120938138238338438538638738838939039160616263646566676869 1100110111021103110411051106110711081109427282289431432433435290436291437438293298 198819891990199119921993199419951996199737137237337437537637737837938050515253545556575859 254400401403260407408409264265268412414415416271418274275276277420421425426 19791980198119821983198419851986198736236336436536636736836937040414243444546474849 991992993994995996997998999200206208210226231235236237238239242246250 196919701971197219731974197519761977197835335435535635735835936036130313233343536373839 971972973974975976977978979980981982983984985986987988989990 196019611962196319641965196619671968343344345346347348349350351352212223242526272829 1490149114921493149414951496149714981499957958959960961962963964965966967968969970 19501951195219531954195519561957195819593353363373383393403413421314151617181920 1480148114821483148414851486148714881489945946947948949950951952953954955956 1937193819391940194119421943194419451946194719481949321322323324325326327328329330331332333334101112 1470147114721473147414751476147714781479791938792939793794795796797798799940941942943944 192219231924192519261927192819291930193119321933193419351936310311312313314315316317318319320 1460146114621463146414651466146714681469787788789930931932933934935936790937 190419051906190719081909191019111912191319141915191619171918191919201921300301302303304305306307308309 145414551456145714581459920921922923924925780926927781928782929783784785786 15781579158015811582158315841585158615871588158915901591159215931594159515961597159815991900190119021903 14491450145114521453915770916771917918772919773774775776777778779 155115521553155415551556155715581559156015611562156315641565156615671568156915701571157215731574157515761577 144014411442144314441445144614471448766767768769910911912913914 1534219115351536219315371538153915401541154215431544154515461547154815491550 1438129214391293129412951296129712981299905906760761907762908909763764765 2187218915301531153215332190 1287128812891430143114321433143414351436129014371291754755756757758759900901902903904 2178152015211522152315242181152515261527152815292186 14201421142214231424142514261280142712811428128214291283128412851286746747748749750751752753 15162173151721741518217515192176 14101411141214131414141514161270141712711418127214191273127412751276127712781279740741742743744745 2169151015111512151321701514217115152172690691692693694695696697698699 1266126712681269733734735590736591737738592739593594595596597598599 1500150115021503216015042161150515061507150815092166687688689 1400140114021403140414051406126014071261140812621409126312641265586587588589730731732 21502151215321542155671672673674675676677678679680681682683684685686 12551256125712581259721722723724725726580581727582728729583584585 214021442145214721482149670 12501251125212531254717571572718573719574575576577578579720 213811902139119111921193119411951196119711981199660661662663664665666667668669 12481249139715716570 11892130213121322133213421352136655656657658659 *31207*9467 21252126212721281180212911811182118311841185118611871188650651652653654 *429*175*1703932805258 1171117211731174117511761177117811792120212121222123640641642643644645646647648649 *1182611199 11671168116921102111211221132114211521162117211811702119634635636637638639 QRY QRY 1163116411651166630631632633 *107*120423 210021012102210321042105210621071160210911611162626627628629 11431144114511461147114811491150115111521153115411551156115711581159610611612613614615616617618619620621622623624625 362542 112211231124112511261127112811291130113111321133113411351136113711381139114011411142600601602603604605606607608609 *191*385*157279285196*12*19*2411 1100110111021103110411051106110711081109111011111112111311141115111611171118111911201121288289290291292293294295296297298299 *287*439*195221*71143140 188818891890189118921893189418951896189718981899272273274275276277278279280281282283284285286287 *202417*81*537*3 1872187318741875187618771878187918801881188218831884188518861887260261262263264265266267268269270271 *372*40537630119718 1859186018611862186318641865186618671868186918701871243244245246247248249250251252253254255256257258259 *243*232*124198 1846184718481849185018511852185318541855185618571858230231232233234235236237238239240241242 *24919366*4 1830183118321833183418351836183718381839184018411842184318441845219220221222223224225226227228229 *138*434*155*14402 181818191820182118221823182418251826182718281829201202203204205206207208209210211212213214215216217218 *169*388*257*21425138293 18041805180618071808180918101811181218131814181518161817990991992993994995996997998999200 *154*194*104*88316 1800180118021803968969970971972973974975976977978979980981982983984985986987988989 *168162188*41 1487148814891490149114921493149414951496149714981499952953954955956957958959960961962963964965966967 *150158370*6234374 14701471147214731474147514761477147814791480148114821483148414851486939940941942943944945946947948949950951 *227*327*216*60114220219241255 14561457145814591460146114621463146414651466146714681469922923924925926927928929930931932933934935936937938 *318*329*972881829865 20991440144114421443144414451446144714481449145014511452145314541455909910911912913914915916917918919920921 430413128*61 14332090143414351436209314372094143820951439209620972098900901902903904905906907908 *21216696 1425208214262083142720841428208514292086208720882089143014311432 101233*32 20731417207414182075141920762077207820791420142114221423208014242081 *263*1727028 20691410141114121413207014142071141520721416590591592593594595596597598599 *307*110*30132284297 14021403206014042061140520621406206314072064140820651409206620672068789 *23*358768 20512052205320542055205620572058205914001401580581582583584585586587588589123456 *56310*8629 *215622262050570571572573574575576577578579 *177*127*272240217108117*89 2244 *266*396*116225222 *2250 *278*201*338*134*337449*9020 *215921962237 *1542*38122319 *2213*2183*22332108 *344*303*411*365*410300259*921461459173 *443*390*136189*44147135 22072252 *205*357*153185218 *2216 *149*224*186228247342 *2152*2240 *281*267*2963791211427943 2249 *156*302362269126 *22252251 *133*252273*7076 *221222112218 *123*424*85*7 *2220*2245 *151*299100*34454 *2091*21672192218821802231 16 *2234*2168*21632165 *209*113*178229*2639510248 *2203*22462221 *422*374*575169 2242 *334321419174*84*2163 *2185*21942217*8362238 *148*244*1052623663872 *2146*2201*22282215221421792143 *305*17231542825646 *2092*2205*2184*221021242177*837 *245*22325377 *2162*2158*222917612199 *391*355*286*78199*72805447 2230 *29216027831395 *2222*2247 *283*330*30613114123453*1 *214122322241 *159*324*258404*1539336 *331*129*165309115 *2164*2142*2206218222232253 *203*164111406137295204230 *215722042224220821372197 *101634441818 *2248 *215356187248176325*503805964 *219522092202 *167*322*294*213171*45*75 22362239 *211*55392*8222 *22432198 *130*40144*49 *223522192227 0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 800000 1000000 1200000 Wolbachia_sp Wolbachia_sp (a) Minimus (b) PHRAP

Contig610Contig605Contig101 248146356802528164807 Contig598Contig602Contig611Contig599Contig575 11241230116310601134110040346613461884845070390031897247318575 Contig597Contig590Contig581Contig578Contig607Contig604Contig582 1183113211511204104710551009929133123304547396258482949499306839166141909 Contig354Contig591Contig584Contig585Contig559 12271120110511871192109557725344821685217971367627787993863 Contig459Contig576Contig606Contig596Contig583Contig592Contig22 10431082106610711137223613721282669420121952989494487238873 Contig603Contig614Contig601Contig586Contig594Contig608Contig595Contig587 115810791041792371204435814924885884339228760863749268531829345172596 Contig600Contig588Contig577Contig593Contig612Contig609 1010121310931023119934375369798172685710670715796493520326132186 Contig580Contig613Contig615Contig589 1167123578559958760919029892360182223124377955114850495 *Contig313*Contig542*Contig91Contig451Contig284Contig579 1211497183518439362411169705478398384836942955407376912 *Contig503*Contig157Contig424Contig491Contig254*Contig87Contig29 1097112110851051110277018840289383861680487089696275577714346456283 *Contig353*Contig517Contig569Contig494 1219119411264229056686887503027748669106935939752 *Contig156*Contig490*Contig211Contig227Contig515Contig279 11171008102711391222334786986740103201871423719902827266174820 *Contig502*Contig13Contig462Contig125Contig226Contig180Contig79 10371017105711471241120655374755585527019553867983152116243388047 *Contig454*Contig376*Contig553Contig200 108011131237327944674732280957273471970104131502314355159 *Contig485*Contig173Contig556Contig69 1091899292291374199114429409241795812489585794937367 *Contig134*Contig492*Contig291Contig106Contig493 10451004106910131224565620850891937730921233368316994849492208511842 *Contig239*Contig315Contig183Contig316Contig140 1115103810641099121721569996081815024526754382635475733274560390 *Contig186*Contig188Contig185Contig355*Contig33Contig37 119810771136117512361088117610391005797293967322469575979709310192589 *Contig141*Contig431*Contig68Contig142 11221108105310141209533918414462843263394991514610725931513615476 *Contig412*Contig255Contig351*Contig63*Contig96 1160122112009962791769564834989472563729204288875008697909972 *Contig230Contig320Contig190 12101157100610345908956642246173521262514263696979 *Contig427*Contig133*Contig178Contig287Contig39 1067111111542759838038537516913469114276779076082 *Contig222*Contig263*Contig163*Contig298Contig541Contig317Contig4 1143123911401181386916445200329525787161582109557718441868769 *Contig481*Contig342*Contig221Contig202*Contig66Contig304 1228107311031062112981364475985139923543621381656826529679117194 *Contig344Contig385Contig123 117310751148370309567416100380914300965824120286800254392496605607 *Contig543*Contig296Contig460Contig148 1056121510025356835417658813128974055072377764 *Contig247*Contig473*Contig352Contig441Contig107Contig219Contig31 104011381232101210321035595488752835379113901152342452862295579341 *Contig375*Contig572Contig570Contig44*Contig8 10151202367976716940485685743778523720460324453244351 *Contig552*Contig165Contig270Contig362Contig310Contig231 115211311090102593316813534814578917812477299371536438895050949 *Contig432*Contig300*Contig507Contig482*Contig48Contig273Contig309Contig78 11451083111096822643152738283313912954892685973987641878362 *Contig208*Contig433Contig297*Contig18Contig75 10651020102144395933135873758076780543719422084693494857284 *Contig435*Contig399*Contig374Contig289Contig513 1042116979910592584059739011140819694524255911820632 *Contig261Contig394Contig250Contig305 120511591226107010241016117036576347079357025288315628342127149186421959 *Contig129*Contig100*Contig325*Contig144Contig523Contig294Contig155 1164108110941225103098055050595841999583729059214971221040617353066 *Contig547*Contig314*Contig295Contig64 110411331118121234456120935711745560033645782169638396329792 *Contig486*Contig70Contig439 11191046114211791052771808401845186147973915319735404516974702 *Contig207*Contig306*Contig326*Contig73Contig283Contig274Contig10 118811861058115010961191998714313529987397908459284823903247165289729 *Contig217*Contig422*Contig423*Contig445Contig331*Contig58 123410181125100731760871113289254653752212574144 *Contig334*Contig357*Contig429Contig400Contig201*Contig15Contig55 1114122321732823912287815826928136373399774847246 Contig404Contig290Contig213Contig262Contig77 109210781036119676867350331533876157622273848430592854468 *Contig224*Contig516*Contig563*Contig428Contig187Contig322Contig358 104410721240108692260284161293646747448186721456481756373122991 *Contig115*Contig533Contig392Contig511Contig225Contig235Contig560 102968161933569862152455455259878024089044023011529937774 *Contig419*Contig413*Contig442Contig237*Contig32Contig448*Contig1 11661101116510031054412954385307801985971830154754847479704326 *Contig319Contig236Contig272Contig406 118911771127727939320913811202249361465184819776894858140181 *Contig512*Contig192*Contig34Contig41 118412071116102810981059434865904930442649689886 *Contig176*Contig443*Contig266*Contig277Contig194 1218119311061146520532259424872163395175584988347555789 *Contig495*Contig249*Contig464Contig519Contig401 1112115694146158325795178280942559183273653978182827685 *Contig555Contig408Contig259Contig465*Contig57 11411128115535337598494678830325061450127849013015538753 *Contig339Contig421Contig527 11741076167360692663854969766510393449274549877 *Contig204*Contig215Contig191Contig218 10011182227977566526806264255359182108556717232477225330142207 *Contig271Contig206Contig205Contig566Contig417 1220114910871238112545191187588756446262212542961810574170660 *Contig366*Contig458*Contig514Contig364 1201107411951063138606137724666432519860708110333710323856 *Contig243*Contig196Contig407Contig130 1011116110191208110740074488815146870028747599041329493276434951578 *Contig330*Contig496Contig500Contig564Contig95*Contig3 10331026456874652975493796366844272775919746534 *Contig393*Contig119*Contig276Contig288Contig484Contig499Contig19Contig20 119787550613611614438158130890617767837377320592730151 *Contig152*Contig327Contig426Contig478Contig6 11301153102210891084114441518948043810744450895322155841778426057376273487 *Contig474*Contig128*Contig195*Contig475Contig324Contig268*Contig36 11681180118539194328819311958618041024699943044753661 *Contig172*Contig122*Contig418Contig292Contig43 11721216113512031068104810001061218834285234594486160569686969843 *Contig337*Contig536*Contig535Contig520*Contig72 121411901171112310134031198257182538991749596699275823670676 Contig382 QRY 211153861337 QRY Contig153 12331162540684 *Contig341*Contig241*Contig278*Contig240*Contig112Contig540Contig480 10311109512463517458451454722578695978728742378325889350 *Contig116*Contig160Contig285Contig466Contig538Contig169 10491050117812788256089879871 *Contig497*Contig189*Contig299*Contig415Contig220Contig154 *682631 *Contig471*Contig118*Contig522*Contig477Contig561Contig151Contig88 *Contig411*Contig449*Contig257*Contig328*Contig395*Contig332*Contig359Contig437 1 *Contig573*Contig248*Contig27Contig26 *655635 *Contig258*Contig347*Contig521*Contig105*Contig16 *636*12 *Contig198*Contig436Contig461Contig197Contig60 651*1338 *Contig265Contig132Contig380Contig335 *128*6801229*198*665*2341 *Contig389*Contig565Contig558Contig444Contig209Contig47 *Contig509*Contig104Contig323Contig525Contig302Contig177Contig28 622 *Contig402*Contig504*Contig52Contig568Contig456Contig51 *659*2422 *Contig371*Contig554*Contig338*Contig149Contig381Contig463 645 *Contig158*Contig370*Contig539Contig267Contig562Contig321Contig363Contig223 *256384039 *Contig280*Contig534*Contig420*Contig388*Contig110Contig238 *641639658 *Contig529Contig506Contig229*Contig5 *Contig545Contig457*Contig30Contig548Contig251Contig505 *Contig483*Contig532*Contig11Contig479*Contig49Contig343 *675*3436422 *Contig214*Contig329Contig242Contig61Contig93 *723*56*17 *Contig367Contig360Contig199Contig361Contig161 *8 *Contig378*Contig23*Contig50Contig438Contig98 *1231*625 Contig162*Contig84Contig282Contig476Contig487*Contig92Contig85Contig25 *80*21 *Contig81Contig526Contig232Contig12 102*81667*2065*9 *Contig145*Contig488Contig150Contig121Contig537Contig76 *31657 *Contig531Contig530Contig166*Contig74Contig17 *15 *Contig113*Contig518*Contig409*Contig175*Contig174Contig252Contig379 6901976288830 *Contig574Contig386Contig544Contig557Contig246Contig82 10 *Contig312Contig468Contig170Contig425*Contig65Contig71 *662611604*587014537 *Contig333*Contig203*Contig377*Contig168Contig136Contig501 *646694*35 *Contig550*Contig126Contig452*Contig89Contig430Contig137 *67062433 *Contig318*Contig307*Contig349Contig405Contig135Contig551 *64864728 *Contig434Contig264 637672*706 *Contig390*Contig398*Contig210*Contig489*Contig143Contig228 *654*29*27 *Contig303*Contig117Contig286*Contig97*Contig2Contig53 *640671 *Contig340*Contig453*Contig368*Contig345Contig120Contig348 *63065648 *Contig234*Contig469Contig336Contig450*Contig54 Contig396Contig397Contig124Contig138 *623 *Contig524*Contig508*Contig269Contig253 5 *Contig447Contig38Contig46Contig94 Contig281Contig127Contig167Contig373 3 *Contig275Contig470Contig301Contig416Contig42 *66168763450 *Contig311*Contig179*Contig80Contig293Contig384Contig90 643 *Contig446*Contig549*Contig147Contig114Contig308Contig86Contig7 4 *Contig467Contig546Contig369Contig139 632*26*5414 *Contig350Contig410Contig414Contig260Contig567Contig146Contig56 *65316 *Contig21Contig216Contig103Contig184 627 *Contig365*Contig244Contig245Contig356*Contig83*Contig62 *642 *Contig391*Contig102Contig108Contig171Contig498Contig99Contig24 6507 *Contig346*Contig472Contig159Contig109Contig59 815*11 *Contig510*Contig571*Contig383*Contig111*Contig14Contig403Contig67 629 Contig131Contig212Contig387*Contig40Contig35 *1819 *Contig528*Contig181Contig182Contig164Contig372 626 *Contig233*Contig440*Contig256Contig455Contig193*Contig45*Contig9 *633 0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 800000 1000000 1200000 Wolbachia_sp Wolbachia_sp (c) Euler (d) PCAP

708563564709565566567568569710711712713714 565566567568710711712713714 10851086108710881089552553554555556557558559700701702703704705560706561707562 702703704705560561707562708563564709 10701071107210731074107510761077107810791080108110821083108454054154254354454554654754854955055187 55255355455555655755855970070192939495969799 10671068106953453539053639153739239353853939439539639739839975 10771078107910805435445455465475495518081828486878889 1060106110621063106410651066381527382528383384529385386387388389530531532533 10701071107210731074107510763953963973983995405417475767879 1050105110521053105410551056105710581059520521522523524525380526 106710681069534535390536391537392393538539394717273 104710481049513514515370516371517372518373519374375376377378379 1060106210631064106510665293863873883895305315325336364656668 1040104110421043104410451046363364365366367368369510511512 1059380526381527382528383384606162 10351036103710381039501502503504505506360507361508362509 10501051105210531054105610571058376377378520521522523525545557 102610271028102910301031103210331034350351352353354355356357358359500 1048104951537051637151737251837351937437550515253 101710181019102010211022102310241025340341342343344345346347348349 104110421044104610473633643663675105115125134243454647 1010101110121013101410151016194195196197198199 103210331034103510371038103950050150250350450550650750836250940 1006100710081009330331332333334335336190337191338192339193 1028102910301031350351353356357358359303132333435363839 100010011002100310041005329183184185186187188189 102010211022102410251026102734234334434534734834920212224252829 179320321322323324325326180327181328182 1010101110121013101410151016101710181019194195197198199340341171819 312313314315316170317171318172319173174175176177178 100933033133233333533619033719119233919310111213141516 161308162309163164165166167168169310311 10001001100210031004100510071008182329183184185186187188189 152153154155156157158159300301302303304305306160307 315170171318172319173174177178179320321323324325326327328 139140141142143144145146147148149150151 300301302303304305160307161308163164165167168169310311312313314 126127128129130131132133134135136137138 140141142144145147148149151152153154155156157158159 113114115116117118119120121122123124125 124125126127128129130131132133134135136138139 106107108109110111112 105109110111112113114115116117118119120121122123 884885886887888889890891892893894895896897898899103104105 881882883884886888889890891892893894895896897898899100101102104 866867868869870871872873874875876877878879880881882883 860861862863864865867868869870871872873874875876877878879880 840841842843844845846847848849850851852853854855856857858859860861862863864865 840841842843844845846847848849850851852853854855856857858 831832833834690835836691837692838693839694695696697698699 688689830831832833834690835836691837692838693839694695697698 678679820821822823824825680681826827682828683829684685686687688689830 676678679820821822823824825680681826827828683829684685686 666667668669810811812813814815670816671672817818673819674675676677 667668669810811812813815670816671672817818673819674675 655656657658659800801802803804805660806661807662663808809664665 656657658800801802803805660806661663808809664665666 495496497498499640641642643644645646647648649650651652653654 494495496497499640642643644645646647648649650651652653654655 487488489630631632633634635636490491637492638639493494 485486487488489630631632633634635636490491637492638493 478479620621622623624625626480627481482628483629484485486 475476477479620621622623624625626480627481482628483629 611612613614615616470617471618472473619474475476477 467468469610611612613614615470617471618473619474 459600601602603604605606460607461608462609463464465466467468469610 452453456457458459600601602603604605606460607461608462463465 299440441442443444445446447448449450451452453454455456457458 295296297298299440441442443444445446447448449450451 431432433434435290436291437292438293294439295296297298 283429284285286287288430431432433434435436291438293294439 421422423424425280426281427282428283429284285286287288289430 272418419274275276277278279420421422423425280426281427282428 416271417272418273419274275276277278279420 407262408263409264265266267268269410411412413414270416417 407262408263409264265266267268269410411412413414415270 249250252253254255256258259400401402403404405260406261 255256257258259400401402403404405260406261 223224225227228229230231232233234235236237238239240241243244245248 240241242243244245246247248249250251252253254 203204206207208209211212214215216217218219220221222 225226227228229230231232233234235236237238239 986987988989990991992993994995996997998999200201202 211212213214215216217218219220221222223224 962963964965967968970971972973974975976977978979980981982983984985 200201202203204205206207208209210 943944945946947948949950951952953954955956957958959960 979980981982983984985986987988989990991992993994995996997998999 790937791938792939793794795796797798799940941942 954955956957958959960961962963964965966967968969970971972973974975976977978 780926927781928782783784785786787788789930932933934935936 791938792939793794795796797798799940941942943944945946947948949950951952953 770916771917918772919773774775776777778779921923924925 925780926927781928782929783784785786787788789930931932933934935936790937 901902903904905906761907908909763764765766767768910911914915 914915770916771917918772919773774775776777778779920921922923924 741742744745747748749750751752753754755756757758759900 902903904905906760761907762908909763764765766767768769910911912913 7307317327337347355907367377385927395935945955975985997409 744745746747748749750751752753754755756757758759900901 72072172272372472658058172758272872958358458558658758845678 735590736591737738592739593594595596597598599740741742743 715716717571572718573719574575576577578579 725726580581727582728729583584585586587588589730731732733734 10613 715716570717571572718573719574575576577578579720721722723724 *3777 *360 *550591 2 *1043*352641 QRY 53 QRY *1055*213*807*814205137273589 *28 *466*424885 20 *3551023242*2 *62*84*737285*915 *368365 97448639 *3851040969 *1455 *74804 271 *36 68798 *4711 *542 45 *8669311 6427 *226*682569*59*58 *94*78*41*67 *464725*83 *5079 *290*44 *5485964885 *804*498 35*3 *162524966309912913180 *48 *150*51467 *22 *316696746 *58*8891498313 *437*317762*23 103 6919 361662 52 *484*455*570*108*106246107*37 102*1692 *478 *82*99*46*4093909624 *176*415175 *98*5929 *10361045334 7665 697091 *81*549571 *743*354*27306 *5666 *4338 85926 *896 289251 *1006*76949 *12 *181 *32 *257760 17 369 *659706 *63*5 *146*166210322 *100*1 *929*379922*56 *961*920*77887 *337 143454 *61*21*10 *346*677 *51*25 *292196609 101*70*6057 616 *34428 699247 *18*6830 *90338 *23 639*41 2631 472 0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 800000 1000000 1200000 Wolbachia_sp Wolbachia_sp (e) SUTTA (f) TIGR

Figure 13: Dot plots for Wolbachia sp. (no mate-pairs). Assemblies produced by Minimus, PHRAP, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

179 Chromosome Y

Feature-Response curve (no mate-pairs) 160 140 120 100 80 60 SUTTA 40 Minimus Phrap

Approximate coverage (%) 20 TIGR PCAP 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Feature threshold

Matepair Feature-Response curve (no mate-pairs) Polymorphism Feature-Response curve (no mate-pairs) 160 120 140 100 120 80 100 80 60 60 SUTTA 40 SUTTA 40 Minimus Minimus Phrap 20 Phrap

Approximate coverage (%) 20 TIGR Approximate coverage (%) TIGR PCAP PCAP 0 0 0 500 1000 1500 2000 2500 3000 3500 4000 0 100 200 300 400 500 600 Feature threshold Feature threshold

Coverage Feature-Response curve (no mate-pairs) Kmer Feature-Response curve (no mate-pairs) 110 70 100 60 90 80 50 70 60 40 50 30 40 SUTTA 20 SUTTA 30 Minimus Minimus 20 Phrap Phrap 10 Approximate coverage (%) 10 TIGR Approximate coverage (%) TIGR PCAP PCAP 0 0 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 Feature threshold Feature threshold

Breakpoint Feature-Response curve (no mate-pairs) Misassembly Feature-Response curve (no mate-pairs) 120 160 140 100 120 80 100 60 80 60 40 SUTTA SUTTA Minimus 40 Minimus 20 Phrap Phrap

Approximate coverage (%) TIGR Approximate coverage (%) 20 TIGR PCAP PCAP 0 0 0 20 40 60 80 100 120 0 100 200 300 400 500 600 700 800 Feature threshold Feature threshold

Figure 14: Feature-Response curves by feature type for Chromosome Y without mate-pair constraints.

180 561708635636563490564491565639493567494569497498712640714 6194774796216226236246255526266275554826284835585594884897007027037056326336348793 529457600601602603530604605606533534390537539394468396398610611614615542616543544471545 43050536250843836851051151351444137044351944644744837737952152252352445145252745438345 263264265193194199412342271344272419275348276277420350425427282355429284357358500501 139212215143220224225153159230231311242315170316318248176321322252253400401403260407408 697840841843844772773847848777778779782784785786787788789792794795797798799121124203131207 823824825752753826681827755682683757684759689830831760761690836763837691838765693767694768695769696 809591737664738666593667668595596811813740814741742816670744817672818745819746675749677678820821 649576577721723650651725726580654581727728729656583584658586801802730804731732806660661735662 *223657715642645572573719646647574648 *487*177538*2877535326 *363306*47*46*70179 *609*310*515332372*277343929 *513894150 *453*496*495*343*612*299*294*557*283699509 *34741821141132068 *123*338*30326242296 *535*386*393*270562637*5756 *462433155 *261*685*195188 114 *237*592*367504151 *152128*3631 *404435*922169183 *829*607*832*671*469*546210456720*98536120*85 *293*163*1266917462 *566*3287761383264478 *756*415*686*141185351849 *3 *105570 *2261225 364298150599*94*889 *659*202*146*281*200103331304*52369 *385*754*704*540*405*384*582*145*352*205*104617766815 *172*428*324218 *297*340*450382186709461 *807*6748334025161842182 *111613273*2273 2 *507*307*213*676*337839701305598*74 436722112187 1 *144*251*445*118470*5965546710 *243*399*2872361 *736*713*590*780*845*458*808*43731428027966590 *416*474*313291190 *373*733*594*480*38083420 *641*551*679*106*196*406*481319710835680 *472*718417608432365325*71 *3864457157515 *201*771*421*556*423*475165631*54*53503288249301 *774*589*585*643*526235568130259175 10116724 *724*517*115739108*42168 QRY *476*716341638 QRY *241*317*10948422877 *449*553*137132431 *133588246554455*76 *63035411395 *409*692*459222653652550781309308286392302 *391*182*258*548181346 *173*8153254970739581099 *2503351486 *466*751*764*206762758587191267 *334*156*444221387375485333 *339*828*790*688*547*238*793*791*10052049951837843 *805*290*147336800219 *2928 *154*20812975 460127663803102374*1816 *21718357817837 *180*69*55502671 *13513449 *579*2519884 278359388*43 *296*512*229234465812295463*3433 *4104832 247107169 *6564 23335639789 *3454017 *560*366597743376110 *711*531*673*473*274*266*38112563 *43415778317130 *245*371*323*158244140 717822*80426792 414162 *312*440*629*620*706256*13360618541413 *750*257136 *240255239254300 *747142214 *439*748*506*478*492*796*269*209192525189 *204*34914914819 *126*770*424*289*197*117*35 *486*687*361*327*285*26884697 *236*698*528*850*4422273303297258 *11116 *16616416011946476 *8426066 161232 4 0 500000 1000000 1500000 2000000 2500000 3000000 0 500000 1000000 1500000 2000000 2500000 3000000 Y Y (a) Minimus (b) PHRAP

Contig49Contig17Contig47Contig24Contig43Contig18Contig25Contig19Contig45Contig21Contig15Contig9 1281341301325953465134 1191031131071231361164737436399614839645256547068 12710213813710113912911013514013112144555731656067694996 Contig10 3015 *9380 Contig42 2588 *Contig27Contig28Contig29 115*723241

74 *Contig20Contig54 13389*6

*120*40*18*5

*Contig46*Contig41*Contig39*Contig36 *Contig58 Contig5 71 *62*20*85 *Contig51Contig30Contig12 *Contig13 8781 *10584 *Contig38*Contig37 118*38*762358 *Contig56Contig53*Contig2Contig35Contig1 *Contig26Contig52 *122*45192278 Contig50Contig11*Contig7 4214 *Contig14 11191366613 *Contig4Contig44Contig16 10983 2711 QRY QRY *125*90*3533*2 *Contig6 *17 *Contig34Contig33 *77 *126*108*50*16 *Contig40Contig23*Contig3 *2910 *Contig48 *106114*7521 *Contig57 92*8 *Contig59 *7328

*98941 79 *2486 Contig8 *82 *Contig31 *112*95*12

*Contig22Contig32Contig0 *100*104*3

*124117*26*7 *979

*Contig55 *4 0 500000 1000000 1500000 2000000 2500000 3000000 0 500000 1000000 1500000 2000000 2500000 3000000 Y Y (c) Euler (d) PCAP

1514 1090109510961097109970170370470556070670756270856356456656771371498 10781080108210831084108754754854955155355455855981838485909192939495 1070107110731074107739053653739253853939439539739854054154370717374757679 105510571058106010611062106410651069524525527528529385388530531534606162636768 10451050105451151251451537051651751837337637752052150525455565759 103510381039104110431044501503504360507361362364365366367510414245464849 10211026102810291030103110331034348351353354355356357359500303132333537 1010101110131014193194195197199340341342343344346347171920232425 1001100210031006100932318032732832918318518833133333433533619133833911121516 158302304306160161308162309166167311312314315170171318172319173176178179321322 118119120122124125126127128129132135136137140142143147148149150153155 877880881882883884885888890891893895896897898899100102103106110112114115116117 697699841842843844845846847849850853854858861862864866867868871872873874876 820822824825681826682828683829684685687688689831832834690691692693839694695696 800801804660806661807663808809664667811812813814670671672817673676677678679 635491492638639495496497498640641642644645646648649651652654655657658659 619474475476477478620621622624626480481628485486487488489630632634 450451453454456458600601603604605607608462466467468610613614616470617618472 431432434435290291437292438293294439296297299440441443444445446447448 1100110111021103416271417418419274276277278420424425426281428283429285287288 *135 244245247248250253254255257258259400401405261262408263409266267410412413415270 209210212214215217219220222223224226227229230231232233235236237238241242 965966968969974977978984986989991992993995996998999200202203204205206207208 931933935790937791792939794798799940941942943944945947948949957958959961962963 918772919773774776777778779920921922925926927781928929784785786787788789930 751752755757759900901902903904905906760761909763767768769910911913915770916771917 73073173273459173859273959459559659874074174274374474574674774875089 71657057271857357557657772072172272472558172758272872958458558658758847 574414411 *326*239*875*997754 *1008*1086*499*459 *1811036386383 *793*134*457802855 1068298449461139 *2 *184069521073 *9 *240*892*404783894879 100083339 *1053*198*15410491051669766325612522 *282*473544*72407818 11*4 1076*502168797436 *12108 *602*970*369*382*78609569865368372 *1040*3871012*795*589923273272 1089*63696054238944297 *7 1081*40753 *535*852*234*851*374*552*34*4429 *1052*421*70971053267565 *1005*430427 *1093*1067*1072*526*523*643*371934737345 *177668268 *1020*848*653519108 *580*647*932*597593479698914 *483*583886264320 10041024*3966298232728 *10881063*550*815*631*89

QRY *228190 QRY 10921032164165 *1048*163*133*1301037 *908*378324983104954907363*2188 *946*953*936*380*275*6861066*859*556836*43*87860 1019924889284286 *144*141*878 *152*579*391*47*53269*5 *819*184 *830352301350555300 *611305303 *10251059*145*280265 *1075*702*810*84070048486951 *123*138*252*384*71925666696 *3 *782*973*764971964 78054553315986 *452*146735546 *6 1091950733460151 1056*571*313*956*508*310 111109182 *981*101*3899011387099 *38149035829582 *509*979*513*805*775*726 *627*656568243246 *955*723*711 *863*493*2 *10791098*211*665 *186*9801007*749*403*83540213 *838192987837332 *201*196279*1 *994*98882179626 *216*803*189 1018*423*422*213715712 1015102322161521836 *982633972951 9123075576629386614 1085482506316 *887*463*606330674169 *7651027131*10349 *469*857*856*736*64*69 *1016*1017*494*650*623*471399*6 *97650522577 *4331042*6801046*8046563722 *379*393*717 *1022*756*187*758337375 1047*5611094*590975967816625455 *317*289105251249 578260599 *565*157121156 *827*175*174*985 *1 464762*58 0 500000 1000000 1500000 2000000 2500000 3000000 0 500000 1000000 1500000 2000000 2500000 3000000 Y Y (e) SUTTA (f) TIGR

Figure 15: Dot plots for Chromosome Y 3Mbs region (no mate-pairs). As- semblies produced by Minimus, PHRAP, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

181 With Mate-Pairs constraints

Brucella suis

Feature-Response curve 110 100 90 80 70 60 50 40 SUTTA 30 TIGR 20 Arachne Approximate coverage (%) 10 PCAP CABOG 0 0 200 400 600 800 1000 1200 1400 1600 1800 Feature threshold

Matepair Feature-Response curve Polymorphism Feature-Response curve 110 110 100 100 90 90 80 80 70 70 60 60 50 50 40 SUTTA 40 SUTTA TIGR TIGR 30 Arachne 30 Arachne

Approximate coverage (%) 20 PCAP Approximate coverage (%) 20 PCAP CABOG CABOG 10 10 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 Feature threshold Feature threshold

Coverage Feature-Response curve Kmer Feature-Response curve 110 24

100 22 90 20 80 70 18 60 16 50 SUTTA 14 SUTTA 40 TIGR TIGR Arachne Arachne 12 Approximate coverage (%) 30 PCAP Approximate coverage (%) PCAP CABOG CABOG 20 10 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 Feature threshold Feature threshold

Breakpoint Feature-Response curve Misassembly Feature-Response curve 110 110 100 100 90 90 80 80 70 70 60 60 50 50 40 40 SUTTA SUTTA TIGR 30 TIGR 30 Arachne 20 Arachne Approximate coverage (%) 20 PCAP Approximate coverage (%) 10 PCAP CABOG CABOG 10 0 0 5 10 15 20 25 0 20 40 60 80 100 120 140 Feature threshold Feature threshold

Figure 16: Feature-Response curves by feature type for Brucella suis with mate- pair constraints.

182 282026 10726985 *1 *13 *14

*4

*5

*7 *15 *16 *17 *3332 *18*4 31

*1630 *12 *18*17 30 *19 29 QRY

QRY 28 *21*3 27 *22

*326 *23 25 *24 24

*25 23

*26 *27

*29 22 15 11211 1314

20 12

10119 19 8 AE014291 AE014292

AE014291 REF AE014292 REF (a) ARACHNE (b) CABOG

Contig168Contig256Contig257Contig199Contig58Contig8 6830754069794246364738343743 *Contig123Contig150 3332718072447431357877653964418173664576626770 *Contig220*Contig171Contig2 *Contig22*Contig10 *61*60*59 *Contig62 *2*1 Contig253Contig215Contig226 *3 Contig77 *Contig180Contig241 *Contig174Contig110 Contig45 *Contig179*Contig254Contig120*Contig61*Contig29Contig67 *Contig159*Contig128 *Contig228*Contig185Contig35 *4 *Contig225Contig186Contig111 *Contig240Contig198 *Contig249Contig114*Contig69Contig50 *Contig255*Contig205 Contig135*Contig26Contig19Contig30 *Contig102*Contig136*Contig34 *Contig131*Contig3 *Contig116*Contig89 *6 *Contig193Contig244Contig103Contig113Contig48Contig93Contig68 *Contig141Contig54 *7 *Contig219*Contig208Contig209Contig138Contig12 *8 *Contig222Contig247*Contig85*Contig32 58 *Contig203Contig126Contig167Contig94 *Contig162*Contig129 57 *Contig189*Contig170Contig125Contig243*Contig95*Contig14*Contig4 *19 Contig5 *Contig155Contig200Contig161Contig104Contig160*Contig87Contig91 *Contig140*Contig137Contig139 *20 *Contig213*Contig39Contig70Contig49Contig40 *21 *Contig217Contig169Contig134Contig109 *Contig124Contig65 *Contig231*Contig92 *2225 *Contig250Contig142 2324 Contig237*Contig11Contig173 18 Contig183 *5 *Contig165*Contig182*Contig99 *Contig196Contig190 QRY

QRY Contig117Contig44Contig9 1617 *Contig178*Contig207*Contig175Contig177Contig232*Contig56 2915 *Contig154*Contig164Contig166*Contig71Contig236Contig23*Contig7Contig33 *Contig195Contig192Contig118 Contig115Contig88Contig60 *Contig210*Contig74Contig72Contig96 *Contig127*Contig112Contig130Contig80 1314 *Contig214*Contig149Contig251Contig172Contig37 1112 Contig21Contig24 10 Contig223Contig90 Contig108 9 *Contig82 *26 *Contig73Contig75 *27 *Contig252Contig100*Contig98Contig86 *Contig211*Contig146*Contig184Contig212*Contig83*Contig81 *Contig230*Contig157*Contig106Contig163Contig38 *Contig221Contig41 53 *Contig31 *Contig234Contig227*Contig20Contig248Contig42 *Contig224Contig78Contig53Contig36Contig79 *Contig107Contig151Contig187Contig194Contig25 52 *Contig152Contig105Contig101Contig57Contig97 *Contig46Contig197 *Contig245*Contig229*Contig242Contig122Contig143 Contig156Contig52 5051 Contig239Contig17 *Contig147*Contig27*Contig63Contig43Contig64 *Contig202Contig181 *Contig144Contig201*Contig1Contig0 49 *Contig145Contig132 *Contig204*Contig133*Contig59 *Contig216 *632848 *Contig238*Contig153*Contig76Contig206 *54 *Contig51*Contig84 Contig158Contig188*Contig66 *Contig218*Contig121Contig148*Contig28Contig246Contig119Contig18Contig55 *Contig16Contig47 *55 Contig176*Contig15*Contig6 *56 *Contig191*Contig233Contig235 Contig13 AE014291 AE014292

AE014291 REF AE014292 REF (c) Euler (d) PCAP

717270 41121416552021282962656869567 *15 *6719 *5332 *45 5146 15 41 *25 *69*4

6011 *35 *2347 5459 *5011 *40 *48 47 27 567 36 *59*64 22 *1 35 22 68*9 548 *128 *30*66 *17 31 *6333 *65*52 43 *32 *6324 3051 5242 *14*20 43 12 9 *1849 QRY *55*37 *64*6129 42 QRY *5736 50 *39533 *4 *6713 5818 5 3460*2 48 5624 *10 *37 6 *33 23 *3449 3857 *10 *61 *2 *21 *46 *1619 *442627

*3 *4462 58 *1745 38 25398 *26 *13*66 31 *40 AE014291 *AE014292

REF AE014291 REF AE014292 (e) SUTTA (f) TIGR

Figure 17: Dot plots for Brucella suis (with mate-pairs). Assemblies produced by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

183 Staphylococcus epidermidis

Feature-Response curve 110 100 90 80 70 60 50 40 SUTTA TIGR 30 ARACHNE

Approximate coverage (%) 20 PCAP CABOG 10 0 200 400 600 800 1000 1200 1400 1600 Feature threshold

Matepair Feature-Response curve Polymorphism Feature-Response curve 110 110 100 100 90 90 80 80 70 70 60 60 50 50 40 40 SUTTA SUTTA TIGR 30 TIGR 30 Arachne 20 Arachne Approximate coverage (%) 20 PCAP Approximate coverage (%) 10 PCAP CABOG CABOG 10 0 0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 Feature threshold Feature threshold

Coverage Feature-Response curve Kmer Feature-Response curve 110 110 100 100 90 90 80 80 70 70 60 60 50 50 40 SUTTA 40 SUTTA 30 TIGR TIGR 20 Arachne 30 Arachne Approximate coverage (%) 10 PCAP Approximate coverage (%) 20 PCAP CABOG CABOG 0 10 0 2 4 6 8 10 12 14 0 5 10 15 20 25 Feature threshold Feature threshold

Breakpoint Feature-Response curve Misassembly Feature-Response curve 110 110 100 100 90 90 80 80 70 70 60 60 50 50 40 SUTTA 40 SUTTA 30 TIGR TIGR 20 Arachne 30 Arachne Approximate coverage (%) 10 PCAP Approximate coverage (%) 20 PCAP CABOG CABOG 0 10 0 10 20 30 40 50 60 0 20 40 60 80 100 120 140 160 Feature threshold Feature threshold

Figure 18: Feature-Response curves by feature type for Staphylococcus epider- midis with mate-pair constraints.

184 262724105 261718301627252820142410311122132329121519763 23 *5*43332212

*2 39

7

16*4 *6 17 35 15 *25*22*21 *9*1368 1314

12 QRY *37

QRY 19 18

*3 20 *38

41

*1 119 3440 8 S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_chromsome S_epidermidis_RP62A_plasmid

S_epidermidis_RP62A_chromsome REF REF (a) ARACHNE (b) CABOG

Contig112Contig13 3743 *Contig34Contig115Contig113Contig114Contig116Contig111Contig27Contig81Contig85Contig63Contig64Contig86Contig14Contig22Contig87Contig53Contig75Contig48Contig25Contig56Contig7 1011009741817345861917883082252883404924794246239685369451473898342 Contig33 10210333329080269918721644847427959289313591784887932950 *Contig29Contig78Contig38 *55*6177 *Contig58*Contig42*Contig28Contig24 *7520 *Contig108Contig97Contig72 *Contig41Contig43Contig50 Contig57Contig73 *Contig59 *56 Contig31Contig70 *Contig36 *Contig67Contig65 *5 *63*62 *Contig79 *11 *Contig88Contig107Contig54Contig80

*Contig18*Contig11 *Contig66 4 Contig103Contig61 3 *Contig44*Contig2 Contig60 58 *13*1567 *Contig102*Contig6 *1470686921 *Contig110Contig101*Contig35*Contig94Contig30Contig10Contig39 *12 *Contig84Contig99Contig93Contig83Contig90 *Contig104*Contig95*Contig98 *Contig74*Contig32 57 Contig51*Contig1 64 *Contig16Contig96 QRY *10

QRY Contig52 *66*76*65 Contig68Contig26*Contig8 *Contig89*Contig71Contig69 *9 *Contig105 *225354 *Contig45Contig19Contig76Contig77 Contig12 52 Contig109*Contig55Contig82 *Contig92 59 *Contig91 *Contig100*Contig106 60 *Contig15Contig9 *6 Contig20 *Contig21*Contig40Contig47 *Contig46 Contig17

*Contig23Contig49 *39*71*1 *Contig37Contig3Contig4 *7 Contig5 8 *Contig62 S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_chromsome S_epidermidis_RP62A_plasmid

S_epidermidis_RP62A_chromsome REF REF (c) Euler (d) PCAP

1849534842505147523754 3839202122232441254226432745294649101112131430153250185119456789 16 *48 *2046 16 47 *21

6 *28 *26 12 *2731 44 *15 3 *4 24 *37*31*353440 253 *45*444140323413 *1728 *3522 7 36 QRY *1 QRY *33

2 *14 *19 *1143 1 9 *29*3930 8 *5 *2336 *3338 *17 *10 *2 *S_epidermidis_RP62A_plasmid S_epidermidis_RP62A_chromsome S_epidermidis_RP62A_plasmid

REF S_epidermidis_RP62A_chromsome REF (e) SUTTA (f) TIGR

Figure 19: Dot plots for Staphylococcus epidermidis (with mate-pairs). Assem- blies produced by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

185 Wolbachia sp.

Feature-Response curve 250

200

150

100 SUTTA TIGR 50 Arachne

Approximate coverage (%) PCAP CABOG 0 0 500 1000 1500 2000 2500 Feature threshold

Matepair Feature-Response curve Polymorphism Feature-Response curve 250 200 180 200 160 140 150 120 100 100 80 SUTTA 60 SUTTA TIGR TIGR 50 Arachne 40 Arachne

Approximate coverage (%) PCAP Approximate coverage (%) 20 PCAP CABOG CABOG 0 0 0 500 1000 1500 2000 2500 0 50 100 150 200 250 300 350 Feature threshold Feature threshold

Coverage Feature-Response curve Kmer Feature-Response curve 200 80 180 70 160 60 140 120 50 100 40 80 30 60 SUTTA SUTTA TIGR 20 TIGR 40 Arachne Arachne

Approximate coverage (%) 20 PCAP Approximate coverage (%) 10 PCAP CABOG CABOG 0 0 0 20 40 60 80 100 120 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Feature threshold Feature threshold

Breakpoint Feature-Response curve Misassembly Feature-Response curve 250 250

200 200

150 150

100 100 SUTTA SUTTA TIGR TIGR 50 Arachne 50 Arachne

Approximate coverage (%) PCAP Approximate coverage (%) PCAP CABOG CABOG 0 0 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 Feature threshold Feature threshold

Figure 20: Feature-Response curves by feature type for Wolbachia sp. with mate- pair constraints.

186 8598 972473185248146356802528164807 1008375 1009141909403466134618848450703900318883075 817376 179713676277879938929133123304547631396258482949499306839166565470 957797 61372128266942012163995298949448723887357725344821685264363 728474 656204435814924885884339228760863749268531829345172596223 908099 10101023106707157964935203261321624792371658186 60919029892360182223124377955114850434375369798172685795 497183518439362411169705478398384836942955407376912785599587 693593770188402641893838616640804870896962755777143464562836 *3 103201871423719902827266174820422905668688750302774866910974152 10171008855270195538679680831521162433880647334786986740 327944674732280957273471622970104131502314355159553747555244737 *69*68*4 208511842899292291374199114429409241795812489585794937367 101382635475733274560356562085089193773092123336831661184949234901611 4767972939673226454696355759797093101925892156996879608181502452675432840 101453391841446284326363239451461072593151361550 *1 2514263692796511769564834986299472563729204288875008697909972 103475169134691142767790765059089566422461735212660826979 44520032952578716158210955771844186867276927598380319785327 64475985139923569443621381656826529679117138691694 56741610038091430096582412028680025439249660560781342 *2 579341535683541765881312897405507237370309776445 92 101210321035835379113901152342452862295 42 36797671694048568574377852372046032445372324463735159548875221 102510151457891781247726277153643886539505094923 57296867522643152738283313912954892685973987666241878393316813534862 102010211182064439593313587375807678054371942208469349488431 10164918642197991059258405973901114081969452425591338 *8291*9 10247122104061735303657634707936545702528831562836554212716659 *60 8216963839632979805505056709584198372906255921499210 *12 91531973540451697470234463656120935711745560033645780 66774171431352939790845928482390324716528972977180840184518614797344 10186281582692813637336307484723176087111328925465375221254636 *13 6572297686735033153387615762227384843059286235442173282391228786825 44023011529937792260284161293670146747448186721456481756373191 11 971830154754847479704326681619335698621524554552598780240890745 727939320913811202249361465184819776894858140181412954385307801985 8721633951755849883476654348659049304426496898866486421988935 10 941461583257951782809425591832736539781828276102520532259424855557 7950 65968235337598494678812830325061450127849013015538762653 41 23247781567122533014220716736069266385496976651039344965827454987748 40 26221254296181057417066022756652680626425535918210860455671733 932764349515138606137724666432519860708110333710323856112545191187588646756446 101110192727759197465344007448886331514687002874754132942078 10339061776783737732059273014568746529754937963668442251 44450895322155841778426057376263873487550613611614466138158130839 10229432881931195861804102466904304475364151894804381076187 5 8253899174959667582367062188342852345944861605696863919698438 *47 325889350540684211153861337101340311982571295876 27 12788256089879851246351745845145472263457869597872874237871 26 *4 *96*886465 *39*38 5763 1030 *94248767 *992*14*29 QRY *987

QRY *993*32 5423 *995 *33*3245 4344 997998 *61*70*59 *17*6278 *53*5289 996 46 *994 *48 991 *58*66 990 *6 *1026 555651 1007 100612 19 1005 *20 49 1004 *977*18*192615 14 1029 1003 3525 1002 34 18 *7 *8 16 1001*3*7 *36*9315 1000999 *21*3786 *1027 *22 *17 2829 *10281 30 7131 1031 0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 800000 1000000 1200000 Wolbachia_sp Wolbachia_sp (a) ARACHNE (b) CABOG

Contig602Contig611Contig599Contig575Contig610Contig605Contig101 1163106011341100848450703900318972473185248146356802528164807 Contig578Contig607Contig604Contig582Contig598 1047105510091124123083916614190940346613461870883075 Contig584Contig585Contig559Contig597Contig590Contig581 109511831262113211511204938929133123304547631396258482949499306 Contig596Contig583Contig592Contig354Contig591Contig22 11371227112011051187119287357725344821685217971327787963 Contig594Contig608Contig595Contig587Contig459Contig576Contig606 12551043108210661071345172596223613721282420121639952989494487238 Contig593Contig612Contig609Contig603Contig614Contig601Contig586 115810791041792371204435814924885884339228760863749268531829 Contig613Contig615Contig589Contig600Contig588Contig577 101012131093102311991579649352032613216586 *Contig542Contig284Contig579Contig580 231243779551148504343753697981726857106707 *Contig157*Contig313Contig254*Contig87*Contig91Contig451Contig29 1167123547839838483694295540737691278559958760919029892360182295 *Contig517*Contig503Contig569Contig494Contig424Contig491 11211085105111021211616640804870896962755143464562497183518439362411169705 *Contig490*Contig211*Contig353Contig515Contig279 121911941126109782042290575030277486691059377018840264189383897 *Contig156Contig462Contig125Contig226Contig180Contig227Contig79 120611171008102711391222433880334786986740103201871423719902827174 *Contig376*Contig553*Contig502Contig200*Contig13 1237125210371017105711471241314355159553747555855270195538831521162 *Contig173*Contig454Contig69 10801113795812489585794327944732280957273471622970104131502 *Contig492*Contig291*Contig485Contig493Contig556 1091208511842899292291374199114429409241937367 *Contig239*Contig315*Contig134Contig316Contig140Contig106 1045100410691013122460356562085093773092123336831699461184949290 *Contig188Contig355*Contig33Contig183Contig37 1115103810641099121796081815024526754382635475733234 *Contig141*Contig431*Contig186Contig142Contig185 10771136117512361088117610391005293967322469635575979310192589215699 *Contig412*Contig255*Contig63*Contig96*Contig68 1053101412091198414462843263632394991514610725931513615476797 Contig320Contig190Contig351 116012211200112211084986299472563729204288875008697905339189972 *Contig427*Contig133*Contig178*Contig230Contig287Contig39 1154121011571006103459089522461735212625142636999627917695648369 *Contig222*Contig263*Contig163*Contig298Contig317Contig4 12391140118112451067111171844186876927598380319785375134691142790782 *Contig481*Contig342*Contig221Contig202*Contig66Contig304Contig541 11031062112911431261568265296791171386916445200329525787161582109557 *Contig344Contig123 11481228107325439249660560781375985139923543621381694 *Contig543*Contig296Contig460Contig148Contig385 1002117312581075237370309567416100380914300965824120286800 *Contig247*Contig473*Contig352Contig441Contig107Contig219Contig31 105612155355417658813128974055077764 *Contig375*Contig572Contig570Contig44*Contig8 104011381232101210321035351595488752835379113901152342452862295579341 *Contig552*Contig165Contig310Contig231 102510151202950509367976716940485743778523720460324453723244637 *Contig300*Contig507Contig482*Contig48Contig273Contig309Contig270Contig362Contig78 115211311090418783933168135348145789178124772993715364388 *Contig208*Contig433*Contig432*Contig18 114512461083111094857296822643152738283313912954892685973987662 *Contig435*Contig399*Contig374Contig289Contig513Contig297Contig75 1065102012571021126311820644395933135873758076780543719422084693484 *Contig261Contig394Contig250Contig305 101611701042116921979910592584059739011140819694524255932 *Contig129*Contig100*Contig325*Contig144Contig523Contig294Contig155 103012051159122612561070102417353036576347079357025288315628342149186466 *Contig547*Contig314*Contig295Contig64 1212116410811094122538396329798055050595841999529062559214921040692 *Contig486*Contig70Contig439 11041133111873540451697470234463656120935711745560033645782169680 *Contig207*Contig306*Contig326*Contig73Contig283Contig274Contig10 1231111912491046114211791052823903247165289771808401845186147973915319 *Contig217*Contig422*Contig423*Contig445Contig331*Contig58 101811251007118811861058115010961191125741998714313529987397908459284 *Contig334*Contig357*Contig429Contig400Contig201*Contig15Contig55 12541234158269281363733997630748472317608711132546537522 Contig404Contig290Contig213Contig262Contig77 1243107810361253119611141223738484305928623544217328239122878628 *Contig224*Contig516*Contig563*Contig428Contig560Contig187Contig322Contig358 10924818672145648175637312297685033153387615762226825 *Contig442*Contig115*Contig533Contig392Contig511Contig225Contig235 104410721240108611529937792260284161293670146747491 *Contig419*Contig413Contig237*Contig32Contig448*Contig1 10031054102961933569862152455455259878024089044023074 *Contig319Contig236Contig272Contig406 116611011165894858140181412954385307801985971830154754847479704326 *Contig512*Contig192*Contig34Contig41 1116102810981059118911771127198727939320913811202249361465184819776 *Contig176*Contig443*Contig266*Contig277Contig401Contig194 110611461184120725942487216339517558498834743486590493044288689 *Contig495*Contig249*Contig464Contig519 11561218119395178280942559183273653978182827610252053285 *Contig555Contig408Contig259Contig465*Contig57 11551112984946788128303250614501278490130155387626941461583257 *Contig339Contig421Contig527 1174107611411128330142207167360854969766510393449274549877353375 *Contig204*Contig215Contig191Contig218 122910011182977566526806264255359182108604556717232477815225 *Contig271Contig205Contig566Contig417 122011491087123819118758875644626221254296181057417022733 *Contig458*Contig514Contig364Contig206 11071201107411951063138606137724432519860708110333710323856112545 *Contig243*Contig196*Contig366Contig407Contig130 1011116110191242120840074488863315146870028747599041329493276434951578 *Contig330*Contig496Contig500Contig564Contig95*Contig3 124710331026373773205927301456874975493796366844272775919746534 *Contig393*Contig119*Contig276Contig484Contig499Contig19Contig20 1089108411441197558417784260573762638734875506136116144381581308906177 *Contig152*Contig327Contig426Contig478Contig288Contig6 1180118511301153102243044753641518948043810744450895322187 *Contig474*Contig128*Contig195*Contig475Contig324Contig268*Contig36 10481000106111686863919432881931195861804102469999861 *Contig172*Contig122*Contig418Contig292Contig43 11721259121611351203106899275823670621883428523459448616056996 *Contig337*Contig536*Contig535Contig520*Contig72 1162121411901171112354021115386133710134031198257182538991749596676 Contig382 QRY 1233378325889350 QRY *Contig241*Contig278*Contig240*Contig112Contig480Contig153 125011091244517458451454722634578695978728742 *Contig160*Contig341Contig466Contig538Contig169Contig540 104910501178103112788256089879851246371 *Contig497*Contig189*Contig299*Contig415*Contig116Contig154Contig285 *58*57*10 *Contig522*Contig477Contig561Contig151Contig220 *650 *Contig328*Contig395*Contig332*Contig359*Contig471*Contig118Contig437Contig88 *Contig248*Contig411*Contig449*Contig257Contig26 1 *Contig347*Contig521*Contig105*Contig573*Contig16*Contig27 *64724 *Contig198*Contig436*Contig258Contig461Contig197Contig60 *648 *Contig265Contig380Contig335 *745*837*712*892*649665666624 *Contig389*Contig565Contig444Contig209Contig132 66483 *Contig104Contig525Contig302Contig177Contig558Contig47 645 *Contig402*Contig504*Contig509Contig568Contig456Contig323Contig51Contig28 *66148 *Contig554*Contig338*Contig149Contig381Contig463*Contig52 67167247 *Contig158*Contig370*Contig539*Contig371Contig562Contig321Contig363Contig223 *1248662663674 *Contig280*Contig534*Contig420*Contig388*Contig110Contig238Contig267 *7949507 Contig229*Contig5 6 *Contig545*Contig529*Contig30Contig548Contig251Contig505Contig506 *67646 *Contig483*Contig532*Contig49Contig343Contig457 *777*660*659 *Contig329Contig242*Contig11Contig479Contig61Contig93 11 *Contig367*Contig214Contig199Contig361Contig161 5 *Contig378*Contig23*Contig50Contig438Contig360 *627*6582931 Contig282Contig476Contig487*Contig92Contig85Contig25Contig98 *23 *Contig81Contig526Contig232Contig162*Contig84Contig12 *45*44*43*42*41*39 *Contig145*Contig488Contig150Contig121Contig537Contig76 *20 *Contig531Contig530Contig166*Contig74Contig17 *6534 *Contig113*Contig518*Contig409*Contig175*Contig174Contig252Contig379 23 *Contig574Contig386Contig544Contig557Contig246Contig82 891*3727 *Contig312Contig468Contig170Contig425*Contig65Contig71 603626 *Contig333*Contig203*Contig377*Contig168Contig136Contig501 1251693694678679680 *Contig550*Contig126Contig452*Contig89Contig430Contig137 *669*729*668*667692 *Contig318*Contig307*Contig349Contig405Contig135Contig551 68535 *687*8168428 *Contig143*Contig434Contig228Contig264 *65468351 *Contig390*Contig398*Contig210*Contig489Contig53 *656*655*21 *Contig368*Contig345*Contig303*Contig117Contig286*Contig97*Contig2 *56*22 *Contig340*Contig453Contig450*Contig54Contig120Contig348 *688*19 *Contig234*Contig469Contig336 *59652 *Contig269Contig253Contig396Contig397Contig124Contig138 651 *Contig447*Contig524*Contig508Contig46Contig94 *18 Contig127Contig167Contig373Contig38 6908 *Contig275Contig470Contig301Contig416Contig281 1340 *Contig179Contig293Contig384Contig90Contig42 *68912 *Contig446*Contig549*Contig147*Contig311Contig308*Contig80Contig86 709646 *Contig467Contig546Contig369Contig139Contig114Contig7 *642*54271*53*52 *Contig350Contig410Contig414Contig260Contig567Contig146Contig56 *657*643 *Contig21Contig216Contig103Contig184 *673*38 *Contig365*Contig244Contig245Contig356*Contig83*Contig62 *677691*55 *Contig391*Contig102Contig108Contig171Contig498Contig99Contig24 681682 *Contig346*Contig472Contig159Contig109Contig59 *17266*16 *Contig510*Contig571*Contig383*Contig111*Contig14Contig403 *1260*15*14 Contig212Contig387*Contig40Contig35Contig67 *675670 *Contig528*Contig181Contig182Contig164Contig372Contig131 644 *Contig233*Contig440*Contig256Contig455Contig193*Contig45*Contig9 *9 0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 800000 1000000 1200000 Wolbachia_sp Wolbachia_sp (c) Euler (d) PCAP

10981099564709565566567568569710711712713714 1099565566567568569710711712713714 10901091109210931094109510961097700701702703704705560706561707562708563 10901091109210931094109510961097700701703704705560706561707562708563564709 1081108210831084108510861087108810895505515525535545555565575585599699 10881089555556557558559909192939495979899 10741075107610771078107910805405415425435445455465475485498289 108110831084108510861087549551552553554848586878889 10691070107110721073390536391537392393538539394395396397398399 107110721073107510761077107810795405415425435445455465475488081 106010611062106310641065106610671068529385386387388389530531532533534535 537392393538539394395396397398399707172737476777879 10541055105610571058105952052152252352452538052638152738252838338460 1060106210631064106510661067106810695315325335345353905363916869 105010511052105337051637151737251837351937437537637737837959 5263815273825283833845293853863873883896061626364656667 1040104110421043104410451046104710481049363364365366367368369510511512513514515 105410551056105710581059520521522523524525380 10341035103610371038103950150250350450550636050736150836250940 105010511052105351937437537637737837952535455565758 10231024102510261027102810291030103110321033350351352353354355356357358359500 104510461047104810495125135145153705163715173725051 1010101110121013101410151016101710181019102010211022340341342343344345346347348349 1040104110421043104436536636736836951051143444546474849 335336190337191338192339193194195196197198199 1036103710381039502503504505506360507361508362509363364404142 1000100110021003100410051006100710081009188189330331332333334 10301031103210331034103535435535635735835950050130323335373839 323324325326180327181328182329183184185186187 1020102110221023102410251026102710281029347348349350351352242526272829 317171318172319173174175176177178179320321322 1010101110121013101410151016101710181019197198199340341342343344345346202223 310311312313314315316170 332333334335336190337191338192193194195196101112131415161819 302303304305306160307161308162309163164165166167168169 1000100110021003100410051006100710081009182329183184186187188189330331 154155156157158159300301 318172319173174175177178179320321322323325326180327181328 135140141144147148149150151152153 162309163164165166167169310311312313314316170317171 885886887888889890891892893894895896897898899104114119122124126127128129 154155156157158159300301302303304305306160161308 861862863864865866867868869870871872873874875876877878879880881882883884 134135136138139140141142143144145146147148149151152153 697698699840841842843844845846847848849850851852853854855856857858859860 116117118119120121122123124125126127128129130131132133 687688689830831832833834690835836691837692838693839694695696 101102103104105106107108109110111112113114115 679820821822823824825680681826827682828683829684685686 875876877878879880881882883884885886887888889890891892893894895896897898899100 668669810811812813814815670816671672817818673819674675676677678 851852853854855856857858859860861862863864865866867868869870871872873874 659800801802803804805660806661807662663808809664665666667 835836691837692838693839694695696697698699840841842843844845846847848849 498499640641642643644645646647648649650651652653654655656657658 825680681826827682828683829684685686687688689830831832833834690 632633634635636490491637492638639493494495496497 814815670816671672817818673819674675676677678820821822823824 621622623624625626480627481482628483629484485486487488489630631 802803804805660661662663808809664665666667668669810811812813 11471148114911501151115211531154613614615616470617471618472473619474475476477478479620 643644645646647648649650651652653654655656657658659800801 1140114111421143114411451146607461608462609463464465466467468469610611612 634635636490491637492638639493494495496497498499640641642 1130113111321133113411351136113711381139456457458459600601602603604605606460 621623624625626627481482628483629484485486487488489630631632 1120112111221123112411251126112711281129442443444445446447448449450451452453454455 617471618472473619474475476477478479620 1110111111121113111411151116111711181119294439295296297298299440441 606460607461608462609463464465466467468469610612613614615616470 110711081109430431432433434435290436291437292438293 11271128112911301131450451452453454455456457458459600601602603604605 1100110111021103110411051106283429284285286287288289 1111111211131114111511161117111811191120112111221123112411251126440441442443444445446448449 419274275276277278279420421422423424425280426281427282428 1110434435290436291437292438293294439295296297298299 268269410411412413414415270416271417272418273 1100110111021103110411051106110711081109282428283429285286289430431432433 404405260406261407262408263409264265266267 415270416271417272418273419274275276278279420421422423425280426427 251252253254255256257258259400401402403 261407262408263409264265266267268269410411412413414 239240241242243244245246247248249250 250251252253254255256257258400401402403404405260406 224225226227228229230231232233234235236237238 228229231232234235236237238239240241242243244245246247248249 209210211212213214215216217218219220221222223 206207208209211212213214215216217218219220221222223224225227 996997998999200201202203204205206207208 989990991992993994995996997998999200201202203204205 975976977978979980981982983984985986987988989990991992993994995 970971972973974975976977978979980981982983984985986987988 953954955956957958959960961962963964965966967968969970971972973974 950951952953954955956957958959960961962963964966967969 794795796797798799940941942943944945946947948949950951952 937791938792939793794795796797798799940941942943944945946947948949 929783784785786787788789930931932933934935936790937791938792939793 923924925780926927781928782929783784786787788789930931932933934935936790 771917918772919773774775776777778779920921922923924925780926927781928782 911912913914915770916771917918772919773774775776777778779920921922 760761907762908909763764765766767768769910911912913914915770916 902903904905906760761907762908909763764765766767768769910 747748749750751752753754755756757758759900901902903904905906 744745746747748749750751752753754755756757758759900901 736591737738592739593594595596597598599740741742743744745746 737738592739593594595597598599740741742743 584585586587588589730731732733734735590 5855865875885897307317327337347355907365913456789 574575576577578579720721722723724725726580581727582728729583 578579720721722723724725726580581727582728729583584 *75715716570717571572718573719 715716570571572718573719574575576577 *472595 *10981080 195 QRY *8 QRY 64 324 2127 *1074*185*176*71725951896 *106*112*76113*91*109017 529849 *2261082 87 *288287 1393 *3151061968 103*71*1670 *921054 *4546 233 622 *57*1162 *8144 *1 *134100*66133*97*5143 *168 *5822 36 *24 *447 *115*731201096515 *480*702965284806807150 424 *1284 68 596 *93130*18 *530230*82*83*3128175 *7729 *137*611*373 *145*118*108*142*74132117136*56*50 33 *2771070 *101*131138889480 *2117 *107110*69*32 *111*123*7248 *2 53 *116*79*611466 *14 143*3955 *679 *125*85*7863 633 *35137*2 *41*36 *339 *23*54 *785*59850 131 *353 *28*30*26 *307 *102*121*6783 550 *86*7 *42389 *2034 *210 3137 *34 0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 800000 1000000 1200000 Wolbachia_sp Wolbachia_sp (e) SUTTA (f) TIGR

Figure 21: Dot plots for Wolbachia sp (with mate-pairs). Assemblies produced by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

187 Chromosome Y

Feature-Response curve 160 140 120 100 80 60 SUTTA 40 TIGR ARACHNE

Approximate coverage (%) 20 PCAP CABOG 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Feature threshold

Matepair Feature-Response curve Polymorphism Feature-Response curve 160 120 140 100 120 80 100 80 60 60 SUTTA 40 SUTTA 40 TIGR TIGR Arachne 20 Arachne

Approximate coverage (%) 20 PCAP Approximate coverage (%) PCAP CABOG CABOG 0 0 0 500 1000 1500 2000 2500 3000 3500 4000 0 100 200 300 400 500 600 Feature threshold Feature threshold

Coverage Feature-Response curve Kmer Feature-Response curve 110 70 100 60 90 80 50 70 60 40 50 30 40 SUTTA 20 SUTTA 30 TIGR TIGR 20 Arachne Arachne 10 Approximate coverage (%) 10 PCAP Approximate coverage (%) PCAP CABOG CABOG 0 0 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 Feature threshold Feature threshold

Breakpoint Feature-Response curve Misassembly Feature-Response curve 120 140

100 120 100 80 80 60 60 40 SUTTA 40 SUTTA TIGR TIGR Arachne Arachne 20 20 Approximate coverage (%) PCAP Approximate coverage (%) PCAP CABOG CABOG 0 0 0 20 40 60 80 100 120 0 50 100 150 200 250 300 350 400 450 Feature threshold Feature threshold

Figure 22: Feature-Response curves by feature type for Chromosome Y with mate-pair constraints.

188 *5 1

4 *2

3 *3 2 *4 QRY QRY

1 *5 0 500000 1000000 1500000 2000000 2500000 3000000 0 500000 1000000 1500000 2000000 2500000 3000000 Y Y (a) ARACHNE (b) CABOG

Contig11Contig18Contig30Contig15 116128134122132645266887583228551 12410413112112611710811511310712369799698807274619148877750 125101129110135715584955789786581607386626782 Contig33 94*1 *90*2 Contig28 45*3 *Contig29Contig35 59*4

Contig7 127*6*5

*114*58*7

*Contig26 *8 Contig22Contig16 *10*9 *Contig20 *12*11 *99*13 *Contig25Contig6 *15112*56*1476 Contig23*Contig3Contig24Contig4 *109*18*17*70*16 *Contig14*Contig13 *Contig9 105*20130133*1954 *Contig8Contig31 103*21 *24*23 QRY QRY *119*27*26*2597 *Contig10 *28 *Contig21Contig19 *120*102*68*29 *Contig27*Contig2 *93*30 *Contig12Contig17 *100*32*3163 *Contig36 *33 *Contig37 *3492

*47*36*35 *37 *38 *106*39

Contig32Contig34Contig1 *49*53*40

*118111*42*41 *46*43

*Contig5 *44 0 500000 1000000 1500000 2000000 2500000 3000000 0 500000 1000000 1500000 2000000 2500000 3000000 Y Y (c) Euler (d) PCAP

11171215109 702703704705560706562563564709565566569710711712713714 55055155255355455555655755855970070187888990929394959699 39039139353853939639739854054254354454554854970717273767983848586 524380526381382528383384529386387388389530531532534535626364656768 51451537037251837351937437537637737837952052251525354565758 1030359500503504505506361508362509367368369511513353740414445464749 10201021102410251027102934634835235335535722232526272930313334 101110121014101610171019334335190337338192193195196197342344101213161819 10011005100610071009174175176177178321322323324325326180328182184185186188189330332 161308162309163165166167168169310311313314315316170171318172319173 125127128130132133136138139140141143144145149150153155158159300301302305306160 882883884885886888889893895896897898100102104105106107108110114115116119120124 847848849850851853854855856857858859860861862863864865866867872873874878880 829684685687688689832834690835836837838693694695696698699840841842843844845 667668669810811812813670671818819674675677679820821822823824825680682828683 642643644645646647648649652653655658800801802803805660661807662663809664666 480627481628483629485486488630631632633634636491637492493495496497498640641 608462464465467611612613615616470617471618472619475476477478620621622626 441442443445446447448450452453454455457458459600601602603605606607461 424425426281427282428283429285286287430431433436291437292294295296297298440 *166 258402403404405260407263409267268411413414270272418419274276277278279421422 220222224225226227228230232233234236239240242244247248250251252253256257 979980982983986987988990992993995999200203205207208209211212214216217219 796798940942943944945946947948950951954955956960964965967970971972974978 919773774775776777921922923927781782929783787931932933934790791793794795 745747748750752754756757759900901903906760762908763765767768769911912913915916 7307317337355905917377385927395935955965975997407427437446789 5705715727185745755765787207227237247255807275827287295835845855893 356360716 *202*804*917*512 *926*400*997*289 *650*333336930 *2 *111*399715635792 *385*439261977408109 *1022*8761015354 *708*201*806816351 *9207582 *963*290*961*479623969 *686*502*245*659*50 *4 *444*719*371137989 *547255546808320350 *717*953*231*527928839 *14*58 1000*341881577340 69199424 *962*474*199*772*487771*15*2111 *401*656*6106514733948 *366365 1004*981*985*692*304468846469 624229 *6044637515969 *525*870*536*826*537*654*833 *423*523275280221 *3475687539255 *998*48497376666

QRY *191*147 QRY *958*131*135148101952 1026*121877831317*9775 *976*78678528483090561 *868*890*871*339*235*625238241784*20 *117*817*936*827 *1013*122*521*349*113*28*32 *499154 *307*259*254*770 *223*265991237 *789*984*112*941432797975 *343*665*639*73463874 *721*894*736*215*98213 *3 489746129899 *46670748260 13*7 1003*392*118 *266*269879449875676 15196651081 *909*910*17*777998278 *331*312434249 *451*456*732588673 681507204 *672*755517 *1002788435891 *726*598*968194 *1023*7611008 *156*869*764293902924 1641 *406914918749198*4 *157181 657417363364179 *567939183*14187 1018*937573892 *852262490594 *561*438*996*460*271 *288*814*410949303609134 *678*815*6971010*42*43103 *415*416*779*780358 *904*935*586 *614959264 *345*3941028*55412581957 *152*329299938327 *741*494*395246579887587533 *206*210*243*80273 *778516218541 *501*12612391 *142*38146907 *1 42036 0 500000 1000000 1500000 2000000 2500000 3000000 0 500000 1000000 1500000 2000000 2500000 3000000 Y Y (e) SUTTA (f) TIGR

Figure 23: Dot plots for Chromosome Y 3Mbps region (with mate-pairs). As- semblies produced by ARACHNE, CABOG, Euler, PCAP, SUTTA and TIGR. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

189 Short Reads

With Mate-Pairs constraints

Escherichia coli

Feature-Response curve 100 90 80 70 60 50 40 30 20

Approximate coverage (%) 10 SUTTA Velvet 0 0 10000 20000 30000 40000 50000 60000 70000 Feature threshold

Matepair Feature-Response curve Polymorphism Feature-Response curve 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20

Approximate coverage (%) 10 SUTTA Approximate coverage (%) 10 SUTTA Vlvet Velvet 0 0 0 10000 20000 30000 40000 50000 60000 70000 0 50 100 150 200 250 300 Feature threshold Feature threshold

Coverage Feature-Response curve Kmer Feature-Response curve 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20

Approximate coverage (%) 10 SUTTA Approximate coverage (%) 10 SUTTA Velvet Velvet 0 0 0 5 10 15 20 25 30 35 40 45 0 50 100 150 200 250 300 Feature threshold Feature threshold

Breakpoint Feature-Response curve Misassembly Feature-Response curve 3.6 100 90 3.4 80 70 3.2 60 3 50 40 2.8 30 20 2.6 Approximate coverage (%) SUTTA Approximate coverage (%) 10 SUTTA Velvet Velvet 2.4 0 -1 -0.5 0 0.5 1 0 50 100 150 200 250 300 350 Feature threshold Feature threshold

Figure 24: Feature-Response curves by feature type for Escherichia coli with mate-pair constraints. 190 28023620218024420111321418825423928120523516920023126817222323821622727714128325013021994181985 1698100413871333130311621430131816601780131716701357150914161658155816711237191610151061120214411311104411301035135612221419174910841756147117261403167214181450179113931610150716041500151619951763169017151071133116131417157512381673179412211830126215011781112810091337182919181702130217771294195556080587245461184232596854044283396616810418159438865355316068046011916263774020153547456469376287533857676158143929653746844127585360628490343592444625382135918299169299585451098446615628325747352894590492398453 *14454 *19421893 *129193 *246*230*21123428462117047 *1930*1906 9777 *164 1971 *229*171*12458 *90*53 *19121886 *125*449684 1913 *159*118*225*265183186*50*563331 *190018851882756 42 1889 *71*2510287 19961960*6631881 *100 1892 *1101 *232173228*83*12617657 *193314381880 *252192*4632 *131*103 1978 *255*2612619929 *1194*1573*1963191918911951 *221*241*242203*93133138212279929 *19581887 *108*263*157 1991 *267*274*251*751 *19901926 *1514 *1905*1989 *253*26910180 *1820*1964 *147*198*158167*8545 *1980 *6772 *1937 *82207161 1938 *2 *1936 *1558617 *1480*1038*1888*9431568 156 1931*2 *275*266282137*60 *195182*413795 *1764*103012341901200 *160*190*78 1884 *204*127168194 *1878*1929 *152*7938 1908152 QRY *213154121 QRY *1949 *39 5588 1923 1950 17428*3 *19351940 *4 1910 *6 1909

*210 *1983 *209*184*107*276217237222148*52178 *1883*1941*19441902 20 1977 *153175 *197918941928 *206261271264122*40 *1897*1934*19701968 *119135*99220197166*49 *1948196619741932 *68*5963 *1953 *106151256*2275 *1911*1943*1952339437*46 *123*26043 19071947 136*98 *18961946 *231702474 *11981985 *36 1994198 *114 *258*259*243*91 *1992*19451903 *273*257*233*215*245*278*247*185*162208272262 *1967*1927*1973*199319591961191419721917195696789 128*4814613 *1976*1957 *112116 *1965 *139*117*189*181*163*179*109165187*89 *18981921195419751879 *142*15024927013435 19151895 *346530 *19201924 2181491021 *1988*19391987 14516 1922 *177*176*105224240120140*6973 *1969*1962*1984*1981*1925 *104*66*81196115*6422614327 *1904*19861089 *191 *1899 248132111 *1982*1890 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 gi|49175990|ref|NC_000913.2| gi|49175990|ref|NC_000913.2| (a) Velvet (b) ABySS

*contig1268|size14094|read11|cov0.00contig1708|size2967|read10|cov0.00*contig2468|size1388|read8|cov0.00contig1064|size161|read12|cov2.00contig1022|size153|read13|cov3.00contig1590|size128|read10|cov2.00contig1132|size113|read12|cov3.00contig1193|size398|read12|cov1.00contig1125|size162|read12|cov2.00contig1260|size296|read11|cov1.00contig1879|size104|read10|cov3.00contig1083|size478|read12|cov0.00contig1852|size134|read10|cov2.00contig1256|size106|read11|cov3.00contig20|size116|read38|cov11.00contig183|size114|read27|cov8.00contig600|size565|read17|cov1.00contig126|size165|read29|cov6.00contig328|size215|read22|cov3.00contig426|size717|read20|cov1.00contig804|size224|read14|cov2.00contig129|size110|read29|cov9.00contig976|size400|read13|cov1.00contig169|size108|read27|cov9.00contig159|size129|read28|cov7.00contig193|size378|read26|cov2.00contig325|size242|read22|cov3.00contig100|size172|read30|cov6.00contig975|size108|read13|cov4.00contig124|size129|read29|cov8.00contig235|size115|read25|cov7.00contig14|size142|read41|cov10.00contig132|size239|read28|cov4.00contig2089|size355|read9|cov0.00contig46|size117|read34|cov10.00contig2067|size100|read9|cov3.00contig187|size134|read26|cov6.00contig842|size193|read14|cov2.00contig110|size121|read30|cov8.00contig364|size117|read21|cov6.00contig205|size285|read26|cov3.00contig115|size154|read30|cov7.00contig165|size112|read27|cov8.00contig790|size294|read15|cov1.00contig147|size153|read28|cov6.00contig593|size118|read17|cov5.00contig250|size115|read24|cov7.00contig109|size112|read30|cov9.00contig83|size407|read31|cov2.00contig64|size163|read33|cov7.00contig95|size296|read31|cov3.00contig58|size125|read33|cov9.00contig71|size274|read32|cov4.00contig52|size140|read34|cov8.00contig86|size152|read31|cov7.00contig78|size211|read32|cov5.00contig84|size184|read31|cov6.00contig3|size132|read47|cov12.00contig39|size259|read35|cov4.00contig6|size107|read44|cov14.00contig5|size163|read44|cov9.00 scaffold43scaffold24scaffold4C4239C4271C4019new29C4277C4209new39C4149C4237C4077C4043C4201new59new77C4283C4231new27C4333new18C4229C4099C4069new21C4331C4061C4289C4143C4215C4125C4135C4455C4315new81new48C4353C4169new19new40new46C4145C4395C4167new76new68new3new4new2new5 *contig1669|size14206|read10|cov0.00*contig1847|size3937|read10|cov0.00*contig1228|size7653|read12|cov0.00contig1433|size4515|read11|cov0.00contig1238|size8076|read12|cov0.00*contig1659|size290|read10|cov1.00contig2772|size208|read7|cov1.00 scaffold25 *contig1737|size1060|read10|cov0.00*contig1784|size3675|read10|cov0.00contig1224|size14333|read12|cov0.00contig1114|size18305|read12|cov0.00contig1271|size2252|read11|cov0.00 *contig1322|size2539|read11|cov0.00contig1262|size4206|read11|cov0.00*contig1687|size978|read10|cov0.00*contig405|size9476|read20|cov0.00contig990|size27098|read13|cov0.00*contig413|size719|read20|cov1.00*contig2391|size394|read8|cov0.00*contig950|size125|read13|cov3.00*contig394|size158|read20|cov4.00contig2043|size3540|read9|cov0.00contig2432|size258|read8|cov1.00contig349|size175|read21|cov4.00contig625|size202|read17|cov3.00contig448|size127|read19|cov5.00contig231|size430|read25|cov2.00 *scaffold44scaffold8*C4073*C4265*new56C4381C4273 *contig1362|size3257|read11|cov0.00*contig1767|size4316|read10|cov0.00*contig1406|size4240|read11|cov0.00*contig1774|size1622|read10|cov0.00*contig1090|size5822|read12|cov0.00contig1660|size5193|read10|cov0.00*contig1594|size523|read10|cov0.00contig1577|size3278|read10|cov0.00contig1355|size1926|read11|cov0.00contig1793|size2754|read10|cov0.00contig1453|size3779|read11|cov0.00contig1818|size4252|read10|cov0.00*contig2450|size567|read8|cov0.00contig2701|size243|read8|cov1.00 *new57 *contig1537|size10557|read11|cov0.00contig1706|size16092|read10|cov0.00contig1416|size6278|read11|cov0.00contig1919|size1888|read10|cov0.00contig1629|size3809|read10|cov0.00 *contig1349|size16528|read11|cov0.00*contig1280|size7174|read11|cov0.00*contig1809|size1654|read10|cov0.00*contig1998|size5553|read9|cov0.00*contig1975|size4798|read9|cov0.00contig1046|size8947|read12|cov0.00 scaffold76*C4831 *contig1451|size12294|read11|cov0.00*contig1273|size10660|read11|cov0.00*contig1044|size6364|read12|cov0.00*contig1902|size572|read10|cov0.00contig1928|size6026|read10|cov0.00contig1868|size7589|read10|cov0.00*contig2087|size2253|read9|cov0.00contig2237|size1829|read9|cov0.00 new70 contig1667|size17894|read10|cov0.00*contig1265|size3691|read11|cov0.00*contig1718|size7342|read10|cov0.00*contig1588|size230|read10|cov1.00contig1747|size3088|read10|cov0.00contig2287|size1368|read9|cov0.00*contig2178|size746|read9|cov0.00contig2203|size5067|read9|cov0.00contig2205|size325|read9|cov0.00 *scaffold77scaffold66 *contig1252|size15531|read11|cov0.00*contig1683|size4253|read10|cov0.00contig1269|size20186|read11|cov0.00 contig1421|size10856|read11|cov0.00*contig1471|size2277|read11|cov0.00*contig1796|size1886|read10|cov0.00*contig2180|size3590|read9|cov0.00contig1513|size5322|read11|cov0.00contig1401|size5178|read11|cov0.00contig1041|size5604|read12|cov0.00contig1995|size130|read9|cov2.00contig2096|size900|read9|cov0.00 *scaffold74 *contig1018|size21908|read13|cov0.00*contig1694|size6946|read10|cov0.00*contig1724|size112|read10|cov3.00contig1812|size5736|read10|cov0.00*contig1482|size305|read11|cov1.00contig1299|size5303|read11|cov0.00contig1065|size156|read12|cov2.00contig984|size3601|read13|cov0.00*contig2304|size256|read9|cov1.00contig2168|size305|read9|cov1.00 *scaffold46scaffold30scaffold86*C4197 *contig1301|size10166|read11|cov0.00*contig1842|size6071|read10|cov0.00contig1370|size9801|read11|cov0.00contig1411|size5620|read11|cov0.00contig1571|size3740|read11|cov0.00contig993|size6184|read13|cov0.00 *scaffold87 *contig1645|size6353|read10|cov0.00*contig1869|size1408|read10|cov0.00*contig1499|size6809|read11|cov0.00*contig1192|size3132|read12|cov0.00*contig2038|size3240|read9|cov0.00contig1632|size1218|read10|cov0.00*contig2194|size3797|read9|cov0.00contig1817|size7176|read10|cov0.00contig2080|size1966|read9|cov0.00contig1997|size5346|read9|cov0.00contig2153|size679|read9|cov0.00 *scaffold70 *contig1145|size7825|read12|cov0.00*contig2767|size1138|read7|cov0.00contig1201|size3962|read12|cov0.00contig1905|size7611|read10|cov0.00contig1504|size4906|read11|cov0.00*contig973|size7602|read13|cov0.00contig1300|size6899|read11|cov0.00*contig418|size279|read20|cov2.00*contig2672|size106|read8|cov2.00contig598|size2488|read17|cov0.00 *scaffold63scaffold55 *contig1350|size13680|read11|cov0.00*contig1698|size7232|read10|cov0.00*contig2515|size1625|read8|cov0.00contig1536|size9117|read11|cov0.00contig1466|size9967|read11|cov0.00contig1861|size5371|read10|cov0.00 *contig1352|size28495|read11|cov0.00contig1578|size18467|read10|cov0.00contig1823|size7026|read10|cov0.00contig1410|size7517|read11|cov0.00*contig101|size196|read30|cov5.00 *scaffold39*new67 *contig1343|size14495|read11|cov0.00contig1263|size12211|read11|cov0.00contig1324|size16441|read11|cov0.00contig1770|size3947|read10|cov0.00contig2165|size5145|read9|cov0.00*contig2106|size861|read9|cov0.00contig2611|size473|read8|cov0.00 *contig1679|size5412|read10|cov0.00contig1474|size16223|read11|cov0.00contig1506|size5536|read11|cov0.00contig1654|size7659|read10|cov0.00contig2004|size7960|read9|cov0.00 *contig1496|size6640|read11|cov0.00*contig1063|size4295|read12|cov0.00contig1231|size33898|read12|cov0.00contig1617|size8421|read10|cov0.00contig1992|size8587|read9|cov0.00contig1580|size520|read10|cov0.00*contig2364|size585|read8|cov0.00 scaffold33C4323new49 *contig1307|size27615|read11|cov0.00*contig1815|size2354|read10|cov0.00contig1347|size4768|read11|cov0.00*contig2201|size3152|read9|cov0.00*contig1996|size993|read9|cov0.00contig3567|size112|read6|cov1.00 *contig1251|size6999|read11|cov0.00contig1709|size18849|read10|cov0.00*contig2288|size1768|read9|cov0.00contig1811|size4356|read10|cov0.00*contig1980|size5730|read9|cov0.00contig966|size898|read13|cov0.00 new26 contig1368|size13725|read11|cov0.00*contig1653|size1308|read10|cov0.00contig1960|size7539|read10|cov0.00contig1467|size1332|read11|cov0.00contig881|size13018|read14|cov0.00*contig2301|size118|read9|cov2.00contig2131|size2382|read9|cov0.00contig2302|size223|read9|cov1.00 *scaffold67*scaffold81scaffold13C4227 *contig1076|size11601|read12|cov0.00*contig1133|size10544|read12|cov0.00*contig1587|size2676|read10|cov0.00*contig1895|size2728|read10|cov0.00*contig1500|size7519|read11|cov0.00contig1723|size3163|read10|cov0.00contig1381|size1042|read11|cov0.00*contig3618|size129|read6|cov1.00 scaffold64scaffold40scaffold45 contig1152|size16724|read12|cov0.00contig1398|size4010|read11|cov0.00contig1749|size2155|read10|cov0.00*contig2105|size1795|read9|cov0.00contig360|size11798|read21|cov0.00 *scaffold52*scaffold54new37 *contig1371|size31333|read11|cov0.00contig1425|size3612|read11|cov0.00*contig1999|size3773|read9|cov0.00contig1072|size3899|read12|cov0.00contig2814|size125|read7|cov2.00 contig1431|size30027|read11|cov0.00*contig1738|size155|read10|cov2.00contig956|size569|read13|cov0.00 *C4423*new61 *contig1040|size41221|read12|cov0.00contig1623|size12753|read10|cov0.00contig1662|size3840|read10|cov0.00contig1982|size4223|read9|cov0.00*contig2750|size128|read7|cov1.00contig891|size130|read14|cov3.00 *scaffold19*new34C5305 *contig1375|size23500|read11|cov0.00*contig1304|size1942|read11|cov0.00*contig1477|size8947|read11|cov0.00contig1099|size2722|read12|cov0.00contig2055|size1253|read9|cov0.00 *new36*new35 *contig1042|size12197|read12|cov0.00*contig1318|size17507|read11|cov0.00*contig1628|size2709|read10|cov0.00contig1202|size12197|read12|cov0.00 contig1085|size15192|read12|cov0.00*contig2049|size2736|read9|cov0.00contig1591|size9170|read10|cov0.00contig1105|size9030|read12|cov0.00contig1586|size3109|read10|cov0.00 *contig1584|size21425|read10|cov0.00*contig1306|size8723|read11|cov0.00contig1414|size4978|read11|cov0.00*contig1990|size2603|read9|cov0.00contig1168|size6329|read12|cov0.00*contig3066|size140|read7|cov1.00 contig1001|size12041|read13|cov0.00contig1677|size13850|read10|cov0.00*contig2015|size4382|read9|cov0.00contig1486|size2141|read11|cov0.00contig1751|size3348|read10|cov0.00contig1670|size186|read10|cov1.00*contig2585|size167|read8|cov1.00contig2298|size1577|read9|cov0.00contig2291|size1041|read9|cov0.00*contig2478|size772|read8|cov0.00contig2368|size166|read8|cov1.00contig2789|size405|read7|cov0.00 *contig1264|size22059|read11|cov0.00*contig1726|size5797|read10|cov0.00*contig1399|size7164|read11|cov0.00contig1418|size7533|read11|cov0.00*contig3307|size128|read6|cov1.00 scaffold1new1 *contig1067|size15641|read12|cov0.00*contig1367|size8334|read11|cov0.00contig1725|size11029|read10|cov0.00contig1524|size4119|read11|cov0.00 scaffold83 *contig1008|size18973|read13|cov0.00contig1254|size11419|read11|cov0.00*contig1246|size3575|read11|cov0.00contig1605|size5049|read10|cov0.00contig1991|size1207|read9|cov0.00 *scaffold27*new42 contig1093|size17367|read12|cov0.00contig1160|size12582|read12|cov0.00*contig1795|size3091|read10|cov0.00contig1073|size23958|read12|cov0.00contig2466|size531|read8|cov0.00 *contig1017|size10482|read13|cov0.00*contig1520|size8270|read11|cov0.00*contig1247|size9095|read11|cov0.00contig1937|size1920|read10|cov0.00contig1649|size6268|read10|cov0.00*contig2064|size2486|read9|cov0.00contig1442|size5284|read11|cov0.00contig1021|size551|read13|cov0.00 *new43 contig1579|size15744|read10|cov0.00contig1626|size19420|read10|cov0.00*contig2313|size1890|read9|cov0.00contig1209|size1104|read12|cov0.00*contig2350|size1432|read8|cov0.00contig2103|size3368|read9|cov0.00contig2557|size795|read8|cov0.00 scaffold85 contig1376|size10837|read11|cov0.00contig1157|size18404|read12|cov0.00contig1906|size5244|read10|cov0.00contig2009|size4669|read9|cov0.00 *contig1223|size33234|read12|cov0.00*contig1419|size3027|read11|cov0.00*contig2070|size5383|read9|cov0.00 *contig1059|size16555|read12|cov0.00*contig1695|size9605|read10|cov0.00contig1771|size4854|read10|cov0.00contig1473|size2604|read11|cov0.00contig1542|size4733|read11|cov0.00*contig2435|size110|read8|cov2.00*contig2625|size111|read8|cov2.00contig2555|size107|read8|cov2.00 *scaffold15*C4047*new32*new31 *contig1445|size17567|read11|cov0.00*contig1153|size5375|read12|cov0.00*contig1484|size6202|read11|cov0.00*contig1550|size1396|read11|cov0.00*contig1802|size4380|read10|cov0.00contig1850|size2592|read10|cov0.00*contig934|size2746|read13|cov0.00*contig3109|size236|read7|cov1.00*contig2826|size437|read7|cov0.00*contig2234|size432|read9|cov0.00contig2404|size598|read8|cov0.00contig2279|size144|read9|cov2.00contig2110|size793|read9|cov0.00contig2883|size103|read7|cov2.00contig2337|size246|read8|cov1.00 new10new6new7 *contig1050|size9622|read12|cov0.00contig1508|size17825|read11|cov0.00contig1380|size10663|read11|cov0.00contig2025|size1801|read9|cov0.00*contig2593|size329|read8|cov0.00contig90|size265|read31|cov4.00 scaffold7C4287C4161C4011 *contig1014|size23431|read13|cov0.00*contig1116|size7346|read12|cov0.00*contig2012|size1355|read9|cov0.00*contig2030|size5378|read9|cov0.00contig1983|size1318|read9|cov0.00 contig1053|size8734|read12|cov0.00contig1428|size8845|read11|cov0.00contig1575|size1308|read10|cov0.00contig1581|size7332|read10|cov0.00*contig2239|size2917|read9|cov0.00*contig2400|size2109|read8|cov0.00contig2054|size6238|read9|cov0.00contig2151|size1660|read9|cov0.00 *scaffold50 *contig1576|size32655|read10|cov0.00contig1772|size8796|read10|cov0.00contig1875|size1769|read10|cov0.00 contig1243|size19106|read11|cov0.00contig1266|size7142|read11|cov0.00*contig2175|size5626|read9|cov0.00 *new58 *contig1434|size10841|read11|cov0.00*contig1798|size2095|read10|cov0.00contig1337|size17039|read11|cov0.00contig1204|size11512|read12|cov0.00contig1311|size2879|read11|cov0.00contig1383|size2467|read11|cov0.00*contig1697|size658|read10|cov0.00*contig1941|size140|read10|cov2.00 *new60new63 contig1479|size11996|read11|cov0.00*contig1582|size5318|read10|cov0.00contig1261|size22043|read11|cov0.00*contig2833|size105|read7|cov2.00*contig2628|size125|read8|cov2.00contig2782|size118|read7|cov2.00contig2134|size879|read9|cov0.00 scaffold56 contig1089|size27060|read12|cov0.00contig798|size11271|read14|cov0.00contig691|size1582|read16|cov0.00 *scaffold18 *contig1148|size38689|read12|cov0.00 *contig1611|size7096|read10|cov0.00contig1031|size27290|read12|cov0.00contig1671|size2932|read10|cov0.00contig1374|size7855|read11|cov0.00contig2230|size2578|read9|cov0.00contig2959|size205|read7|cov1.00 *contig1010|size11542|read13|cov0.00*contig784|size11876|read15|cov0.00*contig1390|size6199|read11|cov0.00contig1710|size1725|read10|cov0.00*contig2406|size125|read8|cov2.00contig1974|size5630|read9|cov0.00contig1984|size3907|read9|cov0.00 QRY *contig1674|size11408|read10|cov0.00*contig1572|size1120|read11|cov0.00contig1887|size6041|read10|cov0.00 QRY scaffold65*new33 *contig1837|size2090|read10|cov0.00*contig1585|size5268|read10|cov0.00contig1249|size16321|read11|cov0.00*contig1820|size3201|read10|cov0.00contig1573|size26751|read10|cov0.00*contig1994|size5150|read9|cov0.00*contig1989|size4392|read9|cov0.00contig916|size106|read13|cov4.00 new38 *contig1283|size22678|read11|cov0.00contig913|size13262|read13|cov0.00contig2090|size3655|read9|cov0.00*contig2098|size890|read9|cov0.00 *contig1313|size21888|read11|cov0.00contig1782|size6758|read10|cov0.00contig1294|size8150|read11|cov0.00contig1338|size7838|read11|cov0.00 scaffold21 *contig1642|size14560|read10|cov0.00*contig1637|size15563|read10|cov0.00*contig1658|size14880|read10|cov0.00 *scaffold49 *contig1641|size4952|read10|cov0.00*contig1315|size7182|read11|cov0.00*contig1614|size6287|read10|cov0.00*contig1690|size1787|read10|cov0.00*contig1727|size3887|read10|cov0.00*contig1624|size1636|read10|cov0.00contig1704|size15757|read10|cov0.00contig1369|size4509|read11|cov0.00contig1880|size2383|read10|cov0.00*contig2113|size1258|read9|cov0.00*contig18|size147|read40|cov9.00 *scaffold58*C4563*C5279C4683C4493 *contig1696|size31404|read10|cov0.00*contig1825|size1767|read10|cov0.00*contig2026|size5971|read9|cov0.00 *contig1647|size7264|read10|cov0.00contig1373|size10362|read11|cov0.00*contig1568|size2481|read11|cov0.00*contig1248|size3948|read11|cov0.00contig961|size15434|read13|cov0.00contig1973|size1441|read9|cov0.00 *new65*new64 *contig1598|size18889|read10|cov0.00contig1407|size30116|read11|cov0.00contig1838|size5535|read10|cov0.00 *contig1844|size14452|read10|cov0.00*contig1345|size3089|read11|cov0.00contig1417|size12097|read11|cov0.00*contig2071|size1127|read9|cov0.00contig1203|size9451|read12|cov0.00contig948|size4367|read13|cov0.00 scaffold84 contig1212|size30597|read12|cov0.00*contig2387|size2565|read8|cov0.00contig1616|size2573|read10|cov0.00contig2213|size3708|read9|cov0.00 *contig1461|size56510|read11|cov0.00contig1865|size13185|read10|cov0.00*contig1867|size4309|read10|cov0.00contig1450|size13168|read11|cov0.00*contig2118|size2522|read9|cov0.00*contig2474|size105|read8|cov2.00 *contig1816|size4071|read10|cov0.00*contig1881|size8496|read10|cov0.00contig1159|size8673|read12|cov0.00contig1557|size8915|read11|cov0.00*contig2184|size7082|read9|cov0.00contig2399|size1566|read8|cov0.00contig2770|size167|read7|cov1.00 *scaffold69 *contig1334|size8277|read11|cov0.00*contig1167|size6633|read12|cov0.00*contig2022|size17458|read9|cov0.00contig1245|size5152|read11|cov0.00contig1126|size6734|read12|cov0.00*contig2329|size355|read8|cov0.00contig2636|size714|read8|cov0.00 *scaffold78 *contig1333|size24260|read11|cov0.00*contig1213|size4143|read12|cov0.00*contig2001|size3488|read9|cov0.00contig1938|size6214|read10|cov0.00*contig1970|size105|read9|cov3.00*contig2860|size109|read7|cov2.00contig1214|size273|read12|cov1.00*contig880|size294|read14|cov1.00*contig2417|size444|read8|cov0.00contig1699|size128|read10|cov2.00contig3065|size323|read7|cov0.00contig2349|size103|read8|cov2.00contig3132|size110|read7|cov2.00contig2739|size534|read8|cov0.00 scaffold31scaffold51*C4263*C4053C4049new47 *contig1904|size3412|read10|cov0.00*contig1570|size5056|read11|cov0.00contig1327|size25743|read11|cov0.00*contig2058|size3189|read9|cov0.00*contig2719|size491|read8|cov0.00contig1979|size536|read9|cov0.00 *scaffold88C4433 *contig1839|size2244|read10|cov0.00*contig1386|size7528|read11|cov0.00*contig1565|size1473|read11|cov0.00*contig1923|size6934|read10|cov0.00contig1234|size5720|read12|cov0.00contig1305|size2415|read11|cov0.00contig1250|size4193|read11|cov0.00contig1764|size1915|read10|cov0.00contig2036|size2535|read9|cov0.00*contig3324|size204|read6|cov1.00contig1885|size117|read10|cov3.00contig2240|size2015|read9|cov0.00contig3365|size140|read6|cov1.00contig2604|size896|read8|cov0.00contig2362|size577|read8|cov0.00contig2378|size657|read8|cov0.00 *scaffold73*scaffold53*new62C4307 *contig1422|size23497|read11|cov0.00*contig1135|size12547|read12|cov0.00contig1743|size11092|read10|cov0.00*contig2157|size1103|read9|cov0.00*contig3026|size128|read7|cov1.00*contig2148|size104|read9|cov3.00contig2561|size138|read8|cov2.00 *contig1043|size31415|read12|cov0.00*contig153|size3477|read28|cov0.00*contig149|size2994|read28|cov0.00*contig138|size197|read28|cov5.00contig1039|size315|read12|cov1.00contig1156|size482|read12|cov0.00contig1070|size143|read12|cov3.00*contig957|size405|read13|cov1.00contig15|size124|read41|cov11.00contig2410|size322|read8|cov0.00contig2588|size405|read8|cov0.00contig2901|size109|read7|cov2.00 *scaffold41scaffold14*C4427*C4147*C4027new28new30 *contig1564|size15755|read11|cov0.00*contig1190|size3639|read12|cov0.00*contig1910|size2344|read10|cov0.00contig1106|size7666|read12|cov0.00contig1646|size8099|read10|cov0.00contig2501|size956|read8|cov0.00 *scaffold26scaffold79*C5255*new41*C4245C4275C4325 *contig1395|size24313|read11|cov0.00contig1778|size4448|read10|cov0.00contig1165|size5941|read12|cov0.00contig1296|size9258|read11|cov0.00 *contig1137|size21276|read12|cov0.00*contig1863|size3661|read10|cov0.00*contig2028|size3794|read9|cov0.00*contig2438|size1468|read8|cov0.00contig1652|size9014|read10|cov0.00 contig1640|size11921|read10|cov0.00*contig1848|size1111|read10|cov0.00contig1372|size9593|read11|cov0.00contig737|size16137|read15|cov0.00*contig1436|size108|read11|cov3.00*contig2784|size103|read7|cov2.00contig2779|size256|read7|cov0.00 *scaffold82scaffold10*C4247new12new13new11new14new17C5057 *contig1684|size13668|read10|cov0.00*contig1768|size2201|read10|cov0.00contig1155|size35848|read12|cov0.00contig1634|size6980|read10|cov0.00*contig2219|size813|read9|cov0.00*contig2423|size854|read8|cov0.00*contig2497|size244|read8|cov1.00*contig2181|size130|read9|cov2.00*contig3158|size101|read7|cov2.00contig3267|size127|read6|cov1.00contig3282|size202|read6|cov1.00 *scaffold71scaffold17*scaffold5C4101C5117 *contig1625|size18903|read10|cov0.00*contig1739|size8601|read10|cov0.00contig1257|size27294|read11|cov0.00contig1799|size4328|read10|cov0.00contig2528|size1246|read8|cov0.00contig1977|size3562|read9|cov0.00 *contig1516|size13284|read11|cov0.00*contig1630|size5363|read10|cov0.00contig1432|size10536|read11|cov0.00*contig1597|size7850|read10|cov0.00*contig2267|size1509|read9|cov0.00 *scaffold6*new71 *contig933|size10476|read13|cov0.00*contig1872|size5605|read10|cov0.00*contig1961|size6973|read10|cov0.00*contig1859|size3787|read10|cov0.00*contig1103|size9575|read12|cov0.00contig1510|size3441|read11|cov0.00contig1615|size6073|read10|cov0.00contig2382|size636|read8|cov0.00contig1986|size340|read9|cov0.00contig2793|size418|read7|cov0.00contig68|size210|read32|cov5.00 contig1358|size25356|read11|cov0.00contig1287|size13279|read11|cov0.00 *scaffold48scaffold20 *contig1806|size15990|read10|cov0.00contig1029|size12881|read13|cov0.00contig1789|size7947|read10|cov0.00*contig2480|size1016|read8|cov0.00contig2052|size1321|read9|cov0.00contig3125|size254|read7|cov0.00 *contig1163|size17048|read12|cov0.00*contig1753|size9543|read10|cov0.00*contig1786|size7653|read10|cov0.00contig1728|size5265|read10|cov0.00*contig2082|size2042|read9|cov0.00*contig2045|size836|read9|cov0.00contig4937|size138|read5|cov1.00 contig1253|size13509|read11|cov0.00contig1335|size15294|read11|cov0.00contig1364|size8496|read11|cov0.00*contig1971|size2201|read9|cov0.00 *scaffold12*C4189 *contig1655|size10884|read10|cov0.00*contig1608|size4755|read10|cov0.00contig1447|size34153|read11|cov0.00*contig2074|size5287|read9|cov0.00*contig2111|size3278|read9|cov0.00contig1303|size6376|read11|cov0.00*contig2358|size863|read8|cov0.00 *contig1722|size4925|read10|cov0.00contig906|size21530|read14|cov0.00contig1935|size9655|read10|cov0.00*contig929|size106|read13|cov4.00contig2449|size2394|read8|cov0.00*contig2869|size118|read7|cov2.00contig2046|size108|read9|cov3.00 *new24 *contig1455|size17763|read11|cov0.00*contig1840|size1227|read10|cov0.00*contig1618|size1404|read10|cov0.00*contig1729|size2607|read10|cov0.00contig1703|size14803|read10|cov0.00*contig1503|size4374|read11|cov0.00*contig1912|size111|read10|cov3.00*contig3712|size103|read6|cov2.00*contig2097|size134|read9|cov2.00*contig5125|size101|read5|cov1.00contig1310|size112|read11|cov3.00contig1688|size901|read10|cov0.00*contig3612|size139|read6|cov1.00*contig1968|size188|read9|cov1.00contig1058|size115|read12|cov3.00contig2549|size209|read8|cov1.00contig2694|size230|read8|cov1.00contig2172|size869|read9|cov0.00contig3186|size112|read7|cov2.00 *new25new74new72new73new75 *contig1601|size7427|read10|cov0.00*contig1701|size2331|read10|cov0.00*contig1834|size3999|read10|cov0.00*contig1717|size7369|read10|cov0.00contig1505|size4390|read11|cov0.00contig1807|size2235|read10|cov0.00contig1627|size8766|read10|cov0.00*contig2108|size135|read9|cov2.00contig2324|size3073|read8|cov0.00contig2282|size220|read9|cov1.00contig2895|size107|read7|cov2.00 *scaffold62scaffold23*scaffold3scaffold29scaffold38scaffold72scaffold2new52new53new55new45C4137 contig1218|size31798|read12|cov0.00contig1124|size8965|read12|cov0.00contig2117|size771|read9|cov0.00 scaffold37 *contig1184|size10437|read12|cov0.00*contig1596|size3010|read10|cov0.00*contig1592|size2790|read10|cov0.00*contig2495|size2408|read8|cov0.00contig285|size11546|read23|cov0.00contig1185|size2909|read12|cov0.00*contig1985|size4527|read9|cov0.00contig2041|size8483|read9|cov0.00contig2284|size1101|read9|cov0.00contig2540|size105|read8|cov2.00contig2109|size791|read9|cov0.00 *scaffold57 *contig1164|size6455|read12|cov0.00*contig1024|size8352|read13|cov0.00contig1955|size4808|read10|cov0.00contig1241|size7150|read11|cov0.00contig1804|size1333|read10|cov0.00contig1480|size1276|read11|cov0.00*contig1824|size774|read10|cov0.00contig1883|size1094|read10|cov0.00*contig1325|size139|read11|cov2.00*contig1302|size208|read11|cov1.00contig1255|size1056|read11|cov0.00contig1744|size3065|read10|cov0.00*contig3580|size145|read6|cov1.00contig2359|size1164|read8|cov0.00*contig2594|size215|read8|cov1.00contig2040|size1274|read9|cov0.00contig1186|size131|read12|cov3.00*contig2761|size113|read7|cov2.00*contig2962|size159|read7|cov1.00contig2094|size2597|read9|cov0.00contig3175|size169|read7|cov1.00contig1976|size725|read9|cov0.00contig2143|size167|read9|cov1.00contig2467|size104|read8|cov2.00contig2753|size351|read7|cov0.00 *scaffold35scaffold61*C4839*C4261*C4253*C4207*C4335*new51C4577C4055 *contig1714|size13449|read10|cov0.00contig1755|size14684|read10|cov0.00*contig667|size10938|read16|cov0.00contig2078|size2081|read9|cov0.00*contig3230|size204|read7|cov1.00contig3666|size130|read6|cov1.00 *scaffold22C4063 *contig1074|size19375|read12|cov0.00*contig1604|size11423|read10|cov0.00contig1523|size2422|read11|cov0.00*contig1988|size3393|read9|cov0.00contig1871|size7100|read10|cov0.00 contig1925|size13394|read10|cov0.00*contig1763|size3086|read10|cov0.00contig1366|size18443|read11|cov0.00*contig1794|size732|read10|cov0.00*contig2013|size7201|read9|cov0.00*contig3177|size107|read7|cov2.00 *contig1948|size3776|read10|cov0.00*contig1712|size2373|read10|cov0.00contig1276|size13008|read11|cov0.00contig1188|size9165|read12|cov0.00*contig2133|size2819|read9|cov0.00*contig2414|size1047|read8|cov0.00*contig2033|size4541|read9|cov0.00contig2222|size2179|read9|cov0.00contig4119|size125|read5|cov1.00 *scaffold42 *contig1183|size11915|read12|cov0.00*contig1884|size4343|read10|cov0.00contig1170|size5311|read12|cov0.00*contig2332|size1802|read8|cov0.00contig283|size4532|read23|cov0.00*contig2900|size119|read7|cov2.00contig2215|size3008|read9|cov0.00contig2144|size2752|read9|cov0.00*contig2445|size150|read8|cov1.00contig2099|size2765|read9|cov0.00contig2002|size2412|read9|cov0.00contig2952|size165|read7|cov1.00 *scaffold11*scaffold75*new23*new22 *contig1603|size5851|read10|cov0.00*contig1719|size2522|read10|cov0.00*contig1836|size4718|read10|cov0.00*contig1752|size3771|read10|cov0.00*contig2390|size2576|read8|cov0.00contig1633|size1037|read10|cov0.00contig1898|size6287|read10|cov0.00*contig2044|size1092|read9|cov0.00contig951|size14009|read13|cov0.00contig2083|size894|read9|cov0.00contig2383|size315|read8|cov0.00contig2307|size957|read9|cov0.00 scaffold80new79new80 *contig1468|size1602|read11|cov0.00*contig2051|size1170|read9|cov0.00contig1346|size3873|read11|cov0.00contig1735|size4532|read10|cov0.00contig1282|size4272|read11|cov0.00*contig1972|size4062|read9|cov0.00contig1926|size3917|read10|cov0.00contig1950|size5271|read10|cov0.00*contig2431|size765|read8|cov0.00contig2020|size1224|read9|cov0.00contig2018|size2312|read9|cov0.00*contig2123|size413|read9|cov0.00*contig2686|size447|read8|cov0.00*contig3000|size239|read7|cov1.00contig1517|size258|read11|cov1.00*contig2537|size191|read8|cov1.00contig2085|size2063|read9|cov0.00contig3068|size226|read7|cov1.00contig3353|size145|read6|cov1.00contig2516|size618|read8|cov0.00 *scaffold32*scaffold68*scaffold47*scaffold59scaffold28*C4127*C4195*new66new44 *contig1890|size7976|read10|cov0.00*contig1494|size8129|read11|cov0.00contig1441|size4833|read11|cov0.00contig1829|size1837|read10|cov0.00contig1813|size2413|read10|cov0.00contig1377|size7375|read11|cov0.00*contig2156|size2138|read9|cov0.00*contig2346|size987|read8|cov0.00contig2048|size8584|read9|cov0.00*contig128|size138|read29|cov7.00contig1012|size123|read13|cov3.00contig2232|size556|read9|cov0.00 *scaffold36*scaffold9*C4925 *contig1444|size23340|read11|cov0.00contig1665|size13623|read10|cov0.00*contig1691|size6252|read10|cov0.00contig1208|size4502|read12|cov0.00*contig2465|size2358|read8|cov0.00contig2722|size458|read8|cov0.00 *contig1911|size13643|read10|cov0.00*contig1538|size2643|read11|cov0.00contig1620|size14128|read10|cov0.00*contig2343|size1925|read8|cov0.00contig2059|size7687|read9|cov0.00 *contig1522|size17126|read11|cov0.00*contig1676|size5368|read10|cov0.00*contig1621|size5401|read10|cov0.00contig1569|size12661|read11|cov0.00contig1284|size9561|read11|cov0.00 *scaffold16 *contig1754|size5846|read10|cov0.00*contig1198|size3947|read12|cov0.00*contig1340|size5243|read11|cov0.00*contig1783|size6906|read10|cov0.00contig1465|size12069|read11|cov0.00contig1672|size15958|read10|cov0.00*contig1449|size140|read11|cov2.00contig2076|size2788|read9|cov0.00*contig2227|size628|read9|cov0.00contig2773|size217|read7|cov1.00 scaffold34scaffold60new50 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 gi|49175990|ref|NC_000913.2| gi|49175990|ref|NC_000913.2| (c) Taipan (d) SOAPdenovo

422117161115983761391241464417631019281385734 contig_611contig_624contig_619contig_589contig_608contig_652contig_330contig_504contig_446contig_552contig_575contig_543contig_221contig_662contig_648contig_564contig_587contig_471contig_644contig_516contig_521contig_294contig_655contig_628contig_565contig_633contig_672contig_517contig_498contig_341contig_144contig_541contig_523contig_65 *293*121*317*289 *contig_469*contig_163contig_306*contig_50contig_547contig_656contig_578contig_503contig_579contig_369contig_572contig_622contig_334contig_582contig_488contig_627contig_550contig_512contig_335contig_545contig_442contig_402 *304*136182 *contig_109*contig_146*contig_658*contig_610contig_278 *124*159*18 *contig_390*contig_349contig_603*contig_38contig_409contig_379contig_250 *292*346*53338*26361421 *contig_419*contig_653*contig_584*contig_467*contig_548*contig_506*contig_507*contig_665*contig_477contig_430contig_495contig_470contig_501contig_663contig_53contig_63 *342*322169301311*6 *contig_394*contig_154*contig_161*contig_274contig_303contig_318contig_342contig_180 *24781 *contig_115contig_203*contig_14 *62154 *contig_496contig_197contig_11contig_81 *326345219 *contig_499*contig_295contig_118contig_247contig_62 278*58 *contig_355*contig_245*contig_297contig_173contig_18 *281*331*163160259*66 *contig_445*contig_151contig_166contig_407 *190*18816740341764 *contig_175*contig_287*contig_242contig_233contig_56 228*9225 *contig_265*contig_449contig_476contig_480*contig_24contig_40contig_58 *334283245400191212 *contig_228contig_304contig_492contig_291contig_37 *108*145*256347101 *contig_385*contig_366contig_143contig_384contig_326contig_316contig_481contig_227 *325*74 *contig_320*contig_594contig_272contig_509*contig_54contig_177 168 *contig_378*contig_158contig_634contig_49 *180*172 *contig_189 *109 *contig_388*contig_104contig_188 *24915838779 contig_208*contig_21contig_88 *221*119*67199 *contig_215*contig_135*contig_466*contig_638*contig_271contig_395contig_532contig_574contig_69 *35737022915288 *contig_439contig_569contig_45contig_12 *178275*8940877 *contig_432*contig_283*contig_336*contig_678*contig_284contig_505contig_438*contig_98contig_280contig_159 *222*9127195 *contig_666*contig_234*contig_325*contig_676contig_251contig_474contig_41 131*57 *contig_105*contig_583*contig_478contig_308contig_46 27097 *contig_263*contig_57 *223*69 *contig_479*contig_515*contig_237*contig_156*contig_119*contig_77contig_647 *211*86*2942 *contig_199*contig_196contig_621*contig_34 *52258 *contig_224*contig_183 *337*40114 *contig_120*contig_423contig_239 *23122565 *contig_323*contig_403contig_132*contig_7 *71414220 *contig_410*contig_560*contig_223contig_114*contig_51 *369*284*321421234394253237197 *contig_612*contig_300contig_312contig_460contig_152contig_433contig_174contig_319 *277*173134 *contig_230*contig_563*contig_641*contig_380*contig_307*contig_93contig_78contig_22 *230*255*150239104 *contig_219*contig_256*contig_127contig_635 *315384262 *contig_317*contig_169*contig_491*contig_70contig_417contig_204contig_150*contig_47 *366*29913750 *contig_401*contig_273contig_436contig_364contig_359contig_10 *70 contig_187contig_646*contig_30 *240*113156 *contig_345*contig_75contig_44contig_16 *36713016510080 *contig_141contig_600contig_107contig_92 *372*332126244 contig_200*contig_20contig_122contig_87 *224298 *contig_184*contig_391*contig_160*contig_396*contig_170*contig_290contig_570contig_605contig_567contig_192 *362*208*241*103*43207420399 *contig_508*contig_588*contig_643*contig_615*contig_182*contig_546contig_522contig_554contig_585contig_350contig_424contig_609contig_558contig_338contig_538contig_168*contig_27 *128288*83*9 *contig_513*contig_680*contig_556*contig_145contig_455contig_659contig_604contig_411contig_639contig_371contig_577contig_29*contig_2 341218373263364144 *contig_299*contig_193*contig_167contig_258contig_3 *307*56377 *contig_458*contig_124contig_301contig_426contig_213contig_448contig_149 *59312 contig_157 *306*406*267*356129305 *contig_381*contig_266contig_255contig_133*contig_55 *3491752903891862422 *contig_101*contig_311*contig_453contig_654contig_425contig_586contig_437contig_465contig_324contig_142contig_89contig_5 *381*106 *contig_456*contig_620*contig_217*contig_310*contig_261contig_329contig_500 *54250 *contig_76contig_235 318*49 *215*393*286242 *contig_373*contig_530*contig_562*contig_671*contig_140*contig_209*contig_42*contig_17 *327*415323*172 *contig_383contig_194contig_413*contig_48contig_246 QRY *273*110386205 QRY *contig_328contig_353contig_185contig_31 *390*27940278 contig_322contig_363contig_434contig_392contig_236contig_249contig_636contig_315contig_257 *303265251 *contig_252*contig_178contig_386contig_32 *261147 *contig_268*contig_606contig_452contig_214contig_71 *375*185176313*2 *contig_128 *183*76 *contig_222*contig_374*contig_435*contig_225contig_358contig_412*contig_66 *16127 *contig_361contig_60 contig_343contig_96contig_25 339170 *contig_202contig_35 157 contig_28 118 contig_669*contig_79*contig_4 *31615120082 contig_357contig_372contig_309*contig_8 *410*418*149*297*233*48*35*873822964111115134 *contig_211*contig_264contig_387 264 *contig_616*contig_650*contig_494*contig_576*contig_418contig_487contig_116contig_623contig_540contig_630contig_590contig_340contig_631contig_542contig_617contig_398contig_660contig_362contig_573contig_210contig_94 236226164 *17420628073 *contig_370*contig_339*contig_677contig_241contig_451*contig_83contig_243contig_1 *202395148166 *contig_533*contig_454contig_393contig_59 *246*336379*4732 *contig_566*contig_493*contig_649contig_486contig_121 *116*140187 *contig_117*contig_443*contig_518*contig_664contig_216contig_276contig_457contig_645contig_637contig_238contig_415contig_602contig_80 *227*309 *contig_136contig_153*contig_39contig_314 *404*23221313530 *contig_171*contig_191*contig_113 41340955 *contig_651*contig_269*contig_613*contig_275contig_198contig_489 *363*28784 *contig_626*contig_536*contig_497*contig_607*contig_618*contig_559*contig_288*contig_112*contig_52contig_580contig_520*contig_85 *300*115238 *33521020 *contig_444*contig_139contig_134contig_148*contig_72 *189*378*254423 *contig_510*contig_597contig_399*contig_15contig_356contig_99contig_26contig_9 *388*112 *contig_130contig_348contig_165*contig_33 *19625775 *contig_285*contig_279contig_331contig_450contig_43 *146*340269141350 *contig_179contig_642contig_262contig_382contig_68 *285*12220335499 *contig_368*contig_231*contig_84contig_375contig_102 *133217243308 *contig_568*contig_571*contig_464*contig_670*contig_525contig_212contig_675contig_111contig_447contig_286contig_90 *302*291*374*407*328*1203833483521253331 *contig_581*contig_657*contig_270*contig_389*contig_544*contig_667*contig_596*contig_674*contig_110contig_601contig_524contig_549contig_598contig_483contig_679contig_468contig_555contig_473contig_253*contig_74contig_95 *330*184*324*351123314320 *contig_106*contig_346*contig_377*contig_129*contig_333contig_461contig_528contig_302contig_195 *177 contig_73 *235*385310194 *contig_421contig_475contig_414*contig_97contig_176contig_13 *138*192*20419814236193 *contig_360*contig_592*contig_462*contig_537*contig_629*contig_428*contig_420*contig_365contig_298contig_181contig_661contig_527contig_406contig_429contig_61 *427*426*401*107405*96425*94 *contig_367*contig_472*contig_440*contig_539*contig_502*contig_484*contig_126contig_599*contig_19contig_123contig_313 *193179272 *contig_463*contig_351contig_293contig_281contig_321*contig_36 *276266344162 *contig_229*contig_305*contig_64contig_162contig_67 *139*201155143 *contig_138*contig_186contig_267contig_344 *209*396*268*3763974192483583553885 *contig_282*contig_220*contig_405*contig_206*contig_534*contig_668*contig_535*contig_296contig_259contig_289contig_595contig_614contig_254contig_82 391195153 *contig_332contig_327*contig_86contig_6 *333*102*392*368*365*171*21429431942432938039841218123 *contig_137*contig_482*contig_529*contig_625*contig_553*contig_593*contig_557*contig_531*contig_490*contig_201*contig_416*contig_526*contig_459*contig_408*contig_404*contig_125*contig_354*contig_172contig_400contig_218contig_640contig_519contig_561contig_431contig_292contig_260contig_591 *371*359260274*45*60343 *contig_352*contig_673*contig_551*contig_485*contig_347contig_277contig_190contig_244contig_441contig_232contig_205contig_427 216 *contig_337*contig_248*contig_100 252416295127 *contig_514*contig_226*contig_632*contig_376*contig_131contig_511contig_397contig_147contig_23 132105 contig_240*contig_91 *282*360353*6890 *contig_207*contig_108*contig_164*contig_155contig_422contig_103 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 4500000 gi|49175990|ref|NC_000913.2| gi|49175990|ref|NC_000913.2| (e) SUTTA (f) Edena

Figure 25: Dot plots for E. coli (with mate-pairs). Assemblies produced by Velvet, ABySS, Taipan, SOAPdenovo, SUTTA and Edena. The horizontal lines indicate the boundary between assembled contigs represented on the y axis.

191 Table 7: Short Read Assemblers parameter setting. Assembler S. aureus H. acininychis E. coli ABySS k=23 k=27 k=31 n=5 Edena m=21 m=27 m=30 192 EULER-SR k=21 k=27 k=28 CloneLength=215 CloneVar=40 SOAPdenovo k =21 k=27 k=25 -R SSAKE default default m=17 o=4 r=0.7 t=1 SUTTA k=21 Wmp=150 Wde=10 Wbb=140 k=27 Wmp=150 Wde=30 Wbb=140 k=29 Wmp=150 Wde=20 Wbb=140 Taipan k=19 T=8 k=27 T=18 k=29 T=10 Velvet k =21 cov cutoff=7 k=27 cov cutoff=8 k=29 ins length=215 cov cutoff=12 -exp cov=24 NOTE: Long-read Assemblers have been ran with their default parameters. Bibliography

[1] Can Alkan, Saba Sajjadian, and Evan E. Eichler. Limitations of next-

generation genome sequence assembly. Nature Methods, 8(1):61–65, Jan-

uary 2011.

[2] S F Altschul, T L Madden, A A Sch¨affer, J Zhang, Z Zhang, et al. Gapped

BLAST and PSI-BLAST: a new generation of protein database search pro-

grams. Nucleic acids research, 25(17):3389–402, September 1997.

[3] T. S. Anantharaman, V. Mysore, and B. Mishra. Fast and cheap genome

wide haplotype construction via optical mapping. In Pacific Symposium on

Biocomputing, p. 385 396, 2005.

[4] Thomas S. Anantharaman, Bud Mishra, and David C. Schwartz. Genomics

via optical mapping iii: Contiging genomic dna and variations (extended

abstract). In Proceedings 7th Intl. Cnf. on Intelligent Systems for Molecular

Biology: ISMB ’99, volume 7, pp. 18–27. AAAI Press, 1997.

[5] TS Anantharaman, B Mishra, and DC Schwartz. Genomics via optical

mapping. ii: Ordered restriction maps. J Comput Biol., 4(2):91–118, 1997.

193 [6] Marco Antoniotti, Thomas Anantharaman, Salvatore Paxia, and Bud

Mishra. Genomics via optical mapping iv: Sequence validation via op-

tical map matching. Technical report, New York University, New York,

NY, USA, 2001.

[7] David Applegate, Robert E. Bixby, Vasek Chv´atal, and William Cook. Tsp

cuts which do not conform to the template paradigm. In Computational

Combinatorial Optimization, Optimal or Provably Near-Optimal Solutions

[based on a Spring School], pp. 261–304, London, UK, 2001. Springer-

Verlag.

[8] Chris Armen and Clifford Stein. A 2 2/3-approximation algorithm for the

shortest superstring problem. In CPM, pp. 87–101, 1996.

[9] Christopher Aston, Bud Mishra, and David C. Schwartz. Optical mapping

and its potential for large-scale sequencing projects. Trends in Biotechnol-

ogy, 17(7):297 – 302, 1999.

[10] Tadashi Baba, Fumihiko Takeuchi, Makoto Kuroda, Harumi Yuzawa, Ken

ichi Aoki, et al. Genome and virulence determinants of high virulence

community-acquired mrsa. The Lancet, 359(9320):1819 – 1827, 2002.

[11] Serafim Batzoglou, David B. Jaffe, Ken Stanley, Jonathan Butler, Sante

Gnerre, et al. ARACHNE: A Whole-Genome Shotgun Assembler. Genome

Research, 12(1):177–189, 2002.

194 [12] R. Bisiani. Beam search. In S. Shapiro, editor, Encyclopedia of Artificial

Intelligence, pp. 56–58. Wiley & Sons, 1987.

[13] Frederick R. Blattner, Guy Plunkett, Craig A. Bloch, Nicole T. Perna,

Valerie Burland, et al. The Complete Genome Sequence of Escherichia coli

K-12. Science, 277(5331):1453–1462, 1997.

[14] Marten Boetzer, Christiaan V. Henkel, Hans J. Jansen, Derek Butler, and

Walter Pirovano, et al. Scaffolding pre-assembled contigs using SSPACE.

Bioinformatics, 2010.

[15] Sbastien Boisvert, Franois Laviolette, and Jacques Corbeil. Ray: Simulta-

neous assembly of reads from a mix of high-throughput sequencing tech-

nologies. Journal of Computational Biology, 17(11):1519–1533, 2010.

[16] Douglas Bryant, Weng-Keen Wong, and Todd Mockler. Qsra - a quality-

value guided de novo short read assembler. BMC Bioinformatics, 10(1):69,

2009.

[17] M. Burrows and D.J. Wheeler. A block-sorting lossless data compression

algorithm, 1994.

[18] Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter,

Matthew K. Belmonte, et al. ALLPATHS: De novo assembly of whole-

genome shotgun microreads. Genome Research, 18(5):810–820, 2008.

[19] Mark J. Chaisson and Pavel A. Pevzner. Short read fragment assembly of

bacterial genomes. Genome Research, 18(2):324–330, 2008.

195 [20] Alonzo Church and J. B. Rosser. Some properties of conversion. Transac-

tions of the American Mathematical Society, 39(3):472–482, 1936.

[21] Martin Davis, George Logemann, and Donald Loveland. A machine pro-

gram for theorem-proving. Commun. ACM, 5(7):394–397, 1962.

[22] Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmel-

bauer. SHARCGS, a fast and highly accurate short-read assembly algorithm

for de novo genomic sequencing. Genome Research, 17(11):1697–1706, 2007.

[23] Mark Eppinger, Claudia Baar, Bodo Linz, Gnter Raddatz, Christa Lanz,

et al. Who ate whom? adaptive helicobacter genomic changes that ac-

companied a host jump from early humans to large felines. PLoS Genet,

2(7):e120, 07 2006.

[24] Y Erlich, PP Mitra, M delaBastide, WR McCombie, and GJ Hannon, et al.

Alta-cyclic: a self-optimizing base caller for next-generation sequencing.

Nat Methods, 5(5):679–82, 2008 Aug.

[25] Brent Ewing, LaDeana Hillier, Michael C. Wendl, and Phil Green. Base-

Calling of Automated Sequencer Traces UsingPhred.I. AccuracyAssess-

ment. Genome Research, 8(3):175–185, 1998.

[26] P. Ferragina and G. Manzini. Opportunistic data structures with applica-

tions. Annual Symposium on Foundations of Computer Science, 41:390–

398, 2000.

196 [27] John Gallant, David Maier, and James Astorer. On finding minimal length

superstrings. Journal of Computer and System Sciences, 20(1):50 – 58,

1980.

[28] Steven R. Gill, Derrick E. Fouts, Gordon L. Archer, Emmanuel F.

Mongodin, Robert T. DeBoy, et al. Insights on Evolution of Vir-

ulence and Resistance from the Complete Genome Analysis of an

Early Methicillin-Resistant Staphylococcus aureus Strain and a Biofilm-

Producing Methicillin-Resistant Staphylococcus epidermidis Strain. J. Bac-

teriol., 187(7):2426–2438, 2005.

[29] Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J. Ribeiro,

Joshua N. Burton, et al. High-quality draft assemblies of mammalian

genomes from massively parallel sequence data. Proceedings of the National

Academy of Sciences, 2010.

[30] Phil Green. Phrap documentation, 1996.

http://www.phrap.org/phredphrap/phrap.html.

[31] Stephen S. Hall. Revolution postponed. Scientific American, pp. 60–67,

2010. doi:10.1038/scientificamerican1010-60.

[32] David Hernandez, Patrice Franois, Laurent Farinelli, Magne sters, and

Jacques Schrenzel, et al. De novo bacterial genome sequencing: Millions

of very short reads assembled on a desktop computer. Genome Research,

18(5):802–809, 2008.

197 [33] Mohammad Hossain, Navid Azimi, and Steven Skiena. Crystallizing short-

read assemblies around seeds. BMC Bioinformatics, 10(Suppl 1):S16, 2009.

[34] Xiaoqiu Huang and Anup Madan. CAP3: A DNA Sequence Assembly

Program. Genome Research, 9(9):868–877, 1999.

[35] Xiaoqiu Huang, Jianmin Wang, Srinivas Aluru, Shiaw-Pyng Yang, and

LaDeana Hillier, et al. PCAP: A Whole-Genome Assembly Program.

Genome Research, 13(9):2164–2170, 2003.

[36] Ruo-Wei Hung and Maw-Shang Chang. Solving the path cover problem

on circular-arc graphs by using an approximation algorithmstar. Discrete

Applied , 154(1):76–105, 2006.

[37] Ramana M. Idury and Michael S. Waterman. A new algorithm for dna

sequence assembly. Journal of Computational Biology, 2(2):291–306, 1995.

[38] International. Initial sequencing and analysis of the human genome. Nature,

409(6822):860–921, February 2001.

[39] International Human Genome Sequencing Consortium. Finishing the eu-

chromatic sequence of the human genome. Nature, 431(7011):931–945, Oc-

tober 2004.

[40] Sorin Istrail, Granger G. Sutton, Liliana Florea, Aaron L. Halpern,

Clark M. Mobarry, et al. Whole-genome shotgun assembly and compar-

ison of human genome assemblies. Proceedings of the National Academy of

Sciences of the United States of America, 101(7):1916–1921, 2004.

198 [41] William R. Jeck, Josephine A. Reinhardt, David A. Baltrus, Matthew T.

Hickenbotham, Vincent Magrini, et al. Extending assembly of short DNA

sequences to handle error. Bioinformatics, 23(21):2942–2944, 2007.

[42] Wei-Chun Kao, Kristian Stevens, and Yun S. Song. BayesCall: A model-

based base-calling algorithm for high-throughput short-read sequencing.

Genome Research, 19(10):1884–1895, 2009.

[43] Richard M. Karp. The role of algorithmic research in computational ge-

nomics. Computational Systems Bioinformatics Conference, International

IEEE Computer Society, 0:10, 2003.

[44] J. Kececioglu and E. Myers. Combinatorial algorithms for dna sequence

assembly. Algorithmica, 13(1):7–51, February 1995.

[45] David Kelley, Michael Schatz, and . Quake: quality-

aware detection and correction of sequencing errors. Genome Biology,

11(11):R116, 2010.

[46] W. J. Kent. BLAT—The BLAST-Like Alignment Tool. Genome Research,

12(4):656–664, March 2002.

[47] W. James Kent and . Assembly of the Working Draft of the

Human Genome with GigAssembler. Genome Research, 11(9):1541–1548,

2001.

[48] Jeffrey M. Kidd, Nick Sampas, Francesca Antonacci, Tina Graves, Robert

Fulton, et al. Characterization of missing human genome sequences and

199 copy-number polymorphic insertions. Nature Methods, 7(5):365–371, April

2010.

[49] Sun Kim, Haixu Tang, and Elaine R. Mardis. Genome Sequencing Tech-

nology and Algorithms. Artech House, Inc., Norwood, MA, USA, 2007.

[50] Carl Kingsford, Michael Schatz, and Mihai Pop. Assembly complexity of

prokaryotic genomes using short reads. BMC Bioinformatics, 11(1):21,

2010.

[51] Martin Kircher, Udo Stenzel, and . Improved base calling for

the illumina genome analyzer using machine learning strategies. Genome

Biology, 10(8):R83, 2009.

[52] A. N. Kolmogorov. Sulla determinazione empirica di una legge di dis-

tribuzione. Giorn. 1st. Ital. Attuari, 4:83–91, 1933.

[53] Stefan Kurtz, Adam Phillippy, Arthur Delcher, Michael Smoot, Martin

Shumway, et al. Versatile and open software for comparing large genomes.

Genome Biology, 5(2):R12, 2004.

[54] A. H. Land and A. G Doig. An automatic method of solving discrete

programming problems. Econometrica, 28(3):497–520, 1960.

[55] Eric S. Lander and Michael S. Waterman. Genomic mapping by finger-

printing random clones: A mathematical analysis. Genomics, 2(3):231 –

239, 1988.

200 [56] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultra-

fast and memory-efficient alignment of short DNA sequences to the human

genome. Genome Biology, 10, 2009.

[57] Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern,

et al. The diploid genome sequence of an individual human. PLoS Biol,

5(10):e254, 09 2007.

[58] H. Li and R. Durbin. Fast and accurate short read alignment with burrows-

wheeler transform. Bioinformatics, 25(14):1754, 2009.

[59] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-wah Lam, Siu-ming Yiu, et al.

SOAP2: an improved ultrafast tool for short read alignment. Bioinformat-

ics, 25(15):1966–1967, 2009.

[60] Ruiqiang Li, Hongmei Zhu, Jue Ruan, Wubin Qian, Xiaodong Fang, et al.

De novo assembly of human genomes with massively parallel short read

sequencing. Genome Research, 20(2):265–272, 2010.

[61] Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno.

Computability of models for sequence assembly. In Algorithms in Bioin-

formatics, Lecture Notes in Computer Science, chapter 27, pp. 289–301.

Springer, 2007.

[62] Fabian Menges, Giuseppe Narzisi, and Bud Mishra. TotalReCaller: Im-

proved Accuracy and Performance via Integrated Alignment & Base-

Calling. Bioinformatics (under review), 2011.

201 [63] Michael L. Metzker. Emerging technologies in DNA sequencing. Genome

Research, 15(12):1767–1776, 2005.

[64] Jason R. Miller, Arthur L. Delcher, Sergey Koren, Eli Venter, Brian P.

Walenz, et al. Aggressive assembly of pyrosequencing reads with mates.

Bioinformatics, 24(24):2818–2824, 2008.

[65] Bud Mishra. Optical mapping. Encyclopedia of Life Sciences, 2005.

[66] James C. Mullikin and Zemin Ning. The Phusion Assembler. Genome

Research, 13(1):81–90, 2003.

[67] Eugene W. Myers. Toward simplifying and accurately formulating fragment

assembly. Journal of Computational Biology, 2:275–290, 1995.

[68] Eugene W. Myers. The fragment assembly string graph. Bioinformatics,

21(suppl 2):ii79–85, 2005.

[69] Eugene W. Myers, Granger G. Sutton, Art L. Delcher, Ian M. Dew,

Dan P. Fasulo, et al. A Whole-Genome Assembly of Drosophila. Science,

287(5461):2196–2204, 2000.

[70] Niranjan Nagarajan and Mihai Pop. Parametric complexity of sequence

assembly: theory and applications to next generation sequencing. Journal

of computational biology, 16(7):897–908, July 2009.

[71] Giuseppe Narzisi. Sutta: Scoring-and-unfolding trimmed tree assembler,

2010. 9th IEEE International Workshop on Genomic Signal Processing

and Statistics, Cold Spring Harbor Laboratory (poster).

202 [72] Giuseppe Narzisi and Bud Mishra. A novel technologically agnostic de novo

sequence assembler, 2010. Systems Biology and New Sequencing Technolo-

gies, Centre for Genomic Regulation (poster).

[73] Giuseppe Narzisi and Bud Mishra. Comparing de novo genome assembly:

The long and short of it. PLoS ONE, 6(4):e19175, 04 2011.

[74] Giuseppe Narzisi and Bud Mishra. Scoring-and-unfolding trimmed tree as-

sembler: concepts, constructs and comparisons. Bioinformatics, 27(2):153–

160, 2011.

[75] Pramila Nuwantha Ariyaratne and Wing-Kin Sung. PE-Assembler: De

novo assembler using short paired-end reads. Bioinformatics, 2010.

[76] Ian T. Paulsen, Rekha Seshadri, Karen E. Nelson, Jonathan A. Eisen,

John F. Heidelberg, et al. The Brucellasuis genome reveals fundamental

similarities between animal and plant pathogens and symbionts. Proceed-

ings of the National Academy of Sciences of the United States of America,

99(20):13148–13153, 2002.

[77] Hannu Peltola, Hans Sderlund, and Esko Ukkonen. SEQAID: a DNA se-

quence assembling program based on a mathematical model. Nucleic Acids

Research, 12(1Part1):307–321, 1984.

[78] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian

path approach to DNA fragment assembly. Proceedings of the National

203 Academy of Sciences of the United States of America, 98(17):9748–9753,

2001.

[79] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian

path approach to DNA fragment assembly. Proceedings of the National

Academy of Sciences of the United States of America, 98(17):9748–9753,

2001.

[80] Adam Phillippy, Michael Schatz, and Mihai Pop. Genome assembly foren-

sics: finding the elusive mis-assembly. Genome Biology, 9(3):R55, 2008.

[81] Mihai Pop, Daniel S. Kosack, and Steven L. Salzberg. Hierarchical Scaf-

folding With Bambus. Genome Research, 14(1):149–159, 2004.

[82] Mihai Pop and Steven L. Salzberg. Bioinformatics challenges of new se-

quencing technology. Trends in Genetics, 24(3):142 – 149, 2008.

[83] Horst W. J. Rittel and Melvin M. Webber. Dilemmas in a general theory

of planning. Policy Sciences, 4(2):155–169, June 1973.

[84] Michael Roberts, Brian R. Hunt, James A. Yorke, Randall A. Bolanos, and

Arthur L. Delcher, et al. A preprocessor for shotgun assembly of large

genomes. Journal of Computational Biology, 11(4):734–752, 2004.

[85] Jacques Rougemont, Arnaud Amzallag, Christian Iseli, Laurent Farinelli,

Ioannis Xenarios, et al. Probabilistic base calling of solexa sequencing data.

BMC Bioinformatics, 9(1):431, 2008.

204 [86] Steven L. Salzberg and James A. Yorke. Beware of mis-assembled genomes.

Bioinformatics, 21(24):4320–4321, 2005.

[87] F. Sanger, S. Nicklen, and A. R. Coulson. DNA sequencing with chain-

terminating inhibitors. Proceedings of the National Academy of Sciences of

the United States of America, 74(12):5463–5467, 1977.

[88] Bertil Schmidt, Ranjan Sinha, Bryan Beresford-Smith, and Simon J.

Puglisi. A fast hybrid short read fragment assembly algorithm. Bioin-

formatics, 25(17):2279–2280, 2009.

[89] Jan Schrder, Heiko Schrder, Simon J. Puglisi, Ranjan Sinha, and Bertil

Schmidt, et al. Shrec: a short-read error correction method. Bioinformatics,

25(17):2157–2163, 2009.

[90] David Schwartz and Michael Waterman. New generations: Sequencing

machines and their computational challenges. Journal of Computer Science

and Technology, 25:3–9, 2010. 10.1007/s11390-010-9300-x.

[91] Colin A. M. Semple. Assembling a view of the human genome. In Michael R.

Barnes and Ian C. Gray, editors, Bioinformatics for Geneticists, chapter 4,

pp. 93–117. John Wiley & Sons, Ltd, 2003.

[92] Jared T. Simpson, Kim Wong, Shaun D. Jackman, Jacqueline E. Schein,

Steven J.M. Jones, et al. ABySS: A parallel assembler for short read se-

quence data. Genome Research, 19(6):1117–1123, 2009.

205 [93] N. V. Smirnov. Approximate laws of distribution of random variables from

empirical data. Uspekhi Mat. Nauk, 10:179–206, 1944.

[94] T. F. Smith and M. S. Waterman. Identification of common molecular

subsequences. Journal of , 147(1):195–197, March 1981.

[95] Daniel Sommer, Arthur Delcher, Steven Salzberg, and Mihai Pop. Minimus:

a fast, lightweight genome assembler. BMC Bioinformatics, 8(1):64, 2007.

[96] G. G. Sutton, O. White, M. D. Adams, and A. R. Kerlavage. TIGR Assem-

bler: A new tool for assembling large shotgun sequencing projects. Genome

Science and Technology, 1(1):9–19, 1995.

[97] J. Tarhio and E. Ukkonen. A greedy approximation algorithm for construct-

ing shortest common superstrings. Theor. Comput. Sci., 57(1):131–145,

1988.

[98] J. S. Turner. Approximation algorithms for the shortest common super-

string problem. Inf. Comput., 83(1):1–20, 1989.

[99] G a Tuskan, S Difazio, S Jansson, J Bohlmann, I Grigoriev, et al. The

genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science

(New York, N.Y.), 313(5793):1596–604, September 2006.

[100] J. Craig Venter, Mark D. Adams, Eugene W. Myers, Peter W. Li,

Richard J. Mural, et al. The Sequence of the Human Genome. Science,

291(5507):1304–1351, 2001.

206 [101] Rene L. Warren, Granger G. Sutton, Steven J. M. Jones, and Robert A.

Holt. Assembling millions of short DNA sequences using SSAKE. Bioin-

formatics, 23(4):500–501, 2007.

[102] David A. Wheeler, Maithreyan Srinivasan, Michael Egholm, Yufeng Shen,

Lei Chen, et al. The complete genome of an individual by massively parallel

DNA sequencing. Nature, 452(7189):872–876, April 2008.

[103] Martin Wu, Ling V Sun, Jessica Vamathevan, Markus Riegler, Robert De-

boy, et al. Phylogenomics of the reproductive parasite wolbachia pipientis

wmel: A streamlined genome overrun by mobile genetic elements. PLoS

Biol, 2(3):e69, 03 2004.

[104] Daniel R. Zerbino and . Velvet: Algorithms for de novo short

read assembly using de Bruijn graphs. Genome Research, 18(5):821–829,

2008.

207