THE UNIVERSITY OF CHICAGO

A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF

DEPARTMENT OF DEPARTMENT OF COMPUTER SCIENCE

BY CHRISTOPHER BUN

CHICAGO, ILLINOIS NOVEMBER 8, 2016 Copyright © 2016 by Christopher Bun All Rights Reserved To My Family And Friends Epigraph Table of Contents

LIST OF FIGURES ...... vii

LIST OF TABLES ...... ix

ABSTRACT ...... x

1 INTRODUCTION ...... 1 1.1 Motivation ...... 2 1.2 Research Contributions ...... 2

2 BACKGROUND ...... 4 2.1 Genome Sequencing: Techniques and Data Profiles ...... 4 2.1.1 Sequencing By Synthesis ...... 5 2.1.2 Oligo-ligation Detection ...... 7 2.1.3 Single Molecule and Nanopore Sequencing ...... 8 2.1.4 Sequencing Data ...... 10 2.2 Genome Assembly ...... 11 2.2.1 Challenges ...... 12 2.2.2 Algorithms for De Novo Genome Assembly ...... 15 2.3 Computing Systems ...... 20 2.3.1 Scientific Compute Services ...... 20

3 ASSEMBLYRAST FRAMEWORK ...... 22 3.1 The WASP Language for Computational Workflows ...... 22 3.1.1 Specification ...... 23 3.2 Implementation ...... 31 3.2.1 Interface Generalization ...... 31 3.2.2 Rapid Pipeline Design ...... 32 3.2.3 Universal Hyperparameter Search Driver ...... 34 3.2.4 Logic-Driven Assembly ...... 35 3.2.5 Data Types ...... 36 3.2.6 Analysis Framework ...... 37 3.3 System Design and Infrastructure ...... 38 3.3.1 AssemblyRAST Control Plane ...... 38 3.3.2 Workers ...... 41 3.3.3 Availability ...... 44

4 ASSEMBLER PROFILING AND OPTIMIZATION ...... 47 4.0.1 Data ...... 47 4.0.2 Evaluation Metrics ...... 48 4.0.3 Programs ...... 49 4.0.4 Comparison ...... 50 v 4.0.5 Methods ...... 52 4.0.6 Discussion ...... 54 4.1 Pipelines ...... 55 4.1.1 Preprocessing ...... 55 4.1.2 Postprocessing ...... 56 4.1.3 Results ...... 57

5 INTEGRATIVE ASSEMBLY ALGORITHMS ...... 60 5.0.1 Integrative Pipelines ...... 60 5.1 Hyperparameter Optimization ...... 61 5.2 Block Construction and Merging ...... 63 5.3 A Self-Tuning Ensemble De Novo Assembly Pipeline ...... 65 5.4 Discussion ...... 66

6 REFERENCE-INDEPENDENT ASSEMBLY ERROR CLASSIFICATION LEARN- ING ...... 68 6.1 Background and Motivation ...... 68 6.1.1 Hard and Soft Genomic Variation Types ...... 69 6.1.2 Statistical Approach to De Novo Assembly Evaluation ...... 70 6.1.3 Supervised Classification ...... 71 6.2 A Novel Implementation of Error Classification Using Gradient Boosting Trees 74 6.2.1 Dataset Generation ...... 75 6.2.2 Assembly Setup ...... 75 6.2.3 Preprocessing ...... 76 6.2.4 Data Labeling ...... 78 6.2.5 Discussion and Feasibility ...... 80 6.3 Results ...... 88 6.3.1 Training Model ...... 88 6.3.2 Feature Engineering and Extraction ...... 89 6.3.3 Model Performance and Hyperparameter Tuning ...... 92 6.3.4 Detection of Major Errors Produced by Assemblers ...... 96 6.4 Discussion ...... 102

7 APPLICATIONS OF AN ACCURATE DE NOVO ASSEMBLY EVALUATION PRO- FILE ...... 104 7.1 Error Removal and Contiguity Metric Correction ...... 104 7.1.1 Contig Splitting ...... 104 7.1.2 Corrected Statistical Measures ...... 104 7.2 Likelihood and Scoring Framework For Systematic Evaluation ...... 106 7.3 Error Prediction Strategy for Assembly Reconcilliation Algorithms ...... 111

8 CONCLUSION ...... 114 8.1 Future ...... 115 References ...... 116

vi List of Figures

2.1 NGAx plot of V. cholera assembly by Velvet parameter sweep ...... 18 2.2 3-mer De Bruijn Graph ...... 19 2.3 4-mer De Bruijn Graph ...... 19

3.1 Flowchart of a read-specific assembly workflow ...... 29 3.2 Pipeline Branching ...... 34 3.3 Parameter Sweeps With Pipeline Combinations ...... 35 3.4 The Assembly RAST Infrastructure ...... 39 3.5 Relative ALE Scores of V. cholera assembly ...... 45 3.6 AssemblyRAST Web Interface ...... 46 3.7 AssemblyRAST Web Interface facillitates user-friendly pipeline design ...... 46

4.1 NGAx plot of V. cholera assembly ...... 52

5.1 NGAx plot of V. cholera assembly by Velvet parameter sweep ...... 62 5.2 Velvet Hash Length vs. NGA50 Score on Rsp HiSeq and MiSeq ...... 63 5.3 NGA50 scores of pairwise mergings of V. Cholerae assemblies ...... 64 5.4 Smart Pipeline ...... 67

6.1 An example decision tree ...... 73 6.2 Generation of training data by AssemblyML workflow ...... 79 6.3 An example of a non-smooth coverage pattern over a misassembly ...... 90 6.4 Discrepancies in coverages between flanking regions ...... 91 6.5 Misassemblies vs. Average Contig Coverage ...... 92 6.6 Contig end regions predicted as misassemblies in Spades Assembly of Singulisphaera acidiphila, but not classified as a major misassembly by QUAST ...... 94 6.7 Feature Importances ...... 94 6.8 Prediction Outcomes for 300 Genome Readsets Simulated ...... 95 6.9 AssemblyML Prediction Outcomes For Velvet Assembly of Singulisphaera acidiphila 96 6.10 REAPR FCD Prediction Outcomes For Velvet Assembly of Singulisphaera acidiphila 97 6.11 Correct Misassembly Classification Velvet Assembly of Singulisphaera acidiphila 98 6.12 Quast defines misassemblies in which inconsistencies are shorter than 1000 base- pairs as local misassemblies. The trained model predicts this missassembly (as shown by the black bar) ...... 99 6.13 Velvet Scaffolding Technique Flagged as Misassembly ...... 100 6.14 AssemblyML Prediction Outcomes For Spades Assembly of Singulisphaera acidiphila101 6.15 REAPR FCD Prediction Outcomes For Spades Assembly of Singulisphaera acidiphila101 6.16 Mapping Quality Anomalies in Spades Assembly of Singulisphaera acidiphila . . 102

7.1 Regions with misassembly from Spades and Velvet assemblies (A, C) compared with their contigs broken at predicted loci (B, D). Red represents contigs that contain true misassemblies...... 105

vii 7.2 The Error Response Curve captures the trade-off between contiguity and correct- ness...... 109 7.3 An improved assembly pipeline using AssemblyML-guided contig breaking and ERC metrics as preceeding steps to block merging...... 113

viii List of Tables

2.1 Profiles of Major Next Generation Sequencing Platforms ...... 11 2.2 FastA Base Pair Codes ...... 14

3.1 Common Wasp/Lisp Supported Expressions ...... 25 3.2 Wasp contains three types of specialized extensions: Type conversion, data anal- ysis, and framework-level functions...... 28

4.1 Comparison of NGA50 assembly scores for various genomes. Best scores are bolded. 51 4.2 Comparison of misassemblies for various genomes...... 51 4.3 Various statistics for the assembly of R. spaeroides HiSeq data. IDBA-ID gen- erated the most contiguous set while also performing the least amount of misas- semblies...... 53 4.4 Effects of preprocessing on V. cholera NGA50. The best scores per assembler are shown in bold. Fields with ’-’ indicate an error generated by the assembler . . . 59 4.5 Effects of preprocessing on V. cholera misassembly. The fewest misassemblies per assembler are shown in bold ...... 59

6.1 Features generated from assembly and sequence data ...... 77 6.2 Genomes Used to Generate Training Set ...... 81 6.3 Top 10 three-way feature interaction scores for the AssemblyML Model. Weighted F-score represents the frequency that the three features appear within the same tree, weighted by the probabilities that the nodes will be visited...... 93 6.4 Prediction Statistics Across a Subsample of Microbe Assemblies ...... 93 6.5 Top 20 Features of XGBoost Trained Model ...... 95

7.1 Statistics of Spades and Velvet assemblies and contigs broken at predicted loci . 107

ix ABSTRACT

High-throughput genetic sequencing technologies have driven the proliferation of new ge- nomic data. From the advent of long-read Sanger sequencing to the now low-cost, short-read generation and upcoming era of single-molecule techniques, methods to address the complex genome assembly problem have evolved alongside and are introduced at an expiditious pace. These algorithms attempt to produce an accurate representation of a target genome from datasets filled with errors and ambiguities. Many of the challenges introduced, unfortunately, must be addressed through an algorithm’s ad-hoc criteria and heuristics, and as a result, can output assembly hypotheses that contain significant errors. Without an inexpensive or computational approach to assess the quality of a given assembly hypothesis, researchers must make due with draft-level genome projects for downstream analysis. Solving three fun- damental challenges will alleviate this issue: (i) automation and incorporation of algorithms from the dynamic landscape of genome assembly tools, (ii) developing optimal assembly al- gorithms best suited for various types, or mixtures, of sequencing data, and (iii) developing an approach to assess de novo genome assembly quality independence of a reference genome. We provide several contributions towards this effort: We first introduce AssemblyRAST, a general compute orchestration framework and accompanying domain-specific language that facillitates rapid workflow design for rapid genome assembly, analysis, and method discov- ery. Next, we demonstrate the improvement of genome assemblies through novel integrative algorithm techniques. Finally, we devise a method for reference-independent assembly eval- uation and error identification through supervised learning, along with several applications to further improve existing techniques.

x Chapter 1

INTRODUCTION

Since the dawn of biological study, the questions of our provenance and resulting phenotypes have long been speculated; through Darwin to Mendel to Watson, Crick, and Franklin, it has been made clear that many of the big answers lie in the smallest form: a nanoscale molecule of nucleic acids which make up our genome. From an individual’s propensity for disease to physical traits or psychological predispositions, the pursuit of gaining such knowledge from an individual’s genome is a central vision that drives many scientists to study this highly enigmatic language of life on earth. While it is certainly quixotic to envision that a four- letter code holds such power, the relentless influx of new research that continues to validate this as true suggests no reason to quell the excitement. Indeed, it was this romantic optimism that underpinned much of the initial enthusiasm for the Human Genome Project (HGP). Announced in 1990, the HGP was a 3 billion dollar global scientific effort to assemble the first human genome and build a map of all genes within; it was hailed as the presumable Rosetta Stone for human disease and medicine. In fact, Francis Collins, leader of the HGP, claimed that genetic diagnoses of most diseases, including cancer, diabetes, heart disease, and major mental illnesses, and their resultant treatments would be realized within 10 and 15 years’ time of the genome completion, respectively. The first draft of the human genome was released in 2000, and declared complete in 2003. Now 13 years later, it is clear that we remain far from those idealistic assumptions. Thus far, the HGP has yielded many insights, including a better count estimate of human genes, structural differences with other organisms, as well as distributions of repeat regions, yet far from the revolution promised. In reality, the completion of the human genome may have generated more questions than it answered. Indeed, new sequencing projects are surfacing at an exponential rate in order to gain insight into genome evolution and alternate gene models; the panda genome was assembled using next-generation sequencing and recently, the Genome 1 10K Project was proposed, aimed at sequencing the genomes of 10,000 vertebrate species. The biological sciences community has already become inundated with a plethora of data, of which the processing and analysis are problems caused by an overall deficiency of resources, methodological consensus, and analytical strategies, among myriad other factors. The scope of the genome and metagenome assembly problem encompasses the fields of genetics, systems biology, graph theory, and computer architecture, among others; indeed a very interesting problem. Work is being done to progress each respective sub-discipline, such as minimizing sequencing obfuscations, classifying genetic pattern profiles, automating gene annotation, accelerating algorithm performance, or applying computational genetics to specialized hard- ware. However, it is becoming clear that the problem as a whole must take a comprehensive approach amongst the numerous snowfields, and approaching the problem from a higher, level will undoubtedly reveal synergistic relationships in the diverse conglomeration of the field of computational genomics.

1.1 Motivation

1.2 Research Contributions

In this thesis, we look at the broad landscape of computational genomics and genome as- sembly, and try to provide the following: AssemblyRAST Software Framework and Assembly Service

AssemblyRAST is a compute execution framework designed to enable rapid workflow gen- eration and hyperparameter optimization through an extensible plugin system and command- line abstractions. The workflow engine is instructed by a custom Lisp dialect (Wasp) which abstracts high-level computational biology-specific concepts as language level expressions and allows for a rapid and declarative creation of meta-algorithmic "personalities" designed to iterate, search, or optimize an overall assembly pipeline.

2 The assembly service is a cloud-hosted computation and storage service designed for high throughput of computationally expensive de novo assembly jobs, scalable to heterogeneous compute hardware to handle appropriate job delegation. The main goal is to provide users with a platform for genome assembly, which through available service interfaces, allows one to incorporate the assembly step into many custom workflows, while bypassing any complex local infrastructural orchestration. A Framework for Genome Assembly Benchmarking and its Application to the Current Landscape of Assembly Algorithms

We provide methods to design and run reproducible assembly pipeline experiments on ge- nomic datasets and provide to the user a comprehensive comparative analysis of performance among the specified assembly pipelines. A Self-Tuning Ensemble De Novo Assembler Pipeline

We introduce a self-tuning genome assembly system that infers assembly parameters through analysis of read data, integrates cleaning, error correction, assembly, and merging steps based on evaluation of intermediate results. We demonstrate that assemblies produced are of quality that are comparable to or surpass the leading assemblers, without any user interaction. Reference-Independent Error Prediction through Supervised Learning

We introduce a -based method to identify and classify putative assembly errors, and assess/score the quality of the assembly, using contig sequence properties and read alignments after training on real and simulated datasets. Two Evaluation Metrics of De Novo Assembly for Informing Downstream Optimization

Finally, we provide two metrics to assess the quality of a reference-independent genome assembly, that can also be integrated into sophisticated workflows.

3 Chapter 2

BACKGROUND

This chapter provides background necessary to understand the landscape and motivation of this thesis. Because the problem has its roots in multiple facets of biology, biotechnology, and computer science, understanding each sub-field intimately is crucial to our efforts of developing an optimal and integrative solution. In this section, we discuss current sequencing technologies and the type of data generated, algorithms and methods used to process and analyze the data, and current approaches in system technology and infrastructure that are capable of handling scientific computing’s new big data problem.

2.1 Genome Sequencing: Techniques and Data Profiles

While the new, efficient, and cost-effective next generation sequencing technologies are in- deed a boon to the biology community, these benefits do not come without their inherent imperfections. Ironically enough, biologists are now producing sequencing data in record amounts, and it is this "data boom" that is one of problems faced by computational biolo- gists today. To make matters more complex, these technologies produce reads and errors that are considerably less "assembly-friendly" than the older Sanger-based technologies. Since each platform relies on its own intricate combination of biochemistry and hardware mech- anisms, distinct error profiles are associated with generated data and should be considered when performing further processing. Here we survey the major next generation sequencing technologies. Overviews of major platforms are presented in Table [tab:orgd3c624e].

4 2.1.1 Sequencing By Synthesis

Sequencing-by-synthesis, a technique originally developed in the mid 1990s, has quickly be- come the most successful sequencing technique, owing its success to its support for massively parallel sequencing at relatively low cost. The Illumina/Solexa GA/HiSeq systems are cur- rently the most widely used platform for these reasons. The procedure is described as follows:

Procedure of Illumina HiSeq Sequencing Systems This sequencing technology uses sequencing by synthesis (SBS) and cyclic reversible termination and proceeds as described:

1. Single strands of the library are attached to the flowcell and bridge amplified to produce clonal clusters.

2. A DNA polymerase bound to a primed template incorporates a dye-labeled terminating nucleotide

3. Remaining nucleotides are washed.

4. Incorporated nucleotides are detected by total internal reflection fluorescence (TIRF).

5. The terminating group and fluorescent dye are cleaved so that polymerase activity can continue.

6. Additional washing is performed and the process is repeated from step 2.

Read Profile of Illumina HiSeq Sequencing Systems Initially, Solexa GenomeAna- lyzer(GA) was able to output 1 Gigabase/run. Through incremental improvements in poly- merase, buffer, flowcell, and software [61], the latest GAIIx series can produce 85 G/run. Introduced in 2010, Illumina’s HiSeq 2000, which employs the same strategies as above, can currently produce 100bp reads at 600 G/run, with 1 T/run possible in the foreseeable future. The benefits of this platform are clearly throughput and cost efficacy. Error rates are relatively low as well, as after filtering they have been shown to fall below 2% [61]. The most 5 common error type are substitution errors, and are amplified specifically when the preceding nucleotide is guanine. Additionally, Dohm, et al. have shown that AT and GC rich regions are underrepresented due to amplification bias[26]. Introduced in 2005, the Roche 454 sequencing system was the first commercially successful next generation platform [61] and ushered in a new era of genomic studies. Wheeler, et al. were the first to apply next generation sequencing to personal human genomes, using 454 technology to sequence the genome of James D. Watson[115].

Procedure of Roche 454 Sequencing Systems The 454 system employs pyrosequenc- ing, which is a bioluminescence method in which enzymatic reactions cause visible light emission. The procedure is outlined here:

1. The library is denatured into single strands, captured by amplification beads, and emulsion PCR is performed.

2. The amplified targets are incubated with DNA polymerase, ATP sulfurylase, luciferase, luciferin, and adenosine 5’ phosphosulfate.

3. One of the deoxynucleoside triphospates (dNTPs: dATPs, dTTP, dGCP, dCTP) are added, and will luminesce when a released pyrophosphate (PPi) reacts with ATP sul- furylase and luciferase.

4. Bioluminescence is detected via a charge-coupled device (CCD) camera and recorded as flowgrams. Homopolymers, for up to six base pairs, can be measured as directly proportional to signal amplitude.

5. dNTPs are degraded by apyrase and the procedure returns to step 2 using the next dNTP type in a predetermined sequence.

6 Profile of Roche 454 Sequencing Systems The 454 GS FLX Titanium system, launched in 2008, is able to produce 0.7G/run with reads of 700bp, and 99.9% accuracy, in under 10 hours for completion. As suggested in the procedure, homopolymers longer than six bases produce a high error rate. Insertions are the most common error, followed by deletion. Cost of reagents also remains an issue.

2.1.2 Oligo-ligation Detection

Released commercially in 2006, the Sequencing by Oligo Ligation Detection (SOLiD) by Ap- plied Biosystems is based off of a unique 2-base encoding technology that enables a drastically reduced cost per base as well as a substantial throughput rate.

Procedure of SOLiD Sequencing The Sequencing by Oligo Ligation Detection (SOLiD) by Applied Biosystems uses an accurate two-base sequencing by ligation technology:

1. A primer is hybridized to a template amplified via emPCR

2. A "1,2-probe" is added. These consist of an interrogation dinucleotide, 16 of which are encoded by four fluorescent dyes, as well as additional degenerate and universal bases that aid in ligation.

3. the dinucleotide encoding is imaged, and the probe is partially cleaved

4. Steps 2 and 3 are repeated ten times to yield ten dinucleotide color calls in 5-base intervals.

5. the extended primer created thus far is stripped

6. An offset, n-1, primer is hybridized, and a second round (Step 2 - 4) ensues. DNA can be decoded via the knowledge of color combinations in the base interrogations.

7 Profile of SOLiD Sequencing The SOLiD 5500xl system is capable of producing 50 bp reads at 120G/run. The highlight of the system is accuracy due to its unique dinu- cleotide sequence by ligation method, at 99.94% [124]. Like Illumina platforms, substitution remains the most common error in SOLiD platforms, and AT and GC-rich regions may be underrepresented[40]. Palindromic sequences have also been reported as problematic for the platform.

2.1.3 Single Molecule and Nanopore Sequencing

The emergence of short-read sequencing technologies has launched the field into the next level of genome research, allowing for the cost-effective molecular analysis that once was the blurry vision underpinning the Human Genome Project effort. Albeit revolutionary, short-read technologies are inherently poorly suited for certain biological problems, such as repeat resolution, gene isoform detection, or methylation detection. Single-molecule sequenc- ing holds clear advantages over the current generation of short-read technology, and recent developments in the technologies clear the path towards taking the next leap in genome research. The main advantages include:

• Small sample requirements: Mid-picogram sample sizes are sufficient for sequencing, opening a new dimension of possible sequencing targets [67]. Analysis of circulating DNA is but one fitting use-case.

• Non-PCR-based: A current challenge in PCR-based methods is the variable GC con- tent of fragments bias the overall read coverage distribution. For studies involving quantification of genomic content, such as chromosomal anomaly detection, such a non-biasing solution would be hugely impactful. The ability to map accurately and quantitatively is paramount for such studies.

• Long read lengths: The computational problem of resolving repeat regions is exacer- 8 bated by the limitations of short-read technology. While error rates in single molecule sequencing technologies still need to improve, read lengths of 5kb to 10kb as produced in current platforms will be hugely beneficial for structural reconstruction of genomes.

Two major platforms that are based on this single molecule paradigm, and are quickly becoming important tools as researchers discover new methods to mitigate high sequencing error. Pacific Biosciences have developed a sequencing platform they describe as "single molecule realtime sequencing" (SMRT), that finds its basis in leveraging zero-mode optical waveguides. We describe the procedure below.

Procedure for Pacific Biosciences RSII Sequencing Platform

1. A circular template called a SMRTbell is prepared by ligation of hairpin adapters to both ends of the target dsDNA.

2. The SMRTbell is loaded onto the chip (SMRTcell) and diffuses into a light detection sequencing unit (zero-mode waveguide), which contains a single bottom-immobilized polymerase.

3. Fluorescently labeled nucleotides are added to the cell.

4. Light emissions are detected and recorded.

Nanopore sequencing has recently matured and quickly becoming a primary option in the fourth generation of sequencing technologies. Oxford Nanopore have released a set of platforms that offer an intriguing sequencing mechanism, read profile, portability (MinION and SmidgeION platforms), and most importantly, cost. This factor has and will continue to democratize and mobilize genome sequencing and has the potential to accelerate, if not revolutionize, the field. We describe mechanisms and read properties below.

9 Procedure for Oxford NanoPore Sequencing Platforms The technology leverages an array of protein nanopores that are embedded in a polymer membrane that separates two chambers of electrolyte solution. Electrodes for detecting change in current are immersed in each chamber, and voltage is applied to generate an ionic current signal. The detection procedure is as follows:

1. As the DNA molecule passes through the nanopore, ionic current is interrupted.

2. The overall changes in current amplitude and duration are recorded.

3. Software techniques map signal change profiles to infer nucleotides

Profile for Oxford NanoPore Sequencing Platforms Oxford Nanopore’s MinION instrument offers some compelling features: a single-use device form factor that is the size of a USB stick and average read lengths of ~5kb up to 10kb [29]. However, initial error rates were estimated to be as high as 90%, although some have developed computational techniques aimed at improving this figure.

2.1.4 Sequencing Data

Because the process of translating an organic molecule into digital information is inherently lossy, most platforms provide data ancillary to the main sequence data which may be useful to assembly algorithms. One major opportunity that next generation’s low cost and throughput provides is the potential to generate high read redundancy, or coverage. While read errors may be produced at a rate of 0.5-2.5%, by leveraging redundancy via consensus algorithms, many error events can be mitigated. We describe here major types of ancillary data generated by platforms.

Quality/Confidence Scoring Each sequencing technology provides a quality score for each base call traditionally by calculating a log transformed probability that the base is 10 incorrect (phred score). Solexa scores are calculated differently, but can be approximated to the same meanings. 454 scores only indicate that a homopolymer length has been called correctly. It has been suggested that Solexa software under-estimates and over-estimates true error rates in high and low quality scores, respectively[26].

Mate Pair Sequencing Prior to attachment of clonal fragments to a flowcell, next- generation sequencers are able to ligate adapters to both ends, allowing for both forward and reverse reading of each strand. Furthermore, the pair of resulting reads contains positional information relative to one another. This information, "insert-size," is leveraged in most assembly software to overcome repeat region ambiguities as well as to perform scaffolding and gap extension.

HiSeq2000 MiSeq 454 SOLiD4hq PacBioRS Sequencing Strategy SBS SBS Pyro Ligation/2base Error Dominance Sub Sub Indel A-T Bias Error Rate 0.26% 0.80% 0.01% 0.01% 12.86% Read Length 2x100bp 2x$<$250bp 2x700bp 2x75bp 1500 mean Insert Size $<$700bp $<$700bp $<$20kbp - Yield/Run 600Gb 2Gb 700Mb 300Gb 100Mbp Time/Run 11d 39h 23h 14d 2h Cost($)/Gbp 40 502 7000 70 2000

Table 2.1: Profiles of Major Next Generation Sequencing Platforms

2.2 Genome Assembly

Genome assembly, or the act of combining many character sequences into, ideally, a single continuous string is an intuitively simple problem and one that is fundamental to compu- tational genomics. While there are multiple classes of genome assembly problems, namely reference-based (or comparison-based) and de novo assembly, the former requires a pre- assembled reference genome, which are relatively scarce especially for assembly projects of esoteric organisms, to which a mapping of newly produced reads are mapped. This is a much 11 easier problem computationally and furthermore, de novo assembly still must be performed for larger regions where valid mappings do not exist, usually caused by genome insertions. We will concentrate on the de novo assembly problem as it poses the greatest challenges yet offers immense opportunities for knowledge discovery. Fundamentally, the algorithmic problem of assembling shotgun reads can be cast as the Shortest Common Superstring (SCS) problem:

Given a set of input sequences [s1, s2, ..., sn], find the shortest superstring C such that

every si is contained within C

The problem itself follows from basic assumptions of perfect read representation and genome substring uniqueness. However, information loss in the sequencing process and repeat regions within genomes prevent complete computability under parsimony based models. In the next section, we outline the inherent challenges as well as those caused by the imperfect sequencing technologies, and the attempts at a solution.

2.2.1 Challenges

2.2.1.1 Next Generation Sequencing

It can be said that the proliferation of new sequencing technologies, plummeting costs, and rising throughput have allowed for a drastic increase in productivity for genome sciences. But, the properties of these new technologies introduce novel challenges not seen before with the traditional capillary sequencers. For example, the first microbial sequencing project, completed in 1995 used 24,304 reads of length ~460bp [30]. The human genome project employed Sanger sequencing, generating roughly 30 million reads of ~800bp [117] at an error rate as low as 10−5 per base [95]. By contrast, the most widely used platform today, Illumina HiSeq 2000, generates billions of basepairs per run, at a much shorter 100bp read length and an error rate that a several orders of magnitude higher. The sheer volume of data, coupled 12 with short read lengths, requires significant adaptations to be made in assembly methods. It has been suggested that despite that costs per basepairs has been reduced, cost per unit of information in gene annotation studies remains comparable to traditional Sanger sequencing [118]. Alkan, et al. analyzed the differences in NCBI’s reference human genome and one assem- bled using next-generation sequencing [2]. It was found that 420.2 Mbp were missing from the assembly, among various other discrepancies. A prevailing view suggests that algorith- mic efficiency cannot overcome limitations introduced by sequencing technologies. Here, we describe some major limitations.

Substitution An erroneously identified nucleotide, possibly caused by convolution in light capture signal and downstream base calling. This error is most common in Illumina and 454 platforms. Minoche, et al. suggest that in Illumina platforms, 99.5

Insertions and Deletions Extra nucleotides are erroneously inserted into reads or original nucleotides are omitted (deletion), collectively called indels. Indels are the dominant error type in 454 and IonTorrent technologies.

Ambiguous base Base-calling software is unable to confidently determine a base. Al- though these may be completely ambiguous (thus denoted by an "N" character), soft- ware may be able to resolve base information down to base classes. (e.g. "G" or "A," represented by the letter "Y" for "pYrimidine"). A full mapping is shown in Table 2.2.

2.2.1.2 Genetic Properties

2.2.1.3 Metagenomes

Metagenomics began to gain traction in 2000, when Beja et al. discovered a new ATP- generating mechanism by studying environmental fragments from seawater (Beja 00). Ven- 13 Code Representation Etymology A A *A*denosine T T/U *T*hymidine G G *G*uanine C C *C*ytidine K G/T *K*eto M A/C A*m*ino R A/G Pu*r*ine Y C/T P*y*rimidine S C/G *S*trong W A/T *W*eak B A B after A V (T/U) V after U D C D after C N A/T/C/G A*n*y - Gap Table 2.2: FastA Base Pair Codes ter et al. later garnered attention by identifying 1.2 million genes in a single metagenomic survey of the Sargasso Sea [109] and it became increasly clear that metagenomics could be used gain insight into many functional pathways of uncultured microbial communities. Re- cent studies have uncovered possible health implications of an organism’s gut microbiome’s effect on disease, obesity, and mental health; as we begin to explore such techniques as probi- otic innoculation or fecal transplantation, metagenomic sequencing will become increasingly important. Metagenomic assembly faces the same sequencing and computational challenges as single- organism assembly, but is further complicated by poor and uneven community coverage, genome variance within natural populations, and risk of chimeric generation [104]. For instance, many error correction methods rely on read coverage and consensus for validation of an accurate read, and often times low coverage reads are discarded. An evenly high coverage of a whole metagenome is unrealistic, and furthermore, single nucleotide variants (SNVs) among a given population must not be classified as a base-calling error. Many metagenomic studies rely on complete or draft genomes for aid in interpretation,

14 but the relative scarcity of these remains significant. Improvements in single cell genome amplification will provide large benefits to this area of research.

2.2.2 Algorithms for De Novo Genome Assembly

The most intuitive approach and accordingly first methods developed for sequence assembly are based on sequence overlaps of reads. In such overlap-based algorithms (greedy, overlap consensus), pairwise comparisons must be done amongst the reads, thus requiring a worst n case 2 comparisons. Furthermore, the common approach for this comparison involves the Smith-Waterman algorithm [100], a dynamic-programming – and inherently slow – algorithm to generate an alignment or overlap score. We describe the algorithm below. Alternate approaches may use heuristical methods, such as kmer similarity in TIGR, for faster performance at the expense of sensitivity. Regardless, the need to perform such comparisons and store its resulting data index causes this class of assemblers extremely computationally expensive. One advantage of overlap approaches is that it is easy to be parallelized and thus benefits greatly from appropriate multi-core computation architectures. However, with the advent of the next generation short-read, high-throughput sequencing technologies, the complexity problems inherent in pairwise methods are further exacerbated by the sheer volume of data.

2.2.2.1 Greedy Extension

As the name suggests, this class of assemblers uses a greedy approach, optimizing over a local objective function. In this case these assemblers optimize over the overlap scores of a defined k basepairs, or $k$-mers. That is, a read with the highest scoring overlap will be used for the extension. Contigs are extended with valid overlapping $k$-mers until the pool of potential overlaps are exhausted. While efficient on memory, genomes containing many repetitive regions are problematic as they can induce local maxima that lead to structural

15 misassembly.

2.2.2.2 Overlap Layout Consensus (OLC)

Overlap layout consensus methods use the same first step as greedy extension, the pairwise or heuristical overlap scoring, but instead build out a layout graph before inferring the global structure of the assembly. In this graph, each read is represented by nodes, and qualifying overlaps are the connecting edges. Once the graph is built, the optimal solution is the path that traverses every node, or the Hamiltonian path. This computational problem of Hamiltonian cycles is classified as NP-Complete. Most OLC approaches attempt to alleviate the computational requirements by employing overlap computation heuristics and performing a variety of graph reduction steps Popular methods include exact matching, transitive reduction, collapsing chordal subgraphs, and removing dead-end paths. By using exact matching in the overlap phase, the traditional and computationally expensive Smith-Waterman scoring algorithm is bypassed while also avoiding spurious overlaps due to sequencing error. Transitive reduction involves removing branch paths in which there exists a longer path to the same node and thus are non-essential. This proves to be a highly optimizing procedure and shown to reduce the graph com- plexity by a factor of the oversampling rate [75]. Graph branching is often caused by repeat regions, substitution errors, and clonal poly- morphisms [41]. While the first issue is unsolvable without additional information, the other problems can be addressed. Assemblers such as Edena, employ a dead-end cleaning method in which branches are traversed up to a certain threshold, and branches that fall short are removed. Assemblers that use this OLC include Edena, CELERA, ARACHNE, Kiki, and Minimus.

16 2.2.2.3 De Bruijn Graph

As the sequencing technology landscape progressed, techniques such as sequencing-by-synthesis (SBS) became the method of choice for a variety of reasons. However, with this shift in methodology came a shift in data. Readlengths shortened, read counts exploded, and in- sertion/deletion errors decreased as substitution rates increased. With the need to assemble billions of short reads from next generation sequencers, a new approach has come into favor which uses de Bruijn graphs and Eularian paths. The method works as follows:

1. For each k-mer present within the reads, form two nodes of length k-1 corresponding to its prefix and suffix, only if unique.

2. Form a directed edge from node x to y if there exists a k-mer where x and y are its prefix and suffix, respectively.

3. Attempt to find an Eularian path, or a path that visits every edge of the graph exactly one time. This resulting path represents the genome. The algorithm is formulized in Algorithm 1

Algorithm 1: De Bruijn Assembly Data: Result: Set C of directed paths representing contigs initialization;

forall the reads ri ∈ R do

Let Prefix(ri) ← first k − 1 nucleotides of ri;

Let Suffix(ri) ← last k − 1 nucleotides of ri;

Form directed edge ei,representing ri, from Prefix(ri) to Suffix(ri); end

Attempt to find an Eularian path, or a path that visits every edge of the graph exactly one time. This resulting path represents the genome;

17 Figure 2.1: NGAx plot of V. cholera assembly by Velvet parameter sweep

NGAx 250

200

150

100 Contig length (kbp)

50

0 0 20 40 60 80 100 x

P4_Vt_h65 P1_Vt_h29 P2_Vt_h41 P6_Vt_h89 P3_Vt_h53 P5_Vt_h77

Furthermore, Euler’s theorem states that a balanced connected directed graph, which is what the constructed de Bruijn graph can be classified, must contain an Eularian cycle. The advantages of this method are clear: There are no pairwise comparisons for building the graph, and finding an Eularian path is made tractable with modern compute systems. This method corresponds fittingly to the problem, in a perfect world. Of course, previously mentioned problems inherent in sequencing and biology complicate matters, and certain strategies have been developed to best deal with these limitations. Euler’s thereom implies that if all k-mers in the genome are generated, then there exists and Eulerian path. In the most used sequencing technology, Illumina, 100-mers generated from a genome only capture a small fraction of its source [21], thereby violating one of the

18 assumptions of Euler’s thereom. De Bruijn assembly methods break reads in to k-mers and given small enough k, will allow for a near complete representation of the genome.

Figure 2.2: 3-mer De Bruijn Graph

Figure 2.3: De Bruijn Graphs. The circular genome CATTCATGTAAGTA is represented by nine reads, {TTCAT, TCATG, TGTAA, ATGTA, ACATT, GTAAG, CATGT, AGTAC, TAAGT}. Figure 2.2: All 3-mers in the genome are represented, but is tangled. Figure 2.3: Some 4-mers are not recovered from the reads, and thus the graph is fragmented.

Repeat regions intrinsic to genome structure cause problems to arise in all assembly methods. In de Bruijn graphs, they cannot be represented within the graph when longer than k.

19 2.3 Computing Systems

The previous section described how next-generation sequencing has introduced numerous algorithmic challenges to the computational biology landscape. This is, however, only half of the issue. Beyond these conceptual problems of correctness and complexity lies the difficulties produced by real world application. Here we explore the problems such as data volume and computational constraints along with available technologies that may provide solutions.

2.3.1 Scientific Compute Services

2.3.1.1 RAST

The Rapid Annotation using Subsystems Technology (RAST) server was made available in 2007 and is the foundation for a set of computational systems aimed at accelerating the progress of systems biology; it is also the eponym of our Assembly RAST service. RAST provides a fully automated service to annotate assembled contigs through identification of protein-encoding genes, and reconstruct a metabolic network of the organism [5]. The process of automatic annotation is based upon a knowledge base produced via manual curation:

1. An “expert curator” defines a subsystem as a set of abstract functional roles, to which specific genes are connected.

2. Proteins encoded by these genes are scrutinized against a set of rules and decision procedures to create and populate protein families called FIGfams.

3. tRNA and rRNA encoding genes are identified from submitted contigs using existing tools and from these, the system intelligently infers metabolic properties of the genome.

User have the ability to submit genomes, view progress, receive notification upon com- pletion, download results, as well as view a graphical analysis with the SEED Viewer.

20 2.3.1.2 MG-RAST

The Metagenomics-RAST (MG-RAST) server employs the same underlying subsystems tech- nology of RAST to metagenomic datasets. In addition to automated functional annotations and metabolic reconstruction, MG-RAST offers additional comparison and visualization tools through summaries and subsystems heatmaps. Due to the challenges posed by metagenomics, many traditional genome analysis meth- ods fail to provide sufficient results. Further work to discover new methods for binning, clustering, and prediction is underway. Moreover, performance is a concern for processing of such large datasets [72].

2.3.1.3 KEGG

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a comprehensive database con- sisting of 15 main databases aimed at integrating genomic, chemical, and systemic functional information, along with a set of tools for interpretation of data sets [47]. Via the KEGG Mapper, a user is able to submit queries of genes, proteins, and other molecular objects and attain pathway and ontology enrichments. KEGG offers several other tools, such as those for genome metabolic comparison, and drug and disease information. As of 2011, access to the KEGG FTP knowledge base has moved to a paid subscription model.

21 Chapter 3

ASSEMBLYRAST FRAMEWORK

Using bioinformatics tools or scripting workflows generally involve substantial manual in- teraction. By abstracting the bioinformatics tools and scripts into functional units within a flexibile execution engine, custom workflows can be generated and in an ad-hoc fashion, encouraging exploration of the algorithmic space. We present a framework and service, called AssemblyRAST, that allows users to rapidly design custom meta-algorithm pipelines, perform multi-pipeline comparisons, multi-module comparisons, parameter sweeps, or build intelligent mini-scripts we call “recipes”. The declarative style for recipe definitions enables one to impose logic within a given workflow; e.g. without manual curation or scripting, a given workflow may be designed to infer optimal parameters, toggle specific pipeline stages, sort, merge or post-process the most promising results for return. The AssemblyRAST framework is predominately written in Python and Perl, and uses a variety of libraries such as Pika, CherryPy, Yapsy, and PyMongo. The service software stack consists of MongoDB, RabbitMQ, Shock, and OpenStack. The web frontend uses JavasScript and builds upon the web technologies AngularJS and Polymer. The framework can be run in many different configurations, invoked from multiple client endpoints (command line, web, REST API, KBase IRIS, KBase Narrative), and here we classify and describe various modes in which an assembly study can be performed.

3.1 The WASP Language for Computational Workflows

By using a computational biology-specific declarative scripting language, workflow declara- tions, or “recipes,” allow researchers to rapidly prototype simple or complex pipelines and quickly generate assembly hypotheses and analysis. Sophisticated workflows can include

22 inter-stage logic or computation, use heterogeneous data sources, and are not limited to 1- dimensional pipelines nor strictly passing the direct outputs from each module to the next. Algorithms incorporating multiple sources such as reference-based assembly, merging, or alignments can be executed using inputs generated from multiple pipelines. Auxilliary data computed or generated within each module plugin wrapper is available to the pipeline and can guide the overall route or configuration of the proceeding workflow stages. For example, the histogram results of KmerGenie can be analyzed to guide the multi-k parameter setting of SPAdes. Several recipes are available that have been manually tuned and tested to offer workflows that fit certain dataset types or desired analysis criteria such as throughput or rigor. The flexible nature of the compute engine also enables the rapid design and emulation of other popular protocols.

3.1.1 Specification

The WASP language, derived from "Workflows for ASsembly Processor," is a domain specific language inspired by the Common Lisp language. All workflows within the AssemblyRAST system are based off of WASP definitions. The abstraction of assembly tools into first-class functions was the first step towards a flexible workflow framework. Lisp is billed as a "programmable programming language," and as such, was an appropriate basis for abstracted yet programmable workflow defini- tions. While conditional logic, concise flow graphs, and dynamic typing enable investigators to rapidly define sophisticated pipelines, strict lexical scoping eliminates the possibility of overlooked side effects that may affect reproducibility (or lack thereof).

23 3.1.1.1 Wasp Objects

3.1.1.2 Basic Evaluation

Evaluation of WASP code follows the same notational convention as Lisp. Each expression normally evaluated as a list expression and typically results in a value. Here, we use the symbol ⇒ to indicate evaluation. The WASP interpreter supports a subset of basic Lisp operators, as shown in 3.1. Below are examples of basic and nested expression evaluation.

(* 212) => 42

(*(+ 10 11)3) => 42

As in Common Lisp, we call our standard unit of interaction a form, which is a data object to be evaluated in order to produce one or more values. Forms can be categorized into three classes: self-evaluating, symbols, and lists [101]. Self-evaluating forms, like their namesake, evaluate directly to a value, such as a number. Lists can be further divided into three categories: special forms, macro calls, and function calls.

Variables Wasp supports the creation of lexical variables, that is, variables that are lexi- cally bound to the context in which they are defined. With out special expressions, one may find it difficult to define a Wasp expression that persists state outside of a computation’s scope. This semantic purity and avoidance of side-effects aligns well with reproducibility goals and trial-based nature of scientific computations. Variable definition is performed via the function define, as shown in Listing 3.1.

1 (begin

2 (define my_assembly(velvet READS))

3 (upload myassembly)

4 )

Listing 3.1: Wasp/Lisp Lexical Variable Definition 24 Special Forms and Functions As each expression is evaluated, those that are not self- evaluating or variables are list forms, which can be special forms, macros, or functions. Most often, special forms are environment and control constructs whereas macros and functions transform lists or values. Expression Description + Addition - Subtraction * Multiplication / Division not Inverse operator > Greater than < Less than >= Greater than or equal to <= Less than or equal to equal? Equality eq? Identity length Length cons Construct object car First cdr Rest append List append list List list? Is list null? Is null symbol? Is symbol slice List slice if If conditional prog Sequential evaluation

Table 3.1: Common Wasp/Lisp Supported Expressions

3.1.1.3 Wasp Expressions

Wasp defines special expressions to interact with specific functionality in the underlying As- semblyRAST framework. A fundamental difference between a pure Lisp implementation and the Wasp system lies in Wasp’s abstraction of a value that is returned from the interpreters list evaluation. A Wasp "value" is in fact an object structure we call a WaspLink, that 25 contains additional metadata pertaining to an experiment’s parameters and datasets. As such, Wasp expressions are available that are capable of specializing, extracting, and trans- forming "values" in a domain specific fashion. Listing 3.2 shows the usage of the get and setparam expressions, which evaluate to return the WaspLink’s dictionary object, best_k, and initializes a lexically scoped parameter hash_length, while invoking a function on the Wasp-reserved variable READS. All Wasp expressions are shown in Table 3.2.

1 (begin

2 (define kval(get best_k(kmergenie READS)))

3 (begin (setparam hash_length kval)(velvet READS))

4 )

Listing 3.2: Wasp Expressions

3.1.1.4 Plugin Expressions

For every program available in the AssemblyRAST environment, a plugin class file as well as a configuration/specification file must be created and placed in the plugins directory. Upon startup, all plugins are loaded into the Wasp environment, defining a corresponding reserved plugin variable in the global expression scope. A majority of plugin expressions evaluate their tailing list as an unordered list of WaspLink, where Type is one of four types returned by a plugin as default output:

• contigs

• paired-reads-

• single-reads

• reference-sequence

• alignment 26 As such, plugins are written to expect certain types in the execution environment and also specify default output types. As an example, a plugin for BWA aligner expects objects of type contigs and paired-reads in its Wasp execution context, and the Velvet assembler function evaluates to return an object of type WaspLink: (bwa READS (velvet READS)

would have proper objects available for the plugin specification:

1 class BwaAligner(BaseAligner, IPlugin):

2 def run(self):

3 contigs = self.data.contigfiles

4 reads = self.data.readsets

5

6 ## Invoke commandline

7 ...

Listing 3.3: Plugin with contigs and reads objects available. A class that inherites from the BaseAligner class expects contig files and reads files to be available in the execution envi- ronment through the class properties self.data.contigfiles and self.data.readsets, respectively

3.1.1.5 Wasp Types and Type Conversion Functions

The Wasp interpreter reserves a set of keywords that are references to certain aspects of the global scope. Reads and reference sequences uploaded as initial input can be referenced by the values READS and REFERENCE, respectively. Functions are available to coerce a Wasp object to another type. This may be useful in cases where default output types are incompatible with expected input types of another plugin type (e.g. using assembled contigs produced from a short-read de bruijn graph assembler as long reads inputs for a string graph assembler). wasptype waspobject

27 Only WaspLinks that have default output values of type Contigs, Paired, Single, or Reference can be converted. An example of such a Wasp definition is as follows:

1 (hybrid_assembler(single(shortread_assembler READS)) READS)

Expression Description contigs Convert type to Contigs paired Convert type to Paired single Convert type to Single reference Convert type to Reference arast_score Perform arast scoring function on assembly has_paired Check if datasets contain paired end reads has_short_read_only Check if datasets only contain single end reads n50 Perform N50 scoring function on assembly get Get non-default return value from a WaspLink all_files Get all files contained in a plugin’s output directory tar Tar/compress return value (if files) print Print to STDOUT for debugging upload Upload files to AssemblyRAST data store Table 3.2: Wasp contains three types of specialized extensions: Type conversion, data analysis, and framework-level functions.

In the next section, we provide several examples of Wasp workflow definitions.

3.1.1.6 Recipe Use Cases

Run particular assemblers based on read types We can design a recipe with a level of automation to handle various types of read profiles and choose the appropriate assembler(s) accordingly. The example shown in Listing 3.4 and Figure 3.1 conditionally runs either a pipeline for de Bruijn graph assembly (error correction of reads via BayesHammer and assembly with Velvet and Spades) or long read assembly, depending on the read length profile of the input.

1 (begin

2 (if (has_short_reads_only READS)

3 (prog 28 4 (define pp(bhammer READS))

5 (define vt(velvet pp))

6 (define sp(spades pp))

7 (define assemblies(list sp vt)))

8

9 (define assemblies(list(miniasm READS))))

10 (quast assemblies)

11 )

Listing 3.4: Recipe definition of read-specific assembly workflow

Figure 3.1: Flowchart of a read-specific assembly workflow

Compare contigs via Quast Here we provide an example of a workflow in which assem- blies are provided by the user for contig-specific analysis (Listing 3.5).

1 (begin

2 (define ale_analysis(ale CONTIGS))

3 (define quast_analysis(quast CONTIGS))

4 (tar ale_analysis quast_analysis :name contig_analysis)

5 ) 29 Listing 3.5: Recipe definition of contig comparison

Autotuning Ensemble Assembler Workflow Next, we describe a more sophisticated assembly workflow:

1. Preprocess reads with BayesHammer error correction algorithm

2. Build k-mer histograms and estimate optimal value for k

3. Assemble preprocessed reads with Velvet, Spades, and IDBA (if reads have mate pair information)

4. Score assemblies based upon AssemblyRAST’s criteria.

5. Define the top two assemblies as Master and Slave sequences, and perform block merg- ing via GAM-NGS.

6. Generate statistics via QUAST and upload back to AssemblyRAST server

1 (begin

2 (define pp(bhammer READS))

3 (if (not(symbol? pp))

4 (define pp READS))

5 (define kval(get best_k(kmergenie pp)))

6 (define vt (begin (setparam hash_length kval)(velvet pp)))

7 (define sp(spades pp))

8 (if (has_paired READS)

9 (prog

10 (define id(idba pp))

11 (define assemblies(list id sp vt)))

12 (define assemblies(list sp vt)))

13 (define toptwo(slice(sort assemblies> :key (lambda (c)(arast_scorec)))02))

14 (define gam(gam_ngs toptwo))

15 (define newsort(sort(cons gam assemblies)> :key (lambda (c)(arast_scorec))))

16 (tar(all_files(quast(upload newsort))) :name analysis)

17 )

30 Listing 3.6: Recipe definition of an Autotuning Ensemble Assembly Workflow

3.2 Implementation

The AssemblyRAST service can be used to perform a computation on a given dataset using a single tool, or batch assembly using multiple tools, without compilation, installation, or required knowledge of particular command-line flags or parameters. Furthermore, various file formats, compressed archives, library types (e.g. paired, single end, reference) and parameter data (e.g. insert length, standard deviation) are properly handled for each respective module. The resulting assembly is automatically processed via a collection of analysis tools, developed both internally and by the community such as Quast and optionally, REAPR and ALE; the raw assembly data, corresponding analysis, any log files or standard output from the modules are made available to download as well as hosted on the AssemblyRAST web server to be viewed through the web interface. While raw sequence reads can be assembled solely by using singlar de novo assembly tools, many research efforts have shown optimal results when a multi-stage processing pipeline is used [107]. Generally, this workflow involves filtering, error correction, assembly, scaffolding, and post-processing. Thus, one of the main design tenets of the system is the ability to easily invoke or generate intricate pipelines in user-friendly and relatively simplified fashion. The intricacies of various assembly tool commands are shown in Listing 3.7.

3.2.1 Interface Generalization

Bioinformatics tools and and analysis methods are becoming abundant, but at the cost of proliferative commandline inconsistencies and non-traditional UNIX commandline invoca- tions. Generally, for a particular tool class (e.g. assembler, aligner, preprocessing), the same required inputs are needed for invocation. For example, an assembler binary usually accepts as input a pair of read files, and optionally parameters for use within the assembly algorithm. 31 1 # Produce a hashtable using hash length of 29

2 velveth outpath 29 -fastq -shortpaired1 reads1.fq reads2.fq

3

4 # Build and assemble a de Bruijn graph assembly

5 velvetg outpath -cov_cutoff5 -min_contig_length 1000

6

7 # Assemble a set of reads with Kiki

8 kiki -i reads.fa -o outputfile -k 29

9

10 # Merge and convert to FastA

11 fq2fa --merge --filter reads1.fq reads2.fq reads.fa

12

13 # Assemble a set of reads with Kiki

14 idba -r reads.fa -o ~/output --maxk 32

Listing 3.7: Various assemblers and commands to assemble reads. Assembly RAST aims to simplify this by implementing tools as plugins with flag defaults, where optional overrides are available. While there are multiple entry points to invoke the AssemblyRAST service, as discussed in 3.2.2, example commands are shown using the commandline client arast.

arast run --f reads.fa --assemblers kiki velvet idba

Listing 3.8: Performing multiple assemblies using arast

3.2.2 Rapid Pipeline Design

Traditionally, genome assemblies are completed in a precise but manual manner, invoking certain tools in a linear fashion from the commandline. Depending on the level of scripting, this may impose unnecessary latencies due to users having to monitor a stage’s progress and invoke proceeding steps by hand. Furthermore, this practice of invoking from the command- line is anticonducive to maintaining a reproducible procedure. There are, however, ongoing efforts to create more advanced and cohesive scripted pipelines, with good results. 32 Currently available genome assembly pipelines, such as A5 or JigSaw, are predetermined scripts in which the user has little choice or flexibility over settings or pipeline modules. Tweaking parameters or pipeline stages involves manually editing the pipeline scripts. Fur- thermore, attempts to hypothesize and find optimal parameter settings or pipeline configura- tions would involve multiple alterations of these scripts. With the Assembly RAST pipeline invocation language, the user can call upon a simplified yet flexible string to generate the desired pipeline or pipelines.

arast run -f reads.fa --pipeline trim_sort tagdust velvet

Listing 3.9: Invoke a 3-stage pipeline with arast

arast run -f reads.fa --pipeline trim_sort tagdust kiki ?k=34

Listing 3.10: Pass assembler-specific in arast commandline Often times, it is unclear how a certain parameter configuration or pipeline stage affects a resulting assembly. While various preprocessing and filtering algorithms attempt to aid a downstream assembler in avoiding mistakes, they may actually be removing critical in- formation that could prove useful. ARAST supports the generation of pipelines in which modules can be easily substituted or run with different configurations. Listing 3.11 is an example of the AssemblyRAST pipeline branching behavior. Each group of strings enclosed by quotation marks represents a program set. For each set S1,S2, ..., Sn, arast generates a batch of pipelines as the cartesian product of all program sets, or S1 × S2 × ... × Sn. arast run -f reads.fa --pipeline "none trim_sort" tagdust "kiki ray idba velvet spades" sspace

Listing 3.11: Launching branching pipelines in arast commandline

33 Figure 3.2: Pipeline Branching

3.2.3 Universal Hyperparameter Search Driver

Certain assembly tools such as IDBA or Velvet optimiser employ an iterative parameter approach with beneficial results. While these tools generally iterate over a single parameter such as k-mer length [80][6], arast is built on a framework that allows for more universal parameter sweeps and optimization. As shown below, a user can invoke a parameter sweep with step interval for a program without the need for development of auxilliary helper pro- grams (e.g. VelvetOptimiser). This can be achieved through proper plugin definition in the AssemblyRAST framework.

arast run -f reads.fa --pipeline ’kiki ?k=29-81:4’

Listing 3.12: For any AssemblyRAST plugin that exposes an integer-based parameter, a user can invoke a parameter sweep from the arast commandline. Parameter sweeps generate first-class members withing a pipeline stage and thus can be used in the same way to achieve cartesian product combinations for pipeline combinations. 34 Figure 3.3: Parameter Sweeps With Pipeline Combinations

For example, one may wish to study the relationship between error correction, read length trimming, and the k-mer size of a de bruijn graph assembler. The command in Listing 3.13 will generate the pipelines represented in Figure arast run -f reads.fa --pipeline ’none trim_sort bhammer’ ’kiki ?k=29-59:10’

Listing 3.13: Combining the concepts of pipeline branching and parameter sweeps.

3.2.4 Logic-Driven Assembly

Many approaches take a multistep optimization approach to assembly, employing parameter sweep iterations in order to capture the best features resulting from specific settings. IDBA iterates over a range of k-values, and VelvetOptimiser optimizes the Velvet assemblers over k, expected coverage, and coverage cutoff settings. These methods have proven to be valuable preprocessing approaches that optimally tune data and parameters for downstream heuris- tics. However, given the complexity of the program interfaces, such optimization stages often do not generalize to any assembly pipeline. As described in Section 3.1.1, AssemblyRAST

35 provides a declarative language to easily define logical workflows that are capable of inter- pipeline decision making and thus can support the aforementioned strategies of parameter tuning.

3.2.5 Data Types

Read data generated from sequencers are ordinarily stored in the FastA and FastQ formats, the latter containing quality information. Due to the plain text ASCII format, data is often compressed with common compression algorithms. Automatic detection and decompression of these files are supported on the compute side, alleviating a fair amount of network load if used. Paired end data is essential to many genome processing tools. However, syntax for conveying pair library data for invocation remains highly variable.

1 # Running a Velvet assembly with read-specific parameters

2 velveth out 31 -shortPaired -fastq -separate reads_1.fastq reads_2.fastq

3 velvetg out -exp_cov auto -scaffolding no -ins_length 335 -ins_length_sd 35

Listing 3.14: Velvet using paired end data

1 DATA

2 PE=pe18020 /FULL_PATH/frag_1.fastq /FULL_PATH/frag_2.fastq

3 JUMP=sh3600200 /FULL_PATH/short_1.fastq /FULL_PATH/short_2.fastq

4 OTHER=/FULL_PATH/file.frg

5 END

6 PARAMETERS

7 GRAPH_KMER_SIZE=auto

8 USE_LINKING_MATES=1

9 LIMIT_JUMP_COVERAGE=60

10 CA_PARAMETERS=ovlMerSize=30cgwErrorRate=0.25ovlMemory=4GB

11 NUM_THREADS=64

12 JF_SIZE=100000000

13 END

36 Listing 3.15: Configuration file needed to invoke the Masurca assembler

1 ## A5 Library Config

2 [LIB]

3 id=raw1

4 p1=reads_1.trimmed.fastq

5 p2=reads_2.trimmed.fastq

6 rc=0

7 ins=199

8 err=0.95

9 nlibs=1

Listing 3.16: A5 using paired end data AssemblyRAST is able to correctly handle and appropriate explicit paired end library instruction to recieving modules. Also, the service partially supports paired end library inference base on file names.

arast run --pair read1.fa read2.fa ?ins=335 ?stdev=35 -a velvet masurca a5

Listing 3.17: Assember-specific parameters are unified into one interface. A user can specify information about a dataset such as mate-pair insert size and standard deviation once on the arast commandline which will be propagated to the assemblers.

3.2.6 Analysis Framework

Automatic and rapid analysis within the Assembly RAST workflow is necessary on two fronts. First, an intra-pipeline, concurrent analysis is necessary for meta-algorithmic feed- back, dictating the overall stages of the pipeline, mentioned previously. And second, a thorough statistical analysis of pipeline results are returned to the user alongside visualiza- tions generated from the data. This will allow users quick access to informative metrics that will give insight into the overall quality of the assembly.

37 3.3 System Design and Infrastructure

The AssemblyRAST infrastructure consists of five separate components, all of which could exist on one machine, or many:

Control server a RESTful frontend listens for client calls, performs authentication checks, analyzes request, populates job queue, hosts web frameworks.

Data repository uses Shock technology, permanently stores raw data, computed data, and userspace files.

Job queue manages different work queues in which job distribution is determined by rules pertaining to worker and job classifications

Compute worker systems(s) various system types running the AssemblyRAST compute framework

Client various entry points into the AssemblyRAST system including command line inter- face (CLI), RESTful API, KBase, PATRIC, and web UI.

3.3.1 AssemblyRAST Control Plane

The control server acts as the main entry point for any actions or requests to the As- semblyRAST system. The server employs the use of CherryPy, a simple web framework for Python, to implement a RESTful interface. The assembly, processing, and analysis of genomes is a computationally expensive routine that requires considerable resources and time. Thus, scalability as well as a robust queueing system are necessary features in such a system. Assembly RAST features a compute node monitor as well as a queuing system that employs RabbitMQ, a framework that implements the Advanced Message Queueng Protocol (AMQP). By monitoring the overall system load and queue size, an adequate number of

38 Figure 3.4: The Assembly RAST Infrastructure

VMs can be ensured as new compute images can be launched and terminated within the local OpenStack cloud.

API Server

1 curl -X GET http://www.kbase.us/service/assembly/user/johndoe/job/42

Listing 3.18: The AssemblyRAST Service exposes a RESTful interface.

Job Scheduler When a job request is submitted, the server creates a database record, determines the queue in which it will be placed based off of details such as data size, data format, organism type, or explicit request parameters. This type of job delegation is impor-

39 tant for the immenent future, when the need to handle thousands or microbial genomes as well as massive metagenome becomes apparent.

User Management While the AssemblyRAST server is open to the public, we use Globus Online’s Nexus OAuth2 API for authentication, as it offers a robust solution for user regis- tration and enables the creation of unique userspaces for data and record keeping. Network speed is yet another bottleneck to such a service; a userspace allows data reuse if the user desires to repeat an assembly job or attempt an alternate pipeline approach. Configuration of the control plane allows for user job limits as well as special queueing rules for specified users 3.22. Jobs can be terminated by the job owner at a job-level or user-wide level, which will terminate all user jobs and prune the global queue. Status check commands allow a user to poll job completions if an AssemblyRAST job is part of a larger script (Listing 3.20).

1 arast run --data 42 --assemblers kiki velvet

Listing 3.19: Previously uploaded datasets can be reused within the AssemblyRAST service.

1 while True:

2 status = check_output(’arast stat --job 123’)

3 if status == ’Complete’:

4 break

5 sleep(60)

Listing 3.20: Polling Status for Completion

+------+------+------+------+------+ | Job ID | Data ID | Status | Run time | Description | +------+------+------+------+------+ | 83| 40 | pipeline [success] | 0:22:29 | None | | 92| 40 | Running: [4%] | 0:00:22 | pch1 | | 93| 41 | pipeline [success] | 0:00:03 | pch1 | | 94| 44 | pipeline [success] | 0:22:08 | bruc | +------+------+------+------+------+ 40 Listing 3.21: Users can observe job queues status.

[ { "user": "superuser_1", "job_limit":-1 }, { "user": "limited_user_1", "job_limit":3 } ]

Listing 3.22: AssemblyRAST Control Plane User Configuration

3.3.2 Workers

The compute runtime is designed to function as a redundant standalone module of the As- semblyRAST system to be deployed upon varying system types and architectures. As we continue to explore the performance suitability of different hardware profiles, the flexibility to assign specific workflows based on compute systems is key to maximizing computational resource efficiency. Currently, numerous big-memory compute VMs are deployed to pro- cess standard assembly jobs, and implementation on specialized hardware such as Convey FPGA-accelerated hybrid-core servers and NERSC supercomputing clusters is undergoing development for sequence alignment and metagenomic assembly computation, respectively. Volunteer computing is also a possible future target.

Plugin Framework The submodular functionality of the compute runtime is dependent on core bioinformatics tools designed to run directly in a CLI-to-Unix-process manner. A plugin framework is thus necessary to abstract away the variability seen in these component invocations and function harmoniously with the meta-algorithmic workflows in our compu- tations. 41 1 class KikiAssembler(BaseAssembler, IPlugin):

2 def run(self, reads):

3 cmd_args = [self.executable, ’-k’, self.k,

4 ’-o’, self.outpath + ’/kiki’, ’-i’]

5 cmd_args += self.get_files(reads)

6 self.arast_popen(cmd_args)

7 return glob.glob(self.outpath + ’/*.contig’)

Listing 3.23: Example Assembler Plugin The framework features a plugin class, and is further extended into different subclasses for assemblers, preprocessing tools, scaffolders, and possbily other categories for alternate behaviors. By inserting modules into the plugin interfaces, output files, logging, bench- marking, and statistics are available to the Wasp plugin engine. Module properties and default command line flags or parameters are set within the plugin configuration file, and tool-specific scripting logic is defined in the plugin subclass.

[Core] Name= kiki Module= kiki

[Settings] executable= /usr/bin/ki filetypes= fasta,fa,fastq,fq k= 29 contig_threshold= 1000

Listing 3.24: Example Plugin Configuration File

Data Handlers As with any scientific experiment, procedures of genome assemblies must be reproducible. Moreover, with the frailties of software designs and patchy script workflows prominent in assembly pipelines, the practice of bookkeeping is paramount for an accurate genome assembly verification and further analysis. And finally, the extrinsic analysis of

42 the hardware performance requires additional intra-computational benchmarking and data collection. Sequencing data is commonly produced as a relatively inefficient plain text format. This, coupled with the high-throughput is cause for concern in storage space as well as data I/O within the workflow itself. As noted earlier, users have the ability to reuse previously uploaded data, as it is permanently stored in the data repository. The compute runtime also performs checks in attempt to locate the requested data on the local node itself, in the case that it was the same worker that handled the particular data set in prior jobs. If no cached copy exists on the node, it is transferred accordingly. A large fraction of running time for bioinformatics processing programs is attributed to disk I/O. Furthermore, passing intermediate processing states between fundamentally separate binaries cannot be solved easily, especially in the case where an analytical scoring of a particular state is necessary for the invocation parameters of the next stage. For example, a meta-algorithm’s invocation of a gap-closing stage is dependent upon the n50 score of an assemblers contig output. An N50 score requires a completely finished and sorted list of contigs, and storing all data in memory may not be feasible; thus each intermediate state must involve disk I/O. Because of the iterative and sometimes exhaustive workflows generated by the service, data efficiency is paramount. Thus, intermediate states are reused when possible. Currently, RAM drives are being tested and may offer a boost in throughput, though developing a more intraprocess approach is in the interest of future work. Finally, workflow disk space requirements are predicted and a garbage collection agent is employed to ensure sufficient space.

43 3.3.3 Availability

3.3.3.1 Local

The primary commandline interface client, "arast", defaults communication to the main DOE-hosted AssemblyRAST Service. However, users are able to build and deploy local AssemblyRAST servers as well. All source code is available at http://www.github.com/ cbun/assembly.

3.3.3.2 KBase

The AssemblyRAST service has been fully developed, undergone deployment and robustness tests by the KBase development team and as of February 2013, is released as a “KBase Labs” module, freely available for use by the general public. The service has been used to assemble over 100 USDA Brucella genomes to be annotated by the RAST system. KBase offers unique "Narrative" interface, allowing a user to perform analyses within a notebook-like environment. It is publically available at http://kbase.us. The scalable infrastructure allows for high throughput for assembly pipelines, and gives users an ideal environment for the study of assembly algorithms at a larger scale than conventional desktop machine experiments. Here we present a few examinations of de novo assembly pipelines and methods. Figures 4.1 and 3.5 offer an example of the types of visualizations generated by the AssemblyRAST server using third-party (QUAST, Mummer) and built-in tools.

44 Figure 3.5: Relative ALE Scores of V. cholera assembly

45 Figure 3.6: AssemblyRAST Web Interface

Figure 3.7: AssemblyRAST Web Interface facillitates user-friendly pipeline design

46 Chapter 4

ASSEMBLER PROFILING AND OPTIMIZATION

The next generation of sequencing technologies described in Chapter 2 have enabled extraor- dinary throughput and through multiplexing, the possibility to sequence dozens of microbial genomes concurrently. However, because the cost of library preparation has not fallen with the cost of sequencing, it is becoming increasingly common to provide a single library for genome assembly. Given the prolific number of assembler tools currently available, it is nec- essary to evaluate and compare different approaches against varying datasets. While other representative studies compared the assemblies of various assemblers [66], we additionally tested various pipeline and parameter configurations to evaluate the efficacies of different pro- cessing stages, such as quality trimming, error correction, misassembly detection, and gap closing, in order to develop a better understanding of their relationships. In this chapter, we describe a set of experiments that attempt to provide a comprehensive analysis.

4.0.1 Data

The four data sets with corresponding reference genomes were the same used in the GAGE- B [66] study and thus available at the studies website (http://ccb.jhu.edu/gage_b/): Bacillus cereus MiSeq from the Illumina website, Reference: Bacillus cereus ATCC 10987 (NC_003909, NC_005707), Staphylococcus aureus HiSeq: SRR569301, Reference: Staphylococ- cus aureus USA300_TCH1516 (NC_010063, NC_010079, NC_012417), Vibrio cholera SRA037376, Reference: Vibrio cholerae 01 biovar eltor str. N16961 (NC_002505, NC_002506), Rhodobac- ter sphaeroides HiSeq SRR522244, Reference: Rhodobacter sphaeroides 2.4.1 (NC_007488, NC_007489, NC_007490, NC_007493, NC_007494, NC_009007, NC_009008). While the Bacil- lus cereus and Rhodobacter sphaeroides reference genomes were of a representative strain to the read libraries, the V. cholera and S. aureus reference were of different strains; however, Magoc, et al. [66] were able to infer sufficient strain similarity through mapping library reads 47 to the genome. The genome of B. cereus consists of one chromosome and one plasmid, S. aureus contains one chromosome and two plasmids, V. cholera has two chromosomes, and R. sphaeroides consists of two chromosomes and five plasmids.

4.0.1.0.1 Preprocessing Typically, raw data from sequencers contains low quality se- quences, contaminants, and adapter sequencers that should be discarded. A variety of tools that filter and trim these types of reads are available, and we employed some on the data sets to improve assemblies. For the HiSeq data sets of V. cholera and R. sphaeroides, we used the available trimmed set from GAGE-B. Magoc et al. used the ea-utils package to remove adapters and perform q10 quality trimming. For the S. aureus data set, we used raw data. We found that the MaSuRCA, IDBA-UD, and accordingly A5 became unstable for data sets with highly variable read lengths, characteristic of MiSeq data. Thus for the B. cereus MiSeq data set, in addition to quality trimming and adapter removal, we filtered out any reads that were not within the range of 150-200 base pairs in length.

4.0.2 Evaluation Metrics

To determine accuracy of assemblies, the following metrics were measured:

Number of contigs : The number of contigs assembled over 500bp long.

N50 : As described in section 6.1, the length of the shortest contig at which 50% of the total assembly length is contained in all contigs larger than the N50 contig.

Nx : Analogous to N50, but instead uses a defined x% in the calculation. This is useful for graph visualization across all values of x.

For data sets with an available reference genome, strucural variation metrics were able to be measured:

48 Misassemblies : Using the reference genome for comparison, we sum up the total number of relocations, translocations, and inversions detected. For a contig C, misassembly

point m, and left and right flanking sequences Lm and Rm, relocation misassembly

breakpoint occurs at m when Lm and Rm align on the same reference chromosome, but over 1000bp away, or overlap by 1000bp. Inversions are defined as errors in which

the Lm and Rm do not qualify as a relocation, and align to opposite strands of the

chromosome. Finally, a translocation occurs when Lm and Rm align to different chro- mosomes.

Substitutions : Mismatched base pairs in all alignments.

Indels : Insertions or deletions in all alignments.

NGA50 : A “corrected” version of N50, where contigs are broken at alignment misassemblies into blocks, and block length is used instead of contig length.

NGAx : Analogous to NGAx, where x% is instead used in the calculation.

These metrics were calculated using QUAST 2.1 [39]. For data sets without an available reference genome, reference-free assessment methods were used:

ALE Likelihood : Likelihood of an assembly given the initial read library, as described in section 7.2

REAPR Misassemblies : An error calling method which measures the difference in ob- served and expected fragment coverage distribution at each base pair, as described in 6.1.2.1

4.0.3 Programs

The following assemblers were used in the comparison:

49 • Velvet [123]: DeBruijn graph assembler designed for short read sequencing technologies.

• Kiki(Xia et al. 2012)

• IDBA-UD [80] - Iterative DeBruijn graph Assembler: Iterates over k-mer sizes to capture strengths in small and large parameters.

• SPAdes [6]:Features a multi-deBruijn graph approach. SPAdes typically combines a preprocessing step using the BayesHammer error correction tools and a misassembly detection postprocessing tool. For these assemblies, we isolated the SPAdes assembly step.

• MaSuRCA - Maryland Super Read Cabog Assembler: Uses a combination of both overlap consensus (OLC) a deBruijn graph assembly algorithms.

• A5 [107]: a pipeline which uses a mixture of open source tools, their own algorithms, as well as dynamic parameter configuration. Namely, it uses preprocessing and error correction from SGA [98], the IDBA-UD assembler, and SSPACE to scaffold. For the purpose of these comparisons, we use contigs before scaffolding.

• *A6: We modified the A5 pipeline to filter read lengths and ensured proper quality encoding detection for stability.

4.0.4 Comparison

While the most often used metric for assembly quality is N50, it can be misleading, as an assembly containing many misassemblies will produce an inflated N50 score. For assemblies with a corresponding reference genome, a more accurate NGA50 score can be produced, which measures aligned blocks that are broken at misassemblies, rather than contig length, and normalized over the length of the reference genome. We summarize the NGA50 scores and errors of all assemblies in Tables 4.1 and 4.2. For the B. cereus HiSeq data set, the 50 Organism MaSuRCA Kiki Velvet IDBA SPAdes A5 B. cereus HiSeq 52644 59995 42763 31347 78420 45935 S. aureus 22603 1854 11540 34957 50888 8188 V. cholera HiSeq 59028 42804 47191 70796 177768 72282 V. cholera MiSeq 50207 70738 19767 44178 198488 57376 R. sphaeriodes HiSeq 66418 3893 33342 72357 71175 20356* R. sphaeriodes MiSeq - 33589 62923 60228 126502 83693 Table 4.1: Comparison of NGA50 assembly scores for various genomes. Best scores are bolded. Organism MaSuRCA Kiki Velvet IDBA SPAdes A5 S. aureus 7 18 11 4 15 15 V. cholera HiSeq 1 0 10 3 2 0 V. cholera MiSeq 35 66 10 4 7 11 R. sphaeriodes HiSeq 7 32 6 3 9 8* R. sphaeriodes MiSeq - 40 7 2 9 2 Table 4.2: Comparison of misassemblies for various genomes. reference was found to be too divergent by mapping, so the N50 scores of its assemblies are reported instead. For all assemblers, default or “auto” settings were used, except for Velvet, for which we performed k-mer searches on most of the datasets. Where applicable, the best assembly score was chosen, and hash length k is reported in Table 5.1. Three of the data sets, B. cereus MiSeq and R. sphaeroides HiSeq and MiSeq, matched the reference genomes precisely, and while the V. cholera and S. aureus reference genomes were of a different strain, they were inferred to be similar enough to use for the data sets. Errors detected in the latter sets may represent true variation. For the S. aureus HiSeq assemblies, SPAdes generated the largest NGA50 score, followed by IDBA-UD and MaSuRCA. IDBA-UD, however, contained the least amount of misassem- blies, and SPAdes contained one of the highest amounts, mainly relocations as described in Section 4.0.2. For the V. cholera HiSeq assemblies, SPAdes generated the largest NGA50 score at 177kb, more than doubling the next largest score of Velvet at 79kb. IDBA-UD, A5, and SPAdes contained few misassemblies at 4, 5, and 7, respectively. Kiki and MaSuRCA contained a

51 Figure 4.1: NGAx plot of V. cholera assembly

NGAx 600

500

400

300

200 Contig length (kbp)

100

0 0 20 40 60 80 100 x

P1_Ma P3_Vt P5_Sp P6_A5 P2_Ki P4_Ia

relatively high amount of misassemblies at 59 and 60, respectively. All contigs generated covered at least 94% of the genome Similarly for the MiSeq dataset, SPAdes performed the best with 198kb, followed by Kiki at 70kb. For the R. sphaeroides HiSeq assemblies, IDBA-UD generated the largest NGA50 score at 72kbp, followed closely by SPAdes at 70kbp. Accordingly, IDBA-UD produced the fewest misassemblies.

4.0.5 Methods

This section lists the commands generated by AssemblyRAST.

52 Assembly MaSurca Kiki Velvet IDBA-UD SPAdes Contigs ( = 1000 bp) 125 1228 520 140 133 Total length ( = 1000 bp) 4447475 4211242 4523850 4490529 4604439 Reference length 4603060 4603060 4603060 4603060 4603060 N50 74831 4406 14135 73097 71177 NGA50 66418 3893 12905 72357 71175 Misassemblies 7 32 14 3 9 Local misassemblies 12 5 1322 4 7 N’s per 100 kbp 0.00 0.00 3249.32 0.00 8.76 Mismatches per 100 kbp 31.34 32.95 8.54 4.16 12.11 Indels per 100 kbp 5.54 6.99 23.91 3.58 5.75 Table 4.3: Various statistics for the assembly of R. spaeroides HiSeq data. IDBA-ID generated the most contiguous set while also performing the least amount of misassemblies.

4.0.5.0.1 S. aureus assemblies

1 # Command given to AssemblyRast client:

2 ar_run --pair frag_1.fastq frag_2.fastq insert=180 stdev=20\

3 -a masurca kiki velvet idba spades a5

4 ======Pipeline1: [masurca] ======

5 runSRCA.pl config.txt

6 bash assemble.sh

7

8 # Contents of config.txt:

9 PATHS

10 CA_PATH = /home/ubuntu/assembly/bin/MaSuRCA-2.0.0/CA/Linux-amd64/bin/

11 JELLYFISH_PATH = /home/ubuntu/assembly/bin/MaSuRCA-2.0.0/bin/

12 SR_PATH = /home/ubuntu/assembly/bin/MaSuRCA-2.0.0/bin/

13 END

14

15 PARAMETERS

16 GRAPH_KMER_SIZE = auto

17 USE_LINKING_MATES =1

18 KMER_COUNT_THRESHOLD =1

19 LIMIT_JUMP_COVERAGE = 60

20 NUM_THREADS =8

21 JF_SIZE = 2000000000

22 EXTEND_JUMP_READS =0

23 CA_PARAMETERS = ovlMerSize=30 cgwErrorRate=0.25 ovlMemory=4GB

24 DO_HOMOPOLYMER_TRIM =0 53 25 END

26

27 DATA

28 PE: p1 180 20 frag_1.fastq frag_2.fastq

29 END

30 ======Pipeline2: [kiki] ======

31 ki -k 29 -i frag_1.fastq frag_2.fastq -o kiki

32 ======Pipeline3: [velvet] ======

33 velveth OUT 29 -shortPaired -fastq -separate frag_1.fastq frag_2.fastq

34 velvetg OUT -exp_cov auto -ins_length 180 -ins_length_sd 20

35 ======Pipeline4: [idba] ======

36 fq2fa --merge --filter frag_1.fastq frag_2.fastq frag_1.idba_merged.fa

37 idba_ud -r frag_1.idba_merged.fa -o run --maxk 50

38 ======Pipeline5: [spades] ======

39 spades.py -1 frag_1.fastq -2 frag_2.fastq --only-assembler -o OUT

40 ======Pipeline6: [a5] ======

41 a5_pipeline.pl frag_1.fastq frag_2.fastq a5

Listing 4.1: Assembly commands

4.0.6 Discussion

The data reported in Tables 4.1, 4.2 and 4.3, sheds some light on the overall performances of the standalone assembly tools. We found that the SPAdes and IDBA assemblers tended to produce assemblies with the largest NGA50 scores. Both of these assemblers employ a multisized k-mer approach, so it would appear that such a strategy should be built upon for future iterations of assemblers. One point to note, however, is that SPAdes does incorporate ambiguous base pairs, or “N’s” into assembled contigs, where all others will not concatenate such uncertainties. While these incidents are fairly low in comparison to scaffolding steps, it is currently unclear as to how this may affect such contiguity scores such as N50 and NGA50. MaSuRCA was shown to be a promising assembly method by Magoc et al., but for our experiments, the software was unstable using default or auto settings. Further exploration must be performed, as this assembler’s potential has not yet been fully utilized. 54 4.1 Pipelines

In addition to many assembly tools, the computational biology community has created a diverse set of standalone preprocessing and post-assembly tools in hopes to improve both the act of assembly as well as the correction or “polish” of an assembly produced. While many assemblers come packaged or hard-coded with these pre- and -post steps, we began to consider various combinations of tools for better insight into how differing approaches to processing perform when coupled with alternative methods of assembly.

4.1.1 Preprocessing

4.1.1.1 Error Correction

In Illumina sequencers, substitution, or base calling, errors occur at rates of 0.5-2.5% [48]. Other platforms such as Ion Torrent and 454, insertions and deletions due to homopolymer and carry-forward errors are common [119]. These mistakes cause problems in all assembly methods, creating ambiguous or spurious paths and overlaps. Various methods to corrects these errors have been developed. Because of the prominent usage of Illumina technology, most of these methods only target substitution errors; as new sequencing usage patterns emerge, the need to target other error profiles is becoming evident. Additionally, because final genome assembly is generally the subject of scrutiny, the comparison analysis of error correction method accuracy is often overlooked; such an elemental step requires deeper ex- amination. Yang et al. attempt to classify the different approaches and perform a thorough evalution. Likewise, we classify them here.

4.1.1.1.1 k-spectrum In these error correction methods, reads are first broken down into their constituent k-mers. With ample coverage and an appropriate value for k, erro- neous k-mers can be inferred and corrected by measuring Hamming distances to a consensus

55 k-mer. To explain further, a k-mer is deemed trusted if it occurs more than a given number of times, and untrusted otherwise. Untrusted k-mers are matched to a trusted k-mer if it meets a desired Hamming distance threshold, and thus corrected to conform to the consen- sus. Certain assemblers [48] incorporate available read quality scores into the weighting of trusted k-mers. k-spectrum-based correction works well for substitution errors and avoids any expensive multiple sequence alignment (MSA) computation; this class is consequently popular, with substitution error being most common in Illumina reads. Popular tools using this strategy are SOAPdenovo [59], Quake [48], SGA [98], and Euler-SR [13].

4.1.1.2 Artifact Removal

Some unfortunate side affect of high throughput sequencing are lingering non-essential arti- facts due to linkers and adapters used in the initial library construction. Though removal of known library sequences from reads is a trivial computation, these sequences can undergo the same error generation as actual genome data. TagDust employs fuzzy string-matching to identify and remove true artifacts from reads [57].

4.1.2 Postprocessing

4.1.2.1 Scaffolding

Once contigs are built via assembly algorithms, the process of scaffolding attempts to place contigs in the correct order. This is generally done by using paired end reads insert size information: When a minimum number of contiguous paired end reads can be mapped and matched across two separate contigs, and where insert size is large enough to span the total sequence distance, a connection can be inferred. Popular scaffolding tools include GRASS [37], SOPRA [23], Opera [31], SSAKE [114], Bambus [?], and SSPACE [7]. Recently Bambus2 [54] has been developed and aimed at

56 scaffolding metagenome sequencing projects.

4.1.2.2 Gap Closing

One gap closing method seen in IMAGE and SOAPdenovo uses paired end data to extend contigs and close gaps after scaffolding has generated a supercontig. By mapping the paired end reads back to the supercontig, the algorithm finds pairs that map one read to the contig and the other to a gap. This is done iteratively, building small islands of contiguous reads within gaps until closed or the matched-reads pool is exhausted [103]. Subsequent scaffolding may generate new gap data for additional iterations of gap closing.

4.1.3 Results

We investigated the following processing steps by invoking pipelines of the V. cholerae HiSeq dataset:

• Ea-Utils Q10 Trimming (Trim)

• SGA Preprocessor (SGAp)

• SGA Error Correction

• Bayes Hammer Error Correction (BH)

• Tagdust adapter removal (TD)

• Read length filtering

For assemblies using the Ea-Utils trimming step, we used previously processed data from GAGE-B. The NGA50 and misassembly performances for various combinations are shown in Table 4.4 and Table 4.5. For this dataset, SGA error correction had little to no affect on downstream assembly, so it was omitted from the table. For the IDBA-UD assembly, the

57 raw data set produced the largest NGA50 score at 72kbp as well as the lowest misassembly rate at 3. Notably, it performed the worst when using data corrected by BayesHammer, while other processing steps only performed slightly worse than raw data. MaSuRCA was unable to complete some of our pipeline configurations, but for those that completed, it seemed generally unaffected by the varying combinations. Kiki performed well using a combination of SGA preprocessing, BayesHammer, and Tagdust at NGA50 of 47kbp, more than 12 times larger than its assembly of raw data. Finally, SPAdes generates an NGA50 score nearly three times larger with q10 trimming and Tagdust than with raw data.

4.1.3.1 Discussion

Because many assemblers are written as miniature pipelines in which reads are processed, controlled for quality, and/or error corrected, it remains difficult to form conclusions as to which processing pipelines may produce optimal results. This is clear when considering the performance of IDBA-UD on raw data. The IDBA-UD program employs an internal error correction mechanism which cannot be isolated from the assembly steps. It would be worthwhile to investigate this iterative k-mer approach decoupled from error correction. In our list of tested assemblers, Kiki can be viewed as a “true” assembler in the sense that it only the graph assembly and correction steps with no prior manipulation to the reads. This explains the massive improvement when using quality control and error correction prior to the assembly stage. By default, SPAdes is coupled with the BayesHammer error correction stage; we disabled the step in some of the trials. Interestingly, SPAdes perform better, albeit slightly, without the use of BayesHammer.

58 Processing Pipeline MaSuRCA Kiki Velvet IDBA SPAdes Raw - 3893 39597 72357 71175 Trim 59028 42804 79621 70796 177768 TD - 6138 37246 62103 148112 Trim + TD 59028 44465 71336 70796 198613 SGAp + TD 57623 38649 71336 70796 197838 Trim + SGAp + TD 57623 39537 71336 70796 171404 SGAp + BH + TD - 46803 76789 59350 198488 Trim + SGAp + BH + TD - 44406 75433 59350 198276 Table 4.4: Effects of preprocessing on V. cholera NGA50. The best scores per assembler are shown in bold. Fields with ’-’ indicate an error generated by the assembler

Processing Pipeline MaSuRCA Kiki Velvet IDBA SPAdes Raw - 32 21 3 9 Trim 59 60 15 4 7 TD - 45 28 4 13 Trim + TD 61 59 23 4 8 SGAp + TD 66 54 20 4 9 Trim + SGAp + TD 66 51 20 4 9 SGAp + BH + TD - 57 12 4 8 Trim + SGAp + BH + TD - 61 13 4 8 Table 4.5: Effects of preprocessing on V. cholera misassembly. The fewest misassemblies per assembler are shown in bold

59 Chapter 5

INTEGRATIVE ASSEMBLY ALGORITHMS

Many assemblers, such as Spades or Masurca, incorporate preprocessing steps (e.g. trim- ming, error correction) leading up to the assembly graph building stage. While this may be beneficial for those seeking to automate genome assembly in order to proceed to downstream analysis, it can be restrictive and opaque to the user looking to investigate the assembly pro- cess itself. Conversely, a genome assembly can be approached in an integrative manner, dynamically using a collection of suitable methods or parameters to best match a provided data set. In most genome sequencing projects, this approach is taken, albiet in a very manual fashion.

5.0.1 Integrative Pipelines

Developed by Tritt et al [107], A5 is a dynamic pipeline which integrates different tools into a multistage pipeline, and has been shown to work favorably in our testing of microbial data sets. The pipeline works as follows:

1. Ambiguous or low quality reads are removed, and the remaining are error corrected via modules in the SGA package [98]. The Tagdust [57] package is used to remove any adapter contamination.

2. The resulting reads are assembled using the IDBA assembler.

3. Contigs produced are scaffolded and extended using SSPACE.

4. Using BWA to first align reads back to resulting scaffolds, a custom method for detec- tion of misassemblies uses paired end data to infer “improper connections” and breaks contigs accordingly.

5. The broken scaffolds are scaffolded once more using SSPACE. 60 5.1 Hyperparameter Optimization

Most available assembly tools offer the ability to set various initial parameters, such as k-mer length for DeBruijn graph algorithms. While many assembler comparison studies attempt to assess assembly quality across different assembler programs, the effect of different parameters on assemblies remains largely unexamined. Here, we used the AssemblyRAST system to perform parameter sweeps and attempt to discern their relationships with data set properties and assembly quality. We used six data sets of three organisms: B. cereus HiSeq and Miseq, R. spharoides HiSeq and MiSeq, and V. cholerae HiSeq and MiSeq. By using two sequencing technology data sets per one reference genome, and since these data set pairs show slightly different read profiles we attempt to find a relationship between initial raw data profiles and optimal hashlength size in Velvet. Various read data statistics, along with optimal hash lengths and largest NGA50 scores are shown in Table 5.1. For each data set, we performed Velvet assemblies where we set Velvet’s hash_length parameter to a range of 29 to 65 with a step size of 4. A plot of hash_length versus the reference-based NGA50 alignment metric is shown in Figure 5.2. The B. cereus HiSeq data set was too divergent from the reference genome, so N50 was reported instead.

Organism Optimal k Max NGA50

B. cereus HiSeq* 57 42663* B. cereus MiSeq 53 30981 R. sphaeriodes HiSeq 29 14270 R. sphaeriodes MiSeq 41 33342 V. cholerae HiSeq 45 47191 V. cholerae MiSeq 77 62923

61 Figure 5.1: NGAx plot of V. cholera assembly by Velvet parameter sweep

NGAx 250

200

150

100 Contig length (kbp)

50

0 0 20 40 60 80 100 x

P4_Vt_h65 P1_Vt_h29 P2_Vt_h41 P6_Vt_h89 P3_Vt_h53 P5_Vt_h77

5.1.0.0.1 Discussion The data presented in Table 5.1 and Figure 5.2 offer some insight into the effects of Velvet’s de Bruijn graph k-mer hashing size on assembly quality. First, when considering the k-mer sweeps of each data set individually, all show either monotonic trends or local maxima. Thus, larger k-mer sizes must be expored for the monotonically increasing experiments, then with regard to a certain criteria, an optimizing function could be implemented to give the best assembly. With a reference genome available, NGA50 score would be ideal for this criteria; without a reference, however, the defining criteria remains equivocal. Next, examining trends of each sequencing technology offers no conclusive rela- tionships between the k-mer sizes and the profiles of the datasets. The HiSeq datasets con- tained reads averaging from 78 basepairs to 98 basepars, roughly half of the MiSeq datasets’

62 Figure 5.2: Velvet Hash Length vs. NGA50 Score on Rsp HiSeq and MiSeq

152 basepair to 252 average range. Furthermore, MiSeq data assembled better in comparison to their respective HiSeq data for only two of the organisms, B. cereus and R. spaeroides, and vice versa for V. cholerae. Further studies are thus required to gain insight into how best to choose graph-building parameters

5.2 Block Construction and Merging

As shown in prior sections, assembler performance appears to vary across data sets, as some assemblers are able to capture portions of an organism’s genome from the reads, while the same ones show weakness on other genomes. Here, we explore the idea of using final

63 assemblies from multiple assemblers by merging.

Figure 5.3: NGA50 scores of pairwise mergings of V. Cholerae assemblies using GAM-NGS. The principle diagonal represents assemblies before merger.

We investigate merging by using Genomic Assemblies Merger for Next Generation Se- quencing (GAM-NGS), which uses read alignments to identify similar “blocks” within each assembly [112]. We performed assemblies on the V. cholorae dataset using the assemblers Kiki, Velvet, IDBA-UD, and SPADES, and performed pairwise mergings. NGA50 scores for each merger are show in the Figure 5.3 heatmap. Besides two mergers in which SPAdes was the “master” assembly, all mergers showed improvement in NGA50 scores when merged with another assembly. While merging the SPAdes assembly with Kiki’s or Velvets yielded no improvement, the SPAdes-IDBA merger had the largest overall NGA50 score at 198kbp. 64 5.2.0.1 Discussion

Like the multi-deBruijn graph approaches of IDBA and SPAdes, some assembly algorithms are able to capture contiguity data that others cannot. Also, certain classes of assemblers may not be ideal for all sequencing technologies and resultant data profiles. Thus it is worthwhile to explore outside the scope of single assembly pipelines, as a meta-algorithmic approach through this type of merging can be advantageous.

5.3 A Self-Tuning Ensemble De Novo Assembly Pipeline

From our experiments described in this chapters, we begin to form some intuition to guide parameter tuning and ensemble assembly. We can describe an intuition for k-mer size as follows:

1. The "best" value of k is one that provides the most distinct genomic k-mers.

2. For k close to read-length l, it is unlikely that all k-mers in the reference are present in the reads due to imperfect coverage

3. A genome will contain all k-mers given small enough k, but are likely repeated in reference

4. There exists an optimal k0 at which nearly all k-mers in reference are present in reads

for k <= k0 and thus decreasing below k0 only contributes to making more k-mers repeated.

Chikhi et al. introduced a method to calculate k-mer abundance histograms and a corresponding heuristal method for selecting a value of k, guided by the aforementioned intuition [15]. Here, we introduce a self-tuning genome assembly system, the AssemblyRAST Smart Assembler, that incorporates this method to infers assembly parameters, as well as

65 integrates cleaning, error correction, assembly, and merging steps based on evaluation of intermediate results. The pipeline is represented in Figure 5.3 and described as follows:

1. If reads are from PacBio RSII or Oxford Nanopore, assemble with MiniASM. Other- wise, proceed.

2. Preprocess reads with BayesHammer error correction algorithm

3. Build k-mer histograms and estimate optimal value for k

4. Assemble preprocessed reads with Velvet, Spades, and IDBA (if reads have mate pair information)

5. Score assemblies based upon ALE likelihood scoring.

6. Define the top two assemblies as Master and Slave sequences, and perform block merg- ing via GAM-NGS.

7. Generate statistics via QUAST and upload back to AssemblyRAST server

5.4 Discussion

The AssemblyRAST Smart Assembler enables automatic and accurate genome assembly that allows non-expert researchers to generate assemblies without domain knowledge. Fur- thermore, the workflow provides all assemblies created in intermediary steps as well as com- parative analysis of each. Given the observed success of this ensemble assembly paradigm, it is clear that there remains many opportunities for improvement.

66 Figure 5.4: Smart Pipeline

67 Chapter 6

REFERENCE-INDEPENDENT ASSEMBLY ERROR

CLASSIFICATION LEARNING

6.1 Background and Motivation

While the number of sequencing projects continue to rise at an exponential rate, techniques to reliably and accurately evaluate de novo assemblies are still non-existant. Validation of draft genomes is still a largely manual and expensive process. In many cases, simple qualitative metrics such as N50, total genome size, and number of contigs, are used as proxies for accuracy, yet contain no information to actually represent correctness or nor any measures of certainty. An unsophisticated assembler that incorrectly concatenates reads would score well using these naive metrics; because assembly algorithms typically employ heuristics and optimize an ad hoc criteria, end users must trust that these decisions lead to correct base calls rather than just optimal contiguity. While a subset of assemblers provide intermediary data structures that capture alternate hypotheses in graph traversal, the lack of standardization in output often leads to assembly pipelines and downstream stages that fail to propogate this supporting information. In the previous section we’ve highlighted the extensive and diverse ecosystem of algo- rithms and approaches to resolving a genome assembly; dozens of new approaches continue to emerge seemingly every month. While many researchers assert that highly accurate as- semblies can be produced by these approaches [63], many others have elucidated substantial inconsistencies and errors found in draft genome assemblies [24]. Furthermore, judges from the Assemblathon 2 competition suggest that no particular assembly method, parameter selection, or workflow is generalizable to all genomic input data. These findings raise ques- tions concerning publication bias or data dredging; because selection of assembly quality is ostensibly ad-hoc, one must be wary of the amount of trust in this selection. 68 Virtually all sequencing platforms make sure to export measures of read hypothesis un- certainty via their respective file formats (e.g. FASTQ file format), which is invaluable for the previously described processing methods. Unfortunately, these methods, which make decisions by optimizing and estimating on some ad-hoc criteria, often fail to provide infor- mation regarding the basis of each decision, ultimately blunting the propagation of valuable uncertainty information. As assembly workflow protocols grow more complex, compounded uncertainty of the final assembly is absent. This chapter will introduce key concepts and current statistical methods used for identi- fying misassmbled regions in genome assemblies in the de novo context; that is, without a reference genome for comparison. We then introduce basic concepts of supervised machine learning and deeper details of the gradient-boosted tree learning method developed by Chen et al [14].

6.1.1 Hard and Soft Genomic Variation Types

One of the challenges of assessing multiple assembly hypotheses, whether aligning back to a reference genome or comparing alternate assemblies, lies in the discernment between soft and hard variants. Whereas a soft variant originates from noise that is injected through any stage of the sequencing process (e.g. amplification, library preparation, sequencing), hard variants are true variations in the source genome (e.g. structural variation, polymorphism). As illustrated by [24], genome projects even in the draft stage have been found to contain an extensive number of errors; Thus, without a framework to identify hard from soft variants, attempting to use such a genome as a reference to assess the performance of an assembler cannot provide a completely accurate picture.

69 6.1.2 Statistical Approach to De Novo Assembly Evaluation

A draft genome assembly is ostensibly an arbritrary sequence of letters, biological context aside. However, through the usage of read data used to generate the assembly in question, it is possible to formulate statistical validations from the read-assembly relationships. the Narzisi et al. proposed a compression-expansion (CE) metric for statistical hypothesis testing as follows: On a given position of a genome assembly, let M be the mean of the insert sizes currently mapped at the location.

N 1 X M = l (6.1) N i i=1 Z can be described as the measure of distance between the mean M of the local alignments √ and the library mean µ in the units of expected sample standard deviation, θ/ N

M − µ Z = √ (6.2) θ/ N

Near-zero CE values indicates an agreement of local alignment distribution with the library mean, whereas large negative or positive values represent a region of compression or expansion, respectively. [128] propose a method for identifying repeat collapses within an assembly by analyzing frequencies of k-length words (k-mers) within the set of reads and the assembly output. A normalized k-mer frequency,

? K = KR/KC (6.3)

where KR is the frequency of all k-mers that appear within all reads, and KC is the frequency of all k-mers within the assembly. K? should approximately equal the average coverage across the assembly, and a substantial deviation from this expected value likely represents a misassembly. 70 6.1.2.1 Fragment Coverage Distribution Analysis

Hunt et al. have developed the “Recognition of Errors in Assemblies using Paired Reads,” or REAPR pipeline which uses the analysis of read-to-contig mapping relationships as a method to call errors in assemblies. Particularly, REAPR attempts to detect positions in the assembly at which areas between observed and expected fragment coverage distribution (FCD) lines are greater than a calculated threshold [43], which is described in Equation 6.4. In the pipeline’s "stats" and "break" stages, REAPR flags such coverage errors as mistakes and allows the user to produce a version of the assembly in which contigs are broken at these errors. While the tool is useful for generating ancillary per-base statistics for contigs, expirements have shown its accuracy to be inconsistent across assembly expirements [127].

[3i/2] 1 X on FCD(i) = |tn − | (6.4) i o0 n=[−3i/2]

6.1.3 Supervised Classification

Over the last decade, advancements in machine learning techniques have proven to be effec- tive classification tools for a broad range of complex scenarios. With adequate amounts of data, supervised learning models are able to improve recommendation engines, spam classi- fiers, fraud detection, and many more complex topics. Recently, an influx of experimental techniques as well as public classification challenges have successfully leveraged tree-based supervised learning techniques; namely random forests and gradient boosted regression trees (i.e. XgBoost). The combination of scalable learning systems that can leverage the infor- mation available in large datasets and effective statistical models are the main factors that drive success on such feats [35]. The generalized problem of supervised learning for the classification task is as follows:

Definition 6.1.1. For a set of k categories, specify to which an input data sample belongs. n A learning algorithm should produce a model function f(x; θ) ∈ R → {1, ..., k} given an 71 n i.i.d training set {xi, yi}i=1 ∈ X × Y , generated from distribution P , where θ is a set of parameters for F

The central difference between an optimization problem and a machine learning task is that the latter must perform well on previously unseen data. Thus, while an algorithm must optimize to reduce its training error, that is, loss as a function of the training data, it must also reduce loss when predicting on test data, also known as the generalization error.A machine learning objective function (Equation 6.5) therefore must consist of two parts, a loss function and a regularization function.

Obj(Θ) = L(Θ) + Ω(Θ) (6.5)

Typical loss functions include mean squared error or logistic loss as described below.

X yˆ yˆ L(θ) = [yiln(1 + e i) + (1 − yi)ln(1 + e i)])) (6.6)

6.1.3.1 Gradient Boosted Trees

Gradient boosted trees is a method that builds a predictive model through a stepwise addition of classification and regression trees (CARTs), resulting in a tree ensemble. Here, the weak

learner (CART) is a function fk ∈ F , k ∈ K, where K is the number of trees and the

functional space F is the set of all possible trees. Thus a single prediction of yˆi is a summation of all learners,

t t X t−1 yˆi = fk(xi) =y ˆi + ft(xi) (6.7) k=1 .

An additive approach is taken, and at each iteration t, the tree ft is greedily added,

where ft optimizes the objective function’s second-order approximation,

72 Figure 6.1: An example decision tree

n X 1 L(t) = [g f (x ) + h f2(x )] + Ω(f ) (6.8) i t i 2 i t i t i=1

where gi and hi are the first and second-order terms,

(t−1) gi = ∂yˆ(t−1)l(yi, yˆ + ft(xi)) + Ω(ft) (6.9)

h = ∂2 l(y , yˆ(t−1) + f (x )) + Ω(f ) i yˆ(t−1) i t i t (6.10)

Now that the scoring function can be used to measure the quality of a tree structure, trees can be built in a stepwise fashion, greedily finding optimally scoring splits to add branches to leaves.

73 6.1.3.2 Misassembly Detection via Supervised Learning

Choi et al. introduced a method for classifying assembly errors by using a set of statistics generated from mapping coverage and paired-end placement. Specifically, they measured local (RC) coverage, and clone coverage (CC). Additional features were derived from com- pression/expansion (CE) statistics (Zimin et al.) and are defined in Equation 6.2 Using these features, the following machine learning algorithms were applied:

• Decision Tree

• Random Forest

• Random Tree

• Naive Bayes

• Bayesian Network

Overall, decision tree and random forest classifiers were found to all other techniques across experiments, albeit at relatively low precision rates of 0.3 to 0.4.

6.2 A Novel Implementation of Error Classification Using

Gradient Boosting Trees

While current attempts to employ supervised learning for the task of error classifications in de novo genome assembly have found limited success, we posit that with properly engineered features and a robust dataset, such techniques are capable of performing well. Fortunately, the ability to generate large amounts of genome assembly data through the AssemblyRAST framework satisfies the data requirements; coupled with a robust machine learning system like XgBoost, we can develop a classifier for de novo assembly analysis methods.

74 6.2.1 Dataset Generation

In order to train a classifier capable of detecting misassemblies across a broad range of microbial assemblies, reference genomes, with an assembly status of Complete, of 300 bacteria were chosen. The data was downloaded from the Sequence Read Archive (SRA) at NIH’s National Center for Biotechnology Information (NCBI), using the accession numbers found in Table 6.2. Paired-end reads were generated using the read simulator ART [42], which aims to mimic error profiles of sequencing technologies through training with empirical error distributions. Coverage parameters ranged from 20x-150x in order to simulate varying sequencing envi- ronments, and a generation profile for Illumina HiSeq was used, which is able to capture the appropriate indel/substitution error modes that the sequencing-by-synthesis approach typically displays.

6.2.2 Assembly Setup

In order to test the ability to detect assembly errors, other investigators synthetically pro- duce an assembly through manual rearrangements and alterations to the original genome. We posit that sophisticated assembly algorithms properly use coverage and mate-pair informa- tion for decisions, and mistakes produced are less obvious than such arbritrarily synthesized errors. Thus, while reads were generated through simulation, introduction of assembly errors were done at the behest of the assemblers used. Each simulated read set was assembled using three assemblers: Velvet, IDBA, and Spades. These assemblers are the most popular de Bruijn graph assemblers and are widely used for short reads such as the aforementioned Illumina-like reads. For Velvet, a value of 29 was used for k-mer hash length, for IDBA, 50 was set as the max-k parameter, and default settings were used for Spades assembler.

75 6.2.3 Preprocessing

Custom plugins were created for AssemblyRAST in order to run the pertinent processing tools on each assembly, and run via a custom recipe to instruct the compute engine of data packaging and upload. For feature engineering, we focused on read-mapping and coverage relationships, as the availablity of such tools could be leveraged. In particular, it was found that features derived from regions surrounding the position in questions had the biggest impact on performance. For example, difference in average coverage between left and right flanking regions had high feature importance. We discuss the feature engineering procees further in Section 6.3.2. All features are listed in table 6.2.3. In addition to custom plugins, recipes, and parsing scripts, we used the tools REAPR, bedtools, FASTQC, QUAST, smalt, mummer, pysamstats, and BWA.

76 Table 6.1: Features generated from assembly and sequence data

Feature Abbreviation Description pos Position in sequence (1-based) perfect_cov Position in sequence (1-based) read_cov Read cov prop_cov Prop cov orphan_cov Orphan cov bad_insert_cov Bad insert cov bad_orient_cov Bad orient cov read_cov_r Read cov r prop_cov_r Prop cov r orphan_cov_r Orphan cov r bad_insert_cov_r Bad insert cov r bad_orient_cov_r Bad orient cov r frag_cov Frag cov frag_cov_err Frag cov err FCD_mean Fcd mean clip_fl Clip fl clip_rl Clip rl clip_fr Clip fr clip_rr Clip rr FCD_err Fcd err mean_frag_length Mean frag length ct_for_cov_in_ctc Ct for cov in ctc cov_frac_in_ctg Cov frac in ctg avg_contig_cov Avg contig cov contig_len Contig len dist_from_end Dist from end avg_cov_lflank Avg cov lflank avg_cov_rflank Avg cov rflank flank_cov_diff Flank cov diff flank_cov_diff_norm Flank cov diff norm flank_cov_diff_norm_max Flank cov diff norm max avg_cov_lflank_h Avg cov lflank h avg_cov_rflank_h Avg cov rflank h flank_cov_diff_h Flank cov diff h flank_cov_diff_norm_h Flank cov diff norm h flank_cov_diff_norm_max_h Flank cov diff norm max h avg_cov_lflank_10 Avg cov lflank 10 avg_cov_rflank_10 Avg cov rflank 10 flank_cov_diff_10 Flank cov diff 10 flank_cov_diff_norm_10 Flank cov diff norm 10 flank_cov_diff_norm_max_10 Flank cov diff norm max 10 gc_pos Gc pos gc_flank_l Gc flank l gc_flank_r Gc flank r gc_flank_diff Gc flank diff matches Matches matches_pp Matches pp mismatches Mismatches mismatches_pp Mismatches pp

77 6.2.4 Data Labeling

By comparing resulting assemblies against the genomes from which the reads were originated, misassembly events are detected to be used as classification labels in the training data. Here, we give the formal descriptions to the various types of errors in regards to the relationship between the assembled and reference genomes. We extracted these locations from assemblies using the QUAST program.

6.2.4.1 Relocation Misassembly Breakpoint

Definition 6.2.1. Let C be the set of all assembled contigs. Let A be the set of all valid alignments of c ∈ C to the reference genome.

A nucleotide nc,p, where c ∈ C, and p ∈ [0..Len(c)] is defined as a Relocation Point if:

(i) nc,p appears in a unique alignment aj ∈ A

(ii) nc,p is located at a terminal position in aj

(iii) The adjacent nucleotide nc,p+1 appears (i) in a different unique alignment ak, j 6= k, (ii) both alignments align to the same reference chromosome, and (iii) the position of

reference alignment P os(ak) is over 1kb away from P os(aj)

6.2.4.2 Translocation Misassembly Breakpoint

Definition 6.2.2. Let C be the set of all assembled contigs. Let A be the set of all valid alignments of c ∈ C to the reference genome.

A nucleotide nc,p, where c ∈ C, and p ∈ [0..Len(c)] is defined as a Relocation Point if:

(i) nc,p appears in a unique alignment aj ∈ A

(ii) nc,p is located at a terminal position in aj

(iii) the adjacent nucleotide nc,p+1 appears (i) in a different unique alignment ak, j 6= k, and (ii) each alignment aligns to different chromosomes.

78 Figure 6.2: Generation of training data by AssemblyML workflow

79 6.2.4.3 Inversion Misassembly Breakpoint

Definition 6.2.3. Let C be the set of all assembled contigs. Let A be the set of all valid alignments of c ∈ C to the reference genome.

A nucleotide nc,p, where c ∈ C, and p ∈ [0..Len(c)] is defined as a Relocation Point if:

(i) nc,p appears in a unique alignment aj ∈ A

(ii) nc,p is located at a terminal position in aj

(iii) The adjacent nucleotide nc,p+1 appears (i) in a different unique alignment ak, j 6= k,

(ii) each respective alignment, aj and ak are alignments to different strands on the same chromosome, and (iii) the position of the alignments are not more than 1kb apart.

6.2.5 Discussion and Feasibility

It is worthwhile to note the computational challenges that were overcome to generate the appropriate training datasets for this experiment. By leveraging the AssemblyRAST frame- work, the orchestration of several hundred unique assemblies and workflow and parameter permutations across multiple compute and storage nodes drastically reduced manual work and minimized necessary scripting. In the next section, we walk through the results of training and techniques used to improve prediction performance.

80 Table 6.2: Genomes Used to Generate Training Set

Accession No. Organism Strain

GCA_000024105.1 Anaerococcus prevotii DSM 20548 GCA_000189495.1 Staphylococcus pseudintermedius ED99 GCA_001447115.1 Salmonella enterica subsp. e ... GT-01 GCA_000024245.1 Zymomonas mobilis subsp. mobilis NCIMB 11163 GCA_000963495.1 Pseudomonas fluorescens PICF7 GCA_001190925.1 Azoarcus sp. CIB GCA_000242455.3 Singulisphaera acidiphila DSM 18658 GCA_001027065.1 Listeria monocytogenes L2074 GCA_001697785.1 Neisseria meningitidis M25472 GCA_000959405.1 Burkholderia mallei 11 GCA_000310105.2 Pseudoalteromonas sp. BSW20308 GCA_000934525.1 Alteromonas australica DE170 GCA_001509915.1 Bordetella pertussis I979 GCA_001551855.1 Listeria monocytogenes 2015TE19005-1355 GCA_000215325.1 Ralstonia solanacearum Po82 GCA_000203955.1 Burkholderia cenocepacia HI2424 GCA_000828035.1 Staphylococcus aureus subsp. ... DAR4145 GCA_001718775.1 Burkholderia vietnamiensis AU1233 GCA_001594245.1 Aggregatibacter actinomycete ... VT1169 GCA_001431145.1 Bacillus pumilus NJ-M2 GCA_001688645.1 Bifidobacterium animalis sub ... YL2 GCA_000972245.3 Bacillus endophyticus Hbe603 GCA_001687605.1 Planococcus plakortidis DSM 23997 GCA_000191565.1 Riemerella anatipestifer RA-GD GCA_001644565.1 Deinococcus puniceus DY1 GCA_000376645.1 Mannheimia haemolytica M42548 GCA_000283815.1 Rickettsia rickettsii Hino GCA_000272835.4 Salmonella enterica subsp. e ... CVM N1543 GCA_000988385.1 Escherichia coli SQ88 GCA_000309885.1 Thermus oshimai JL-2 GCA_001558895.1 Morganella morganii FDAARGOS_172 GCA_000283915.1 Rickettsia canadensis CA410 GCA_001750685.1 Geosporobacter ferrireducens IRF9 GCA_000590795.1 Brucella ceti TE10759-12 GCA_001039695.2 Streptococcus pyogenes H293 GCA_000834335.1 Yersinia pestis Shasta GCA_000195435.4 Listeria monocytogenes J1816 GCA_000359525.1 Streptomyces albus J1074 GCA_000091985.1 Helicobacter mustelae 12198 GCA_001182765.1 Synechococcus sp. WH 810 GCA_001683155.1 Bacillus anthracis PR02 GCA_000803705.1 Escherichia coli O157:H7 SS52 Continued on Next Page. . .

81 Table 6.2 – Continued

Accession No. Organism Strain

GCA_000200475.1 Haemophilus influenzae F3047 GCA_000738445.1 Mycobacterium tuberculosis ZMC13-264 GCA_001038645.1 Pseudomonas stutzeri SLG510A3-8 GCA_000512775.1 Bacillus anthracis A16R GCA_000284155.1 Rickettsia australis Cutlack GCA_000988395.1 Pseudomonas syringae pv. syr ... HS191 GCA_000192335.1 Helicobacter pylori 2018 GCA_001606025.1 Psychrobacter alimentarius PAMC 27889 GCA_001188915.1 Staphylococcus schleiferi 2317-03 GCA_000344575.1 Lactococcus lactis subsp. lactis IO-1 GCA_000163615.3 Aggregatibacter actinomycete ... D7S-1 GCA_000231925.1 Streptococcus suis ST1 GCA_000340905.1 Candidatus Kinetoplastibacte ... TCC219 GCA_000270045.1 Helicobacter pylori F32 GCA_000993765.1 Streptococcus pyogenes AP1 GCA_001542795.1 Pseudomonas aeruginosa X78812 GCA_000236405.1 Blattabacterium sp. (Cryptoc ... Cpu GCA_000832785.1 Bacillus anthracis V770-NP-1R GCA_000169215.2 Shewanella putrefaciens 200 GCA_000183345.1 Escherichia coli O83:H1 NRG 857C GCA_001548635.1 Listeria monocytogenes CFSAN010068 GCA_000177255.2 Rhodopseudomonas palustris DX-1 GCA_001402875.1 Blastochloris viridis ATCC 19567 GCA_000184325.1 Vibrio furnissii NCTC 11218 GCA_000988615.1 Moraxella bovoculi 22581 GCA_000196135.1 Wolinella succinogenes DSM 1740 DSMZ 1740 GCA_000364765.1 Chlamydia trachomatis L2/434/Bu(i) GCA_001687405.1 Bordetella pertussis B202 GCA_001267395.1 Enterococcus durans KLDS6.0933 GCA_000956315.1 Borrelia hermsii CC1 GCA_000023925.1 Kytococcus sedentarius DSM 20547 GCA_000006925.2 Shigella flexneri 2a 301 GCA_001592385.1 Streptococcus agalactiae CU_GBS_08 GCA_000494875.1 Staphylococcus pasteuri SP1 GCA_001513675.1 Microbacterium sp. XT11 GCA_001653455.1 Helicobacter pylori K26A1 GCA_001559055.1 Bordetella bronchiseptica ATCC:BAA-588D-5 GCA_000597785.2 Hafnia alvei FB1 GCA_000235405.3 Fervidobacterium pennivorans DSM 9078 GCA_001506625.1 Campylobacter jejuni CJ677CC537 GCA_000253215.1 Neisseria meningitidis WUE 2594 GCA_000967325.1 Staphylococcus aureus subsp. ... ST228 GCA_000521745.1 Bibersteinia trehalosi USDA-ARS-USMARC-189 Continued on Next Page. . .

82 Table 6.2 – Continued

Accession No. Organism Strain

GCA_000203875.1 Ralstonia eutropha JMP134 GCA_001580405.1 Mycobacterium sp. NRRL B-3805 GCA_000012385.1 Rickettsia bellii RML369-C GCA_000829155.1 Candidatus Sulcia muelleri PSPU GCA_000212415.1 Treponema brennaborense DSM 12168 GCA_000214665.1 Methylomonas methanica MC09 GCA_001443605.1 bacterium L21-Spi-D4 GCA_000016665.1 Roseiflexus sp. RS-1 GCA_001547755.1 endosymbiont of Bathymodiolu ... Myojin Knoll GCA_000237995.2 Pediococcus claussenii ATCC BAA-344 GCA_001704315.1 Lactobacillus plantarum KP GCA_001638925.1 Sphingomonas sp. NIC1 GCA_001447075.1 Streptomyces hygroscopicus s ... KCTC 1717 GCA_000941035.1 Leptospira interrogans serov ... 56609 GCA_000283615.1 Tetragenococcus halophilus NBRC 12172 GCA_001652525.1 Leptospira borgpetersenii 4E GCA_001587135.1 Ralstonia solanacearum UW163 GCA_001554155.1 Mycoplasma mycoides subsp. m ... Ben468 GCA_000685745.1 Helicobacter pylori BM013B GCA_000742955.1 Ochrobactrum anthropi OAB GCA_000959485.1 Burkholderia mallei 2002734306 GCA_000832525.1 Bacillus cereus FM1 GCA_000816885.1 Xanthomonas citri subsp. citri A306 GCA_000007025.1 Rickettsia conorii Malish 7 GCA_000215705.1 Ramlibacter tataouinensis TTB310 GCA_000814865.1 Corynebacterium pseudotuberc ... VD57 GCA_000023665.1 Escherichia coli ’BL21-Gold( ... BL21-Gold(DE3)pLysS AG GCA_001042675.1 Parascardovia denticolens DS ... JCM 12538 GCA_001605485.1 Bordetella pertussis I483 GCA_001697725.1 Neisseria meningitidis M09261 GCA_000613085.1 Listeria monocytogenes R479a GCA_001029715.1 candidate division TM6 bacte ... GCA_000240055.1 Propionibacterium acnes TypeIA2 P.acn31 GCA_001735765.1 Clostridium taeniosporum 1/k GCA_001027205.1 Listeria monocytogenes L2676 GCA_001695715.1 Burkholderia pseudomallei M1 GCA_000699525.1 Bacillus subtilis subsp. sub ... AG1839 GCA_000020065.1 Mycoplasma arthritidis 158L3-1 GCA_000717515.1 Klebsiella pneumoniae subsp. ... KPR0928 GCA_000025645.1 Thermoanaerobacter italicus Ab9 GCA_001697265.1 Bacillus subtilis subsp. sub ... KCTC 3135 GCA_001750785.1 Streptomyces rubrolavendulae MJM4426 GCA_001687545.1 Altererythrobacter namhicola JCM 16345 Continued on Next Page. . .

83 Table 6.2 – Continued

Accession No. Organism Strain

GCA_000968135.1 Magnetospira sp. QH-2 GCA_000192315.1 Helicobacter pylori 2017 GCA_000746505.1 Staphylococcus aureus 2395 USA500 GCA_000317575.1 Stanieria cyanosphaera PCC 7437 GCA_000025805.1 Bacillus megaterium DSM 319 GCA_001507045.1 Campylobacter jejuni CJ677CC522 GCA_001664445.1 Rhizobium phaseoli R611 GCA_000826965.3 Pandoraea apista TF81F4 GCA_001683095.1 Bacillus anthracis Parent1 GCA_000022705.1 Bifidobacterium animalis sub ... Bl-04; ATCC SD5219 GCA_001594325.1 Pseudomonas aeruginosa F63912 GCA_000733715.2 Pseudomonas mendocina S5.2 GCA_000016705.1 Dehalococcoides mccartyi BAV1 GCA_001729525.1 Brevibacterium linens SMQ-1335 GCA_001611135.1 Pediococcus damnosus TMW 2.1535 GCA_001647655.1 [Haemophilus] ducreyi VAN1 GCA_000698865.1 Pseudomonas chlororaphis PA23 GCA_000317125.1 Chroococcidiopsis thermalis PCC 7203 GCA_000009445.1 Mycobacterium bovis BCG str. ... BCG Pasteur 1173P2 GCA_000523045.1 Bacillus subtilis BEST7003 GCA_000247715.1 Gordonia polyisoprenivorans VH2 GCA_001746595.1 Xanthomonas oryzae pv. oryzae PXO71 GCA_000829355.1 Candidatus Liberibacter asia ... Ishi-1 GCA_000754305.1 Candidatus Sulcia muelleri BGSS GCA_000389965.1 Actinoplanes sp. N902-109 GCA_000953315.1 Wolbachia endosymbiont of Dr ... GCA_000213805.1 Pseudomonas fulva 12-X GCA_000685665.1 Helicobacter pylori BM013A GCA_001577385.1 Streptomyces albus SM254 GCA_000830945.1 Chlamydia muridarum Nigg3 CMUT3-5 GCA_001735805.1 Streptomyces puniciscabiei TW1S1 GCA_000296215.2 Bradyrhizobium sp. CCGE-LA001 GCA_000196455.1 [Clostridium] sticklandii DSM 519 GCA_001308145.2 Weissella cibaria CH2 GCA_000632925.1 Ehrlichia chaffeensis Saint Vincent GCA_000006685.1 Chlamydia muridarum Nigg GCA_000210915.2 Halobacteriovorax marinus SJ GCA_001029795.1 candidate division Kazan bac ... GCA_000959545.1 Burkholderia ambifaria AMMD GCA_000569015.1 Bifidobacterium breve JCM 7019 GCA_000832765.1 Bacillus cereus 3a GCA_000245535.1 Salmonella enterica subsp. e ... P-stx-12 GCA_000217615.1 Propionibacterium acnes 6609 Continued on Next Page. . .

84 Table 6.2 – Continued

Accession No. Organism Strain

GCA_000400855.1 Mycoplasma hyopneumoniae 168-L GCA_000277165.1 Rickettsia prowazekii Chernikova GCA_000695995.1 Serratia sp. FS14 GCA_001554075.1 Mycoplasma mycoides subsp. m ... Ben50 GCA_000270005.1 Helicobacter pylori F16 GCA_000292505.1 Mycoplasma genitalium M2288 GCA_000016365.1 Mycobacterium gilvum PYR-GCK GCA_000318905.1 Chlamydia trachomatis L2b/8200/07 GCA_000007705.1 Chromobacterium violaceum ATCC 12472 GCA_000508285.1 Achromobacter xylosoxidans N ... ATCC 27061 GCA_000143965.1 Desulfarculus baarsii DSM 2075 GCA_001573085.1 Acinetobacter baumannii XH859 GCA_000202635.1 Microbacterium testaceum StLB037 GCA_000009785.1 Geobacillus kaustophilus HTA426 GCA_000623475.2 Salmonella enterica subsp. e ... SA19994216 GCA_000466785.3 Lactobacillus fermentum 3872 GCA_000800335.1 Listeria monocytogenes NTSN GCA_001465755.1 Staphylococcus aureus RIVM3897 GCA_000010905.1 Acetobacter pasteurianus IFO ... IFO 3283 substr. IFO 3283-26 GCA_001559035.1 Bartonella bacilliformis ATCC:35685D-5 GCA_001272655.2 Paenibacillus peoriae HS311 GCA_000636115.1 Streptococcus agalactiae 138spar GCA_001577755.1 Rufibacter sp. DG15C GCA_000270285.1 Eggerthella sp. YY7918 GCA_000012245.1 Pseudomonas syringae pv. syr ... B728a GCA_001578105.1 Martelella sp. AD-3 GCA_000012405.1 Thermobifida fusca YX GCA_000283875.1 Pantoea ananatis LMG 5342 GCA_000007445.1 Escherichia coli CFT073 GCA_000147015.1 Candidatus Zinderia insecticola CARI GCA_000819665.1 Paenibacillus polymyxa Sb3-1 GCA_000196875.1 Halomonas elongata DSM 2581 type strain: DSM 2581 GCA_000463425.1 Streptococcus constellatus s ... C1050 GCA_000027065.2 Cronobacter turicensis z303 GCA_000969265.1 Vibrio cholerae 10432-62 GCA_001639105.1 Azospirillum humicireducens SgZ-5 GCA_000011025.1 Rothia mucilaginosa DY-18 GCA_000299965.1 Cycloclasticus sp. P1 P1; MCCC 1A01040 GCA_000144405.1 Prevotella melaninogenica ATCC 25845 GCA_000187205.4 Acinetobacter baumannii MDR-TJ GCA_001273795.1 Candidatus Rickettsia amblyommii Ac37 GCA_000819505.1 Aeromonas hydrophila J-1 GCA_000940915.1 Aeromonas hydrophila AL06-06 Continued on Next Page. . .

85 Table 6.2 – Continued

Accession No. Organism Strain

GCA_001017775.2 Pandoraea thiooxydans DSM 25325 GCA_000286775.1 Mycoplasma gallisepticum NC06_2006.080-5-2P GCA_000747315.1 Corynebacterium ureicelerivorans IMMIB RIV-2301 GCA_001483425.1 Listeria monocytogenes Lm 3136 GCA_001047635.1 Leptospira interrogans serov ... UP-MMC-NIID LP GCA_000184435.1 Mycobacterium gilvum Spyr1 GCA_000211075.1 Streptococcus pneumoniae SPN032672 GCA_001655595.1 Pseudomonas sp. DR 5-09 GCA_001578205.1 Bacillus pumilus SH-B9 GCA_000148855.1 Helicobacter pylori SJM180 GCA_000011465.1 Prochlorococcus marinus subs ... MED4 GCA_000831065.1 Bacillus bombysepticus str. Wan GCA_000017965.1 Lysinibacillus sphaericus C3-41 GCA_000194745.1 Burkholderia gladioli BSR3 GCA_001187845.1 Octadecabacter temperatus SB1 GCA_000349225.1 Xanthomonas citri subsp. citri Aw12879 GCA_000959185.1 Burkholderia pseudomallei MSHR840 GCA_001566635.1 Escherichia coli G749 GCA_000785555.1 Planococcus sp. PAMC 21323 GCA_001294625.1 Arthrobacter alpinus R3.8 GCA_000286435.2 Morganella morganii subsp. m ... GCA_000219725.1 Treponema caldarium DSM 7334 GCA_000302535.1 Acidovorax sp. KKS102 GCA_000292455.1 Bacillus thuringiensis HD-771 GCA_000224435.1 Mycobacterium tuberculosis CTRI-2 GCA_000259175.1 Providencia stuartii MRSN 2154 GCA_000961415.1 Xanthomonas citri subsp. citri 5208 GCA_000604065.3 Pandoraea pnomenusa RB38 GCA_000015505.1 Polaromonas naphthalenivorans CJ2 GCA_000734975.2 Halomonas sp. KO116 GCA_001580385.1 Mycobacterium bovis BCG str. ... Tokyo 172 substr. TRCS GCA_000816385.2 Campylobacter lari CCUG 22395 GCA_000833295.1 Francisella philomiragia 319-036 [FSC 153] GCA_001559075.1 Citrobacter amalonaticus FDAARGOS_122 GCA_000284475.1 Chlamydia trachomatis A2497 GCA_000317025.1 Pleurocapsa sp. PCC 7327 GCA_000190535.1 Odoribacter splanchnicus DSM ... DSM 220712 GCA_000015845.1 Shewanella baltica OS155 GCA_001697665.1 Neisseria meningitidis M23413 GCA_001686525.1 Bordetella pertussis VA-15 GCA_000178875.2 Shewanella baltica OS678 GCA_000828915.1 Comamonadaceae bacterium B1 GCA_001189495.1 Eubacterium sulci ATCC 35585 Continued on Next Page. . .

86 Table 6.2 – Continued

Accession No. Organism Strain

GCA_000008685.2 Borrelia burgdorferi B31 GCA_001045415.1 Vibrio cholerae TSY216 GCA_900078695.1 Bordetella trematum H044680328 GCA_000197755.2 Listeria monocytogenes SLCC2755 GCA_000019525.1 Shewanella woodyi ATCC 51908 GCA_001655575.1 Chlamydia trachomatis F-6068 GCA_001183805.1 Chlamydia trachomatis E/CS1025/11 GCA_001182745.2 Nocardia farcinica NCTC11134 GCA_001712875.1 Maize bushy stunt phytoplasma M3 GCA_001701025.1 Lentzea sp. DHS C013 GCA_000007605.1 Chlamydophila caviae GPI GCA_000214175.1 Hoyosella subflava DQS3-9A1 GCA_001718495.1 Burkholderia vietnamiensis FL-2-3-30-S1-D0 GCA_000025185.1 Pirellula staleyi DSM 6068 GCA_000988745.1 Moraxella bovoculi 58086 GCA_000725325.1 Bacillus anthracis HYU01 GCA_000968175.1 Xenorhabdus poinarii G6 GCA_000018825.1 Bacillus weihenstephanensis KBAB4 GCA_000148365.1 Escherichia coli ABU 83972 GCA_000513635.1 Listeria monocytogenes serot ... 08-6997 GCA_000192865.1 Marinomonas mediterranea MMB-1 GCA_000166135.1 Frankia sp. EuI1c GCA_001022115.1 Klebsiella oxytoca CAV1335 GCA_000832865.1 Bacillus cereus 03BB108 GCA_001677075.1 Legionella pneumophila OLDA GCA_001677135.1 Mycobacterium immunogenum FLAC016 GCA_000486445.2 Salmonella enterica subsp. e ... ATCC 15791 GCA_001647795.1 [Haemophilus] ducreyi VAN5 GCA_001315015.1 Azospirillum brasilense Sp 7 GCA_000283695.1 Bacillus velezensis CAU B946 GCA_000255935.1 Corynebacterium pseudotuberc ... P54B96 GCA_000304535.1 Chlamydia trachomatis F/SW5 GCA_000632805.1 Dyella jiangningensis SBZ 3-12 GCA_000763535.2 Vibrio coralliilyticus OCN014 GCA_001598095.1 Bacillus thuringiensis HD12 GCA_000803625.1 Candidatus Saccharibacteria ... TM7x

87 6.3 Results

6.3.1 Training Model

We chose to use gradient-boosted regression trees (XGBoost) for our model. Initial hyper- parameter tuning was performed via an exhaustive grid search, first optimizing max_depth and min_child_weight parameters, followed by gamma, and finally n_estimators. The final parameters, shown in Listing 6.1, were used to train via a 5-fold stratified cross validation.

1 grid_params = {

2 ’max_depth’:[4,8,10,12],

3 ’n_estimators’:[500,1000,2000],

4 ’min_child_weight’:[4,5,6]

5 }

6 num_folds =5

7 g_search = GridSearchCV(xgb_model,

8 grid_params,

9 cv=num_folds)

10 g_search.fit(X_train, y_train)

11 model = g_search.best_estimator_

Listing 6.1: XGBoost Parameter Tuning

6.3.1.1 Misassembly Classification

For each resultant assembly, misassembly break points (relocation, translocation, inversion) were identified using nucmer and QUAST by aligning the assembly back to the reference genome. To account for the large overrepresentation of non-misassemblies, a balanced sub- sampling of contig positions was taken. Given reads used to produce the assembly the AssemblyML model is able to output classifications and confidence scores for a given position on an assembly contig

88 6.3.2 Feature Engineering and Extraction

The shape of such a learning problem, in which prediction must be robust to datasets not only unseen, but also unique in sequence properties, requires much care for overfitting. For example, it may be common for sequencing project parameters to use 40 - 100x base coverage, but over time and as the landscape of sequencing cost and technology changes, those parameters could shift. Although we training our model on a distribution similar to the former scenario, we would like our predictions to still be robust to the latter, while striking a balance between sensitiviy and bias. We process all mapping-related statistics to fraction representations, relative to the average coverage for the contig. By doing so, we were able to reduce our generalization error significantly. One of the challenges introduced by most next-generation sequencing and library prepa- ration techniques is that they can bias read coverage in GC-rich areas of the genome [1]. Thus when considering genome-wide relative coverage, one must be careful not to automat- ically flag non-normal coverage statistics as misassembly. One way to address a multimodal distribution often seen in large fragment libraries is to calculate relative rates of coverage against observed GC-ratios. We adopt the following method, also used in the REAPR error classification algorithm, to calculate relative error: For a defined subsample length s and window size w, calculate GC ratio and average coverage for each position i from 1, ..., s. In a scatter plot of of (GC ratio, average coverage), calculate a locally weighted scatterplot smoothing (LOWESS) line. Finally for each observed coverage in each position, we define relative error as the difference from the calculated line. By applying signal processing techniques to microbial genomes, Allen et al. reveal that the overall arrangment of genomes are typically nonrandom and in fact display long-range patterns in structure [3]. Even in the case of abrupt shifts in genomic properties that may be seen with structural rearrangements, coverage numbers due to sequencing bias are still expected to show a level of smoothness in a continuous series of genome positions. We

89 Algorithm 2: Relative GC-Coverage Error input : Contigs and reads output: GC error for position ← 1 to subsamplelength do GC ← CalculateGC(position, window); coverage ← CalculateFragCov(position); GcStats.add(GC, coverage); return CalculateError(LOWESS(GcStats), position)

Figure 6.3: An example of a non-smooth coverage pattern over a misassembly

90 Figure 6.4: Discrepancies in coverages between flanking regions

hypothesize that the signature of a true misassembly may manifest as a jagged event at the site in question. We include several features that measure discrepancies between the left and right flanking regions of the position. We define flank coverage discrepancy of assembly position p, Dp, as follows: For flank length l, and read coverage of the contig containing p

Cp,

Pp c − Pp+l c i=p−l i i=p i Dp = (6.11) l ∗ C p

where ci is the aligned read coverage at contig position i. Figure 6.4 illustrates the distribution of classes in regards to their respective flank coverage discrepancies.

91 Figure 6.5: Misassemblies vs. Average Contig Coverage

6.3.3 Model Performance and Hyperparameter Tuning

6.3.3.1 Parameter Tuning

Using the observed set of feature interactions shown in Table 6.3, we found that a tree depth of 4 was a good balance to capture model complexity while avoiding overfitting. Furthermore, we chose a relatively lower learning rate and performed early stopping by halting the addition of learners when logistic loss failed to improve over a set amount of rounds.

6.3.3.2 Empirical Results

On the test set of 17,000 samples over 300 genomes, the model performed well identifying misassemblies. As shown in table 6.4, the model predicted errors with a precision of 80.6% and recall rate of 47%. The model predicted with 99.9% accuracy, but this measure was ul- timately uninformative due to the extreme skewness. Upon analysis of the XgBoost model, 92 Feature 1 Feature 2 Feature 3 Weighted F-Score orphan_cov orphan_cov_r rms_baseq_pp 94.068561182 N_norm orphan_cov orphan_cov_r 45.5783855134 flank_cov_diff_norm_h_norm orphan_cov orphan_cov_r 18.7258578187 flank_cov_diff_norm_max_h_norm orphan_cov orphan_cov_r 15.5422038246 cov_frac_in_ctg_norm orphan_cov orphan_cov_r 13.5707453258 orphan_cov_r rms_baseq_pp rms_tlen 3.9140536114 flank_cov_diff_h_norm orphan_cov_r rms_tlen 3.8911463195 dist_from_end orphan_cov_r rms_baseq_pp 1.2754779594 N_norm dist_from_end orphan_cov_r 0.6886012308 dist_from_end orphan_cov rms_baseq_pp 0.4471139529

Table 6.3: Top 10 three-way feature interaction scores for the AssemblyML Model. Weighted F-score represents the frequency that the three features appear within the same tree, weighted by the probabilities that the nodes will be visited. the features which had the most impact on detecting a misassembly breakpoint were (i) differences in average coverage between left and right flanking regions (100bp), and (ii) po- sitional and length features (dist_from_end, pos, contig_len), followed by mapping quality (max_mapq) and error in pair insert lengths (rms_tlen, bad_insert_cov). Top features are show in Table 6.5.

Type Precision Recall F1 Score Support Non-misassembly 1.00 1.00 1.00 400000 Misassembly 0.88 0.47 0.61 4000 Table 6.4: Prediction Statistics Across a Subsample of Microbe Assemblies

Upon analysis, we find that a large subset of misassemblies occur near the ends of the contig. While one hypothesis may suggest that contig ends will display irregular read map- ping profiles and will thus be flagged as misassemblies, an alternate viewpoint is that an assembler’s heuristical decision to incorporate erroneous reads leads to the inability to ex- tend the contig further, ending the sequence. Figure 6.6 exemplifies a such irregularities in coverage that suggest a misassembled region rather than simply a contig end that could be mistaken for a misassembly.

93 Figure 6.6: Contig end regions predicted as misassemblies in Spades Assembly of Sin- gulisphaera acidiphila, but not classified as a major misassembly by QUAST

Figure 6.7:

94 p n total

p0 134 42 P0 actual value

n0 32 169962 N0

total P N

Figure 6.8: Prediction Outcomes for 300 Genome Readsets Simulated

Rank F-Score Feature Name 1 0.0401781611145 dist_from_end 2 0.034989438951 flank_cov_diff_norm_h 3 0.0329231321812 flank_cov_diff_norm_10 4 0.0327853783965 flank_cov_diff_norm 5 0.0325098708272 flank_cov_diff_10 6 0.0320966094732 pos 7 0.0312241706997 contig_len 8 0.0301680602133 FCD_err 9 0.0290201120079 FCD_mean 10 0.0284690968692 avg_cov_lflank_10 11 0.0282854251564 avg_contig_cov 12 0.02722931467 flank_cov_diff_h 13 0.0271374788135 flank_cov_diff 14 0.0252089258283 avg_cov_lflank 15 0.0244742408395 max_mapq_pp 16 0.023739553988 avg_cov_rflank_10 17 0.0225916057825 avg_cov_rflank 18 0.0219946727157 cov_frac_in_ctg 19 0.0207089725882 orphan_cov 20 0.0199742857367 prop_cov_r Table 6.5: Top 20 Features of XGBoost Trained Model

95 p n total

p0 12 0 P0 actual value

n0 32 94493 N0

total P N

Figure 6.9: AssemblyML Prediction Outcomes For Velvet Assembly of Singulisphaera acidiphila

6.3.4 Detection of Major Errors Produced by Assemblers

The previous experiment used a random sub-sample across a variety of genomes to both train and test for prediction accuracy. In order to test our prediction model in a more realistic scenario, we scanned a single genome assembly for putative misassemblies. We simulated reads from the organism Singulisphaera acidiphila, using 40x coverage, with the ART simu- lator set to use the Illumina error distribution model. The reads were then assembled using Velvet with default parameters, aligned to the reference to label true misassemblies, and processed through our prediction pipeline. Due to computational limitations, 1 out of every 100 basepairs were sampled in the assembly, which corresponds to our training methods which attempt to classify a base within 50 nucleotides of a missassembly breakpoint, also as a misassembled base. As shown in the confusion matrix in Figure 6.16, the AssemblyML prediction model achieved 100% recall, as it was able to identify all 12 of 12 misassemblies and called zero false positives. The .gff file produced is shown in listing 6.2.

NODE_385_length_58109_cov_25.251545 AssemblyML possible_assembly_error 24079 24179 0.6370434165 . . Note=AssemblyML Misassembly;colour=17 NODE_413_length_58995_cov_25.302229 AssemblyML possible_assembly_error 53852 53952 0.890655517578 . . Note=AssemblyML Misassembly;colour=17 NODE_448_length_1356_cov_18.029499 AssemblyML possible_assembly_error 88 188 0.765890240669 . . Note=AssemblyML Misassembly;colour=17 NODE_463_length_23919_cov_25.622519 AssemblyML possible_assembly_error 12386 12486 0.554635167122 . . Note=AssemblyML Misassembly;colour=17 NODE_463_length_23919_cov_25.622519 AssemblyML possible_assembly_error 12586 12686 0.998969197273 . . Note=AssemblyML Misassembly;colour=17 NODE_470_length_14430_cov_25.071449 AssemblyML possible_assembly_error 8983 9083 0.572376251221 . . Note=AssemblyML Misassembly;colour=17 NODE_546_length_5187_cov_18.528437 AssemblyML possible_assembly_error 1967 2067 0.523954868317 . . Note=AssemblyML Misassembly;colour=17 NODE_683_length_63408_cov_25.024729 AssemblyML possible_assembly_error 46981 47081 0.994942605495 . . Note=AssemblyML Misassembly;colour=17 96 p n total

p0 0 12 P0 actual value

n0 55 94470 N0

total P N

Figure 6.10: REAPR FCD Prediction Outcomes For Velvet Assembly of Singulisphaera acidiphila

NODE_694_length_42958_cov_25.074352 AssemblyML possible_assembly_error 30184 30284 0.536048531532 . . Note=AssemblyML Misassembly;colour=17 NODE_856_length_1342_cov_45.863636 AssemblyML possible_assembly_error 41 141 0.999892830849 . . Note=AssemblyML Misassembly;colour=17 NODE_881_length_12912_cov_25.461277 AssemblyML possible_assembly_error 666 766 0.985140442848 . . Note=AssemblyML Misassembly;colour=17 NODE_911_length_24981_cov_24.948200 AssemblyML possible_assembly_error 11029 11129 0.81190341711 . . Note=AssemblyML Misassembly;colour=17 NODE_911_length_24981_cov_24.948200 AssemblyML possible_assembly_error 11129 11229 0.829632520676 . . Note=AssemblyML Misassembly;colour=17 NODE_1076_length_565_cov_21.033628 AssemblyML possible_assembly_error 61 161 0.825996100903 . . Note=AssemblyML Misassembly;colour=17 NODE_1079_length_621_cov_21.901772 AssemblyML possible_assembly_error 76 176 0.994867920876 . . Note=AssemblyML Misassembly;colour=17 NODE_1141_length_4182_cov_26.076519 AssemblyML possible_assembly_error 2813 2913 0.77964925766 . . Note=AssemblyML Misassembly;colour=17 NODE_1156_length_63926_cov_25.052982 AssemblyML possible_assembly_error 53035 53135 0.60688751936 . . Note=AssemblyML Misassembly;colour=17 NODE_1311_length_8587_cov_23.217772 AssemblyML possible_assembly_error 5565 5665 0.698850989342 . . Note=AssemblyML Misassembly;colour=17 NODE_1425_length_597_cov_16.546064 AssemblyML possible_assembly_error 274 374 0.619778573513 . . Note=AssemblyML Misassembly;colour=17 NODE_1477_length_20147_cov_24.891796 AssemblyML possible_assembly_error 4674 4774 0.927391052246 . . Note=AssemblyML Misassembly;colour=17 NODE_1490_length_81104_cov_25.396097 AssemblyML possible_assembly_error 20667 20767 0.56034040451 . . Note=AssemblyML Misassembly;colour=17 NODE_1573_length_48398_cov_25.259268 AssemblyML possible_assembly_error 20940 21040 0.993979811668 . . Note=AssemblyML Misassembly;colour=17 NODE_1573_length_48398_cov_25.259268 AssemblyML possible_assembly_error 21040 21140 0.99996972084 . . Note=AssemblyML Misassembly;colour=17 NODE_1621_length_105564_cov_25.599030 AssemblyML possible_assembly_error 20505 20605 0.62704205513 . . Note=AssemblyML Misassembly;colour=17 NODE_1745_length_3481_cov_21.532318 AssemblyML possible_assembly_error 1027 1127 0.98476010561 . . Note=AssemblyML Misassembly;colour=17 NODE_1923_length_8747_cov_24.201555 AssemblyML possible_assembly_error 5687 5787 0.960769474506 . . Note=AssemblyML Misassembly;colour=17 NODE_1923_length_8747_cov_24.201555 AssemblyML possible_assembly_error 5787 5887 0.999976754189 . . Note=AssemblyML Misassembly;colour=17 NODE_1970_length_69917_cov_25.241787 AssemblyML possible_assembly_error 50675 50775 0.968700110912 . . Note=AssemblyML Misassembly;colour=17 NODE_1970_length_69917_cov_25.241787 AssemblyML possible_assembly_error 50775 50875 0.999956130981 . . Note=AssemblyML Misassembly;colour=17

Listing 6.2: A .gff file is generated to specify predicted errors. The fourth and fifth columns provide the range of the predicted misassemblies, and column 6 is the prediction probability. For comparison, we used REAPR to predict errors on the same assembly. Because REAPR flagged over 170,000 positions as errors (mostly of type "Frag_cov: Fragment cov- erage too low"), we chose to only use errors due to REAPR’s "FCD error" classification. This classification method was unable to correctly identify misassemblies, while falsely clas- 97 Figure 6.11: Correct Misassembly Classification Velvet Assembly of Singulisphaera acidiphila

98 Figure 6.12: Quast defines misassemblies in which inconsistencies are shorter than 1000 basepairs as local misassemblies. The trained model predicts this missassembly (as shown by the black bar)

sifying 55 samples as positive. As shown in Figure 6.11, discrepancies between the left and right flanking regions of the misassembly breakpoint, along with the number of soft-clipping events, were clear indicators that the given region was misassembled. Upon deeper investigation of AssemblyML’s putative misassemblies that were not repre- sented in the true positives, most appeared to represent real anomalies. By default, Quast treats misassemblies as "Extensive" if inconsistencies exceed 1 kilo-basepair and all others as "Local". We found that most of these predictions were in fact local misassemblies. Further- more, long stretches of N’s inserted into the assembly by Velvet as a scaffolding mechanism were flagged as misassemblies. Figure 6.12 and 6.13 illustrate such cases. Furthermore, as shown in Table 7.1, a number of local misassemblies were also flagged. To test for robustness against assember-specific errors, we also assembled the same dataset

99 Figure 6.13: Scaffolding mechanisms incorporate "N" nucleotides into the assembly to pro- vide structural information. These are often labeled as misassemblies.

with the Spades assembler. The assembly produced 2 true misassemblies: a relocation break- point in which the subsequence 61050:103120 on contig NODE_27_length_103120_cov_11.021_ID_53 was mapped inconsistently on the reference genome, and another relocation of 7765:37888 on contig NODE_89_length_37888_cov_11.0422_ID_177. As shown in Listings 6.3, 6.4, AssemblyML was able to predict both correctly with confidence scores of 0.997 and .998, respectively. Once again, none of these true assemblies were represent by REAPR’s FCD error calls.

NODE_27_length_103120_cov_11.021_ID_53 AssemblyML possible_assembly_error 60972 61072 0.996984660625 . . Note=AssemblyML Misassembly;colour=1 NODE_19_length_127744_cov_10.6766_ID_37 AssemblyML possible_assembly_error 10942 11042 0.69387370348 . . Note=AssemblyML Misassembly;colour=1 NODE_1_length_316631_cov_10.8538_ID_1 AssemblyML possible_assembly_error 314431 314531 0.644537210464 . . Note=AssemblyML Misassembly;colour=1 NODE_1_length_316631_cov_10.8538_ID_1 AssemblyML possible_assembly_error 315131 315231 0.929858028889 . . Note=AssemblyML Misassembly;colour=1 NODE_89_length_37888_cov_11.0422_ID_177 AssemblyML possible_assembly_error 7678 7778 0.998008906841 . . Note=AssemblyML Misassembly;colour=1 NODE_89_length_37888_cov_11.0422_ID_177 AssemblyML possible_assembly_error 7778 7878 0.543280422688 . . Note=AssemblyML Misassembly;colour=1 NODE_11_length_139233_cov_11.4983_ID_21 AssemblyML possible_assembly_error 1041 1141 0.824176132679 . . Note=AssemblyML Misassembly;colour=1

Listing 6.3: Predicted assembly positions for Spades assembly

NODE_27_length_103120_cov_11.021_ID_53 Extensive misassembly ( relocation, inconsistency= 59538 ) between 40424 61049 and 103120 61050 NODE_89_length_37888_cov_11.0422_ID_177 Extensive misassembly ( relocation, inconsistency= 71485 ) between 77641 and 7765 37888

Listing 6.4: Real misassemblies defined by Quast/Mummer alignment 100 p n total

p0 2 0 P0 actual value

n0 5 95025 N0

total P N

Figure 6.14: AssemblyML Prediction Outcomes For Spades Assembly of Singulisphaera acidiphila

p n total

p0 0 2 P0 actual value

n0 59 94964 N0

total P N

Figure 6.15: REAPR FCD Prediction Outcomes For Spades Assembly of Singulisphaera acidiphila

101 Figure 6.16: A region correctly classified as a true misassembly. Reads with mapping quality of low or zero (shown in lighter shades) are prevalent around the position. Coverage metrics are unaffected.

As we can see in Figure 6.16, coverage statistics at misassembly event (i) seem relatively normal, but anomalies in mapping quality are present. Currently, no other evaluation tech- niques previously mentioned accounts for mapping quality and thus are unable to detect such events. Further investigation on all novel positions not classified as real misassemblies reveals inconsistencies in coverage nearing the ends of the contigs (Figure 6.6). Assuming that these are in fact partially misassembled regions that Spades cannot thus extend further, this shows that our model can acheive a perfect prediction with 100% accuracy, precision, and recall.

6.4 Discussion

In this chapter we showed that a supervised learning based method was able to provide a model for reference-independent misassembly detection, consistently outperforming REAPR and providing basis for a informed assembly evaluation. Furthermore, due to the large dataset available, we were able to account for overfitting, as training and test data were generated from entirely different genomes. While we were not able to directly compare our method with the machine learning based approach by Choi et. al on the same dataset (source code unavailable), our approach showed significant improvement in misassembly prediction precision (0.89 versus ~0.2-0.3). We must emphasize, however, that in experiments using

102 separate genomes for training and testing, recall rates only reached 0.48, signifying that there is room for improvement of the model. However, a low false positive rate proves beneficial for avoiding unnecessary validations in finishing efforts.

103 Chapter 7

APPLICATIONS OF AN ACCURATE DE NOVO ASSEMBLY

EVALUATION PROFILE

7.1 Error Removal and Contiguity Metric Correction

7.1.1 Contig Splitting

The accuracy of the AssemblyML prediction model allows putative misassembly points to be located and thus contigs can be broken at low confidence positions. We implemented a method to break contigs at regions that AssemblyML classified as errors, and as shown in Figure 7.1.1, misassemblies are eliminated when aligning back to its reference.

7.1.2 Corrected Statistical Measures

The most widely used assembly assessment metric, N50, became popular during the com- pletion of the human genome. N50 is calculated by sorting contigs by length, then summing them in descending order until the total length traverses the 50% point of the total assembly length. The length of the contig at which this happens is the resulting score. A perhaps more representative metric is NG50, in which the decision point is at 50% of the estimated genome size. Accordingly, Nx and NGx scores can be calculated for any value of x. An NG graph, where NG all values are calculated, is useful for a visualized comparison of scaffold lengths relative to different assemblers as well as to estimated genome size. [9]. Example code in listing 7.1 executes the Broad Institute’s definition of N50:

N50 is a statistical measure of average length of a set of sequences. It is used widely in genomics, especially in reference to contig or supercontig lengths within a draft assembly.

104 Figure 7.1: Regions with misassembly from Spades and Velvet assemblies (A, C) compared with their contigs broken at predicted loci (B, D). Red represents contigs that contain true misassemblies.

Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N. This can be found mathematically as follows: Take a list L of positive integers. Create another list L’ , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L’ is the N50 of L. For example: If L = 2, 2, 2, 3, 3, 4, 8, 8, then L’ consists of six 2’s, six 3’s, four 4’s, and sixteen 8’s; the N50 of L is the median of L’ , which is 6.

1 def N50(seq_lengths):

2 sorted_lengths = sorted(sequence_lengths)

3 expanded = []

4 for length in sorted(seq_lengths) :

5 expanded += [length] * length

6 return median(expanded)

Listing 7.1: N50 Calculation Though the N50 and L50 contiguity metrics are often used as a proxy for assembly quality, an overaggressive assembler will often incorrectly misjoin reads when optimizing for 105 its heuristical model. Such an assembler will actually be rewarded, rather than penalized, by an N50 score. Once the assembly is processed by the breaking phase, contiguity statistics such as N50 can then be recalculated, resulting in a more accurate measure which we call NC50 When a reference genome is used to assess the quality of an assembly, it is possible to produce an NGA50 score. This score, which is similar to N50, but only accounts for contig blocks that align correctly to the reference, is much closer to a "true" scoring metric than its analogue due to the availability of validating information. When comparing NGA50 and NC50 scores for both Velvet and Spades assemblies (Table 7.1), both scores show similarities; 57260 versus 57142 for the Velvet assembly, and 77378 versus 78658 for the Spades assemblies. When applied to a de novo assembly or a set of assemblies generated from the same set of reads, a user is provided with an accurate way of assessing or comparing resulting outputs.

7.2 Likelihood and Scoring Framework For Systematic Evaluation

Clark, et al. have developed a reference free metric called an ALE score, or Assembly Likelihood Evaluation, that is based upon a Bayesian probability statistical model [19]. By using quality data from the library from which the assembly was produced and alignment data produced from alignment tool such as BWA or Bowtie, the ALE program calculates the following subscores in order to infer the overall probability that and assembly S is generated from the set of reads R, or P (S|R):

1. Pplacement(R|S): Read quality scores and basepair alignment information, along with read orientation likelihood, are used to quantify how well the reads agree with the assembly.

2. Pinsert(R|S): Mean insert length and standard devatiation is calculated from the read mapping, and the likelihoods of each mate pair mapping is inferred.

106 Table 7.1: Statistics of Spades and Velvet assemblies and contigs broken at predicted loci

Assembly spades spades_corrected velvet velvet_corrected # contigs (≥ 0 bp) 509 515 1166 1201 # contigs (≥ 1000 bp) 261 266 334 358 Total length (≥ 0 bp) 9504431 9503726 9466917 9462485 Total length (≥ 1000 bp) 9430248 9428943 9331995 9324485 # contigs 304 310 393 418 Largest contig 316631 314431 342054 312539 Total length 9460755 9460050 9373077 9366407 Reference length 9755686 9755686 9755686 9755686 GC (%) 62.23 62.23 62.24 62.24 Reference GC (%) 62.23 62.23 62.23 62.23 N50 79178 78658 62128 57142 L50 35 36 44 47 # misassemblies 2 0 12 1 # misassembled contigs 2 0 11 1 Misassembled contigs length 141008 0 656934 2183 # local misassemblies 2 2 16 10 # unaligned contigs 0 + 0 part 0 + 0 part 1 + 13 part 3 + 16 part Unaligned length 0 0 21858 22523 Duplication ratio 1.001 1.001 1.001 1.001 # N’s per 100 kbp 0.00 0.00 91.42 87.78 # mismatches per 100 kbp 1.70 1.69 7.11 7.24 # indels per 100 kbp 0.19 0.19 2.52 2.39 Largest alignment 316631 314431 312539 312539 Total aligned length 9460643 9459935 9343977 9338892 NA50 78658 78658 60278 57025 NGA50 77378 77378 57260 55817

107 3. Pdepth(R|S): Given the GC content at the particular location, this measures how well the current depth agrees with expected depth inferred from GC bias models.

4. k-mer: Likelihood of the assembly S is calculated without read information by multi- plying all present k-mer frequency probabilies appearing in the assembly, and used as the Bayesian prior probability.

This metric proves to be a useful way to gauge overall assembly confidence; however, it only provides a method for comparison amongst assemblies of different assember configurations using the same initial reads, and do not have the ability to accurately infer assembly errors without setting an arbritrary threshold. Narzisi et. al present an alternate measurement, taking influence from Reciever Operating Characteristic (ROC) curves to produce their Feature Response Curve (FRC) metric, as described in Algorithm 3, which characterizes the sensitivity (coverage) of the assembly as a function of a discriminatory threshold (features). The method works in tandem with the amosvalidate package [51], which performs assembly validation based on various measures of assembly consistency. Some features include mate-pair checking, depth-of-coverage, and suggested breakpoints at suspicious regions. The generation of such profiles allows for a visual comparison between assemblers and their relative strengths and weaknesses.

Algorithm 3: Algorithm to calculate the Feature-Response Curve based on output from AmosValidate software input : A set of contigs C with tagged errors (a.k.a "features") F output: Feature-Response Curve for k ← 1 to 100 do k φk ← |F | ∗ 100.0; sum ← 0; totallength ← 0; for j ← 1 to |C| do sum ← sum + NumFeatures(cj) ; totallength ← totallength + Length(cj)

108 We present an approach, called the Error Response Curve, that similarly captures the trade-off between and assembly’s overall contiguity and correctness. Errors are first predicted using the AssemblyML prediction model we presented in the previous Chapter. Contigs are sorted by length, longest to shortest, and for each error prediction event, we compute the corresponding assembly length as a function of error coverage. We define error coverage as the sum of the error prediction probabilities up until the current error prediction event. The visual result, as shown in Figure 7.2, allows an investigator to percieve the overall quality of the assembly relative to contiguity; ostensibly where problematic regions occur. The algorithm is described in Algorithm 4.

Figure 7.2: The Error Response Curve captures the trade-off between contiguity and cor- rectness.

In addition to this visual representation of assembly performance, we derive a scoring metric called the Error-Response Curve Area Over the Curve (ERC-AOC). This metric rep- resents the overall contiguity and confidence of misassembly predicted by the AssemblyML 109 Algorithm 4: Algorithm to calculate the Error-Response curve after prediction via AssemblyML model input : A set of contigs C with predicted errors E output: Error Response Curve totallength ← 0; errorcov ← 0; for k ← 1 to |C| do for j ← 1 to |E| do errorcov ← errorcov + PredictionProbability(ej) ; coverage ← (totallength + PositionOnContig (ej)) / genomesize totallength ← totallength + Length(ck) model. Our AssemblyML model attempts to optimize logistic loss from prediction probabil- ities from which it derives classifications through application of a sigmoid function. Given each error prediction’s probability score, e, we can define ERC-EOC as follows:

Let ci be the total error coverage until error i,

i X ci = ej (7.1) j=1

then the error over the curve is calculated:

|E| |E| X X 1 EOC = e − (c − c ) ∗ (e + e ) (7.2) i 2 i i−1 i i−1 i=1 i=1 Both the Error-Response Curve and the ERC-EOC metrics provide a valueble approach to assessing the results of reference-independent genome assemblies. The current prediction accuracy of the AssemblyML model is satisfactory and thus, we believe that these metrics are a very important step towards improving assembly through the existence of this optimization function. Furthermore, as the prediction model improves through additional datasets and feature engineering, so will the value of ERC and ERC-AOC.

110 7.3 Error Prediction Strategy for Assembly Reconcilliation

Algorithms

Zimin et al. first proposed a genome assembly improvement strategy known as assembly reconcilliation alongside the premise that tools will make different mistakes in their heuris- tical decisions and thus it is possible to reconcile correct regions through merging while also identify problematic regions.

Definition 7.3.1. For a given alphabet Σ, and a set of strings S, where s = s1, s2, ..., si, si

Definition 7.3.2. Given a set of reads used to produced the assemblies R = {r1, r2, ...rn}, we define Begin(ri) and End(ri) as positions on contig C where the first and last base of ri are aligned to C, respectively.

Definition 7.3.3. Two reads r1 and r2 are adjacent if and only if Begin(r1) ≤ End(r2)+1 and Begin(r2) ≤ End(r1) + 1.

Definition 7.3.4. Given a genome assembly A, a frame FA is a monotonic sequence of reads r1, r2, ..., rn, ri ∈ R, where all ri, rn are adjacent for i = 1, ..., n

Vicedomino et. al uses the concept of master, M, and slave, S assemblies to identify concordant blocks, which for two assemblies, M and S, are maximal length pair of frames that consist of the same set of reads. To avoid erroneous merges in scenarious of highly repetitive sequences, the GAM-NGS algorthm performs block filtering before proceeding with its graph-building step. The block coverage of a frame on the master assembly FM is defined as

P r∈R |r| BM BCFM = (7.3) |FM |

and global coverage as

111 P r∈R |r| BM GCFM = (7.4) |FM |

The Gam-NGS program exposes a user definable filtering threshold Tc, 0 < Tc < 1, that essentially controls for frames built from lower-than-expected read coverage. Blocks are removed unless

BCFM BCFS max{ , } ≥ Tc (7.5) GCFM GCFS

As with consensus strategies, Gam-NGS attempts to provide researchers with broader structural information through merging and extension of contigs. One downfall, however, can manifest when the algorithm assumes that misassemblies do not occur in regions that provide novel links between a master and slave assembly. In these cases, misassembled regions can be incorporated into the consensus merger. We present an assembly algorithm that incorporates this block merging method, with two key stages preceeding the process. First, to prevent the merging step from propagating misassemblies, we break the contigs at any region predicted by the AssemblyML model. Next we use ERC-AOC, as described in the previous section, to assign appropriate master and slave labels to each contig set, traditionally a task left up to a user’s ad hoc decision. The Gam-NGS algorithm uses these labels when traversing ambiguous paths in the algorithm’s assemblies graph.

112 Figure 7.3: An improved assembly pipeline using AssemblyML-guided contig breaking and ERC metrics as preceeding steps to block merging.

113 Chapter 8

CONCLUSION

This dissertation provides several key insights and novel methods that move us towards the goal of automated and accurate de novo genome assembly. The landscape of genetic sequencing information continues to shift, and as such, it is important to search the infor- mation space for possible features that may improve computability. We’ve described the challenges presented in each stage. Some are inherent in the chemistry of DNA sequencing, some are due to genomic structure of an organism, or the complexity of a microbial commu- nity. Navigating the diverse ecosystem of computational biology tools provides additional barriers, and layers of heuristical decisions without propagation of uncertainty narrows the opportunity to gain insight from – and to improve – integrative techniques for assembly. We began our studies by developing a framework (AssemblyRAST) that would allow for high-level design of integrative workflows, parameter searches, data capture, and rapid analysis. This allowed for comprehensive surveys of techniques used for read preprocessing, assembly, scaffolding, and analysis, and ultimately facillitated the development of a self- tuning microbial assembly pipeline that outperforms any automatic methods used today. After optimizing our algorithms for true performance, we set our sights on measuring per- formance in the absence of a reference genome. The most advanced techniques attempted to provide statistical justification through the usage of read-mapping and various cover- age statistics. Unfortunately, no technique was robust enough to detect errors or measure accuracy across the diverse landscape of assemblies that is the result of nearly unlimited permutations of sequencing and assembly parameters. Supervised learning is capable of ac- curate prediction models if datasets are robust and features are engineered properly. Using the AssemblyRAST framework, we were able to generate a large quantity of assembly data, and by apply our domain knowledge of assembly methods, developed a trained model that outperforms any method we’ve tested. Finally, we derive a set of evaluation metrics from our 114 prediction model that not only give researchers an immediate tool to compare the qualities of assembly, but most importantly, unlock the potential improvement of the assembly through high-level optimization techniques. We describe these in the next section.

8.1 Future

The difficulty of the genome assembly problem is the result of layers of complexity that are compounded through a interdiciplinary workflows. We’ve introduced a promising set of methods that while currently useful, incite further improvement and application to new techniques. Here, we describe them.

• Emulated Hybrid Assembly via Alternate Assembly Methods: Several methods for hy- brid assembly have been studied, in which multiple read sets from different sequencing technologies were assembled together. Our experiments show that different assembly techniques or parameters can resolve unique sequence extensions. We’d like to explore a technique that assembles short reads via de Bruijn assembly, and a second step in which contigs are treated as long, Sanger-like reads, in overlap consensus assembly.

• SNP Classification: The detection of single nucleotide polymorphisms can be very impactful for many downstream analyses. Unfortunately, such events can often be introduced through the sequencing or assembly process. Using a similar data generation and training method as described in Chapter 6, we’d like to classify SNPs introduced in the assembly stage and furthermore, correct them.

• Eukaryote Assembly: Microbial and metagenomic assembly are currently the focus of the system. As we improve assembly accuracy and throughput, support for eukaryotic genomes will develop. This will be helpful to such efforts as the Genome 10k project.

• Metagenomic Communities: Metagenomic assembly remains a prominent challenge due to the complexities inherent in the data provided. To produce better methods 115 of assembly, it will be necessary to better understand properties of microbial commu- nites, studying abundance profiles, clustering algorithms, phylogenetic analyses, and a myriad other facets.

• Alternate Compute Architectures: Novel architectures may be capable of accelerat- ing certian computations. For example, Convey’s HC-1 machine uses an FPGA co- processor to accelerate specific task; we are currently testing a BWA read alignment implementation on this machine. The flexibility of the Assembly RAST compute frame- work allows for a heterogeneous mixture of compute systems, and thus alternate sys- tems can be used. As we continue the effort to incorporate machine learning into assembly workflows, the need for accelerated computation becomes more and more prevalent.

116 References

[1] Daniel Aird, Michael G. Ross, Wei-Sheng Chen, Maxwell Danielsson, Timothy Fennell, Carsten Russ, David B. Jaffe, Chad Nusbaum, and Andreas Gnirke. Analyzing and minimizing pcr amplification bias in illumina sequencing libraries. Genome Biology, 12(2):R18, 2011.

[2] Can Alkan, Saba Sajjadian, and Evan E Eichler. Limitations of next-generation genome sequence assembly. Nature methods, 8(1):61–5, jan 2011.

[3] Timothy E Allen, Nathan D Price, Andrew R Joyce, and Bernhard A Palsson. Long- range periodic patterns in microbial genomes indicate significant multi-scale chromo- somal organization. PLoS Comput Biol, 2(1):1–9, 01 2006.

[4] David Alvarez-Ponce, Philippe Lopez, Eric Bapteste, and James O McInerney. Gene similarity networks provide tools for understanding eukaryote origins and evolution. Proceedings of the National Academy of Sciences of the United States of America, 110(17):E1594–603, apr 2013.

[5] Ramy K Aziz, Daniela Bartels, Aaron a Best, Matthew DeJongh, Terrence Disz, Robert a Edwards, Kevin Formsma, Svetlana Gerdes, Elizabeth M Glass, Michael Kubal, Folker Meyer, Gary J Olsen, Robert Olson, Andrei L Osterman, Ross a Over- beek, Leslie K McNeil, Daniel Paarmann, Tobias Paczian, Bruce Parrello, Gordon D Pusch, Claudia Reich, Rick Stevens, Olga Vassieva, Veronika Vonstein, Andreas Wilke, and Olga Zagnitko. The RAST Server: rapid annotations using subsystems technology. BMC genomics, 9:75, jan 2008.

[6] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prji- belski, Alexey V Pyshkin, Alexander V Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A 117 Alekseyev, and Pavel A Pevzner. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology : a journal of computational molecular cell biology, 19(5):455–77, may 2012.

[7] Marten Boetzer, Christiaan V Henkel, Hans J Jansen, Derek Butler, and Walter Pirovano. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics (Oxford, England), 27(4):578–9, feb 2011.

[8] Sébastien Boisvert, François Laviolette, and Jacques Corbeil. Ray: simultaneous as- sembly of reads from a mix of high-throughput sequencing technologies. Journal of computational biology : a journal of computational molecular cell biology, 17(11):1519– 33, nov 2010.

[9] Keith R Bradnam, Joseph N Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, Inanç Birol, Sébastien Boisvert, Jarrod a Chapman, Guillaume Chapuis, Rayan Chikhi, Hamidreza Chitsaz, Wen-Chi Chou, Jacques Corbeil, Cristian Del Fabbro, T Roderick Docking, Richard Durbin, Dent Earl, Scott Emrich, Pavel Fedotov, Nuno a Fonseca, Ganeshkumar Ganapathy, Richard a Gibbs, Sante Gnerre, Elénie Godzaridis, Steve Goldstein, Matthias Haimel, Giles Hall, , Joseph B Hiatt, Isaac Y Ho, Jason Howard, Martin Hunt, Shaun D Jackman, David B Jaffe, Erich Jarvis, Huaiyang Jiang, Sergey Kazakov, Paul J Kersey, Jacob O Kitzman, James R Knight, Sergey Koren, Tak-Wah Lam, Dominique Lavenier, François Laviolette, Yingrui Li, Zhenyu Li, Binghang Liu, Yue Liu, Ruibang Luo, Iain Maccallum, Matthew D Mac- manes, Nicolas Maillet, Sergey Melnikov, Delphine Naquin, Zemin Ning, Thomas D Otto, Benedict Paten, Octávio S Paulo, Adam M Phillippy, Francisco Pina-Martins, Michael Place, Dariusz Przybylski, Xiang Qin, Carson Qu, Filipe J Ribeiro, Stephen Richards, Daniel S Rokhsar, J Graham Ruby, Simone Scalabrin, Michael C Schatz, David C Schwartz, Alexey Sergushichev, Ted Sharpe, Timothy I Shaw, Jay Shendure,

118 Yujian Shi, Jared T Simpson, Henry Song, Fedor Tsarev, Francesco Vezzi, Riccardo Vicedomini, Bruno M Vieira, Jun Wang, Kim C Worley, Shuangye Yin, Siu-Ming Yiu, Jianying Yuan, Guojie Zhang, Hao Zhang, Shiguo Zhou, and Ian F Korf. Assem- blathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience, 2(1):10, jul 2013.

[10] Guy Bresler, Ma’ayan Bresler, and David Tse. Optimal Assembly for High Throughput Shotgun Sequencing. page 26, jan 2013.

[11] C Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B Pyrkosz, and Timothy H Brom. A Reference-Free Algorithm for Computational Normalization of Shotgun Se- quencing Data. pages 1–18.

[12] Louise Teixeira Cerdeira, Adriana Ribeiro Carneiro, Rommel Thiago Jucá Ramos, Sin- tia Silva de Almeida, Vivian D’Afonseca, Maria Paula Cruz Schneider, Jan Baumbach, Andreas Tauch, John Anthony McCulloch, Vasco Ariston Carvalho Azevedo, and Ar- tur Silva. Rapid hybrid de novo assembly of a microbial genome using only short reads: Corynebacterium pseudotuberculosis I19 as a case study. Journal of microbiological methods, 86(2):218–23, aug 2011.

[13] Mark Chaisson, Pavel Pevzner, and Haixu Tang. Fragment assembly with short reads. Bioinformatics (Oxford, England), 20(13):2067–74, sep 2004.

[14] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016.

[15] Rayan Chikhi and Paul Medvedev. Informed and Automated k -Mer Size Selection for Genome Assembly. pages 1–7, 2013.

[16] Chen-Shan Chin, David H Alexander, Patrick Marks, Aaron a Klammer, James Drake, Cheryl Heiner, Alicia Clum, Alex Copeland, John Huddleston, Evan E Eich- 119 ler, Stephen W Turner, and Jonas Korlach. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature methods, 10(6):563–9, jun 2013.

[17] Hamidreza Chitsaz, Joyclyn L Yee-Greenbaum, Glenn Tesler, Mary-Jane Lombardo, Christopher L Dupont, Jonathan H Badger, Mark Novotny, Douglas B Rusch, Louise J Fraser, Niall a Gormley, Ole Schulz-Trieglaff, Geoffrey P Smith, Dirk J Evers, Pavel a Pevzner, and Roger S Lasken. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nature biotechnology, 29(10):915–21, oct 2011.

[18] Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G Gilbert, and John K Colbourne. A machine-learning approach to combined evidence validation of genome assemblies. Bioinformatics (Oxford, England), 24(6):744–50, mar 2008.

[19] Scott C Clark, Rob Egan, Peter I Frazier, and Zhong Wang. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics (Oxford, England), 29(4):435–43, feb 2013.

[20] Phillip E C Compeau and Pavel A Pevzner. Genome Reconstruction : A Puzzle with a Billion Pieces. pages 1–30, 2010.

[21] Phillip E C Compeau, Pavel A Pevzner, and Glenn Tesler. How to apply de Bruijn graphs to genome assembly. 29(11):987–991, 2011.

[22] Aaron E Darling, Andrew Tritt, Jonathan a Eisen, and Marc T Facciotti. Mauve assembly metrics. Bioinformatics (Oxford, England), 27(19):2756–7, oct 2011.

[23] Adel Dayarian, Todd P Michael, and Anirvan M Sengupta. Sopra: Scaffolding al- gorithm for paired reads via statistical optimization. BMC bioinformatics, 11(1):1, 2010.

120 [24] James F Denton, Jose Lugo-Martinez, Abraham E Tucker, Daniel R Schrider, Wesley C Warren, and Matthew W Hahn. Extensive error in the number of genes inferred from draft genome assemblies. PLoS computational biology, 10(12):e1003998, dec 2014.

[25] Mark a DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony a Philippakis, Guillermo del Angel, Manuel a Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler, and Mark J Daly. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics, 43(5):491–8, may 2011.

[26] Juliane C Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. Sub- stantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic acids research, 36(16):e105, sep 2008.

[27] John M Eppley, Gene W Tyson, Wayne M Getz, and Jillian F Banfield. Strainer: software for analysis of population variation in community genomic datasets. BMC bioinformatics, 8:398, jan 2007.

[28] M V Everett, E D Grau, and J E Seeb. Short reads and nonmodel species: exploring the complexities of next-generation sequence assembly and SNP discovery in the absence of a reference genome. Molecular ecology resources, 11 Suppl 1:93–108, mar 2011.

[29] Yanxiao Feng, Yuechuan Zhang, Cuifeng Ying, Deqiang Wang, and Chunlei Du. Nanopore-based Fourth-generation DNA Sequencing Technology. Genomics, Pro- teomics & Bioinformatics, 13(1):4–16, 2015.

[30] R D Fleischmann, M D Adams, O White, R A Clayton, E F Kirkness, A R Kerlavage, C J Bult, J F Tomb, B A Dougherty, and J M Merrick. Whole-genome random

121 sequencing and assembly of haemophilus influenzae rd. Science (New York, N.Y.), 269(5223):496–512, July 1995. PMID: 7542800.

[31] Song Gao, Wing-Kin Sung, and Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computa- tional Biology, 18(11):1681–1691, 2011.

[32] Mohammadreza Ghodsi, Christopher M Hill, Irina Astrovskaya, Henry Lin, Dan D Sommer, Sergey Koren, and Mihai Pop. De novo likelihood-based measures for com- paring genome assemblies. BMC research notes, 6(1):334, jan 2013.

[33] Belinda Giardine, Cathy Riemer, Ross C Hardison, Richard Burhans, Laura Elnitski, Prachi Shah, Yi Zhang, Daniel Blankenberg, Istvan Albert, James Taylor, , W James Kent, and Anton Nekrutenko. Galaxy: a platform for interactive large-scale genome analysis. Genome research, 15(10):1451–5, oct 2005.

[34] Sante Gnerre, Iain Maccallum, Dariusz Przybylski, Filipe J Ribeiro, Joshua N Bur- ton, Bruce J Walker, Ted Sharpe, Giles Hall, Terrance P Shea, Sean Sykes, Aaron M Berlin, Daniel Aird, Maura Costello, Riza Daza, Louise Williams, Robert Nicol, An- dreas Gnirke, Chad Nusbaum, Eric S Lander, and David B Jaffe. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences of the United States of America, 108(4):1513–8, jan 2011.

[35] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in prepa- ration for MIT Press, 2016.

[36] Paul Greenfield, Konsta Duesing, Alexie Papanicolaou, and Denis C Bauer. Blue: correcting sequencing errors using consensus and context. Bioinformatics (Oxford, England), pages btu368–, jun 2014.

122 [37] Alexey A Gritsenko, Jurgen F Nijkamp, Marcel J T Reinders, and Dick De Ridder. GRASS : a generic algorithm for scaffolding next-generation sequencing assemblies. pages 1–9, 2012.

[38] Yan Guo, Jiang Li, Chung-I Li, Jirong Long, David C Samuels, and Yu Shyr. The effect of strand bias in Illumina short-read sequencing data. BMC genomics, 13(1):666, jan 2012.

[39] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST : Quality Assessment Tool for Genome Assemblies. pages 1–4, 2013.

[40] Olivier Harismendy, Pauline C Ng, Robert L Strausberg, Xiaoyun Wang, Timothy B Stockwell, Karen Y Beeson, Nicholas J Schork, Sarah S Murray, Eric J Topol, Samuel Levy, and Kelly a Frazer. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome biology, 10(3):R32, jan 2009.

[41] David Hernandez, Patrice François, Laurent Farinelli, Magne Osterås, and Jacques Schrenzel. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome research, 18(5):802–9, may 2008.

[42] Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. ART: a next- generation sequencing read simulator. Bioinformatics (Oxford, England), 28(4):593–4, feb 2012.

[43] Martin Hunt, Taisei Kikuchi, Mandy Sanders, Chris Newbold, Matthew Berriman, and Thomas D Otto. REAPR: a universal tool for genome assembly evaluation. Genome biology, 14(5):R47, may 2013.

[44] Susan M Huse, Julie a Huber, Hilary G Morrison, Mitchell L Sogin, and David Mark Welch. Accuracy and quality of massively parallel DNA pyrosequencing. Genome biology, 8(7):R143, jan 2007. 123 [45] Lucian Ilie, Farideh Fazayeli, and Silvana Ilie. HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics (Oxford, England), 27(3):295–302, feb 2011.

[46] Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, 44(2):226–32, feb 2012.

[47] , Susumu Goto, Yoko Sato, Miho Furumichi, and Mao Tanabe. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic acids research, 40(Database issue):D109–14, jan 2012.

[48] David R Kelley, Michael C Schatz, and Steven L Salzberg. Quake : quality-aware detection and correction of sequencing errors. 2010.

[49] Jaebum Kim, Denis M Larkin, Qingle Cai, Asan, Yongfen Zhang, Ri-Li Ge, Loretta Auvil, Boris Capitanu, Guojie Zhang, Harris A Lewin, and Jian Ma. Reference-assisted chromosome assembly. Proceedings of the National Academy of Sciences of the United States of America, 110(5):1785–90, jan 2013.

[50] Carl Kingsford, Michael C Schatz, and Mihai Pop. Assembly complexity of prokaryotic genomes using short reads. BMC bioinformatics, 11(1):21, jan 2010.

[51] S. Koren, T. J. Treangen, C. M. Hill, M. Pop, and A. M. Phillippy. Automated ensemble assembly and validation of microbial genomes. Technical report, feb 2014.

[52] Sergey Koren, Michael C Schatz, Brian P Walenz, Jeffrey Martin, Jason T Howard, Ganeshkumar Ganapathy, Zhong Wang, David A Rasko, W Richard McCombie, Erich D Jarvis, and Adam M Phillippy. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotech, 30(7):693–700, jul 2012.

124 [53] Sergey Koren, Todd J Treangen, Christopher M Hill, Mihai Pop, and Adam M Phillippy. Automated ensemble assembly and validation of microbial genomes. BMC bioinformatics, 15(1):126, jan 2014.

[54] Sergey Koren, Todd J Treangen, and Mihai Pop. Bambus 2: scaffolding metagenomes. Bioinformatics (Oxford, England), 27(21):2964–71, nov 2011.

[55] Ka-Kit Lam, Asif Khalak, and David Tse. Near-optimal Assembly for Shotgun Se- quencing with Noisy Reads. feb 2014.

[56] Jonathan Laserson, Vladimir Jojic, and Daphne Koller. Genovo : De Novo Assembly for Metagenomes. pages 341–356, 2010.

[57] Timo Lassmann, Yoshihide Hayashizaki, and Carsten O Daub. TagDust–a program to eliminate artifacts from next generation sequencing data. Bioinformatics (Oxford, England), 25(21):2839–40, nov 2009.

[58] Christian Ledergerber and Christophe Dessimoz. Base-calling for next-generation se- quencing platforms. Briefings in bioinformatics, 12(5):489–97, sep 2011.

[59] Ruiqiang Li, Hongmei Zhu, Jue Ruan, Wubin Qian, Xiaodong Fang, Zhongbin Shi, Yingrui Li, Shengting Li, Gao Shan, Karsten Kristiansen, Songgang Li, Huanming Yang, Jian Wang, and Jun Wang. De novo assembly of human genomes with massively parallel short read sequencing. Genome research, 20(2):265–72, feb 2010.

[60] Maxwell W. Libbrecht and William Stafford Noble. Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6):321–332, may 2015.

[61] Lin Liu, Yinhu Li, Siliang Li, Ni Hu, Yimin He, Ray Pong, Danni Lin, Lihua Lu, and Maggie Law. Comparison of next-generation sequencing systems. Journal of biomedicine & biotechnology, 2012:251364, jan 2012.

125 [62] Tsunglin Liu, Cheng-Hung Tsai, Wen-Bin Lee, and Jung-Hsien Chiang. Optimizing in- formation in Next-Generation-Sequencing (NGS) reads for improving de novo genome assembly. PloS one, 8(7):e69503, jan 2013.

[63] Nicholas J Loman, Joshua Quick, and Jared T Simpson. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Meth, 12(8):733–735, aug 2015.

[64] Nicholas J Loman, Joshua Quick, and Jared T Simpson. A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature Methods, 12(8):733– 735, jun 2015.

[65] Chengwei Luo, Despina Tsementzi, Nikos Kyrpides, Timothy Read, and Konstanti- nos T Konstantinidis. Direct comparisons of Illumina vs. Roche 454 sequencing tech- nologies on the same microbial community DNA sample. PloS one, 7(2):e30087, jan 2012.

[66] Tanja Magoc, Stephan Pabinger, Steven L Salzberg, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, and Luke J Tal. GAGE-B : An Evaluation of Genome Assemblers for Bacterial Organisms. pages 1–9, 2013.

[67] R. D. Maitra, J. Kim, and W. B. Dunbar. Recent advances in nanopore sequencing. Electrophoresis, 33(23):3418–3428, Dec 2012.

[68] Páll Melsted and Bjarni V Halldórsson. KmerStream: streaming algorithms for k-mer abundance estimation. Bioinformatics (Oxford, England), 30(24):3541–7, dec 2014.

[69] Daniel R Mende, Alison S Waller, Shinichi Sunagawa, Aino I Järvelin, Michelle M Chan, Manimozhiyan Arumugam, Jeroen Raes, and Peer Bork. Assessment of metagenomic assembly using simulated next generation sequencing data. PloS one, 7(2):e31386, jan 2012. 126 [70] Jill P Mesirov. Accessible Reproducible Research. 327(January):415–416, 2010.

[71] Michael L Metzker. Sequencing technologies - the next generation. Nature reviews. Genetics, 11(1):31–46, jan 2010.

[72] F Meyer, D Paarmann, M D’Souza, R Olson, E M Glass, M Kubal, T Paczian, a Ro- driguez, R Stevens, a Wilke, J Wilkening, and R a Edwards. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC bioinformatics, 9:386, jan 2008.

[73] André E Minoche, Juliane C Dohm, and Heinz Himmelbauer. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. 2011.

[74] Ian Misner, Cédric Bicep, Philippe Lopez, Sébastien Halary, Eric Bapteste, and Christopher E Lane. Sequence comparative analysis using networks: software for evaluating de novo transcript assembly from next-generation sequencing. Molecular biology and evolution, 30(8):1975–86, aug 2013.

[75] Eugene W Myers. The fragment assembly string graph. Bioinformatics (Oxford, Eng- land), 21 Suppl 2:ii79–85, sep 2005.

[76] Kensuke Nakamura, Taku Oshima, Takuya Morimoto, Shun Ikeda, Hirofumi Yoshikawa, Yuh Shiwa, Shu Ishikawa, Margaret C Linak, Aki Hirai, Hiroki Takahashi, Md Altaf-Ul-Amin, Naotake Ogasawara, and Shigehiko Kanaya. Sequence-specific er- ror profile of Illumina sequencers. Nucleic acids research, 39(13):e90, jul 2011.

[77] Francesco Napolitano, Renato Mariani-Costantini, and Roberto Tagliaferri. Bioinfor- matic pipelines in Python with Leaf. BMC bioinformatics, 14(1):201, jan 2013.

[78] Giuseppe Narzisi and Bud Mishra. Comparing de novo genome assembly: the long and short of it. PloS one, 6(4):e19175, jan 2011. 127 [79] Anton Nekrutenko and James Taylor. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility.

[80] Yu Peng, Henry C M Leung, S M Yiu, and Francis Y L Chin. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics (Oxford, England), 28(11):1420–8, jun 2012.

[81] Erik Pettersson, Joakim Lundeberg, and Afshin Ahmadian. Generations of sequencing technologies. Genomics, 93(2):105–11, feb 2009.

[82] Pavel A Pevzner, Paul A Pevzner, Haixu Tang, and Glenn Tesler. De novo repeat classification and fragment assembly. Genome research, 14(9):1786–96, sep 2004.

[83] Son K Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel a Pevzner, and Max a Alekseyev. Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly. Journal of computational biology : a journal of computational molecular cell biology, 20(4):359–71, apr 2013.

[84] Adam M Phillippy, Michael C Schatz, and Mihai Pop. Genome assembly forensics: finding the elusive mis-assembly. Genome biology, 9(3):R55, jan 2008.

[85] Joseph K Pickrell and Jonathan K Pritchard. Inference of population splits and mix- tures from genome-wide allele frequency data. PLoS genetics, 8(11):e1002967, jan 2012.

[86] Mihai Pop. Genome assembly reborn: recent computational challenges. Briefings in bioinformatics, 10(4):354–66, jul 2009.

[87] Andrey D Prjibelski, Irina Vasilinetc, Anton Bankevich, Alexey Gurevich, Tatiana Krivosheeva, Sergey Nurk, Son Pham, Anton Korobeynikov, Alla Lapidus, and Pavel A Pevzner. ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioin- formatics (Oxford, England), 30(12):i293–i301, jun 2014. 128 [88] Michael a Quail, Miriam Smith, Paul Coupland, Thomas D Otto, Simon R Harris, Thomas R Connor, Anna Bertoni, Harold P Swerdlow, and Yong Gu. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC genomics, 13(1):341, jan 2012.

[89] Atif Rahman and . CGAL: computing genome assembly likelihoods. Genome biology, 14(1):R8, jan 2013.

[90] Anthony Rhoads and Kin Fai Au. PacBio Sequencing and Its Applications. Genomics, Proteomics & Bioinformatics, 13(5):278–289, 2015.

[91] Roy Ronen, Christina Boucher, Hamidreza Chitsaz, and Pavel Pevzner. SEQuel: improving the accuracy of genome assemblies. Bioinformatics (Oxford, England), 28(12):i188–96, jun 2012.

[92] Steven L Salzberg, Adam M Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc, Sergey Koren, Todd J Treangen, Michael C Schatz, Arthur L Delcher, Michael Roberts, Guillaume Marçais, Mihai Pop, and James a Yorke. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3):557–67, mar 2012.

[93] Michael C Schatz, Arthur L Delcher, and Steven L Salzberg. Assembly of large genomes using second-generation sequencing. Genome research, 20(9):1165–73, sep 2010.

[94] Asheesh Shanker. Genome research in the cloud. Omics : a journal of integrative biology, 16(7-8):422–8, 2012.

[95] Yufeng Shen, Sumeet Sarin, Ye Liu, Oliver Hobert, and Itsik Pe’er. Comparing plat- forms for C. elegans mutant identification using high-throughput whole-genome se- quencing. PloS one, 3(12):e4012, jan 2008.

129 [96] Jared T. Simpson. Exploring Genome Characteristics and Sequence Quality Without a Reference. jul 2013.

[97] Jared T Simpson. Exploring genome characteristics and sequence quality without a reference. Bioinformatics (Oxford, England), 30(9):1228–35, may 2014.

[98] Jared T Simpson and Richard Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22(3):549–56, mar 2012.

[99] Julie A Sleep, Andreas W Schreiber, and Ute Baumann. Sequencing error correction without a reference genome. BMC Bioinformatics, 14(1):367, 2013.

[100] T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195 – 197, 1981.

[101] Guy L. Steele. COMMON LISP: the language. 1984. With contributions by Scott E. Fahlman and Richard P. Gabriel and David A. Moon and Daniel L. Weinreb.

[102] Garret Suen, Jarrod J Scott, Frank O Aylward, Sandra M Adams, Susannah G Tringe, Adrián a Pinto-Tomás, Clifton E Foster, Markus Pauly, Paul J Weimer, Kerrie W Barry, Lynne a Goodwin, Pascal Bouffard, Lewyn Li, Jolene Osterberger, Timothy T Harkins, Steven C Slater, Timothy J Donohue, and Cameron R Currie. An insect herbivore microbiome with high plant biomass-degrading capacity. PLoS genetics, 6(9):e1001129, sep 2010.

[103] Martin T Swain, Isheng J Tsai, Samual a Assefa, Chris Newbold, Matthew Berriman, and Thomas D Otto. A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs. Nature protocols, 7(7):1260–84, jul 2012.

[104] Ben Temperton and Stephen J Giovannoni. Metagenomics : microbial diversity through a scratched lens. 2000.

130 [105] Todd J Treangen, Sergey Koren, Daniel D Sommer, Bo Liu, Irina Astrovskaya, Brian Ondov, Aaron E Darling, Adam M Phillippy, and Mihai Pop. MetAMOS: a modu- lar and open source metagenomic assembly and analysis pipeline. Genome biology, 14(1):R2, jan 2013.

[106] Todd J Treangen and Steven L Salzberg. Repetitive DNA and next-generation sequenc- ing: computational challenges and solutions. Nature reviews. Genetics, 13(1):36–46, jan 2012.

[107] Andrew Tritt, Jonathan a Eisen, Marc T Facciotti, and Aaron E Darling. An integrated pipeline for de novo assembly of microbial genomes. PloS one, 7(9):e42304, jan 2012.

[108] Isheng J Tsai, Thomas D Otto, and Matthew Berriman. Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. 2010.

[109] J Craig Venter, Karin Remington, John F Heidelberg, Aaron L Halpern, Doug Rusch, Jonathan A Eisen, Dongying Wu, Ian Paulsen, Karen E Nelson, William Nelson, Der- rick E Fouts, Samuel Levy, Anthony H Knap, Michael W Lomas, Ken Nealson, Owen White, Jeremy Peterson, Jeff Hoffman, Rachel Parsons, Holly Baden-Tillson, Cynthia Pfannkoch, Yu-Hui Rogers, and Hamilton O Smith. Environmental genome shotgun sequencing of the sargasso sea. Science (New York, N.Y.), 304(5667):66–74, April 2004. PMID: 15001713.

[110] Francesco Vezzi, Giuseppe Narzisi, and Bud Mishra. Feature-by-feature–evaluating de novo sequence assembly. PloS one, 7(2):e31002, jan 2012.

[111] Francesco Vezzi, Giuseppe Narzisi, and Bud Mishra. Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PloS one, 7(12):e52210, jan 2012.

131 [112] Riccardo Vicedomini, Francesco Vezzi, Simone Scalabrin, Lars Arvestad, and Alberto Policriti. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC bioinformatics, 14 Suppl 7(Suppl 7):S6, jan 2013.

[113] Xin Victoria Wang, Natalie Blades, Jie Ding, Razvan Sultana, and Giovanni Parmi- giani. Estimation of sequencing error rates in short reads. pages 1–12, 2012.

[114] René L Warren, Granger G Sutton, Steven JM Jones, and Robert A Holt. Assembling millions of short dna sequences using ssake. Bioinformatics, 23(4):500–501, 2007.

[115] David A Wheeler, Maithreyan Srinivasan, Michael Egholm, Yufeng Shen, Lei Chen, Amy McGuire, Wen He, Yi-Ju Chen, Vinod Makhijani, G Thomas Roth, Xavier Gomes, Karrie Tartaro, Faheem Niazi, Cynthia L Turcotte, Gerard P Irzyk, James R Lupski, Craig Chinault, Xing-zhi Song, Yue Liu, Ye Yuan, Lynne Nazareth, Xiang Qin, Donna M Muzny, Marcel Margulies, George M Weinstock, Richard A Gibbs, and Jonathan M Rothberg. The complete genome of an individual by massively parallel DNA sequencing. Nature, 452(7189):872–876, April 2008. PMID: 18421352.

[116] David Williams, William L Trimble, Meghan Shilts, Folker Meyer, and Howard Ochman. Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC genomics, 14(1):537, jan 2013.

[117] Jan Witkowski. Long view of the human genome project. Nature, 466(7309):921–922, August 2010.

[118] K Eric Wommack, Jaysheel Bhavsar, and Jacques Ravel. Metagenomics: read length matters. Applied and environmental microbiology, 74(5):1453–1463, March 2008. PMID: 18192407.

132 [119] Xiao Yang, Sriram P Chockalingam, and Srinivas Aluru. A survey of error-correction methods for next-generation sequencing. Briefings in bioinformatics, 14(1):56–66, jan 2013.

[120] Xiao Yang, Karin S Dorman, and Srinivas Aluru. Reptile: representative tiling for short read error correction. Bioinformatics, 26(20):2526–2533, 2010.

[121] Guohui Yao, Liang Ye, Hongyu Gao, Patrick Minx, Wesley C Warren, and George M Weinstock. Graph accordance of next-generation sequence assemblies. Bioinformatics (Oxford, England), 28(1):13–16, January 2012. PMID: 22025481.

[122] Kevin Y Yip, Chao Cheng, and Mark Gerstein. Machine learning and genome anno- tation: a match meant to be? Genome biology, 14(5):205, jan 2013.

[123] Daniel R Zerbino and . Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research, 18(5):821–9, may 2008.

[124] Jun Zhang, Rod Chiodini, Ahmed Badr, and Genfa Zhang. The impact of next- generation sequencing on genomics. Journal of genetics and genomics = Yi chuan xue bao, 38(3):95–109, mar 2011.

[125] Peisen Zhang, Eric A Schon, Stuart G Fischer, Janie Weiss, Susan Kistler, and Philip E Bourne. An algorithm based on graph theory for the assembly of contlgs in physical mapping of DNA. 10(3):309–317, 1994.

[126] Wenyu Zhang, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, and Bairong Shen. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PloS one, 6(3):e17915, jan 2011.

[127] Xiao Zhu, Henry C. M. Leung, Rongjie Wang, Francis Y. L. Chin, Siu Ming Yiu, Guangri Quan, Yajie Li, Rui Zhang, Qinghua Jiang, Bo Liu, Yucui Dong, Guohui

133 Zhou, and Yadong Wang. misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads. BMC Bioinformatics, 16(1):386, nov 2015.

[128] Aleksey V Zimin, Douglas R Smith, Granger Sutton, and James A Yorke. Assembly reconciliation. Bioinformatics (Oxford, England), 24(1):42–5, jan 2008.

134