The University of Chicago a Dissertation Submitted To

THE UNIVERSITY OF CHICAGO A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF DEPARTMENT OF COMPUTER SCIENCE BY CHRISTOPHER BUN CHICAGO, ILLINOIS NOVEMBER 8, 2016 Copyright © 2016 by Christopher Bun All Rights Reserved To My Family And Friends Epigraph Table of Contents LIST OF FIGURES . vii LIST OF TABLES . ix ABSTRACT . x 1 INTRODUCTION . 1 1.1 Motivation . .2 1.2 Research Contributions . .2 2 BACKGROUND . 4 2.1 Genome Sequencing: Techniques and Data Profiles . .4 2.1.1 Sequencing By Synthesis . .5 2.1.2 Oligo-ligation Detection . .7 2.1.3 Single Molecule and Nanopore Sequencing . .8 2.1.4 Sequencing Data . 10 2.2 Genome Assembly . 11 2.2.1 Challenges . 12 2.2.2 Algorithms for De Novo Genome Assembly . 15 2.3 Computing Systems . 20 2.3.1 Scientific Compute Services . 20 3 ASSEMBLYRAST FRAMEWORK . 22 3.1 The WASP Language for Computational Workflows . 22 3.1.1 Specification . 23 3.2 Implementation . 31 3.2.1 Interface Generalization . 31 3.2.2 Rapid Pipeline Design . 32 3.2.3 Universal Hyperparameter Search Driver . 34 3.2.4 Logic-Driven Assembly . 35 3.2.5 Data Types . 36 3.2.6 Analysis Framework . 37 3.3 System Design and Infrastructure . 38 3.3.1 AssemblyRAST Control Plane . 38 3.3.2 Workers . 41 3.3.3 Availability . 44 4 ASSEMBLER PROFILING AND OPTIMIZATION . 47 4.0.1 Data . 47 4.0.2 Evaluation Metrics . 48 4.0.3 Programs . 49 4.0.4 Comparison . 50 v 4.0.5 Methods . 52 4.0.6 Discussion . 54 4.1 Pipelines . 55 4.1.1 Preprocessing . 55 4.1.2 Postprocessing . 56 4.1.3 Results . 57 5 INTEGRATIVE ASSEMBLY ALGORITHMS . 60 5.0.1 Integrative Pipelines . 60 5.1 Hyperparameter Optimization . 61 5.2 Block Construction and Merging . 63 5.3 A Self-Tuning Ensemble De Novo Assembly Pipeline . 65 5.4 Discussion . 66 6 REFERENCE-INDEPENDENT ASSEMBLY ERROR CLASSIFICATION LEARN- ING ........................................... 68 6.1 Background and Motivation . 68 6.1.1 Hard and Soft Genomic Variation Types . 69 6.1.2 Statistical Approach to De Novo Assembly Evaluation . 70 6.1.3 Supervised Classification . 71 6.2 A Novel Implementation of Error Classification Using Gradient Boosting Trees 74 6.2.1 Dataset Generation . 75 6.2.2 Assembly Setup . 75 6.2.3 Preprocessing . 76 6.2.4 Data Labeling . 78 6.2.5 Discussion and Feasibility . 80 6.3 Results . 88 6.3.1 Training Model . 88 6.3.2 Feature Engineering and Extraction . 89 6.3.3 Model Performance and Hyperparameter Tuning . 92 6.3.4 Detection of Major Errors Produced by Assemblers . 96 6.4 Discussion . 102 7 APPLICATIONS OF AN ACCURATE DE NOVO ASSEMBLY EVALUATION PRO- FILE . 104 7.1 Error Removal and Contiguity Metric Correction . 104 7.1.1 Contig Splitting . 104 7.1.2 Corrected Statistical Measures . 104 7.2 Likelihood and Scoring Framework For Systematic Evaluation . 106 7.3 Error Prediction Strategy for Assembly Reconcilliation Algorithms . 111 8 CONCLUSION . 114 8.1 Future . 115 References . 116 vi List of Figures 2.1 NGAx plot of V. cholera assembly by Velvet parameter sweep . 18 2.2 3-mer De Bruijn Graph . 19 2.3 4-mer De Bruijn Graph . 19 3.1 Flowchart of a read-specific assembly workflow . 29 3.2 Pipeline Branching . 34 3.3 Parameter Sweeps With Pipeline Combinations . 35 3.4 The Assembly RAST Infrastructure . 39 3.5 Relative ALE Scores of V. cholera assembly . 45 3.6 AssemblyRAST Web Interface . 46 3.7 AssemblyRAST Web Interface facillitates user-friendly pipeline design . 46 4.1 NGAx plot of V. cholera assembly . 52 5.1 NGAx plot of V. cholera assembly by Velvet parameter sweep . 62 5.2 Velvet Hash Length vs. NGA50 Score on Rsp HiSeq and MiSeq . 63 5.3 NGA50 scores of pairwise mergings of V. Cholerae assemblies . 64 5.4 Smart Pipeline . 67 6.1 An example decision tree . 73 6.2 Generation of training data by AssemblyML workflow . 79 6.3 An example of a non-smooth coverage pattern over a misassembly . 90 6.4 Discrepancies in coverages between flanking regions . 91 6.5 Misassemblies vs. Average Contig Coverage . 92 6.6 Contig end regions predicted as misassemblies in Spades Assembly of Singulisphaera acidiphila, but not classified as a major misassembly by QUAST . 94 6.7 Feature Importances . 94 6.8 Prediction Outcomes for 300 Genome Readsets Simulated . 95 6.9 AssemblyML Prediction Outcomes For Velvet Assembly of Singulisphaera acidiphila 96 6.10 REAPR FCD Prediction Outcomes For Velvet Assembly of Singulisphaera acidiphila 97 6.11 Correct Misassembly Classification Velvet Assembly of Singulisphaera acidiphila 98 6.12 Quast defines misassemblies in which inconsistencies are shorter than 1000 base- pairs as local misassemblies. The trained model predicts this missassembly (as shown by the black bar) . 99 6.13 Velvet Scaffolding Technique Flagged as Misassembly . 100 6.14 AssemblyML Prediction Outcomes For Spades Assembly of Singulisphaera acidiphila101 6.15 REAPR FCD Prediction Outcomes For Spades Assembly of Singulisphaera acidiphila101 6.16 Mapping Quality Anomalies in Spades Assembly of Singulisphaera acidiphila . 102 7.1 Regions with misassembly from Spades and Velvet assemblies (A, C) compared with their contigs broken at predicted loci (B, D). Red represents contigs that contain true misassemblies. 105 vii 7.2 The Error Response Curve captures the trade-off between contiguity and correct- ness. 109 7.3 An improved assembly pipeline using AssemblyML-guided contig breaking and ERC metrics as preceeding steps to block merging. 113 viii List of Tables 2.1 Profiles of Major Next Generation Sequencing Platforms . 11 2.2 FastA Base Pair Codes . 14 3.1 Common Wasp/Lisp Supported Expressions . 25 3.2 Wasp contains three types of specialized extensions: Type conversion, data analysis, and framework-level functions. 28 4.1 Comparison of NGA50 assembly scores for various genomes. Best scores are bolded. 51 4.2 Comparison of misassemblies for various genomes. 51 4.3 Various statistics for the assembly of R. spaeroides HiSeq data. IDBA-ID gen- erated the most contiguous set while also performing the least amount of misassemblies. 53 4.4 Effects of preprocessing on V. cholera NGA50. The best scores per assembler are shown in bold. Fields with ’-’ indicate.

Load more