Pseudoalignment for Metagenomic and Metatranscriptomic Read Assignment

Pseudoalignment for metagenomic and metatranscriptomic read assignment By Lorian Victoria Schaeffer A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Molecular and Cell Biology in the Graduate Division of the University of California, Berkeley Committee in Charge: Professor Lior Pachter, Chair Professor Gregory Barton Professor Nicholas Ingolia Professor Kimmen Sjolander Fall 2016 Abstract Pseudoalignment for metagenomic and metatranscriptomic read assignment By Lorian Victoria Schaeffer Doctor of Philosophy in Molecular and Cell Biology University of California, Berkeley Professor Lior Pachter, Chair The first step in many metagenomic and metatranscriptomic analysis workflows is assigning high-throughput sequencing reads to specific strains or transcripts, providing the basis for identification and later quantification. However, the high degree of similarity between the sequences of many strains and genes makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to more general taxonomic levels where they are unambiguous. Recent developments in RNA-Seq analysis have found direct-match k-mer based methods to be extremely accurate and fast when comparing sequenced RNA-Seq reads to transcriptomes. While similar methods have been used in metagenomics before now, none are highly accurate at distinguishing similar strains, and none have been applied to metatranscriptomic data. We explore connections between metagenomic and metatranscriptomic read assignment and the quantification of transcripts from RNA-Seq data to develop novel methods for rapid and accurate quantification of microbiome strains and transcripts. We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics and metatranscriptomics settings as well. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics and metatranscriptomics projects. 1 Table of Contents Table of Contents ............................................................................................................. i List of Figures ................................................................................................................ iii List of Tables .................................................................................................................. v Acknowledgements ........................................................................................................ vi Chapter 1: History of abundance estimation .................................................................. 1 1.1 Metagenomics ....................................................................................................... 1 1.2 k-mer based estimation ......................................................................................... 6 1.3 Metatranscriptomics .............................................................................................. 9 Chapter 2: Pseudoalignment for metagenomic read assignment .................................. 12 2.1 Introduction ......................................................................................................... 12 2.2 Results ................................................................................................................. 13 2.3 Methods............................................................................................................... 18 2.4 Conclusions ......................................................................................................... 19 Chapter 3: K-mer based metatranscriptome analysis.................................................... 21 3.1 Introduction ......................................................................................................... 21 3.2 Results ................................................................................................................. 22 3.3 Methods............................................................................................................... 32 3.4 Conclusions ......................................................................................................... 34 Chapter 4: Concluding remarks on low-memory k-mer indexing improvements ........ 36 References ..................................................................................................................... 39 Appendix A: Notes on collecting microbiome samples from D. melanogaster guts .... 47 A.1 Gut dissection ..................................................................................................... 47 A.2 DNA/RNA extraction ........................................................................................ 48 i A.3 Microbial mRNA enrichment ............................................................................ 50 ii List of Figures Figure 1.1. A phylogenetic tree of the V4 region of the 16S rRNA gene .......................... 3 Figure 1.2. Drop in sequencing cost per megabase since 2001 .......................................... 4 Figure 1.3. MEGAN’s use of lowest common ancestor during read assignment ............... 6 Figure 2.1. Results of kallisto on simulated reads from the Ensembl dataset at the exact genome level. ................................................................................................................................ 15 Figure 2.2. Comparison of species-level abundance estimation between metagenomic programs ....................................................................................................................................... 16 Figure 2.3. Results of kallisto on bacterial reads in human saliva samples at all taxonomic levels. ............................................................................................................................................ 17 Figure 3.1. Estimated counts at strain level aligned against present transcriptomes ........ 22 Figure 3.2. Estimated counts of transcripts in simulated data .......................................... 23 Figure 3.3. Estimated counts at species level aligned against representative transcriptomes. .............................................................................................................................. 24 Figure 3.4. Estimated counts at species level aligned against representative genomes .... 25 Figure 3.5. Estimated counts at strain level aligned against pre-filtered transcriptomes . 26 Figure 3.6. Estimated counts of human gut metatranscriptome at species level aligned against representative genomes..................................................................................................... 27 Figure 3.7. Estimated counts of human gut metatranscriptome at genus level aligned against pre-filtered transcriptomes ................................................................................................ 28 Figure 3.8. Estimated counts of human gut metagenome at genus level aligned against pre-filtered transcriptomes ............................................................................................................ 29 Figure 3.9. Estimated relative abundance of top genera in human gut metagenomes. ..... 30 Figure 3.10. Percentage of kallisto-estimated human gut microbiome transcripts assigned to listed KEGG functional pathways ............................................................................................ 31 Figure 3.11. Estimated percentage of genes present in KEGG functional pathways, as estimated by COMAN .................................................................................................................. 32 Figure A.1 Dissected third instar larva gut ....................................................................... 48 iii Figure A.2 Bioanalyzer trace of RNA sample extracted from gut. .................................. 49 Figure A.3 Bioanalyzer trace of DNA sample extracted from gut. .................................. 49 iv List of Tables Table 1.1. Performance of selected metagenomic read assignment tools.. ......................... 7 Table 2.1. Normalized count-based classification accuracy at four taxonomic ranks. ..... 17 v Acknowledgements Thanks to my friends and family, whose support has made the completion of this dissertation possible. Special thanks to my advisor, Lior Pachter, for his encouragement and boundless enthusiasm throughout my research, whether I had results or not. While everyone in the Pachter lab has been a delight, I am especially grateful for Shannon Hateley, who gave me considerable support during my wetlab trials, and Isaac Joseph, who helped keep me moving forward when nothing was working. I also enormously appreciate the help of Harold Pimentel, Nicolas Bray, and Páll Melsted, who were ever helpful in getting kallisto to do what I needed it to do. Outside the Pachter lab, thanks go to Carolyn Elya, whose experience with Drosophila gut microbes was invaluable to my own investigations. Thanks also to Andrew Toseland, without whose simulated metatranscriptome data I would have missed a lot of interesting results. Finally, thanks to Diya Das for her significant help with editing everything, ever. vi Chapter 1: History of abundance estimation in metagenomics and metatranscriptomics 1.1 Metagenomics Microbial ecology,

Pseudoalignment for Metagenomic and Metatranscriptomic Read Assignment

Functional Profiling of the Human Gut Microbiome Using Metatranscriptomic Approach

BIOGRAPHICAL SKETCH NAME: Berger

Metatranscriptomics to Characterize Respiratory Virome, Microbiome, and Host Response Directly from Clinical Samples

Microbes and Metagenomics in Human Health an Overview of Recent Publications Featuring Illumina® Technology TABLE of CONTENTS

Bioinformatics Tools for RNA-Seq Gene and Isoform Quantification

Lior Pachter Genome Informatics 2013 Keynote

Modeling and Analysis of RNA-Seq Data: a Review from a Statistical Perspective

The Chromosome-Centric Human Proteome Project for Cataloging Proteins Encoded in the Genome

Downloaded As a CSV Dump ﬁle

Metagenomics and Metatranscriptomics

Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction

Trees, Fungi and Bacteria: Tripartite Metatranscriptomics of a Root Microbiome Responding to Soil Contamination E