Host-Microbe Relations: a Phylogenomics-Driven Bioinformatic Approach to the Characterization of Microbial DNA from Heterogeneous Sequence Data
Total Page:16
File Type:pdf, Size:1020Kb
Host-Microbe Relations: A Phylogenomics-Driven Bioinformatic Approach to the Characterization of Microbial DNA from Heterogeneous Sequence Data Timothy Patrick Driscoll Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Genetics, Bioinformatics, and Computational Biology Joseph J Gillespie David R Bevan Madhav V Marathe T M Murali May 1st, 2013 Blacksburg, Virginia Keywords: phylogenomics, genome-mining, host-microbe interactions, genomics, bioinformatics, symbiosis, bacteria, lateral gene transfer Copyright 2013 Host-Microbe Relations: A Phylogenomics-Driven Bioinformatic Approach to the Characterization of Microbial DNA from Heterogeneous Sequence Data Timothy Patrick Driscoll ABSTRACT Plants and animals are characterized by intimate, enduring, often indispensable, and always complex associations with microbes. Therefore, it should come as no surprise that when the genome of a eukaryote is sequenced, a medley of bacterial sequences are produced as well. These sequences can be highly informative about the interactions between the eukaryote and its bacterial cohorts; unfortunately, they often comprise a vanishingly small constituent within a heterogeneous mixture of microbial and host sequences. Genomic analyses typically avoid the bacterial sequences in order to obtain a genome sequence for the host. Metagenomic analysis typically avoid the host sequences in order to analyze community composition and functional diversity of the bacterial component. This dissertation describes the development of a novel approach at the intersection of genomics and metagenomics, aimed at the extraction and characterization of bacterial sequences from heterogeneous sequence data using phylogenomic and bioinformatic tools. To achieve this objective, three interoperable workflows were constructed as modular computational pipelines, with built-in checkpoints for periodic interpretation and refinement. The MetaMiner workflow uses 16S small subunit rDNA analysis to enable the systematic discovery and classification of bacteria associated with a host genome sequencing project. Using this information, the ReadMiner workflow comprehensively extracts, assembles, and characterizes sequences that belong to a target microbe. Finally, AssemblySifter examines the genes and scaffolds of the eukaryotic genome for sequences associated with the target microbe. The combined information from these three workflows is used to systemically characterize a bacterial target of interest, including robust estimation of its phylogeny, assessment of its signature profile, and determination of its relationship to the associated eukaryote. This dissertation presents the development of the described methodology and its application to three eukaryotic genome projects. In the first study, the genomic sequences of a single, known endosymbiont was extracted from the genome sequencing data of its host. In the second study, a highly divergent endosymbiont was characterized from the assembled genome of its host. In the third study, genome sequences from a novel bacterium were extracted from both the raw sequencing data and assembled genome of a eukaryote that contained significant amounts of sequence from multiple competing bacteria. Taken together, these results demonstrate the usefulness of the described approach in singularly disparate situations, and strongly argue for a sophisticated, multifaceted, supervised approach to the characterization of host-associated microbes and their interactions. Dedication I dedicate this dissertation to my wife Charley, the most winsome and enchanting soul I've ever known, whose love, support, and unflagging encouragement never fail to humble and inspire me: Charley, may all of your days be sunny, may your ears be filled with love and laughter, and may your eyes some day alight upon a moose. To my daughter Samantha, who will always be the best of me, and who is, without exception or bias, the most beautiful girl in the world. To my as-yet-unborn daughter, who is the promise of tomorrow. To my mom and dad, a constant source of love and support, who never once in twenty-two years asked me when I was going to finish. To my brothers: Mark for lighting the way, Joe for exemplifying success through determination, and Jeff for showing me how to tread a little lighter. To my sisters: Lynn for being the big sister I always needed, and Sarah for jumping in with both feet. You are all an inspiration in your own way. I also thank the many people who have guided me along the path to this day. My high school biology teacher, Mr. Lane (a.k.a. Mr. Greenjeans), who made biology interesting and fun and without whom I would have been an English major. John Boyer, my master's thesis advisor, for teaching me that biochemistry isn't so scary after all, and for instilling in me a sense of enthusiasm for scientific exploration that remains as strong today as ever. Eric Martz, for helping me to see that science and creativity are necessarily intertwined. And the countless teachers, professors, collaborators, and colleagues who have contributed their time, knowledge, and guidance with no expectation of reward. They are a constant source of inspiration. Finally, I am wholeheartedly grateful for the many personal relationships that sustained me during my time in Blacksburg. The Reverend Doctor Bryan Lewis for his enduring friendship, the Beauty Loop (of course!), Stillers-Pats, and his utter Dudeness. Andrew Warren, Chris Lasher, Brian Gehrt, and Brian Yohn for their camaraderie and companionship through all manner of interesting adventures. To the Blue Ridge Mountains and the special places of solace therein, especially Pandapas Pond, Rock Castle Gorge, Douthat, Kelly's Knob, and Sinking Creek. Last but not least, I dedicate this dissertation to a famous man, a man whom I never met, but whose words and vision have been an inspiration nevertheless. "Somewhere, something incredible is waiting to be known." - Carl Sagan. iii Acknowledgements I acknowledge and appreciate the support of many people who helped enable the research presented in this dissertation, and to the GBCB program at Virginia Tech for the opportunity to carry out my graduate studies under their auspices. I am indebted to all of my committee members for their participation, advice, and patience throughout the development of this dissertation. Joe Gillespie, for his encouragement, mentorship, and collaborative spirit, for his infectious enthusiasm for biology in all its messy splendor, for always supporting my development as a scientist, and for his personal friendship. T. M. Murali and Madhav Marathe for their patience, helpful comments, and willingness to see this process through to its conclusion. David Bevan for his support and willingness to help carry this work across the final few months. Additionally, I acknowledge Bruno Sobral and the members of the Molecular Genetics Lab and Cyberinfrastructure Division (past and present), for their support, technical expertise, and many thought-provoking conversations. The IT department at VBI, for their prompt assistance on all things related to the high-performance computing aspects of this project. Eric Nordberg for contributing his phylogenomics expertise and software. Nicolas Lartillot, author of PhyloBayes, for his assistance in running the MPI version of his software. Finally, I am especially grateful to Dennie Munson, heart of the GBCB program, for her tireless dedication to helping her students meet the many administrative, academic, and personal challenges of being a graduate student. In recognition of her invaluable role in my graduate career, I will use a phrase that only Dennie could elicit from me: "Go Yankees!" iv Attribution Several colleagues aided in the writing and research behind one of the chapters presented as part of this dissertation. A brief description of their contributions is included here. Chapter 4: Bacterial DNA sifted from the Trichoplax adhaerens (Animalia: Placozoa) genome project reveals a putative rickettsial endosymbiont. Chapter 4 was published in the journal Genome Biology and Evolution with the following co- authors: Joseph J. Gillespie Ph.D. (Virginia Bioinformatics Institute, Virginia Tech) was an equal contributor to this manuscript, wrote much of the paper, and provided expert analysis of the rickettsial biology. Eric K. Nordberg (Virginia Bioinformatics Institute, Virginia Tech) was a co-author on this paper and provided the FastTree-based phylogenomic analyses. Abdu F. Azad (Department of Microbiology and Immunology, University of Maryland School of Medicine) was a co-author on this paper and principal investigator for one of the grants supporting this research. Bruno W. Sobral (Nestlé Institute of Health Sciences SA, Campus EPFL, Quartier de L'innovation, Batîment G, 1015 Lausanne, Switzerland) was a co-author on this paper and principal investigator for one of the grants supporting this research. v Table of Contents CHAPTER 1. Introduction. ......................................................................................................... 1 MOTIVATION ........................................................................................................................... 1 BACKGROUND ........................................................................................................................ 1 ADVANTAGES ........................................................................................................................