Integrated Assembly and Annotation of Fathead Minnow Genome
Total Page:16
File Type:pdf, Size:1020Kb
Integrated Assembly and Annotation of Fathead Minnow Genome Towards Prediction of Environmental Exposures A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Biomedical Engineering by John Martinson M.S. University of Cincinnati March 2020 Committee Chair: Jaroslaw Meller, Ph.D. Abstract The fathead minnow (FHM, Pimephales promelas) is a species of temperate freshwater fish with a geographic range that extends throughout much of North America that is widely used as a model organism for aquatic toxicity testing. Our team at the Environmental Protection Agency produced a new FHM assembly which served as the foundation for accomplishing the aims in this project. Because of the importance of the underlying assembly to being able to achieve the aims, the generation of the new assembly is presented in this dissertation, though it was not a specific aim. The first aim of this research project was to annotate the protein coding genes in a new FHM genome. A comprehensive set (26,150) of gene models that can facilitate the analysis of RNA-seq expression profiles derived from exposures of P. promelas subjects to chemicals and other stressors was produced. The second aim of the project was to demonstrate the application and utility of the new gene models by using RNA-seq data generated in controlled exposure experiments to identify differentially expressed genes (DEGs) as markers of exposure. FHM were exposed to two chemicals with different modes of toxicity, the pyrethroid pesticide bifenthrin and copper. The new gene models were used to quantify mRNA expressions levels and statistical and machine learning techniques were applied to develop lists of DEGs in treated and untreated samples. The third aim of the study was to develop predictors of exposure from the data, using machine and statistical learning methods to combine the obtained markers into exposure signatures and optimize the predictive power of the resulting exposure classifiers. As part of these experiments, five different classifiers were evaluated using a cross-validation framework. Classifiers were able to distinguish treated samples from controls and were then applied to samples treated with the other chemical to evaluate how the classifiers performed when faced with an exposure scenario different from the one for which they were trained. ii Assessment of the genome and gene models in terms of both BUSCO coverage and RNA-seq mapping rates show that the new assembly and gene models represent not only a significant improvement over the previously published FHM assembly and gene annotations, but also that they compare very favorably with the highly studied and closely related zebrafish (Danio rerio). Given the mature state of the zebrafish genome the FHM results presented here represent a significant success story. Further validation of the success was provided by the successful use of the new gene models in the bifenthrin/copper exposure study. For each of the two toxicants studied, successful classifiers of exposure were able to be developed from a variety of approaches based on mapping RNA-seq data to the new gene models. Functional analysis of the differentially expressed genes (DEGs) leveraged by the classifiers indicated toxicant specific responses at the gene level appeared to drive the ability to correctly classify samples. Glm elastic net (“glmnet”) and random forest showed the most promise of being able to avoid false positive classifications in the cross-chemical testing. iii iv Acknowledgements I would like to thank my academic advisor Dr. Jaroslaw Meller for the guidance and support he has provided to me as I’ve been on this long and winding path, that is certainly the one less taken. Jarek saw something in me that made him decide to encourage me to follow through on the academic pursuits I chose to take up as a forty-something year old and has stuck with me, and gone to bat for me, for a long time. Thank you Jarek. You’ve been a great mentor. I’d also like to thank my former EPA colleague Dr. Mitchell Kostich. Mitch taught me so much while he was around at EPA that I can’t even begin to describe the full benefits I derived from working with him. Whether (or not) I wanted to know about ancient religions, oppressed Muslims in China, early Pink Floyd, biology, computing or statistics, Mitch was my go-to guy. We miss him a lot at the EPA, and I would not be at this point without Mitch having led our efforts to sequence the Fathead minnow genome. Mitch was another great mentor and teacher. There are a bunch of other people at the EPA and UC who I would like to thank for their support and counsel over the years. Given the duration of my journey, there have been a lot. I will start with the other members of my dissertation committee, Dr. Daria Narmoneva, Dr. Marepalli Rao, and my current EPA supervisor, Dr. Adam Biales. Drs. Narmoneva and Rao were great additions to my committee who helped “keep me honest” by asking good questions and providing useful insights and good advice that helped me move forward. The support Adam supplied to me, particularly as I came down the “home stretch,” was great and influenced my being able to finish up as much as any factor. Dr. Greg Toth at EPA, who ceded the protein coding gene annotation responsibility to me, and in so many other ways helped me along the way, also deserves special thanks. Weichun Huang’s tireless efforts to generate the best assembly possible also must be recognized. His work set the basis for my success. Barbara Carter at the UC CEAS Graduate Office has also been incredibly helpful and supportive over the years and v deserves special thanks too. Many others helped and/or supported me along the way in some way or other. In no real order, others I’d like to thank include Pete Kauffman, Lora Johnson, Dr. David Bencic, Bob Flick, Dr. Rong-Lin Wang, Denise Gordon, Mary Jean See, Dr. David Lattier, Dr. Florence Fulk, Dr. Mark Bagley, Dr. Eric Waits, Dr. Roy Martin, Janine Fetke, Dr. Erik Pilgrim, Sara Okum, Dr. John Darling, Chris Bourk, Margie Vazquez and Dr. Jing-Huei Lee. There are just too many others to thank by name. So to all of you, thanks. Almost last, and certainly not least, I’d like to dedicate this dissertation to my dearly departed parents, who I know would be proud of me, and to my siblings and extended family including my in-laws Nancy and Tom. I also want to say to my children, Alex, Ellen and Lily, who I love immensely, “Guess what kids? I finally did it! Whatever you might choose to do, try to do it sooner! But also, don’t give up!” Finally, I’d like to thank my wife Beth for all her patience, understanding and love. Putting up with me isn’t always easy, especially when I’m stressed out about overdue academic pursuits and don’t have time to do other things. Thanks for everything Beth. Love you then, now and always. vi Table of Contents Abstract .................................................................................................................................................... ii Acknowledgements .................................................................................................................................. v Chapter 1. Introduction and Aims ................................................................................................................. 1 1.1. Motivation .......................................................................................................................................... 1 1.2. Background ........................................................................................................................................ 2 1.3. Specific Aims ...................................................................................................................................... 5 Specific Aim #1. Develop a comprehensive set of annotated protein-coding gene models for Pimephales promelas using the improved FHM genome assembly ...................................................... 6 Specific Aim #2. Identify differentially expressed transcripts from the set of new gene models developed as aim 1 as markers of exposure in an RNA-seq exposure study......................................... 7 Specific Aim #3. Develop and assess statistical and machine learning predictors of exposure from the RNA expression profiles developed as aim 2. ................................................................................. 7 Chapter 2. Generation of a New FHM Assembly .......................................................................................... 8 2.1 Introduction and Background ............................................................................................................. 8 2.1a. DNA sequencing ........................................................................................................................... 9 2.1b. Genome assembly ...................................................................................................................... 15 2.2. Materials and methods .................................................................................................................... 18 2.2a. DNA sequencing ......................................................................................................................... 18 2.2b. Assembly and scaffolding........................................................................................................... 20 2.2c.