Maintaining Protein Localization, Structure, and Functional
Total Page:16
File Type:pdf, Size:1020Kb
University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 8-2020 Maintaining protein localization, structure, and functional interactions via codon usage and coevolution of gene expression: Combining evolutionary bioinformatics with omics-scale data to test hypotheses related to protein function Alexander Cope University of Tennessee Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Recommended Citation Cope, Alexander, "Maintaining protein localization, structure, and functional interactions via codon usage and coevolution of gene expression: Combining evolutionary bioinformatics with omics-scale data to test hypotheses related to protein function. " PhD diss., University of Tennessee, 2020. https://trace.tennessee.edu/utk_graddiss/6796 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Alexander Cope entitled "Maintaining protein localization, structure, and functional interactions via codon usage and coevolution of gene expression: Combining evolutionary bioinformatics with omics-scale data to test hypotheses related to protein function." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Life Sciences. Michael Gilchrist, Robert Hettich, Major Professor We have read this dissertation and recommend its acceptance: Albrecht von Arnim, Steven Abel Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Maintaining Protein Localization, Structure, and Functional Interactions via Codon Usage and Coevolution of Gene Expression: Combining Evolutionary Bioinformatics with Omics-scale Data to Test Hypotheses Related to Protein Function A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Alexander Lloyd Cope August 2020 Copyright c by Alexander Lloyd Cope, 2020 All Rights Reserved. ii This is dedicated to my parents, William and Sharleen Cope, for supporting and loving me throughout my academic career. iii Acknowledgments First, I would like to thank my advisors, Dr. Michael Gilchrist and Dr. Robert Hettich. When I proposed this collaboration 5 years ago, I think all of us viewed it as a \high risk, high reward" scenario. I could not be more pleased with how it turned out. Both of you never failed to challenge me intellectually, especially when it came to contextualizing my thoughts in terms of biology, which was undoubtedly my weakest area coming into graduate school. You both also helped me strike a good balance of exploring new ideas, while also giving enough attention to projects that needed to be completed. I would also like to thank my committee, Dr. Albrecht von Arnim and Dr. Steven Abel for their support and feedback over the years. I would also like to extend a special thanks to Dr. Brian O'Meara for serving as a sounding board for many of my ideas over the years. I'm also grateful for the financial support provided by Genome Science and Technology, NIMBioS, the University of Tennessee, Oak Ridge National Laboratory. I would like to thank all of the friends I've made over the past few years. They have been a source of sanity during graduate school, especially when things were not going well with research. Grabbing dinner and drinks together are among my fondest memories of graduate school. I would like to thank my parents, William and Sharleen Cope, and my brother, Austen Cope. They have never failed to love and support me during the past 27 years. They have made many personal sacrifices in order to help put me in a position to succeed. I would likely not be pursuing a PhD if not for them. Finally, I would like to thank Elena for her constant love and support over the course of graduate school. Without her, I'm not sure I would have had the mental or emotional iv strength to make it through to the end. Although she always supported me in my work, she also made sure I would stop long enough to enjoy life, especially during times of stress. v Abstract A major challenge of the omics-era is identifying how a protein functions, both in terms of its specific function and within the context of the various biological processes necessary for the cell's survival. Key elements necessary for a protein to perform its function are efficient and accurate protein localization, protein folding, and interactions with other proteins. Previous work implicated codon usage as a means to modulate protein localization and folding. Using a mechanistic model rooted in population genetics, I examine potential selective differences in codon usage in signal peptides (localization) and protein secondary structures. Although previous work argued signal peptides were under selection for increased translation inefficiency, I find selection is generally consistent with the 5'-regions of non- secreted proteins. I also find that previous work was likely confounded by biases in signal peptide amino acid usage and gene expression. Although the direction of selection on codon usage is mostly consistent between protein secondary structures, the strength of this selection does vary for certain codons. After successful folding and localization of a protein, it must be able to function within the context of other proteins in the cell, often through protein-protein interactions of metabolic pathways. Previous work suggests proteins which are part of the same functional processes within a cell are co-expressed across time and environmental conditions. Using the concept of guilt-by-association, I combine empirical protein abundances (measured via mass spectrometry) with sequence homology based function prediction tools to identify potential functions of proteins of unknown function in C. thermocellum. Building upon the concept that functionally-related genes are co- expressed within a species, I demonstrate how phylogenetic comparative methods can be used to detect signals of gene expression coevolution across species while accounting for the shared ancestry of the species in question. vi Table of Contents 1 Introduction to the relationship between codon usage, gene expression, and protein function1 1.1 Codon usage bias.................................2 1.1.1 Evidence for selection for translation efficiency.............2 1.1.2 Evidence for selection for translation accuracy.............6 1.1.3 Quantifying codon usage bias......................8 1.1.4 Codon usage and protein biogenesis................... 10 1.2 Functional Omics: using transcriptomics and proteomics to detect functionally- related genes................................... 15 1.2.1 Detecting signals of functional-relatedness across species....... 17 1.3 Summary..................................... 20 2 Review of methods 21 3 Quantifying codon usage in signal peptides: Gene expression and amino acid usage explain apparent selection for inefficient codons 29 3.1 Introduction.................................... 29 3.2 Materials and Methods.............................. 33 3.2.1 Signal Peptide Prediction........................ 33 3.2.2 ROC-SEMPPR.............................. 33 3.2.3 CAI and tAI................................ 36 3.2.4 Generating Datasets........................... 36 3.2.5 Analysis of Codon Usage with CAI, tAI, and ROC-SEMPPR..... 39 vii 3.3 Results....................................... 40 3.4 Discussion..................................... 58 3.5 Conclusions.................................... 61 3.6 Supporting Information.............................. 61 3.6.1 Assessing ROC-SEMPPR Model Adequacy............... 61 3.6.2 Maximum-Likelihood Model-II Regression Approach......... 62 4 Quantifying shifts in natural selection on codon usage between protein regions: A population genetics approach 64 4.1 Introduction.................................... 64 4.2 Materials and Methods.............................. 66 4.2.1 Protein secondary structures....................... 66 4.2.2 Analysis with ROC-SEMPPR...................... 67 4.3 Results....................................... 68 4.3.1 Comparing selection on codon usage between secondary structures.. 72 4.3.2 Intrinsically-disordered regions have a minor impact on results.... 77 4.3.3 Simulations suggest differences in codon usage reflect real biological signals................................... 80 4.4 Discussion..................................... 81 4.5 Conclusions.................................... 84 5 An integrated guilt-by-association approach to characterize proteins of unknown function (PUFs) in Clostridium thermocellum DSM 1313 as potential genetic engineering targets 85 5.1 Introduction.................................... 86 5.2 Materials and Methods.............................. 88 5.2.1 Bacterial Strains and Culture Conditions................ 88 5.2.2 Proteome analyses using LC-MS/MS.................. 88 5.2.3 MS database searching, data analysis and interpretation....... 89 5.2.4 Generation of co-expression networks.................. 89 5.2.5 Sequence-based functional predictions.................