Structure-Aware M. Tuberculosis Functional Annotation Uncloaks Resistance, Metabolic, and Virulence Genes
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/358986; this version posted June 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Structure-aware M. tuberculosis functional annotation 2 uncloaks resistance, metabolic, and virulence genes 3 4 Running title: Structure-aware M.tb annotation uncloaks key functions 5 6 aSamuel J Modlin, aAfif Elghraoui, aDeepika Gunasekaran, aAlyssa M Zlotnicki, bNicholas A Dillon, aNermeeta 7 Dhillon, aNorman Kuo, aCassidy Robinhold, aCarmela K Chan, bAnthony D Baughn, aFaramarz Valafar* 8 9 a Laboratory for Pathogenesis of Clinical Drug Resistance and Persistence, San Diego State University, San Diego, CA 92182 10 b Department of Microbiology and Immunology, University of Minnesota Medical School, Minneapolis, MN, 55455, USA 11 * Corresponding Author 12 13 14 ABSTRACT Accurate and timely functional genome annotation is essential for translating basic 15 pathogen research into clinically impactful advances. Here, through literature curation and 16 structure-function inference, we systematically update the functional genome annotation of 17 Mycobacterium tuberculosis virulent type strain H37Rv. First, we systematically curated 18 annotations for 589 genes from 662 publications, including 282 gene products absent from 19 leading databases. Second, we modeled 1,711 under-annotated proteins and developed a semi- 20 automated pipeline that captured shared function between 400 protein models and structural 21 matches of known function on protein data bank, including drug efflux proteins, metabolic 22 enzymes, and virulence factors. In aggregate, these structure- and literature-derived annotations 23 update 940/1,725 under-annotated H37Rv genes and generate hundreds of functional 24 hypotheses. Retrospectively applying the annotation to a recent whole-genome transposon 25 mutant screen provided missing function for 48% (13/27) of under-annotated genes altering 26 antibiotic efficacy and 33% (23/69) required for persistence during mouse TB infection. 27 Prospective application of the protein models enabled us to functionally interpret novel 28 laboratory generated Pyrazinamide-resistant (PZA) mutants of unknown function, which bioRxiv preprint doi: https://doi.org/10.1101/358986; this version posted June 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 29 implicated the emerging Coenzyme A depletion model of PZA action in the mutants’ PZA 30 resistance. Our findings demonstrate the functional insight gained by integrating structural 31 modeling and systematic literature curation, even for widely studied microorganisms. Functional 32 annotations and protein structure models are available at https://tuberculosis.sdsu.edu/H37Rv 33 in human- and machine-readable formats. 34 35 IMPORTANCE Mycobacterium tuberculosis, the primary causative agent of tuberculosis, kills 36 more humans than any other infectious bacteria. Yet 40% of its genome is functionally 37 uncharacterized, leaving much about the genetic basis of its resistance to antibiotics, capacity to 38 withstand host immunity, and basic metabolism yet undiscovered. Irregular literature curation 39 for functional annotation contributes to this gap. We systematically curated functions from 40 literature and structural similarity for over half of poorly characterized genes, expanding the 41 functionally annotated Mycobacterium tuberculosis proteome. Applying this updated annotation 42 to recent in vivo functional screens added functional information to dozens of clinically pertinent 43 proteins described as having unknown function. Integrating the annotations with a prospective 44 functional screen identified new mutants resistant to a first-line TB drug supporting an emerging 45 hypothesis for its mode of action. These improvements in functional interpretation of clinically 46 informative studies underscores the translational value of this functional knowledge. Structure- 47 derived annotations identify hundreds of high-confidence candidates for mechanisms of 48 antibiotic resistance, virulence factors, and basic metabolism; other functions key in clinical and bioRxiv preprint doi: https://doi.org/10.1101/358986; this version posted June 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 49 basic tuberculosis research. More broadly, it provides a systematic framework for improving 50 prokaryotic reference annotations. 51 KEYWORDS Mycobacterium tuberculosis, annotation, structure, virulence factors, functional 52 genomics, pyrazinamide, resistance 53 54 INTRODUCTION 55 Manual curation remains the gold standard for annotating function from literature(1), yet 56 requires massive effort from highly specialized researchers. UniProt curators alone evaluate over 57 4,500 papers each year(1). Literature annotation is typically complemented with functional 58 inference by sequence homology, but this approach fails to identify distant relatives (remote 59 homologs) or convergently evolved proteins of shared function (structural analogs). 60 These challenges hinder the study of Mycobacterium tuberculosis, the etiological agent of 61 tuberculosis. The M. tuberculosis virulent type strain is H37Rv, a descendant of strain H37, was 62 isolated from a pulmonary TB patient in 1905 and kept viable through repeated subculturing(2). 63 Following sequencing of the H37Rv genome, function was assigned to 40% of its open reading 64 frames (ORFs)(3), then expanded to 52% in 2002 following re-annotation(4). New annotations 65 continued to be added by TubercuList (now part of Mycobrowser, https://mycobrowser.epfl.ch/) 66 until March 2013. To date, one-quarter of the H37Rv genome (1,057 genes) lacks annotation 67 entirely, listed in “conserved hypotheticals” or “unknown” functional categories and hundreds 68 more minimally describe product function (e.g. “possible membrane protein”). Though other 69 databases have emerged in recent years(5–9) Mycobrowser remains the primary resource for bioRxiv preprint doi: https://doi.org/10.1101/358986; this version posted June 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 70 gene annotation for TB researchers(10) yet lacks functional characterizations reported in the 71 literature. 72 Moreover, many proteins key to M. tuberculosis pathogenesis are challenging to ascribe 73 function by sequence similarity. For instance, transport proteins—many of which allow M. 74 tuberculosis to tolerate drug exposure by effluxing drug out of the cell(11)—have membrane- 75 embedded regions under relaxed constraint compared to globular proteins and diverge in 76 sequence more rapidly as a result(12). This rapid divergence challenges their characterization 77 through homology. Limitations of sequence-based approaches to detect and annotate M. 78 tuberculosis proteins motivates an alternative approach to annotating M. tuberculosis gene 79 function. 80 One alternative approach is identifying protein homologs and analogs through shared 81 structure, which offers considerable advantages. First, it removes bias toward a priori 82 assumptions by not limiting search space to evolutionarily close relatives, facilitating discovery 83 of functions unexpected by conventional predictions. Second, structure-based functional 84 annotation can infer analogy between proteins structures. This ability to detect analogy is 85 especially valuable for inferring function at the host-pathogen interface, which is challenging to 86 recapitulate in the laboratory. Moreover, analogous relationships between proteins of shared 87 structure/function cannot be resolved by sequence homology because they often evolve 88 convergently with low amino acid (AA) similarity rather than from common ancestry (13). 89 Iterative Threading ASSEmbly Refinement (I-TASSER)(14), builds three-dimensional protein 90 structure from sequence through multiple threading alignment of Protein Data Bank (PDB)(15) bioRxiv preprint doi: https://doi.org/10.1101/358986; this version posted June 30, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 91 templates, followed by iterative fragment assembly simulations. I-TASSER accurately predicts 92 structure (16–20), provides metrics for model quality(21) (C-score) and pairwise structural 93 similarity(22) (TM-score), and integrates function and structure prediction tools(23) (COACH and 94 COFACTOR) comprising Gene Ontology (GO) terms(24), Enzyme Commission (EC) numbers(25), 95 and Ligand Binding Sites (LBS)(26). 96 EC numbers and GO terms partially or completely define gene function and are widely 97 incorporated into mainstream databases. EC numbers describe catalytic function