<<

Structure-Based Genome Scale Function Prediction and Reconstruction

of the Mycobacterium tuberculosis Metabolic Network

Mariam Konaté

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY

2014

© 2014

Mariam Konaté

All rights reserved ABSTRACT

Structure-Based Genome Scale Function Prediction and Reconstruction

of the Mycobacterium tuberculosis Metabolic Network

Mariam Konaté

Due to vast improvements in sequencing methods over the past few decades, the availability of genomic data is rapidly increasing, thus bringing about the need for functional characterization tools. Considering the breadth of data involved, functional assays would be impractical and only a computational method could afford fast and cost-effective functional annotations. Therefore, homology-based computational methods are routinely used to assign putative molecular functions that can later be confirmed with targeted experiments. These methods are particularly well suited to predict the function of because most metabolic pathways are conserved across organisms. However, the current methods have limitations, especially when considering enzymes that have very low sequence and structure homology to well-annotated enzymes.

We hypothesized that two enzymes with the same molecular function shared significant sequence homology in the region surrounding the , even if they appear diverged at the global sequence level. First, we have investigated the limits of sequence and structure conservation for enzymes with the same function during divergent evolution. The goal of this was to determine the sequence identity threshold beyond which functional annotations should not be transferred between two sequences; that is the level of homology beyond which the pair of proteins would not be expected to have the same function. Our analysis, which compares several models of sequence evolution, shows that the sequences of orthologous proteins catalyzing the same reaction rarely diverge beyond 30 % identity, even after approximately 3.5 billion years of evolution. As for structure conservation, enzymes catalyzing the same reactions rarely diverge beyond 3Å root-mean-square distance (RMSD). We have also explored sequence conservation constraints as a function of the distance to the active site. Although residues closer to the protein active site (within a radius of 10Å around the catalytic residues) are mutating significantly slower, the requirement to preserve the molecular function also constrains residues at other parts of the protein.

From these results, we have developed a structure-based function prediction method where we employ active site conservation in addition to global sequence homology for functional characterization. We then integrated this method with a probabilistic whole-genome function prediction framework previously developed in the Vitkup group, GLOBUS. The original version of GLOBUS uses sampling of probability space to assign functions to all putative metabolic genes in an input genome by considering sequence homology to known enzymes, gene-gene context and EC co-occurrence. Applying this novel method to the whole-genome metabolic reconstruction of Mycobacterium tuberculosis, we made several novel predictions for genes with apparent links to pathogenesis. Notably, our predictions allowed us to reconstruct the cholesterol degradation pathway in M. tuberculosis, which has been implicated in bacterial persistence in the literature but remains to be fully characterized. This pathway is absent from previously published metabolic models of M. tuberculosis. Our new model can now be used to simulate different environments and conditions in order to gain a better understanding of the metabolic adaptability of M. tuberculosis during pathogenesis...... ……………………………………………… TABLE OF CONTENTS

LIST OF FIGURES...... v

LIST OF TABLES...... viii

ACKNOWLEDGMENTS...... ix

Chapter 1 – INTRODUCTION ...... 1

1.1 MOTIVATION ...... 1

1.2 ORTHOLOGY AND DEFINING FUNCTION ...... 3

1.2.1 Orthology ...... 3

1.2.2 The Gene Ontology Project ...... 5

1.2.3 The Commission Number ...... 5

1.2.4 Other Classification Systems ...... 7

A. Clusters of orthologous groups ...... 7

B. Protein domain families: Pfam and PROSITE ...... 8

C. Structural classification: SCOP and CATH ...... 9

1.3 ENZYMES AND METABOLISM ...... 10

1.4 SEQUENCE DATABASES AND FUNCTIONAL ANNOTATIONS ...... 11

1.4.1 The Protein Data Bank ...... 11

1.4.2 The Universal Protein Resource Knownledgebase ...... 12

1.4.3 The Braunschweig Enzyme Database ...... 14

1.5 STATEMENT OF HYPOTHESIS ...... 16

i Chapter 2 - EVOLUTION OF SEQUENCE AND STRUCTURE ...... 19

2.1 INTRODUCTION ...... 19

2.2 MATERIALS AND METHODS ...... 21

2.2.1 Homology between Swiss-Prot and BRENDA enzymes ...... 21

2.2.2 Sequence homology and alignments ...... 22

2.2.3. Models of divergent evolution ...... 23

2.2.4 Nonlinear regression ...... 24

2.2.5 Hubble-like derivation ...... 25

2.3 COMPARING MODELS OF SEQUENCE EVOLUTION ...... 26

2.4 HUBBLE LAW-LIKE DERIVATION ...... 34

2.5 FUNCTIONAL CONSTRAINTS ON THE EVOLUTION OF PROTEIN STRUCTURE ...... 37

2.6 ACTIVE SITE REGION CONSERVATION ...... 40

2.7 DIVERGENT VS. ...... 42

2.8 DISCUSSION ...... 44

Chapter 3 - PREDICTING ENZYME FUNCTION ...... 46

3.1 INTRODUCTION ...... 46

3.1.1 Sequence-Based Function Prediction Methods ...... 46

3.1.2 Structure-Based Function Prediction Methods ...... 47

3.2 MATERIALS AND METHODS ...... 48

3.2.1 Determination of the active site region ...... 48

3.2.2 Gibbs sampling ...... 49

ii 3.2.3 Performance evaluation ...... 50

3.3 INTEGRATING LOCAL SEQUENCE CONSERVATION ...... 51

3.3.1 Structural Coverage ...... 51

3.3.2 Localizing the active site region ...... 55

3.3.3 Size of active site region ...... 57

3.4 GLOBAL vs. LOCAL SEQUENCE CONSERVATION ...... 59

3.5 GLOBAL BIOCHEMICAL RECONSTRUCTION USING SAMPLING (GLOBUS) ...... 63

3.5.1 Sources of evidence ...... 64

A. Sequence homology ...... 64

B. Genomic context ...... 64

3.5.2 The generic metabolic network ...... 66

3.5.3 The fitness function ...... 67

3.5.4 Gibbs sampling ...... 68

3.6 STRUCTURAL GLOBUS (3D-GLOBUS) ...... 69

3.7 COMPARISON TO OTHER METHODS ...... 73

3.7.1 Argot2 ...... 74

3.7.2 EFICAZ: Enzyme function inference by a combined approach ...... 74

3.7.3 Comparison of Argot2 and EFICAZ2.5 to 3D-GLOBUS ...... 75

3.8 DISCUSSION ...... 76

Chapter 4 - RECONSTRUCTION OF THE MYCOBACTERIUM TUBERCULOSIS

METABOLIC NETWORK ...... 78

4.1 MYCOBACTERIUM TUBERCULOSIS AND DISEASE ...... 78

iii 4.2 MATERIALS AND METHODS ...... 83

4.2.1 Probabilistic annotation of the M. tuberculosis genome ...... 83

4.2.2 Structural representations ...... 84

4.2.3 Flux balance analysis ...... 85

4.2.4 Gene essentiality analysis ...... 86

4.2.5 Phenotype microarray analysis ...... 87

4.3 3D-GLOBUS NOVEL PREDICTIONS ...... 87

4.3.1 Candidates for functional characterization ...... 87

4.3.2 Cholesterol Degradation Pathway ...... 92

4.4 RECONSTRUCTION OF THE M. TUBERCULOSIS METABOLIC NETWORK ...... 94

4.4.1 iMK917 Metabolic model characteristics ...... 94

4.4.2 Reconciliation with experimental data ...... 96

4.4.3 Essential genes and candidate small molecule drugs ...... 100

4.5 DISCUSSION ...... 103

REFERENCES...... 105

APPENDIX...... 121

iv LIST OF FIGURES

Chapter 1

Figure 1.1. Number of fully sequenced bacterial genomes...... 1

Figure 1.2. Fraction of microbial proteins against maximal sequence homology to an enzyme in

Swiss-Prot...... 2

Figure 1.3. Schematic representation of divergent evolution...... 4

Figure 1.4. Functional diversity of 572 Clusters of Orthologous Groups from the Last Universal

Common Ancestor (LUCA)...... 8

Figure 1.5. Growth of sequence databases...... 13

Figure 1.6. Source of BRENDA sequences...... 14

Figure 1.7. Distribution of EC functional classes in Swiss-Prot and BRENDA...... 15

Chapter 2

Figure 2.1. Distribution of pairwise sequence identity between Swiss-Prot enzymes and best

BLAST hit in BRENDA...... 22

Figure 2.2. Phylogenetic tree of the 24 organisms considered in this study...... 27

Figure 2.3. Fitting models of sequence divergence...... 30

Figure 2.4. Distribution of model 2 parameter y0 across 30 enzymatic functions...... 32

Figure 2.5. Analysis of 300 functional sets from Swiss-Prot...... 34

Figure 2.6. Hubble-like model of sequence divergence...... 36

v Figure 2.7. Structural divergence of enzymes with the same molecular function...... 37

Figure 2.8. Average pairwise and triplet sequence identity for proteins in different domains of life; Prokarya, Archaea and Eukarya...... 39

Figure 2.9. Sequence conservation as a function of the distance to the active site...... 40

Figure 2.10. Density plot of RMSD vs. sequence identity...... 43

Chapter 3

Figure 3.1. Orthology and paralogy...... 47

Figure 3.2. Cumulative structural coverage for microbial and Swiss-Prot enzymes...... 53

Figure 3.3. Illustration of the transitive properties for structural homology...... 54

Figure 3.4. Schematic representation of the function prediction method...... 56

Figure 3.5. Sequence identity after 4 billion years of divergence as a function of the distance to the active site...... 58

Figure 3.6. Average number of residues within a given radius from the active site...... 59

Figure 3.7. Probability of correct prediction given global sequence identity for Escherichia coli and Saccharomyces cerevisiae...... 62

Figure 3.8. Illustration of GLOBUS...... 69

Figure 3.9. Precision and recall performance for 5 species...... 70

Figure 3.10. Precision and recall vs. probability for 5 species...... 72

Figure 3.11. Comparison of the performance of 3D-GLOBUS, Argot2 and EFICAz2.5...... 76

vi Chapter 4

Figure 4.1. Whole genome alignment of mycobacterial species...... 79

Figure 4.2. Estimated tuberculosis incidence rate in 2012...... 80

Figure 4.3. Illustration of flux balance analysis...... 86

Figure 4.4. Difference between GLOBUS and 3D-GLOBUS probabilities for Mycobacterium tuberculosis...... 88

Figure 4.5. Homology models of target genes with ligand...... 91

Figure 4.6. Cholesterol degradation pathway...... 93

Figure 4.7. Model predictions for growth on different carbon and nitrogen sources...... 99

Figure 4.8. Candidate target genes in a diverse set of metabolic processes...... 102

vii LIST OF TABLES

Chapter 1

Table 1.1. Enzyme Commission classification scheme...... 6

Chapter 2

Table 2.1. List of 30 enzymatic functions considered in this study...... 28

Table 2.2. Comparisons of the fits of models 1, 2 and 3 of sequence divergence...... 31

Chapter 3

Table 3.1. Structural coverage of metabolic gene products in 5 species...... 55

Table 3.2. Probability of correct function at different sequence identity levels...... 61

Table 3.3. Distribution of the enzymatic activities of the generic network...... 67

Chapter 4

Table 4.1. Candidate genes for structural and experimental validation...... 89

Table 4.2. Comparison of Mycobacterium tuberculosis H37Rv metabolic reconstructions...... 95

Table 4.3. Comparison of experimental and in silico gene essentiality predictions...... 97

viii ACKNOWLEDGMENTS

Firstly, I would like to thank my advisor Dr. Dennis Vitkup for providing guidance and support over the past 5 years. I am truly grateful for the opportunity to study and learn from him and from fellow labmates, past and present, along the way. I would like to thank the whole

Vitkup lab, and particularly Dr. Germán Plata for his help and advice, and for many helpful discussions. I am also grateful to Dr. Markus Fischer for sharing his expertise on structural alignment methods.

I would also like to thank my first thesis advisor Dr. Dianna Murray for her kindness and for fostering my enthusiasm for Bioinformatics. I am grateful to Dr. Tonya Silkov and Dr.

Nebosja Mirkovic from whom I learned a lot during my time in the Murray Lab.

I am grateful for my advisory and thesis committee members, for their time and for providing insightful discussions.

Next, I would like to thank the Pharmacology Department and fellow students, especially Karen Allis for her constant support in all matters, past and current course directors

Dr. Dan Goldberg, Dr. Neil Harrison and Dr. Rich Robinson. I am particularly thankful for the support I received from the Pharmacology students when my father passed away. And thank you to my good friend, classmate and roommate Pelisa Charles-Horvath for her friendship.

ix Lastly, I would like to thank all those who supported and encouraged me along this journey, including my mentors at Virginia Tech, my friends and family. Thank you Mom and Dad,

Ramata, AÏcha and Bryan for believing in me always. Thanks to the Monnard family for being so welcoming and well-wishing. And of course thank you Alex for inspiring and encouraging me whenever I needed cheering up, and for making the trip from Washington to New York so diligently over the past 6 years...... ……………………………………………………

x Chapter 1 – INTRODUCTION

1.1 MOTIVATION

Over the past twenty years the number of fully sequenced bacterial genomes has been increasing exponentially. It is expected that this number will have exceeded 10,000 in 20201 (fig.

1.1).

Cumulave Bacterial Sequenced Genomes

3000

2500

2000

1500

1000

500

Number of Sequenced Bacterial Genomes 0 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013

Figure 1.1. Number of fully sequenced bacterial genomes. Since 2000, the number of determined sequences has been increasing exponentially. This graph represents the exponential increase in the number of completed bacterial genome sequencing projects. (Data from NCBI Genome).

This is in part due to the prevalence of metagenomic studies that are shedding a light on the wide diversity of organisms present on Earth. In order to better understand these

1 organisms, their metabolism, and how they interact with each other and their environment, it is critical to fully functionally characterize their genomes. Because of the large quantity of data, experimental validation cannot be systematically conducted. Therefore, to tackle this issue, there is a need for a fast and reliable computational method for genomic functional annotation.

Classically, such methods involve annotation transfer from a homologous sequence. Specifically, to predict the function of an uncharacterized sequence, a BLAST2 homology search is conducted to identify its closest homologues. The function of the top BLAST hit is then assigned to the uncharacterized sequence. However this method has limitations when considering query sequences with low homology to functionally annotated sequences. It has been shown previously that functional annotation cannot be reliably transferred between proteins that share less than 60 % sequence identity and this threshold may vary across protein families3-7.

However, for a typical newly sequenced bacterial genome, the majority of its gene products have less than 60 % homology to know proteins (fig.1.2).

Figure 1.2. Fraction of microbial proteins against maximal sequence homology to an enzyme in Swiss-Prot.

2 Most of the new genomes that are being added to UniProtKB/TrEMBL are not closely related to any well-characterized organism. This plot represents the distribution of bacterial proteins as a function of the maximum sequence identity to a sequence in the manually curated Swiss-Prot. These bacterial proteins sequenced from metagenomic samples and cover 199 species according to small subunit ribosomal RNA8.

This region of low sequence homology has been termed “twilight zone of sequence space” by Rost9. The goal of this thesis was first to investigate this threshold and identify new ways to make functional predictions by studying the limits of sequence and structure divergence through divergent evolution. Additionally, we aimed to use these findings to develop a functional annotation method applicable to whole genomes for simultaneous characterization of all gene products. Namely, the goal of this project was to incorporate structural information to the previous version of the probabilistic whole-genome prediction algorithm GLOBUS, in the hope that it would further improve prediction, especially for those genes with low sequence homology to known enzymes. This way, we aimed to improve the ability to propose candidate functions for proteins in the “twilight zone” of sequence identity through the identification of orthologues.

1.2 ORTHOLOGY AND DEFINING FUNCTION

1.2.1 Orthology

Orthology is a case of vertical divergence. Orthologues are genes in different species whose evolutionary history can be traced back to the same gene in a common ancestor (fig.

1.3). Orthology between genes can be established with phylogenetic methods such as tree

3 construction, or through whole-genome sequence comparison. This second technique consists in identifying mutually top scoring BLAST hits from each genome. Oftentimes, orthologues have the same molecular function as opposed to paralogues, which are the result of a gene duplication event within a genome. In that case, one copy of the gene maintains the original function, while the second “redundant” copy is free to diverge into a new function10-12.

Figure 1.3. Schematic representation of divergent evolution. We assume that all species can be traced back to a universal common ancestor (LUCA) dating back to about 4 billion years. We simplify phylogeny from a tree form to a linear form where the enzyme sequence for each species considered in this study represents a “snapshot” of the ancestral enzyme throughout divergent evolution.

4 The function of a protein corresponds to the role that it has inside the cell and can range from transporting small molecules to ensuring proper folding of other proteins, or carrying out specific chemical reactions.

1.2.2 The Gene Ontology Project

Function can be defined a number of different ways. For example the Gene Ontology

(GO) classification system uses a structured and “controlled-vocabulary” (i.e. ontologies) to characterize three different functionally associated features of a protein. These three features are Biological Process, Cellular Component and Molecular Function13. For a protein, a biological process corresponds to a specific set of chemical reactions or a given pathway, for example

“Carbohydrate Metabolic Process” (GO:0005985). The cellular components category tells where in the cell the protein is localized, for example in the cytosol (GO:0005829). The molecular function is the most specific functional category and it describes the biochemical activity of the protein; for example GO:0046882 corresponds to the activity “metal ion biding”. A protein can be associated to multiple GO terms, each term giving additional information on the function of the protein in the cell. However the GO categories can be broad, and therefore yield non- specific information on the function of a protein.

1.2.3 The Enzyme Commission Number

Another way to define protein function is through the Enzyme Commission number14,15.

As the name indicates, the EC hierarchical system classifies the function of enzymes only as opposed to other classes of proteins. First devised in 1955, the EC system links a four-part numerical code termed EC number to a specific catalytic activity or reaction. That is, different

5 enzymes from different organisms can have the same EC number annotation if they catalyze the same regardless of mechanism. Each part of the EC code gives more detailed information on the type of reaction catalyzed. The first number is the most general category (table 1.1) and each subsequent number adds to the specificity of the reaction.

Table 1.1. Enzyme Commission classification scheme.

EC Class Class Name EC 1 EC 2 EC 3 EC 4 EC 5 EC 6

For example an enzyme annotated with the EC 3.2.1.24 is an α–. The first digit of the code, EC 3, tells us that the enzyme is a ; that is, the enzyme catalyzes the of a bond. The second digit added to the first, EC 3.2, specifies that the enzyme is a glycolase. The third digit, EC 3.2.1, tells us that the enzyme hydrolyzes O- or S-glycosyl compounds. Finally the last number, EC 3.2.1.24, reveals the exact chemical specificity of the enzyme for the substrate α-D-mannose. The ENZYME data bank is updated periodically so that almost all enzymatic activities known to occur in nature have an associated EC number16-19. As of the release of March 19, 2014 there were 6,370 distinct enzymatic activities listed in the

ENZYME data bank.

6 1.2.4 Other Classification Systems

A. Clusters of orthologous groups

Unlike Gene Ontology and the EC nomenclature system, there are other protein classification approaches in use that have less stringent definitions of function. One such system is the Cluster of Orthologous Groups (COG), a phylogenetic classification system where proteins from complete genomes are placed in a given COG if they have apparent orthology (i.e. related by vertical evolution) to another protein already in the COG20-22. COGs are essentially families of homologous proteins, and by that logic they may include not only orthologues, but also paralogues; that is pairs of proteins that originate from a duplication event, and therefore may not have the same molecular function23. In fact, a quick analysis of the 572 COGs thought to have originated in the last common ancestral species20,24 (LUCA) reveals that only 30 % of these COGs are monofunctional. Another 20 % are populated with non-enzymes, while the remaining half of the 572 COGs contains a heterogeneous set of enzymes (fig. 1.4). Therefore, for our purposes the COG classification system is not indicated.

7 0 ECs

21.37% 1 EC 32.05% 2 ECs 3+ ECs

29.95% 16.64%

Figure 1.4. Functional diversity of 572 Clusters of Orthologous Groups from the Last Universal Common Ancestor (LUCA). Several studies have postulated that 572 proteins families emerged in the last universal common ancestor and were passed down to most of the organisms in existence today. This plot shows the percentages of these 572 COGs that are not associated to an any EC number (non- enzymes), the percentage that are associated to a single (monofunctional) or multiple EC numbers (multifunctional).

B. Protein domain families: Pfam and PROSITE

Proteins can also be organized into families based on the presence of specific domains or sequence motifs. For instance, the Pfam database is a collection of protein domain families25.

A domain can be defined as a functional region with conserved sequence and structure.

Proteins may have more than one domain so that in combination, multiple domains may yield a specific molecular function. PROSITE operates in the same way, with PROSITE families being derived from multiple sequence alignments, as well as HMM profiles to identify common functional modules26.

8 C. Structural classification: SCOP and CATH

Not all protein classification systems are based on sequence similarity. Classification schemes based on secondary and tertiary structure similarity also exist. The secondary structure corresponds to the general pattern of hydrogen bonding of the backbone of a protein. The two most common kinds of secondary structure elements are α-helices and β- sheets27,28. The tertiary structure of a protein is its geometric shape in three dimensions.

Despite some differences in implementation, the SCOP29 and CATH30 classification systems are both hierarchical in nature. The four levels of SCOP are structural class (i.e. type of secondary structure elements), fold (configuration of secondary structure elements in 3D space), superfamily and family, while CATH (class, architecture, topology, homologous family) sorts protein domains into class, architecture, topology and superfamily31. The rationale for using structural elements for functional classification is that structures being more robust to evolutionary divergence than sequence, functionally related proteins tend to retain highly similar three-dimensional folds32,33. Of course these systems require the availability of a structure which is rarely the case for the gene products of newly completed genomes.

In this project, we investigated the link between sequence, structure and function of enzymes, therefore the EC system was well-suited as a definition of function. We defined orthologues as bi-directional best hits between two genomes with the same EC number annotation.

9 1.3 ENZYMES AND METABOLISM

Most enzymes are proteins whose function is to facilitate chemical reactions in the cell by accelerating the rate of the reactions, although some RNA molecules called ribozymes are also known to have a catalytic role34-36. Enzymatic proteins account for 20-40 % of all proteins in a genome37,38. Enzymes catalyze reactions involving a specific compound or substrate and release a product. In general, only a handful of amino acids out of the full sequence are catalytic and must be absolutely conserved across enzymes with the same EC number. However, the entire catalytic pocket, that is the set of all amino acids that form the substructure of the active site, must also be constrained to maintain the specificity of the enzymatic reaction. In an organism, reactions occur in a particular sequence where the product of one reaction is the substrate of the following, and a set of subsequent reactions form a pathway. The collection of all pathways in a cell is called a metabolic network. Some pathways have emerged in a given organism as a means of adaptation to a particular environment. Therefore, not all pathways are conserved across genomes. On the other hand, some pathways responsible for basic functions are critical to life and therefore will be found across all domains of life because of their essentiality for cellular growth. These pathways include all reactions associated with glycolysis or the TCA cycle and constitute the core carbon metabolism. Because these core metabolism reactions are present across organisms, we postulate that they have emerged in the last universal common ancestor.

10 1.4 SEQUENCE DATABASES AND FUNCTIONAL ANNOTATIONS

Several protein databases are available for research purposes including the Universal

Protein Resource KnowledgeBase39 (UniProtKB), the Protein Data Bank40 (PDB), the database Pfam25, the Structural Classification of Proteins database29 (SCOP) and the

Braunschweig Enzyme Database41 (BRENDA). Only the PDB and BRENDA contain experimentally-validated protein entries with literature references. The other databases are populated with proteins functionally annotated through both experimental evidence and computational data mining.

1.4.1 The Protein Data Bank

The Protein Data Bank (PDB) is a database of experimentally-determined three- dimensional structures40. There are currently 87,755 three-dimensional structures resulting from X-ray crystallography experiments. An additional 10,384 structures were elucidated by nuclear magnetic resonance (NMR). The structures may or may not contain bound ligands or co-factors. The PDB has seen an exponential increase in the number of deposited three dimensional structures over the past few decades. However, not all the structures in the PDB have a known molecular function. Indeed, thanks to structural genomics projects such as the

Protein Structure Initiative (PSI), over 10,000 previously uncharacterized protein targets were chosen to be crystallized, in the hope that a solved structure would shed a light on their potential biochemical function. Currently, there are 57,360 three dimensional structures in the

PDB with at least partial EC number annotation.

11 1.4.2 The Universal Protein Resource Knownledgebase

Another well known resource for protein sequences is the Universal Protein Resource

KnowledgeBase39 (UniProtKB). It contains translated gene products from all major nucleic acid databases (EMBL, GeneBank, DDBJ). UniProtKB is divided into two sub-databases; Swiss-Prot and TrEMBL (translated EMBL nucleotide sequence data library). The latter contains sequences which may have been experimentally or computationally annotated; however the contents of

Swiss-Prot are manually-curated. Efforts are also made to prevent redundancy from Swiss-Prot by merging all sequences encoded by the same gene. Swiss-Prot now contains 544,996 entries

(456,429 at 100 % redundancy), with 214,898 proteins with at least partial EC number annotation. In other words, 53 % of the proteins in Swiss-Prot are either non-enzymes or they lack molecular function annotation. In contrast to Swiss-Prot, TrEMBL is populated with any gene product whether it has a verified annotation or not. To date, TrEMBL is composed of over

54 million sequences (28.5 million at 100 % redundancy). Both components of UniProtKB have been expanding exponentially over the past 20 years39 (fig. 1.5), due to the improved efficiency of sequencing methods. Notably, the growth of UniProtKB/TrEMBL has overwhelmingly surpassed the pace of experimental validation.

12

Figure 1.5. Growth of sequence databases. Since 2000, the number of determined sequences has been increasing exponentially. This graph represents the growth of 3 sequence databases; the Protein Data Bank (which contains proteins with experimentally determined 3D structure), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. The latter shows the most dramatic progression and it includes any determined protein sequence. Functional annotations of TrEMBL sequences are computationally-generated with no manual curation. Unsurprisingly, 74 % of TrEMBL sequences are bacterial gene products.

Computational methods may help bridge the gap between sequencing and characterization through utilization of homology-based approaches. In fact, functional annotations in TrEMBL result from such methods. However, homology-based methods can introduce mis-annotations that could later be propagated as the erroneous annotation gets transferred by homology to other uncharacterized proteins42-44. In order to avoid such errors, one may use databases such as the manually curated database UniProtKB/Swiss-Prot. However, as much as 70 % of Swiss-Prot entries were inferred from homology39. Therefore it would be useful to determine the fraction of Swiss-Prot which has significant homology to experimentally-validated proteins. This way we can get a better idea of the extent to which

13 Swiss-Prot database annotations can be trusted. For this purpose, we turn to the largest enzyme information system available, the Braunschweig Enzyme Database.

1.4.3 The Braunschweig Enzyme Database

The BRaunschweig ENzyme DAtabase41 (BRENDA) is a comprehensive database of experimental biochemical and molecular enzyme data. The functional information, which includes data on enzyme occurrence, reaction kinetics and other physicochemical properties, is directly extracted from the literature, and is manually verified by scientists. BRENDA contains all enzymes with known enzyme classification (E.C.) as well as some physicochemical properties as referenced in the literature. It contains close to 20,000 enzymes with experimental validation spanning 3,074 unique E.C. functions. Just over half of experimentally-validated BRENDA sequences are cross-referenced in Swiss-Prot (Fig. 1.6) and this makes up 1.9 % of the Swiss-

Prot database. The other 47 % belong to TrEMBL (database of unreviewed, automatic annotations).

Figure 1.6. Source of BRENDA sequences. BRENDA is a database of proteins with experimentally determined physico-chemical properties and function (N = 19.415). Among the proteins listed in BRENDA, 9,033 are cross-listed in UniProt/TrEMBL and 10,382 are cross-listed in UniProt/Swiss-Prot.

14

The set of experimentally-verified BRENDA entries cover 3,074 unique E.C. numbers while Swiss-Prot has 3,971. The distribution of E.C. classes (1-6) represented in BRENDA is virtually the same as the distribution of E.C. classes in Swiss-Prot (Fig. 1.7). In both cases the first three classes (oxidoreductases, transferases, hydrolases) make up around 80 % of the dataset. This shows that experimental BRENDA is not biased in its composition but still covers all classes of enzymatic functions.

Figure 1.7. Distribution of EC functional classes in Swiss-Prot and BRENDA. The distribution of proteins per EC functional class in Swiss-Prot and BRENDA are very similar, with classes EC1, EC2 and EC3 representing over 80 % of both sets.

Functional prediction can be made robust with a careful choice of reference sequence set. For our purposes the SwissProt and BRENDA database constitute reliable sources. However, as mentioned above, sequence homology alone is not enough to make reliable predictions.

Therefore we propose to add a structure-derived component to sequence homology for function prediction.

15 1.5 STATEMENT OF HYPOTHESIS

Given that the amino acids contained in the region extending around the catalytic residues should be significantly more conserved than the overall sequence between enzymes with the same molecular function, active site region conservation can be used to complement global sequence identity to make functional predictions, especially when a BLAST search returns no high homology hits. More specifically, measuring the level of sequence conservation between two protein sequences at the active site region should allow us to confidently determine whether the two proteins have the same molecular function.

We know that global sequence is not enough for accurate predictions below a certain level of homology to known enzymes. Indeed purely computational methods of function prediction relying on annotation transfer from a homologue tend to fail below a certain homology threshold6,45. We expect residues surrounding the catalytic region of an enzyme to be more conserved than the global sequence when comparing two enzymes with the same function.

To investigate the limits of sequence homology between orthologous enzymes, as well as the size of the conserved region around the active site we conducted the study described in

Chapter 2. We found that enzymes diverging from a common ancestor must retain ~30 % sequence identity in order for function to be maintained. Similarly, the structures diverging enzymes with the same function are no more distant than about 3.5 Å root mean square distance

(RMSD). We also showed that the level of conservation at a given site along an enzyme sequence actually depends on how distant this site is to the catalytic site. In other words, the region surrounding catalytic residues undergo fewer mutations over the course of divergent evolution.

16 Subsequently, we used the result of that study to devise a probabilistic functional annotation method integrating active site region conservation with global sequence homology and genomic context information, as shown in Chapter 3. We first compared the performance of 3D-GLOBUS to that of GLOBUS in order to quantify the improvements in prediction accuracy and precision afforded by the addition of structure-derived local sequence conservation information. We also demonstrated that the aforementioned method, 3D-GLOBUS, performs better than high performing methods Argot2 and EFICAz2.5 in predicting the function of low homology genes. The improvement in precision and recall of 3D-GLOBUS over GLOBUS is particularly observable for genes with less than 40 % to the closest homologue in Swiss-Prot.

For this subset, 3D-GLOBUS shows a 36 % increase in precision over GLOBUS (precision of 0.6 and 0.44). Similarly, we observe a 40 % increase in recall over GLOBUS (recall of 0.35 and 0.25 respectively).

Finally, we describe how 3D-GLOBUS was applied to the whole-genome reconstruction of the Mycobacterium tuberculosis H37Rv metabolic network in Chapter 4. Through 3D-GLOBUS, we make several novel functional predictions that were not in the published models of this organism. We propose for functional characterization 7 genes that were absent from published models of the bacterium and for which 3D-GLOBUS has made high probability predictions. This set of 7 genes constitutes a case-study of low global sequence conservation vs. high local sequence homology to enzymes with known function.

Interestingly, we were also able to identify a large number of genes involved in the degradation of cholesterol. This is an exciting finding because several studies have been

17 published these past few years showing that cholesterol is an important nutrient for M. tuberculosis persistence during infection. However at this time the exact mechanisms and genes involved have not been fully elucidated and no published model contained this pathway.

The findings from 3D-GLOBUS led us to build an updated metabolic model for M. tuberculosis with is better able to replicate real experimental data, as well as to simulate and predict phenotypic states of interest. We hope that our model will help further understand what happens during persistent infection and shed light on the metabolic adaptations that M. tuberculosis undergoes when faced with different environments and conditions.

18 Chapter 2 - EVOLUTION OF SEQUENCE AND STRUCTURE

2.1 INTRODUCTION

Although many previous studies investigated the main determinants, constraints, and mechanisms of protein molecular evolution46-49, relatively little attention has been given to understanding possible limits in the evolution of protein sequences and structures sharing the same molecular function. The molecular function13 is defined as the specific protein activity at the molecular level, such as the ability to catalyze a specific enzymatic reaction. Several examples of convergent protein evolution suggest that the same molecular functions could be shared by proteins with sequence identity as low as the sequence identity expected between two unrelated enzymes50-52. Nevertheless, for proteins that are diverging from the same ancestral sequence it is possible that crucial residues are strongly conserved in order to preserve common functions. The main goal of this study is to explore the extent to which enzyme sequences and structures can diverge while continuously maintaining the same molecular function. Understanding the long-term constrains of protein evolution is not only an important fundamental question, but it should also advance our understanding of the main determinants of protein molecular functions and thus improve computational algorithms for functional annotations53-55.

Recently, Povolotskaya and Kondrashov investigated whether the global divergence of protein sequences has reached a limit56. The analysis showed that distant homologues belonging to the same Cluster of Orthologous Groups (COG)20 are still diverging. However, the

19 limit of protein evolution was not specifically investigated and it is not clear if proteins would be able to diverge beyond the limit of sequence recognition methods. Importantly, as COGs are constructed based on a network of best sequence hits across genomes20, proteins in the same

COG do not necessarily share the same molecular function. Consequently, the impact of the requirement to continuously maintain the molecular function on protein divergence remains to be elucidated.

In the present study we focus on the evolution of enzymes given that their molecular function is relatively well defined. The Enzyme Commission (EC) classification organizes enzymatic functions using a hierarchical 4-digit code19. Specifically, sharing of all 4 digits in the

EC classification indicates the same molecular function, i.e. catalysis of the same biochemical reaction. Sharing less than 4 digits, indicates different degrees of similarity in molecular function; for example, sharing the first 3 EC digits of the EC classification usually indicates identity of the catalyzed reaction, but not the reaction substrates. In this chapter we analyze sequence evolution of enzymes catalyzing thirty metabolic activities in a diverse collection of organisms from all domains of life. We compare several models of sequence evolution and investigate the effective limits on sequence and structural divergence imposed by the necessity to continuously preserve the same molecular function. We also investigate how the constraint to preserve molecular function helps in distinguishing between divergent and convergent evolution of enzymes.

20 2.2 MATERIALS AND METHODS

2.2.1 Homology between Swiss-Prot and BRENDA enzymes

In order to determine specifically the fraction of Swiss-Prot enzymes for which a high confidence annotation can be made, the list of all enzymes with experimentally validated function in BRENDA was retrieved with the Simple Object Access Protocol (SOAP) client. Then all of the Swiss-Prot entries were blasted against this subset of BRENDA. For each Swiss-Prot, query all self-hits were removed (because BRENDA contains both Swiss-Prot and TrEMBL entries, Fig. 1.6). The hit within BRENDA with the lowest E-value was retained as the best hit for the Swiss-Prot query if any.

About half of Swiss-Prot proteins (253,904) are annotated with a full or partial EC number, and therefore are likely to be enzymes. Fifty-two percent have a hit in BRENDA at the

0.05 significance level. The other 48 % are likely to be mostly non-enzymes. The distribution of pair-wise sequence identity between Swiss-Prot proteins and their best hit in BRENDA across sequence identity bins reveals that the majority of Swiss-Prot/BRENDA best hit pairs are in the

30 – 40 % sequence identity range (Fig.2.1). Indeed, about 50 % of all pairs cluster in the 20 –

50 % sequence identity bin. The cumulative distribution plot shows that in accordance with the aforementioned hypothesis, over 90 % of Swiss-Prot enzymes have at least 30 % sequence identity to an experimentally validated enzyme from BRENDA (Fig. 2.1).

21

Figure 2.1. Distribution of pairwise sequence identity between Swiss-Prot enzymes and best BLAST hit in BRENDA. This plot shows that on average Swiss-Prot enzymes have about 40 % sequence identity to an experimentally characterized protein in BRENDA. Therefore, a considerable number of Swiss- Prot enzymes that are either un-annotated, partially annotated or mis-annotated cannot be assigned function with confidence through sequence homology.

2.2.2 Sequence homology and alignments

All database searches in this work were done with the PSI-BLAST algorithm. PSI-BLAST is better indicated to identify remote homologues than BLAST because it generates a position- specific scoring matrix (PSSM), and uses it to make more sensitive sequence comparisons2.

To build the phylogenetic tree shown in Fig. 2.2 a multiple sequence alignment of the strongly conserved small subunit ribosomal RNA was performed with CLUSTALW57. Then a maximal-likelihood phylogenetic analysis was conducted with FastML58.

Orthology between two enzymes was defined as a full match in E.C. number and the enzymes being best bi-directional hits in a PSI-BLAST search2. The BLOSUM6259 matrix was used

22 along with E-value cutoff of 0.005. Only enzymes with experimentally validated EC number were used to avoid bias from mis-annotations. To fulfill this requirement we only considered

Swiss-Prot sequences also listed in BRENDA as having functional annotation verified experimentally41. CLUSTALW was used to carry out multiple sequence alignments for each EC set. Times of divergence between pairs of species are from the TimeTree database60. In all cases time of divergence was determined with a global or local clock method on conserved molecules such as ribosomal RNA.

2.2.3. Models of divergent evolution

The 3 regression models were derived from the Poisson model61-63 and Gamma correction64-66 methods routinely applied to sequence divergence problems. The fits of the models were evaluated with coefficient of determination R2:

!"" �! = 1 − (4) !"" where RSS is the residual sum of squares and TSS is the total sum of squares. Because model 1 is nested in model 2 we could compare their fits with the F-test (Eqn. 5).

(RSS − RSS )/(df − df ) F = 1 2 1 2 (5) RSS 2 /df 2

where RSS1 and RSS2 are the residual sum of squares for model 1 and model 2 respectively, and € df1 and df2 are the number of degrees of freedom for model 1 and model 2.

Structures annotated with the EC numbers considered in this study and pertaining to the 24 organisms from Figure 1 were retrieved from the PDB40. All pairs of structures with the

23 same EC number annotation were then aligned using the Combinatorial Extension (CE)67 algorithm.

2.2.4 Nonlinear regression

The equation for a nonlinear model is the following:

� = � �, � + � (1)

where Y is the observed data, X = (x1, x2, …, xk)’ are independent variables, p = (p1, p2, …, pk)’ are the model parameters, and ε are the residual errors.

The process of curve fitting corresponds to solving a least squares problem. The goal is to find the set of parameters that minimize the sum of the weighed residuals X2

ʹ ! ! ! !!!!(! !;!) � (�) = !!! (2) !!

where Yj is the observed data, ŷ(x’I;p) is the estimate given the set of parameters p, and σi is the variance.

To estimate the set of parameters p that minimize X2, one must set the partial derivative of X2 with respect to p to 0. If the curve-fit function is non-linear, there are no explicit solutions so we reduced X2 iteratively by introducing a perturbation to the parameters at each iteration. In order to achieve this, we use the Levenberg-Marquardt algorithm which is a combination of the steepest descent method and the Gauss-Newton method. The parameter update equation contains a factor λ which is adjusted depending on which of these two methods one wants to use.

24 !! ! �!!! = �! − � + λ� ∇� (3)

where for observations 1 ≤ i < n, xi+1 is the updated parameter, xi the previous step parameter,

H the Hessian is a matrix of secondary derivatives (accounts for curvature), λ is the “decision” factor, I is the identity matrix, and ∇f(xi) is the gradient.

At each step, if the parameters are far from optimal (i.e. X2 is not reduced) we increase λ so that the steepest descent method is applied. When λ is large the update equation if approximately the following:

!! ! �!!! ≈ �! − λ� ∇X (4)

Conversely, when the parameters are closer to optimal we decrease λ in order to use the

Gauss-Newton method and the update equation approximates:

!! ! �!!! ≈ �! − � ∇� (5)

2.2.5 Hubble-like derivation

In order to smooth out variations and increase coverage, for each data set of evolutionary distance as a function of divergence time, the data was binned by distance with bin size = 0.25.

For each distance bin, the corresponding time value was defined as the average time of all points within the bin. Then we shifted the window by a step size of 0.05 and repeated this process.

!" The rate of change of evolutionary distance as a function of time was calculated as follows: !"

25 ! �(�!!!) − �(�!!!) � �! ≈ �(�!!!) − �(�!!!)

where D’(xi) is the derivative as a function of time at point xi, D(xi+1) and t(xi+1) are the evolutionary distance and divergence time of the point following point xi, and D(xi-1) and t(xi-1) are the evolutionary distance and the divergence time of the point preceding xi respectively.

When a gap greater than 0.1 distance units was present between two points we did not take the derivative at these points to avoid outlier derivative values.

2.3 COMPARING MODELS OF SEQUENCE EVOLUTION

In order to investigate the long-term divergence of orthologous sequences we considered 30 enzymes from 24 diverse organisms (Fig. 2.2, table 2.1).

26

Figure 2.2. Phylogenetic tree of the 24 organisms considered in this study. The set contains 12 eukaryotic species (red) including 2 plant species (Arabidopsis thaliana and Oryza sativa), a fungi (Saccharomyces cerevisiae), and 2 protozoans (Plasmodium falciparum and Trypanosoma brucei). It also contains 3 archaeal species (green) and 9 bacteria (blue). The tree is based on the sequence alignment of the small subunit rRNAs.

27 Table 2.1. List of 30 enzymatic functions considered in this study.

Enzymatic function EC Number Metabolism Pathway IMP 1.1.1.205 Nucleotide/Xenobiotics 1.1.1.37 Carbohydrate/Energy Dihydrolipoyl dehydrogenase 1.8.1.4 Carbohydrate/ Thioredoxin-disulfide 1.8.1.9 Nucleotide Thymidylate 2.1.1.45 Nucleotide/Cofactors & Vitamins Acetyl-CoA C-acetyltransferase 2.3.1.9 Lipid/Amino Acid/Carbohydrate Citrate (Si)-synthase 2.3.3.1 Carbohydrate Amidophosphoribosyltransferase 2.4.2.14 Nucleotide/Amino Acid Uracil phosphoribosyltransferase 2.4.2.9 Nucleotide Glutamine-fructose-6-phosphate 2.6.1.16 Carbohydrate/Amino Acid Phosphoglycerate 2.7.2.3 Carbohydrate/Energy Ribose-phosphate diphosphokinase 2.7.6.1 Carbohydrate/Nucleotide UDP-N-acetylglucosamine diphosphorylase 2.7.7.23 Carbohydrate UTP-glucose-1-phosphate uridylyltransferase 2.7.7.9 Carbohydrate Inositol-phosphate 3.1.3.25 Carbohydrate/Secondary Metabolite Inorganic diphosphatase 3.6.1.1 Energy Fructose-bisphosphate aldolase 4.1.2.13 Carbohydrate/Energy Phosphopyruvate hydratase 4.2.1.11 Carbohydrate/Energy Fumarate hydratase 4.2.1.2 Carbohydrate/Energy UDP-glucose 4-epimerase 5.1.3.2 Carbohydrate Triose-phosphate 5.3.1.1 Carbohydrate/Energy Ribose-5-phosphate isomerase 5.3.1.6 Carbohydrate/Energy Phosphoglycerate 5.4.2.1 Carbohydrate/Energy 5.4.2.2 Carbohydrate/Nucleotide/Secondary Metabolites 5.4.2.8 Carbohydrate Acetate-CoA 6.2.1.1 Carbohydrate/Energy Succinate-CoA ligase 6.2.1.5 Carbohydrate/Energy SAICAR synthetase 6.3.2.6 Nucleotide Adenylsuccinate synthase 6.3.4.4 Nucleotide/Amino Acid GMP synthase 6.3.5.2 Nucleotide/Xenobiotics

The selected enzymatic activities represent important metabolic functions maintained in species from all domains of life covering all 6 classes of enzymes (oxidoreductases, transferases, hydrolases, lyases, isomerases and ligases). For most of the 30 EC numbers a sequence could be found in all 24 organisms. Importantly, to prevent biases due to computational transfer of

28 molecular functions based on sequence homology, we selected for our analysis only enzymes for which molecular function was experimentally validated in all species41 (see Methods).

We performed pairwise sequence alignments across all 24 organisms and compared sequence identity to divergence times between each pair of species for each enzyme.

Divergence times, spanning ~4 billion years of evolution, were taken from the TimeTree database60 (Appendix 2.1).

First, we considered a simple model of sequence evolution that assumes that the rate of amino acid substitution at each site is independent and follows the Poisson distribution. This model is commonly used to estimate the number of substitutions that have occurred between two sequences61-63. Using this model (Eqn. 1) we fitted the pairwise sequence identity of each enzyme considered as a function of the time of divergence between the species. The fit corresponds to an exponential decay with a single parameter (R0) representing the average rate of amino acid substitution (see Methods). The obtained fits can be seen in Figure 2.3 and appendix 2.2 (green lines).

−R *t y =100*e 0 (1)

In the second model, we considered two distinct types of residues A and B. Residues of € type A are absolutely conserved in protein sequences, while residues of type B have constant substitution rates in accordance with the Poisson correction, as in model 1. Using this model we fitted an exponential decay (Eqn. 2) with the additional parameter y0 that corresponds to the percentage of un-mutable amino acids and represents the minimum sequence identity between any two proteins, i.e. a limit to sequence divergence.

29 −R0 *t y =y 0+(100 − y0)*e (2)

From the R2 values (table 2.2) and a visual examination of the fitted curves (Fig. 2.3, € appendix 2.2, red lines), this model is a better representation of divergent sequence evolution than model 1.

Figure 2.3. Fitting models of sequence divergence. For all 30 enzymatic activities we plotted pairwise sequence identity against time of divergence for all pairs of organisms and fitted the data with 3 models of amino acid substitution described in the text. Here we show 4 representative plots with the fits from model 1 (green lines) assuming that the rate of amino acid substitution is independent between sites and follows the Poisson distribution (Eqn. 1), model 2 (red) assuming that a fraction of residues remains unchanged (Eqn. 2), and model 3 (blue) that uses Gamma correction (Eqn. 4) to model substitution rate heterogeneity between sites. A) EC: 1.1.1.205, B) EC: 2.7.7.9, C) EC: 3.1.3.25 and D) EC: 6.2.1.5.

30 Table 2.2. Comparisons of the fits of models 1, 2 and 3 of sequence divergence. The fits of model 1 (Poisson correction), model 2 (effective limit to sequence divergence), and model 3 (Gamma correction) to the data were compared. The P-value for the F-test comparing nested models 1 and 2 is in the third column. F-test Enzymatic function Model 1 R2 Model 2 R2 Model 3 R2 P-value 1.1.1.205 0.07 0.65 1.00x10-20 0.60 1.1.1.37 0.55 0.68 1.96x10-11 0.66 1.8.1.4 0.62 0.75 4.44x10-16 0.75 1.8.1.9 -0.25 0.56 1.00x10-20 0.53 2.1.1.45 0.24 0.42 6.93x10-9 0.48 2.3.1.9 0.28 0.61 2.22x10-16 0.64 2.3.3.1 0.73 0.77 3.89x10-6 0.77 2.4.2.14 -0.15 0.74 1.00x10-20 0.64 2.4.2.9 0.45 0.67 9.97x10-14 0.67 2.6.1.16 0.08 0.71 1.00x10-20 0.64 2.7.2.3 0.60 0.77 1.00x10-20 0.77 2.7.6.1 0.58 0.68 7.75x10-11 0.67 2.7.7.23 -0.33 0.62 1.00x10-20 0.58 2.7.7.9 0.57 0.76 1.00x10-20 0.75 3.1.3.25 -0.03 0.81 1.00x10-20 0.77 3.6.1.1 -0.14 0.61 1.00x10-20 0.58 4.1.2.13 0.60 0.70 3.64x10-5 0.71 4.2.1.11 0.64 0.77 1.11x10-16 0.77 4.2.1.2 0.78 0.78 0.226 0.79 5.1.3.2 0.37 0.70 1.00x10-20 0.74 5.3.1.1 0.44 0.69 1.11x10-16 0.67 5.3.1.6 -0.07 0.74 1.00x10-20 0.64 5.4.2.1 0.15 0.61 9.75x10-12 0.52 5.4.2.2 0.80 0.84 2.10x10-5 0.84 5.4.2.8 0.35 0.57 1.12x10-7 0.57 6.2.1.1 -1.33 0.47 1.00x10-20 0.45 6.2.1.5 0.29 0.72 1.00x10-20 0.71 6.3.2.6 0.60 0.83 1.11x10-16 0.82 6.3.4.4 0.64 0.79 1.00x10-20 0.80 6.3.5.2 -0.50 0.44 1.00x10-20 0.33

The parameter y0 represents the proportion of amino acids in a sequence that cannot be mutated if that sequence has to retain its molecular function. If we plot its distribution for all

30 enzymatic activities the majority of the density is at 30-40 % sequence identity (Fig. 2.4). We

31 can conclude that diverging enzymes require conservation of ~30 % of their amino acids to retain the same molecular function.

Figure 2.4. Distribution of model 2 parameter y0 across 30 enzymatic functions. A) all species, B) eurakyotes only. The parameter y0 corresponds to the faction of residues that will remain unchanged throughout divergence. The analysis was repeated comparing only pairs of eukaryotes to discount any possible effect of horizontal gene transfer which is known to occur between prokaryotes. It should be noted that consideration of all species allows us survey long evolutionary times (up to 4Byr) whereas the eukaryotic evolutionary history is much shorter so it may be less accurate in simulating long-term divergence behavior.

The second model has one extra parameter compared to the first model, and it is therefore more complex. Consequently it comes as no surprise that it offers a better fit to the data. To determine whether the fit improvement is worth the cost of an additional parameter we used the F-test, which can be applied in this case because model 1 is nested in model 268.

We calculated the F statistic (Eqn. 5, See Methods) and determined the corresponding p-value for each enzymatic activity analyzed (Table 2.2). In all but one of the cases the F-test rejects the hypothesis that the additional parameter does not contribute any additional information to the

32 fit (p < 0.05) Therefore we accept model 2 as a better representation of divergent protein evolution.

Models 1 and 2 assume that substitution rates are uniform at all sites, with residues of type A in model 2 having a substitution rate of zero. A more realistic method is to use Gamma correction to model rate heterogeneity at different sites64-66. The Gamma correction model assumes that the substitution rates across all residues in a sequence are Gamma-distributed.

The shape parameter α (Eqn. 3) is a measure of substitution rate variability across sites. The parameter α was determined empirically along with the substitution rate R0. The equation relating evolutionary distance and time of divergence is as follows:

−α ⎡R0 * t ⎤ y =100 * ⎢ +1⎥ (3) ⎣ α ⎦

From figure 2.3 and the R2 values in table 2.2, we observe that models 2 and 3 are approximately equally good at fitting the data. Notably, according to model 2, diverging enzymes that retain the same function will never reach 5 % sequence identity, the homology level for a randomly chosen pair of proteins69. According to the more realistic Gamma- corrected model, it will take hundreds to thousands of billions of years (Byr) for said enzymes to diverge to 5 % sequence identity (appendix 2.3). The lifetime of the sun being of 9 Byr, diverging othologues will never diverge to the level of random enzymes if they are to continuously retain the same molecular function. Similar results were observed when we applied our models to computationally-annotated enzymes comprising 300 different EC numbers from Swiss-Prot70 (fig. 2.5).

33

Figure 2.5. Analysis of 300 functional sets from Swiss-Prot. A) Distribution of the goodness of fit for all 3 models across 300 enzymatic activities. Model 1 which postulates that sequence divergence has a limit has the worst fit. Models 2-3 show equivalent performance. B) Distribution of y0 parameter of model 2 determined empirically for each EC dataset. The average value for y0, which corresponds to the percentage of amino acids that have to remain the same for sequence to be preserved, is 34.06 ± 0.69 %. C) Distribution across the 300 EC datasets of the time of divergence required to reach 5 % sequence identity (average identity of unrelated sequences) between pairs or orthologues based on fits to model 3.

From these observations we conclude that there is indeed a constraint placed on diverging sequences to retain their function. This constraint translates to an effective limit to sequence divergence.

2.4 HUBBLE LAW-LIKE DERIVATION

We can express the Gamma correction model (model 3) as evolutionary distance as a function of divergence time.

! ! !∝ ! ! !∝ � = ! + 1 � = 1 − ! + 1 (1) ∝ ∝

34 where y is the sequence identity between two orthologous enzymes, and D is the evolutionary distance defined as D = 1- y. From this we can derive an equation for the rate of change of the evolutionary distance with respect to divergence time.

!" ∝!! = � ×(1 − �) ∝ (2) !" !

where Ro and α are the rate and shape parameters of model 3.

Equation 2 bears similarity to the Hubble law which relates the velocity v at which galaxies recede from the Earth to the distance D between the galaxies and the observer. The velocity corresponds to the derivative of D with respect to cosmological time t. H0 is the Hubble constant.

!" = � = � � (3) !" !

!" In contrast to the Hubble law, which depicts a linear relationship between and D, !"

!" equation 2 shows that follows the power law. Therefore divergence slows down as the !" evolutionary distance increases.

We can show this empirically by plotting evolutionary distance as a function of

!" divergence time for a given set, calculating at different evolutionary distances and compare !" the fit of the points with equation 2 to the fit obtained from plugging in model 3 parameters. To

!" show that across different enzymatic functions slows down with increasing evolutionary !" distance we compare the fit with equation 2 to a linear fit with the Akaike Information Criterion

(appendix 2.4). In all cases the non-linear model outperforms the linear fit with likelihood ratio

35 > 5.0 (figure 2.6, appendix 2.5). In all cases the empirical data fits the Hubble-like model, suggesting that model 3 is the best approximation of true sequence divergence, and that the velocity of divergence at the sequence level is increasingly dampened with increasing evolutionary distance. In other words the more two enzymes are distant from each other, the slower they diverge from each other. This result is consistent with our hypothesis that divergent evolution of enzymes that must retain their molecular function has an effective limit.

Figure 2.6. Hubble-like model of sequence divergence. For all 30 enzymatic activities we plotted the pairwise rate of change of the evolutionary distance (velocity of divergence) between pairs of enzymes against evolutionary distance between the two corresponding species. The black points are empirically calculated dD/dt (see methods). For each enzymatic function, to obtain the green line we plugged in the corresponding α and R0 parameters obtained from the gamma-corrected model of sequence divergence (model 3, ) into equation (2). The dotted blue line represents an actual fit of the data points with the Hubble-like model. A) EC: 1.1.1.205, B) EC: 2.7.7.9, C) EC: 3.1.3.25 and D) EC: 6.2.1.5.

36

2.5 FUNCTIONAL CONSTRAINTS ON THE EVOLUTION OF PROTEIN STRUCTURE

The next step in the analysis was to study the conservation of enzyme structures. For this we performed structural alignments between Swiss-Prot enzymes annotated with the 30

EC numbers in table 2.2 and with corresponding atomic coordinates in the Protein Data Bank40.

Structures were aligned using the Combinatorial Extension (CE) algorithm67, which measures the root mean square distance (RMSD) between C-α atoms over the length of the aligned residues. The average RMSD for each pair of organisms in our 24 species set was plotted against their divergence time (Fig. 2.7, panel A).

Figure 2.7. Structural divergence of enzymes with the same molecular function. Structural divergence of enzymes with the same function. A) To investigate the extent of structural divergence for enzymes during evolution we compared the RMSD between pairs of orthologous enzymes from the enzyme database BRENDA to the time of divergence between the corresponding species. The red line is the linear regression (p < 0.05). B) Distribution of RMSD values for pairs of orthologs used in this study (blue bars) vs. pairs of identical sequences crystallized by different groups (red bars). The mean RMSD for both sets are represented as dashed lines.

37

Although one would expect structural similarity between orthologous enzymes to decay over time, it has been shown that structure is more robust to divergence than sequence71,72. As shown in the figure, we observe that the structures of pairs of enzymes sharing the same function do not significantly diverge as a function of time (Pearson correlation coefficient between time and RMSD is 0.042). Moreover, most structures are within 3.5 Å RMSD even after

~3.5 Byr of evolution, which seems to indicate that functional restraints prevent structural divergence beyond this limit. The variability in RMSD at each point is partly attributed to expected differences in crystal structures (fig. 2.7, panel B). Even for identical amino acid sequences different crystallization conditions (e.g. presence of ligand, conformational change) lead to average RMSD values of around 1 Å.

Despite this structural conservation it is apparent from the study of sequence divergence that the substitution rates vary between sites across a sequence. If this is the case, we would expect certain residues to be more constrained than others and therefore show a higher conservation across long evolutionary distances. We carried out triplet sequence alignments for the 30 enzymatic functions in this study using representative sequences from each domain of life (Bacteria, Archaea and Eukarya; Fig. 2.8).

38

Figure 2.8. Average pairwise and triplet sequence identity for proteins in different domains of life; Prokarya, Archaea and Eukarya. The 2-branch conservation bar (red) corresponds to the the average conserved sequence identity between pairs of enzymes with the same molecular function but belonging to different domains of life (e.g. eukaryote vs. archaea or bacteria vs. archaea). The 3-branch conservation bar (blue) represents the average sequence identity between sets of 3 enzymes with the same molecular function where one enzyme is bacterial, another one is archaeal and the third one is from a eukaryotic genome. We observe significant conservation across all 3 domains of life despite long times of divergence, consistent with the notion that the same set of residues has been conserved throughout divergent evolution.

While the sequence identity shared by sequences from all 3 domains of life is lower than the identity between pairs of sequences from different domains (~30 % vs. 40-50 %), it is still significantly higher than the homology level of unrelated enzymes (~8 %). The fact that a significant subset of amino acids is conserved across the 3 domains of life shows that the mutations that do occur do not do so at random, otherwise we would observe very little 3- branch conservation. The same residues that have to be conserved for function in bacteria also have to be conserved for archaeal and eukaryotic species. This should ensure that no pair of

39 diverging enzymes with the same molecular function should reach sequence homology levels of unrelated proteins, no matter how long they have been diverging from their common ancestor.

2.6 ACTIVE SITE REGION CONSERVATION

Because the condition to maintain the same function imposes a limit on sequence divergence, we hypothesize that the rate of amino-acid substitution increases with distance from functionally important residues73. To test this idea, we focused on enzyme active sites, which are responsible for catalytic activity and reaction specificity. For our set of enzymes, the active site was defined as the centroid of the active site residues. Annotations of these residues were retrieved from UniprotKB or the PDB, and visually verified with the molecular graphics system PyMOL74. Five-angstrom layers starting at the active site centroid were determined in a representative 3D structure for each enzymatic activity (fig. 2.9).

Figure 2.9. Sequence conservation as a function of the distance to the active site. Sequence conservation as a function of distance to active site. A) For illustration purposes the surface representation of the T. maritima enzyme IMP dehydrogenase (PDB 1VRD) color-coded for distance to active site is shown. B) This graph represents 30 enzymes of the central

40 metabolism from 24 model organisms. For each enzymatic function we plotted the sequence identity between all pairs of species against the divergence time between the species. C) We plotted sequence identity as a function of distance to the active site centroid at 0.25 Byr of divergence (straight line, squares), 1.75 Byr (dashed line, triangles) and 3.25 Byr (dotted line, circles).

Multiple sequence alignments were performed and the sequence identity in each layer was determined as a function of divergence times between species (Fig. 2.9). As expected, sequences are more conserved around the active site than at greater distances. Also the estimated rate of divergence R0 was faster farther from the active site. Importantly, residues far away from the active site do not diverge to the level expected of random enzymes, but they still maintain a significant level of sequence identity (34.8 % on average across enzymes for amino acids in region 5, Fig. 2.9).

Enzymes have only a handful of catalytic residues in their active sites. The residues directly around these amino acids form and probably help maintain the structural integrity of the catalytic pocket to prevent steric clashes and accommodate the substrate. Therefore, functional restrictions apply a greater pressure on sequence and structure conservation closer to the active site, consistent with previous findings75. The first two layers closest to the active site centroid contain fewer residues than the outer three; this explains the larger observed error in the sequence identities of layers 1 and 2 (Fig. 2.9, panel B, black and red lines).

41 2.7 DIVERGENT VS. CONVERGENT EVOLUTION

In 1998, Rost studied the conservation of protein structures with divergent sequences and determined that at low sequence identities (i.e. ≤25 %) it is impossible to differentiate between products of evolutionary divergence and convergence69. In other words, convergent and divergent evolution would have reached equilibrium. Proteins below 25 % identity are said to be in the “twilight zone” of protein space because their weak homology to other proteins makes it impossible to trace evolutionary relationships9,76. To gain a better understanding of

“twilight zone” enzymes and to determine how stable protein structure is with respect to sequence changes, all-against-all structural alignments of all enzyme structures from the PDB cross-listed in Swiss-Prot with EC annotations were conducted. This set encompasses over 150 enzymatic functions across diverse species. A plot of RMSD vs. sequence identity was constructed to observe the density of structure pairs over structure and sequence space (Fig.

2.10, appendix 2.6).

42

Figure 2.10. Density plot of RMSD vs. sequence identity. Density plot of RMSD vs. sequence identity. The figure represents the density of enzyme pairs sharing all 4 EC digits. The inset on the left represents enzyme pairs sharing only the first 3 EC digits, and the inset on the right corresponds to random enzyme pairs.

As seen in figure 2.10, there are two distinct high-density areas that correspond to enzymes having converged to the same function (high RMSD, low sequence identity) and enzymes diverging from a common ancestor (low RMSD, high sequence identity). This suggests that diverging orthologues are limited in their sequence and structural divergence and are distinguishable from enzymes that converged to the same function. As a control, we only considered pairs of enzymes sharing the first 3 EC digits (A.B.C.x) (fig. 2.10, left inset). These enzyme pairs perform similar enzymatic reactions but have distinct specificities. When the molecular function definition is relaxed this way, enzyme pairs are free to diverge and we do

43 not observe a clear limit in structure or sequence, similar to the case of randomly chosen enzyme pairs (Fig. 2.10, right inset). We conclude that conservation of the same function, defined by specific enzymatic activities in the EC nomenclature, restricts sequence and structure divergence to a limit of 20-30 % and 3.5 Å respectively.

2.8 DISCUSSION

We have shown that sequence divergence for enzymes sharing the same molecular function has a limit. Enzymes diverging from a common ancestor must retain ~30 % sequence identity in order for function to be maintained. Also the difference in structures of same- function diverging enzymes is capped at ~3.5 Å. While horizontal gene transfer (HGT) could reduce the observed sequence divergence between orthologues, we do not expect a substantial effect of HGT in our results. All 30 enzymes considered carry out central metabolic functions and are therefore less likely to be transferred across widely diverged lineages77,78. On the other hand, we observe a similar pattern of sequence divergence and similar limits for most enzymes studied.

Our structural analysis shows that sequence divergence is heterogeneous across the enzyme sequence, with residues closer to the catalytic site mutating at a slower rate. Notably, we were able to develop a Hubble Law-like model that is a good fit to the data, and that confirms that sequence divergence slows down as evolutionary distance increases.

Consequently, in cases where global sequence identity between enzymes is low, examination of residues close to the catalytic site may be useful for function prediction. These results are

44 significant because the majority of sequenced enzymes have less than 60 % global sequence identity to their closest homologues6,9. Therefore, the identified limit of sequence evolution has important applications for understanding the conservation of catalytic functions and for reliably transferring functional annotations between extensively diverged sequences.

45 Chapter 3 - PREDICTING ENZYME FUNCTION

3.1 INTRODUCTION

It is estimated that 30-50 % of genes in a given genome are either un-annotated, improperly or incompletely annotated42,43,79. A big challenge of the post-genomic era is the systemic functional annotation of every gene in a genome, which in turn will allow us to further our understanding of organisms as a whole.

3.1.1 Sequence-Based Function Prediction Methods

The most widely used method of protein functional prediction is through sequence homology to functionally characterized proteins7,80-82. In fact, about 50 % of all gene products of a newly sequenced genome may be annotated using sequence homology based annotation transfer45,83. The rationale behind this method is that homologous proteins should have related functions84. A query sequence of unknown function is compared to a database of functionally annotated proteins such as Swiss-Prot70,85 with BLAST2 so that the annotation of the top homologue is proposed as the true annotation of the query. However this method requires close homologues with functional annotation. It has been demonstrate that at least 60 % sequence identity is required for an accurate transfer of molecular function4,6,7. A sizeable proportion of gene products fall in a region of low sequence homology to any functionally annotated enzymes, the so-called “twilight zone”9,86, thus rendering the sequence homology method ineffective in those cases. Additionally, high sequence homology is not always

46 synonymous with identical function. For instance, gene duplication events that are common during evolution give rise to paralogues, that may diverge and give rise to new functions44 (fig.

3.1). Because the two paralogues come from the same ancestor protein, they are evolutionarily related and may share substantial sequence identity while having different functions12,87,88. For these reasons, predicting function based solely on global sequence homology may lead to errors in annotation3,5,7.

Figure 3.1. Orthology and paralogy. Two proteins with sequence homology may be either paralogues or orthologues. Paralogues emerge as a result of a gene duplication event within a single organism. One copy of the gene evolves into a new function so that genes A and B have similar sequences but different functions.

3.1.2 Structure-Based Function Prediction Methods

In such cases one can resort to structure to help shed light on potential functions.

Indeed structure is much more robust to changes throughout evolution than sequence. In fact,

47 two proteins with low pairwise sequence identity may have the same fold and their active site may have the same spatial location, thus indicating some functional ties that sequence homology alone could not reveal32,45. Structure-based methods can be divided into two categories. The first type operates by comparing the overall structure of the unknown query to a database of structures with known functional annotation such as the PDB83,89,90. The second type employs geometric analysis to identify active sites, either with cavity detection91-93, or by local structure homology94-97. Subsequently, ligand docking approaches may be used to refine the predicted ligand specificity of the protein query83,89,98. These methods require that the uncharacterized protein query has an experimentally determined 3D structure or a high quality homology model. As a result of improvements in structure determination efforts in the past decade, there are now close to 80,000 experimentally determined 3D structure in the Protein

Data Bank40. The structural coverage of sequence gene product is still imperfect, but over time such methods will become increasingly more applicable and useful.

3.2 MATERIALS AND METHODS

3.2.1 Determination of the active site region

The first step is to find candidate functions for the uncharacterized query sequence.

This is done through a BLAST search against Swiss-Prot. The reason for this is that two sequences with the same molecular function are expected to share some level of sequence homology. But because global sequence homology might be low we want to measure local sequence homology, at the region surrounding the active site residues. To do this we need

48 to know the structure of the query or the hit otherwise we cannot determine active site region residues from the sequence alone. We find structural representatives by searching the PDB database of experimentally determined 3D structures. We get active site residue annotations from the PDB, CSA or Swiss-Prot. We define the active site region as the set of all residues contained within a radius of 8 Å around the centroid of the catalytic residues in space. Finally, we can map the region onto the sequence and compute local conservation at all residues comprising the active site region. This will serve as input for 3D-GLOBUS.

3.2.2 Gibbs sampling

GLOBUS is a method to assign probability to functional annotations; it gives a probability for each possible assignment for a given gene. GLOBUS uses Gibbs sampling to effectively sample from the network probability space, placing more emphasis on regions of high likelihood. The first step in performing whole-genome functional assignments with GLOBUS was to build a generic EC network corresponding to all known enzymatic activities involving small molecules. The nodes of the network are reactions, and the edges linking 2 nodes represent shared metabolites (substrates or products). We used both global and local sequence homology to initialize the network by randomly assigning every gene to one of its candidate locations.

Then, at each iteration of the Gibbs sampler, gene-gene context was used to evaluate the likelihood of the network assignment. We defined a fitness function to be minimized throughout the sampling process. The overall fitness function was evaluated on the basis of a given global assignment of metabolic genes. This function consists of different terms, each term is a product of an f variable (functional descriptor) and a b variable associated weight (eqn. 1).

� �!, �!, … , �! = −�!!"!#!$%�!!"!#!$% − �!"#!!"!!"�!"#!!"!#$ − �!"#$%&$�!"#$%&$

49 −�!"!"!!""#$%&"%�!"!"!!""#$%&"% + �!"#!!"!!"#$%&'.�!"#!!"!!"#$%&' (1)

From the fitness function we derived global probabilities of a network assignment (eqn.

2).

! � � , � , … , � = ×�!!(!!,!!,…,!!) (2) ! ! ! !

For a step in the simulation we take a random gene, determine the probability of correct functional prediction for the gene at all candidate locations, and assign it to the location that affords best overall probability for a given network. We do this until convergence of gene assignments, and the marginal probability for each gene corresponds to the number of times it was successfully placed in a given location over the simulation.

3.2.3 Performance evaluation

We evaluate the performance of the different methods by plotting precision-recall curves. In order to do this we first filter all functional predictions to keep only the best one for each gene. The best prediction corresponds to the one with the highest relevant score (i.e. probability from GLOBUS and 3D-GLOBUS, global sequence identity for homology-based function prediction, Argot2 score, or fraction of EFICAZ components in agreement). We are left with a single functional prediction per gene. Afterward for each method we want to evaluate we rank the genes according to their respective score. To build the precision recall curve we calculate precision and recall at different cutoff values of the score.

50 3.3 INTEGRATING LOCAL SEQUENCE CONSERVATION

3.3.1 Structural Coverage

Although several methods for structure-based functional predictions have been developed54,90,99,100, to our knowledge structural data had never been integrated before with genomic context data and phenotypic data under a unified probabilistic framework. The main advantage of using structural information to identify the molecular function of a protein is high evolutionary conservation of structural residues responsible for specificity and catalysis. It has been demonstrated that at least 60 % of the overall sequence identity is required for an accurate transfer of molecular function, for example all 4 digits of EC classification, between homologous proteins4,6. It has also been shown that across enzymes with the same molecular function, protein residues located close to an active site remain highly conserved even after billions of years of evolution33,101-103. Our main idea for incorporating structural information into the GLOBUS framework was to identify residues of the query sequence that are responsible for ligand-binding, specificity, and catalysis. Because it is not possible to determine which residues are part of the active site region merely by looking at the sequence, this method relies on the availability of a structure or structural representative, in addition to knowledge of the position of the catalytic residues in the sequence.

For a query protein of unknown function we may observe two types of cases; either the

3D structure of the protein has been experimentally determined, or it is unknown. To determine whether the structure is known, we compare the query sequence to a database of sequences with experimentally determined structures, the Protein Data Bank (PDB)40. If the query sequence has full homology to a sequence in the PDB, then the true 3D structure of the

51 query is known, and we can use it to determine which amino acids in the sequence are associated with the active site. Conversely, if the query sequence has no exact match in the PDB, the structure of the closest homologue in the PDB may act as a proxy for the true structure. The rationale behind this is that structure remains strongly conserved so that proteins with very diverged sequences may still share significant structural similarity. By the same token, the position of the active site in the tertiary structure should also be conserved, even in somewhat distant proteins. This is the case because a protein has to remain functional during divergent evolution, or be eliminated by the laws of natural selection. Therefore, the catalytic region has to be preserved at the structural level in order to continue accommodating a substrate, so it is unlikely for two evolutionarily related enzymes to have very different topologies. Therefore, to increase the structural coverage, it is acceptable to use the structure of the closest homologue to the query in PDB to identify the position of the active site residues in the query sequence.

Because of the relative ease of purification and crystallization, current structural coverage is particularly good for bacterial enzymes. For example, our calculations demonstrate that currently in an average bacterial genome, over 60 % of metabolic genes have significant sequence identity to a protein with a known structure (fig. 3.2).

52

Figure 3.2. Cumulative structural coverage for microbial and Swiss-Prot enzymes. This plot shows the structural coverage when we allow the use of a structural representative with different sequence homology cutoffs. When we BLAST the sequence of Swiss-Prot enzymes against the sequences in the PDB to find a representative structure, we see that 87 % of Swiss-Prot enzymes have at least 30 % homology to an enzyme with known structure. In the case of newly sequenced bacteria, 67 % of enzymes have at least 30 % homology to an enzyme with known structure. The percentage goes up to 92 % and 86 % respectively when no homology cutoff is applied.

Moreover, we can further increase the structural coverage of bacterial genes through the use of the transitive property of structure conservation, and turn to remote homologues to find the location of active site using structure104,105. Consider three proteins Q, I and T that originate from a common ancestor, so that they must have significant structural similarity. Q is the protein of interest, however, only T has an experimentally determined structure. Because of long times of divergence, the query Q and target T have very low sequence homology therefore we would not be able to identify them as functionally related with a simple BLAST search. To find a structure for Q we blast it against Swiss-Prot and find the closest homologue I, which has no known structure as well. If we do a blast search for I against the PDB, we find T. Therefore, T

53 and I have similar structures, and by transitive property, T and Q must share structural similarity as well (fig. 3.3).

Figure 3.3. Illustration of the transitive properties for structural homology. This figure illustrates the concept of the transitive property of structure conservation. Structure is robust to changes in sequence so that two proteins with low global sequence homology may still have similar structures. In order to uncover this relationship, an intermediate protein with significant homology to both query and target is necessary.

In our case, we blasted bacterial query sequences against Swiss-Prot to find functionally- annotated intermediate sequences I, which we subsequently blasted against the PDB, so that the active site sequence identity could be determined between the unknown query and each functionally characterized Swiss-Prot hit. With this procedure, structural coverage of bacterial metabolic genes exceeds 90 % (figure 3.2).

54

Table 3.1. Structural coverage of metabolic gene products in 5 species.

Genes w/ significant Gene with structural Genes with structural Species BLAST hit in Swiss- representative through direct representative through Prot Enzyme Set homology (Coverage) Swiss-Prot hit (Coverage)

S. cerevisiae 1,124 997 (88.7 %) 1,077 (95.82 %)

B. subtilis 1,615 1,452 (89.91 %) 1,488 (92.14 %)

E. coli 1,120 1,025 (91.52 %) 1,076 (96.07 %)

M. tuberculosis 1,638 1,419 (86.63 %) 1,543 (94.20 %)

S. aureus 1,084 934 (86.16 %) 1,008 (92.99 %)

3.3.2 Localizing the active site region

Once we identified a structural representative for the query sequence or for the Swiss-

Prot hits, we first found the catalytic residues of this protein through annotations in Swiss-Prot, the PDB, or the Catalytic Site Atlas106,107. As mentioned previously, catalytic residues only account for three to five amino acids out of the whole enzyme sequence. However, we have shown in Chapter 2 that the amino acids contained in a region extending up to 10 Å around the catalytic residues are significantly more conserved than the overall sequence between enzymes with the same molecular function. Spatially close residues are not necessarily close in the sequence. Therefore we needed a structure to disambiguate the amino acids of the active site region from the rest of the sequence. We then mapped the active site region residues onto the sequence and aligned the sequence of the 3D structure with the sequence of the bacterial query and the Swiss-Prot hits. For each Swiss-Prot hit, we computed active site sequence identity as the conservation between bacterial query sequence and Swiss-Prot hit sequence only at the positions of residues of the active site region (fig. 3.4). The active site region

55 sequence homology was then used, in addition to the overall sequence homology, in the

GLOBUS fitness function during Gibbs sampling.

Figure 3.4. Schematic representation of the function prediction method. To make a prediction on the function of an uncharacterized protein sequence, we BLAST it against the subset of Swiss-Prot sequences with known EC number annotation. Then we parse the BLAST hits so that for each candidate EC number we only retain the best hit. Subsequently, we BLAST each of the retained hits against the PDB to find structural representatives for which we known the place of the catalytic site. Then we determine the active site region residues and map them onto the sequence of the structural representative. An triple alignment of the query, the hit and its structural representative allows us to compute local sequence conservation at the active site region.

56 3.3.3 Size of active site region

To determine the size of the active site region to be considered, we first studied sequence conservation as a function of the distance to the active site. As described above, for a given enzyme with known structure, the active site was represented by the centroid of the annotated active site residues. For each of the 30 enzymatic functions introduced in Chapter 2, we found an experimentally determined representative 3D structure in the PDB. We then aligned all sequences in each of the 30 sets, and used the structure to label the positions in the alignment with their corresponding distance to the active site centroid. Pairwise sequence identity for all residues contained in regions 2-angstrom-wide around the active site was determined as a function of divergence times between species. Subsequently, the average sequence conservation across the 30 enzymatic functions after 4 billion years of divergence was calculated for each 2 Å bin, from 0 to 18 Å around the active site centroid (fig. 3.5). The remaining sequence identity at 4 Byr represents the percentage of slowly mutating residues, whose rate of divergence is probably slowed because of the requirement to maintain function

(i.e. these residues are related to the function of the protein either catalytically or through structural role). This plot illustrates that the set of all residues contained in a region up to 8 Å from the active site centroid are considerably more conserved (≥ 40 %) than residues farther removed in 3D space (≤ 30 % sequence identity) for pairs of enzymes with the same EC number annotation.

57

Figure 3.5. Sequence identity after 4 billion years of divergence as a function of the distance to the active site. The plot shows the bimodal behavior of enzyme residues. One type of sites in an enzymatic sequence is strongly conserved, and that corresponds to residues that are within 8 Å to the active site. The other type of sites corresponding to sites more distant from the active site that have more freedom to diverge without affecting molecular function.

Subsequently, we investigated the size of the active size region in terms of the number of residues for regions ranging from having a radius of 3 Å to 8 Å around the centroid of the annotated catalytic residues for all fully annotated enzymes in the PDB (N = 51,539). We found regions with a radius of 5 Å are small in size with an average number of residues that barely exceeds 7. This close region is not significantly larger than the set of catalytic residues.

Therefore, to increase statistical significance of the computed local sequence conservation, we define the active site region as being the set of all residues within a radius of 8 Å from the active site centroid. Such a region contains about 22 amino acids on average (fig. 3.6).

58

Figure 3.6. Average number of residues within a given radius from the active site. Whether using a single residue of the catalytic site or the centroid of the active site residues, the average number of residues per region at a given distance from the active remains essentially the same.

We also tried defining the active site region as any residue within a specified distance to any of the catalytic residues. Similarly to the centroid method, the average number of residues at a distance of 8 Å is about 20 (fig. 3.6).

3.4 GLOBAL vs. LOCAL SEQUENCE CONSERVATION

Following the results shown above, we hypothesized that local sequence conservation would be more informative than global sequence identity to predict enzyme function through the identification of distant orthologs. We further determined that a region with radius 8 Å from the active site centroid is significantly more conserved than the global sequence, and is large enough to capture the substructure of the active site, which by definition should be conserved. A 3D structure representative of the query was necessary in order to localize the

59 residues of interest. Then, it was possible to calculate local sequence conservation at those sites between the query sequence and its closest homologues as found by BLAST.

Next, we tested our hypothesis by applying our method to the genome-wide EC number annotation of an eukaryotic species, Saccharomyces cerevisiae, as well as four bacterial species,

Bacillus subtilis, Escherichia coli, Mycobacterium tuberculosis and Staphylococcus aureus. This choice of organisms was made because they vary in size, complexity and extent of knowledge.

For instance, S. cerevisiae and E. coli and B. subtilis are all model organisms, specifically they have been extensively studied and their metabolism is rather well understood108-110. Conversely,

M. tuberculosis and S. aureus are both dangerous pathogens that necessitate additional study in order to advance drug development efforts. For these 5 species, the structural coverage using our method can be found in table 3.1. Interestingly, even the more unknown species M. tuberculosis and S. aureus have considerable structural coverage. Using the transitive property of structural conservation, we were able to assign a representative structure to upwards of

90 % of their predicted enzymes of unknown function, in accordance with the results shown in figure 3.2. Consequently, our method is applicable to the whole-genome functional characterization of a wide range of organisms.

Because the method that we propose is probabilistic in nature, the next step in this project was to determine the probabilities associated with different levels of sequence conservation. Furthermore, we had to compare the predictive power of local sequence conservation against that of global sequence conservation. To investigate the effect of taking into account active site region conservation for function prediction, we calculated the probabilities of correctly predicting the function of all the metabolic genes in S. cerevisiae and E.

60 coli together in one set comprising 1,742 genes in total. For this, we used functional annotations from the well-curated models of E. coli (iAF1260) and S. cerevisiae (iLL672) as gold standards. We computed the probability of correctly predicting the EC number annotation against the global sequence identity between uncharacterized query and functionally annotated hits in Swiss-Prot (table 3.2).

Table 3.2. Probability of correct function at different sequence identity levels.

Global Sequence Probability of Identity (%) Correct Function 0-5 0.000 5-10 0.000 10-15 0.000 15-20 0.000 20-25 0.020 25-30 0.096 30-35 0.143 35-40 0.253 40-45 0.578 45-50 0.623 50-55 0.698 55-60 0.882 60-65 0.960 65-70 1.000 70-75 1.000 75-80 1.000 80-85 1.000 85-90 1.000 90-95 1.000 95-100 1.000

In order to also consider local sequence conservation, we determined the probability of correctly predicting function at different levels of global sequence identity, given the level of local sequence conservation of the active site region (appendix 3.1). Plots of the probability of correct functional prediction against query-hit global sequence conservation for: i) global sequence identity only, ii) global sequence identity given low active site region identity (≤ 50 %),

61 and iii) global sequence identity given high active site region conservation (> 50 %) were made

(fig. 3.7).

Figure 3.7. Probability of correct prediction given global sequence identity for Escherichia coli and Saccharomyces cerevisiae. When considering active site conservation instead of global sequence conservation, we obtain more robust predicts because probability of correct prediction shifts to higher values compared to global sequence identity alone. And conversely if the active site region is very conversely if the active site region is more diverged the probability of correct prediction will shift downwards.

In this figure, we see that accounting for local sequence conservation allows us to shift the probability of correct prediction higher if the active site region is very conserved, and conversely if the active site region is more diverged. Consequently, we proposed to use this additional source of evidence to improve the robustness of function prediction made with

GLOBUS.

62 3.5 GLOBAL BIOCHEMICAL RECONSTRUCTION USING SAMPLING (GLOBUS)

It is a known fact that purely homology-based methods are unreliable in the context of making functional assignments for genes with very distant homologues3,4,6. As noted above (fig.

2.1), for a typical bacterial genome, about 60 % of enzymes have less than 40 % sequence identity to a homologous enzyme in Swiss-Prot. As a resort, context-based methods may be used to complement global sequence homology to improve functional predictions. In contrast to sequence homology methods, context-based approaches reveal information on the genomic context of a gene of interest, thus drawing from evolutionary relationships between neighbors in the genome. Context-based methods include gene fusion, gene clustering, mRNA co- expression and phylogenetic profiling. Such methods have successfully been used in the past to uncover functional associations, elucidate molecular pathways and determine macromolecular interaction networks111-114. Therefore, we postulated that combining sequence homology with genomic context information under a unified framework would markedly improve on homology-based predictions. To this end, Global Biochemical Reconstruction Using Sampling

(GLOBUS) was developed in lab previously55. This method was shown to afford some improvements over sequence-based functional prediction schemes.

The goal of GLOBUS is to do probabilistic genome-wide functional assignments for bacterial genomes on the basis of both sequence homology and genomic context. The probabilistic aspect of the method comes from the fact that functional predictions are made computationally combining several sources of information; therefore there may be several possible functions for a given gene. Defining probabilities allows us to propose the most likely function as well as potential alternative functions for a gene of interest. Another advantage of

63 the method is that functional assignments are made simultaneously for the whole metabolic genome, so that predictions are evaluated in the context of a network (i.e. genes assigned to neighboring locations in the generic metabolic network should have high correlation in terms of genomic context).

3.5.1 Sources of evidence

A. Sequence homology

The starting point to most functional annotation methods is a homology search against a database of functionally characterized sequences, with an algorithm such as FASTA115,116 or

BLAST2,117,118. This procedure returns a number of hits which, depending on their degree of sequence identity to the query sequence, are more or less likely to share the function of the query. Therefore, sequence homology is useful to decrease the scope of candidate functions for a given query sequence. However, sequence homology alone may not be sufficient to make reliably assign functional annotations to query sequences. For that reason, it is necessary to take into consideration another source of evidence that would help elucidate the true function of the query. Such a method should not be sequence-based, so that it may afford complementary information that is not sequence homology cannot reveal.

B. Genomic context

Gene context information is not based on sequence homology, and therefore it is a suitable choice of complementary evidence for function prediction. Gene context reveals functional relationships between genes through patterns of co-regulation and co-evolution. In

64 GLOBUS, three types of genomic context methods are applied; gene clustering, phylogenetic correlation and mRNA co-expression.

Gene clustering corresponds to the tendency of genes with related functions to cluster together in the genome. This is very common in bacterial genomes where sets of functionally related genes form operons. The spatial clustering can be explained by the fact that such functional clusters need to be simultaneously induced or repressed, and are likely to share a promoter119. Here we measured gene clustering by ranking the genes for a given genome according to their order in genome, and we calculated the gene order difference between pairs of genes (see methods). The null hypothesis states that genes are randomly distributed across the chromosomes. The smaller the gene order distance, the higher the gene clustering correlation coefficient.

Similarly, genes that are functionally related (e.g. parts of the same pathway) will tend to have highly correlated phylogenetic profiles. Specifically, two genes with related functions will tend to either coexist or both be absent across different genomes. The phylogenetic profile of a given gene is a binary vector of length N, where n is a number of genomes that may or may not contain an orthologue of the query. For example, the ith element (i ∈ [1,N]) of the vector is either 1 if an orthologue of the query exists in genome Gi, or 0 if the opposite is true. We determine the similarity of phylogenetic profiles of two genes by calculating the probability of observing these profiles using the hypergeometric distribution.

Finally, the third genomic context method used in GLOBUS, mRNA co-expression, corresponds to whether pairs of genes have similar expression levels across different genomes and different conditions. The rationale for this method is that if one gene is up- or down-

65 regulated as a result of a specific stimulus, genes that are part of the same functional group should react in the same way. For each condition and genome, the genes are ranked according to their level of expression. We compute the mRNA expression correlation by comparing the ranks of gene pairs with a Spearman’s rank correlation.

3.5.2 The generic metabolic network

We defined a generic metabolic network containing the enzymatic activities of the

Enzyme Commission system, which involve small molecule substrates. We could then use this generic network as a starting point to build organism-specific metabolic networks. All nodes of the network are enzymatic activities (i.e. EC numbers), and the edges connecting two nodes are shared metabolites between these two activities (i.e. the product of one reaction is the substrate of another). We disregarded any activity that did not involve a small molecule substrate; for example “RNA ” or “” were not made part of the generic network. We also removed all edges based on the top 40 most common ligands or cofactors such as H2O and ATP because they are likely to be common to many different reactions which do not necessarily have a similar function. The resulting generic network includes 4049 enzymatic activities. The distribution of Enzyme Commission system classes represented in the generic network is shown in table 3.3, and it is similar to the distribution of the functional classes of Swiss-Prot and BRENDA (fig. 1.7).

66 Table 3.3. Distribution of the enzymatic activities of the generic network.

Class Frequency EC1 0.33 EC2 0.29 EC3 0.17 EC4 0.12 EC5 0.05 EC6 0.04

3.5.3 The fitness function

The next step was to reconstruct the metabolic network for a given genome by finding the highest probability location of every gene in the generic network. In order to do this, we used Gibbs sampling to sample from all candidate locations for every gene. We defined a

Markov-like fitness function (eqn. 1) that we minimized in order to find the best solution to the network reconstruction problem.

� �!, �!, … , �! = −�!!"!#!$%�!!"!#!$% − �!"#!!"!#$�!"#!!"!#$ − �!"#$%&$�!"#$%&$

−�!"!"!!""#$%&"%�!"!"!!""#$%&"% + �!"#!!"!!"#$%&'.�!"#!!"!!"#$%&' (1)

where �!, �!, … , �! are all the genes in a given network, f variables represent homology and context-based functional descriptors and b constants are the weights associated with these descriptors. We used two descriptors for homology. The first one corresponds to the global sequence identity to the best hit in Swiss-Prot with a given functional annotation. The second homology-based descriptor is binary, and its value is 1 if an orthologue in another species is annotated with that same enzymatic activity. We used three context-based descriptors: gene- clustering, phylogenetic profile and mRNA co-expession of a given gene and genes assigned at neighboring positions in the network. We also accounted for the possibility that a gene with

67 very low homology to any Swiss-Prot enzyme or low context correlation to any location may not be in the network with the �!"#!!"!!"#$%&'�!"#!!"!!"#$%&' term. From the fitness function we derived an equation for the probability of a given whole-network functional assignment (eqn. 2).

! � � , � , … , � = ×�!!(!!,!!,…,!!) (2) ! ! ! !

where Z is a normalizing partition factor to ensure that the probabilities sum to 1.

3.5.4 Gibbs sampling

We used Gibbs Sampling, a version of Markov Chain Monte Carlo, to sample from all possible functional assignments for every gene proportionally to their likelihood. Our goal was to find an optimized set of parameters that maximizes the probability of the overall network (i.e. minimizes the fitness function). This is a vast combinatorial problem; however, we reduced the sampling space by restricting candidate locations for a given gene to the activities of its closest

BLAST hits in Swiss-Prot. Also, the pre-computed probabilities of a gene having a certain function given each descriptor allowed us to concentrate the search on higher probability candidate functions.

The process of Gibbs sampling starts by assigning every gene to one of its candidate locations at random. Candidate locations correspond to the enzymatic activities of the homologues in Swiss-Prot. At each step a gene is placed at random at one of its candidate locations and we then evaluate the fitness function (fig 3.8). GLOBUS allows us to only sample the regions of highest likelihood with Gibbs sampling because only the function of enzymes that are homologous to a query sequence will be considered as candidate functions. That is, each gene can only be assigned to a limited number of sites in the network.

68

Figure 3.8. Illustration of GLOBUS. A) We define a generic metabolic network where the nodes represent metabolic activities (i.e. reactions), and the edges represent shared metabolites between two reactions. B) For a given genome, we assign gene products to locations in the generic network using Gibbs sampling. We use sequence homology to identify candidate locations in the network for all genes. Then, we use gene-gene context correlations to evaluate whole-genome metabolic assignments with a fitness function. The output of this process is a set of probabilities that a gene belongs in any of its possible location corresponding to the fraction of times the gene was successfully placed at each location during the simulation.

We learn the set of parameters (or weights) that should be assigned to each descriptor to optimize probabilities of correct functional assignments by training on a well-annotated metabolic network (e.g. S. cerevisiae iLL672, E. coli iAF1260, see appendix 3.2). This optimal set of parameters could then be applied to the reconstruction of any bacterial genome.

3.6 STRUCTURAL GLOBUS (3D-GLOBUS)

We could integrate the structural component to GLOBUS in one of two ways. This could either be achieved by adding an additional term to the objective function that corresponds to active site region conservation, or by combining global and local sequence conservation into a single term. Then we could re-train on a well-annotated genome in order to learn a new set of weights optimized for each source of evidence (i.e. global and local homology, genomic

69 context). We opted to define a single sequence conservation term encompassing both global and local sequence homology. To that end, we computed the probability of correct functional prediction at a certain global sequence conservation level, given the local sequence identity score (appendix 3.1). We then proceeded to compare the performance of 3D-GLOBUS with the performance of GLOBUS, and of global sequence conservation alone.

We applied 3D-GLOBUS, GLOBUS and global sequence homology to assign functional annotations to the gene products of B. subtilis, E. coli, M. tuberculosis, S. cerevisiae and S. aureus. In order to compare the performance of these different methods, for any given gene we only retained the functional annotation with the highest score (probability for GLOBUS and

3D-GLOBUS; global sequence identity for the global sequence homology method). Subsequently, for each method we ranked all the genes from the five species listed above according to their score. Next, we made precision-recall curves for all three methods (fig. 3.9).

Figure 3.9. Precision and recall performance for 5 species. Using known annotations of metabolic genes for S. cerevisiae, B. subtilis, E. coli, M. tuberculosis and S. aureus, we compared the performance of sequence homology, GLOBUS and 3D-GLOBUS in prediction function. Precision and recall was computed at different thresholds of correct prediction probability. For gene products with less than 40 % sequence identity to known proteins, A) fraction increase in precision at set recall for 3D-GLOBUS relative to GLOBUS. B) fraction increase in recall at set precision for 3D-GLOBUS relative to GLOBUS.

70

We found that across all 5 species, the information added by local sequence conservation improves both precision and recall. This improvement is even more marked for the functional prediction of genes with low sequence homology to all characterized enzymes

(fig. 3.9, D, E, F). Both GLOBUS (red line) and 3D-GLOBUS (green line) performed significantly better than the sequence homology method (blue line). 3D-GLOBUS also shows an increase in precision and recall compared to GLOBUS. The improvement in precision and recall of 3D-

GLOBUS over GLOBUS is particularly observable for genes with less than 40 % to the closest homologue in Swiss-Prot. For this subset, 3D-GLOBUS shows a 36 % increase in precision over

GLOBUS (precision of 0.6 and 0.44). Similarly, we observe a 40 % increase in recall over GLOBUS

(recall of 0.35 and 0.25 respectively). These results are exciting because as mentioned above, the majority of newly sequenced bacterial gene products fall in that region of low global sequence homology to any functionally-characterized protein. Therefore, 3D-GLOBUS is a useful tool for the genome-wide functional annotation of bacterial species. Comparing the precision of predictions made with 3D-GLOBUS (red line) and GLOBUS (blue line) at different probability scores shows that 3D-GLOBUS has better precision, especially at lower values of correct functional assignment probability (fig. 3.9, A). Similarly, 3D-GLOBUS displays better recall than GLOBUS at a given probability score (fig. 3.9, B). We can infer from these results that the addition of a structural component to GLOBUS allows incorrect predictions to be shifted to lower probabilities (hence the improved precision). Likewise, accounting for local sequence conservation shifts correct predictions to higher probabilities so that 3D-GLOBUS recall is increased when compared to GLOBUS recall (fig. 3.10).

71

Figure 3.10. Precision and recall vs. probability for 5 species. For the known genes of S. cerevisiae, B. subtilis, E. coli, M. tuberculosis and S. aureus, we determined A)the precision of functional prediction as a function of probability, and B)the recall of known genes as a function of probability. The probabilities are from GLOBUS (blue line) and 3D-GLOBUS (red line) functional assignments.

We showed that taking into account local sequence conservation via the use of a structural template to identify active site location adds information that allows us to make robust functional predictions. 3D-GLOBUS performs remarkably better than sequence-based homology methods when no close homologue can be found in Swiss-Prot. 3D-GLOBUS also performs significantly better than GLOBUS for genes with less than 40 % sequence identity to a homologue in Swiss-Prot. These results are significant because they validate the use of 3D-

GLOBUS for the genome-wide functional annotation of bacterial metabolic genes.

72 3.7 COMPARISON TO OTHER METHODS

After querying the performance of 3D-GLOBUS in the prediction of enzyme function using global/local and gene context information, we were interested in comparing this performance to that of other automated protein annotation systems. We turned to the Critical

Assessment of Functional Annotation (CAFA) Challenge that was first held from September

2010 to December 2011120 to select state of the art algorithms to compare to 3D-GLOBUS. In this experiment, the participants were tasked with predicting Gene Ontology (GO) term annotations for 48,298 uncharacterized sequences. A time lapse of 11 months was given to allow accumulation of experimental functional annotations for some of these proteins. At the end of this delay, 866 proteins from 11 species (A. thaliana, S. pneumonia, S. cerevisiae, X. laevis, H. sapiens, M. musculus, R. norvegicus, D. discoideum, P. aeruginosa and E. coli K-12) had been functionally annotated. This was the basis for a target set on which all 54 submitted algorithms were evaluated. This set was analyzed to reduce bias in terms of taxonomic and functional coverage. The methods predicted both Molecular Function and Biological Processes

GO terms and they were evaluated on their ability to predict both functional categories. The submitted algorithms were ranked according to the F-measure (F-max) which corresponds to the maximum harmonic mean of precision and recall over the full range of sensitivity (eqn. 1).

!.!" ! .!"(!) � = max (1) !"# ! !" ! !!"(!) where pr(t) is the precision and rc(t) is the recall at a fixed threshold t. Molecular Function GO terms map relatively well to EC numbers so that we were interested in the methods that scored the best in predicting Molecular Function. The best performing method with a freely available

73 2 algorithm is Argot with Fmax = 0.59. The specifics of this method are described in the following section.

3.7.1 Argot2

Argot2 is a web server for functional annotation of a single or multiple genes. It performs functional prediction by clustering the GO terms assigned to the best BLAST hits on the basis of semantic similarities121. Furthermore, larger weights are given to the GO terms associated with better hits (i.e. closer homologues to the query at the sequence level). The method outputs a total score for each candidate location based on information content (how specific is the GO term) and GO term weight (how often it occurs across BLAST hits). Because

Argot2 maps to GO terms, and GO terms map to just under 3,700 distinct EC numbers, the coverage of enzymatic functions by Argot2 is lower than 3D-GLOBUS (4,049 EC numbers).

Furthermore, in contrast to GLOBUS, functional assignments are made discrete as opposed to genome-wide. That is, the functional assignment of a given gene is not affected by the functional assignment of other genes.

3.7.2 EFICAZ: Enzyme function inference by a combined approach

EFICAz2.5 is another computational function prediction method that extracts sequence- related features to predict protein function. It combines predictions from 6 different sequence- related feature; family-specific functionally discriminating residues (FDRs), family-specific global sequence identity threshold, PROSITE patterns, enzyme family classification by support vector machine (SVM), and PFAM classification by SVM54,122. EFICAz2.5 predicts 3-digit or 4-digit EC number for a query sequence. The consensus prediction across 6 components is the one that is

74 accepted since there is no integrative score. EFICAz has the lowest EC number coverage since it is currently able to recognize 2,757 distinct EC numbers.

3.7.3 Comparison of Argot2 and EFICAZ2.5 to 3D-GLOBUS

The performance of Argot2, EFICAZ2.5, 3D-GLOBUS was evaluated and compared in the task of predicting function of B. subtilis (curated genome) and M. tuberculosis (less know genome) enzymes. Then, the precision and recall for both E. coli and M. tuberculosis enzymatic genomes was determined. As can be seen in figure 3.11 (A and B), 3D-GLOBUS performs the best in the genome-wide functional assignment of both genomes. Looking at the precision at set recall (fig. 3.11, C and E) and recall at set precision (fig. 3.11, D and F), we find that the 3D-

GLOBUS outperforms the other methods more strikingly when considering enzymes with less than 40 % sequence homology to any functionally characterized sequence. This is significant because the majority of bacterial sequences uncovered by metagenomic studies fall in that category. Argot2 predicts GO term instead of EC number, and even though it is a more sophisticated method than global sequence identity alone, it is based on BLAST sequence homology results, so this can explain why Argot2 performs worse than 3D-GLOBUS. As for

EFICAz2.5 it has lower coverage of EC numbers (30% less EC numbers than GLOBUS). Also, neither Argot2 nor EFICAz2.5 perform whole genome functional assignments. In addition to having more sources of evidence for functional prediction than the other methods, 3D-GLOBUS also has increased accuracy of prediction due to the fact that all predictions are taken in the context of a network, and the whole network is optimized in the final solution.

75

Figure 3.11. Comparison of the performance of 3D-GLOBUS, Argot2 and EFICAz2.5. The proteome of E. coli and M. tuberculosis was submitted for genome-wide functional assignment using, 3D-GLOBUS, global sequence homology, Argot2 and EFICAz2.5. Precision-recall curves were built for each method in order to compare their performance. A) Precision-recall curves for B. subtilis functional assignments. B) Precision-recall for M. tuberculosis functional assignments. C) Precision at 90% Recall for the B. subtilis genome as a function of global sequence identity to the best BLAST hit. D) Recall at 70% Precision for the B. subtilis genome as a function of global sequence identity to the best BLAST hit. E) Precision at 90% Recall for the M. tuberculosis genome as a function of global sequence identity to the best BLAST hit. F) Recall at 70% Precision for the M. tuberculosis genome as a function of global sequence identity to the best BLAST hit.

3.8 DISCUSSION

Current computational methods for functional prediction have limitations, especially when considering enzymes that have very low sequence and structure homology to well- annotated enzymes. We hypothesized that the active site region of an enzyme is more conserved when compared to other enzymes with the same function than the global sequence.

76 From this premise we have developed a structure-based function prediction method where we employ active site conservation in addition to global sequence homology for functional characterization. We then integrated this method with a probabilistic whole-genome function prediction framework GLOBUS. The original version of GLOBUS uses sampling of probability space to assign functions to all putative metabolic genes in an input genome by considering sequence homology to known enzymes, gene-gene context and EC co-occurrence.

We showed that taking into account local sequence conservation via the use of a structural template to identify active site location adds information that allows us to make robust functional predictions. 3D-GLOBUS performs remarkably better than sequence-based homology methods when no close homologue can be found in Swiss-Prot. 3D-GLOBUS also performs better than GLOBUS for genes with less than 40 % sequence identity to a homologue in Swiss-Prot. Namely, 3D-GLOBUS shows a 36 % increase in precision over GLOBUS (precision of 0.6 and 0.44). Similarly, we observe a 40 % increase in recall over GLOBUS (recall of 0.35 and

0.25 respectively). 3D-GLOBUS also outperforms Argot2 and EFICAZ2.5, which where both considered top methods in the field54,120,119. These results are significant because they validate the use of 3D-GLOBUS for the genome-wide functional annotation of bacterial metabolic genes.

Because the sequencing of bacterial genomes drives the rapid growth of sequence databases, we propose 3D-GLOBUS as a fast and reliable first step for genome-wide functional annotation, to be followed by with manual curation and experimental validation.

77 Chapter 4 - RECONSTRUCTION OF THE MYCOBACTERIUM TUBERCULOSIS METABOLIC NETWORK

4.1 MYCOBACTERIUM TUBERCULOSIS AND DISEASE

Following the validation of our method’s performance on various genomes ranging from well studied to partially characterized, we decided to deepen our analysis of the predictions for a very important human pathogen, Mycobacterium tuberculosis, which is responsible for 1.5 millions of deaths annually123. This bacteria is a major public health concern, however studying it in the lab is difficult and requires Biosafety level 3 facilities to minimize the danger of accidental exposure.

The genus Mycobacterium includes virulent pathogens responsible for infections such as tuberculosis124, leprosy125 and maybe even Crohn’s disease126. The bacteria responsible for tuberculosis in different eukaryotic hosts (M. tuberculosis, M. africanum, M. bovis, M. caprae,

M. microti, M. pinnipedii) form what is called the Mycobacterium Tuberculosis Complex and their genomes exhibit more similarity than other members of the genus (fig. 4.1).

78 M. tuberculosis H37Rv

M. bovis AF2122/97

M T M. bovis BCG str. Pasteur 1173P2 B C M. africanum GM041182

M. canettii CIPT 140010059

M. avium 104

M. smegmatis str. mc2 155

Figure 4.1. Whole genome alignment of mycobacterial species. The whole genomes of M. tuberculosis H37Rv, M. bovis AF2122/97, M. bovis BCG Pasteur 1173P2, M. africanum GM041182, M. canettii CIPT 140010059, M. avium 104 and M. smegmatis str. mc2 155 were aligned with the software MAUVE. The first five genomes enclosed in brackets are members of the Mycobacterium Tuberculosis Complex, meaning that these species are known to cause tuberculosis in humans and animal hosts. M.avium can cause non-pulmonary infections in birds and mammals. M. smegmatis str. mc2 155 is a non- pathogenic relative of the other six species. The colored boxes outline regions of local collinearity, and the colored lines inside the boxes represent levels of sequence similarity. The lines connect homologous regions. We observe genomic rearrangements in the M. avium genomes relative to MTBC genomes where gene order is mostly conserved.

The M. tuberculosis strain was discovered by German scientist Robert Koch in 1882, but has been around for thousands of years. It has been postulated that M. tuberculosis first started

79 affecting humans as a result of cattle domestication about 10,000 years ago. However some new evidence suggests that humans may have been subject to tuberculosis infections for the past 50,000 years124. Nowadays, the infection affects people worldwide, but the majority of the casualties are in third world countries, as can be seen in the WHO map from 2011 (fig. 4.2).

Figure 4.2. Estimated tuberculosis incidence rate in 2012. This map shows worldwide incidence of tuberculosis in 2012. The heaviest burden is in the third world, however the disease occurs everywhere (Reproduced from the World Health Organization).

A vaccine was developed by French doctors Camille Guerin and Albert Calmette in the early twentieth century from the attenuated bovine-targeting bacterium M. bovis, but it has had mixed results, especially in developing countries where the disease burden is the highest127-

129.

80 Because of its unusual overall structure, a lot remains to be elucidated about M. tuberculosis. First, it has an atypical glycolipid-rich cell wall that makes the cell very impermeable. The mycobacterial cell wall contains unusual lipids such as mycolic acids and lipoarabinomannan (LAM) on top of a thin layer of peptidoglycan130. It has been said that up to

30 % of M. tuberculosis genes may be devoted to lipid production and metabolism131. The thick waxy cell wall constitutes a first line of defense of the bacterium against therapeutics. M. tuberculosis also boasts an array of efflux mechanisms carried out by ATP transporters for active efflux of drugs132. Additionally, M. tuberculosis has a slow doubling time when compared to other bacteria, which appears to be a hallmark of pathogenic Mycobacteria and makes it somewhat harder to study in vitro133. Also, the bacteria can exist in two distinct states; it can be either active or dormant. During persistent infection, M. tuberculosis exists in a non-replicating, slow-growing state. Little is known about the specifics of M. tuberculosis behavior in the dormant state. Some studies have attempted to shed light on metabolic processes during dormancy, and this is done by simulating hypoxic, low nutrient and/or acid stressed environments134,135. One finding that emerged from these studies is that the bacteria strongly downregulates genes associated with energy metabolism while upregulating lipid degradation and fatty acid metabolism (i.e. β-oxidation)136.

When harboring the bacteria in the dormant state, an infected individual will not show symptoms so that the infection will go untreated. The WHO has estimated that about a third of the world population carries the bacteria in a latent state, so it is like a ticking time bomb123.

Also, due to long times of co-evolution with humans, M. tuberculosis has developed the ability to hijack the immune system by being taken up by macrophages of the lungs and remaining

81 inside in order to escape drug treatment and an immune response. It is able to survive within the macrophage by using fatty acids or cholesterol as a substrate for growth, and prevents macrophages from maturing and destroying it137,138.

A number of anti-tuberculosis drugs have been developed since the 1940s; notably, isoniazid, pyrazinamide, ethambutol and rifampicin which are still in use today as first line of treatment against non-resistant TB123. The majority of them target some aspect of metabolism, and some of these substances (e.g. streptomycin and rifampicin) are derived from soil-dwelling microbes so that non-pathogenic bacteria which share a niche with these are not susceptible to these antibacterial compounds139. Specifically, ethambutol and isoniazid affect cell wall biosynthesis, while pyrazinamide inhibits fatty acid metabolism140. As for rifampicin, it inhibits the transcription mechanism through inhibition of bacterial DNA-dependent RNA polymerase140.

Therefore the first-line drugs are only effective when cells are actively dividing. To prevent the emergence of drug-resistant bacteria, the first-line of defense involves administering a cocktail of drugs for an extended period of time. The increased prevalence of drug resistance is in part a consequence of the 5-fold decrease of the rate of development and approval of new antibiotics in the last twenty years141,142. Shortage of anti-tuberculosis drugs is also a common occurrence143,144. Another cause of drug resistance is that dormant bacterial cells escape drug therapy, and can activate after treatment is terminated. Moreover, many affected patients fail to comply with treatment for the recommended duration, so that more aggressive drug- resistant forms of M. tuberculosis have emerged145. Co-morbidity of tuberculosis with HIV is also a major problem146. Since the early 1990s, the World Health Organization (WHO) has declared tuberculosis a global emergency147 .

82 Good and complete annotations of the M. tuberculosis genome are lacking. We can use

3D-GLOBUS probabilities to curate and update the previously published metabolic network. The new model could then be used to test hypotheses (such as gene essentiality, nutrient consumption, etc.) and to identify new drug targets.

4.2 MATERIALS AND METHODS

4.2.1 Probabilistic annotation of the M. tuberculosis genome

To re-annotate M. tuberculosis metabolic genes we obtained the protein sequences assigned to strain H37Rv from the KEGG database148. The protein sequences were then probabilistically annotated with Enzyme Commission15 (EC) numbers using the GLOBUS algorithm55. Briefly, GLOBUS uses Gibbs sampling to find the probability of a gene having one of a set of possible enzymatic functions given a combination of sequence identity and genomic context correlation data. Homology-based descriptors for a given gene included: sequence homology to the top BLAST hit against SwissProt enzymes, sequence homology calculated only for residues predicted to be in the enzymes’ active sites, and the presence of annotated orthologs (bi-directional best hits) in the SwissProt database. Context-based data included phylogenetic profile correlations, chromosomal gene clustering, gene co- expression, and EC number co-occurrence55. The GLOBUS algorithm is based on the idea that genes are more likely to be assigned to their correct metabolic function if they have high sequence similarity to enzymes known to have such function and also if they have strong functional correlations with genes annotated to neighboring enzymes in the metabolic network, i.e. genes annotated to enzymes sharing metabolites.

83 Context correlations were calculated as described55 using data from KEGG. Gene expression data used to calculate co-expression were obtained from the GEO database149, with accession numbers: GSE10391, GSE11096, GSE1642, GSE16811, GSE21114, GSE3201,

GSE3999, GSE8786, GSE8839 and GSE9331.

4.2.2 Structural representations

In order to investigate the veracity of functional predictions made for 7 candidate genes of unknown function in Mycobacterium tuberculosis, homology models were built for each of them using a sequence homologue with the candidate enzymatic function and an experimentally determined 3D structure. The models were retrieved from ModBase or built with the homology modeling pipeline ModPipe, which uses PSI-BLAST for template selection and MODELLER for model building150). The quality of the models was evaluated through metrics such as sequence identity, E-value, GA341, MPQS and z-DOPE scores .

Then in order to simulate molecular interactions between the model and its predicted ligand we first needed to model the ligand in the homology model. To this end, we overlaid the structures of the template and of the model so as to obtain the coordinates of the ligand in the model structure. Subsequently, we used PyMol to visualize the relative position of the active site region and of the ligand, and to determine which residues of the active site region are in a position to have polar interactions with atoms of the ligand. This also allowed us to survey the level of conservation of active site region residues in the query sequence relative to the template.

84 4.2.3 Flux balance analysis

For a given organism, the metabolic network can be represented by the combination of the a stoichiometry matrix S that contains the stoichiometric coefficients of all metabolites involved in metabolism, and a flux matrix v. Flux balance analysis (FBA) is a mathematical method to simulate metabolism in genome-scale reconstructed metabolic networks (fig. 4.3).

We simulate gene deletion with the CobraToolbox151 and determined how deleting each gene would affect growth (gene essentiality analysis). The current gold standard for gene essentiality is a Transposon Site Hybridization (TraSH) study by Sassetti et al.152. TraSH is the equivalent of high-throughput gene knockout experiments where a single deleterious mutation is introduced into each cell of a bacterial colony. After a period of time, the pool of cells is analyzed to determine the ones that survived and grew, the ones that did not, and the identity of the mutated gene.

An FBA problem can be resumed by the following equation:

S x v = b (1) where S is the stoichiometry matrix, v is the vector of fluxes through the network and b is the rate of change in concentration of all metabolites. At steady-state when there is no accumulation of metabolites, b = 0 so that we may solve equation (1) to determine the flux vector v that best satisfies the equation. We may do this by imposing the requirement to maximize an objective function (e.g. biomass production, ATP production). This considerably collapses the space of possible solutions.

85

Figure 4.3. Illustration of flux balance analysis. A metabolic network can be represented as the dot product of a stoichiometry matrix S (contains the stoichiometric coefficients of all metabolites involved in all reactions) and a flux matrix v (describes the flux through all reactions). At steady state S x v = 0.

4.2.4 Gene essentiality analysis

Flux balance analysis uses linear programming to find a vector of metabolic fluxes that maximizes an objective function (e.g. flux through the biomass reaction) while fulfilling steady- state, mass-balance and flux volume constraints. To simulate gene deletions using FBA that were comparable to in vivo and in vitro experimental results we defined two types of in silico growth media153. For the in vitro case, we allowed influx of metabolites corresponding to the composition of the Middlebrook 7H9 media154 with a maximum rate of 1mMol/gDW. For the in vivo case, we only allowed uptake of metabolites identified as being most consistent with in vivo gene essentiality predictions by Fang et al.155. Single gene deletion analyses were performed under both media conditions using the COBRA toolbox151. Any gene whose deletion resulted in a maximum growth yield of zero media was considered to be essential. Predicted

86 gene essentiality was compared to high-throughput in vitro experimental data by Sassetti et al.152 and the in vivo data by Sassetti et al. in a mouse model of infection156.

4.2.5 Phenotype microarray analysis

Phenotype microarray data for carbon and nitrogen source utilization were obtained from the study by Khatri et al.157. Following the original experimental analysis, a threshold of three Omnilog units was used to identify a positive result. In order to simulate growth on different carbon sources we defined a minimal media including Ca, Cl, Cu, Fe, H,

H2O, K, Na, NH4, O2, PO4, SO4 with a maximum uptake rate of 1mMol/gDW and Glycerol – which was found to be essential– with a maximum uptake rate of 0.06 mMol/gDW. Then, we allowed the uptake of each carbon compound tested with a maximum rate of

1mMol/gDW. If maximal biomass synthesis under these conditions was predicted to be at least twice as large as on Glycerol alone, the corresponding compound was considered to be a viable carbon source for biomass production. A similar approach was used to test nitrogen sources, the only difference being that NH4 was removed from the minimal media and glycerol uptake was allowed a maximum rate of 1mMol/gDW.

4.3 3D-GLOBUS NOVEL PREDICTIONS

4.3.1 Candidates for functional characterization

The utilization of structural data concurrently with global sequence homology and gene context information (3D-GLOBUS) allowed us to build a model of M. tuberculosis metabolism which significantly improves on previously published models (table 4.1). We compared

87 probability predictions for all 1,399 predicted metabolic genes of M. tuberculosis with GLOBUS and 3D-GLOBUS using the set of optimized weights obtained from training on the genome of E. coli. We chose E. coli because it is also a bacteria, so that the set of weights learnt from applying Gibbs sampling on the well-curated genome of E. coli may better apply to the genomes of other bacteria. The distribution of M. tuberculosis genes as a function of the difference in prediction probability obtained from 3D-GLOBUS vs. GLOBUS is shown in figure 4.4.

Figure 4.4. Difference between GLOBUS and 3D-GLOBUS probabilities for Mycobacterium tuberculosis. A large number of gene products have increased probability of being associated to their top predicted function. This trend is visible in the genome-wide set (blue bars) as well as in the “twilight-zone” set (green bars).

About 37% of genes have unchanged prediction probability between the two methods.

However, 105 genes have a significant decrease in probably from GLOBUS to 3D-GLOBUS, probably because the addition of structure-based information help us resolve inconclusive

88 predictions by shifting the probability of a given function upwards or downwards. Conversely,

324 genes have a significant increase in prediction probability. If we focus on genes that share

40% of less sequence identity to their best hit in Swiss-Prot, we find that this set of genes account for almost all of the genes for which we observed a downward shift in probability, and for about half of genes for which probability is shifted up significantly.

When selecting cases to further study and validate structurally and experimentally, we sought the following requirements: 1) Active site conservation had to be appreciably higher than global sequence identity. 2) The probability of the given functional assignment by 3D-

GLOBUS had to be at least 50 %. 3) The proposed function of the genes had to be related to pathogenesis. An examination of the genes that matched these conditions yielded 7 interesting

(table 4.1). All the genes chosen for validation are involved in the metabolism of alternative carbon sources for growth.

Table 4.1. Candidate genes for structural and experimental validation.

Gene Function %Global %ASite P(3D- Pathway Substrate Globus)

Rv3255c Mannose-6-phosphate 37.44 70.37 79.8 Sugar D-mannose 6- isomerase (5.3.1.8) metabolism Phosphate

Rv2332 Malate dehydrogenase 42.55 66.67 69.0 Pyruvate (S)-malate (1.1.1.38) metabolism

Rv3811 N-acetylmuramoyl-L-alanine 31.05 68.42 66.3 Cell wall Peptidoglycan (3.5.1.28) metabolism

Rv1592c Triacylglycerol (3.1.1.3) 32.47 50.00 72.4 Glycerolipid Triacylglycerol metabolism

Rv0751c 3-hydroxybutyrate 45.79 58.06 86.8 Val-Leu-Ile 3-hydroxy-2- dehydrogenase (1.1.1.31) degradation methylpropanoate

Rv3406 Taurine 38.08 47.06 53.9 Taurine Taurine (1.14.11.17) metabolism

Rv0891 Adenylate cyclase (4.6.1.1) 27.85 41.18 0.111 Purine ATP metabolism

89

The first step in investigating the veracity of the functional prediction made with 3D-

GLOBUS, was a structural examination. The goal was to visualize the possible interactions between the ligand corresponding to the candidate function, and the conservation of the active site region residues. In order to do this, we built homology models for 6 of the 7 genes; selecting 3D structural templates in the PDB with the same EC number annotation (see methods). The 3D structure for the remaining gene, Rv3406, has been experimentally determined at resolution 2.5 Å and is in the PDB (code 4FFA)158. Subsequently, for each gene of interest, we aligned the homology model (or true structure) to that of the experimentally determined structure of an enzyme known to have the candidate function and crystallized with a ligand. Because the structures were overlaid, we determined the position of the ligand in the uncharacterized enzymes by knowing its position in the known template. This method allowed us to test whether the ligand could truly fit in the catalytic region of the uncharacterized protein, as well as to identify putative catalytic residues. In all 7 cases the ligand sits in a pocket and is within proximity of residues to form polar contacts (fig. 4.5, apprendix 4.1).

90

Figure 4.5. Homology models of target genes with ligand. This figure representing the homology models for: A)Rv3255c, B)Rv2332, C)Rv3811, D)Rv1592c, E)Rv0751c, F)Rv3406 and G)Rv0891 was created with PyMol74. The active site region residues that are fully conserved between template and query are shown as blue spheres, and the ligands are shown in magenta (spheres and stick models). To show which residues may be involved in interactions with the ligand, the mesh surface representations to the right of each panel are the residues in a 4 Å neighborhood around the ligand. The residues which have polar interactions with the ligand are shown as green stick representations.

91 4.3.2 Cholesterol Degradation Pathway

Furthermore, we identified genes for a pathway that is absent from the previously published models by the Jamshidi and Palsson154, Beste159 and Fang155 research groups. These genes belong to the cholesterol degradation pathway, which allows the bacteria to grow on cholesterol if other nutrients are not available. It turns out that during persistent infection, M. tuberculosis is most often found in the lungs and takes up residence inside the alveolar macrophages by the process of phagocytosis. At that stage, the macrophages should be activated by T-cell dependent cytokines and transfer the bacteria to lysosomes for destruction by reactive oxygen and nitrogen species160-162. However, some bacteria are able to escape this process and remain sequestered inside the lipid-rich foamy macrophages in a macrophages/denditric cells/lymphocytes aggregate called granuloma137,138,163. These macrophages have a high cholesterol content and therefore it is believed to be the prevalent carbon source in the environment.

We were able to predict a gene for each step of the cholesterol degradation pathway with high probability (fig. 4.6,). This pathway produces metabolites that we believe are then used in the TCA cycle for energy production and gluconeogenesis. Therefore, this pathway is of interest when considering potential targets for new drug development against tuberculosis.

92

Figure 4.6. Cholesterol degradation pathway. All steps of the cholesterol degradation pathway are shown here. The rings and side-chains of the cholesterol molecules are degraded, and the resulting molecules are acetyl-CoA, pyruvate and propionyl-CoA. M. tuberculosis may use these compounds as substrates for the TCA cycle for energy production, and for gluconeogenesis.

Interestingly, starting with the discovery of the mycobacterial cholesterol transport system mce4, several studies have been published in these past few years showing that cholesterol is an important nutrient for M. tuberculosis persistence during infection164-166.

However, at this time the exact mechanisms and genes involved had not been fully elucidated.

Our model of the metabolic network of M. tuberculosis reconstructed thanks to 3D-GLOBUS may help further understand what happens during persistent infection, as well as help find ways to affect bacterial proliferation and persistence.

93

4.4 RECONSTRUCTION OF THE M. TUBERCULOSIS METABOLIC NETWORK

4.4.1 iMK917 Metabolic model characteristics

In the past decade, several metabolic reconstructions have been published for M. tuberculosis. In 2007, Beste et al.159 and Jamshidi and Palsson154 independently built metabolic network models of M. tuberculosis, each accounting for ~900 reactions and ~700 genes (see table 4.1); these reconstructions have been the basis for an in vivo compatible metabolic reconstruction155,167and a semi-automated merged model167. These models were used, among others, to find reactions metabolically coupled to known TB drug targets154, and to investigate in vivo growth conditions and metabolic adaptations to hypoxia and low pH155. Despite these significant contributions, there are substantial differences between reconstructions of the same

TB strain, and the accuracy of growth and gene deletion phenotypes is still lower than achieved by similar models in other species168. Additionally, metabolic processes such as the reactions involved in cholesterol degradation, which play an important role in TB infection164,165,169, have not been modeled in a genome-scale context. These shortcomings are, to a great extent, a consequence of the sparse biological knowledge about the parasite’s biology and incomplete or incorrect gene annotations154.

Here, we use probabilistic gene annotations and careful manual curation to consolidate prior reconstructions and to improve both the coverage and accuracy of the M. tuberculosis metabolic model. The new model includes 177 genes and 208 reactions that were not present in previous reconstructions; it also displays improved accuracy in the prediction of essential

94 genes and growth on different nutrient sources. Using a combination of constraints-based

simulations, sequence homology analyses and the interrogation of drug databases, we

produced a diverse catalog of candidate target genes paired with potential inhibitors which

could be useful to control parasite growth. We also use the updated model as a tool to

interpret recent data on the growth of TB cells at low pH in the presence of various carbon

sources.

Table 4.2. Comparison of Mycobacterium tuberculosis H37Rv metabolic reconstructions.

iMK917 GSMN-TB iNJ661 iNJ661v (Beste et al.) (Jamshidi and (Fang et al.) Palsson) Number of genes 917 726 661 663 Number of reactions 1240 849 939 1049 Number of metabolites 1012 739 828 838 Number of exchange reactions 111 40 86 96 Fraction of orphan reactions 0.12 0.21 0.23 0.24 Fraction of no-flux reactions 0.19 0.10 0.22 0.15

Flux variability analysis (FVA) showed that the iMK917 model allows the simulation of

fluxes through reactions that were blocked in prior reconstructions, i.e. reactions that cannot

carry flux under any condition. Overall, the fraction of orphan and blocked reactions in the

iMK917 model is comparable if not lower to those of previous reconstructions despite the

increased number of reactions (table 4.2). A large fraction (~15%) of the additional reactions

included in the iMK917 model corresponds to the steroid degradation pathway, which is used

for the degradation of cholesterol into propanoyl-CoA, acetyl-CoA and pyruvate which may

subsequently act as substrates for central carbon metabolism and the TCA cycle. Cholesterol

degradation has been found to be necessary for growth of M. tuberculosis in chronically

infected mice164,166,170; therefore, addition of this catabolic route allows the study of this

95 process in the context the entire metabolic network. Besides sterol degradation, additional reactions were included in nucleotide, cell envelope, and metabolism. One notable example is 2-oxoglutarate ferredoxin oxidourreductase (KOR, EC: 1.2.7.3), which replaces the typical 2-oxoglutarate dehydrogenase of the TCA cycle –absent from M. tuberculosis– and provides an alternative route for the conversion 2-oxoglutarate to succinyl-CoA. KOR has been shown to operate concurrently with b-oxidation during growth on lipid substrates171, suggesting that the activity of this enzyme is relevant for the predominant in vivo growth conditions of M. tuberculosis.

4.4.2 Reconciliation with experimental data

Several high-throughput experimental studies have been performed to determine M. tuberculosis genes that are likely to be essential for growth or pathogenesis152,156,172. In order to identify potential errors in the reconstruction of the metabolic network, we compared predictions of gene essentiality made by FBA (see Methods) to these data. An initial accuracy of

74 % and 80 % for predictions of in vivo and in vitro essentiality suggested several false positive and false negative predictions that were studied in detail. We looked for missing gene associations that could explain why certain genes were incorrectly predicted to be essential and also looked for missing reactions and potential changes to the biomass objective function that would explain these false predictions. For example, the presence of ribose in the biomass reaction explained the false prediction of Rv3393 (inosine hydrolase) as an essential gene.

Ribose in the biomass composition of M. tuberculosis is accounted by entirely by ribonucleotides, for this reason, we did not consider this sugar as an essential biomass component in the iMK917 model. As shown in Table 4.3, the final revised version of our model

96 achieves an average accuracy of 81 % for the prediction of essential genes under in vitro and in vivo conditions. This represents an improvement relative to prior models which is particularly marked in terms of precision and specificity. This result is in part explained by a lower fraction of false positives associated with the extended gene annotation and inclusion of new reactions in the new model.

Table 4.3. Comparison of experimental and in silico gene essentiality predictions.

In vitro iMK916 GSMN-TB iNJ661 iNJ661v (Beste et al.) (Jamshidi and Palsson) (Fang et al.) Accuracy 0.78 0.76 0.71 0.71 Precision 0.81 0.73 0.68 0.68 Recall 0.53 0.65 0.65 0.64 Specificity 0.92 0.84 0.77 0.77 In vivo Accuracy 0.86 0.75 0.69 0.78 Precision 0.40 0.12 0.14 0.29 Recall 0.57 0.23 0.44 0.86 Specificity 0.90 0.81 0.71 0.77 Accuracy : (TP + TN)/(Total), Precision : TP/(TP+FP), Recall : TP/(TP+FN), Specificity : TN/(FP+TN)

Similar to the essentiality results, we compared in silico predictions of the ability to synthesize biomass from single carbon and nitrogen sources to published experimental results.

For this we used Phenotype microarray (PMs)173 data produced in a recent study157. PMs allow parallel testing of multiple growth conditions for which cell respiration –a proxy for cellular growth– is measured by the reduction of a tetrazolium dye. An initial accuracy of 57% in reproducing the experimental phenotypes prompted us to investigate corresponding C and N uptake and utilization pathways suggesting several important modifications to the model. In particular, exchange and transport reactions for all positive C and N sources were initially added to the model to see if this simple change would explain the disagreement with experimental

97 results. This was the case for 6 of the tested compounds. In more complex cases, we manually revised pathways associated with positive experimental results in order to identify simple changes to the model that would allow growth on said nutrients. For example, butyric acid was one of the compounds producing the largest dye reduction signals, but the original model could not simulate growth on this substrate. The addition of butanoyl-CoA:acetate CoA-

(EC: 2.8.3.8), which converts butyric acid to butanoyl-CoA, as an orphan reaction to the model allowed in silico growth on butyric acid as the sole carbon source.

We also found cases in which the model predicted growth on particular substrates that gave a negative result in the PM assay. This could be a result of the specific experimental conditions; for example, negative experimental results were obtained for glucose, nitrite and ammonia, which are known carbon and nitrogen sources for M. tuberculosis growth174,175.

However, negative growth results were also helpful for improving parts of the model.

Specifically, for the metabolites showing growth in silico but not in vitro, we tried to identify model reactions that were not supported by gene annotations that were essential for growth on those metabolites. For instance, the model predicted L-arginine both as a carbon and nitrogen source while no dye reduction was observed in the PM experiments. We found that in the original model, the enzymes arginine succinyltransferase (EC: 2.3.1.109), succinylarginine dihydrolase (EC: 3.5.3.23) and succinylglutamate desuccinylase (EC: 3.5.1.96) were essential for arginine utilization. The fact that all three reactions are orphan suggests that there is not a metabolic route in M. tuberculosis from arginine to glutamate that uses these reactions.

Removing of these activities from the model correctly led to the prediction of no growth on L-

98 arginine. Following model curation guided by experimental results, the iMK917 model was able to reproduce 81 % of the experimentally measured phenotypes (fig. 4.7).

Figure 4.7. Model predictions for growth on different carbon and nitrogen sources. The figure shows the prediction results of growth/no-growth phenotypes for sole carbon and nitrogen sources. Predictions by four metabolic models were compared to experimental results based on phenotype microarrays (PMs)157. The letters NT indicate that the corresponding model lacks the transporters and exchange reactions associated with a specific metabolite. Yellow cells indicate that although the PM produced a negative result, evidence exists for the

99 utilization of the corresponding compound. *Evidence of cholesterol utilization is not based on PMs but has been suggested by prior studies164,166.

4.4.3 Essential genes and candidate small molecule drugs

In order to suggest novel therapeutic research targets, we used the final version of the iMK917 model to predict and analyze essential M. tuberculosis genes. First, we a single gene deletion analysis for every gene in the model using FBA under rich media conditions, i.e. making all metabolites available to all modeled transport reactions. This resulted in 56 genes that are predicted to be essential under any nutritional condition given the biomass composition modeled in the iMK917 model. Second, to identify a set of target genes that could be subject to drug inhibition we performed a sequence homology search of these 56 genes against 4108 known protein drug targets in the DrugBank database176. DrugBank stores biochemical and pharmacological information about drugs, their mechanism and their targets.

The search produced a significant hit (e-value < 5*10-3) for 37 of the FBA predicted essential genes. Finally, we complemented the information in DrugBank with a literature search of inhibitors tested against homologues of the in silico defined essential genes to produce a catalog of candidate drug targets and potential small-molecule inhibitors that, to our knowledge, were not discussed before in the context of M. tuberculosis infection (appendix 4.2).

Notably, the list of candidate drug targets includes mostly genes with no detectable sequence homology to human proteins, or with homology to enzymes annotated to different activities in the human genome (appendix 4.2). This is desirable in order to avoid the potential inhibition of host proteins by drugs targeting M. tuberculosis genes. Another important feature of the results is that the functional roles of the identified essential genes are distributed across

100 a wide span of distinct metabolic processes including sugar, vitamin, amino acid, and co-factor metabolism, which have not been typically targeted by TB drugs. Finally, for most of the candidate targets in table, we could find additional experimental evidence supporting a critical role for M. tuberculosis growth and infection. Our analysis makes a substantial addition to the list of candidate TB drug targets (e.g. Mdlui et al.177) while suggesting several lead compounds that can readily be tested to inhibit a variety of metabolic processes. Below we describe several interesting examples (fig. 4.8).

101

Figure 4.8. Candidate target genes in a diverse set of metabolic processes. The figure shows the metabolic roles of proposed M. tuberculosis drug targets (blue), and candidate compounds previously used to inhibit the activities of homologous genes in different bacterial species. A. is essential for the synthesis of peptidoglycan and the mycobacterial cell wall; potential inhibitors include (1) Pyrazolopyramidinediones (Minimum Inhibitory Concentration in Helicobacter pylori: 0.5-8 µg/mL178) and (2) 1H-benzimidazole-2- 179 sulfonate (Ki in B. subtilis = 2.5 µM ). B. 8-amino-7-oxoanoate synthase is essential for the synthesis of biotin, a crucial for lipid biosynthesis; it was inhibited by (3) 8-Amino-7- oxo-8-phosphonononaoate (Ki = 7 µM) and (4) 4-carboxybutyl(1-amino-1- 180 carboxyethyl)phosphonate (Ki = 68 µM) in Bacillus sphaericus C. 3-hydroxybutyrate dehydrogenase was predicted as essential for siderophore biosynthesis and was inhibited by (5) 181 L-3-hydroxybutyrate in Pseudomonas fragi (Ki = 800 µM ) D. 7,8-dihydroxymethylpterin- pyrophosphokinase was predicted as essential for folate metabolism and was inhibited (IC50 = 9.5 µM) in E. coli by 2-Amino-7,7-dimethyl-4-oxo-3,4,7,8-tetrahydro-pteridine-6-

102 (2-{2-[5-(6-amino-purin-9-yl)-3,4-dihydroxy-tetrahydro-furan-2-ylmethanesulfonyl]- ethylcarbamoyl}-ethyl)-amide182.

4.5 DISCUSSION

Here we use a combination of probabilistic gene annotations and manual curation to consolidate prior efforts and to expand the Mycobacterium tuberculosis H37Rv metabolic reconstruction. The new model includes 1240 reaction and 1012 metabolites and accounts for

208 reactions and 177 genes not present in previous models. These reactions involve critical pathways for TB pathogenesis such as the route of cholesterol degradation, and also allow for an increased accuracy in the prediction of essential genes (81 %) and growth on different substrates (80 %). We use the model to predict 20 potential drug targets across 8 different metabolic subsystems which not only have low or no similarity to human proteins, but are homologous to proteins for which small-molecule inhibitors have been identified. These genes represent a rich set of targets and drug combinations to explore therapeutic interventions on multiple fronts of the parasite’s metabolism.

New anti-tuberculosis drugs are desperately needed to shorten the time of treatment and attack non-traditional targets that have already been weakened by the emergence of drug resistance.

Notably, our predictions also allowed us to reconstruct the cholesterol degradation pathway in M. tuberculosis, which has been implicated in bacterial persistence in the literature but remains to be fully characterized. This pathway is absent from previously published metabolic models of M. tuberculosis. Our new model can now be used to simulate different

103 environments and conditions in order to gain a better understanding of the metabolism of M. tuberculosis and its ability to adapt to different environments and conditions when confronted with stress such as hypoxia or low pH.

104 REFERENCES

1 Pagani, I. et al. The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic acids research 40, D571-579, doi:10.1093/nar/gkr1100 (2012).

2 Altschul, S. F. & Koonin, E. V. Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. Trends in biochemical sciences 23, 444-447 (1998).

3 Bork, P. & Koonin, E. V. Predicting functions from protein sequences--where are the bottlenecks? Nature genetics 18, 313-318, doi:10.1038/ng0498-313 (1998). 4 Rost, B. Enzyme function less conserved than anticipated. Journal of molecular biology 318, 595-608, doi:10.1016/S0022-2836(02)00016-5 (2002). 5 Ouzounis, C. A. & Karp, P. D. The past, present and future of genome-wide re- annotation. Genome biology 3, COMMENT2001 (2002). 6 Tian, W. & Skolnick, J. How well is enzyme function conserved as a function of pairwise sequence identity? Journal of molecular biology 333, 863-882 (2003). 7 Rost, B., Liu, J., Nair, R., Wrzeszczynski, K. O. & Ofran, Y. Automatic prediction of protein function. Cellular and molecular life sciences : CMLS 60, 2637-2650, doi:10.1007/s00018-003-3114-8 (2003).

8 Perkel, J. M. Single-cell genomics: defining microbiology's dark matter. BioTechniques 52, 301-303, doi:10.2144/000113848 (2012).

9 Rost, B. Twilight zone of protein sequence alignments. Protein engineering 12, 85-94, doi:DOI 10.1093/protein/12.2.85 (1999). 10 Remm, M., Storm, C. E. & Sonnhammer, E. L. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of molecular biology 314, 1041- 1052, doi:10.1006/jmbi.2000.5197 (2001). 11 Fang, G., Bhardwaj, N., Robilotto, R. & Gerstein, M. B. Getting started in gene orthology and functional analysis. PLoS computational biology 6, e1000703, doi:10.1371/journal.pcbi.1000703 (2010). 12 Altenhoff, A. M., Studer, R. A., Robinson-Rechavi, M. & Dessimoz, C. Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS computational biology 8, e1002514, doi:10.1371/journal.pcbi.1002514 (2012).

13 Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 25, 25-29, doi:10.1038/75556 (2000).

105 14 Nomenclature committee of the international union of and molecular biology (NC-IUBMB), Enzyme Supplement 5 (1999). European journal of biochemistry / FEBS 264, 610-650 (1999).

15 Bairoch, A. The ENZYME database in 2000. Nucleic acids research 28, 304-305 (2000). 16 Bairoch, A. The ENZYME data bank. Nucleic acids research 21, 3155-3156 (1993).

17 Bairoch, A. The ENZYME data bank. Nucleic acids research 22, 3626-3627 (1994). 18 Bairoch, A. The ENZYME data bank in 1995. Nucleic acids research 24, 221-222 (1996). 19 Bairoch, A. The ENZYME data bank in 1999. Nucleic acids research 27, 310-311 (1999). 20 Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278, 631-637 (1997).

21 Tatusov, R. L. et al. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic acids research 29, 22-28 (2001).

22 Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC bioinformatics 4, 41, doi:10.1186/1471-2105-4-41 (2003).

23 Dessimoz, C., Boeckmann, B., Roth, A. C. & Gonnet, G. H. Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucleic acids research 34, 3309-3316, doi:10.1093/nar/gkl433 (2006). 24 Koonin, E. V. Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature reviews. Microbiology 1, 127-136, doi:10.1038/nrmicro751 (2003).

25 Punta, M. et al. The Pfam protein families database. Nucleic acids research 40, D290- 301, doi:10.1093/nar/gkr1065 (2012). 26 Hulo, N. et al. The PROSITE database. Nucleic acids research 34, D227-230, doi:10.1093/nar/gkj063 (2006). 27 Pauling, L. & Corey, R. B. Configurations of Polypeptide Chains With Favored Orientations Around Single Bonds: Two New Pleated Sheets. Proceedings of the National Academy of Sciences of the United States of America 37, 729-740 (1951). 28 Pauling, L., Corey, R. B. & Branson, H. R. The structure of proteins; two hydrogen- bonded helical configurations of the polypeptide chain. Proceedings of the National Academy of Sciences of the United States of America 37, 205-211 (1951). 29 Lo Conte, L. et al. SCOP: a structural classification of proteins database. Nucleic acids research 28, 257-259 (2000).

106 30 Knudsen, M. & Wiuf, C. The CATH database. Human genomics 4, 207-212 (2010). 31 Csaba, G., Birzele, F. & Zimmer, R. Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis. BMC structural biology 9, 23, doi:10.1186/1472-6807-9-23 (2009). 32 Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. The EMBO journal 5, 823-826 (1986).

33 Mirny, L. A. & Shakhnovich, E. I. Universally conserved positions in protein folds: Reading evolutionary signals about stability, folding kinetics and function. Journal of molecular biology 291, 177-196, doi:DOI 10.1006/jmbi.1999.2911 (1999). 34 Cech, T. R. Biologic catalysis by RNA. Harvey lectures 82, 123-144 (1986).

35 Lamond, A. I. & Gibson, T. J. Catalytic RNA and the origin of genetic systems. Trends in genetics : TIG 6, 145-149 (1990). 36 Cech, T. R. Catalytic RNA: structure and mechanism. Biochemical Society transactions 21, 229-234 (1993). 37 Jimenez-Sanchez, G., Childs, B. & Valle, D. Human disease genes. Nature 409, 853- 855, doi:10.1038/35057050 (2001). 38 Freilich, S. et al. The complement of enzymatic sets in different species. Journal of molecular biology 349, 745-763, doi:10.1016/j.jmb.2005.04.027 (2005). 39 UniProt, C. Activities at the Universal Protein Resource (UniProt). Nucleic acids research 42, D191-198, doi:10.1093/nar/gkt1140 (2014). 40 Berman, H. M. et al. The Protein Data Bank. Nucleic acids research 28, 235-242 (2000). 41 Schomburg, I. et al. BRENDA: a resource for enzyme data and metabolic information. Trends in biochemical sciences 27, 54-56 (2002).

42 Brenner, S. E. Errors in genome annotation. Trends in genetics : TIG 15, 132-133 (1999).

43 Devos, D. & Valencia, A. Intrinsic errors in genome annotation. Trends in genetics : TIG 17, 429-431 (2001).

44 Richardson, E. J. & Watson, M. The automatic annotation of bacterial genomes. Briefings in bioinformatics 14, 1-12, doi:10.1093/bib/bbs007 (2013).

45 Loewenstein, Y. et al. Protein function annotation by homology-based inference. Genome biology 10, 207, doi:10.1186/gb-2009-10-2-207 (2009). 46 Pal, C., Papp, B. & Hurst, L. D. Highly expressed genes in yeast evolve slowly. Genetics 158, 927-931 (2001).

107 47 Bloom, J. D., Wilke, C. O., Arnold, F. H. & Adami, C. Stability and the evolvability of function in a model protein. Biophysical journal 86, 2758-2764, doi:10.1016/S0006- 3495(04)74329-5 (2004).

48 Lobkovsky, A. E., Wolf, Y. I. & Koonin, E. V. Universal distribution of protein evolution rates as a consequence of protein folding physics. Proceedings of the National Academy of Sciences of the United States of America 107, 2983-2988, doi:10.1073/pnas.0910445107 (2010).

49 Slotte, T. et al. Genomic determinants of protein evolution and polymorphism in Arabidopsis. Genome biology and evolution 3, 1210-1219, doi:10.1093/gbe/evr094 (2011).

50 Frick, I. M. et al. Convergent evolution among immunoglobulin G-binding bacterial proteins. Proceedings of the National Academy of Sciences of the United States of America 89, 8532-8536 (1992). 51 Bork, P., Sander, C. & Valencia, A. Convergent evolution of similar enzymatic function on different protein folds: the , , and families of sugar . Protein science : a publication of the Protein Society 2, 31-40, doi:10.1002/pro.5560020104 (1993). 52 Otlewski, J., Jelen, F., Zakrzewska, M. & Oleksy, A. The many faces of -protein inhibitor interaction. The EMBO journal 24, 1303-1310, doi:10.1038/sj.emboj.7600611 (2005).

53 Lee, D., Redfern, O. & Orengo, C. Predicting protein function from sequence and structure. Nature reviews. Molecular cell biology 8, 995-1005, doi:10.1038/nrm2281 (2007).

54 Arakaki, A. K., Huang, Y. & Skolnick, J. EFICAz2: enzyme function inference by a combined approach enhanced by machine learning. BMC bioinformatics 10, 107, doi:10.1186/1471-2105-10-107 (2009). 55 Plata, G., Fuhrer, T., Hsiao, T. L., Sauer, U. & Vitkup, D. Global probabilistic annotation of metabolic networks enables enzyme discovery. Nature chemical biology 8, 848-854, doi:10.1038/nchembio.1063 (2012).

56 Povolotskaya, I. S. & Kondrashov, F. A. Sequence space and the ongoing expansion of the protein universe. Nature 465, 922-926, doi:10.1038/nature09105 (2010).

57 Chenna, R. et al. Multiple sequence alignment with the Clustal series of programs. Nucleic acids research 31, 3497-3500 (2003). 58 Ashkenazy, H. et al. FastML: a web server for probabilistic reconstruction of ancestral sequences. Nucleic acids research 40, W580-584, doi:10.1093/nar/gks498 (2012). 59 Eddy, S. R. Where did the BLOSUM62 alignment score matrix come from? Nature biotechnology 22, 1035-1036, doi:10.1038/nbt0804-1035 (2004).

108 60 Hedges, S. B., Dudley, J. & Kumar, S. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics 22, 2971-2972, doi:10.1093/bioinformatics/btl505 (2006).

61 Goldstein, L. & Waterman, M. S. Poisson, compound Poisson and process approximations for testing statistical significance in sequence comparisons. Bulletin of mathematical biology 54, 785-812 (1992). 62 Kumar, S., Tamura, K. & Nei, M. MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Briefings in bioinformatics 5, 150- 163 (2004).

63 Zheng, X., Qin, Y. & Wang, J. A Poisson model of sequence comparison and its application to coronavirus phylogeny. Mathematical biosciences 217, 159-166, doi:10.1016/j.mbs.2008.11.006 (2009). 64 Yang, Z., Nielsen, R., Goldman, N. & Pedersen, A. M. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155, 431-449 (2000). 65 Fares, M. A., Byrne, K. P. & Wolfe, K. H. Rate asymmetry after genome duplication causes substantial long-branch attraction artifacts in the phylogeny of Saccharomyces species. Molecular biology and evolution 23, 245-253, doi:10.1093/molbev/msj027 (2006).

66 Fletcher, W. & Yang, Z. INDELible: a flexible simulator of biological sequence evolution. Molecular biology and evolution 26, 1879-1888, doi:10.1093/molbev/msp098 (2009). 67 Shindyalov, I. N. & Bourne, P. E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein engineering 11, 739-747 (1998). 68 Snedecor, G. W. & Cochran, W. G. Statistical methods. 8th edn, (Iowa State University Press, 1989). 69 Rost, B., Schneider, R. & Sander, C. Protein fold recognition by prediction-based threading. Journal of molecular biology 270, 471-480, doi:10.1006/jmbi.1997.1101 (1997).

70 Bairoch, A. & Boeckmann, B. The SWISS-PROT protein sequence data bank. Nucleic acids research 19 Suppl, 2247-2249 (1991).

71 Xu, Y. O., Hall, R. W., Goldstein, R. A. & Pollock, D. D. Divergence, recombination and retention of functionality during protein evolution. Human genomics 2, 158-167 (2005).

72 Pascual-Garcia, A., Abia, D., Mendez, R., Nido, G. S. & Bastolla, U. Quantifying the evolutionary divergence of protein structures: the role of function change and function conservation. Proteins 78, 181-196, doi:10.1002/prot.22616 (2010). 73 Worth, C. L., Gong, S. & Blundell, T. L. Structural and functional constraints in the evolution of protein families. Nature reviews. Molecular cell biology 10, 709-720, doi:10.1038/nrm2762 (2009).

109 74 DeLano, W. L. The PyMOL molecular graphics system, (2002). 75 Halabi, N., Rivoire, O., Leibler, S. & Ranganathan, R. Protein sectors: evolutionary units of three-dimensional structure. Cell 138, 774-786, doi:10.1016/j.cell.2009.07.038 (2009). 76 Doolittle, R. F. Similar amino acid sequences: chance or common ancestry? Science 214, 149-159 (1981). 77 Peregrin-Alvarez, J. M., Sanford, C. & Parkinson, J. The conservation and evolutionary modularity of metabolism. Genome biology 10, R63, doi:10.1186/gb-2009-10-6-r63 (2009). 78 Goldman, A. D., Baross, J. A. & Samudrala, R. The enzymatic and metabolic capabilities of early life. PloS one 7, e39912, doi:10.1371/journal.pone.0039912 (2012). 79 Siew, N., Azaria, Y. & Fischer, D. The ORFanage: an ORFan database. Nucleic acids research 32, D281-283, doi:10.1093/nar/gkh116 (2004). 80 Andrade, M. A. et al. Automated genome sequence analysis and annotation. Bioinformatics 15, 391-412 (1999). 81 Mungall, C. J. et al. An integrated computational pipeline and database to support whole-genome sequence annotation. Genome biology 3, RESEARCH0081 (2002). 82 Calderon-Copete, S. P. et al. The Mycoplasma conjunctivae genome sequencing, annotation and analysis. BMC bioinformatics 10 Suppl 6, S7, doi:10.1186/1471-2105-10- S6-S7 (2009). 83 Brylinski, M. & Skolnick, J. A threading-based method (FINDSITE) for ligand- prediction and functional annotation. Proc Natl Acad Sci U S A 105, 129-134, doi:10.1073/pnas.0707684105 (2008). 84 Berezin, C. et al. ConSeq: the identification of functionally and structurally important residues in protein sequences. Bioinformatics 20, 1322-1324, doi:10.1093/bioinformatics/bth070 (2004).

85 Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic acids research 24, 21-25 (1996). 86 Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431-437, doi:10.1038/nature12352 (2013). 87 Katju, V. To the beat of a different drum: determinants implicated in the asymmetric sequence divergence of Caenorhabditis elegans paralogs. BMC evolutionary biology 13, 73, doi:10.1186/1471-2148-13-73 (2013).

88 Petrey, D. & Honig, B. Is protein classification necessary? Toward alternative approaches to function annotation. Current opinion in structural biology 19, 363-368, doi:10.1016/j.sbi.2009.02.001 (2009).

110 89 Brylinski, M. & Skolnick, J. FINDSITE: a threading-based approach to ligand homology modeling. PLoS computational biology 5, e1000405, doi:10.1371/journal.pcbi.1000405 (2009).

90 Ward, R. M. et al. Evolutionary Trace Annotation Server: automated enzyme function prediction in protein structures using 3D templates. Bioinformatics 25, 1426-1427, doi:10.1093/bioinformatics/btp160 (2009). 91 Hendlich, M., Rippmann, F. & Barnickel, G. LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. Journal of molecular graphics & modelling 15, 359-363, 389 (1997).

92 Huang, B. & Schroeder, M. LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation. BMC structural biology 6, 19, doi:10.1186/1472-6807-6-19 (2006). 93 Glaser, F., Morris, R. J., Najmanovich, R. J., Laskowski, R. A. & Thornton, J. M. A method for localizing ligand binding pockets in protein structures. Proteins 62, 479-488, doi:10.1002/prot.20769 (2006).

94 Binkowski, T. A. & Joachimiak, A. Protein functional surfaces: global shape matching and local spatial alignments of ligand binding sites. BMC structural biology 8, 45, doi:10.1186/1472-6807-8-45 (2008). 95 Bianchi, V., Gherardini, P. F., Helmer-Citterich, M. & Ausiello, G. Identification of binding pockets in protein structures using a knowledge-based potential derived from local structural similarities. BMC bioinformatics 13 Suppl 4, S17, doi:10.1186/1471-2105-13-S4- S17 (2012). 96 Spitzer, R., Cleves, A. E., Varela, R. & Jain, A. N. Protein function annotation by local binding site surface similarity. Proteins, doi:10.1002/prot.24450 (2013). 97 Konc, J., Hodoscek, M., Ogrizek, M., Trykowska Konc, J. & Janezic, D. Structure-based function prediction of uncharacterized protein using binding sites comparison. PLoS computational biology 9, e1003341, doi:10.1371/journal.pcbi.1003341 (2013).

98 Pierri, C. L., Parisi, G. & Porcelli, V. Computational approaches for protein function prediction: a combined strategy from multiple sequence alignment to molecular docking- based virtual screening. Biochimica et biophysica acta 1804, 1695-1712, doi:10.1016/j.bbapap.2010.04.008 (2010). 99 Brylinski, M. & Skolnick, J. Comparison of structure-based and threading-based approaches to protein functional annotation. Proteins 78, 118-134, doi:10.1002/prot.22566 (2010). 100 Dey, F., Cliff Zhang, Q., Petrey, D. & Honig, B. Toward a "structural BLAST": using structural relationships to infer function. Protein Sci 22, 359-366, doi:10.1002/pro.2225 (2013).

111 101 Cygler, M. et al. Relationship between sequence conservation and three-dimensional structure in a large family of , , and related proteins. Protein science : a publication of the Protein Society 2, 366-382, doi:10.1002/pro.5560020309 (1993).

102 Russell, R. B., Sasieni, P. D. & Sternberg, M. J. Supersites within superfolds. Binding site similarity in the absence of homology. Journal of molecular biology 282, 903-918, doi:10.1006/jmbi.1998.2043 (1998). 103 Capra, J. A. & Singh, M. Predicting functionally important residues from sequence conservation. Bioinformatics 23, 1875-1882, doi:10.1093/bioinformatics/btm270 (2007). 104 Heger, A., Mallick, S., Wilton, C. & Holm, L. The global trace graph, a novel paradigm for searching protein sequence databases. Bioinformatics 23, 2361-2367, doi:10.1093/bioinformatics/btm358 (2007). 105 Soding, J. & Remmert, M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Current opinion in structural biology 21, 404- 411, doi:10.1016/j.sbi.2011.03.005 (2011). 106 Porter, C. T., Bartlett, G. J. & Thornton, J. M. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic acids research 32, D129-133, doi:10.1093/nar/gkh028 (2004).

107 Furnham, N. et al. The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic acids research 42, D485-489, doi:10.1093/nar/gkt1243 (2014). 108 Kuepfer, L., Sauer, U. & Blank, L. M. Metabolic functions of duplicate genes in Saccharomyces cerevisiae. Genome research 15, 1421-1430, doi:10.1101/gr.3992505 (2005).

109 Feist, A. M. et al. A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Molecular systems biology 3, 121, doi:10.1038/msb4100155 (2007). 110 Henry, C. S., Zinner, J. F., Cohoon, M. P. & Stevens, R. L. iBsu1103: a new genome- scale metabolic model of Bacillus subtilis based on SEED annotations. Genome biology 10, R69, doi:10.1186/gb-2009-10-6-r69 (2009).

111 von Mering, C. et al. Genome evolution reveals biochemical networks and functional modules. Proceedings of the National Academy of Sciences of the United States of America 100, 15428-15433, doi:10.1073/pnas.2136809100 (2003). 112 Korbel, J. O., Jensen, L. J., von Mering, C. & Bork, P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nature biotechnology 22, 911-917, doi:10.1038/nbt988 (2004).

112 113 Doerks, T., von Mering, C. & Bork, P. Functional clues for hypothetical proteins based on genomic context analysis in prokaryotes. Nucleic acids research 32, 6321-6326, doi:10.1093/nar/gkh973 (2004).

114 Dimitrieva, S. & Bucher, P. Genomic context analysis reveals dense interaction network between vertebrate ultraconserved non-coding elements. Bioinformatics 28, i395- i401, doi:10.1093/bioinformatics/bts400 (2012). 115 Pearson, W. Finding protein and nucleotide similarities with FASTA. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] Chapter 3, Unit3 9, doi:10.1002/0471250953.bi0309s04 (2004).

116 Pearson, W. R. BLAST and FASTA similarity searching for multiple sequence alignment. Methods in molecular biology 1079, 75-101, doi:10.1007/978-1-62703-646-7_5 (2014). 117 Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 403-410, doi:10.1016/S0022- 2836(05)80360-2 (1990).

118 Mount, D. W. Using the Basic Local Alignment Search Tool (BLAST). CSH protocols 2007, pdb top17, doi:10.1101/pdb.top17 (2007).

119 Kharchenko, P., Chen, L., Freund, Y., Vitkup, D. & Church, G. M. Identifying metabolic enzymes with multiple types of association evidence. BMC bioinformatics 7, 177, doi:10.1186/1471-2105-7-177 (2006). 120 Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nature methods 10, 221-227, doi:10.1038/nmeth.2340 (2013). 121 Falda, M. et al. Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms. BMC bioinformatics 13 Suppl 4, S14, doi:10.1186/1471-2105-13-S4-S14 (2012). 122 Tian, W., Arakaki, A. K. & Skolnick, J. EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic acids research 32, 6226-6239, doi:10.1093/nar/gkh956 (2004). 123 Zumla, A., George, A., Sharma, V., Herbert, N. & Baroness Masham of, I. WHO's 2013 global report on tuberculosis: successes, threats, and opportunities. Lancet 382, 1765-1767, doi:10.1016/S0140-6736(13)62078-4 (2013). 124 Gagneux, S. Host-pathogen coevolution in human tuberculosis. Philosophical transactions of the Royal Society of London. Series B, Biological sciences 367, 850-859, doi:10.1098/rstb.2011.0316 (2012).

125 Sasaki, S., Takeshita, F., Okuda, K. & Ishii, N. Mycobacterium leprae and leprosy: a compendium. Microbiology and immunology 45, 729-736 (2001).

113 126 Greenstein, R. J. Is Crohn's disease caused by a mycobacterium? Comparisons with leprosy, tuberculosis, and Johne's disease. The Lancet infectious diseases 3, 507-514 (2003). 127 Andersen, P. & Doherty, T. M. The success and failure of BCG - implications for a novel tuberculosis vaccine. Nature reviews. Microbiology 3, 656-662, doi:10.1038/nrmicro1211 (2005). 128 Luca, S. & Mihaescu, T. History of BCG Vaccine. Maedica 8, 53-58 (2013).

129 Mangtani, P. et al. Protection by BCG Vaccine Against Tuberculosis: A Systematic Review of Randomized Controlled Trials. Clinical infectious diseases : an official publication of the Infectious Diseases Society of America 58, 470-480, doi:10.1093/cid/cit790 (2014). 130 Wolfe, L. M., Mahaffey, S. B., Kruh, N. A. & Dobos, K. M. Proteomic definition of the cell wall of Mycobacterium tuberculosis. Journal of proteome research 9, 5816-5826, doi:10.1021/pr1005873 (2010).

131 Cole, S. T. et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537-544, doi:10.1038/31159 (1998).

132 Srivastava, S. et al. Efflux-pump-derived multiple drug resistance to ethambutol monotherapy in Mycobacterium tuberculosis and the pharmacokinetics and pharmacodynamics of ethambutol. The Journal of infectious diseases 201, 1225-1231, doi:10.1086/651377 (2010). 133 Beste, D. J. et al. The genetic requirements for fast and slow growth in mycobacteria. PloS one 4, e5349, doi:10.1371/journal.pone.0005349 (2009).

134 Voskuil, M. I. et al. Inhibition of respiration by nitric oxide induces a Mycobacterium tuberculosis dormancy program. The Journal of experimental medicine 198, 705-713, doi:10.1084/jem.20030205 (2003). 135 Trauner, A., Lougheed, K. E., Bennett, M. H., Hingley-Wilson, S. M. & Williams, H. D. The dormancy regulator DosR controls ribosome stability in hypoxic mycobacteria. The Journal of biological chemistry 287, 24053-24063, doi:10.1074/jbc.M112.364851 (2012).

136 Bacon, J. et al. Non-replicating Mycobacterium tuberculosis elicits a reduced infectivity profile with corresponding modifications to the cell wall and extracellular matrix. PloS one 9, e87329, doi:10.1371/journal.pone.0087329 (2014). 137 Pieters, J. Mycobacterium tuberculosis and the macrophage: maintaining a balance. Cell host & microbe 3, 399-407, doi:10.1016/j.chom.2008.05.006 (2008). 138 Russell, D. G., Cardona, P. J., Kim, M. J., Allain, S. & Altare, F. Foamy macrophages and the progression of the human tuberculosis granuloma. Nature immunology 10, 943-948, doi:10.1038/ni.1781 (2009).

114 139 Kenney, T. J. & Churchward, G. Cloning and sequence analysis of the rpsL and rpsG genes of Mycobacterium smegmatis and characterization of mutations causing resistance to streptomycin. Journal of bacteriology 176, 6153-6156 (1994).

140 Almeida Da Silva, P. E. & Palomino, J. C. Molecular basis and mechanisms of drug resistance in Mycobacterium tuberculosis: classical and new drugs. The Journal of antimicrobial chemotherapy 66, 1417-1430, doi:10.1093/jac/dkr173 (2011). 141 Boucher, H. W. et al. Bad bugs, no drugs: no ESKAPE! An update from the Infectious Diseases Society of America. Clinical infectious diseases : an official publication of the Infectious Diseases Society of America 48, 1-12, doi:10.1086/595011 (2009).

142 Devasahayam, G., Scheld, W. M. & Hoffman, P. S. Newer antibacterial drugs for a new century. Expert opinion on investigational drugs 19, 215-234, doi:10.1517/13543780903505092 (2010). 143 Centers for Disease, C. & Prevention. Impact of a shortage of first-line antituberculosis medication on tuberculosis control - United States, 2012-2013. MMWR. Morbidity and mortality weekly report 62, 398-400 (2013).

144 Centers for Disease, C. & Prevention. Interruptions in supplies of second-line antituberculosis drugs--United States, 2005-2012. MMWR. Morbidity and mortality weekly report 62, 23-26 (2013). 145 Sudre, P. & Cohn, D. L. Mycobacterium tuberculosis drug resistance: a call to action. The international journal of tuberculosis and lung disease : the official journal of the International Union against Tuberculosis and Lung Disease 2, 609-611 (1998).

146 Pawlowski, A., Jansson, M., Skold, M., Rottenberg, M. E. & Kallenius, G. Tuberculosis and HIV co-infection. PLoS pathogens 8, e1002464, doi:10.1371/journal.ppat.1002464 (2012). 147 Tuberculosis: a global emergency. World health forum 14, 438 (1993). 148 Ogata, H. et al. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic acids research 27, 29-34 (1999). 149 Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research 30, 207-210 (2002). 150 Eswar, N., Eramian, D., Webb, B., Shen, M. Y. & Sali, A. Protein structure modeling with MODELLER. Methods in molecular biology 426, 145-159, doi:10.1007/978-1-60327- 058-8_8 (2008). 151 Schellenberger, J. et al. Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0. Nature protocols 6, 1290-1307, doi:10.1038/nprot.2011.308 (2011).

115 152 Sassetti, C. M., Boyd, D. H. & Rubin, E. J. Genes required for mycobacterial growth defined by high density mutagenesis. Molecular microbiology 48, 77-84 (2003). 153 Orth, J. D., Thiele, I. & Palsson, B. O. What is flux balance analysis? Nature biotechnology 28, 245-248, doi:10.1038/nbt.1614 (2010). 154 Jamshidi, N. & Palsson, B. O. Investigating the metabolic capabilities of Mycobacterium tuberculosis H37Rv using the in silico strain iNJ661 and proposing alternative drug targets. BMC systems biology 1, 26, doi:10.1186/1752-0509-1-26 (2007). 155 Fang, X., Wallqvist, A. & Reifman, J. Development and analysis of an in vivo- compatible metabolic network of Mycobacterium tuberculosis. BMC systems biology 4, 160, doi:10.1186/1752-0509-4-160 (2010). 156 Sassetti, C. M. & Rubin, E. J. Genetic requirements for mycobacterial survival during infection. Proceedings of the National Academy of Sciences of the United States of America 100, 12989-12994, doi:10.1073/pnas.2134250100 (2003). 157 Khatri, B. et al. High throughput phenotypic analysis of Mycobacterium tuberculosis and Mycobacterium bovis strains' metabolism using biolog phenotype microarrays. PloS one 8, e52673, doi:10.1371/journal.pone.0052673 (2013). 158 Sogi, K. M. et al. Mycobacterium tuberculosis Rv3406 is a type II alkyl capable of sulfate scavenging. PloS one 8, e65080, doi:10.1371/journal.pone.0065080 (2013). 159 Beste, D. J. et al. GSMN-TB: a web-based genome-scale network model of Mycobacterium tuberculosis metabolism. Genome biology 8, R89, doi:10.1186/gb-2007-8- 5-r89 (2007). 160 Shin, D. M. et al. Mycobacterium tuberculosis lipoprotein-induced association of TLR2 with C zeta in lipid rafts contributes to reactive oxygen species- dependent inflammatory signalling in macrophages. Cellular microbiology 10, 1893-1905, doi:10.1111/j.1462-5822.2008.01179.x (2008). 161 Yang, C. S., Yuk, J. M. & Jo, E. K. The role of nitric oxide in mycobacterial infections. Immune network 9, 46-52, doi:10.4110/in.2009.9.2.46 (2009). 162 Chen, K. & Kolls, J. K. T cell-mediated host immune defenses in the lung. Annual review of immunology 31, 605-633, doi:10.1146/annurev-immunol-032712-100019 (2013). 163 Peyron, P. et al. Foamy macrophages from tuberculous patients' granulomas constitute a nutrient-rich reservoir for M. tuberculosis persistence. PLoS pathogens 4, e1000204, doi:10.1371/journal.ppat.1000204 (2008). 164 Pandey, A. K. & Sassetti, C. M. Mycobacterial persistence requires the utilization of host cholesterol. Proceedings of the National Academy of Sciences of the United States of America 105, 4376-4380, doi:10.1073/pnas.0711159105 (2008).

116 165 Ouellet, H., Johnston, J. B. & de Montellano, P. R. Cholesterol catabolism as a therapeutic target in Mycobacterium tuberculosis. Trends in microbiology 19, 530-539, doi:10.1016/j.tim.2011.07.009 (2011).

166 Griffin, J. E. et al. Cholesterol catabolism by Mycobacterium tuberculosis requires transcriptional and metabolic adaptations. Chemistry & biology 19, 218-227, doi:10.1016/j.chembiol.2011.12.016 (2012). 167 Chindelevitch, L., Stanley, S., Hung, D., Regev, A. & Berger, B. MetaMerge: scaling up genome-scale metabolic reconstructions with application to Mycobacterium tuberculosis. Genome biology 13, r6, doi:10.1186/gb-2012-13-1-r6 (2012).

168 Henry, C. S. et al. High-throughput generation, optimization and analysis of genome- scale metabolic models. Nature biotechnology 28, 977-982, doi:10.1038/nbt.1672 (2010). 169 Klink, M. et al. Cholesterol is indispensable in the pathogenesis of Mycobacterium tuberculosis. PloS one 8, e73333, doi:10.1371/journal.pone.0073333 (2013). 170 Nesbitt, N. M. et al. A of Mycobacterium tuberculosis is required for virulence and production of androstenedione and androstadienedione from cholesterol. Infection and immunity 78, 275-282, doi:10.1128/IAI.00893-09 (2010).

171 Baughn, A. D., Garforth, S. J., Vilcheze, C. & Jacobs, W. R., Jr. An anaerobic-type alpha- ketoglutarate ferredoxin completes the oxidative tricarboxylic acid cycle of Mycobacterium tuberculosis. PLoS pathogens 5, e1000662, doi:10.1371/journal.ppat.1000662 (2009).

172 Griffin, J. E. et al. High-resolution phenotypic profiling defines genes essential for mycobacterial growth and cholesterol catabolism. PLoS pathogens 7, e1002251, doi:10.1371/journal.ppat.1002251 (2011). 173 Bochner, B. R. Global phenotypic characterization of bacteria. FEMS microbiology reviews 33, 191-205, doi:FMR149 [pii] 10.1111/j.1574-6976.2008.00149.x (2009). 174 Baker, J. J., Johnson, B. K. & Abramovitch, R. B. Slow growth of Mycobacterium tuberculosis at acidic pH is regulated by phoPR and host-associated carbon sources. Molecular microbiology, doi:10.1111/mmi.12688 (2014). 175 Akhtar, S., Khan, A., Sohaskey, C. D., Jagannath, C. & Sarkar, D. NirBD is induced and plays an important role during in vitro dormancy of Mycobacterium tuberculosis. Journal of bacteriology 195, 4592-4599, doi:10.1128/JB.00698-13 (2013). 176 Law, V. et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic acids research 42, D1091-1097, doi:10.1093/nar/gkt1068 (2014).

117 177 Mdluli, K. & Spigelman, M. Novel targets for tuberculosis drug discovery. Current opinion in pharmacology 6, 459-467, doi:10.1016/j.coph.2006.06.004 (2006). 178 Basarab, G. S. et al. Design of inhibitors of Helicobacter pylori glutamate racemase as selective antibacterial agents: incorporation of imidazoles onto a core pyrazolopyrimidinedione scaffold to improve bioavailabilty. Bioorganic & medicinal chemistry letters 22, 5600-5607, doi:10.1016/j.bmcl.2012.07.004 (2012). 179 Whalen, K. L., Chau, A. C. & Spies, M. A. In silico Optimization of a Fragment-Based Hit Yields Biologically Active, High-Efficiency Inhibitors for Glutamate Racemase. ChemMedChem, doi:10.1002/cmdc.201300271 (2013).

180 Ploux, O., Breyne, O., Carillon, S. & Marquet, A. Slow-binding and competitive inhibition of 8-amino-7-oxopelargonate synthase, a pyridoxal-5'-phosphate-dependent enzyme involved in biotin biosynthesis, by substrate and intermediate analogs. Kinetic and binding studies. European journal of biochemistry / FEBS 259, 63-70 (1999).

181 Nakashima, K. et al. Closed complex of the D-3-hydroxybutyrate dehydrogenase induced by an enantiomeric competitive inhibitor. Journal of biochemistry 145, 467-479, doi:10.1093/jb/mvn186 (2009). 182 Shi, G. et al. Bisubstrate analog inhibitors of 6-hydroxymethyl-7,8-dihydropterin pyrophosphokinase: new lead exhibits a distinct binding mode. Bioorganic & medicinal chemistry 20, 4303-4309, doi:10.1016/j.bmc.2012.05.060 (2012).

183 Heimli, H., Hollung, K. & Drevon, C. A. Eicosapentaenoic acid-induced apoptosis depends on acyl CoA-synthetase. Lipids 38, 263-268 (2003).

184 Askari, B. et al. Rosiglitazone inhibits acyl-CoA synthetase activity and fatty acid partitioning to diacylglycerol and triacylglycerol via a peroxisome proliferator-activated receptor-gamma-independent mechanism in human arterial smooth muscle cells and macrophages. Diabetes 56, 1143-1152, doi:10.2337/db06-0267 (2007). 185 Rengarajan, J., Bloom, B. R. & Rubin, E. J. Genome-wide requirements for Mycobacterium tuberculosis adaptation and survival in macrophages. Proceedings of the National Academy of Sciences of the United States of America 102, 8327-8332, doi:10.1073/pnas.0503272102 (2005). 186 Kalscheuer, R., Weinrick, B., Veeraraghavan, U., Besra, G. S. & Jacobs, W. R., Jr. Trehalose-recycling ABC transporter LpqY-SugA-SugB-SugC is essential for virulence of Mycobacterium tuberculosis. Proceedings of the National Academy of Sciences of the United States of America 107, 21761-21766, doi:10.1073/pnas.1014642108 (2010). 187 Nieland, T. J. et al. Cross-inhibition of SR-BI- and ABCA1-mediated cholesterol transport by the small molecules BLT-4 and glyburide. Journal of lipid research 45, 1256- 1265, doi:10.1194/jlr.M300358-JLR200 (2004).

118 188 Favari, E. et al. Probucol inhibits ABCA1-mediated cellular lipid efflux. Arteriosclerosis, thrombosis, and vascular biology 24, 2345-2350, doi:10.1161/01.ATV.0000148706.15947.8a (2004).

189 Basarab, G. S., Hill, P. J., Rastagar, A. & Webborn, P. J. Design of Helicobacter pylori glutamate racemase inhibitors as selective antibacterial agents: a novel pro-drug approach to increase exposure. Bioorganic & medicinal chemistry letters 18, 4716-4722, doi:10.1016/j.bmcl.2008.06.092 (2008).

190 Doublet, P., van Heijenoort, J., Bohin, J. P. & Mengin-Lecreulx, D. The murI gene of Escherichia coli is an essential gene that encodes a glutamate racemase activity. Journal of bacteriology 175, 2970-2979 (1993). 191 Fisher, S. L. Glutamate racemase as a target for drug discovery. Microbial biotechnology 1, 345-360, doi:10.1111/j.1751-7915.2008.00031.x (2008). 192 Farrar, C. E., Siu, K. K., Howell, P. L. & Jarrett, J. T. exhibits burst kinetics and multiple turnovers in the absence of inhibition by products and product- related biomolecules. Biochemistry 49, 9985-9996, doi:10.1021/bi101023c (2010).

193 Woong Park, S. et al. Evaluating the sensitivity of Mycobacterium tuberculosis to biotin deprivation using regulated gene expression. PLoS pathogens 7, e1002264, doi:10.1371/journal.ppat.1002264 (2011). 194 Lamichhane, G. et al. A postgenomic method for predicting essential genes at subsaturation levels of mutagenesis: application to Mycobacterium tuberculosis. Proceedings of the National Academy of Sciences of the United States of America 100, 7213- 7218, doi:10.1073/pnas.1231432100 (2003). 195 Pappenberger, G., Schulz-Gasch, T., Kusznir, E., Muller, F. & Hennig, M. Structure- assisted discovery of an aminothiazole derivative as a lead molecule for inhibition of bacterial fatty-acid synthesis. Acta crystallographica. Section D, Biological crystallography 63, 1208-1216, doi:10.1107/S0907444907049852 (2007). 196 Zuo, Y. et al. Design and synthesis of 1-(benzothiazol-5-yl)-1H-1,2,4-triazol-5-ones as protoporphyrinogen oxidase inhibitors. Bioorganic & medicinal chemistry 21, 3245-3255, doi:10.1016/j.bmc.2013.03.056 (2013). 197 Hao, G. F., Zuo, Y., Yang, S. G. & Yang, G. F. Protoporphyrinogen oxidase inhibitor: an ideal target for herbicide discovery. Chimia 65, 961-969, doi:10.2533/chimia.2011.961 (2011). 198 Hao, G. F., Tan, Y., Yu, N. X. & Yang, G. F. Structure-activity relationships of diphenyl- ether as protoporphyrinogen oxidase inhibitors: insights from computational simulations. Journal of computer-aided molecular design 25, 213-222, doi:10.1007/s10822-011-9412-6 (2011).

119 199 Dailey, T. A. et al. Discovery and Characterization of HemQ: an essential heme biosynthetic pathway component. The Journal of biological chemistry 285, 25978-25986, doi:10.1074/jbc.M110.142604 (2010).

200 Yip, K. W. et al. A porphodimethene chemical inhibitor of uroporphyrinogen decarboxylase. PloS one 9, e89889, doi:10.1371/journal.pone.0089889 (2014). 201 Miller, B. H. & Shinnick, T. M. Evaluation of Mycobacterium tuberculosis genes involved in resistance to killing by human macrophages. Infection and immunity 68, 387- 390 (2000). 202 Nixon, M. R. et al. Folate Pathway Disruption Leads to Critical Disruption of Methionine Derivatives in Mycobacterium tuberculosis. Chemistry & biology 21, 819-830, doi:10.1016/j.chembiol.2014.04.009 (2014).

120 APPENDIX

Chapter 2

Appendix 2.1. Divergence times between pairs of species from TimeTree.org by Hedges et al60.

Appendix 2.2. Fitting the three models of sequence divergence to 30 EC data sets.

Appendix 2.3. Time of divergence necessary for orthologous enzymes to reach 5 % sequence identity; determined from the Gamma-corrected model of evolution.

Appendix 2.4. Hubble-like derivation: rate of change of the evolutionary distance with respect to the divergence time for 30 enzymatic functions.

!" Appendix 2.5. Comparison of the non-linear fit and linear fit of as a function of evolutionary !" distance with the Akaike Information Criterion.

Appendix 2.6. Two-dimensional projections of the sequence identity vs. RMSD density plot for

EC4 Pairs.

Chapter 3

Appendix 3.1. Probability of correct function at different sequence identity levels given active site region conservation.

Appendix 3.2. Optimized set of parameters obtained from training GLOBUS (no structural- component) on the E. coli metabolic network iAF1260 (red), 3D-GLOBUS on the E. coli

121 metabolic network iAF1260 (dark blue), and 3D-GLOBUS on the S. cerevisiae metabolic network iLL672.

Chapter 4

Appendix 4.1. Residues that form polar contacts with the ligand.

Appendix 4.2. Catalogue of target genes in M. tuberculosis and homologous drug targets with candidate drugs.

122 Appendix 2.1. Divergence times between pairs of species from TimeTree.org by Hedges et al60.

Species Pair Time (Byr) Species Pair Time (Byr) H.sapiens-R.norvegicus 0.10 O.sativa-P.falciparum 1.72 H.sapiens-G.gallus 0.33 O.sativa-E.coli 2.60 H.sapiens-X.laevis 0.59 O.sativa-M.thermo 3.97 H.sapiens-D.rerio 0.46 S.cerevisiae-T.brucei 1.96 H.sapiens-D.melanogaster 0.89 S.cerevisiae-P.falciparum 1.72 H.sapiens-C.elegans 0.99 S.cerevisiae-E.coli 2.60 H.sapiens-A.thaliana 1.10 S.cerevisiae-M.thermo 3.97 H.sapiens-O.sativa 1.10 T.brucei-P.falciparum 0.94 H.sapiens-S.cerevisiae 1.30 T.brucei-E.coli 2.60 H.sapiens-T.brucei 1.10 T.brucei-M.thermo 3.97 H.sapiens-P.falciparum 1.10 P.falciparum-E.coli 2.60 H.sapiens-E.coli 2.60 P.falciparum-M.thermo 3.97 H.sapiens-M.thermo 3.97 E.coli-S.enterica 0.10 R.norvegicus-G.gallus 0.33 E.coli-Y.pestis 0.43 R.norvegicus-X.laevis 0.36 E.coli-H.influenzae 0.68 R.norvegicus-D.rerio 0.46 E.coli-P.aeruginosa 1.35 R.norvegicus-D.melanogaster 0.83 E.coli-A.tumefaciens 2.36 R.norvegicus-C.elegans 1.13 E.coli-M.tuberculosis 2.75 R.norvegicus-A.thaliana 1.10 E.coli-B.subtilis 2.75 R.norvegicus-O.sativa 1.10 E.coli-T.maritima 3.64 R.norvegicus-S.cerevisiae 1.39 E.coli-P.horikoshii 3.78 R.norvegicus-T.brucei 1.10 E.coli-S.solfataricus 3.78 R.norvegicus-P.falciparum 1.10 E.coli-M.thermo 3.78 R.norvegicus-E.coli 2.60 S.enterica-Y.pestis 0.43 R.norvegicus-M.thermo 3.97 S.enterica-H.influenzae 0.68 G.gallus-X.laevis 0.36 S.enterica-P.aeruginosa 1.35 G.gallus-D.rerio 0.46 S.enterica-A.tumefaciens 2.36 G.gallus-D.melanogaster 0.83 S.enterica-M.tuberculosis 2.75 G.gallus-C.elegans 1.13 S.enterica-B.subtilis 2.75 G.gallus-A.thaliana 1.10 S.enterica-T.maritima 3.64 G.gallus-O.sativa 1.10 S.enterica-P.horikoshii 3.78 G.gallus-S.cerevisiae 1.27 S.enterica-S.solfataricus 3.78 G.gallus-T.brucei 1.10 S.enterica-M.thermo 3.78 G.gallus-P.falciparum 1.10 Y.pestis-H.influenzae 0.68 G.gallus-E.coli 2.60 Y.pestis-P.aeruginosa 1.35 G.gallus-M.thermo 3.97 Y.pestis-A.tumefaciens 2.36 X.laevis-D.rerio 0.46 Y.pestis-M.tuberculosis 2.75 X.laevis-D.melanogaster 0.83 Y.pestis-B.subtilis 2.75 X.laevis-C.elegans 1.13 Y.pestis-T.maritima 3.30 X.laevis-A.thaliana 1.10 Y.pestis-P.horikoshii 3.78

123 X.laevis-O.sativa 1.10 Y.pestis-S.solfataricus 3.78 X.laevis-S.cerevisiae 1.27 Y.pestis-M.thermo 3.78 X.laevis-T.brucei 1.10 H.influenzae-P.aeruginosa 1.35 X.laevis-P.falciparum 1.10 H.influenzae-A.tumefaciens 2.36 X.laevis-E.coli 2.60 H.influenzae-M.tuberculosis 2.75 X.laevis-M.thermo 3.97 H.influenzae-B.subtilis 2.75 D.rerio-D.melanogaster 0.46 H.influenzae-T.maritima 3.64 D.rerio-C.elegans 1.13 H.influenzae-P.horikoshii 3.78 D.rerio-A.thaliana 1.10 H.influenzae-S.solfataricus 3.78 D.rerio-O.sativa 1.10 H.influenzae-M.thermo 3.78 D.rerio-S.cerevisiae 0.99 P.aeruginosa-A.tumefaciens 2.36 D.rerio-T.brucei 1.10 P.aeruginosa-M.tuberculosis 2.75 D.rerio-P.falciparum 1.10 P.aeruginosa-B.subtilis 2.75 D.rerio-E.coli 2.60 P.aeruginosa-T.maritima 3.64 D.rerio-M.thermo 3.97 P.aeruginosa-P.horikoshii 3.78 D.melanogaster-C.elegans 0.97 P.aeruginosa-S.solfataricus 3.78 D.melanogaster-A.thaliana 1.67 P.aeruginosa-M.thermo 3.78 D.melanogaster-O.sativa 1.67 A.tumefaciens-M.tuberculosis 2.75 D.melanogaster-S.cerevisiae 1.39 A.tumefaciens-B.subtilis 2.75 D.melanogaster-T.brucei 1.72 A.tumefaciens-T.maritima 3.64 D.melanogaster-P.falciparum 1.72 A.tumefaciens-P.horikoshii 3.78 D.melanogaster-E.coli 2.60 A.tumefaciens-S.solfataricus 3.78 D.melanogaster-M.thermo 3.97 A.tumefaciens-M.thermo 3.78 C.elegans-A.thaliana 1.53 M.tuberculosis-B.subtilis 2.75 C.elegans-O.sativa 1.67 M.tuberculosis-T.maritima 3.64 C.elegans-S.cerevisiae 1.54 M.tuberculosis-P.horikoshii 3.78 C.elegans-T.brucei 1.10 M.tuberculosis-S.solfataricus 3.78 C.elegans-P.falciparum 1.72 M.tuberculosis-M.thermo 3.78 C.elegans-E.coli 2.60 B.subtilis-T.maritima 3.64 C.elegans-M.thermo 3.97 B.subtilis-P.horikoshii 3.78 A.thaliana-O.sativa 1.17 B.subtilis-S.solfataricus 3.78 A.thaliana-S.cerevisiae 1.67 B.subtilis-M.thermo 3.78 A.thaliana-T.brucei 1.72 T.maritima-P.horikoshii 3.78 A.thaliana-P.falciparum 1.72 T.maritima-S.solfataricus 3.78 A.thaliana-E.coli 2.60 T.maritima-M.thermo 3.78 A.thaliana-M.thermo 3.97 P.horikoshii-S.solfataricus 3.46 O.sativa-S.cerevisiae 1.39 P.horikoshii-M.thermo 3.35 O.sativa-T.brucei 1.01 S.solfataricus-M.thermo 3.46

124 Appendix 2.2. Fitting the three models of sequence divergence to 30 EC data sets.

125

126 Appendix 2.3. Time of divergence necessary for orthologous enzymes to reach 5 % sequence identity; determined from the Gamma-corrected model of evolution. Enzymatic Function Divergence Time (Byr) 1.1.1.205 3.48x103 1.1.1.37 23.40 1.8.1.4 35.26 1.8.1.9 1.35x103 2.1.1.45 1.98x103 2.3.1.9 154.65 2.3.3.1 19.90 2.4.2.14 1.31x103 2.4.2.9 169.10 2.6.1.16 799.86 2.7.2.3 100.57 2.7.6.1 63.60 2.7.7.23 401.52 2.7.7.9 71.50 3.1.3.25 260.16 3.6.1.1 1.66x103 4.1.2.13 124.81 4.2.1.11 470.22 4.2.1.2 28.03 5.1.3.2 212.04 5.3.1.1 155.10 5.3.1.6 607.12 5.4.2.1 1.50x103 5.4.2.2 17.50 5.4.2.8 504.60 6.2.1.1 7.35x105 6.2.1.5 425.92 6.3.2.6 316.22 6.3.4.4 88.50 6.3.5.2 1.05x104

127 Appendix 2.4. Hubble-like derivation: Rate of change of the evolutionary distance with respect to the divergence time for 30 enzymatic functions.

128

129 !" Appendix 2.5. Comparison of the non-linear fit and linear fit of as a function of evolutionary !" distance with the Akaike Information Criterion.

E.C. Number Better Model Non-linear AIC Weight Linear AIC Weight Likelihood Ratio 6.2.1.1 Non-linear 1.00 0.00 230900.49 1.8.1.4 Non-linear 1.00 0.00 86365.36 5.3.1.6 Non-linear 1.00 0.00 78635.35 5.4.2.8 Non-linear 1.00 0.00 6846.22 4.1.2.13 Non-linear 1.00 0.00 2967.50 6.3.2.6 Non-linear 1.00 0.00 2953.66 5.4.2.2 Non-linear 1.00 0.00 756.58 1.8.1.9 Non-linear 1.00 0.00 368.00 2.4.2.9 Non-linear 0.99 0.01 158.24 5.1.3.2 Non-linear 0.99 0.01 85.51 5.3.1.1 Non-linear 0.98 0.02 61.42 2.6.1.16 Non-linear 0.98 0.02 40.84 6.3.4.4 Non-linear 0.97 0.03 36.74 4.2.1.11 Non-linear 0.97 0.03 35.19 2.7.6.1 Non-linear 0.97 0.03 34.88 2.3.1.9 Non-linear 0.97 0.03 29.38 2.4.2.14 Non-linear 0.96 0.04 21.46 4.2.1.2 Non-linear 0.95 0.05 19.82 6.2.1.5 Non-linear 0.95 0.05 19.45 2.7.7.9 Non-linear 0.95 0.05 19.01 5.4.2.1 Non-linear 0.95 0.05 18.68 1.1.1.37 Non-linear 0.93 0.07 13.69 2.7.7.23 Non-linear 0.93 0.07 13.43 2.3.3.1 Non-linear 0.91 0.09 10.52 1.1.1.205 Non-linear 0.91 0.09 9.77 3.6.1.1 Non-linear 0.90 0.10 9.31 2.7.2.3 Non-linear 0.89 0.11 8.48 3.1.3.25 Non-linear 0.89 0.11 8.32 2.1.1.45 Non-linear 0.89 0.11 8.31 6.3.5.2 Non-linear 0.85 0.15 5.87

130 Appendix 2.6. Two-dimensional projections of the sequence identity vs. RMSD density plot for EC4 Pairs.

131 Appendix 3.1. Probability of correct function at different sequence identity levels given active site region conservation. Global Act. Site Act. Site Act. Site Act. Site Act. Site Act. Site Act. Site Act. Site Act. Site Act. Site (%) 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-10 % 0-10 0.000 0.015 0.012 0.006 0.003 0.016 0.021 0.021 0.026 0.103 10-20 0.000 0.377 0.032 0.021 0.013 0.045 0.061 0.063 0.078 0.214 20-30 0.005 0.090 0.085 0.069 0.058 0.123 0.163 0.174 0.212 0.395 30-40 0.027 0.198 0.206 0.207 0.226 0.293 0.369 0.400 0.462 0.609 40-50 0.129 0.383 0.421 0.480 0.579 0.550 0.638 0.679 0.733 0.788 50-60 0.443 0.609 0.670 0.765 0.866 0.784 0.842 0.870 0.898 0.899 60-70 0.957 0.957 0.957 0.957 0.968 0.957 0.957 0.957 0.966 0.957 70-80 0.990 0.990 0.990 0.990 0.993 0.990 0.990 0.990 0.990 0.990 80-90 0.990 0.998 0.998 0.998 0.999 0.998 0.998 0.998 0.998 0.998 90-100 0.994 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999

132 Appendix 3.2. Optimized set of parameters obtained from training GLOBUS (no structural- component) on the E. coli metabolic network iAF1260 (red), 3D-GLOBUS on the E. coli metabolic network iAF1260 (dark blue), and 3D-GLOBUS on the S. cerevisiae metabolic network iLL672.

2.5 Globus 2.0 3D-Globus(Eco) 3D-Globus(Sce) 1.5

1.0

0.5

0.0 Seq/AS Phylo Clustering Co-exp Orthology EC co-occ Out-of-net -0.5 Gene Context -1.0

-1.5

133 Appendix 4.1. Residues that form polar contacts with the ligand.

Gene Residues forming polar contact with ligand Rv3255c Pro-100, His-131, Lys-132, Ala-276, Asn-280 Rv2332 Lys-169, Glu-240, Asp-264, Asn-410, Asn-454, Asn-455 Rv3811 His-228, Thr-369, Asp-370 Rv1592c Asp-116, Ser-205 Rv0751c Asn-11, Met-12, Asp-31, Leu-65, Pro-66, Thr-93, Lys-243 Rv3406 Thr-124, His-251, Arg-262, R-266 Rv0891 Glu-100, Ser-101, Arg-172

134 Appendix 4.2. Catalogue of target genes in M. tuberculosis and homologous drug targets with candidate drugs.

Top DrugBank Candidate Inhibitors (plus % identity to Essentiality Gene Function Pathway homologue references) human protein support

Fatty-acid-CoA ligase Fatty acid 183 Rv1185c ACSL4_HUMAN Icosapent (inducer ) 27.0 156 (E.C: 6.2.1.3) metabolism Rosiglitazone184 Sugar ABC transporter Sugar Rv1235 MALE_ECOLI 9.0 156, 185, 186 (E.C: na) transport Sugar ABC transporter Sugar Glyburide187 172, 156, 185, Rv1238 ABCA1_HUMAN 25.9 (E.C: na) transport Probucol188 186 Glutamate racemase Glutamate Pyrazolopyrimidinediones178,189 156, 172, 190, Rv1338 MURI_LISMO n/a (E.C: 5.1.1.3) metabolism 1H-benzimidazole-2-sulfonic acid179 191 8-Amino-7-oxo-8-phosphonononaoate180 8-amino-7-oxononanoate Biotin 156, 172, 190, Rv1569 BIOF_ECOLI 4-carboxybutyl(1-amino-1- 31.4 (EC: 2.3.1.47) metabolism 191 carboxyethyl)phosphonate180 Biotin synthase Biotin 5’-deoxyadenosine192 Rv1589 BIOB_ECOLI n/a 156, 172, 193 (E.C: 2.8.1.6) metabolism S-adenosyl-L-homocysteine192 Fatty acid Rv1661 Q9ZGI2_STRVZ 23.3 152, 172, 194 (E.C: 2.3.1.41) metabolism Polyketide synthase Fatty acid Rv1663 Q9ZGI2_STRVZ 2-Phenylamino-4-Methyl-5-Acetyl Thiazole195 26.7 152, 172 (E.C: 2.3.1.41) metabolism Polyketide synthase Fatty acid 2-Phenylamino-4-Methyl-5-Acetyl Thiazole195 Rv1664 Q9ZGI2_STRVZ 29.8 (E.C: 2.3.1.41) metabolism Polyketide synthase Fatty acid Rv2048c Q9ZGI2_STRVZ 2-Phenylamino-4-Methyl-5-Acetyl Thiazole195 15.4 156, 172 (E.C: 2.3.1.41) metabolism Glycine, Amino- , Rv2211c GCST_THEMA 30.0 152, 156, 172 (E.C: 2.1.2.10) threonine metabolism Protoporphyrinogen 1-(benzothiazol-5-yl)-1H-1,2,4-triazol-5-ones Rv2677c oxidase PPOX_BACSU 196,197 (herbicides) 29.6* 172, 199 metabolism (E.C: 1.3.3.4) Diphenyl-ether198 (anti-cancer) Uroporphyrinogen Porphyrin PI-16200 (porphodimethene compound, anti- Rv2678c decarboxylase DCUP_HUMAN 35.5* 172 metabolism cancer) (E.C: 4.1.1.37) Polyketide synthase Fatty acid Rv2946c Q9ZGI2_STRVZ 2-Phenylamino-4-Methyl-5-Acetyl Thiazole195 28.8 (E.C: 2.3.1.41) metabolism Membrane Rv2957 SPSA_BACSU 22.9 (E.C: 2.4.1.-) metabolism Glycosyltransferase Membrane Rv2958c UD3A1_HUMAN 25.3 201 (E.C: 2.4.1.-) metabolism Rhamnosyltransferase Membrane Rv2962c UD3A1_HUMAN 24.7 201 (E.C: 2.4.1.-) metabolism Mannose 1-phosphate Sugar Rv3264c RMLA_SALTY 32.3* 152, 172 metabolism (E.C: 2.7.7.13) 3-hydroxybutyrate Fatty acid (L)-3-hydroxybutyrate181 Rv3502c dehydrogenase FABG_ECOLI 15.4 156, 172 metabolism (E.C: 1.1.1.30) 2-amino-7,7-dimethyl-4-oxo-3,4,7,8- 7,8-dihydroxymethylpterin tetrahydro-pteridine-6-carboxylic acid (2-(2- Folate Rv3606c pyrophosphokinase HPPK_ECOLI [5-(6-amino-purin-9-yl)-3,4-dihydroxy- n/a 202 metabolism (E.C: 2.7.6.3) tetrahydro-furan-2-ylmethanesulfonyl]- ethylcarbamoyl)-ethyl)-amide182 Shaded rows are described in detail in the text and figure 4.4. *Closest human homolog encodes the same enzymatic activity

135