Universal Protein Families and the Functional Content of the Last Universal Common Ancestor
Total Page:16
File Type:pdf, Size:1020Kb
J Mol Evol (1999) 49:413–423 © Springer-Verlag New York Inc. 1999 Universal Protein Families and the Functional Content of the Last Universal Common Ancestor Nikos Kyrpides,1,2 Ross Overbeek,2 Christos Ouzounis3 1 Department of Microbiology, University of Illinois at Urbana–Champaign, 407 South Goodwin Avenue, Urbana IL 61801, USA 2 Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Case Avenue, Argonne IL 60439, USA 3 Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK Abstract. The phylogenetic distribution of Metha- ties of a complete set of archaeal proteins with the other nococcus jannaschii proteins can provide, for the first two domains, Bacteria and Eukarya. This universal set of time, an estimate of the genome content of the last com- protein families present in all domains can then be used mon ancestor of the three domains of life. Relying on as an estimate for the genome content of the Last Uni- annotation and comparison with reference to the species versal Common Ancestor (Woese 1982; Woese and Fox distribution of sequence similarities results in 324 pro- 1977). Evidently, factors such as gene loss, horizontal teins forming the universal family set. This set is very transfer across domains and species peculiarities make well characterized and relatively small and nonredun- this task difficult, especially when only interspecies com- dant, containing 301 biochemical functions, of which parisons are used (Becerra et al. 1997). Previous such 246 are unique. This universal function set contains approaches have ignored the above problems (Mush- mostly genes coding for energy metabolism or informa- egian and Koonin 1996), resulting in descriptions that tion processing. It appears that the Last Universal Com- may be relevant only from a functional but not an evo- mon Ancestor was an organism with metabolic networks lutionary viewpoint, despite such claims (Koonin and and genetic machinery similar to those of extant unicel- Mushegian 1996). lular organisms. Until recently, there has been no description and only a single prediction for the set of universal protein fami- Key words: Last Universal Common Ancestor — lies, based on the presence or absence of sequence pat- Cellular evolution — Metabolic reconstruction — Bio- tern sets in the three domains of life (Ouzounis and Kyr- chemical pathways — Genomics pides 1996b). However, this approach was limited primarily by two factors: first, sequence patterns cannot adequately describe an exact molecular function, but only protein families; and second, the archaeal families Introduction were clearly underrepresented (Ouzounis et al. 1995a), before the availability of the M. jannaschii genome. Al- With the completion of the sequencing of the first ar- though only 77 protein families were found to be uni- chaeal genome, that of Methanococcus jannaschii (Bult versal, a much larger number of protein families was et al. 1996), it has been possible to describe the similari- predicted to be present in Archaea, based on functional relationships such as metabolic pathways (Ouzounis and Kyrpides 1996b). Herein, we describe for the first time a list of univer- Correspondence to: C. Ouzounis; e-mail: [email protected] sal functions based on sequence comparison and detailed 414 functional annotation. The current approach takes into account, but is not restricted to, complete genomes, thus providing a basis for the identification of the broadest possible number of functions present in representative species from all three domains of life. Methods All function assignments were derived after detailed family analysis of every single M. jannaschii ORF with continuing updates (Andrade et al. 1997; Kyrpides et al. 1996a). In addition, intradomain similarities were considered through the complete collection of species in public databases, thus eliminating some of the problems with pairwise inter- Fig. 1. Evolution of functional annotations for the M. jannaschii species comparisons encountered in similar studies (Tatusov et al. genome. “Function” signifies any functional assignment with a varying 1997). It must be emphasized that if a protein is present in any species degree of accuracy, “hypothetical” denotes similarity to uncharacter- for a given domain, it is irrelevant whether this protein may be absent ized sequences, and “unique” represents unique sequences without any from complete genomes, as is sometimes thought (Mushegian and Koo- functional annotation. Within a year, the level of function assignment nin, 1996). For this particular problem, what is needed is abundant increased to 52%, while homologies to other proteins have covered sequence information, and not complete genome sequences (Ouzounis another quarter of the total genome. Continuing annotation is available and Kyrpides 1996b). Manually derived results were compared with an at <http://geta.life.uiuc.edu/∼nikos/MJannotations.html>. automatic analysis obtained by the WIT system (<http:// wit.mcs.anl.gov/WIT/>) (Overbeek et al. 1997). Metabolic information was obtained through the EMP database (Selkov et al. 1997a). have concluded that the level of functional assignment Each M. jannaschii ORF was used as a query in Blast2 similarity for M. jannaschii is much lower. Approximately another searches against the nonredundant protein sequence database at the National Center of Biotechnology Information (NCBI). The complete quarter of the genome contains proteins with homo- genome sequence of Methanobacterium thermoautotrophicum (Smith logues of unknown function in the database and the re- et al. 1997) was obtained from the Genome Therapeutics Corporation maining quarter of the genome contains unique se- (<http://www.cric.com/>). The complete genome sequence of Ar- quences, with no homologues (even after the completion chaeoglobus fulgidus (Klenk et al. 1997) was retrieved from TIGR of two additional archaeal genomes). This is an impor- (<http://www.tigr.org/>). These genome sequences were only used for comparisons within the archaeal domain. Database and bibliographic tant observation, since these levels of function and simi- searches and function assignments were performed as described pre- larity relationships are now comparable to those in other viously (Andrade et al. 1997; Kyrpides et al. 1996a). All functional model species (Ouzounis et al. 1996). annotations are available at <http://geta.life.uiuc.edu/∼nikos/ MJannotations.html> and mirrored at <http://www.ebi.ac.uk/research/ cgg/annotation/MJannotations.html>. Updates of the phylogenetic dis- Phylogenetic Distribution tribution for all M. jannaschii genes are available at <http:// geta.life.uiuc.edu/∼nikos/Domain.Comparisons.html>. There are 324 proteins in M. jannaschii, with at least one homologue present in some species from both the other two domains, Bacteria and Eukarya, forming the univer- Results sal protein family set (Table 1, Fig. 2). Of those, only 23 are hypothetical (families without functional annotation). The universal set of proteins contains metabolic en- Continued Annotation zymes, transporters, various ATP/GTP-binding proteins, protein tyrosine phosphatases (Stravopodis and Kyrpides At the time of the original publication, the M. jannaschii 1999), ribosomal proteins, aminoacyl-tRNA synthetases, genome was thought to be unique in that it contained translation initiation factors (Kyrpides and Woese 1998), only a few functionally characterized homologues, 38% helicases, and RNA polymerase subunits. Most of these of the total genome (Bult et al. 1996). One explanation proteins were previously predicted using family patterns was that this was the first completely sequenced archaeal (Ouzounis and Kyrpides 1996b). About a quarter of these genome and its uniqueness represented the peculiarity of proteins are paralogues within M. jannaschii, therefore the yet unexplored archaeal domain. However, with con- limiting the number of predicted universal functions to tinued updates, the level of functional annotation has 246 (Table 2). Structural RNA (rRNA and tRNA) genes now surpassed 50% (Fig. 1). Previous claims that have can also be considered to belong to this universal func- raised this number to as high as 70% (Koonin et al. 1997) tion set but are not further discussed in this context. should be discarded due to a large number of false- Another 522 proteins have at least one homologue in positive identifications and overpredictions (Kyrpides the bacterial domain only, of which 132 are hypothetical and Ouzounis 1999a). Three independent analyses (An- (Table 1, Fig. 2). This “uneukaryotic” set (Ouzounis and drade et al. 1997; Bult et al. 1996; Kyrpides et al. 1996a) Kyrpides 1996b) of proteins contains electron-transport 415 Table 1. Domain and class distribution of the M. jannaschii proteins Domains/classes Energy Information Communication Hypothetical Total (domain) Universal (ABE) 178 103 20 23 324 Uneukaryotic (AB) 264 94 32 132 522 Unbacterial (AE) 10 79 7 27 123 Archaeal (A) 80 34 11 662 787 Total (class) 532 310 70 844 1756 There are two unexpected observations from this analysis: first, the universal set of proteins is relatively small, but very well characterized (Tables 1 and 2); sec- ond, Archaea, as represented by M. jannaschii, seem to contain four times more bacterial-type than eukaryotic- type proteins. This final point comes in sharp contrast to previous beliefs (Keeling et al. 1994). It remains to