Q-D/ 1995 Oxford University Press Nucleic Acids Research, 1995, Vol. 23, No. 4 565-570 Novel protein families in archaean genomes Christos Ouzounis*, ' and

European Molecular Biology Laboratory, Meyerhofstrasse 1, Postfach 10.2209, D-69012 Heidelberg, and 1 Institute for Molecular Biology and Biotechnology, Heraklion,

Received December 7, 1994; Revised and Accepted January 9, 1995

ABSTRACT there are direct implications not only for the origins oflife and the definition ofthe common ancestor, but also for the understanding In a quest for novel functions in archaea, all archaean of the origins and distribution of contemporary molecular hypothetical open reading frames (ORFs), as annotated systems, such as replication, transcription and translation. in the Swiss-Prot protein sequence database, were Some ofthe recent successes in this field were the identification used to search the latest databases forthe identification of general transcription factors in archaea (2), for example TFIIB of characterized homologues. Of the 95 hypothetical (3), TFIID (4,5) and TFHS (6,7), RNA polymerases (8) and archaean ORFs, 25 were found to be homologous to , such as members of the (9) and methionine another hypothetical archaean ORF, while 36 were aminopeptidase (10) families. homologous to non-archaean proteins, of which as As a test case for our analysis tools, enriched by new methods many as 30 were homologous to a characterized protein and larger and better sequence databases, we have analysed a set family. Thus the level of sequence similarity in this set of sequences, those of hypothetical ORFs of archaea. The new reaches 64%, while the level of function assignment is functional assignments provide new insights into universal only 32%. Of the ORFs with predicted functions, 12 protein families present before the divergence of the prime homologies are reported here for the first time and domains into the contemporary ones. represent nine new functions and one gene duplication at an acetyl-coA synthetase locus. The novel functions include components of the transcriptional and transla- MATERIALS AND METHODS tional apparatus, such as ribosomal proteins, modifica- As a prime dataset, 95 archaean ORFs annotated as hypothetical tion enzymes and a translation initiation factor. In ORFs from Swiss-Prot (11) were used as query sequences to addition, new enzymes are identified in archaea, such search databases for the identification of homologues. as cobyric acid synthase, dCTP deaminase and the first The sequence databases that were searched were Swiss-Prot, archaean homologues of a new subclass of ATP GenPept (translated GenBank, courtesy ofMark Gunnell), PirOnly binding proteins found in fungi. Finally, it is shown that (a subset of PIR, non-overlapping with Swiss-Prot, courtesy of the putative laminin receptor family of eukaryotes and Peter Rice) and TrEBML (translated EMBL, courtesy of Thure an archaean homologue belong to the previously Etzold). The versions used were as of October 10, 1994. Database characterized ribosomal S2 from eubac- searches were exclusively performed using BLAST (12), using the teria. From the present and previous work, the major BLOSUM62 substitution matrix and default parameters. implication is that archaea seem to have a mode of The database searches were controlled, parsed and visualized expression of genetic information rather similar to using GeneQuiz (13), a system for large scale sequence analysis eukaryotes, while eubacteria may have proceeded into that allows semi-automatic interactive evaluation. Multiple align- unique ways of transcription and translation. In addi- ments were performed using MaxHom (14) and ClustalW (15). tion, with the detection of proteins in various metabolic Access to bibliographic databases was greatly facilitated using and genetic processes in archaea, we can further Nentrez (NCBI, NLM) and SRS (16). A side goal of the analysis predict the presence of additional proteins involved in was to perform as many actions of the process without recourse these processes. to paper, using only a workstation and the public databases available, something impossible only a few years ago (17). INTRODUCTION RESULTS The definition of three major domains in the tree of life and the emergence of archaea as a separate group (1), distinct to and Of the 95 hypothetical ORFs, as many as 61 had a homologue in separate from eubacteria and eukaryotes has stirred much interest the databases: 25 of those were similar to another hypothetical in the past years. Studies of deep phylogenies and the definition ORF in archaea and remain unassigned, while the other 36 were of a hypothetical universal common ancestor from which all homologous to proteins from eubacteria and eukaryotes (Fig. 1). branches of life may have emerged independently are based on Of these, 30 proteins were homologous to 22 characterized the analysis of molecular families of extant species. In addition, families and thus their functions can be predicted. Twelve ofthese

* To whom correspondence should be addressed 566 Nucleic Acids Research, 1995, Vol. 23, No. 4

a proteins represent the discovery of 10 functions, nine ofwhich are novel for the archaean domain (Table 1). The remaining 18 Class Number Percent represent 12 functions and have recently been reported, but not adequately annotated (Table 2). No homology 34 36% An interesting pattern emerging from the present analysis is the Homology 61 64% Archaea 25 26% rather equal representation ofrelatives of archaean proteins in the Eubactena only 14 15% two other major domains, eubacteria and eukaryotes. of the 61 Eukaryotes only 10 10.5% sequences with homology to any protein, 25 were confined to the Universal 12 12.5% archaean domain only (all ofunknown function) and the other 36 were distributed as follows: 14 were homologous to eubacteria Total 95 100% only (of which the functions of four remain unknown), 10 were homologous to eukaryotes only (of which two are of unknown b function) and finally 12 were ubiquitous (and all assigned a function) (Fig. 1). The 12 novel function assignments belonging to 10 families are presented below in some detail.

Archaea and eubacteria only function _ homogy Cobyric acid synthase. The partial 0RF230 downstream of the Elunknown \/!0000-000000 32% glutamine synthetase gene of Methanococcus voltae (18) is 64% highly similar (45% sequence identity) to CobQ from Pseudomo- nas denitrifitans (19) and homologous to dethiobiotin synthases (data not sh6wn). Cobalamin synthesis is known to occur in various eubacterial species. The pathway is not fully understood, Figure 1. Summary of results and patterns in the analysis of the current set. (a) but cloning of the locus revealed the presence of five different Numbers and percentages ofproteins without homology and with homology to genes, named cobN-cobQ and cobW (19). CobQ was identified the three domains. Note that as many as one third of the ORFs with homology as sequence to non-archaean genomes are universal. (b) A pie diagram showing the levels cobyric acid synthase (19). Therefore, the detected of function prediction by homology (32%) and the detection of homology similarity establishes the presence of an which may have without functional assignment (another 32%). While the latter number is a similar function in a methanogenic archaeon and enables us to comparable to values obtained from the analysis of other model genomes, the predict the presence of additional genes in archaea for cobalamin function prediction is significantly lower. biosynthesis.

Table 1. The 12 ORFs in archaea for which novel functions are predicted on the basis of sequence similarity

Querya Closest homologue P(N) Predicted function Archaea and eubacteria only ygln_metvo swisslP29932lcobq_psede 1. le-64 Cobyric acid synthase ylg5_desam swisslP28248ldcd_ecoli 1.3e-06 Deoxycytidine triphosphate deaminase ydo3_sulso trembllZ21677ltmribprsa 0.0014 50S Ribosomal protein L24 ynil_mettl swisslP297491mtt8 theth 0.0047 Modification methylase Archaea and eukaryotes only yrp2_theac gpIL189601humeif4c_1 3.8e-15 Protein synthesis factor eIF-4C yory_pyrwo swissIP05744Irl37-yeast 1.6e-13 60S Ribosomal protein L37 yorz_pyrwo swisslPI55651trml_yeast 2.1e-07 N2,N2-dimethylguanosine tRNA methyltransferase Ubiquitous families yac2_metso swissIP27095Iacua_metso 1.3e-51 Acetyl-coenzyme A synthetase yac lmetso swissIP27095Iacua_metso 4.2e-23 Acetyl-coenzyme A synthetase yeno_halma swisslP32905lnabl_yeast 2.5e-31 NAB 1, putative laminin receptor, RS2 ypvl_mettf trembllL16793Ispcdcl8xg 8.5e-07 cdc 18, ATP binding protein ypzl_mettf trembllL16793Ispcdcl8xg 8.7e-07 cdc 18, ATP binding protein

Query: Swiss-Prot name; closest homologue: unique identifier of closest homologue in databases (format: dbaselaccession numberlidentifier); P(N): probability value from BLAST; predicted function: function of closest homologue, tentative assignment. aThe 12 ORFs fall into 10 functional classes, one of which has been previously found in archaea (acetyl-CoA synthetase has been identified before; upstream of the two ORFs, the closest homologue is also an archaean protein). Nucleic Acids Research, 1995, Vol. 23, No. 4 567

Table 2. The 18 ORFs in archaea for which their functions have been predicted and published recently

Query Closest homologue P(N) Predicted function ys91_halma swissIP221391rpbx.yeast l.le-18 DNA-directed RNA polymerases RPB10b ys92_halma pironlyIS386271S38627 3.7e-06 DNA-directed RNA polymerases RPB6b y9k_halha pironly1S237951S23795 3.le-12 DNA-directed RNA polymerase Hc yrp I_sulac swissIP31815Irpoh_thece 1.le-21 DNA-directed RNA polymerase Hc

yl4k_halha swissIP15738Iyl4k_halmo 7.6e-50 NusA transcription regulatord yl4k_halmo swissIP15739Iyl4k_halha 3.4e-50 NusA transcription regulatord yrpl_theac swisslPl4026lyrp7Lmetva 7.4e-16 NusA transcription regulatord yrp3_sulac swisslP29157lyrpl_thece 4.2e-22 NusA transcription regulatord yrp7_metva swissiPI 15231yrp3.sulac 1.7e-17 NusA transcription regulatord yrpl_thece swissIPi 15231yrp3_sulac 2.0e-22 NusA transcription regulatord

yrpl_metva swisslP08245lycih_ecoli 2.9e-09 Protein translation factor Suile yrp2_sulac swissiPi 15221rl3e_sulaca 1.6e-49 Ribosomal protein L30f

y36k_metsm trembllX03250lmtcomppu 1.9e-64 AIR carboxylasesg yhmf_metfe swissnewlP37819lspeb strcl 2.5e-14 /arginase familyh yhsh_halma swissIP36132lykl8_yeast 2.5e-27 Glycoproteinaseij yrbl_halcu swisslP32235lgtplIschpo 4.3e-40 GTP binding protein Ik yrp8_metva swissnewlP37742lrfk7 ecoli 0.0072 Phosphomannomutase/phosphoglucomutase family' ysyn_metfe swissnewIP38062ii2a6.rat 1.5e-1 1 IF2-associated Met aminopeptidasem

Annotation as in Table 1. References are given to the publications where the prediction has been made. The 18 ORFs fall into 12 functional classes. aRenamed in the latest Swiss-Prot release. The table is split in four sections (from top to bottom) according to function: transcription, regulation, translation, metab- olism. Phylogenetic distribution data (as in Table 1) is available from the authors on request. bK. McKune, N. A. Woychik (1994) J. Bacteriol. 176, 4754-4756. CH. P. Klenk, P. Palm, F. Lottspeich, W. Zillig (1992) Proc. Natl. Acad. Sci. USA 89, 407-410. dT. J. Gibson, J. D. Thompson, J. Heringa (1993) FEBS Lett. 324, 361-366. eC. Fields, M. D. Adams (1994) Biochem. Biophys. Res. Commun. 198, 288-291. fA. Bairoch, B. Boeckmann (1991) Nucleic Acids Res. 19, 2247-2249. gD. J. Ebbole, H. Zalkin (1987) J. Bio. Chem. 262, 8274-8287. hC. A. Ouzounis, N. C. Kyrpides (1994) J. Mol. Evol. 39, 101-104. iK. M. Abdullah, R. Y Lo, A. Mellors (1991) J. Bacteriol. 173, 5597-5603. jE. Arndt, C. Steffens (1992) FEBS Lett. 314, 211-214. kj. D. Hudson, P. G. Young (1993) Gene 125, 191-193. 'K. Robison, W. Gilbert, G. M. Church (1994) Nature Genet. 7, 205-214. mJ. F. Bazan, L. H. Weaver, S. L. Roderick, R. Huber, B. W. Matthews (1994) Proc. Natl. Acad. Sci. USA 91, 2473-2477. dCTP deaminase. The ORF present in the opposite strand of a logue). Ribosomal proteins known to be common only to archaea locus coding for an ATP-dependent eukaryotic-like DNA and eubacteria are very few (25) and it can be predicted that L24P (ORF3) in Desulfurolobus ambivalens (20) is 33% identical to the may also be found in eukaryotes. dCTP deaminase from Escherichia coli (21). This enzyme converts dCTP to dUTP, which is then converted to dUMP and Modification methylase. A partial ORF (C-terminal part only) at used as for thymidylate synthase (21). This reaction was the nijfH locus of Methanococcus thenrolithotrophicus (26) is only known to be present in E.coli, while in eukaryotes dUMP is similar to a modification methylase from Thernnus aquaticus (27) produced by dCMP deaminase (22). It is therefore unlikely that and to the methyltransferase coded by the bcgIA gene from dCTP deaminase will ever be found in other eukaryotes. Bacillus coagulans (28). The sequence identity between the two enzymes is 26% and, in addition, the PROSITE pattern for N-6 SOS ribosomal protein L24 (L24Pfamily). An unidentified ORF adenine-specific DNA methylases (29) is present in this particu- (ORF3) between the DPa gene and ribosomal proteins LIi, LI, lar ORF (positions 91-97). The presence of modification L10 and L12 (in that order) of Sulfolobus solfataricus (23) is methylases and related restriction systems has been previously similar to the 50S ribosomal protein L24 (L24P family) of confirmed in archaea (30), yet this case represents the detection eubacteria (24) (38% sequence identity with the closest homo- of an archaean methylase of a distinct type. 568 Nucleic Acids Research, 1995, Vol. 23, No. 4

Archaea and eukaryotes only yeast Schizosaccharomyces pombe, named CDC18+ (44). This protein has been identified as a control factor that allows cells to Translation initiationfactor eIF-4C. An ORF at the locus coding enter the S phase of cell division, preventing mitosis until the S for DNA-dependent RNA polymerase subunits H, B, Al and A2 phase is completed (44). Another protein from the yeast in Thennoplasma acidophilum (ORF125) (31) has 31% sequence Saccharomyces cerevisiae, CDC6, which controls replication and identity with eukaryotic translation initiation factor eIF-4C (32). nuclear division (45), also belongs to this family (Fig. 3). It is not Translation factor eIF-4C is involved in the initiation pathway, clear what the exact function of the archaean proteins is for the enhancing ribosome dissociation and stabilizing the interaction of replication of the plasmids. This protein family contains the initiator tRNA with the 40S ribosomal subunit (32). typical motifs A and B present in purine NTP binding protein families (46) (Fig. 3) and it seems to be distantly related to a large 60S ribosomal protein L37 (L3SAEfamily). The first upstream family of ATP binding proteins, which includes FtsH, PAS 1 and ORF (ORFy) in the antisense strand of the locus coding for CDC48 (data not shown) (47). glyceraldehyde-3-phosphate dehydrogenase from Pyrococcus woesei (33) is 34% identical to tRNA binding 60S ribosomal protein L37 from yeast (L35AE family) (34). It is unclear how Archaea only many ribosomal proteins are common only between archaea and There still remain in the dataset 25 families of ORFs ofunknown eukaryotes (35). function, unique to archaea. Their functions cannot be currently predicted, but the presence of homologues in various related N2,N2-dimethylguanosine tRNA methyltransferase. The second species implies that these ORFs represent genuine proteins. partial ORF (ORFz) at the same locus as the previous case (33) Sensitive methods such as pattern and profile searches (43,48) is 50% identical to the protein coded by the TRMI gene of failed to detect homology to proteins of known function. Saccharomyces cerevisiae (36). This enzyme is necessary for the N2,N2-dimethylguanosine modification of tRNA (36). Previous DISCUSSION observations have already established the presence of modifica- tion mechanisms for rRNA and tRNA in archaea (37). Some general patterns in sequence analysis emerging from genome projects (17,49) were reproduced in this effort. In Archaea, eubacteria and eukaryotes particular, the level of detection for sequence similarity was 64% (61 of 95 sequences have a homologue in the databases, 34 A sequential duplication of acetyl-CoA synthase? The acetyl- sequences do not), while the function assignment was rather low, CoA synthase gene (acs) has been cloned from Methanothrix at 32% (30 sequences have a homologue ofknown function). Yet soehngenii (38) and was reported to be a single copy gene in this 30 interesting cases were found where function can be assigned methanogenic archaeon. The two ORFs downstream of acs to the corresponding ORFs, of which 12 represent 10 new (ORFl, ORF2-partial) were reported not to show any significant functions in archaea. Of these novel protein families discovered, similarity to proteins in the databases (38). However, the two some are not surprising (e.g. ribosomal protein L37), while some ORFs have high sequence similarity to the upstream acs gene are most remarkable (e.g. CDC18 family). (ORF1 55% identity, ORF2 73% identity) and in consecutive Another implication ofthis particular study is the realization of order, a fact that may be interpreted as a sequencing error the value of updating datasets of interest regularly. With better resulting in a frameshift. Thus, interestingly, the two ORFs tools and larger and richer databases, the task of homology downstream ofacs in M.soehngenii represent a gene duplication. identification has become easier and the probability of detecting The functional significance of this tandem arrangement is not homologies to proteins ofknown function has reached an average clear. level of-50%. The challenge ofcollecting all hypothetical ORFs from E.coli, yeast, C.elegans or any other model organism and A 'laminin receptor'-like protein is the archaean RS2. An ORF submitting them to regular database searches to update our downstream of the enolase gene (MSG) at the S9 locus of the knowledge about their possible function by homology is im- archaeon Haloarcula marismortui (39) is 40% identical to the mense. putative laminin receptor of eukaryotes. This was noted in the The biological implications of the present and other recent original publication, but the observation was offered without findings reinforce the idea that archaea are indeed the domain further explanations (40). In two recent reports, ribosomal which holds the secrets about the origin ofthe cell in general and proteins S2 have been found to share sequence similarities with the eukaryotic domain in particular. The sharing of components the putative laminin receptors (41,42). Now this observation is from both eubacteria and eukaryotes is most puzzling. confimned by the discovery of a 'missing link': as shown here The most convincing demonstration of this duality emerges (Fig. 2), this family includes the archaean and eukaryotic from the study of genomic organization in relationship to gene homologues of the ribosomal protein family S2 from eubacteria expression. Curiously enough, archaea display the typical (40). Profile searches (43) with the putative laminin receptors eubacterial mode of gene organization in operons, while their identify the eubacterial ribosomal proteins S2 unambiguously. various levels ofgene expression seem to be related to eukaryotes The sequences are of comparable length, but seem to evolve (e.g. transcription and translation factors). Whether the eukaryo- rapidly, thus making the identification ofhomology possible only tic genome composition is a primitive characteristic or a derived by using sensitive and selective methods of similarity detection. one, amplified by long evolution and natural selection, cannot be decided at present. Two CDC18 homologues. Two homologous ORFs (ORFI) coded Even more enigmatic is the observation that even within by two plasmids (pfZl and pfVl) in M.thermoformicicum (30) archaean operons, there are certain discrepancies as far as are in tum 21% identical to a recently identified protein from the sequence similarities with the other two domains are concerned. Nucleic Acids Research, 1995, Vol. 23, No. 4 569

zE L W FT': fi S 1 a Rs:E*iA A GVH H T R Y WT KIaK p FNG -NK| '. H I N EN PNFFN E7L A EL NEKQASRLRK F .T AA. 77 FA N ILF D H IL T N WFKTIRS 11 A G V H F G HH T F WINIP ININIK p y N 11 H I N L E LELIFIN DAIL G,NKLASANNTSNT TPR Y I - D A A S' R G K F A AD AVE A I R A R C H Y VIN K T N ';STTETi 114 Rs iTobac AAVH F AG H G T K WINIP KINIA p yEtI AIRI-AE- II Hlr N L T A R F|L|S EAIC D LIVF FLVG T ILG II, A G G IH T R K W|INFP DIIA p Yy I|S VKlr KG 1HT N L T A R F|L|S EIAIC D LIVF Y A A S' R G K FFL 11; T Ki AD AA N K ;IF T N WASTTETf 114 Rfs Epivi tfF LV3 A AAIRSHACVI S C H Y V N K IFS II,T N WSITYTf 114 RNi_a ize A G V F G FH G I K W|NI KINIAI p y N L A A R FILIS EAIAC D LF D A A S GFATLS 1AT AD ,L V MAA h h ANASSEAEAFL A K R C H Y V N Q ILC II,T N WSTIETR 114 Rs i_Narpo AGA H F G H Y A R K WNN jP KN A p Yy ITrBl-K£ IH II L TA A R F|L| EAC D L|INASSK5KQ FL;.TSA'"TE AD L I E A L A II, Rs!_Galsu V H R F N WNPKPNVD s F I YABE - N I I vLQ AFLL. EAI LLA EKGEK K FLD VA. TER EL NT I L A E A E C N S F Y IIN Q ILG T N X VT I S f 11: H L GE 1;. V F?F K L C N S F F IT K T N WITIEN. IiL Rsi8Euggr V H L G H K V K Q WNNP NR|I I Yy I Yl ElEX-K DL L I Y CILIK KC N F SV KGK R A|LIF V C T SI L T RA V ILG LLII, I T T K 111 Y V F -K D V Y V N S I E G K K V S A ER 4LII V D L H A N V T F S F Y I IT E ILG IA T N Nw Rsi Astlo AH L G HF S T W EFlK K y II KT J Y|LEK K C EE1 IILIF A WITTE SG Ili- A s y I DN Q LYLKEA K v G lS E G G I IILIF L G T R KR G L E A K K T H Y Y V ;II T N RtO4-Yeast NLG Q S T S L W AS AQ YoY G E Y- I W YNNN F F V 111 - I H T N - A L VN T N AAEY ] 5A K K T N Q S I N ;I1 GG RtEOANarpo L G iHR M P TSD FAQ G YYA FA |- N IC EK L I LIR C N LIAACSI I S K G H NK IL NablIYeast - -A R N V Q V H E p yEYV FIN |I-ND I GK W e LIV L A R IIIA AIIP N F E D V AASS R T RA NI A T G A - T I A G ?TI IA F T N N -A VHI AN R H G E A T N Laur/Pneca A NG --AS R NRAC E L p Yy VFK RWA G I W E ELI LAA R I|I|V T|IIE N F T D I I AIVIS S S cQIRA H K Y T A I, ?TI --- S vm P R - A G R TI |AAFL T N lic LamrNHuuan A NGA - T N L D FQAIIE Q Yy IYl1 RR DG N LK R W E RLIL AIA R AIYV AI E N A D V S VIAIS S N RA,V L K F AAA T G A T I y 11 G Iyl IIN RN w L AIA R IIIV N P A D V C S R F RA V L K F H T G A - T II A G R !TI T N Laur/Trigr AI[G --S S N L D Y QIIT Q Y VYYKR E --A N N YY Vy DG|IG ,LII N R K E LIAA R AAIIE N A D I CVAI|SVIIS A R F RA NL KFY Y TG TARA - I.A G R ?T T N Lar/AUreca CSHG V F Q|IEE Q YK RP P I, - 9 Lamr /Drome AIH L G --S E N VAjF Q E Q Yy V YK RI A G IN LG X W E XDLIQ L|AIA R A|I|VAIVA A|I|DA|IDE N F S D I F V|IIS S R P tQIRA V L K F AK Y T D -T P I A G R TI T N I P A - A IGA F Lamr/Chivi A VH L G - - TNVGSN SSQ G YY V Fl R S G IN LNRK W E ILI L|AIA R I|IA S|IE N A D V C V ASs R F RA L K F AQ STG A I I, GR ITI T N 1 14 Y - A )A T N Laur/Arath G --T K N C QINE R Y V FK RN G Ir, NGA EKUQIAIAAR VVA ENFQDIIV SARPF RA V L K F AQY TGA N G TI |A - Yeno_Halma' --TQAQKQD ERRU V T D Yl3L YN D E RIR IUA D F S N E P E Q I L A A S S hQR GR1FF AE I F E A I G AV.RT G I3 - - Y -R W G L T N - 10A Cons*nsusA G V H L G - - - N N - Y - RI G -H - I- N L - T - I IL A I I - L I K F A A I

A LF DmH I NI N NIb D A GIDF F V ri Rs2-Ecoli I R L D LEE T Q - - - S D AG T F D K L T K K E A LN h T R E L E K L E N S L G GEK D N1 G G I IVrA K N L A N - L RL G I V - ND N DP 1VG YAIFIF D Rs2_Spipl I REF ER D L E Q -- A N D G T F D K L T K K E A LN R R R eN E K L E R GAIIIFS I G D N G G LP I ED A QE IA IA N 1 P A N RsN2Tobac L H K F R D L R E -- -AQ K T G R L N R L P KR D A AN L KR Q L SR L Q T GAIIIEY L G Y N TGA VFPED DVIDIAN hEE TI T L I F I -1- T DEPI 1D AIIFC D Rs2NEpivi L H KE L R D L I E -- - Q EK A R L N R L K K D E A A V KR Q L AR L Q T Y L G GIIIK Y N T RILF D I I I HEEY K AL HHE AIIf NDI P I V EIEIA A AIIF ND| 211all4 R D L R A E - - - E K N G K F H H L F KR D A A I L KR K L S T L QR GAIIIKY L G Y N TRILP RDI V LIDEQQAA QE - -- s Rs!-Naize L S Q F ECIT IF I I NCDF D If D RsLflarpo L AQ F K D LE N -- -- K T G T I N R L P K K E A A N L K RA L D HGLIAL Q K Y L GA YN T SI LpPDIVIIIPQQKEjjFTA QQ9EC A I N C D GI P A N -E4 I II LI S- -V T lC D1 AsF If NDI 1o9 Rs2-Galsu V Q - - - A L E K K - - - E SEE G F F N S L P K E S A R L K K E L D K L RR Y F N GWR K N N T LLPPRYV REEDL TAVQQE C N KI TL AIGN|IL I T I C CT- VD D N F D E PAsN - - - E Y F R N TEN RPPHEI V I ND I NAYN RE K VI P IF ND| Rs2-Euggr I N E L - Q L S K Q KEE K H H N L L T K EK R L V L K K L K L K S D E CLKK L I T I - - - D C DTI FI vI PA N I DR L N Y Y K F N S S V D G L E L Y D S L T K R R L I S Ef K Q Q S K L A K Y Y F GAK Q K K IIpP RAYV IA AES KKA GGE II I L TCFADFEYpIFI IFPANND| 211 Rs2_Astlo E I -I C D D F ------L N PT EN RNA C L K L -V L P S T P If 140 RtO4-Yeast NE KQ E I DS N DN P TERAL SP N E TS K Q V K - -pEEDL LE I T Y PAvN - P C N A ANA E N S A ALV N L Al IF Nl 04 RE H F D F SA H NP L K D A F T S S P F D - Y FFP R F K N Q K CF EGAIIN T HN Ip LV. NA0 LE E QIL K VF LD S - ] 7E H K L XT Rt02-Narpo Q A ------YIIIT R S F KE Pp L VI RAS A AQA KEI P V E AlIf N|1 It4 NablIYNast E I A L PSIFS II Lamr/Pneca ------YIIT R S Y KE p Al NDIP R A I8N - Y A I --A D P L R Y VI A IF 1f' Lanr_Human ------Q I Q A A F R pEER L7VAT DINA0 E A S V N F A L -A S D P C N F I V T D R AID T P R IA IF lfC LaNr/Trigr ------Q I Q A A RE F HNAP T E AS YV EFI A L D S L Y V D P CNNINN1N - - P Y A II 165 Lamr/Ureca ------QIIQ TA F RE F LL VYVT ]JNDHQP TKE E S Y VN PV IAL --NT D S FL Y V DI P CNNIN ------EPN T D N II 14e0 Lamr/Drome ------Q|IQ F A F RE pEEkL EA S Y V N|IVIAL- - -T T PS L KY I A C NNIN Lamr/Chlvi Qll Q K R F R E p ER L QEP NE A IF le' E S Y V I V I A L RF V PIA PAcNNIN ------PIP L - - II if; Lame /Arath - - - Ql Q T S F SE I NA PB EI A L G N AI I I A C T F H G o YenoNHalma' ------F D Y DG Y IE ElI u aNN

- - - - - D Consensus ------I ------P D - - I V - D - - - D - - A I - E A - - I P V I A L- - - D T D S - - - L V D I I P N

b Rs2_Galsu

Rs2_Epivi

RR

Rs2 Euggr

LamrHuman I Lamr/Arath Lamr/Chlvi

Figure 2. Alignment (a) and tree (b) of the eubacterial ribosomal family S2, the putative laminin receptor from eukaryotes (which is predicted to be the eukaryotic homologue of ribosomal protein S2) and the ORF from Haloarcula marismortui (39), marked with an asterisk. The tree displays bootstrap values from 1000 simulations. It is evident that eubacterial S2 shows a much greater variability than the eukaryotic homologues. The archaean sequence is reliably placed (bootstrap value 1000) at a location equidistant from the other two domains. The alignment figure was generated by PrettyPlot (written by Peter Rice) and the tree by TreeTool (University of Illinois). Sequence identifiers are from Swiss-Prot; provisional names are marked accordingly (name/species). Species names conventions as in Swiss-Prot. Sequence positions are marked.

First, the S9 operon ofHaloarcula marismortui (39), which with unique or highest similarity to eukaryotes, yet its structure is the present work has been completely characterized, contains the reminiscent of eubacterial operons. genes for the ribosomal proteins L29, L13, S9, the two subunits Finally, it is expected that the more our knowledge of archaean of the eukaryotic RNA polymerase RPB 10 and RPB6 (Table 2), sequences increases, the more evidence of a mosaic genome an enolase gene and the ribosomal protein S2. This operon is an organization will be obtained. An interesting goal is the collection example where some of its proteins only have eukaryotic and documentation of proteins common to all three domains, homologues, others only eubacterial ones and some homologues allowing us to speculate about the origins and nature of the to both domains. Secondly, the locus coding for DNA-dependent universal common ancestor (37), a hypothetical genetic or RNA polymerase subunits H, B, Al, A2 and eIF-4C in pre-cellular entity present before the divergence of the three Thermoplasma acidophilum (31) contains mostly ORFs with prime domains of life. It seems that the common intersection of 570 Nucleic Acids Research, 1995, Vol. 23, No. 4

. . t t . . t I Ypvl_Hettf R Lfr N RV --FL VFD T QD X V G A 7 S tTGYIrG A T F H L 7 T T K Y Ypzl_Kettf K LE R V T-D h-KI, GA S QIY LI Y II WG A T F F H I F T T V RKi --FLMVFD A Q Y ' LVNL N S S D G E A E H D ` S- Cc(-Yeast _ss LRRSRSAVL .ETGFA RT EY V Ni FJLAR H S3fi S I L D NGKTtI k K F Q- S L F L L S~T F SFz3 V E L F. H T 14 F N L`<11 L S W F E L F is'- CdclS/Schpo REIS]E'GiQAfN A; '; AYI1L CNVVStYE F -

------Consensus F K D L S - F - - - - L - - fi - - E ------L - - - L D - - - - L - I - G F F G - G F T V - - - V; ------1..

Ypvl._Hettt F~~~~~~E H T D I I., A D G T A QA T S I A A F rh G L G N I KIRhERhA SE G K 8 I D E I E, _INE,E SY Yr FL *tEl V. D, RTLS G3- Kl L H L -' F. 14- Ypzl_HetSti NIELL~~~S E EEH T SD I AD GTAYIASIATS IA RAAFr GL GIF L N SI EK SAEABRA S E GK NSSII.HND El I K s Ar rIG L--.j-J H L K1F 14i CcEYeast D G ALE S VA VT NC IS L GE FS SF SFD SF --LQLLNV.F TL -~IK NNQH Q K F lYHKK--TT F''V-L D NIDIALH A NTSVSE T-.o AV:F,TI LIE A FL F - - - - - KL K Cdcl3/Schpo K V N V C Y JC T I N E P Fm H S K I IV E IUL E N E D H H I N F y C E jE s H FQ S A N E L Y N F - V,tl I V; L D I H L - - - A h E C. E 'V L Y T L F j F Sc F F .1 - - - - Consensus ------I N - L ------A - - E - I ------Q V ------E ------I I V - D E - D - - L ------L L 'i- - - -

Ypvl-Nettf E P N V C ]I K LSnT V- N I G D S G V I S SK R R A ' ErF 9 L N Y v------E F N D G nE D G'J i..14 Ypzl_Nettf E P N V C I VGLSIN ILITV I G D S G V SIF R RISF A L N y V ------E D G F L I IQL NOF N ELE D D V Cc6.Yeast T V S F V L ]AN S LID X R F L S h L N L D R GL Q 1lm Q Yl Al S I I Q T N S S L P T I ------I FI N A I F F '0 Cdcl BSchpo T S R L I L .LAu AJD R F L P R L R - T K tT K L t1T Q S I K AL K T A A T T S E K N N P F T F I K S I S EV' S D D S I N ': V S ' H(aD E T F H P A A E 4 0@

------F - Consensus V G L -N -L D F - - I S F - P Y A - L - E I - - - R ------0-- A - - - - L ------L C 00

Ypvlj4ettf A L A Q R N GA Y L -FLIMQ LMG V V EESSES MA T D GESE E F ------Q h N IEI A s 1 Ypzl_Nettf AA L AIAIQ R L D YSE S V T DVE V E ------E NGDIAlRLLDNLILGSFDIL D ISIQ8 LGV NATIA FL--ARGLIR D N I ii Cc6-Yeast Ai CCN ETD F RSRG IBSB F L LP TG S LiN S A IF L T7 T S P V K K S Y F E F Q G F I G L N Y I A K V F SK F V N N N S T A K i I KILSL CIT 40 V A S S= I t ITR HWSUl - - S V JF A SWA S ------Cdcl8/Schpo RER E^ -KIM-Q H JNT R H V V R A T S A N Q - - S AAR NT G L QIQ A I T 44f Consensus A - - - A - - - G D - R - A L D - L - - A - - I A - R - - K - - - S - - D ------E 'J - - - I ------R - I - - L - - - G K - L - - 4 0

Ypvl_lettf - - - - G G EEQFG N A R RS Q L L R ErI KEME V G h G RAR INHf N E A I RSAR ------004 iTMnN - - - G E F Q LfrT FVS Ypzl_llettf IVI NqIPSI - T1 1YI N A I Q GG N A SQRR-R S Q LLVLI4L R L L IBIIE V V G h G R C C V N H F S D N E A I R R ------I Cc6Y*ast Q|IL I L N S D I I I T K T T L A PLR N |L E I C T ILqICG K TCGK T K -FI D LD E Y D ET K I S I L K F F L H - - 51? V L DI8a|F|D S Cdcl8/Schpo VCE K T S DuULRRV AIDVFE S C L R D R L I Y RSFSEQCD V A N - T F Q R N G Q DIW - D - - - I T A G D I G T L K R F F D fi R C------Consensus V - S - T - - E I K Y - N - L T ------L E - Y G -LL V - I ------V - I D - L - L - - - L ------500

Figure 3. Alignment of the two homologous ORFs (ORF1) coded by plasmids (pfZl and pfVl) in M.thermnoformicicum (30) with their closest relatives, from fungi. The ATP binding residues and the conserved hydrophobic positions preceding them are marked with an asterisk. Annotation as in Figure 2. Figure also generated by PrettyPlot.

the protein sets present in all organisms today contains only some 23 Ramirez, C., Matheson, A.T. (1991) Mol. Microbiol. 5, 1687-1693. components of the gene expression machinery, such as ribosomal 24 Ohama, T., Muto, A., Osawa, S. (1989) J. MoL Evol. 29, 381-395. 25 Otaka, E., Hashimoto, T., Mizuta, K. (1993) Protein Sequence Data Anal. proteins, as well as the series of enzymes participating in basic 5, 285-300. metabolic pathways. 26 Souillard, N., Magot, M., Possot, O., Sibold, L. (1988) J. MoL Evol. 27, 65-76. 27 Barany, F., Danzitz, M., Zebala, J., Mayer, A. (1992) Gene 112, 3-12. REFERENCES 28 Kong, H., Morgan, R.D., Maunus, R.E., Schildkraut, I. (1994) Nucleic Acids Res. 21, 987-991. 1 Woese, C.R., Kandler, O., Wheelis, M.L. (1990) Proc. Natl. Acad. Sci. 29 Bairoch, A. (1992) Nucleic Acids Res. 20, 2013-2018. USA 87,4576-4579. 30 N61ing, J., van Eeden, J.M., Eggen, R.I.L., de Vos, W.M. (1992) Nucleic 2 Klenk, H.-P., Doolittle, W.F. (1994) Curr: Biol. 4,920-922. Acids Res. 20,6501-6507. 3 Ouzounis, C., Sander, C. (1992) Cell 71, 189-190. 31 Klenk, H.-P., Renner, O., Schwass, V., Zillig, W. (1992) Nucleic Acids Res. 4 Marsh, T.L., Reich, C.I., Whitelock, R.B., Olsen, G.J. (1994) Proc. Natl. 20, 5226-5226. Acad. Sci. USA 91,4180-4184. 32 Dever, T.E., Wei, C.L., Benkowski, L.A., Browning, K., Merrick, W.C., S Rowlands, T., Baumann, P., Jackson, S.P. (1994) Science 264, 1326-1329. Hershey, J.W. (1994) J. Biol. Chem. 269, 3212-3218. 6 Langer, D., Zillig, W. (1993) Nucleic Acids Res. 21, 2251-2251. 33 Zwickl, P., Fabry, S., Bogedain, C., Haas, A., Hensel, R. (1990) J. 7 Kaine, B.P., Mehr, I.J., Woese, C.R. (1994) Proc. Natl. Acad. Sci. USA 91, Bacteriol. 172,4329-4338. 3854-3856. 34 Santangelo, G.M., Tomow, J., McLaughlin, C.S., Moldave, K. (1991) 8 Huet, J., Schnabel, R., Sentenac, A., Zillig, W. (1983) EMBO J. 2, Gene 105, 137-138. 1291-1294. 35 Otaka, E., Hashimoto, T., Mizuta, K., Suzuki, K. (1993) Protein Sequence 9 Ouzounis, C.A., Kyrpides, N.C. (1994) J. Mol. Evol. 39, 101-104. Data Anal. 5,301-313. 10 Bazan, J.F., Weaver, L.H., Roderick, S.L., Huber, R., Matthews, B.W. 36 Ellis, S.R., Hopper, A.K., Martin, N.C. (1987) Proc. Natl. Acad. Sci. USA (1994) Proc. Natl. Acad. Sci. USA 91,2473-2477. 84, 5172-5176. 11 Bairoch, A., Boeckmann, B. (1991) Nucleic Acids Res. 19, 2247-2249. 37 Woese, C.R. (1987) Microbiol. Rev. 51, 221-271. 12 Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) J. 38 Eggen, R.I.L., Geerling, A.C.M., Boshoven, A.B.P., de Vos, W.M. (1991) Mol. Biol. 215,403-410. J. BacterioL 173, 6383-6389. 13 Scharf, M., Schneider, R., Casari, G., Bork, P., Valencia, A., Ouzounis, C., 39 Kromer, W.J., Arndt, E. (1991) J. Biol. Chem. 266, 24573-24579. Sander, C. (1994) In: Second International Conference on Intelligent 40 An, G., Bendiak, D.S., Mamelak, L.A., Friesen, J.D. (1981) Nucleic Acids Systems for Molecular Biology, Altnan, R., Brutlag, D., Karp, P., Lathrop, Res. 9,4163-4172. R., Searls, D. (eds). AAAI Press, Stanford, CA, pp. 348-353. 41 Davis, S.C., Tzagoloff, A., Ellis, S.R. (1992) J. Biol. Chem. 267, 14 Sander, C., Schneider, R. (1991) Proteins 9, 56-68. 5508-5514. 15 Higgins, D.G., Bleasby, A.J., Fuchs, R. (1992) Comput. Appl. Biosci. 8, 42 Tohgo, A., Takasawa, S., Munakata, H., Yonekura, H., Hayashi, N., 189-191. Okamoto, H. (1994) FEBS Lett. 340, 133-138. 16 Etzold, T., Argos, P. (1993) Comput. Appl. Biosci. 9,49-57. 43 Gribskov, M., Luethy, R., Eisenberg, D. (1990) Methods Enzymol. 183, 17 Bork, P., Ouzounis, C., Sander, C., Scharf, M., Schneider, R., 146-159. Sonnhammer, E. (1992) Protein Sci. 1, 1677-1690. 44 Kelly, T.J., Martin, G.S., Forsburg, S.L., Stephen, R.J., Russo, A., Nurse, P. 18 Possot, O., Sibold, L., Aubert, J.P. (1989) Res. Microbiol. 140, 355-371. (1993) Cell 74, 371-382. 19 Crouzet, J., Levy-Schil, S., Cameron, B., Cauchois, L., Rigault, S., 45 Bueno, A., Russell, P. (1992) EMBO J. 11, 2167-2176. Rouyez, M.C., Blanche, F., Debussche, L.,Thibaut, D. (1991) J. BacterioL 46 Ouzounis, C.A., Blencowe, B.J. (1991) Nucleic Acids Res. 19, 6953-6953. 173, 6074-087. 47 Gorbalenya, A.E., Koonin, E.V. (1993) Curr: Opin. Struct. Biol. 3, 20 Kletzin, A. (1992) Nucleic Acids Res. 20,5389-5396. 419-429. 21 Wang, L., Weiss, B. (1992) J. Bacteriol. 174, 5647-5653. 48 Hodgman, T.C. (1989) Comput. Appl. Biosci. 5, 1-13. 22 Mathews, C.K., van Holde, K.E. (1990) Biochemistry. Benjamin-Cum- 49 Bork, P., Ouzounis, C., Sander, C. (1994) Curr Opin. Struct. Biol. 4, mings, Redwood City, CA, pp. 763-764. 393-403.