280 bioinfovmatics Computational comparisons of model genomes Christos Ouzounis,Georg Casari, ,Javier Tamames and AlfonsoValencia

Complete genomes from model organisms provide new challenges for computational molecular biology. Novel questions emerge from the genome data obtained from the functional prediction of thousands of gene products. In this review, we present some approaches to the computational comparison of genomes, based on sequence and text analysis, and comparisons of genome composition and gene order.

With the recent publication of the complete genome tions of an organism, identifying some components sequences from two bacteria, Haemophilus injuenzae that are common to other species, and some that Rdi and Mycoplasma genitalium2, new challenges are appear to be unique. These predicted ‘expression pat- emerging for computational biology. Two such chal- terns’ can help in the identification of novel or poten- lenges are (1) to predict and annotate the fimctions of tial metabolic or regulatory pathways, and provide a the gene products as rapidly and completely as poss- faster route to the development of targets for drug ible, and (2) to derive adequate abstractions that make design and discovery. genomes comparable at a higher-than-molecular level. Function prediction is a primary goal of genome- Orthologues: functionally equivalent genes sequence analysis, as many newly determined across species sequences have no experimental information associ- A gene with a certain level of sequence similarity to ated with them, while functional information can be its homologue in the genome of another species may derived by examining homology to proteins of known have the same function as its homologue; such genes function. Prediction can be carried out by integrating are defined as orthologues7. How is it possible to deter- and co-ordinating a number of well-tested methods mine whether the two proteins encoded by the genes that rapidly and efficiently identie sequences with the of different species have the same function? Proteins highest degree of similarity horn complete databases, change during evolution, forming families of related and can, therefore, assist function prediction using molecules that have similar primary, secondary and homology observations. The GeneQuiz system3 auto- tertiary structures, but which have divergent functions. matically annotates protein-encoding sequences and Algorithms for sequence comparison can detect genes can identify novel functions, e.g. for H. influenzati and encoding homologous proteins, but are unable to 111.genitaliums sequences*. The use of integrated data- determine definitively whether two molecules have bases and software tools, combined with the appli- exactly the same function. Therefore, function pre- cation of a number of empirical rules that can auto- diction by detection of sequence similarity, especially matically eliminate false annotation@, make this in incomplete genomes, can only be approximate, possible. because it usually makes use of genes encoding the If the functional annotations are known, what is the most similar proteins; however, these genes might not next step in genome analysis? We have been explor- yet have been identified. ing new ways to make use of this information, and In complete genomes, however, there is a finite have been performing global comparisons of genomic number of genes, so it is possible to determine which data, addressing a number of questions. These analy- genes share the highest degree of similarity, thus nar- ses yield a profile of the composition of genomic func- rowing the set of experiments that must be performed to prove that proteins encoded by orthologous genes C. Ouzounis ([email protected]) is at the Art$cial Intelligeflce have the same function. Therefore, the average Center, SRI International, 333 Ravenswood Avenue, Menlo Park, similarity value can be calculated, helping in the esti- CA 94025, USA. A. Valencia andJ. Tamames are at the Protein mation of the rate of change in different families. The Design Group, Centro National de Biofecknologia, CSIC, Campus U. Autonoma, E-28049 Madrid, Spain. G. Casari is at the Euro- pean Molecular Biology Labovatory, MeyevkofstraJe 1, D-69012 *Analysis of the H. in&wrzae, M. yerzitalicrm and S. cerevisiae genomes, Heidelberg, . C. Sander is at the Euvopean Bioinfrmatics including functional classification, is available on the World Wide Web Institute, EMBL, Hinxton Hall, Cambridge, UK CBlO 1RQ. at

TIBTECH AUGUST 1996 (VOL 14) Copyright 0 1996, Elsevier Science Ltd. All rights reserved. 0167 - 7799/96/$15.00. PII: SO167-7799(96)10043-3 281 redundancy of homologous families of proteins (e.g. 20 transport proteins in bacteria) can also be measured, and the most conserved pairs of proteins can be identi- fied. The first analysis of this kind was performed with 2 a contig set from Myco$asma capricolums that encoded ‘g 15 almost half of the total number of proteins. We have found that the average similarity between protein- .g encoding genes in M. capricolum and Escherichia coli is 5 40%, with a wide variance (Fig. 1). It should be noted = 10 that analyses of putative orthologues in incomplete -5 genomes provide a lower estimate of sequence simi- b larity between orthologues. There may be other genes, e as yet unidentified in either genome, that display an z’ 5 even higher similarity The identification of orthologues can be used in the reconstruction of metabolic pathways encoded by 0 various bacterial genomes; E. coli metabolism has been 25 30 35 40 45 50 55 60 65 70 75 used as the modelg. This approach can provide a Sequence identity (%) deeper understanding of the primary-sequence data, 1 and can help in the identification of potentially inter- Figure 1 esting targets for medical, industrial and environmen- Frequency histogram indicating the sequence identity of genes encoding tal applications. 82 Mycoplasma capricolum proteins with their putative orthologues in Escherichia colis. These numbers represent lower estimates: if protein-encoding genes Genome compositions from automated with higher sequence-similaritres are found in either genome, the drstrrbution would function-classification probably shift to the right. The three most conserved gene pairs between the Functional classification of genes and gene products two bacteria are those encoding FtsH, enolase and ClpB, all of which share is another aspect of genome analysis that can provide 263% sequence identity over the length of available sequences (60-125 amino a basis for comparative genomics. This type of acids).

Box 1. Functional classes for genome comparison

Our definition of functional classes used for genome comparisons corresponds with the one that was previously proposed*3 for characterizing Escherichia co/i. Of the nine classes, eight are shown, and these can be classified into three super-classes;the ninth class consists of unknown functions, and is not shown. Typical keywords for the superclasses are: ENERGY - photosynthesis, oxygen transport, respiratory protein, monooxygenase, kinase, hydrophobic ion transporter; INFORMATION -activator, zinc finger, early protein, nuclear protein, developmental pro- tein; COMMUNICATION - hormone, amidation, serine/threonine protein kinase, G-protein coupled reCeptOr, Serine protease inhibitor, cell adhesion.

ENERGY INFORMATION COMMUNICATION

Metabolism DNA and RNA Signal l Amino acid metabolism l Replication (including l Regulatory functions l Cofactor biosynthesis recombination and repair) l Cell division l Prosthetic groups and carriers l Transcription (including l Cell killing l Central intermediary splicing) metabolism Environment l Energy conservation (e.g. Translation l Interaction with environment: sugar modification and l Translation, ribosomal proteins recognition, adhesion, defense degradation, pentose (toxic substances), phosphate cycle, glycolysis, Proteins extracellular gluconeogenesis, electron l Protein biosynthesis, folding, degradation transport, respiration, internal transport, secondary metabolism) translocation, post-translational Structure l Fatty acid and phospholipid modification and degradation l Structural proteins biosynthesis l Purines, pyrimidines, nucleosides, nucleotides

Transport l Transport through membranes

TlBTECHAUGUST1996WOL14) 282. bioin formatics

1 Haemophilus influenzae

Energy ---F- (59%) -+- -+- Extract ------w- keywords l-i--+- I * Assign -+- keywords ---+- to classes --+--+- Communication Function, L-NFunction: iequences class Keywords class

Classified sequences Dictionary of keywords Mycoplasma genitalium

l Assign sequences Iterate l \ Scan keyword field Energy I to classes , l Scan other fields (40%)

Information

ommunication Figure 2 (2%) Schema of the method used to classify sequences in functional classes. ______(a) Sequences are classified into functlonal classes; (b) a dictionary with keywords Figure 4 corresponding to each functional class is created; (c)the database is searched and The functional compositions of two complete genomes4,5. The rela- all sequences are classified into functional classes; and (d) the procedure is iterated. tive sizes of the charts reflects genome size. While Haemophilus Classified sequences can be input manually, or from the automatic classification pro- influenzae displays a typical compositlon for a bacterial genome vided by the method, or from a combination of both methods, (compare with Fig. 31, Mycoplasma genitalium appears to devote most of its coding potential to replication and translation, while its metabolic capacity is restricted, owing to its parasitic lifestyle.

Animals approach raises the following questions: How can we devise useful abstractions that allow model genomes to Plants be compared? In other words, how can we class@ pro- teins beyond the similarity of the sequences encoding Yeasts them? The problem is, how can genomes be compared when sequence similarity is not sufficient; for example, Bacteria if most homologous proteins have not been identified, or if they are too divergent or simply not as abundant between species (e.g. between mycoplasmas and mam- Viruses mals). We have approached this problem by catego- I”“,‘~‘~,~~~~,““,‘~~‘,~~~~,~~~~,~’~’,~’~~,~~~’, rizing protein functions into nine classes (eight classes 0 10 20 30 40 50 60 70 80 90 100 of major functions, and one unknown; Box 1). These Functional composition (%) classes, which correspond with a number of cellular 1 processes, can be clustered into three super-classes: energy-, information- and communication-related Figure 3 1// 1.,. 1 . Basic functional composition of various taxa’*. Only organisms with more than 100 genes and proteins”‘, in a simplltication that has a sequences in SWISS-PROT are represented. Viruses include phages and eukaryotic physical basis; i.e. the substrates of these super-classes viruses, both displaying very similar composition. For each taxon, (blue) energy-, are usually small molecules, nucleic acids or proteins, (green) information- and (red) communication-related composition is shown as a per- correspondingly. This situation is analogous to that centage of the total. The three most interesting patterns are: (1) that viruses have a seen for expression patterns obtained from expressed small, but significant, amount of metabolic enzymes (mostly involved in nucleotide sequence tag (EST) sequencing”, except that the biosynthesis), (2) that yeasts and plants are remarkably similar, and (3) that there is compositional profiles define a potential, and not an an increase in communication-related proteins during cellular evolution. actual, genetic pattern.

TIBTECH AUGUST 1996 (VOL 14) 283 bioinfovmatics

Figure 5 Numbers of pairs of functionally related genes in the genome of Haemophilus influenzael. The two planar axes represent the classes in which neighboring genes belong, the third axis represents the number of gene pairs. It is clear that the numbers in the diagonal value dominate, corresponding to adjacent genes of the same functional class**. Genes for which there is no function assignment, or cases where the function assignment is not clear, are classified as unknown (and are not shown); values above 60 (only for pairs of metabolic genes) are also not shown.

We have developed an automatic functional classifi- example, it appears that during evolution, the fraction cation system (Fig. 2; J. Tamames, G. Casari, of (intra- and extra-cellular) communication-related C. Ouzounis, C. Sander and A. Valencia, unpublished) proteins increases monotonically with the phylo- to classify thousands of predicted protein functions as genetic status of major taxa (Fig. 3). This pattern of rapidly and reliably as possible. A set of pre-classified increase clearly reflects the differences between sequences is used to build a dictionary of keywords prokaryotic and eukaryotic cells, and the changes that characteristic of each functional class, and uses this occurred during the transition from unicellular to dictionary to classify other sequences based on their multicellular organisms. In our analysis, differences in keywords. The process is repeated, starting with the genome composition are observed between the genes enlarged set of classified sequences, deriving new asso- encoding the complete sets of identified proteins of ciations between keywords and functional classes, and H. i$uenzae and M. gerhdiwr (Fig. 4), and for the genes classifying more sequences. In this way, the dictionaries encoding the proteins of Saccharmycees cerrvisiae (yeast), are enriched with unique and distinctive patterns of the first completely sequenced eukaryotic genomei3. keywords that describe each functional class. Inconsis- Functional compositions correspond well with the tencies, conflicts and other problems are dealt with, estimates previously obtained from the taxa in which and the procedure converges when no more keywords these species belong. These compositional patterns are can be uniquely associated with a class (Fig. 2). useful general descriptors of the genomic make-up of This system has been used to estimate the compo- model organisms, and implicitly define a similarity sitions of functional classes of a number of species that measure between genomes that can be used in future are well-represented in the sequence databasesi’. For comparative studies.

TIBTECH AUGUST 1996 (VOL 14) 284 bioinformatics

I However, using the comprehensive functional classifi- a cation described above, we can ask whether function- 85 ally related, but not necessarily homologous, genes of E ENVA-ECOLI n any two genomes are in a similar sequential order along alz FTSZpECOLl 0 chromosomes. cn FTSApECOLI 0 To facilitate this, we have compiled data from 0 FTSQ_ECOLI intra- and inter-genomic comparisons of E. coli and n DDLBpECOLl H. ir$uenzae. All E. coli sequences were classified into n MURC-ECOLI the nine functional classes, using annotation from the n MURG-ECOLI sequence databases, in conjunction with keywords - a l FTSW-ECOLI functional class dictionary was created. The relative n MURD-ECOLI n MRAY-ECOLI locations of genes (not their absolute order) were taken from genome repositories, and genes for which there is no accurately mapped location were omitted. The positions of genes, which were deduced fi-om the order of the genes along the chromosome, were used for comparing the two genomesiJ,a(‘. The numbers of gene pairs are compared with the results from a ran- Position in Haemophilus influenzae genome dom modela’, which estimates the probability of two functionally related genes being adjacent on the genome, in order to establish the statistical significance of these observations. b The main conclusion is that clusters of genes Haemophilus influenzae belonging to the same functional classes (that do otherwise not share sequence similarity) have patterns FUCP-HAEIN FUCA-HAEIN FUCU-HAEIN FUCK-HAEIN FUGI_HAEIN FUCR_HAEIN (Hl0610) (Hl0611) (Hl0612) (Hl0613) (H10614) (Hl0615) of aggregation along chromosomes that appear to be statistically significantaa. For example, in H. i$uenzae, there is a clear aggregation of genes encoding metab- olism- and translation-associated proteins (Fig. 5). In addition, patterns of linear conservation between genomes are observed (Fig. 6). TLvo examples of genome-order comparison in regions of the H. influenzae and E. co/i genomes are presented. The first example illustrates the case for the dew cluster in the 2’ region on the E. coli chromosome. In this region, all FUCA_ECOLI FUCP-ECOLI FUCI-ECOLI FUCK-ECOLI FUCU-ECOLI FUCR-ECOLI sequences belong to the signal or structural classes, and Escherichia co/i the order is preserved (Fig. 6a). The second example is that of thefr4c operon, at 63’ in E. coli and 35’ in H. irzjuenzae: together withfi4cA, fuc0 (upstream) is transcribed in the negative orientation with respect to Figure 6 the origin in E. coli, whilefi4cP (adjacent to&c& and (a) An illustration of the conserved dew cluster present in Escherichia coli and the downstream genes are transcribed in the positive Haemophilus influenzae22, containing a number of functionally related genes. The orientation; in H. influenzae, all genes are transcribed direction of transcription is colinear. Genes are labeled according to their name in in the opposite direction to their direction of tran- E. coli, and markers represent functional classifications: red circles are used to denote scription in E. coli (Fig. 6b). Not all of the clusters are genes encoding proteins involved in signaling, blue squares are used to denote genes and these clusters may represent newly encoding proteins with structural functions. Axes display the absolute position of operons, genes in the H. influenzae genome paired with their orthologues from E. coli. This identified regulatory units. In addition, it may be example demonstrates the patterns that emerge at this level of sequence compari- possible to predict the functional class of unknown son. (b) The conserved fucose (fuc) operon. While the genes in the H. influenzae genes and formulate a hypothesis for the cellular fuc operon (top) are all transcribed in the same orientation, most genes in the processes with which they are associated, further guid- E. coli fuc operon (bottom) are transcribed in the opposite orientation, with the excep- ing experimental analysis. tion of the first gene. Some rearrangements are also visible. Genes encoding trans- port proteins are colored orange, genes encoding proteins involved in metabolism Global views are colored green, and genes encoding proteins involved in signaling are colored The approaches described above are posing new blue. questions for the computational comparison of model genomes, and are helping us to obtain different global Conserved clusters of functionally related genes views of complete genome data. Having accurate esti- Is chromosome organization conserved in related mates for the expected sequence similarity between species? This question, which was first raised in re- orthologues in different species will be the basis for lation to genomic maps of plastidsi”Js and bacterial”‘“, more-reliable function predictions. Automatic anno- was investigated using only homologous sequences. tation and classification of sequences, possibly based

TIBTECH AUGUST 1996 (VOL 14) 285 bioinfowzatics

‘It’s Wilkinson!’

on alternative classification schemes, can supply global 9 Karp, P. D., Ouzoutus, C. and Pale);, S. (lY96) m Itrrc&w Sysfe,rrs views of complete genomes. Finally, gene-order com- &r Mole&v Biolqy 1996 (States. D., Gaasterland, T. and Smith. R., parisons contribute towards the understanding of eds), pp. 116-124, AAAI Press 10 Ouzoumr, C.. Valencia, A., Tamames, J,, Bark, P. and Sander, C. chromosome structure and evolution, complementing (1995) in Ewopean Cmfmnn OH ilrt@zi I$ 1995 iECAL95) traditional sequence comparison. (MO&. F., Moreno, A., Merrlo, J. J. and Chacbn. P.. eds), pp. 843-85 1, Springer-V&g Acknowledgements 11 Adams, M. D., Kerlavage, A. R., Fields, C. md Venter. J. C. (1993) We thank our colleagues in the GeneQuiz consor- Nat. Genet. 4, 256-267 tium, especially Reinhard Schneider and Antoine de 12 Tamames, J., Ouzounis, C., Sander, C. and Valmcia, A. FEBS Leti. Daruvar, for their valuable contributions to our con- (I” press) 13 Ouzounis, C., Bark, P., Casari. G. and Sander, C. (1995) Pw’rurcin Sci. tinuing efforts in genome analysis. C.O. is a recipient 4.142;1-2428 of a long-term fellowship from the Human Frontiers 14 SankoK D., Leduc, G., Antoine. N., Paqum, B., Lang, B. F. and Science Program Organization. Cedergren, R. (I 992) Pwror. X&l dud. Sri. LT.4 8Y, 6575-6579 15 Boudreau, E., Otis, C. and Tunnel. M. (1904) Want At!. Btol. 11. References 585al2 1 Fleischmann, K. D. rt al. (1995) Science 269. 496512 16 Taylor, D. E.. Eaton, M., Yan, W. and Chang, N. (1992)j. Bilctrrial. 2 Fraser. F. C. ef al. (1995) S&me 270, 397-403 174.23323337 3 Scharf, M. et al. (1994) in Irrtelliywt Sy.rtmxf;,r Molecular Biology 1994 17 Kunisawa. T. (1995)j. l\,[email protected]. 40. 585-5Y3 (AItman, IX., BrutIag, D., I&p. P. D., Lathrop. R. and Searls, D., 18 Liu. S. L. and Sanderson, K. E. (1995) Pruc. .%tf Amf. Sci. US.4 93. eds), pp. 348-333, AAAI Press 1018-1022 4 Cam, G. et al. (1995) Natwe 376, 647-648 19 Lopez-Garcia, P.. StJean. A., An&, R. and Charleboir. R. L. (1995) 5 Ouzounis, C., Casari, G., Valencia, A. and Sander. C. (1996) UJ/. 1. Batteriol. 177. 1405-1408 !Wcrobiol. 20. 897-899 20 Tatusov, K. L. et aI. (1996) Gun. Biol. 6, 27%291 6 Casari, G.. Ouzounis, C., VaIencia, A. and Sander, C. (19Y6) in First 21 Karlin, S. and Ladunga, I. (1994) Pm NIG/ ilcad. Sci. CGi 91, Annual Puiic Syrnporium on Biocomputirg (Hunter, L. and Klein, T. E., 12832-12836 eds), pp. 7117-709. World Scientific 22 Tamames, J., Oueounis, C., Casari, G. and Valencia. A.]. &f~~l. Ewl. 7 Fitch, W. M. (1970) Syst.Zuol. 19, 99-106 (in press) 8 Bork, P. ef nl. (1995) Mol. Miicrobiol. 16, 955-967 23 Riley, M. (1993) Microbial. Rev. 57. 862-952

The Trends Guide to the Internet

If you would like to order copies (minimum order - 20 copies) of this highly popular guide, contact Thelma Reid at: Elsevier Trends Journals, 68 Hills Road, Cambridge, UK CB2 1LA Tel: +44 1223 311114 Fax: +44 1223 321410 E-mail:

TlBTECHAUGUST1996h’OL14)