Computational Comparisons of Model Genomes Christos Ouzounis,Georg Casari, Chris Sander,Javier Tamames and Alfonsovalencia

280 bioinfovmatics Computational comparisons of model genomes Christos Ouzounis,Georg Casari, Chris Sander,Javier Tamamesand AlfonsoValencia Complete genomes from model organisms provide new challenges for computational molecular biology. Novel questions emerge from the genome data obtained from the functional prediction of thousands of gene products. In this review, we present some approaches to the computational comparison of genomes, based on sequence and text analysis, and comparisons of genome composition and gene order. With the recent publication of the complete genome tions of an organism, identifying some components sequences from two bacteria, Haemophilus injuenzae that are common to other species, and some that Rdi and Mycoplasma genitalium2, new challenges are appear to be unique. These predicted ‘expression pat- emerging for computational biology. Two such chal- terns’ can help in the identification of novel or poten- lenges are (1) to predict and annotate the fimctions of tial metabolic or regulatory pathways, and provide a the gene products as rapidly and completely as poss- faster route to the development of targets for drug ible, and (2) to derive adequate abstractions that make design and discovery. genomes comparable at a higher-than-molecular level. Function prediction is a primary goal of genome- Orthologues: functionally equivalent genes sequence analysis, as many newly determined across species sequences have no experimental information associ- A gene with a certain level of sequence similarity to ated with them, while functional information can be its homologue in the genome of another species may derived by examining homology to proteins of known have the same function as its homologue; such genes function. Prediction can be carried out by integrating are defined as orthologues7. How is it possible to deter- and co-ordinating a number of well-tested methods mine whether the two proteins encoded by the genes that rapidly and efficiently identie sequences with the of different species have the same function? Proteins highest degree of similarity horn complete databases, change during evolution, forming families of related and can, therefore, assist function prediction using molecules that have similar primary, secondary and homology observations. The GeneQuiz system3 auto- tertiary structures, but which have divergent functions. matically annotates protein-encoding sequences and Algorithms for sequence comparison can detect genes can identify novel functions, e.g. for H. influenzati and encoding homologous proteins, but are unable to 111. genitaliums sequences*. The use of integrated data- determine definitively whether two molecules have bases and software tools, combined with the appli- exactly the same function. Therefore, function pre- cation of a number of empirical rules that can auto- diction by detection of sequence similarity, especially matically eliminate false annotation@, make this in incomplete genomes, can only be approximate, possible. because it usually makes use of genes encoding the If the functional annotations are known, what is the most similar proteins; however, these genes might not next step in genome analysis? We have been explor- yet have been identified. ing new ways to make use of this information, and In complete genomes, however, there is a finite have been performing global comparisons of genomic number of genes, so it is possible to determine which data, addressing a number of questions. These analy- genes share the highest degree of similarity, thus nar- ses yield a profile of the composition of genomic func- rowing the set of experiments that must be performed to prove that proteins encoded by orthologous genes C. Ouzounis ([email protected]) is at the Art$cial Intelligeflce have the same function. Therefore, the average Center, SRI International, 333 Ravenswood Avenue, Menlo Park, similarity value can be calculated, helping in the esti- CA 94025, USA. A. Valencia andJ. Tamames are at the Protein mation of the rate of change in different families. The Design Group, Centro National de Biofecknologia, CSIC, Campus U. Autonoma, E-28049 Madrid, Spain. G. Casari is at the Euro- pean Molecular Biology Labovatory, MeyevkofstraJe 1, D-69012 *Analysis of the H. in&wrzae, M. yerzitalicrm and S. cerevisiae genomes, Heidelberg, Germany. C. Sander is at the Euvopean Bioinfrmatics including functional classification, is available on the World Wide Web Institute, EMBL, Hinxton Hall, Cambridge, UK CBlO 1RQ. at <http://www.sander.embl-heidelberg.de/ge. TIBTECH AUGUST 1996 (VOL 14) Copyright 0 1996, Elsevier Science Ltd. All rights reserved. 0167 - 7799/96/$15.00. PII: SO167-7799(96)10043-3 281 bioinformatics redundancy of homologous families of proteins (e.g. 20 transport proteins in bacteria) can also be measured, and the most conserved pairs of proteins can be identified. The first analysis of this kind was performed with 2 a contig set from Myco$asma capricolums that encoded ‘g 15 almost half of the total number of proteins. We have found that the average similarity between protein- .g encoding genes in M. capricolum and Escherichia coli is 5 40%, with a wide variance (Fig. 1). It should be noted = 10 that analyses of putative orthologues in incomplete -5 genomes provide a lower estimate of sequence simi- b larity between orthologues. There may be other genes, e as yet unidentified in either genome, that display an z’ 5 even higher similarity The identification of orthologues can be used in the reconstruction of metabolic pathways encoded by 0 various bacterial genomes; E. coli metabolism has been 25 30 35 40 45 50 55 60 65 70 75 used as the modelg. This approach can provide a Sequence identity (%) deeper understanding of the primary-sequence data, 1 and can help in the identification of potentially inter- Figure 1 esting targets for medical, industrial and environmen- Frequency histogram indicating the sequence identity of genes encoding tal applications. 82 Mycoplasma capricolum proteins with their putative orthologues in Escherichia colis. These numbers represent lower estimates: if protein-encoding genes Genome compositions from automated with higher sequence-similaritres are found in either genome, the drstrrbution would function-classification probably shift to the right. The three most conserved gene pairs between the Functional classification of genes and gene products two bacteria are those encoding FtsH, enolase and ClpB, all of which share is another aspect of genome analysis that can provide 263% sequence identity over the length of available sequences (60-125 amino a basis for comparative genomics. This type of acids). Box 1. Functional classes for genome comparison Our definition of functional classes used for genome comparisons corresponds with the one that was previously proposed*3 for characterizing Escherichia co/i. Of the nine classes, eight are shown, and these can be classified into three super-classes; the ninth class consists of unknown functions, and is not shown. Typical keywords for the superclasses are: ENERGY - photosynthesis, oxygen transport, respiratory protein, monooxygenase, kinase, hydrophobic ion transporter; INFORMATION -activator, zinc finger, early protein, nuclear protein, developmental protein; COMMUNICATION - hormone, amidation, serine/threonine protein kinase, G-protein coupled reCeptOr, Serine protease inhibitor, cell adhesion. ENERGY INFORMATION COMMUNICATION Metabolism DNA and RNA Signal l Amino acid metabolism l Replication (including l Regulatory functions l Cofactor biosynthesis recombination and repair) l Cell division l Prosthetic groups and carriers l Transcription (including l Cell killing l Central intermediary splicing) metabolism Environment l Energy conservation (e.g. Translation l Interaction with environment: sugar modification and l Translation, ribosomal proteins recognition, adhesion, defense degradation, pentose (toxic substances), phosphate cycle, glycolysis, Proteins extracellular gluconeogenesis, electron l Protein biosynthesis, folding, degradation transport, respiration, internal transport, secondary metabolism) translocation, post-translational Structure l Fatty acid and phospholipid modification and degradation l Structural proteins biosynthesis l Purines, pyrimidines, nucleosides, nucleotides Transport l Transport through membranes TlBTECHAUGUST1996WOL14) 282. bioin formatics 1 Haemophilus influenzae Energy ---F- (59%) -+- -+- Extract --- ---w- keywords l-i--+- I * Assign -+- keywords ---+- to classes --+--+- Communication Function, L-NFunction: iequences class Keywords class Classified sequences Dictionary of keywords Mycoplasma genitalium l Assign sequences Iterate l \ Scan keyword field Energy I to classes , l Scan other fields (40%) Information ommunication Figure 2 (2%) Schema of the method used to classify sequences in functional classes. _______ (a) Sequences are classified into functlonal classes; (b) a dictionary with keywords Figure 4 corresponding to each functional class is created; (c)the database is searched and The functional compositions of two complete genomes4,5. The rela- all sequences are classified into functional classes; and (d) the procedure is iterated. tive sizes of the charts reflects genome size. While Haemophilus Classified sequences can be input manually, or from the automatic classification pro- influenzae displays a typical compositlon for a bacterial genome vided by the method, or from a combination of both methods, (compare with Fig. 31, Mycoplasma genitalium appears

Computational Comparisons of Model Genomes Christos Ouzounis,Georg Casari, Chris Sander,Javier Tamames and Alfonsovalencia

Biolayout Paper Fig3 V3

Computational Analysis of Protein Function Within Complete Genomes

EMBL-EBI Press Release European Molecular Biology Laboratory European Bioinformatics Institute for Immediate Release

Designing and Running an Advanced Bioinformatics and Genome Analyses Course in Tunisia

Automated Annotation of Protein Sequences, This Is Vital and Similar Approaches Will Be Implemented for Other Domains in the Future

Contextual Analysis of Large-Scale Biomedical Associations for the Elucidation and Prioritization of Genes and Their Roles in Complex Disease Jeremy J

Avancesin Systemsand Synthetic Biology

An Application to Phylogenetic Analysis

EMBL April 2003 &Cetera Newsletter of the European Molecular Biology Laboratory

Experimental Investigation of Enzyme Functional Annotations Reveals Extensive Annotation Error

BIOINFORMATICS Pages 48–64

Adam Alexander Thil SMITH Exploitation Automatisée Des Contextes