Measuring Genome Evolution (Ortholog͞synteny͞computer Analysis͞horizontal Gene Transfer)
Total Page:16
File Type:pdf, Size:1020Kb
Proc. Natl. Acad. Sci. USA Vol. 95, pp. 5849–5856, May 1998 Colloquium Paper This paper was presented at the colloquium ‘‘Computational Biomolecular Science,’’ organized by Russell Doolittle, J. Andrew McCammon, and Peter G. Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine CA. Measuring genome evolution (orthologysyntenyycomputer analysisyhorizontal gene transfer) MARTIJN A. HUYNEN* AND PEER BORK European Molecular Biology Laboratory, Meyerhofstrasse 1, 69012 Heidelberg, Germany, and Max-Delbru¨ck-Centrum for Molecular Medicine, 13122 Berlin-Buch, Germany ABSTRACT The determination of complete genome se- structures, coding densities, protein coding genes, operons, the quences provides us with an opportunity to describe and size distribution of gene families (4), etc. They also can be analyze evolution at the comprehensive level of genomes. Here compared at a variety of levels: a first-order level where we we compare nine genomes with respect to their protein coding regard the genome as a ‘‘bag of genes’’ without taking account genes at two levels: (i) we compare genomes as ‘‘bags of genes’’ of interactions between the various components, and a second- and measure the fraction of orthologs shared between ge- order level that considers whether properties of genomes are nomes and (ii) we quantify correlations between genes with cross-correlated (e.g., the absence of certain polynucleotides respect to their relative positions in genomes. Distances together with the presence of restriction enzymes that specif- between the genomes are related to their divergence times, ically cut these polynucleotides; ref. 5). In this paper we focus measured as the number of amino acid substitutions per site on first- and second-order patterns in protein coding regions in a set of 34 orthologous genes that are shared among all the in genomes. Specifically we measure: (i) the fraction of or- genomes compared. We establish a hierarchy of rates at which thologous sequences between genomes, (ii) the conservation genomes have changed during evolution. Protein sequence of gene order between genomes, and (iii) the spatial clustering identity is the most conserved, followed by the complement of of genes in one genome that have an ortholog in another genes within the genome. Next is the degree of conservation of genome. We correlate these measures with the divergence time the order of genes, whereas gene regulation appears to evolve between the genomes compared. It is not our goal to define at the highest rate. Finally, we show that some genomes are new distance measures to construct phylogenetic trees. Rather more highly organized than others: they show a higher degree it is to analyze the conservation and differentiation of patterns of the clustering of genes that have orthologs in other genomes. between genomes, to show how we can extract useful infor- mation from these, and to analyze at what relative time scales Molecular evolution usually is studied at the level of single they change. The analyses are done on the first nine sequenced genes. With the determination of genome sequences we have Archaea and Bacteria that were publicly available: H. influen- an opportunity to study it at a higher, comprehensive level, that zae (6), Mycoplasma genitalium (7), Synechocystis sp. PCC 6803 of complete genomes. This leads to the pertinent question: how (8), Methanococcus jannaschii (9), Mycoplasma pneumoniae can genomic information be used to obtain useful information (10), E. coli (1), Methanobacterium thermoautotrophicum (11), concerning genome evolution? The goal of this paper is to Helicobacter pylori (12), and Bacillus subtilis (13). Although the create baseline expectations for measures of genome distances total number of publicly available genome sequences is grow- that are based on gene content. By describing some general ing rapidly, the trends that we observe should remain largely patterns one also can identify the exceptions. Measuring unchanged with the comparison of new species, given the evolution at the level of complete genomes is pertinent as it is, diverse range of evolutionary distances of the species com- after all, the principal level for natural selection. Furthermore, pared in this paper. it is intermediate to levels at which evolution has long been Methodological Issues in Comparisons of Genomes studied: namely, the molecular level in genes and genotypes, and the organismal level in the fossil record. The genome in Identification of Orthologous Genes. Defining orthology. In principle contains all of the information necessary to bridge comparing the genes of different genomes it is important that the gap between genotype and phenotype. For example, by we avoid comparisons of ‘‘apples and pears’’: i.e., that we are knowing the functions of the genes in a genome of a species we able to identify which genes correspond to each other in the can postulate a model for its complete metabolism. However, various genomes. Fitch (14) introduced the term ‘‘orthologs’’ we have to be careful not to overstate our expectations. The for genes whose independent evolution reflects a speciation situation might turn out to be analogous to that of proteins, for event rather than a gene duplication event. ‘‘Where the which, in principle, all information necessary to determine homology is the result of gene duplication so that both copies three-dimensional structures in the form of amino acid se- have descended side by side during the history of an organism, quences is known, yet we remain unable to predict their (for example, alpha and beta hemoglobin) the genes should be tertiary structures. called paralogous (para 5 in parallel). Where the homology is Genomes can be analyzed and compared for various fea- the result of speciation so that the history of the gene reflects tures: e.g., nucleotide content, compositional biases of leading the history of the species (for example, alpha hemoglobin in and lagging strands in replication (e.g., in Escherichia coli) (1), man and mouse) the genes should be called orthologous dinucleotide frequencies (2), the occurrence of repeats (e.g., in (ortho 5 exact)’’ (14). Note that orthology and paralogy are virulence genes of Haemophilus influenzae; ref. 3), RNA *To whom reprint requests should be addressed at: European Molec- © 1998 by The National Academy of Sciences 0027-8424y98y955849-8$2.00y0 ular Biology Laboratory, Meyerhofstrasse 1, 69012 Heidelberg, Ger- PNAS is available online at http:yywww.pnas.org. many. e-mail: [email protected]. 5849 Downloaded by guest on September 29, 2021 5850 Colloquium Paper: Huynen and Bork Proc. Natl. Acad. Sci. USA 95 (1998) defined only with respect to the phylogeny of the genes and not fer, and ancient gene duplications cannot be distinguished. with respect to function. Besides the construction of phylogenetic trees an additional Identifying orthology by using relative levels of sequence iden- strategy for finding horizontal gene transfer is the comparison tity. Ideally one would expect that the orthologous genes of two of nucleotide frequencies within a genome. Recently trans- genomes are those that have the highest pairwise identity, ferred genes often display nucleotide frequencies that deviate having bifurcated relatively recently compared with genes that significantly from the rest of the genome (20, 21). A conser- duplicated before the speciation. The most straightforward vative estimate of the amount of genes that recently have been approach to identifying orthologous genes is to compare all transferred to E. coli, based on nucleotide frequencies and genes in genomes with each other, and then to select pairs of dinucleotide frequencies in genomes is 10%215% of the E. genes with significant pairwise similarities. A pair of sequences coli genome (Phil Green, personal communication; ref. 21). A with the highest level of identity then is considered ortholo- third strategy for finding horizontal gene transfer is synteny. gous. Because gene order is rarely conserved in evolution, the Auxiliary information for detection of orthology. Auxiliary presence in two distant evolutionary branches of the same information that is useful to assess orthology is ‘‘synteny’’: the order of genes, combined with the absence of this gene order presence in both genomes of neighboring sequences that are in other more closely related branches, can point to horizontal also orthologs of each other. As shown below, there is little gene transfer. This strategy has been used successfully to find conservation of the order of genes in genomes in evolution at the example of horizontal gene transfer described in Fig. 1. a time when divergence of their orthologous genes reaches a Orthology in multidomain proteins. In multidomain proteins level of 50% amino acid identity (see Fig. 3). Hence the two levels of orthology can be distinguished: one is at the level potential for using synteny for identifying orthologs is limited of single domains, a second at the level of the whole protein. mainly to genomes that have speciated only relatively recently. This may lead to situations where nonorthologous proteins A second type of auxiliary information that can be used is the possess orthologous domains. Modularity of genes in the sense comparison of genes with those of a third genome. If two genes that modules can have different positions, but the same from different genomes have the highest level of identity both function, in various proteins, is not well documented in Bac- to each other and to a single gene from a third genome, then teria and Archaea. A first step toward modularity, the presence this is a strong indication that they are orthologs (see ref. 15 of ‘‘gene fusion’’ or ‘‘gene splitting,’’ however, does occur for a large-scale implementation of this idea). However for a regularly. Comparative analysis of the genomes H. influenzae large fraction of genes identifying orthologs by relative se- and E. coli showed 10 (24) clear-cut cases of genes that were quence identity is hampered by a variety of evolutionary separate in E.