292 Review TRENDS in and Vol.18 No.6 June 2003

Evolution by : an update

Jianzhi Zhang

Department of Ecology and Evolutionary , University of Michigan, 3003 Nat. Sci. Bldg, 830 N. University Ave, Ann Arbor, MI 48109, USA

The importance of gene duplication in supplying raw which is 0.1–0.5 site21 100 MY21 in nuclear of genetic material to biological evolution has been recog- vertebrates [10]. Theabove duplicationrateisthegene-birth nized since the 1930s. Recent genomic sequence data rate, which was derived from recent duplications. Many fixed provide substantial evidence for the abundance of duplicated genes later become (see Glossary) duplicated genes in all surveyed. But how do and are deleted from the . The rate of duplication newly duplicated genes survive and acquire novel func- that gives rise to stably maintained genes is the birth rate tions, and what role does gene duplication play in the multiplied by the retention rate, which is expected to evolution of genomes and organisms? Detailed molecu- fluctuate with gene function, among other things. lar characterization of individual gene families, compu- Duplicated genes are often referred to as paralogous tational analysis of genomic sequences and population genes, which form gene families. Several authors have genetic modeling can all be used to help us uncover the tabulated the distribution of size for a few mechanisms behind the evolution by gene duplication. completely sequenced genomes [11,12] and this varies substantially among species and gene families [13]; for In 1936, Bridges reported one of the earliest observations instance, the biggest gene family in D. melanogaster is the of gene duplication from the doubling of a chromosomal band in a mutant of the fruit fly Drosophila melanogaster, Table 1. Prevalence of gene duplication in all three domains of which exhibited extreme reduction in eye size [1]. The lifea potential role of gene duplication in evolution was subse- Total Number of duplicate quently suggested and possible scenarios of duplicate gene number genes (% of evolution were proposed [2–4]. Ohno’s seminal book in of genes duplicate genes) Refs 1970, Evolution by Gene Duplication [5], further popular- ized this idea among biologists. It was, however, not until Mycoplasma pneumoniae 677 298 (44) [65] the late 1990s, when many genome sequences were deter- Helicobacter pylori 1590 266 (17) [66] mined and analyzed, that the prevalence and importance Haemophilus influenzae 1709 284 (17) [67] of gene duplication was clearly demonstrated. Through Archaeoglobus fulgidus 2436 719 (30) [68] genomic sequence analysis, population genetic modeling and molecular experimentation, rapid progress has also Eukarya Saccharomyces cerevisiae 6241 1858 (30) [67] been made in disclosing the mechanisms by which dupli- Caenorhabditis elegans 18 424 8971 (49) [67] cate genes diverge in function and contribute to evolution. Drosophila melanogaster 13 601 5536 (41) [67] Here, I review current understandings of these mechan- Arabidopsis thaliana 25 498 16 574 (65) [69] b isms. I do not discuss genome duplication, as there have Homo sapiens 40 580 15 343 (38) [11] been several recent reviews of this topic [6–8]. aUse of different computational methods or criteria results in slightly different estimates of the number of duplicated genes [12]. bThe most recent estimate is ,30 000 [61]. Prevalence of gene duplication in all three domains of Table 1 lists the estimated numbers of duplicated genes in completely or nearly completely sequenced genomes of Glossary representative bacteria, archaebacteria and . Concerted evolution: a mode of gene family evolution in which members of a One finds that, in all three domains of life, large pro- family remain similar in sequence and function because of frequent gene portions of genes were generated by gene duplication. It is conversion and/or unequal crossing over. : a recombination process that nonreciprocally homogenizes almost certain that these proportions are underestimates, gene sequences. because many duplicated genes have diverged so much Nonsynonymous ( substitution): a nucleotide substitution in the that virtually no sequence similarity is found. coding region of a gene that changes the protein sequence. Positive (darwinian) selection: natural selection that promotes the fixation of Lynch and Conery estimated that gene duplication arises advantageous alleles. (and is fixed in populations) at an approximate rate of 1 : a DNA sequence derived from a functional gene but has been gene21100 millionyears(MY)21 ineukaryotessuchasHomo rendered nonfunctional by . Purifying selection: natural selection that prevents the fixation of deleterious sapiens, Mus musculus, D. melanogaster, Caenorhabditis alleles. elegans,ArabidopsisthalianaandSaccharomycescerevisiae Operon: a unit of gene expression and regulation, including structural genes and control elements. [9].Thisrate iscomparabletothat of nucleotidesubstitution, Synonymous (nucleotide substitution): a nucleotide substitution in the coding region of a gene that does not change the protein sequence. Corresponding author: Jianzhi Zhang ([email protected]). http://tree.trends.com 0169-5347/03/$ - see front q 2003 Elsevier Science Ltd. All rights reserved. doi:10.1016/S0169-5347(03)00033-8 Review TRENDS in Ecology and Evolution Vol.18 No.6 June 2003 293

trypsin family [12], with 111 members, whereas the retroposition unless the genes involved are all in an OPERON. biggest family in mammals is the olfactory receptor family, Only those genes that are expressed in the germ line are with ,1000 members [14,15]. From a genomic sequence subject to heritable retroposition. Because promoter and analysis of the bacterium Escherichia coli, two yeasts, regulatorysequencesofa genearenot transcribedand hence C. elegans and D. melanogaster, Conant and Wagner found not duplicated by retroposition, the resulting duplicate often that ribosomal proteins and transcription factors gener- lacks necessary elements for transcription and thus ally form smaller gene families than do other proteins, immediately becomes a pseudogene. Nevertheless, several such as those controlling cell cycles and metabolism [16]. retroposition-mediated duplicate genes are expressed, prob- ably because of the chance insertion of cDNA into a genomic Generation of duplicate genes location that is downstream of a promoter sequence [17]. Gene duplication can result from unequal crossing over Chromosomal or genome duplication occurs probably by a (Fig. 1a), retroposition (Fig. 1b), or chromosomal lack of disjunction among daughter after DNA (or genome) duplication, the outcomes of which are quite replication. Substantial evidence shows that these large- different. Unequal crossing over usually generates tandem scale duplications occurred frequently in but gene duplication; that is, duplicated genes are linked in a infrequently in [10]. Recent human genome (Fig. 1a). Depending on the position of analysis reveals another type of large-scale duplication, crossing over, the duplicated region can contain part of a segmental duplication, which often involves 1000 to gene, an entire gene, or several genes. In the latter two .200 000 [18]. That most segmental dupli- cases, introns, if present in the original genes, will also be cations do not generate tandem repeats suggests that present in the duplicated genes. This is in sharp contrast unequal crossing over is probably not responsible, although the exact duplication mechanism is unclear [18]. to the result from retroposition (Fig. 1b). Retroposition occurs when a message RNA (mRNA) is retrotranscribed Evolutionary fate of duplicate genes to complementary DNA (cDNA) and then inserted into the Duplication occurs in an individual, and can be fixed or lost genome. As expected from this process, there are several in the population, similar to a point . If a new molecular features of retroposition: loss of introns and allele comprising duplicate genes is selectively neutral, regulatory sequences, presence of poly A tracts, and compared with pre-existing alleles, it only has a small presence of flanking short direct repeats, although probability, 1/2N, of being fixed in a diploid population [19], deviations from these common patterns do occasionally where N is the effective population size. This suggests that occur [17]. Another major difference from unequal crossing many duplicated genes will be lost. For those that do become over is that a duplicated gene generated by retroposition is fixed, fixation is consuming, because it takes, on usually unlinked to the original gene, because the insertion average, 4N generations for a neutral allele to become fixed of cDNA into the genome is more or less random. It is also [19]. Upon fixation, the long-term evolutionary fate of impossible to have blocks of genes duplicated together by duplication will still be determined by functions of the duplicate genes. The birth and death of genes is a common theme in gene family and genome evolution [20,21],with (a) those genes involved in the that vary greatly Unequal crossing over among species (e.g. immunity, reproduction and sensory systems) probably having high rates of gene birth and death.

Pseudogenization Gene duplication generates functional redundancy, as it is often not advantageous to have two identical genes. In (b) other words, mutations disrupting the structure and function of one of the two genes are not deleterious and Transcription and RNA splicing Mature are not removed by selection. Gradually, the mutation- Intron sequences are mRNA containing gene becomes a pseudogene, which is either AAAAAAA spliced out during mRNA maturation unexpressed or functionless, an evolutionary fate that has Reverse been shown by population genetic modeling [22,23] as well transcription as by genomic analysis [9,24]. After a long time evolutio- narily speaking, pseudogenes will either be deleted from cDNA the genome or become so diverged from the parental genes Random The parental gene that they are no longer identifiable. Relatively young insertion into resides in a different pseudogenes are recognizable because of sequence simi- the genome chromosome larity. For example, genomic analyses have identified 2168

TRENDS in Ecology & Evolution pseudogenes in C. elegans, or about one pseudogene for every eight functional genes [25]. More pseudogenes exist Fig. 1. Two common modes of gene duplication. (a) Unequal crossing over, which in humans, with about one pseudogene for every two results in a recombination event in which the two recombining sites lie at nonidentical functional genes in the two completely sequenced chromo- locations in the two parental DNA molecules. (b) Retroposition, which occurs when a message RNA (mRNA) is retrotranscribed to complementary DNA (cDNA) and then somes [24]. Pseudogenization, the process by which a inserted into the genome. Squares represent exons and bold lines represent introns. functional gene becomes a pseudogene, usually occurs in http://tree.trends.com 294 Review TRENDS in Ecology and Evolution Vol.18 No.6 June 2003 the first few million years after duplication if the dupli- conversion in maintaining common functions of these cated gene is not under any selection [9]. Nevertheless, duplicated genes [21,31]. A recent population genetic some duplicated genes had been maintained in the genome analysis also suggested that the conditions for gene conver- for a long time for specific functions, before recently sion to be favored selectively are relatively restrictive [32]. becoming pseudogenes because of the relaxation of func- tional constraints. For example, the size of the olfactory receptor gene family (,1000) is similar in humans and Subfunctionalization mice, but the percentage of pseudogenes is .60% in Unless the presence of an extra amount of gene product humans and only 20% in mice. Many olfactory receptor is advantageous, two genes with identical functions are genes have become pseudogenes since the origin of unlikely to be stably maintained in the genome [33]. hominoids [26]. This is probably related to the reduced Theoretical population predicates that both use of olfaction in hominoids, which can be compensated duplicates can be stably maintained when they differ in for by other sensory mechanisms, such as better vision. some aspects of their functions [33], which can occur by Pseudogenes do occasionally serve some function. In subfunctionalization, in which each daughter gene adopts chickens, there is only one functional gene (VH1)encoding part of the functions of their parental gene [34–36]. One the heavy chain variable region of immunoglobulins, and form of subfunctionalization that is potentially important immunoglobulin diversity is generated by GENE CONVERSION in the evolution of development is division of gene of the VH1 gene by the many duplicated variable region expression after duplication ([37], Fig. 2). Several dupli- pseudogenes that occur on its 50 side [27]. Although unlikely, cate genes have been demonstrated to evolve following this pseudogenes can also be revived. In cows, the pancreatic model of subfunctionalization [37]. For example, zebrafish ribonuclease gene has a paralogous gene called the seminal engrailed-1 and engrailed-1b are a pair of transcription ribonuclease gene, which is expressed in semen. These two factor genes generated by a chromosomal segmental genes are the result of gene duplication that occurred before duplication that occurred in the lineage of ray-finned the of ruminants at least 35 MY ago. In all other fish. Zebrafish engrailed-1 is expressed in the pectoral ruminants, the seminal ribonuclease gene either contains appendage bud, whereas engrailed-1b is expressed in a deleterious mutations or is not expressed [28–30],which specific set of neurons in the hindbrain/spinal cord [37]. suggests that the seminal ribonuclease gene had been a The sole engrailed-1 gene of the mouse, orthologous to both pseudogene for much of its history,but was revived recently in genes of the zebrafish, is expressed in both pectoral the cow. How this could have happened is unclear. appendage bud and hindbrain/spinal cord. Changes of In my view, there have not been sufficient studies gene expression after gene duplication appear to be a of pseudogenization probably because pseudogenes are general rule rather than exception [38,39] and these regarded to be uninteresting. In fact, lineage-specific changes often occur quickly after gene duplication [39]. pseudogenization, such as the aforementioned example Subfunctionalization can also occur at the protein func- of olfactory receptor genes of hominoids, provides rich tion level and can lead to functional specialization when information about organismal evolution. one of the duplicate genes becomes better at performing one of the original functions of the progenitor gene [40]. Conservation of gene function A recent study illustrates how a specialized digestive The presence of duplicate genes is sometimes beneficial simply because extra amounts of protein or RNA products A1 A2 are provided. This applies mainly to strongly expressed Expressed in genes the products of which are in high demand, such as tissues T1 and T2 rRNAs and histones. How can two paralogous genes Gene duplication maintain the same function after duplication? One way is by gene conversion. Under frequent gene conversion, two paralogous genes will have very similar sequences and functions, and this mode of evolution is often referred to as CONCERTED EVOLUTION [10]. Alternatively, strong Complementary degenerate mutations PURIFYING SELECTION against mutations that modify gene function can also prevent duplicated genes from diverging. Purifying selection can be distinguished from gene conver- sion by an examination of synonymous (or silent) nucleotide A1 A2 A1 A2 differences among duplicated genes. Synonymous differ- Expressed in T1 Expressed in T2 ences are more or less immune to selection and cannot be reduced by purifying selection. But they can be removed by TRENDS in Ecology & Evolution gene conversion, because gene conversion homogenizes DNA sequences regardless of whether the differences are Fig. 2. Division of expression after gene duplication. Squares represent genes, closed ovals represent cis-acting elements that regulate gene transcription, and synonymous or nonsynonymous (amino-acid-altering). open ovals represent deactivated cis-elements. Consider a gene that is expressed in Using this strategy, Nei and his associates re-examined tissues T1 and T2, with a cis-acting regulatory element A1 controlling the expression several large gene families that were previously thought to in T1 and A2 controlling the expression in T2. Following gene duplication, one daughter gene might lose the A1 element whereas the other gene might lose A2, be under concerted evolution. Their results suggest that so that each is expressed in only one of the two tissues. Under such conditions, purifying selection is much more important than is gene both genes are necessary and therefore will be maintained in the genome. http://tree.trends.com Review TRENDS in Ecology and Evolution Vol.18 No.6 June 2003 295 enzyme gene emerged following the duplication of a because itmakes their cell membranes porous; the positively bifunctional gene in the leaf-eating monkey douc langur charged arginine residues might be important for establish- ([41] Box 1). It is unclear, however, what percentage of ing tight contact between the ECP and negatively charged duplicate genes evolved by subfunctionalization. bacterial cell membranes in the pore-formation process [42]. In many cases, however, a related function, rather than Neofunctionalization an entirely new function, evolves after gene duplication. One of the most important outcomes of gene duplication is One good example is the red- and green-sensitive opsin the origin of novel function. Although it seems improbable genes of humans, which were generated by gene dupli- that an entirely new function could emerge in a duplicate cation in hominoids and Old World monkeys [44]. After gene, there are several examples. For instance, the duplication, the two opsins have diverged in function, eosinophil-derived neurotoxin (EDN) and eosinophil cat- resulting in a 30-nm difference in the maximum absorp- ionic protein (ECP) genes of humans were generated in the tion wavelength. This confers the sensitivity to a wide lineage of hominoids and Old World monkeys via gene range of colors that humans and related primates have. duplication [42]. Both genes belong to the RNase A gene Neofunctionalization of duplicated genes requires vary- superfamily.After duplication, a novel antibacterial activity ing numbers of amino acid substitutions. The functional emerged in ECP. This activity is absent in human EDN and change in ECP probably required many substitutions [42], the EDN of New World monkeys, which represents the but the functional difference between the red and green progenitor gene before duplication. More surprisingly, the opsins is largely attributable to two substitutions [45].Inthe antibacterial activity of ECP does not depend on the case of neofunctionalization by a large number of genetic ribonuclease activity [43]. Molecular evolutionary analysis changes, an intriguing question is what function the suggests that the new function is probably conferred by a protein has during the many steps of the genetic changes. large number of arginine substitutions that occurred in a This could be answered by the reconstruction of ancestral short period after duplication [42]. ECP is toxic to bacteria proteins and functional analysis of these proteins.

Box 1. Digestive RNases of a leaf-eating monkey: a case study of duplicate gene evolution

Colobine monkeys are unique among primates in using leaves rather than fruits and insects as their primary food source; the leaves are Gene duplication fermented in the foregut by symbiotic bacteria [70]. Similar to Douc langur (RNASE1B) Colobines ruminants, colobines recover nutrients by breaking and digesting the 100 Douc langur (RNASE1) bacteria with the use of various enzymes, including RNase1, which is 98 Green monkey Old World secreted from the pancreas and transported into the small intestine to Talapoin monkey monkeys degrade [71,72]. Initial studies revealed a substantially greater Baboon Cercopithecines Pig-tailed macaque amount of RNases in the pancreas of colobines and ruminants than in 97 78 Rhesus monkey other mammals [71,72]. This is probably because rapidly growing 60 Human bacteria have the highest RNA-nitrogen:total nitrogen ratio of all cells, 89 Chimpanzee and high concentrations of RNases are needed to break down bacterial 87 Gorilla Hominoids RNAs so that nitrogen can be recycled efficiently [72]. Zhang et al. Orangutan identified two copies (RNase1 and RNase1B) of the RNase gene in douc 97 Gibbon langur Pygathrix nemaeus, an Asian colobine, but only one copy 77 Woolly monkey (RNase1) in each of the 15 noncolobine primates examined [41]. 55 Spider monkey New World Tamarin Phylogenetic analyses suggested that the gene duplication occurred 99 monkeys Squirrel monkey after colobines diverged from other monkeys (Fig. I). After duplication, Lemur Prosimian RNase1B evolved much more rapidly than did RNase1 (Fig. I). In fact, the rate of nucleotide substitution in RNase1B since duplication was 0.05 significantly higher at nonsynonymous sites than at synonymous and TRENDS in Ecology & Evolution noncoding sites, providing evidence for the action of positive selection on RNase1B. Fig. I. Phylogenetic relationships of RNase1 and RNase1B genes of primates. Because the pH in the small intestine of colobines is 6–7, whereas that Douc langur has both genes, whereas other species have only the RNase1 in humans and other monkeys is 7.4–8, Zhang et al. hypothesized that gene. Bootstrap percentages .50 are shown. Branch lengths are drawn to RNase1B has been under selection for a high catalytic efficiency in an scale, indicating the number of nucleotide substitutions per site. This neigh- acidic environment. To test this hypothesis, recombinant proteins from bor-joining tree was reconstructed with the use of Kimura’s two-parameter the douc langur RNase1B gene as well as the RNase1 genes of humans, distances. Reprinted, with permission, from [41]. rhesus monkeys and douc langur were prepared and their ribonucleo- [41]. Apparently, RNase1B can lose EA and become specialized in lytic activities were quantified at different pHs. The catalytic optimal dsRNA pH of douc langur RNase1B was found to be 6.3, whereas that of RNase1 digesting bacterial RNAs because the paralogous RNase1 retains was 7.4 for all three species examined. At pH 6.3, RNase1B is EAdsRNA. There are nine amino acid differences between douc langur approximately six more efficient in degrading RNA than is RNase1 and RNase1B. Using site-directed mutagenesis, Zhang et al. RNase1 [41]. made protein mutants and showed that each of the nine substitutions in Why has the douc langur RNase1 been conserved after duplication RNase1B was detrimental to EAdsRNA. Thus, if the EAdsRNA function had and why does its optimal catalytic pH remain at 7.4? Human RNase1 is not been relaxed in RNase1B, none of the adaptive substitutions that expressed in many tissues other than the pancreas and has a second shifted the optimal pH could have occurred [41]. This detailed molecular

enzyme activity (EAdsRNA) in degrading double-stranded RNA, an evolutionary study demonstrated complementary roles of positive activity that might be related to defense against viral infection but selection and relaxation of purifying selection in functional divergence

unrelated to digestion. Zhang et al. found similar EAdsRNA among the of duplicate genes and provides a vivid example of the contribution RNase1 proteins of the human, rhesus monkey and douc langur, of gene duplication toward organismal adaptation to changing although that of the douc langur RNase1B was reduced by .300-fold environments.

http://tree.trends.com 296 Review TRENDS in Ecology and Evolution Vol.18 No.6 June 2003

Evolutionary forces behind functional divergence of duplicate genes Box 2. Identifying amino acid changes behind the In the case of division of expression, such as the engrailed-1 functional differences among caspases and engrailed-1b genes of zebrafish, it is likely that random When two duplicated genes A and B have different functions, the fixations of complementary degenerate mutations under amino acid substitution rate in A might differ from that in B at certain relaxed functional constraints are the main cause [37]. sites. Let X be a sequence data set that usually contains multiple In other words, it is a result of neutral evolution, without the orthologous sequences of protein A and B, and let S1 be the event that the substitution rate in A and B are different at a given site. Gu has involvement of POSITIVE SELECTION. In the case offunctional developed a method to compute P(S1|X), the posterior probability specialization and neofunctionalization, two models have that a site has different substitution rates in A and B, given the been widely cited. The first model is known as the Dykhui- data X [55]. He has also invented a statistic u as a measure of the zen–Hartl effect, which does not require positive selection overall site-specific rate difference between proteins A and B [55]. [19,42,46,47]. In this model, after gene duplication, random The computer program for these methods [73] is available at mutations are fixed in one daughter gene under relaxed http://xgu1.zoo1.iastate.edu. Wang and Gu [74] applied this method to cysteine aspartyl purifying selection, which occurs by reduced functional proteases (caspases), which are key components in apoptosis. The constraint provided by genetic redundancy. These fixed authors were interested in caspases because, in vertebrates, these mutations later induce a change in gene function when the can be divided into two subfamilies (CED-3 and ICE), which have environment or the genetic background is altered. different functions. CED-3 type caspases are essential for most apoptotic pathways whereas the major function of ICE-type caspases The second model requires positive selection and involves is to mediate immune response, although some members are two scenarios. In the first scenario, after gene duplication, a also involved in apoptosis under some circumstances. The two few neutral or nearly neutral substitutions create a new, but subfamilies also show structural differences. only weakly active function in one daughter gene, and The authors found u ¼ 0:29 ^ 0:05; suggesting significant site- positive selection then accelerates the fixation of advan- specific rate differences between the CED-3 and ICE subfamilies (Fig. I). The significantly positive value of u was found to be due to tageous mutations that enhance the activity of the novel 21 sites, as u becomes virtually 0 when these sites are removed. In function [42]. In the second scenario, the ancestral gene fact, experimental evidence is available supporting the functional already has dual functions. Gene duplication provides the importance of four of the 21 sites. It would be interesting to examine opportunity for each daughter gene to adopt one ancestral experimentally whether the rest of the sites are also functionally function, and further substitutions under positive selection important. Such experiments will also help computational biologists to improve predictive methods. can refine the functions [40]. Acceleration of protein sequence evolution following gene duplication is often observed [9,48,49]; however, this can be explained by either model. Under such circumstances, one often assumes the null hypothesis of neutral evolution with relaxed purifying 1 0.8 selection. A significantly higher rate of NONSYNONYMOUS 0.6 than SYNONYMOUS NUCLEOTIDE SUBSTITUTION can be used

(S1|X) 0.4 to reject the null hypothesis and to establish the action P of positive selection. Indeed, several cases of positive 0.2 selection after gene duplication have been reported, includ- 1 11 21 31 41 51 61 71 81 91 ing immunoglobulins, conotoxins, ribonucleases, preg- 101 111 121 131 141 151 161 171 181 191 nancy-associated glycoproteins, triosephosphate isomerase Alignment position and the ECP gene [40–42,50–54]. A cautionary note is that TRENDS in Ecology & Evolution relaxation of purifying selection is often treated as the null hypothesis and is thus accepted even without direct Fig. I. The site-specificprofile for predicting crucial amino acid residues evidence. Because of the relatively low power of statistical responsible for the functional divergence between CED-3 and the ICE sub- methods for detecting positive selection, actions of positive families, measured by the posterior probability P(S1|X). The arrows point to four amino acid residues at which functional divergence between two sub- selection have probably been overlooked and relaxation of families has been verified by experiments. Reprinted, with permission, purifyingselectionincorrectlyinvoked.Box1illustrates how from [74]. molecular experimentation can be used to test the hypoth- esis of relaxation of purifying selection. Both positive selection and relaxation of purifying selection are necessary involved in apoptosis. These computational results can in the functional divergence of duplicate genes [41] (Box 1). guide molecular experimentation, leading to the identifi- This might be particularly so when the functional change cation of functionally important amino acid substitutions, involves multiple amino acid substitutions. which might help us to reveal the molecular basis of When specialization or neofunctionalization is com- functional divergence as well as provide crucial infor- pleted, duplicate genes are likely to be maintained under mation for drug design and other biomedical applications. different functional constraints and show different substi- tution patterns. Several statistical methods have been Contributions of gene duplication to genomic and developed to identify amino acid sites evolving with organismal evolution altered substitution rates and to test the rate difference The most obvious contribution of gene duplication to [55–58]. Box 2 shows an application of such statistical evolution is providing new genetic material for mutation, methods in finding candidate sites responsible for func- drift and selection to act upon, the result of which is tional differences between two subfamilies of the proteases specialized or new gene functions. Without gene duplication, http://tree.trends.com Review TRENDS in Ecology and Evolution Vol.18 No.6 June 2003 297

total number of human genes [61] and 6 is the approximate Box 3. Outstanding questions about the evolution of time in MY since the human–chimpanzee split. A more duplicate genes conservative estimate of ,720 gene duplications can be obtained using the recent result from human segmental 1. What is the relative importance of positive darwinian selection duplications [62,63]. Although many of these duplicated and relaxation of purifying selection in functional divergence of genes might have become pseudogenes, it is possible that duplicated genes? In spite of many case studies, a general pattern some acquired new functions. Identification of human- is still lacking. This is likely to be a long-standing question, because sequence analysis, even at the genomic level, only specific gene duplications might help pinpoint the genetic provides a partial answer because of the inefficiency of current basis of human-unique features. With the human genome statistical methods in detecting selection. sequence now available, such studies should be feasible. 2. How does an entirely new function originate after gene In fact, several potential cases of human-specific gene duplication? More detailed molecular studies of model gene duplication are known [64]. It is also possible, as Lynch has families are needed to look into the emergence of novel gene function. suggested [63], that differential gene duplication and 3. What roles does gene duplication play in the establishment of pseudogenization in geographically isolated populations complex gene expression networks and protein–protein inter- causes reproductive isolation and speciation, although this action networks, which are key characteristics of biological intriguing hypothesis awaits empirical evidence. Box 3 systems? A few studies have been conducted using genomic data from model organisms [59,75,76], but more comprehensive lists several outstanding questions on the evolution of gene studies of gene duplication in the evolution of gene/protein duplication that I believe are important. It can be expected networks are needed. that, with an explosive increase in genomic data and rapid 4. How does genetic buffering function? Biological systems are advances in molecular genetic technology, the manifold quite robust and deletions of some genes often do not show and fundamental roles of gene duplication will become severe phenotypes. Is this because of the existence of duplicate genes that share functional similarity with the deleted genes or even more evident and the once imaginative idea of because of redundant metabolic networks? A recent study using evolution by gene duplication will be established as one of mammalian gene sequences suggested that gene duplication is the cornerstones of . as important as redundant metabolic networks [77]. Another study using yeast gene knockout experiments showed that, on Acknowledgements average, deletions of genes that have closely related duplicates I thank Lizhi Gao, Xun Gu, David Mindell, David Webb and three exhibit less severe phenotypes than do deletions of genes with anonymous reviewers for valuable comments. This work was supported by distantly related duplicates or without duplicates [78]. These studies provide the first evidence that gene duplication plays a a startup fund and a Rackham research fellowship of the University of key role in making biological systems robust against genetic Michigan and an NIH grant (R01-GM67030) to J.Z. turbulence. Related to this issue, one wonders whether functional redundancy of duplicated genes is adaptive or simply a result of References structural constraints brought by a common origin. 1 Bridges, C.B. (1936) The Bar ‘gene’ a duplication. Science 83, 210–211 5. How important is gene duplication to the origin of species-specific 2 Stephens, S.G. (1951) Possible significance of duplication in evolution. features and speciation? Answering this question requires Adv. Genet. 4, 247–265 empirical data from closely related species, particularly the 3 Ohno, S. (1967) Sex Chromosomes and Sex-Linked Genes, Springer molecular details of the genes involved in reproductive isolation. 4 Nei, M. (1969) Gene duplication and nucleotide substitution in evolution. 221, 40–42 5 Ohno, S. (1970) Evolution by Gene Duplication, Springer the plasticity of a genome or species in adapting to 6 Wolfe, K. (2001) Yesterday’s polyploids and the mystery of diploidiz- changing environments would be severely limited, because ation. Nat. Rev. Genet. 2, 333–341 no more than two variants (alleles) exist at any locus 7 Friedman, R. and Hughes, A.L. (2001) Pattern and timing of gene duplication in genomes. Genome Res. 11, 1842–1847 within a (diploid) individual. It seems difficult to imagine, 8 Spring, J. (2002) Genome duplication strikes back. Nat. Genet. 31, for instance, how the vertebrate adaptive immune system 128–129 (with dozens of duplicated immunoglobulin genes) could 9 Lynch, M. and Conery, J.S. (2000) The evolutionary fate and have evolved without gene duplication. consequences of duplicate genes. Science 290, 1151–1155 Gene duplication has probably also contributed to the 10 Li, W.H. (1997) , Sinauer 11 Li, W.H. et al. (2001) Evolutionary analyses of the human genome. evolution of gene networks in such a way that sophisti- Nature 409, 847–849 cated expression regulations can be established [59].An 12 Gu, Z. et al. (2002) Extent of gene duplication in the genomes of interesting case is eyeless (also known as Pax6), the master Drosophila, nematode, and yeast. Mol. Biol. Evol. 19, 256–262 control gene of eye development in metazoans. This gene 13 Lespinet, O. et al. (2002) The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 12, 1048–1059 was duplicated in Drosophila, and its paralog, twin of 14 Mombaerts, P. (2001) The human repertoire of odorant receptor genes eyeless, now regulates the expression of eyeless [60]. and pseudogenes. Annu. Rev. Hum. Genet. 2, 493–510 Species-specific gene duplication can also lead to species- 15 Zhang, X. and Firestein, S. (2002) The olfactory receptor gene specific gene functions, which might facilitate species- superfamily of the mouse. Nat. Neurosci. 5, 124–133 16 Conant, G.C. and Wagner, A. (2002) GenomeHistory: a software tool specific adaptation, as exemplified in [41] (Box 1). In other and its application to fully sequenced genomes. Nucleic Acids Res. 30, words, gene duplication contributes to species divergence 3378–3386 and origins of species-specific features. For example, if the 17 Long, M. (2001) Evolution of novel genes. Curr. Opin. Genet. Dev. 11, gene duplication rate of 1 gene21 100 MY21 [9] is used, 673–680 I estimate that there have been 1 £ 30 000 £ 6=100 ¼ 1800 18 Samonte, R.V. and Eichler, E.E. (2002) Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet. 3, 65–72 gene duplications in the human genome since humans 19 Kimura, M. (1983) The Neutral Theory of Molecular Evolution, diverged from chimpanzees. Here, 30 000 is the estimated Cambridge University Press http://tree.trends.com 298 Review TRENDS in Ecology and Evolution Vol.18 No.6 June 2003

20 Hughes, A.L. and Nei, M. (1989) Evolution of the major histocompati- 50 Tanaka, T.and Nei, M. (1989) Positive darwinian selection observed at the bility complex: independent origin of nonclassical class I genes in variable-region genes of immunoglobulins. Mol. Biol. Evol. 6, 447–459 different groups of mammals. Mol. Biol. Evol. 6, 559–579 51 Duda, T.F. Jr and Palumbi, S.R. (1999) of ecological 21 Nei, M. et al. (2000) Purifying selection and birth-and-death evolution in diversification: duplication and rapid evolution of toxin genes of the the ubiquitin gene family. Proc.Natl.Acad. Sci. U.S. A. 97, 10866–10871 venomousgastropodConus.Proc. Natl. Acad. Sci.U.S.A.96, 6820–6823 22 Walsh, J.B. (1995) How often do duplicated genes evolve new 52 Zhang, J. et al. (2000) Evolution of the rodent eosinophil-associated functions? Genetics 139, 421–428 RNase gene family by rapid gene sorting and positive selection. Proc. 23 Lynch, M. et al. (2001) The probability of preservation of a newly arisen Natl. Acad. Sci. U. S. A. 97, 4701–4706 gene duplicate. Genetics 159, 1789–1804 53 Hughes, A.L. et al. (2000) Adaptive diversification within a large 24 Harisson, P.M. et al. (2002) Molecular fossils in the human genome: family of recently duplicated, placentally expressed genes. Proc. Natl. identification and analysis of the pseudogenes in chromosomes 21 and Acad. Sci. U. S. A. 97, 3319–3323 22. Genome Res. 12, 272–280 54 Merritt, T.J. and Quattro, J.M. (2001) Evidence for a period of 25 Harrison, P.M. et al. (2001) Digging for dead genes: an analysis of the directional selection following gene duplication in a neurally expressed characteristics of the pseudogene population in the Caenorhabditis locus of triosephosphate isomerase. Genetics 159, 689–697 elegans genome. Nucleic Acids Res. 29, 818–830 55 Gu, X. (1999) Statistical methods for testing functional divergence 26 Rouquier, S. et al. (2000) The olfactory receptor gene repertoire in after gene duplication. Mol. Biol. Evol. 16, 1664–1674 primates and mouse: evidence for reduction of the functional fraction 56 Dermitzakis, E.T. and Clark, A.G. (2001) Differential selection after in primates. Proc. Natl. Acad. Sci. U. S. A. 97, 2870–2874 duplication in mammalian developmental genes. Mol. Biol. Evol. 18, 27 Ota, T. and Nei, M. (1995) Evolution of immunoglobulin VH 557–562 pseudogenes in chickens. Mol. Biol. Evol. 12, 94–102 57 Knudsen, B. and Miyamoto, M.M. (2001) A likelihood ratio test for 28 Trabesinger-Ruef, N. et al. (1996) Pseudogenes in ribonuclease evolutionary rate shifts and functional divergence among proteins. evolution: a source of new biomacromolecular function? FEBS Lett. Proc. Natl. Acad. Sci. U. S. A. 98, 14512–14517 382, 319–322 58 Gaucher, E.A. et al. (2002) Predicting functional divergence in protein 29 Breukelman, H.J. et al. (1998) Secretory ribonuclease genes and evolution by site-specific rate shifts. Trends Biochem. Sci. 27, 315–321 pseudogenes in true ruminants. Gene 212, 259–268 59 Wagner, A. (1994) Evolution of gene networks by gene duplications: a 30 Kleineidam, R.G. et al. (1999) Seminal-type ribonuclease genes in mathematical model and its implications on genome organization. ruminants, sequence conservation without protein expression? Gene Proc. Natl. Acad. Sci. U. S. A. 91, 4387–4391 231, 147–153 60 Czerny, T. et al. (1999) Twin of eyeless, a second Pax-6 gene of 31 Piontkivska, H. et al. (2002) Purifying selection and birth-and-death Drosophila, acts upstream of eyeless in the control of eye development. Mol. Cell 3, 297–307 evolution in the histone H4 gene family. Mol. Biol. Evol. 19, 689–697 61 Waterston, R.H. et al. (2002) Initial sequencing and comparative 32 Hurst, L.D. and Smith, N.G.C. (1998) The evolution of concerted analysis of the mouse genome. Nature 420, 520–562 evolution. Proc. R. Soc. Lond. Ser. B 265, 121–127 62 Bailey, J.A. et al. (2002) Recent segmental duplications in the human 33 Nowak, M.A. et al. (1997) Evolution of genetic redundancy. Nature 388, genome. Science 297, 1003–1007 167–171 63 Gagneux, P. and Varki, A. (2001) Genetic differences between humans 34 Jensen, R.A. (1976) Enzyme recruitment in the evolution of new and great apes. Mol. Phylogenet. Evol. 18, 2–13 function. Annu. Rev. Microbiol. 30, 409–425 64 Lynch, M. (2002) Gene duplication and evolution. Science 297, 945–947 35 Orgel, L.E. (1977) Gene duplication and the origin of proteins with 65 Himmelreich, R. et al. (1996) Complete sequence analysis of the novel functions. J. Theor. Biol. 67, 773 genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 36 Hughes, A.L. (1994) The evolution of functionally novel proteins after 24, 4420–4449 gene duplication. Proc. R. Soc. Lond. Ser. B 256, 119–124 66 Tomb, J.F. et al. (1997) The complete genome sequence of the gastric 37 Force, A. et al. (1999) Preservation of duplicate genes by complemen- pathogen Helicobacterpylori. Nature 388, 539–547 tary, degenerative mutations. Genetics 151, 1531–1545 67 Rubin, G.M. et al. (2000) Comparative genomics of the eukaryotes. 38 Wagner, A. (2000) Decoupled evolution of coding region and mRNA Science 287, 2204–2215 expressionpatternsaftergeneduplication: implications fortheneutralist- 68 Klenk, H.P. et al. (1997) The complete genome sequence of the selectionist debate. Proc. Natl. Acad. Sci. U. S. A. 97, 6579–6584 hyperthermophilic, sulphate-reducing archaeon Archaeoglobus 39 Gu, Z. et al. (2002) Rapid divergence in expression between duplicate fulgidus. Nature 390, 364–370 genes inferred from microarray data. Trends Genet. 18, 609–613 69 TheArabidopsis Genome Initiative (2000) Analysis of the genome sequence 40 Hughes, A.L. (1999) Adaptive Evolution of Genes and Genomes, Oxford of the flowering Arabidopsis thaliana. Nature 408, 796–815 University Press 70 Kay, R.N.B. and Davies, A.G. (1994) Digestive . In Colobine 41 Zhang, J. et al. (2002) Adaptive evolution of a duplicated pancreatic Monkeys: Their Ecology Behaviour and Evolution (Davies, A.G. and ribonuclease gene in a leaf-eating monkey. Nat. Genet. 30, 411–415 Oates, J.F., eds) pp. 229–250, Cambridge University Press 42 Zhang, J. et al. (1998) Positive Darwinian selection after gene 71 Barnard, E.A. (1969) Biological function of pancreatic ribonuclease. duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. Nature 221, 340–344 U. S. A. 95, 3708–3713 72 Beintema, J.J. (1990) The primary structure of langur (Presbytis 43 Rosenberg, H.F. (1995) Recombinant human eosinophil cationic entellus) pancreatic ribonuclease adaptive features in digestive protein. Ribonuclease activity is not essential for cytotoxicity. J. Biol. enzymes in mammals. Mol. Biol. Evol. 7, 470–477 Chem. 270, 7876–7881 73 Gu, X. and Vander Velden, K. (2002) DIVERGE: phylogeny-based 44 Yokoyama, S. and Yokoyama, R. (1989) Molecular evolution of human analysis for functional-structural divergence of a protein family. visual pigment genes. Mol. Biol. Evol. 6, 186–197 18, 500–501 45 Asenjo, A.B. et al. (1994) Molecular determinants of human red/green 74 Wang, Y. and Gu, X. (2001) Functional divergence in the caspase gene color discrimination. Neuron 12, 1131–1138 family and altered functional constraints: statistical analysis and 46 Dykhuizen, D. and Hartl, D.L. (1980) Selective neutrality of 6PGD prediction. Genetics 158, 1311–1320 allozymes in E. coli and the effects of genetic background. Genetics 96, 75 Wagner, A. (2001) The yeast protein interaction network evolves 801–817 rapidly and contains few redundant duplicate genes. Mol. Biol. Evol. 47 Li, W.H. (1983) Evolution of duplicate genes and pseudogenes. In 18, 1283–1292 Evolution of Genes and Proteins (Nei, M. and Koehn, R.K., eds) 76 Bhan, C. et al. (2002) A duplication growth model of gene expression pp. 14–37, Sinauer networks. Bioinformatics 18, 1486–1493 48 Ohta, T. (1994) Further examples of evolution by gene duplication 77 Kitami, T. and Nadeau, J.H. (2002) Biochemical networking contri- revealed through DNA sequence comparisons. Genetics 138, 1331–1337 butes more to genetic buffering in human and mouse metabolic 49 Van de Peer, Y. et al. (2001) The ghost of selection past: rates of pathways than does gene duplication. Nat. Genet. 32, 191–194 evolution and functional divergence of anciently duplicated genes. 78 Gu, Z. et al. (2003) Role of duplicate genes in genetic robustness J. Mol. Evol. 53, 436–446 against null mutations. Nature 421, 63–66 http://tree.trends.com