T ECHNICAL C OMMENTS An analogy for the application of half- is the mortality of newborns centuries Duplication and ago: At that time the infant mortality rate was Lynch and Conery (1) presented one of the in the mouse Sp100-rs family, a short lineage very high, because medical science was un- first serious efforts to study the evolutionary of Mus musculus has created at least 60 gene derdeveloped—but just because the “half- fate of gene duplication using genomic se- duplicates within 1.7 million years; other lin- ” of newborns is short, it does not follow quence data. Their analysis led to several eages such as the sibling taxa Mus caroli,a that half of all adults will die shortly. We interesting observations, particularly with re- group that diverged 2.5 million years ago, suggest that figure 2 of (1) supports a con- spect to the rate of gene duplication in eu- contain few duplicates (4). If the duplication clusion opposite to the one that Lynch and karyotic and the subsequent half- rate over the time during which divergence is Conery drew: A large proportion of duplicate life of duplicates. These two parameters are observed is much lower than the recent rate either have evolved new functions (7) of particular importance in studying the evo- of duplication, the half-life calculated by or have been maintained by subfunctionaliza- lutionary processes of gene duplication and Lynch and Conery would represent a serious tion (8, 9) or other mechanisms. subsequent functional divergence. The most underestimate. Manyuan Long frequent class of duplications appeared to be Finally, an alternative interpretation for Department of Ecology and Evolution similar in all six , which suggests the short half-life of duplicate genes before University of Chicago some silencing process for old duplicates. silencing may deserve consideration. Assum- 1101 East 57th Street Several additional considerations in the anal- ing that small values of S may more reliably Chicago, IL 60637, USA ysis and interpretation, however, might have reflect a short evolutionary time, the authors led to some different conclusions. chose to estimate the half-life of duplicate Kevin Thornton First, Lynch and Conery (1) used the genes only from gene pairs with S values in Committee on number of substitutions per silent site, S,to the range of 0 to 0.25. They estimated a mean University of Chicago measure the age of a duplicate-gene pair half-life of 4 million years, concluding that [figure 2 of (1)]. It is unclear, however, that “the fate awaiting most gene duplications References silent divergence is a suitable proxy for a appears to be silencing rather than preserva- 1. M. Lynch, J. S. Conery, Science 290, 1151 (2000). 2. L.-W. Zeng, J. M. Comeron, B. Chen, M. Kreitman, molecular clock involving different genes or tion,” and, hence, that “duplicate genes may Genetica 102–103, 369 (1998). gene duplicates. For example, Zeng et al. (2) only rarely evolve new functions.” Yet their 3. W.-H. Li, (Sinauer Associates, reported 9- to 15-fold differences in S values analysis appears to have ignored several im- Sunderland, MA, 1997). and a flat distribution of S for 24 single-copy portant features of the data [figure 2 of (1)]. 4. D. Weichenhan, B. Kunze, W. Traut, H. Winking, Cytogenet. Genet. 80, 226 (1998). genes in Drosophila. Two points are impor- (i) Notwithstanding their model of “young” 5. D. A. Petrov, E. R. Lozovskaya, D. L. Hartl, Nature 384, tant in this context: (i) this large variation in duplicates, the tails of the distribution are 346 (1996). S is expected when the divergence time is long and flat, which suggests that the data are 6. G. M. Rubin et al., Science. 287, 2204 (2000). 7. W. Wang, J. Zhang, C. Alvarez, A. Llopart, M. Long, low; and (ii) the divergence time for each actually heterogeneous. (ii) The proportions Mol. Biol. Evol. 17, 1294 (2000). comparison made by Zeng et al.(2) was of the duplications that reside in the tails are 8. A. Force et al., Genetics 151, 1531 (1999). fixed. Thus, for different genes, S may vary high—85% for Drosophila melanogaster, 9. M. Lynch, A. Force, Genetics 154, 459 (2000). by more than an order of magnitude given a 66% for , and 65% 21 December 2000; accepted 21 June 2001 fixed divergence time. This situation differs for Saccharomyces cerevisiae. (iii) The tails from description of divergence time using S include old and ancient duplications. The Lynch and Conery (1) have proposed a num- values from homologous genes across a heterogeneity of the age distribution in figure ber of provocative hypotheses regarding the group of , in which a dependable 2of(1) suggests that the short half-life cal- evolution of duplicate genes, using data from molecular clock may exist. The same S val- culated from young duplicate-gene pairs can- nine eukaryotic species. One hypothesis is ues may represent duplicates of very different not be extended to most pairs. After all, a that the ratio of replacement (R) to silent (S ) ages, and the different S values may be from large proportion of these older duplicates nucleotide substitutions among recently du- duplicates of the same or similar ages. Thus, may be much older than 4 million years, with plicated genes is near 1.0, the neutral expec- figure 2 of (1) should be viewed with caution real ages of tens or hundreds of million years. tation. Their analysis indicates that this phase as a description of the age distribution of It is likely that these genes have been func- of relaxed selection is confined to recently gene duplications. A related issue is the reli- tional since their origin; otherwise, the dupli- duplicated gene pairs. Another hypothesis is ability of estimates of S, because many of the cate sequences would have been deleted from that many duplicate-gene pairs are short- values presented by Lynch and Conery (1) the (5). lived, with half-lives of 3 to 7 million years, were larger than 1. Estimates larger than 1 are In addition, the absolute number of old or depending on the . associated with a large variance due to satu- ancient gene duplicates is relatively large. Unfortunately, their conclusions are com- ration of substitutions and should generally For example, 40% of the approximately promised by the fact that their data, obtained be considered unreliable (3). 13,600 coding sequences in the D. melano- through GenBank taxon searches, included Second, the calculation of the half-life of gaster genome appear to have arisen by gene many redundant records. For example, 43.3% gene duplicates was based on the untested, duplication (6). Thus, some 34% of the fly of the gene pairs in their Arabidopsis data set hidden assumption that the rate of gene du- genome, or 4624 genes [40% ϫ 85% ϫ had no synonymous differences (S ϭ 0). We plication is constant over evolutionary 13,600, with the 85% from item (ii), above], randomly examined 50 of these gene pairs time—an assumption implicit in both figure 3 comprise old or ancient duplicates. It is there- and found that 86% were derived from the and equation 3 of (1). Unfortunately, there fore misleading to assert that the vast major- same genomic sequence, mostly because of are insufficient data with which to estimate ity of gene duplicates are quickly silenced, the presence of a single gene on two overlap- the variation in the rate of gene duplication even if the calculation of the half-life is ping clones. These redundant sequences were on a short time scale; nevertheless, there is correct. Rather, it appears that the accumula- used to estimate the rate at which duplicate- some evidence that the duplication rate for tion of “survivors” of the silencing process gene pairs reverted to single copies, a proce- some families may indeed not be stationary constitutes a large fraction of modern eukary- dure that tended to overestimate the rate of over a short evolutionary time. For example, otic genomes. gene loss. Such problems were not limited to

www.sciencemag.org SCIENCE VOL 293 31 AUGUST 2001 1551a T ECHNICAL C OMMENTS Arabidopsis; 58.3% of human gene pairs and Liqing Zhang demographic analysis of gene duplicates 67.7% of mouse gene pairs had R ϭ S ϭ 0. Brandon S. Gaut made the hidden assumption of a constant Because Lynch and Conery recognized the Department of Ecology and rate of gene duplication over time. This as- potential problem of redundancy, human and sumption was actually stated explicitly in (1), mouse gene pairs with S Ͻ0.01 were not used University of California, Irvine although strictly speaking we only assumed in their analyses. In many cases, however, Irvine, CA 92697, USA rate constancy over the time scale for which S Ͻ Ͻ both gene sequences from an S 0.01 pair Todd J. Vision 0.25. Long-term rate constancy was not were compared with a more distant gene USDA-ARS Center for Agricultural relevant to our birthrate estimates, which family member, which did result in the use of were simply the average values that apply redundant data entries. Cornell University over the time scale required for S to reach Also problematic are the mammalian Ithaca, NY 14853, USA 0.01—perhaps the past few hundred thousand gene pairs in the 0.01 Ͻ S Ͻ0.05 class, to million years for the species analyzed. Rate which were crucial to the conclusion by References constancy is an important assumption under- Lynch and Conery (1) that selective con- 1. M. Lynch, J. S. Conery, Science 290, 1151 (2000). lying the use of the slope of an age distribu- straint is temporarily relaxed after gene 2. K. H. Wolfe, D. C. Shields, Nature 387, 708 (1997). tion to estimate a half-life. We note, however, 3. T. J. Vision, D. G. Brown, S. D. Tanksley, Science 290, duplication. We manually inspected 20 2114 (2000). that there appears to be no intrinsic reason pairs from each species and found that 50% why our half-life estimates should be biased in human and 80% in mouse are actually 5 March 2001; accepted 21 June 2001 in one direction versus the other by temporal allelic or alternatively spliced forms of the Response: Our understanding of the evolution- variation in birth or mortality rates. In con- same locus. Allelism was determined pri- ary dynamics of duplicate genes stems largely trast to the scenario painted by Long and marily by GenBank annotation, which pro- from a handful of studies on gene families Thornton, a recent reduction in the rate of vided the same gene name for different known to have functions that may enhance the origin of duplicates would result in a flatter sequence entries. These observations sug- likelihood of evolutionary diversification [see age distribution and an overestimate of the gest that several data sets were problematic references in (1)]. These studies, though impor- half-life. and cast doubt on the value of the analyses tant, may lead to bias in our understanding of Third, Long and Thornton question our presented by Lynch and Conery. the usual fate of gene duplicates. To avoid this assertion that the vast majority of gene dupli- There are additional problems with the problem, our study (1) exploited the informa- cates enjoy a rather short half-life, arguing approach used by Lynch and Conery (1). tion inherent in fully sequenced genomes to that many ancient pairs of duplicates can be First, the authors applied an exponential- evaluate the average evolutionary properties of found in most eukaryotic genomes. In princi- decay model to estimate the rate of gene the members of duplicate-gene pairs. Although ple, the probability of loss of a duplicate gene turnover, assuming a steady state between this alternative approach glossed over the spe- may progressively decline once preserva- the origin and loss of duplicated pairs. In cific properties of individual gene pairs, the tional events such as neofunctionalization or and Arabidopsis, however, this as- emergent patterns provided a broad description subfunctionalization begin to take hold. sumption has clearly been violated by epi- of the types of observations that a general the- However, the argument adduced by Long and sodic, large-scale genomic duplication ory for the evolution of duplicate genes must Thornton themselves about inaccuracies in events (2, 3). Second, their model failed to explain. the estimation of S provides the answer for account for the fact that the number of Long and Thornton have three concerns why the long, flat profile at the high end of pairwise comparisons within gene families with our analyses. First, they argue that vari- the age distribution cannot provide quantita- can be substantially larger than the number ation among genes in the rate of nucleotide tive insight into rates of duplicate-gene si- of actual duplication events. To estimate substitution at silent sites reduces the reliabil- lencing—because the sampling variance of S the rate of gene loss, one needs to know the ity of the number of substitutions per silent becomes increasingly large with increasing S, distribution of the latter, not the former. site, S, as an indicator of the age of a dupli- the age distribution will become artificially Finally, the authors proposed a curvilinear cate pair. As stated in (1), we confined our flattened at high S. Ignoring the plateau on model for the relationship of R to S, but analyses of the birth and death rates of gene the right, as we did, will cause a slight over- failed to test that model against the null duplicates to pairs with S Ͻ0.25 because of estimate of the initial downward slope, which hypothesis that R is a simple linear function the high statistical uncertainty associated will cause a slight underestimate of the aver- of S. In our reanalyses, we found that the with large estimates of S. Long and Thornton age half-life of the vast majority of duplicates curvilinear model fits significantly better state that this problem may also be serious for that appear to be quickly silenced. than the linear model for all nine species, pairs of duplicates with low S, but with only Statistical problems aside, there is a fun- but we obtained substantially different pa- two exceptions, the study to which they refer damental problem with using the observation rameter estimates, with smaller sums of (2) showed that the range of S for the 24 that “the accumulation of ‘survivors’ of the squares than those reported. Our results, genes studied in the youngest species pair silencing process constitutes a large fraction which appear to support the curvilinear (Drosophila subobscura and D. psuedoob- of modern eukaryotic genomes” to support model, will require independent verifica- scura) is 0.09 to 0.50. An unknown fraction the claim that a large proportion of newborn tion in light of the problems with several of this variation must simply be due to sta- gene duplicates become permanently pre- data sets. tistical sampling error. In any event, variation served. The ancient duplicates to which Long Although they have not succeeded in among loci in the rate of silent-site substitu- and Thornton refer are the rare survivors of demonstrating empirical support for all of tion would have the effect of dispersing the duplication events that have accumulated their hypotheses, Lynch and Conery nonethe- data points in figures 2 and 3 of (1) over the over vast periods of time (many tens to hun- less have offered a variety of stimulating horizontal axis relative to the positions ex- dreds of millions of years). In the long run, ideas—the apparently high rate of gene du- pected on the basis of actual ages of the pairs. each origin of a gene must be balanced by the plication, the role of duplication in chromo- This would cause our half-life estimates to be loss of another to prevent indefinite genome somal repatterning, and the role of gene du- upwardly biased, but would not alter the ba- expansion or contraction. plication in reproductive isolation between sic conclusions reported in (1). Many of the gene duplicates that we species—that call for further investigation. Second, Long and Thornton argue that our have identified with S Ͼ1 may have arisen

1551a 31 AUGUST 2001 VOL 293 SCIENCE www.sciencemag.org T ECHNICAL C OMMENTS by processes substantially different from errors is independent of the degree of di- young gene duplicates from the “complete” the incremental single-gene duplications vergence for the genes involved in the anal- human genomic sequence (9, 10), our esti- that we focused on in (1). Most notable is ysis, but the estimated rate of origin of new mated rate of origin of new duplicates in the process of complete genome duplica- duplicates will be inflated. humans is probably downwardly biased, tion, the ancient remnants of which have 2) As noted above, the ancient genome whereas our estimated half-life is likely up- been implicated in yeast (3) and Arabidop- duplications known to have occurred in Ara- wardly biased. A recent comparison of chro- sis (4). Although the genomic extent is not bidopsis and yeast have no bearing on our mosomal contents in mice and humans yet understood, a massive amount of gene conclusions, because the duplicate pairs as- strongly supports our contention that a high duplication occurred early in the vertebrate sociated with these events were not included rate of duplicate-gene turnover occurs in lineage (5), and we cannot rule out the in our demographic analyses. mammals (11). possibility of similar large-scale events pri- 3) The distinction raised by Zhang et al. One must be cautious to avoid overinter- or to the radiation of the phyla. The between numbers of extant duplicate pairs preting the degree of precision associated probability of duplicate-gene preservation and number of actual duplication events is with all of these estimates; most large-scale following polyploidization may be substan- correct and important. However, multigene genome projects are still in a stage of mat- tially elevated relative to that for single- families were excluded from our analyses uration, with updated annotations being re- copy duplicates for two reasons. First, as and the vast majority of the young gene leased regularly. At this point, however, we we noted previously, polyploidization duplicates that we identified were simple see no reason to alter our basic conclusions maintains the dosage ratios of all pairs of pairs (in which case, there is no ambiguity that the rate of origin of new duplicates in genes relative to the situation in the diploid with respect to event counting), so this is quite high, often in the range state, and selection may favor the mainte- distinction has little effect on our estimates. of 0.002 to 0.020 per gene per million nance of the ancestral stoichiometric ratios. Nevertheless, the reanalyses presented be- years, and that most gene duplicates have a Second, when whole are du- low are based on estimates of duplication relatively short life-span, the average being plicated, the constituent genes are guaran- events rather than on observed numbers of in the neighborhood of 1 to 10 million years teed to initiate with all essential regulatory duplicate pairs. (with a possible exception in Arabidopsis). regions intact, and this may further reduce After publication of (1), a well-curated Functional studies will be required to de- the likelihood of negative selection against version of the Arabidopsis genome became termine the fraction of duplicates identifi- new copies. available (6) that has eliminated most of the able from coding-region identity that are Zhang et al. argue that three of the data redundancies and ambiguities noted by actually biologically active. sets that we worked with in (1) contained Zhang et al. A complete reanalysis of the data Michael Lynch flaws that may have influenced the outcome is beyond the scope of this response and will Department of Biology of our analyses. We agree that this issue be reported elsewhere (7); to summarize, Indiana University merits close scrutiny, and at the close of this however, using our prior methods for demo- Bloomington, IN 47405, USA response, we will present some reanalyses for graphic analysis, we have estimated the rate E-mail: [email protected] both the Arabidopsis and human genomes of origin of new duplicates in Arabidopsis, John C. Conery that take into consideration the concerns based on the new data set, to be 0.0022 per Department of Computer and raised by Zhang et al. First, however, we gene per million years, which is of the same Information Science respond to three technical issues raised by order of magnitude as that observed for D. University of Oregon these authors: melanogaster (0.0023) and yeast (0.0083), Eugene, OR 97403, USA 1) As noted in (1), the inability to easily but lower than that for C. elegans (0.0208). distinguish allelic sequences or alternative Because the incidence of putative duplicates References and Notes spliced forms from duplicate genes raises in the nearly identical class is greatly reduced 1. M. Lynch, J. S. Conery, Science 290, 1151 (2000). potential complications with some databas- in this newly available data set (consistent 2. L.-W. Zeng, J. M. Comeron, B. Chen, M. Kreitman, es. This is unlikely to be a serious problem with the arguments of Zhang et al.), the Genetica 102-103, 369 (1998). with inbred species such as C. elegans or half-life estimate increases from our previous 3. K. H. Wolfe, D. C. Shields, Nature 387, 708 (1997). 4. T. J. Vision, D. G. Brown, S. D. Tanksley, Science 290, haploid species such as S. cerevisiae, value of 3.2 million years to 23.4 million 2114 (2000). whose genomic sequences are well annotat- years, which exceeds our previous estimates 5. A. Sidow, Curr. Opin. Genet. Dev. 6, 715 (1996). ed and curated. For outbred species, it is for invertebrates by a factor of seven and for 6. www.tigr.orgtdb/e2k1/ath1/ 7. A complete list of data underlying the conclusions difficult to see how one can unambiguously mammals by a factor of three. discussed in these paragraphs appears at http://csi. resolve this issue with data sets constructed We are also now able to provide an esti- uoregon.edu/projects/genetics/duplications/letters. from random sequences (contrary to the mate of the rate of origin of new duplicates in 8. www.ensembl.org suggestion by Zhang et al.), and 5% se- the , using the database of the 9. International Human Genome Sequencing Consor- tium, Nature 409, 860 (2001). quence divergence seems rather high for publicly funded project (8, 9). Our estimate, 10. J. C. Venter et al., Science 291, 1304 (2001). allelic variants. Nevertheless, this problem 0.0071 per gene per million years, falls in the 11. D. Paramvir et al., Science 293, 104 (2001). remains a serious consideration for data middle of the range for other species. Our 12. We thank B. Haas and S. Salzberg for help in clarifying issues involving the TIGR database for Arabidopsis sets that are not highly refined. If nondu- revised estimate of the half life for human gene sequences and E. Birney for assistance in inter- plicate sequences are inadvertently includ- duplicate genes, 16 million years, is about preting the Ensembl database resulting from the ed in a survivorship analysis for duplicate double our previous estimate. However, be- work of the International Human Genome Sequenc- loci, the estimated half-life will be unaf- cause assembly problems probably result in ing Consortium. fected so long as the incidence of such the exclusion of substantial numbers of 30 April 2001; accepted 31 July 2001

www.sciencemag.org SCIENCE VOL 293 31 AUGUST 2001 1551a