<<

Downloaded by guest on October 4, 2021 www.pnas.org/cgi/doi/10.1073/pnas.1708151114 a Harpak Arbel of divergence duplicates the gene on effect its and the lineage on conversion gene nonallelic Frequent eeconversion gene plays it duplicates. that gene role of our the evolution improves the and in work mechanism NAGC This the on duplicates. of effect understanding of average small divergence a has sequence NAGC the result, sequences a the As duplicate this influence. of as its of “escape” down from Despite eventual slows order an rate). NAGC (an diverge—until that show sequences converted we point rate, be the baseline to high than nucleotide higher 2.5 a of magnitude mean probability for a a and generation : bp per in 250 of NAGC length govern to tract it converted that use parameters and methods the previous estimate by sequence used in than a information sequences develop more duplicate We substantially NAGC. leverages of that con- model bias These evolution GC them. large of a 46% reflect in gene versions tracts duplicate converted survey identify we and Here, families characterized. the govern well and that not parameters diseases loci. are the genetic NAGC genetic families, various gene distinct of in at evolution plays concerted a are conversion NAGC acceptor from role gene the the sequence nonallelic Despite and In genetic donor “acceptor.” a the an May (NAGC), of review to for copying (received region 2017 the 12, “donor” October is Hartl L. conversion Daniel Member Gene Board Editorial by accepted and Laboratory, Harbor 2017) Spring 17, Cold Siepel, Adam by Edited 94305 CA Stanford, University, Stanford Institute, Medical emd“ocre vlto”(7.NG sa meit sus- immediate an been simi- is has NAGC their highly phenomenon (17). from This evolution” be greatly “concerted (13–16). differ noticed termed can species they other was genes when in even It duplicated orthologs species, that (10–12). one within ago duplicates lar century gene a of half evolution the ing asso- genes of (5). 1% diseases in showed inherited found study with are ciated func- recent NAGC A by in introduced gene. (9)—resulting acceptor that the splicing of 7), aberrant impairment (6, genes— tional or duplicated (8), nonsynonymous tandemly frameshifting cause between —can alleles or of with case transfer the The often is (2)—as (3). duplicates similar gene tandem highly young are recombination during they aligned conversion because accidentally paral- are when gene occur sequences also nonallelic ogous can loci genetic However, distinct between Gene gene (1). (NAGC) mutation. (allelic can paste” regions AGC) nonrecipro- and and allelic conversion, another, between “copy the occurs to a in typically as sequence conversion of result one thought from can be alleles therefore it of recombination, transfer cal of part ural as conversion. strand gene called other is the process using This to strand, template. segment one a DNA on a sequence (“het- synthesizing the mismatch by overwrite This repaired chromosome. then same is the eroduplex”) of strands two the A rate mutation eateto ilg,Safr nvriy tnod A94305; CA Stanford, University, Stanford Biology, of Department ACi locniee ob oiatfrerestrict- force dominant a be to considered also is NAGC 5). 4, (2, diseases 20 over of driver a as implicated is NAGC nat- a rather but error an not is conversion gene Although rmtetohmlgu hoooe a n pon up end may originate that chromosomes alleles homologous distinct two recombination, the of from result a s | a,1,2 eeduplicates gene u Lan Xun , b,1 | iu Gao Ziyue , euneevolution sequence b,c n oahnK Pritchard K. Jonathan and , | Cbias GC b | × eateto eeis tnodUiest,Safr,C 40;and 94305; CA Stanford, University, Stanford , of Department 10 −7 1073/pnas.1708151114/-/DCSupplemental at online information supporting contains article This 2 Editorial the by invited 1 editor guest the a under is Published A.S. Submission. Direct PNAS Board. a is article This interest. of conflict no paper. declare the authors wrote The A.H. and A.H., data; tools; analyzed analytic J.K.P. new and contributed J.K.P. Z.G., and X.L., Z.G., X.L., A.H., research; performed J.K.P. ae prahs(9 4 edt els aibe AChas in mutation NAGC point than variable: faster less times 100 be to 10 to be to tend estimated been 34) (19, approaches evolutionary- based Alternatively, 33). them, (27, between variability distance experimental the and and sequences duplicate sim- the sequence of location, ilarity genomic magnitude as of such that orders rate experiments, the across eight of vary as determinants key much to as due by 29–32)—presumably of (26, vary estimates NAGC However, (28). of has pairs rate experi- base length the hundred and few tract organisms a be mean across to ments consistently The fairly 27). estimated (26, been sequences artificially genes—typically, DNA NAGC accumu- single inserted of mutation to limited with rate experiments been organisms the have lation nonhuman parameters (i) in These parameters: probed length. tract mostly key converted two the (ii) know and to need we with interplay its and NAGC mutations. in characterize other 25). evolution must (3, function ques- we duplicates and in gene sequence duplicates, for puts for expectations clocks it develop NAGC molecular To divergence, if of Importantly, fidelity sequence the 19–24). down tion 11, slowing (3, indeed paralogs is in evolution similar sequence restrict be Both may to selection. selection natural positive is and Another evolution purifying 18). concerted 14, of 13, driver (10, possible homogenizes accumulate mechanisms that mutational it differences other because through reversing evolution, by sequences concerted paralogous driving for pect uhrcnrbtos ...... n ...dsge eerh ...... and Z.G., X.L., A.H., research; designed J.K.P. and Z.G., X.L., A.H., contributions: Author stanford.edu. owo orsodnemyb drse.Eal rehsafr.d rpritch@ or [email protected] Email: addressed. be may correspondence whom work. To this to equally contributed X.L. and A.H. ua ulctsadcnetdeouini o sperva- as thought. not previously is as of sur-sive evolution divergence a concerted sequence has average duplicates—and the NAGC the human than rate, on faster effect high times small this 20 prisingly Despite is rate. humans mutation in point the NAGC that of find and rate We NAGC duplicates. char- baseline gene study well human to for not consequences tools are its statistical NAGC developed any govern We that eliminate acterized. parameters to evo- impor- the its acts “concerted Despite tance, the it them. between drive because accumulate that to duplicates than differences thought gene more also of of is driver lution” It a is diseases. (NAGC) 20 conversion gene Nonallelic Significance natmtn oln ACmttost euneevolution, sequence to mutations NAGC link to attempting In a,b,c,2 NSlicense. PNAS . www.pnas.org/lookup/suppl/doi:10. NSEryEdition Early PNAS c oadHughes Howard | f6 of 1

EVOLUTION Saccharomyces cerevisiae (35), Drosophila melanogaster (36, 37), A NAGC affects divergence patterns and human (19, 38–40). These estimates are typically based on single loci (but see refs. 41, 42). Recent family studies (43–45) G duplication −6 point have estimated the rate of AGC to be 5.9×10 per base pair per mutation generation. This is likely an upper bound on the rate of NAGC, G T GGspeciation since NAGC requires a misalignment of homologous chromo- NAGC somes during recombination, while AGC does not. T Here, we estimate the parameters governing NAGC with a sequence evolution model. Our method is not based on direct GTT G empirical observations, but it leverages substantially more infor- macaque gene 1 human gene 1 human gene 2 macaque gene 2 (m1) (h1) (h2) (m2) mation than previous experimental and computational methods: We use data from a large set of segmental duplicates in mul- B Genealogy changes (hidden) tiple species, and exploit information from a long evolutionary Null tree NAGC tree history. We estimate that the rate of NAGC in newborn dupli- NAGC initiation cates is an order of magnitude higher than the point mutation rate in humans. Surprisingly, we show that this high rate does not necessarily imply that NAGC distorts the molecular clock. NAGC tract ends Results m1 h1 h2 m2 unlikely m1 h1 h2 m2 To investigate NAGC in duplicate sequences across primates, we used a set of gene duplicate pairs in humans that we had assem- bled previously (46). We focused on young pairs where we esti- C Divergence patterns likely unlikely likely mate that the duplication occurred after the human–mouse split, (observed) h1 A T T and identified their orthologs in the reference genomes of chim- h2 C T T panzee, , , macaque, and mouse. We required m1 A G G that each gene pair have both orthologs in at least one nonhu- m2 C T G man primate and exactly one ortholog in mouse. Since our infer- ence methods implicitly assume neutral sequence evolution, we D Inferred genealogy map focused our analysis on intronic sequence at least 50 bp away Inferred NAGC tract Intron boundary from intron–exon junctions. After applying these filters, our data ****** **** ***** * consisted of 97, 055 bp of sequence in 169 intronic regions from 040kbCYP2C9 and CYP2C19 75 gene families (SI Appendix). We examined divergence patterns (the partition of alleles in ** **** ******** * gene copies across primates) in these gene families. We noticed 0 FCN1 and FCN2 4kb that some divergence patterns are rare and clustered in spe- Fig. 1. NAGC alters divergence patterns. (A) NAGC can drive otherwise rare cific regions. We hypothesized that NAGC might be driving this divergence patterns, like the sharing of alleles between paralogs but not clustering. To illustrate this, consider a family of two duplicates orthologs. (B) An example of a local change in genealogy, caused by NAGC. in human and macaque which resulted from a duplication fol- (C) Examples of divergence patterns in a small multigene family. Some diver- lowed by a speciation event—as illustrated in Fig. 1B (“Null gence patterns—such as the one highlighted in purple—were both rare and tree”). Under this genealogy, we expect certain divergence pat- spatially clustered. We hypothesized that underlying these changes are local terns across the four genes to occur more frequently than oth- changes in genealogy, caused by NAGC. (D) Genealogy map (null geneal- ers. For example, the gray sites in Fig. 1C can be parsimoniously ogy marked by white, NAGC marked by purple tracts) inferred by our HMM explained by one substitution under the null genealogy. They based on observed divergence patterns (stars). Two different gene families should therefore be much more common than purple sites, as are shown. For simplicity, only the most informative patterns (purple and gray sites, as exemplified in C) are plotted. purple sites require at least two mutations. However, if we con- sider sites in which an NAGC event occurred after speciation (Fig. 1A and “NAGC tree” in Fig. 1B), our expectation for diver- gence patterns changes: Now, purple sites are much more likely Applying our HMM, we identified putatively converted tracts than gray sites. in 18/39 (46%) of the gene families, affecting 25.8% of the intronic sequence (Fig. 2A; see complete list of identified tracts in Mapping Recent NAGC Events. We developed a Hidden Markov Datasets S1–S4). Previous studies estimate that only several per- Model (HMM) which exploits the fact that observed local cent of the sequence is affected by NAGC, but the definition of changes in divergence patterns may point to hidden local changes “affected sequence” statistic is arguably method-dependendent in the genealogy of a (Fig. 1 B and C). In our model, and therefore not directly comparable (41, 51, 52). Fig. 1D shows genealogy switches occur along the sequence at some rate; the an example of the maximum likelihood genealogy maps for two likelihood of a given divergence pattern at a site then depends gene families. The average length of the detected converted only on its own genealogy and nucleotide substitution rates. Our tracts is 880 bp (Fig. 2B). As previously discussed for other meth- method is similar to others that are based on incongruency of ods (27), this is likely an overestimate of the mean tract length of inferred genealogies along a sequence (47–49), but it is model- NAGC, because some identified NAGC tracts result from mul- based and robust to substitution rate variation across genes (SI tiple NAGC events occurring in close proximity (SI Appendix, Appendix). Fig. S2). We applied the HMM to a subset of the gene families When an AT/GC heteroduplex DNA arises during AGC, that we described above: families of four genes consisting of it is preferentially repaired toward GC alleles (53, 54). We two duplicates in human and a nonhuman primate. Since the sought to examine whether the same bias can be observed for HMM assumes that the duplication preceded the speciation, we NAGC (53, 55–57). We found that converted regions have a required that the overall intronic divergence patterns support this high GC content (percentage of bases that are either guanine genealogy, using the software MrBayes (50). This requirement or cytosine): 48.9%, compared with 39.6% in matched uncon- decreased the number of gene families considered to 39. verted regions (p = 4 × 10−5, two-sided t test; Fig. 2C). This

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1708151114 Harpak et al. Downloaded by guest on October 4, 2021 Downloaded by guest on October 4, 2021 apke al. et Harpak eecneso htw osdri iuainsuy(8 59), (58, study simulation Among a biased in 44). underlie consider (43, could we AGC that that conversion for mechanisms gene estimated repair bias possible estimate GC several Our the heteroduplexes. with strong/weak agrees correcting when les of probability a to corresponds ence Appendix SI isdACaoe(xc ioiltest binomial (exact alone AGC an biased with compared that changes, strong of found to expectation We weak were both nucleotides. changes (G/C) involving these we of strong substitutions consideration, and such parsimony (A/T) of a weak pat- directionality Using divergence the 1C). “purple” (Fig. inferred by the before substitution with as nucleotide sites tern of the evidence are sites strongest NAGC—these on the focused we carry preferentially GC, that NAGC toward whether heteroduplexes test AT/GC To driver repairs NAGC. a of be for result could observed a difference and/or the previously However, been (55). paralogs has histone difference composition base bias. GC interval no confidence lines with binomial model colored 95% “tract” a solid the is The for area differ- gray-shaded conversion. three The gene fits. under linear biased results show simulation of show models dots mechanistic color favor ent The in . heteroduplex GC/AT G/C around a the resolving errors of of standard probability estimated two the show shows (E bars estimates. Error point for contents. the accounting GC derive after line) to different (pink used NAGC their we unbiased which for through regions, proportion substitutions expected unconverted AT→GC the in of AGC and proportion mutations estimated point the shows sites The bar purple substitutions. GC→AT left In than common (D) more significantly paralogs. are intronic substitutions identified the no shows with 1C line (Fig. genes Shaded black human tract). The for focal intervals. bp average the confidence 200 (excluding symmetric 95% regions for show respective content regions the GC show at for lines centered gene average The bins same the regions. the within converted shows and the dot length as gray in The matched regions, regions. unconverted converted random in content (C GC distribution. age length Tract (B) intron. per 2. Fig. C B A htaems ieyt eadrc euto AC(ih a) AT→GC bar), (right NAGC of result direct a be to likely most are that ) ubro tracts of Number (A) tracts. converted HMM-inferred of Properties .W siaeta hsosre differ- observed this that estimate We 2D). Fig. and 44% on siaeo Cba.Tedse upeline purple dashed The bias. GC of estimate Point ) hog on uaindfeecsadGC- and differences mutation point through E D 67. h upedtsosteaver- the shows dot purple The ) 3% p 7.5 = nfvro togalle- strong of favor in × 10 −7 n see and , 61% .Frt ecniee oe hr ACat tthe at acts NAGC where model a considered we First, Appendix). sequence? in we diverge should to 65), never (64, duplicates divergence NAGC gene by the expect slower divergence If much of is elimination arises: mutation the question than point light the through In infer, sequences paralogous we duplication. of rate post high paralogs the of of Sequence dynamics Neutral divergence on the NAGC of Effect The Divergence. Young? Stay Fast, Live experi- accumulation (27). diseases mutation NAGC-driven with NAGC and and ments many 63) of (43, mean AGC metaanalysis a for a estimates than estimated with slower simultaneously CI)—consistent magnitude strap of We of length 44). order tract (43, an NAGC rate is AGC and 27) the polymor- (19, using data sizes sample phism smaller on based with accords estimates estimate esti- previous This 3D). We Fig. be generation. CI, bootstrap to per nonparametric rate pair this base per mate converted our is nucleotide limited therefore and most with genes introns be to would diverged estimation model parameter lowly the the as that to con- concluded stop, a applicable we complete assumes rate, its model NAGC our or Since stant rate, sequence. NAGC in in diverge duplicates slowdown as a decrease to estimates due rate NAGC (Spearman that increases found We gene. tive ace hs siae with estimates these matched dependence the (ignoring sites pairs). of between 3B pairs likelihood (Fig. all composite over values (MLE) maximum length estimates obtained tract then mean We and Appendix). rate SI over NAGC species, available of all grid in a alleles observed the of likelihood the likeli- the in model. for two-site accounted tract the the are of no hood Unlike hits makes NAGC: multiple model of method, Our frequency detection 3A). the (Fig. on are assumptions the other site prior the to one affecting incorporate respect events to (with NAGC likely by length), close tract are mean acts NAGC consideration NAGC under the increase—while When sites paralogs. to two between acts divergence mutation decrease—sequence short, to at nucleotides In of time. pair the a a of at likelihood paralogs the and between model divergence mutation to observed able point were in both we intractable, through like- (62) is evolving NAGC full al. sequence the et a computing of McVean lihood While the and rates: by (61) recombination inspired point Hudson estimating is with guided model evolution that This rationale sequence (Methods). NAGC of we and NAGC, model mutation of two-site distribution Mutation. length a Point tract than developed the and Faster rate Magnitude the of estimate Order an Is NAGC without parameters a NAGC assumption. have implicit estimate would this to a making it describe us we that allowed Next, that 60). pervasive (28, method so only patterns divergence not applicable on is effect therefore global NAGC is where with it cases disagreement patterns; in clear intron-wide show global patterns the divergence local 59). ref. where by suggested also (as drive species mechanisms different different in that bias GC mech- suggest GC repair of could mismatch the This driver (like (58). dominant tracts anism) long the over acts that is yeast shown in repair bias been to has Appendix strand it (SI excision base Conversely, heteroduplex of the each choice is for bias the independent large which a mechanism—in such underlie repair to likely most the ecniee eea oeso eunedvrec (SI divergence sequence of models several considered We edfieNG aea h rbblt htarandom a that probability the as rate NAGC define We efis siae Lsfrec nrnsprtl,and separately, intron each for MLEs estimated first We computed we data, our in intron each in sites of pair each For conversions, recent to limited likely is HMM our of power The enx osdrteipiain forrslson results our of implications the consider next We p 2.5 1 = 250 × × 10 p([63, bp 10 −7 ds −5 ([0.8 .Ti rn slikely is trend This 3C). Fig. , 1)i xn fterespec- the of exons in (16) 1,000] ds × < 10 NSEryEdition Early PNAS 5%. oprmti boot- nonparametric −7 , 5.0 × n i.2E). Fig. and 10 −7 | ] 95% f6 of 3 and To ds

EVOLUTION tory power of these different theoretical models for synonymous A NAGC events are correlated for nearby sites divergence in human duplicates. We wished to obtain an esti- The two focal sites are far apart The two focal sites are nearby mate of the age of duplication that is independent of ds between gene A gene B gene A gene B the human duplicates; we therefore used the extent of shar- ing of both paralogs in different species as a measure of the substitutions substitutions duplication time. For example, if a duplicate pair was found in human, gorilla, and orangutan—but only one ortholog was NAGC NAGC found in macaque—we estimated that the duplication occurred at the time interval between the human–macaque split and the human–orangutan split. Except for the continuous NAGC model mouse B Datum for the two-site model (or global threshold ≥4.5%), all models displayed similar broad agreement with the data (Fig. 4). The small effect of NAGC on divergence levels is intuitive in retrospect: For identical sequences, NAGC has no effect. Once human chimp. gorilla orangutan macaque differences start to accumulate, there is only a small window of gene A gene B opportunity for NAGC to act before the paralogous sequences C Rate MLE decreases with seq. divergence D Composite likelihood estimates (ds<5%) escape from its hold. This suggests that neutral sequence diver- -5 10 gence (e.g., ds) may be an appropriate molecular clock even Spearman p=10-5 bootstrap in the presence of NAGC (as also suggested by refs. 41, 46, -6 10 1000 samples and 68).

-7 Discussion 10

Tract length (bp) Tract 100 In this work, we identify recently converted regions in humans -8 10

NAGC rate estimate rate NAGC and other primates, and estimate the parameters that govern −7 0.0 0.1 0.2 0.3 0.4 0.5 1 5 x 10 NAGC. Previously, it has been somewhat ambiguous whether NAGC rate ds (in exons) (conversions per bp generation) (conversions per bp per generation) observations were due to , abundant NAGC, or a combination of the two (3, 22, 23). Today, Fig. 3. Estimation of NAGC parameters. (A) The two-site sequence evo- equipped with genomic data, we can revisit the pervasiveness of lution model exploits the correlated effect of NAGC on nearby sites (near concerted evolution; the data in Fig. 4 suggest that, in humans, with respect to the mean tract length). In this illustration, orange squares duplicates’ divergence levels are roughly as expected from the represent focal sites. Point substitutions are shown by the red points, and accumulation of point mutations alone. When we plugged in our a converted tract is shown by the purple rectangle. (B) Illustration of a sin- gle datum on which we compute the full likelihood, composed of two sites estimates for NAGC rate, most mechanistic models of NAGC in two duplicates across multiple species (except for the mouse outgroup also predicted a small effect on neutral sequence divergence. for which only one ortholog exists). (C) MLE rate estimates for each intron This result suggests that neutral sequence divergence may be an (orange points). MLEs of zero are plotted at the bottom. The solid line shows appropriate molecular clock even in the presence of NAGC. a natural cubic spline fit. The rate decreases with sequence divergence (ds). One important topic left for future investigation is the varia- We therefore only use lowly diverged genes (ds ≤ 5%) to get point estimates tion of NAGC parameters. Our model assumes constant action of the baseline rate. (D) Composite likelihood estimates. The black point is of NAGC through time and across the genome to get a robust centered at our point estimates for ds ≤ 5% genes. The blue points show 1,000 nonparametric bootstrap estimates, where the intensity of each point corresponds to the number of bootstrap samples. The corresponding 95% marginal confidence intervals are shown by black lines. Small effect of NAGC on the divergence of duplicates

75% constant rate that we estimated throughout the duplicates’ evolu- tion (“continuous NAGC”). In this case, divergence is expected 30% to plateau around 4.5%, and concerted evolution continues for a long time [red line in Fig. 4; in practice, there will eventually 10% be an “escape” through a chance rapid accumulation of multi- 5% ple mutations (11, 66)]. However, NAGC is hypothesized to be contingent on high sequence similarity between paralogs. data (median) We therefore considered two alternative models of NAGC no NAGC 1% local threshold dynamics: first, a model in which NAGC acts only while the global threshold 3% sequence divergence between the paralogs is below some thresh- global threshold ≥ 4.5%

old (“global threshold”); second, a model in which the initiation between duplicates Sequence divergence of NAGC at a site is contingent on perfect sequence today at a short 400-bp flanking region upstream from the site [“local human divergence time with: chimp. gorilla orang. macaque mouse opossum chicken threshold”, (2, 27, 67)]. The local threshold model yielded a sim- Duplication age (log time-scale) ilar average trajectory to that in the absence of NAGC. A global threshold of as low as 4.5% may lead to an extended period of Fig. 4. The effect of NAGC on the divergence of duplicates. The fig- concerted evolution as in the continuous NAGC model. A global ure shows both data from human paralogs and theoretical predictions of threshold of <4.5% results in a different trajectory. For example, different NAGC models. The blue line shows the expected divergence in with a global threshold of 3%, duplicates born at the time of the the absence of NAGC and the red line shows the expected divergence primates’ most recent common ancestor would diverge at 3.9% with NAGC acting continuously. The pink, orange, and red lines show the expected divergence for models in which NAGC initiation is contingent on of their sequence, compared with 5.7% in the absence of NAGC sequence similarity between the paralogs. The gray horizontal bars cor- (Fig. 4 and SI Appendix, Figs. S10–S12 show trajectories for other respond to human duplicate pairs. The duplication time for each pair is rates and threshold values). inferred by examining the nonhuman species that carry orthologs for both Lastly, we asked what these results mean for the validity of of the human paralogs. The y axis shows the synonymous sequence dissimi- molecular clocks for gene duplicates. We examined the explana- larity between the two human paralogs.

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1708151114 Harpak et al. Downloaded by guest on October 4, 2021 Downloaded by guest on October 4, 2021 4 mt P(94 nqa rsoe n h vlto fmlieefamilies. multigene of evolution the and crossover Unequal (1974) GP Smith 14. 3 mt P odL ic M(91 nioydiversity. Antibody (1971) WM Fitch L, Hood GP, Smith 13. genes duplicated of evolution non-neutral and Neutral (2011) H Innan JA, Fawcett 11. (1987) M Nei 10. 2 Hartas 12. apke al. et Harpak e eeain(4 n AC h rbblt fast en ovre per converted being site a of is probability generation The NAGC. and (64) generation per iinpoaiiymti.Teeaetopsil vnsta a euti a of in rate result a may at tran- occur that generation) which events mutations (per possible sites point two the transition: right are derive There and first matrix. left We probability the allele. sition each that same to respect mean the with not have defined are does necessarily 1 0000 and 0 state labels separately—the The B. site copy in site right the at state The l vector”). (“state vector four-bit a Process. l with Markov sites homogeneous four discrete these a describe as We genes duplicate two in sites lelic Matrix. Transition Model infer- Two-Site for families gene 39 and tracts. estimation converted of parameter ence for families least Appendix gene tasks, at (SI inference 75 filters two sequences method-specific the intronic of additional each applied on For we junctions. analysis intron–exon from our and away focused macaque, bp 50 We orangutan, S1). gorilla, in (Table , orthologs mouse of their a genomes identified with reference and consistent split, the pairs human–mouse young the on human after focused the We duplication using 37). (46) (build previously genome assembled reference had we human the that in genome pairs gene reference protein-coding best-matched reciprocal 1,444 of set Data. Families Gene of Methods evolution the of understanding our duplicates. further gene molec- and of the dating events, clarify the improve ular may of disease, results drive effects to our NAGC the of conversion, potential selection. measure gene natural on to as factors efforts genomic such contemporary duplicates, with gene Together of evolution shap- forces the other of ing studies future guide could alone mechanism (69). of NAGC percent pervasive experience several and duplicates. comprise highly genome, repeats the gene pervasive, terminal segmental long in example, than distributions factors For other These different sequences similarity. very homologous sequence have and also ), S9 paralogs can between Fig. recombina- distance as Appendix, physical such (SI factors context, to sequence due rate, pairs tion gene across variation exists substantial However, likely parameters. mean the of estimate .Lro L annE nrpyE,WrhB(99 igencetd nteSMN the in nucleotide single A (1999) B Wirth EJ, Androphy E, its Hahnen and CL, p47-phoxgene Lorson the 9. between events Recombination (2000) al. et conver- J, Gene (1998) Roesler GG Germino 8. HP, Neumann H, Weber MA, Gandolph TJ, Watnick causes (1q32) 7. cluster gene RCA the in conversion gene novo De (2006) al. et S, Heinen con- gene 6. Interlocus (2012) of MW Hahn capable DN, Cooper pseudogenes AD, of Phillips U, identification Zekonyte C, Genome-wide Casola (2006) 5. and al. Classifying duplications: et gene J, of Bischoff evolution The 4. (2010) F Kondrashov H, Innan F 3. N, Chuzhanova DN, Neurospora. Cooper of JM, mutants Chen pyridoxine 2. of recombination Aberrant (1955) MB Mitchell 1. B A l ttelf iei oyB allele B, copy in site left the at B u siae o h aaeesta oentemutational the govern that parameters the for estimates Our pigHro,N) o 8 p507–513. pp 38, Vol NY), Harbor, Biology Spring Quantitative on Symposia Harbor Spring 969–1012. atrophy. muscular spinal USA for Sci responsible is and splicing regulates gene chronic recessive autosomal of cause main disease. granulomatous the are pseudogenes homologous highly pkd1. in mutation of cause likely a syn- is sion uremic hemolytic atypical with associated H drome. factor complement in mutations asso- genes human disease. of inherited 1% least with at ciated into mutations deleterious introduce events version conversion. gene disease-causing models. between distinguishing disease. human and evolution Mechanisms, USA Sci Acad Natl Proc ou eecneso n rsoe nsgetldpiain ne eta sce- neutral a under duplications segmental in crossover nario. and conversion gene locus conversion. gene with r A r B ,1} {0, ∈ 3(Bethesda) G3 nhzD,Vall Da, anchez ´ u Mutat Hum 96:6307–6311. c oeua vltoayGenetics Evolutionary Molecular ecnie hs uainleet ob aeadignore and rare be to events mutational these consider We . 4 orsod oallele to corresponds 27:292–293. 4:1479–1489. oivsiaeNG ndpiaesqecs eue a used we sequences, duplicate in NAGC investigate To sCdn ,Bras O, es-Codina ` Genes 41:215–220. Blood 2:191–209. eoeRes Genome 95:2150–2156. a e Genet Rev Nat u Mutat Hum r A tte“ih”st ncp ,adallele and A, copy in site “right” the at ecnie h vlto ftobial- two of evolution the consider We -ie ,NvroA(04 nepa finter- of Interplay (2014) A Navarro M, o-Vives ´ rcC arnsG 20)Gn conversion: Gene (2007) GP Patrinos C, erec ´ u o Genet Mol Hum 22:429–435. l A a e Genet Rev Nat 27:545–552. tte“et iei oyA allele A, copy in site “left” the at Clmi nvPes e York). New Press, Univ (Columbia 11:97–108. Cl pigHro a rs,Cold Press, Lab Harbor Spring (Cold µ 7:1239–1243. 8:762–775. = nuRvBiochem Rev Annu 1.2 –evn swith us )–leaving × 10 rcNt Acad Natl Proc −8 e site per Cold 40: r B 0 hle h ,Kas J iklf A(98 eecneso rc ietoaiyis directionality tract Gene-conversion (1998) JA Nickoloff GJ, Khalsa J, Cho Whelden intrachromosomal 30. of products gene of interlocus analysis Fine-resolution detecting conversion (1997) for AS gene Waldman methods D, Yang the of of length 29. power tract The (2010) and H rate Innan SP, The Mansai (2011) H 28. Innan T, Kado SP, Mansai 27. between conversion gene meiotic yeast High-frequency the (1985) in TD conversion Petes gene S, of Jinks-Robertson rate a low 26. in Very (2012) conversion MW gene Hahn ectopic GC, Conant and C, sweep Casola selective 25. Hard (2013) to concerted al. force under et genes a M, duplicated as of Hanikenne rate product 24. evolutionary same The the (2008) H of Innan more S, the for Mano in Selection 23. conversion (2006) gene H and Innan selection RP, of Sugino signatures 22. Complex (2007) al. et JF, Storz 21. 0 ehm M na 20)Noucinlzto fdpiae ee ne the under genes duplicated of Neofunctionalization (2007) H Innan application its KM, and Teshima selection with 20. model conversion gene two-locus A (2003) evolve. H families Innan gene How 19. (1990) T Ohta 18. 7 imrE atnS eelyS a ,Wlo C(90 ai ulcto n loss and duplication Rapid (1980) AC Wilson Y, Kan S, Beverley S, Martin E, Zimmer 17. (1991) D Graur WH, Li of 16. 5S (1973) K Sugimoto DD, Brown 15. npr,b elwhp rmteSafr etrfrCmuainl Evolu- Computational, for Center Genomics. Stanford Human HG009431 the supported, and was were from tionary and Z.G. fellowships work and HG008140 by A.H. part, This Institute. Grants Medical in Health Hughes manuscript. Howard of the the by Institutes and on National discussions/comments by funded review- helpful anonymous two for and Siepel, ers Adam Rosenberg, Noah Knowles, David ACKNOWLEDGMENTS. is j node and matrix i probability node between transition edge the correspond- the Therefore, the to rate. that—for equal assume mutation rate We a ing state. at gen- a occurs to constant types—substitution corresponds a mutation node assumed both Each and y. 70, 25 of ref. time from eration times split primate for of root estimates the lengths on prior edge a constant set reference and to used human be Appendix only the (SI will tree gene) two- the in one mouse in The alleles sites genomes. (two reference the state primate bit other encoding four to paralogs) one bial- and two two genome to in (corresponding sites vectors state lelic of consists datum Each timescales. a has sites) two the between on breakpoint In effect a negligible distribution. (with geometric recombination the that show of memorylessness the by the includes it on conditional site one including is conversion other a of mean with ability distributed geometrically as length alleles. unobserved fourth) and (third transi- to full mutations the possible derive ignores can we Similarly, matrix other. probability the tion includes it that given sites where right the involving A copy therefore to is B probability copy transition from The the left. NAGC the at by not mutation or but point A site through copy either of happen site are can right that transition sites two This for apart. 0100, bp to 0110 from probability transition generation order the of terms aty etr ocmuetasto rbblte ln evolutionary along probabilities transition compute to turn we Lastly, derive next We nune ytecrmsm environment. chromosome the by influenced cells. mammalian in recombination homeologous conversion. genes. duplicated between yeast. in chromosomes 82:3350–3354. nonhomologous on genes repeated genome. adaptation. environmental affording cluster gene evolution. genes. duplicated of evolution concerted enhance mice. house of genes globin duplicated rsueo eeconversion. gene of pressure genes. RHD and RHCE human the to 2158–2162. fgnscdn o h lh hiso hemoglobin. of chains alpha the for coding genes of MA). derland, family. gene a of lution P g(d (0110 stepoaiiyo ovrineeticuigoeo the of one including event conversion a of probability the is ) o ilEvol Biol Mol Genetics Genetics → 0100) g(d O(µ 180:493–505. g .W sueacntn renml,afie topology fixed a tree—namely, constant a assume We ). 184:517–527. init udmnaso oeua Evolution Molecular of Fundamentals .Floigpeiu ok(8,w oe h tract the model we (28), work previous Following ). 29:3817–3826. = P o Biol Mol J 2 etakElnSao,DcEg,Kle Harris, Kelley Edge, Doc Sharon, Eilon thank We . ), IAppendix (SI µ/3 O(c Genes g {t Genetics init ij 2 + } P (d ,and ), deij) (edge ∗ sdfie in defined as 78:397–415. 2:313–331. c ) (1 = rcNt cdSiUSA Sci Acad Natl Proc 1398:1385–1398. −  Genetics O(µc .W oeta u parameterization our that note We ). ho ou Biol Popul Theor g(d = 1 eou laevis Xenopus − P )) urGenet Curr .Freape osdrteper- the consider example, For ). t ij λ 1 + . 177:481–500.  o elBiol Cell Mol eused We S5. Fig. Appendix, SI LSGenet PLoS O(µ rnsGenet Trends d tflosta h prob- the that follows It λ. , NSEryEdition Early PNAS 2 ) rcNt cdSiUSA Sci Acad Natl Proc 34:269–279. + and 37:213–219. 100:8793–8798. SnurAscae,Sun- Associates, (Sinauer O(c rcNt cdSiUSA Sci Acad Natl Proc 9:e1003707. eou mulleri Xenopus 17:3614–3628. 2 22:642–644. IAppendix SI ) + O(µc P deij) (edge ∗ | ), f6 of 5 Evo- : we , for 77: d

EVOLUTION 31. Taghian DG, Nickoloff JA (1997) Chromosomal double-strand breaks induce gene con- 52. Dennis MY, et al. (2016) The evolution and population diversity of human-specific version at high frequency in mammalian cells. Mol Cell Biol 17:6386–6393. segmental duplications. Nat Ecol Evol 1:69. 32. Lichten M, Haber J (1989) Position effects in ectopic and allelic mitotic recombination 53. Duret L, Galtier N (2009) Biased gene conversion and the evolution of mammalian in Saccharomyces cerevisiae. Genetics 123:261–268. genomic landscapes. Annu Rev Genomics Hum Genet 10:285–311. 33. Schildkraut E, Miller CA, Nickoloff JA (2005) Gene conversion and deletion frequen- 54. Odenthal-Hesse L, Berg IL, Veselis A, Jeffreys AJ, May CA (2014) Transmission distor- cies during double-strand break repair in human cells are controlled by the distance tion affecting human noncrossover but not crossover recombination: A hidden source between direct repeats. Nucleic Acids Res 33:1574–1580. of meiotic drive. PLoS Genet 10:e1004106. 34. Sawyer S (1989) Statistical tests for detecting gene conversion. Mol Biol Evol 6: 55. Galtier N (2003) Gene conversion drives GC content evolution in mammalian histones. 526–538. Trends Genet 19:65–68. 35. Takuno S, Innan H (2009) Selection to maintain paralogous amino acid differences 56. Assis R, Kondrashov AS (2011) Nonallelic gene conversion is not GC-biased in under the pressure of gene conversion in the heat-shock protein genes in yeast. Mol Drosophila or primates. Mol Biol Evol 29:1291–1295. Biol Evol 26:2655–2659. 57. McGrath CL, Casola C, Hahn MW (2009) Minimal effect of ectopic gene con- 36. Thornton K, Long M (2005) Excess of amino acid substitutions relative to polymor- version among recent duplicates in four mammalian genomes. Genetics 182:615– phism between X-linked duplications in Drosophila melanogaster. Mol Biol Evol 622. 22:273–284. 58. Lesecque Y, Mouchiroud D, Duret L (2013) GC-biased gene conversion in yeast is 37. Arguello JR, Chen Y, Yang S, Wang W, Long M (2006) Origination of an X-linked testes specifically associated with crossovers: Molecular mechanisms and evolutionary sig- chimeric gene by illegitimate recombination in Drosophila. PLoS Genet 2:e77. nificance. Mol Biol Evol 30:1409–1419. 38. Rozen S, et al. (2003) Abundant gene conversion between arms of palindromes in 59. Arbeithuber B, Betancourt AJ, Ebner T, Tiemann-Boege I (2015) Crossovers are associ- human and ape Y chromosomes. Nature 423:873–876. ated with mutation and biased gene conversion at recombination hotspots. Proc Natl 39. Bosch E, Hurles ME, Navarro A, Jobling MA (2004) Dynamics of a human interparalog Acad Sci USA 112:2109–2114. gene conversion hotspot. Genome Res 14:835–844. 60. Betran´ E, Rozas J, Navarro A, Barbadilla A (1997) The estimation of the number and 40. Hurles ME (2001) Gene conversion homogenizes the CMT1A paralogous repeats. BMC the length distribution of gene conversion tracts from population DNA sequence Genomics 2:11. data. Genetics 146:89–99. 41. Dumont BL, Eichler EE (2013) Signals of historical interlocus gene conversion in human 61. Hudson RR (2001) Two-locus sampling distributions and their application. Genetics segmental duplications. PLoS One 8:e75949. 159:1805–1817. 42. Ji X, Griffing A, Thorne JL (2016) A phylogenetic approach finds abundant interlocus 62. McVean G, Awadalla P, Fearnhead P (2002) A coalescent-based method for gene conversion in yeast. Mol Biol Evol 33:2469–2476. detecting and estimating recombination from gene sequences. Genetics 160:1231– 43. Williams AL, et al. (2015) Non-crossover gene conversions show strong GC bias and 1241. unexpected clustering in humans. Elife 4:e04637. 63. Jeffreys AJ, May CA (2004) Intense and highly localized gene conversion activity in 44. Halldorsson BV, et al. (2016) The rate of meiotic gene conversion varies by sex and human meiotic crossover hot spots. Nat Genet 36:151–156. age. Nat Genet 48:1377–1384. 64. Kong A, et al. (2012) Rate of de novo mutations and the importance of father’s age 45. Narasimhan VM, et al. (2017) Estimating the human mutation rate from autozygous to disease risk. Nature 488:471–475. segments reveals population differences in human mutational processes. Nat Com- 65. Segurel´ L, Wyman MJ, Przeworski M (2014) Determinants of mutation rate variation mun 8:303. in the human germline. Annu Rev Genomics Hum Genet 15:47–70. 46. Lan X, Pritchard JK (2016) Coregulation of tandem duplicate genes slows evolution 66. Teshima KM, Innan H (2004) The effect of gene conversion on the divergence of subfunctionalization in mammals. Science 352:1009–1013. between duplicated genes. Genetics 166:1553–1560. 47. Balding DJ, Nichols RA, Hunt DM (1992) Detecting gene conversion: Primate visual 67. Jinks-Robertson S, Michelitch M, Ramcharan S (1993) Substrate length require- pigment genes. Proc Biol Sci 249:275–280. ments for efficient mitotic recombination in Saccharomyces cerevisiae. Mol Cell Biol 48. Jakobsen IB, Wilson SR, Easteal S (1997) The partition matrix: Exploring variable phy- 13:3937–3950. logenetic signals along nucleotide sequence alignments. Mol Biol Evol 14:474–484. 68. Dumont BL (2015) Interlocus gene conversion explains at least 2.7 % of single 49. Weiller GF (1998) Phylogenetic profiles: A graphical method for detecting genetic nucleotide variants in human segmental duplications. BMC Genomics 16:456. recombinations in homologous sequences. Mol Biol Evol 15:326–335. 69. Trombetta B, Fantini G, D’Atanasio E, Sellitto D, Cruciani F (2016) Evidence of exten- 50. Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic sive non-allelic gene conversion among LTR elements in the human genome. Sci Rep trees. Bioinformatics 17:754–755. 6:28710. 51. Jackson MS, et al. (2005) Evidence for widespread reticulate evolution within human 70. Scally A, et al. (2012) Insights into hominid evolution from the gorilla genome duplicons. Am J Hum Genet 77:824–840. sequence. Nature 483:169–175.

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1708151114 Harpak et al. Downloaded by guest on October 4, 2021