<<

Natural insertions in rice commonly form tandem duplications indicative of patch-mediated double-strand break induction and repair

Justin N. Vaughn and Jeffrey L. Bennetzen1

Department of , University of Georgia, Athens, GA 30602

Contributed by Jeffrey L. Bennetzen, December 4, 2013 (sent for review September 13, 2013)

The insertion of DNA into a can result in the duplication duplications were also hypothesized to be caused by slippage be- and dispersal of functional sequences through the genome. In cause, out of 85 insertions producing such duplications, 50 were addition, a deeper understanding of insertion mechanisms will associated with flanking repeats >2 bp (14). Replication slippage inform methods of genetic engineering and transformation. would presumably require a preexisting short repeat because Exploiting structural variations in numerous rice accessions, we priming must occur between the end of the loop that will become have inferred and analyzed intermediate length (10–1,000 bp) the duplication and the position to where replication slips. insertions in . Insertions in this size class were found to be Authors of more recent work investigating insertions across the approximately equal in frequency to deletions, and compound human genome suggest alternatives to replication slippage on insertion–deletions comprised only 0.1% of all events. Our find- the grounds that homology is often either nonexistent or very ings indicate that, as observed in humans, tandem or partially short, whereas the length of homology and the length of insertion tandem duplications are the dominant form of insertion (48%), are not correlated (10). These researchers favor a model based on although short duplications from ectopic donors account for a siz- DSBs being repaired by nonhomologous end-joining (NHEJ). able fraction of insertions in rice (38%). Many nontandem dupli- However, conventional models of DSB repair are strained to predict tandem duplications >10 bp, much less >100 bp. Such cations contain insertions from nearby DNA (within 200 bp) and models require extensive single-stranded, complementary ends can contain multiple donor sources—some distant—in single to be preserved during the break. Moreover, DSBs produced by events. Although replication slippage is a plausible explanation Tal-effector nucleases in humans do not yield insertions that for tandem duplications, the end homology required in such > form tandem repeats, despite the fact that the breaks generate a model is most often absent and rarely is 5 bp. However, end a5′ overhang (15). Thus, this common class of mutations cur- homology is commonly longer than expected by chance. Such find- rently lacks a firm molecular explanation. ings lead us to favor a model of patch-mediated double-strand- Similar to tandem duplications, short duplications are com- break creation followed by nonhomologous end-joining. Addition- monly found within 100 bp of one another, but with unique in- ally, a striking bias toward 31-bp partially tandem duplications tervening DNA (16). By comparing human polymorphisms with suggests that errors in nucleotide excision repair may be resolved chimp sequence, Thomas et al. (16) inferred that the repeats via a similar, but distinct, pathway. In summary, the analysis of were recent insertions. As discussed by the authors and herein, recent insertions in rice suggests multiple underappreciated causes a mechanism for such duplications is even less forthcoming than of structural variation in . for tandem duplications. In this study, we used extensive population-scale rice rese- double-strand break repair | structural DNA variation quencing data to confirm that tandem duplications are also abundant natural polymorphisms in the plant kingdom. Addi- tionally, we found that many insertions in rice, although not enomic DNA insertion causes genome expansion and, po- ∼ Gtentially, the rearrangement and diffusion of protein domains perfectly tandem, are from a 50-bp window around the insertion and regulatory elements throughout the genome (1, 2). Addi- tionally, genetic engineers generally aim to integrate specific DNA Significance into the nuclear genome, so the natural mechanisms by which this integration occurs may serve as a starting point to elaborate and Very short insertions are usually attributable to replication improve genome modification (3, 4). Common causes of - slippage. Another class of longer insertions (>10 bp) creates sized insertions are unequal recombination (5), transposable ele- tandem duplications even in the absence of preexisting repeats. ment replication (1), and ectopic recombination stimulated by This work provides analysis into the properties and mechanistic double-strand breaks (DSBs) in the genome (2, 6). Shorter events implications of such insertion polymorphisms segregating in rice. are less well characterized, but it appears that these can be created To our knowledge, this work is the first comprehensive analysis by similar processes (7). Still, high-throughput sequencing of DSB of this major class of natural mutations in any plant. Inspired by repair events in humans (8) and plants (9) suggests that insertions the prior experiments of Stéphane Vispé and Masahiko Satoh, related to induced breaks are very rare and very short. we propose a model for how a substantial number of double- Although the processes described above can produce duplica- strand breaks are created and how they might result in tandem tions at distant genetic loci, the most common form of non- duplications. The model, based on patch-mediated nick repair, -associated insertions in humans is tandem duplica- is indirectly supported by recently published experiments using tions (10). Once created, tandem duplications can be dramatically a modified CRISPR-associated 9 nicking enzyme. expanded by unequal recombination or replication slippage. Such duplications may be deleterious, or they may be promoted Author contributions: J.N.V. and J.L.B. designed research; J.N.V. performed research; J.N.V. by selection for a novel or expanded function (11, 12). analyzed data; and J.N.V. and J.L.B. wrote the paper. Although tandem repeats are ubiquitous in eukaryotic , The authors declare no conflict of interest. the mechanisms for their origin are still in question. Early Freely available online through the PNAS open access option. analysis of human indel mutations indicated that replication 1To whom correspondence should be addressed. E-mail: [email protected]. slippage was the most effective model to explain the origin of This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. assorted repeats (13). In other studies, longer, de novo tandem 1073/pnas.1321854111/-/DCSupplemental.

6684–6689 | PNAS | May 6, 2014 | vol. 111 | no. 18 www.pnas.org/cgi/doi/10.1073/pnas.1321854111 Downloaded by guest on October 2, 2021 site. We rarely found the end homology in tandem repeats that is DSB repair outcomes, which commonly produce deletions (6, 9), expected for replication slippage, although we did note a bias whereas we inferred approximately equivalent frequencies of toward short microhomology between insertion ends and in- insertions relative to deletions. As expected, the number of sertion site. These data led us to elaborate on the DSB model of structural variations correlated with the size of a . tandem duplication, proposing that long patch base excision re- Chromosome 3 had the highest percentage of inferable events, pair (BER) on complementary strands commonly leads to such likely because of its high-quality assembly. patterns (17). Additionally, we characterized common forms of In the human genome, short insertions (8–100 bp) commonly nontandem, but local, duplication. create tandem duplications that are in the same orientation and have no unique spacer sequence between the resultant repeats Results (10). Such insertions are impossible to position exactly (Fig. 1 A– Inferring Insertions. Recent mutational events can be inferred by C), and so we used the trace extension metric, d, to first char- comparing orthologous sequences between two lineages with an acterize the inferred insertions. As illustrated in Fig. 1 A–C and orthologous sequence in a known outgroup lineage (1). The Messer and Arndt (10), d can characterize whether an insertion extant state in the sister lineages matching the outgroup state is is a tandem duplication (l ∼ d in Fig. 1B) or comes from a more inferred to be the ancestral state. Although such inferences may distant site (d = 0 in Fig. 1A). Also, d allows one to determine be false due to segregating polymorphisms in the ancestral whether an insertion and its donor have similarity that extends population, they are generally valid for Oryza sativa (rice) com- beyond the boundaries of the insertion, even when the insertion parisons using Oryza glaberrima as an outgroup (1). Insertion > variations segregating in an extant population are more likely to creates a tandem duplication (d l in Fig. 1C). be recent mutations and, hence, are less likely to complicate > interpretation resulting from multiple events and sequence di- Tandem Duplications Accounts for More Than Half of 9-bp Insertions and Rarely Exhibit Extensive End Homology. The sharp diagonal in vergence. For these reasons, we chose to use recently published > data regarding genetic variation in a sample of 50 rice accessions Fig. 1D demonstrates that, as in humans, insertions 9bp (18). In that study, indels >9 bp were ascertained by pooling commonly create tandem duplication: d is often equal to l. sequence data for subpopulations in the sample and de novo Across all analyzed , tandem duplications account assembling reads. Once assembled, these contigs were positioned for approximately half of all insertions. The remaining insertions on the rice reference genome of the cultivar Nipponbare to lo- either come from more distant loci (d ∼ 0) or are partially tan- calize the relative mutation. The authors also resequenced the dem duplications (Fig. 1D). Of additional interest, there is a Nipponbare genome and assessed structural variations relative to clear bias toward partially tandem duplications of 31 bp, and, a different rice reference genome, indica cultivar 93-11. Using within that group, there is very little variation around this value. known structural variations between these two Sanger-sequenced In other words, it is common for insertions to occur that dupli- genomes—93-11 and Nipponbare—they determined that 96% cate 31 bp around the site of insertion but include intervening of structural variants identified in their short-read pipeline sequence between the duplications. We discuss the implications were true positives. of such a bias below. To infer whether an indel was an insertion or deletion, we used Replication slippage is a common explanation for tandem the recently released genome sequence of O. glabberima (Table duplications (14), although this explanation is now being chal- S1). To establish large tracts of colinearity, each O. glabberima lenged, particularly with regard to whole genome analysis (10). chromosome was aligned with the homologous chromosomes One assumed prerequisite for replication slippage is a repeat from cultivars indica 93-11 and Nipponbare. The region sur- that allows for the stabilization of the slippage intermediate and rounding each of the polymorphic indels was extracted and resumption of synthesis at the mispaired site (Fig. 2A). The realigned by using the outgroup, the Nipponbare reference, and minimum primer length for extension for most polymerases in the variant sequence. Variant sequences were synthetically de- vitro appears to be 6 bp (19). Although we cannot be certain that rived from the Nipponbare reference by using associated indel this rule applies at all times in vivo, length restrictions on information. A variant was called an insertion if a gap of the microsatellite expansion support this result (20). exact size was also found in the outgroup and 15 bp flanking the = gaps were gap-free and shared >90% identity with the outgroup. Data above the d l diagonal is indicative of the degree of homology extending beyond the insertion and its donor se- For clarity of exposition and to ease manual curation of the − data, we chose to describe data from only 4 of 12 rice chromo- quence. A model of replication slippage would predict d l to be ∼ somes: the longest (no. 1), the shortest (no. 10), the most heavily 6 or greater. In fact, if replication slippage were common for curated (no. 3), and no. 8, which falls between nos. 3 and 10 in these insertions, the observed 1:1 diagonal should be shifted at length. On these four chromosomes, only 92 of 65,391 insertions least 6 units higher. Additionally, for longer insertions, we expect (0.1%) relative to Nipponbare reference have a perfectly adja- that greater homology would be required to stabilize the in- cent deletion (Table 1). Thus, of structural variations, complex termediate loop. This expectation was not observed (Fig. 1D). structural variants derived from mixed insertion–deletion events More often than not, d is equal to l; therefore, if replication are rare within this size class of 10–1,000 bp. Also, the inferred slippage is occurring, it is occurring most often in the absence of insertion/deletion profile deviates dramatically from induced any priming base.

Table 1. Summary of structural variations and inferred events for chromosomes analyzed Chr. size,* Structural Adjacent Inferable † Chr. Mb variations events Deletions Insertions events, %

1 43.2 22,278 53 4,912 4,390 42 3 36.4 14,990 18 4,285 3,544 52 8 28.4 14,757 9 3,047 2,711 39 10 23.1 13,366 12 2,907 2,400 40 Total 131.1 65,391 92 15,151 13,045 43

Chr., chromosome. GENETICS *From www.gramene.org. †Adjacent events are defined as insertion polymorphisms directly adjacent to deletion polymorphisms.

Vaughn and Bennetzen PNAS | May 6, 2014 | vol. 111 | no. 18 | 6685 Downloaded by guest on October 2, 2021 the repair results in a DSB (Fig. 2C). Regardless of the degree of end degradation, repair of this DSB via NHEJ will result in a tandem duplication.

Nontandem Duplications Are Often Derived from Local Sequences Within 200 bp and Do Not Exhibit Signs of Canonical Conversion Mechanisms. Although Fig. 1 reveals tandem duplications, it cannot show the homologies that exist >3–4 bp away from the insertion. Many of the insertions with d ∼ 0 could in fact come from very near the insertion. Based on manual curation and prior work (7, 16), we predicted that local homologies were more likely to be the source of insertions than more distant sites. Therefore, we first searched each insertion that was not a tan- dem duplication against the ancestral sequence upstream and downstream for 100 + 2.5l bp, where l is the length of the in- sertion. For each match within this region, we determined the coverage of the match relative to the insertion and the distance of the match relative to the site of insertion (Methods). The distance between insertion site and donor locus is typi- cally within 50 bp, although more distant conversions are clearly possible (Fig. 3A). In some cases, only a fragment of the insertion appears to be derived from a local locus. There are a substantial number of complex events in Fig. 3A, in which 15% or more of an insertion could be accounted for by a local donor sequence, but the rest of the sequence either comes from another local site or a region outside of our search (gray bars in Fig. 3A). When an insertion is made up of two distinct local donor sequences, these donor loci can be quite distant from one another and can be derived from two upstream sites, two downstream sites, or both upstream and downstream sites. Note that these are not over- lapping matches but account for distinct regions of the insertion (Methods). Unlike tandem duplications, local duplications with intervening unique sequence are difficult to explain by invoking replication slippage or patch-mediated repair. Another model has been pos- tulated to explain the types of insertions found at repaired DSBs Fig. 1. Trace extension, d, spectra for rice insertions. (A–C) Simplified dot plots between ancestral sequence without an insertion and derived se- quence with an insertion. (A) When d = 0, a gap can be placed exactly, and there is no similarity between the insertion and regions adjacent to the in- sertion site. (B) When d = l, a gap cannot be placed exactly, and the insertion is a duplication of DNA adjacent to the site of insertion. (C) When d > l, not only do the conditions of d = l apply, but the similarity between donor loci and the site of insertion extends beyond the inserted sequence, suggesting a homology-dependent mechanism. (D) A heatmap plotting the total counts of all insertions having a particular combination of length (x axis) and d (y axis). Gray-scale range is from 1 to >90 insertions.

Bias Toward Short End Homology Suggests That NHEJ, Not Replication Slippage, Occurs During Tandem Duplication. As discussed above, tandem repeats are generally not associated with end homology: d − l = 0 (Fig. 1 B and D). Still, we tested for statistical bias in our (d − l) values by randomly selecting one position in the Nipponbare genome and another position 100 bp downstream and generating a null distribution based on the number of downstream bases that the two points shared. In fact, end ho- mologies for tandem repeats are typically longer than expected by chance, although they are often 0 and are nearly always <6bp Fig. 2. End homology between donors and insertions and the implication (Fig. 2B). This end homology profile exhibited by tandem for mechanistic models. (A) Replication slippage is dependent on priming, repeats is strikingly similar to that of transfected DNA repaired and thus tandem duplications resulting from such a model should exhibit by NHEJ in mammalian cells (21). similarity between the end of the duplicated region and the beginning of As elaborated on in Discussion, these data lead us to a model the donor site, described here as end homology. When the replicating strand for the origin of tandem duplications that depends on long-patch slips back to another annealing site, the DNA exhibits a transient bubble. BER. Originally proposed as a major (but underappreciated) Replication from the slipped site duplicates the intervening DNA (light gray) cause of DSBs (17), the “DNA repair patch-mediated pathway” as well as the priming site (dark gray). (B) A null distribution of end ho- mology was generated by randomly sampling positions along the rice ge- also explains the high frequency of tandem duplications in plants nome (Methods). The tandem repeat homology was assessed by subtracting and . Briefly, we propose that DNA lesions (e.g., single- the length of the insertion from the trace extension, d. Values >5 bp suggest strand nicks) close to one another on complementary strands a synthesis-dependent mechanism of duplication. (C) A patch-mediated trigger simultaneous BER. Because complementary strands for model of tandem duplication formation in which synchronous BER events both events are concurrently displaced by the other repair event, create a DSB that is repaired via NHEJ.

6686 | www.pnas.org/cgi/doi/10.1073/pnas.1321854111 Vaughn and Bennetzen Downloaded by guest on October 2, 2021 Fig. 3. Characterization of insertions donated from the sequence around an insertion site. (A) The length and position of the donor site relative to the insertion site are plotted (x axis) for each insertion with d < 4 and < 3 extensive local matches. Inser- tions are sorted top to bottom based on shortest-to- longest length. Darkness of a donor bar indicates the percentage of the insertion covered by the do- nor, with black being 100%. Only donors with >15% coverage and an overlap of >7 bp are plotted. For brevity, only insertions from chromosome 10 are presented; other chromosomes exhibit a similar profile (Figs. S1–S3). (B) Summary of all insertion types for chromosomes 1, 3, 8, and 10. Examples of each category are given as schematics and repre- sentative dot plots in Figs. S4–S8.(C) Similarly to Fig. 2B, end homology of local duplications is plotted against a null distribution. Gap homology is the frequency of particular d scores <12 for insertions >20 bp. Excluding partially tandem duplications, the gap homology distribution should resemble the null distribution. (Upper) The schematic illustrates what is meant by 5′ and 3′ homology. The extent of the insertion is colored light gray, and extendable ho- mologous regions are dark gray. Only duplications with 95% coverage of the insertion and positioned >9 bp from the insertion were considered. Catego- ries are broken into the position of the donor rela- tive to the insertion site: for 5′ donors, n = 264; for 3′ donors, n = 301.

that were induced by homing endonucleases. The synthesis- homology relative to a null model. However, if SDSA was dependent strand-annealing (SDSA) model assumes that after a common cause of these insertions, the 5’ donor and 3’ ho- DSB and 5′ to 3′ resectioning, the 3′ overhang is able to syn- mology and the 3’ donor and 5’ homology should be highly thesize from a local or ectopic site. Once synthesis has occurred, enriched in >6-bp end homologies (Fig. 3C). Although we ob- the temporary duplex is denatured, and then the newly synthe- serve that ∼5% of these high-confidence local duplications fol- sized single strand can begin again its search for an appropriate low this expectation, the great majority favor the NHEJ-like ligation partner (3). This process appears to be a major mecha- pattern observed for tandem duplications. nism by which sequences are duplicated from distant loci (2, 7) and may account for short, local duplications as well (22). Insertions Associated with Partially Tandem Duplications Comprise The most plausible scenario for local duplication by SDSA Mostly Local Duplications but with a Substantial Fraction of More would be that the 3′ overhang of one side of the DSB uses the 3′ Distant Donors. Fig. 1D exhibits a striking feature characterized overhang of the other side as a template. Alternatively, a 3′ by insertions with various lengths >31 bp and d = 31. Such overhang could form a loop and copy sequence on its own strand insertions are tandem duplications with a substantial block of in the reverse orientation. We only observed ∼0.5% of local intervening DNA. If such events were the result of replication duplications in the reverse orientation. Thus, we consider this slippage, DNA polymerase would need to stall, copy DNA from possibility rare. The paucity of reversed insertions also indicates an ectopic or local positions, commonly slip back 31 bp from its that the 3′ overhang rarely invades the duplex of the other half of original template position, and resume replication. Alternatively, the DSB because it could then arbitrarily copy DNA in either under the patch-mediated model, the presence of a tandem direction. It follows, then, that duplications in the donor orien- duplication with intervening DNA copied from another locus tation are also unlikely to be the result of the 3′ overhang in- indicates a series of events: patch-mediated DSB induction, vading the duplex on its own side of the DSB. followed by additional synthesis from elsewhere in the genome. If we consider the 3′ overhang as only being able to copy from Interestingly, insertions that are 31 bp in length with d = 31 are the 3′ overhang of the other side of the DSB, then insertions with no more abundant than other tandem duplications (Fig. 1D). 5′ donors would have to be extended by the 3′ overhang of the 3′ This finding indicates that these partially tandem insertions side of the DSB. The opposite would be true for insertions with nearly always result in the insertion of at least some DNA from 3′ donors. To test this model, we determined the 5′ and 3′ ho- outside the tandemly duplicated sequence. mology between insertions and their donors, using only local We manually inspected all 227 insertions >40 bp in length with > > duplications with 95% coverage and a spacer distance of 9 bp. a d value between 29 and 31 (Dataset S1). We found that 72% of GENETICS As with the end homology seen in tandem duplications, the the nontandemly duplicated parts of these sequences came homology patterns are slightly skewed toward longer tracks of from the local sequence context. Indeed, 14% of the insertions

Vaughn and Bennetzen PNAS | May 6, 2014 | vol. 111 | no. 18 | 6687 Downloaded by guest on October 2, 2021 overlap the tandem duplication, thus resulting in a triplication of Rice only has one X polymerase, and it is more similar to the less- some of the donor sequence, which would be expected if addi- promiscuous human polymerase λ (27). Given the divergence, its tional replication was occurring after patch-mediated tandem function may have expanded, and/or other polymerases from the duplication. Unlike the analysis shown in Fig. 3B, a much greater Y family that are tolerant of terminal priming mismatch may be percentage of the insertions involve local duplication. This ob- active during NHEJ (28, 29). Thus, the molecular machinery to servation may in part be a result of an excessive number of explain local duplications is present in humans and, likely, in inferred insertions caused by paralog-related misassembly, plants. The constrained distance from the insertion site of these whereas these partially tandem insertions are more legitimate duplications suggests that the polymerase is using the single- examples of DSB repair. Still, 16% of the insertion appears to stranded end of the resectioned DNA as a template because they have been copied from a distant locus, and the remainder, 12%, cannot invade double-stranded DNA. Still, given that single-strand are a combination of local and unidentified donors. annealing (SSA) is commonly used to resolve DSBs (3), it is dif- Discussion ficult to imagine why the nascent duplex synthesized by an X or Y polymerase would ever denature. Such denaturing in SDSA Short insertions of DNA into the genome are most commonly triggered by microsatellite repeats. In contrast, gene-sized appears to be mediated by specific helicases (30, 31), and given insertions are often mediated by transposable elements and that we observe long, local duplications (Fig. 3A), the pro- SDSA. In humans, there is a third and distinct class of common posed pathway would also be dependent on these helicases insertions of intermediate length that has no clear explanation outcompeting the SSA machinery. (10, 20). We have identified abundant recent insertions >9bpin If X/Y-type polymerases are mediating replication slippage, rice using population polymorphisms and a reproductively iso- arguments against the slippage model based on priming con- lated outgroup. Of the insertions analyzed, the majority are straints are not particularly valid. Still, replication slippage is an tandem duplications, and these tandem duplications lack a sig- unlikely explanation given the following additional observations. nature of replications slippage (Figs. 2 and 3B). We feel that the (i) The partially tandem duplications that are commonly ob- weight of evidence supports a model in which patch-mediated served would require the polymerase to copy from an ectopic site displacement of complementary strands creates tandemly du- and then slip 5′ to where it initially stalled, despite the added plicated ends that are rejoined by NHEJ (Fig. 2C). This model insertion and the affinity between preexisting strands. Moreover, precludes the need for the end homology required by type B the 31-bp bias in this duplication category (Fig. 1D) would re- DNA polymerases and naturally resolves the difficult problem of quire that the polymerase commonly slip back 31 bp from where how lengthy duplicates are established, even in the absence of it initially stalled. (ii) As observed previously (10), we find no end homology. Additionally, when modified to account for cross- correlation between length of insertion and end homology (Fig. reactivity with nucleotide excision repair (NER) errors, the 1D). Longer homology would almost certainly be required for model explains the peculiar observation of a bias toward 31-bp stabilizing long, unhybridized loops. (iii) If long strands are so partially tandem duplications. easily displaced, it is unclear why, during a slippage event, the Long-patch BER has been shown to create DSBs in vivo leading strand would prefer its template strand over the lagging when single-strand lesions, such as uracils, are present on com- strand template, thus creating a duplication in the opposite orien- plementary strands within 30 bp of each other (Fig. 2C and ref. tation. As described above, such duplications are rarely observed. 17). The efficiency of DSB is reduced with longer patches, but It is plausible that, if a terminal priming base is not required, patches as long as 80 bp of BER have been observed in vivo (23), where DNA synthesis may be more rapid. Moreover, even longer both local and tandem duplications could be explained by distances between paired nicks induced by modified CRISPR- a polymerase X or Y repair model discussed above. In the case of associated 9 enzymes can stimulate DSBs in human cells (24). In tandem duplication, the polymerase would start at the end of the support of our model, insertions resulting from these DSBs break and copy the resectioned strand; the extension would be formed tandem duplications (24). NER is distinct from BER in denatured, and NHEJ or SSA would attach the ends. However, that a consistently sized segment of DNA is removed to repair data on various forms of resolved DSBs show that insertions, large adducts such as UV-induced thymine dimers. In NER, when they do occur, rarely, if ever, produce tandem duplications RPA 70 protein is used to measure ∼30 bp from an initial nicking (9, 22). Although they do produce local duplications, these occur site to a secondary nicking site on the same strand (25). We adjacent to deletions more common than insertions alone (9). hypothesize that the 31-bp bias seen in partially tandem dupli- We observe very few such adjacent events: ∼3 in 1,000 insertions cations is a result of nicking errors on the complementary strand (Table 1). The fraction of local duplications is clearly >0.3% (Fig. 1D). Such errors would induce a DSB with two comple- (Fig. 3B); therefore, a large number of local duplications still mentary 31-bp 5′ overhangs. Unlike the patch-mediated model lack a strong mechanistic explanation. Indeed, one of the (Fig. 2C), where in-filling is concurrent with DSB creation, strengths of the patch-mediated model is that tandem repeats overhangs produced by NER error would likely undergo less can mask the deletions observed in DSB repair experiments, direct NHEJ. As noted, above, we do not see an enrichment in simply resulting in a shorter duplicated region. 31-bp tandem duplications as would be expected if these ends Because of the ubiquity of tandem duplications and the gen- were simply filled in and ligated. erality of the patch-mediated model, it is tempting to include The presence of local duplications with short stretches of in- local duplications as a possible outcome, but such a model would tervening DNA has been observed in the human genome (16). require an internal deletion of the duplicated region that is po- Many of these are likely to be partially tandem duplications (Fig. sitionally coincident on the start of the duplication. A reasonable 3), but the authors also report on a class of mutations similar to biochemical scenario that would result in such an outcome has the local duplications seen in this study. As observed in humans and Arabidopsis (16), there is a sharp limitation to the distance yet to be conceived. Indeed, an agreed-upon mechanism to ex- between these duplications (Fig. 3A). In addition, we find that plain local duplications is still lacking (11). The continued such insertions can be the composite of many local donor analysis of mutants involved in DNA repair in plants, particularly sequences. Unlike tandem repeats, such patterns are difficult to the repair polymerases, should inspire better models. When explain by the patch-mediated model, but, like tandem repeats, coupled with next-generation sequencing, mutant analysis can they appear to lack a signature of primed synthesis at an ectopic be particularly informative and robust (9). However, current site (Fig. 3C). Some members of the X family of DNA poly- methods for inducing DSBs may poorly simulate the majority of merases in humans engage in primer-free but template-directed naturally occurring DSBs. In the future, poststress resequencing synthesis during NHEJ. Polymerase μ can even tolerate steric of wild-type and mutant lines may be the most accurate method conflict between template and the terminal priming base (26). to achieve a thorough understanding of these events (32).

6688 | www.pnas.org/cgi/doi/10.1073/pnas.1321854111 Vaughn and Bennetzen Downloaded by guest on October 2, 2021 Methods length of the insertion, one can quickly determine whether the insertion Whole-Genome Alignments and Insertion Inference. To infer insertions and resulted in a tandem duplication and whether the donor sequence possessed deletions, we first generated whole-genome alignments of O. sativa var. short, terminal repeats before insertion. We reimplemented the trace ex- indica, O. sativa var. japonica, and O. glabberima. The source and version of tension approach using blast2seq, which in effect reports all significantly each genome is given in Table S1. Mauve progressive aligner was run for long diagonals in a dot plot between two sequences. We aligned each de- chromosomes separately with default parameters (33). These alignments rived state with the ancestral state using blast2seq with default parameters, were not used directly for indel inference, but as a guide for downstream except that the DUST filter was turned off (-F F) and gap extension and gap alignments used to infer the ancestral state of indel variations found in opening penalties were set to 6 and 2, respectively, (-G 2 -E 4). These were > a large sample of cultivated rice accessions. The list of structural variations, optimized to disallow the spanning of gaps 10 bp. Only hits with an e-value < all of which are >9 bp, was downloaded from ref. 18 on September 28, 2012. of 0.001 were kept. Major diagonals, which represent alignment between > Only structural variations for indica, tropical japonica, and aromatic rice the sequences to the left and right of an insertion (Fig. 1A), had to be 2.5l, accessions were used. For each polymorphism site in the Nipponbare refer- where l is the length of the insertion. ence sequence, the relative position 100 + 2.5l, where l is the indel length, bases upstream and downstream were extracted. Similarly, the outgroup Characterizing Insertions and Identifying Donor Loci. For counting purposes sequence associated with the same region of the alignment was extracted. (Fig. 3B), we considered any derived insertion as a tandem duplication if the The variant sequence was reconstructed from the Nipponbare sequence alignment with the ancestral sequence resulted in d − l > −3. End homology based on the position of the indel and its sequence. These three sequences— of a tandem duplication (Fig. 2) was calculated by subtracting d from l (Fig. reference, variant, and outgroup—were realigned by using MAFFT in de- 1) for d − l > −1. The blast2seq alignments used above could also be used to fault mode (FFT-NS-2) (34), and these alignments were used to infer events. characterize local duplications. For entirely nontandem insertions (d < 4), For a gap to be inferred as an insertion or deletion, the 15 bases flanking local donor sites were found by identifying sequences upstream/down- both sides of the gap had to match the outgroup with a pairwise identity stream of the insertion that overlapped the insertion by >7 bp and had >90%. Additionally, the flanking sequences could not contain gaps. identity of >95%. The percent coverage of a match was calculated as the Dot plots for assorted derived-ancestral pairs were generating by using length of the match divided by the length of the insertion. Multiple donor dottup, which is available through the EMBOSS program suite (35), with sites were allowed if the donor sites did not overlap within the insertion by additional PDF conversion. A word size of 3 bp was used in all cases. >3 bp. The 3′ end homology of local duplications (Fig. 3C) was only assessed for insertions with d = 0 and was calculated by subtracting the start of the Assessing the Trace Extension Metric, d. An insertion will produce a gap when gap in the ancestral sequence (relative to the derived sequence) from the aligned with a reference. The gap can only be placed exactly if the sequence beginning of the insertion match position in the ancestral sequence and vice within the insertion is unique relative to the flanking sequences into which it versa for 5′ end homology. has inserted. This ambiguity can be seen in a dot plot, where the position chosen to start a gap can clearly be varied within a defined range (Fig. 1 B Software. Unless otherwise stated, Perl scripts were written to perform these and C). Previous authors proposed a “trace extension” value, d, to account analysis; the following external libraries were also used and are available from for this property (10). The trace extension is best illustrated in a dot plot; CPAN (www.cpan.org): Bioperl (Version 6.1) (36) and Set::IntSpan (Version 1.16). with the ancestral state on the x axis and the derived state on the y axis, d is the horizontal distance between the end of the first major diagonal and ACKNOWLEDGMENTS. This work was supported by National Science Foun- the beginning of the second. Based on the relationship between d and the dation Plant Genome Program Awards 0607123 and 043707-01.

1. Ma J, Bennetzen JL (2004) Rapid recent growth and divergence of rice nuclear ge- 20. Montgomery SB, et al.; 1000 Genomes Project Consortium (2013) The origin, evolu- nomes. Proc Natl Acad Sci USA 101(34):12404–12410. tion, and functional impact of short insertion-deletion variants identified in 179 hu- 2. Wicker T, Buchmann JP, Keller B (2010) Patching gaps in plant genomes results in man genomes. Genome Res 23(5):749–761. gene movement and erosion of colinearity. Genome Res 20(9):1229–1237. 21. Roth DB, Porter TN, Wilson JH (1985) Mechanisms of nonhomologous recombination 3. Puchta H (2005) The repair of double-strand breaks in plants: Mechanisms and con- in mammalian cells. Mol Biol 5(10):2599–2607. – sequences for genome evolution. J Exp Bot 56(409):1 14. 22. Lloyd AH, Wang D, Timmis JN (2012) Single molecule PCR reveals similar patterns of 4. Fauser F, et al. (2012) In planta gene targeting. Proc Natl Acad Sci USA 109(19): non-homologous DSB repair in tobacco and Arabidopsis. PLoS ONE 7(2):e32255. – 7535 7540. 23. Ward JF (1988) DNA damage produced by ionizing radiation in mammalian cells: Identities, 5. Woodhouse MR, Pedersen B, Freeling M (2010) Transposed in Arabidopsis are mechanisms of formation, and reparability. Prog Res Mol Biol 35:95–125. often associated with flanking repeats. PLoS Genet 6(5):e1000949. 24. Mali P, et al. (2013) CAS9 transcriptional activators for target specificity screening and 6. Salomon S, Puchta H (1998) Capture of genomic and T-DNA sequences during double- paired nickases for cooperative genome engineering. Nat Biotechnol 31(9):833–838. strand break repair in somatic plant cells. EMBO J 17(20):6086–6095. 25. Costa RMA, Chiganças V, Galhardo RdaS, Carvalho H, Menck CF (2003) The eukaryotic 7. Pace JK, 2nd, Sen SK, Batzer MA, Feschotte C (2009) Repair-mediated duplication by nucleotide excision repair pathway. Biochimie 85(11):1083–1099. capture of proximal chromosomal DNA has shaped vertebrate genome evolution. 26. Nick McElhinny SA, et al. (2005) A gradient of template dependence defines distinct PLoS Genet 5(5):e1000469. 8. Mali P, et al. (2013) RNA-guided human genome engineering via Cas9. Science biological roles for family X polymerases in nonhomologous end joining. Mol Cell – 339(6121):823–826. 19(3):357 366. 9. Huefner ND, Mizuno Y, Weil CF, Korf I, Britt AB (2011) Breadth by depth: Expanding 27. Amoroso A, et al. (2011) Oxidative DNA damage bypass in Arabidopsis thaliana requires our understanding of the repair of transposon-induced DNA double strand breaks via DNA polymerase λ and proliferating cell nuclear antigen 2. Plant Cell 23(2):806–822. deep-sequencing. DNA Repair (Amst) 10(10):1023–1033. 28. García-Ortiz MV, Ariza RR, Hoffman PD, Hays JB, Roldán-Arjona T (2004) Arabidopsis 10. Messer PW, Arndt PF (2007) The majority of recent short DNA insertions in the human thaliana AtPOLK encodes a DinB-like DNA polymerase that extends mispaired primer genome are tandem duplications. Mol Biol Evol 24(5):1190–1197. termini and is highly expressed in a variety of tissues. Plant J 39(1):84–97. 11. Thomas EE (2005) Short, local duplications in eukaryotic genomes. Curr Opin Genet 29. Garcia-Diaz M, Bebenek K (2007) Multiple functions of DNA polymerases. CRC Crit Rev Dev 15(6):640–644. Plant Sci 26(2):105–122. 12. Nourmohammad A, Lässig M (2011) Formation of regulatory modules by local se- 30. Sebesta M, Burkovics P, Haracska L, Krejci L (2011) Reconstitution of DNA repair quence duplication. PLOS Comput Biol 7(10):e1002167. synthesis in vitro and the role of polymerase and helicase activities. DNA Repair 13. Levinson G, Gutman GA (1987) Slipped-strand mispairing: A major mechanism for (Amst) 10(6):567–576. – DNA sequence evolution. Mol Biol Evol 4(3):203 221. 31. Roth N, et al. (2012) The requirement for recombination factors differs considerably 14. Chen J-M, Chuzhanova N, Stenson PD, Férec C, Cooper DN (2005) Meta-analysis of between different pathways of homologous double-strand break repair in somatic gross insertions causing human genetic disease: Novel mutational mechanisms and plant cells. Plant J 72(5):781–790. the role of replication slippage. Hum Mutat 25(2):207–221. 32. St Charles J, et al. (2012) High-resolution genome-wide analysis of irradiated (UV and 15. Reyon D, et al. (2012) FLASH assembly of TALENs for high-throughput genome ed- γ-rays) diploid yeast cells reveals a high frequency of genomic loss of heterozygosity iting. Nat Biotechnol 30(5):460–465. – 16. Thomas EE, et al. (2004) Distribution of short paired duplications in mammalian ge- (LOH) events. Genetics 190(4):1267 1284. nomes. Proc Natl Acad Sci USA 101(28):10349–10354. 33. Rissman AI, et al. (2009) Reordering contigs of draft genomes using the Mauve – 17. Vispé S, Satoh MS (2000) DNA repair patch-mediated double strand DNA break for- aligner. Bioinformatics 25(16):2071 2073. mation in human cells. J Biol Chem 275(35):27386–27392. 34. Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: 18. Xu X, et al. (2012) Resequencing 50 accessions of cultivated and wild rice yields markers Improvements in performance and usability. Mol Biol Evol 30(4):772–780. for identifying agronomically important genes. Nat Biotechnol 30(1):105–111. 35. Rice P, Longden I, Bleasby A (2000) EMBOSS: The European Molecular Biology Open 19. Zhao G, Guan Y (2010) Polymerization behavior of Klenow fragment and Taq DNA Software Suite. Trends Genet 16(6):276–277. GENETICS polymerase in short primer extension reactions. Acta Biochim Biophys Sin (Shanghai) 36. Stajich JE, et al. (2002) The Bioperl toolkit: Perl modules for the sciences. 42(10):722–728. Genome Res 12(10):1611–1618.

Vaughn and Bennetzen PNAS | May 6, 2014 | vol. 111 | no. 18 | 6689 Downloaded by guest on October 2, 2021