FEATURE

Access to DNA establishes a secondary target site bias SACKLER SPECIAL for the yeast Ty5

Joshua A. Baller1, Jiquan Gao1, and Daniel F. Voytas2

Department of Genetics, Cell Biology, and Development, and Center for Genome Engineering, University of Minnesota, Minneapolis, MN 55455

Edited by Joan Curcio, Wadsworth Center, New York State Department of Health, Albany, NY, and accepted by the Editorial Board June 28, 2011 (received for review March 11, 2011) Integration sites for many and are and HMR) (13–15). Ty5 IN selects integration sites using a 6-aa determined by interactions between retroelement-encoded inte- motif at the IN C terminus (16, 17). This IN targeting domain grases and specific DNA-bound proteins. The Saccharomyces interacts with a protein component of heterochromatin, namely retrotransposon Ty5 preferentially integrates into heterochroma- silent information regulator 4 (Sir4) (16, 18). The Ty5 IN/Sir4 tin because of interactions between Ty5 and the interaction tethers the integration complex to target sites and heterochromatin protein silent information regulator 4. We map- results in the primary target site bias of Ty5. ped over 14,000 Ty5 insertions onto the S. cerevisiae genome, 76% In this study, we applied high-throughput DNA sequencing to of which occurred in heterochromatin, which is consistent with the characterize a large number of Ty5 insertions that we mapped to known target site bias of Ty5. Using logistic regression, associa- the S. cerevisiae genome. Whereas the majority of Ty5 elements tions were assessed between Ty5 insertions and various chromo- integrated as predicted in heterochromatin, a secondary target somal features such as genome-wide distributions of nucleosomes site bias was revealed for both euchromatic and heterochromatic and histone modifications. Sites of Ty5 insertion, regardless of insertions. Logistic regression established that this secondary bias whether they occurred in heterochromatin or euchromatin, were was influenced by chromosomal features characteristic of open strongly associated with DNase hypersensitive, nucleosome-free chromatin, including DNase hypersensitivity, lack of nucleo- fl regions anking genes. Our data support a model wherein silent somes, presence of transcription factors, and epigenetic marks information regulator 4 tethers the Ty5 integration machinery to associated with gene transcription. We provide evidence sug- GENETICS fi domains of heterochromatin, and then, speci c target sites are gesting that this secondary target site bias reflects sites that can be selected based on DNA access, resulting in a secondary target site easily accessed by the Ty5 integration complex during integration. bias. For insertions in euchromatin, DNA access is the primary de- terminant of target site choice. One consequence of the secondary Results target site bias of Ty5 is that insertions in coding sequences occur Ty5 Insertion Dataset. To observe genome-wide patterns of Ty5 infrequently, which may preserve genome integrity. integration, we created an integrant library of ∼400,000 in- dependent transposition events. This library was derived from 16 he insertion of mobile genetic elements into new chromo- separate Ty5 transposition assays—8 assays using the WT YPH499 Tsomal sites profoundly impacts genome structure and evolu- haploid strain and 8 assays using the isogenic WT diploid YPH501. tion. For many mobile elements, integration sites are not chosen Ty5/host DNA junction fragments were recovered from each of randomly. Target site biases are particularly well-documented for the 16 populations using linker-mediated PCR. Linkers were li- the LTR retrotransposons and retroviruses (1–3). These retro- gated to genomic DNA that had been digested with restriction elements replicate by reverse-transcribing mRNA into cDNA and enzymes. Four enzymes (each recognizing four bases) were used to then inserting the cDNA into their host’s genome using an ele- maximize potential to recover sites and minimize recovery bias. ment-encoded integrase (IN). Retrotransposons are among the The genomic sequence at each insertion site was determined by most abundant interspersed repeats in eukaryotic genomes, and pyrosequencing using the 454 GS FLX platform. retroviruses are often used as vectors for gene therapy. Un- In total, ∼337,000 sequencing reads were obtained (Table 1). derstanding mechanisms of retroelement target site choice, Specific barcode sequences in the PCR primers made it possible therefore, has value for both basic and applied research. to assign reads to 1 of 16 transposition assays. Reads were ex- In the best studied cases, retroelement target site choice is cluded that (i) did not have a perfect match to a barcode and dictated by interactions between IN and specific DNA-bound surrounding DNA or (ii) had more than four mismatches to the proteins. HIV IN, for example, interacts with the transcription primer. Furthermore, insertions at a given position and orienta- coactivator lens epithelial-derived growth factor (4), and sites of tion were only counted once in each pool. In total, ∼160,000 reads fl ’ HIV integration are in uenced by sites of this protein s chromo- passed our filters. Sequences sharing more than 98% sequence somal occupancy (5). The role of chromatin in target site choice is also well-established for model yeast retrotransposons. The Schizosaccharomyces pombe Tf1 element inserts preferentially This paper results from the Arthur M. Sackler Colloquium of the National Academy of into regions upstream of some genes transcribed by RNA poly- Sciences, “Telomerase and Retrotransposons: Reverse Transcriptases That Shaped Ge- merase (pol) II (6). Tf1 IN interacts with the transcription factor nomes” held September 29 and 30, 2010, at the Arnold and Mabel Beckman Center of the Atf1p (7), and at the fbp1 promoter, Atf1p alone mediates target National Academies of Sciences and Engineering in Irvine, CA. The complete program and audio files of most presentations are available on the NAS Web site at www.nasonline.org/ site choice (8). The Saccharomyces cerevisiae Ty1 and Ty3 retro- telomerase_and_retrotransposons. transposons prefer to integrate upstream of genes transcribed by Author contributions: J.A.B., J.G., and D.F.V. designed research; J.A.B. and J.G. performed RNA pol III, likely because of interactions between IN and research; J.A.B. and J.G. contributed new reagents/analytic tools; J.A.B., J.G., and D.F.V. components of the pol III machinery or associated chromatin (9, analyzed data; and J.A.B., J.G., and D.F.V. wrote the paper. 10). In the case of Ty3, critical factors for targeting are the TATA The authors declare no conflict of interest. binding protein and Brf (also called TFIIIB70) (11, 12). This article is a PNAS Direct Submission. J.C. is a guest editor invited by the Editorial Board. The first retroelement for which a targeting mechanism was 1J.A.B. and J.G. contributed equally to this work. described in detail was the Saccharomyces retrotransposon Ty5. 2To whom correspondence should be addressed. E-mail: [email protected]. Ty5 integrates preferentially into heterochromatin, which in This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. yeast, is found near the and silent mating loci (HML 1073/pnas.1103665108/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1103665108 PNAS | December 20, 2011 | vol. 108 | no. 51 | 20351–20356 Downloaded by guest on September 25, 2021 Table 1. Ty5 insertion sites recovered by pyrosequencing Strain name and Clean Base pairs hosting Base pairs hosting pool number Ploidy All reads reads ambiguous alignments unambiguous alignments

YPH499-1 Haploid 21,960 10,368 468 743 YPH499-2 Haploid 22,050 11,082 423 847 YPH499-3 Haploid 22,559 10,356 370 673 YPH499-4 Haploid 23,351 10,868 444 766 YPH499-5 Haploid 22,102 10,525 400 719 YPH499-6 Haploid 21,367 9,161 361 637 YPH499-7 Haploid 21,361 9,816 540 912 YPH499-8 Haploid 21,779 10,749 568 987 YPH501-1 Diploid 18,605 9,127 348 389 YPH501-2 Diploid 20,365 9,485 207 228 YPH501-3 Diploid 19,292 8,680 264 214 YPH501-4 Diploid 20,889 9,903 222 287 YPH501-5 Diploid 19,572 9,182 212 234 YPH501-6 Diploid 21,967 10,460 205 279 YPH501-7 Diploid 21,014 10,542 346 450 YPH501-8 Diploid 19,237 8,906 205 243 Not assigned 6,399 —— —

identity to a single site on the S. cerevisiae genome were desig- a 30-kb buffer between heterochromatin and euchromatin to nated as unambiguous insertions. Because Ty5 integrates pref- ensure that signals were distinct. The euchromatin and buffer erentially into repetitive, subtelomeric regions, reads mapping to regions constituted 88% and 7% of the genome, respectively. The multiple sites in the genome (greater than 98% sequence identity) rDNA and MAT were excluded from euchromatin, because the were also considered. These ambiguous insertions were down- former is not accurately represented in the reference genome and weighted by a factor equal to the number of sites to which the read the latter contained many ambiguous insertions because of du- mapped (i.e., each ambiguous site was assigned a fraction of an plicated sequences at the silent mating loci. integration event); 40% of the high-quality reads were ambiguous. Selection could influence the distribution of Ty5 insertions; for example, insertions may not be recovered if they occur in es- Primary Target Site Bias of Ty5. The majority of Ty5 insertions sential genes in haploid strains. To assess impacts of selection, mapped to the ends of all 16 S. cerevisiae chromosomes (Fig. 1 and Ty5 insertion sites were compared between the haploid and Fig. S1). Thus, the primary pattern of Ty5 integration matched diploid populations. Both the haploid and diploid chromosomal what we predicted based on our previous work showing the key distributions were nearly identical, with a Pearson’s correlation role played by heterochromatin in target site choice (15, 18). of 0.82 at 10-bp resolution. Selection, therefore, does not play Because most Ty5 insertions were subtelomeric, for subsequent a significant role in global patterns of Ty5 integration. analyses, the genome was split into two regions, designated eu- chromatin and heterochromatin. Heterochromatic regions began Relationships Between Ty5 Insertions and Chromosomal Features. at the end of a chromosome and ended 10 kb centromere proxi- For S. cerevisiae, a large body of genome-wide data has accumu- mal to the subtelomeric X repeat or one of the silent mating loci, lated describing, for example, distributions of various histone HML or HMR. By this definition, heterochromatin constituted modifications, transcription factor binding sites, and nucleosome 4% of the genome and received 76% of the insertions. This in- occupancy (Table S1). To better understand factors that influence sertion density is likely an underestimate, because reads mapping Ty5 target site choice, we used logistic regression to establish to the same position were excluded if they were derived from the associations between insertions and these chromosomal features same pool; such duplicate reads may represent independent as well as DNA sequence landmarks such as ORFs or specific insertions at the same site. Euchromatic regions comprised most gene classes (e.g., those genes transcribed by RNA pol III). Our of the chromosomes and were bounded by centromere-proximal implementation compared sites of observed integration (case) points 40 kb distant from an X repeat, HML or HMR. This left with a random subset of sites without integrations (control). The

Chr 3 150 100 50 0

Insertions 50 HML CEN3 MAT HMR

0 50,000 100,000 150,000 200,000 250,000 300,000 Position (bp)

Fig. 1. Distribution of Ty5 insertions on chr 3. The x axis denotes position along the chromosome at 1,000-bp resolution. Black bars indicate the number of unambiguous integrations at a particular site; stacked green bars indicate additional ambiguous integrations. Bars above the x axis indicate data from the haploid strain; bars below the x axis denote data from the diploid strain.

20352 | www.pnas.org/cgi/doi/10.1073/pnas.1103665108 Baller et al. Downloaded by guest on September 25, 2021 FEATURE acteristic (ROC) analysis, particularly the value of the area under SACKLER SPECIAL DNAseI 0.25 0.25 DNAseI the curve (AUC). Logistic regression was applied to the euchro- Hermes 0.24 matic and heterochromatic datasets separately (Fig. 2). Both single- 0.19 Hermes and multidimensional models were evaluated, and both gave the Ty5 in Δsir4 0.24 0.16 Ty5 in Δsir4 same overall conclusions. In the following paragraphs, we illustrate Upstream ORF 0.15 the major findings of 1D logistic regression using representative 0.11 Near ARS Acetyl H3K14 0.15 examples of euchromatic and heterochromatic Ty5 target sites. Acetyl H3K9 0.15 0.06 Upstream ORF Details about the multidimensional models are provided in Fig. S2. Ty5 insertions in heterochromatin. Transcription Factors 0.14 -0.08 In Uncharacterized ORF Recently, a genome-wide map of Sir4 chromosomal occupancy was determined (19), and to our Acetyl H4 0.11 -0.12 In Y’ Element initial surprise, logistic regression did not reveal an association 3-Methyl H3K4 0.06 -0.13 In ORF between Ty5 insertions and sites of Sir4. In Fig. 3, we plot Ty5 1-Methyl H3K4 -0.10 insertions and Sir4 distribution at a few subtelomeric regions, and -0.14 HMM Nucleosomes HMM Nucleosomes -0.20 as can be seen, peaks of Sir4 and Ty5 insertions occur near the -0.18 H2AZ Nucleosomes fl HMR In ORF -0.20 subtelomeric X repeats and the silencers anking (Fig. S1). As illustrated by these examples, Sir4 is highly localized, and sites ChIP Nucleosomes -0.25 ChIP Nucleosomes -0.25 of Sir4 occupancy are predictive of sites of Ty5 integration. Euchromatic Heterochromatic However, because very little Sir4 is found elsewhere throughout AUC-0.5 the subtelomeric region (or the remainder of the genome), the Fig. 2. Associations between Ty5 insertions and chromosomal features. majority of insertions in heterochromatin (or euchromatin) have Heat maps showing the area under the curve (AUC) of the receiver operating no clear link to Sir4 distribution. Our logistic regression model characteristic (ROC) curve from logistic classifiers trained on single features. only considers chromosomal features at or near (e.g., within 1 kb) Actual values shown are AUC-0.5. As such, zero indicates a model of no a Ty5 insertion site, and therefore, logistic regression did not re- predictive power, whereas 0.5 and −0.5 indicate models of perfect predictive veal a strong Ty5/Sir4 association. power. Positive AUCs signify features associated with case integrations; fi negative AUCs signify sites associated with control integrations. Heat maps Sir4 aside, logistic regression identi ed several chromosomal for insertions in euchromatin (on the left) and heterochromatin (on the features in heterochromatin that were positively or negatively GENETICS right) were generated from separate models. Details of the datasets used for associated with Ty5 insertions. Among these features was a pos- various chromosomal features can be found in Table S1. itive association (AUC-0.5 = 0.11) with 1-kb regions centered on known autonomously replicating sequences (ARSs), which often serve as sites of DNA replication (Fig. 2) (20). The subtelomeric random distribution was corrected for possible recovery bias be- X repeats, which are bound in Sir4, also contain an ARS, and in cause of restriction site distribution. Additionally, the overall our previous work, Ty5 insertions were considered targeted if quality of the model was evaluated using receiver operating char- they occurred within a 3-kb window centered on an X ARS (15).

A Chr 15L 40 Sir4 20 0 20

TEL15L−X Chr 12L 0 5,000 10,000 15,000 20,000 25,000 30,000

B 30 Sir4 20 10 0 10 20 Insertions 30 TEL12L−X 5,000 10,000 15,000 20,000 25,000 30,000 Chr 3R 0

C 60 Sir4 40 20 0 20 40 HMR TEL03R−X 285,000 290,000 295,000 300,000 305,000 310,000 315,000 Position (bp)

Fig. 3. Ty5 insertions in heterochromatin. Representative heterochromatic domains are shown for the left subtelomeric region of chr 15 (A), the left sub- telomeric region of chr 12 (B), and the right subtelomeric region of chr 3 (C). Verified (red) and uncharacterized (tan) ORFs are depicted. Black and green bars indicate the frequency of unambiguous and ambiguous integration events, respectively. Bars above the x axis indicate integrations in the haploid strain; bars below the x axis are integrations in the diploid. The heat map at the top of the graph displays Sir4 occupancy in red; the color intensity was normalized to the chromosomal regions depicted.

Baller et al. PNAS | December 20, 2011 | vol. 108 | no. 51 | 20353 Downloaded by guest on September 25, 2021 The high incidence of insertions near X repeats (Fig. 3) likely

Ty5 Heterochromatin explains the observed association with ARSs. Density A negative association (AUC-0.5 = −0.12) was identified be- Ty5 Density tween Ty5 insertions and Y′ elements—repeats at the ends of in Nucleosome some yeast chromosomes that are typically either 5.5 or 6.7 kb in Density length and encode a helicase (21). The Y′ coding region, in particular, was a cold spot for integration, which is illustrated for Sir4 Density the two tandem Y′ elements on chr 12L (Fig. 3B). Insertion hotspots, however, occurred on the centromere-proximal side of Ty5

Density Euchromatin the Y′ elements—the side adjacent to an X repeat—and at sites Ty5 Density rich in Sir4 between the Y′ elements and at the itself. in The coding sequences of Y′ elements are bound by nucleosomes, Nucleosome and the Ty5 insertion hotspots flanking Y′ elements lack nucle- Density osomes (22, 23). The pattern of Ty5 insertions is, therefore, Sir4 Density consistent with the finding that nucleosome occupancy is a strong negative predictor of Ty5 insertion sites (Fig. 2). Nucleosomes were represented in two different forms in the regression model: -1000bp Distance0% Percentage100% of Distance1000bp Down- Upstream of ORFway through ORFstream of ORF either as processed ChIP probe values (AUC-0.5 = −0.25) or as a ternary prediction from a hidden Markov model trained on the ChIP data (AUC-0.5 = −0.14) (23). Nucleosomes were also avoided if they contained H2AZ (AUC-0.5 = −0.18), an H2 variant enriched in transcriptionally inactive genes (24). On chr 3, heterochromatic domains are found at the telomeres Fig. 4. Ty5 insertions near verified ORFs. The X dimension represents po- and silent mating loci, the latter of which are located up to 30 kb sition in and around verified ORFs. To account for ORFs of different lengths, from the end of the chromosome. As illustrated for the right arm the region within the ORFs was scaled as a percentage of ORF length. of chr 3 (Fig. 3C), in addition to peaks of Ty5 insertions near the Datasets were smoothed and scaled for easy comparison. As a result of silencers flanking HMR and at the X repeat, clusters of insertions scaling, all units are arbitrary, and the integrals of all curves are equal. occur throughout the region telomere proximal to HMR, partic- ularly in intergenic regions. Localized selection does not con- Δsir4 tribute to the distribution pattern, because none of the genes on peak of Ty5 insertions. In the strain, the Ty5 peak shifts to the the right arm of chr 3 are essential (25). Furthermore, a similar site occupied by Sir4 in the WT, suggesting that this site may now insertion distribution is observed in both haploid and diploid be more accessible to the integration complex. strains. Clustering of Ty5 insertions adjacent to coding sequences One hypothesis to explain local Ty5 integration patterns is that can also be seen in other subtelomeric regions (e.g., chr 12L) (Fig. there is a host protein like Sir4 that acts as a positive targeting 3B). This pattern is consistent with the results of logistic re- determinant, drawing Ty5 insertions to promoter regions. To as- gression, indicating that heterochromatic insertions are slightly sess whether Sir4 itself contributes to local integration patterns, associated with upstream regions of genes (AUC-0.5 = 0.06) and we evaluated a large dataset of Ty5 insertions recovered from Δsir4 very strongly associated with DNase hypersensitive sites (AUC- a strain. These insertions where generated to establish 0.5 = 0.25), a feature characteristic of many promoters. baseline patterns of Ty5 integration for calling card experiments Ty5 insertions in euchromatin. Logistic regression performed on eu- (27). A given transcription factor can be made into a Ty5 calling chromatic insertions revealed a similarly pronounced association card by fusing it to the domain of Sir4 that interacts with Ty5 IN between Ty5 and regions flanking genes. As with heterochroma- (28). Ty5 insertion sites in yeast strains expressing the calling cards tin, Ty5 insertions showed a strong positive association with identify chromosomal sites occupied by the transcription factor. DNase hypersensitive sites (AUC-0.5 = 0.25) and regions up- We treated Ty5 insertions in the Δsir4 strain as chromosomal stream of verified ORFs (AUC-0.5 = 0.20). Other features features and evaluated their association with insertions generated characteristic of actively transcribed genes were also positively in WT strains using logistic regression. The Ty5 insertions in Δsir4 associated, such as H3 K14 and H3 K9 acetylation (AUC-0.5 = showed a significant positive association with insertions generated 0.15) (26) and sites bound by transcription factors (AUC-0.5 = in WT in both euchromatin (AUC-0.5 = 0.24) and heterochro- 0.14). Negative associations were similar to those of heterochro- matin (AUC-0.5 = 0.16) (Fig. 2). Insertion sites in both strains matin, namely that Ty5 was less likely to be found in coding were correlated (assuming 1-kb windows; Spearman ρ = 0.255, sequences (AUC-0.5 = −0.20) and sites bound by nucleosomes P < 2.2e-16). This finding is evidenced in Fig. 4, where insertions in (AUC-0.5 = −0.20 hidden Markov model or −0.25 ChIP). Rep- Δsir4 are mapped relative to ORFs. Secondary targeting patterns, resentative Ty5 hotspots in euchromatin are illustrated in Fig. S3. therefore, are not caused by Sir4, and if a different positive tar- geting determinant is responsible, it remains elusive. Secondary Target Site Bias of Ty5. Because Ty5 insertions in both An alternative hypothesis to explain secondary targeting pat- euchromatin and heterochromatin were enriched in intergenic terns is that insertion hotspots simply reflect sites accessible to the regions, we generated composite figures relating Ty5 insertions to Ty5 integration complex. This hypothesis is consistent with DNase ORFs in both of these chromatin environments (Fig. 4). On av- hypersensitivity being the strongest positive predictor of Ty5 in- erage, insertions begin to occur near the start codon and peak ∼100 tegration sites in both heterochromatin and euchromatin (Fig. 2). bp upstream at a site corresponding to minimal nucleosome oc- Recently, a large number of insertion sites were recovered in yeast cupancy. Insertion frequency falls off to background levels ∼1,000 using the Hermes DNA transposon from housefly (29). Like Ty5, bp upstream of the translational start. A smaller peak of insertions Hermes strongly prefers nucleosome-free regions. The Hermes is also observed in a nucleosome-poor region downstream of the dataset proved to be the second best predictor of Ty5 integration ORFs. As indicated by the logistic regression analyses, Ty5 avoids sites in both euchromatin (AUC-0.5 = 0.24) and heterochromatin integrating into the nucleosome-bound coding sequences. Subtle (AUC-0.5 = 0.19) (Fig. 2). Correspondence between Hermes discrepancies distinguished euchromatin and heterochromatin insertions and Ty5 insertions in WT and Δsir4 strains can be vi- integration patterns; for example, there is a clear peak of Sir4 sualized on a genome-wide level (Fig. 1 and Fig. S1)andatselect density downstream of ORFs in heterochromatin and an adjacent euchromatic sites (Fig. S3). As with the Ty5 insertions in Δsir4,the

20354 | www.pnas.org/cgi/doi/10.1073/pnas.1103665108 Baller et al. Downloaded by guest on September 25, 2021 FEATURE distribution of Hermes insertions is correlated with the distribu- have previously documented the chromatin dynamics that occur tion of Ty5 insertions in WT (assuming 1-kb windows; Spearman during aging, particularly the movement of Sir4 from the telo- SACKLER SPECIAL ρ = 0.257, P < 2.2e-16). One explanation for the similarity in in- meres to the rDNA (32). In addition, the recently developed tegration patterns of Hermes and Ty5 in WT and Δsir4 strains is calling card approach cleverly uses Ty5’s ability to mark chro- that these preferred sites represent open chromatin where these mosomal occupancy of proteins (28). Ty5 calling cards are cre- mobile elements can gain access to DNA. This explanation is also ated by fusing the domain of Sir4 that interacts with Ty5 to supported by the observation that Ty5 insertion sites are most a transcription factor, and Ty5 insertions mark chromosomal sites positively associated with sites of DNase hypersensitivity and by where the transcription factor is bound. Because many retroele- our multidimensional model (Fig. S2), which produces an AUC- ments recognize specific chromatin features during integration, 0.5 of 0.30 using only features associated with open DNA. Access to DNA, therefore, is likely the basis for the secondary target site retroelements may increasingly prove to be valuable probes of bias of Ty5. chromatin dynamics. Regardless of whether Ty5 integrates into euchromatin or Discussion heterochromatin, the chromosomal features influencing Ty5 The ability to recover large numbers of target site choice were remarkably consistent. Ty5 insertions insertions using high-throughput DNA sequencing technologies were associated with DNase hypersensitive, nucleosome-free provides a powerful means to understand mechanisms under- sites, and other features linked to transcription—a pattern that lying target site choice. Complementing the robust and quanti- we refer to as the secondary target bias of Ty5. On average, Ty5 tative measures of target specificity afforded by this approach is insertions peak in nucleosome-free windows ∼100 bp upstream the wealth of genome-wide information that makes it possible to and downstream of coding sequences. A very similar pattern is discern associations between mobile element insertions and observed for insertions generated in a Δsir4 strain, indicating that fi speci c chromosomal features. Pioneering work in this regard this secondary target site bias is not caused by Sir4. Hermes, was performed with HIV, in which associations between in- a completely unrelated DNA transposon from the housefly, has sertion sites and various chromosomal features were assessed by an integration pattern correlated to that of Ty5. Hermes is not computational approaches, including logistic regression (30, 31). adapted to life in its heterologous host and uses a very different We adopted a similar approach with our dataset of over 14,000 enzyme to catalyze integration into the yeast genome. Hermes Ty5 insertions and the extensive genome-wide datasets available GENETICS for S. cerevisiae. One additional advantage of applying this ap- insertion sites, therefore, likely identify open chromatin, and this fi proach in a model organism like yeast is that insertions can be nding is consistent with their correlation with DNase hyper- ρ P < readily recovered in various mutant backgrounds (e.g., Δsir4). sensitive sites (Spearman = 0.715, 2.2e-16). We believe The additional use of genetic resources available for S. cerevisiae that, based on the data at hand, the most parsimonious expla- will undoubtedly lead to new insights into mechanisms by which nation of the secondary target site bias of Ty5 is that it is dictated Ty5 and other yeast transposable elements select chromosomal by accessibility of the Ty5 integration complex to DNA. integration sites. Secondary targeting patterns are not without consequence for Our genome-wide analysis reinforced what was previously genome structure and evolution. One consequence of integrating known about the primary target site preference of Ty5, namely that into nucleosome-free sites is that coding regions are often avoided, insertions predominantly occur in domains of heterochromatin. To thereby limiting a negative consequence of transposition, namely our surprise, however, we did not observe a tight association be- insertional mutagenesis. It has been argued that heterochromatin, tween sites of Ty5 integration and Sir4 occupancy; rather, inser- because it is gene-poor, provides a safe haven for Ty5 integration tions occurred throughout subtelomeric domains, including that minimizes deleterious consequences of transposition (33). It regions largely devoid of Sir4. Our 2D view of the genome and Sir4 may be that integration into open chromatin provides an addi- occupancy, however, most certainly belies the actual architecture tional mechanism to avoid genes. That said, insertions in promoter of subtelomereic regions. We believe that much of the sub- regions likely have consequences for the regulation of adjacent telomeric DNA is actually within close proximity to sites enriched in Sir4 (Fig. 5); therefore, after the Ty5 IN/Sir4 tether is estab- genes, which could have important evolutionary outcomes. Our lished, integration can occur throughout the subtelomeric region. proposed mechanism underlying the secondary target bias of Ty5 Alternatively, Ty5 IN could be loaded onto heterochromatin by may underlie well-established associations between other mobile Sir4 and then scan the subtelomeric regions for target sites. genetic elements and promoter regions (6, 34, 35). Clearly, the Ty5 integration patterns provide a readout for boundaries of discovery and initial characterization of the secondary target site heterochromatin on the yeast chromosomes. Probing chromatin bias of Ty5 as reported here reinforces the importance of chro- is not a new role for Ty5, because changes in integration patterns matin in dictating retroelement target site choice.

Fig. 5. A model describing the primary and secondary target site biases of Ty5. Ty5 IN interacts with Sir4, which localizes the integration complex to het- erochromatin. This interaction results in the primary target site bias of Ty5, namely the association of ∼75% of Ty5 insertions with domains of hetero- chromatin. The secondary target site bias of Ty5 is determined by DNA access. Sites in heterochromatin are chosen for being nucleosome-free and accessible to the integration complex. Access to DNA also dictates the preferred integration sites of Ty5 in euchromatin, resulting in integration primarily in nucleo- some-free regions flanking genes.

Baller et al. PNAS | December 20, 2011 | vol. 108 | no. 51 | 20355 Downloaded by guest on September 25, 2021 Materials and Methods disjointed from known insertion sites. This process resulted in a set of control Recovery of Ty5 Insertions. Ty5 transposition assays were performed as pre- insertions with restriction bias similar to that of the recovered insertions. viously described using the haploid and diploid strains YPH499 and YPH501, respectively (15). The donor Ty5 plasmid was pNK254, which contains a Data Annotation and Analysis. Logistic regression was used to identify dis- galactose-inducible Ty5 element with a marker gene to detect transposition. criminative features for integration (Table S1). Regression models were Each Ty5 transposition assay gave rise to a pool of ∼25,000 Ty5 integrants. trained using the glm log-linear regression function in the R statistical Genomic DNA was prepared from the pools and treated with two sets of package (37, 38). Our implementation compared the sites of observed in- restriction enzymes, AciI/TaqI and MspI/HinplI (Fig. S4). Linker-mediated tegration (case) with a random subset of the sites without integrations amplification of integration sites was performed using the protocol found in (control). Logistic regression fits the equation (Eq. 1) the work by Ciuffi et al. (36). Digested DNA was ligated to a linker made up of two oligonucleotides, DVO4621 and DVO4622 (Table S2 shows linker sequences). To prevent amplification of the 5′ LTR, DNA samples treated ð Þ¼ 1 ; f z − [1] with AciI/TaqI were digested with AseI; samples treated with MspI/HinplI 1 þ e z were digested with EcoRI. The first round of PCR amplification used the Ty5 P ð Þ ¼ β þ n β LTR-specific primer DVO495 and the linker-specific primer DVO4632. The where f z is the class prediction and z is a linear function, z 0 i¼1 i xi, second round of PCR amplification used DVO4665 and one of several bar- of the levels xi of the n chromosomal features. coded Ty5 LTR primers (DVO4666–DVO4681) (Table S2). PCR products were Predictions from a logistic regression fall within the interval (0, 1), with gel-purified, and fragments between 100 and 500 bp were sequenced using proximity to the endpoints indicating greater certainty of a class designation. a 454 GLX sequencer. This information was used to produce a ROC curve, a plot of the true-positive rate vs. the false-positive rate parameterized on a discrimination threshold. Random Control Insertions. A total of 19,934 control insertions were produced An area under a ROC curve (AUC-ROC or AUC) of 0.5 indicates a model with in silico for euchromatin and 7,034 were produced for heterochromatin. Each no predictive power, whereas an AUC of 1.0 indicates perfect prediction. All control insertion was the product of three random values: a restriction site AUC data presented herein are in the form of an AUC-0.5, where negative value, a position value, and an orientation value. These values select, re- values indicate features showing a greater association with the con- spectively, a restriction site in the genome, a distance away from the re- trol dataset. striction site, and an orientation for the control insertion. The probability ’ distribution function for a control insertion s position and orientation was ACKNOWLEDGMENTS. We thank H. Wang, D. Mayhew, and R. Mitra for calculated as the normalized frequency of recovered insertions relative to making data available before publication. We thank R. Bushman and the restriction sites used in recovery. Control insertions were made to be N. Milani for advice on data processing and statistical approaches.

1. Ciuffi A, Bushman FD (2006) Retroviral DNA integration: HIV and the role of LEDGF/ 21. Louis EJ, Haber JE (1992) The structure and evolution of subtelomeric Y’ repeats in p75. Trends Genet 22:388–395. Saccharomyces cerevisiae. Genetics 131:559–574. 2. Bushman FD (2003) Targeting survival: Integration site selection by retroviruses and 22. Zhu X, Gustafsson CM (2009) Distinct differences in chromatin structure at sub- LTR-retrotransposons. Cell 115:135–138. telomeric X and Y’ elements in budding yeast. PLoS One 4:e6363. 3. Sandmeyer S (2003) Integration by design. Proc Natl Acad Sci USA 100:5586–5588. 23. Lee W, et al. (2007) A high-resolution atlas of nucleosome occupancy in yeast. Nat 4. Cherepanov P, et al. (2003) HIV-1 integrase forms stable tetramers and associates with Genet 39:1235–1244. LEDGF/p75 protein in human cells. J Biol Chem 278:372–381. 24. Li B, et al. (2005) Preferential occupancy of histone variant H2AZ at inactive pro- 5. Ciuffi A, et al. (2005) A role for LEDGF/p75 in targeting HIV DNA integration. Nat Med moters influences local histone modifications and chromatin remodeling. Proc Natl 11:1287–1289. Acad Sci USA 102:18385–18390. 6. Guo Y, Levin HL (2010) High-throughput sequencing of retrotransposon integration 25. Cherry JM, et al. (1997) Genetic and physical maps of Saccharomyces cerevisiae. fi provides a saturated pro le of target activity in Schizosaccharomyces pombe. Ge- Nature 387(6632 Suppl):67–73. – nome Res 20:239 248. 26. Pokholok DK, et al. (2005) Genome-wide map of nucleosome acetylation and meth- 7. Leem YE, et al. (2008) Retrotransposon Tf1 is targeted to Pol II promoters by tran- ylation in yeast. Cell 122:517–527. – scription activators. Mol Cell 30:98 107. 27. Wang H, Mayhew D, Chen X, Johnston M, Mitra RD (2011) Calling Cards enable 8. Majumdar A, Chatterjee AG, Ripmaster TL, Levin HL (2011) Determinants that specify multiplexed identification of the genomic targets of DNA-binding proteins. Genome the integration pattern of retrotransposon Tf1 in the fbp1 promoter of Schizo- Res 21:748–755. – saccharomyces pombe. J Virol 85:519 529. 28. Wang H, Johnston M, Mitra RD (2007) Calling cards for DNA-binding proteins. Ge- 9. Chalker DL, Sandmeyer SB (1992) Ty3 integrates within the region of RNA polymerase nome Res 17:1202–1209. III transcription initiation. Genes Dev 6:117–128. 29. Gangadharan S, Mularoni L, Fain-Thornton J, Wheelan SJ, Craig NL (2010) DNA 10. Devine SE, Boeke JD (1996) Integration of the yeast retrotransposon Ty1 is targeted to transposon Hermes inserts into DNA in nucleosome-free regions in vivo. Proc Natl regions upstream of genes transcribed by RNA polymerase III. Genes Dev 10:620–633. Acad Sci USA 107:21966–21972. 11. Yieh L, Hatzis H, Kassavetis G, Sandmeyer SB (2002) Mutational analysis of the tran- 30. Berry C, Hannenhalli S, Leipzig J, Bushman FD (2006) Selection of target sites for scription factor IIIB-DNA target of Ty3 retroelement integration. J Biol Chem 277: mobile DNA integration in the human genome. PLoS Comput Biol 2:e157. 25920–25928. 31. Wang GP, Ciuffi A, Leipzig J, Berry CC, Bushman FD (2007) HIV integration site se- 12. Yieh L, Kassavetis G, Geiduschek EP, Sandmeyer SB (2000) The Brf and TATA-binding lection: Analysis by massively parallel pyrosequencing reveals association with protein subunits of the RNA polymerase III transcription factor IIIB mediate position- epigenetic modifications. Genome Res 17:1186–1194. specific integration of the gypsy-like element, Ty3. J Biol Chem 275:29800–29807. 32. Zhu Y, Zou S, Wright DA, Voytas DF (1999) Tagging chromatin with retrotransposons: 13. Zou S, Voytas DF (1997) Silent chromatin determines target preference of the Sac- fi charomyces retrotransposon Ty5. Proc Natl Acad Sci USA 94:7412–7416. Target speci city of the Saccharomyces Ty5 retrotransposon changes with the chro- – 14. Zou S, Kim JM, Voytas DF (1996) The Saccharomyces retrotransposon Ty5 influences mosomal localization of Sir3p and Sir4p. Genes Dev 13:2738 2749. the organization of chromosome ends. Nucleic Acids Res 24:4825–4831. 33. Boeke JD, Devine SE (1998) Yeast retrotransposons: Finding a nice quiet neighbor- – 15. Zou S, Ke N, Kim JM, Voytas DF (1996) The Saccharomyces retrotransposon Ty5 in- hood. Cell 93:1087 1089. tegrates preferentially into regions of silent chromatin at the telomeres and mating 34. Bellen HJ, et al. (2004) The BDGP gene disruption project: Single transposon insertions – loci. Genes Dev 10:634–645. associated with 40% of Drosophila genes. Genetics 167:761 781. 16. Xie W, et al. (2001) Targeting of the yeast Ty5 retrotransposon to silent chromatin is 35. Liu S, et al. (2009) Mu transposon insertion sites and meiotic recombination events co- mediated by interactions between integrase and Sir4p. Mol Cell Biol 21:6606–6614. localize with epigenetic marks for open chromatin across the maize genome. PLoS 17. Gai X, Voytas DF (1998) A single amino acid change in the yeast retrotransposon Ty5 Genet 5:e1000733. abolishes targeting to silent chromatin. Mol Cell 1:1051–1055. 36. Ciuffi A, et al. (2006) Integration site selection by HIV-based vectors in dividing and 18. Zhu Y, Dai J, Fuerst PG, Voytas DF (2003) Controlling integration specificity of a yeast growth-arrested IMR-90 lung fibroblasts. Mol Ther 13:366–373. retrotransposon. Proc Natl Acad Sci USA 100:5891–5895. 37. R Development Core Team (2008). R: A language and environment for statistical 19. Zill OA, Scannell D, Teytelman L, Rine J (2010) Co-evolution of transcriptional silencing computing. (R Foundation for Statistical Computing, Vienna). Available at http:// proteins and the DNA elements specifying their assembly. PLoS Biol 8:e1000550. www.R-project.org. Accessed February 2011. 20. Rehman MA, Yankulov K (2009) The dual role of autonomously replicating sequences 38. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear as origins of replication and as silencers. Curr Genet 55:357–363. models via coordinate descent. J Stat Softw 33:1–22.

20356 | www.pnas.org/cgi/doi/10.1073/pnas.1103665108 Baller et al. Downloaded by guest on September 25, 2021