Three classes of recurrent DNA break clusters in brain progenitors identified by 3D proximity-based break joining assay

Pei-Chi Weia,b,c,1, Cheng-Sheng Leea,b,c,1, Zhou Dua,b,c, Bjoern Schwera,b,c,2, Yuxiang Zhanga,b,c, Jennifer Kaoa,b,c, Jeffrey Zuritaa,b,c, and Frederick W. Alta,b,c,3

aHoward Hughes Medical Institute, Harvard Medical School, Boston, MA 02115; bProgram in Cellular and Molecular Medicine, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115; and cDepartment of Genetics, Harvard Medical School, Boston, MA 02115

Contributed by Frederick W. Alt, January 9, 2018 (sent for review November 17, 2017; reviewed by Fred H. Gage and Irving L. Weissman) We recently discovered 27 recurrent DNA double-strand break Over the past decade, we have developed and refined high- (DSB) clusters (RDCs) in mouse neural stem/progenitor cells (NSPCs). throughput genome-wide translocation sequencing (HTGTS) Most RDCs occurred across long, late-replicating RDC and to identify recurrent endogenous DSBs (13–15). Application of were found only after mild inhibition of DNA replication. RDC genes the HTGTS approach recently allowed us to map a set of re- share intriguing characteristics, including encoding surface currently breaking genes in mouse neural stem/progenitor that organize brain architecture and neuronal junctions, and are cells (NSPCs) (16). genetically implicated in neuropsychiatric disorders and/or cancers. HTGTS maps, at nucleotide resolution, genome-wide DSBs RDC identification relies on high-throughput genome-wide trans- based on their ability to translocate to a “bait” DSB introduced location sequencing (HTGTS), which maps recurrent DSBs based on at a specific chromosomal location (13–15). Bait DSBs can be their translocation to “bait” DSBs in specific chromosomal locations. either introduced ectopically by designer nucleases (14) or pro- Cellular heterogeneity in 3D genome organization allowed unequivocal vided by endogenous DSBs, including RAG-initiated V(D)J re- identification of RDCs on 14 different using HTGTS combination DSBs (17–19) or clusters of activation-induced baits on three mouse chromosomes. Additional candidate RDCs -initiated DSBs in IgH switch (S) re- were also implicated, however, suggesting that some RDCs were gions during class switch recombination (CSR) in mature B cells missed. To more completely identify RDCs, we exploited our finding (20). The ability of HTGTS to identify recurrent DSBs across the that joining of two DSBs occurs more frequently if they lie on the genome relies on cellular heterogeneity in 3D genome organization same cis . Thus, we used CRISPR/Cas9 to introduce spe- (2); however, due to the increased potential for interaction, the cific DSBs into each mouse chromosome in NSPCs that were used as joining frequency between two separate DSBs is greatly enhanced bait for HTGTS libraries. This analysis confirmed all 27 previously if the two lie on the same cis chromosome (2, 14, 17), and is en- identified RDCs and identified many new ones. NSPC RDCs fall into hanced even further if the two lie within the same topological or three groups based on length, organization, transcription level, and loop domain (2, 17, 21). Indeed, B lymphocytes exploit enhanced replication timing of genes within them. While mostly less robust, the largest group of newly defined RDCs share many intriguing Significance characteristics with the original 27. Our findings also revealed RDCs in NSPCs in the absence of induced replication stress, and support Human brain neuron genomes can differ from one another, giving the idea that the latter treatment augments an already active rise to brain mosaicism. We developed a sensitive DNA break endogenous process. joining assay that uses “bait” DNA breaks introduced on different chromosomes to detect endogenous “prey” DNA breaks across nonhomologous end-joining | neural stem cells | replication stress | the mouse brain progenitor cell genome. This approach revealed neurodevelopment | recurrent DNA break clusters 27 recurrently breaking sites, many of which occur in long neural- specific genes associated with mental illnesses and cancer. We lassical nonhomologous end-joining (C-NHEJ) is a major have exploited the finding that bait and prey DSB join more CDNA double-strand break (DSB) repair pathway in somatic frequently when on the same chromosome to increase assay cells that was first implicated based on its requisite role in V(D)J sensitivity. This approach confirms previously identified break- recombination in the developing lymphocytes (1, 2). Subse- ing neural genes and identifies new ones, often with the same quently, we found that inactivation of XRCC4, a core C-NHEJ intriguing characteristics. Our study offer potential insights into factor (3) specifically abrogates both lymphocyte and neuronal brain diversification and disease. development due to unrepaired DSBs in progenitor cells (4). Similar findings have been reported for inactivation of DNA Author contributions: P.-C.W., C.-S.L., B.S., and F.W.A. designed research; P.-C.W., C.-S.L., and J.K. performed research; P.-C.W., C.-S.L., Z.D., and J.Z. contributed new reagents/ 4 (5, 6), with which XRCC4 partners in C-NHEJ end- analytic tools; P.-C.W., C.-S.L., Z.D., Y.Z., and F.W.A. analyzed data; and P.-C.W. and ligation (7). The unrepaired DSBs that cause blocked lym- F.W.A. wrote the paper. phocyte development in C-NHEJ–deficient mice are generated Reviewers: F.H.G., The Salk Institute for Biological Studies; and I.L.W., Stanford University. in antigen receptor genes by the RAG endonuclease, the pro- The authors declare no conflict of interest.

tein complex that initiates V(D)J recombination (2). The na- Published under the PNAS license. NEUROSCIENCE ture of the DNA breaks that cause neuronal apoptosis in the Data deposition: Sequencing data have been deposited in the Expression Omnibus absence of XRCC4 or DNA ligase 4 has remained unresolved, (GEO) database, https://www.ncbi.nlm.nih.gov/geo (accession no. GSE106822). however. Nonetheless, these previous studies promoted spec- 1P.-C.W. and C.-S.L. contributed equally to this work. ulation that specific DSBs might play a role in brain develop- 2Present address: Department of Neurological Surgery and Eli and Edythe Broad Center of ment or disease (8, 9). In this regard, more recent studies have Regeneration Medicine and Stem Cell Research, University of California, San Francisco, highlighted the potential of genomic alterations to contribute to CA 94158. brain diversification and disease (10, 11). Somatic “brain only” 3To whom correspondence should be addressed. Email: [email protected]. mutations and genomic variations also have been implicated This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. in neurodevelopmental and neuropsychiatric disorders (12). 1073/pnas.1719907115/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1719907115 PNAS | February 20, 2018 | vol. 115 | no. 8 | 1919–1924 Downloaded by guest on September 27, 2021 DSB joining within a topological domain to promote robust joining chromosome (2, 14, 17). Thus, for more complete coverage of of activation-induced cytidine deaminase-initiated S region DSBs the genome, we designed 17 additional sgRNAs to generate separated by 100 s of kb to effect exon shuffling during CSR (20), HTGTS bait DSBs, on each of the remaining 16 mouse auto- and developing lymphocytes exploit joining within chromosomal somes and the X chromosome (Fig. 1A and SI Appendix, Table loops to mediate physiological V(D)J recombination (18, 19). S1). Because the rejoining of two resected DSBs in close prox- To identify recurrent DSB clusters (RDCs) in the NSPC ge- imity to the bait break site is the most frequently detected event − − − − nome, we applied HTGTS to NSPCs from Xrcc4 / p53 / mice, in HTGTS analyses, the sgRNAs were designed to target geno- as this background enhances HTGTS detection of genomic mic sequences that were at least 5 Mb away from known RDC DSBs due to their persistence (2). HTGTS analyses from bait genes to avoid potentially confounding effects of resection events DSBs on three separate chromosomes revealed 27 RDCs found extending into adjacent RDC genes (16). In addition, to ensure by at least two of the three chromosomal baits, along with many that mapping from HTGTS bait DSBs was not influenced by additional “candidate” RDCs found with only a bait from one repetitive sequences, we selected bait genomic locales that did chromosome (16). Notably, all 27 RDCs occurred within genes not contain telomeric or simple repeat sequences. To maximize − − − − (“RDC genes”), and these genes shared an intriguing set of RDC DSB detection efficiency, we used Xrcc4 / p53 / primary characteristics, including encoding surface proteins that organize mouse NSPCs for the current HTGTS experiments, as deficiency brain architecture and neuronal junctions. Moreover, human for XRCC4 facilitates DSB persistence and detection of trans- counterparts of most mouse RDC genes had already been im- locations (2, 16). plicated genetically in neuropsychiatric disorders and cancer We used the same general approach to identify RDCs as de- (16). RDC genes also tend to be very long, moderately tran- scribed previously (16) (SI Appendix,Fig.S1). The 17 Cas9:sgRNA − − − − scribed, and late replicating. In the latter context, most RDCs constructs were introduced individually in Xrcc4 / p53 / NSPCs appeared only after treatment of NSPCs with aphidicolin (APH) to experimentally induce a bait DSB on one specific chromosome to create mild replication stress, and even those found sponta- at a time. In each case, NSPCs were treated with either APH to neously were enhanced as RDCs by APH treatment (16). The induce RDCs or diluted DMSO as the vehicle control. Specific common transcription and replication characteristics of RDC HTGTS primers were designed for each bait site (SI Appendix, genes suggest that, as has been proposed for related copy num- Table S2), and each individual Cas9:sgRNA HTGTS experiment ber variations (CNVs) (22), collisions between RNA and DNA was repeated at least three times and analyzed as described polymerases might play a role in recurrent RDC DSBs (23, 24). previously (16). A substantial proportion of HTGTS junctions in Our finding of numerous RDC candidates based on interaction all experiments resulted from the rejoining of two resected ends with only one of our three baits used previously suggested that we of Cas9:sgRNA-induced bait DSBs, which were distributed mostly could have missed many RDCs due to lower interaction with two 10 kb around the break site in both APH-treated and control cells or more of the bait locations and/or because they are weaker (SI Appendix, Table S3). In addition, HTGTS junctions repre- RDCs. Thus, based on our finding that joining of two DSBs occurs senting joining to low-level APH-induced or endogenous DSBs much more frequently if they lie on the same cis chromosome (2, not associated with either bait site resections or RDCs were en- 14, 17), we used sgRNAs specific for each of the 20 mouse hanced within the cis chromosome via proximity-based mecha- chromosomes as bait for the generation of HTGTS libraries from − − − − nisms for each independent chromosome bait site, as expected control or APH-treated Xrcc4 / p53 / mouse NSPCs. This (14, 16) (SI Appendix,TableS3). analysis robustly confirmed the 27 previously identified RDCs and To identify APH-induced RDCs, we applied our RDC iden- conclusively identified a substantial number of new RDCs that tification pipeline (16) (SI Appendix, Methods) to HTGTS li- shared a similar spectrum of intriguing characteristics with the braries generated from DNA of the DMSO- or APH-treated − − − − initial 27. Xrcc4 / p53 / NSPCs expressing various independent bait sgRNAs (16). For these analyses, we included independent Chr- Results X-sgRNA libraries created in both male and female NSPCs, and Use of HTGTS Bait DSBs to Identify RDCs on cis Chromosomes. Our we also reanalyzed our previously created Chr-12-sgRNA-1, Chr- previous RDC identification studies used CRISPR/Cas9-induced 15-sgRNA-1, and Chr-16-sgRNA-2 libraries to achieve a whole- HTGTS bait DSBs on mouse chromosomes 12, 15, and 16 (16) genome view based on the full set of libraries (16) (Fig. 1A). This (Fig. 1A). To more completely identify RDCs by HTGTS, we RDC discovery analysis revealed a total of 113 RDCs (Fig. 1A, SI exploited our finding that joining of two separate DSBs generally Appendix, Fig. S1, and Dataset S1). We also applied a multiple- occurs much more frequently if the two lie on the same cis comparison correction test and confirmed that all 113 RDCs

A Chr-12, Chr-15, Chr-16 HTGTS baits B 27 RDC-genes (ref. 16) Newly identified RDCs New HTGTS baits 16 27 RDC-genes (ref.16) Npas3, Lsamp 11 baits (6) Newly identified RDCs Cen 14 7-10 baits (21) Nrxn1, Ptn 12 Nfia, Ctnna2 10

8

6

4 Number of HTGTS bait location

2 Tel 0102030 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Number of RDC

Fig. 1. Identification of NSPC RDCs by a proximal DSB joining approach. (A) Map illustrating 19 murine autosomes and X chromosome (gray hollow bars), the 20 HTGTS bait DSB locations (arrowheads), and the 113 RDC locations. Cen, centromere; Tel, telomere. Horizontal lollipop symbols mark the locations of RDCs in the murine NSPC genome. (B) Graph showing a total of 113 RDCs either identified previously by three HTGTS bait DSBs located at chromosomes 12, 15, and 16 (green dots) or newly identified by at least two of the 20 HTGTS bait DSBs (black dots) as indicated in A. The y-axis indicates the number of different HTGTS baits significantly joined to each DSBs in each RDC; the x-axis, the number of RDCs. The genes within the top six most frequently identified RDCs are listed in the orange box, and RDCs identified by more than seven chromosomal baits are listed in the blue box. The numbers of RDCs in the orange and blue boxes are indicated. The robustness scores for the RDCs are provided in Dataset S1 and SI Appendix, Fig. S3B.

1920 | www.pnas.org/cgi/doi/10.1073/pnas.1719907115 Wei et al. Downloaded by guest on September 27, 2021 were free from false-positive calls (SI Appendix, Methods and membrane (Lsamp) were the first identified RDC genes, Dataset S1). These 113 RDCs included the previously described because they qualified as RDCs even in NSPCs not treated with 27 RDC genes identified by at least two of three chromosome 12, APH (16). Notably, Npas3 and Lsamp were also the genes most 15, and 16 HTGTS baits, along with 58 previously identified frequently detected by different members of the set of chromosome- RDC candidates identified by only one of the three baits (16) specific baits (Figs. 1B and 2 A and B). Indeed, both genes were (Dataset S1) and 28 additional RDCs. These 113 NSPC RDCs detected as RDCs from bait DSBs on 15 different chromosomes were distributed to all autosomes as well as to the X chromo- (Fig. 1B; Npas3 complete example shown in Fig. 2 A and B). For some (Fig. 1A and Dataset S1). In this regard, we found RDCs Lsamp, this number does not include chromosome 16, on which it on chromosomes not previously identified as RDC-containing, lies, because the bait DSB was too close to Lsamp (∼600 kb up- including chromosomes 2, 3, 7, 11, 13, and X (16) (Fig. 1A). We stream) to be called as a separate RDC by the SICER program. At did not assay a bait from the Y chromosome; since it was not the other extreme, the least frequently detected RDCs, including identified by baits on any other chromosome, and thus even the previously identified diacylglycerol kinase beta (Dgkb)and candidate RDCs on it would not qualify as RDCs based on our oxidation resistance 1 (Oxr1) genes, were identified only by baits current criterion of being identified by at least two independent from one other chromosome besides their host chromosomes (Fig. baits. As for previously identified RDC genes, DSBs in newly 1B, Dgkb example shown in Fig. 2 C and D). 1 (Nrxn1) identified RDC genes tended to map across the length of the and pleiotrophin (Ptn) also were robustly identified, being found RDC gene transcription units (Figs. 2 and 3; data available in the by 13 different bait chromosome DSBs, as were nuclear factor I/A GEO database, accession no. GSE106822). (Nfia) and catenin alpha 2 (Ctnna2), which were identified by 11 chromosomal bait DSBs (Dataset S1 and Fig. 1B). In the RDC DSBs Translocate to Recurrent Bait DSBs Genome-Wide. Neuronal majority of cases that could be examined, RDCs were detected PAS domain protein 3 (Npas3) and limbic system-associated most robustly from HTGTS bait DSBs on the same chromosome;

A Chr12-sgRNA-1 C Chr12-sgRNA-1

Cen Tel Cen Tel Npas3 Dgkb APH APH * + * + Chr12 - Chr12 - sgRNA sgRNA B Trans D Trans Chr12 APH Chr12 APH + * + Chr1 - Chr16 - + + Chr2 - Chr1 - + + Chr3 - Chr2 - + + Chr4 - Chr3 - + + Chr7 - Chr4 - + + Chr8 - Chr5 - + + Chr9 - Chr6 - + + Chr10 - Chr7 - + + Chr11 - Chr8 - + + Chr13 - Chr9 - + + Chr14 - Chr10 - * + Chr11 + Chr15 - - + + Chr18 - Chr13 - + + Chr19 - Chr14 - ChrX + * + - Chr15 - + + Chr5 - Chr17 - + + Chr6 - Chr18 - * + Chr19 + Chr16 - - + + Chr17 - ChrX - chr12:53,483,879-56,038,947 chr12:37,706,111-40,261,179

Fig. 2. Joining frequency of genome-wide HTGTS bait DSBs to strong and weak RDC DSBs. (A, Upper) An HTGTS bait DSB (orange box) induced by Cas9: sgRNA (Chr-12-sgRNA-1, black arrowhead) joining to the prey DSBs (blue box) in the Npas3 RDC ∼40 Mb downstream of the bait DSB on chromosome 12. Cen,

centromere; Tel, telomere. The green arrowhead indicates the HTGTS primer; dashed line/arrows indicate joining possibilities between the bait DSB and RDC NEUROSCIENCE DSBs. (A, Lower) The HTGTS prey junctions (black bars) distributed across the Npas3 gene and its surrounding genomic area on chromosome 12 in the APH- treated (+) or control (−) Xrcc4−/−p53−/− NSPCs. Yellow rectangles indicate overall RDC locations; RefGene (blue track) indicates the gene location. A total of 17,701 randomly selected HTGTS prey junctions from APH-treated or control experiments were plotted. (B, Upper) The joining between transchromosomal Cas9:sgRNA-induced HTGTS bait DSBs to the Npas3 RDC DSBs on chromosome 12. (B, Lower) The HTGTS prey junctions distributed across the Npas3 gene and its surrounding area. Each panel represents an independent experiment using a bait on the indicated chromosome. The location of RDCs identified by each chromosomal bait are indicated with yellow boxes and generally represent a subset of the longest RDCs identified. (C, Upper) The joining between HTGTS bait DSB induced by Cas9:sgRNA (Chr-12-sgRNA-1) and the Dgkb RDC DSBs ∼15 Mb downstream of the bait DSB. (C, Lower) Graph showing HTGTS prey junctions distributed across and surrounding the Dgkb gene. (D, Upper) Graph illustrating the joining between a transchromosomal Cas9:sgRNA-induced HTGTS bait DSB and the Dgkb RDC DSBs at chromosome 12. (D, Lower) HTGTS prey junctions distributed in and around the Dgkb gene. Panels are organized as described for B. (Scale bars: 1 Mb.) *Panels generated using previously published HTGTS datasets (16). Dataset S1 presents the MACS-based adjusted P values of RDCs.

Wei et al. PNAS | February 20, 2018 | vol. 115 | no. 8 | 1921 Downloaded by guest on September 27, 2021 Chr-9-sgRNA Npas3 (16), the HTGTS junction density across the Ctnna2 and ABChr-6-sgRNA Cen Tel Maml2/Mtmr2 RDC genes was further enhanced after APH Cen Tel treatment (Fig. 3 A and B). Ctnna2 Maml2/Mtmr2 APH APH - Characteristics of Newly Identified RDCs. All of the previously HTGTS - HTGTS + + 1 0.05 identified 27 RDCs were contained within a single gene, which in GRO-seq 0 GRO-seq 0 1 0.05 most cases was very long (16). Notably, all of the newly identified chr9:12,777,719-14,109,875 chr6:75,775,341-78,985,951 0.5 Mb 0.2 Mb RDCs were within genes or gene clusters (Dataset S1 and Fig. Fig. 3. Proximal intrachromosomal HTGTS bait DSBs facilitates spontaneous 4A). We also tested RDC robustness based on a new RDC ro- RDC identification. (A, Upper) The joining between a HTGTS bait DSB to bustness test (SI Appendix, Methods). This analysis revealed that, Ctnna2 RDC DSBs located ∼5 Mb downstream of the bait DSB at chromo- while we identified many new RDCs, most of the previously some 6. The figure is organized as described in Fig. 2A.(A, Lower) The HTGTS reported RDCs that were discovered with just three separate bait junction distribution across the Ctnna2 gene and its surrounding area in DSBs (16) are among the most robust RDCs (SI Appendix,Fig.S3). − − − − Xrcc4 / p53 / NSPCs. RDC areas are shaded in yellow. Ctnna2 RDC HTGTS We classified newly identified RDCs into three groups according to junctions are significantly enriched in both the DMSO-treated control (P = RDC DSB distribution. Group 1 RDCs occur within one, usually − − 4.5 × 10 2) and APH-treated experiments (P = 7.0 × 10 75). Transcription long gene; this category has 76 members, including most of our activity of the Ctnna2 gene and its surrounding genomic DNA by GRO-seq is previously identified RDCs (Fig. 4A and Dataset S1;examples shown in the centromeric-to-telomeric direction (blue) and the telomeric-to- shown in Figs. 2 and 3A). While group 1 RDCs overall show a wide centromeric direction (red). The scale indicates normalized GRO-seq counts range in robustness, approximately 80% of the most robust RDCs [reads per kilobase per million (RPKM)]. (B) Joining of HTGTS bait DSBs to ∼ are in group 1 (SI Appendix,Fig.S3B and C). Group 2 RDCs Maml2/Mtmr2 RDC DSBs located 12 Mb upstream on . The > figure is organized as described in the panel. Maml2/Mtmr2 RDCs are sig- contain multiple genes, with at least one gene 80 kb long; this − nificant in the control (P = 4.5 × 10 2) and APH-treated experiments (P = category contains 34 members with varying degrees of robustness, 2.17 × 10−23). including five that fall into the most robust RDC category (Figs. 3B and 4A; SI Appendix,Figs.S2A and B, and S3B and Table S5;and Dataset S1). Group 3 RDCs include a cluster of multiple small representative examples for Npas3, Dgkb, and mastermind-like (<20 kb) genes (examples in SI Appendix,Fig.S2C and D); this transcriptional coactivator 2 (Maml2)/-related pro- category contains three members, one of which is robust, along with tein 2 (Mtmr2)areshowninFigs.2and3B and SI Appendix, Fig. 24 genes and two long noncoding RNA (lncRNA) sequences (Fig. S2A,respectively. 4A; SI Appendix,Figs.S2D and S3 and Table S5;andDataset S1). The six most frequently detected RDCs (found by 11 or more The lengths of group 1 and 2 RDCs are comparable (mean, chromosomal-specific baits) were identified in our previous 1.07 ± 0.09 and 1.10 ± 0.13 Mb, respectively), while group 3 studies with baits from only three chromosomes (16) (Fig. 1B, RDCs are much shorter (mean, 0.22 ± 0.05 Mb). The genes within orange box). On the other hand, newly confirmed RDCs in- the new group 1 RDCs are very long, comparable in length to most cluded one of three RDCs detected by 10 different chromosome of the RDCs genes described previously (16) (Fig. 4B). However, baits (glutamate receptor, ionotropic, delta 2; Grid2), one of the genes in the group 2 and group 3 RDCs are significantly shorter three detected with nine different chromosome baits (Maml2/ than those in group 1 RDCs (Fig. 4B). We also extracted tran- Mtmr2), one of four found with eight chromosome baits (tran- scription rate information for newly identified RDCs from our − − − − scription factor 4; Tcf4), and five of 11 detected with seven existing Xrcc4 / p53 / NSPC GRO-seq data (16). We found that chromosome baits [-susceptibility candidate 2 (Auts2), the overall transcription rates of group 1 and 2 RDC genes are not 2 (Nlgn2), low-density lipoprotein-related protein 1B significantly different, but the transcription rate of group 3 RDCs (Lrp1b), semaphorin 6D (Sema6d), and Quacking/Parking] (Fig. is significantly greater than that of group 1(Fig. 4C). Based on 1B, blue box). Overall, it is notable that 19 of the known 27 RDC existing murine neural progenitor replication timing data (25), the genes were in the found with at least seven different baits, while majority of group 1 RDCs replicate late (Fig. 4D), whereas the only eight of the 86 newly detected RDCs were found in this majority of both group 2 and group 3 RDCs replicate early (Fig. 4 group (Fig. 1B, blue and orange boxes). Indeed, more than one- E and F). half of the newly identified RDCs were found with only their To further compare the newly identified RDC genes with the host chromosome bait and one or two others (Fig. 1B). 27 previously identified RDC genes, we performed gene function and disease association analyses based on the published litera- Identification of Additional Spontaneous RDCs by the Intrachromosomal ture (PubMed, OMIM at the National Center for Biotechnology Bait Approach. The use of baits on each chromosome to detect RDCs Information website). Ten of the 51 newly identified group 1 in cis greatly enhanced the detection efficiency of weaker RDCs. To RDCs (19.6%) harbor genes that encode cell membrane proteins assay for additional “spontaneous” RDCs that are detectable in that are adhesion molecules, 18 (35.3%) harbor genes implicated the absence of APH treatment, we analyzed the HTGTS libraries in synaptic functions, and 23 (45.1%) harbor genes implicated in − − − − of DSMO-treated Xrcc4 / p53 / NSPCs through a SICER-based neurogenesis (Fig. 4G and SI Appendix, Table S5). In addition, approach designed to detect RDCs in the nontreated NSPC ge- nearly all of the genes within the newly identified group 1 RDCs nome (16). In most cases, genomic regions detected by this method have been linked to diseases in mice, humans, or both, including were those that contained the sgRNA off-target (OT) sites (SI neuropsychiatric and developmental disorders (36 of 51; 70.6%) Appendix,TableS4). Nevertheless, we found two genomic regions and cancer (35 of 51; 68.6%) (Fig. 4H and SI Appendix, Table that did not contain sgRNA OT sites that were significantly S5). Compared with the genes in group 1 RDCs, the genes in enriched with HTGTS junctions (Fig. 3). A SICER-called cluster group 2 RDCs showed fewer neuronal functional correlations in HTGTS libraries that used chromosome 6 bait occurred within (16.9% implicated in neurogenesis, 22.5% implicated in synaptic the Ctnna2 gene on chromosome 6, which we previously found as function or neuronal plasticity, and 4.2% implicated as adhesion an RDC gene only after APH treatment (16) (Fig. 3A). In this molecules), and also were less closely associated with neuro- analysis, the chromosome 6 bait DSB was introduced ∼5Mb psychiatric disorders (39.4%) or cancer (38.0%) (SI Appendix, upstream of Ctnna2. The other spontaneous RDC detected was Table S5). Finally, only one of 24 small genes (Mef2d) and one of within the Maml2/Mtmr2 gene cluster (Fig. 3B). The power of the two lncRNA loci found within the group 3 clusters (Malat1) have chromosome-specific bait approach is also evident from the been implicated in synaptic function (7.7%; SI Appendix, Table finding that Maml2/Mtmr2 barely reached significance after APH S5), and only two genes and two lncRNA loci (Mef2d and Cct3 treatment in libraries from chromosome 4, 12, 13, 18, and 19 baits and Malat1 and Neat1, respectively; SI Appendix, Table S5) have (SI Appendix,Fig.S2A). Finally, we note that, as for Lsamp1 and been associated with cancer. Indeed, the majority of the genes

1922 | www.pnas.org/cgi/doi/10.1073/pnas.1719907115 Wei et al. Downloaded by guest on September 27, 2021 A BC**** * 15 5 n.s. 32 27 RDCs-genes **** 27 RDCs-genes 51 Group 1 (76) 10 New Group 1 (51) Fig. 4. Characteristics and functional classification 3 0 Group 2 (34) 5 New Group 2 (71) of murine NSPC RDCs and their relevance to diseases. 25 n.s. Group 3 (3) -5 Group 3 (24) (A) Venn diagram of the indicated classes among the 2 Log2 [kb] 0 Log2 [RPKM] 113 RDCs, including 27 previously identified RDC **** n.s. n.s. -5 -10 genes (blue); 76 group 1 RDCs (pink), including D 2 25 previously identified RDC genes; 34 group 2 RDCs (gray), including two previously identified RDC 1 genes; and three group 3 RDCs (yellow). Additional group 1, 2, and 3 examples are shown in Figs. 2 and 3 0 and SI Appendix, Fig. S2. For viewing additional RDC junction distributions, all datasets are available in the log2[early/late] -1 GEO database (accession no. GSE106822). (B and C)

Replication timing ratio Length (B) and transcription rate determined by -2 GRO-seq (RPKM) (C) of the 27 RDC genes (blue), Fhit Tcf4 Utrn cf12 Tnik Rora Dmd Nrg3 Vav3

Sox5 genes in newly identified group 1 RDCs (pink), Nav3 Nav2 Dpyd Ptprg Dpp6 Gphn T Ptprd Grid2 Fgf14 Prkce Lrp1b Nlgn1 Auts2 Rev3l Ptprm Erbb4 Cntn4 Creb5 Exoc4 Lphn3 Lrrc4c Pcdh9 Agap1 Hdac9 Slc4a4 Inpp4b Slc1a3 Zbtb20 Ccser1 Dlgap1 Nkain2 Rbms3 Il1rapl1 Srgap2 Nckap5 Pde10a > Kcnma1

Sema6d genes 80 kb in newly identified group 2 RDCs (gray), Pcdh11x Macrod2 Naaladl2 E 2 F 2.5 and all genes in group 3 RDCs (orange). The number of 2.0 genes analyzed in each group is indicated. Whiskers ] 1.5 1 1.0 indicate minimum and maximum values; the top and 0.5 bottom edges of the boxplots correspond to the 25th 0 0 -0.5 and 75th percentiles, respectively; and the horizontal < < log2[early/late] line indicates medium values. *P 0.05; ****P 0.0001 log2[early/late -1.0 -1 -1.5 (Mann–Whitney U test); n.s., P ≥ 0.05. (D–F)Timingof Replication timing ratio Replication timing ratio -2.0 replication of the newly identified group 1 RDC genes -2 -2.5 (D), group 2 RDCs (E), and group 3 RDCs (F). Average 091 101 078 098 100 096 086 106 095 105 107 097 029 041 049 031 054 071 024 023 013 002 042 070 026 016 036 066 005 015 025 055 065 075 072 058 064 RDCs RDCs and SEM are shown. Details are provided in Materials and Methods and SI Appendix, Materials and Methods. 5 GH Green, early; blue, late. The corresponding locations 12 10 5 Cell adhesion (10) Neuropsychiatric and of each indicated RDC are provided in Dataset S1. 3 Neurogenesis (23) 25 development disorder (36) (G and H) Venn diagram of the indicated gene function 10 3 Synapse, synaptogenesis, 4 Cancer (35) (G) and link to diseases (H)amongthe51newly plasticity (18) identified group 1 RDCs. Details are provided in SI 14 11 Appendix,TableS5.

within group 3 RDCs function in general cellular processes (SI additional RDCs by introducing bait DSBs within the topologic Appendix, Table S5). domain in which they lie or by using endogenous RDC DSB In summary, the new group 1 RDCs harbor genes that gen- clusters within a domain of candidate RDCs as bait, as we have erally share most of the characteristics of the previously discov- shown for V(D)J recombination and CSR DSBs (18–20). ered RDC genes (16), while a much smaller fraction of genes in Based on our RDC robustness test (SI Appendix, Fig. S3), two group 2 RDCs share common characteristics with group 1 RDC of the six most robust RDCs—the Ctnna2 and Maml2/Mtmr2 genes, including found most frequently in neuropsychiatric dis- genes—contain RDCs that are identifiable in the absence of eases and/or cancer. Genes in group 3 RDCs appear to be a replication stress, similar to the most robust Lsamp and Npas3 functionally quite distinct class. RDC genes (16) (SI Appendix, Fig. S3). These findings support the possibility that at least some RDCs have an intrinsic fragility Discussion augmented by induced replication stress (16). In this regard, By generating HTGTS libraries from control or APH-treated replication stress induced by APH or hydroxyurea leads to CNVs, − − − − mouse Xrcc4 / p53 / NSPCs in which bait DSBs were in- which in some cases correspond to known common fragile sites troduced separately on 20 different mouse chromosomes, we that often contain very large transcribed and late-replicating gene confirmed 27 previously identified RDCs and identified 86 new units (22, 26). In many cases, these late-replicating gene units ones. All but eight RDCs were most robustly identified by bait correspond to the previously described RDCs (22). Based on our DSBs on their host chromosome. The exceptions were RDCs new dataset, such CNVs were found in 33 of 76 group 1 RDCs, lying too close to the bait DSB (Lasmp, Prkg1, Magi2, Macrod2, 10 of 34 group 2 RDCs, and one of three group 3 RDCs in APH- Auts2 and RDC-106) or too close to a separate robust RDC treated mouse embryonic stem cells in which most of these (Zbtb20 and RDC-072) to allow unequivocal identification by overlapping RDC genes were actively transcribed (22). Similarly, our strict pipeline. According to our RDC robustness estimation CNVs also corresponded to 21 of the 76 group 1 and nine of the (Fig. 3B), 19 of the 31 most robust RDCs (which all had a RDC 34 group 2 murine NSPC RDC human orthologs in human fi- robustness score >50) were those that we previously identified broblasts treated with APH or hydroxyurea (22). Taken together, using bait DSBs on just three chromosomes (Fig. 1B and SI the foregoing findings suggest mechanistic overlaps between NSPC Appendix, Fig. S3B). These findings strongly support the idea RDCs and CNVs occurring in other cell types in which particular

that cellular heterogeneity in genomic 3D proximity greatly fa- RDC genes are transcribed. For group 1 RDCs and related CNVs, NEUROSCIENCE cilitates the joining of HTGTS bait DSBs to most other classes of late S-phase transcription/replication collisions or the entry into robust recurrent endogenous DSBs or recurrent DSB clusters mitotic phase with collapsed replication forks could lead to DSBs genome-wide (2). However, while our new approach using baits or other mechanisms of fragility (26). On the other hand, group on all chromosomes allows us to detect many new lower-level 2 and group 3 RDCs harbor genes that are shorter and mostly early RDCs, including different classes of RDCs, it is possible that replicating and have higher transcription rates, as exemplified by some RDCs are less prone to translocate to recurrent DSBs even the robust group 2 RDC gene Ptn (16). More than 600 early rep- in distant domains on the same chromosome, due to organiza- licating fragile sites (ERFSs) have been found to occur in response tional or mechanistic features of their DSBs or their repair (18). to replication stress induced by hydroxyurea in B lymphocytes and In this regard, it may be possible to further reveal such putative are also proposed to result from conflicts between transcription

Wei et al. PNAS | February 20, 2018 | vol. 115 | no. 8 | 1923 Downloaded by guest on September 27, 2021 and replication (27). We found that 13 of 76 group 1, eight of and to assess potential impacts of endogenous or induced repli- 34 group 2, and one of three group 3 APH-treated NSPC RDCs cation stresses on RDC formation in vivo. overlap with the B lymphocyte ERFSs. More studies are needed to determine potential relationships between certain NSPC RDCs Materials and Methods and ERFSs. Primary NSPC Isolation, Culture, and HTGTS Bait DSB Induction. Primary Many newly identified group 1 RDCs also lie within long genes Xrcc4−/−p53−/− NSPCs were prepared as described previously (16, 31). All related encoding proteins that regulate synaptic function/cell adhesion animal procedures were performed under protocol 14–10-2790R approved by and/or have been associated with neuropsychiatric disorders or the Institutional Animal Care and Use Committee of Boston Children’s Hospital. cancer (Fig. 4 G and H). The frequency with which DSBs that Details are provided in SI Appendix, Materials and Methods. can be captured via translocation across the robust Lsamp RDC in NSPCs has been estimated to roughly approach the same HTGTS. Libraries were prepared as described previously (15, 16) and se- order of magnitude as that IgH switch region breaks in activated quenced (Illumina MiSeq). Reads from demultiplexed FASTQ files were B lymphocytes (16). Because group 1 RDC genes typically have aligned to the genome build mm9/NCBI37 through Bowtie2, and processed long introns interspersed with small exons, most of their RDC through the HTGTS pipeline (15). In each library, only unique junctions were DSBs occur within introns (16) (Figs. 2 and 3). For RDC genes preserved for the RDC identification. SI Appendix, Table S3 lists the number lying within a specific topologic domain, two RDC gene DSBs of junctions for each experiment. would mostly be either rejoined or joined to other DSBs within different introns of the same RDC gene—the latter of which RDC Identification. A SICER-based, unbiased, genome-wide method and a could functionally alter encoded proteins (28). In this regard, MACS-based method were both applied to identify APH-induced and while many RDC genes are thought to produce numerous pro- spontaneous RDCs as described previously (16) and now further modified. A tein isoforms via differential RNA processing (29, 30), such new method for evaluating relative RDC robustness was used as well. Details plasticity conceivably could be augmented or “hardwired” by are provided in SI Appendix, Materials and Methods. intragenic rearrangements (28). In this context, 30 RDCs, in- cluding the previously described nine RDCs (16), locate in CNV Replication Timing Analysis. The median replication timing ratio of genomic regions found in single human neurons (10) (SI Appendix, Fig. regions was analyzed using Repli-chip datasets from murine 46C, TT2, and D3 S4). However, due to the large size of currently characterized ES cell-derived neural progenitor cells (25) by a custom Python script as human neuron CNVs (∼5–15 Mb), the significance of this po- described previously (16). tential overlap awaits a higher-resolution map of single neuron genomic sequences. Regardless of any developmental role, RDC ACKNOWLEDGMENTS. We thank members of the F.W.A. laboratory for gene breakage and joining still might contribute to genetic var- stimulating discussions. This work in the F.W.A. laboratory was supported by the Boston Children’s Hospital Department of Medicine, the Harvard iations associated with neuropsychiatric diseases and cancer (9, Brain Initiative Collaborative Seed Fund, and the Howard Hughes Medical 16). Finally, to extend our findings to potential impact of RDC Institute. P.-C.W. is supported by Charles A. King Trust Postdoctoral Research DSBs on neural development and/or neural diseases, it will be Fellowship Program, Bank of America, co-trustees. C.-S.L. is supported by a necessary to assay for RDC formation during neural differentiation Cancer Research Institute Irvington Postdoctoral Fellowship.

1. Taccioli GE, et al. (1993) Impairment of V(D)J recombination in double-strand break 17. Zhang Y, et al. (2012) Spatial organization of the mouse genome and its role in re- repair mutants. Science 260:207–210. current chromosomal translocations. Cell 148:908–921. 2. Alt FW, Zhang Y, Meng FL, Guo C, Schwer B (2013) Mechanisms of programmed DNA 18. Hu J, et al. (2015) Chromosomal loop domains direct the recombination of antigen lesions and genomic instability in the immune system. Cell 152:417–429. receptor genes. Cell 163:947–959. 3. Li Z, et al. (1995) The XRCC4 gene encodes a novel protein involved in DNA double- 19. Zhao L, et al. (2016) Orientation-specific RAG activity in chromosomal loop domains strand break repair and V(D)J recombination. Cell 83:1079–1089. contributes to Tcrd V(D)J recombination during T cell development. J Exp Med 213: 4. Gao Y, et al. (1998) A critical role for DNA end-joining proteins in both lymphogenesis 1921–1936. and neurogenesis. Cell 95:891–902. 20. Dong J, et al. (2015) Orientation-specific joining of AID-initiated DNA breaks pro- 5. Barnes DE, Stamp G, Rosewell I, Denzel A, Lindahl T (1998) Targeted disruption of motes antibody class switching. Nature 525:134–139. the gene encoding DNA ligase IV leads to lethality in embryonic mice. Curr Biol 8: 21. Zarrin AA, et al. (2007) Antibody class switching mediated by yeast endonuclease- – 1395–1398. generated DNA breaks. Science 315:377 381. 6. Frank KM, et al. (2000) DNA ligase IV deficiency in mice leads to defective neuro- 22. Wilson TE, et al. (2015) Large transcription units unify copy number variants and – genesis and embryonic lethality via the p53 pathway. Mol Cell 5:993–1002. common fragile sites arising under replication stress. Genome Res 25:189 200. 7. Lieber MR (2010) The mechanism of double-strand DNA break repair by the non- 23. Helmrich A, Ballarino M, Tora L (2011) Collisions between replication and transcrip- tion complexes cause common fragile site instability at the longest human genes. Mol homologous DNA end-joining pathway. Annu Rev Biochem 79:181–211. Cell 44:966–977. 8. Gilmore EC, Nowakowski RS, Caviness VS, Jr, Herrup K (2000) Cell birth, cell death, cell 24. Aguilera A, García-Muse T (2013) Causes of genome instability. Annu Rev Genet 47: diversity and DNA breaks: How do they all fit together? Trends Neurosci 23:100–105. 1–32. 9. Weissman IL, Gage FH (2016) A mechanism for somatic brain mosaicism. Cell 164: 25. Hiratani I, et al. (2008) Global reorganization of replication domains during embry- 593–595. onic stem cell differentiation. PLoS Biol 6:e245. 10. McConnell MJ, et al. (2013) Mosaic copy number variation in human neurons. Science 26. Glover TW, Wilson TE, Arlt MF (2017) Fragile sites in cancer: More than meets the eye. 342:632–637. Nat Rev Cancer 17:489–501. 11. Poduri A, Evrony GD, Cai X, Walsh CA (2013) Somatic mutation, genomic variation, 27. Barlow JH, et al. (2013) Identification of early replicating fragile sites that contribute and neurological disease. Science 341:1237758. to genome instability. Cell 152:620–632. 12. McConnell MJ, et al.; Brain Somatic Mosaicism Network (2017) Intersection of diverse 28. Alt FW, Wei PC, Schwer B (2017) Recurrently breaking genes in neural progenitors: neuronal genomes and neuropsychiatric disease: The brain somatic mosaicism net- Potential roles of DNA breaks in neuronal function, degeneration and cancer. work. Science 356:eaal1641. Genome Editing in Neurosciences, eds Jaenisch R, Zhang F, Gage F (Springer, Basel), 13. Chiarle R, et al. (2011) Genome-wide translocation sequencing reveals mechanisms of pp 63–72. chromosome breaks and rearrangements in B cells. Cell 147:107–119. 29. Schreiner D, et al. (2014) Targeted combinatorial alternative splicing generates brain 14. Frock RL, et al. (2015) Genome-wide detection of DNA double-stranded breaks in- region-specific repertoires of . Neuron 84:386–398. duced by engineered nucleases. Nat Biotechnol 33:179–186. 30. Treutlein B, Gokce O, Quake SR, Südhof TC (2014) Cartography of neurexin alterna- 15. Hu J, et al. (2016) Detecting DNA double-stranded breaks in mammalian genomes by tive splicing mapped by single-molecule long-read mRNA sequencing. Proc Natl Acad linear amplification-mediated high-throughput genome-wide translocation se- Sci USA 111:E1291–E1299. quencing. Nat Protoc 11:853–871. 31. Schwer B, et al. (2016) Transcription-associated processes cause DNA double-strand 16. Wei PC, et al. (2016) Long neural genes harbor recurrent DNA break clusters in neural breaks and translocations in neural stem/progenitor cells. Proc Natl Acad Sci USA 113: stem/progenitor cells. Cell 164:644–655. 2258–2263.

1924 | www.pnas.org/cgi/doi/10.1073/pnas.1719907115 Wei et al. Downloaded by guest on September 27, 2021