The Journal of Immunology

Multiple Transcription Factor Binding Sites Predict AID Targeting in Non-Ig

Jamie L. Duke,* Man Liu,†,1 Gur Yaari,‡ Ashraf M. Khalil,x Mary M. Tomayko,{ Mark J. Shlomchik,†,x David G. Schatz,†,‖ and Steven H. Kleinstein*,‡

Aberrant targeting of the enzyme activation-induced cytidine deaminase (AID) results in the accumulation of somatic mutations in ∼25% of expressed genes in germinal center B cells. Observations in Ung2/2 Msh22/2 mice suggest that many other genes efficiently repair AID-induced lesions, so that up to 45% of genes may actually be targeted by AID. It is important to understand the mechanisms that recruit AID to certain genes, because this mistargeting represents an important risk for genome instability. We hypothesize that several mechanisms combine to target AID to each . To resolve which mechanisms affect AID targeting, we analyzed 7.3 Mb of sequence data, along with the regulatory context, from 83 genes in Ung2/2 Msh22/2 mice to identify common properties of AID targets. This analysis identifies three transcription factor binding sites (E-box motifs, along with YY1 and C/EBP-b binding sites) that may work together to recruit AID. Based on previous knowledge and these newly discovered features, a classification tree model was built to predict genome-wide AID targeting. Using this predictive model, we were able to identify a set of 101 high-interest genes that are likely targets of AID. The Journal of Immunology, 2013, 190: 3878–3888.

omatic hypermutation (SHM) occurs in germinal center the enzyme that deaminates cytosines to initiate SHM, can act (GC) B cells, resulting in the introduction of point muta- outside of the Ig locus. In a previous sequencing study, we showed tions into Ig genes. Although SHM provides an important that .45% of expressed genes in GC B cells are targeted by AID in S 2/2 2/2 source of genetic diversity, capable of producing specific Abs for Ung Msh2 double-knockout (dKO) mice, where the absence quickly evolving pathogens, the process also poses a severe threat of DNA repair reveals the “footprint” of AID. Even among genes to genomic stability. Activation-induced cytidine deaminase (AID), that were targeted by AID, this study revealed a wide range of mutation frequencies observed across 83 genes (1). In this study, we seek to address two basic questions that are raised by the former *Interdepartmental Program in Computational Biology and Bioinformatics, Yale study: Why are some genes targeted by AID, whereas others are University, New Haven, CT 06511; †Department of Immunobiology, Yale University not? and How do the genes targeted by AID accumulate different School of Medicine, New Haven, CT 06510; ‡Department of Pathology, Yale Uni- versity School of Medicine, New Haven, CT 06510; xDepartment of Laboratory levels of mutation? The main hypothesis we pursue is that sequence { by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. Medicine, Yale University School of Medicine, New Haven, CT 06510; Department features of each are responsible for this differential targeting. of Dermatology, Yale University School of Medicine, New Haven, CT 06510; and ‖ The current model of SHM proposes two phases (2). In the first Howard Hughes Medical Institute, New Haven, CT 06510 1 phase, AID converts a cytosine (C) residue to a uracil (U) in ssDNA Current address: Drinker Biddle & Reath LLP, Washington, D.C. created during the process of transcription, which, if left unrepaired, Received for publication September 26, 2012. Accepted for publication February 15, leads to a C to T (thymine) transition mutation when the DNA is 2013. replicated for cell division (3). The second phase of SHM begins J.L.D. was supported in part by the Pharmaceutical Research and Manufacturers of America Foundation and National Institutes of Health Grant T15 LM07056 from the when DNA repair mechanisms attempt to remove the uracil lesion National Library of Medicine. D.G.S. is an investigator of the Howard Hughes from the DNA. The repair of the uracil happens via two pathways: Medical Institute. Computational resources were provided by the Yale University base excision repair with UNG and mismatch repair facilitated by Biomedical High Performance Computing Center (National Institutes of Health http://classic.jimmunol.org Grant RR19895). the MSH2/MSH6 complex, both of which are capable of working J.L.D. and S.H.K. designed the analyses; M.L. and D.G.S. designed the RNA poly- in an error-prone fashion and contributing to the observed muta- merase II ChIP-Seq experiment; M.L. performed the ChIP portion of the experiment; tion frequency (4). In the dKO setting, the second phase of SHM M.M.T. and M.J.S. provided the microarray expression data; J.L.D. and G.Y. wrote is unavailable, thus revealing the underlying “footprint” of AID, software for the analyses; J.L.D. performed the analyses; and J.L.D., D.G.S., and S.H.K. wrote the manuscript. All authors commented on the manuscript. where the expectation is primarily C → T transition mutations. We The microarray data presented in this article have been submitted to the Gene Ex- previously sequenced 83 non-Ig genes from dKO mice on average

Downloaded from pression Omnibus (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44260) 70 times per gene over a 1-kb region downstream of the tran- under accession number GSE44260. scription start site (TSS) (1). Mutation frequencies varied widely, Address correspondence and reprint requests to Dr. Steven H. Kleinstein, Yale Uni- ranging from ,1 3 1025 to 116.1 3 1025 mutations/bp, but they versity School of Medicine, 300 George Street, Suite 505, New Haven, CT 06511. E-mail address: [email protected] were highly predictable for the same gene across samples from The online version of this article contains supplemental material. multiple mice. In the same system, sequencing of an IgH positive control, specifically the VhJ558-Jh4 intron 39 flanking region Abbreviations used in this article: AID, activation-induced cytidine deaminase; ChIP, chromatin immunoprecipitation; ChIP-Seq, chromatin immunoprecipitation followed (hereafter referred to as the Jh4 intron), found a mutation frequency by massively parallel sequencing; CSR, class switch recombination; dKO, double of 9.96 3 1023 mutations/bp. Each gene represents a unique ge- knockout; FDR, false-discovery rate; GC, germinal center; GSEA, gene set enrich- ment analysis; KO, knockout; KW, Kruskal–Wallis test; MWU, Mann–Whitney U nomic context in which to explore the various properties associated test; NB, negative binomial; RefSeq, National Center for Biotechnology Information with AID targeting. Reference Sequence; SHM, somatic hypermutation; TC-Seq, translocation-capture Differential AID activity in non-Ig genes may be influenced by sequencing; TSS, transcription start site; ZI-NB, zero-inflated negative binomial. multiple underlying mechanisms. A higher transcription rate may Copyright Ó 2013 by The American Association of Immunologists, Inc. 0022-1767/13/$16.00 be associated with an increased mutation frequency. Genes with

www.jimmunol.org/cgi/doi/10.4049/jimmunol.1202547 The Journal of Immunology 3879

a higher mutation frequency may contain a large number of AID Dynabeads A (Invitrogen) were incubated with RNA polymerase hotspots, such as WRC (W = A/T; R = A/G), and/or few AID II Ab N20 (Santa Cruz Biotechnologies) or normal rabbit serum. Excess Ab coldspots, such as SYC (S = C/G; Y = C/T), where the C is the was washed away. Then Ab-bound beads were incubated with chromatin from 20 million sorted spleen GC B cells (previously cross-linked with 1% mutated position (5, 6). Clonal recruitment of AID to certain HCHO and then sonicated to shear the DNA fragments to 100–300 bp) at 4˚C genes may lead to an increased mutation frequency (7). Finally, overnight. Beads were washed, chromatin was eluted, and the cross-linking the genes for which high mutation frequencies are observed may was reversed. DNA was purified, precipitated, and redissolved in TE buffer. share functional elements, like transcription factor binding sites, Precipitated DNA was quantified using a PicoGreen dsDNA quantification kit (Molecular Probe). A total of 200 ng chromatin immunoprecipitation which recruit AID to the locus for mutation. In this study, we first (ChIP) DNA (from 40 million cells) ends was repaired using polynucleotide examine each of the possible mechanisms independently and then kinase and Klenow enzyme, followed by treatment with Taq polymerase to develop an integrated model to predict targeting of AID in the generate a protruding 39 A base used for adaptor ligation. Following ligation non-Ig genes. of a pair of Solexa adaptors to the repaired ends, the ChIP DNA was am- plified using the adaptor primers for 17 cycles, and the fragments around 220 bp (mononucleosome + adaptors) were isolated from agarose gel. The Materials and Methods purified DNA was used directly for cluster generation and sequencing anal- Stratification of dKO genes ysis using the Solexa 1G Genome Analyzer, following the manufacturer’s Genes were selected for sequencing in our previous study based on multiple protocols. criteria (1): expression of the gene determined through microarray studies The resulting 25-bp reads were aligned against the mouse genome (mm8) in both mouse and human B cells (8–11), because it is expected that using Efficient Local Alignment of Nucleotide Data (Illumina), allowing up expressed genes are undergoing transcription, a requirement for AID tar- to two mismatches against the reference. The reads kept for further analysis geting (3, 12); a well-defined TSS and chromosomal location; and a high had to map uniquely to the genome, and a maximum of three copies of the level of homology between the mouse and human genes for the first exon. same read was kept to reduce PCR artifact. The reads were then converted Genes known to be involved in the immune response, tumorigenesis, cell to browser extensible data format. The TSSs for all genes were identified proliferation, or apoptosis or known to undergo chromosomal trans- using the RefSeq identifier (16) and were obtained through the UCSC locations or deletions, especially in B cell tumors, were given preference. Genome Table Browser (18) for the mouse mm8 genome build. For each Finally, the set of genes chosen was also selected to provide good coverage group, genes were aligned by the TSS, and the maximum peak of over- across the set of mouse . A total of 83 non-Ig genes was lapping reads was determined for each gene in the region 100 bases around sequenced from Peyer’s patches of Ung2/2 Msh22/2 dKO mice, in ad- the TSS. Differences among the three groups were determined using the dition to the positive control of the Ig Jh4 intron. Each gene was se- Kruskal–Wallis test (KW), and the Mann–Whitney U test (MWU) was quenced, as previously described, in a 1-kb region directly downstream of used for determining differences among pairs of groups. the TSS (1). Mutations were determined through use of the neighborhood Mutability quality standard algorithm (13), using previously defined criteria (1). The resulting mutation frequency of each gene in the dKO setting was com- Mutations and all positions sequenced that occurred at C on either the top or pared with the background mutation rate determined by sequencing 31 bottom strands were considered for analysis if the residue and the two genes from Aicda2/2 knockout (KO) mice using the Fisher exact test. residues directly upstream passed the neighborhood quality standard criteria False-discovery rate (FDR) q-values were determined for each gene (l =0) of Liu et al. (1). Both mutations and positions that did not meet this cri- and used as the basis for ranking the genes and determining group terion were excluded from the analysis, resulting in an observed mutation assignments: group A = q-value , 1026; group B = 1026 # q-value , 1022; frequency for each gene that is slightly different from the previously and group C = q-value $ 1022. published figures. Mutability was calculated in a manner similar to that of Shapiro et al. (19), restricted to the case when the mutation is a C residue Gene-expression analysis

by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. in the third position (or G in the first position for the reverse complement). BALB/c background mice carrying IgH-transgenic alleles and targeted To do so, the frequency of mutations in each of the 16 possible dinu- cleotides upstream of the mutated C residue was tabulated for each se- deletions of the endogenous JH locus (VH186.2-Tg JH KO and V23-Tg JH quence and normalized by the total number of mutations in that sequence. KO) were immunized i.p. with 50 mgNP25-chicken g globulin precipitated in alum, as described (9). Fourteen days later, splenocytes were isolated The same was done for all sequenced C residues to determine the back- and stained with fluorescently labeled peanut agglutinin (Vector Labora- ground. The mutability for each sequence was calculated by dividing by tories), anti-l (goat polyclonal; SouthernBiotech), and anti-CD45R/B220 the normalized frequency of the dinucleotide motifs for the mutated res- (RA3-6B2) to identify l+ PNA+ B220+ GC B cells. Cells were purified on idues by the background of the sequenced region. The overall mutability a FACSAria cell sorter (BD), as previously described (14). Live/dead for a set of sequences was calculated as the mean mutability in individual discrimination was accomplished based on forward/side scatter and pro- sequences weighted by the total number of mutations in each sequence. pidium iodide exclusion. Cells were kept at 4˚C in buffers containing Error bars were calculated by bootstrapping the original set of sequences http://classic.jimmunol.org 0.05% sodium azide to minimize alterations in gene expression. All animal 10,000 times. The p values were calculated for each motif by comparing immunizations and experiments were approved by the Yale Institutional the bootstrapped mutability values and the observed mutability values for Animal Care and Use Committee. each sample. An aggregated p value for all motifs was computed using the The mRNA expression levels among sorted cells were determined using Fisher method for combining p values of individual tests. Affymetrix Mouse 430 2.0 microarray and previously described methods Hot/cold-spot analysis (14, 15). Annotation files were obtained directly from Affymetrix (v. 31) and were used to associate the gene symbols and National Center for For each gene, the region sequenced was analyzed for occurrences of the Biotechnology Information Reference Sequence (RefSeq) (16) identifiers AID hotspot WRC and the AID coldspot SYC on both the top and bottom Downloaded from with each probe set. The samples were normalized using GC Robust Multi- strands. The total number of hotspots found in the sequenced region was array Average algorithm. The average expression value reported for each normalized by the number of C and G residues in the sequence to account gene was determined by taking the log2 of the average expression values for the unique composition of each gene. across all probes associated with each RefSeq identifier for all samples. These microarray data are available through the Gene Expression Omni- Negative binomial analysis bus: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44260. The number of mutations from each sequence was determined for each of Polymerase II chromatin immunoprecipitation followed by the 15 genes in group A and for the Jh4 intron. The negative binomial (NB) distribution was fit to each gene using the fitdistr method available through massively parallel sequencing the MASS package for R (20), and a x2 goodness-of-fit test was applied. In Mice carrying an mVH186.2-Tg IgH Tg allele and targeted deletions of the cases in which the NB did not fit the data (p , 0.05), a zero-inflated the JH locus (17) were immunized with NP -chicken g globulin, and NB (ZI-NB) distribution was fit as defined through the emdbook package 25 2 spleens were removed 12 d postimmunization. B cells were enriched by an for R (21) and tested using a x goodness-of-fit test. EasySep customized B cell negative selection kit (without anti-CD43 Determining transcription factor binding sites Ab; Stem Cell Technologies). Cells were stained with anti-mouse B220– Cy-Chrome, anti-mouse CD19-PE, and anti-mouse GL7-FITC (all from The region 2 kb around the TSS for each sequenced gene was calculated Becton Dickinson Biosciences) Abs and sorted on a DakoCytomation based on RefSeq gene definitions (accessed July 2009). The corresponding MoFlo cell sorter at the Yale Immunobiology Cell Sorting Facility. aligned sequences for both mouse (mm9) and human (hg18) were obtained 3880 AID TARGETING IN NON-Ig GENES

from the multiple sequence alignment, multiz30way (22), for the mouse exception of the following parameters: the minimum number of samples genome available from the UCSC Genome Bioinformatics Site (23). Re- in each node to allow a split was set to 9, the splitting index used was petitive sequences were masked from both the mouse and human se- “information,” and the loss matrix used was: quences using RepeatMasker (24). With regard to the identification of E- 2 3 box core sequences, two consensus sequences defined the site: CASSTG 0315 (where S = C or G) and CANNTG (where n = A, C, G or T). In addition, 4 10 0 10 5; the MATCH (25) program was used to scan the sequences for hits of 20 1 0 TRANSFAC-defined binding sites (26) (including all high-quality verte- where the columns correspond to the true group A, B, or C and the rows brate binding sites used for v. 2010.3) using the cutoff values defining a hit correspond to predicting group A, B, or C. Thus, a gene from group B to minimize the sum of false positives and false negatives. Only hits that predicted to be in group A results in a loss (or penalty) for misclassi- occurred in the same position in the mouse and human aligned sequences fication of 10. Ten-fold cross-validation was performed during the were retained for analysis. Note that this step of evolutionary conservation construction of the model and used to prune the full tree by choosing the does not impose upon the model an assumption that mutation spectrum tree that minimized the cross-validated error to create the final model from AID targeting in the species used for the conservation will correlate (31). perfectly, or at all, with mouse. Gene set enrichment analysis Results Gene set enrichment analysis (GSEA) was performed using the GSEA Mutation frequency is correlated weakly with transcription Preranked tool with classic weighting within the GSEA tool (27). Genes from the dKO setting were ranked according to the FDR q-value, comparing Transcription is required for SHM at the Ig locus, and we previously the dKO mutation frequency with the background mutation frequency from observed that nontranscribed genes did not accumulate mutations Liu et al. (1). A total of 10,000 permutations was used to determine the null distribution of enrichment scores for each gene set and to calculate the in the genome-wide setting (1). We asked whether, for expressed p value. The FDR q-value corrects for multiple hypothesis testing and is genes, there is a correlation with mutation frequency. We pre- influenced by the total number of gene sets tested. In the case of E-boxes and viously determined the mutation frequency for 83 non-Ig genes YY1 sites, the total number of gene sets tested was 6; in the general test isolated from Ung2/2 Msh22/2 dKO Peyer’s patch B cells and of all other transcription factor binding sites (including C/EBP-b), the total classified these into three groups: A, B, and C (1) (see Materials number of gene sets tested was 705. and Methods). In this setting, group A genes had the highest Location analysis mutation frequency, percentage of C → T transition mutations, Binding site locations were compared between sets of genes using a Bon- and hotspot focusing, followed by genes in groups B and C, with ferroni-corrected one-tailed MWU. For each gene, the binding site closest to decreases in each subsequent group, respectively. The relative the TSS (in the region 6 7.5 kb) was selected for analysis. Two sets of mRNA expression of these genes was compared in GC B cells; we control genes were included. The first set of control genes was defined as found a minimal correlation between a gene’s log mRNA ex- the top 10% of genes with the highest average expression (across all probe 2 sets and samples) in the GC B cell microarray data previously described. pression level, as measured by Affymetrix microarray, and its The second set of control genes included the set of 17 nonexpressed genes mutation frequency (r = +0.079) (Supplemental Fig. 1A). When that was sequenced in the wild-type setting (1). genes were grouped based on their p value for AID targeting (group A and B genes accumulate mutations that are significantly Colocation and trilocation analysis different from the background and show strong signs of AID tar-

by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. For the pair-wise analysis of transcription factor binding site locations geting, whereas genes in group C are minimally mutated compared performed, a score was computed for each possible pair of binding sites with the background), we observed a positive trend between av- between two transcription factors (Eq. 1), erage log2 expression and mutation frequency (Fig. 1A). However, Colocation score5dðSÞ 1 dðTÞ; ð1Þ there is not a significant difference between these groups (p = where d(S) is the distance between the binding sites for each factor, and 0.064, KW). Thus, although AID-targeted genes seem to have d(T) is the distance of the binding site furthest from the TSS. Genes higher average log2 mRNA expression at the group level, this was without a predicted binding site in the analyzed region defined were not statistically significant, and the correlation between mutation assigned a distance equal to the maximal distance from the TSS. The genes and steady-state transcript levels is very weak. were again split into groups A, B, and C and compared against the two sets of control genes from above. In computing the trilocation score of three Because steady-state mRNA levels are not a direct measure of http://classic.jimmunol.org factors, d(S) was defined as the maximal distance between all three sites. A the rate of transcription and performing a nuclear run-on assay Bonferroni-corrected one-tailed MWU was used to determine the differ- would be difficult with GC B cells (32), RNA polymerase II ChIP ences between the groups of genes and the control sets. followed by massively parallel sequencing (ChIP-Seq) was per- formed on GC B cells to better address the relationship between Estimation of variable importance transcription rate and mutation frequency. For each gene, tran- To determine the significance of the variables that were used in this analysis, scription was quantified by the maximum peak of overlapping the measurements were obtained for all group A, B, and C genes for all

Downloaded from polymerase II tags in the region 100 bp around the TSS. A statis- independent attributes measured across all genes. In addition, the total number of binding sites for each significant factor within the region 2 kb tically significant difference was found among the genes in groups around the TSS was included for analysis. The cforest algorithm (28) A, B, and C (p = 0.04, KW), indicating an unequal distribution of (available through the party package for R) was used to generate a random polymerase II (Fig. 1B). A further examination of the polymerase II forest of classification trees to determine the variable importance using levels among the groups revealed that group A had a significantly an unbiased approach. A forest of 1000 trees was built with the following higher amount of polymerase II compared with both group B parameters specified: minimum samples in a node for splitting = 15, 24 minimum number of samples in a leaf = 5, and number of variables ran- (p = 0.0045, MWU) and group C (p =4.63 10 ). As a control, domly selected at each node = 5. Variable importance was calculated from group A, B, and C genes all had significantly higher polymerase II the random forest, defined by the mean decrease in prediction accuracy levels compared with the set of 17 nonexpressed genes that were when values are permuted for each variable. Important variables were de- also analyzed in the previous study (1) (p values: group A = 8.2 3 fined as having a variable importance score greater than the absolute value 27 3 25 3 24 of the least important variable (29). 10 ; group B = 9.0 10 ;groupC=5.1 10 , MWU). Thus, AID targeting is correlated positively with higher levels of poly- Classification tree model to predict genome-wide SHM merase II near the TSS of a gene. However, as with mRNA ex- The classification tree model was generated using the rpart package for R pression, polymerase II levels are only weakly predictive of mutation (30). The model was constructed using the default parameters, with the frequency at the individual gene level (Fig. 1C). The Journal of Immunology 3881

FIGURE 1. Transcription is correlated weakly with mutation frequency. (A) Normalized mRNA expression levels (averaged over all probe sets and microarray samples) for genes in groups A, B, and C. A set of 17 nonexpressed genes (NE) is included as a control. (B) The maximum peak for RNA polymerase II binding based on ChIP-Seq of GC B cells for each group of genes, including the control of nonexpressed genes (NE). (C) Comparison of the maximum peak of RNA polymerase II and the observed mutation frequency for the 83 individual genes sequenced in the dKO setting.

Increased AID hotspot density does not lead to higher mutation random targeting (Fig. 2A). In addition, inspection of the boot- frequency strap interval for the mutability of each motif shows a high over- AID preferentially targets WRC/GYW motifs, so it is possible that lap between the two datasets. Thus, we conclude that AID displays genes with more of these motifs may accumulate a greater number a range of hot/cold-spot preferences in the genome-wide setting of mutations independent of AID recruitment to the locus. To test that is consistent with those at the Ig locus. this hypothesis, we first confirmed that the hot/cold-spot targeting To test whether differences in the mutability of each gene could of AID, which was previously defined at the Ig locus, was con- account for differential mutation accumulation, we compared the served in the genome-wide setting. It was shown previously that hot-/coldspot densities to the mutation frequency across the set the observed mutations, especially in the group A genes, were of 83 genes from the dKO setting. Supplemental Fig. 1B shows biased toward the SHM hotspot WRCY/RGYW motif (1). To a minimal negative relationship between hotspot frequency and 2 extend this analysis to cover the full spectrum of DNA motifs, mutation frequency (r = 0.16), and this was not significant when focusing on the AID hotspot, and to increase coverage across the genes were split into groups A/B/C (p = 0.33, KW) (Fig. 2B). by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. all motifs given the relatively low number of mutations in the The AID coldspot SYC/GRS displayed a weak positive relation- genome-wide dataset, the relative mutability of each of the 16 ship with the mutation frequency (r = +0.2118, Supplemental Fig. possible dinucleotide combinations in the motif NNC/GNN was 1C), but once again this was not statistically significant (p = 0.31) calculated. Under the null hypothesis of no hot/cold-spot target- for groups A/B/C (Fig. 2C). We conclude that the density of AID ing, the mutability for each motif will be focused around a value hot-/coldspots does not account for the differential targeting of of 1. However, we observed significant deviations from this null AID to group A/B/C genes. hypothesis for both the group A genes and the positive control Jh4 intron. In each case, the same general trend is observed: trinu- Clonal recruitment of AID is not preserved in non-Ig genes cleotide motifs representing the classic hotspot WRChavea Clonal recruitment of SHM to the Ig locus was suggested based on http://classic.jimmunol.org higher mutability than random targeting, whereas those repre- transgene studies (7). According to this model, once a particular senting the classic coldspot SYC have a lower mutability than gene in a cell is targeted, the gene remains marked as accessible to Downloaded from

FIGURE 2. Genome-wide AID hot/cold-spots mirror the Ig locus but do not influence overall targeting. (A) The relative mutability of each DNA triplet motif ending in C based on group A genes and the Jh4 intron. The error bars represent the 95% confidence interval, as determined through bootstrapping of the original sequences. (B) Normalized WRC/GYW hotspot frequency for each group of genes. (C) Normalized SYC/GRS coldspot frequency for each group of genes. 3882 AID TARGETING IN NON-Ig GENES

SHM in the cell’s progeny. If true, the differential ability of genes TSS (CASSTG, p = 0.0089; CANNTG, p = 0.0092, MWU). Pre- to initiate clonal recruitment could contribute to the wide range of vious studies showed that binding sites located closer to the TSS observed mutation frequencies in the genome-wide setting. We first have a higher probability of being true regulatory sites (37, 38). sought to confirm the idea of clonal recruitment at the Ig locus by These results strongly support a role for E-boxes in recruiting AID. analyzing the distribution of mutation counts in the Jh4 intronic We next tested whether YY1 binding sites were associated region. If mutation acts randomly among all of the cells in a pop- with AID targeting, because this transcription factor was recently ulation, the number of mutations per sequence should fit an NB shown to be a regulator of the GC gene-expression program (39). distribution (33, 34). In contrast, the process of clonal recruitment Four separate definitions exist for a YY1 binding site in the implies that only some of the cells will accumulate mutations TRANSFAC database, where each defined YY1 binding site has efficiently, whereas others that fail to recruit AID will not accu- a different length and slightly different consensus motif. Two of mulate mutations at all. In this case, we propose that the number of these four binding site definitions (YY1_Q6 and YY1_Q6_02) mutations per sequence should fit a ZI-NB distribution, which has were significantly enriched in the promoter regions of group A the capability of modeling the distribution of the number of mu- genes (GSEA p , 0.05; FDR q-value , 0.1). Further supporting tations per sequence for cells that have recruited AID separately the involvement of YY1, these binding sites tended to be farther from those with no mutations present. Fig. 3A clearly shows that from the TSS in group C genes (Fig. 4). This shift in location was the NB distribution is a poor model for the observed mutations per significant for the YY1_Q6_02 binding site definition (p = 0.014), sequence in the Jh4 intron (p = 2.6 3 1028, x2 test), whereas the and we refer to this definition as a YY1 binding site throughout ZI-NB distribution is appropriate (p . 0.05) (35, 36). Thus, these the rest of this article. Thus, along with E-box motifs, YY1 motifs data support the idea of clonal recruitment of AID at the Ig locus. are associated with the recruitment of AID. In contrast, the NB distribution was an appropriate fit to all genes To identify additional transcription factors influencing AID in group A (p . 0.05) (Fig. 3, Supplemental Fig. 2). This was true targeting, we screened the remaining set of 705 high-quality ver- for both genes with a low mutation frequency, like Eif4a2 (Fig. tebrate TRANSFAC transcription factor binding sites. This analysis 3E), as well as the most mutated genes, like Myc, Bcl6, and H2afx identified an additional factor associated with AID targeting: (Fig. 3B–D). Thus, unlike the Ig locus, we find no evidence for evolutionarily conserved binding sites for C/EBP-b (CCAAT- clonal recruitment of AID in the genome-wide setting, and this enhancer binding protein, b), also known as NF-IL6 (p , 0.001; process is unlikely to play a role in the differential accumulation FDR q-value = 0.094). As with both E-box and YY1 sites, the of mutations. location of C/EBP-b binding sites tends to drift away from the TSS with the decrease in mutation frequency (Fig. 4), and this is sig- E-boxes, YY1, and C/EBP-b binding sites are associated with nificant for group C genes (p = 0.0086). At no point in our analysis AID targeting did we find enrichment for binding sites inhibiting AID targeting. We previously demonstrated that evolutionarily conserved E-box Altogether, our transcription factor binding site screen and location motifs (CASSTG), which bind various E-, including E12 analysis identified three binding sites that could be involved in the and E47, were enriched in the region 2 kb around the TSS for highly recruitment of AID: E-boxes, YY1, and C/EBP-b. mutated genes (1). Extending these results, we also found moderate Colocation of E-boxes, YY1, and C/EBP-b sites in AID-

by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. enrichment for a more general form of the motif (CANNTG) (GSEA p value = 0.006; FDR q-value = 0.061). This association targeted genes between E-box motifs and highly mutated genes is further supported Because we showed that evolutionarily conserved binding sites for by a shift in the location of the motif in group A genes toward the E-boxes, YY1, and C/EBP-b have a nonrandom location distri- http://classic.jimmunol.org Downloaded from

FIGURE 3. AID is clonally recruited to the Ig locus but not to non-Ig genes. (A) The frequency distribution of the total number of mutations per Jh4 intron sequence was fit to both the NB (s) and the ZI-NB (n). A similar analysis was carried out for group A genes, including Myc (B), Bcl6 (C), H2afx (D), and Eif4a2 (E). The Journal of Immunology 3883

FIGURE 4. E-box, YY1, and C/EBP-b binding sites exhibit a nonrandom location distribution with respect to the TSS of group A genes. Each gene is represented by the binding site that is closest to the TSS within the window 7.5 kb in either direction. For groups A, B and C, comparisons were made against two control groups: a set of highly expressed genes in GC B cells (HC) and a set of nonexpressed genes (NE). A statistically significant shift in bind- ing site locations is indicated by the group’s name at the top of the panel (p , 0.05, versus HC, Bon- ferroni-corrected MWU). No significant difference was detected for any group compared with the set of NE genes.

bution, we set out to test whether these sites are colocated near zation of the three sites together, we developed a trilocation score one another. A colocation score was defined that combines the that incorporates the maximal distances between the three binding distance between pairs of binding sites with the distance of the sites and the location from the TSS of the gene in a manner similar pair from the TSS of the gene, such that a low colocation score to the colocation score (see Materials and Methods). We find that reflects a pair of binding sites that are located close to one another both group A and group B have significantly lower trilocation scores and close to the TSS (see Materials and Methods). Using this compared with control genes (p , 0.05, Fig. 6). In addition, both scoring metric, we found that CASSTG E-box motifs were colo- groups A and B have significantly lower trilocation scores compared cated both with YY1 and C/EBP-b binding sites in group A genes with group C (p =1.73 1024 and p =9.33 1024, respectively) but at a statistically significant level compared with control genes are not distinguishable from one another (p = 0.07). These results (p , 0.01, Fig. 5). In addition, these two pairs of sites are sig- suggest that E-box motifs, YY1, and C/EBP-b binding sites form nificantly colocated in group A compared with either group B a regulatory module that is capable of recruiting AID to target the genes or group C genes (p , 6.5 3 1024 and 6.0 3 1023,re- gene for mutation. spectively). Although a similar trend was observed in the colo- A combination of features can predict AID targeting genome cation scores of the CANNTG E-box with both YY1 and C/EBP-b, wide the colocation signal is much stronger for the CASSTG E-box motif. Although none of the features investigated above could individ- In addition to being colocated with E-box motifs, YY1 and ually separate genes in groups A, B, and C (note the high overlap C/EBP-b sites were found to be colocated with one another in both in box plots for each group), we hypothesized that a combination group A and B genes (Fig. 5, far right panel), and both groups have of these features could accurately predict AID targeting. As a first step, importance analysis based on random forests was used to

by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. significantly lower colocation scores compared with group C genes (p =5.73 1024 and 5.0 3 1024, respectively). Interestingly, the select a subset of variables to be included in the final model. A colocation scores from genes in groups A and B are indistin- list of variables included in this analysis is shown in Supplemental guishable from one another (p . 0.8). Because the strong colo- Fig. 3. Note that some colocation scores were excluded to reduce cation of YY1 and C/EBP-b binding sites in groups A and B can redundancy. Variable importance analysis (see Materials and distinguish genes that mutate from nonmutating genes (group C), it Methods) identified seven features as significant for separating seems that more information is required to explain the differential group A, B, and C genes: CASSTG E-box and YY1 colocation, level of mutation observed between the two groups. CASSTG E-box and C/EBP-b colocation, YY1 and C/EBP-b colocation, number of C/EBP-b sites (6 2 kb around TSS), lo- CASSTG E-boxes, YY1, and C/EBP-b sites are trilocated in

http://classic.jimmunol.org cation of the C/EBP-b site closest to the TSS, and the maximum highly mutated genes polymerase II peak (6 100 bp around the TSS) (Supplemental The observation that both YY1 and C/EBP-b binding sites are Fig. 3). colocated with E-box motifs suggests that the three binding sites To predict whether each gene belongs to group A, B, or C, the may be located together, perhaps forming a cis-regulatory module seven variables identified as important were used to create a clas- located close to the TSS of the gene. To characterize the organi- sification tree. In generating this model, we strongly penalized the Downloaded from

FIGURE 5. E-box, YY1, and C/EBP-b binding sites are colocated in group A genes. For each pair of enriched transcription factors, the colocation score combining the distance between factors and the distance to the TSS was calculated for each gene. Groups A, B, and C were compared sepa- rately with each set of control genes (HC [highly expressed control] and NE lines at top of plot, see Fig. 4 legend). Statistically significant shifts in colocation scores are indicated above the relevant group. Single letter: p , 0.05, double letter: p , 0.01, triple letter: p , 0.001, Bonferroni-corrected MWU. 3884 AID TARGETING IN NON-Ig GENES

that should be enriched for actual AID targets. In the first vali- dation, the set of actual AID targets was based on results from a translocation-capture sequencing (TC-Seq) study (43). This study identified 92 genes that could translocate with either IgH or Myc in an AID-dependent manner in ex vivo B cells. Of the 92 TC-Seq genes, 10 were excluded from the validation because they had been used to generate the model, 11 were not in our RefSeq database, and 1 was translocated to Myc in the AID2/2 control. Of the remaining 70 TC-Seq genes, the model predicted 19 (27%) as group A, 13 (19%) as group B, and 38 (54%) as group C (Sup- plemental Table I). When compared with the set of predictions from all mouse RefSeq-defined genes, the frequency of predicted group A and B genes was significantly higher among the TC-Seq genes than expected by chance (p = 0.00019, x2 test) (Fig. 7B). This overrepresentation of predicted group A and B genes remained FIGURE 6. E-boxes (CASSTG), C/EBP-b and YY1 binding sites tri- significant, even when the background was restricted to genes that locate in AID targets. For each gene, the trilocation score was calculated were highly expressed in GC B cells and, therefore, were more 2 by combining the distance between all three sites and the distance from the likely to accumulate AID-induced mutations (p = 0.0031, x test). TSS. Groups A, B, and C were compared separately with each set of Overall, these results demonstrate that the classification model control genes (HC [highly expressed control] and NE lines at top of plot, predictions are enriched for actual AID targets. see Fig. 4 legend). Statistically significant shifts in trilocation scores are In the second validation, the set of actual AID targets was defined , indicated above the relevant group. Single letter: p 0.05, double letter: by AID binding, according to a ChIP-Seq study (44). Although the , , p 0.01, triple letter: p 0.001, Bonferroni-corrected MWU. B cells used in this study undergo class-switch recombination (CSR) and not SHM, we expect that because the two processes are misclassification between groups A and C. The penalty for mis- linked in their use of AID we would still observe enrichment of classification of group B genes was deemed less significant for two AID-binding genes for SHM. This assumption is supported by the reasons. First, the original classification of a gene in this group observation that group A, B, and C genes differed in the mean could change based on where the initial cut points between groups number of AID sequence tags that were bound (p = 0.013, was drawn by Liu et al. (1). Second, group B genes are still ANOVA), with the highest frequency of AID tags found in group considered to be targeted by AID, albeit at a lower level than A genes (Fig. 7C). Differential AID binding was also observed group A genes; therefore, misclassification would not be as severe among highly expressed genes that were predicted to be in groups for this middle group. The full model generated from the data was A, B, and C, with predicted group A genes exhibiting the highest pruned to an appropriate size based on the 10-fold cross-validation frequency of AID tags (p = 1.9 3 1029, ANOVA) (Fig. 7C). An error, producing a final model containing five terminal nodes with even higher frequency of AID binding was found among the 101

by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. four decision points (Fig. 7A). For the 83 genes used to build the high-interest predicted group A genes. The significantly higher model, the overall misclassification rate was 25%, with the ma- AID binding among targets predicted by the classification model jority of these misclassified genes from group B (as expected by may be even more significant because CSR and SHM are de- the penalty setup). Importantly, the misclassification rate between pendent on different regions of the AID protein and may involve groups A and C was only 8%. It was determined through cross- process-specific cofactors that can change the targeting (3, 45), so validation that the sensitivity of the model for correctly predicting that complete overlap with the model predictions was not ex- a group A gene was 55%, with a specificity of 84% in predicting pected. Taken together, these results further support the validity of non–group A genes. Consequently, this model allows for the ac- the classification tree model to predict genome-wide targets of curate separation of genome-wide AID targets (groups A and B) AID and highlight the rationale for choosing the set of 101 high- http://classic.jimmunol.org compared with nontargeted genes (group C). interest genes put forth as candidates for experimental validation. The classification tree model was applied genome wide to predict additional AID targets. Of the 22,492 RefSeq-defined mouse genes Discussion that were not used to build the model, 3,648 (16%) were predicted The mechanisms that underlie AID targeting to the Ig locus and to be in group A, 1,949 (9%) were predicted to be in group B, and mistargeting to non-Ig loci are poorly understood. We analyzed the the remaining 16,895 (75%) were predicted to be group C (Sup- sequence context of 83 non-Ig genes that are targeted by AID to

Downloaded from plemental Table I). To generate a list of high-quality AID targets varying degrees in dKO mice to define several mechanisms re- that would be of interest for experimental validation, predicted sponsible for genome-wide AID targeting. Our data confirm the group A genes were filtered to include only those that have an association of AID targeting with transcription and the presence of average gene expression in the top 25% of all genes in GC B cells, E-box motifs. E-boxes were found to increase the frequency of SHM as determined through mRNA expression; have human homologs in several systems (12, 46, 47). Our data also implicate several new [defined by the Mouse Genome Database (40)] annotated as either factors as relevant to the targeting of AID. In particular, YY1 and oncogenes or tumor suppressors [identified by an query C/EBP-b binding sites are associated with increased mutation ac- through the CancerGenes tool (41)], or are found within the cumulation, and these sites, together with E-box motifs, are colo- COSMIC Cancer Gene Census (42) version 56. This filtering cated near the TSS of AID targets, perhaps forming a cis-regulatory produced a set of 101 high-interest genes shown in Table I. Some module. YY1 is of high interest because it is active in GC B cells notable members of this list are immune-related genes: Bcl7a, and was recently identified as a regulator for the GC program (39). Btg1, Cd82, Cdk4, Cxcr4, Foxo1, Foxp1, Irf1, Irf8, Raf1, Rela, Intriguingly, YY1 was recently shown to interact directly with Stat3, and Tcf4. AID, to influence levels of AID in the nucleus, and to enhance CSR We carried out two validations of the model by comparing the (48). C/EBP-b mRNA and protein levels were reported to increase predicted AID targets with sets of experimentally identified genes in abundance as B cells mature, and C/EBP-g, the negative regu- The Journal of Immunology 3885

FIGURE 7. Classification tree model to predict genome-wide targets of AID. (A) Classification tree to predict AID targeting for individual genes. For each decision node in the tree (gray circle), the gene proceeds down the left branch if it satisfies the decision condition. The leaves of the tree indicate the predicted group (A, B, or C), along with the number of genes falling into this group from the 83 genes sequenced in the dKO setting (group A/group B/ group C). For example, genes in leaf #1 are predicted to be in group A, and this leaf has a total of nine genes from the training data: six group A genes, three group B genes, and zero group C genes. (B) AID-mediated translocated genes (43) are enriched among genes predicted by the model to be AID targets (group A) compared with the set of all genes (22,492 genes) and the set of highly expressed genes comprised of the top 25% of all genes by mRNA expression levels (4,403 genes). (C) Based on the ChIP-Seq study by Yamane et al. (44), group A and predicted group A genes show greater recruitment of AID than do either group B or C. The mean and 95% confidence intervals of AID tags per gene are shown for each group of genes. The set of 17 nonexpressed genes is shown as a negative control. Also shown separately is the set of 101 high-interest predicted group A genes. ANOVA was used to by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. compare the number of AID tags per gene for the group A/B/C genes used to build the model, as well as for the genes predicted by the model to be in groups A/B/C, with the p value shown above each comparison.

lator of C/EBP-b, decreases with B cell maturity (49). In addition, The contribution of AID hot/cold-spots and gene transcription to a C/EBP-b site was recently identified in a mutational enhancer mutation accumulation was also evaluated. Although known AID element in the Ig L chain of DT40 cells that is conserved in the hotspots are specifically targeted in non-Ig genes (1), the frequency condor and zebra finch (50). These results support a role for shared of hotspots (or coldspots) in the sequenced region did not correlate functional elements in explaining differential AID activity in non- with the overall gene-mutation frequency. This suggests that AID Ig genes. recruitment to the locus is a rate-limiting step. Gene transcription http://classic.jimmunol.org

Table I. High-interest predicted group A genes from classification tree model

Entrez Identifier Model Leaf Genes

a

Downloaded from Oncogene 1 Golga5 , Lmo4, Rab18, Rab30, Rab5c, Snrpe, Ubtf 3 Actb, Fusa 4 Cblbb, Mybl1, Nae1, Pfdn5, Rapgef1, Sh3kbp1, Tpd52, Vav1 Tumor suppressor 1 Anp32b, Ccna2, Cd82, Cdk2ap2, Cdkn2cb, Cfl1, Eif2s1, Flna, Foxp1a, G3bp2, Gabarap, Gtf2e1, Ing3, Ltf, Mef2c, Nbr1, Nfkbia, Rbbp5, Tcf4, Uhrf1, Zbtb33 3 Btg2, Ddx3x, Ddx5a, Dhx9, G3bp1, Irf1, Irf8, Klf10, Tnfaip3b, Uvrag 4 Anxa7, Chfr, Foxo1, Msh2b, Ndufa13, Raf1a, Rnf129, Rnf40, Rtn4, Sdhbb, Sept9, Spint2, Stk4, Tes, Tusc2, Ywhah, Ywhaq, Zfp238 Tumor suppressor and oncogene 1 Ccdn3a, Ptenb, Rela, Rhoa 3 Ctnnb1a, Cxcr4, Hif1a, Lyn, Stat3 4 Mapk1, Prkar1aa,b, Smarcb1 Neither 1 Bcl7aa, Brd4a, Ep300a,b, Eps15a, Gnasb, Herpud1a, Ikzf1b, Lasp1a, Myh9a, Ncoa1a, Nina, Nsd1a, Thrap3a 3 Btg1a 4 Cdk4b, Chic2a, Crebbpa,b, Ncoa2a, Nfe2l2b, Numa1a, Sept6a, Sh3gl1a, Tpm3a Genes are listed according to the Entrez Query Identifier and by the leaf of the model which led to the gene being predicted as a strong AID target. aIdentified in the COSMIC database as translocation type. bIdentified in the COSMIC database as other mutation type. 3886 AID TARGETING IN NON-Ig GENES

is also important, and some transcription is required for mutation of AID at the Ig locus, at least some features seem to be shared. accumulation in non-Ig genes, because nonexpressed genes do not Several of the transcription factors that we implicate in genome- accumulate mutations (1). Indeed, there was a small, but signifi- wide targeting have been associated with targeting at the Ig locus cant, increase in polymerase II binding among genes with the (46, 47, 53). In addition, we find that AID hot/cold-spots for highest mutation frequencies (group A). However, there was only mutation are generally shared between the Ig and non-Ig genes. a weak correlation at the level of individual genes, and highly Interestingly, there is a significant difference in the relative mu- mutated genes could be identified with widely varying levels of tability of the four individual trinucleotide motifs that form the polymerase II binding. Notably, the polymerase II–associated WRC hotspot (Fig. 2A). This may be the result of varying cofactors pause/elongation factor Spt5, a component of the DRB-sensitivity present in each unique genomic setting that slightly alters the inducing complex, interacts with AID and with Ig and non-Ig specificity of AID to its preferred nucleotide target. targets of AID (51). Although our data indicate that total poly- One significant difference found between AID targeting at Ig and merase II levels do not correlate strongly with AID targeting, non-Ig loci is the occurrence of clonal recruitment of AID. For a a better correlation might be seen upon examination of stalled random mutation process with hot-/coldspots, the distribution of the polymerase II and Spt5 in GC B cells. A recent study found that number of mutations per sequence should fit an NB distribution (33). levels of stalled polymerase II, but not Spt5, correlated with AID- Indeed, we find this model to be an appropriate fit for the non-Ig mediated mutation of an artificial mutation reporter (52). Overall, genes. In contrast, the NB is a poor fit for the Jh4 intron mutation although our analysis identifies several associations that are sta- data, because this model significantly underestimates the frequency tistically significant when comparing group A, B, and C genes, of unmutated sequences, given the mutation distribution among none of these features alone is sufficient to predict the differential sequences that do acquire mutations. Instead, these data can be fit accumulation of mutations by individual genes. by a ZI-NB distribution, which separately models the process of We hypothesized that several mechanisms work together to mutation accumulation from the induction of AID activity. This promote AID targeting to individual genes. By combining the suggests that these sequences from the Jh4 intron were derived features that were each associated with AID targeting, we were able from two subsets of cells: one subset in which AID had been ef- to construct a classification tree model that separated highly mu- fectively recruited to the gene being sequenced, resulting in a high tating genes (group A) from genes that do not mutate (group C) with degree of mutation, and a second subset in which AID had not (yet) an accuracy of 92%. It is instructive to consider the genes for which been recruited. One caveat of this analysis is that we may not ob- AID targeting is incorrectly predicted. In the training data, there are serve enough mutations in the non-Ig genes to be able to detect five genes that are considered major misclassifications: Pax5 exon whether clonal recruitment of AID is occurring, and further se- 1b, Il21r, Fas, B2m, and Mll1. For the group A genes misclassified quencing is necessary to increase the number of observed mutations as group C genes (Pax5 exon 1b, Il21r, and Fas) the lack of strong in each non-Ig gene. Overall, these results provide additional sup- evolutionary conservation between mouse and human near the TSS port for the idea that AID is clonally recruited to the Ig loci but that diminishes the ability of the model to accurately predict these clonal recruitment is not a property of genome-wide AID targeting. genes as true group A genes. Pax5 exon 1b presents unique chal- The evidence presented in this article supporting clonal re- lenges to this analysis because the sequencing was focused on an cruitment of AID to the Ig genes has potential implications with

by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. alternative promoter, which may not be adhering to the same respect to Ab diversity during the affinity-maturation process. conventions of the other genes within the group. The two group C Current thought predisposes B cells with receptors of initial high genes were predicted as group A genes for different reasons: B2m affinity to the Ag as having a competitive advantage for expansion had a high peak of polymerase II near the TSS (max peak = 20), within GCs (54). The addition of clonal recruitment may allow for and Mll1 had a conserved YY1 and E-box that had a very low the B cells with receptors of lower initial affinity, if able to recruit colocation score (score = 11), reflecting two sites that were over- AID early, to go through enough rounds of mutation, selection, and lapping and located at the TSS of the gene. Thus, many features of expansion to yield mutated receptors with equal or higher affinity each gene contribute in a small way to AID targeting. In predicting than the unmutated higher-affinity cells. These lower initial affinity, AID targets, there was not a single reason underlying false positive but early AID-recruiting, B cells could then be competitive in the http://classic.jimmunol.org and negative predictions, and improvements in accuracy are likely GC environment. Thus, depending on the time of clonal recruit- to come from the inclusion of additional features in the model. ment of AID in B cells, the expected competitive advantage of The model was applied genome wide to predict AID targets. We B cells with high-affinity receptors may be diminished enough to identified a set of 101 high-interest genes that fulfilled three cri- allow for a broader spectrum of Abs to be produced from B cells teria: they are predicted to be in group A, they are highly expressed with a wider range of initial affinities to the Ag. in GC B cells, and they have a known association with cancer. A limitation of this study is the focus on the promoter region

Downloaded from These targets are significantly enriched for genes identified through when analyzing transcription factor binding site associations. At TC-Seq of B cells (43). They also tend to bind higher levels of AID the Ig locus, enhancer regions were shown to play a role in SHM in an ex vivo model of CSR (44). About 25% of the high-interest that is separate from influencing the rate of transcription (12, 50, genes are known to undergo translocations in oncogenic settings 55). Nevertheless, through this approach we identified three im- as identified through the Cancer Gene Census. Several of these portant DNA elements that can effectively differentiate group A genes undergo translocations with other genes known to sustain and C genes. It is likely that future work incorporating additional high levels of AID targeting: Ccnd3 translocates with IgH, Bcl7a elements in the enhancer regions will improve these predictions and Btg1 each translocate with Myc, and Foxp1 translocates with and may also distinguish group B genes, which were not a major Pax5. Additionally, two genes, Btg1 and Raf1, were previously focus of this work. However, there are still many problems to be identified as sustaining high levels of SHM in the wild-type setting worked out in the identification of enhancer boundaries and their (1) but were not sequenced in dKO mice. Thus, we have high association(s) with particular genes. confidence in the validity of our model to predict AID-induced In conclusion, we identified several mechanisms that are sig- mutations and suggest these high-interest genes as targets for fu- nificantly associated with AID targeting to non-Ig genes, including ture experiments. Although it is not clear whether a common polymerase II binding along with the presence of E-boxes, YY1, and mechanism underlies genome-wide targeting of AID and targeting C/EBP-b binding sites. A classification model integrating these The Journal of Immunology 3887

features captures ∼55% of highly targeted (group A) genes with 18. Karolchik, D., A. S. Hinrichs, T. S. Furey, K. M. Roskin, C. W. Sugnet, D. Haussler, and W. J. Kent. 2004. The UCSC Table Browser data retrieval tool. the benefit of being specific for AID targets. This model was used Nucleic Acids Res. 32: D493–D496. to predict AID targeting genome wide, and a set of 101 additional 19. Shapiro, G. S., K. Aviszus, J. Murphy, and L. J. Wysocki. 2002. Evolution of high-interest AID targets was identified for further experimental Ig DNA sequence to target specific base positions within codons for somatic hypermutation. J. Immunol. 168: 2302–2306. study. 20. Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S, 4th Ed. Springer, New York. Acknowledgments 21. Bolker, B. M. 2008. Ecological Models and Data in R. Princeton University Press, Princeton, NJ. We thank Dustin Schones (Department of Cancer Biology, Beckman Re- 22. Blanchette, M., W. J. Kent, C. Riemer, L. Elnitski, A. F. Smit, K. M. Roskin, search Institute, City of Hope, Duarte, CA), Kairong Cui (Systems Biology R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green, et al. 2004. Aligning Center, National Heart, Lung, and Blood Institute, National Institutes of multiple genomic sequences with the threaded blockset aligner. Genome Res. 14: Health, Bethesda, MD), and Keji Zhao (Systems Biology Center, National 708–715. 23. Kent, W. J., C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, Heart, Lung, and Blood Institute, National Institutes of Health) for exper- and D. Haussler. 2002. The browser at UCSC. Genome Res. 12: tise, sequencing, and read mapping of the RNA polymerase II ChIP-Seq 996–1006. experiment and Annette Molinaro (Departments of Neurological Surgery 24. Smit, A. F. A., R. Hubley, and P. Green. 2011. RepeatMasker. Available at: http:// and of Epidemiology and Biostatistics, University of California, San Fran- repeatmasker.org. Accessed: March 20, 2011. 25. Kel, A. E., E. Go¨ssling, I. Reuter, E. Cheremushkin, O. V. Kel-Margoulis, and cisco, San Francisco, CA) for guidance in developing the classification tree E. Wingender. 2003. MATCH: A tool for searching transcription factor binding model. sites in DNA sequences. Nucleic Acids Res. 31: 3576–3579. 26. Matys, V., E. Fricke, R. Geffers, E. Go¨ssling, M. Haubrock, R. Hehl, K. Hornischer, D. Karas, A. E. Kel, O. V. Kel-Margoulis, et al. 2003. Disclosures TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids The authors have no financial conflicts of interest. Res. 31: 374–378. 27. Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. 2005. Gene set enrichment analysis: A knowledge-based approach References for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 1. Liu, M., J. L. Duke, D. J. Richter, C. G. Vinuesa, C. C. Goodnow, 102: 15545–15550. S. H. Kleinstein, and D. G. Schatz. 2008. Two levels of protection for the B cell 28. Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn. 2007. Bias in random genome during somatic hypermutation. Nature 451: 841–845. forest variable importance measures: illustrations, sources and a solution. BMC 2. Rada, C., M. R. Ehrenstein, M. S. Neuberger, and C. Milstein. 1998. Hot spot Bioinformatics 8: 25. focusing of somatic hypermutation in MSH2-deficient mice suggests two stages 29. Strobl, C., J. Malley, and G. Tutz. 2009. An introduction to recursive parti- of mutational targeting. Immunity 9: 135–141. tioning: rationale, application, and characteristics of classification and regression 3. Peled, J. U., F. L. Kuang, M. D. Iglesias-Ussel, S. Roa, S. L. Kalis, trees, bagging, and random forests. Psychol. Methods 14: 323–348. M. F. Goodman, and M. D. Scharff. 2008. The biochemistry of somatic hyper- 30. Therneau, T. M., and E. J. Atkinson. 1997. Mayo Foundation. An introduction to mutation. Annu. Rev. Immunol. 26: 481–511. recursive partitioning using the RPART routines. Available at: http://www.mayo. 4. Rada, C., J. M. Di Noia, and M. S. Neuberger. 2004. Mismatch recognition and edu/hsr/techrpt/61.pdf. Accessed: January 4, 2011. uracil excision provide complementary paths to both Ig switching and the A/T- 31. Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen. 1984. Classification and focused phase of somatic mutation. Mol. Cell 16: 163–171. Regression Trees. Wadsworth International Group, Belmont, CA. 5. Pham, P., R. Bransteitter, J. Petruska, and M. F. Goodman. 2003. Processive 32. Garcı´a-Martı´nez, J., A. Aranda, and J. E. Pe´rez-Ortı´n. 2004. Genomic run-on AID-catalysed cytosine deamination on single-stranded DNA simulates somatic evaluates transcription rates for all yeast genes and identifies gene regulatory hypermutation. Nature 424: 103–107. mechanisms. Mol. Cell 15: 303–313. 6. Bransteitter, R., P. Pham, P. Calabrese, and M. F. Goodman. 2004. Biochemical 33. Uzzell, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to analysis of hypermutational targeting by wild type and mutant activation- evolutionary events. Science 172: 1089–1096.

by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. induced cytidine deaminase. J. Biol. Chem. 279: 51612–51621. 34. Storb, U., E. L. Klotz, J. Hackett, K. Kage, G. Bozek, and T. E. Martin. 1998. A 7. Goyenechea, B., N. Klix, J. Ye´lamos, G. T. Williams, A. Riddell, hypermutable insert in an immunoglobulin transgene contains hotspots of so- M. S. Neuberger, and C. Milstein. 1997. Cells strongly expressing Ig(kappa) matic mutation and sequences predicting highly stable structures in the RNA transgenes show clonal recruitment of hypermutation: a role for both MAR and transcript. J. Exp. Med. 188: 689–698. the enhancers. EMBO J. 16: 3987–3994. 35. Zuur, A. F., E. N. Ieno, N. Walker, A. A. Saneliev, and G. M. Smith. 2009. Mixed 8. Yu, D., M. C. Cook, D.-M. Shin, D. G. Silva, J. Marshall, K.-M. Toellner, Effects Models and Extension in Ecology with R (Statistics for Biology and W. L. Havran, P. Caroni, M. P. Cooke, H. C. Morse, et al. 2008. Axon growth and Health), 1st Ed. Springer, New York. guidance genes identify T-dependent germinal centre B cells. Immunol. Cell 36. Lord, D., S. Washington, and J. Ivan. 2005. Poisson, Poisson-gamma and zero- Biol. 86: 3–14. inflated regression models of motor vehicle crashes: balancing statistical fit and 9. Anderson, S. M., A. Khalil, M. Uduman, U. Hershberg, Y. Louzoun, theory. Accid. Anal. Prev. 37: 35–46. A. M. Haberman, S. H. Kleinstein, and M. J. Shlomchik. 2009. Taking advan- 37. Liu, R., R. C. McEachin, and D. J. States. 2003. Computationally identifying tage: high-affinity B cells in the germinal center have lower death rates, but novel NF-kappa B-regulated immune genes in the human genome. Genome Res.

http://classic.jimmunol.org similar rates of division, compared to low-affinity cells. J. Immunol. 183: 7314– 13: 654–661. 7325. 38. Tabach, Y., R. Brosh, Y. Buganim, A. Reiner, O. Zuk, A. Yitzhaky, 10. Klein, U., Y. Tu, G. A. Stolovitzky, J. L. Keller, J. Haddad, Jr., V. Miljkovic, M. Koudritsky, V. Rotter, and E. Domany. 2007. Wide-scale analysis of human G. Cattoretti, A. Califano, and R. Dalla-Favera. 2003. Transcriptional analysis of functional transcription factor binding reveals a strong bias towards the tran- the B cell germinal center reaction. Proc. Natl. Acad. Sci. USA 100: 2639–2644. scription start site. PLoS One 2: e807. 11. Alizadeh, A., M. Eisen, R. E. Davis, C. Ma, H. Sabet, T. Tran, J. I. Powell, 39. Green, M. R., S. Monti, R. Dalla-Favera, L. Pasqualucci, N. C. Walsh, M. Schmidt- L. Yang, G. E. Marti, D. T. Moore, et al. 1999. The lymphochip: a specialized Supprian, J. L. Kutok, S. J. Rodig, D. S. Neuberg, K. Rajewsky, et al. 2011. Sig- cDNA microarray for the genomic-scale analysis of gene expression in normal natures of murine B-cell development implicate Yy1 as a regulator of the germinal and malignant lymphocytes. Cold Spring Harb. Symp. Quant. Biol. 64: 71–78. center-specific program. Proc. Natl. Acad. Sci. USA 108: 2873–2878. Downloaded from 12. Odegard, V. H., and D. G. Schatz. 2006. Targeting of somatic hypermutation. 40. Blake, J. A., C. J. Bult, J. A. Kadin, J. E. Richardson, and J. T. Eppig. 2011. The Nat. Rev. Immunol. 6: 573–583. Mouse Genome Database (MGD): premier model organism resource for mam- 13. Altshuler, D., V. J. Pollara, C. R. Cowles, W. J. Van Etten, J. Baldwin, L. Linton, malian genomics and genetics. Nucleic Acids Res. 39: D842–D848. and E. S. Lander. 2000. An SNP map of the human genome generated by re- 41. Higgins, M. E., M. Claremont, J. E. Major, C. Sander, and A. E. Lash. 2007. duced representation shotgun sequencing. Nature 407: 513–516. CancerGenes: a gene selection resource for cancer genome projects. Nucleic 14. Tomayko, M. M., S. M. Anderson, C. E. Brayton, S. Sadanand, N. C. Steinel, Acids Res. 35: D721–D726. T. W. Behrens, and M. J. Shlomchik. 2008. Systematic comparison of gene 42. Santarius, T., J. Shipley, D. Brewer, M. R. Stratton, and C. S. Cooper. 2010. A expression between murine memory and naive B cells demonstrates that memory census of amplified and overexpressed human cancer genes. Nat. Rev. Cancer B cells have unique signaling capabilities. J. Immunol. 181: 27–38. 10: 59–64. 15. Affymetrix, Inc. 2004. Eukaryotic sample and array processing. In GeneChip 43. Klein, I. A., W. Resch, M. Jankovic, T. Oliveira, A. Yamane, H. Nakahashi, Expression Analysis Technical Manual, 701021 Rev. 5. Affymetrix, Inc., Santa M. Di Virgilio, A. Bothmer, A. Nussenzweig, D. F. Robbiani, et al. 2011. Clara, CA, 2.1.3–2.3.18. Translocation-capture sequencing reveals the extent and nature of chromosomal 16. Pruitt, K. D. 2004. NCBI Reference Sequence (RefSeq): a curated non-redundant rearrangements in B lymphocytes. Cell 147: 95–106. sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33: 44. Yamane, A., W. Resch, N. Kuo, S. Kuchen, Z. Li, H. Sun, D. F. Robbiani, D501–D504. K. McBride, M. C. Nussenzweig, and R. Casellas. 2011. Deep-sequencing 17. Hannum, L. G., A. M. Haberman, S. M. Anderson, and M. J. Shlomchik. 2000. identification of the genomic targets of the cytidine deaminase AID and its co- Germinal center initiation, variable gene region hypermutation, and mutant factor RPA in B lymphocytes. Nat. Immunol. 12: 62–69. B cell selection without detectable immune complexes on follicular dendritic 45. Teng, G., and F. N. Papavasiliou. 2007. Immunoglobulin somatic hypermutation. cells. J. Exp. Med. 192: 931–942. Annu. Rev. Genet. 41: 107–120. 3888 AID TARGETING IN NON-Ig GENES

46. Michael, N., H. M. Shen, S. Longerich, N. Kim, A. Longacre, and U. Storb. 51. Pavri, R., A. Gazumyan, M. Jankovic, M. Di Virgilio, I. Klein, C. Ansarah- 2003. The E box motif CAGGTG enhances somatic hypermutation without Sobrinho, W. Resch, A. Yamane, B. R. San-Martin, V. Barreto, et al. 2010. enhancing transcription. Immunity 19: 235–242. Activation-induced cytidine deaminase targets DNA at sites of RNA polymerase 47. Tanaka, A., H. M. Shen, S. Ratnam, P. Kodgire, and U. Storb. 2010. Attracting II stalling by interaction with Spt5. Cell 143: 122–133. AID to targets of somatic hypermutation. J. Exp. Med. 207: 405–415. 52. Kohler, K. M., J. J. McDonald, J. L. Duke, H. Arakawa, S. Tan, S. H. Kleinstein, 48. Zaprazna, K., and M. L. Atchison. 2012. YY1 controls immunoglobulin class J.-M. Buerstedde, and D. G. Schatz. 2012. Identification of core DNA elements switch recombination and nuclear activation-induced deaminase levels. Mol. that target somatic hypermutation. J. Immunol. 189: 5314–5326. Cell. Biol. 32: 1542–1554. 53. Liu, M., and D. G. Schatz. 2009. Balancing AID and DNA repair during somatic 49. Cooper, C. L., A. L. Berrier, C. Roman, and K. L. Calame. 1994. Limited ex- hypermutation. Trends Immunol. 30: 173–181. pression of C/EBP family proteins during B lymphocyte development. Negative 54. Shlomchik, M. J., and F. Weisel. 2012. Germinal center selection and the de- regulator Ig/EBP predominates early and activator NF-IL-6 is induced later. J. velopment of memory B and plasma cells. Immunol. Rev. 247: 52–63. Immunol. 153: 5049–5058. 55. Blagodatski, A., V. Batrak, S. Schmidl, U. Schoetz, R. B. Caldwell, 50. Kothapalli, N. R., K. M. Collura, D. D. Norton, and S. D. Fugmann. 2011. H. Arakawa, and J.-M. Buerstedde. 2009. A cis-acting diversification activator Separation of mutational and transcriptional enhancers in Ig genes. J. Immunol. both necessary and sufficient for AID-mediated hypermutation. PLoS Genet. 187: 3247–3255. 5: e1000332. by guest on October 1, 2021. Copyright 2013 Pageant Media Ltd. http://classic.jimmunol.org Downloaded from