A Model of Targeting in Mice Based on High-Throughput Ig Sequencing Data

This information is current as Ang Cui, Roberto Di Niro, Jason A. Vander Heiden, Adrian of October 1, 2021. W. Briggs, Kris Adams, Tamara Gilbert, Kevin C. O'Connor, Francois Vigneault, Mark J. Shlomchik and Steven H. Kleinstein J Immunol published online 5 October 2016 http://www.jimmunol.org/content/early/2016/10/05/jimmun ol.1502263 Downloaded from

Supplementary http://www.jimmunol.org/content/suppl/2016/10/05/jimmunol.150226

Material 3.DCSupplemental http://www.jimmunol.org/

Why The JI? Submit online.

• Rapid Reviews! 30 days* from submission to initial decision

• No Triage! Every submission reviewed by practicing scientists

• Fast Publication! 4 weeks from acceptance to publication by guest on October 1, 2021

*average

Subscription Information about subscribing to The Journal of Immunology is online at: http://jimmunol.org/subscription Permissions Submit copyright permission requests at: http://www.aai.org/About/Publications/JI/copyright.html Email Alerts Receive free email-alerts when new articles cite this article. Sign up at: http://jimmunol.org/alerts

The Journal of Immunology is published twice each month by The American Association of Immunologists, Inc., 1451 Rockville Pike, Suite 650, Rockville, MD 20852 Copyright © 2016 by The American Association of Immunologists, Inc. All rights reserved. Print ISSN: 0022-1767 Online ISSN: 1550-6606. Published October 5, 2016, doi:10.4049/jimmunol.1502263 The Journal of Immunology

A Model of Somatic Hypermutation Targeting in Mice Based on High-Throughput Immunoglobulin Sequencing Data

Ang Cui,* Roberto Di Niro,† Jason A. Vander Heiden,* Adrian W. Briggs,‡ Kris Adams,‡ Tamara Gilbert,‡ Kevin C. O’Connor,x,{ Francois Vigneault,‡ Mark J. Shlomchik,† and Steven H. Kleinstein*,{,‖

Analyses of somatic hypermutation (SHM) patterns in B cell Ig sequences have important basic science and clinical applications, but they are often confounded by the intrinsic biases of SHM targeting on specific DNA motifs (i.e., hot and cold spots). Modeling these biases has been hindered by the difficulty in identifying mutated Ig sequences in vivo in the absence of selection pressures, which skew the observed mutation patterns. To generate a large number of unselected mutations, we immunized B1-8 H chain transgenic mice with nitrophenyl to stimulate nitrophenyl-specific l+ germinal center B cells and sequenced the unexpressed k L chains using k next-generation methods. Most of these sequences had out-of-frame junctions and were presumably uninfluenced by selection. Downloaded from Despite being nonfunctionally rearranged, they were targeted by SHM and displayed a higher mutation frequency than functional sequences. We used 39,173 mutations to construct a quantitative SHM targeting model. The model showed targeting biases that were consistent with classic hot and cold spots, yet revealed additional highly mutable motifs. We observed comparable targeting for functional and nonfunctional sequences, suggesting similar biological processes operate at both loci. However, we observed species- and chain-specific targeting patterns, demonstrating the need for multiple SHM targeting models. Interestingly, the

targeting of C/G bases and the frequency of transition mutations at C/G bases was higher in mice compared with , http://www.jimmunol.org/ suggesting lower levels of DNA repair activity in mice. Our models of SHM targeting provide insights into the SHM process and support future analyses of mutation patterns. The Journal of Immunology, 2016, 197: 000–000.

omatic hypermutation (SHM) is a process that diversifies C/G base pair (phase I) or at neighboring base pairs (phase II) (2). BCRs by introducing point mutations into Ig at a high Although stochastic, SHM is biased by the local DNA sequence S rate (1). SHM is initiated when activation-induced cytidine context and preferentially introduces mutations at specific DNA deaminase (AID) is recruited to the Ig locus and converts cytosine motifs (hot spots) while avoiding others (cold spots) (3–5). SHM (C) to uracil (U). Error-prone DNA repair pathways are then ac- plays a crucial role in the B cell immune response and immune- by guest on October 1, 2021 tivated, resulting in somatic mutations either at the AID-targeted mediated disorders. The analysis of mutation patterns and distri- butions has been widely used to infer selective processes involved in such responses (6). However, the analysis of SHM patterns can *Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511; †Department of Immunology, University of be confounded by the intrinsic biases of SHM targeting, driving Pittsburgh, Pittsburgh, PA 15213; ‡AbVitro Inc., Boston, MA 02210; xDepartment of the need for accurate characterization of targeting patterns that { Neurology, Yale School of Medicine, New Haven, CT 06511; and Transla- reflect inherent SHM properties in the absence of Ag-driven tional Immunology Program, Yale School of Medicine, New Haven, CT 06511; and ‖Departments of Pathology and Immunobiology, Yale School of Medicine, New selection (7, 8). Haven, CT 06511 The SHM process can be quantitatively characterized by a ORCIDs: 0000-0002-1087-8568 (A.C.); 0000-0002-1474-310X (J.A.V.H.); 0000- targeting model, consisting of a mutability model, which specifies 0002-7056-419X (K.C.O.); 0000-0003-4957-1544 (S.H.K.). the relative mutation frequency of DNA microsequence motifs, and Received for publication October 21, 2015. Accepted for publication August 22, a substitution model, which describes the specific nucleotide 2016. substitution frequencies at the mutated sites (9–13). These models This work was supported by the National Institutes of Health (Grants R03AI092379 can serve as a background distribution for statistical analysis of and R01AI104739 to S.H.K., Grant R01AI43603 to M.J.S.), the Myasthenia Gravis Foundation of America, the National Institute of Allergy and Infectious Diseases mutation patterns in Ig sequences, improving the ability to detect (Awards R01AI114780 and U19 AI056363 to K.C.O.), the Natural Sciences and deviations in SHM pathways related to diseases or identify se- Engineering Research Council of Canada (NSERC PGS-M to A.C.), and the National Library of Medicine (Grant T15LM07056 to J.A.V.H.). The Yale University Biomed- lected mutations that drive Ag specificity and affinity maturation ical High Performance Computing Center is supported by National Institutes of (7, 8). However, modeling these intrinsic biases has been limited Health Grants RR19895 and RR029676-01. by the lack of large sets of Ig sequences that have undergone SHM The content is solely the responsibility of the authors and does not necessarily in vivo in the absence of selection pressures. Previous work has represent the official views of the National Institutes of Health. focused on studying mutations in intronic regions or in Ig se- Address correspondence and reprint requests to Dr. Steven H. Kleinstein, Yale School quences that were determined to be nonfunctional (e.g., because of of Medicine, 300 George Street, Suite 505, New Haven, CT 06511. E-mail address: [email protected] an out-of-frame junction) (9–11, 13). However, intronic regions The online version of this article contains supplemental material. have limited diversity and a different base composition from ex- Abbreviations used in this article: GC, germinal center; NP, (4-hydroxy-3-nitro- onic V(D)J regions, and some mutations in nonfunctional se- phenyl)acetyl; RS, replacement and silent; RS5NF, RS, 5-mer, nonfunctional; S5F, quences may be subject to selection pressures if the sequences silent, 5-mer, functional; S5NF, silent, 5-mer, nonfunctional; SHM, somatic hyper- were rendered nonfunctional during the affinity maturation mutation; UID, unique identifier; UNG, uracil-DNA glycosylase. process. Another strategy to determine targeting models in- Copyright Ó 2016 by The American Association of Immunologists, Inc. 0022-1767/16/$30.00 volves using mutations that do not alter the amino acid sequence

www.jimmunol.org/cgi/doi/10.4049/jimmunol.1502263 2 MOUSE SOMATIC HYPERMUTATION TARGETING MODEL

(i.e., silent or synonymous mutations), which presumably are not barcoded on the 59 end in a template switchlike reaction using an oligo- subject to selection pressures. We previously used this strategy to nucleotide harboring a semidegenerate 17-nt-long unique identifier (UID) construct the human silent, 5-mer, functional (S5F) SHM targeting barcode and a universal flanking sequence. The cDNA was purified using . streptavidin bead pulldown and then subjected to a first round of 12 cycles model from 800,000 mutations in functional Ig sequences (12). of PCR. The human samples used a human C region primer mix composed Despite the high resolution of this S5F model, the mutability of of Ig H chain C region, Ig k L chain C region, and Ig l L chain C region some DNA motifs could not be estimated directly because they do flanked by another universal sequence (19–21), and the mouse samples k l not yield silent mutations. used a mouse primer mix composed of Ig L chain C region and Ig L chain C region (21). A second round of 12 cycles of PCR was then Modeling and analysis of SHM would also benefit from a clear performed using universal overhang from both ends to add the required understanding of whether equivalent models can be used across Illumina sequencing and clustering adapters. The samples were then L and H chains and species (mouse and human). L and H chain pooled at equimolar ratio and subjected to sequencing on the Illumina genes are located on different chromosomes; thus, different reg- MiSeq platform at 325 3 275 bp. Raw mouse and human sequences are ulatory elements and epigenetic effects may impact microsequence available via the Sequence Read Archive under BioProject accession numbers PRJNA283640 and PRJNA338795, respectively. Human samples targeting specificity (14–17). Previously, Shapiro et al. (9) reported included in this study were restricted to BioSample accession numbers comparable dinucleotide and trinucleotide mutabilities between SAMN05570585, SAMN05570589, SAMN05570593, SAMN05570596, L and H chains, and between mouse and human sequences. SAMN05570598, SAMN05570602, SAMN05570604. However, the small sequence database and short motif compari- Sequencing data preprocessing sons limited the resolution of this analysis. In this study, we use a novel experimental strategy based on the Raw sequences were processed using the Repertoire Sequencing Toolkit (pRESTO) (22) as previously described (12). The script containing the (4-hydroxy-3-nitrophenyl)acetyl (NP) immunization system in

detailed procedures is in the supplemental material. In summary, we re- Downloaded from mice, combined with next-generation sequencing technologies, to moved low-quality sequences, annotated sequences by UID (corresponding generate a large set of unselected mutations from nonfunctionally to the same mRNA molecule) and isotype-specific primer, generated a rearranged Ig k-chain sequences. For comparison, we also se- consensus sequence for each UID, and assembled pair-end reads. To obtain high-fidelity sequences, we constrained the reads being used for mutation quenced L and H chains from healthy human subjects. Using a analysis to appear at least twice. Furthermore, a sliding window approach modified version of the 5-mer motif framework previously de- was used to eliminate sequences with $6 mutations in 10 consecutive veloped (12), we have used this novel database of mutations nucleotides. IMGT was used to assign germline V(D)J segments and de- spanning the full V to develop a next-generation targeting termine functionality (23). Mutations were defined as nucleotides that were http://www.jimmunol.org/ model. We found that targeting patterns on nonfunctional se- different from the inferred germline sequence. Clustering of sequences into clonal groups was carried out using the Change-O command line tool (24). quences resemble those on functional sequences, but revealed H chain sequences were assigned into clonal groups on the basis of unexpected chain- and species-specific differences in SHM tar- identical V gene, J gene, and junction length, with a weighted intraclonal geting. These models provide insights into the SHM process and distance threshold of 10 using the substitution probabilities previously support future analyses of SHM patterns. described (25). L chain sequences were assigned into clonal groups on the basis of identical V gene, J gene, and junction sequence. Materials and Methods Estimation of the sequencing error rate after quality control

mIgM Tg mouse data set: cell sorting The empirical sequencing error rate for each sequence was computed by by guest on October 1, 2021 Four mIgM Tg JHD2/2 BALB/c strain female mice (18), 6–10 wk old, comparing the number of mismatched nucleotides with the consensus se- were immunized with NP-CGG in alum adjuvant. Twenty-eight days after quence in each UID group. The UID consensus sequence was obtained by immunization the mice were sacrificed and splenocytes were harvested. majority rule at each nucleotide base in each UID group. The error rate of Germinal center (GC) cells (live, B220+, NIP+, CD95+, CD382) were UID sequences was estimated by subtracting the probability that each base FACS-sorted, leaving 64,000, 87,000, 120,000, and 107,000 cells, respec- is correct (defined by more than half of the sequences at each position being tively. As a control, non-GC B cells expressing k L chain with a B220+, correct) from 1. The actual error is expected to be lower because sequencing CD952,CD38+, l2 phenotype were FACS-sorted from two nonimmunized may result in distinct errors, in which case less than half of correct bases at a mice, leaving 500,000 and 400,000 cells, respectively. RNA was isolated position would result in the correct UID consensus. from sorted cells using the RNeasy Mini kit (Qiagen) per the manufacturer’s Mutability calculation for replacement and silent models protocol and stored at 280˚C. The construction of silent (S) 5-mer models was previously described (12). Human data set: cell sorting The replacement and silent (RS) model was constructed using the same method except that both silent and replacement mutations were included. Four healthy donors were recruited for the assessment of B cell Ab rep- ertoire analysis. Specimens were collected after informed written consent Inference for substitution and mutability frequencies was obtained, under a protocol approved by the Human Research Protection Program at Yale School of Medicine. PBMCs were isolated from whole The SHazaM R package (version 0.1.1) was used to compute the targeting blood by Ficoll-Paque gradient centrifugation. B cells were first enriched models (12, 24). For substitution models, the values of 5-mers with ,20 with CD20 MicroBeads (Miltenyi Biotec); then memory (CD27+) B cells mutations were inferred. Inference was performed by first averaging sub- were isolated through FACS on a FACSAria flow cytometer (BD) into bulk stitution frequencies of all mutations in 5-mers matching the same inner collections before RNA isolation and sequencing. For cytometric analysis 3-mers. If the number of mutations was still insufficient (,20), substitu- and sorting, enriched B cells were stained with PerCP-Cy5.5 anti-human tion frequencies of all mutations in 5-mers matching the same middle base CD27 (clone O323), Pacific Blue anti-human CD19 (clone HIB 19) (both were averaged. In the mutability models, 5-mers with ,500 mutations in from BioLegend), and with FITC anti-human IgM (clone G20-127; BD). sequences containing the mutated 5-mer were inferred. If the mutated base For this study, memory and bulk unsorted B cell populations for subjects was A or T, inference was done by averaging the mutability of all 5-mers HD09, HD10, and HD13 were analyzed; only bulk unsorted B cells were sharing the same inner 3-mer. For the base C, inference was based on analyzed for subject HD07. 5-mers sharing the same upstream 3-mer (the mutated base and two nu- cleotides upstream of it), whereas for the base G, inference was based on Sequencing 5-mers sharing the same downstream 3-mer (the mutated base and two nucleotides downstream of it). RNA from sorted memory B cells was extracted via Qiagen RNeasy kit as per manufacturer’s recommendation. A total of 250 ng of RNA input was Construction of sequencing-error models used for each sample. Immune sequencing library construction for human samples was performed as previously described using human VH, VL-l Sequencing-error models were constructed for the purposes of comparing and VL-k constant region primer mix in each reaction (19), and for mouse with targeting models. Combined PCR and MiSeq sequencing error samples was performed using mouse VL-l and VL-k primer mix (21). In (hereafter referred to as sequencing error) profiles were generated using brief, RNA was reverse transcribed using biotin-dT oligonucleotide and sequences from large UID groups (.20 sequences) in mice. A consensus The Journal of Immunology 3 sequence was constructed per group through majority rule at each nucle- they contained an out-of-frame junction (92%) or a stop codon (8%) otide position. Within each UID group, the deviations from consensus (Supplemental Fig. 1). The small fraction of in-frame k Lchain sequences were counted as sequencing errors and used to construct the sequences may be derived from B cells reacting to the NP carrier error models. Targeting models for these errors were built the same way as the 5-mer RS targeting model, except that all sequences were treated in- (CGG) that contaminated the sort, or may represent cells dependently (instead of collapsing mutations for each clone). 5-mer motifs carrying both a productive k-andl-chain (28, 29). with ,20 mutations in the center base or ,500 mutations in any base in Previous data suggested that the nonfunctional k sequences sequences containing the mutated motif were excluded. Samples that did would accumulate SHMs because they were derived from GC not have any 5-mers matching the criteria were excluded from the study. B cells, where the SHM machinery is active, and where Ig is Principal component analysis on targeting models clearly being transcribed (because the sequencing was carried out For each pair of samples, the distance was defined as follows: 1 2 Spearman from an mRNA template) (9, 13, 30). To determine the extent of correlation coefficient between any two targeting models, considering only SHM, we used IMGT/HighV-QUEST to identify and align the the motifs observed in both models. The R function prcomp was used to germline V and J gene segments for each of the observed se- project the distance matrix onto the first two principal components. quences. We found that the nonfunctional k L chain sequences in Mutation simulation NP-binding GC B cells had an average mutation frequency of 1.49% in the V region. Interestingly, the nonfunctional k se- Ten simulated repertoires were generated for each SHM model (mouse quences were consistently more mutated than functional k se- L chain, human L chain, and human H chain). For each repertoire, 2000 , 2 sequences were simulated with five mutations per sequence, roughly the quences, which had a mutation frequency of 0.95% (p 1e 15) same mutation levels as observed in the GC cells from mice analyzed in this (Fig. 1). The decreased mutation frequency observed in functional study. The mutations were added to the sequences stochastically using a sequences was not due to the use of V segments with fewer hot- mutation algorithm previously described (8). All the simulations were spot motifs, because a similar pattern was observed when com- Downloaded from initiated with the germline Mus mus IGKV1-135*01, the most abundant V segment among nonfunctional chains in the NP-immunized mice. The paring mutation frequencies in functional and nonfunctional simulation results were compared across different SHM targeting models sequences using the same V segment. The l L chains sequenced by calculating the number of different mutations (defined as mutations at from the same B cell subsets had a similar mutation frequency as different positions or mutations to different bases) between repertoires the functional k-chain sequences. In contrast, k-chain sequences divided by the total number of mutations. from non-GC cells sorted from two unimmunized mice were significantly less mutated than k-chain sequences from GC cells http://www.jimmunol.org/ Results (p , 1e215). The mutation frequencies in all of these populations Nonfunctional rearrangements of the k L chain accumulate were higher than the estimated sequencing error rate after cor- SHMs rection using the molecular barcodes (estimated sequencing error To generate a large number of unselected mutations necessary to rate ,1e29). To identify independent SHM events, we partitioned precisely model SHM targeting, we carried out next-generation the sequences into groups that were likely to be clonally related repertoire sequencing (Rep-Seq) (26) in NP-immunized mIgM Tg (see Materials and Methods), and identified the set of distinct mice (18). These mice express a transgenic H chain (B1-8) that mutations from each clone. In total, we obtained 39,173 inde- binds the NP Ag when paired with l, but not k, L chains. We pendent and unselected mutations from NP-binding nonfunctional by guest on October 1, 2021 reasoned that NP-binding B cells in these mice would likely carry k-chain sequences. nonfunctionally rearranged k-chains, because l-chains are gener- ally used only after failure to generate a productive k-chain rear- SHM targeting at the nonfunctional locus resembles the rangement during B cell development (27). Next-generation functional locus sequencing of k L chains of sorted GC NP-binding B cells from Similar to previous models (12), we defined the mutability of a four mice 28 d postimmunization with NP produced 25,341 distinct DNA motif as the probability of the central base in the motif k L chain sequences (Table I). Consistent with our expectations, being targeted by SHM relative to all other motifs (described in 20,194 sequences (80%) were identified as nonfunctional because Materials and Methods). To capture the classic SHM hot spots

Table I. Mouse sequencing data

No. of Unique Sequences No. of Clones No. of Mutations

Functional Nonfunctional Sample Chain Processed Functional Total Functional Nonfunctionala Functional Nonfunctionala Silent Silenta NP1 k 2,526 534 1,163 232 942 739 4,225 183 1,081 l 25,755 21,332 1,298 516 794 2,728 3,691 718 703 NP2 k 8,703 1,555 2,671 464 2,233 1,546 12,489 370 3,347 l 84,141 70,990 2,552 930 1,653 4,923 7,874 1,440 1,566 NP3 k 10,006 2,226 3,245 656 2,626 2,305 14,992 565 3,758 l 89,976 76,406 2,720 1,045 1,711 5,528 8,508 1,590 1,767 NP4 k 4,106 832 1,813 373 1,459 1,095 7,467 252 1,964 l 50,587 42,523 1,923 658 1,281 3,639 6,214 1,017 1,230 Total k 25,341 5,147 8,892 1,725 7,260 5,685 39,173 1,370 10,150 l 250,459 211,251 8,493 3,149 5,439 16,818 26,287 4,765 5,266

C1 k 89,707 76,245 15,688 9,905 5,989 18,984 7,843 3,871 879 l 3,844 590 798 138 662 202 1,293 32 178 C2 k 26,041 22,814 5,797 3,853 2,026 6,900 2,539 1,331 266 l 1,240 183 305 44 262 62 493 9 65 Total k 115,748 99,059 21,485 13,758 8,015 25,884 10,382 5,202 1,145 l 5,084 773 1,103 182 924 264 1,786 41 243 aThe number of nonfunctional sequences is the number of processed sequences minus the number of functional sequences. 4 MOUSE SOMATIC HYPERMUTATION TARGETING MODEL

both replacement (R) and silent (S) mutations because neither type of mutation should be subject to selection pressures in nonfunc- tional k-chain sequences in l+ cells. Using the notation developed in our previous work, we refer to this model as RS, 5-mer, non- functional (RS5NF) (Fig. 2). The RS5NF targeting model con- tained 825 directly observed motifs (Fig. 2A) and 199 unobserved motifs, whose values were inferred from similar motifs (Fig. 2B). We built a separate targeting model using data from each mouse and confirmed that the mutabilities were highly similar among them (Supplemental Fig. 2A). As expected, classic hot spots generally had higher mutabilities, whereas classic cold spots had lower mutabilities. Furthermore, we observed potential new strand-symmetric hot-spot motifs, CRCY/RGYG (Fig. 2, orange FIGURE 1. Nonfunctional sequences accumulate a high frequency of bars) and ATCT/AGAT (Fig. 2, magenta bars), whose mean mu- somatic mutations. k L chains were sequenced from four NP-immunized tability levels were more than twice as large as the mean muta- (NP1-4) or two control mice (C1-2) (Table I). The mutation frequency of bility level of the classic WA/TW hot spot. These SHM targeting the V segment in functional (shaded boxes) and nonfunctional (open patterns remained stable over the course of B cell maturation boxes) sequences was calculated by comparing each sequence with its because models constructed from sequences with ,1% mutation germline segment determined by IMGT/HighV-Quest. Boxes indicate the frequency were well correlated with those constructed from se- mean mutation frequency and 5–95% quantiles. Mean mutation frequen- $ Downloaded from cies for comparable l L chains are indicated by X. quences with 1% mutation frequency; the strength of the cor- relations between highly and lowly mutated sequences (average (WRC/GYW and WA/TW, where W = [A, T], R = [G, A], Y = pairwise Pearson r = 0.61) was comparable with pairwise corre- [C, T]) (2) on both the forward and reverse strands, we modeled lations for sequences with ,1% mutation frequency (average targeting patterns based on 5-mer motifs. Furthermore, we used Pearson r = 0.58). http://www.jimmunol.org/ by guest on October 1, 2021

FIGURE 2. SHM targeting model for murine L chains. An RS5NF model was constructed to estimate the mutability of 5-mer motifs using both R and S mutations in nonfunctional k sequences from NP-binding cells in four immunized mice (NP1-4 in Table I). Hedgehog plots depict the relative targeting of 5-mer motifs centered on the bases A, T, C, and G, with each center base in an individual circle. (A) The SHM targeting model in which the mutability values were measured directly from Ig sequencing data. (B) The complete SHM targeting model with inferred values for 5-mer motifs that could not be directly estimated. Classic hot spots are colored in red (WRC/GYW) and green (WA/TW), whereas cold spots are colored in blue (SYC/GRS). Potential novel hot spots are colored in orange (CRCY/RGYG, R = [G,A], Y = [C,T]) and magenta (ATCT/AGAT). All other motifs are shown in grey. The Journal of Immunology 5

To test whether SHM targeting at the nonfunctional locus was model and the human S5F model (Pearson r = 0.63, Spearman r = similar to the functional locus, we compared targeting models 0.63; Fig. 4A). In both species, classic hot spots generally constructed from functional and nonfunctional sequences. Because exhibited higher mutabilities, whereas classic cold spots exhibited selection pressures may influence mutation patterns in functional lower mutabilities. Despite this similarity, we observed a global sequences, we constructed a model based only on silent mutations. shift in mutabilities at certain bases. In the mouse model, 5-mer For this comparison, the model for nonfunctional sequences was motifs with C or G (hereafter written as C/G) as the central base also based on silent mutations to ensure comparability. Specifically, had much higher mutabilities compared with the same motifs in we built a silent, 5-mer, nonfunctional (S5NF) model using silent the human model across the 299 motifs observed in both models, mutations from nonfunctional k sequences from GC cells in im- whereas 5-mer motifs centered at A or T (hereafter written as A/T) munized mice, and an S5F model using silent mutations from had much lower mutabilities (Fig. 4B). When considering the functional k sequences from both GC B cells in immunized mice mutability of motifs centered on each base separately, the muta- and non-GC B cells in control mice. The mutability models for bilities were highly consistent (Supplemental Fig. 3), suggesting nonfunctional (S5NF) and functional (S5F) sequences were highly that the difference between mouse and human SHM targeting correlated (Pearson r = 0.86 and Spearman r = 0.58 over 250 exists mainly in the overall targeting frequencies of C/G verses common 5-mer motifs), consistent with the notion that the same A/T bases. The models adjust for differences in CG contents be- mutational pathways are operating at both loci in this murine tween mice and humans by normalizing mutabilities using fre- system (Fig. 3A). quencies of microsequence motifs in the germline sequences. To To test whether SHM targeting was also consistent across ensure that the observed species-specific difference was not due to functional and nonfunctional loci in humans, we recruited four tissue-specific differences (mouse B cells were collected from the healthy human subjects and sequenced both their Ig H and L chains. spleen, whereas human B cells were from PBMCs), we examined Downloaded from Overall, we collected 32,260 unique H chain sequences and an independent data set derived from lymph node B cells from 111,625 unique L chain sequences (Table II). Comparison of S5F four multiple sclerosis subjects (20). The largest mutability dif- targeting models constructed independently for each subject ference between species (i.e., mouse versus human) was five times showed that targeting patterns were highly conserved across in- greater than the largest difference between tissues (i.e., blood dividuals (Supplemental Fig. 2B, 2C), consistent with previous versus lymph node) in the same species (data not shown), sug- observations (12). We thus agglomerated mutations from all in- gesting that the large increase observed in C/G targeting in mice http://www.jimmunol.org/ dividuals to build L chain S5F and S5NF targeting models, using compared with humans was likely due to a species-dependent functional and nonfunctional (defined by out-of-frame junctions) mechanistic difference in SHM. sequences, respectively. The relative mutability of DNA motifs in The increase in C/G (verses A/T) targeting in the mouse model the nonfunctional and functional sequences was highly correlated suggested a relative decrease in the recruitment of the MSH2/ (Pearson r = 0.93, Spearman r = 0.88; Fig. 3B). Similar results MSH6 DNA mismatch repair pathway that drives mutation were obtained for the human H chain sequences (Pearson r = 0.93, spreading from the original AID-induced lesion, thus leading to a Spearman r = 0.86; Fig. 3C). Overall, these results suggest the shift in the balance toward phase I (versus phase II) SHM. To test SHM targeting mechanism is highly conserved at the functional whether there was also a decreased efficiency of recruiting uracil- by guest on October 1, 2021 and nonfunctional loci in both the human and the mouse systems. DNA glycosylase (UNG), the base excision repair enzyme asso- ciated with phase Ib of SHM, we analyzed the pattern of nucleotide C/G bases are targeted more frequently and exhibit a higher substitution frequencies computed for each base by aggregating all transition frequency in mice mutations at positions where only silent mutations were possible To understand whether SHM targeting was species specific, we (Table III). We reasoned that decreased recruitment of UNG compared the mouse and human L chain mutability models. would lead to an increase in AID-induced lesions being resolved Overall, there was a high correlation between the mouse S5NF through simple replication events, and consequently an increase in

FIGURE 3. SHM targeting is conserved at the nonfunctional locus. Functional or nonfunctional sequences were analyzed separately to estimate the mutability of each 5-mer (points) using data from (A)mousek L chains from six mice (NP1-4 and C1-2 in Table I) (or four NP-immunized mice only for nonfunctional chains), (B) human L chains from four subjects (HD samples in Table II), or (C) human H chains from four subjects (HD samples in Table II). Classic hot and cold spots are colored as specified in Fig. 2. The mutabilities were normalized such that the average mutability of the motifs observed in both loci was 1. 6 MOUSE SOMATIC HYPERMUTATION TARGETING MODEL

the frequency of transition mutations (versus transversion muta-

a tions) at C/G bases. Consistent with this hypothesis, and with previously reported transition frequencies in mice (4, 31, 32) and humans (30), the transition frequencies from C-to-T and G-to-A in 479 469 354 552 189 1,299 1,484 1,837 1,564 5,099 mice were significantly higher than those in humans (Table III) (p , 1e215, Pearson’s x2 test). This species-specific difference in transition frequency was found only at C/G bases, whereas the substitution profiles at A/T bases were similar between mice and humans. This is consistent with a shift in the balance toward phase Ia compared with phase Ib SHM in mice. Overall, these observa- tions suggest a general decrease in the efficiency of recruiting DNA

8,650 error-prone repair pathways normally associated with SHM to the 14,280 31,415 31,694 16,831 40,399 16,908 20,200 62,589 117,788 sites of AID-induced lesions in mice compared with humans. Functional Silent Nonfunctional Silent Targeting in human H chains is distinct from L chains

a Because of different chromosomal contexts, it is possible that the SHM process for H and L chains is subject to distinct regulatory

616 machineries. To examine this, we compared the human L chain S5F 2,077 6,243 6,532 1,632 7,909 1,627 2,157 6,032 model with the human H chain S5F model previously built. The L chain and the H chain models were well correlated (Pearson r = Downloaded from 0.66, Spearman r = 0.75; Fig. 5A), and the classic hot and cold spots showed the expected deviations from neutral motifs. How- ever, the cross-chain correlations were significantly weaker than 41,135 89,687 87,735 24,032 48,479 49,327 56,551 the correlations for same-chain comparisons across individuals Functional Nonfunctional (p , 1e215; Fig. 5B). Furthermore, .30% of the 5-mer motifs had mutabilities that were .2-fold different between H and http://www.jimmunol.org/

a L chains (100/323 motifs observed in both models), and .7% of the motifs were at least 4-fold different (25/323 motifs), whereas 57 418 901 664 178,389 190 217 200 ,17% of the motifs were .2-fold different between the same type of chains (when comparing separate models built for each indi- vidual). The targeting model built solely on human k L chains (instead of combining k and l) did not yield higher correlation with the mouse k-chain model, possibly because a smaller data- base is more prone to noise. Overall, these results indicate that by guest on October 1, 2021 H and L chains exhibit similar SHM targeting patterns, but distinct mechanisms are likely acting at each locus, and analysis of H and L chain SHM patterns is likely to require separate targeting models.

Total Functional Nonfunctional Species-specific and chain-specific targeting To explore the global relationships between the species- and chain-

a specific SHM targeting models, we built separate targeting models for each individual mouse and human of the following categories:

58 2,870 2,789 k 454 6,455 5,944 708 22,532 21,682 199233218 6,426 7,333 6,182 5,903 7,064 5,647 1) L chain sequences from four NP-immunized mice (RS5NF models), 2) k L chain sequences from two control mice (S5F models), 3) k and l L chain sequences from four healthy human subjects (S5F models), 4) H chain sequences from four healthy human subjects (S5F models), 5) H chain sequences from 11 human subjects used in a previous study by Yaari et al. (12) (S5F models), and 6) sequencing errors from three mouse samples (described in Materials and Methods). The pairwise distance be-

No. of Unique Sequences No. of Clonestween each of these models was No. of Mutations defined as follows: 1 2 Spearman correlation of the mutability estimates. Principal component analysis was then used to project this distance matrix onto the first two principal components, which together account for most of the Processed Functional Nonfunctional variance in the distance matrix (95%) (see Materials and Meth- ods) (Fig. 6A). We found that mouse L chains, human L chains, and human H chains clustered into distinct groups. The high L 21,011 20,362 LL 36,596L 26,089 34,846 27,929 24,553 26,505 1,514 1,338 16,319 1,191 14,611 14,909 12,508 13,322 11,490 1,266 1,144 113,235 L 111,625 106,266 4,497 49,893 45,665 3,729 331,792 22,761 consistency of samples within each group, combined with the disparity across the groups, indicates that different species and chains have distinct targeting patterns. These differences were not completely due to the increased targeting of C/G bases in mice, Nonfunctional sequences are defined as sequences with out-of-frame junctions. SampleHD07 Chain H 5,697 5,570 HD09HD10 HHD13 H H 8,628 8,742 8,362 9,193 8,462 8,900 Total H 32,260 32,294

a because renormalizing the models to have the same mutability for

Table II. Human sequencing data all 5-mers with the same central base produced a similar result. The Journal of Immunology 7

FIGURE 4. The mutability of C/G (relative to A/T) is increased in mice compared with humans. SHM targeting models were estimated using silent mutations from nonfunctional mouse k L chains in four NP-immunized mice (NP1-4 in Table I) or functional human L chains from four subjects (HD samples in Table II). (A) The mutability of each 5-mer (points) was compared between the mouse and human models. (B) The distribution of muta- bilities for all 5-mer motifs centered at each base (A, T, G, and C) was compared between the mouse (shaded boxes) and human (open boxes) models. The mutabilities were normalized such that the average mutability of the motifs observed in both species was 1.

The differences observed across chains and between species were process, and highlight the need to use the proper targeting models also not likely due to usage of distinct germline sequences because for the analysis of SHM patterns. mutabilities are normalized based on the frequency of occurrence Downloaded from of each motif in the germline sequences. To confirm this, the Discussion analyses were repeated using separate models constructed for each Models of SHM microsequence targeting and substitution bias can V gene family independently. These results showed that the var- provide mechanistic insights into the underlying mutation and iation between V gene families at the same locus was small DNA repair pathways, and serve as critical background models for compared with the differences observed across chains and be- statistical analyses of mutation patterns. The development of SHM

tween species. All of the models were clustered far from the se- targeting models has been hindered by the difficulty in identifying http://www.jimmunol.org/ quencing error model, providing reassurance that the SHM large numbers of somatic mutations that have not been influenced targeting models were not significantly influenced by potential by selection. To overcome this limitation, we used high-throughput sequencing errors. Interestingly, the human L chain models were sequencing to generate a large number of k L chains from intermediate between the murine L chain and the human H chain NP-binding GC B cells. Because B cells specific for the NP hapten models, suggesting some conservation of SHM mechanisms at this use l L chains, many of these cells also carry a nonfunctional k L locus. To better understand how these differences in SHM tar- chain sequence. These nonfunctional k L chains accumulated geting could impact affinity maturation, we designed a simulation- significant numbers of somatic mutations after NP immunization. based analysis to quantify how often the models lead to different To model SHM targeting, we adapted the 5-mer microsequence mutations being introduced in a repertoire of 2000 Mus mus targeting framework (12), which estimates the mutability of each by guest on October 1, 2021 IGKV1-135*01 sequences (the most frequent segment among base as a function of the surrounding two bases upstream and two mouse nonfunctional sequences), each carrying five mutations. bases downstream. This approach captured SHM targeting pat- Ten simulated repertoires were generated for each SHM model terns that were highly reproducible across individual mice. The (mouse L chain, human L chain, and human H chain) using a patterns were also similar in sequences with low (,1%) and high mutation algorithm previously described (8). We found that two ($1%) mutation frequencies, suggesting that the SHM process is repertoires that responded independently under the same muta- stable over time. This contrasts with the hierarchical model pro- bility model (mouse L chain) would differ in ∼13% of mutations. posed by Yeap et al. (33). Although our data also show that early In contrast, responses that evolved under the mouse L chain model mutations are targeted preferentially to hot-spot motifs, this is due differed from those using the human H chain model by .38% to their higher overall mutability. Hot- and cold-spot mutations are (Fig. 6B). In contrast with the PCA in Fig. 6A, the human L chain both observed in sequences with few mutations (,1%) and appear model was not closer to the mouse L chain than the human H at their expected frequencies. Comparative analysis of this mouse chain. This is because the V segment chosen for the simulation L chain model with SHM targeting models for human L and lacks the sequence motifs exhibiting similar mutabilities between H chains identified both species- and chain-specific differences. the mouse L chain model and the human L chain model. However, Although the mutability of classic hot- and cold-spot motifs was in simulations with other V segment sequences, we observed that broadly similar in mice and humans, SHM targeting in mice was the human L chain model rendered more similar mutations with characterized by increased C/G targeting. Previous studies suggested the mouse L chain model. Overall, these analyses provide strong that SHM targeting in mice and humans was similar, and that there evidence for species- and chain-specific targeting in the SHM were no chain-specific differences (9). However, the number of

Table III. SHM substitution matrices

Mouse L Chain Human L Chain Human H Chain

From\ToACGTACGTACGT A — 0.16 0.54 0.29 — 0.26 0.50 0.24 — 0.28 0.49 0.23 C 0.13 — 0.09 0.78 0.25 — 0.30 0.45 0.21 — 0.35 0.44 G 0.77 0.14 — 0.09 0.53 0.31 — 0.16 0.49 0.35 — 0.16 T 0.25 0.57 0.18 — 0.20 0.57 0.23 — 0.19 0.44 0.37 — The substitution frequency from each base to every other base is computed using the set of mutations observed at positions where all base changes are silent. 8 MOUSE SOMATIC HYPERMUTATION TARGETING MODEL

FIGURE 5. SHM targeting in human H and L chains is distinct. SHM targeting models were estimated using silent mutations in functional sequences from four subjects (HD samples in Table II). (A) The mutability of each 5-mer (points) was compared between L and H chains for the models based on all individuals (HD samples) combined. (B) Separate SHM targeting models were constructed for H chains and L chains from each individual, and the

Spearman correlation coefficients of the 5-mer mutabilities between each pair of individuals was determined. Comparisons between the same types of Downloaded from chains (H versus H or L versus L) are represented by empty bars, whereas comparisons between H and L chains are represented by filled bars. The mutabilities were normalized such that the average mutability of the motifs observed in both chains was 1. sequences and mutations in these studies were significantly smaller or mismatch repair. L and H chains display overall similar targeting compared with this study. Although the gross patterns are similar patterns, but the detailed mutability patterns within each hot- or with the previous finding, the novel observations of both species- cold-spot displayed large discrepancy. One explanation for locus- http://www.jimmunol.org/ and locus-specific patterns of mutation imply that there are unique specific targeting is that differences in the regulatory elements (e.g., molecular mechanisms operating and/or that the extent of activity promoters and enhancers) in H and L chain genes control recruit- of certain mechanisms varies across species and loci. The increased ment of AID and DNA repair factors like UNG and error-prone targeting of C/G (versus A/T) bases combined with the increased polymerases. Each repair mechanism may have distinct preference transition frequency specifically at C/G bases in mice is consistent in targeting motifs, and the shift in the contribution of repair with less involvement of the DNA error repair pathways normally mechanisms may be reflected in microsequence targeting specific- associated with SHM relative to humans. Thus, in this mouse ity. Overall, these species- and chain-specific differences highlight model, more of the AID-induced lesions appear to be resolved by a the importance of using an SHM targeting model that is appropri- simple replication event without the involvement of base excision ately matched for the data being analyzed. by guest on October 1, 2021

FIGURE 6. SHM targeting displays species- and chain-specific features. (A) The distance between each pair of targeting models was defined as follows: 1 2 Spearman correlation measure between the mutabilities of the 5-mer motifs observed in both models. PCA was used to project this distance matrix onto two components that explain most of the variance. Analyzed models include S5F model from human H chains in 15 subjects (HD samples in Table II and previous data set [12]; open circles), S5F model from human L chains in four subjects (HD samples in Table II; squares), S5F model from murine L chains in two mice (C1-2 in Table I; open triangles), RS5NF model from murine L chains in four mice (NP1-4 in Table I; filled triangles), and sequencing errors determined from large UID groups (3). (B) A simulation-based analysis was used to quantify how often the SHM targeting models lead to different mutations being introduced in a repertoire of 2000 Mus mus IGKV1-135*01 sequences (the most frequent segment among nonfunctional mouse se- quences), each carrying five mutations (see Materials and Methods for details). The Journal of Immunology 9

A potential limitation of this study concerns the comparison of 8. Yaari, G., M. Uduman, and S. H. Kleinstein. 2012. Quantifying selection in high- throughput Immunoglobulin sequencing data sets. Nucleic Acids Res. 40: e134. functional versus nonfunctional sequences. In the mouse system, the 9. Shapiro, G. S., K. Aviszus, D. Ikle, and L. J. Wysocki. 1999. Predicting regional total number of mutations in functional sequences was low, and thus mutability in antibody V genes based solely on di- and trinucleotide sequence may be more influenced by sequencing errors. However, the con- composition. J. Immunol. 163: 259–268. 10. Cowell, L. G., and T. B. Kepler. 2000. The nucleotide-replacement spectrum firmation that SHM targeting in functional and nonfunctional se- under somatic hypermutation exhibits microsequence dependence that is strand- quences in human subjects was also similar supports our conclusion symmetric and distinct from that under germline mutation. J. Immunol. 164: that data from nonfunctional sequences can be used as a model for 1971–1976. 11. Shapiro, G. S., M. C. Ellison, and L. J. Wysocki. 2003. Sequence-specific tar- mutation patterns in functional sequences. It is possible that the geting of two bases on both DNA strands by the somatic hypermutation 5-mer model does not fully capture the influence of microsequence mechanism. Mol. Immunol. 40: 287–295. context on mutability and that targeting may be influenced by bases 12. Yaari, G., J. A. Vander Heiden, M. Uduman, D. Gadala-Maria, N. Gupta, J. N. Stern, K. C. O’Connor, D. A. Hafler, U. Laserson, F. Vigneault, and that are further upstream or downstream. However, extending the S. H. Kleinstein. 2013. Models of somatic hypermutation targeting and substi- analysis to longer motifs (e.g., 7-mers) would require much higher tution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Front. Immunol. 4: 358. sequence coverage to account for the large number of possible 13. Oprea, M., L. G. Cowell, and T. B. Kepler. 2001. The targeting of somatic hyper- nucleotide combinations (e.g., 16,384 combinations of 7-mers). It is mutation closely resembles that of meiotic mutation. J. Immunol. 166: 892–899. also important to acknowledge that no motif-based model may be 14. Inlay, M. A., H. H. Gao, V. H. Odegard, T. Lin, D. G. Schatz, and Y. Xu. 2006. Roles of the Ig kappa light chain intronic and 39 enhancers in Igk somatic sufficient to capture the full complexity of SHM targeting, which hypermutation. J. Immunol. 177: 1146–1151. may be influenced uniquely by each V gene context. However, the 15. Rouaud, P., C. Vincent-Fabert, A. Saintamand, R. Fiancette, M. Marquet, observation that the same conclusions hold when running the I. Robert, B. Reina-San-Martin, E. Pinaud, M. Cogne´, and Y. Denizot. 2013. The IgH 39 regulatory region controls somatic hypermutation in germinal center analysis independently on each V gene family suggests that this is B cells. J. Exp. Med. 210: 1501–1507. Downloaded from unlikely. There is also a caveat in the comparison between species. 16. Buerstedde, J.-M., J. Alinikula, H. Arakawa, J. J. McDonald, and D. G. Schatz. The human samples in the study were from a diverse population, 2014. Targeting of somatic hypermutation by immunoglobulin enhancer and enhancer-like sequences. PLoS Biol. 12: e1001831. whereas the mouse models were based on a single inbred laboratory 17. Betz, A. G., C. Milstein, A. Gonza´lez-Ferna´ndez, R. Pannell, T. Larson, and strain. It is possible that the patterns observed are specific to this M. S. Neuberger. 1994. Elements regulating somatic hypermutation of an im- munoglobulin kappa gene: critical role for the intron enhancer/matrix attachment strain. Finally, it is important to note that the sequencing error models region. Cell 77: 239–248. were built from sequences with identical UIDs, which captures the 18. Chan, O. T., L. G. Hannum, A. M. Haberman, M. P. Madaio, and M. J. Shlomchik. PCR and sequencing errors, but not errors introduced in the reverse 1999. A novel mouse with B cells but lacking serum antibody reveals an antibody- http://www.jimmunol.org/ independent role for B cells in murine lupus. J. Exp. Med. 189: 1639–1648. transcription step prior to the UID-tagging process. 19. Tsioris, K., N. T. Gupta, A. O. Ogunniyi, R. M. Zimnisky, F. Qian, Y. Yao, The models developed in this study quantitatively characterize X. Wang, J. N. Stern, R. Chari, A. W. Briggs, et al. 2015. Neutralizing antibodies SHM targeting without the biasing influence of selection pressures. In against West Nile virus identified directly from human B cells by single-cell analysis and next generation sequencing. Integr Biol (Camb) 7: 1587–1597. addition to revealing previously unrecognized hot and cold spots as 20.Stern,J.N.,G.Yaari,J.A.VanderHeiden,G.Church,W.F.Donahue well as the influence of species and locus, these models provide R.Q.Hintzen,A.J.Huttner,J.D.Laman,R.M.Nagra,A.Nylander,etal. important background distributions for statistical analysis of affinity 2014. B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes. Sci. Transl. Med. 6: 248ra107. maturation in experimentally derivedIgsequencingdata.Specifically, 21. Di Niro, R., S.-J. Lee, J. A. Vander Heiden, R. A. Elsner, N. Trivedi, we developed two new models: the mouse RS5NF L chain and the J. M. Bannock, N. T. Gupta, S. H. Kleinstein, F. Vigneault, T. J. Gilbert, et al. 2015. Salmonella infection drives promiscuous B cell activation followed by by guest on October 1, 2021 human S5F L chain models. These models complement the previously extrafollicular affinity maturation. 43: 120–131. published human S5F H chain model (12) and are available in the 22. Vander Heiden, J. A., G. Yaari, M. Uduman, J. N. Stern, K. C. O’Connor, SHazaM R package (24) (http://shazam.readthedocs.io/). D. A. Hafler, F. Vigneault, and S. H. Kleinstein. 2014. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor rep- ertoires. Bioinformatics 30: 1930–1932. Acknowledgments 23. Alamyar, E., V. Giudicelli, S. Li, P. Duroux, and M.-P. Lefranc. 2012. HighV- QUEST: the IMGT web portal for immunoglobulin (IG) or antibody and T cell We thank Dr. Gur Yaari and Dr. Mohamed Uduman for feedback and tech- receptor (TR) analysis from NGS high throughput and deep sequencing. nical assistance and the Yale University Biomedical High Performance Immunome Res. 8: 26. Computing Center for use of computing resources. 24. Gupta, N. T., J. A. Vander Heiden, M. Uduman, D. Gadala-Maria, G. Yaari, and S. H. Kleinstein. 2015. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31: 3356–3358. Disclosures 25. Smith, D. S., G. Creadon, P. K. Jena, J. P. Portanova, B. L. Kotzin, and The authors have no financial conflicts of interest. L. J. Wysocki. 1996. Di- and trinucleotide target preferences of somatic muta- genesis in normal and autoreactive B cells. J. Immunol. 156: 2642–2652. 26. Benichou, J., R. Ben-Hamo, Y. Louzoun, and S. Efroni. 2012. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. References Immunology 135: 183–191. 1. McKean, D., K. Huppi, M. Bell, L. Staudt, W. Gerhard, and M. Weigert. 1984. 27. Hieter, P. A., S. J. Korsmeyer, T. A. Waldmann, and P. Leder. 1981. Human Generation of antibody diversity in the immune response of BALB/c mice to immunoglobulin k light-chain genes are deleted or rearranged in l-producing influenza virus hemagglutinin. Proc. Natl. Acad. Sci. USA 81: 3180–3184. B cells. Nature 290: 368–372. 2. Peled, J. U., F. L. Kuang, M. D. Iglesias-Ussel, S. Roa, S. L. Kalis, 28. Jacob, J., R. Kassir, and G. Kelsoe. 1991. In situ studies of the primary immune M. F. Goodman, and M. D. Scharff. 2008. The biochemistry of somatic hyper- response to (4-hydroxy-3-nitrophenyl)acetyl. I. The architecture and dynamics of mutation. Annu. Rev. Immunol. 26: 481–511. responding cell populations. J. Exp. Med. 173: 1165–1175. 3. Rogozin, I. B., and N. A. Kolchanov. 1992. Somatic hypermutagenesis in im- 29. Diaw, L., D. Siwarski, W. DuBois, G. Jones, and K. Huppi. 2000. Double pro- ducers of kappa and lambda define a subset of B cells in mouse plasmacytomas. munoglobulin genes. II. Influence of neighbouring base sequences on muta- Mol. Immunol. 37: 775–781. genesis. Biochim. Biophys. Acta 1171: 11–18. 30. Do¨rner, T., H. P. Brezinschek, R. I. Brezinschek, S. J. Foster, R. Domiati-Saad, 4. Betz, A. G., C. Rada, R. Pannell, C. Milstein, and M. S. Neuberger. 1993. and P. E. Lipsky. 1997. Analysis of the frequency and pattern of somatic mu- Passenger transgenes reveal intrinsic specificity of the antibody hypermutation tations within nonproductively rearranged human variable heavy chain genes. J. mechanism: clustering, polarity, and specific hot spots. Proc. Natl. Acad. Sci. Immunol. 158: 2779–2789. USA 90: 2385–2388. 31. Ye´lamos, J., N. Klix, B. Goyenechea, F. Lozano, Y. L. Chui, A. Gonza´lez Ferna´ndez, 5. Pham, P., R. Bransteitter, J. Petruska, and M. F. Goodman. 2003. Processive R. Pannell, M. S. Neuberger, and C. Milstein. 1995. Targeting of non-Ig sequences in AID-catalysed cytosine deamination on single-stranded DNA simulates somatic place of the V segment by somatic hypermutation. Nature 376: 225–229. hypermutation. Nature 424: 103–107. 32. Golding, G. B., P. J. Gearhart, and B. W. Glickman. 1987. Patterns of somatic 6. Kenneth, H., F. Anna, L. Gerton, and O. G. Pybus. 2016. The diversity and molecular mutations in immunoglobulin variable genes. Genetics 115: 169–176. evolution of B cell receptors during infection. Mol. Biol. Evol. 33: 1147–1157. 33. Yeap, L. S., J. K. Hwang, Z. Du, R. M. Meyers, F. L. Meng, A. Jakubauskaitė, 7. Betz, A. G., M. S. Neuberger, and C. Milstein. 1993. Discriminating intrinsic and M. Liu, V. Mani, D. Neuberg, T. B. Kepler, et al. 2015. Sequence-intrinsic antigen-selected mutational hotspots in immunoglobulin V genes. Immunol. mechanisms that target AID mutational outcomes on antibody genes. Cell 163: Today 14: 405–411. 1124–1137. #!/bin/bash # Super script to run the pRESTO pipeline on AbVitro library v3.0 data # # Author: Jason Anthony Vander Heiden, Gur Yaari, Namita Gupta, Ang Cui # Date: 2015.10.20 # # Required Arguments: # $1 = read 1 file (C-region start sequence) # $2 = read 2 file (V-region start sequence) # $3 = output directory (absolute path) # $4 = number of subprocesses for multiprocessing tools

# Define run parameters LOG_RUNTIMES=true ZIP_FILES=false ALIGN_STEP=false CALC_DIV=true MAXDIV=0.1 PRCONS=0.6 FSQUAL=20 BCQUAL=0 R1_MAXERR=0.2 R2_MAXERR=1.0 # VPRIMERs start at variable positions but are not very relevant; make the maxerr cutoff high AP_MAXERR=0.2 ALPHA=0.01 FS_MISS=20 CS_MISS=20 MUSCLE_EXEC=$HOME/bin/muscle3.8.31_i86linux64 NPROC=$4

# Define input files R1_FILE=$1 R2_FILE=$2 OUTDIR=$3 R1_PRIMER_FILE=‘/CPrimers_MouseV3.fasta' R2_PRIMER_FILE=‘/VPrimers_MouseV3.fasta'

# Define script execution command and log files mkdir -p $OUTDIR; cd $OUTDIR RUNLOG="${OUTDIR}/Pipeline.log" echo '' > $RUNLOG if $LOG_RUNTIMES; then TIMELOG="${OUTDIR}/Time.log" echo '' > $TIMELOG RUN="nice -19 /usr/bin/time -o ${TIMELOG} -a -f %C\t%E\t%P\t%Mkb" else RUN="nice -19" fi

# Start echo "DIRECTORY:" $OUTDIR echo "VERSIONS:" >> $RUNLOG AlignSets.py -v >> $RUNLOG AssemblePairs.py -v >> $RUNLOG BuildConsensus.py -v >> $RUNLOG CollapseSeq.py -v >> $RUNLOG FilterSeq.py -v >> $RUNLOG MaskPrimers.py -v >> $RUNLOG PairSeq.py -v >> $RUNLOG ParseHeaders.py -v >> $RUNLOG ParseLog.py -v >> $RUNLOG SplitSeq.py -v >> $RUNLOG

# Filter low quality reads echo " 1: FilterSeq quality $(date +'%H:%M %D')" $RUN FilterSeq.py quality -s $R1_FILE -q $FSQUAL --nproc $NPROC --outname R1 \ --outdir . --log QualityLogR1.log --clean >> $RUNLOG $RUN FilterSeq.py quality -s $R2_FILE -q $FSQUAL --nproc $NPROC --outname R2 \ --outdir . --log QualityLogR2.log --clean >> $RUNLOG

# Identify primers and UID echo " 2: MaskPrimers score $(date +'%H:%M %D')" $RUN MaskPrimers.py score -s R1_quality-pass.fastq -p $R1_PRIMER_FILE --mode cut --start 0 \ --maxerror $R1_MAXERR --nproc $NPROC --log PrimerLogR1.log --clean >> $RUNLOG $RUN MaskPrimers.py score -s R2_quality-pass.fastq -p $R2_PRIMER_FILE --mode cut --barcode --start 17 \ --maxerror $R2_MAXERR --nproc $NPROC --log PrimerLogR2.log --clean >> $RUNLOG

# Assign UIDs to read 1 sequences echo " 3: PairSeq $(date +'%H:%M %D')" $RUN PairSeq.py -1 R2*primers-pass.fastq -2 R1*primers-pass.fastq -f BARCODE -- coord illumina --clean >> $RUNLOG if $ALIGN_STEP; then # Multiple align UID read groups echo " 4: AlignSets muscle $(date +'%H:%M %D')" $RUN AlignSets.py muscle -s R1*pair-pass.fastq --exec $MUSCLE_EXEC -- nproc $NPROC \ --log AlignLogR1.log --clean >> $RUNLOG $RUN AlignSets.py muscle -s R2*pair-pass.fastq --exec $MUSCLE_EXEC -- nproc $NPROC \ --log AlignLogR2.log --clean >> $RUNLOG BCR1_FILE=R1*align-pass.fastq BCR2_FILE=R2*align-pass.fastq else BCR1_FILE=R1*pair-pass.fastq BCR2_FILE=R2*pair-pass.fastq fi

# Build UID consensus sequences echo " 5: BuildConsensus $(date +'%H:%M %D')" if $CALC_DIV; then $RUN BuildConsensus.py -s $BCR1_FILE --bf BARCODE --pf PRIMER --prcons $PRCONS \ -q $BCQUAL --maxdiv $MAXDIV --nproc $NPROC --log ConsensusLogR1.log --clean >> $RUNLOG $RUN BuildConsensus.py -s $BCR2_FILE --bf BARCODE --pf PRIMER \ -q $BCQUAL --maxdiv $MAXDIV --nproc $NPROC --log ConsensusLogR2.log --clean >> $RUNLOG else $RUN BuildConsensus.py -s $BCR1_FILE --bf BARCODE --pf PRIMER --prcons $PRCONS \ -q $BCQUAL --nproc $NPROC --log ConsensusLogR1.log --clean >> $RUNLOG $RUN BuildConsensus.py -s $BCR2_FILE --bf BARCODE --pf PRIMER \ -q $BCQUAL --nproc $NPROC --log ConsensusLogR2.log --clean >> $RUNLOG fi

# Assemble paired ends echo " 6: AssemblePairs $(date +'%H:%M %D')" $RUN AssemblePairs.py align -1 R2*consensus-pass.fastq -2 R1*consensus- pass.fastq \ --1f CONSCOUNT --2f PRCONS CONSCOUNT --coord presto --rc tail --maxerror $AP_MAXERR --alpha $ALPHA \ --nproc $NPROC --log AssembleLog.log --clean >> $RUNLOG

# Remove sequences with many Ns echo " 7: FilterSeq missing $(date +'%H:%M %D')" $RUN FilterSeq.py missing -s *assemble-pass.fastq -n $FS_MISS --inner \ --nproc $NPROC --log MissingLog.log >> $RUNLOG

# Rewrite header with minimum of CONSCOUNT echo " 8: ParseHeaders collapse $(date +'%H:%M %D')" $RUN ParseHeaders.py collapse -s *missing-pass.fastq -f CONSCOUNT --act min \ --outname Assembled --fasta > /dev/null

# Remove duplicate sequences echo " 9: CollapseSeq $(date +'%H:%M %D')" $RUN CollapseSeq.py -s Assembled_reheader.fasta -n $CS_MISS --uf PRCONS \ --cf CONSCOUNT --act sum --outname Assembled --inner >> $RUNLOG

# Filter to sequences with at least 2 supporting sources echo " 10: SplitSeq group $(date +'%H:%M %D')" $RUN SplitSeq.py group -s Assembled_collapse-unique.fasta -f CONSCOUNT --num 2 >> $RUNLOG

# Create table of final repertoire echo " 11: ParseHeaders table $(date +'%H:%M %D')" $RUN ParseHeaders.py table -s Assembled_collapse-unique_atleast-2.fasta -f ID PRCONS CONSCOUNT DUPCOUNT >> $RUNLOG

# Process log files echo " 12: ParseLog $(date +'%H:%M %D')" $RUN ParseLog.py -l QualityLogR[1-2].log -f ID QUALITY > /dev/null & $RUN ParseLog.py -l PrimerLogR[1-2].log -f ID BARCODE PRIMER ERROR > /dev/null & $RUN ParseLog.py -l ConsensusLogR[1-2].log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCOUNT PRFREQ DIVERSITY > /dev/null & $RUN ParseLog.py -l AssembleLog.log -f ID OVERLAP LENGTH PVAL ERROR HEADFIELDS TAILFIELDS > /dev/null & $RUN ParseLog.py -l MissingLog.log -f ID MISSING > /dev/null & wait if $ZIP_FILES; then tar -cf LogFiles.tar *LogR[1-2].log *Log.log gzip LogFiles.tar rm *LogR[1-2].log *Log.log

tar -cf TempFiles.tar R[1-2]_*.fastq *under* *duplicate* *undetermined* *reheader* gzip TempFiles.tar rm R[1-2]_*.fastq *under* *duplicate* *undetermined* *reheader* fi

# End echo -e "DONE\n" cd ../

Supplementary File 1: Script for Ig sequence pre-processing. 3 Functional 2

1

0

3 Non − functional

2

Number of Sequences (log10) 1

0 0 25 50 75 100 125 Junction Length (bp)

Figure S1: Many non-functional κ sequences contain out-of-frame junctions. Junction length distribution for functional (top panel) and non-functional (bottom panel) κ light chain sequences from NP-immunized mice (NP1-4 in Table 1). Functionality was determined by IMGT/HighV- Quest.

A B C 0.05 0.50 5.00 0.02 0.50 10.00 0.05 0.50 5.00 0.05 0.50 5.00 0.05 0.50 5.00 0.05 0.50 5.00 5.00 5.00 5.00 0.75 0.78 0.76 0.88 0.87 0.87 0.91 0.91 0.92 0.50 0.50 GC1 HD17 0.50 HD17 0.05 0.05 0.05

GGCG G G GCGCGG GAGACGCA CGGC A G A 5.00 AC 5.00 G 5.00 GGCGG AACC C T AGCGGCACAGAGCGCC GAAGTCAGACAGCCC TCGCTCGACGTCGCCA AGAACGAGGCAGGCGACGC ATAGCTGCTAGATCTGCAAC T AATAGTTGACGTCATGC GCAGTCATACATGCAGACGATACAA GTAGTAACTATAGTCGATCTGCCT CGAATGATCGATCTAGCAGTCATCGTA GAATCACTGTACGCAGTCTAATGCACGACAGAAC ATTTGAGTACGTACTGACTGTCATCTAGT GTCCACAATCGCATAGC ATCATAGTTCATACGATCATGTAGTCAGACGCAATAGGTCACAA 0.77 0.77 TTATCGTACTACTACGTGCATACTC 0.91 0.93 T TGATGATCACGTATGCTGTACT 0.91 0.92 AATGGA T CGAGG 0.50 C AC 9 GCTA AA T 0.50 GTTCTCCC 9 G HD2 0.50 GTCTGATTTTATTTA GC2 GGGTGCACC HD2 TCGCT AACTTGTCAGCTATCTACTGAAGTCTAGC GCATGTGCTGTCCCTCGCGCTT CACGCGCGCAGCGTACGCGT TTACAGTATTAGTTCTCCGCTGATACATGTACG ATAT GCCGGCTCGCGCTCCG CGGGCGCGCC TT TGTCAGATTA G CTGCCCGCC CGCGGCGTCCG TTA CTGTAGCACT G GGGGC C CC C AG A C GCC C 0.05 CC

T 0.05 0.05 C C C C G GGG CGGGC GGGCCGGG A A G A AGAG GGC CGG ACGGC AGGCGGCGC 10.00 G GAGACGCA G GACGAG CCGC CCG GGC GCGG G G G 5.00 AGCAGGCGACGGCCG AGGAGCGAGCGCGGC CCAGACGC CCCGCAAG 5.00 GCGCGGCA GCACGGCA G GCAAGACACGACAGCGCAGCCGGACC GAACCTCGAAAGTCACGCAGCTCGACGCGAG TACGTCAGACGAACCC T AATACGACGACTAC CGCTGACGTACGCTCA GCTCGTCAGTCAGTCATA ATAACAAGAATCGACAGCAGAAGAT GTATAGTGTTGAGCATATCAATGTCAGCAGATGAGCA GTGATGATGCATACATCGTCCA GCATCAGATCATGCAGTACGACA T CTAAGTCTGCTACTAGTC CGTCGTACGTACTAGTATT ACTATGATGACATGTGCATATGCACAAGACGAGCCACCATAA G TCCGTACATGACTATGAAGTCTAGCAGTCACGTACATCACGCCTA ATAGCTACGCTAGTAGCTATGCTTACAT CACAGATCGTCATCGATCATGTATAGCTAA CCGATATCGACGACGTCTCAT TCCTTTGGACTGCTATGCTACATT ACTATGATCGACTTCAACTGATATGCACGATGCATGACATGCTA AGTAAGTACATCTGTCATTACGTATCAGTCATAGCATCAGAGA GTACGTACGATCTGACTACTGCTATTGT CTCTTCTACGATCTGCATCGTAGTATCTG GCGCCAGCTATATCGATCTAAA GTCGTCGCTACTGACGACTATCATAGATAT 0.92 TACATATTATGCCAACTGTATACGTAGTAGCGACCATATC TCTATCAATAGACTTCATACTAGTCATCTAGTCAGACTGATACCTTGC 0.78 0.50 T AATTCTGCTGCTTACTTCT C GAGCTCATCATCGTACTGTCT 0.92 CCGTGTCAACGAGTTGT C AGGCTGGCTTATGAA ATATCTCCA A TTCCATGCTAGCTAAA T CTACGCG TGGGCTCGTA AGGT GGAT HD310 0.50 AGCTTAGGAATTG TT TTGTGATTAGCCATGTCTATATT GC3 GCTGTGTTCTC GTGTGCGTCTTT HD310 0.50 CAGCCCAT GCCCGCAAT TTTGTTACTGACTGATCATATGA CATGTTTAGTTCTCCGTATGTCATGACTGTAT G GTAGGTCGTCACTCTGCTATCGCTC C GCTCGTCGTCGCTGAGTCTAGATTG GTGCGCGATCAACT CGGCCTGACACATATTT T GGCTGGT TA GCGGTCTGACGTTATTATT GTCGCTCGCGCTCCCTT GCCTGCCTGCTGTCT CCGGCTGTCGCGCATCGT C GCTCTGCGGACCGTCT CT TCTAGCC CTCTTCGATA T G GGCCCCGC GCGCGTC G GACGCTC GCGGCCACT T CCGCCCCC CCCCGGC GCCCC C CC CCGC C T T CC GC CCG G C C C C 0.05 C G G G G 0.05 G

0.01 C C G GGG GGG G G G CGGGC GGCG GCGCG A G A AGA AAGA GGC CGA AGCG 10.00 GCCG CGCCG CCGG AGGAGA AAGAG AGAGA CCG CGGG CGCGG G GGC GGC 5.00 GC GCG GCG GCCG GCGGC GCGC 5.00 GACC CACG CGGAC CCA CCTA ACT G GAGAGGCACGACGAGCGCGC AGCGCAGACGACGACGCGGACGC AGCAGCAGCGACCGGCG GAATTGCGAGCACC CATCGCGACTGCAA GGAGTCACTAGCAC GCGTCAGCTCGCAC CACCGTCGACGGC ACGTACGCAGCC CTAACGAGACACGCTGACAGCG TCGATCAATAATAGCAGCAGAGCACGG GTATACAAATCGCAGCATGCAGACGAGA ATATGACTAGCACGATACTAC CATCTACTGACAGTCAGACA CACTATCGTACAGTCAGCC T AATCGTGCATGACTAGCTC TGTAGTCATCTCAGTATGACTCG T GATCATACTGACTGCATCT A AGTAAGCAGCAGCCGTACGGATACGGAGAC TATATACACATGTAGCAGCAAGCATCAGATCGAG TAGCATACGATAGCTGCACGATAGCAGCTGACGGAA GTCTACTAAGCGATCGAGTAC TAGGTACTGTACTAGTCAGTACGT TATGCTGATCATCAGATGCCTA GCGTCTAACGTCATCGAT TTAGCGTACGATCTACGATTT TATTATGTCAGACTGTCATATC CAATAGCTACTTGATCTGCAGCAATACGGCACTATAA GATATATCAGCTGTATGACGTACTGACTAGCTATGACTGACACT A ACAACGAATGTACGATCGCTAGCTAGTACGTAACTGATAC GTGCATGACATAGTCGCATACTTCGCC TCGTCATGACTACGTACTGTACACTCT TATGTTCGTACTGACTAGCTACGCC GACCATGATGTCACTTACTGTCT GCTGACTAGTATCATCTGCATCTA GTATTGCATCAGTCATCTGTAC TAAGGACTATCTACTGCAGTATGTACTAGACATGCTGCACT CTTTCTGTGACGATCATCAGTCAGTCATGCATCTAGCTACTGCTCAC CTTTCTATGTCATCAGCATCACGTCTAGATGCATCAGAGT GATGTACGTACGATCGTCATTCTTAT CGTACGCTGTCTACTAGTCGTTACGTATA TTAGTGCTCGTACGTATAGTCTAGTTA G TGCGTGAACAGCTATCAGACT T AGGCGCGAGCTACTATCGAGATA T AGTCAGCACTGATCAGCGTACGTACT 0.50 CATGC AGTTAATA TCATAGATAGCATAATTGA TATGACATATGCATTATTA TATATCC CTCATAA ATTCA GG GGG GGG A TGTCAGTA A A TCCGATGATCGCTA TTATAGACGACTA C TAGTTTCCG CTGTCTATGTT TTTCCGTTAT 0.50 ATGGTCT GTACTTC CAGGTCTT HD413 CTCGACACTCACGTGCCACTACCC CAGCTCTGATACCTGAGCACATACCA TGACTAGTCTACCTACACTGAAACA GC4 0.50 T ATGGCTGTTGCTTCC GCTGCTGCTACTTC GTGCCGTGACTCTTCT HD413 GCTGCAGCGTCACGC GACTGCATCGTC GCTATCGCTACGCC ATATTATTCTATCGTACTATCATAATGA TGTAGTATCGATACTCGATCATCTACTTCATAATGACAATA AGCATCATCGATCATATCTACTGTCGAGATA GTGGCTCGCCTGCTTCGTCCTT GTCCGTGCTCCGTCGCTTTG G GCTCGTCTCGTCCT C G CGTGCCTGCCGCACC CCGTGCCGCCATTCT A CCCGCCTTCGCACGACT ACTGAGTCTTTATGCTAC CCGCTTATCTAGTTCATATCTAA CGTCTTCAATATCTTCCTTGAAT G TCCGGCTCGCCTGACGCCG GCCCGCCGCTCAGTG CCCGCGCGCTCACGTG C CGACGCCC CCGTGCGGCAC CGCACGTGCGC C GTGATGAAGAG ATGGTGTGCAAAG A GTCGGTATGCATAGA GGGCCC CT C GCCGGTCT T GCCGCCTCCGC T CCGGCCC C CGCG C CCGC C Mutabilityin Mouse Light Chain T TT T G G G G G G T T TT Mutabilityin Human Light Chain G GCCCCC CCG GCC CCCCG CCT C C T CTC TT TT G GTT C G CG GC C CG GCCCC C CCC GCCCC C C C 0.05

0.05 C C C C C CG GCC G CG GC C GG CG 0.02

G G G Mutabilityin Human Heavy Chain 0.05 0.50 5.00 0.01 0.50 10.00 0.05 0.50 5.00 0.05 0.50 5.00 0.05 0.50 5.00 0.05 0.50 5.00 Mutability in Mouse Light Chain Mutability in Human Light Chain Mutability in Human Heavy Chain

Figure S2: SHM targeting is highly consistent across individual samples. (A) Separate RS5NF models were constructed for each of the four individual mice (NP1-4 in Table 1) to estimate the mutability of 5-mer motifs in non-functional κ sequences from NP-binding B-cells. Separate S5F models were constructed for each of the four individual human subjects (HD samples in Table 2) to estimate the mutability of 5-mer motifs in (B) light chain sequences and (C) heavy chain sequences. The mutabilities were normalized such that the average mutability of the motifs observed in each pair of samples was 1. The lower triangle shows the pair-wise comparisonFigure'S2' of mutabilities for each 5-mer (points). The upper triangle shows the Spearman correlations between the estimated mutabilities in each motif.

G C Total number: 63 Total number: 86 G Pearson: 0.75 G Pearson: 0.67 C C G C CC Spearman: 0.82 G Spearman: 0.83 C C C G G G C C C G G C C CCCC GG G GG G C C G G G CCC C CC 1.0 G G C C C C C GG GG CCC C C G GG G C C G GG CC CCC C GGGGG G C C C C CC C GGG G G G C C CC C G GG C C C C G C C CCC C G GG C CC G G C CC 0.1 G G GG C C C C

G

A T Total number: 66 Total number: 84 Pearson: 0.64 Pearson: 0.55 A AA Spearman: 0.6 A A Spearman: 0.6 AAAAA T AA AAAAA A T A AAAAA A T T T T A AAA A T TT T T A A AAA A A TTTT Mutability in Human 1.0 A A AA A TTT A AA T T T T TT T A AAAA A T T TTTTTTT A A A T TT TT A A T TT T TTT T T A A T TTTTTT A T T TT T T TTT TT T TT TT T 0.1 T

0.1 1.0 0.1 1.0 Mutability in Mouse

Figure S3: SHM targeting in mouse and human is highly correlated when conditioned on the base being targeted. The mutability values from Figure 4A were plotted separately for 5- mer motifs where the central base was A, T, G or C.