Cutting Edge: Ig H Chains Are Sufficient to Determine Most B Cell Clonal Relationships Julian Q. Zhou and Steven H. Kleinstein This information is current as J Immunol published online 4 September 2019 of September 23, 2021. http://www.jimmunol.org/content/early/2019/09/03/jimmun ol.1900666 Downloaded from Supplementary http://www.jimmunol.org/content/suppl/2019/09/03/jimmunol.190066 Material 6.DCSupplemental

Why The JI? Submit online. http://www.jimmunol.org/ • Rapid Reviews! 30 days* from submission to initial decision

• No Triage! Every submission reviewed by practicing scientists

• Fast Publication! 4 weeks from acceptance to publication

*average by guest on September 23, 2021 Subscription Information about subscribing to The Journal of Immunology is online at: http://jimmunol.org/subscription Permissions Submit copyright permission requests at: http://www.aai.org/About/Publications/JI/copyright.html Email Alerts Receive free email-alerts when new articles cite this article. Sign up at: http://jimmunol.org/alerts

The Journal of Immunology is published twice each month by The American Association of Immunologists, Inc., 1451 Rockville Pike, Suite 650, Rockville, MD 20852 Copyright © 2019 by The American Association of Immunologists, Inc. All rights reserved. Print ISSN: 0022-1767 Online ISSN: 1550-6606. Published September 4, 2019, doi:10.4049/jimmunol.1900666

Cutting Edge: Ig H Chains Are Sufficient to Determine Most B Cell Clonal Relationships Julian Q. Zhou* and Steven H. Kleinstein*,†,‡ B cell clonal expansion is vital for adaptive immunity. largely unique, with recent estimates suggesting 1 3 1016–1018 High-throughput BCR sequencing enables investigat- unique paired Abs in the circulating repertoire (2). ing this process but requires computational inference Adaptive immune repertoire receptor sequencing allows to identify clonal relationships. This inference usually for high-throughput profiling of the diverse BCR repertoire relies on only the BCR H chain, as most current pro- via full-length V(D)J sequencing in bulk (3). An ensuing tocols do not preserve H:L chain pairing. The extent to challenge is to computationally infer B cell clonal relation- which paired L chains aids inference is unknown. Us- ships (4). This step is of great importance as the assessment ing human single-cell paired BCR datasets, we assessed of repertoire properties such as diversity (5) depends on the Downloaded from the ability of H chain–based clonal clustering to iden- proper identification of clones, as does the reconstruction of tify clones. Of the expanded clones identified, ,20% B cell clonal lineages (6) for tracing switching (7) and grouped cells expressing inconsistent L chains. H chains Ag-specific (8) Abs. To infer clones, differences at the se- from these misclustered clones contained more distant quence nucleotide level, especially the high diversity in the junction sequences and shared fewer V segment muta- CDR3 region, can serve as “fingerprints” (9). Likelihood- tions than the accurate clones. This suggests that addi- based (10) and distance-based (11–14) approaches exist. For http://www.jimmunol.org/ tional H chain information could be leveraged to refine instance, cells sharing the same IGHV and IGHJ and clonal relationships. Conversely, L chains were in- whose H chain junctional sequences are sufficiently similar sufficient to refine H chain–based clonal clusters. based on a fixed (11–13) or adaptive (14) distance threshold Overall, the BCR H chain alone is sufficient to identify may be clustered as clones. For validation, existing methods clonal relationships with confidence. The Journal of used simulated and experimental H chain sequences (10, 13, Immunology, 2019, 203: 000–000. 14), measuring the fractions of sequences inferred to be clonally unrelated and related of being, respectively, truly un- related and related (specificity and sensitivity). Recently, Nouri by guest on September 23, 2021 cell–mediated immunity relies on Ig Abs produced and Kleinstein (14) reported both metrics at over 96% based as a result of B cell clonal expansion. A BCR is the on simulated data. B membrane-bound form of an Ab and is made up of The majority of current BCR repertoire studies utilize bulk H and L chains paired in a heterodimeric fashion. Each chain sequencing (15), during which VH:VL pairing is lost (16). contains a variable (V) region, and together, the V regions In the absence of VH:VL pairing, computational methods from the H and L chains form the Ag-binding sites. The V for identifying clones have focused on H chain BCR data. regions are formed via V(D)J recombination. In human, This is justified under the assumption that H chain junc- this shuffling process brings together one each from tional diversity alone should be sufficiently high such that, numerous IGHV, IGHD, and IGHJ genes for the H chain even without L chains, the likelihood of clonally unrelated V (VH) region and one gene each from either IGKV and cells being clustered together will be negligibly small (13). IGKJ genes or IGLV and IGLJ genes for, respectively, the This reasoning has yet to be rigorously tested with experimental k or the l L chain V (VL) region. Enzyme-mediated editing data. Recent breakthroughs in single-cell BCR sequencing of the V(D)J junctions and the pairing of H and L chains technology have enabled the recovery of native VH:VL pair- inject additional diversity (1). During adaptive immune re- ing (17, 18). We now have the opportunity to investigate the sponses, B cells proliferate and further diversify via somatic extent to which the inclusion of L chains impacts the ability to hypermutation (SHM), forming clones consisting of cells that accurately detect B cell clonal relationships. originated from the same V(D)J recombinant events yet whose Using single-cell VH:VL paired BCR data, we assessed the BCRs differ at the nucleotide level. As a result, each BCR is performance of H chain–based computational methods for

*Interdepartmental Program in Computational Biology and Bioinformatics, Yale Address correspondence and reprint requests to Dr. Steven H. Kleinstein, Yale School University, New Haven, CT 06511; †Department of Pathology, Yale School of Medi- of Medicine, 300 George Street, Suite 505, New Haven, CT 06511. E-mail address: cine, New Haven, CT 06520; and ‡Department of Immunobiology, Yale School of [email protected] Medicine, New Haven, CT 06520 The online version of this article contains supplemental material. ORCIDs: 0000-0001-9602-2092 (J.Q.Z.); 0000-0003-4957-1544 (S.H.K.). Abbreviations used in this article: NSCLC, non–small cell lung carcinoma; SHM, Received for publication June 18, 2019. Accepted for publication August 2, 2019. somatic hypermutation; V, variable; VH, H chain V; VJL, V-J-junction-length; VL, L chain V. This work was supported in part by the National Institutes of Health under Award R01AI104739. Copyright Ó 2019 by The American Association of Immunologists, Inc. 0022-1767/19/$37.50

www.jimmunol.org/cgi/doi/10.4049/jimmunol.1900666 2 CUTTING EDGE: H CHAINS CAN DETERMINE CLONES WITH CONFIDENCE identifying clones by measuring the extent to which the in- Calculation of the number and frequency of nucleotide mutations ferred clonal members expressed consistent L chains sharing The number and frequency of nucleotide mutations were calculated based the same V and J genes and junction length. We conclude that on IGH/K/LV positions leading up to the junction region using the clonal members of the majority of the inferred clones exhibited “calcObservedMutations” function from SHazaM (v0.1.10) (22). To calculate the number of IGHV mutations shared pairwise between cells L chain consistency. For the majority of the accurately inferred from the same clone, we counted the number of positions at which mutations H chain–based clones, L chain information did not lead to involving the same nucleotide change were observed in both cells. further clonal clustering with greater granularity. At least some of the information gained from paired L chain data were Results apparent when considering the pattern of shared mutations in We performed clonal relationship inference for five single-cell, the H chain V segment, which is not considered in current VH:VL paired, human BCR datasets, using only the H chain distance–based clonal clustering methods, thus offering the sequence from each cell. The datasets included four publicly potential for further improvements in H chain–based clonal available ones from 10x Genomics and one described by (19) inference. (Materials and Methods). Of these, the B lymphoblastoid GM12878 cell line dataset served as a positive control, as Materials and Methods any cells present in a cell line culture can be expected to Single-cell immune profiling datasets comprise genetically identical clonal members. A distance- Four human datasets (Supplemental Fig. 1A) published by 10x Genomics for based spectral clustering method (14) was applied to identify public use on August 1, 2018 were accessed (https://support.10xgenomics.com/ clones for each dataset. The datasets contained between 3 and Downloaded from single-cell-vdj/datasets) on November 3, 2018. Two datasets were sorted and produced by direct Ig enrichment of, respectively, CD19+ B cells isolated from 157 nonsingleton clones (i.e., clones containing at least two PBMCs from a healthy donor, and GM12878 B lymphoblastoid cell line. They cells) (Table I). Because of the small number of clones in each contain VH:VL paired reads for individual cells. The other two datasets were of the six food-allergic individuals in the dataset from (19), we unsorted and produced by V(D)J+59 gene expression profiling of, respectively, aggregated those results for display after performing analysis PBMCs from a healthy donor and squamous non–small cell lung carcinoma (NSCLC) cells from a fresh surgical resection. These contain gene expres- on a per-subject basis. sion measurements and Ig enrichment with VH:VL pairing. All datasets http://www.jimmunol.org/ were outputted by 10x Genomics via Cell Ranger (v2.2.0). We used “filtered H chain–based clonal clustering is accurate for over 80% of clones contigs” and “filtered gene-barcode matrices” for Ig and gene expression, To assess the extent to which H chain–based clonal clustering respectively. A fifth dataset (Supplemental Fig. 1B) from (19) contains BCR contigs captures the underlying biological truth of B cell clonal covering full-length V(D)J segments reconstructed from single-cell relationships, we examined whether cells clustered into the + RNA sequencing of FACS-sorted CD19 B cells from six food-allergic same clone based on their H chains alone expressed con- individuals (19). sistent L chains. Specifically, within the same clone, cells Germline V(D)J gene annotation of BCR contigs that are truly clonally related should carry L chains comprised

of the same combination of IGK/LV gene, IGK/LJ gene, and by guest on September 23, 2021 Germline V(D)J gene annotation was performed using IMGT/HighV-QUEST and IgBLAST (v1.10.0). The germline reference used was IMGT release junction sequences of identical lengths (hereafter referred to as 201839-3. The 10x Genomics datasets also contained annotations by Cell the VJL combination). An inferred clone was considered ac- Ranger. IMGT/HighV-QUEST annotations were used as final annotations curate if all of its clonal members carried L chains with the postfiltering. same VJL combination and “misclustered” otherwise. Using Filtering of BCR contigs and cells spectral clustering with adaptive thresholds, 83–97% of the inferred clones were accurate (Fig. 1A). Another distance- For all datasets, only productively rearranged BCR contigs with valid V and J gene annotations, consistent chain annotation (excluding such contigs with basedhierarchicalclusteringmethodusingafixeddistance IGHV and IGK/LJ), and junctions with nucleotide lengths being a multiple threshold (13) yielded similar results (Fig. 1B), and there- of three were used. A contig must meet all abovementioned criteria based on fore, we focus on presenting results from spectral clustering annotations from all programs used. Furthermore, only cells with exactly one hereafter. To test the possibility that the observed accuracy H chain contig paired with at least one L chain contig were examined. From the two unsorted 10x Genomics datasets with gene expression, we considered only arose by chance, we randomly permuted the VH:VL pair- cells displaying a transcriptomic profile consistent with being a B cell. Taking ings of the cells while maintaining their H chain–based into account the high dropout rate of single-cell RNA sequencing, B cells were clustering structures. Across 100 permutations, only 1–6% defined as any cell with nonzero log-normalized expression for any one of the following genes: pan-B cell markers CD19, CD24, and CD72 (20), (SD1–8%)oftheinferredcloneswereaccuratebychance plasmablast markers CD38 and MKI67 (21), and the isotype-encoding (Fig. 1A). Overall, these results show that H chain–based genes IGHA1, IGHA2, IGHD, IGHE, IGHG1, IGHG2, IGHG3, IGHG4, clonal clustering can determine clonal relationships with rea- and IGHM. sonable confidence (.80%) in terms of L chain consistency. H chain–based B cell clonal clustering We next investigated the possibility that the observed level of confidence was deflated by factors unrelated to the clonal For each dataset, on a per-subject basis, we identified clones using distance- based methods. We used, separately, spectral clustering (SCOPer v0.1.999) clustering method itself. We considered the possibility that (14) and hierarchical clustering (13) (Change-O v0.4.3) (22). Both methods the misclustered clones (Supplemental Fig. 2A) arose because first partitioned cells into groups sharing the same combination of IGHV of erroneous barcoding during sequencing preparation, which gene, IGHJ gene, and H chain junction length (H chain V-J-junction-length has the potential to link together H and L chains from un- [VJL] combination), in which junction is defined as the IMGT-numbered codon 104 (conserved Cys) to codon 118 (conserved Phe/Trp) (23). Within related cells. We reasoned that incorrectly paired H and each group, based on distances among the H chain junction sequences, L chains would show a decreased correlation in their SHM a threshold was used to cluster cells within that group into clones. For frequencies relative to correctly paired chains. Thus, we com- spectral clustering, adaptive thresholds were chosen by an unsupervised puted the Pearson correlation coefficient between IGHV and machine learning algorithm. For hierarchically clustering, a subject-specific, fixed threshold was chosen upon inspection of distance-to-nearest-neighbor IGK/LV mutation frequencies for cells expressing nonmajority plots (Supplemental Fig. 1C) (13). L chain VJL combinations from misclustered clones. We found The Journal of Immunology 3

Table I. Numbers of H chain–based clones and clone sizes

Dataset Total Number of Clones Number of Nonsingleton Clones Clone Size of Nonsingleton Clones (Number of Clones) CD19+ B cells 8268 157 2 (136); 3 (15); 4 (4); 5 (2) B cells from NSCLC tumor 1247 98 2 (81); 3 (9); 4 (4); 6 (2); 8 (1); 14 (1) B cells from food-allergic individuals 952 12 2 (10); 3 (1); 5 (1) B cells from healthy PBMCs 1105 8 2 (6); 3 (2) GM12878 cell line 5 3 2 (1); 5 (1); 790 (1) no significant difference (p = 0.926, Fisher r-to-z trans- and those of misclustered clones (p = 0.810, Fig. 2A, Supplemental formation and z test) between the levels of correlation for Fig. 2C). Thus, it is unlikely that H chain junction length misclustered clones (0.761) and for accurate clones (0.769), could serve as an effective indicator for misclustered clones. suggesting that erroneous barcoding was unlikely a concern. A central component of distance-based clonal clustering is In addition, we considered the possibility of poor germline V/J the choice of a distance threshold that determines how dis- gene annotation for the L chains leading to a false appearance similar the junction sequence can be before it is unlikely for of L chain inconsistency. Because higher SHM frequencies cells to be clonally related. To determine whether a better are associated with increasing V(D)J annotation errors, we choice of distance threshold had the potential to correct compared the L chain mutation frequency in misclustered misclustered clones, we compared the maximum pairwise Downloaded from and accurate clones. We found that the average IGK/LV distance between H chain junction sequences of cells in mutation frequency across cells expressing nonmajority L chain accurate clones with the minimum pairwise distance between VJL combinations in misclustered clones was not signifi- cells carrying L chains with different VJL combinations cantly higher than that across cells in accurate clones (p =0.957; in misclustered clones. Cells with inconsistent L chains in Supplemental Fig. 2B). Overall, these results based on the misclustered clones had significantly more dissimilar H chain analysis of SHM suggest that the observed confidence for junction sequences compared with cells in accurate clones http://www.jimmunol.org/ 2 accurately identifying clones using H chain–based clonal (p =5.63 10 9,Fig.2B,SupplementalFig.2D).This clustering was not deflated by a false appearance of L chain implies that some of the misclustered clones could have inconsistency created by erroneous barcoding or incorrect been corrected by using a numerically lower (stricter) distance L chain germline gene annotation. threshold and that this lower threshold would not break apart the accurate clones. Characteristics of misclustered clones suggest room for improvement in True B cell clones are expected to share mutations resulting H chain–based clustering from SHM followed by clonal expansion and/or positive se-

Current distance-based, H chain–based clonal clustering methods lection (12, 14). Hershberg and Luning Prak (12) suggested by guest on September 23, 2021 use information confined to VJL combination and distances a minimum threshold of four shared mutations for infer- between junction sequences to identify clonally related se- ring clones. We investigated whether cells in misclustered quences. We investigated the characteristics of the H chains clones shared fewer mutations in their IGHVs compared of cells from misclustered clones to determine whether there with cells in accurate clones. To do so, we compared the was additional information in the H chains that could im- minimum number of shared mutations between cells in ac- prove the clustering. It has been noted that shorter H chain curate clones with the maximum number of shared mutations junction lengths rendered lower and possibly insufficient between cells carrying L chains with different VJL combi- diversity for effectively distinguishing clonal members from nations in misclustered clones. We found that cells carrying nonclonal ones (13). However, we found no significant differ- inconsistent L chains in misclustered clones shared signifi- ence between the H chain junction lengths of accurate clones cantly fewer mutations compared with cells in accurate clones

FIGURE 1. Performance of H chain–based (A) spectral clustering with adaptive threshold and (B) hierarchical clustering with fixed threshold. Solid and shaded bars show percentages of nonsingleton inferred clones in which clonal members all carried L chains with the same VJL combination. Numbers at the top indicate the denominators. A background distribution was generated by permuting VH:VL pairings 100 times while maintaining inferred clonal lineage structures. Hollow bars show average percentages of accurate clones across permutations with SEs. * denotes empirical one-sided, Bonferroni-corrected p value ,0.05. 4 CUTTING EDGE: H CHAINS CAN DETERMINE CLONES WITH CONFIDENCE

FIGURE 2. Characteristics of misclustered clones. (A) H chain junction lengths of accurate clones versus misclustered clones. (B) The maximum pairwise Downloaded from distance between H chain junction sequences in accurate clones versus the minimum pairwise distance between cells expressing inconsistent L chains in mis- clustered clones. (C) The minimum pairwise shared IGHV mutations in accurate clones versus the maximum pairwise shared IGHV mutations between cells expressing inconsistent L chains in misclustered clones. Analysis was performed for each dataset separately (Supplemental Fig. 2C–E), and a Fisher combined p value was calculated.

2 http://www.jimmunol.org/ (p = 2.8 3 10 5, Fig. 2C, Supplemental Fig. 2E). Using In other words, clonal members tended to carry L chains with Hershberg and Luning Prak’s (12) threshold of four shared junction sequences that were at least 80% similar to each mutations, 26 of the 30 misclustered clones (87%) would not other. have been clustered together based on their H chains (thus We next investigated whether there would be further clus- increasing specificity). In contrast, within four of these tering based on L chain junctions at lower distance thresholds misclustered clones, a subset of cells with consistent L chains ranging, in increasing order of stringency, from 0.15 to would also become separated (thus potentially reducing sen- 0.05 normalized Hamming distance while bearing in mind sitivity). Although the tradeoff between sensitivity and spec- that one could always artificially yield further clustering by ificity needs further investigation, these results suggest that the imposing an increasingly stricter clustering threshold. At by guest on September 23, 2021 extent of shared mutations in the IGHV is a potentially useful each clustering threshold, we determined the percentage characteristic to consider in distance-based clonal clustering of H chain–based clones that were further clustered on the methods. basis of distances between their L chain junction sequences (Table II). On average, 5% of the H chain–based accurate L chain information is insufficient for refining H chain–based clones clones inferred via spectral clustering were further clustered Given the availability of paired L chains in the single-cell datasets at 0.15, the most lenient threshold explored. Even at 0.05, the that we analyzed, we assessed the value added from that in- strictest threshold explored, only 23.2% of the accurate clones formation. For misclustered clones (Supplemental Fig. 2A), were further clustered. This threshold is approaching the mean this was trivial as any L chain inconsistency was immediately L chain SHM frequency, which ranges from 0.01 to 0.05 resolved by regrouping the cells into smaller clusters based on across the datasets, raising questions as to whether such their L chain VJL combinations. For accurate clones, although further clustering is artificial rather than biological. Overall, the cells express consistent L chains, it is possible that these L chain information does not support clonal clustering with clusters may still contain multiple true clones grouped together. greater granularity for the majority of H chain–based accurate We thus investigated the extent to which further clustering cells clones. based on the similarity of their L chain junctions (analogous to H chain–based clustering) would further split the accurate Discussion clones. When clustering using the H chains, a threshold In this study, we investigated the accuracy of H chain–based around 0.2 normalized Hamming distance tended to sepa- clonal inference. With single-cell VH:VL paired BCR datasets, rate clonally related cells from unrelated ones (Supplemental we performed B cell clonal inference using only the H chains, Fig. 1C). However, applying this same threshold to cluster effectively treating the datasets as if bulk sequenced and L chains within accurate clones added virtually no infor- unpaired. Over 80% of the inferred clones were accurate as mation. The L chain junction regions of cells in accurate defined by L chain consistency. Within the majority of these clones were highly similar and significantly more so com- accurate clones, an additional round of clustering using the pared with their H chain counterparts (Bonferroni-corrected L chain sequence failed to yield finer resolution (,10% at a p values , 0.001) (Supplemental Fig. 2F). In all but four threshold of 0.1 normalized Hamming distance). Including of the accurate clones, the L chain junction sequence of a requirement that members of a clone share mutations be- a clonal member was at most 0.2 normalized Hamming tween H chain sequences would have corrected 87% of the distance away from the junction sequence of another misclustered clones, although at the expense of also breaking clonal member most similar to itself (Supplemental Fig. 2F). up the accurately clustered part in 13% of these misclustered The Journal of Immunology 5

Table II. Percentage of accurate H chain–based clones further clustered based on L chains

Clustering Threshold for L Chain Junctions 0.05 0.10 0.15

Dataset Percentage of H Chain–Based Clones Undergoing Further Clustering CD19+ B cells 20.3 6.8 3.0 B cells from NSCLC tumor 32.6 18.9 16.8 B cells from food-allergic individuals 40.0 10.0 0.0 B cells from healthy PBMCs 0.0 0.0 0.0 Average 23.2 8.9 5.0 The unit of measure for the clustering threshold for L chain junctions is normalized Hamming distance. clones. Overall, whereas there remains additional information (10). Although computationally slower compared with the from the H chain that could be leveraged for improvement, distance-based methods explored in this study, future studies we found H chain–based clustering alone capable of identifying should explore the potential benefit of such methods that use clonal relationships with reasonable confidence. both junction distance and shared mutation patterns. Cells from four of the misclustered clones (13%) carried A limitation of this study is that all but one of the datasets closely related H chains sharing a reasonable number of mu- contained relatively small numbers of B cells, and none was tations in the V segment, yet they expressed L chains with sorted for specific subsets, such as memory B cells, that would Downloaded from different VJL combinations. There are several possibilities for enrich for expanded clones. As a result, only a small number of how such clones could arise. During B cell maturation, the the inferred clones contained multiple cells and were thus H chain rearranges first, and the cell proliferates before the suitable for analysis. As high-quality, single-cell B cell datasets L chain rearranges (1). Thus, it is possible that the misclustered of higher throughput with sorting for relevant B cell subsets clonal relationships we detect represent daughter cells of the emerge, similar analyses could be performed, leading to better same H chain VDJ-rearranged B cell that developed different estimates of performance. In addition, although L chains were http://www.jimmunol.org/ L chains. Alternatively, a group of identical, autoreactive, insufficient to refine accurate H chain–based clones, they did immature B cells could have undergone receptor editing, identify ∼20% of the clones as misclustered and can be critical during which independent rearrangements gave rise to different in other contexts, such as synthesizing functional Abs and L chains paired with an identical H chain rearrangement. In evaluating Ag-binding specificity (27). these scenarios, the H chain–based clonal clustering accurately Clonal relationship inference is an early step crucial for reflects the underlying biology. However, we believe that these computational BCR repertoire analyses. As studies taking scenarios are unlikely as the chance that cells with identical advantage of the relatively low cost of bulk BCR sequencing

H chains but different L chains would developmentally share continue to generate unpaired BCR data, current clonal clus- by guest on September 23, 2021 thesametemporaland/orspatial trajectories, in addition to tering methods can determine most clonal relationships with being sampled together for sequencing, is expected to be very reasonable confidence using H chains only, and their perfor- small. mance may continue to improve by considering additional When examining L chain consistency of cells from an inferred characteristics such as the number of shared mutations in the clone, if more than one L chain sequence was associated with H chains. a cell, we checked every sequence present for a match against the clonal majority VJL combination. We did not, however, Acknowledgments consider the possible complication in which there was partial We thank Dr. Nima Nouri for critical reading of this manuscript. sampling in such a cell. Hypothetically, should two clonally related B cells with dual L chains each have a different L chain Disclosures sequenced, our analysis would have considered an H chain– The authors have no financial conflicts of interest. based clone containing these cells misclustered. However, such B cells have been reported to be rare, especially outside References autoimmunity, with dual-k and dual-l cells estimated to 1. Murphy, K., and C. Weaver. 2017. Janeway’s Immunobiology, 9th Ed. Garland Science, New York. occupy ∼2–10% (24, 25) and 0.2% (26) of the normal 2. Briney, B., A. Inderbitzin, C. Joyce, and D. R. Burton. 2019. Commonality despite murine repertoire, consistent with the percentages of cells exceptional diversity in the baseline human repertoire. Nature 566: 393–397. with multiple L chains observed in the single-cell datasets 3. Robins, H. 2013. Immunosequencing: applications of immune repertoire deep in this study (dual: 1.00–7.06%; triple: 0.00–0.31%; sequencing. Curr. Opin. Immunol. 25: 646–652. 4.Greiff,V.,E.Miho,U.Menzel,andS.T.Reddy.2015.Bioinformaticand quadruple: 0.00–0.02%; all cells in the dataset from (19) had statistical analysis of adaptive immune repertoires. Trends Immunol. 36: 738– exactly one L chain). 749. 5.Gibson,K.L.,Y.-C.Wu,Y.Barnett,O.Duggan,R.Vaughan,E.Kondeatis, Our results suggest several ways that H chain–based clonal B.-O. Nilsson, A. Wikby, D. Kipling, and D. K. Dunn-Walters. 2009. B-cell clustering could be improved. For example, H chain junctions diversity decreases in old age and is correlated with poor health status. Aging Cell 8: were more similar to each other in accurate clones than in 18–25. 6. Hoehn, K. B., G. Lunter, and O. G. Pybus. 2017. A phylogenetic codon substi- misclustered clones, suggesting that the clustering threshold tution model for antibody lineages. Genetics 206: 417–427. was perhaps too lenient for the latter. We also observed that 7. Horns, F., C. Vollmers, D. Croote, S. F. Mackey, G. E. Swan, C. L. Dekker, M. M. Davis, and S. R. Quake. 2016. Lineage tracing of human B cells reveals the accurate clones shared more mutations in their V segment. in vivo landscape of human antibody class switching. [Published erratum appears in Some likelihood-based methods, such as partis, implicitly take 2016 Elife 5.] Elife 5: e16578. 8. Tru¨ck, J., M. N. Ramasamy, J. D. Galson, R. Rance, J. Parkhill, G. Lunter, into consideration shared mutations via the use of a multi-Hidden A. J. Pollard, and D. F. Kelly. 2015. Identification of antigen-specific B cell receptor Markov model that simultaneously emits multiple sequences sequences using public repertoire analysis. J. Immunol. 194: 252–261. 6 CUTTING EDGE: H CHAINS CAN DETERMINE CLONES WITH CONFIDENCE

9. Dunn-Walters, D., C. Townsend, E. Sinclair, and A. Stewart. 2018. Immunoglobulin 18. Busse, C. E., I. Czogiel, P. Braun, P. F. Arndt, and H. Wardemann. 2014. Single- gene analysis as a tool for investigating human immune responses. Immunol. Rev. 284: cell based high-throughput sequencing of full-length immunoglobulin heavy and 132–147. light chain genes. Eur. J. Immunol. 44: 597–603. 10. Ralph, D. K., and F. A. Matsen, IV. 2016. Likelihood-based inference of B cell 19. Croote, D., S. Darmanis, K. C. Nadeau, and S. R. Quake. 2018. High-affinity clonal families. PLOS Comput. Biol. 12: e1005086. allergen-specific human cloned from single IgE B cell transcriptomes. 11.Kepler,T.B.,S.Munshaw,K.Wiehe,R.Zhang,J.-S.Yu,C.W.Woods, Science 362: 1306–1309. T.N.Denny,G.D.Tomaras,S.M.Alam,M.A.Moody,etal.2014.Recon- 20. LeBien, T. W., and T. F. Tedder. 2008. B lymphocytes: how they develop and structing a B-cell clonal lineage. II. Mutation, selection, and affinity maturation. function. Blood 112: 1570–1580. Front. Immunol. 5: 170. 21. Fink, K. 2012. Origin and function of circulating plasmablasts during acute viral 12. Hershberg, U., and E. T. Luning Prak. 2015. The analysis of clonal expansions in infections. Front. Immunol. 3: 78. normal and autoimmune B cell repertoires. Philos. Trans. R. Soc. Lond. B Biol. Sci. 22. Gupta, N. T., J. A. Vander Heiden, M. Uduman, D. Gadala-Maria, G. Yaari, and 370: 20140239. S. H. Kleinstein. 2015. Change-O: a toolkit for analyzing large-scale B cell im- 13. Gupta, N. T., K. D. Adams, A. W. Briggs, S. C. Timberlake, F. Vigneault, and munoglobulin repertoire sequencing data. Bioinformatics 31: 3356–3358. S. H. Kleinstein. 2017. Hierarchical clustering can identify B cell clones with high 23. Lefranc, M.-P., C. Pommie´, M. Ruiz, V. Giudicelli, E. Foulquier, L. Truong, confidence in Ig repertoire sequencing data. J. Immunol. 198: 2489–2499. V. Thouvenin-Contet, and G. Lefranc. 2003. IMGT unique numbering for im- 14. Nouri, N., and S. H. Kleinstein. 2018. A spectral clustering-based method for munoglobulin and T cell receptor variable domains and Ig superfamily V-like identifying clones from high-throughput B cell repertoire sequencing data. domains. Dev. Comp. Immunol. 27: 55–77. Bioinformatics 34: i341–i349. 24. Casellas, R., Q. Zhang, N.-Y. Zheng, M. D. Mathias, K. Smith, and P. C. Wilson. 15. Nielsen, S. C. A., and S. D. Boyd. 2018. Human adaptive immune receptor rep- 2007. Igkappa allelic inclusion is a consequence of receptor editing. J. Exp. Med. ertoire analysis-past, present, and future. Immunol. Rev. 284: 9–23. 204: 153–160. 16. Georgiou, G., G. C. Ippolito, J. Beausang, C. E. Busse, H. Wardemann, and 25. Velez, M.-G., M. Kane, S. Liu, S. B. Gauld, J. C. Cambier, R. M. Torres, and S. R. Quake. 2014. The promise and challenge of high-throughput sequencing of R. Pelanda. 2007. Ig allotypic inclusion does not prevent B cell development or the antibody repertoire. Nat. Biotechnol. 32: 158–168. response. J. Immunol. 179: 1049–1057. 17. DeKosky, B. J., G. C. Ippolito, R. P. Deschner, J. J. Lavinder, Y. Wine, 26. Pelanda, R. 2014. Dual immunoglobulin light chain B cells: trojan horses of B. M. Rawlings, N. Varadarajan, C. Giesecke, T. Do¨rner, S. F. Andrews, et al. autoimmunity? Curr. Opin. Immunol. 27: 53–59. 2013. High-throughput sequencing of the paired human immunoglobulin heavy 27. Robinson, W. H. 2015. Sequencing the functional antibody repertoire--diagnostic Downloaded from and light chain repertoire. Nat. Biotechnol. 31: 166–169. and therapeutic discovery. Nat. Rev. Rheumatol. 11: 171–182. http://www.jimmunol.org/ by guest on September 23, 2021 C A B

Density

Density Density 0 2 4 6 8 0 1 2 3 4 5 6 Subject 0.0 0.0 B cellsfrom healthy PBMCs PA16 PA15 PA14 PA13 PA12 PA11 Total B cellsfrom food−allergic B individual PA13 0.2 0.2 froma

- (NSCLC)tumor Cells 0.2 0.2 lymphoblastoid lung carcinoma Cell counts for for counts Cell CD19+ B cells B CD19+ (unsorted) non healthy healthy As in Croote Croote in As i solatedfrom PBMCs of a PBMCs of et al., 2018 al., et 0.4 0.4 Description

- GM12878 (unsorted) squamous small cell line line cell PBMCs Dataset

0.6 0.6 donor donor 971 123 110 212 202 267

of a of cell cell 59

Cell counts for 10x Genomics datasets at various stages of data processing data of stages various at datasets Genomics 10x for counts Cell

Density Density dataset 2018 al. et Croote 0 1 2 3 4 5 6 7 0 1 2 3 4 Number of B cells B of Number After filtering filtering After Enrichment Enrichment BCR contigs BCR 0.0 0.0 Expression Expression B cellsfrom NSCLCtumor B cellsfrom food−allergic Direct Ig Ig Direct Ig Direct Platform V(D)J + + V(D)J + V(D)J 5’ Gene 5’ Gene 5’ individual PA14 0.2 0.2 0.2 0.22 968 123 110 211 200 265 59 0.4 0.4

Hamming distance (normalized bylength) (normalized distance Hamming (Catalogno. (Catalogno. (Catalogno. GM12878) After filtering for for filtering After 1 VH1 and

PB010 0.6 0.6 resection AllCells AllCells surgical PB001) A fresh fresh A Coriell Coriell Source -

0) Density 10 Density

cells w/ cells 0 2 4 6 8 0 1 2 3 4 5

³ 0.0 0.0 1 VL1 B cellsfrom food−allergic 968 123 110 211 200 265 Total 9465 7802 7726 59 0.1 854 individual PA15 0.2 0.2 CD19+ Bcells

0.2

0.4 0.4 available were likely likely were separate the bi a same partition seque junction chain heavy other to distance non smallest junction the sequence, chain heavy each length. for junction partition, each chain Within heavy and gene, IGHJ IGHV gene, same the sharing groups into partitioned first Cells clustering. hierarchical via clones inferring a choosing guide to plot individuals. allergic Genomics. 10x A. 1. Figure Supplemental contig(s) contig(s) ahd line dashed 9465 2953 1321 BCR 0.6 0.6 854

filtering to be

Density Density 10 12 14 contigs 0 2 4 6 8 0 2 4 6 8 Number of Cells of Number , 9219 1817 1255 After After 0.0 0.0 mod BCR was was calculated.

819 B cellsfrom food−allergic B cellsfrom food−allergic

a coe va aul inspection manual via chosen was B. clonally related from from related clonally al al distribution

individual PA16 individual PA11 0.2 0.2 umr o B el fo food from cells B of Summary 0.2 0.2 C. C. for cells w/ 1 VH1 and

filtering 0.4 0.4 Distance ³ fixed 8454 1530 1196 After After - 1 VL1 eo uloie Hamming nucleotide zero Summary of datasets from 799

A thresh 0.6 0.6

distance threshold for for threshold distance representing -

to profile of a Bcell a of profile After filtering for for filtering After

Density - transcriptomic nearest neighbor neighbor nearest 0 1 2 3 4 5 6 7 old unrelated ones. unrelated 0.0 B cellsfrom food−allergic ,

i ndicated nces nces cells w/ cells individual PA12 0.2 cells cells that 0.2 1388 1115 in the in

were were 0.4 - -

by to

- 0.6 A Dataset : CD19+ B cells Dataset: CD19+ B cells Dataset: CD19+ B cells B Fisher’s combined p=0.957 clone ID: 3147 Inferred germline clone ID: 3929 Inferred germline clone ID: 3937 Inferred germline (Mann-Whitney for misclustered > accurate) IGHV3−23, IGHJ4 IGHV3−15, IGHJ4 IGHV5−10−1, IGHJ6 0 0 0 133 24 95 3 10 2 7 1 2 6 IGLV1-44 1 13 IGLJ1

0.15 4 14 42nt (1) IGKV3-20 IGKV1-39 IGKJ2 4 2 1 3 IGLV2-14 IGKV3-20 IGKJ2 5 1 IGKV1-12 IGKV1-27 IGLJ2 60nt (1) IGKJ2 33nt (3) IGKJ3 IGKJ1 36nt (1) 5 2 36nt (4) 33nt (1) 33nt (1) ●

0.10 ● ● Dataset: B cells from NSCLC tumor Dataset: B cells from NSCLC tumor Dataset: B cells from food-allergic individuals ● clone ID: 362 clone ID: 636 clone ID: PA14_29 Inferred germline Inferred germline ● Inferred germline ● ● IGHV1 −24, IGHJ4 IGHV4−34, IGHJ6 ● ● ●● IGHV3−23, IGHJ4 ● ● ●● 2 0 0 ● ● ●

0.05 ●● ● ●●● ● ●●●● ● ● ●● ● IGKV1-39 ● ● ● ●● ●● ●●● ●● IGKJ1 ●●●● ● 2 5 ●●●● ● 5 9 ●●●●●● ● 12 3 ●●● 33nt (1) ●●●●● ●● ● ●●● ● ● ● ● ●● ● ● ●●●●● ● ● ●●

Average IGK/LV mutation frequency mutation IGK/LV Average ● ● ● IGLV3-1 ●● ● ●● ● ● IGLV2-11 10 12 1 3 ●●●●●●●●●●●●●● ● ●●●●● ● IGLJ2 17

4 0.00 36nt (1) IGKV2-28 IGLV1-47 IGLJ1 Acc. Mis. Acc. Mis. Acc. Mis. Acc. Mis. IGKJ4 5 2 IGLJ3 36nt (2) clones clones clones clones clones clones clones clones 33nt (4) IGKV3-20 IGLV2-14 39nt (1) 2 2 CD19+ B cells from B cells from B cells from IGKJ1 IGLJ3 33nt (3) B cells NSCLC food−allergic healthy ● Inferred 36nt (3) tumor individuals PBMCs ● Light chain w/ majority VJL combination ● Inferred ● Light chain w/ non−majority VJL combination ● Light chain w/ majority VJL combination ● Light chain w/ non−majority VJL combination C Fisher’s combined p=0.810 D Fisher’s combined p=5.6*10-9 E Fisher’s combined p=2.8*10-5 (two-sided Mann-Whitney) (two-sided Mann-Whitney) (two-sided Mann-Whitney)

133 24 95 3 10 2 7 1 133 24 95 3 10 2 7 1 50 133 24 95 3 10 2 7 1 81 ● 78 ● ● 75 72 ● ● 0.3 69 ● ● 40 ● 66 ●●●●●● ● ● 63 ●●●● ● 60 ●●●●●●● ● ● ● ●● ●● ●● ● ● ●● 57 ● 30 ●●●●●●●●● ● 0.2 ● 54 ● ● ● ● ●●●●●● ● ● ● ● ● 51 ● ● ●● ● ●●●●●●●●● ●● ●● ● ●● ● 48 ● ● ●●● ● ● ●●●●●●●●●●● ●●●●● ● ● ●● 20 ● ●● ● 45 ●●● ● ● ●●●● ● ● ● ●● ● ●● 42 ●● ●●●●●●●●● ●● 39 ● ● ● 0.1 ● ● ● ● ●●● ● ●● ●●●● ● ● ● ● ● ●● 36 ●● ● ● ● ● ● ● ● ● ●●●● ●● ● 33 ● 10 ●● ●●●● ● ●●●●●● ● ● ●●●● 30 ● ●● ● ● ●● ● ● ●●● ● ●● 27 ● ●● ●●●● Heavy chain junction length (nt) Heavy ● ●●● ● ●● ● ● ●● 24 ● ● ●●● ●●● ●●●●●●●●● 21 ● 0.0 ●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● Hamming distance normalized by length Hamming distance normalized by 0 Acc. Mis. Acc. Mis. Acc. Mis. Acc. Mis. Acc. Mis. Acc. Mis. Acc. Mis. Acc. Mis. mutations Number of shared nucleotide Acc. Mis. Acc. Mis. Acc. Mis. Acc. Mis. clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones clones CD19+ B cells from B cells from B cells from CD19+ B cells from B cells from B cells from CD19+ B cells from B cells from B cells from B cells NSCLC food−allergic healthy B cells NSCLC food−allergic healthy B cells NSCLC food−allergic healthy tumor individuals PBMCs tumor individuals PBMCs tumor individuals PBMCs

F Bonferroni-corrected p<0.001 for each dataset Supplemental Figure 2. A. Maximum-parsimony trees based on IMGT- (two-sided Mann-Whitney) numbered nucleotides 1-312 of IGHV for misclustered clones with >2 cells. 1156 133 275 95 82 10 213 7 Clonal members are shown as nodes and colored according to whether their light chain carries the clonal majority VJL combination. In case of a tie, a majority is designated arbitrarily (such as in Clone 3937). Cells with identical IGHV positions 1-312 share the same node, with node size proportionate to the 0.6 number of cells at that node. Edge labels represent the number of mutations accumulated from one node to the next. B. Average IGK/LV mutation frequency across cells in accurate clones compared with that across cells expressing clonal non-majority VJL combination in misclustered clones. C. 0.4 Heavy chain junction lengths of accurate clones versus misclustered clones. D. The maximum pairwise distance between heavy chain junction sequences in accurate clones, versus the minimum pairwise distance between cells expressing inconsistent light chains in misclustered clones. E. The minimum 0.2 pairwise shared IGHV mutations in accurate clones, versus the maximum pairwise shared IGHV mutations between cells expressing inconsistent light chains in misclustered clones. F. Means of the distributions of the smallest Mean normalized Hamming distance from the junction of a cell to its nearest neighbor normalized Hamming distance between heavy chain junctions in partitions of 0.0 Heavy Light Heavy Light Heavy Light Heavy Light cells with the same heavy chain VJL combinations, and between light chain CD19+ B cells from B cells from B cells from junctions in accurate clones inferred via spectral clustering. A distribution was B cells NSCLC food−allergic healthy tumor individuals PBMCs calculated for each heavy chain-based partition or clone. The mean of each distribution is visualized. Acc. = Accurate. Mis. = Misclustered.