A Database of Human Immune Receptor Alleles Recovered from Population Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
A Database of Human Immune Receptor Alleles Recovered from Population Sequencing Data This information is current as Yaxuan Yu, Rhodri Ceredig and Cathal Seoighe of September 26, 2021. J Immunol published online 23 January 2017 http://www.jimmunol.org/content/early/2017/01/20/jimmun ol.1601710 Downloaded from Supplementary http://www.jimmunol.org/content/suppl/2017/01/20/jimmunol.160171 Material 0.DCSupplemental Why The JI? Submit online. http://www.jimmunol.org/ • Rapid Reviews! 30 days* from submission to initial decision • No Triage! Every submission reviewed by practicing scientists • Fast Publication! 4 weeks from acceptance to publication *average by guest on September 26, 2021 Subscription Information about subscribing to The Journal of Immunology is online at: http://jimmunol.org/subscription Permissions Submit copyright permission requests at: http://www.aai.org/About/Publications/JI/copyright.html Email Alerts Receive free email-alerts when new articles cite this article. Sign up at: http://jimmunol.org/alerts Errata An erratum has been published regarding this article. Please see next page or: /content/198/9/3758.full.pdf The Journal of Immunology is published twice each month by The American Association of Immunologists, Inc., 1451 Rockville Pike, Suite 650, Rockville, MD 20852 Copyright © 2017 by The American Association of Immunologists, Inc. All rights reserved. Print ISSN: 0022-1767 Online ISSN: 1550-6606. Published January 23, 2017, doi:10.4049/jimmunol.1601710 The Journal of Immunology A Database of Human Immune Receptor Alleles Recovered from Population Sequencing Data Yaxuan Yu,* Rhodri Ceredig,† and Cathal Seoighe* High-throughput sequencing data from TCRs and Igs can provide valuable insights into the adaptive immune response, but bioinformatics pipelines for analysis of these data are constrained by the availability of accurate and comprehensive repositories of TCR and Ig alleles. We have created an analytical pipeline to recover immune receptor alleles from genome sequencing data. Applying this pipeline to data from the 1000 Genomes Project we have created Lym1K, a collection of immune receptor alleles that combines known, well-supported alleles with novel alleles found in the 1000 Genomes Project data. We show that Lym1K leads to a significant improvement in the alignment of short read sequences from immune receptors and that the addition of novel alleles discovered from genome sequence data are likely to be particularly significant for comprehensive analysis of populations that are not currently well represented in existing repositories of immune alleles. The Journal of Immunology, 2017, 198: 000–000. Downloaded from he adaptive immune system provides protection from input sequences with the reference V, D (for Ig H chain and TCR disease by initiating specific immune responses to specific b-chain), and J genes, determine the CDR3 region, and estimate T foreign Ags following their recognition by the clonally the repertoire diversity. Two main factors that directly influence distributed surface-bound receptors of T and B lymphocytes, namely the quality of these analyses are the alignment algorithm and the TCRs for T cells and Igs for B cells. In its secreted form, the B cell’s completeness and accuracy of the reference database. In the past 3 http://www.jimmunol.org/ Ig molecule is called an Ab. The variability in molecular shapes of y, several software packages have been developed with different foreign Ags represents a significant challenge for the immune alignment algorithms for analyzing next-generation sequencing system, requiring development of an extensive receptor repertoire sequence data of TCR and/or Ig (11–19). These studies all focused for Ag recognition. In response to this challenge, TCRs and Igs on improving the accuracy, effectiveness, and completeness of achieve vast diversity by a unique mechanism called VDJ recom- their alignment algorithms. Most of the software packages use bination, which is a largely stochastic process of gene rearrange- the reference V(D)J sequences provided by the international ment encoding the variable parts of TCRs and Igs, respectively. ImMunoGeneTics information system (IMGT) (20), which is cur- Additionally, genes encoding Igs undergo somatic hypermutation rently the most comprehensive collection of TCR and Ig germline during T cell–dependent B cell responses, resulting in further ex- gene sequences available. by guest on September 26, 2021 pansion in Ab diversity and subsequent selection of B cell clones However, previous studies have reported numerous novel V generating Igs with higher affinity for Ag. Quantitative analysis of alleles that are not found in the IMGT database (21–25), suggesting the diversity of TCR and Ig repertoires can shed light on receptor– that this database is incomplete. Additionally, a study of 226 Ag interactions and reveal insights into the state of the immune IGHV alleles in IMGT reported a high rate of false positives (of system under diverse conditions, such as aging (1–3), chronic viral 226 alleles, 104 were classified as very likely to contain se- infections (4–6), and autoimmune diseases (7–10). quencing errors) (26). This is because most of the IMGT TRBV Next-generation sequencing technology revolutionized the and IGHV alleles were included from early studies, conducted analysis of TCR/Ig diversity by providing vast amounts of data at between 1984 and 1996 (27, 28). After that, there were only two much higher resolution, leading also to a need for effective and major updates on IGHV alleles in 2002 and 2009 introducing in robust bioinformatics pipelines. Generally, these pipelines align the total 36 alleles, and no updates on TRBV alleles have been sub- mitted in the past 20 y. Furthermore, many alleles were reported in only one study, and multiple allelic variants of the same gene were *School of Mathematics, Statistics and Applied Mathematics, National University of † derived from the same individual. For instance, an individual can Ireland Galway, Galway H91 HX31, Ireland; and Discipline of Psychology, National University of Ireland Galway, Galway H91 HX31, Ireland have at most two alleles of any gene; however, seven IGHV3-15 ORCIDs: 0000-0002-4393-727X (R.C.); 0000-0003-0496-4029 (C.S.). alleles were derived from one individual. This was pointed out by Received for publication October 5, 2016. Accepted for publication January 3, 2017. Wang et al. (26), but the issue remains unresolved in the current version of IMGT. Additionally, many TRBV and TRAV alleles in This work was supported by the European Union Seventh Framework Programme (FP7/2007–2013) under Grant 317040 (QuanTI). Funding for open access was by the IMGT are annotated as having come from rearranged sequences, European Union Seventh Framework Programme. which may contain somatic mutations and partially truncated 39 Address correspondence and reprint requests to Prof. Cathal Seoighe, National ends resulting from VDJ recombination. University of Ireland Galway, School of Mathematics, Statistics and Applied Since the development of IMGT, a large and ever-increasing Mathematics, University Road, Galway H91 HX31, Ireland. E-mail address: [email protected] number of full human genomes have been sequenced. The chro- The online version of this article contains supplemental material. mosomal regions containing the immune receptor alleles have both Abbreviations used in this article: G1K, 1000 Genomes Project; IMGT, international a high level of intrachromosomal duplication and a high level of ImMunoGeneTics information system; SNP, single nucleotide polymorphism; SRA, diversity and thus present challenges for genome assembly. National Center for Biotechnology Information Sequence Read Archive; VCF, var- However, improvement of the reference genome over time has the iant cell format. potential to enable recovery of a comprehensive set of immune gene Copyright Ó 2017 by The American Association of Immunologists, Inc. 0022-1767/17/$30.00 alleles found in human populations from population resequencing www.jimmunol.org/cgi/doi/10.4049/jimmunol.1601710 2 TCR/Ig ALLELES RECOVERED FROM POPULATION SEQUENCING DATA studies. We have developed a bioinformatics pipeline that can be we randomly chose a V, D (only for the H chain of Igs and the b-chain of used to generate a TCR/Ig reference database using data from TCRs), and J gene assuming uniform gene usage. We then decided the population resequencing studies. We applied this pipeline to data particular allele of the chosen V gene based on the frequency of its oc- currence in the chosen population. Second, we “mutated” the chosen V and from the 1000 Genomes Project (G1K) (29), which contains the J segments by introducing 0–10 mismatches on the V region and 0–5 most comprehensive collection of global human variation currently mismatches on the J region. To simulate junction diversity, we inserted available in the public domain, to create the Lym1K database. We zero to six randomly generated nucleotides into the V-D and D-J junction a aligned short read sequence data from human immune receptors to (V-J junction only for the Ig L chain and TCR -chain). We applied our simulation pipeline for TCRB, TCRA, IGH, and IGL Lym1K and obtained significantly improved alignments compared genes separately. In all, we produced 27 datasets, with separate datasets with what could be achieved using the IMGT database as reference. corresponding to each of the 26 populations included in G1K and 1 dataset We also report evidence that the diversity of immune receptor al- derived from the pool of all populations. Each dataset consisted of 20 leles observed in non-African populations is particularly under- simulated samples, and each sample contained 200,000 simulated represented in the existing database. sequences. G1K data Materials and Methods Immune receptor sequencing dataset We retrieved variant cell format (VCF) files corresponding to the TCRB, TCRA, IGH, and IGL regions from phase 3 of G1K. We used the R We obtained two datasets from the National Center for Biotechnology In- package SNPRelate (30) to perform principal component analysis sepa- formation Sequence Read Archive (SRA).