Using Large Sequencing Data Sets to Refine Intragenic Disease Regions and Prioritize Clinical Variant Interpretation
Total Page:16
File Type:pdf, Size:1020Kb
ORIGINAL RESEARCH ARTICLE © American College of Medical Genetics and Genomics Using large sequencing data sets to refine intragenic disease regions and prioritize clinical variant interpretation Sami S. Amr, PhD1,2, Saeed H. Al Turki, PhD1, Matthew Lebo, PhD1,2, Mahdi Sarmady, PhD3, Heidi L. Rehm, PhD1,2 and Ahmad N. Abou Tayoun, PhD3 Purpose: Classification of novel variants is a major challenge fac- commonly impacted in symptomatic individuals. Our data show ing the widespread adoption of comprehensive clinical genomic that a significant number of exons and domains in genes strongly sequencing and the field of personalized medicine in general. This is associated with disease can be defined as disease-sensitive or disease- largely because most novel variants do not have functional, genetic, tolerant, leading to potential reclassification of at least 26% (450 out or population data to support their clinical classification. of 1,742) of variants of uncertain clinical significance in the 132 genes. Methods: To improve variant interpretation, we leveraged the Exome Conclusion: This approach leverages domain functional annotation Aggregation Consortium (ExAC) data set (N = ~60,000) as well as and associated disease in each gene to prioritize candidate disease 7,000 clinically curated variants in 132 genes identified in more than variants, increasing the sensitivity and specificity of novel variant 11,000 probands clinically tested for cardiomyopathies, rasopathies, assessment within these genes. hearing loss, or connective tissue disorders to perform a systematic evaluation of domain level disease associations. Genet Med advance online publication 22 September 2016 Results: We statistically identify regions that are most sensitive Key Words: disease burden; domains; intragenic; intolerance; to functional variation in the general population and also most variant interpretation INTRODUCTION program is dissemination of standards and guidelines for vari- Genomic sequencing in the form of large gene panels, whole- ant interpretation and gene–disease association.7,8 In a simi- genome sequencing or whole-exome sequencing has entered lar approach for gene–disease associations, we have recently into genetics clinics with thousands or even tens of thousands shown that a systematic evaluation of genes associated with of such tests performed to date, significantly increasing the hearing loss can largely eliminate unnecessary interpretation of diagnostic yields in several clinical scenarios.1–4 The promise of variants in genes with weak disease associations.9 Still, a large genomic medicine has long been acknowledged to be at the cen- number of novel variants of uncertain clinical significance are ter of precision medicine by the scientific community and has routinely identified in genes with strong disease associations, just recently gained national recognition.5 With such compre- necessitating novel approaches to prioritize candidate disease- hensive testing, however, come several challenges, the key one causing variants. of which is the interpretation of the large number of variants— The use of population sequencing data sets such as the Exome estimated to be 3–5 million per human genome—generated in Aggregation Consortium (ExAC) has been useful in genome- the process, with thousands of which having uncertain clinical wide assessment of the tolerance of genes to variation and has significance. Even more alarming is the fact that the majority shown a greater degree of intolerance for genes that are associ- of variants identified are very rare, having been seen in only ated with Mendelian disorders.10–13 A similar strategy at a gene one family.6 Characterizing the functional impact of all variants level can potentially identify tolerant or sensitive intragenic identified requires tremendous resources that are beyond the regions for a particular class of variants for any given disease capacity of any single molecular diagnostic laboratory. gene.14 The tolerance predictions of these regions to different Recognizing this challenge, the National Institutes of Health classes of variation can be further enhanced through the use of has recently funded three different groups under the Clinical clinically interpreted variant data sets from diagnostic laborato- Genome Resource (ClinGen) Program, whose main goal is to ries. Here, we used variant data sets from population databases build and maintain a publicly accessible genomic knowledge and from a clinical laboratory to statistically define domain- base to promote variant data sharing among clinical labora- level disease associations and to assess exon-level tolerance of tories, researchers, and clinicians.6 Another major goal of this loss-of-function variants across 132 genes included in gene 1Laboratory for Molecular Medicine, Partners Healthcare Personalized Medicine, Cambridge, Massachusetts, USA; 2Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, USA; 3Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, The Children’s Hospital of Philadelphia, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA. Correspondence: Heidi L. Rehm ([email protected]) or Ahmad N. Abou Tayoun ([email protected]) Submitted 13 May 2016; accepted 28 July 2016; advance online publication 22 September 2016. doi:10.1038/gim.2016.134 496 Volume 19 | Number 5 | May 2017 | GENETICS in MEDICINE Characterizing intragenic disease burden | AMR et al ORIGINAL RESEARCH ARTICLE panels offered at our laboratory. We evaluated the utility of this hidden Markov models, downloaded from UCSC tables, approach by examining the impact of regions defined through last updated 1 July 2013). Coding regions that do not over- our analysis of classification of variants previously determined lap with any Pfam domain were labeled as “outside domains” to be of uncertain clinical significance. In addition, we show a for each transcript. We then generated a 2 × 2 table at each significant role for our approach in refining the variant inter- protein domain/“outside domain” region for the number of pretation process. positives and negatives in cases and controls and assessed sig- nificance using Fisher’s exact test. All P values were adjusted by MATERIALS AND METHODS Bonferroni correction based on the total number of domains Clinically curated variants (the significance level was <1.52 × 10–5). All significant, tolerant, A total of 186,009 variant observations representing 6,978 and intolerant regions are listed in Table 1 and Supplementary unique variants (Supplementary Table S1 online) in 132 genes Table S4 online. (Supplementary Table S2 online) were identified in a cohort As a quality check, significantly “intolerant” regions were of 11,219 probands tested at the Laboratory for Molecular confirmed to have adequate coverage in ExAC (Supplementary Medicine between 2005 and 2015. This cohort consisted of Table S4 online), thus ruling out any potential false-positive individuals affected with cardiomyopathies (n = 5,466), rasopa- results due to the lack of coverage in the general population. thies (n = 3,022), hearing loss (n = 1,990), and connective tissue Furthermore, all identified regions were validated for enrich- disorders (n = 741). ment of clinically significant variants in our internal database, All 6,978 unique variants were manually assessed for clinical HGMD, and ClinVar. significance and each was classified into one of five categories based on its potential impact and role in causing the patient’s dis- Loss-of-function variant analysis per RefSeq transcripts ease: pathogenic (P), likely pathogenic (LP), variant of uncertain After reannotating variants from the Exome Aggregation significance (VUS), likely benign (LB), and benign (B). Details Consortium, we investigated all RefSeq transcripts in the 132 of the variant classification workflow at our laboratory have genes (361 transcripts and 2,825 unique exons) to identify been described.15 The third category, VUS, is further refined exons with high-allele frequency (MAF ≥0.1%) loss-of-func- into three subcategories (VUS–favor benign (FB), VUS, and tion (LoF) variant(s) and/or with multiple such variants. The VUS–favor pathogenic (FP)) based on whether evidence favors loss-of-function effect of each variant was predicted by SnpEff a benign or pathogenic classification but does not meet thresh- v4.0e (see details under “Protein Domain Burden Analysis”) olds or criteria for a likely pathogenic or likely benign classifica- and included mainly stop-gains, splice sites (± 1 or 2), frame- tion. VUS-FP leans toward LP and VUS-FB leans toward LB, shifts, and start-losses. whereas the VUS subcategory denotes a lack of evidence to favor Exons with high-allele frequency LoF variants were inter- a causative or neutral role or equally conflicting evidence. preted in the context of disease prevalence, mode of inheri- tance, potential annotation errors, and/or gene structure (see Protein domain burden analysis “Results”). The numbers of probands positive or negative for each of the 6,978 clinically curated variants were determined using RESULTS our internal database. For controls, we queried the Exome Intragenic disease burden Aggregation Consortium (release 0.3)12 database (60,706 whole A significant number of variants in known genes are continu- exome samples) for all high-quality (PASS) variants and their ally identified that have uncertain clinical significance due allele frequencies