Exome-Scale Discovery of Hotspot Mutation Regions in Human Cancer Using 3D Protein Structure Collin Tokheim1, Rohit Bhattacharya1, Noushin Niknafs1, Derek M
Total Page:16
File Type:pdf, Size:1020Kb
Published OnlineFirst April 28, 2016; DOI: 10.1158/0008-5472.CAN-15-3190 Cancer Integrated Systems and Technologies: Mathematical Oncology Research Exome-Scale Discovery of Hotspot Mutation Regions in Human Cancer Using 3D Protein Structure Collin Tokheim1, Rohit Bhattacharya1, Noushin Niknafs1, Derek M. Gygax2, Rick Kim2, Michael Ryan2, David L. Masica1, and Rachel Karchin1,3 Abstract The impact of somatic missense mutation on cancer etiology hotspot regions. In addition to experimentally determined and progression is often difficult to interpret. One common protein structures, we considered high-quality structural mod- approach for assessing the contribution of missense mutations els, which increase genomic coverage from approximately in carcinogenesis is to identify genes mutated with statistically 5,000 to more than 15,000 genes. We provide new evidence nonrandom frequencies. Even given the large number of that 3D mutation analysis has unique advantages. It enables sequenced cancer samples currently available, this approach discovery of hotspot regions in many more genes than previ- remains underpowered to detect drivers, particularly in less ously shown and increases sensitivity to hotspot regions in studied cancer types. Alternative statistical and bioinformatic tumor suppressor genes (TSG). Although hotspot regions have approaches are needed. One approach to increase power is to long been known to exist in both TSGs and oncogenes, we focus on localized regions of increased missense mutation provide the firstreportthattheyhavedifferent characteristic density or hotspot regions, rather than a whole gene or protein properties in the two types of driver genes. We show how cancer domain. Detecting missense mutation hotspot regions in three- researchers can use our results to link 3D protein structure and dimensional (3D) protein structure may also be beneficial the biologic functions of missense mutations in cancer, and to because linear sequence alone does not fully describe the generate testable hypotheses about driver mechanisms. Our biologically relevant organization of codons. Here, we present results are included in a new interactive website for visualizing a novel and statistically rigorous algorithm for detecting mis- protein structures with TCGA mutations and associated hotspot sense mutation hotspot regions in 3D protein structures. We regions. Users can submit new sequence data, facilitating the analyzed approximately 3 Â 105 mutations from The Cancer visualization of mutations in a biologically relevant context. Genome Atlas (TCGA) and identified 216 tumor-type–specific Cancer Res; 76(13); 3719–31. Ó2016 AACR. Introduction Major Findings Missense mutations are perhaps the most difficult mutation We used The Cancer Genome Atlas mutation data and type to interpret in human cancers. Truncating loss-of-function fi identi ed 3D clusters of cancer mutations ("hotspot regions") mutations and structural rearrangements generate major at amino-acid-residue resolution in 91 genes, of which 56 are changes in the protein product of a gene, but a single missense fi known cancer-associated genes. The hotspot regions identi ed mutation yields only a small change in protein chemistry. The – by our method are smaller than a protein domain or protein impact of missense mutation on protein function, cellular protein interface and in many cases can be linked precisely behavior, cancer etiology, and progression may be negligible with functional features such as binding sites, active sites, and or profound, for reasons that are not yet well understood. sites of experimentally characterized mutations. The hotspot Missense mutations are frequent in most cancer types, account- regions are shown to be biologically relevant to cancer, and we ing for approximately 85% of the somatic mutations observed discovered that there are characteristic differences between in solid human tumors (1), and the cancer genomics regions in the two types of driver genes, oncogenes and tumor suppressor genes (TSG). These differences include region size, mutational diversity, evolutionary conservation, and amino 1Department of Biomedical Engineering and Institute for Computa- 2 acid residue physiochemistry. For the first time, we quantify tional Medicine, Johns Hopkins University, Baltimore, Maryland. In Silico Solutions, Fairfax, Virginia. 3Department of Oncology, Johns why the great majority of well-known hotspot regions occur in Hopkins University School of Medicine, Baltimore, Maryland. oncogenes. Because hotspot regions in TSGs are larger, more Note: Supplementary data for this article are available at Cancer Research fi heterogeneous than those in oncogenes, they are more dif cult Online (http://cancerres.aacrjournals.org/). to detect using protein sequence alone and are likely to be – Corresponding Author: Rachel Karchin, Johns Hopkins University School of underreported. Our results indicate that protein structure Medicine, 217 A Hackerman Hall, 3400 N. Charles St., Baltimore, MD 21218. Phone: based 3D mutation clustering increases power to find hotspot 410-516-5578; Fax: 410-516-5294; E-mail: [email protected] regions, particularly in TSGs. doi: 10.1158/0008-5472.CAN-15-3190 Ó2016 American Association for Cancer Research. www.aacrjournals.org 3719 Downloaded from cancerres.aacrjournals.org on September 30, 2021. © 2016 American Association for Cancer Research. Published OnlineFirst April 28, 2016; DOI: 10.1158/0008-5472.CAN-15-3190 Tokheim et al. Quick Guide to Equations and Assumptions An experimentally determined or theoretically modeled protein structure consists of a set of atoms, each with a unique coordinate in three-dimensional (3D) Euclidean space. Each amino acid residue consists of many atoms and may harbor zero, one, or multiple missense mutations in a cohort of sequenced cancer samples. Two key mathematical concepts in our study are the density of local missense mutations in 3D space, which underlies our statistical measure to define missense mutation hotspot regions, and mutational diversity of a hotspot region. Local missense mutation density X k ¼ k Dr Mn 2 k n Nr is definedforeachaminoacidresiduer and each protein structure k. It considers the sum of the count of missense mutations that occurred at r and those that occurred at residues proximal to r, that is, in its "neighborhood." Proximity is measured in 3D space and the neighborhood is limited to residues up to 1 nm away from r, where 1 nm was chosen because it is the order of k th magnitude of an amino acid side chain. The term Mn is the missense mutation count for the n residue neighbor of r. The k observed value of Dr is compared with simulations of its value under an empirical null distribution, where the total number of missense mutations observed in k remains the same, but they are distributed uniformly in 3D. Residue r has significantly k increased Dr if its adjusted P value is 0.01 after multiple testing correction. A 3D hotspot region is a grouping of residues fi k fi with signi cantly increased Dr that are linked as connected components in a neighbor graph. Our algorithm can nd 3D hotspot regions directly on protein complexes, enabling detection of hotspot regions that occur on both sides of a protein– protein interface. It also handles complexes with multiple chains originating from a single gene product (e.g., a homodimer) by running identical simulations simultaneously. g Mutational diversity is computed for each hotspot region si (where i indexes the region and g indexes the gene) based on the Shannon entropy of the joint probability of a missense mutation occurring at a specificresiduer and having a specificmutant amino acid m X X g g g g g g si ; si ¼ si ¼ ; si ¼ si ¼ ; si ¼ HR M PR r M m log2PR r M m g g s s r2R i 2 i m Mr Because the maximum possible Shannon entropy grows with the number of residues in a hotspot region, the score is normalized so hotspot regions of different sizes can be compared. sg sg ÀÁ HRi ; M i g ¼ MD si HmaxðÞN; R; A N is the number of mutations in the hotspot region, R is the number of residues, and A is the number of possible alternate amino acids per residue. In this work, mutational diversity is found to be significantly different between hotspot regions that occur in oncogenes versus those that occur in tumor suppressor genes. Major assumptions of the model: * In the absence of selection for drivers, somatic missense mutations in cancers are equally likely to appear at any amino acid residue position in a protein structure of interest. * Many driving missense mutations have significantly increased local mutation density. * Residues with significantly increased mutation density and proximal to each other in three dimensions are likely to be subject to similar selective pressures and can be grouped together into hotspot regions. * The most parsimonious number of hotspot regions in a protein structure is preferred. * Carefully filtered theoretical protein structure models are accurate enough to capture local missense mutation densities and groupings of proximal residues with significantly increased densities. community has prioritized the task of identifying important chance (2–5). Metrics to call SMGs now appear to be under- missense mutations discovered in sequencing studies. Whole- powered given the size of current cohorts in The Cancer exome sequencing (WES) studies of cancer have created new Genome Atlas (TCGA) and International Cancer Genome opportunities to better understand the importance of missense Consortium (ICGC). A recent study suggested that approxi- mutations. This enormous collection of data now allows mately 1,500 cases of endometrial cancer would need to be detection of patterns with power that was unheard of a few sequenced to attain 90% power to detect mutations in 90% of years ago. genes with a mutation frequency of 2% with the SMG approach The first approaches to identify cancer drivers from WES (5). The recognition of the limitations of the SMG paradigm mutations looked for significantly mutated genes (SMG), har- has motivated interest in orthogonal analysis techniques to boring a larger number of somatic mutations than expected by detect mutational patterns associated with drivers (1, 6–9).