Building a Statistical Model for Predicting Cancer Genes
Total Page:16
File Type:pdf, Size:1020Kb
Dartmouth College Dartmouth Digital Commons Open Dartmouth: Published works by Dartmouth faculty Faculty Work 11-15-2012 Building a Statistical Model for Predicting Cancer Genes Ivan P. Gorlov The University of Texas MD Anderson Cancer Center Christopher J. Logothetis The University of Texas MD Anderson Cancer Center Shenying Fang The University of Texas MD Anderson Cancer Center Olga Y. Gorlova The University of Texas MD Anderson Cancer Center Christopher Amos Dartmouth College Follow this and additional works at: https://digitalcommons.dartmouth.edu/facoa Part of the Oncology Commons Dartmouth Digital Commons Citation Gorlov, Ivan P.; Logothetis, Christopher J.; Fang, Shenying; Gorlova, Olga Y.; and Amos, Christopher, "Building a Statistical Model for Predicting Cancer Genes" (2012). Open Dartmouth: Published works by Dartmouth faculty. 2572. https://digitalcommons.dartmouth.edu/facoa/2572 This Article is brought to you for free and open access by the Faculty Work at Dartmouth Digital Commons. It has been accepted for inclusion in Open Dartmouth: Published works by Dartmouth faculty by an authorized administrator of Dartmouth Digital Commons. For more information, please contact [email protected]. Building a Statistical Model for Predicting Cancer Genes Ivan P. Gorlov1*, Christopher J. Logothetis1, Shenying Fang2, Olga Y. Gorlova3, Christopher Amos4 1 Department of Genitourinary Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America, 2 Department of Cancer Genetics, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America, 3 Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America, 4 Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Lebanon, New Hampshire, United States of America Abstract More than 400 cancer genes have been identified in the human genome. The list is not yet complete. Statistical models predicting cancer genes may help with identification of novel cancer gene candidates. We used known prostate cancer (PCa) genes (identified through KnowledgeNet) as a training set to build a binary logistic regression model identifying PCa genes. Internal and external validation of the model was conducted using a validation set (also from KnowledgeNet), permutations, and external data on genes with recurrent prostate tumor mutations. We evaluated a set of 33 gene characteristics as predictors. Sixteen of the original 33 predictors were significant in the model. We found that a typical PCa gene is a prostate-specific transcription factor, kinase, or phosphatase with high interindividual variance of the expression level in adjacent normal prostate tissue and differential expression between normal prostate tissue and primary tumor. PCa genes are likely to have an antiapoptotic effect and to play a role in cell proliferation, angiogenesis, and cell adhesion. Their proteins are likely to be ubiquitinated or sumoylated but not acetylated. A number of novel PCa candidates have been proposed. Functional annotations of novel candidates identified antiapoptosis, regulation of cell proliferation, positive regulation of kinase activity, positive regulation of transferase activity, angiogenesis, positive regulation of cell division, and cell adhesion as top functions. We provide the list of the top 200 predicted PCa genes, which can be used as candidates for experimental validation. The model may be modified to predict genes for other cancer sites. Citation: Gorlov IP, Logothetis CJ, Fang S, Gorlova OY, Amos C (2012) Building a Statistical Model for Predicting Cancer Genes. PLoS ONE 7(11): e49175. doi:10.1371/journal.pone.0049175 Editor: Ludmila Prokunina-Olsson, National Cancer Institute, National Institutes of Health, United States of America Received August 3, 2012; Accepted October 9, 2012; Published November 15, 2012 Copyright: ß 2012 Gorlov et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This study was supported in part by the David H. Koch Center for Applied Research of Genitourinary Cancers, the National Institutes of Health Prostate SPORE grant CA140388-01, and the National Institutes of Health Cancer Center Support Grant 5 P30 CA016672. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] Introduction secondary (i.e., related) terms. As a result, each gene receives a confidence score (CS): the higher the CS, the stronger the A census of human cancer genes conducted by Futreal et al. [1] association of the gene with a specified phenotype; in our case, and updated by Santarious et al. [2] to identify 400 cancer-related PCa. We identified a total of 707 genes with CSs ranging from genes. It is obvious that this list of cancer-related genes is not 2.663 to 0.001 (Table S1) and used the top 100 genes as ‘‘known complete: a PubMed search of the literature conducted in June PCa genes.’’ The other 607 genes from the list were considered 2011 using the term ‘‘novel cancer gene’’ in the title identified ‘‘putative PCa genes.’’ The remaining 14,641 genes with more than 100 papers published in 2011 (data not shown). aCS,0.001 were considered ‘‘non-PCa genes.’’ We excluded Development of a predictive model for cancer genes could the 607 putative PCa genes to create a well-defined binary accelerate their identification. In this study, we developed outcome for our analysis. a statistical model for the prediction of prostate cancer (PCa) Because our initial search term to identify PCa genes was genes. Our study was motivated by the following: i) a number of ‘‘prostate cancer,’’ the algorithm searches the genes associated PCa-related genes with strong experimental evidence have been with any aspects of prostate carcinogenesis, including initiation, identified, ii) many genes in the human genome are extensively progression, recurrence, and survival. In other words, we used annotated, and iii) genome-wide profiling of gene-expression data a broad definition of PCa genes. Of course, a search can be more is available [3,4]. In this study, we identified traits that are specific, e.g., ‘‘prostate cancer recurrence,’’ and this is expected to characteristic of known PCa genes and used them to predict novel produce a training set that will be different from the list we used. PCa genes. Model and Variables Materials and Methods A binary logistic regression (BLR) model was used to discriminate between the ‘‘known PCa’’ and the ‘‘non-PCa’’ Known PCa Genes genes. Each gene was described by 33 variables (Table 1). The We used the KnowledgeNet (KN; a literature-mining algorithm) variables were selected on the basis of evidence published by us approach to identify PCa genes [5]. The KN algorithm searches and others that the individual variables are associated with PCa for an association between the gene and both primary and [6–11]. A detailed description of the variables can be found in the PLOS ONE | www.plosone.org 1 November 2012 | Volume 7 | Issue 11 | e49175 Statistical Model for Predicting Cancer Genes Information S1. We subclassified the variables into two categories: reduce the risk of experimental follow-up of false positives, which prostate-tissue specific and nonspecific. Tissue-specific variables can be costly. included gene expression data in normal and tumorous prostate In total, we used 15,348 genes. Gene expression data were tissues. Non–tissue-specific variables were those that can be a limiting factor of inclusion of each gene in the analysis. We used applied to any type of tissue, e.g., ‘‘growth factor,’’ ‘‘phosphory- the publicly available datasets GSE6919 [12,13] and GSE21034 lated’’ variables. [13] from the Gene Expression Omnibus (GEO) [3,4] and used 2 Because our regression model was naturally unbalanced, with AmiGO [14] to identify the genes associated with specific biologic too many ‘‘non-PCa’’ genes and too few PCa genes, we could not function, cellular location, and posttranslational modifications. use a 0.5 threshold to decide whether the gene was a PCa or ‘‘non- The number of human orthologs reported in the HomoloGene PCa’’ gene. The classification threshold (0.05) was chosen to database (http://www.ncbi.nlm.nih.gov/HomoloGene) was used ensure that at least 95% of non-PCa genes were predicted as the evolutionary conservation index [15,16]. correctly, and because it reflects the proportion of genes that were identified as prostate cancer (707) related to the total number of Validation of the Model genes studied in the training phase (14,641). This relatively high To validate the model, we first randomly subclassified the 200 rate of correct classification of ‘‘non-PCa’’ genes was selected to genes with the highest CS into discovery and validation sets. Next we built the BLR model by using only the discovery set and used it Table 1. Variables used to build a binary logistic model to discriminate known PCa genes. Type of variable Variable Source of the data Specific Three-level meta-analysis Ref. [10] Nonspecific Acetylated GO* Nonspecific Angiogenesis GO Nonspecific Antiapoptotic GO Nonspecific Cell adhesion GO Nonspecific