Mining the Dark Matter of the Cancer Proteome for Novel Biomarkers
Total Page:16
File Type:pdf, Size:1020Kb
Send Orders for Reprints to [email protected] Current Cancer Therapy Reviews, 2013, 9, 265-277 265 Mining the Dark Matter of the Cancer Proteome for Novel Biomarkers Ana Paula Delgado, Pamela Brandao, Sheilin Hamid and Ramaswamy Narayanan Department of Biological Sciences, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431, USA Abstract: The post genome era has ushered us into therapeutic target discovery empowering us to mine the genome using rational approaches. Numerous cancer targets have emerged from the genome project for diagnostics, therapeutics and re- sponse to therapy prediction. Among thousands of genes predicted in the human genome, nearly half of them remain un- characterized. Considerable attention in the last decade has focused on the well-characterized known genes. However, the future of cancer target discovery resides in the uncharacterized or novel genes called the dark matter of the human ge- nome. Realizing the importance of this vast untapped potential, recently the US National Cancer Institute announced a new initiative called "Illuminating the Dark Matter of the Genome for Druggability". This area of cancer research albeit exciting, remains a challenge due to the lack of adequate information about the uncharacterized genes. Amongst the pleth- ora of bioinformatics tools and databases, a streamlined approach remains elusive. In this review, we present a simplified approach to mine directly the cancer proteome for rapid target discovery. Using such an approach, we have created a da- tabase of uncharacterized cancer genes and have shown the biomarker and drug target potential for an uncharacterized protein, C1ORF87, as a putative solid tumor target. In view of this protein's association with carcinomas, the C1ORF87 is termed as Carcinoma-Related EF-Hand (CREF) gene. The approaches discussed in this review should aid in lighting the dark matter of the human cancer proteome. Keywords: Post genome, druggable genes, uncharacterized genes, C1ORF87, CREF gene. 1. INTRODUCTION establishing their relevance to various diseases including cancer [24-29]. Together, the uncharacterized proteins and Rational approaches to cancer target discovery have been the ncRNAs constitute the dark matter of the genome [30, 31]. greatly accelerated in the last decade by the completion of the human genome project [1-6]. The Cancer Genome Anat- Recent efforts on cancer biomarkers and drug target dis- omy Project (CGAP) from the National Cancer Institute covery have focused on the well characterized (known) (NCI) is an attractive starting point for cancer gene discovery genes [22, 32-34]. The concept of the druggable genome has [7-12]. The number of bioinformatics tools available (public gained increased prominence for pharmaceutical drug devel- and private) to mine the cancer genome is expanding. The opment [21, 35-39]. Current drug targets largely revolve most commonly used tools in the public domain include the around such classes of proteins as enzymes, receptors, trans- CGAP database [7], the NCBI UniGene database, the Euro- porters and channel proteins [21, 38]. Numerous drug target pean Bioinformatics Institute database (EBI), Serial analysis databases are available for readily mining the genome to of gene expression SAGE [13], the UCSC Genome Browser detect drug interactions (PharmGKB - The Pharmacogenom- [14], the ArrayExpress [15], the Roche Cancer Genome Da- ics Knowledgebase, Therapeutic Target Database, TTD [39], tabase (RCGDB) [16], the canSAR database [17], the Cata- Drug Bank, [40, 41] and Drug Gene Interactions database, logue of Somatic Mutations in Cancer (COSMIC) [18], the DGIdb). molecular targets database at the NCI Developmental Thera- The biomarker potential of the genes is often neglected peutics Program (DTP) [19] and the Gene Chip Oncology due to the greater attraction of druggable target discovery. In Database (GCOD) from the Dana Farber Gene Index [20]. addition to therapeutic potential, a new target can hold prom- Currently it is estimated that there are 22,000 protein ise in diagnostics and response to therapy prediction. By coding genes in the human genome [6]. It is clear that with mining the uncharacterized cancer proteins researchers can the isoforms and the post translationally modified proteins begin to elucidate the mechanisms of complex gene network [21], a larger number of druggable genes will emerge. The interactions among the known and unknown genes [42-44]. majority of these genes remain uncharacterized and their For mining the cancer proteome, the UniProt Knowledge function unknown. In addition, the noncoding RNAs, some Base UniProtKB [45] is a useful starting point to obtain of which may also code for proteins, may ultimately contrib- functional information on proteins. Other knowledge data- ute to the actual number of disease targets [22, 23]. Numer- bases such as the GeneCards [46], the Human Genome No- ous databases are available for mining the ncRNAs and menclature Committee (HGNC) from EBI and the NCBI Gene and RefSeq provide highly integrated approaches to mining the genome at the genomic DNA and mRNA level *Address correspondence to this author at the Department of Biological Sciences, Charles E. Schmidt College of Science, Florida Atlantic Univer- and to find protein related information. The Gene Ontology sity, Boca Raton, FL 33431; Tel: 561 297 2247; Fax: 561 297 3859; databases (QuickGo, GoMiner, Gene Ontology) offer com- E-mail: [email protected] plex interpretations from the Omics data. 1875-6301/13 $58.00+.00 © 2013 Bentham Science Publishers 266 Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 Delgado et al. GeneCard Mutome DB 1142 1436 Uniprot KB 855 RefSeq 4563 HGNC 4025 Fig. (1). Novel cancer proteome output by text mining. The genome knowledge databases (UniProtKB, GeneCards, Mutome DB, Refseq and the HGNC) were text mined with advanced search options and the uncharacterized cancer-associated ORFs were identified. For analyzing the proteins, the SIB Bioinformatics Re- on a comprehensive bioinformatics and proteomics strategy source Portal ExPaSy [47] has numerous tools to perform a of datamining. Various knowledge based bioinformatics detailed characterization of the structure and the class of the tools can be used to develop an initial list of uncharacterized proteins. Multiple meta analysis bioinformatics tools are also ORF genes. Mining of the genome knowledge databases can becoming available to perform a comprehensive analysis of rapidly identify such putative novel proteins. For example, proteins including Predict Protein [48] and Meta Server for text mining of databases such as the Genecards, Uniprot KB, Protein Sequence Analysis (MESSA) [49]. In the private the RCGDB-Mutome Database, NCBI RefSeq and the domain several integrated datamining tools (most of which HGNC with advanced search options can be performed to are freely available for academic users) exist for mining the identify cancer-related uncharacterized proteins (Fig. 1). cancer genome and the proteome, including Oncomine for Text mining these databases requires individual optimi- cancer microarray analysis [50], the NextBio [51] and the zation of the query using advanced search options [62-65]. Ingenuity Pathway Analysis (IPA) tools from Quiagen. Some of the tools show a significant difference when terms These Meta analysis tools offer the advantage of performing such as "cancer", "tumor" vs. "cancer AND tumor" were complex Omics and pathway analysis for gene discovery and used. A universally employed query identification search characterization. A number of datasets exist for mining the algorithm is critically needed for seamless mining of these cancer transcriptome and proteome in diverse databases [34, diverse databases. 52-59]. Approaches to mine the known genes versus the un- characterized genes vary. In this review, we will address a Our optimized query definition for the databases shown streamlined approach to identify and predict function for the in (Fig. 1) identified a distinct number of cancer-related dark matter of the human proteome, the uncharacterized uncharacterized human ORFs (GeneCards, n=1142, HGNC, genes. n= 4,025, RefSeq, n=4,563, Mutome DB (Roche), n=1,436 and the UniProtKB, n=855). The variation in the number of 2. MINING THE BIOINFORMATICS DATABASES putative cancer-related proteins from these databases reflects FOR THE DARK MATTER: UNCHARACTERIZED the presence of partials, pseudogenes and noncoding RNAs PROTEINS (ncRNAs). Numerous ORFs were identified by more than one database; however, due to the continually evolving na- The dark matter of the human genome for biomarker dis- ture of the datasets, not all the ORFs are identifiable by each covery resides in the uncharacterized proteins and the of the databases used (SH and RN, unpublished). In theory ncRNAs [30-31, 60, 61]. These novel proteins hold the clue any one of these databases could be used as a starting point to deciphering biology and to develop a drug therapy ration- for developing an initial list of cancer-related novel ORFs. ale or to identify biomarkers. Mining for these novel protein For the sake of simplicity, this review employs the 855 hits targets, however, remains daunting. What little data exists from the UniprotKB as an example to mine the dark matter for these putative proteins is spread over a multitude of data- of the cancer proteome. These hits were verified as unchar- bases, and defined in various ways: as uncharacterized, as acterized using select databases such as HGNC,