Send Orders for Reprints to [email protected]

Current Therapy Reviews, 2013, 9, 265-277 265 Mining the Dark Matter of the Cancer Proteome for Novel Biomarkers

Ana Paula Delgado, Pamela Brandao, Sheilin Hamid and Ramaswamy Narayanan

Department of Biological Sciences, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431, USA

Abstract: The post genome era has ushered us into therapeutic target discovery empowering us to mine the genome using rational approaches. Numerous cancer targets have emerged from the genome project for diagnostics, therapeutics and re- sponse to therapy prediction. Among thousands of predicted in the , nearly half of them remain un- characterized. Considerable attention in the last decade has focused on the well-characterized known genes. However, the future of cancer target discovery resides in the uncharacterized or novel genes called the dark matter of the human ge- nome. Realizing the importance of this vast untapped potential, recently the US National Cancer Institute announced a new initiative called "Illuminating the Dark Matter of the Genome for Druggability". This area of cancer research albeit exciting, remains a challenge due to the lack of adequate information about the uncharacterized genes. Amongst the pleth- ora of bioinformatics tools and databases, a streamlined approach remains elusive. In this review, we present a simplified approach to mine directly the cancer proteome for rapid target discovery. Using such an approach, we have created a da- tabase of uncharacterized cancer genes and have shown the biomarker and drug target potential for an uncharacterized , C1ORF87, as a putative solid tumor target. In view of this protein's association with , the C1ORF87 is termed as -Related EF-Hand (CREF) . The approaches discussed in this review should aid in lighting the dark matter of the human cancer proteome.

Keywords: Post genome, druggable genes, uncharacterized genes, C1ORF87, CREF gene.

1. INTRODUCTION establishing their relevance to various diseases including cancer [24-29]. Together, the uncharacterized and Rational approaches to cancer target discovery have been the ncRNAs constitute the dark matter of the genome [30, 31]. greatly accelerated in the last decade by the completion of the human genome project [1-6]. The Cancer Genome Anat- Recent efforts on cancer biomarkers and drug target dis- omy Project (CGAP) from the National Cancer Institute covery have focused on the well characterized (known) (NCI) is an attractive starting point for cancer gene discovery genes [22, 32-34]. The concept of the druggable genome has [7-12]. The number of bioinformatics tools available (public gained increased prominence for pharmaceutical drug devel- and private) to mine the cancer genome is expanding. The opment [21, 35-39]. Current drug targets largely revolve most commonly used tools in the public domain include the around such classes of proteins as enzymes, receptors, trans- CGAP database [7], the NCBI UniGene database, the Euro- porters and channel proteins [21, 38]. Numerous drug target pean Bioinformatics Institute database (EBI), Serial analysis databases are available for readily mining the genome to of SAGE [13], the UCSC Genome Browser detect drug interactions (PharmGKB - The Pharmacogenom- [14], the ArrayExpress [15], the Roche Cancer Genome Da- ics Knowledgebase, Therapeutic Target Database, TTD [39], tabase (RCGDB) [16], the canSAR database [17], the Cata- Drug Bank, [40, 41] and Drug Gene Interactions database, logue of Somatic Mutations in Cancer (COSMIC) [18], the DGIdb). molecular targets database at the NCI Developmental Thera- The biomarker potential of the genes is often neglected peutics Program (DTP) [19] and the Gene Chip Oncology due to the greater attraction of druggable target discovery. In Database (GCOD) from the Dana Farber Gene Index [20].  addition to therapeutic potential, a new target can hold prom- Currently it is estimated that there are 22,000 protein ise in diagnostics and response to therapy prediction. By coding genes in the human genome [6]. It is clear that with mining the uncharacterized cancer proteins researchers can the isoforms and the post translationally modified proteins begin to elucidate the mechanisms of complex gene network [21], a larger number of druggable genes will emerge. The interactions among the known and unknown genes [42-44]. majority of these genes remain uncharacterized and their For mining the cancer proteome, the UniProt Knowledge function unknown. In addition, the noncoding RNAs, some Base UniProtKB [45] is a useful starting point to obtain of which may also code for proteins, may ultimately contrib- functional information on proteins. Other knowledge data- ute to the actual number of disease targets [22, 23]. Numer- bases such as the GeneCards [46], the Human Genome No- ous databases are available for mining the ncRNAs and menclature Committee (HGNC) from EBI and the NCBI Gene and RefSeq provide highly integrated approaches to mining the genome at the genomic DNA and mRNA level *Address correspondence to this author at the Department of Biological Sciences, Charles E. Schmidt College of Science, Florida Atlantic Univer- and to find protein related information. The sity, Boca Raton, FL 33431; Tel: 561 297 2247; Fax: 561 297 3859; databases (QuickGo, GoMiner, Gene Ontology) offer com- E-mail: [email protected] plex interpretations from the Omics data.

1875-6301/13 $58.00+.00 © 2013 Bentham Science Publishers 266 Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 Delgado et al. GeneCard Mutome DB 1142 1436 Uniprot KB 855

RefSeq 4563 HGNC 4025

Fig. (1). Novel cancer proteome output by text mining. The genome knowledge databases (UniProtKB, GeneCards, Mutome DB, Refseq and the HGNC) were text mined with advanced search options and the uncharacterized cancer-associated ORFs were identified.

For analyzing the proteins, the SIB Bioinformatics Re- on a comprehensive bioinformatics and proteomics strategy source Portal ExPaSy [47] has numerous tools to perform a of datamining. Various knowledge based bioinformatics detailed characterization of the structure and the class of the tools can be used to develop an initial list of uncharacterized proteins. Multiple meta analysis bioinformatics tools are also ORF genes. Mining of the genome knowledge databases can becoming available to perform a comprehensive analysis of rapidly identify such putative novel proteins. For example, proteins including Predict Protein [48] and Meta Server for text mining of databases such as the Genecards, Uniprot KB, Protein Sequence Analysis (MESSA) [49]. In the private the RCGDB-Mutome Database, NCBI RefSeq and the domain several integrated datamining tools (most of which HGNC with advanced search options can be performed to are freely available for academic users) exist for mining the identify cancer-related uncharacterized proteins (Fig. 1). cancer genome and the proteome, including Oncomine for Text mining these databases requires individual optimi- cancer microarray analysis [50], the NextBio [51] and the zation of the query using advanced search options [62-65]. Ingenuity Pathway Analysis (IPA) tools from Quiagen. Some of the tools show a significant difference when terms These Meta analysis tools offer the advantage of performing such as "cancer", "tumor" vs. "cancer AND tumor" were complex Omics and pathway analysis for gene discovery and used. A universally employed query identification search characterization. A number of datasets exist for mining the algorithm is critically needed for seamless mining of these cancer transcriptome and proteome in diverse databases [34, diverse databases. 52-59]. Approaches to mine the known genes versus the un- characterized genes vary. In this review, we will address a Our optimized query definition for the databases shown streamlined approach to identify and predict function for the in (Fig. 1) identified a distinct number of cancer-related dark matter of the human proteome, the uncharacterized uncharacterized human ORFs (GeneCards, n=1142, HGNC, genes. n= 4,025, RefSeq, n=4,563, Mutome DB (Roche), n=1,436 and the UniProtKB, n=855). The variation in the number of 2. MINING THE BIOINFORMATICS DATABASES putative cancer-related proteins from these databases reflects FOR THE DARK MATTER: UNCHARACTERIZED the presence of partials, pseudogenes and noncoding RNAs PROTEINS (ncRNAs). Numerous ORFs were identified by more than one database; however, due to the continually evolving na- The dark matter of the human genome for biomarker dis- ture of the datasets, not all the ORFs are identifiable by each covery resides in the uncharacterized proteins and the of the databases used (SH and RN, unpublished). In theory ncRNAs [30-31, 60, 61]. These novel proteins hold the clue any one of these databases could be used as a starting point to deciphering biology and to develop a drug therapy ration- for developing an initial list of cancer-related novel ORFs. ale or to identify biomarkers. Mining for these novel protein For the sake of simplicity, this review employs the 855 hits targets, however, remains daunting. What little data exists from the UniprotKB as an example to mine the dark matter for these putative proteins is spread over a multitude of data- of the cancer proteome. These hits were verified as unchar- bases, and defined in various ways: as uncharacterized, as acterized using select databases such as HGNC, NCBI Ref- putative Open Reading Frames (ORF), or with various gene seq and NCBI-Gene. ID numbers. Hence discovery research for cancer targets has One of the major problems encountered when mining un- thus far tended to focus on the well-characterized known characterized genes is the need for various gene identifier genes for which considerable information is already available. information. Since multiple datamining tools use different Developing a streamlined, rational approach to mine the identifiers such as probe set ID, gene bank numbers, EST, uncharacterized proteins could greatly facilitate discovery of protein ID etc., it is currently challenging to rapidly scan the novel cancer targets. Towards this end, we have embarked ORFs against multiple databases. However, numerous ID

Cancer Proteome Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 267

Table 1. Verification of protein motifs and domains analysis: experimental design. Well-characterized proteins with detailed knowledge of protein motifs and domains were analyzed using indicated motifs and domain analysis tools. Positive con- trols: TP53 WT-transcription factor; IL-7- secreted factor and KRAS- enzyme superfamily. Negative control: the trans- activation domain (TAD) deleted TP53 WT (mutant). The key motifs and domains expected are bolded.

Expected Protein Name CDD InterProScan ProDom Pfam HMMER Domain/Motif

P53 DNA binding trasactivation P53 tetramerisation P53 DNA binding P53 tetramerisation domain, P53 motif, P53 motif, p53 DNA domain, TAD, P53 motif, p53 DNA bind- Nuclear N P53 P73L TP53 (WT) tetramerisation tetramerisation binding domain, tetramerisation ing domain, p53 trans- Tumor DELTA motif, P53 transacti- motif, p53 DNA TAD motif activation domain vation motif binding domain

Interleukin-7 IL-7 Cytokine Growth Precur- Interleukin 7/9 fam- Interleukin 7/9 Interleukin 7/9 family, Interleukin 7/9 Interleukin 7/9 IL7 sor Glycoprotein Factor ily, signal peptide family signal peptide family family Signal, 3-D Structure Sequencing

GTP-Binding Lipoprotein Ras GTPase family Small GTPase super- Prenylation RAS- Small GTPase super- containing H-Ras,N- family, P-loop contain- KRAS RELATED GTPASE Ras family Ras family family, RAS family Ras and K- ing nucleoside thriphos- ADP-Ribosylation Small Ras4A/4B phate hydrolase Family Factor RAS

P53 DNA binding P53 tetramer- P53 tetramerisation P53 tetramerisation DNA binding domain, Nuclear N P53 P73L domain, P53 isation motif, p53 TP53 (Mutant) motif, p53 DNA motif, P53 DNA tetramerisation Tumor DELTA tetramerisation DNA binding binding domain binding domain domain motif domain converter tools are becoming available. The clone/gene ID in lead cancer target discovery and verification. These steps converter from Bioinformatics Unit, CNIO, gives compre- can be easily incorporated into a workflow and can be auto- hensive gene IDs for a query sequence [66]. mated. We manually curated the UniProtKB output into a data- 1). The uncharacterized protein ORFs can be identified by base of putative full-length ORF protein hits. A working text mining diverse genome knowledge databases such database of putative uncharacterized cancer-related full- as UniProtKB, GeneCards, RefSeq, HGNC and Mutome length proteins (n=455) from the UniProtKB is shown in DB. Any one of these databases offers an attractive supplemental (Table 1). This database is available upon re- starting point (see Fig. 1). Alternatively, hits from all of quest. These putative proteins can be readily classified into these databases can be merged and curated. ORFs to facilitate mining at the deletion, am- plification or for genome wide association studies (GWAS). 2). Databases such as HGNC, NCBI Gene and RefSeq can In addition, other databases such as GeneCards, the be used to rapidly verify the novelty or uncharacterized UCSC/Ensemble Genome Browser, Mutome DB, canSAR, nature of the test ORF. The ncRNAs, pseudo genes and cBioPortal and COSMIC disease databases can be mined to partial cDNAs can be filtered. verify rapidly the protein hits at the levels of , 3). The mRNA expression specificity in cancer and normal mRNA and protein expression in normal and cancer tissues tissues can be established using the gene information and cell lines. Diverse proteomic tools can help identify the from the NCBI UniGene, Serial Analysis Gene Expres- nature of the proteins, which should help in establishing drug- sion tags (SAGE) and the Cancer Genome Analysis Pro- gableness and to predict putative function (see Section III). ject (CGAP). The UniGene and SAGE databases can be mined for Expressed Sequence Tag (EST) expression 3. UNCHARACTERIZED CANCER TARGET DIS- profiles in normal, cancer and developmental tissues, for COVERY sources of cDNAs, and for digital expression in normal The expression specificity and knowledge of protein mo- and tumor tissues (SAGE Anatomical Viewer). An es- tifs are crucial for predicting the putative biomarker potential sential drawback of these diverse RNA expression- of a novel protein. Hence, we have attempted to develop an profiling tools is that these databases use different algorithm to identify protein hits, verify expression specific- sources of cDNAs for EST and SAGE expression verifi- ity to and define the class of the protein for lead can- cation. The depth of the EST sequences in the cDNA li- cer target discovery. A streamlined approach to identify and braries varies significantly, leading to both false posi- verify the cancer protein leads from the list of uncharacter- tives and false negatives in the data output. Often, the ized ORFs is shown in (Fig. 2). As the genome databases are user faces a lack of correlation of mRNA expression continually being updated, it is necessary to verify the un- across the tools. An additional limitation is that most of characterized nature of the ORF prior to undertaking exten- the cDNAs are derived from bulk tumor tissues. The re- sive analysis. Below, we provide a stepwise approach to aid sults tend to generate considerable noise because of the

268 Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 Delgado et al.

Identify uncharacterized Establish EST Verify protein hits proteins (ORF) hits in Specificity (RefSeq, HGNC, NCBI- (UniGene, cancer (UniProt KB, Gene) GeneCards, HGNC, SAGE, CGAP) RefSeq)

Verify protein Characterize hits Analyze cancer expression (Human (NCBI Gene, AceView, transcriptome Protein Atlas, Allan canSAR, GO, Mutome (Oncomine, GCOD, NCI Brain Atlas, MOPED, DB, Expasy, UCSC 60, NextBio, Gene Human Proteinpedia) Genome Browser) Expression Atlas, IPA)

Predict clinical Predict protein class relevance (SNP, Explore Pathway Leads (Druggable Motif/Domain detection GWAS, HapMap, Interactome/Gene genes/ diagnostic (CDD, Pfam, COSMIC, canSAR, Network targets) InterProsScan, ProDom, Mutome DB, Meta Analysis, PDB) PubMed)

Fig. (2). A streamlined approach to identify novel cancer protein leads. The strategy used in this review to mine the human proteome for novel cancer targets discovery is shown.

lack of high quality cDNAs. Expansion of laser micro- another is the general lack of antibody data for individ- dissected tumors and normal tissue-derived cDNAs ual isoforms, which limits interpretations to the canoni- would greatly enhance the quality of the data output cal protein sequences. from these databases. Further protein expression data can be generated from 4). The cancer transcriptome of the chosen ORF can be the Model Oganism Protein Expression Database investigated using microarray databases. The Oncomine MOPED [69], the Human Proteinpedia [70] and Human database provides comprehensive mRNA expression Protein Reference Database HPRD [71]. The Allen data from numerous subtypes of tumors. Additional mi- Brain Atlas [72] can be mined for developmental ex- croarray data can be generated using the ArrayExpress, pression and glioblastomas for both mRNA and protein the EMBL-EBI Gene Expression Atlas, the Gene Chip expression. Oncology database (GCOD) from the Dana Farber Gene Indices (formerly the Institute of Genome Research- 6). The expression-verified cancer protein hits can be fur- TIGR), the NCI 60 Developmental Therapeutics Targets ther characterized for the genomic organization, iso- database (DTP) for expression in sixty cancer cell lines, forms, drug interactions and splice variants using the and by using the NextBio database [51] for normal and UCSC Genome Browser, NCBI Gene and Aceview. The cancer tissue mRNA expression profiling. The RNAseq Gene Ontology (GO) can be established from the Go data can be mined with the Illumina Pathway Analysis Miner. The Mutome DB and canSAR can be mined to (IPA) tool. Caution should be exercised in verifying the verify the mutational status of the ORF in question. The dataset from diverse microarray experiments to ensure Expasy protein server can be explored to develop an ini- meeting the requirements for Minimum Information tial hint of the related proteins and homologues and About a Microarray Experiment (MIAME) [67]. other protein characteristics. 5). In contrast to mRNA analysis, protein analysis tools are 7). Identification of the protein class and structural pro- few in number. The most comprehensive protein analy- teomics analysis (motifs, fingerprints, domains, struc- sis tool is the Human Protein Atlas [68] which includes tures) can be performed using the Expasy and NCBI protein expression data for various normal and tumor proteomic tools such as the Conserved Domain Data- tissues (Tissue Microarrays) as well as for cell lines. In base (CDD) [73], the Protein Family Pfam [74], Inter- addition to protein expression, the Human Protein Atlas ProScan [75], Protein Domain (ProDom) [76] and the also contains data from mRNA expression. Data for Protein Database PDB [77]. most of the uncharacterized proteins exist and this data- base is updated regularly with additional antibody veri- 8). Once a hint of putative motifs and domains is obtained, fication. One major drawback of the Protein Atlas how- a comprehensive secondary, tertiary and quarternary ever, is the lack of quantitative protein expression data; structural modeling can be performed using the Meta Cancer Proteome Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 269

analysis tools such as Predict Protein and Meta Server WT), a secreted cytokine with signal peptide sequence (IL7) for Sequence Analysis Tool (MESSA). and an oncogene with enzyme function (KRAS). To serve as 9). The clinical cancer relevance of the ORF hits can be a negative control, we deleted the Transactivation domain inferred using the MutomeDB, canSAR, cBioPortal for (TAD) in the TP53WT gene. These control sequences were cancer genomics [78] and the COSMIC disease data- analyzed using various motifs and domain analysis tools bases, as well as by using the International HapMap Pro- CDD, InterProScan, ProDom, Pfam [74] and HMMER [86]. ject tool (HapMap) and PubMed-based text mining. As can be seen from the results (Table 1), all of the tools These databases provide comprehensive data on dele- detected the expected protein motifs and domains (DNA tions, mutations, amplifications, SNPs and Genome binding, signal peptide and Ras-GTPase) in the well- Wide Association Studies (GWAS) data [79]. characterized positive control genes. As expected, in the de- 10). To understand the mechanism and to develop a pathway leted version of TP53, (negative control) the TAD was not for gene network interactions, the protein-protein inter- detected, but the remaining motifs of TP53 were intact. acting partners for the ORFs can be predicted using While any one of the motifs tools would have yielded the various interactome tools such as the Strings, [80], Hu- information about the nature of the motifs, we believe that man Interactome Database [81], Protein Interaction when dealing with uncharacterized proteins, it is crucially Network Analysis (PINA) [82], The Molecular INTerac- important to use multiple tools to cross verify the prediction. tion database MINT [83], the EMBL-EBI interactome When a test ORF was investigated using these motifs and IntAct [84] and BioGRID interactome tools [85]. domains tools and a putative class is inferred, it is essential to verify the predictions by employing multiple controls us- 11). Finally, based on the expression specificity and the pro- ing known members of the predicted class of the protein. tein motifs and fingerprints, cancer leads for biomarkers and drug therapy targets can be chosen for wet labora- 5. FUNCTIONAL CLASS IDENTIFICATION OF THE tory verification. UNCHARACTERIZED CANCER TARGETS Our approach to illuminating the novel cancer proteome We next sought to analyze the uncharacterized proteins flows from discovery to expression specificity to protein from the database. Four of the ORFs from the database were class identification and prediction of function. chosen for initial studies. Preliminary investigation of these ORFs for gene expression with Oncomine microarray and 4. PROTEIN MOTIF VERIFICATION: EXPERI- Human Protein Atlas tools predicted upregulated expression MENTAL DESIGN in certain tumor types (APD and RN, data not shown). We reasoned that if a putative protein class could be assigned to Numerous protein motif verification tools are available, these ORFs, a proof of concept for the approach would be but these tools use different algorithms for mining the established. Hence, we embarked on extensive motif and genome database, so false negatives are common. Since the domain characterization of these hits. Multiple protein motifs reliability of the druggableness prediction of a cancer target and domain detection tools (NCBI-CDD, InterProScan, Pro- is dependent on the accurate prediction of the protein class Dom, PFAM and HMMER) were used to predict the protein through the knowledge of protein motifs, proper experimen- class of the ORFs (Table 2). tal design is imperative. We have developed an experimental standard for motif verification using well-characterized pro- The C8ORF34 isoform2 is predicted to belong to the teins (see Table 1). Varying classes of proteins with known cAMP-dependent protein kinase subunit (enzyme class) of motifs and domains were used to verify the reliability of mo- proteins. The C1ORF101 isoform 1is inferred to belong to tif predictions. As positive control for the tools, we chose a the cation channel class of proteins. The CXorf48 isoform 1 with DNA binding activity (TP53, belongs to the RNA binding class of proteins. The

Table 2. Functional class identification of the uncharacterized cancer targets. Indicated uncharacterized ORFs were analyzed using multiple motifs and domains analysis tools. Each ORF hit was assigned a putative class (bolded).

Protein Name Length (aa) CDD InterProscan ProDom Pfam HMMER Protein Class

cAMP-dependent protein C8orf34 ADK 372 Negative kinase, regulatory subunit, Negative Negative Enzyme isoform 2 type I/II alpha/beta (Adenylate Kinase),

Cation channel Cation channel C1orf101 951 Negative Negative Negative sperm-associated sperm-associated Channel isoform 1 protein subunit delta protein subunit delta

Nucleic acid-binding CXorf48 S1- Like AAA domain, 264 proteins superfamily Negative S1-like RNA Binding isoform 1 Superfamily OB- fold S1-like

HEAT repeats, C17orf66 Armadillo-type fold Cell 570 Heat-2 Negative domain of unknown HEAT repeats isoform 1 Transport ARM repeat function (DUF3385)

270 Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 Delgado et al.

AB

Normal Liver NOS (M-00100) Hepatocellular Carcinoma T-56000 Normal Lung NOS (M-00100) Lung Adenocarcinoma (T-28000) Patient id: 3402 Patient id: 2280 Patient id: 2268 Patient id: 2585 CD

Normal Lung (58) Lung Adenocarcinoma (58) Normal Breast (61) Invasive Carcinoma (76)

Fig. (3). Discovery of an EF-hand containing novel protein, C1ORF87 (CREF) as a putative solid tumor target. The CREF IHC data for lung adenocarcinoma and normal lung is shown in panel A (Human Protein Atlas). Panel B shows the CREF IHC data for normal liver and hepatocellular carcinoma. The mRNA expression (Oncomine microarray) for lung adenocarcinoma (Selamat Lung, n= 116, P-value: 1.96E-11, fold change: -2.031 and gene rank: top 1%) and invasive breast carcinoma (TCGA Breast 2, n= 593, P-value: 1.14E-6, fold change: -1.575 and gene rank: top 1%) models are shown in panels C and D respectively.

C17ORF66 was predicted to belong to the HEAT repeat- TPM). The expression was restricted to select tumor tissues containing transporter family. Domain detection tools such (APD and RN, data not shown). Hence we investigated the as CDD and ProDom were not always effective in predicting C1ORF87 protein expression in tumor and normal tissues domains. Frequently, domains of unknown function (see using the Human Protein Atlas tool. C17ORF66 isoform 1) were detected. The class of proteins As shown in (Fig. 3A), the protein expression was down encompassing enzyme, ion channel and cell transporters regulated in the lung carcinoma as analyzed by tissue mi- identified belong to the druggable targets [87-89]. These croarray using immunohistochemistry (IHC). The IHC stain- results demonstrate the feasibility of rapidly mining the un- ing of human bronchus showed strong cytoplasmic and characterized proteome for druggable targets and biomarker membranous positivity in respiratory epithelial cells using discovery following the streamlined strategy we have outlined. two different antibodies, HPA031366 and HPA031368 (Hu- man Protein Atlas). 6. ILLUMINATING AN UNCHARACTERIZED PRO- TEIN, C1ORF87 (CREF) In panel B, the IHC data for the normal and liver carci- noma specimens is shown. A strong granular cytoplasmic We chose one uncharacterized ORF, C1ORF87, to verify staining is seen in the cancer tissue in comparison to the our approach outlined in (Fig. 2). Preliminary experiments normal liver tissue. However, the sample size with the cur- using the UniGene, SAGE/EST expression and the CGAP rent data is very small (n=3/12 high expression) and addi- gene information tools indicated that the C1ORF87 gene is tional samples are needed for verification. Further, moderate not ubiquitously expressed. The Unigene-based analysis of staining was observed in several cases of endometrial, pros- normal tissue expression showed that the mRNA expression tate, thyroid and renal cancers. The remaining normal and for CREF is restricted to brain, connective tissue, lung, phar- malignant tissues were mainly negative (Human Protein Atlas). ynx, testis and uterus. EST expression profiling from Uni- Gene verified the downregulation of C1ORF87 in lung tu- Overall, about 20% of tumors analyzed for protein ex- mors (8 transcripts per million, in normal, versus zero in pression showed positive staining for IHC (HPA031368). No Cancer Proteome Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 271 protein expression data is currently available for normal and Genome Wide Association Studies (GWAS), amplification tumor breast tissues. The second antibody, HPA031366, in one central nervous system (CNS) tumor tissue, loss of showed very weak to no reactivity with many tumor samples heterozygocity (LOH) in diverse tumors (134) and mutation analyzed. These results with the Human Protein Atlas for the in one breast tumor was detected for the C1ORF87 (CREF) C1ORF87 protein should be considered as preliminary; addi- gene. No homozygous deletion was identified. LOH was tional experiments are necessary to verify the findings. The identified in 27% (37/139) of tumors and 9% (4/45) of breast expression of C1ORF87 protein was also detected in normal tumors. To date, lung tumors had the largest number of LOH lung, blood plasma and heart by another protein expression identified. The LOH results from the COSMIC disease data- analysis tool, the Human Protein Reference Database base raise the possibility of a tumor suppressor gene in the (HPRD). chromosome 1p.32.1 [91, 92]. Frequent allelic loss at 1p32.1 is seen in lung, breast and other tumors [93]. A list of all We next investigated the mRNA expression of the genes (Atlas of Genetics and Cytogenetic in Oncology and C1ORF87 gene using microarray databases. Consistent with Hematology) present in (both known and the downregulation of C1ORF87 protein expression seen in uncharacterized) indicates the presence of numerous cancer- lung and breast carcinomas by the Human Protein Expres- related genes [94]. Using co-expression and interactome sion Atlas, the Oncomine microarray-based mRNA profiling analyses, we are currently investigating chromosome 1p32.1 also showed downregulation of C1ORF87 expression in lung to find the key protein partners of CREF protein in the solid adenocarcinomas and invasive breast carcinomas in a statis- tumors. tically significant number of samples (panels C & D). The downregulation of C1ORF87 was also seen in large cell car- The SNP analysis from the NCBI SNP showed 1690 cinoma (n= normal vs. tumor: 5 vs. 4), and squamous cell variants for the CREF. These SNPs are largely located at the carcinoma (n= normal vs. tumor: 5 vs. 12). 3’ UTR, intron, and at the 5’ UTR encompassing synony- mous or missense mutations. Five different natural variants The downregulation of C1ORF87 mRNA expression was have been detected (in aa positions 151, 185, 301, 403, and also seen in different subtypes of breast cancers including 406). The aa position 151 variant results in glutamine to glu- ductal breast carcinoma in situ, invasive ductal breast carci- tamic acid substitution in breast cancer [95]. noma, and invasive lobular breast carcinoma (PB and RN, data not shown). Conversely, the CREF gene may be The CREF gene exists as at least three different isoforms upregulated in cancer of the uterus (6.10 fold, z-test 4.02E-1) and is spread over 12 exons. It is highly conserved with a according to GeneHub GEPIS. The downregulation of function of calcium binding gene ontology. The precise sub- C1ORF87 mRNA in lung and breast carcinomas was also cellular location is unclear at present (nucleus or cytoplas- verified by means of the NextBio Meta analysis tool and the mic) as measured by two different tools. The CREF protein ArrayExpress tool. Consistent with the Oncomine prediction, is developmentally regulated and its mRNA expression is these two tools also showed a significant downregulation of restricted to select normal tissues (testis, lung, brain, fallo- the C1ORF87 mRNA in diverse lung and breast carcinoma pian tubes, uterus and peritoneum); it is absent in many nor- samples analyzed (RN, data not shown). While at the present mal tissues. Current data on protein expression in normal time, mining of the Oncomine dataset did not reveal mRNA tissues is limited to lung, heart and fallopian tubes. In the upregulation of C1ORF87 in liver carcinomas, the NextBio Swiss PDB database, the CREF protein shows alignment dataset analysis did show an upregulation of the C1ORF87 with multiple EF-Hand motif containing proteins. The Inter- mRNA expression in diverse liver carcinomas compared to ProScan from InterPro enabled identification of 1) an EF- normal tissues (RN, data not shown). No significant differen- Hand domain (aa 180-386) and 2) an EF-Hand domain pair tial expression of C1ORF87 was detected in the hema- (aa 477-529) in CREF protein. Subsequently an extensive topoitic or neuronal tumors. These results demonstrate the structural analysis of CREF protein was undertaken to de- need to use multiple bioinformatics tools to verify leads. velop a further understanding of its nature. The Meta Server for Sequence Analysis (MESSA) and the Predict Protein In view of the expression specificity of this ORF in car- tools were used to characterize the structure of the CREF cinomas, the gene C1ORF87 was named Carcinoma Related protein. The secondary structure prediction modules of these EF-Hand (CREF) gene. Subsequently, a comprehensive mo- tools showed largely the presence of coil-coil and alpha heli- lecular characterization using diverse bioinformatics tools cal structures along with beta strands using the PROFsec tool was undertaken to create a knowledge base for the CREF [48]. Three protein-binding sites (aa. positions 1, 85-86 and gene (see Table 3). 243-244) were detected in the CREF protein using the The CREF gene is located in chromosome 1p32.1 and profisis (ISIS), a machine learning-based method [96]. No spread over 12 exons. Allelic loss at chromosome position signal peptide motifs were found. A disordered region lack- 1p32-pter is a frequent event in non-small cell lung cancer ing a stable tertiary structure was predicted using these meta [90]. CREF is, however, neither significantly focally ampli- structural tools. fied nor deleted in 14 individual subtypes of tumors (Mutome In addition to the EF-Hand motifs, the Prosite proteomic DB). We next analyzed the cancer genome for mutations. tool predicted a putative Leucine zipper motif, which is The catalogue of somatic mutations in cancer (COSMIC) characteristic of nuclear binding proteins; however, no nu- has a comprehensive list of genes that are somatically mu- clear localization signal was detected in the CREF protein. tated in human cancer [18]. In lung tumors, 12% of the sam- Hence, the relevance of the Leucine zipper is unclear as it is ples analyzed (58/476 samples) and in breast tumors 11% a commonly occurring signature (Prosite). However, LETM1, (90/782 samples) show point mutations for the C1ORF87 a Leucine zipper-EF-hand containing transmembrane protein (CREF) gene. Among various tumor tissues analyzed for is localized to the inner mitochondrial membrane [95]. This

272 Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 Delgado et al.

Table 3. Detailed characterization of the CREF protein. Diverse bioinformatics tools were used to develop a comprehensive knowl- edge base for the uncharacterized C1ORF87 (CREF) protein.

Characteristics Gene Description Tool Used

Map Position 1p32.1 NCBI Map Viewer

Amplifications/ Deletions None (The Roche Cancer Genome Database, CanSar) NCBI Gene

Physical Location 60.456.066 - 60.539.442 NCBI Gene

Intron / Exon Structure. DNA Size 12 exons, 83.36 kb GENEATLAS

Transcription Factor Binding Sites GATA-2, STAT-3, MafF, CTCF, Rad21 UCSC Genome Browser

 POU3F1, FOXC1, FOXJ2, LHX3, FOXM1, VSX1, PITX3 and SP1 Geneset- NexBio

3 different isoforms (Isoform 1, 546aa; Isoform 2, 180 aa and Isoform Isoforms UniProt KB 3, 138 aa)

Cannonical Sequence NP_689590.1 (isoform 1) Homologue Gene

7 homologs (highly conserved) Homo sapiens, Pan troglodytes, Homologs Macaca mulatta, Canis lupus familiaris, Bos taurus, Mus musculus, Homologue Gene Rattus norvegicu.

Gene Ontology Calcium ion binding UniProt KB - Go Miner

Subcellular Location Nucleus Predict Protein

 Cytoplasmic and membranous Human Protein Atlas

Size of mRNA 2126 bp RefSeq

Testis, brain, lung, pharynx, connective tissue, uterus, fallopian tubes, Expression of mRNA (normal) UniGene, NextBio peritonium

Selective expression in ciliated cells including fallopian tube and Protein Expression (normal) NextBio airway epithelia

Stem Cell Expression of mRNA Urogenital, keratinocytes, adipose and embryonal stem cells NextBio

Expression of mRNA (cancer) Upregulated: Ovary, muscle, bladder; downregulated: breast, lung Oncomine, NextBio

Expression of mRNA (cancer cell lines) Multiple. Highest in melanoma cell line AO4 NexBio

Upregulated: Liver, prostate, renal carcinomas, endometrial; Protein Expression (cancer) Human Protein Atlas downregulated: breast, colorectal, lung, stomach carcinomas

SNP Variants 1690: 5 natural variants, one breast cancer variant, Q --> E GeneCards, COSMIC, canSAR

hsa-miR-338-5p, hsa-miR-103b, hsa-miR-26b*, hsa-miR-3175, microRNA GeneCards hsa-miR-4273

 has-miR-383 NexBio

Protein Size (aa), Molecular Weight 546, 62 kDa UniProtKB

Structural Alignment in PDB Calcium binding EF- hand proteins PredictProtein

Motif/Domain EF-like domain InterProscan

 Putative Leucine Zipper Prosite

Hypermethylated in lung and breast carcinomas and hypomethylated Methylation NextBio in liver carcinoma

Downregulated by Genistein (G2 arrest), 2-methoxy estradiol (mitotic modulator), Capecitabine (antimetabolite). Upregulated by Gene Regulation Trichostatin A (G1 arrest), Azacytidine (antimetabolite), Isoascorbic NextBio acid (antioxidant), Oxyquinoline (chelating agent), Paclitaxel (antimicrotubule).

Cancer Proteome Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 273

Table 3. Contd….

Characteristics Gene Description Tool Used

Downregulated by AICDA, CHMP2B, TP53BP2. Upregulated by Most Correlated Gene Perturbations NextBio NUPL1 and RTN.

Mood disorders, Chronic sinusitis, Gout, Hyperuricemia, Interstitial Other Diseases NextBio lung disease

Interacting Protein Partners Amyloid beta precursor protein APP BioGrid

Carcinoma-associated, metal binding, calcium binding, potential UniProtKB, InterProscan, Human Putative Class of Protein transmembrane protein Protein Atlas, NextBio, Oncomine gene is expressed in diverse tumors analyzed by the Human 6.1. The CREF Gene, EF-Hand and Cancer Protein Expression Atlas (RN, data not shown). Our bioinformatics characterization of the CREF protein The nature of the post-translational modification site was led to the new finding that it is a novel EF-Hand containing next investigated using diverse proteomic tools. CREF pro- calcium binding transmembrane protein relevant to carcino- tein contains 273 Serine kinase/phosphatase motifs and 36 mas. The EF-Hand domain consists of a duplication of two Serine binding motifs. This protein also contains 10 Tyrosine EF-hand units, where each unit is composed of two helices kinase/phosphatase motifs and 10 Tyrosine binding motifs connected by a twelve-residue calcium-binding loop. The according to the PhosphoMotif Finder [97] tool from the calcium ion in the EF-hand loop is coordinated in a pentago- HPRD database. N-glycosylation sites were not detected in nal bipyramidal configuration. Many calcium-binding pro- CREF (NetNGlyc). However, a mucin type GalNAc O- teins contain an EF-hand type calcium-binding domain glycosylation site was detected using NetOGlyc [98]. The [102]. Numerous proteins including , tool PrePS predicts that the CREF protein is not a substrate S100 family members, calpain, phospholipase, myosin, on- for farsenylation. Using Myristoylator from Expasy we comodulins and protein phosphatases are included in a group found that the CREF protein does not have N-terminus of proteins which harbor the EF-Hand domain (see Prosite myristoylation sites [96]. documentation entry PS00018). The selective regulation of CREF gene expression was Calcium binding proteins containing EF-Hand motifs are next investigated for CpG methylation at the key targets for various cancers [102-109]. A well-studied (NextBio). The CREF gene is hypermethylated in lung and prototype member of the EF-Hand protein includes the S100 breast carcinomas and hypomethylated in liver carcinomas, family [103, 110]. The S100 family members are consistent with the downregulation of the CREF gene in lung differentially expressed in various solid tumors such as and breast carcinomas and its upregulation in liver carcino- breast, lung, bladder, kidney, gastric, thyroid, prostate and mas observed in the same dataset (meta analysis). Our results oral cancers [103]. Individual family members of the S100 strongly suggest that the CREF gene is regulated at the level proteins act as tumor promoters as well as tumor suppressor of DNA CpG methylation. Promoter hyper- and hypo- genes, demonstrating a complex function in tumor growth methylation play a crucial role in the development of malig- [103, 111-112]. For example, S100A2 expression is down nancy, and differential methylation is a critical determinant regulated in diverse tumor types (renal, neuroendocrine, of regulation of gene expression [99, 100]. skin, prostate) yet upregulated in lung, breast and pancreatic Additional data on the regulation of the CREF gene ex- cancers (NextBio). The S100A2 gene is associated with a pression indicates control at two distinct phases of the cell poor prognosis [107]. cycle, G1 and G2. The gene perturbation studies dataset Many calcium-binding proteins belong to the same evo- from NextBio provides a strong hint of key pathways involv- lutionary family and share a type of calcium-binding domain ing specific kinases, key transcription factor signaling, nu- known as the EF-hand (Procyte documentation entry, cleotide pathways and apoptotic pathways (APD and RN, PDOC00018). This domain consists of a twelve-residue loop manuscript in preparation). Our results also implicate the flanked on both sides by a twelve residue -helical domain CREF gene in other diseases (see Table 3). We have identi- (PDB: 1CLL). The structural/functional unit of EF-hand pro- fied one putative interacting protein partner, the Amyloid teins involves a pair of EF-hand motifs that together form a beta precursor protein, APP gene. The APP gene, a trans- stable four-helix bundle domain. The EF-hand pairing is 2+ membrane protein, has recently been implicated in breast essential to the cooperativity in the binding of Ca ions. The carcinomas [101]. Further, the expression of APP protein consensus pattern of the EF-Hand calcium binding is D- was detected in over 85% of tumors analyzed (Human Pro- {W}-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}- tein Atlas). Efforts are underway by co-expression analysis [LIVMC]-[DENQSTAGC]-x(2)-[DE]-[LIVMFYW]. to clarify whether the CREF-APP interaction have a func- The differential regulation of the CREF protein in diverse tional consequence. Using diverse interactome tools as well tumors suggests a complex tissue type regulation of gene as microarray-based co-expression analysis, we have sys- expression. The CREF gene exists as at least three isoforms, tematically begun to dissect the pathways in the mechanism but at present, expression data is available for only the can- involved in the CREF gene function in carcinomas (APD nonical sequence, isoform I. It is unclear whether any iso- and RN, manuscript in preparation). form-specific expression exists for this gene that would ac-

274 Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 Delgado et al.

Knowledge Base

Text Mining

Identify uncharacterized/ novel ORF proteins

Bioinformatics

Verify expression (RNA/Protein)

Structural Predict function/ nature (motif/domain/structure) Proteomics

Cancer Genome Analysis

GWAS Druggable/ Diagnostic Targets Fig. (4). Summary of illuminating the cancer proteome. The uncharacterized ORFs identified by text mining are verified for mRNA and protein expression using diverse bioinformatics tools. Structural proteomic analysis leads to protein characterization. The GWAS studies verify cancer relevance resulting in putative cancer leads. count for tissue specificity. Efforts are underway to further 8. SUMMARY clarify the isoform specificity. The differential and complex expression profile of the CREF gene in diverse tumors is In this review, we have attempted to outline approaches consistent with the role of EF-Hand containing calcium bind- to deciphering the dark matter of the human proteome. Novel ing proteins [103, 106-107]. proteins hold the clues to new approaches for targeted ther- apy, diagnosis and response to therapy prediction for cancer patients. Despite the challenges, mining the novel cancer 7. CRITICAL ISSUES proteome for druggable targets is a promising approach for The discovery of the CREF gene within the uncharacter- rational drug discovery. A brief, streamlined approach ized cancer proteome database demonstrates the feasibility of shown in (Fig. 4) can provide a basis for illuminating the the approaches outlined in this review for illuminating the dark matter of the cancer proteome. For the sake of simplic- cancer proteome. However, several challenges remain in our ity, we have focused on the putative protein coding full ability to effectively mine the human genome for biomarkers length ORFs. The ncRNAS and partial protein fragments and druggable proteins. The lack of universal text mining of were filtered out from the database. Text mining the knowl- protein queries across multiple knowledge bases necessitates edge base creates a database of initial hits of novel proteins. individualized query definition. The quality of the datasets Expression analysis allows identification of putative leads across diverse bioinformatics tools varies. Currently, a user based on specificity. Motif prediction and cancer genome must resort to trial and errors approaches. Some of the major analysis for association unearth potential biomarkers and stumbling blocks for efficient mining include drug targets. The feasibility of our approach is demonstrated 1). A lack of correlation among various mRNA expression by our discovery and characterization of an ORF protein as a analysis tools, leading to high levels of false hits carcinoma related gene (CREF). The putative leads can be rapidly taken for wet laboratory verification. Improvements 2). Large variations in the quality of cDNAs used in tran- are still needed in the quality of datasets, the standardization scriptome studies, which often contributes to unreliability of text mining queries, and in the number and scope of meta of the data output analysis and protein expression analysis tools if researchers 3). Multiple gene identifiers for a test gene across diverse are to tap the potential of the dark matter of the genome. bioinformatics tools, resulting in complex data output 4). Inadequate isoform expression related information, lead- CONTRIBUTIONS ing to an underestimation of the number of true drugga- RN was responsible for the overall execution of the pro- ble targets ject. Data generation and validation were performed by 5). An insufficient number of protein expression analysis APD, PB and SH. tools, which seriously limits the interpretation of data gleaned from diverse mRNA analysis tools and CONFLICT OF INTEREST 6). A lack of information on the modification status of the The authors confirm that this article content has no proteins in diverse tissues, an aspect crucial to the func- conflict of interest. tional relevance of a protein.

Cancer Proteome Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 275

ACKNOWLEDGEMENTS [23] Costa PM, Pedroso de Lima MC. MicroRNAs as Molecular Tar- gets for Cancer Therapy: On the Modulation of MicroRNA Expres- This work was supported in part by the Genomics of sion. Pharmaceuticals 2013; 6(10): 1195-220. Cancer Fund, Florida Atlantic University Foundation. We [24] Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. Mi- thank Jeanine Narayanan for editorial assistance. croRNA targets in Drosophila. Genome biology 2003; 5(1): R1. [25] Dweep H, Sticht C, Pandey P, Gretz N. miRWalk--database: pre- diction of possible miRNA binding sites by "walking" the genes of SUPPLEMENTARY MATERIAL three genomes. J Biomed Inform 2011; 44(5): 839-47. [26] Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. Supplementary material is available on the publisher’s Prediction of mammalian microRNA targets. Cell 2003; 115(7): web site along with the published article. 787-98. [27] Xiao F, Zuo Z, Cai G, Kang S, Gao X, Li T. miRecords: an inte- grated resource for microRNA-target interactions. Nucleic Acids REFERENCES Res 2009; 37(Database issue): D105-10. [28] Jiang Q, Wang Y, Hao Y, et al. miR2Disease: a manually curated [1] Lander ES, Linton LM, Birren B, et al. Initial sequencing and database for microRNA deregulation in human disease. Nucleic analysis of the human genome. Nature 2001; 409(6822): 860-921. Acids Res 2009; 37(Database issue): D98-104. [2] Venter JC, Adams MD, Myers EW, et al. The sequence of the [29] Xie B, Ding Q, Han H, Wu D. miRCancer: a microRNA-cancer human genome. Science 2001; 291(5507): 1304-51. association database constructed by text mining on literature. Bio- [3] Wheeler DA, Wang L. From human genome to cancer genome: the informatics 2013; 29(5): 638-44. first decade. Genome research 2013; 23(7): 1054-62. [30] Martin L, Chang HY. Uncovering the role of genomic "dark mat- [4] Alfoldi J, Lindblad-Toh K. Comparative genomics as a tool to ter" in human disease. The J Clin Invest 2012; 122(5): 1589-95. understand evolution and disease. Genome research 2013; 23(7): [31] Nagano T, Fraser P. No-nonsense functions for long noncoding 1063-8. RNAs. Cell. 2011; 145(2): 178-81. [5] Brunschweiger A, Hall J. A decade of the human genome se- [32] Mak L, Liggi S, Tan L, Kusonmano K, et al. Anti-cancer drug quence--how does the medicinal chemist benefit? Chem Med Chem development: computational strategies to identify and target pro- 2012; 7(2): 194-203. teins involved in cancer metabolism. Curr Pharm Des 2013; 19(4): [6] Pertea M, Salzberg SL. Between a chicken and a grape: estimating 532-77. the number of human genes. Genome biology 2010; 11(5): 206. [33] Natrajan R, Wilkerson P. From integrative genomics to therapeutic [7] Strausberg RL. The Cancer Genome Anatomy Project: new re- targets. Cancer research 2013; 73(12): 3483-8. sources for reading the molecular signatures of cancer. The Journal [34] Nevins JR, Potti A. Mining gene expression profiles: expression of pathology 2001; 195(1): 31-40. signatures as cancer phenotypes. Nat Rev Genet 2007; 8(8): 601-9. [8] Scheurle D, DeYoung MP, Binninger DM, Page H, Jahanzeb M, [35] Chung CW. Small molecule bromodomain inhibitors: extending the Narayanan R. Cancer gene discovery using digital differential dis- druggable genome. Prog Med Chem 2012; 51: 1-55. play. Cancer research 2000; 60(15): 4037-43. [36] Griffith M, Griffith OL, Coffman AC, et al. DGIdb: mining the [9] De Young MP, Damania H, Scheurle D, Zylberberg C, Narayanan druggable genome. Nature methods 2013; 10(12): 1209-10. R. Bioinformatics-based discovery of a novel factor with apparent [37] Rask-Andersen M, Almen MS, Schioth HB. Trends in the exploita- specificity to colon cancer. In vivo 2002; 16(4): 239-48. tion of novel drug targets. Nature reviews Drug discovery 2011; [10] Narayanan R. Bioinformatics approaches to cancer gene discovery. 10(8): 579-90. Methods Mol Biol 2007; 360: 13-31. [38] Russ AP, Lampel S. The druggable genome: an update. Drug dis- [11] Schmitt AO. Mining expressed sequence tag (EST) libraries for covery today 2005; 10(23-24): 1607-10. cancer-associated genes. Methods Mol Biol 2010; 576: 89-98. [39] Zhu F, Han B, Kumar P, et al. Update of TTD: Therapeutic Target [12] Lauriola M, Ugolini G, Rosati G, et al. Identification by a Digital Database. Nucleic Acids Res 2010; 38(Database issue): D787-91. Gene Expression Displayer (DGED) and test by RT-PCR analysis [40] Wishart DS, Knox C, Guo AC, et al. DrugBank: a comprehensive of new mRNA candidate markers for colorectal cancer in periph- resource for in silico drug discovery and exploration. Nucleic Acids eral blood.Int J Oncol2010; 37(2): 519-25. Res 2006; 34(Database issue): D668-72. [13] Boon K, Osorio EC, Greenhut SF, et al. An anatomy of normal and [41] Knox C, Law V, Jewison T, et al. DrugBank 3.0: a comprehensive malignant gene expression. Proceedings of the National Academy resource for 'omics' research on drugs. Nucleic Acids Res 2011; of Sciences of the United States of America. 2002; 99(17): 11287-92. 39(Database issue): D1035-41. [14] Kent WJ, Sugnet CW, Furey TS, et al. The human genome browser [42] Li J, Duncan DT, Zhang B. CanProVar: a human cancer proteome at UCSC. Genome research 2002; 12(6): 996-1006. variation database. Human mutation 2010; 31(3): 219-28. [15] Parkinson H, Sarkans U, Shojatalab M, et al. ArrayExpress--a [43] Nadzirin N, Firdaus-Raih M. Proteins of Unknown Function in the public repository for microarray gene expression data at the EBI. (PDB): An Inventory of True Uncharacterized Nucleic Acids Res 2005; 33(Database issue): D553-5. Proteins and Computational Tools for Their Analysis. Int J Mol Sci [16] Kuntzer J, Maisel D, Lenhof HP, Klostermann S, Burtscher H. The 2012; 13(10): 12761-72. Roche Cancer Genome Database 2.0. BMC medical genomics. [44] Babcock JJ, Li M. Deorphanizing the human transmembrane ge- 2011; 4: 43. nome: A landscape of uncharacterized membrane proteins. Acta [17] Halling-Brown MD, Bulusu KC, Patel M, Tym JE, Al-Lazikani B. Pharmacol Sin 2014; 35(1): 11-23. doi: 10.1038/aps.2013.142. canSAR: an integrated cancer public translational research and Epub 2013 Nov 18. drug discovery resource. Nucleic Acids Res 2012; 40(Database is- [45] UniProt C. Update on activities at the Universal Protein Resource sue): D947-56. (UniProt) in 2013. Nucleic Acids Res 2013; 41(Database issue): [18] Forbes SA, Bindal N, Bamford S, et al. COSMIC: mining complete D43-7. cancer genomes in the Catalogue of Somatic Mutations in Cancer. [46] Safran M, Dalah I, Alexander J, et al. GeneCards Version 3: the Nucleic Acids Res 2011; 39(Database issue): D945-50. human gene integrator. Database : the journal of biological data- [19] Monga M, Sausville EA. Developmental therapeutics program at bases and curation 2010; 2010: baq020. the NCI: molecular target and drug discovery process. Leukemia [47] Artimo P, Jonnalagedda M, Arnold K, Baratin D, Csardi G, de 2002; 16(4): 520-6. Castro E, et al. ExPASy: SIB bioinformatics resource portal. Nu- [20] Liu F, White JA, Antonescu C, Gusenleitner D, Quackenbush J. cleic Acids Res 2012; 40(Web Server issue): W597-603. GCOD - GeneChip Oncology Database. BMC bioinformatics 2011; [48] Rost B, Yachdav G, Liu J. The PredictProtein server. Nucleic Ac- 12: 46. ids Res2004; 32(Web Server issue): W321-6. [21] Hopkins AL, Groom CR. The druggable genome. Nature reviews [49] Cong Q, Grishin NV. MESSA: MEta-Server for protein Sequence Drug discovery 2002; 1(9): 727-30. Analysis. BMC biology 2012; 10: 82. [22] Hauptman N, Glavac D. MicroRNAs and long non-coding RNAs: [50] Rhodes DR, Kalyana-Sundaram S, Mahavisno V, et al. Oncomine prospects in diagnostics and therapy of cancer. Radiol Oncol 2013; 3.0: genes, pathways, and networks in a collection of 18,000 cancer 47(4): 311-8. gene expression profiles. Neoplasia 2007; 9(2): 166-80.

276 Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 Delgado et al.

[51] Kupershmidt I, Su QJ, Grewal A, et al. Ontology-based meta- [77] Guex N, Peitsch MC. SWISS-MODEL and the Swiss-PdbViewer: analysis of global collections of high-throughput public data. PloS an environment for comparative protein modeling. Electrophoresis one 2010; 5(9). 1997; 18(15): 2714-23. [52] Cheng JQ, Lindsley CW, Cheng GZ, Yang H, Nicosia SV. The [78] Cerami E, Gao J, Dogrusoz U, et al. The cBio cancer genomics Akt/PKB pathway: molecular target for cancer drug discovery. On- portal: an open platform for exploring multidimensional cancer ge- cogene 2005; 24(50): 7482-92. nomics data. Cancer discovery 2012; 2(5): 401-4. [53] Chelala C, Lemoine NR, Hahn SA, Crnogorac-Jurcevic T. A web- [79] Sur I, Tuupanen S, Whitington T, Aaltonen LA, Taipale J. Lessons based platform for mining pancreatic expression datasets. Pancrea- from functional analysis of genome-wide association studies. Can- tology : official journal of the International Association of Pancrea- cer research 2013; 73(14): 4180-4. tology 2009; 9(4): 340-3. [80] Franceschini A, Szklarczyk D, Frankild S, et al. STRING v9.1: [54] Mohandass J, Ravichandran S, Srilakshmi K, Rajadurai CP, San- protein-protein interaction networks, with increased coverage and mugasamy S, Kumar GR. BCDB - A database for breast cancer re- integration. Nucleic Acids Res 2013; 41(Database issue): D808-15. search and information. Bioinformation 2010; 5(1): 1-3. [81] Rual JF, Venkatesan K, Hao T, et al. Towards a proteome-scale [55] Kao S, Shiau CK, Gu DL, et al. IGDB.NSCLC: integrated genomic map of the human protein-protein interaction network. Nature database of non-small cell lung cancer. Nucleic Acids Res 2012; 2005; 437(7062): 1173-8. 40(Database issue): D972-7. [82] Cowley MJ, Pinese M, Kassahn KS, et al. PINA v2.0: mining [56] Dayem Ullah AZ, Cutts RJ, Ghetia M, et al. The pancreatic expres- interactome modules. Nucleic Acids Res 2012; 40(Database issue): sion database: recent extensions and updates. Nucleic Acids Res D862-5. 2013. [83] Licata L, Briganti L, Peluso D, et al. MINT, the molecular interac- [57] Maqungo M, Kaur M, Kwofie SK, et al. DDPC: Dragon Database tion database: 2012 update Nucleic Acids Res 2012; 40(Database of Genes associated with Prostate Cancer. Nucleic Acids Res 2011; issue): D857-61. 39(Database issue): D980-5. [84] Orchard S, Ammari M, Aranda B, et al. The MIntAct project-- [58] Kaur M, Radovanovic A, Essack M, et al. Database for exploration IntAct as a common curation platform for 11 molecular interaction of functional context of genes implicated in ovarian cancer. Nucleic databases. Nucleic Acids Res 2013. Acids Res2009; 37(Database issue): D820-3. [85] Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers [59] Essack M, Radovanovic A, Schaefer U, et al. DDEC: Dragon data- M. BioGRID: a general repository for interaction datasets. Nucleic base of genes implicated in esophageal cancer. BMC cancer 2009; Acids Res 2006; 34(Database issue): D535-9. 9: 219. [86] Finn RD, Clements J, Eddy SR. HMMER web server: interactive [60] Brylinski M. Exploring the "dark matter" of a mammalian pro- sequence similarity searching. Nucleic Acids Res 2011; 39(Web teome by and function modeling. Proteome sci- Server issue): W29-37. ence 2013; 11(1): 47. [87] Katt WP, Cerione RA. Glutaminase regulation in cancer cells: a [61] Blaxter M. Genetics. Revealing the dark matter of the genome. druggable chain of events. Drug Discov Today 2013. Science 2010; 330(6012): 1758-9. [88] Sioud M, Leirdal M. Druggable signaling proteins. Methods Mol [62] Korhonen A, Seaghdha DO, Silins I, Sun L, Hogberg J, Stenius U. Biol 2007; 361: 1-24. Text mining for literature review and knowledge discovery in can- [89] Stegh AH. Targeting the p53 signaling pathway in cancer therapy - cer risk assessment and research. PloS one 2012; 7(4): e33427. the promises, challenges and perils. Expert Opin Ther Targets [63] Aguiar-Pulido V, Seoane JA, Gestal M, Dorado J. Exploring pat- 2012; 16(1): 67-83. terns of epigenetic information with data mining techniques. Curr [90] Chizhikov V, Zborovskaya I, Laktionov K, et al. Two consistently Pharm Des 2013; 19(4): 779-89. deleted regions within chromosome 1p32-pter in human non-small [64] Zhu F, Patumcharoenpol P, Zhang C, et al. Biomedical text mining cell lung cancer. Molecular carcinogenesis 2001; 30(3): 151-8. and its applications in cancer research. J Biomed Inform 2013; [91] Gasparian AV, Laktionov KK, Belialova MS, Pirogova NA, Tato- 46(2): 200-11. syan AG, Zborovskaya IB. Allelic imbalance and instability of mi- [65] Rivenbark AG, Coleman WB. Dissecting the molecular mecha- crosatellite loci on chromosome 1p in human non-small-cell lung nisms of cancer through bioinformatics-based experimental ap- cancer. Br J Cancer 1998; 77(10): 1604-11. proaches. J Cell Biochem 2007; 101(5): 1074-86. [92] Smyth I, Narang MA, Evans T, et al. Isolation and characterization [66] Alibes A, Yankilevich P, Canada A, Diaz-Uriarte R. IDconverter of human 2 (PTCH2), a putative tumour suppressor gene and IDClight: conversion and annotation of gene and protein IDs. inbasal cell carcinoma and medulloblastoma on chromosome 1p32. BMC bioinformatics 2007; 8: 9. Human molecular genetics 1999; 8(2): 291-7. [67] Brazma A, Hingamp P, Quackenbush J, et al. Minimum informa- [93] Opocher G, Schiavi F, Vettori A, et al. Fine analysis of the short tion about a microarray experiment (MIAME)-toward standards for arm of chromosome 1 in sporadic and familial pheochromocytoma. microarray data. Nature genetics 2001; 29(4): 365-71. Clinical endocrinology 2003; 59(6): 707-15. [68] Uhlen M, Oksvold P, Fagerberg L, et al. Towards a knowledge- [94] Huret JL, Ahmad M, Arsaban M, et al. Atlas of genetics and cyto- based Human Protein Atlas. Nat Biotechnol 2010; 28(12): 1248-50. genetics in oncology and haematology in 2013. Nucleic Acids Res [69] Kolker E, Higdon R, Haynes W, et al. MOPED: Model Organism 2013; 41(Database issue): D920-4. Protein Expression Database. Nucleic Acids Res 2012; [95] Sjoblom T, Jones S, Wood LD, et al. The consensus coding se- 40(Database issue): D1093-9. quences of human breast and colorectal cancers. Science 2006; [70] Mathivanan S, Ahmed M, Ahn NG, et al. Human Proteinpedia 314(5797): 268-74. enables sharing of human protein data. Nature biotechnology 2008; [96] Ofran Y, Rost B. ISIS: interaction sites identified from sequence. 26(2): 164-7. Bioinformatics 2007; 23(2): e13-6. [71] Keshava Prasad TS, Goel R, Kandasamy K, et al. Human Protein [97] Amanchy R, Periaswamy B, Mathivanan S, Reddy R, Tattikota SG, Reference Database--2009 update. Nucleic Acids Res 2009; Pandey A. A curated compendium of phosphorylation motifs. Na- 37(Database issue): D767-72. ture biotechnology 2007; 25(3): 285-6. [72] Shen EH, Overly CC, Jones AR. The Allen Human Brain Atlas: [98] Hansen JE, Lund O, Tolstrup N, Gooley AA, Williams KL, Brunak comprehensive gene expression mapping of the human brain. S. NetOglyc: prediction of mucin type O-glycosylation sites based Trends Neurosci 2012 ; 35(12): 711-4. on sequence context and surface accessibility. Glycoconjugate [73] Marchler-Bauer A, Zheng C, Chitsaz F, et al. CDD: conserved journal 1998; 15(2): 115-30. domains and protein three-dimensional structure. Nucleic Acids [99] Ma X, Wang YW, Zhang MQ, Gazdar AF. DNA methylation data Res 2013; 41(Database issue): D348-52. analysis and its application to cancer research. Epigenomics 2013; [74] Finn RD, Bateman A, Clements J, et al. Pfam: the protein families 5(3): 301-16. database. Nucleic Acids Res2013. [100] Gnyszka A, Jastrzebski Z, Flis S. DNA methyltransferase inhibitors [75] Hunter S, Jones P, Mitchell A, et al. InterPro in 2011: new devel- and their emerging role in epigenetic therapy of cancer. Anticancer opments in the family and domain prediction database. Nucleic Ac- research 2013; 33(8): 2989-96. ids Res 2012; 40(Database issue): D306-12. [101] Takagi K, Ito S, Miyazaki T, et al. Amyloid precursor protein in [76] Servant F, Bru C, Carrere S, et al. ProDom: automated clustering human breast cancer: An androgen-induced gene associated with of homologous domains. Brief Bioinform 2002; 3(3): 246-51. cell proliferation. Cancer science 2013.

Cancer Proteome Current Cancer Therapy Reviews, 2013, Vol. 9, No. 4 277

[102] Ikura M, Osawa M, Ames JB. The role of calcium-binding proteins [107] Wolf S, Haase-Kohn C, Pietzsch J. S100A2 in cancerogenesis: a in the control of transcription: structure to function. BioEssays : friend or a foe? Amino acids 2011; 41(4): 849-61. news and reviews in molecular, cellular and developmental biol- [108] Yanez M, Gil-Longo J, Campos-Toimil M. Calcium binding pro- ogy. 2002; 24(7): 625-36. teins. Advances in experimental medicine and biology 2012; 740: [103] Salama I, Malone PS, Mihaimeed F, Jones JL. A review of the 461-82. S100 proteins in cancer. European journal of surgical oncology : [109] Subramanian L, Polans AS. Cancer-related diseases of the eye: the the journal of the European Society of Surgical Oncology and the role of calcium and calcium-binding proteins. Biochem Biophys British Association of Surgical Oncology 2008; 34(4): 357-64. Res Commun 2004; 322(4): 1153-65. [104] Heizmann CW, Ackermann GE, Galichet A. Pathologies involving [110] Leclerc E, Heizmann CW. The importance of Ca2+/Zn2+ signaling the S100 proteins and RAGE. Sub-cellular biochemistry 2007; 45: S100 proteins and RAGE in translational medicine. Front Biosci 93-138. 2011; 3: 1232-62. [105] Gibadulinova A, Tothova V, Pastorek J, Pastorekova S. Transcrip- [111] Hwang SK, Piao L, Lim HT, et al. Suppression of lung tumori- tional regulation and functional implication of S100P in cancer. genesis by leucine zipper/EF hand-containing transmembrane-1. Amino acids 2011; 41(4): 885-92. PloS one 2010; 5(9). [106] Braunewell KH. The darker side of Ca2+ signaling by neuronal [112] Tsai WC, Lin YC, Tsai ST, et al. Lack of modulatory function of Ca2+-sensor proteins: from Alzheimer's disease to cancer. Trends coding polymorphism S100A2_185G>A in oral in Pharmacological Sciences 2005; 26(7): 345-51. squamous cell carcinoma. Oral Diseases 2011; 17(3): 283-90.

Received: December 28, 2013 Revised: January 17, 2014 Accepted: January 20, 2014