BIOINFORMATICS MINING OF THE DARK MATTER FOR

CANCER TARGETS DISCOVERY

by

Ana Paula Delgado

A Thesis Submitted to the Faculty of

The Charles E. Schmidt College of Science

In Partial Fulfillment of the Requirements for the Degree of

Master of Science

Florida Atlantic University

Boca Raton, Florida

May 2015

Copyright 2015 by Ana Paula Delgado

ii

ACKNOWLEDGEMENTS

I would first like to thank Dr. Narayanan for his continuous encouragement, guidance, and support during the past two years of my graduate education. It has truly been an unforgettable experience working in his laboratory. I also want to express gratitude to my external advisor Professor Van de Ven from the University of Leuven, Belgium for his constant involvement and assistance on my project. Moreover, I would like to thank Dr. Binninger and Dr.

Dawson-Scully for their advice and for agreeing to serve on my thesis committee.

I also thank provost Dr. Perry for his involvement in my project. I thank Jeanine

Narayanan for editorial assistance with the publications and with this dissertation.

It has been a pleasure working with various undergraduate students some of whom became lab mates including Pamela Brandao, Maria Julia Chapado and

Sheilin Hamid. I thank them for their expert help in the projects we were involved in.

Lastly, I want to express my profound thanks to my parents and brother for their unconditional love, support and guidance over the last couple of years. They were my rock when I was in doubt and never let me give up. I would also like to thank my boyfriend Spencer Daniel and best friends for being part of an incredible support system.

iv

ABSTRACT

Author: Ana Paula Delgado

Title: Mining of the Dark Matter Proteome for Targets Discovery Institution: Florida Atlantic University

Thesis Advisor: Dr. Ramaswamy Narayanan

Degree: Master of Science

Year: 2015

Mining the human for therapeutic target(s) discovery promises novel outcome. Over half of the in the however, remain uncharacterized. These proteins offer a potential for new target(s) discovery for diverse diseases. Additional targets for cancer diagnosis and therapy are urgently needed to help move away from the cytotoxic era to a targeted therapy approach. Bioinformatics and approaches can be used to characterize novel sequences in the genome database to infer putative function.

The hypothesis that the motifs and proteins domains of the uncharacterized proteins can be used as a starting point to predict putative

v function of these proteins provided the framework for the research discussed in this dissertation.

Initially, a comprehensive atlas of 800 uncharacterized proteins was established using Meta Analysis approaches. Involving a streamlined strategy with the use of genome-wide association studies, and proteome- based expression studies, motifs and domains analysis, interactome and pathway mapping tools, the atlas of the novel were characterized.

Druggable proteins such as , channel proteins, receptors and transporters as well secreted biomarkers were identified amongst the novel proteins. Genome association studies show the involvement of these novel proteins in multiple diseases. An uncharacterized calcium binding -

Related EF-hand Protein (CREF) was chosen as a proof of concept for the approach. A comprehensive characterization of the CREF protein suggests therapy and potential in the breast, liver and lung . The atlas of the novel proteins generated in this study provides a rationale for new target(s) discovery for cancer and other diseases.

vi

BIOINFORMATICS MINING OF THE DARK MATTER PROTEOME FOR

CANCER TARGETS DISCOVERY

List of Tables ...... xii

List of Figures ...... xiii

Specific Aims…………...………………………………………………………………..1

Chapter 1: Background and Significance ...... 3

Gene and Cancer ...... 3

Bioinformatics and Cancer Discovery ...... 5

Novel ESTs and Uncharacterized Proteins ...... 8

Illuminating the Dark Matter of the Human Genome ...... 9

Protein Domains, Motifs, and Fingerprints ...... 11

Genome Wide Association Studies ...... 12

Genome to Phenome Studies ...... 13

Pharmacogenomics ...... 15

Drug Discovery ...... 15

Biomarkers ...... 16

Significance ...... 19

Chapter 2: Materials and Methods ...... 21

vii Genome Analysis ...... 21

Transcriptome Analysis ...... 21

Proteome Analysis ...... 22

Knowledge-based Datamining ...... 23

Textmining Query Definition ...... 23

GeneALaCart (LifeMap Discovery) Batch Analysis ...... 24

Data Analysis ...... 24

Statistics ...... 25

Chapter 3: The Approach and Experimental Design ...... 26

Database Generation ...... 26

The Approach ...... 27

Expression Analysis ...... 27

The mRNA Expression ...... 28

The Protein Expression ...... 28

Protein Class ...... 28

Characterization of Protein Motifs ...... 31

Structure, Interaction and Pathway Identification ...... 32

GWAS and Genome to Phenome Analysis ...... 32

Putative Diagnostic/ Druggable Targets ...... 33

Chapter 4: Results ...... 34

viii Characterization of the Carcinoma Related EF-Hand Protein ...... 34

!!!!!Expression Profiling of the CREF Protein ...... 34

GWAS of the CREF Gene ...... 40

!!!!Comprehensive Mutational Analysis of the CREF Gene ...... 47

Molecular Characterization of the CREF Gene ...... 48

Regulation of the CREF ...... 49

Effect of Knockout of Genes on CREF ...... 52

Effect of Knockdown of Genes on CREF ...... 52

Effect of Overexpression of Genes on CREF ...... 52

Effect of Mutations of Genes on CREF ...... 53

Effect of Methylation on CREF Gene Regulation ...... 54

MicroRNA Regulation of the CREF Gene ...... 54

Protein Motif and Domains Analysis ...... 56

Structural Characterization ...... 56

3D Modeling of the CREF Protein ...... 58

Templates Used by I-TASSER ...... 59

Proteins with Highly Similar Structure in PDB ...... 60

Post Translation Modification ...... 61

Interactome Mapping for the CREF Protein ...... 61

Involvement of CREF in Other Diseases ...... 64

ix Further Supporting Evidence for the Gene Discovery Approach ...... 64

Establishment of the Dark Matter Landscape ...... 63

OncoORF as a Disease Specific Sub-database ...... 65

!!!!Prediction of a Secreted Biomarker as a Further Proof of Concept

for the Approach ...... 67

Applicability of the Approach to Another Therapeutic Area: Diabetes ...... 69

Chapter 5: Discussion ...... 71

Calcium Binding Proteins and Cancer ...... 74

Cancer Implications of the CREF Protein ...... 75

Genome-Wide Association Studies of the CREF Gene ...... 76

Regulation of the CREF Gene ...... 78

Critical Issues ...... 82

The Correlation of the mRNA and Protein Expression ...... 82

Lack of Consistency across Bioinformatics Tools ...... 85

Future Directions ...... 86

Assay Development for CREF Transporter Function for ...... 87

Profiling Methylation Status of the CREF Gene in the Tumors of Interest

Could Lead to Biomarker Identification ...... 89

CREF, Loss of Heterozygozity and Cancer ...... 91

Appendices ...... 92

x !

Appendix A – List of Publications ...... 93

Appendix B – CREF Characterization ...... 96

Bibliography ...... 97

!

xi TABLES

Table 1. Biomarkers and Respective FDA Approved Drug ...... 18

Table 2. Verification of Motif/domain Tools with Controls ...... 30

Table 3. Motif Detection Analysis of the Hit Proteins Under Study ...... 32

Table 4. Detailed Mutational Analysis of the CREF Gene ...... 47

Table 5. Effect of Pharmacological Agents on CREF by NextBio ...... 51

Table 6. Perturbation of Top Genes and Their Effect on CREF by NextBio

Analysis...... 53

Table 7. List of PDB Hits Structural Analogs of the CREF Protein ...... 61

xii FIGURES

Figure 1. The Human Genome Dark Matter Landscape ...... 11

Figure 2. Bioinformatics and Proteomic Approach ...... 27

Figure 3. Expression Profiling of CREF ...... 36

Figure 4. Oncomine Expression Profiling of CREF in

Lung and Breast Tumors ...... 38

Figure 5. mRNA Expression Profiling of C1orf87/ CREF ...... 39

Figure 6. Mutational Types in the C1orf87/ CREF Gene...... 41

Figure 7. CREF Gene Perturbations Study ...... 42

Figure 8. Comparative C1orf87/ CREF Gene Changes with Positive

Controls for Lung Adenocarcinoma ...... 44

Figure 9. CREF Gene as a Response to Therapy Predictor in

Lung Carcinoma...... 45

Figure 10. Alternative mRNAs of the CREF Gene ...... 49

Figure 11. Approach of Identification of CREF Regulating ...... 55

Figure 12. Motif and Domains Analysis of the CREF Protein ...... 56

xiii Figure 13. The Secondary Structure of the CREF Protein Predicted by the

MESSA Meta Analysis Tool...... 58

Figure 14. 3D Modeling of the CREF Protein Using the I-Tasser Tool

Based on Protein Templates...... 59

Figure 15. List of Genes Interacting with the C1orf87/ CREF ...... 63

Figure 16. Postulated Model for Regulation of the CREF Gene ...... 81

xiv

SPECIFIC AIMS

Aim 1: To establish a database of novel, putative cancer-associated proteins.

By using bioinformatics tools, a database of proteins with unknown function that show a strong level of cancer association will be established. The cancer selectivity of each protein will be validated, using electronic expression, microarray and proteomic tools. Hits will be validated by performing correlation studies of mRNA and protein expression.

Aim 2: To perform motif and interaction analysis of the novel proteins for prediction of putative function.

Diverse bioinformatics and proteomic tools will be used to detect protein signatures and putative networks and pathways. A knowledge database of the uncharacterized proteins with putative functions will be established.

Aim 3: To develop proof of concept for the approach by verifying one target protein (C1orf87).

The chosen target protein (C1orf87) will go through a thorough characterization. The analysis will include the mRNA and protein expression levels, protein signatures and 3D structures, somatic mutations, protein

1 interactions, clinical relevance, drug interactions and genetic association. A possible druggable target/ biomarker nature will be established.

Aim 4: To integrate C1orf87 results obtained in this thesis and results of similar studies on other target proteins to substantiate proof of concept.

The bioinformatics tools defined in the C1orf87 studies will be used to predict a secreted protein. The biomarker potential and genetic association of such a protein in cancer will be investigated in detail. Expanding to a different therapeutic area (diabetes) for target (s) discovery will test the broad scope of the approach.

2

CHAPTER 1: BACKGROUND AND SIGNIFICANCE

Genes and Cancer

Cancer is a disease with one of the highest mortality rates in the world.

This is due to the fact that there is a serious lack of effective treatments for various . Cancer is a consequence of wide genetic abnormalities and aberrations in gene expression. There are many tumor-specific mutations that can lead to cancer including the activation of proto-oncogenes to oncogenes and the inactivation of tumor suppressor genes (1). The mutation of these classes of genes can cause the to become aberrant, resulting in uncontrolled proliferation (2-4).

A proto-oncogene is a normal gene which, when altered by mutation, becomes an oncogene that can contribute to cancer. Most play an important role in diverse normal cellular processes such as , differentiation, adhesion, cell cycle and (5). There are three types of mutation mechanisms that can result in activation oncogenes. 1) Point mutations in proto- oncogenes leading to constitutively active protein, 2) Gene amplification of a region of a DNA sequence that codes for a gene resulting in overexpression of the encoded protein, and 3) A chromosomal translocation (Philadelphia translocation,! t 9:22) contributing to the generation of a hybrid fused gene such as the case with the BCRABL gene causing aberrant expression of the hybrid 3 gene (6,7). Alternatively, the chromosomal translocation could involve a swapping of a gene by the reciprocal chromosomal translocations involving 8q12 as the case with the PLAG1 gene with beta-catenin (CTNNB1). This results in the activation of the PLAG1 gene in a subset of salivary gland pleomorphic adenomas with a concomitant reduction of the CTNNB1 Protein (8).

Once mutated or activated, these proteins become oncogenes and perturb normal cellular functions, leading to an indefinite cell growth. When these cells become tumorigenic, a process of begins as the cancer cell start to invade normal tissues (9). One of the most commonly mutated genes resulting in an aberrant cancerous growth is the GTPase Ras protein, causing an uncontrolled proliferation of the cell (10).

Tumor suppressor genes are those that encode proteins that serve to inhibit cell proliferation, and their inactivation due to mutations can cause normal cell to become cancerous (11). There are five main classes of proteins encoded by tumor suppressor genes that have a role in normal cell division. These proteins are 1) intracellular proteins that inhibit progression of cell growth, 2) receptors for secreted proteins that will halt cell proliferation, 3) checkpoint- control proteins that stop the cell cycle if DNA becomes aberrant, 4) proteins that promote cell death (apoptosis), and 5) enzymes needed to repair damaged DNA

(12). An example of a commonly mutated in cancer is the protein TP53 (13). Once mutated, it can fail at sensing DNA damage and can block apoptosis allowing the cell to proliferate indefinitely (14).

In addition to the oncogenes and tumor suppressor genes, the stability genes (for

4 example ATM, MLH1) are a third type of regulatory mechanism that keeps genetic alterations to a minimum. When these genes become inactivated, other genes suffer from mutations at a higher rate (15).

Bioinformatics and Cancer Gene Discovery

The field of bioinformatics is constantly developing and introducing new multifaceted tools to be applied in genome biology and medical applications. A vast amount of cancer research is being done using several bioinformatics tools to analyze cancer-specific expression profiles of genes by mRNA datamining of microarray data and by protein expression analysis. The data this field deals with includes genomic, mRNA, proteins sequences and their respective modifications and functions. The proper management of the tools available can lead to discoveries that can subsequently be validated by wet laboratory experiments.

The use of these tools can potentially save resources, and accelerate the process of discovery and characterization (16-21).

The majority of all sequences collected from the Human and other genetic projects are stored in genomic databases. It is estimated that the total number of genes in the human genome ranges from 20,000 – 25,000 genes (22). The most commonly used databases are those of the European

Molecular Biology Laboratory (EMBL)/European Bioinformatics Institute (EBI), the National Center of Biotechnology Information (NCBI, GenBank) database

(23), and the DNA Database of Japan (DDBJ) (24). All these databases are constantly being updated and are integrated. Thirty million or more expressed

5 sequence tags (ESTs) stored in GenBank (23) are human; however, an increasing amount of sequences from other species and biological models are being introduced to this and other databases (16), (25, 26).

Since the completion of the a decade ago, most of the efforts have revolved around the known genes, leaving the unknown genes

(about half of the genes) to be mined. The analysis of ESTs and uncharacterized protein sequences could be very useful for new target discovery. Different bioinformatics tools can be used for characterization of both known and unknown genes. Algorithms such as the Basic Local Alignment Search Tool (BLAST) (27) from the National Center for Biotechnology Information (NCBI) (28) website can be used to compare a sequence against different databases for identification purposes. There are a series of mRNA expression profiling tools such as Serial

Analysis of Gene Expression (SAGE) (29) that display gene expression in human normal and malignant tissues, from the Cancer Genome Anatomy Project

(CGAP) (30) tool. The protein expression tools like The Human Protein Atlas

(HPA) (31, 32) store expression profiles of proteins based on for a vast number of human tissues, cancers and cell lines. Other tools such as ScanProsite (33), HMMR (34), InterProscan (35, 36), and DAVID V6.7 Annotation tool (37) allow scanning of protein sequences for motifs and domains. Furthermore, a tool such as Search Tool for the Retrieval of

Interacting Genes/Proteins (STRINGS) (38) facilitates gene network predictions from a database of known and predicted protein-protein interactions. The

6 numerous tools available can serve in the discovery, characterization, and addition of information towards novel and known genes.

Discovery of novel genes and drug therapy targets by using bioinformatics tools from our laboratory include the Colon Carcinoma Related Gene (CCRG), and the identification of SIM2 gene as a target for solid tumors. The discovery of the CCRG gene was initiated by data mining the cancer genome using Digital

Differential Display (DDD) tool from the CGAP (30), which stores thousands of known and novel expressed sequences in the form of ESTs (18). The tool was used to predict in our laboratory solid tumor-and organ-specific genes from the

ESTs database, with one particular EST showing high specificity to colon tumor tissues. This gene was not present in other carcinomas such as those of breast, lung, ovary, , and prostate, further indicating its specificity in colon tumors. A wet lab validation showed the gene to cause stimulation in proliferation of colon cancer cells. This gene encodes a novel cysteine-rich motif, sequence, and belongs to a new class of chemokine growth factors (19).

Another example of the bioinformatics approaches to drug targets discovery from our laboratory is the identification of Down’s syndrome critical locus gene SIM2-s (Single Minded 2 short form) as a specific target for solid tumors (39-42). Following the bioinformatics strategy used to find the CCRG gene, the discovery of the SIM2-s gene comes from the solid tumor-specific EST database datamined from the CGAP database (30) of the National Cancer

Institute (NCI). This gene was shown to be highly relevant to colon, pancreas and prostate tumors. The gene showed lack of expression in other cancers such as

7 breast, ovary and lung. Wet lab experiments involving the antisense inhibition of the gene in cell lines and in animal models caused inhibition of both gene expression and cell growth, and promoted apoptosis (39). Identification of genes like CCRG and SIM2-s using bioinformatics approaches supports discovery of the novel gene and drug targets discovery using computational methods in combination with a wet lab validation.

Novel ESTs and Uncharacterized Proteins

The term ESTs refers to small pieces of DNA sequences made from the complimentary cDNA libraries constructed from a tissue. ESTs can be used for novel gene discovery and to map the gene positions within a genome. The novel

ESTs can be expanded into a larger consensus sequence and scanned using

ORF finder tools to predict an open reading frame (ORF) of a novel protein (43).

This ORF protein sequence can be further analyzed for domains and motifs to predict protein function.

An uncharacterized protein is a protein with a function that has not yet been defined; however, the identification of protein motifs in these sequences can be used for functional characterization of the protein. The amount of information pertaining to proteins of unknown function is constantly updated to databases, and it might provide the basis for further structural and functional genomic analysis (44).

8 Illuminating the Dark Matter of the Human Genome

The human genome project provides a starting point for gene discovery to facilitate diagnosis and therapy of diverse diseases (26), (45). Considerable knowledgebase exists for the well-characterized known genes; however, a significant number of proteins in the genome still remain to be characterized.

These genes together with the non-coding RNAs (the ncRNAs), called the “Dark

Matter of the human genome” (46-48), offer a potential to discover new opportunities for diagnosis and therapy (49-52). Realizing the importance of this area, the National Cancer Institute has recently announced a new funding initiative called “Illuminating the Dark Matter of the Genome for Druggability”

(http://commonfund.nih.gov/idg/index).

The current status of the human genome encompassing the known and the uncharacterized proteins is shown in Figure 1. This includes protein coding sequences, the RNA genes, pseudogenes and the uncharacterized genes including the ORFs. Nearly half of the protein coding sequences (10,427/ 23,282) are uncharacterized to date. In addition, a vast number of uncharacterized proteins (n= 3,451), pseudogenes which are evolutionary reminiscent of genes that have lost their protein-coding capabilities (n= 12,722) and ncRNAs (n=

5,883) are present in the genome. Interestingly, RNA genes had the largest number of genes in the human genome (n= 123,309). All these offer additional clues to novel targets discovery and for creation of new knowledgebase of the genome (53, 54).

9 Currently, the uncharacterized proteins in the databases are identified with ORF numbers (for ex. C1orf102, an Open Reading Frame 102 present in ), clone ID numbers (for example KIAA clones) or simply by a GeneBank or Ensembl ID number. Limited protein motif and domain results are available for these ORFs, thus making a functional annotation and prediction of protein classes difficult. Numerous bioinformatics and proteomics tools are becoming available in the recent years such as microarray and Next

Generation sequencing analysis (55), protein expression analysis (31, 32, 56-

58), protein motifs and domain analysis tools (59) and the Genome Wide

Association Studies (GWAS) tools (60-66).

The whole genome transcriptome (mRNA) and the proteome expression datasets for both the known and the uncharacterized ORFs are available in the various Metadata of the human genome. Peptide antibodies to nearly 90% of the human proteins are made available by the Human Genome project and these antibodies have been used in various datasets for protein expression analysis using immunohistochemistry of both the normal and tumor tissues (31, 32). The

Cancer Gene Atlas (TCGA) (67) has expression data for both mRNA and protein for matched samples (surrounding normal and tumor) from the same patients.

This allows for developing a correlation of gene expression at the mRNA and protein levels (49). These enhanced capabilities allowed us to derive a functional knowledge and consequence of the uncharacterized genome (the ORFs).

Hence, a comprehensive knowledgebase can readily be created for these ORFs, to aid in prediction of putative function. Using these approaches, it is possible to

10 begin to decipher the Dark Matter of the human proteome to predict novel uses for therapy and diagnosis.

123,309

20,199 23,282 12,722 800 620 5,883 3,451 1,093

Figure 1: The Human Genome Dark Matter Landscape. The current number of genes (known and uncharacterized) was obtained from the GeneCards, HGNC and UniProtKB databases. The disease associated ORFs were identified from the Genetic Association database, the MalaCards and the NextBio Meta analysis tool.

Protein Domains, Motifs, and Fingerprints

Protein domains are individual units within the amino acid sequence of a protein that impart a particular function or structure. Domains result from polypeptide chains that contain ~200 amino acid residues which generally fold into two or more globular clusters. The structure of a domain is assumed to fold

11 independently of the rest of the protein, as it determines a certain function and interaction towards the role of proteins (68, 69).

Protein sequence motifs are structural signatures of protein families within domains that are used for protein function prediction. This tem often implies the conservation of short regions within larger sequences of proteins, such as signal peptides, DNA-binding, etc. The majority of protein motifs are detectable sequences that can be used to group a set protein into different families or super families and families of proteins. The amino acid motifs are shared across different members of a family of proteins with common function (70) .

Protein motifs can be identified as signatures for protein families in protein/domain databases such as PROSITE (71), BLOCKS (72), InterProscan

(35, 36) and PFAM (73). These various tools can scan the ORF sequences for the presence of amino acid signature and also provide clues regarding the putative function of the motif (70).

Protein fingerprints can also be used for characterization of unknown genes. The fingerprints refer to groups of conserved motifs within sequences alignment of similar proteins, which can be used as signatures of a class of protein members. The use of protein fingerprints appears to be more useful in terms of diagnostic purposes rather than single motif identification (74).

Genome Wide Association Studies

The GWAS analyze genetic variations occurring at the level of single polymorphism (SNPs) across diverse population (75). Genome 12 sequences from thousands of humans are being made available through the

International HapMap Project (HapMap) (76). This dataset is dynamic and continue to expand for, example the UK’s England 100 K project

(http://www.genomicsengland.co.uk/).

The GWAS datasets can be accessed using the NCBI Thousand Genome

Browser (http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/), the UCSC

Genome Browser (77), the Database of Genomic Variants, DGV (66) and the

GeneCards (78). Currently, the 1000 Project encompass datasets from the following populations: Yoruba in Ibadan (YRI), Nigeria; Japanese in

Tokyo (JPT); Chinese in Beijing (CHB); Utah residents with ancestry from northern and western Europe (CEU); Luhya in Webuye, Kenya (LWK); Maasai in

Kinyawa, Kenya (MKK); Toscani in Italy (TSI); Peruvians in Lima, Peru (PEL);

Gujarati Indians in Houston (GIH); Chinese in metropolitan Denver (CHD); people of Mexican ancestry in Los Angeles (MXL); and people of African ancestry in the southwestern United States (ASW). These dataset make it possible to perform population-based for a target gene, making the interpretations more reliable.

Genome to Phenome Studies

Linking a polymorphic SNP to a specific disease using the

GWAS datasets offers an advantage to identify the most relevant targets. Thus, a novel uncharacterized ORF can be investigated for disease- relevant phenotype rapidly. Clinically relevant SNPs can be verified using the NCBI Clinical

13 Variations (ClinVar) database (65), which can facilitate the lead target prioritization.

The phenotypic studies at the level of individual SNP can be linked to mRNA expression, which can enable the genome to phenome analysis using the

NCBI Phenome-Genome Integrator bioinformatics tool, PheGenI (60). An expression Quantitative Trait Locus (eQTL) (79) represents a marker (locus) in the genome in which variation between individuals is associated with a quantitative gene expression trait, often measured as mRNA abundance. The eQTL analysis incorporates i) a SNP marker; ii) the gene expression levels, as measured by a probe or sequence information; and iii) a measure of the statistical association between the two in a study population, such as the P- value. The eQTL browser provides an approach to query the eQTL database

(79). The eQTLs can be cis, where the genotyped marker is near the expressed gene, or trans, in which the genotyped marker is distant from the expressed gene either in the same or on another chromosome. Currently only the cis-eQTLs are available (80).

Thus, the analysis of the uncharacterized ORFs using such tools can provide valuable information of the relevance of the gene product (s) for diverse disease and response to therapy phenotype traits. Discovery of an ORF associated with a disease trait in conjunction with protein motif information can provide a druggable target or diagnostic marker rationale. Analyzing the ORF protein expression in diverse body fluids (secretome) such as blood, saliva, tear, urine etc. using the proteomics expression datasets can help verify biomarker

14 potential. Utilizing a genome to phenome approach using these tools, our laboratory has recently identified phenotypic traits and known and novel ORF proteins associated with diverse diseases including Diabetes (81), Pancreatic cancer (82), Neurodegenerative diseases (83) and Ebola virus disease (84).

Pharmacogenomics

The field of pharmacogenomics is a fast growing field in biology that seeks to use genetic information from patients to predict drug response (21)

Certain might not be as effective in particular patients due to select genetic differences.

Druggable Targets

Prior knowledge of gene mutation and drug responsiveness is already beginning to provide the basis for . The understanding of of how a drug would affect that disease causing protein is essential in drug discovery. By the same token, the target’s expression is essential to determine if possible toxicity factors will be minimal. About half the current drug targets are cell surface receptors, a quarter of them are enzymes, and the other quarter is hormones with a small percentage of unknown classes of targets (85, 86).

An example of this powerful approach to personalized medicine is the adjuvant therapy with Trastuzumab (Herceptin) for HER-2/neu-positive breast cancers. It has been reported that in 20%-25% of breast cancer cases, the overexpression of the HER-2/neu protein, which is a cell surface , will cause the transformation of normal mammary epithelial cells to malignant 15 cancerous cells, resulting in a poor disease prognosis (87). This drug, Herceptin, is a humanized monoclonal antibody that directly inhibits the function of extracellular and the intracellular domains of the HER-2/neu protein. Herceptin has proven to benefit patients with metastatic breast cancer in prolonging disease-free survival. Thus, the HER-2/neu protein expression status provides a pharmacogenomics (response to therapy) target for choosing the therapy for breast cancer patients. Other examples of therapeutic approaches targeting proteins specific to cancer include imatinib (Gleevec) and gefitinib (Iressa).

Gleevec is used to treat chronic myelogenous leukemia (88,! 89) (CML) by inhibiting BCR-ABL kinsase due to chromosomal translocation. Iressa is used to treat lung cancer by inhibiting EGFR kinases (10), (90, 91).

Biomarkers

Tumor biomarkers are substances that are detectable in tissues, blood, urine, stool, and other bodily fluids, produced by cancer cells or cells responding to a tumor environment (92, 93). These proteins (the cancer secretome) are secreted in the body fluids both by classical as well as by non-classical secretory pathways (94-96). Most current diagnostic/prognostic biomarkers are proteins, and more than 20 different tumor markers are currently in clinical use. The tumor markers are used in cancer treatment to facilitate the detection, diagnosis and treatment efficacy. The measurement of the levels of the presence of tumor markers in patients can be used to determine the treatment efficacy, or to verify that the levels have decreased or increased, indicating the prognosis.

16 Similarly, microRNA (miRNA) and long non-coding RNA molecules

(lincRNA), which are considered to be non-coding RNA (ncRNA), have been identified in many biological processes (46), (89). These include transcriptional and post-transcriptional regulation of gene expression, and protein activity modulation (97). For this reason, many studies suggest that these ncRNA might be considered potential biomarkers, as these molecules are differentially expressed in tumors and not in normal healthy tissues (98, 99).

Multiple tumor markers are being clinically used and have a respective drug to treat select cancers. Below is a current list of FDA approved biomarkers

(Table 1):

17 Biomarker Drug Target Cancer ERBB2 (HER2) Ado-Trastuzumab Emtansine Breast cancer, lymphoma PML/RARα Arsenic Trioxide Leukemia Hodgkin lymphoma (HL), anaplastic CD30 Brentuximab Vedotin large cell lymphoma Ph Chromosome Busulfan Certain kinds of leukemia DPD Capecitabine Breast cancer, colorectal cancer Colorectal cancer, head and neck EGFR Cetuximab (1) cancer Colorectal cancer, head and neck KRAS Cetuximab (2) cancer Testicular cancer, bladder cancer, TPMT Cisplatin ovarian cancer, lung cancer Advanced or metastatic non-small ALK Crizotinib cell lung cancer Ph Chromosome Dasatinib Certain kinds of leukemia CD25 Denileukin Diftitox Cutaneous T-cell lymphoma Pancreatic cancer and non-small cell EGFR Erlotinib lung cancer

Advanced kidney cancer, ERBB2 (HER2) Everolimus subepedymal giant cell astrocytoma, angiomyolipoma ER &/ PGR Exemestane Advanced breast cancer Colorectal, pancreatic cancer, DPD Fluorouracil (2) aggressive breast cancer ER Fulvestrant Breast cancer EGFR Gefitinib Breast cancer, lung cancer Chronic myelogenous leukemia, C-Kit Imatinib (1) gastrointestinal stromal tumors Chronic myelogenous leukemia, Ph Chromosome Imatinib (2) gastrointestinal stromal tumors Chronic myelogenous leukemia, PDGFR Imatinib (3) gastrointestinal stromal tumors

Table 1. Biomarkers and Respective FDA Approved Drug

(http://www.fda.gov/drugs)

18 Biomarker Drug Target Cancer Chronic myelogenous leukemia, FIP1L1-PDGFRα Imatinib (4) gastrointestinal stromal tumors UGT1A1 Irinotecan Colon or rectum cancer Advanced or metastatic HER2-receptor ERBB2 (HER2) Lapatinib positive breast cancer ER &/ PGR Letrozole Certain types of breast cancer Maintenance treatment of acute TPMT Mercaptopurine lymphatic leukemia Ph Chromosome Nilotinib (1) Chronic myelogenous leukemia UGT1A1 Nilotinib (2) Chronic myelogenous leukemia EGFR Panitumumab (1) Metastatic colorectal cancer KRAS Panitumumab (2) Metastatic colorectal cancer ERBB2 (HER2) Pertuzumab Breast cancer G6PD Rasburicase Leukemia, lymphoma ER Tamoxifen (1) Breast cancer Factor V Leiden (FV) Tamoxifen (2) Breast cancer Prothrombin Tamoxifen (3) Breast cancer mutations (F2) TPMT Thioguanine Certain kinds of leukemia CD20 antigen Tositumomab Non-Hodgkin lymphoma ERBB2 (HER2) Trastuzumab Breast cancer PML/RARα Tretinoin Acute promyelocytic leukemia BRAF Vemurafenib Melanoma

Table 1. Biomarkers and Respective FDA Approved Drug (Continued)

(http://www.fda.gov/drugs)

Significance

Over half of the genes in the human genome are yet to be characterized.

Characterization of currently unknown genes offers an attractive starting point for new uses (both therapeutic and diagnostic) for cancer. The ability to predict putative motifs and domains for these novel proteins can provide a hint of function for further validation. Starting from full-length uncharacterized proteins

19 (instead of mRNA) from the genome database can facilitate an accelerated drug discovery and diagnostic markers identification. Creating a knowledgebase for the uncharacterized ORF proteins, the Dark Matter proteome can help fill in the void in the current understanding these novel proteins.

20

CHAPTER 2: MATERIALS AND METHODS

Genome Analysis

The whole genome analysis was performed using the Genetic Association

Disease Database, GAD (100, 101), the University of California Santa Cruz,

UCSC Genome Browser (77), the Ensembl Genome Browser (102), the National

Center for Biotechnology Information, (NCBI) Gene, and the NCBI Aceview

(103).

The cancer genome analysis was performed using the Sanger Institute’s

Catalogue Of Somatic Mutations In Cancer, COSMIC (63), the Integrated Drug

Discovery platform, canSar v2 (61), cBioPortal (64), International Cancer

Genome Consortium, ICGC (104), The Roche Cancer Genome Database,

Mutome DB (105), Expression Quantitative loci browser , eQTL (60) Genotype

–Tissue Expression Project, GTEx (79) and the European Bioinformatics

Institute- European Molecular Biology Laboratory, (EBI-EMBL) tools.

Transcriptome Analysis

The mRNA expression of the novel genes was studied using diverse transcriptome analysis tools. These tools included the NCBI-Unigene, SAGE

Digital Anatomical Viewer (29), Cancer Genome Anatomy Project, CGAP (17),

21 Oncomine microarray analysis tool (106), the Array Express (27), The Gene

Expression Omnibus, GEO and the Gene Indices from the Dana Farber Cancer

Institute (107).

Proteome Analysis

The proteome analysis of the uncharacterized ORFs was performed using the UniProtKB Knowledge base, UniProtKBKB (108), Swiss Expasy server,

Protein Database (PDB), and diverse post translational modification sites at the

Expasy tool (59). The protein structural models were generated using the

PredictProtein (109), the Meta Sever for Sequence Analysis, MESSA (110) and the I -TASSER server (111). The protein motif and domains of the ORFs were predicted using the NCBI Conserved domain database, CDD (112), the PFAM

(73), the ProDom (113), the PROSITE (71), the InterProscan4 (36), the HMMER

(34), the SignalP server (114), and the Eukaryotic Linear Motif prediction, ELM

(115).

The protein expression analysis was performed using the Human Protein

Atlas, HPA (31, 32) the Multi Omics Proteins Expression Database, MOPED

(56), (116), the Human Proteinpedia Reference Database, HPRD (117), the

Human Proteome Map (HPM) (57), the Human Gene and Protein Database

(HGPD) (118), and Proteomics DB (58). The ORF expression in body fluids

(secretome) was inferred from the MOPED (56), (116), HPRD (117) and

Proteomics DB (58). The fetal and adult tissues expression results were obtained from the Human Proteome Map. For the batch analysis, a comma-separated

22 values (CSV) file of the Dark Matter ORF list of genes was first generated for use with the protein expression analyses tools indicated above.

Knowledge-based Datamining

To further characterize the novel proteins, various knowledge-based databases were used. Protein and genome knowledge databases used included

GeneCards (78), GeneAtlas (119), NextBio trasncriptome meta analysis tool

(120), the MalaCards (121), On line Mendelian Inheritance in Man, OMIM,

Human Genome Nomenclature Committee, HGNC (111), , Amigo

(122), NCBI SNP database, the NCBI Phenome-Genome integrator, PheGenI

(60), the Expression Quantitative Trait Browser, eQTL (79), the NCBI Clinical

Variations database, ClinVar (51), and the Ensembl database. The pathways and interactome analysis were performed using the Strings Interactome (38),

BioGrid (123), IntAct (124), the KEGG pathway and the DAVID V6.7 functional annotation tool (125).

Textmining Query Definition

The UniProtKB proteomic database (108) was scanned for uncharacterized cancer-enriched proteins. The analysis consisted of several different search queries that included keywords such as “human”,

“uncharacterized”, and the cancer of specific tissue types. Nineteen different types of tissues were scanned for uncharacterized proteins. The tissues included were bladder, bone marrow (leukemia), brain, breast, cervix, colon, kidney,

23 lymph nodes (lymphoma), liver, lung, ovary, pancreas, peritoneum, prostate, retina, skin (melanoma), stomach, thyroid, and uterus.

GeneALaCart (LifeMap Discovery) Batch Analysis

The GeneALaCart (LifeMap Discovery) Meta-analysis tool from the

GeneCards (LifeMap Discovery) (78) was used to batch analyze the Dark Matter

ORF database. Key information such as gene names, aliases and descriptions, gene ontology, protein motifs and domains, pathways and interactome data, as well as drug bank hits were obtained from these analyses.

Data Analysis

The entire database of GAD (100), HPA (31, 32) and Unigene (126) was downloaded and the Excel filtering tool was used to scan for the ORFs. Batch analysis of the ORF database was performed for canSar (62), the MOPED (56),

(116), the DAVID V6.7 Annotation tool (37), the Human Proteome Map (57),

Proteomics DB (58), the HPRD (117), the PheGenI (60) and the eQTL browser

(79).

Two independent investigators verified all the bioinformatics mining data.

Only statistically significant results per each tool’s requirement are reported. The basic default parameters indicated by each bioinformatics tool’s browser were used. The large amount of data downloaded from the database to be used in the study was first generated as a tab-separated values file and was then imported into a working Excel document. The Dark Matter ORF Excel data were filtered using the advanced filtering options to create annotated therapeutic clusters. 24 Prior to using a bioinformatics tool, a series of positive and negative control query sequences were tested to evaluate the predicted outcome of the results (81),

(127, 128).

Statistics

The statistical significance of the mRNA expression results was inferred from the top 1% gene rank (Oncomine), with a fold change threshold value higher than 1.5, and a P-value lower than 0.05 (NextBio Transcriptome tool and

Oncomine). The R-value parameter for the eQTL (79) tool was 0.3 and the

BLAST e-values were 10E- 20 or less, and the top five hits were selected. While comparing normal versus cancer expression the tissue sample sizes were taken into consideration. The protein expression data from Mass-spectrometry tools such as the Human Proteome Map (31, 32) took into consideration peptide spectral counts per gene per experiment. Samples with spectral counts higher than 8 indicated high expression and were chosen. For survival analysis results from the cBioPortal (64) cancer genomics tools results with a p-value lower than

0.05 were chosen.

25

CHAPTER 3: THE APPROACH AND EXPERIMENTAL DESIGN

Database Generation

A streamlined approach to identify and verify the putative druggable/ diagnostic targets from the list of disease relevant uncharacterized ORFs is shown in Figure 2. The disease-related databases including the Genetic

Association Database (GAD) (100), the NextBio Transcriptome Meta-analysis

(120) tool and the Malacards (121) were textmined to cluster disease-related uncharacterized ORF proteins. The Genetic Association Database was downloaded in its entirety (update: 2014-04-19) and enriched using Excel filtering for open reading frames. Evidence of genetic association of the uncharacterized proteins was established from the output data. The preliminary cancer proteome database from the Narayanan lab (source: UniProtKB) was used to complement the GAD database (129). The NextBio Transcriptome (120) data was mined and diverse parameters including 1) the most correlated diseases, 2) tissues, 3) compounds and 4) perturbed genes information were collected. To identify additional disease relevant ORFs, the MalaCards (121) disease database from the GeneCards (78) was also mined and the output from all three databases was merged. This initial step led to the creation of a master-working database. This master database was manually curated and duplicates were eliminated. This

26 resulted in a final database of 800 (52). These 800 uncharacterized ORFs provided the basis of the Dark Matter ORFs investigated in this dissertation.

The Approach

Text Mine Disease- Dark Matter Expression Analysis at related Databases (GAD, ORF database mRNA level (Unigene, NextBio, Malacards) (800) NextBio, HPA)

! ! Proteomics tools used to Expression Analysis at predict function/nature mRNA correlation protein level (HPA, (InterProScan, DAVID analysis (87 show HPRD, MOPED, V6.7, SignalP, MESSA, I- correlation) Human Proteome Tasser, NCBI CDD) Map, Proteomics DB)

! Proteomics tools GWAS used to find Cancer Genome Putative pathways genes Analysis (cBio, Druggable/Diagnostic might be involved in COSMIC, ICGC, Targets (STRINGS, KEGG, eQTL, PheGenI, Ex. C1orf87 - CREF IntAct) HapMap, ClinVar)

Figure 2. Bioinformatics and Proteomic Approach.

Expression Analysis

The database of ORFs was analyzed for expression at the mRNA and protein levels in the fetal and in the adult tissues, and in body fluids using the trasncriptome and proteome expression analysis tools. A correlation in expression at both the mRNA and protein levels for the ORFs was also established using the transcriptome and proteome expression datasets.

27 The mRNA Expression

The mRNA expression analysis of the entire database was conducted by using the Expressed Sequence Tag (EST) Unigene tool (126), microarray and

Next Generation Sequencing tools such as the NextBio transcriptome meta analysis tool (120) and the HPA (31, 32) RNASeq Dataset. The Unigene tool (31,

32) was used to collect EST-restricted expression information for the ORFs. Most

Correlated Tissues based microarray mRNA data was gathered from the NextBio

(120) and the HPA (31, 32) databases (NGS).

The Protein Expression

Using proteomics tools such as the HPA (31, 32), HPRD (117), MOPED

(56), (116), the Human Proteome Map, and the Proteomics DB (58), protein expression analysis of the uncharacterized proteins was performed. Batch analysis of multiple ORFs was conducted to extract expression data using the

MOPED (56), (116), HPA (31, 32), HPRD (117), and Proteomics DB (58) protein expression databases.

Proteins Class

To predict possible motif and domains within the uncharacterized ORF protein sequences, diverse proteomic tools were used and a batch analysis was performed using the DAVID V6.7 functional annotation tool (37). To test the accuracy of the level of detection by these tools, several positive controls (known genes) were tested. Using NCBI’s CDD (112), InterProscan (36), ProDom (113),

28 Pfam (73) and HMMER (34), the predicted motifs were detected on each protein analyzed.

Positive and negative controls were used to test the levels of accuracy of motif/domain detection by the tools (Table 2). The positive controls consisted of a

1) transcription factor - tumor suppressor gene (TP53), 2) signal peptide containing secreted protein (IL-7) and 3) an - an oncogene (KRAS). The negative control consisted of a deleted version of the TP53 sequence lacking the transactivation motif. Two other negative controls were also tested in the same assay, the IL-7 protein missing the signal peptide and a mutant version of KRAS without the GTPase motif (data not shown). The tools detected the motifs present in all three positive controls, but did not detect the motifs and signal peptide intentionally deleted in the mutant sequences (negative controls), thus verifying the motif/domain detection tool’s reliability.

29 Minimum Protein expected CDD InterProScan Pfam HMMER Name domain/ motif detected

P53 DNA- binding domain, TAD, tetrameris- P53 P53 P53 P53 ation tetramerisation tetramerisation transactivation tetramerisa TP53 motif, motif, P53 DNA- motif, P53 domain, DNA- -tion motif, (wt) DEC-1 binding domain, DNA-binding binding,tetram- DNA- protein, N- TAD domain, TAD erisation binding terminal domain region, P53 transactiv ation motif Interleukin 7/9 Interleukin 7/9 Interleukin 7/9 Interleukin Interleukin IL7 family, signal family, signal family 7/9 family 7/9 family peptide peptide Ras GTPase family containing H- Ras, N-Ras Small GTPase and K- superfamily, P- Small GTPase Ras4A/4B; H- loop containing KRAS superfamily, Ras family Ras family Ras/N-Ras/ K- nucleoside Ras family Ras subfamily. thriphosphate H-Ras, N-Ras, hydrolase and K-R, Ras- like protein; provisional P53 DNA- binding domain, P53 P53 P53 P53 DNA- tetrameris- tetramerisa tetramerisation tetramerisation binding TP53 ation -tion motif, motif, P53 motif, P53 domain, (mu) motif, DNA- DNA-binding DNA-binding tetramerisation DEC-1 binding domain domain motif protein, N- domain terminal region

Table 2. Verification of Motif/domain Tools with Controls. Common motifs

(bolded) were detected for each protein using multiple proteomic tools. mu – mutant; wt – wild type.

30 Characterization of Protein Motifs

Table 3 demonstrates that when multiple motif/domain tools analyzed five test case ORF proteins a high degree of correlation is easily seen. For example, motifs such as cAMP kinase, S1-like superfamily, Heat repeats as well as domains of unknown function (bolded) were identified by more than one tool.

Protein CDD InterProScan Pfam HMMER cAMP- dependent protein ADK (Adenylate C8orf34 Negative kinase, Negative Kinase) regulatory subunit, type I/II alpha/beta Cation channel Cation channel sperm- sperm-associated C1orf101 Negative Negative associated protein subunit protein subunit delta delta Nucleic acid- binding S1-Like CXorf48 proteins S1-like AAA, S-1 like Superfamily superfamily OB-fold HEAT repeat, Armdillo-type Domain of C17orf66 Heat_2 fold ARM HEAT repeats unknown function repeat (DUF3385) Protein of unknown (DUF4527), AAA Protein of domain, Syntaxin, unknown C4orf50 Negative Negative Outer membrane function protein (OmpH- (DUF4527) like), MutS family domain IV

31 Table 3. Motif Detection Analysis of the Hit Proteins Under Study. Six different motif tools were used to detect domains and motifs with five hit proteins.

Common motifs from multiple tools are bolded.

Structure, Interaction and Pathway Identification

The MESSA (110) and I-Tasser 3D (111) modeling tools were used to study the nature of the query proteins by building secondary and tertiary structures of the query proteins. Both tools use protein 3D templates to generate the structural models. Identification of template homologues used to generate the predicted model provided a hint of putative function of the novel ORF proteins.

Interaction and pathway tools such as STRINGS (38), KEGG (130) and

IntAct (131) were used to determine the possible interactions and pathways implicating the uncharacterized ORF proteins to further verify/ predict a putative function. Additional results were obtained by performing batch analysis using the

DAVID V6.7 annotation tool (37) and the canSar (62) platform to identify potential interactions and pathways. Results from these multiple tools were merged to create a master database and common features were extracted.

GWAS and Genome to Phenome Analysis

GWAS were performed to analyze the different types of alterations present in the Dark Matter Database. Copy number alteration and mutational datasets for the cBioPortal (64), COSMIC (63), and the ICGC (104) were batch analyzed, downloaded and were mined for the uncharacterized ORF proteins. 32 Relationships among human polymorphic variations and in the novel genes were analyzed using the NCBI PheGenI (60) and eQTL (79) tools.

Additional GWAS data and the clinical relevance was obtained from the HapMap

(76) and the NCBI ClinVar (65) databases.

The eQTL tool (79) was used to identify genetic markers such as SNPs that are associated to a quantitative mRNA gene expression trait in the uncharacterized proteins. The parameters included using , , near gene and Untranslated Region (UTR) to select genotypes; only statistical significant data was collected.

Putative Diagnostic/ Druggable Targets

From the working database of 800 disease-related uncharacterized proteins, one ORF was chosen based on the expression selectivity and a putative hint of function (calcium binding). This gene was investigated in detail to develop a proof of concept for the gene discovery approach.

Critical issues and potential alternatives are described in the Critical

Issues section, see page 81.

33

CHAPTER 4: RESULTS

Characterization of the Carcinoma Related EF-Hand Protein

From the list of 800 genes of the Dark Matter ORF database, one protein was chosen for verification of the gene discovery approach based on preliminary data. The C1orf87 showed a) a restricted expression at the mRNA and protein levels in select normal tissues, b) a differential expression in the lung, breast and liver tumors, c) a putative calcium binding motif, and d) a possible druggable target nature since its EF-Hand containing calcium binding motif is present in many anti-cancer drugs (132-134). In view of this association with carcinomas, the C1orf87 was renamed as Carcinoma Related EF-Hand (CREF) protein. This name CREF has been adopted by the HGNC as one of the aliases for the

C1orf87 protein.

Expression Profiling of the CREF Protein

Initially, using the HPA tool (31, 32), the CREF protein expression was analyzed in tumor and normal tissues. As shown in Figure 1A, the protein expression was downregulated in the breast and lung carcinomas and upregulated in liver carcinoma as analyzed by immunohistochemistry (IHC). The antibody (HPA 031368) a rabbit polyclonal peptide antibody detected a predicted

62kDa protein on the .

34 Immunohistochemical staining of human bronchus shows strong cytoplasmic and membranous positivity in respiratory epithelial cells (HPA). In panel B, the IHC data for the normal and liver cancer specimens is shown. A strong granular cytoplasmic staining is seen in the cancer tissue in comparison to the normal liver tissue. Further, moderate staining was observed in several cases of endometrial, prostate, thyroid and renal cancers. Remaining normal and malignant tissues were mainly negative (HPA).

Overall, about 20% of tumors analyzed for protein expression were positive for IHC. The CREF protein expression was also detected in normal lung, blood, plasma and heart by another protein expression analysis tool such as

HPRD (117). Further, protein expression information was also detected in fetal gut, adult heart and B cells by the HPM tool (57). The CREF protein expression was also inferred from the MOPED tool (56), (116), where the expression was seen in the olfactory epithelium. According to AceView (103), this gene is moderately expressed, only 35.1% of the average gene in this release. The sequence of this gene is defined by 41 GenBank (23) accessions from 38 cDNA clones, some from testis (seen 4 times), brain (3), lung (3), caudate nucleus (2), medulla (2), multiple sclerosis lesions (2), normal nasopharynx (2) and 11 other tissues. Consistent with the down regulation of CREF protein expression in the breast and the lung carcinomas (Figure 3), the Oncomine (106) microarray- based mRNA profiling also showed down regulation of CREF expression in lung adenocarcinoma and invasive breast carcinoma (panels C & D).

35

Figure 3. Expression Profiling of CREF. The summary of cancer tissue protein expression data is shown in panel A (HPA). Red/yellow, light yellow/white- strong expression/moderate/weak/no expression respectively. Panel B shows

IHC data for normal liver and hepatocellular carcinoma (ab-HPA031368). The mRNA expression (Oncomine microarray) for lung adenocarcinoma (Okayama

Lung, n= 246, P-value: 2.71E-4, fold change: -2.834 and gene rank: top 1%) and invasive ductal breast carcinoma (TCGA Breast 2, n= 450, P-value: 4.1490E-6, fold change: -1.451 and gene rank: top 1%) models are shown in panels C and D respectively (129).

The downregulation of CREF was also seen in large cell lung carcinoma

(N= normal vs. tumor: 65 vs.19), and squamous cell lung carcinoma (N= normal 36 vs. tumor: 65 vs. 27). The downregulation of the protein was also seen at different subtypes of breast cancer. In the ductal breast carcinoma in situ (N= normal vs. tumor: 4 vs. 2), invasive ductal breast carcinoma (N= normal vs. tumor: 4 vs. 14) as seen in Figure 3, and invasive lobular breast carcinoma (N= normal vs. tumor: 4 vs. 4). At the present time, mining of the Oncomine (106) dataset did not reveal mRNA upregulation of CREF in liver carcinomas.

Figure 4. Oncomine Expression Profiling of CREF in Lung and Breast

Tumors. The mRNA expression (Oncomine microarray) for large cell lung carcinoma (Hou Lung, n= 84, P-value: 3.38E-8, fold change: -1.827 and gene rank: top 1%) and squamous cell lung carcinoma (Hou Lung, n= 92, P-value:

1.54E-5, fold change: -1.616 and gene rank: top 1%) are shown in panels A and

B respectively. In addition, breast carcinoma in situ (Radvanji Breast, n= 6, P-

37 value: 0.041, fold change: -1.764 and gene rank: top 1%) and invasive ductal breast carcinoma (Radvanji Breast, n= 18, P-value: 0.195, fold change: -1.886 and gene rank: top 1%) models are shown in panels C and D respectively.

Unigene (126) based analysis of normal tissue expression showed that the mRNA expression for CREF is restricted to brain, connective tissue, lung, pharynx, testis and uterus. EST expression profiling from the Unigene (126) verified the down regulation of CREF in lung tumors (8 transcripts per million,

TPM, versus zero in normal) consistent with the protein and microarray based data shown in Figure 4. Additional evidence for CREF mRNA expression was obtained from a meta analysis tool such as NextBio (Illumina/ Quiagen).

Restricted expression of CREF mRNA was seen in fallopian tube, testes, peritoneum, bronchial and tracheal tissues (see Figure 5). The mRNA expression data in these tissues, with the exception of peritoneal tissue expression, correlates with data obtained with the protein expression from the HPA (31, 32).

CREF showed the highest level of expression in epithelial cells of bronchial small airways and trachea, as opposed to other organs and cell types as seen in

Figure 5.

38

Figure 5. mRNA Expression Profiling of C1orf87/ CREF. The summary of cancer tissue mRNA expression data is shown in panel A (NextBio). Panel B shows a comparison mRNA expression (Nextbio microarray) in different studies of lung adenocarcinoma (Lung adenocarcinoma stage I, II, and III and healthy lung tissue). [Lung adenocarcinoma _vs_ normal tissue adjacent to tumor, P- value: 1.5E-10, fold change: -2.04 and imported ID: ILMN_2097943]. [Lung adenocarcinomas stage IB _vs_ healthy control, P-value: 4.1E-9, fold change: -

2.01 and imported ID: 236710_at], [Lung adenocarcinomas stage IIA _vs_ healthy control, P-value: 7.7E-6, fold change: -2.01 and imported ID: 236710_at],

39 [Lung adenocarcinomas stage IA _vs_ healthy control, P-value: 1.2E-9, fold change: -2.1 and imported ID: 236710_at].

GWAS of the CREF Gene

The altered expression of the CREF gene in the lung, breast and liver tumors raises a possibility that it may be a biomarker for these tumor types. In order to develop a further understanding of the CREF gene in the etiology of cancers, it is necessary to understand the nature of mutations and other perturbations at this gene locus. Hence, a comprehensive bioinformatics analysis of the CREF gene perturbations was undertaken using the GWAS tools.

Numerous GWAS datamining tools are available to mine the cancer genome from samples around the world. The key tools include the CanSar (62),

COSMIC (63), the cBioPortal (64), the ICGC (104), the Single Nucleotide

Polymorphism database, SNP (135) and TCGA (67).

Preliminary experiments with SNP analysis indicated 1,694 SNPs at the

CREF gene (GeneCards). These SNPs included deletions, insertions and complex mutations. These SNPs are largely located at the 3’ UTR, , and at the 5’ UTR encompassing synonymous or missense mutations. Five different natural variants have been detected (in aa positions 151, 185, 301, 403, and

406). The aa position 151 variant results in glutamine to substitution in breast cancer (136). In addition, the NCBI PheGenI (60) detected related missense SNPs (rs626251, p403, K-G; rs17120025, p301, N-D; rs7528245 ,p88, H-Y; rs1382602 , p45/235: Q-L).

40 To expand these results further and identify relevant mutations, the

COSMIC database (63) was mined to identify verified mutations and deletions for the CREF gene (see Figure 6). Currently, in the COSMIC database (63), twenty- eight mutations were found, which were largely missense and nonsense mutations. These mutations were present in the breast, lung and liver tumors.

Figure 6. Mutational Types in the C1orf87/ CREF Gene. The COSMIC database was mined to identify different mutations for all the tumor types using tissue sample datasets (137).

Additional evidence of the CREF gene perturbations was found using the cBioPortal tool (64). This tool allows the identification of not only mutations and deletions, but also possible gene amplification status of a query gene. The result output can be selected for specific tumor datasets. In view of the accumulated 41 evidence for the relevance of the CREF gene in three tumors type (breast, lung

and liver), the dataset for these tumors were chosen (see Figure 7). Consistent

with the results from the COSMIC database (63), these results also

predominantly identified mutations in these tumor types. In addition, both CREF

gene deletions and amplifications as measured by Copy Number variations (CNV

studies) were seen in these three tumors.

80%

70%

60%

50% Alteration frequency Alteration 40%

30%

20%

10%

0% Cancer t ype

Mutation data + + + + + + + + + + + -

CNA dat a + + + + + - - + - + - +

Liver (TCGA) Breast (TCGA) Breast (TCGA) Breast (BCCRC) Breast (BROAD) Breast (SANGER) Lung squ. (TCGA) Lung squ. (TCGA) Lung adeno (TSP) Lung adeno (TCGA) Lung adeno (TCGA) Lung adeno (BROAD)

Mutation Deletion Amplification Multiple alterations

Figure 7. CREF Gene Perturbations Study. The cBioPortal dataset

encompassing the breast, lung and liver tumors-based GWAS were mined to

identify mutations, amplifications and deletions at the CREF gene (137).

42 Preliminary analysis of the cBioPortal (64) dataset indicated that the

CREF gene mutations range from 1-3% in the tumors analyzed. In order to establish relevance of these mutations in tumors, known genes were chosen as positive controls. These oncoproteins included p53 tumor suppressor gene

(TP53), an oncogene (KRAS), a receptor amplified in numerous tumors (EGFR), a well- established breast cancer biomarker (BRCA2) and ERBB2/ HER2Neu.

As can be seen below (Figure 8), the CREF gene mutations and amplifications was present in the same tumor tissues along with these positive controls. As predicted, the TP53, KRAS and EGFR gene perturbations were seen in a large majority of tumor tissues. The BRCA2 and ERBB2/Her2Neu mutations were restricted to a subset of breast tumors. These results helped establish the tumor relevance of the CREF gene.

43 Figure 8. Comparative C1orf87/ CREF Gene Changes with Positive Controls for Lung Adenocarcinoma. The cBioPortal GWAS dataset was mined for mutations for the CREF, TP53, KRAS, EGF, BRCA2 and ERBB2/Her2Neu simultaneously in the breast, lung and liver tumors (137).

The occurrence of CREF gene mutations in a sub set of tumors raises a possibility that it may offer a response to therapy potential. Hence, these tumor data was analyzed for patient survival outcome. Among various tumor types, the lung tumor dataset provided a preliminary proof of concept (138).

The lung cancer patients’ 50% survival rate with CREF mutation was ten months vs. patient without mutation (ca. 45 months), see Figure 9. These results were significant (Logrank p-value, 0.05).

44

Figure 9. CREF Gene as a Response to Therapy Predictor in Lung

Carcinoma. The lung adenocarcinoma (TCGA Provisional) GWAS data was analyzed for patient survival using Kaplan-Meier plot. Logrank p-value=0.006

(significant) (137).

The knowledge of mutations at the CREF gene might offer a significant hint of involvement in distinct cancers. This is becoming increasingly possible with the worldwide Cancer Genome Analysis. For example, the ICGC (104) has cancer mutations from over 10,000 cancer patients across the world. This database is linked to the COSMIC (63) as well as the TCGA database (67).

45 Hence a comprehensive study was undertaken to catalogue the CREF mutations across all of these databases. Mutations identified were cross-verified across the datasets. Table7 lists the current CREF mutations. This database provides a starting point for identifying key mutations relevant to specific cancer types.

Allelic loss at chromosome position 1p32-pter is a frequent event in non- small cell lung cancer NSCLC (139). CREF is not significantly focally amplified nor deleted in 14 individual subtypes of tumors analyzed by Mutome DB (105).

The COSMIC database (63) offers a list of genes that are somatically mutated in human cancer. In the lung tumors, 12% of the samples analyzed

(58/476 samples) and in the breast tumors 11% (90/782 samples) show point mutations for the CREF gene.

Among various tumor tissues analyzed for GWAS, amplification in one

CNS tumor tissue, loss of heterozygocity (LOH) in diverse tumors (134) and mutation in one breast tumor was detected for the CREF gene. No homozygous deletion was identified. LOH was identified in 27% (37/139) in lung tumors and

9% (4/45) in breast tumors. To date lung tumors had the largest number of LOH identified.

46 Comprehensive Mutational Analysis of the CREF Gene

CDS CDS AA Mutation AA Mutation Primary Mutation Mutation Type Syntax Sites Type Syntax Substitution c.451C>G Missense p.Q151E breast Substitution c.163A>G Missense p.T55A breast Substitution c.172G>A Missense p.A58T endometrium Substitution c.1622G>A Missense p.R541Q endometrium large Substitution c.1589C>T Missense p.S530L intestine large Substitution c.1603C>A Missense p.L535I intestine large Substitution c.1210G>T Missense p.A404S intestine Substitution c.613C>A Missense p.L205M lung Substitution c.1622G>T Missense p.R541L lung Substitution c.991G>A Missense p.E331K lung Substitution c.838G>A Missense p.E280K lung Substitution c.25C>A Missense p.R9S lung Substitution c.385G>T Nonsense p.G129* lung Substitution c.1494G>T Missense p.K498N lung Substitution c.1405G>A Missense p.E469K lung Substitution c.8C>G Nonsense p.S3* lung Substitution c.1258A>T Nonsense p.K420* lung Substitution c.1546C>A Missense p.L516M lung Substitution c.545G>C Missense p.R182T lung Substitution c.1157G>T Missense p.R386I pancreas Substitution c.40A>G Missense p.M14V skin Substitution c.479C>T Missense p.P160L skin Substitution c.632G>A Missense p.G211E ovary Substitution c.1173G>T Missense p.L391F ovarian Substitution c.1622G>A Missense p.R541Q uterus Substitution c.1508G>A Missense p.R503H uterus Substitution c.637C>A Missense p.L213I uterus Substitution c.997C>T Missense p.R333C uterus Substitution c.676G>T Nonsense p.E226* uterus

Table 4. Detailed Mutational Analysis of the CREF Gene. The GWAS datasets from the COSMIC (63), cBioPortal (64), CanSar (61) and ICGC (104) were compared and common mutations at the CREF gene were identified.

47 Molecular Characterization of the CREF Gene

To further understand the CREF gene, a detailed molecular characterization was performed by using different bioinformatics tools (See

Table5).

The CREF gene is located in chromosome 1p32.1, spread over 12 exons and consists of an mRNA size of 2126bp (Figure 10). The CREF gene exists as at least five different isoforms with the canonical sequence being NP_689590.1

(UniProtKB). Three complete alternative mRNAs (Isoform 1, 546aa; Isoform 2,

180 aa and Isoform 3, 138 aa) and two partial sequences are shown in Figure 9

(AceView). This gene is highly conserved across the species with seven homologs identified in Homo sapiens, Pan troglodytes, Macaca mulatta, Canis lupus familiaris, Bos taurus, Mus musculus, Rattus norvegicu (HomoloGene). For a complete characterization summay of CREF see Appendix B - CREF

Characterization.

Initial gene ontology analysis indicates putative calcium ion binding class with EF binding motif. Further, transcription factors binding sites including GATA-

2, STAT-3, MafF, CTCF and Rad21 were identified at the 5’ end of the initiation codon by the UCSC Genome Browser (77). Also, transcription factor binding sites such as POU3F, FOXC1, FOXJ2, LHX3, VSX1 and SP1 are present in the promoter region of the CREF gene according to NextBio (120). Additional details of the molecular characterization of the gene are shown in Table3. Based on prediction analysis by I-Tasser (111), the top hit for the Gene Ontology prediction

48 of CREF protein was a protein transporter (function, GO-score 0.78), and cytosolic (cellular location, GO-score 0.71). CscoreGO value was [Range 0-1].

Figure 10. Alternative mRNAs of the CREF Gene. The isoforms are shown aligned from 5' to 3' on a virtual genome. Exon size is proportional to length, intron height reflects the number of cDNAs supporting each intron, and the small numbers show the support of the introns in deep sequencing. Introns of the same color are identical. Those of different colors are different. 'Good proteins' are pink, partial or not-good proteins are yellow, uORFs are green. 5' cap or 3' poly A flags show completeness of the transcript. Alignments a, b, and d are considered complete sequences. Alignments c and e are considered partial sequences.

Regulation of CREF Gene Expression

Understanding the regulation of the CREF gene expression is essential to understand the behavior of the gene. Diverse tools such as NextBio (120), UCSC

Genome Browser (77) and Comparative Database (140) were mined to gather data referring to the effect of perturbed genes on CREF, in addition to pharmacological data and regulation of methylation of this gene.

49 MicroRNA data was also analyzed by diverse tools such as miR2Disease (141) and miRCancer (47) as potential regulators of the CREF gene in addition to the previous criteria mentioned.

In an effort to develop an understanding of regulation of CREF gene expression, the Comparative Toxicogenomics Database (CTD) (140) was mined.

The microarray-based expression profiling from the CTD database (140) identified two compounds that appear to have an effect on the regulation of gene expression. Both Arbutin (tyrosinase inhibitor) and kojic acid (chelation agent) cause down regulation of expression of CREF mRNA in A375 human malignant melanoma cells (142).

Table 5 presents data obtained from NextBio (120) describing the effect of regulation of CREF by different pharmacological agents in different studies. Key agents were identified using the NextBio meta-analysis tool that may provide information on the regulation of the CREF gene expression. The CREF gene expression was found to be upregulated by Tamoxifen (angiogenesis inhibitor),

Paclitaxel (mitotic inhibitor), Azacitidine (methyltransferase inhibitor), Decitabine

(methyltransferase inhibitor), Isoascorbic acid (antioxidant), and Trichostatin A

(HDAC inhibitor). The downregulation of the CREF gene expression was seen upon treatment with Gefitinib, Genistein (protein-tyrosine kinase inhibitors), and

2-Methoxyestradiol (angiogenesis inhibitor).

50 Perturbation Nature/ Significance Effect on CREF

Angiogenesis inhibitor, estrogen Tamoxifen Upregulation inhibitor

Paclitaxel Hyper-stabilization of microtubules Upregulation

Inhibition of DNA Azacitidine Upregulation methyltransferase

Inhibition of DNA Decitabine Upregulation methyltransferase

Isoascorbic acid Antioxidant Upregulation

Trichostatin A HDAC inhibitor Upregulation

Genistein Protein-tyrosine kinase inhibitor Downregulation

Gefitinib Protein-tyrosine kinase inhibitor Downregulation

2-Methoxyestradiol Angiogenesis inhibitor Downregulation

Table 5: Effect of Pharmacological Agents on CREF by NextBio

The NextBio meta-analysis tool was used to perform analysis of the effect of perturbed genes in CREF mRNA regulation as seen in Table 6.

51 Effect of Knockout of Genes on CREF

The knockout of methyltrasnferase genes such as DNMT1 and DNMT3B cause upregulation of CREF in breast and colon cancer cells and DNMT-KO

HCT116 cells (143).

Effect of Knockdown of Genes on CREF

Knockdown of TP53BP2 (p53 interacting protein) causes downregulation of the CREF mRNA in HCC-LM3 hepatocarcinoma cell line (144). The knockdown of ESR1 (estrogen receptor) causes an upregulation of CREF mRNA in MDA-MB-435 breast cancer cells (145).

Effect of Overexpression of Genes on CREF

The overexpression of certain genes also appears to have a direct effect on the CREF protein. The overexpression of POU2F1 (transcription factor) caused the upregulation of CREF mRNA in pluripotent stem cells (146). In addition the overexpression of Nuclear Receptor Subfamily 4, Group A (NR4A2) appears to upregulate CREF in urr1-overexpressing SKNAS neuroblastoma clonal cell lines (147). The overexpression of , a phosphoprotein, seems to upregulate CREF mRNA in the pluripotent stem cells (146). The overexpression of Ras (GTPase) suggests the downregulation of CREF mRNA in mammary epithelial cells (148). Further, the overexpression of IDH1 (isocitrate dehydrogenase) seems to cause a downregulation of the CREF mRNA at the methylation level.

52 Effect of Mutations of Genes on CREF

The mutation of genes can also affect the expression of the CREF gene.

Mutations in Bcr (/ kinase) and Abl1 (tyrosine kinase) induced downregulation of CREF gene at the DNA copy number level in mouse model

(149). In addition, the mutation of EFGR (tyrosine kinase) induced the downregulation of CREF mRNA in 226 lung adenocarcinomas (127 with EGFR mutation, 20 with KRAS mutation, 11 with EML4-ALK fusion and 68 triple negative cases) (150). Influence of other genes:

Effect on Perturbation Gene Nature/ Significance CREF Gene Knockout DNMT1 Regulation of Methylation Upregulation DNMT3B Regulation of Methylation Upregulation Gene Knockdown TP53BP2 p53 interacting protein Downregulation DNMT3B Regulation of Methylation Upregulation ESR1 Regulation of Methylation Upregulation Gene POU2F1 Transcription factor Upregulation Overexpression NR4A2 Transcription factor Upregulation MYC Phosphoprotein Upregulation Ras small GTPase Downregulation IDH1 Isocitrate dehydrogenase Downregulation Abl1 Tyrosine kinase Downregulation Bcr Serine/threonine kinase Downregulation Gene Mutation EGFR Tyrosine kinase Downregulation Abl1 Tyrosine kinase Downregulation Bcr Serine/threonine kinase Downregulation

Table 6. Perturbation of Top Genes and their Effect on CREF by NextBio

Analysis. Data measured at RNA, Copy Number Variation (CNV) and

Methylation (ME) levels.

53 Effect of Methylation on CREF Regulation

Regulation of gene expression is also exerted at the level of promoter function by methylation, at the level of non-coding RNA, or at the level of protein destabilization and/ or degradation or by posttranslational modification.

The analysis of UCSC Genome Browser (77) for CpG islands and DNA methylation tracks showed the presence of at least six hypermethylation sites.

The selective regulation of CREF gene expression was next investigated for CpG methylation at the promoter by NextBio (120). The CREF gene is hypermethylated in lung and breast carcinomas and hypomethylated in liver carcinomas, consistent with the downregulation of the CREF gene in lung and breast carcinomas and its upregulation in liver carcinomas observed in the same dataset (meta analysis). These results strongly suggest that the CREF gene is regulated at the level of DNA CpG methylation.

MicroRNA Regulation of the CREF Gene

Other factors that could potentially influence gene expression that must be considered are microRNAs (MiR). Diverse microRNA databases were mined to obtain MiRs that may have an impact on the regulation of the CREF gene, and possibly infer why the CREF mRNA appears to be downregulated in the tumors of interest (breast and lung).

Several databases such as Genecards (78), NextBio (120), TargetScan

(151) among others were used to identify microRNAs binding sites that may be responsible for the regulation of CREF mRNA. The list of MiRs hits was screened 54 for an inverse correlation in the expression in the breast, lung and liver carcinomas. An additional disease relevance filter using miR2Disease (141) and mirCancer (47) databases was used to obtain two leads that correlate with CREF down regulation in breast and lung tumors as seen in Figure 11.

Two microRNAs (hsa-miR-221 and hsa-miR-222) appear to be overexpressed in aggressive non-small cell lung cancer and triple-negative primary breast cancer according to miR2Disease and miRCancer In addition,

NextBio results denotes the downregulation of has-miR-383 in liver tissue which could be use to infer the up regulation in liver tumors.

Figure 11. Approach of Identification of CREF Regulating microRNAs.

Several microRNA databases were screened to obtain leads after correlation and disease relevance filters. The asterisks denote downregulation in early stage and upregulation in late stage NSCLC.

55 Protein Motif and Domains Analysis

Preliminary Gene Ontology results from GeneCards, UniProtKB, GO- miner and I-Tasser indicated that the CREF gene is a putative calcium binding protein. Hence a comprehensive motifs and domain analysis of the CREF gene was undertaken.

InterProScan from InterPro enabled identification of 1) EF hand domain

(aa 180-386, 8.0E-12) and 2) EF hand domain pair (aa 477-529, 3.4E-4) in

CREF.

Figure 12. Motif and Domains Analysis of the CREF Protein. InterProScan analysis of the CREF protein showing the EF hand domain.

Structural Characterization

Subsequently extensive structural analysis was undertaken to develop a further understanding of the nature of CREF protein. The Meta Server for

Sequence Analysis (MESSA) and Predict a Protein tools were used for protein

56 structure characterization of CREF. The secondary structure prediction tools showed largely the presence of coil-coil and alpha helical structures along with beta strands. No transmembrane helix and signal peptide motifs were found. A disordered region lacking stable tertiary structure was predicted using multiple structural tools.

Figure 13 shows structural characterization at the secondary and tertiary levels of the CREF. Panel A shows the results of the MESSA Meta-analysis tool

(110) for secondary structure prediction. In panel B, 3D molecular model of calmodulin containing EF hands 2 and 3 domains (NCBI Structure) is shown.

Panel C shows the tertiary structure prediction using EF hand containing

Calmodulin template for the 3D modeling using predict protein meta-analysis tool from (SWISS-MODEL). The QMEANscore4 for a measure of reliability of 3D model prediction was 0.45. This value is at 50% confidence level.

57

Figure 13. The Secondary Structure of the CREF Protein Predicted by the

MESSA Meta Analysis Tool. Panel A is a visualization of a 3D molecular model of the CREF protein using Calmodulin containing EF hands 2 and 3 domains

(NCBI Structure) by the predict protein tool (Panel B). Panel C shows the 3D structural model query for the CREF template used to build the model. The model residue range was from 318 to 389, based on a heterodimeric template

(99.9 Å).

3D Modeling of the CREF Protein

Structural modeling prediction of the CREF protein can be generated by tools such as I-Tasser (111) and Swiss PDB (59). The predictions are created 58 based on templates of other proteins that appear to share structural similarities with the CREF protein. This information is valuable as it can be used to obtain possible 3D models of the protein and relevant information such as GO data.

Templates Used by I-TASSER

The top ten templates used by I-Tasser (111) included various EF-Hand containing proteins (Multidomain EF-Hand protein CBP40, Calmodulin,

Calcineurin subunit B type 1, and EF-Hand protein from Danio rerio) and nuclear transporters proteins (transportin-3, transportin-1, and Dynamin-1-like protein).

(Normalized Z-score >1) (152).

Figure 14. 3D Modeling of the CREF Protein Using the I-Tasser Tool Based on Protein Templates. C-score: -0.80. (Typical range is -5 to 2). Estimated accuracy of Model1: 0.61±0.14 (TM-score).

59 Proteins with Highly Similar Structure in PDB

In the Swiss PDB (59) ten proteins were identified as structural analogs of

CREF protein by the I-Tasser (111) program. These proteins included diverse nuclear transporter proteins (see Table 7). The above results with I-Tasser (111) and Swiss PDB (59) model identified protein transporters and numerous calcium binding EF-Hand containing proteins. This raises a possibility that the CREF gene may belong to transporter family (see discussion).

TM- PDB Hit Classification Score

Transport Protein/ RNA Binding 0.943 Transportin-3 Protein

Importin-3 Nuclear Transport 0.824

Snurportin-1 Protein Transport 0.803

GTP-binding nuclear protein 0.794 GSP1/CNR1 Protein Transport

Importin-13 Transport Protein 0.793

Exportin-T RNA Binding Protein 0.773

Exportin 1 Transport Protein 0.729

GTP-binding nuclear protein 0.709 Transport Protein

GTP-binding nuclear protein 0.696 Ran Nuclear Transport

Transportin-1 Nuclear Transport Protein Complex 0.689

60 Table 7. List of PDB Hits of Structural Analogs of the CREF Protein. TM- score is a measure of global structural similarity between query and template protein and it is measured at the [0-1] range.

Post Translation Modification

The nature of the post-translational modification site was next investigated using diverse proteomic tools. CREF contains 273 Serine kinase/ phosphatase motifs and 36 Serine binding motifs. This protein also contains 10 Tyrosine kinase / phosphatase motifs and 10 Tyrosine binding motifs according to the

PhosphoMotif Finder tool from the HPRD database (117). N-glycosylation sites were not detected in CREF (NetNGlyc). However, a mucin type GalNAc O- glycosylation site was detected using NetOGlyc. The CREF gene is not a substrate for farnesylation according to PrePS. The CREF protein does not have

N-terminus myristoylation according to the Myristoylator tool from Expasy (59).

Interactome Mapping for the CREF Protein

Interaction and coexpression studies were performed by mining the

STRINGS bioinformatics tool (38) as seen in Figure 15. Three genes that seem to interact with the CREF protein were further analyzed. The WD Repeat-

Containing Protein (WDR78) was present in both the Oncomine (106) and the

STRINGS interactome datasets (38) and was found to be to co-expressed and interacted with CREF. The Coiled-coil domain containing 33 (CCDC33) protein showed a coexpression correlation value of 0.898 with CREF (acceptable range

0-1) according to Oncomine (106) and was implicated in lung tumors. In addition,

61 other CREF interacting proteins was predicted using the interactome tools

BioGrid (153) and the iRefindex (154). These tools identified amyloid beta (A4) precursor protein (APP) as an interacting partner for the CREF protein. Biogrid and Refindex showed that the APP to be associated with the CREF protein. The

HPA showed that the APP gene to be associated with both the breast and lung tumors (155). Interestingly, one of the CREF-specific miR, hsa-miR-222 was also found to be a regulator of the APP gene (156).

62

Figure 15. List of Genes Interacting with the C1orf87/ CREF. STRINGS interactome analysis is shown. Score determining level of confidence of relationship between query gene and interacting gene is defined [0-1].

63 Involvement of CREF in other Diseases

In addition to cancer, the CREF gene is linked with numerous other diseases including cardiovascular, immune, metabolic, neurological and psychiatric disorders. The GAD database (100) showed phenotypic association of the CREF gene-related SNPs with coronary artery disease, stroke, bipolar disorder and rheumatoid arthritis (52). The NextBio transcriptome meta analysis tool (120) identified the CREF gene expression changes in diseases such as chronic sinusitis, gout, hyperuricemia, interstitial lung disease and mood disorder.

The MalaCards disease encyclopedia showed evidence of disease relationship with bipolar disorder, cancer (breast and colon) and rheumatoid arthritis (121).

Further Supporting Evidence for the Gene Discovery Approach

Establishment of the Dark Matter Landscape

In the search for novel therapies, the Dark Matter of the human genome shows promise in offering different classes of druggable targets and biomarkers.

These targets could be beneficial not only to cancer but also in distinct therapeutic areas such as aging, cardiovascular, chemodependency, immune, infection, inflammation, metabolic, neurological, pharmacacogenomic, psychiatry, and vision. Following this reasoning, we designed a systematic bioinformatics- based strategy to decode these uncharacterized ORFs to aid in target discovery.

64 By mining disease related databases such as the GAD (100), NextBio

(120) and the Malacard (121) disease encyclopedia we have created a database of 800 disease-oriented novel genes. The novel proteins of this Dark Matter Atlas went through a thorough omics characterization that included information on baseline mRNA and protein expression, protein motif and domains, somatic mutations, possible pathways/ interactions, disease traits and clinical relevance.

The batch analysis of all proteins was performed using tools such as the

GeneALaCart (78), the DAVID V6.7 functional annotation tool (125), the canSar integrated database (62), and the protein analysis tools MOPED (56), (116),

HPRD (117) and HPA (31, 32). This streamlined approach has proven to be effective in mining the uncharted territory of Dark Matter to identify putative leading targets.

OncoORF as a Disease Specific Sub-database

A database of 62 uncharacterized proteins with genetic association evidence among diverse types of solid tumors and hematological was established. The approach presented in this thesis involving gene expression, protein signature identification, and GWAS was applied to the group of

OncoORFs.

All the ORFs went through mRNA and protein expression analyzes using tools such as Unigene, HPA (31, 32), MOPED (56), (116), and HPRD (117) to determine specificity. The mRNA expression profiling identified tissue-enriched

65 levels for adrenal gland (C7orf16); brain (C2orf80, C2orf85, C7orf16); connective tissues (C20orf61); liver (C1orf38); placenta (CXorf67); skin (C6orf15); and testis

(C1orf94, C20orf61, C20orf79, C7orf16, CXorf61, CXorf66, CXorf67). Moreover, protein expression in body fluids was detected for C1orf94 (blood plasma) and

C3orf10 (semen); suggesting a possible secreted nature.

The ORFs then went through motifs and domain detection analyses and as a result were categorized into putative protein classes; this classification was used to prioritize genes for follow-up studies. The results showed that 50/62 proteins belonged to distinct classes such as antigen, binding proteins, carriers, enzymes, pseudogenes, secreted, sorting, transporter, ncRNAs, vacuolar and

Zinc finger containing genes.

Cancer-association traits of the OncoORFs were noted in thirteen of the novel proteins. A strong eQTL association evidence was detected in breast, colorectal, esophageal, gall bladder, head and neck, liver, ovary, prostate, renal and non-small-cell lung carcinomas, neuroblastoma and myeloid leukemia. This information was useful when identifying statistically significant SNPs across diverse populations.

The mutational analyses performed with tools such as COSMIC (63),

ICGC (104), DGV (66) and cBioPortal (64) showed somatic mutations present for the novel genes across solid tumors and hematological malignancies. Diverse mutations (nonsense, missense, deletions, insertions, frameshifts, in frames, homozygous and heterozygous) were concealed in fifty-six OncoORFs. The

66 largest class of mutation observed was substitution missense (n= 1,813), followed by heterozygous mutations (n=1,080). Furthermore, homozygous mutations (n=105) were present for sixteen of the OncoORFs present in malignant melanomas, endometrial, cervical, breast, renal and stomach neoplasms. This implied a cancer relation across various populations.

The results from this study offer an understanding of the role of the

OncoORFs across different types of neoplasms and other diseases. This cancer fingerprint encompasses a broad spectrum of uncharacterized proteins that include enzymes, membrane receptors, transporters, DNA/nucleotide/metal binding proteins and secreted proteins. This cancer signature has the potential to facilitate a strong rational basis for novel cancer therapy (157).

Prediction of a Secreted Biomarker as a Further Proof of Concept for the

Approach

The streamlined approach used to establish the cancer fingerprint described above was applied to another cancer related uncharacterized gene. A comprehensive characterization of a second uncharacterized protein (CXorf66) as another proof of concept was performed and it led to the gene to be renamed

SGPX (Secreted glycoprotein in chromosome X). The SGPX name has been adopted by the HGNC as one of the aliases for the Cxorf66 gene.

The gene expression profiling results from SGPX detected medium expression level in 7 out of 77 analyzed normal tissues according to the HPA tool

(31, 32). The major normal tissues included salivary gland, spleen, lymph node,

67 kidney and tonsil. The Unigene (126) tool indicated a restricted expression of

SGPX in the testis. The protein expression results of the ORF correlated with the mRNA results and in addition, detected levels of expression in seminiferous ducts. Moreover, the protein was detected in body fluids such as blood plasma and serum, indicating a secreted nature. The transcriptome expression of the

ORF in tumors was investigated by using the NextBio meta analysis tool (120) and the CGAP (30) short SAGE tag (sTTTCAAGCAA) analysis of the CGAP tissue libraries. Both tools indicated upregulation of SGPX in brain and lung tumors and a downregulation of expression in liver and prostate . On the other hand, the HPA (31, 32) tool detected high levels of expression of the novel protein in colon, lung, ovarian and urothelial cancers. Current protein expression data for SGPX is limited and further experiments may need to be performed in the future.

A characterization of the protein was undertaken to assign a putative class to SGPX. The UniProtKB (108) database showed that SGPX is a single-pass type-1 transmembrane protein with a signal peptide. This was verified by the identification of classical signal peptide at the N-terminus of the protein by

SignalP (114). The functional aspect of the ORF remains unclear. However, 3D modeling analyses by the I-TASSER (111) and UCSC Genome Browser tools

(77) indicated a possible glycoprotein with cellular trafficking, transport and budding function. This is based on the fact that structural homologues such as the properdin glycoprotein, the phosphatidylinositol 3,4,5-trisphosphate 3- phosphatase and the dual-specificity protein phosphatase (PTEN)-like region of

68 auxilin, were used as templates to build the models. A comprehensive motifs and domain characterization by NCBI CDD (112), InterProScan (158), HMMER (34),

PFAM (73), and ProDom (113) was performed. The following protein signatures were detected: DU936, Signal Ribosomal S2, INP family, and FAM163.

Genetic alterations of this gene were detected in the form of somatic mutations in glioma, lung, uterine and pancreatic cancer. Furthermore, amplifications were identified in prostate, lung and glioma while deletions were found in sarcoma. Significant elevation of DNA copy numbers was observed in different subtypes of brain cancer samples of female patients, suggesting an escape from gene dosage compensation. For this reason we hypothesize that the SGPX to be a potential secreted biomarker for reproductive and urological tumors (128). Additional experiments are needed to verify this prediction.

Applicability of the Approach to Another Therapeutic Area: Diabetes

The feasibility of our approach has allowed for the expansion to other therapeutic areas and to continue seeking novel druggable targets and biomarkers beneficial to diseases such as diabetes. The GAD (100) was downloaded to provide a basis for mining the diabetes genomes. A vast number of genetic polymorphisms associated to cancer, cardiovascular, immune, infection, metabolic, neurodegenerative disorders were detected.

A sub-database of 58 novel genes involved with type 1 and type 2 diabetes was built to investigate possible druggable targets and biomarkers that could be beneficial in treating this disease and related illnesses. The GAD

69 database (100) was mined for diabetes relevant genes and respective genetic polymorphisms. A disease association profile was stratified indicating complications and disorders related to these diabetes genes. These include albuminuria, allergies, Alzheimer’s disease, autoimmune, cardiovascular, glucose intolerance, infection, inflammation, insulin resistance, metabolic syndrome, neurodegenerative, neoplasm, obesity, and viral infections based on gene- associated polymorphisms. The relevance of these uncharacterized proteins to diabetes type 1 and type 2 was determined by genome-phenome tools such as the eQTL (60), ClinVar (51) and PheGenI (60). This catalog of uncharacterized proteins provided a number of novel secreted proteins in body fluids and putative druggables targets such as receptors, transporters, enzymes that could be advantageous in the treatment of diabetes type 2 and its complication-related illnesses (81).

70

CHAPTER 5: DISCUSSION

Mining the human genome for novel drug targets for cancer has been an effective approach (20, 21), (39-42), (54), (81-83), (86), (127-129), (157), (159,

160). The current availability of numerous proteomics analysis tools has enhanced our ability to harness the genome databases for target discovery. More than half of the human proteins in the genome however, remain uncharacterized.

At the start of this dissertation research, it was reasoned that understanding the role of the unknown proteins might help develop a strong basis for future research. The starting premise was that the uncharacterized proteins could be evaluated for prediction of function from the amino acid motif and domains. The working hypothesis was that the amino acid fingerprint in a novel protein can provide a hint of function. The long-term goal of this project is to develop a pipeline of novel cancer markers to facilitate diagnosis and therapy.

At the onset, the project was considered risky; limited information was available on the uncharacterized proteins. However, reasoning that the ability to establish a knowledgebase of the unknown proteins could greatly aid the future of cancer research, the initial feasibility experiments were undertaken. From a preliminary list of 450 ORFs, motif and domain analysis revealed putative hints of function for various ORFs. Well-established motifs including DNA binding, catalytic site bearing enzyme-like signatures, transporters, and signal peptide

71 bearing putative secreted proteins were detected (129). Encouraged by these findings, efforts were initiated to develop a comprehensive database of novel, uncharacterized ORFs (the Dark Matter proteome).

A great number of genome datasets had to be screened for this purpose.

To accomplish this, a lab-wide multi investigator datamining was undertaken to create the Atlas of the Dark Matter ORFs (52). This Atlas of novel ORFs provides the basis for our laboratory’s research in the future. In addition to cancer, other disease areas can also benefit by these findings. The results presented in this study as well as work published from the laboratory (52), (81-84), (157) underscore the involvement of cancer targets in diverse diseases and disorders such as cardiovascular, infectious, immunologic, metabolic, neurological, and psychiatry. Thus, targets characterized in this study are not only relevant to cancer models, but also for other diseases as well.

In independent findings, the Narayanan laboratory has recently identified several ORFs involved in diverse disease areas including neurodegenerative diseases (83), infectious disease (84) and pancreatic cancer (82). These additional data provide validation for the mining approach involving starting from the Genome to phenome association evidence as is outlined in the strategy (see

Figure 2).

Until recently, gene expression (mRNA and protein) correlation has been a serious limitation due to the lack of availability protein expression databases

(161, 162). Hence, frequent interpretation of the protein function was made from

72 the mRNA expression results. However, during the period of this dissertation research, numerous protein expression databases became available (31, 32),

(56-58), (116), (161). These databases greatly enhanced our ability to develop correlative evidence at mRNA versus protein expression level.

Our results that about 10% of the dark Matter ORFs show such a correlation is consistent with recent findings of proteogenomic characterization colon cancer genes (49), (161-164). We reasoned that the target prioritization can be made with ORFs showing an expression correlation. However, it is to be realized that such a filter is likely to have false negatives; crucial lead target proteins may be missed. Various mechanisms such as sample variations, mRNA and protein stability, regulation of gene expression and post-translational modifications can contribute to such a lack of correlation. Thus, it may be too simplistic an approach for lead target prioritization. This is addressed in detail under the critical issues.

Identification and characterization of the CREF protein from the database of novel ORFs provides a proof of concept for the target discovery approaches undertaken in this study. Key findings of 1) restricted expression in select normal tissues, 2) calcium binding EF-Hand domain, 3) the putative transporter function inferred from the I-Tasser Model homologue templates, 4) differential regulation in tumor sub types and 5) loss of heterozygosity in lung carcinomas allowed us to create a working knowledgebase for this protein. The CREF protein with putative transporter function joins a growing list of EF-Hand containing proteins as cancer biomarkers and druggable targets (133), (156), (165-167). Current druggable

73 targets of proteins include enzymes, ion channel proteins and receptors, in particular G-Protein Coupled receptors and transporters (168).

Results to date strongly point to a possible role of the CREF protein in three distinct tumor types, the breast, colon and liver carcinomas. The results obtained suggest that the CREF protein could elicit a tumor-type dependent function: tumor promoter role in the liver neoplasms and possibly a mutated tumor suppressor function in the breast and lung carcinomas. Additional experiments are warranted to verify these findings.

Calcium Binding Proteins and Cancer

Calcium binding proteins containing EF hand motifs are key targets for various cancers (133), (156), (165-167). A well-studied prototype member of the

EF hand protein includes the S100 family. The S100 family members are differentially expressed in various solid tumors such as breast, lung, bladder, kidney, gastric, thyroid, prostate, and oral cancers (132), (165), (167), (169-171).

Individual family members of the S100 proteins act as both tumor promoters as well as tumor suppressor genes, demonstrating a complex function in tumor growth (169), (172). For example, the S100A2 expression is downregulated in diverse tumor types and is associated with poor patient prognosis (148).

Many calcium-binding proteins belong to the same evolutionary family and share a type of calcium-binding domain known as the EF-Hand (see Prosite documentation entry, PDOC00018). This domain consists of a twelve-residue loop flanked on both sides by a twelve residue α-helical domain (see PDB:

74 1CLL). The structural/functional unit of EF-Hand proteins involves a pair of EF-

Hand motifs that together form a stable four-helix bundle domain. The EF-Hand pairing is essential to the cooperative binding of Ca2+ ions. The consensus pattern of the EF hand calcium binding proteins is

D-{W}-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-

[DENQSTAGC]-x(2)-[DE]-[LIVMFYW]

The EF hand domain consists of a duplication of two EF-Hand units, where each unit is composed of two helices connected by a twelve-residue calcium-binding loop. The calcium ion in the EF-Hand loop is coordinated in a pentagonal bi-pyramidal configuration.

Cancer Implications of the CREF Protein

The differential and complex expression profile of the CREF gene in diverse tumors (upregulated in the liver and downregulated in the breast and colon carcinomas) is consistent with the role of EF hand containing calcium- binding proteins (Selma et al). Currently, the expression data of CREF protein is obtained using the cannonical sequence, the isoform I (NP_689590.1). However, at least three isoforms of CREF protein are predicted. Current results in the proteome expression datasets preclude any isoform evidence for the CREF protein. It is unclear whether any isoform-specific expression exists for this gene that can account for tissue and tumor selectivity. Future experiments will have to address this issue.

75 The LOH results from the COSMIC disease database (63) raise a possibility of a tumor suppressor gene in chromosome 1p.32.1 (139), (173).

Frequent allelic loss at locus 1p32.1 is seen in lung, breast and other tumors

(174). A list of all genes present in chromosome 1 according to Atlas of Genetics and Cytogenetics in Oncology and Haematology (both known and uncharacterized) indicates the presence of numerous cancer related genes such as TP73 and MYCBP, among other genes. Therefore it is reasonable to speculate that the CREF gene (located in chromosome 1) is cancer-related.

Genome-Wide Association Studies of the CREF Gene

The GWAS analysis presented in this work establishes strong cancer relevance for the CREF gene. Overall, 3% of the tumors analyzed by mining the

GWAS datasets harbored CREF gene mutations. These mutations were found in tumor samples from around the world across diverse databases representing different populations (ICGC database).

The CREF gene perturbations in tumors included largely missense and nonsense mutations. Further, amplifications and deletions at the CREF gene were detected in the breast, lung, and liver tumors. A natural variant, Q → E at aa. 151, is present in breast cancer samples (136) pointing to the importance of the CREF gene in breast cancers. In addition, in lung tumors an E331K mutation was found. This lung tumor mutation occurs at the EF-Hand domain. It is tempting to speculate that such a mutation would result in inactivation of the

CREF protein contributing to tumorigenicity. If this mutation is verified using

76 recombinant wild type and mutant CREF proteins in calcium binding assays, functional relevance in cancer can be established. This would also provide a basis for assay development for drug discovery for an antagonist screen.

The CREF gene mutations may occur only in a small subset of tumors. If verified with a larger cohort of tumor samples, the CREF gene may offer a way to stratify the group of patients for therapy. This would indicate a pharmacogenomics (response to therapy) potential for the CREF gene.

Encouraging results were obtained for such a potential use for the CREF gene from the analysis of CREF gene mutations and survival (see Figure 7). The enhanced survival of the lung adenocarcinoma patients without CREF gene mutations treated with chemotherapy implies a response to therapy potential for this gene. These findings, together with the genome association evidence, support the premise that the CREF gene may be a solid tumor biomarker for patient stratification in therapy. If proven, the CREF protein may be added to the growing list of current pharmacogenomics targets such as the ERBB2/HerNeu for breast cancer (175), KRAS for lung cancer (176), EGFR (177), and S100 for melanomas and neurological tumors (178).

77 Regulation of the CREF Gene

Knowledge about the regulation of CREF gene expression can help develop an understanding of the role and function of the protein in normal and cancer tissues. To this end, transcriptome-based profiling from the NextBio Meta

Analysis tool (120) provided valuable clues. The upregulation of expression of

CREF mRNA when methyltransferase genes such as DNMT1 and DNMT3B were knocked out suggests a regulation at the level of methylation of the CREF gene (143). In addition, our data suggests CREF gene is p53-regulated, as the gene becomes downregulated when TP53BP2 is knocked down. The implication of tyrosine kinases and GTPases in the regulation of the CREF gene is supported by the overexpression of oncogenes like Ras, Bcr, IDH1, and Abl1 which appear to cause downregulation of CREF mRNA. The mutations of EGFR,

Abl1, and Bcr kinases also cause the downregulation of the CREF gene (150).

These results seem to implicate kinase pathways in the regulation of CREF gene expression.

The results on the effect of pharmacological agents on the CREF gene expression provided additional evidence of cell cycle regulation. Paclitaxel, a mitotic inhibitor with a mechanism consisting of hyper-stabilization of microtubules, causes upregulation of the CREF mRNA. This result suggests that

CREF may be regulated at the G2 phase of the cell cycle process. The upregulation seen with the use of methyltransferase inhibitors such as

Azacitidine and Decitabine on CREF gene is consistent with upregulation caused

78 by knockout of methyltransferase genes such as DNMT1, DNMTBP2, and ESR1

(179).

The implication of CREF protein in cell growth and in the cell proliferation signal cascade is further validated by the effect of tyrosine kinase inhibitors such as Gefitinib and Genistein. The upregulation of CREF gene expression by isoascorbic acid, an antioxidant, implicates redox pathways in the regulation of

CREF. Upregulation of CREF by Trichostatin A (TSA), a histone deacetylase

(HDAC) inhibitor that causes G1 arrest (180), suggests a close cell cycle regulation between G1 and G2/M transition. This observation is additionally validated by the upregulation of CREF caused by Tamoxifen, an angiogenesis inhibitor (180) that causes cells to remain in the G0 and G1 phases of the cell cycle. Thus, a complex two-phase regulation (G1 and G2/M) for the CREF gene can be inferred from our data. The downregulating effect of Arbutin / kojic acid (142) on CREF correlates with the downregulation of CREF by knockdown of TP53BP2 as these two drugs regulate apoptotic genes such as p53 (181).

MicroRNAs provide an additional layer of regulation of the CREF gene expression. The upregulation of hsa-miR-221 and hsa-miR-222 in lung and breast tumors is accompanied with the downregulation of CREF in these tumors

(182, 183). Also, the downregulation of hsa-miR-383 (184) in liver cancer may cause the upregulation of CREF in the hepatic tissue. These three miRs may offer valuable reagents for understanding the regulation of the CREF gene.

Accumulating evidence indicates the key role of miRs in diagnosis and therapy of cancer and other diseases (185, 186). Thus, follow up studies on these three

79 miRs might provide additional insights into the biomarker potential of the CREF protein.

Based on these transcriptome-based analyses, a proposed model for the regulation of the CREF gene expression is shown in Figure 16. In this model, the

CREF gene is regulated at the cell cycle level in a biphasic manner (the G1/S and G2/M phases). Further, transcriptional regulation is inferred involving the p53 binding protein (TP53PB2). The CREF protein is also post translationally regulated by (RAS, ABL1, BCR, EGR) and by promoter DNA methylation (DNMT1, DNMTP3, ESR1). An additional level of noncoding RNA mediated regulation of the CREF gene can be seen. Two miRs (hsa- miR221/222) are involved in the regulation of the CREF gene. From the interactome analyses, key interacting proteins are implicated in the regulation of the CREF protein (WDR78, CCDC33, APP, EF-Hand solute transporter).

Interestingly, another EF-Hand containing transporter was identified as a CREF interacting protein partner. This suggests the possible involvement of calcium signaling in the mechanism of CREF protein function. The Amyloid beta precursor protein (APP), a transmembrane protein, is implicated in breast carcinomas (155). The APP interaction with the CREF protein extends the functional relevance of the CREF protein to neurological systems. This working model provides a basis for experimentation.

80

Figure 16. Postulated Model for Regulation of the CREF Gene. Panel A shows transcription factor, methylation and microRNA as possible regulators of the CREF gene. Panel B shows regulation by apoptosis regulation, kinase pathways, methylation and the interactome. Panel C shows cell cycle regulation of the CREF gene (G1 & G2-M phase).

In summary, the characterization of the CREF gene product and prediction of the putative function as a transporter provides a strong rationale for druggableness. The feasibility of the gene discovery approach is further strengthened by the identification of a secreted cancer biomarker, the SGPX

81 protein as well as by the discovery of diabetes-related novel ORFs. Taken together, the results presented in this dissertation provide support to the starting hypothesis of prediction of function using the protein motif and domain knowledge. The atlas of the dark matter database generated in the study is likely to provide a basis for drug targets discovery for cancer and other diseases.

Critical Issues

The discovery of the uncharacterized protein CREF is a proof of concept of our approach to decode the Dark Matter of the human genome. Although the strategy demonstrates feasibility in characterizing novel genes, there are critical issues that should be kept in mind. A discussion of these critical issues follows.

The Correlation of the mRNA and Protein Expression

Gene expression is monitored at the level of 1) gene amplification through copy number variations (CNV), 2) mRNA expression and 3) protein expression.

Until recently, the number of available tools to monitor protein expression in normal and diseased tissues was limiting; most of the tools for measuring gene expression revolved around mRNA expression (EST expression, Unigene analysis, Serial Analysis of Gene Expression (SAGE) analysis, microarray and

Next Generation Sequencing (NGS) technology (29), (106), (187-190). Further, mRNA and protein expression datasets utilized tissue samples from different patients. This was a major handicap in our ability to infer functional consequence of protein targets (161), (163), (191). As the proteins determine function, it is essential to establish a correlation of gene expression at mRNA versus protein

82 levels. Recently, a wide variety of protein expression databases became available enabling an increased reliability of such correlative analysis.

Protein expression data for the majority of the 22,000 human proteins from the in fetal and adult tissues (including diverse body fluids) are present in the databases such as the HPA, HPRD, MOPED, the Human Proteome map and the

Proteomics DB (31), (56-58), (192). Further, isoform-related information is also available for 20,000 human proteins from the Proteomics DB (58). Thus, it is possible to prioritize lead targets for follow-up studies based on correlation of gene expression at mRNA and protein levels.

Due to distinct regulatory mechanisms that affect gene amplification, transcription translation and protein activity (164), (193, 194), establishing a correlation of gene expression is often difficult to accomplish. In a recent study,

Zhang et al. (162) using the Cancer genome Atlas (TCGA) (67) tissues (n=87) showed that the gene amplification and mRNA abundance often did not show correlation. In addition, only a third of the genes analyzed (n= 3,764) in their study, showed a correlation of mRNA and protein expression.

The results presented in this study are consistent with these findings. The

CREF gene expression showed a strong correlation of mRNA (Oncomine, HPA,

RNA SEQ) and protein expression (HPRD, HPA, MOPED) data in the lung tissues. Further, the atlas of the ORF database established as a basis for the study presented in this dissertation, showed a 10% correlation of expression

83 (80/800 ORFs) at the mRNA and protein levels (195). These results at the very least helped establish a subset of ORFs as putative leads for further studies.

While it is not possible to eliminate the importance of the hit genes based on lack of correlation between mRNA and protein expression, for a lead drug target, it is reasonable to expect such a correlation. Choice of such a stringent filter selection can increase the odds of druggable protein target (s) discovery and can reduce the noise. However, it is important to keep in mind that datasets for proteins as well as the mRNAs come from different tissue sources. Hence the expression differences between mRNA and protein levels could be a consequence of genetic variability. Further, the protein activity rather than the levels of proteins is often a key determinant of function and relevance. Issues related to the protein stability, half-life and modification adds a further layer of complexity. Thus a reliance on a correlative evidence for the mRNA and protein expression will have false negatives. Despite these limitations, the hit genes showing correlation across gene amplification and expression at mRNA and protein levels can be chosen for further studies. The eventual development of single cell transcriptome and proteome analysis capability in the future can increase the reliability of such a correlation analysis. Further choice of datasets such as the CanSar (61) and the HPA (56, 116) which have the genome and expression data (CNV, mRNA and proteins) from the same tissues may help address some of these issues.

84 Lack of Consistency across Bioinformatics Tools

The vast number of bioinformatics tools available provides an extensive amount of data accessible for diverse types of analytical studies. The tools offer a range of information regarding transcriptome and proteomic expression levels in normal and infected tissues, mutational analysis, interaction/ pathway examination and GWAS data (29), (31, 32), (56), (58), (60), (63), (106), (116).

The dynamic nature of the datasets from the different databases is both an advantage and a hindrance in mining the dataset. The material presented by the tools is constantly updated and expanded. This means some of the earlier data may become irrelevant. By the same token some of the false negatives may become hits as newer information becomes available (196, 197).

Hence, the datasets need to be constantly reevaluated prior to performing extensive bioinformatics follow up experiments. Such a dynamic aspect of the dataset sometimes could add a level of complexity to the interpretations.

Another critical problem is a possible lack of consistency across multiple bioinformatics tools (198). A set of tools (for ex. expression analysis, motifs and domains analysis, and mutational characterization), utilize different algorithms, datasets, samples etc. Hence it is reasonable to expect differences in the output across multiple tools. Despite these possible differences, if two or more tools show a concordance in the output of hit genes, then it is possible to develop a higher level of confidence in the results. While false negatives cannot be avoided, the positives are likely to be reliable (albeit a smaller number). In fact in

85 our approach, we have undertaken this strategy i.e. hits identified by more than two bioinformatics tools as possible leads to perform follow up studies (see Table

1).

This dissertation research was dependent on the premise that the uncharacterized ORF proteins can be verified for function using the knowledge of the motifs and domains. The reliability of the outcome from the diverse tools to be used was a major critical issue. To address this concern, known proteins with knowledge of motifs and domains were chosen as control sequences. In addition, from certain proteins, an in silico modified was generated by deletion experiments. These motifs-deleted constructs provided an additional level of negative controls. The use of positive and negative controls was fundamental to test the reliability of the tools used and dismiss any potential false hits generated by them. An example of such logic was introduced in the

“Approach and Experimental Design” section (see page 26). The integration of different tools in a study has the potential to raise these types of issues; however, a possible standardization of the data could possibly help resolve them (199).

Future Directions

The CREF gene is differentially regulated in distinct tumor types

(upregulation in the liver and downregulation in the breast and lung carcinomas).

This raises a possibility that the CREF protein may elicit tissue-type dependent function in growth regulation similar to other EF-Hand containing calcium-binding proteins (132), (165, 166), (200). Further characterization of the CREF protein

86 including the assay development for measuring the activity of the protein product would provide a rationale for verifying the drug therapy potential. Some of these potential areas are discussed below.

Assay Development for CREF Transporter Function for Drug Discovery

The calcium binding EF-hand family of proteins are important drug targets for cancer and other diseases (156), (165), (201, 202). The prediction of putative transporter function of the CREF protein based on I-Tasser (111) protein domain analysis (see Figure 14) opens the possibility of assay development for potential drug discovery. Currently in the drug bank (www.drugbank.ca), 113 transporter proteins are in the FDA approved list of drug targets (31), (203). These proteins include EF-Hand containing proteins such as calmodulins, calcinuerin and S100

RAGE proteins. A small molecular weight inhibitor of the S100 RAGE protein is in clinical trials for cancer (171), (200), (204). Assays based on mobility shift and enzyme-linked immunosorbent assays (ELISA) are readily available for these proteins through various commercial vendors. Thus, it is possible to develop cell-based assay systems to measure the CREF protein activity. The E331K mutation found within the EF-Hand domain could be further analyzed to understand the level of impact it may have on the function of the protein. In view of the upregulation of the CREF protein seen in liver carcinoma, liver cancer- derived cell lines can be used as model system to develop the assay.

An initial proof of concept for the requirement of CREF function in tumor cell growth can be established using knockout technology such as siRNA or

87 antisense techniques using the liver cancer models. Appropriate liver cancer cell line models can be chosen using mRNA and protein expression evidence; an antibody to the CREF protein is available (HPA, Nova Biological, Abcam). Low and high expressor cell lines, if available, can be chosen for verification experiments using the knockout technology. If growth inhibition is seen in these experiments, a strong rationale for assay development to identify inhibitors can be developed.

Recently, our laboratory has identified small molecular weight inhibitors for enzymes, receptors and transporters using chemogenomics approaches (205).

The of compounds from the Database of bioactive compounds at the EMBL Laboratories (CHEMBL) can be screened for hits using the protein 3D structure of the CREF protein for binding (206). The CanSar protein annotation tool can be used for these studies (62). These drug-like compounds can then be tested in the liver cancer-derived cell line models for growth inhibition in vitro and in vivo experiments. Further, the specificity of such compounds can be verified using various other transporter assays. Such an approach may help discover a rational drug or an antagonist based on the function of the CREF protein. This could help eliminate random screening approaches to drug discovery for targeting the CREF protein.!!

The CREF protein showed phenotype association evidence using GWAS tools in bipolar disorder, coronary artery disease, rheumatoid arthritis and stroke.

Hence, one can anticipate that an inhibitor/antagonist of the CREF calcium binding function may be useful for some of these diseases as well.

88 Profiling Methylation Status of the CREF Gene in the Tumors of Interest

Could Lead to Biomarker Identification

DNA methylation plays a key role in the epigenetic effects on gene expression in both normal tissue and cancer cells (207). The hypermethylation of

CpG islands within promoter sequences might contribute to tumorigenesis by promoting transcriptional silencing or the downregulation of tumor suppressor genes, while hypomethylation could lead to the mutation of proto-oncogenes

(208). Further studies of changes in DNA methylation have the potential to lead to identification of biomarkers for early detection and prognosis of diverse tumors. For example, our results indicate a selective regulation of CREF gene expression by methylation of the CpG islands in the promoter region. More specifically, the CREF gene is hypermethylated in lung and breast carcinomas and hypomethylated in liver neoplasms. This is consistent with the downregulation of the CREF gene in lung and breast tumors and its upregulation in liver carcinomas observed in the same dataset. The analysis of the effect of pharmacological agents and the perturbation of genes on CREF (see Tables 5 and 6) further corroborated such claim: The upregulation of CREF seen with the methyltransferase inhibitors such as Azacitidine and Decitabine on CREF is consistent with upregulation caused by knockout of methyltransferase genes such as DNMT1, DNMTBP2, and ESR1.

The use of high-throughput DNA methylation assays technology (like

MethyLight), and methylation quantitative trait loci (mQTLs) could improve the generation of epigenetic profiles of tumor samples of interest for genetic marker

89 identification (209, 210). This strategy is already being applied to identify possible biomarkers in colorectal cancer (CRC) (208).

In addition, measuring if the methylation is silencing the capability of the gene to suppress malignancy could test the putative tumor suppressor role of

CREF. Assays based on DNMT activity/ inhibition and ELISA-based methods that detect DNA methyltransferase activity on provided CpG-enriched DNA substrates are commercially available for different proteins

(www.activemotif.com). Furthermore, methyltransferase inhibitors such as

Azacitidine and Decitabine are FDA approved drugs used in anticancer treatment. These pharmacological agents can allow the cancer cells to revert to the standard phenotype required for cell cycle regulation and apoptosis indication while having restricted toxicity (211). Hence, it is tempting to speculate that these agents could potentially reactivate the silent tumor suppressor gene in tissue samples and allow for normalization of cell growth (212). These approaches could further corroborate the tumor suppressor role of CREF.

CREF, Loss of Heterozygozity and Cancer

The results from our study demonstrate the presence of the Loss of

Heterozygozity (LOH) in the CREF gene in 27% (37/139) of lung carcinomas and

9% (4/45) of breast tumors. The LOH is a hallmark of cancer development. It is associated with genomic instability at the chromosomal level since one of the gene alleles is lost. This often leads to disruption and/or the functional silencing of tumor suppressor genes located in the region of LOH (213-215). The loss

90 observed with the CREF gene in the breast and lung neoplasms appears to correlate with the downregulation of the CREF gene in these tumor types. Such a relationship led us to speculate a putative tumor suppressor role for the CREF gene.

It is possible to test this speculation by overexpressing the siRNA or antisense CREF gene in non-neoplastic breast epithelial cells, for example the

MCF10A cells (216). These breast cells are chemically transformed immortal but non-neoplastic cells. The transfected cells can be tested for growth suppression in vitro and if efficacy is seen, then in the nude mouse or transplantation models.

However, it is possible that the expression of the CREF gene in the MCF7A cells is toxic and hence it may not be possible to establish. To overcome this, it is possible to choose an inducible expression vectors (such as Tetracycline or dexamathasone) for such transfection studies (217). Similar studies can also be performed using BEAS2B cells, an immortalized human bronchial epithelial cell line (218).

91

APPENDICES

92 APPENDIX A – LIST OF PUBLICATIONS

Delgado AP, Brandao P and Narayanan R. Mining the dark matter of the cancer proteome for novel biomarkers. Current Cancer Therapy Reviews, 2013, 9, 1-13

1W

This review illustrates a rationalized strategy to decode the dark matter of the human proteome by using bioinformatics and proteomics tools. A proof of concept is described by performing a bioinformatics profiling of an uncharacterized gene (C1orf87) renamed as Carcinoma-Related EF hand protein-CREF (CREF). The CREF gene is a calcium-binding protein that is likely a novel druggable target for breast, lung and liver tumors.

Delgado AP, Hamid S, Brandao P, and Narayanan R. A Novel Transmembrane

Glycoprotein Cancer Biomarker Present in the X. Chromosome Cancer

Genomics Proteomics. 2014 Mar-Apr;11(2):81-92.

The manuscript describes the discovery and characterization of SGPX

(Secreted glycoprotein in chromosome X), previously named CXorf66. This novel gene is differentially expressed in the brain, lung, liver, prostate neoplasms and in leukemia. We hypothesize that SGPX has the potential of being a secreted biomarker for reproductive and urological tumors. In these tumors, the based on observed elevated DNA copy number of SGPX of the genes suggests an escape from gene dosage compensation.

93 Delgado AP, Brandao P, Chapado MJ, Hamid S, and Narayanan R (2014) Open reading frames associated with cancer in the dark matter of the human proteome. Cancer Genomics Proteomics 11(4): 201-213

This study portrays the development of a signature of sixty-two novel genes that show verified GWAS relevance to diverse neoplasms. These

OncoORFs (Oncology Open Reading Frames) include druggable targets

(enzymes, transporters, receptors) and secreted biomarkers for multiple tumor types. Furthermore, these novel genes show association with numerous other diseases suggesting a complex landscape in human diseases.

Delgado AP, Brandao P, and Narayanan R (2014) Diabetes associated genes from the dark matter of the human proteome. MOJ Proteomics Bioinform 1(4):

00020.

This paper establishes a catalog of fifty-eight ORFs involved with type 1 and type 2 diabetes and its complication-related illnesses. The relevance of these uncharacterized proteins was determined by genome-phenome tools such as the eQTL, ClinVar and PheGenI. Putative novel therapy targets for type 2 diabetes and related disorders were extracted from this database.

94 Delgado AP, Brandao P, Chapado MJ, Hamid S, and Narayanan R (2015). Atlas of the Open Reading Frames in Human Diseases: Dark Matter of the Human

Genome. MOJ Proteomics Bioinform.

This manuscript illustrates the establishment of a database of disease- related uncharacterized ORFs as a possible source for accelerated therapeutic target and diagnostic maker discovery in diverse disorders. These proteins went through a comprehensive profiling including information on baseline mRNA and protein expression, protein signatures, GWAS, possible pathways/ interactions, disease traits and clinical relevance.

Delgado AP, Brandao P, Chapado MJ, Hamid S, Van de Ven W and Narayanan

R (2014). "Discovery of a novel carcinoma-associated EF hand containing protein by mining the dark matter of the human proteome." Cancer Res (74):

4189. (Poster presented at the American Association for Cancer Research, San

Diego, 2014)

This poster describes the profiling of the open reading frame C1orf87 by following our bioinformatics based approach to decode the Dark Matter of the human genome. We explain the result obtained from mRNA and protein expression analyzes, protein characterization, microRNA regulation, and GWAS, and the relevance of this recently named CREF protein as a potential cancer druggable target.

95 APPENDIX B – CREF CHARACTERIZATION

Characteristics Gene Description Tool Used Map Position 1p32.1 NCBI MapViewer Genatlas, Exon Structure| DNA, 12 exons, 83.36 kb, 2126 bp, 545 aa (62kDa) RefSeq, mRNA & Protein Size UniProtKB GATA-2, STAT-3, MafF, CTCF, Rad21 |POU3F1, UCSC Genome Transcription Factor FOXC1, FOXJ2, LHX3, FOXM1, VSX1, PITX3 and Browser, Binding Sites SP1 Geneset- NexBio Canonical Sequence | NP_689590.1 (isoform 1) | 3 isoforms (Isoform 1, UniProtKB Isoforms 546aa; Isoform 2, 180 aa and Isoform 3, 138 aa) 7 homologs (highly conserved) Homo sapiens, Pan Homologue Homologs troglodytes, Macaca mulatta, Canis lupus familiaris, Gene Bos Taurus, Mus musculus, Rattus norvegicu. UniProtKB, GO Gene Ontology Calcium ion binding Miner Predict Protein, Subcellular Location Nucleus, Cytoplasmic and membranous HPA Expression of mRNA Testis; brain; lung; pharynx; connective tissue; uterus, Unigene, NextBio (normal) fallopian tubes, peritoneum Protein Expression Selective expression in ciliated cells including fallopian NextBio (normal) tube and airway epithelia Expression of mRNA Upregulated: Ovary, muscle, bladder; Oncomine, (cancer) downregulated: breast, lung NextBio Upregulated: Liver, prostate, renal carcinomas, Protein Expression Human Protein endometrial; downregulated: breast, colorectal, (cancer) Atlas lung, stomach carcinomas 1690,5 natural variants, one breast cancer GeneCards, SNP Variants variant, Q --> E COSMIC hsa-miR-338-5p, hsa-miR-103b, hsa-miR-26b, hsa- microRNA GeneCards miR-3175, hsa-miR-4273, has-MiR-383 Structural Alignment in Calcium binding EF hand proteins PredictProtein PDB Motif/Domain EF-like domain InterProscan Hypermethylated in lung and breast carcinomas and Methylation NextBio hypomethylated in liver carcinoma Downregulated by Genistein (G2 arrest), 2-methoxy estradiol (mitotic modulator), Capecitabine (antimetabolite). Upregulated by Trichostatin A (G1 Gene Regulation NextBio arrest), Azacytidine (antimetabolite), Isoascorbic acid (antioxident), Oxyquinoline (cheleting agent), Paclitaxel (antimicrotubule). Most Correlated Gene Downregulated by AICDA, CHMP2B, TP53BP2. NextBio Perturbations Upregulated by NUPL1 and RTN. Mood disorders, Chronic sinusitis, Gout, Other Diseases NextBio Hyperuricemia, Interstitial lung disease Interacting Protein Amyloid beta precursor protein APP BioGrid Partners UniProtK, InterProscan, Putative Class of Protein Carcinoma-associated, calcium binding Human Protein Atlas, NextBio, Oncomine

96

BIBLIOGRAPHY

1. Di Lonardo A, Nasi S, Pulciani S. Cancer: we should not forget the past. Journal of Cancer. 2015;6(1):29-39. 2. Croce CM. Oncogenes and cancer. The New England journal of medicine. 2008;358(5):502-11. 3. Hong B, Le Gallo M, Bell DW. The mutational landscape of endometrial cancer. Current opinion in genetics & development. 2015;30C:25-31. 4. Marshall CJ. Tumor suppressor genes. Cell. 1991;64(2):313-26. 5. Gabay M, Li Y, Felsher DW. MYC activation is a hallmark of cancer initiation and maintenance. Cold Spring Harbor perspectives in medicine. 2014;4(6). 6. Maru Y, Witte ON. The BCR gene encodes a novel serine/threonine kinase activity within a single exon. Cell. 1991;67(3):459-68. 7. Shah NP, Witte ON, Denny CT. Characterization of the BCR promoter in Philadelphia chromosome-positive and -negative cell lines. Molecular and cellular biology. 1991;11(4):1854-60. 8. Kas K, Voz ML, Roijer E, Astrom AK, Meyen E, Stenman G, et al. Promoter swapping between the genes for a novel zinc finger protein and beta- catenin in pleiomorphic adenomas with t(3;8)(p21;q12) translocations. Nature genetics. 1997;15(2):170-4. 9. Kunz M. Oncogenes in melanoma: an update. European journal of cell biology. 2014;93(1-2):1-10. 10. Vogelstein B, Kinzler KW. Cancer genes and the pathways they control. Nature medicine. 2004;10(8):789-99. 11. Gasparian AV, Laktionov KK, Belialova MS, Pirogova NA, Tatosyan AG, Zborovskaya IB. Allelic imbalance and instability of microsatellite loci on chromosome 1p in human non-small-cell lung cancer. British journal of cancer. 1998;77(10):1604-11. 12. Merino D, Malkin D. p53 and hereditary cancer. Sub-cellular biochemistry. 2014;85:1-16. 13. Pflaum J, Schlosser S, Muller M. p53 Family and Cellular Stress Responses in Cancer. Frontiers in oncology. 2014;4:285. 14. Stegh AH. Targeting the p53 signaling pathway in cancer therapy - the promises, challenges and perils. Expert opinion on therapeutic targets. 2012;16(1):67-83. 15. Levine AJ, Puzio-Kuter AM. The control of the metabolic switch in cancers by oncogenes and tumor suppressor genes. Science. 2010;330(6009):1340-4. 16. Teufel A, Krupp M, Weinmann A, Galle PR. Current bioinformatics tools in genomic biomedical research (Review). International journal of molecular medicine. 2006;17(6):967-73.

97 17. Strausberg RL. The Cancer Genome Anatomy Project: new resources for reading the molecular signatures of cancer. The Journal of pathology. 2001;195(1):31-40. 18. Scheurle D, DeYoung MP, Binninger DM, Page H, Jahanzeb M, Narayanan R. Cancer gene discovery using digital differential display. Cancer research. 2000;60(15):4037-43. 19. De Young MP, Damania H, Scheurle D, Zylberberg C, Narayanan R. Bioinformatics-based discovery of a novel factor with apparent specificity to colon cancer. In vivo. 2002;16(4):239-48. 20. Narayanan R. Bioinformatics approaches to cancer gene discovery. Methods in molecular biology. 2007;360:13-31. 21. Hopkins AL, Groom CR. The druggable genome. Nature reviews Drug discovery. 2002;1(9):727-30. 22. Pertea M, Salzberg SL. Between a chicken and a grape: estimating the number of human genes. Genome biology. 2010;11(5):206. 23. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al. GenBank. Nucleic acids research. 2013;41(Database issue):D36-42. 24. Kodama Y, Mashima J, Kaminuma E, Gojobori T, Ogasawara O, Takagi T, et al. The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of experiments. Nucleic acids research. 2012;40(Database issue):D38-42. 25. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860-921. 26. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science. 2001;291(5507):1304-51. 27. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403-10. 28. Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, et al. Database resources of the National Center for Biotechnology Information. Nucleic acids research. 2001;29(1):11-6. 29. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science. 1995;270(5235):484-7. 30. O'Brien C. Cancer genome anatomy project launched. Molecular medicine today. 1997;3(3):94. 31. Uhlen M, Fagerberg L, Hallstrom BM, Lindskog C, Oksvold P, Mardinoglu A, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419. 32. Uhlen M, Oksvold P, Fagerberg L, Lundberg E, Jonasson K, Forsberg M, et al. Towards a knowledge-based Human Protein Atlas. Nature biotechnology. 2010;28(12):1248-50. 33. Gattiker A, Gasteiger E, Bairoch A. ScanProsite: a reference implementation of a PROSITE scanning tool. Applied bioinformatics. 2002;1(2):107-8. 34. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic acids research. 2011;39(Web Server issue):W29-37.

98 35. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, et al. InterProScan: protein domains identifier. Nucleic acids research. 2005;33(Web Server issue):W116-20. 36. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, et al. InterPro: the integrative protein signature database. Nucleic acids research. 2009;37(Database issue):D211-5. 37. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols. 2009;4(1):44-57. 38. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic acids research. 2013;41(Database issue):D808-15. 39. Deyoung MP, Scheurle D, Damania H, Zylberberg C, Narayanan R. Down's syndrome-associated single minded gene as a novel tumor marker. Anticancer research. 2002;22(6A):3149-57. 40. DeYoung MP, Tress M, Narayanan R. Identification of Down's syndrome critical locus gene SIM2-s as a drug therapy target for solid tumors. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(8):4760-5. 41. DeYoung MP, Tress M, Narayanan R. Down's syndrome-associated Single Minded 2 gene as a pancreatic cancer drug therapy target. Cancer letters. 2003;200(1):25-31. 42. Aleman MJ, DeYoung MP, Tress M, Keating P, Perry GW, Narayanan R. Inhibition of Single Minded 2 gene expression mediates tumor-selective apoptosis and differentiation in human colon cancer cells. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(36):12765-70. 43. Schmitt AO. Mining expressed sequence tag (EST) libraries for cancer- associated genes. Methods in molecular biology. 2010;576:89-98. 44. Nadzirin N, Firdaus-Raih M. Proteins of Unknown Function in the (PDB): An Inventory of True Uncharacterized Proteins and Computational Tools for Their Analysis. International journal of molecular sciences. 2012;13(10):12761-72. 45. Natrajan R, Wilkerson P. From integrative genomics to therapeutic targets. Cancer research. 2013;73(12):3483-8. 46. Hauptman N, Glavac D. MicroRNAs and long non-coding RNAs: prospects in diagnostics and therapy of cancer. Radiology and oncology. 2013;47(4):311-8. 47. Xie B, Ding Q, Han H, Wu D. miRCancer: a microRNA-cancer association database constructed by text mining on literature. Bioinformatics. 2013;29(5):638-44. 48. Nagano T, Fraser P. No-nonsense functions for long noncoding RNAs. Cell. 2011;145(2):178-81. 49. Narayanan R. The Next Horizon in Proteomics and Genomics Research. MOJ Proteomics Bioinform. 2014;1(1):0006.

99 50. Griffith M, Griffith OL, Coffman AC, Weible JV, McMichael JF, Spies NC, et al. DGIdb: mining the druggable genome. Nature methods. 2013;10(12):1209- 10. 51. Martin L, Chang HY. Uncovering the role of genomic "dark matter" in human disease. The Journal of clinical investigation. 2012;122(5):1589-95. 52. Delgado AP, Chapado MJ, Brandao P, Hamid S, Narayanan R. Atlas of the Open reading Frames in human diseases: Dark matter of the human genome. MOJ Proteomics Bioinform. 2015;2(1). 53. Blaxter M. Genetics. Revealing the dark matter of the genome. Science. 2010;330(6012):1758-9. 54. Russ AP, Lampel S. The druggable genome: an update. Drug discovery today. 2005;10(23-24):1607-10. 55. Gogol-Doring A, Chen W. An overview of the analysis of next generation sequencing data. Methods in molecular biology. 2012;802:249-57. 56. Montague E, Stanberry L, Higdon R, Janko I, Lee E, Anderson N, et al. MOPED 2.5--an integrated multi-omics resource: multi-omics profiling expression database now includes transcriptomics data. Omics : a journal of integrative biology. 2014;18(6):335-43. 57. Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, et al. A draft map of the human proteome. Nature. 2014;509(7502):575-81. 58. Wilhelm M, Schlegl J, Hahne H, Moghaddas Gholami A, Lieberenz M, Savitski MM, et al. Mass-spectrometry-based draft of the human proteome. Nature. 2014;509(7502):582-7. 59. Artimo P, Jonnalagedda M, Arnold K, Baratin D, Csardi G, de Castro E, et al. ExPASy: SIB bioinformatics resource portal. Nucleic acids research. 2012;40(Web Server issue):W597-603. 60. Ramos EM, Hoffman D, Junkins HA, Maglott D, Phan L, Sherry ST, et al. Phenotype-Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources. Eur J Hum Genet. 2014;22(1):144-7. 61. Halling-Brown MD, Bulusu KC, Patel M, Tym JE, Al-Lazikani B. canSAR: an integrated cancer public translational research and drug discovery resource. Nucleic acids research. 2012;40(Database issue):D947-56. 62. Bulusu KC, Tym JE, Coker EA, Schierz AC, Al-Lazikani B. canSAR: updated cancer research and drug discovery knowledgebase. Nucleic acids research. 2014;42(Database issue):D1040-7. 63. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic acids research. 2011;39(Database issue):D945-50. 64. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer discovery. 2012;2(5):401-4. 65. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research. 2014;42(Database issue):D980-5.

100 66. MacDonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic acids research. 2014;42(Database issue):D986-92. 67. Tomczak K, Czerwinska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemporary oncology. 2015;19(1A):A68-77. 68. Thapar R. Structure-specific nucleic acid recognition by L-motifs and their diverse roles in expression and regulation of the genome. Biochimica et biophysica acta. 2015. 69. Van Roey K, Gibson TJ, Davey NE. Motif switches: decision-making in cell regulation. Current opinion in . 2012;22(3):378-85. 70. Bork P, Koonin EV. Protein sequence motifs. Current opinion in structural biology. 1996;6(3):366-76. 71. Sigrist CJ, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, et al. New and continuing developments at PROSITE. Nucleic acids research. 2013;41(Database issue):D344-7. 72. Pietrokovski S, Henikoff JG, Henikoff S. The Blocks database--a system for protein classification. Nucleic acids research. 1996;24(1):197-200. 73. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic acids research. 2014;42(D1):D222-D30. 74. Attwood TK, Beck ME, Bleasby AJ, Degtyarenko K, Parry Smith DJ. Progress with the PRINTS protein fingerprint database. Nucleic acids research. 1996;24(1):182-8. 75. Low SK, Takahashi A, Mushiroda T, Kubo M. Genome-wide association study: a useful tool to identify common genetic variants associated with drug toxicity and efficacy in cancer pharmacogenomics. Clinical cancer research : an official journal of the American Association for Cancer Research. 2014;20(10):2541-52. 76. International HapMap C, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467(7311):52-8. 77. Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, et al. The UCSC Genome Browser database: 2014 update. Nucleic acids research. 2013. 78. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, et al. GeneCards Version 3: the human gene integrator. Database. 2010;2010. 79. Consortium GT. The Genotype-Tissue Expression (GTEx) project. Nature genetics. 2013;45(6):580-5. 80. Stranger BE, Montgomery SB, Dimas AS, Parts L, Stegle O, Ingle CE, et al. Patterns of cis regulatory variation in diverse human populations. PLoS genetics. 2012;8(4):e1002639. 81. Delgado AP, Brandao P, Narayanan R. Diabetes Associated Genes from the Dark Matter of the Human Proteome. MOJ Proteomics Bioinform. 2014 b;1(4):00020.

101 82. Narayanan R. Phenome-Genome Association Studies of Pancreatic Cancer: New Targets for Therapy and Diagnosis. Cancer genomics & proteomics. 2015;12(1):9-19. 83. Narayanan R. Neurodegenerative diseases: Phenome to genome analysis. MOJ Proteomics Bioinform. 2014;1(6):00033. 84. Narayanan R. Ebola-associated genes in the human genome: Implications for novel targets. MOJ Proteomics Bioinform. 2014;1(5):0032. 85. Dixon SJ, Stockwell BR. Identifying druggable disease-modifying gene products. Current opinion in . 2009;13(5-6):549-55. 86. Orth AP, Batalov S, Perrone M, Chanda SK. The promise of genomics to identify novel therapeutic targets. Expert opinion on therapeutic targets. 2004;8(6):587-96. 87. Gonzalez-Angulo AM, Hortobagyi GN, Esteva FJ. Adjuvant therapy with trastuzumab for HER-2/neu-positive breast cancer. The oncologist. 2006;11(8):857-67. 88. Kang Y, Hodges A, Ong E, Roberts W, Piermarocchi C, Paternostro G. Identification of drug combinations containing imatinib for treatment of BCR-ABL+ leukemias. PloS one. 2014;9(7):e102221. 89. Agafonov RV, Wilson C, Otten R, Buosi V, Kern D. Energetic dissection of Gleevec's selectivity toward human tyrosine kinases. Nature structural & molecular biology. 2014;21(10):848-53. 90. Chen R, Chen B. The role of dasatinib in the management of chronic myeloid leukemia. , development and therapy. 2015;9:773-9. 91. Pilkington G, Boland A, Brown T, Oyee J, Bagust A, Dickson R. A systematic review of the clinical effectiveness of first-line chemotherapy for adult patients with locally advanced or metastatic non-small cell lung cancer. Thorax. 2015;70(4):359-67. 92. Beretov J, Wasinger VC, Graham PH, Millar EK, Kearsley JH, Li Y. Proteomics for breast cancer urine biomarkers. Advances in clinical chemistry. 2014;63:123-67. 93. Hochberg FH, Atai NA, Gonda D, Hughes MS, Mawejje B, Balaj L, et al. Glioma diagnostics and biomarkers: an ongoing challenge in the field of medicine and science. Expert review of molecular diagnostics. 2014;14(4):439-52. 94. Patel S, Ngounou Wetie AG, Darie CC, Clarkson BD. Cancer secretomes and their place in supplementing other hallmarks of cancer. Advances in experimental medicine and biology. 2014;806:409-42. 95. Inal JM, Kosgodage U, Azam S, Stratton D, Antwi-Baffour S, Lange S. Blood/plasma secretome and microvesicles. Biochimica et biophysica acta. 2013;1834(11):2317-25. 96. Paltridge JL, Belle L, Khew-Goodall Y. The secretome in cancer progression. Biochimica et biophysica acta. 2013;1834(11):2233-41. 97. Costa PM, Pedroso de Lima MC. MicroRNAs as Molecular Targets for Cancer Therapy: On the Modulation of MicroRNA Expression. Pharmaceuticals. 2013;6(10):1195-220.

102 98. Zarate R, Boni V, Bandres E, Garcia-Foncillas J. MiRNAs and LincRNAs: Could they be considered as biomarkers in colorectal cancer? International journal of molecular sciences. 2012;13(1):840-65. 99. Ferdin J, Kunej T, Calin GA. Non-coding RNAs: identification of cancer- associated microRNAs by gene profiling. Technology in cancer research & treatment. 2010;9(2):123-38. 100. Becker KG, Barnes KC, Bright TJ, Wang SA. The genetic association database. Nature genetics. 2004;36(5):431-2. 101. Zhang Y, De S, Garner JR, Smith K, Wang SA, Becker KG. Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information. BMC medical genomics. 2010;3:1. 102. Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic acids research. 2014;42(Database issue):D749-55. 103. Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA- supported gene and transcripts annotation. Genome biology. 2006;7 Suppl 1:S12 1-4. 104. International Cancer Genome C, Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, et al. International network of cancer genome projects. Nature. 2010;464(7291):993-8. 105. Kuntzer J, Maisel D, Lenhof HP, Klostermann S, Burtscher H. The Roche Cancer Genome Database 2.0. BMC medical genomics. 2011;4:43. 106. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB, et al. Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia. 2007;9(2):166-80. 107. Liu F, White JA, Antonescu C, Gusenleitner D, Quackenbush J. GCOD - GeneChip Oncology Database. BMC bioinformatics. 2011;12:46. 108. UniProt C. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic acids research. 2013;41(Database issue):D43-7. 109. Nair R, Rost B. LOCnet and LOCtarget: sub-cellular localization for targets. Nucleic acids research. 2004;32(Web Server issue):W517-21. 110. Cong Q, Grishin NV. MESSA: MEta-Server for protein Sequence Analysis. BMC biology. 2012;10:82. 111. Zhang Y. I-TASSER server for protein 3D structure prediction. BMC bioinformatics. 2008;9:40. 112. Marchler-Bauer A, Zheng C, Chitsaz F, Derbyshire MK, Geer LY, Geer RC, et al. CDD: conserved domains and protein three-dimensional structure. Nucleic acids research. 2013;41(Database issue):D348-52. 113. Servant F, Bru C, Carrere S, Courcelle E, Gouzy J, Peyruc D, et al. ProDom: automated clustering of homologous domains. Briefings in bioinformatics. 2002;3(3):246-51. 114. Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nature methods. 2011;8(10):785-6.

103 115. Dinkel H, Van Roey K, Michael S, Davey NE, Weatheritt RJ, Born D, et al. The eukaryotic linear motif resource ELM: 10 years and counting. Nucleic acids research. 2014;42(Database issue):D259-66. 116. Kolker E, Higdon R, Haynes W, Welch D, Broomall W, Lancet D, et al. MOPED: Model Organism Protein Expression Database. Nucleic acids research. 2012;40(Database issue):D1093-9. 117. Mathivanan S, Ahmed M, Ahn NG, Alexandre H, Amanchy R, Andrews PC, et al. Human Proteinpedia enables sharing of human protein data. Nature biotechnology. 2008;26(2):164-7. 118. Maruyama Y, Kawamura Y, Nishikawa T, Isogai T, Nomura N, Goshima N. HGPD: Human Gene and Protein Database, 2012 update. Nucleic acids research. 2012;40(Database issue):D924-9. 119. Frezal J. Genatlas database, genes and development defects. Comptes rendus de l'Academie des sciences Serie III, Sciences de la vie. 1998;321(10):805-17. 120. Kupershmidt I, Su QJ, Grewal A, Sundaresh S, Halperin I, Flynn J, et al. Ontology-based meta-analysis of global collections of high-throughput public data. PloS one. 2010;5(9). 121. Rappaport N, Nativ N, Stelzer G, Twik M, Guan-Golan Y, Stein TI, et al. MalaCards: an integrated compendium for diseases and their annotation. Database : the journal of biological databases and curation. 2013;2013:bat018. 122. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics. 2000;25(1):25-9. 123. Stark C, Breitkreutz BJ, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS, et al. The BioGRID Interaction Database: 2011 update. Nucleic acids research. 2011;39(Database issue):D698-704. 124. Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, et al. The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic acids research. 2013. 125. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protocols. 2008;4(1):44-57. 126. Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, et al. Database resources of the National Center for Biotechnology Information: update. Nucleic acids research. 2004;32(Database issue):D35-40. 127. Delgado A, Brandao P, Chapado M, Hamid S, Narayanan R. Open Reading Frames Associated with Cancer in the Dark Matter of the Human Genome. Cancer Genomics Proteomics. 2014 c;11(4):201-13. 128. Delgado AP, Hamid S, Brandao P, Narayanan R. A novel transmembrane glycoprotein cancer biomarker present in the . Cancer genomics & proteomics. 2014 a;11(2):81-92. 129. Delgado AP, Brandao P, Hamid S, Narayanan R. Mining the Dark Matter of the Cancer Proteome for novel biomarkers. Current Cancer Therapy Reviews. 2013;9(4):265-77.

104 130. Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to in KEGG. Nucleic acids research. 2014;42(Database issue):D199-205. 131. Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, et al. The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic acids research. 2014;42(Database issue):D358- 63. 132. Xu C, Chen H, Wang X, Gao J, Che Y, Li Y, et al. S100A14,a member of EF-hand Calcium-Binding Proteins, is overexpressed in breast cancer and acts as a modulator of HER2 signaling. The Journal of biological chemistry. 2013. 133. Wolf S, Haase-Kohn C, Pietzsch J. S100A2 in cancerogenesis: a friend or a foe? Amino acids. 2011;41(4):849-61. 134. Lewit-Bentley A, Rety S. EF-hand calcium-binding proteins. Cancer discovery. 2000;10(6):637-43. 135. Sherry ST, Ward M, Sirotkin K. dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research. 1999;9(8):677-9. 136. Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, et al. The consensus coding sequences of human breast and colorectal cancers. Science. 2006;314(5797):268-74. 137. Delgado AP, Brandao P, Hamid S, Van De Ven W, R N. Discovery of a novel carcinoma-associated EF hand containing protein by mining the dark matter of the human proteome. Cancer research. 2014(74):4189. 138. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the United States of America. 2001;98(24):13790-5. 139. Chizhikov V, Zborovskaya I, Laktionov K, Delektorskaya V, Polotskii B, Tatosyan A, et al. Two consistently deleted regions within chromosome 1p32- pter in human non-small cell lung cancer. Molecular carcinogenesis. 2001;30(3):151-8. 140. Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, et al. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. Nucleic acids research. 2015;43(Database issue):D914-20. 141. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, et al. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic acids research. 2009;37(Database issue):D98-104. 142. Cheng SL, Liu RH, Sheu JN, Chen ST, Sinchaikul S, Tsay GJ. Toxicogenomics of A375 human malignant melanoma cells treated with arbutin. Journal of biomedical science. 2007;14(1):87-105. 143. Sproul D, Nestor C, Culley J, Dickson JH, Dixon JM, Harrison DJ, et al. Transcriptionally repressed genes become aberrantly methylated and distinguish tumors of different lineages in breast cancer. Proceedings of the National Academy of Sciences of the United States of America. 2011;108(11):4364-9.

105 144. Bremmer F, Schweyer S, Martin-Ortega M, Hemmerlin B, Strauss A, Radzun Hj, et al. Switch of cadherin expression as a diagnostic tool for Leydig cell tumours. APMIS : acta pathologica, microbiologica, et immunologica Scandinavica. 2013;121(10):976-81. 145. Bhati R, Gokmen-Polar Y, Sledge GW, Jr., Fan C, Nakshatri H, Ketelsen D, et al. 2-methoxyestradiol inhibits the anaphase-promoting complex and protein translation in human breast cancer cells. Cancer research. 2007;67(2):702-8. 146. Zaehres H, Kogler G, Arauzo-Bravo MJ, Bleidissel M, Santourlidis S, Weinhold S, et al. Induction of pluripotency in human cord blood unrestricted somatic stem cells. Experimental hematology. 2010;38(9):809-18, 18 e1-2. 147. Johnson MM, Michelhaugh SK, Bouhamdan M, Schmidt CJ, Bannon MJ. The Transcription Factor NURR1 Exerts Concentration-Dependent Effects on Target Genes Mediating Distinct Biological Processes. Frontiers in neuroscience. 2011;5:135. 148. Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006;439(7074):353-7. 149. Klemm L, Duy C, Iacobucci I, Kuchen S, von Levetzow G, Feldhahn N, et al. The B cell mutator AID promotes B lymphoid blast crisis and drug resistance in chronic myeloid leukemia. Cancer cell. 2009;16(3):232-45. 150. Okayama H, Kohno T, Ishii Y, Shimada Y, Shiraishi K, Iwakawa R, et al. Identification of genes upregulated in ALK-positive and EGFR/KRAS/ALK- negative lung adenocarcinomas. Cancer research. 2012;72(1):100-11. 151. Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell. 2005;120(1):15-20. 152. Roy A, Yang J, Zhang Y. COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic acids research. 2012;40(Web Server issue):W471-7. 153. Chatr-Aryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A, Stark C, et al. The BioGRID interaction database: 2013 update. Nucleic acids research. 2013;41(Database issue):D816-23. 154. Razick S, Magklaras G, Donaldson IM. iRefIndex: a consolidated protein interaction database with provenance. BMC bioinformatics. 2008;9:405. 155. Takagi K, Ito S, Miyazaki T, Miki Y, Shibahara Y, Ishida T, et al. Amyloid precursor protein in human breast cancer: An androgen-induced gene associated with cell proliferation. Cancer science. 2013. 156. Braunewell KH. The darker side of Ca2+ signaling by neuronal Ca2+- sensor proteins: from Alzheimer's disease to cancer. Trends in pharmacological sciences. 2005;26(7):345-51. 157. Delgado AP, Brandao P, Chapado MJ, Hamid S, Narayanan R. Open reading frames associated with cancer in the dark matter of the human genome. Cancer genomics & proteomics. 2014;11(4):201-13. 158. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic acids research. 2012;40(Database issue):D306-12.

106 159. Brunschweiger A, Hall J. A decade of the human genome sequence--how does the medicinal chemist benefit? ChemMedChem. 2012;7(2):194-203. 160. Desany B, Zhang Z. Bioinformatics and cancer target discovery. Drug discovery today. 2004;9(18):795-802. 161. Narayanan R, Van De Ven W. Transcriptome and Proteome Analysis: A Perspective on Correlation. MOJ Proteomics Bioinform. 2014;1(5):00027. 162. Zhang B, Wang J, Wang X, Zhu J, Liu Q, Shi Z, et al. Proteogenomic characterization of human colon and rectal cancer. Nature. 2014. 163. Gry M, Rimini R, Stromberg S, Asplund A, Ponten F, Uhlen M, et al. Correlations between RNA and protein expression profiles in 23 human cell lines. BMC genomics. 2009;10:365. 164. Vogel C, Marcotte EM. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nature reviews Genetics. 2012;13(4):227-32. 165. Salama I, Malone PS, Mihaimeed F, Jones JL. A review of the S100 proteins in cancer. European journal of surgical oncology : the journal of the European Society of Surgical Oncology and the British Association of Surgical Oncology. 2008;34(4):357-64. 166. Heizmann CW, Ackermann GE, Galichet A. Pathologies involving the S100 proteins and RAGE. Sub-cellular biochemistry. 2007;45:93-138. 167. Gibadulinova A, Tothova V, Pastorek J, Pastorekova S. Transcriptional regulation and functional implication of S100P in cancer. Amino acids. 2011;41(4):885-92. 168. Zheng CJ, Han LY, Yap CW, Ji ZL, Cao ZW, Chen YZ. Therapeutic targets: progress of their exploration and investigation of their characteristics. Pharmacological reviews. 2006;58(2):259-79. 169. Tsai WC, Lin YC, Tsai ST, Shen WH, Chao TL, Lee SL, et al. Lack of modulatory function of coding nucleotide polymorphism S100A2_185G>A in oral squamous cell carcinoma. Oral diseases. 2011;17(3):283-90. 170. Tothova V, Gibadulinova A. S100P, a peculiar member of S100 family of calcium-binding proteins implicated in cancer. Acta virologica. 2013;57(2):238- 46. 171. Bresnick AR, Weber DJ, Zimmer DB. S100 proteins in cancer. Nature reviews Cancer. 2015;15(2):96-109. 172. Hwang SK, Piao L, Lim HT, Minai-Tehrani A, Yu KN, Ha YC, et al. Suppression of lung tumorigenesis by zipper/EF hand-containing transmembrane-1. PloS one. 2010;5(9). 173. Garnis C, Campbell J, Davies JJ, Macaulay C, Lam S, Lam WL. Involvement of multiple developmental genes on chromosome 1p in lung tumorigenesis. Human molecular genetics. 2005;14(4):475-82. 174. Opocher G, Schiavi F, Vettori A, Pampinella F, Vitiello L, Calderan A, et al. Fine analysis of the short arm of chromosome 1 in sporadic and familial pheochromocytoma. Clinical endocrinology. 2003;59(6):707-15. 175. Tolaney SM, Barry WT, Dang CT, Yardley DA, Moy B, Marcom PK, et al. Adjuvant paclitaxel and trastuzumab for node-negative, HER2-positive breast cancer. The New England journal of medicine. 2015;372(2):134-41.

107 176. Sunaga N, Shames DS, Girard L, Peyton M, Larsen JE, Imai H, et al. Knockdown of oncogenic KRAS in non-small cell lung cancers suppresses tumor growth and sensitizes tumor cells to targeted therapy. Molecular cancer therapeutics. 2011;10(2):336-46. 177. Reungwetwattana T, Dy GK. Targeted therapies in development for non- small cell lung cancer. Journal of carcinogenesis. 2013;12:22. 178. Haase-Kohn C, Wolf S, Herwig N, Mosch B, Pietzsch J. Metastatic potential of B16-F10 melanoma cells is enhanced by extracellular S100A4 derived from RAW264.7 macrophages. Biochemical and biophysical research communications. 2014;446(1):143-8. 179. Biswal BK, Beyrouthy MJ, Hever-Jardine MP, Armstrong D, Tomlinson CR, Christensen BC, et al. Acute hypersensitivity of pluripotent testicular cancer- derived embryonal carcinoma to low-dose 5-aza deoxycytidine is associated with global DNA Damage-associated p53 activation, anti-pluripotency and DNA demethylation. PloS one. 2012;7(12):e53003. 180. Wood CE, Kaplan JR, Fontenot MB, Williams JK, Cline JM. Endometrial profile of tamoxifen and low-dose estradiol combination therapy. Clinical cancer research : an official journal of the American Association for Cancer Research. 2010;16(3):946-56. 181. Wang Y, Bu F, Royer C, Serres S, Larkin JR, Soto MS, et al. ASPP2 controls epithelial plasticity and inhibits metastasis through beta-catenin- dependent regulation of ZEB1. Nature cell biology. 2014;16(11):1092-104. 182. Nurul-Syakima AM, Yoke-Kqueen C, Sabariah AR, Shiran MS, Singh A, Learn-Han L. Differential microRNA expression and identification of putative miRNA targets and pathways in head and neck cancers. International journal of molecular medicine. 2011;28(3):327-36. 183. Fassina A, Cappellesso R, Simonato F, Siri M, Ventura L, Tosato F, et al. A 4-MicroRNA signature can discriminate primary lymphomas from anaplastic carcinomas in thyroid cytology smears. Cancer cytopathology. 2014;122(4):274- 81. 184. Xu Z, Zeng X, Tian D, Xu H, Cai Q, Wang J, et al. MicroRNA-383 inhibits anchorage-independent growth and induces cell cycle arrest of glioma cells by targeting CCND1. Biochemical and biophysical research communications. 2014;453(4):833-8. 185. Wang W, Luo YP. MicroRNAs in breast cancer: oncogene and tumor suppressors with clinical potential. Journal of Zhejiang University Science B. 2015;16(1):18-31. 186. Chien HY, Lee TP, Chen CY, Chiu YH, Lin YC, Lee LS, et al. Circulating microRNA as a diagnostic marker in populations with type 2 diabetes mellitus and diabetic complications. Journal of the Chinese Medical Association : JCMA. 2014. 187. Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, et al. Database resources of the National Center for Biotechnology. Nucleic acids research. 2003;31(1):28-33.

108 188. Larsson M, Stahl S, Uhlen M, Wennborg A. Expression profile viewer (ExProView): a software tool for transcriptome analysis. Genomics. 2000;63(3):341-53. 189. Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, et al. ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic acids research. 2005;33(Database issue):D553-5. 190. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian by RNA-Seq. Nature methods. 2008;5(7):621-8. 191. Greenbaum D, Colangelo C, Williams K, Gerstein M. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome biology. 2003;4(9):117. 192. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al. Human Protein Reference Database--2009 update. Nucleic acids research. 2009;37(Database issue):D767-72. 193. Deribe YL, Pawson T, Dikic I. Post-translational modifications in signal integration. Nature structural & molecular biology. 2010;17(6):666-72. 194. Gedeon T, Bokes P. Delayed protein synthesis reduces the correlation between mRNA and protein fluctuations. Biophysical journal. 2012;103(3):377- 85. 195. Delgado AP, Chapado MJ, Brandao P, Hamid S, Narayanan R. Atlas of the Open reading Frames in human diseases: Dark matter of the human genome. MOJ Proteomics Bioinform. 2015;2(1):00036. 196. Miklos I, Novak A, Dombai B, Hein J. How reliably can we predict the reliability of protein structure predictions? BMC bioinformatics. 2008;9:137. 197. Carugo O. Detailed estimation of bioinformatics prediction reliability through the Fragmented Prediction Performance Plots. BMC bioinformatics. 2007;8:380. 198. Rajaram S. A novel meta-analysis method exploiting consistency of high- throughput experiments. Bioinformatics. 2009;25(5):636-42. 199. Wu K, Rao CV. Computational methods in synthetic biology: towards computer-aided part design. Current opinion in chemical biology. 2012;16(3- 4):318-22. 200. Subramanian L, Polans AS. Cancer-related diseases of the eye: the role of calcium and calcium-binding proteins. Biochemical and biophysical research communications. 2004;322(4):1153-65. 201. Ikura M, Osawa M, Ames JB. The role of calcium-binding proteins in the control of transcription: structure to function. BioEssays : news and reviews in molecular, cellular and developmental biology. 2002;24(7):625-36. 202. Yanez M, Gil-Longo J, Campos-Toimil M. Calcium binding proteins. Advances in experimental medicine and biology. 2012;740:461-82. 203. Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic acids research. 2014;42(Database issue):D1091-7.

109 204. Leclerc E, Heizmann CW. The importance of Ca2+/Zn2+ signaling S100 proteins and RAGE in translational medicine. Frontiers in bioscience. 2011;3:1232-62. 205. Narayanan R. Druggableness of the Ebola Associated Genes in the Human Genome: Chemoinformatics Approach. MOJ Proteomics Bioinform. 2015;2(2):00038-44. 206. Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, et al. The ChEMBL bioactivity database: an update. Nucleic acids research. 2014;42(Database issue):D1083-90. 207. Li L, Chen BF, Chan WY. An Epigenetic Regulator: Methyl-CpG-Binding Domain Protein 1 (MBD1). International journal of molecular sciences. 2015;16(3):5125-40. 208. Ng JM, Yu J. Promoter Hypermethylation of Tumour Suppressor Genes as Potential Biomarkers in Colorectal Cancer. International journal of molecular sciences. 2015;16(2):2472-96. 209. Eads CA, Danenberg KD, Kawakami K, Saltz LB, Blake C, Shibata D, et al. MethyLight: a high-throughput assay to measure DNA methylation. Nucleic acids research. 2000;28(8):E32. 210. van Eijk KR, de Jong S, Boks MP, Langeveld T, Colas F, Veldink JH, et al. Genetic analysis of DNA methylation and gene expression levels in whole blood of healthy human subjects. BMC genomics. 2012;13:636. 211. Peedicayil J. The role of DNA methylation in the pathogenesis and treatment of cancer. Current clinical . 2012;7(4):333-40. 212. Momparler RL. Epigenetic therapy of cancer with 5-aza-2'-deoxycytidine (decitabine). Seminars in oncology. 2005;32(5):443-51. 213. Lapunzina P, Monk D. The consequences of uniparental disomy and copy number neutral loss-of-heterozygosity during human development and cancer. Biology of the cell / under the auspices of the European Cell Biology Organization. 2011;103(7):303-17. 214. Paulsson K. Genomic heterogeneity in acute leukemia. Cytogenetic and genome research. 2013;139(3):174-80. 215. Ostrovnaya I. Testing clonality of three and more tumors using their loss of heterozygosity profiles. Statistical applications in genetics and molecular biology. 2012;11(4). 216. Imbalzano KM, Tatarkova I, Imbalzano AN, Nickerson JA. Increasingly transformed MCF-10A cells have a progressively tumor-like phenotype in three- dimensional basement membrane culture. Cancer cell international. 2009;9:7. 217. Narayanan R, Lawlor KG, Schaapveld RQ, Cho KR, Vogelstein B, Bui- Vinh Tran P, et al. Antisense RNA to the putative tumor-suppressor gene DCC transforms Rat-1 fibroblasts. Oncogene. 1992;7(3):553-61. 218. Kinnula VL, Yankaskas JR, Chang L, Virtanen I, Linnala A, Kang BH, et al. Primary and immortalized (BEAS 2B) human bronchial epithelial cells have significant antioxidative capacity in vitro. American journal of respiratory cell and molecular biology. 1994;11(5):568-76.

110