SYSTEMS BIOLOGY OF HUMAN COLORECTAL

by

ROD K. NIBBE

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation advisor: Dr. Mark R. Chance

Department of Pharmacology

CASE WESTERN RESERVE UNIVERSITY

May, 2010

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

______Rod K. Nibbe______

candidate for the _Ph.D.______degree *.

(signed)___Charles L. Hoppel, M.D. (chair of the committee)

____Noa Noy, Ph.D.______

_____John Letterio, M.D.______

____Mark R. Chance,Ph.D.______

______

______

(date) ___November 23rd, 2009______

*We also certify that written approval has been obtained for any proprietary material contained therein.

ii

To my wife, for her never-ending patience, support and love. And to my parents,

family, friends and colleagues – I told you it’s never too late!

iii

Table of contents

TITLE PAGE SIGNATURE PAGE ii DEDICATION iii TABLE OF CONTENTS 1 LIST OF TABLES 3 LIST OF FIGURES 4 ACKNOWLEDGEMENTS 6 ABSTRACT 8

CHAPTER 1 - Approaches to biomarkers in human colorectal cancer: looking back, to go forward. 1. Epidemiology and genetics of CRC 11 2. Biomarkers: old and new 12 3. Genomic markers in CRC 13 MUTATION 14 TRANSCRIPTOME CHANGES 15 SINGLE NUCLEOTIDE POLYMORPHISMS (SNPs) AND COPY NUMBER VARIATION (CNV) 16 EPIGENETIC STATUS 17 microRNA – THE NEW –OME 18 PREDICITIVE MARKERS IN CRC THERAPY – PHARMACOGENOMICS 18

LIMITATIONS AND CHALLENGES 20 4. TRADITIONAL PROTEOMIC APPROACHES TO MARKERS 21 MARKERS FROM BIOFLUIDS 22 PROTEIN MARKERS FROM TISSUE BIOPSY 24

1

PROTEIN MARKERS FROM CELL CULTURE 25

5. INTERACTOME 26 6. NETWORK_BASED MARKERS IN CRC 28 7. FIGURES 30

CHAPTER II - Discovery and scoring of protein interaction subnetworks discriminative of late stage human colon cancer. ABSTRACT 38 INTRODUCTION 40 EXPERIMENTAL PROCEDURES 44 RESULTS & DISCUSSION 60 FIGURES 76 TABLES 92 SUPPLEMENTAL DATA 100

CHAPTER III - An Integrative -omics Approach to Identify Functional Sub-networks in Human Colorectal Cancer.

ABSTRACT 108 INTRODUCTION 110 MATERIALS AND METHODS 117 RESULTS 131 DISCUSSION 139 FIGURES 143 TABLES 160

2

CHAPTER IV – Summary and future directions Overview 162 Future biological direction: in vivo subnetwork validation 165 Future computational direction: classification 167 Figures 169

APPENDIX 171

3

LIST OF TABLES

Table. 2.1 – List of unique (18) from a total of 20 identified by experiment one, pI 3-10 92 Table 2.2 – List of unique proteins from a total of 67 identified by experiment two, pI = 4-7 94 Table 2.3 – The mutual information (MI) scores for each signature 98 Table 3.1 – A glossary of terms frequently used in this paper. 160

4

LIST OF FIGURES

Fig. 1.1 – Stage-wise progression of colorectal cancer 30 Fig. 1.2 – Proteomic profiling: 2D-differential gel electrophoresis 32 Fig. 1.3 – Relevant colorectal cancer subnetworks and pathways 34 Fig. 1.4 – Proposed bioinformatic flow 36 Fig. 2.1 – Experimental design 76 Fig. 2.2 – Representative gel from experiment 1 (5144) 78 Fig. 2.3 – Representative gel from experiment 2 80 Fig. 2.4 – MetaCore subnetwork 82 Fig. 2.5 – Flow chart showing steps required to compute MI 84 Fig. 2.6 – Estimated null distributions (probability density function) for hypotheses H1 and H2 86 Fig. 2.7 – Expanded subnetworks from corresponding signatures 88 Fig. 2.8 – Relative expression change of signature proteins and mRNA in subnetwork 1, tumor versus normal, for three patients (507, 534, and 540) 90 Fig. 2.S1 – Clinical cohort 100 Fig. 2.S2 – Distribution of decoy over the 102 Fig. 2.S3 – Additional three MetaCore protein interaction sub-networks returned by a search seeded by significant proteomic targets 104 Fig. 3.1 – Schematic of an integrated, proteomics-first approach for the discovery of functional, candidate sub-networks in a disease phenotype 143 Fig. 3.2 – Crosstalkers are not significant at level of individual mRNA expression 145 Fig. 3.3 – Synergistic dysregulation versus network size for candidate sub-networks associated with proteomic seeds obtained from Nibbe et al. 147

5

Fig 3.4 – Synergistic dysregulation versus network size for candidate sub-networks associated with proteomic seeds obtained from Friedman et al. 149

Fig. 3.5 – Significant sub-networks induced by proteomic seeds. 151

Fig 3.6 – Validation of select targets predicted to be dysregulated in TCP1 sub-network. 154 Fig 3.7 – Synergistic dysregulation versus network size for candidate sub-networks associated with the CRC driver gene seeds obtained from Sjöblom et al. 156 Fig 3.8 – Cross-validation performance comparison of sub-network based classifiers. 158 Fig 4.1 – Network classification performance. 169

6

Acknowledgements

I want to thank the Biomedical Sciences Training Program for the

invitation to come to Case Western, and providing me the academic basis which

proved to be crucial to my ability to conduct this research. I also want to thank the department of pharmacology for all their support, financial and academic,

both of which helped me to mature as a scientist. None of this research would

have been possible without the patient support of many people in the Case

Center for Proteomics and Bioinformatics. In particular, Elizabeth Yohannes for

teaching me all aspects of the 2D-gel method; Giri Gokulrangan, Jana Kiselar,

Dani Schlatzer and Dasa Leary for guidance with all aspects of mass

spectrometry; Katy Lundberg and Jennifer Burgoyne for help with the MALDI

mass spectrometer and protein digestion; Jim Crish and Chao Yuan for help with

Western blots; Shannon Shiatowski, Maita Diaz and especially Joan Schenkel for

all things administrative; Rob Ewing, Vishal Patel and Gurkan Bebek for

discussions and feedback on bioinformatics; and finally my mentor Mark Chance,

for his generous support, critical guidance, and confidence in me, especially when I needed them the most. I’d like to thank Dr. Mehmet Koyutϋrk from the

Case Electrical Engineering and Computer Science department for his guidance and collaborative efforts toward our joint publications. I’d also like to thank Dr.

Sanford Markowitz and members of his lab for providing us clinical tissue samples, colon cancer microarray data, relevant cell lines, technical assistance, and manuscript reviews. I am indebted to my committee members for their thoughtful and critical contributions at my committee meetings, and for their help 7

in keeping me on track to complete this work. Most of all I want to thank my wife,

Nancy Nibbe, who did all the heavy lifting at home, as well as at work, so I could focus on my education and research. Without her in my life this would not have been possible.

8

Systems Biology of Human Colorectal Cancer

Abstract

by

ROD K. NIBBE

Like all human colorectal cancer (CRC) is a complicated disease. While a mature body of research involving CRC has implicated the putative sequence of genetic alterations that trigger the disease and sustain its progression, there is a surprising paucity of well-validated, clinically useful diagnostic markers of this disease. For prognosis or guiding therapy, single gene-based markers of CRC often have limited specificity and sensitivity. Genome-wide analyses (microarray) have been used to propose candidate patterns of that are prognostic of outcome or predict the tumor’s response to a therapy regimen, however these patterns frequently do not overlap, and this has raised questions concerning their power as biomarkers. The limitation of gene expression approaches to marker discovery occurs because the change in mRNA expression across tumors is highly variable and alone accounts for a limited variability of the phenotype, e.g. cancer. It is largely unknown how the integration of proteomic data and genomic data, along with protein-protein interaction data may enhance the discovery of more quantitatively powerful biomarkers. In this work we show that a proteomics-first approach can discover significantly,

9

differentially expressed proteins between cancer and control tissues. In turn,

these targets may be integrated with mRNA and protein-protein interaction data

to discover networks of proteins that are quantitatively significant discriminators of cancer versus control. Further, we show that our bioinformatic methods are extensible and robust with respect to publicly available proteomic data and public

PPI datasets. Further, a proteomics-first approach for finding significant sub- networks in CRC is comparable to the same approach seeded instead with a set genes implicated as “drivers” of CRC. Finally, because these network discriminators exist at the level of the proteome, they provide an optimal basis for mechanistic validation in in vitro disease models, such as cell culture. It is thought that network-based approaches may provide improved diagnostic, prognostic, or predictive markers in CRC, and lead to improvements in molecularly targeted therapies.

10

Chapter I – Approaches to biomarkers in human colorectal cancer: looking

back, to go forward.

Much of this chapter was published in Biomarkers In Medicine (2009)

Nibbe RK, Chance MR

1. Epidemiology and genetics of CRC

Colorectal cancer (CRC) is the second leading cause of cancer death in the

United States and the United Kingdom [1]. The disease can be classified into two groups, inherited and sporadic. The former is an inherited predisposition to CRC that may be broadly classified into two forms, familial adenomatous polyposis coli

(FAP) and hereditary nonpolyposis colon cancer (HNPCC), although other syndromes are known. FAP is an autosomal dominant disease that results from a germline mutation in the APC gene, often an N-terminal truncation of the APC protein [2], which inevitably results, before age 50, in the development of hundreds or more polyps on the colonic wall, one or more of which inevitably will

progress to an established cancer. Often, however, a somatic mutation of the

other APC allele is associated with adenoma formation [3] and considered to be

the determining event initiating CRC [4]. HNPCC arises from a germline mutation in one or more of the DNA mismatch repair (MMR) genes, commonly hMLH1 or hMSH2, and microsatellite instability [5]. Approximately 80% of people with one

or more of these mutations will develop CRC, usually by age 45 [101]. By

contrast, sporadic CRC, which affects approximately 5-6% of the American 11

population [102], is a progressive disease arising from the accumulation of somatic mutations in colonic epithelial cells. The mutations and epigenetic alterations commonly implicated in the progression are shown in Figure 1, overlaid on the stage (0-IV) of the cancer, during formation of an early adenoma

(0), to invasion of the mucosa (I), followed by increasing angiogenesis and lymph node involvement (II-III), and finally a breach of the colonic wall and metastasis

(IV).

2. Biomarkers: old and new

In a clinical sense, and depending on the context of its application, the optimum biomarker of human cancer may have one or more properties. It would be a single molecule whose level of expression or activity would mark the onset of cancer (diagnostic), or indicate the specific treatment regimen for a patient with established cancer (predictive), or indicate the fate of the cancer, e.g., good versus poor outcome (prognostic) [6]. The biomarker would be easily assayable by a single test, present in both the disease state and the normal state, and capable of both high specificity and sensitivity. Ideally, it would be readily detectable in body fluid (e.g. serum or urine). Unfortunately, human cancer is a complicated disease [7]. At the molecular level it is quite unlikely we will find a single optimum marker of CRC, or any cancer, in the traditional sense.

Unquestionably, researchers need to take advantage of the tremendous insights resulting from decades of research as well as clinical and pre-clinical trials in

12

human cancer (legacy data), but at the same time we need to understand the limitations of any one approach, and guard against bias in favoring the results of one approach over others. The unprecedented volume of data coming from high- throughput experiments in genomics and proteomics is rapidly advancing our understanding of cancer. However, these results need to be placed in the context of specific molecular functions. A more complete understanding of the functional implications of differentially expressed genes or proteins in cancer promises to deliver improved biological markers of the disease [8], markers that are more

sensitive to the known heterogeneity of patient tumors, and offer improved

specificity and sensitivity in classification compared to existing approaches that

do not account for function. In turn, these markers will provide an improved focus

for follow-on experiments to verify mechanism, which in turn will help inform the

development of novel drug targets.

3. Genomic markers in CRC

There are a variety of indicators evident in nuclear DNA or its transcribed

product, mRNA, which may be said to provide biomarkers of cancer. These

markers may have a physiologic role in initiating the cancer, or regulating its

progression or in mediating its response to drug treatment. CRC patients may

have their tumors genotyped for one or more of these mutations, or epigenetic

modifications, either or both of which may be useful as prognostic or predictive

markers. As indicated above, the clinical utility of these markers is presently

13

limited, nevertheless many alterations at the genomic level do have an important

role in the disease, and for the purpose of review these merit discussion.

Gene mutation

It is widely agreed that sporadic CRC is caused by the accumulation of somatic gene mutations evident in colonic epithelial cells [9]. A landmark study

recently published in the journal Science revealed the protein coding genes most

frequently mutated in breast and colorectal cancer (candidate driver genes),

obtained by a genome-wide screen on a cohort of human tumor biopsies [10]. Not surprisingly APC (adenomatosis polyposis coli), TP53 (tumor protein 53), SMAD4

(mothers against decapentaplegic homolog 4) and KRAS (kirsten rat sarcoma viral ) were among the sixty-nine driver genes identified in CRC, which mapped to no less than seven distinct gene ontological (GO) processes. APC is the “gatekeeper” gene in CRC and was found mutated in greater than 90% of the thirty-five tumors used in the discovery and validation screens. Likewise, TP53 and KRAS were found mutated in 51% and 44% of tumors, respectively. Three isoforms (2, 3, & 4) of the SMAD tumor suppressor gene were mutated in >5% of the tumors. While the results with respect to these four genes confirmed their known role in CRC, except for APC up to half of the tumors did not contain mutations in one or more of these genes. Further, the authors noted that in no case did any single cancer specimen have more than six candidate driver genes in common with another sample; overall each specimen had its own “signature”

14

pattern of somatic mutation. This observation underscores the wide variability of expression patterns in individual tumors, and the limitation of markers in CRC which are based exclusively on changes in the transcriptome. Indeed, to date it has been suggested [11] that only certain mutations in KRAS are clinically useful as predictive markers for estimating the success of certain chemotherapy treatments in CRC, which we discuss later in this review.

Transcriptome changes

Evidence of the ability to quantify genome-wide expression of mRNA by microarray in cancer was reported over a dozen years ago [12]. Since then thousands of microarray experiments have been conducted with the goal of discovering gene patterns or “signatures” that change significantly between treated or diseased samples and control. Encouraged by a call for standards in reporting the results of microarray experiments, due to their inherent technical variability, a number of public databases were established where the raw data could be deposited along with the relevant annotations and details of sample preparation. Indeed, many journals now require authors who report the results of a microarray experiment to deposit these data in a public database as a condition of publication of their manuscript. One such database is the Gene Expression

Omnibus (GEO) hosted at the NCBI web site [13]. A recent search of this database with the keyword “cancer” returned over 2600 experiments. Refining the search with respect to CRC, 246 experiments were returned, 203 of which

15

had been conducted on human tissue or derived cell lines. Many of these gene expression profiles have been mined to find signatures that characterize the early stages of CRC tumorigenesis [14], regulate its progression [15], or predict the tumor’s response to a particular therapy [16]. Additionally, the high dimensional nature of these data has proved to be a rich substrate for increasingly sophisticated bioinformatic methods that attempt to overcome the problem that obtains when the number of predictor variables (genes) greatly exceeds the number of samples [17]. Despite these advances, however, evidence from studies in other human cancers counsel caution with respect to gene expression signatures of CRC. For instance, the evaluation of candidate signatures from two landmark studies of breast cancer metastasis [18,19] revealed strikingly little overlap although a number of canonical pathways the genes resolved to were in common. While technical variation may explain some of the variability, these observations otherwise suggest the way forward in marker discovery is an integrative –omics approach, one that leverages all the relevant information we have regarding the disease, not merely changes in the transcriptome.

Single nucleotide polymorphisms (SNPs) and copy number variation (CNV)

SNPs are alterations of one or more nucleotides that occur with an allelic frequency of greater than one percent in members of a species. They are frequently the basis for genome-wide association studies (GWAS) that target susceptibility markers for cancer. These alterations may occur within or outside

16

the protein coding region of the gene. CNVs are chromosomal aberrations that result from the loss of a gene, its duplication or translocation causing an aberrant

number of transcripts to be produced in the cell. As with mutations, genome-wide

sequencing may target certain SNPs and CNVs known to predispose an

individual to certain cancers, including CRC. For a recent review of the relevance of SNPs and CNVs in CRC, and their relationship to gene expression and chromosomal aberrations, see Tsafrir et al. [20].

Epigenetic status

Gene expression in eukaryotes may be regulated by chromatin

remodeling, through post-translational modification of histone proteins, or by

direct modification of a nucleotide, e.g. methylation of cytosine which acts to

silence gene transcription. The gene (VIM) codes for a type III filament

protein highly abundant in many tissues, and is transcriptionally silent in both

normal and tumor colon epithelial tissue. However, the methylation status of this

gene detected in colonic cells isolated from the stool of 94 CRC patients

predicted the presence of a tumor with 46% sensitivity. In 198 cancer-free

controls it achieved a specificity of 90%. Notably, the same test was 43%

sensitive at detecting early cancer, Duke’s stage I and II [21]. Screening for this

marker is often recommended in addition to the fecal occult blood (FOB) test.

Recent improvements in the test have raised its prediction rate to 77%, and it has

17

gained the recommendation of the American Cancer Society as a screening tool

for CRC.

microRNA – the new –ome

Not all RNAs code for protein. microRNAs (miRNA) are small (<22

nucleotides) RNAs known to regulate a variety of cellular processes [22],

especially translation and mRNA stability. This has launched a bioinformatic hunt

for the precursors in the , and informed a variety of experiments

evaluating their role in disease, including CRC [23]. It’s too early to say whether

the expression of these molecules alone will become useful markers of disease.

More likely, as with protein coding genes, their differential expression in cancer

will most likely be one contributor to an integrated molecular phenotype, rather than by itself diagnostic.

Predictive markers in CRC therapy – pharmacogenomics

CRC is commonly treated with fluoropyrimidines such as 5-Flourouracil (5-

FU) or Capecitabine, platinum based drugs such as Oxaliplatin, Topisomerase I

(TOP-1) inhibitors, e.g. Irinotecan, or, particularly in late stage CRC, drugs that inhibit the epidermal growth factor receptor (EGFR), such as Eribtux. Recently, inhibitors of two isoforms of the vascular endothelial growth factor (VEGF) receptor have shown efficacy in CRC as well. Certain mutations or 18

polymorphisms of the enzymes involved in the metabolism of these drugs have been investigated as biomarkers for predicting the response to the drug, or the disease prognosis. For instance, Thymidylate synthase (TS) is the target enzyme of the active metabolite of 5-FU. Several studies have evaluated the expression level of this enzyme in patients administered 5-FU, but reached conflicting conclusions as to whether a high or low level confers a favorable response to 5-

FU, or improves prognosis. Genetic alterations in certain DNA excision repair genes may confer differential efficacy in patients treated with Oxaliplatin. A particular polymorphism in the protein involved in the glucuronidation

(inactivation) of Irinotecan (UGT1A1) confers reduced metabolism of the drug, and increases the patient’s chance of mylosupression and diarrhea. Screening for this polymorphism in CRC patients has been approved by the U.S. food and drug administration (FDA). KRAS is an important protein involved in the

epidermal growth factor receptor (EGFR) pathway, often amplified in CRC, as

well other cancers. Inhibitors of EGFR, such as Erbitux have differential efficacy

depending on the mutation status of KRAS, or certain polymorphisms upstream

of the coding region of the gene. Screening for KRAS mutations is now common

in the clinic to better inform the oncologist’s decision to treat with these drugs.

BRAF is also involved in the EGFR pathway, and several studies have indicated

its mutation status is important in predicting the patient’s response to the drug,

but unlike KRAS it has not been widely used in the clinic as a predictive marker

for informing therapy. It is beyond the scope of this review to cover in detail all

the genetic mutations and polymorphisms that have been shown in many studies

19

to confer a differential response to a variety of the CRC drugs mentioned. For a

thorough review of this topic see Strimpakos et. al. [24]. Suffice it to say that only

KRAS has gained wide acceptance at the clinical level as genetic biomarker in

CRC relevant for predicting the response to EGFR inhibitors.

Limitations and challenges

Setting aside the concern of technical variance not unique to microarray

experiments, gene expression signatures have limitations as direct markers of

biological significance. For instance, many of the so-called driver genes of cancer are not differentially expressed at the level of mRNA, or the cancer progression

may not be regulated at the level of expression [25]. Additionally, the significant,

differentially expressed genes in a signature may not resolve to only one or two

distinct gene ontological (GO) process, or the pathways they map to are unknown altogether, and this limits their usefulness as guides for mechanistic experiments. Further, the expression of the mRNA does not always correlate with the expression level of the protein [26], which is the immediate effecter of

cellular phenotype. In these cases the level of gene transcription does not

necessarily have an important role in the disease. These limitations should not

be misunderstood to mean that genome-wide measures of protein-coding

mRNAs are no longer useful as indicators of dysregulation possibly important in

disease. Rather, it bears repeating that these data are likely to be most useful

when integrated into a comprehensive analysis that factors in all the relevant

information we have about the cell. 20

4. Traditional proteomic approaches to markers

Unlike the human genome, estimates of the size of the human proteome are widely variable. The Human Protein Initiative (HPI) estimates that 20,500 genes could code for over one million proteins [103]. Compounded with the myriad post-

translational modifications, e.g. phosphorylation, ubiquination, and glycosylation,

to name only a few, and recognizing that a modification may occur on one or

more protein residues, the full annotation of the human proteome presents an

extraordinary challenge. A significant change in the expression or activity (e.g. a

kinase) of one or more of these proteomic species between cancer and control

indicates dysregulation in the cell, and may be a candidate biomarker. However,

at present there is no high-dimensional equivalent of the microarray in

proteomics. Even for that portion of the proteome which is well annotated, it

cannot be comprehensively surveyed for expression changes between tumor and

control, be the sample tissue, cells, or biofluids. However, significant

technological improvements have been made. Cox and Mann [27] recently

showed that high-resolution mass spectrometry, the workhorse of many

proteomic approaches, paired with statistical rigor can quantify the differential

expression change of over 4000 proteins between control and treated

mammalian cells in a high-throughput manner. Much progress has been made

and many technical hurdles have been cleared, so that now perhaps only money

and cooperation stand in the way of high-throughput, proteome-wide profiling

21

[28]. Also encouraging for biomarker discovery is the empirical evidence that the

expression of most protein species does not significantly change between cancer

and control. It is therefore not necessary that entire human proteome be

annotated or completely assayable to identify candidate markers of cancer.

Indeed, as we discuss in a subsequent section, significant targets found by proteomic profiling are useful inputs to bioinformatic approaches that implicate other proteins with a role in cancer, proteins that lack direct experimental

evidence and are unlikely to be found significant by proteomic or genomic

profiling alone.

Protein markers from biofluids

Presently, the only potentially useful clinical biomarker of CRC is the

serum protein CEA, and its value as a predictive marker of disease recurrence has been questioned [11]. As with various genomic markers, certain circulating

proteins have shown high sensitivity and specificity as diagnostic, prognostic, or

predictive markers in cohorts of limited size. Kaaks et al. [29] proposed a model

of how chronically high circulating levels of insulin growth factors (IGFs) in serum

associated with a higher risk of CRC in women leading a Western lifestyle.

SELDI-TOF mass spectrometry has been used to find distinct protein species in

biofluids that were able to discriminate the sera of CRC patients from controls

[30,31]. When paired with a novel bioinformatic method, a similar approach was

apparently able to distinguish adenoma from carcinoma using sera obtained from

22

a large cohort of patients with mixed stage (Duke’s A-D) CRC [32]. Various quantitative proteomic methods exist [33] that involve covalent modification of the proteins in sample. The proteins in each sample are differentially labeled with

moieties of distinct mass, then mixed and digested and the peptides analyzed by

LC-MS/MS to determine the relative abundance of parent proteins present in

each sample. The relative abundance of a protein between samples may also be

measured by label-free free strategies using mass spectrometers capable of high

mass accuracy [34]. Mass spectrometers capable of high sensitivity and mass

accuracy, paired with nano-chromatography are also capable of detecting and quantifying post-translationally modified proteins [35], which increasingly are

recognized as having an important role in cancer [36]. There are two challenges

particular to marker discovery in biofluids. One involves the fact that biofluids are

frequently highly concentrated in high abundant proteins. The digestion products

of these proteins can overwhelm the mass spectrometer’s detector and mask the

detection of less abundant proteins. One strategy to overcome this problem

involves depleting the sample of these proteins on columns designed specifically

for this purpose. The elute is then digested in the usual manner and the peptides submitted to LC-MS/MS for sequencing. Perhaps the biggest challenge to marker discovery in fluids is the fact that a heterogeneous mix of proteins is secreted

from a variety of tissues in the body, making it difficult to attribute a candidate

marker to a tissue-specific disease.

23

Protein markers from tissue biopsy

Profiling for changes in oncogenic proteins between tumor and control in

tissue has three distinct advantages over biofluids: 1) a putative marker protein

may not be secreted to biofluids, 2) the ambiguity of source tissue is eliminated,

and 3) the sample is enriched in changes for tumor versus control. A common

method for separating proteins collected from tissue is 2D-DIGE, a variant of the

2D-PAGE method that allows the multiplexing of up to three samples in a single

gel, typically normal, tumor and an internal standard. Each sample is labeled by a

distinct flurophore and then the samples are mixed, loaded onto a polyacrylimide

gel and separated by isoelectric value and molecular weight. Follow-on image

analysis allows for the identification of spots significant for the tumor phenotype.

The spots are excised from the gel, digested by trypsin, and the peptides

submitted for sequencing by LC-MS/MS. The identification of proteins in the

samples is performed subsequently by database search (Figure 2). A number of

studies have used this approach to identify significantly changing proteins

between matched normal and tumor tissues obtained from CRC patients

[37,38,39]. Verification of select findings is commonly carried out by Western blot or appropriate mass spectrometry based methods, optimally with samples not used in the discovery phase. The method has an ascertainment bias for detecting highly expressed proteins, but is considered suitable for identifying post-translationally modified protein, or isoforms. The bias may be overcome by

pre-fractioning the samples to focus on proteins differentially expressed in a

particular sub-cellular compartment (e.g. mitochondria). An alternative approach 24

involves mapping the proteins to protein interaction networks, and subsequently

analyzing the activity of a suite of protein interactions (a network) between tumor

and control. The end point of this analysis is the inference of a functional role of a

well-connected set of proteins in disease, one that is readily testable in an in vivo

model. An example of this approach to functional marker discovery is discussed

in the final section.

Protein markers from cell culture

Cell culture continues to be one of the most expedient experimental

models for testing biological hypotheses. At the level of the genome, modern

tools of cell and molecular biology allow for genes to be knocked in, knocked out,

systematically mutated or in myriad ways modified, and the phenotype analyzed

by a wide variety of methods capable of impressive precision and specificity.

Similarly, at the level of the proteome, many perturbation experiments can be

conducted (e.g., interference of protein expression, ectopic over-expression,

pharmacological inhibition or constitutive activation, etc.) and the ensuing

phenotype analyzed. As differential expression of protein continues to be viewed as a quality indicator of cellular dysregulation in cancer, methods have been developed [40] that have enabled quantitative protein analysis between treated

and control cells by sensitive mass spectrometry. Additionally, proteins

differentially modified in disease may also be found with cell models of CRC. Kim

et al. [41] surveyed the phosphoproteome of HT-29 cells and found 238 unique

phosphorylation sites that the authors suggested may be used as surrogate 25

markers implicating the differential activity of a suite of kinases. Cell culture is an equally useful model for verification of mechanistic hypothesis involving markers found in tissue or biofluids. If one has a cell line with a similar genetic background and pathologic stage to a cohort of clinical samples used to screen for candidate markers, mechanistic hypotheses involving that marker may be tested in a cell model. In addition, the conditioned media in which the cells grow can be quantitatively assayed for proteins in the secretome, and in this way used to verify the candidate markers found in a screen of biofluids [42]. The limitation of cell culture as a model of human disease lies in the fact that cells grown in

(2D) culture are known to have altered metabolism, and the microenvironment is often strikingly different, e.g. lacking features of angiogenesis. Xenografts or other animal models of disease may then be more appropriate.

5. Interactatome

Proteins in the cell do not function independently. Cellular phenotype is the result of proteins interacting with each other and with other molecules in the cell (e.g. lipids, RNA, DNA, hormones, drugs, etc.). How these interactions are coordinated and regulated is to a large extent unclear. There is, however, wide agreement that cancer is caused and sustained by dysregulated pathways

(networks) driven by mutant proteins. Many efforts are underway to annotate and catalog in computer databases individual interactions between proteins (and other molecules), and these databases may be used to build large network

26

graphs of interactions. These graphs reveal the daunting complexity we must

deal with if we are to understand the functional implications of differentially

expressed genes or proteins in disease. For example, Figure 3a is an interaction

graph that depicts the proteins (70) known to interact with APC, the so-called

“gatekeeper” gene found mutated in over 90% of CRC tumors [10]. By contrast, figure 3b shows the relatively few interactions involving APC in the well-studied

WNT-signaling pathway, known to be dysregulated in CRC. Clearly, Figure 3a indicates there are many more interactions on the APC axis which may have an important role in causing, sustaining, or suppressing the cancer phenotype.

Further, it is certainly conceivable that the differences in the activity of these various interactions may account for the heterogeneity of tumors in patients, the difference in the aggressiveness of their tumors, and may ultimately be the source of the differential response to treatment frequently observed in the clinic

[43]. The implication is clear, improved markers of CRC will need to account for

the coordinated, differential expression of many genes or proteins that

synergistically accelerate or retard the activity of networks responsible for

disease [44].

It is beyond the scope of this review to delve into the details of all the

interactomic databases presently available, both public and commercial. It is

sufficient to note that many are species specific; the first attempt to build a

comprehensive protein-protein interaction (PPI) database was done in yeast [45].

Human PPIs also have been constructed based on a variety of evidence: from

pull-down experiments (e.g. Y2H or Co-IP), inference from homology, 27

computational predictions from binding motifs, evidence found in the literature, or

a combination of these [46]. For a review of publicly available human PPIs, see

Mathivanan et al. [47]. Human PPIs mark a milestone for biomarker discovery in

cancer, because they provide a functional context in which to analyze the

mechanistic role of genes or proteins found significantly differentially expressed

by traditional screens.

6. Network-based markers in CRC

Traditional genomic and proteomic approaches to biomarker discovery in cancer,

by themselves, have limitations. Integrating all the information about the cell into

a functional model may improve the robustness and reliability of markers in

cancer. Indeed, Chuang et. al. [48] recently showed a network modeling

approach which integrated gene expression profiles with interaction networks to

more reliably and robustly predict breast cancer metastasis. Their approach

mapped microarray data to a human PPI and then searched for small

subnetworks within the PPI that could distinguish metastatic and non-metastatic

patients. A classifier constructed from these subnetworks was more accurate at

predicting metastasis when compared to single gene markers. In a similar

approach, Jonsson et al. [49] mapped a consensus of 346 cancer genes to a

carefully constructed human PPI (based on homology), and found that these

genes had, on average, twice as many connections as non-cancer genes. In a

related approach Segal et al. [50] used a series of gene expression signatures and human curated annotations to identify “cancer modules”, some of which 28

generalized to tumorgenesis while others were found to be stage or tissue specific. These studies in addition to others provide compelling evidence that co- expressed genes in cancer concentrate non-randomly in “hot spots”, and when mapped to the interactome revealed well-connected sets of proteins (“modules”).

In an evolutionary sense, it is these modules (or subnetworks) that are seen as being selected for the growth advantage they convey in cancer.

29

Figures

Figure 1.1. Stage-wise progression of colorectal cancer. Genetic alterations commonly associated with progression to subsequent stages are shown above the arrow. Reproduced with permission [104].

30

31

Figure 1.2. Proteomic profiling: 2D-differential gel electrophoresis. Tripartite

samples of N, T, and pooled control were mixed (1); samples were separated by

isoelectric focusing (2); then by molecular weight (3); and each fluorophore

independently imaged (4). Using the DeCyder software, spots were matched on an intragel basis with differentail image analysis (5), and on an intergel basis with biological variation analysis to assess biological variation(6). Significant spots

were selected for robotic excision (7), digested by trypsin (8), and the peptides

separated by reverse-phase chromatography and detected by tandem mass

spectrometry (9). MS2 spectra were searched against the Sequest database (10).

MS2: Tandem mass spectrometry; N: normal; RPLC-MS: Reverse phase liquid

chromatography – mass spectrometry; T:tumor.

32

33

Figure 1.3. Relevant colorectal cancer subnetworks and pathways. (A)

Protein interaction graph of the APC (center) axis. Nodes (n=70) are indicated as glyphs (enzymes, transcription factors, receptors, etc.) and interactions with APC are indicated by lines. Evidence of the interactions is based on literature curation.

The nominal cellular compartment of each protein is indicated by annotation on the right. (B) Map of the canonical WNT-signaling pathway, including APC (left, center). MetaCore®

34

35

Figure 1.4. Proposed bioinformatic flow. The proteomic-first approach begins with a seed of targets found significant for disease. These are used to search for hot-spots in the interactome, i.e. subnetworks of well-connected proteins that reveal significant proximity and interactivity (“crosstalk”) to the targets in the seed. Using differential mRNA expression between normal and tumor

(microarray) as a surrogate for network activity, each subnetwork is scored and pruned for its ability to discriminate normal from tumor. The pruned subnetworks are evaluated for their role in the disease state, and may be used to inform mechanistic verification experiments. HPRD: Human protein reference database.

36

37

Chapter II - Discovery and scoring of protein interaction subnetworks discriminative of late stage human colon cancer.

This work was published in Molecular and Cellular Proteomics (2009)

Rod K. Nibbe, Sanford Markowitz, Lois Myeroff, Rob Ewing, Mark R. Chance

ABSTRACT

We used a systems biology approach to identify and score protein interaction subnetworks whose activity patterns are discriminative of late stage human colorectal cancer (CRC) versus control in colonic tissue. We conducted two gel- based proteomics experiments to identify significantly changing proteins between normal and late stage tumor tissues obtained from an adequately sized cohort of human patients. A total of 67 proteins identified by these experiments was used to seed a search for protein-protein interaction subnetworks. A scoring scheme based on mutual information, calculated using gene expression data as a proxy for subnetwork activity, was developed to score the targets in the subnetworks.

Based on this scoring, the subnetwork was pruned to identify the specific protein combinations that were significantly discriminative of late stage cancer versus control. These combinations could not be discovered using only proteomics data or by merely clustering the gene expression data. We then analyzed the resultant pruned subnetwork for biological relevance to human CRC. A number of the proteins in these smaller subnetworks have been associated with the progression

38

(CSNK2A2, , and IGFBP3) or metastatic potential (PDGFRB) of CRC.

Others have been recently identified as potential markers of CRC (IFITM1), and the role of others is largely unknown in this disease (CCT3, CCT5, CCT7, and

GNA12). The functional interactions represented by these signatures provide new experimental hypotheses that merit follow-on validation for biological significance in this disease. Overall the method outlines a quantitative approach for integrating proteomics data, gene expression data, and the wealth of accumulated legacy experimental data to discover significant protein subnetworks specific to disease.

39

INTRODUCTION

A fundamental presumption of the –omics revolution is that high- dimensional datasets resulting, for example, from proteomic and genomic experiments should be integrated with functional annotations to give a more complete account of the cellular changes underlying the etiology of human disease. Nevertheless, the accumulation of specific gene annotations and experimental protein or gene expression data are presently outpacing data integration. Network modeling of protein-protein interactions provides a context for such data integration [51-53]. These modeling approaches can build networks using databases created from literature curation, inference by homology, high- throughput data or a combination of these [54]. Network generation, analysis, and modeling are clearly fundamental to a new generation of systems biology approaches that promise an improved understanding of both the causes of human disease as well as providing novel biomarkers of its progression.

We undertook a systems biology approach to identify protein “signatures” that were significantly discriminative of late stage human colorectal cancer (CRC) versus control. CRC continues to be the second leading cause of cancer death in adult Americans [1]. While a great deal of research is focused on the early detection of CRC, comparably less attention has been paid to understanding the patho-physiology of a late-stage (Duke’s D) phenotype. As the prognosis of a late stage diagnosis is significantly poorer (<10% long-term survivability [5]) than one following early detection (>90% [1]), identifying significant network-level changes

40

in a late stage cohort holds the possibility of more clearly elucidating the mechanisms of tumorigenesis specific to this phenotype.

Proteomic studies of CRC using tumor and adjacent normal tissue obtained by

biopsy from human patients have been conducted [37-39,56]. Even more studies

have profiled protein expression changes in colon cancer cell lines [38,42,57-59].

However, these tissue based studies have used either a sample cohort of mixed

pathologic stage, or a cohort size that was smaller than optimal. As colon cancer

is a disease that progresses over a number of years, and is marked by distinct

pathologic stages of increasing severity (Duke’s A-D), it is reasonable to expect

changes in the proteome that associate with particular stages of disease. Hence,

a cohort of homogenous pathology may improve the detection of stage-specific

changes. Further, we expected that the most dramatic changes in protein

expression, in terms of quantity and magnitude, would be detectable between

control and tumor using a statistically robust late stage cohort.

Figure 1 outlines our overall experimental design and emphasizes an

integrated –omics approach to understanding the patho-physiology of stage D

colon cancer. We began by quantitative proteomic profiling of a late stage CRC

cohort, and used the differentially expressed proteins to seed a search for protein

interaction sub-networks possibly involved in this disease (steps 1-11). Our list of

differentially expressed proteins was imported into a bioinformatic data mining

tool that permits a search of a database comprised of tens of thousands of

manually curated protein-protein interactions [60]. This provided us with a list of

sub-networks that we were able to rank by significance, and reduce to a

41

subsequently manageable number on which to focus our analysis (Figure 1,

steps 12-14).

We seeded our initial search for sub-networks with proteomic data (versus

transcriptomic data) because changes in both protein expression and isoform

abundance provide the most direct functional readout of the cell. As such, we

expected our seed proteins would represent “fence posts” within the sub-

networks, each with one or more functional roles. These sub-networks represent

expansions of the cancer proteome in the sense that the algorithm we employed

(see Experimental Procedures) builds an extended interactome comprised of

many targets around a smaller set of seed proteins. This extended interactome, although it provides clues to the regulatory connections that drive the observed abundance changes, is merely qualitative and inferential. Thus, a potential criticism of this type of approach is that a small set of proteins which is potentially causative with respect to a disease state, has simply been expanded to a larger set whose members may or may not be important to the phenotype. If the myriad of protein interaction networks (and there are many tools to build interactomes from a set of seed targets) are to be useful to researchers for informing new hypotheses, new methods are needed to quantitatively evaluate the significance of the targets within the sub-network that is generated. To address this, we adapted a method described by Ideker and co-workers [48] to score our sub- networks, and then we systematically searched within each one using the metric of mutual information (MI) to identify statistically significant combinations of proteins that were highly discriminative of stage D cancer versus control. Gene

42

expression data was an ideal basis for scoring our sub-networks because of its

complete coverage, i.e. we were able to assign an mRNA expression value to

every target.

Overall our guiding principle is that protein is the immediate effector of

phenotype. Therefore, profiling changes in the proteome is likely to provide the

most direct evidence for cellular changes causing, or resulting from, a disease.

However, proteomics experiments typically have incomplete coverage of the

proteome. In particular, gel-based expression experiments are most likely to

detect high abundant proteins. These high abundant targets may include sub-

network nodes that participate in larger sub-networks of protein interactions, and may also be regulated at the level of transcription. If so, patterns of mRNA expression can be useful for discriminating between disease and non-diseased states within these “discovered” sub-networks. Of course, mRNA expression data

has the characteristic of whole genome coverage, and these data can enable

queries for sub-networks of interest that are “saturated”. Here, we present an

integrated approach to cancer biology: one that shows how proteomic data,

genomic data, and a vast database of legacy experimental data can be integrated with MI scoring schemes to reveal protein signatures significantly

discriminative of disease. The signatures are useful for focusing follow-on

experiments to verify their functional role in a disease phenotype. In addition, our

approach is very general for use with existing public datasets as well as newly

generated data and can be applied in the context of multiple types of protein

interaction networks.

43

EXPERIMENTAL PROCEDURES

Sample preparation

pI 3-10 experiment: Tissue samples were procured from a human tissue

repository at the Case Comprehensive Cancer Center (Supplemental 1). In

addition to a tumor biopsy during surgical resection, a normal biopsy adjacent to the patient’s tumor was also taken, typically >10cm from tumor. Validation of the tissue as normal or tumor (including stage of the tumor) was performed by a pathologist. Tissues were immediately frozen and stored at -80oC. A fifty

milligram sample provided an adequate mass of protein for the 2D-DIGE

experiment. The tissue was weighed, placed in lysis buffer (4% CHAPS, 7M

urea, 2M thiourea, 30 mM Tris) on ice and the cells disrupted by a three cycle

sonication protocol in a 4oC cold room. A protease inhibitor cocktail (Sigma

Aldrich, #P8340) and a wide spectrum phosphatase inhibitor (Roche) were

added to the buffer at the manufacturer’s suggested concentration, to inhibit protein degradation and dephosphorylation, respectively. The homogenate was centrifuged at 12,000 rpm for ten minutes and the protein fraction withdrawn by pipette. Protein concentration was quantified by a colorimetric assay, similar to

the Bradford assay, using the 2D-Quant kit (GE Healthcare). Aliquots were

stored at -80oC.pI 4-7 experiment: Protein fractions from the prior experiment

were thawed, cleaned with the 2D-cleanup kit (GE Healthcare), and the

concentration re-determined as before. Aliquots were re-stored at -80oC.

44

2D gel electrophoresis

We used the 2D-DIGE system available from GE Healthcare (formerly

Amersham),described by Marouga et. al. [61]. This system provides two distinct

advantages over conventional 2D-PAGE. First, it allows for up to three distinct

samples to be labeled by spectrally resolvable fluorophores (CyDyes, Cy2, Cy3,

& Cy5) and multiplexed in a single gel. Second, by using one of these CyDyes

(typically Cy2) to label a pooled sample, constituted by a proportional amount of

every sample in the experiment, the Cy2 dimension is useful as an internal

standard. This internal standard is crucial in the image analysis phase to a

confident assessment of real biological variation from gel to gel, as distinct from

changes arising from variance in protein loading. For the purpose of detection by

image analysis, 50 micrograms of protein is sufficient for labeling by each of the

CyDyes. Additionally, gels intended to be used for spot excision were loaded with

an additional 350 micrograms of an unlabeled, pooled sample sufficient for tryptic

digestion and detection of the peptides by LC-MS2.

1st dimension: Each miminal CyDye was reconstituted in fresh N,N- dimethylforamide (DMF) and a 400 pmol quantity used to label 50 µg of protein at pH 8-9. Cy2 was used to label the pooled internal standard as described above.

Cy3 and Cy5 were used to label the normal and tumor samples, and we alternately swapped the dyes on subsequent sample pairs to alleviate dye- specific effects which could bias image analysis. Labeling proceeded for thirty

45

minutes in the dark and was quenched with 10 mM lysine. Samples were then

mixed with an equal volume of 2X sample buffer (8mM urea, 4% CHAPS, 2%

dithiothreitol (DTT), 2% Pharmylyte, 3-10 or 4-7, non-linear), placed on ice for

ten minutes, then loaded onto non-linear 3-10 (or 4-7) Immobiline DryStrips (GE

Healthcare), placed in a strip holder and focused with an IFGphor system using a

step gradient protocol ranging from 30 to 8000 volts for approximately twenty

seven hours. The strips were then stored at -80oC, ready for the 2nd dimension.

Additionally, for the first experiment (pI 3-10), two pooled, unlabeled 350

microgram samples were prepared and focused separately, to be subsequently

separated in the 2nd dimension on separate gels intended for spot excision. By contrast, for the second experiment (pI 4-7), the unlabeled, pooled sample was

mixed with the labeled samples and run on the same gel. This is possible because the Deep Purple gel stain (GE Healthcare) we used to stain the unlabeled sample is spectrally resolvable from the CyDye fluorophores. This reduces the number of gels required for the experiment.

2nd dimension: For separation by molecular weight we used the Ettan

DALT Twelve apparatus. The DryStrips were rehydrated in 10 µL of re-

equilibration buffer (8 M urea, 100 mM Tris-HCL (pH6.8), 30% Glycerol, 1% SDS,

45 mg/mL iodoacetamide (to reduce streaking)) for 10minutes, laid across the

top of a homogeneous 12.5% polyacrylamide gel “sandwiched” between two

glass plates submerged in running buffer (40% Acrylamide-Bis, 1.5 Tris, 10%

SDS, 10% APS, 10% TEMED), then covered with a 0.5% agarose solution.

Separation proceeded at 15oC at 0.5 watts/gel, then 1.0 watt/gel for fifteen hours. 46

Separation was stopped when the bromophenol dye front reached the bottom of

the gel.

Gel fixation: Gels to be used for spot excision were previously secured to one

glass plate in the “sandwich” with silane (bind-silane, GE Healthcare). After the experiment was stopped, these gels were fixed in 50% methanol and 7.5% acetic acid and subsequently stained with Deep Purple Total Protein Stain (GE

Healthcare).

Image analysis

Gels were scanned using a Typhoon 9400 variable mode imager (GE

Healthcare). During this phase each CyDye fluorophore is independently excited by laser light specific to its particular excitation spectrum. Emission sensitivity, i.e. PMT (photo multiplier tube) was adjusted until the most intense spot on the gel approached saturation. This tuning was performed at a 1000 micron resolution, and once the PMT value was optimal, a final high resolution scan was performed at 100 microns. Gel(s) intended to be used for spot excision were post-stained with Deep Purple (GE Healthcare), imaged using 532/560 wavelength light, then stored in the dark at 4oC. In general, by following the

Typhoon’s recommended settings, our experience indicates that dye-specific

biases in spot intensity are eliminated or reduced below significance. After

imaging the three fluorophores for each gel, the images were imported to the

47

DeCyder image analysis software (GE Healthcare) for spot detection, spot matching (intra-gel), and determination of statistically significant biological variation (inter-gel) based on the measurement of relative abundance change after background subtraction and normalization to the internal standard.

Typically, about 90% of the spots on a gel will fall within two standard deviations of the mean, and not show a significant fold change (the null hypothesis), though this is highly sample dependent.

Image analysis is a time consuming part of this experiment. A statistically significant spot is one whose mean fold change is greater than or equal to ±50%

(depending on statistical power), and paired t-test is less than or equal to 0.05.

Each spot that passes significance must then be manually checked to ensure it is

likely a protein spot and not a gel artifact. Satisfying these criteria, a pick list is

generated and exported to the software controlling the Ettan robotic spot picker

(GE Healthcare). Spots were excised with a 3mm core from the post-stained gel

and loaded to a 96-well plate for digestion.

In-gel Digestion

Excised gel plugs were washed four times for 10 minutes with 50 µl of both 25

mM ammonium bicarbonate (ABC) and 50% acetonitrile (ACN) removing the

liquid between each wash. 10 mM dithiothreitol (DTT) prepared fresh in 30 µl of

25 mM ABC, was added to each gel plug. The samples were then incubated for

48

45 minutes at 56°C. Following incubation, gel plugs were cooled at room

temperature for 20 minutes. The DTT was removed and 30 µl of 55 mM

iodoacetamide (IAM) was added. The samples were incubated in the dark for 45

minutes at room temperature. The IAM was removed and samples were washed

four times with 50 µl of both 25 mM ABC and 50% ACN. Gel plugs were covered

with 10 µl of a 100ng trypsin solution, incubated at room temperature for 10 minutes to allow absorption of trypsin, then covered with 15 µl of 25 mM ABC and placed in a 37°C water bath overnight for digestion. The reaction was quenched the following day with 7 µl of 1% (final concentration) formic acid.

Extraction of the peptides from the gel plugs was completed by adding 30 µl of

50% ACN/5% formic acid, vortexing for 30 minutes, spinning the samples, and finally sonication for 5 minutes.

Mass spectrometry and database software

Most samples were analyzed by tandem mass spectrometry using an LTQ mass spectrometer (Thermo Electron Corp., Bremen, Germany) equipped with an

Ettan MDLC (GE Healthcare). Six samples were run on a Finnigan LTQ FT

hybrid mass spectrometer (Thermo Electron Corp., Bremen, Germany) operated

in a positive ion mode. 2.5 µL of tryptic peptides were desalted on a C-18 pre

column (PepMap 100, 300 x 5 µm particles size, 100 Ǻ, Dionex), then separated

on a reverse phase column (C-18, 75 µm x 150 nm, 3 µm, Dionex) using mobile

phase A (0.1 % formic acid) and B (84% acetonitrile, 0.1% formic acid) with a

49

linear gradient of 2% per min, beginning with 100% A. Peptides were

subsequently infused at a flow rate of 300 nL/min via a Pico Tip emitter (New

Objective, Inc, Woburn, MA) at a voltage of 1.8 kV. Mass spectra were recorded

in the ion trap, and MS2spectra were acquired for the five most intense ions in the LTQ employing collision energy of 35 eV and an isolation width of 2.5 Da.

Bioworks version 3.2 (Thermo electron Corp.) employing the SEQUEST software was used to search against an indexed human database with a peptide mass tolerance of 2.5 Da, and a fragment tolerance of 1.0 Da. Search parameters included partial methionine oxidation, complete carbamidomethylation of cysteine, and two missed cleavage sites. Statistically significant peptides were those satisfying p<0.001, and cross correlation (Xcorr) values of 1.9, 2.5, and 3.0 for 1+, 2+, and 3+ charged ions, respectively. The protein probability cutoff was

p<0.001, and each “hit” necessarily required at least three peptides for

consideration, with rare exceptions for low molecular weight proteins. Surviving

that filter, each protein call was manually “rationalized” to the gel, that is the

theoretical pI and molecular weight were compared to the observed values on

the gel image. Cleavage products and post-translational modifications were

considered in this step.

Mass spectrometry detail may be found in Appendix 1.

Statistical power analysis

50

The power of a statistical test (1-β) is a measure of the probability of correctly

rejecting the null hypothesis, H0, if it is false. Low powered studies consequently have a higher rate of false negatives. Formally, β is functionally related to sample size (n), the standard deviation of the distribution (σ), the difference in the means being tested (µ-µ0), and the area under the standard curve at a given significance level (α) (Z100):

[ − ] n = 2 −α )1(100 ZZ −β )1(100 σ 2 − ()μμ 10

Prior to our second experiment we estimated the average spot variance (σ) by

considering all spots on all 12 gels, under the assumption that the source of

variance was primarily biological and experimental variance relatively minimal.

Since the samples had been prepared, labeled, and separated at once under

near identical conditions we thought this assumption reasonable. Next, at a fixed

level of significance (α=0.05) we calculated the relationship between power and

fold change (µ0-µ1) at three different sample levels (n=3, 6, and 12). This provided an estimate of the minimum number of paired samples required to measure a particular minimum fold change with a power of 0.8.

51

Gene expression data mRNA expression was measured by cDNA microarray on 171 human colon tissue samples of various stages of colon cancer (normal=16, stage B=41, stage

C=25, stage D=50, metastatic=39) using the Affymetrix Human EXON 1.0 ST chip. Expression values were generated with the Expression Console program from Affymetrix (Affymetrix, Santa Clara, CA) using the PLIER algorithm to minimize the effect of outliers. Expression values for all 171 samples for select genes in our networks, plus the decoy database of 1000 genes, were generously provided to us by the Case Comprehensive Cancer Center. The decoy genes were randomly chosen from >17,800 probe sets with core evidence. The distribution of the decoy was evaluated to ensure representation across all 23 (1-

22, plus X) chromosomes (Supplemental 2), and was verified not to overlap any genes in our four networks.

52

Protein interaction network database and sub-network(s) build algorithm

We used MetaCore from GeneGo Inc. (version 4.6 build 12332) to search for

protein-protein interaction networks. MetaCore uses a protein interaction

database comprised of tens of thousands of protein interactions that have been manually curated based on a thorough reading of evidence reported in the literature. MetaCore covers 2400 journals and does not use natural language

processing (NLP) algorithms. In essence, the database represents a vast wealth

of legacy experimental evidence that can be quickly mined in a number of ways

for proteins and interactions relevant to a particular disease. These data can be

usefully represented by directed graphs (“networks”) that illustrate not only which

proteins interact with each other, but the functional nature of the interaction

between them (binding, cleavage, phosphorylation, etc.). We will use the term

“network” to refer to the entire database of protein interactions, and “sub-

network” will mean any network smaller than the whole network. There are a

number of algorithms available in MetaCore with which to build sub-networks

around a set of differentially expressed targets (“seed”). We chose an algorithm

that would extend a sub-network around our seed, while minimizing the number

of outgoing and/or incoming connections needed to enclose the seed in a “cloud”

of interactions topologically constrained by shortest path. As this sub-network is

likely to be very large, it is subsequently divided into smaller sub-networks by

maximizing the saturation of the seed targets in each, while obeying our input

constraint of sub-network size (n=50). We further constrained the search by

species (human) and tissue type (colon); all other pre-filter options retained their 53

default values. The end result was a list of sub-networks (13) ranked by p-value

and zScore. The reported p-value is calculated assuming a hypergeometric

distribution and it represents the probability of a particular mapping arising by

chance, given the numbers of genes in the set of total networkable genes (i.e.

genes or network objects that have at least one annotated functional interaction),

all genes on maps/sub-networks/processes, genes on a particular map/sub-

network/process, and genes in our experiment. The zScore is a statistical

measure of the concentration of the seed targets in the sub-network.

R − nr N ⎛ R ⎞⎛ R ⎞⎛ n −1 ⎞ n⎜ ⎟⎜ − ⎟⎜11 − ⎟ zScore = ⎝ N ⎠⎝ N ⎠⎝ N −1⎠

where, N = total number of nodes in MetaCore database R = number of the sub-network’s objects corresponding to the genes in the import list n = total number of nodes in each sub-network generated from import list r = number of nodes with data in each sub-network generated from import list

A white paper providing additional details of network construction algorithms can be found at: http://portal.genego.com/help/Whitepaperversionsep1905official.zip.

Network scoring and significance tests

54

A flowchart of the scoring scheme is outlined in figure 5. We obtained global gene expression data as measured by cDNA microarray (for detail see Gene expression data) for every gene product in each of the four sub-networks chosen for analysis. Importantly, although the search criteria constrained each sub- network to 50 proteins or less, some proteins in the sub-networks were complexes involving multiple gene products, sub-units, or isoforms. We chose to include all these gene products when we exported the sub-network list of genes from MetaCore. Consequently, and depending on the particular sub-network, the number of genes may exceed 50. Additionally, the microarray data set also included five experiments that had used micro-dissected epithelial cells from colonic crypts. We considered these to be controls for the normal tissue samples because of the homogeneous cell type, whereas the normal tissue samples had detectable levels of stromal markers (e.g. vimentin). Hence, genes with an average expression value less than 40 (below the detection limit of q-PCR) across the crypt samples were considered unexpressed in the epithelium layer, and removed from consideration during scoring.

Mutual Information (MI) – MI is a concept from information theory used to measure the dependence of two random, discrete variables, say X and Y, based on their joint and marginal distributions. A high MI score (0≤MI≤1) indicates that

X and Y are non-randomly associated to each other, whereas in the limit an MI value of zero indicates the two variables are statistically independent. To apply the concept to our problem, we retrieved the mRNA expression for every gene product in each of the four networks, and used these data to populate two 55

distributions of network activity values, one for a set of normal samples (X), and

one for a set of stage D samples (Y). With these two distributions we were able

to calculate an MI score between normal and stage D for each network, and also

use it as an optimization metric to search within the network for combinations of

proteins (“signatures”) that would maximize this score. Intuitively, a high MI score

would indicate the corresponding proteins non-randomly associate between

normal and stage D, inferring their functional importance in the network in late

stage cancer, and suggesting new experiments for elucidating mechanisms of

tumorigenesis.

The raw mRNA expression values in each sub-network were first

normalized by subtracting the population mean and dividing that difference by the

population standard deviation (z-score). Next, a sub-network activity score was

determined for each sample (column) in each network by summing the

corresponding normalized mRNA expression values (rows). The activity score

across phenotypes is a random, continuous variable that we discretized using a binning procedure. The number of bins (figure 5, step 3) was determined by log2(# samples) + 1, which is Sturge’s rule, in the same manner as described by

Ideker and co-workers [48]. The range of the bins was determined from the range of the normalized expression values, plus or minus a small adjustment to ensure

all values fell within a bin. Finally, the marginal and joint distributions were determined for the two phenotypes being compared (normal and stage D), and

the mutual information value computed (figure 5, steps 4 and 5). Note that the

example in figure 5 indicates that network activity values were calculated over 56

every patient sample (column), which as a matter of course we did do, but we only calculated the MI between normal and stage D, consistent with the proteomic comparison. The other values, however, were useful for testing hypothesis 2 (H2, see following).

Significance testing – To test the hypothesis that the genes in a given sub- network were not significantly discriminative of phenotype compared to a random selection of n genes (where n= number of genes in sub-network being measured), we randomly selected 1,000,000 combinations of n genes from the decoy database to create the null distribution, then evaluated the actual MI score of the sub-network on the cumulative distribution function (CDF). To test the second hypothesis, which is that the genes in our network do not associate with a particular phenotype, we permuted the phenotypes (columns) of the relevant array 100,000 times and evaluated significance in the same way. The evaluation on the CDF is expressed as a percent value. A value of, say, 95% indicates there is a 5% chance (p=0.05) of observing a higher MI value, assuming the null hypothesis is true.

The programs required to import the expression data, organize it for analysis, visualize it, perform the scoring and optimization search, as well as the hypothesis testing, were written using Matlab, and are available on request.

Label-free mass spectrometry – protein fold change determination

57

Sample preparation -- 50 µg of total protein derived from colonic tissue

lysate was precipitated with acetone (-20oC) at -80oC for twenty minutes,

followed by centrifugation at 12,000 rpm for ten minutes at 4oC. The samples

were then dried, re-buffered in 20 µL of 0.2% ProteasMax surfactant (Promega

Corp., Madison WI), and gently shaken for thirty minutes. The buffer volume was then increased to 93.5 µL with 50 mM ammonium bicarbonate, incubated with 1

µL of 0.5M dithiothreitol (DTT) for twenty minutes at 56oC, followed by incubation

with 2.7 µL of 0.55M iodoacetamide for fifteen minutes in the dark. 1 µL of 1.0%

ProteasMax was added to the buffer, followed by 1.8 µL of trypsin (Promega

Corp., Madison WI) that had been dissolved in 50 mM ammonium bicarbonate to

a final concentration of 1 µg/µL. Digestion was carried out for 3 hours at 37oC.

Peptides were concentrated on a 100 µL C18 UtraMicroTip column (Net Group, inc), eluted in 20 µL of 0.1% formic acid in 60% acetonitrile, then diluted with

UltraPure water to a final concentration of 500 ng/µL and stored at -80oC.

LC-MS/MS – Analyses were performed on a LC Packings/Dionex Ultimate

3000 HPLC-Orbitrap XL (Finnigan, San Jose, CA) system. The HPLC system is

equipped with two independent ternary gradient pumps, suitable for high-

throughput dual column parallel-HPLC mode applications. A standard injection

volume of 10 µL was used for all the samples, giving a total of one microgram of

digest on column. The data collection method incorporated a 30K Orbitrap full

scan in the FT mode, using profile mode data collection, followed by data-

dependent (DDA) mode MS2 acquisition of the top 5 precursors from each full

scan (centroid mode). CID mode fragmentation was chosen for generating the 58

MS2 spectra in the linear ion trap with a standardized value of 30% normalized

collision energy (NCE) being chosen for the peptides’ fragmentation. The LC

method included a slow, 95-minute acetonitrile ramp from an initial 4.8% till a

final composition of 50.2% was achieved. Further high organic elution was

performed (5 minutes) to complete the elution of peptides from the analytical

column, followed by equilibration of the column for succeeding analyses.

Analysis – Raw files were searched by Mascot against the IPI_Human database (ipi.HUMAN.v3.28.fasta). For each protein of interest one tryptic peptide with tandem MS evidence in at least one of (six) replicate samples was

used to measure the relative expression change between normal and tumor.

Peptide abundance was determined by the area under the elution curve which

was extracted from the total chromatogram using a mass window adequate to

capture all isotopes of the observed monoisotopic mass (≤ 10 ppm). The curves

were smoothed by an 11-point Gaussian filter and baseline subtracted using the

Xcalibur software (Thermo Electron Corp., Bremen, Germany). Fold change

between normal and tumor was calculated as the ratio of the integrated curve

areas. Three replicate runs of a single sample pair (#507, normal/tumor) were

used to estimate technical variance. The coefficient of variation for fold change

ranged from 6% - 39%. The fold change for IGFBP3 was determined by densitometry from Western blot analysis.

RESULTS

59

2D-DIGE Discovery Proteomics

The features of our cohort and the overall design of our proteomic experiments are shown in Figure 1. Twenty-four tissue samples (12 normal and 12 tumor, each pair from the same patient) were prepared using standard procedures (see

Experimental Procedures). The tissue samples had been vetted by a pathologist to establish tumor grade. The experimental design involved alternately labeling tumor and control samples with Cy3 and Cy5 dyes, while Cy2 was used to label a 50 µg pooled fraction that served as an internal standard for each gel. The usefulness of this standard cannot be over-emphasized as it assists in providing a confident assessment of real biological variation, by controlling for variance in protein loading [62]. Each tripartite sample was first separated by isoelectric focusing (IEF) over a broad pH range (3-10), then by molecular weight using

12.5% homogenous SDS polyacrylamide gels. All the samples were labeled and separated simultaneously under identical conditions to minimize experimental variation (see Experimental Procedures). Each gel yields three images (step 4, figure 1), and along with the post-stained gel used for spot excision, a total of thirty-seven images were imported to the DeCyder software (GE Healthcare) for differential image analysis (DIA, i.e. spot matching), followed by statistical analysis of biological variation (BVA). Figure 2. is a (Cy5) image representative of a typical analytical gel (patient #5144) indicating the significant spots matched. In total for this experiment 58 spots were identified as significantly (p ≤ 0.05, |mean f.c.| > 50%) changing between normal and tumor. For the majority of these spots

DeCyder was able to detect a match on greater than thirty of the images. In no 60

case did that number fall below twenty, or fail to be matched on the post-stained

gel used for spot excision. The spots were robotically excised from gel, digested

by trypsin overnight, and submitted to reverse-phase LC-MS2 followed by

database search. Twenty-three spots were confidently (p ≤ 0.001, peptide and

protein) identified by database search (Table 1, figure 2 – annotation). Thirteen

proteins were up-regulated in cancer, and seven were down-regulated.

The IEF range chosen for this experiment resulted in spot overlay in a

number of regions of interest. We anticipated we could improve separation and

focus more spots by using an IEF range of 4-7. Additionally, using a measure of spot variance from the first experiment, we found we only needed six sample pairs to capture fold changes greater than ±30% while maintaining statistical power of 0.8 (see Statistical power analysis, Experimental Procedures).

Accordingly, we performed a second experiment using a subset of six sample pairs from the first experiment (#s 145, 321, 362, 468, 480, 602). The protein fractions were thawed and cleaned with a kit (2D-Cleanup kit, GE Healthcare) to remove impurities known to interfere with proper separation. The protein concentration was re-determined as before by colorimetric assay. We employed even more stringent criteria in the image analysis phase as compared to

experiment 1. Spots not only had to satisfy the same statistical criteria, but

additionally, for a spot to be considered for picking, it had to have been matched

by DeCyder on every one of the 19 gel images. Employing these criteria we

identified 150 significantly (p ≤ 0.05, |mean f.c.| > 30%) changing spots (Figure

3). Activating the false discovery rate (FDR) filter in DeCyder (based on the 61

method of Benjamini and Hochberg [63]), over 40 of these spots retained their significance (q ≤ 0.05).The indicated spots were excised from the gel, digested by trypsin, and submitted to reverse phase LC-MS2 followed by database search.

Out of 150 spots, we confidently identified 67 proteins. 35 spots were up-

regulated in cancer, and 32 were down-regulated (Table 2), indicative of a lack of

bias toward up- or down-regulated proteins.

Highly significant late-stage CRC signatures

As stated above, our guiding hypothesis was that the differentially expressed

proteins (“targets”) found in our experiments represented nodes upstream and/or

downstream of other nodes in one or more functional sub-networks that may be

dysregulated in stage D colon cancer. Accordingly, we used the set of unique

targets from both experiments (n=67) to seed a search for functional sub-

networks. A detailed description of the protein interaction database we searched,

and the sub-network construction algorithm we used is provided in Experimental

Procedures. The search returned thirteen sub-networks, each of which contained

a variable number of between one and twelve of the seed targets. We limited our

attention to four sub-networks judged most significant by a combination of p-

value and zScore. One of these sub-networks is annotated in Figure 4 with a

breakdown of the most significant gene-ontological (GO) processes, their percent representation in the sub-network, the p-value, zScore, and sub-network size, i.e.

62

the total number of gene products it contains. The remaining three are provided in Supplemental Data (3).

To test our hypothesis that our 67 proteomic targets were significant for

late stage CRC, we implemented a quantitative method to score the sub-

networks. Figure 5 outlines our approach. A more detailed explanation of the

scoring procedure is provided in Experimental Procedures. Briefly, we obtained

mRNA expression data for every gene product in each of the four sub-networks

from a set of unpublished microarray experiments (Affymetrix) performed on a

large cohort of clinical tissue samples of varying CRC stage (figure 5, step 1).

With these data we computed an activity value for each sub-network over each

normal experiment (n=16) and each stage D experiment (n=50) (figure 5, step 2).

The activity values represent a continuous variable, and for the purpose of

computing MI need to be discretized by a binning procedure (figure 5, step 3).

With discrete values we computed the relevant distributions (figure 5, step 4),

then the mutual information (MI) between normal and stage D for each sub-

network. The scores are shown in Table 3 (last column).

There are two null hypotheses relevant to evaluate the significance of the

MI score. The first, which we will call H1, states that genes in a sub-network are

not discriminative of the disease phenotype compared to a random set of genes.

For example, if the sub-network contained ten genes, then any ten genes taken

at random would produce an MI score at least as good as the real sub-network of ten genes, under the null hypothesis. The second, call it H2, is subtly different; it

63

states that the expression levels of the genes in the sub-network do not

associate with a particular phenotype. For example, if the network contains ten

genes that produce a high MI score between normal and stage D, then under the

null hypothesis scores at least as high will be found for random permutations of

phenotypes, i.e. by disrupting the real association between patient and gene expression.

From the microarray we obtained mRNA expression data for 1000 random genes (“decoys”), and ensured that these genes had no overlap with any of the genes in the four sub-networks. A null distribution was estimated for testing H1 by evaluating an MI score for 1,000,000 combinations of n random genes selected from the decoy dataset, where n equals the size of the particular sub- network being assessed. The null distribution for H2 was estimated from 100,000 permutations of the phenotypes (columns) in the 2-dimensional array representing a sub-network. Significance was then determined by evaluating the

MI score on the cumulative distribution function (CDF) of the respective null distribution. 1-CDF (MI) indicates the probability of finding a higher MI score.

Probability values of 1% or less were considered to be significant.

By this measure neither the null hypothesis for H1 nor H2 could be

rejected for any of the four sub-networks, i.e. when all the gene products in the

sub-network were used to compute its activity value (Table 3, last column). We reasoned that this result could be attributed to how the sub-networks were built

and scored. Even though all four sub-networks were discovered by proteomic

64

profiling and were judged statistically significant, the individual interactions in a sub-network are nevertheless based in large part on a diverse set of experiments performed in vitro, and in vivo in a variety of different tissues. Consequently, although many of the sub-networks are indicated to be active in colonic tissue, they are not necessarily important to a metastatic cancer phenotype. Second, we observed that the sub-networks, in terms of the nodes and edges they contain, were sensitive to the parameters used to search for them, even including the version of the database software. This results in a certain degree of arbitrariness in the sub-network’s overall topology. Given these caveats the statistically insignificant MI scores were not very surprising.

However, we further hypothesized that the activity of specific protein combinations within the sub-network(s) would be highly discriminative of disease.

Thus, we performed an exhaustive combinatorial search over each sub-network for combinations that would maximize the MI score between control and stage D.

We did this for up to six combinations (readily accomplished on a conventional desktop computer). For each sub-network except one (sub-network 2), the MI score steadily increased with combination number (Table 3). MI scores for these specific combinations (“signatures”) were much higher than those for any of the sub-networks taken as a whole, and, more importantly, in each case the MI score was highly significant with respect to both null hypotheses, H1 and H2 (Figure 6, compare to MI in column labeled “signature 6” in Table 3). Notably, the top scoring signatures included proteins for which we had independently found direct proteomic evidence (e.g., CCT2, HSP90AB1, SERPINA1, and CapG), as did 65

certain other signatures of fewer combinations. Figure 7 highlights the gene products (grey) from each sub-network participating in signature 6. To extend the potential functional importance of these signatures, we added to each directed graph those proteins that were one hop away from the signature proteins, i.e. those from the corresponding parent sub-network, immediately up- or down- stream. We briefly discuss the proteins and functional interactions of these expanded sub-networks in more detail in the following section. Additionally,

similar to the conclusion of Ideker and co-workers [48], we found that the signature genes did not cluster in a dendrogram computed using traditional

distance metrics, e.g., Euclidean or Spearman (data not shown), and would likely

have gone overlooked by conventional gene classification techniques.

As mentioned above, with the exception of parent network two, MI scores

increased or were constant for successively larger combinations of proteins.

Further, we found that combinations of less than six proteins (signatures 1-5)

were also significant when tested against the appropriate H1 and H2 null

distributions (data not shown). Indeed the relevant distributions for smaller

combinations of proteins had similar characteristics (e.g. mean and variance) as

those for signature six. If the proteins appearing in successively larger

combinations were completely different this would suggest that our method was

not sensitive to small variations in the underlying network activity patterns.

However, this is not what we observed. As indicated in Table 3, for each network

all signatures (1-6) frequently show the repeated contribution of either the same

protein or proteins involved in the same complex of proteins, suggesting that the 66

method is sensitive to the specific interactions of functionally similar sub-network proteins. Even for the signatures derived from parent sub-network 2, the contribution of proteins capable of phospholipase activity (PLA2 and PRDX6) appears in five out of six signatures.

Biological relevance to CRC of signature proteins in extended sub- networks

It merits emphasis that each of the most significant extended sub-networks

(Figure 7) contained targets for which we had direct proteomic evidence (CCT2,

HSP90AB1, SERPINA1 and CapG), indicating that gene products significant by their contribution to MI maintain their significance at the level of the proteome in late stage CRC. The most significant targets (highlighted grey) are generally classified according to those with a known role in CRC, a role in other human cancers, and those with no known role in cancer. The fact that we found significant genes with a known role in CRC can be understood as a positive control of our analytical method.

Genes with a role in CRC – IGFBP3 (aka IBP3) is an insulin-like growth factor binding protein that was recently identified to cause in a TNF-related- apoptosis-inducing-ligand (TRAIL)-mediated fashion in relevant CRC cells [64].

Additionally, a large association study found that paired polymorphisms in

IGFBP3 and its substrate predicted a significant increase in risk for CRC [65].

67

The integrin family of proteins has been well studied in CRC. They are generally responsible for cell-cell and cell-matrix adhesion. Loss of expression of certain sub-units in this family has been associated with increased neoplastic transformation in colonic epithelium, and the specific loss of beta-1 (ITGB1) chains was associated with benign to malignant transformations [66]. Notably, integrins are active as heterodimers, and although only ITGB1 contributed to the significant MI score, the sub-network indicates that the dimer ITGB1/ITGA4 is of particular interest. IFITM1 (IFI17) is a member of the family of interferon-inducible transmembrane proteins. A recent study [67] proposed it as a possible marker for human colorectal tumors. The sub-network also revealed it to be regulated at the level of transcription by the PBAF complex, certain members of which we also found significant. SMARCA4 (aka BRG-1) plays a key role in the chromatin remodeling complex in mammals. One study [68] showed in vivo evidence that

BRG-1 interacts with β-catenin and induces the transcription of T-cell factor

(TCF) target genes. Mutant forms of BRG-1 lacking ATPase activity disrupted this induction. TCF target gene activation is the final consequence of the WNT- signaling pathway, mutations in which are well known to cause tumorgenesis in the human colon. The role of platelet-derived growth factor receptor (PDGFR) has been well-studied in CRC along with other receptors capable of tyrosine kinase activity and downstream signaling. One recent study [69] found a significant association between the stromal expression of the B sub-unit

(PDGFRB) and the metastatic potential of CRC tumors. In addition to being over expressed in a number of human cancers, Casein kinase II (CSNK2A2) in CRC

68

suppresses apoptosis by de-sensitizing cells to TRAIL in a caspase dependent manner but independent of NF-kappaβ [70]. It also has been shown to promote survival of colon cancer cells by increasing the expression of survivin via the canonical transcription pathway hyperactive in CRC (TCF/LEF) [71]. PLA2G12A

is a member of the family of secreted phospholipases, many of which display

distinct patterns of expression in adenocarcinomas [72].

Genes with a role in other human cancers - CapG, a -like capping

protein, has been identified as a possible tumor-suppressor gene [73], though our proteomic screen revealed it to be up-regulated in cancer, in agreement with

the mRNA expression. A closer look at this study revealed that the authors had

measured a near complete loss of the CapG protein in a variety of primary

human cancer tissues, but not colon tissue. Our evidence that the CapG

message and protein are up-regulated in CRC indicates that it may have oncogenic activity in the colon. Further, we actually identified CapG at two closely spaced but distinct spots on the gel, suggesting that post-translational modification may be important to its activity. The human gene PLK1 (or PLK), a

serine/threonine protein kinase, was characterized many years ago [74] and its

expression was found to strongly correlate with the mitotic activity of a variety of

tumor cell lines, including those derived from human colon. Notably, the study

found PLK1 was not expressed in a variety of the normal human tissues, with the

exception of normal colon tissue. More recently, PLK1 was found to be over

expressed in primary CRC tumors [75], identified as a prognostic factor for CRC

[76], and when knocked down or inhibited in human adenocarcinoma cells (RKO) 69

lead to dramatic mitotic arrest [77], thus showing promise as a possible drug

target. Lastly, driver mutations in PLK1 are not unknown, as was recently

revealed by a large screen for somatic mutations on over 500 protein kinases covering a large cohort of human cancers [78]. RPS2 is a gene encoding a ribosomal protein in the small 40S sub-unit. RPS2 was found by a proteomics screen to be a novel kinase substrate differentially phosphorylated in breast cancer development [79].

Genes with no known role in human cancer – A literature search of Pubmed revealed little to nothing about the role of the remaining significant sub-network targets in human cancer (HSP90AB1, HIST1H2AB, TUBA4A, TUBB3, GNA12,

TRAP1, DYNLT3, CCT3, CCT5, CCT7, POLR2D). Guided by the evaluation of interactions on the sub-network, along with select proteomic evidence, these targets may merit follow-on experiments to discover their role, if any, in late stage

CRC. For example, the chaperone containing t-complex proteins (CCT) play a role in protein folding in eukaryotes and are widely expressed in the cytosol. One study did find a significant elevation of the CCT transcript in human colon carcinomas, and validated the change in protein expression by immunohistochemistry, whereas our proteomic screen revealed it down- regulated in the cancer tissue. Notably, that study had not vetted the samples for tumor stage, which highlights the importance of stage-specific studies.

Additionally, CCT3 and CCT5 were identified by microarray analysis to be significantly differentially expressed in the epithelium of other human cancer

70

tissues (esophegel, breast, ovarian and lung), but their functional role in cancer,

CRC in particular, is unknown.

Significant targets in the developmental process sub-network are coordinately regulated

We used a label-free mass spectrometry approach (see Experimental Methods)

to verify the relative expression of four of the significant targets in the

developmental processes sub-network (panel 1, Figure 7), in a new cohort of

clinical tissue samples. The differential expression of IGFBP3 was determined by

Western blot. The relative expression change between normal and stage D at the

level of mRNA for each of these targets was computed from the microarray and

used for comparison. Most all of the targets were up-regulated in cancer in all

patients at the level of mRNA and protein (Figure 8). Overall, these data indicate coordinated regulation at the level of mRNA and protein, but also highlight the relatively large variation of expression of both mRNA and protein across patients.

An interpretation consistent with this observation is that subtle changes in the transcription of one or more targets may have a synergistic effect on the activity of the other targets in maintaining the phenotype, something the measure of

mutual information is well-suited to capture, and consistent with our guiding

hypothesis.

71

The advantages of an integrated –omics approach to cancer

Colon cancer has a strong genetic basis due to the accumulation of somatic mutations in and tumor suppressor genes. However, it is also widely accepted that due to the resiliency of mammalian cells, single gene mutations are usually insufficient to cause this disease [80]. While a great deal of work has been done identifying genes involved in colon cancer, as well as the canonical pathways they resolve to [81-83], comparatively little work has been done to evaluate the functional protein interactions derived directly from proteomic data.

It is in fact not known how genomic, transcriptomic, or proteomic perspectives may differently inform our understanding of colon cancer onset or progression.

Classification of disease phenotype using candidate gene, candidate RNA, or candidate protein target approaches has been the bedrock of modern -omics research. However, in some cases these single gene/protein models of disease have been disappointing in follow-on studies [53]. Alternative approaches, using network and sub-network classifiers, are currently under examination. In this study, we searched for protein sub-networks by leveraging a database built on a very large number of legacy experiments, using proteomic data as a seed to discover sub-networks discriminative of late-stage CRC. It is thought that this approach would quickly lead us to significant protein-protein interaction sub- networks that would reveal the functional cause, or consequence, of stage- specific phenotype(s). We then developed a novel approach for searching within these sub-networks for particular “signatures” that are significant discriminators of the disease phenotype, using a scoring process based on gene expression 72

data. Computational and bio-statistical methods to unify proteomics and gene expression data are a powerful way of identifying novel interaction signatures involved in late stage cancer. While gene lists combined with rigorous statistical analysis can identify significant genes whose expression profiles cluster, the resultant gene lists alone provide no functional information of the post- transcriptional mechanism(s) of dysregulation.

One criticism of our approach is that gel-based proteomics experiments, which provided the seed proteins for our search, typically identify highly abundant proteins as differentially expressed. Many of these are either so-called “house keeping” genes with a role in metabolism, or in any case may often be considered unimportant to a disease phenotype such as cancer as they may lack transcription factors or receptors as protein classes. However, our integrated approach was able to locate these seed proteins within regulatory sub-networks of great interest. Each of the four sub-networks we scored included between eight to twelve proteomics targets that were directly identified. This underscores the usefulness of a network-based approach that identifies specific and significant functional interactions possibly relevant to the patho-physiology of cancer. It also revealed the large diversity of sub-network interactions that these high expressers are evidently involved in. However, as the scoring shows, the entire sub-network(s) are not statistically significant for classifying phenotype.

Only the end product of our quantitative approach, which identified root nodes with functional interactions significant for phenotype, presents a focused set of testable hypotheses suitable for validation by perturbation experiments. 73

Additionally, we acknowledge that different protein interaction databases are likely to return different sub-networks given the same target seed. However, this is most likely because present-day databases represent an under sampling of the human interactome, coverage of which has been recently estimated at less than one percent [84], and not because of any inherent arbitrariness attributable to our approach. Indeed, as coverage of the human interactome continues to improve, interaction databases are likely to converge with respect to sub-network selectivity.

Using the measure of mutual information to score the networks had an advantage over other classification methods in that there is no requirement that the underlying data be normally distributed. This made the method particularly well-suited to examining gene expression data which, for many of the genes in our networks, exhibited non-normal distributions of expression for particular stages of cancer. Pairing this approach with exhaustive combinatorial search, versus a greedy search, reduces the possibility that the signatures represent a local rather than global maximum. A complete exhaustive search of the expression landscape for even larger combinations of gene products (>6) is limited only by computer power. Finally, some of the proteins identified in our signatures did not, independently, have a significant change in mRNA expression, at least not enough to be considered significant by simple gene expression profiling. But it is certainly conceivable that the cumulative effect of small changes in network activity (mRNA expression) may lead to significant changes in the proteome, and this was a guiding hypothesis of our study. 74

As high-throughput methods continue to produce more genomic and proteomic data, it will become increasingly important to find new ways to integrate these data and to provide precise, quantitatively significant classifications of human disease stage. These classifiers will likely be critical to the assessment of individual phenotype important for development of personalized medicine.

75

Figures

Figure 2.1. Experimental design. For each patient, a tripartite sample of normal

(N), tumor (T), and pooled control were mixed (1); unlabeled samples were run

on separate gels for poststaining or mixed in the analytical gels (see

“Experimental Procedures”). Samples were separated by isoelectric focusing (2)

and then by molecular weight (3), and each fluorophore was imaged

independently (4). Using the DeCyder software, spots were matched on an intragel basis with differential image analysis (DIA) (5) and on an intergel basis with biological variation analysis (BVA) to assess biological variation (6).

Significant spots (∆fold dependent on statistical power) were selected for robotic excision (7) and digested by trypsin (8), and the peptides were separated by reverse-phase (RP) chromatography and detected by tandem mass spectrometry

(9). MS2 spectra were searched using SEQUEST (10), and identified proteins were imported to MetaCore to search for relevant networks (11). Significant protein signatures are scored by mutual information using gene expression profiles (12); signatures are extended out one hop to infer functional relevance

(13). Resultant subnetworks were analyzed for biological relevance to CRC and new hypothesis generation (14).

76

77

Figure 2.2. Representative gel from experiment 1 (5144). Polygons indicate spots significantly changing between normal and cancer as determined by

DeCyder. Spots identified by mass spectrometry are labeled. See Table I.

78

79

Figure 2.3. Representative gel from experiment 2. Polygons indicate spots significantly changing between normal and cancer as determined by DeCyder.

Magenta polygons indicate that the spot passed a multiple comparison filter test

(false discovery rate) in DeCyder. Labeled spots were identified by mass spectrometry and appeared in one or more of four significant MetaCore networks. See Table II.

80

81

Figure 2.4. MetaCore subnetwork. Shown is a characteristic example of one of four significant MetaCore protein interaction subnetworks returned by a search seeded by significant proteomic targets: subnetwork 1, regulation of developmental processes. Interaction effects are positive (green), negative (red), and unspecified (black). Red and blue circles beside certain objects indicate that the protein was identified by proteomics, either up-regulated in cancer (red) or down-regulated in cancer (blue). Size indicates the total number of gene products used for scoring by mutual information. Similar details for each of the other three subnetworks chosen for scoring are provided in supplemental Data

S3.

82

83

Figure 2.5. Flow chart showing steps required to compute MI. Gene expression values (Xs, rows) were mean-shifted to 0 across samples (columns)

(1). Normalized values were then summed to produce an activity value for each sample (2); activity scores are continuous and need to be assigned into discrete bins for an MI calculation (3). A joint distribution matrix is calculated between two sets of samples, e.g. normal (N) and stage D (4). MI is calculated as shown where p(x) is the marginal distribution of normal activity, p(y) is the marginal distribution of stage D activity, and p(x,y) is the joint distribution of x and y (5).

84

85

Figure 2.6. Estimated null distributions (probability density function) for

hypotheses H1 and H2. For the H1 null distribution, an array of pseudo, six

gene signatures was computed from 1,000,000 combinations of genes randomly

selected six at a time from the decoy data set. The activity value for each pseudo

signature was computed between normal and stage D followed by the MI scores, which were then used to populate the distribution. For the H2 null distributions, as each signature (best six) comprises different genes, a separate H2 null distribution was computed for each. First we computed an array of 100,000

(100K) random permutations of phenotypes (n 171). Then using the six genes for each signature (Table III), we computed an activity value for 16 pseudonormals and 50 pseudostage D samples (refer to array in Fig. 5). We then computed

100,000 MI scores to populate the H2 distributions. H1 and H2 were modeled by the generalized extreme value distribution in Matlab.

86

87

Figure 2.7. Expanded subnetworks from corresponding signatures.

Signature 6 proteins were expanded by one hop inside the corresponding parent subnetwork(s) (Fig. 4 and supplemental Data S4) to infer functional relevance.

Signature 6 proteins are highlighted gray; overlapping gray circles indicate multiple members or subunits of a complex participating in the signature. See also Table III, column labeled Signature 6. Horizontal lines demark cellular compartments. Panels 1–4 are pruned versions of the parent subnetworks, Fig. 4 and supplemental S4 (subnetworks 2–4), respectively.

88

89

Figure 2.8. Relative expression change of signature proteins and mRNA in

subnetwork 1, tumor versus normal, for three patients (507, 534, and 540). mRNA values are the difference between the means measured using the normal

samples (n 16) and stage D samples (n 50) obtained from the microarray.

Protein -fold change was determined by label-free mass spectrometry except for

IGFBP3, which was determined by Western blot. As most targets were up-

regulated in the tumor, carbonic anhydrase I (CA1) is included as a loading

control to indicate that the observed up-regulation of most targets was not merely

due to a greater amount of tumor digest on column versus normal.

90

91

Table 2.1. List of unique proteins (18) from a total of 20 identified by experiment one, pI 3-10. When a protein was identified at multiple spots on the gel, the fold change here represents the average value.

92

Fold Theoretica GENE Protein NCBI gi|number MW change l pI

Acyl-Coenzyme A ACADS dehydrogenase, C-2 to C-3 gi|19684166 -1.82 44299.8 7.96 short chain

S-adenosylhomocysteine AHCY gi|9951915 1.51 47799.3 5.9 hydrolase

mitochondrial aldehyde ALDH2 gi|48256839 1.56 56318.6 6.37 dehydrogenase 2 ANXA2 Annexin A2 gi|16306978 -2.46 38594 7.77 ANXA3 Annexin III gi|12654115 1.93 36452.7 5.69 CA1 carbonic anhydrase I gi|4502517 -2.8 28853.4 6.67 dihydrolipoamide DLST gi|643589 2.66 48555 8.89 succinyltransferase HSPD1 heat shock protein 60 gi|77702086 2.15 61175.5 5.59 LMNA lamin A/C isoform 3 gi|27436948 1.55 65208 8.5 Homo sapiens similar to NDUFS1 zinc finger protein gi|18490405 -1.69 79417.5 5.84 (LOC147947), mRNA. Nucleoside diphosphate NM23A gi|35068 1.79 20399.3 7.07 kinase A mitochondrial processing PMPCB peptidase beta subunit gi|94538354 1.75 54332.5 6.38 precursor

predicted: similar to PPIA peptidylprolyl isomerase A gi|89058333 1.57 24502 7.11 isoform 1 pigment epithelium-derived SERPINF1 gi|15217079 -1.39 46314 5.95 factor SNX6 sorting nexin 6 isoform b gi|88703041 1.52 47775.3 5.99 TALDO1 transaldolase 1 gi|5803187 1.24 37517.5 6.38 TF transferrin gi|37747855 -1.66 77030.6 6.86

predicted: similar to Triosephosphate TPI1 isomerase (TIM) (Triose- gi|88942747 -1.85 26926 8.1 phosphate isomerase) isoform 8

93

Table 2.2. List of unique proteins from a total of 67 identified by experiment two, pI = 4-7. Y=yes, N=no, indicates whether the protein survived the false discovery rate (FDR) significance test. When a protein was identified at multiple spots on the gel, the fold change in this table usually represents the average value.

94

NCBI Fold Theoretical GENE Protein FDR MW gi|number Change pI ACTB beta gi|4501885 3.54 Y 41170 5.18 ACTG2 ACTG2 gi|49168516 3.26 N 41898 5.2 ARP3 actin-related ACTR3 gi|5031573 1.77 N 47432 5.54 protein 3 homolog mitochondrial aldehyde ALDH2 gi|48256839 2.17 N 56420 6.67 dehydrogenase 2 ANXA4 Annexin IV gi|4502105 1.43 Y 36063 5.75 ANXA5 Annexin V gi|49168528 1.99 Y 35941 4.78 Apolipoprotein H (beta- APOH gi|18089104 -1.52 N 38273 7.84 2-glycoprotein I)

ATP synthase, H+ transporting, ATP5B gi|32189394 -1.38 N 56525 5.14 mitochondrial F1 complex, beta

CA1 carbonic anhydrase I gi:4502517 -1.77 N 28853 6.67 gelsolin-like capping CAPG gi|63252913 1.69 N 38475 5.79 protein CAPNS1 calpain, small subunit 1 gi|40674605 2.68 N 28212 4.82 F-actin capping protein CAPZA1 gi|5453597 -2.59 N 32903 5.36 alpha-1 subunit containing CCT2 gi|5453603 -1.55 Y 57453 6 TCP1, subunit 2 carboxylesterase 1 CES1 gi|68508957 -1.96 Y 62354 6.15 isoform c precursor catechol-O- COMT methyltransferase gi|6466450 1.52 N 24434 5.02 isoform S-COMT CTSD cathepsin D gi|30584113 -1.47 N 44637 6.1 CTSX preprocathepsin P gi|3719219 -1.74 N 32681 6.1 dihydropyrimidinase- DPYSL2 gi|4503377 1.57 N 62255 5.93 like 2 peroxisomal enoyl- ECH1 coenzyme A hydratase- gi|70995211 -1.68 N 35972 6.68 like protein fibrinogen, beta chain FGB gi|70906435 6.53 N 55893 8.23 preproprotein fibrinogen gamma- FGG gi|182440 1.77 N 51464 5.19 prime chain FK506-binding protein FKBP4 gi|4503729 1.59 N 51773 5.22 4 heterogeneous nuclear HNRPF gi|4826760 1.99 N 45643 5.27 ribonucleoprotein F heterogeneous nuclear HNRPH1 gi|5031753 -1.51 Y 49199 5.86 ribonucleoprotein H1 HP haptoglobin gi|4826762 -2.53 N 45177 6.13 HPX hemopexin gi|11321561 -1.39 N 51643 6.6 heat shock protein HSP90 gi|15010550 2.36 N 92412 4.61 gp96 precursor

95

HSP90AA1 HSP90AA1 protein gi|12654329 1.86 Y 64350 4.96 HSP90AB1 HSP90AB1 protein gi|39644662 1.57 N 74769 4.91 heat shock 70kDa HSPA5 gi|16507237 -1.86 Y 72289 4.92 protein 5 IMP (inosine IMPDH2 monophosphate) gi|15277480 -1.52 N 55770 6.46 dehydrogenase 2 KRT18 KRT8 protein gi|33875698 2.85 Y 55788 5.49 KRT8 8 gi|4504919 6.13 Y 53672 5.38 KRT9 KRT9 protein gi|113197968 1.52 N 48057 4.7 D-lactate LDHD dehydrogenase isoform gi|37595756 6.32 Y 52112 6.02 2 precursor -associated MAPRE1 protein, RP/EB family, gi|6912494 2.05 N 29981 4.87 member 1 regulatory light MRLC2 gi|15809016 1.71 Y 19767 4.54 chain MRCL2 nicotinamide N- NNMT gi|5453790 2.24 Y 29556 5.46 methyltransferase Succinyl-CoA:3- ketoacid-coenzyme A OXCT gi|48146215 -1.57 N 56159 7.21 transferase 1, mitochondrial precursor protein disulfide PDIA3 isomerase-associated 3 gi|21361657 -1.54 N 56747 5.95 precursor protein disulfide PDIA5 isomerase-related gi|1710248 -1.38 N 46171 4.81 protein 5 phosphoglycerate PGAM1 gi|4505753 2.62 N 28802 6.79 mutase 1 (brain) PGM1 phosphoglucomutase 1 gi|21361621 -2.18 N 61411 6.31 phosphatidylinositol PITPNA gi|5453908 -1.35 N 31787 6.11 transfer protein, alpha PKM2 PKM2 protein gi|33870117 -1.76 Y 61362 8.86 Mitochondrial- processing peptidase PMPCB gi|40226469 1.84 N 53475 6.3 subunit beta, mitochondrial precursor inorganic PPA1 gi|33875891 1.67 N 35449 5.92 pyrophosphatase ribosome binding RRBP1 gi|110611220 -1.77 Y 108590 5.33 protein 1 RUVBL2 RuvB-like 2 gi|5730023 1.57 N 51125 5.37 selenium binding SELENBP1 gi|16306550 -1.89 N 52357 5.9 protein 1 SEPT2 SEPT2 protein gi|23274163 1.74 N 42659 6.4 serine (or cysteine) SERPINA1 proteinase inhibitor, gi|50363219 -1.66 N 46588 5.27 clade A (alpha-1

96

serine (or cysteine) SERPINB2 proteinase inhibitor, gi|62898301 1.55 Y 42743 5.87 clade B (ovalbumin), SERPINB6 SERPINB6 protein gi|12655087 -1.98 N 42563 5.1 Heterogeneous nuclear SYNCRIP gi|33874520 1.84 N 46715 5.81 ribonucleoprotein Q TAGLN transgelin gi|48255905 4.14 N 22596 9.4 TALDO1 transaldolase 1 gi|5803187 1.46 Y 37517 6.38 Homo sapiens t- TCP1 gi|30584211 -2.02 N 60419 5.74 complex 1 TF transferrin gi|4557871 -1.54 N 77000 6.75 Triosephosphate TPI1 gi|17389815 -1.51 N 26624 6.5 isomerase 1 TUBB tubulin, beta gi|18088719 -1.75 N 49641 4.9 thioredoxin domain TXNDC4 containing 4 gi|52487191 2.55 N 46900 5.01 (endoplasmic reticulum) thioredoxin domain TXNDC5 gi|42794775 2.2 N 43642 5.73 containing 5 isoform 2 VIM vimentin gi|47115317 -3.05 Y 53548 4.94 14-3-3 protein YWHAZ gi|49119653 1.99 Y 29928 4.57 zeta/delta

97

Table 2.3. The mutual information (MI) scores for each signature. Signature 1 represents the single best protein by the measure of MI, signature 2 the highest scoring combination of 2, signature 3 the highest scoring combination of 3, etc. MI values of the corresponding parent sub-network (Figure 4, Supplemental 3) appear in the last column. P-values are included for signature 6 and the whole parent sub-network: pH1= probability of achieving a higher MI value under null hypothesis H1, pH2= probability of achieving a higher MI value under null hypothesis H2. Colors are matched to corresponding H2-null distributions in Figure 6. Genes in bold font indicate proteins with direct proteomic evidence.

98

(MetaCore Whole Parent Signature 6 Signature Signature Signature Signature Signature Network Network P 1 2 3 4 5 H1 P #) P H1 H2 P H2 0.4981 PH1 = 0.1774 MI(1) 0.4116 0.4326 0.4525 0.4545 0.4820 0.0004 PH1 = 0.73 PH2 << PH2 = 0.63 0.0001 CAPG IGFBP4 IGFBP4 HIST1H2AB H2AFX PPID TUBA4A PPID IGFBP3 Figure 4, Genes(1) PDGFRB TUBA1A TUBA4A TUBB3 TUBA4A ITGB1 sub-network 1 TUBB3 TUBB3 TUBB3 TUBA4A HIST1H2AB TUBB3 0.4628 PH1 = 0.1668 MI(2) 0.4971 0.4981 0.4713 0.4786 0.4530 0.0009 PH1 = 0.92 PH2<< PH2 = 0.82 0.0001 CSNK2A2 FOS CSNK2A2 GNA12 Supplemental FOS PLA2G10 PLA2G10 HSP90AA1 PDGFRB 3, Genes(2) PRDX6 PLA2G10 PLA2G4A PRDX6 PLK1 PLA2G12A sub-network PRDX6 PLA2G6 RB1 PLK1 2 PRDX6 SERPINA1 0.6063 PH1 = 0.1879 MI(3) 0.4971 0.5171 0.5717 0.5717 0.6063 0.0007 PH1 = 0.42 PH2<< PH2 = 0.44 0.0001 CCT3 CCT2 HSP90AA1 CCT5 PPAI CCT6A Supplemental CCT3 PPA1 CCT7 Genes(3) PRDX6 TRAP1 CCT7 3, TRAP1 TRAP1 DYNLT3 TUBA1A SMARC4A sub-network 3 TUBA1A PLK1 TRAP1 TRAP1 0.5398 PH1 = 0.2176 MI(4) 0.4000 0.4552 0.4983 0.5057 0.5462 0.0002 PH1 = 0.92 PH2<< PH2 = 0.82 0.0001 ACTL6A ACTL6A ACTL6A HSP90AB1 RPS15A IFITM1 Supplemental NCBP2 HSP90AB1 IFITM1 Genes(4) TUBA4A SMARC4A POLR2D 3, POLR2D SMARC4A POLR2D TUBA4A SMARC4A sub-network 4 SMARCB1 SMARC4A SYNCRIP RPS2

99

Supplemental 2.S1. Clinical cohort. For each patient a pair of biopsies was obtained, one from the tumor and one from the adjacent normal epithelium.

100

101

Supplemental 2.S2. Distribution of decoy genes over the chromosomes.

102

103

Supplemental 2.S3 (3 panels). Additional three MetaCore protein interaction sub-networks returned by a search seeded by significant proteomic targets. Interaction effects are positive (green), negative (red), and unspecified

(black). Red and blue circles aside certain objects indicate protein was identified by proteomics, either up-regulated in cancer (red), or down-regulated in cancer

(blue). Size indicates total number of gene products used for scoring by mutual information.

104

Supplemental 2.S3: Sub-network 2 – Cell Proliferation.

105

Supplemental 2.S3: Sub-network 3 – organization and biogenesis.

106

Supplemental 2.S3: Sub-network 4 – Apoptosis and regulation of programmed cell death.

107

Chapter III - An Integrative -omics Approach to Identify Functional Sub-

networks in Human Colorectal Cancer.

This work was published in PLoS Computational Biology (2009)

Rod K. Nibbe, Mehmet Koyutürk, Mark R. Chance

ABSTRACT

Emerging evidence indicates that gene products implicated in human cancers

often cluster together in “hot spots” in protein-protein interaction (PPI) networks.

Additionally, small sub-networks within PPI networks that demonstrate

synergistic differential expression with respect to tumorigenic phenotypes were recently shown to be more accurate classifiers of disease progression when compared to single targets identified by traditional approaches. However, many of these studies rely exclusively on mRNA expression data, a useful but limited measure of cellular activity. Proteomic profiling experiments provide information at the post-translational level, yet they generally screen only a limited fraction of the proteome. Here, we demonstrate that integration of these complementary data sources with a “proteomics-first” approach can enhance the discovery of candidate sub-networks in cancer that are well-suited for mechanistic validation in disease.

We propose that small changes in the mRNA expression of multiple genes in the neighborhood of a protein-hub can be synergistically associated with significant

108

changes in the activity of that protein and its network neighbors. Further, we

hypothesize that proteomic targets with significant fold change between phenotype and control may be used to “seed” a search for small PPI sub- networks that are functionally associated with these targets. To test this hypothesis, we select proteomic targets having significant expression changes in human colorectal cancer (CRC) from two independent 2-D gel-based screens.

Then, we use random walk based models of network crosstalk and develop novel reference models to identify sub-networks that are statistically significant in terms of their functional association with these proteomic targets. Subsequently, using an information-theoretic measure, we evaluate synergistic changes in the activity of identified sub-networks based on genome-wide screens of mRNA expression in CRC. Cross-classification experiments to predict disease class show excellent

performance using only a few sub-networks, underwriting the strength of the

proposed approach in discovering relevant and reproducible sub-networks.

109

INTRODUCTION

Colorectal cancer (CRC) is the second leading cause of cancer death in

adult Americans [1]. Interest in this complex disease is represented by a very

mature body of research, much of it at the genomic level. Yet the identification

and verification of proteins that have a functional role in the patho-physiology of

CRC remains an important goal as proteins directly mediate the functions

dysregulated in the disease. Modern, high-throughput proteomic methods

provide one way of profiling the significant changes in protein expression of

tumor samples with respect to control, using tissue biopsies obtained from

patients diagnosed with this disease [37,39,56].

Proteomic screening techniques are particularly useful for furthering the

understanding of the mechanisms that underlie complex phenotypes like CRC, in

that they provide information at the post-translational level. However, due to

various biological and experimental constraints (e.g., ascertainment bias and physical properties of proteins), proteomic methods may screen only a limited fraction of proteins and protein isoforms present in cells and tissues. We propose that this limitation may be mitigated through the integration of proteomic data with genome scale data sources, such as measurements of gene expression. In addition, protein-protein interaction (PPI) databases, which are rapidly growing in terms of both the quality and quantity of their annotations, provide another source of genome scale data integration [85]. Such integrative approaches can

110

potentially lead to functional inference at the systems level, through identification

of pathways and molecular sub-networks that are implicated in CRC.

In support of this approach, a recent review by Ideker and Sharan [86]

summarizes studies that indicate that genes with a role in cancer tend to cluster

together on well-connected sub-networks of protein-protein interactions. This

suggests a hypothesis that the synergistic expression of multiple cancer-related

genes at the level of mRNA can co-regulate the expression of proteins in their

immediate “network neighborhood”. These differentially expressed proteins may be captured by expression proteomics experiments, thus their network neighborhood should provide an ideal starting place to search for sub-networks

with a possible role in the disease.

The effectiveness of network-based approaches to the identification of

multiple disease markers has been demonstrated in the context of various

diseases, including Huntington’s disease [87], the inflammatory response [88],

and human breast cancer [89]. Furthermore, it was recently shown that

“differentially expressed sub-network markers” were more accurate predictors of

metastasis in breast cancer (compared to single gene markers) [48]. However,

existing approaches are generally limited to mRNA expression data in terms of

quantification of molecular expression, which captures post-transcriptional

activity only to a limited extent [90, 91]. Consequently, inclusion of protein

expression data in the search for sub-network markers has the potential to

improve the effectiveness of systems biology approaches [92]. However, it

111

remains largely unknown how a network-based approach may be enhanced when starting with proteomic data.

In this paper, we propose a novel computational approach that takes into

account certain topological features of the interactome, namely connectivity and

proximity, for searching the neighborhoods of proteomic targets to find significant

sub-networks implicated in CRC. In doing so, we partly overcome (i) the bias

inherent in proteomic profiling experiments, particularly those that are gel-based,

which are typically limited to capturing changes only in relatively abundant

proteins and (ii) the noise, missing data, and ascertainment bias in PPI data. This

is accomplished by assessing the functional association between proteins based

on the quantification of the statistical significance of network crosstalk through

information-flow based modeling of the PPI network and development of a

reference model that takes into account the network connectivity of proteomic

targets. We hypothesize that identification of candidate sub-networks with a

significant association to proteomic targets can reveal proteins that are not

detected to be differentially expressed at the level of the proteome, but whose

activity in the network may play a key role in maintaining the phenotype.

Consequently, the proposed framework provides a means for expanding

proteome expression data to infer a role for proteins that exhibit significant

crosstalk to the proteomic targets. The flow of the proposed computational

framework is illustrated in Figure 1.

112

A key objective of this study is to systematically elaborate a proteomics-

driven approach as a sound method for inferring small sub-networks implicated in

complex phenotypes, and ultimately make these methods practically available to

a wider community of researchers working in this area. For this purpose, we

ground our approach on the hypothesis that the observed fold change of the

proteomic targets may be associated with the synergistic dysregulation of their

interacting partners at the level of mRNA. From a computational perspective, our

hypothesis is based on the premise that sub-networks which exhibit significant

association with the proteomic targets should also show a significant change in

activity between control and cancer. . To test this hypothesis, we first score each

protein in the network based on their crosstalk with the proteomic targets. In

order to account for noise, incompleteness of data, and ascertainment bias, we

also develop novel methods for assessing the significance of these “crosstalk

scores”. Then, for each proteomic target, we identify a candidate sub-network

that is composed of its interacting partners with significant crosstalk scores.

Subsequently, using an information theoretic measure, we evaluate the

synergistic differential expression of these candidate sub-networks between

control and disease, based on changes in mRNA expression obtained from

microarray experiments performed on tissue biopsies collected from a cohort of patients with CRC. Finally, using the sub-networks that exhibit significant synergistic dysregulation as features, we develop classifiers to predict disease

class across different data sets.

113

The proposed computational approach for assessing functional association between proteomic targets and other proteins uses a random-walk based algorithm. Recently, Kohler et. al. [93] and Chen et al. [94] used similar network algorithms to prioritize candidate disease genes implicated by linkage analysis in a variety of human diseases. Vanunu and Sharan [95] developed a global, propagation-based method that exploits information on known causal disease genes and PPI confidence scores. Their method more accurately recovered known disease gene relationships compared to several other extant methods. In contrast to these applications and rather than using raw scores obtained by such information flow based algorithms, we develop reference models to assess the statistical significance of these scores, with a view to identifying proteins that are significantly associated with proteomic targets.

Furthermore, our biological hypothesis, which drives our approach, is that targets

(proteomic or genomic) significant for the CRC phenotype may reside in or near cancer hotspots in the network, and thus present an ideal starting place to search for high-value sub-networks associated with the disease. Therefore, our computational approach does not rely on canonical disease-related genes or proteins; rather, it is a global, unbiased search that tries to identify network interactions statistically significant with respect to all targets in an experimentally- derived set.

Our previous work in this area [96] was limited in scope due to the lack of access to the topology of the commercial PPI we employed. This prevented us from assessing the importance of topology for sub-network generation, which is 114

the primary focus of our computational approach in this study. Likewise, our network scoring and statistical hypothesis testing were all greatly limited in the previous work due to incomplete access to an unpublished microarray data. For the same reason we were practically prevented from iteratively adjusting network search parameters in the commercial software that would have generated a large list of candidate sub-networks for scoring.

Here we describe a new network search method for finding high-value candidate sub-networks associated with CRC. To overcome the limitations of the previous study and to permit independent evaluation of our methods, we utilize a public PPI (HPRD) and public microarrays (Gene Expression Omnibus) to evaluate performance using two independent sets of proteomic targets obtained by 2D-PAGE that are also publically available. We compare this result to that obtained using a set of CRC driver gene mutants as seeds for the network search. The basis for this test is the hypothesis that if mutated gene products map to cancer hotspots on the network, they would be similarly useful as seeds for our network search algorithm. To reveal the practical utility of our integrative approach, and to extend it beyond merely a theoretical computational framework, we validate by western blot several targets in a sub-network predicted by our method to be dysregulated, using a cohort of tissue biopsies not used in the original proteomic screen. Finally, we employ a cross-validation approach to compare the disease classification performance of the proteomic-versus genomic-derived sub-networks.

115

Our results show that the proposed proteomics-driven approach, as it integrates a variety of biologically relevant data, can identify significant sub- networks implicated in a complex phenotype, i.e. CRC. The definition of terminology frequently used in this paper is provided in Table 1.

116

MATERIALS AND METHODS

Proteomic Methods

Target screen: The Nibbe et al. proteomic targets were determined using

two gel-based screens of twelve and six, respectively, late-stage CRC tumor

tissue biopsies (with matched adjacent normals) obtained from the Case

Comprehensive Cancer Center. Briefly, the biologically significant spots between

normal and tumor were identified by image analysis of the 2D-gels. The spots

were then robotically excised, digested by trypsin, and the peptide sequences

determined by LC-MS/MS. Parent proteins were subsequently identified by

database search. Full experimental details as well as the lists of the targets

identified in both screens can be found at Nibbe et al [96]. It merits emphasis that

the targets selected for network analysis were highly significant given the

stringent p-values used (< 0.01) at the level of peptide and protein identification.

The targets in the Friedman seed were similarly identified (see the Methods

section of Friedman et al.) on a smaller cohort of paired biopsies (n=6) of mixed

stage CRC.

Western blot: The tissue samples were thawed and homogenized in

Lameali buffer with a Polytron mixer. Protein concentration was determined by a kit (Amersham Biosciences, 2D-Quant). Aliquots were diluted to 5 ug/ul,and stored at -80oC. 15 ug of total protein was separated by 1D-PAGE on

homogeneous 10% gels. The protein was immediately transferred to a

nitrocellulose membrane (40 mA for 4 hours on ice). Membranes were blocked

117

overnight at 4 oC with 5% milk in TBS-T, washed 2X with TBS-T at room

temperature, and subsequently incubated with the primary antibody (Sigma)

overnight at 4oC. The membranes were once again washed 2X at room

temperature and incubated with the secondary antibody (Cell Signaling) for two

hours. The membranes were then washed 3X with TBS-T, incubated with ECL

reagent (Pierce) and exposed from one to ten minutes (protein dependent). Fold

change was determined using the 2D-QUANT software (Amersham Bioscience).

Computational Methods

The computational framework for integrating proteomic, transcriptomic,

and interactomic data to discover sub-networks implicated in complex

phenotypes is shown in Figure 1. As seen in the figure, we first identify disease

targets with significant differential expression with respect to control, via

proteomic screening as described above. Once these targets, called proteomic

seeds, are identified, we map these seeds on the PPI network obtained from

HPRD to identify proteins that are functionally associated with the proteomic

seeds.

In order to develop biologically sound measures to quantify the functional association between proteins, we develop information flow based algorithms to

compute crosstalk scores, which capture network proximity and connectivity to

proteomic seeds. We discuss this procedure in Subsections A and B. In order to

account for experimental artifacts, incompleteness of data, and ascertainment

bias, we use Monte Carlo simulations to assess the significance of the crosstalk

118

scores computed by these algorithms. Our statistical evaluation scheme is based

on a reference model that captures the basic characteristics of the proteomic

seeds, in terms of the number of seeds and their degree distribution. This

procedure is described in Subsection C.

Subsequently, for each proteomic seed, we construct two “candidate sub- networks”: (i) sub-network induced by all interacting partners of the seed protein,

(ii) sub-network induced by the interacting partners that have significant crosstalk scores (in our experiments, we use a p-value cut-off of 0.001 to determine

“significant crosstalkers”). Finally, we evaluate the mutual information score of each candidate sub-network with respect to the phenotype of interest (in this paper, CRC), using mRNA expression data for test and control samples. For this purpose, we use an established information-theoretic scheme that quantifies synergistic differential expression in terms of the mutual information between the aggregate expression of the sub-network and disease classes across samples.

This procedure is explained in Subsection D. In order to assess the statistical significance of synergistic differential expression, we also use Monte Carlo simulations based on reference models that accurately capture the basic topological characteristics of each sub-network. This procedure is explained in

Subsection E. We then use identified sub-networks to develop classifiers for predicting disease class in CRC. This procedure is explained in Subsection F.

A. Relationship between Synergistic Expression, Functional Association, and Network Topology

119

Systematic studies of differentially expressed genes in certain phenotype

classes show that these genes are related to each other in molecular networks, composed of protein-protein interactions, transcriptional regulatory interactions, and metabolic interactions [99]. In one of the early algorithmic studies, Ideker et al. [100] develop a method for identifying differentially expressed metabolic sub-

networks with respect to GAL80 deletion in yeast. This method is based on

searching for connected groups of enzymes within the yeast metabolic network,

such that the aggregate differential expression of genes coding these enzymes is

statistically significant. Variations of this method prove useful in identifying

multiple gene markers implicated in a variety of diseases, including prostate

cancer [101], melanoma [102], and diabetes [103]. Building on these results,

information theoretic schemes for assessing synergistic differential expression

are also shown to be effective in network based disease classification [48, 104].

While differential network analysis is effective in identifying multiple gene

markers, most of the existing methods utilize network information to primarily find

the genes that are connected, hence potentially related to each other. In other

words, these approaches do not take into account network topology, connectivity

patterns, or degree of connectivity between proteins. This is because (i) much of

the available network information is noisy and incomplete [105], therefore,

connectivity patterns cannot be interpreted as well-defined wiring schemes, and

(ii) network models (particularly, high-throughput protein-protein interactions)

provide only a high-level qualitative description of the information flow in the cell.

However, several studies show that variations in molecular expression can be 120

interpreted in terms of network topology (e.g, subunits of a protein complex are

co-expressed significantly over a time course [106], functional similarity of proteins correlates with proximity in a network of interactions [107,108].

Motivated by these considerations, we develop network-based scoring schemes to quantify the crosstalk between proteomic seeds and the rest of the proteins in a network of interactions. Based on the premise that synergistic changes in transcriptional expression may be associated with significant changes in proteomic activity, we expect that proteins that demonstrate significant crosstalk with proteomic seeds will be good candidates for being implicated in the phenotype of interest. In order to assess the crosstalk between a group of proteomic targets and any other protein in the network accurately, we develop information flow based algorithms, as discussed in the next section.

B. Network Crosstalk: Capturing Functional Association via Connectivity

and Proximity

Let G=(V,E) be a network of protein interactions, where V consists of the proteins in the network, and an undirected edge uv ∈E represents an interaction between proteins u ∈V and v ∈V. For convenience, we also define N(v) as the set of interacting partners of protein v ∈V, i.e., N(v) ={u ∈V: uv ∈E}. Let S ⊆ V be the set of proteomic seeds, i.e., the proteins that are identified by proteomic studies to exhibit significant fold change with respect to the phenotype of interest.

Our objective is to compute a score α(v) for each protein v ∈V, to quantify the

121

network crosstalk between v and the proteins in S. Here, network crosstalk is

used as an indicator of functional association between proteins.

In order to develop a biologically sound measure of network crosstalk, we

rely on the following observations: (i) Functional similarity between two proteins,

as measured by semantic similarity of annotations [109], is

significantly correlated with their network proximity, as measured by the shortest

path (number of hops) between these proteins [107,108]. (ii) Existence of

multiple alternate paths between two proteins is an indicator of their functional

association, since functional multiple paths are often conserved through evolution

owing to their contribution to robustness against perturbations, as well as

amplification of signals [110].

To incorporate both the number of hops and multiple alternate paths into

the assessment of crosstalk between proteins, we use an information flow based

algorithm based on random walks with restarts [111]. This algorithm can be

considered a generalization of Google’s well-known page-rank algorithm [112].

Furthermore, a special case of the proposed crosstalk score, when |S|=1, is a

network proximity measure [111] known to be closely related to commute

distance and effective resistance [113] in graphs. Similar graph-theoretic

measures are also used to identify functional modules in PPI networks [114], annotation of protein function [115], and prioritization of disease genes [93-95].

We assign crosstalk scores to all proteins in the network for a given S by

simulating a random walk as follows. The random walk starts at a randomly

122

chosen protein in S. At each step, when the random walk is at some protein v, it

either moves to an interacting partner of v with probability 1-r, or it restarts at a

protein in S with probability r. Here, the parameter 0≤r≤1 is called the restart

probability (in our experiments, we use r=0.5). For each move, the interacting

partner to be moved to is selected uniformly at random from N(v). However, the

move probabilities can also be adjusted to reflect the confidence of each

interaction, so that more reliable interactions contribute more to the quantification

of crosstalk. In other words, one can define the probability of a move from v to u

as P(u,v)=w(u,v)/Σu’∈N(v) w(u’,v) if u ∈N(v), 0 otherwise. Here, w(u,v) denotes the

reliability of the interaction between u and v. Similarly, for each restart, the

protein to be restarted is selected uniformly at random from S. These

probabilities can also be adjusted to reflect the significance of the fold change of

each protein in S, so that proteins with more significant fold change are

considered as more reliable seed proteins. In other words, one can define the

probability of restart at u∈V as ρ(u)=zP(u)/Σu’∈S zP(u’) if u ∈S and 0 otherwise.

Here, zP(u) denotes the z-score of the fold change of u with respect to the

phenotype of interest, based on proteomic screening.

Based on this random walk model, we define the crosstalk between the

proteins in S and each protein v∈V as the relative amount of time spent at v by

such an infinite random walk, or equivalently, the probability that the random walk

will be at protein v at a randomly chosen time step after the random walk

proceeds for a sufficiently long time. More precisely, let αt denote a |V|-

123

dimensional vector, such that αt(v) is equal to the probability that the random

walk will be at protein v at step t, where ||αt||1=1 (here, ||.||1 denotes the 1-norm of

a vector, defined as the sum of magnitudes of its elements). Let P denote the

stochastic matrix derived from network G=(V,E), i.e., P(u,v)=1/|N(v)| if uv ∈E, 0

otherwise. Then, we have

αt+1 = (1-r)P αt + rρ, (1)

where ρ denotes the restart vector with ρ(u)=1/|S| for u∈S, and 0 otherwise.

Then, letting α0= ρ, the vector containing the crosstalk scores for each node in

the network is given by α=limt→∞ αt. Observe that this formulation lends itself to

an iterative algorithm to compute crosstalk scores efficiently, where each iteration requires O(|E|) time, since P is a sparse matrix with 2|E| non-zero entries.

Note that, when r=0, α is equal to the eigenvector of P that corresponds to its

largest eigenvalue (with numerical value 1), i.e., α(v) is exactly equal to the page

rank of v in G for all v ∈ V. Therefore, the crosstalk score of a protein is not only

an indicator of its connectivity and proximity to seed proteins, but it is also

influenced by the centrality of the protein in the network. In order to account for

such sources of bias, as well as the choice of parameter r (in our experiments,

we use r=0.5), we adjust the crosstalk scores statistically as we discuss in the

next section.

C. Dealing with Experimental Artifacts, Ascertainment Bias, and

Incomplete Data

124

Due to variability in physical properties of proteins and other experimental

artifacts, it is likely that there will be significant ascertainment bias in the selection

of proteomic seeds, as well as the availability of interaction data for each protein

[116]. Indeed, our results show that the seed proteins extracted by proteomic

screening are likely to be highly connected in the PPI network derived from

HPRD. More specifically, the 60 proteins that are identified to have significant

fold change (p < 0.01) in late stages of human colorectal cancer have 24.1

interactions in HPRD on an average, while the average degree of a protein in the

HPRD network is 9.1. Consequently, highly connected proteins in the network

are likely to be assigned artificially high crosstalk scores just by chance. Since

available network data is often incomplete and prone to ascertainment bias,

these effects are likely to amplify the ascertainment bias and skew the results

toward well-studied proteins. However, we are very interested in finding those

proteins that are relatively less characterized but may provide novel insights into

phenotype. Therefore, the crosstalk scores described above need to be assigned

significance scores based on reliable statistical models.

In order to deal with such experimental and data-related sources of bias,

we use a reference model that captures the degree distribution of seed proteins

accurately. Namely, for a given seed set S, we generate a random instance S(i) representative of S as follows. For every protein u ∈ S, we create a bucket B(u)

’ of proteins in the network, such that ∪u∈S B(u)=V and B(u)∩B(u’)=∅ for all u, u ∈

S. Here, protein v ∈ V is assigned to bucket B(u) if |N(v)-N(u)|≤ |N(v)-N(u’)| for all

125

u’∈ S and ties are broken randomly. Then, we construct S(i) by choosing one protein from each bucket uniformly at random, so that |S(i)|=|S|. Observe that

each bucket consists of proteins that have similar number of interactions with a

particular seed protein; therefore, each seed protein is represented in S(i) by

exactly one protein in terms of its number of interactions. Consequently, the

expected total degree of the proteins in S(i) is likely to be very close to the total

degree of the proteins in S. Once a random instance S(i) is generated, we

compute the corresponding crosstalk vector α(i) by letting ρ(i)(u)=1/|S(i)| for u∈S(i),

and 0 otherwise.

Repeating this procedure n times, where n is sufficiently large (we use

n=1000 in our experiments), we obtain a sampling {α(1), α(2), …, α(n)} of the null

distribution of crosstalk scores, with respect to seed sets that are representative

of S in terms of their size and degree distribution. We then estimate the mean

(i) 2 (i) 2 μS=Σ1≤i≤nα /n and standard deviation σS =Σ1≤i≤n(α -μS) /(n-1) of the null distribution of crosstalk scores for S using this sample. Subsequently, we compute adjusted crosstalk scores

zS(v)=(α(v)-μS(v))/σS(v) (2)

for each protein v ∈ V. These adjusted crosstalk scores represent the statistical

significance of the crosstalk between each protein and the proteins in the seed

set, accounting for the centrality of the protein the network, as well as the degree

distribution of seed proteins.

D. Assessing Synergistic Dysregulation of Candidate Sub-Networks 126

Once all proteins in the network are scored according to their crosstalk with proteomic seeds, we construct candidate sub-networks as follows:

1. Interactor sub-networks: For each proteomic seed u, the sub-network

induced by its interacting partners in the network (N(u)) is considered a

candidate sub-network, based on the hypothesis that significant changes

in the expression of a protein may be associated with synergistic changes

in the transcriptional expression of proteins in its neighborhood.

2. Crosstalker sub-networks: For each proteomic seed u, the sub-network

induced by the proteins in N(u) that have significant adjusted crosstalk

scores with respect to S is considered a candidate sub-network, based on

the hypothesis that sub-networks composed of proteins with significant

crosstalk to the proteomic seeds (as opposed to solely interacting with one

proteomic seed) are likely to exhibit significant synergistic differential

expression.

Formally, the set of candidate sub-networks is defined as C(S) =

* * * * {N(u):u∈S} ∪ {N (u):u∈S}, where N (u) = {v∈N(u): zS(v)>z }. Here, z denotes the cut-off for adjusted crosstalk scores to be considered significant. In our experiments, we use z*=3.45, to reflect a p-value cut-off of 0.001, under the assumption of normally distributed crosstalk scores.

For each candidate sub-network Q in C(S), we quantify the synergistic expression of the proteins in Q using an information-theoretic scheme developed by Chuang et al. [48]. Namely, for protein v∈V, let e(v) denote the properly

127

normalized m-dimensional mRNA expression vector, provided by genome-scale

transcriptomic screening of m disease and control samples. Let c denote an m- dimensional binary vector indicating the phenotype class of each sample, such that c(i)=1 if the ith sample is diagnosed with the disease, 0 otherwise.

Furthermore, define the aggregate expression vector e(Q) for the sub-network induced by set of proteins Q as

= ∑ QveQe ||/)()( . (3) ∈Qv

Then, the synergistic differential expression ϕ(Q) of the genes coding for

proteins in Q with respect to the phenotype of interest is given by the mutual

information between e(Q) and c, i.e.,

ϕ(Q)=I(e(Q), c)=H(e(Q))+H(c)-H(e(Q), c). (4)

Here, e(Q) denotes a discrete-valued vector obtained by quantizing e(Q) into k

bins, H(x) denotes the entropy of a discrete-valued vector x over a finite alphabet

A, i.e., H(x)=Σa∈A-p(a)log(p(a)), and p(a)=|{i:x(i)=a}|/m (in the context of our

problem, A represents the set of bins). In this paper, we use k=6, since this value

of k was found to provide reasonable estimates for mutual information in our

experiments.

E. Statistical Significance of Synergistic Dysregulation

Finally, we assess the statistical significance of synergistic differential

expression for each candidate sub-network. In order to do so, for a given

Q∈C(S), we generate a null distribution for synergistic differential expression of 128

sub-networks that reflect the topological properties of Q. Since Q is composed of

proteins that are connected to each other via a single protein (that is, the

corresponding proteomic seed), the null distribution should also be derived from

sub-networks that consist of the same number of proteins in Q, which are

connected to each other through a single protein in the network. Therefore, we

first construct a bag D of proteins in the network with degree at least |Q|, i.e,

D={v∈V:|N(v)|≥|Q|}. Subsequently, we choose a protein v from D uniformly at random. Finally, we choose |Q| proteins uniformly at random from N(v) to

construct a random instance Q(i) representative of Q. Repeating this procedure n

times (in our experiments, we use n=1000) and computing ϕ(Q(i)), we obtain a

null distribution of synergistic differential expression for sub-networks similar to

Q. Observe that, only the size of Q(i) depends on Q in this procedure. For this

reason, in our experiments, we do not explicitly generate a null distribution for

each Q∈C(S). Rather, we generate a null distribution for sub-networks of size 2,

4, 8, 16, 32, 64. Then we interpolate the mean and standard deviation of

synergistic differential expression for these distributions, to obtain a curve that

characterizes the behavior of synergistic differential expression with respect to

sub-network size.

F. Sub-network Classification

In order to assess the reproducibility of discovered subnetworks across

different data sets and evaluate the potential of the proposed framework for

feature selection in classification of CRC, we perform cross-classification

129

experiments. In these experiments, we use the aggregate expression profiles

(e(Q)) of crosstalker and interactor subnetworks associated with Nibbe and CAN seeds as features for classification. For this purpose, in each experiment, we select the crosstalker (or interactor) subnetworks with synergistic differential

expression (ϕ(Q)) one standard deviation above random mean, according to a

specific mRNA expression data set (e.g., GSE8671). Assume that there are K

such subnetworks. Then, for each k≤K, we use the k subnetworks with maximum

ϕ(Q) to train an SVM classifier on the same data set (GSE8671), using Matlab’s

svmtrain function. Subsequently, we use this classifier to predict the class

(tumor vs. normal) of each sample on a different data set (e.g., GSE10950), using Matlab’s svmclassify function. We evaluate the performance of the

classifier using the harmonic mean of precision (selectivity) and recall

(sensitivity), known as the F-measure, defined as

2´´precision recall F = . precision+ recall

Here, precision is the fraction of true positives among all samples

classified as tumor and recall is the fraction of tumor samples called accurately

by the classifier among all tumor samples.

130

RESULTS

We searched the PPI network obtained from the Human Protein

Reference Database (HPRD) for CRC-implicated sub-networks using two distinct

sets of proteomic targets from Nibbe et al. [96] (n=67) and Friedman et al. [37]

(n=55). Both sets contain significant targets of CRC obtained by a proteomic

screen using tissue biopsies (tumor and matched controls) obtained from twelve

and six patients, respectively (see Proteomic Methods for details of the screen

performed in our lab). We call these targets proteomic seeds. The HPRD PPI

network was downloaded from the HPRD website on September 2008 and

contained 35023 binary interactions between 9299 proteins, as well as 1060 protein complexes consisting of 2146 proteins. We integrated the binary

interactions and protein complexes using a matrix model (e.g., each complex is

represented as a clique between the proteins in the complex), to obtain a PPI

network composed of 42781 binary interactions among 9442 proteins. 60 of the

proteomic seeds from the data of Nibbe et al. had at least one interaction in

HPRD, while 37 of the seeds from the data of Friedman et al. had at least one

interaction in HPRD. 14 of the proteins in the two seed sets were common.

For every protein in HPRD, our procedure assigns a score based on the

protein’s proximity and connectivity to all the seeds (see Methods). If the score is

not significant (p<0.001) but the protein directly interacts with one or more of the

seeds, we call it an interactor, whereas a crosstalker is any protein whose score is significant. Note that a crosstalker is generally (but not necessarily always) an

131

interactor since a significant crosstalk score for a protein indicates that it is in the

network neighborhood of one or more of the seeds, however, there are many

interactors that do not qualify as crosstalkers. Overall, this procedure revealed

233 crosstalkers for Nibbe seeds, and 210 crosstalkers for Friedman seeds.

Subsequently, for each proteomic seed in each set, a candidate sub-

network consisting of its interactors, termed the interactor sub-network, was

obtained, resulting in a total of 55 interactor sub-networks (46 for Nibbe seeds

exclusively, 23 for Friedman seeds exclusively, and 14 additional sub-networks

for both). Similarly, for each seed in both sets, a crosstalker sub-network was

obtained. Thus, for every seed there are two corresponding sub-networks, an

interactor sub-network and a crosstalker sub-network. The proteins in an

interactor sub-network are merely characterized by their direct interactions with

the corresponding proteomic seed. By contrast, proteins in a crosstalker sub-

network are characterized by their degree of functional association with all proteomic seeds.

Relationship of Expression Between Crosstalkers and Individual Proteins in HPRD at the level of mRNA

We evaluated the individual differential gene expression of each crosstalker identified using the Nibbe and Friedman proteomic seeds using two microarray datasets obtained from GEO (GSE10950 & GSE8671). GSE8671 represents 64 experiments using mRNA isolated from tissue biopsies obtained from 32 patients (matched tumor and adjacent normal mucosa) performed on an

132

Affymetrix GeneChip (Human U133 Plus 2.0). Similarly, GSE10950 represents

48 experiments on matched tissue biopsies (24 patients) performed on an

Illumina array (Human ref-8, v2.0).

The cumulative distribution of individual differential expression scores for

proteomic seeds, (and a seed of CRC driver genes discussed later), as well as

all proteins in the network computed as described in the Methods section, is

shown in Figure 2 (please see the Methods section for details on how differential

expression is quantified). As seen in the figure, we found no significant difference in the distribution of individual differential expression of the crosstalkers, as compared to the distribution of differential expression of all proteins in the HPRD network. This observation indicates that at the level of individual genes, significant network crosstalk with proteomic seeds in CRC is not associated with transcriptomic dysregulation in CRC.

Synergistic Regulation of Sub-networks Induced by Proteomic Seeds

For the purpose of discussion we will refer to a sub-network by the proteomic seed it is generated from (e.g. TCP1). For each version of each sub- network we computed the mutual information (MI) of each sub-network between control and tumor using the mRNA expression data from microarrays GSE10950 and GSE8671 (see Computational Methods), and we used this score to estimate the significance of the various networks in differentiating the phenotype (Figure

1). The comparison of mutual information for the two versions of each sub-

network associated with the Nibbe seed is shown in Figure 3. We plotted the

133

results only for those (crosstalker) sub-networks where the mutual information exceeded 0.35 (approximately 1σ from random mean). The purpose of this analysis is to understand how the synergy of each crosstalker sub-network compares to f that of its corresponding interactor sub-network. The MI and significance scores for all sub-networks can be found in Supplemental Table 1.

Of the 46 candidate sub-networks associated with Nibbe proteomic seeds,

10 unique interactor sub-networks (green squares) exhibited significant MI scores. For five of these sub-networks (CCT2, TCP1, SYNCRIP, HNRPF and

HNRPH1) the crosstalker version of the sub-networks was found to have enhanced MI on one or the other microarray datasets. Two crosstalker sub- networks (red diamonds), CCT2 and TCP1, show improvement over their corresponding interactor sub-network on both arrays. Notably, on GSE10950, the mutual information score of the TPI1 crosstalker sub-network is significant, while the corresponding interactor sub-network failed to show significance.

Figure 4 shows the corresponding plots for the Friedman proteomic seeds.

Here, seven unique interactor sub-networks have significant MI scores; two of them (ANXA3 and PSMA6) were common to both sets of microarray data. For the Friedman seeds, the crosstalkers for candidate sub-network TUBA1B showed dramatically increased mutual information compared to its interactor network. Furthermore, four other crosstalker sub-networks (associated with

MYL9, GARS, ANXA3 and GSTP1) all revealed much higher synergy compared to their corresponding interactor sub-networks, two of which (MYL9, GSTP1)

134

failed to show significance on either array. We discuss a possible explanation for

these findings in the Discussion section.

Figures 5a and 5b show unions of crosstalker sub-networks associated with the Friedman and Nibbe seeds, respectively, for which the synergy was higher than the corresponding interactor sub-network. The graphs reveal that

many proteomic seeds reside within or near dense sub-networks of crosstalkers.

Post-Trancriptional Dysregulation of TCP1 Sub-Network

We observed that several of the sub-networks generated using the two proteomic seed sets contained proteins in common. In particular, certain sub- units of the TCP1 complex exhibited marked crosstalk in the sub-network induced by CCT2 in the Nibbe seed, and TUBA1B in the Friedman seed (Figure

4). In addition, we had previously shown [5] that certain sub-units of this complex

(CCT3, CCT5, and CCT7) were also significant for the late-stage CRC phenotype, as revealed by a similar network scoring methodology but using a commercial PPI unrelated to HPRD.

TCP1 (or TCPα) is a hetero-oligomeric complex comprised of two stacked ring structures, each composed of eight known subunits and plays a functional role in maintaining the CRC phenotype. Specifically, it was shown [97] to be required for the proper biogenesis of PLK1, a kinase that has a critical role in cytokinesis. However, other than their role as sub-units in the formation of the

TCP complex little is known about the independent role, if any, of these sub-units

135

in CRC [98]. Consequently, these targets present an opportunity for follow-on

mechanistic studies. For this reason, we verified the protein expression of TCP1,

CCT3, CCT5, CCT7, and PLK1 by western blot in a separate cohort of three patient sample pairs not used in screening phase, and compared this to the average expression at the level of mRNA (Figure 6). Consistent with our hypothesis, the data indicate co-regulation at the level of mRNA and protein, but also reveal the wide variability of expression of these targets among individual patients. CCT3 and CCT7 were dramatically over-expressed in two patients (507 and 534), but less so in patient 540, which was similar to the pattern for PLK1.

Synergistic Dysregulation of Sub-networks Induced by CRC Driver-gene

Seeds

Although these data show that proteomic seeds are well-suited for identifying synergistically dysregulated sub-networks, we wished to investigate the power of genetically identified seed sets in discovering significant sub- networks. As CRC is commonly thought to be caused by the accumulation of somatic mutations, a number of cancer research labs have collaborated to conduct whole genome sequencing to identify the genes thought to be “drivers” in cancer, i.e. those represented by the set of genes that appeared most frequently mutated in a robust cohort of clinical biopsies. The results of one such study on human breast and colon cancer were recently reported by Sjöblom et al.

[10]. We hypothesized that the gene products of the CRC driver genes reported in this study would be located at hotspots in the interactome. Further, if the

136

mutations lead to dysregulation of neighboring genes at the level of mRNA, then

the seed should reveal significant sub-networks using our method. Additionally,

since there is less bias in PCR sequencing and high genome coverage, at least

as compared to proteomic profiling, we supposed that driver gene seeds (n=42)

might be superior both in terms of the number and significance of the sub-

networks identified.

As shown in Figure 7, when scored by GSE8671, only four significant sub-

networks were found. Strikingly, for every one of them, only the crosstalker sub-

networks were significant. Using GSE10950, seven sub-networks of crosstalkers

were significant, including all four found on GSE8671. For all but two of the sub-

networks (P2RX7, OBSCN), the crosstalkers show substantially higher

synergistic differential expression as compared to their interactor counterparts.

Notably, APC, a tumor suppressor gene widely viewed as the “gate-keeper” in

CRC, was associated with a significantly dysregulated sub-network with respect

to both arrays, and of all the genes in the driver seed it was found to be mutated

in the highest percentage (90%) of the clinical samples. This expected finding

may be viewed as a positive control for our analytical method.

In terms of the overall number of significant sub-networks identified,

however, there was no apparent improvement using the driver gene seed set

versus either proteomic seed set. Additionally, a number of the significant

crosstalk sub-networks identified by the proteomic seeds show markedly higher

137

synergy (MI>0.60) than all but one (EVL) of the sub-networks found by the driver

gene seed.

Classification Performance of Sub-networks as Features

We evaluated the quality of the crosstalker versus interactor sub-networks

in terms of their ability to classify tumor versus control on the microarrays, using

an SVM-based classifier in a cross-validation approach (see Methods). The

significant sub-networks in each group were first ranked by MI, and the features

were valued by superposing the mRNA expression values of each gene in the

sub-network. When trained on GSE10950 and validated on GSE8671, proteomic crosstalkers outperformed the interactor sub-networks (both proteomic and genomic) when the number of features used to train the classifier was three or less. Beyond three features, both the proteomic interactor and CAN (candidate

CRC driver genes) crosstalker sub-networks outperformed the proteomic crosstalkers (Figure 8a). Performance was similar when the training and validation sets were reversed, although the performance of proteomic crosstalkers dropped when more than two sub-networks were used for classification (Figure 8b). The raw classification data are provided in

Supplemental Table 1.

138

DISCUSSION

We have shown that proteomic targets showing significant expression

changes for a complex phenotype, such as CRC, provide valuable inputs for our

algorithms designed to discover phenotypically significant sub-networks with

connectivity and proximity to these targets. In addition, certain crosstalker sub-

networks, when scored with respect to phenotype by the measure of mutual

information, display significant differential synergistic expression at the level of

mRNA with respect to the seed targets. When these implicated sub-networks

contain proteins with no known role in the disease, they present new

opportunities for follow-on mechanistic experiments to verify the in silico

inference of biological significance in the disease. This point cannot be over-

emphasized, because in our view the promotion of a candidate, disease-

associated sub-network to an functional sub-network with a validated role in

disease must be accomplished by wet lab experiments.

As mentioned in the previous section, with respect to the proteomic seeds, a number of the same sub-networks showed significance (>1σ from background)

when scored by either GSE10950 or GSE8671. With respect to the driver gene

seed, every sub-network that showed significance when scored by the GSE8671

array was also found to be significant when scored by the GSE10950 array. One

explanation for why the sub-networks with respect to a given set of proteomic seeds did not show complete redundancy between arrays is that the microarrays represent experiments performed on different pathologic stages of CRC tumors,

139

very early stage in the case of GSE8671 (adenoma) versus a more established tumor in GSE10950 (primary). The pathologic stage of the proteomic samples in the Nibbe seed was homogenous late stage CRC (Duke’s D) while the Friedman seed was a mix of mid to late stage samples (Duke’s B-D). This highlights a potential limitation of an integrated –omics approach, namely, it is often difficult to establish an optimal match of the biology underlying the measures made at the level of the proteome and transcriptome. However, in our case, if the sub- networks become dysregulated early in the disease and have a role in maintaining the phenotype through later stages, this limitation can turn into an opportunity for development of hypotheses regarding the mechanisms of the progression of CRC. In particular, the complete overlap of crosstalk sub-networks between arrays observed with the driver gene seed indicates the synergistic activity of these sub-networks may be independent of pathologic stage.

We also noted that only a relatively small fraction of the seeds induced significant sub-networks, either interactors or crosstalkers, and this was the case for both the proteomic and the genomic seeds. One potential explanation for this observation is that current human PPI networks capture only a very small fraction of all protein relationships in the human interactome [84], and therefore cannot be expected to reveal a significant sub-network for every experimentally determined seed. As these networks improve, we expect their value in uncovering interesting biology will only grow.

140

The classification performance indicates that experimentally-derived

proteomic disease targets combined with our network search algorithm can

discover high-valued sub-networks for mechanistic in vivo verification. This was consistent with our hypothesis, and supports the claim that a proteomic seed can identify sub-networks that provide additional pathways of interest (e.g CCT2,

TCP1). To strengthen this claim, in an independent cohort of patient biopsies, we validated the differential expression of several targets in the TCP-1 sub-network, predicted by our model to be coordinately dysregulated.

The genomic seed showed excellent classification performance, and

crosstalkers were superior in most instances to their corresponding interactor

sub-networks, consistent with our computational hypotheses. When three or

more features were used to train the classifier they were also better than the proteomic crosstalkers. However, this result is not entirely unexpected as the

proteomic data has low coverage and may lack key seeds and thus may lack

important sub-networks. However, the favorable classification performance of the genomic-derived sub-networks may be viewed as a positive control for this experimental approach. Alternatively, it is unlikely that all relevant sub-networks

are regulated at the level of transcription, and this may reduce the number of

significant sub-networks discoverable by our approach. Never-the-less, the

approach can be generalized to many proteomics expression data sets to

discover novel sub-networks dysregulated in many complex diseases.

141

In many classification applications, high dimensionality is an important

problem and it is often desirable to be able to choose a small number of features that will provide reasonable performance (to overcome “curse of dimensionality”).

In this respect, the classification performance provided by only a few sub- networks is indeed very promising, in that “crosstalk to proteomic targets” may actually provide a shortcut to the identification of a compact set of useful sub-

network features. As our classification experiments were carried out in a cross-

classification setting, the high accuracy of classification using up to three sub-

networks indicates that the most significant crosstalker sub-networks were highly

reproducible. Reproducibility is an important concern in classification

applications, since if the sub-network features that are used are not reproducible

across datasets, this will result in over-fitting. In this regard, the use of proteomic

data can also be considered a tool for obtaining useful biological insights for

feature selection.

142

Figures

Figure 3.1. Schematic of an integrated, proteomics-first approach for the

discovery of functional, candidate sub-networks in a disease phenotype.

Disease targets significant for a phenotype (e.g. cancer) are used to seed an

information-flow based search of the human interactome for candidate sub-

networks subsequently classified as crosstalkers or interactors. Candidate sub-

networks are then scored between test and control (e.g. normal vs. tumor) using

the mutual information of aggregate mRNA expression data as a proxy for

synergistic dysregulation. High-scoring sub-networks may be experimentally validated for their role in disease.

143

144

Figure 3.2. Crosstalkers are not significant at level of individual mRNA expression. Cumulative distribution of differential expression for crosstalkers identified using two proteomic seeds (Nibbe et al., Friedman et al.), a seed of

CRC driver genes (Sjöblom et al.), and all proteins in the HPRD PPI network, as quantified by mutual information with phenotype, using GSE8671 and

GSE10950.

145

146

Figure 3.3. Synergistic dysregulation versus network size for candidate sub-networks associated with proteomic seeds obtained from Nibbe et al.

Sub-network dysregulation (i.e. mutual information of sub-network mRNA expression profile with phenotype class) versus network size for candidate sub- networks. All interactors (green squares) and crosstalkers (red diamonds) were scored using GSE8671 (top) and GSE10950 (bottom). The blue lines represent the linear interpolation of the means of the estimated null distributions computed for random candidate sub-networks of size 2,4,8,16,32, and 64, using the respective arrays (see Materials and Methods for details). Vertical bars represent one standard deviation from the mean.

147

148

Figure 3.4. Synergistic dysregulation versus network size for candidate

sub-networks associated with proteomic seeds obtained from Friedman et al. Please see Figure.3 for annotation.

149

150

Figure 3.5. Significant sub-networks induced by proteomic seeds. Network graph visualization of sub-networks induced by Friedman seed, scored using GSE10950 (a) and Nibbe seed, scored using GSE8671 (b). Proteomic seeds that induced a significant crosstalker sub-network are shown in red, other proteomic seeds are shown in orange, crosstalkers are black and interactors are white. Visualization was performed with the Pajek software.

151

Figure 5a

152

Figure 5b

153

Figure 3.6. Validation of select targets predicted to be dysregulated in TCP1 sub-network. Immunoblot data were obtained from three (540, 534, 507) late- stage matched (N=normal/T=tumor) patient tissue biopsies not used in the original proteomic screen by Nibbe et. al. Values are in kilodalton (kDa).

GSE8671 and GSE10950 represent the ratio of the mean mRNA value

(tumor/normal) from the respective microarray array. Fold change was determined by densitometry.

154

155

Figure 3.7. Synergistic dysregulation versus network size for candidate sub-networks associated with the CRC driver gene seeds obtained from

Sjöblom et al. Please see Figure 3 for annotation.

156

157

Figure 3.8. Cross-validation performance comparison of sub-network based classifiers. The sub-networks induced by proteomic and genomic seeds were first ranked by mutual information with phenotype (MI). Then the normalized mRNA expression values for the genes were aggregated to compute a feature for each sub-network with significant MI. These features were used to train an

SVM-based classifier to distinguish normal from tumor using GSE10950, and then cross-validated on GSE8671 (top), and vice-versa (bottom).

158

159

Tables

Table 3.1. A glossary of terms used frequently in this paper.

160

Term Definition

Proteomic seed A protein that is significantly differentially expressed between tumor and control, as identified by proteomic screening.

Proteomic seed set A set of proteomic seeds that are identified together in one proteomic screening cohort.

Network crosstalk The degree of network proximity and connectivity between (groups) of proteins, modeled as the amount of “information flow” between these proteins in a PPI network.

Crosstalker A protein that exhibits statistically significant network crosstalk with proteins in a particular proteomic seed set.

Interactor sub-network A sub-network of the PPI network induced by the interacting partners of a particular proteomic seed.

Crosstalker sub- A sub-network of the PPI network induced by the network interacting partners of a particular proteomic seed, which are also identified as crosstalkers with respect to the corresponding proteomic seed set.

Synergistic Coordinate mRNA-level differential expression of a group dysregulation of genes in the phenotype.

161

Chapter IV – Summary and future directions

Overview

Human CRC continues to be the second leading cause of death from cancer for adults living in the United States and the United Kingdom. A mature body of

research into this cancer has largely been focused on the genetics of the

disease, specifically the accumulation of certain gene mutations which, it is

widely agreed, initiate the malignancy and variably sustain its stage-wise

progression to a metastatic tumor. As with many cancers, the early detection of a

malignancy by colonoscopy greatly improves the survivorship of the patient. In

fact, the long-term survival rate is 90% for patients who have their cancers

detected in the earliest stage (Duke’s A or B), and who respond well to surgery

and adjuvant chemotherapy. Nevertheless, despite the advances in our

understanding of the molecular basis of this disease, there are no well-validated,

clinically useful markers of CRC to rival the colonoscopy for early detection. For

guiding therapy determinations, only certain mutations in the K-RAS gene are

useful to oncologists to predict the success of EGFR inhibitors in certain patients.

There are no serum-based markers useful for the diagnosis or prognosis of the

disease. Numerous research studies have proposed a variety of candidate

markers of CRC, however many of them are based on experimental cohorts of limited size.

The intense focus on the genetic basis of CRC has given way to a number of genome-wide studies which have identified the putative driver genes of

162

sporadic CRC. These genes are called drivers because they were found

statistically, significantly mutated in a large cohort of tumor tissue biopsies, and

thus considered to play a causative role in the cancer. They are distinct from

“passenger” genes, which are genetically linked to the driver gene(s) but have no

causative role in the somatic evolution of the cancer. It is thought that by

understanding the downstream, molecular effects of the mutated gene products

that drive CRC, particularly in the earliest stages of the disease, this may lead to

the identification of improved, clinically useful markers. A number of the driver

genes identified (e.g. APC, K-RAS, TP53) mapped to pathways that have a well

characterized role in CRC, e.g. the WNT-signaling pathway. Many others,

however, have yet to be characterized, and overall the set of CRC driver genes

displayed little overlap with driver genes similarly identified in other cancers, for

example breast cancer. However, follow-up studies showed that several of the

known pathways to which the genes mapped did overlap (see Wood et. al.). This

observation has led a number of researchers to hypothesize that pathways, i.e.

coordinated networks of interacting proteins, and not individual genes should be

the focus of cancer research because mutations to individual genes, by

themselves, are often insufficient to cause the disease [8].

Modeling the activity of networks of interacting proteins is a philosophically

different approach to studying the molecular basis of disease, compared to traditional wet-bench approaches in molecular biology which usually focus on

characterizing the role of one or a few genes at a time, for example, a certain

signal transduction pathway. So-called “systems biology” approaches endeavor 163

to explain how a complex phenomenon (e.g. disease phenotype) emerges from

the interactions of all the components that have a putative role in causing the

phenomenon. By contrast, more traditional approaches are called “reductionist”

because they try to reduce a complex phenomenon to a few components (e.g.

interactions). With respect to cancer research, the goal of a systems biology

approach is to explain the cancer phenotype in terms of the concerted action of

many interacting proteins, some of which may reside in molecular networks

differentially active (dysregulated) in a tumor, with respect to normal tissue.

A rich literature now exists in the field of network biology which indicates

that integration of all experimental information about the cell, i.e. proteomic,

genomic, and interactomic, can be integrated within various computational

approaches to provide improved classifiers of a variety of important human diseases (note aforementioned references). While traditional wet-bench experiments will continue for the foreseeable future to be the mainstay for characterizing individual molecular interactions, e.g. signal transduction pathways, these results can be integrated with other experimentally derived information in order to more fully explain the variability of a phenotype, e.g. cancer. The guiding biological hypothesis of this thesis was that since protein is the immediate effector molecule of phenotype, the differential expression of certain proteins should provide direct clues to the underlying dysregulation of cellular processes in disease. The computational hypothesis is that i) these proteins may be used to seed the discovery of disease-related networks of

164

proteins, and ii) the activity of these networks should be significantly

discriminative of the disease phenotype as compared to control.

The research goal of this thesis was to take a systems biology approach

to further the understanding the late stage CRC phenotype. We have shown that

a set of experimentally derived, differentially expressed proteins significant for the CRC phenotype, paired with a complement of mRNA expression data and a well-annotated PPI, can be used to discover well-connected networks of proteins statistically significant for the late stage CRC phenotype. These dysregulated networks are proposed as candidate markers of CRC, and should be validated in

relevant in vivo models, e.g. cell culture. Once dysregulation of the network is validated, the interference of one or more network interactions presents new opportunities for therapeutic intervention in the disease. Further, we also developed a computational framework and implemented it to show how a proteomics-first approach paired with a non-biased, statistical network search

can give us a compact set of networks with dysregulation patterns that are

reproducible across different gene expression data sets. Further, our work

indicates that networks derived from a proteomics seed are superior to networks

derived from a genomic seed (driver genes) when used for feature selection to

train a binary classifier. In this regard, the proposed methodology has the

potential to be very useful in selecting meaningful network features.

Future biological direction: in vivo subnetwork validation

165

The work in chapter II inferred four subnetworks (hereafter referred to as

“networks”) of proteins with a significant role in the late stage phenotype. Beyond

merely western blotting in matched tissue samples to validate the differential

expression of the network proteins, a mechanistic validation of the role of the

network interactions is indicated as well. Toward that goal we have obtained from

our collaborator a cell line derived from a patient diagnosed with late stage CRC.

Since our model predicts that these networks are dysregulated and have a role in maintaining the disease, a mechanistic validation involves perturbing the network by the pharmacological interference of one or more targets, or by an siRNA approach, followed by assaying for a change in phenotype (e.g. apoptosis). We are currently experimenting with this approach in our lab. Specifically, we are using an siRNA approach to interfere with the expression of certain TCP-1 complex subunits (CCT3 and CCT7) and PLK-1 (network 3, figure reference), one at a time and in combination, followed by the assessment of cell fate by flow cytometry. Other network signatures may be validated similarly.

Another outstanding problem is to address whether or not the inferred networks are disease stage-specific, or if they are dysregulated in earlier stages of the disease. Our proteomic screen was performed on a cohort of samples of homogenous late-stage CRC, so if the networks are activated late in the disease but not active earlier, then presumably the proteins found by screening tissues from an earlier stage would be different, and resolve a different set of networks with correspondingly different scores of significance. If, on the other hand, the networks are dysregulated early in the disease we would expect some overlap of 166

the set of networks discovered by the different experiments. Testing this hypothesis would involve repeating the entire experimental process beginning with a proteomics screen of tissue samples obtained from earlier stage of CRC

(available from the tissue repository maintained at the Case Cancer Center).

Future computational direction: classification

At the computational level, our finding that the four networks (Chapter II, page and figure) are discriminative of tumor versus control would be strengthened by showing that the networks are able to classify normal versus tumor on independent datasets not used in the discovery process, and by showing the classification power is better than random networks or other traditional classification approaches. Since the publication of the papers represented by chapters II and III, we have begun to develop a classification framework to do this. We obtained two public microarrays from the Gene Omnibus (GEO) at the

NCBI web site: GSE10950 represents experiments on 24 matched pairs of colon biopsies, normal and primary tumor, and GSE8671 represents experiments on

32 matched pairs of biopsies, normal and early adenoma. Using a two-way cross-validation approach we trained a support vector machine (SVM) classifier using one array, and then evaluated the classification performance on the other array. We compared the performance of the most significant 6-gene combinations in each network (see Chapter II, Table 4) to correctly classify tumor versus control against the performance of 1000 random 6-gene networks and

167

four 6-gene sets selected from a set of twenty four genes that had the highest

mutual information between normal and tumor. Figure F1 shows the result. The

classification performance of the networks is similar on both arrays, suggesting

that the networks are dysregulated early in the disease and the pattern persists

through later stages. They perform notably better than the high-scoring genes on

10950, and the difference is greater as more features are added to train the

classifier. The high-scoring genes perform better in the validation on 8671, but

are nearly identical to the network markers beyond three features. The figure

also illustrates the poor performance, worse than random, of the entire network

as a feature for classification, supporting our approach to search these networks for the most significant gene combinations. These early results support the

proteomic-first, computational approach for finding network signatures that

biomark CRC.

Formalizing the computational framework in software and making it

available to a wider community of researchers, would be an important step

forward for the Center of Proteomics and Biofinformatics in their goal to lead the

implementation of network biology approaches.

168

Figure 4.1. Network classification performance. An SVM-based classifier was trained using public microarray GSE8671 and validated on GSE10950 for its ability to classify cancer from normal (top). Performance was measured by the area under the receiver-operating-characteristic (ROC) curve. Similarly, training was performed on 10950 and validated on GSE8671( bottom). The classifier was trained using 1-4 features, where the features were computed as follows: the significant 6-gene combinations from each subnetwork (see Chapter II, Table 4)

(blue line), all genes in each subnetwork (red line), the best 6-genes on the microarray as scored by mutual information for feature 1, the next best 6 for feature 2, etc. (green line), and the mean of 1000 random 6-gene groups (purple line).

169

Classification Performance Train:8671,Test:10950 1.00 0.90 0.80 Significant Subnetwork Curve 0.70 Proteins

ROC 0.60 All Subnetwork Proteins 0.50 0.40 Under

0.30 Best Four 6‐gene sets

Area 0.20 (microarray) 0.10 Random 6‐gene sets 1234(microarray) # Subnetworks

Classification Performance Train:10950,Test:8671 1.00 0.90 0.80 Significant Subnetwork Curve 0.70 Proteins

ROC 0.60 All Subnetwork Proteins 0.50 0.40 Under 0.30 Best Four 6‐gene sets

Area 0.20 (microarray) 0.10 Random 6‐gene sets 1234(microarray) # Subnetworks

170

Appendix 1.

171

IEF Gene NCBI P(Prot Sco pI MW MH+ z Peptide Sequence (# = Oxidation % ran Acces ein) re ΔC Sp Methionine, * = Carbamidomethylation Cover ge sion # P(Pept XC N Cysteine) age ide) 3- ACADS 19684 7.47E- 20.2 8.0 4429 8.74 10 166 09 8 0 8.8 7.47E- 5.52 0.6 1319 2210.0 2 K.IGC*FALSEPGNGSDAGAASTTAR.A 09 3 .0 448 2.46E- 3.80 0.5 1255 1469.6 2 R.GSSTANLIFEDC*R.I 04 2 .1 995

3- AHCY 30584 8.10E- 210. 5.9 4779 37.18 10 089 11 26 0 8.4 2.27E- 2.61 0.2 686. 1068.6 2 K.VNIKPQVDR.Y 04 8 8 160 2.42E- 2.83 0.2 1061 1061.5 2 K.RATDVMIAGK.V 05 8 .7 771 7.24E- 3.04 0.3 893. 921.47 2 R.ATDVM#IAGK.V 05 2 8 10 6.31E- 2.97 0.4 1184 905.47 2 R.ATDVMIAGK.V 05 3 .2 61 1.41E- 5.13 0.4 2111 1380.7 2 K.KLDEAVAEAHLGK.L 08 1 .2 482 1.14E- 3.22 0.4 1021 1259.6 2 K.SKFDNLYGC*R.E 07 9 .4 143 2.36E- 4.11 0.4 1763 1252.6 2 K.LDEAVAEAHLGK.L 07 3 .1 532 4.04E- 2.98 0.3 780. 1040.6 2 R.RIILLAEGR.L 05 0 9 575 1.10E- 2.71 0.3 610. 1102.5 2 K.WLNENAVEK.V 04 3 0 527 4.99E- 3.86 0.4 452. 1719.8 2 R.KALDIAENEM#PGLM#R.M 07 4 6 404 1.55E- 3.71 0.4 731. 1648.8 2 R.GISEETTTGVHNLYK.M 06 8 2 177 2.38E- 3.15 0.5 1735 1134.6 2 K.VAVVAGYGDVGK.G 08 7 .2 154 2.54E- 3.93 0.6 1231 1256.6 2 K.VPAINVNDSVTK.S 09 0 .5 844 9.02E- 3.12 0.4 1420 1156.6 2 K.YPVGVHFLPK.K 07 2 .8 514 1.05E- 2.12 0.3 695. 884.55 2 R.IILLAEGR.L 04 8 7 64 9.91E- 3.74 0.4 950. 1591.7 2 K.ALDIAENEM#PGLM#R.M 06 5 9 454 8.10E- 4.70 0.4 2051 2147.1 3 K.M#M#ANGILKVPAINVNDSVTK.S 11 7 .4 199 3.33E- 2.27 0.2 373. 1004.5 2 K.AGIPVYAWK.G 05 0 9 564 9.03E- 3.25 0.4 910. 1575.7 2 K.ALDIAENEMPGLM#R.M 06 5 0 505 6.56E- 3.34 0.3 1134 1056.6 2 K.YPQLLPGIR.G 06 3 .8 200 9.15E- 3.20 0.4 1125 1128.6 2 K.VADIGLAAWGR.K 06 8 .6 160

3- ANXA2 16306 1.57E- 40.1 7.7 3859 11.8 10 978 07 7 7 3.8 9.00E- 2.67 0.1 466. 889.55 2 R.KLM#VALAK.G 02 3 5 39 5.87E- 3.07 0.3 1283 1244.6 2 R.TNQELQEINR.V 05 4 .3 229 1.57E- 3.38 0.5 1325 1222.5 2 K.TPAQYDASELK.A 07 9 .4 950 1.99E- 3.48 0.3 1557 1421.6 2 K.SLYYYIQQDTK.G 06 6 .5 947

3- ANXA3 12654 1.05E- 220. 5.5 3635 60.37 10 115 13 27 0 2.7 4.41E- 4.84 0.3 1133 1539.8 3 R.RDESLKVDEHLAK.Q 06 8 .0 125 1.40E- 3.62 0.4 1632 1383.7 2 R.DESLKVDEHLAK.Q 07 6 .5 114 172

2.02E- 2.55 0.4 1026 1057.6 2 R.KALLTLADGR.R 04 6 .5 365 3.45E- 2.98 0.1 648. 1429.7 3 K.KHYGYSLYSAIK.S 04 6 2 474 1.14E- 3.30 0.3 1017 1770.9 3 K.VDEHLAKQDAQILYK.A 03 7 .1 385 5.73E- 2.55 0.4 863. 1510.8 3 R.QLIVKEYQAAYGK.E 07 2 1 264 4.94E- 2.29 0.3 392. 1359.7 2 R.NISQKDIVDSIK.G 03 1 8 478 1.46E- 2.08 0.2 341. 1018.5 2 R.NTPAFLAER.L 04 7 1 316 1.71E- 4.92 0.4 932. 1865.9 3 R.QMKDISQAYYTVYKK.S 06 8 8 467 5.88E- 3.31 0.5 788. 1301.6 2 K.HYGYSLYSAIK.S 08 8 1 525 8.08E- 4.44 0.5 1275 1713.7 2 K.SLGDDISSETSGDFRK.A 11 9 .1 926 3.87E- 3.47 0.4 1086 1478.7 2 K.DISQAYYTVYKK.S 07 6 .6 526 8.13E- 2.42 0.3 1288 929.54 2 K.ALLTLADGR.R 06 6 .8 14 1.05E- 4.78 0.5 2182 2195.0 3 R.GTVRDYPDFSPSVDAEAIQK.A 13 5 .6 615 2.53E- 3.29 0.5 1464 1222.6 2 K.GIGTDEFTLNR.I 06 7 .6 062 1.76E- 3.26 0.4 943. 1737.8 2 R.QMKDISQAYYTVYK.K 07 4 6 517 2.55E- 2.79 0.5 681. 1350.6 2 K.DISQAYYTVYK.K 06 3 6 576 1.39E- 3.18 0.4 964. 1441.7 2 K.SDTSGDYEITLLK.I 06 2 8 057 6.75E- 3.05 0.2 1375 1073.5 2 R.SEIDLLDIR.T 04 4 .7 837 2.50E- 3.48 0.3 957. 2036.0 3 K.SM#KGAGTNEDALIEILTTR.T 04 4 5 328 3.95E- 2.97 0.1 1136 1075.6 2 K.MLISILTER.S 01 2 .8 180 8.03E- 4.60 0.4 1371 1673.8 2 K.GAGTNEDALIEILTTR.T 06 1 .2 705

3- ALDH2 30584 2.66E- 170. 6.3 5645 37.14 10 723 14 31 7 8.8 2.44E- 3.48 0.4 1150 1342.7 2 R.VIQVAAGSSNLKR.V 07 1 .2 802 7.94E- 4.38 0.5 1638 1186.6 2 R.VIQVAAGSSNLK.R 09 0 .5 790 4.17E- 2.81 0.2 579. 902.49 2 K.TIEEVVGR.A 04 4 8 41 3.46E- 2.81 0.3 751. 1106.6 2 K.KILGYINTGK.Q 04 9 1 569 1.51E- 3.25 0.4 1757 1137.5 2 K.VAFTGSTEIGR.V 06 7 .7 898 1.28E- 2.39 0.5 454. 962.49 2 R.VVGNPFDSK.T 06 5 9 41 4.54E- 2.64 0.2 408. 978.56 2 K.ILGYINTGK.Q 03 7 2 19 1.81E- 4.23 0.5 1037 1385.7 2 K.LGPALATGNVVVM#K.V 06 3 .5 821 4.68E- 3.90 0.4 724. 2450.1 3 R.VVGNPFDSKTEQGPQVDETQFK.K 11 8 5 833 1.17E- 6.14 0.5 954. 2963.4 3 R.KTFPTVNPSTGEVIC*QVAEGDKEDVDK. 11 8 0 608 A 6.17E- 3.03 0.2 751. 1369.7 2 K.LGPALATGNVVVMK.V 04 8 8 872 2.96E- 3.45 0.6 1303 1599.7 2 R.ELGEYGLQAYTEVK.T 08 1 .5 900 8.54E- 4.48 0.5 1845 1527.7 2 R.ANNSTYGLAAAVFTK.D 07 6 .4 802 2.66E- 4.87 0.5 900. 2203.0 2 R.GYFIQPTVFGDVQDGM#TIAK.E 14 8 4 740 2.52E- 2.27 0.4 170. 1531.7 2 K.TIPIDGDFFSYTR.H 03 7 3 428 7.91E- 5.25 0.5 1351 1789.8 2 R.TFVQEDIYDEFVER.S 10 7 .8 279 173

1.29E- 4.43 0.6 928. 1844.0 2 K.VAEQTPLTALYVANLIK.E 10 1 1 527

3- CA1 45025 1.09E- 80.2 6.7 2885 10 17 10 3 0 2.4 1.86E- 2.42 0.4 944. 1026.5 2 K.YSSLAEAASK.A 29.12 05 1 4 1025 4.25E- 3.52 0.4 517. 1612.7 3 K.YSAELHVAHWNSAK.Y 08 5 3 8662 1.09E- 4.45 0.5 1138 1929.0 2 K.HDTSLKPISVSYNPATAK.E 10 9 .7 0757 2.02E- 4.16 0.5 468. 2759.3 3 R.SLLSNVEGDNAVPMQHNNRPTQPLK.G 07 1 4 8940 5.58E- 2.23 0.2 504. 985.43 2 K.GGPFSDSYR.L 03 5 8 738 3.88E- 4.54 0.4 953. 1742.9 2 K.LYPIANGNNQSPVDIK.T 10 6 3 0723 1.73E- 3.02 0.4 846. 970.59 2 K.VLDALQAIK.T 05 3 7 314 7.15E- 3.83 0.4 1679 1202.6 2 K.ADGLAVIGVLM#K.V 06 1 .4 8131

3- DLST 1.16E- 30.2 7.7 4855 7.73 10 06 1 0 4.4 1.16E- 3.86 0.5 1695 1424.6 2 R.NVEAM#NFADIER.T 06 2 .5 4744 1.52E- 4.23 0.4 1976 1478.7 2 K.TPAFAESVTEGDVR.W 05 5 .0 1216 7.52E- 2.76 0.1 1384 1015.5 2 K.LGFM#SAFVK.A 04 8 .9 2810

3- LMNA 55957 3.23E- 60.2 6.2 6920 11.07 10 499 07 2 0 7.4 3.63E- 3.87 0.5 943. 1359.6 2 R.SGAQASSTPLSPTR.I 05 5 3 8628 6.15E- 2.51 0.0 755. 1171.6 3 K.KEGDLIAAQAR.L 04 3 5 4294 8.43E- 3.63 0.4 1071 1148.5 2 R.ITESEEVVSR.E 06 1 .2 7935 2.61E- 3.05 0.2 1201 1089.5 2 R.SLETENAGLR.L 04 6 .2 5347 3.23E- 4.32 0.4 1358 1491.7 2 R.TALINSTGEEVAMR.K 07 2 .7 4719 1.60E- 3.58 0.2 1260 1028.5 2 R.LADALQELR.A 04 2 .5 7349

3- NFUDS1 15082 8.28E- 30.1 5.6 2756 10 323 07 6 0 2.3 5.13E- 3.01 0.3 1003 1097.5 2 K.SATYVNTEGR.A 05 9 .3 2222 2.03E- 3.17 0.4 1462 1250.5 2 K.DFYM#TDSISR.A 03 5 .5 3576 8.28E- 3.03 0.3 1412 2071.1 3 R.IASQVAALDLGYKPGVEAIR.K 07 7 .5 5454

3- SNX6 88703 1.70E- 30.2 6.0 4777 6.07 10 041 09 4 0 4.4 1.70E- 4.84 0.4 1563 2050.0 3 K.NKDVLQAETSQQLC*C*QK.F 09 7 .5 3027 3.35E- 3.90 0.3 849. 1031.5 2 K.SADGVIVSGVK.D 05 8 6 7312 1.26E- 4.15 0.5 1472 1807.8 2 K.DVLQAETSQQLC*C*QK.F 06 2 .3 9238

174

3- PMPCB 14714 1.98E- 90.2 6.4 5415 16.56 10 528 09 7 0 7.5 2.37E- 3.05 0.3 1010 1351.6 2 R.LC*TSVTESEVAR.A 06 1 .0 8274 6.25E- 3.24 0.2 1251 1034.5 2 R.VTC*LESGLR.V 04 8 .4 6044 9.15E- 2.15 0.2 405. 1085.6 2 R.TILGPTENIK.S 06 9 1 2012 1.13E- 2.02 0.3 296. 1081.6 2 K.GEIPALPPC*K.F 03 9 8 0158 4.16E- 3.31 0.3 886. 1193.7 2 R.RIPIPELEAR.I 06 6 3 0007 6.40E- 4.73 0.5 683. 1713.9 2 R.STQAATQVVLNVPETR.V 08 0 0 1296 9.80E- 3.70 0.5 1088 1367.6 2 K.DLVDYITTHYK.G 08 0 .0 8420 1.98E- 5.37 0.5 1997 2149.0 2 K.TNM#LLQLDGSTPIC*EDIGR.Q 09 5 .7 5690 3.65E- 3.86 0.4 940. 2133.0 2 K.TNMLLQLDGSTPIC*EDIGR.Q 09 0 9 6200

3- NM23A 35068 1.58E- 60.1 7.1 2039 40 10 09 9 0 8.3 7.92E- 3.50 0.3 1139 1485.7 2 R.NIIHGSDSVESAEK.E 07 2 .8 1802 1.58E- 3.70 0.3 692. 1801.9 3 R.VM#LGETNPADSKPGTIR.G 09 9 8 1126 7.40E- 2.34 0.2 638. 984.62 2 R.GLVGEIIKR.F 04 6 4 006 8.64E- 2.60 0.1 306. 1344.7 3 R.TFIAIKPDGVQR.G 04 7 8 6343 4.64E- 3.01 0.2 904. 1197.5 2 K.FM#QASEDLLK.E 05 9 9 8199 1.55E- 2.70 0.1 789. 1149.6 2 K.DRPFFAGLVK.Y 04 8 1 4148

3- PPIA 48145 2.40E- 118. 7.8 1798 40.61 10 531 09 32 0 6.9 1.53E- 3.62 0.0 720. 1537.7 3 K.VKEGM#NIVEAM#ER.F 04 5 2 3486 9.50E- 3.54 0.0 573. 1537.7 2 K.VKEGM#NIVEAM#ER.F 06 9 6 3486 3.36E- 3.95 0.4 957. 1521.7 2 K.VKEGMNIVEAM#ER.F 05 4 8 3996 3.94E- 3.45 0.0 699. 1521.7 3 K.VKEGMNIVEAM#ER.F 05 1 4 3996 8.10E- 3.32 0.0 372. 1521.7 3 K.VKEGM#NIVEAMER.F 04 7 4 3996 4.31E- 3.80 0.0 937. 1521.7 2 K.VKEGM#NIVEAMER.F 07 5 9 3996 2.87E- 3.67 0.2 1664 1505.7 2 K.VKEGMNIVEAMER.F 09 1 .2 4512 1.21E- 2.80 0.1 609. 1294.5 2 K.EGM#NIVEAMER.F 04 8 5 7658 2.38E- 4.01 0.0 661. 1505.7 3 K.VKEGMNIVEAMER.F 04 3 2 4512 2.40E- 6.37 0.0 1639 2807.3 3 K.HTGPGILSM#ANAGPNTNGSQFFIC*TAK 09 9 .9 5448 .S 1.08E- 4.94 0.2 1322 2821.3 3 K.HTGPGILSM#ANAGPNTNGSQFFIC*TAK 06 0 .3 5448 ^.S 9.79E- 2.75 0.1 691. 1154.5 2 K.FEDENFILK.H 05 1 6 7288 3.14E- 2.41 0.2 188. 1278.5 1 K.EGMNIVEAMER.F 05 2 9 8167 9.49E- 5.21 0.2 1587 1831.9 2 K.SIYGEKFEDENFILK.H 09 2 .7 1125 3.16E- 3.45 0.5 1094 1379.7 2 R.VSFELFADKVPK.T 09 6 .9 5696 1.83E- 2.54 0.2 467. 1278.5 2 K.EGMNIVEAMER.F 04 9 1 8167 3.33E- 2.91 0.2 929. 1055.5 2 R.VSFELFADK.V 05 7 7 4077 1.46E- 2.36 0.2979. 1055.5 1 R.VSFELFADK.V 175

04 1 4 4077

3- TF 37747 1.65E- 128. 6.9 7702 24.93 10 855 10 26 0 9.7 1.32E- 2.72 0.5 1031 1725.8 2 K.IEC*VSAETTEDC*IAK.I 03 0 .5 2805 2.14E- 3.04 0.5 373. 1415.7 2 K.SVIPSDGPSVAC*VK.K 04 6 1 5043 1.78E- 3.00 0.3 831. 997.50 2 K.ASYLDC*IR.A 02 3 7 768 7.68E- 3.94 0.4 1378 1195.5 2 K.DSGFQMNQLR.G 05 2 .9 5249 1.03E- 2.79 0.4 765. 1283.5 2 K.EGYYGYTGAFR.C 04 8 3 6909 1.31E- 4.02 0.2 1440 1881.9 3 K.ADRDQYELLC*LDNTR.K 03 2 .9 0649 8.40E- 3.19 0.3 800. 1249.6 2 K.SASDLTWDNLK.G 05 2 0 0596 1.65E- 4.37 0.6 1635 1577.6 2 R.FDEFFSEGC*APGSK.K 07 0 .6 8822 2.15E- 4.46 0.5 1493 1494.7 2 K.M#YLGYEYVTAIR.N 07 5 .1 2971 1.65E- 5.18 0.5 835. 2071.9 2 K.SDNC*EDTPEAGYFAVAVVK.K 10 9 0 5825 7.97E- 2.28 0.4 393. 1629.8 2 K.EDPQTFYYAVAVVK.K 07 8 1 1592 1.57E- 2.76 0.5 493. 2175.0 2 K.IM#NGEADAMSLDGGFVYIAGK.C 02 1 7 0965 4.12E- 3.35 0.1 725. 2175.0 2 K.IMNGEADAM#SLDGGFVYIAGK.C 06 8 6 0965

3- TPI1 88942 1.17E- 70.2 8.1 2692 10 747 09 2 0 5.9 2.52E- 2.09 0.3 361. 850.46 2 K.VVFEQTK.V 26.1 03 5 3 692 2.89E- 3.37 0.4 1344 1137.6 2 K.IAVAAQNC*YK.V 06 6 .0 0264 1.17E- 3.76 0.5 2043 1234.6 2 K.SNVSDAVAQSTR.I 09 4 .9 0217 4.57E- 3.68 0.3 1808 1614.8 3 R.RHVFGESDELIGQK.V 06 7 .8 2349 1.58E- 3.33 0.5 1610 1326.7 2 R.IIYGGSVTGATC*K.E 09 3 .7 0275 1.70E- 4.41 0.5 942. 1458.7 2 R.HVFGESDELIGQK.V 09 7 8 2229 8.22E- 2.52 0.1 707. 1082.5 2 R.KFFVGGNWK.M 02 7 5 7813

3- TALDO1 14603 3.25E- 80.1 6.4 3751 24.33 10 290 08 9 0 6.5 7.40E- 2.03 0.1 639. 991.52 2 K.IYNYYKK.F 03 3 1 472 3.67E- 3.37 0.4 1425 1074.5 2 K.LGGSQEDQIK.N 05 2 .5 4260 8.01E- 2.31 0.2 720. 1050.5 2 R.M#ESALDQLK.Q 02 5 4 1357 1.60E- 2.02 0.2 432. 1233.5 2 K.SYEPLEDPGVK.S 03 8 3 9973 7.16E- 2.37 0.3 805. 1034.5 2 R.MESALDQLK.Q 06 7 7 1868 1.26E- 3.73 0.5 1741 1276.6 2 K.LSSTWEGIQAGK.E 03 0 .7 5320 6.63E- 3.69 0.4 1201 1213.6 2 K.LLGELLQDNAK.L 06 9 .7 7871 3.25E- 3.41 0.5 1868 1392.7 2 K.ALAGC*DFLTISPK.L 08 8 .3 4970

3- UQCRC1 16307 3.18E- 30.1 5.9 5261 8.54 10 022 06 7 0 2.5 3.18E- 3.21 0.4 1081 1323.6 2 R.LC*TSATESEVAR.G 06 3 .9 5144 1.93E- 3.31 0.3865. 1256.6 2 R.RIPLAEWESR.I 176

05 6 3 7456 5.08E- 3.21 0.3 891. 2054.0 3 R.NALVSHLDGTTPVC*EDIGR.S 04 4 4 2767

3- HSPD1 77702 5.39E- 150. 5.6 6117 27.83 10 086 11 27 0 4.5 4.49E- 2.68 0.3 423. 1233.5 2 K.VGGTSDVEVNEK.K 05 5 0 9570 5.59E- 3.14 0.3 768. 901.53 2 K.LSDGVAVLK.V 05 4 6 534 1.16E- 4.09 0.4 1328 1215.6 2 K.NAGVEGSLIVEK.I 06 2 .5 5796 8.26E- 2.75 0.3 861. 1153.7 2 R.LKVGLQVVAVK.A 06 9 7 6672 4.49E- 3.31 0.2 1367 912.58 2 K.VGLQVVAVK.A 04 9 .9 771 8.24E- 3.92 0.4 552. 1646.9 2 K.VGEVIVTKDDAM#LLK.G 05 2 5 0332 9.98E- 4.18 0.5 765. 1646.9 3 K.VGEVIVTKDDAM#LLK.G 06 4 7 0332 6.37E- 4.52 0.4 1653 1344.7 2 R.TVIIEQSWGSPK.V 06 6 .6 1582 8.72E- 4.34 0.5 1419 1630.9 2 K.VGEVIVTKDDAMLLK.G 09 0 .5 0845 7.69E- 3.82 0.4 560. 1630.9 3 K.VGEVIVTKDDAMLLK.G 06 6 5 0845 7.23E- 4.55 0.4 1359 1771.8 2 R.C*IPALDSLTPANEDQK.I 09 5 .6 8363 3.15E- 3.49 0.5 1197 1389.7 2 R.GYISPYFINTSK.G 07 1 .8 0483 5.39E- 5.34 0.4 1916 1684.9 2 R.AAVEEGIVLGGGC*ALLR.C 11 4 .2 3560 1.23E- 4.15 0.4 1371 1520.7 2 K.TLNDELEIIEGM#K.F 07 3 .7 5124 5.83E- 4.59 0.4 1664 1601.7 2 K.C*EFQDAYVLLSEK.K 06 6 .4 8212 3.48E- 3.56 0.3 713. 1919.0 3 K.ISSIQSIVPALEIANAHR.K 07 0 8 7092 2.47E- 5.03 0.4 1779 1504.7 2 K.TLNDELEIIEGMK.F 08 8 .8 5635

4-7 ACTB 45018 9.06E- 70.3 5.1 4170 24.37 85 12 4 8 9.7 1.64E- 1.93 0.3 596. 1354.6 3 K.DSYVGDEAQSKR.G 03 7 5 2329 2.74E- 3.03 0.4 1014 1354.6 2 K.DSYVGDEAQSKR.G 06 4 .2 2329 1.79E- 3.51 0.5 1126 1198.5 2 K.DSYVGDEAQSK.Y 05 0 .4 2222 3.19E- 2.54 0.4 798. 1132.5 2 R.GYSFTTTAER.E 03 5 1 2698 5.31E- 4.23 0.3 982. 1954.0 2 R.VAPEEHPVLLTEAPLNPK.A 08 6 4 6445 9.59E- 2.16 0.2 659. 945.55 2 R.AVFPSIVGR.P 04 8 0 164 8.30E- 4.82 0.2 1506 1790.8 2 K.SYELPDGQVITIGNER.F 07 8 .0 9197 9.06E- 6.82 0.6 3168 2566.1 2 K.LC*YVALDFEQEM#ATAASSSSLEK.S 12 4 .7 9927

4-7 ACTG2 49168 4.71E- 20.1 5.2 4189 5.85 516 05 6 0 7.8 2.23E- 3.25 0.4 1047 1198.5 2 K.DSYVGDEAQSK.Y 04 9 .3 2222 4.71E- 2.84 0.3 1196 976.44 2 K.AGFAGDDAPR.A 05 5 .1 830

4-7 ACTR3 50315 3.76E- 90.2 5.5 4734 34.93 73 10 9 0 1.0 3.76E- 4.23 0.5 713. 1768.9 2 R.DREVGIPPEQSLETAK.A 10 2 8 0759 1.09E- 3.98 0.4 1555 1540.7 2 R.LPAC*VVDC*GTGYTK.L 05 5 .5 7450

177

4.86E- 2.09 0.4 211. 1497.7 2 R.EVGIPPEQSLETAK.A 04 3 4 7954 3.90E- 3.68 0.5 1311 1515.6 2 R.HGIVEDWDLM#ER.F 04 1 .8 8964 8.04E- 3.97 0.4 703. 2482.1 3 R.AEPEDHYFLLTEPPLNTPENR.E 05 6 8 8848 8.77E- 5.53 0.5 1071 3058.6 3 R.TLTGTVIDSGDGVTHVIPVAEGYVIGSC*I 08 7 .0 0706 K.H 1.41E- 4.53 0.4 695. 2192.1 2 K.LGYAGNTEPQFIIPSC*IAIK.E 06 4 7 7254 2.61E- 4.34 0.5 758. 2444.1 3 K.GVDDLDFFIGDEAIEKPTYATK.W 09 2 8 8677 7.20E- 2.65 0.4 771. 1409.7 2 R.DITYFIQQLLR.E 05 6 7 7869

4-7 ALDH2 48256 1.16E- 90.2 6.4 5631 21.28 839 07 3 0 7.6 1.86E- 3.36 0.5 1303 1342.7 2 R.VIQVAAGSSNLKR.V 04 0 .8 8015 9.68E- 3.08 0.5 2030 1186.6 2 R.VIQVAAGSSNLK.R 04 3 .2 7896 6.43E- 3.84 0.5 1272 1506.7 2 K.TEQGPQVDETQFK.K 04 5 .6 0703 8.61E- 3.00 0.4 1200 1137.5 2 K.VAFTGSTEIGR.V 04 6 .3 8984 3.89E- 3.55 0.4 929. 1385.7 2 K.LGPALATGNVVVM#K.V 04 6 8 8208 1.16E- 4.25 0.5 1904 1527.7 2 R.ANNSTYGLAAAVFTK.D 07 6 .3 8015 7.91E- 2.29 0.4 234. 1531.7 2 K.TIPIDGDFFSYTR.H 05 4 1 4268 3.14E- 4.61 0.6 1857 1789.8 2 R.TFVQEDIYDEFVER.S 07 0 .8 2788 7.46E- 4.57 0.5 830. 1844.0 2 K.VAEQTPLTALYVANLIK.E 07 7 4 5273

4-7 ANXA4 1.28E- 180. 5.8 3606 54.52 08 26 0 2.2 1.38E- 3.40 0.4 1178 1527.7 3 R.RISQTYQQQYGR.S 04 4 .9 6624 7.02E- 4.16 0.4 801. 1527.7 2 R.RISQTYQQQYGR.S 06 5 8 6624 9.34E- 3.70 0.4 1418 1371.6 2 R.ISQTYQQQYGR.S 06 4 .9 6516 2.62E- 1.74 0.2 500. 815.39 2 K.SAYFAEK.L 03 4 5 337 1.75E- 2.97 0.3 1493 1494.7 3 R.QDAQDLYEAGEKK.W 04 7 .7 0703 4.14E- 1.80 0.1 570. 834.39 2 K.WGTDEVK.F 02 8 2 923 1.96E- 4.86 0.5 2223 1597.7 2 K.AASGFNAM#EDAQTLR.K 06 3 .0 2748 5.47E- 2.49 0.4 1374 958.56 2 R.VLVSLSAGGR.D 03 8 .7 799 2.04E- 2.90 0.2 1257 1134.4 2 R.SDTSFM#FQR.V 02 5 .4 8842 4.56E- 3.35 0.4 1027 1570.8 3 R.NHLLHVFDEYKR.I 04 0 .4 1250 1.92E- 2.95 0.4 1300 1174.6 2 K.GLGTDDNTLIR.V 03 3 .0 0620 1.54E- 2.89 0.3 767. 1414.7 3 R.NHLLHVFDEYK.R 03 6 7 1143 4.32E- 2.50 0.3 1154 1091.5 2 R.AEIDM#LDIR.A 02 3 .6 4012 1.28E- 4.69 0.6 1712 2319.1 3 R.VLVSLSAGGRDEGNYLDDALVR.Q 08 1 .3 9385 2.18E- 3.30 0.5 1128 1379.6 2 R.DEGNYLDDALVR.Q 04 6 .4 4380 1.54E- 4.60 0.3 914. 2319.1 2 R.VLVSLSAGGRDEGNYLDDALVR.Q 05 6 3 9385 9.23E- 2.48 0.5 819. 1060.5 1 K.VLLVLC*GGDD 05 1 1 6486 1.00E- 4.76 0.5 1340 1661.8 2 K.GAGTDEGC*LIEILASR.T 04 6 .7 4685

178

7.88E- 4.70 0.5 1649 1692.8 2 K.GLGTDEDAIISVLAYR.N 07 9 .3 8025 9.05E- 1.90 0.2 265. 1692.8 3 K.GLGTDEDAIISVLAYR.N 04 3 8 8025 3.04E- 4.60 0.6 2238 1666.8 2 K.SETSGSFEDALLAIVK.C 06 1 .2 5339

4-7 ANXA5 49168 2.03E- 50.3 4.8 3594 22.81 528 11 0 0 0.5 3.60E- 2.36 0.3 361. 1143.6 2 K.LIVALM#KPSR.L 02 4 6 9181 2.89E- 3.21 0.3 1431 1001.5 2 K.VLTEIIASR.T 04 0 .3 9900 2.41E- 2.97 0.4 554. 1446.7 2 R.DLLDDLKSELTGK.F 04 7 2 6868 1.87E- 4.14 0.5 1464 1704.9 2 K.GLGTDEESILTLLTSR.S 06 6 .0 0137 2.03E- 6.06 0.6 2297 2658.2 2 R.DPDAGIDEAQVEQDAQALFQAGELK.W 11 2 .1 5293

4-7 APOH 18089 1.08E- 130. 7.8 3827 43.48 104 10 31 0 2.7 1.76E- 3.77 0.4 1009 1150.6 2 K.KATVVYQGER.V 03 7 .1 2146 3.57E- 3.12 0.5 1494 992.46 2 K.TDASDVKPC* 03 1 .2 587 6.79E- 2.90 0.4 1098 1022.5 2 K.ATVVYQGER.V 04 3 .6 2655 1.08E- 5.70 0.5 1901 2629.1 3 K.DKATFGC*HDGYSLDGPEEIEC*TK.L 10 1 .0 7918 2.95E- 4.27 0.5 1841 2214.0 2 K.KC*SYTEDAQC*IDGTIEVPK.C 09 8 .0 6638 6.42E- 6.06 0.5 2715 2386.0 3 K.ATFGC*HDGYSLDGPEEIEC*TK.L 05 2 .0 5727 1.93E- 4.94 0.6 1966 2383.1 2 K.TFYEPGEEITYSC*KPGYVSR.G 09 1 .2 2163 9.67E- 3.86 0.5 967. 1723.7 2 K.TFYEPGEEITYSC*K.P 07 9 3 8252 5.45E- 5.01 0.4 958. 2731.3 3 K.C*PFPSRPDNGFVNYPAKPTLYYK.D 05 3 1 6426 1.14E- 6.03 0.6 1662 2085.9 2 K.C*SYTEDAQC*IDGTIEVPK.C 08 3 .2 7142 4.52E- 3.32 0.3 1554 1914.0 3 R.TC*PKPDDLPFSTVVPLK.T 04 7 .1 3465 1.04E- 3.55 0.2 1198 1914.0 2 R.TC*PKPDDLPFSTVVPLK.T 06 8 .7 3465 6.35E- 3.59 0.4 927. 1502.8 2 R.VC*PFAGILENGAVR.Y 06 5 3 0895 3.25E- 4.77 0.4 2016 1773.0 2 K.FIC*PLTGLWPINTLK.C 06 7 .3 0731

4-7 ATP5B 32189 4.45E- 110. 5.1 5652 28.73 394 07 24 0 4.7 1.40E- 2.92 0.3 1058 1086.5 2 K.ADKLAEEHSS 04 8 .9 0623 6.45E- 2.93 0.4 1171 1278.6 2 R.TIAM#DGTEGLVR.G 04 5 .4 3581 7.70E- 4.08 0.3 997. 1617.8 2 K.VALVYGQM#NEPPGAR.A 07 2 2 0534 9.43E- 3.81 0.3 1680 1401.7 2 R.IM#NVIGEPIDER.G 05 9 .8 0423 7.12E- 4.89 0.5 810. 2298.0 3 R.IPSAVGYQPTLATDM#GTM#QER.I 06 5 8 7403 9.85E- 3.07 0.4 1149 975.56 2 K.IGLFGGAGVGK.T 04 8 .2 219 9.50E- 3.71 0.4 709. 1650.9 3 R.LVLEVAQHLGESTVR.T 07 9 6 1736 4.45E- 4.55 0.4 1729 1650.9 2 R.LVLEVAQHLGESTVR.T 07 8 .7 1736 6.10E- 2.68 0.3 862. 1088.6 2 K.VVDLLAPYAK.G 04 4 7 3501 3.72E- 4.02 0.5 1703 1435.7 2 R.FTQAGSEVSALLGR.I 06 3 .0 5403 1.02E- 3.91 0.5844. 1988.0 2 R.AIAELGIYPAVDPLDSTSR.I 179

06 6 1 3345 2.39E- 4.07 0.5 1857 1439.7 2 R.VALTGLTVAEYFR.D 04 9 .8 8931

4-7 CA1 45025 5.68E- 20.1 6.7 2885 13.41 17 05 9 0 2.4 5.68E- 3.78 0.3 815. 0.3912 3 K.EIINVGHSFHVNFEDNDNR.S 05 6 7 1 8.55E- 2.75 0.2 423. 1.4304 2 K.LYPIANGNNQSPVDIK.T 05 1 3 7

4-7 CAPG 63252 8.61E- 80.2 3847 34.77 913 08 8 4.5 8.17E- 3.88 0.5 1465 1280.6 2 K.VSDATGQM#NLTK.V 05 1 .3 1508 1.16E- 3.40 0.3 869. 1649.8 3 R.GLKYQEGGVESAFHK.T 04 7 6 2825 6.75E- 2.21 0.3 534. 1258.6 2 R.DLALAIRDSER.Q 04 1 7 7505 8.61E- 4.31 0.5 617. 2319.1 3 K.EGNPEEDLTADKANAQAAALYK.V 08 7 6 0986 8.20E- 5.52 0.4 1479 2117.1 3 K.ANEKERQAALQVAEGFISR.M 07 9 .8 0986 4.89E- 3.53 0.4 441. 2778.4 3 K.AQVEIVTDGEEPAEM#IQVLGPKPALK.E 04 8 5 5936 1.05E- 4.15 0.5 1634 1389.7 2 R.QAALQVAEGFISR.M 04 7 .8 4854 2.42E- 4.30 0.5 1491 1934.8 2 R.EVQGNESDLFM#SYFPR.G 05 7 .8 5889

4-7 CAPNS1 40674 1.40E- 60.3 4.8 2821 38.81 605 09 0 0 1.7 1.08E- 4.37 0.5 2165 1373.5 2 R.SM#VAVM#DSDTTGK.L 05 8 .3 9228 1.14E- 4.80 0.4 2013 1777.7 3 R.THYSNIEANESEEVR.Q 04 1 .6 9871 2.84E- 4.76 0.5 2650 1777.7 2 R.THYSNIEANESEEVR.Q 07 6 .1 9871 4.73E- 3.52 0.4 1169 1141.5 2 K.TDGFGIDTC*R.S 05 1 .2 2478 1.88E- 5.78 0.6 1590 2555.2 2 R.LFAQLAGDDM#EVSATELM#NILNK.V 09 1 .5 3674 1.40E- 5.96 0.6 2543 2447.2 3 R.ILGGVISAISEAAAQYNPEPPPPR.T 09 3 .7 9297 9.53E- 4.55 0.6 1410 2284.9 2 R.YSDESGNM#DFDNFISC*LVR.L 07 4 .5 7905

4-7 CAPZA1 54535 2.22E- 60.2 5.4 3290 33.57 97 07 6 0 2.3 2.22E- 3.41 0.5 800. 1542.6 2 K.EASDPQPEEADGGLK.S 07 4 2 9177 1.64E- 3.42 0.4 929. 2022.9 3 R.AYVKDHYSNGFC*TVYAK.T 04 0 8 6836 4.02E- 3.99 0.5 1501 1705.8 2 K.DVQDSLTVSNEAQTAK.E 05 8 .1 2385 5.18E- 3.30 0.3 1297 1197.6 2 R.LLLNNDNLLR.E 04 0 .7 9495 2.15E- 5.27 0.2 1615 2089.0 3 K.FITHAPPGEFNEVFNDVR.L 05 7 .7 1367 3.11E- 3.07 0.3 269. 2314.1 3 K.TIDGQQTIIAC*IESHQFQPK.N 04 9 5 8014

4-7 CCT2 54536 4.56E- 160. 6.0 5745 48.79 03 11 29 0 2.3 2.30E- 3.35 0.4 1449 1282.6 2 K.VAEIEHAEKEK.M 05 5 .6 6370 3.83E- 3.63 0.6 1084 1546.6 2 R.AAHSEGNTTAGLDM#R.E 06 2 .5 9143 9.21E- 2.68 0.4 1248 1114.5 2 K.EAVAM#ESYAK.A 04 2 .1 0849 1.60E- 4.68 0.5 2214 1656.8 2 R.EALLSSAVDHGSDEVK.F 07 4 .4 0750 5.21E- 2.77 0.4503. 2398.1 3 R.TVYGGGC*SEM#LM#AHAVTQLANR.T 180

05 8 0 2532 6.76E- 3.17 0.3 889. 1291.6 2 K.LIEEVM#IGEDK.L 04 8 1 4498 2.23E- 4.10 0.5 1048 1494.7 2 R.QDLM#NIAGTTLSSK.L 04 7 .6 4682 1.80E- 4.58 0.5 1720 1564.7 2 R.DASLM#VTNDGATILK.N 04 1 .1 8868 1.99E- 5.70 0.6 1617 2169.1 3 K.KLGGSLADSYLDEGFLLDKK.I 09 1 .0 4380 2.96E- 3.27 0.4 797. 2097.1 3 R.LALVTGGEIASTFDHPELVK.L 05 4 5 2256 2.20E- 5.30 0.6 1672 2097.1 2 R.LALVTGGEIASTFDHPELVK.L 08 5 .7 2256 6.80E- 4.43 0.5 1652 1554.8 2 R.SLHDALC*VLAQTVK.D 07 1 .5 6138 4.56E- 5.13 0.5 2130 2363.1 3 R.M#LPTIIADNAGYDSADLVAQLR.A 11 6 .8 9112 2.36E- 5.11 0.6 2564 2363.1 2 R.M#LPTIIADNAGYDSADLVAQLR.A 08 1 .0 9112 1.03E- 5.00 0.5 1123 2025.0 2 R.EGTIGDM#AILGITESFQVK.R 07 4 .3 2087 1.12E- 4.94 0.3 2493 1582.9 2 R.QVLLSAAEAAEVILR.V 06 4 .7 1626 8.35E- 4.98 0.5 1918 1517.8 2 R.LTSFIGAIAIGDLVK.S 07 1 .7 9380 1.16E- 5.72 0.5 2590 2288.1 2 R.VQDDEVGDGTTSVTVLAAELLR.E 10 8 .3 6162 4.33E- 4.69 0.5 1562 2288.1 3 R.VQDDEVGDGTTSVTVLAAELLR.E 07 0 .7 6162

4-7 CES1 68508 3.82E- 70.2 6.1 6235 16.96 957 09 8 0 3.2 9.96E- 2.20 0.2 305. 1490.7 3 K.AVEKPPQTEHIEL 05 2 2 8491 1.11E- 4.88 0.4 1588 1591.8 2 K.EGYLQIGANTQAAQK.L 04 1 .0 0750 9.44E- 3.93 0.4 1608 1436.7 2 R.GNWGHLDQVAALR.W 06 2 .5 3938 3.92E- 3.29 0.5 585. 1727.7 2 R.DAGAPTYM#YEFQYR.P 07 1 8 3698 8.49E- 3.12 0.4 895. 1260.6 2 K.FLSLDLQGDPR.E 04 8 1 5833 2.74E- 3.52 0.5 1233 1348.7 2 K.AGQLLSELFTNR.K 04 2 .7 2192 3.82E- 5.67 0.5 2039 2129.0 2 K.LSEDC*LYLNIYTPADLTK.K 09 8 .8 7763

4-7 COMT 64664 3.38E- 30.2 5.0 2443 20.36 50 07 6 2 3.4 4.78E- 2.98 0.5 999. 1146.5 2 K.AIYKGPGSEAGP 05 2 8 7898 3.11E- 3.04 0.4 1080 2170.9 3 R.GSSC*FEC*THYQSFLEYR.E 04 8 .1 5677 3.38E- 5.22 0.5 1440 1827.9 2 R.LITIEINPDC*AAITQR.M 07 9 .0 9385

4-7 CTSD 45031 1.07E- 60.3 6.1 4452 24.7 43 10 1 0 3.7 1.07E- 5.10 0.6 1004 1803.8 2 R.DPDAQPGGELM#LGGTDSK.Y 10 0 .8 0652 4.27E- 3.38 0.4 1555 1255.6 2 K.FDGILGM#AYPR.I 04 0 .3 1396 2.30E- 5.57 0.6 2245 2350.1 2 K.EGC*EAIVDTGTSLM#VGPVDEVR.E 07 7 .3 2062 2.75E- 3.84 0.5 842. 2350.1 3 K.EGC*EAIVDTGTSLM#VGPVDEVR.E 07 2 3 2062 4.97E- 4.61 0.5 1200 2005.0 2 K.AIGAVPLIQGEYM#IPC*EK.V 08 3 .3 4382 3.11E- 4.11 0.5 1444 2317.1 3 K.AYWQVHLDQVEVASGLTLC*K.E 04 0 .2 9506 8.73E- 3.72 0.5 1495 1601.8 2 K.LVDQNIFSFYLSR.D 06 6 .9 3228 5.19E- 4.05 0.4 1295 1601.8 3 K.LVDQNIFSFYLSR.D 04 2 .4 3228 181

4-7 CTSX(P? 37192 5.49E- 30.1 6.1 3268 8.19 or Z?) 19 06 8 0 1.6 5.49E- 2.94 0.3 1452 1267.6 2 R.VGDYGSLSGREK.M 06 9 .8 2769 3.58E- 3.64 0.4 1459 1010.4 2 R.VGDYGSLSGR.E 04 7 .1 9017 1.68E- 3.40 0.5 1476 1308.6 2 R.NVDGVNYASITR.N 05 0 .3 5430

4-7 DPYSL2 45033 4.75E- 150. 5.9 6225 39.34 77 10 32 3 4.7 6.52E- 3.12 0.5 606. 1310.6 2 R.M#VIPGGIDVHTR.F 05 2 1 8852 9.84E- 3.08 0.3 1380 1262.6 2 K.GIQEEM#EALVK.D 04 8 .2 2966 1.13E- 5.48 0.0 2270 1741.8 2 K.M#DENQFVAVTSTNAAK.I 08 8 .4 0613 1.10E- 3.07 0.3 927. 1084.6 2 R.GSPLVVISQGK.I 03 8 1 3611 9.15E- 2.67 0.4 663. 1140.6 2 R.KPFPDFVYK.R 04 1 6 0876 5.40E- 3.99 0.3 1004 2102.9 3 K.THNSSLEYNIFEGM#EC*R.G 07 0 .0 2114 2.49E- 3.31 0.6 1355 1620.8 2 R.GLYDGPVC*EVSVTPK.T 07 0 .7 2432 6.55E- 2.26 0.3 526. 1323.7 2 K.QIGENLIVPGGVK.T 03 8 4 6306 7.23E- 4.65 0.6 1462 1820.9 2 R.AITIANQTNC*PLYITK.V 09 9 .2 8803 2.05E- 5.38 0.4 2819 2377.1 3 R.DIGAIAQVHAENGDIIAEEQQR.I 08 1 .2 7432 2.51E- 4.81 0.5 1676 2182.9 2 R.FQM#PDQGM#TSADDFFQGTK.A 06 8 .5 0557 3.62E- 2.22 0.0 323. 1031.5 2 K.SSAEVIAQAR.K 04 5 5 4797 3.42E- 2.99 0.3 873. 1792.8 2 K.DNFTLIPEGTNGTEER.M 06 1 6 3484 4.75E- 4.72 0.6 1956 1899.9 2 R.IAVGSDADLVIWDPDSVK.T 10 2 .9 6985 4.26E- 4.72 0.6 1951 2365.0 2 K.IVNDDQSFYADIYM#EDGLIK.Q 09 5 .6 9041

4-7 ECH1 16924 4.24E- 70.3 8.2 3573 28.35 265 09 0 3 5.4 9.97E- 2.98 0.2 977. 1596.7 3 R.KM#M#ADEALGSGLVSR.V 04 8 7 7197 4.44E- 4.64 0.5 2281 1468.6 2 K.M#M#ADEALGSGLVSR.V 06 5 .3 7701 3.64E- 3.02 0.4 1065 1298.6 2 R.YQETFNVIER.C 04 1 .7 3757 1.99E- 3.28 0.5 1424 1376.6 2 R.YC*AQDAFFQVK.E 05 5 .6 6088 5.57E- 5.30 0.1 2936 1542.8 2 K.EVDVGLAADVGTLQR.L 07 5 .7 1226 4.24E- 3.87 0.5 993. 1731.9 2 K.VIGNQSLVNELAFTAR.K 09 8 1 3884 2.41E- 5.96 0.6 1691 2998.3 3 R.DHSVAESLNYVASWNM#SM#LQTQDLV 04 0 .7 9207 K.S

4-7 FGB 70906 7.49E- 30.2 8.2 5589 7.74 435 10 6 0 2.2 3.40E- 4.49 0.4 1626 1972.9 3 K.IQKLESDVSAQM#EYC*R.T 04 2 .7 4081 1.08E- 3.84 0.5 1321 1618.8 2 R.TPC*TVSC*NIPVVSGK.E 07 6 .4 5381 7.49E- 4.98 0.4 1608 2548.3 3 R.TPC*TVSC*NIPVVSGKEC*EEIIR.K 10 9 .9 1201

4-7 FGG 18244 1.58E- 150. 5.2 5146 52.54 0 10 32 0 3.9 4.96E- 2.49 0.3 485. 1034.5 2 K.VGPEADKYR.L 04 9 4 2649

182

1.80E- 4.04 0.4 1563 1545.8 2 R.LTIGEGQQHHLGGAK.Q 04 5 .3 1323 2.75E- 3.69 0.5 1297 1560.7 2 K.VAQLEAQC*QEPC*K.D 05 0 .8 7556 2.53E- 2.17 0.4 691. 1150.5 2 R.TSTADYAM#FK.V 04 0 0 0849 2.27E- 4.76 0.3 1433 1513.7 2 R.YLQEIYNSNNQK.I 05 0 .0 2815 9.77E- 4.01 0.5 1045 1491.7 2 K.YEASILTHDSSIR.Y 09 1 .9 4377 2.00E- 4.94 0.5 1613 2768.3 3 K.VAQLEAQC*QEPC*KDTVQIHDITGK.D 08 8 .4 9526 5.24E- 2.72 0.3 1091 1117.5 2 R.VELEDWNGR.T 04 8 .5 2722 4.86E- 3.44 0.3 619. 2536.2 3 K.AIQLTYNPDESSKPNM#IDAATLK.S 05 9 9 5993 2.04E- 4.52 0.5 2022 1682.9 2 K.IHLISTQSAIPYALR.V 08 7 .2 5886 7.18E- 4.55 0.6 1038 2207.0 2 K.EGFGHLSPTGTTEFWLGNEK.I 09 3 .7 4028 5.25E- 4.73 0.5 1138 1893.9 2 K.ASTPNGYDNGIIWATWK.T 06 1 .6 1296 2.14E- 5.51 0.5 2418 2661.2 3 K.ANQQFLVYC*EIDGSGNGWTVFQK.R 10 9 .5 7075 1.58E- 5.27 0.6 1466 2661.2 2 K.ANQQFLVYC*EIDGSGNGWTVFQK.R 10 1 .6 7075 1.18E- 4.83 0.5 963. 2417.1 2 R.FGSYC*PTTC*GIADFLSTYQTK.V 06 7 6 3988 7.89E- 4.81 0.5 1522 2417.1 3 R.FGSYC*PTTC*GIADFLSTYQTK.V 06 3 .1 3988 8.70E- 6.43 0.6 2055 2834.1 2 R.LTYAYFAGGDAGDAFDGFDFGDDPSDK. 09 7 .3 7407 F

4-7 FKBP4 45037 2.97E- 90.2 32. 5177 32.46 29 09 7 50 2.2 1.20E- 2.66 0.3 743. 2105.8 3 K.AEASSGDHPTDTEM#KEEQK.S 05 1 0 9277 5.37E- 3.54 0.5 1323 1408.6 2 K.SNTAGSQSQVETEA 05 2 .4 1865 6.25E- 2.11 0.4 433. 1425.5 2 R.EGTGTEM#PM#IGDR.V 05 0 7 9843 2.24E- 4.88 0.5 2312 1697.8 3 R.RGEAHLAVNDFELAR.A 07 1 .9 7183 3.19E- 4.45 0.4 882. 2101.0 2 K.ATESGAQSAPLPM#EGVDISPK.Q 04 9 5 1176 3.09E- 4.44 0.5 2422 1438.7 2 K.LQAFSAAIESC*NK.A 06 5 .5 3003 1.71E- 3.13 0.3 639. 2039.0 3 K.GEHSIVYLKPSYAFGSVGK.E 06 5 0 5969 3.74E- 4.18 0.6 1094 2145.9 2 K.IVSWLEYESSFSNEEAQK.A 08 4 .4 9756 2.97E- 5.33 0.6 1377 1950.9 2 R.FEIGEGENLDLPYGLER.A 09 3 .4 4434

4-7 HNRPF 16876 3.76E- 40.3 5.2 4567 14.46 910 07 0 7 0.9 4.36E- 3.74 0.6 1035 1630.7 2 K.HSGPNSADSANDGFVR.L 06 0 .9 2046 4.93E- 3.01 0.5 1086 1092.5 2 R.VHIEIGPDGR.V 04 2 .6 7959 1.67E- 5.81 0.5 2161 1867.9 2 K.ITGEAFVQFASQELAEK.A 06 3 .2 4360 3.76E- 4.22 0.5 1543 1996.9 2 R.ATENDIYNFFSPLNPVR.V 07 8 .7 7632

4-7 HNRPF1 48145 3.09E- 30.2 4909 10.24 673 07 5 9.3 3.09E- 3.40 0.5 1304 1684.7 2 K.HTGPNSPDTANDGFVR.L 07 4 .3 6733 1.79E- 3.61 0.5 496. 1504.7 2 R.GLPWSC*SADEVQR.F 05 0 7 1544 5.86E- 4.95 0.4 2070 1841.8 2 R.STGEAFVQFASQEIAEK.A 06 7 .4 9160

183

4-7 HP 78174 4.57E- 30.2 3835 390 05 2 5.3 4.57E- 2.56 0.4 953. 980.49 2 R.VGYVSGWGR.N 8.93 05 1 8 481 8.88E- 3.83 0.5 1790 1345.6 2 K.SC*AVAEYGVYVK.V 05 8 .4 7620 1.67E- 3.56 0.3 1114 1203.6 2 K.VTSIQDWVQK.T 04 8 .6 3684

4-7 HSP90 15010 1.15E- 180. 4.5 9013 26.85 (gp96) 550 10 28 8 8.0 8.75E- 3.13 0.3 1533 1139.5 2 K.LGVIEDHSNR.T 05 7 .9 8032 8.93E- 2.40 0.1 611. 1167.5 2 K.EGVKFDESEK.T 03 5 3 5286 3.22E- 3.04 0.3 689. 982.48 2 K.SGTSEFLNK.M 02 2 5 401 2.06E- 2.21 0.3 611. 1150.5 2 K.EAESSPFVER.L 03 4 4 3748 4.29E- 2.50 0.3 916. 1047.4 2 K.IYFM#AGSSR.K 04 7 9 9278 1.21E- 3.49 0.5 1210 1289.6 2 K.DISTNYYASQK.K 04 6 .4 0083 2.03E- 3.66 0.3 1144 1529.7 3 K.NLLHVTDTGVGM#TR.E 04 0 .0 7404 1.88E- 3.70 0.4 2187 1529.7 2 K.NLLHVTDTGVGM#TR.E 04 7 .2 7404 2.46E- 2.84 0.3 787. 1275.6 2 R.ELISNASDALDK.I 04 9 5 4270 4.28E- 2.09 0.3 225. 1515.7 3 K.IADDKYNDTFWK.E 03 0 6 1143 9.69E- 3.22 0.4 1638 1515.7 2 K.IADDKYNDTFWK.E 07 9 .6 1143 1.17E- 5.07 0.4 2417 2260.0 3 R.FQSSHHPTDITSLDQYVER.M 07 3 .5 6299 7.09E- 3.50 0.5 1853 1544.8 2 R.ELISNASDALDKIR.L 05 1 .9 2788 2.67E- 1.80 0.1 568. 1015.4 2 R.GLFDEYGSK.K 03 8 9 7308 1.04E- 2.77 0.2 475. 963.58 2K.LIINSLYK.N 02 8 1 734 1.54E- 2.96 0.2 872. 1187.6 2 K.SILFVPTSAPR.G 03 7 1 7834 2.29E- 4.39 0.5 2285 1734.7 2 -.DDEVDVDGTVEEDLGK.S 08 7 .3 5525 1.04E- 4.69 0.5 2550 1785.8 2 R.EEEAIQLDGLNASQIR.E 04 9 .1 9771 1.42E- 4.05 0.3 2006 1525.7 2 K.EEASDYLELDTIK.N 05 5 .4 2681 1.15E- 5.66 0.6 1625 2046.0 2 R.LISLTDENALSGNEELTVK.I 10 2 .9 6006

4-7 HSP90AA 12654 3.90E- 120. 5.0 6434 25.73 1 329 10 25 0 9.8 1.48E- 2.85 0.1 572. 1296.6 3 K.LGIHEDSQNRK.K 04 4 7 6553 1.05E- 3.09 0.4 844. 1224.6 2 K.HIYYITGETK.D 05 8 4 2585 1.99E- 3.96 0.6 1241 1566.6 2 R.YYTSASGDEM#VSLK.D 05 0 .2 9920 4.69E- 3.60 0.5 1067 1550.7 2 R.YYTSASGDEMVSLK.D 06 8 .7 0435 8.15E- 3.02 0.4 599. 1915.0 3 K.KHLEINPDHSIIETLR.Q 08 5 2 3955 5.04E- 4.44 0.4 1491 1527.7 2 K.SLTNDWEDHLAVK.H 06 8 .4 4377 7.00E- 4.27 0.5 1484 1786.9 2 K.HLEINPDHSIIETLR.Q 05 1 .9 4458 1.65E- 3.28 0.6 1197 1348.6 2 K.HFSVEGQLEFR.A 06 0 .5 6443 9.40E- 5.06 0.5 2167 1833.7 2 R.NPDDITNEEYGEFYK.S 06 1 .7 8137 4.59E- 2.88 0.5 1084 1108.5 2 R.APFDLFENR.K 184

04 1 .6 4224 3.90E- 4.46 0.6 1071 2462.1 2 R.LVTSPC*C*IVTSTYGWTANM#ER.I 10 6 .4 7594 3.44E- 4.16 0.3 959. 2593.2 3 K.HGLEVIYM#IEPIDEYC*VQQLK.E 06 0 1 9819

4-7 HSP90AB 39644 1.40E- 220. 4.9 7474 37.08 1 662 09 27 0 6.0 2.57E- 2.20 0.3 459. 1297.6 3 K.LGIHEDSTNRR.R 03 3 7 6077 8.34E- 1.89 0.0 410. 1865.8 3 K.IEDVGSDEEDDSGKDKK.K 03 2 1 2471 1.19E- 3.26 0.4 684. 1141.5 2 K.LGIHEDSTNR.R 03 2 4 5957 3.93E- 2.92 0.5 775. 1296.4 2 R.DNSTM#GYM#M#AK.K 03 2 8 9044 5.88E- 2.43 0.2 443. 1009.5 2 K.AKFENLC*K.L 03 9 0 4406 9.69E- 3.38 0.2 663. 1151.5 2 K.YIDQEELNK.T 03 4 4 5786 4.18E- 2.51 0.0 790. 886.54 2 R.RLSELLR.Y 02 8 7 688 1.27E- 2.68 0.3 788. 1249.6 2 K.EQVANSAFVER.V 03 7 4 1719 3.13E- 4.14 0.4 966. 2192.9 3 R.YHTSQSGDEM#TSLSEYVSR.M 07 3 9 4005 7.45E- 5.40 0.6 2269 2192.9 2 R.YHTSQSGDEM#TSLSEYVSR.M 08 0 .9 4005 2.57E- 2.12 0.2 651. 891.42 1 K.FYEAFSK.N 03 9 1 468 4.66E- 3.84 0.2 682. 1364.7 3 R.RAPFDLFENKK.K 02 2 4 3206 3.16E- 3.12 0.5 1079 1160.5 2 K.SIYYITGESK.E 05 3 .5 8337 1.09E- 1.99 0.3 910. 1160.5 1 K.SIYYITGESK.E 04 5 8 8337 4.41E- 2.14 0.0 416. 1416.6 2 K.EGLELPEDEEEK.K 03 9 1 3770 7.63E- 3.63 0.2 1128 1782.9 3 K.HLEINPDHPIVETLR.Q 06 4 .5 4971 3.52E- 2.73 0.0 350. 1365.7 2 R.TLTLVDTGIGM#TK.A 03 0 0 2938 1.68E- 3.33 0.5 1228 1348.6 2 K.HFSVEGQLEFR.A 06 4 .5 6443 1.10E- 4.40 0.5 1198 1527.7 2 K.SLTNDWEDHLAVK.H 05 3 .4 4377 5.81E- 3.18 0.2 934. 1236.6 2 R.RAPFDLFENK.K 03 4 5 3721 5.60E- 5.10 0.6 2083 1847.7 2 R.NPDDITQEEYGEFYK.S 09 1 .8 9700 3.70E- 2.15 0.3 169. 1080.5 2 R.APFDLFENK.K 03 3 0 3601 1.40E- 4.61 0.6 1327 2448.1 2 R.LVSSPC*C*IVTSTYGWTANM#ER.I 09 4 .6 6029 2.92E- 4.94 0.3 1179 2464.1 2 R.GFEVVYM#TEPIDEYC*VQQLK.E 07 1 .2 7160

4-7 HSPA5 16507 1.35E- 158. 4.9 7228 33.49 237 11 28 0 8.5 1.40E- 2.61 0.5 714. 1191.6 2K.VYEGERPLTK.D 03 0 8 3684 5.30E- 3.62 0.3 1708 1228.6 2 K.VEIIANDQGNR.T 04 2 .7 2805 1.59E- 3.48 0.4 1197 1430.6 2 R.TWNDPSVQQDIK.F 03 8 .1 9104 2.93E- 3.03 0.2 825. 1653.9 3 K.KKELEEIVQPIISK.L 03 9 5 7852 1.53E- 3.10 0.4 1425 1233.6 2 K.DAGTIAGLNVM#R.I 03 6 .1 2558 1.35E- 5.54 0.6 1412 2175.9 2 K.LYGSAGPPPTGEEDTAEKDEL 11 2 .1 9292 2.31E- 4.53 0.6 1192 1677.8 2 K.NQLTSNPENTVFDAK.R 07 1 .4 0786 4.78E- 4.56 0.61065 1887.9 2 K.VTHAVVTVPAYFNDAQR.Q 185

08 3 .1 7119 1.59E- 5.04 0.6 2196 1836.9 2 K.SQIFSTASDNQPTVTIK.V 07 2 .4 3384 8.19E- 4.55 0.5 1887 1588.8 2 K.KSDIDEIVLVGGSTR.I 05 3 .5 5413 1.65E- 2.80 0.5 799. 1566.7 2 R.ITPSYVAFTPEGER.L 05 8 5 7991 9.76E- 3.62 0.3 1389 1316.6 2 R.NELESYAYSLK.N 04 6 .8 3684 3.24E- 3.83 0.3 1235 1397.7 2 K.ELEEIVQPIISK.L 02 4 .6 8857 2.38E- 3.35 0.4 1338 1552.7 2 K.TFAPEEISAM#VLTK.M 04 4 .8 9271 2.36E- 3.93 0.5 1313 1934.0 2 K.DNHLLGTFDLTGIPPAPR.G 07 5 .3 1306 2.40E- 3.99 0.6 1675 2164.9 2 R.IEIESFYEGEDFSETLTR.A 09 3 .7 9219

4-7 IMPDH2 3.26E- 70.2 6.5 5576 18.48 09 6 0 9.7 3.71E- 2.35 0.2 507. 1063.5 2 K.NRDYPLASK.D 02 8 4 5310 4.50E- 3.92 0.5 1683 1158.6 2 K.VAQGVSGAVQDK.G 05 1 .6 1133 4.21E- 3.48 0.4 896. 1430.7 2 R.HGFC*GIPITDTGR.M 04 8 8 1505 2.44E- 2.80 0.2 809. 1481.8 3 K.REDLVVAPAGITLK.E 03 5 2 6865 3.26E- 5.12 0.5 1689 2048.1 3 R.RFGVPVIADGGIQNVGHIAK.A 09 9 .4 3989 9.10E- 4.37 0.4 1480 1820.9 3 K.KYEQGFITDPVVLSPK.D 04 3 .3 7925 1.73E- 4.64 0.5 1762 1820.9 2 K.KYEQGFITDPVVLSPK.D 08 8 .2 7925 4.65E- 2.89 0.2 1091 1156.6 2 K.NLIDAGVDALR.V 03 2 .4 3208

4-7 KRT18 12653 4.51E- 60.2 5.3 4800 19.77 819 11 9 0 2.6 2.40E- 3.16 0.3 1004 1174.6 2 R.KVIDDTNITR.L 04 3 .6 4258 9.46E- 3.14 0.3 1000 1065.5 2 K.LEAEIATYR.R 04 8 .5 5750 4.24E- 3.37 0.5 1817 1319.6 2 R.AQIFANTVDNAR.I 05 2 .0 7029 1.95E- 2.33 0.2 1155 1041.6 2 R.IVLQIDNAR.L 04 5 .7 0510 1.60E- 5.79 0.5 2058 2293.0 3 R.GGM#GSGGLATGIAGGLAGM#GGIQNE 07 1 .0 9108 K.E 4.51E- 5.63 0.6 2945 1884.0 2 K.GLQAQIASSGLTVEVDAPK.S 11 8 .9 0732

4-7 KRT8 33875 2.98E- 60.2 5.5 5578 18.01 698 10 9 0 7.2 2.76E- 3.52 0.0 1116 1137.5 2 K.YEELQSLAGK.H 04 6 .9 7861 2.98E- 5.88 0.6 1849 2125.0 2 R.ELQSQISDTSVVLSM#DNSR.S 10 6 .4 0774 1.10E- 3.98 0.4 1591 1344.6 2 R.ASLEAAIADAEQR.G 04 9 .7 7542 1.60E- 5.40 0.5 1873 2050.0 3 K.LKLEAELGNM#QGLVEDFK.N 08 5 .2 5250 9.10E- 4.64 0.5 1323 1879.7 2 R.SNM#DNM#FESYINNLR.R 06 4 .8 9489 2.30E- 4.20 0.5 1705 1419.7 2 R.LEGLTDEINFLR.Q 06 8 .4 4780

4-7 KRT9 43547 4.44E- 90.4 5.0 27.45 6 16 2 0 1.31E- 5.52 0.6 1791.7 2 R.GGSGGSYGGGGSGGGYGGGSGSR.G 10 9 2766 8.44E- 3.37 0.6 1235.5 2 R.FSSSSGYGGGSSR.V 07 2 2869

186

1.66E- 4.56 0.5 1232.5 2 R.SGGGGGGGLGSGGSIR.S 04 6 9778 3.07E- 8.36 0.7 3223.2 3 R.GGSGGSHGGGSGFGGESGGSYGGGE 11 4 8149 EASGSGGGYGGGSGK.S

3.21E- 5.30 0.4 1586.7 2 K.VQALEEANNDLENK.I 06 6 6563 2.30E- 3.37 0.3 1586.7 3 K.VQALEEANNDLENK.I 07 6 6563 1.92E- 6.10 0.5 2510.1 3 K.EIETYHNLLEGGQEDFESSGAGK.I 05 7 3184 2.20E- 5.39 0.5 1966.0 3 R.HGVQELEIELQSQLSKK.A 05 4 6042 1.60E- 6.10 0.5 1837.9 2 R.HGVQELEIELQSQLSK.K 09 4 6545 1.00E- 4.20 0.4 1837.9 3 R.HGVQELEIELQSQLSK.K 04 5 6545 4.44E- 5.18 0.6 2902.4 3 K.NYSPYYNTIDDLKDQIVDLTVGNNK.T 16 2 1040

4-7 LDHD 37595 6.10E- 60.2 6.0 5211 16.74 756 08 2 2 1.5 6.10E- 4.33 0.6 1204 1409.7 2 K.AVVGGSHVSTAAVVR.E 08 1 .2 8589 2.72E- 2.87 0.5 896. 1355.6 2 K.AVLDPQGLM#NPGK.V 04 7 1 9875 2.89E- 3.53 0.5 1092 1353.6 2 K.GYSTDVC*VPISR.L 05 3 .9 7726 1.81E- 3.46 0.3 1657 1273.6 2 R.HNAWYAALATR.P 06 3 .2 4368 4.44E- 4.49 0.5 1611 1774.9 2 R.QLLQEEVGAVGVETM#R.Q 06 4 .0 0036 1.40E- 3.73 0.4 2342 1552.8 2 R.DNVLNLEVVLPDGR.L 07 2 .0 3301

4-7 MAPRE1 69124 8.82E- 50.2 4.9 2998 30.97 94 10 9 0 0.2 9.50E- 2.83 0.5 745. 1076.4 2 K.FFDANYDGK.D 04 7 3 6838 3.56E- 4.27 0.4 1088 2019.1 3 R.QGQETAVAPSLVAPALNKPK.K 07 7 .3 2329 4.55E- 4.69 0.3 1500 2442.1 3 R.KNPGVGNGDDEAAELM#QQVNVLK.L 04 9 .8 9291 8.82E- 5.89 0.6 2321 2270.1 2 R.NIELIC*QENEGENDPVLQR.I 10 3 .8 0229 3.06E- 4.87 0.4 1987 1634.7 2 K.FQDNFEFVQWFK.K 06 9 .4 6379

4-7 MRLC2 15809 1.88E- 70.2 4.5 1976 56.98 016 08 8 0 6.5 1.70E- 2.83 0.3 697. 1253.5 2 K.EAFNM#IDQNR.D 02 8 3 5790 4.15E- 3.40 0.4 1519 1228.6 2 K.LNGTDPEDVIR.N 04 5 .2 1682 4.28E- 3.66 0.4 1429 1415.6 2 R.FTDEEVDELYR.E 06 6 .5 3257 4.99E- 2.98 0.2 911. 2019.9 3 R.DGFIDKEDLHDM#LASLGK.N 04 7 9 6917 2.04E- 2.92 0.3 789. 1260.6 2 K.GNFNYIEFTR.I 03 1 2 0071 1.88E- 5.09 0.6 2535 2106.9 2 R.ATSNVFAM#FDQSQIQEFK.E 08 0 .2 8007 1.31E- 5.66 0.6 2765 2350.0 2 R.NAFAC*FDEEATGTIQEDYLR.E 07 1 .4 5975

4-7 NNMT 54537 1.95E- 30.2 5.5 2955 18.56 90 07 6 0 5.1 5.89E- 3.41 0.5 953. 1249.5 2 K.DTYLSHFNPR.D 06 2 2 9607 1.95E- 5.01 0.5 2386 2022.9 2 K.EIVVTDYSDQNLQELEK.W 07 5 .4 8657 2.45E- 5.30 0.4 1145 2611.2 3 K.KEPEAFDWSPVVTYVC*DLEGNR.V 06 2 .6 4387

187

4-7 OXCT 48146 4.50E- 160. 7.2 5615 40.58 215 10 31 0 6.0 3.57E- 4.50 0.5 1949 1505.8 3 K.YNKDGSVAIASKPR.E 04 1 .9 0701 2.81E- 3.27 0.3 1166 1100.6 2 K.DGSVAIASKPR.E 05 6 .1 0583 3.28E- 2.52 0.3 885. 921.50 2 K.AVFDVDKK.K 03 4 1 403 4.29E- 3.62 0.5 1380 1255.5 2 K.GM#GGAM#DLVSSAK.T 03 4 .6 6567 3.73E- 2.73 0.5 1396 1168.5 2 K.STGC*DFAVSPK.L 03 4 .7 6084 8.31E- 2.45 0.3 392. 889.51 2 K.C*TLPLTGK.Q 03 9 5 170 1.18E- 1.60 0.3 188. 776.44 1 R.AGNVIFR.K 02 4 0 135 3.82E- 2.28 0.1 604. 1039.5 2R.NFNLPM#C*K.A 02 4 2 0047 3.23E- 3.90 0.2 2570 1675.8 3 R.GGHVDLTM#LGAM#QVSK.Y 03 6 .7 1417 4.10E- 5.10 0.4 2266 1675.8 2 R.GGHVDLTM#LGAM#QVSK.Y 05 7 .5 1417 1.82E- 4.75 0.6 1929 1633.7 2 R.M#VSSYVGENAEFER.Q 06 5 .2 1625 1.17E- 2.12 0.3 526. 1168.5 2 K.FYTDPVEAVK.D 03 2 8 8850 5.70E- 4.21 0.5 1587 1380.6 2 K.YGDLANWM#IPGK.K 04 3 .5 6163 4.50E- 5.35 0.6 943. 2595.3 2 R.AGGAGVPAFYTPTGYGTLVQEGGSPIK. 10 8 5 0884 Y 9.98E- 5.21 0.4 1112 2595.3 3 R.AGGAGVPAFYTPTGYGTLVQEGGSPIK. 07 8 .8 0884 Y 7.30E- 3.78 0.5 942. 2233.1 2 R.QYLSGELEVELTPQGTLAER.I 07 1 0 3477 3.03E- 5.36 0.6 2167 2101.1 2 K.GLTAVSNNAGVDNFGLGLLLR.S 09 1 .6 4014 5.47E- 4.40 0.6 1203 2421.1 2 K.ETVTILPGASFFSSDESFAM#IR.G 10 1 .2 6424

4-7 PDIA3 21361 5.83E- 100. 5.9 5674 20 657 08 22 0 6.8 5.67E- 4.11 0.5 1364 1397.6 2 K.VDC*TANTNTC*NK.Y 04 2 .5 3946 2.69E- 3.70 0.4 1111 1168.6 2 R.TADGIVSHLKK.Q 05 9 .1 6846 2.10E- 4.17 0.4 1804 1347.7 2 K.RLAPEYEAAATR.L 06 4 .8 0154 1.18E- 3.22 0.5 1279 1236.5 2 R.DGEEAGAYDGPR.T 05 6 .5 1270 3.22E- 3.55 0.4 652. 1802.8 3 R.TAKGEKFVM#QEEFSR.D 05 3 2 7415 5.83E- 4.21 0.5 1203 1652.7 2 K.IFRDGEEAGAYDGPR.T 08 8 .5 6636 2.39E- 4.46 0.5 886. 2317.2 3 R.LKGIVPLAKVDC*TANTNTC*NK.Y 07 1 2 6133 1.10E- 2.95 0.3 664. 1432.7 2 R.LAPEYEAAATRLK.G 06 4 8 7942 6.13E- 3.69 0.4 1200 1645.8 3 K.FLDAGHKLNFAVASR.K 08 4 .6 8086 1.09E- 3.81 0.4 1242 1515.7 2 R.FLQDYFDGNLKR.Y 05 4 .3 5903

4-7 PDIA6 50319 5.44E- 80.2 4.8 4809 34.68 73 11 9 0 1.3 2.20E- 3.47 0.4 597. 1191.5 2 K.NRPEDYQGGR.T 04 1 4 5017 9.05E- 3.00 0.3 1406 1015.6 2 K.AATALKDVVK.V 04 4 .6 1462 2.44E- 4.79 0.6 1475 1615.8 2 R.GSTAPVGGGAFPTIVER.E 09 3 .2 4387 1.19E- 4.41 0.5 2568 1527.8 2 K.LAAVDATVNQVLASR.Y 05 7 .5 4888 2.30E- 4.05 0.4 1516 1483.7 2 K.GSFSEQGINEFLR.E 04 5 .6 1753

188

3.00E- 5.58 0.4 1142 2646.3 3 R.TC*EEHQLC*VVAVLPHILDTGAAGR.N 05 5 .7 7374 1.74E- 3.73 0.4 1425 1386.7 2 R.TGEAIVDAALSALR.Q 04 3 .4 5867 5.44E- 5.50 0.6 1439 2758.2 2 R.DGELPVEDDIDLSDVELDDLGKDEL 11 5 .3 6758

4-7 PGAM1 38566 1.46E- 30.2 15. 2880 15.35 176 07 3 40 1.9 1.31E- 3.63 0.3 781. 1979.8 3 R.FSGWYDADLSPAGHEEAK.R 05 3 1 7695 8.75E- 4.13 0.4 1111 1868.8 2 R.YADLTEDQLPSC*ESLK.D 07 6 .5 8877 1.46E- 4.51 0.4 1153 2425.1 3 R.YADLTEDQLPSC*ESLKDTIAR.A 07 4 .7 8568

4-7 PGM1 21361 1.56E- 160. 6.3 6141 37.72 621 08 28 0 0.6 1.37E- 1.42 0.2 289. 1050.4 2 R.SM#PTSGALDR.V 03 9 0 8842 9.47E- 4.20 0.4 1009 1649.8 2 K.TQAYQDQKPGTSGLR.K 06 4 .2 2422 2.79E- 3.59 0.4 723. 1649.8 3 K.TQAYQDQKPGTSGLR.K 02 2 5 2422 8.39E- 3.13 0.4 1132 1090.5 2 R.LSGTGSAGATIR.L 04 8 .5 8508 1.13E- 3.56 0.4 1352 1201.6 2 R.QEATLVVGGDGR.F 03 1 .4 1719 1.36E- 2.75 0.2 714. 1401.7 2 R.IDAM#HGVVGPYVK.K 03 9 6 1948 6.22E- 1.92 0.3 235. 1443.7 3 R.LYIDSYEKDVAK.I 04 7 0 3657 1.68E- 3.32 0.4 1725 1443.7 2 R.LYIDSYEKDVAK.I 05 2 .1 3657 5.60E- 4.69 0.4 1753 2325.0 3 K.SGEHDFGAAFDGDGDRNM#ILGK.H 04 8 .2 2003 2.52E- 2.78 0.3 527. 1145.5 2 K.FFGNLM#DASK.L 03 3 3 2956 3.12E- 4.56 0.5 2524 1630.8 2 K.FNISNGGPAPEAITDK.I 07 0 .9 0713 2.55E- 2.03 0.3 983. 800.48 2K.VDLGVLGK.Q 02 8 9 761 5.80E- 5.62 0.5 1673 1771.7 2 K.ADNFEYSDPVDGSISR.N 06 6 .2 7698 2.40E- 2.85 0.3 847. 1278.6 2 K.IALYETPTGWK.F 02 2 9 7285 1.01E- 2.20 0.2 772. 1125.5 2 K.DLEALM#FDR.S 03 2 6 2447 4.07E- 3.65 0.3 802. 2011.1 2 R.LVIGQNGILSTPAVSC*IIR.K 04 7 5 6739 3.34E- 2.12 0.3 440. 1027.5 2 R.SIFDFSALK.E 04 3 3 4590 8.22E- 1.77 0.1 403. 1027.5 1 R.SIFDFSALK.E 03 3 5 4590 1.56E- 5.58 0.5 2755 1980.1 2 K.INQDPQVM#LAPLISIALK.V 08 8 .9 1980

4-7 PITPNA 46249 1.67E- 60.2 3587 29.11 793 09 9 7.0 1.75E- 2.74 0.3 806. 1165.5 2 K.DYKAEEDPAK.F 04 7 9 3711 1.89E- 3.15 0.4 782. 1438.7 2 R.M#LAPEGALNIHEK.A 06 3 2 3586 7.71E- 3.74 0.3 578. 1859.9 2 R.TVITNEYM#KEDFLIK.I 04 8 6 4591 1.67E- 5.39 0.5 2014 2019.9 2 K.NETGGGEGVEVLVNEPYEK.D 09 9 .2 5056 4.97E- 2.59 0.4 546. 1400.7 2 K.HVEAVYIDIADR.S 06 9 1 1680 1.86E- 5.36 0.6 1704 2494.3 2 R.VILPVSVDEYQVGQLYSVAEASK.N 09 1 .1 0762

4-7 PKM2 33870 2.11E- 70.2 8.8 6136 21.28 117 12 7 6 2.0 189

3.43E- 3.07 0.3 827. 1213.5 2 K.ITLDNAYM#EK.C 04 8 1 7690 3.82E- 3.32 0.5 1539 1359.7 2 R.NTGIIC*TIGPASR.S 04 4 .3 3545 9.92E- 3.25 0.4 2290 1779.8 2 K.GADFLVTEVENGGSLGSK.K 05 0 .8 7598 1.22E- 3.15 0.4 774. 2465.2 3 R.TATESFASDPILYRPVAVALDTK.G 04 7 4 9224 4.58E- 3.64 0.5 1756 1462.8 2 K.IYVDDGLISLQVK.Q 07 2 .4 1519 2.56E- 3.92 0.5 1554 1642.7 2 K.DPVQEAWAEDVDLR.V 05 3 .3 7075 2.11E- 5.31 0.6 1835 2175.1 2 R.LAPITSDPTEATAVGAVEASFK.C 12 1 .5 1792

4-7 PMPCB 40226 8.83E- 40.2 6.3 5347 10.21 469 06 4 0 5.1

1.00E- 3.12 0.3 1388 1351.6 2 R.LC*TSVTESEVAR.A 04 9 .6 8274 1.28E- 3.63 0.4 1253 1101.5 2 R.IDAVNAETIR.E 04 9 .9 8984 8.83E- 4.21 0.5 1262 1713.9 2 R.STQAATQVVLNVPETR.V 06 5 .0 1296 2.89E- 3.19 0.4 1114 1367.6 2 K.DLVDYITTHYK.G 05 9 .5 8420

4-7 PPA1 33875 2.44E- 110. 5.9 3544 50.32 891 12 30 0 8.7 9.41E- 3.35 0.3 651. 1266.7 2 K.DPLNPIKQDVK.K 05 8 9 0520 6.43E- 4.99 0.4 2718 2133.9 3 K.HTGC*C*GDNDPIDVC*EIGSK.V 07 1 .6 5502 1.77E- 5.67 0.5 1679 2133.9 2 K.HTGC*C*GDNDPIDVC*EIGSK.V 08 5 .1 5502 1.89E- 4.80 0.4 1009 1687.7 2 K.GISC*M#NTTLSESPFK.C 07 9 .4 9711 2.86E- 3.30 0.4 523. 2230.0 3 R.YKVPDGKPENEFAFNAEFK.D 05 0 4 8154 9.15E- 3.95 0.4 623. 2444.2 3 K.VIAINVDDPDAANYNDINDVKR.L 12 6 5 0532 1.91E- 5.07 0.5 930. 2444.2 2 K.VIAINVDDPDAANYNDINDVKR.L 11 9 3 0532 2.67E- 2.65 0.3 430. 1938.9 3 K.VPDGKPENEFAFNAEFK.D 04 9 7 2322 3.54E- 4.06 0.5 694. 2355.1 2 R.AIVDALPPPC*ESAC*TVPTDVDK.W 07 8 8 8175 2.44E- 5.96 0.6 1696 2288.1 2 K.VIAINVDDPDAANYNDINDVK.R 12 4 .0 0400 4.43E- 3.75 0.2 957. 2461.1 3 K.GYIWNYGAIPQTWEDPGHNDK.H 06 6 3 2085 3.79E- 3.65 0.5 733. 1694.8 2 R.LKPGYLEATVDWFR.R 09 1 4 9001 5.58E- 5.69 0.6 2787 1805.8 2 K.VLGILAM#IDEGETDWK.V 08 2 .5 9896

4-7 RRBP1 38014 3.89E- 180. 4.8 7363 25.49 595 11 27 0 9.5 3.18E- 3.25 0.5 1118 1360.7 2 K.LKGELESSDQVR.E 04 0 .4 0667 1.90E- 2.85 0.4 717. 1503.7 2 K.HPPAPAEPSSDLASK.L 04 5 3 4377 1.48E- 4.80 0.4 1657 1988.8 3 R.DAQDVQASQAEADQQQTR.L 08 5 .2 9038 3.89E- 5.40 0.6 3588 1988.8 2 R.DAQDVQASQAEADQQQTR.L 11 3 .6 9038 6.59E- 3.84 0.4 1433 1409.6 2 K.AM#EALATAEQAC*K.E 04 7 .0 7045 5.50E- 4.54 0.5 1459 1941.9 3 R.SKC*EELSGLHGQLQEAR.A 08 3 .0 7524 190

2.75E- 2.71 0.3 961. 1218.6 2 K.ELESQVSGLEK.E 04 3 3 2122 1.87E- 3.86 0.4 1819 1459.8 2 R.LKELESQVSGLEK.E 05 1 .5 0029 7.81E- 3.67 0.5 1251 1390.6 2 R.DALNQATSQVESK.Q 07 5 .2 8091 7.33E- 4.58 0.6 2520 2057.9 2 R.EAEETQSTLQAEC*DQYR.S 08 8 .7 0219 2.93E- 5.15 0.4 2312 2327.0 3 K.LREAEETQSTLQAEC*DQYR.S 07 8 .5 8736 8.37E- 4.65 0.5 2474 1612.7 2 K.LTAEFEEAQTSAC*R.L 07 5 .9 5770 2.43E- 4.56 0.5 1313 1776.8 2 R.TAGPLESSETEEASQLK.E 09 7 .1 4973 9.50E- 3.04 0.3 1315 1290.5 2 K.SVEEEEQVWR.A 04 5 .0 9607 1.00E- 4.33 0.5 1859 2912.3 3 K.SHVEDGDIAGAPASSPEAPPAEQDPVQL 07 3 .2 9087 K.T 4.31E- 3.65 0.1 1296 1811.9 2 R.TLQEQLENGPNTQLAR.L 04 1 .5 2456 1.23E- 4.86 0.5 1637 1543.8 2 R.QLLLESQSQLDAAK.S 05 8 .5 3264 8.05E- 3.28 0.5 921. 1257.6 2 R.SIEALLEAGQAR.D 04 2 6 7969 1.62E- 4.40 0.5 1406 2090.0 2 K.TQLEWTEAILEDEQTQR.Q 06 9 .3 0366

4-7 RUVBL2 57300 3.12E- 140. 5.4 5112 48.38 23 11 36 0 4.6 7.59E- 2.51 0.2 1037 1387.7 2 R.KGTEVQVDDIKR.V 04 7 .0 5403 3.10E- 4.61 0.2 1533 1387.7 3 R.KGTEVQVDDIKR.V 05 3 .0 5403 8.78E- 3.13 0.5 635. 1332.6 2 R.QASQGM#VGQLAAR.R 05 0 4 6884 4.46E- 3.24 0.5 1307 1111.6 2 R.AVLIAGQPGTGK.T 05 7 .4 4697 7.71E- 4.33 0.4 1708 1401.7 2 K.DKVQAGDVITIDK.A 05 8 .6 5842 4.42E- 3.94 0.3 1184 1694.9 2 R.LLIVSTTPYSEKDTK.Q 06 8 .9 2114 5.02E- 3.83 0.5 792. 1947.0 3 K.EVVHTVSLHEIDVINSR.T 06 0 7 2942 6.27E- 3.62 0.5 1338 1517.7 2 K.TTEM#ETIYDLGTK.M 05 3 .2 0395 2.11E- 5.30 0.5 1473 1763.8 2 R.ALESDM#APVLIM#ATNR.G 06 3 .0 6660 6.70E- 4.60 0.4 1148 2461.1 3 R.IRC*EEEDVEM#SEDAYTVLTR.I 04 2 .7 1627 3.12E- 7.17 0.6 1568 2941.5 3 R.IKEETEIIEGEVVEIQIDRPATGTGSK.V 11 3 .0 3638 2.85E- 4.62 0.5 1003 1868.9 2 R.GTSYQSPHGIPIDLLDR.L 05 0 .3 5007 1.19E- 4.67 0.5 2122 1578.8 2 R.YAIQLITAASLVC*R.K 06 4 .8 9776 1.77E- 4.24 0.6 1259 2253.9 2 K.EYQDAFLFNELKGETM#DTS 06 0 .4 8561 1.49E- 4.57 0.5 2596 1683.8 2 R.TQGFLALFSGDTGEIK.S 07 7 .9 5889

4-7 SELENBP 16306 4.21E- 80.2 5.9 5235 21.61 1 550 06 5 0 7.7 2.78E- 2.68 0.5 900. 1167.6 2 R.HEIVQTLSLK.D 04 4 0 7322 4.61E- 3.67 0.4 2123 1233.6 2 R.IYVVDVGSEPR.A 06 0 .6 4734 4.21E- 3.92 0.5 776. 1529.7 2 R.VAGGPQM#IQLSLDGK.R 06 4 1 9919 3.83E- 5.06 0.4 1212 1905.9 2 R.NTGTEAPDYLATVDVDPK.S 05 8 .5 0759 3.91E- 3.94 0.4 1309 1332.7 2 R.LTGQLFLGGSIVK.G 05 7 .7 8857 6.58E- 2.57 0.3 782. 1084.6 2 K.LVLPSLISSR.I 05 7 0 7249

191

9.28E- 3.47 0.4 905. 1454.7 2 R.EEIVYLPC*IYR.N 04 5 4 6535 3.46E- 4.14 0.4 1532 1510.7 2 K.GGFVLLDGETFEVK.G 04 7 .9 7881

4-7 SEPT2 23274 1.36E- 40.2 6.4 4265 15.63 163 08 4 0 8.0 1.91E- 3.58 0.3 1428 1352.7 2 R.ILDEIEEHNIK.I 04 4 .4 0557 9.75E- 4.34 0.5 2318 1603.8 2 R.TVQIEASTVEIEER.G 07 9 .4 1738 1.36E- 4.85 0.6 1495 1750.8 2 R.LTVVDTPGYGDAINC*R.D 08 4 .3 7340 1.88E- 3.85 0.5 1178 1759.9 2 K.ASIPFSVVGSNQLIEAK.G 05 4 .0 5886

4-7 SERPINA 50363 7.46E- 100. 5.2 4670 39.71 1 219 10 28 7 7.1 3.81E- 4.95 0.4 1312 2186.0 3 K.LYHSEAFTVNFGDTEEAKK.Q 09 8 .2 4004 2.94E- 5.18 0.5 1886 1891.8 2 K.DTEEEDFHVDQVTTVK.V 08 7 .4 5559 7.35E- 3.91 0.4 1076 1803.9 3 K.LQHLENELTHDIITK.F 05 0 .2 5996 7.41E- 4.22 0.4 1625 1803.9 2 K.LQHLENELTHDIITK.F 09 5 .8 5996 5.49E- 3.10 0.4 1170 1110.6 2 K.LSITGTYDLK.S 05 2 .7 0413 9.32E- 5.66 0.5 1705 1833.9 2 K.VFSNGADLSGVTEEAPLK.L 09 4 .4 2285 6.07E- 3.87 0.2 1957 1871.9 3 K.FNKPFVFLM#IEQNTK.S 06 5 .3 7240 7.46E- 5.19 0.6 1837 2574.3 2 R.TLNQPDSQLQLTTGNGLFLSEGLK.L 10 4 .7 4106 2.56E- 4.11 0.6 749. 2291.1 2 K.GTEAAGAM#FLEAIPM#SIPPEVK.F 05 0 4 2975 6.74E- 4.33 0.5 894. 1641.8 2 K.ITPNLAEFAFSLYR.Q 08 2 9 6353 4.71E- 4.21 0.5 1847 1576.8 2 R.DTVFALVNYIFFK.G 06 9 .7 4094

4-7 SERPINB 13489 5.55E- 30.2 5.9 4271 10.29 1 087 06 2 0 4.8 5.55E- 3.59 0.5 835. 1602.7 3 K.TFHFNTVEEVHSR.F 06 6 3 6587 6.46E- 4.42 0.4 945. 1785.9 3 R.FKLEESYTLNSDLAR.L 06 0 1 0173 2.14E- 4.40 0.5 1772 1785.9 2 R.FKLEESYTLNSDLAR.L 05 1 .9 0173 6.61E- 3.19 0.4 841. 1207.6 2 R.LGVQDLFNSSK.A 06 9 2 3171

4-7 SERPINB 12655 3.12E- 30.2 5.0 4256 9.84 6 087 07 5 0 2.1 8.37E- 2.82 0.4 1206 1239.6 2 R.TVEKELTYEK.F 05 0 .6 4673 3.12E- 5.01 0.5 2080 1609.8 2 K.GNTAAQM#AQILSFNK.S 07 9 .5 0025 8.27E- 2.86 0.3 685. 1311.6 2 R.NLGM#TDAFELGK.A 05 7 7 2491

4-7 SYNCRIP 33874 3.30E- 50.1 5.8 4671 15.35 520 05 9 1 4.0 2.68E- 3.26 0.4 933. 1311.6 2 R.TGYTLDVTTGQR.K 04 4 2 5393 3.74E- 3.62 0.4 1077 1351.7 2 K.TKEQILEEFSK.V 05 0 .1 1033 5.41E- 3.13 0.2 1389 1292.6 2 R.LM#M#DPLTGLNR.G 04 5 .5 3369 2.89E- 3.45 0.4 351. 1942.9 3 K.VTEGLTDVILYHQPDDK.K 04 8 6 7571 3.30E- 3.86 0.5 1033 1593.8 2 R.DLFEDELVPLFEK.A 05 2 .0 0469

192

4-7 TAGLN 48255 2.65E- 30.2 9.2 2259 14.43 905 05 0 4 6.4 1.54E- 2.66 0.3 720. 965.49 2 K.AAEDYGVIK.T 04 9 1 384 4.14E- 3.35 0.4 969. 2334.0 3 K.TDM#FQTVDLFEGKDM#AAVQR.T 05 4 3 7403 2.65E- 4.10 0.6 1591 1546.7 2 K.TDM#FQTVDLFEGK.D 05 3 .5 0937

4-7 VIM 47115 1.40E- 190. 4.9 5354 46.57 317 12 29 0 7.2 1.41E- 3.56 0.3 1254 1216.6 2 R.RQVDQLTNDK.A 02 8 .2 2805 1.89E- 2.06 0.3 251. 1023.5 2 R.QQYESVAAK.N 03 4 8 1056 3.62E- 2.70 0.3 996. 1088.5 2 R.QDVDNASLAR.L 04 9 8 3308 4.52E- 3.42 0.4 726. 1836.7 2 R.DGQVINETSQHHDDLE 07 1 8 9944 2.38E- 4.24 0.4 1377 1587.7 2 R.TNEKVELQELNDR.F 05 4 .9 9724 3.64E- 4.20 0.3 1466 1587.7 3 R.TNEKVELQELNDR.F 06 5 .5 9724 2.58E- 3.96 0.3 1395 1093.5 2 K.FADLSEAANR.N 04 8 .7 2722 1.09E- 3.00 0.2 1260 1115.5 2 K.VELQELNDR.F 03 8 .7 6909 2.00E- 3.48 0.4 1063 1270.5 2 R.LGDLYEEEM#R.E 04 8 .0 6198 1.66E- 2.36 0.3 264. 1704.8 2 R.VEVERDNLAEDIM#R.L 03 4 5 2211 7.08E- 2.97 0.3 632. 1704.8 3 R.VEVERDNLAEDIM#R.L 06 4 8 2211 9.30E- 3.17 0.5 771. 1668.8 2 R.ETNLDSLPLVDTHSK.R 05 2 0 4387 1.71E- 2.51 0.4 846. 1309.6 2 K.NLQEAEEWYK.S 03 2 2 0596 1.11E- 4.59 0.3 1065 1661.9 3 R.KVESLQEEIAFLKK.L 04 3 .1 4727 1.70E- 3.88 0.4 1027 1490.7 2 R.QVQSLTC*EVDALK.G 02 8 .9 8246 3.52E- 3.28 0.1 1290 1311.6 2 K.M#ALDIEIATYR.K 02 9 .1 6130 5.90E- 5.47 0.3 2047 1533.8 2 R.KVESLQEEIAFLK.K 06 6 .1 5229 3.57E- 4.86 0.3 2711 1533.8 3 R.KVESLQEEIAFLK.K 06 2 .5 5229 1.61E- 4.75 0.6 1500 2202.9 2 R.EM#EENFAVEAANYQDTIGR.L 08 3 .4 6079 2.61E- 3.90 0.2 1196 1169.7 2 K.ILLAELEQLK.G 03 6 .0 1399 2.66E- 3.01 0.4 814. 1570.8 2 R.ISLPLPNFSSLNLR.E 07 8 6 9514 1.40E- 5.82 0.6 2713 2126.0 2 R.LLQDSVDFSLADAINTEFK.N 12 5 .1 6519

4-7 YWHAZ 6.54E- 50.2 7.0 3531 24.53 08 8 0 3.8 4.29E- 2.53 0.5 1195 1136.4 2 R.YDDM#AAC*M#K.S 03 4 .0 3620 6.30E- 3.31 0.3 1448 1279.6 2 R.YLAEVAAGDDKK.G 04 6 .9 5283 1.97E- 3.94 0.5 2414 1548.7 2 K.SVTEQGAELSNEER.N 06 3 .1 1362 6.54E- 5.56 0.6 2328 2040.9 2 -.GIVDQSQQAYQEAFEISK.K 08 3 .3 8730 4.73E- 2.53 0.1 464. 1418.7 2 R.DIC*NDVLSLLEK.F 03 5 1 5009

193

BIBLIOGRAPHY

1. Jemal A, Siegel R, Ward E, Hao Y, Xu J, Murray T et al. Cancer statistics. CA

Cancer J Clin. 58, 71–96 (2008).

2. Powell SM, Petersen GM, Krush AJ, Booker S, Jen J, Giardiello FM,

Hamilton SR, Vogelstein B, Kinzler KW: Molecular diagnosis of familial

adenomatous polyposis. N Engl J Med. 329, 1982–7 (1993).

3. Markowitz SD, Dawson DM, Willis J, Willson JK: Focus on colon cancer.

Cancer Cell. 3, 233-6 (2002).

4. Kinzler KW, Vogelstein B: Lessons from hereditary colon cancer. Cell. 87,

159-70 (1996).

5. Green RC, Parfrey PS, Woods MO, Younghusband HB: Prediction of Lynch

syndrome in consecutive patients with colorectal cancer. J Natl. Cancer Inst.

101, 331-40 (2009).

6. Sawyers CL: The cancer biomarker problem. Nature. 452, 548-52 (2008).

7. Hornberg JJ, Bruggeman FJ, Westerhoff HV, Lankelma J: Cancer: A Systems

Biology disease. Biosystems. 83, 81-90 (2006).

8. Vogelstein B, Kinzler KW: Cancer genes and the pathways they control. Nat.

Med. 10, 789-99 (2004).

9. Calvert PM, Frucht H: The genetics of colorectal cancer. Ann Intern Med. 137,

603-12 (2002).

10. Sjöblom T, Jones S, Wood LD, et. al.: The consensus coding sequences of

human breast and colorectal cancers. Science 314, 268-74 (2006).

194

11. Saltz LB: Biomarkers in colorectal cancer: added value or just added

expense? Expert Rev Mol Diagn. 8, 231-33 (2008).

12. DeRisi J, Penland L, Brown PO, et. al.: Use of a cDNA microarray to analyse

gene expression patterns in human cancer. Nat Genet. 14, 457-60 (1996).

13. Barrett T, Troup DB, Wilhite SE, et. al.: NCBI GEO: archive for high-

throughput functional genomic data. Nucleic Acids Res. 337(Database issue),

885-90. (2009)

14. Sabates-Bellver J, Van der Flier LG, de Palo M, et. al.: Transcriptome profile

of human colorectal adenomas. Mol Cancer Res. 5, 1263-75 (2007).

15. Jiang X, Tan J, Li J, Kivimäe S, et. al.: DACT3 is an epigenetic regulator of

Wnt/beta-catenin signaling in colorectal cancer and is a therapeutic target of

histone modifications. Cancer Cell. 13, 529-41 (2008).

16. Boyer J, Allen WL, McLean EG, et. al.: Pharmacogenomic identification of

novel determinants of response to chemotherapy in colon cancer. Cancer

Res. 66,2765-77 (2006).

17. Huang DS, Zheng CH: Independent component analysis-based penalized

discriminant method for tumor classification using gene expression data.

Bioinformatics 22, 1855-62 (2006).

18. Wang Y, Klijn JG, Zhang Y, et al.: Gene-expression profiles to predict distant

metastasis of lymph-node-negative primary breast cancer. Lancet 365, 671-9.

(2005).

19. van 't Veer LJ, Dai H, van de Vijver MJ, et al.: Gene expression profiling

predicts clinical outcome of breast cancer. Nature 415, 530-6. (2002).

195

20. Tsafrir D, Bacolod M, Selvanayagam Z, et. al.: Relationship of gene

expression and chromosomal abnormalities in colorectal cancer. Cancer Res.

66, 2129-37 (2006).

21. Chen WD, Han ZJ, Skoletsky J, et. al.: Detection in fecal DNA of colon

cancer-specific methylation of the nonexpressed vimentin gene. J Natl

Cancer Inst. 97, 1124-32 (2005).

22. Bartel DP: MicroRNAs: target recognition and regulatory functions. Cell 136,

215-33. (2009)

23. Cummins JM, He Y, Leary RJ, et. al.: The colorectal microRNAome. Proc Natl

Acad Sci U S A. 103, 3687-92 (2006).

24. Strimpagos AS, Syrigos KN, Saif, MW, Pharmacogenetics and biomarkers in

colorectal cancer. The Pharmacogenomics Journal 9, 147-60. (2009).

25. Turner N, Tutt A, Ashworth A: Hallmarks of 'BRCAness' in sporadic cancers.

Nat Rev Cancer. 4, 814-19 (2004).

26. Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression

data with protein-protein interactions. Genome Res. 12, 37-46 (2002).

27. Cox J, Mann M: MaxQuant enables high peptide identification rates,

individualized p.p.b.-range mass accuracies and proteome-wide protein

quantification. Nat Biotechnol. 26, 1367-72 (2008).

28. Service RF: Proteomics ponders prime time. Science 321, 1758-61. (2008).

29. Kaaks R, Toniolo P, Akhmedkhanov A, et. al.: Serum C-peptide, insulin-like

growth factor (IGF)-I, IGF-binding proteins, and colorectal cancer risk in

women. J Natl Cancer Inst. 92, 1592-600 (2000).

196

30. Ward DG, Suggett N, Cheng Y, et. al.: Identification of serum biomarkers for

colon cancer by proteomic analysis. Br J Cancer. 94, 898-905 (2006).

31. Habermann JK, Paulsen U, Roblick UJ, et. al.: Stage-specific alterations of

the genome, transcriptome, and proteome during colorectal carcinogenesis.

Genes Chromosomes Cancer 46, 10-26 (2007).

32. Chen YD, Zheng S, Yu JK, et. al.: Artificial neural networks analysis of

surface-enhanced laser desorption/ionization mass spectra of serum protein

pattern distinguishes colorectal cancer from healthy population. Clin Cancer

Res. 10, 8380-5 (2004).

33. Aggarwal K, Choe LH, Lee KH: Shotgun proteomics using the iTRAQ isobaric

tags. Brief Funct Genomic Proteomic. 5, 112-20 (2006).

34. Asara JM, Christofk HR, Freimark LM, et. al.: A label-free quantification

method by MS/MS TIC compared to SILAC and spectral counting in a

proteomics screen. Proteomics 8, 994-9 (2008).

35. Gafken PR, Lampe PD: Methodologies for characterizing phosphoproteins by

mass spectrometry. Cell Commun Adhes. 13, 249-62 (2006).

36. Hanash SM, Pitteri SJ, Faca VM: Mining the plasma proteome for cancer

biomarkers. Nature 452, 571-9 (2008).

37. Friedman DB, Hill S, Keller JW, et. al.: Proteome analysis of human colon

cancer by two-dimensional difference gel electrophoresis and mass

spectrometry. Proteomics 4, 793-811 (2004).

197

38. Alfonso P, Núñez A, Madoz-Gurpide J, et. al.: Proteomic expression analysis

of colorectal cancer by two-dimensional differential gel electrophoresis.

Proteomics 5, 2602-11 (2005).

39. Mazzanti R, Solazzo M, Fantappié O, et. al.: Differential expression

proteomics of human colon cancer. Am J Physiol Gastrointest Liver Physiol.

290, 1329-38 (2006).

40. Ong SE, Foster LJ, Mann M: Mass spectrometric-based approaches in

quantitative proteomics. Methods 29, 124-30 (2003).

41. Kim JE, Tannenbaum SR, White FM: Global phosphoproteome of HT-29

human colon adenocarcinoma cells. J Proteome Res. 4, 1339-46 (2005).

42. Volmer MW, Stühler K, Zapatka M, et al.: Differential proteome analysis of

conditioned media to detect Smad4 regulated secreted biomarkers in colon

cancer. Proteomics 5, 2587-601 (2005).

43. Yu J, Shannon WD, Watson MA, et al.: Gene expression profiling of the

irinotecan pathway in colorectal cancer. Clin Cancer Res. 11, 2053-62 (2005).

44. McMurray HR, Sampson ER, Compitello G, et al.: Synergistic response to

oncogenic mutations defines gene class critical to cancer phenotype. Nature

453, 1112-16 (2008).

45. Boone C, Bussey H, Andrews BJ: Exploring genetic interactions and networks

with yeast. Nat Rev Genet. 8, 437-39 (2008).

46. De Las Rivas J, de Luis A: Interactome data and databases: different types of

protein interaction. Comp Funct Genomics. 5, 173-78 (2004).

198

47. Mathivanan S, Periaswamy B, Gandhi TK, et al.: An evaluation of human

protein-protein interaction data in the public domain. BMC Bioinformatics.

18;7 Suppl 5:S19. (2006).

48. Chuang HY, Lee E, Liu YT, et al.: Network-based classification of breast

cancer metastasis. Mol Syst Biol. 3,140-50. (2007).

49. Jonsson PF, Bates PA: Global topological features of cancer proteins in the

human interactome. Bioinformatics. 22, 2291-7 (2007).

50. Segal E, Friedman N, Koller D, et al.: A module map showing conditional

activity of expression modules in cancer. Nat Genet. 36, 1090-8 (2004).

51. Vidal M.: Interactome modeling. FEBS Lett. 579, 1834-8 (2005).

52. Edelman EJ, Guinney J, Chi JT, Febbo PG, Mukherjee S.: Modeling cancer

progression via pathway dependencies. PLoS Comput Biol. 4, e28 (2008).

53. Auffray C.: Protein subnetwork markers improve prediction of cancer

outcome. Mol Syst Biol. 3, 141-2 (2007).

54. Shoemaker BA, Panchenko AR.: Deciphering Protein–Protein Interactions.

Part I. Experimental Techniques and Databases. PLoS Comput Biol. 3, e42

(2007).

55. LAG, Melbert D, Krapcho M, Mariotto A, Miller BA, Feuer EJ, Clegg L, Horner

MJ, Howlader N, Eisner MP, Reichman M, Edwards BK (eds). (2007) SEER

Cancer Statistics Review, 1975-2004, National Cancer Institute. Bethesda,

MD, http://seer.cancer.gov/csr/1975_2004/, based on November 2006 SEER

data submission, posted to the SEER web site.

199

56. Bi X, Lin Q, Foo TW, Joshi S, You T, Shen HM, Ong CN, Cheah PY, Eu KW,

Hew CL.: Proteomic analysis of colorectal cancer reveals alterations in

metabolic pathways. Mol Cell Proteomics 6,1119-30 (2006).

57. Rahman-Roblick R, Roblick UJ, Hellman U, Conrotto P, Liu T, Becker S,

Hirschberg D, Jörnvall H, Auer G, Wiman KG: targets identified by protein

expression profiling.Proc Natl Acad Sci USA 13, 5401-6. (2007)

58. Tan S, Seow TK, Liang RC, Koh S, Lee CP, Chung MC, Hooi SC.: Proteome

analysis of butyrate-treated human colon cancer cells (HT-29). Int J Cancer

98, 523-531 (2002).

59. Ahmed N, Oliva K, Wang Y, Quinn M, Rice G.: Proteomic profiling of proteins

associated with urokinase plasminogen activator receptor in a colon cancer

cell line using an antisense approach. Proteomics 3, 288-298 (2003).

60. Ekins S, Nikolsky Y, Bugrim A, Kirillov E, Nikolskaya T.: Pathway mapping

tools for analysis of high content data. Methods Mol Biol. 356, 319-50 (2006).

61. Marouga R, David S, Hawkins E.: The development of the DIGE system: 2D

fluorescence difference gel analysis technology. Anal Bioanal Chem, 382,

669-678 (2005).

62. Viswanathan S, Unlü M, Minden JS.: Two-dimensional difference gel

electrophoresis. Nature Protocols. 1, 1351-8 (2006).

63. Benjamini Y, Hochberg Y.: Controlling the False Discovery Rate: A Practical

and Powerful Approach to Multiple Testing. J Royal Stat Soc. 57, 289-300

(1995).

200

64. Williams AC, Smartt H, H-Zadeh AM, Macfarlane M, Paraskeva C, Collard

TJ.: Insulin-like growth factor binding protein 3 (IGFBP-3) potentiates TRAIL-

induced apoptosis of human colorectal carcinoma cells through inhibition of

NF-kappaB. Cell Death Differ. 14, 137-45 (2007).

65. Slattery ML, Samowitz W, Curtin K, Ma KN, Hoffman M, Caan B, Neuhausen

S.: Associations among IRS1, IRS2, IGF1, and IGFBP3 genetic

polymorphisms and colorectal cancer. Cancer Epidemiol Biomarkers Prev.

13, 1206-14 (2004).

66. Stallmach A, von Lampe B, Matthes H, Bornhöft G, Riecken EO.: Diminished

expression of integrin adhesion molecules on human colonic epithelial cells

during the benign to malign tumour transformation.Gut 33, 342-6 (1992).

67. Andreu P, Colnot S, Godard C, Laurent-Puig P, Lamarque D, Kahn A, Perret

C, Romagnolo B.: Identification of the IFITM family as a new molecular

marker in human colorectal tumors. Cancer Res. 66, 1949-55 (2006).

68. Barker N, Hurlstone A, Musisi H, Miles A, Bienz M, Clevers H.: The chromatin

remodeling factor Brg-1 interacts with beta-catenin to promote target gene

activation. EMBO J. 20, 4935-43(2001).

69. Kitadai Y, Sasaki T, Kuwai T, Nakamura T, Bucana CD, Hamilton SR, Fidler

IJ.: Expression of activated platelet-derived growth factor receptor in stromal

cells of human colon carcinomas is associated with metastatic potential. Int J

Cancer. 119, 2567-74 (2006).

201

70. Izeradjene K, Douglas L, Delaney A, Houghton JA.: Casein kinase II (CK2)

enhances death-inducing signaling complex (DISC) activity in TRAIL-induced

apoptosis in human colon carcinoma cell lines. Oncogene 24, 2050-8 (2005).

71. Tapia JC, Torres VA, Rodriguez DA, Leyton L, Quest AF.: Casein kinase 2

(CK2) increases survivin expression via enhanced beta-catenin-T cell

factor/lymphoid enhancer binding factor dependent transcription. Proc Natl

Acad Sci U S A. 103, 15079-84 (2006).

72. Mounier CM, Wendum D, Greenspan E, Fléjou JF, Rosenberg DW, Lambeau

G.: Distinct expression pattern of the full set of secreted phospholipases A2 in

human colorectal adenocarcinomas: sPLA2-III as a biomarker candidate. Br J

Cancer 98, 587-95 (2008).

73. Watari A, Takaki K, Higashiyama S, Li Y, Satomi Y, Takao T, Tanemura A,

Yamaguchi Y, Katayama I Shimakage M, Miyashiro I, Takami K, Kodama K,

Yutsudo M.: Suppression of tumorigenicity, but not anchorage independence,

of human cancer cells by new candidate tumor suppressor gene CapG.

Oncogene 25, 7373-80 (2006).

74. Holtrich U, Wolf G, Bräuninger A, Karn T, Böhme B, Rübsamen-Waigmann H,

Strebhardt K.: Induction and down-regulation of PLK, a human

serine/threonine kinase expressed in proliferating cells and tumors. Proc Natl

Acad Sci U S A.91, 1736-40 (1994).

75. Takahashi T, Sano B, Nagata T, Kato H, Sugiyama Y, Kunieda K, Kimura M,

Okano Y, Saji S.: Polo-like kinase 1 (PLK1) is overexpressed in primary

colorectal cancers. Cancer Sci. 94, 148-52 (2003).

202

76. Weichert W, Kristiansen G, Schmidt M, Gekeler V, Noske A, Niesporek S,

Dietel M, Denkert C.: Polo-like kinase 1 expression is a prognostic factor in

human colon cancer. World J Gastroenterol. 11, 5644-50 (2005).

77. Schmidt M, Hofmann HP, Sanders K, Sczakiel G, Beckers TL, Gekeler V.:

Molecular alterations after Polo-like kinase 1 mRNA suppression versus

pharmacologic inhibition in cancer cells. Mol Cancer Ther. 5, 809-17 (2006).

78. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, et. al.: Patterns

of somatic mutation in human cancer genomes. Nature 446, 153-8 (2007).

79. Chen Y, Choong LY, Lin Q, Philp R, Wong CH, Ang BK, Tan YL, Loh MC,

Hew CL, Shah N, Druker BJ, Chong PK, Lim YP.: Differential expression of

novel tyrosine kinase substrates during breast cancer development. Mol Cell

Proteomics, 6, 2072-87 (2007).

80. Vogelstein B, Kinzler KW.: Cancer genes and the pathways they control. Nat

Med. 10, 789-99 (2004).

81. Zou TT, Selaru FM, Xu Y, Shustova V, Yin J, Mori Y, Shibata D, Sato F,

Wang S, Olaru A, Deacu E, Liu TC, Abraham JM, Meltzer SJ.: Application of

cDNA microarrays to generate a molecular taxonomy capable of

distinguishing between colon cancer and normal colon. Oncogene 21, 4855-

62 (2002).

82. Williams NS, Gaynor RB, Scoggin S, Verma U, Gokaslan T, Simmang C,

Fleming J, Tavana D, Frenkel E, Becerra C.: Identification and validation of

genes involved in the pathogenesis of colorectal cancer using cDNA

microarrays and RNA interference. Clin Cancer Res. 9, 931-46 (2003).

203

83. Wood LD, Parsons DW, Jones S, Lin J, Sjöblom T, et. al.: The genomic

landscapes of human breast and colorectal cancers. Science, 318, 1108-13

(2007).

84. Stumpf MP, Thorne T, de Silva E, Stewart R, An HJ, Lappe M, Wiuf C.:

Estimating the size of the human interactome. Proc Natl Acad Sci U S A. 105,

6959-64 (2008).

85. Bader JS: Greedily building protein networks with confidence. Bioinformatics

19, 1869-74 (2003).

86. Ideker T, Sharan R: Protein networks in disease. Genome Res. 18, 644-52

(2008).

87. Goehler H, Lalowski M, Stelzl U, Waelter S, Stroedicke M, et al: A protein

interaction network links GIT1, an enhancer of Huntington aggregation, to

Huntington’s disease. Mol. Cell 15, 853–865 (2004).

88. Calvano SE, Xiao W, Richards DR, Felciano RM, Baker, HV, et al.: A network

based analysis of systemic inflammation in humans. Nature 437, 1032–1037

(2005).

89. Pujana MA, Han JD, Starita LM, Stevens KN, Tewari M, et al.: Network

modeling links breast cancer susceptibility and centrosome dysfunction. Nat.

Genet. 39, 1338–1349 (2007).

90. de Godoy LMF, Olsen JV, Cox J, Nielsen ML, Hubner NC, Fröhlich F, Walther

TC, and Mann, M: Comprehensive mass-spectrometry-based proteome

quantification of haploid versus diploid yeast. Nature, 455, 1251-1254 (2008).

204

91. Chang J, Chance MR, Nicholas C, Ahmed N, Guilmeau S, Flandez M, Wang

D, Nasser S, and Albanese JM, (2008) Proteomic changes during intestinal

cell maturation in vivo. J Proteomics, 71(5):530–546.

92. Joyce AR, Palsson B: The model organism as a system: integrating ‘omics’

data sets. Nature Reviews Molecular Cell Biology, 7(3), 198–210 (2006).

93. Köhler S, Bauer S, Horn D, Robinson PN: Walking the Interactome for

Prioritization of Candidate Disease Genes. Am J Hum Genet., 82(4):949-58

(2006).

94. Chen J, Aronow B, Jegga A.: Disease candidate gene identification and

prioritization using protein interaction networks. BMC Bioinformatics , 10 (1),

73+ (2009).

95. Vanunu, O and Sharan R: A propagation based algorithm for inferring gene-

disease associations, Proceedings of German Conference on Bioinformatics,

54-62 (2009).

96. Nibbe RK, Markowitz S, Myeroff L, Ewing R, Chance MR: Discovery and

scoring of protein interaction sub-networks discriminative of late stage human

colon cancer. Mol. Cell. Prot. 8, 827-45 (2009).

97. Liu X, Lin CY, Lei M, Yan S, Zhou T, et. al.: CCT chaperonin complex is

required for the biogenesis of functional Plk1. Mol Cell Biol. 25, 4993-5010

(2005).

98. Coghlin C, Carpenter B, Dundas SR, Lawrie LC, Telfer C, et. al.:

Characterization and over-expression of chaperonin t-complex proteins in

colorectal cancer. J Pathol. 210, 351-7 (2006).

205

99. Rhodes DR, Chinnaiyan,AM: Integrative analysis of the cancer transcriptome.

Nat Genet, 37:S31-S37 (2005).

100. Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and

signalling circuits in molecular interaction networks. In Bioinformatics Suppl.

on ISMB, 18, S233–S240 (2002).

101. Guo, Z., Li, Y. Gong, X., Yao, C., Ma, W., Wang, D., Li, Y., Zhu, J., Zhang,

M., Yang, D., Wang, J.: Edge based scoring and searching method for

identifying condition-responsive protein–protein interaction sub-network.

Bioinformatics, 23(16), 2121–8 (2007).

102. Nacu, Ş., Critchley-Thorne, R., Lee, P., and Holmes, S. Holmes: Gene

expression network analysis and applications to immunology. Bioinformatics,

23(7), 850–858 (2007).

103. Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S.

Network-based analysis of affected biological processes in type 2 diabetes

models. PLoS Genetics, 3, e96 (2007).

104. Watkinson J, Wang X, Zheng T, Anastassiou D.: Identification of gene

interactions associated with disease from gene expression data using

synergy networks. BMC Systems Biology, 2, 10 (2008).

105. Lee, I., Date, S.V., Adai, A.T., Marcotte, E.M. A probabilistic functional

network of yeast genes. Science 306, 1555–1558 (2004).

106. Jansen, R., Greenbaum, D.: Gerstein, M. Relating whole-genome

expression data with protein-protein interactions. Genome Res., 12, 37–46

(2002).

206

107. Sharan R, Ulitsky I, Shamir R.: Network-based prediction of protein

function. Mol Sys Bio, 3, 88 (2007).

108. Pandey, J., Koyutürk, M., Subramaniam, S., & Grama, A.: Functional

coherence in domain interaction networks. Bioinformatics , 24 (16), i28-34

(2008).

109. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry,

J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill,

D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson,

J. E., Ringwald, M., Rubin, G. M., & Sherlock, G.: Gene Ontology: tool for the

unification of biology. the gene ontology consortium. Nature Genetics , 25(1),

25-29 (2000).

110. Kelley, R. and Ideker, T.: Systematic interpretation of genetic interactions

using protein networks. Nature Biotechnology, 23(5), 561-566 (2005).

111. Tong, H., Faloutsos, C., and Pan, J.-Y.: Random walk with restart: fast

solutions and applications. Knowledge and Information Systems, 14(3):327-

346 (2008).

112. Brin, S. and Page, L.: The anatomy of a large-scale hypertextual web

search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117

(1998).

113. Tetali, P.: Random walks and the effective resistance of networks, Journal

of Theoretical Probability, 4(1):101-109 (1991).

114. Stojmirović, A. and Yu, Y. K.: Information flow in interaction networks. J

Comput Biol, 14(8), 1115-1143 (2007).

207

115. Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M.: Whole-proteome

prediction of protein function via graph-theoretic analysis of interaction maps.

Bioinformatics , 21 Suppl 1:i302-i310 (2005).

116. de Silva, E., Thorne, T., Ingram, P. J., Agrafioti, I., Swire, J., Wiuf, C., and

Stumpf, M. P. H.: The effects of incomplete protein interaction data on

structural and evolutionary inferences. BMC Biology, 4, 39+ (2006).

208