<<

Constructing and Analyzing Biological Interaction Networks

for Knowledge Discovery

Dissertation

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Duygu Ucar Graduate Program in Computer Science and Engineering

The Ohio State University

2009

Dissertation Committee:

Srinivasan Parthasarathy, Advisor Yusu Wang Umit Catalyurek

c Copyright by

Duygu Ucar

2009 ABSTRACT

Many biological datasets can be effectively modeled as interaction networks where

nodes represent biological entities of interest such as , , or complexes and

edges mimic associations among them. The study of these biological network structures

can provide insight into many biological questions including the functional characterization

of genes and products, the characterization of DNA- bindings, and the under- standing of regulatory mechanisms. Therefore, the task of constructing biological interac- tion networks from raw data sets and exploiting information from these networks is critical, but is also fraught with challenges. First, the network structure is not always known in a priori; the structure should be inferred from raw and heterogeneous biological data sources.

Second, biological networks are noisy (containing unreliable interactions) and incomplete

(missing real interactions) which makes the task of extracting useful information difficult.

Third, typically these networks have non-trivial topological properties (e.g., uneven degree distribution, small world) that limit the effectiveness of traditional knowledge discovery al- gorithms. Fourth, these networks are usually dynamic and investigation of their dynamics is essential to understand the underlying . In this thesis, we address these issues by presenting a set of computational techniques that we developed to construct and analyze three specific types of biological interaction networks: protein-protein interaction networks, gene co-expression networks, and regulatory networks.

iii Dedicated to my mother, who gave me the courage for the journey to the PhD.

I wish she were here to see the end of the journey.

iv Acknowledgments

I would like to first thank my advisor Dr. Srinivasan Parthasarathy for his invaluable

support and guidance throughout my years at OSU. Starting from my first months at OSU,

he supported me greatly and provided a stimulating environment for my academic studies. I am grateful for his openness for new ideas and fields including . I would also like to thank Dr. Yusu Wang, Dr. Ramana Davuluri, and Dr. Umit Catalyurek for serving on my candidacy and defense committees and for providing me with invaluable insights and suggestions. I would also like to thank my collaborators Dr. Hakan Ferhatosmanoglu and Dr. Fatih Altiparmak for their assistance in my work with the gene external similarity.

I would also like to thank and acknowledge my collaborators outside the OSU. First,

I would like to thank Dr. Christopher Workman for supporting my study at the Technical

University of Denmark and for providing a very motivating and friendly environment for me in Denmark. I would like to further thank Dr. Workman and to Dr. Andreas Beyer for all their domain expertise and guidance in my work with regulatory networks. I also would like to thank Dr. Rui-Ru Ji, for hosting me at the Bristol-Myers Squibb as a research intern and for her efforts in our work and in arranging my accommodation and transportation in

New Jersey.

I would like to acknowledge the National Science Foundation and Department of En- ergy for supporting this work in part through the following grants: NSF CAREER Grant

v IIS-0347662, NSF SGER Grant IIS-0742999, NSF RI CNS-0403342, and DOE DE-FG02-

04ER25611. Any opinions, findings, and conclusions expressed in this dissertation are those of the author, her advisor, and collaborators, and do not necessarily reflect the views

of the National Science Foundation or the Department of .

I would like to thank my friends and colleagues at the Data Mining Research Lab

(DMRL) at OSU. I heartily thank my friend Sitaram Asur for long hours we spent dis-

cussing ideas, writing papers, and sharing our passion for and travel. I also want to

thank Shirish Tatikonda for always being so positive and genuine. I want to thank Hui Yang

for being very supportive during my first years at OSU. She has been a great mentor and

a very caring friend to me. I also want to thank other friends at the DMRL for sharing so

many things: Amol Ghoting, Greg Buehrer, Matthew Otey and Matthew Goyder, Sameep

Mehta, Venu Satuluri, Xintian Yang, and Ye Wang. It was a great pleasure to be a member

of this motivating and friendly group.

I am more thankful than I can express to my family. Their patience and constant support

through my studies kept me going. Special thanks are owed to my dear brother Utku Ucar

for sharing my apartment, my ideas, and my in Columbus during the last year of my

PhD. It was a great pleasure to have him here as a friend and as a caring and loving family

member.

My PhD studies will not come to an end without the constant support of my friends.

I am grateful to my dear friend Zulal Fazlioglu Akin for being such a good listener and a

supporter. I thank her for her advice and enlightening thoughts in many difficult situations.

I also want to thank Yigit and Zulal Akin for sharing so many dinners, conversations,

and laughs with me over the years at Crane’s. It will always be remembered as a place

filled with joy and friendship. I would like to thank Sahika Vatan Korkmaz for always

vi listening to me and cheering me up. I also want to thank Gokhan Korkmaz for his friendship and for graduating a year ahead of Sahika. The days we spent with Sahika in 2007 were

among the best in my life. I also would like to thank my dear friend Hasibe Otter for

her sisterly support and for taking such good care of me, during her years at Columbus.

Hasibe, Thomas, Artun, and Timon always made me feel at home and welcomed me to the

joyful Otter family. I would also like to thank my friends Arif and Hulya Cetintas for their

friendship and for many hours we spent together in Columbus. And I would like to thank

my friends in Turkey whom I see in person only once in a while for being there with love,

support, and encouragement.

And last of all, I want to pay my special thanks to my friend and partner Emre Sencer

for his continuous support during my studies and for his joyful existence in my life. I am

more thankful than I can say to share my life with such a loving and caring person and to

have my share of his colorful stories, his appetite for ethnic-food, his history lessons, and

his passion for travel. I had the privilege to learn a lot from him in a very broad range of

topics, though none of those are relevant enough to be included into this dissertation.

vii VITA

July 27, 1980 ...... Born – Corum, Turkey

May 2003 ...... B.S. Computer Engineering, Bilkent University, Turkey September 2007 ...... M.Sc Computer Science & Engineering,

The Ohio State University 2003 - 2009 ...... Graduate Teaching/Research Associate, The Ohio State University June 2007 - Sept 2007 ...... Research Intern, Bristol-Myers Squibb

January 2008 - June 2008 ...... Guest Researcher, Technical University of Denmark

PUBLICATIONS

Research Publications

Duygu Ucar, Fatih Altiparmak, Hakan Ferhatosmanoglu, and Srinivasan Parthasarathy. Mutual Information Based Extrinsic Similarity for Microarray Analysis. International Conference on Bioinformatics and Computational , BiCOB 2009.

Duygu Ucar, Andreas Beyer, Christopher T. Workman, Srinivasan Parthasarathy. Predict- ing functionality of protein-DNA interactions by integrating diverse evidence. Bioinfor- matics Volume 25:12, pages 137-144, 2009

viii Duygu Ucar, Andreas Beyer, Christopher T. Workman, Srinivasan Parthasarathy. Predict- ing functionality of protein-DNA interactions by integrating diverse evidence. In the Pro- ceedings of the 17th Annual International Conference on Intelligent Systems for , ISMB, 2009.

Duygu Ucar, Isaac Neuhaus, Petra Ross-MacDonald, Charles Tilford, Srinivasan Parthasarathy, Nathan Siemers, and Rui-Ru Ji. Construction of a Reference Gene Association Network from Multiple Profiling Data: Application to Data Analysis. Bioinformatics Volume 23:20, pages 2716-2724, August 2007

Sitaram Asur, Duygu Ucar, Srinivasan Parthasarathy. An Ensemble Framework for Clus- tering Protein-Protein Interaction Networks. Bioinformatics Volume 23:13, pages 29-40, July 2007.

Sitaram Asur, Duygu Ucar, Srinivasan Parthasarathy. An Ensemble Framework for Clus- tering Protein-Protein Interaction Networks. In the Proceedings of the 15th Annual Inter- national Conference on Intelligent Systems for Molecular Biology, ISMB, 2007.

Hui Yang, Srinivasan Parthasarathy, Duygu Ucar. A spatio-temporal mining approach to- wards summarizing and analyzing protein folding trajectories. Algorithms for Molecular Biology, Volume 2:3, April 2007

Duygu Ucar, Fatih Altiparmak, Hakan Ferhatosmanoglu, and Srinivasan Parthasarathy. Investigating the use of Extrinsic Similarity Measures for Microarray Analysis In the BioKDD workshop held at the 13th ACM International Conference on Knowledge Dis- covery and Data Mining, SIGKDD, 2007

Sitaram Asur, Srinivasan Parthasarathy, and Duygu Ucar. An Event-based Framework for Characterizing the Evolutionary Behavior of Interaction Graphs. The 13th International Conference on Knowledge Discovery and Data Mining, SIGKDD, 2007

Duygu Ucar, Sitaram Asur, Umit Catalyurek, and Srinivasan Parthasarathy. Functional Modularity in Protein-Protein Interactions Graphs Using Hub-induced Subgraphs. The 17th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD, 2006

Sitaram Asur, Srinivasan Parthasarathy, and Duygu Ucar. An Ensemble Approach for Clus- tering Scale-Free Graph. The LinkKDD workshop held at the 12th ACM International Conference on Knowledge Discovery and Data Mining SIGKDD, 2006

ix Hui Yang, Srinivasan Parthasarathy, and Duygu Ucar. Protein Folding Trajectories Anal- ysis: Summarization, Folding Events Detection and Common Partial Folding Pathway Identification. The BioKDD workshop held at the 12th ACM International Conference on Knowledge Discovery and Data Mining, SIGKDD, 2006

Duygu Ucar, Srinivasan Parthasarathy, Sitaram Asur and Chao Wang. Effective Pre-processing Strategies for Functional Clustering of a Protein-Protein Interactions Network. The 5th IEEE Symposium on Bioinformatics and Bioengineering, BIBE, 2005

FIELDS OF STUDY

Major Field: Computer Science and Engineering

Studies in Data Mining: Prof. Srinivasan Parthasarathy

x TABLE OF CONTENTS

Page

Abstract ...... iii

Dedication ...... iv

Vita ...... viii

List of Figures ...... xiv

List of Tables ...... xviii

Chapters:

1. Introduction ...... 1

1.1 Challenges in analyzing biological interaction networks ...... 4 1.2 Proposed Framework ...... 7 1.3 Contributions ...... 11 1.4 Organization ...... 14

2. Background and Related Work ...... 16

2.1 Background ...... 17 2.1.1 The Analysis of PPI Networks ...... 17 2.1.2 The Analysis of Gene Expression Networks ...... 22 2.1.3 The Analysis of Regulatory Networks ...... 26 2.1.4 Clustering and Graph Partitioning Algorithms ...... 28

xi 3. PPI Networks ...... 34

3.1 Dataset ...... 35 3.2 Network Purification Algorithms ...... 36 3.2.1 Topological Metrics ...... 38 3.2.2 Line Graph Transformation ...... 39 3.2.3 Clustering ...... 41 3.2.4 Validation ...... 42 3.2.5 Experiments ...... 44 3.2.6 Discussion ...... 51 3.3 Network Refinement Algorithm based on Hub Duplications ...... 55 3.3.1 Evolutionary Implications ...... 57 3.3.2 Hub-induced Subgraphs ...... 57 3.3.3 Hub Duplication ...... 59 3.3.4 Clustering ...... 60 3.3.5 Validation Measures ...... 61 3.3.6 Experiments ...... 62 3.3.7 Discussion ...... 65

4. Gene Co-expression Networks ...... 70

4.1 An Extrinsic Method to Infer Gene Similarity ...... 72 4.1.1 Datasets and Pre-processing ...... 73 4.1.2 Similarity Measures ...... 74 4.1.3 Experiments ...... 83 4.1.4 Discussion ...... 90 4.2 Hypothesis Testing using Gene Modules ...... 93 4.2.1 Methods ...... 93 4.2.2 Experiments ...... 98 4.2.3 Discussion ...... 109

5. Gene Regulatory Networks ...... 112

5.1 Datasets ...... 115 5.1.1 ChIP-chip ...... 115 5.1.2 Gene Expression Data ...... 116 5.1.3 TFBS Data ...... 117 5.1.4 Nucleosome Occupancy Data ...... 119 5.1.5 Training and Validation Data ...... 120 5.2 Methodology ...... 120 5.2.1 Probability of Promoter Binding ...... 124

xii 5.2.2 Probability of Transcriptional Response ...... 126 5.2.3 Characterization of Binding Events ...... 126 5.2.4 Determining Factors that Explain Functional Binding ...... 128 5.3 Results and Discussion ...... 131 5.3.1 Prediction of Condition-Specific Promoter Binding ...... 131 5.3.2 Characterization of TFs and their Promoter Binding ...... 134 5.3.3 Exploration of Predictive Features for Functional Binding Events 140 5.3.4 Identification of Co-factor Hierarchy Networks ...... 143 5.3.5 Conclusion ...... 144

6. Discussion and Future Directions ...... 146

6.1 Discussion ...... 147 6.2 Limitations ...... 153 6.3 Future Directions ...... 155 6.3.1 Integrating temporal epigenetic signatures ...... 155 6.3.2 Predicting interactions which are yet to be produced ...... 156 6.3.3 Analyzing temporality of biological interactions ...... 159 6.3.4 Computational ...... 159

Bibliography ...... 161

xiii LIST OF FIGURES

Figure Page

1.1 The overview of the proposed workflow ...... 8

3.1 The degree Distribution of DIP Budding Yeast interactions maps ...... 36

3.2 Clustering Algorithms vs. Random Clustering (a) kMETIS vs. Random (Avg p-value) (b) kMETIS vs. Random (Insignificant Clusters) (c) Hierar- chical vs. Random (Avg p-value) (d) Hierarchical vs. Random (Insignifi- cant Clusters) ...... 45

3.3 Pre-processed with the Clustering Coefficient metric and partitioned with the Hierarchical clustering algorithm (k=500). Evaluated with respect to average p-value, variance, and insignificance scores ...... 47

3.4 Pre-processed with the Clustering Coefficient metric and partitioned with the Hierarchical clustering algorithm (k=700). Evaluated with respect to average p-value, variance, and insignificance scores ...... 48

3.5 Pre-processed with the Clustering Coefficient metric and partitioned with the kMETIS algorithm (k=120). Evaluated with respect to average p-value, variance, and insignificance scores...... 49

3.6 Pre-processed with the Betweenness Centrality metric and evaluated with respect to average p-value, variance, and insignificance scores...... 50

3.7 Pre-processed with the Closeness Centrality metric and evaluated with re- spect to average p-value, variance, and insignificance scores...... 50

3.8 The comparison of our edge removal algorithm with random edge elimina- tion...... 51

xiv 3.9 Example clusters ...... 54

3.10 Illustration of the benefits of hub duplications on a toy example. The mod- ularity of the original graph (a) improves after hub duplications (b). . . . . 58

3.11 Modularity scores before(Original) and after(CC : 0.3,CC : 0.4, etc.) refinement ...... 63

3.12 Clustering scores are improving after hub duplications using two different algorithms: a) kMETIS and b) Spectral graph partitioning...... 65

3.13 P-value distribution of significant clusters before and after the refinement. The y axis represents the log(p-value) for each corresponding cluster. . . . . 66

4.1 The average size of neighborhood lists with respect to different κ thresholds (depicted for the colon cancer dataset)...... 86 4.2 Average semantic similarity (SS) is calculated for the top ‘similar’ pairs identified via alternative measures for Colon cancer (top) and Yeast mi- croarray datasets (bottom). The y axis represents the number of top pairs considered in each experiment...... 87

4.3 The p-value distribution of significant clusters extracted from Colon Cancer (top) and Yeast gene networks (bottom). The y axis represents the −log of the enrichment score of each corresponding cluster...... 91

4.4 The schematic view of the RGA construction and hypothesis testing. . . . . 94

4.5 Randomly generated datasets of different distributions are compared with the real data. The x-axis indicates the total number of probe pairs con- sidered starting from the top of the sorted probe pair list and the y-axis indicates the corresponding FDR for that set of probe pairs...... 100

4.6 The x-axis indicates the number of overlapping GO terms and the y-axis is the cumulative probability. The x-axis is truncated at 220 for visualization purposes...... 101

4.7 The distribution of the GO enrichment p-values for sub-networks extracted by different algorithms...... 103

xv 4.8 The box-plot of sub-network sizes generated by selected network partition- ing algorithms with k = 500. The algorithms are shown on the x-axis, and the log(sub-network size) values are indicated on the y-axis...... 104

4.9 Example sub-networks.(a) The neuronal development and signaling cluster ((b) The cluster ...... 107

5.1 The average nucleosome occupancy score of the Saccharomyces cerevisiae genome aligned with respect to translation start sites of all studied Saccha- romyces cerevisiae Open Reading Frames (ORFs) (7048 ORFs). We also identify the ones that are known to be binding (and functionally binding) to an example TF (ABF1 and FHL1 in this example) ...... 121

5.2 The overall framework of the proposed methodology. *Though nucleo- some occupancy is known to be condition dependent, it is treated as condi- tion independent for this study...... 123

5.3 TFs are clustered with respect to the distance of their motif hit site from the TLS of genes. This clustering could not be explained using the nature of binding (column 1) or the structural classification of TFs (column 2). . . 130

5.4 (A) ROC curves for TF-target predictions based on individual (TFBS(T), ChIP-chip(C), and Nucleosome Occupancy(N)) and integrated evidence and (B) area under ROC curve (AUC scores) generated by 5-fold cross validation...... 132

5.5 Context dependent TF-target gene interactions can be induced from our analysis. In this TF-gene interaction network, colors on edges represent under which stress condition each binding takes place. The color coding is as follows: Black (aa starvation), Green (MMS treatment), Red (Heat Shock), Blue peroxide treatment), Yellow (Raffinose treatment), and Purple (Galactose treatment)...... 133

5.6 Functional binding rates for individual TFs by condition. The size of point indicates the number of associated differential binding events predicted. The intensity of red indicates the significance of a χ2 test comparing the FB-rate to the global mean FB-rate (49.2% using 0.4 cut-off value) . . . . . 135

5.7 Score threshold vs. uncertainty on the FB-rate estimates...... 137

xvi 5.8 Rank Distance values for different cut-off thresholds under different stress conditions aa starvation (blue line), h202 (green line), and H2O2 treatment (red line)...... 139

5.9 RD values calculated with respect to the two step apart cut-off thresholds for aa starvation, mms treatment, and peroxide conditions respectively. . . . 140

5.10 Significant TF-TF co-factor relationships as determined by the multivari- ate Random Forest method (p < 0.01) under amino acid starvation (AAS) and hydrogen peroxide (H2O2) stress conditions. This hierarchical net- work view shows TFs (nodes) and co-factor relationships (edges) where direction of dependency is indicated by the arrow. In this representation, X− > Y implies that binding functionality of Y depends on X. The thick- ness of the edge indicates the significance of the X variable in determining functionality of Y binding. The node color indicates the FB-rate of the TF in the that condition, red indicates rates higher than expected while green indicates lower than expected rates...... 145

6.1 The overview of the proposed bipartite learning technique...... 158

xvii LIST OF TABLES

Table Page

3.1 The first column represents the ontology annotated with the specified clus- ter. P,F and C stands for , molecular and cellular component respectively. GO-Term refers to the biological association for the proteins in each cluster. The Cluster Frequency represents the ratio of proteins annotated with the specified ontology term in the given cluster whereas the Genome Frequency column represents the ratio for the whole genome...... 52

4.1 Top 15 scoring sub-networks for the HIV data. Sub-networks are annotated with the top GO term hits in ‘Molecular Function’, ‘Biological Process’, and ‘Cellular Component’ ontologies...... 105

4.2 Top 15 scoring sub-networks for the cigarette smoking data. Sub-networks are annotated with the top GO term hits in ‘Molecular Function’, ‘Biologi- cal Process’, and ‘Cellular Component’ ontologies...... 105

5.1 Condition specific binding data ...... 117

5.2 Condition Dependent Binding Events ...... 127

5.3 Significant Predictors for determining the functionality of TFs under stud- ied conditions. The values in parentheses (in column 2) correspond to the number of times the corresponding TF has been found to be a significant co-factor at 0.05 p-value cut-off...... 142

xviii CHAPTER 1: Introduction

Advances in data generation and storage technologies have led to an exponential growth in structured datasets available for analysis. The of data is especially significant in the field of bioinformatics with the constant production of data via high-throughput techniques. A major focus here is to make sense of such enormous amounts of biological data within a reasonable time. Given the size and complexity of the data, development of computational solutions is essential to accomplish this task. To that end, the field of data mining has been very useful in helping to organize, analyze, and interpret the accumulated biological data. Biological data may arise directly out of experimental observations, such as protein complexes derived through mass spectrometry technology or the mRNA levels measured by the gene expression microarray studies. These measurements can often be effectively converted into pairwise or groupwise interactions between entities of interest such as genes, proteins, and metabolites. Accumulation of these interactions in the form of networks pro- vides a concise and higher level picture of the associations among these entities. This representation also enables us to capture higher order patterns of networks, such as motifs and clusters. Moreover, recent studies have shown that these networks should not only be considered as accumulation of complex interactions, but as the key determinants of struc- ture, function, and dynamics of a living system [75, 76, 137]. Therefore, the construction of biological networks from raw datasets and their analysis for the discovery of useful in- formation have become an important step towards reaching a higher-level understanding of complex and dynamic living systems. Specific examples of such biological networks

1 abound, but in this work we will focus on three major biological interaction networks 1: Protein-Protein Interaction (PPI) networks, gene co-expression networks, and regulatory networks. One particular interest of scientists has been the characterization of proteins and their roles in the complicated structure of the living system by studying protein interactions. Pairwise and groupwise protein interactions have been an important source for accomplish- ing these tasks at an level [23, 55, 72, 118]. A reported interaction between two proteins (or a group of proteins) indicates in vitro detection of a physiochemical interaction via biochemical assays such as yeast-2 hybrid (Y2H) or tandem affinity purification [141].

Today, due to the advent of high throughput proteomics, scientists have monitored many protein-protein interactions of different . The monitored protein interactions of an organism can be naturally represented in the form of an interaction network where the nodes refer to proteins and the edges between these nodes represent the detected interac-

tions [54]. A specific interest here is to identify dense subgraphs of PPI networks which, in theory, correspond to the functional protein modules. Later, these modules have been employed for inferring the functional behavior of unknown proteins through the ‘guilt by association’ principle [112].

Another major research area is the study of gene interactions by inferring the similarity of gene expression profiles through microarray data. Gene profiles provide a comprehen- sive overview of gene expression in a given at a certain time. This global screening of gene expression levels has a great potential to shed light on cellular functions, pathways, diseases, and drug targets. Advances in microarray technology have enabled profiling ex- pression levels of thousands of genes under diverse experimental conditions. Recently,

1Throughout this dissertation, ‘we’ is used to refer to the author instead of ‘I’. It is a manner of style that reflects the writing style preference of the computer science . Unless otherwise noted, whenever ‘we’ is used, the author is referring to her own work or the work done jointly with her advisor.

2 there has been a growing interest in representing gene interactions deduced from microar- ray studies in the form of interaction networks to effectively analyze them and to mine them for useful information [27, 42]. In such networks, nodes represent genes, and two nodes are connected if they have significantly similar expression patterns over different conditions. Analyzing these networks has shown to be effective in revealing the molecu- lar and biochemical processes that sustain the physiological state of the cell [140] through the identification of their densely linked subnetworks [46]. These networks can also be studied to track and compare interactions between different gene groups across different conditions [13].

Another major direction of study is the identification of regulatory interactions, which indicate associations between regulatory proteins and promoter regions of genes for the initiation of gene expression. The expression of a gene is the synthesis of a protein from the DNA. In a eukaryotic cell, gene expression is composed of several stages including transcription, translation, and post-translation. Among these, transcription is recognized as an important one which controls the transcription of a DNA sequence to the mRNA sequence. The main players of the transcriptional regulation are specific proteins called transcription factors (TF) that bind to promoter regions of DNA sequences and control

the access of RNA polymerase to these regions. The regulation of transcription controls the timing and the amount of mRNA synthesis, which then specifies the structure and the function of a cell. Thus, studying these complex regulatory interactions between TFs and their target genes is vital to understand the whole cellular machinery. Today, by using technologies such as ChIP-chip (Chromatin Immuno-Precipitation on microarray chip) it is

possible to measure the DNA binding of transcription factors at a genomic scale [124,154, 159]. These experiments, however, suffer from high levels of noise leading to the prediction of many false positive and false negative interactions [16, 66]. Supporting evidence from other data sources can be employed to eliminate the noisy inferences of high-throughput

3 sources [16,150]. Identified regulatory interactions between TFs and genes from predictive sources can be conveniently represented in the form of a regulatory network. In a regulatory network, genes and regulatory proteins form the network nodes. A directed edge from a protein to a gene represents the binding of the corresponding regulatory protein to the promoter region of that gene. The identification of these networks and the study of them under different living conditions is an important and complex problem that has drawn a lot of recent attention. By revealing the structure and the functioning of PPI networks, gene co-expression net- works, and regulatory networks, and by utilizing these networks for knowledge discovery, we can greatly enhance our understanding on the organization and functioning of biological systems. However, the construction and analysis of these networks is fraught with diverse challenges which we will discuss next in detail.

1.1 Challenges in analyzing biological interaction networks

One of the major problems with biological data is the presence of technical noise due to many imperfections inherent in the high-throughput detection processes. For example, it is known that the differences in sample preparation can be a significant source of noise. The highly erroneous nature of existing high-throughput techniques have produced many spurious associations, which are accumulated in biological interaction networks (i.e., false positives). As an example, it has been hypothesized that almost half of the catalogued protein-protein interactions are false positives with no biological implication. In addition, these analysis miss a large number of known real interactions (i.e., false negatives) [6]. The presence of these inaccuracies and the large fraction of missing interactions in the current interaction databases may produce biologically invalid inferences. Therefore, detecting

4 and mitigating the impact of noise and uncertainty is crucial while analyzing biological interaction networks. Another challenge originates from the specific topological properties of these networks. Typically a biological interaction network includes a few hub nodes with many links and many nodes with a few connections. This is evident from the uneven degree distribution of these networks [153]. Although the mathematical form of the uneven degree distribution is heavily exploited [86,121,153], its impact on the analysis of the interaction network is very much an open question. This topological property makes the task of applying traditional data mining algorithms for the discovery of useful information difficult. For example, the

existence of hub nodes in PPI networks complicates the discovery of functional protein modules since a typical algorithm can only identify a few huge partitions including hubs and their neighbors and many singleton clusters [10]. None of these partitions can be mapped to functional protein groups. Therefore, it is essential to take into consideration

the topological challenges of these network structures while analyzing these networks. A further challenge to understand and model complicated system-level interactions lies in the diversity of predictive data sources. The cross analysis of such datasets has proved crucial for discovering broad and accurate information about biological relations and mech-

anisms [77, 168]. As an example, regulation of a gene by a transcription factor can be ef- fectively inferred by integrating diverse and complementary sources such as motif binding sites, nucleosome occupancy data, and ChIP-chip measurements [16]. It is paramount to compile such common and complementary information from existing biological datasets in order to have a comprehensive understanding of regulatory interactions. Therefore, a fur-

ther challenge is to develop integrative models to put together diverse pieces of information from heterogeneous data sources into a single network structure. The timing of biological interactions as well as their dependency on environmental stimuli play a key role in determining the biological meaning of these interactions [39,

5 93]. Biological interactions will not occur continuously and under all living conditions. A recent study of protein complexes along with the gene expression time series datasets has proved that some of the proteins are only periodically expressed whereas the others are continuously measured [39]. Models and algorithms that take into consideration the dynamics of biological interaction networks become more important under the evidence of such findings. However, the identification of biological response systems and their timing is still an open problem of . In addition to the aforementioned challenges, knowledge discovery from raw biological datasets (or measurements) are fraught with its own challenges. After a biological interac- tion network is constructed, the next challenge is to effectively use this network structure for inferring information. Depending on the underlying dataset as well as the questions that one is trying to answer, techniques from statistics, mathematics, machine learning, and data mining can be employed for this purpose. For example clustering algorithms can be used to identify the densely interacting regions of a PPI network [149] or regression models can be employed to co-analyze gene expression and DNA-gene binding datasets [56]. A novel and an effective application of existing computational tools and the development of new ones are essential tasks for the effective interpretation and analysis of biological interaction

networks. This dissertation focuses on novel computational techniques that aim to solve the afore- mentioned challenges associated with the construction and the analysis of three major bi- ological interaction networks. Although here we focus only on PPI networks, gene co- expression networks, and regulatory networks, we believe that the proposed algorithms can

be applied on other biological networks as well as on networks from different disciplines ranging from sociology to World Wide Web. Our computational techniques are developed as pieces of a comprehensive framework which we will discuss in the next section. We are now in a position to highlight our thesis statement.

6 Thesis Statement: Cellular function is often governed by an intricate web of inter- actions among DNA, RNA, proteins, and other small . Such biological interac- tions can be naturally abstracted as an interaction network. Our thesis statement is that it is possible to develop noise-resistant, integrative, and topology-aware algorithms for the construction and analysis of such biological interaction networks.

1.2 Proposed Framework

In this thesis proposal, we seek to address the aforementioned challenges associated with the task of constructing and analyzing biological interaction networks. The objective is to realize a set of algorithms that can be interlinked within a general framework to facilitate the construction and the analysis of diverse biological networks. As shown in Figure 1.1, our framework is composed of three main steps: data pre-processing, network construction, and network analysis. We next describe each of these stages in detail.

7 Figure 1.1: The overview of the proposed workflow

• Data pre-processing: The first component of our pre-processing stage addresses the question of data acquisition. Raw biological data are accumulated in unstructured databases, either in the form of interaction networks or the raw measurements of the amount of biological entities such as genes, proteins, and mRNAs. Acquisition of this data is the first step towards the construction and the analysis of biological

networks. In some cases, the transformation of raw biological sources into more in- formative values and pairwise associations requires a set of pre-processing steps. As an example, obtaining mRNA measurements from raw images of microarray studies requires image processing, normalization, background correction, and log transfor- mation. Although such analysis are not part of this dissertation, wherever applicable,

8 we employed the most commonly used pre-processing techniques prior to our analy- sis.

After raw datasets are obtained, the next step is to alleviate the impact of noise. Here, we aim to develop purification techniques to enhance the data quality by re- ducing spurious edges of a given biological network or noisy measurements. In the case of networks, such as PPI networks, we will rely on topological evidence inher-

ent in these networks to reveal potential false interactions. However, in the case of raw measurements, such as gene expression studies, we will develop noise-tolerant similarity measures and statistical methods to derive reliable gene associations.

We further aim to develop techniques to overcome difficulties associated with the topological features of biological interaction networks, specifically the uneven degree distribution of these networks. For this purpose, we use a refinement technique that is motivated by the biological and topological importance of hub nodes (nodes with

high degree). This step aims to refine these datasets to enhance the quality of their interactions and higher-level patterns such as communities.

• Network Construction: In the second step of our framework, we aim to construct biological interaction networks from raw datasets. Here, the integration of diverse

datasets - either produced by the same technology or different technologies - poses a major challenge for our analysis. In addition, we also aim to develop noise-tolerant similarity measures to infer pairwise associations between proteins and genes.

In many instances, it is possible to improve the quality of inferences by includ- ing common and complementary information that can be retrieved from disparate sources. For instance in order to predict a regulatory interaction between a gene and

a transcription factor, we can employ diverse data sources such as motif binding sites, genome-wide binding locations, nucleosome occupancy, co-expression of target and

9 gene pairs. By introducing diverse aspects of biological interactions, we will be able to reduce the impact of noise inherent in each of these data sources. In addition, integration can produce a comprehensive and complete model that can give insight into complex high level interactions. We propose using probabilistic and statistical

models in order to integrate raw and disparate data sources for constructing gene co-expression networks and regulatory networks.

In addition to data integration, we also propose noise-tolerant similarity measures that will be used to infer biological networks from raw microarray datasets. For this purpose, we propose two alternative and noise-tolerant gene similarity measures: rank based and external similarity based gene similarity. In the rank based similarity of two genes, instead of using raw gene expression values, we propose to use sim-

ilarity ranking of genes to derive gene co-expression networks. In the latter case, we propose to derive similarity of two genes with respect to the similarity of their co-occurrence patterns in gene neighborhoods.

• Network Analysis: In the third step of our framework we focus on discovering use- ful information by reasoning about the individual nodes based on their interactions as well as about the community structures. The main motivation is to use motifs and clusters inherent in these networks to draw useful conclusions about the underlying living system. For example, densely interacting regions of a co-expression network

can be a good source of information to characterize unknown genes. For our pur- poses, we propose to make use of available standard data mining techniques where it makes sense such as clustering, graph partitioning, and pattern discovery. Clustering and graph partitioning algorithms aim to group together nodes that are similar. In many cases, we propose to develop our own data mining algorithms that may work

best for the task at hand. Additionally, it can be possible to improve the quality of

10 inferences by considering evidence from many sources of information. This can be accomplished by using additional data as a filter, or incorporating additional knowl- edge into the data mining process.

As the final output of our framework, we hope to reach to a comprehensive and a re- liable model of the underlying system that can be used to assist the domain experts in a decision-making process. For each stage of our framework we propose novel techniques and ,where appropriate, validate them on real-world datasets. While not every technique will be applicable on all types of biological networks we are studying, it is our hope that this general framework can yield important insights into the analysis of diverse biological interaction networks from and proteomics domains.

1.3 Contributions

While we develop in our work novel computational techniques, we also address a num- ber of specific research questions. We categorize our contributions in an attempt to answer these questions by framework stages as follows:

I Pre-processing:

• We propose pre-processing strategies based on two topological metrics

and a graph transformation to eliminate potential false positive interac- tions from PPI networks. Our strategies improve the quality of clusters obtained using standard clustering algorithms [152]. Our comparative re- sults indicate that our strategies provide improvements regardless of the

clustering algorithm applied.

• We introduce the notion of a hub-induced subgraph. We propose a method to duplicate a hub node by identifying the dense regions extracted from its hub-induced subgraph by employing the Edge Betweenness measure [149].

11 • We create a key refinement of PPI networks based on hub duplications to improve the modular structure of these networks [149]. We find that the clusters we obtain after refinement match very well with known biolog- ical annotations. In addition, we obtain groupings after refinement that

could not be obtained from the original graph. Our technique also allows soft clustering of multi-functional proteins which is shown to be effective by our analysis.

II Network Construction:

• We investigate and demonstrate the efficacy of using extrinsic measures in inferring pairwise gene similarities and subsequently in constructing gene networks. We propose effective extrinsic similarity measures for microarray analysis motivated by the mutual information notion [147, 148]. Our experimental results prove that using the extrinsic measures

it is possible to identify gene pairs that are biologically more relevant. In addition, we show that association networks generated based on these measures contain more real edges and less spurious edges.

• We propose an alternative gene association network construction process based on the reciprocal rank of genes deduced from their co-regulation over independently collected microarray samples. Our methodology elim-

inates the bias introduced by the diversity of gene expression distribu- tions. In addition, by enforcing a mutual relation, we mitigate noise ef- fects to some extent, and consequently we generate a more reliable gene interaction network.

12 • We build a comprehensive gene association network via the analysis of a compendium of human and cell line samples from independently conducted projects.

• We integrate diverse and complementary data sources in order predict

regulatory interactions between genes and transcription factors. We pro- pose a Bayesian model to accomplish the integration of diverse sources. We also employ condition-dependent data to predict perturbations in these interactions. Our analysis proves that data integration is essential for this

task and our integrated model is more informative than models derived from individual data.

III Network Analysis:

• We introduce the notion of mutual independence of two genes based on their associations with other genes. We also study the biological meaning of mutual independency of two genes [147].

• We examine the application of well-known clustering/partitioning tech-

niques on purified PPI and gene expression networks. We utilize these extracted clusters to infer the functional annotation of genes and pro- teins [147, 149, 152].

• We study the use of known biological annotations to validate our exper-

imental results and to infer the signal to noise ratio in a gene association network.

• We propose a novel use of densely interacting sub-networks extracted

from a gene association network as a reference tool for the analysis of future microarray datasets. We further incorporate these sub-networks into a MANOVA statistical model to test hypothesis at the gene-set level.

13 • We propose to employ gene expression measurements in order to iden- tify the regulatory implications of an interaction between a transcription factor and the promoter region of a target gene. Using our methodology, we rank regulatory interactions in terms of their regulatory implications,

i.e., functionality. We also employ a feature selection algorithm in order to explain some of the biological factors behind the functionality of these interactions.

1.4 Organization

This manuscript is organized as follows. In Chapter 2, we provide background on re- lated concepts and data mining algorithms that we employed for our analysis. In Chapter 3, we present several case studies on PPI networks that highlight the methods we have developed for various stages of our framework. We begin by describing our data clean- ing methodology based on topological evidence and then we introduce our data refinement strategy that aims to improve the quality of protein groupings. We present the efficacy of both of these techniques, with detailed experimental results on the PPI network of the bud- ding yeast organism. In Chapter 4, we introduce our analysis of gene expression datasets in order to construct reliable gene interaction networks and to employ these networks for further discoveries. We begin by introducing our rank based methodology to derive a com- prehensive gene association network many gene expression studies. We used this method- ology to construct a human gene association network. We also show how the sub-clusters of this network can be used for gene-set level hypothesis testing. We run real-world case studies to utilize our gene-set level model to reveal system-level responses to HIV treat- ment and cigarette smoking. Next, we introduce extrinsic similarity measures and discuss their applicability on microarray datasets. Our techniques are employed on the multiple

14 gene expression datasets of human and budding yeast to derive similar gene pairs and to compile these pairs into gene co-expression networks. In Chapter 5, we present our work on the construction and analysis of regulatory networks. Here, we describe our probabilis- tic model to integrate ChIP-chip, TFBS, and nucleosome occupancy data to identify TF and target gene interactions. We also propose a methodology to incorporate gene expression data to discriminate regulatory and spurious interactions. We apply our model on Saccha- romyces cerevisiae and discuss our detailed experiments for this organism. We conclude in Chapter 6 and detail our plans for future directions.

15 CHAPTER 2: Background and Related Work

Representing biological relations in the form of a graph enables researchers to mine these structures for various purposes such as the identification of common patterns, the detection of anomalies in graphs, and the discovery of useful subgraph structures. Bio- logical relations that will form the edges of the graph structure can be obtained through high-throughput experiments. For example, mRNA measurements from microarray stud- ies can be used to infer pairwise gene relations that imply co-regulation of two genes over experimental conditions. On the other hand, regulatory relations between DNA binding proteins and genes can also be identified via various experimental technologies including ChIP-chip, ChIP-seq, or DamID [93,98,131,154]. Learning a biological network structure that reflects the real world relations from raw experimental data is a challenge in itself. However, a further challenge lies in the analysis of these networks for the discovery of useful information. Graph mining methodologies have been particularly useful for this purpose which we will discuss in this section [61,88,164,169]. In this chapter, we will also describe the previous work on the analysis and construction of the biological interaction networks that we study2.

2Throughout this thesis proposal, terms ‘graph’ and ‘network’ are used interchangeably.

16 2.1 Background

2.1.1 The Analysis of PPI Networks

Interacting proteins can be detected by conducting small or large scale experiments. These interactions can be naturally represented in the form of networks to facilitate the process of knowledge discovery. The number and coverage of public databases that col- lect experimental data on the protein physical bindings of diverse organisms have been increasing with the advances in high-throughput techniques. Although there is no estab- lished standard database of PPIs today, there have been efforts to integrate existing interac- tions in publicly available databases. An example such database, Human Protein Reference Database (HPRD), includes 34,624 interactions between Human proteins that are derived from a number of platforms including mass spectrometry, yeast two-hybrid (Y2H), and co-immunoprecipitation [106]. Similarly, another freely accessible database BIOGRID, includes more than 238,634 raw interactions from various organisms including Saccha- romyces cerevisiae, Caenorhabditis elegans, Drosophil melanogaster, and Homo sapiens. These large collections of protein interactions are naturally represented in the form of net- works to facilitate the process of knowledge discovery. Of particular interest to many scientists is to study PPI networks to isolate densely interacting regions since they are pre- sumed to be functional protein modules. However, the identification of these regions using traditional mining/clustering algorithms has proved to be fraught with challenges due to the high rate of technical noise in these networks and their non-trivial topology. In this section we provide a brief overview of the studies that attempt to handle these two challenges. The primary property of the PPI network that is detrimental to traditional graph min- ing is presence of technical noise. Sprinzak, Sattath, and Margalit estimated the reliability of high-throughput yeast two-hybrid assays to be around 50% for the Saccharomyces cere- visiae organism based on the database annotations of co-localization and cellular role [138].

17 In addition to many false positives, there are a large number of known interactions between proteins which are missed by these analyses (i.e., false negatives) [41]. For the same - ism, the false negative rate of interactions obtained with the Y2H technology has also been estimated to be as high as 70% [41]. An initial attempt to handle problems associated with the low reliability of these networks focused on the intersection of interaction maps from several studies [40]. However, this study generated only a small set of highly confident interactions due to the varying coverage of different detection techniques. As an alterna- tive, in recent years many researchers have studied computational techniques to eliminate potential false positive interactions from PPI networks [28, 126, 127]. For this purpose,

Saito, Suzuki, and Hayashizaki [126, 127] proposed the Interaction Generality measure (IG1) that leverages the network topology to predict interaction reliability. This work has addressed the task of eliminating false positive edges produced by the ‘sticky’ proteins in Y2H assays. Sticky proteins tended to interact with many other proteins without signifying any biological relevance. As an alternative to Interaction Generality measures, Deng, Sun, and Chen proposed a reliability measure based on the presence of alternative interaction paths in the underlying network. Their measure, the Interaction Reliability by Alternative Path ,IRAP , has been shown to outperform Interaction Generality measures while deter- mining the false positives of a real PPI network. In a later work, an extension of the IRAP measure has been proposed to detect both false positives and false negatives in a PPI net- work [28]. This methodology employs different weightings of the original graph derived from diverse topological metrics in order to calculate interaction confidence scores. Another challenge in partitioning these networks for potentially useful information is their complex topological properties. Most of these networks possess a skewed degree distribution, with a few nodes (hubs) having very large degrees, and the rest of the nodes having very few interactions [153]. The existence of hub nodes hinders the underlying community structure in these networks since they are connected to many nodes within and

18 across communities. Therefore, a typical graph clustering algorithm produces a few giant core clusters and many uninformative small-sized clusters. To address this problem, re- searchers have proposed various refinement techniques [3,34,35]. Earlier approaches based on the elimination of hubs from scale-free graphs have found that this strategy disconnects the graph and breaks down the modules as well [3]. Recently, Costa [35] introduced a hub-centered community detection algorithm which locates hubs at the center of the identi- fied clusters. However, this methodology will not be effective in biological networks since hubs have a large number of neighbors that belong to disparate communities. Moreover, in clustering these graphs there is a need to assign multi-faceted proteins to different groups, i.e., soft clustering. Such proteins typically have multiple functions and they are likely to be essential for the organism. To address the difficulties in partitioning scale-free graphs, in a recent work Abou-Rjeili and Karypis [2] presented several multi-level graph partition- ing algorithms. A multi-level graph partitioning algorithm approximates a given graph by a set of increasingly smaller graphs and accomplish the initial partitioning on the small- est graph. Then, partitions are refined in a backward manner to obtain a partitioning of each level of graphs and finally the original graph. Wu, Garland, and Jiawei [162] have presented a geodesic path-based clustering approach to partition networks that exhibit the scale-free topology. Although the proposed algorithms result in better groupings compared to the traditional ones, they still do not perform soft clustering. In the analysis of PPI networks, of particular interest to many scientists is to isolate the densely interacting regions, i.e., communities, of these networks. Such regions are pre- sumed to be protein complexes or functional modules. A protein complex can be defined as a set of proteins that bind to each other in order to accomplish a cellular level task. The identification of these structures is useful to understand cell functioning and to predict functionality of unknown proteins. The interest in their identification is motivated by the

19 fact that proteins heavily interacting within themselves tend to participate in the same bio- logical processes. Thus, the discovery of dense subgraphs from PPI networks is recognized as an important task for the identification of potential protein complexes. Based on this underlying principle, a set of algorithms that employ local dense regions of PPI networks to discover putative complexes have been proposed [12, 32, 96]. Bader and Hogue [12] proposed a three-step algorithm; Molecular COmplex DEtec- tion (MCODE) for the identification of densely interacting proteins. MCODE starts with weighting each node of the network based on the density of its local neighborhood. Nodes with high weights are assigned as seeds and starting from these seed nodes, initial clusters are obtained by iteratively including the neighboring nodes in the cluster. Finally an op- tional third step is proposed to filter proteins according to a connectivity criteria. MCODE is evaluated on an integrated dataset of Budding Yeast that is composed of 9088 protein- protein interactions among 4379 proteins from the MIPS, YPD, and PreBIND databases.

166 complexes are predicted, where 52 of these complexes have shown to match with known MIPS complexes. MCODE is based on the observation that proteins share func- tions with their immediate neighbors. In a more recent work, Chua et al. [32] utilized an- other observation based on ‘level-2’ interactions in PPI networks. A topological weighting schema is proposed, namely the Functional Similarity Weight (FS-Weight), that enables weighting both direct and indirect (i.e., ‘level-2’) interactions. FS-Weight makes use of the estimated reliability of each interaction to reduce the impact of noise. The reliability of each experimental source is estimated based on the evidence from GO ontology. FS-Weight is a mechanism that favors two proteins that share many common neighbors from a reliable source. The number of non-common neighbors is also included into the calculation in order to reduce the number of potential false positive inferences. Based on FS-weights, the stud- ied PPI network is expanded with ‘level-2’ interactions. Next, this network is filtered by eliminating the interactions with small FS-weights. After this preprocessing step, cliques

20 in the modified PPI network are identified and these cliques are iteratively merged to form larger dense subgraphs. More recently, Li, Foo, and Ng. [96] proposed an algorithm named DECAFF (Dense Neighborhood Extraction using Connectivity and conFidence measures) which employs the Hub Removals algorithm of Ravasz et al. [123]. DECAFF initially identifies local dense neighborhoods of each protein by iteratively removing nodes with low degree from the local neighborhoods. These local cliques are merged with dense sub- graphs detected by the Hub Removal algorithm [123] based on a Neighborhood Affinity criteria. The Neighborhood Affinity of two subgraphs is calculated based on their size and the number of their common neighbors. Finally DECAFF improves the quality of the

extracted clusters by removing subgraphs with low reliability scores. The reliability of a subgraph is defined as the average reliability of all interactions of that subgraph, where interaction reliability can be deduced from the functional relevance of the two interacting proteins.

In addition to network modules, motifs of PPI networks have also been utilized to characterize and better understand the group-level relations. A motif of a graph refers to a substructure, which is repeated considerably inside the graph. For the identifica- tion of large size motifs in PPI networks, a scalable algorithm, NEtwork MOtif FINDER

(NEMOFINDER) [29] has been proposed as an extension to existing subgraph mining al- gorithms. This algorithm is based on the formation of frequent trees of varying size from 2 to k, which are then used to partition the graph into a set of graphs such that each graph embeds a size-k tree. In the next step, frequent size-k graphs are generated by performing graph join operations. The frequency of these size-k graphs can be calculated with respect to the number of times the graph occurs in randomized networks. On the other hand, the uniqueness of a subgraph is determined as the number of times a subgraph is more fre- quent in the real graph than randomized graphs. A frequent subgraph that is also unique is considered as a Network Motif by the NEMOFINDER. Existing Apriori-based algorithms

21 are not able to capture interesting network motifs that are frequent and unique. Therefore, an extension to the SPIN [73] algorithm is proposed, which enables multiple membership. The input to the NEMOFINDER algorithm is a PPI network, and user defined thresholds for frequency, uniqueness, and maximal network size. The algorithm outputs Network Mo- tifs that are frequent and unique with respect to the defined thresholds. NEMOFINDER is tested on the PPI network of budding yeast, and it identified motifs up to size 12. Later, an extension to the NEMOFINDER, named LaMoFinder, is proposed which takes into con- sideration labels of nodes [30]. GO terms are used as node labels during the application of LaMoFinder for the discovery of PPI network motifs. First, the motifs of an unanno- tated network is extracted. Next, these motifs are labeled with GO functions. The analysis in [30] showed that by incorporating GO terms as labels, it is possible to capture biological context of motifs as well as their topological shapes.

2.1.2 The Analysis of Gene Expression Networks

Recently, microarray technologies have enabled the prediction of a comprehensive pro- file of gene expression levels of a cell under a particular condition. To analyze and mine these gene profiles for useful information, various techniques and ideas have been pro- posed [27, 171]. One particular interest of many scientists is to transform these expression profiles into gene co-expression networks. Here, the nodes represent genes and two nodes are linked, if the corresponding genes behave significantly similar across different sam- ples (i.e., co-expression) [27]. While constructing these networks, first, a similarity notion of genes should be defined in order to identify interacting pairs. Later, the most naive and commonly used approach is to threshold the similarity score to arrive at a gene co- expression network [171]. Earlier approaches have employed linear similarity measures such as the Pearson correlation coefficient as a standard way of inferring the gene similar- ity [27,171]. Recently, Butte and Kohane [24] pioneered the use of the mutual information

22 of gene expression levels instead of the Pearson correlation coefficient. They generated a set of Relevance Networks by calculating the mutual information of two genes based on their binned expression levels. Motivated by the success of this measure, Margolin et al. [103] have recently proposed an improvement over the Relevance Networks by using the Data Processing Inequality, named ARACNE. Their analysis proved that ARACNE achieves very low error rates on synthetic datasets and outperforms Relevance Networks. However, similar to the Pearson correlation coefficient, these measures are also purely based on the points in question (i.e., an intrinsic measure). Therefore, given the noise inher- ent in microarray datasets, intrinsic similarity measures will not be adequate to distinguish the accidentally regulated genes from those that share a biological context. In this thesis proposal, we propose and investigate the use of extrinsic similarity measures to induce gene similarity by using the relative positions of many genes as a reference to the similarity of two genes. The use of extrinsic measures and their advantages have been previously stud- ied for various data mining problems [37,38]. Das, Mannila, and Ronkainen [37] proposed using extrinsic measures on market basket data in order to derive similarity between two products from the buying patterns of customers. We discussed this measure in detail in Section 4.1.2. More recently, Palmer and Faloutsos [115] defined an extrinsic similarity measure (REP) with an analogy to electric circuits. Both groups concluded that extrinsic measures can give additional insight into the data. Recently, Ravasz et al [123] took a step towards using extrinsic properties along with the intrinsic ones. Their measure, the Topological Overlap Measure (TOM), infers the similarity of two nodes in a biochemi- cal network in terms of their pairwise similarity as well as the number of their common neighbors. Yip and Horvath [169] proposed a generalized version of the TOM measure, Generalized Topological Overlap Measure (GTOM), which considers common neighbors of two nodes in higher-order neighborhoods. In addition to the immediate neighbors of two

23 nodes, GTOM also takes into consideration the number of common neighbors of two nodes in their m step neighborhoods. In the field of gene co-expression network analysis, one excessively studied area is the partitioning of these networks into subgraphs to elucidate gene functional groups. Since genes that share a functionality are often co-regulated, such genes exhibit similar expres- sion patterns under diverse conditions. Thus, identifying and studying groups of highly- interacting genes in co-expression networks is an important step towards capturing func- tional groups of genes at a global scale. For this purpose, in addition to diverse graph partitioning algorithms, popular clustering algorithms have also been employed such as the

hierarchical clustering [134], k-means clustering [102], and self-organizing maps [89]. To find gene groups that have similar expression patterns, Hartuv and Shamir proposed an al- gorithm that recursively splits the weighted co-expression graph into its highly connected components [68]. A highly connected component is defined as a subnetwork which in-

cludes at least two nodes, i.e., n > 1, and which can only be disconnected after the removal of more than 2 edges. Their algorithm, namely the Highly Connected Subgraphs (HCS), at each iteration splits the network into its subgraphs until a highly connected component is identified. Shamir and Sharan [130] proposed an extension of the HCS algorithm named

CLuster Identification via Connectivity Kernels (CLICK). At each step of the CLICK algo- rithm, a minimum cut of the input graph is computed, which outputs two subgraphs. Then, subgraphs that satisfy a certain criterion are labeled as kernels. Each of these kernels is associated with a fingerprint similarity calculated from the kernel elements. After the iden- tification of all kernels, nodes that are not part of any kernels are further analyzed and these

nodes are included into the kernel if they are similar enough to its fingerprint. Next, at the adoption step, the fingerprint of the kernel is re-calculated. Finally, kernels that are similar enough are merged and the adoption operation is repeated. Adoption and kernel merging steps are repeated until there are no more changes in the kernel structures. Final kernels are

24 obtained as gene clusters obtained by the CLICK algorithm. They have shown that their algorithm outperforms existing clustering algorithms when it is applied on various gene expression datasets originating from various studies, such as, yeast cell cycle dataset, or data for the response of human bro-blasts to serum.

Besides the use of such clusters for functional characterization of unknown genes, we also propose an alternative way of utilizing them for the gene-set level analysis of further microarray datasets. The goal of gene set analysis is to determine whether genes of a defined set are differentially affected by a treatment. The major advantage of these analysis over single gene analysis is their ability to provide a unifying theme for data interpretation.

To this date, several gene set approaches have been proposed [18, 142]. Among these, GO Term Finder [18] and Gene Set Enrichment Analysis (GSEA) [142] require a prior generation of a rank-ordered gene list. Given a set of genes, GO Term Finder calculates an enrichment score based on GO annotations [18]. On the other hand, GSEA calculates an

enrichment score (ES) to reflect the degree of over-representation of a set S at the top or bottom of a given ranked list of genes [142]. Recently, Lu et al. [101] have proposed the use of Hotelling’s T-square test as a multivariate analysis method for pathway significance analysis. In their work, a set of dependent variables such as genes sharing a particular

property are tested in order to detect a significant difference between two groups studied (e.g., treated vs. control). Later, motivated by the success of this pioneering work, the usage of Hotelling’s T-square test has been successfully utilized to detect differentially regulated genes or gene groups [156, 166]. Nevertheless, the application of Hotelling’s T-square test is limited since it can only deal with two treatment groups. In our work, we explore the

use of a generalized analogue of Hotelling’s T2 test, namely the multivariate analysis of variance (MANOVA), which can be effectively used in the presence of multiple groups.

25 2.1.3 The Analysis of Regulatory Networks

Gene regulation is the key to explaining a number of biological processes including , cell cycle, signaling, and stress. The regulation of genes is realized by a com- plex set of interactions between regulatory proteins (transcription factors) and their target genes. To properly respond to environmental stimuli and initiate necessary cellular pro- grams, distinct set of interactions take place inside the cell. Therefore, identifying the complex architecture of regulation as well as discovering the change in this architecture in response to various environmental conditions, have great potential in revealing the biolog- ical mechanisms that are fundamental to the maintenance of life. The most important components of gene regulatory systems are the Transcription Factor (TF) proteins, which bind DNA and initiate transcription. Recently, genome-wide TF-DNA interactions are systematically detected through the technique of ChIP-chip [93, 98, 131].

This approach enables detecting all genome fragments bound directly by a TF of interest. However, the binding of a TF to a gene under a particular condition does not necessarily imply that the TF actually regulates that gene under the prevailing condition [16]. First, a detected binding event may not be regulatory or the observed binding may relate to some

cellular function other than gene expression. Moreover, some interactions between TFs and their target genes only occur under very specific conditions, which implies that many true binding events may be missed by ChIP-chip analysis because the relevant conditions have not yet been examined. Therefore, interactions identified via ChIP-chip experiments are

by definition static, i.e., they represent the interactions but do not imply a relevant context. Recently, computational models have been proposed to address the problem of correctly interpreting the measurement of TF-target binding from noisy and condition-dependent datasets [16, 56]. As proposed by some of these authors, other sources of information can be incorporated into TF-gene binding evidences for the reliable identification of regulatory

relations, the most important of these being the gene expression profiles. In an attempt to

26 discriminate between the functional and non-functional TF-DNA interactions, Gao, Foat, and Bussemaker [56] integrated gene expression with ChIP-chip datasets. They proposed a multivariate regression model to infer the activity of each transcription factor and to deduce the correlation between TF activities and expression levels of genes. Their model aims to quantify to what extent each transcription factor is responsible for the observed changes in mRNA expression. Probabilistic models have been studied to integrate multiple evidences into a single and more comprehensive model. Segal et al. [128] proposed a probabilistic model to integrate information from gene expression profiles and protein interaction networks. In addition, it has been shown that probabilistic models enable leveraging the prior knowledge into the learning process. Along these lines, Hartemink et al. [67] analyzed gene expression datasets using Bayesian networks where ChIP-chip data has been employed as a prior to their Bayesian model. In a more recent work by Beyer et al. [16], seven distinct lines of evidence are incorporated into a model: DNA binding intensities measured with ChIP- chip technology, TF binding motifs, coexpression information, physical protein-protein interactions, gene pairs with shared phylogenetic profiles, and pairs of genes that were fused together in other . Their analysis showed that an integrative model can be very effective in assigning confidences to raw interaction measurements. In addition to attempts to derive probabilistic integrative techniques, scientists also have been studying the identification of regulatory modules. These modules can be inferred from diverse datasets including ChIP-chip, motif, and gene expression datasets. A regulatory module is composed of a set of genes that are co-regulated by a common set of regulators.

In order to identify such modules from ChIP-chip data and gene expression profiles, Bar- Joseph et. al [14] proposed the GRAM algorithm. This algorithm first identifies a set of genes that are bound with the same regulator from the ChIP-chip data with an exhaustive search. Subsequently, a subset of this set that are similarly expressed is selected to serve

27 as the seed. The algorithm identifies genes that are similarly expressed with the seed genes and that are connected to the same set of transcription factors based on a relaxed binding criteria. Lemmens et al. [95] improved the GRAM algorithm by incorporating motif data as an additional source. Their Apriori-like algorithm identifies gene modules by identifying

genes that have a common expression profile and that share the same regulatory program and have the same motifs in their intergenic regions. Another research direction in the analysis of regulatory interactions is the study of the temporality of these interactions. Lee et al. [93] generated ChIP-chip data to study protein-DNA interactions under normal growth conditions and various stress conditions.

Their analysis showed that the binding space of many transcription factors are shrinking or expanding when the experimental condition is changed, which points to the condition- dependency of TF-gene interactions. Recently, to reveal different condition-dependent reg- ulatory mechanisms, Lee et al. [92] analyzed ChIP-chip [66], motif, and gene expression

datasets [58] under three different experimental conditions (heat shock, nitrogen depletion, and mitotic cell cycle). Their analysis identified dynamic transcriptional modules com- posed of co-expressed genes and their common regulators under the tested experimental condition. In another study, time series gene expression and ChIP-chip datasets are ana-

lyzed to identify bifurcation points in different stress studies [47]. A bifurcation point is defined as a time point where the expression of a subset of genes diverges from the rest of the genes significantly.

2.1.4 Clustering and Graph Partitioning Algorithms

Clustering is the task of grouping a set of data points into subgroups, i.e., clusters, such that the points in each subgroup are similar to each other according to some distance or similarity measure. Most biological interaction networks exhibit a modular structure,

28 which implies that they include groups of well-connected nodes with relatively loose con- nections between these groups. For the identification of these groups, clustering algorithms can be effectively used. In this section, we review some clustering and graph partitioning algorithms that are relevant to this work.

Clustering of gene co-expression networks have been investigated to elucidate gene functions at a global scale by identifying groups of highly-interacting genes from co- expression networks. For this purpose, in addition to diverse graph partitioning algo- rithms, popular clustering algorithms have also been employed such as hierarchical clus- tering [134], kmeans clustering [102], and self-organizing maps [89]. In addition to these standard techniques, algorithms that are more suitable for the specific task have been stud- ied. Among these are the biclustering algorithms, which identify a group of genes that be- have similarly only for a subset of all conditions [145]. Given a gene expression matrix of samples and genes, biclustering algorithms perform clustering in two dimensions -samples and genes- simultaneously. Cheng and Church proposed a biclustering algorithm for the identification of gene biclusters, which finds maximal sized biclusters that satisfy a certain condition on the residue scores - an assessment for the quality of biclusters [31]. Their algorithm identifies each bicluster separately by iteratively removing rows and columns until the mean squared residue score for the sub-matrix is smaller than a threshold and by iteratively adding rows and columns until the quality assessment score exceeds this thresh- old. Each run of the algorithm identifies a bicluster separately, and the next bicluster is extracted after the previously found bicluster is masked by randomization. Tanay, Sharan, and Shamir [144] converted the biclustering problem into a graph theory problem using the bipartite modeling. Initially the expression data is converted into a bipartite graph of genes and microarray samples. This modeling reduces the biclustering problem into the problem of finding the densest subgraphs in the bipartite graph. Since the identification of heaviest biclique is an NP-complete problem, authors restricted their search space by assuming a

29 degree bound on one side of the bipartite graph. In a more recent work, Koyuturk, Sz- pankowski, and Grama [90] proposed a model that associates statistical significance to the extracted biclusters. They formulated this problem as an optimization problem based on the statistical significance objective and proposed fast heuristics to solve this optimization problem in a scalable manner. In addition to biclustering algorithms, an ensemble clustering algorithm has been pro- posed for the analysis of biological networks in order to generate a more robust clustering compared to standard clustering algorithms [10]. Cluster ensembles can be defined as a mapping from a set of clusterings generated by a variety of sources into a single consensus clustering arrangement. Asur, Ucar, and Parthasarathy [10] proposed an ensemble cluster- ing for the PPI network problem. First, they employed different topological weighting schemes to generate different views of the original unweighted PPI network. Next, these different views are clustered with diverse algorithms to obtain a set of base clusterings of weighted networks. Later, these base clusterings are integrated into a Cluster Membership Matrix which is reduced in size using Principal Component Analysis (PCA) to eliminate redundancy and to scale the consensus determination problem. Subsequently, standard hierarchical clustering algorithms are utilized on the reduced Cluster Membership matrix for generating the consensus clustering. Moreover, scientists have studied soft clustering algorithms for the analysis of biolog- ical networks, which enables assigning multiple-cluster membership to biological entities. To enable multiple cluster membership for proteins while identifying PPI clusters, Asur, Ucar, and Parthasarathy [10] proposed a soft ensemble clustering technique that is a step further from their PCA based ensemble clustering. This adapted algorithm iteratively cal- culates the strength of the membership of each protein to each consensus cluster based on shortest path distances. Proteins that have high propensity towards multiple membership are then assigned to their alternate clusters. A soft biclustering algorithm (MF-PINCoC)

30 has also been proposed to identify overlapping dense subgraphs by using a local search technique has been proposed [120], as an extension to the PINCoC algorithm. The PINCoC algorithm applies a greedy search strategy in order to find the local optimal sub-matrices with respect to a quality function [119]. More recently, Avogadri and Valentini [11] pro- posed an ensemble fuzzy clustering for decomposing gene expression datasets into its soft clusters. Their algorithm first generates multiple views of the data by using random pro- jections. A random projection maps the data from a high-dimensional space to a lower dimensional space. On these views, fuzzy k-means algorithm is applied and these fuzzy clustering arrangements are integrated into a similarity matrix. Fuzzy k-means is further

employed on this similarity matrix to identify final fuzzy consensus clusters. In the rest of this section, we review the algorithms that we employed for our analysis.

• Hierarchical Clustering: Hierarchical clustering is a traditional clustering technique, that is commonly applied

on biological datasets due to its robustness [134]. Agglomerative hierarchical clus- tering is a bottom-up clustering paradigm which results in a nested set of clusters, where at each level, clusters are generated by merging clusters of a lower level. At the bottom level, each cluster is a single entity (e.g., a single gene). At each level

clusters that are most similar are merged to form higher level clusters. This hierar- chical clustering process can be represented as a tree, or dendrogram, where each step in the clustering process is illustrated by a join of the tree. The algorithm requires a parameter specifying the number of clusters and a similarity criterion. During our

analysis, we utilized a fast implementation of agglomerative clustering, from the CLUTO clustering toolkit [83]. Average linkage method was used to find the most similar clusters at each step of the algorithm.

31 • Spectral Graph Partitioning: The Spectral clustering algorithm uses the similarity matrix constructed from the graph, namely the Laplacian matrix. The eigenvectors of the Laplacian matrix is used to effectively group the data points. For our analysis we employed the publicly

available Chaco software [69]. This algorithm performs a weighted version of spec- tral bisection. At each step, the algorithm uses the second smallest eigenvector of the Laplacian matrix, known as the Fiedler vector, to divide the data points into two. To obtain the desired number of partitions, the data is divided into bisections in a re- peated manner. Spectral methods have been shown to be effective in finding a good

general cut [110]. However they usually perform poorly in the fine details. Thus, local refinement on initial partitions are usually employed after the spectral partition- ing of the data. To improve the quality of partitions, Chaco refines the spectral output with a generalized version of the Kerninghan-Lin algorithm.

• Multilevel k-way Graph Partitioning: METIS is a family of algorithms developed to partition graphs and hyper-graphs [84]. These algorithms have three major phases: coarsening, initial partitioning and refine- ment. In the coarsening phase, the original graph is transformed into a sequence of

smaller graphs. An initial two-way partitioning of the coarsest graph that satisfies the balancing constraints while minimizing the cut value is obtained in the next phase. During the uncoarsening and refinement phase, the partitioning is projected back to the original graph by going through intermediate partitions. After projecting a par-

tition, a refinement algorithm is employed to reduce the edge-cut while conserving the balance constraints. The fundamental multilevel paradigm of METIS algorithms produces balanced and high quality partitions in a scalable manner.

32 • Graclus Partitioning Algorithm: Graclus is a recent algorithm that computes normalized cuts for a given network using a kernel function [43]. Dhillon, Guan, and Kulis [43] also established in their work a mathematical equivalence between general cut or association objectives and

the weighted kernel k-means objective. This algorithm does not require the use of any time-consuming eigenvector computation, which is why, unlike spectral clustering, it is scalable for large networks. We used Graclus software which embeds weighted kernel k-means algorithm within a multi-level framework.

33 CHAPTER 3: PPI Networks

Experimentally obtained protein interactions are typically represented in a graphical form, where nodes represent genes and edges refer to the interactions. Recently, researchers have proposed various methods to analyze these networks one of which is the extraction of tightly connected modules from these networks [12,54,55,72]. It has been shown that there is a correspondence between such modules and functional groups of proteins [122]. Przulj, Wigle, and Jurisica [122] proved that densely interacting clusters that can be deduced from

Saccharomyces cerevisiae interaction map are functionally more homogeneous in compar- ison to the random groupings of these proteins. Moreover, the connectivity of these net- works can be used to infer the functions of unannotated proteins based on the the functions of their neighbors, which is known as the ‘guilt by association’ principle [23, 72, 163]. Providing functional annotation for uncharacterized proteins will take us a step forward towards revealing the life of a biological cell. However, revealing this information from PPI networks is a challenging task due to the uneven degree distribution of these networks and the existence of technical noise. PPI networks typically contain a few highly connected proteins (hubs) linking the rest of the proteins to the system, which is detrimental to the

application of traditional graph partitioning/clustering algorithms for the identification of protein partitions/clusters. Moreover, high-throughput screening that produces the majority of extant protein interactions, generate a large number of false positives [40]. As a result, interaction maps include many physical interactions with no biological significance. These

two issues complicates the extraction of functional modules from these networks. This

34 chapter details our attempts to alleviate both of these problems by employing the evidence from the network topology. In the first part of this section we will focus on a pre-processing technique that uses a key transformation and separate weighting functions to effectively identify and elimi- nate potential false positives from the graph [152]. In the following section, we propose a novel refinement method based on local neighborhoods and the biological and topologi- cal significance of hub proteins [149]. Both of the proposed methods are evaluated on the PPI graph of Saccharomyces Cerevisiae obtained from the Database of Interacting Proteins (DIP) database [165]. We will start our discussion with a brief description of this data.

For validation, we employ the (GO) consortium database [8], which pro- vides structured vocabularies(ontologies) to annotate genes in terms of their associations to biological processes, molecular functions and cellular components.

3.1 Dataset

We evaluated our work on the PPI network of budding Yeast (Saccharomyces Cere- visiae) downloaded from the Database of Interacting Proteins (DIP) [165]. This database catalogs experimentally determined interactions of various organisms both from small- scale and large scale experiments. The dataset we used throughout this section consists of

15147 interactions among 4741 yeast proteins. We focus on this organism since it is a well- studied organism with large amounts of interaction data. For the purpose of our study, these interactions can be naturally visualized as an interaction network with nodes representing proteins and the edges between these nodes denoting the experimentally obtained inter-

actions. The frequency-degree plot of this network is depicted in Figure 3.1. Frequency (P (k)) indicates the probability that a randomly chosen node has degree k. Usually, the degree distribution of a network is shown as a frequency-degree plot, which approximates a

35 Figure 3.1: The degree Distribution of DIP Budding Yeast interactions maps

straight line of slope on a log-log scale for scale-free networks. As evident from Figure 3.1, degree distribution of our experimental data slightly follows a straight-line. This implies that most proteins in the graph participate in a small number of interactions while a few proteins, known as hubs, are involved in a large number of interactions.

3.2 Network Purification Algorithms

The majority of the protein interactions are obtained using high-throughput techniques such as the Yeast two-hybrid (Y2H) [51]. This technique has become one of the most com- monly used technologies to detect protein-protein interactions. Its main advantages are its simplicity, low cost and high throughput. However, it is burdened by a tendency to produce a large number of false positives. A number of studies made to assess the quality of the data have demonstrated a large number of erroneously identified interactions. Hence, the bio- logical relevance of interacting proteins obtained from this system needs to be re-affirmed. However, experimental validation of these interactions is infeasible due to the expenses and the time required to conduct small-scale more reliable experiments. Therefore, we propose computational techniques to identify potential false-positives from a given PPI network.

36 Our methodology employs an abstraction of the network, namely the line graph transfor- mation, along with two topological metrics to transform the PPI network into a sparser network with reduced number of interactions. We aim to show that the transformed graph contains fewer false positives and leads to a more biologically relevant partitioning than the original graph using the same algorithm with the same settings. To study the efficacy of our pre-processing techniques with different graph partitioning/clustering techniques, we examine the application of two different approaches - a hierarchical agglomerative clus- tering algorithm and a multi-way graph partitioning algorithm for the purposes of module extraction.

We want to emphasize the fact that it would be impossible without experimental exam- ination to determine if the interactions eliminated by our methodology are indeed false, we believe that the presence of balanced, biologically more significant clusters on the cleaned data serves as a preliminary validation of our technique. As we have discussed in details in Chapter 2, there has been work done by Saito, Suzuki, and Hayashizaki [126], to erad- icate false positive interactions using a metric called Interaction Generality. However, this metric focuses only on the degree of individual proteins without considering the topology of the network. We believe that the degree, by itself, is not sufficient. It is important to consider connectivity and density of sub-networks to adequately deal with false positive interactions. Our technique is, therefore, governed by the following intuition. If a node is strongly connected to its neighbors (i.e., lies inside a dense subnetwork), it is obvious that the proposed interaction is supported by several other interactions. Hence, the edges (interactions) that are not part of dense subnetworks are more likely to be interactions that are falsely obtained. Edges that connect subnetworks are also potential false interactions. Hence, we use topological metrics of the network, namely the Clustering Coefficient and Centrality (Betweenness and Closeness), to quantify the possibility of an interaction being false.

37 3.2.1 Topological Metrics

In this study, we proposed utilizing topological metrics to identify potential false posi- tive interactions. Here, we will describe the topological metrics we employed in detail.

Clustering Coefficient

The Clustering Coefficient [158], is a metric commonly employed to identify well- connected sub-components in networks. It represents the interconnectivity of neighbors of a node. The Clustering Coefficient of a node v in a graph can be defined as follows:

2nv CC(v) = (3.1) kv(kv − 1)

where nv denotes the number of triangles that go through node v and kv indicates the

degree of this node. The denominator gives the maximum number of triangles that can go through node v. Nodes having high Clustering Coefficient have neighbors that have higher probability to be connected.

Centrality

The Centrality of a node in a network is a measure of the structural importance of the node. There are three important kinds of Centrality: Degree, Closeness, and Betweenness. In this work we use Betweenness and Closeness as they are more informative than degree- based centrality and more suitable for this problem.

Betweenness Centrality: Betweenness Centrality [53] is a measure of the centrality of a node and its influence over data flows in the network. For a node v, it is normally calculated as the fraction of the shortest geodesic paths between node pairs that pass through node v.

More precisely, if dv(i, j) is the number of paths from i to j that pass through node v in a

38 graph G having n nodes, then the Betweenness Centrality of node v can be calculated as

dv(i, j) B(v) = Pi,v,j εG (3.2) (n − 1)(n − 2)

Closeness Centrality : Closeness Centrality [52] is a measure of the closeness of a node, on average, to all the other nodes. Formally the closeness of a node v in a graph G is defined by the following expression:

N − 1 C(v) = (3.3) d(v, w) Pv,wεG

where d(v, w) denotes the pairwise geodesic distance between node v and w. N denotes the number of reachable nodes from node v. Due to the scale free property, the nodes with the highest closeness scores in the PPI network are the hubs and hence they are viewed as core components of the network.

3.2.2 Line Graph Transformation

Above defined metrics are all defined for nodes of a graph. However we do not want to eliminate any nodes (proteins) of our PPI network since we wish to cluster them at the end Therefore, our aim is to attack the edges (interactions) of a given network. In order to use metrics defined on nodes, we transform our data into a line graph representation [132]. In

this representation, each node corresponds to an edge in the original graph and two nodes are connected if and only if they (the corresponding edges) have a common endpoint (i.e, protein) in the original graph. More formally, let G(V, E) be a graph, where V describes the set of nodes and E describes the set of interactions (edges) between the network nodes.

The line graph of G (L(G)) has the set E as its nodes. Two nodes in L are connected if they are adjacent edges in G, which implies that they share a node from the set V [65].

39 Line graph transformation has several advantages for our purposes. First, it emphasizes the edges (interactions) rather than nodes. Since we are considering eliminating false posi- tive interactions, this proves to be useful. Second, it retains information about the proteins involved. Hence, we are able to cluster all the proteins. Pereira-Leal, Enright, and Ouzou- nis [118], have used the line graph transformation of a PPI network to deduce its functional modules. Although this transformation allows soft clustering of nodes, their work does not take into account the false positive interactions present in these networks. Therefore, their solution is prone to errors that would be caused by the erroneous interactions in the PPI network. Moreover, their solution is not scalable due to the complexity of clustering line graphs. Edges in the original graph form nodes in the line graph. Therefore, the line graph is larger in size than the original graph. Conventional clustering algorithms do not scale well when confronted with large graphs. Since a graph partition and its line graph repre- sentation have very different topologies in terms of compactness, we believe that finding dense components on the line graph will not reveal the actual dense regions of the original graph. Hence, we used the line graph representation only for pre-processing purposes. We transform the reduced line graph back to the original graph before clustering. Although various studies have been made on line graphs, no earlier work has focused on using the line graph transformation to perform pre-processing on the protein-protein interactions, to the best of our knowledge. Some work has also been done on the Clustering Coefficient of line graphs. Nacher et al. [107] showed that nodes with high Clustering Coefficient in the original graph will have high Clustering Coefficient after the line graph transformation. Since proteins shar- ing a significant number of interaction partners are likely to participate in common cellular processes, nodes with high Clustering Coefficient in the line graph are more likely to par- ticipate in efficient partitions. As a result we decided to remove the nodes that have low Clustering Coefficient in line graph since they will correspond to edges that are not parts

40 of any dense components in the original network and are, therefore, most likely to be false positives. FAS Research [1] studied the node betweenness of a line graph. They claimed that betweenness values calculated for the nodes of a line graph provide information about the contribution of each edge to the betweenness in the original network. Therefore, edges of a line graph with high betweenness scores are likely to lie between communities and more likely to be biologically irrelevant. Due to these observations, we remove nodes of the transformed graph with low Clustering Coefficient and high Centrality values.

3.2.3 Clustering

As we mentioned earlier, we use two different clustering algorithms - an agglomera- tive hierarchical algorithm and a multi-level graph partitioning algorithm. Details of these algorithms are discussed in Chapter 2. The hierarchical clustering algorithm requires a similarity criterion. In order to define similarity of proteins in the PPI network, we use the well-known Czekanowski-Dice distance metric [23]. This metric is ideal for this do- main, since it increases the weight of shared interacting proteins, and two proteins having no common interactors will have the maximum distance value, while those interacting with exactly the same set of proteins will have zero value. Our similarity metric is defined as:

|Int(i)∆Int(j)| Sim(i, j) = 1 − (3.4) |Int(i) ∪ Int(j)| + |Int(i) ∩ Int(j)|

Here, Int(i) and Int(j) denote the set of inter-actors(including themselves) of proteins i and j, respectively, and ∆ represents the symmetric difference between the sets. The value of this metric ranges from 0 to 1. We varied the number of clusters for both algorithms and picked the values that gave the best balance in terms of size of clusters. We discovered that for smaller values of k

41 (number of clusters), the Hierarchical algorithm provided an imbalanced cluster arrange- ment. However at k = 500 and k = 700, the clusters obtained were balanced and suitable for our analysis. In the case of kMETIS, since the algorithm is designed to obtain balanced clusters, a high value of k, such as k = 500, results in extremely small sized clusters.

Hence we picked the number of clusters as 120 for the experiments in the case of kMETIS.

3.2.4 Validation

To test the hypothesis that the final clusters obtained from the pre-processed data cor- respond to better functional modules, we need to validate our clusters using the domain information. For this purpose, we employ gene annotations from the Gene Ontology Con- sortium Online database [8]. The Gene Ontology (GO) is an important tool designed to support the work of researchers in the area of genomics and biomedicine by providing a common terminology to report the results obtained. GO consists of three terminologies comprising of biological process (BP), molecular function (MF), and cellular component

(CC) terms. The CC terms refer to localization of the corresponding biological entities within a single cell. This provides anatomical and structural association information. The MF terms refer to shared activities at the molecular level. On the other hand, the BP terms refer to entities at both the cellular and organism levels of granularity. Each of them pro-

vides valuable information in terms of the biological significance of protein associations in the organism. We employ the May 2005 distribution of GO which contains 7000 genes annotated in 1644 cellular component terms, 7502 molecular function terms and 9706 bio- logical process terms.

It is obvious that merely counting the proteins that share an annotation will be mislead- ing since the underlying distribution of genes among different annotations is not uniform. Hence, we use p-values to calculate the statistical significance of a group of proteins that share a GO term. The p-values essentially represent the chance of seeing that particular

42 grouping, or better, given the background distribution. Assume we have a cluster of size n, out of which m proteins share a particular annotation. Also, there are N proteins in the database with M of them known to have that same annotation. Then using the Hyper- geometric Distribution, the probability of observing m or more proteins that are annotated with the same GO term out of n proteins is:

n M N−M  i  n i  p − value = − X N (3.5) i=m  n 

Smaller p-values imply that the grouping is not random and biologically triggered. Here, the smaller the p-value, the less likely it is to obtain that group randomly. A cut-off value (alpha level) is used to differentiate significant groups from the insignificant ones. If a group of proteins are associated with a p-value greater than the cut-off, they are considered insignificant. We have used the recommended cut-off of 0.05 for all our validations.

For each cluster obtained by our experiments, we query the GO annotations using the Go Term Finder [18] tool, to calculate the enrichment p-values in each of the three GO categories: biological process, molecular function, and cellular component. To assign a single p-value score for the overall clustering scheme, we define an average p-value as follows: n min(pvaluei) avg pvalue = Pi=1 (3.6) n where n denotes the number of partitions with significant p-values (smaller than the cut-

off) and min(pvaluei) denotes the smallest p-value of the partition i. We calculate the average separately for each of the three ontologies. Apart from the p-value, we also cal- culate the variance of the p-values over the clusters and the number of clusters which have p-value higher than the threshold of 0.05. The variance provides us information about the

distribution of p-values while the latter quantifies the balance of the clusters obtained.

43 3.2.5 Experiments

The first experiment that we perform is to highlight the effectiveness of our valida- tion scheme. The subsequent experiments examine the effectiveness of our proposed pre- processing techniques.

Effectiveness of the Validation Metric

We use the p-value obtained from the GO annotations to validate our clusters. To test this method of validation, we use the clusters obtained from the two clustering algorithms as well as the clusters obtained by randomly partitioning the dataset. In order to make a fair comparison, we obtain random clusters having the same cluster-size distribution as the ones from the two algorithms. Since this is not a test of our pre-processing method, we use the PPI dataset with the original set of interactions. We expect the two clustering methods to yield better clusters than a random assignment of proteins to clusters would. Hence a

lower p-value on the clusters obtained from the two algorithms would indicate that our validation scheme is proficient in identifying better clusters. We use two specific criteria for evaluation. The first criterion is the average of the p-values of the clusters obtained. Since we expect the clustering algorithms to work better than random assignments, smaller

average p-values for the clusters from the two algorithms would suggest that our metric is capturing this difference. We similarly compare the number of insignificant clusters obtained. This gives a measure of the difference in biological relevance. Lower values for average p-values and insignificant cluster counts signify high biological relevance. Figures 3.2-a and 3.2-b, present the comparison between kMETIS and random clus-

tering. Figures 3.2-c and 3.2-d, present the comparison between Hierarchical and random clustering. From these results, we observe that the average p-values of hierarchical and kMETIS clustering schemes are significantly smaller than the average p-values of random

44 40 120 t

35 n u 100 o ) C

5 30 r 0 - e 80 t e 25 s (

u l e u

20 C 60 l

t a n v - 15 a c p 40 i

f - i 10 n g g v i 20 A 5 s n I 0 0 Random (120) Metis (120) Random (120) Metis (120) process function component process function component (a) (b)

500 90.0 t 450 n

80.0 u 400 o ) C

5 70.0 350 0 r - e t e 60.0

( 300 s

u e 50.0 l 250 u l C

a 40.0 t

v 200 n - a p c 30.0

i 150 - f

i g 20.0 n 100 v g i A

10.0 s 50 n I 0.0 0 Random (500) Hierarchical (500) Random (500) Hierarchical (500) process function component process function component

(c) (d)

Figure 3.2: Clustering Algorithms vs. Random Clustering (a) kMETIS vs. Random (Avg p-value) (b) kMETIS vs. Random (Insignificant Clusters) (c) Hierarchical vs. Random (Avg p-value) (d) Hierarchical vs. Random (Insignificant Clusters)

clustering. From the number of insignificant clusters, we observe that both clustering al- gorithms identify much more biologically relevant clusters than the random grouping as we expected. The results show that our validation metrics are appropriate, since they pro- duce results that are consistent with our initial expectation -clustering outperforms random grouping in terms of the biological enrichment of final groupings.

Pre-processing using the Clustering Coefficient

In this experiment, we examine the effect of using the Clustering Coefficient for pre- processing. We eliminated the nodes with low Clustering Coefficient values in the line graph to improve the quality of the original dataset. We eliminated 30%,40%,50% and 60%

45 of the PPI interactions using this methodology. We then ran the two clustering algorithms on the original network and the reduced network separately. Figures 3.3, 3.4, and 3.5 present our experimental results for two clustering algorithms. For validation we use the average p-values,variance of p-values, and the number of insignificant clusters generated

by each experiment. The p-value score is presented as the ratio between the average p- value of the clustering obtained on the pre-processed data and the clustering obtained on the original clustering. Therefore, values smaller than 1 indicates an improvement over the original case. Note that the smaller this value gets, the greater the improvement becomes. The variance and insignificant scores are also similarly defined. As before, smaller score

values indicate high improvement achieved due to pre-processing. The variance of the p- values reflects the fluctuation of quality across the identified clusters. Smaller variance values indicate more stable clusters. From these figures, we clearly observe the benefit due to the Clustering Coefficient pre-processing. In most cases, both algorithms return better

results on the pre-processed data, in terms of all three evaluation criterion. Hierarchical works well even when the number of clusters is increased. This can be shown by the fact that most points are below the line where the scale value is 1. An interesting fact is that the pre-processing scheme seems to work better for the hierarchical clustering, when compared

to the kMETIS algorithm. This can be attributed to the fact that the hierarchical algorithm gets stuck at local optima when applied on the noisy original dataset. Our pre-processing step might have helped to overcome this disadvantage. In the case of kMETIS, since the algorithm forces a balanced partitioning, it misses some significant unbalanced clusters.

Pre-processing with Centrality

The results of the two clustering algorithms are evaluated after pre-processing the data using Betweenness and Closeness Centrality metrics respectively. Figures 3.6a-c depicts the results in terms of average p-value, variance and insignificance scores for Betweenness

46 process process 1.4 function 1.4 function component component

1.2 1.2

1 1

0.8

0.8

0.6 P−value Score 0.6 Variance Score 0.4

0.4 0.2

0.2 0 30% CC 40% CC 50% CC 60% CC 30% CC 40% CC 50% CC 60% CC Percent Removed Percent Removed

process 1.4 function component

1.2

1

0.8

0.6 Insignificance Score

0.4

0.2 30% CC 40% CC 50% CC 60% CC Percent Removed

Figure 3.3: Pre-processed with the Clustering Coefficient metric and partitioned with the Hierarchical clustering algorithm (k=500). Evaluated with respect to average p-value, vari- ance, and insignificance scores

Centrality. Similarly, results for Closeness Centrality are provided in Figures 3.7a-c. The results show a significant improvement for the Hierarchical algorithm after pre-processing when compared to the kMETIS. The Hierarchical algorithm has better scores in terms of insignificance and p-value in all cases. When we increase the number of clusters to

700, there is still significant improvement for the Hierarchical algorithm when applied after the proposed pre-processing step. This indicates the robustness of the Hierarchical algorithm. The score values are also significantly smaller than in the Clustering Coefficient experiments. This suggests that the Closeness Centrality might be a better metric to use for

our purposes.

47 6 process process 1.4 function function 5.5 component component

5 1.2 4.5

1 4

3.5 0.8 3

2.5 0.6

2 P−Value Score Variance Score

0.4 1.5

1 0.2 0.5

0 0 30% CC 40% CC 50% CC 60% CC 30% CC 40% CC 50% CC 60% CC Percent Removed Percent Removed

process 1.4 function component

1.2

1

0.8

0.6 Insignificance Score

0.4

0.2 30% CC 40% CC 50% CC 60% CC Percent Removed

Figure 3.4: Pre-processed with the Clustering Coefficient metric and partitioned with the Hierarchical clustering algorithm (k=700). Evaluated with respect to average p-value, vari- ance, and insignificance scores

From the results in both the above cases, it is evident that the pre-processed data pro- vides a significant improvement over the original dataset.

Comparison with Random Elimination

Here, we perform an experiment to highlight the fact that the improvement obtained from the pre-processing is due to its capability to eliminate possible false positive inter- actions, as opposed to merely reducing the number of interactions. To do this, we elim- inate interactions randomly from the original network. We then apply the Hierarchical clustering algorithm (with k=500) on the randomly eliminated data. We use the dataset pre-processed using Clustering Coefficients with 30% and 40% interactions eliminated. In

48 process process 1.4 function 1.4 function component component

1.2 1.2

1 1

0.8

0.8

0.6 P−value Score 0.6 Variance Score 0.4

0.4 0.2

0.2 0 30% CC 40% CC 50% CC 60% CC 30% CC 40% CC 50% CC 60% CC Percent Removed Percent Removed

process 1.4 function component

1.2

1

0.8

0.6 Insignificance Score

0.4

0.2 30% CC 40% CC 50% CC 60% CC Percent Removed

Figure 3.5: Pre-processed with the Clustering Coefficient metric and partitioned with the kMETIS algorithm (k=120). Evaluated with respect to average p-value, variance, and in- significance scores.

order to achieve a fair measure of randomness, we take the average of the values of the clustering metrics over 5 runs for the random scheme. Figure 3.8 presents the experimen- tal results comparing the performance of the hierarchical algorithm on the pre-processing datasets and the randomly eliminated dataset. We choose the hierarchical algorithm for this experiment since it is observed to be the most robust of the two for this task. To evaluate the two strategies, we decided to use the number of insignificant clusters since it provides an indication of how good the clustering is at a global scale. The benefit of our proposed methods over the random scheme can be clearly seen from the data. As can be seen from

49 1.6 6

1.4 5 1.2 e

r 4 e 1 o r c o S c

S 0.8

e 3 e c u n l a

0.6 a i V

r 2 - a P

0.4 V 1 0.2

0 0 Metis Hierarchical-500 Hierarchical -700 Metis Hierarchical-500 Hierarchical -700 process function component process function component

1.4

1.2 e

r 1 o c S

0.8 e c n

a 0.6 c i f i

n 0.4 g i s

n 0.2 I

0 Metis Hierarchical-500 Hierarchical -700

process function component

Figure 3.6: Pre-processed with the Betweenness Centrality metric and evaluated with re- spect to average p-value, variance, and insignificance scores.

1.4 4.5 4 1.2 3.5 e

1 r

e 3 o r c o c

0.8 S 2.5

S

e e c 0.6 c 2 n n a a c i i 1.5 f r i 0.4 n a

g 1 i V

s 0.2 n I 0.5 0 0 Metis Hierachical - 500 Hierarchical - 700 Metis Hierachical - 500 Hierarchical - 700

process function component process function component

0.9 0.8 0.7

e 0.6 r o

c 0.5 S

e 0.4 u l

a 0.3 V -

P 0.2 0.1 0 Metis Hierachical - 500 Hierarchical - 700

process function component

Figure 3.7: Pre-processed with the Closeness Centrality metric and evaluated with respect to average p-value, variance, and insignificance scores.

50 ) 600 0 0 5

f 500 o

t u o (

400 s r e t

s 300 u l C

t

n 200 a c i f i

n 100 g i s n I 0 30% CC 30% random 40% CC 40% random

process function component

Figure 3.8: The comparison of our edge removal algorithm with random edge elimination. the figure, the hierarchical clustering algorithm works much worse on the randomly re- duced data. The data with randomly eliminated interactions results in a skewed clustering arrangement with a large cluster of size 4000 and a few small clusters.

3.2.6 Discussion

Our results clearly showed that we are able to quantify the biological meaning of a cluster in terms of three different ontologies defined by the GO Consortium. We believe that the proposed validation methodology will be helpful in order to interpret the results of a clustering algorithm applied on the PPI networks. Moreover, this validation method might also be effective in deciding which final groups to focus on when mining for novel biological findings. An example of a high scoring partition of the dataset, cleaned using Closeness Centrality has a small p-value of 8.93e-37 for the BP ontology. This partition is further analyzed to indicate how informative our partitioning can be. 19 proteins (UTP15,

DIP2, IMP4, PWP2, UTP8, UTP4, NAN1, EMG1, RRP9, UTP10, UTP7, MPP10, UTP6, ENP1, NOP14, UTP9, NOP58, UTP13, IMP3) out of 41 in this cluster are annotated with term ‘Processing of 20S Pre-rRNA’ (GO:0030490), whereas there exist only 32 proteins

51 associated with this term in the whole genome of 7000 proteins. Similarly, in the same group, 17 proteins (UTP10, UTP7, UTP15, DIP2, MPP10, IMP4, PWP2, UTP8, UTP6, UTP4, NAN1, NOP14, UTP9, NOP58, IMP3, UTP13, RRP9) are associated with ‘Small Nucleolar Ribonucleoprotein Complex’ (GO:0005732) whereas only 30 proteins are asso-

ciated with this cellular component in the whole genome. Based on the very small p-values of these annotations we can hypothesize that all proteins in this cluster may belong to the same protein complex, namely the ‘Small Nucleolar Ribonucleoprotein Complex’. The proteins which are not yet annotated with this GO term might have an undetected function- ality.

Go-Term Cluster Fr. Genome Fr. p-value P - mRNA splicing 48 of 69 80 of 7000 2.13e-84 P - mRNA splicing 37 of 49 80 of 7000 6.02e-66 P - proteolysis and peptidolysis 27 of 31 113 of 7000 4.42e-46 P - vacuolar acidification 14 of 33 19 of 7000 1.22e-30 P - processing of 20S pre-rRNA 19 of 41 32 of 7000 8.93e-37 F - pre-mRNA splicing factor activity 29 of 69 45 of 7000 4.08e-50 F - pre-mRNA splicing factor activity 22 of 49 45 of 7000 5.58e-38 F - proteasome endopeptidase activity 26 of 31 34 of 7000 1.38e-61 F - structural constituent of 32 of 50 226 of 7000 2.29e-36 F - snoRNA binding 15 of 41 23 of 7000 8.45e-30 F - RNA polymerase II transcription mediator act. 17 of 42 20 of 7000 4.48e-37 C - small nuclear ribo-nucleoprotein complex 32 of 38 36 of 7000 7.08e-67 C - small nuclear ribo-nucleoprotein complex 38 of 49 65 of 7000 1.80e-73 C - proteasome complex 28 of 31 36 of 7000 9.48e-68 C - organellar large ribosomal subunit 29 of 50 47 of 7000 8.49e-55 C - hydrogen-translocating V-type ATPase comp. 12 of 33 16 of 7000 2.23 e-26 C - small nuclear ribo-nucleoprotein complex 17 of 41 30 of 7000 2.71e-32

Table 3.1: The first column represents the ontology annotated with the specified cluster. P,F and C stands for biological process, molecular function and cellular component respec- tively. GO-Term refers to the biological association for the proteins in each cluster. The Cluster Frequency represents the ratio of proteins annotated with the specified ontology term in the given cluster whereas the Genome Frequency column represents the ratio for the whole genome.

52 Another high-scoring cluster is obtained after reducing 40 percent of the data by using the Clustering Coefficient metric. In this cluster there exist 51 proteins. 32 proteins out of these are annotated with biological process ’Protein Biosynthesis’ (GO:0006412) with p- value 1.30e-19. Similarly, 32 of them (IMG1, MRPL25, MRPL15, YDR116C, MRPL13,

MRP49, MRPL27, MRPL38, MRPL35, MRPL19, MRPL9, YPR100W, MRP20, IMG2, MRPL10, MRPL7, RPL19B, YPL183W-A, MRPL3, MRPL16, MRPL20, MRPL24, MRPL28, MRPL17, MRPL39, YML025C, MRPL36, MRP7, MRPL8, MRPL44, MRPL6, MRPL23) are annotated with the same molecular function, ’Structural Constituent of Ribosome’ with p-value 2.29e-36. This result might be helpful in inferring that the remaining proteins such as CKA1, SLD2, NSP1,LOS1, HOT1, TRL1 have a large chance of being involved in this molecular function. We are able to obtain clusters with extremely small p-values with hierarchical cluster- ing. As an example, after the Closeness Centrality reduction, we applied the hierarchical clustering with the number of clusters as 500. One of the resulting clusters has a p-value score of 2.13e-84 for the BP ontology. 48 of the 69 proteins in this cluster are annotated with ‘mRNA Splicing’. This process is only associated with 80 proteins in the whole genome. The same cluster has p-value scores 7.08e-67, 4.08e-50 for the CC and the MF ontologies respectively. Clearly, our pre-processing method, improves the biological value of the final clusters. We are able to obtain clusters that are enriched in proteins with the same GO annotations. A cluster obtained by the hierarchical clustering algorithm was composed of the fol- lowing proteins - PRE5, PRE1, UBP6, RPN8, RPT4, PRE9, ECM29, PRE7, RPT2, RPN9,

RPT3, YGL004C, PRE6, RPT1, PRE4, PUP3, NAS6, RPN7, RPN10, RPN12, RPN11, RPN6, SCL1, PRE2, RPT5, RPN3, RPN4, RPN5, RPN13, PRE3, PHO4, RPT6. 28 of these proteins are annotated with the biological process ‘Proteolysis and Peptidolysis’. This suggests that the remaining three proteins might have an unrevealed task in the same

53 process. Also, 27 of them are annotated with the molecular functionality ‘Proteasome En- dopeptidase Activity’ whereas only 34 proteins in the whole genome have this molecular function annotation. Proteins in this cluster also share the same CC annotations. It includes almost all the proteins in the proteasome complex. Our experimental results indicate that the pre-processing methods are more effective with the hierarchical clustering. We pro- vide details of some of the clusters we obtained in Table 3.1. The interactions between the proteins of the above two clusters (obtained by the hierarchical clustering) are depicted in Figure 3.9.

YEL015W DHH1 PRP24 NAS6 DCP1 DCP2 PRE3 RPN13 LSM1 KEM1 PAT1 RPN7 LSM7 PRE6 RPT5 LSM2 LSM5 PRE4 LSM3 LSM6 RPT2 LSM8 SCM4 RPT6 LSM4 YGL004C PRP3 PRE2 PHO4 NOG2 CWC23 SNU66 PRP4 PRP31 PRP6 DIB1 CWC22 PRE1 RPT1 RPT3 BRR2 SNU114 SCL1 SMD3 RPN5 RPT4 SMB1 PRP43 RPN10 RPN6 PRP11 PRP8 SMD2 SPP381 HSH155 PRP45 YBR190W PUP3 SYF1 PRP21 PRP9 SME1 RPN3 CEF1 RPN12 SMX2 SMD1 CLF1 RPN8 NAM8 SMX3 YJU2 RPN11 SYF2 PRE9 LUC7 CUS1 UBP6 SNU56 RSE1 PRP19 CDC40 YHC1 CWC2 SNP1 YLR424W RPN9 CBC2 STO1 SNU71 PRE7 PRE5 LEA1 PRP46 ISY1 PRP39 SNT309 ECM29 MUD1 PRP40 ECM2

PRP42

RPN4

Pajek Pajek

Figure 3.9: Example clusters

In this section, we have proposed novel pre-processing strategies to eliminate redundant and potentially false interactions. We have demonstrated the effectiveness of this technique from our detailed experiments on the extraction of biologically relevant clusters from PPI datasets. Our results indicate clearly that our pre-processing strategies improve the quality of clusters obtained with using the standard clustering algorithms. Our comparative results for the two algorithms indicate that our strategies provide improvement regardless of the clustering algorithm applied.

54 3.3 Network Refinement Algorithm based on Hub Duplications

As we have previously discussed, one of the primary properties of the PPI graph that is detrimental to traditional graph mining is its uneven degree distribution [146]. This implies that most proteins in the graph participate in a small number of interactions while

a few proteins, known as hubs, are involved in a large number of interactions. Due to presence of hub nodes, the topology typically consists of a giant central core containing a significant amount of proteins and their interactions. The rest of the proteins are either completely disconnected or part of small disconnected groups. Thus, the tendency of the

hubs to interact with a high fraction of proteins, makes isolation of modules hidden inside the central core all but impossible [170]. Another challenge in clustering PPI graphs is the need to assign proteins to different groups (soft clustering) based on their functions. Hub proteins typically have multiple functions and are likely to be essential for the organism. Traditional algorithms fail to distinguish these essential proteins and assign them to multiple groups. Recently, Abou- Rjeili and Karypis [2] presented several multi-level graph partitioning algorithms to address the difficulty of partitioning scale-free graphs. Although the proposed algorithms result in better groupings, they still do not perform soft clustering.

In order to address these two issues, in this part of our thesis, we describe a key re- finement of the PPI graph, motivated by the topological and biological importance of the hub proteins [80]. Our aim is to target the neighborhood of these potentially multi-faceted proteins and isolate, for each of their functions, corresponding densely connected regions.

Our approach consists of two stages. In the first stage, we refine the PPI graph to improve its functional modularity, using hub-induced subgraphs. We employ the edge betweenness measure [109] to identify dense regions within local neighborhoods of hubs. These dense regions are later used to determine how the original network will be refined. In the second

55 stage, we cluster the refined graph using traditional algorithms. Similar to our previous work, the end goal is to isolate components with high degree of overlap with known func- tional modules. An additional advantage of the refinement process is its ability to perform soft clustering of hub proteins. Although hub nodes have been studied to break down a scale-free network into disconnected components [3, 34], to the best of our knowledge, we are the first to suggest duplicating hubs to improve modular decomposition of biological networks in the existence of hubs. Although, in this work, we focus on PPI networks, our refinement technique is applicable to other networks that exhibit similar topological properties.

As we detail in Chapter 2, there have been some attempts to extract dense regions to iso- late protein complexes from PPI graphs [12, 97] using concepts such as k-cores or cliques. Although dense regions of the PPI graph are highly associated with known functional mod- ules, they are by themselves not entirely informative in terms of function prediction. Min- ing the entire PPI graph will definitely prove to be a superior source for novel discovery of protein functions. Hence, we aim to improve the modularity of a PPI graph, as a whole, which enables enhanced functional prediction/identification of every protein of the graph. The proposed refinement technique is evaluated on the PPI graph of Saccharomyces cere- visiae that is described in Section 3.1. In order to quantify the quality of our clustering, in terms of overlap with known biological annotations, we again employ the Gene Ontology (GO) Consortium database. We find that the clusters obtained after our refinement strategy match very well with known biological annotations. In addition, we obtain groupings after refinement that could not be obtained from the original graph. Our technique also allows soft clustering of multi-functional hub proteins. We find that each of these clusters include proteins sharing a certain function with the multi-functional protein.

56 3.3.1 Evolutionary Implications

Recently, several groups [15, 33, 155] have suggested mathematical models to explain the evolutionary growth of Protein-Protein interactions graphs. They claim that preferential attachment is one of the main causes for the scale-free topology of interaction graphs. According to the duplication-divergence model [155], there is a linear relation between the degree of a node degree and the probability of a new node attaching to that node, known as the ‘preferential attachment’ principle. Since hubs have very high degrees, new proteins added to the graph are more likely to interact with hubs rather than other nodes. Hence, if a

hub belongs to a functional module, most of the other proteins in that module will prefer to connect to the hub rather than a node with the same function but less degree. This suggests that proteins with the same function interact within themselves and also individually with at least one hub. For this reason, we believe that it is important to consider neighborhoods

of hubs to isolate functional modules.

3.3.2 Hub-induced Subgraphs

It has been shown that hubs typically tend to be essential proteins [80], having several important functions inside the cell. Hubs can, therefore, be assigned to several functional

modules. However, most of their interactions do not imply a functional similarity. In this work we are aiming to identify the neighbors of hubs that share functionalities with the hub protein. Hence, our goal is to identify all dense components that lie within the neighbor- hood of a hub. Once such components are identified, the neighboring hub is duplicated and

all its interactions with the members of the group reassigned to the duplicate. In addition, all the duplicates will be linked to the original hub to preserve the original interactions of the proteins belonging to the isolated dense component. Note that we are not eliminating

57 any interactions. We are merely re-assigning interactions between the proteins belonging to the dense components and the hub to the duplicate.

(a) Example graph

(b) Graph after hub duplicatins

Figure 3.10: Illustration of the benefits of hub duplications on a toy example. The modu- larity of the original graph (a) improves after hub duplications (b).

If the proteins of a functional module are divided across neighborhoods of several hubs, each of those hubs will be duplicated once and will be included in the functional module. This will isolate the functional module from the unrelated neighbors of the hubs and create a tightly knit group. An example can be seen in Figure 3.10. We perform duplication of

hubs into several new nodes for each dense component in the hub’s neighborhood. In or- der to identify these dense components, we introduce the notion of a hub-induced subgraph.

Definition 1: Let G = (V, E) be a graph. G0 = (V 0, E0) is a vertex-induced subgraph of G if V 0 ⊆ V and E0 is all the edges of G having both endpoints in V 0.

58 Definition 2: A hub-induced subgraph of G is a graph G00 = (V 00, E00), where V 00 corre- sponds to a hub’s adjacency list. Thus, for every hub of the graph, there exists a corresponding hub-induced subgraph obtained from the adjacency list of the hub. We isolate these hub-induced subgraphs to identify potential functional modules.

3.3.3 Hub Duplication

To obtain information about the neighborhoods of hubs, we use the edge betweenness measure which was first introduced by Newman and Girvan [109]. This measure favors edges between communities and disfavors ones within communities. Newman and Gir- van [109] introduced three different edge betweenness measures: Shortest-path, Random- walk, and Current-flow based edge betweenness measures. We use the Shortest-path be- tweenness measure, which considers the number of shortest paths between all pair of nodes going along each edge. The original algorithm of Newman and Girvan was designed to hi- erarchically split the graph into communities by eliminating edges with high edge between- ness values. More formally, given a graph, G(V, E) and ‘known number of partitions’(k), the algorithm identifies k groups such that the intra-group connections are dense and inter- group connections sparse, by repetitively removing edges with high Betweenness values.

Our goal is to detect dense regions inside each hub-induced subgraph. We implement the algorithm without the k parameter and include the Clustering Coefficient of the subgraphs as the stopping criteria. As we already described in Section 3.2.1, Clustering Coefficient [158] is a measure that represents the interconnectivity of a vertex’s neighbors. More formally, the Clustering

Coefficient of a vertex v with degree kv can be defined as follows:

2nv CC(v) = (3.7) kv(kv − 1)

59 where nv denotes the number of triangles that go through node v. The Clustering Coeffi- cient of a graph is the mean over the coefficients of all vertexes in it and lies between 0 and 1. Tightly knit groups are associated with high Clustering Coefficients. Although the Shortest-path betweenness algorithm is computationally costly (O(E 2V )

running time), since the hub-induced subgraphs are small in size (< 284 nodes), it is tractable for our purpose. The pseudo-code of our refinement algorithm with Shortest-path betweenness and Clustering Coefficient measures is given in Algorithm 1. Here, most-

between-edge(Gi) returns the edge with the highest Shortest-path betweenness score in the

Gi subgraph. Tsize represents the minimum size(number of nodes) of the dense compo- nents we will consider and Tcc represents the Clustering Coefficient threshold. When a

component (of size >= Tsize) is dense enough, algorithm calls DuplicateHub function to duplicate the corresponding hub and re-assign its interactions with the members of the dense component to the duplicate. For each dense component identified from a hub-induced

subgraph a duplication event takes place. Note that, Betweenness scores are recalculated whenever an edge is removed from the graph to capture the topology of the remaining graph.

3.3.4 Clustering

Once the PPI graph is refined using hub-induced subgraphs, the resulting graph is clus- tered to separate out the functional modules. We used two graph clustering algorithms - a single-level Spectral algorithm and kMETIS [84], a multi-level partitioning algorithm. For more details on these algorithms, please refer to the Related Work chapter.

60 Algorithm 1 Identify-Dense-Regions(Gi) INPUT Gi = (Vi, Ei) : hub-induced subgraph of Hubi if size(Gi) < Tsize then Return else if CC(Gi) ≥ Tcc then DuplicateHub(Hubi,Gi) else e = most-between-edge(Gi) Gi ← Gi − e (remove e from Gi) recalculate Edge betweenness values 1 2 if Gi is partitioned into Gi and Gi then 1 Identify-Dense-Regions(Gi ) 2 Identify-Dense-Regions(Gi ) else Identify-Dense-Regions(Gi) end if end if

3.3.5 Validation Measures

Topological Measure

To evaluate our clusters, we use a topology-based modularity metric proposed by New- man [109]. This metric considers a k × k symmetric matrix of clusters where each element

aij represents the fraction of edges that link nodes between clusters i and j and each aii represents the fraction of edges linking vertexes within cluster i. The modularity measure is given by

2 M = X (aii − (X aij) ) (3.8) i j

Statistical Measure based on Domain Information

To test if the clusters obtained correspond to known functional modules, we need to validate our dense components using known biological associations. We use the Gene Ontology Consortium Online Database [8] to look for biological relations between proteins

61 assigned to the same cluster. Similar to our previous analysis, we used all three annotations for validation and comparison which is also in accordance with the earlier work [7]. As of May 2005, the GO database contains 7000 genes annotated with 1644 cellular component, 7502 molecular function and 9706 biological process terms.

We calculated p-values for each cluster obtained in each of the 3 ontologies using the GO Term Finder tool [18]. The p-value calculations are based on the Hyper-geometric Distribution assumption. We used the recommended cut-off of 0.05 for all our validations. Details of this calculation is discussed in Section 3.1.4. (REF) As the p-value of a single cluster is statistically not representative, we define a Clustering score function in order to quantify the overall clusters. We defined this score as follows.

nS min(pi) + (nI ∗ cutoff) Clustering score = Pi=1 (3.9) nS + nI

where nS and nI denotes the number of significant and insignificant clusters, respectively.

cutoff stands for the alpha level (0.05) whereas min(pi) denotes the smallest p-value of the significant cluster i. Hence, each cluster is associated with one p-value for each of the three ontologies.

3.3.6 Experiments

In this section, we discuss our experimental results. First we validate the clusters ob- tained using topology-based modularity, following which we provide biological validation

for the clusters.

Topology-based Modularity: We use the modularity metric on the clusters obtained using both the kMETIS and Spectral algorithms.

62 0.5

0.4 Modularity 0.3

Metis Spectral 0.2 Original CC:0.3 CC:0.4 CC:0.5 CC:0.6 CC:0.7 CC:0.8

Figure 3.11: Modularity scores before(Original) and after(CC : 0.3,CC : 0.4, etc.) re- finement

Figure 3.11 shows the modularity comparison between the original graph and refined graphs for the two algorithms. We find that the refinement improves the modularity of

the graph for both algorithms. From the curve, we find that the modularity scores peak at Clustering Coefficient values between 0.4 and 0.6 in both cases. Further, the modularity for clustering the original graph is much lower than for any of the refined graphs. kMETIS produces clusters with higher modularity(up to 45% better than the original) than Spec- tral(Upton 8% better than the original) for the refined graphs.

Biological Modularity: We test the effectiveness of our refinement technique by compar- ing with the clusters obtained from the original graph. The DIP dataset consists of 15147 interactions among 4741 proteins. By analyzing the degree distribution of this network (as shown in Figure 3.1), we defined all nodes of degree greater than 25 (2% of all nodes) to be hubs. We ran the algorithm to find all dense components within every hub-induced subgraph using the Clustering Coefficient and size as stopping criteria. We chose 6 as the size thresh- old for dense components, since components with size smaller than 6 are likely to be in- significant. To choose a suitable threshold for the Clustering Coefficient, there are two

63 things that we should consider. First, we want the resulting components to be dense enough to correspond to a functional module. So we ideally need to consider components whose Clustering Coefficient is greater than 0.5 (i.e., half of the all possible triangles are formed). On the other hand, it is well known that PPI datasets are prone to high false negative rates.

Hence, we cannot expect perfect cliques (i.e., all possible triangles are formed) among the hub-induced subgraphs. Hence, we vary the Clustering Coefficient parameter (Tcc) between 0.3 and 0.8 and obtain refined graphs for each. We believe that, considering the incomplete- ness in PPI graphs and the need for obtaining dense components, a reasonable Clustering Coefficient value would be around 0.5-0.6. Components that have Clustering Coefficients within this range are likely to be dense enough to be considered as functional groups and would not be affected too much by the incomplete nature of the dataset. The refined graphs and the original graph are clustered by kMETIS and Spectral clustering algorithms sepa- rately. The results obtained are depicted in Figure 3.12. As can be seen from this figure,

Clustering scores are reducing (improving) after refinement for both algorithms. Our above hypothesis is validated by the fact that, although an improvement is observed for every Clustering Coefficient threshold, this improvement is small for low and high values. Also, both algorithms have their smallest Clustering scores for the threshold values of 0.5, 0.6 and 0.7. If we consider the improvement in this Clustering Coefficient range, our refinement technique improves Clustering scores up to 52%, 48% and 28% for MF, BP and CC ontologies in the case of kMETIS and 30%, 21% and 38% for the same three ontologies for the Spectral algorithm. This confirms that kMETIS produces better clusters than the Spectral algorithm. Note that our Clustering score considers both significant and insignificant clusters. Next, we evaluate the significance of our clustering results for all three ontologies. In Figure 3.13(a-c) we show the p-value distribution of significant clusters in both original and refined graphs. For all ontologies, we find that the refined graph can be clustered into

64 more biologically meaningful groups. For example, the best cluster we obtained on the original graph had a p-value of 1.2089e-32 for Cellular Component, the best cluster after refinement had a p-value of 5.4658e-49 for the same ontology. We obtained similar results for the other two ontologies. In addition, we are able to identify more significant clusters after the refinement.

0.02 0.03 Process Process 0.018 Function 0.028 Function Component Component 0.016 0.026

0.014 0.024

0.012 0.022

0.01 0.02

0.008 0.018 Clustering Score Clustering Score 0.006 0.016

0.004 0.014

0.002 0.012

0 0.01 Original CC:0.3 CC:0.4 CC:0.5 CC:0.6 CC:0.7 CC:0.8 Original CC:0.3 CC:0.4 CC:0.5 CC:0.6 CC:0.7 CC:0.8 (a) kMETIS (b) Spectral

Figure 3.12: Clustering scores are improving after hub duplications using two different algorithms: a) kMETIS and b) Spectral graph partitioning.

3.3.7 Discussion

In this section, we have proposed a refinement technique to improve the modular de-

composition of PPI graphs. We refined the PPI graph based on Shortest-path betweenness and Clustering Coefficient measures. From our experimental results, we found that dupli- cating the hubs of a scale-free PPI graph improves the modularity of the graph. Thus, we are able to obtain topologically and biologically more significant clusters even using tradi- tional clustering algorithms. A detailed examination of the obtained clusters revealed that the proposed method has three major benefits:

65 0 0

−5 −5

−10 −10 −15

−20 −15

−25 −20 Log(p−value) Log(p−value) −30 −25 −35

−30 −40 Metis Metis Metis+Refinement Metis+Refinement −45 −35 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 Significant Cluster Rank Significant Cluster Rank (a) Biological Process (b) Molecular Function

0

−5

−10

−15

−20

−25

−30 Log(p−value) −35

−40

−45 Metis Metis+Refinement −50 0 10 20 30 40 50 60 70 80 90 Significant Cluster Rank (c) Cellular Component

Figure 3.13: P-value distribution of significant clusters before and after the refinement. The y axis represents the log(p-value) for each corresponding cluster.

• Enhancement of available functional groupings: We obtain larger groups of proteins that are annotated with the same GO term from our refined graph than on the original

graph.

• Isolation of new functional groupings: We find groupings of proteins that could not be obtained from the original graph.

• Soft-clustering: Our approach can identify multi-functional hub proteins and group them into modules corresponding to each of their functions.

We now provide some illustrations from our results for each of these cases. KAP95 (karyopherin beta), an essential protein is known to take part in ‘nucleocytoplasmic trans- port’. Specifically, it participates in a complex mediating nuclear import via a localization

66 signal(NLS). It interacts with nucleoporins to guide transport across the nuclear pore com- plex [60]. When we cluster the original DIP dataset, this protein is correctly grouped with 8 proteins (NTF2, MLP2, SSA1, ASM4, YRB1, YNL253W, RNA1, NDC1) that are also annotated with ‘nucleocytoplasmic transport’ term with p-value 1.14e-08.

Using our refinement technique, KAP95 is duplicated once. The hub and its dupli- cate appear in two separate clusters when we use the kMETIS algorithm. In one, KAP95 is grouped with proteins (NTF2, SSA1, YRB1, RNA1, GSP1, SRM1, MTR10, KAP122, KAP142, KAP124, NUP157, NUP2, NUP1, NUP60, NUP82, NUP170, NUP145, NUP42) sharing the same biological process (‘nucleocytoplasmic transport’) with p-value 1.07e-27.

The major difference between this group and the one from the original graph are the inclu- sion of NUPs(Nucleoporins - 8 proteins) and KAPs(Karyopherins - 3 proteins). Transport through the nuclear pore complex is facilitated by transient interactions between the KAPs and the nuclear pore complex proteins (NUPs) [117]. Thus, locating NUPs and KAPs to- gether is a noticeable benefit caused by our refinement. Clearly, our approach groups more proteins that belong to the same functional module together. This suggests that hub dupli- cations make isolation of modules easier. These clusters are also valuable for predicting the functions of unknown proteins. In the above group, four proteins (YKL061W,YKR064W,

YNL122C, YER004W) do not have a known function. Among these four, YKL061W is predicted by Brun et al [22] to take part in ‘nucleus-cytoplasm transport’ process which is in accordance with our findings. Since two different datasets and approaches are used to infer the same conclusion about protein YKL061W, the overlap is noteworthy. This also suggests that the other three proteins might have an unrevealed task in ‘nucleocytoplasmic transport’ biological process. In addition to enhancing clusters, our method is able to assign hub proteins which were originally in insignificant clusters into significant clusters. To illustrate this, we consider the hub protein LSM8. The LSM(Sm-like) proteins interact with each other and with U6

67 snRNA complex and influence pre-mRNA splicing [105]. In the original dataset, this pro- tein is assigned to a cluster which does not have any significant annotations. However, after the refinement, this protein is located into a cluster which has a biological process annotation with p-value 1.2e-12. In addition to LSM8, ten other proteins in this group are associated with ‘mRNA splicing’. LSM8 is located with the members of its complex (other SM-like proteins) as well as the components of U6 snRNP complex(PRP proteins). This example shows that our technique not only improves functional modules which can be identified from the original dataset, but also allows detection of functional modules which cannot be discovered from the original dataset.

Another advantage of our refinement technique is its ability to perform soft clustering on certain hub proteins. This feature is important since these proteins are known to take part in multiple unrelated functional modules. CKA1 is one of these multi-faceted proteins and is involved in several cellular events. It is known to function in the maintenance of cell morphology and polarity, and to regulate the actin and tubulin cytoskeletons [26]. When the original dataset was clustered, CKA1 and seven other proteins, annotated with ‘tran- scription, DNA-dependent’ term are located in the same cluster(p-value 3.47e-05). On the other hand, our algorithm duplicates CKA1 twice, so there exist three nodes corresponding to this protein. When we cluster, these 3 nodes are assigned to different clusters resulting in three different groupings for protein CKA1. All three correspond to different functional modules of the CKA1 protein. One of these clusters is an enhancement of the ‘transcrip- tion, DNA-dependent’ functional module(very low p-value of 2.3e-19). The second cluster in which CKA1 is located includes proteins which are annotated with the biological process term ‘protein amino acid phosphorylation’ with p-value 1.2e- 05. CKA1 is itself annotated with the same term. The third cluster contains 21 proteins and CKA1, all of which are annotated for ‘ organization and biogenesis’ (with p-value 3.2e-12). Thus, we found that our technique, not only improved the obtainable clusters (by

68 decreasing p-value from 3.47e-05 to 2.3e-19), but also grouped CKA1 with proteins that share its different functions. Altogether these examples indicate the effectiveness of our approach on isolation of functional modules from the PPI graphs. In this chapter, we investigate two different ideas that tackle two major problems asso- ciated with knowledge discovery from PPI networks. Our detailed experimental analysis proved that both techniques -network cleaning and network refinement- are effective in improving the quality of communities that can be derived from these networks.

69 CHAPTER 4: Gene Co-expression Networks

Gene expression datasets provide vital information that can be used to gain insight into diverse biological questions. Novel strategies are required to analyze the growing archives

of microarray data and to extract useful information from them. One particular area of interest is in the identification of gene groups that have similar expression patterns over various samples, known as co-expressed genes. There has been a growing interest in rep- resenting co-expressed genes as an interaction network to explore the system-level func-

tionality of genes [27, 153, 171]. In a co-expression network, nodes represent genes and two nodes are linked if the corresponding genes are significantly co-expressed (correlated) across the samples. Study of co-expression networks is particularly important for charac- terizing functionality of unknown genes and revealing biological mechanisms [140]. It has been predicted that genes behaving similarly over changing conditions are part of the same functional module and correspond to densely linked structures in the gene interactions net- works [46]. Therefore, an effective identification of such functional modules is essential. However, the analysis of gene expression data fraught with challenges due to the significant levels of noise produced by the microarray technology. In such data, a major challenge is to

isolate the biological signal from the experimental noise. Therefore, we aim to investigate alternative techniques that mitigate the effect of noise to some extent in order to improve the quality of inferences that can be drawn from gene expression datasets. The purpose of our analysis in this section is two-fold: (i) developing techniques that mitigate the effect

of noise inherent in microarray datasets in order to improve the quality of resulting gene

70 interactions and gene interaction networks, (ii) investigating novel ways of utilizing gene interaction networks and their substructures for knowledge discovery. To accomplish these tasks, first, we explore an extrinsic way of calculating gene simi- larity based on their relations with other genes. A typical technique to infer gene similarity is to use a linear similarity measure like Pearson’s correlation coefficient or Euclidean dis- tance. However, the noise inherent in microarray datasets reduces the sensitivity of these measures and produces many spurious pairs with no real biological meaning. Our analysis show that in comparison to traditional measures, ‘similar’ pairs identified by extrinsic mea- sures overlap better with known biological annotations available in the GO database [8].

We also prove that extrinsic measures are useful to enhance the quality of gene networks constructed from similar gene pairs by reducing spurious edges and introducing missing edges between network nodes. In addition to extrinsic similarity, secondly, we also explore the use of a rank-based methodology to transform the gene profiling data into a gene association network. Our methodology exploits and applies the principle of reciprocal best match pairs, often used in the area of gene sequence analysis. Furthermore, we propose a False Discovery Rate (FDR) analysis to characterize and control the signal to noise ratio in the gene co-expression net-

work. The FDR of a set of predictions can be defined as the expected percent of false predictions in the set of predictions. For example when the FDR is set to 0.2, we should expect 80% of all interactions in our association network to be correct. Improving the quality of gene interaction 3 networks enhances the biological homo- geneity of modules that can be extracted from these networks through graph partitioning or clustering algorithms. These modules can be used to understand and characterize the modular structure of gene interactions, which has direct implications in predicting cellular functions of unknown genes [140]. In addition to their use for functional characterization

3Throughout this section, gene association, co-expression, and interaction terms are used interchangeably.

71 of unknown genes, we propose a novel statistical model to utilize these modules for the gene-set level hypothesis testing. We investigate the use of a MANOVA approach that can take individual probe expression values as input and perform hypothesis testing at the sub-network level. We apply this MANOVA methodology to two published studies on

HIV treatment and cigarette smoking to identify functional groups or pathways that are perturbed. Our analysis proves that this methodology is effective in capturing the known effects of these two perturbation on human along with some novel predictions.

4.1 An Extrinsic Method to Infer Gene Similarity

Earlier approaches have used expression levels of two genes over all samples to surmise their correlation. However, this similarity notion does not necessarily imply that genes are functionally related. Given the noise inherent in microarray datasets, it is our hypothe- sis that intrinsic similarity measures are not adequate to distinguish accidentally regulated genes from those that are biologically related. We argue that since any given gene is likely to fluctuate in its measured expression level due to many possible sources of error, a simi- larity based on two genes’ measurements is more error-prone than the relative positions of many genes as a reference to gene similarity. In addition to being noise-tolerant, since gene products act as complexes to accomplish certain cellular level tasks [137], these groupings can be effectively leveraged to infer two gene’s similarity via their relations with other genes. Thus, we introduce a methodology for the application of extrinsic similarity mea- sures on microarray datasets. We propose two different extrinsic measures motivated by the notion of mutual independence analysis.

The proposed similarity measures are evaluated on a well-studied cancer microarray dataset [5] obtained with Affymetrix oligonucleotide arrays, as well as a yeast microarray data generated with custom complementary DNA (cDNA) arrays [74]. For both datasets

72 and platforms, we showed that gene pairs obtained by extrinsic similarity measures better overlap with known biological annotations from the Gene Ontology (GO) database when compared to the Pearson’s correlation coefficient and the Topological Overlap Measure (TOM) [123]. To further analyze the efficacy of extrinsic measures for gene function in- ference, we constructed co-expression networks by using different measures. We observe that co-expression networks constructed based on extrinsic measures contain less spurious and more biologically verified edges compared to their counterparts generated with other measures. We also studied the modular structure of these networks by decomposing them into co-expressed modules. We found that gene modules extracted from Extrinsic Gene

Networks are also functionally more homogeneous.

4.1.1 Datasets and Pre-processing

For this study, we employed a well-studied cancer dataset [5] and the Rosetta com- pendium yeast data (Saccharomyces cerevisiae) [74]. The first dataset is composed of gene expression values of 62 colon tissue samples where the Affymetrix Hum6000 array with 6819 probes is used [5]. 42 of these are collected from colon adenocarcinoma pa- tients and 20 of them are collected from normal colon tissues of these patients. Among all probes, 2000 were selected from 6817 by Alon et al. [5] according to the highest minimum intensity. The second dataset, Rosetta yeast data is obtained using a two-color cDNA mi- croarray hybridization assay [74]. It is composed of 300 compendium experiments on the Saccharomyces cerevisiae organism. As suggested by the authors, we used the scale factor for our further analysis, which is defined as the standard deviation of log10(ratio)/[error of log10(ratio)] over all experiments. We perform thresholding, log transformation, and quantile normalization on these two datasets as suggested by our analysis. In addition to

73 these, we further standardize datasets using a robust standardization method, median abso- lute deviation (MAD). Genes with zero MAD values implying that they are co-expressed at very similar levels across all of the samples are excluded from further analysis.

4.1.2 Similarity Measures

To quantify the resemblance of two points, one needs a measure of similarity. Simi- larity measures can be categorized into two: extrinsic and intrinsic similarity. An intrinsic similarity of two points i and j is purely defined in terms of the values of i and j. On the other hand, an extrinsic similarity measure takes into account other points to infer similar- ity of i and j. Previous studies have shown the usability of extrinsic similarity measures in other domains [37, 38]. The standard method to infer similarity of two genes from their expression patterns is to use a linear intrinsic similarity such as the Pearson’s correlation coefficient. To our knowledge, we are the first to study extrinsic measures for the analysis of microarray datasets [147].

Intrinsic Similarity

Intrinsic similarity is purely defined on the points in question. In the context of mi- croarray analysis, the intrinsic similarity of two genes is defined on the measured expres- sion levels of two genes over all samples. In a typical microarray experiment, each gene is expressed at some certain level at each condition, which is defined as the expression profile of the gene. More formally, a gene (say, x) is associated with a profile vector (Vx) composed of its expression values over all samples, such that Vx = [x1, x2, ..., xn], where n denotes the number of samples in the dataset. Thus, intrinsic similarity between genes x and y, is a measure defined on their profile vectors, Vx and Vy. A prevailing measure used for inferring similarity of two genes based on their gene profiles is Pearson’s correlation

74 coefficient [114] which is defined as follows:

n i i i=1 (Vx − Vx)(Vy − Vy) rxy = P (4.1) n i 2 n i 2 (V − Vx) (V − Vy) qPi=1 x Pi=1 y

i th where Vx and Vy are the profile averages. Here, Vx represents the i entry of the vector Vx. Accordingly, genes which are positively (or negatively) correlated have a value close to 1

(or -1) whereas dissimilar gene pairs have values close to 0. Throughout our analysis, we employ the absolute value of Pearson’s correlation scores since both positive and negative correlations can play an important role in gene association.

Extrinsic Similarity

Extrinsic similarity of two attributes (i.e., genes) is defined over other attributes in the

dataset [37]. In general, an extrinsic similarity between two attributes, i and j, can be defined as follows:

ESP (i, j) = X f(i, j, k) (4.2) k∈P Here, f(i, j, k) denotes a function that signifies the association between attributes i and j, with respect to a third attribute k. P refers to the set of attributes that will contribute to the extrinsic similarity of attributes i and j. As noted by Das, Mannila, and Ronkainen [37], the proper choice of the attribute set P and function f is crucial for the usefulness of the

resulting extrinsic measure. Different choices of P and f will result in different similarity notions. In this work [37], they preferred to define an extrinsic dissimilarity measure based on the confidence of association rules, which we discuss in Section 4.1.2. In this thesis, we propose using the Mutual Information of Information Theory to derive

efficient extrinsic gene similarity measures. Our final goal is to surmise the similarity of two genes by the similarity of their relation with other genes. We believe that an extrinsic measure for microarray analysis has a twofold advantage over the use of intrinsic measures.

75 First, these measures may reduce the impact of noise inherent in the dataset on the similar- ity analysis. It is well known that expression level of each gene is likely to fluctuate due to many sources of variability in a typical microarray analysis. Thus, the similarity deduced from expression levels of two genes is likely to be more error-prone than a similarity de- duced from relative positions of these two genes with respect to many other genes. Second, extrinsic measures suit well with the biological hypothesis about genes and gene products acting in the form of complexes (i.e., groups) to accomplish certain tasks in the cell. As hypothesized, two gene products that belong to the same complex behave similarly with the members of this complex. Thus a similarity notion that is defined based on the relation of two genes with other genes can potentially capture the modular structure of the genomic interactions. Moreover, a priori known modular structure of a biological system can be incorporated into the similarity analysis, by defining the P set by using these pre-defined gene sets.

Defining Measures for Extrinsic Gene Similarity/Dissimilarity To define proper extrinsic measures for gene expression studies, we first need to determine the gene set, P , and the association function, f, that will constitute our measures. For the P set, we make use of the close proximity of each gene determined by an intrinsic similarity notion. We propose to use Conditional Mutual Information and Specific Mutual Information as our association functions.

Choice of Attribute Set (P ): To derive an efficient extrinsic measure, we need an ef- fective gene set that will be used to infer the extrinsic similarity of two genes. To com- pile such a set, we initially identified for each gene a set of genes that are intrinsically similar to that gene. We refer this as the neighborhood list of gene i and define it as

Ni = {j|j ∈ G, |rij| > κ}, where G denotes the set of all genes in our dataset and |rij| refers to the absolute value of the Pearson’s correlation coefficient between genes i and j.

76 We investigate the effect of the threshold parameter κ in our experiments and observed that the size of the neighborhood lists can help us set this parameter. Next, the attribute set P is designated as the intersection of their neighborhood lists, i.e., P = Ni ∩Nj. Using common elements of two neighborhood lists has two important implications. First, instead of using

the whole gene set (G), a smaller size set is taken into consideration for each similarity calculation, which significantly reduces the required number of calculations. Secondly, it filters out irrelevant information which enhances the power of the extrinsic measure. More- over, by using the intrinsic similarity to determine elements in set P , we take advantage of both extrinsic and intrinsic properties. We believe this will be helpful in reducing the

noisy inference that can be introduced into the similarity inference by using each technique separately. Although we prefer to use the neighborhood lists for determining the P set, it is noteworthy that an extrinsic measure can be easily expandable to other groupings of related genes. For instance, an extrinsic similarity can be defined by using an attribute set contain-

ing the genes that are mapped to close chromosomal locations or the genes that belong to the same pathway.

Choice of Association Function (f): Das, Mannila, and Ronkainen [37] proposed using the confidence of association rules in an application on market basket dataset which we

discuss in Section 4.1.2. Our analysis showed that it is possible to improve their measure for the task of similar gene identification by using the Mutual Independence of genes. We propose using Conditional Mutual Information and Specific Mutual Information to derive effective extrinsic microarray measures. To leverage the Mutual Information of genes, we first start with the following probabil-

ity definitions: the probability of occurrence of a gene and the probability of co-occurrence of two genes in neighborhood lists. Formally we define these probabilities as follows:

77 Definition 1: The probability of occurrence for a gene i, P (i), is defined as the frequency of encountering that gene in all neighborhood lists. Since Pearson’s correlation coefficient is a symmetric measure a gene has as many neighbors as the number of times it occurs in all neighborhood lists. Thus, the frequency of a gene’s occurrence can be simplified to the following: |Ni| P (i) = (4.3) |G| where ‘||’ denotes the number of elements (cardinality) in its argument. Note that fre- quency of occurrence is an indication of the discriminatory nature of gene expression pro- files. Therefore, genes with indistinct expression profiles will have higher frequency of occurrence.

Definition 2: The probability of co-occurrence for two genes, i and j, P (i, j), is defined as the frequency of encountering these two genes together in the neighborhood lists. More formally, based on the symmetric Pearson’s measure, P (i, j) can be defined as follows:

|{a|a ∈ G, i ∈ Na, j ∈ Na}| P (i, j) = (4.4) |G|

where Na refer to the neighborhood list of gene a. Using these two probability definitions, we develop two different extrinsic gene asso- ciation measures motivated from the Information Theory concepts: Conditional Mutual Information based gene similarity and Specific Mutual Information based gene dissimilar- ity. We discuss these in detail next.

Conditional Mutual Information based Gene Similarity: Conditional Mutual Informa- tion between variables X and Y, I(X, Y |C), signifies the quantity of information shared

78 between X and Y when C is known. Formally, it is defined as follows:

I(X, Y |C) = H(X|C) − H(X|Y, C) (4.5) where H(X) signifies the Shannon entropy of the discrete random variable, X. Mutual information calculates the quantity of information shared between X and Y , when C is given. I(X, Y |C) is equal to zero iff X and Y are conditionally independent given C.

We employ this Information Theory notion of association in order to infer similarity of two genes with respect to their behavior in gene neighborhood lists. Probabilities of occurrence and co-occurrence are used to calculate the Conditional Mutual Information of two genes given the neighborhood list of a third gene. A high Conditional Mutual Information between two genes implies that these two genes prefer to co-occur with the same set of genes when a third gene is known to be occurring in the neighborhood lists. If they are not co-occurring with the same set of genes, they will have a smaller Conditional Mutual Information. If two genes bring the same information to the neighborhood lists of many third parties, we expect these two genes to be regulated by the same mechanism.

Based on this heuristic, we define Conditional Mutual Information based extrinsic gene Similarity as follows:

CMIP (i, j) = X I(i, j|k = 1) (4.6) k∈P This measure calculates the quantity of information shared by i and j, in the neighborhood lists where k is present, i.e., k = 1. As can be seen above, the final score is the sum of Conditional Mutual Information between i and j, with respect to all elements in set P . If i and j tend to share the same information in many neighborhood lists, they will have a high

CMI similarity value4.

4CMI values are normalized with respect to the size of the P set.

79 Specific Mutual Information based Gene Dissimilarity: Specific Mutual Information is a measure of association commonly used in the Information Theory to infer mutual dependency. The Specific Mutual Information of two variables, X and Y , given their joint distribution, P (X, Y ), and individual distributions, P (X) and P (Y ), is defined as follows:

P (X, Y ) SMI(X, Y ) = (4.7) P (X)P (Y ) where P (X, Y ) is the observed value (O) for joint probability of events X and Y , whereas P (X)P (Y ) is its expected value (E). We propose to utilize this test to deduce the co-occurrence relation between two genes when their neighbors are considered. If the Specific Mutual Information of two genes is 1, it can be concluded that these two genes are independent. In this context, being independent

means genes i and j are randomly appearing together in the neighborhood lists. However, if two genes are not independent, occurrence of a gene in a neighborhood list makes it either less probable or more probable for the other gene to occur in that list. Based on this analysis, we propose the following extrinsic measure to quantify the dissimilarity of two

genes (i and j). P (i, k) P (j, k) SMIP (i, j) = X | − | (4.8) k∈P P (i)P (k) P (j)P (k) This definition ensures that two genes having the same co-occurrence relations with their common neighbors are closely related to each other (SMI value close to 0). Whereas two genes that have different independency relations with their common neighbors are dissimilar and associated with higher values of SMI. We normalize SMI values with respect to the size of the attribute set P .

80 Previous Work

Topological Overlap Measure: Recently, Ravasz et al. [123] proposed the Topological Overlap Measure (TOM), which takes into a step in using extrinsic measures to infer similarity between two nodes of a biological network. This measure is considered as an improvement over the intrinsic sim- ilarity, which amalgamates an additional external knowledge derived from the network topology (i.e., number of common neighbors). According to their definition, two nodes have high topological overlap if they are connected to roughly the same group of nodes.

More formally, TOM of two genes i and j can be expressed as follows:

|Ni ∩ Nj| + rij T OM(i, j) = (4.9) min{|Ni|, |Nj|} + 1 − rij

where rij is the pairwise similarity between these two genes. The inclusion of the intrinsic

similarity (rij), into this definition makes TOM measure explicitly dependent on the intrin- sic similarity of two nodes in question. The drawback of this dependency will be discussed in Section 4.1.3.

Confidence of Association Rules: Das, Mannila, and Ronkainen [37, 38] previously studied the extrinsic similarity of at- tributes in a market basket dataset where the confidence of association rules are used as the

association function, f. In a market-basket problem, each customer fills their market bas- ket with a subset of large number of items (e.g., bread, milk). Such datasets are mined for

association rules of the form (X1, ..., Xn ⇒ Y ) to identify the relation between items. The confidence of an association rule is defined as the frequency of encountering the head of

the rule (X1, ..., Xn) among all the groups containing the body (Y ). They proposed using the confidence of association rules as the association function f. More specifically, their

81 proposed extrinsic similarity measure is the following:

ESP (A, B) = X |conf(A ⇒ D) − conf(B ⇒ D)| (4.10) D∈P

where confidence is defined as follows:

P (A, D) conf(A ⇒ D) = (4.11) P (A)

For the task at hand, an analogy to a market basket is a gene neighborhood list. Ac- cordingly, we use the frequency of occurrence (P (i)) and the frequency of co-occurrence

(P (i, j)) to derive a corresponding confidence based extrinsic measure for microarray anal- ysis. We again normalize this measure by dividing it by the size of the set P to avoid favoring pairs with large size attribute sets. We compare the newly proposed extrinsic similarity measures (SMI and CMI) with the existing ideas in the literature (i.e., TOM and confidence) as well as the most commonly used intrinsic measure for the microarray analysis, namely the Pearson’s correlation coef- ficient.

82 4.1.3 Experiments

Domain Based Evaluation

‘Similar’ pairs identified according to different similarity/dissimilarity measures are evaluated based on Pairwise Semantic Similarity measure of Resnik [125]. Semantic sim- ilarity based on ontologies has been studied many times in the past [81, 99]. Resnik sug- gested a novel way to calculate semantic similarity in an ontology based on notion of in-

formation content. Resnik’s measure is preferred among other semantic similarity mea- sures [81, 99], since it has been shown to outperform the others and to suit better for GO analysis [129]. This measure makes use of known annotations in the GO database as well as the hierar- chy. GO accumulates the result of all investigations in the area of genomic and biomedicine

by providing a large database of known associations in the form of a hierarchy. GO terms that are general enough to annotate many genes are close to the root of the hierarchy, and very distinctive terms form the leaf nodes in this hierarchy. Biological relevance of two genes can be quantified with respect to the Information Content of their shared GO

annotations using the Semantic Similarity (SS) measure defined by Resnik [125]. The Information Content (IC) of a GO term, using Resnik’s definition is given as:

F (ki) IC(k ) = −ln( ) (4.12) i F (root)

where ki represents a term and F (ki) is the frequency of encountering that particular term over all the entire corpus such as the GO annotations for the whole genome. Here, F (root)

is the frequency of the root term of the hierarchy. Note that the frequency count of a term includes the frequency counts of all subsumed terms in an is-a hierarchy. Accordingly, the root of our hierarchy includes the frequency counts of every other term in the ontology, and is associated with the lowest IC value. Terms with smaller frequency counts will have

83 higher information content values, i.e., they will be more informative. Using the above Information Content definition, the Semantic Similarity (SS) between two GO terms can be computed as follows:

SS(ki, kj) = IC(lcs(ki, kj)) (4.13)

,where lcs(ki, kj) refers to the lowest common subsumer of terms ki and kj. Using GO annotations for the yeast genome, we calculated the pairwise semantic similarity for all available GO terms.

Next, for a given gene pair, their semantic similarity is assigned as the maximum SS of their shared GO annotations. More formally for two genes x and y, the semantic similarity is defined as follows:

SS(x, y) = maxax,ay [SS(ax, ay)] (4.14)

where ax ranges over all terms annotating gene x and similarly ay ranges over all terms annotating gene y. Accordingly, for the pairs found to be similar according to differ- ent similarity/dissimilarity measures, we calculated their similarity using the above def- inition. While running our experiments, we did not take into consideration unannotated genes since there is not enough information to speculate about the biological concordance of such genes. We then constructed association gene networks by linking the most similar gene pairs identified with respect to alternative similarity definitions. We obtained clusters of densely linked genes from these networks to study their efficacy in understanding the molecular and biological processes. The obtained clusters are evaluated with an enrichment score that shows the statistical significance of the GO annotations homogeneity in a cluster. Details of this enrichment score can be found in Section 3.2.4.

84 Throughout our experiments, we discuss the usability of extrinsic measures for mi- croarray analysis. First, we discuss our rationale for setting the κ threshold while con- structing gene neighborhood lists. Second, we present biological relevance of ‘similar’ gene pairs with different measures. We then linked these ‘similar’ genes to construct gene co-expression networks. Each of these networks are partitioned into its functional modules to study the effect of extrinsic similarity on the quality of information extracted from these networks.

Setting the κ parameter

Before comparing newly proposed measures with the existing ones, we first investigated the effect of the κ parameter on the attribute set P . To choose a suitable κ threshold, there are two things that we should take into consideration. First, we want the attribute set (P ) to be composed only of genes that are within close proximity of the two genes whose similarity is under investigation. Second, it is not desirable to have a set that is only composed of a very small number of genes since this would limit the power of inference on common neighbors. Accordingly, we vary the κ parameter between 0.3 and 0.9 and observed the average size of attribute sets for each of these values (shown in Figure 4.1). As expected, smaller values of κ resulted in P sets bigger in size with many unsimilar genes. On the other hand, higher κ values resulted in very small size P sets which are very restrictive to draw any conclusions. Given that observation, we believe that average size of the attribute set P can guide us for setting the κ parameter. Consequently, we set the κ threshold to 0.5 for the colon cancer dataset and 0.9 for the yeast data, which generates neighborhood lists of size 40 and P sets of size 8 in average.

85 Figure 4.1: The average size of neighborhood lists with respect to different κ thresholds (depicted for the colon cancer dataset).

Effect on top ‘similar’ pairs

In our first experiment, we compare gene pairs that are labeled as ‘similar’ according to discussed measures. For each measure, gene pairs are sorted starting from the most ‘similar’ (or least ‘dissimilar’) one. Next, we calculated the average semantic similarity

for these gene pairs. Different number of top scoring pairs (varying between 1000 and 20000) are compared in our study to have a better understanding of the effectiveness of extrinsic measures. We present these numbers in Figures 4.3-a and b. When we analyze the distribution of average semantic similarities, we observe that extrinsic measures outperform existing measures, where a significant improvement in semantic similarity is observed for both datasets. For the colon cancer dataset, as can be seen in Figure 4.3-a, the pairs identified with the SMI measure show greater biological relevance when compared to the pairs identified by other measures. For the top 1000 pairs, the improvement in the average semantic similarity score is up to 15%, when an extrinsic measure is used instead of an intrinsic one. Since semantic similarity calculations are based on the information content of each GO term

86 Figure 4.2: Average semantic similarity (SS) is calculated for the top ‘similar’ pairs identi- fied via alternative measures for Colon cancer (top) and Yeast microarray datasets (bottom). The y axis represents the number of top pairs considered in each experiment.

87 which is in the logarithmic scale, this improvement is significant in real world, as our further analysis indicates. Although TOM measure is also able to improve the Pearson’s correlation, this improvement is not as significant as our extrinsic measures. When we analyze the yeast dataset, we again observe that extrinsic measures identify biologically more relevant gene pairs. As can be seen in Figure 1b, the improvement is more significant (up to 22%) when top pairs obtained by CMI measure are compared to top pairs identified by the standard measure. Note that in contrast to colon cancer dataset, yeast data is obtained using cDNA assays. This analysis shows that extrinsic measures are effective for studying gene expression profiles produced by both cDNA and oligonucleotide

arrays. As can be observed in this figure, TOM contributes even less to the standard mea- sure in this case. This can be attributed to the high mean r value observed for this data. Because with higher contribution from its intrinsic component, the extrinsic component of the TOM measure becomes underrepresented.

Our analysis confirm that extrinsic measures better capture the biological relevance of two genes when compared to the standard intrinsic measure. We believe their power can be attributed to two reasons: the noisy nature of microarray datasets and the functional modularity of genes. Intrinsic measures directly possess and reflect the noise inherent in

the data since they are purely defined on the expression levels of genes under study. We also believe that since TOM measure is also dependent on the intrinsic measure in its definition, it is also effected by the noise inherent in these datasets. The poor performance of TOM measure with respect to our extrinsic measures can be attributed to the fact that erroneous measurements will have a more drastic impact on any intrinsic or intrinsic based measure.

On the other hand, extrinsic measures are dependent on more evidence since similarity of two genes are inferred from their relative positions with respect to a set of other genes. Hence, we expect the impact of erroneous measurements to be less severe on the extrinsic similarity measures. Our experimental results are also in accordance with this expectation

88 where extrinsic measures produce biologically more relevant pairs. In addition, inferring the similarity of two genes from a set of other genes can benefit from the group level interactions known to take place between genes and gene products when accomplishing certain cellular tasks [137].

Effect on gene networks

In this experiment, we constructed gene association networks by linking top similar pairs identified via each measure. Here, nodes represent genes, and two nodes are linked if the corresponding genes are ‘similar’ to each other. To keep the same size for all networks, we only used the top 0.01% of all gene pairs sorted with respect to a similarity/dissimilarity measure. Accordingly, colon cancer networks are composed of 12,438 interactions and yeast networks are composed of 74,267 interactions. Tightly connected subnetworks of a co-expression network can provide insight into the vital molecular and biochemical pro- cesses. Moreover, groups of genes that are densely linked in gene networks have been theorized to have similar cellular functions with great implications for gene annotation at a global scale [12, 46, 135]. Thus, we extracted and studied densely linked sub-networks of these networks. For this purpose, we employ a graph partitioning algorithm, Graclus [44], that is shown to be effective in analyzing gene association networks [151]. This algorithm is effective in obtaining balanced-size clusters, while minimizing the normalized cuts cri- terion. To our knowledge, no entirely reliable method exists for identifying the correct number of partitions (i.e., k) in a network. That is why, we partitioned colon cancer net- works into 100 clusters, and yeast networks into 200 clusters, to make sure reasonable size clusters will be generated at the end. In average 20 genes are located into each partition.

Each partitioning is validated using the enrichment score p-values that signify the homo- geneity of each cluster in terms of its known GO annotations. Smaller p-values imply that the grouping is not random and is functionally more homogeneous. A cut-off parameter is

89 used to differentiate significant groups from the insignificant ones. Accordingly, a cluster is associated with a p-value greater than the recommended cut-off of 0.5, it is considered to be insignificant. The p-value distributions for the significant clusters extracted from various gene association networks are shown in Figure 25. As can be observed from the

figure, extrinsic similarity measures produce more number of clusters that are significantly enriched with Biological Process GO term annotations. For the colon cancer data, we are able to identify only 4 clusters that are functionally homogeneous when Pearson correlation is used. However, with the use of extrinsic measures this number increases to 10 for SMI and 9 for CMI. Similarly, for the yeast data, the number of significant clusters and their

significance scores are drastically improved when extrinsic measures are used instead of the intrinsic measure. By using the SMI measure instead of Pearson’s correlation, number of significant clusters that can be deduced from the same data increased more than three- fold. These results suggest that using extrinsic measures has a two-fold enhancement for

gene association network analysis. First, these measures enhance functional homogeneity of clusters that can be identified with a standard algorithm. Also it enables the identifica- tion of functionally homogeneous clusters that cannot be detected by standard measures, as evident from the increase in the number of significant clusters.

4.1.4 Discussion

In this section, we investigate the usability of clusters extracted from different gene similarity networks by running a dataset specific analysis. For this part of our analysis, we analyze the colon cancer dataset, which is composed of tumorous and non-tumorous

tissues of the human colon and rectum. A more detailed analysis of the significant clusters obtained from the colon cancer data revealed that they can be very useful in understanding

5Biological Process GO terms are used for this analysis since it directly infers the functionality of genes and gene products.

90 Figure 4.3: The p-value distribution of significant clusters extracted from Colon Cancer (top) and Yeast gene networks (bottom). The y axis represents the −log of the enrichment score of each corresponding cluster.

91 and treating the colorectal cancer. We discuss several of these clusters and their relation with colon cancer in the rest of this section. By using the CMI measure, we obtained a cluster that is annotated with ‘aldehyde dehydrogenase (NAD) activity’. Previous studies showed that activity of aldehyde dehy-

drogenase was measured in primary and metastatic human colonic adenocarcinomas [104]. We also identified clusters annotated with ‘phospholipase activity’ by employing the CMI measure. It has been shown that Phospholipase D (PLD) has a possible impact on carcino- genesis and its progression [113]. Another cluster obtained with CMI measure is annotated with ‘NF-kappaB binding’. NF-kappaB pathway is shown to be taking part in the regula- tion of Inhibitors of apoptosis (IAP) family in human colon cancers [157]. Identification of clusters that are known to be related to colon cancer is vital for developing new therapeutic targets and identifying potential tumor markers for colorectal cancer. However, we cannot identify such clusters via standard analysis of the same dataset.

From the SMI network, we extracted a cluster that is composed of genes associated with the GO term ‘cytoskeleton-dependent intracellular transport’. Recent evidence in- dicates that the interaction of a tumor suppressor gene (APC) with the cytoskeleton might contribute to colorectal tumor initiation and progression [111]. That is why, we believe that locating these genes together in a cluster is triggered by the role they play in colon cancer tumorigenesis. Unfortunately, it is still unknown how APC interacts with the cytoskeleton and how their interaction plays a role in the formation of colorectal tumors [111]. We be- lieve that once functionally coherent clusters are identified, relations between these clusters can be used to reveal function level interactions vital for understanding the cause of some diseases.

92 4.2 Hypothesis Testing using Gene Modules

In this section, we propose a two-step study composed of gene network construction step and hypothesis testing step using the sub-graphs of this gene network. In the first part, multiple independently collected microarray datasets are analyzed to identify probe

pairs that are positively co-regulated across microarray samples. A co-expression network referred to as the Reference Gene Association (RGA) network is constructed based on a reciprocal ranking criteria and a false discovery rate analysis. Graph partitioning and clustering algorithms are then utilized to extract functionally and topologically coherent sub-networks of the RGA network. We investigate the effectiveness of various algorithms based on domain knowledge from curated biological pathway databases. In the second part, we propose a novel MANOVA approach that can take individual probe expression values as input and perform hypothesis testing at the sub-network level. We apply this MANOVA

methodology on two published studies and our analysis indicate that our methodology is both effective and sensitive for identifying transcriptional sub-networks or pathways that are perturbed across treatments.

4.2.1 Methods

The overview of the proposed methodology is illustrated in Figure 4.4. As can be seen from this figure, in the first step, a reference gene association (RGA) network is constructed from publicly available microarray profiling data. In step two, this network is partitioned into its tightly coupled sub-networks. These sub-networks are subsequently fed into a MANOVA based procedure for hypothesis testing in individual profiling experiments.

93 GEO RGA network Correlation and Rank Analysis FDR Analysis Sorted Probe Pairs

Pathways RGA Network Construction

Clustering Algorithms

Sub-networks … …

GO Labeled Ion binding Apoptosis Angiogenesis Sub-networks Annotation Analysis … … RGA Network Partitioning

MANOVA Tests Dataset D Do genes in Ion Binding sub- network are differentially expressed among groups of D … … (i.e., different treatments) ?

MANOVA based Hypothesis Testing

Figure 4.4: The schematic view of the RGA construction and hypothesis testing.

Data collection and pre-processing

We collect experiments within NCBI Gene Expression Omnibus (GEO) that are studied with the Affymetrix HG-U133A Affymetrix chip. The starting data consists of 525 human tissue and cell line samples from 26 independently conducted projects. We exclude projects including primary tumor samples in order to minimize the possible influence of gross cyto- genetic abnormalities, amplifications, and deletions on network construction. The resulting final dataset consists of 393 human tissue and cell line samples from 21 different projects.

The Affymetrix CEL files are analyzed with the MAS 5.0 algorithm from Affymetrix and

94 standardized using the quantile normalization [17]. For this analysis we make use of the publicly available Bioconductor tools (www.bioconductor.org).

RGA Network Construction

In order to identify probes that are co-expressed across multiple projects, we first cal- culate Pearson’s correlation coefficients between every probe pair across all samples from the selected microarray projects. Formally, the Pearson’s correlation coefficient between probes pA and pB is defined as:

n i i i=1 (A − A)(B − B) rAB = P (4.15) n (Ai − A)2 n (Bi − B)2 qPi=1 Pi=1

i i th where A and B are the expression levels of pA and pB in the i sample and A and B are the average expression levels of pA and pB, respectively. In this work, we based our analysis only on positive correlations as it guarantees to include the strongest correlations among gene pairs. In prior work, Goldenberg and Moore [62] demonstrated that positive correlations are much more significant than negative correlations, in sparsely connected

networks. The implicit importance of positive correlation over the negative ones is also suggested in previous research in the context of microarray studies [4]. In our context, the goal is to derive a network that is relatively sparse, so in this sense by considering only the positive correlations we ensure that the strongest or most significant correlations among gene pairs is captured by our model. Next, a rank ordered list of co-expressed neighbors are enumerated for each probe, with rank 1 being the most correlated. Note that, although the Pearson’s correlation scores are symmetric, the rank order of probe A with respect to probe B (say, ROAB), does not necessarily equal to the rank of probe B with respect to probe A (say, ROBA). Thus, two reciprocal rank values are associated with each probe pair (i.e., ROAB and ROBA).

95 Accordingly, for any probe pair, pA and pB, we define its reliability score as:

1 RSAB = (4.16) ROAB × ROBA

Probe pairs with smaller reciprocal rank orders have higher reliability scores. We system- atically computed reliability scores for every probe pair based on the its reciprocal ranks. Pairs were sorted with respect to their scores starting from the most reliable one.

Essentially the network construction step can be viewed as gradual additions of increas- ingly unreliable pairs. One can imagine that, with the increasing number of probe pairs in a network, the noise in the network is also increasing. To find a proper cut-off point, we adopt a false discovery rate (FDR) analysis step to quantify the signal to noise ratio in a given network. Two probes that are linked in the network can either be functionally relevant

(true positive) or irrelevant pairs (false positive). To quantify the relevance of probe pairs, we make use of curated pathway annotations from the KEGG, BioCarta and GenMAPP databases. Accordingly, a link between two probes that share pathway annotation in any of the databases is labeled as a true positive (TP). Conversely, a link between two probes that are annotated with two non-overlapping sets of pathways is labeled as a false positive (FP). Probe pairs where at least one probe is not annotated are defined as non-discriminatory (ND) since current knowledge is not sufficient to conclude whether they are functionally relevant or not. Such pairs are excluded from our FDR calculation.

The effect of the FDR cut-off on the resulting network and the criteria that we take into consideration while setting this parameter are evaluated through experiments. In order to include pairs labeled as ND, a reliability score threshold corresponding to the chosen FDR cut-off is obtained and all probe pairs that have higher reliability scores than this threshold are included in the final reference network.

96 To check the significance of the network, we generated random datasets using the ran- dom number generator tool from the Partek software package (www.partek.com). Random datasets of normal, gamma, and uniform distributions, and of overall dimensions compara- ble to the experimental data are generated. These datasets are processed the same way as

the real dataset and the sorted probe pair lists are analyzed and compared to that of RGA.

RGA Network Partitioning

Many published graph partitioning algorithms can be used to partition the co-expression network into densely-connected modules. Clustering algorithms can also be employed uti- lizing the reliability scores of probe pairs as the similarity values. To find an optimal

algorithm, two graph partitioning algorithms, Graclus [43] and Metis [85], as well as a clustering algorithm, Agglomerative Hierarchical Clustering UPGMA were employed. We utilized a fast implementation of agglomerative clustering from CLUTO, clustering toolkit [83].

We utilize the GO Term Finder tool [18] to evaluate whether the extracted sub-networks are enriched in genes of similar functions and/or involved in similar biological processes. This analysis serves two purposes. First, each sub-network is annotated with the best GO terms that are significantly enriched. Second, the optimal algorithm is selected based on the

number of annotated sub-networks and the significance of the annotations (i.e., p values). P-values are calculated with the previously explained methodology (see Section 3.2.4).

MANOVA Model for Sub-network Significance Testing

Sub-networks extracted from the RGA network usually consist of co-regulated probes involved in a common biological function or process. We adopt a MANOVA based ap-

proach to assess transcriptional response at the sub-network level. MANOVA is a statistical method that can be used to measure treatment effects across multiple dependent variables

97 simultaneously. In our case, we treat each probe of a sub-network as a response variable. By combining all probes from the same sub-network, we can access the response of that sub-network as a whole to some experiment factor. For experiments where there is only one independent variable (e.g., treatment), our MANOVA model is formulated as follows:

p1 + p2 + ... + pi + ... = t (4.17)

th where pi is the i probe in the sub-network and t is the treatment factor. Note that each probe is treated as a dependent variable. Using this model, we assess the following hy- potheses for each of the sub-networks.

Hnull = Sub-network is not affected by different treatments

Ha = Sub-network is significantly affected by different treatments

Note that the model may be expanded to accommodate more complex experimental designs and interactions between factors. To validate the effectiveness of the approach, we apply it to two published studies on HIV protease inhibitors and cigarette smoking.

Significantly affected sub-networks are analyzed and compared to the known biological effects of these two studies.

4.2.2 Experiments

Network Construction

Microarray datasets generated from the Affymetrix HG-U133A platform are analyzed and a reference gene association network is constructed as we have previously described in Section 4.2.1. Primary tumor samples are excluded from the original data to minimize

possible influence from cytogenetic abnormalities. For example, when we include primary tumor samples in our analysis, we frequently observe probe-probe correlations between

98 probes mapped to the same chromosomal regions commonly amplified or deleted in cancer. We employ a rank-based approach that avoids any assumption of data distribution since correlation coefficients calculated from profiling data are not of the well known normal or t distribution [4]. We use mutual rank orders to quantify the strength of the co-expression between probe pairs. Every probe pair is assigned a reliability score calculated from the two reciprocal correlation ranks between the pair. These scores are used to sort the probes pairs starting from the most reliability ones. Networks are constructed by incrementally adding probe pairs to the network starting from the top of the sorted list 6. To estimate the signal to noise ratio in a network, False Discovery Rate (FDR) analysis is adopted. As previously described, each probe pair is given a label of TP (true positive), FP (false positive), or ND (non-discriminatory) based on their pathway annotations from the KEGG, BioCarta, and GenMAPP databases. As depicted in Figure 4.5, FDR of the network increases with increasing number of probe pairs. This indicates that reliability scores correlate well with the biological relevance of probe pairs. To construct a network based on FDR analysis, we choose a relatively moderate 20% FDR cut-off for a couple of reasons. First, it yields a reasonable coverage of probes in the network. Second, we must take into account the fact that the current pathway annotation is not complete due to our limited understanding of biology, and the probe pairs falling into the false discovery class are in fact a mixture of false positives and true positives missing in the databases. We expect that the FDR threshold could be made stricter as pathway annotation becomes more complete. Using the 20% FDR cut-off, a network composed of 10314 probes and 9675 edges is generated. To further assess the significance of probe-

probe association in the RGA network, random datasets of normal, uniform and gamma distributions, are simulated. For each of the datasets, Pearson’s correlation coefficients are computed for probe pairs and FDR analysis is performed. As shown in Figure 4.5, networks

6Unless otherwise noted, whenever we use the term probe, we refer to the corresponding gene as well.

99 Network Size vs. FPR 1.0 0.8 0.6 0.4 False Positive Rate 0.2 Real Gamma Normal Uniform 0.0

5000 10000 15000 20000

Number of Edges

Figure 4.5: Randomly generated datasets of different distributions are compared with the real data. The x-axis indicates the total number of probe pairs considered starting from the top of the sorted probe pair list and the y-axis indicates the corresponding FDR for that set of probe pairs.

constructed from random datasets have much higher FDR than RGA at any given numbers of probe pairs. Another metric that quantifies the functional relevance of probe pairs as their GO term

overlap is also utilized to compare these networks [91]. The GO term overlap of two genes is calculates as the number of terms that occur in the intersection set of their annotation sets. The higher this score the higher the similarity between two genes. For our analysis, we calculate the number of shared GO annotations including parent terms for each probe pair.

Next, these values are plotted in the form of a cumulative distribution, to globally assess the functional relevance of each network (shown in Figure 4.6). For a better comparison, we use the same number of probe pairs from each of the random networks as in the real network (i.e., 9675 pairs). As depicted in Figure 4.6, the real network showed greater functional

relevance than any of the random networks. In summary, these results demonstrate that our network construction procedure is effective in selecting biologically related probe pairs.

100 1.0 0.8 0.6 Cumulative Probability 0.4

Real Gamma 0.2 Normal Uniform

0 50 100 150 200

Number of overlapping GO Terms

Figure 4.6: The x-axis indicates the number of overlapping GO terms and the y-axis is the cumulative probability. The x-axis is truncated at 220 for visualization purposes.

Network Partitioning

To further exploit the association networks, it is desirable to partition the network into smaller pieces, with the overall intent of identifying functional biological modules within the larger network. Many published clustering algorithms can identify densely connected sub-networks in a network [12, 23, 87], which we discussed in details in Chapter 2. Given that each probe pair can be assigned a weight based on their reliability score, we select three algorithms that can make use of these scores as edge weights. We employ two graph par- titioning algorithms (Graclus and Metis) and one clustering algorithm (Hierarchical clus- tering with average linkage) for our analysis. We consider two criterion when picking the number (k) of partitions. First, we would like to obtain sub-networks that are comparable in size to actual biological pathways. Second, the sub-network size should be feasible for use in the next section, MANOVA based data analysis. After a few trials, we pick a par- tition number of 500, which would result in an average number of 20 nodes per partition

101 if nodes are evenly distributed among partitions. After partitioning, we identify connected components of sub-networks to be used in our MANOVA model. If the partitioning algorithms could identify functional modules in the network, we would expect that probes of similar functions (thus likely to be co-expressed) would be

partitioned into the same sub-network. The GO Term Finder tool is utilized to determine whether sub-networks are enriched in probes with similar GO term annotations and to com- pare the suitability of the partitioning and clustering algorithms. A plot of the best GO p- values for each of the sub-networks is shown in Figure 4.7. As can be seen from this figure, the Graclus algorithm outperforms other two algorithms, in terms of both the quality and the quantity of significantly annotated sub-networks. The hierarchical clustering method yields the worst performance on this data, likely due to the unbalanced sub-network sizes it produced. The box plots of the connected sub-network sizes when k was set to 500 are shown in

Figure 4.8. It can be observed that the graph partitioning algorithms yield sub-networks that are more balanced in size. By contrast, the hierarchical clustering algorithm tends to produce a few big sub-networks and many singletons. After our comprehensive experiments, we decide to obtain the final partitioning using

the Graclus algorithm and a k value of 500. Among these 500 partitions, we identified 2112 connected sub-networks. The average size of sub-networks is 5 probes per sub-network and the largest two sub-networks have 38 probes each. Inspection of these sub-networks revealed many well-known areas of biological regu- lation. These areas include steroid biosynthesis, neuronal development, antigen processing

and presentation, cell polarity maintenance, cytoskeletal components, the cell cycle, RNA splicing machinery, and many others. However, it is also worthwhile to recognize the lim- itations of the underlying experimental data and their effects on network structure when

102 Figure 4.7: The distribution of the GO enrichment p-values for sub-networks extracted by different algorithms.

interpreting the RGA network and its partitions. One critical feature of the mRNA profil- ing data generated from the U-133A array is that although (however many thousand) probes

are independently measured within a single sample, significantly fewer independent signals can realistically be resolved from these data. Probe design has led to the intentional (and sometimes unintentional) construction of many redundant probes for RNA transcripts. We estimate that 2358 transcripts represented on the U-133A array have more than one probe set measuring them (data not shown). Although not technically an artifact, networks and

network partitions containing these redundant probes have fewer independent genetic com- ponents than would be implied by the network connectivity. The lack of specificity for probe hybridization characteristics (cross-hybridization) is another limitation of Affymetrix technology. However, the impact of cross-hybridization on network structure seems to be small. The U-133A array design assists in the identifica- tion and interpretation of cross-hybridization artifacts by attaching ‘ s’ and ‘ x’ suffixes in the nomenclature of probes to indicate higher risk of cross-hybridization across isoforms

103 Boxplots of log(Cluster Size) 8 6 4 Log(cluster size) 2

Graclus Metis Hierarchical

Algorithms

Figure 4.8: The box-plot of sub-network sizes generated by selected network partitioning algorithms with k = 500. The algorithms are shown on the x-axis, and the log(sub-network size) values are indicated on the y-axis.

and gene families for these probes. We found that the RGA network is only modestly en- riched in probes capable of cross-hybridization - 53% versus 47% on the U-133A array. This result suggests that, for the majority of the probes, the cross-hybridization impact at the probe set level is limited. The impact of the above probe design considerations may affect the network construc- tion and our assessment of network robustness in various ways. In the cases where probes within a given network partition are non-redundant, we can somewhat safely ascribe the network to the biology we are trying to discover. The highly co-regulated elements of neuronal development and signaling fall into this category. Probes in this sub-network represent genes include ASTN1, CA11, CDK5R1, CSPG3, CSPG5, CTNNA2, DIRAS2, DNM1, GABRB1, GPM6A, GRIA2, HMP19, HPCAL4, NTRK2, OMG, PPP2R2B, RAB3A, RPIP8, RTN1, RUFY3, SCG3, SLC6A1, SNAP25, SNPH, SV2B, and TUBB4 (Figure 4.9- a). These genes share only minor . Two unannotated probes are also

104 ID Manova-p Significant GO Annotation 1186 1.67E-09 regulation of glutamate-cysteine ligase activity 495 4.08E-09 aminosugar meta. pro., N-acetylglucosamine 6-O-sulfotransf. 1197 5.26E-08 cellular respiration, TCA cycle, mitochondrion 812 5.72E-08 No significant annotation 164 7.92E-08 aromatic amino acid transport, antiporter activity 1520 8.10E-08 6-phosphofructokinase complex 547 8.95E-08 glycolysis, energy derivation by oxidation of organic compounds 1302 2.82E-07 glucose transport, integral to membrane 1272 3.51E-07 amino acid biosynthetic/metabolic process 1862 5.67E-07 amine metabolic process 568 5.96E-07 steroid meta. pro., cholestorel , xenobitic meta. pro. 1109 1.14E-06 kinase activity 1707 2.61E-06 protein targeting to ER, endoplasmic reticulum 1832 2.98E-06 cell growth, phosphatase activity 291 3.64E-06 response to chemical stimulus, response to drug

Table 4.1: Top 15 scoring sub-networks for the HIV data. Sub-networks are annotated with the top GO term hits in ‘Molecular Function’, ‘Biological Process’, and ‘Cellular Component’ ontologies.

ID Manova-p Significant GO Annotation 1770 1.69E-12 response to extracellular stimulus, cell ion homeostasis 93 4.17E-10 epithelial cell differentiation, phospholipid binding 261 8.39E-10 microtubule associated complex, GKAP/Homer scaffold activity 495 4.08E-09 amino sugar meta. pro., N-acetylglucosamine 6-O-sulfotransferase 2062 1.42E-08 protein phosphatase binding, protein kinase CK2 activity 1123 4.84E-08 acute-phase response 1369 7.04E-08 fibril organization and biogenesis 1272 3.37E-07 amino acid biosynthetic/metabolic process 198 3.91E-07 nitrogen compound catabolic process, organic acid metabolic process 77 3.93E-07 xenobitic metabolic process 697 4.50E-07 leukocyte activation, immune response 1967 5.54E-07 phosphaditic acid metabolic process 568 5.96E-07 steroid meta. pro., cholesterol homeostasis, xenobiotic meta. pro. 434 1.13E-06 myosin complex, actin cytoskeleton 182 2.31E-06 ciliary or flagellar motility, microtubule-based process

Table 4.2: Top 15 scoring sub-networks for the cigarette smoking data. Sub-networks are annotated with the top GO term hits in ‘Molecular Function’, ‘Biological Process’, and ‘Cellular Component’ ontologies.

105 present in the sub-network: 213484 at and 213841 at. Interestingly, both probes represent transcripts that are highly expressed in brain tissues7. In other cases, both RNA sequence and functional biological conservation are contained within probes in a sub-network. Such sub-network is fundamentally confounded, and even when we can give the sub-network a high Gene Ontology overlap significance, we lack the resolution to identify whether or not the score is due to our capture of biology or the fact that we are measuring many fewer independent signals within the sub-network than the number of probes would imply. The high degree of both sequence and functional similarity in the ribosome proteins (RPL13A, RPL18, RPL18A, RPL24, RPL28, RPL29, RPL31,

RPL35, RPL36, RPL36A, RPL8, RPLP0, RPS13, RPS14, RPS17, RPS2, RPS5, RPS7, RPS9) is one example of this behavior, and is well represented within a sub-graph of the network (Figure 4.9-b). Finally, although details of biological specimen preparation not intended in study de-

sign (in vitro cell confluency, culture temperature variation, etc.) are still valid biological effects, other technical artifacts of sample processing, hybridization, and data acquisition can interfere with the robustness of the network and network partitions. Some of these tech- nical effects can be readily identified in the networks. For instance, a collection of control

probes including BioB, BioC, BioD, and Cre control oligonucleotides are spiked into the cRNA mixtures appear in the resultant network, connected only to each other. Similarly, 18S and 28S ribosomal RNA contaminants carried through the probe preparation process also appear as an isolated sub-network of the total network.

MANOVA Based Sub-network Significance Testing

Gene networks have been used in the area of gene function inference and gene-gene interaction/regulation studies [82, 140]. Allocco, Kohane, and Butte have demonstrated

7(Unigene, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene)

106 (a) Neural Development (b) Ribosome Proteins

Figure 4.9: Example sub-networks.(a) The neuronal development and signaling cluster ((b) The ribosomal protein cluster that two genes are more likely to share a common transcriptional regulator if they are strongly co-expressed [4]. Since our sub-networks are composed of co-expressed genes, these sub-networks may be treated as a special transcriptional ontology where each sub- network is specifically under the control of one or a combination of transcription factors. For any individual microarray study, if the significantly affected sub-networks can be iden-

tified analytically, it will greatly improve our understanding of the underlying biological mechanism. Naturally, sub-networks can be treated as gene sets that can be fed into gene set analysis tools. To this date several gene set significance testing approaches have been developed in-

cluding GO Term Finder, GSEA, PAGE, and Hotelling’s T-square test [18,101,142]). These are discussed in details in Chapter 2. Here, we explored the use of a new, MANOVA (mul- tivariate analysis of variance) based approach for network/pathway analysis. MANOVA is an extension of ANOVA (analysis of variance) to cover cases where there is more than

one dependent variables. It is often used when a set of correlated dependent variables are present whereas a single, overall statistical test is desired. Hotelling’s T-square test is a

107 special case of MANOVA to deal with data of two groups. MANOVA has the same ad- vantages as Hotelling’s T-square test but it can be applied to a wider range of experimental designs since MANOVA can deal with data of any group number. Naturally, we can treat each probe as a dependent variable, and then derive a single statistic based on MANOVA for the sub-network as a unit. This approach makes perfect sense because the sub-networks are consisted of co-expressed probes and MANOVA are especially designed to deal with correlated dependent variables. We test our MANOVA model on two published datasets. The first one is the study of HIV protease inhibitors including antazanavir, nelfinavir, and ritonavir [116]. It is well known that patients taking protease inhibitor drugs to treat HIV-AIDS often develop a lipodystrophy-like syndrome such as hyperlipidermia, peripheral lipoatrophy, and central fat accumulation [25]. Parker et al. [116] used Affymetrix chip technology to survey gene expression changes after drug treatment. Their analysis indicated that protease inhibitors could induce gene expression changes indicative of dysregulation of metabolism, en- doplasmic reticulum (ER) stress, and metabolic . This result is consistent with the clinical observations and provides evidence for a molecular mechanism for the patho- of protease inhibitor induced lipodystrophy.

The top 15 scoring sub-networks with respect to treatment effect are listed in Table 4.1. The model p-values give a relative measurement of the significance of the treatment. One can immediately note that this list includes all the major targets of the HIV protease in- hibitors including lipid metabolism, amino acid metabolism, gluconeogenesis, cellular res- piration, and endoplasmic reticulum.

The second dataset is from the study on the effects of cigarette smoking on human air- way epithelial cell transcriptome [136]. It is well established that cigarette smoking is the leading cause of lung cancer and other serious pulmonary diseases such as chronic obstructive pulmonary disease (COPD). Spira and colleagues analyzed three groups of

108 samples from current smokers, previous smokers, and nonsmokers. Their results indi- cated that cigarette smoking could lead to expression changes in genes involved in xeno- biotic metabolism, antioxidation, inflammation, cell adhesion function, and oncogenesis, which are consistent with previous studies using different model system or experimen- tal design [59, 64]. The top 15 sub-networks identified by the MANOVA approach are listed in Table 4.2. Xenobiotic metabolism, cytoskeleton elements, amino acid and lipid , and protein kinase and phosphatase activities are found to be significantly differentiated among study groups. In addition, several immune system related responses are also present in the list, which are presumably due to the inflammation response caused by smoking. Again, the MANOVA approach was able to identify all the major effects caused by cigarette smoking. To compare this approach to a typical analysis flow, i.e., single gene analysis followed by a gene set analysis, we analyze the two datasets using Ingenuity IPA tool8, which uses the Fisher exact test to calculate gene set enrichment scores. For the HIV dataset, a list of 579 affected probes (10% FDR) identified though single gene ANOVA are analyzed by IPA. While lipid and amino acid metabolism were picked up as the main affected pathways, ER stress response and gluoconeogenesis were not identified (data not shown). For the cigarette smoking dataset, a list of 1190 affected probes (10% FDR) is analyzed. IPA identified xenobiotic metabolism, cytoskeleton structures, and electron transport as the main affected targets but missed inflammation response completely.

4.2.3 Discussion

In this study, we show how reciprocal rankings and FDR analysis can be applied to in- tegrate signal from independently conducted microarray experiments into a meta-network, which we named RGA network. We have used three different algorithms to identify densely

8www.ingenuity.com)

109 connected sub-networks from the network. Examination of the sub-networks indicated that they may represent various units of biological processes, which we can label using the GO annotations. Next, we developed a MANOVA based approach which allows interpreting expression

profiling data at the gene set level. We conducted a thorough evaluation of our approach by applying it to two published studies. Our analysis showed that using our model we are able to capture unified signals that are missed by single-gene analysis. The gained sensitivity of our approach may be due to two factors. First, no ranked gene list with an artificial cut-off is used for our model. Therefore, genes having modest changes may be

considered together to reveal significant group effect. Second, instead of an enrichment score that does not reflect the actual expression data, our approach attempts to find the best separation between treatments. Given that parametric method is generally considered more sensitive than its non-parametric counterpart, we are not surprised that our method generated results that better fit the biological hypothesis. However, one should be cautious when applying MANOVA and related Hotelling’s T- square models. There are three assumptions when applying MANOVA. First, the dependent variables should be normally distributed. In general, the more samples in a project, the

more likely the assumption is met. Second, the relationships among dependent variables are linear. Since our calculation of co-expression is based on linear correlation, it is likely that genes of the same sub-network follow a linear relationship. However, deviations from linear relationship will compromise the power of the test. Third, variances and co-variances of dependent variables should be homogeneous across the ranges of independent variables.

Several tests may be applied to check the validity of this assumption [143]. In addition to the aforementioned assumptions, MANOVA is also sensitive to outliers, which may lead to type I or type II error. There are tests available to check for univariate and multivariate outliers. Lastly, MANOVA analysis can only be done when the total number

110 of samples is larger than the size of the gene set being tested. This is because one degree of freedom is lost for each dependent variable added [19]. In another word, large gene sets can not be tested by MANOVA or Hotelling’s T-square for projects involving small numbers of samples. In fact, this is one of the criteria we used to pick the partition number k so that the sizes of the resulting sub-networks are feasible for MANOVA based hypothesis testing for most projects, which typically have 10 samples or more. Although in this paper, we limited our analysis to co-expression gene sub-networks, we strongly believe that our MANOVA model is expandable to other groups of related genes (e.g., a gene list containing genes mapped to the same chromosomal location). We expect

this approach would serve as a novel method for analyzing transcriptional and proteomic profiling data at the gene set level.

111 CHAPTER 5: Gene Regulatory Networks

Gene regulation is a mechanism that controls the transcription of DNA sequences into mRNA structures. mRNAs are then transferred to the cytoplasm and translated into pro- teins. During the transcription process, proteins named transcription factors (TF) bind to the promoter regions of genes and control the access of RNA polymerase to these regions. In addition, having bound to the promoter regions of genes, these regulatory proteins in- teract with other entities to activate (or suppress) the transcription process. The binding interactions between TFs and their target genes can be represented in the form of regula- tory networks. In a regulatory network, nodes refer to regulatory proteins and genes and and an edge directed from a TF to a gene indicates the binding of this TF to the promoter region of that gene in order to activate or suppress the expression of gene. The regulation of transcription via the interactions of specific proteins with DNA se- quences is the most important mechanism for controlling protein levels which is known as the transcriptional regulation9. Specific binding of transcription factors (TF) controls the differentiation of progenitor cells into somatic cell types, this binding also regulates the response to cellular stress Using technologies such as ChIP-chip, ChIP-seq or DamID, it is possible to measure the DNA binding of transcription factors at a genomic scale which can be useful to infer regulatory networks [124, 154, 159]. These experiments, however, suffer from high levels of noise leading to the prediction of many false positive and false negative

9Throughout this chapter, whenever we refer to a TF-gene interaction, we imply a physical binding be- tween the TF and the promoter region of the gene.

112 interactions [16, 66]. In addition, even if predicted binding is real, these experiments do not provide direct evidence about the downstream effects of the binding [56]. Previous work has shown that DNA binding of transcription factors may have no effect on the transcription of proximal genes [21,56,160]. However, the extent of non-functional binding is still unknown. This lack of knowledge is also due to a methodological compli- cation. It is relatively easy to predict functional binding, because additional information such as the conservation of TF binding sites, expression changes of putative target genes, and others can be used to corroborate the fact that an actual interaction between the TF and the promoter of the predicted target gene is functional. Thereby, the number of false positives can be greatly reduced [16, 66]. After such a filtering procedure one is left with TF-DNA interactions that were measured with e.g. ChIP-chip, but which have insufficient support from other data sources. These predicted bindings fall in one of two categories: (a) true binding that has no effect on the putative target gene and (b) no real binding, i.e., false positive prediction, due to noise in the DNA binding experiment. Since it is difficult to disentangle these two classes, it is hard to estimate the extent of non-functional binding. Another equally important question is, given that physical binding to DNA takes place, what makes this binding event (or interaction) functional? Factors that may determine the functionality of TF-DNA binding are (i) distance of the binding site to the transcription start site, (ii) orientation of the TF with respect to the direction of transcription, (iii) local 3D structure of the DNA, and (iv) presence or absence of interacting proteins (co-factors) in the same DNA region. Importantly these factors have to be distinguished from other factors influencing the ability of a TF to bind a specific site per se, such as the presence of histones. Such competitive factors only affect the binding efficiency, but do not influence the functionality of the binding. Histone modifications on the other hand may affect both, the binding itself and its functionality. Usually it is not known which of the above factors control the functionality of a specific TF binding. Once it is possible to distinguish true

113 binding events that are functional from those that are non-functional we will be able to also investigate the molecular factors distinguishing one from the other. In this study, we utilize Bayesian logic to first estimate true physical binding events of budding yeast TFs and subsequently separate functional from non-functional binding events based on respective expression data. Here, we define TF binding as functional, if it has a specific effect on the transcript levels of its target gene [56]. Since such downstream effect may be condition specific, the statement ‘binding is functional’ always refers to a specific TF-target gene pair under a specific experimental condition. Therefore, we apply our analysis to a range of stress conditions.

We use condition specific ChIP-chip data as the primary evidence for DNA binding which we supplement with nucleosome occupancy and TF binding site predictions. Once we have determined that TF-DNA interactions depend on the change in growth condition, we use expression changes of putative target genes corresponding to the same conditions to determine functionality. In other words, we only assume that the binding is functional, if the differential binding correlates with the differential expression changes under the same condition. Due to the noise in the data we cannot reliably determine all true binding events under a certain condition. However, we can a filter for a set of high confidence interactions and then ask, which fraction of those was functional. Hence, our probabilistic framework allows us to determine the fraction of true binding events that have no effect on the ex- pression of proximal genes, i.e., the fraction of non-functional bindings. We show that this fraction is stable when changing the probability thresholds that our analysis requires. Next, we employ the multi-parametric Random Forests machine learning technique to determine the factors controlling the functionality of TF binding. This analysis reveals that functionality is mainly determined by the presence of specific co-factors. Distance to the target gene and the orientation of the TF also affect the functionality, but to a lesser

114 extent. Importantly, we notice that functionality is determined in a highly combinatorial and hierarchical manner. From a network-centric perspective, this work contributes to the existing literature in the following directions:

• The inference of regulatory interactions between TFs and their target genes by inte- grating evidence from ChIP-chip studies, motif data, and nucleosome occupancy of motif binding sites

• The extraction of regulatory interactions under six different conditions in an attempt

to construct condition-dependent regulatory networks

• The assignment of semantic meaning (a potential downstream impact) to regulatory interactions by studying changes in gene expression level of TF bound genes

5.1 Datasets

5.1.1 ChIP-chip

ChIP-chip studies detect genomewide interactions between gene sequences and regula- tory proteins, i.e., transcription factors. The ChIP-chip platform is critical in investigating

the function of genomes and the proteins they encode. However, this platform is not enough to deduce all regulatory interactions and eventually regulatory pathways because of several reasons. First, an observed protein-DNA interaction may relate to some cellular function other than gene expression. Moreover, the statistical analysis of the huge amount of data

generated from these arrays is a challenge, where artifacts of the used technology might lead to erroneous interpretations. TF-DNA binding for transcription factors of the budding yeast (Saccharomyces cere- visiae ) has been profiled under normal growth conditions (rich media, YPD) [66, 79, 93]

115 and under different stress conditions [66,160]. We utilized ChIP-chip measurements of the genome-wide TF binding locations from the Harbison et al. study [66], which identified genome-wide binding locations for 203 TFs in YPD and 84 TFs in one or more stress con- ditions. In addition, we also employed ChIP-chip data from Workman et al. [160], which profiled 30 TFs after DNA damaging stress induced by methyl-methane sulfonate (MMS) treatment. In total we compared protein-DNA binding profiles from YPD and 6 environ- mental stress conditions, where mRNA expression responses to the same environmental conditions were also available [57]. In both cases, we made use of the binding p-values as calculated by the original studies. ChIP-chip experiments employed for our analysis are listed in Table 5.1.

5.1.2 Gene Expression Data

Microarray technology enables measuring the mRNA level of a particular cell, at a par- ticular time, under the particular experimental condition. Cluster analysis is often used to identify genes whose expression levels are correlated across numerous experiments. These co-expressed gene clusters are employed to infer regulatory modules for example genes that share common promoter sites. However, clustering gene expression datasets to infer regulatory modules has its limitations. The intensity of a probe signal does not necessarily reflect a biological response and depends on a multitude of artifacts, such as, measure- ment errors and technical variations. Moreover, due to post-transcriptional regulation, a transcription factor and its target genes might not be co-expressed, which means even a microarray experiment can identify co-regulated genes, it will miss the regulators. There- fore, although being extremely informative for understanding transcriptional interactions, by themselves, gene expression datasets contribute little to this task. Gene expression profiles measured under 6 stress conditions relative to the expression in normal growth conditions (YPD) were employed for our analysis [57]. Log ratios of gene

116 expression from all existing replicates for each of the studied stress conditions were con- sidered. In some cases, measurements taken at comparable time points after the treatment were regarded as replicates of the same experiment after an initial evaluation of the profile similarities and overall quality. These log-ratios were then used to calculate a p-value of

differential expression of each gene and each stress condition using a t-test. In order to account for the small number of replicates, a t-test based on a moderated t-statistics [133] is used. The moderated t-statistic uses a variance estimate over many genes, instead of a single gene, and has been shown to be more robust for small sample sizes [133]. Details of the studied gene expression datasets can be found in Table 5.1.

Table 5.1: Condition specific binding data Condition ChIP-chip #TFs Gene Expression #Assays aa Starvation [66] 34 [57] 5 Heat Shock [66] 6 [57] 8 H2O2 high [66] 28 [57] 9 Galactose [66] 4 [57] 2 Raffinose [66] 1 [57] 2 MMS Treatment [160] 30 [160] 4 YPD [66] 203 - -

5.1.3 TFBS Data

Transcription Factor Binding Sites are short DNA sequences, frequently located in proximity to their target genes. Transcriptional regulation is mainly accomplished by tran- scription factors bound to gene sequences at these binding sites. Various computational methods have attempted to predict new target genes for known transcription factors based on their known DNA binding sequences. One of the major drawbacks of most current

117 methods is in predicting many spurious binding sites, which resemble in sequence the ac- tual binding site but are not known to function as such (false positives). So, TFBS data can be informative as a complementary source in deducing functional interactions between genes and proteins.

We employed TFBS data of Saccharomyces cerevisiae in order to determine the poten- tial binding sites for a given TF. We compiled position specific scoring matrices (PSSM) from public databases which provide evidence about the preferred binding sequences of a TF [16]. Using these existing models of transcription factor DNA-binding specificity, we predict TF binding sites (TFBS) for 111 TFs. A log-likelihood score distribution for each

PSSM is determined using the sequence scoring feature of the ANN-Spec tool [161] over all possible sites in the yeast genome. Using this empirical probability density function, we estimated p-values for each TFBS using the PSSM log-likelihood scores. This allowed us to ensure that the expected rate of predicted binding site was < 10−3. Accordingly, we identified all significant PSSM hits for the 111 TFs with a TFBS score smaller than the score/p-value threshold. Based on our prior knowledge on the typical base-pair length of Budding Yeast genes, TFBS hits occurring within 800 base-pair (bp) upstream and 200 bp downstream of a gene’s start codon were considered as potential promoter binding sites for that gene. In the existence of multiple motif hits between a PSSM and a gene, the TFBS with the most significant score is considered as the primary TFBS hit. Binding site scores are then used in our Bayesian framework as supporting evidence for a physical interaction. Orientation of a TFBS and its distance from the start codon are later used as predictive variables to explain the impact of binding geometry on the functionality of TF-gene inter- actions.

118 5.1.4 Nucleosome Occupancy Data

In an eukaryotic cell, DNA is wrapped into a complex structure called chromatin inside the nucleus. This structure is composed of DNA nucleotides organized around the eight-

histone protein complex, which is known as the nucleosome. Revealing the nucleosome organization of a genome is important for understanding the impact of chromatin struc- ture on gene regulation. It has been shown that Nucleosome Occupancy of DNA sequence around functional TFBSs to be remarkably lower [36]. Based on this, we employed Lee et al.’s experimentally obtained atlas of nucleosome occupancy [94] to identify potential

binding sites with low nucleosome occupancy. In this study [94], nucleosome binding is profiled using 25-nucleotide probes spaced every 8 base pairs of both Watson and Crick strands of the complete genome sequence. For each of the yeast genome, we averaged all measurements covering that base pair and generated a mean nucleosome oc-

cupancy map of the yeast genome. This map is then used to calculate an average occupancy score for each TFBS. After aligning average nucleosome occupancy of yeast genes with re- spect to their start codon, we observed a significant depletion just before the AUG codon which is the DNA region that is enriched in promoters (as can be seen in Figure 5.1 - black line). Moreover, this figure indicates that a more significant depletion in Average Nucle- osome Occupancy scores of the promoter regions of known TF binding sequences can be observed in comparison to all Open Reading Frames (ORFs) of Saccharomyces cerevisiae . As evident from this figure, the average nucleosome occupancy is high at the TransLation

Start site (TLS) in the existence of a binding site for a TF (ABF1 and FHL1 are shown in our example but the same trend is observed for all studied TFs, where we have enough number of binding and functionally binding ORFs). This is in accordance with our expec- tation to observe binding sites to be depleted (not blocked) by the nucleosome structure. If the binding motif is covered by a nucleosome it is not accessible and there should be

no binding. Similarly, if there is a binding, there is no nucleosome, regardless of whether

119 or not the binding is functional. Accordingly, we do not observe a significant difference between Functional Binding Sites and Binding Sites in terms of their nucleosome deple- tion. Therefore, we believe that nucleosome data do not help to distinguish between true binding sites and non-functional binding sites. However, it is a good indicator of a bind-

ing region. Based on this observation, we integrated Nucleosome Occupancy of TFBSs with evidence from TFBS and ChIP-chip datasets as a predictive evidence for determining potential binding sites.

5.1.5 Training and Validation Data

In order to fit and validate the various estimates used in this work, 1324 high-confidence regulatory interactions are obtained from the Incyte YPD Database, a curated, literature- derived data repository (http : //www.incyte.com). This data includes much of the known regulatory interactions between TFs and genes of Saccharomyces cerevisiae . In this context, these interactions were used as a positive control set for calculating both the binding and the regulatory response probabilities. As the negative control data set, we em- ployed random TF-gene interactions which are enhanced by a low co-citation criterion [16].

5.2 Methodology

To reduce the number of false positive and false negative binding predictions, we inte-

grated evidence from multiple sources to calculate the probability of a TF binding to the promoter region of a gene under an experimental condition c, P c(B) (Figure 5.2). Im- portantly, we only use evidence for binding, i.e., at this step we exclude evidence such as expression data, which is predictive for the functionality of an interaction. ChIP-chip pro-

files generated under a particular experimental condition, Transcription Factor Binding Site (TFBS) Data, and Nucleosome Occupancy of TFBSs are used as binding evidences. Using

120 Figure 5.1: The average nucleosome occupancy score of the Saccharomyces cerevisiae genome aligned with respect to translation start sites of all studied Saccharomyces cere- visiae Open Reading Frames (ORFs) (7048 ORFs). We also identify the ones that are known to be binding (and functionally binding) to an example TF (ABF1 and FHL1 in this example)

121 these datasets, a composite likelihood ratio of binding is calculated based on the condi- tional independence assumption between these three predictive sources. Subsequently, the composite likelihood ratios are converted into posterior odds and finally into posterior prob- abilities by using a prior odds estimate that is derived from the validation data statistics. In

summary, this step of our methodology aims to reliably predict binding of a regulatory protein to its targets under different experimental conditions. We analyzed gene expression profiles along with these binding evidences to discrimi- nate functionally binding from non-functionally binding to a transcription factor. A second probability, the probability of transcriptional response to a change in growth condition, i.e.,

YPD to heat shock, is estimated for this purpose. This probability of functional response is calculated for each gene and for each stress condition based on the likelihood ratio obtained using the Bayesian formula and our training datasets. We estimated binding probabilities in two different growth conditions, P c1 (B) and

P c2 (B), as well as the response of the gene expression levels to this change in growth con- dition, from c1 to c2. As an example c1 can be a stress condition and c2 can be the normal growth conditions. These probabilities were further used to categorize TF-TG bindings as regulatory or not-regulatory or functional (FB) versus non-functional (NFB). If a TF binds or is released from the promoter region of a gene only when the growth condition has changed, the expression of this gene as a response to this environmental change can be used to determine the regulatory implications of this binding. By studying the binding of a TF to the promoter region of a gene under two conditions as well as the accompanying ex- pression change of that gene, we obtain two ranked lists of TF-gene bindings for the most functional and the least functional ones. Next, we analyzed the top ranked functional and non-functional bindings to reveal biological factors that might play a role in determining the regulatory impacts of these bindings. The distance of the binding site from the target

122 gene, the orientation of the binding site, and the binding of other TFs to the promoter re- gion of the same gene are all considered as potential factors that might determine functional binding. We then used a feature selection algorithm based on Random Forests classifica- tion algorithm to identify the discriminatory factors for the functionality of TFBSs and their corresponding TF-gene interactions. Our overall framework is summarized in Figure 5.2.

Growth condition dependent Functional output

chIP-chip chIP-chip Expression YPD Stress YPD vs Stress PYPD(B) PS(B) P(F)

-800 to 200 bp TFBS Target gene

2.0 1.0 bits 0.0 Nucleosome PSSM occupancy*

Growth condition independent

Figure 5.2: The overall framework of the proposed methodology. *Though nucleosome occupancy is known to be condition dependent, it is treated as condition independent for this study.

123 5.2.1 Probability of Promoter Binding

Bayesian formulation enables us to update our prior belief of a TF t binding to gene a g into posterior probability after observing the evidence from predictive sources. More

formally, we are interested in calculating the below posterior probability.

P (Data|Binding)P (Binding) P (Binding|Data) = (5.1) P (Data)

This formula can be stated in terms of posterior and prior odds. Posterior odds for

the physical binding of a TF to the promoter region of its target gene (Oposterior) can be

calculated as the product of prior odds (Oprior) and the likelihood ratio, LR. The prior odds quantifies the chance of interaction for a given TF-target pair when all pairs are considered

P (B=1) and can be defined as P (B=0) , where P (B = 1) is the probability of a physical interaction. Accordingly, the posterior odds that a TF-target pair constitutes a binding interaction given predictive evidence can be defined as follows:

P (B = 1|E1, ..., En) Oposterior = = Oprior ∗ LR (5.2) P (B = 0|E1, ..., En)

th Here, Ei represents the value of the TF-target pair for the i evidence and LR refers to the composite Likelihood Ratio which can be defined as:

P (E , ..., En|B = 1) LR = 1 (5.3) P (E1, ..., En|B = 0)

The problem with computing the likelihood ratio using the above equation is that it requires us to estimate many probabilities from limited training data. To overcome this issue, we assume that features (evidences) are conditionally independent of each other given the value of B. This assumption is often referred to as the independent feature model or as the Naive Bayes assumption. According to this assumption, the simplified Likelihood Ratio can be

124 written as: i=n P (Ei|B = 1) LR = Y (5.4) i=1 P (Ei|B = 0)

The assumption of conditional independence, which is also the basis of Naive Bayes classification, may seem to be an oversimplification for many complex real world situations including the current one. However, as demonstrated by a number of studies, both empirical and theoretical, the performance of Naive Bayes Classifiers has been surprisingly good even in domains where this assumption is known to be a gross simplification [45]. Additionally,

recent efforts have shown that Naive Bayes models can also be effective for probabilistic estimation and inference [100], suggesting that using such a model to estimate likelihood ratios like the one we describe may be reasonably effective. Using our probabilistic model, we integrated complementary evidence for protein- DNA binding from TFBS data [161], Nucleosome Occupancy data [94], and ChIP-chip

data [66, 160]. Based on these three sources of evidences, we are interested in computing the probability of binding in a particular condition for a given TF t and a gene g. ChIP-chip binding p-values provide evidence about the physical interactions between t and g. The rank of a TFBS hit is another informative source in terms of the existence of a real binding.

Moreover, nucleosome occupancy of the motif site also has an impact on TF-gene binding. So, in our model, we separately derive empirical likelihood ratio distributions for TFBS hit scores, Average Nucleosome Occupancy of TF binding sites, and ChIP-chip p-values. These three likelihood ratio distributions are used to calculate a composite likelihood ratio

for any given < t − g > pair. Later, this likelihood ratio is used to calculate the posterior probability of binding for that pair. With this methodology, we calculated binding proba- bilities for TF-TG pairs under 6 stress conditions and the normal growth conditions (rich media, YPD).

125 5.2.2 Probability of Transcriptional Response

Next, we calculated the probability of an expressional gene response based on the change of the expression level of a gene in a stress condition relative to the normal growth

condition. Here, we aim to assess the likelihood ratio of a gene being active (induced or re- pressed) under this stress condition relative to the unstressed situation. We again calculate the likelihood ratios using the positive and negative training dataset to derive the posterior probabilities. The likelihood ratio for expressional response is calculated based on the p- values of the differential expression test (∆E) that we explained in Section 5.1.2 (P (F ) in

Figure 5.2). Similar to our probability of binding calculations, we learned a likelihood ratio distribution using the Bayesian model and training data from the Incyte YPD data. Next, this distribution and prior estimates are used to calculate the probability of expressional response for any given gene g.

5.2.3 Characterization of Binding Events

In order to study context-dependent TF-target interactions, we analyze ChIP-chip datasets generated under 6 stress conditions and the nominal growth condition. Using our Bayesian model and the context-dependent ChIP-chip p-values with other context independent evi- dences, i.e., Nucleosome Occupancy and TFBS, we calculated binding probabilities under each of the studied conditions. These binding probabilities are then used to identify con- dition dependent changes in binding. Evidence for dynamic TF-binding is then compared to the functional output of the putative regulatory events by analyzing the changes in gene expression levels. To reveal functional bindings, our analysis are derive from the below tabulated sample space for binding of a TF to its targets in two different conditions and the mRNA abundances compared between these two conditions. This table summarizes all

126 possible binding and differential expression events that might take place when the living conditions of the biological system change from YPD to stress.

Table 5.2: Condition Dependent Binding Events Bypd Bstress ∆E Semantics 0 0 0 No Binding No Response 0 0 1 No Binding Functional Response 0 1 0 Differential Binding No Response 0 1 1 Differential Binding Functional Response 1 0 0 Differential Binding No Response 1 0 1 Differential Binding Functional Response 1 1 0 Constant Binding No Response 1 1 1 Constant Binding Functional Response

Here, we assume that a change in binding status accompanied by a change in expression can be considered as functional. Conversely, dynamic binding events with no change in the gene’s transcript level can be viewed as non-functional. However, when binding is constant across two conditions, it is not easy to associate the functional response of the gene (or lack of response) to the binding event. This may be due to other factors or protein modifications

that may modulate the regulatory activity even though the TF appears to remain bound both before and after stress. Therefore, we focused only on differential binding events. If the differential binding to a promoter is observed in addition to a change in this gene’s expression level (cases (Bypd, Bstress, ∆E) = {(0, 1, 1), (1, 0, 1)} in Table 5.2), we labeled

the corresponding binding as functional. On the other hand, if the differential binding to a gene is observed but the gene’s expression level is not significantly changing, then this is considered as an evidence of non-functional binding (cases {(0, 1, 0), (1, 0, 0)} in Table 5.2).

127 These definitions along with the previously defined binding and expressional response probabilities allow us to rank TF-target bindings in terms of their functionality and non- functionality. Accordingly, we define below four scores to identify the most functional and the most non-functional TF-target bindings.

ypd s • Functional Binding in YPD Score: S101 = P (B) ∗ P (B) ∗ P (∆E)

ypd s • Functional Binding in Stress Score: S011 = P (B) ∗ P (B) ∗ P (∆E)

ypd s • Non-functional Binding in YPD Score: S100 = P (B) ∗ P (B ∗ P (∆E)

ypd s • Non-functional Binding in Stress Score: S010 = P (B) ∗ P (B) ∗ P (∆E)

Prior to score calculation, each of the probability distributions are equal-depth normalized (p-values based on ranks) separately to limit the impact of our prior estimates and the variability in the probability distributions range on the final scores. Based on these four scores, we can identify potential functional and non-functional bindings between a TF and its target genes.

5.2.4 Determining Factors that Explain Functional Binding

Next, we aimed at determining factors that explain the difference between functional and non-functional binding events of a given TF. Our prior analysis on the motif binding locations with respect to their distance from the AUG codon of genes revealed that there are specific binding locations preferred by different TFs. After studying all transcription factors that are studied under the normal growth conditions, we observed a clear separation of TFs in terms of the distance of their motif site from their translation start site (TLS). As can be seen in Figure 5.3, there exist a group of TFs whose binding sites are over-represented at

the upstream region of the genes, such as YAP proteins. On the contrary, another set of TFs prefer to bind to the downstream of genes such as MSN2 and MSN4. This clear difference in their binding site preferences could not be explained by the structural differences of TFs

128 (second column in Figure 5.3) or their impact on the transcription - inducing or repressing (first column in Figure 5.3). Although the rationale behind this clear differences in binding location preferences of TFs could not be explained with the current data, we hypothesize the geometry of binding (location and the direction of binding) plays an important role in determining the downstream effects of TF bindings to DNA regions. In addition, it is now well-known that TFs act in groups in order to initiate the transcrip- tion. Therefore, we considered the following information as potential biological factors that might explain the functionality of binding:

• Distance : the distance of binding site with respect to the start codon of the gene

• Orientation : the binding orientation of the TFBS with respect to the direction of transcription

• Co-factors : the presence or absence of other TFs bound to the same promoter

The co-factor information is obtained from the previously calculated binding probabilities since it is a better predictor of binding in comparison to single data sources. However, note that not every TF is studied under every condition, which is limiting our analysis by the coverage of existing ChIP-chip studies. Since these factors are likely to act in parallel and in a combinatorial manner we have employed a multivariate method for determining the individual importance of each factor.

We have also tested univariate methods and the results are generally in agreement with the multivariate method.However, since our multivariate approach includes factors that are known to influence gene regulation, e.g. co-transcription factors, we will focus on those results in our work.

Random Forests classification method [20] is used to determine features explaining the difference between functional and non-functional binding either alone or in combination with other factors. A Random Forests is an ensemble technique that combines individual

129 Figure 5.3: TFs are clustered with respect to the distance of their motif hit site from the TLS of genes. This clustering could not be explained using the nature of binding (column 1) or the structural classification of TFs (column 2).

130 classification trees into a forest of classification trees. Each individual tree is constructed from a bootstrap sample of the original data and each splitting feature in the tree is cho- sen among a small random subset of original predictor variables. Random Forests are also shown to be effective in finding the predictor variables in a classification task. We em-

ployed an alternative implementation of Random Forests that eliminates the bias in vari- able selection where potential predictor variables vary in their scale of measurements and in their number of categories [139]. Using this Random Forests algorithm, each variable is sorted according to its importance. The importance of each variable is calculated with the ‘permutation accuracy importance’ measure [20].

5.3 Results and Discussion

5.3.1 Prediction of Condition-Specific Promoter Binding

We calculated binding probabilities based on three sources of evidence under the nor- mal growth condition (YPD) and the 6 stress conditions as depicted in Table 5.1. In order to assess the predictive power of the binding probabilities, we first used them to predict known transcription factor-target gene interactions from the Incyte YPD database by using a 5-fold

cross validation approach. For each TF-target pair, the posterior probabilities are calculated based on the Bayesian formula. For varying posterior probability cut-offs, the ROC (Re- ceiver Operating Characteristic) curves of our predictions are generated by using different combinations of the three types of evidence as shown in Figure 5.4. An ROC curve graph- ically displays true positives versus false-positives across a range of posterior-probability

cut-offs. These ROC curves are generated based on ChIP-chip profiles measured under the YPD condition as it is the most comprehensive ChIP-chip source we have. The figure indicates the anticipated result, i.e., that by combining these three sources in a model, we can better predict the existence of a physical interaction. When considering the

131 area under the ROC curve (AUC) as a metric for predictive power, we observe an 11% to 17% improvement of the overall AUC score obtained by our integrated model (AUC= 0.77) when compared to scores obtained by models based on single sources of evidences alone. 0.8 1.0 A B 0.8 0.7

0.6 C+N+T C+T AUC C+N 0.4 Sensitivity N+TTFBS (T) NO (N) 0.6

0.2 ChIP (C) 0.0 0.5 T N C

0.0 0.2 0.4 0.6 0.8 1.0 +T N+T C+T 1−Specificity C+N C+N

Figure 5.4: (A) ROC curves for TF-target predictions based on individual (TFBS(T), ChIP- chip(C), and Nucleosome Occupancy(N)) and integrated evidence and (B) area under ROC curve (AUC scores) generated by 5-fold cross validation.

It has been shown that nucleosome occupancy is remarkably lower around the transcrip- tion factor binding sites (TFBSs) [36] and that this data is useful for TFBS discovery [108]. Here, we introduce a methodology to integrate this source with ChIP-chip data for predict- ing true DNA binding of TFs. Figure 5.4 shows that accounting for nucleosome occupancy at potential binding sites significantly improves the predictions. In order to identify potential condition-specific TF-target interactions, binding proba- bility estimates greater than 0.9 are assumed to be true physical binding interactions. For each stress condition such top scoring pairs are identified. We are depicting the TF-gene

132 interactions that are induced under stress conditions in Figure 5.5. Colors on the network edges indicate under which stress condition the corresponding TF (nodes in the middle of spheres) binds to the corresponding gene (nodes around the spheres). Although we did not pursue that direction in this work, it could be possible to use this deduced network structure

to study change in cellular level responses with respect to different perturbations.

Figure 5.5: Context dependent TF-target gene interactions can be induced from our anal- ysis. In this TF-gene interaction network, colors on edges represent under which stress condition each binding takes place. The color coding is as follows: Black (aa starvation), Green (MMS treatment), Red (Heat Shock), Blue peroxide treatment), Yellow (Raffinose treatment), and Purple (Galactose treatment).

133 5.3.2 Characterization of TFs and their Promoter Binding

In our next experiment, we employed the four scores we defined to quantify the func- tional status of TF-target bindings. Using these scoring schemes, for each stress condition we labeled TF-target interactions that score above 0.4 for scores S011 or S101 as functional binding events (FB). Similarly, pairs that score above 0.4 for scores S010 or S100 were la- beled as non-functional binding events (NFB). We discuss our rationale behind the choice of this cut-off value of 0.4 in Section 5.3.2. Based on these FB and NFB labels, for each TF and each stress condition we investigated the fraction of functional binding events. Given a condition and a specific TF, the ‘functional binding rate’ (FB rate) is the ratio of functional binding events compared to total differential binding (FB+NFB) for this TF-condition pair,

F B i.e., FB-rate = F B+NF B . Figure 5.6 shows the functional binding rate for each TF-condition pair. The global

FB-rate was found to be 49% and did not significantly vary by stress condition. Although the overall distribution of FB-rates varies considerably, (standard deviation(sd) 0.06), this

variation also did not appear to be condition specific (e.g. AAS sd = 0.059, H2O2 sd = 0.043, MMS sd = 0.067). The FB-rates of individual transcription factors did depend on

the threshold used though. More stringent thresholds (e.g. 0.5) resulted in more extreme FB-rates due to the low numbers of predicted differential binding events, though FB-rates were generally observed between 25% and 75%. It is also noteworthy that FB-rates of individual TFs may be dramatically different for different conditions (Figure 5.6). For

example the FB-rate of GCN4 under amino acid starvation is 64%, whereas it is just 36% under MMS treatment. Each of the condition specific FB-rates was compared to the background FB-rate (i.e. 49%) using a two-sided χ2 test. This analysis revealed that very few of the 90 TF-condition pairs generated FB-rates significantly different from the expected rate though a few notable

exceptions were found. In particular the combination of GCN4 and amino acid starvation

134 red (49.2% indicates Figure indicates 5.6: using the Functional the number 0.4

significance Functional binding rate cut-of

0.40 0.45 GAT1 0.50 0.55 0.60 0.65 of f AAS binding ● associated ● ● v ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● alue) SIP4 DAL81 CBF1 RCS1 ADR1 CAD1 RAP1 FHL1 UGA3 RTG1 MOT3 MET32 GLN3 RPH1 HAP4 RTG3 PHO2 SFP1 MET4 BAS1 DAL82 MET31 HAP5 PUT3 LEU3 STP1 GCN4 of a MMS rates χ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 dif GCN4 MSN4 SWI5 RFX1 CAD1 UGA3 ADR1 PDR1 RTG3 SOK2 FKH2 DAL81 RIM101 INO4 DIG1 MCM1 RPN4 ASH1 SKO1 HSF1 NDD1 YAP1 SWI6 YAP5 SWI4 ACE2 CIN5 test for ferential comparing H indi 2 135 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● O R H R P R M P S P M Y M A N H Y R R C F R R X M 2 vidual K H U F D A F A B C A P T R S I P I O E O S S B M N P T H binding G P P P P H O T R F N B S G N N X P T 5 101 2 1 3 1 2 7 6 1 4 1 3 1 1 1 4 2 1 1 3 4 2 1 S K Y HEAT N A log the P 7 ● ● TFs ● ● ● ● 1 −0.5 −1.0 −1.5 −2.0 10 ADR1 MSN2 SKN7 YAP1 GAT1 HSF1 FB-rate 0 ( e p v -value) by ents ● ● ● ● ● GAL condition. ● ● predicted. to GAL4 MIG1 FB+NFB the 100 200 300 20 50 RAFF GAL4 global ● The The mean size intensity of FB-rate point of (AAS) was found to be significantly enriched for functional binding events (unadjusted χ2 p-value is smaller than 1e−2). Gcn4p factor is a well known master regulator of amino acid metabolism so this finding was not a surprise. The result does represent an important positive control and offers additional validation of our approach.

Gene Ontology (GO) terms [8] for the FB and NFB target gene sets were analyzed for the enrichment of assigned GO terms for each of the 90 TF-condition pairs using the GO Term Finder [18]. When we compare the total number of FB and NFB target sets that contained one or more significantly enriched GO terms (p < 0.05), it was clear in the AAS

and H2O2 conditions that more of the FB target sets contained enrichment of functional ontology terms (28 for FB vs 15 for NFB). In addition, the most significant results were observed for the FB target sets (data not show). In summary, using our probabilistic method we find that roughly 50% of the condition specific binding events are accompanied by differential expression of the targeted gene.

Each of these functional binding observations verifies a gain or loss of positive regulation or a repression or de-repression event across the compared growth conditions. The 50% of non-functional binding may arise for a number of reasons: 1) non-optimal distance of the TFBS to the transcription start site (TLS), 2) incorrect orientation of the TFBS relative to the direction of transcription, 3) lack of the appropriate co-factors. In Section 5.3.3, we explore the evidence for these possible factors.

Determining a Cut-off Score Threshold

In order to determine a sensible score threshold that will help us to identify potential FB and NFB pairs, we run several analysis. First, we consider the impact of this threshold on the FB-rate estimates. Secondly, we analyze the stability of this threshold on the ordering of TFs in terms of their FB-rates.

136 As expected increasing the threshold decreases the number of FB and NFB counts and confers a greater uncertainty on the FB-rate estimates. As can be seen in Figure 5.7, FB-rates vary greatly when we label a few pairs as FB or NFB by setting the threshold very high (say 0.6). On the contrary, a small threshold (say 0.2) labels many pairs as

FB or NFB, which brings all FB-rates close to each other. We believe there should be some variability in FB-rates of TFs as it is unrealistic to accept every TF to behave the same. Therefore, we prefer to set a score threshold that produces some variability in these rates but that does not generate severe fluctuation. Accordingly, when we consider the variability of FB-rates for all 90 TF-condition combinations we have for 6 stress conditions, as estimated by the standard deviation (SteDev) and the median absolute deviation (MAD), we choose a threshold that corresponds to roughly 10% uncertainty on the FB rate estimate (see Figure 5.7).

Figure 5.7: Score threshold vs. uncertainty on the FB-rate estimates.

Since it is not clear how much uncertainty exists in FB-rates of TFs in real-life, we also run an analysis to study the impact of this threshold on the ordering of TFs in terms of their FB-rates instead of their impact on actual FB-rates. For this purpose, we propose the notion

137 of Rank Distance (or Rank Similarity) (RD) to evaluate the robustness of our approach with respect to the cut-off value. We formally define the RD between two totally ordered sets (TFs ordered with respect to their functionality in our case), A and B, as follows:

N |Ai∩Bi| i A B RD(A, B) = P | i∪ i| (5.5) N

where Ai (or Bi) represents a totally ordered subset of A (or B) comprising the first i elements. N is a value that can be defined by the user or it can be defined based on the cardinality of the sets A and B (i.e., N = min(|A|, |B|)). Intuitively this formulation captures the similarity between two totally ordered sets and it weights matches among elements higher up in the rank ordered list more since such elements appear in a larger number of subsets. This implies that the ordering at the beginning of the list (TFs with higher FB-rates) is given more importance than the ordering towards the end of this list. An RD value of 1 represents an exact match between two ordered sets. For our purposes a perfect score implies that two TF functional orderings generated with different score thresholds are exactly the same. Low RD values imply that the two ordered sets do not share many matching subsets. We used the RD notion to compare ordered sets of TFs in terms of their FB-rates under a given stress condition for different score cut-offs. We calculated the FB-rates of TFs for different cut-off thresholds ranging from 0.2 to 0.6. Each cut-off value generated an ordered set of TFs in terms of their FB-rates under that condition. Then, we calculated the RD scores for different TF orderings produced by consecutive threshold pairs. In Figure 5.8, we plotted RD scores for three stress conditions in which a sufficient number of TFs are studied - where an overlap in ordering is less likely to occur randomly. Observed high values of RD conserved overall indicate that the choice of score cut-off does not impact the order of TFs in terms of their FB-rates in any of these stress conditions. This shows that

138 Figure 5.8: Rank Distance values for different cut-off thresholds under different stress conditions aa starvation (blue line), h202 (green line), and H2O2 treatment (red line).

functionality and non-functionality scores we defined based on binding and expressional response probabilities can robustly order a set of TFs in terms of their FB-rate in a given condition regardless of the choice of the cut-off. This implies that our methodology is robust for producing a ranked list of TFs in terms of their functionality under different conditions.

Moreover, we also calculated RD scores for more distant orderings in addition to the orderings generated by consecutive cut-offs. For each cut-off value, we calculated the RD scores for two thresholds before and after the corresponding threshold value. As an ex- ample,for cut-off value of 0.4, we calculated its RD values with respect to the following

cut-off values, 0.38, 0.39, 0.41, 0.42. Boxplots of these RD scores are depicted in Fig- ure 5.9. Given these figures, we decided to choose a cut-off value (0.4) where RD scores start to stabilize in a smaller range.

139 Cond : AAS , Step : 2 Cond : MMS , Step : 2 Cond : H2O2 , Step : 2 0.8 0.8 0.8 0.6 0.6 0.6 RD RD RD 0.4 0.4 0.4 0.2 0.2 0.2

0.23 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.5 0.53 0.56 0.23 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.5 0.53 0.56 0.23 0.26 0.29 0.32 0.35 0.38 0.41 0.44 0.47 0.5 0.53 0.56

Cut−off Cut−off Cut−off

Figure 5.9: RD values calculated with respect to the two step apart cut-off thresholds for aa starvation, mms treatment, and peroxide conditions respectively.

5.3.3 Exploration of Predictive Features for Functional Binding Events

In our next experiment, we try to explain why certain TF binding events are functional whereas others are not. In order to answer this question we focused on high-confident sets of interactions that are functional and non-functional according to our analysis. The top

1% percent of all pairs with the highest S101 and S011 scores are considered as reliable pairs of functional binding events. Similarly the top 1% percent of all pairs with the high- est S100 and S010 scores are selected as the reliable non-functional binding events. Next, we identified binding site features and co-factor binding probabilities in an attempt to ex- plain differences between these two groups. We used the multivariate Random Forests method to calculate the importance of each feature for predicting the class response, which is the functionality of binding for our purposes. To identify the significant predictors for each TF, we obtain a background distribution of this score by randomizing the class re- sponse variable (FB/NFB). Subsequently, the variable importance scores are re-calculated for each variable. These randomization experiments are repeated 1000 times to get a sta- ble background distribution of the variable importance scores. Finally, importance scores

140 calculated for each predictor are converted into p-values with respect to the empirical back- ground distribution obtained from the randomization experiments. We tabulated the factors with p-values < 0.05 in Table 5.3. These significant factors can be very useful in understanding the functionality of tran-

scription factor binding. After examining these factors we observe that in some cases, the geometry of a binding site (distance to TLS and orientation) is very important for the regulatory implications of a physical interaction. For example, CIN5 in the MMS stress condition, an incorrect distance of the TFBS can lead to non-functional binding to CIN5 under this condition. However, in most cases the binding of other TFs to the promoter re-

gion of a gene determines the functionality of binding. An extreme example here is MET31 under amino acid starvation. In this case, our methodology predicts that its functionality of binding depends on the binding of six other TFs to the same promoters. As can be seen from Table 5.3, GCN4 is clearly the most important co-factor for the amino acid starvation

condition. In this condition we found that the functionality of 8 out of 24 tested regulators depended on the binding of GCN4 (shown in parenthesis in column 2 of Table 5.3). This finding recapitulates well established knowledge about GCN4’s role in amino acid star- vation. GCN4 is known to regulate most genes responding to this stress and it is known to be the first level responder [70]. In the case of H2O2 treatment, PHO2 and MSN4 are identified as the most important co-factors for regulation under this condition. Another interesting example that we can draw from our analysis is the context-dependency of factors that determine functionality. Our results show that not only the TF-gene bindings are condition dependent, but living conditions also effect the nature of binding (as evident

from varying FB-rates for the same TF under different conditions) and the biological causes behind functionality. As an example, the functionality of HSF1 bindings depends on sev-

eral co-factors as well as the orientation of these bindings under H2O2 treatment. However, the functionality of HSF1 bindings does not depend on orientation under MMS treatment.

141 Table 5.3: Significant Predictors for determining the functionality of TFs under studied conditions. The values in parentheses (in column 2) correspond to the number of times the corresponding TF has been found to be a significant co-factor at 0.05 p-value cut-off.

Condition TF (co-factor instances) Important Factors for Binding Functionality Galactose GAL4 (0) Distance Heat Shock SKN7 (0) MSN2 MMS Treatment ADR1 (0) Orientation MMS Treatment ASH1 (1) RIM101 MMS Treatment CAD1 (3) FKH2,DAL81 MMS Treatment CIN5 (0) Distance, MCM1 MMS Treatment DIG1 (2) RFX1,SKO1 MMS Treatment FKH2 (1) DAL81, MCM1, RTG3 MMS Treatment GCN4 (1) YAP5 MMS Treatment HSF1 (0) DAL81,RPN4 MMS Treatment INO4 (2) ASH1,CAD1,INO4,MCM1 MMS Treatment MSN4 (1) DIG1 MMS Treatment NDD1 (1) YAP5 MMS Treatment RFX1 (0) INO4 MMS Treatment RTG3 (1) DIG1 MMS Treatment SOK2 (0) CAD1, UGA3 MMS Treatment SWI4 (0) PDR1,SKO1 MMS Treatment SWI5 (0) CAD1 MMS Treatment SWI6 (0) MCM1,MSN4,RIM101 MMS Treatment YAP1 (0) GCN4,NDD1 aa Starvation ADR1 (1) RPH1,MET4,SFP1 aa Starvation BAS1 (4) CBF1,STP1,UGA3 aa Starvation CBF1 (2) BAS1,GCN4,HAP5,MET4 aa Starvation DAL81 (2) Distance,CBF1 aa Starvation DAL82 (4) GCN4,PUT3 aa Starvation FHL1 (0) DAL82 aa Starvation GAT1 (1) ADR1 aa Starvation GCN4 (8) GLN3,MET32 aa Starvation GLN3 (3) GCN4,MOT3,PUT3,SFP1 aa Starvation HAP4 (1) MOT3,RTG3 aa Starvation HAP5 (3) BAS1,DAL82,GCN4 aa Starvation LEU3 (0) CBF1,UGA3 aa Starvation MET31 (1) DAL81,DAL82,GCN4,GLN3,HAP5,RAP1 aa Starvation MET4 (5) RCS1,RTG1 aa Starvation MOT3 (2) DAL82 aa Starvation PHO2 (0) GCN4,GLN3,HAP4,MET31 aa Starvation RPH1 (1) DAL81,MET4,UGA3 aa Starvation RAP1 (2) GAT1,RAP1 aa Starvation RTG1 (2) BAS1 aa Starvation RTG3 (2) GCN4,HAP5,MET4,SFP1 aa Starvation SFP1 (4) Orientation,MET4,PUT3,RAP1 aa Starvation SIP4 (0) SFP1 aa Starvation STP1 (1) GCN4,RTG1 aa Starvation UGA3 (3) BAS1,RCS1,RTG3 H2O2 AFT2 (0) CIN5, YAP7 H2O2 CIN5 (2) HSF1,PHO2 H2O2 FKH2 (2) XBP1 H2O2 HAP4 (1) MSN4,PHO2,PUT3,YAP6 H2O2 HSF1 (1) Orientation,FKH2,RIM101,SFP1 H2O2 MBP1 (1) MOT3,MSN2,REB1,SKN7 H2O2 MOT3 (1) MSN2,MSN4 H2O2 MSN2 (2) MSN4,PUT3,RCS1 H2O2 MSN4 (5) SFP1 H2O2 PDR1 (0) SFP1,SKN7 H2O2 PHO2 (5) FKH2,SKN7,XBP1 H2O2 REB1 (2) HAP4 H2O2 RIM101 (1)) YAP1 H2O2 ROX1 (0) MSN4,RTG3,SFP1 H2O2 RPH1 (2) Distance,PHO2 H2O2 RPN4 (0) Distance H2O2 RTG3 (1) NRG1 H2O2 SFP1 (5) Orientation,MSN4,RPH1 H2O2 SKN7 (3) CIN5,MBP1 H2O2 YAP6 (3) RPH1 H2O2 YAP7 (1) PHO2,REB1,YAP6 H2O2 XBP1 (2) PHO2,SFP1,YAP6

142 It only depends on co-factors. Note that, since stress related studies are incomplete in terms of their TF coverage, we do not claim we are able to identify all significant co-factors that explain binding functionality. However, at least for the geometry of binding we have com- plete information. And even for the binding geometry we have observed different patterns under different stress conditions.

5.3.4 Identification of Co-factor Hierarchy Networks

Using the cofactor relations with significant p-values defined in the previous section, we identified larger systems of dependencies between regulatory proteins in each condi- tion. As an example, the functionality of 8 TFs in the amino acid starvation condition are dependent on the binding of GCN4 to the same promoter region (p < 0.05). On the other hand, GCN4’s functionality appears to be dependent on only two co-factors (MET32 and GLN3) at this same threshold. Hence, this analysis establishes a hierarchy of regulatory re- lationships with GCN4 being the master regulator for responding to amino acid starvation.

The set of significant dependencies can be used to define a hierarchical network describing the co-factor dependencies between TFs under different conditions. Other examples of these significant relationships are shown for the amino acid starva- tion and hydrogen peroxide conditions for a more stringent variable importance threshold

(p < 0.01) in Figure 5.10 . Given this set of the most significant co-factor dependencies, we can still clearly observe the ’master regulator’ status of GCN4 under aa starvation (AAS) condition. The peroxide stress results also point MSN2 and MSN4 as being required as a co-factor for a cascade of other regulators (Fig. 5.10). The importance of MSN2 and MSN4 are well documented in the oxidative stress response [49] and both are known to bind stress response elements (STRE) in response to a number of stress conditions. Though functional roles are

143 known to be partially redundant, recent work also indicated distinct roles for MSN2 and MSN4 [48] as is also suggested in Figure 5.10. Figure 5.10 also shows the FB-rates of TFs (node color red/green for high/low FB- rate) and it should be noted that this information does effect whether a significant co-factor relationship is likely to occur or not other than the case where all differential binding pre- dictions are of only one category, FB or NFB. In these atypical cases, the FB vs NFB importance of other variables cannot be estimated. Based on these networks, it is tempting to suggest that TFs at the top of the dependency hierarchies are more enriched for functional binding. Indeed this makes some intuitive sense. Dynamic binding events that require the fewest additional co-factors are the ones most likely to be functional.

5.3.5 Conclusion

We believe that the ability to distinguish functional from non-functional TF-gene in- teractions within living cells is an important research area and will only increase in im-

portance in the future. The need for methods to address this problem may already be acute considering the volume of protein-protein and protein-DNA interactions that have been sys- tematically measured by yeast-two-hybrid, mass spectrometric, ChIP-chip, ChIP-Seq, and other methods. The functional fraction of these newly determined and valid protein interac-

tions is currently unclear. Our work strives to answer this question by exploiting dynamic protein-DNA binding events coupled with potential expression changes in the output of the corresponding regulatory system. The method described in this work gives a first estimate for the functionality of condition dependent protein-DNA interactions and sheds light on

the possible causal factors determining functionality.

144 AAS

FB rate 0.70 p-value 1e-2 0.50 H2O2 1e-3 1e-4 0.25

Figure 5.10: Significant TF-TF co-factor relationships as determined by the multivariate Random Forest method (p < 0.01) under amino acid starvation (AAS) and hydrogen per- oxide (H2O2) stress conditions. This hierarchical network view shows TFs (nodes) and co-factor relationships (edges) where direction of dependency is indicated by the arrow. In this representation, X− > Y implies that binding functionality of Y depends on X. The thickness of the edge indicates the significance of the X variable in determining function- ality of Y binding. The node color indicates the FB-rate of the TF in the that condition, red indicates rates higher than expected while green indicates lower than expected rates.

145 CHAPTER 6: Discussion and Future Directions

Cellular function is often governed by an intricate web of interactions among DNA, RNA, proteins, and other small molecules. Such biological interactions can be naturally abstracted as a relationship network. Our thesis statement is that it is possible to develop noise-resistant, integrative, and topology-aware algorithms for the construction and analy- sis of such biological relationship networks. We attacked several challenges associated with the construction and the analysis of diverse biological interaction networks, particularly the

following three: PPI networks, gene co-expression networks, and regulatory networks. We leveraged data mining methods and statistical tools to accomplish data cleaning, topology refinement, and data integration. We demonstrated that by incorporating topological in- formation one can improve the quality of biological networks through data cleaning and data refinement. We also showed that by integrating diverse and complementary datasets one can enhance the biological quality of extracted networks. Moreover, we found that integrative models can be useful in inferring the dynamics and the semantics of biological interactions. We proposed our work within a generalized framework, which consists of three main stages: data pre-processing, network construction, and network analysis. This workflow is a significant step toward accomplishing our ultimate goal of employing biological net- works and their sub-structures for knowledge discovery. We present our work as a series of case studies spanning different domains. In Chapter 3, we discussed computational tech- niques for eliminating false positive interactions and handling highly-connected nodes in PPI networks. These methods are shown to improve the biological modularity of protein

146 sub-networks. We employed evidence from network topology to accomplish these tasks. In Chapter 4, we discussed alternative similarity measures that are suitable for the analysis of noisy microarray datasets: rank based similarity of genes and external gene similarity. To assess the usefulness of these alternative measures we compared them against the standard

measures used in the literature. With these, we were able to show significant improvement in the quality of constructed gene co-expression networks. We then proposed a study to employ gene sub-networks derived from human co-expression networks within a statistical model to detect the gene set level responses of human cell to changes such as HIV drug treatment. In Chapter 5, we integrated diverse datasets generated under normal growth conditions as well as six different stress conditions to infer context-dependent regulatory interactions between TFs and their target genes. Our results proved that by integrating direct protein-DNA interactions from ChIP-chip measurements, transcription factor motif hits, and nucleosome occupancy datasets, we can significantly reduce the level of noise that is introduced by each dataset alone. While we have tried to make the proposed workflow as general as possible, it is evident that every case study has its own set of goals and datasets. Next, we describe the similarities and differences among these three network types.

6.1 Discussion

As validation, we considered the analysis of three diverse structures: protein interac- tion networks, gene co-expression networks, and regulatory networks. There exist some similarities among these three, which also motivated our generalized framework. Although each of these network structures points to diverse relations in the biological context, our analysis showed that they share common themes and infrastructures. First, we will discuss the common themes that are highlighted by our analysis.

147 • All three networks are either partially or completely extracted from high-throughput data sets. For constructing gene co-expression networks, gene expression profiles from microarray studies are employed. PPI networks are usually constructed from high throughput measurements from the yeast two-hybrid system and the protein co-

immunoprecipitation (coIP) followed by mass spectrometry (MS). Regulatory net- works can be inferred from protein-DNA interactions measured at large scale using the ChIP-chip technique. The involvement of large scale measurements for the con- struction of these networks introduces common challenges into their analysis. It is well known that high-throughput screening techniques sacrifice sensitivity. There-

fore, these networks are large in size, but they are also noisy.

Given this, it becomes very important to develop computational models that are able

to extract accurate information from uncertain and large networks. To address this challenge, we proposed a data cleaning strategy that enables eliminating noisy inter- actions of an existing interaction network by employing its topology. We showed the efficacy of this algorithm on the PPI network of the model organism, Saccharomyces

cerevisiae . Along these lines, we also developed noise-tolerant measures to be used to construct biological networks from raw and noisy datasets. These measures are employed to construct gene co-expression networks from gene expression profiles of human and Saccharomyces cerevisiae . Finally, we also showed that through the

integration of these complementary sources, one can reduce the level of uncertainty in the constructed network. This methodology is used to construct regulatory net- works of Saccharomyces cerevisiae . To sum up, within our framework we proposed alternative methods -edge elimination, noise-tolerant measures, data integration- to attack the uncertainty in biological interaction networks, which is a drawback of

high-throughput screening technologies. We showed the efficacy of these methods

148 by employing them for the analysis (and construction) of several biological interac- tion networks.

• All three interaction networks exhibit modular organization. The concept of modu- larity for a network structure is not well defined. However, a module can be vaguely

defined as a group of entities (genes, proteins, or both) that share some commonal- ities within themselves which they do not share with others. There have been sev- eral attempts to explain the prevalent existence of modules in biological networks from an evolutionary perspective. It has been suggested that modular systems are

selected as the most adaptable framework to the changing environment, among dif- ferent evolutionary frameworks [71]. Although it is not easy to provide supporting experimental evidence for this hypothesis, it still makes intuitive sense since modu- lar organization enables biological systems to be robust to , environmental changes, and knockouts [71]. Simulation studies are also in accordance with this

expectation. It has been shown that the modularity of networks is the most important characteristic for understanding the robustness of a network, in comparison to other common properties such as clustering coefficient, degree distribution, and average path length [50].

• Another feature that is common to all three networks is their dynamic nature. As ab- stract representations of biological systems, biological networks capture the dynamic characteristic of the living system. The dynamics of a biological system are essential for its survival. Similarly, the dynamics of biological networks are essential elements

for enabling these networks adapt and to respond to external stimuli in a controlled and coherent manner. Interactions and nodes of many biological networks appear and disappear temporally in response to external and internal stimuli. For example a

149 study on the PPI networks revealed that two thirds of budding yeast proteins are dy- namic during the cell cycle stages, which implies that they do not form interactions at all stages of the cell cycle. Moreover, it has been shown that such dynamic pro- teins are expressed just before they are needed to assemble a protein complex. This

‘just-in-time assembly’ of protein complexes is believed to enable fast and accurate response mechanisms for the system [39]. Similarly, regulatory interactions depend on the living conditions of the cell and they appear (or disappear) in certain condi- tions. In our work, we investigated this issue in detail by constructing and studying condition-dependent regulatory networks under six stress conditions. Our analysis

revealed that the existence of regulatory interactions as well as the semantics of these interactions depend very much on the experimental condition.

• Another feature that is common to all of our analysis, though not directly a com- mon theme of these networks, is the lack of gold standard information. This lack

of information mainly led to two particular challenges for our studies: learning sen- sible models from limited training datasets and evaluating our findings. We handle the first challenge by developing unsupervised techniques wherever applicable or by simplifying supervised models to get the most out of the limited training data. In

order to handle the second challenge, we frequently employed existing domain in- formation from the GO database [8]. However, it is known that GO annotations are biased towards well studied pathways and genes in addition to being incomplete. Therefore, treating GO annotations as a complete and collective gold standard can be

misleading. On the other hand, we believe it is safe to assume GO terms to be equally biased when they are used to annotate the same set of genes or proteins. Motivated by this assumption, throughout our analysis we employed GO-based evaluations in a comparative manner. For example, while analyzing PPI networks, we compared

150 partitions deduced from the original data with ones from the cleaned data in terms of their GO term enrichment.

Having studied all the three networks, we observe the above mentioned common ele- ments in their structure, topology, and organization. In addition, we also observe that these networks present similar challenges in their analysis. However, there exist some important differences among these three networks. We discuss these next.

• Each of these networks and their interactions have different real-world meanings and

implications. A biological interaction might be solely representing a physical rela- tion (as in the case of PPIs) or might be referring to a causal relation (as in the case of regulatory networks). In our case studies, we took into consideration different requirements and implications of each domain. Certain models and algorithms are

more appropriate for some data types than others. As an example, when we study regulatory networks, we propose a methodology that employs gene expression pro- files to employ the semantics of regulatory interactions. However, in the case of PPI networks, we employed network topology for this purpose.

• In addition to their interactions, higher level substructures of these networks, such as their modules, also refer to different biological contexts. As we discussed previ- ously, all three networks we studied exhibit a modular organization. However, for

each of these networks, modules have different implications and real-world corre- spondence. For example, genes that are co-expressed can be classified into a module of co-regulated genes. On the other hand, protein clusters that are deduced from PPI networks can correspond to protein complexes, where a complex refers to a group

of proteins that bind to each other and accomplish a certain cellular task. Besides, a module in a regulatory network can represent a group of co-regulated genes along with their common regulators. Many studies have focused on the identification of

151 modules from biological networks. In our analysis we also extracted module struc- tures when they are useful in the knowledge discovery process. For this purpose, we employed existing clustering and graph partitioning algorithms. To further benefit from the modular nature of these networks, we also proposed a novel utilization of

gene modules for detecting gene-set level responses. Our analysis revealed that ef- fective identification and utilization of network modules are critical steps to convert them into biological investigation tools.

• Technologies that produce these networks have different technical artifacts. For ex-

ample, Y2H technology produces ‘sticky’ proteins which bind non-functionally to many other proteins. It is essential to develop a methodology to differentiate sticky proteins from highly connected multi-functional proteins. On the contrary, microar- ray based technologies are overall sensitive mechanisms, which produce many false positive interactions. Therefore, while studying raw microarray data, the first step

should be quality control checks and other pre-processing steps for reducing the number of false positive inferences. It is evident from this example that different technologies have different sources of noise. Modeling and eliminating the noise in these networks should be specifically tailored with respect to these differences. We

believe that it is extremely important to understand the limitations and artifacts of existing technologies, in order to develop effective algorithms and solutions for the study of biological networks.

• Each of these networks raises different research problems. Therefore, different out-

comes are expected from the analysis of each network. For example in the case of PPI networks, an interesting outcome would be clean and modular structures deduced from this network. These structures can be identified by obtaining densely linked re- gions inside the network. However, in the case of regulatory networks, an essential

152 outcome will be the network structure. Inferring interactions of a regulatory network can be as interesting as inferring its substructures. As a result, while developing algo- rithms for the study of these networks, one needs to understand the specific research questions of the domain.

With the maturation of the field of Network Biology and the accumulation of diverse datasets, it is becoming clear that there is no solution that fits all the problems and analyses associated with biological networks. Instead, there is a wide range of frameworks and algorithms, each with its specific pros and cons. We believe our workflow is also such a framework, where we contributed significantly to the existing literature via our novel data preprocessing, network construction, and network analysis methods. Similarities among these networks enabled us to propose our generalized framework that is composed of three steps: data pre-processing, network construction, and network analysis. On the other hand, we also observe some significant differences among these networks, which led to network-

specific solutions to each task at hand. We want to emphasize that despite the diversity of extant methodologies to analyze biological networks, all of them share an ultimate goal. This goal is to make sense of raw biological data in order to answer diverse and complex biological questions. We believe our work has contributed towards the accomplishment of this ultimate goal.

6.2 Limitations

In this dissertation we developed several algorithms and demonstrated their usefulness for constructing and analyzing three biological interaction networks of interest. However,

the biological questions motivating our work are far from being resolved. In this section, we will review the limitations of our current methodologies. As part of our future work, we will briefly discuss some directions that might help resolve these limitations.

153 • Due to the lack of Gold Standard datasets, throughout our analysis we employed the GO database for evaluation and validation purposes. However, this lack of validation data led to the neglect of comprehensive domain information from the data modeling and algorithm development phases. In the future, we hope to overcome this limitation

with the generation of reliable Gold Standard datasets, and by relying on other forms of validation, e.g., wet-lab experimental validation.

• In our previous work, we employed the nucleosome occupancy map of Saccha- romyces cerevisiae to examine its impact on gene regulation. In our work,due to

the lack of data, we assumed this map to be static over different environmental con- ditions. However, it is known that chromatin structure dynamically changes as a function of external stimuli. Moreover, these changes enable (or disable) the progress of different nuclear machinery [63]. We want to overcome this limitation by incorpo- rating histone modification data into our model. We detail some of our initial ideas

on this in Section 6.3.1.

• As part of this dissertation we studied the identification of regulatory relations from diverse predictive data sources within a probabilistic model [150]. Although our model helped us predict TF-gene interactions under six different stress conditions,

our predictions were limited to the TFs that are studied in each stress condition. Since most of the experimental conditions are only studied for a limited number of TFs, our predictions were incomplete. In the future, we want to address this limitation by developing a computational methodology to predict missing data. The details of this

future direction are explained in Section 6.3.2.

• Another limitation of our analysis of PPI networks and gene co-expression networks is our assumption about the stability of these interactions. It is well-known that these interactions are dynamic in nature. Despite the importance of these dynamics, in its

154 current stage, the science and technology have not matured enough to be able to mon- itor dynamic interactions between genes and proteins effectively and globally. Our work is also affected from this lack of missing temporal data, where interactions are assumed to be static. However, we believe that in the near future, with the production

of temporal data, we will have a better understanding of network dynamics.

6.3 Future Directions

In this section, we will discuss future directions in order to improve this dissertation work.

6.3.1 Integrating temporal epigenetic signatures

Of particular interest to us is the integration of epigenetic data for the construction of regulatory networks. Epigenetic refers to phenotypic changes that are caused by mecha- nisms other than the . An example can be the impact of nucleosome occupancy

on gene regulation. In a eukaryotic cell, DNA is wrapped around histone proteins and is condensed into a compact structure known as the chromatin structure. This structure has been shown to play a significant role in driving cellular processes such as replication, tran- scription or DNA repair [63]. We demonstrated the impact of static nucleosome occupancy map on gene regulation. However, chromatin structure is shown to be dynamically chang- ing, which affects the progress of different nuclear machinery [63]. In the future, we plan to enhance our previous study by taking the chromatin structure dynamics into considera- tion. More specifically, we want to incorporate histone modification data for acetylation, methylation, and phosphorylation. We believe with the inclusion of these new epigenetic

signatures, we will enhance our understanding of the dynamics of regulatory networks.

155 6.3.2 Predicting interactions which are yet to be produced

Another potential way to improve our prior work on regulatory networks is by attack- ing a limitation that we faced due to the incompleteness of existing data. While studying

regulatory networks, we employed condition-dependent ChIP-chip data, which were only studied for a limited number of TFs. A complete list of TF-gene interactions under con- ditions that we studied would produce a more reliable and comprehensive output with the same model. Therefore, we plan to develop a computational methodology to predict some of the unknown data by studying existing interactions between regulatory proteins and genes. Here, we describe this methodology in detail and discuss how it can be employed for handling the incompleteness in existing data.

Learning a bipartite graph to predict TF-gene interactions under stress conditions

In their recent work, Yamanishi et al. [167] proposed learning a ‘pharmacological

space’ of drug compounds and their target proteins through a bipartite graph model. For this purpose they investigated the relation between the chemical structure similarity of drugs (‘chemical space’) and the drug-target network topology. They also studied the rela- tion between the amino acid sequence similarity of drug targets (‘genomic space’) and the

drug-target network topology. For each of these spaces, chemical and genomic, they learn a function that maps original space into the ‘pharmacological space’, where interacting drugs and target proteins are close to each other. In this space, interactions between drugs and proteins are represented in the form of a bipartite graph. Their approach has been shown to be useful for predicting the structure of this bipartite graph, by identifying interactions

between previously unseen drug candidate compounds and target candidate proteins. Here, we propose a similar idea to predict interactions between TFs and their targets based on diverse evidence. This proposed work has an advantageous distinction from previ- ous data integration studies, due to its ability to generate predictions for previously unseen

156 regulatory proteins and their potential targets. In addition, by studying condition-dependent ChIP-chip datasets with this methodology, we plan to predict regulatory interactions of TFs under different conditions before they are experimentally studied in this condition. We be- lieve our findings will provide guidance for further experimental studies by identifying the activity level of TFs under new conditions. In addition, these predictions can be extremely useful for expanding the coverage of current computational models including ours. In this work, we are addressing the problem of predicting links between heterogeneous vertices of a complex interaction network. The first type of vertices that we consider are regulatory proteins. The ‘protein space’ quantifies the similarity between these proteins.

It has been shown that transcription factors with similar motif binding sites tend to have similar biological effects, in order to minimize the impact of mis-recognition errors [78]. Thus, similarity of PSSMs can be used to quantify the similarities between the first type of vertices. It is also known that genes with strongly correlated mRNA expression profiles are more likely to be bound by a common transcription factor [4]. Accordingly, gene similarities can be quantified using the correlation between gene expression profiles. This information will construct our second space namely the ‘genomic space’. We also employ a priori knowledge regarding the interactions between TFs and their targets to construct the bipartite network topology. For this purpose, it is possible to have a cut-off value on binding probabilities or ChIP-chip p-values. However, it is also possible to represent the whole structure as a biclique where every vertex of the first type is connected to every vertex of the second type with some weight associated with this connection. This weight can either be the binding probabilities that we have already calculated in our work or p- values from ChIP-chip studies. Similar to Yamanishi’s work we can visualize this idea as in Figure 6.1. As can be seen in this figure, the ‘regulation space’ is a bipartite graph of genes and proteins where interactions in the ‘regulation space’ refer to regulatory relations.

157 Figure 6.1: The overview of the proposed bipartite learning technique.

After representing TF-gene interactions in the form of a bipartite graph, the next step is to define a similarity between all vertices (TFs and genes) in this space. They em- ployed a graph-based kernel similarity that is based on the shortest distance between all vertices [167]. To learn the correlation between initial spaces and the ‘pharmacological

space’, they proposed to apply a kernel regression model. Initially, we plan to employ a similar idea to construct a similarity matrix based on the a priori network structure. At the end of our analysis, we intend to learn two mapping functions: one that maps the ’pro- tein space’ into the ‘regulation space’ and a second one that maps the ‘genomic space’

into the ‘regulation space’. After obtaining these mappings, we plan to predict TF-gene interactions based on a similarity cut-off in the regulatory space. We intend to extend this work for predicting TF behavior under different environmental conditions by incorporating condition-dependent ChIP-chip data into the model.

158 6.3.3 Analyzing temporality of biological interactions

Another limitation of our analysis of PPI networks and gene co-expression networks is our assumption about the stability of these interactions. However, these interactions are dynamic in nature and revealing their dynamics is a future direction for improving our dissertation work. The science and technology have not matured sufficiently to be able to monitor dynamic interactions between biological entities at a large scale. However, re- cently researchers started to integrate static accumulated datasets, such as PPI networks with temporal datasets. A recent study has investigated the gene expression levels mea- sured during the cell cycle in order to identify dynamic protein complex formation in the model organism [39]. We also plan to develop data mining algorithms to co-analyze exist- ing static and dynamic datasets in an attempt to exploit the dynamics of biological systems. Previously, we studied dynamic relations and clusters in social networks and showed that dynamic analysis reveals information that cannot be elucidated by static ones [9]. Similar techniques can be employed for studying the temporal interactions and clusters of biologi- cal systems.

6.3.4 Computational systems biology

Over the years, computational biologists analyzed basic units of many biological sys- tems with mathematical, statistical, and computer science tools. Today due to advances in data generation technology and the decrease in technology cost, scientists produce and ac- cumulate enormous amounts of diverse data on many organisms. This increasing size and diversity of the data, led to a new area of research in computational biology that focuses on the holistic and composite characteristics of biological systems. A hallmark of this new field, known as Systems Biology, is the investigation of relationships between biological entities of a particular organism including its genes, proteins, and metabolites as a whole.

159 In other words, Systems Biology focuses on networks of interacting biological molecules. This thesis focused on developing computational models for the construction and analysis of diverse biological networks. Systems Biology offers a number of opportunities for ex- tending our work. More specifically, in the near future we plan to focus on the integration of diverse interaction networks into an integrated and dynamic map of the interactome.

160 BIBLIOGRAPHY

[1] http://www.fas.at.

[2] A. Abou-Rjeili and G. Karypis. Multilevel algorithms for partitioning power- law graphs. IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2006.

[3] R. Albert, H. Jeong, and A. Barabasi. Error and attack tolerance of complex net- works. Nature, 406(6794):378–382, 2000.

[4] D. Allocco, I. Kohane, and A. Butte. Quantifying the relationship between co- expression, co-regulation and gene function. BMC Bioinformatics, 5(1):18, 2004.

[5] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and nor- mal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745–6750, 1999.

[6] P. Aloy and R. Russell. Potential artefacts in protein-interaction networks. FEBS letters, 530(1-3):253–254, 2002.

[7] V. Arnau, S. Mars, and I. Mar´ın. Iterative cluster analysis of protein interaction data. Bioinformatics, 21(3):364–378, 2005.

[8] M. Ashburner, C. Ball, J. Blake, D. Botstein, H. Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, et al. Gene Ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000.

[9] S. Asur, S. Parthasarathy, and D. Ucar. An event-based framework for characterizing the evolutionary behavior of interaction graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 913– 921, 2007.

[10] S. Asur, D. Ucar, and S. Parthasarathy. An ensemble framework for clustering pro- tein protein interaction networks. Bioinformatics, 23(13):i29, 2007.

161 [11] R. Avogadri and G. Valentini. Fuzzy ensemble clustering based on random projec- tions for DNA microarray data analysis. Artificial Intelligence in Medicine, 45(2- 3):173–183, 2009.

[12] G. Bader and C. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinfomatics, 4:2, 2003.

[13] Z. Bar-Joseph. Analyzing time series gene expression data. Bioinformatics, 20(16):2493–2503, 2004.

[14] Z. Bar-Joseph, G. Gerber, T. Lee, N. Rinaldi, J. Yoo, F. Robert, D. Gordon, E. Fraenkel, T. Jaakkola, R. Young, et al. Computational discovery of gene mod- ules and regulatory networks. Nature , 21(11):1337–1342, 2003.

[15] J. Berg, M. L ”assig, and A. Wagner. Structure and of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC , 4(1):51, 2004.

[16] A. Beyer, C. Workman, J. Hollunder, D. Radke, U. Moller, T. Wilhelm, and T. Ideker. Integrated assessment and prediction of transcription factor binding. PLoS Comput Biol, 2(6):e70, 2006.

[17] B. Bolstad, R. Irizarry, M. Astrand, and T. Speed. A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 19(2):185–193, 2003.

[18] E. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J. Cherry, and G. Sherlock. Go::termfinder–open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics, 20:18:3710–3715, 2004.

[19] J. Bray and M. S.E. Multivariate analysis of variance. Quantitative applications in the social sciences series:Sage Publications., 1985.

[20] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.

[21] R. Brockmann, A. Beyer, J. Heinisch, and T. Wilhelm. Posttranscriptional expres- sion regulation: What determines translation rates. PLoS Comput Biol, 3(3):e57, 2007.

[22] C. Brun, F. Chevenet, D. Martin, J. Wojcik, A. Guenoche,´ and B. Jacq. Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biology, 5(1):6–6, 2004.

162 [23] C. Brun, C. Herrmann, and A. Guenoche.´ Clustering proteins from interaction net- works for the prediction of cellular functions. BMC bioinformatics, 5(1):95, 2004.

[24] A. Butte and I. Kohane. Mutual information relevance networks: Functional ge- nomic clustering using pairwise entropy measurements. In Proceedings of the Pa- cific Symp Biocomput (PSB), 5:418–429, 2000.

[25] L. Calza, R. Manfredi, and F. Chiodo. Dyslipidaemia associated with antiretroviral therapy in HIV-infected patients. Journal of Antimicrobial Chemotherapy, 53(1):10– 14, 2004.

[26] D. Canton and D. Litchfield. The shape of things to come: an emerging role for pro- tein kinase CK2 in the regulation of cell morphology and the cytoskeleton. Cellular signalling, 18(3):267–275, 2006.

[27] S. Carter, C. Brechbhler, M. Griffin, and A. T. Bond. Gene co-expression net- work topology provides a framework for molecular characterization of cellular state. Bioinformatics, 20:14:2242–2250, 2004.

[28] J. Chen, W. Hsu, L. Lee, and S. Ng. Increasing confidence of protein interactomes using network topological metrics. Bioinformatics, 22:16:1998–2004, 2006.

[29] J. Chen, W. Hsu, M. Lee, and S. Ng. NeMoFinder: Dissecting genome-wide protein- protein interactions with meso-scale network motifs. In In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing, pages 106–115. ACM New York, NY, USA, 2006.

[30] J. Chen, W. Hsu, M. L. Lee, and S.-K. Ng. Labeling network motifs in protein interactomes for protein function prediction. In Proceedings of the International Conference on Data Engineering (ICDE), 0:546–555, 2007.

[31] Y. Cheng and G. Church. Biclustering of gene expression data. In In Proceedings of International Conference on Intelligent Systems for Molecular Biology (ISMB), volume 2000, pages 93–103, 2000.

[32] H. Chua, K. Ning, W. Sung, H. Leong, and L. Wong. Using indirect protein-protein interactions for protein complex prediction. In In Proceedings of the IEEE Compu- tational Systems Bioinformatics (CSB), page 97. Imperial College Press, 2007.

[33] F. Chung, L. Lu, T. Dewey, and D. Galas. Duplication models for biological net- works. Journal of Computational Biology, 10(5):677–687, 2003.

[34] P. Crucitti, V. Latora, M. Marchiori, and A. Rapisarda. Efficiency of scale-free networks: error and attack tolerance. Physica A: Statistical Mechanics and its Ap- plications, 320:622–642, 2003.

163 [35] L. da Fontoura Costa. Hub-based community finding. Arxiv preprint cond- mat/0405022, 2004.

[36] F. Daenen, F. van Roy, and P. De Bleser. Low nucleosome occupancy is en- coded around functional human transcription factor binding sites. BMC Genomics, 9(1):332, 2008.

[37] G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by external probes. In Proceedings of the ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pages 23–29, 1998.

[38] G. Das, H. Mannila, and P. Ronkainen. Similarity of attributes by external probes. Report C-1997-66, University of Helsinki, Department of Computer Science, Octo- ber 1997.

[39] U. de Lichtenberg, L. Jensen, S. Brunak, and P. Bork. Dynamic complex formation during the yeast cell cycle. Science, 307(5710):724–727, 2005.

[40] C. Deane, L. Salwinski, I. Xenarios, and D. Eisenberg. Protein Interactions Two Methods for Assessment of the Reliability of High Throughput Observations*. Molecular & Cellular Proteomics, 1(5):349–356, 2002.

[41] M. Deng, S. Mehta, F. Sun, and T. Chen. Inferring domain-domain interactions from protein-protein interactions. Genome research, 12(10):1540–1548, 2002.

[42] P. D’haeseleer, S. Liang, and R. Somogyi. Genetic network inference: from co- expression clustering to reverse engineering. Bioinformatics, 16(8):707–726, 2000.

[43] I. Dhillon, Y. Guan, and B. Kulis. A fast kernel-based multilevel algorithm for graph clustering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 629–634, 2005.

[44] I. Dhillon, Y. Guan, and B. Kulis. Weighted Graph Cuts without Eigenvectors: A Multilevel Approach. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, pages 1944–1957, 2007.

[45] P. Domingos and M. Pazzani. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In Proceedings of the Thirteenth International Conference on Machine Learning(ICML), pages 105–112. Morgan Kaufmann, 1996.

[46] M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998.

[47] J. Ernst, O. Vainas, C. Harbison, I. Simon, and Z. Bar-Joseph. Reconstructing dy- namic regulatory maps. Molecular Systems Biology, 3(74):1–13, 2007.

164 [48] F. Estruch. Stress-controlled transcription factors, stress-induced genes and stress tolerance in budding yeast. FEMS Reviews, 24(4):469–486, 2000.

[49] F. Estruch and M. Carlson. Two homologous zinc finger genes identified by mul- ticopy suppression in a SNF1 protein kinase mutant of Saccharomyces cerevisiae. Molecular and Cellular Biology, 13(7):3872–3881, 1993.

[50] S. Eum and M. Shin’ichi Arakawa. Toward bio-inspired network robustness-Step 1. Modularity. Bio-Inspired Models of Network, Information and Computing Systems, 2007. Bionetics 2007. 2nd, pages 84–87, 2007.

[51] S. Fields and O. Song. A novel genetic system to detect protein protein interactions. Nature, 340(6230):245–246, 1989.

[52] L. Freeman. Centrality in social networks: Conceptual clarification. Social Net- works, 1:215–239, 1979.

[53] L. Freeman. Centered graphs and the construction of ego networks. Mathematical Social Sciences, 3:291–304, 1982.

[54] C. Friedel and R. Zimmer. Inferring topology from clustering coefficients in protein- protein interaction networks. BMC bioinformatics, 7(1):519, 2006.

[55] J. Gagneur, R. Krause, T. Bouwmeester, and G. Casari. Modular decomposition of protein-protein interaction networks. Genome Biology, 5(8):R57, 2004.

[56] F. Gao, B. Foat, and H. Bussemaker. Defining transcriptional networks through in- tegrative modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics, 5(1):31, 2004.

[57] A. Gasch, P. Spellman, C. Kao, O. Carmel-Harel, M. Eisen, G. Storz, D. Botstein, and P. Brown. Genomic expression programs in the response of yeast cells to envi- ronmental changes. Molecular biology of the cell, 11(12):4241–4257, 2000.

[58] A. Gasch and M. Werner-Washburne. The genomics of yeast responses to environ- mental stress and starvation. Functional & Integrative Genomics, 2(4-5):181–192, 2002.

[59] S. Gebel, B. Gerstmayer, A. Bosio, H. Haussmann, E. Van Miert, and T. Muller. Gene expression profiling in respiratory tissues from rats exposed to mainstream cigarette smoke. Carcinogenesis, 25:169–178, 2004.

[60] D. Gilchrist and M. Rexach. Molecular basis for the rapid dissociation of nuclear localization signals from karyopherin alpha in the nucleoplasm. J. Biol. Chem, 278:51:51937–51949, 2003.

165 [61] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 7821-7826, 2002.

[62] A. Goldenberg and A. Moore. Tractable learning of large bayes net structures from sparse data. Proceedings of the twenty-first international conference on Machine learning (ICML), 2004.

[63] P. Grant. A tale of histone modifications. Genome Biol, 2(4):0003–1, 2001.

[64] N. Hackett, A. Heguy, B. Harvey, T. O’Connor, K. Luettich, D. Flieder, R. Kaplan, and R. Crystal. Variability of antioxidant-related gene expression in the airway ep- ithelium of cigarette smokers. Am. J. Respir. Cell Mol. Biol., 29:331–343, 2003.

[65] F. Harary and E. Palmer. Graphical enumeration. Academic press New York, 1997.

[66] C. Harbison, D. Gordon, T. Lee, N. Rinaldi, K. Macisaac, T. Danford, N. Hannett, J. Tagne, D. Reynolds, J. Yoo, et al. Transcriptional regulatory code of a eukaryotic genome. Nature, 431:99–104, 2004.

[67] A. Hartemink, D. Gifford, T. Jaakkola, and R. Young. Combining location and expression data for principled discovery of genetic regulatory network models. Pac Symp Biocomput, 7:437–449, 2002.

[68] E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. In- formation processing letters, 76(4-6):175–181, 2000.

[69] B. Hendrickson and R. Leland. The chaco user’s guide:version 2.0. Sandia Tech Report, SAND94–2692, 1994.

[70] A. Hinnebusch. Transcriptional regulation of GCN4 and the general amino acid control of yeast. Annu Rev Microbiol., 59:407–50, 2005.

[71] A. Hintze and C. Adami. Evolution of complex modular biological networks. PLoS Computational Biology, 4(2):23, 2008.

[72] J. Hua, D. Koes, and Z. Kou. Finding motifs in protein-protein interaction networks. Project Final Report, CMU, 2003.

[73] J. Huan, W. Wang, J. Prins, and J. Yang. Spin: mining maximal frequent subgraphs from graph databases. In Proceedings of the ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining, pages 581–586, 2004.

[74] T. Hughes, M. Marton, A. Jones, C. Robets, R. Stoughton, C. Armour, and H. Bennett. Functional discovery via a compendium of expression profiles. Cell, 102(1):109.

166 [75] T. Ideker, O. Ozier, B. Schwikowski, and A. Siegel. Discovering regulatory and signalling circuits in molecular integration networks. Bioinformatics, 18:1:233–240, 2002.

[76] T. Ideker and R. Sharan. Protein networks in disease. Genome Research, 18(4):644, 2008.

[77] T. Ideker, V. Thorsson, J. Ranish, R. Chirstmas, J. Buhler, J. Eng, R. Bumgarner, D. Goodlett, R. Aebersold, and L. Hood. Integrated genomic and proteomic analysis of a systematically perturbed metabolic network. Science, 292:929–934, 2001.

[78] S. Itzkovitz, T. Tlusty, and U. Alon. Coding limits on the number of transcription factors. BMC genomics, 7(1):239, 2006.

[79] V. Iyer, C. Horak, C. Scafe, D. Botstein, M. Snyder, and P. Brown. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature, 409(6819):533–538, 2001.

[80] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature. 411:44., 411:41–42, 2001.

[81] J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical . In Proceedings of the Internationall Conference Research in Computa- tional Linguistics, ROCKLING X, 1997.

[82] I. Jordan, L. Marino-Ramirez, Y. Wolf, and E. Koonin. Conservation and in the scale-free human gene coexpression network. Mol. Biol. Evol., 21:2058–2070, 2004.

[83] G. Karypis and et al. Cluto: a software package for clustering high-dimensional datasets. http://www-users.cs.umn.edu/ karypis/cluto/index.html.

[84] G. Karypis and et al. Metis: a software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orders of sparse matrics. www- users.cs.umn.edu/ karypis/ metis/metis/files/manual.pdf.

[85] G. Karypis and V. Kumar. Multilevel k-way hypergraph partitioning. 36th Design Automation Conference, pages 343–348, 1999.

[86] R. Khanin and E. Wit. How scale-free are biological networks. Journal of Compu- tational Biology, 13(3):810–818, 2006.

[87] A. King, N. Przulj, and I. Jurisica. Protein complex prediction via cost-based clus- tering. Bioinformatics, 20(17):3013–3020, 2004.

167 [88] K. Klemm and V. Eguiluz. Growing scale-free networks with small-world behavior. Phys Rev E Stat Nonlin Soft Matter Phys. 65, 2002.

[89] T. Kohonen. Self-organizing maps. Springer, Berlin, 1995.

[90] M. Koyuturk, W. Szpankowski, and A. Grama. Biclustering gene-feature matrices for statistically significant dense patterns. In In Proceedings of the IEEE Computa- tional Systems Bioinformatics Conference (CSB), pages 480–484, 2004.

[91] H. Lee, A. Hsu, J. Sajdak, J. Qin, and P. Pavlidis. Coexpression analysis of human genes across many microarray data sets. Genome Research, 14:1085–1094, 2004.

[92] H. Lee, H. Lee, S. Jeon, T. Chung, Y. Lim, and W. Huh. High-resolution analy- sis of condition-specific regulatory modules in Saccharomyces cerevisiae. Genome Biology, 9:R2, 2008.

[93] T. Lee, N. Rinaldi, F. Robert, D. Odom, Z. Bar-Joseph, G. Gerber, N. Hannett, C. Harbison, C. Thompson, I. Simon, et al. Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science, 298(5594):799–804, 2002.

[94] W. Lee, D. Tillo, N. Bray, R. Morse, R. Davis, T. Hughes, and C. Nislow. A high- resolution atlas of nucleosome occupancy in yeast. Nature Genetics, 39:1235–1244, 2007.

[95] K. Lemmens, T. Dhollander, T. De Bie, P. Monsieurs, K. Engelen, B. Smets, J. Wind- erickx, B. De Moor, and K. Marchal. Inferring transcriptional modules from chip- chip, motif and microarray data. Genome biology, 7(5):R37, 2006.

[96] X. Li, C. Foo, and S. Ng. Discovering protein complexes in dense reliable neighbor- hoods of protein interaction networks. In In Proceedings of the IEEE Computational Systems Bioinformatics (CSB), page 157. Imperial College Press, 2007.

[97] X. Li, S. Tan, C. Foo, and S. Ng. Interaction graph mining for protein complexes using local clique merging. Genome Informatics, 16(2):260–269, 2005.

[98] J. Lieb, X. Liu, D. Botstein, and P. Brown. Promoter-specific binding of Rap1 re- vealed by genome-wide maps of protein-DNA association. Nat Genet, 28(4):327– 34, 2001.

[99] D. Lin. An information-theoretic definition of similarity. pages 296–304, 1998.

[100] D. Lowd and P. Domingos. Naive Bayes models for probability estimation. In Proceedings of the 22nd International Conference on Machine Learning (ICML), volume 22, page 529. ACM Press, 2005.

168 [101] Y. Lu, P. Liu, P. Xiao, and H. Deng. Hotelling’s t2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics, 21:14:3105–3113, 2005.

[102] J. MacQueen. Some methods for classification and analysis of multivariate observa- tions. In Proceedings of Berkeley Symposium on Mathematical Statistics and Prob- ability, pages 281–297, 1967.

[103] A. Margolin, I. Nemenman, K. Basso, U. Klein, C. Wiggins, G. Stolovitzky, R. Fav- era, and C. A. Aracne: An algorithm for reconstruction of genetic networks in a mammalian cellular context. BMC Bioinformatics, 7:1:S7, 2006.

[104] M. Marselos and G. Michalopoulos. Changes in the pattern of aldehyde dehydro- genase activity in primary and metastatic adenoc arcinomas of the human colon. Cancer letters, 34(1):27–37, 1987.

[105] A. Mayes, L. Verdone, P. Legrain, and J. Beggs. Characterization of sm-like proteins in yeast and their association with u6 snrna. EMBO J., 18(15):4321–4331, 1999.

[106] G. Mishra, M. Suresh, K. Kumaran, N. Kannabiran, S. Suresh, P. Bala, K. Shivaku- mar, N. Anuradha, R. Reddy, T. Raghavan, et al. Human protein reference database– 2006 update. Nucleic acids research, 34(Database Issue):D411, 2006.

[107] J. Nacher, T. Yamada, S. Goto, M. Kanehisa, and T. Akutsu. Two complementary representations of a scale-free network. Physica A, pages 349–363, 2005.

[108] L. Narlikar, R. Gordan, and A. Hartemink. Nucleosome Occupancy Information Improves de novo Motif Discovery. Lecture Notes in Computer Science, 4453:107, 2007.

[109] M. Newman and M. Girvan. Finding and evaluating community structure in net- works. Physical Review E, 69:026113, 2004.

[110] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002.

[111] I. Nthke. Cytoskeleton out of the cupboard: colon cancer and cytoskeletal changes induced by loss of apc. Nature Reviews Cancer, 6:967–974, 2006.

[112] S. Oliver. Guilt-by-association goes global. Nature, 403(6770):601–603, 2000.

[113] H. Oshimoto, S. Okamura, M. Yoshida, and M. Mori. Increased activity and expres- sion of phospholipase d2 in human colorectal cancer. Oncology research, 14(1):31, 2003.

[114] B. Ostel. Statistics in research basic concepts and techniques for research workers. Iowa State University Press, Ames, Iowa, USA, 1963.

169 [115] C. Palmer and C. Faloutsos. Electricity based external similarity of categorical at- tributes. Lecture notes in computer science, pages 486–500, 2003.

[116] R. Parker, O. Flint, R. Mulvey, C. Elosua, F. Wang, W. Fenderson, S. Wang, W. Yang, and M. Noor. Endoplasmic reticulum stress links dyslipidemia to inhibition of pro- teasome activity and glucose transport by hiv protease inhibitors,. Molecular Phar- macology, 67:1909–1919, 2005.

[117] L. Pemberton and B. Paschal. Mechanisms of receptor-mediated nuclear import and nuclear export. Traffic, 6:187, 2005.

[118] J. Pereira-Leal, A. Enright, and C. Ouzounis. Detection of functional modules from protein interaction networks. Proteins, 54:1:49–57, 2004.

[119] C. Pizzuti and S. Rombo. Pincoc: a co-clustering based approach to analyze protein- protein interaction networks. Lecture Notes in Computer Science, 4881:821, 2007.

[120] C. Pizzuti and S. Rombo. Multi-functional Protein Clustering in PPI Networks. Bioinformatics Research and Development, 13:318–330, 2008.

[121] N. Przulj, D. Corneil, and I. Jurisica. Modeling interactome: scale-free or geometric? Bioinformatics, 20(18):3508–3515, 2004.

[122] N. Przulj, D. Wigle, and I. Jurisica. Functional topology in a network of protein interactions. Bioinformatics, 20(3):340–348, 2004.

[123] E. Ravasz, A. Somera, D. Mongru, Z. Oltvai, and A. Barabasi. Hierarchical organi- zation of modularity in metabolic networks. Science, 297(5586):1551–1555, 2002.

[124] B. Ren, F. Robert, J. Wyrick, O. Aparicio, E. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, et al. Genome-wide location and function of dna binding proteins. Science, 290(5500):2306–2309, 2000.

[125] P. Resnik. Using information content to evaluate semantic similarity in a taxon- omy. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 1:448–453, 1995.

[126] R. Saito, H. Suzuki, and Y. Hayashizaki. Interaction generality, a measurement to assess the reliability of a protein-protein interaction. Nucleic Acids Research, 30(5):1163–1168, 2002.

[127] R. Saito, H. Suzuki, and Y. Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, 19:6:756–763, 2003.

170 [128] E. Segal, M. Shapira, A. Regev, D. Peter, D. Botstein, D. Koller, and N. Friedman. Module networks: identifying regulatory modules and their condition-specific regu- lators from gene expression data. Nat Genet, 34(2):166–76, 2003.

[129] J. Sevilla, V. Segura, A. Podhorski, E. Guruceaga, J. Mato, L. Martinez-Cruz, F. Cor- rales, and A. Rubio. Correlation between gene expression and GO semantic similar- ity. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2(4):330–338, 2005.

[130] R. Sharan and R. Shamir. CLICK: A clustering algorithm with applications to gene expression analysis. In Proc. Int. Conf. Intell. Syst. Mol. Biol (ISMB), volume 8, pages 307–316, 2000.

[131] I. Simon, J. Barnett, N. Hannett, C. Harbison, N. Rinaldi, T. Volkert, J. Wyrick, J. Zeitlinger, D. Gifford, T. Jaakkola, et al. Serial Regulation of Transcriptional Regulators in the Yeast Cell Cycle. Cell, 106(6):697–708, 2001.

[132] S. Skiena. Line graph in implementing discrete mathematics: Combinatorics and graph theory with mathematica. Reading, MA: Addison-Wesley, pp. 128 and 135- 139, 1990.

[133] G. Smyth, N. Thorne, and J. Wettenhall. Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconduc- tor, pages 397–420, 2005.

[134] P. H. Sneath and R. Sokal. Hierarchical Clustering. Freeman, London-UK., 1973.

[135] B. Snel, P. Bork, and M. Huynen. The identification of functional modules from the genomic association of genes. Proc Natl Acad Sci, 99:5890–5895, 2002.

[136] A. Spira, J. Beane, V. Shah, G. Liu, F. Schembri, X. Yang, J. Palma, and J. Brody. Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc. Natl. Acad. Sci. USA, 101:27:10143–8, 2004.

[137] V. Spirin and L. Mirny. Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences, 100(21):12123–12128, 2003.

[138] E. Sprinzak, S. Sattath, and H. Margalit. How reliable are experimental protein- protein interaction data? J. Mol. Biol., 327:919–923, 2003.

[139] C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1):25, 2007.

171 [140] J. Stuart, E. Segal, D. Koller, and S. Kim. A gene coexpression network for global discovery of conserved genetic modules. Science, 302:5643:249–255, 2003.

[141] M. Stumpf, W. Kelly, T. Thorne, and C. Wiuf. Evolution at the system level: the natural history of protein interaction networks. Trends in & Evolution, 22(7):366–373, 2007.

[142] A. Subramanian, P. Tamayo, V. Mootha, S. Mukherjee, B. Ebert, M. Gillette, A. Paulovich, S. Pomeroy, T. Golub, E. Lander, and J. Mesirov. Gene set enrich- ment analysis: a knowledge-based approach for interpreting genome-wide expres- sion profiles. Proc. Natl. Acad. Sci., 102:15545–50, 2005.

[143] B. Tabachnick and L. Fidell. Using multivariate statistics. Harper Collins College Publishers: New York, 1996.

[144] A. Tanay, R. Sharan, and R. Shamir. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18(Suppl 1):S136–S144, 2002.

[145] A. Tanay, R. Sharan, and R. Shamir. Biclustering algorithms: A survey. Handbook of computational molecular biology, 9:26–1, 2005.

[146] A. Thomas, R. Cannings, N. Monk, and C. Cannings. On the structure of protein- protein interaction networks. Biochemical Society Transactions, 31:1491–1496, 2003.

[147] D. Ucar, F. Altiparmak, H. Ferhatosmanoglu, and S. Parthasarathy. Investigating the use of extrinsic similarity measures for microarray analysis. In Proceedings of the BIOKDD workshop at the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007.

[148] D. Ucar, F. Altiparmak, H. Ferhatosmanoglu, and S. Parthasarathy. Mutual Infor- mation Based Extrinsic Similarity for Microarray Analysis. In In Proceedings of the International Conference on Bioinformatics and Computational Biology (BICOB), pages 424–436. Springer, 2009.

[149] D. Ucar, S. Asur, U. Catalyurek, and S. Parthasarathy. Improving functional mod- ularity in protein-protein interactions graphs using hub-induced subgraphs. In Pro- ceedings of the Principles and Practice of Knowledge Discovery in Databases (PKDD) Conference, pages 371–382, 2006.

[150] D. Ucar, A. Beyer, S. Parthasarathy, and C. Workman. Predicting functional- ity of protein-DNA interactions by integrating diverse evidence. Bioinformatics, 25(12):i137, 2009.

172 [151] D. Ucar, I. Neuhaus, P. Ross-MacDonald, C. Tilford, S. Parthasarathy, N. Siemers,˜ and R. Ji. Construction of a reference gene association network from multiple pro- filing data: application to data analysis. Bioinformatics, 23(20):2716, 2007.

[152] D. Ucar, S. Parthasarathy, and S. Asur. Effective pre-processing strategies for functional clustering of a protein-protein interactions network. In Proceedings of the IEEE International Conference on BioInformatics and BioEngineering (BIBE), pages 129–136, 2005.

[153] V. van Noort, B. Snel, and M. Huynen. The yeast coexpression network has a small- world, scale-free architecture and can be explained by a simple model. EMBO Rep, 5:3:280–284, 2004.

[154] B. van Steensel and S. Henikoff. Identification of in vivo DNA targets of chromatin proteins using tethered Dam methyltransferase. Nature Biotechnology, 18:424–428, 2000.

[155] A. Vazquez,´ A. Flammini, A. Maritan, and A. Vespignani. Modeling of protein interaction networks. Complexus, 1:38–44, 2003.

[156] V. Vinciotti, X. Liu, R. Turk, E. de Meijer, and P. Hoen. Exploiting the full power of temporal gene expression profiling through a new statistical test: application to the analysis of muscular dystrophy data. BMC Bioinformatics, 7:183–194, 2006.

[157] Q. Wang, X. Wang, and B. Evers. Induction of cIAP-2 in human colon cancer cells through PKC/NF-B. J. Biol. Chem, 278:51091–51099, 2003.

[158] D. Watts and S. Strogatz. Collective dynamics of small world networks. Nature, 393(6684):440–442, June 1998.

[159] C. Wei, Q. Wu, V. Vega, K. Chiu, P. Ng, T. Zhang, A. Shahab, H. Yong, Y. Fu, Z. Weng, et al. A Global Map of p53 Transcription-Factor Binding Sites in the . Cell, 124(1):207–219, 2006.

[160] C. Workman, H. Mak, S. McCuine, J. Tagne, M. Agarwal, O. Ozier, T. Begley, L. Samson, and T. Ideker. A Systems Approach to Mapping DNA Damage Response Pathways. Science, 312(5776):1054–1059, 2006.

[161] C. Workman and G. Stormo. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput, 5:464–475, 2000.

[162] A. Wu, M. Garland, and H. Jiawei. Mining scale-free networks using geodesic clus- tering. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 719–724, 2004.

173 [163] L. Wu, T. Hughes, A. Davierwala, M. Robinson, R. Stoughton, and S. Altschuler. Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nature Genetics, 31:255–265, 2002.

[164] S. Wuchty. Evolution and topology in the yeast protein interaction network. Genome Res. 14:1310-1314, 2004.

[165] I. Xenarios, L. Salwinski, X. Duan, P. Higney, S. Kim, and D. Eisenberg. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research, 30(1):303, 2002.

[166] H. Xiong. Non-linear tests for identifying differentially expressed genes or genetic networks. Bioinformatics, 22:8:919–923, 2006.

[167] Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda, and M. Kanehisa. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13):i232, 2008.

[168] C. Yeang and T. Jaakkola. Physical network models and multi-source data integra- tion. 7th Int. Conf. Research in Computational Molecular Biology, 2003.

[169] A. M. Yip and S. Horvath. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics, 8:22, 2007.

[170] S. Yook, Z. N. Oltvai, and A. L. Barabasi. Functional and topological characteriza- tion of protein interaction networks. Proteomics, 4:928–942, 2004.

[171] B. Zhang and S. Horvath. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4:1, 2005.

174