<<

NETWORK MINING APPROACH TO BIOMARKER DISCOVERY

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Praneeth Uppalapati, B.E.

Graduate Program in Computer Science and Engineering

The Ohio State University

2010

Thesis Committee:

Dr. Kun Huang, Advisor

Dr. Raghu Machiraju

Copyright by

Praneeth Uppalapati

2010

ABSTRACT

With the rapid development of high throughput expression profiling technology, molecule profiling has become a powerful tool to characterize disease subtypes and discover gene signatures. Most existing gene signature discovery methods apply statistical methods to select whose expression values can differentiate different subject groups. However, a drawback of these approaches is that the selected genes are not functionally related and hence cannot reveal biological mechanism behind the difference in the patient groups.

Gene co-expression network analysis can be used to mine functionally related sets of genes that can be marked as potential biomarkers through survival analysis. We present an efficient heuristic algorithm EigenCut that exploits the properties of gene co- expression networks to mine functionally related and dense modules of genes. We apply this method to brain tumor (Glioblastoma Multiforme) study to obtain functionally related clusters. If functional groups of genes with predictive power on patient prognosis can be identified, insights on the mechanisms related to metastasis in GBM can be obtained and better therapeutical plan can be developed. We predicted potential biomarkers by dividing the patients into two groups based on their expression profiles over the genes in the clusters and comparing their survival outcome through survival analysis. We obtained 12 potential biomarkers with log- test p-values less than 0.01. ii

DEDICATION

This document is dedicated to my family & friends.

iii

ACKNOWLEDGMENTS

I would like to thank my research advisor Dr. Kun Huang for the support and guidance he has given me throughout the entire period of my work. It has been a great pleasure to work with him. I would also like to thank my advisor Dr. Raghu Machiraju for his unconditional help and suggestions.

I also thank Yang Xiang and Abhisek Kundu for the help and support they have extended to me. Their inputs and suggestions have been a great help throughout my thesis.

I would like to express my deepest gratitude to my parents who have shown unconditional love and care throughout my life. Am thankful to my friends who have given me moral support and helped me get through many harder times.

iv

VITA

May 2004 ...... Sri Chaitanya Jr. College

2008 ...... B.E. Computer Science and Engineering,

Osmania University

2008 to present ...... Graduate Student. Computer Science and

Engineering Department, The Ohio State

University

2009 to present ...... Graduate Research Associate, Department

of Bio-Medical Informatics, The Ohio State

University

FIELDS OF STUDY

Major Field: Computer Science and Engineering

v

TABLE OF CONTENTS

ABSTRACT ...... ii

DEDICATION ...... iii

ACKNOWLEDGMENTS ...... iv

VITA ...... v

LIST OF TABLES ...... viii

LIST OF FIGURES ...... x

CHAPTER 1: INTRODUCTION ...... 1

1.1 Background ...... 1

1.2 Motivation...... 6

1.3 Thesis Statement ...... 9

1.4 Contribution ...... 10

1.5 Organization...... 12

CHAPTER 2: GENE CO-EXPRESSION NETWORK ANALYSIS ...... 13

2.1 Co-expression Similarity ...... 13

2.2 Building a Gene co-expression network...... 16

2.3 Mining Modules ...... 18

vi

2.4 Comparing Modules ...... 20

CHAPTER 3: DENSE NETWORK COMPONENT DISCOVERY METHODS...... 23

3.1 K – Core Algorithm...... 23

3.2 Min-Cut Algorithm ...... 27

3.3 Prune-Cut Algorithm – Modification to Min-Cut algorithm ...... 29

CHAPTER 4: EIGEN CUT ALGORITHM: A NEW NETWORK APPROACH ...... 33

4.1 The Algorithm ...... 33

4.2 Performance ...... 41

CHAPTER 5: APPLICATIONS ...... 46

5.1 Application 1: TCGA data on Glioblastoma Multiforme...... 46

5.1.1 Gene – miRNA Interaction Prediction ...... 47

5.1.2 Gene Signature Discovery ...... 52

5.1.2.1 Survival Analysis ...... 52

5.1.2.3 Results ...... 57

5.2 Application 2: Breast Cancer Data - GDS2250 dataset ...... 70

CHAPTER 6: CONCLUSION & FUTURE WORK ...... 75

BIBLIOGRAPHY ...... 78

Appendix A: Gene lists for Clusters A - L ...... 81

Appendix B: Codes/Programs ...... 88

vii

LIST OF TABLES

Table 1. Running times and number of clusters for different graph sizes( no. of nodes) . 42

Table 2. Gene Enrichment results for Cluster 1 (GO: Molecular Function) ...... 48

Table 3. Gene Enrichment results for cluster 1 (GO: Biological Process) ...... 49

Table 4. Gene Enrichment results for cluster 1 (GO: Cellular Component) ...... 49

Table 5. List of clusters (EigenCut without overlaps and next-available-hub seed- selection) with p-values less than 0.05 in the log-rank tests ...... 59

Table 6. List of clusters (EigenCut without overlaps and next-available-higher index seed-selection) with p-values less than 0.05 in the log-rank tests ...... 59

Table 7. List of clusters (EigenCut with multi-merge, overlaps and next-available-hub seed-selection) with p-values less than 0.05 in the log-rank tests ...... 60

Table 8. List of clusters (EigenCut on K-core) with p-values less than 0.05 in the log-rank tests ...... 61

Table 9. Potential Biomarkers identified through different methods ...... 62

Table 10. GO enrichment results using ToppGene for Cluster D (GO: Biological

Processes) ...... 64

Table 11. GO Enrichment results using ToppGene for Cluster E (GO: Biological

Processes) ...... 66

viii

Table 12. GO Enrichment results using ToppGene for Cluster H (GO: Biological

Processes) ...... 68

Table 13. Overlap values for the basal-like cancer clusters versus non-basal-like cancer clusters ...... 74

Table 14. Cluster A ...... 81

Table 15. Cluster B ...... 81

Table 16. Cluster C ...... 82

Table 17. Cluster D ...... 83

Table 18. Cluster E ...... 84

Table 19. Cluster F ...... 84

Table 20. Cluster G ...... 85

Table 21. Cluster H ...... 86

Table 22. Cluster I ...... 86

Table 23. Cluster J ...... 86

Table 24. Cluster K ...... 87

Table 25. Cluster L ...... 87

ix

LIST OF FIGURES

Figure 1. A graph colored according to the core membership of the vertices. The yellow vertices form a 3-core. The orange together with the yellow vertices form a 2-core. All the vertices except vertex 21 form a 1-core. All the vertices together form a 0-core ...... 24

Figure 2. A graph in which two strongly connected components are connected together by few inter-component edges(green edges) which are not eliminated by a k-core algorithm ...... 26

Figure 3. Heatmap of expression values of a cluster returned by PruneCut ...... 30

Figure 4. Plot of time taken vs. product of size and maximum degree of the network

(d•N). The regression coefficient of the linear fit is greater than 0.99 ...... 42

Figure 5. Eigen network obtained by a threshold of 0.5 ...... 51

Figure 6. Kaplan curves for two types of glioma [22]...... 53

Figure 7. Clusters obtained by applying EigenCut with next-available-hub seed-selection approach (only some of the clusters shown in the fig)...... 58

Figure 8. Kaplan-Meier curves for cluster 8 in the above table ...... 59

Figure 9. Kaplan-Meier curves for cluster 14 in the above table ...... 60

Figure 10. Kaplan-Meier curves for cluster 32 in the above table ...... 61

Figure 11. Functional Enrichment analysis using IPA for cluster D. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher‟s exact tests...... 63

x

Figure 12. Functional Enrichment analysis using IPA for cluster E. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher‟s exact tests...... 65

Figure 13. Functional Enrichment analysis using IPA for cluster H. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher‟s exact tests...... 67

Figure 14. Functional Enrichment analysis using IPA for cluster L. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher‟s exact tests...... 69

xi

CHAPTER 1: INTRODUCTION

1.1 Background

A Gene is a basic unit of heredity in a living organism. Genes hold the information to build and maintain an organism‟s cells and pass genetic traits to offspring[1]. Cells in organisms possess many genes associated with many different biological traits, some of which are manifest such as eye color and some of which are not, such as blood type or increased risk to certain diseases. In cells a gene is a portion of DNA (Deoxyribonucleic

Acid) that contains both “” which determine what the gene does and “” which control the expression or activation of a gene. When a gene is active, the Exons are copied in a process called , producing messenger RNA (mRNA or messenger

Ribonucleic Acid) copy of the gene‟s information. The mRNA can then direct the synthesis of via the based on the genetic code (the code that the DNA encodes). In some cases RNA is used directly. The molecules resulting from , whether it be RNA or are called gene products and are responsible for the development and functioning of all living organisms. The physical development and the traits of an organism can be thought of as a result of genes interacting with each other and the environment. Organisms often have many genes and the entire set of genes is called its . The genome is estimated to have around 20,000 - 25,000 protein-coding genes[2].

1

MicroRNAs( miRNA) are post-transcriptional regulators that bind to the mRNAs often causing transcriptional repression of genes[4]. However, they can also cause over- activation of the genes[5]. They are short RNA molecules that can bind to complementary parts of mRNAs. Each miRNA can influence several mRNAs and each mRNA could be influenced by several miRNAs. Aberrant expression or activation of miRNAs has been associated with several disease states. However, not many miRNA – mRNA interactions have been observed through biological experiments.

Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins but in non-protein coding genes the product is often a functional RNA. Several steps in the gene expression process may be modulated. The genetic code is manifest as gene expression and the properties of the expression products give rise to an organism‟s traits or phenotype. Gene expression is a highly complicated and tightly controlled process that lets the to respond dynamically to external influences and its own changing needs. This mechanism acts as an “on/off switch” controlling what genes are expressed and also has “+/- control” controlling the amount of level of expression of particular genes as necessary [3]. Gene expression profiling is the measurement of the activity or expression of thousands of genes at once to create a global picture of cellular function. The kinds and amounts of mRNA produced by a cell are studied to learn which genes are expressed in the cell, which in turn gives insight on how the cell responds to changes. Expression profiling experiments often involve measuring the relative expressions of genes or the relative

2 amount of mRNA expressed in two or more experimental conditions because, altered levels of a specific mRNA may suggest a change in the need for protein encoded by that mRNA, indicating a probable response to changes in environment. For example, the expressions of cell samples from a normal human and a diseased human can suggest the associations of genes to a particular disease.

A microarray is a tool for analyzing gene expression and it contains a small glass slide containing samples of many genes (whose DNA composition is already known) arranged in a regular pattern [3]. It works by exploiting the ability of mRNA (or cDNA) to bind specifically to the DNA template from which it originated. By using an array containing many DNA samples, expression levels of hundreds and thousands of genes within a cell can be determined in a single experiment, by measuring the amount of mRNA bound to each site in the array. The amount of mRNA bound to a site in the array is proportional to the expression of the corresponding gene. With the use of a CCD camera aided computer the amount of mRNA bound to each site on the array is measured, generating a profile of gene expression in the cell.

Gene expression depends on various environmental factors such as time of day, food intake and molecular concentration within a cell. The expression of a gene changes in response to changes in the environment. For example, if a cell is deficient in some kind of a , then more of that receptor protein may be synthesized to bring the cell back to a normal state. Genes interact with each other in the cell and so the expression of a gene

3 is also controlled by other genes via mRNA and protein encoded by them. Also, many genes are involved in a single process of a cell and many cells respond together to an external change. This implies that genes that encode similar proteins or proteins that involve in similar biological process often respond similarly with respect to producing the proteins and so express together. Such genes are called co-expressed genes. Such genes either express together or not-express together at all or express highly together or express minimally together and could belong to a biological pathway. A biological pathway is a network of actions among molecules in a cell that leads to a certain product or control certain processes in a cell.

A network is a straightforward way to represent interactions between nodes and network concepts are useful in analyzing complex interactions. Network based methods are useful in many domains such as gene co-expression networks, protein-protein interaction networks, cell-cell interaction networks, world wide web and social interaction networks

(Bin Zhang and Steve Horvath, 2005). Gene co-expression networks are networks whose nodes represent the genes and the edges represent co-expression (interaction). These networks represent the transcriptional response of cells to changing conditions which is measured by estimating the expression values through microarray experiments. The nodes correspond to the expression profile of a gene and the edges imply a significant pairwise expression profile association across different samples (or environmental perturbations). Since the coordinated co-expression of genes encodes interacting proteins,

4 study of co-expression patterns provides insight into to the underlying cellular processes

(Bin Zhang and Steve Horvath, 2005).

Similar genes with respect to the biological processes that their proteins involve in will have similar expression profiles. As a gene co-expression network is built on the basis of similarity in expression profiles, similar genes are connected to each other and form highly connected groups. The intra-group connections/edges are high, while the inter- group connections are sparse. Genes in these groups often have high similarity in their functions and thus form functional modules. Network based methods such as dense- component and quasi-clique finding methods can be used to identify such groups in the co-expression network and these groups could represent a meaningful biological pathway.

The (GO) project provides an ontology of a large number of terms which are representative of the gene product properties or the functions of the gene products.

The Gene Ontology covers three domains; Cellular component, the parts of the cell or its environment; Molecular function, the activities of the gene product at the molecular level; and Biological processes, molecular events pertaining to function of integrated living units. Most of the genes in a genome are annotated with the ontology terms relevant to the gene products. Each gene is associated with many ontology terms and each term is associated with more than one gene. Genes that are similar in their functioning share many common ontology terms. A GO Enrichment analysis on a set of

5 genes analyzes the GO terms associated with the set and returns the enriched terms in the order of decreasing significance. The significance is given by the p-values which is the probability that a term occurs in the set by chance. So, a smaller p-value implies that the term occurs less by likely chance and more likely by the property of the group. If the GO enrichment analysis on a particular set of genes returns many terms with very high significance i.e. very low p-values, then that set of genes is highly similar in its properties. However, this method alone cannot certify the similarity of the set. It only gives a basic insight about functional homogeneity of the gene group. This method can be used to evaluate the results of clustering techniques on gene datasets.

1.2 Motivation

Several physical, biological and environmental factors influence the outcome of a disease in a patient. The main prognostic factors in breast cancer are age, tumor size, status of lymph nodes, histological type of tumor, pathological grade and hormone-receptor status.

There are also other factors that can potentially predict the outcome of a disease.

However, they have a limited predictive power. Gene-expression profile has been proved to be a powerful predictor of a disease outcome in cancer patients. With rapid development of high throughput gene expression profiling technology, it has become a popular tool in characterizing disease types and discovering gene signatures for disease prognosis and treatment prediction. This approach has been most popular in cancer research. However, most of the existing approaches compare the entire genome to apply statistical learning feature selection methods to identify genes whose expression profiles

6 can differentiate subject groups as normal controls & patients or long survival patients & short survival patients etc. A major drawback of these approaches is that the selected gene features are usually not functionally related and hence cannot reveal key biological mechanisms and processes behind the differentiation of the groups. To overcome this issue, recently there have been some studies on combining pathway information with gene expression information to differentiate patients. Some pathway based approaches use gene expression of the contained genes to represent the pathway activity and the activity is used to differentiate between disease states. Other pathway based approaches use gene expression differentiation to estimate the probability of pathway activation [16].

These approaches encode the markers as a set of genes constituting a pathway instead of individual genes. In [16], a subset of genes within a pathway that has high expression differentiation, called the “condition-responsive genes (CORGs)” is chosen as a marker.

However, these methods are supervised and are limited to only prior known pathways and cannot take into account the fact that genes interact across pathways.

Gene co-expression network (CGN) analysis is to identify groups of genes highly correlated to each other in expression levels across multiple samples. The correlation metric is usually the Pearson correlation coefficient (PCC) or the Spearman correlation coefficient (SCC) between the expression profiles of the genes. However, it is standard to use the (Pearson) correlation coefficient as a co-expression measure, and the absolute value of Pearson correlation is often used in a gene expression cluster analysis [6]. Genes that are functionally related to each other are believed to express similarly and thus have

7 high correlations between their expression profiles. So, functionally similar genes group together in Gene co-expression networks. A weighted graph is derived with the nodes representing the genes or more correctly the expression profiles of the genes and the edge weights being the correlation values between the profiles. For weighted GCN analysis,

Horvath et al. have developed various methods for identifying highly correlated gene clusters using hierarchical clustering approaches. However, to apply hierarchical clustering all the edge weights have to be stored in the memory, which could take up huge amounts of memory, considering the large sizes of gene networks. Also the clustering approach does not allow direct control over the intra-cluster connectivity. It also creates the whole cluster tree which might not be useful to build when we are mostly concerned with very tightly bound clusters. Another approach to CGN analysis, is to mine modules or sub-networks through various module finding algorithms. However, many such module finding algorithms work on un-weighted graphs only i.e. graphs with binary edges (an edge between two genes represents that the correlation between the genes is above a pre-defined threshold) and are computationally very complex.

MicroRNAs have been recently found to influence the expression of genes by binding to complimentary parts of mRNAs. However, there is no much proven knowledge about the interactions between genes and miRNAs. Several software such as ToppGene predict these interactions by comparing their sequence structure [27]. As research is progressing pertaining specifically to miRNA-gene interactions, new miRNAs and their interactions with genes are being discovered. However, it is very expensive to verify interactions

8 between genes and miRNAs through biological experiments considering the large number of genes and miRNAs in . It is, therefore, useful to be able to predict such interactions without biologically verifying all possible combinations of gene – miRNA pairs. Predicting interactions with a high probability or confidence can help researchers pick gene – miRNA pairs to consider for biological verification. This would reduce both time and costs, by cutting down the number of biological experiments to be performed.

However, not much research has been done to predict the interactions based on expression of genes and miRNAs. Since miRNAs bind to the mRNAs (whose count measures the gene expression), studying the expression levels of genes (or mRNAs) and miRNAs can give an insight into the potential interactions between them.

1.3 Thesis Statement

In this thesis, the main goal is to address the problem of identifying functionally homogenous groups of biomarker genes in an unsupervised way based on gene expression data. Functionally homogenous genes can be mined from GCNs. However, present network clustering techniques report high time and space complexity. They also do not allow user-control on the algorithm. Also, most algorithms are generic and do not exploit the topological properties of the network. Once an efficient module mining algorithm can be developed, then functionally homogenous gene groups or modules can be mined from the GCNs and can be used in different applications. This goal is realized in the context of identifying functionally homogeneous biomarkers for a disease called

Glioblastoma Multiforme (GBM). Glioblastoma multiforme is a late stage brain tumor

9

(glioma) which is highly metastatic and patients with GBM usually have a short survival time after diagnosis. Therefore if biomarkers with predictive power on patient prognosis can be identified, insights on GBM can be obtained and better therapeutical plan can be developed. Our main research goal addresses the following issues:

 How can we develop a time and space efficient network clustering algorithm that

can mine functionally homogenous gene modules?

 How can we exploit the properties of biological networks to mine such dense

modules?

Though, the work is mainly in the context of GBM it is also applied to other applications.

This thesis proposes an algorithm that can be used to mine functionally homogenous clusters from any GCNs efficiently.

1.4 Contribution

Here I present a novel heuristic network mining algorithm called EigenCut that exploits the properties of intra-modular and inter-modular connectivity of biological modules. It exploits the fact that modules in a GCN or any biological network have high intra- modular connectivity and low inter-modular connectivity. The algorithm takes a greedy approach and applies some heuristics to find the cuts in the network that separate out modules from the network. The algorithm outputs clusters that are dissimilar to each other, the dissimilarity specified by the user-input threshold value. This algorithm has

10 linear time complexity. It finds non-overlapping clusters, but also extends easily to the overlapping case which has higher time complexity. It also allows the user to specify candidate genes which are used as seeds when finding clusters. This gives the user the ability to mine modules related to a specific set of known genes, enabling the mining of candidate genes for other studies.

Gene signatures were identified such that the genes are functionally related to each other and also such that the gene signatures serve as good prognostic markers. An un- supervised method is applied, where pathway information is not used to group genes. The module mining method described above is used to mine modules that could potentially form biological pathways. Gene signatures are identified by performing survival analysis on the gene modules mined out by the EigenCut algorithm on the GCN built from the expression data of genes in patients suffering from the disease under study. The EigenCut algorithm ensures that the genes in the clusters are functionally related and the survival analysis helps identify potential biomarkers (gene signatures) from the clusters. New biomarkers for GBM were discovered that have better prognostic power than many other previously discovered biomarkers [28]. These biomarkers help researchers in studying pathways and genes related to various key aspects of GBM and also to identify other pathways affected by the GBM pertinent pathways.

This thesis contributes a novel network mining algorithm, well suited for mining functionally homogenous modules of genes from GCNs that enables unsupervised

11 biomarker discovery from gene expression data. Finally, it also proposes new biomarkers for GBM with good prognostic predictive power.

1.5 Organization

The reminder of this manuscript is organized as follows. Chapter 2 gives an introduction to Gene Co-expression Network (GCN) analysis. Chapter 3 presents various network mining techniques that I used to mine modules from GCNs. Chapter 4 describes a novel network mining algorithm that exploits the properties of biological networks to mine functionally homogenous modules from GCNs. Chapter 5 describes two applications in which this algorithm is applied to obtain functionally homogenous modules of biological entities called genes and microRNAs. Finally, I conclude and outline several future directions in chapter 6.

12

CHAPTER 2: GENE CO-EXPRESSION NETWORK ANALYSIS

Gene co-expression networks constructed from the gene expression data obtained by microarray experiments capture the relationships between transcripts. As already discussed, each gene corresponds to a gene or its expression profile. The edges encode the interaction between these nodes as a pair wise similarity in the expression profiles of the nodes that the edges connect. Thus, in un-weighted networks, an edge between two nodes exists if the expression profile similarity between the nodes is high. Usually a threshold on the similarity is specified. In weighted networks each pair of genes has an edge and the weight associated with each edge is the measure of the similarity of the expression profiles. The process of building a Gene Co-expression network is dealt in the following subsections.

2.1 Co-expression Similarity

To create the co-expression network, a similarity measure between the gene expression profiles is to be defined first. It measures the level of similarity between gene expression profiles, across several experiments. The most commonly used similarity measure in the context of gene co-expression networks is the Pearson Correlation coefficient (PCC). The

Pearson correlation captures linear relationship or dependency between two random variables. The PCC between two variables is the product of their covariance divided by their standard deviations. The covariance is the sum of the pair wise product of the 13 corresponding mean-centered data points. Pearson correlation coefficient is also called

Pearson product-moment correlation coefficient. The coefficient represented by r is given by

th where Xi and Yi are the i values of the variable X and Y, X and Y are the mean of X and

Y. The values of r lie between -1 and 1. The sign of the coefficient indicates the direction of association. A relation such as variable Y increases when the variable X increases, has a positive correlation coefficient value. And, a relation such as variable Y decreases when the variable X increases has a negative correlation coefficient value. A value of 0 implies that there is no linear association between the variables. A correlation value of 1 or -1 implies that the variables are related by a linear equation such as y = ax + b. If the slope of the function is positive then, the value is 1 and if it is negative the value is -1.

However, Pearson correlation is very sensitive to outliers and can only capture linear dependencies. It provides a good correlation only if the variables are linearly dependent.

Spearman Rank correlation coefficient method [14] is a statistical non-parametric method that captures the statistical dependence between two random variables by measuring how well the association between the variables can be fitted by a monotonic function. This method is less sensitive to outliers than the Pearson method and performs better than the 14

Pearson method when the association between variables is a monotonic function. The spearman correlation coefficient ρ (rho), is given by :

where, n is the number of data points in each variable and d is the difference between the ranks of the corresponding data points or the data points in each observation. The n raw data points Xi, Yi are converted into ranks xi, yi and di = xi – yi. Spearman correlation could be more useful in capturing the relationship between the expression profiles, but

Pearson coefficient is most commonly used as the purpose is to find out if two genes co- express which is a linear relationship y=ax + b, b being the error or noise.

For n genes, a total of n2 pairwise similarities or correlation coefficient values can be calculated i.e. we find the correlation coefficient values between each pair of genes or the gene expression profiles. These values are stored in an n × n matrix called the similarity matrix or the correlation matrix. The similarity matrix may be modified to contain the absolute values of the correlation coefficient values or to preserve the sign they could be added by 1 and the result divided by 2, to transform the values from [-1 1] to [0 1].

15

2.2 Building a Gene co-expression network

Un-weighted networks are represented by an adjacency matrix whose entries are 1s and

0s, a 1 representing an edge and a 0 representing no edge. For a network with n nodes, the adjacency matrix A is an n × n matrix with rows and columns both representing the n nodes and an entry A(i,j) = 1 represents an edge between the node i and node j and a 0 represents a no-edge. In the weighted networks the matrix entries are the weights of the edges if they exist for a certain pair of nodes and 0 if there is no edge. Networks can also be represented by an Adjacency list structure where the network A is a n length array of lists and A(i) is the list of all the nodes that node i is connected to. Adjacency list representation saves memory but access time is poorer than the adjacency matrix representation.

To create a gene co-expression network, the correlation matrix is to be converted into an adjacency matrix by applying a transformation called an Adjacency Function. The choice of adjacency function depends on whether the resulting network should be a weighted network (soft threshold) or an un-weighted network (hard threshold). An adjacency function maps the correlation values stored in the correlation matrix into an interval [0 1] for weighted networks and maps to {0,1} for un-weighted networks. The most commonly used adjacency function is the Signum function which is used to implement hard threshold and involves a threshold parameter τ[6]. An entry aij in the adjacency matrix is given by :

1 푖푓 푠(푖, 푗) ≥ τ 푎(푖, 푗) = 푠푖푔푛푢푚 푐(푖, 푗), τ ≡ 0 푖푓 푠(푖, 푗) < τ 16

Hard threshold might lead to a loss of information. For example if τ has been set to 0.7, there will be no connection between nodes with correlation equal to 0.69. Soft adjacency functions such as the Sigmoid function or the power adjacency function (take a power of the correlations) can be used to avoid the disadvantages of hard threshold. One potential drawback of soft threshold is that it is not trivial to define the neighbors of a node. It only allows the ranking of the nodes of the networks on basis of their connection strengths

(Zhang and Horvath, 2005).

Choosing the hard threshold, τ is tricky. Instead of to threshold the correlation coefficient values, the significance level of the correlation coefficient can be threshold. The significance level can be found by using permutation tests to find out the p-values that can then be threshold. Another way to find the threshold is to set the network size to a constant. However, instead of focusing on the significance of the correlation or the network size, we could choose the threshold by exploiting the fact that many biological networks, despite variations in their individual constituents and pathways, display approximate scale-free topology (Barbasi and Albert, 1999). Therefore, we can choose threshold values that lead to networks that satisfy the approximate scale-free topology. A network satisfies a scale-free topology if the frequency distribution p(k) of the connectivity follows the Power law: p(k) ~ k -γ (Barbasi and Albert, 1999).To visually check if a network satisfies approximate scale-free topology, log10(p(k)) can be plotted versus log10(k) and a near-straight line indicates approximate scale-free topology. How well the network satisfies the scale-free topology can be measured by calculating the

17

2 2 fitting index R of a linear regression model that regresses log(p(k)) on log10(k). If R of the model approaches 1, it means that log(p(k)) and log10(k) are related by a straight line relationship. Many co-expression networks satisfy the scale-free topology only approximately. So, we choose only those threshold values that lead to a network satisfying the scale-free topology i.e. the threshold τ that gives rise to a R2 of at least 0.80.

Also while choosing the threshold it should be considered that the mean connectivity of the network should be high so as to not loose information and the slope of the regression line should be close to -1[6].

After the threshold τ value is determined, the adjacency function is applied on the gene expression correlation matrix to obtain a gene adjacency matrix. Applying the Signum adjacency function on the correlation matrix results in an adjacency matrix having 1s and

0s that represents an un-weighted co-expression network. The correlation matrix and the adjacency matrix are symmetric matrices as both the rows and columns of the matrices represent genes and in the same order. Each row or column represents the adjacency of a gene and the degree of a node can be easily calculated by adding all the ones in corresponding row or column.

2.3 Mining Modules

An important aim of building co-expression networks is to detect modules of nodes that are tightly connected to each other. The definition of a gene module varies among researchers. A gene module could mean a group of genes whose expression profiles are

18 similar, or a group of genes that form a biological pathway or a group of genes that have similar functions. One way to find modules is by using the node dissimilarity measure as an input to a clustering method such as Hierarchical clustering. There are several dissimilarity measures. One such measure called the topological overlap dissimilarity measure (TOM) was used in [8]. It was found that it results in biologically meaningful modules. For un-weighted networks, the topological overlap measure between two nodes is nothing but the normalized number of nodes commonly connected to both nodes. The topological overlap matrix is a similarity measure which is non-negative and symmetric and contains the topological overlap measures for all the gene pairs. The topological overlap measure can be converted into a dissimilarity measure by subtracting from one.

All the entries of the TOM matrix can be subtracted from one to obtain a dissimilarity matrix which can be fed as input to hierarchical clustering algorithm to obtain gene modules (Zhang and Horvath, 2005).

Network based module mining techniques may be useful in mining biologically meaningful modules from the gene co-expression networks. We are trying to find sub- graphs or groups of genes that are strongly connected to each other, as similar genes are expected to have similar expression profiles. A strongly connected group means that the expression profiles of the genes in the group are strongly correlated to each other and thus they have a similar function which is the reason for them to be expressed together similarly. Several algorithms are known that can mine dense components such as a

Clique [9], K-Core [10], etc. from networks. A Clique is a strongly connected subgraph

19 of a network, i.e. each node in the subgraph is connected to every other node in that subgraph. There are also variants of the Clique called the quasi-Cliques which is a set of nodes such that the subgraph induced is γ-dense (connectivity of each node is γ percent of total degree)[12]. Cliques are interesting components and the modules in the co- expression networks that form biologically meaningful components could potentially be in the form of cliques or quasi-cliques in the co-expression networks.

Another interesting graph component is a K-core, which is a subgraph in which each node is connected to at least k other nodes in the subgraph. As it is assumed that the intra- modular connections are dense while the inter-modular connections are sparse, by the properties of the gene co-expression networks, K-core can be a useful tool in mining meaningful modules, by selecting an appropriate k value. We have made use of the k-core algorithm in our work and will be dealt with in chapter 2. There are also several other methods of mining dense modules that do not necessarily form one of the above mentioned modules, such as the Min-Cut [11] algorithm.

2.4 Comparing Modules

Once a network is found and meaningful modules are mined from the network, the modules can be correlated to each other by correlating the corresponding Eigengenes

(Horvath et al., 2007). Highly connected modules could be merged to form more meaningful modules, as the module mining techniques are not guaranteed to give absolutely correct results. The main purpose, however, of network analysis of gene

20 expressions is to relate the network properties such as connectivity and modularity to external gene information. For example, Zhang and Horvath(2005) show that in a network, the intra-modular connectivity in a particular module is highly correlated with gene essentiality, which is determined by gene knock-out experiments. Such analysis facilitates strategies for identifying therapeutic targets. Statistical methods can be used for such network analysis such as regression, principal component analysis, permutation tests and others.

Network modules in a single network or across multiple networks with the same set of nodes can be compared by correlating the module Eigengenes. An eigengene is a single representative gene transcript/ expression profile that represents the entire module. Each module has different number of genes which makes it hard to compare modules using the individual gene expression profiles. So, principal component analysis (PCA) is applied through eigen decomposition on the expression profiles of the genes in a module to obtain a single gene expression profile that can represent the entire module. The first principal component obtained from the PCA is selected as the Eigengene as it accounts for the most variability in the expression profiles of the contained genes. A higher level

Eigengene network can be constructed considering each eigengene as a node, so the magnitude of the network is in the order of the number of modules or Eigengenes instead of the order of number of genes which is very large. A network can be constructed just as a gene co-expression network by correlating the Eigengene expression profiles. This

21 network provides information on how the dense and biologically relevant (functional) modules interact with other such modules.

The gene co-expression networks can be used to analyze the preservation of gene connections, modularity and at a higher level, the preservation of inter-modular connections across several networks such as a human network versus chimpanzee network (Horvath et al., 2007). In [13] the authors show that modularity between the human and chimpanzee network is preserved not only at the gene level, but also at the modular level. They can also be used to compare networks built on several diseased data

(gene expression profiles obtained from diseased cell samples) obtained from different types of diseases or to compare a normal and a diseased network, to look for significant gene connectivity changes. Such changes in the network structure could give an insight on how the biological pathways change in response to the environment and which genes play an important role in such changes. Such comparisons can also be used to identify prognostic markers for several diseases by identifying consistent clusters that have differential expression across diseased and normal patients.

22

CHAPTER 3: DENSE NETWORK COMPONENT DISCOVERY METHODS

Network approaches have been found useful to find biologically meaningful modules from a co-expression network. Several researchers have applied different network mining techniques on co-expression networks including topological overlap measure based clustering, clique mining and others. Applying these techniques result in modules of genes that contain only a small fraction of the total number of genes on the whole due to the formation of many very small modules that are eliminated. For example, only around

4000 genes from a network of 12000 genes might be mined as clusters. However, it is acceptable, because researchers are interested mainly in the groups that form meaningful and strong groups with considerable number of genes when analyzing co-expression networks.

3.1 K – Core Algorithm

It is required that strongly connected components in a co-expression network that are less connected to each other i.e. the intra-modular connectivity is high while the inter-modular connectivity is low are identified. The co-expression network can be reduced by removing the less connected nodes/genes and focus can be put on strongly connected genes that have at least a certain degree of connectivity. One way to reduce the network is to mine K-cores from the network. A K-core is a subgraph in which each node in the subgraph is connected to at least k other nodes. Finding K-cores in the network with an 23 appropriate k value will eliminate all those vertices that have low degrees i.e. those genes that are not strongly correlated to other genes as to form modules. So, by eliminating edges corresponding to the nodes with low degrees, the dense components which are the cores in the network are identified.

Figure 1. A graph colored according to the core membership of the vertices. The yellow vertices form a 3-core. The orange together with the yellow vertices form a 2-core. All the vertices except vertex 21 form a 1-core. All the vertices together form a 0-core

A simple O(m) algorithm to mine K-cores was developed by Batagelj et al. [10], where m is the number of edges in the network. The intuition behind the algorithm is simple. If the nodes with degrees less than k are iteratively eliminated from the graph, then the resulting

24 subgraph is a K-core. The algorithm is very simple and attempts to assign each vertex with a core number to which it belongs. If a vertex v is assigned a core number of h, then the vertex is part of all the K-cores such that K ≤ h.

Algorithm K-Core

1. Sort all the vertices in a Queue in increasing order of their degrees.

2. Remove the first element/vertex v (lowest degree vertex)from the Queue.

3. Assign the degree of the vertex v as its core number.

4. For all v‟s neighbors with degrees greater than the degree of v, decrease their

degrees by 1 and re-sort the Queue.

5. If Queue is not empty, go to Step 2.

6. Output the K-cores based on the core numbers assigned to the vertices. If the k

value is h, then output all vertices that have a core number ≥ h.

Based on the minimum core number that is not zero and the maximum core number, a k- value that is around 0.1 times the maximum core number is chosen. The subgraph of all k-cores is obtained by removing all nodes not returned by the K-Core algorithm, from the graph. As high intra-modular and low inter-modular connectivity is assumed, it was expected that the subgraph obtained from k-core algorithm to contain un-connected components. However, for the gene co-expression networks created from various

25 datasets, only one single component was obtained, the whole k-core itself. Finding K- core components filters out the nodes with low degrees, so the components that are connected together by a sparse number of edges should separate out. However, that is not true. The K-core algorithm fails to remove those sparse edges that bind together, otherwise separated dense components. This behavior could be because of the reason that the K-core algorithm mines modules by the iterative elimination of vertices and not the iterative elimination of edges. Only those edges are eliminated that are connected to low- degree nodes and not the edges like the green colored edges in Figure 2.

Figure 2. A graph in which two strongly connected components are connected together by few inter-component edges(green edges) which are not eliminated by a k-core algorithm 26

3.2 Min-Cut Algorithm

Inter-modular connectivity is low for strongly connected modules and so the connections such as in the above figure are few in number compared to the connections in the module.

A simple min-cut algorithm was developed by Mechthild Stoer and Frank Wagner in

1994 [11]. The algorithm finds the minimum cut in an undirected edge-weighted graph.

Since, the edges like the green colored edges in the Figure 2 are very few compared to the edges within the modules, the cut of the network is formed by removing such edges. The

MinCut algorithm has |V| nearly identical phases, each called the Minimum Cut phase, and outputs the mimimum cut. A Minimum cut phase looks as follows.

Algorithm: MinimumCutPhase( G, w, a, V)

// G is network, w is edge weights and a is seed vertex and V is vertex list.

1. A ← { a }

2. Add to A the most tightly connected vertex ( maximum total edge-weight)

3. If A ≠ V, go to step 2

4. Store the cut-of-phase and shrink G by merging the two vertices added last

The MinimumCutPhase iteratively adds vertices into the set A of vertices (which starts with an arbitrary vertex a) , by selecting the vertex that is most strongly connected to the vertex set A. The phase stops when A becomes equal to V and them merges the last two vertices added to A. The edge joining the merged vertices is removed and the edges from

27 the merged vertices to the remaining vertices are replaced by edges from the merged vertex weighted by the sum of the weights of the replaced edges. The cut that separates the last added vertex from the rest of the graph is the cut-of-the-phase. The lowest of these cuts (based on cardinality) is the minimum cut that breaks the graph into two components. The algorithm MinCut is as follows:

Algorithm: MinCut( G, w, a)

// // G is the network, w is edge weights and a is seed vertex and V is vertex list.

Call MinimumCutPhase( G, w , a,V)

1. If cut-of-the-phase returned in step 1 ≤ the current cut; then store the cut-of-

the-phase as the minimum cut

2. If |V| > 1 , go to step 1

The MinCut function is called recursively on the graphs separated from the MinCut function until the subgraphs are separated into components of a maximum size specified by the user.

The K-core algorithm was deployed on a gene co-expression network obtained from The

Cancer Genomic Atlas (TCGA) website containing microarray expression data for genes and microRNAs from samples taken from patients suffering with Glioblastoma

Multiforme (Brain cancer). Contrary to what was expected, only a single K-core i.e. only

28 one module, was obtained for any values of k. So the Min-cut algorithm discussed above was applied to remove any edges that bind together, any separable highly connected modules. The min-cut algorithm was applied on the graph recursively, cutting the graph into two at each recursion until the graph size (number of nodes) falls below a certain threshold, say 5% of the total nodes. Since the co-expression network is an un-weighted graph, all the edge weights equal to 1. However, the algorithm separated out only one node from the input graph at each recursion, thus constantly forming modules of node size 1. Min-cut essentially obtains the minimum-cut and removes those vertices with the least degrees and that are superficial in the graph, if the nodes are arranged in a spherical shape in a way that the high degree nodes are at the core and low degree nodes on the periphery, continuously, exposing more nodes at each stage to the periphery.

3.3 Prune-Cut Algorithm – Modification to Min-Cut algorithm

To overcome the drawback of separating a single node at each step, the Min-cut algorithm was modified to select the minimum-cut to be the minimum of only those cut- of-the-phases that can separate the modules such that their size is above a threshold.

The PruneCut algorithm resulted in separating the highly connected modules, the nodes/genes with similar expression profiles, but with few misclassifications. The heatmaps of the modules showed that the Prune-cut algorithm grouped together genes with similar expression profiles but with few mis-groupings. However, the Gene

29

Ontology enrichment analysis results demonstrated good results on some groups and not- so-good results on others. The p-values of many enriched terms were below 0.01.

Algorithm: PruneCut( G, w, a)

// G is the input graph, w are the edge weights, a is the seed vertex

1. Call MinimumCutPhase( G, w , a)

2. If cut-of-the-phase returned in step 1 ≤ the current cut and size( module 1) ≥

threshold & size(module 2) ≥ threshold ; then store the cut-of-the-phase as the

minimum cut

3. If |V| > 1 , go to step 1

Figure 3. Heatmap of expression values of a cluster returned by PruneCut 30

Figure 3 shows the heatmap of expression values of one of the clusters obtained from the

PruneCut algorithm. It can be seen that the cluster could be further divided into 5 distinctive clusters, based on the expression values.

The K-core algorithm has a complexity of O(|E|). And the PruneCut algorithm runs in

O(C(|V||E| + |V|2 log |V|)) where C is the number of clusters returned by the algorithm,

|V| is the number of nodes and |E| is the number of edges in the co-expression network.

The K-core algorithm and the PruneCut algorithm applied together have resulted in meaningful modules, but there were some misgroupings and the GO enrichment results were not very good. Also, the PruneCut algorithm needs the threshold value of the size of the modules as an input which eliminates proper grouping of the genes because size is an improper parameter to control the modules. A good clustering algorithm should be able to cluster the genes without any threshold on the cluster size, as the ideal cluster sizes are arbitrary and cannot be guessed. The time complexity of the PruneCut algorithm is also high especially when considering networks as large as gene co-expression networks with around 10,000 – 50,000 nodes. Also the PruneCut and MinCut algorithms do not enable the module separation based on the inter-modular connectivity i.e. there is no control on the similarity between the modules.

An additional filtering algorithm is applied on the graph. For each edge in the graph, the intersection of the neighbors of the end vertices is calculated. If the intersection is very less compared to the number of neighbors of the end vertices, then the edge is removed.

31

Also, any edges between the neighbors of one end vertex and the other end vertex are also removed so as to form a graph cut. This process eliminates cases where strongly connected components are held together by one or a few edges as in Figure 2. It was hoped that such modules will be disconnected through this filtering process. However, no improvement was observed. Also, the PrunCut algorithm was deployed on the whole network without finding the K-core and it was found that the results were similar. And also it resulted in a large number of clusters, many of which might not be of interest.

32

CHAPTER 4: EIGEN CUT ALGORITHM: A NEW NETWORK APPROACH

4.1 The Algorithm

The PruneCut algorithm requires, as an input, a threshold on the size of the clusters.

This constraint of the algorithm results in inflexibility in clustering, as the size of a cluster is un-tractable. This also results in some mis-groupings. If the threshold is high, clusters that should ideally be divided into two distinct clusters by an ideal algorithm may be wrongly grouped into a single cluster by PruneCut. A good clustering algorithm should have no constraints on the cluster size and should be able to exploit the properties of the graph and the nodes to obtain optimal clusters.

An algorithm called EigenCut similar to the min-cut algorithm was developed, that exploits the properties of intra-modular and inter-modular connectivity of biological modules. The algorithm exploits the fact that modules in a biological network have high intra-modular connectivity (in terms of number of neighbors) but low inter-modular connectivity and also that nodes in a module are highly correlated to each other and are less correlated to other modules or nodes in other modules. It takes a greedy approach with some heuristics. The algorithm clusters the genes based on their expression similarity, assigning a gene to a cluster only if it is similar to the cluster above a certain threshold. The algorithm takes as an input, a threshold on the inter-modular connectivity.

33

It finds clusters such that they are similar to each other by at most some threshold value and also the nodes within each cluster are similar to each other by at least that threshold.

The EigenCut algorithm starts with an arbitrary node and collapses neighboring nodes with high similarities into the node until the similarity of the node with the remaining nodes in the graph falls below a user-defined threshold. The collapsed set of nodes forms a cluster. The similarity of a cluster with the rest of the nodes is the correlation of the representative expression profile of the cluster with expression profiles of these nodes.

The representative expression profile for a cluster is calculated by performing Principal component analysis (PCA) over the expression profiles of the contained genes. The PCA uses eigen decomposition and so the representative expression profile for a cluster is called its eigen gene (an expression profile representing all the genes in a particular group). Once the similarity of a cluster with the rest of the genes falls below a threshold, the cluster is isolated from the graph and another iteration starts with another arbitrary node in the remaining graph. This is done until there are no more nodes left in the graph.

An un-weighted gene co-expression network is built as described in Chapter 1. However, the correlation values between the genes are retained, i.e. the edge weights of the network are retained. The network and the weights are stored as matrices, the adjacency matrix and the weight matrix respectively. The indices of the matrices correspond to the vertex indices. The algorithm merges nodes as clusters and pulls them off the network followed by the same on the remaining graph, iteratively, until there are no more nodes to merge in

34 the graph. Each iteration starts with an empty set A of vertices. A vertex (seed) v chosen as the next available vertex ordered by index (next-available-higher index vertex seed- selection approach) is added to A. However, the choice of the seed can be explicitly made by the user by giving as input the vertex list in the required selection order. The user could also choose to let the algorithm select the next available vertex with the highest degree (the hub genes) as the seed (next-available-hub vertex seed-selection approach).

Nodes are added to the vertex set A in steps until the similarity of the representative expression profile of set A with the remaining nodes falls below a threshold. In each step, a vertex outside A that is most strongly connected to v above the user defined inter-cluster correlation threshold thresh is added to A and the representative expression profile

(Eigen-gene) for A is calculated through PCA (select the first Principal component which accounts for the most variability in the cluster). Adding a node to A means that the node is collapsed into the node v. Any edges from the collapsed node to the remaining nodes are replaced by edges from v weighted by the correlations of Eigen-gene of A (which is the expression profile of new version of vertex v) with the expression profiles of those nodes. The edge connecting v with the collapsed node is deleted. Once the similarity (or correlation or the edge weight) of v to the remaining nodes falls below the threshold thresh, then set A is added to the Clusters set that stores all the clusters. The nodes in A are removed from the network. The next iteration starts with an empty set A and follows the above process. The iterations continue until there are no more nodes left in the graph to form clusters. The EigenCut algorithm looks as below: graph is an adjacency matrix representation of the co-expression network, corrs is the pair wise correlation

35 matrix(weight matrix), expval a matrix of the expression values of the nodes/genes vs. samples, V is the set of nodes or indices for the nodes and thresh is the threshold on the inter-modular connectivity.

Algorithm: EigenCut(graph, corrs, expval , V, thresh)

1. Clusters = {};

2. v = V(1) ; A = {};

3. u = neighbor(v) with maximum edge weight

4. If corrs(u,v) ≥ thresh then add u to A else go to step 8

5. Merge u into v: connect all neighbors of u to v and delete u.

6. Calculate Eigengene for v

7. Recalculate correlation values or edge weights from v to all its neighbors.

8. If ( corrs(v)) ≥ thresh go to step 3

9. Clusters = Union(Clusters , A) and V = V – A;

10. If V is not empty, go to step 2 else return Clusters;

The algorithm returns a set of all cluster sets in the network. The clusters can be filtered to obtain only those clusters that have a significant size i.e. significant number of nodes.

As the algorithm clusters the nodes such that the inter-modular strength or similarity is below the specified threshold, by changing the input threshold, the algorithm can return clusters at different levels of inter-modular and intra-modular strength. The inter-modular

36 and the intra-modular strength can also be varied by varying the initial threshold used to create the co-expression network. In the simple form of the algorithm as described above, both the thresholds are same or have a little difference.

The algorithm described above takes the next-availabe-higher index vertex seed-selection approach. However, it can be modified to select the next hub node as the seed at each iteration of the algorithm, i.e. it can select the next available node with the highest degree to start an iteration (next-available-hub vertex seed-selection approach) instead of merely selecting the node with the next least or highest index. Since, biological networks are believed to be scale-free in which there are a few hub (high degree) genes and the rest of the genes are connected to the hub genes, such hub-centered modules can be identified by always selecting the hub genes to start the clustering iteration. When the algorithm was deployed on a gene co-expression network from the TCGA dataset discussed in the next chapter, it was observed that a random selection of the seed node at each iteration has little influence on the clustering. The algorithm returned almost similar clusters, which is an indication that the algorithm is independent of seed selection. Further analysis on the seed selection independence of the algorithm is discussed in the next section where the clustering results are compared using random node selection. However, the next- available-hub gene seed-selection approach resulted in clusters that were different from the next index selection and the random selection approaches. These clusters were better in terms of the gene ontology enrichment analysis results. The ontology enrichment results suggested that the next hub selection approach gave clusters with genes that were

37 strongly related to each other in terms of associated functions. The results are discussed in the next chapter.

The algorithm can be made faster by performing a multi-merge at each step of an iteration. At each merge stage, instead of just merging the node with the highest correlation with the vertex v, the algorithm can choose to merge all the nodes with correlation above a certain threshold. All the neighbors of the merged nodes are now connected to the current seed node and the correlations are updated accordingly.

Everything else in the algorithm remains the same. This approach will highly reduce the time complexity, but might not give results similar to the single node merge approach.

Also, the results from the multi-merge may not be always correct. At each merge in a single merge approach, since the highly correlated node is merged with the seed vertex v, the recalculated representative expression vector of the new v (Eigen expression vector) might have less variance from the expression vector of the old v, thus not causing a rapid change to the edge weights. However, in the multi node merge approach, all the nodes above a threshold irrespective of the correlation values are merged together in a single step and the Eigen expression vector calculated for v might have high variance from its previous expression vector. This variation might cause many of the updated edge-weights to fall below the threshold, thus eliminating the inclusion of some nodes that are supposed to be included. For example, if the seed node v is correlated to several nodes with correlations ranging from 0.5 to 0.95 and user-specified threshold is 0.75. Let u be a node with highest correlation of 0.95 from the seed node. Let there be a node z correlated

38 to u by a value 0.85. So, the vertex z is also highly similar to node v. In a single node merger approach, node u is merged into node v and the recalculated correlation between node v and node z is 0.82. Now, in the next iteration, node z is merged into node v. We now have a cluster containing v, u, z. In the multi node merge approach, all the nodes correlated to v with a value greater than the threshold which is 0.75 are merged and the correlations with the neighbors of the merged nodes (which also include node z) are calculated. Now the vertex z could be correlated to v by only 0.73. This might be because, among the merged nodes, there are a large number of nodes with correlation values nearer to 0.75 than nearer to 0.95 and thus the Eigen expression vector is correlated to the seed expression vector by only around, just say, 0.8. This will prevent node z from being merged into the new node v, which should have been. When time complexity is not an issue, single node merge approach performs well. But if fast processing is needed even at the cost of accuracy, the multi-merge approach can be applied. However, the multi-merge version can be modified so has to merge, not all the nodes above the threshold, but only a few by using some heuristics. Only those nodes above a correlation value mid-way between the threshold and highest possible correlation value (which is 1) can be merged in a single step. All the other nodes can be merged one at each step. Such a heuristic will help make the algorithm faster without compromising too much on correctness.

The EigenCut algorithm can be easily modified to include overlapping clusters. The basic algorithm does not allow overlapping clusters, because at the end of each iteration i.e. at the end of finding each cluster, all the vertices in the cluster are eliminated from the

39 network and from the vertex list V. The graph and weight matrices are modified during each iteration to eliminate the nodes that are already clustered from being considered in the following iterations. To find overlapping clusters, it is enough to consider the whole graph during each iteration, but restraining the seed vertex from not being in any of the previously found clusters. In the non-overlapping case, the graph processed at each iteration is the result of the previous iteration. But in the overlapping case, the graph from the previous iteration is discarded and the original graph is reloaded, thus including the already clustered vertices in the possible candidate set for merging into the current seed.

This enables vertices to be in more than one cluster. However, the overlapping clustering method takes longer time compared to the non-overlapping version, because the number of nodes to be processed and merged at each iteration is higher in the overlapping version due to the repetition of nodes. Since in the non-overlap clustering version, the graph used is the result of the previous iteration and the graph becomes smaller at each iteration, this version is faster than the overlap clustering version.

This algorithm is also useful when a set of seed genes are known and the aim is to grow the gene set. For instance, if one or few genes associated with a disease are known, then the EigenCut algorithm can be run with these genes as seeds, to obtain the cluster of genes related to the known genes, thus enabling researchers to find more genes that are potentially related to the disease. Such clusters of genes can form the candidate genes for further research on the disease.

40

4.2 Performance

The algorithm runs in O(dN) time where d is the maximum degree in the network and N is the number of nodes. There are a maximum of N-1 collapses of the nodes and each collapse requires the calculation of eigen expression profile and the update of a maximum of d edges. The calculation of the correlation for d edges takes O(d)time. Therefore the total time complexity for the algorithm is O(dN). However, the time complexity of the overlapping case increases to as much as O(dN2).

The algorithm scales well to size and is almost independent of seed selection. The algorithm returns almost the same set of clusters with whatever criteria is used to select the seed vertex at each iteration unless the seed-selection method is the next-available- hub vertex approach. To test the computational time scalability and stability to seed tests were performed on the algorithm by randomizing the sizes and the seed selection over several repetitions. From a network of around 12,897 genes (The TCGA data discussed in the next chapter) sets of 12000, 10000, 8000, 6000, 5000, 3000, 1000 and 500 nodes were randomly selected and the subgraphs induced by those nodes obtained. The algorithm was deployed on each of the network 100 times and the time taken by the algorithm on each run was noted. The mean running time taken in seconds for each of the sizes was plot versus the product of size of the graph (i.e. the number of nodes) and the maximum degree. The plot was a near straight line and the regression coefficient from regressing the running time over the product d•N was greater than 0.99. This provides evidence that the EigenCut algorithm is time scalable.

41

Maximum d.N Running time in # nodes (N) # clusters degree(d) secs. 12000 629 7548000 488 30 10000 520 5200000 320 23 8000 415 3320000 200 19 6000 336 2016000 124 17 5000 233 1165000 71 11 3000 161 483000 27 5 1000 79 79000 5 3 500 48 24000 2 1 Table 1. Running times and number of clusters for different graph sizes( no. of nodes)

Figure 4. Plot of time taken vs. product of size and maximum degree of the network (d•N). The regression coefficient of the linear fit is greater than 0.99

42

To test the stability of our algorithm 200 independent random experiments were performed and the mean overlap of the clusters with a reference cluster was calculated for each experiment. The reference clusters were the results of running the algorithm by selecting the seed at each iteration to be the next available vertex sorted by their indices.

Out of the 200 experiments, in 100 of them seed vertices were selected from the reference clusters randomly, and in the other 100 experiments seed vertices not in the reference clusters were selected randomly. In each experiment random seed selection without replacement is performed. So, in the former 100 experiments, if there were no more available vertices from the reference clusters, a vertex from the remaining available vertices was picked randomly. For each experiment, a set of clusters (say m clusters) is obtained. The mean overlap of the clusters with the reference cluster set (say n clusters) was calculated by creating a pair-wise-overlap m × n matrix between the two sets, and calculating the mean of the m (or n, whichever is minimum) maximum values in the matrix. This gave us the mean cluster overlap for each experiment. The mean overlaps were observed to be nearly 97%. This implies that our algorithm is very stable in the sense that the seed selection has little influence on the final results. However, when clusters were formed using the next hub gene approach, the independent cluster overlaps ranged from 10% to 100%, with a mean overlap of around 70%. The algorithm is not completely independent of seed selection. Also, the time taken by the program for each experiment and the mean time over all experiments was observed to be around 450 seconds. The difference in the clustering with random seed selection and the next-

43 available-hub seed selection approach might be because of the reason that the next hub approach exploits the inherent properties of a biological network.

The results provide evidence that the EigenCut algorithm is both time scalable and is independent of seed selection except in the case of next-hub seed-selection approach.

Now the question is, Does the EigenCut algorithm return modules that are biologically meaningful? The clusters formed by applying the EigenCut algorithm on our datasets were evaluated through Gene Ontology Enrichment analysis (discussed in Chapter 1).

The analysis returned terms with high enrichment i.e. very low p-values for a majority of clusters. It shows that the genes within each cluster have significant similarity in their ontology or functions. A Gene Ontology Enrichment analysis tool that we developed was used to do the analysis and the results were compared with on online enrichment analysis tool called ToppGene (www.toppgene.com). It was observed that the next-available-hub seed-selection approach resulted in much more meaningful clusters than the next- available-higher index approach in terms of functional homogeneity of the clusters as evaluated by the enrichment analysis. The results are discussed in more detail in the next chapter.

The TOM based hierarchical clustering algorithm[8] discussed in Chapter 1 was implemented, to compare our results. A TOM matrix which contains the topological similarity measure in terms of neighborhood of the nodes is built. It is subtracted from 1 to obtain a dissimilarity matrix which is then input to average linkage hierarchical

44 clustering algorithm to obtain clusters. It was observed that this method could not cluster the network properly, resulting in many very small and single node clusters and one large cluster. It was also observed that a large majority of TOM values were below 0.1.

45

CHAPTER 5: APPLICATIONS

5.1 Application 1: TCGA data on Glioblastoma Multiforme

The Cancer Genomic Atlas (TCGA) data portal is a database that lets users search, download and analyze data generated by TCGA. It is maintained by National Cancer

Institute TCGA project. The goal of TCGA is to collect genotype (e.g., somatic ,

SNP, gene copy number), phenotype (e.g., gene and microRNA expression), and clinical

(e.g., treatment, survival, gender, age) information from patients of more than 20 different with hundreds of patients for each type of cancer. Glioblastoma

Multiforme (GBM) is the first type of cancer being catalogued in TCGA. Currently there are data from more than 200 GBM patients in TCGA. The gene expressions were measured using whole genome microarray array (the expression values for the whole genome are measured in a single experiment). Glioblastoma multiforme is a late stage brain tumor (glioma) which is highly metastatic and patients with GBM usually have a short survival time after diagnosis. Therefore, if functional groups of genes with predictive power on patient prognosis can be identified, insights on the metastatic mechanisms in GBM can be obtained and better therapeutical plans can be developed.

The gene expression and microRNA expression data on GBM for patients are obtained from the TCGA data portal. Co-expression networks from the gene data and the microRNA data are built and clusters from both the networks are mined using the 46

EigenCut algorithm. Interactions between genes and microRNAs are predicted by calculating the similarity between the gene clusters and microRNA clusters on the basis of expression similarity. Survival analysis was performed on each by plotting Kaplan-Meijer curves and performing a log rank test to identify gene signatures associated with GBM.

5.1.1 Gene – miRNA Interaction Prediction

The gene expression data was available as a data table of gene probes vs. patients with a size of 22215 × 340. The microRNA expression data was also similar with a size of 1510

× 250. Each gene or microRNA can have more than one probe on the microarray which means that the data table has multiple readings for some genes. The expression data was reduced into a table of genes/microRNAs vs. patients from the probes vs. patients table.

For genes with multiple probes in the microarray, their expression values can be chosen to be the maximum or a mean of their expression values at the associated probes. I chose to take the mean value. The same was done on the microRNA data. 12897 genes and 534 microRNAs were obtained. The tables are further reduced to contain values only for patients that were common in both the gene expression table and the microRNA table so as to enable comparisons between gene clusters and microRNA clusters. There were 242 common patients. Co-expression networks were built for both the genes and the microRNAs as described in chapter 1. Spearman correlation coefficients were chosen instead of Pearson correlation coefficients as the former models monotonic functions better than the later. A threshold of 0.5 was applied for both the networks. The EigenCut

47 algorithm was applied with an inter-cluster threshold of 0.6 on both the networks to obtain clusters. The gene clusters were filtered to include only those clusters that have at least 20 genes to obtain 31 clusters totaling 3004 genes. 85 microRNA clusters were obtained after filtering to including only those that have at least 2 microRNAs.

The clusters obtained by running the EigenCut algorithm on the TCGA gene data are evaluated for functional homogeneity by performing a Gene Ontology Enrichment analysis. The enrichment analysis returned terms with high enrichment i.e. very low p- values for a majority of clusters ( Tables 2, 3 and 4). It shows that the genes within the clusters have significant homogeneity in their ontology or functions. However, bad enrichment results help us identify bad clusters. It was also observed that the next-hub- approach for seed selection in the EigenCut algorithm gave good ontology enrichment results. So this approach was chosen here for clustering the co-expression networks.

ID Name P-value Term in Query Term in Genome

1 GO:0019864 IgG binding 2.543E-5 6 10

2 GO:0019865 immunoglobulin binding 1.303E-3 6 17

3 GO:0030246 carbohydrate binding 1.510E-3 27 414

4 GO:0042802 identical protein binding 1.594E-3 40 765

5 GO:0004896 receptor activity 8.246E-3 9 60

6 GO:0019899 binding 1.486E-2 34 664

7 GO:0019955 cytokine binding 1.607E-2 12 117

Table 2. Gene Enrichment results for Cluster 1 (GO: Molecular Function)

48

ID Name P-value Term in Query Term in Genome

1 GO:0002376 process 1.982E-37 119 1258

2 GO:0006955 immune response 1.976E-35 95 832

3 GO:0006952 defense response 6.475E-23 75 760

4 GO:0002684 positive regulation of immune system process 1.780E-21 47 303

5 GO:0002682 regulation of immune system process 8.594E-21 58 493

6 GO:0009605 response to external stimulus 3.538E-18 84 1114

7 GO:0050776 regulation of immune response 1.298E-17 41 277

8 GO:0009611 response to wounding 2.580E-17 62 658

9 GO:0048583 regulation of response to stimulus 2.061E-16 56 564

10 GO:0050778 positive regulation of immune response 4.952E-16 33 188

Table 3. Gene Enrichment results for cluster 1 (GO: Biological Process)

ID Name P-value Term in Query Term in Genome

1 GO:0042611 MHC protein complex 6.815E-10 13 39

2 GO:0005764 2.066E-7 24 228

3 GO:0000323 lytic vacuole 2.066E-7 24 228

4 GO:0005773 Vacuole 3.778E-7 26 274

5 GO:0009897 external side of plasma membrane 1.850E-6 20 178

6 GO:0042613 MHC class II protein complex 6.179E-6 7 15

7 GO:0009986 cell surface 7.601E-6 30 407

8 GO:0044421 extracellular region part 1.120E-5 54 1060

9 GO:0031226 intrinsic to plasma membrane 1.587E-5 62 1313

10 GO:0005887 integral to plasma membrane 4.209E-5 60 1287

Table 4. Gene Enrichment results for cluster 1 (GO: Cellular Component)

49

The interactions between genes and microRNAs are predicted based on their expression profiles by finding the similarities (correlations) in the expression profiles between the gene clusters and the microRNA clusters, as it is assumed that functionally similar genes and microRNAs have similar expression profiles. Each cluster has different number of nodes in it. Therefore, the expression profiles of the clusters cannot be compared directly.

Thus, a representative expression profile that can represent the expression profiles of all the nodes (genes/microRNAs) in a particular cluster is needed. This is achieved by PCA which involves eigen value decomposition of a data covariance matrix. It finds an orthogonal coordinate system of n coordinates when applied on a n-dimensional data, where the first coordinate (Principal component) represents the most variability/variance in the data, the second coordinate represents the next most variability and so on. The first

Principal component that accounts for the most variance in the expression profiles of the nodes in the cluster is chosen here. For each cluster a PCA is performed to obtain the respective first principal components which is called an Eigen-gene for the gene clusters and an Eigen-microRNA for the microRNA clusters. Now we find pair-wise correlation between each Eigen-genes and Eigen-microRNAs to obtain a correlation matrix. The matrix is threshold, so to have 1s for values above the threshold value and 0s for values below the threshold value. A bipartite graph is obtained which has Eigen-genes on the left side and Eigen-microRNAs on the right side. The edges between these left and right nodes represent potential interactions between the gene and microRNA modules.

50

In our experiment, 31 gene clusters and so 31 eigen-genes & 85 microRNA clusters and so 85 eigen-microRNAs were obtained. The pairwise correlation matrix was calculated and threshold with a value of 0.5. A total of 41 edges were obtained and many

Eigengenes were not connected to any EigenmicroRNAs.

Figure 5. Eigen network obtained by a threshold of 0.5

The results were validated using an online software called ToppGene. It predicts interactions between genes and microRNA through sequence analysis. However, it was 51 observed that the our results contradicted or had very less similarity to the results from

ToppGene. This could be because of the fact that ToppGene tries to predict interactions based on the sequence structure, of which many are not proven biologically and also, the sequence structure and expression values may not be correlated.

5.1.2 Gene Signature Discovery

A gene signature or a biomarker is a group of genes in a cell, whose combined expression pattern [24] is associated to a particular disease i.e. the expression pattern is a unique characteristic of the disease [25]. Survival analysis was performed on the gene clusters obtained from the gene co-expression network to identify gene signatures associated to

Glioblastoma multiforme. Survival analysis is a branch of statistics that deals with death in biological organisms (humans in our case) and involves modeling of „time to death‟ data [26]. A survival function is estimated from the life-time data of patients through

Kaplan-Meier estimator which can be used to measure the fraction of patients living for a certain amount of time after treatment. The Kaplan-Meier estimate of the survival function is plotted for different groups of patients to compare the survival among different conditions. It can be used to identify gene signatures associative to a particular disease.

5.1.2.1 Survival Analysis

The Kaplan-Meier curves can be plot for different gene signatures and compared to identify signatures associative to a disease. A low survival time of patients with a

52 particular gene signature indicates that the gene signature is associative to the disease under study. A considerable difference in the Kaplan-Meier curves plotted for different expression patterns of genes reveals associated gene signatures. For example, if patients suffering from a particular disease are sorted into two groups based on the expression of a group of genes in these patients and Kaplan curves are plotted for each of the two groups.

If the two curves are considerably different i.e. the area between the curves is large, then it means that the survival times of the two groups are so different, that we could say that the genes form a gene signature, and that their expression pattern in the patient group with low survival is associated to early death in patients suffering with the disease. A sample Kaplan survival curve is shown in Figure 6.

Figure 6. Kaplan curves for two types of glioma. Figure adopted from [22]

53

If a new patient has an expression pattern over the genes in the gene signature, that is similar to the expression pattern of the low survival-time patient group studied above, then there is high chance of low survival time for that patient. If the expression pattern is similar to the high survival-time patient group, then is there is fair chance that the patient will survive longer.

The Kaplan-Meier estimate of a survival function is a series of horizontal steps of declining magnitude, which approaches the true survival function when the sample space is large enough [21]. The value of the survival function between successive sampled observations is assumed to be constant. An advantage of the estimate is that it can take into account censored data i.e. losses from the sample without the final outcome known.

This is useful when the patients withdraw from the study before their final outcome is known. We wish to estimate the proportion surviving by any given time which is the estimated probability of survival till that time for a member of the patient set from which the sample was taken. However, the Kaplan-Meier method is used so as to be able to handle censored data. At each time interval, the conditional probability that those who have survived till the beginning of the interval will survive till the end of the interval is estimated. The survival to any time point is the product of the conditional probabilities of surviving at each previous time interval. The survival probabilities for all the time points are calculated and a survival curve is drawn. The curve is a step function with horizontal steps because it assumes that the probability remains constant between time points. Each step starts at a time point in the data and the time points need not be equally spaced. The

54 time points with censored data are marked by vertical lines. There are three main assumptions that are made while calculating the Kaplan-Meier estimate [20]. First, at any time point, censored patients are assumed to have the same survival prospects as the un- censored time points. However, the assumption is not easily testable as censored data could be a result of various related and unrelated events. Second, it is also assumed that the survival probabilities are independent of the time point at which the subjects are recruited into the study. Third, it is assumed that an event occurs at the time point, though it might have happened between time points or earlier to the time point but has been identified at that time point. The survival curves can be plotted for two or more groups of subjects/patients and the curves can be compared for differences in survival probabilities and can be used to establish an association between survival and unique characteristics of the group. Formal methods are needed to test hypotheses about survival in two or more groups.

To compare the groups, the survival curves for each group can be calculated and the proportion of subjects surviving at each time point can be compared. However, this approach can compare the survival of two groups at an arbitrary time point but cannot compare the total survival experience of the two groups. To obtain the statistical comparison of the Kaplan curves, there are several statistical methods that can be used, such as the log-rank test. It is sometimes called Mantel-Cox test and is a hypothesis test used to compare survival distributions of two samples. It is a non-parametric test and is useful especially when considering right censored data (non-informative data) [20]. It is

55 used to test hypotheses about survival in two different patient groups. The log rank test is the most popular method for comparing survival of groups and takes into account the whole period of study. The advantage of the method is that it does not need to know the shape of the survival curve and the distribution of survival times [22]. It is used to test the null hypothesis that the patient groups do not differ in the probability of death (or any other event in consideration) at any time point. At each time point, the observed number of deaths in each group and the number of expected deaths (assuming that there was no difference in survival between the groups) are calculated. The calculations are performed each time a death occurs. For censored survival times, the patient is considered to be at a risk of dying during the censored time period and not during the subsequent time periods.

From the above calculations the total numbers of expected deaths for each group are calculated. A chi-square test of the null hypothesis can now be used. The chi-square test statistic is given as the sum of (O-E)2/E for each group where O and E are the total observed and expected deaths. The degree of freedom for the chi-square is the number of groups minus one. If from a table of chi-square distribution, a p-value less than 0.01 is obtained, then the difference between the groups is statistically significant. The log rank test performs well when the risk of death is consistently greater for one group than another. The test performs poorly when the survival curves cross [22]. Therefore, when analyzing survival data, the survival curves should be plotted along with log rank tests.

Also, the log rank test is a test of significance and cannot estimate the size of difference between the groups. One common method to measure the difference is to use the hazard ratio.

56

5.1.2.3 Results

For each gene cluster obtained by applying the EigenCut algorithm on the TCGA gene data, patients are divided into two sets through K-means clustering (k = 2) which clusters the patients based on their expression profiles across the genes in the cluster. So, within a cluster the two sets of patients have different expression profiles. The survival functions for the two sets of patients (two groups) are plot using Kaplan-Meier estimate. A log rank test is performed to compare the survival curves. A gene cluster for which the two survival functions are significantly different is of interest. If the survival times for the two sets of patients divided based on the expression profiles over a set of genes are very different, it means that the survival of a patient is related to the expression of these genes in that patient. Once such clusters are recognized they can be biologically validated.

Out of the 31 clusters obtained by applying the EigenCut algorithm without overlaps using the next-available-hub seed-selection approach, 7 clusters show potential predictive capacity with p-values for log-rank tests less than 0.05. They are listed in Table 5. The

Kaplan-Meier curves for the separate patient groups in cluster 8 is shown in Figure 7.

Table 6 shows the 6 clusters out of the 32 clusters obtained by applying the EigenCut algorithm on TCGA data without overlaps using the next-available-higher index seed- selection approach that show potential predictive capacity with p-values of log rank tests less than 0.05. Table 7 shows the 8 clusters with predictive capacity out of 71 clusters obtained by applying EigenCut with multi-merge, overlaps and next-available-hub seed- selection approach. The clusters obtained from PruneCut algorithm do not display

57

Figure 7. Clusters obtained by applying EigenCut with next-available-hub seed-selection approach (only some of the clusters are shown here).

predictive power indicated by the low p-values for the Kaplan curves. The EigenCut algorithm was applied on the K-core to obtain 24 clusters out of which 7 showed good predictive power, and are listed in Table 8.

58

Cluster p-value 1 0.0091268 5 0.026664 8 0.0010946 9 0.00054934 20 0.0074659 24 0.032249 27 0.0016763 Table 5. List of clusters (EigenCut without overlaps and next-available-hub seed- selection) with p-values less than 0.05 in the log-rank tests

Figure 8. Kaplan-Meier curves for cluster 8 in the above table

Cluster p-value 2 0.0063116 6 0.0057298 14 0.000957 17 0.010389 22 0.0010559 31 0.022746 Table 6. List of clusters (EigenCut without overlaps and next-available-higher index seed-selection) with p-values less than 0.05 in the log-rank tests

59

Figure 9. Kaplan-Meier curves for cluster 14 in the above table

Cluster p-value 1 0.0069956 2 0.0086392 13 0.023455 14 0.016102 20 0.01743 32 0.00098224 41 0.021736 43 0.034206 Table 7. List of clusters (EigenCut with multi-merge, overlaps and next-available-hub seed-selection) with p-values less than 0.05 in the log-rank tests

60

Figure 10. Kaplan-Meier curves for cluster 32 in the above table

Cluster p-value 1 0.0091268 4 0.026664 8 0.010111 11 0.013947 13 0.015916 20 0.0097023 21 0.0061901 Table 8. List of clusters (EigenCut on K-core) with p-values less than 0.05 in the log-rank tests

All the clusters with good predictive power (p < 0.05) from the above methods were identified. These clusters could be potential gene signatures that can aid in differentiating patients as either good prognosis or bad prognosis set. The log-rank test results for some of these clusters are more significant than most published results on TCGA data (Bredel

61

M, et al. 2009). Few clusters of interest that have low p-values were identified and labeled as Cluster A - K. The clusters are listed in table 9. However, these clusters warranty further experimental validations. It was observed that cluster A and cluster F gave almost equal Kaplan-curves and p-values. So, Kaplan curves were plot for the intersection of the two clusters that contained 42 genes and was labeled as cluster L. The resulting Kaplan curves had a p value of 0.000352, significantly lower than any of the clusters A - K. The genes in each of the clusters are listed in the Appendix A.

Cluster Membership p-value # Genes A Cluster 8, Table 5 0.0010946 79 B Cluster 9, Table 5 0.00054934 87 C Cluster 27, Table 5 0.0016763 23 D Cluster 2, Table 6 * 0.0063116 466 E Cluster 6, Table 6 * 0.0057298 154 F Cluster 14, Table 6 0.000957 79 G Cluster 22, Table 6 0.0010599 29 H Cluster 2, Table 7* 0.0086392 303 I Cluster 32, Table 7 0.00098224 39 J Cluster 20, Table 8 0.0097023 21 K Cluster 21, Table 8 0.0061901 97 L Intersection of clusters A & F * 0.000352 42 Table 9. Potential Biomarkers identified through different methods

For the above gene clusters, their biological functions were investigated using ToppGene functional enrichment software and Ingenuity Pathway Analysis(IPA) software. Three clusters that are of interest were identified but they may warranty further experimental validations. The first is the cluster D. Cluster D is highly enriched with immune response and other inflammatory response genes. Immune response is a key process in cancer development. During cancer initiation and progression, immune cells such as

62 and T-cells play important and complicated roles. The cluster is also enriched with genes related to immune cell trafficking, cellular development, cell death, cellular movement and cell growth and proliferation. These genes include those related to activation, movement and recruitment of cells. Our results are likely to help researchers to zoom in on the immune genes that play key roles in the progression of

GBM.

Figure 11. Functional Enrichment analysis using IPA for cluster D. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher‟s exact tests.

63

Term in ID Name P-value Term in Genome Query

1 GO:0002376 immune system process 5.380E-37 119 1258

2 GO:0006955 immune response 4.421E-35 95 832

3 GO:0006952 defense response 1.172E-22 75 760

4 GO:0002684 positive regulation of immune system process 2.644E-21 47 303

5 GO:0002682 regulation of immune system process 1.374E-20 58 493

6 GO:0050776 regulation of immune response 1.825E-17 41 277

7 GO:0009611 response to wounding 4.149E-17 62 658

8 GO:0050778 positive regulation of immune response 6.545E-16 33 188

9 GO:0006954 inflammatory response 1.640E-15 47 414

10 GO:0002252 immune effector process 4.530E-14 36 260

11 GO:0019882 processing and presentation 6.096E-14 21 73

12 GO:0045321 leukocyte activation 1.002E-13 45 421

13 GO:0001775 cell activation 2.898E-13 47 471

14 GO:0002460 adaptive immune response based on somatic 1.167E-12 27 153

recombination of immune receptors built from

immunoglobulin superfamily domains

15 GO:0002250 adaptive immune response 1.384E-12 27 154

16 GO:0002443 leukocyte mediated immunity 7.152E-12 28 178

17 GO:0045087 innate immune response 3.463E-11 29 204

18 GO:0002253 activation of immune response 5.102E-11 23 123

19 GO:0002449 lymphocyte mediated immunity 2.286E-9 23 146

20 GO:0046649 lymphocyte activation 2.565E-9 35 349

Table 10. GO enrichment results using ToppGene for Cluster D (GO: Biological Processes)

64

Figure 12. Functional Enrichment analysis using IPA for cluster E. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher‟s exact tests.

Another interesting gene cluster is the cluster E. It is highly enriched with genes for cellular growth and proliferation, cell death and cell movement. The first term is associated with programmed cell death or . As what is well established, the ability of escaping apoptosis is a hallmark for cancer cells. So, our results will help researchers to identify apoptosis genes that play key roles in GBM. Cell movement is also one of the most important processes associated with the fatality of GBM - the metastasis. The cell growth genes in the cluster include genes related to growth of tumor

65 cell lines and cancer cells. Out of the 154 genes in cluster E, many of them are extracellular genes (e.g. MMP14, critical for tumor cells to break the basal membrane), or genes related to tumor microenvironment (e.g., CAV1, mediating the signal transduction between the collagen receptor integrin and tyrosince kinase pathway). These genes in cluster E not only will help researchers in identifying tumor microenvironment related pathways to study metastasis of GBM, they also provide potential drug targets for . Cluster H also is highly enriched with genes for cellular development and cell growth. The cellular development genes included genes for differentiation, cell development and cell spreading. It is also enriched with genes for cellular movement.

Term in ID Name P-value Term in Genome Query

1 GO:0008219 cell death 8.245E-3 28 1385

2 GO:0016265 death 8.829E-3 28 1390

3 GO:0010033 response to organic substance 1.299E-2 17 605

4 GO:0012501 programmed cell death 1.735E-2 26 1279

5 GO:0034097 response to cytokine stimulus 1.742E-2 7 93

6 GO:0009628 response to abiotic stimulus 2.393E-2 14 443

7 GO:0048545 response to steroid hormone stimulus 2.469E-2 10 225

8 GO:0051093 negative regulation of developmental process 3.889E-2 18 728

9 GO:0006915 apoptosis 4.244E-2 25 1265

Table 11. GO Enrichment results using ToppGene for Cluster E (GO: Biological Processes)

66

Figure 13. Functional Enrichment analysis using IPA for cluster H. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher‟s exact tests.

67

Term in ID Name P-value Term in Genome Query

1 GO:0009605 response to external stimulus 1.711E-10 54 1114

2 GO:0002376 immune system process 6.800E-8 53 1258

3 GO:0009611 response to wounding 1.501E-7 36 658

4 GO:0002252 immune effector process 2.165E-6 21 260

5 GO:0006955 immune response 7.177E-6 38 832

6 GO:0042060 wound healing 8.574E-6 20 254

7 GO:0002682 regulation of immune system process 1.693E-4 26 493

8 GO:0050776 regulation of immune response 1.963E-4 19 277

9 GO:0002684 positive regulation of immune system process 7.928E-4 19 303

10 GO:0006952 defense response 8.910E-4 32 760

11 GO:0045321 leukocyte activation 2.097E-3 22 421

12 GO:0002460 adaptive immune response based on somatic 2.121E-3 13 153

recombination of immune receptors built from

immunoglobulin superfamily domains

13 GO:0002250 adaptive immune response 2.285E-3 13 154

14 GO:0051093 negative regulation of developmental process 3.304E-3 30 728

15 GO:0001775 cell activation 3.759E-3 23 471

16 GO:0019724 B cell mediated immunity 4.284E-3 10 92

17 GO:0060548 negative regulation of cell death 4.338E-3 23 475

18 GO:0031099 regeneration 4.735E-3 10 93

19 GO:0006916 anti-apoptosis 6.494E-3 16 254

20 GO:0002449 lymphocyte mediated immunity 7.970E-3 12 146

21 GO:0043066 negative regulation of apoptosis 8.642E-3 22 459

Table 12. GO Enrichment results using ToppGene for Cluster H (GO: Biological Processes)

68

Cluster L had 42 genes and was moderately enriched with genes for cellular growth and proliferation and cell death and DNA replication. However, the p-value for the Kaplan curves is better than any other cluster identified as potential biomarkers. The cluster does not contain many of the genes for key processes already identified to be associated to

GBM. These genes and also the genes in the other clusters from our biomarker cluster set will help researchers in identifying relationships between pathways and changes induced by Glioma related genes on other genes or pathways.

Figure 14. Functional Enrichment analysis using IPA for cluster L. The x-axis shows the log (base 10) of p-values of the enriched terms using the Fisher‟s exact tests. 69

Once a biomarker or gene signature is identified, the patients can be divided into good prognosis and bad prognosis sets. A prediction of the prognosis of a new patient can be made by comparing the expression profiles of marker genes in the patient with the good and bad prognosis patient sets. The comparison is made by finding the correlation, using

PCC, between the expression vector of the patient with the expression vectors of the good and bad prognosis patient sets over the marker genes. The new patient is classified into whatever patient group has a correlation of at least 0.5.

5.2 Application 2: Breast Cancer Data - GDS2250 dataset

Gene Expression Omnibus (GEO) is a public functional repository of array and sequence based data. The database stores curated gene expression datasets. The GDS2250 dataset in the GEO database contains expression data on Sporadic Basal-like Cancer (BLC),

BRCA-associated Breast Cancer and non-basal like tumor samples. Basal-like breast cancers are defined by their specific pattern of gene expression that is similar to normal breast basal cells. They correspond to Estrogen-receptor (ER)-negative, progesterone- receptor (PR)-negative, and HER-2 negative tumors (triple-negative tumors). The changes in the network topology and the change in cluster membership between co- expression networks created from basal-like cancer expression data and non-basal like cancer expression data are analyzed.

The expression data is available as a table of probes × samples. The data is reduced into genes × samples as discussed in the previous application. The data is divided into two

70 tables, one for „basal-like cancer‟ samples and the other for „non-basal-like cancer‟ samples. The data had expression values for 22215 genes across 18 basal-like cancer samples and 20 non-basal-like cancer samples. Gene co-expression networks are built for both the datasets. The EigenCut algorithm is deployed on both the co-expression networks to obtain clusters. The overlaps between the clusters are calculated to observe if the network is stable across the two datasets or not.

To obtain the gene co-expression networks for the above datasets, an adaptive threshold method is used. Instead of arbitrarily selecting a threshold value, which usually is 0.5, adherence to scale-free topology is evaluated for the gene co-expression network at each threshold and a value is selected such that the scale–free topology property is maximized.

The unique characteristic of such networks is the presence of few highly connected nodes

(nodes with high degree) called hub nodes that connect rest of the nodes to the network

[6][7]. Scale free networks have high tolerance to errors because the ability of the nodes to communicate is unaffected in the presence of high failure rates. It is believed that gene expression networks and other biological networks have a scale free topology.

An adaptive threshold analysis is conducted on the datasets, by building a gene co- expression network for threshold values between 0.1 to 0.9 at intervals of 0.05 and finding the linear regression fitting model index R2 for each threshold value. Only the threshold values for which the network has nearly scale-free topology characteristics which means that R2 value is at least 0.85 are considered. There might be several

71 threshold values that can lead to R2 > 0.85. So, the first threshold value such that R2 >

0.85 for that threshold and all the following thresholds, is selected. For the two datasets above threshold values of 0.72 (basal-like cancer data) and 0.69 (non-basal like cancer data) are observed to give rise to gene co-expression networks with scale-free topology.

The goal is to understand the difference in the gene expression and gene connectivity across the two datasets. To start with, the adjacency matrices of the two co-expression networks were compared. An AND operation was performed on the matrices to obtain the preserved network. The resulting matrix had only very few nodes that were connected. Out of 22215 genes, only around 5400 genes were connected to each other.

The EigenCut algorithm was deployed on the new adjacency matrix, to obtain the consistent clusters.

The EigenCut algorithm was deployed on both the co-expression networks using next- available-hub seed-selection approach to obtain two sets of clusters, one for each co- expression network. The cluster overlaps between the two sets were calculated. To do so, the pair-wise overlap between the two sets are calculated and stored in an overlap matrix.

The top m overlap values are picked out, where m is the number of clusters in the smaller cluster set. The maximum overlap was observed to be around 0.6 with majority of the overlaps lying below 0.2. This shows that the gene connectivity and the overall network topology changed rapidly across the two datasets which suggests that there is a major difference in the gene behavior between basal-like cancer and non-basal like cancer.

72

Table 13 shows the overlap values for clusters obtained by applying EigenCut algorithm with next-available-hub seed-selection approach and inter-modular similarity threshold of

0.85.

73

Ds2/ 2 2 2 2 2 2 ds1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3 4 5 6 7 8 0.0 0.0 0. 1 0 0 4 0 0 0 0 4 0 0 0 0 0 0 0 04 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0. 0.0 2 0 0 2 0 0 0.1 0 3 0 0 0 0 0 0 0 13 0 0 0 0 0 4 0 0 0 0 0 0 0.0 0.0 0.0 3 0 0 0 0 0 0 0 9 0 0 0 5 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0.1 0. 4 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 05 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 5 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0.1 0 0 0 4 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 7 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 74 0.0 0.0

8 0 0 0 0 0 0 0.1 5 0 0 0 0.2 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0. 0.0 0.0 0.0 0. 10 0 0 0 04 0 0 4 4 0 4 0 0 0 0 0 04 0 0 0 0 0 0 0 0 0 0 0 0 0.3 0.0 0. 11 0 0 0 0 0 0 0 0 0 0 0 0 5 7 0 0 0 0 0 05 0 0 0 0 0 0 0 0 0.3 0.0 12 0 0 0 0 0 0 0 0 0 0 0 0 9 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.4 13 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0.1 14 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 15 0.6 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Table 13. Overlap values for the basal-like cancer clusters versus non-basal-like cancer clusters

74

CHAPTER 6: CONCLUSION & FUTURE WORK

In this thesis, I propose an approach for selecting potential biomarkers based on GCN analysis. I propose an efficient GCN module finding algorithm and use it to search for predictive markers in glioblastoma using TCGA data and to predict interactions between genes and miRNAs using the same data. The EigenCut algorithm is time-efficient compared to other methods used so far for GCN analysis. It enables user control on the inter-cluster and intra-cluster similarity. It also gives flexibility to the user in selecting the seed nodes for clustering i.e. the algorithm lets the user specify the seed-selection order.

It also provides the facility of growing a cluster from known set of genes. The next- available-hub seed-selection approach exploits the inherent scale-free topology of biological networks unlike many other techniques and is very suitable for the complex topology of biological networks which is a mix of scale-free and modular topologies. The different simple modifications of the algorithm such as multi-merge and overlapping clusters can be used with certain trade-offs depending on the context. This algorithm is very useful in the context of co-expression networks or any other interaction networks, where the nodes of the network are correlated through feature vectors and groups of highly inter-correlated nodes are sought.

This heuristic algorithm provides opportunities for interesting extensions. A more efficient heuristic can be implemented to create clusters that can more appropriately 75 exploit the complex topology of the biological networks. The topological information of the network such has the degree or connectivity of the nodes could also be incorporated into the algorithm to make it much more efficient. The algorithm can then be extended to biological networks other than GCN or in fact, any network whose nodes represent vectors in a feature space.

Several novel gene-miRNA interactions were predicted. However, these predictions need further validations through biological experiments. These results help researchers in the gene-miRNA interaction studies and serve as a starting point in the study of usage of

GCN analysis to interaction prediction.

Few clusters of genes were identified as potential biomarkers. The log-rank test results for some of these clusters are more significant than most published results on TCGA data. Twelve potential biomarkers with good predictive power indicated by the log rank test p-values < 0.01 were identified. For three of these clusters, it was observed that the genes enriched from functional enrichment are related to key aspects of GBM and Cancer in general. Also other potential biomarkers though are not highly enriched with GBM related genes have better predictive power, which suggests that GBM related pathways may induce changes in or interact with other pathways. My work on GBM biomarker discovery demonstrates that the CGN analysis can be an effective approach in biomarker discovery and a research tool to identify candidate genes that may play an important role in patient survival or disease prognosis.

76

I also identified, through EigenCut clustering and overlap comparison of clusters among the basal-like cancer and non-basal like cancer datasets, that the gene connectivity and co-expression network topology rapidly changed across the two datasets. It implies that the expression pattern and the co-expression of genes have drastically changed across the datasets which suggests that these two types of cancer may be very different at the functional level. The biological relevance of such a behavior in the expression values is yet to be studied which might reveal interesting behaviors.

My future work will involve developing a much more efficient heuristic for the EigenCut algorithm that can exploit the complex topology of biological networks by incorporating the topological properties of the network. I plan to extend the algorithm to be more generic in the context of biological networks and then extend it to networks in general. I also plan to apply the algorithm to biomarker discovery for other cancers and also to other research such as protein-protein interactions.

77

BIBLIOGRAPHY

[1] http://en.wikipedia.org/wiki/Gene

[2] Project Information http://www.ornl.gov/sci/techresources/Human_Genome/faq/genenumber.shtml

[3] Just the Facts: A basic introduction to the Science underlying NCBI resources. Science Primer, NCBI.

[4] Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116, 281-297.

[5] http://en.wikipedia.org/wiki/MicroRNA

[6] Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene Co- Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17

[7] Barabasi, A. L. and Albert, R. (1999). Emergence of scaling in random networks science. Science, 286(5439), 509–512.

[8] Ravasz, E., Somera, A. L., Mongru, D. A., Oltavi, Z. N. and Barabasi A.L. (2002)). Hierarchical organization of modularity in metabolic networks. Science, 297( Aug.), 1151-1155.

[9] C. Bron and J. Kerbosch. Algorithm 457: finding all cliques of an undirected graph. Commun. ACM, 16(9):575–577, 1973.

[10] V. Batagelj andM. Zaversnik. An o(m) algorithm for cores decomposition of networks. CoRR, cs.DS/0310049, 2003.

[11] Mechthild Stoer and Frank Wagner. A simple Min-cut algorithm Proceedings of the 2nd annual European Symposium on Algorithms. Lecture notes in Computer Science, vol. 855, 1994, pp. 141-147

[12] J. Abello, M.G. Resende, and S. Sudarsky, “Massive Quasi-Clique Detection,” Springer Berlin / Heidelberg, 2002.

78

[13] Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54

[14] J. L. Rodgers and W. A. Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, Feb 1988.

[15] Biological Pathways: Fact sheets: www.genome.gov, National Human Genome Research Institute.

[16] Lee E, Chuang H-Y, Kim J-W, Ideker T, Lee D (2008), Inferring Pathway Activity toward Precise Disease Classification. PLoS Comput Biol 4(11): e1000217. Doi:10.1371/journal.pcbi.1000217

[17] Rui Liu et al.(2007) The prognostic Role of a Gene Signature from Tumorigenic Breast-Cancer cells, The New England Journal of Medicine. Vol. 356: No. 3.

[18] Marc J. Van De Vijver et al.(2002) A Gene-Expression signature as a predictor of survival in Breast Cancer, The New England Journal of Medicine. Vol. 347:No. 25.

[19] Christos Sotiriou and Lajos Pusztai(2009) Molecular origins of cancer: Gene- Expression Signatures in Breast Cancer, The New England Journal of Medicine; 360:790-800

[20] J Martin Bland and Douglas G Altman. Survival Probabilities (The Kaplan-Meier method), Statistics Notes, BMJ 1998;317:1573

[21] Steve Dunn. Survival curves:Accrual and the Kaplan-Meier estimate. CancerGuide: Statistics., 2002. http://www.cancerguide.org/scurve_km.html

[22] J Martin Bland and Douglas G Altman. The Logrank test, Statistics Notes, BMJ 2004;328:1073

[23] NCI (2009) TCGA Data Portal. URL http://cancergenome.nih.gov/dataportal/.

[24] Itadani H, Mizuarai S, Kotani H. Can systems biology understand pathway activation? Gene expression signatures as surrogate markers for understanding the complexity of pathway activation. Curr Genomics. 2008 Aug;9(5):349-60.PMID: 19517027

[25] Liu J, Campen A, Huang S, Peng SB, Ye X, Palakal M, Dunker AK, Xia Y, Li S. Identification of a gene signature in pathway for breast cancer prognosis using gene expression profiling data. BMC Med Genomics. 2008 Sep 11;1:39. PMID: 18786252 79

[26] http://en.wikipedia.org/wiki/Survival_analysis

[27] Chen J, Bardes EE, Aronow BJ, Jegga AG 2009. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Research doi: 10.1093/nar/gkp427 (PubMed).

[28] Bredel M. et al. A network model of a cooperative genetic landscape in brain tumors. Jama, 2009. 302(3): p.261-75

[29] MacLennan, N.K., et al., Weighted gene coexpression network analysis identifies biomarkers in glycerol kinase deficient mice. Mol Genet Metab, 2009.98(1-2): p. 203-14.

[30] Hu, H., et al., Mining coherent dense subgraphs across massive biological networks for functional discovery. , 2005. 21 Suppl 1: p. i213-21.

[31] Langfelder, P. and S. Horvath, WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics, 2008. 9: p. 559.

80

Appendix A: Gene lists for Clusters A - L

XRCC4 FDPS EIF1AY SLC1A2 OBFC1 NTAN1 XRCC2 MRPL17 MED8 GLRX2 HIST1H2BL MRPL35 BLVRB APH1B RP11- AZGP1 PPCS 336K24.9 APRT MYBPC3 SRRM1 C1orf78 DERL2 ESR1 GSTK1 COL14A1 VAMP5///VAMP8 GSTP1 YIPF2 AHDC1 GNG5 C20orf24 PITX1 S100A6 SLC35A2 ORC2L GEMIN6 SMG7 C19orf10 CENPI MRLC2 TMEM208 TNFRSF1A QKI MRPS33 SEPX1 GTF3A EMG1 SYT1 EMR2 MRPL34 MTCP1 ATXN2L C19orf53 GLRX LGTN DPM3 ZNHIT1 ABCC10 SSR4 BCL7C COPS6 COX7BP1 CYB5B CDH3 BUD31 FZD7 MYO1D TOR3A ARD1A TIMM8B TRSPAP1 NDUFB2 NECAP2 CREM MRPS16 SSNA1 BNIP2 PPCDC DVL3 Table 14. Cluster A

DDT NELL2 GRM6 COMMD3 NSMCE4A TYRP1 RRAS HSPA14 CDC123 TIMM23 PDSS1 PQBP1 LOC92482 RPS17 4-Sep C10orf97 CUEDC2 ARL3 CISD1 PPA1 C10orf88 IDI1 CBARA1 NDUFB8 PRPF18 ASCC1 ASB13 CUTC KIAA1128 EBP ZNF239 KIAA1462 YME1L1 COX15 C12orf24 ACAT2 C8orf32 HCRTR1 MRPL48 UQCRH NDUFB11 YWHAQ SUCLG1 HARS2 POLR3K ETFB PDE2A ANKHD1 COPS4 PSME2 C16orf61 PDE6D TOMM20 ZNHIT3 OMG LYRM4 RPL15 hCG_39912///RPL17 HINT1 SNCB Magmas ORC4L NOLA2 RPS3A UBE2D2 METTL9 OPRS1 FGF6 OLA1 ATP5J HIST1H4G P15RS CRIPT KIDINS220 TMEM14B UBE2N MXRA5 PSMD10 VBP1 WDR7 CLINT1 C16orf80 CDC40 HSF2 GML NDUFC2 TBC1D4 Table 15. Cluster B 81

HCG2P7 PRINS MTRF1L LRRFIP1 MCM3AP GPR1 SPG21 CACNA1G RPL35A HRSP12 HDAC9 DYM TIGD1L SCN11A SFTPB ATP6V1A C4orf34 DCLRE1C C14orf124 ZNF492 MKRN3 MLN TUBA4B Table 16. Cluster C

HSPA6 VPS8 KIAA0564 FCGR2C FCGR2B PLAUR AP3B1 CD163 MYO10 PF4V1 C5AR1 ADCY7 DSE SLC11A1 OPRL1 FOXD2 MS4A4A FCER1G KRT5 C1QA TRPC6 ITGB2 HFE ACADSB HCK SERPINA1 SLAMF8 SCARB1 CD53 HELZ ANK1 GNL2 VAMP8 BLNK MS4A6A LILRB4 RNASE6 KDSR AIF1 KLF1 SAMSN1 NPC2 RNASE2 MNDA GIPC1 CPB1 HLA-DMA GIMAP4 CTSC S100A4 C8orf71 PTPRC LILRB1 PRPF19 TNFSF12 ID3 MYO1F HSPA1B GFRA1 CENTA2 RAD1 TNF MAFB HTR1E MFSD1 PYCARD GPR65 CLEC7A TFEC HLA-DRB1 HLA-DMB RNASET2 RFPL3 C4orf15 ITGAM SQRDL CD300A GMFG GC CYBA FAM50A ZNF518A RAB22A PTGS1 PTPN6 FCGR1B CD68 HLA-DRB5 HAPLN1 FCGR1A MGAT4A PILRA NEURL PLCG2 DLD GRK5 GIMAP6 RHOG C7orf54 UGT1A9 FLJ22662 LCK LYN MAP3K8 LIMS1 PLA2R1 LCMT2 S100A11 KYNU LILRB2 MATN2 SIGLEC7 RPS6KA1 TREM2 LGALS9 CD33 ADAM28 HLA-DPB1 CIZ1 HIF1A GSTM4 GPX1 CLCN4 CASP1 CD84 IL4R TDO2 PAPPA2 SFRS16 GNA15 IL18 LST1 APOD BIN2 NPL PSMD4 ATP8B4 NMT2 LAT2 ATP5L2 TLR7 CTSB B4GALT1 HTATIP RGS19 ELA3A HLA-DRB6 F13A1 STAB1 RARG ARHGAP15 CTDSP1 SCPEP1 CTSS DAB2 KCNN4 FCGRT DENND2D CPVL COL1A2 GNAI3 C3 TUSC4 RBM39 ELF2 LITAF PTPRS SH3TC1 RBM47 S100A8 SPRY2 S100A9 MGAT1 KIR2DL4 OLFML3 RHBDF2 GRN WDR76 NEDD4 SLC22A4 IL15 HSF4 RPL3L ZMYM6 OAS1 C17orf60 SLC15A3 FLI1 API5 C11orf75 HPSE RNASE3 ATP5H TRAF5 FBP1 LTBP2 PAX9 FXYD5 PAX2 NOD2 MR1 HAMP 82

SLC15A1 HLA-DQA1 CMTM6 REG1B CCNO TRIOBP HISPPD2A OCEL1 PDE1A CLEC4A TRIM24 IL15RA TNFSF10 PLOD2 ACSL1 SNTB2 SMARCD2 P2RY5 NKG7 IER3 ZC3H11A C21orf82 MAPKAPK3 CLEC2B KRT13 NCF1B PTPN2 CAP1 NCF1 RFPL1 RAPGEF3 DPYD OLFML2B RPS18 SFRS18 TMEM176B PLSCR1 SPP1 TRADD CD180 SREBF1 CCL2 CTSD DOCK1 THBS1 TPK1 MPV17 DENND1C OSTF1 SSX5 PGRMC1 PTGS2 TGFB1I1 HLA-DPA1 GAL3ST4 CAPNS1 A2M VNN2 PHF8 IMP4 TREM1 CFI WIPI1 TANK MOBKL1B ACTB ELF4 CTBS DNAJC7 CCT2 SRRM2 C20orf177 FLVCR2 GIMAP5 FPR3 SELL PCLO NDUFB7 FAS FDX1 CSTB TMEM109 IL16 RIN3 SYNGR2 HYAL3 SAT1 SEC24D BTK LILRA2 SOAT1 APOBEC3G BRP44 CD69 BMP2K CIDEB ANXA4 MGST2 DNM1L CSF3R PCNT CFB GNPDA1 HAGH HNF1B RPP38 CD2 RASSF9 SACM1L CLIC1 AKT1 TNFSF13 SP110 PIR CSF1 IL6R PCDH11Y SMPDL3A CP ISG20 INPP5D TMEM176A DENND3 TAP2 SCCPDH WIPF1 SERPING1 CDCP1 GNS LGR5 MMP19 TMBIM1 IFITM2 TOP1 COIL DNASE1L1 RAB20 TPMT SOD2 NAMPT FHL2 INA COPZ2 TRIM38 CECR1 NAGA GLB1 C11orf80 SH3GLB1 ROR1 TRIM34 BAI2 DAPK1 TIMM17B TRPC4AP SPECC1L GMIP FER1L3 SH3BGRL3 MRCL3 C9orf167 ADAM19 RAB11FIP1 PGDS ZNF711 MYD88 MFHAS1 EBI3 FAM26B IL21R NSF COL8A2 RAB27A CD44 ADPGK TRIM21 KIAA0194 C1RL ANXA2P2 INPP1 CD1C CLEC5A PDHA1 UPP1 FTL CTSZ REXO2 LOC391020 DYRK3 IL2RA ITPKB DRAM LASS2 ZMYM6 PRSS23 TLR3 DCLK1 HERPUD1 BMP2 OGFRL1 ELL2 JMJD2B TRPV2 ACTA2 ANXA1 CPE GRPEL1 TRIM22 STEAP3 RNF40 NFATC4 GYPC KIAA0133 MATK PLAU CRHR2 NAT1 SERPINE1 SLCO2B1 APOBEC3F SLFN12 ROM1 KMO MTHFD1 USH2A PRUNE2 IL1R2 SLC24A6 GAA ALDOB SRF SP100 FOSL2 GALNT6 RGS11 LAIR1 SECTM1 FAM129A KIAA0152 CCBL1 IRAK3 IQGAP1 TOMM34 HLA-G C1orf54 CEBPD CD72 RCAN1 CCDC109B DUSP3 MXD4 SFT2D2 CD59 CD3D PPP1R3D FMNL1 TAF11 MST075 NPHS1 DLX4 CDH6 CD1B HLX Table 17. Cluster D

83

MMP14 SERPINH1 LAMC1 COL5A2 SRPX2 FNDC3B MYH9 MRC2 GALNT2 CA12 MCM6 KIAA1012 C2orf12 GLT25D1 KIAA0776 TXNDC5 HTR2B LEPRE1 PIN4 SRPX ARSJ CSGlcA-T CPB2 LGALS3 PLS3 HNRPA1P5 ZYX M6PRBP1 CSDA CBFB CD151 EHD2 GUSB GAD1 TNFRSF12A ADAM9 BMP1 CHRNA9 TM9SF1 GDF15 ACSS3 LAMA2 CHST2 VIM DNAJC16 ALX3 EMP1 KIAA0391 SLC35D1 LIF PTX3 HRH1 WWTR1 FOSL1 KLK3 FAM46A SDC1 SKIL HSD17B6 HSPA5 TMEM43 LSS SLC22A18 TRIP10 LMNA RFXANK CLCF1 ACAA1 C7orf42 OSBPL3 ICAM4 MAP3K6 TSPAN4 PLOD1 KANK2 C8orf4 SLC10A3 PIGT PKM2 CASP7 TM9SF4 TMEM214 LRP10 FGF18 BACE2 FKBP11 ICAM3 GMPPA NNMT COL6A1 LGALS3BP OXTR LOC26010 STARD13 DNAJB1 TMED9 SKP2 PTPN14 PVRL2 GM2A PAM SLC19A1 PDIA4 ST3GAL1 NUAK2 EPHB4 ARHGAP29 CAV1 AKR1C3 PMEPA1 MGAT4B CYR61 IL13RA1 FLNA PRKCI IL6ST CALD1 COL6A2 MAFF LY6G6C TMEM184B PTPN9 KIAA0323 RAD21 STAT3 DIRAS3 BCAT1 RABIF SDF4 CD99 ADAMTS1 KDELR1 PRKCE SLC9A1 MCL1 SPAG4 GP9 LOC57228 SLC2A4RG LDHA TRH SLC39A1 MALT1 MT1H MT1P2 TBX10 SLC2A3 MAP1B TMOD3 KDELR3 PTGIR KCNE4 GPC4 SBNO2 Table 18. Cluster E

SMNDC1 ABT1 DNAJA3 IMP3 C15orf44 ITIH3 HYPK HDC C15orf24 NUBP1 ANXA13 TMED3 TMEM208 SYT1 ORC2L GEMIN6 PITX1 OR7A17 ORMDL2 APRT MYBPC3 PSMB8 PRUNE PSMB10 APH1B XRCC4 AKAP8 EDEM2 SMG7 C19orf10 C1orf78 YIPF2 C20orf24 XRCC2 MRPL17 MED8 CENPI MRLC2 DERL2 BNIP2 PPCDC BLVRB S100A6 AHDC1 GNG5 SRRM1 ESR1 GSTK1 ASL PFDN1 GSTP1 NECAP2 SLC35A2 SEPX1 COL14A1 VAMP5 HLA-J PPIE CYBRD1 STK17A TNFRSF11B AZGP1 PPCS GLRX LGTN DCTD TOR3A EMR2 CHCHD8 EIF4E2 HDAC3 TMEM50B MRPL34 VPS45 SLC38A6 TLN1 CYB561D2 MTUS1 RBCK1 Table 19. Cluster F

84

S100A10 TIMM9 PPP2R3C PSMA6 GNAI1 EPB41L2 C14orf166 ACTR10 TOM1L1 SGSM2 FKBP3 TBC1D20 TINF2 GMPR2 COMMD4 KLHDC2 C14orf156 MBIP GALNT1 AP4S1 PPP2R5C NGDN CHMP4A OAS2 SIP1 THTPA ACBD3 CCNB1IP1 RDH11 Table 20. Cluster G

MMP14 NPM1 MSN RAD21 CAP1 LDHA S100A11 GRN KDELR2 NPC2 GPX1 CALU LAMC1 IQGAP1 ACTB RPL12 CTSB PSMD4 CAGE1 LGALS3BP ACTA2 PDHA1 CD59 SERPING1 ANXA1 CD99 CPE MGAT1 CSDA TERF2IP GNAI3 CSTB PLS3 SRRM1 PKM2 SYPL1 SDC1 MOBKL1B ANXA4 IFITM2 MRCL3 TMEM109 CRTAP LRP10 VIM CTSC DUSP3 CLIC4 DCTD GLB1 SMG7 FER1L3 LPCAT1 SCCPDH SMARCD2 IL13RA1 MCM6 MYO10 ACAA1 WWTR1 BLVRB NNMT LSS SREBF1 PAM ADAM9 SRF COL1A2 PLSCR1 PRSS23 VAMP8 GUSB PLOD2 LYN SERPINE1 RFXANK FBN1 MAPKAPK3 INPP1 SACM1L SERPINA1 FHL2 AP3B1 RHOG S100A4 APRT IL4R UPP1 TLN1 F13A1 ARF6 LMNA SAT1 ELF4 S100A9 CD163 ITPKB ADCY7 RBMS1 WIPI1 AKAP8 CFI PDIA5 CA12 CEBPD KDELR3 MGST2 RARG APOBEC3G SLC17A7 FCER1G FOSL1 INA CD44 UGT1A9 PIN4 ASL DPYD LTBP2 ISG20 FAS TRIM21 CP C11orf80 SLC10A3 TNFRSF11B ICAM3 AHDC1 XRCC4 LAMA2 CLCN4 ESR1 LIF GAD1 FOSL2 PLAU SRPX2 HSD17B6 ROR1 ROM1 REG1B SLC22A4 TDO2 CD1C IL15 CASP1 RGS11 RNASE2 PTX3 TRH HTN1 CPB2 RPL3L RNF40 FGF18 PAX9 GNG5 CASP7 ICAM4 KIAA1012 IL15RA SLC35A2 MR1 TANK KRT83 LILRB2 SERPINH1 ZNF711 PF4V1 CRHR2 MYBPC3 CPVL MALT1 ST3GAL1 RCAN1 OR7A17 CLIC1 CHD3 ANXA2P2 TMED3 LGALS3 STAT3 SH3GLB1 MYD88 HNRPA1P5 TM9SF1 TSPAN4 MRC2 SWAP70 TPM4 TNFSF13 RAB27A VPS8 OSBPL3 CYB561D2 SERBP1 SLC35D1 CLEC2B RIPK1 CCNO CTSZ C7orf54 EFNA3 DYRK3 LGR5 C8orf71 MXD4 PLAUR FCGR2B HSF4 85

FCGR2C ANXA2P3 PAPPA2 HSPA5 MTUS1 MAP1B IMP4 KIAA0194 EXOC3 KIAA0776 FTL DNAJC16 KIAA0564 NTAN1 STARD13 tcag7.1314 MFHAS1 ZMYM6 NAT1 SFT2D2 APOBEC3F SOD2 C2orf12 LOC391020 CCL2 TMBIM1 NAMPT GSTK1 PIGT SLC39A1 YKT6 GALNT2 TMEM43 C20orf24 BACE2 FAM129A SQRDL C7orf42 RBM47 GMPPA FXYD5 REXO2 EDEM2 DERL2 TNFRSF12A STEAP3 CDCP1 FLJ22662 C8orf4 ORMDL2 FNDC3B DRAM LEPREL1 CCDC109B DSE CTBS C1RL FKBP11 CKLF TWSG1 MAP3K6 C1GALT1C1 TREM1 CLCF1 COPZ2 C9orf167 RAB20 TTC26 SPAG4 CLEC5A ZMYM6 C5AR1 TBC1D19 DNAJC22 TXNDC15 TMOD3 ADPGK NUAK2 TMEM49 APH1B C21orf7 WBSCR16 TXNDC5 SH3BGRL3 SOAT1 FAM26B ACOT9 F11R COL5A2 C19orf10 DNAJC10 CSGlcA-T LASS2 Table 21. Cluster H

CKB CCND2 ARCN1 SPCS2 CUL4A ZFR SLC20A2 ARNT2 GALNT3 RICS LPHN1 PHKG2 ATP2B2 MTMR9 APC2 CSPG5 BAI3 G3BP2 NTRK3 ABAT DPP6 SEZ6L SRI UPF1 KIF1B HIP1R NAG18 PHLPP RTN4 ATP9A CLIP3 ASTN1 C11orf2 SLC22A17 PNMAL1 SOBP SLC24A3 C1orf21 SIRT3 Table 22. Cluster I

GZMH RPL7A RPL36A RECQL5 MTMR1 dJ507I15.1 CAMK4 RPS17L4 SLC44A5 RPL29 RPLP0-like RPL17 MAP3K7IP2 ACVR1 RPL30 XDH ITPKC RPS4X PCDHGA8 RPS4L AFM RPL13 RPS15A RPS25 RPL13A RPS27A RPL19 CRYBA4 TNFRSF1B RPL10L RPS23 MS4A1 Table 23. Cluster J

CPSF1 ABCC5 CCL14 HERC1 CHEK1 HERC2 DENND4B CDON PHIP TTBK2 MYO9A SIN3B SPTAN1 ANKRD12 CPT1B SPINK2 KIAA0528 SNAPC4 CATSPER2 ZNF236 GSK3A KIAA1109 GPR35 RPAP2 EZH1 USP48 SEC31B N4BP2L2 DUSP8 KIF13A MGEA5 COL13A1 OTUD3 TRPV1 86

Table 24. Cluster K

GSTP1 SRRM1 SMG7 BLVRB APRT SYT1 ORC2L AHDC1 XRCC4 ESR1 GLRX GNG5 SLC35A2 CENPI XRCC2 EMR2 MYBPC3 PITX1 BNIP2 AZGP1 COL14A1 MED8 VAMP5 S100A6 GSTK1 C20orf24 SEPX1 LGTN DERL2 PPCS TOR3A PPCDC YIPF2 GEMIN6 C1orf78 NECAP2 APH1B MRLC2 TMEM208 MRPL34 C19orf10 MRPL17 Table 25. Cluster L

87

Appendix B: Codes/Programs

All programs are coded in MATLAB 2009b.

B.1 K- Core program

% The function returns the vertices of the k-core and their corrresponing % core values % graph : adjacency matrix og the graph % threshold to select K function [core_vert cores] = kcore(graph,core_thresh) [m m] = size(graph); deg = sum(graph); % degrees gene_cores = zeros(1,m); % store core numbers for all vertices

[sdg ix] = sort(deg); % sort degrees vert = ix; % vertex indices in order of degree while numel(sdg)>1 % while there are more than one element in the queue si = vert(1); % get vertex with lowest degree cores(si) = sdg(1); % assign core number list = find(graph(si,:)); for j=1:numel(list) % modify degree of adjacent vertices v = list(j); id = (vert == v); if sdg(id) > sdg(1) sdg(id) = sdg(id)-1; end end sdg = sdg(2:end); % remove lowest degree vertex from the queue vert = vert(2:end); [sdg ix] = sort(sdg); vert = vert(ix); end si = vert(1); cores(si) = sdg(1); core_vert = []; % list of all vertices in the k-core thresh = max(cores)*core_thresh; for i=1:numel(cores) if cores(i) >= thresh; core_vert = [core_vert i]; end

88 end end

B.2 PruneCut Program

% finds cluster by the Prune Cut algorithm % outputs a cluster set by recursively applying prunecut algorithm on % subgraphs starting with the whole graph % inputs : % graph_mat: adjacency matrix repersentation of the graph % graph_wts: weights of the edges, for un-weighted networks use the % graph_mat as the graph_wts % Vert: vertex list in the order of the graph_mat matrix indices % nv: no of nodes in the input graph in the present interation % modv: no of nodes in the initial graph % alpha: size parameter to control recursion % gamma: size parameter to choose the cut function cluster = PruneCut(graph_mat,graph_wts,Vert,nv,modv,alpha,gamma) %store the matrix and wts for usage after the cut rmat = graph_mat; rwts = graph_wts; degs = sum(graph_mat); cluster = {}; nv V = [1:nv]; min_cut = 1000000; cut_num = 0; cut_cntr = 0; cut_wt = 0; cut_flag = 0; for i=1:length(V) merge_list{i} = []; end cuts = []; tic % MINIMUM CUT while length(V) > 1 A = []; u = V(1); A = [A u]; queue = setxor(A,V);%all vertices not in A queueW = zeros(1,numel(queue)); alist = find(graph_mat(u,:)); [c ia ib] = intersect(queue,alist); queueW(ia) = graph_wts(u,alist);

89

[queueW indx] = sort(queueW,'descend'); %sort queue weights in descending order queue = queue(indx); % reoder queue by descending weights % MINIMUM CUT PHASE while numel(setxor(A,V))~=0 % while A ~= V v = queue(1); A = [A v]; % add vertex to A %update the queue weights of vertices connected to v in the queue alist = find(graph_mat(v,:)); cut_wt = queueW(1); if length(queue) > 1 queue = queue(2:end); % remove v from queue queueW = queueW(2:end); [c ia ib] = intersect(queue,alist); graph_wts; % debug line t = graph_wts(v,alist(ib)); % debug line t = alist(ib); % debug line queueW(ia) = queueW(ia)+graph_wts(v,alist(ib)); % reorder the queue [queueW indx] = sort(queueW,'descend'); queue = queue(indx); % debug line queueW; % debug line end end cut_cntr = cut_cntr+1; % store the cut v1 = A(end); v2 = A(end-1); tlist1 = [v1 , merge_list{v1}]; tlist2 = []; for ptr = 1:(numel(A)-1) tlist2 = [tlist2,merge_list{A(ptr)},A(ptr)]; end tlist1 = numel(tlist1); tlist2 = numel(unique(tlist2)); if min(tlist1,tlist2)>= 0.03*nv cuts = [cuts,cut_wt]; end if min(tlist1,tlist2) >= gamma*nv && min_cut > cut_wt min_cut = cut_wt; cut_num = cut_cntr; cut_list = merge_list; cutv1 = v1; cutv2 = v2; cut_flag = 1; end %update the weights of v2 and delete v1 from matrix list1 = find(graph_mat(v1,:)); list2 = find(graph_mat(v2,:)); [c ia ib] = intersect(list1,list2); t = list2(ib); % debug line t = list1(ia); % debug line

90

graph_wts(list2(ib),v2) = graph_wts(list2(ib),v2) + graph_wts(list1(ia),v1); graph_wts(v2,list2(ib)) = graph_wts(v2,list2(ib)) + graph_wts(v1,list1(ia)); graph_wts; % debug line graph_mat(v1,:) = 0; graph_mat(:,v1) = 0; idx = V == v1; V = V(~idx); %store the merge merge_list{v2} = [merge_list{v2} ,v1, merge_list{v1}]; merge_list{v1} = []; end toc if cut_flag % the partition cutv1; % debug line cut_list{cutv1}; % debug line cut_vert = [cutv1,cut_list{cutv1}]; nv2 = numel(cut_vert); nv1 = nv-nv2;

m = nv; labels = zeros(1,m); labels(cut_vert) = 1; idx = labels > 0; mat1 = rmat(~idx,~idx); mat2 = rmat(idx,idx); rwt1 = rwts(~idx,~idx); rwt2 = rwts(idx,idx); map1 = Vert(find(~idx)); map2 = Vert(find(idx));

%subgraph 1 if nv1 <= alpha*modv % if size of partition is small C = map1; cluster = [cluster, C]; % add the cluster to list of clusters else C = mincut(mat1,rwt1,map1,nv1,modv,alpha,gamma); % cut the cluster further cluster = [cluster, C]; end

%subgraph 2 if nv2 <= alpha*modv C = map2; cluster = [cluster, C]; else C = mincut(mat2,rwt2,map2,nv2,modv,alpha,gamma); cluster = [cluster, C]; end else 91

cluster = [cluster Vert]; end end

B.3 EigenCut Program

% expval = expression values ; genes X samples % corrs = correlation matrix ; absolute values % absl = 1 if corrs are absoluted else 0 % graph = thresholded graph ; % thresh = cut threshold % V = vertex list % cl_size = minimum cluster size % overlap = 1 if need overlapping clusters else 0 % hub = 1 if seed-selection is next-available-hub , 0 for % next-available-higher index function clusters = eigen_cut(expval,corrs,absl,graph,thresh,V,cl_size,overlap,hub) if overlap tgraph = graph; tcorrs = corrs; end clusters = {}; % stores clusters A = []; [m n] = size(expval); Vdeg = sum(graph,1); if hub % sort vertices in decreasing order of degree [Vdeg ia] = sort(Vdeg,'descend'); V = V(ia); end while ~isempty(V)

if hub && max(Vdeg) == 0 break; end

while isempty(find(corrs(V(1),:) >= thresh)) % delete vertices that cant yield clusters

if numel(V) == 1 V=[]; break; 92

end V = V(2:end); Vdeg = Vdeg(2:end); end

if isempty(V) break; end

v = V(1); % select a vertex to start expanding ; can select based on the degree of the vertex A = [A v]; % add to A

while ~isempty(A)

if max(corrs(v,:)) >= thresh % if max corr is greater than threshold indx = find(corrs(v,:) == max(corrs(v,:))); % find maximum correlated vertices u = indx(1); % max corr vertex corrs(v,u) = 0; % delete the edges from corrs(u,v) = 0; % graph and corrs graph(v,u) = 0; graph(u,v) = 0; A = [A u]; % add vertex u to the cluster

% u is added to v, and edges incident on u are replaced by % edges incident on v eigenexp = pca(expval(A,:)); % get eigen gene for cluster A

% Update graph and corrs - update the neighbors of u list = find(graph(u,:)); % find neighbors of u corrs(u,:) = 0; % delete u corrs(:,u) = 0; graph(u,:) = 0; graph(:,u) = 0; list=[list find(graph(v,:))]; % add neighbors of v to list for recalculation of edge-weights

% update correlation values for neigbors of u exp = zeros(numel(list)+1,n); exp(1,:) = eigenexp; % cluster eigen exp exp(2:end,:) = expval(list,:); % exp of neighbors(list) eigen_corrs = eigen_spearman(exp); % get correlations

if absl eigen_corrs = abs(eigen_corrs); end clear exp;

corrs(v,list) = eigen_corrs; % update corrs between vertex corrs(list,v) = eigen_corrs'; % v and neighbors of vertex u 93

graph(v,list) = (eigen_corrs >= thresh); % add edges from v to neighbors of u graph(list,v) = (eigen_corrs >= thresh)';

else % if there are no more edges above threshold clusters = [clusters A];

if overlap graph = tgraph; % load original graph corrs = tcorrs; % load original edge-weights [V ia] = setdiff(V,A); % delete A from V Vdeg = Vdeg(ia); else [V ia] = setdiff(V,A); % delete A from V Vdeg = Vdeg(ia); end if hub % sort by degree [Vdeg ia] = sort(Vdeg,'descend'); V = V(ia); end A = []; end end end

% return only those clusters above cl_size k=0; gcluster = {}; for i=1:numel(clusters) if numel(clusters{i}) >= cl_size k=k+1; gcluster{k} = clusters{i}; end end clusters = gcluster; clear gcluster; end

B.4 EigenCut multimerge Program

% expval = expression values ; genes X samples % corrs = correlation matrix ; absolute values % absl = 1 if corrs are absoluted else 0 % graph = thresholded graph ; % thresh = cut threshold % V = vertex list % cl_size = minimum cluster size % overlap = 1 if need overlapping clusters else 0 94

% hub = 1 if seed-selection is next-available-hub , 0 for % next-available-higher index function clusters = eigen_cut_multimerge(expval,corrs,absl,graph,thresh,V,cl_size,overlap,h ub) tic if overlap tgraph = graph; tcorrs = corrs; end clusters = {}; A = []; [m n] = size(expval); Vdeg = sum(graph,1); if hub [Vdeg ia] = sort(Vdeg,'descend'); V = V(ia); end while ~isempty(V) if hub && max(Vdeg) == 0 break; end if isempty(V) break; end while isempty(find(corrs(V(1),:) >= thresh)) if numel(V) == 1 && isempty(find(corrs(V(1),:) >= thresh)) V=[]; break; end V = V(2:end); Vdeg = Vdeg(2:end); end if isempty(V) break; end

v = V(1); % select a vertex to start expanding ; can select based on the degree of the vertex A = [A v]; % add to A

while ~isempty(A)

if max(corrs(v,:)) >= thresh % if max corr is greater than threshold t =( thresh+1)/2; % find a threshold value midway between thresh and 1 if ~isempty(find(corrs(v,:) >= t)) % if correlations above t, then combine all of those vertices

95

u = find(corrs(v,:) >= t); % find maximum correlated vertices else u = find(corrs(v,:) >= thresh); % find maximum correlated vertices u = u(1); % find the max corr vertex end

corrs(v,u) = 0; % delete the edges from corrs(u,v) = 0; % graph and corrs graph(v,u) = 0; graph(u,v) = 0; A = [A u]; % add vertex u to the cluster

% u is added to v, and edges incident on u are replaced by % edges incident on v eigenexp = pca(expval(A,:)); % get eigen gene for cluster A

% Update graph and corrs - update the neighbors of u list = []; for k=1:numel(u) list = [list find(graph(u(k),:))]; % find neighbors of u end

%list = unique(list); corrs(list,:) = 0; % delete list corrs(:,list) = 0; graph(list,:) = 0; graph(:,list) = 0; list=[list find(graph(v,:))]; list = unique(list); % update correlation values for neigbors of u exp = zeros(numel(list)+1,n); exp(1,:) = eigenexp; % cluster eigen exp exp(2:end,:) = expval(list,:); % exp of neighbors(list) eigen_corrs = eigen_spearman(exp); % get correlations if absl eigen_corrs = abs(eigen_corrs); end clear exp; corrs(v,list) = eigen_corrs; % update corrs between vertex corrs(list,v) = eigen_corrs'; % v and neighbors of vertex u

graph(v,list) = (eigen_corrs >= thresh); % add edges from v to neighbors of u graph(list,v) = (eigen_corrs >= thresh)';

else % if there are no more edges above threshold clusters = [clusters unique(A)];

if overlap graph = tgraph;% load original graph 96

corrs = tcorrs;% load original edge-weights [V ia] = setdiff(V,A); % delete A from V Vdeg = Vdeg(ia); else [V ia] = setdiff(V,A); Vdeg = Vdeg(ia); end if hub % sort by degree [Vdeg ia] = sort(Vdeg,'descend'); V = V(ia); end A = []; end end end k=0; gcluster = {}; for i=1:numel(clusters) if numel(clusters{i}) >= cl_size k=k+1; gcluster{k} = clusters{i}; end end clusters = gcluster; clear gcluster; toc end

B.5 Kaplan – Meier curves plotting program

% finds kaplan meier curvesa nd also does logrank test % input : % clusters: the cluster set % exp: exp values of genes in the cluster : gene X patients format % patients: patient set % fname: filename to output survival curves % output : % Ind1 and Ind2 contain indices of two sets of patients function [Ind1 Ind2] = kaplan_meijer_curves(clusters,exp,patients,fname)

% read survival data [FileName,PathName] = uigetfile('*.txt','Select the survival data file'); fid = fopen([PathName,FileName],'r'); reqd_indx = [];

97 line = fgetl(fid); parsed = textscan(line,'%s','Delimiter','\t'); parsed = parsed{1}; % get the indices/ column numbers of the required fields reqd_indx = [reqd_indx find(strcmpi(parsed,'BCRPATIENTBARCODE'))]; reqd_indx = [reqd_indx find(strcmpi(parsed,'VITALSTATUS'))]; reqd_indx = [reqd_indx find(strcmpi(parsed,'DAYSTODEATH'))]; reqd_indx = [reqd_indx find(strcmpi(parsed,'DAYSTOLASTFOLLOWUP'))]; clinical_patients = {}; % patients vstatus = {}; % living status ttt = []; % time to treatment line = fgetl(fid); while line~=-1 parsed = textscan(line,'%s','Delimiter','\t'); parsed = parsed{1}; parsed = parsed(reqd_indx); if strcmpi(parsed{2},'null') == 0 clinical_patients = [clinical_patients parsed{1}]; vstatus = [vstatus parsed{2}]; if strcmpi(parsed{2},'LIVING') == 0 ttt = [ttt str2num(parsed{3})]; else ttt = [ttt str2num(parsed{4})]; end end line = fgetl(fid); end fclose(fid);

% filter patients; get data only for patients in the in the input argument [compat ia ib] = intersect(clinical_patients,patients); vstatus = vstatus(ia); ttt = ttt(ia); compat_exp = exp(:,ib);

% get time to treatment values treatVec = strcmpi(vstatus,'LIVING');

Cidx = {}; Ind1 = {}; Ind2 = {}; for i=1:numel(clusters) list = clusters{i}; cl_exp = compat_exp(list,:); cl_exp = cl_exp'; % kmeans to be done on patients, so transpose the expression data [cidx, ctrs] = kmeans(cl_exp, 2, 'dist','corr', 'rep',100,... 'disp','off'); Cidx = [Cidx cidx];

98

ind1 = find(cidx == 1); ind2 = find(cidx == 2); Ind1 = [Ind1 ind1]; Ind2 = [Ind2 ind2]; survivalEvent1 = treatVec(ind1); survivalEvent2 = treatVec(ind2); [f1, x1] = ecdf(ttt(ind1), 'function', 'survivor', 'censoring', survivalEvent1); [f2, x2] = ecdf(ttt(ind2), 'function', 'survivor', 'censoring', survivalEvent2); pS = logrank([ttt(ind1); survivalEvent1], [ttt(ind2); survivalEvent2], 2); scrsz = get(0,'ScreenSize'); figure(i); set(gcf, 'Position', [200 200 scrsz(4)*2/3 scrsz(4)*2/3]); set(gcf, 'color', [1 1 1]); stairs(x1,f1,'LineWidth',2); hold on; stairs(x2,f2,'r-', 'LineWidth',2); axis([0 max(ttt)*1.1 0 1]); hold on; plot([x1(end), max(ttt(ind1))], [f1(end), f1(end)], 'b-', 'LineWidth',2); hold on; plot([x2(end), max(ttt(ind2))], [f2(end), f2(end)], 'r-', 'LineWidth',2); p = title(strcat('Kaplan-Meier Curves, Cluster ',num2str(i) , ', Log-rank test p = ', num2str(pS))); set(p, 'FontSize', 18); p = xlabel('Time to Death (days)'); set(p, 'FontSize', 18); p = ylabel('Ratio'); set(p, 'FontSize', 18); xname = 'Patient Set A'; yname = 'Patient Set B'; legend(xname,yname); filename = ['kaplancurves\' fname num2str(i) '.jpg'] saveas(i,filename); close(i); end end

B.6 Log Rank test function p=logrank(x,y,show,xname,yname) % Log-rank test(Approximate) % The logrank test is a hypothesis test to compare the survival distributions of two samples. % It is a nonparametric test and appropriate to use when the data are right % censored (technically, the censoring must be non-informative). % % Syntax: p=logrank(x,y,show,xname,yname) 99

% % Input: %x and y - two groups; % show -0:don't show the result % -1(default):show the result % -2:show the graph and the result % xname,yname :name for x and y ,(default):X;Y % Output: p:p value % % % % Example: % m1=[2 3 9 10 10 -12 15 -15 16 -18 -24 30 -36 -40 -45] % m2=[9 -12 16 19 -19 -20 -20 -24 -24 -30 -31 -34 -42 -44 -53 -59 -62] % OR: % m1=[2 3 9 10 10 12 15 15 16 18 24 30 36 40 45;... % 0 0 0 0 0 1 0 1 0 1 1 0 1 1 1] % m2=[9 -12 16 19 -19 -20 -20 -24 -24 -30 -31 -34 -42 -44 -53 -59 - 62;... % 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1] % Elements are 1 for observations that are right-censored and 0 for % observations that are observed exactly. % % logrank(m1,m2,2,'OperA','OperB') % or logrank(m1,m2) % or logrank(m1,m2,0) % Created by foxet -----Fan Lin % [email protected] if nargin<3;show=1;end if length(x(:,1))>1 && length(x(1,:))>1 if length(x(1,:))>2 x=x';y=y';end alx1=x(:,1);alx2=x(:,2);x=alx1.*(1-alx2.*2); aly1=y(:,1);aly2=y(:,2);y=aly1.*(1-aly2.*2); else if length(x(:,1))==1 x=x';y=y';end end x=sortrows(x);y=sortrows(y);lx=length(x);ly=length(y); [bx,ix1,mx1]=unique(x,'first'); [bx,ix2,mx2]=unique(x,'last'); ux=ix2-ix1+1;cx=ux;dx=ux; dx(find(bx<0))=0;cx(find(bx>0))=0; [by,iy1,my1]=unique(y,'first'); [by,iy2,my2]=unique(y,'last'); uy=iy2-iy1+1;cy=uy;dy=uy; dy(find(by<0))=0;cy(find(by>0))=0; datax=[dx cx ux];datay=[dy cy uy]; %dataxy:A table for failure event ,censored value and so on. %dataxy %dataxy(:,1:2):original data %dataxy(:,3):Failure events for x (dx) %dataxy(:,4):Censored values for x %dataxy(:,5):Events(Failure+Censor) for x

100

%dataxy(:,6):inital number for x (nx) %dataxy(:,7):Failure events for y (dx) %dataxy(:,8):Censored values for y %dataxy(:,9):Events(Failure+Censor) for y %dataxy(:,10):inital number (ny) %dataxy(:,11):Expect T (TX=(dx+dy)*ny/(nx+ny)) %dataxy(:,11):Expect T (TY=(dx+dy)*nyx/(nx+ny)) dataxy=unique([bx;by]); dataxy=sortrows([abs(dataxy) dataxy]); datam=dataxy; dataxy(1,6)=lx;dataxy(1,10)=ly; for i=1:length(dataxy) k=find(bx==dataxy(i,2));k2=find(by==dataxy(i,2)); if isempty(k) dataxy(i,3:5)=[0,0,0]; else dataxy(i,3:5)=datax(k,:); end if isempty(k2) dataxy(i,7:9)=[0,0,0]; else dataxy(i,7:9)=datay(k2,:); end if(i>1) dataxy(i,6)=lx-sum(dataxy(1:i-1,5)); dataxy(i,10)=ly-sum(dataxy(1:i-1,9)); end end n1j=dataxy(:,6);n2j=dataxy(:,10);d1j=dataxy(:,3);d2j=dataxy(:,7); nj=n1j+n2j;dj=d1j+d2j; dataxy(:,11)=dj.*n1j./nj; dataxy(:,12)=dj.*n2j./nj; % just for test dataxy(:,13)=n1j.*n2j.*dj.*(nj-dj)./(nj.^2.*(nj-1)) sumall=nansum(dataxy); U1=sumall(3)-sumall(11);U2=sumall(7)-sumall(12); chi1=U1^2/sumall(11)+U2^2/sumall(12); p=1-cdf('chi2',chi1,1);

%display result if show>0 if nargin<4 xname='X';end if nargin<5 yname='Y';end disp(' ') disp('Summary of the Number of Censored and Uncensored Values') disp(' ') disp(' GROUP Total Failed Censored "%Censored" ' ); fprintf('------\n'); fprintf('%10s %10.0f %10.0f %10.0f %10.2f\n',xname(1:min(10,length(xname))),lx,sumall(3),sumall(4),sumall( 4)*100/lx); 101

fprintf('%10s %10.0f %10.0f %10.0f %10.2f\n',yname(1:min(10,length(yname))),[ly sumall(7) sumall(8) sumall(8)*100/ly]); fprintf('%10s %10.0f %10.0f %10.0f %10.2f\n','Total',[lx+ly sumall(3)+sumall(7) sumall(4)+sumall(8) (sumall(4)+sumall(8))*100/(lx+ly)]);

fprintf('------\n'); fprintf('Chi-square:%3.4f P:%6.4f \n',[chi1,p].'); end if show>1 [f,xx,flo,fup] = ecdf(abs(x),'censoring',(1- x./abs(x))/2,'function','survivor'); [f2,yy,flo2,fup2] = ecdf(abs(y),'censoring',(1- y./abs(y))/2,'function','survivor'); end

Utility functions ( functions called by the above functions )

B.7 Spearman Correlation finding function

%% SPEARMAN CORRELATION % correlation matrix using SPEARMAN correlation % input: exp_data - data matrix or expression value matrix : gene X % patients % output: pair-wise correlation matrix function corrs = spearman_corr(exp_data) % exp_data is genes X samples exp_data = exp_data'; [m n] = size(exp_data); corrs = zeros(n,n);

ranks = zeros(m,n);

for i=1:n temp1 = exp_data(:,i); temp1 = temp1'; temp2 = sort(temp1); for j=1:m idx = find(temp2 == temp1(j)); idx = sum(idx)/numel(idx); ranks(j,i) = idx; end end

for i=1:n dij = ranks; temp = ranks(:,i); for j=1:m dij(j,:) = dij(j,:) - temp(j); 102

end dij2 = dij.^2; sdij2 = sum(dij2,1); sdij2 = sdij2.*6; sdij2 = sdij2/(m*(m^2 - 1)); rho = 1 - sdij2; rho(i) = 0; corrs(:,i) = rho'; end end

B.8 Principal component Analysis to find eigen expression profile

% finds pca of the expression vectors input as a matrix of expression % values %input : % exp: expression values as gene X samples % output: eigen or representative expression vector function eigenexp = pca(exp) exp = exp'; [m n] = size(exp); eigenexp = zeros(m,1);

% find mean along the patient dimension mean_exp = mean(exp,1); [m n] = size(gene_exp);

%subtract mean from the original data to get mean adjusted data for j=1:n exp(:,j) = exp(:,j) - exp(j); end

%get covariance matrix cov_matrix = cov(exp); %find the eigen decompostion [v d] = eig(cov_matrix); % find the largest eigen value dia = diag(d); max_eig = find(dia == max(dia)); % get the eigen vector corresponding to maximum eigen value max_v = v(:,max_eig(1)); max_vt = max_v';

% find the principal component

temp = max_vt*exp'; eigenexp(:,1) = temp'; end 103

B.9 Eigen Spearman function (to update edge weights in EigenCut)

% to find the correlation values between the seed node and its neighbors % to be used in eigen cut algorithm % similar to spearman_corr function but finds corelations between the seed % and neighbors only, instead of all pair wise correlations % input: % exp profiles/ vectors input as a matrix of expression values with first % row corresponding to the seed node and the following rows corresponding % to the neighbors % output: % rho : vector containing correlation values from seed to its % neighbors function rho = eigen_spearman(exp_data) exp_data = exp_data'; [m n] = size(exp_data); ranks = zeros(m,n); for i=1:n temp1 = exp_data(:,i); temp1 = temp1'; temp2 = sort(temp1); for j=1:m idx = find(temp2 == temp1(j)); idx = sum(idx)/numel(idx); ranks(j,i) = idx; end end i=1; dij = ranks; temp = ranks(:,i); for j=1:m dij(j,:) = dij(j,:) - temp(j); end dij2 = dij.^2; sdij2 = sum(dij2,1); sdij2 = sdij2.*6; sdij2 = sdij2/(m*(m^2 - 1)); rho = 1 - sdij2; rho = rho(2:end); end

B.10 Merge clusters ( filtering stage after EigenCut with overlap ) 104

% merges clusters based on the input overlap threshold values % merges all those clusters with overlap greater than thresh % inputs are the clusters set and the threshold value % outputs the final clusters and the overlap between these clusters function [final_cl olap] = merge_clusters(clusters, thresh) k = numel(clusters); olap = zeros(k); for i=1:k for j=1:k m = min(numel(clusters{i}),numel(clusters{j})); olap(i,j) = numel(intersect(clusters{i},clusters{j}))/m; end olap(i,i) = 0; end cntr = 0; while max(max(olap)) > thresh cntr = cntr+1; idx = find(olap == max(max(olap))); idx = idx(1); c = floor(idx/k) + 1; r = mod(idx,k); if r == 0 r = k; end

clusters{r} = unique(union(clusters{r},clusters{c})); clusters{c} = [];

for i=1:k olap(r,i) = numel(intersect(clusters{r},clusters{i}))/min(numel(clusters{r}),numel( clusters{i})); olap(i,r) = numel(intersect(clusters{r},clusters{i}))/min(numel(clusters{r}),numel( clusters{i})); end olap(r,r)=0; olap(c,:) = 0; olap(:,c) = 0; end final_cl = {}; for i=1:k if ~isempty(clusters{i}) final_cl = [final_cl clusters{i}]; end end k = numel(final_cl); olap = zeros(k);

105 for i=1:k for j=1:k m = min(numel(final_cl{i}),numel(final_cl{j})); olap(i,j) = numel(intersect(final_cl{i},final_cl{j}))/m; end olap(i,i) = 0; end end

B.11 Adaptive threshold calculation

%% ADAPTIVE THRESHOLD % corrs = correlation matrix % min_thresh = minimum threshold to start calculating R2; usually 0.1 for % non-absoluted correlations and 0.5 for absoluted correlations % max_thresh = minimum threshold to start calculating R2; usually 1 % R2 value cut-off % Returns the threshold value that maximizes the scale-free topology function thresh = adapthresh(corrs,min_thresh,max_thresh,inc,R2value) mt = min_thresh; [m n] = size(corrs); incs = (max_thresh - min_thresh)/inc; % # increments i=0; R2 = []; while min_thresh < max_thresh idx = corrs >= min_thresh; % find those that are above the threshold i=i+1; temp = sum(idx,2); % sum columns to get degrees temp3 = unique(temp); % for each unique degree temp3 = temp3(temp3 ~= 0); temp2 = []; for j=1:numel(temp3) idx = find(temp == temp3(j)); % find occurence of the degree temp2 = [temp2 numel(idx)]; % add the occurence count into histogram end histo = temp2/(m-1); % store the probabilities loghisto = log(histo); clear histo; logdeg = log(temp3);

%% LINEAR REGRESSION

n=numel(logdeg); xav=sum(logdeg)/n; 106

yav=sum(loghisto)/n; Sxy=0; Sxx=0; for i=1:n Sxy=Sxy +logdeg(i)*loghisto(i)-xav*yav; Sxx=Sxx + (logdeg(i))^2-xav^2; end

a1=Sxy/Sxx; a0=yav-a1*xav; Yp=zeros(1,floor(max(logdeg)+1)); for i=0:floor(max(logdeg)+1) Yp(i+1)=a0+a1*i; end Xp=(0:floor(max(logdeg)+1)); Yp;

Sr=0; St=0; for i=1:n Sr=Sr+(loghisto(i)-a0-a1*logdeg(i))^2; St=St+(loghisto(i)-yav)^2; end R2=[R2 (St-Sr)/St];

min_thresh = min_thresh+inc; % increment min_thresh by inc end R2; for i=1:(numel(R2)-1) if R2(i) >= R2value && R2(i+1) >=R2value thresh = mt + inc*(i-1); break; end end end

B.12 Depth first search to find components

% input: adjacency representation of a graph n X n % output : components - n length array holding the component numbers for the %vertices function components = dfs_nr(graph) [m m] = size(graph); components = zeros(1,m); % hold component numbers for vertices k=0; vert_stack = []; % stack of vertices idx = find(components == 0);

107 while ~isempty(idx) k = k+1 vert = idx(1); % get a vertex not assigned to any component if components(vert) == 0 components(vert) = k; list = find(graph(vert,:)); % adjacent vertices vert_stack = [vert_stack list]; % add adjacent vertices into stack [vert_stack_temp ids]= unique(vert_stack); vert_stack = vert_stack(ids); clear vert_stack_temp; k; end while ~isempty(vert_stack) % while the stack is empty vert = vert_stack(end); vert_stack = vert_stack(1:end-1); %numel(vert_stack); if components(vert) == 0 components(vert) = k; list = find(graph(vert,:)); vert_stack = [vert_stack list]; [vert_stack_temp ids]= unique(vert_stack); vert_stack = vert_stack(ids); clear vert_stack_temp; end end idx = find(components == 0); % check if there are vertices not assigned to any component end end

B.13 Eigen network builder

% expval1 and expval2 are in patient X /gene format % finds an eigen network between the input clusters % inputs are the two clusters and the corresponding expression values, % patient list and the threshold to create eigen -network % outputs the network in rho and the correlation values in eigencorr function [rho eigencorr] = build_eigennet(cluster1, expval1, cluster2, expval2, thresh)

% dataset 1 [m1 n1] = size(expval1); [m2 n2] = size(expval2); if n1~=n2 warning('error: sample sizes of expression values do not match'); exit; 108 end eigen1 = zeros(n1,numel(cluster1)); for i=1:numel(cluster1) list = unique(cluster1{i}); exp1 = expval1(:,list);

% find mean along the patient dimension mean_exp = mean(exp1,1); [m n] = size(exp1);

%subtract mean from the original data to get mean adjusted data for j=1:n exp1(:,j) = exp1(:,j) - mean_exp(j); end

%get covariance matrix cov_matrix = cov(exp1); %find the eigen decompostion [v d] = eig(cov_matrix); % find the largest eigen value dia = diag(d); max_eig = find(dia == max(dia)); % get the eigen vector corresponding to maximum eigen value max_v = v(:,max_eig(1)); max_vt = max_v';

% find the principal component temp = max_vt*exp1'; eigen1(:,i) = temp'; end

% Dataset 2 eigen2 = zeros(n2,numel(cluster2)); for i=1:numel(cluster2) list = unique(cluster2{i}); exp2 = expval2(:,list);

% find mean along the patient dimension mean_exp = mean(exp2,1); [m n] = size(exp2);

%subtract mean from the original data to get mean adjusted data for j=1:n exp2(:,j) = exp2(:,j) - mean_exp(j); end

%get covariance matrix cov_matrix = cov(exp2); %find the eigen decompostion [v d] = eig(cov_matrix); % find the largest eigen value 109

dia = diag(d); max_eig = find(dia == max(dia)); % get the eigen vector corresponding to maximum eigen value max_v = v(:,max_eig(1)); max_vt = max_v';

% find the principal component

temp = max_vt*exp2'; eigen2(:,i) = temp'; end eigencorr = corr(eigen1,eigen2); rho = abs(eigencorr); idx = rho > thresh; rho(idx) = 1; rho(~idx) = 0; end

110