EXPLORING CANCER-ASSOCIATED BY NETWORK MINING AND MANAGEMENT

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of

Science in the Graduate School of the Ohio State University

By

Kewei Lu, B.E.

Graduate Program in Computer Science and Engineering

The Ohio State University

2011

Thesis Committee:

Dr. Kun Huang, Advisor

Dr. Raghu Machiraju c Copyright by

Kewei Lu

2011 ABSTRACT

Biomedical data, including cancer-related data, are widely available in the form of networks in which nodes are biomedical objects and edges are relationships between objects. Some of these networks, such as -coexpression networks, have same type nodes within a network, while others, such as the UMLS network, have various types of nodes interacting each other within a network. Both kinds of networks contain very rich hidden information beyond the obvious facts conveyed by the original data.

To turning this hidden information into useful knowledge for cancer research, we developed network mining and management methods for these networks. Our net- work mining method focuses on identifying candidate gene markers by finding dense components in the gene-coexpression networks which have homogeneous nodes. In addition, we propose methods to efficiently manage very large networks. By manag- ing the UMLS network which is formed by data from various sources, we are able to prioritize genes towards cancers and generate hypotheses that are very informative for cancer research.

ii Dedication

This thesis is dedicated to my family.

iii ACKNOWLEDGMENTS

I would like to express my gratitude to all those who helped me during the writing of this thesis.

First of all, I would like to thank my advisor Dr. Kun Huang for his patience, encouragement, and professional advice during my graduate study. Without his pa- tient instruction and expert guidance, I could not make such improvement during my graduate study. It has been an honor for me to work with him.

I would like to thank Dr. Raghu Machiraju, a member of my committee, for his valuable help and suggestions.

I am also grateful to Dr. Yang Xiang and Dr. Jie Zhang for their helpful sugges- tions and support. They always gave me valuable advice whenever I had questions in either graduate study or research.

Special thanks to my friends. Their support and friendship encouraged me a lot.

Last, I would thank to my beloved family for their love, help and great confidence in me through all these years.

iv VITA

2009 ...... B.E. Computer Science and Technology, Wuhan University of Technology, China

2009 to present ...... M.S. Computer Science and Engineering, The Ohio State University

2010-Present ...... Student Assistant, Comprehensive Can- cer Center, The Ohio State University

PUBLICATIONS

FIELDS OF STUDY

Major Field: Computer Science and Engineering

v TABLE OF CONTENTS

Abstract ...... ii

Dedication ...... ii

Acknowledgments ...... iv

Vita...... v

List of Figures ...... vii

CHAPTER PAGE

1 Introduction ...... 1

1.1 Related work ...... 2 1.2 Thesis Organization ...... 4

2 Problem Formulation ...... 6

3 Gene Co-expression Network Mining ...... 8

3.1 Data Preprocess ...... 8 3.2 Building Gene Co-expression Network ...... 9 3.3 Mining Subgraphs with Bounded Density ...... 12 3.4 Weighted Subgraph Pattern Mining for Biomedical Applications . 14 3.5 Discovering Candidate Cancer Biomarkers ...... 24

4 Network Management ...... 31

4.1 Handling large networks ...... 31 4.2 Managing the UMLS network ...... 39 4.3 Prioritize Cancer Genes by managing the UMLS network . . . . . 45

5 Conclusion and Future work ...... 52

Bibliography ...... 54

vi LIST OF FIGURES

FIGURE PAGE

3.1 Plot of expression value versus samples for CSN2...... 10

3.2 Plot of expression value versus samples for C17orf99...... 11

3.3 Plot of expression value versus samples for A1BG...... 12

3.4 The distribution of PCC for GSE18864...... 13

3.5 The flow chart of select the start edges(k=0.9). The numbers are the weight of edges. The red edges are the edges have been add to the set of start edges; the red edges and the blue edges are the edges have been covered; the other edges have not been covered ...... 19

3.6 Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI) ...... 24

3.7 Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI LN-POS) ...... 26

3.8 Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI ER-NEG) ...... 27

3.9 Visualization of the co-expression network for the gene cluster of NKI with a smaller p-value than vant Veer 70 Genes ...... 28

3.10 Visualization of the co-expression network for the gene cluster of NKI LN-POS with a smaller p-value than vant Veer 70 Genes ...... 29

3.11 Visualization of the co-expression network for the gene cluster of NKI ER-NEG with a smaller p-value than vant Veer 70 Genes ...... 30

4.1 The flow chart of localized 2-hop...... 35

4.2 Illustration of greedy select a batch of vertices...... 36

vii 4.3 Distance query of localized 2-hop vs BFS, the green line is localized 2-hop and the blue line is the BFS ...... 39

4.4 Construct time of localized 2-hop for graph with different size . . . . 40

4.5 Label size of localized 2-hop for graph with different size ...... 41

4.6 An illustration of distance and path queries...... 43

4.7 Number of labels and total label size for different kDLS broadcast ranges...... 44

viii CHAPTER 1

INTRODUCTION

A large portion of cancer-related biomedical data is arrogated in the form of net- works, also known as graphs. In these networks nodes typically represent biomedical objects. An edge may exist between two nodes of biomedical objects to represent their relationships. Some of these networks, such as gene-coexpression networks, have same type of nodes within a network, while others, such as the UMLS network, have various types of nodes interacting each other within a network. In this thesis,

I name the former networks “homogeneous networks”, and the latter “heterogeneous networks”. Both kinds of networks contain very rich hidden information beyond the obvious facts conveyed by the original data.

Given this form of data, efficient mining and management them becomes an im- portant application for areas such as algorithmic graph theory, graph mining, and graph databases. Dense component mining algorithms for gene-coexpression net- works, a typical type of homogenous networks, often lead to the discovery of gene clusters which are candidate biomarkers. The graph indexing schemes on heteroge- neous networks will enable knowledge discovery via efficiently graph queries such as reachability, distance, and path queries. In the next section, I review the related techniques for mining and management these networks.

1 1.1 Related work

In many networks, dense components themselves are important clusters. Detect- ing dense components is a nontrivial task for these networks. This problem seems straightforward but even a simplest version of the problem is NP-hard. That is, it is

NP-hard to find a maximum clique in an undirected graph [1]. On the other hand, managing or indexing a network for answering reachability, distance, and shortest path queries has polynomial-time solutions. However, the availability of polynomial- time solutions by no means imply there is no challenging for indexing them. On the contrary, it is practically impossible to use brutal-force methods to efficiently manage large networks. When the network size is very large as appearing in some biomedical applications, it becomes a problem even for most of the latest methods.

Clique and Quasi-Clique Mining Although listing of all maximal cliques is NP- hard, it is still possible to efficiently list them in some cases involving small and sparse graphs. A classical algorithm for enumerating all maximal cliques is proposed in [2] and redescribed in [3]. However, the definition of clique is too tight that it does not consider many cases in real life especially with the presence of noise. Therefore, a good number of works focus on finding quasi-cliques instead [4, 5, 6]. Though the detailed definition of quasi-clique varies among literature, a quasi-clique is often considered as a general case of a clique. Thus the concept of quasi-clique is better for modeling dense components in real networks.

Biclique and Frequent Itemset Mining A bipartite network (or bipartite graph) contains two types of nodes, with edges connecting different types of nodes. The con- cept of bipartite is between homogeneity and heterogeneity. These kinds of networks are also available in biomedical data, such as gene expression data where rows are genes and columns are patients (or vise versa). A dense component cluster is a Carte- sian product between two different types of vertices. These Cartesian products are

2 often with other names in different works, such as tiles [7, 8], hyperrectangles [9], and blocks [10]. Different mining algorithms for dense components have been proposed in these works correspondingly. The biclique mining problem nicely connects to the closed frequent itemset mining problem. This connection has been used for maximal biclique generation in [11] for effective knowledge dicovery in (0,1)-matrices. How- ever, similar connections are not available for clique or quasi-clique mining to the best of our knowledge thus we cannot ease our mining tasks by borrowing techniques from frequent itemset mining in our work. Nevertheless, our method for finding dense components in homogeneous networks provides new insight to the biclique mining problem in bipartite graphs.

Weighted Dense Subgraph Mining Most available works consider only unweighted graphs, in which edges have no weights (or unit weights in another work). To use methods proposed in these works for biomedical data, people often convert edge weights in the real data into 0 or 1 by setting a threshold. Although this conversion works, it nevertheless reduce a significant portion of original data information. To fully utilize the edge weight information of a weighted graph, Ou and Zhang [12] proposes a method to mine weighted dense subgraphs from a weighted graph, with density guarantees for the result dense subgraphs. Since this work fits the gene- coexpression data, we derive our gene-coexpression network mining solution from it, with special consideration for cancer research.

Reachability Indexing Building a reachability index for a directed network is a basic network management. In graph theory, the reachability problem is to answer whether we can reach a target vertex from a source vertex in a directed graph. The major challenge of this problem is how to use limited index size and construction time to enable fast query. In recent years, quite a few indexing schemes have been proposed to efficiently answer reachability in a large directed graph [13, 14, 15, 16].

3 Most of these approaches aim at building a compact data structure for the network to facilitate reachability queries.

Distance and Path Indexing Compared to reachability query, distance and path queries can be applied to both directed and undirected networks, and are more diffi- cult to handle with. In the past years, a number of approaches have been proposed to answer distance and path query [17, 18, 19, 20, 21, 22] by efficiently indexing the graph. Most of these approaches handle distance and path query very well on small and sparse networks. However, none of the approaches can offer a satisfying solution for very large and dense networks.

Distance and Routing Labeling Schemes Distance labeling scheme (DLS), initially defined by Peleg [23], is a scheme similar to distance indexing. The major difference between DLS and Distance labeling is that the former answers distance between two vertices by using the labels of that two vertices only, while the latter has no such restriction. Thus, DLS is suitable for applications in a distributed network environment such as communication networks, while the distance indexing is good for applications in a centralized environment such as biomedical network analysis.

The difference between routing labeling scheme (see [24] for a formal definition) and path indexing is similar.

1.2 Thesis Organization

Given the background and challenge as introduced above, I, together with my coauthors, developed methods specifically for exploring cancer-associated genes by network mining and management. Results of network management were partially presented in [25]. To facilitate discussions in this thesis, I provide basic definitions and formal problem statements in Chapter 2. Then I present our main results in

Chapter 3 and Chapter 4.

4 In Chapter 3, we identify candidate cancer biomarker through mining dense com- ponents in gene-coexpression networks. By extending the weighted graph mining algorithm by Ou and Zhang [12], we propose a method specifically targeting on min- ing biomarkers in gene-coexpression networks. After applying our method to the gene coexpression data of the breast cancer, we find gene clusters and further select them by statistical tests. The final results are candidate biomarkers for the breast cancer.

In Chapter 4, we propose network management methods for very large networks.

First, we explore the possibility of extending the available algorithms for indexing large networks for knowledge discovery. Then, we propose k-neighborhood Decen- tralization Labeling Scheme (kDLS) for efficiently indexing the UMLS, a network with power-law property. By using kDLS labels for the UMLS network, we effec- tively prioritize breast cancer genes and generate breast cancer related hypotheses.

Finally, I conclude our main results and discuss future work in Chapter 5.

5 CHAPTER 2

PROBLEM FORMULATION

In this thesis network and graph are used interchangeably. Similarly, node (a term often used in networks) and vertex (a term often used in graphs) have the same meaning. A graph G = (V,E) consists of a set of vertices V and a set of edges E. If edges have no directions, the graph is called “undirected graph”, otherwise “directed graph”. If we assign a numerical value to each edge in a graph to represent the weight of the edge, then the graph becomes a weighted graph. We denote the weight of an edge e ∈ E as w(e).

As we discussed in chapter 1, in many networks, dense component themselves are important clusters and these dense components are usually in the form of Cliques.

Its formal definition is given below.

Definition 1. Let G = (V,E) be an undirected graph, a clique in this graph is a set of vertices V 0 ⊆ V such that (p, q) ∈ E for every two vertices p, q ∈ V 0.

It is often impractical to enumerate all cliques and thus it is a very common practice to enumerate maximal cliques instead. A maximal clique is a clique which is not a subgraph of any other cliques. A maximum clique is a clique with maximum number of vertices. A graph may have more than one maximum clique.

In a weighted graph, we use density to measure a component/subgraph as follows.

For simplicity, we normalize edge weights between 0 and 1.

6 Definition 2. Let G = (V,E) be a weighted undirected graph with w(u, v) (0 ≤ w(u, v) ≤ 1) denoting the weight of an edge (u, v) ∈ E. Let V 0 ⊆ V . The sub- graph induced by V 0 is denoted as G(V 0). The density of G(V 0) is density(G(V 0)) = P 2 0 w(u, v) u,v∈V . |V 0|(|V 0| − 1)

Mining dense components from biomedical networks has many applications. These dense components often imply important biological information. The dense compo- nent mining method for gene co-expression network described in this thesis can help us to discover clusters of genes which can be potential biomarkers.

For heterogenous networks with various types of nodes, such as the UMLS net- work, dense subgraphs may not correspond to a cluster with clear significance in application. In these networks, we are more interested in mining them by discovering the relationship between two nodes. The relationship between two nodes are often measured by reachability, distance and path between them. In graph theory, a path is a sequence of vertices such that each vertex connects to the next vertex in the sequence by an edge, and a path with no repeated vertices is called a simple path. We say node u can reach node v if and only if there is a path (or directed path if the graph is directed) connecting u to v. The distance between u and v in the graph, denoted as dG(u, v), is the length (i.e., number of edges) of the shortest path connecting them. The research described in this thesis aims at answering the following problems.

• How to efficiently mine dense components in a weighted gene-coexpression net-

work for the purpose of discovering candidate biomarkers for cancer research?

• How to efficiently index heterogenous networks to understand the relationships

between different nodes (concepts) for the purpose of knowledge discovery?

7 CHAPTER 3

GENE CO-EXPRESSION NETWORK MINING

Gene co-expression network is built from the microarray data which contains the gene expression values for different samples. In this kind of networks, each node repre- sents a gene and each edge between two nodes encodes the co-expression level between these two genes. Usually, the gene co-expression network is a weighted network and each edge has a weight. This weight encodes the co-expression level or the correlation between the two expressions. A dense component in a gene co-expression network usually corresponds to a group of highly correlated genes and contains meaningful bi- ological information. By exploring this gene co-expression network, we can find these groups of genes which may correspond to candidate biomarkers. In this chapter, we describe our network mining method for exploring cancer-associated genes in details.

3.1 Data Preprocess

The data set we use is GSE18864 which is from GEO. The Gene Expression

Omnibus (GEO) is a public repository of microarray, next-generation sequencing and other functional genomic data. The GSE18864 data contains tumor expression data from triple negative breast cancer patients who were treated with neoadjuvant cisplatin.

8 The GSE18864 dataset is available as a 54675×84 matrix with some header infor- mation. In the original data, each row represents a probe and each column represents a sample or patient. First, we need to reduced it to gene × sample matrix in order to mine the data based on the gene co-expression level. The GSE18864 dataset cor- responds to platform GPL570 which provides a map from probe to gene. However, this mapping is not one-to-one mapping. There are some probes which do not map to any gene and also some probes map to the same gene. When reducing the probe

× samples matrix to gene × samples matrix, these two situations have to be han- dled separately. If a probe does not map to any gene, this row will be discarded. If multiple probes map to one gene, we choose the row which has the maximum sum of expression value over all samples. After doing this preprocess, the 54675×84 matrix reduces to a 20827×84 matrix. At last, because all the data in GSE18864 is logarithm to base 2, we transform these data back to original value by taking 2p for each data p in order to use the original expression values.

3.2 Building Gene Co-expression Network

In the Co-expression network, each node corresponds to a gene and each edge between two nodes has a numerical value associated with it which represents the co- expression level between these two nodes. We use the Pearson Correlation Coefficient

(PCC) as the measurement for the gene co-expression level between two nodes. PCC is a common measurement for the correlation between two random variables and is widely used in the measurement of correlation between different gene’s expressions.

Given two random variables, the PCC is calculated by using the covariance of these two random variables divided by the standard deviation of one random variable mul- tiplying the standard deviation of the other random variable. The value of PCC is between -1 and 1. If the PCC is closer to -1, it means that the two random variables

9 are highly negative correlated. While if the PCC is closer to 1, it means that the two random variables are highly positive correlated. Given two random variable X and

Y, the standard formula to calculate PCC is:

P(X − X¯)(Y − Y¯ ) p = i i pP ¯ 2 P ¯ 2 (Xi − X) (Yi − Y )

th th Where Xi is the i value of the variable X and Yi is the i value of the variable Y. X¯ is the mean of the random variable X and Y¯ is the mean of the random variable

Y.

1400

1200

1000

800

600

400

200

0 0 10 20 30 40 50 60 70 80 90

Figure 3.1: Plot of expression value versus samples for CSN2.

For example, given three genes CSN2, C17orf99 and A1BG from the GSE18864 data set. PCC for CSN2 and C17orf99 is about 0.99981 according the above for- mula. this implies that they are strongly positive correlated. On the other hand, the

PCC for CSN2 and A1BG is about 0.149511 and this implies they are less correlated.

10 700

600

500

400

300

200

100

0 0 10 20 30 40 50 60 70 80 90

Figure 3.2: Plot of expression value versus samples for C17orf99.

The Figure 3.1, 3.2 and 3.3 show the plot of expression value versus samples for these three genes respectively. From these figures, we can see that CSN2 and C17orf99 have similar patterns over all the samples which imply they are highly positive correlated while CSN2 and A1BG have very different patterns over all the samples which imply they are neither positive correlated or negative correlated. This highly correlated re- lationship are well captured by the Pearson Correlation Coefficient. To build the gene co-expression network, we calculate the PCC for each pair of genes in the GSE18864 dataset and set the absolute value of each PCC value to be the weight of the edge which connects the two genes since we want to mine highly correlated (either positive or negative) genes. Figure 3.4 shows the cumulative distribution function(CDF) of

PCC and the x axis represents the PCC value. From Figure 3.4 we observe that about 5% of the edges have absolute value of PCC greater than 0.37. Because we want to mine the dense components of the gene co-expression network, we primarily

11 19

18

17

16

15

14

13

12

11

10 0 10 20 30 40 50 60 70 80 90

Figure 3.3: Plot of expression value versus samples for A1BG.

target on edges with high weight. Thus we keep the top 5% edges (absolute value of

PCC value ranges from 0.366196 to 0.999839) for our further study.

3.3 Mining Subgraphs with Bounded Density

Dense components in the gene co-expression network correspond to groups of highly correlated genes. By exploring these groups of genes, we can find candidate biomarkers. A weighted dense subgraph mining algorithm was proposed by Ou and

Zhang in [12]. In their paper, they name a dense component ∆-quasi-clique where ∆ is actually the density of the dense component.

In chapter 2, the density of S, a subgraph of G, is defined as: 2 P w(e) density(S) = e∈E(S) |V (S)|(|V (S) − 1|)

G is a weighted graph, a subgraph S is called ∆-quasi-clique in [12] if density(S)≥ 12 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 3.4: The distribution of PCC for GSE18864.

∆ for some positive real number ∆. In [12] authors introduce the Quasi Clique

Merger(abbreviated as QCM) algorithm to find the dense subgraphs. This algorithm works by add a node to a subgraph S recursively if the contribution of this node satisfies a predefined condition. For a node v 6∈ V (S), [12] defines the contribution of v to S by: P w(u, v) c(v, S) = u∈V (S) |V (S)|

The whole QCM Algorithm contains three basic steps. The first step is growing.

In this step, the algorithm starts from a subgraph S which only contains one edge and iteratively adds a node with the maximum contribution to this subgraph S if such contribution satisfies a predefined condition. This step outputs a number of dense components. The second step is merging. In this step, the algorithm merges two dense components if the two dense components are overlapped to certain extent.

This step normally ends up with a smaller number of dense components. The last step

13 is building a hierarchical clusters base on the dense components. The first two steps are most helpful in identifying dense components in gene co-expression networks that may correspond to candidate biomarkers. The sketch of the growing step is shown in algorithm 1 and the sketch of the merging step is shown in algorithm 2.

Notice that the size of a dense component is affected by the parameter λ and t in the QCM algorithm. These two parameters control the lower bound of the density of the dense component. Since an edge with two endpoints in available components will not be used to generate a new dense component in the QCM algorithm, the two parameters λ and t essentially affect the QCM algorithm’s search space in a somewhat unpredictable manner. That is, if we adjust λ and t slightly in order to get slightly larger components, we may obtain a set of very different components, a phenomenon similar to “butterfly effect”. To counteract this adverse effect of QCM algorithm in mining dense components from gene-coexpression networks, we propose a method with both reliability and flexibility. Our method can adjust the size (adding or removing vertices) of each dense component in the result by setting different λ and t, rather than generating a new set in which many dense components are unrelated to the previous results. On the other hand, our method will also allow users to flexibly adjust the search space. The details are described in the next section.

3.4 Weighted Subgraph Pattern Mining for Biomedical Ap-

plications

To overcome the negative effects of the QCM algorithm as described at the end of previous section, we proposed to select the set of start edges based on the definition of a node’s contribution to a subgraph S which is defined in the paper [12]. Recall that given a subgraph S, for a node v∈ / S the contribution of v to S is defined as:

14 Algorithm 1 Quasi Clique Merger Algorithm 1, grow(input G=(V,E),γ, λ, t; output L)

1: Initialize w0 ← γmax{w(e): ∀e ∈ E(G)}, 0 ≤ γ ≤ 1; 2: Perform sort on the edge set e ∈ E(G): w(e) ≥ w0 to get a sequence S = e1, e2, ..., em such that w(e1) ≥ w(e2) ≥ ... ≥ w(em); 3: µ ← 1, p ← 0,L ← ∅; 4: while µ ≤ m do 5: x,y is the two vertices of eµ; p p 6: if ((x∈ / ∪i=1V (Ci))k(y∈ / ∪i=1V (Ci))) then 7: p ← p + 1,Cp ← V (eµ),L ← L ∪ Cp; 8: while V (G) − V (Cp)! = ∅ do 9: for each v ∈ V (G) − V (Cp) do P u∈V (C ) w(uv) 10: c(v, C ) = p ; p |V (C)| 11: end for 12: select the vertex u such that c(u, Cp) is maximum 1 13: αn = 1 − ; 2λ(|V (Cp)| + t) 2 P w(e) e∈E(Cp) 14: d(Cp) = |V (Cp)|(|V (Cp) − 1|) 15: if c(u, Cp) ≥ αnd(Cp) then 16: Cp ← Cp ∪ u 17: else 18: µ ← µ + 1; 19: break; 20: end if 21: end while 22: else 23: µ ← µ + 1; 24: end if 25: end while 26: return L

15 Algorithm 2 Quasi Clique Merger Algorithm 2, merge(input L, β; output L) 1: s ← |L|; 2: Sort all the elements in L such that |V (C1)| ≥ |V (C2)| ≥ ... ≥ |V (Cs)|; 3: h ← 2, j ← 1; 4: while h ≤ s do 5: while j ≤ h do 6: if |Cj ∩ Ch| > βmin(|Cj|, |Ch|) then 7: Cs+1 ← Cj ∪ Ch; 8: L = L − Cj; 9: L = L − Ch; 10: L.append(Cs+1) 11: s ← s − 1; 12: h ← max(h − 2, 1); 13: h ← h + 1; 14: j ← 1; 15: break; 16: else 17: j ← j + 1; 18: end if 19: end while 20: end while 21: return L

16 P w(u, v) c(v, S) = u∈V (S) |V (S)| According to [12], the density of a dense component is bounded by the weight of the start edge. QCM algorithm does not add a node to a dense component if the contribution of this node to the dense component is less than the density of the dense component times a constant α(α ∈ (0, 1), α is closer to 1). Thus, the selection of start edges is crucial in this process.

The basic idea of our method to choose the set of start edges is as follows: First, we take all the edges with weight bigger than a user input threshold γ to form a set

E0 and then sort the edges in decreasing order based on their weight and mark them as uncovered; Second, we choose an uncovered edge from E0 with maximum weight and add it to the set of start edges; Third, after we add an edge e = (u, v) to the set of start edges, for each node which is the neighbor of u or v, we calculate its contribution c to the subgraph S = {u, v}. If the contribution c of a node v0 is bigger than the w(e) times a constant k (which is also the density of S), this implies when we growing from e, the node v0 is likely to be added to that dense component, we

0 0 0 0 will mark the edge e1 = (u, v ) or e2 = (v, v ) as covered if e1 ∈ E or e2 ∈ E . We continue to do the last two steps until every edge in E0 has been either covered or selected as a start edge. Algorithm 3 shows the sketch of the start edge selection.

Figure 3.5 shows the flow chart of how to select the start edges, here we set k = 0.9.

An important feature of our method is that we can use the parameter k to control the search space. When k → +∞, we will search from every edge and the search space is maximally possible. While if k = 0, the search space will be minimal. In this case, if an edge e is selected to be a start edge, then all the edges which incident to e can not be selected as a start edge. The final set of start edges is exactly a matching or an independent edge set in a graph G. The matching or independent edge set in a graph G = (V,E) is defined as follows: 17 Algorithm 3 Start edge selection Algorithm, start edge selection(input G = (V,E), k, γ; output U) 0 1: Initialize T , E to be a set of edges and N to be a set of nodes; 0 2: T = ∅, E = ∅, U = ∅; 3: for each e ∈ E do 4: if w(e) ≥ γ then 0 0 5: E = E ∪ {e}; 6: end if 7: end for 8: for each u ∈ V do 9: for each v ∈ V do 0 10: if (u, v) ∈/ E then 11: w(u, v) = 0; 12: end if 13: end for 14: end for 0 15: for each e ∈ E do 16: e.covered = false; 17: end for 0 18: while T 6= E do 0 19: Pick the edge (u, v) ∈ E such that w(e) is maximum and e.covered = false; 20: U = U ∪ {e}, T = T ∪ {e}, N = ∅, e.covered = true; 21: for each (q ∈ Neighbor(u)) && (q 6= u) do 22: N = N ∪ {q} 23: end for 24: for each (q ∈ Neighbor(v)) && (q 6= v) do 25: if q ∈/ N then 26: N = N ∪ {q} 27: end if 28: end for 29: for each (q ∈ N) do w(q, u) + w(q, v) 30: c = ; 2 31: if c k × w(u, v) then > 0 32: if ((q, u) ∈ E ) && ((q.u) ∈/ T ) then 33: (q, u).covered = ture; 34: T = T ∪ {(q, u)}; 35: end if 0 36: if ((q, v) ∈ E ) && ((q.v) ∈/ T ) then 37: (q, v).covered = ture; 38: T = T ∪ {(q, v)}; 39: end if 40: end if 41: end for 42: end while 18 43: return U Definition 3. Given a graph G = (V,E), a matching or independent edge set in this graph is a set of edges U such that for any two edges e1, e2 ∈ U, e1 and e2 is not adjacent in G.

0.86 0.96 0.9 0.76 0.86 0.96 0.9 0.76 0.86 0.86 0.71 0.71 0.98 0.96 0.98 0.96 0.89 0.89 0.97 0.97 0.5 0.81 0.5 0.81

0.67 0.67

0.86 0.96 0.9 0.76 0.86 0.96 0.9 0.76 0.86 0.86 0.71 0.71 0.98 0.96 0.98 0.96 0.89 0.89 0.97 0.97 0.5 0.81 0.5 0.81

0.67 0.67

0.86 0.96 0.9 0.76 0.86 0.96 0.9 0.76 0.86 0.71 0.86 0.98 0.96 0.71 0.98 0.96 0.89 0.89 0.97 0.81 0.97 0.5 0.5 0.81

0.67 0.67

Figure 3.5: The flow chart of select the start edges(k=0.9). The numbers are the weight of edges. The red edges are the edges have been add to the set of start edges; the red edges and the blue edges are the edges have been covered; the other edges have not been covered

After the start edges are determined, we use a portion of the QCM algorithm

(Line 8-21 of Algorithm 1 ) to search from these start edges. We name our method start-edge-predetermined QCM, or simply preQCM. Algorithm 4 gives the complete pseudocode of preQCM. 19 Algorithm 4 preQCM Algorithm, preQCM(input G=(V,E),k γ, λ, t; output L) 1: Initialize U to be a set of edges; 2: U =start edge selection(G = (V,E),k,λ); 3: µ ← 1, p ← 0,L ← ∅, m ← |U|; 4: while µ ≤ m do 5: p ← p + 1,Cp ← V (eµ),L ← L ∪ Cp; 6: while V (G) − V (Cp)! = ∅ do 7: for each v ∈ V (G) − V (Cp) do P u∈V (C ) w(uv) 8: c(v, C ) = p ; p |V (C)| 9: end for 10: select the vertex u such that c(u, Cp) is maximum 1 11: αn = 1 − ; 2λ(|V (Cp)| + t) 2 P w(e) e∈E(Cp) 12: d(Cp) = |V (Cp)|(|V (Cp) − 1|) 13: if c(u, Cp) ≥ αnd(Cp) then 14: Cp ← Cp ∪ u 15: else 16: µ ← µ + 1; 17: break; 18: end if 19: end while 20: end while 21: return L

20 Number of Number of dense components dense components generated pass survival test QCM 1472 839 preQCM (k=0) 874 503 preQCM (k=0.9) 2357 1461

Table 3.1: Results when set γ = 0.7, β = 0.9

Number of Number of dense components dense components generated pass survival test QCM 830 523 preQCM (k=0) 496 307 preQCM (k=0.9) 1276 862

Table 3.2: Results when set γ = 0.7, β = 0.8

We run preQCM by setting (1) k = 0 (i.e., minimal search space), γ = 0.7/0.8/0.9,

β = 0.7/0.8/0.9, λ = 2, t = 1, and (2) k = 0.9, γ = 0.7/0.8/0.9 β = 0.7/0.8/0.9, λ =

2, t = 1. As a comparison, we run the original QCM algorithm with γ = 0.7/0.8/0.9,

β = 0.7/0.8/0.9, λ = 2 and t = 1. After we obtained dense components from the data, we use microarray data GSE1456 (159 patients) to perform survival tests. The survival test results are demonstrated by the Kaplan-Meier Curves which was also used in [26] to study time-to-treatment. The following tables show the results.

From these results, we can see that users can use k to control the search space.

When the search space is big enough we can often find more promising patterns that pass survival tests from the results of our method than from the results of the original

QCM method.

21 Number of Number of dense components dense components generated pass survival test QCM 477 318 preQCM (k=0) 322 215 preQCM (k=0.9) 486 326

Table 3.3: Results when set γ = 0.7, β = 0.7

Number of Number of dense components dense components generated pass survival test QCM 446 203 preQCM (k=0) 298 139 preQCM (k=0.9) 505 223

Table 3.4: Results when set γ = 0.8, β = 0.9

Number of Number of dense components dense components generated pass survival test QCM 254 138 preQCM (k=0) 172 94 preQCM (k=0.9) 282 155

Table 3.5: Results when set γ = 0.8, β = 0.8

Number of Number of dense components dense components generated pass survival test QCM 158 91 preQCM (k=0) 111 64 preQCM (k=0.9) 160 95

Table 3.6: Results when set γ = 0.8, β = 0.7

22 Number of Number of dense components dense components generated pass survival test QCM 101 25 preQCM (k=0) 59 15 preQCM (k=0.9) 122 28

Table 3.7: Results when set γ = 0.9, β = 0.9

Number of Number of dense components dense components generated pass survival test QCM 67 14 preQCM (k=0) 44 15 preQCM (k=0.9) 61 17

Table 3.8: Results when set γ = 0.9, β = 0.8

Number of Number of dense components dense components generated pass survival test QCM 46 16 preQCM (k=0) 34 13 preQCM (k=0.9) 39 14

Table 3.9: Results when set γ = 0.9, β = 0.7

23 3.5 Discovering Candidate Cancer Biomarkers

vant Veer 70 Genes −−NKI ALL p=6.7302e−011 NKI ALL p=1.1984e−012 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Survival Ratio Survival Ratio 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Survival Time (Year) Survival Time (Year)

Figure 3.6: Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI)

In Table 3.10 we show the parameter setting for our experiments, the bold face means we get the minimum p-value(most significant) when we set parameters to that value. In Figure 3.6, Figure 3.7 and Figure 3.8, we show the survival test results for three dense components for NKI, NKI LN-POS, and NKI ER-NEG categories, respectively. These three dense components were obtained by preQCM under k=0.9.

Each of the them has a p-value less than the p-value of vant Veer 70 Genes in the corresponding category. In Table 3.11 we list the genes in the above three dense components.

After we got the three dense components with a smaller p-value than vant Veer 70

Genes for NKI, NKI LN-POS and NKI ER-NEG, we visualize these dense components as graph by using graph drawing method. In the visualization, different color of the edge represents different PCC value. The red represents P CC > 0.9, green represents

24 NKI NKI LN-Pos NKI ER-Neg GSE1456 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 QCM γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 preQCM,k=0.0 γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.7 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.8 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.7, β = 0.9 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 γ = 0.8, β = 0.7 preQCM,k=0.9 γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.8 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.8, β = 0.9 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.7 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.8 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 γ = 0.9, β = 0.9 min p-value 1.198375e-012 6.313093e-008 5.275878e-006 2.202827e-008

Table 3.10: Number of statistically significant (p < 0.05) gene-clusters for survival tests on each patient group. The bold face for some number in the table means we get the minimum p-value when we set parameters to that value.

25 vant Veer 70 Genes −−NKI LN−Positive p=0.00026302 NKI LN−Positive p=6.3131e−008 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Survival Ratio Survival Ratio 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Survival Time (Year) Survival Time (Year)

Figure 3.7: Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI LN-POS)

0.8 < P CC ≤ 0.9, blue represents 0.7 < P CC ≤ 0.8, lightsteelblue represents

0.6 < P CC ≤ 0.7, yellow represents 0.5 < P CC ≤ 0.6 and violet red represents

PCC ≤ 0.5. This visualization clearly shows us the approximate co-expression level between two genes and help us to understand the dense components.

26 vant Veer 70 Genes −−NKI ER−Negative p=0.68436 NKI ER−Negative p=0.0024199 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4 Survival Ratio Survival Ratio 0.3 0.3

0.2 0.2

0.1 0.1

0 0 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 Survival Time (Year) Survival Time (Year)

Figure 3.8: Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI ER-NEG)

Dataset One sample gene cluster with a smaller p-value than vant Veer 70 Genes ANKRA2,ARID4A,AUH,CETN3,CSNK1G3,CTR9,DMXL1, DNAJC12,ESR1,FAM179B,FOXA1,GRAMD1C,HISPPD1, NKI HNRNPH2,IL6ST,LRBA,MKL2,MOAP1,MRFAP1L1,NPY1R, PCNP,PCTK2,RNF14,RNF141,SENP7,SYNJ2BP,TBC1D9, TMEM144,YTHDC2,ZC3H14 AURKA,CCNB2,CDCA3,CDT1,CENPF,FAM64A,MCAM,NCAPH, NKI LN-POS PTDSS1,SKP2,SSRP1,UBE2C C12orf69,FAM38B,C12orf64,LOC100128641,NPY5R, AMDHD1,RMND1,FAM169A,FCRLB,TRIM9,DPY19L2,GPM6A, C14orf53,ALOX15,C4orf43,KCNK17,MAGI2,LOC441601, NKI ER-NEG SLC14A1,VSIG2,hCG 1742852 /// LOC100132169,SYNPO2L, GOLSYN,STC2,INPP4B,LOC100134363 /// LOC100134366 /// PLGLA /// PLGLB1 /// PLGLB2,RLN2,DFNB59,PGR,CCDC66, RABEP1,MAP2K4,INADL,TRERF1,GLI3,ARHGEF12,KIAA0999

Table 3.11: Genes in the dense components with a smaller p-value than vant Veer 70 Genes for NKI, NKI LN-POS and NKI ER-NEG respectively.

27 Figure 3.9: Visualization of the co-expression network for the gene cluster of NKI with a smaller p-value than vant Veer 70 Genes

28 Figure 3.10: Visualization of the co-expression network for the gene cluster of NKI LN-POS with a smaller p-value than vant Veer 70 Genes

29 Figure 3.11: Visualization of the co-expression network for the gene cluster of NKI ER-NEG with a smaller p-value than vant Veer 70 Genes

30 CHAPTER 4

NETWORK MANAGEMENT

In heterogenous networks with various types of nodes, the relationship between two nodes is often measured by reachability, distance, and path between them. How- ever, designing efficient reachability, distance or path indexing schemes for large net- works are challenging problems in the database and biomedical informatics commu- nity. The challenge comes from designing a practical method to balance the index size, construction time and query time. In this chapter, we propose methods to tackle this challenge, in both random networks and the power-law network that models the

Unified Medical Language System. Since reachability indexing has been extensively studied in [13, 14, 15, 16], our focus is distance and path indexing, and the knowledge discovery by these indexing on heterogenous biomedical networks. Results of this chapter were partially presented in [25].

4.1 Handling large networks

The 2-hop algorithm, proposed by Cohen et. al. [17], offers a good solution to answer the reachability and distance query on moderate-sized graphs. The method is base on 2-hop covers of the shortest distances in a graph. Given a graph G = (V,E), a

2-hop cover P is a set of paths, then for any two vertices u, v ∈ V , if there is a shortest path from u to v (denote as puv), there are two paths p1, p2 ∈ P such that puv = p1p2.

31 However, finding the 2-hop cover with minimum size is a NP-problem. Thus, in [17] authors propose solutions to generate almost optimal 2-hop cover. In their approach, each vertex v ∈ V will maintain two list of intermediate vertices Lout(v) and Lin(v).

Lout(v) is a collection of pairs (x, d(v, x)) which record the vertices that v can reach and also the distances. The other list Lin(v) is a collection of pairs (x, d(x, v)) which records the vertices that can reach v and the distances. The d(u, v) denotes the shortest distance from u to v in G. So the reachability and distance query between two vertices can be answered by using the Lout(v) and Lin(v). Let Puv denotes the path from u to v. The 2-hop cover problem is equivalent to the set cover problem and it can be defined as follows:

Definition 4. Give a graph G = (V,E) and a ground set of elements Q = {(u, v)|u, v ∈

V,Puv 6= ∅} need to be covered. For each vertices x ∈ V , we have a set

S(Lin, x, Lout) = {(u, v) ∈ Q|(u, d(u, x)) ∈ Lin, (v, d(x, v)) ∈ Lout, d(u, v) = d(u, x) + d(x, v)}

and there is a weight ws corresponding to each set S and ws = |Lin|+|Lout|. The goal X is to find a collection U of S such that w(s) is minimum and for every (u, v) ∈ Q, s∈U [ (u, v) ∈ s. s∈U X The label size for this approach is the total size of the two list: (|Lout(v)| + v∈V |Lin(v)|) and the computation cost to answer whether v can reach u or what is the distance between v and u is O(|Lout(v)| + |Lin(u)|) However, because of the extremely time-cost procedure to identifying the densest subgraph, the index construction time is too long to be practical for large graphs.

This weakness of the original 2-hop approach is realized among the database research community and has been discussed in [15, 20].

32 To overcome this weakness, Cheng and Yu [20] proposed a top-down divide and conquer approach to compute distance-aware 2-hop covers. The basic strategy behind the top-down approach is to find a subgraph of a graph that partition the graph into two parts. The vertices in the subgraph becomes the candidate centers of distance- aware 2-hop covers. This top-down approach provides a scalable solution for large graphs. The largest graph reported in their paper is XMK20M, which contains

336,243 vertices and 397,713 edges.

The efficiency of the top-down approach in [20] relies on the identifying of a small- size subgraph partitioner each iteration. In a dense graph, there may not exist a small subgraph partitioner. This explains why most datasets in [20] are very sparse, i.e.,

|E|/|V | < 2. However, in biomedical applications, many datasets are dense and the top-down approach is not a good strategy.

To design a distance and path indexing scheme specifically for biomedical data which usually are dense, we take a bottom-up divider and conquer approach instead.

This approach does not need to find a subgraph partitioner. Instead, it works on labeling on a small portion of graph in one iteration, and eventually finish labeling the whole graph. Since knowledge discovery in biomedical data largely focus on the close transitive relations, our approach focus on answering distance and path queries within k hops, and we name it localized 2-hop.

Given a graph G = (V,E), in the localized 2-hop method, we first choose a batch of vertices U1 from V as intermediate vertices. The set U2 contains the set of vertices which can be reached by any vertex in U1 within k hops. The localized 2-hop cover problem is defined as follows:

Definition 5. Give a graph G = (V,E) and a ground set of elements Q = {(u, v)|u ∈

U1, v ∈ U2, d(u, v) ≤ k} need to be covered. For each vertices q ∈ U1, we have a set

33 S(Lin, q, Lout) = {(u, v) ∈ Q|(u, d(u, q)) ∈ Lin, (v, d(q, v)) ∈ Lout, d(u, v) = d(u, q) + d(q, v)}

and there is a weight ws corresponding to each set S and ws = |Lin|+|Lout|. The goal X is to find a collection R of S such that w(s) is minimum and for every (u, v) ∈ Q, s∈R [ (u, v) ∈ s. s∈R Then we apply the 2-hop cover method, which involving densest subgraph search- ing, to complete the localized 2-hop labeling task. We repeat this until all the vertices in V have been selected. Figure 4.1 shows the flow chart of the localized 2-hop.

Given a vertex v ∈ V , we apply the Breadth First Search to calculate all the vertices which can be reached by v in k hops. Algorithm 5 demonstrates the process of using BFS to find all the vertex that can be reached by v in k hops, and saving these vertices and their distance to v in the data structure TCout and TCin.

Algorithm 5 LocalBFS(G = (V,E), v, Length Limit, T Cout,TCin) 1: Initialize visited to be a set; 2: Initialize a queue Q and enqueue v with v.level = 0 onto Q; 3: visited = visited ∪ {v}; 4: while Q 6= ∅ do 5: u = pop(Q); 6: Add (u, u.level) to the entry of v in TCout; 7: Add (v, u.level) to the entry of u in TCin; 8: if u.level ≤ Length Limit - 1 then 9: for each w ∈ u.NEIGHBOR do 10: if w∈ / visited then 11: visited = visited ∪ {w}; 12: enqueue w with w.level = u.level + 1 onto Q 13: end if 14: end for 15: end if 16: end while

34 Select a batch of Input graph G=(V,E) vertices U from V

Apply 2-hop method to calculate localized 2-hop labels

No

V=V-U

Is V empty?

Terminate

Figure 4.1: The flow chart of localized 2-hop.

When choosing a batch of vertices from V , it is intuitive to select vertices that are close to each other, so that we will have more redundancy in the 2-hop cover for label optimization. We choose these vertices following this intuition. In the initialize step, each vertex will have a numerical value, i.e, rank, associated with it. The ranks of all vertices are initialized to 0. We start at a random vertex and the rank value for the vertices adjacent to the starting vertex will be increased by one. Then a vertex with highest rank value will be selected. Repeatedly doing this until a batch of vertices have been selected. The figure 4.2 illustrates the greedy method. In this figure, the

35 red vertices are the vertices which have been selected and the white vertices are these vertices have not been selected. The numbers are the vertex ranks.

0 0 0 1 0 0 1 0

0 0 0 0 1 2 1 1 0 0

0 0 1 0 1 0 1 1

0 0 0 0 2 2 2 1 1 1

Figure 4.2: Illustration of greedy select a batch of vertices.

The algorithm for localized 2-hop is shown in algorithm 6. The algorithm takes three parameter. G is the graph, length limit is the parameter to control that af- ter labeling, the reachability and distance query within how many hops should be answered. The batch size is the parameter to control how many vertices should be selected from V each time.

We test the localized 2-hop method on 2.4GHz AMD Opteron Machine with a

Linux 2.6 kernel on several sparse large graphs (the number of vertices range from

10,000 to 100,000 and the highest density is 1.0079) and compare its performance with

36 Algorithm 6 Localized2Hop(G = (V,E), Length Limit, Batch size) 1: for each v ∈ V do 2: v.rank = 0; 3: v.selected = false; 4: end for 0 5: U = ∅; 6: for i=1 to |V | do 7: Select vi such that vi.rank is maximum and vi.selected == false; 8: vi.selected = true; 0 0 9: U = U ∪ {vi}; 10: for each ω ∈ Neighbor(v) do 11: ω.rank = ω.rank + 1; 12: end for 0 13: if |U | < Batch Size then 14: continue; 15: end if 16: Initialize TCout to an array of set structure; 17: Initialize TCin to an array of set structure; 0 18: for each v ∈ U do 19: LocalBFS( G = (V,E), v, Length limit, TCout, TCin); 20: end for 21: 2HOPCOVER(G = (V,E), TCout, TCin); 0 22: U = ∅; 23: Set all ranks of vertices in V to be 0; 24: end for

37 graph density BFS(ms) Localized-2hop(ms) Speed up rand2 10k 0.9943 305.894 5.423 54.407 rand2 20k 1.0041 618.813 4.908 126.083 rand2 30k 0.9794 963.119 5.169 186.326 rand2 40k 1.0050 1302.51 5.572 233.760 rand2 50k 1.0017 1611.34 5.753 280.087 rand2 60k 0.9886 1923.82 5.746 334.636 rand2 70k 0.9748 2245.89 6.181 363.354 rand2 80k 0.9998 2617.19 6.354 411.896 rand2 90k 0.9896 2934.82 6.531 449.368 rand2 100k 1.0079 3245.19 6.857 473.267

Table 4.1: Distance query of localized 2-hop and BFS on 10,000 random test, the hop cover range for localized 2-hop is 6 hop. BFS search depths are also set to be 6. Results are the aggregates of the 10.000 tests. Time unit is millisecond (ms)

BFS. The results is showed in table 4.1 and the figure 4.3 shows the performance of localized 2-hop versus BFS.

The experiment shows that in large sparse graph(|E|/|v| ≤ 2), the localized 2-hop method has significantly better performance than the BFS method. More impor- tantly, the distance query performance of localized 2-hop scales well on graph size.

From Figure 4.4 and Figure 4.5, we can also tell the construction time and label size of localized 2-hop are also proportional to the graph size in general. The construction time increases a little faster when the graph size increases but the increasing ratio becomes stable when the graph size exceeds 70k. This is understandable because when the graph size is small, the “locality” (i.e., U 0) takes a significant portion of the whole graph thus it is not accurate to observe the performance of the localized algorithm at the beginning. The scalability of construction time and label size shows that localized 2-hop is a practical method for large biomedical graphs. By saving the path (instead of the distance) to the hop center, it is easy to see that the localized

38 4

3.5

3

2.5

2 Log10 on time (ms)

1.5

1

0.5 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Number of Nodes |V|

Figure 4.3: Distance query of localized 2-hop vs BFS, the green line is localized 2-hop and the blue line is the BFS

2-hop can also be used to answer path-queries under the cost of k times increase on the label size.

The above performance of localized 2-hop are on random graphs. When testing on power-law graphs that model many real biomedical data, we found its performance is not as good as for random graphs. This is because the power-law graphs are not statistically equal on each sample subgraphs. Therefore, the localized 2-hop method is not optimized for these graphs. To handle the power-law graphs, specially the

UMLS, one of the largest power-law biomedical networks, we apply another approach as described in the next chapter.

4.2 Managing the UMLS network

The Unified Medical Language System (UMLS) is one of the large networks that store a wide variety of biomedical concepts. Answering distance and path queries on the UMLS are important for many applications. However, due to the extremely

39 5 x 10 2.5

2

1.5

1 Construct time (ms)

0.5

0 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Number of Nodes |V|

Figure 4.4: Construct time of localized 2-hop for graph with different size

large size of the UMLS graph, no practical solutions are available to index it for efficiently answering distance and path queries. The graph modeling UMLS demon- strates a strong power-law property, i.e., a small fraction of vertices have very large degrees. Therefore, as mentioned above, the localized 2-hop scheme is not optimized for indexing it. The UMLS studied in our work as discussed in this section is the

UMLS2010AB under its default installation (level 0, no additional license agreements necessary), and downloaded from the UMLS website on Feb 03, 2011.

To handle the UMLS, we first preprocess the UMLS into a directed graph, which contains more than 0.6 million unique concepts (vertices) and more than 5 million available links (edges) between concepts. As indicated in our manuscript, 4-10 hops represents the majority of distances between two concepts. Clued by this fact and the power-law property of the UMLS graph, we apply k-neighborhood Decentralization

Labeling Scheme (kDLS) for the UMLS. After preprocessing, each time we will pick up a largest degree vertex and delete it from the graph, and broadcast its information to its neighbors. The process is named “decentralizing”. We repeat this process until all the remaining vertices are isolated vertices, or no vertex remains.

40 250

200

150

Label Size (KB) 100

50

0 10k 20k 30k 40k 50k 60k 70k 80k 90k 100k Number of Nodes |V|

Figure 4.5: Label size of localized 2-hop for graph with different size

To make the kDLS work efficiently, the key issue is designing a super-compact label. First, we can observe that a three-tuple record is enough to hold the information a node receives from decentralizing. The first tuple is a vertex id for which 3 bytes (24 bit) is sufficient to represent (up to 16 million vertices). The last tuple is the distance.

Since most of the shortest distances are within 10 hops, thus focusing on distances within 255 hops (using 1 byte) would be sufficient for applications on UMLS.

It is somewhat tricky to handle the second tuple. Instead of using 3 bytes to record the via-neighbor vertex, we use only 2 bytes to record the port number up to

65,535. A port number is an offset of the neighbor vertex in the vertex adjacent list.

However, there may exist vertices with degree larger than 65,535 in a UMLS graph.

In the UMLS graph used for this work, there is only one vertex with degree larger than 65,535 and this vertex will be deleted from the graph first. We will never need to record an offset which is bigger than 65,535, so 2-bytes is sufficient.

Thus, a vertex only uses 3+2+1=6 bytes (even smaller than the size of an 8-byte

41 long integer) to record the information to or from a decentralized vertex (or 12 bytes for both “to” and “from”).

After the decentralizing process is finished, kDLS is ready to handle various queries with the labels. The query procedure is similar to 2-hop [17, 18, 19], which compares the labels of two query vertices.

To facilitate query processing, we sort the records, in the label of each vertex according to the decentralized vertex IDs. Thus, a linear-time comparison between two labels of the two query vertices can answer shortest distance query.

To answer path queries, we need to construct paths by recursively locating the next vertex via the offset in the corresponding record, and searching for the next record within the label of the located vertex. This requires more memory access and thus takes more computing time compared to answering distance query.

Figure 4.6 gives an illustration for answering distance and path queries from x to y.

To understand how kDLS performs in answering queries, we compared it with the standard Breath-First Search (BFS) and Depth-First Search (DFS) on the time to find the shortest distance, the time to find one shortest path, the time for path estimation over the total number of estimated paths, and etc. We performed these measurements on randomly generated tests.

Since no other index methods are ready for handling the UMLS graph, BFS and

DFS are the only methods we compared with. However, we shall note that DFS cannot guarantee finding the shortest distance or the shortest path. Thus it is not used for study on this two measurments. We mainly focused on kDLS with broadcast range limited to 6-neighborhoods, since the limit of 6-neighborhoods covers a majority of search space on the UMLS graph while only using approximately 10GB memory as suggested in Figure 4.7, which is a representative usage of the UMLS graph for

42 Figure 4.6: An illustration of distance and path queries.

many applications. Comparably, we limited the BFS and DFS search depth within

6-hop.

In our empirical study, we found that kDLS is far superior to BFS and DFS in answering queries. The speedup typically ranges from 3 to 7 orders of magnitude with respect to various measurements. In particular, kDLS can estimate the number of paths from one vertex to another by very quickly comparing their labels without actually finding these paths, while BFS and DFS have to search the graph to obtain an estimation. This is particularly helpful to build a similarity matrix based on

43 the path estimation. The application of kDLS path estimation will be discussed in Section 4.3.

Figure 4.7: Number of labels and total label size for different kDLS broadcast ranges.

Finally, we would like to know how scalable kDLS is. To determine this, we calculated the number of total records (recall a label may contain multiple records) for different broadcast limits. Note that we only need a counter to calculate the number in each test in order to calculate the memory cost in advance, rather than actually create those records. Figure 4.7 displays the results. Since each record uses only 6 bytes, we can calculate the total memory cost for all labels. We have verified that the actual memory cost is very close to the estimated label size in Figure 4.7, except for only a few hundred MB additional memory space for recording the UMLS graph itself. Thus, according to Figure 4.7, it takes kDLS only around 10GB memory to explore a majority of search space for knowledge discovery, and only around 20GB memory to explore most of the search space, and ultimately, only around 64GB

44 (≈ 68.76 × 109 bytes) to explore the whole search space for large scale knowledge discovery (i.e., k is set to be the number of vertices).

4.3 Prioritize Cancer Genes by managing the UMLS network

Besides efficient answering shortest distance between two concepts, and generating a shortest path between two concepts, The k-neighborhood Decentralization Labeling

Scheme can also efficiently calculate a number of kDLS paths (i.e., paths can be identified by comparing the labels of the starting and ending vertices),and generate all simple kDLS paths, as detailed in our manuscript [25]. These functions are extensions of basic queries as illustrated in Figure 4.6. These functions are very useful for large scale knowledge discovery applications utilizing the UMLS network.

In the following we will show how kDLS is applicable to prioritizing cancer genes, an important application of large scale knowledge discovery. For this purpose, we found 348 disease concepts from the UMLS which are associated with gene concepts via at least one of the following semantic relations: “disease has associated gene”,

“disease is marked by gene”, “gene associated with disease”. These were treated as confirmed gene-disease relations. Meanwhile, we selected 4,444 concepts on UMLS with the semantic type being “gene or genome”. These 4,444 gene concepts are built by combining two sets of gene concepts. One set includes the gene concepts directly associated with at least one of the 348 disease concepts via the above three semantic relations in UMLS. The other set of gene concepts comes from mapping the UCSD gene list 1 to the HUGO (Human ) data source in UMLS via the

MetaMap program [27].

We measure the relationship between a gene concept and a disease concept by

1Publicly available at: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/

45 CUI Gene Concept R value percentile C0079419 TP53 gene 284.110000 0.067507% C2825744 MIRLET7A1 wt Allele 283.639000 0.112511% C2825745 MIRLET7A3 wt Allele 283.639000 0.112511% C1537910 MIRLET7A1 gene 283.622000 0.157516% C1537912 MIRLET7A3 gene 283.622000 0.157516% C1705526 TP53 wt Allele 283.237000 0.202520% C2825796 MIR17 wt Allele 281.032000 0.337534% C1537713 MIR17 gene 281.016000 0.360036% C0919524 ATM gene 279.284000 0.562556% C1705846 ATM wt Allele 278.561000 0.607561% C2825936 MIR126 wt Allele 278.266000 0.652565% C1537773 MIR126 gene 278.217000 0.675068% C2826019 MIR221 wt Allele 277.598000 0.765077% C1537859 MIR221 gene 277.582000 0.787579% C2825803 MIR20A wt Allele 277.147000 0.810081% C1825998 MIR20A gene 277.131000 0.832583% C1826002 MIR146A gene 276.892000 0.877588% C2825957 MIR146A wt Allele 276.860000 0.900090% C1705702 TSSC4 wt Allele 276.072000 1.080108% C1336686 TSSC4 gene 275.991000 1.102610% C2825930 MIR124-1 wt Allele 275.501000 1.260126% C1836306 MIR124-1 gene 275.485000 1.282628% C1537719 MIR21 gene 275.439000 1.305131% C2825806 MIR21 wt Allele 275.407000 1.327633% C1335209 TSPAN32 gene 274.182000 1.687669% C1705700 TSPAN32 wt Allele 273.662000 2.025203% C1537774 MIR127 gene 273.278000 2.205221% C2825937 MIR127 wt Allele 273.246000 2.272727% C1332764 CLCA2 gene 272.318000 2.677768% C1332126 AXIN2 gene 271.719000 2.947795%

Table 4.2: The percentiles of top 30 confirmed disease gene concepts for Breast Carcinoma, by semantic relations “disease has associated gene” or “dis- ease is marked by gene” or “gene associated with disease”.

46 rank CUI Gene Concept R value percentile Pubmed ID √ 1 C0694884 MEN1 gene 287.543000 0.022502% 15168774 √ 2 C1384665 HFE gene 286.534000 0.045005% 16503999 3 ? C0079419 TP53 gene 284.110000 0.067507% 4 ? C2825744 MIRLET7A1 wt Allele 283.639000 0.112511% 5 ? C2825745 MIRLET7A3 wt Allele 283.639000 0.112511% 6 ? C1537910 MIRLET7A1 gene 283.622000 0.157516% 7 ? C1537912 MIRLET7A3 gene 283.622000 0.157516% √ 8 C0162832 APC gene 283.384000 0.180018% 10874025 9 ? C1705526 TP53 wt Allele 283.237000 0.202520% 10 C1415882 IDS gene 283.208000 0.225023% 11 C1419102 PTPN22 gene 281.453000 0.247525% √ 12 C2825818 MIR29B2 wt Allele 281.199000 0.270027% 19048628 √ 13 C1537734 MIR29B2 gene 281.149000 0.292529% 19048628 14 C1705386 PTPN22 wt Allele 281.067000 0.315032% 15 ? C2825796 MIR17 wt Allele 281.032000 0.337534% 16 ? C1537713 MIR17 gene 281.016000 0.360036% √ 17 C1416498 ITGB3 gene 280.993000 0.382538% 16317580 18 C1332802 CTLA4 gene 280.824000 0.405041% √ 19 C0694889 RB1 gene 280.809000 0.427543% 11108660 20 C1705969 CTLA4 wt Allele 280.633000 0.450045% √ 21 C1706243 PPARG wt Allele 280.558000 0.472547% 9660931 22 C2825817 MIR29B1 wt Allele 279.573000 0.495050% 23 C1835840 MIR29B1 gene 279.557000 0.517552% √ 24 C1335238 PPARG gene 279.333000 0.540054% 9660931 25 ? C0919524 ATM gene 279.284000 0.562556% √ 26 C1366515 LEPR gene 278.706000 0.585059% 19697123 27 ? C1705846 ATM wt Allele 278.561000 0.607561% 28 C1439329 CBS gene 278.455000 0.630063% 29 ? C2825936 MIR126 wt Allele 278.266000 0.652565% 30 ? C1537773 MIR126 gene 278.217000 0.675068%

Table 4.3: Top thirty ranked gene concepts for Breast Carcinoma. Confirmed disease gene concepts (by semantic relations “disease has associated gene” or “dis- ease is marked by gene” or “gene associated with disease”) are marked by ?. Gene concepts√ having been studied for Breast Cancer in literature are marked by . One literature for each of these gene concepts is listed by its Pubmed ID.

47 CUI Gene Concept R value percentile C1704743 BRCA1 wt Allele 1025.620000 0.067507% C1705143 BRCA2 wt Allele 874.877000 2.812781% C0376571 BRCA1 gene 796.161000 7.875788% C0598034 BRCA2 gene 760.271000 11.408641%

Table 4.4: The percentiles of All 4 confirmed disease gene concepts for Breast Can- cer,Familial, by semantic relations “disease has associated gene” or “dis- ease is marked by gene” or “gene associated with disease”.

considering the number of kDLS paths between them and the average length of these paths. To make our measurement between a gene concept and a disease concept unbiased against the available semantic relations that directly connect them, we do not count any paths with length being 1 between them. Formally, the closeness between a gene concept x and a disease concept y is measured as follows.

Let P(x,y) be the set of paths from x to y excluding paths with length equal to

1. Let P(y,x) be the set of paths from y to x excluding paths with length equal to

1. Let dP(x,y) be the average distance of paths in P(x,y), and dP(y,x) be the average 1 X distance of paths in P(y,x). It is easy to see that dP(x,y) = length(p) |P(x,y)| p∈P(x,y) 1 X and dP(y,x) = length(p). The relationship between a gene concept and |P(y,x)| p∈P(y,x) (|P | + |P |)2 a disease concept is measured by R(x, y) = (x,y) (y,x) , which is dP(x,y) ∗ |P(x,y)| + dP(y,x) ∗ |P(y,x)| the total number of paths divided by their average length. Therefore, the closeness of the relationship between a gene concept and a disease concept is proportional to the number of kDLS paths (with length being at least 2) that connects them, and inversely proportional to the average length of these paths. With this measurement, we can prioritize gene concepts with respect to disease concepts.

We evaluate our knowledge discovery results by fold-enrichment [28, 29, 30], which

48 rank CUI Gene Concept R value percentile Pubmed ID √ 1 C1705526 TP53 wt Allele 1071.210000 0.022502% 16551709 √ 2 C1705310 PTEN wt Allele 1039.270000 0.045005% 9072974 3 ? C1704743 BRCA1 wt Allele 1025.620000 0.067507% √ 4 C1705628 RB1 wt Allele 1009.750000 0.090009% 8616837 5 C1707160 CTNNB1 wt Allele 992.704000 0.112511% √ 6 C1708817 MDM2 wt Allele 976.469000 0.135014% 11178989 √ 7 C1705846 ATM wt Allele 976.125000 0.157516% 16832357 8 C1368192 KRAS2 gene 969.935000 0.180018% √ 9 C1705766 APC wt Allele 965.605000 0.202520% 10874025 √ 10 C1704866 SMAD4 wt Allele 964.959000 0.225023% 16489022 √ 11 C2825744 MIRLET7A1 wt Allele 962.573000 0.247525% 19246618 √ 12 C1384665 HFE gene 959.355000 0.270027% 14973098 13 C1705702 TSSC4 wt Allele 959.042000 0.292529% √ 14 C1705280 CDKN2A wt Allele 957.665000 0.315032% 15879498 15 C1705700 TSPAN32 wt Allele 956.340000 0.337534% 16 C2825793 MIR15A wt Allele 955.208000 0.360036% √ 17 C1706590 BCL10 wt Allele 954.700000 0.382538% 10408401 18 C1335209 TSPAN32 gene 953.125000 0.405041% √ 19 C1537910 MIRLET7A1 gene 952.441000 0.427543% 19246618 20 C1336686 TSSC4 gene 951.540000 0.450045% 21 C1537709 MIR15A gene 950.946000 0.472547% 22 C1537912 MIRLET7A3 gene 949.723000 0.495050% 23 C2825745 MIRLET7A3 wt Allele 949.601000 0.517552% √ 24 C1367449 BCL10 gene 947.860000 0.540054% 10408401 √ 25 C1706047 CHEK2 wt Allele 947.651000 0.562556% 12094328 26 C1332803 CTNNB1 gene 947.269000 0.585059% √ 27 C1706879 BAX wt Allele 945.097000 0.607561% 11895913 28 C1705041 NBN wt Allele 944.958000 0.630063% 29 C1332806 CTSH gene 942.869000 0.652565% 30 C1705982 KRAS wt Allele 942.386000 0.675068%

Table 4.5: Top thirty ranked gene concepts for Breast Cancer, familial. Confirmed disease gene concepts (by semantic relations “disease has associated gene” or “disease is marked by gene” or “gene associated with disease”) are marked by ?. Gene concepts√ having been studied for Breast Cancer in literature are marked by . One literature for each of these gene concepts is listed by its Pubmed ID.

49 measures how good the known gene concepts are ranked with respect to their asso- ciated disease concepts. Typically, if only one disease concept is considered, the fold-enrichment is measured as the maximum α/β such that α% known causative gene concepts are ranked in top β% gene concepts for this disease concept. We can extend the measurement between a set of gene concepts and a set of disease concepts by considering the two percentiles aggregatively. It is easy to see that if the results are not statistically significant, the fold-enrichment value will be close to 1.

We exclude paths with lengths equal to 1, and calculate the sum of valid path distance along with the valid path number, we build the similarity matrix between

4,444 gene concepts and 348 disease concepts in only several minutes. Using the fold-enrichment evaluation, we obtain a fold-enrichment value around 42.3, which demonstrates the strong statistical significance of our result. More importantly and unlike existing methods [28, 29, 30], our result is both unbiased (i.e., the relationship between a gene concept and a disease concept is not affected by the available semantic relations connecting them) since paths with lengths equal to 1 are removed, and complete (i.e., all relationships between 4,444 gene concepts and 348 disease concepts are evaluated).

To provide a detailed look at the result, we selected the breast cancer, one of the highest occurrence cancers that plague the human being. In particular, we focus on the disease C0678222 (Breast Carcinoma), and disease C0346153 (Breast Cancer,

Familial). The former term is for general breast cancer while the latter is a special case of the diseases where there is a history of breast cancer in closely related family members. Table 4.2 lists the percentiles of top 30 confirmed disease gene concepts in the ranking, and we can see that known disease gene concepts for Breast Carcinoma are generally ranked high in percentile. Table 4.3 lists the 30 top-ranked gene concepts for Breast Carcinoma. Table 4.4 lists the percentiles of top 30 confirmed disease gene

50 concepts in the ranking for Breast Cancer, familia and Table 4.5 lists the 30 top- ranked gene concepts for Breast Cancer, familia.

For gene concepts in Table 4.3 marked by * are confirmed disease gene concepts by semantic relations “disease has associated gene” or “disease is marked by gene” or “gene associated with disease”. For the remaining genes, we searched literature for their relationship with Breast Carcinoma, and found more than half of the gene √ concepts having been studied for Breast Carcinoma (marked by with the Pubmed

ID of one literature). Some studies are very recent. For example, the relationship between the LEPR gene and the breast carcinoma has been studied in a very recent work [31]. For the remaining gene concepts not found in literature for Breast Car- cinoma, our ranking nevertheless suggests that they are very likely to be associated with Breast Carcinoma. This information provides promising research directions for biologists and clinicians. For Breast Cancer,Familial, we do the same experiment and the result is shown in Table 4.5.

51 CHAPTER 5

CONCLUSION AND FUTURE WORK

We proposed network mining and management methods to handle very large biomedical data for knowledge discovery on cancer-associated genes. To handle gene- coexpression networks, we proposed preQCM, an improved algorithm over QCM for mining homogenous biomedical networks in which each edge has a weight. The pre-

QCM algorithm has the flexibility of selecting the search space (i.e., starting edges) between a complete search and a selected search corresponding to the independent edge set. More importantly, the results returned by the preQCM algorithm under different parameters, which adjust the dense components, are consistent with respect to a search space selection. The experimental results also show that preQCM algo- rithm is able to find more promising potential biomarkers from the gene-coexpression data. Correspondingly, we also proposed methods to index large biomedical data for knowledge discovery. As an extension of the classic 2-hop labeling scheme, we proposed localized 2-hop, which scales well on large random datasets and can an- swer distance and path queries efficiently on similar types of biomedical data. As many real biomedical data show power-law property on the vertex degree distribu- tion, we designed a k-neighbor Decentralized Labeling Scheme (kDLS) to indexing them. kDLS not only answers distance and path query very efficiently, but also en- ables large scale knowledge discovery between two groups of concepts. The empirical

52 studies on the breast cancer related genes demonstrate the effectiveness of kDLS in knowledge discovery and hypothesis generation on cancer-related research.

Given the methods we have proposed to mining and management cancer-related networks, the future work involves improving and optimizing our algorithm to make our methods more efficient. For the network mining algorithm, the future work will focus on how to further explore the search space without significant increasing the computational cost. While for the network indexing and management scheme, the future work will focus on how to further reduce the label size so as to handle extremely large networks emerging from future biomedical applications.

53 BIBLIOGRAPHY

[1] R. M. Karp, “Reducibility among combinatorial problems,” R. E. Miller and J. W. Thatcher (editors), Complexity of Computer Computations, pp. 85–103, 1972.

[2] C. Bron and J. Kerbosch, “Algorithm 457: finding all cliques of an undirected graph,” Commun. ACM, vol. 16, pp. 575–577, 1973.

[3] F. Cazals and C. Karande, “A note on the problem of reporting maximal cliques,” Theoretical Computer Science, vol. 407, no. 1-3, pp. 564–568, 2008.

[4] J. Abello, M. G. C. Resende, and S. Sudarsky, “Massive quasi-clique detection,” in Proceedings of the 5th Latin American Symposium on Theoretical Informatics, ser. LATIN ’02, 2002, pp. 598–612.

[5] S. Seidman, “Network structure and minimum degree* 1,” Social Networks, vol. 5, no. 3, pp. 269–287, 1983.

[6] S. Seidman and B. Foster, “A graph-theoretic generalization of the clique con- cept,” The Journal of Mathematical Sociology, vol. 6, no. 1, pp. 139–154, 1978.

[7] F. Geerts, B. Goethals, and T. Mielik¨ainen,“Tiling databases,” in Discovery Science, 2004, pp. 278–289.

[8] A. Gionis, H. Mannila, and J. K. Sepp¨anen,“Geometric and combinatorial tiles in 0-1 data,” in Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, ser. PKDD ’04. Springer-Verlag New York, Inc., 2004, pp. 173–184.

[9] Y. Xiang, R. Jin, D. Fuhry, and F. F. Dragan, “Summarizing transactional databases with overlapped hyperrectangles,” Data Mining and Knowledge Dis- covery, vol. 23, no. 2, pp. 215–251, 2011.

[10] R. Jin, Y. Xiang, H. Hong, and K. Huang, “Block interaction: a generative sum- marization scheme for frequent patterns,” in Proceedings of the ACM SIGKDD Workshop on Useful Patterns, ser. UP ’10. ACM, 2010, pp. 55–64.

54 [11] Y. Xiang, P. R. Payne, and K. Huang, “Transactional database transformation and its application in prioritizing human disease genes,” IEEE/ACM Transac- tions on Computational Biology and Bioinformatics, in Press, DOI: http://dx.doi.org/10.1109/TCBB.2011.58. [12] Y. Ou and C.-Q. Zhang, “A new multimembership clustering method,” Journal of Industrial and Management Optimization, vol. 3, no. 4, pp. 619–624, 2007. [13] H. Yildirim, V. Chaoji, and M. Zaki, “GRAIL: scalable reachability index for large graphs,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 276– 284, 2010. [14] R. Jin, Y. Xiang, N. Ruan, and H. Wang, “Efficiently answering reachability queries on very large directed graphs,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ser. SIGMOD’08. ACM, 2008, pp. 595–608. [15] R. Jin, Y. Xiang, N. Ruan, and D. Fuhry, “3-hop: a high-compression indexing scheme for reachability query,” in Proceedings of the 35th SIGMOD international conference on Management of data, ser. SIGMOD ’09. ACM, 2009, pp. 813– 826. [16] R. Jin, H. Hong, H. Wang, N. Ruan, and Y. Xiang, “Computing label-constraint reachability in graph databases,” in Proceedings of the 2010 international con- ference on Management of data, ser. SIGMOD ’10. ACM, 2010, pp. 123–134. [17] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick, “Reachability and Distance Queries via 2-Hop Labels,” SIAM Journal on Computing, vol. 32, no. 5, pp. 1338–1355, 2003. [18] R. Schenkel, A. Theobald, and G. Weikum, “Hopi: An efficient connection index for complex xml document collections,” in EDBT, 2004, pp. 237–255. [19] ——, “Efficient creation and incremental maintenance of the hopi index for com- plex xml document collections,” in Proceedings of the 21st International Confer- ence on Data Engineering, ser. ICDE ’05. IEEE Computer Society, 2005, pp. 360–371. [20] J. Cheng and J. X. Yu, “On-line exact shortest distance query processing,” in Proceedings of the 12th International Conference on Extending Database Tech- nology: Advances in Database Technology, ser. EDBT ’09. ACM, 2009, pp. 481–492. [21] Y. Xiao, W. Wu, J. Pei, W. Wang, and Z. He, “Efficiently indexing shortest paths by exploiting symmetry in graphs,” in Proceedings of the 12th International Con- ference on Extending Database Technology: Advances in Database Technology, ser. EDBT ’09. ACM, 2009, pp. 493–504. 55 [22] F. Wei, “Tedi: efficient shortest path query answering on graphs,” in Proceedings of the 2010 international conference on Management of data, ser. SIGMOD ’10. ACM, 2010, pp. 99–110.

[23] D. Peleg, “Proximity-preserving labeling schemes,” J. Graph Theory, vol. 33, pp. 167–176, March 2000.

[24] V. Chepoi, F. Dragan, B. Estellon, M. Habib, Y. Vaxes, and Y. Xiang, “Additive Spanners and Distance and Routing Labeling Schemes for Hyperbolic Graphs,” Algorithmica, in press, DOI: http://dx.doi.org/10.1007/s00453-010-9478-x.

[25] Y. Xiang, K. Lu, S. L. James, T. B. Borlawsky, K. Huang, and P. R. Payne, “k-neighborhood decentralization: A comprehensive solution to index the umls for large scale knowledge discovery,” Manuscript under submission, 2011.

[26] J. Zhang, Y. Xiang, L. Ding et al., “Using gene co-expression network analysis to predict biomarkers for chronic lymphocytic leukemia,” BMC bioinformatics, vol. 11, no. Suppl 9, p. S5, 2010.

[27] A. Aronson, “Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.” in Proceedings of the AMIA Symposium. American Medical Informatics Association, 2001, p. 17.

[28] K. Gaulton, K. Mohlke, and T. Vision, “A computational system to select can- didate genes for complex human traits,” Bioinformatics, vol. 23, no. 9, p. 1132, 2007.

[29] X. Wu, R. Jiang, M. Zhang, and S. Li, “Network-based global inference of human disease genes,” Molecular Systems Biology, vol. 4, no. 1, 2008.

[30] B. Linghu, E. Snitkin, Z. Hu, Y. Xia, and C. DeLisi, “Genome-wide prioritiza- tion of disease genes and identification of disease-disease associations from an integrated human functional linkage network,” Genome Biol, vol. 10, no. 9, p. R91, 2009.

[31] R. Cleveland, M. Gammon, C.-M. Long, M. Gaudet, S. Eng, S. Teitelbaum, A. Neugut, and R. Santella, “Common genetic variations in the LEP and LEPR genes, obesity and breast cancer incidence and survival,” Breast Cancer Research and Treatment, vol. 120, pp. 745–752, 2010.

56