
EXPLORING CANCER-ASSOCIATED GENES BY NETWORK MINING AND MANAGEMENT THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of the Ohio State University By Kewei Lu, B.E. Graduate Program in Computer Science and Engineering The Ohio State University 2011 Thesis Committee: Dr. Kun Huang, Advisor Dr. Raghu Machiraju c Copyright by Kewei Lu 2011 ABSTRACT Biomedical data, including cancer-related data, are widely available in the form of networks in which nodes are biomedical objects and edges are relationships between objects. Some of these networks, such as gene-coexpression networks, have same type nodes within a network, while others, such as the UMLS network, have various types of nodes interacting each other within a network. Both kinds of networks contain very rich hidden information beyond the obvious facts conveyed by the original data. To turning this hidden information into useful knowledge for cancer research, we developed network mining and management methods for these networks. Our net- work mining method focuses on identifying candidate gene markers by finding dense components in the gene-coexpression networks which have homogeneous nodes. In addition, we propose methods to efficiently manage very large networks. By manag- ing the UMLS network which is formed by data from various sources, we are able to prioritize genes towards cancers and generate hypotheses that are very informative for cancer research. ii Dedication This thesis is dedicated to my family. iii ACKNOWLEDGMENTS I would like to express my gratitude to all those who helped me during the writing of this thesis. First of all, I would like to thank my advisor Dr. Kun Huang for his patience, encouragement, and professional advice during my graduate study. Without his pa- tient instruction and expert guidance, I could not make such improvement during my graduate study. It has been an honor for me to work with him. I would like to thank Dr. Raghu Machiraju, a member of my committee, for his valuable help and suggestions. I am also grateful to Dr. Yang Xiang and Dr. Jie Zhang for their helpful sugges- tions and support. They always gave me valuable advice whenever I had questions in either graduate study or research. Special thanks to my friends. Their support and friendship encouraged me a lot. Last, I would thank to my beloved family for their love, help and great confidence in me through all these years. iv VITA 2009 . B.E. Computer Science and Technology, Wuhan University of Technology, China 2009 to present . M.S. Computer Science and Engineering, The Ohio State University 2010-Present . Student Assistant, Comprehensive Can- cer Center, The Ohio State University PUBLICATIONS FIELDS OF STUDY Major Field: Computer Science and Engineering v TABLE OF CONTENTS Abstract . ii Dedication . ii Acknowledgments . iv Vita.........................................v List of Figures . vii CHAPTER PAGE 1 Introduction . .1 1.1 Related work . .2 1.2 Thesis Organization . .4 2 Problem Formulation . .6 3 Gene Co-expression Network Mining . .8 3.1 Data Preprocess . .8 3.2 Building Gene Co-expression Network . .9 3.3 Mining Subgraphs with Bounded Density . 12 3.4 Weighted Subgraph Pattern Mining for Biomedical Applications . 14 3.5 Discovering Candidate Cancer Biomarkers . 24 4 Network Management . 31 4.1 Handling large networks . 31 4.2 Managing the UMLS network . 39 4.3 Prioritize Cancer Genes by managing the UMLS network . 45 5 Conclusion and Future work . 52 Bibliography . 54 vi LIST OF FIGURES FIGURE PAGE 3.1 Plot of expression value versus samples for CSN2. 10 3.2 Plot of expression value versus samples for C17orf99. 11 3.3 Plot of expression value versus samples for A1BG. 12 3.4 The distribution of PCC for GSE18864. 13 3.5 The flow chart of select the start edges(k=0.9). The numbers are the weight of edges. The red edges are the edges have been add to the set of start edges; the red edges and the blue edges are the edges have been covered; the other edges have not been covered . 19 3.6 Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI) . 24 3.7 Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI LN-POS) . 26 3.8 Survival Test of 70-gene breast cancer gene signature versus our cluster with a smaller p-value (NKI ER-NEG) . 27 3.9 Visualization of the co-expression network for the gene cluster of NKI with a smaller p-value than vant Veer 70 Genes . 28 3.10 Visualization of the co-expression network for the gene cluster of NKI LN-POS with a smaller p-value than vant Veer 70 Genes . 29 3.11 Visualization of the co-expression network for the gene cluster of NKI ER-NEG with a smaller p-value than vant Veer 70 Genes . 30 4.1 The flow chart of localized 2-hop. 35 4.2 Illustration of greedy select a batch of vertices. 36 vii 4.3 Distance query of localized 2-hop vs BFS, the green line is localized 2-hop and the blue line is the BFS . 39 4.4 Construct time of localized 2-hop for graph with different size . 40 4.5 Label size of localized 2-hop for graph with different size . 41 4.6 An illustration of distance and path queries. 43 4.7 Number of labels and total label size for different kDLS broadcast ranges. 44 viii CHAPTER 1 INTRODUCTION A large portion of cancer-related biomedical data is arrogated in the form of net- works, also known as graphs. In these networks nodes typically represent biomedical objects. An edge may exist between two nodes of biomedical objects to represent their relationships. Some of these networks, such as gene-coexpression networks, have same type of nodes within a network, while others, such as the UMLS network, have various types of nodes interacting each other within a network. In this thesis, I name the former networks \homogeneous networks", and the latter \heterogeneous networks". Both kinds of networks contain very rich hidden information beyond the obvious facts conveyed by the original data. Given this form of data, efficient mining and management them becomes an im- portant application for areas such as algorithmic graph theory, graph mining, and graph databases. Dense component mining algorithms for gene-coexpression net- works, a typical type of homogenous networks, often lead to the discovery of gene clusters which are candidate biomarkers. The graph indexing schemes on heteroge- neous networks will enable knowledge discovery via efficiently graph queries such as reachability, distance, and path queries. In the next section, I review the related techniques for mining and management these networks. 1 1.1 Related work In many networks, dense components themselves are important clusters. Detect- ing dense components is a nontrivial task for these networks. This problem seems straightforward but even a simplest version of the problem is NP-hard. That is, it is NP-hard to find a maximum clique in an undirected graph [1]. On the other hand, managing or indexing a network for answering reachability, distance, and shortest path queries has polynomial-time solutions. However, the availability of polynomial- time solutions by no means imply there is no challenging for indexing them. On the contrary, it is practically impossible to use brutal-force methods to efficiently manage large networks. When the network size is very large as appearing in some biomedical applications, it becomes a problem even for most of the latest methods. Clique and Quasi-Clique Mining Although listing of all maximal cliques is NP- hard, it is still possible to efficiently list them in some cases involving small and sparse graphs. A classical algorithm for enumerating all maximal cliques is proposed in [2] and redescribed in [3]. However, the definition of clique is too tight that it does not consider many cases in real life especially with the presence of noise. Therefore, a good number of works focus on finding quasi-cliques instead [4, 5, 6]. Though the detailed definition of quasi-clique varies among literature, a quasi-clique is often considered as a general case of a clique. Thus the concept of quasi-clique is better for modeling dense components in real networks. Biclique and Frequent Itemset Mining A bipartite network (or bipartite graph) contains two types of nodes, with edges connecting different types of nodes. The con- cept of bipartite is between homogeneity and heterogeneity. These kinds of networks are also available in biomedical data, such as gene expression data where rows are genes and columns are patients (or vise versa). A dense component cluster is a Carte- sian product between two different types of vertices. These Cartesian products are 2 often with other names in different works, such as tiles [7, 8], hyperrectangles [9], and blocks [10]. Different mining algorithms for dense components have been proposed in these works correspondingly. The biclique mining problem nicely connects to the closed frequent itemset mining problem. This connection has been used for maximal biclique generation in [11] for effective knowledge dicovery in (0,1)-matrices. How- ever, similar connections are not available for clique or quasi-clique mining to the best of our knowledge thus we cannot ease our mining tasks by borrowing techniques from frequent itemset mining in our work. Nevertheless, our method for finding dense components in homogeneous networks provides new insight to the biclique mining problem in bipartite graphs. Weighted Dense Subgraph Mining Most available works consider only unweighted graphs, in which edges have no weights (or unit weights in another work).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages65 Page
-
File Size-