Gene Co-Expression Network Mining Using Graph Sparsification

Gene Co-expression Network Mining Using Graph Sparsification THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Jinchao Di Graduate Program in Electrical and Computer Engineering The Ohio State University 2013 Master’s Examination Committee: Dr. Kun Huang, Advisor Dr. Raghu Machiraju Dr. Yuejie Chi ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Copyright by Jinchao Di 2013 ! ! Abstract Identifying and analyzing gene modules is important since it can help understand gene function globally and reveal the underlying molecular mechanism. In this thesis, we propose a method combining local graph sparsification and a gene similarity measurement method TOM. This method can detect gene modules from a weighted gene co-expression network more e↵ectively since we set a local threshold which can detect clusters with di↵erent densities. In our algorithm, we only retain one edge for each gene, which can generate more balanced gene clusters. To estimate the e↵ectiveness of our algorithm we use DAVID to functionally evaluate the gene list for each module we detect and compare our results with some well known algorithms. The result of our algorithm shows better biological relevance than the compared methods and the number of the meaningful biological clusters is much larger than other methods, implying we discover some previously missed gene clusters. Moreover, we carry out a robustness test by adding Gaussian noise with di↵erent variance to the expression data. We find that our algorithm is robust to noise. We also find that some hub genes considered as important genes could be artifacts. ii Dedication This document is dedicated to my family. iii Acknowledgments Iwouldliketoo↵er my gratitude to my advisor, Dr. Kun Huang, for his guidance. I would like to thank Dr. Raghu Machiraju for providing suggestions for my project. IwouldliketothankDr.YuejieChiforbeingmycommitteemember.Iwouldlike to thank Chao Wang, Qihang Li, Nan Meng and Hao Ding for the help they o↵ered me during my study at OSU. iv Vita 2011................................................................B.S. Electrical and Computer Engineering, Harbin Institute of Technology, China 2011 to present ...............................................Graduate Student, ECE Graduate Program, The Ohio State University, USA Fields of Study Major Field: Electrical and Computer Engineering v Table of Contents Page Abstract....................................... ii Dedication . iii Acknowledgments.................................. iv Vita......................................... v List of Tables . viii ListofFigures ................................... x Chapters: 1. Introduction . 1 1.1 Background............................... 1 1.2 Previous work . 2 1.3 Problem statement . 8 1.4 Thesis statement . 8 1.5 Roadmap . 9 2. BiologicalNetworkStructure . 10 2.1 BiologicalNetwork ........................... 10 2.2 Scale Free Topology . 12 vi 3. Sparsification of WGCN with TOM . 16 3.1 DatasetsUsedinthisThesis. 16 3.2 Algorithm................................ 17 3.2.1 Co-expression Measure . 17 3.2.2 Local Graph Sparsification . 18 3.2.3 Choosing the Parameter for Adjacency Function . 21 3.3 Properties and Observation . 23 3.3.1 Property 1: In a cluster with N nodes, there are N 1edges23 − 3.3.2 Observation: . 24 4. Experiment and Evaluation . 27 4.1 GeneOntologyAnalysis . 27 4.1.1 AlgorithmsComparison . 27 4.1.2 Evaluation Method . 29 4.2 Robustness test . 31 5. Discussion and Conclusion . 43 Bibliography . 45 vii List of Tables Table Page 4.1 The biological relevance analysis result for both ABA and BC dataset usingthefouralgorithms. 30 4.2 Number of common genes of BC dataset and BC dataset with noise when k=1 . 33 4.3 Hub genes of clusters of BC dataset and hub genes of corresponding clusters of BC dataset with noise when k=1 . 33 4.4 Number of common genes of BC dataset and BC dataset with noise when k=5 . 34 4.5 Hub genes of clusters of BC dataset and hub genes of corresponding clusters of BC dataset with noise when k=5 . 34 4.6 Number of common genes of BC dataset and BC dataset with noise when k=10 . 35 4.7 Hub genes of clusters of BC dataset and hub genes of corresponding clusters of BC dataset with noise when k=10 . 35 4.8 Number of common genes of BC dataset and BC dataset with noise when k=20 . 36 4.9 Number of common genes of ABA dataset and ABA dataset with noise when k=1 . 37 viii 4.10 Hub genes of clusters of ABA dataset and hub genes of corresponding clustersofABAdatasetwithnoisewhenk=1 . 38 4.11 Number of common genes of ABA dataset and ABA dataset with noise when k=5 . 39 4.12 Hub genes of clusters of ABA dataset and hub genes of corresponding clustersofABAdatasetwithnoisewhenk=1 . 40 4.13 Number of common genes of ABA dataset and ABA dataset with noise when k=10 . 41 4.14 Number of common genes of ABA dataset and ABA dataset with noise when k=20 . 42 ix List of Figures Figure Page 2.1 Degree distribution for biological data: breast cancer data and ABA data,;(a) degree distribution p(k) k of beast cancer data; (b) degree ⇠ distribution log(p(k)) log(k)forbreastcancerdata;(c)degree ⇠ distribution p(k) k of ABA data; (d) degree distribution log(p(k)) ⇠ ⇠ log(k)forABAdata ........................... 13 2.2 Degree distribution for random network with 1000 nodes and 100 sam- ples..................................... 14 3.1 Maximum cluster size as e changes; (a) Maximum cluster size as e changes for GDS2250 breast cancer dataset; (b) Maximum cluster size as e changesforABAdataset . 19 3.2 Cluster size distribution for e =0,e =2ande =5;(a)Cluster size distribution for GDS2250 breast cancer dataset; (b) Cluster size distributionforABAdataset . 20 3.3 Maximum cluster size as β changes; (a) Maximum cluster size as β changes for GDS2250 breast cancer datase; (b) Maximum cluster size as β changesforABAdataset.. 22 3.4 Cluster size distribution as β changes; (a) Cluster size distribution for GDS2250 breast cancer dataset, β =2,β =4,β =6,β =8;(b)Cluster size distribution for ABA dataset,β =6,β =8,β =10,β =12. ... 23 3.5 Visualization result of a subnetwork of breast cancer dataset . 25 x 3.6 Visualization result of a subnetwork of simulated dataset . 26 xi Chapter 1 Introduction 1.1 Background A gene is a subunit of DNA that is usually coding for a particular protein and conductive to phenotype [29]. Genes are not working separately, the interaction of genes can carry out biological function [28]. A gene module is a group of genes that share common biological properties and functions. It is important to identify and analysis gene modules since it can help understand gene function globally and reveal the underlying molecular mechanism. A disease related gene module may also contribute to therapy development since it can elucidate how abnormal phenotype is induced [2]. If two genes are regulated by common transcription factors, they tend to be on the same pathway and have similar expression patterns [23]. Therefore, the genes within a gene module tend to have similar expression pattern, also called co- expression [14]. Microarray gene expression data can measure the expression levels of many genes in one biological sample. We can identify gene modules by analyzing the gene co-expression patterns using a microarray gene expression dataset. 1 1.2 Previous work Many methods can be used to measure the strength of gene co-expression. One of the most commonly used matrc is Pearson Correlation Coefficient (PCC) [8]. Pearson correlation coefficient is defined as the ratio of the covariance of two gene expression profiles (vectors) to the product of their standard deviations. Usually, we use the absolute value of the Pearson correlation that lies between 0 and 1. The higher the value is, the more significantly the two genes are co-expressed. If the absolute value of the correlation is close to one, that means the two genes are highly co-expressed. In contrast, the two genes are uncorrelated to each other if the value is close to zero. Pearson correlation coefficient measures the linear dependence between two genes, which may miss the non-linear correlation between the genes. A low value of the absolute correlation does not necessarily mean independence. Mutual Information o↵ers a more general and accurate measurement of the similarity of gene expression [26]. The concept of entropy is introduced by Shannon [25], it is the self-information of the expression profile of a gene. This concept can be extended to mutual information which describes the distance of two genes [4]. Assume we have a random variable A with MA possible states ai, i =1,...,MA, the entropy of the variable H(A) is defined as the minus sum of the probability corresponding to each state times the log scale probability: MA H(A)= p(a )logp(a ). (1) − i i i=1 X 2 The mutual information of variables A and B I(A, B)isdefinedasthesumofthe entropy of A and entropy of B minus the joint entropy of A and B [4]: I(A, B)=H(A)+H(B) H(A, B), (2) − where the joint entropy of A and B is defined as: MA MB H(A, B)= p(a ,b )logp(a ,b ). (3) − i j i j i=1 j=1 X X To analyze the gene expression data using mutual information, one can partition the expression data into bins and calculate the probability for each bin by counting the number of data in the bin [5]. The mutual information of two genes then can be calculated by the definition above. Some studies show that both Pearson correlation and mutual information can generate good results in measuring the co-expression [16][13]. However, there is no clearly demonstrated advantage of mutual information over PCC [15].

Gene Co-Expression Network Mining Using Graph Sparsification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support