Refinement of Subnetwork Discovery Algorithm for Biological Networks

REFINEMENT OF SUBNETWORK DISCOVERY ALGORITHM FOR BIOLOGICAL NETWORKS By SATRIA PUTRA SAJUTHI A Thesis Submitted to the Graduate Faculty of WAKE FOREST UNIVERSITY in Partial Fulllment of the Requirements for the Degree of MASTER OF SCIENCE in the Department of Computer Science May, 2010 Winston-Salem, North Carolina Approved By: Jacquelyn Su Fetrow, Ph.D., Advisor Examining Committee: Victor Paúl Pauca, Ph.D., Chairperson William Hansel Turkett Jr., Ph.D. Table of Contents Acknowledgments . iv List of Figures. v List of Tables. vi Abstract . vii Introduction. 1 Chapter 1 Background. 4 1.1 JActiveModules (JAM) . 8 1.2 JActiveModules Algorithm . 9 1.3 Problems with Current JActiveModules . 12 1.3.1 Initial List of Subnetworks . 12 1.3.2 Unbounded List of Subnetworks . 13 1.3.3 Regional Scoring . 13 Chapter 2 Methods . 15 2.1 Randomize Initial List of Tracked Subnetworks . 15 2.2 Constrained Size of the List of Tracked Subnetworks . 16 2.3 Modication to Improve Regional-Scoring Heuristic . 17 2.4 Running JAM over Multiple Trials . 18 2.5 Sensitivity and Specicity . 21 Chapter 3 Method Validation. 23 3.1 Networks . 23 3.2 Validation for randomizing initial list of subnetworks . 25 3.3 Validation for Limiting the Size of the Subnetwork List . 29 3.4 Validation for Improved Regional Scoring . 33 3.5 Modied JAM Result with All Modications Active . 35 Chapter 4 Biological Results : Identication of Active Subnetworks from Time Course Microarray Data . 37 4.1 Datasets Description . 37 4.1.1 Mouse KEGG Network . 37 4.1.2 Dendritic Cell Microarray Data . 38 ii iii 4.2 Conversion from Signal Log Ratio to p-value . 38 4.3 Conversion from Ay Id to Entrez ID . 39 4.4 Overall Process . 41 4.5 From Subnetworks to Pathway Annotation . 44 Chapter 5 Conclusion and Future Work. 50 5.1 Conclusion . 50 5.1.1 Comparison to other methods . 51 5.2 Future Work . 53 Appendix A Subnetwork Results . 59 Appendix B Expression Data for Test Networks . 79 Appendix C Expression Data and Annotations for KEGG Network . 91 Acknowledgments First, I want to extend my gratitude to my advisor, Dr. Jacquelyn S. Fetrow, who introduced the eld of bioinformatic and system biology to me and guided me in every step of my thesis work. I always appreciate the 8 o'clock session in the Dean's oce. I also want to thank Dr. William H. Turkett who always been patient and open when I need a brainstorm. His numerous advices help me a lot during the construction of my writing and enabling my mind to think critically. Also, I would like to thank Dr. V. Paul Pauca for his help and input during the initial phase of my thesis and for his willingness for being my thesis committee member. I also want to thank my academic advisor, Dr. David John, who always keep track of my academic progression throughout the years. I want to thank Stacy Howerton for being a good friend and Amy Olex who helped to guide me when Jacque is busy with her dean's work. I also like to say thank to Paul Whitener for lend me the Sunray which has been useful to my thesis work. The friendship from Shuai and Harry also contribute a lot when I need friends to have fun and escape from the stress of the thesis work. I would like to thank my aunt for a place to live in the past two years. She also help me adapts to a way of living in the United States. Finally, I would like to thank my parents for enormous supports. They always believe in me and support my decision to continue my study to another ve years of graduate school. iv List of Figures 1.1 JAM algorithm behaviour when exploring state space . 10 1.2 The regional-scoring heuristic takes into account all of the surrounding neigh- bors when scoring a subnetwork . 12 2.1 An example of the distribution of nodes resulting from a JAM search over ten runs shows inconsistency in the subnetwork results . 19 3.1 Various networks used for validation purposes . 24 3.2 Randomizing the initial list of subnetworks neither improves overall scores nor prevents unusually low scores . 25 3.3 The progression of size-distribution from original JAM . 26 3.4 Graphs of the iteration space of modied JAM runs in term of size distribution and score distribution that lead to high-scoring subnetworks and low-scoring subnetworks, show a relationship between size and score . 28 3.5 Top score distribution from original JAM (green) shows consistently higher Z-scores than Z-scores from modied JAM (red) over one hundred JAM runs 29 3.6 Tracing the iteration space of size and score distribution demonstrates how the size of the list of tracked subnetworks aects the performance of the search 30 3.7 Examples of the state of active nodes achieved for the mouse cell cycle network as a result of employing the original JAM algorithm and the JAM algorithm with only the modication to limit the size of the tracked subnetwork list . 32 3.8 The subnetwork result from modied JAM with improved regional scoring (red bordered) shows better connectivity compared to the result from the original JAM (dashed orange border) on the tadpole network . 34 3.9 The modied JAM results oer an increase in sensitivity and decrease in specicity compared to original JAM results over ten runs on galactose utilization pathway . 35 4.1 An example of conversion from Ay ID to Entrez ID . 41 4.2 Proposed JAM search with DC data and KEGG network as an input . 43 4.3 Genes labeled as TLR signaling pathway across timepoint . 48 4.4 Dynamic of gene activity in toll-like receptor signaling pathway . 49 v List of Tables 2.1 Example of binomial distribution over 10 JAM runs with p : 624/2691. 21 3.1 Subnetworks distribution over 100 runs . 34 4.1 1hr DAVID annotation result. 46 4.2 3hr DAVID annotation result . 46 4.3 6hr DAVID annotation result . 46 4.4 12hr DAVID annotation result . 46 4.5 24hr DAVID annotation result . 47 4.6 Summary of the results from Table 4.1-4.5 . 47 vi Abstract The abundance of biological experimental data from new high-throughput technologies, such as microarray data, suggests the need for new methods to study biological processes using a systems-based approach. Handling this huge volume of data requires the development of new computational methods to analyze and extract interesting pieces of biological information. JAM (jActiveModules) is a Cytoscape plugin developed to nd connected sets of genes with high levels of dierential expression. This network approach helps biologists to generate new hypotheses concerning the biological mechanisms underlying observed changes in gene expression. In this work, the search algorithm of JAM is modied and measured. The goal was to improve the sensitivity and specicity of the method compared to the original JAM algorithm. The modications made to the search algorithm involve: randomizing the starting point of the search, constraining the number of subnetworks maintained while searching, and improving the regional-scoring heuristic. Importantly, these modications increase the number of signicant genes observed in the results. To ensure consistency in the search results, we apply the search algorithm multiple times and develop a statistical lter to retain consistent genes appearing across JAM runs. Furthermore, we apply this improved version of JAM to DC (Dendritic Cell) maturation microarray data and KEGG pathways to study the underlying mechanisms behind the DC maturation process, an essential part of the development of protective immunity to a number of infectious pathogens. vii 1 Introduction The advancement of high-throughput technology such as microarray is capable to generate large amount of gene expressions data. In particular, this technology allows researcher to study the eects of an external stimulus, such as drug or virus, on system wide gene expression. The large volume of data generated by microarray experiments have given rise to new computational methods to analyze and understand the data. In the eld of molecular biology, genes usually work together with other genes. This collaboration between genes is required to perform various important functions inside the cell. Many of these interactions between genes or gene products have been studied and doc- umented in various public databases. The rapid growth of interaction data resources allows researcher to analyze gene expression data within a network context, where the network is dened by the known interaction data. The idea to integrate interaction data and gene expression data was rst presented by Ideker et. al. in 2002. In that work, a network was constructed with genes or gene products as the nodes, known interactions as edges, and weights attached to the nodes based on the expression value of the associated genes. A method was proposed to nd an active subnetwork within the network, where an active subnetwork is dened by a connected re- gion in the network that consists of nodes with signicant changes of expression. Ideker's work presented both a subnetwork scoring scheme and an optimization algorithm based on simulated annealing approach. The method itself was implemented in a Cytoscape plugin called jActiveModules. The algorithm's intention is to maximize the score of the subnetwork by heuristic search through the subnetworks space. However, there are several problems with implementation of jActiveModules that are believed to lessen the performance of the originally developed search algorithm. In this thesis work, three modications have been employed to address the perceived problems of the current jActiveModules implementation.

Refinement of Subnetwork Discovery Algorithm for Biological Networks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support