Network Mining Approach to Cancer Biomarker Discovery
Total Page:16
File Type:pdf, Size:1020Kb
NETWORK MINING APPROACH TO CANCER BIOMARKER DISCOVERY THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Praneeth Uppalapati, B.E. Graduate Program in Computer Science and Engineering The Ohio State University 2010 Thesis Committee: Dr. Kun Huang, Advisor Dr. Raghu Machiraju Copyright by Praneeth Uppalapati 2010 ABSTRACT With the rapid development of high throughput gene expression profiling technology, molecule profiling has become a powerful tool to characterize disease subtypes and discover gene signatures. Most existing gene signature discovery methods apply statistical methods to select genes whose expression values can differentiate different subject groups. However, a drawback of these approaches is that the selected genes are not functionally related and hence cannot reveal biological mechanism behind the difference in the patient groups. Gene co-expression network analysis can be used to mine functionally related sets of genes that can be marked as potential biomarkers through survival analysis. We present an efficient heuristic algorithm EigenCut that exploits the properties of gene co- expression networks to mine functionally related and dense modules of genes. We apply this method to brain tumor (Glioblastoma Multiforme) study to obtain functionally related clusters. If functional groups of genes with predictive power on patient prognosis can be identified, insights on the mechanisms related to metastasis in GBM can be obtained and better therapeutical plan can be developed. We predicted potential biomarkers by dividing the patients into two groups based on their expression profiles over the genes in the clusters and comparing their survival outcome through survival analysis. We obtained 12 potential biomarkers with log-rank test p-values less than 0.01. ii DEDICATION This document is dedicated to my family & friends. iii ACKNOWLEDGMENTS I would like to thank my research advisor Dr. Kun Huang for the support and guidance he has given me throughout the entire period of my work. It has been a great pleasure to work with him. I would also like to thank my advisor Dr. Raghu Machiraju for his unconditional help and suggestions. I also thank Yang Xiang and Abhisek Kundu for the help and support they have extended to me. Their inputs and suggestions have been a great help throughout my thesis. I would like to express my deepest gratitude to my parents who have shown unconditional love and care throughout my life. Am thankful to my friends who have given me moral support and helped me get through many harder times. iv VITA May 2004 .................................................... Sri Chaitanya Jr. College 2008 ............................................................ B.E. Computer Science and Engineering, Osmania University 2008 to present ............................................ Graduate Student. Computer Science and Engineering Department, The Ohio State University 2009 to present ........................................... Graduate Research Associate, Department of Bio-Medical Informatics, The Ohio State University FIELDS OF STUDY Major Field: Computer Science and Engineering v TABLE OF CONTENTS ABSTRACT ................................................................................................................... ii DEDICATION .............................................................................................................. iii ACKNOWLEDGMENTS ............................................................................................. iv VITA ...............................................................................................................................v LIST OF TABLES ...................................................................................................... viii LIST OF FIGURES .........................................................................................................x CHAPTER 1: INTRODUCTION ...................................................................................1 1.1 Background ............................................................................................................1 1.2 Motivation..............................................................................................................6 1.3 Thesis Statement ....................................................................................................9 1.4 Contribution ......................................................................................................... 10 1.5 Organization......................................................................................................... 12 CHAPTER 2: GENE CO-EXPRESSION NETWORK ANALYSIS .............................. 13 2.1 Co-expression Similarity ...................................................................................... 13 2.2 Building a Gene co-expression network................................................................ 16 2.3 Mining Modules ................................................................................................... 18 vi 2.4 Comparing Modules ............................................................................................. 20 CHAPTER 3: DENSE NETWORK COMPONENT DISCOVERY METHODS............ 23 3.1 K – Core Algorithm.............................................................................................. 23 3.2 Min-Cut Algorithm .............................................................................................. 27 3.3 Prune-Cut Algorithm – Modification to Min-Cut algorithm .................................. 29 CHAPTER 4: EIGEN CUT ALGORITHM: A NEW NETWORK APPROACH ........... 33 4.1 The Algorithm ...................................................................................................... 33 4.2 Performance ......................................................................................................... 41 CHAPTER 5: APPLICATIONS .................................................................................... 46 5.1 Application 1: TCGA data on Glioblastoma Multiforme....................................... 46 5.1.1 Gene – miRNA Interaction Prediction ............................................................ 47 5.1.2 Gene Signature Discovery .............................................................................. 52 5.1.2.1 Survival Analysis ..................................................................................... 52 5.1.2.3 Results ..................................................................................................... 57 5.2 Application 2: Breast Cancer Data - GDS2250 dataset ......................................... 70 CHAPTER 6: CONCLUSION & FUTURE WORK ...................................................... 75 BIBLIOGRAPHY ......................................................................................................... 78 Appendix A: Gene lists for Clusters A - L ..................................................................... 81 Appendix B: Codes/Programs ........................................................................................ 88 vii LIST OF TABLES Table 1. Running times and number of clusters for different graph sizes( no. of nodes) . 42 Table 2. Gene Enrichment results for Cluster 1 (GO: Molecular Function) .................... 48 Table 3. Gene Enrichment results for cluster 1 (GO: Biological Process) ....................... 49 Table 4. Gene Enrichment results for cluster 1 (GO: Cellular Component) .................... 49 Table 5. List of clusters (EigenCut without overlaps and next-available-hub seed- selection) with p-values less than 0.05 in the log-rank tests ............................................ 59 Table 6. List of clusters (EigenCut without overlaps and next-available-higher index seed-selection) with p-values less than 0.05 in the log-rank tests .................................... 59 Table 7. List of clusters (EigenCut with multi-merge, overlaps and next-available-hub seed-selection) with p-values less than 0.05 in the log-rank tests .................................... 60 Table 8. List of clusters (EigenCut on K-core) with p-values less than 0.05 in the log-rank tests ............................................................................................................................... 61 Table 9. Potential Biomarkers identified through different methods ............................... 62 Table 10. GO enrichment results using ToppGene for Cluster D (GO: Biological Processes) ...................................................................................................................... 64 Table 11. GO Enrichment results using ToppGene for Cluster E (GO: Biological Processes) ...................................................................................................................... 66 viii Table 12. GO Enrichment results using ToppGene for Cluster H (GO: Biological Processes) ...................................................................................................................... 68 Table 13. Overlap values for the basal-like cancer clusters versus non-basal-like cancer clusters .......................................................................................................................... 74 Table 14. Cluster A ........................................................................................................ 81 Table 15. Cluster B .......................................................................................................