Active Module Discovery: Integrated Approaches of Gene Co-Expression and PPI Networks and Microrna Data
Total Page:16
File Type:pdf, Size:1020Kb
Active Module Discovery: Integrated Approaches of Gene Co-Expression and PPI Networks and MicroRNA Data Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Ayat Hatem, M.Sc. Graduate Program in Electrical and Computer Engineering The Ohio State University 2014 Dissertation Committee: Umit¨ V. C¸ataly¨urek,Advisor Yuejie Chi Kun Huang F¨usun Ozg¨uner¨ c Copyright by Ayat Hatem 2014 Abstract Integrating protein-protein interaction (PPI) networks with gene expression data to extract active modules is shown to be promising in detecting meaningful biomark- ers for cancer and other diseases. However, current algorithms suffer from many drawbacks such as focusing only on the highly differentially expressed genes, ana- lyzing dependencies between genes in the PPI network only; totally neglecting the genes whose interactions are not known yet, and finally using mRNA gene expression data; ignoring other types of data such as gene mutation information and microRNAs expressions. In addition, lately, using the next generation sequencing technology to sequence the mRNA (RNA-Seq) has become the new standard for gene expression. However, existing algorithms either cannot handle the RNA-Seq data, or they return large modules which are hard to analyze. Therefore, we need new approaches to ad- dress the current drawbacks while utilizing and integrating the RNA-Seq data to the module discovery process. This work explores some of the drawbacks of current active module discovery algorithms. We first discuss the differences between RNA-Seq data and microarray data. With experimental evidence, we show that RNA-Seq is more powerful than microarray in providing better active modules at the expense of generating larger ones. Therefore, new approaches are needed to handle RNA-Seq data. ii Afterwards, we present a new workflow, PRASE, that is specifically designed to handle and obtain better active modules while using RNA-Seq data. PRASE employs a variation of the famous PageRank algorithm to preprocess the gene expression p- values. Then, it applies a scaling function to construct new p-values for the genes. Such new p-values redefine the importance of the genes: a gene is important not only based on its own value but also based on the values of the surrounding genes, thus, boosting the importance of genes that might not be differentially expressed from the p-value perspective. Finally, PRASE uses the new p-values with the existing active module discovery algorithms to extract the final modules. We applied our workflow on colorectal cancer, oligodendroglioma tumor, and breast cancer datasets. Using PRASE, we obtain more specialized modules which contain information that is overlooked by existing algorithms. Finally, we present our novel microRNA-mRNA integration technique, Mica, that efficiently integrates microRNA and mRNA expressions with the PPI network to discover more disease-specific active modules. The novelty of Mica lies in the early integration of microRNA expression with mRNA expression to better highlight the indirect dependencies between genes. We applied Mica on microRNA-Seq and mRNA-Seq data sets of 699 invasive ductal carcinoma samples and 150 invasive lob- ular carcinoma samples from the Cancer Genome Atlas Project (TCGA). The Mica modules unravel new and interesting dependencies between the genes and miRNAs. Additionally, the modules accurately differentiate between case and control samples while being highly enriched with disease-specific pathways and genes. iii To my parents, Karim, Omar, and Maleeka. iv Acknowledgments I would like to thank and express gratitude to my advisor Prof. Umit¨ V. C¸ataly¨urek, for his continuous and generous support and guidance throughout my study at OSU. Prof. C¸ataly¨urekshowed great faith in my abilities and allowed me to work quite independently, but at the same time provided invaluable guidance at the necessary times. I would also like to thank the dissertation examination committee members, in- cluding, Prof. F¨usun Ozg¨uner,Prof.¨ Kun Huang, Prof. Yuejie Chi, and Prof. Dawn Chandler. The discussion and comments I received during my defense were invaluable; opening my mind to new ideas and research directions. I would also like to thank Prof. Kamer Kaya for his support and the various discussions we had; some of which already generated ideas used in my work. I want to thank all of my colleagues and friends at the HPC lab including Erdem Sariyuce, Mehmet Deveci, Anas AbuDolah, ad Izzet Senturk. Also, I would like to thank the former members of the HPC lab, including, Erik Saule, Onur Kucuktunc, and Doruk Bozda˘g. It has been a privilege to know such a great group of people. Particularly I would like to mention Doruk Bozda˘gand Erik Saule for the numerous fruitful and interesting discussions. I would like to extend my deepest gratitude and love to my mother and my late father, who supported me during my research career and always encouraged me to v follow my dreams. I also would like to thank my children, Omar and Maleeka, for their sense of humor and their wonderful characters, they totally changed my life. I can't describe how grateful I am towards my husband, Karim, whose sweet presence has brought happiness into my life. He was always there for me in my tough times and always encouraging me to go forward with my PhD and never to give up. Finally, I acknowledge the support of the Graduate School of The Ohio State University, for the University Fellowship Award and the support from the National Science Foundation. vi Vita September 15th, 1985 . .Born - Giza, Egypt July 2007 . .B.S., Computer Engineering, Cairo University, Cairo, Egypt August 2009 . .M.S., Software Engineering, Nile University, Cairo, Egypt September 2009{August 2010 . University Fellow, The Ohio State University, Columbus, OH, USA September 2010{Spring 2013 . .Grad. Research Assoc., The Ohio State University, Columbus, OH, USA Spring 2013{Present . Grad. Teaching Assoc., The Ohio State University, Columbus, OH, USA Publications Research Publications A. Hatem, K. Kaya, J. Parvin, K. Huang, U.¨ V. C¸ataly¨urek, "MICA: MicroRNA Integration for Active Module Discovery," In the 13th European Conference on Computational Biology (ECCB), Submitted K. Kaya, A. Hatem, H. G. Ozer,¨ K. Huang, U.¨ V. C¸ataly¨urek, "High-Performance Computing in High-Throughput Sequencing," In Biological Knowledge Discovery Handbook, John Wiley & Sons, Editors M. Elloumi, A. Y. Zomaya, 2014 vii L. Wang, A. Hatem, U.¨ V. C¸ataly¨urek,M. Morrison, Z. Yu, "Metagenomic Insights into the Carbohydrate-Active Enzymes Carried by the Microorganisms Adhering to Solid Digesta in the Rumen of Cows," In PLoS One, vol. 8, no. 11, pg. e78507, Nov 2013 A. Hatem, D. Bozda˘g, A. E. Toland, U.¨ V. C¸ataly¨urek, "Benchmarking Short Se- quence Mapping Tools," In BMC Bioinformatics, vol. 14, no. 1, pg. 184, 2013 A. Hatem, K. Kaya, U.¨ V. C¸ataly¨urek,"PRASE: PageRank-based Active Subnetwork Extraction," In Proc. of ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB), Sep 2013 A. Hatem, K. Kaya, U,¨ V.C¸ataly¨urek, "Microarray vs. RNA-Seq: A comparison for active subnetwork discovery," In Proc. of ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB), Oct 2012 A. Hatem,D. Bozda˘g, U.¨ V. C¸ataly¨urek, "Benchmarking Short Sequence Mapping Tools," In Proc. of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2011 D. Bozda˘g,A. Hatem, U.¨ V. C¸ataly¨urek,"Exploring Parallelism in Short Sequence Mapping Using Burrows-Wheeler Transform," In Proc. of 9th IEEE International Workshop on High Performance Computational Biology (in conjunction with IPDPS), 2010 A. Hatem, D. Bozda˘g, U.¨ V. C¸ataly¨urek,"Benchmarking Short Sequence Alignment Tools," In Abstract, Bioinformatics, 2010 Ohio Collaborative Conference, 2010 Fields of Study Major Field: Electrical & Computer Engineering viii Table of Contents Page Abstract . ii Dedication . iv Acknowledgments . .v Vita......................................... vii List of Tables . xii List of Figures . xiv 1. Introduction . .1 1.1 Dissertation Outline and Summary of Contributions . .4 2. Background and Related Work . .8 2.1 DNA and the central dogma . .8 2.1.1 Measuring gene expression levels . 10 2.1.2 Other elements in the central dogma . 11 2.2 Active module discovery problem . 12 2.3 microRNA and mRNA integration . 18 3. An Evaluation of RNA-Seq Mapping Tools . 20 3.1 Background . 26 3.1.1 Features . 26 3.1.2 Tools' description . 28 3.1.3 Default options of the tested tools . 31 3.1.4 Evaluation criteria . 34 ix 3.2 Methods . 38 3.2.1 Benchmark design . 38 3.2.2 Usecase: SNP Calling . 41 3.3 Results and discussion . 42 3.3.1 Mapping options . 48 3.3.2 Input properties . 54 3.3.3 Algorithmic features . 63 3.3.4 Scalability . 65 3.3.5 Accuracy evaluation . 67 3.3.6 Rabema evaluation . 70 3.3.7 Use case: SNP calling . 71 3.4 Conclusion . 72 4. Efficiency of RNA-Seq Data for Active Module Discovery in Comparison to MicroArrays . 77 4.1 Background . 79 4.1.1 Tools for Active Module Discovery . 79 4.1.2 Microarray vs. RNA-Seq: History . 81 4.2 Experimental Evaluation . 82 4.2.1 Colorectal cancer cell lines . 84 4.2.2 Oligodendroglioma tumors . 92 4.3 Conclusion and Future Work . 94 5. PRASE: PageRank-based Active Module Extraction . 97 5.1 Background . 102 5.1.1 Active module extraction tools . 102 5.1.2 PageRank for gene ranking . 103 5.2 PRASE . 104 5.2.1 Input network and matrix construction . 105 5.2.2 Re-ranking . 107 5.2.3 Scaling and combining . 107 5.3 Experimental Results . 109 5.3.1 Breast invasive carcinoma . 111 5.3.2 Colorectal cancer cell line (CRC) .