A Probabilistic Approach for Automated Discovery of Biomarkers

A Probabilistic Approach for Automated Discovery of Biomarkers using Expression Data from microarray or RNA- Seq datasets A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY In the Department of Molecular & Cellular Physiology of the College of Medicine By Gopinath Sundaramurthy MS, University of Cincinnati 2015 Thesis Advisor and Committee Chair: Dr. Hamid Eghbalnia Abstract The response to perturbations in cellular systems is governed by a large number of molecular circuits that coalesce into a complex network. In complex diseases, the breakdown of cellular components is brought about by multiple molecular and environmental perturbations. While individual signatures of cellular components might vary significantly among clinical patients, commonality in signs and symptoms of disease progression is a compelling indicator that key cellular sub-processes follow similar trajectories? -. Our approach aims for an enhanced understanding of the effect of disease perturbations on the cell by developing an automated platform that assigns more significance to changes that occur at the sub-network level – focusing on genes that are “wired” together and change together. The platform that we have developed is motivated by the study of concomitant expression changes in sub-networks. The analysis by our platform produces a small subset of signaling and regulatory genes that are wired together and change together beyond random chance. In order to evaluate the effectiveness of our platform in producing subsets that can distinguish diseases and disease-subtypes, we used publicly available RNA-Seq and microarray breast cancer expression datasets. Each dataset was analyzed independently using our platform and the disease related sub-network perturbations among breast cancer subtypes were identified. The resulting subset was subjected to standard multi-way classification and predictions based on our approach were compared with PAM50 predictions. Biomarkers identified from the microarray and RNA-Seq dataset reproduced the PAM50 classification with 100% and 80% agreement respectively despite having only 10% of genes common with the PAM50. This proof-of-concept analysis using breast cancer datasets is indicative of the platform’s stable cross-validation results. This platform can potentially be used for automated and unbiased computational discovery of disease related genes. Our results suggest that probabilistic and automated approaches may offer a powerful complement to existing approaches by providing an unbiased initial screen. 2 3 This research is dedicated to my father, Mr. Sundaramurthy Arunachalam. 4 Acknowledgement This dissertation would not have been possible without the help of my family, friends and mentors. I would first like to express my sincere appreciation to my committee chair, professor and mentor, Dr. Hamid Eghbalnia for his support, guidance, and encouragement throughout my graduate study and research. Without his persistence, advice and support, this dissertation would not have been possible. I would also like to express my heartfelt appreciation to Dr. Steven Kleene for constant support and encouragement. I would also like to thank the other members of my dissertation committee, Dr. Yana Zavros, Dr. Jarek Meller, Dr. Judith Heiny and Dr. Anil Jegga for their valuable feedback, insights and advice throughout my research. I would also like to thank Dr. Nelson Horseman and Dr. Jay Hove, who had previously served on my committee, for their support and guidance. I would also like to thank Jeannie Cummins and Betty Young for making Cincinnati my second home. I would like to thank my mom, Ms. Shanthi Sundaramurthy, and my dad, Mr. Sundaramurthy Arunachalam, for providing me an excellent home where I could learn, grow and develop. My research would not have been possible without their support, determination, inspiration and encouragement. I would like to thank my wife Dhivya Jeganathan without whose unwavering support, I would not have been able to complete my dissertation. I would also like to take this opportunity to thank my brothers and sisters-in- law, Swaminathan Sundaramurthy and Nirupama Srinivasan, and, Palani Sundaramurthy and Sampoorni Deivasigamani, for always being there and supporting me through my challenging times. Finally, I would also like to thank my friends Shruti, Shatrunjai and Preeti for making my PhD. experience a memorable one. I would especially like to thank Dr. Sun Wook Kim, Dr. Shreya Ghosh and Dr. Kirthi Radhakrishnan for all the support through my research and dissertation. 5 TABLE OF CONTENTS Abstract ................................................................................................................................ 2 Acknowledgement ................................................................................................................ 5 List of Figures ...................................................................................................................... 10 List of Tables ....................................................................................................................... 13 Chapter I: Introduction and Historical Perspective ............................................................... 15 I - 1 Biological Complexity and Emergence .................................................................................... 18 I - 2 Complex Diseases .................................................................................................................. 19 I - 3 Organizing Principles of Biological Systems ............................................................................. 21 I - 4 Network Biology .................................................................................................................... 23 I - 5 Network Theory and Properties of Biological Networks .......................................................... 24 I - 6 Dynamics of Biological Networks ............................................................................................ 32 I - 6.1 Differential Expression Analysis ................................................................................................ 33 I - 7 Pathway Analysis ................................................................................................................... 35 I - 7.1 Classification of Pathway Analysis ............................................................................................ 35 I - 7.2 Limitations of Pathway Analysis................................................................................................ 42 Chapter II: Aim of the Thesis ................................................................................................ 45 Chapter III: Methods ............................................................................................................ 48 III – 1 Genomic and Network Data ................................................................................................ 50 III – 1.1 Genomic Data ......................................................................................................................... 50 III – 1.2 Network Data ......................................................................................................................... 50 III - 2 Probability of Change (Nodes and Edges) .............................................................................. 51 6 III - 3 Hub Interaction Score .......................................................................................................... 54 III - 4 Path Analysis ....................................................................................................................... 56 III - 5 Path and Feature Genes Selection ........................................................................................ 57 III - 6 Biomarker Selection, Validations, and Functional Analysis .................................................... 58 III - 7 Pathway Analysis Software Architecture .............................................................................. 60 III - 7.1 GSE Analyzer ........................................................................................................................... 61 III - 7.2 GSE Project Compiler .............................................................................................................. 65 III - 7.3 GSE Run Scheduler .................................................................................................................. 67 III - 7.4 GSE EMD Calculation ............................................................................................................... 70 III - 7.5 GSE Path Analysis Calculations ............................................................................................... 72 III - 7.6 GSE Feature Selection ............................................................................................................. 74 III - 7.7 GSE Validation, Results and Reports ....................................................................................... 76 Chapter IV: Expression Analysis of Breast Cancer Subtypes .................................................. 79 IV - 1 Molecular Subtypes of Breast Cancer ................................................................................... 82 IV - 2 Dataset Information ...........................................................................................................

A Probabilistic Approach for Automated Discovery of Biomarkers

University of California, San Diego

POLR2L Antibody Cat

DNA Methylation Changes in Down Syndrome Derived Neural Ipscs Uncover Co-Dysregulation of ZNF and HOX3 Families of Transcription

Molecular Pharmacology of Cancer Therapy in Human Colorectal Cancer by Gene Expression Profiling1,2

Supplementary Table 3. Genes Predicted to Be Regulated by Myc in KRAS Mutant NSCLC Cells

A Mutation in Histone H2B Represents a New Class of Oncogenic Driver

Human Recombinant Protein – TP760406

University of California, San Diego

A Network Based Approach to Identify the Genetic Influence Caused By

Acetyl-Histone H4-K12 Pab Cat

Research Article Common Expression Quantitative Trait Loci Shared by Histone Genes

Reduction in Reproductive Lifespan of Tissue Inhibitor of Metalloproteinase