Computational Methods in Metabolomics Mai Hamdalla University of Connecticut - Storrs, [email protected]

University of Connecticut OpenCommons@UConn Doctoral Dissertations University of Connecticut Graduate School 5-9-2014 Computational Methods in Metabolomics Mai Hamdalla University of Connecticut - Storrs, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/dissertations Recommended Citation Hamdalla, Mai, "Computational Methods in Metabolomics" (2014). Doctoral Dissertations. 376. https://opencommons.uconn.edu/dissertations/376 Computational Methods in Metabolomics Mai A. Hamdalla, Ph.D. University of Connecticut, 2014 ABSTRACT Diverse health challenges such as rising incidence of metabolic disease, rapid ag- ing, and increasing antibiotic resistance are facing current humanity. Most diseases involve many genes in complex interactions, as well as environmental influences that are often not well understood. High-throughput advances in genome sequencing, tran- script measurement, and protein measurement have been developed to address these challenges. A number of disease biomarkers have been identified as a result of an increased understanding of cellular functions. The observation of such systems-level cellular behavior has naturally extended to the metabolite level, leading to the study of metabolomics. Measurement of the metabolites in a biological sample represents a snapshot of the physiology of the cell. The study of metabolites can help assign biochemical functions to so-called orphan genes (genes that cannot be ascribed a function by sequence analogy) and validate them as molecular targets for therapeutic intervention. Integration of metabolomics data with other omics data will provide a more complete picture of the functioning of organisms. Due to the chemical diversity of metabolites, the identification process in metabolomics is currently less advanced than that in proteomics and transcriptomics. Development ii of a computational workflow to improve and accelerate metabolite identification and biochemical pathway reconstruction is required for metabolomics to increase its im- pact in systems biology. The goal of this thesis is to design, develop, and validate methods for metabolite structure identification as well as defining their biochemical functions by predicting their metabolic pathway associations. First, I propose BioSM; a cheminformatics tool that uses known endogenous mammalian biochemical compounds and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endogenous mammalian biochemical structures with high accuracy (95%). In addition, results suggest that approximately 13% of PubChem compounds are mammalian biochemicals. Thus, BioSM may be useful for searching large chemical databases in metabolomics applications where the number of potential false positives is very large. BioSM is freely available at http://metabolomics.pharm.uconn.edu. A major downside of BioSM, granting its encouraging results, was its need to exhaustively search all known biochemical structures to be able to make a decision about the molecular structure under investigation, which resulted in an undesirably high run time. To tackle this concern, I introduce BioSMXpress, designed and developed as an enhancement to BioSM. BioSMXpress is, on average, 8 times faster than BioSM without compromising the quality of the predictions made. BioSMXpress will be an extremely useful tool in the timely identification of unknown biochemical structures in metabolomics. Finally, I present TrackSM; a bioinformatics tool designed to predict the metabolic pathway classes as well as the individual pathways to which small molecules might be associated with, based only on their molecular structures. Validation experiments iii show that TrackSM is capable of associating 93% of the structures to their correct pathway classes as defined by KEGG and 88% of them to the correct individual KEGG pathway. These impressive results suggest that TrackSM may be a valuable tool to aid in recognizing the biochemical functions of small molecules. Computational Methods in Metabolomics Mai A. Hamdalla M.S. University of Connecticut, USA, 2013 M.S. Helwan University, Egypt, 2005 B.S. Helwan University, Egypt, 2001 A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy at the University of Connecticut 2014 Copyright by Mai A. Hamdalla 2014 APPROVAL PAGE Doctor of Philosophy Dissertation Computational Methods in Metabolomics Presented by Mai A. Hamdalla, B.S., M.S., M.S. Major Advisor Sanguthevar Rajasekaran Major Advisor Reda A. Ammar Associate Advisor Ion I. Mandoiu Associate Advisor Jimbo Bi University of Connecticut 2014 ii ACKNOWLEDGMENTS I would never have been able to finish my dissertation without the guidance of my committee members, the help of my friends, the support and love of my family and specifically the patience of my daughter. First and foremost I offer my sincerest gratitude to my co-major advisors, Dr. Reda Ammar and Dr. Sanguthevar Rajasekaran, whose support and guidance have been instrumental in finishing my doctoral degree. Dr. Ammar was the one to welcome me on my first day to the lab and he kept his doors always open for discussion. No matter what the issue was, I knew that he had a solution for me. I have excelled on both the professional and personall levels as a result of the patience, kindness and support of Dr. Raj. I would like to express my deepest appreciation to Dr. Ion Mandoiu for teaching me resilience and to Dr. David Grant for introducing me to the beautiful field of Metabolomics. I am also very grateful to Dr. Dennis Hill for his scientific advice, knowledge and many insightful discussions and suggestions. The good advice, support and friendship of Dr. Sahar AlSisi have been invaluable on both academic and personal levels, for which I am extremely grateful. Special thanks to Dr. Samir ElSayed, Dr. Rania Kilany and Dr. Manal Albzor, who as good friends were always there to support me when I went through tough times, it would have been a lonely lab without them. Special thanks to Rebecca Rndazzo and Debra Mielczarek, the CSE Administrative Staff, for being so helpful when it came to paperwork and deadlines. I would like to acknowledge the support of the Egyptian Ministry of higher educa- iii iv tion and Helwan University (Cairo, Egypt), particularly in the award of a Doctorate Scholarship that provided the necessary financial support for this research. My friends in Egypt, the US and other parts of the World were sources of laughter, joy, and support. I would particularly like to thank my dear friend Dr. Elena Castel- lari for always reminding me that God is looking over us. In addition, I would like to thank all my friends in Storrs and Hartford who gave me the necessary distractions from my research and made my stay in Connecticut memorable. Finally, my deep and sincere gratitude goes to my family for their continuous and unparalleled love. I am grateful to my aunts, uncles and cousins in Egypt for believing in me. Their prayers for me are what sustained me thus far. I would like to thank my older brothers, Mohamed and Islam Hamdalla, for being my source of motivation and stimulation. Last but not least, I would like to thank my parents, Meeza Elbek and Dr. Ahmed Hamdalla, for their unconditional support, both financially and emotionally throughout my degree. Their love was my inspiration and driving force. I owe them everything and wish I could show them just how much I love and appreciate them. This journey would not have been possible if not for them. I dedicate this thesis to my daughter, Nadia, and my beloved late grandma, Ateyat. Thank you for believing in me way before I ever did. I love you both dearly. Contents List of Figures 1 List of Tables 4 Ch. 1. Introduction 5 1.1 Motivation . 5 1.2 Thesis Objective . 7 1.3 Thesis Structure . 8 Ch. 2. Background and Related Work 9 2.1 Introduction . 9 2.2 Applications of Metabolomics . 10 2.3 Metabolomics Approaches and Platforms . 12 2.3.1 Analytical Technologies . 12 2.3.2 Metabolomics Approaches . 14 2.4 Untargeted Metabolomics . 15 2.4.1 Metabolite identification. 17 2.4.2 Identifying Altered Metabolic Pathways . 19 Ch. 3. Basic Evaluation Techniques 21 3.1 Cross Validation Framework . 21 3.1.1 K-folds Cross Validation Framework . 22 3.1.2 Nested Cross Validation Framework . 22 3.1.3 Leave-one-out Cross Validation Framework . 23 3.2 Analysis of Variance . 24 3.3 Accuracy Measures . 24 vi vii Ch. 4. Identifying endogenous mammalian biochemical structures in chemical structure space 26 4.1 Introduction . 26 4.2 Computational Algorithm . 30 4.3 Datasets . 34 4.3.1 Chemical Space Definition . 34 4.3.2 Non-Biological Subsections (NBS) . 38 4.3.3 Training Data . 40 4.3.4 Prospective Validation Sets . 40 4.3.5 Extended Scaffolds List . 43 4.4 Results and Discussion . 45 4.4.1 Selection of Candidate Scoring Methods by CV . 45 4.4.2 Leave-One-Out Cross Validation Experiments . 47 4.4.3 Prospective Validation . 48 4.4.4 Extended Scaffolds List . 52 4.5 Conclusions . 58 Ch. 5. Efficient identification of endogenous mammalian biochemical structures 60 5.1 Introduction . 60 5.2 Computational Algorithm . 61 5.3 Datasets . 65 5.3.1 Biological Dataset (Scaffolds list): . 65 5.3.2 Non-Biological Dataset (Synthetic compounds list): . 65 5.3.3 Training Dataset . 66 5.3.4 Independent Datasets . 66 5.4 Results and Discussion . 67 5.4.1 Classification Methods Selection . 67 5.4.2 Leave-One-Out Cross Validation Analysis . 69 5.4.3 Prospective Validation . 70 5.4.4 Execution and CPU Time Comparison . 73 5.5 Conclusions . 76 Ch. 6. Classifying Small Molecules into Metabolic Pathways 77 6.1 Introduction . 77 6.2 Computational Algorithm . 81 6.2.1 Pathway Classes Prediction Method . 83 6.2.2 Individual Pathways Prediction Method . 84 6.3 Dataset . 85 6.4 Results and Discussion .

Computational Methods in Metabolomics Mai Hamdalla University of Connecticut - Storrs, [email protected]

Chemical Genomics 33

Link Mining for Kernel-Based Compound-Protein Interaction Predictions Using a Chemogenomics Approach

Endogenous Metabolites in Drug Discovery: from Plants to Humans

Chemogenomics: an Emerging Strategy for Rapid Target and Drug Discovery

Chemogenomics:Chemogenomics 19/4/07 16:30 Page 57

An Emerging Strategy for Rapid Target and Drug Discovery

From Phenotypic Hit to Chemical Probe: Chemical Biology Approaches to Elucidate Small Molecule Action in Complex Biological Systems

Network-Based Characterization of Drug-Protein Interaction Signatures

From Chemical to Systems Biology: How Chemoinformatics Can Contribute?

Bioinformatics Mining of the Dark Matter Proteome For

Substrate-Driven Mapping of the Degradome by Comparison of Sequence Logos

Annual Report 2020