University of Connecticut OpenCommons@UConn

Doctoral Dissertations University of Connecticut Graduate School

5-9-2014 Computational Methods in Mai Hamdalla University of Connecticut - Storrs, [email protected]

Follow this and additional works at: https://opencommons.uconn.edu/dissertations

Recommended Citation Hamdalla, Mai, "Computational Methods in Metabolomics" (2014). Doctoral Dissertations. 376. https://opencommons.uconn.edu/dissertations/376 Computational Methods in Metabolomics

Mai A. Hamdalla, Ph.D. University of Connecticut, 2014

ABSTRACT

Diverse health challenges such as rising incidence of metabolic disease, rapid ag- ing, and increasing antibiotic resistance are facing current humanity. Most diseases involve many genes in complex interactions, as well as environmental influences that are often not well understood. High-throughput advances in sequencing, tran- script measurement, and protein measurement have been developed to address these challenges. A number of disease biomarkers have been identified as a result of an increased understanding of cellular functions. The observation of such systems-level cellular behavior has naturally extended to the metabolite level, leading to the study of metabolomics. Measurement of the metabolites in a biological sample represents a snapshot of the physiology of the cell. The study of metabolites can help assign biochemical functions to so-called orphan genes (genes that cannot be ascribed a function by sequence analogy) and validate them as molecular targets for therapeutic intervention. Integration of metabolomics data with other data will provide a more complete picture of the functioning of organisms. Due to the chemical diversity of metabolites, the identification process in metabolomics is currently less advanced than that in and transcriptomics. Development ii of a computational workflow to improve and accelerate metabolite identification and biochemical pathway reconstruction is required for metabolomics to increase its im- pact in . The goal of this thesis is to design, develop, and validate methods for metabolite structure identification as well as defining their biochemical functions by predicting their metabolic pathway associations. First, I propose BioSM; a tool that uses known endogenous mam- malian biochemical compounds and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endoge- nous mammalian biochemical structures with high accuracy (95%). In addition, results suggest that approximately 13% of PubChem compounds are mammalian biochemicals. Thus, BioSM may be useful for searching large chemical databases in metabolomics applications where the number of potential false positives is very large. BioSM is freely available at http://metabolomics.pharm.uconn.edu. A major downside of BioSM, granting its encouraging results, was its need to exhaustively search all known biochemical structures to be able to make a decision about the molecular structure under investigation, which resulted in an undesirably high run time. To tackle this concern, I introduce BioSMXpress, designed and devel- oped as an enhancement to BioSM. BioSMXpress is, on average, 8 times faster than BioSM without compromising the quality of the predictions made. BioSMXpress will be an extremely useful tool in the timely identification of unknown biochemical structures in metabolomics. Finally, I present TrackSM; a tool designed to predict the metabolic pathway classes as well as the individual pathways to which small molecules might be associated with, based only on their molecular structures. Validation experiments iii show that TrackSM is capable of associating 93% of the structures to their correct pathway classes as defined by KEGG and 88% of them to the correct individual KEGG pathway. These impressive results suggest that TrackSM may be a valuable tool to aid in recognizing the biochemical functions of small molecules. Computational Methods in Metabolomics

Mai A. Hamdalla

M.S. University of Connecticut, USA, 2013 M.S. Helwan University, Egypt, 2005 B.S. Helwan University, Egypt, 2001

A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy at the University of Connecticut

2014 Copyright by

Mai A. Hamdalla

2014 APPROVAL PAGE

Doctor of Philosophy Dissertation

Computational Methods in Metabolomics

Presented by Mai A. Hamdalla, B.S., M.S., M.S.

Major Advisor Sanguthevar Rajasekaran

Major Advisor Reda A. Ammar

Associate Advisor Ion I. Mandoiu

Associate Advisor Jimbo Bi

University of Connecticut 2014

ii ACKNOWLEDGMENTS

I would never have been able to finish my dissertation without the guidance of my committee members, the help of my friends, the support and love of my family and specifically the patience of my daughter. First and foremost I offer my sincerest gratitude to my co-major advisors, Dr. Reda Ammar and Dr. Sanguthevar Rajasekaran, whose support and guidance have been instrumental in finishing my doctoral degree. Dr. Ammar was the one to welcome me on my first day to the lab and he kept his doors always open for discussion. No matter what the issue was, I knew that he had a solution for me. I have excelled on both the professional and personall levels as a result of the patience, kindness and support of Dr. Raj. I would like to express my deepest appreciation to Dr. Ion Mandoiu for teaching me resilience and to Dr. David Grant for introducing me to the beautiful field of Metabolomics. I am also very grateful to Dr. Dennis Hill for his scientific advice, knowledge and many insightful discussions and suggestions. The good advice, support and friendship of Dr. Sahar AlSisi have been invaluable on both academic and personal levels, for which I am extremely grateful. Special thanks to Dr. Samir ElSayed, Dr. Rania Kilany and Dr. Manal Albzor, who as good friends were always there to support me when I went through tough times, it would have been a lonely lab without them. Special thanks to Rebecca Rndazzo and Debra Mielczarek, the CSE Administrative Staff, for being so helpful when it came to paperwork and deadlines. I would like to acknowledge the support of the Egyptian Ministry of higher educa-

iii iv tion and Helwan University (Cairo, Egypt), particularly in the award of a Doctorate Scholarship that provided the necessary financial support for this research. My friends in Egypt, the US and other parts of the World were sources of laughter, joy, and support. I would particularly like to thank my dear friend Dr. Elena Castel- lari for always reminding me that God is looking over us. In addition, I would like to thank all my friends in Storrs and Hartford who gave me the necessary distractions from my research and made my stay in Connecticut memorable. Finally, my deep and sincere gratitude goes to my family for their continuous and unparalleled love. I am grateful to my aunts, uncles and cousins in Egypt for believing in me. Their prayers for me are what sustained me thus far. I would like to thank my older brothers, Mohamed and Islam Hamdalla, for being my source of motivation and stimulation. Last but not least, I would like to thank my parents, Meeza Elbek and Dr. Ahmed Hamdalla, for their unconditional support, both financially and emotionally throughout my degree. Their love was my inspiration and driving force. I owe them everything and wish I could show them just how much I love and appreciate them. This journey would not have been possible if not for them. I dedicate this thesis to my daughter, Nadia, and my beloved late grandma, Ateyat. Thank you for believing in me way before I ever did. I love you both dearly. Contents

List of Figures 1

List of Tables 4

Ch. 1. Introduction 5 1.1 Motivation ...... 5 1.2 Thesis Objective ...... 7 1.3 Thesis Structure ...... 8

Ch. 2. Background and Related Work 9 2.1 Introduction ...... 9 2.2 Applications of Metabolomics ...... 10 2.3 Metabolomics Approaches and Platforms ...... 12 2.3.1 Analytical Technologies ...... 12 2.3.2 Metabolomics Approaches ...... 14 2.4 Untargeted Metabolomics ...... 15 2.4.1 Metabolite identification...... 17 2.4.2 Identifying Altered Metabolic Pathways ...... 19

Ch. 3. Basic Evaluation Techniques 21 3.1 Cross Validation Framework ...... 21 3.1.1 K-folds Cross Validation Framework ...... 22 3.1.2 Nested Cross Validation Framework ...... 22 3.1.3 Leave-one-out Cross Validation Framework ...... 23 3.2 Analysis of Variance ...... 24 3.3 Accuracy Measures ...... 24

vi vii

Ch. 4. Identifying endogenous mammalian biochemical structures in chemical structure space 26 4.1 Introduction ...... 26 4.2 Computational Algorithm ...... 30 4.3 Datasets ...... 34 4.3.1 Chemical Space Definition ...... 34 4.3.2 Non-Biological Subsections (NBS) ...... 38 4.3.3 Training Data ...... 40 4.3.4 Prospective Validation Sets ...... 40 4.3.5 Extended Scaffolds List ...... 43 4.4 Results and Discussion ...... 45 4.4.1 Selection of Candidate Scoring Methods by CV ...... 45 4.4.2 Leave-One-Out Cross Validation Experiments ...... 47 4.4.3 Prospective Validation ...... 48 4.4.4 Extended Scaffolds List ...... 52 4.5 Conclusions ...... 58

Ch. 5. Efficient identification of endogenous mammalian biochemical structures 60 5.1 Introduction ...... 60 5.2 Computational Algorithm ...... 61 5.3 Datasets ...... 65 5.3.1 Biological Dataset (Scaffolds list): ...... 65 5.3.2 Non-Biological Dataset (Synthetic compounds list): ...... 65 5.3.3 Training Dataset ...... 66 5.3.4 Independent Datasets ...... 66 5.4 Results and Discussion ...... 67 5.4.1 Classification Methods Selection ...... 67 5.4.2 Leave-One-Out Cross Validation Analysis ...... 69 5.4.3 Prospective Validation ...... 70 5.4.4 Execution and CPU Time Comparison ...... 73 5.5 Conclusions ...... 76

Ch. 6. Classifying Small Molecules into Metabolic Pathways 77 6.1 Introduction ...... 77 6.2 Computational Algorithm ...... 81 6.2.1 Pathway Classes Prediction Method ...... 83 6.2.2 Individual Pathways Prediction Method ...... 84 6.3 Dataset ...... 85 6.4 Results and Discussion ...... 86 6.4.1 Ranking Method Formulation and Selection ...... 86 6.4.2 Performance of the Predictive Method for Metabolic Pathway Classes ...... 89 6.4.3 Performance of the predictive method for individual Metabolic Pathways ...... 94 6.5 Conclusions ...... 95

Ch. 7. Conclusions and Recommendations 97 7.1 Conclusions ...... 97 7.2 Recommendations for future work ...... 99

Bibliography 100

viii List of Figures

1.1.1 Schema of omics technologies, their corresponding analysis targets, and assessment methods. Adapted from [1] ...... 6

2.1.1 Comparison of the number publications/year for the keywords ”metabolomics OR metabonomics” in Scopus form 1999 to 2013...... 10 2.2.1 Applications of metabolomics. Adapted from [2] ...... 11 2.3.1 Schematic representation of a metabolomics experimental workflow. Based on the study design, biological samples are collected, processed and subsequently analyzed using various analytical platforms. Adapted from http://www.cial.uam-csic.es/metabolomics/workflow.html . . . 13 2.3.2 Workflow illustrating both untargeted and targeted metabolomics ap- proaches. Adapted from [3]...... 15 2.4.1 Untargeted metabolomics workflow. Adapted from [4] ...... 17 2.4.2 The citric acid cycle is central to the chemical processing of carbohy- drates, fats, and proteins to produce energy. Adapted from http://math.uwaterloo.ca/ 20

3.1.1 K-folds cross validation. In this illustration k = 4 ...... 22 3.1.2 Nested cross validation framework ...... 23 3.1.3 Leave-one-out cross validation ...... 23

4.2.1 Matching a candidate structure (panel A) with 4 different scaffolds (panels B1- B5; note that scaffold B2 = scaffold B3) as substructures and the similarity score of each match. The union scaffold structure incorporating all scaffold matches is shown in panel C...... 31 4.2.2 Matching a candidate structure (panel A) with 2 scaffolds (panels B1 and B2) as superstructures and the similarity score of each match. The scaffold structure with the highest similarity score (scaffold B2) is selected...... 33

1 2

4.2.3 (A) General flow of BioSM and (B) an example showing how the union scaffold structure and superstructure scaffold are used in the prediction process based on 5BSS...... 35 4.3.1 Mass Distribution of compounds in the validation datasets in 50 Da bins...... 42 4.4.1 Biological predictions within each mass bin for each dataset using KEGGscafs. 5BSS bin threshold values (thr) are also displayed.*LOOCV results...... 50 4.4.2 Frequency distribution of candidate scores for each dataset. 5BSS threshold values for each of the 5 bin masses are given in Figure 4. *LOOCV results...... 54 4.4.3 Biological predictions within each mass bin for each dataset using KHHscafs. 5BSS bin threshold values (thr) are also displayed. *LOOCV results...... 55 4.4.4 Distribution of compounds based on their candidate scores using KHH- scafs. *LOOCV Results...... 57 4.4.5 Percentage of biological predictions in each data set using KEGGscafs versus using KHHscafs. *Refer to LOOCV results when using the KEGGscafs dataset (turquoise bar) and the KHHscafs (purple bar) as defined in the methods section above...... 58

5.2.1 Scaffold selection and sorting Process. In this example, it is assumed that the candidate compound (cq) consists of 9 atoms and that sub- Thr = 0.5 and superThr = 0.51. Therefore, minAC = b9∗0.5c = 4  9  and maxAC= 0.51 = 18. (a) The hashed scaffolds list with minAC and maxAC identified. (b) The sorted scaffolds list consists of all the scaffolds with 9 atoms followed by those with 10 atoms followed by those with 8 atoms and so on...... 63 5.2.2 (A) General flow of BioSMXpress and (B) an example showing how the appropriate scaffolds list is populated...... 64 5.4.1 Biological predictions resulting from a set of LOOCV experiments by BioSMXpress and BioSM with 1,387 KEGG compounds. Compounds were binned by atom count...... 71 5.4.2 Biological predictions within each atom count bin for each dataset using BioSMXpress. SSB bin threshold values (subThr and superThr) are also displayed.*Representing LOOCV results...... 72 5.4.3 (a) Average runtime (in hh:mm:ss) needed to make predictions using BioSM versus BioSMXpress. (b) Average CPU time (in seconds) for BioSM and BioSMXpress when annotating sets of compounds of dif- ferent sizes...... 75 3

6.2.1 Schematic of TrackSM’s predictive process...... 82 6.4.1 Distribution of 3,190 scaffolds based on (a) the number of classes they belong to and (b) the number of individual pathways they belong to. Panel (c) shows the mass distribution of 3,190 scaffolds into 8 mass bins ranging from 0 922 Daltons...... 88 6.4.2 LOOCV prediction accuracy of the 1st and 2nd orders of predictions made by Gao et al’s method, TrackSM with Match100, and TrackSM with Match90 when predicting metabolic pathway classes...... 90 6.4.3 Breakdown of the 1st order of Class predictions made by Match100 versus those made by Match90 for 3,190 compounds based on molecular mass from a set of LOOCV experiments...... 91 6.4.4 Accuracy of mass bins per number of class associations for TrackSM when using Match90 to predict metabolic pathway classes...... 92 6.4.5 Distribution of class predictions when using Match100 versus Match90 based on the query compound’s class association...... 93 6.4.6 Prediction accuracy of the 1st, 2nd and 3rd orders of predictions made by TrackSM with Match100, Match90 and Match90ClassBased when predicting individual metabolic pathways...... 94 6.4.7 Percentage of compounds with at least one correct individual pathway prediction when compounds are distributed by mass amongst 8 mass bins...... 96 List of Tables

2.3.1 Definitions and terms used in metabolomics. Adapted from [5, 6, 7]. Original references [8, 9] ...... 16 2.4.1 Freely accessible databases. Adapted from [10]. . . . . 18

3.3.1 Definitions of abbreviations used to explain accuracy measures . . . . 24

4.3.1 Pathway classes and individual pathways included in the study. . . . 37 4.3.2 List of Non-biological substructures (NBS) ...... 38 4.3.3 List of Non-biological substructures (NBS) ...... 39 4.4.1 Mean and standard deviation of accuracy measures obtained for 15 cross validation experiments using 6 different scoring methods and KEGGscafs (N = 1,565 compounds) ...... 47 4.4.2 Prediction results for 3 random PubChem datasets using the 5BSS classifier and KEGGscafs...... 51 4.4.3 Predictive results using the 5BSS classifier for 6 different datasets using KEGGscafs...... 52 4.4.4 Average and standard deviation of accuracy measures obtained for 15 cross validation experiments using 6 different scoring methods and the KHHscafs (N = 3,927 compounds)...... 53

5.4.1 Mean and standard deviation of accuracy measures obtained for 15 cross validation experiments using 4 different scoring...... 69 5.4.2 Predictive results using the SSB classifier for 6 different datasets. . . 73

6.3.1 Distribution of 3,190 KEGG compounds among the 11 KEGG metabolic pathway classes...... 87 6.4.1 SENS of each ranking method when TrackSM predicts 1, 2 or 3 classes per candidate compound...... 89

4 Chapter 1

Introduction

1.1 Motivation

Detailed knowledge of the molecular nature of biological systems has become readily available thanks to the current developments in genome sequencing as well as related high-throughput technologies [11, 12]. Systems biology is a discipline that strives to explain biologic phenomena through the net interactions of all cellular and biochem- ical components within a cell or organism [13]. This has led to the establishment of the so-called omics technologies, referring to a group of high-throughput research tools, including , transcriptomics, proteomics, metabolomics, peptidomics, , , and interatomic. Figure 1.1.1 displays a schematic of omics technologies. These technologies are based on comprehensive analyses of genetic in- formation, including information from DNA, RNA, proteins and small compounds (metabolites) from tissue samples, cell lines, and body fluids [14]. Investigators are responsible for obtaining, integrating, and analyzing complex data from multiple ex-

5 6 perimental sources using interdisciplinary tools [15].

Figure 1.1.1: Schema of omics technologies, their corresponding analysis targets, and assessment methods. Adapted from [1]

In the post-genomic era, researchers became interested in studying the metabolome to describe the relationship between the genome and the in cells and or- ganisms [16]. This was influenced by the fact that an organisms phenotype is not revealed by the complete understanding of the state of its genes, messages, and pro- teins [17]. The metabolome is the complete set of metabolites in a cell or organism [6]. Metabolites are small molecules, usually <1000 Daltons (Da), which are chemi- cally transformed during metabolism. Metabolites are required for the maintenance, growth and normal function of a cell. It is estimated that a metabolome may com- prise anywhere from 1000 to 200,000 distinct metabolites depending on the organism [10]. 7

The rapid, high throughput analysis and characterization of metabolites within a cell, tissue or biofluid of an organism in response to some external stress is referred to as metabolomics [18]. The study of the metabolome is a reflection of enzymatic pathways and networks encoded within the genome [19]. The interactions of the de- velopmental processes and the changing environment over the lifetime of an organism can be conveyed by the entire composition of metabolites. Metabolomics promises to provide a more precise snap shot of the actual physiological state of an organism by monitoring the overall effect of various factors acting on a cell [20]. The Future of Metabolomics promises a comprehensive insight into the effective use of metabolomics in discovery, drug repurposing, pre-clinical development and clinical trials [4].

1.2 Thesis Objective

The objective of this thesis is to develop computational methods and software tools to aid in the diagnosis and treatment of human diseases. Particularly, tools to improve and accelerate the identification of endogenous mammalian biochemical structures and the prediction of their biological function by identifying the metabolic pathways to which they might interfere with. These tools are designed to reach such decisions based on chemical structural similarity methods. Work in this thesis was designed with the following objectives:

1. Design and develop methods for metabolite structure identification. These methods may be useful for searching large chemical databases in metabolomics applications where the number of potential false positives is very large.

2. Design and develop bioinformatics tools for identifying the biochemical function 8

of small molecules. This can be achieved by predicting the metabolic pathway class or the individual pathways to which the compound might be associated with.

3. Apply the above methods to unseen data for evaluation.

Successful implementation of the above objectives will result in a practical toolbox that can be integrated with existing work flows for efficient metabolomics data anal- ysis.

1.3 Thesis Structure

Chapter 2 gives background, related work, and approaches for different metabolomics analyses approaches. This will be followed by an overview of some applications in metabolomics and common metabolic platforms and strategies. Finally a review of a general metabolic workflow. Chapter 3 describes different validation frameworks and accuracy measures applied in this thesis. Chapter 4 describes the design, development and validation of a cheminformatics tool, called BioSM that identifies endogenous mammalian biochemical structures in chemical structure space. Chapter 5 describes the design, development and validation of BioSMXpress, an enhancement to BioSM. The speed up algorithm is described along with validation that the quality of the predictions has not been compromised. Chapter 6 describes a bioinformatics tool, referred to as TrackSM, designed to predict the metabolic pathway classes as well as the individual pathways to which small molecules might be associated with, based only on their molecular structures. Method validation is also discussed. Finally, Chapter 7 offers conclusions and recommendations for future work. Chapter 2

Background and Related Work

2.1 Introduction

Metabolomics is the critical level of post-genomic analysis. It can reveal changes in metabolite fluxes that are controlled by minor changes within gene expression measured using transcriptomics and/or by analysing the that exposes post- translational control over activity [21]. The term metabolomics was initially defined, by the Nicholson Group at The Imperial College of London in 1999, as the global analysis of all metabolites present in a sample [22]. While the term metabo- nomics was defined, by the Fiehn group at The Max-Planck Institute of Molecular Plant Physiology in Germany in 2002, as the analysis of metabolic responses to and diseases [7]. Nowadays, both terms are generally considered to be synonymous [2]. However, the monitoring of metabolite components (colors, smells and tastes) of a patient’s body fluids, such as urine or saliva, to various medical conditions can be traced back to ancient cultures [2, 20, 23, 24]. The field of metabolomics has seen

9 10 explosive growth in the post-genomic era. The number of publications indexed by it has grown exponentially to ˜2,123 PubMed indexed citations in 2013. Figure 2.1.1 displays a comparison of the number publications per year since the term was coined in 1999 till 2013.

Figure 2.1.1: Comparison of the number publications/year for the keywords ”metabolomics OR metabonomics” in Scopus form 1999 to 2013.

2.2 Applications of Metabolomics

Metabolomics research has been applied to a variety of applications some of which are microbiology [20, 25, 26, 27, 28, 29], plants [21, 30, 31, 32, 33] and medical science. In regards to medical science, there are three broad areas that might benefit from metabolomics (Figure 2.2.1). Metabolic profiling of individuals could be used in personalized health care to work out patients susceptibilities to disease or their responses to medicines, and to tailor their lifestyles and drug therapies accordingly 11

[34, 35, 36, 37]. Metabolic profiling of populations could allow the identifica- tion of many metabolites as possible biomarkers for diseases or new insights can be gained in the development or progression of disease [2, 4, 12, 24, 38]. Examples in- clude non-invasive diagnosis of coronary heart disease [6, 7, 39], lung disease [15], mental disorders [40], and cancer [1, 14, 41, 42, 43]. This is accomplished by com- paring the metabolomes of healthy and diseased subjects. Finally, by identifying biochemical pathways for disease, metabolomics could uncover new targets for [44], prioritize lead compounds, and assess toxicity non-invasive manner, enabling the development of novel, smarter and safer drugs [2, 13, 43, 45, 46, 47] as well as drug repositioning [48, 49, 50, 51, 52, 53]. It’s anticipated that within the next 10 years metabolomics will become a standard tool in the pharma industry as validated metabolomic-based biomarkers begin to emerge and cost-effective screens are established [47].

Figure 2.2.1: Applications of metabolomics. Adapted from [2] 12

2.3 Metabolomics Approaches and Platforms

The metabolomics experiment provides unique challenges to fulfill the goal of im- proving the current status of biological information related to the metabolome and more generally [5, 8, 9]. A typical metabolomics research flow starts with the design of an experiment plan then proceeds to sampling and sample preparation followed by data acquisition using analytical instrumentation, and fi- nally data processing and data interpretation using a variety of statistical techniques [32, 54]. Figure 2.3.1 shows the typical scheme employed in metabolome analysis. For a metabolomics study to be successful, all steps, starting from the definition of the biological question and the experimental design up to the biostatistics, should be optimized and appropriate for their intended use [24]. The ultimate goal of a metabolomics experiment is to identify and quantify all of the metabolites in a cell or tissue in a given state at a given point in time [9]. However, it is currently impossible to do so simultaneously in a single, high throughput platform due to the fact that no single extraction technique or analytical instrument can isolate and detect every metabolite within a biological sample [30, 55]. These difficulties are further amplified by issues such as human error in sample preparation and extraction, sample storage and instrument reproducibility [5]. Additionally, metabolomics is challenging due to the fundamental diversity in chemical structure, size, abundance and reactivity of the collection of metabolites in any biological samples [21, 55].

2.3.1 Analytical Technologies

Metabolomic platforms can be categorized based on the detection method. Detection methods include nuclear magnetic resonance (NMR) [56], Fourier transformation in- 13

Figure 2.3.1: Schematic representation of a metabolomics experimental work- flow. Based on the study design, biological samples are collected, processed and subsequently analyzed using various analytical platforms. Adapted from http://www.cial.uam-csic.es/metabolomics/workflow.html frared spectroscopy (FT-IR) [28, 57], and (MS) [58] coupled to separation techniques such as high performance liquid chromatography (HPLC) [59], gas chromatography (GC) [24], or capillary electrophoresis (CE) [6]. The use of differ- ent analytical platforms provides complimentary information that can be integrated for deeper metabolome coverage [60]. Generally, the technology platform of choice depends on the type of sample to be analyzed [9]. Sample types commonly inves- tigated include plant tissue, plasma, urine, cerebral spinal fluid, mammalian tissue, and cultured eukaryotic and prokaryotic cells [61]. 14

2.3.2 Metabolomics Approaches

Analytical approaches for metabolomics can be categorized broadly into two distinct groups: targeted or untargeted [62]. This is dependent upon whether the methodol- ogy implemented is designed to quantify a number of specific metabolites (targeted) or to measure a generally larger set of metabolites restricted only by the sensitivity and applicability of the analytical platform(s) and data processing employed (untar- geted). These approaches can further be subdivided as metabolic profiling, using an untargeted approach or metabolite identification and quantitation using a targeted approach [55], as seen in Figure 2.3.2. Different jargon for the definition of metabolic approaches have been used by various metabolomic research areas some of which can be found in Table 2.3.1. Targeted metabolomics is commonly driven by a specific biochemical question or hypothesis that motivates the investigation of one or more related pathways of in- terest [45, 55]. These studies involve the use of biochemical and analytical tools for the quantification of known metabolites of biological interest [62]. Targeted metabolomics studies can be effective for pharmacokinetic studies of drug metabolism as well as for measuring the influence of therapeutics or genetic modifications on a specific enzyme [45]. Untargeted metabolomics is global in scope, usually hypothesis-free, and has the aim of simultaneously measuring as many metabolites as possible from biological samples [63]. It is used for the identification of metabolic pathways that are altered following distresses of biological systems [3]. Since, untargeted metabolomics does not require prior knowledge, it can be used to identify novel metabolic biomarkers of disease and drug efficacy as well analyzing the global metabolic profile of the whole 15 system [55]. Both targeted and untargeted metabolomics reveal the expected behav- ior of known metabolites, but only untargeted metabolomics allows the detection of concurrent effects between variables which cannot be observed at an individual level.

Figure 2.3.2: Workflow illustrating both untargeted and targeted metabolomics ap- proaches. Adapted from [3].

2.4 Untargeted Metabolomics

In contrast to targeted metabolomic results, untargeted metabolomic data sets are exceedingly complex, with file sizes on the order of gigabytes per sample for some new high-resolution mass spectrometry instruments. Manual inspection of the thou- 16

Metabolomics Unbiased identification and quantification of all the metabolites present in a biological system. Metabolome Complete set of low-molecular-weight metabolites present in a biological sample (i.e., biofluid, organism, bacterial community) Metabolite Qualitative and quantitative analysis of one or a few metabolites target analysis related to a specific metabolic reaction. Metabolite Metabolic profiling Identification and quantification of a se- profiling lected number of pre-defined metabolites, generally related to a specific metabolic pathway(s). Metabolic Analysis of the metabolites secreted/excreted by an organism; footprint it may include environmental and growth substances. Does not rely on the measurement of intracellular metabolites but rather, on monitoring those that are secreted or fail to be taken up by a cell or tissue [31, 32]. Metabolic Unbiased, high-throughput, rapid, global analysis of samples fingerprinting to provide sample classification. Analysis is oriented towards defining clinically relevant differences rather than identifying all the molecules present in a sample [30]. Metabonomics Evaluation of tissues and biological fluids for changes in endoge- nous metabolite levels that result from disease or therapeutic treatments Metabolite Also known as fluxomics. Labeled metabolites are fed into a flux analysis biosystem and the destination of the label is assessed, usually in a time-dependent manner.

Table 2.3.1: Definitions and terms used in metabolomics. Adapted from [5, 6, 7]. Original references [8, 9]

sands of peaks detected is impractical. With recent developments in bioinformatic tools, identification of metabolite peaks that are differentially altered between sample groups has become a relatively automated process [4]. Several metabolomic software programs that provide a method for peak picking, nonlinear retention time alignment, visualization, relative quantification and statistical analysis are available [64, 65, 66]. 17

2.4.1 Metabolite identification.

One of the known bottlenecks in metabolomics is in the identification process of unknown metabolites since currently available metabolomics software does not output metabolite identifications. Rather, it provides a table of features with pvalues and fold changes related to their difference in relative intensity between samples [4]. To determine the identity of a feature of interest, the accurate mass of the com- pound is first searched in an online chemical structure database. Databases range from general chemical structure databases such as PubChem [67], ZINC [68] ChemSpider [69] to specialized databases such as HMDB [70], DrugBank [71], or HumanCyc [72]. A list of freely accessible small molecule databases useful in metabolomics research is presented in Table 2.4.1. A database match represents only a putative metabolite assignment that must be confirmed by comparing the retention time and MS/MS data of a model compound to that from the feature of interest in the research sample (Figure 2.4.1) [4].

Figure 2.4.1: Untargeted metabolomics workflow. Adapted from [4]

A typical mass search in PubChem may yield several thousand chemical struc- tures, whereas the same search in HMDB often results in less than a dozen. Both 18

Database Type # of cpds Reference PubChem General chemical structure ˜33 million [67] database ChemSpider General chemical structure ˜28 million [69] database Zinc General chemical structure ˜20 million [68] database (commercially available small molecules) Metlin Metabolites and MS/MS data ˜64 000 [64] HMDB Human metabolome database 8,608 [65] KEGG compound A collection of small molecules, 16,834 [73] biopolymers and other com- pounds relevant to biological sys- tems DrugBank Drugs (approved, illicit, with- 6,711 [71] drawn and experimental) PlantCyc Plant metabolites and pathways 3,334 [74] HumanCyc Human metabolites and path- 1,321 [72] ways

Table 2.4.1: Freely accessible small molecule databases. Adapted from [10].

types of databases have advantages and disadvantages. Querying a focused small database such as HMDB makes identification relatively trivial if the unknown metabo- lite happens to be among the candidates. However, this approach cannot be used to identify previously unknown metabolites. On the other hand, searching a large chem- ical database such as PubChem greatly improves the odds of finding the unknown compound in the database. On the downside, the excessive number of potential candidates in PubChem may lead to a large number of false positives making the identification of the correct ”unknown” extremely difficult. However, by applying carefully configured curation steps, the candidate list from a large database may be shortened substantially. Initial curation steps may include removing disconnected 19 structures, eliminating charged species, clustering stereoisomers and eliminating com- pounds containing elements other than C, H, N, O, P and S. These curation steps alone can eliminate anywhere from 40% to 90% of candidates from the initial bin of structures matching the molecular mass of the unknown [10].

2.4.2 Identifying Altered Metabolic Pathways

Metabolic pathways are characterized as a series of chemical reactions catalyzed by connected in a way such that the reactants of one reaction are the products of the previous one. Understanding these pathways is essential to understanding the machinery of life [75]. The reconstruction of the metabolic network of an organism based on its genome sequence is a key challenge in systems biology[76]. Metabolic pathways are a series of chemical reactions occurring within a cell. In each pathway, a principal chemical is modified by a series of chemical reactions. Chemical reactions are catalyzed by enzymes, and often require dietary minerals, vitamins, and other cofactors in order to function properly. In addition, numerous distinct pathways co-exist within a cell. This collection of pathways is called the metabolic network. Pathways are important to the maintenance of homeostasis within an organism. Figure 2.4.2 gives a glimpse of the complexity of a metabolic pathway. By performing global metabolite profiling, also known as untargeted metabolomics, new discoveries linking cellular pathways to biological mechanism are being revealed and are shaping our understanding of cell biology, physiology and medicine [4]. 20

Figure 2.4.2: The citric acid cycle is central to the chemical process- ing of carbohydrates, fats, and proteins to produce energy. Adapted from http://math.uwaterloo.ca/ Chapter 3

Basic Evaluation Techniques

3.1 Cross Validation Framework

Cross Validation (CV) is a statistical method used for estimating how accurately a predictive algorithm will perform in practice while avoiding overfitting as well as to tune learning model parameters [77]. This is accomplished by dividing data into two segments: one used to train a model and the other used to validate the model. In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. CV is also used to compare the performance of two or more different algorithms and find out the best algorithm for the available data, or alternatively to compare the performance of two or more variants of a parameterized model.

21 22

3.1.1 K-folds Cross Validation Framework

The data set is divided into k subsets. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set (Figure 3.1.1). Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation. A variant of this method is to randomly divide the data into a test and training set k different times. The advantage of doing this is that you can independently choose how large each test set is and how many trials you average over.

Figure 3.1.1: K-folds cross validation. In this illustration k = 4

3.1.2 Nested Cross Validation Framework

Classification accuracy is empirically assessed using 2 -fold CV, with parameter tuning performed by executing k-fold CV on the training data (Figure 3.1.2). Briefly, the dataset is divided randomly into 2 halves; one half for model training and the other 23

half for model testing. The training half is further randomly split into k roughly equal parts, and then each part was used to evaluate classification accuracy of models trained on the remaining (k−1) parts. The average accuracy measure of all k training sets is used as the cutoff score when evaluating the testing data.

Figure 3.1.2: Nested cross validation framework

3.1.3 Leave-one-out Cross Validation Framework

Leave-one-out cross validation (LOOCV) is k-fold cross validation with k equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point (Figure 3.1.3).

Figure 3.1.3: Leave-one-out cross validation 24

3.2 Analysis of Variance

The general purpose of the analysis of variance (ANOVA) is to test for significant differences between means [77]. ANOVA is preferred when there is more than two levels of an independent variable to compare. ANOVA can also analyze data from several independent variables simultaneously. In this thesis, all the ANOVA analysis was carried out using the Single Factor ANOVA function in Microsoft Excel 2007.

3.3 Accuracy Measures

Statistical measures used to evaluate the performance of classification models in this thesis are defined in this section. The definition of some abbreviations used below can be found in Table 3.3.1.

TP True positive Correctly identified FP False positive Incorrectly identified TN True negative Correctly rejected FN False negative Incorrectly rejected

Table 3.3.1: Definitions of abbreviations used to explain accuracy measures

Sensitivity (SENS) refers to the proportion of actual positives which are correctly identified, and is computed as

TP SENS = (3.3.1) TP + FN

Specificity (SPEC) refers to the proportion of negatives which are correctly identified, 25

and is given by TN SPEC = (3.3.2) TN + FP

The Positive Predictive Value (PPV) is the proportion of positive test results that are true positives, and is defined by

TP PPV = (3.3.3) TP + FP

The Matthews Correlation Coefficient (MCC) [78], defined by

T P.T N − F P.F N MCC = (3.3.4) p(TN + FN) . (TN + FP ) . (TP + FN) .(TP + FP )

is commonly used as a combined measure of the overall quality of two-class classifiers. MCC can range from 1 to -1 where

  1 perfect prediction   MCC = 0 randomized prediction    −1 perfectly inverse prediction

Finally, the F-Score is the harmonic mean of SENS and PPV, i.e.,

SENS.P P V F = 2 (3.3.5) SENS + PPV Chapter 4

Identifying endogenous mammalian biochemical structures in chemical structure space

4.1 Introduction

The interpretation of the massive amount of data produced by high-throughput tech- niques is a major challenge in metabolomics [79]. The most common approach involves matching experimentally determined features, such as a mass spectrum or retention index, with computationally simulated features for a set of candidate compounds downloaded from a general chemical structure database [66] Various on-line chemi- cal structure databases (Table 2.4.1) provide the fundamental support for molecular identification. The relative advantages or disadvantages of utilizing chemical struc- ture databases vary depending on the size of the database. Small databases often will not contain the candidate compound of interest. On the other hand, searching

26 27

large databases such as PubChem, often results in a large number of false positives, making identification of the ”unknown” extremely difficult. Hence, cheminformatics methods are needed to more efficiently search large chemical databases in order to identify unknown endogenous biochemical compounds. Ideally, these methods would allow discrimination between candidate structures that are synthetic and candidate structures that are biochemical [16, 80]. Nobeli et al. [81], using two-dimensional (2D) molecular structures and chemin- formatics tools, reported the first attempt to solve this problem. They visually exam- ined the 2D molecular structures of 745 E. coli metabolites and manually derived a library of 57 structural fragments commonly found in those metabolites to reveal the main constituents of metabolites and to assist in the classification of the metabolome into biochemically relevant classes. Preliminary efforts correlating similarities be- tween metabolites and protein structures, as well as with metabolic pathways were reported. In related work, Gupta and Aires-de-Sousa [82] defined chemical space of endogenous biochemicals using the KEGG/ database. Any compound in KEGG that was involved in a metabolic reaction was included in the study. These included metabolites from different species as well as xenobiotics. The chemical space of non-metabolites was represented by a random set of commercially available com- pounds from the ZINC [83] chemical database. They compared both chemical spaces based on 2D and 3D structures and descriptors of global properties. They found that overlap between metabolites and non-metabolites was smallest in the space defined by the global descriptors and suggested that the most discriminative features were the number of OH groups, the presence of aromatic systems, and molecular weight. Using a random forest (RF) [84] classifier and global molecular descriptors they were able to correctly annotate 95% of the 1,811 KEGG compounds used for training the 28 model. A RF is a collection of unpruned classification trees created by using boot- strap samples of the training data and random subsets of variables to define the best split at each node. Extending Gupta and Aires-de-Sousas work, Peironcely et al. [85] used 6,954 molecular structures in HMDB [70] to represent chemical space occupied by en- dogenous human metabolites and an updated collection of compounds from ZINC as non-biological structures. Both datasets were clustered independently and 532 molecules (cluster centers) from each dataset, selected to represent each cluster, were used for building the classification model. The remaining (6,422) molecules were used for training the model. They showed that using MDL public keys [86] and RF re- sulted in the best accuracy for their classifier. The authors reported that 96% of 457 HMDB compounds not used for training the model, 54% of 6,532 DrugBank [71] compounds and 22% of 6,312 compounds from ChEMBL [87] were classified as endogenous metabolites. Both Gupta & Aires-de-Sousa and Peironcely et al. employed fingerprints for classification. Molecular fingerprints represent the structure of a molecule as a list of binary values (0 or 1) that indicate the presence or absence of structural features in the molecule [88]. A structural feature may include properties (such as molecu- lar weight), the presence/absence of an element, an unusual or important electronic configuration (such as triple-bonded nitrogen), rings and ring systems and functional groups. An alternative approach is based on viewing a molecule as a graph and using graph-matching algorithms to find common substructures. Previous work [81] sug- gests that matching common substructures may describe structural similarity more accurately than fingerprint-based methods. Although this has been suggested, it has not been explored due to concerns related to computational efficiency. In addition, 29 this approach of matching common substructures is consistent with how endogenous biochemicals are produced enzymatically in vivo, i.e., from precursors with similar and/or overlapping structures. Here, I present BioSM, a molecular classifier that can identify endogenous mam- malian biochemical structures contained within chemical structure space. BioSM uses the structures of known endogenous mammalian biochemical compounds as scaffolds to aid in the classification process, as opposed to other works that use fragments of known structures. The graph-based method implemented within BioSM can also be expanded to predict metabolic pathways since it links a set of annotated scaffold structures to each candidate structure. In the empirical evaluation of BioSM, I initially focused on a curated set of endoge- nous human biochemicals obtained from the KEGG/LIGAND database to represent the scaffolds list. The chemical space of non-biological compounds was approximated by a randomly selected set of compounds from the Chembridge and Chemsysnthesis chemical databases. Since structurally similar molecules tend to have similar proper- ties [89], I use a graph matching algorithm to identify compounds that are structurally similar to those in our scaffolds list. The classification method is based on a novel scoring scheme that combines all matches of scaffolds to substructures of a candidate compound as well as matches of the candidate compound’s structure to substructures of the scaffolds. I was also interested in determining whether increasing the number of scaffolds (i.e., increasing our representation of biochemical structure space) would improve model sensitivity and specificity. Therefore, the initial KEGG scaffolds list was supplemented with 2,362 curated compounds from HMDB and HumanCyc and the assessment experiments were repeated. 30

4.2 Computational Algorithm

Marvin [90], a chemical structure processing software, was used to generate canonical SMILES (Simplified Molecular-Input Line-Entry System) [91] from structure data files (.sdf) for all compounds described in this chapter. The Small Molecule Sub-graph Detector (SMSD) Toolkit [92] was used to carry out molecule similarity searches. SMSD is a Java based software library for finding the maximum common sub-graph between small molecules using atom type matches and bond sensitivity information. In this work, two molecular structures match if and only if the smaller structure was an exact substructure (atom and bond types) of the larger structure being compared. If two molecular structures r and q were found to be a match, their similarity score is defined by AC(r) Sc = (4.2.1) AC(q) where AC(r) represents the total number of atoms in the substructure, r, and AC(q) represents the total number of atoms in the superstructure, q. Clearly, a candidate molecule may match more than one scaffold structure, resulting in several similar- ity scores computed for each candidate compound. Initially, the highest similarity score was selected to represent the degree of biochemical similarity between scaffold structures and the candidate compound’s structure. However, it was observed that multiple scaffolds could match different substructures of the candidate, significantly strengthening the evidence that the candidate compound is an endogenous mam- malian biochemical. Thus, I developed a ”union scaffold structure” approach that incorporates all scaffolds matching a candidate compound’s structure and serves to reduce bias that might exist due to overlap among scaffolds. This representation provides a quantitative assessment of a candidate compound’s 31

Figure 4.2.1: Matching a candidate structure (panel A) with 4 different scaffolds (pan- els B1- B5; note that scaffold B2 = scaffold B3) as substructures and the similarity score of each match. The union scaffold structure incorporating all scaffold matches is shown in panel C. 32 overall ”biochemical coverage”. Figure 4.2.1 illustrates BioSM’s scaffold matching process and shows how scaffolds are mapped onto the candidate structure to generate the union scaffold structure. When multiple matches exist, BioSM incorporates each one into the union scaffold structure being generated (Figure 4.2.1, matches B2 and B3). Please note that a disjoint union scaffold structure may be generated if matching substructure scaffolds do not overlap. Once a union scaffold structure is mapped to a candidate structure, a similarity score, known as the union-scaffold score (US), is computed using equation 4.2.1 with the candidate structure as the superstructure and the union scaffold structure as the substructure. I considered using the number of scaffolds that match a candidate structure as an optional scoring parameter. It was realized, however, that this approach would make BioSM’s predictions biased depending on the over or under abundance of any particular group of structures in the scaffolds list. Knowing that our scaffolds list is incomplete, since not all endogenous mammalian biochemical compounds are known, it was decided not include the number of scaffold matches in a candidate compound’s score. I also recognized that some candidate structures may be small and thus have very few scaffolds matching as substructures. Obviously, larger candidate compounds have a better chance of matching substructures in the scaffolds list. Accordingly, the method was modified to match and score scaffolds that are superstructures of a candidate structure as well as those that are substructures. This approach seems intuitive since many biochemical compounds are produced enzymatically (i.e., prod- ucts) from larger precursor scaffolds (i.e., substrates) via biochemical pathways [93]. If a scaffold is found to be a superstructure of a candidate structure, a similarity score is computed using equation (1). In addition, a candidate compound may be a sub- 33

Figure 4.2.2: Matching a candidate structure (panel A) with 2 scaffolds (panels B1 and B2) as superstructures and the similarity score of each match. The scaffold structure with the highest similarity score (scaffold B2) is selected. structure of several scaffolds as shown in Figure 4.2.2. In that case, the scaffold with the highest similarity score is selected, and that score is used as the superstructure score. Hence, a candidate compound can have a score of zero (when no matches are found), a union scaffold score, a superstructure score, or both. In order to have one value represent the structural match of a candidate compound to the biochemi- cal scaffold structures, we combined the union scaffold and superstructure scores in two different ways. In the first approach, referred to as the Sum of Scores (SS), we obtained a candidate’s overall score by adding the union scaffold score to the super- 34 structure score. In the second approach, referred to as the Maximum Score (MS), the candidate’s score was the larger of the union scaffold score and superstructure score. Figure 4.2.3 shows an overview of the general flow of BioSM and an illustrative example.

4.3 Datasets

4.3.1 Chemical Space Definition

• Biological Dataset (Scaffolds List): The KEGG database served as the source of the first set of endogenous mammalian scaffolds used in this study. These scaffolds were selected based on their inclusion within at least one of 63 known KEGG mammalian pathways (scaffold pathway and metabolic class information is given in Table 4.3.1). However, some compounds were excluded from the final scaffold list. Compounds with elements other than C, H, N, O, P and S are typically found only in marine organisms and extremely rare in mammals. Hence, we decided to treat these compounds as non-mammalian compounds and eliminated them (59 compounds). Molecules with a molecular mass less than 50 Da (12 compounds) were removed. Fifty nine compounds with any atom type other than C, H, O, N, S and P were eliminated as were compounds that had duplicate structures (174 compounds), or were polymers (223 compounds). Additionally, we eliminated compounds that did not have a formula associated (27 compounds) and all charged structures (11 compounds) except those in which the charge was due to quaternary amines or sulfonium ions. This curation resulted in a final list of 1,565 mammalian scaffolds (KEG- 35

Figure 4.2.3: (A) General flow of BioSM and (B) an example showing how the union scaffold structure and superstructure scaffold are used in the prediction process based on 5BSS. 36

Gscafs) for our initial representation of biochemical structure space.

• Non-Biological Dataset (Synthetic compounds List): The Chembridge and Chemsynthesis databases, comprising synthetic compounds for chemical synthesis and drug screening and design, were chosen to represent non-biological chemical space. A set of 29,207 compounds was downloaded from the Chem- synthesis database on 7/18/2011 and a set of 760,517 compounds was down- loaded from the Chembridge database on 7/20/2011. Because Chemsynthe- sis and Chembridge databases mainly contain compounds with low molecular weights, a value of 700 Da was set as the maximum molecular weight of can- didate compounds included in this study. Accordingly, 177 KEGG compounds (with masses greater than 700 Da) were eliminated from any testing set through- out this study and were only used for superstructure scaffold matching. This mass restriction was enforced to ensure that any compound with a mass range 50 – 700 Da was equally likely to be biological/non-biological and thus dis- crimination would be based solely on structure. Similar to KEGGscafs, the combined synthetic set of compounds was curated by removing all compounds containing elements other than C, H, O, N, S and P (297,721 structures), or- ganic salts (3,496 structures), charged compounds (39,170 structures), duplicate compounds (153 structures), and compounds with molecular mass less than 50 Da (8 structures). Additionally, we removed 127 compounds that were identical to compounds in KEGGscafs. This curation resulted in a final set of putative non-biological compounds consisting of 483,615 structures. 37

Pathway Class Pathways Pathway KEGG IDs Compounds per Class per Class Carbohydrate 15 ko00010, ko00020, ko00030 293 Metabolism ,ko00040, ko00051, ko00052, (CM) ko00053, ko00500, ko00520, ko00562, ko00620, ko00630, ko00640, ko00650, ko00660 Energy 1 ko00190 10 Metabolism (EM) Lipid 16 ko00061, ko00062, ko00071, 430 Metabolism ko00072, ko00100, ko00120, (LM) ko00121, ko00140, ko00561, ko00564, ko00565, ko00590, ko00591, ko00592, ko00600, ko01040 Nucleotide 2 ko00230, ko00240 137 Metabolism (NM) Amino Acid 13 ko00250, ko00260, ko00270, 502 Metabolism ko00280, ko00290, ko00300, (AM) ko00310, ko00330, ko00340, ko00350, ko00360, ko00380, ko00400 Metabolism of 3 ko00410, ko00430, ko00480 69 Other Amino Acids (OAM) Metabolism 12 ko00130, ko00670, ko00730, 296 of Cofactors ko00740, ko00750, ko00760, and Vitamins ko00770, ko00780, ko00785, (MCV) ko00790, ko00830, ko00860 Metabolism 1 ko00900 26 of Terpenoids and Polyketides (MTP)

Table 4.3.1: Pathway classes and individual pathways included in the study. 38

4.3.2 Non-Biological Subsections (NBS)

In addition to these non-biological compounds, we empirically derived a set of non- biological substructures (NBS) which are not commonly found in mammalian bio- chemical compounds [94]. The NBS list was checked against KEGGscafs. If an NBS was found to be part of a compound in KEGGscafs, the NBS was removed. This resulted in 35 substructures in the final NBS list (Tables 4.3.2 and ??). The NBS list was used as an initial filter in the identification process. If a candidate compound was found to contain at least one NBS it was predicted to be non-biological.

Table 4.3.2: List of Non-biological substructures (NBS)

# NBS SMILES NBS Structure Exceptions

1 C=S

2 S=O

3 S(=O)(=O) 39

Table 4.3.3: List of Non-biological substructures (NBS)

# NBS SMILES NBS Structure # NBS SMILES NBS Structure

4 C1=CC=C1 5 C1C=C1

6 C1C2C1C2 7 C1CC=C1

8 C1CC1 9 P1PP1

10 S1SS1 11 N1NN1

12 C=C=C 13 [N+](=O)[O-]

14 N=[N+]=[N-] 15 NNN

16 OOO 17 SSS

18 C=P 19 N=O

20 N=P 21 N=S

22 P=P 23 P=S

24 S=S 25 NS

26 CON 27 OON 40

Table 4.3.3: List of Non-biological substructures (NBS) - Continued

# NBS SMILES NBS Structure # NBS SMILES NBS Structure

28 PON 29 NON

30 SON 31 PP

32 PS 33 SN

34 SO 35 C#N

4.3.3 Training Data

From the selected set of 1,565 KEGGscafs, there were 1,388 compounds with molec- ular weights in the range 50 – 700 Da. These were used as the training set for our method. A set of 1,388 synthetic compounds, selected from the synthetic compounds dataset to match the mass distribution of the 1,388 biological set, was used to rep- resent non-biological chemical space. Synthetic compounds containing one or more NBS were not used for training since BioSM applies the NBS filter before the scaffolds matching step.

4.3.4 Prospective Validation Sets

To estimate the performance of our predictive model, five external validation sets were used; one set of drugs, two sets of putative human metabolites, one set of plant 41 secondary metabolites, and one set of synthetic compounds. Figure 4.3.1 shows the mass distribution of the compounds in each validation dataset. For each dataset, any compound identical to any of KEGGscafs was removed. Also, structures found in more than one dataset were removed from all datasets except one, as explained below. The following is a description of the five datasets:

1. A dataset which contained 7,036 compounds obtained from DrugBank version 3.0 downloaded on 01/18/2012, combined with a set of 5,390 structures obtained from the 1989 USAN and the USP Dictionary of Drug Names, was used as a drug dataset. Salts, mixtures, compounds containing elements other than C, H, N, O, S, and P; duplicate structures and compounds with molecular weight outside the 50 – 700 Da range were removed resulting in a set of 3,895 compounds.

2. I used compounds from HMDB version 2.5, downloaded on 7/15/2012, to repre- sent human metabolites. Out of the 8,534 molecules in that set, 174 compounds contained elements other than C, H, N, O, S, and P; 4,209 molecules were out- side the considered mass range (50 – 700 Da) and 133 compounds had dupli- cate structures. Additionally, 1,138 molecules were eliminated because they were found in KEGGscafs and 132 were found in the drug dataset. Finally, all charged structures except those in which the charge was due to quaternary amines or sulfonium ions were eliminated. This resulted in an independent dataset of 2,563 putative human metabolites.

3. I downloaded a set of 2,396 compounds from HumanCyc version 16.0 on 5/24/2012 to represent another dataset of putative human metabolites. A curated set of 158 compounds were available for testing after eliminating compounds contain- ing elements other than C, H, N, O, S, and P (111 compounds), those not in the 42

Figure 4.3.1: Mass Distribution of compounds in the validation datasets in 50 Da bins.

mass range 50 – 700 Da (289 compounds), compounds found in KEGGscafs (198 compounds), charged compounds (792 compounds), duplicate structures (283 compounds), polymers (368 compounds), drugs (28 compounds), and HMDB compounds (169 compounds).

4. A dataset of 2,829 secondary plant metabolites [95], as specified by KEGG, was downloaded on 6/25/2012 to represent plant structures. A total of 2,416 com- pounds remained after removing compounds present in KEGGscafs (75 com- pounds), drugs (54 compounds), compounds not in the mass range 50 – 700 Da (217 compounds), compounds containing elements other than C, H, N, O, S, 43

and P (10 compounds), and compounds with charges (57 compounds).

5. A fifth dataset of 458,207 compounds from the Chembridge and Chemsynthesis databases, not used in training the model, were used as a synthetic compound test set. The same curation steps described above were used for these com- pounds.

In addition to these five validation datasets, we classified a random set of compounds taken from the PubChem chemical database. On 12/15/2011, 30,142,651 compounds were downloaded from PubChem. I eliminated 1,003,580 compounds with molecu- lar masses not in the range of 50 – 700 Da. Further, 13,171,123 compounds that contained elements other than C, H, O, N, S, P were eliminated. Three replicate datasets, each containing approximately 320,000 compounds, were randomly chosen from the remaining 15,967,948 PubChem compounds resulting in a total of 959,420 molecules. Further curation resulted in the elimination of 7,280 compounds with duplicate structures, 67,449 compounds with charges and 12 compounds that had disconnected structures. This resulted in three random samples totaling 883,199 test molecules. It should be noted that there was no attempt to remove compounds present in any of the other validation sets from the PubChem dataset. The PubChem dataset was intended to be a random sampling (other than curation requirements) of PubChem compounds.

4.3.5 Extended Scaffolds List

In order to determine whether BioSM’s prediction accuracy would improve if the number of scaffolds was increased, we compiled an updated scaffolds list of 3,927 44 compounds (referred to as KHHscafs; KEGG, HMDB, and HumanCyc Scaffolds List) using our initial KEGGscafs, plus additional compounds from the HMDB and Hu- manCyc databases. Only non-redundant compounds from HMDB and HumanCyc predicted to be endogenous mammalian biochemical compounds by BioSM using KEGGscafs were included in KHHscafs. This list consisted of the original 1,565 KEGGscafs, 2,273 compounds from HMDB and 89 compounds from HumanCyc. A set of compounds from the synthetic dataset (randomly selected to match the KHH- scafs mass distribution) were chosen to represent non-biological compounds. I then used the same cross validation framework and scoring methods described earlier for KEGGscafs. BioSM using KHHscafs was used to analyze the following independent datasets:

1. the drug dataset described above (3,894 compounds),

2. the plant secondary metabolites dataset (2,354 compounds) after eliminating 62 compounds found in the KHHscafs,

3. compounds from the synthetic dataset (374,143 Chemsynthestis and Chem- bridge compounds) not used in training BioSM, and

4. one of the randomly generated Pubchem datasets (294,671 compounds). 45

4.4 Results and Discussion

4.4.1 Selection of Candidate Scoring Methods by CV

In this thesis, the nested CV framework defined in 3.1.2 was used to evaluate the performance of the proposed prediction methods. Specifically, parameter tuning was performed by executing 5-fold CV on the training data and the classification accuracy was empirically assessed using 2-fold CV on the testing data. Compounds in the scaffolds list and an equal number of atom-matched non-biological compounds were individually divided randomly into two halves: one half for model training and the other half for model testing. The training half, comprised of 711 biological and 711 non-biological compounds, was further randomly split into k = 5 roughly equal parts. Thus, the 711 biological compounds as well as the 711 non-biological compounds were divided into 5 random parts. One part from the biological data along with one part from the non-biological data were used to evaluate classification accuracy of models trained on the remaining (k − 1) parts of the biological data as scaffolds. For the results of each training fold, the score where SENS = SPEC was recorded as the cutoff threshold of that fold. This process was repeated 5 times to insure that each of the 5 parts was evaluated. Five-fold CV was used to determine bin boundaries ensuring that each bin had approximately the same number of compounds, as well as independent score threshold values for each bin. Both threshold scores and bin boundaries obtained from each of the 5 training folds were averaged then applied to the testing fold. Several methods for scoring a candidate compound were examined in this CV analysis. Specifically, the US reflects the value of equation (1) having the candidate 46 compound as the superstructure and the union scaffolds as the substructure, SS re- flects the sum of the union scaffold score and the superstructure score, and the MS reflects the larger of the union scaffold score and superstructure score. In preliminary experiments we noted that the molecular weight of a compound had an impact on its final score. This is because smaller compounds are more likely to match larger scaffolds; larger compounds more likely to match smaller scaffolds and compounds of intermediate size could match both smaller and larger scaffolds. Therefore, we chose to split the set of test compounds into 5 mass bins. Five-fold CV was used to determine bin boundaries ensuring that each bin had approximately the same num- ber of compounds, as well as independent score threshold values for each bin (as explained in 3.1.2). Both threshold scores and bin boundaries obtained from each of the 5 training folds were averaged before applying BioSM to the testing fold. Thus, the sum of threshold values obtained from each fold divided by the number of folds (5) would be the averaged threshold score applied by BioSM to the testing fold. I refer to classification obtained by applying the three scoring methods discussed above with independent threshold values for each of the 5 bins as 5-Bin Union-scaffold Score (5BUS), 5-Bin Sum of Scores (5BSS), and 5-Bin Maximum Score (5BMS), respectively. The accuracy measures explained above were used to compare results generated from 15 CV experiments for each of the scoring functions (US, MS, SS, 5BUS, 5BMS, and 5BSS) as shown in Table 4.4.1. An ANOVA test was carried out to check for statistical significance between the 6 scoring methods (see section 3.2 for more details). ANOVA results indicated no statistically significant difference between any of the 6 methods (P >0.05). However, 5BSS accuracy was consistently higher than the other methods on all measures and thus was selected as the scoring method for all remaining 47

experiments. It is noticeable (Table 4.4.1) that the sensitivity of the model in the CV experi- ments is relatively low. As explained in the methods section, in each CV experiment only half of the KEGGscafs were used for training the model and the other half were used for testing. Thus, a candidate could be predicted to be non-biological because there were no scaffolds in the randomly selected training set to match it in that specific experiment.

Structure Scoring Methods US MS SS 5BUS 5BMS 5BSS Mean 0.77 0.78 0.78 0.76 0.77 0.79 SENS StdDev 0.02 0.02 0.02 0.03 0.03 0.02 Mean 0.71 0.71 0.72 0.71 0.71 0.73 SPEC StdDev 0.04 0.04 0.04 0.04 0.04 0.04 Mean 0.73 0.74 0.74 0.73 0.73 0.75 PPV StdDev 0.03 0.03 0.03 0.04 0.03 0.03 Mean 0.49 0.5 0.5 0.47 0.48 0.51 MCC StdDev 0.05 0.05 0.05 0.05 0.05 0.04 Mean 0.75 0.75 0.75 0.74 0.75 0.76 F Score StdDev 0.02 0.02 0.02 0.02 0.02 0.02

Table 4.4.1: Mean and standard deviation of accuracy measures obtained for 15 cross validation experiments using 6 different scoring methods and KEGGscafs (N = 1,565 compounds)

4.4.2 Leave-One-Out Cross Validation Experiments

Using the averaged meta-parameters determined by CV (as explained in 3.1.3), I carried out a set of LOOCV experiments on the N = 1, 388 structures (with masses between 50 and 700 Da) in our reference scaffolds database as an additional method of evaluating the accuracy of BioSM in predicting endogenous mammalian biochemical 48 structures. N experiments were performed and for each experiment, N -1 compounds (plus 177 KEGG compounds with masses 700 – 1200 Da) were used as scaffolds and the remaining compound was treated as an unknown. This allowed the use of all but one scaffold in the prediction process. As a result, BioSM annotated 95% of the compounds as being biochemical.

4.4.3 Prospective Validation

Five prospective datasets (drugs, plant secondary metabolites, 2 independent human metabolite datasets, and a synthetic molecule dataset) were classified by BioSM us- ing the 5BSS method. The compounds in each dataset were split into 5 bins (mass range/bin determined as described in the CV experiments) and the percentage of biochemical predictions per bin was computed (Figure 4.4.1). For the sake of com- parison, the results from the LOOCV experiments with 1,388 KEGG endogenous metabolites (described above) are also included in Figure 4.4.1. It is observed that the prediction accuracy for KEGG compounds (LOOCV results) is uniform across all mass bins. For the other datasets compounds in the mass range 287 – 700 Da (bins 4 and 5) tended to have a higher probability of being predicted as endogenous mammalian biochemical structures. This was especially true for the HumanCyc com- pounds, plant metabolites and drugs. The overall results (Table 4.4.3) show that out of the 2,563 HMDB molecules, 89% were predicted to be biochemcal structures. However, only 58% of HumanCyc compounds were predicted to be biological. Vi- sual examination of the HumanCyc structures predicted to be non-biological showed that many of them are indeed non-biological. For example, anthrazene, triazene and compounds with cyclopropane rings are included in the list (these non-biochemical 49

structures are given in the supplementary material). Thus, the above results are con- sistent with the intent of the HMDB and HumanCyc databases to include compounds that are found in humans, however, these are not necessarily endogenous mammalian biochemical compounds. For the 2,416 plant compounds, 72% were predicted to be biochemical. Although this high percentage might seem initially surprising given that we are using mammalian scaffolds to represent biochemical space, this result is consis- tent with current biochemical and evolutionary data suggesting that plant secondary metabolites and mammalian biochemicals (i.e., our KEGGscafs) share multiple con- served biochemical pathways and thus an overlapping biochemical phylogeny [96]. Interestingly, only 1% of the plant secondary metabolites matched one or more su- perstructure scaffolds; and those plant compounds were found to have relatively small molecular weights (116 - 299 Da). This suggests that plants have expanded upon con- served biochemical pathways to produce compounds containing unique combinations of common scaffolds; and these unique combinations are not substructures of known mammalian scaffolds. Forty eight percent of 3,895 drug structures were predicted to be endogenous mam- malian biochemical structures. These results are very similar to those found earlier by Peironcely et al. using a similar drug dataset [87]. It is perhaps not surprising that approximately half of the drugs were predicted to be endogenous biochemical struc- tures since many are derived from natural products [51]. In contrast, only 29% of the synthetic compounds were predicted to be endogenous biochemical structures. By chance, synthetic compounds may be structurally similar to biochemical compounds. Indeed, as mentioned previously, we found 127 compounds that had to be removed from the synthetic data set prior to cross validation because they were identical to compounds in KEGGscafs. 50

Figure 4.4.1: Biological predictions within each mass bin for each dataset using KEG- Gscafs. 5BSS bin threshold values (thr) are also displayed.*LOOCV results.

In addition to these five prospective datasets, three random samples of approxi- mately 294,000 compounds (883,199 total) from PubChem were tested. Thirty-four percent (0.02%) of these were predicted to be biochemical. This suggests that the Pubchem database contains mostly non-biological compounds. Thus, for metabolomics studies where identification of unknown endogenous biochemicals is the primary goal, BioSM would facilitate more efficient use of large chemical databases such as Pub- Chem by removing non-biological candidate compounds from further consideration. For example, BioSM will be incorporated into MolFind [66], a recently described program that aids in the identification of unknown compounds detected in biological samples by LC/MS. Table 4.4.2 shows the detailed predictions results for each of the PubChem random samples as well as the average and standard deviation. Next, we evaluated the distribution of candidate scores regardless of compound mass (Figure 51

4.4.2) for each prospective dataset. PubChem compounds, synthetic compounds, and compounds in the drug dataset have a large number of compounds (31%, 32%, and 25% respectively) with a candidate score of zero. After eliminating compounds with a zero score due to NBSs (Table 4.4.2) we found that 8% of Pubchem compounds, 10% of the synthetic compounds and 9% of the drug compounds had no structural similarity with any of our scaffolds. It is also clear in Figure 4.4.2 that Pubchem compounds and synthetic compounds have very similar candidate score distributions.

PubChem Number Predictions Random of Com- Set pounds Non- Non- Biological Biological Biological (5BSS) (NBSs) (5BSS) 1 294,651 23.62% 42.37% 34.01% 2 293,885 23.62% 42.36% 34.02% 3 294,643 23.65% 42.30% 34.06% Average 294,393 23.63% 42.34% 34.03% Std Dev 439.96 0.0002 0.0004 0.0002

Table 4.4.2: Prediction results for 3 random PubChem datasets using the 5BSS clas- sifier and KEGGscafs.

A candidate score greater than 1.0 can only be achieved if the candidate com- pound has at least one matching substructure scaffold and at least one matching superstructure scaffold. Figure 4.4.2 shows that 82% of the KEGG endogenous com- pounds, 54% of the HMDB compounds and 31% the HumanCyc compounds have a scores between 1 and 2. Only a few of the drug, plant, PubChem and synthetic compound structures have candidate scores in that range (9%, 7%, 2%, and 1% re- spectively). As mentioned earlier, only about 1% of the plant compounds matched one or more superstructure scaffolds. Thus, of the 7% of plant compounds with scores 52 between 1 and 2, approximately 6% of these had a score of 1. Using KEGGscafs, the largest threshold value over all 5 bins was 0.89. Therefore any compound, regardless of its mass, with a score of greater than 0.89 would be annotated as an endogenous mammalian biochemical compound.

Number of Prediction Type Compounds Non- Non- Biological Biological Biological (5BSS) (NBSs) (5BSS) HMDB 2,563 1% 10% 89% Plant Secondary 2,416 0% 28% 72% Metabolites HumanCyc 158 7% 35% 58% Drugs 3,895 16% 36% 48% Synthetics 458,207 21% 50% 29% PubChem 959,420 22% 46% 32%

Table 4.4.3: Predictive results using the 5BSS classifier for 6 different datasets using KEGGscafs.

4.4.4 Extended Scaffolds List

The analysis above was based on using BioSM and the curated set of 1,565 KEG- Gscafs. This assumes that these 1,565 structures provide a complete (or nearly com- plete) representation of mammalian biochemical structure space. Thus, an important question is whether a larger scaffold list (larger biochemical structure space) would significantly change the results presented above. After updating the scaffolds list to 3,927 compounds (KHHscafs described above), I followed the same process for finding the best scoring method, cutoff values, and bin masses using 15 CV experiments with 3,750 training scaffolds (3,927 – 177 = 3,750) in the 50 - 700 Da mass range. For the 53 non-biological set we selected a random set of structures from the Chembridge and Chemsynthesis databases which matched the mass distribution of the 3,750 training KHHscafs. Note that since this non-biological set was chosen at random from our cu- rated dataset of 483,615 synthetic compounds, it is not identical to the non-biological set used for CV of KEGGscafs. Table 4.4.4 shows the average accuracy measures of the 15 CV experiments for US, MS, SS, 5BUS, 5BMS and 5BSS methods. An ANOVA of the results in Table 4.4.4 indicated statistically significant (P <0.05) dif- ferences between SPEC and PPV for one or more of the 6 scoring methods. Having the highest SPEC (0.75) and PPV (0.83), 5BSS was selected as the scoring method for BioSM when using KHHscafs to reanalyze the various datasets as described above. A further ANOVA of the 5BSS CV results for KEGGscafs and KHHscafs showed a statistically significant (P <0.05) difference between all measures.

Structure Scoring Methods US MS SS 5BUS 5BMS 5BSS Mean 0.84 0.84 0.84 0.83 0.84 0.83 SENS StdDev 0.01 0.01 0.01 0.02 0.02 0.02 Mean 0.72 0.72 0.72 0.73 0.73 0.75 SPEC StdDev 0.01 0.01 0.01 0.01 0.01 0.01 Mean 0.81 0.81 0.81 0.82 0.82 0.83 PPV StdDev 0.01 0.01 0.01 0.01 0.01 0.01 Mean 0.56 0.56 0.57 0.56 0.57 0.58 MCC StdDev 0.02 0.02 0.02 0.02 0.02 0.02 Mean 0.82 0.82 0.83 0.82 0.83 0.83 F Score StdDev 0.01 0.01 0.01 0.01 0.01 0.01

Table 4.4.4: Average and standard deviation of accuracy measures obtained for 15 cross validation experiments using 6 different scoring methods and the KHHscafs (N = 3,927 compounds).

Figure 4.4.3 shows the results of LOOCV as well as the results of the prospective datasets per mass bin. Ninety six percent of the 3,750 KHHscafs were correctly pre- 54

Figure 4.4.2: Frequency distribution of candidate scores for each dataset. 5BSS threshold values for each of the 5 bin masses are given in Figure 4. *LOOCV results. 55

Figure 4.4.3: Biological predictions within each mass bin for each dataset using KHH- scafs. 5BSS bin threshold values (thr) are also displayed. *LOOCV results. dicted as biological using a LOOCV. Even though this value is high, four percent of our scaffolds were still incorrectly annotated (these structures are found in supple- mentary material). In many cases, we noted that these false negatives were because BioSM requires an exact match between the scaffold and the candidate. This was particularly problematic for predicting specific classes of compounds. For example, lipids with a double bond in the middle of the structure were poorly predicted by BioSM since there may not be scaffolds that match either side of the double bond. I explored using scaffold matching without the requirement of exact bond match- ing; however, the specificity of the system was negatively affected. It is important to note that bin masses and cut-off thresholds changed after running CV with the updated KHHscafs. This explains why some compounds predicted to be biological using KEGGscafs might be predicted to be non-biological using KHHscafs or vice 56 versa. Although the 96% sensitivity suggested by our LOOCV analysis is quite good, a possible approach to further improve BioSM would be to expand the set of scaffolds by using enzyme reaction information (oxidation and or reduction reactions for ex- ample). In this case, not only would BioSM be searching for exact structure matches between scaffolds and candidate compounds, but also among putative metabolites of those scaffolds. BioSM will apply a set of applicable enzyme reactions to a candi- date compound; if any of the metabolites produced were found to be an endogenous mammalian biochemical compound by BioSM then the candidate is also biochemical. Using KHHscafs, BioSM predicted 74% of the 2,354 plant compounds, 42% of the 3,894 drug compounds, 26% of the 374,143 synthetic compounds and 25% of the 294,671 random Pubchem compounds as biological. It is important to point out that this 25% value for PubChem does not include compounds that were eliminated during the initial curation steps (mass range requirement, compounds with elements other than C, H, N, O, P, S, stereoisomers, salts and disconnected structures). Thus, starting with approximately 29,000,000 PubChem compounds with MIMW between 50-700 Da, it was estimated that approximately 3,680,000 (13%) of these would be an- notated as mammalian biochemical compounds using our curation steps and BioSM. Figure 4.4.4 shows the distribution of candidate scores from each dataset regardless of compound mass. Figure 4.4.5 illustrates the percentage of molecules predicted to be biological by BioSM using KHHscafs versus KEGGscafs in each of the prospective datasets. Al- though sensitivity, specificity, MCC, PPV and F score are all significantly higher when using KHHscafs, overall, the percentages predicted to be biological are very similar using the two sets of scaffolds. Thus, it is unlikely that the use of additional scaffolds will significantly improve our representation of biochemical structure space 57

Figure 4.4.4: Distribution of compounds based on their candidate scores using KHH- scafs. *LOOCV Results. as defined here, and that the model is reasonably robust. One could argue that the 2,362 added scaffolds may not have contributed appropriate biochemical structure diversity since they were predicted to be biological using KEGGscafs. However, this seems unlikely due to the large number of non-redundant structures added, and the fact that all CV model parameters were significantly improved compared to KEG- Gscafs. Further slight improvements may still be possible by iteratively expanding the scaffold list; notably, out of the 275 HMDB compounds classified as non-biological using KEGGscafs, 91 of these were classified as biological using KHHscafs. It is difficult to measure the accuracy of BioSM based on the results displayed in Figure 4.4.5 as there is no definite answer as to whether or not each compound in these datasets is actually an endogenous mammalian biochemical. Yet it is still 58

Figure 4.4.5: Percentage of biological predictions in each data set using KEGGscafs versus using KHHscafs. *Refer to LOOCV results when using the KEGGscafs dataset (turquoise bar) and the KHHscafs (purple bar) as defined in the methods section above. interesting to see how BioSM classifies compounds from each dataset.

4.5 Conclusions

In this chapter, I described the development and validation of BioSM, a novel su- pervised classifier that uses endogenous mammalian biochemical scaffolds to predict whether a candidate chemical structure is biochemical or synthetic. BioSM was able to correctly classify 96% of 3,750 biochemical compounds in a leave-one-out cross validation experiment. In addition, the results suggest that approximately 13% of PubChem compounds are mammalian biochemicals. Thus, BioSM may be useful for 59 searching large chemical databases in metabolomics applications where the number of potential false positives is very large. Additionally, BioSM can place molecules in the context of metabolic pathways since it can link potentially unknown biochemicals to matched substructure and superstructure scaffolds for which metabolic pathways are known. Chapter 5

Efficient identification of endogenous mammalian biochemical structures

5.1 Introduction

BioSM’s limitations, granting its encouraging results, is its need to exhaustively search all known biochemical structures to be able to make a decision about the candidate compound under investigation which resulted in an undesirably high run time. In this chapter, I introduce BioSMXpress. BioSMXpress was designed as an enhancement to BioSM with the aim of making the least possible number of structure comparisons to efficiently identify biochemical structures with the aid of a scaffolds list. BioSMXpress decides if a candidate structure is biochemical based upon how similar that structure is to any of the structures in the scaffolds list. BioSMXpress is a highly multi-threaded desktop application written in Java.

60 61

Similar to BioSM, two molecular structures are considered to be a ”match”, if the smaller structure is an exact substructure (atom and bond types) of the larger structure being compared. Also, the underlying cheminformatics functionality of BioSMXpress is based on SMSD [92]. In addition to SMSD, BioSMXpress uses Mar- vin, a chemical structure processing software, to generate both the canonical SMILES and atom counts from structure data files (.sdf) for all the compounds described in this work.

5.2 Computational Algorithm

Here, I will introduce a tool that can efficiently identify small endogenous mam- malian biochemical structures from the chemical structure space. First, I will start by defining some notations followed by a detailed explanation of the computational model behind BioSMXpress. Let cq be the molecular structure of a query compound,

S = s1s2 . . . sn be a set of n small biological compounds (scaffolds). Let sx ∼ cq indicate that scaffold sx is a substructure of candidate compound cq, and AC (sx)

represent the number of atoms in compound sx. Let minAC define the minimum

number of atoms required in a scaffold sy to identify cq as biological if cq ∼ sy. If two molecular structures r and q were found to be a match, a similarity score (as defined in equation 4.2.1) is calculated where r ∼ q. Based on a given substructure threshold (subThr), the minimum atom count is computed as

minAC = bAC (cq) ∗subT hrc . (5.2.1) 62

Similarly, based on a given superstructure threshold (superThr), BioSMXpress com- putes the maximum atom count,

 AC (c )  maxAC= q . (5.2.2) superT hr

¯  0 0 0 ¯ Finally, let S = s1, s2, . . . , sl | l ≤ n be the scaffold list assigned to cq where S ⊆ 0 ¯ 0  S. A scaffold sx ∈ S is assigned to S if and only if minAC ≤ AC sx ≤ maxAC. Please note that each candidate structure with a different atom count is provided with a different set of scaffolds in S¯. Once S¯ is populated with the appropriate scaffolds for cq, BioSMXpress examines cq against each of those scaffolds. As soon as a match

(substructure or superstructure) is found, BioSMXpress predicts that cq is biologi- cal and terminates. Otherwise, it’s predicted to be non-biological. Values for subThr and superThr are determined by cross validation as described in the following section. Figures 5.2.1 illustrates a visual example of the BioSMXpress scaffolds selection and ordering process. In addition to that, BioSMXpress orders the potential scaffolds in S¯ such that scaffolds with atom counts closer to the candidate atom count are examined first followed by those with a larger atom count difference. In this case, once a match is found the search terminates and it is guaranteed that this is the best possible match (as a substructure or superstructure). Figure 5.2.2 shows an overview of the general flow of BioSMXpress and an illustrative example. 63

(a)

(b)

Figure 5.2.1: Scaffold selection and sorting Process. In this example, it is assumed that the candidate compound (cq) consists of 9 atoms and that subThr = 0.5 and  9  superThr = 0.51. Therefore, minAC = b9∗0.5c = 4 and maxAC= 0.51 = 18. (a) The hashed scaffolds list with minAC and maxAC identified. (b) The sorted scaffolds list consists of all the scaffolds with 9 atoms followed by those with 10 atoms followed by those with 8 atoms and so on. 64

Figure 5.2.2: (A) General flow of BioSMXpress and (B) an example showing how the appropriate scaffolds list is populated. 65

5.3 Datasets

I will briefly describe the source and nature of the datasets selected to train and vali- date BioSMXpress. Since these datasets will be used to compare between BioSMXpress and BioSM in terms of prediction accuracy, I utilized the same datasets and followed the same curation steps in 4.3.4. In each dataset, compounds with any of the following characteristics were elimi- nated: (1) compounds with elements other than C, H, N, O, P and S; (2) compounds with less than 4 atoms and more than 53 atoms (explained below); (3) compounds that were polymers; (4) charged structures except those in which the charge was due to quaternary amines or sulfonium ions; (5) compounds with duplicate structures; and (6) compounds with disjoint structures.

5.3.1 Biological Dataset (Scaffolds list):

The KEGG database was chosen as the source of endogenous mammalian compounds. The list of 1,564 mammalian scaffolds (KEGGscafs) defined in [94] were used to represent the biochemical structure space in BioSMXpress. Each compound in the scaffolds list comprises of a number of atoms from 4 to 80 atoms per compound.

5.3.2 Non-Biological Dataset (Synthetic compounds list):

The Chembridge (www.chembridge.com) and Chemsynthesis (www.chemsynthesis.com) databases served as the sources of compounds representing the non-biological chemi- cal space. These databases were selected because they comprise synthetic compounds for chemical synthesis and drug screening and design. After curation, a set of 375,930 66 structures represented the synthetic compounds list. Chemsynthesis and Chembridge databases mainly contain compounds with low molecular weights (a maximum atom count of 53 atoms per compound). Accordingly, 143 of the 1,564 KEGGscafs (with atom count between 54 and 80) were eliminated from any testing set throughout this study and were only used for superstructure scaffold matching. This restriction was enforced to ensure that the sole discrimination between a compound being biological or non-biological is based on the structure of a compound and not on the number of atoms in that compound.

5.3.3 Training Dataset

A total of 2,842 compounds, with at least 4 atoms and at most 53, were used to train and test our predictive model. Half of those compounds were obtained from the scaffolds list (representing the endogenous mammalian chemical space) and the other half from the synthetic compounds list (representing the non-biological chemical space). The later molecules were randomly selected from the synthetic dataset to match the atom count distribution of the 1,421 biological set.

5.3.4 Independent Datasets

To estimate the performance of our predictive model and compare it with that of BioSM, four external validation sets were used: one set of putative human metabolites, one set of plant secondary metabolites, one set of drugs, and one set of synthetic compounds. For each dataset, any compound with a structure identical to any of those in the scaffolds list was removed. Also, structures found in more than one dataset were removed from all datasets except one. Molecules in each dataset had to 67 satisfy both mass (50 – 700 Da) and atom count (4 – 53 atoms) constraints to allow for a fair comparison between BioSMXpress and BioSM. Additionally, compounds with at least one NBS (see 4.3.2) were eliminated. This decision was based on our interest in comparing the core predictive models of BioSM and BioSMXpress since in reality, NBS filters will be applied to both models before any scaffold comparisons are involved. The following is a brief description of the five datasets. Please note that the numbers of compounds reported below refer to the datasets after curation. The first dataset consisted of 2,329 compounds and was obtained from HMDB version 2.5 representing putative human metabolites. The second dataset consists of 2,416 sec- ondary plant metabolites, as specified by KEGG, representing plant structures. The drug dataset was represented by 3,282 compounds and was obtained from DrugBank [71] version 3.0 and from the 1989 USAN and the USP Dictionary of Drug Names. A randomly chosen set of approximately 46,203 molecules was from the National Cen- ter for Biotechnology Information’s (NCBI) PubChem database [67]. PubChem is the largest freely accessible compound database currently available. Finally, a set of 374,509 compounds from the Chembridge and Chemsynthesis databases, not used in training the model, were used as a synthetic compound test set.

5.4 Results and Discussion

5.4.1 Classification Methods Selection

Four classification methods were proposed for BioSMXpress specifically, the SSF method which refers to finding a substructure scaffold match or a superstructure 68 scaffold match in the sorted scaffolds list was utilized. SSSF refers to searching for the best substructure scaffold similarity score (Scsub), if existent, and the best super- structure scaffold similarity score (Scsuper), if existent. It declares the candidate as biological if Scsub + Scsuper ≥ sumT hr. From my experience with BioSM, I found that distributing candidates into mass bins, with each bin having its own threshold values, showed an improvement in the prediction quality. Thus, we decided to test if the same concept applies here. I split the set of test compounds into five bins based on the number of atoms per compound and used CV to evaluate the model. This introduced two more methods, SSB and SSSB, similar to SSF and SSSF respectively, except that there are independent thresholds assigned to each bin.

So, the CV training phase was used to record Scsub, Scsuper and the sum of scores

(Scsub + Scsuper) of the training data and determine the cutoff thresholds, subThr, superThr and sumThr where SENS = SPEC. Then the average thresholds of all five training sets were used as the cutoff values when evaluating the CV testing data as explained in 3.1.2. I ran 15 CV experiments to evaluate the performance of each method. Some accuracy measures defined in 3.3 were applied to each of the results of the 15 CV experiments. The mean and standard deviation of the 15 experiments are displayed in Table 5.4.1. The highest sensitivity of 90% was obtained by the SSF classifier. At the same time, SSF suffered from the lowest specificity of 55% only. Another observation is that the application of sumThr improved the specificity significantly, 82% (SSSB) and 71% (SSSF) versus 62% (SSB) and 55% (SSF) but affected the sensitivity negatively (53% and 73%, respectively). SSB had the best MCC of 51% and hence was selected as the method of choice for BioSMXpress with a sensitivity of 86% and a specificity of 62%. 69

5.4.2 Leave-One-Out Cross Validation Analysis

As an additional method to evaluate how well BioSMXpress can identify endogenous mammalian biochemical structures using the SSB classifier, we carried out a set of LOOCV experiments (defined in 3.1.3) using the SSB method with the averaged subThr, superThr and bin boundaries determined by CV as explained in the Methods section. Here, 1,421 experiments representing KEGG structures (with atom count between 4 and 53 atoms/compound) were carried out. Please note that in every experiment, the scaffolds list was composed of 1,420 compounds plus 143 compounds (those with atom count between 54 and 80 atoms/compound) as the scaffolds list. As a result, BioSMXpress annotated 94% of the 1,421 compounds as being biological structures. Using 1,387 scaffolds in a set of LOOCV experiments implemented by BioSMXpress and BioSM independently were implemented and compared. These 1,387 compounds were the scaffolds that satisfied the constraints of both BioSM and BioSMXpress (mass in the range of 50 – 700 Da and number of atoms in the range of 4 – 53). BioSM was capable of identifying 94.5% of the 1,387 scaffolds as biochemical struc- tures while BioSMXpress identified 94.2%. Figure 5.4.1 shows the breakdown of the

SSF SSSF SSB SSSB Mean 0.90 0.73 0.86 0.58 SENS StdDev 0.02 0.04 0.02 0.03 Mean 0.55 0.71 0.62 0.82 SPEC StdDev 0.04 0.05 0.04 0.03 Mean 0.41 0.45 0.51 0.48 MCC StdDev 0.03 0.02 0.05 0.05

Table 5.4.1: Mean and standard deviation of accuracy measures obtained for 15 cross validation experiments using 4 different scoring. 70

results of this comparison with compounds binned by atom count. BioSMXpress performed best when identifying compounds in the first bin (99% positive identification of biochemical compounds) while BioSM was able to predict only 92% of those compounds. On the other hand, BioSM’s highest performance was achieved by predicting 96% of the compounds in the fourth bin while BioSMXpress identified 88% of them. In general, Figure 5.4.1 indicates that BioSMXpress is better at identifying biochemical compounds in the first and second bins (99% and 97%, respectively), while BioSM is better at identifying biochemical compounds in the fourth and fifth bins (96% and 94%, respectively). They both identify compounds in the third bin with the same accuracy (95%). A note worth mentioning is that in addition to the 1,387 scaffolds annotated, BioSMXpress was able to positively identify 34 compounds in the scaffolds list that were rejected by BioSM without classification due to mass restrictions (masses were greater than 700 Daltons). This indicates that BioSMXpress has broadened the range of compounds examined just by restricting the number of atoms in a candidate compound versus its molecular mass.

5.4.3 Prospective Validation

Independent datasets containing plant secondary metabolites, drugs, independent hu- man metabolites, synthetic molecules, and PubChem compounds were classified by BioSMXpress. The compounds in each dataset were split into 5 bins (atom count bins determined as described in the CV experiments) and the percentage of biochem- ical predictions per bin was computed (Figure 5.4.2). For the sake of comparison, the results from the LOOCV experiments with 1,387 KEGG endogenous metabolites 71

Figure 5.4.1: Biological predictions resulting from a set of LOOCV experiments by BioSMXpress and BioSM with 1,387 KEGG compounds. Compounds were binned by atom count.

(described above) are also included in the figure. It is observed that BioSMXpress is capable of efficiently discriminating between mammalian structures in datasets such as KEGG and HMDB compounds and non-mammalian ones such as those in the Plant, Drug, Pubchem, and Synthetic compounds specifically for compounds with atom count between 4 and 19 atoms per compound (Bins 1, 2, and 3). The aver- aged biological prediction of the first 3 bins was very high for the KEGG and HMDB datasets (97% and 86%, respectively) and significantly low for the Plant, Drug, Pub- chem, and Synthetic compounds datasets (35%, 39%, 26%, and 24%, respectively). The results also show that compounds with 20 – 53 atoms from both the plant and drug datasets tend to look more biological (84% and 69%, respectively). This might seem initially surprising, but the plant dataset results are consistent with current biochemical and evolutionary data suggesting that plant secondary metabolites and mammalian biochemical structures share multiple conserved biochemical pathways 72 and thus have an overlapping biochemical phylogeny. Similarly, since many drugs are derived from natural products, it is not unexpected to find that 69% of the larger compounds were predicted to be endogenous biochemical structures.

Figure 5.4.2: Biological predictions within each atom count bin for each dataset using BioSMXpress. SSB bin threshold values (subThr and superThr) are also dis- played.*Representing LOOCV results.

Subsequently, the same datasets were also annotated by BioSM. Table 5.4.2 presents a comparison of BioSM’s predictions versus those of BioSMXpress. The results indi- cate that 91% of the 2,329 HMDB molecules were correctly classified as endogenous mammalian metabolites while 88% of them were identified by BioSM. Predictions for the 2,416 plant metabolites by BioSMXpress and BioSM were comparable with 72% and 73%, respectively. As for the 3,282 drug compounds, 58% were predicted to be biological by BioSMXpress versus 62% by BioSM. In contrast, only 25% of the randomly selected 46,203 PubChem compounds were predicted as biological by 73

BioSMXpress as opposed to 35% by BioSM. In addition to these four prospective datasets, a set of 374,509 synthetic com- pounds (represented by Chembridge and Chemsynthesis compounds) were evaluated by BioSMXpress and BioSM with 36% and 33% of these being predicted to be bio- chemical, respectively. Overall, the comparison in Table 2 shows that the biochemical prediction percentages made by BioSM and BioSMXpress are practically comparable except that BioSMXpress predicted 10% lesser compounds of the PubChem com- pounds as biological.

Dataset Number of Compounds BioSM BioSMXpress HMDB Compounds 2,329 88% 91% Plant Metabolites 2,416 73% 72% Drug Compounds 3,282 62% 58% PubChem Compounds 46,203 35% 25% Synthetic Compounds 374,509 33% 36%

Table 5.4.2: Predictive results using the SSB classifier for 6 different datasets.

5.4.4 Execution and CPU Time Comparison

Now that I have shown that the predictive performance of BioSMXpress is analogous to BioSM, in this section we discuss their time performance. I used a high-end cluster (http://becat.uconn.edu/hpc/) hosted by the School of Engineering and the Taylor L. Booth Engineering Center for Advanced Technology (BECAT) at the University of Connecticut to run and compare the performance of both BioSM and BioSMXpress. I ran both classifiers with a set of randomly generated datasets as candidate datasets for prediction. Each dataset was evaluated by both predictive models under the same circumstances (same number of cluster nodes, threads, same scaffolds list, etc.) 74

and the time for each model was recorded. I was also interested in comparing the CPU time utilized by both BioSM and BioSMXpress. CPU time is the amount of time for which a central processing unit (CPU) was used for processing instructions of a computer program or operating system, as opposed to, for example, waiting for input/output operations. Figure 5.4.3b shows the average CPU time utilized by each of the classifiers when making predictions for each data set size (50 – 50,000 compounds). Similar to response time, BioSMXpress has outperformed BioSM by utilizing an average of 7 times less CPU time. I generated multiple candidate datasets with 50, 100, 500, 1,000, 5,000, 10,000, and 50,000 compounds. Each dataset was composed of randomly selected compounds from a pool of all the independent datasets used in this study as described in the Methods section. To ensure that the only factor I am measuring is the number of compounds in a set regardless of the nature of the compounds included, I generated multiple random datasets (specifically 3) with the same number of compounds for each size required. So, I ran 3 groups each containing 50 randomly selected compounds, 3 groups of 100 compounds and so on, and then reported the average response time of each group size. Figure 5.4.3a displays the average run time of BioSM versus that of BioSMXpress when annotating datasets of sizes 50 to 50,000 compounds as explained above. Obviously, BioSMXpress impressively outperformed BioSM across all datasets. BioSMXpress was 10 times faster than BioSM when analyzing 10,000 compounds. Across all datasets examined, BioSMXpress provided an 8 times average speed up over BioSM. Another interesting observation is that it takes BioSM an average of 6 minutes and 51 seconds to evaluate 1,000 compounds while it takes BioSMXpress an average of 5 minutes and 8 seconds to evaluate 10,000 compounds (10 times more compounds in less time). 75

(a)

(b)

Figure 5.4.3: (a) Average runtime (in hh:mm:ss) needed to make predictions using BioSM versus BioSMXpress. (b) Average CPU time (in seconds) for BioSM and BioSMXpress when annotating sets of compounds of different sizes. 76

This drastic difference in run time and CPU time can be explained by observ- ing the number of scaffold comparisons required by each predictive model to make a prediction about any given candidate compound. BioSM needs to compare the candidate structure with each and every structure in the scaffolds list accumulating scores and then finally comparing that score with a threshold to make a prediction. BioSMXpress intelligently selects and sorts the scaffolds that would produce the high- est match scores, based on the thresholds, if they were to match the candidate. Only a portion of the scaffolds are added to the list that is actually used by BioSMXpress as the scaffolds list and once a match is found the candidate is predicted to be biological with no other computational steps further needed.

5.5 Conclusions

In this chapter, I described the development and validation of BioSMXpress, an effi- cient supervised cheminformatics tool that uses endogenous mammalian biochemical scaffolds to predict whether a candidate chemical structure is biochemical or synthetic. BioSMXpress is at average 8 times faster than BioSM without compromising the ac- curacy of the predictions. BioSMXpress was able to correctly classify 94% of 1,421 biochemical compounds in a set of leave-one-out cross validation experiment. Thus BioSMXpress may be useful for searching large chemical databases in metabolomics applications where the number candidates is extremely large as well as the number of potential false positives in an efficient manner. Chapter 6

Classifying Small Molecules into Metabolic Pathways

6.1 Introduction

Metabolic pathways are characterized as a series of chemical reactions catalyzed by enzymes connected in a way such that the reactants of one reaction are the products of the previous one. Understanding these pathways is essential to understanding the machinery of life [75]. The reconstruction of the metabolic network of an organism based on its genome sequence is a key challenge in systems biology [97]. Predicting which metabolic pathways are present in the organism based on the annotated genome of the organism is a possible strategy to address this issue [76]. Such metabolic pathways are selected from a reference database of known pathways. Other strategies provide some data mining capabilities to correlate protein annotations to pathway templates so that organism-specific pathways can be derived. Some of the commonly

77 78 used tools include PathComp [98], Pathway Analyst [75], Rahnuma [99], Pathway Tools [100], UM-BBD Pathway Prediction System [101], and PathPred [102]. Pathway prediction can involve predicting pathways that were previously known in other organisms, or predicting novel pathways that have not been previously observed (pathway discovery) [76]. The work addressed in this chapter is focused on method- ologies that do the former, predicting pathways from a curated reference database. A number of databases collecting biological pathway information are available allowing broader exploration of metabolism. One of which is the Kyoto Encyclope- dia of Genes and (KEGG) database [73]. KEGG contains a collection of manually drawn pathway maps representing molecular interaction and reaction net- works. Eleven major metabolic pathway classes that are strongly associated with the biological functions of compounds are defined by KEGG as [103]: Carbohy- drate Metabolism, Energy Metabolism, Lipid Metabolism, Nucleotide Metabolism, Amino Acid Metabolism, Metabolism of Other Amino Acids, Glycan Biosynthe- sis and Metabolism, Metabolism of Cofactors and Vitamins, Metabolism of Ter- penoids and Polyketides, Biosynthesis of Other Secondary Metabolites, and Xeno- biotics Biodegradation and Metabolism, each of which contains several individual pathways. Some compounds serve as intermediates in multiple pathways and appear on multiple KEGG pathways. New metabolic experiments combined with computational methods are likely to reveal the structures of new metabolites that do not belong to any known metabolic pathways. Placing these molecules in the context of known metabolic pathways would aid in understanding their biological functions. It will shed the light at the presence of yet unidentified gene products that may be catalyzing relevant reactions [93]. Thus, the aim of this work is to develop and assess a model to predict the pathway classes 79 and individual pathways that a given query molecule would lie closest to. Primary attempts of correlating the similarities between metabolites with metabolic pathways have been performed by Nobeli et al. [104]. Further investigations per- formed by Cai et al. [105] utilized functional group composition of compounds to represent small molecules. They proposed a Nearest Neighbor Algorithm to map small chemical molecules to the metabolic pathway class that they likely belong to. After excluding all compounds that belonged to two or more metabolic pathway classes, a set of 2,764 compounds from 11 major classes of metabolic pathways, obtained from KEGG, were selected for the study. An overall prediction rate of 73.3% was observed. Since the authors were focused on addressing the single-label classification problem, their methods could not be used to deal with the ”multi-function” compounds, com- pounds that belong to more than one pathway class. Macchiarulo et al. [93] used 32 quantitative structure activity relationship descriptors to estimate the proximity of any small molecule to a given pathway class. When classifying 681 small molecules into 7 KEGG pathway classes using a random forest classifier [84], they reported an average Matthews correlation coefficient of 0.73. They expanded their investigation to predict individual pathways to which these small molecules would lie close to as well. When classifying those metabolites to 52 individual KEGG pathways, they were able to predict the correct pathways for only 31% of the molecules. A multi-target model for predicting which of the 11 KEGG metabolic pathway classes a query compound may be involved was proposed by Hu et al. [103]. The model was built upon information of chemical-chemical interactions retrieved from STITCH [106]; a database containing known and predicted interactions of chemi- cals and proteins derived from experiments, literature and other databases. In their model, an interaction unit consists of two chemicals and their interaction weight (con- 80

fidence score) representing the probability that the interaction occurs between the two chemicals concerned. It was observed that the overall success rate obtained by the method via the 5-fold cross-validation test on a benchmark dataset consisting of 3,137 compounds was 77.97%. Gao et al. [107] extended Hu et al.’s work by integrating interactions among chemicals and proteins. Their work included not only chemical-chemical interactions but also protein-protein interactions and chemical-protein interactions, to predict metabolic pathways in which small molecules and enzymes of yeast participate. The data concerning protein-protein interactions was retrieved from STRING [108]. They constructed a hybrid interaction network having small molecules and enzymes as its nodes, and edges between two nodes if and only if there is data showing that they can interact with one another. Results of the jackknife test, a leave-one-out cross validation method, show that the first order prediction accuracy for 3,348 small molecules was 77.12% and 92.05% for the 655 enzymes which does not reflect any improvement over Hu et al.s approach. One of the major limitations in the approaches proposed in [103, 107] is their dependency on interaction information. Hu et al. reported that they were unable to process 1,229 compounds due to the lack of interaction information with other compounds within the dataset they were using. In this chapter, I introduce TrackSM, a Bioinformatics tool designed to predict the metabolic pathway class as well as the individual pathways to which small molecules might be associated with, based only on their molecular structures. Small molecules within a typical pathway tend to look similar as they are related to each other through stepwise chemical transformations. TrackSM is guided by structural similarity infor- mation acquired from a set of compounds, hereafter referred to as scaffolds. In other 81

words, TrackSM represents pathways using the scaffolds they comprise.

6.2 Computational Algorithm

In TrackSM, a query compound is predicted to belong to a metabolic pathway based on how similar its chemical structure is to those structures associated with a given pathway. TrackSM relies on the SMSD Toolkit [92] to carry out molecular similarity searches. In this chapter, I define two ways for matching molecular structures: Match100 and Match90. In Match100, two molecular structures are considered a match if and only if the smaller structure is an exact substructure (atom and bond types) of the larger structure being compared. Match90 considers two molecular structures as similar if at least 90% of the smaller structure’s atoms match the larger structure being compared. Regardless of the matching method, the similarity score between two molecular structures is defined by Two molecular structures r and q are found to be a match if and only if the smaller structure, r, is an exact substructure (atom and bond types) of the larger structure, q. Match90 considers r and q similar, and r ∼ q (r is a substructure of q), a similarity score (as defined in equation 4.2.1) is calculated. Clearly, a candidate molecule may match more than one scaffold structure, some of which as substructures and others as superstructures, resulting in several similarity scores computed for each candidate compound. As previously mentioned, molecules within a typical pathway tend to have similar structures since they are related to each other through stepwise chemical transfor- 82

Figure 6.2.1: Schematic of TrackSM’s predictive process.

mations. Our hypothesis is that for a query compound cq, the larger the number of compounds that are structurally similar to cq within a given metabolic pathway the more likely that cq is to participate in that pathway. Also, if at least one of those scaf- folds matches cq as a substructure and another as a superstructure, then that might be more evidence of cq belonging to that pathway. TrackSM identifies the biological function of a molecular compound in two steps. It first predicts a metabolic class to which the molecule is likely to belong to, based on information from structurally 83

similar scaffolds. Then it uses the predicted metabolic class with scaffold similar- ity information to predict an individual pathway to which the molecule is likely to interact in. Figure 6.2.1 shows a general overview of the TrackSM prediction process. I propose an algorithm to predict the biological function of a small compound by associating it with a metabolic class and an individual metabolic pathway based on its molecular structure alone. First we will start by defining some notations followed by an explanation of the computational model behind TrackSM.

Let cq be the molecular structure of a query compound, S = {s1, s2, . . . , sn be a

set of n small compounds (scaffolds), M = {M1,M2,...,Ml} be a set of metabolic pathway classes, and P = {p1, p2, . . . , pm be the set of individual pathways. Let

sx → My indicate that the scaffold sx belongs to the metabolic pathway class My

and let sx → Pz indicate that sx belongs to the individual metabolic pathway Pz

Let CL (cq) represent the list of candidate pathway classes to which cq is predicted to

be associated with. Similarly, let PL (cq) represent the list of candidate individual

pathways to which cq is predicted to be associated with.

6.2.1 Pathway Classes Prediction Method

TrackSM predicts the biological function of query compound cq in two steps. First,

it predicts the metabolic pathway class that cq belongs to. This step is carried out in

the following manner: cq is matched against all the scaffolds in S. If a scaffold sx is

found to be a substructure of cq, then sxis added to set Sb. If a scaffold sy is found ¯ to be a superstructure of cq, then sy is added to set Sp. Hence, let S denote the set ¯ of scaffolds that structurally match cq such that S = Sb ∪ Sp. The list of candidate

metabolic classes CL (cq) is assembled such that it represents all the metabolic classes 84 associated with all the scaffolds inS¯ . For each candidate pathway class M 0, associated

¯ 0 with at least one compound in S, a vector V (cq,M ) = [Ss, Sc, Co] is populated.

0 0 Vector V (cq,M ) represents the confidence that class M is is the pathway class to which cq belongs to. Let Ss be a binary value representing the existence of at least one substructure compound and at least one superstructure compound that belong to M 0, i.e.

  1, if ∃s ∈ S & s ∈ S ; s → M 0 , s → M 0  x b y p x y  Ss =    0, otherwise

Let Sc represent the highest SimScore (defined by equation 1) found between cq and

¯ 0 all the matching compounds in S that belong to M’; Sc = maxsj S,sj →M SimScore(cq, sj). Finally, Co is defined as the number of scaffolds in S¯ that belong to pathway class

¯ 0 M’; Co = count(sj ∈ S, sjM ). Finally, all pathway classes in CL (cq) are ranked as discussed in Section 3 and the class with the highest scores is predicted to be PC, the class to which cq is associated.

6.2.2 Individual Pathways Prediction Method

In the second step, TrackSM predicts one or more individual pathways to which the query compound might belong to. List PL (cq) is populated and ranked via a method very similar to that used to populate CL (cq) with the exception of referencing individual pathways instead of pathway classes. Similarly, for each candidate indi- ¯ vidual pathway Pr , associated with at least one compound in S, a vector V (cq,Pr ) =

[Ss, Sc, Co] is populated. Vector V (cq,Pr ) represents the confidence that pathway 85

Pr is the predicted pathway to which cq belongs to. Ss is a binary value representing the existence of at least one substructure compound and at least one superstructure compound that belong to Pr , i.e.

  1, if ∃s ∈ S & s ∈ S ; s → P , s → P  x b y p x r y r  Ss =    0, otherwise

Let Sc represent the highest SimScore (defined by equation 1) found between cq and ¯ all the matching compounds in S that belong to Pr; Sc = maxsj S,sj →Pr SimScore(cq, sj). ¯ Finally, Co is defined as the number of scaffolds in Sthat belong to pathway class Pr ; ¯ Co = count(sj ∈ SsjPr ). Specific to predicting individual pathways, we have developed an additional method referred to as Match90ClassBased. In this method, TrackSM uses the predicted path- way class PC in the first step to further guide its search for individual pathway asso- ciations for cq. Hence, PL (cq) is only populated with individual pathways that have associations with scaffolds that structurally match cq and belong to the predicted pathway class PC. Hence, any scaffold in S¯ must belong to the predicted class PC. All the calculations following this step are similar to that of the previously explained method.

6.3 Dataset

Pathway information concerning 3,190 small molecules of the dataset used by Gao et al [107], as well as their molecular structures, were downloaded (January 2013) 86 from the KEGG database. The distribution of those compounds among the pathway classes and the number of individual pathways associated with each class are shown in Table 6.3.1. It was observed that some compounds are associated with more than one pathway class. Others are associated with more than one individual pathway within the same class. These observations are obvious in Table 6.3.1 since the total number of compounds belonging to all classes is 4,404, greater than the actual number of compounds (3,190). Figure 6.4.1a shows that 90% of the 3,190 scaffolds used in this study are associated with only one pathway class, while 7% are associated with two classes, and only 3% are associated with 3 or more pathway classes. Figure 6.4.1b demonstrates the distribution of scaffolds based on their association to individual pathways. Of the 3,190 scaffolds, 85% are associated with one individual pathway. This means that 5% of the compounds associated with one pathway class belong to more than one individual pathway within that given class. Nine percent are associated with 2 individual pathways and only 6% are associated with 3 or more individual pathways. The mass distribution of the molecules in the scaffolds list used in this work is displayed in Figure 6.4.1c. This figure shows that the majority of the molecules (76%) fall in the mass range 116 – 460 Daltons.

6.4 Results and Discussion

6.4.1 Ranking Method Formulation and Selection

I formalized six possible ways for ranking the candidate classes in CL (cq) and candi- date pathways in PL (cq) referred to as: SsScCo, ScSsCo, ScCoSs, CoScSs, CoSsSc, 87

Pathway Class Name Pathway Number Number Class of Indi- of Com- Code vidual pounds Pathways 1 Carbohydrate metabolism CM 15 575 2 Energy metabolism EM 7 193 3 Lipid metabolism LM 16 444 4 Nucleotide metabolism NM 2 137 5 Amino acid metabolism ACM 13 580 6 Metabolism of other amino acids MOAA 9 170 7 Glycan biosynthesis and metabolism GBM 5 48 8 Metabolism of cofactors and vitamins MCV 12 365 9 Metabolism of terpenoids and MTP 18 541 polyketides 10 Biosynthesis of other secondary BOSM 20 555 metabolites 11 Xenobiotics biodegradation and XBM 20 796 metabolism Total 137 4404

Table 6.3.1: Distribution of 3,190 KEGG compounds among the 11 KEGG metabolic pathway classes.

and SsCoSc. SsScCo indicates sorting the candidate pathway classes in CL (cq) by the Ss value (in descending order) then breaking ties with the pathway with the high- est Sc followed by the highestCo. Table 6.4.1 shows the sensitivity acquired when a set of LOOCV experiments predicting metabolic pathway classes were carried out using the dataset of 3,190 KEGG compounds. It is clear that the result from SsScCo, ScCoSs, and ScSsCo are comparable and are much better than those obtained by SsCoSc, CoSsSc, and CoScSs. I carried out an ANOVA to check for statistical signif- icance between the top 3 ranking methods. ANOVA results indicated no statistical significance (P >0.05). However, SsScCo accuracy was consistently higher than the other methods and thus was selected as the ranking method for TrackSM. 88

(a) (b)

(c)

Figure 6.4.1: Distribution of 3,190 scaffolds based on (a) the number of classes they belong to and (b) the number of individual pathways they belong to. Panel (c) shows the mass distribution of 3,190 scaffolds into 8 mass bins ranging from 0 922 Daltons. 89

Classes Ranking Method Predicted SsScCo ScCoSs ScSsCo SsCoSc CoSsS C CoScSs 1 84.92% 84.73% 83.76% 64.70% 50.47% 50.41% 2 92.82% 92.76% 92.23% 81.38% 73.17% 73.13% 3 95.39% 95.27% 95.14% 89.78% 86.36% 86.36%

Table 6.4.1: SENS of each ranking method when TrackSM predicts 1, 2 or 3 classes per candidate compound.

6.4.2 Performance of the Predictive Method for Metabolic Pathway Classes

Here, I evaluated the predictive method by a set of LOOCV experiments using a dataset of 3,190 KEGG compounds. The 1st and 2nd order of predictions made by Gao et al(Gao et al. 2012) as well as those of TrackSM using both Match100 and Match90 with the SsScCo ranking method are shown in Figure 6.4.2. TrackSM was able to predict at least one correct pathway class for 85% of the compounds using Match90 versus 79% when using Match100. Both methods reflect an improvement over the results reported by Gao et al (77% [107]). Actually, TrackSM using Match90 predicted only one class per query compound had a 4% improvement in SENS over Gao et al.’s method when they predicted 2 classes per compound. This also indicates that TrackSM has a better PPV than that of Gao et al. When TrackSM using Match90 made two class predictions per candidate compound, 93% of the 3,190 compounds had at least one correct class prediction. Additionally, I’ve distributed the 3,190 compounds into 8 bin masses. Results from both prediction method (Match100 and Match90 ) were compared in each bin. Figure 6.4.3a displays the 1st order of class predictions made by TrackSM using Match100 versus Match90. It indicates that Match90 outperforms Match100 across all the bins. 90

Figure 6.4.2: LOOCV prediction accuracy of the 1st and 2nd orders of predictions made by Gao et al’s method, TrackSM with Match100, and TrackSM with Match90 when predicting metabolic pathway classes.

Figure 6.4.3b plots the SENS in each bin when Match90 was applied. The plot shows that TrackSM is capable of predicting the metabolic class of a given compound in the mass range 231 – 460 Da with 93% accuracy. It also shows that bins 3 through 7 acquire an average SENS of 90%. While the average SENS at bin 1 and bin 8 is 70%. We think that predictions at both ends of the mass range are poorer because as the compounds get very large or very small, there is a higher chance for them to match with many scaffolds as substructures only or superstructures only, respectively. As that happens, many noise matches are introduced causing the decrease in sensitivity. Figure 6.4.4 shows the distribution of the 1st order of Match90 class predictions based on a compound’s molecular mass and the number of pathway classes it is associated with. The first bar in each subsection shows the overall predictions per number of class associations. Match90 can predict at least one class to which a 91

(a)

(b)

Figure 6.4.3: Breakdown of the 1st order of Class predictions made by Match100 versus those made by Match90 for 3,190 compounds based on molecular mass from a set of LOOCV experiments. 92 compound might belong to with a SENS of 85%, 80%, 83%, and 85% for compounds associated with one, two, three, and four or more classes, respectively. This hints that the number of class associations that a compound has does not really affect the prediction quality of TrackSM. It is clear from figure 5 that the predictions for compounds that belong to more than one metabolic class do not follow the same distribution (based on mass) as that of those associated to only on class. There is a good chance that this is the case because only 10% of the compounds used in this analysis are associated with more than one class. So there does not exist enough representation of those compounds to make such a claim.

Figure 6.4.4: Accuracy of mass bins per number of class associations for TrackSM when using Match90 to predict metabolic pathway classes.

To further analyze our results, we assembled the results to explore the prediction distribution based on class associations. Figure 6.4.5 shows the distribution of com- pounds among the 11 KEGG metabolic classes based on the 1st order of prediction produced by Match100 versus Match90. In this analysis, only 2,874 of the com- 93 pounds were included as they are associated with only one class. Match100 had a 4% improvement over Match90 when associating compounds to class EM. Both meth- ods performed equivalently when associating compounds to classes CM, MOAA, and XBM. Match90 outperformed Match100 in the other 7 classes with a highest im- provement of 18% in class MTP. Also, Match90 is capable of correctly associating 90% of the compounds belonging to six metabolic classes specifically BOSM, CM, LM, MTP, NM, and XBM. It was also noted that Match90 could only correctly associate 44% of the compounds belonging to class EM. Likely, this is due to the small size of class EM, with only 45 scaffold associations.

Figure 6.4.5: Distribution of class predictions when using Match100 versus Match90 based on the query compound’s class association. 94

6.4.3 Performance of the predictive method for individual Metabolic Pathways

In this section I investigate using TrackSM to predict individual pathways to which a candidate compound might belong to. I show results from applying Match100, Match90 as well as a method exclusive to pathway predictions referred to as Match90ClassBased. Figure 6.4.6 shows the SENS of the 1st, 2nd, and 3rd orders of prediction when using Match100, Match90, and Match90ClassBased. Match90ClassBased outperformed the other two methods. Specifically, with the 1st order of individual pathway prediction, Match90ClassBased had 80% accuracy, while Match100 had only 66% and Match90 had a 69%. When making 2 predictions per query compound, Match90ClassBased was able to predict at least one individual pathway for 88% of the compounds.

Figure 6.4.6: Prediction accuracy of the 1st, 2nd and 3rd orders of predictions made by TrackSM with Match100, Match90 and Match90ClassBased when predicting in- dividual metabolic pathways.

Finally, we’ve distributed the 3,190 compounds into 8 bin masses. Results from 95 each prediction method (Match100 and Match90ClassBased) were compared in each bin. Figure 6.4.7a displays the 1st order of individual pathway predictions made by TrackSM using Match100 versus Match90ClassBased. It is obvious that Match90ClassBased outperforms Match100 across all the bins except bin 1 (0 – 115 Da). Figure 6.4.7b plots the SENS in each bin when Match90ClassBased was applied to predict one in- dividual pathway per query compound. The plot shows that TrackSM is capable of predicting the metabolic class of a given compound in the mass range 346 – 460 Da and 691 – 805 Da with 94% accuracy. It also elaborates that bins 4 through 7 acquire an average SENS of 92%. Similar to pathway class predictions, individual pathway predictions for compounds on both ends of the mass spectrum are of noticeably lower sensitivity than the rest of the bins.

6.5 Conclusions

In this chapter, I presented TrackSM; a bioinformatics tool designed to predict the metabolic pathway classes as well as the individual pathways to which small molecules might be associated with, based only on their molecular structures. TrackSM can place molecules in the context of metabolic pathways since it can link potentially un- known biochemicals to matched substructure and superstructure scaffolds for which metabolic pathways are known. Validation experiments show that TrackSM is capa- ble of associating 93% of the structures to their correct pathway classes as defined by KEGG and 88% of them to the correct individual KEGG pathway. These impres- sive results suggest that TrackSM may be a valuable tool to aid in recognizing the biochemical functions of small molecules. 96

(a)

(b)

Figure 6.4.7: Percentage of compounds with at least one correct individual pathway prediction when compounds are distributed by mass amongst 8 mass bins. Chapter 7

Conclusions and Recommendations

7.1 Conclusions

Although there has been a long-standing interest in metabolic profiling, only recently have technologies emerged that enable the global analysis of metabolites at a systems level, comparable to its omic predecessors. Unlike genomics, transcriptomics and proteomics, however, metabolomics provides a tool for measuring biochemical activ- ity directly by monitoring the substrates and products transformed during cellular metabolism. Untargeted profiling of these chemical transformations at a global level serves as a phenotypic readout that can be used effectively in diagnosing patholo- gies, identifying therapeutic targets of disease and investigating the mechanisms of fundamental biological processes. In this thesis, I described the development and validation of BioSM, a novel su- pervised classifier that uses endogenous mammalian biochemical scaffolds to predict whether a candidate chemical structure is biochemical or synthetic. BioSM was able

97 98 to correctly classify 96% of 3,750 biochemical compounds in a leave-one-out cross val- idation experiment. In addition, our results suggest that approximately 13% of Pub- Chem compounds are mammalian biochemicals. Next, I introduced BioSMXpress, a tool designed to enhance the performance of BioSM. I showed that BioSMXpress is, on an average, 8 times faster than BioSM without compromising the quality of the predictions made. BioSMXpress will be an extremely useful tool in the timely iden- tification of unknown biochemical structures in metabolomics. BioSMXpress may be specifically useful for searching large chemical databases in metabolomics applications where the number of potential false positives is very large. BioSMXpress can be easily tailored to specific application domains. For example, if one is interested in identi- fying unknown chemical structures in plant samples, the current scaffolds list can be supplemented with known plant biochemical structures and the NBS list could be appropriately modified Finally, I presented TrackSM; a bioinformatics tool designed to predict the metabolic pathway classes as well as the individual pathways to which small molecules might be associated with, based only on their molecular structures. TrackSM can place molecules in the context of metabolic pathways since it can link potentially unknown biochemicals to matched substructure and superstructure scaf- folds for which metabolic pathways are known. Validation experiments show that TrackSM is capable of associating 93% of the structures to their correct pathway classes as defined by KEGG and 88% of them to the correct individual KEGG path- way. These impressive results suggest that TrackSM may be a valuable tool to aid in recognizing the biochemical functions of small molecules. 99

7.2 Recommendations for future work

Based on the work presented in this thesis, I presnt the following recommendations for future work:

∗ Include compounds with halogens. Currently, BioSMXpress does not allow annotation of candidate compounds with halogens (ie, F, Cl, Br) since the current scaffolds list is based upon endogenous human biochemical compounds.

∗ Metabolomics pipeline. Currently BioSMXpress and TrackSM are two indepen- dent programs. I would like to create a pipleline such that the inputs/outputs of BioSMXpress would be automatically handed to TrackSM. Results from both programs would be presented together.

∗ User-friendly website. I would like to develop a user-friendly web-based appli- cation where users can input one or more molecular structure(s) and get a prediction whether the structure(s) looks biochemical or not in addition to the pathway class and individual pathway it might interact with. Bibliography

[1] R. Q. Wu, X. F. Zhao, Z. Y. Wang, M. Zhou, and Q. M. Chen, “Novel molecular events in oral carcinogenesis via integrative approaches.,” Journal of dental research, vol. 90, pp. 561–72, May 2011.

[2] J. K. Nicholson and J. C. Lindon, “Metabonomics,” Nature, vol. 455, no. Oc- tober, pp. 1054–1056, 2008.

[3] E. C. Laiakis, R. Bogumil, C. Roehring, M. Daxboeck, S. Lai, M. Breit, J. Shockcor, S. Cohen, J. Langridge, A. J. F. Jr, and G. Astarita, “Targeted Metabolomics Using the UPLC / MS-based AbsoluteIDQ p180 Kit,” tech. rep., Waters, 2012.

[4] G. J. Patti, O. Yanes, and G. Siuzdak, “Metabolomics: The Apogee of the Omic Trilogy,” Nature Reviews, vol. 13, pp. 263–269, 2012.

[5] W. B. Dunn and D. I. Ellis, “Metabolomics: Current analytical platforms and methodologies,” TrAC Trends in Analytical Chemistry, vol. 24, pp. 285–294, Apr. 2005.

[6] M. G. Barderas, C. M. Laborde, M. Posada, F. de la Cuesta, I. Zubiri, F. Vi- vanco, and G. Alvarez-Llamas, “Metabolomic profiling for identification of novel

100 101

potential biomarkers in cardiovascular diseases.,” Journal of biomedicine & biotechnology, vol. 2011, p. 790132, Jan. 2011.

[7] I. Barba, D. Garcia-dorado, I. D. Recerca, A. Cor, and H. Universitari, “Metabolomics in Cardiovascular Disease : Towards Clinical Application,” in Coronary Artery Disease New Insights and Novel Approaches, pp. 207–224, InTech, 2012.

[8] O. Fiehn, “Metabolomics the link between genotypes and ,” Plant Molecular Biology, vol. 48, pp. 155–171, 2002.

[9] R. Goodacre, S. Vaidyanathan, W. B. Dunn, G. G. Harrigan, and D. B. Kell, “Metabolomics by numbers: acquiring and understanding global metabolite data.,” Trends in biotechnology, vol. 22, pp. 245–52, May 2004.

[10] L. C. Menikarachchi, M. A. Hamdalla, D. W. Hill, and D. F. Grant, “Chemical Structure Identification in Metabolomics : Computational Modeling of Experi- mental Features,” Computational and Structural Biotechnology Journal, vol. 5, no. 6, 2013.

[11] H. Kitano, “Systems biology: a brief overview.,” Science (New York, N.Y.), vol. 295, pp. 1662–4, Mar. 2002.

[12] J. van der Greef, P. Stroobant, and R. van der Heijden, “The role of analyti- cal sciences in medical systems biology.,” Current opinion in , vol. 8, pp. 559–65, Oct. 2004. 102

[13] S. Rochfort, “Metabolomics Reviewed: A New Omics Platform Technology for Systems Biology and Implications for Natural Products Research.,” Journal of Natural Products, vol. 68, pp. 1813–1820, 2005.

[14] N. S. Nagaraj, “Evolving ’omics’ technologies for diagnostics of head and neck cancer.,” Briefings in functional genomics & proteomics, vol. 8, pp. 49–59, Jan. 2009.

[15] S. Bhattacharya and T. J. Mariani, “Systems biology approaches to identify developmental bases for lung diseases.,” Pediatric research, vol. 73, pp. 514–22, Apr. 2013.

[16] E. L. Schymanski, M. Meringer, and W. Brack, “Automated Strategies To Iden- tify Compounds on the Basis of GC/EI-MS and Calculated Properties.,” Ana- lytical Chemistry, vol. 83, pp. 903–912, 2011.

[17] S. G. Villas-Bˆoas,S. Mas, M. Akesson, J. r. Smedsgaard, and J. Nielsen, “Mass spectrometry in metabolome analysis.,” Mass spectrometry reviews, vol. 24, no. 5, pp. 613–46, 2005.

[18] B. P. Lankadurai, H NMR-based metabolomics for elucidating the of contaminants in the earthworm Eisenia fetida after sub-lethal exposure by. PhD thesis, University of Toronto, 2013.

[19] L. V. Tong, Development and application of mass spectrometry-based metabolomics methods by. PhD thesis, Massachusetts Institute of Technology, 2008. 103

[20] J. Tang, “Microbial metabolomics.,” Current genomics, vol. 12, pp. 391–403, Sept. 2011.

[21] J. W. Allwood, D. I. Ellis, and R. Goodacre, “Metabolomic technologies and their application to the study of plants and plant-host interactions.,” Physiolo- gia plantarum, vol. 132, pp. 117–35, Feb. 2008.

[22] J. K. Nicholson, J. C. Lindon, and E. Holmes, “’Metabonomics’: understand- ing the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data.,” Xeno- biotica; the fate of foreign compounds in biological systems, vol. 29, pp. 1181–9, Nov. 1999.

[23] J. van der Greef and A. K. Smilde, “Symbiosis of chemometrics and metabolomics: past, present, and future,” Journal of Chemometrics, vol. 19, pp. 376–386, May 2005.

[24] M. M. Koek, Gas chromatography mass spectrometry: key technology in metabolomics. PhD thesis, Leiden University, 2009.

[25] D. B. Kell, “Metabolomics and systems biology: making sense of the soup.,” Current opinion in microbiology, vol. 7, pp. 296–307, June 2004.

[26] R. Goodacre, “Metabolomics of a superorganism.,” The Journal of Nutrition, vol. 137, pp. 259S–266S, Jan. 2007.

[27] M. Li, B. Wang, M. Zhang, M. Rantalainen, S. Wang, H. Zhou, Y. Zhang, J. Shen, X. Pang, M. Zhang, H. Wei, Y. Chen, H. Lu, J. Zuo, M. Su, Y. Qiu, W. Jia, C. Xiao, L. M. Smith, S. Yang, E. Holmes, H. Tang, G. Zhao, J. K. 104

Nicholson, L. Li, and L. Zhao, “Symbiotic gut microbes modulate human metabolic phenotypes.,” Proceedings of the National Academy of Sciences of the United States of America, vol. 105, pp. 2117–22, Feb. 2008.

[28] H. E. Johnson, D. Broadhurst, D. B. Kell, M. K. Theodorou, R. J. Merry, W. Gareth, and G. W. Griffith, “High-Throughput Metabolic Fingerprinting of Legume Silage Fermentations via Fourier Transform Infrared Spectroscopy and Chemometrics High-Throughput Metabolic Fingerprinting of Legume Silage Fermentations via Fourier Transform Infrared Spectroscopy and Ch,” Applied and Environmental Microbiology, vol. 70, no. 3, pp. 1583–1592, 2004.

[29] K. D. Nadella, S. S. Marla, and P. A. Kumar, “Metabolomics in agriculture.,” OMICS: A Journal of Integrative Biology, vol. 16, pp. 149–159, Apr. 2012.

[30] L. W. Sumner, P. Mendes, and R. a. Dixon, “Plant metabolomics: large- scale phytochemistry in the functional genomics era,” Phytochemistry, vol. 62, pp. 817–836, Mar. 2003.

[31] B. Teusink, F. H. J. V. Enckevort, C. Francke, A. Wiersma, A. Wegkamp, E. J. Smid, J. Roland, and R. J. Siezen, “In Silico Reconstruction of the Metabolic Pathways of Lactobacillus plantarum : Comparing Predictions of Nutrient Re- quirements with Those from Growth Experiments In Silico Reconstruction of the Metabolic Pathways of Lactobacillus plantarum : Comparing Pre,” Applied and Environmental Microbiology, vol. 71, no. 11, pp. 7253 – 7262, 2005.

[32] T. Ogura and Y. Sakamoto, “Application of Metabolomics Techniques using LC / MS and GC / MS Profi ling Analysis of Green Tea Leaves,” Tech. Rep. 10, Lifescience, Tokyo, 2007. 105

[33] N. Hall, RobertD. and Hardy, “Practical Applications of Metabolomics in Plant Biology,” in Plant Metabolomics (N. W. Hardy and R. D. Hall, eds.), Humana Press, 2012.

[34] J. van der Greef, T. Hankemeier, and R. N. McBurney, “Metabolomics-based systems biology and : moving towards n = 1 clinical tri- als?,” , vol. 7, pp. 1087–94, Oct. 2006.

[35] E. Baraldi, S. Carraro, G. Giordano, F. Reniero, G. Perilongo, and F. Zacchello, “Metabolomics : moving towards personalized medicine,” Italian Journal of Pediatrics BioMed Central Commentary, vol. 4, no. 35, pp. 2–5, 2009.

[36] A. D. Eckhart, K. Beebe, and M. Milburn, “Metabolomics as a key integrator for ”omic” advancement of personalized medicine and future therapies.,” Clinical and translational science, vol. 5, pp. 285–8, June 2012.

[37] M. G¨uzey, “Personalized Medicine: New Perspectives in Cancer Treatments,” Journal of Postgenomics Drug & Biomarker Development, vol. 03, no. 02, pp. 2– 3, 2013.

[38] X. Li, C. Li, D. Shang, J. Li, J. Han, Y. Miao, Y. Wang, Q. Wang, W. Li, C. Wu, Y. Zhang, X. Li, and Q. Yao, “The implications of relationships between human diseases and metabolic subpathways.,” PloS one, vol. 6, p. e21131, Jan. 2011.

[39] J. T. Brindle, H. Antti, E. Holmes, G. Tranter, J. K. Nicholson, H. W. L. Bethell, S. Clarke, P. M. Schofield, E. McKilligin, D. E. Mosedale, and D. J. Grainger, “Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using 1H-NMR-based metabonomics.,” Nature medicine, vol. 8, pp. 1439–44, Dec. 2002. 106

[40] C. Ohdoi, W. L. Nyhan, and T. Kuhara, “Chemical diagnosis of Lesch-Nyhan syndrome using gas chromatography-mass spectrometry detection.,” Journal of chromatography. B, Analytical technologies in the biomedical and life sciences, vol. 792, pp. 123–30, July 2003.

[41] J. L. Spratlin, N. J. Serkova, and S. G. Eckhardt, “Clinical Applications of Metabolomics in Oncology: A Review,” Clinical Cancer Research, vol. 15, no. 2, pp. 431–440, 2009.

[42] A. Sreekumar, L. M. Poisson, T. M. Rajendiran, A. P. Khan, Q. Cao, J. Yu, B. Laxman, R. Mehra, R. J. Lonigro, Y. Li, M. K. Nyati, A. Ahsan, S. Kalyana- Sundaram, B. Han, X. Cao, J. Byun, G. S. Omenn, D. Ghosh, S. Pennathur, D. C. Alexander, A. Berger, J. R. Shuster, J. T. Wei, S. Varambally, C. Beecher, and A. M. Chinnaiyan, “Metabolomic profiles delineate potential role for sar- cosine in prostate cancer progression.,” Nature, vol. 457, pp. 910–4, Feb. 2009.

[43] C. Li, D. Shang, Y. Wang, J. Li, J. Han, S. Wang, Q. Yao, Y. Wang, Y. Zhang, C. Zhang, Y. Xu, W. Jiang, and X. Li, “Characterizing the network of drugs and their affected metabolic subpathways.,” PloS one, vol. 7, p. e47326, Jan. 2012.

[44] a. S. Reddy and S. Zhang, “Polypharmacology: drug discovery for the future.,” Expert review of clinical , vol. 6, pp. 41–47, Jan. 2013.

[45] J. K. Nicholson, J. Connelly, J. C. Lindon, and E. Holmes, “Metabonomics: a platform for studying drug toxicity and gene function.,” Nature reviews. Drug discovery, vol. 1, pp. 153–61, Feb. 2002. 107

[46] A. L. Harvey, “Natural products in drug discovery.,” Drug Discovery Today, vol. 13, pp. 894–901, Oct. 2008.

[47] T. Nanda, M. Das, K. Tripathy, and R. Teja Y, “Metabolomics: The Future of Systems Biology,” Journal of Computer Science & Systems Biology, vol. 04, no. 02, pp. 1–6, 2011.

[48] T. T. Ashburn and K. B. Thor, “Drug repositioning: identifying and developing new uses for existing drugs.,” Nature reviews. Drug discovery, vol. 3, pp. 673–83, Aug. 2004.

[49] D. J. S. Jr, “New uses for old drugs,” Nature, vol. 448, no. August, pp. 645–646, 2007.

[50] J. T. Dudley, T. Deshpande, and A. J. Butte, “Exploiting drug-disease rela- tionships for computational drug repositioning.,” Briefings in bioinformatics, vol. 12, pp. 303–11, July 2011.

[51] B. B. Mishra and V. K. Tiwari, “Natural products in drug discovery: Clinical evaluations and investigations.,” in Opportunity, Challenge and Scope of Nat- ural Products in , vol. 661, ch. 1, pp. 1–62, Kerala, India: Research Signpost, 2011.

[52] Z. Liu, H. Fang, K. Reagan, X. Xu, D. L. Mendrick, W. Slikker, and W. Tong, “In silico drug repositioning - what we need to know.,” Drug discovery today, vol. 18, pp. 110–5, Feb. 2013. 108

[53] N. T. Issa, J. Kruger, S. W. Byers, and S. Dakshanamurthy, “Drug repurposing a reality: from computers to the clinic.,” Expert review of clinical pharmacology, vol. 6, pp. 95–7, Mar. 2013.

[54] Z. Le´on,J. C. Garc´ıa-Ca˜naveras, M. T. Donato, and A. Lahoz, “Mammalian cell metabolomics: experimental design and sample preparation.,” Electrophoresis, vol. 34, pp. 2762–75, Oct. 2013.

[55] P. Bais, Bioinformatics methods for metabolomics based biomarker detection in functional genomics studies. PhD thesis, Iowa State University, 2011.

[56] N. V. Reo, “NMR-based metabolomics.,” Drug and Chemical Toxicology, vol. 25, pp. 375–382, Nov. 2002.

[57] G. G. Harrigan, R. H. LaPlante, G. N. Cosma, G. Cockerell, R. Goodacre, J. F. Maddox, J. P. Luyendyk, P. E. Ganey, and R. a. Roth, “Application of high-throughput Fourier-transform infrared spectroscopy in toxicology studies: contribution to a study on the development of an animal model for idiosyncratic toxicity,” Toxicology Letters, vol. 146, pp. 197–205, Feb. 2004.

[58] K. Dettmer, P. A. Aronov, and B. D. Hammock, “Mass Spectrometry-based Metabolomics.,” Mass Spectrometry Reviews, vol. 26, pp. 51–78, 2007.

[59] M. Entzeroth, “Emerging trends in high-throughput screening,” Current Opin- ion in Pharmacology, vol. 3, pp. 522–529, Oct. 2003.

[60] A. Zhang, H. Sun, P. Wang, Y. Han, and X. Wang, “Modern analytical tech- niques in metabolomics analysis.,” The Analyst, vol. 137, pp. 293–300, Jan. 2012. 109

[61] M. A. Lorenz, Development and application of metabolomic techniques for iden- tification and quantification of intercellular metabolites relevant to glucose stim- ulated insulin secretion in β-cells. PhD thesis, The University of Michigan, 2011.

[62] J. Boccard, J.-L. Veuthey, and S. Rudaz, “Knowledge discovery in metabolomics: an overview of MS data handling.,” Journal of separation sci- ence, vol. 33, pp. 290–304, Feb. 2010.

[63] M. Commisso, P. Strazzer, K. Toffali, M. Stocchero, and F. Guzzo, “Untargeted metabolomics : an emerging approach to determine the composition of herbal products,” Computational and Structural Biotechnology Journal, vol. 4, no. 5, pp. 1–7, 2013.

[64] C. a. Smith, G. O’Maille, E. J. Want, C. Qin, S. a. Trauger, T. R. Brandon, D. E. Custodio, R. Abagyan, and G. Siuzdak, “METLIN: a metabolite mass spectral database.,” Therapeutic Drug Monitoring, vol. 27, pp. 747–751, Dec. 2005.

[65] D. S. Wishart, C. Knox, A. C. Guo, R. Eisner, N. Young, B. Gautam, D. D. Hau, N. Psychogios, E. Dong, S. Bouatra, R. Mandal, I. Sinelnikov, J. Xia, L. Jia, J. a. Cruz, E. Lim, C. a. Sobsey, S. Shrivastava, P. Huang, P. Liu, L. Fang, J. Peng, R. Fradette, D. Cheng, D. Tzur, M. Clements, A. Lewis, A. De Souza, A. Zuniga, M. Dawe, Y. Xiong, D. Clive, R. Greiner, A. Nazyrova, R. Shaykhutdinov, L. Li, H. J. Vogel, and I. Forsythe, “HMDB: a knowledgebase for the human metabolome.,” Nucleic Acids Research, vol. 37, pp. D603–D610, Jan. 2009. 110

[66] L. C. Menikarachchi, S. Cawley, D. W. Hill, L. M. Hall, L. Hall, S. Lai, J. Wilder, and D. F. Grant, “MolFind: a software package enabling HPLC/MS-based identification of unknown chemical structures.,” Analytical Chemistry, vol. 84, pp. 9388–9394, Nov. 2012.

[67] E. E. Bolton, Y. Wang, P. A. Thiessen, and S. H. Bryant, “PubChem: Inte- grated Platform of Small Molecules and Biological Activities,” in Annual Re- ports in Computational Chemistry, vol. 4, ch. 12, pp. 217–241, Washington: American Chemical Society, 4 ed., 2008.

[68] J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman, “ZINC: A Free Tool to Discover Chemistry for Biology.,” Journal of Chemical Information and Modeling, vol. 52, pp. 1757–1768, June 2012.

[69] ChemSpider, “http://www.chemspider.com/.”

[70] D. S. Wishart, D. Tzur, C. Knox, R. Eisner, A. C. Guo, N. Young, D. Cheng, K. Jewell, D. Arndt, S. Sawhney, C. Fung, L. Nikolai, M. Lewis, M. Coutouly, I. Forsythe, P. Tang, S. Shrivastava, K. Jeroncic, P. Stothard, G. Ameg- bey, D. Block, D. D. Hau, J. Wagner, J. Miniaci, M. Clements, M. Ge- bremedhin, N. Guo, Y. Zhang, G. E. Duggan, G. D. Macinnis, A. M. Weljie, R. Dowlatabadi, F. Bamforth, D. Clive, R. Greiner, L. Li, T. Marrie, B. D. Sykes, H. J. Vogel, and L. Querengesser, “HMDB: the Human Metabolome Database.,” Nucleic Acids Research, vol. 35, pp. D521–526, Jan. 2007.

[71] C. Knox, V. Law, T. Jewison, P. Liu, S. Ly, A. Frolkis, A. Pon, K. Banco, C. Mak, V. Neveu, Y. Djoumbou, R. Eisner, A. C. Guo, and D. S. Wishart, 111

“DrugBank 3.0: a comprehensive resource for ’omics’ research on drugs.,” Nu- cleic Acids Research, vol. 39, pp. D1035–41, Jan. 2011.

[72] P. Romero, J. Wagg, M. L. Green, D. Kaiser, M. Krummenacker, and P. D. Karp, “Computational prediction of human metabolic pathways from the com- plete human genome.,” Genome Biology, vol. 6, pp. R2.1–R2.17, Jan. 2004.

[73] M. Kanehisa, S. Goto, S. Kawashima, and A. Nakaya, “The KEGG databases at GenomeNet.,” Nucleic Acids Research, vol. 30, pp. 42–46, Jan. 2002.

[74] P. M. Network, “http://www.plantcyc.org/.”

[75] L. Pireddu, D. Szafron, P. Lu, and R. Greiner, “The Path-A metabolic pathway prediction web server.,” Nucleic acids research, vol. 34, pp. W714–9, July 2006.

[76] J. M. Dale, L. Popescu, and P. D. Karp, “Machine learning methods for metabolic pathway prediction.,” BMC bioinformatics, vol. 11, p. 15, Jan. 2010.

[77] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer Series in Statistics, 2 ed., 2009.

[78] B. W. Matthews, “Comparison of the predicted and observed secondary struc- ture of T4 phage lysozyme.,” Biochim. Biophys. Acta., vol. 405, no. 2, pp. 442 – 451, 1975.

[79] T. Kertesz, D. W. Hill, D. Albaugh, L. Hall, L. Hall, and D. F. Grant, “Database searching for structural identification of metabolites in complex biofluids for mass spectrometry-based metabonomics.,” Bioanalysis, vol. 1, no. 9, pp. 1627– 1643, 2009. 112

[80] M. Hamdalla, D. Grant, I. Mandoiu, D. Hill, S. Rajasekaran, and R. Ammar, “The use of graph matching algorithms to identify biochemical substructures in synthetic chemical compounds: Application to metabolomics.,” in 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS), (NV, USA), Feb. 2012.

[81] I. Nobeli, H. Ponstingl, E. B. Krissinel, and J. M. Thornton, “A structure-based anatomy of the E.coli metabolome.,” Journal of Molecular Biology, vol. 334, pp. 697–719, Dec. 2003.

[82] S. Gupta and J. a. Aires-de Sousa, “Comparing the chemical spaces of metabo- lites and available chemicals: models of metabolite-likeness.,” Molecular Diver- sity, vol. 11, pp. 23–36, Feb. 2007.

[83] J. J. Irwin and B. K. Shoichet, “ZINC A Free Database of Commercially Available Compounds for Virtual Screening.,” Journal of Chemical Information and Modeling, vol. 45, no. 1, pp. 177–182, 2005.

[84] L. Breiman, “Random forests,” in Machine Learning, vol. 45, pp. 5–32, Kluwer Academic Publishers, 45 ed., 2001.

[85] J. E. Peironcely, T. Reijmers, L. Coulier, A. Bender, and T. Hankemeier, “Un- derstanding and classifying metabolite space and metabolite-likeness.,” PloS One, vol. 6, Jan. 2011.

[86] J. L. Durant, B. a. Leland, D. R. Henry, and J. G. Nourse, “Reoptimization of MDL keys for use in drug discovery.,” Journal of Chemical Information and Computer Sciences, vol. 42, no. 6, pp. 1273–1280, 2002. 113

[87] W. A. Warr, “ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the Eu- ropean Molecular Biology Laboratory (EMBL-EBI).,” Journal of Computer- Aided Molecular Design, vol. 23, pp. 195–198, Apr. 2009.

[88] C. A. James, D. Weininger, and J. Delany, “Fingerprints - Screening and Sim- ilarity,” in Daylight Theory Manual, Irvine, CA and Santa Fe, NM: Daylight Chemical Information Systems, Inc., Nov. 2000.

[89] V. Maggiora, Gerald M. and Shanmugasundaram, “Molecular Similarity Mea- sures.,” in Chemoinformatics and Computational Chemical Biology, vol. 672, pp. 39–100, Humana Press, 2011.

[90] Marvin, “http://www.chemaxon.com/,” 2012.

[91] D. Weininer, “SMILES, a chemical language and information system. 1. Intro- duction to methodology and encoding rules,” Journal of Chemical Information and Computer Sciences, vol. 28, no. 1, pp. 31–36, 1988.

[92] S. A. Rahman, M. Bashton, G. L. Holliday, R. Schrader, and J. M. Thornton, “Small Molecule Subgraph Detector (SMSD) toolkit.,” Journal of Cheminfor- matics, vol. 1, Jan. 2009.

[93] A. Macchiarulo, J. M. Thornton, and I. Nobeli, “Mapping human metabolic pathways in the small molecule chemical space.,” Journal of Chemical Infor- mation and Modeling, vol. 49, pp. 2272–2289, Oct. 2009.

[94] M. A. Hamdalla, I. I. Mandoiu, D. W. Hill, S. Rajasekaran, and D. F. Grant, “BioSM: A chemoinformatics tool for identifying biochemical structures in 114

chemical structure space,” Journal of Chemical Information and Modeling, 2012.

[95] K. P. Compounds, “www.genome.jp/kegg-bin/get htext?org name=br08003&query=&htext=br08003.keg&filedir=&highlight=&option=- &extend=C1-162B19&uploadfile=&format=&wrap=&length=&open=&close=&hier=0.”

[96] J.-K. Weng, R. N. Philippe, and J. P. Noel, “The rise of chemodiversity in plants.,” Science, vol. 336, pp. 1667–1670, June 2012.

[97] N. Chen, I. J. del Val, S. Kyriakopoulos, K. M. Polizzi, and C. Kontoravdi, “Metabolic network reconstruction: advances in in silico interpretation of ana- lytical information.,” Current opinion in biotechnology, vol. 23, pp. 77–82, Mar. 2012.

[98] M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, “From genomics to chemical genomics: new developments in KEGG.,” Nucleic acids research, vol. 34, pp. D354–7, Jan. 2006.

[99] A. Mithani, G. M. Preston, and J. Hein, “Rahnuma: hypergraph-based tool for metabolic pathway prediction and network comparison.,” Bioinformatics (Oxford, England), vol. 25, pp. 1831–2, July 2009.

[100] P. D. Karp, S. M. Paley, M. Krummenacker, M. Latendresse, J. M. Dale, T. J. Lee, P. Kaipa, F. Gilham, A. Spaulding, L. Popescu, T. Altman, I. Paulsen, I. M. Keseler, and R. Caspi, “Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology.,” Briefings in bioinfor- matics, vol. 11, pp. 40–79, Jan. 2010. 115

[101] J. Gao, L. B. M. Ellis, and L. P. Wackett, “The University of Minnesota Bio- catalysis/Biodegradation Database: improving public access.,” Nucleic acids research, vol. 38, pp. D488–91, Jan. 2010.

[102] Y. Moriya, D. Shigemizu, M. Hattori, T. Tokimatsu, M. Kotera, S. Goto, and M. Kanehisa, “PathPred: an enzyme-catalyzed metabolic pathway prediction server.,” Nucleic acids research, vol. 38, pp. W138–43, July 2010.

[103] L.-L. Hu, C. Chen, T. Huang, Y.-D. Cai, and K.-C. Chou, “Predicting biological functions of compounds based on chemical-chemical interactions.,” PloS one, vol. 6, p. e29491, Jan. 2011.

[104] I. Nobeli and J. M. Thornton, “A bioinformatician’s view of the metabolome.,” BioEssays : news and reviews in molecular, cellular and developmental biology, vol. 28, pp. 534–45, May 2006.

[105] Y.-D. Cai, Z. Qian, L. Lu, K.-Y. Feng, X. Meng, B. Niu, G.-D. Zhao, and W.-C. Lu, “Prediction of compounds’ biological function (metabolic pathways) based on functional group composition.,” Molecular diversity, vol. 12, pp. 131–7, May 2008.

[106] STITCH, “http://stitch.embl.de/.”

[107] Y.-F. Gao, L. Chen, Y.-D. Cai, K.-Y. Feng, T. Huang, and Y. Jiang, “Predict- ing metabolic pathways of small molecules and enzymes based on interaction information of chemicals and proteins.,” PloS one, vol. 7, p. e45944, Jan. 2012.

[108] STRING, “http://string.embl.de/.”