COMPUTATIONAL ANALYSIS, VISUALIZATION AND TEXT MINING OF METABOLIC NETWORKS
by
XINJIAN QI
Submitted in partial fulfillment of the requirements For the degree of Doctor of Philosophy
Dissertation Advisor: Dr. Gültekin Özsoyoğlu
Department of Electrical Engineering and Computer Science
CASE WESTERN RESERVE UNIVERSITY
January, 2014
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the thesis/dissertation of
____Xinjian Qi______candidate for the Doctor of Philosophy___degree *.
(signed) __Dr. Gültekin Özsoyoğlu______(chair of the committee)
______Dr. Andy Podgurski______
______Dr. M. Cenk Cavusoğlu______
______Dr. Nicola Lai______
______Dr. Z. Meral Özsoyoğlu______
______
(date) _____June 24, 2013____
*We also certify that written approval has been obtained for any proprietary material contained therein.
Table of Contents
Table of Contents ...... 1
List of Tables ...... 6
List of Figures ...... 7
Acknowledgements ...... 10
Abstract ...... 12
Introduction ...... 14
1.1 Computational Interpretation of Metabolomics Measurements: Steady-State
Metabolic Network Dynamics Analysis ...... 15
1.2 Performing Gene Lethality Testing with SMDA ...... 17
1.3 Visualization Tools for PathCase Systems...... 19
1.4 Locating Basic Bio-Entities in Genome-Scale Reconstructed Metabolic
Networks ...... 22
1.5 Thesis Organization...... 24
Computational Interpretation of Metabolomics Measurements: Steady-State Metabolic
Network Dynamics Analysis ...... 25
2.1 Introduction ...... 25
2.2 Condition-Based Modeling ...... 33
2.2.1 Assumptions and Terminology ...... 33
2.2.2 Metabolite Pool Label Identifiers ...... 36 1
2.2.3 Metabolite Label Condition Characterization ...... 38
2.2.4 Trigger Values and Activation Condition Sets for Reactions, Transport
Processes, or Pathways ...... 39
2.2.5 Biochemistry-Based Rules ...... 43
2.3 Active/Inactive Graph Generation And Expansion ...... 45
2.3.1 Initial GAI Generation ...... 46
2.3.2 GAI Graph Expansion ...... 48
2.3.3 Merging GAI Graphs ...... 56
2.3.4 Algorithm Sketch ...... 58
2.4 Experimental Evaluation ...... 59
2.4.1 Experimental Setting ...... 59
2.4.2 Experimental Results ...... 61
2.5 Related Work: Metabolic Network Analysis Techniques ...... 63
2.6 Conclusions ...... 66
Performing Gene Lethality Testing with SMDA ...... 67
3.1 Introduction ...... 67
3.2 Summary of SMDA Algorithm ...... 70
3.2.1 SMDA Terminology ...... 70
3.2.2 Algorithm Flow ...... 72
3.2.3 Conflicts ...... 74
2
3.3 Existing Gene Lethality Techniques and SMDA ...... 75
3.4 Revising SMDA For Gene Lethality Testing ...... 80
3.5 Experimental Evaluation ...... 83
3.5.1 Experimental Setting ...... 83
3.5.2 Gene Lethality Test Results ...... 86
3.5.3 Gene Non-Lethality Test Results ...... 90
3.6 Conclusions ...... 91
Visualization Tools for PathCase Systems ...... 92
4.1 Introduction ...... 92
4.2 Visualization Tool for PathCase-SB System ...... 94
4.3 Visualization Tools for other PathCase Systems ...... 98
4.3.1 PathCase-MAW and PathCase-MAW Editor ...... 98
4.3.2 PathCase-SMDA ...... 99
4.3.3 PathCase-RCMN and PathCase-Recon ...... 101
4.3.4 PathCase-MQL ...... 103
4.4 General Framework ...... 106
4.5 Visualization Tool for iPad Applications ...... 108
4.6 Conclusions ...... 109
Locating Basic Bio-Entities in Genome-Scale Reconstructed Metabolic Networks ...... 110
5.1 Introduction ...... 110 3
5.1.1 Entity Identification ...... 111
5.1.2 Similarity Score ...... 112
5.2 Metabolite Identification ...... 114
5.2.1 Exact Match via Metabolite Id/Synonyms ...... 116
5.2.2 Approximate Name Matching...... 117
5.2.3 Filtering Metabolite Match Candidates ...... 125
5.3 Reaction Identification ...... 129
5.3.1 Reaction Name Matching ...... 130
5.3.2 Reaction Property Matching ...... 131
5.3.3 Reaction Compound Matching ...... 132
5.3.4 Reaction Similarity Score ...... 134
5.4 Experimental Evaluation ...... 135
5.4.1 Metabolite Identification Results ...... 136
5.4.2 Reaction Identification Results ...... 144
5.5 Conclusion ...... 150
Conclusions and Future Work ...... 152
6.1 Future work ...... 153
6.1.1 SMDA ...... 154
6.1.2 Visualization ...... 157
6.1.3 Bio-Entity Identifications...... 157 4
Appendix 1. Core Metabolites (Total count: 617) ...... 161
Bibliography ...... 170
5
List of Tables
Table 2.1 The number of observations vs. the number of output graphs for small sub- networks...... 61
Table 2.2 The number of observations vs. the number of graphs for a large network. .... 62
Table 3.1 Metabolite pool observations from the T. Cruz. paper ...... 85
Table 3.2 Metabolite pool observations from biomass reaction ...... 85
Table 3.3 Energy pools are set as Available ...... 85
Table 3.4 Inactive reactions for epimastigote case ...... 86
Table 3.5 Lethal genes to be verified ...... 87
Table 3.6 SMDA test results on lethal genes ...... 88
Table 5.1 Biologically significant terms ...... 128
6
List of Figures
Figure 2.1 SMDA result as a single GAI graph ...... 29
Figure 2.2 Illustration of three alternative versions of transport processes ...... 48
Figure 2.3 A partial metabolic network M. Circle nodes are metabolites, rectangle nodes are reactions and edges represent relations between reactions (which consume and/or produce metabolites) and metabolites...... 52
Figure 2.4 The first level of the GAI graph generation hierarchy for the metabolic network in Fig.2.3...... 53
Figure 2.5 A metabolic network M. Circle nodes are metabolites, rectangle nodes are reactions and edges represent relations between reactions (which consume and/or produce metabolites) and metabolites...... 58
Figure 2.6 The GAI graphs before merging two GAI -GROUPs...... 58
Figure 2.7 Sketch of the SMDA algorithm ...... 60
Figure 2.8 SMDA time cost for a single network versus the number of observations for
Glycolysis and TCA Cycle combined...... 63
Figure 3.1 A partial network with reversible reaction ...... 74
Figure 3.2 Partial depiction of theoptimal flux distribution on epimastigote model of T.
Cruzi network...... 80
Figure 3.3 A complete network for Example 3.2 ...... 83
Figure 3.4 A partial network for Example 3.3...... 89
Figure 4.1 Visualization Tools and Applications ...... 93
Figure 4.2 Visualization of Albert2005-Glycolysis Model ...... 96
7
Figure 4.3 An example of built-in query (reaction-to-process mapping)...... 96
Figure 4.4 Visualization of a query(Figure 4.3) result...... 97
Figure 4.5 Glycolysis in Cytosol_Adipose and Cytosol_Liver...... 99
Figure 4.6 Catabolism of Phenylalanine pathway in PathCase-MAW editor...... 100
Figure 4.7 TCA cycle pathway in in PathCase-MAW...... 101
Figure 4.8 SMDA query results...... 102
Figure 4.9 Fatty Acid Metabolism pathway in the iMM1415 model (2010)...... 104
Figure 4.10 E. Coli Textbook in PathCase-Recon System...... 105
Figure 4.11 An example of MQL query...... 105
Figure 4.12 MQL query result of the example in Figure 4.11...... 106
Figure 4.13 Visualization tools in PathCase systems ...... 107
Figure 5.1 Metabolite Identification Algorithm Sketch ...... 115
Figure 5.2 CandidatesM () function ...... 115
Figure 5.3 BST-Filter() function ...... 129
Figure 5.4 Reaction Identification Algorithm Sketch ...... 131
Figure 5.5 Model-to-model metabolite matching results for
Figure 5. 6 Model-to-model metabolite matching results for 29, Model2008_08_15_12_13_14> ...... 141 Figure 5. 7 Model-to-model metabolite matching results for <03_16_09_TM_minimal _medium_glc, barkeri iAF692> ...... 142 Figure 5.8 Model-to-model reaction matching results for ...... 147 8 Figure 5.9 Model-to-model reaction matching results for Model2008_08_15_12_13_14> ...... 148 Figure 5. 10 Model-to-model reaction matching results for <03_16_09_TM_minimal_m edium_glc, barkeri iAF692> ...... 149 9 Acknowledgements First and foremost, I would like to express the deepest appreciation to my advisor, Prof. Gultekin Ozsoyoglu, for his guidance, encouragement and support which helped me to finish this dissertation. It was only with his effective guidance, infinite patience, full trust and continuous faith in me that I am able to succeed in my graduate studies and become a researcher. He himself has served as a role model in so many ways, not only in pursuing research goals, collaborating with other people, guiding diversified students, but also balancing work and life. He has been a great mentor with full of wisdom and endless energy. It is my great honor to have worked with him. His spirit of working hard, being thoughtful of others and sharing generously with others will accompany me for the rest of my life. I would like to thank Drs. Podgurski, Cavusoğlu, Lai, and Ozsoyoglu for serving as members of my dissertation committee and for their constructive comments. I appreciate the time and effort that they spent on reading this work and providing me with feedback. I am grateful to Dr. Z. Meral Ozsoyoglu for attending my presentations, research meetings, and providing precious feedback and suggestions throughout the whole period of my graduate studies. I would like to thank my research collaborators and lab mates, Dr. Ali Cakmak, A. Ercument Cicek and Sarp Coskun. Dr. Cakmak has helped me not only in my research papers, but also in my studies. A. Ercument Cicek is a perfect colleague and friend, and we have collaborated on many projects and papers successfully. Sarp Coskun is full of enthusiasm in programming and new technologies, and he has made the lab a pleasurable 10 place. Besides, everyone in the Databases and Bioinformatics Laboratory at Case Western Reserve University deserves acknowledgement and thanks for their friendship and warm company; it has been a great pleasure to work and learn from all of you. Most importantly, my deepest gratitude goes to my family. My parents and my sister have always been supportive on the decisions I have made. Their love, encouragement and absolute confidence are important sources of energy for me. Special thanks to my wife, Dr. Hong Guo, who helped me to start this journey, accompanied me in the process, gave me endless love, patience and persistent support. It is her diligence, goodness, courage and self-sacrifice that has made this degree possible. She is always there for me. I must thank our twenty-month old son, Yang Qi, who has caught up with this special journey and filled our lives with so much joy. And finally, I would like to acknowledge the supports coming from the National Science Foundation grants DBI-0849956 and DBI-0743705, and the National Institute of Health under grant R01 GM088823. 11 Computational Analysis, Visualization and Text Mining of Metabolic Networks Abstract by XINJIAN QI With the recent advances in experimental technologies, such as gas chromatography/mass spectrometry, the number of metabolites that can be measured in biofluids of individuals has markedly increased. Given a set of such measurements, a very common task encountered by biologists is to identify the metabolic mechanisms that lead to changes in the concentrations of given metabolites and interpret the metabolic consequences of the observed changes in terms of physiological problems, nutritional deficiencies, diseases. This thesis presents the steady-state metabolic network dynamics analysis (SMDA) approach in detail. Experimental evaluation of the SMDA tool against a mammalian metabolic network database is also presented. The query output space of the SMDA tool can be reduced via (i) larger number of observations exponentially reduce the output size, and (ii) exploratory search and browsing of the query output space is provided to allow users to search what they are looking for. SMDA is then applied to gene lethality testing. Compared with other methods that are used for gene lethality testing, the advantages of the SMDA algorithm are: (1) SMDA requires less input, and (2) does not make optimality assumptions. The algorithm has been tested on the genome scale reconstructed network of Trypanosoma cruzi and its gene lethality testing results taken as ground truth. 12 Also, in this thesis, we study general framework of visualization tools as well as distinct features of each tool in the PathCase systems, namely PathCase-SB, PathCase-MAW editor, PathCase MAW, PathCase-SMDA, PathCase-RCMN, PathCase-Recon, and PathCase-MQL. Finally, this thesis proposes a number of metabolite/reaction identification techniques for Genome-Scale Reconstructed Metabolic Networks (GSRMN) (by matching metabolites/reactions to corresponding metabolites/reactions of a source model or data source). We employ a variety of computer science techniques that include approximate string matching, similarity score functions and filtering techniques, all enhanced by the underlying metabolic biochemistry-based knowledge. The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four pairs of GSRMNs. Our results indicate that significant accuracy gains are made using the proposed metabolite/reaction identification techniques. 13 Introduction With recent advances in experimental technologies, the number of metabolites measured in bio-fluids of organisms has markedly increased. Given a set of measurements, a common metabolomics task is to identify the metabolic mechanisms that lead to changes in the concentrations of given metabolites, and interpret the metabolic consequences of the observed changes in terms of physiological problems, nutritional deficiencies, or diseases. In metabolic networks, gene lethality is defined in terms of essential metabolite availability. An essential metabolite is a metabolite without which the organism cannot stay alive. Existing methods of gene lethality test requires either the optimal conditions or other assumptions such as “the quality of the biomass reaction and the assumption of biomass optimization which is debatable even for unicellular organisms” [1][2]), or assumes the prior knowledge about the network (e.g., complete stoichiometry), or generates a result which may not be meaningful biochemically. PathCase-SB aims to integrate systems biology models data and metabolic network data of selected biological data. PathCase-SB system provides a database-enabled framework and web-based computational tools towards facilitating the development of kinetic models for biological systems. Visualization interface is an important component of the PathCase-SB system. Also, visualization interfaces/tools exist in many other PathCase systems. The number and use of Genome-Scale Reconstructed Metabolic Networks (GSRMN) have been increasing in recent years. It is noted in the literature [3][4] that published 14 GSRMNs have two basic limitations, which reduce their full utilization. One is the inability to match metabolites/reactions/compartments in a given GSRMN to metabolites/reactions/compartments in a given data source (e.g., KEGG) or another GSRMN, due to naming inconsistencies involving species (metabolites), reactions, and compartments. Another noted difficulty is in identifying pathways of a GSRMN. To address the above problems, this thesis employs biological data analysis techniques on (1) Performing computational interpretation of metabolomics measurements; (2) Applying SMDA to gene lethality testing; (3) Implementing visualization tools for PathCase Systems; and (4) Locating basic bio-entities in Genome-Scale Reconstructed Metabolic Networks. Next we briefly describe the above-listed four different metabolic network-related problems that we study in this dissertation. 1.1 Computational Interpretation of Metabolomics Measurements: Steady-State Metabolic Network Dynamics Analysis Given a set of observed measurements in a metabolic network, we present the steady- state metabolic network dynamics analysis (SMDA) approach to interpret the metabolic consequences in terms of physiological problems, nutritional deficiencies, or diseases. SMDA needs no additional assumptions other than the steady-state assumption. SMDA can be viewed as both a constraint- and rule-based approach. It is constraint-based 15 [5][6][7] in that it uses conditions (pre-stored in its database) to locate all “allowable states” [8] of the reconstructed metabolic network model (pre-stored in its database). And, it is rule-based in that its graph-expansion and merge strategies employ a number of biochemistry rules to capture the underlying metabolic biochemistry as much as possible. In this research, a complete condition- and rule-based model of the metabolic network behavior is specified. Then we list the assumptions of our model and define the notion of (quasi-) steady-state for the metabolic network. The notion of metabolite pool label identifiers, the three-valued logic to specify metabolite pool label conditions and the Activation Condition Sets for reactions as well as transport processes are presented. Also transport process rules, and a number of basic biochemistry-based rules are listed. The SMDA algorithm runs in a cycle of two phases: Expansion and Merge. It lasts until all reactions and metabolite pools in the network are assigned a status. Expansion phase starts from the labeled metabolite pools (observations), which are flow-graphs with single metabolite pools. Then, expanded flow-graph(s) are generated by adding neighboring reactions and metabolite pools to the original flow-graph. SMDA generates all possible combinations of label assignments to those neighboring pools and reactions. This process continues until all reactions and metabolite pools are assigned a label. SMDA returns only two flux values for a reaction, namely, 0 (inactive), and 1 (inactive).The query output space of the SMDA tool is exponentially large in the number of reactions of the network. However, (i) larger numbers of observations exponentially reduce the output size, and (ii) exploratory search and browsing of the query output space 16 allows users to mine and search for what they are looking for. The SMDA problems and their solutions addressed are new, and specific to the SMDA approach. Advantages of SMDA include its ease of use and simplicity; it is designed as a “first- step” and “online” tool for wet lab researchers (a) to evaluate their hypotheses about observed measurements, and (b) to be used for “what if” types of questions (i.e., knowledge discovery). SMDA technique and its computational performance limits are evaluated using a mammalian metabolic network database [9]. Our work of evaluating the activation/inactivation scenarios of the metabolic network at steady state is related to metabolic network analysis techniques such as metabolic control analysis (MCA) [10], flux balance analysis (FBA) [11], metabolic flux analysis [12] and, finally, metabolic pathway analysis (MPA) (elementary flux modes (EFM) and extreme pathways (EP)) [13]. MCA and FBA solve a set of under-constrained differential equations; in comparison, our SMDA approach can be considered as a rule-based knowledge discovery approach within a given metabolic network database. Detailed comparison between these techniques and SMDA can be found in this chapter. 1.2 Performing Gene Lethality Testing with SMDA In this research, we have attempted a usefulness study for SMDA for the problem of gene lethality testing [14]. A gene is lethal if its knockout causes the unavailability of at least one essential metabolite in the organism at the steady state. In other words, a gene is lethal if its removal from the organism’s genome results in the non-production of at least one essential metabolite, and, thus, the death of the organism. 17 An SMDA-based gene lethality test is done in three steps. First, reactions catalyzed by the enzymes produced by the knocked-out gene are marked as inactive in the network. Then all essential metabolite pools are labeled as Available. Finally, SMDA is run to check if there is at least one feasible flow-graph in the metabolic network that produces and consumes each and every essential metabolite. Thus, stopping conditions for gene lethality/non-lethality are adjusted: no flow graphs or any merge/expansion conflict encountered during the process means the knocked gene is lethal, or any valid flow-graph means the knocked gene is non-lethal. We mark the reactions corresponding to the lethal gene as inactive and run SMDA. When the algorithm terminates with no possible flow-graphs, or any merge/expansion conflict is encountered during the process, the gene is verified to be lethal. However, if SMDA produces even one possible result, the gene is said to be non-lethal since the organism is still alive without the gene. SMDA gene lethality algorithm is validated via a selected reconstructed network of the core metabolism of Trypanosoma cruzi [16], a kinetoplastid parasite in humans that causes Chagas disease [17]. Trypanosoma cruzi has a small core reconstructed metabolic network [18] with 215 genes, 162 reactions, and 4 compartments. There are seven genes are considered to be lethal in literature, full model prediction and epimastigote model prediction in Trypanosoma cruzi paper [15]. We obtained the reconstructed network model of Trypanosoma cruzi in the form of an SBML document, and parsed and exported the model (with a home-made SBML parser tool) into our PathCase-RCMN database. We take the model network as one input for SMDA. Another input for SMDA, 18 metabolite pool observations, is from the extracellular metabolites’ availability according to the paper supplement [16]. All seven lethal genes are verified with SMDA. We have also selected one non-lethal gene in Trypanosoma cruzi, namely, adenosine kinase, and SMDA has also correctly verified its non-lethality. Thus, we confirm that SMDA can be used for gene lethality testing purposes. SMDA has been compared with current methods that are used for gene lethality testing, for example, topological analysis of regulatory networks [2], Barabasi’s computational estimate method [3,4], Flux Balance Analysis (FBA). The advantages of the SMDA algorithm are: (1) SMDA requires less input, and (2) does not make optimality assumptions. On the negative side, for a very large network, SMDA has its limitations since it enumerates all possible activation/inactivation scenarios for the network at hand. The complexity of SMDA can be reduced with domain expert’s knowledge, or by reducing the network, i.e., abstracting pathway into single abstract reaction, or providing more observations. Detailed comparison between these techniques and SMDA can be found in the Chapter Three. 1.3 Visualization Tools for PathCase Systems Released in August 2010, PathCase-SB system [17][18][19] brings together (i) systems biology sources, e.g., BioModels [20][21][22], and (ii) pathways sources, e.g., KEGG [23][24][25][26], with the goal of providing additional capabilities and tools made possible due to the integration. Currently, PathCase-SB has provided visualization, browsing, querying, simulation and comparison, model composition and user upload model capabilities and interfaces. 19 The visualization tool in PathCase-SB system has the new features of (1) integration of the interactive pathway graph visualization; (2) displaying model according to compartments hierarchy of the model (3) the mapping between the model network and the pathway is provided by displaying both side by side. Also, the visualization tool has multiple visualization simplification capabilities of by truncating of long entity names and not showing common species. Another feature is layout manipulations by revising and saving visualization layouts. The visualization interface is accessed from different places within PathCase-SB. It’s employed by Browser Interface Built-In Queries iModel Tool Model Composition Tool. Similar to visualization interface of PathCase-SB, Pathcase visualization tools present metabolic data, relationships in the data, as well as analysis results of the data via a java applet. These tools are components of many PathCase Systems. Based on differing requirements of PathCase Systems, some features are adjusted or redesigned. For example, in PathCase-MAW’s visualization tool, common metabolites are reproduced for each reaction they participate in, which reduces many edges between common metabolites and reactions, therefore beautifies the resulting visualized graph. In PathCase-SMDA visualization tool, reversible reactions are connected via double edges to show the direction of flow. 20 The visualization tools are included in the following PathCase systems: PathCase-SB: PathCase Systems Biology Workbench featuring BioModels models and KEGG Pathways has 409 Systems Biology Models and 139 KEGG pathways. PathCase-MAW Editor: a stand-alone Java application on maintaining a mammalian metabolic database—MAW. PathCase-MAW: Pathcase Metabolomics Analysis Workbench featuring manually created generic mammalian metabolic network has 27 pathways. PathCase-RCMN: PathCase ReconstruCted Metabolic Networks has four modes, namely, Mus Musculus iMM1554 model (2008), Mus Musculus iMM1415 model (2010), H.sapiens Recon 1 model and Trypanosoma Cruzi iSR215 model (2009). PathCase-Recon: PathCase RECON Workbench featuring Genome-Scale Reconstructed Metabolic Networks and KEGG Pathways has 53 networks and 139 KEGG pathways. PathCase-SMDA: an online tool to analyze metabolomics data in terms of the dynamic behavior of the metabolic network under steady state. Metabolism Query Language Interface: a Metabolism Query Language Interface to query PathCase-MAW database. And three iPad applications: iPathCaseMAW: iPad version PathCase-MAW system, which includes visualizations of metabolic pathways and SMDA tool, 21 iPathCaseRCMN: iPad version PathCase-RCMN system, which includes visualizations of three reconstructed networks, iPathCaseKEGG: iPad version PathCase-KEGG system, which includes visualizations of Kyoto Encyclopedia of Genes and Genomes[27]. Finally, the visualization framework of all PathCase visualizations has the following steps. Designing an XML schema for the visualization data file, Defining parameters for web services to communicate with the PathCase applet in the client side, Retrieving information from PathCase database, based on parameters, Composing the obtained information into an XML data file, and Parsing the data file, and providing visualization via the applet (except for PathCase iPad applications). 1.4 Locating Basic Bio-Entities in Genome-Scale Reconstructed Metabolic Networks Basic bio-entity identification problem is defined as to match metabolites reactions/compartments in a given GSMR network to metabolites/reactions /compartments in a given data source (e.g., KEGG or another GSMR network). This can be difficult due to naming inconsistencies involving species (metabolites), reactions and compartments. Inconsistent prefix, suffix, space, number, and formula are the main 22 reasons for basic entities identification problem. Additionally, for reactions identification, different number of alternative substrates in different GSMR networks complicates the identification problem further. In this research, we focus on the basic bio-entity identification problem in a GSRMN model (referred to as the “target model”, from here on) with respect to a “source model” (where the “source model” may easily be replaced by a “data source”, generalizing the identification problem), and propose three types of matches for metabolite identification, and a multi-step identification process for reaction identification. For metabolite identification techniques in GSRMNs (by matching metabolites to corresponding metabolites of a source model or data source), we employ a variety of computer science techniques that include token based approximate string matching, similarity score functions and filtering techniques, i.e., formula matching and biologically significant term matching. All techniques are enhanced by the underlying metabolic biochemistry-based knowledge. Based on metabolite identification, reaction identification is performed via three-step matching techniques, namely, reaction name matching, reaction property matching and reaction compound matching. For reaction compound matching, compounds are paired by exact name match, name length match and core metabolite match. Also, reversibility - + property of the reactions, as well as ignorable metabolites, i.e., e , H , H2O, are all taken into account in the matching process. A new reaction similarity score function is given to measure matching results. 23 Since the number of compartments in GSRMNs is few and their names are shorter compared with reaction or metabolite names, we employ a curated data set for compartment name matching. In the data set, compartments are grouped, and various compartments are mapped into the corresponding groups. We also collect compartment names from other different data sources, i.e., KEGG, BioModels, Reactome, to enhance this data set. Given a compartment name, we locate the compartment’s group name in data set. All compartments in the located group is considered as identical/matched with the given compartment name. The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four [28], namely, “iAM303”and “E. coli textbook”, “H. pylori iIT341” and “EryNet”, “Model2008_09_23_13_13_29” and “Model2008_ 08_15_12_13 _14”, “03_16_09_TM_minimal_medium_glc” and “M. barkeri iAF692”. Our results indicate that significant accuracy gains are possible using the proposed metabolite/reaction identification techniques. 1.5 Thesis Organization Chapter 2 presents Steady-State Metabolic Network Dynamics Analysis. In Chapter 3, gene lethality testing with SMDA is performed as one application of SMDA tool. Chapter 4 presents visualization interface framework and distinct features of PathCase Systems. In Chapter 5, similarity score based techniques on locating basic bio-entities in GSRMNs are explained. And Chapter 6 concludes and discusses several future work directions. 24 Computational Interpretation of Metabolomics Measurements: Steady-State Metabolic Network Dynamics Analysis 2.1 Introduction Currently, metabolomics data analysis necessitates a time-consuming, extensive, and manual cross-referencing of metabolic pathways, in order to critically evaluate the measurements data. Recently, a novel In Silico approach (IOMA) that integrates metabolomics data with a metabolic network model, and infers metabolic fluxes is proposed[29]. IOMA (a) requires many pieces of information (e.g., availability of the stoichiometry matrix of the network, dissociation constants, enzyme turnover rates, mass balance constraints, flux capacity constraints, etc.), and (b) infers a single network state with all the computed metabolic fluxes. On the other hand, manual analysis of fluxes in small (and usually abstracted) sub-networks is quite common in life science publications. As examples, please see Figure 5 and Figure 1 in Bederman et al, and Gasier et al, respectively[30][31]. Researchers seek alternative activation/inactivation scenarios in small-scale networks, without the need/access to the additional information such as those needed by IOMA. Note that, even for small-size networks, as the size of the network grows, the number of possible flow (flux) scenarios grows exponentially, which makes manual enumeration error-prone. This manual process can be automated using computer science and bioinformatics techniques that employ biochemistry rules and constraints, 25 pre-stored in a metabolic network database. Once the results are obtained, users can also visualize and query them, (e.g., “list those alternative flows where one targeted reaction is active, and another targeted reaction is inactive”). In this chapter, we propose a database-enabled and graph-traversal-based technique, called SMDA (Steady-state Metabolic network Dynamics Analysis), that infers all allowable (flux) states of a network. Given a set of bio-fluid (e.g., blood) and tissue-based metabolite concentration measurements at steady-state, SMDA answers the query “list alternative steady-state metabolic network activation/inactivation (i.e., flux) scenarios, given the observed measurements”. That is, SMDA takes as input from the user (i) metabolomics data, and (ii) a metabolic sub-network, selected from a metabolic network database already made available to users, and produces a set of possible alternative flow scenarios (i.e., activation/inactivation scenarios) for the metabolic sub-network. Then SMDA lets users to further visualize and query the alternatives (not discussed in this thesis). SMDA can be viewed as both a constraint- and rule-based approach. It is constraint- based [5][6][7] in that it uses conditions (pre-stored in a database) to locate all “allowable states” [8] of a sub-network in a metabolic network model (also pre-stored in a database). And, SMDA is rule-based in that its graph-expansion and merge strategies employ a number of biochemistry rules to capture the underlying metabolic biochemistry as much as possible. Advantages of SMDA include: 26 Ease of use and simplicity. it is designed as a “first-step” and ‘online” tool for biochemists and wet lab researchers to o Evaluate their hypotheses about observed measurements in small scale networks, and o Be used as a “knowledge discovery” tool, e.g., to be used for “what if” types of questions. No flux optimization. SMDA does not to require the knowledge of reaction kinetics or any utility/optimization function for flux optimization. The disadvantages of SMDA include: SMDA returns only two flux values for a reaction: 0 (Inactive), and 1 (Active). As is the case with other techniques that return “all allowable states”, SMDA is inherently exponential in its output size. However, the computational performance of SMDA is acceptable for networks with up to 60 reactions (with some paths/pathways abstracted into “abstract reactions”; see section 2.4 and the original paper[32][9]). SMDA is implemented, and functional as a prototype both as an online tool, called PathCase-SMDA[33]that is part of PathCase family of applications[34], and as an iPad application, named “PathCase MAW”[35]. 1. SMDA Overview Prior Preparation. We assume a fully hierarchical and compartmentalized metabolic network, i.e., one with tissues, organelles, etc., already available in a metabolic network 27 database. And, the steady-state “activation conditions” (or, the ACT condition set) for each reaction and transport process to be active are characterized a priori, saved in a database, and used during query-time analysis. Initially, the status values of all reactions and all metabolite pools in the metabolic network are Unknown. Query-time Analysis. At query time, the user chooses a smaller metabolic sub- network (i.e., query network) to query. SMDA takes the observed metabolite set and the selected sub-network, referred to as “query network”, as input, and executes the following steps. Initialization. (i) For each bio-fluid-based metabolite observation, identify whether its transport processes are active or not (by checking, for each transport process, whether all conditions in its ACT set are satisfied or not). (ii) For each tissue-based metabolite observation, derive its metabolite pool label, which is one of Unavailable, Available, Accumulated, or Severely Accumulated. Expansion and Merge: Metabolic Sub-Network Traversal and Active-Inactive Reaction Assessment. Starting with active/inactive transport processes and tissue-based observed metabolites, and continuing with metabolic reactions in tissues of the query network, locate iteratively those reactions with satisfied or unsatisfied ACT condition sets, and mark (i) those reactions whose ACT conditions are completely satisfied as Active, and (ii) those reactions whose ACT conditions contain at least one unsatisfied ACT condition as Inactive. (This process results in multiple expansions). When two disconnected “active/inactive sub-networks” “touch” each other, merge them to obtain a larger active-inactive sub-network. 28 The above-summarized query-time analysis creates and iteratively expands multiple possible metabolic flux sub-graphs, called Active-Inactive Graphs (GAI), where, in each GAI graph, the status of each reaction, and the label of each metabolite pool is clearly marked (i.e., no reactions or metabolite pools with “Unknown” status/label exist). The result is a set of GAI graph sets where each GAI graph set specifies one distinct alternative steady-state activation/inactivation scenario for the metabolic network. An alternative output to GAI graphs is flow-graphs where a flow-graph is a GAI graph without metabolite pool labels; flow-graphs are utilized in section 2.4. We give an example. Figure 2.1 SMDA result as a single GAI graph Example 2.0. Assume that the user selects Catabolism of Cysteine in liver as the metabolic sub-network to be queried (as shown in Figure 2.1), and has three observed metabolite measurements in cytosol: O2 as 80mM/L (we assume that O2 is “estimated” as it is very difficult to measure O2 in tissue of intact organ), cysteine as 60µM/L, and SO3 (3-sulifino-L-Alanine) as 80µM/L. Assume that the database conditions state that, in Liver cytosol, “O2 is marked as Available if it is in between [1, 100]mM/L”, “cysteine is marked as Available if it is in between [1, 100]µM/L”, and “SO3 is marked as Available if it is in between [1, 100]µM/L”. Thus, the SMDA initialization step concludes that O2, 29 cysteine, and SO3 are all Available. And, the execution of the expansion step as summarized above concludes that there is only one flow-graph with only one GAI graph in the output of the query, as shown by the (actual) SMDA output of Figure 2.1. In summary, given metabolomics observations and a query network, SMDA locates all possible alternative active-inactive network scenarios on the selected sub-network. This approach provides compact and complete steady-state views of possible metabolism dynamics as independent and alternative snapshots in the form of user-friendly visual steady-state views of the metabolic network. There are four issues. The first issue is prioritizing and ranking different alternatives produced by SMDA. This issue is not discussed in this chapter; please see the supplement of original paper[9] for a number of ranking mechanisms. The second issue is related to the space complexity of SMDA: what happens when, for a large sub-network, there are many alternative GAI graphs? As a first response to this issue, SMDA switches to the use of flow-graphs, as opposed to GAI graphs, where a single flow-graph captures multiple GAI graphs. Second, SMDA allows for an exploratory search of the resulting GAI graphs. That is, an “interactive query” execution takes place where, as a response to the query, the user is given the total number of “possible results” (i.e., GAI graphs), and, is then prompted to choose and view different GAI graphs or flow- graphs in the output with respect to participating metabolites and reactions. For example, the user is told, say, that Pyruvate dehydrogenase is active in two flow-graphs and inactive in four flow-graphs, and is given the option of viewing only the first two, or the 30 latter four, or all six flow-graphs. We refer to this process as “exploratory search and browsing” of the SMDA query output search space. The third issue is related to the time complexity of SMDA. Given a large metabolic network, SMDA output may increase so fast and so large that SMDA may not complete its execution within a reasonable amount of time. When this case occurs, our suggestion to the user is to reduce the network size, either by eliminating sub-networks or by “abstracting” a sub-network (e.g., a pathway of a metabolism) into an “abstract reaction”. From our interactions with wet-lab biochemists, both approaches are quite common, and, used extensively in practice to (manually) analyze the behavior of metabolic networks[30][31]. Finally, the fourth issue is about the way SMDA works: as described above, SMDA discretizes metabolite observations into four categories, namely, Unavailable, Available, Accumulated, or Severely Accumulated. This discretization can be done by users employing their domain expertise, as is done in section 2.4. Or, it can be done automatically on the basis of ranges for each discretization, that are in turn obtained from the HMDB data source[36]. However, in some cases, HMDB classifies multiple levels of “normal” ranges for metabolites, leading to “observation misclassifications” in SMDA. This issue and the SMDA actions taken are discussed in a separate study[37]. All figures in this chapter are obtained from the web-based SMDA application[33]. The observation set of example 2.1 is available on the web site of the browser-based application PathCase-MAW as “Sample Observation 0”; and, running the SMDA Tool with Sample Observation 0 produces the results of Example 2.1. Figure 2.1 and Example 31 1.1 are from a manually constructed mammalian network database, available at PathCase-MAW site[38]. All other examples and visualizations in figures of this chapter are obtained from the PathCase-RCMN (ReConstructed genome-scale Metabolic Network) application[39], which is, in turn, built by importing the SBML of reconstructed metabolic network of Trypanasoma cruzi bacteria[15]. The SMDA tool, an evolution of the OMA Tool[40], is currently being beta-tested in cystic fibrosis metabolomics data analysis. This chapter is organized as follows. Section 2.2 specifies a complete condition- and rule- based model of the metabolic network behavior. We List the assumptions of our model and define the notion of (quasi-) steady- state for the metabolic network, Introduce the notion of metabolite pool label identifiers, Employ a three-valued logic to specify metabolite pool label conditions and Activation Condition Sets for reactions as well as transport processes, List transport process rules, and, finally, Specify a number of basic biochemistry-based rules. Section 2.3 presents the SMDA algorithm with the three steps, namely, GAI (flow-) graph initialization, expansion, and merge steps. The SMDA algorithm iteratively constructs a GAI Generation Hierarchy where, when it terminates, each leaf node of the hierarchy contains one possible activation/inactivation scenario within the query sub-network. In section 2.4, presents a computational performance evaluation of the SMDA tool by using 32 PathCase-MAW mammalian metabolic network database. SMDA can be viewed as a new approach within the category of metabolic network flux analysis techniques such as flux balance analysis [11], elementary flux modes[13] and extreme pathways[41]. Section 2.5 compares SMDA with these other techniques. Section 2.6 briefly concludes this chapter. 2.2 Condition-Based Modeling 2.2.1 Assumptions and Terminology We make the following assumptions about our environment. The complete metabolic network is pre-captured and available in a metabolic network database. The metabolic network database models tissue-level compartmentalization; that is, it is a multi-tissue and a multi-compartment (e.g., cytosol, mitochondrion, etc.) environment. The metabolic network is “sound” in the sense that all metabolites that are not in bio- fluids are both produced by (i.e., are a product of) at least one reaction and consumed by (i.e., are a substrate of) at least one reaction. Initially, we label each unmeasured metabolite pool size with the identifier “Unknown”. During query-time analysis, the labels may change into one of "Unavailable", “Available”, “Accumulated”, or “Severely accumulated”. The reason for non-quantitative labeling (as opposed to numerical size values) is that this work 33 does not employ quantitative pool size estimation techniques, as discussed in more detail in Section 2.2.2. No a priori knowledge of the size of each metabolite pool is assumed, except for measured metabolites. Given a reaction r, and a metabolite m as a substrate, co-factor-in, activator (product, co-factor-out, inhibitor) of r, the knowledge of the lowest (highest) metabolite pool size label of m at steady-state for m to activate (inhibit) a reaction so that r is “active” (“inactive”), is assumed to be available. This is discussed in more detail in Section 2.2.4. The organism (represented by its metabolic network database) is queried when it is at a steady-state for a time interval T. Steady-state is defined in terms of two properties: a. Production-Consumption Rate Equality (PCRE): During the time interval T, the rate of formation of every metabolite m is (almost) equal to its rate of degradation, i.e., all metabolite pool sizes (concentrations) remain (almost) constant during the time interval T. Put another way, production rate of each metabolite is equal to its consumption rate. b. Metabolite Pool Label Invariability (MPLI): During the time interval T, all metabolite pool labels stay the same. That is, if the label of a metabolite pool is Available, it stays Available during the time interval T. The PCRE property at steady-state is a natural property, referring to the state of constancy or the homeostasis (equilibrium) of the organism. As an example, in the “fed” state of, say, humans, glucose, through Glycolysis, is catabolized to Acetyl CoA, which is 34 converted to fatty acids or oxidized in the TCA Cycle. Although Acetyl CoA is available to both metabolic pathways (i.e., Fatty Acid Synthesis and the TCA Cycle), it does not accumulate, as the combined consumption rate of Acetyl CoA by Fatty Acid Synthesis and the TCA Cycle is (almost) the same as its production by Glycolysis. We use the MPLI property in order to capture a snapshot of the metabolism when metabolite pool size labels also stay constant during steady-state. Next we define some terminology. Definition (Metabolic Network). A metabolic network is a connected graph G(V, E) with a vertex set V of reactions and metabolite pools (a metabolite pool can be a substrate, regulator or product in a reaction), and a directed edge set E such that there is an edge from node u to node v if (i) v is a reaction, and u is a substrate, regulator of v, or (ii) u is a reaction, and v is a product of u. Definition (ProductionRate and ConsumptionRate of metabolite pool m): Consider any metabolite pool m, its producer reactions p1, p2, …, pi, and its consumer reactions c1, c2, …, cj. Let prm, k denote the production contribution rate of reaction pk, 1 ≤ k ≤ i, for metabolite m, and crm,v denote the consumption contribution rate of reaction cv, 1≤ v ≤ j, for metabolite m, during time period T. Then Pm = {(p1, prm,1), (p2, prm,2), …, (pi, prm,i)} is the active producer set of m, where each pair (pi, prm,i) refers to a producer pi of m and its contribution rate prm,i; and (prm,1 + prm,2 +…+ prm,i) is the ProductionRate(m) of m; and 35 Cm = {(c1, crm,1), (c2, crm,2), …, (cj, crm,j)} is the active consumer set of m, where (cj, crm,j) refers to an activated consumer cj of m and its consumption rate crm,j; and (crm,1 + crm,2 +…+ crm,j) is the ConsumptionRate(m) of m. Below we formally characterize the notion of (quasi-)steady-state for the metabolism. Definition ((quasi-)steady-state for an organism during a time period): Given an organism Org, its metabolites ml, 1 ≤ l ≤ n, and two constants εml and T, the organism Org is said to be in a steady-state during the time period T if (a) ProductionRate(ml) = ConsumptionRate(ml) ± εml for each ml, 1 ≤ l ≤ n, during the time period T, and (b) Label of each metabolite ml, 1 ≤ l ≤ n, stays the same during the time period T. 2.2.2 Metabolite Pool Label Identifiers The purpose of metabolite pool label identifiers is to simplify the ACT (activation condition) set specifications for reactions and transport processes. Definition (Metabolite pool label during a time period): Let TAVAIL(m), TACC(m), and TSAC(m) , TAVAIL(m)< TACC(m) < TSAC(m), be three threshold constants for a metabolite m, stored in the database. Given the metabolite pool m, the label of m during the time period T is marked with one of the following five identifiers. Unknown (id:-1): if the metabolite pool size for m, denoted by Size(m), is unknown during time period T. Unavailable (id: 0): Size(m) is less than the threshold TAVAIL(m) and ProductionRate(m) ≤ εm during time period T, where εm is a small constant. 36 Available (id: 1): Size(m) is greater than or equal to the threshold TAVAIL(m) and less than the threshold TACC(m) during time period T. Accumulated (id: 2): Size(m) is equal to or above the threshold TACC(m), but less than the threshold TSAC(m) during time period T. Severely Accumulated (id: 3): Size(m) is equal to or above the threshold TSAC(m) in time period T. This label is used for the product inhibition rule BC4 of section 2.2.5. Note that there is a need to use different metabolite pool labels of Available and Accumulated because, for some reactions, “availability” of a metabolite m as a substrate (or regulator) may be sufficient for the reaction (i) to be active through substrate availability (provided that there are no other inhibiting mechanisms) or (ii) to experience the regulating effect (i.e., inhibition/activation) of m, in those cases where m is a regulator. However, for activation/regulation, other reactions may require the “accumulation” of m, at least at moderate levels. We give an example. Example 2.1. Acetyl CoA is an allosteric activator of the first (also the committed) step in Gluconeogenesis, which is catalyzed by pyruvate carboxylase. And, pyruvate carboxylase activation needs Acetyl CoA accumulation. In the fed state of organism, Acetyl CoA is produced by Glycolysis (hence, is Available), but does not accumulate (hence has “Not Accumulated”). Thus, pyruvate carboxylase is not activated, which leads to the inactivation of Gluconeogenesis pathway. But, in the fasting state of the organism, Acetyl CoA is produced by Beta Oxidation, and consumed by the TCA Cycle and Ketone Body Synthesis. In this case, accumulation of Acetyl CoA occurs (slowly, but steadily), 37 since its production rate by Beta Oxidation is higher than its combined consumption rate by the TCA Cycle and Ketone Body Synthesis. 2.2.3 Metabolite Label Condition Characterization The metabolite label condition C about the label identifier q of a metabolite pool m is denoted as C Example 2.2. Ketone Body Synthesis requires the accumulation of Acetyl CoA to use it as a substrate. Then, the required condition can be stated as C We employ three-valued logic (True, False, Unknown) in evaluating conditions about metabolite pool labels of reactions. Definition (Satisfaction of a metabolite label condition): A metabolite label condition C (i) True if m is marked with the identifier qActual where either (a) 0 < q.id ≤ qActual.id or (b) q.id = qActual.id = 0 holds, (ii) False if m is marked with the identifier qActual where either (qActual.id ≠ -1 and qActual.id < q.id) or (q.id = 0 and qActual.id > 0), (iii)Unknown if m is marked with the identifier qActual where qActual.id = -1. Example 2.3. The condition C Example 2.2 is True when the corresponding pool of Acetyl CoA has the label Accumulated (id: 2) or Severely Accumulated (id: 3). 38 Definition (Negation of a Condition): Negation of a condition C C Example 2.4. The negation of the condition from Example 2.2, i.e., C Acetyl CoA>, is True only when Acetyl CoA is marked as Available (id: 1) or Unavailable (id: 0) (i.e., no active producer). Definition (Conflicting Conditions): Two conditions C1 Example 2.5. C1 CoA>. Definition (Condition Subsumption): Condition C1 C2 Example 2.6. C1 2.2.4 Trigger Values and Activation Condition Sets for Reactions, Transport Processes, or Pathways The label of a reaction r, a transport process Tc1-to-c2 from compartment c1 to compartment c2 (not to be confused by time interval T), or an “abstract pathway” can be one of active, inactive, or unknown, as discussed next. 2.2.4.1 Reaction 39 We start with the notion of a “metabolite trigger value” for a reaction, which can be either Available or Accumulated. Definition (Trigger value for metabolite m for reaction r to be active): Let m be a metabolite involved in a reaction r. For r to be active, metabolite m is said to have a trigger value tm,r, where tm,r {Available, Accumulated}, if (i) m is a substrate, cofactor-in, or an activator of r, and the metabolite pool identifier for m is tm,r, or (ii) m is an inhibitor of r, and the metabolite pool identifier for m is below (the integer id value of) tm,r . Each reaction r (or pathway) is associated with a set of participating metabolite pools and their predetermined trigger values, already available in a database. Each reaction (or a pathway) is associated with a set of “activation conditions” (i.e., ACT set), which are created based on the participating metabolites and their trigger values, as discussed next. Definition (Activation Condition Set of a Reaction/Pathway): Activation condition set of a reaction (or a pathway) r, denoted as ACT(r), defines the conditions for r to be active, and is constructed as follows. o For each m in reaction r where m is a substrate/cofactor-in/activator of r with trigger value tm,r, C Available and Accumulated labels, respectively) o For each m in r where m is an inhibitor of r with trigger value tmr, C ACT(r) where tm,r {1} 40 o For each m in r where m is a product/cofactor-out of r, C<3, m> ACT(r) (Product Inhibition rule 4; 3 is the id of Severely Accumulated label). o If the ratio Tr=Size(m1)/Size(m2) of energy metabolite pairs is specified as an activator for r, then C1(Accumulated,m1)ACT(r), and C2(Accumulated,m2) ACT(r). If Tr is an inhibitor for r, then C1(Accumulated, m1)ACT(r), and C2(Accumulated,m2)ACT(r). As mentioned before, the activation condition set ACT of a each reaction is defined a priori (offline) before any metabolomics analysis is carried out. 2.2.4.2 Transport Processes We view each transport process Tc1-to-c2 as having one metabolite transported from compartment c1 to compartment c2, subject to the activation condition set ACT for Tc1-to-c2. We give an example. Example 2.7. The transport process Tblood-to-muscle(glucose) of glucose from blood to muscle may be characterized within the ACT set as {C C Tmuscle-to-blood(glutamine) of glutamine from muscle to blood can be conditioned based on its availability in muscle, i.e., ACT(Tmuscle-to-blood(glutamine)) contains {C We have the following transport process rules. 41 Rule TR1. Let c1 and c2 be two compartments, m be an observed metabolite in compartment c1, and Tc1-to-c2 (m, c1, c2) be m’s transport process from c1 to c2. Assume that pool label of m in c2 is Unknown. Then if ACT(Tc1-to-c2) is satisfied then Tc1-to-c2 (m) is active; otherwise, it is inactive. Rule TR2. For active transport processes (i.e., the ACT set is satisfied), we assume that the metabolite pool of the product has the same label with the substrate. Rule TR3. For transport processes, the product inhibition rule (Please see rule BC4 of Section 2.2.5) does not apply. 2.2.4.3 Steady-State Labels for Reactions and Transport Processes We define the steady-state label of a reaction/transport process as one of Active, Inactive, or Unknown, based on the satisfaction of its associated activation condition set ACT. Definition (Active, Inactive, or Unknown reaction/transport process state): Given a reaction/transport process r with an associated activation condition set ACT(r) defined on the participating metabolites, r is said to be Active (i.e., having a nonzero flux) during the steady-state time period if (i) All conditions in ACT(r) are satisfied; i.e., all conditions that involve substrates, cofactors, and products of r are satisfied, and (ii) Among the conditions involving regulators of r, those conditions that include regulator(s) with the highest precedence are satisfied. Reaction/transport process r is Inactive if there is at least one unsatisfied condition in ACT(r). Otherwise, the state of r is Unknown. 42 Note that, for some reactions there may be multiple activators and inhibitors, in which case, we assume that (a) we have a priori information about the precedence of regulators, and (b) we make use of such precedence information in deciding whether the reaction is active or inactive. 2.2.5 Biochemistry-Based Rules Next, we list a number of basic biochemistry (BC)-based rules that we use in the rest of the paper. Rule BC1. For each reaction, when multiple regulators with conflicting regulatory effects (activation or inhibition) on an enzyme are in place, the regulator with the strongest effect (highest precedence) on the enzyme is considered, and the other regulators are ignored. The regulated reactions in a pathway may be classified as rate-limiting and committed steps. Once the committed step takes place, other reactions in the pathway follow this reaction until the end-product is produced, provided that none of the other regulated processes are blocked or inhibited. A committed step of a pathway is usually one of the early irreversible reactions in the pathway. As an example, in glycolysis, the committed step is the same as the rate-limiting step, PFK1. Rule BC2. If the committed step of a pathway p is blocked (i.e., inactive), then p is Inactive (i.e., all reactions in p are Inactive). We associate each compartment with particular pools of metabolites as its input and output. We then connect two compartments in the metabolic network if a transport process connects the two. 43 Rule BC3. Each input and/or output metabolite of a compartment is associated with a transport process (pre-captured and modeled in the database). A transport reaction and an enzymatic metabolic reaction are connected if they share at least one metabolite pool (i.e., as their substrate and/or product). Due to similarities in the way they bind to enzymes, substrates are in competition with products to bind to their enzymes. As the concentration of products increase, this competition slows down the rate of enzymes binding the substrates. Hence, the reaction rate decreases. Eventually, when the product accumulation reaches to high levels, the corresponding reaction is inhibited dramatically. Rule BC4. Whenever a non-bio-fluid metabolite m is marked as “severely accumulated”, all reactions that produce (and, therefore, due to the steady-state assumption) and consume m are Inactive. The next set of rules follows from the steady-state assumption. Rule BC5. If all producers (consumers) of a metabolite pool m are inactive then, due to the PCRE property, regardless of the pool label of m, labels all consumers (producers) of m are Inactive. Rule BC6. If at least one producer (consumer) of a metabolite m is Active, then (i) m is either Available or Accumulated, and (ii) at least one consumer (producer) of m is Active. Rule BC7. If the metabolite m is Unavailable then all consumers (and, thus, due to the steady-state assumption) and all producers of m are Inactive. 44 Rule BC8. Substrate and product labels of a transport process with no conditions are always the same. Next, using rules BC1-8, we specify the notion of “inconsistent” metabolite pool and reaction label assignments. Definition (Inconsistency): For each Rule BCi, 1 ≤ i ≤ 8, violation of Rule BCi in terms of metabolite pool and/or reaction label assignments constitutes an inconsistency in metabolite pool and reaction labels. For example, as a product of an Active reaction r, the label of metabolite pool m should not be Severely Accumulated, since it violates Rule BC4. 2.3 Active/Inactive Graph Generation And Expansion Starting from a given set of observations, we employ iterative backward and forward reasoning with the goal of identifying possible metabolic mechanisms which may have led to the observed changes. We first give some definitions. Definition (Reaction Participants): Given a reaction r, RP(r) is the set of substrates, products, and regulators of r (i.e., “Reaction Participants” of r). We refer to a metabolite pool concentration measurement as an observation. Next we define the notion of Active/Inactive graph, which has labeled reactions and labeled reaction participants. Definition (Active/Inactive Graph GAI): An active/inactive graph GAI(RAI,AI,SRP RP,M,O) is a connected subgraph of the metabolic network M with respect to a set O of observations where (i) RAI consists of a set of reactions or pathways in the subgraph 45 GAI(RAI,AI,SRP, RP,M,O), (ii) each reaction/pathway in RAI is assigned a label of Active or Inactive through the function AI: RAI {Active, Inactive}, (iii) SRP is the set of reaction participants (i.e., substrates, products, activators, etc.) of reactions in RAI, and (iv) each reaction participant of a reaction in SRP is assigned a label of Unavailable, Available, Accumulated, Severely Accumulated through the function RP: SRP {Unavailable, Available, Accumulated, Severely Accumulated}. During the GAI graph generation process, inconsistencies in GAI are avoided where inconsistency is as defined in Section 2.2.5. 2.3.1 Initial GAI Generation A generated GAI graph should be valid, as defined below. Definition (Valid Active/Inactive Graph): A GAI graph is valid when a. All metabolite pool/reaction labels in GAI are consistent. b. For all active reactions r in GAI, ACT(r) is satisfied, and For all inactive reactions r in GAI, ACT(r) contains at least one unsatisfied condition. 2.3.1.1 Converting Observations into Metabolite Pool Labels As discussed in the introduction, there are two alternative ways of converting metabolite observations into discretized metabolite pool labels of Available, Unavailable, Accumulated, or Severely Accumulated. The first way is, users can decide on these labels themselves using their domain expertise. The second way is, given a quantitative concentration statement on a metabolite pool m, SMDA compares the value with threshold constants (obtained from HMDB) for the metabolite m, and then marks m with the corresponding label identifier label. SMDA marks m only with one identifier, which 46 is the highest satisfied identifier. However, thresholds obtained from HMDB may be problematic: (1) HMDB may have more than one “normal” level for a metabolite, or (2) there may be no information at all. Please see Cicek et al [37]for more details. When observations on metabolite concentrations are converted into one of Unavailable, Available, Accumulated, or Severely Accumulated, to distinguish between metabolites in different compartments, we use the underscore notation, and refer to metabolite m in compartment c as “m_c”. Finally, for each observed bio-fluid metabolite, we investigate iteratively which of the possible GAI graphs (initially, each contains only one measured bio-fluid metabolite) is valid. We illustrate the initial GAI graph construction with an example from the metabolic network of T. cruzi (all network visualizations, except Figure 2.1, are from PathCase- RCMN for T. cruzi). Example 2.8. Let pi be an observed metabolite in compartment cytosol, denoted as pi_c, and the label of pic_c be Available. Let phosophatetransporter,peroxisome and phosphatetransportl be two such transport processes transporting pi_c from compartment cytosol to compartment glycosome, and from cytosol to compartment mitochondria, respectively (see Figure 2.2). By evaluating their ACT sets, we locate whether the two transport processes phosophatetransporter,peroxisome and phosphatetransportl are active (By Rule BC8, at least one must be active). This means that one of the three alternative GAI graphs involving pi_c is consistent. 47 2.3.2 GAI Graph Expansion Each valid GAI graph is iteratively expanded at each step with a set of reactions and/or transport processes. We start with some definitions. Definition (Distance between two metabolite pools): The number of reactions on the shortest path that connect two metabolite pools, regardless of reaction directions, is the distance between two metabolite pools. Figure 2.2 Illustration of three alternative versions of transport processes Definition (Border Metabolite Pool): Given a metabolite pool m and a nonempty active/inactive graph GAI(RAI,AI,SRP RP,M,O), m is called a border metabolite pool of GAI if, in the metabolic network M , there is a pair of reactions (r1, r2) such that m participates in both r1 and r2, and r1 RAI, r2 RAI. Note that when a GAI graph contains only a single metabolite pool m, m becomes a border metabolite pool of GAI. Also, the label of a border metabolite in a GAI graph is one 48 of Unavailable, Available, Accumulated, or Severely Accumulated, and never Unknown. We denote the border metabolite pool set of GAI as BMP(GAI). The process of extending a given GAI graph to a new GAI graph via the addition of new reactions connected to its border metabolite pools is called GAI graph expansion. The newly added reactions of the GAI graph are assigned the label values of either Active or Inactive (which are consistent, i.e., not in conflict with the existing reaction label assignments in the graph). If there is no such consistent expansion, then the expansion is terminated. Next we characterize the GAI graph expansion process. Definition (GAI graph expansion): Let GAI(RAI,AI,SRP,RP,M,O) denote the original GAI graph to be expanded; exp exp exp exp exp G AI(R AI, AI,S RP, RP,M,O) denote one of the alternative GAI graph expansions of GAI(RAI,AI,SRP,RP,M,O); BMP(GAI) denote the set of all border metabolite pools of GAI; NRS(BMP(GAI)) denote the set of (“new”) reactions involved with border metabolites and not (yet) in GAI, i.e., those reactions r where r has, as a substrate/product/regulator, a metabolite pool in BMP(GAI) and r is not in RAI; NMP(NRS(BMP(GAI))) denote the set of (new) metabolite pools p where p participates, as a substrate, product, or regulator, in a reaction of NRS(BMP(GAI)) and p exp exp exp exp exp is not RAI. Then the expansion G AI(R AI, AI,S RP, RP,M,O) is characterized as follows. exp (i) R AI = RAI U NRS(BMP(GAI)) exp (ii) S RP = SRP U NMP(NRS(BMP(GAI))) 49 exp (iii)Each r in R AI is assigned the label of Active or Inactive through the function exp exp AI: R AI {Active, Inactive} exp (iv) Each metabolite in S RP is assigned one of the labels Unavailable, Available, exp exp Accumulated, Severely Accumulated through the function RP: S RP {Unavailable, Available, Accumulated, Severely Accumulated}. exp (v) AI is consistent with AI. exp (vi) RP is consistent with RP. End of Definition. exp Border metabolite pools of G AI can be characterized as those metabolite pools which exp (i) are not in GAI, and (ii) participate in reactions that are not in G AI. Clearly, border exp metabolite pools of G AI are always within one reaction distance from any border metabolite pool of GAI. Note that the GAI graph expansion process is not unique. Each expansion step of GAI graph generates a new “alternative” GAI graph by assigning different labels to new reactions. At each step, the newly formed set of GAI graphs that are alternatives of each other is called a GAI-group. Each GAI graph in the same GAI-group is non-redundant, meaning each GAI graph has at least one reaction or metabolite pool assignment differing from the corresponding assignment in any other GAI graph in the same group. GAI graph generation/expansion process is represented as a hierarchy, called the GAI generation hierarchy, where each node represents a GAI graph of the metabolism, and each edge from parent to child in the GAI generation hierarchy represents the expansion of the parent GAI graph in the next step by additional reactions leading to a new child GAI 50 graph. Branching in the GAI generation hierarchy occurs whenever alternative (i.e., OR- connected), but conflicting, graph extension steps are taken, which leads to alternative GAI graphs. Each such set of alternative graphs forms a GAI-group. The GAI generation hierarchy is a directed acyclic graph with only one node that does not have any incoming edges, called root, and with any other node containing a GAI- group. The hierarchy is constructed as follows. Initialization: Level 0: root is a dummy node at level 0, i.e., it does not contain any information. Root has |O| immediate children, one for each observation in O. Level 1: Each immediate child of root is a GAI-group with only one GAI graph containing (i) a single node corresponding to a measured metabolite pool in the set O of observations, and (ii) no reactions. Expansion-and-merge: Level i: Nodes in each level i, i>1, of the hierarchy are constructed from the GAI- groups in level (i-1) in two steps, as follows. Expansion step: let Gr be a GAI-group node in level (i-1). Each GAI graph in Gr is expanded by following a GAI graph expansion as specified by the GAI graph expansion definition. The set of all such expanded graphs of Gr forms a new GAI- group node at level i in the hierarchy. Merge step: Let GAI-groups GrX and GrY be two newly expanded GAI-groups of the expansion step. Let Gx and Gy be the GAI graphs of GAI-groups GrX and GrY, respectively. If Gx and Gy have a nonzero number of common border metabolites 51 with identical border metabolite labels, then, GrX and GrY are merged into a single GAI-group, say, GrZ, in the hierarchy by (i) merging Gx with Gy, and placing the result in GrZ, and (ii) replacing GrX and GrY by GrZ into the hierarchy at level i. Example 2.9. Consider a part of a metabolic network M, shown in Figure 2.3. Reaction malatedehydrogenasel is already decided as Active, and oaa_m is a border metabolite with a label value other than Unknown. Assume that oaa_m is already assigned the pool label identifier Available. The border metabolite oaa_m is involved in two reactions, namely, aspartatetransaminase and citratesynthase, whose label assignments are not yet made. Figure 2.3 A partial metabolic network M. Circle nodes are metabolites, rectangle nodes are reactions and edges represent relations between reactions (which consume and/or produce metabolites) and metabolites. Based on the network of Figure 2.3 and given the pool label identifier information of Available on oaa_m, we would like to generate possible valid GAI graphs with active/inactive reactions in the metabolic network. At the same time, each valid GAI graph must preserve the observed Available pool label mark for oaa_m. 52 Next, starting from the border metabolite oaa_m, we generate different possible GAI graphs. New GAI graphs are generated by expanding the initial metabolic subgraph with reactions from the larger metabolic network M. Figure 2.4 shows the original GAI graph and the next level of the GAI generation hierarchy. Each of GAI1, GAI2, GAI3 and GAI4 is distinct and non-redundant. Thus, they form alternatives of each other, called a "GAI- group". End of Example 2.9 Figure 2.4 The first level of the GAI graph generation hierarchy for the metabolic network in Fig.2.3. We next discuss the creation of GAI graphs in more detail. For a given metabolite pool m, R(m) denotes the set of producer and consumer reactions of m in the metabolic network; and r.label represents the current label (i.e., active, inactive, unknown) assignment for reaction r. Definition (Label Assignment for reactions in the Reaction Set R(m) of metabolite m): Given a metabolite pool m, SA(R(m), SA,m) is a label assignment for reactions in R(m), 53 where each reaction in R(m) is assigned a label of either Active or Inactive, through a function SA,m: R(m){Active, Inactive}. Note that the number of possible label assignments for a set of consumer/producer reactions of a given metabolite pool m is exponential in the number of consumers and producers of m. Remark 3.1: Given a metabolite pool m, let i be the number of consumer reactions of m, and j be the number of producer reactions of m in the metabolic network. Then, the maximum number of possible distinct label assignments for m’s producers and consumers is 2i+j. Note that one does not need to evaluate each such combination of reaction label assignment as a valid GAI graph expansion. Metabolite pool label assignment for metabolite m in GAI is subject to three requirements: 1. Conditions that involve m are either True or False, but not Unknown, and 2. For each reaction r in GAI, either all conditions in ACT(r) are True, or ACT(r) contains at least one False condition, and 3. All rules (of Sections 2.2.4.2 and 2.2.5) are satisfied. To check the satisfaction of all three requirements above, our approach (described in Section 2.3.4 next) is as follows. For initialization, start with observed bio-fluid metabolites and non-bio-fluid metabolites as “seed” metabolites; use satisfied conditions of observed bio-fluid metabolites to locate their transport processes (i) with ACT sets having only satisfied conditions (in which case they are Active transport processes), or (ii) 54 with at least one unsatisfied condition (in which case they are Inactive transport processes). When (i) and (ii) fail for a transport process, then the label of transport process is unknown. Next, after the initialization (of GAI graphs), repeat the GAI graph expansion process (as defined above) via the “border metabolites” of GAI graphs, until there are no more border metabolites involved in active reactions. Next, for a given border metabolite m, we define the notion of a “valid label assignment for R(m)”, the reaction set (i.e., all producers and consumers) of m. Definition (Valid Label Assignment for Reactions in the Reaction Set of a Border Metabolite m): ): Given the graph GAI(RAI,AI, SRP,RP,M,O), a border metabolite pool m in GAI, the reaction set R(m) of m, let SA(R(m), SA,m) be a label assignment for all reactions in R(m) of m. Then, SA(R(m),SA,m) is said to be a valid label assignment for R(m) with respect to GAI if the following conditions hold. a. No Label Conflict Among Reactions: For each reaction r where rR(m), SA,m(r)=AI(r) or rRAI. b. Backward Compatibility: The label assignment SA(R(m), SA,m) results in a set Q of pool label assignments for the border metabolite m, each resulting in a new expanded GAI graph. Then, for a GAI and the metabolite pool assignment q in Q, the following two conditions hold: o With m having the label q, all the conditions in the ACT sets of “active” reactions in RAI R(m) are satisfied by the assignment q. 55 o With m having the label q, for each “inactive” reaction r in RAI R(m) that involves the border metabolite m in its ACT set, there is at least one unsatisfied condition. 2.3.3 Merging GAI Graphs During the GAI graph expansion, it is possible to have two GAI graphs in two different GAI groups to intersect, in which case the two graphs are reconciled into a single GAI graph (leading to a GAI graph generation “hierarchy”, rather than a GAI generation “tree”). If the reconciliation is not possible then it means that the two GAI graphs are not consistent, and the metabolic network model characterized by merging the two GAI graphs is inconsistent. In such a case, this specific merge of the two GAI graphs is stopped, the inconsistency is noted, and the expansion of the GAI generation hierarchy is continued for other possibilities. To expedite the process of expanding the GAI graphs, we start by assigning labels to observed metabolites, and, forming single-node GAI graphs. Initially, each observed metabolite in a bio-fluid results in a single GAI-Group with a single GAI graph. We attempt to merge GAI graphs in different GAI-groups when they intersect, i.e., when two GAI graphs that are in two distinct GAI-groups have the same border metabolite(s). We illustrate the process with an example. Example 2.10. Consider the metabolic network M of Figure 2.5. Assume C 56 Figure 2.5 A metabolic network M. Circle nodes are metabolites, rectangle nodes are reactions and edges represent relations between reactions (which consume and/or produce metabolites) and metabolites. Group-1 has GAI3 with border metabolites {oaa_m, sdhlam_m}; and GAI-Group-2 has GAI4 with border metabolites {succ_m, sdhlam _m }, as shown in Figure 2.6. Since both groups have the same border metabolites {sdhlam_m}, we merge the two groups of GAI graphs into one group: Let the new GAI graph to be created by merging GAI3 and GAI4 be GAI6. For each reaction with active or inactive label in GAI3 and GAI4, we assign the same label in GAI6. For border metabolites in GAI6, we assign each border metabolite a common possible label that both GAI3 and GAI4 have as the border metabolite, e.g., the label of sdhlam_m becomes Available. 57 Figure 2.6 The GAI graphs before merging two GAI -GROUPs. 2.3.4 Algorithm Sketch Input to the SMDA algorithm is a set of quantitative metabolite concentration values, and a metabolic sub-network to which the user wants to restrict the analysis. The very first initialization step starts from an observed, possibly a bio-fluid metabolite, and results in a GAI graph per observed metabolite, where each such single-node graph is placed in a single GAI-group. In each expansion step, a GAI graph is expanded with a producer/consumer reaction set of a “border metabolite” while the validity of reaction label assignments are enforced, as described in Section 2.3.2. Each possible expansion with a different label assignment on the same metabolite pool, or expansions on different metabolite pools, leads to a distinct GAI graph. Expansion can result in alternative GAI graphs, all placed into a yet another GAI-group. This process builds the GAI generation hierarchy, where nodes are GAI-groups, and, distinct extensions lead to branching in the hierarchy. At the end, each leaf level node in the hierarchy represents a complete GAI 58 graph set, i.e., one possible activation/inactivation scenario. At any point during the expansion process, if a border metabolite with no valid label assignment is encountered, then the expansion of the GAI graph is stopped, and it is eliminated as an invalid GAI graph. The expansion process is performed in a breadth-first manner. In Figure 2.7, we present a sketch of the SMDA algorithm. Note that GAI graphs in different GAI-groups are AND-alternatives. GAI graphs in the same GAI-group are XOR-alternatives. 2.4 Experimental Evaluation In this section, we present an experimental evaluation of the SMDA algorithm, and compare different expansion strategies on our experimental data. 2.4.1 Experimental Setting The experiments are performed on a Dell PowerEdge R710 Server with two Intel® Xeon® quad processors and 48 GB main memory, running the Windows Server 2008. The web application server is Microsoft IIS 7. The database server is Microsoft SQL Server 2010. The SMDA web site is implemented with Microsoft ASP.NET; and the client visualization is implemented with Java. The experiment data set includes pathways that are built for PathCase Metabolomics Analysis Workbench, with 22 pathways, 202 metabolites, 375 metabolite pools, and 240 reactions. The thresholds are set up according to the Human Metabolome Data-base. 59 Figure 2.7 Sketch of the SMDA algorithm 60 2.4.2 Experimental Results 2.4.2.1 Relationship between the number of observations and the number of GAI and flow- graphs. In this experiment, we evaluate the performance of SMDA for different number of user observations. We experiment with three different size sub-networks. For each sub- network, we change the number of metabolite pool observations and record the number of graphs in the result, as listed in Table 2.1. Observation 1. For small sub-networks, a linear increase in the number of observations results in an exponential decrease in the number of GAI and flow-graphs in the output. From Table 2.1, regardless of the size of the sub-network, the number of GAI- and R- Table 2.1 The number of observations vs. the number of output graphs for small sub-networks. Sub-Network # # M. # # GAI- # flow- Reactions Pools Observations graphs graphs 1 8938 846 Pentose pathway 8 16 2 860 423 3 588 376 Glycolysis 1 152 12 14 25 pathway 2 8 8 3 4 4 2 332288 160 Glycoly-sis+TCA 24 48 4 166144 80 Cycle pathways 6 128 32 graphs decreases as we provide more observations as input. Note that, in some cases, increasing the number of observations will not reduce the number of graphs, since there is only one possible label for the input pools in the results. Then the input pool observation is really duplicate information with no reduction on the result size. 61 In another experiment, for a larger sub-network, we observe how the algorithm scales. We choose a connected sub-network with 6 pathways, 48 reactions and 132 metabolite pools. The number of GAI- and flow-graphs versus different numbers of observations is shown in Table 2.2. Table 2.2 The number of observations vs. the number of graphs for a large network. # Reactions # M. Pools # Observations # GAI-graphs # flow-graphs 17 3072 40 23 1536 20 48 132 31 384 12 33 192 12 35 192 12 37 192 12 From Table 2.2, we can see that, even in a large sub-network, we can get reasonably small numbers of GAI- and flow-graphs with increased number of pool observations. Observation 2. For larger sub-networks, a linear increase in the number of observations results in an exponential decrease in the number of GAI- graphs and a linear decrease in the number of flow-graphs in the output. 2.4.2.2 Algorithm time efficiency The execution time is composed of two parts: expansion time and merge time. For each sub-network, we execute each of the three expansion strategies. The results show that, in general, increasing the no. observed pool observations decreases the execution time exponentially. This is due to the fact that, with more observed values, expansion time is decreased exponentially by reducing the expansions of many small sub-networks, instead of one large network. However, in some experiments, increasing the number of pool observations has actually increased the execution time, instead of decreasing it. In those 62 cases, we have found that merge time costs are significantly higher than expansion time costs. Observation 3. A linear increase in the number of metabolite pool observations results in an exponential decrease in the execution time of the algorithm, as in Figure 2.8. Figure 2.8 SMDA time cost for a single network versus the number of observations for Glycolysis and TCA Cycle combined. 2.5 Related Work: Metabolic Network Analysis Techniques SMDA technique can be viewed as being in the general category of metabolic analysis techniques. In this section we summarize the existing metabolic network analysis techniques, and briefly compare with the SMDA approach. Over the last 30+ years, a number of powerful mathematical modeling approaches and their corresponding computational tools have been proposed and used to study the dynamics of cellular metabolism. These techniques have many goals such as determining the metabolic fluxes of reactions in the metabolic network, or finding all the “optimal” routes, etc. They include metabolic control analysis (MCA) [10][42][43][44], flux balance analysis (FBA) [45][11][46](also known as constrained optimization), metabolic flux analysis [12], and metabolic pathway analysis (more specifically, elementary flux modes 63 and extreme pathways) ([39, 36, 40, 41]. Next we briefly summarize these techniques, and compare them with SMDA approach. Comparison of MCA, EMA, and SMDA approaches. Next we briefly list the differences between the MCA (or FBA), EMA, and SMDA approaches: Different goals. The four approaches are useful in different contexts, focus on providing different sets of information to users, and have different goals. (a) MCA focuses on “control as a property of the whole system”: One can (i) measure (at quasi-steady state) the effect of single enzyme perturbations on the system, and (ii) calculate the control distribution, relating the system behavior to individual reactions. (b) EMA can be used for tasks like the recognition of operational modes, finding all optimal paths, analysis of network flexibility (structural robustness, redundancy) [47]. Under steady-state conditions, the metabolic fluxes of an organism can be expressed as non-negative, linear, weighted combinations of elementary flux modes [48]; however, identifying the weighting factors to determine the fractional contributions of each elementary mode is difficult, if not impossible [48][49]. Visualizations of elementary flux modes within a given KEGG pathway are also available (via YANAsquare). (c) SMDA, working with possibly large metabolic network within a multi-tissue (organ) environment (i.e., not within a cell) and assuming steady-state behavior, returns to users all metabolic action scenarios as well as their visualizations within the metabolic network, allowing users to quickly concentrate on locating possibly activated paths for a given set of observed metabolite concentration 64 changes. SMDA does not derive (steady-state) flux values of the MCA (FBA) method, and, thus, there are no control-related (i.e., rate limitation) conclusions (of the MCA method). Different underlying fundamentals. SMDA is condition-based, and employs graph traversal and expansion algorithms across the metabolic network. In comparison, MCA and FBA involve solving a set of underconstrained differential equations corresponding to a possibly smaller metabolic network at hand. EMA determines elementary fluxes via a linear combination of “null space basis vectors” of the stoichiometry matrix [50]. Ease of use. MCA (or FBA), even with the easiest-to-use GUI-oriented software tools (such as COPASI), requires (i) additional information to be collected and provided by the users including the stoichiometry information, and (ii) setup and usage expertise, for biologists to use them. The EMA tools YANA and YANAsquare do provide user-friendly elementary flux derivations and their visualizations. In comparison, SMDA uses a metabolic pathways database, which already contains the metabolic network, biochemistry-based rules and other information so that all that a user is expected to provide is a set of observed metabolite changes. Modeling-related restrictions/assumptions. As listed above, MCA has a number of assumptions (such as requiring a connected network of pathways) [51] which are not needed for SMDA. EMA also requires connectivity. Computational Complexity. Computational complexity of MCA is exponential in the number of reactions involved, forcing users to use various compaction, aggregation, and clustering/merging, etc. techniques. Computational complexity of EMA is also exponential [47], and various approaches to tackle the high complexity are proposed such 65 as parallel computing [52], network decomposition and “functional conversion of flux cones”. SMDA is also exponential in the number of reactions in its worst case. 2.6 Conclusions In this chapter, we have proposed Steady-State Metabolic Network Dynamics Analysis, a computational metabolomics analysis approach that captures a metabolic network and biochemical principles in a metabolic network database [9]. Given a set of metabolic observations and a selected metabolic sub network, SMDA executes with expansion phase and merge phase to locate all possible steady-state activation/inactivation scenarios of the reactions in the network, based on biochemistry rules. The algorithm of SMDA is given. And experimental evaluation of the SMDA tool against a mammalian metabolic network database is also presented. 66 Performing Gene Lethality Testing with SMDA 3.1 Introduction Steady-State Metabolic Network Dynamics Analysis (SMDA) is a recently proposed computational metabolomics analysis approach that captures a metabolic network and biochemical principles in a metabolic network database [9]. Given a set of metabolic observations and a selected metabolic subnetwork, SMDA locates all possible steady- state activation/inactivation scenarios of the reactions in the network, based on biochemistry rules. Our goal in this chapter is to describe how SMDA can be used in the context of gene lethality testing, where a gene is said to be lethal if its knockout (i.e., elimination from the genome of the organism) causes the death of the organism. The direct effect of a knocked-out gene is the elimination of the enzymes that it encodes. This corresponds to the removal of all reactions from the metabolic network in which the catalyzing enzyme is encoded by the knocked-out gene, with the exception of those reactions that have other associated isozymes (e.g., another enzyme that catalyzes the same reaction). We define gene lethality in terms of essential metabolite availability. An essential metabolite is a metabolite without which the organism cannot stay alive. Thus, a gene is lethal if its knockout causes the unavailability of at least one essential metabolite in the organism at the steady state. In other words, a gene is lethal if its removal from the organism’s genome results in the non-production of at least one essential metabolite, and, thus, the death of the organism. 67 Topological analysis of regulatory networks [53], Barabasi’s computational estimate method [54][55], Flux Balance Analysis (FBA) [45][11][46][56]are techniques used for testing gene lethality. FBA is mostly used to test if a knock-out is lethal for the organism by using the reconstructed metabolic network of the organism (e.g., Duarte et al., 2007 [56] for humans, or Sigurdsson et al., 2010 [1]for mus musculus). FBA calculates metabolite pool sizes and flux values by first constraining the knocked out reaction with zero flux, and then checking whether there is flux through the biomass reaction, which is a reaction that is added for simulation purposes (e.g., to simulate the growth of the organism). Main problems with the FBA approach include: (1) the optimal conditions and the assumptions proposed by the technique are questionable (e.g., “the quality of the biomass reaction and the assumption of biomass optimization which is debatable even for unicellular organisms” [17][2]), (2) the prior knowledge about the network (e.g., complete stoichiometry) might not always be available for the organism at hand, and (3) the FBA result may not be meaningful biochemically (illustrated with an example in Section 3.3). This chapter proposes the use of SMDA as an alternative gene lethality testing technique. SMDA requires as input (i) the metabolic network of the organism, and (ii) a set of metabolite pool observations. Then, as output, it enumerates all possible flux scenarios (in the form of “active/inactive reactions”). SMDA does not perform any optimization or stoichiometric calculations; hence, it does not require any of the assumptions stated above (see Section 3.3 for details). However, similar to Elementary Mode Analysis [11][13][41][57]that enumerates all possible elementary flux modes that can occur at the steady state, SMDA also suffers from exponential computation time. The number of possible scenarios produced by SMDA can be exponential with respect to the 68 size of the network, and is inversely proportional to the number of metabolite observations provided. However, the complexity of SMDA can be reduced with domain expert’s knowledge, as shown in the experiments in this chapter. To validate the SMDA gene lethality algorithm, we have selected the reconstructed network of the core metabolism of Trypanosoma cruzi [15]. Trypanosoma cruzi, a kinetoplastid parasite in humans and causes Chagas disease [58], has a small core reconstructed metabolic network [16] with 215 genes, 162 reactions, and 4 compartments. Seth B Roberts, Jennifer L Robichaux, Arvind K Chavali, Patricio A Manque, Vladimir Lee, Ana M Lara, Jason A Papin and Gregory A Buck used [59]the core metabolism network of Trypanosoma cruzi to perform experimental gene lethality tests. To evaluate the gene lethality testing algorithm with SMDA, we have used the same seven lethal genes verified by Roberts et al, and SMDA gene lethality testing has correctly verified the lethality of all seven genes. We have also selected one non-lethal gene in Trypanosoma cruzi, namely, adenosine kinase, and SMDA has also correctly verified its non-lethality. These results show that SMDA can be used for gene lethality testing of organisms. In section 3.2, we briefly introduce the SMDA algorithm. Section 3.3 summarizes the existing gene lethality testing techniques in more detail, discusses their shortcomings, and compares them to SMDA. In Section 3.4, we provide a sketch of the revised SMDA algorithm for gene lethality testing. Section 3.5 experimentally evaluates SMDA gene lethality testing in the context of Trypanosoma cruzi. Our conclusion is that SMDA gene lethality testing algorithm successfully locates lethal and non-lethal genes of organisms 69 when either the core metabolism network of the organism is not large or the number of observed metabolites is large. 3.2 Summary of SMDA Algorithm In this section we explain the terminology of SMDA and the algorithm flow. 3.2.1 SMDA Terminology The metabolic network is a connected graph G(V,E) where the vertex set V consists of metabolite pools and reactions, and the edge set E consists of directed edges from a vertex u to vertex v if (i) u is a metabolite pool that plays the role of substrate or regulator of that reaction, or (ii) u is a reaction and v is a product of that reaction. SMDA makes use of many biochemistry principles such as Substrate Availability, Product Inhibition and Committed Steps, etc.; see Cakmak et al. for details [1]. There are five discrete states a metabolite pool can be in, namely, Unknown, Unavailable, Available, Accumulated, and Severely Accumulated. If a metabolite pool is not observed (i.e., not measured), it is labeled as Unknown; otherwise the algorithm compares the user- provided observation with predefined metabolite level thresholds (originally either obtained from HMDB [36] and stored in the SMDA database, or decided manually). Let TAVAIL(m), TACC(m), and TSAC(m) (such that TAVAIL(m)< TACC(m) < TSAC(m)) be three metabolite level thresholds for metabolite m. Also let the observed value for metabolite m be Obs(m). If Obs(m) < TAVAIL(m) then we mark the metabolite pool m as Unavailable; if TAVAIL(m) ≤ Obs(m) < TACC(m) then SMDA marks m as Available; if TACC(m) ≤ Obs(m) 70 There are three discrete states for a reaction, namely, Unknown, Active and Inactive. Initially, all reactions are labeled as Unknown. Each reaction has a set of conditions for it to be Active. For instance, for reaction r to be Active, substrates has to be labeled as Available, and product p has to be labeled as Available or Accumulated. There are two types of reactions, reversible reaction and irreversible reaction. A reversible reaction can be active in forward and backward directions, where forward, rather arbitrarily, refers to one direction, and backward means the substrates and products of the reaction are reversed. A basic assumption of the SMDA algorithm is that the metabolic profile (observations) are obtained when the organism is at a steady state. That is, for a time interval T, (i) the production rate of each metabolite pool in the network is equal to its consumption rate, and (ii) the metabolite pool labels stay constant. This assumption corresponds to the homoeostasis of the organism. SMDA algorithm makes use of two main concepts: Activation/Inactivation Graph (GAI) and GAI Group. A GAI is a connected sub-graph of the metabolic sub-network, where (i) each metabolite pool is assigned a label other than Unknown (Unavailable, Available, Accumulated, or Severely Accumulated) and (ii) each reaction is assigned a label other than Unknown (Active or Inactive). A GAI group is a set of GAIs, where (a) all GAIs share the same set of reactions and metabolite pools, and, (iii) any two GAI’s of the GAI group differ by at least one metabolite pool label assignment. In other words, a GAI group represents all possible activation/inactivation scenarios within a selected sub-graph of the query network. An alternative output to GAI graphs is flow-graphs where a flow-graph is a GAI graph without metabolite pool labels, and a single flow-graph captures multiple GAI 71 graphs. In other words, a flow-graph represents a scenario where each reaction marked as either active or inactive, regardless of the metabolite pool labels. Flow-graphs are used to speed up the original SMDA algorithm. 3.2.2 Algorithm Flow The algorithm runs in a cycle of two phases: Expansion and Merge. It lasts until all reactions and metabolite pools in the network are assigned a status. 3.2.2.1 Expansion Phase Expansion phase starts from the labeled metabolite pools (observations), which are flow- graphs with single metabolite pools. Then, expanded flow-graph(s) are generated by adding neighboring reactions and metabolite pools to the original flow-graph. SMDA generates all possible combinations of label assignments to those neighboring pools and reactions. This process continues until all reactions and metabolite pools are assigned a label. The metabolite pools that are in the flow-graph and attached to those reactions that are not in the flow-graph are called border metabolite pools. At each expansion step, one of the border metabolite pools is chosen for expansion. There can be different options to choose the border pool to expand. To control the number of expanding flow-graphs, the algorithm keeps the number of alternatives small (as much as it can) as a heuristic. It tries to generate as few new flow-graphs as it can at each step, so that, hopefully, this would avoid the generation of cases, which are later eliminated in future iterations (e.g., due to the unavailability of a substrate). Therefore, SMDA picks for expansion the border pools with the least number of reactions attached. We also delay picking common metabolites (since they are highly connected) like ATP or H2O. 72 The reversible reactions that are attached to the border pool result in the generation of different flow-graphs because of directional combination possibilities. As an example shown in Figure 3.1, consider r1 and r2, two Unknown-labeled reactions (i.e., not yet in the flow-graph) that are attached to a border metabolite pool m. In Figure 3.1(a), assume r1 is the producer of m, and r2 is a reversible reaction. Then there are two cases to consider. Case 1 is Figure 3.1(b): r1 is the producer and r2 is the consumer. Case 2 is Figure 3.1(c), both are producers of m. We avoid having dead-ends (i.e., a reaction with no consumers and/or producers), and, thus, we disregard such cases. For example, case 2 would be eliminated if there are no other consumer reactions in the flow-graph that is already assigned a label. After each generated case, reactions are assigned Active/Inactive status based on the already known metabolite pool labels in the flow-graph. For instance, if a reaction is the only consumer of a pool, which has Available/Accumulated label, the reaction must be Active. Alternatively, all producers and consumers of a pool should be Inactive if the pool is Unavailable. If the status of a reaction cannot be decided at the time, both cases (active/inactive) are generated for that reaction. For each Inactive reaction, a list of possible metabolite pool status “combinations” are generated that would make that reaction Inactive. After reactions are assigned labels, related metabolite pools are assigned labels. For example, the product pool of an Active reaction must be Available; or a pool is Unavailable if the only producer is Inactive. 3.2.2.2 Merge Phase Merge phase comes after each expansion phase. Border pools of each pair of flow-graphs are checked to see if they intersect. If so, possible cases among two flow-graphs that 73 Figure 3.1 A partial network with reversible reaction agree on the shared border pool(s) are joined into a larger flow-graph. The metabolite pool status-combination lists for inactive reactions that are attached to the border pools are updated. For example, assume that two flow-graphs flg1 and flg2 have a shared border metabolite pool m1 and can be merged. Metabolite pool label of m1 is Unknown in flg1 and is related with an Inactive reaction r1, and the metabolite pool label of m1 is Available in flg2 . Then in the new merged flow-graph flg1-2, m1 is Available and is removed from the pool status-combination list of r1. If any metabolite pool status- combination list of such an Inactive reaction is empty after the merge, this means the reaction which is already assigned the label of Inactive has to be Active. This creates a conflict between the two flow-graphs, meaning that they cannot be merged, and this merge alternative is removed from the expansion. 3.2.3 Conflicts One of the fundamental assumptions of SMDA is that the observations provided are consistent within themselves; otherwise algorithm runs into a conflict state. For example, consider a network that consists of a single reaction with a single substrate and a single 74 product. User observes both substrate and the product. Substrate is classified as Unavailable and product is classified as Available. SMDA algorithm would (a) mark the reaction as Inactive as the substrate is Unavailable, and also (b) mark the product as Unavailable as the only producer in the network is Inactive. It may encounter two types of conflicts based on the stage of the algorithm, namely, Expansion Conflict and Merge Conflict. Given an observation set and a metabolic sub- network, expansion conflict occurs when a reaction in a GAI graph/flow-graph cannot be assigned the label of Active or Inactive. Merge conflict happens when two GAI groups have shared (same) border metabolite pool(s), but the GAI graphs in the two groups cannot be merged because of the inconsistent metabolite pool label(s). Another paper [20] has it in details. 3.3 Existing Gene Lethality Techniques and SMDA Flux Balance Analysis (FBA) is a computational technique that computes the fluxes of reactions under “optimal” conditions such as the maximization of biomass [17]. The technique cannot directly compute metabolite pool sizes (under neither optimal, nor non- optimal conditions). FBA works by defining an objective function to optimize. For gene lethality, the objective function is the maximization of the flux (flow) in the artificially defined “biomass production” reaction. The reasoning is that organisms strive to maximize their chances of survival by maximizing the production of essential metabolites, a debatable argument [17][2]. FBA can be characterized as having two steps: Define an artificial reaction called the “biomass reaction” whose substrates and products are essential metabolites (it is not clear how the researchers choose 75 which essential metabolite is a substrate, and which one is a product, except that, obviously, the goal is to maximize the flux of the biomass reaction, and, hence, to maximize the pool sizes of selected products). Given a set of stoichiometry equations, characterize the metabolite pool consumption and production of the organism at steady-state by performing a linear programming-based optimization with the goal of maximizing the production of flux in the biomass reaction. If the optimization returns zero flux in the biomass reaction then the conclusion is that the knocked-out genes are lethal; otherwise, they are non-lethal. For the sake of simplicity in the discussion, from now on, we discuss the single gene-knockout case. There are four criticisms of the FBA technique in the literature. 1. Full stoichiometry of the metabolic network is commonly unavailable to researchers, rendering the optimization inapplicable. In this case, the common approach is to “estimate” the stoichiometry matrix by using a stoichiometry matrix of 1’s, 0’s, and -1’s. 2. Organisms routinely survive under sub-optimal conditions. Therefore, the optimality criterion needed by the metabolic control analysis techniques is an artificial, and not always correct, criterion (Please see Example 3.1). 3. Biomass maximization criterion is applicable to only simple organisms (e.g., unicellular organisms) and it is hard to define an objective function for more complex organisms [29]. 4. For an organism to live, all metabolite pools that are produced must be consumed at steady-state (See [9]for a definition of steady-state). This in turn means that, in the 76 metabolic network, there should not be ‘dead-end” (or, dangling) metabolites (e.g., a metabolite with no consumers or a metabolite with no producer reactions). But, dead-end metabolites routinely occur in metabolic networks of organisms, due to lack of knowledge about the organism. Metabolic control analysis techniques must work with “consistent networks with no dead-end metabolites” to perform their optimization; otherwise, the optimization fails. Therefore, it is not uncommon for researchers to make changes to the network at hand, such as adding reactions in order to eliminate dead-end metabolites. For example, multiple “source flux” and “escape flux” reactions are added in the reconstructed network of Trypanosoma cruzi [59]. In comparison, SMDA does not need/use stoichiometry equations, and therefore does not suffer from the first criticism for FBA. Similarly, SMDA does not perform optimization and does not require any objective function. Hence, it does not suffer from second and third criticisms either. SMDA also needs a full metabolic network to perform its analysis. And, as a negative, SMDA has exponential time-complexity due to the enumeration of all possible flow-graphs. Enumeration of possible states enables SMDA to avoid criticisms 2 and 3, but it comes with the price of exponential time complexity, similar to Elementary Mode Analysis, whose goal is to enumerate elementary fluxes at the steady state. We also note that, with more observations and/or domain expert knowledge, SDMA complexity can be reduced. Finally, to compare SMDA with FBA, we have attempted to replicate the “optimal flux distribution” generated by FBA in Trypanosoma cruzi paper [15]. We have found that SMDA does not generate the “optimal case” as found by FBA. The reason is as follows. 77 To maximize the flux in the biomass reaction, FBA freely shuts down the flux in any reaction, even when all the substrates and the enzyme of such a reaction are indeed available. SMDA does not allow such a case, as it would create a conflict with the underlying biochemistry. That is, for SMDA, a reaction r is considered to be active when all substrates of r are available, and no product of r is severely accumulated (i.e., product inhibition does not occur). In more detail, SMDA uses basic biochemistry knowledge to reason about possible scenarios of the metabolic network, based on observations. Some example rules are listed below (for a complete list of rules, please see Cakmak et al [9]). Whenever a non-bio-fluid metabolite m is marked as “Severely Accumulated”, all reactions that produce (and, therefore, due to the steady-state assumption) and consume m are “Inactive”. If all producers (consumers) of a metabolite pool m are inactive then, due to the PCRE [9] property, regardless of the pool label of m, all consumers (producers) of m are Inactive. If at least one producer (consumer) of a metabolite m is Active, then (i) m is either Available or Accumulated, and (ii) at least one consumer (producer) of m is Active. If the metabolite m is Unavailable then all consumers (and, thus, due to the steady-state assumption) and all producers of m are Inactive. To judge whether a reaction is active or not, SMDA employs biochemistry rules that are captured as “activation conditions” (ACT condition set), and, if the ACT condition set of a reaction is satisfied, SMDA labels the reaction as Active, otherwise, the reaction is 78 labeled Inactive. We give an example of a case in which the optimal solution found by FBA is rejected by SMDA as a valid flux distribution alternative. Example 3.1. The supplemental file 5 [16] of Trypanosoma cruzi paper [15], “Flux distribution for epimastigote model”, presents a graphical depiction of the FBA-located optimal flux distribution for the epimastigote phase of the organism. Figure 3.2 shows part of the network with optimal flux distribution. The reaction NADH2-u6m (NADH dehydrogenase, mitochondrial) in mitochondria is assigned zero flux by FBA. However, in SMDA, NADH2-u6m is assigned Active label (indicating the existence of flux) as follows. From the network, ACT set of the reaction NADH2-u6m is defined as {q6[m] is Available; h[m] is Available; nadh[m] is Available; q6h2[m] is not Severely Accumulated; nad[m] is not Severely Accumulated;} SMDA infers that all three substrates of NADH2-u6m, namely, q6[m], h[m] and nadh[m], are Available since o reaction CYOR_u6m (ubiquinol-6 cytochrome c reductase) has non-zero flux, and is producing q6[m]; o reaction SUCD3_DASH_u6m (succinate dehydrogenase (ubiquinone-6), mitochondrial), has non-zero flux, and is consuming q6[m]; o reaction P5CDm_i (1-pyrroline-5-carboxylate dehydrogenase, mitochondrial )(not shown in Figure 3.2) has non-zero flux, and is producing h[m]; o reaction MDHm (malate dehydrogenase, mitochondrial) (not shown in Figure 3.2)has non-zero flux, and is producing nadh[m]; 79 o both products of the reaction NADH2-u6m, namely, q6h2[m] and nad[m], are not severely accumulated, since (i) reaction CYOR_u6m (ubiquinol-6 cytochrome c reductase) has non-zero flux , and is consuming q6h2[m]; and (ii) reaction MDHm (malate dehydrogenase, mitochondrial) (not shown in Figure 2) has non-zero flux, and is consuming nad[m]. Thus, since the ACT set of NADH2-u6m is satisfied, SMDA labels it as Active. In other words, some optimal results of FBA may not have any biological reasoning at all. FBA results may or may not exist in reality, and are not necessarily always generated by SMDA. Figure 3.2 Partial depiction of theoptimal flux distribution on epimastigote model of T. Cruzi network 3.4 Revising SMDA For Gene Lethality Testing An SMDA-based gene lethality test can be done in three steps. First, reactions catalyzed by the enzymes produced by the knocked-out gene are removed from the network. Then 80 all essential metabolite pools are labeled as Available. Finally, SMDA is run to check if there is at least one feasible flow-graph in the metabolic network that produces and consumes each and every essential metabolite. Thus, stopping conditions for gene lethality/non-lethality are as follows. Stopping criterion to decide that gene is lethal: The algorithm terminates with no flow- graphs. This means the algorithm could not find a feasible flow-graph that has the biomass reaction active because it had a merge/expansion conflict (Please see Section 3.2 for explanation about the conflicts and Cicek et al [37]for more details). Either conflict reveals discrepancy between the observation set and the metabolic sub-network. The observation set is unchanged and is based on a live organism. However, the sub-network was affected by knocking out the gene, which causes some reactions to be inactive due to the enzyme not being available. So the reason for no feasible flow in the organism is that the gene was knocked out (the gene is lethal). Stopping criterion to decide that gene is non-lethal: The algorithm produces one flow- graph in which all essential metabolites are both produced and consumed, given that all reactions in the network are expanded or a subset of the reactions have been expanded, and we are guaranteed not to have an expansion conflict as the algorithm proceeds. (No conflicts). Gene lethality/non-lethality testing algorithm (sketch): 1) Mark all reactions of the knocked out enzyme(s) as Inactive. 2) Mark all essential metabolite pools and some energy pools (details are in Section 3.5) as Available. Mark all observed metabolite pools as Available or 81 Unavailable according to the supplement 5[16] of Trypanosoma cruzi paper [15]. 3) Starting from all metabolite pools, which have status other than Unknown, SMDA creates flow-graphs and expands them as described in Section 3.2. If a merge or expansion conflict is encountered then the knocked gene is lethal since, given the observations, SMDA is not able to have a scenario where all essential metabolites are Available. When there is a single flow-graph there is no chance to have a merge conflict, yet, there is still a chance that SMDA encounters an expansion conflict. We give an example 3.2. One option is to run the algorithm until all reactions in the network are expanded and it halts with at least one flow-graph. This means that there is a feasible flow on this network with the provided observations. A second option is to guarantee that there would have been no expansion conflicts if SMDA had expanded all reactions. An expansion conflict can occur only during expanding a reaction that has more than one border pool associated with it (e.g., SMDA knows the labels of both the substrate and the product, and expands the reaction). Considering the flow-graph as a super node, if there is a cycle (disregarding the directions of the edges) in the subnet work that contains the super node, then there is a chance to have an expansion conflict. Expansion should be maintained till there is no such cycle. Moreover, an expansion conflict can only occur on a reaction that connects an Unavailable pool to an Available pool. Then SMDA can only consider the cycles that include an Available pool and an Unavailable pool in the flow-graphs. 82 Figure 3.3 A complete network for Example 3.2 Example 3.2. Assume that the network in Figure 3.3 is the single flow-graph after merge, and all reactions in the network are included. Since the reaction RPE is Unknown, SMDA will keep expanding the flow-graph. However, “expansion conflict” will occur in the following step since the substrate of RPE is not available, but the product of RPE is available. 3.5 Experimental Evaluation In this section, we describe the environment in which the SMDA was run, the conducted gene lethality/non-lethality tests, and the results. 3.5.1 Experimental Setting Metabolic Network. Our tests were run on the reconstructed metabolic network for Trypanosoma cruzi [15]. We have obtained the reconstructed network model of this organism in the form of an SBML document, and parsed and exported the model (with a home-made SBML parser tool) into our PathCase-RCMN database 83 Metabolomics.Chlamydomonas_reinhardtii_curated. The data is available to browse from the web interface of PathCase-RCMN website [39]. The data was cleaned up manually since there were some inconsistences in the SBML file itself. For example, there were some reactions which were not marked as transport processes, but having substrates and products in different compartments. We located and corrected those reactions to transport reactions. Also, biomass reaction was removed from the database as it is an artificial reaction, and not needed for SMDA. It is worth mentioning that, in different supplement files of the Trypanosoma cruzi paper [15], some reactions directions (forward/backward) were not consistent. We located such discrepancies, and fixed them during the experiments. The database consists of 162 reactions, which include 58 transport reactions, and 92 gene-associated reversible reactions. To construct the sub-network for SMDA, we include all 15 pathways (Trypanosoma cruzi paper [15] has more pathways in the data file, and we count similar pathways as one), and 52 transport reactions that connect metabolite pools of the same metabolite in different compartments. This is done to ensure the connectivity of the sub network. Algorithm Input. Given the full network described above, we run SMDA gene lethality testing algorithm using the extracellular metabolite observations provided in the paper supplement [16]. There are 17 such metabolites; 12 out of 17 are marked as Available, and the rest are marked as Unavailable, as shown in Table 3.1. We also input substrates and products of the biomass reaction as described in Section 3.4. This corresponds to another 20 metabolites that are marked as Available. Their roles and names are shown in Table 3.2. 84 Table 3.1 Metabolite pool observations from the T. Cruz. paper 12 Available metabolite M_nh4_e, M_pro_DASH_L_e, M_glu_DASH_L_e, pools M_o2_e, M_b_DASH_D_DASH_glucose_e, 5 Unavailable M_a_DASH_D_DASH_glucose_e,M_ac_e, M_asp_DASH_L_e , M_gly_e, M_pi_e, M_glyc_e, M_h_e, metabolite pools M_succ_e,M_thr_DASH_L_e M_ala_DASH_L_e, M_co2_e, M_h2o_e Table 3.2 Metabolite pool observations from biomass reaction 6 products of M_pi_c , M_nadp_c , M_nadh_c , M_coa_c , M_co2_c, biomass reaction M_adp_c 14 substrates of M_r5p_c, M_pyr_c, M_pep_c, M_oaa_c, M_nh4_c, biomass reaction M_nadph_c, M_nad_c, M_g6p_DASH_B_c, M_g3p_c, M_e4p_c, M_atp_c, M_akg_c, M_accoa_c, M_3pg_c Since energy pools (i.e., metabolite pools related to energy metabolism) are essential for the organism, we add and mark those pools as Available also. The 17 energy pools are in Table 3.3. Table 3.3 Energy pools are set as Available M_fad_m, M_fadh2_m, M_nad_x, M_nadh_x, M_nadp_x, 17 Available M_nadph_x, M_nad_m, M_nadh_m, M_nadp_m, M_nadph_m, energy pools M_atp_x, M_atp_m, M_adp_x, M_adp_m, M_h2o_m, M_h2o_c, M_h2o_x Our experiments in this chapter focus on the epimastigote model of Trypanosoma cruzi. According to the supplement 3 [59]of the paper, there are 18 reactions that are always Inactive in the epimastigote model. We mark those 18 reactions as Inactive. The reactions are listed in Table 3.4. 85 Table 3.4 Inactive reactions for epimastigote case aldose1epimerase-likeprotein in Glycosome , inorganicdiphosphatase__in Glycosome , ribokinase,glycosomal in Glycosome, gluconatekinase,glycosomal in Glycosome, 18 inactive deoxyribokinase,glycosomal in Glycosome , glycerol-3- reactions phosphatedehydrogenase(nad) in Glycosome , glycerolkinase() in Glycosome , inorganicdiphosphatase in Cytosol, glycerol-3- phosphatedehydrogenase(FAD) in Cytosol, NADHdehydrogenase in Mitochondria , malicenzyme(NADP)l in Mitochondria As we discussed before, foracetatesuccinatecoatransferase SMDA to produce gene lethality in Mitochondria test results, , directionsL- of reversible reactions shouldalaninetransaminasel be set. We utilize in Mitochondria, the “FBA-selected NADHdehydrogenasel optimal result” from in the paper’s supplement 5 [15]M itochondriaas an example (NADH2 network,-u6m), and NADHdehydrogenaselset the direction of the in reactions according to the paper’s optimal result.Mitochondria( Out of 92 reversible NADH2- u6am)reactions, , L- 68 reactions are set in one (forward) direction,threoninedehydrogenase,mitochondrion and 24 reactions are set to the other (backward) in Mitochondria, direction. inorganicdiphosphatase_in Mitochondria, alternativeoxidase in 3.5.2 Gene Lethality Test Results Mitochondria Table 3.6 lists SMDA lethality test results for the genes in Table 3.5, along with the reason why the SMDA algorithm has found the gene to be lethal. Observation 1: SMDA gene lethality testing algorithm verified correctly the lethality of all seven lethal genes. Thus, the success rate for the SMDA algorithm for verifying lethal genes was 100%. That said, due to reversible reactions, the number of possible networks and thus the number of possible flow-graphs (i.e., the number of possible steady-state network flows) is 86 Table 3.5 Lethal genes to be verified Experimental Target Model reaction(s) constrained fructose-1,6-bisphosphate aldolase FBAg Phosphogluconate dehydrogenase PGDH glyceraldehyde-3-phosphate GAPDg dehydrogenase Hexokinase HEXg, GLUKg Phosphofructokinase PFKg phosphoglycerate mutase PGM enolase ENO exponential: in the Trypanosoma cruzi network, there are 92 reversible reactions. When all combinations of directions of the reversible reactions are considered, there are 292 different possible networks. Therefore, one is confronted with the problem of pruning the search space in order to locate feasible flow-graphs. In our experiments, in order to avoid testing each and every combination of reversible reaction directions for possibilities, we set a priori the directions of reversible reactions in the Trypanosoma cruzi network according to the “optimal” flow result found by any constraint-based technique, such as FBA. We did so by using the results in supplement of the paper [16], and use the corresponding network as opposed to testing 292 different possible networks. Another reason for choosing one network is that for different networks, SMDA may give different results to be consistent with the underlying biochemistry. We give an example. 87 Table 3.6 SMDA test results on lethal genes Gene Reason fructose-1,6- Expansion conflict after reaching single flow group bisphosphate aldolase phosphogluconate Merge conflict. dehydrogenase M_nadh_x is Available, but the only producer glyceraldehyde-3- glyceraldehyde-3-phosphatedehydrogenase is Inactive due phosphate to the gene knock out. (Note: another reaction dehydrogenase malatedehydrogenase,peroxisomal is in backward direction; so it is a consumer instead of a producer.) hexokinase Merge Conflict phosphofructokinase Expansion conflict after reaching single flow group M_3pg_c is Available, but the only consumer phosphoglycerate phosphoglyceratemutase is Inactive due to the gene knock mutase out. M_pep_c is Available, but the only producer enolase is enolase Inactive due to the gene knock out. Example 3.3. Consider the partial network of Figure 3.4 where a circle represents a metabolite, a rectangle represents a reaction, and a directed edge between them represents a role (substrate or product) of the metabolite in the reaction. Assume that RPI is a reversible reaction, and the metabolite r5p is a substrate of the biomass reaction (not 88 included in the figure) which means r5p is an essential metabolite and must be available. When the gene PGDH is knocked out, metabolite ru5p-D is not Available since PGDH is the only producer of the metabolite (assume reaction RPI’s direction is forward, consuming ru5p-D and producing r5p, as shown in the figure). As a consequence, r5p is also not Available since the only producer reaction RPI is Inactive due to the unavailability of ru5p-D. This conflicts with the observation that r5p is an essential metabolite; and, thus SMDA concludes that the gene PGDH is lethal. However, if RPI works in a direction reversed to Figure 3.4, i.e., r5p is substrate or RPI and ru5p-D is product of RPI, SMDA will conclude that the gene PGDH is non-lethal since ru5p-D is still available, or produced by RPI. Observation 2: SMDA lethality testing algorithm covers all stopping conditions for the seven genes. Figure 3.4 A partial network for Example 3.3. The lethality testing algorithm has stop conditions such as the expansion conflict, the merge conflict and the expansion conflict after reaching a single flow group. In the lethality testing of seven genes, the algorithm stops as follows: (i) for three genes: the expansion conflict; (ii) for two genes: the merge conflict, and (iii) for two genes: the expansion conflict after reaching single flow group. 89 3.5.3 Gene Non-Lethality Test Results We use the same network and observations as the gene lethality test, but, this time. knocking out a non-lethal gene instead of a lethal one. When a feasible flow-graph is generated, the non-lethality is verified since the organism is functioning without the knocked-out gene. In this experiment, we performed one non-lethality test to verify the non-lethality of the gene adenosine kinase in the epimastigote model. The gene adenosine kinase is related to the reaction ADK1g. Testing a gene for non-lethality is less time consuming than testing it for gene lethality, as it requires the generation of only one feasible flow scenario that produces and consumes all essential metabolites. However, the single feasible flow-graph should be complete in its assignment of active/inactive status to all reactions. Since one complete flow-graph will be sufficient for this experiment, we pre-set some reactions’ status a priori according to the optimal result of the paper in order to expedite SMDA running process and reduce the running time and complexity of the task. In addition to the 18 inactive reactions in the epimastigote model [59], which are listed in Table 3.4, and 1 knocked gene related reaction adenosine kinase , we set other 16 reactions as Inactive, 53 reactions as Active[16]. With the algorithm of Section 3.4, the non-lethality test execution stops when there is no cycle to cause a conflict in current flow-graphs, at which time all 167 reactions are covered in the test. To have a complete flow-graph, we have the SMDA keep running until all 167 reactions in the network have a status label. There are 3,328 flow-graphs in the final result. And, we have been able to conclude that the gene “adenosine kinase” is not lethal since there are possible feasible flows in the organism to keep it alive even 90 when the gene is knocked out. Clearly, SMDA can also be used in the same manner to verify the non-lethality of other genes. 3.6 Conclusions In this chapter, we have proposed an algorithm to verify gene lethality with the SMDA metabolomics tool. Using the SMDA gene lethality test algorithm, we have successfully verified seven lethal genes and one non-lethal gene in a genome-scale metabolic network of Trypanosoma cruzi organism. This confirms that SMDA can be used for gene lethality testing purposes. Compared with other computational techniques such as FBA, SMDA produces results consistent with the underlying biochemistry. On the negative side, for a very large network, SMDA has its limitations since it enumerates all possible activation/inactivation scenarios for the network at hand. We have discussed ways of reducing the complexity, e.g., abstracting pathways into single “abstract” reactions. Thus, we conclude that SMDA can be used as an alternative tool to verify gene lethality. 91 Visualization Tools for PathCase Systems 4.1 Introduction Pathcase visualization tools visualize metabolic data, relationships in the data, as well as analysis results of the data via a java applet. The visualization tools are components of many PathCase Systems, as shown in Figure 4.1. In this chapter, we present the visualization tools in the PathCase-SB system, as well as different and specific features in other PathCase systems, as listed below: PathCase-SB: PathCase Systems Biology Workbench featuring BioModels models and KEGG Pathways has 409 Systems Biology Models and 139 KEGG pathways. PathCase-MAW Editor: a stand-alone Java application on maintaining a mammalian metabolic database—MAW. PathCase-MAW: Pathcase Metabolomics Analysis Workbench featuring manually created generic mammalian metabolic network has 27 pathways. PathCase-RCMN: PathCase ReconstruCted Metabolic Networks has four modes, namely, Mus Musculus iMM1554 model (2008), Mus Musculus iMM1415 model (2010), H.sapiens Recon 1 model and Trypanosoma Cruzi iSR215 model (2009). PathCase-Recon: PathCase RECON Workbench featuring Genome-Scale Reconstructed Metabolic Networks and KEGG Pathways has 53 networks and 139 KEGG pathways. 92 PathCase-SMDA: an online tool to analyze metabolomics data in terms of the dynamic behavior of the metabolic network under steady state. Metabolism Query Language Interface: a Metabolism Query Language Interface to query PathCase-MAW database. Figure 4.1 Visualization Tools and Applications We generalize the visualization framework of all PathCase visualization tools. Part of the framework is also used in providing visualization data for three different iPad applications, namely, 93 iPathCaseMAW: iPad version PathCase-MAW system, which includes visualizations of metabolic pathways and SMDA tool, iPathCaseRCMN: iPad version PathCase-RCMN system, which includes visualizations of three reconstructed networks, iPathCaseKEGG: iPad version PathCase-KEGG system, which includes visualizations of Kyoto Encyclopedia of Genes and Genomes[27] 4.2 Visualization Tool for PathCase-SB System Released in August 2010, PathCase-SB system [17][18][19] brings together (i) systems biology sources, e.g., BioModels [20][21][22], and (ii) pathways sources, e.g., KEGG [23][24][25][26], with the goal of providing additional capabilities and tools made possible due to the integration. PathCase-SB system provides a database-enabled framework and web-based computational tools towards facilitating the development of kinetic models for biological systems. Currently, PathCase-SB has provided visualization, browsing, querying, simulation and comparison, model composition and user upload model capabilities and interfaces. At the server side, PathCase-SB data is managed by a relational database management system, namely, Microsoft SQL Server 2008. An object-oriented data-access interface between the relational database and the application layer is provided in the form of a large set of wrapper class functions, in order to provide easy, extensible, and fast data access as well as to prevent major changes in the application when a schema change occurs during the evolution of PathCase-SB. Web services are provided for visualization interface or other applications to get detailed model and pathway data. Mappings 94 between BioModels and KEGG pathways are created between: species of system biology models and molecular of KEGG, reactions of system biology models and process of KEGG, and models of system biology models and pathways of KEGG. The visualization interface is accessed from different places within PathCase-SB. It it employed by different sub-components, namely, Browser Interface (appears as a menu item at many places with the name "Interactive Model/Pathway Graph"), Built-In Queries (by each query that produces a metabolic sub-network), iModel Tool (biochemical networks of uploaded user models), Model Composition Tool. Compared with the visualization tool in previous PathCase systems, i.e., PathCase-KEGG, the visualization tool in the PathCase-SB system has the following new features: Integration of the interactive pathway graph visualization, which shows molecule entities, processes and enzymes, cofactors, activators, regulators and inhibitors of pathways, as well as the interactive model graph visualization, which shows species, reactions, compartments and their properties of models. Displaying model according to compartments hierarchy of the model, as shown in Figure 4.2. Moving elements in a compartment is limited by the compartment’s boundary. For those models that are related to a pathway, the mapping between the model network and the pathway is provided by displaying both side by side, and highlighting species in the model and the molecular entity in the pathway together 95 with corresponding qualifiers (such as is-part-of ), as shown in Figure 4.4(result of Figure 4.3). Figure 4.2 Visualization of Albert2005-Glycolysis Model Also, the visualization tool has the capabilities of visualization simplifications Figure 4.3 An example of built-in query (reaction-to-process mapping). 96 Figure 4.4 Visualization of a query(Figure 4.3) result. (i) Truncation of long entity names are used. Full names and related information are also available as user moves the cursor over the entity. (ii) Common species, which participate in many reactions (e.g., H2O, ATP, ADP, etc.) the network can be elected not to be visualized. layout manipulations The visualization layout can be manually revised and saved, while the movement is managed to keep it biological meaningful (e.g. the entities are limited to the boundary of a compartment). 97 4.3 Visualization Tools for other PathCase Systems Visualization tools in different PathCase Systems are adjusted to fit into each system. Next we introduce several PathCase Systems, and specific features in their visualization tools. 4.3.1 PathCase-MAW and PathCase-MAW Editor PathCase Metabolomics Analysis Workbench (PathCase-MAW) provides a database- enabled framework and web-based computational tools for browsing, querying, analyzing, and visualizing stored metabolic networks. It featuring manually created generic mammalian metabolic network has 27 pathways. The metabolic network can be accessed through a web interface or an iPad application. PathCase-MAW editor, a stand- alone Java application, with its user-friendly interface, can be used to create a new metabolic network and/or update an existing metabolic network. Also, the visualization tool in PathCase-MAW and PathCase-MAW editor has the capabilities of Visualizing pathway side by side if it exists in multiple tissue(s ) In the PathCase-MAW database, pathways are organized via tissues. A same pathway name may exist in different tissues and has different reaction compositions. For example, pentose phosphate pathway exists in adipose and ctosol_liver. When there is a pathway in multiple tissues, they are presented side by side in the visualization, as shown in Figure 4.5. 98 Figure 4.5 Glycolysis in Cytosol_Adipose and Cytosol_Liver. common metabolites are reproduced for each reaction they participate in, which reduces many edges between between common metabolites and reactions, therefore beautifies the resulting visualized graph. Metabolites that participate in many reactions (e.g., NAD, ATP, O2 etc.) have high connectivity and tend to result in unclear visualizations. Such metabolites are called common metabolites. Unlike all metabolite pools, which are displayed once, these metabolites are displayed per participated reaction to have better results, as shown in Figure 4.6. Transport reaction is distinguished with dotted line, as shown in Figure 4.7. 4.3.2 PathCase-SMDA As discussed in Chapter Two, SMDA is developed as a computational tool to analyze 99 Figure 4.6 Catabolism of Phenylalanine pathway in PathCase-MAW editor. measurements of a mammalian metabolic network database. It evaluates the activation/inactivation scenarios of the metabolic network and interprets the metabolic consequences of the observed changes at steady state. Both the user selected sub network as well as the SMDA running results can be visualized. Also, the visualization tool in PathCase-SMDA has the capabilities of visualizing reversible reactions are connected via double edges, which enables displaying direction of the reaction flow, highlighting the reaction flow via bold lines in the resulting visualization. An SMDA result with highlighted flow is shown in Figure 4.8. 100 Figure 4.7 TCA cycle pathway in in PathCase-MAW. 4.3.3 PathCase-RCMN and PathCase-Recon Both PathCase-RCMN[60] and PathCase-Recon[61] systems integrate metabolomics data of genome scale reconstructed networks. However, they target on two databases and focus on different angle of the reconstructed networks. PathCase-ReConstructed Metabolic Networks of organisms (PathCase-RCMN) system contains three reconstructed metabolic networks for two organisms, namely, Mus 101 Figure 4.8 SMDA query results. Musculus iMM1554 model (2008), Mus Musculus iMM1415 model (2010), and Trypanosoma Cruzi iSR215 model (2009). They are parsed from literature and stored in the database to let users to browse pathways, reactions and metabolites, see compartmentalized visualizations of the pathways, and query the networks using various built-in queries provided. As an example, Figure 4.9 shows Fatty Acid Metabolism pathway in the Mus Musculus iMM1415 model (2010). PathCase-Recon Workbench featuring Genome-Scale Reconstructed Metabolic Networks 102 and KEGG Pathways, which contains reconstructed networks from the literature, BiGG Database[62], MEMOSys[63], and GSMNDB[64] site. Currently, it has 53 Reconstructed Metabolic Networks, which include 39,692 reactions, and 30,117 species. And from the KEGG site, there are 139 Pathways, which includes 7,932 processes, 27,926 basic molecules, 5,342 proteins, 6,295,654 genes. In the visualization of PathCase-Recon system, instead of displaying the network via pathways, the whole reconstructed network is visualized as one graph. As an example, Figure 4.10 shows E. Coli Textbook reconstructed network. 4.3.4 PathCase-MQL PathCase-MQLsystem [65] provides a metabolism query language interface to query PathCase-MAW database. MQL Tool enables users to query the metabolism under different stress conditions [66]. MQL enables users to specify multiple and different classes of queries, such as (i) computing (and visualizing) “Activated/Inactivated (metabolic) Paths” with increased and decreased fluxes under specified physiological conditions, (ii) identifying/verifying “Potential Futile Cycles”, (iii) querying for required metabolic concentration change sets to prevent a particular futile cycle, (iv) searching for concentration change sets which lead to the (in)activation of a user- specified metabolic subnetwork, and, (v) exploring the metabolic behavior of a set of (possibly reversible) reactions. 103 Figure 4.9 Fatty Acid Metabolism pathway in the iMM1415 model (2010). Our framework allows users to input concentration change statements on key metabolites, and incorporates such input into its query processing. Both the selected sub-network as well as query results can be visualized. In the visualization tool of PathCase-MQL, common metabolites are duplicated per each reaction it participates in; reaction direction is highlighted in the query result; transport reactions are emphasized via dotted line between compartments. As an example, Figure 4.11 gives a query in Urea Cycle of Cytosol_Liver, and Figure 104 4.12 displays the query result. Figure 4.10 E. Coli Textbook in PathCase-Recon System. Figure 4.11 An example of MQL query. 105 Figure 4.12 MQL query result of the example in Figure 4.11. 4.4 General Framework For all the PathCase systems’ visualization tools, the general procedure can be summarized as shown in Figure 4.13. Also we generalize the visualization framework of all PathCase systems as having the following steps: Designing a XML schema for the visualization data file. For all visualization tools, visualization data is encapsulated into XML file which is transferred between web server and user’s terminal. According to visualization requirements of each PathCase system, different data, relation and customized visualization properties may be needed. XML schema is customized for each PathCase system to fulfill the requirements. 106 Figure 4.13 Visualization tools in PathCase systems Defining parameters for web services to communicate with the visualization applet in the client side. On the server side, web services provide a variety of capabilities on retrieving data to be visualized. However, each call from a client may only need part of those data. Parameters are defined for communication between the applet and the web service. Retrieving information from PathCase system’s database. At the server side , based on parameters from the applet, web services are used to access database and obtain data to be visualized. Composing the obtained information into an XML data file. Data retrieved from database is assembled into XML file according to the schema predefined. 107 Parsing the data file, and providing visualization via the applet. After obtaining XML file via web service, the applet of visualization tool parse the data file and restore the data into relations at the client side. The applet then visualize the data and interactive with user. Based on differing requirements of PathCase Systems, one or more steps above may need to be adjusted or revised. For example, in the PathCase-MAW visualization tool, common metabolites are reproduced for each reaction they participate in, which reduces many edges between common metabolites and reactions, therefore beautifies the resulting visualized graph. And, in the PathCase-SMDA visualization tool, reversible reactions are connected via double edges to show the direction of flow. 4.5 Visualization Tool for iPad Applications Three iPad applications have been developed with an attempt to provide mobile users with mobile-feasible features of PathCase systems. The three iPad applications are: iPathCaseMAW: iPad version PathCase-MAW system, which includes visualizations of metabolic pathways and SMDA tool, iPathCaseRCMN: iPad version PathCase-RCMN system, which includes visualizations of three reconstructed networks, iPathCaseKEGG: iPad version PathCase-KEGG system, which includes visualizations of Kyoto Encyclopedia of Genes and Genomes[27]. Compared with the full PathCase system, the iPad application only includes mobile- feasible data and functionalities. XML schema, parameters and web services are all redesigned for iPad applications. Using iPathCaseKEGG as an example, the application 108 does not download the entire pathways data from the PathCaseKEGG database at once. HTTP requests are made as new data is needed, and the result is cached in the device’s internal storage. After the XML response has been parsed into one or more KEGG internal objects, these objects are not serialized. In other words, all objects generated by web services are built from the XML each time they are needed. 4.6 Conclusions In this chapter, we have introduced the visualization tools that are integrated into different PathCase systems, namely, PathCase-SB, PathCase-MAW, PathCase-MAW Editor, PathCase-RCMN, PathCase-Recon, PathCase-SMDA and PathCase Metabolism Query Language Interface. We have summarized new features of PathCase-SB visualization, and specific features of other PathCase systems’ visualization tools. The visualization framework of tools in different PathCase systems are generalized. Part of this framework is also used to provide visualization data for three iPad applications, namely, iPathCaseKEGG, iPathCaseMAW and iPathCaseRCMN. 109 Locating Basic Bio-Entities in Genome-Scale Reconstructed Metabolic Networks 5.1 Introduction The numbers of genome-scale reconstructed metabolic networks (GSRMN) have been increasing at a higher rate in the last five years [7][67]. GSRMNs are being built for a wide variety of organisms, and used in many applications for the tasks of (a) contextualization of high-throughput data, (b) guidance of metabolic engineering, (c) directing hypothesis-driven discovery, (d) interrogation of multi-species relationships, and (e) network property discovery [7]. The numbers of GSRMNs and their sizes in terms of the number of reactions continue to increase: most GSRMNs now have more than 500 reactions, with some having 3,700 plus reactions[68]. GSRMNs are specified in many ways: SBML documents, published as supplements to publications(e.g., [15]), and/or provided over the internet at web pages of researchers. It is noted in the literature [3][4] that published GSRMNs have two basic limitations, which reduce their full utilization. One is the inability to match metabolites/reactions/compartments in a given GSRMN to metabolites/reactions /compartments in a given data source (e.g., KEGG) or another GSRMN, due to naming inconsistencies involving species (metabolites), reactions, and compartments. We refer to metabolites, reactions, and compartments of GSRMNs as basic bio- (biological) entities. Another noted difficulty is in identifying pathways of a GSRMN. We refer to this task as identifying higher-level bio-entities of a GSRMN automatically, where we classify 110 pathways, and metabolism-based sub-networks (e.g., the lipid metabolism) as “higher- level” biological entities of GSRMNs. We refer to basic and higher-level bio entity identification problems in GSRMNs as both a “bio-entity identification problem” of GSRMNs. In this chapter, we focus on the basic bio-entity identification problem in a GSRMN model (referred to as the “target model”, from here on) with respect to a “source model” (where the “source model” may easily be replaced by a “data source”, generalizing the identification problem), and propose three types of matches for metabolite identification, and a multi-step identification process for reaction identification. Compartment identification with a curated dataset is omitted here due to paper size limitations. Identification results for metabolites and reactions are ranked via a variety of similarity scores. Finally, we present an empirical study of entity identification for four “iAM303”and “E. coli textbook”, “H. pylori iIT341” and “EryNet”, “Model2008_09_23_13_13_29” and “Model2008_ 08_15_12_13 _14”, “03_16_09_TM_minimal_medium_glc” and “M. barkeri iAF692”. Also, we evaluate the usefulness of the entity identifications and/or ease of interpretation of our results. 5.1.1 Entity Identification Stobbe et al [3] compares the contents of five human metabolic pathway databases, and reports that the level of agreement among the metabolic network is very low. E.g., five databases that describe human metabolic network agree on only 3% of ~7,000 reactions. Even for the well-studied pathway TCA cycle, only 5 of the 30 reactions agree in all five databases. The low agreement on pathways in different data sources may be due to 111 differences on a pathway definition, different intermediate steps, different numbers of alternative substrates, and difficulties in determining metabolites’ identities. Stobbe et al[3] suggests that the low agreement problem among pathways of different sources can be eliminated via (i) comparing metabolites by KEGG compound ID, KEGG Glycan, ChEBI, PubChem Compound or CAS before comparing metabolite names, (ii) ignoring electrons, protons, water while comparing reactions, (iii) treating metabolites with enzyme-bound/unbounded versions as identical metabolites, etc. MetRxn database [4] is designed to resolve incompatibilities in content representation where (i) metabolite and reaction descriptions are standardized by integrating information from 8 metabolic databases and 90 GSRMNs, (ii) all metabolite entries have matched synonyms, resolved protonation states, and are linked to unique structures, (iii) all reaction entries are elementally and charge-balanced, and (iv) the standardization in description allows for a direct comparison of metabolite and reaction content between metabolic models and databases. MetRxn is allowed to use the metabolic information from standardized version. Thus, we utilize the standardized networks when available. As of April 2, 2013, at least one unsolved problem with MetRxn is that all metabolites/reactions which lack full atomistic information were excluded from comparison. 5.1.2 Similarity Score We briefly summarize metabolite and reaction similarity scores in the literature. To measure the closeness of located metabolites with a specified target metabolite n, one can use many scoring functions, based on a structure-based metric or a string-based metric. For structure-based similarity scores, metabolites’ chemical structures are compared. 112 Most methods for calculating chemical similarity are based on compound’s two- or three- dimensional structure. Molecular structures are sometimes represented by molecular fingerprints, in which case fingerprints are compared. Maximum common substructure, or MCS, is used in the assessment of molecular similarity based on chemical graphs[69]. The score can be given on the number of common fragments or common subgraphs defined by the atom types[70]. Structures based chemical similarity score can be given via SMILES strings. A SMILES (simplified molecular-input line-entry system) string is a way to represent a 2D molecular graph as a 1D string. SIMCOMP (SIMilar COMPound), is a graph-based method for comparing chemical structures[71]. For a string-based similarity score of two strings with equal length, Hamming distance is the number of positions at which the corresponding symbols are different[72]. Levenshtein distance, or edit distance, is a string metric for measuring the amount of difference between two sequences[73]. Monge Elkan distance is a general text string comparison method[74]. Needleman–Wunsch distance and Smith–Waterman distance are usually for performing local sequence alignment, i.e., for determining protein sequences[75][76]. Jaro–Winkler distance is a measure of similarity between two strings and is used in the area of record linkage [77]. The Matching Coefficient is a vector- based approach which simply counts the number of terms, on which both vectors are nonzero. Dice coefficient is a similarity measure over sets. Jaccard similarity uses word sets from the comparison instances to evaluate similarity[78]. Tversky index is an asymmetric similarity measure that compares a variant to a prototype. The overlap coefficient is a similarity measure related to the Jaccard index. Cosine similarity is a common vector-based similarity measure similar to the dice coefficient[79]. TF/IDF is 113 not typically considered to be a similarity metric, which provides a relevant metric for a string with respect to a given query; hence it is often used in searching. Maximal matches is often used within the protein and DNA sequence. Similarity of reactions are computed as the Tanimoto coefficient between the sets of bond changes describing the transformation from substrates to products in each pair of reactions. The Tanimoto coefficient is defined as the ratio of intersecting set to the union set as the measure of similarity[80]. SimR [81] considers input compounds, output compounds and enzymes of the two reactions by integrating all three matching weights. SimR are computed via the Maximum Weight Bipartite Matching. 5.2 Metabolite Identification For metabolite identification in a target model, we employ three types of matches (comparisons) to metabolites in the source model. We also apply filtering techniques on the entities of the target model, depending on what is available in both the source model SM and the target GSRMN TM at hand: Metabolite id matching (exact match) [3][4], Metabolite name synonym matching (exact match) [4], Approximate metabolite name (string) matching, and Filtering approximate string matching candidates via mandatory and optional techniques. The algorithm is summarized in Figure 5.1. 114 Figure 5.1 Metabolite Identification Algorithm Sketch The function CandidatesM() locates possible matches in the target GSRMN TM, and is summarized in Figure 5.2. Biologically significant term matching is described in Figure 5.3 of section 5.2.3.2. Figure 5.2 CandidatesM () function 115 We use the similarity score s to rank the closeness of match. As summarized in section 5.1.2, both structure- and string-based similarity scores are used for metabolite matching in the literature. Since metabolite structures are not available in current GSRMNs, we use a string-based metric. Edit distance, the most popular rudimentary metric[72], is the basis of many other string metrics, e.g., Needleman-Wunch distance[75]. Since q-gram- based approximate string matching also uses edit distance, and we use q-gram approximate string matching to locate metabolites, we define the following similarity score function for a string a and its edit distance to string b and the threshold k: S=SimScore_Original(a,b)= 1 – Da,b/(k+1) where k>0 and Da,b>0, where Da,b>0, is the edit distance of string a to string b, and k, k>0, is the edit distance threshold for the approximate match between a and b. The similarity score s is a value between [0, 1]. The score is 0 when a or b doesn’t exist. Edit distance threshold k is required as a parameter to calculate metabolite approximate similarity scores. 5.2.1 Exact Match via Metabolite Id/Synonyms In terms of metabolite id matching, different types of metabolite identifiers can be used for matching two metabolites. Some identifiers identify a metabolite uniquely, for example, KEGG compound id, KEGG Glycan id, ChEBI id, or PubChem Compound (CAS) id. Others have a single identifier corresponding to multiple metabolites. If the source model SM or the target GSRMN TM do not have unique identifiers specified, then synonym-based comparisons as in MetRxn [4] can be attempted. As summarized in section 5.1.1, MetRxn has a single unified data set of standardized metabolite and reaction descriptions for 90 GSRMNs. If TM, the target GSRMN to be analyzed, is a metabolic network in MetRxn then one can obtain metabolite synonyms 116 from MetRxn. For each metabolite in TM, one can locate metabolites, each with a name identical to a MetRxn metabolite or its synonym. The located metabolites form the set of exactly matched metabolite entities. 5.2.2 Approximate Name Matching Metabolite name matching can be done via approximate string matching [82], and there are many approximate string matching techniques in the literature[83], [84][85][86]. However, they can be inaccurate if applied directly. For approximate metabolite name matching, we build our method based on a revised version of the most well-studied and efficient one [79], namely, q-gram based approximate string matching via string joins, to locate “similar metabolite names” between SM and TM, summarized next. We start with some preliminaries. The edit distance between two strings is the minimum number of edit operations (i.e., insertions, deletions, and substitutions) of single characters needed to transform the first string into the second. q-gram of a string is a contiguous sequence of q characters from a given string. A positional q-gram at location i of string σ is a pair (i; g), where g is the q- gram of σ that starts at position I, i.e. g= σ [i … i+ q-1]. Thus, from a string t of size |t|, one can create |t|-q+1 overlapping q-grams, where |t| denotes the length of t. Gravano et al proposes [79]efficient methods for evaluating the edit distance using such q-grams. One observation is that, for an edit distance of 1, the sets of q-grams from two strings will differ by at most q, and, only these q substrings contain the character affected by the one edit distance operation. The remaining q-grams correspond to each other. Based on this, Gravano et al. introduces count filtering: if edit(t1,t2)⩽d is true, then t1 and t2 will share at least (max(|t1|,|t2|)+q-1)-d·q corresponding 117 q-grams, where d·q is the maximum number of q-grams that can be affected by d edit distance operations. Given a metabolite m in SM, a threshold of edit distance k, and a threshold of similarity score ƟM, we propose to locate a set of metabolites mTM in TM whose name is within k edit distance to m and with a similarity score no less than threshold ƟM by the token and q-gram based approximate string matching. 5.2.2.1 Problems with Approximate Name Match Directly using approximate string matching on metabolite names may lead to incorrect results. We give an example. Example 5.1. “D-fructose-6-phosphate” and “D-glucose-6-phosphate” differ only by two characters. Thus, given the edit distance threshold k=2, they are considered identical although they are different entities[87]. Also, approximate string matching results and scores are influenced by other factors: The metabolite name has or does not have prefix/suffix; for example, “M_fe2_m” and “fe2_m”, or “NAD(+)“ and “NAD(+) [cytoplasm]”; Metabolite names differ in prefix or suffix only, for example, ”crn_b” and “crn_c”, which are the same metabolite in different compartments; Metabolite name has additional information, i.e., formula. For example, “m_xyl_b” and “M_XYL_C5H10O5”; Metabolite names have differing notations, for example, “xyl-D[c]” and “M_xyl_DASH_D_c”; Suffix/prefix use different notations, e.g. “chol[c]” and “chol_c”; 118 Suffix/prefix uses an abbreviation instead of a full name, for example, “NADPH [cytoplasm]” and “nadph_c”; Different metabolite names represent the same entity, for example, “Glucose” and “D- Glucose”. “(-)” and “(+)” symbols are usually used to represent optical rotation optical activity of a chiral molecule, a type of molecule that has a non-superimposable mirror image[88]. The (+) symbol refers to a dextrorotatory molecule. Such molecules rotate linearly polarized light to the right (clockwise) when viewed in the direction of light propagation. Molecules labeled with (-) are laevorotatory, and rotate the polarization to the left (counterclockwise). Sometimes the letters D and L are used for these respective cases, instead of (+) and (-)[89]. Naturally, most metabolites/enzymes have only one type of optical rotation optical activity, either “(-)” or “(+)”. In that case, “(-)” or “(+)” may not exist in the name, and it makes no difference whether the name has it or not; e.g., “(-)- ureidoglycolate” and “ureidoglycolate”, or “(-)-MAACKIAIN” and “MAACKIA IN” [90]. But there also exist some metabolites which have both cases, for example, “(+)- alpha-Pinene” and “(-)-alpha-Pinene” both exist. For the case of “Glucose”, only one isomer exists in nature, which is the right-handed form of glucose, denoted “D-glucose” [91]. These isomers can be identified via synonym matching. 5.2.2.2 Split-and-Match Approach To refine the accuracy of identification, we further process results of approximate string matching. Next we propose a token and q-gram based technique to split the name string into substrings, classify substrings, and compare them. Below we define separators, token types, and the matching procedures. 119 5.2.2.2.1 Separators in Metabolite Names According to the naming convention[92], [93] and names in the GSRMNs, we first define separators that are used to obtain substrings, which include “ ”(space), “-“, “_”, “:”, “,”,“[…]” and “(…)”. To keep the original meaning of metabolites, “’” and “.” are not considered as separators. For example, “3',5'-cyclic IMP” is split as “3',5'” and “cyclic IMP”. And, “AN2623.3” is not split. “:” is treated as a separator when it is not between two numbers. For example, the name of “YLR060W:YFL022C” is split into “YLR060W” and “YFL022C”. However, in “Hexadecanoate (n-C16:0)”, “(n-C16:0)” is not split further. “, ” and “_” are similar to “:”. For example “alpha,alpha-trehalose” is split as “alpha”, “alpha”, and “trehalose”, but “estrone-2,3-semiquinone” is split as “estrone”, “2,3”, and “semiquinone”. “IDP_C10H12N4O11P2” is split into “IDP” and “C10H12N4O11P2”, but “3_5_Cyclic_GMP_C10H11N5O7P” is split into “3_5”, “Cyclic”, ”GMP”, and”C10H11N5O7P”. Additionally, some sub-strings are considered as separators because it is what they are meant to be, semantically. Examples of such sub-strings include ‘_DASH_’ and ‘minus ’. 5.2.2.2.2 Token Types in Metabolite Names We define six types of tokens in a metabolite name: prefix, suffix, parenthesis, number, single character and main token. Prefix token is at the beginning of a string. A single character or a substring in parenthesis is viewed as the prefix. If a parenthesis is nested in another parenthesis, then the first inner parenthesis is the prefix. A string may not have a prefix. For example, “L” 120 is prefix of “L-Rhamnose”, “(9Z)” is prefix of “(9Z)-Hexadecenoic acid”, “(R)” is prefix of “((R)-3-Hydroxybutanoyl)(n-2)”, “5,10-Methenyltetrahydrofolate” has no prefix. Suffix token is at the end of a string. A single character or a substring in parenthesis, “(…)” or “[…]”, is considered as the suffix. If a parenthesis is nested in another parenthesis, then the last inner parenthesis is the suffix. Also, a formula at the end of the string is a suffix. A string may not have a suffix. For example, “b” is suffix of “M_o2_b”, “ (Val)” is suffix of “tRNA(Val)”, “(9Z)” is suffix of “octadecenoate (18:1(9Z))”, “C4H4N2O2” is suffix of “M_Uracil_C4H4N2O2”, “Acetyl-CoA” has no suffix. There are some special characters which have different notations, for example, α as alpha, β as beta, etc. The notations are considered as prefix or suffix if they are at the beginning or at the end of the string. The notations include: alpha, beta, gamma, delta, epsilon, omega, trans, cis. “Phosphate” is also a suffix. A name string has at most one prefix and one suffix. Except for prefix and suffix, substrings of the remaining name string can be further classified into parenthesis, number, single character and main token. A substring is classified as a parenthesis token if it is in a parenthesis in the name string. For example,”1”, “D, and “ribityl” are parenthesis tokens of the name string “6,7- dimethyl-8-(1-D-ribityl)lumazine [cytoplasm]”. Number token is a substring which has only number characters with/without separators. For example, “2” and “1_2_4” are number tokens of 121 “M_2_Hydroxybutane_1_2_4_tricarboxylate _C7H7O7”. And, “6,7”, “8” and “1” are number tokens of “6,7-dimethyl-8-(1-D-ribityl)lumazine [cytoplasm]”. Single character token is a substring which has a single character. For example, “L” is a single character token of “M_L_Cysteine_”. Main Tokens are the remaining substrings. Prefix/suffix tokens may have separators in them, i.e., “(2S,3R)” is prefix of “(2S,3R)-3- Hydroxybutane-1,2,3-tricarboxylate_C7H7O7”. A single number character is not considered a prefix or a suffix. For example, “4” is not prefix of “4-OH-13-cis-retinal”, which has no prefix in this case. 5.2.2.2.3 Tokens with sequence information Given a string, the prefix and the suffix of the string are located first. Then we split the remaining string into substrings by separators. The substrings are classified into different categories according to the sequence of single character, number, and parenthesis tokens. The final remaining substrings are main tokens. And we also record a substring’s original position information so that we can utilize them during the pairing and matching process. For example, the string “M_4_5_dihydroxy_2_3_pentanedione_C5H8O4” is split as Prefix: (“M”,1) Suffix: (“C5H8O4”, 6) Number: (“4_5”, 2), (“2_3”, 4) Main tokens: (“dihydroxy”, 3), (“pentanedione”, 5) Single character, number and parenthesis tokens are all considered as modifiers of main tokens. By keeping sequence information, we track main-token related modifiers, including pre- and post-positions of substrings. 122 5.2.2.2.4 Pair Tokens and Match First, two strings to be compared (i.e., metabolite names from the source and target models) are split into tokens. We then pair tokens in each according to their classifications, i.e., prefix/suffix is compared with another prefix/suffix substring only. For single character, number and parenthesis substrings, we pair them according to their sequence related to the main tokens. For the main tokens, we pair them by different cases as follows. 1.Both names do not have main tokens: we check and compare parenthesis tokens instead. 2.One name does not have main token; the other one has: we pair parenthesis tokens of the one which has no main token, with another main token as well as parenthesis tokens. 3.Both names have the same or different numbers of main tokens: we pair main tokens according to the sequence as follows. (1) Identical tokens are paired. (2) Size/length-similar tokens are paired. (3) Tokens that have the same core metabolites (listed in appendix 1) are paired. For prefix and suffix, we do not separate them further though they may contain separators, as shown in Section 5.2.2.2.2. We give an example. Example 5.2. Assume the two metabolites are “4,5-dihydroxy-2,3-pentanedione” and “M_4_5_dihydroxy_2_3_pentanedione _C5H8O4”. 123 “4,5-dihydroxy-2,3-pentanedione” is split as Number: (“4,5”, 1), (“2,3”,3) Main tokens: (“dihydroxy”,2), (“pentanedione”,4) “M_4_5_dihydroxy_2_3_pentanedione_C5H8O4” is split as Prefix: (“M”,1) Suffix: (“C5H8O4”, 6) Number: (“4_5”, 2), (“2_3”, 4) Main tokens: (“dihydroxy”, 3), (“pentanedione”, 5) When main tokens (“dihydroxy”,2) and (“dihydroxy”, 3), (“pentanedione”,4) and (“pentanedione”, 5) are paired, number (“4,5”, 1) and (“4_5”, 2), (“2,3”,3) and (“2_3”, 4) are paired accordingly. This is because number (“4,5”, 1) is related to (“dihydroxy”,2), and (“4_5”, 2) is related to (“dihydroxy”, 3). Separators are ignored during the number matching. For example, “4,5” and “4_5” are considered as exact match in this example. 5.2.2.3 Similarity Score functions Based on our token and q-gram based approximate string matching method, we revise the similarity score computations accordingly. The similarity score S=SimScore_Original(a,b)= 1 – Da,b/(k+1) where k>0 and Da,b>0, is revised as SimScored,k (a,b)= α*SimScore_Prefix(a,b) + β*SimScore_Main(a,b) + γ *SimScore_Suffix(a,b) 124 where α+ β+ γ =1, SimScore_Prefix(a,b) is the original similarity score of prefix pair, SimScore_Suffix(a,b) is the original similarity score of suffix pair, and SimScore_Main(a,b) is the average of main tokens’ original similarity scores. Since number and parenthesis are always related with main tokens, their scores are not calculated separately, but are considered in SimScore_Main(a,b). Next we illustrate the similarity score computation with two metabolites. Example 5.3. “4,5-dihydroxy-2,3-pentanedione” and “M_4_5_dihydroxy_2_3_pentanedione_C5H8O4” have the following similarity score computations: Let a=“4,5-dihydroxy-2,3-pentanedione” and b= “M_4_5_ dihydroxy_2_3_pentanedione_C5H8O4”, α=γ=0.05, β=0.9. SimScore_Prefix(a,b)=0; SimScore_Main(a,b)= (SimScore (“dihydroxy”, “dihydroxy”)+ SimScore (“pentanedione”, “pentanedione”))/2 =(1+1)/2 =1 We use average of main tokens’ similarity scores; SimScore_Suffix(a,b)=0; then SimScore (a,b)= 0 + β +0 = β=0.9. Or, in this example, the adjusted score depends on how user weighs the main token matching over all the three parts’ (prefix, suffix and main token) matching. 5.2.3 Filtering Metabolite Match Candidates Instead of processing all metabolites in TM, we locate candidates for approximate match calculation by pruning names unlikely to be in the result. Two types of filtering 125 techniques are used in this chapter, namely, mandatory filters and optional filters. Mandatory filters are based on observations of Gravano et al in their work [79]; length filtering, count filtering and the corresponding q-gram filtering. And, based on GSRMN metabolite names, we have two optional filters: formula filter and Biologically Significant Term filter (if additional information is available for m in SM and mTM in TM. For example, when chemical formula specifications exist for m and mTM, or they have “biologically significant terms”, we can then check them for a match, and identify candidates to calculate an approximate similarity score. In some GSRMN data files, formula is listed under “notes” tag of the SBML file, i.e., S. aureus iSB619[28]. 5.2.3.1 Formula Matching A chemical formula is a way of expressing information about the proportions of atoms that constitute a particular chemical compound. Formula information can be used to improve approximate string matching results further, i.e., identical metabolites have the same formula. However, formula itself is not enough to identify a metabolite. This is because, different metabolites may have the same proportions of atoms, but with various structures, which is not shown in chemical formulas. As an example, “glucose” and “fructose” have the same formula “C6H12O6”, but differ structurally. 5.2.3.2 Biologically Significant Term Matching Considering the characteristics of metabolite and reaction names during matching adds an additional depth to the matching process. Next we introduce the concept of biologically significant terms (BSTs), which are checked for before the approximate string matching process for both metabolites or reactions. 126 Treated as strings, the names of nucleotides, i.e., GTP, GDP, GMP, etc., are very similar, and direct approximate string matching techniques may yield highly similar matching scores. The same holds for some coenzymes, i.e., NAD, NADP, etc. However, these metabolites are significantly different and distinct metabolites in biochemistry, with different characteristics and functionalities. And, they are also important in biochemistry in chemical energy transfer between different reactions, or in catalyzing chemical reactions. We call such metabolites as biologically significant terms, or BSTs, for short. We identify three types of BSTs for the purposes of filtering. These terms denote nucleotides and co-enzymes. Level 1 contain terms that do not include any other BST term as a substring, such as ATP, ADP, AMP, etc. Level 3 terms are not a substring of any other BST term, such as dATP, dADP, dAMP, etc. A level 2 term, e.g., NADP, includes a level 1 term (ADT) as substring, and is a substring of level 3 term (NADPH). From the level definitions, level i term, 1≤ i ≤2 is a substring of a level i+1 term. Level i+1 terms and level i+2 terms are called level i term’s upper level terms if level i term is included as substring. All levels of BST terms are collected from metabolite names and reaction names, and are maintained separately, as shown in Table 5.1. We make use of BSTs as follows. Given a metabolite/reaction name m in SM, to find approximately matching names mTM in metabolite or reaction sets in TM. We first locate candidates via formula match when both m and mTM have them, then we use BST to further filter the located candidates in CSM if m has a biologically significant term(s). Based on the type of the term encountered in m, we check whether mTM in CSM has a matching BST with m, and remove mTM from CSM if it does not. Since the terms we chose have specific biochemistry semantics, it is safe to assume that these terms must 127 Table 5.1 Biologically significant terms Type of terms Terms ATP, ADP, AMP, CTP, CDP, CMP, GTP, GDP, GMP, ITP,IDP,IMP, TTP, Level 1 TDP,TMP,UTP, UDP, UMP, XMP, NAD, FAD Level 2 NADP dATP, dADP, dAMP, cAMP,dCTP, dCDP, dCMP, cCMP, dGTP, dGDP, dGMP, Level 3 cGMP, dITP,dIDP,dIMP,cIMP, dTTP, dTDP,dTMP, dUTP, dUDP,dUMP, cUMP, dNAD, XTP,XDP, hXMP, dNAD, NADH, NADPH, FADH be present in a genome-scale network. We utilize candidates in CSM to have more accurate matching, instead of only using approximate string matching. Next we give the BST algorithm, shown in Figure 5.3. In summary, the idea is, before comparing a metabolite m to a metabolite in the set CSM, we filter out those metabolites of CSM that are distinct from m, and, yet, approximate matching would incorrectly identify them as close matches. We give an example on a type-1 term. Example 5.4. GSRMN iAB-AMØ-1410-Mt-661[94] has the metabolite M_nad_m, the corresponding subset of metabolite names to be tested for approximate matching include M_nad_m, M_nadp_m and M_nadph_m. Since M_nad_m has the type-1 term NAD, the BST algorithm removes M_nadp_m and M_nadph_m from the result because they contain the terms NADP and NADPH, which in turn contain the term NAD. 128 Figure 5.3 BST-Filter() function 5.3 Reaction Identification For locating identical reactions in SM and TM, two methods are used in the literature. One is, only compounds can be compared, and two reactions rTM and r are considered to be the same if all substrates, products and modifiers of rTM and r match one-to-one with each other, e.g., see Stobbe MD et al [3]. Or, all compounds and catalyzing enzymes can be compared for locating identical reactions, e.g., see Radrich, K. et al [95]. Since enzyme names and reaction names are usually the same (and enzyme information is only specified in a few GSRMNs, i.e. M. tuberculosis iNJ661[28], but not available in most GRSMNs), we compare reaction names and compound names here. For reaction identification, we employ a three-step identification process, namely, (i) reaction name match, (ii) reaction property matches, and (iii) reaction compound matches. Reaction name match is used to filter possible reaction candidates via approximate string match for further processing, i.e., compound-pairing and matches, which is more time consuming. 129 Reaction property match includes two property comparisons, namely, (i) reversible reaction property and (ii) transport reaction property match, which are both optional since not all reactions have these properties available. Figure 5.4 presents the reaction identification algorithm. SimScore d,k(a, b) is the similarity score function between two strings a and b. 5.3.1 Reaction Name Matching Reaction name matching is first used to get candidates. Similar to metabolite name matching, we first locate reactions rTM in TM which have names identical to the reaction r of SM. If GSRMN is included in the MetRxn data set, reactions that match via synonyms are also located by using r’s synonyms in MetRxn. If no exact reaction match was found, tokenized q-gram based approximate string matching is performed to find reactions rTM in GSRMN. As in metabolite name matching, reaction name approximate matching is executed by splitting and classifying reaction name string into prefix, suffix, main tokens, and the corresponding modifier tokens. Classified tokens are compared smilarly. Also the similarity score SimScore d,k (a,b) between two reaction names is calculated. However, there are several important differences between a reaction name match and a metabolite name match: (a) there are some special sub-strings in reaction names, i.e., “transport”, “exchange”, ”reversible”, ”irreversible”, which only represent reactions’ properties. Those sub-strings are excluded from the reaction name string, and they are not compared in this step. Instead, they are used to set a reactions’ property in the property match step; (b) multiple reactions with the same name located via approximate name matching, are considered as possible 130 candidates for future steps, since they may have different reactants, products or modifiers. Figure 5.4 Reaction Identification Algorithm Sketch 5.3.2 Reaction Property Matching Two major properties of reactions, namely, reversibility property and the transport property, are compared in this step. A reaction in the source/target model is considered a reversible reaction when its reversibility property is true in the source/target database (which is consistent with the original model data), or when the reaction’s name has “reversible”, but without “irreversible”, in its name string explicitly. For example, reaction “D alanine D alanine ligase reversible” in the model “Salmonella_consensus_build_1”, is considered as a reversible reaction although its reversibility property is false in the GSRMN database. 131 There is no transport property in the GSRMN database; so, we consider a reaction to be a transport reaction only when its name string contains the words “transport”, “exchange”, “transporter”, or “transferase”. The property matches are up to user’s decision, because in some literatures, the comparison does not consider the direction of a reaction, or, reactions are counted as being the same if they only differ in their directions, for example, in Stobbe et al’s work[3]. The reason is that, a reaction’s direction or reversibility of a reaction varies in GSRMNs due to different data sources, and/or vague or missing data [7] [3][95] [96]. Even in the ideal case where the knowledge about an organism is complete, there still remains some ambiguous decisions in the reconstruction process resulting from the core approximations of constraint-based modeling [96], which confine the fundamentally analog nature of biology to digital categorizations (i.e., a continuum of enzyme thermodynamics is categorized into “reversible” and “non-reversible”; and, a continuum of substrate affinities is converted into ‘yes’ or ‘no’ decisions on which metabolites can be acted on by an enzyme, etc. [96]). When two reactions’ property matches are required during the identification process, reactions’ reversibility property and transport property are obtained and compared accordingly. 5.3.3 Reaction Compound Matching In this step, compounds of each reaction candidate are compared with the source reaction’s compounds and compounds similarity scores are computed. First, two reactions’ compounds are matched by roles. There are three roles of a reaction’s compounds, namely, “Reactant”, “Product” and “Modifier”. And two reactions’ 132 reactants, products and modifiers are matched separately. If any reaction is reversible, we check their compounds’ counts to make sure the counts match of each role. Otherwise, reversible reaction’s reactants and products are switched to match with another reaction’s reactants and products to get a better match. E.g., a source reaction “fructose- bisphosphate aldolase” in GSRMN “A. baylyi” , has two reactants “glyceraldehyde-3- phosphate” and “dihydroxy-acetone-phosphate”, and one product “ fructose-1,6- bisphosphate”. The located candidate reaction “R_fructose _bisphosphate_aldolase” in GSRMN “E. coli iAF1260” has one reactant, ”M_D_Fructose_1_6_bisphosphate_C6H10O12P2”, and two products “M_Glyceraldehyde_3_phosphate_C3H5O6P” and “M_Dihydroxyacetone_phosphate_C3H5O6P” . But, since both reactions are reversible, we match the source reaction’s product “fructose-1,6-bisphosphate” with the candidate reaction’s reactant “R_fructose_bisphosphate_aldolase”, and the source reaction’s two reactants with the candidate reaction’s two products. Then, the compounds are paired by exact name match, name length match and core metabolite match. Names with equal length or the closest-length compounds are paired. Core metabolites are also used to pair compounds when their name length is not enough to decide on pairs, i.e., when more than one compound of the target reaction can be paired with a compound of source reaction. We also consider metabolite’s input and output roles for pairing. That is, a reaction’s input metabolites (i.e., substrates and modifiers) can be compared with the other reaction’s input metabolites only. Also, output metabolites (i.e., products) are compared accordingly. 133 Some metabolites are ignored during the match since reactions are not always balanced, - + especially with respect to electrons (e ), protons(H ), water(H2O) [3][95]. Also, some data sources inadvertently include reactions that are not consistent in their stoichiometry [4][97]. After all compounds are paired, each pairs similarity score is calculated via the SimScore d,k (a,b) function. 5.3.4 Reaction Similarity Score In section 5.1.2, we summarized two reaction scores from literature. They are different from GSRMN reaction similarity in the sense that Tanimoto coefficient of similarity considers transformation from substrates to products in reactions, and SimR measures all pairwise similarity entities between reactions in two pathways. In GSRMN reaction similarity score, we consider all factors that were involved in the matching process, i.e., reaction name, reaction properties (i.e. reversibility-property and transport-property), substrates, regulators/modifiers, and products name. Enzyme information is not available in most GSRMNs, so it is not included in GSRMN’s reaction similarity score. In this chapter, we compute the similarity of reactions r and rTM by the function SimReaction(rTM, r), as described below. SimReaction(rTM, r) = f1* SimReactionName(rTM, r) + f2* SimReactionProperty(rTM, r) + (1-f1-f2)* SimReactionCompounds(rTM, r), where SimReactionName(rTM, r) is the two reaction names’ similarity score; 134 SimReactionProperty(rTM, r) is the two reactions’ reversebi-lity-property and transport property score, which is 1 when the two properties are the same; or 0.5 when one of the properties is the same in both reactions, but the other is not; or 0 when none of the properties are the same in both reactions; SimReactionCompounds(rTM, r) = (TOTALall-substrate-pairs(SimScored,k(srTM, sr))/|number of substrates in r| + TOTALall-product- pairs(SimScored,k(prTM, p))/|number of products in r| + TOTALall-sregulator- pairs(SimScored,k(regrTM, regr))/|number of regulators in r|)/3; f1 and f2 are system-adjusted weight factors for reaction name match and reaction property match, respectively. 5.4 Experimental Evaluation In this section, we present the experimental evaluation in terms of precision and recall analysis [98][99][100] [101] on basic bio entities. Given a similarity score threshold s (0≤s≤1), a source model SM and a target model TM, in experiments of this section, we locate from selected source-target model pairs all matching source-target metabolites/reactions with a similarity score not less than s. Using metabolite identification as an example, we illustrate how precision and recall are calculated. For each source metabolite m, multiple target metabolites mTM x (x=1,…n) with similarity scores greater than or equal to s are identified. Then, for precision and recall analysis, we manually locate m’s “real matches” to mR-TMx (x=1,…n), in the target model. More specifically, for a given m, we manually check and locate target metabolites, mR-TMx, which have either the same name or (based on the underlying 135 biochemistry, model-related documents and the literature) the same/similar biological function with m. We refer to (m, mR-TMx) as the “real source-target metabolite pair”, or simply, the real pair. To analyze the experimental results, we identify the source-target metabolite pair match as being true positive or false positive or false negative. Note that true positive pairs are in the set {mTMx | x=1,…n}∩{mR-TMx|x=1,…m} ; false positive pairs are in the set {mTMx| x=1,…n} –{mTMx | x=1,…n}∩{mR-TMx|x=1,…m}; and false negative pairs are in the set {mR-TMx| x=1,…m} –– {mTMx | x=1,…n}∩{mR-TMx|x=1,…m}.. Precision is defined as the proportion of correct matches amongst the metabolite names of retrieved pairs, (m, mTMx). Let CTP, CFP denote true positive and false positive counts. Then, Precision = CTP / (CTP + CFP) Recall is defined as the proportion of correct matches amongst the metabolite names of real pairs, (m, mR-TMx). Let CTP, CFN denote true positive and false negative counts. Then, Recall = CTP / (CTP + CFN) 5.4.1 Metabolite Identification Results 5.4.1.1 GSRMN Pairs and Statistics We have chosen three source-target model pairs (all from [28]), namely, they are related). 136 b) The model pairs are from different organisms (so that they have differing metabolites). c) The model pairs’ sizes are typical of GSRMNs (and we manually check the results and evaluate precision and recall). Next, for the selected model pairs, we list three different case statistics for (source, target) metabolite pairs. Case 1: “Target metabolite (reaction) identical to the source metabolite (reaction) exists in the target model of the Recon Models database”. That is, the two names are identified as being equal via a SQL server string equality comparison. Case 2: “Target and source metabolites (reactions) are not identical; however, their similarity score is 1”. For a source metabolite where there is no target metabolite with a similarity score greater than or equal to s, for further analysis, we locate those metabolites with the highest similarity score, and produce the relevant statistics. Below, we identify this case as: Case 3: “The given source metabolite (reaction) does not have a matching target metabolite (reaction) with a score greater than or equal to s”. We list one statistic for (source, target) reaction pairs. Case r: The given source reaction has matching target reactions(s) with score greater than or equal to s For all three source-target model pairs, we use edit distance threshold k=3, q-gram size q=3, and similarity score threshold s =0.9. The three source-target model pairs and their statistics are: 1) Source Model: H. pylori iIT341 137 (412 distinct metabolites; 554 reactions) Target Model: EryNet (482 distinct metabolites; 438 reactions) Case 1, 2, 3 counts for metabolites: 129, 14, 248 2) Source Model: Model2008_09_23_13_13_29 (175 distinct metabolites; 167 reactions) Target Model: Model2008_08_15_12_13_14 (583 distinct metabolites; 581 reactions) Case 1, 2, 3 counts for metabolites: 69, 0, 19 Case r count for reactions: 67 3) Source Model: 03_16_09_TM_minimal_medium_glc (647 distinct metabolites; 645 reactions) Target Model: M. barkeri iAF692 (698 distinct metabolites; 690 reactions) Case 1, 2, 3 counts for metabolites: 247, 60, 227 Case r count for reactions: 332 5.4.1.2 Equal Functionality Criteria In this analysis, to focus on nonexact matches, we do not include the (source, target) metabolite pairs for Case 1 (i.e., “a target metabolite that is identical to the source metabolite exists in the target model of the Recon Models database”). For example, in experiment (1), 44 source-target metabolite pairs are excluded from the precision and recall analysis. Also, when the source (or the target) model has two metabolites with the same name, we only use one of them for matching purposes. 138 In terms of “equal functionality” of metabolites, we use four criteria as illustrated below with examples. a.Metabolites in the real pair only differ by their compartments, i.e., M_adp_m (i.e., metabolite is in the compartment mitochondria) and M_adp_c (i.e., metabolite is in the compartment cytosol). b.Metabolites in the real pair differ at the chemical formula level, but have the same function as identified by data sources such as KEGG, e.g., M_Glycyl_tRNA_Gly__C2H4NOR and M_Glycyl _tRNA_Gly__C2H5NO2X have the same KEGG id C02412. c. Metabolites in the real pair only differ at the location of phosphate, e.g., D-Glucose 6- phosphate and D-Glucose 1-phosphate. However, metabolites with phosphate and without phosphate, or metabolites with different number of phosphates are not real pair, e.g., D-Glucose and D-Glucose 6- phosphate; or D-Fructose 1,6-bisphosphate and D-Fructose6-phosphate. d.Metabolites in the real pair are mirror images, or enantiomer, e.g., L-Lactate and D- Lactate. Enantiomers, or optical isomers, have identical chemical and physical properties except they have opposite orientation, like one’s left and right hands. We have chosen these “equal functionality” criteria carefully to make sure that they are consistent with the literature and data. And, we use these criteria in all experiments to observe how the parameters influence precision and recall. 5.4.1.3 Results and Observations In figures 5.5-5.7 we vary the similarity score thresholds (i.e., x dimension) from 0.5 to 0.9 to show changes in the counts of true positives, false positives, and false negatives, as 139 well as precision and recall for the three pairs of GSRMNs. The values of α, β and γ in SimScore (a,b) are set as 0.05, 0.9 and 0.05 separately. 60 Source:H. pylori iIT341 Target: EryNet 50 40 30 Counts True Positives 20 False 10 Positives 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (a) True Positive, False Positive, and False Negative counts Source:H. pylori iIT341 Target:EryNet 1 0.9 0.8 0.7 Precision 0.6 Results Recall 0.5 0.4 0.3 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (b) Precision and Recall Figure 5.5 Model-to-model metabolite matching results for 140 1400 Source:Model2008_09 Target:Model2008_08 1200 1000 True Positives 800 False 600 Counts Positives 400 200 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (a) True Positive, False Positive, and False Negative counts Source:Model2008_09 Target:Model2008_08 1 0.9 0.8 0.7 0.6 0.5 Precision Results 0.4 Recall 0.3 0.2 0.1 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (b) Precision and Recall Figure 5. 6 Model-to-model metabolite matching results for Observation 1. Increasing the similarity score threshold from 0.5 to 0.9 results in an average decrease of true-positive counts by 3.316%. 141 Observation 2. Increasing the similarity score threshold from 0.5 to 0.9 results in an average decrease of the false-positive counts by 97.639%. False positive counts drop significantly as the similarity score threshold increases, while true positives counts may also drop slightly. This improvement shows that, with a 140 Source:03_16_09_TM_minimal_medium_glc Target:M. barkeri iAF692 120 100 80 True 60 Positives Counts 40 False Positives 20 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (a) True Positive, False Positive, and False Negative counts 1 0.9 0.8 0.7 0.6 0.5 Precision Results 0.4 Recall 0.3 0.2 Source:03_16_09_TM_minimal_medium_glc 0.1 Target:M. barkeri iAF692 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (b) Precision and Recall Figure 5. 7 Model-to-model metabolite matching results for <03_16_09_TM_minimal_medium_glc, barkeri iAF692> 142 proper similarity score threshold, most, if not all, true positive results can be obtained, with/without a small percentage of false positive results. Observation 3. Increasing the similarity score threshold may result in an increase in false-negative counts. When the similarity score threshold is too high, a very small number of matching pairs may be eliminated from the results, which leads to a slight increase in false negative counts. Observation 4. Increasing the similarity score threshold results in a precision increase on average 415.91% and by at least 115.6%, and recall decrease on average 2.335% and by at least 0. Since false-positive counts drop when the similarity score threshold increases, precision increases significantly while recall decreases much less. As the similarity score threshold increases, increases in the false-negative count and decreases in the true-positive count lead to a decrease in recall. Observation 5. The way of splitting tokens may occasionally result in additional false negatives, though in only small numbers. We allow only one prefix and one suffix at most while obtaining the tokens. This works with most (99.37% of the) cases in experiments, but, for some special cases, may also lead to a low similarity score. We give two examples. For “alpha-D-Ribose 5-phosphate” in the source model “H. pylori iIT341”, “D-Ribose 5- phosphate” in the target model “EryNet” is not identified as a match. The reason is “alpha-D-Ribose 5-phosphate” has actually two prefixes, “alpha” and “D”; and, the “one prefix” rule leads to unmatched body tokens. 143 Observation 6. Lower edit distance threshold may occasionally result in additional false negatives, though in only small numbers. We give an example. For “_3-Phospho-D-glycerate” in the source model “iAF692 network flux distributions for BOF optimization on methanol”, “3-Phospho-D-Glycerate (E)” in the target model “Natronomonas pharaonis metabolic network” is not identified as a match when the edit distance threshold is 3. The reason is that the two names have actually the edit distance 4, which leads to the name “3-Phospho-D-Glycerate (E)” being eliminated by the edit distance filtering condition. 5.4.2 Reaction Identification Results 5.4.2.1 GSRMN Pairs and Statistics In the following experiments, we use two of the source-target model pairs used in the metabolite identification experiment. We do not use the pair For all three pairs (pair 2 - pair 4), we use edit distance threshold k=9, q=3 for q-gram, and similarity score threshold s =0.5. The source-target model pair statistics for the new model pair are: 4) Source Model: iAM303 (279 reactions) Target Model: E. coli textbook (95 reactions) Case rTM count: 77 144 5.4.2.2 Equal Functionality Criteria In the experiments below, for each source reaction r, multiple target reactions rTM x (x=1,…n) with similarity scores greater than or equal to s, are identified. For a given r, we manually check and locate target reactions, rR-TMx, which have identical compounds and corresponding roles, unless one reaction is reversible reaction. We refer to (r, rR-TMx) as the “real source-target reaction pair”, or the “real reaction pair”. In terms of identical compounds, we use the same criteria used in metabolite similarity experiments; that is, metabolites that only differ by the compartments that they reside in are considered identical. E.g., reaction “aspartatetransaminase” in model “Model2008_09”and “aspartatetransaminase” in model “Model2008_08” are identical though their compounds only differ by their compartments and are thus identical, i.e., the first reaction’s compounds “M_akg_m”, “M_glu_DASH_L_m”, “M_oaa_m”, “M_asp_DASH_L_m” are all in “Mitochondria” (thus the suffix m); and the second reaction compounds “M_akg_c”, “M_glu_DASH_L_c”, “M_oaa_c” and “M_asp_DASH_L_c” are all in the compartment “Cytosol”. There are also a number of generic reaction names such as “. escape flux”, “. source flux”, and “BiomassRxn”, which are used to represent different reactions. Compound similarities and differences of these reactions are captured via their similarity scores. However, typically, such reactions serve for different reasons (such as the FBA analysis), and, thus we exclude reactions with these three names in experimental results. 5.4.2.3 Results and Observations In figures 5.8-5.10, we vary the similarity score thresholds (i.e., the x dimension) from 0.5 to 0.9 to show changes in the counts of true positives, false positives, and false 145 negatives, as well as precision and recall. The values of f1 and f2 in SimReaction(rTM, r) are set as 0.5 and 0 separately. 60 Source:iAM303 Target:E. Coli Textbook 50 40 True Positives Counts 30 False Positives False Negatives 20 10 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (a) True Positive, False Positive and False Negative counts Source:iAM303 Target:E. Coli Textbook 1 0.9 0.8 0.7 0.6 Results Precision 0.5 Recall 0.4 0.3 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (b) Precision and Recall without a rule-based pre-match filter 146 Source:iAM303 Target:E. Coli Textbook 1 0.9 0.8 0.7 0.6 Results Precision 0.5 Recall 0.4 0.3 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (c) Precision and Recall after a rule-based pre-match filter Figure 5.8 Model-to-model reaction matching results for Source:Model2008_09 Target:Model2008_08 35 30 25 True Positives 20 False Positives 15 Counts False Negatives 10 5 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (a) True Positive, False Positive and False Negative counts 147 Source:Model2008_09 Target:Model2008_08 1 0.9 0.8 0.7 0.6 0.5 Precision Results 0.4 Recall 0.3 0.2 0.1 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (b) Precision and Recall Figure 5.9 Model-to-model reaction matching results for Source:03_16_09_TM_minimal_medium_glc 500 Target:M. barkeri iAF692 450 True Positives 400 False Positives 350 300 False Negatives 250 Counts 200 150 100 50 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (a) True Positive, False Positive and False Negative counts 148 1 0.9 0.8 0.7 0.6 0.5 Precision Results 0.4 Recall 0.3 0.2 Source:03_16_09_TM_minimal_medium_glc 0.1 Target:M. barkeri iAF692 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (b) Precision and Recall Figure 5. 10 Model-to-model reaction matching results for <03_16_09_TM_minimal_medium_glc, barkeri iAF692> Observation 7. Inconsistent modeling decisions on the part of the modelers can be identified via either rule-based pre-match filters or via a closer examination of the results. Figure 5.8 shows that, as compared with pair 2 and pair 3, for pair 4, recall drops significantly at similarity score threshold 0.8. From Figure 5.8(a), we can see the reason is the true positive count, which drops by thirteen at score 0.8. After a closer examination, nine of these thirteen real reaction pairs are exchange reactions, e.g., O2 exchange, or Phosphate exchange. All exchange reactions in the source model “iAM303” have only one compound, i.e., O2 for O2 exchange. However, exchange reactions in the target model “E. Coli textbook” have two (of the same) compounds as a substrate and a product, separately. Such inconsistent reaction modeling decisions on the part of the 149 modelers distort the results of our matching algorithms; clearly, they can be identified by additional pre-match-time rule-based filtering steps. The recall at similarity score 0.8 is improved from 0.754 to 0.896 after the filtering steps, as shown in Figure 5.8(b) and Figure 5.8(c). Observation 8. Same reactions with differing reaction names and identical compounds can be identified via high SimReactionCompounds() scores. Reaction “malicenzyme(NADP)” in model “Model2008_09”and “malicenzyme(NAD)” in model “Model2008_08” look like different reactions since NADP and NAD differ by a phosphate. However, their substrates and products are identical which means they are the same reaction with different names, captured with a similarity score of 0.99. Observation 9. Increasing the similarity score threshold improves the precision, but reduces the recall. A change in the similarity score threshold from 0.5 to 0.9 results in precision increase on the average 107.99% and by at least 57%, and recall decrease on the average 9.44% and by at least 1.3%. Similar to metabolite similarity experiments, both true positive and false positive counts drop when the similarity score increases. Since false positive counts drop faster, we can see that precision improves significantly while recall declines slightly. Observation 10. Increasing the similarity score threshold from 0.5 to 0.9 results in an average decrease in false-positive counts by 91.50%. 5.5 Conclusion In this chapter, we have proposed a number of metabolite/reaction identification techniques for Genome-Scale Reconstructed Metabolic Networks (GSRMN) (by 150 matching metabolites/reactions to corresponding metabolites/reactions of a source model or data source). We employ a variety of computer science techniques that include approximate string matching, similarity score functions and filtering techniques, all enhanced by the underlying metabolic biochemistry-based knowledge. The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four pairs of GSRMNs. Our results indicate that significant accuracy gains are made using the proposed metabolite/reaction identification techniques. 151 Conclusions and Future Work In this thesis, we have studied metabolic network computational analysis, presenting metabolomics data as well as its analysis results via visualization tools, and text mining among Genome-Scale Reconstructed Metabolic Networks. We have developed the SMDA (steady-state metabolic network dynamics analysis) technique, built the system, and evaluated its computational performance limits using a mammalian metabolic network database. SMDA takes a set of measurements and a metabolic network as input, performing the task of identifying the metabolic mechanisms that lead to changes in the concentrations of given metabolites, and interpreting the metabolic consequences of the observed changes in terms of physiological problems, nutritional deficiencies, or diseases. The SMDA problems and their solutions addressed are new, and specific to the SMDA approach. This work of evaluating the activation/inactivation scenarios of the metabolic network at steady state is related to metabolic network analysis techniques such as metabolic control analysis (MCA) [10], flux balance analysis (FBA) [11], metabolic flux analysis [12] and, finally, metabolic pathway analysis (MPA) (elementary flux modes (EFM) and extreme pathways (EP)) [13]. Comparison between SMDA and related techniques are discussed. We have performed a usefulness study for SMDA for the problem of gene lethality testing [10]. SMDA algorithm is revised accordingly. We also examined this research via the reconstructed network model of Trypanosoma cruzi . We take the model network as one input for SMDA. Metabolite pool observations from the extracellular metabolites’ availability according to the paper supplement [16] is another input. In the examination, all seven lethal genes are verified with SMDA and one non-lethal gene selected from the 152 paper is also verified. Thus, we confirm that SMDA can be used for gene lethality testing purposes. Compared with other computational techniques such as FBA, SMDA produces results consistent with the underlying biochemistry. Pathcase visualization tools present metabolic data, relationships in the data, as well as analysis results of the data via a java applet. These tools are components of many PathCase Systems, i.e., PathCase-SB, PathCase-MAW, PathCase-MAW, PathCase- RCMN, PathCase-Recon, PathCase-SMDA, Metabolism Query Language Interface. Also, we generalize visualization framework of all PathCase visualization and introduce distinct features of visualization tools in each PathCase system. Part of the framework are revised for three iPad applications: iPathCaseMAW, iPathCaseRCMN and iPathCaseKEGG. In the basic-level bio-entities research, we propose metabolite/ reaction identification techniques for GSRMNs. A variety of computer science techniques that include approximate string matching, similarity score functions and filtering techniques are employed. All techniques are enhanced by the underlying metabolic biochemistry-based knowledge. The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four pairs of GSRMNs. Our results indicate that significant accuracy gains are possible using the proposed metabolite/reaction identification techniques 6.1 Future work 153 6.1.1 SMDA For the SMDA research, exploratory data Mining and analysis capabilities for the SMDA query output search space can be integrated. Its output space can be queried to filter specific scenarios such as results with malfunctioning Urea Cycle. Also backward reasoning can be implemented to locate specific observations. The input of SMDA can also be improved by allowing user to preset status of metabolites and/or reactions. The user selected network can be examined with biochemistry rules before providing user suggestions intelligently. 5.2.3.3 6.1.1.1 Querying Result Space The result space of SMDA tool can be huge and there are many ways to query the results. Here we propose an example to query the results from a specific angle, i.e., querying the SMDA results for selected metabolism disease-/disorder-related scenarios. The goal is to answer a basic question: Are there any plausible steady-state metabolic network activation/inactivation scenarios that would implicate specific diseases or disorders (e.g., urea cycle disorders), given the observed metabolomics measurements? Disorders or diseases, sometimes occur due to low-levels of activity for certain reactions; that is, a reaction has slowed down sufficiently to cause a disorder/disease, but not necessarily shut down (i.e., Inactive). The current SMDA approach but attaches three labels for reactions, which are Unknown, Active and Inactive. In this study, we stay with the current SMDA model, and assume that reactions related with disorders and labeled as Inactive may in fact have very low flux rates (i.e., “LowActive” or “NotActive Enough”), leading to the disorder. 154 As an idea on solving this problem, two steps can be implemented based on SMDA tool’s output. Firstly, SMDA results can be clustered into groups via preset biochemistry based clustering rules. Secondly, biological process of a specific disease can be analyzed. Then the disease’s disorder characteristics can be extracted and captured via one or more disease/disorder identification rules, which can be used to identify scenarios from SMDA outputs. 5.2.3.4 6.1.1.2 Finding Observations for a Desired Flow-Graph User may be interested in locating metabolite observations which is related to a specific result. This can be done via backward reasoning. Let an SMDA query is run with a number of observed metabolites, etc., and the output with flow-graphs is returned. First, our SMDA tool should retain the first query, and should be able to provide the same interface specified as-is to the user. Next, the SMDA tool should be able to switch to a different interface that allows the user to reverse the question into the following query: “Let Q define a query network with the following X, …, Z metabolite pool labels (observations), called PoolAssignment, and the following ri, rj, …, rv active/inactive reactions/pathways, called ReactionAssignment. List the metabolite pool labels of selected metabolites in Q (possibly those in bio-fluids) and the associated flow-graphs where R is consistent with PoolAssignment and ReactionAssignment”. 5.2.3.5 6.1.1.3 Allowing User to Specify Status Other than the measurements, we may utilize user’s domain knowledge as well. This can be done by allowing user to provide more information as SMDA’s input. User doesn’t 155 have to be familiar with SMDA model or terminologies. The interface should be able to take user’s biochemistry language and cover them into the SMDA tool as known conditions. For example, the user’s input could be, some pathway/reaction is known active/inactive, or the flow of a metabolite goes to one branch only instead of other branches, or the known fuel is being used for a pathway. In our implementation, we'll translate the inputs into status labels of the nodes in the sub network. For known pathway/reaction, we set them as active/inactive as user's input. For the known flow, we can assign labels for metabolite pools and for reactions. For the know fuel, we can set the label of corresponding pool as available with the labels of production reaction and consumer reaction as active. 5.2.3.6 6.1.1.4 Notifying Users Special Cases Since the user-selected sub-network may not be a complete network, when we apply biochemistry rules to the sub network, we may miss some cases which could exist in the real world. For example, if a user chooses Tricarboxylic Acid(TCA) Cycle as the sub- network, Acetyl Coa is produced by Pyruvate dehydrogenase (PDH) complex, and is consumed by Citrate synthase. When Citrate synthase is inactive, we’ll say Pyruvate dehydrogenase (PDH) complex is inactive, also according to the Rule BC7 (no consumer is active then no producer is active). However, in reality, if we include the reaction Pyruvate Carboxlase in the sub network, Pyruvate dehydrogenase (PDH) complex could be active since Pyruvate Carboxlase is consuming Acetyl Coa. 156 To avoid such cases, the user chosen sub-network can be analyzed intelligently, and then SMDA may remind theuser when such situations arise, and let the user choose to continue or to change the sub-network. 6.1.2 Visualization Visualization tools are integrated in exploratory search, querying, and visualization of PathCase Systems. The tool can be enhanced further via new functionalities. We give some examples. In a large metabolic network, i.e., a genome-scale reconstructed metabolic network with hundreds reactions/metabolites, it can be hard to locate a specific element manually when you have the element’s name. A node location functionality may be provide which allows user to input a name of the element, the visualization tool should locate all nodes with the specified name and highlight them in the visualization graph. In the PathCase-RCMN, or PathCase-SMDA, it will be helpful to provide a whole picture of all the pathways/sub-networks in the database, in addition to a single pathway, or user selected network visualization. In this whole picture, all pathways in the database are visualized, the user selected pathway(s)/sub network(s) are highlighted. By this way, user will have a global view of the complete network, as well as the connections/relations between the sub networks. 6.1.3 Bio-Entity Identifications Based on basic-level bio entities identification techniques and results, the scope of the bio-entity identification problem in GSRMNs can be increased in multiple new and innovative ways, e.g., higher-level bio-entities, basic and higher-level graph based (GB) 157 bio-entities, and basic social network based(SNB) entities can also be located. Identification efficiency may be improved via applying techniques of related works. Also, multi levels of identifications can be integrated into current PathCase sytems, i.e., PathCase-RCMN, PathCase-RECON. 6.1.3.1 Different levels of identifications Higher-level bio-entities identification, as defined in Chapter Five, includes pathway or metabolism sub-network locating in the GSRMNs. In addition to metabolite, reaction and compartment identification, topological structure of the entity needs to be considered and matched in some manner. Also, disconnected components should be identified if they exist in the GSRMNs. GSRMN networks can be mapped onto graphs for analysis. There are several ways of mapping techniques [102]. For GSRMN analysis, one can map metabolites into vertices and reactions into edges or map reactions into vertices and metabolites into edges. For each mapped graph, clustering algorithms can be applied and results may be compared. This is called GB bio-entities identification and is a syntactic clusters location problem. Basic-level GB bio entities identification includes basic entities. Network analysis can also be applied to GSRMN identifications. Authorities are objects pointed by a large number of hubs (i.e. they have a large number of ingoing edges), which are likely to be good sources of information. Hubs are objects that are likely to point to many such authorities (hubs) through the link structure of the data (i.e. they have a large number of outgoing edges)[103]. Page rank scores of metabolites and reactions in a GSRMN network can be computed and compared across GSRMN networks. 158 Authorities, hubs and page ranks computation and analysis are called basic social network based(SNB) entities identification. 6.1.3.2 Improve efficiency of identifications For basic-level bio entities identification, the efficiency can be improved further via approximate string matching related techniques in the literature. We give some examples. In Li et al’s work[68], three algorithms for answering approximate string search queries are proposed, called ScanCount, MergeSkip and DivideSkip. The ScanCount algorithm adopts a simple idea of scanning the inverted lists of the grams and counting candidate strings. TheMergeSkip algorithm exploits the value differences among the inverted lists and the threshold on the number of common grams of similar strings to skip many irrelevant candidates on the lists. The DivideSkip algorithm combines the MergeSkip algorithm and the idea in the MergeOpt algorithm proposed in [104]that divides the lists into two groups. One group is for those long lists, and the other group is for the remaining lists. Instead of having user to choose q value of q-gram, VGRAM , proposed in Li et. al.’s work[85], can be used to improve the performance of the algorithm. Its main idea is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. It’s like an index structure associated with a collection of strings which are going to be queried approximately. The frequencies of variable- length grams in the strings are analyzed to build gram dictionary. For a string, a set of grams of variable lengths are generated using the gram dictionary. Two strings’ sets of grams are compared to get their similarity. 159 In Zhang et. al’s work[86], they propose the Bed-tree, a B+-tree based index structure to support string similarity queries with respect to edit distance. The index can be built once and used with arbitrary distance thresholds and for all query types. The paper identifies the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. They present three different string transformations that capture useful information from different aspects of string. All these techniques can be adapted to the entity identification techniques. 6.1.3.3 Integration with PathCase systems Of all the PathCase systems introduced in Chapter Four, PathCase-RCMN and PathCase- Recon use GSRMN database. Basic-level bio entities identification can be integrated into both PathCase systems which allows user to locate similar entities with ranking scores. 160 Appendix 1. Core Metabolites (Total count: 617) (+-)-Malic 2,2',6,6'- 2- Acenaphthylene (-)- Tetrachlorobiph Acetylaminoflu Acetaminophen enyl orene Histrionicotoxin Acetate 2,2'- 2-Furoic 1,2- Acetic Dichloroethane Dichlorobiphen 2-Propenoic Acetone yl 1,2- 4,4'- Acetophenone Dichloronapthal 2,3,7,8- Dichlorobiphen Acetyl ene Tetrachloro- yl Acetylcholine dibenzo-p- 1,2- 5-Fluorouracil dioxin Acetylene Diphenylhydrazi 6,6'-Dibromo- 2,3,7,8- Acroleic ne indigo Tetrachloro- Acrylamide 1,3,5,7- 6- dibenzofuran Acrylic Tetrafluorocylco Mercaptopurine octatetraene 2,4 Acrylonitrile 9-BBN 1,5- 2,4- Acyclovir a-Actinin Dichlorophenox Dichloronapthal Adenine a-Aminobutyric ene yacetic Adenosine a-Tocopherol 1-Bromo-1- Adipic Abscisic chloro-ethene ADP 161 Adrenaline Antimony Azurite Boron Adrucil Apatite Bacteriopheoph Brazilianite Advil Apophyllite ytin Brevetoxin Alanine Aqua-kleen Barite Bromo-chloro- Aldrin Aquamarine Barium fluoro-methane Allene Aragonite Benitoite Bromoaureol Aluminum Arginine Benzene Bromopentafluo ride Amidox Arkelite Benzo(a)pyrene Brooklax Ammonium Arsenic Benzoic Buckminsterfull Amoxone Arsenopyrite Benzophenone erene Amphidinolide Arsine Benzothiazole BuLi Anatase Ascidiacyclamid Beryl Bupropion Androsterone e Biacetyl Butylbenzoic Anhydrite Ascorbic Bicyclomycin Butyric Anhydroanguiba Asparagine Biotin C60 ctin Aspartame Biphosphate C70 Anhydroscymno Aspartic Bismuth Caffeine l Aspirin Bisulphite Calcite Aniline Aspirochlorine Boracite Calcium Annulin ATP Borax Caledonite Anthracene AZT Borazine 162 Calyculin Chlorine Codeine Cyclomarin Camphor Chloro-difluoro- Collagen Cyclopropane Cantharidin methane Copiapite Cyclopropenylid Captan Chlorocresol Copper ene Carbon Chloromethane Coronene Cycloxazoline Carbonate Chlorophyll Cortisol Cymobarbatol Carbonic Chlorosulfuric Cortisone Cysteine Carletonite Cholecalciferol Corundum Cytidine Carnallite Cholesterol Coumarin Cytosine Caryophyllene Cholic Creatine D-(-)-Luciferin Cassiterite Chromate Crocoite D-Glucitol Catechol Chromium Cryolite Dactylallene Cavansite Chrysene Cucurbitine DDT CDP Chrysoberyl Cumene Decachlorobiph enyl Celestine Cinnabar Curcumin Decamine Cembranolide Cinnamic Cyanide Dechlorane Cerussite Cinnamon Cyanoacetylene Decopur Chalcanthite Cisplatin Cyanoacrylate Di-t-butyl- Chalcopyrite Citric Cyanogen peroxide Chlorate Clinoclase Cyclobutane Diacetylene Chlordene Cocaine Cyclohexane 163 Diamond Dinitrotoluene Epsomite Fructose-6- Diazepam Dioxane Erythrite phosphate Diazomethane Dioxin Erythromycin Fumarate Dibenzoyl Diuron Estradiol Fumaric Dicamba Divinyl Estrol Fumiquinazolin e Dicarbon DL-3- Estrone Furaldehyde Dichloro- Aminoisobutyri Ethane c galactosamine difluoro- Ethanol methane Dodecanedioic Galacturonic Ethyl Dichromate Dolomite Galena Ethylene Dieldrin Domeykite Gallic Ferredoxin Dihydroxyaceto dTDP Garnet Ferric ne Durdenite Gaspeite Ferrous Diketene Durotox GDP Fluoranthene Dimethyl Dynamite Germane Fluorapatite Dimethylpyrazi Dysamide Glucarate Fluorene ne e-Caprolactam Glucocorticoid Fluorite Dimethyltrypta Ecstasy glucosamine Fluoxetine mine Emmonsite Glucose Fool's Dinitrogen Endrin Glucuronic Formic Dinitrophenol Epinephrine Glutamate Fructose 164 Glutamic Hexachlorocycl Hyposulfite L-Arginine Glutamine ohexane Ibuprofen L-Carvone Glutaric Hexachlorocycl Indole Lactose opentadiene Glycine Inesite Lankalapuol Hexafluorosulfi Gold Iodine Laurencin de Graphite Iron Lauric Hexahydrobenz Guanidinium isobutane Leucine ene Guanine Isodrin Limonene Hexane Guanosine isoleucine Lindane Histidine Gypsum isopropanol Linoleic Honulactone Halite isopropyl LSD Hyaluronidase Halloysite Jadeite Lyphocin Hydrated Halomon Juglone Lysine Hydrazine Hardystonite Juncusol Lysozyme Hydrochloric HCB Kalihinene m-Cresol Hydrogen Hematite Kepone m- Hydronium Heptachlor Keramamine-A Hydroxybenzoic Hydroxide Hessite Keramaphidin m-Xylene Hydroxyisobuty HEX Ketene Magic ric Hexachlorobenz Kilprop Malachite Hypochlorite ene L-Alanine Maleic 165 Malonic Methylamine N-(p- Nitrous Maltol Methylcyanoace Bromobenzamid Norcholestane e)gymnodimine Maltose tylene Nuprin n-Pentacosane Manzamine Methyldiacetyle Nutrasweet ne NAG Marcasite Octanitrocubane Methylpyrazine Naphthol MDMA Octanoic Millerite Naprosyn Mecopar Oestrin Mimetite Naproxen Mecoprop Oestrone Miracle Napthalene Melanin Oleic Mirbane Neamphine Melanterite Oxalate Mirex Needle Melatonin Oxalic Molybdenite Neohalicholacto Melittin Oxirane Molybdenum ne Menadione Oxychlor Molybdic Nicotine Mercaptopurine Oxytetracycline Monosan Niter Mescaline Ozone Morphine Nitrate Methacrylate p,p-DDE Motrin Nitric Methane p-Benzoquinone Muscovite Nitrite Methanol p-Cresol Musk Nitrobenzol Methionine p- Nitroguanidine Methoxychlor Hydroxybenzald Nitrophenol ehyde Methyl 166 p- Pepsin Phycocyanin Prometone Hydroxybenzoic Perchlorate Phycocyanobilin Propadienyliden p-Xylene Periclase Phycoerythrin e Paraherquamide Permanganate Phylloquinone Propane Paroxetine Peroxide Picene Propene PCB-15 Perylene Picric Propionic PCB-4 Phenanthrene Picryl Propyne Pectenotoxin Phencyclidine Pinene Prozac Pectenotoxin-1 Phengite Pinnatazane Pseudo- conhydrine Penicillin Phenol Piperazinomyci Pseudopterosin Pennamine Phenolphthalein n Psilocybin Pentaacetoxy Phenoxymethyl Piperine Pentacarbon penicillin Plastocyanin Pyrene Pentacene Phenylalanine Plastoquinone Pyridine Pentachlorophen Phenylmercuric Platinum Pyrite ol Phloroglucinol Potassium Pyrrhotite Pentaerythritol Phosgene Pregnenolone Pyruvate Pentane Phosphine Prianosin Quartz Pentatetraenylid Phosphoenol Progesterone R-Carnitine ene Phosphoric Progestin RDX Peppermint Phosphorus Proline Realgar 167 Retinoic Sorbitol Tartaric Thymidine Rhodochrosite Sphalerite Testosterone Thymine Ribulose- Spinel Tetrabromodichl Thymol bisphosphate Stannic orobipyrrole Tin Rotenone Stearic Tetracycline Tinstone Saccharin Stream Tetradecanol Titanium Salicylaldehyde Streptonigrin Tetrahydrocortis Topaz ol Salicylic Strontianite Tourmaline Tetrahydrofuran Salinamide Strychnine Toxaphene Tetranitroaniline Saxitoxin Styphnic Trans- Tetrodotoxin Scalaradial Styrene Chlordane Tetryl Scapolite Succinic Triacetylene THF Serine Sucrose Tridecane Thiocarbohydra Showdomycin Sugar Trimesic zide Siderite Sulfate Trimethylamine Thioketene Silver Sulfite Trimethylene Thionyl Silylene Sulfur Trimethylpyrazi Thiophene ne Sinhalite Sulfuric Thioredoxin Trinitrobenzene Sodalite Sweet Thiourea Trinitroresorcin Sodium Tamoxifen Threonine ol Sorbic Tanzanite 168 Trioxane Uracil Vancomycin Weed Triphenylene Urea Vanillin Weedtrol Triphosgene Urethane Variscite Wellbutrin Trunculin Uridine Venlafaxine Wood Tryptophan Uridine-5- Vinyl Wulfenite Turquoise oxyacetic Vinylformic Zippeite Tyrosine Valine Vitamin Zircon Ubiquinone Valium Warfarin Zyban Undecanol Vanadinite Water Vancocyn 169 Bibliography [1] M. I. Sigurdsson, N. Jamshidi, E. Steingrimsson, I. Thiele, and B. Ø. Palsson, “A detailed genome-wide reconstruction of mouse metabolism based on human Recon 1.,” BMC systems biology, vol. 4, no. 1, p. 140, Jan. 2010. [2] R. Schuetz, L. Kuepfer, and U. Sauer, “Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli.,” Molecular systems biology, vol. 3, no. 119, p. 119, Jan. 2007. [3] M. Stobbe and S. Houten, “Critical assessment of human metabolic pathway databases: a stepping stone for future integration,” BMC Systems Biology, vol. 5, p. 165, 2011. [4] A. Kumar, P. F. Suthers, and C. D. Maranas, “MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases.,” BMC bioinformatics, vol. 13, no. 1, p. 6, Jan. 2012. [5] N. D. Price, J. L. Reed, and B. Ø. Palsson, “Genome-scale models of microbial cells: evaluating the consequences of constraints.,” Nature reviews. Microbiology, vol. 2, no. 11, pp. 886–97, Nov. 2004. [6] A. Joyce and B. Palsson, “Towards whole cell modeling and simulation: comprehensive functional genomics through the constraint-based approach,” Progress in drug research, vol. 64, pp. 267–309, 2007. [7] M. a Oberhardt, B. Ø. Palsson, and J. a Papin, “Applications of genome-scale metabolic reconstructions.,” Molecular systems biology, vol. 5, no. 320, p. 320, Jan. 2009. [8] S. Schuster, D. a Fell, and T. Dandekar, “A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks.,” Nature biotechnology, vol. 18, no. 3, pp. 326–32, Mar. 2000. [9] A. Cakmak, X. Qi, A. E. Cicek, and G. Ozsoyoglu, “Computational Interpretation of Metabolomics Measurements: Steady-State Metabolic Network Dynamics Analysis,” in Proceedings of 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011 (ACM BCB 2011), Aug 1 -3, Chicago, IL., 2011. [10] D. Fell, Understanding the control of metabolism. Portland Press, London, UK, 1996. 170 [11] C. H. Schilling, S. Schuster, B. O. Palsson, and R. Heinrich, “Metabolic pathway analysis: basic concepts and scientific applications in the post-genomic era.,” Biotechnology progress, vol. 15, no. 3, pp. 296–303, 1999. [12] N. G. Stephanopoulos, A. A. Aristidou, and J. Nielsen, Metabolic Engineering: Principles and Methodologies. Academic Press, Maryland Hts, MO, 1998. [13] S. Schuster and C. Hilgetag, “On elementary flux modes in biochemical reaction systems at steady state,” Journal of Biological Systems, vol. 2, pp. 165–182, 1994. [14] X. Qi, A. E. Cicek, and G. Ozsoyoglu, “Performing Gene Lethality Testing with SMDA.” 2012. [15] S. B. Roberts, J. L. Robichaux, A. K. Chavali, P. a Manque, V. Lee, A. M. Lara, J. a Papin, and G. a Buck, “Proteomic and network analysis characterize stage- specific metabolism in Trypanosoma cruzi.,” BMC systems biology, vol. 3, p. 52, Jan. 2009. [16] S. B. Roberts, J. L. Robichaux, A. K. Chavali, P. a Manque, V. Lee, A. M. Lara, J. a Papin, and G. a Buck, “Supplement 5, Proteomic and network analysis characterize stage-specific metabolism in Trypanosoma cruzi,” BMC systems biology, vol. 3, p. 52, 2009. [17] M. Sajitz-Hermstein and Z. Nikoloski, “A novel approach for determining environment-specific protein costs: the case of Arabidopsis thaliana.,” Bioinformatics (Oxford, England), vol. 26, no. 18, pp. i582–8, Sep. 2010. [18] A. Cakmak, X. Qi, S. a Coskun, M. Das, E. Cheng, a E. Cicek, N. Lai, G. Ozsoyoglu, and Z. M. Ozsoyoglu, “PathCase-SB architecture and database design.,” BMC systems biology, vol. 5, no. 1, p. 188, Jan. 2011. [19] S. a Coskun, X. Qi, A. Cakmak, E. Cheng, a E. Cicek, L. Yang, R. Jadeja, R. K. Dash, N. Lai, G. Ozsoyoglu, and Z. M. Ozsoyoglu, “PathCase-SB: integrating data sources and providing tools for systems biology research.,” BMC systems biology, vol. 6, no. 1, p. 67, Jan. 2012. [20] “BioModels database—A Database of Annotated Published Models.” [Online]. Available: http://www.ebi.ac.uk/biomodels-main. [21] N. Le Novère, B. Bornstein, A. Broicher, M. Courtot, M. Donizelli, H. Dharuri, L. Li, H. Sauro, M. Schilstra, B. Shapiro, J. L. Snoep, and M. Hucka, “BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems.,” Nucleic acids research, vol. 34, no. Database issue, pp. D689–91, Jan. 2006. 171 [22] C. Li, M. Donizelli, N. Rodriguez, H. Dharuri, L. Endler, V. Chelliah, L. Li, E. He, A. Henry, M. I. Stefan, J. L. Snoep, M. Hucka, N. Le Novère, and C. Laibe, “BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models.,” BMC systems biology, vol. 4, p. 92, Jan. 2010. [23] “KEGG (Kyoto Encyplopedia of Genes and Genomes) Pathways.” [Online]. Available: http://www.genome.jp/KEGG/pathway.html. [24] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa, “KEGG: Kyoto Encyclopedia of Genes and Genomes.,” Nucleic acids research, vol. 27, no. 1, pp. 29–34, Jan. 1999. [25] M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, “From genomics to chemical genomics: new developments in KEGG.,” Nucleic acids research, vol. 34, no. Database issue, pp. D354–7, Jan. 2006. [26] M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe, and M. Hirakawa, “KEGG for representation and analysis of molecular networks involving diseases and drugs.,” Nucleic acids research, vol. 38, no. Database issue, pp. D355–60, Jan. 2010. [27] R. S. Johnson, X. Qi, A. E. Cicek, and G. Ozsoyoglu, “iPathCase-KEGG: An iPad Interface for KEGG Metabolic Pathways,” Health Information Science and Systems, 2012. [28] “RECON:Online Database of Reconstructed Metabolic Networks.” [Online]. Available: http://www.reconmodels.org/Web. [29] K. Yizhak, T. Benyamini, W. Liebermeister, E. Ruppin, and T. Shlomi, “Integrating quantitative proteomics and metabolomics with a genome-scale metabolic network model.,” Bioinformatics (Oxford, England), vol. 26, no. 12, pp. i255–60, Jun. 2010. [30] I. R. Bederman, S. Foy, V. Chandramouli, J. C. Alexander, and S. F. Previs, “Triglyceride synthesis in epididymal adipose tissue: contribution of glucose and non-glucose carbon sources.,” The Journal of biological chemistry, vol. 284, no. 10, pp. 6101–8, Mar. 2009. [31] H. G. Gasier, J. D. Fluckey, and S. F. Previs, “The application of 2H2O to measure skeletal muscle protein synthesis.,” Nutrition & metabolism, vol. 7, p. 31, Jan. 2010. [32] A. Cakmak, X. Qi, a E. Cicek, I. Bederman, L. Henderson, M. Drumm, and G. Ozsoyoglu, “A new metabolomics analysis technique: steady-state metabolic network dynamics analysis.,” Journal of bioinformatics and computational biology, vol. 10, no. 1, p. 1240003, Feb. 2012. 172 [33] “SMDA tool.” [Online]. Available: http://nashua.case.edu/pathwayssmda/web. [34] “PathCase family of applications.” [35] S. Johnson and G. Ozsoyoglu, “PathCase MAW, an iPad application.” [36] “HMDB site.” [Online]. Available: http://www.hmdb.ca/sources. [37] A. E. Cicek, F. Olnh, I. O. X. Ri, D. Q. Hq, P. H. Ru, and F. Ri, “Resolving Observation Conflicts in Steady State Metabolic Network Dynamics Analysis,” vol. 9, pp. 409–414. [38] “PathCase-MAW application.” [Online]. Available: http://nashua.case.edu/PathwaysMAW/. [39] “PathCase-RCMN application for organism Trypanosoma cruzi.” [Online]. Available: http://nashua.case.edu/PathwaysMAW_Trypanosoma/web/. [40] “OMA Tool.” [41] C. H. Schilling, D. Letscher, and B. O. Palsson, “Theory for the systemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective.,” Journal of theoretical biology, vol. 203, no. 3, pp. 229–48, Apr. 2000. [42] D. A. Fell, “Metabolic control analysis: A survey of its theoretical and experimental development,” Biochem, no. 286, pp. 313–330, 1992. [43] J. C. Liao and J. Delgado, “Advances in Metabolic Control Analysis,” Biotechnology progress, vol. 9, no. 3, pp. 221–233, 1993. [44] M. C. Wildermuth, M. G. Hospital, and B. Street, “Minireview Metabolic control analysis : biological applications and insights,” Genome biology, vol. 1, no. 6, pp. 1–5, 2000. [45] A. Varma and B. O. Palsson, “Metabolic capabilities of Escherichia coli. II. Optimal growth patterns.pdf,” Journal of Theoretical Biology, vol. 165, no. 4, pp. 503–522, 1993. [46] J. S. Edwards and B. O. Palsson, “Systems properties of the Haemophilus influenzae Rd metabolic genotype.,” The Journal of biological chemistry, vol. 274, no. 25, pp. 17410–6, Jun. 1999. [47] S. Klamt and J. Stelling, “Two approaches for metabolic pathway analysis?,” Trends in biotechnology, vol. 21, no. 2, pp. 64–9, Feb. 2003. 173 [48] A. P. Wlaschin, C. T. Trinh, R. Carlson, and F. Srienc, “The fractional contributions of elementary modes to the metabolism of Escherichia coli and their estimation from reaction entropies.,” Metabolic engineering, vol. 8, no. 4, pp. 338– 52, Jul. 2006. [49] M. G. Poolman, K. V Venkatesh, M. K. Pidcock, and D. a Fell, “A method for the determination of flux in elementary modes, and its application to Lactobacillus rhamnosus.,” Biotechnology and bioengineering, vol. 88, no. 5, pp. 601–12, Dec. 2004. [50] R. Urbanczik and C. Wagner, “An improved algorithm for stoichiometric network analysis: theory and applications.,” Bioinformatics (Oxford, England), vol. 21, no. 7, pp. 1203–10, Apr. 2005. [51] D. J. Glykys and S. Banta, “Metabolic control analysis of an enzymatic biofuel cell.,” Biotechnology and bioengineering, vol. 102, no. 6, pp. 1624–35, Apr. 2009. [52] S. Llamt, J. Gagneur, and A. von Kamp, “Algorithmic approaches for computing elementary modes in large biochemical reaction networks,” System Biology, vol. 4, no. 152, pp. 249–55, 2005. [53] G. Alterovitz, S. Member, V. Muralidhar, and M. F. Ramoni, “Gene Lethality Detection and Characterization via Topological Analysis of Regulatory Networks,” IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, vol. 53, no. 11, pp. 2438– 2443, 2006. [54] A.-L. Barabási and Z. N. Oltvai, “Network biology: understanding the cell’s functional organization.,” Nature reviews. Genetics, vol. 5, no. 2, pp. 101–13, Feb. 2004. [55] H. Jeong, S. P. Mason, a L. Barabási, and Z. N. Oltvai, “Lethality and centrality in protein networks.,” Nature, vol. 411, no. 6833, pp. 41–2, May 2001. [56] N. C. Duarte, S. a Becker, N. Jamshidi, I. Thiele, M. L. Mo, T. D. Vo, R. Srivas, and B. Ø. Palsson, “Global reconstruction of the human metabolic network based on genomic and bibliomic data.,” Proceedings of the National Academy of Sciences of the United States of America, vol. 104, no. 6, pp. 1777–82, Feb. 2007. [57] C. T. Trinh, A. Wlaschin, and F. Srienc, “Elementary mode analysis: a useful metabolic pathway analysis tool for characterizing cellular metabolism.,” Applied microbiology and biotechnology, vol. 81, no. 5, pp. 813–26, Jan. 2009. [58] T. D. Jamison, Disease Control Priorities in Developing Countries, 2nd ed. Washington (DC): World Bank, 2006. 174 [59] S. B. Roberts, J. L. Robichaux, A. K. Chavali, P. a Manque, V. Lee, A. M. Lara, J. a Papin, and G. a Buck, “Supplement 3, Proteomic and network analysis characterize stage-specific metabolism in Trypanosoma cruzi,” BMC systems biology, vol. 3, p. 52, 2009. [60] “PathCaseRCMN.” [Online]. Available: http://nashua.case.edu/PathCaseRCMN/web/. [61] “PathCase-Recon.” [Online]. Available: http://nashua.case.edu/PathCaseRECON/Web/. [62] “BiGG Database.” [Online]. Available: http://bigg.ucsd.edu/bigg/main.pl. [63] “MEMOSys.” [Online]. Available: http://icbi.at/software/memosys/memosys.shtml. [64] “GSMNDB.” [Online]. Available: http://synbio.tju.edu.cn/GSMNDB/gsmndb.htm. [65] “PathCase-MQL.” [Online]. Available: http://nashua.case.edu/PathwaysMQL/web/. [66] A. Cakmak, G. Ozsoyoglu, R. W. Hanson, and C. Science, “MANAGING AND QUERYING MAMMALIAN METABOLIC NETWORKS : A METABOLISM QUERY LANGUAGE AND ITS QUERY PROCESSING 1 . 1 . A Query Template and Its Instance 1 . 2 . A Sample MQL AIP Query Instance and Its Output :” [67] J. D. Orth and B. Ø. Palsson, “What is flux balance analysis?,” vol. 28, no. 3, pp. 245–248, 2011. [68] M. L. Mo, N. Jamshidi, and B. Ø. Palsson, “A genome-scale, constraint-based approach to systems biology of human metabolism.,” Molecular bioSystems, vol. 3, no. 9, pp. 598–603, Sep. 2007. [69] G. M. Maggiora and V. Shanmugasundaram, Molecular Similarity Measures, vol. 672. Totowa, NJ: Humana Press, 2011. [70] M. Hattori, Y. Okuno, S. Goto, and M. Kanehisa, “Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways.,” Journal of the American Chemical Society, vol. 125, no. 39, pp. 11853–65, Oct. 2003. [71] M. Hattori, N. Tanaka, M. Kanehisa, and S. Goto, “SIMCOMP/SUBCOMP: chemical structure search servers for network analyses.,” Nucleic acids research, vol. 38, no. Web Server issue, pp. W652–6, Jul. 2010. 175 [72] “http://en.wikipedia.org/wiki/String_metric.” . [73] V. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, p. 707, 1966. [74] S. Jimenez, C. Becerra, A. Gelbukh, and F. Gonzalez, “Generalized Mongue-Elkan Method for Approximate Text String Comparison,” pp. 559–570, 2009. [75] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins.,” Journal of molecular biology, vol. 48, no. 3, pp. 443–53, Mar. 1970. [76] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences.,” Journal of molecular biology, vol. 147, no. 1, pp. 195–7, Mar. 1981. [77] W. E. WinkLer, “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage,” in Proceedings of the Section on Survey Research, 1990, pp. 354–359. [78] P. Jaccard, “The distribution of the flora of the alpine zone,” New Phytologis, vol. 11, no. 2, 1912. [79] L. Gravano and N. Koudas, “Approximate String Joins in a Database ( Almost ) for Free,” in Proceedings of the 27th International Conference on Very Large Data Bases, 2001, pp. 491–500. [80] D. E. Almonacid, E. R. Yera, J. B. O. Mitchell, and P. C. Babbitt, “Quantitative comparison of catalytic mechanisms and overall reactions in convergently evolved enzymes: implications for classification of enzyme function.,” PLoS computational biology, vol. 6, no. 3, p. e1000700, Mar. 2010. [81] F. Ay, T. Kahveci, and V. de Crécy-Lagard, “Consistent alignment of metabolic pathways without abstraction.,” Computational systems bioinformatics / Life Sciences Society. Computational Systems Bioinformatics Conference, vol. 7, pp. 237–48, Jan. 2008. [82] M. Mednis and M. K. Aurich, “Application of string similarity ratio and edit distance in automatic metabolite reconciliation comparing reconstructions and models,” Biosystems and Information technology, vol. 1, no. 1, pp. 14–18, 2012. [83] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31–88, Mar. 2001. 176 [84] C. Li, J. Lu, and Y. Lu, “Efficient Merging and Filtering Algorithms for Approximate String Searches,” 2008 IEEE 24th International Conference on Data Engineering, pp. 257–266, Apr. 2008. [85] C. Li and B. Wang, “VGRAM : Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams,” 2007. [86] Z. Zhang, “B ed -Tree : An All-Purpose Index Structure for String Similarity Search Based on Edit Distance Categories and Subject Descriptors,” in SIGMOD’10, 2010. [87] R. L. Anderson and E. Lansing, “D-Fructose l-Phosphate Kinase from Aerobacter Kinase and D-Fructose aerogenes,” Journal of Biological Chemistry, vol. 244, no. November 25, 1969. [88] “Chirality.” [Online]. Available: http://en.wikipedia.org/wiki/Chirality_(chemistry). [89] “Optical Activity.” [Online]. Available: http://physics.unl.edu/~tgay/content/OA2.html. [90] “ChemicalProperies.” [Online]. Available: www.chemicalbook.com/ProductChemicalPropertiesCB7237292_EN.htm. [91] “D-Glucose.” [Online]. Available: http://en.wikipedia.org/wiki/Glucose. [92] “Naming Convention.” [Online]. Available: http://en.wikipedia.org/wiki/Enzyme#Naming_conventions. [93] “IUBMB.” [Online]. Available: http://www.chem.qmul.ac.uk/iubmb/. [94] A. Bordbar, N. E. Lewis, J. Schellenberger, B. Ø. Palsson, and N. Jamshidi, “Insight into human alveolar macrophage and M. tuberculosis interactions via metabolic reconstructions.,” Molecular systems biology, vol. 6, no. 422, p. 422, Oct. 2010. [95] K. Radrich, Y. Tsuruoka, P. Dobson, A. Gevorgyan, N. Swainston, G. Baart, and J.-M. Schwartz, “Integration of metabolic databases for the reconstruction of genome-scale metabolic networks.,” BMC systems biology, vol. 4, p. 114, Jan. 2010. [96] M. a Oberhardt, J. Puchałka, V. a P. Martins dos Santos, and J. a Papin, “Reconciliation of genome-scale metabolic reconstructions for comparative systems analysis.,” PLoS computational biology, vol. 7, no. 3, p. e1001116, Mar. 2011. 177 [97] A. Gevorgyan, M. G. Poolman, and D. a Fell, “Detection of stoichiometric inconsistencies in biomolecular models.,” Bioinformatics (Oxford, England), vol. 24, no. 19, pp. 2245–51, Oct. 2008. [98] L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava, “Text joins in an RDBMS for web data integration,” Proceedings of the twelfth international conference on World Wide Web - WWW ’03, p. 90, 2003. [99] J. Zobel and P. Dart, “Finding Approximate Matches in Large Lexicons,” Software: Practice and Experience, vol. 25(3), no. October, pp. 331–345, 1994. [100] “Precision and recall.” [Online]. Available: http://en.wikipedia.org/wiki/Precision_(information_retrieval). [101] S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data. 2002. [102] J. van Helden, L. Wernisch, D. Gilbert, and S. J. Wodak, “Graph-based analysis of metabolic networks.,” Ernst Schering Research Foundation workshop, no. 38, pp. 245–74, Jan. 2002. [103] C. H. Q. Ding, H. Zha, X. He, P. Husbands, and H. D. Simon, “Link analysis: hubs and authorities on the world wide web,” vol. 2001, no. July, pp. 1–12, 2003. [104] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates,” Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD ’04, p. 743, 2004. 178 .
is
is denoted as
. C
is True if m is marked with a identifier qActual such that either (a) qActual.id≠-1 and qActual.id < q.id, or (b) q.id = 0 and qActual.id > 0.