COMPUTATIONAL ANALYSIS, VISUALIZATION AND TEXT MINING OF METABOLIC NETWORKS

by

XINJIAN QI

Submitted in partial fulfillment of the requirements For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Gültekin Özsoyoğlu

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

January, 2014

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

____Xinjian Qi______candidate for the Doctor of Philosophy___degree *.

(signed) __Dr. Gültekin Özsoyoğlu______(chair of the committee)

______Dr. Andy Podgurski______

______Dr. M. Cenk Cavusoğlu______

______Dr. Nicola Lai______

______Dr. Z. Meral Özsoyoğlu______

______

(date) _____June 24, 2013____

*We also certify that written approval has been obtained for any proprietary material contained therein.

Table of Contents

Table of Contents ...... 1

List of Tables ...... 6

List of Figures ...... 7

Acknowledgements ...... 10

Abstract ...... 12

Introduction ...... 14

1.1 Computational Interpretation of Metabolomics Measurements: Steady-State

Metabolic Network Dynamics Analysis ...... 15

1.2 Performing Gene Lethality Testing with SMDA ...... 17

1.3 Visualization Tools for PathCase Systems...... 19

1.4 Locating Basic Bio-Entities in Genome-Scale Reconstructed Metabolic

Networks ...... 22

1.5 Thesis Organization...... 24

Computational Interpretation of Metabolomics Measurements: Steady-State Metabolic

Network Dynamics Analysis ...... 25

2.1 Introduction ...... 25

2.2 Condition-Based Modeling ...... 33

2.2.1 Assumptions and Terminology ...... 33

2.2.2 Metabolite Pool Label Identifiers ...... 36 1

2.2.3 Metabolite Label Condition Characterization ...... 38

2.2.4 Trigger Values and Activation Condition Sets for Reactions, Transport

Processes, or Pathways ...... 39

2.2.5 Biochemistry-Based Rules ...... 43

2.3 Active/Inactive Graph Generation And Expansion ...... 45

2.3.1 Initial GAI Generation ...... 46

2.3.2 GAI Graph Expansion ...... 48

2.3.3 Merging GAI Graphs ...... 56

2.3.4 Algorithm Sketch ...... 58

2.4 Experimental Evaluation ...... 59

2.4.1 Experimental Setting ...... 59

2.4.2 Experimental Results ...... 61

2.5 Related Work: Metabolic Network Analysis Techniques ...... 63

2.6 Conclusions ...... 66

Performing Gene Lethality Testing with SMDA ...... 67

3.1 Introduction ...... 67

3.2 Summary of SMDA Algorithm ...... 70

3.2.1 SMDA Terminology ...... 70

3.2.2 Algorithm Flow ...... 72

3.2.3 Conflicts ...... 74

2

3.3 Existing Gene Lethality Techniques and SMDA ...... 75

3.4 Revising SMDA For Gene Lethality Testing ...... 80

3.5 Experimental Evaluation ...... 83

3.5.1 Experimental Setting ...... 83

3.5.2 Gene Lethality Test Results ...... 86

3.5.3 Gene Non-Lethality Test Results ...... 90

3.6 Conclusions ...... 91

Visualization Tools for PathCase Systems ...... 92

4.1 Introduction ...... 92

4.2 Visualization Tool for PathCase-SB System ...... 94

4.3 Visualization Tools for other PathCase Systems ...... 98

4.3.1 PathCase-MAW and PathCase-MAW Editor ...... 98

4.3.2 PathCase-SMDA ...... 99

4.3.3 PathCase-RCMN and PathCase-Recon ...... 101

4.3.4 PathCase-MQL ...... 103

4.4 General Framework ...... 106

4.5 Visualization Tool for iPad Applications ...... 108

4.6 Conclusions ...... 109

Locating Basic Bio-Entities in Genome-Scale Reconstructed Metabolic Networks ...... 110

5.1 Introduction ...... 110 3

5.1.1 Entity Identification ...... 111

5.1.2 Similarity Score ...... 112

5.2 Metabolite Identification ...... 114

5.2.1 Exact Match via Metabolite Id/Synonyms ...... 116

5.2.2 Approximate Name Matching...... 117

5.2.3 Filtering Metabolite Match Candidates ...... 125

5.3 Reaction Identification ...... 129

5.3.1 Reaction Name Matching ...... 130

5.3.2 Reaction Property Matching ...... 131

5.3.3 Reaction Compound Matching ...... 132

5.3.4 Reaction Similarity Score ...... 134

5.4 Experimental Evaluation ...... 135

5.4.1 Metabolite Identification Results ...... 136

5.4.2 Reaction Identification Results ...... 144

5.5 Conclusion ...... 150

Conclusions and Future Work ...... 152

6.1 Future work ...... 153

6.1.1 SMDA ...... 154

6.1.2 Visualization ...... 157

6.1.3 Bio-Entity Identifications...... 157 4

Appendix 1. Core Metabolites (Total count: 617) ...... 161

Bibliography ...... 170

5

List of Tables

Table 2.1 The number of observations vs. the number of output graphs for small sub- networks...... 61

Table 2.2 The number of observations vs. the number of graphs for a large network. .... 62

Table 3.1 Metabolite pool observations from the T. Cruz. paper ...... 85

Table 3.2 Metabolite pool observations from biomass reaction ...... 85

Table 3.3 Energy pools are set as Available ...... 85

Table 3.4 Inactive reactions for epimastigote case ...... 86

Table 3.5 Lethal genes to be verified ...... 87

Table 3.6 SMDA test results on lethal genes ...... 88

Table 5.1 Biologically significant terms ...... 128

6

List of Figures

Figure 2.1 SMDA result as a single GAI graph ...... 29

Figure 2.2 Illustration of three alternative versions of transport processes ...... 48

Figure 2.3 A partial metabolic network M. Circle nodes are metabolites, rectangle nodes are reactions and edges represent relations between reactions (which consume and/or produce metabolites) and metabolites...... 52

Figure 2.4 The first level of the GAI graph generation hierarchy for the metabolic network in Fig.2.3...... 53

Figure 2.5 A metabolic network M. Circle nodes are metabolites, rectangle nodes are reactions and edges represent relations between reactions (which consume and/or produce metabolites) and metabolites...... 58

Figure 2.6 The GAI graphs before merging two GAI -GROUPs...... 58

Figure 2.7 Sketch of the SMDA algorithm ...... 60

Figure 2.8 SMDA time cost for a single network versus the number of observations for

Glycolysis and TCA Cycle combined...... 63

Figure 3.1 A partial network with reversible reaction ...... 74

Figure 3.2 Partial depiction of theoptimal flux distribution on epimastigote model of T.

Cruzi network...... 80

Figure 3.3 A complete network for Example 3.2 ...... 83

Figure 3.4 A partial network for Example 3.3...... 89

Figure 4.1 Visualization Tools and Applications ...... 93

Figure 4.2 Visualization of Albert2005-Glycolysis Model ...... 96

7

Figure 4.3 An example of built-in query (reaction-to-process mapping)...... 96

Figure 4.4 Visualization of a query(Figure 4.3) result...... 97

Figure 4.5 Glycolysis in Cytosol_Adipose and Cytosol_Liver...... 99

Figure 4.6 Catabolism of Phenylalanine pathway in PathCase-MAW editor...... 100

Figure 4.7 TCA cycle pathway in in PathCase-MAW...... 101

Figure 4.8 SMDA query results...... 102

Figure 4.9 Fatty Acid Metabolism pathway in the iMM1415 model (2010)...... 104

Figure 4.10 E. Coli Textbook in PathCase-Recon System...... 105

Figure 4.11 An example of MQL query...... 105

Figure 4.12 MQL query result of the example in Figure 4.11...... 106

Figure 4.13 Visualization tools in PathCase systems ...... 107

Figure 5.1 Metabolite Identification Algorithm Sketch ...... 115

Figure 5.2 CandidatesM () function ...... 115

Figure 5.3 BST-Filter() function ...... 129

Figure 5.4 Reaction Identification Algorithm Sketch ...... 131

Figure 5.5 Model-to-model metabolite matching results for ...... 140

Figure 5. 6 Model-to-model metabolite matching results for

29, Model2008_08_15_12_13_14> ...... 141

Figure 5. 7 Model-to-model metabolite matching results for <03_16_09_TM_minimal

_medium_glc, barkeri iAF692> ...... 142

Figure 5.8 Model-to-model reaction matching results for

...... 147

8

Figure 5.9 Model-to-model reaction matching results for

Model2008_08_15_12_13_14> ...... 148

Figure 5. 10 Model-to-model reaction matching results for <03_16_09_TM_minimal_m edium_glc, barkeri iAF692> ...... 149

9

Acknowledgements

First and foremost, I would like to express the deepest appreciation to my advisor, Prof.

Gultekin Ozsoyoglu, for his guidance, encouragement and support which helped me to finish this dissertation. It was only with his effective guidance, infinite patience, full trust and continuous faith in me that I am able to succeed in my graduate studies and become a researcher. He himself has served as a role model in so many ways, not only in pursuing research goals, collaborating with other people, guiding diversified students, but also balancing work and life. He has been a great mentor with full of wisdom and endless energy. It is my great honor to have worked with him. His spirit of working hard, being thoughtful of others and sharing generously with others will accompany me for the rest of my life.

I would like to thank Drs. Podgurski, Cavusoğlu, Lai, and Ozsoyoglu for serving as members of my dissertation committee and for their constructive comments. I appreciate the time and effort that they spent on reading this work and providing me with feedback.

I am grateful to Dr. Z. Meral Ozsoyoglu for attending my presentations, research meetings, and providing precious feedback and suggestions throughout the whole period of my graduate studies.

I would like to thank my research collaborators and lab mates, Dr. Ali Cakmak, A.

Ercument Cicek and Sarp Coskun. Dr. Cakmak has helped me not only in my research papers, but also in my studies. A. Ercument Cicek is a perfect colleague and friend, and we have collaborated on many projects and papers successfully. Sarp Coskun is full of enthusiasm in programming and new technologies, and he has made the lab a pleasurable 10

place. Besides, everyone in the Databases and Bioinformatics Laboratory at Case

Western Reserve University deserves acknowledgement and thanks for their friendship and warm company; it has been a great pleasure to work and learn from all of you.

Most importantly, my deepest gratitude goes to my family. My parents and my sister have always been supportive on the decisions I have made. Their love, encouragement and absolute confidence are important sources of energy for me. Special thanks to my wife, Dr. Hong Guo, who helped me to start this journey, accompanied me in the process, gave me endless love, patience and persistent support. It is her diligence, goodness, courage and self-sacrifice that has made this degree possible. She is always there for me.

I must thank our twenty-month old son, Yang Qi, who has caught up with this special journey and filled our lives with so much joy.

And finally, I would like to acknowledge the supports coming from the National Science

Foundation grants DBI-0849956 and DBI-0743705, and the National Institute of Health under grant R01 GM088823.

11

Computational Analysis, Visualization and Text Mining of

Metabolic Networks

Abstract

by

XINJIAN QI

With the recent advances in experimental technologies, such as gas chromatography/mass spectrometry, the number of metabolites that can be measured in biofluids of individuals has markedly increased. Given a set of such measurements, a very common task encountered by biologists is to identify the metabolic mechanisms that lead to changes in the concentrations of given metabolites and interpret the metabolic consequences of the observed changes in terms of physiological problems, nutritional deficiencies, diseases.

This thesis presents the steady-state metabolic network dynamics analysis (SMDA) approach in detail. Experimental evaluation of the SMDA tool against a mammalian metabolic network database is also presented. The query output space of the SMDA tool can be reduced via (i) larger number of observations exponentially reduce the output size, and (ii) exploratory search and browsing of the query output space is provided to allow users to search what they are looking for. SMDA is then applied to gene lethality testing.

Compared with other methods that are used for gene lethality testing, the advantages of the SMDA algorithm are: (1) SMDA requires less input, and (2) does not make optimality assumptions. The algorithm has been tested on the genome scale reconstructed network of Trypanosoma cruzi and its gene lethality testing results taken as ground truth.

12

Also, in this thesis, we study general framework of visualization tools as well as distinct features of each tool in the PathCase systems, namely PathCase-SB, PathCase-MAW editor, PathCase MAW, PathCase-SMDA, PathCase-RCMN, PathCase-Recon, and

PathCase-MQL.

Finally, this thesis proposes a number of metabolite/reaction identification techniques for

Genome-Scale Reconstructed Metabolic Networks (GSRMN) (by matching metabolites/reactions to corresponding metabolites/reactions of a source model or data source). We employ a variety of computer science techniques that include approximate string matching, similarity score functions and filtering techniques, all enhanced by the underlying metabolic biochemistry-based knowledge. The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four pairs of GSRMNs.

Our results indicate that significant accuracy gains are made using the proposed metabolite/reaction identification techniques.

13

Introduction

With recent advances in experimental technologies, the number of metabolites measured in bio-fluids of organisms has markedly increased. Given a set of measurements, a common metabolomics task is to identify the metabolic mechanisms that lead to changes in the concentrations of given metabolites, and interpret the metabolic consequences of the observed changes in terms of physiological problems, nutritional deficiencies, or diseases.

In metabolic networks, gene lethality is defined in terms of essential metabolite availability. An essential metabolite is a metabolite without which the organism cannot stay alive. Existing methods of gene lethality test requires either the optimal conditions or other assumptions such as “the quality of the biomass reaction and the assumption of biomass optimization which is debatable even for unicellular organisms” [1][2]), or assumes the prior knowledge about the network (e.g., complete stoichiometry), or generates a result which may not be meaningful biochemically.

PathCase-SB aims to integrate systems biology models data and metabolic network data of selected biological data. PathCase-SB system provides a database-enabled framework and web-based computational tools towards facilitating the development of kinetic models for biological systems. Visualization interface is an important component of the

PathCase-SB system. Also, visualization interfaces/tools exist in many other PathCase systems.

The number and use of Genome-Scale Reconstructed Metabolic Networks (GSRMN) have been increasing in recent years. It is noted in the literature [3][4] that published 14

GSRMNs have two basic limitations, which reduce their full utilization. One is the inability to match metabolites/reactions/compartments in a given GSRMN to metabolites/reactions/compartments in a given data source (e.g., KEGG) or another

GSRMN, due to naming inconsistencies involving species (metabolites), reactions, and compartments. Another noted difficulty is in identifying pathways of a GSRMN.

To address the above problems, this thesis employs biological data analysis techniques on

(1) Performing computational interpretation of metabolomics measurements;

(2) Applying SMDA to gene lethality testing;

(3) Implementing visualization tools for PathCase Systems; and

(4) Locating basic bio-entities in Genome-Scale Reconstructed Metabolic Networks.

Next we briefly describe the above-listed four different metabolic network-related problems that we study in this dissertation.

1.1 Computational Interpretation of Metabolomics

Measurements: Steady-State Metabolic Network

Dynamics Analysis

Given a set of observed measurements in a metabolic network, we present the steady- state metabolic network dynamics analysis (SMDA) approach to interpret the metabolic consequences in terms of physiological problems, nutritional deficiencies, or diseases.

SMDA needs no additional assumptions other than the steady-state assumption. SMDA can be viewed as both a constraint- and rule-based approach. It is constraint-based

15

[5][6][7] in that it uses conditions (pre-stored in its database) to locate all “allowable states” [8] of the reconstructed metabolic network model (pre-stored in its database).

And, it is rule-based in that its graph-expansion and merge strategies employ a number of biochemistry rules to capture the underlying metabolic biochemistry as much as possible.

In this research, a complete condition- and rule-based model of the metabolic network behavior is specified. Then we list the assumptions of our model and define the notion of

(quasi-) steady-state for the metabolic network. The notion of metabolite pool label identifiers, the three-valued logic to specify metabolite pool label conditions and the

Activation Condition Sets for reactions as well as transport processes are presented. Also transport process rules, and a number of basic biochemistry-based rules are listed.

The SMDA algorithm runs in a cycle of two phases: Expansion and Merge. It lasts until all reactions and metabolite pools in the network are assigned a status. Expansion phase starts from the labeled metabolite pools (observations), which are flow-graphs with single metabolite pools. Then, expanded flow-graph(s) are generated by adding neighboring reactions and metabolite pools to the original flow-graph. SMDA generates all possible combinations of label assignments to those neighboring pools and reactions. This process continues until all reactions and metabolite pools are assigned a label.

SMDA returns only two flux values for a reaction, namely, 0 (inactive), and 1

(inactive).The query output space of the SMDA tool is exponentially large in the number of reactions of the network. However, (i) larger numbers of observations exponentially reduce the output size, and (ii) exploratory search and browsing of the query output space

16

allows users to mine and search for what they are looking for. The SMDA problems and their solutions addressed are new, and specific to the SMDA approach.

Advantages of SMDA include its ease of use and simplicity; it is designed as a “first- step” and “online” tool for wet lab researchers (a) to evaluate their hypotheses about observed measurements, and (b) to be used for “what if” types of questions (i.e., knowledge discovery). SMDA technique and its computational performance limits are evaluated using a mammalian metabolic network database [9].

Our work of evaluating the activation/inactivation scenarios of the metabolic network at steady state is related to metabolic network analysis techniques such as metabolic control analysis (MCA) [10], flux balance analysis (FBA) [11], metabolic flux analysis [12] and, finally, metabolic pathway analysis (MPA) (elementary flux modes (EFM) and extreme pathways (EP)) [13]. MCA and FBA solve a set of under-constrained differential equations; in comparison, our SMDA approach can be considered as a rule-based knowledge discovery approach within a given metabolic network database. Detailed comparison between these techniques and SMDA can be found in this chapter.

1.2 Performing Gene Lethality Testing with SMDA

In this research, we have attempted a usefulness study for SMDA for the problem of gene lethality testing [14]. A gene is lethal if its knockout causes the unavailability of at least one essential metabolite in the organism at the steady state. In other words, a gene is lethal if its removal from the organism’s genome results in the non-production of at least one essential metabolite, and, thus, the death of the organism.

17

An SMDA-based gene lethality test is done in three steps. First, reactions catalyzed by the enzymes produced by the knocked-out gene are marked as inactive in the network.

Then all essential metabolite pools are labeled as Available. Finally, SMDA is run to check if there is at least one feasible flow-graph in the metabolic network that produces and consumes each and every essential metabolite. Thus, stopping conditions for gene lethality/non-lethality are adjusted: no flow graphs or any merge/expansion conflict encountered during the process means the knocked gene is lethal, or any valid flow-graph means the knocked gene is non-lethal.

We mark the reactions corresponding to the lethal gene as inactive and run SMDA.

When the algorithm terminates with no possible flow-graphs, or any merge/expansion conflict is encountered during the process, the gene is verified to be lethal. However, if

SMDA produces even one possible result, the gene is said to be non-lethal since the organism is still alive without the gene.

SMDA gene lethality algorithm is validated via a selected reconstructed network of the core metabolism of Trypanosoma cruzi [16], a kinetoplastid parasite in humans that causes Chagas disease [17]. Trypanosoma cruzi has a small core reconstructed metabolic network [18] with 215 genes, 162 reactions, and 4 compartments. There are seven genes are considered to be lethal in literature, full model prediction and epimastigote model prediction in Trypanosoma cruzi paper [15]. We obtained the reconstructed network model of Trypanosoma cruzi in the form of an SBML document, and parsed and exported the model (with a home-made SBML parser tool) into our PathCase-RCMN database.

We take the model network as one input for SMDA. Another input for SMDA,

18

metabolite pool observations, is from the extracellular metabolites’ availability according to the paper supplement [16]. All seven lethal genes are verified with SMDA. We have also selected one non-lethal gene in Trypanosoma cruzi, namely, adenosine kinase, and

SMDA has also correctly verified its non-lethality. Thus, we confirm that SMDA can be used for gene lethality testing purposes.

SMDA has been compared with current methods that are used for gene lethality testing, for example, topological analysis of regulatory networks [2], Barabasi’s computational estimate method [3,4], Flux Balance Analysis (FBA). The advantages of the SMDA algorithm are: (1) SMDA requires less input, and (2) does not make optimality assumptions. On the negative side, for a very large network, SMDA has its limitations since it enumerates all possible activation/inactivation scenarios for the network at hand.

The complexity of SMDA can be reduced with domain expert’s knowledge, or by reducing the network, i.e., abstracting pathway into single abstract reaction, or providing more observations. Detailed comparison between these techniques and SMDA can be found in the Chapter Three.

1.3 Visualization Tools for PathCase Systems

Released in August 2010, PathCase-SB system [17][18][19] brings together (i) systems biology sources, e.g., BioModels [20][21][22], and (ii) pathways sources, e.g., KEGG

[23][24][25][26], with the goal of providing additional capabilities and tools made possible due to the integration. Currently, PathCase-SB has provided visualization, browsing, querying, simulation and comparison, model composition and user upload model capabilities and interfaces.

19

The visualization tool in PathCase-SB system has the new features of (1) integration of the interactive pathway graph visualization; (2) displaying model according to compartments hierarchy of the model (3) the mapping between the model network and the pathway is provided by displaying both side by side. Also, the visualization tool has multiple visualization simplification capabilities of by truncating of long entity names and not showing common species. Another feature is layout manipulations by revising and saving visualization layouts.

The visualization interface is accessed from different places within PathCase-SB. It’s employed by

 Browser Interface

 Built-In Queries

 iModel Tool

 Model Composition Tool.

Similar to visualization interface of PathCase-SB, Pathcase visualization tools present metabolic data, relationships in the data, as well as analysis results of the data via a java applet. These tools are components of many PathCase Systems. Based on differing requirements of PathCase Systems, some features are adjusted or redesigned. For example, in PathCase-MAW’s visualization tool, common metabolites are reproduced for each reaction they participate in, which reduces many edges between common metabolites and reactions, therefore beautifies the resulting visualized graph. In

PathCase-SMDA visualization tool, reversible reactions are connected via double edges to show the direction of flow. 20

The visualization tools are included in the following PathCase systems:

 PathCase-SB: PathCase Systems Biology Workbench featuring BioModels

models and KEGG Pathways has 409 Systems Biology Models and 139 KEGG

pathways.

 PathCase-MAW Editor: a stand-alone Java application on maintaining a

mammalian metabolic database—MAW.

 PathCase-MAW: Pathcase Metabolomics Analysis Workbench featuring

manually created generic mammalian metabolic network has 27 pathways.

 PathCase-RCMN: PathCase ReconstruCted Metabolic Networks has four modes,

namely, Mus Musculus iMM1554 model (2008), Mus Musculus iMM1415 model

(2010), H.sapiens Recon 1 model and Trypanosoma Cruzi iSR215 model (2009).

 PathCase-Recon: PathCase RECON Workbench featuring Genome-Scale

Reconstructed Metabolic Networks and KEGG Pathways has 53 networks and

139 KEGG pathways.

 PathCase-SMDA: an online tool to analyze metabolomics data in terms of the

dynamic behavior of the metabolic network under steady state.

 Metabolism Query Language Interface: a Metabolism Query Language Interface

to query PathCase-MAW database.

And three iPad applications:

 iPathCaseMAW: iPad version PathCase-MAW system, which includes

visualizations of metabolic pathways and SMDA tool,

21

 iPathCaseRCMN: iPad version PathCase-RCMN system, which includes

visualizations of three reconstructed networks,

 iPathCaseKEGG: iPad version PathCase-KEGG system, which includes

visualizations of Kyoto Encyclopedia of Genes and Genomes[27].

Finally, the visualization framework of all PathCase visualizations has the following steps.

 Designing an XML schema for the visualization data file,

 Defining parameters for web services to communicate with the PathCase applet in

the client side,

 Retrieving information from PathCase database, based on parameters,

 Composing the obtained information into an XML data file, and

 Parsing the data file, and providing visualization via the applet (except for

PathCase iPad applications).

1.4 Locating Basic Bio-Entities in Genome-Scale

Reconstructed Metabolic Networks

Basic bio-entity identification problem is defined as to match metabolites reactions/compartments in a given GSMR network to metabolites/reactions

/compartments in a given data source (e.g., KEGG or another GSMR network). This can be difficult due to naming inconsistencies involving species (metabolites), reactions and compartments. Inconsistent prefix, suffix, space, number, and formula are the main

22

reasons for basic entities identification problem. Additionally, for reactions identification, different number of alternative substrates in different GSMR networks complicates the identification problem further.

In this research, we focus on the basic bio-entity identification problem in a GSRMN model (referred to as the “target model”, from here on) with respect to a “source model”

(where the “source model” may easily be replaced by a “data source”, generalizing the identification problem), and propose three types of matches for metabolite identification, and a multi-step identification process for reaction identification.

For metabolite identification techniques in GSRMNs (by matching metabolites to corresponding metabolites of a source model or data source), we employ a variety of computer science techniques that include token based approximate string matching, similarity score functions and filtering techniques, i.e., formula matching and biologically significant term matching. All techniques are enhanced by the underlying metabolic biochemistry-based knowledge.

Based on metabolite identification, reaction identification is performed via three-step matching techniques, namely, reaction name matching, reaction property matching and reaction compound matching. For reaction compound matching, compounds are paired by exact name match, name length match and core metabolite match. Also, reversibility

- + property of the reactions, as well as ignorable metabolites, i.e., e , H , H2O, are all taken into account in the matching process. A new reaction similarity score function is given to measure matching results.

23

Since the number of compartments in GSRMNs is few and their names are shorter compared with reaction or metabolite names, we employ a curated data set for compartment name matching. In the data set, compartments are grouped, and various compartments are mapped into the corresponding groups. We also collect compartment names from other different data sources, i.e., KEGG, BioModels, Reactome, to enhance this data set. Given a compartment name, we locate the compartment’s group name in data set. All compartments in the located group is considered as identical/matched with the given compartment name.

The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four pairs of GSRMNs in a GSRMN database maintained by us

[28], namely, “iAM303”and “E. coli textbook”, “H. pylori iIT341” and “EryNet”,

“Model2008_09_23_13_13_29” and “Model2008_ 08_15_12_13 _14”,

“03_16_09_TM_minimal_medium_glc” and “M. barkeri iAF692”. Our results indicate that significant accuracy gains are possible using the proposed metabolite/reaction identification techniques.

1.5 Thesis Organization

Chapter 2 presents Steady-State Metabolic Network Dynamics Analysis. In Chapter 3, gene lethality testing with SMDA is performed as one application of SMDA tool.

Chapter 4 presents visualization interface framework and distinct features of PathCase

Systems. In Chapter 5, similarity score based techniques on locating basic bio-entities in

GSRMNs are explained. And Chapter 6 concludes and discusses several future work directions. 24

Computational Interpretation of Metabolomics Measurements: Steady-State Metabolic Network Dynamics Analysis 2.1 Introduction

Currently, metabolomics data analysis necessitates a time-consuming, extensive, and manual cross-referencing of metabolic pathways, in order to critically evaluate the measurements data. Recently, a novel In Silico approach (IOMA) that integrates metabolomics data with a metabolic network model, and infers metabolic fluxes is proposed[29]. IOMA (a) requires many pieces of information (e.g., availability of the stoichiometry matrix of the network, dissociation constants, enzyme turnover rates, mass balance constraints, flux capacity constraints, etc.), and (b) infers a single network state with all the computed metabolic fluxes. On the other hand, manual analysis of fluxes in small (and usually abstracted) sub-networks is quite common in life science publications.

As examples, please see Figure 5 and Figure 1 in Bederman et al, and Gasier et al, respectively[30][31]. Researchers seek alternative activation/inactivation scenarios in small-scale networks, without the need/access to the additional information such as those needed by IOMA. Note that, even for small-size networks, as the size of the network grows, the number of possible flow (flux) scenarios grows exponentially, which makes manual enumeration error-prone. This manual process can be automated using computer science and bioinformatics techniques that employ biochemistry rules and constraints,

25

pre-stored in a metabolic network database. Once the results are obtained, users can also visualize and query them, (e.g., “list those alternative flows where one targeted reaction is active, and another targeted reaction is inactive”).

In this chapter, we propose a database-enabled and graph-traversal-based technique, called SMDA (Steady-state Metabolic network Dynamics Analysis), that infers all allowable (flux) states of a network. Given a set of bio-fluid (e.g., blood) and tissue-based metabolite concentration measurements at steady-state, SMDA answers the query “list alternative steady-state metabolic network activation/inactivation (i.e., flux) scenarios, given the observed measurements”. That is, SMDA takes as input from the user (i) metabolomics data, and (ii) a metabolic sub-network, selected from a metabolic network database already made available to users, and produces a set of possible alternative flow scenarios (i.e., activation/inactivation scenarios) for the metabolic sub-network. Then

SMDA lets users to further visualize and query the alternatives (not discussed in this thesis).

SMDA can be viewed as both a constraint- and rule-based approach. It is constraint- based [5][6][7] in that it uses conditions (pre-stored in a database) to locate all “allowable states” [8] of a sub-network in a metabolic network model (also pre-stored in a database).

And, SMDA is rule-based in that its graph-expansion and merge strategies employ a number of biochemistry rules to capture the underlying metabolic biochemistry as much as possible.

Advantages of SMDA include:

26

 Ease of use and simplicity. it is designed as a “first-step” and ‘online” tool for

biochemists and wet lab researchers to

o Evaluate their hypotheses about observed measurements in small scale

networks, and

o Be used as a “knowledge discovery” tool, e.g., to be used for “what if”

types of questions.

 No flux optimization. SMDA does not to require the knowledge of reaction

kinetics or any utility/optimization function for flux optimization.

The disadvantages of SMDA include:

 SMDA returns only two flux values for a reaction: 0 (Inactive), and 1 (Active).

 As is the case with other techniques that return “all allowable states”, SMDA is

inherently exponential in its output size. However, the computational performance

of SMDA is acceptable for networks with up to 60 reactions (with some

paths/pathways abstracted into “abstract reactions”; see section 2.4 and the

original paper[32][9]).

SMDA is implemented, and functional as a prototype both as an online tool, called

PathCase-SMDA[33]that is part of PathCase family of applications[34], and as an iPad application, named “PathCase MAW”[35].

1. SMDA Overview

Prior Preparation. We assume a fully hierarchical and compartmentalized metabolic network, i.e., one with tissues, organelles, etc., already available in a metabolic network 27

database. And, the steady-state “activation conditions” (or, the ACT condition set) for each reaction and transport process to be active are characterized a priori, saved in a database, and used during query-time analysis. Initially, the status values of all reactions and all metabolite pools in the metabolic network are Unknown.

Query-time Analysis. At query time, the user chooses a smaller metabolic sub- network (i.e., query network) to query. SMDA takes the observed metabolite set and the selected sub-network, referred to as “query network”, as input, and executes the following steps.

Initialization. (i) For each bio-fluid-based metabolite observation, identify whether its transport processes are active or not (by checking, for each transport process, whether all conditions in its ACT set are satisfied or not). (ii) For each tissue-based metabolite observation, derive its metabolite pool label, which is one of Unavailable, Available,

Accumulated, or Severely Accumulated.

Expansion and Merge: Metabolic Sub-Network Traversal and Active-Inactive

Reaction Assessment. Starting with active/inactive transport processes and tissue-based observed metabolites, and continuing with metabolic reactions in tissues of the query network, locate iteratively those reactions with satisfied or unsatisfied ACT condition sets, and mark (i) those reactions whose ACT conditions are completely satisfied as Active, and (ii) those reactions whose ACT conditions contain at least one unsatisfied ACT condition as Inactive. (This process results in multiple expansions). When two disconnected “active/inactive sub-networks” “touch” each other, merge them to obtain a larger active-inactive sub-network. 28

The above-summarized query-time analysis creates and iteratively expands multiple possible metabolic flux sub-graphs, called Active-Inactive Graphs (GAI), where, in each

GAI graph, the status of each reaction, and the label of each metabolite pool is clearly marked (i.e., no reactions or metabolite pools with “Unknown” status/label exist). The result is a set of GAI graph sets where each GAI graph set specifies one distinct alternative steady-state activation/inactivation scenario for the metabolic network. An alternative output to GAI graphs is flow-graphs where a flow-graph is a GAI graph without metabolite pool labels; flow-graphs are utilized in section 2.4. We give an example.

Figure 2.1 SMDA result as a single GAI graph

Example 2.0. Assume that the user selects Catabolism of Cysteine in liver as the metabolic sub-network to be queried (as shown in Figure 2.1), and has three observed metabolite measurements in cytosol: O2 as 80mM/L (we assume that O2 is “estimated” as it is very difficult to measure O2 in tissue of intact organ), cysteine as 60µM/L, and SO3

(3-sulifino-L-Alanine) as 80µM/L. Assume that the database conditions state that, in

Liver cytosol, “O2 is marked as Available if it is in between [1, 100]mM/L”, “cysteine is marked as Available if it is in between [1, 100]µM/L”, and “SO3 is marked as Available if it is in between [1, 100]µM/L”. Thus, the SMDA initialization step concludes that O2,

29

cysteine, and SO3 are all Available. And, the execution of the expansion step as summarized above concludes that there is only one flow-graph with only one GAI graph in the output of the query, as shown by the (actual) SMDA output of Figure 2.1.

In summary, given metabolomics observations and a query network, SMDA locates all possible alternative active-inactive network scenarios on the selected sub-network. This approach provides compact and complete steady-state views of possible metabolism dynamics as independent and alternative snapshots in the form of user-friendly visual steady-state views of the metabolic network. There are four issues. The first issue is prioritizing and ranking different alternatives produced by SMDA. This issue is not discussed in this chapter; please see the supplement of original paper[9] for a number of ranking mechanisms.

The second issue is related to the space complexity of SMDA: what happens when, for a large sub-network, there are many alternative GAI graphs? As a first response to this issue,

SMDA switches to the use of flow-graphs, as opposed to GAI graphs, where a single flow-graph captures multiple GAI graphs. Second, SMDA allows for an exploratory search of the resulting GAI graphs. That is, an “interactive query” execution takes place where, as a response to the query, the user is given the total number of “possible results”

(i.e., GAI graphs), and, is then prompted to choose and view different GAI graphs or flow- graphs in the output with respect to participating metabolites and reactions. For example, the user is told, say, that Pyruvate dehydrogenase is active in two flow-graphs and inactive in four flow-graphs, and is given the option of viewing only the first two, or the

30

latter four, or all six flow-graphs. We refer to this process as “exploratory search and browsing” of the SMDA query output search space.

The third issue is related to the time complexity of SMDA. Given a large metabolic network, SMDA output may increase so fast and so large that SMDA may not complete its execution within a reasonable amount of time. When this case occurs, our suggestion to the user is to reduce the network size, either by eliminating sub-networks or by

“abstracting” a sub-network (e.g., a pathway of a metabolism) into an “abstract reaction”.

From our interactions with wet-lab biochemists, both approaches are quite common, and, used extensively in practice to (manually) analyze the behavior of metabolic networks[30][31].

Finally, the fourth issue is about the way SMDA works: as described above, SMDA discretizes metabolite observations into four categories, namely, Unavailable, Available,

Accumulated, or Severely Accumulated. This discretization can be done by users employing their domain expertise, as is done in section 2.4. Or, it can be done automatically on the basis of ranges for each discretization, that are in turn obtained from the HMDB data source[36]. However, in some cases, HMDB classifies multiple levels of

“normal” ranges for metabolites, leading to “observation misclassifications” in SMDA.

This issue and the SMDA actions taken are discussed in a separate study[37].

All figures in this chapter are obtained from the web-based SMDA application[33]. The observation set of example 2.1 is available on the web site of the browser-based application PathCase-MAW as “Sample Observation 0”; and, running the SMDA Tool with Sample Observation 0 produces the results of Example 2.1. Figure 2.1 and Example 31

1.1 are from a manually constructed mammalian network database, available at

PathCase-MAW site[38]. All other examples and visualizations in figures of this chapter are obtained from the PathCase-RCMN (ReConstructed genome-scale Metabolic

Network) application[39], which is, in turn, built by importing the SBML of reconstructed metabolic network of Trypanasoma cruzi bacteria[15]. The SMDA tool, an evolution of the OMA Tool[40], is currently being beta-tested in cystic fibrosis metabolomics data analysis.

This chapter is organized as follows. Section 2.2 specifies a complete condition- and rule- based model of the metabolic network behavior. We

 List the assumptions of our model and define the notion of (quasi-) steady-

state for the metabolic network,

 Introduce the notion of metabolite pool label identifiers,

 Employ a three-valued logic to specify metabolite pool label conditions and

Activation Condition Sets for reactions as well as transport processes,

 List transport process rules, and, finally,

 Specify a number of basic biochemistry-based rules.

Section 2.3 presents the SMDA algorithm with the three steps, namely, GAI (flow-) graph initialization, expansion, and merge steps. The SMDA algorithm iteratively constructs a

GAI Generation Hierarchy where, when it terminates, each leaf node of the hierarchy contains one possible activation/inactivation scenario within the query sub-network. In section 2.4, presents a computational performance evaluation of the SMDA tool by using 32

PathCase-MAW mammalian metabolic network database. SMDA can be viewed as a new approach within the category of metabolic network flux analysis techniques such as flux balance analysis [11], elementary flux modes[13] and extreme pathways[41]. Section

2.5 compares SMDA with these other techniques. Section 2.6 briefly concludes this chapter.

2.2 Condition-Based Modeling

2.2.1 Assumptions and Terminology

We make the following assumptions about our environment.

 The complete metabolic network is pre-captured and available in a metabolic network

database.

 The metabolic network database models tissue-level compartmentalization; that is, it

is a multi-tissue and a multi-compartment (e.g., cytosol, mitochondrion, etc.)

environment.

 The metabolic network is “sound” in the sense that all metabolites that are not in bio-

fluids are both produced by (i.e., are a product of) at least one reaction and consumed

by (i.e., are a substrate of) at least one reaction.

 Initially, we label each unmeasured metabolite pool size with the identifier

“Unknown”. During query-time analysis, the labels may change into one of

"Unavailable", “Available”, “Accumulated”, or “Severely accumulated”. The reason

for non-quantitative labeling (as opposed to numerical size values) is that this work

33

does not employ quantitative pool size estimation techniques, as discussed in more

detail in Section 2.2.2.

 No a priori knowledge of the size of each metabolite pool is assumed, except for

measured metabolites.

 Given a reaction r, and a metabolite m as a substrate, co-factor-in, activator (product,

co-factor-out, inhibitor) of r, the knowledge of the lowest (highest) metabolite pool

size label of m at steady-state for m to activate (inhibit) a reaction so that r is “active”

(“inactive”), is assumed to be available. This is discussed in more detail in Section

2.2.4.

 The organism (represented by its metabolic network database) is queried when it is at

a steady-state for a time interval T. Steady-state is defined in terms of two properties:

a. Production-Consumption Rate Equality (PCRE): During the time interval T, the

rate of formation of every metabolite m is (almost) equal to its rate of degradation,

i.e., all metabolite pool sizes (concentrations) remain (almost) constant during the

time interval T. Put another way, production rate of each metabolite is equal to its

consumption rate.

b. Metabolite Pool Label Invariability (MPLI): During the time interval T, all

metabolite pool labels stay the same. That is, if the label of a metabolite pool is

Available, it stays Available during the time interval T.

The PCRE property at steady-state is a natural property, referring to the state of constancy or the homeostasis (equilibrium) of the organism. As an example, in the “fed” state of, say, humans, glucose, through Glycolysis, is catabolized to Acetyl CoA, which is 34

converted to fatty acids or oxidized in the TCA Cycle. Although Acetyl CoA is available to both metabolic pathways (i.e., Fatty Acid Synthesis and the TCA Cycle), it does not accumulate, as the combined consumption rate of Acetyl CoA by Fatty Acid Synthesis and the TCA Cycle is (almost) the same as its production by Glycolysis.

We use the MPLI property in order to capture a snapshot of the metabolism when metabolite pool size labels also stay constant during steady-state. Next we define some terminology.

Definition (Metabolic Network). A metabolic network is a connected graph G(V, E) with a vertex set V of reactions and metabolite pools (a metabolite pool can be a substrate, regulator or product in a reaction), and a directed edge set E such that there is an edge from node u to node v if (i) v is a reaction, and u is a substrate, regulator of v, or (ii) u is a reaction, and v is a product of u.

Definition (ProductionRate and ConsumptionRate of metabolite pool m): Consider any metabolite pool m, its producer reactions p1, p2, …, pi, and its consumer reactions c1, c2, …, cj. Let prm, k denote the production contribution rate of reaction pk, 1 ≤ k ≤ i, for metabolite m, and crm,v denote the consumption contribution rate of reaction cv, 1≤ v ≤ j, for metabolite m, during time period T. Then

 Pm = {(p1, prm,1), (p2, prm,2), …, (pi, prm,i)} is the active producer set of m,

where each pair (pi, prm,i) refers to a producer pi of m and its contribution rate

prm,i; and (prm,1 + prm,2 +…+ prm,i) is the ProductionRate(m) of m; and

35

 Cm = {(c1, crm,1), (c2, crm,2), …, (cj, crm,j)} is the active consumer set of m,

where (cj, crm,j) refers to an activated consumer cj of m and its consumption

rate crm,j; and (crm,1 + crm,2 +…+ crm,j) is the ConsumptionRate(m) of m.

Below we formally characterize the notion of (quasi-)steady-state for the metabolism.

Definition ((quasi-)steady-state for an organism during a time period): Given an organism Org, its metabolites ml, 1 ≤ l ≤ n, and two constants εml and T, the organism

Org is said to be in a steady-state during the time period T if

(a) ProductionRate(ml) = ConsumptionRate(ml) ± εml for each ml, 1 ≤ l ≤ n, during the

time period T, and

(b) Label of each metabolite ml, 1 ≤ l ≤ n, stays the same during the time period T.

2.2.2 Metabolite Pool Label Identifiers

The purpose of metabolite pool label identifiers is to simplify the ACT (activation condition) set specifications for reactions and transport processes.

Definition (Metabolite pool label during a time period): Let TAVAIL(m), TACC(m), and

TSAC(m) , TAVAIL(m)< TACC(m) < TSAC(m), be three threshold constants for a metabolite m, stored in the database. Given the metabolite pool m, the label of m during the time period

T is marked with one of the following five identifiers.

 Unknown (id:-1): if the metabolite pool size for m, denoted by Size(m), is unknown

during time period T.

 Unavailable (id: 0): Size(m) is less than the threshold TAVAIL(m) and

ProductionRate(m) ≤ εm during time period T, where εm is a small constant.

36

 Available (id: 1): Size(m) is greater than or equal to the threshold TAVAIL(m) and less

than the threshold TACC(m) during time period T.

 Accumulated (id: 2): Size(m) is equal to or above the threshold TACC(m), but less than

the threshold TSAC(m) during time period T.

 Severely Accumulated (id: 3): Size(m) is equal to or above the threshold TSAC(m) in

time period T. This label is used for the product inhibition rule BC4 of section 2.2.5.

Note that there is a need to use different metabolite pool labels of Available and

Accumulated because, for some reactions, “availability” of a metabolite m as a substrate

(or regulator) may be sufficient for the reaction (i) to be active through substrate availability (provided that there are no other inhibiting mechanisms) or (ii) to experience the regulating effect (i.e., inhibition/activation) of m, in those cases where m is a regulator.

However, for activation/regulation, other reactions may require the “accumulation” of m, at least at moderate levels. We give an example.

Example 2.1. Acetyl CoA is an allosteric activator of the first (also the committed) step in

Gluconeogenesis, which is catalyzed by pyruvate carboxylase. And, pyruvate carboxylase activation needs Acetyl CoA accumulation. In the fed state of organism,

Acetyl CoA is produced by Glycolysis (hence, is Available), but does not accumulate

(hence has “Not Accumulated”). Thus, pyruvate carboxylase is not activated, which leads to the inactivation of Gluconeogenesis pathway. But, in the fasting state of the organism,

Acetyl CoA is produced by Beta Oxidation, and consumed by the TCA Cycle and Ketone

Body Synthesis. In this case, accumulation of Acetyl CoA occurs (slowly, but steadily),

37

since its production rate by Beta Oxidation is higher than its combined consumption rate by the TCA Cycle and Ketone Body Synthesis.

2.2.3 Metabolite Label Condition Characterization

The metabolite label condition C about the label identifier q of a metabolite pool m is denoted as C .

Example 2.2. Ketone Body Synthesis requires the accumulation of Acetyl CoA to use it as a substrate. Then, the required condition can be stated as C or, equivalently, as C<2, Acetyl CoA> when the identifier of Available is used.

We employ three-valued logic (True, False, Unknown) in evaluating conditions about metabolite pool labels of reactions.

Definition (Satisfaction of a metabolite label condition): A metabolite label condition

C is

(i) True if m is marked with the identifier qActual where either (a) 0 < q.id ≤ qActual.id

or (b) q.id = qActual.id = 0 holds,

(ii) False if m is marked with the identifier qActual where either (qActual.id ≠ -1 and

qActual.id < q.id) or (q.id = 0 and qActual.id > 0),

(iii)Unknown if m is marked with the identifier qActual where qActual.id = -1.

Example 2.3. The condition C (or, C<2, Acetyl CoA>) from

Example 2.2 is True when the corresponding pool of Acetyl CoA has the label

Accumulated (id: 2) or Severely Accumulated (id: 3).

38

Definition (Negation of a Condition): Negation of a condition C is denoted as

C. C is True if m is marked with a identifier qActual such that either (a) qActual.id≠-1 and qActual.id < q.id, or (b) q.id = 0 and qActual.id > 0.

Example 2.4. The negation of the condition from Example 2.2, i.e.,  C

Acetyl CoA>, is True only when Acetyl CoA is marked as Available (id: 1) or Unavailable

(id: 0) (i.e., no active producer).

Definition (Conflicting Conditions): Two conditions C1 and C2 which are defined on the same metabolite m are in conflict if there is no possible pool label identifier for m that would satisfy both C1 and C2.

Example 2.5. C1 is in conflict with C2

CoA>.

Definition (Condition Subsumption): Condition C1 subsumes another condition

C2 if C2 is satisfied whenever C1 is satisfied.

Example 2.6. C1 subsumes C2.

2.2.4 Trigger Values and Activation Condition Sets for Reactions, Transport

Processes, or Pathways

The label of a reaction r, a transport process Tc1-to-c2 from compartment c1 to compartment c2 (not to be confused by time interval T), or an “abstract pathway” can be one of active, inactive, or unknown, as discussed next.

2.2.4.1 Reaction

39

We start with the notion of a “metabolite trigger value” for a reaction, which can be either Available or Accumulated.

Definition (Trigger value for metabolite m for reaction r to be active): Let m be a metabolite involved in a reaction r. For r to be active, metabolite m is said to have a trigger value tm,r, where tm,r  {Available, Accumulated}, if

(i) m is a substrate, cofactor-in, or an activator of r, and the metabolite pool identifier

for m is tm,r, or

(ii) m is an inhibitor of r, and the metabolite pool identifier for m is below (the integer

id value of) tm,r .

Each reaction r (or pathway) is associated with a set of participating metabolite pools and their predetermined trigger values, already available in a database. Each reaction (or a pathway) is associated with a set of “activation conditions” (i.e., ACT set), which are created based on the participating metabolites and their trigger values, as discussed next.

Definition (Activation Condition Set of a Reaction/Pathway): Activation condition set of a reaction (or a pathway) r, denoted as ACT(r), defines the conditions for r to be active, and is constructed as follows.

o For each m in reaction r where m is a substrate/cofactor-in/activator of r with

trigger value tm,r, C  ACT(r) where tm,r{1, 2} (1 and 2 are ids of

Available and Accumulated labels, respectively)

o For each m in r where m is an inhibitor of r with trigger value tmr, C

ACT(r) where tm,r  {1}

40

o For each m in r where m is a product/cofactor-out of r, C<3, m>  ACT(r)

(Product Inhibition rule 4; 3 is the id of Severely Accumulated label).

o If the ratio Tr=Size(m1)/Size(m2) of energy metabolite pairs is specified as an

activator for r, then C1(Accumulated,m1)ACT(r), and C2(Accumulated,m2)

ACT(r). If Tr is an inhibitor for r, then C1(Accumulated, m1)ACT(r), and

C2(Accumulated,m2)ACT(r).

As mentioned before, the activation condition set ACT of a each reaction is defined a priori (offline) before any metabolomics analysis is carried out.

2.2.4.2 Transport Processes

We view each transport process Tc1-to-c2 as having one metabolite transported from compartment c1 to compartment c2, subject to the activation condition set ACT for Tc1-to-c2.

We give an example.

Example 2.7. The transport process Tblood-to-muscle(glucose) of glucose from blood to muscle may be characterized within the ACT set as {C,

C}. That is, for glucose to be transported from blood to muscle, both glucose and insulin must be at least Available. On the other hand, transport process

Tmuscle-to-blood(glutamine) of glutamine from muscle to blood can be conditioned based on its availability in muscle, i.e., ACT(Tmuscle-to-blood(glutamine)) contains {C}.

We have the following transport process rules.

41

Rule TR1. Let c1 and c2 be two compartments, m be an observed metabolite in compartment c1, and Tc1-to-c2 (m, c1, c2) be m’s transport process from c1 to c2. Assume that pool label of m in c2 is Unknown. Then if ACT(Tc1-to-c2) is satisfied then Tc1-to-c2 (m) is active; otherwise, it is inactive.

Rule TR2. For active transport processes (i.e., the ACT set is satisfied), we assume that the metabolite pool of the product has the same label with the substrate.

Rule TR3. For transport processes, the product inhibition rule (Please see rule BC4 of

Section 2.2.5) does not apply.

2.2.4.3 Steady-State Labels for Reactions and Transport Processes

We define the steady-state label of a reaction/transport process as one of Active, Inactive, or Unknown, based on the satisfaction of its associated activation condition set ACT.

Definition (Active, Inactive, or Unknown reaction/transport process state): Given a reaction/transport process r with an associated activation condition set ACT(r) defined on the participating metabolites, r is said to be Active (i.e., having a nonzero flux) during the steady-state time period if

(i) All conditions in ACT(r) are satisfied; i.e., all conditions that involve substrates,

cofactors, and products of r are satisfied, and

(ii) Among the conditions involving regulators of r, those conditions that include

regulator(s) with the highest precedence are satisfied.

Reaction/transport process r is Inactive if there is at least one unsatisfied condition in

ACT(r). Otherwise, the state of r is Unknown.

42

Note that, for some reactions there may be multiple activators and inhibitors, in which case, we assume that (a) we have a priori information about the precedence of regulators, and (b) we make use of such precedence information in deciding whether the reaction is active or inactive.

2.2.5 Biochemistry-Based Rules

Next, we list a number of basic biochemistry (BC)-based rules that we use in the rest of the paper.

Rule BC1. For each reaction, when multiple regulators with conflicting regulatory effects

(activation or inhibition) on an enzyme are in place, the regulator with the strongest effect

(highest precedence) on the enzyme is considered, and the other regulators are ignored.

The regulated reactions in a pathway may be classified as rate-limiting and committed steps. Once the committed step takes place, other reactions in the pathway follow this reaction until the end-product is produced, provided that none of the other regulated processes are blocked or inhibited. A committed step of a pathway is usually one of the early irreversible reactions in the pathway. As an example, in glycolysis, the committed step is the same as the rate-limiting step, PFK1.

Rule BC2. If the committed step of a pathway p is blocked (i.e., inactive), then p is

Inactive (i.e., all reactions in p are Inactive).

We associate each compartment with particular pools of metabolites as its input and output. We then connect two compartments in the metabolic network if a transport process connects the two.

43

Rule BC3. Each input and/or output metabolite of a compartment is associated with a transport process (pre-captured and modeled in the database). A transport reaction and an enzymatic metabolic reaction are connected if they share at least one metabolite pool (i.e., as their substrate and/or product).

Due to similarities in the way they bind to enzymes, substrates are in competition with products to bind to their enzymes. As the concentration of products increase, this competition slows down the rate of enzymes binding the substrates. Hence, the reaction rate decreases. Eventually, when the product accumulation reaches to high levels, the corresponding reaction is inhibited dramatically.

Rule BC4. Whenever a non-bio-fluid metabolite m is marked as “severely accumulated”, all reactions that produce (and, therefore, due to the steady-state assumption) and consume m are Inactive.

The next set of rules follows from the steady-state assumption.

Rule BC5. If all producers (consumers) of a metabolite pool m are inactive then, due to the PCRE property, regardless of the pool label of m, labels all consumers (producers) of m are Inactive.

Rule BC6. If at least one producer (consumer) of a metabolite m is Active, then (i) m is either Available or Accumulated, and (ii) at least one consumer (producer) of m is Active.

Rule BC7. If the metabolite m is Unavailable then all consumers (and, thus, due to the steady-state assumption) and all producers of m are Inactive.

44

Rule BC8. Substrate and product labels of a transport process with no conditions are always the same.

Next, using rules BC1-8, we specify the notion of “inconsistent” metabolite pool and reaction label assignments.

Definition (Inconsistency): For each Rule BCi, 1 ≤ i ≤ 8, violation of Rule BCi in terms of metabolite pool and/or reaction label assignments constitutes an inconsistency in metabolite pool and reaction labels.

For example, as a product of an Active reaction r, the label of metabolite pool m should not be Severely Accumulated, since it violates Rule BC4.

2.3 Active/Inactive Graph Generation And Expansion

Starting from a given set of observations, we employ iterative backward and forward reasoning with the goal of identifying possible metabolic mechanisms which may have led to the observed changes. We first give some definitions.

Definition (Reaction Participants): Given a reaction r, RP(r) is the set of substrates, products, and regulators of r (i.e., “Reaction Participants” of r).

We refer to a metabolite pool concentration measurement as an observation. Next we define the notion of Active/Inactive graph, which has labeled reactions and labeled reaction participants.

Definition (Active/Inactive Graph GAI): An active/inactive graph GAI(RAI,AI,SRP

RP,M,O) is a connected subgraph of the metabolic network M with respect to a set O of observations where (i) RAI consists of a set of reactions or pathways in the subgraph

45

GAI(RAI,AI,SRP, RP,M,O), (ii) each reaction/pathway in RAI is assigned a label of Active or Inactive through the function AI: RAI {Active, Inactive}, (iii) SRP is the set of reaction participants (i.e., substrates, products, activators, etc.) of reactions in RAI, and

(iv) each reaction participant of a reaction in SRP is assigned a label of Unavailable,

Available, Accumulated, Severely Accumulated through the function RP: SRP

{Unavailable, Available, Accumulated, Severely Accumulated}.

During the GAI graph generation process, inconsistencies in GAI are avoided where inconsistency is as defined in Section 2.2.5.

2.3.1 Initial GAI Generation

A generated GAI graph should be valid, as defined below.

Definition (Valid Active/Inactive Graph): A GAI graph is valid when

a. All metabolite pool/reaction labels in GAI are consistent.

b. For all active reactions r in GAI, ACT(r) is satisfied, and

For all inactive reactions r in GAI, ACT(r) contains at least one unsatisfied condition.

2.3.1.1 Converting Observations into Metabolite Pool Labels

As discussed in the introduction, there are two alternative ways of converting metabolite observations into discretized metabolite pool labels of Available, Unavailable,

Accumulated, or Severely Accumulated. The first way is, users can decide on these labels themselves using their domain expertise. The second way is, given a quantitative concentration statement on a metabolite pool m, SMDA compares the value with threshold constants (obtained from HMDB) for the metabolite m, and then marks m with the corresponding label identifier label. SMDA marks m only with one identifier, which 46

is the highest satisfied identifier. However, thresholds obtained from HMDB may be problematic: (1) HMDB may have more than one “normal” level for a metabolite, or (2) there may be no information at all. Please see Cicek et al [37]for more details.

When observations on metabolite concentrations are converted into one of

Unavailable, Available, Accumulated, or Severely Accumulated, to distinguish between metabolites in different compartments, we use the underscore notation, and refer to metabolite m in compartment c as “m_c”.

Finally, for each observed bio-fluid metabolite, we investigate iteratively which of the possible GAI graphs (initially, each contains only one measured bio-fluid metabolite) is valid. We illustrate the initial GAI graph construction with an example from the metabolic network of T. cruzi (all network visualizations, except Figure 2.1, are from PathCase-

RCMN for T. cruzi).

Example 2.8. Let pi be an observed metabolite in compartment cytosol, denoted as pi_c, and the label of pic_c be Available. Let phosophatetransporter,peroxisome and phosphatetransportl be two such transport processes transporting pi_c from compartment cytosol to compartment glycosome, and from cytosol to compartment mitochondria, respectively (see Figure 2.2). By evaluating their ACT sets, we locate whether the two transport processes phosophatetransporter,peroxisome and phosphatetransportl are active (By Rule BC8, at least one must be active). This means that one of the three alternative GAI graphs involving pi_c is consistent.

47

2.3.2 GAI Graph Expansion

Each valid GAI graph is iteratively expanded at each step with a set of reactions and/or transport processes. We start with some definitions.

Definition (Distance between two metabolite pools): The number of reactions on the shortest path that connect two metabolite pools, regardless of reaction directions, is the distance between two metabolite pools.

Figure 2.2 Illustration of three alternative versions of transport processes

Definition (Border Metabolite Pool): Given a metabolite pool m and a nonempty active/inactive graph GAI(RAI,AI,SRP RP,M,O), m is called a border metabolite pool of

GAI if, in the metabolic network M , there is a pair of reactions (r1, r2) such that m participates in both r1 and r2, and r1  RAI, r2  RAI.

Note that when a GAI graph contains only a single metabolite pool m, m becomes a border metabolite pool of GAI. Also, the label of a border metabolite in a GAI graph is one

48

of Unavailable, Available, Accumulated, or Severely Accumulated, and never Unknown.

We denote the border metabolite pool set of GAI as BMP(GAI).

The process of extending a given GAI graph to a new GAI graph via the addition of new reactions connected to its border metabolite pools is called GAI graph expansion. The newly added reactions of the GAI graph are assigned the label values of either Active or

Inactive (which are consistent, i.e., not in conflict with the existing reaction label assignments in the graph). If there is no such consistent expansion, then the expansion is terminated. Next we characterize the GAI graph expansion process.

Definition (GAI graph expansion): Let GAI(RAI,AI,SRP,RP,M,O) denote the original GAI graph to be expanded;

exp exp exp exp exp G AI(R AI, AI,S RP, RP,M,O) denote one of the alternative GAI graph expansions of GAI(RAI,AI,SRP,RP,M,O);

BMP(GAI) denote the set of all border metabolite pools of GAI;

NRS(BMP(GAI)) denote the set of (“new”) reactions involved with border metabolites and not (yet) in GAI, i.e., those reactions r where r has, as a substrate/product/regulator, a metabolite pool in BMP(GAI) and r is not in RAI;

NMP(NRS(BMP(GAI))) denote the set of (new) metabolite pools p where p participates, as a substrate, product, or regulator, in a reaction of NRS(BMP(GAI)) and p

exp exp exp exp exp is not RAI. Then the expansion G AI(R AI, AI,S RP, RP,M,O) is characterized as follows.

exp (i) R AI = RAI U NRS(BMP(GAI))

exp (ii) S RP = SRP U NMP(NRS(BMP(GAI)))

49

exp (iii)Each r in R AI is assigned the label of Active or Inactive through the function

exp exp  AI: R AI {Active, Inactive}

exp (iv) Each metabolite in S RP is assigned one of the labels Unavailable, Available,

exp exp Accumulated, Severely Accumulated through the function  RP: S RP

{Unavailable, Available, Accumulated, Severely Accumulated}.

exp (v)  AI is consistent with AI.

exp (vi)  RP is consistent with RP.

End of Definition.

exp Border metabolite pools of G AI can be characterized as those metabolite pools which

exp (i) are not in GAI, and (ii) participate in reactions that are not in G AI. Clearly, border

exp metabolite pools of G AI are always within one reaction distance from any border metabolite pool of GAI.

Note that the GAI graph expansion process is not unique. Each expansion step of GAI graph generates a new “alternative” GAI graph by assigning different labels to new reactions. At each step, the newly formed set of GAI graphs that are alternatives of each other is called a GAI-group. Each GAI graph in the same GAI-group is non-redundant, meaning each GAI graph has at least one reaction or metabolite pool assignment differing from the corresponding assignment in any other GAI graph in the same group.

GAI graph generation/expansion process is represented as a hierarchy, called the GAI generation hierarchy, where each node represents a GAI graph of the metabolism, and each edge from parent to child in the GAI generation hierarchy represents the expansion of the parent GAI graph in the next step by additional reactions leading to a new child GAI

50

graph. Branching in the GAI generation hierarchy occurs whenever alternative (i.e., OR- connected), but conflicting, graph extension steps are taken, which leads to alternative

GAI graphs. Each such set of alternative graphs forms a GAI-group.

The GAI generation hierarchy is a directed acyclic graph with only one node that does not have any incoming edges, called root, and with any other node containing a GAI- group. The hierarchy is constructed as follows.

Initialization:

Level 0: root is a dummy node at level 0, i.e., it does not contain any information. Root has |O| immediate children, one for each observation in O.

Level 1: Each immediate child of root is a GAI-group with only one GAI graph containing

(i) a single node corresponding to a measured metabolite pool in the set O of observations, and (ii) no reactions.

Expansion-and-merge:

Level i: Nodes in each level i, i>1, of the hierarchy are constructed from the GAI-

groups in level (i-1) in two steps, as follows.

Expansion step: let Gr be a GAI-group node in level (i-1). Each GAI graph in Gr is

expanded by following a GAI graph expansion as specified by the GAI graph

expansion definition. The set of all such expanded graphs of Gr forms a new GAI-

group node at level i in the hierarchy.

Merge step: Let GAI-groups GrX and GrY be two newly expanded GAI-groups of

the expansion step. Let Gx and Gy be the GAI graphs of GAI-groups GrX and GrY,

respectively. If Gx and Gy have a nonzero number of common border metabolites

51

with identical border metabolite labels, then, GrX and GrY are merged into a single

GAI-group, say, GrZ, in the hierarchy by (i) merging Gx with Gy, and placing the

result in GrZ, and (ii) replacing GrX and GrY by GrZ into the hierarchy at level i.

Example 2.9. Consider a part of a metabolic network M, shown in Figure 2.3. Reaction malatedehydrogenasel is already decided as Active, and oaa_m is a border metabolite with a label value other than Unknown. Assume that oaa_m is already assigned the pool label identifier Available. The border metabolite oaa_m is involved in two reactions, namely, aspartatetransaminase and citratesynthase, whose label assignments are not yet made.

Figure 2.3 A partial metabolic network M. Circle nodes are metabolites, rectangle nodes are reactions and edges represent relations between reactions (which consume and/or produce metabolites) and metabolites.

Based on the network of Figure 2.3 and given the pool label identifier information of

Available on oaa_m, we would like to generate possible valid GAI graphs with active/inactive reactions in the metabolic network. At the same time, each valid GAI graph must preserve the observed Available pool label mark for oaa_m.

52

Next, starting from the border metabolite oaa_m, we generate different possible GAI graphs. New GAI graphs are generated by expanding the initial metabolic subgraph with reactions from the larger metabolic network M. Figure 2.4 shows the original GAI graph and the next level of the GAI generation hierarchy. Each of GAI1, GAI2, GAI3 and GAI4 is distinct and non-redundant. Thus, they form alternatives of each other, called a "GAI- group".

End of Example 2.9

Figure 2.4 The first level of the GAI graph generation hierarchy for the metabolic network in Fig.2.3.

We next discuss the creation of GAI graphs in more detail. For a given metabolite pool m,

R(m) denotes the set of producer and consumer reactions of m in the metabolic network; and r.label represents the current label (i.e., active, inactive, unknown) assignment for reaction r.

Definition (Label Assignment for reactions in the Reaction Set R(m) of metabolite m):

Given a metabolite pool m, SA(R(m), SA,m) is a label assignment for reactions in R(m),

53

where each reaction in R(m) is assigned a label of either Active or Inactive, through a function  SA,m: R(m){Active, Inactive}.

Note that the number of possible label assignments for a set of consumer/producer reactions of a given metabolite pool m is exponential in the number of consumers and producers of m.

Remark 3.1: Given a metabolite pool m, let i be the number of consumer reactions of m, and j be the number of producer reactions of m in the metabolic network. Then, the maximum number of possible distinct label assignments for m’s producers and consumers is 2i+j.

Note that one does not need to evaluate each such combination of reaction label assignment as a valid GAI graph expansion.

Metabolite pool label assignment for metabolite m in GAI is subject to three requirements:

1. Conditions that involve m are either True or False, but not Unknown, and

2. For each reaction r in GAI, either all conditions in ACT(r) are True, or ACT(r)

contains at least one False condition, and

3. All rules (of Sections 2.2.4.2 and 2.2.5) are satisfied.

To check the satisfaction of all three requirements above, our approach (described in

Section 2.3.4 next) is as follows. For initialization, start with observed bio-fluid metabolites and non-bio-fluid metabolites as “seed” metabolites; use satisfied conditions of observed bio-fluid metabolites to locate their transport processes (i) with ACT sets having only satisfied conditions (in which case they are Active transport processes), or (ii)

54

with at least one unsatisfied condition (in which case they are Inactive transport processes). When (i) and (ii) fail for a transport process, then the label of transport process is unknown. Next, after the initialization (of GAI graphs), repeat the GAI graph expansion process (as defined above) via the “border metabolites” of GAI graphs, until there are no more border metabolites involved in active reactions.

Next, for a given border metabolite m, we define the notion of a “valid label assignment for R(m)”, the reaction set (i.e., all producers and consumers) of m.

Definition (Valid Label Assignment for Reactions in the Reaction Set of a Border

Metabolite m): ): Given the graph GAI(RAI,AI, SRP,RP,M,O), a border metabolite pool m in GAI, the reaction set R(m) of m, let SA(R(m), SA,m) be a label assignment for all reactions in R(m) of m. Then, SA(R(m),SA,m) is said to be a valid label assignment for

R(m) with respect to GAI if the following conditions hold.

a. No Label Conflict Among Reactions: For each reaction r where rR(m),

SA,m(r)=AI(r) or rRAI.

b. Backward Compatibility: The label assignment SA(R(m), SA,m) results in a set Q

of pool label assignments for the border metabolite m, each resulting in a new

expanded GAI graph. Then, for a GAI and the metabolite pool assignment q in Q,

the following two conditions hold:

o With m having the label q, all the conditions in the ACT sets of “active”

reactions in RAI  R(m) are satisfied by the assignment q.

55

o With m having the label q, for each “inactive” reaction r in RAI  R(m) that

involves the border metabolite m in its ACT set, there is at least one unsatisfied

condition.

2.3.3 Merging GAI Graphs

During the GAI graph expansion, it is possible to have two GAI graphs in two different GAI groups to intersect, in which case the two graphs are reconciled into a single GAI graph

(leading to a GAI graph generation “hierarchy”, rather than a GAI generation “tree”). If the reconciliation is not possible then it means that the two GAI graphs are not consistent, and the metabolic network model characterized by merging the two GAI graphs is inconsistent. In such a case, this specific merge of the two GAI graphs is stopped, the inconsistency is noted, and the expansion of the GAI generation hierarchy is continued for other possibilities.

To expedite the process of expanding the GAI graphs, we start by assigning labels to observed metabolites, and, forming single-node GAI graphs. Initially, each observed metabolite in a bio-fluid results in a single GAI-Group with a single GAI graph. We attempt to merge GAI graphs in different GAI-groups when they intersect, i.e., when two

GAI graphs that are in two distinct GAI-groups have the same border metabolite(s). We illustrate the process with an example.

Example 2.10. Consider the metabolic network M of Figure 2.5. Assume C and C are satisfied from observed measurements, as shown in Figure 2.5. Let’s say, after multiple expansions, we reach a point where GAI-

56

Figure 2.5 A metabolic network M. Circle nodes are metabolites, rectangle nodes

are reactions and edges represent relations between reactions (which consume

and/or produce metabolites) and metabolites.

Group-1 has GAI3 with border metabolites {oaa_m, sdhlam_m}; and GAI-Group-2 has

GAI4 with border metabolites {succ_m, sdhlam _m }, as shown in Figure 2.6.

Since both groups have the same border metabolites {sdhlam_m}, we merge the two groups of GAI graphs into one group: Let the new GAI graph to be created by merging

GAI3 and GAI4 be GAI6. For each reaction with active or inactive label in GAI3 and GAI4, we assign the same label in GAI6. For border metabolites in GAI6, we assign each border metabolite a common possible label that both GAI3 and GAI4 have as the border metabolite, e.g., the label of sdhlam_m becomes Available.

57

Figure 2.6 The GAI graphs before merging two GAI -GROUPs.

2.3.4 Algorithm Sketch

Input to the SMDA algorithm is a set of quantitative metabolite concentration values, and a metabolic sub-network to which the user wants to restrict the analysis. The very first initialization step starts from an observed, possibly a bio-fluid metabolite, and results in a

GAI graph per observed metabolite, where each such single-node graph is placed in a single GAI-group. In each expansion step, a GAI graph is expanded with a producer/consumer reaction set of a “border metabolite” while the validity of reaction label assignments are enforced, as described in Section 2.3.2. Each possible expansion with a different label assignment on the same metabolite pool, or expansions on different metabolite pools, leads to a distinct GAI graph. Expansion can result in alternative GAI graphs, all placed into a yet another GAI-group. This process builds the GAI generation hierarchy, where nodes are GAI-groups, and, distinct extensions lead to branching in the hierarchy. At the end, each leaf level node in the hierarchy represents a complete GAI

58

graph set, i.e., one possible activation/inactivation scenario. At any point during the expansion process, if a border metabolite with no valid label assignment is encountered, then the expansion of the GAI graph is stopped, and it is eliminated as an invalid GAI graph. The expansion process is performed in a breadth-first manner. In Figure 2.7, we present a sketch of the SMDA algorithm. Note that GAI graphs in different GAI-groups are

AND-alternatives. GAI graphs in the same GAI-group are XOR-alternatives.

2.4 Experimental Evaluation

In this section, we present an experimental evaluation of the SMDA algorithm, and compare different expansion strategies on our experimental data.

2.4.1 Experimental Setting

The experiments are performed on a Dell PowerEdge R710 Server with two Intel®

Xeon® quad processors and 48 GB main memory, running the Windows Server 2008.

The web application server is Microsoft IIS 7. The database server is Microsoft SQL

Server 2010. The SMDA web site is implemented with Microsoft ASP.NET; and the client visualization is implemented with Java.

The experiment data set includes pathways that are built for PathCase Metabolomics

Analysis Workbench, with 22 pathways, 202 metabolites, 375 metabolite pools, and 240 reactions. The thresholds are set up according to the Human Metabolome Data-base.

59

Figure 2.7 Sketch of the SMDA algorithm

60

2.4.2 Experimental Results

2.4.2.1 Relationship between the number of observations and the number of GAI and flow-

graphs.

In this experiment, we evaluate the performance of SMDA for different number of user observations. We experiment with three different size sub-networks. For each sub- network, we change the number of metabolite pool observations and record the number of graphs in the result, as listed in Table 2.1.

Observation 1. For small sub-networks, a linear increase in the number of observations results in an exponential decrease in the number of GAI and flow-graphs in the output.

From Table 2.1, regardless of the size of the sub-network, the number of GAI- and R-

Table 2.1 The number of observations vs. the number of output graphs for small

sub-networks.

Sub-Network # # M. # # GAI- # flow- Reactions Pools Observations graphs graphs 1 8938 846 Pentose pathway 8 16 2 860 423 3 588 376 Glycolysis 1 152 12 14 25 pathway 2 8 8 3 4 4 2 332288 160 Glycoly-sis+TCA 24 48 4 166144 80 Cycle pathways 6 128 32 graphs decreases as we provide more observations as input. Note that, in some cases, increasing the number of observations will not reduce the number of graphs, since there is only one possible label for the input pools in the results. Then the input pool observation is really duplicate information with no reduction on the result size.

61

In another experiment, for a larger sub-network, we observe how the algorithm scales.

We choose a connected sub-network with 6 pathways, 48 reactions and 132 metabolite pools. The number of GAI- and flow-graphs versus different numbers of observations is shown in Table 2.2.

Table 2.2 The number of observations vs. the number of graphs for a large network.

# Reactions # M. Pools # Observations # GAI-graphs # flow-graphs 17 3072 40 23 1536 20 48 132 31 384 12 33 192 12 35 192 12 37 192 12

From Table 2.2, we can see that, even in a large sub-network, we can get reasonably small numbers of GAI- and flow-graphs with increased number of pool observations.

Observation 2. For larger sub-networks, a linear increase in the number of observations results in an exponential decrease in the number of GAI- graphs and a linear decrease in the number of flow-graphs in the output.

2.4.2.2 Algorithm time efficiency

The execution time is composed of two parts: expansion time and merge time. For each sub-network, we execute each of the three expansion strategies. The results show that, in general, increasing the no. observed pool observations decreases the execution time exponentially. This is due to the fact that, with more observed values, expansion time is decreased exponentially by reducing the expansions of many small sub-networks, instead of one large network. However, in some experiments, increasing the number of pool observations has actually increased the execution time, instead of decreasing it. In those 62

cases, we have found that merge time costs are significantly higher than expansion time costs.

Observation 3. A linear increase in the number of metabolite pool observations results in an exponential decrease in the execution time of the algorithm, as in Figure 2.8.

Figure 2.8 SMDA time cost for a single network versus the number of observations for

Glycolysis and TCA Cycle combined.

2.5 Related Work: Metabolic Network Analysis Techniques

SMDA technique can be viewed as being in the general category of metabolic analysis techniques. In this section we summarize the existing metabolic network analysis techniques, and briefly compare with the SMDA approach.

Over the last 30+ years, a number of powerful mathematical modeling approaches and their corresponding computational tools have been proposed and used to study the dynamics of cellular metabolism. These techniques have many goals such as determining the metabolic fluxes of reactions in the metabolic network, or finding all the “optimal” routes, etc. They include metabolic control analysis (MCA) [10][42][43][44], flux balance analysis (FBA) [45][11][46](also known as constrained optimization), metabolic flux analysis [12], and metabolic pathway analysis (more specifically, elementary flux modes

63

and extreme pathways) ([39, 36, 40, 41]. Next we briefly summarize these techniques, and compare them with SMDA approach.

Comparison of MCA, EMA, and SMDA approaches. Next we briefly list the differences between the MCA (or FBA), EMA, and SMDA approaches:

Different goals. The four approaches are useful in different contexts, focus on providing different sets of information to users, and have different goals.

(a) MCA focuses on “control as a property of the whole system”: One can (i)

measure (at quasi-steady state) the effect of single enzyme perturbations on the

system, and (ii) calculate the control distribution, relating the system behavior to

individual reactions.

(b) EMA can be used for tasks like the recognition of operational modes, finding all

optimal paths, analysis of network flexibility (structural robustness,

redundancy) [47]. Under steady-state conditions, the metabolic fluxes of an

organism can be expressed as non-negative, linear, weighted combinations of

elementary flux modes [48]; however, identifying the weighting factors to

determine the fractional contributions of each elementary mode is difficult, if

not impossible [48][49]. Visualizations of elementary flux modes within a given

KEGG pathway are also available (via YANAsquare).

(c) SMDA, working with possibly large metabolic network within a multi-tissue

(organ) environment (i.e., not within a cell) and assuming steady-state behavior,

returns to users all metabolic action scenarios as well as their visualizations

within the metabolic network, allowing users to quickly concentrate on locating

possibly activated paths for a given set of observed metabolite concentration

64

changes. SMDA does not derive (steady-state) flux values of the MCA (FBA)

method, and, thus, there are no control-related (i.e., rate limitation) conclusions

(of the MCA method).

Different underlying fundamentals. SMDA is condition-based, and employs graph traversal and expansion algorithms across the metabolic network. In comparison, MCA and FBA involve solving a set of underconstrained differential equations corresponding to a possibly smaller metabolic network at hand. EMA determines elementary fluxes via a linear combination of “null space basis vectors” of the stoichiometry matrix [50].

Ease of use. MCA (or FBA), even with the easiest-to-use GUI-oriented software tools

(such as COPASI), requires (i) additional information to be collected and provided by the users including the stoichiometry information, and (ii) setup and usage expertise, for biologists to use them. The EMA tools YANA and YANAsquare do provide user-friendly elementary flux derivations and their visualizations. In comparison, SMDA uses a metabolic pathways database, which already contains the metabolic network, biochemistry-based rules and other information so that all that a user is expected to provide is a set of observed metabolite changes.

Modeling-related restrictions/assumptions. As listed above, MCA has a number of assumptions (such as requiring a connected network of pathways) [51] which are not needed for SMDA. EMA also requires connectivity.

Computational Complexity. Computational complexity of MCA is exponential in the number of reactions involved, forcing users to use various compaction, aggregation, and clustering/merging, etc. techniques. Computational complexity of EMA is also exponential [47], and various approaches to tackle the high complexity are proposed such

65

as parallel computing [52], network decomposition and “functional conversion of flux cones”. SMDA is also exponential in the number of reactions in its worst case.

2.6 Conclusions

In this chapter, we have proposed Steady-State Metabolic Network Dynamics Analysis, a computational metabolomics analysis approach that captures a metabolic network and biochemical principles in a metabolic network database [9]. Given a set of metabolic observations and a selected metabolic sub network, SMDA executes with expansion phase and merge phase to locate all possible steady-state activation/inactivation scenarios of the reactions in the network, based on biochemistry rules. The algorithm of SMDA is given. And experimental evaluation of the SMDA tool against a mammalian metabolic network database is also presented.

66

Performing Gene Lethality Testing with SMDA 3.1 Introduction

Steady-State Metabolic Network Dynamics Analysis (SMDA) is a recently proposed computational metabolomics analysis approach that captures a metabolic network and biochemical principles in a metabolic network database [9]. Given a set of metabolic observations and a selected metabolic subnetwork, SMDA locates all possible steady- state activation/inactivation scenarios of the reactions in the network, based on biochemistry rules. Our goal in this chapter is to describe how SMDA can be used in the context of gene lethality testing, where a gene is said to be lethal if its knockout (i.e., elimination from the genome of the organism) causes the death of the organism.

The direct effect of a knocked-out gene is the elimination of the enzymes that it encodes.

This corresponds to the removal of all reactions from the metabolic network in which the catalyzing enzyme is encoded by the knocked-out gene, with the exception of those reactions that have other associated isozymes (e.g., another enzyme that catalyzes the same reaction).

We define gene lethality in terms of essential metabolite availability. An essential metabolite is a metabolite without which the organism cannot stay alive. Thus, a gene is lethal if its knockout causes the unavailability of at least one essential metabolite in the organism at the steady state. In other words, a gene is lethal if its removal from the organism’s genome results in the non-production of at least one essential metabolite, and, thus, the death of the organism.

67

Topological analysis of regulatory networks [53], Barabasi’s computational estimate method [54][55], Flux Balance Analysis (FBA) [45][11][46][56]are techniques used for testing gene lethality. FBA is mostly used to test if a knock-out is lethal for the organism by using the reconstructed metabolic network of the organism (e.g., Duarte et al., 2007

[56] for humans, or Sigurdsson et al., 2010 [1]for mus musculus). FBA calculates metabolite pool sizes and flux values by first constraining the knocked out reaction with zero flux, and then checking whether there is flux through the biomass reaction, which is a reaction that is added for simulation purposes (e.g., to simulate the growth of the organism). Main problems with the FBA approach include: (1) the optimal conditions and the assumptions proposed by the technique are questionable (e.g., “the quality of the biomass reaction and the assumption of biomass optimization which is debatable even for unicellular organisms” [17][2]), (2) the prior knowledge about the network (e.g., complete stoichiometry) might not always be available for the organism at hand, and (3) the FBA result may not be meaningful biochemically (illustrated with an example in

Section 3.3). This chapter proposes the use of SMDA as an alternative gene lethality testing technique. SMDA requires as input (i) the metabolic network of the organism, and

(ii) a set of metabolite pool observations. Then, as output, it enumerates all possible flux scenarios (in the form of “active/inactive reactions”). SMDA does not perform any optimization or stoichiometric calculations; hence, it does not require any of the assumptions stated above (see Section 3.3 for details). However, similar to Elementary

Mode Analysis [11][13][41][57]that enumerates all possible elementary flux modes that can occur at the steady state, SMDA also suffers from exponential computation time. The number of possible scenarios produced by SMDA can be exponential with respect to the

68

size of the network, and is inversely proportional to the number of metabolite observations provided. However, the complexity of SMDA can be reduced with domain expert’s knowledge, as shown in the experiments in this chapter.

To validate the SMDA gene lethality algorithm, we have selected the reconstructed network of the core metabolism of Trypanosoma cruzi [15]. Trypanosoma cruzi, a kinetoplastid parasite in humans and causes Chagas disease [58], has a small core reconstructed metabolic network [16] with 215 genes, 162 reactions, and 4 compartments. Seth B Roberts, Jennifer L Robichaux, Arvind K Chavali, Patricio A

Manque, Vladimir Lee, Ana M Lara, Jason A Papin and Gregory A Buck used [59]the core metabolism network of Trypanosoma cruzi to perform experimental gene lethality tests. To evaluate the gene lethality testing algorithm with SMDA, we have used the same seven lethal genes verified by Roberts et al, and SMDA gene lethality testing has correctly verified the lethality of all seven genes. We have also selected one non-lethal gene in Trypanosoma cruzi, namely, adenosine kinase, and SMDA has also correctly verified its non-lethality. These results show that SMDA can be used for gene lethality testing of organisms.

In section 3.2, we briefly introduce the SMDA algorithm. Section 3.3 summarizes the existing gene lethality testing techniques in more detail, discusses their shortcomings, and compares them to SMDA. In Section 3.4, we provide a sketch of the revised SMDA algorithm for gene lethality testing. Section 3.5 experimentally evaluates SMDA gene lethality testing in the context of Trypanosoma cruzi. Our conclusion is that SMDA gene lethality testing algorithm successfully locates lethal and non-lethal genes of organisms

69

when either the core metabolism network of the organism is not large or the number of observed metabolites is large.

3.2 Summary of SMDA Algorithm

In this section we explain the terminology of SMDA and the algorithm flow.

3.2.1 SMDA Terminology

The metabolic network is a connected graph G(V,E) where the vertex set V consists of metabolite pools and reactions, and the edge set E consists of directed edges from a vertex u to vertex v if (i) u is a metabolite pool that plays the role of substrate or regulator of that reaction, or (ii) u is a reaction and v is a product of that reaction. SMDA makes use of many biochemistry principles such as Substrate Availability, Product Inhibition and Committed Steps, etc.; see Cakmak et al. for details [1].

There are five discrete states a metabolite pool can be in, namely, Unknown, Unavailable,

Available, Accumulated, and Severely Accumulated. If a metabolite pool is not observed

(i.e., not measured), it is labeled as Unknown; otherwise the algorithm compares the user- provided observation with predefined metabolite level thresholds (originally either obtained from HMDB [36] and stored in the SMDA database, or decided manually). Let

TAVAIL(m), TACC(m), and TSAC(m) (such that TAVAIL(m)< TACC(m) < TSAC(m)) be three metabolite level thresholds for metabolite m. Also let the observed value for metabolite m be

Obs(m). If Obs(m) < TAVAIL(m) then we mark the metabolite pool m as Unavailable; if

TAVAIL(m) ≤ Obs(m) < TACC(m) then SMDA marks m as Available; if TACC(m) ≤ Obs(m)

70

There are three discrete states for a reaction, namely, Unknown, Active and Inactive.

Initially, all reactions are labeled as Unknown. Each reaction has a set of conditions for it to be Active. For instance, for reaction r to be Active, substrates has to be labeled as

Available, and product p has to be labeled as Available or Accumulated.

There are two types of reactions, reversible reaction and irreversible reaction. A reversible reaction can be active in forward and backward directions, where forward, rather arbitrarily, refers to one direction, and backward means the substrates and products of the reaction are reversed.

A basic assumption of the SMDA algorithm is that the metabolic profile (observations) are obtained when the organism is at a steady state. That is, for a time interval T, (i) the production rate of each metabolite pool in the network is equal to its consumption rate, and (ii) the metabolite pool labels stay constant. This assumption corresponds to the homoeostasis of the organism.

SMDA algorithm makes use of two main concepts: Activation/Inactivation Graph (GAI) and GAI Group. A GAI is a connected sub-graph of the metabolic sub-network, where (i) each metabolite pool is assigned a label other than Unknown (Unavailable, Available,

Accumulated, or Severely Accumulated) and (ii) each reaction is assigned a label other than Unknown (Active or Inactive). A GAI group is a set of GAIs, where (a) all GAIs share the same set of reactions and metabolite pools, and, (iii) any two GAI’s of the GAI group differ by at least one metabolite pool label assignment. In other words, a GAI group represents all possible activation/inactivation scenarios within a selected sub-graph of the query network. An alternative output to GAI graphs is flow-graphs where a flow-graph is a GAI graph without metabolite pool labels, and a single flow-graph captures multiple GAI

71

graphs. In other words, a flow-graph represents a scenario where each reaction marked as either active or inactive, regardless of the metabolite pool labels. Flow-graphs are used to speed up the original SMDA algorithm.

3.2.2 Algorithm Flow

The algorithm runs in a cycle of two phases: Expansion and Merge. It lasts until all reactions and metabolite pools in the network are assigned a status.

3.2.2.1 Expansion Phase

Expansion phase starts from the labeled metabolite pools (observations), which are flow- graphs with single metabolite pools. Then, expanded flow-graph(s) are generated by adding neighboring reactions and metabolite pools to the original flow-graph. SMDA generates all possible combinations of label assignments to those neighboring pools and reactions. This process continues until all reactions and metabolite pools are assigned a label.

The metabolite pools that are in the flow-graph and attached to those reactions that are not in the flow-graph are called border metabolite pools. At each expansion step, one of the border metabolite pools is chosen for expansion. There can be different options to choose the border pool to expand. To control the number of expanding flow-graphs, the algorithm keeps the number of alternatives small (as much as it can) as a heuristic. It tries to generate as few new flow-graphs as it can at each step, so that, hopefully, this would avoid the generation of cases, which are later eliminated in future iterations (e.g., due to the unavailability of a substrate). Therefore, SMDA picks for expansion the border pools with the least number of reactions attached. We also delay picking common metabolites

(since they are highly connected) like ATP or H2O. 72

The reversible reactions that are attached to the border pool result in the generation of different flow-graphs because of directional combination possibilities. As an example shown in Figure 3.1, consider r1 and r2, two Unknown-labeled reactions (i.e., not yet in the flow-graph) that are attached to a border metabolite pool m. In Figure 3.1(a), assume r1 is the producer of m, and r2 is a reversible reaction. Then there are two cases to consider. Case 1 is Figure 3.1(b): r1 is the producer and r2 is the consumer. Case 2 is

Figure 3.1(c), both are producers of m. We avoid having dead-ends (i.e., a reaction with no consumers and/or producers), and, thus, we disregard such cases. For example, case 2 would be eliminated if there are no other consumer reactions in the flow-graph that is already assigned a label.

After each generated case, reactions are assigned Active/Inactive status based on the already known metabolite pool labels in the flow-graph. For instance, if a reaction is the only consumer of a pool, which has Available/Accumulated label, the reaction must be

Active. Alternatively, all producers and consumers of a pool should be Inactive if the pool is Unavailable. If the status of a reaction cannot be decided at the time, both cases

(active/inactive) are generated for that reaction. For each Inactive reaction, a list of possible metabolite pool status “combinations” are generated that would make that reaction Inactive. After reactions are assigned labels, related metabolite pools are assigned labels. For example, the product pool of an Active reaction must be Available; or a pool is Unavailable if the only producer is Inactive.

3.2.2.2 Merge Phase

Merge phase comes after each expansion phase. Border pools of each pair of flow-graphs are checked to see if they intersect. If so, possible cases among two flow-graphs that

73

Figure 3.1 A partial network with reversible reaction agree on the shared border pool(s) are joined into a larger flow-graph. The metabolite pool status-combination lists for inactive reactions that are attached to the border pools are updated. For example, assume that two flow-graphs flg1 and flg2 have a shared border metabolite pool m1 and can be merged. Metabolite pool label of m1 is Unknown in flg1 and is related with an Inactive reaction r1, and the metabolite pool label of m1 is

Available in flg2 . Then in the new merged flow-graph flg1-2, m1 is Available and is removed from the pool status-combination list of r1. If any metabolite pool status- combination list of such an Inactive reaction is empty after the merge, this means the reaction which is already assigned the label of Inactive has to be Active. This creates a conflict between the two flow-graphs, meaning that they cannot be merged, and this merge alternative is removed from the expansion.

3.2.3 Conflicts

One of the fundamental assumptions of SMDA is that the observations provided are consistent within themselves; otherwise algorithm runs into a conflict state. For example, consider a network that consists of a single reaction with a single substrate and a single

74

product. User observes both substrate and the product. Substrate is classified as

Unavailable and product is classified as Available. SMDA algorithm would (a) mark the reaction as Inactive as the substrate is Unavailable, and also (b) mark the product as

Unavailable as the only producer in the network is Inactive.

It may encounter two types of conflicts based on the stage of the algorithm, namely,

Expansion Conflict and Merge Conflict. Given an observation set and a metabolic sub- network, expansion conflict occurs when a reaction in a GAI graph/flow-graph cannot be assigned the label of Active or Inactive. Merge conflict happens when two GAI groups have shared (same) border metabolite pool(s), but the GAI graphs in the two groups cannot be merged because of the inconsistent metabolite pool label(s). Another paper

[20] has it in details.

3.3 Existing Gene Lethality Techniques and SMDA

Flux Balance Analysis (FBA) is a computational technique that computes the fluxes of reactions under “optimal” conditions such as the maximization of biomass [17]. The technique cannot directly compute metabolite pool sizes (under neither optimal, nor non- optimal conditions). FBA works by defining an objective function to optimize. For gene lethality, the objective function is the maximization of the flux (flow) in the artificially defined “biomass production” reaction. The reasoning is that organisms strive to maximize their chances of survival by maximizing the production of essential metabolites, a debatable argument [17][2]. FBA can be characterized as having two steps:

 Define an artificial reaction called the “biomass reaction” whose substrates and

products are essential metabolites (it is not clear how the researchers choose 75

which essential metabolite is a substrate, and which one is a product, except that,

obviously, the goal is to maximize the flux of the biomass reaction, and, hence, to

maximize the pool sizes of selected products).

 Given a set of stoichiometry equations, characterize the metabolite pool

consumption and production of the organism at steady-state by performing a

linear programming-based optimization with the goal of maximizing the

production of flux in the biomass reaction.

If the optimization returns zero flux in the biomass reaction then the conclusion is that the knocked-out genes are lethal; otherwise, they are non-lethal. For the sake of simplicity in the discussion, from now on, we discuss the single gene-knockout case.

There are four criticisms of the FBA technique in the literature.

1. Full stoichiometry of the metabolic network is commonly unavailable to

researchers, rendering the optimization inapplicable. In this case, the common

approach is to “estimate” the stoichiometry matrix by using a stoichiometry matrix

of 1’s, 0’s, and -1’s.

2. Organisms routinely survive under sub-optimal conditions. Therefore, the

optimality criterion needed by the metabolic control analysis techniques is an

artificial, and not always correct, criterion (Please see Example 3.1).

3. Biomass maximization criterion is applicable to only simple organisms (e.g.,

unicellular organisms) and it is hard to define an objective function for more

complex organisms [29].

4. For an organism to live, all metabolite pools that are produced must be consumed at

steady-state (See [9]for a definition of steady-state). This in turn means that, in the

76

metabolic network, there should not be ‘dead-end” (or, dangling) metabolites (e.g.,

a metabolite with no consumers or a metabolite with no producer reactions). But,

dead-end metabolites routinely occur in metabolic networks of organisms, due to

lack of knowledge about the organism. Metabolic control analysis techniques must

work with “consistent networks with no dead-end metabolites” to perform their

optimization; otherwise, the optimization fails. Therefore, it is not uncommon for

researchers to make changes to the network at hand, such as adding reactions in

order to eliminate dead-end metabolites. For example, multiple “source flux” and

“escape flux” reactions are added in the reconstructed network of Trypanosoma

cruzi [59].

In comparison, SMDA does not need/use stoichiometry equations, and therefore does not suffer from the first criticism for FBA. Similarly, SMDA does not perform optimization and does not require any objective function. Hence, it does not suffer from second and third criticisms either. SMDA also needs a full metabolic network to perform its analysis.

And, as a negative, SMDA has exponential time-complexity due to the enumeration of all possible flow-graphs. Enumeration of possible states enables SMDA to avoid criticisms 2 and 3, but it comes with the price of exponential time complexity, similar to Elementary

Mode Analysis, whose goal is to enumerate elementary fluxes at the steady state. We also note that, with more observations and/or domain expert knowledge, SDMA complexity can be reduced.

Finally, to compare SMDA with FBA, we have attempted to replicate the “optimal flux distribution” generated by FBA in Trypanosoma cruzi paper [15]. We have found that

SMDA does not generate the “optimal case” as found by FBA. The reason is as follows.

77

To maximize the flux in the biomass reaction, FBA freely shuts down the flux in any reaction, even when all the substrates and the enzyme of such a reaction are indeed available. SMDA does not allow such a case, as it would create a conflict with the underlying biochemistry. That is, for SMDA, a reaction r is considered to be active when all substrates of r are available, and no product of r is severely accumulated (i.e., product inhibition does not occur). In more detail, SMDA uses basic biochemistry knowledge to reason about possible scenarios of the metabolic network, based on observations. Some example rules are listed below (for a complete list of rules, please see Cakmak et al [9]).

 Whenever a non-bio-fluid metabolite m is marked as “Severely Accumulated”, all

reactions that produce (and, therefore, due to the steady-state assumption) and

consume m are “Inactive”.

 If all producers (consumers) of a metabolite pool m are inactive then, due to the

PCRE [9] property, regardless of the pool label of m, all consumers (producers) of

m are Inactive.

 If at least one producer (consumer) of a metabolite m is Active, then (i) m is either

Available or Accumulated, and (ii) at least one consumer (producer) of m is

Active.

 If the metabolite m is Unavailable then all consumers (and, thus, due to the

steady-state assumption) and all producers of m are Inactive.

To judge whether a reaction is active or not, SMDA employs biochemistry rules that are captured as “activation conditions” (ACT condition set), and, if the ACT condition set of a reaction is satisfied, SMDA labels the reaction as Active, otherwise, the reaction is

78

labeled Inactive. We give an example of a case in which the optimal solution found by

FBA is rejected by SMDA as a valid flux distribution alternative.

Example 3.1. The supplemental file 5 [16] of Trypanosoma cruzi paper [15], “Flux distribution for epimastigote model”, presents a graphical depiction of the FBA-located optimal flux distribution for the epimastigote phase of the organism. Figure 3.2 shows part of the network with optimal flux distribution. The reaction NADH2-u6m (NADH dehydrogenase, mitochondrial) in mitochondria is assigned zero flux by FBA. However, in SMDA, NADH2-u6m is assigned Active label (indicating the existence of flux) as follows.

 From the network, ACT set of the reaction NADH2-u6m is defined as {q6[m] is

Available; h[m] is Available; nadh[m] is Available; q6h2[m] is not Severely

Accumulated; nad[m] is not Severely Accumulated;}

 SMDA infers that all three substrates of NADH2-u6m, namely, q6[m], h[m] and

nadh[m], are Available since

o reaction CYOR_u6m (ubiquinol-6 cytochrome c reductase) has non-zero flux,

and is producing q6[m];

o reaction SUCD3_DASH_u6m (succinate dehydrogenase (ubiquinone-6),

mitochondrial), has non-zero flux, and is consuming q6[m];

o reaction P5CDm_i (1-pyrroline-5-carboxylate dehydrogenase, mitochondrial

)(not shown in Figure 3.2) has non-zero flux, and is producing h[m];

o reaction MDHm (malate dehydrogenase, mitochondrial) (not shown in Figure

3.2)has non-zero flux, and is producing nadh[m];

79

o both products of the reaction NADH2-u6m, namely, q6h2[m] and nad[m], are

not severely accumulated, since (i) reaction CYOR_u6m (ubiquinol-6

cytochrome c reductase) has non-zero flux , and is consuming q6h2[m]; and

(ii) reaction MDHm (malate dehydrogenase, mitochondrial) (not shown in

Figure 2) has non-zero flux, and is consuming nad[m].

Thus, since the ACT set of NADH2-u6m is satisfied, SMDA labels it as Active. In other words, some optimal results of FBA may not have any biological reasoning at all. FBA results may or may not exist in reality, and are not necessarily always generated by

SMDA.

Figure 3.2 Partial depiction of theoptimal flux distribution on epimastigote model of T. Cruzi network

3.4 Revising SMDA For Gene Lethality Testing

An SMDA-based gene lethality test can be done in three steps. First, reactions catalyzed by the enzymes produced by the knocked-out gene are removed from the network. Then 80

all essential metabolite pools are labeled as Available. Finally, SMDA is run to check if there is at least one feasible flow-graph in the metabolic network that produces and consumes each and every essential metabolite. Thus, stopping conditions for gene lethality/non-lethality are as follows.

Stopping criterion to decide that gene is lethal: The algorithm terminates with no flow- graphs. This means the algorithm could not find a feasible flow-graph that has the biomass reaction active because it had a merge/expansion conflict (Please see Section 3.2 for explanation about the conflicts and Cicek et al [37]for more details). Either conflict reveals discrepancy between the observation set and the metabolic sub-network. The observation set is unchanged and is based on a live organism. However, the sub-network was affected by knocking out the gene, which causes some reactions to be inactive due to the enzyme not being available. So the reason for no feasible flow in the organism is that the gene was knocked out (the gene is lethal).

Stopping criterion to decide that gene is non-lethal: The algorithm produces one flow- graph in which all essential metabolites are both produced and consumed, given that all reactions in the network are expanded or a subset of the reactions have been expanded, and we are guaranteed not to have an expansion conflict as the algorithm proceeds. (No conflicts).

Gene lethality/non-lethality testing algorithm (sketch):

1) Mark all reactions of the knocked out enzyme(s) as Inactive.

2) Mark all essential metabolite pools and some energy pools (details are in

Section 3.5) as Available. Mark all observed metabolite pools as Available or

81

Unavailable according to the supplement 5[16] of Trypanosoma cruzi paper

[15].

3) Starting from all metabolite pools, which have status other than Unknown,

SMDA creates flow-graphs and expands them as described in Section 3.2. If a

merge or expansion conflict is encountered then the knocked gene is lethal

since, given the observations, SMDA is not able to have a scenario where all

essential metabolites are Available. When there is a single flow-graph there is

no chance to have a merge conflict, yet, there is still a chance that SMDA

encounters an expansion conflict. We give an example 3.2. One option is to

run the algorithm until all reactions in the network are expanded and it halts

with at least one flow-graph. This means that there is a feasible flow on this

network with the provided observations. A second option is to guarantee that

there would have been no expansion conflicts if SMDA had expanded all

reactions. An expansion conflict can occur only during expanding a reaction

that has more than one border pool associated with it (e.g., SMDA knows the

labels of both the substrate and the product, and expands the reaction).

Considering the flow-graph as a super node, if there is a cycle (disregarding

the directions of the edges) in the subnet work that contains the super node,

then there is a chance to have an expansion conflict. Expansion should be

maintained till there is no such cycle. Moreover, an expansion conflict can

only occur on a reaction that connects an Unavailable pool to an Available

pool. Then SMDA can only consider the cycles that include an Available pool

and an Unavailable pool in the flow-graphs.

82

Figure 3.3 A complete network for Example 3.2

Example 3.2. Assume that the network in Figure 3.3 is the single flow-graph after merge, and all reactions in the network are included. Since the reaction RPE is Unknown, SMDA will keep expanding the flow-graph. However, “expansion conflict” will occur in the following step since the substrate of RPE is not available, but the product of RPE is available.

3.5 Experimental Evaluation

In this section, we describe the environment in which the SMDA was run, the conducted gene lethality/non-lethality tests, and the results.

3.5.1 Experimental Setting

Metabolic Network. Our tests were run on the reconstructed metabolic network for

Trypanosoma cruzi [15]. We have obtained the reconstructed network model of this organism in the form of an SBML document, and parsed and exported the model (with a home-made SBML parser tool) into our PathCase-RCMN database

83

Metabolomics.Chlamydomonas_reinhardtii_curated. The data is available to browse from the web interface of PathCase-RCMN website [39]. The data was cleaned up manually since there were some inconsistences in the SBML file itself. For example, there were some reactions which were not marked as transport processes, but having substrates and products in different compartments. We located and corrected those reactions to transport reactions. Also, biomass reaction was removed from the database as it is an artificial reaction, and not needed for SMDA. It is worth mentioning that, in different supplement files of the Trypanosoma cruzi paper [15], some reactions directions

(forward/backward) were not consistent. We located such discrepancies, and fixed them during the experiments.

The database consists of 162 reactions, which include 58 transport reactions, and 92 gene-associated reversible reactions. To construct the sub-network for SMDA, we include all 15 pathways (Trypanosoma cruzi paper [15] has more pathways in the data file, and we count similar pathways as one), and 52 transport reactions that connect metabolite pools of the same metabolite in different compartments. This is done to ensure the connectivity of the sub network.

Algorithm Input. Given the full network described above, we run SMDA gene lethality testing algorithm using the extracellular metabolite observations provided in the paper supplement [16]. There are 17 such metabolites; 12 out of 17 are marked as Available, and the rest are marked as Unavailable, as shown in Table 3.1.

We also input substrates and products of the biomass reaction as described in Section 3.4.

This corresponds to another 20 metabolites that are marked as Available. Their roles and names are shown in Table 3.2.

84

Table 3.1 Metabolite pool observations from the T. Cruz. paper

12 Available metabolite M_nh4_e, M_pro_DASH_L_e, M_glu_DASH_L_e,

pools M_o2_e, M_b_DASH_D_DASH_glucose_e,

5 Unavailable M_a_DASH_D_DASH_glucose_e,M_ac_e, M_asp_DASH_L_e , M_gly_e, M_pi_e, M_glyc_e, M_h_e,

metabolite pools M_succ_e,M_thr_DASH_L_e M_ala_DASH_L_e, M_co2_e, M_h2o_e

Table 3.2 Metabolite pool observations from biomass reaction

6 products of M_pi_c , M_nadp_c , M_nadh_c , M_coa_c , M_co2_c,

biomass reaction M_adp_c 14 substrates of M_r5p_c, M_pyr_c, M_pep_c, M_oaa_c, M_nh4_c,

biomass reaction M_nadph_c, M_nad_c, M_g6p_DASH_B_c, M_g3p_c,

M_e4p_c, M_atp_c, M_akg_c, M_accoa_c, M_3pg_c

Since energy pools (i.e., metabolite pools related to energy metabolism) are essential for the organism, we add and mark those pools as Available also. The 17 energy pools are in

Table 3.3.

Table 3.3 Energy pools are set as Available

M_fad_m, M_fadh2_m, M_nad_x, M_nadh_x, M_nadp_x, 17 Available M_nadph_x, M_nad_m, M_nadh_m, M_nadp_m, M_nadph_m, energy pools M_atp_x, M_atp_m, M_adp_x, M_adp_m, M_h2o_m, M_h2o_c,

M_h2o_x Our experiments in this chapter focus on the epimastigote model of Trypanosoma cruzi.

According to the supplement 3 [59]of the paper, there are 18 reactions that are always

Inactive in the epimastigote model. We mark those 18 reactions as Inactive. The reactions are listed in Table 3.4.

85

Table 3.4 Inactive reactions for epimastigote case

aldose1epimerase-likeprotein in Glycosome ,

inorganicdiphosphatase__in Glycosome , ribokinase,glycosomal in

Glycosome, gluconatekinase,glycosomal in Glycosome,

18 inactive deoxyribokinase,glycosomal in Glycosome , glycerol-3- reactions phosphatedehydrogenase(nad) in Glycosome , glycerolkinase() in

Glycosome , inorganicdiphosphatase in Cytosol, glycerol-3-

phosphatedehydrogenase(FAD) in Cytosol, NADHdehydrogenase in

Mitochondria , malicenzyme(NADP)l in Mitochondria

As we discussed before, foracetatesuccinatecoatransferase SMDA to produce gene lethality in Mitochondria test results, , directionsL- of reversible reactions shouldalaninetransaminasel be set. We utilize in Mitochondria, the “FBA-selected NADHdehydrogenasel optimal result” from in the paper’s supplement 5 [15]M itochondriaas an example (NADH2 network,-u6m), and NADHdehydrogenaselset the direction of the in reactions according to the paper’s optimal result.Mitochondria( Out of 92 reversible NADH2- u6am)reactions, , L- 68 reactions are set in one (forward) direction,threoninedehydrogenase,mitochondrion and 24 reactions are set to the other (backward) in Mitochondria, direction.

inorganicdiphosphatase_in Mitochondria, alternativeoxidase in 3.5.2 Gene Lethality Test Results Mitochondria Table 3.6 lists SMDA lethality test results for the genes in Table 3.5, along with the reason why the SMDA algorithm has found the gene to be lethal.

Observation 1: SMDA gene lethality testing algorithm verified correctly the lethality of all seven lethal genes.

Thus, the success rate for the SMDA algorithm for verifying lethal genes was 100%. That said, due to reversible reactions, the number of possible networks and thus the number of possible flow-graphs (i.e., the number of possible steady-state network flows) is

86

Table 3.5 Lethal genes to be verified

Experimental Target Model reaction(s) constrained

fructose-1,6-bisphosphate aldolase FBAg

Phosphogluconate dehydrogenase PGDH

glyceraldehyde-3-phosphate GAPDg dehydrogenase

Hexokinase HEXg, GLUKg

Phosphofructokinase PFKg

phosphoglycerate mutase PGM

enolase ENO

exponential: in the Trypanosoma cruzi network, there are 92 reversible reactions. When all combinations of directions of the reversible reactions are considered, there are 292 different possible networks. Therefore, one is confronted with the problem of pruning the search space in order to locate feasible flow-graphs. In our experiments, in order to avoid testing each and every combination of reversible reaction directions for possibilities, we set a priori the directions of reversible reactions in the Trypanosoma cruzi network according to the “optimal” flow result found by any constraint-based technique, such as FBA. We did so by using the results in supplement of the paper [16], and use the corresponding network as opposed to testing 292 different possible networks.

Another reason for choosing one network is that for different networks, SMDA may give different results to be consistent with the underlying biochemistry. We give an example.

87

Table 3.6 SMDA test results on lethal genes

Gene Reason

fructose-1,6- Expansion conflict after reaching single flow group bisphosphate aldolase

phosphogluconate Merge conflict. dehydrogenase

M_nadh_x is Available, but the only producer

glyceraldehyde-3- glyceraldehyde-3-phosphatedehydrogenase is Inactive due

phosphate to the gene knock out. (Note: another reaction

dehydrogenase malatedehydrogenase,peroxisomal is in backward

direction; so it is a consumer instead of a producer.)

hexokinase Merge Conflict phosphofructokinase Expansion conflict after reaching single flow group

M_3pg_c is Available, but the only consumer phosphoglycerate phosphoglyceratemutase is Inactive due to the gene knock mutase out.

M_pep_c is Available, but the only producer enolase is enolase Inactive due to the gene knock out.

Example 3.3. Consider the partial network of Figure 3.4 where a circle represents a metabolite, a rectangle represents a reaction, and a directed edge between them represents a role (substrate or product) of the metabolite in the reaction. Assume that RPI is a reversible reaction, and the metabolite r5p is a substrate of the biomass reaction (not

88

included in the figure) which means r5p is an essential metabolite and must be available.

When the gene PGDH is knocked out, metabolite ru5p-D is not Available since PGDH is the only producer of the metabolite (assume reaction RPI’s direction is forward, consuming ru5p-D and producing r5p, as shown in the figure). As a consequence, r5p is also not Available since the only producer reaction RPI is Inactive due to the unavailability of ru5p-D. This conflicts with the observation that r5p is an essential metabolite; and, thus SMDA concludes that the gene PGDH is lethal. However, if RPI works in a direction reversed to Figure 3.4, i.e., r5p is substrate or RPI and ru5p-D is product of RPI, SMDA will conclude that the gene PGDH is non-lethal since ru5p-D is still available, or produced by RPI.

Observation 2: SMDA lethality testing algorithm covers all stopping conditions for the seven genes.

Figure 3.4 A partial network for Example 3.3.

The lethality testing algorithm has stop conditions such as the expansion conflict, the merge conflict and the expansion conflict after reaching a single flow group. In the lethality testing of seven genes, the algorithm stops as follows: (i) for three genes: the expansion conflict; (ii) for two genes: the merge conflict, and (iii) for two genes: the expansion conflict after reaching single flow group.

89

3.5.3 Gene Non-Lethality Test Results

We use the same network and observations as the gene lethality test, but, this time. knocking out a non-lethal gene instead of a lethal one. When a feasible flow-graph is generated, the non-lethality is verified since the organism is functioning without the knocked-out gene. In this experiment, we performed one non-lethality test to verify the non-lethality of the gene adenosine kinase in the epimastigote model. The gene adenosine kinase is related to the reaction ADK1g.

Testing a gene for non-lethality is less time consuming than testing it for gene lethality, as it requires the generation of only one feasible flow scenario that produces and consumes all essential metabolites. However, the single feasible flow-graph should be complete in its assignment of active/inactive status to all reactions. Since one complete flow-graph will be sufficient for this experiment, we pre-set some reactions’ status a priori according to the optimal result of the paper in order to expedite SMDA running process and reduce the running time and complexity of the task. In addition to the 18 inactive reactions in the epimastigote model [59], which are listed in Table 3.4, and 1 knocked gene related reaction adenosine kinase , we set other 16 reactions as Inactive, 53 reactions as Active[16].

With the algorithm of Section 3.4, the non-lethality test execution stops when there is no cycle to cause a conflict in current flow-graphs, at which time all 167 reactions are covered in the test. To have a complete flow-graph, we have the SMDA keep running until all 167 reactions in the network have a status label. There are 3,328 flow-graphs in the final result. And, we have been able to conclude that the gene “adenosine kinase” is not lethal since there are possible feasible flows in the organism to keep it alive even

90

when the gene is knocked out. Clearly, SMDA can also be used in the same manner to verify the non-lethality of other genes.

3.6 Conclusions

In this chapter, we have proposed an algorithm to verify gene lethality with the SMDA metabolomics tool. Using the SMDA gene lethality test algorithm, we have successfully verified seven lethal genes and one non-lethal gene in a genome-scale metabolic network of Trypanosoma cruzi organism. This confirms that SMDA can be used for gene lethality testing purposes. Compared with other computational techniques such as FBA, SMDA produces results consistent with the underlying biochemistry. On the negative side, for a very large network, SMDA has its limitations since it enumerates all possible activation/inactivation scenarios for the network at hand. We have discussed ways of reducing the complexity, e.g., abstracting pathways into single “abstract” reactions. Thus, we conclude that SMDA can be used as an alternative tool to verify gene lethality.

91

Visualization Tools for PathCase Systems 4.1 Introduction

Pathcase visualization tools visualize metabolic data, relationships in the data, as well as analysis results of the data via a java applet. The visualization tools are components of many PathCase Systems, as shown in Figure 4.1. In this chapter, we present the visualization tools in the PathCase-SB system, as well as different and specific features in other PathCase systems, as listed below:

 PathCase-SB: PathCase Systems Biology Workbench featuring BioModels

models and KEGG Pathways has 409 Systems Biology Models and 139 KEGG

pathways.

 PathCase-MAW Editor: a stand-alone Java application on maintaining a

mammalian metabolic database—MAW.

 PathCase-MAW: Pathcase Metabolomics Analysis Workbench featuring

manually created generic mammalian metabolic network has 27 pathways.

 PathCase-RCMN: PathCase ReconstruCted Metabolic Networks has four modes,

namely, Mus Musculus iMM1554 model (2008), Mus Musculus iMM1415 model

(2010), H.sapiens Recon 1 model and Trypanosoma Cruzi iSR215 model (2009).

 PathCase-Recon: PathCase RECON Workbench featuring Genome-Scale

Reconstructed Metabolic Networks and KEGG Pathways has 53 networks and

139 KEGG pathways.

92

 PathCase-SMDA: an online tool to analyze metabolomics data in terms of the

dynamic behavior of the metabolic network under steady state.

 Metabolism Query Language Interface: a Metabolism Query Language Interface

to query PathCase-MAW database.

Figure 4.1 Visualization Tools and Applications

We generalize the visualization framework of all PathCase visualization tools. Part of the framework is also used in providing visualization data for three different iPad applications, namely,

93

 iPathCaseMAW: iPad version PathCase-MAW system, which includes

visualizations of metabolic pathways and SMDA tool,

 iPathCaseRCMN: iPad version PathCase-RCMN system, which includes

visualizations of three reconstructed networks,

 iPathCaseKEGG: iPad version PathCase-KEGG system, which includes

visualizations of Kyoto Encyclopedia of Genes and Genomes[27]

4.2 Visualization Tool for PathCase-SB System

Released in August 2010, PathCase-SB system [17][18][19] brings together (i) systems biology sources, e.g., BioModels [20][21][22], and (ii) pathways sources, e.g., KEGG

[23][24][25][26], with the goal of providing additional capabilities and tools made possible due to the integration. PathCase-SB system provides a database-enabled framework and web-based computational tools towards facilitating the development of kinetic models for biological systems. Currently, PathCase-SB has provided visualization, browsing, querying, simulation and comparison, model composition and user upload model capabilities and interfaces.

At the server side, PathCase-SB data is managed by a relational database management system, namely, Microsoft SQL Server 2008. An object-oriented data-access interface between the relational database and the application layer is provided in the form of a large set of wrapper class functions, in order to provide easy, extensible, and fast data access as well as to prevent major changes in the application when a schema change occurs during the evolution of PathCase-SB. Web services are provided for visualization interface or other applications to get detailed model and pathway data. Mappings

94

between BioModels and KEGG pathways are created between: species of system biology models and molecular of KEGG, reactions of system biology models and process of

KEGG, and models of system biology models and pathways of KEGG.

The visualization interface is accessed from different places within PathCase-SB. It it employed by different sub-components, namely,

 Browser Interface (appears as a menu item at many places with the name

"Interactive Model/Pathway Graph"),

 Built-In Queries (by each query that produces a metabolic sub-network),

 iModel Tool (biochemical networks of uploaded user models),

 Model Composition Tool.

Compared with the visualization tool in previous PathCase systems, i.e., PathCase-KEGG, the visualization tool in the PathCase-SB system has the following new features:

 Integration of the interactive pathway graph visualization, which shows molecule

entities, processes and enzymes, cofactors, activators, regulators and inhibitors of

pathways, as well as the interactive model graph visualization, which shows

species, reactions, compartments and their properties of models.

 Displaying model according to compartments hierarchy of the model, as shown in

Figure 4.2. Moving elements in a compartment is limited by the compartment’s

boundary.

 For those models that are related to a pathway, the mapping between the model

network and the pathway is provided by displaying both side by side, and

highlighting species in the model and the molecular entity in the pathway together

95

with corresponding qualifiers (such as is-part-of ), as shown in Figure 4.4(result

of Figure 4.3).

Figure 4.2 Visualization of Albert2005-Glycolysis Model

Also, the visualization tool has the capabilities of

 visualization simplifications

Figure 4.3 An example of built-in query (reaction-to-process mapping).

96

Figure 4.4 Visualization of a query(Figure 4.3) result.

(i) Truncation of long entity names are used. Full names and related information are also available as user moves the cursor over the entity.

(ii) Common species, which participate in many reactions (e.g., H2O, ATP, ADP, etc.) the network can be elected not to be visualized.

 layout manipulations

The visualization layout can be manually revised and saved, while the movement is managed to keep it biological meaningful (e.g. the entities are limited to the boundary of a compartment).

97

4.3 Visualization Tools for other PathCase Systems

Visualization tools in different PathCase Systems are adjusted to fit into each system.

Next we introduce several PathCase Systems, and specific features in their visualization tools.

4.3.1 PathCase-MAW and PathCase-MAW Editor

PathCase Metabolomics Analysis Workbench (PathCase-MAW) provides a database- enabled framework and web-based computational tools for browsing, querying, analyzing, and visualizing stored metabolic networks. It featuring manually created generic mammalian metabolic network has 27 pathways. The metabolic network can be accessed through a web interface or an iPad application. PathCase-MAW editor, a stand- alone Java application, with its user-friendly interface, can be used to create a new metabolic network and/or update an existing metabolic network.

Also, the visualization tool in PathCase-MAW and PathCase-MAW editor has the capabilities of

 Visualizing pathway side by side if it exists in multiple tissue(s )

In the PathCase-MAW database, pathways are organized via tissues. A same pathway name may exist in different tissues and has different reaction compositions. For example, pentose phosphate pathway exists in adipose and ctosol_liver. When there is a pathway in multiple tissues, they are presented side by side in the visualization, as shown in Figure

4.5.

98

Figure 4.5 Glycolysis in Cytosol_Adipose and Cytosol_Liver.

 common metabolites are reproduced for each reaction they participate in,

which reduces many edges between between common metabolites and

reactions, therefore beautifies the resulting visualized graph.

Metabolites that participate in many reactions (e.g., NAD, ATP, O2 etc.) have high connectivity and tend to result in unclear visualizations. Such metabolites are called common metabolites. Unlike all metabolite pools, which are displayed once, these metabolites are displayed per participated reaction to have better results, as shown in

Figure 4.6.

 Transport reaction is distinguished with dotted line, as shown in Figure 4.7.

4.3.2 PathCase-SMDA

As discussed in Chapter Two, SMDA is developed as a computational tool to analyze

99

Figure 4.6 Catabolism of Phenylalanine pathway in PathCase-MAW editor. measurements of a mammalian metabolic network database. It evaluates the activation/inactivation scenarios of the metabolic network and interprets the metabolic consequences of the observed changes at steady state. Both the user selected sub network as well as the SMDA running results can be visualized.

Also, the visualization tool in PathCase-SMDA has the capabilities of visualizing

 reversible reactions are connected via double edges, which enables displaying

direction of the reaction flow,

 highlighting the reaction flow via bold lines in the resulting visualization.

An SMDA result with highlighted flow is shown in Figure 4.8.

100

Figure 4.7 TCA cycle pathway in in PathCase-MAW.

4.3.3 PathCase-RCMN and PathCase-Recon

Both PathCase-RCMN[60] and PathCase-Recon[61] systems integrate metabolomics data of genome scale reconstructed networks. However, they target on two databases and focus on different angle of the reconstructed networks.

PathCase-ReConstructed Metabolic Networks of organisms (PathCase-RCMN) system contains three reconstructed metabolic networks for two organisms, namely, Mus

101

Figure 4.8 SMDA query results.

Musculus iMM1554 model (2008), Mus Musculus iMM1415 model (2010), and

Trypanosoma Cruzi iSR215 model (2009). They are parsed from literature and stored in the database to let users to browse pathways, reactions and metabolites, see compartmentalized visualizations of the pathways, and query the networks using various built-in queries provided. As an example, Figure 4.9 shows Fatty Acid Metabolism pathway in the Mus Musculus iMM1415 model (2010).

PathCase-Recon Workbench featuring Genome-Scale Reconstructed Metabolic Networks

102

and KEGG Pathways, which contains reconstructed networks from the literature, BiGG

Database[62], MEMOSys[63], and GSMNDB[64] site. Currently, it has 53

Reconstructed Metabolic Networks, which include 39,692 reactions, and 30,117 species.

And from the KEGG site, there are 139 Pathways, which includes 7,932 processes,

27,926 basic molecules, 5,342 proteins, 6,295,654 genes. In the visualization of

PathCase-Recon system, instead of displaying the network via pathways, the whole reconstructed network is visualized as one graph. As an example, Figure 4.10 shows E.

Coli Textbook reconstructed network.

4.3.4 PathCase-MQL

PathCase-MQLsystem [65] provides a metabolism query language interface to query

PathCase-MAW database. MQL Tool enables users to query the metabolism under different stress conditions [66]. MQL enables users to specify multiple and different classes of queries, such as

(i) computing (and visualizing) “Activated/Inactivated (metabolic) Paths” with increased and decreased fluxes under specified physiological conditions,

(ii) identifying/verifying “Potential Futile Cycles”,

(iii) querying for required metabolic concentration change sets to prevent a particular futile cycle,

(iv) searching for concentration change sets which lead to the (in)activation of a user- specified metabolic subnetwork, and,

(v) exploring the metabolic behavior of a set of (possibly reversible) reactions.

103

Figure 4.9 Fatty Acid Metabolism pathway in the iMM1415 model (2010).

Our framework allows users to input concentration change statements on key metabolites, and incorporates such input into its query processing.

Both the selected sub-network as well as query results can be visualized. In the visualization tool of PathCase-MQL,

 common metabolites are duplicated per each reaction it participates in;

 reaction direction is highlighted in the query result;

 transport reactions are emphasized via dotted line between compartments.

As an example, Figure 4.11 gives a query in Urea Cycle of Cytosol_Liver, and Figure

104

4.12 displays the query result.

Figure 4.10 E. Coli Textbook in PathCase-Recon System.

Figure 4.11 An example of MQL query. 105

Figure 4.12 MQL query result of the example in Figure 4.11.

4.4 General Framework

For all the PathCase systems’ visualization tools, the general procedure can be summarized as shown in Figure 4.13. Also we generalize the visualization framework of all PathCase systems as having the following steps:

 Designing a XML schema for the visualization data file.

For all visualization tools, visualization data is encapsulated into XML file which is transferred between web server and user’s terminal. According to visualization requirements of each PathCase system, different data, relation and customized visualization properties may be needed. XML schema is customized for each PathCase system to fulfill the requirements.

106

Figure 4.13 Visualization tools in PathCase systems

 Defining parameters for web services to communicate with the visualization applet in

the client side.

On the server side, web services provide a variety of capabilities on retrieving data to be visualized. However, each call from a client may only need part of those data. Parameters are defined for communication between the applet and the web service.

 Retrieving information from PathCase system’s database.

At the server side , based on parameters from the applet, web services are used to access database and obtain data to be visualized.

 Composing the obtained information into an XML data file.

Data retrieved from database is assembled into XML file according to the schema predefined.

107

 Parsing the data file, and providing visualization via the applet.

After obtaining XML file via web service, the applet of visualization tool parse the data file and restore the data into relations at the client side. The applet then visualize the data and interactive with user.

Based on differing requirements of PathCase Systems, one or more steps above may need to be adjusted or revised. For example, in the PathCase-MAW visualization tool, common metabolites are reproduced for each reaction they participate in, which reduces many edges between common metabolites and reactions, therefore beautifies the resulting visualized graph. And, in the PathCase-SMDA visualization tool, reversible reactions are connected via double edges to show the direction of flow.

4.5 Visualization Tool for iPad Applications

Three iPad applications have been developed with an attempt to provide mobile users with mobile-feasible features of PathCase systems. The three iPad applications are:

 iPathCaseMAW: iPad version PathCase-MAW system, which includes

visualizations of metabolic pathways and SMDA tool,

 iPathCaseRCMN: iPad version PathCase-RCMN system, which includes

visualizations of three reconstructed networks,

 iPathCaseKEGG: iPad version PathCase-KEGG system, which includes

visualizations of Kyoto Encyclopedia of Genes and Genomes[27].

Compared with the full PathCase system, the iPad application only includes mobile- feasible data and functionalities. XML schema, parameters and web services are all redesigned for iPad applications. Using iPathCaseKEGG as an example, the application

108

does not download the entire pathways data from the PathCaseKEGG database at once.

HTTP requests are made as new data is needed, and the result is cached in the device’s internal storage.

After the XML response has been parsed into one or more KEGG internal objects, these objects are not serialized. In other words, all objects generated by web services are built from the XML each time they are needed.

4.6 Conclusions

In this chapter, we have introduced the visualization tools that are integrated into different PathCase systems, namely, PathCase-SB, PathCase-MAW, PathCase-MAW

Editor, PathCase-RCMN, PathCase-Recon, PathCase-SMDA and PathCase Metabolism

Query Language Interface. We have summarized new features of PathCase-SB visualization, and specific features of other PathCase systems’ visualization tools. The visualization framework of tools in different PathCase systems are generalized. Part of this framework is also used to provide visualization data for three iPad applications, namely, iPathCaseKEGG, iPathCaseMAW and iPathCaseRCMN.

109

Locating Basic Bio-Entities in Genome-Scale Reconstructed Metabolic Networks 5.1 Introduction

The numbers of genome-scale reconstructed metabolic networks (GSRMN) have been increasing at a higher rate in the last five years [7][67]. GSRMNs are being built for a wide variety of organisms, and used in many applications for the tasks of (a) contextualization of high-throughput data, (b) guidance of metabolic engineering, (c) directing hypothesis-driven discovery, (d) interrogation of multi-species relationships, and

(e) network property discovery [7]. The numbers of GSRMNs and their sizes in terms of the number of reactions continue to increase: most GSRMNs now have more than 500 reactions, with some having 3,700 plus reactions[68]. GSRMNs are specified in many ways: SBML documents, published as supplements to publications(e.g., [15]), and/or provided over the internet at web pages of researchers.

It is noted in the literature [3][4] that published GSRMNs have two basic limitations, which reduce their full utilization. One is the inability to match metabolites/reactions/compartments in a given GSRMN to metabolites/reactions

/compartments in a given data source (e.g., KEGG) or another GSRMN, due to naming inconsistencies involving species (metabolites), reactions, and compartments. We refer to metabolites, reactions, and compartments of GSRMNs as basic bio- (biological) entities.

Another noted difficulty is in identifying pathways of a GSRMN. We refer to this task as identifying higher-level bio-entities of a GSRMN automatically, where we classify

110

pathways, and metabolism-based sub-networks (e.g., the lipid metabolism) as “higher- level” biological entities of GSRMNs. We refer to basic and higher-level bio entity identification problems in GSRMNs as both a “bio-entity identification problem” of

GSRMNs.

In this chapter, we focus on the basic bio-entity identification problem in a GSRMN model (referred to as the “target model”, from here on) with respect to a “source model”

(where the “source model” may easily be replaced by a “data source”, generalizing the identification problem), and propose three types of matches for metabolite identification, and a multi-step identification process for reaction identification. Compartment identification with a curated dataset is omitted here due to paper size limitations.

Identification results for metabolites and reactions are ranked via a variety of similarity scores. Finally, we present an empirical study of entity identification for four pairs of GSRMNs in a GSRMN database maintained by us [28], namely,

“iAM303”and “E. coli textbook”, “H. pylori iIT341” and “EryNet”,

“Model2008_09_23_13_13_29” and “Model2008_ 08_15_12_13 _14”,

“03_16_09_TM_minimal_medium_glc” and “M. barkeri iAF692”. Also, we evaluate the usefulness of the entity identifications and/or ease of interpretation of our results.

5.1.1 Entity Identification

Stobbe et al [3] compares the contents of five human metabolic pathway databases, and reports that the level of agreement among the metabolic network is very low. E.g., five databases that describe human metabolic network agree on only 3% of ~7,000 reactions.

Even for the well-studied pathway TCA cycle, only 5 of the 30 reactions agree in all five databases. The low agreement on pathways in different data sources may be due to

111

differences on a pathway definition, different intermediate steps, different numbers of alternative substrates, and difficulties in determining metabolites’ identities. Stobbe et al[3] suggests that the low agreement problem among pathways of different sources can be eliminated via (i) comparing metabolites by KEGG compound ID, KEGG Glycan,

ChEBI, PubChem Compound or CAS before comparing metabolite names, (ii) ignoring electrons, protons, water while comparing reactions, (iii) treating metabolites with enzyme-bound/unbounded versions as identical metabolites, etc.

MetRxn database [4] is designed to resolve incompatibilities in content representation where (i) metabolite and reaction descriptions are standardized by integrating information from 8 metabolic databases and 90 GSRMNs, (ii) all metabolite entries have matched synonyms, resolved protonation states, and are linked to unique structures, (iii) all reaction entries are elementally and charge-balanced, and (iv) the standardization in description allows for a direct comparison of metabolite and reaction content between metabolic models and databases. MetRxn is allowed to use the metabolic information from standardized version. Thus, we utilize the standardized networks when available. As of April 2, 2013, at least one unsolved problem with MetRxn is that all metabolites/reactions which lack full atomistic information were excluded from comparison.

5.1.2 Similarity Score

We briefly summarize metabolite and reaction similarity scores in the literature. To measure the closeness of located metabolites with a specified target metabolite n, one can use many scoring functions, based on a structure-based metric or a string-based metric.

For structure-based similarity scores, metabolites’ chemical structures are compared.

112

Most methods for calculating chemical similarity are based on compound’s two- or three- dimensional structure. Molecular structures are sometimes represented by molecular fingerprints, in which case fingerprints are compared. Maximum common substructure, or MCS, is used in the assessment of molecular similarity based on chemical graphs[69].

The score can be given on the number of common fragments or common subgraphs defined by the atom types[70]. Structures based chemical similarity score can be given via SMILES strings. A SMILES (simplified molecular-input line-entry system) string is a way to represent a 2D molecular graph as a 1D string. SIMCOMP (SIMilar COMPound), is a graph-based method for comparing chemical structures[71].

For a string-based similarity score of two strings with equal length, Hamming distance is the number of positions at which the corresponding symbols are different[72].

Levenshtein distance, or edit distance, is a string metric for measuring the amount of difference between two sequences[73]. Monge Elkan distance is a general text string comparison method[74]. Needleman–Wunsch distance and Smith–Waterman distance are usually for performing local sequence alignment, i.e., for determining protein sequences[75][76]. Jaro–Winkler distance is a measure of similarity between two strings and is used in the area of record linkage [77]. The Matching Coefficient is a vector- based approach which simply counts the number of terms, on which both vectors are nonzero. Dice coefficient is a similarity measure over sets. Jaccard similarity uses word sets from the comparison instances to evaluate similarity[78]. Tversky index is an asymmetric similarity measure that compares a variant to a prototype. The overlap coefficient is a similarity measure related to the Jaccard index. Cosine similarity is a common vector-based similarity measure similar to the dice coefficient[79]. TF/IDF is

113

not typically considered to be a similarity metric, which provides a relevant metric for a string with respect to a given query; hence it is often used in searching. Maximal matches is often used within the protein and DNA sequence.

Similarity of reactions are computed as the Tanimoto coefficient between the sets of bond changes describing the transformation from substrates to products in each pair of reactions. The Tanimoto coefficient is defined as the ratio of intersecting set to the union set as the measure of similarity[80]. SimR [81] considers input compounds, output compounds and enzymes of the two reactions by integrating all three matching weights.

SimR are computed via the Maximum Weight Bipartite Matching.

5.2 Metabolite Identification

For metabolite identification in a target model, we employ three types of matches

(comparisons) to metabolites in the source model. We also apply filtering techniques on the entities of the target model, depending on what is available in both the source model

SM and the target GSRMN TM at hand:

 Metabolite id matching (exact match) [3][4],

 Metabolite name synonym matching (exact match) [4],

 Approximate metabolite name (string) matching, and

 Filtering approximate string matching candidates via mandatory and optional

techniques.

The algorithm is summarized in Figure 5.1.

114

Figure 5.1 Metabolite Identification Algorithm Sketch

The function CandidatesM() locates possible matches in the target GSRMN TM, and is summarized in Figure 5.2. Biologically significant term matching is described in Figure

5.3 of section 5.2.3.2.

Figure 5.2 CandidatesM () function

115

We use the similarity score s to rank the closeness of match. As summarized in section

5.1.2, both structure- and string-based similarity scores are used for metabolite matching in the literature. Since metabolite structures are not available in current GSRMNs, we use a string-based metric. Edit distance, the most popular rudimentary metric[72], is the basis of many other string metrics, e.g., Needleman-Wunch distance[75]. Since q-gram- based approximate string matching also uses edit distance, and we use q-gram approximate string matching to locate metabolites, we define the following similarity score function for a string a and its edit distance to string b and the threshold k:

S=SimScore_Original(a,b)= 1 – Da,b/(k+1) where k>0 and Da,b>0, where Da,b>0, is the edit distance of string a to string b, and k, k>0, is the edit distance threshold for the approximate match between a and b. The similarity score s is a value between [0, 1]. The score is 0 when a or b doesn’t exist. Edit distance threshold k is required as a parameter to calculate metabolite approximate similarity scores.

5.2.1 Exact Match via Metabolite Id/Synonyms

In terms of metabolite id matching, different types of metabolite identifiers can be used for matching two metabolites. Some identifiers identify a metabolite uniquely, for example, KEGG compound id, KEGG Glycan id, ChEBI id, or PubChem Compound

(CAS) id. Others have a single identifier corresponding to multiple metabolites.

If the source model SM or the target GSRMN TM do not have unique identifiers specified, then synonym-based comparisons as in MetRxn [4] can be attempted. As summarized in section 5.1.1, MetRxn has a single unified data set of standardized metabolite and reaction descriptions for 90 GSRMNs. If TM, the target GSRMN to be analyzed, is a metabolic network in MetRxn then one can obtain metabolite synonyms 116

from MetRxn. For each metabolite in TM, one can locate metabolites, each with a name identical to a MetRxn metabolite or its synonym. The located metabolites form the set of exactly matched metabolite entities.

5.2.2 Approximate Name Matching

Metabolite name matching can be done via approximate string matching [82], and there are many approximate string matching techniques in the literature[83], [84][85][86].

However, they can be inaccurate if applied directly. For approximate metabolite name matching, we build our method based on a revised version of the most well-studied and efficient one [79], namely, q-gram based approximate string matching via string joins, to locate “similar metabolite names” between SM and TM, summarized next. We start with some preliminaries.

The edit distance between two strings is the minimum number of edit operations (i.e., insertions, deletions, and substitutions) of single characters needed to transform the first string into the second. q-gram of a string is a contiguous sequence of q characters from a given string. A positional q-gram at location i of string σ is a pair (i; g), where g is the q- gram of σ that starts at position I, i.e. g= σ [i … i+ q-1].

Thus, from a string t of size |t|, one can create |t|-q+1 overlapping q-grams, where |t| denotes the length of t. Gravano et al proposes [79]efficient methods for evaluating the edit distance using such q-grams. One observation is that, for an edit distance of 1, the sets of q-grams from two strings will differ by at most q, and, only these q substrings contain the character affected by the one edit distance operation. The remaining q-grams correspond to each other. Based on this, Gravano et al. introduces count filtering: if edit(t1,t2)⩽d is true, then t1 and t2 will share at least (max(|t1|,|t2|)+q-1)-d·q corresponding

117

q-grams, where d·q is the maximum number of q-grams that can be affected by d edit distance operations.

Given a metabolite m in SM, a threshold of edit distance k, and a threshold of similarity score ƟM, we propose to locate a set of metabolites mTM in TM whose name is within k edit distance to m and with a similarity score no less than threshold ƟM by the token and q-gram based approximate string matching.

5.2.2.1 Problems with Approximate Name Match

Directly using approximate string matching on metabolite names may lead to incorrect results. We give an example.

Example 5.1. “D-fructose-6-phosphate” and “D-glucose-6-phosphate” differ only by two characters. Thus, given the edit distance threshold k=2, they are considered identical although they are different entities[87].

Also, approximate string matching results and scores are influenced by other factors:

 The metabolite name has or does not have prefix/suffix; for example, “M_fe2_m” and

“fe2_m”, or “NAD(+)“ and “NAD(+) [cytoplasm]”;

 Metabolite names differ in prefix or suffix only, for example, ”crn_b” and “crn_c”,

which are the same metabolite in different compartments;

 Metabolite name has additional information, i.e., formula. For example, “m_xyl_b” and

“M_XYL_C5H10O5”;

 Metabolite names have differing notations, for example, “xyl-D[c]” and

“M_xyl_DASH_D_c”;

 Suffix/prefix use different notations, e.g. “chol[c]” and “chol_c”;

118

 Suffix/prefix uses an abbreviation instead of a full name, for example, “NADPH

[cytoplasm]” and “nadph_c”;

 Different metabolite names represent the same entity, for example, “Glucose” and “D-

Glucose”.

 “(-)” and “(+)” symbols are usually used to represent optical rotation optical activity of a

chiral molecule, a type of molecule that has a non-superimposable mirror image[88]. The

(+) symbol refers to a dextrorotatory molecule. Such molecules rotate linearly polarized

light to the right (clockwise) when viewed in the direction of light propagation.

Molecules labeled with (-) are laevorotatory, and rotate the polarization to the left

(counterclockwise). Sometimes the letters D and L are used for these respective cases,

instead of (+) and (-)[89]. Naturally, most metabolites/enzymes have only one type of

optical rotation optical activity, either “(-)” or “(+)”. In that case, “(-)” or “(+)” may not

exist in the name, and it makes no difference whether the name has it or not; e.g., “(-)-

ureidoglycolate” and “ureidoglycolate”, or “(-)-MAACKIAIN” and “MAACKIA IN”

[90]. But there also exist some metabolites which have both cases, for example, “(+)-

alpha-Pinene” and “(-)-alpha-Pinene” both exist. For the case of “Glucose”, only one

isomer exists in nature, which is the right-handed form of glucose, denoted “D-glucose”

[91]. These isomers can be identified via synonym matching.

5.2.2.2 Split-and-Match Approach

To refine the accuracy of identification, we further process results of approximate string matching. Next we propose a token and q-gram based technique to split the name string into substrings, classify substrings, and compare them. Below we define separators, token types, and the matching procedures.

119

5.2.2.2.1 Separators in Metabolite Names

According to the naming convention[92], [93] and names in the GSRMNs, we first define separators that are used to obtain substrings, which include “ ”(space), “-“, “_”, “:”,

“,”,“[…]” and “(…)”.

To keep the original meaning of metabolites, “’” and “.” are not considered as separators.

For example, “3',5'-cyclic IMP” is split as “3',5'” and “cyclic IMP”. And, “AN2623.3” is not split.

“:” is treated as a separator when it is not between two numbers. For example, the name of “YLR060W:YFL022C” is split into “YLR060W” and “YFL022C”. However, in

“Hexadecanoate (n-C16:0)”, “(n-C16:0)” is not split further.

“, ” and “_” are similar to “:”. For example “alpha,alpha-trehalose” is split as “alpha”,

“alpha”, and “trehalose”, but “estrone-2,3-semiquinone” is split as “estrone”, “2,3”, and

“semiquinone”. “IDP_C10H12N4O11P2” is split into “IDP” and “C10H12N4O11P2”, but “3_5_Cyclic_GMP_C10H11N5O7P” is split into “3_5”, “Cyclic”, ”GMP”, and”C10H11N5O7P”.

Additionally, some sub-strings are considered as separators because it is what they are meant to be, semantically. Examples of such sub-strings include ‘_DASH_’ and ‘minus ’.

5.2.2.2.2 Token Types in Metabolite Names

We define six types of tokens in a metabolite name: prefix, suffix, parenthesis, number, single character and main token.

Prefix token is at the beginning of a string. A single character or a substring in parenthesis is viewed as the prefix. If a parenthesis is nested in another parenthesis, then the first inner parenthesis is the prefix. A string may not have a prefix. For example, “L”

120

is prefix of “L-Rhamnose”, “(9Z)” is prefix of “(9Z)-Hexadecenoic acid”, “(R)” is prefix of “((R)-3-Hydroxybutanoyl)(n-2)”, “5,10-Methenyltetrahydrofolate” has no prefix.

Suffix token is at the end of a string. A single character or a substring in parenthesis,

“(…)” or “[…]”, is considered as the suffix. If a parenthesis is nested in another parenthesis, then the last inner parenthesis is the suffix. Also, a formula at the end of the string is a suffix. A string may not have a suffix. For example, “b” is suffix of

“M_o2_b”, “ (Val)” is suffix of “tRNA(Val)”, “(9Z)” is suffix of “octadecenoate

(18:1(9Z))”, “C4H4N2O2” is suffix of “M_Uracil_C4H4N2O2”, “Acetyl-CoA” has no suffix.

There are some special characters which have different notations, for example, α as alpha, β as beta, etc. The notations are considered as prefix or suffix if they are at the beginning or at the end of the string. The notations include: alpha, beta, gamma, delta, epsilon, omega, trans, cis. “Phosphate” is also a suffix.

A name string has at most one prefix and one suffix. Except for prefix and suffix, substrings of the remaining name string can be further classified into parenthesis, number, single character and main token.

A substring is classified as a parenthesis token if it is in a parenthesis in the name string.

For example,”1”, “D, and “ribityl” are parenthesis tokens of the name string “6,7- dimethyl-8-(1-D-ribityl)lumazine [cytoplasm]”.

Number token is a substring which has only number characters with/without separators.

For example, “2” and “1_2_4” are number tokens of

121

“M_2_Hydroxybutane_1_2_4_tricarboxylate _C7H7O7”. And, “6,7”, “8” and “1” are number tokens of “6,7-dimethyl-8-(1-D-ribityl)lumazine [cytoplasm]”.

Single character token is a substring which has a single character. For example, “L” is a single character token of “M_L_Cysteine_”.

Main Tokens are the remaining substrings.

Prefix/suffix tokens may have separators in them, i.e., “(2S,3R)” is prefix of “(2S,3R)-3-

Hydroxybutane-1,2,3-tricarboxylate_C7H7O7”.

A single number character is not considered a prefix or a suffix. For example, “4” is not prefix of “4-OH-13-cis-retinal”, which has no prefix in this case.

5.2.2.2.3 Tokens with sequence information

Given a string, the prefix and the suffix of the string are located first. Then we split the remaining string into substrings by separators. The substrings are classified into different categories according to the sequence of single character, number, and parenthesis tokens.

The final remaining substrings are main tokens. And we also record a substring’s original position information so that we can utilize them during the pairing and matching process. For example, the string “M_4_5_dihydroxy_2_3_pentanedione_C5H8O4” is split as

Prefix: (“M”,1) Suffix: (“C5H8O4”, 6)

Number: (“4_5”, 2), (“2_3”, 4)

Main tokens: (“dihydroxy”, 3), (“pentanedione”, 5)

Single character, number and parenthesis tokens are all considered as modifiers of main tokens. By keeping sequence information, we track main-token related modifiers, including pre- and post-positions of substrings. 122

5.2.2.2.4 Pair Tokens and Match

First, two strings to be compared (i.e., metabolite names from the source and target models) are split into tokens. We then pair tokens in each according to their classifications, i.e., prefix/suffix is compared with another prefix/suffix substring only.

For single character, number and parenthesis substrings, we pair them according to their sequence related to the main tokens. For the main tokens, we pair them by different cases as follows.

1.Both names do not have main tokens: we check and compare parenthesis tokens

instead.

2.One name does not have main token; the other one has: we pair parenthesis tokens of

the one which has no main token, with another main token as well as parenthesis

tokens.

3.Both names have the same or different numbers of main tokens: we pair main tokens

according to the sequence as follows.

(1) Identical tokens are paired.

(2) Size/length-similar tokens are paired.

(3) Tokens that have the same core metabolites (listed in appendix 1) are paired.

For prefix and suffix, we do not separate them further though they may contain separators, as shown in Section 5.2.2.2.2.

We give an example.

Example 5.2. Assume the two metabolites are “4,5-dihydroxy-2,3-pentanedione” and

“M_4_5_dihydroxy_2_3_pentanedione _C5H8O4”.

123

“4,5-dihydroxy-2,3-pentanedione” is split as

 Number: (“4,5”, 1), (“2,3”,3)

 Main tokens: (“dihydroxy”,2), (“pentanedione”,4)

“M_4_5_dihydroxy_2_3_pentanedione_C5H8O4” is split as

 Prefix: (“M”,1)

 Suffix: (“C5H8O4”, 6)

 Number: (“4_5”, 2), (“2_3”, 4)

 Main tokens: (“dihydroxy”, 3), (“pentanedione”, 5)

When main tokens (“dihydroxy”,2) and (“dihydroxy”, 3), (“pentanedione”,4) and

(“pentanedione”, 5) are paired, number (“4,5”, 1) and (“4_5”, 2), (“2,3”,3) and (“2_3”, 4) are paired accordingly. This is because number (“4,5”, 1) is related to (“dihydroxy”,2), and (“4_5”, 2) is related to (“dihydroxy”, 3).

Separators are ignored during the number matching. For example, “4,5” and “4_5” are considered as exact match in this example.

5.2.2.3 Similarity Score functions

Based on our token and q-gram based approximate string matching method, we revise the similarity score computations accordingly. The similarity score

S=SimScore_Original(a,b)= 1 – Da,b/(k+1)

where k>0 and Da,b>0, is revised as

SimScored,k (a,b)= α*SimScore_Prefix(a,b) + β*SimScore_Main(a,b) + γ

*SimScore_Suffix(a,b)

124

where α+ β+ γ =1, SimScore_Prefix(a,b) is the original similarity score of prefix pair,

SimScore_Suffix(a,b) is the original similarity score of suffix pair, and

SimScore_Main(a,b) is the average of main tokens’ original similarity scores. Since number and parenthesis are always related with main tokens, their scores are not calculated separately, but are considered in SimScore_Main(a,b).

Next we illustrate the similarity score computation with two metabolites.

Example 5.3. “4,5-dihydroxy-2,3-pentanedione” and

“M_4_5_dihydroxy_2_3_pentanedione_C5H8O4” have the following similarity score computations:

Let a=“4,5-dihydroxy-2,3-pentanedione” and b= “M_4_5_ dihydroxy_2_3_pentanedione_C5H8O4”, α=γ=0.05, β=0.9.

SimScore_Prefix(a,b)=0;

SimScore_Main(a,b)= (SimScore (“dihydroxy”, “dihydroxy”)+ SimScore

(“pentanedione”, “pentanedione”))/2 =(1+1)/2 =1

We use average of main tokens’ similarity scores;

SimScore_Suffix(a,b)=0; then SimScore (a,b)= 0 + β +0 = β=0.9.

Or, in this example, the adjusted score depends on how user weighs the main token matching over all the three parts’ (prefix, suffix and main token) matching.

5.2.3 Filtering Metabolite Match Candidates

Instead of processing all metabolites in TM, we locate candidates for approximate match calculation by pruning names unlikely to be in the result. Two types of filtering

125

techniques are used in this chapter, namely, mandatory filters and optional filters.

Mandatory filters are based on observations of Gravano et al in their work [79]; length filtering, count filtering and the corresponding q-gram filtering. And, based on GSRMN metabolite names, we have two optional filters: formula filter and Biologically Significant

Term filter (if additional information is available for m in SM and mTM in TM. For example, when chemical formula specifications exist for m and mTM, or they have

“biologically significant terms”, we can then check them for a match, and identify candidates to calculate an approximate similarity score. In some GSRMN data files, formula is listed under “notes” tag of the SBML file, i.e., S. aureus iSB619[28].

5.2.3.1 Formula Matching

A chemical formula is a way of expressing information about the proportions of atoms that constitute a particular chemical compound. Formula information can be used to improve approximate string matching results further, i.e., identical metabolites have the same formula. However, formula itself is not enough to identify a metabolite. This is because, different metabolites may have the same proportions of atoms, but with various structures, which is not shown in chemical formulas. As an example, “glucose” and

“fructose” have the same formula “C6H12O6”, but differ structurally.

5.2.3.2 Biologically Significant Term Matching

Considering the characteristics of metabolite and reaction names during matching adds an additional depth to the matching process. Next we introduce the concept of biologically significant terms (BSTs), which are checked for before the approximate string matching process for both metabolites or reactions.

126

Treated as strings, the names of nucleotides, i.e., GTP, GDP, GMP, etc., are very similar, and direct approximate string matching techniques may yield highly similar matching scores. The same holds for some coenzymes, i.e., NAD, NADP, etc. However, these metabolites are significantly different and distinct metabolites in biochemistry, with different characteristics and functionalities. And, they are also important in biochemistry in chemical energy transfer between different reactions, or in catalyzing chemical reactions. We call such metabolites as biologically significant terms, or BSTs, for short.

We identify three types of BSTs for the purposes of filtering. These terms denote nucleotides and co-enzymes. Level 1 contain terms that do not include any other BST term as a substring, such as ATP, ADP, AMP, etc. Level 3 terms are not a substring of any other BST term, such as dATP, dADP, dAMP, etc. A level 2 term, e.g., NADP, includes a level 1 term (ADT) as substring, and is a substring of level 3 term (NADPH).

From the level definitions, level i term, 1≤ i ≤2 is a substring of a level i+1 term. Level i+1 terms and level i+2 terms are called level i term’s upper level terms if level i term is included as substring. All levels of BST terms are collected from metabolite names and reaction names, and are maintained separately, as shown in Table 5.1.

We make use of BSTs as follows. Given a metabolite/reaction name m in SM, to find approximately matching names mTM in metabolite or reaction sets in TM. We first locate candidates via formula match when both m and mTM have them, then we use BST to further filter the located candidates in CSM if m has a biologically significant term(s).

Based on the type of the term encountered in m, we check whether mTM in CSM has a matching BST with m, and remove mTM from CSM if it does not. Since the terms we chose have specific biochemistry semantics, it is safe to assume that these terms must

127

Table 5.1 Biologically significant terms

Type of terms Terms

ATP, ADP, AMP, CTP, CDP, CMP, GTP, GDP, GMP, ITP,IDP,IMP, TTP, Level 1 TDP,TMP,UTP, UDP, UMP, XMP, NAD, FAD

Level 2 NADP

dATP, dADP, dAMP, cAMP,dCTP, dCDP, dCMP, cCMP, dGTP, dGDP, dGMP,

Level 3 cGMP, dITP,dIDP,dIMP,cIMP, dTTP, dTDP,dTMP, dUTP, dUDP,dUMP,

cUMP, dNAD, XTP,XDP, hXMP, dNAD, NADH, NADPH, FADH

be present in a genome-scale network. We utilize candidates in CSM to have more accurate matching, instead of only using approximate string matching.

Next we give the BST algorithm, shown in Figure 5.3. In summary, the idea is, before comparing a metabolite m to a metabolite in the set CSM, we filter out those metabolites of CSM that are distinct from m, and, yet, approximate matching would incorrectly identify them as close matches.

We give an example on a type-1 term.

Example 5.4. GSRMN iAB-AMØ-1410-Mt-661[94] has the metabolite M_nad_m, the corresponding subset of metabolite names to be tested for approximate matching include

M_nad_m, M_nadp_m and M_nadph_m. Since M_nad_m has the type-1 term NAD, the

BST algorithm removes M_nadp_m and M_nadph_m from the result because they contain the terms NADP and NADPH, which in turn contain the term NAD.

128

Figure 5.3 BST-Filter() function

5.3 Reaction Identification

For locating identical reactions in SM and TM, two methods are used in the literature.

One is, only compounds can be compared, and two reactions rTM and r are considered to be the same if all substrates, products and modifiers of rTM and r match one-to-one with each other, e.g., see Stobbe MD et al [3]. Or, all compounds and catalyzing enzymes can be compared for locating identical reactions, e.g., see Radrich, K. et al [95]. Since enzyme names and reaction names are usually the same (and enzyme information is only specified in a few GSRMNs, i.e. M. tuberculosis iNJ661[28], but not available in most

GRSMNs), we compare reaction names and compound names here. For reaction identification, we employ a three-step identification process, namely, (i) reaction name match, (ii) reaction property matches, and (iii) reaction compound matches. Reaction name match is used to filter possible reaction candidates via approximate string match for further processing, i.e., compound-pairing and matches, which is more time consuming.

129

Reaction property match includes two property comparisons, namely, (i) reversible reaction property and (ii) transport reaction property match, which are both optional since not all reactions have these properties available. Figure 5.4 presents the reaction identification algorithm. SimScore d,k(a, b) is the similarity score function between two strings a and b.

5.3.1 Reaction Name Matching

Reaction name matching is first used to get candidates. Similar to metabolite name matching, we first locate reactions rTM in TM which have names identical to the reaction r of SM. If GSRMN is included in the MetRxn data set, reactions that match via synonyms are also located by using r’s synonyms in MetRxn.

If no exact reaction match was found, tokenized q-gram based approximate string matching is performed to find reactions rTM in GSRMN. As in metabolite name matching, reaction name approximate matching is executed by splitting and classifying reaction name string into prefix, suffix, main tokens, and the corresponding modifier tokens.

Classified tokens are compared smilarly. Also the similarity score SimScore d,k (a,b) between two reaction names is calculated. However, there are several important differences between a reaction name match and a metabolite name match: (a) there are some special sub-strings in reaction names, i.e., “transport”, “exchange”, ”reversible”,

”irreversible”, which only represent reactions’ properties. Those sub-strings are excluded from the reaction name string, and they are not compared in this step. Instead, they are used to set a reactions’ property in the property match step; (b) multiple reactions with the same name located via approximate name matching, are considered as possible

130

candidates for future steps, since they may have different reactants, products or modifiers.

Figure 5.4 Reaction Identification Algorithm Sketch

5.3.2 Reaction Property Matching

Two major properties of reactions, namely, reversibility property and the transport property, are compared in this step.

A reaction in the source/target model is considered a reversible reaction when its reversibility property is true in the source/target database (which is consistent with the original model data), or when the reaction’s name has “reversible”, but without

“irreversible”, in its name string explicitly. For example, reaction “D alanine D alanine ligase reversible” in the model “Salmonella_consensus_build_1”, is considered as a reversible reaction although its reversibility property is false in the GSRMN database.

131

There is no transport property in the GSRMN database; so, we consider a reaction to be a transport reaction only when its name string contains the words “transport”, “exchange”,

“transporter”, or “transferase”.

The property matches are up to user’s decision, because in some literatures, the comparison does not consider the direction of a reaction, or, reactions are counted as being the same if they only differ in their directions, for example, in Stobbe et al’s work[3]. The reason is that, a reaction’s direction or reversibility of a reaction varies in

GSRMNs due to different data sources, and/or vague or missing data [7] [3][95] [96].

Even in the ideal case where the knowledge about an organism is complete, there still remains some ambiguous decisions in the reconstruction process resulting from the core approximations of constraint-based modeling [96], which confine the fundamentally analog nature of biology to digital categorizations (i.e., a continuum of enzyme thermodynamics is categorized into “reversible” and “non-reversible”; and, a continuum of substrate affinities is converted into ‘yes’ or ‘no’ decisions on which metabolites can be acted on by an enzyme, etc. [96]).

When two reactions’ property matches are required during the identification process, reactions’ reversibility property and transport property are obtained and compared accordingly.

5.3.3 Reaction Compound Matching

In this step, compounds of each reaction candidate are compared with the source reaction’s compounds and compounds similarity scores are computed.

First, two reactions’ compounds are matched by roles. There are three roles of a reaction’s compounds, namely, “Reactant”, “Product” and “Modifier”. And two reactions’

132

reactants, products and modifiers are matched separately. If any reaction is reversible, we check their compounds’ counts to make sure the counts match of each role.

Otherwise, reversible reaction’s reactants and products are switched to match with another reaction’s reactants and products to get a better match. E.g., a source reaction “fructose- bisphosphate aldolase” in GSRMN “A. baylyi” , has two reactants “glyceraldehyde-3- phosphate” and “dihydroxy-acetone-phosphate”, and one product “ fructose-1,6- bisphosphate”. The located candidate reaction “R_fructose _bisphosphate_aldolase” in

GSRMN “E. coli iAF1260” has one reactant,

”M_D_Fructose_1_6_bisphosphate_C6H10O12P2”, and two products

“M_Glyceraldehyde_3_phosphate_C3H5O6P” and

“M_Dihydroxyacetone_phosphate_C3H5O6P” . But, since both reactions are reversible, we match the source reaction’s product “fructose-1,6-bisphosphate” with the candidate reaction’s reactant “R_fructose_bisphosphate_aldolase”, and the source reaction’s two reactants with the candidate reaction’s two products.

Then, the compounds are paired by exact name match, name length match and core metabolite match. Names with equal length or the closest-length compounds are paired.

Core metabolites are also used to pair compounds when their name length is not enough to decide on pairs, i.e., when more than one compound of the target reaction can be paired with a compound of source reaction.

We also consider metabolite’s input and output roles for pairing. That is, a reaction’s input metabolites (i.e., substrates and modifiers) can be compared with the other reaction’s input metabolites only. Also, output metabolites (i.e., products) are compared accordingly.

133

Some metabolites are ignored during the match since reactions are not always balanced,

- + especially with respect to electrons (e ), protons(H ), water(H2O) [3][95]. Also, some data sources inadvertently include reactions that are not consistent in their stoichiometry

[4][97].

After all compounds are paired, each pairs similarity score is calculated via the SimScore d,k (a,b) function.

5.3.4 Reaction Similarity Score

In section 5.1.2, we summarized two reaction scores from literature. They are different from GSRMN reaction similarity in the sense that Tanimoto coefficient of similarity considers transformation from substrates to products in reactions, and SimR measures all pairwise similarity entities between reactions in two pathways. In GSRMN reaction similarity score, we consider all factors that were involved in the matching process, i.e., reaction name, reaction properties (i.e. reversibility-property and transport-property), substrates, regulators/modifiers, and products name. Enzyme information is not available in most GSRMNs, so it is not included in GSRMN’s reaction similarity score.

In this chapter, we compute the similarity of reactions r and rTM by the function SimReaction(rTM, r), as described below.

SimReaction(rTM, r) = f1* SimReactionName(rTM, r) + f2* SimReactionProperty(rTM, r)

+ (1-f1-f2)* SimReactionCompounds(rTM, r), where

 SimReactionName(rTM, r) is the two reaction names’ similarity score;

134

 SimReactionProperty(rTM, r) is the two reactions’ reversebi-lity-property and transport

property score, which is 1 when the two properties are the same; or 0.5 when one of

the properties is the same in both reactions, but the other is not; or 0 when none of the

properties are the same in both reactions;

 SimReactionCompounds(rTM, r) =

(TOTALall-substrate-pairs(SimScored,k(srTM, sr))/|number of substrates in r| + TOTALall-product- pairs(SimScored,k(prTM, p))/|number of products in r| + TOTALall-sregulator- pairs(SimScored,k(regrTM, regr))/|number of regulators in r|)/3;

 f1 and f2 are system-adjusted weight factors for reaction name match and reaction

property match, respectively.

5.4 Experimental Evaluation

In this section, we present the experimental evaluation in terms of precision and recall analysis [98][99][100] [101] on basic bio entities.

Given a similarity score threshold s (0≤s≤1), a source model SM and a target model TM, in experiments of this section, we locate from selected source-target model pairs all matching source-target metabolites/reactions with a similarity score not less than s.

Using metabolite identification as an example, we illustrate how precision and recall are calculated. For each source metabolite m, multiple target metabolites mTM x (x=1,…n) with similarity scores greater than or equal to s are identified. Then, for precision and recall analysis, we manually locate m’s “real matches” to mR-TMx (x=1,…n), in the target model. More specifically, for a given m, we manually check and locate target metabolites, mR-TMx, which have either the same name or (based on the underlying

135

biochemistry, model-related documents and the literature) the same/similar biological function with m. We refer to (m, mR-TMx) as the “real source-target metabolite pair”, or simply, the real pair.

To analyze the experimental results, we identify the source-target metabolite pair match as being true positive or false positive or false negative. Note that true positive pairs are in the set {mTMx | x=1,…n}∩{mR-TMx|x=1,…m} ; false positive pairs are in the set {mTMx| x=1,…n} –{mTMx | x=1,…n}∩{mR-TMx|x=1,…m}; and false negative pairs are in the set

{mR-TMx| x=1,…m} –– {mTMx | x=1,…n}∩{mR-TMx|x=1,…m}..

Precision is defined as the proportion of correct matches amongst the metabolite names of retrieved pairs, (m, mTMx). Let CTP, CFP denote true positive and false positive counts.

Then,

Precision = CTP / (CTP + CFP)

Recall is defined as the proportion of correct matches amongst the metabolite names of real pairs, (m, mR-TMx). Let CTP, CFN denote true positive and false negative counts. Then,

Recall = CTP / (CTP + CFN)

5.4.1 Metabolite Identification Results

5.4.1.1 GSRMN Pairs and Statistics

We have chosen three source-target model pairs (all from [28]), namely, ,

, <03_16_09_TM_minimal_medium_glc, barkeri iAF692>, for metabolite and reaction matching analysis: a) The model pairs are selected to have non-empty intersections of metabolites (so that

they are related).

136

b) The model pairs are from different organisms (so that they have differing metabolites). c) The model pairs’ sizes are typical of GSRMNs (and we manually check the results

and evaluate precision and recall).

Next, for the selected model pairs, we list three different case statistics for (source, target) metabolite pairs.

 Case 1: “Target metabolite (reaction) identical to the source metabolite (reaction)

exists in the target model of the Recon Models database”. That is, the two names are

identified as being equal via a SQL server string equality comparison.

 Case 2: “Target and source metabolites (reactions) are not identical; however, their

similarity score is 1”.

For a source metabolite where there is no target metabolite with a similarity score greater than or equal to s, for further analysis, we locate those metabolites with the highest similarity score, and produce the relevant statistics. Below, we identify this case as:

 Case 3: “The given source metabolite (reaction) does not have a matching target

metabolite (reaction) with a score greater than or equal to s”.

We list one statistic for (source, target) reaction pairs.

 Case r: The given source reaction has matching target reactions(s) with score greater

than or equal to s

For all three source-target model pairs, we use edit distance threshold k=3, q-gram size q=3, and similarity score threshold s =0.9. The three source-target model pairs and their statistics are:

1) Source Model: H. pylori iIT341

137

(412 distinct metabolites; 554 reactions)

Target Model: EryNet (482 distinct metabolites; 438 reactions)

Case 1, 2, 3 counts for metabolites: 129, 14, 248

2) Source Model: Model2008_09_23_13_13_29

(175 distinct metabolites; 167 reactions)

Target Model: Model2008_08_15_12_13_14

(583 distinct metabolites; 581 reactions)

Case 1, 2, 3 counts for metabolites: 69, 0, 19

Case r count for reactions: 67

3) Source Model: 03_16_09_TM_minimal_medium_glc

(647 distinct metabolites; 645 reactions)

Target Model: M. barkeri iAF692

(698 distinct metabolites; 690 reactions)

Case 1, 2, 3 counts for metabolites: 247, 60, 227

Case r count for reactions: 332

5.4.1.2 Equal Functionality Criteria

In this analysis, to focus on nonexact matches, we do not include the (source, target) metabolite pairs for Case 1 (i.e., “a target metabolite that is identical to the source metabolite exists in the target model of the Recon Models database”). For example, in experiment (1), 44 source-target metabolite pairs are excluded from the precision and recall analysis.

Also, when the source (or the target) model has two metabolites with the same name, we only use one of them for matching purposes.

138

In terms of “equal functionality” of metabolites, we use four criteria as illustrated below with examples. a.Metabolites in the real pair only differ by their compartments, i.e., M_adp_m (i.e.,

metabolite is in the compartment mitochondria) and M_adp_c (i.e., metabolite is in the

compartment cytosol). b.Metabolites in the real pair differ at the chemical formula level, but have the same

function as identified by data sources such as KEGG, e.g.,

M_Glycyl_tRNA_Gly__C2H4NOR and M_Glycyl _tRNA_Gly__C2H5NO2X have the

same KEGG id C02412. c. Metabolites in the real pair only differ at the location of phosphate, e.g., D-Glucose 6-

phosphate and D-Glucose 1-phosphate.

However, metabolites with phosphate and without phosphate, or metabolites with different number of phosphates are not real pair, e.g., D-Glucose and D-Glucose 6- phosphate; or D-Fructose 1,6-bisphosphate and D-Fructose6-phosphate. d.Metabolites in the real pair are mirror images, or enantiomer, e.g., L-Lactate and D- Lactate. Enantiomers, or optical isomers, have identical chemical and physical properties except they have opposite orientation, like one’s left and right hands.

We have chosen these “equal functionality” criteria carefully to make sure that they are consistent with the literature and data. And, we use these criteria in all experiments to observe how the parameters influence precision and recall.

5.4.1.3 Results and Observations

In figures 5.5-5.7 we vary the similarity score thresholds (i.e., x dimension) from 0.5 to

0.9 to show changes in the counts of true positives, false positives, and false negatives, as 139

well as precision and recall for the three pairs of GSRMNs. The values of α, β and γ in

SimScore (a,b) are set as 0.05, 0.9 and 0.05 separately.

60 Source:H. pylori iIT341 Target: EryNet

50

40

30

Counts True Positives 20 False 10 Positives 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(a) True Positive, False Positive, and False Negative counts

Source:H. pylori iIT341 Target:EryNet 1

0.9

0.8

0.7 Precision 0.6 Results Recall 0.5

0.4

0.3 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold (b) Precision and Recall Figure 5.5 Model-to-model metabolite matching results for

140

1400 Source:Model2008_09 Target:Model2008_08

1200

1000 True Positives 800 False 600

Counts Positives 400

200

0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(a) True Positive, False Positive, and False Negative counts

Source:Model2008_09 Target:Model2008_08 1 0.9 0.8 0.7 0.6 0.5 Precision

Results 0.4 Recall 0.3 0.2 0.1 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(b) Precision and Recall Figure 5. 6 Model-to-model metabolite matching results for We have the following observations.

Observation 1. Increasing the similarity score threshold from 0.5 to 0.9 results in an average decrease of true-positive counts by 3.316%.

141

Observation 2. Increasing the similarity score threshold from 0.5 to 0.9 results in an average decrease of the false-positive counts by 97.639%.

False positive counts drop significantly as the similarity score threshold increases, while true positives counts may also drop slightly. This improvement shows that, with a

140 Source:03_16_09_TM_minimal_medium_glc Target:M. barkeri iAF692 120

100

80 True

60 Positives Counts 40 False Positives 20

0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(a) True Positive, False Positive, and False Negative counts

1 0.9 0.8 0.7

0.6 0.5 Precision

Results 0.4 Recall 0.3 0.2 Source:03_16_09_TM_minimal_medium_glc 0.1 Target:M. barkeri iAF692 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(b) Precision and Recall Figure 5. 7 Model-to-model metabolite matching results for <03_16_09_TM_minimal_medium_glc, barkeri iAF692>

142

proper similarity score threshold, most, if not all, true positive results can be obtained, with/without a small percentage of false positive results.

Observation 3. Increasing the similarity score threshold may result in an increase in false-negative counts.

When the similarity score threshold is too high, a very small number of matching pairs may be eliminated from the results, which leads to a slight increase in false negative counts.

Observation 4. Increasing the similarity score threshold results in a precision increase on average 415.91% and by at least 115.6%, and recall decrease on average 2.335% and by at least 0.

Since false-positive counts drop when the similarity score threshold increases, precision increases significantly while recall decreases much less.

As the similarity score threshold increases, increases in the false-negative count and decreases in the true-positive count lead to a decrease in recall.

Observation 5. The way of splitting tokens may occasionally result in additional false negatives, though in only small numbers.

We allow only one prefix and one suffix at most while obtaining the tokens. This works with most (99.37% of the) cases in experiments, but, for some special cases, may also lead to a low similarity score. We give two examples.

For “alpha-D-Ribose 5-phosphate” in the source model “H. pylori iIT341”, “D-Ribose 5- phosphate” in the target model “EryNet” is not identified as a match. The reason is

“alpha-D-Ribose 5-phosphate” has actually two prefixes, “alpha” and “D”; and, the “one prefix” rule leads to unmatched body tokens. 143

Observation 6. Lower edit distance threshold may occasionally result in additional false negatives, though in only small numbers.

We give an example. For “_3-Phospho-D-glycerate” in the source model “iAF692 network flux distributions for BOF optimization on methanol”, “3-Phospho-D-Glycerate

(E)” in the target model “Natronomonas pharaonis metabolic network” is not identified as a match when the edit distance threshold is 3. The reason is that the two names have actually the edit distance 4, which leads to the name “3-Phospho-D-Glycerate (E)” being eliminated by the edit distance filtering condition.

5.4.2 Reaction Identification Results

5.4.2.1 GSRMN Pairs and Statistics

In the following experiments, we use two of the source-target model pairs used in the metabolite identification experiment. We do not use the pair because reaction names in the model “EryNet” are not biochemistry-based names, but are user- defined number-based names, which is not consistent with other models in the experiment. Also, we add one new model pair to the experiments, namely, < iAM303, E. coli textbook>. “E. coli textbook” is the core Escherichia coli metabolic model as an educational guide. It’s the reconstruction and use of microbial metabolic networks[28].

For all three pairs (pair 2 - pair 4), we use edit distance threshold k=9, q=3 for q-gram, and similarity score threshold s =0.5. The source-target model pair statistics for the new model pair are:

4) Source Model: iAM303 (279 reactions) Target Model: E. coli textbook (95 reactions)

Case rTM count: 77

144

5.4.2.2 Equal Functionality Criteria

In the experiments below, for each source reaction r, multiple target reactions rTM x

(x=1,…n) with similarity scores greater than or equal to s, are identified. For a given r, we manually check and locate target reactions, rR-TMx, which have identical compounds and corresponding roles, unless one reaction is reversible reaction. We refer to (r, rR-TMx) as the “real source-target reaction pair”, or the “real reaction pair”. In terms of identical compounds, we use the same criteria used in metabolite similarity experiments; that is, metabolites that only differ by the compartments that they reside in are considered identical. E.g., reaction “aspartatetransaminase” in model “Model2008_09”and

“aspartatetransaminase” in model “Model2008_08” are identical though their compounds only differ by their compartments and are thus identical, i.e., the first reaction’s compounds “M_akg_m”, “M_glu_DASH_L_m”, “M_oaa_m”,

“M_asp_DASH_L_m” are all in “Mitochondria” (thus the suffix m); and the second reaction compounds “M_akg_c”, “M_glu_DASH_L_c”, “M_oaa_c” and “M_asp_DASH_L_c” are all in the compartment “Cytosol”.

There are also a number of generic reaction names such as “. escape flux”, “. source flux”, and “BiomassRxn”, which are used to represent different reactions. Compound similarities and differences of these reactions are captured via their similarity scores.

However, typically, such reactions serve for different reasons (such as the FBA analysis), and, thus we exclude reactions with these three names in experimental results.

5.4.2.3 Results and Observations

In figures 5.8-5.10, we vary the similarity score thresholds (i.e., the x dimension) from

0.5 to 0.9 to show changes in the counts of true positives, false positives, and false

145

negatives, as well as precision and recall. The values of f1 and f2 in SimReaction(rTM, r) are set as 0.5 and 0 separately.

60 Source:iAM303 Target:E. Coli Textbook

50

40 True Positives

Counts 30 False Positives

False Negatives 20

10

0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(a) True Positive, False Positive and False Negative counts Source:iAM303 Target:E. Coli Textbook 1

0.9

0.8

0.7

0.6 Results Precision 0.5 Recall 0.4

0.3 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(b) Precision and Recall without a rule-based pre-match filter

146

Source:iAM303 Target:E. Coli Textbook 1

0.9

0.8

0.7

0.6 Results Precision 0.5 Recall 0.4

0.3 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(c) Precision and Recall after a rule-based pre-match filter Figure 5.8 Model-to-model reaction matching results for

Source:Model2008_09 Target:Model2008_08 35

30

25

True Positives

20 False Positives 15 Counts False Negatives

10

5

0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(a) True Positive, False Positive and False Negative counts

147

Source:Model2008_09 Target:Model2008_08 1 0.9 0.8

0.7

0.6 0.5 Precision

Results 0.4 Recall 0.3 0.2 0.1 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(b) Precision and Recall

Figure 5.9 Model-to-model reaction matching results for

Source:03_16_09_TM_minimal_medium_glc 500 Target:M. barkeri iAF692 450 True Positives 400 False Positives 350

300 False Negatives

250 Counts 200

150

100

50

0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(a) True Positive, False Positive and False Negative counts

148

1 0.9 0.8 0.7

0.6 0.5 Precision

Results 0.4 Recall 0.3 0.2 Source:03_16_09_TM_minimal_medium_glc 0.1 Target:M. barkeri iAF692 0 0.5 0.6 0.7 0.8 0.9 Similarity Score Threshold

(b) Precision and Recall

Figure 5. 10 Model-to-model reaction matching results for <03_16_09_TM_minimal_medium_glc, barkeri iAF692> Observation 7. Inconsistent modeling decisions on the part of the modelers can be identified via either rule-based pre-match filters or via a closer examination of the results.

Figure 5.8 shows that, as compared with pair 2 and pair 3, for pair 4, recall drops significantly at similarity score threshold 0.8. From Figure 5.8(a), we can see the reason is the true positive count, which drops by thirteen at score 0.8. After a closer examination, nine of these thirteen real reaction pairs are exchange reactions, e.g., O2 exchange, or Phosphate exchange. All exchange reactions in the source model “iAM303” have only one compound, i.e., O2 for O2 exchange. However, exchange reactions in the target model “E. Coli textbook” have two (of the same) compounds as a substrate and a product, separately. Such inconsistent reaction modeling decisions on the part of the

149

modelers distort the results of our matching algorithms; clearly, they can be identified by additional pre-match-time rule-based filtering steps. The recall at similarity score 0.8 is improved from 0.754 to 0.896 after the filtering steps, as shown in Figure 5.8(b) and

Figure 5.8(c).

Observation 8. Same reactions with differing reaction names and identical compounds can be identified via high SimReactionCompounds() scores.

Reaction “malicenzyme(NADP)” in model “Model2008_09”and “malicenzyme(NAD)” in model “Model2008_08” look like different reactions since NADP and NAD differ by a phosphate. However, their substrates and products are identical which means they are the same reaction with different names, captured with a similarity score of 0.99.

Observation 9. Increasing the similarity score threshold improves the precision, but reduces the recall. A change in the similarity score threshold from 0.5 to 0.9 results in precision increase on the average 107.99% and by at least 57%, and recall decrease on the average 9.44% and by at least 1.3%.

Similar to metabolite similarity experiments, both true positive and false positive counts drop when the similarity score increases. Since false positive counts drop faster, we can see that precision improves significantly while recall declines slightly.

Observation 10. Increasing the similarity score threshold from 0.5 to 0.9 results in an average decrease in false-positive counts by 91.50%.

5.5 Conclusion

In this chapter, we have proposed a number of metabolite/reaction identification techniques for Genome-Scale Reconstructed Metabolic Networks (GSRMN) (by

150

matching metabolites/reactions to corresponding metabolites/reactions of a source model or data source). We employ a variety of computer science techniques that include approximate string matching, similarity score functions and filtering techniques, all enhanced by the underlying metabolic biochemistry-based knowledge. The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four pairs of GSRMNs. Our results indicate that significant accuracy gains are made using the proposed metabolite/reaction identification techniques.

151

Conclusions and Future Work

In this thesis, we have studied metabolic network computational analysis, presenting metabolomics data as well as its analysis results via visualization tools, and text mining among Genome-Scale Reconstructed Metabolic Networks.

We have developed the SMDA (steady-state metabolic network dynamics analysis) technique, built the system, and evaluated its computational performance limits using a mammalian metabolic network database. SMDA takes a set of measurements and a metabolic network as input, performing the task of identifying the metabolic mechanisms that lead to changes in the concentrations of given metabolites, and interpreting the metabolic consequences of the observed changes in terms of physiological problems, nutritional deficiencies, or diseases. The SMDA problems and their solutions addressed are new, and specific to the SMDA approach. This work of evaluating the activation/inactivation scenarios of the metabolic network at steady state is related to metabolic network analysis techniques such as metabolic control analysis (MCA) [10], flux balance analysis (FBA) [11], metabolic flux analysis [12] and, finally, metabolic pathway analysis (MPA) (elementary flux modes (EFM) and extreme pathways (EP))

[13]. Comparison between SMDA and related techniques are discussed.

We have performed a usefulness study for SMDA for the problem of gene lethality testing [10]. SMDA algorithm is revised accordingly. We also examined this research via the reconstructed network model of Trypanosoma cruzi . We take the model network as one input for SMDA. Metabolite pool observations from the extracellular metabolites’ availability according to the paper supplement [16] is another input. In the examination, all seven lethal genes are verified with SMDA and one non-lethal gene selected from the 152

paper is also verified. Thus, we confirm that SMDA can be used for gene lethality testing purposes. Compared with other computational techniques such as FBA, SMDA produces results consistent with the underlying biochemistry.

Pathcase visualization tools present metabolic data, relationships in the data, as well as analysis results of the data via a java applet. These tools are components of many

PathCase Systems, i.e., PathCase-SB, PathCase-MAW, PathCase-MAW, PathCase-

RCMN, PathCase-Recon, PathCase-SMDA, Metabolism Query Language Interface.

Also, we generalize visualization framework of all PathCase visualization and introduce distinct features of visualization tools in each PathCase system. Part of the framework are revised for three iPad applications: iPathCaseMAW, iPathCaseRCMN and iPathCaseKEGG.

In the basic-level bio-entities research, we propose metabolite/ reaction identification techniques for GSRMNs. A variety of computer science techniques that include approximate string matching, similarity score functions and filtering techniques are employed. All techniques are enhanced by the underlying metabolic biochemistry-based knowledge. The proposed metabolite/reaction identification techniques are evaluated by an empirical study on four pairs of GSRMNs. Our results indicate that significant accuracy gains are possible using the proposed metabolite/reaction identification techniques

6.1 Future work

153

6.1.1 SMDA

For the SMDA research, exploratory data Mining and analysis capabilities for the SMDA query output search space can be integrated. Its output space can be queried to filter specific scenarios such as results with malfunctioning Urea Cycle. Also backward reasoning can be implemented to locate specific observations. The input of SMDA can also be improved by allowing user to preset status of metabolites and/or reactions. The user selected network can be examined with biochemistry rules before providing user suggestions intelligently.

5.2.3.3 6.1.1.1 Querying Result Space

The result space of SMDA tool can be huge and there are many ways to query the results.

Here we propose an example to query the results from a specific angle, i.e., querying the

SMDA results for selected metabolism disease-/disorder-related scenarios. The goal is to answer a basic question: Are there any plausible steady-state metabolic network activation/inactivation scenarios that would implicate specific diseases or disorders (e.g., urea cycle disorders), given the observed metabolomics measurements?

Disorders or diseases, sometimes occur due to low-levels of activity for certain reactions; that is, a reaction has slowed down sufficiently to cause a disorder/disease, but not necessarily shut down (i.e., Inactive). The current SMDA approach but attaches three labels for reactions, which are Unknown, Active and Inactive. In this study, we stay with the current SMDA model, and assume that reactions related with disorders and labeled as

Inactive may in fact have very low flux rates (i.e., “LowActive” or “NotActive Enough”), leading to the disorder.

154

As an idea on solving this problem, two steps can be implemented based on SMDA tool’s output. Firstly, SMDA results can be clustered into groups via preset biochemistry based clustering rules. Secondly, biological process of a specific disease can be analyzed. Then the disease’s disorder characteristics can be extracted and captured via one or more disease/disorder identification rules, which can be used to identify scenarios from SMDA outputs.

5.2.3.4 6.1.1.2 Finding Observations for a Desired Flow-Graph

User may be interested in locating metabolite observations which is related to a specific result. This can be done via backward reasoning.

Let an SMDA query is run with a number of observed metabolites, etc., and the output with flow-graphs is returned. First, our SMDA tool should retain the first query, and should be able to provide the same interface specified as-is to the user. Next, the SMDA tool should be able to switch to a different interface that allows the user to reverse the question into the following query:

“Let Q define a query network with the following X, …, Z metabolite pool labels

(observations), called PoolAssignment, and the following ri, rj, …, rv active/inactive reactions/pathways, called ReactionAssignment. List the metabolite pool labels of selected metabolites in Q (possibly those in bio-fluids) and the associated flow-graphs where R is consistent with PoolAssignment and ReactionAssignment”.

5.2.3.5 6.1.1.3 Allowing User to Specify Status

Other than the measurements, we may utilize user’s domain knowledge as well. This can be done by allowing user to provide more information as SMDA’s input. User doesn’t

155

have to be familiar with SMDA model or terminologies. The interface should be able to take user’s biochemistry language and cover them into the SMDA tool as known conditions.

For example, the user’s input could be, some pathway/reaction is known active/inactive, or the flow of a metabolite goes to one branch only instead of other branches, or the known fuel is being used for a pathway. In our implementation, we'll translate the inputs into status labels of the nodes in the sub network. For known pathway/reaction, we set them as active/inactive as user's input. For the known flow, we can assign labels for metabolite pools and for reactions. For the know fuel, we can set the label of corresponding pool as available with the labels of production reaction and consumer reaction as active.

5.2.3.6 6.1.1.4 Notifying Users Special Cases

Since the user-selected sub-network may not be a complete network, when we apply biochemistry rules to the sub network, we may miss some cases which could exist in the real world. For example, if a user chooses Tricarboxylic Acid(TCA) Cycle as the sub- network, Acetyl Coa is produced by Pyruvate dehydrogenase (PDH) complex, and is consumed by Citrate synthase. When Citrate synthase is inactive, we’ll say Pyruvate dehydrogenase (PDH) complex is inactive, also according to the Rule BC7 (no consumer is active then no producer is active). However, in reality, if we include the reaction

Pyruvate Carboxlase in the sub network, Pyruvate dehydrogenase (PDH) complex could be active since Pyruvate Carboxlase is consuming Acetyl Coa.

156

To avoid such cases, the user chosen sub-network can be analyzed intelligently, and then

SMDA may remind theuser when such situations arise, and let the user choose to continue or to change the sub-network.

6.1.2 Visualization

Visualization tools are integrated in exploratory search, querying, and visualization of

PathCase Systems. The tool can be enhanced further via new functionalities. We give some examples.

In a large metabolic network, i.e., a genome-scale reconstructed metabolic network with hundreds reactions/metabolites, it can be hard to locate a specific element manually when you have the element’s name. A node location functionality may be provide which allows user to input a name of the element, the visualization tool should locate all nodes with the specified name and highlight them in the visualization graph.

In the PathCase-RCMN, or PathCase-SMDA, it will be helpful to provide a whole picture of all the pathways/sub-networks in the database, in addition to a single pathway, or user selected network visualization. In this whole picture, all pathways in the database are visualized, the user selected pathway(s)/sub network(s) are highlighted. By this way, user will have a global view of the complete network, as well as the connections/relations between the sub networks.

6.1.3 Bio-Entity Identifications

Based on basic-level bio entities identification techniques and results, the scope of the bio-entity identification problem in GSRMNs can be increased in multiple new and innovative ways, e.g., higher-level bio-entities, basic and higher-level graph based (GB)

157

bio-entities, and basic social network based(SNB) entities can also be located.

Identification efficiency may be improved via applying techniques of related works.

Also, multi levels of identifications can be integrated into current PathCase sytems, i.e.,

PathCase-RCMN, PathCase-RECON.

6.1.3.1 Different levels of identifications

Higher-level bio-entities identification, as defined in Chapter Five, includes pathway or metabolism sub-network locating in the GSRMNs. In addition to metabolite, reaction and compartment identification, topological structure of the entity needs to be considered and matched in some manner. Also, disconnected components should be identified if they exist in the GSRMNs.

GSRMN networks can be mapped onto graphs for analysis. There are several ways of mapping techniques [102]. For GSRMN analysis, one can map metabolites into vertices and reactions into edges or map reactions into vertices and metabolites into edges. For each mapped graph, clustering algorithms can be applied and results may be compared.

This is called GB bio-entities identification and is a syntactic clusters location problem.

Basic-level GB bio entities identification includes basic entities.

Network analysis can also be applied to GSRMN identifications. Authorities are objects pointed by a large number of hubs (i.e. they have a large number of ingoing edges), which are likely to be good sources of information. Hubs are objects that are likely to point to many such authorities (hubs) through the link structure of the data (i.e. they have a large number of outgoing edges)[103]. Page rank scores of metabolites and reactions in a GSRMN network can be computed and compared across GSRMN networks.

158

Authorities, hubs and page ranks computation and analysis are called basic social network based(SNB) entities identification.

6.1.3.2 Improve efficiency of identifications

For basic-level bio entities identification, the efficiency can be improved further via approximate string matching related techniques in the literature. We give some examples.

In Li et al’s work[68], three algorithms for answering approximate string search queries are proposed, called ScanCount, MergeSkip and DivideSkip. The ScanCount algorithm adopts a simple idea of scanning the inverted lists of the grams and counting candidate strings. TheMergeSkip algorithm exploits the value differences among the inverted lists and the threshold on the number of common grams of similar strings to skip many irrelevant candidates on the lists. The DivideSkip algorithm combines the MergeSkip algorithm and the idea in the MergeOpt algorithm proposed in [104]that divides the lists into two groups. One group is for those long lists, and the other group is for the remaining lists.

Instead of having user to choose q value of q-gram, VGRAM , proposed in Li et. al.’s work[85], can be used to improve the performance of the algorithm. Its main idea is to judiciously choose high-quality grams of variable lengths from a collection of strings to support queries on the collection. It’s like an index structure associated with a collection of strings which are going to be queried approximately. The frequencies of variable- length grams in the strings are analyzed to build gram dictionary. For a string, a set of grams of variable lengths are generated using the gram dictionary. Two strings’ sets of grams are compared to get their similarity.

159

In Zhang et. al’s work[86], they propose the Bed-tree, a B+-tree based index structure to support string similarity queries with respect to edit distance. The index can be built once and used with arbitrary distance thresholds and for all query types. The paper identifies the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. They present three different string transformations that capture useful information from different aspects of string.

All these techniques can be adapted to the entity identification techniques.

6.1.3.3 Integration with PathCase systems

Of all the PathCase systems introduced in Chapter Four, PathCase-RCMN and PathCase-

Recon use GSRMN database. Basic-level bio entities identification can be integrated into both PathCase systems which allows user to locate similar entities with ranking scores.

160

Appendix 1. Core Metabolites (Total count: 617)

(+-)-Malic 2,2',6,6'- 2- Acenaphthylene

(-)- Tetrachlorobiph Acetylaminoflu Acetaminophen enyl orene Histrionicotoxin Acetate 2,2'- 2-Furoic 1,2- Acetic Dichloroethane Dichlorobiphen 2-Propenoic Acetone yl 1,2- 4,4'- Acetophenone Dichloronapthal 2,3,7,8- Dichlorobiphen Acetyl ene Tetrachloro- yl Acetylcholine dibenzo-p- 1,2- 5-Fluorouracil dioxin Acetylene Diphenylhydrazi 6,6'-Dibromo- 2,3,7,8- Acroleic ne indigo Tetrachloro- Acrylamide 1,3,5,7- 6- dibenzofuran Acrylic Tetrafluorocylco Mercaptopurine octatetraene 2,4 Acrylonitrile 9-BBN 1,5- 2,4- Acyclovir a-Actinin Dichlorophenox Dichloronapthal Adenine a-Aminobutyric ene yacetic Adenosine a-Tocopherol 1-Bromo-1- Adipic Abscisic chloro-ethene ADP 161

Adrenaline Antimony Azurite Boron

Adrucil Apatite Bacteriopheoph Brazilianite

Advil Apophyllite ytin Brevetoxin

Alanine Aqua-kleen Barite Bromo-chloro-

Aldrin Aquamarine Barium fluoro-methane

Allene Aragonite Benitoite Bromoaureol

Aluminum Arginine Benzene Bromopentafluo ride Amidox Arkelite Benzo(a)pyrene Brooklax Ammonium Arsenic Benzoic Buckminsterfull Amoxone Arsenopyrite Benzophenone erene Amphidinolide Arsine Benzothiazole BuLi Anatase Ascidiacyclamid Beryl Bupropion Androsterone e Biacetyl Butylbenzoic Anhydrite Ascorbic Bicyclomycin Butyric Anhydroanguiba Asparagine Biotin C60 ctin Aspartame Biphosphate C70 Anhydroscymno Aspartic Bismuth Caffeine l Aspirin Bisulphite Calcite Aniline Aspirochlorine Boracite Calcium Annulin ATP Borax Caledonite Anthracene AZT Borazine

162

Calyculin Chlorine Codeine Cyclomarin

Camphor Chloro-difluoro- Collagen Cyclopropane

Cantharidin methane Copiapite Cyclopropenylid

Captan Chlorocresol Copper ene

Carbon Chloromethane Coronene Cycloxazoline

Carbonate Chlorophyll Cortisol Cymobarbatol

Carbonic Chlorosulfuric Cortisone Cysteine

Carletonite Corundum Cytidine

Carnallite Cholesterol Coumarin Cytosine

Caryophyllene Cholic Creatine D-(-)-Luciferin

Cassiterite Chromate Crocoite D-Glucitol

Catechol Chromium Cryolite Dactylallene

Cavansite Chrysene Cucurbitine DDT

CDP Chrysoberyl Cumene Decachlorobiph enyl Celestine Cinnabar Curcumin Decamine Cembranolide Cinnamic Cyanide Dechlorane Cerussite Cinnamon Cyanoacetylene Decopur Chalcanthite Cisplatin Cyanoacrylate Di-t-butyl- Chalcopyrite Citric Cyanogen peroxide Chlorate Clinoclase Cyclobutane Diacetylene Chlordene Cocaine Cyclohexane 163

Diamond Dinitrotoluene Epsomite Fructose-6-

Diazepam Dioxane Erythrite phosphate

Diazomethane Dioxin Erythromycin Fumarate

Dibenzoyl Diuron Estradiol Fumaric

Dicamba Divinyl Estrol Fumiquinazolin e Dicarbon DL-3- Estrone Furaldehyde Dichloro- Aminoisobutyri Ethane c galactosamine difluoro- methane Dodecanedioic Galacturonic Ethyl Dichromate Dolomite Galena Ethylene Dieldrin Domeykite Gallic Ferredoxin Dihydroxyaceto dTDP Garnet Ferric ne Durdenite Gaspeite Ferrous Diketene Durotox GDP Fluoranthene Dimethyl Dynamite Germane Fluorapatite Dimethylpyrazi Dysamide Glucarate Fluorene ne e-Caprolactam Glucocorticoid Fluorite Dimethyltrypta Ecstasy glucosamine Fluoxetine mine Emmonsite Glucose Fool's Dinitrogen Endrin Glucuronic Formic Dinitrophenol Epinephrine Glutamate Fructose 164

Glutamic Hexachlorocycl Hyposulfite L-Arginine

Glutamine ohexane Ibuprofen L-Carvone

Glutaric Hexachlorocycl Indole Lactose opentadiene Glycine Inesite Lankalapuol Hexafluorosulfi Gold Iodine Laurencin de Graphite Iron Lauric Hexahydrobenz Guanidinium isobutane Leucine ene Guanine Isodrin Hexane Guanosine isoleucine Lindane Histidine Gypsum isopropanol Linoleic Honulactone Halite isopropyl LSD Hyaluronidase Halloysite Jadeite Lyphocin Hydrated Halomon Juglone Lysine Hydrazine Hardystonite Juncusol Lysozyme Hydrochloric HCB Kalihinene m-Cresol Hydrogen Hematite Kepone m- Hydronium Heptachlor Keramamine-A Hydroxybenzoic Hydroxide Hessite Keramaphidin m-Xylene Hydroxyisobuty HEX Ketene Magic ric Hexachlorobenz Kilprop Malachite Hypochlorite ene L-Alanine Maleic 165

Malonic Methylamine N-(p- Nitrous

Maltol Methylcyanoace Bromobenzamid Norcholestane e)gymnodimine Maltose tylene Nuprin n-Pentacosane Manzamine Methyldiacetyle Nutrasweet ne NAG Marcasite Octanitrocubane Methylpyrazine Naphthol MDMA Octanoic Millerite Naprosyn Mecopar Oestrin Mimetite Naproxen Mecoprop Oestrone Miracle Napthalene Melanin Oleic Mirbane Neamphine Melanterite Oxalate Mirex Needle Melatonin Oxalic Molybdenite Neohalicholacto Melittin Oxirane Molybdenum ne Menadione Oxychlor Molybdic Nicotine Mercaptopurine Oxytetracycline Monosan Niter Mescaline Ozone Morphine Nitrate Methacrylate p,p-DDE Motrin Nitric Methane p-Benzoquinone Muscovite Nitrite Methanol p-Cresol Musk Nitrobenzol Methionine p- Nitroguanidine Methoxychlor Hydroxybenzald Nitrophenol ehyde Methyl 166

p- Pepsin Phycocyanin Prometone

Hydroxybenzoic Perchlorate Phycocyanobilin Propadienyliden p-Xylene Periclase Phycoerythrin e

Paraherquamide Permanganate Phylloquinone Propane

Paroxetine Peroxide Picene Propene

PCB-15 Perylene Picric Propionic

PCB-4 Phenanthrene Picryl Propyne

Pectenotoxin Phencyclidine Pinene Prozac

Pectenotoxin-1 Phengite Pinnatazane Pseudo- conhydrine Penicillin Phenol Piperazinomyci Pseudopterosin Pennamine Phenolphthalein n Psilocybin Pentaacetoxy Phenoxymethyl Piperine

Pentacarbon penicillin Plastocyanin Pyrene

Pentacene Phenylalanine Plastoquinone Pyridine

Pentachlorophen Phenylmercuric Platinum Pyrite ol Phloroglucinol Potassium Pyrrhotite

Pentaerythritol Phosgene Pregnenolone Pyruvate

Pentane Phosphine Prianosin Quartz

Pentatetraenylid Phosphoenol Progesterone R-Carnitine ene Phosphoric Progestin RDX Peppermint Proline Realgar

167

Retinoic Sorbitol Tartaric Thymidine

Rhodochrosite Sphalerite Testosterone Thymine

Ribulose- Spinel Tetrabromodichl bisphosphate Stannic orobipyrrole Tin

Rotenone Stearic Tetracycline Tinstone

Saccharin Stream Tetradecanol Titanium

Salicylaldehyde Streptonigrin Tetrahydrocortis Topaz ol Salicylic Strontianite Tourmaline Tetrahydrofuran Salinamide Strychnine Toxaphene Tetranitroaniline Saxitoxin Styphnic Trans- Tetrodotoxin Scalaradial Styrene Chlordane Tetryl Scapolite Succinic Triacetylene THF Serine Sucrose Tridecane Thiocarbohydra Showdomycin Sugar Trimesic zide Siderite Sulfate Trimethylamine Thioketene Silver Sulfite Trimethylene Thionyl Silylene Sulfur Trimethylpyrazi Thiophene ne Sinhalite Sulfuric Thioredoxin Trinitrobenzene Sodalite Sweet Thiourea Trinitroresorcin Tamoxifen Threonine ol Sorbic Tanzanite 168

Trioxane Uracil Vancomycin Weed

Triphenylene Urea Vanillin Weedtrol

Triphosgene Urethane Variscite Wellbutrin

Trunculin Uridine Venlafaxine Wood

Tryptophan Uridine-5- Vinyl Wulfenite

Turquoise oxyacetic Vinylformic Zippeite

Tyrosine Valine Vitamin Zircon

Ubiquinone Valium Warfarin Zyban

Undecanol Vanadinite Water

Vancocyn

169

Bibliography

[1] M. I. Sigurdsson, N. Jamshidi, E. Steingrimsson, I. Thiele, and B. Ø. Palsson, “A detailed genome-wide reconstruction of mouse metabolism based on human Recon 1.,” BMC systems biology, vol. 4, no. 1, p. 140, Jan. 2010.

[2] R. Schuetz, L. Kuepfer, and U. Sauer, “Systematic evaluation of objective functions for predicting intracellular fluxes in Escherichia coli.,” Molecular systems biology, vol. 3, no. 119, p. 119, Jan. 2007.

[3] M. Stobbe and S. Houten, “Critical assessment of human metabolic pathway databases: a stepping stone for future integration,” BMC Systems Biology, vol. 5, p. 165, 2011.

[4] A. Kumar, P. F. Suthers, and C. D. Maranas, “MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases.,” BMC bioinformatics, vol. 13, no. 1, p. 6, Jan. 2012.

[5] N. D. Price, J. L. Reed, and B. Ø. Palsson, “Genome-scale models of microbial cells: evaluating the consequences of constraints.,” Nature reviews. Microbiology, vol. 2, no. 11, pp. 886–97, Nov. 2004.

[6] A. Joyce and B. Palsson, “Towards whole cell modeling and simulation: comprehensive functional genomics through the constraint-based approach,” Progress in drug research, vol. 64, pp. 267–309, 2007.

[7] M. a Oberhardt, B. Ø. Palsson, and J. a Papin, “Applications of genome-scale metabolic reconstructions.,” Molecular systems biology, vol. 5, no. 320, p. 320, Jan. 2009.

[8] S. Schuster, D. a Fell, and T. Dandekar, “A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks.,” Nature biotechnology, vol. 18, no. 3, pp. 326–32, Mar. 2000.

[9] A. Cakmak, X. Qi, A. E. Cicek, and G. Ozsoyoglu, “Computational Interpretation of Metabolomics Measurements: Steady-State Metabolic Network Dynamics Analysis,” in Proceedings of 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine 2011 (ACM BCB 2011), Aug 1 -3, Chicago, IL., 2011.

[10] D. Fell, Understanding the control of metabolism. Portland Press, London, UK, 1996.

170

[11] C. H. Schilling, S. Schuster, B. O. Palsson, and R. Heinrich, “Metabolic pathway analysis: basic concepts and scientific applications in the post-genomic era.,” Biotechnology progress, vol. 15, no. 3, pp. 296–303, 1999.

[12] N. G. Stephanopoulos, A. A. Aristidou, and J. Nielsen, Metabolic Engineering: Principles and Methodologies. Academic Press, Maryland Hts, MO, 1998.

[13] S. Schuster and C. Hilgetag, “On elementary flux modes in biochemical reaction systems at steady state,” Journal of Biological Systems, vol. 2, pp. 165–182, 1994.

[14] X. Qi, A. E. Cicek, and G. Ozsoyoglu, “Performing Gene Lethality Testing with SMDA.” 2012.

[15] S. B. Roberts, J. L. Robichaux, A. K. Chavali, P. a Manque, V. Lee, A. M. Lara, J. a Papin, and G. a Buck, “Proteomic and network analysis characterize stage- specific metabolism in Trypanosoma cruzi.,” BMC systems biology, vol. 3, p. 52, Jan. 2009.

[16] S. B. Roberts, J. L. Robichaux, A. K. Chavali, P. a Manque, V. Lee, A. M. Lara, J. a Papin, and G. a Buck, “Supplement 5, Proteomic and network analysis characterize stage-specific metabolism in Trypanosoma cruzi,” BMC systems biology, vol. 3, p. 52, 2009.

[17] M. Sajitz-Hermstein and Z. Nikoloski, “A novel approach for determining environment-specific protein costs: the case of Arabidopsis thaliana.,” Bioinformatics (Oxford, England), vol. 26, no. 18, pp. i582–8, Sep. 2010.

[18] A. Cakmak, X. Qi, S. a Coskun, M. Das, E. Cheng, a E. Cicek, N. Lai, G. Ozsoyoglu, and Z. M. Ozsoyoglu, “PathCase-SB architecture and database design.,” BMC systems biology, vol. 5, no. 1, p. 188, Jan. 2011.

[19] S. a Coskun, X. Qi, A. Cakmak, E. Cheng, a E. Cicek, L. Yang, R. Jadeja, R. K. Dash, N. Lai, G. Ozsoyoglu, and Z. M. Ozsoyoglu, “PathCase-SB: integrating data sources and providing tools for systems biology research.,” BMC systems biology, vol. 6, no. 1, p. 67, Jan. 2012.

[20] “BioModels database—A Database of Annotated Published Models.” [Online]. Available: http://www.ebi.ac.uk/biomodels-main.

[21] N. Le Novère, B. Bornstein, A. Broicher, M. Courtot, M. Donizelli, H. Dharuri, L. Li, H. Sauro, M. Schilstra, B. Shapiro, J. L. Snoep, and M. Hucka, “BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems.,” Nucleic acids research, vol. 34, no. Database issue, pp. D689–91, Jan. 2006.

171

[22] C. Li, M. Donizelli, N. Rodriguez, H. Dharuri, L. Endler, V. Chelliah, L. Li, E. He, A. Henry, M. I. Stefan, J. L. Snoep, M. Hucka, N. Le Novère, and C. Laibe, “BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models.,” BMC systems biology, vol. 4, p. 92, Jan. 2010.

[23] “KEGG (Kyoto Encyplopedia of Genes and Genomes) Pathways.” [Online]. Available: http://www.genome.jp/KEGG/pathway.html.

[24] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa, “KEGG: Kyoto Encyclopedia of Genes and Genomes.,” Nucleic acids research, vol. 27, no. 1, pp. 29–34, Jan. 1999.

[25] M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa, “From genomics to chemical genomics: new developments in KEGG.,” Nucleic acids research, vol. 34, no. Database issue, pp. D354–7, Jan. 2006.

[26] M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe, and M. Hirakawa, “KEGG for representation and analysis of molecular networks involving diseases and drugs.,” Nucleic acids research, vol. 38, no. Database issue, pp. D355–60, Jan. 2010.

[27] R. S. Johnson, X. Qi, A. E. Cicek, and G. Ozsoyoglu, “iPathCase-KEGG: An iPad Interface for KEGG Metabolic Pathways,” Health Information Science and Systems, 2012.

[28] “RECON:Online Database of Reconstructed Metabolic Networks.” [Online]. Available: http://www.reconmodels.org/Web.

[29] K. Yizhak, T. Benyamini, W. Liebermeister, E. Ruppin, and T. Shlomi, “Integrating quantitative proteomics and metabolomics with a genome-scale metabolic network model.,” Bioinformatics (Oxford, England), vol. 26, no. 12, pp. i255–60, Jun. 2010.

[30] I. R. Bederman, S. Foy, V. Chandramouli, J. C. Alexander, and S. F. Previs, “Triglyceride synthesis in epididymal adipose tissue: contribution of glucose and non-glucose carbon sources.,” The Journal of biological chemistry, vol. 284, no. 10, pp. 6101–8, Mar. 2009.

[31] H. G. Gasier, J. D. Fluckey, and S. F. Previs, “The application of 2H2O to measure skeletal muscle protein synthesis.,” Nutrition & metabolism, vol. 7, p. 31, Jan. 2010.

[32] A. Cakmak, X. Qi, a E. Cicek, I. Bederman, L. Henderson, M. Drumm, and G. Ozsoyoglu, “A new metabolomics analysis technique: steady-state metabolic network dynamics analysis.,” Journal of bioinformatics and computational biology, vol. 10, no. 1, p. 1240003, Feb. 2012.

172

[33] “SMDA tool.” [Online]. Available: http://nashua.case.edu/pathwayssmda/web.

[34] “PathCase family of applications.”

[35] S. Johnson and G. Ozsoyoglu, “PathCase MAW, an iPad application.”

[36] “HMDB site.” [Online]. Available: http://www.hmdb.ca/sources.

[37] A. E. Cicek, F. Olnh, I. O. X. Ri, D. Q. Hq, P. H. Ru, and F. Ri, “Resolving Observation Conflicts in Steady State Metabolic Network Dynamics Analysis,” vol. 9, pp. 409–414.

[38] “PathCase-MAW application.” [Online]. Available: http://nashua.case.edu/PathwaysMAW/.

[39] “PathCase-RCMN application for organism Trypanosoma cruzi.” [Online]. Available: http://nashua.case.edu/PathwaysMAW_Trypanosoma/web/.

[40] “OMA Tool.”

[41] C. H. Schilling, D. Letscher, and B. O. Palsson, “Theory for the systemic definition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective.,” Journal of theoretical biology, vol. 203, no. 3, pp. 229–48, Apr. 2000.

[42] D. A. Fell, “Metabolic control analysis: A survey of its theoretical and experimental development,” Biochem, no. 286, pp. 313–330, 1992.

[43] J. C. Liao and J. Delgado, “Advances in Metabolic Control Analysis,” Biotechnology progress, vol. 9, no. 3, pp. 221–233, 1993.

[44] M. C. Wildermuth, M. G. Hospital, and B. Street, “Minireview Metabolic control analysis : biological applications and insights,” Genome biology, vol. 1, no. 6, pp. 1–5, 2000.

[45] A. Varma and B. O. Palsson, “Metabolic capabilities of Escherichia coli. II. Optimal growth patterns.pdf,” Journal of Theoretical Biology, vol. 165, no. 4, pp. 503–522, 1993.

[46] J. S. Edwards and B. O. Palsson, “Systems properties of the Haemophilus influenzae Rd metabolic genotype.,” The Journal of biological chemistry, vol. 274, no. 25, pp. 17410–6, Jun. 1999.

[47] S. Klamt and J. Stelling, “Two approaches for metabolic pathway analysis?,” Trends in biotechnology, vol. 21, no. 2, pp. 64–9, Feb. 2003.

173

[48] A. P. Wlaschin, C. T. Trinh, R. Carlson, and F. Srienc, “The fractional contributions of elementary modes to the metabolism of Escherichia coli and their estimation from reaction entropies.,” Metabolic engineering, vol. 8, no. 4, pp. 338– 52, Jul. 2006.

[49] M. G. Poolman, K. V Venkatesh, M. K. Pidcock, and D. a Fell, “A method for the determination of flux in elementary modes, and its application to Lactobacillus rhamnosus.,” Biotechnology and bioengineering, vol. 88, no. 5, pp. 601–12, Dec. 2004.

[50] R. Urbanczik and C. Wagner, “An improved algorithm for stoichiometric network analysis: theory and applications.,” Bioinformatics (Oxford, England), vol. 21, no. 7, pp. 1203–10, Apr. 2005.

[51] D. J. Glykys and S. Banta, “Metabolic control analysis of an enzymatic biofuel cell.,” Biotechnology and bioengineering, vol. 102, no. 6, pp. 1624–35, Apr. 2009.

[52] S. Llamt, J. Gagneur, and A. von Kamp, “Algorithmic approaches for computing elementary modes in large biochemical reaction networks,” System Biology, vol. 4, no. 152, pp. 249–55, 2005.

[53] G. Alterovitz, S. Member, V. Muralidhar, and M. F. Ramoni, “Gene Lethality Detection and Characterization via Topological Analysis of Regulatory Networks,” IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, vol. 53, no. 11, pp. 2438– 2443, 2006.

[54] A.-L. Barabási and Z. N. Oltvai, “Network biology: understanding the cell’s functional organization.,” Nature reviews. Genetics, vol. 5, no. 2, pp. 101–13, Feb. 2004.

[55] H. Jeong, S. P. Mason, a L. Barabási, and Z. N. Oltvai, “Lethality and centrality in protein networks.,” Nature, vol. 411, no. 6833, pp. 41–2, May 2001.

[56] N. C. Duarte, S. a Becker, N. Jamshidi, I. Thiele, M. L. Mo, T. D. Vo, R. Srivas, and B. Ø. Palsson, “Global reconstruction of the human metabolic network based on genomic and bibliomic data.,” Proceedings of the National Academy of Sciences of the United States of America, vol. 104, no. 6, pp. 1777–82, Feb. 2007.

[57] C. T. Trinh, A. Wlaschin, and F. Srienc, “Elementary mode analysis: a useful metabolic pathway analysis tool for characterizing cellular metabolism.,” Applied microbiology and biotechnology, vol. 81, no. 5, pp. 813–26, Jan. 2009.

[58] T. D. Jamison, Disease Control Priorities in Developing Countries, 2nd ed. Washington (DC): World Bank, 2006.

174

[59] S. B. Roberts, J. L. Robichaux, A. K. Chavali, P. a Manque, V. Lee, A. M. Lara, J. a Papin, and G. a Buck, “Supplement 3, Proteomic and network analysis characterize stage-specific metabolism in Trypanosoma cruzi,” BMC systems biology, vol. 3, p. 52, 2009.

[60] “PathCaseRCMN.” [Online]. Available: http://nashua.case.edu/PathCaseRCMN/web/.

[61] “PathCase-Recon.” [Online]. Available: http://nashua.case.edu/PathCaseRECON/Web/.

[62] “BiGG Database.” [Online]. Available: http://bigg.ucsd.edu/bigg/main.pl.

[63] “MEMOSys.” [Online]. Available: http://icbi.at/software/memosys/memosys.shtml.

[64] “GSMNDB.” [Online]. Available: http://synbio.tju.edu.cn/GSMNDB/gsmndb.htm.

[65] “PathCase-MQL.” [Online]. Available: http://nashua.case.edu/PathwaysMQL/web/.

[66] A. Cakmak, G. Ozsoyoglu, R. W. Hanson, and C. Science, “MANAGING AND QUERYING MAMMALIAN METABOLIC NETWORKS : A METABOLISM QUERY LANGUAGE AND ITS QUERY PROCESSING 1 . 1 . A Query Template and Its Instance 1 . 2 . A Sample MQL AIP Query Instance and Its Output :”

[67] J. D. Orth and B. Ø. Palsson, “What is flux balance analysis?,” vol. 28, no. 3, pp. 245–248, 2011.

[68] M. L. Mo, N. Jamshidi, and B. Ø. Palsson, “A genome-scale, constraint-based approach to systems biology of human metabolism.,” Molecular bioSystems, vol. 3, no. 9, pp. 598–603, Sep. 2007.

[69] G. M. Maggiora and V. Shanmugasundaram, Molecular Similarity Measures, vol. 672. Totowa, NJ: Humana Press, 2011.

[70] M. Hattori, Y. Okuno, S. Goto, and M. Kanehisa, “Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways.,” Journal of the American Chemical Society, vol. 125, no. 39, pp. 11853–65, Oct. 2003.

[71] M. Hattori, N. Tanaka, M. Kanehisa, and S. Goto, “SIMCOMP/SUBCOMP: chemical structure search servers for network analyses.,” Nucleic acids research, vol. 38, no. Web Server issue, pp. W652–6, Jul. 2010.

175

[72] “http://en.wikipedia.org/wiki/String_metric.” .

[73] V. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, p. 707, 1966.

[74] S. Jimenez, C. Becerra, A. Gelbukh, and F. Gonzalez, “Generalized Mongue-Elkan Method for Approximate Text String Comparison,” pp. 559–570, 2009.

[75] S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins.,” Journal of molecular biology, vol. 48, no. 3, pp. 443–53, Mar. 1970.

[76] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences.,” Journal of molecular biology, vol. 147, no. 1, pp. 195–7, Mar. 1981.

[77] W. E. WinkLer, “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage,” in Proceedings of the Section on Survey Research, 1990, pp. 354–359.

[78] P. Jaccard, “The distribution of the flora of the alpine zone,” New Phytologis, vol. 11, no. 2, 1912.

[79] L. Gravano and N. Koudas, “Approximate String Joins in a Database ( Almost ) for Free,” in Proceedings of the 27th International Conference on Very Large Data Bases, 2001, pp. 491–500.

[80] D. E. Almonacid, E. R. Yera, J. B. O. Mitchell, and P. C. Babbitt, “Quantitative comparison of catalytic mechanisms and overall reactions in convergently evolved enzymes: implications for classification of enzyme function.,” PLoS computational biology, vol. 6, no. 3, p. e1000700, Mar. 2010.

[81] F. Ay, T. Kahveci, and V. de Crécy-Lagard, “Consistent alignment of metabolic pathways without abstraction.,” Computational systems bioinformatics / Life Sciences Society. Computational Systems Bioinformatics Conference, vol. 7, pp. 237–48, Jan. 2008.

[82] M. Mednis and M. K. Aurich, “Application of string similarity ratio and edit distance in automatic metabolite reconciliation comparing reconstructions and models,” Biosystems and Information technology, vol. 1, no. 1, pp. 14–18, 2012.

[83] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol. 33, no. 1, pp. 31–88, Mar. 2001.

176

[84] C. Li, J. Lu, and Y. Lu, “Efficient Merging and Filtering Algorithms for Approximate String Searches,” 2008 IEEE 24th International Conference on Data Engineering, pp. 257–266, Apr. 2008.

[85] C. Li and B. Wang, “VGRAM : Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams,” 2007.

[86] Z. Zhang, “B ed -Tree : An All-Purpose Index Structure for String Similarity Search Based on Edit Distance Categories and Subject Descriptors,” in SIGMOD’10, 2010.

[87] R. L. Anderson and E. Lansing, “D-Fructose l-Phosphate Kinase from Aerobacter Kinase and D-Fructose aerogenes,” Journal of Biological Chemistry, vol. 244, no. November 25, 1969.

[88] “Chirality.” [Online]. Available: http://en.wikipedia.org/wiki/Chirality_(chemistry).

[89] “Optical Activity.” [Online]. Available: http://physics.unl.edu/~tgay/content/OA2.html.

[90] “ChemicalProperies.” [Online]. Available: www.chemicalbook.com/ProductChemicalPropertiesCB7237292_EN.htm.

[91] “D-Glucose.” [Online]. Available: http://en.wikipedia.org/wiki/Glucose.

[92] “Naming Convention.” [Online]. Available: http://en.wikipedia.org/wiki/Enzyme#Naming_conventions.

[93] “IUBMB.” [Online]. Available: http://www.chem.qmul.ac.uk/iubmb/.

[94] A. Bordbar, N. E. Lewis, J. Schellenberger, B. Ø. Palsson, and N. Jamshidi, “Insight into human alveolar macrophage and M. tuberculosis interactions via metabolic reconstructions.,” Molecular systems biology, vol. 6, no. 422, p. 422, Oct. 2010.

[95] K. Radrich, Y. Tsuruoka, P. Dobson, A. Gevorgyan, N. Swainston, G. Baart, and J.-M. Schwartz, “Integration of metabolic databases for the reconstruction of genome-scale metabolic networks.,” BMC systems biology, vol. 4, p. 114, Jan. 2010.

[96] M. a Oberhardt, J. Puchałka, V. a P. Martins dos Santos, and J. a Papin, “Reconciliation of genome-scale metabolic reconstructions for comparative systems analysis.,” PLoS computational biology, vol. 7, no. 3, p. e1001116, Mar. 2011.

177

[97] A. Gevorgyan, M. G. Poolman, and D. a Fell, “Detection of stoichiometric inconsistencies in biomolecular models.,” Bioinformatics (Oxford, England), vol. 24, no. 19, pp. 2245–51, Oct. 2008.

[98] L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava, “Text joins in an RDBMS for web data integration,” Proceedings of the twelfth international conference on World Wide Web - WWW ’03, p. 90, 2003.

[99] J. Zobel and P. Dart, “Finding Approximate Matches in Large Lexicons,” Software: Practice and Experience, vol. 25(3), no. October, pp. 331–345, 1994.

[100] “Precision and recall.” [Online]. Available: http://en.wikipedia.org/wiki/Precision_(information_retrieval).

[101] S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data. 2002.

[102] J. van Helden, L. Wernisch, D. Gilbert, and S. J. Wodak, “Graph-based analysis of metabolic networks.,” Ernst Schering Research Foundation workshop, no. 38, pp. 245–74, Jan. 2002.

[103] C. H. Q. Ding, H. Zha, X. He, P. Husbands, and H. D. Simon, “Link analysis: hubs and authorities on the world wide web,” vol. 2001, no. July, pp. 1–12, 2003.

[104] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates,” Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD ’04, p. 743, 2004.

178