<<

The Pennsylvania State University

The Graduate School

The Huck Institutes of the Life Sciences ELUCIDATION AND SYNTHETIC

DESIGN OF BIOCHEMICAL

PATHWAYS USING NOVOSTOIC

A Dissertation in

Integrative Biosciences

by

Akhil Kumar

 2017 Akhil Kumar

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

May 2017

ii

The dissertation of Akhil Kumar was reviewed and approved* by the following:

Costas D. Maranas Donald B. Broughton Professor of Chemical Engineering Dissertation Advisor Chair of Committee

Ross Cameron Hardison T. Ming Chu Professor of Biochemistry and Molecular Biology

Reka Z Albert Distinguished Professor of Physics and Biology

Andrew David Patterson Associate Professor of Molecular Toxicology

Peter J. Hudson Director of Huck Institutes of Life Sciences

*Signatures are on file in the Graduate School

iii

Abstract

Next generation pathway design algorithms and tools facilitate with ease and speed the design of novel sophisticated biosynthetic routes. The development of computational designs for the biosynthesis of xenobiotics being the goal of this dissertation, we discuss and demonstrate solutions to two key challenges. The first challenge we identify is in the pace of extraction of metabolic knowledge i.e. directly using the data in the way it was published. Difficulties in directly using data from genome-scale metabolic models (GSMs) as well as semi-curated databases such as BRENDA1, KEGG2, and MetaCyc3, EcoCyc4, BioCyc3. The difficulties arise from the incompatibilities of representation, duplications, and errors i.e. with a single metabolite annotated with multiple names across different data sources. Also, in many cases, the same metabolite is annotated with multiple structures. This ambiguity gravely slows down the pooling of information across data sources. As a consequence, duplications in reaction information would not reveal otherwise (synthetic) lethal gene deletions. Such ambiguity affects the quality of predictions related to overall metabolic potential of an organism. In addition, non-standard metabolite names and ids prevent the direct comparisons needed to identify reactions that overlap multiple data sources. This would also lead to fragmented/disconnected datasets that would provide smaller reaction domains for pathway traversal algorithms. The second challenge we identify is in the capacity of various algorithms and computer-aided design (CAD) tools to conceive novel biosynthetic pathway designs while syncretizing various engineering challenges. To systemize the engineering calculations needed for designing biosynthesis of high-value

iv chemicals, existing CAD tools explore the complex biochemical reaction space and enumerate metabolic engineering strategies for the heterologous production of target chemicals from substrates with native or engineered enzymes. Existing CAD tools are however limited and approximate in their design elements i.e. they do not consider all the metabolic engineering paradigms in an integrated fashion5. The design elements such as reaction rules, network size, non-linear pathway topology, mass-conservation, cofactor balance, thermodynamic feasibility, chassis selection, toxicity, yield, and cost have never been unified into a single scheme in current CAD tools, until this work. In the first chapter, we present the novel reaction rule based pathway design (CAD) tool and demonstrate with results i.e. biosynthetic designs to three pharmaceuticals namely , epinine and naproxen. The second chapter presents a novel atom mapping algorithm, which heavily uses the concept of prime factorization. In the third chapter, we demonstrate the algorithms we developed for the purposes of curating biochemical data i.e. development of MetRxn. Finally, in chapter 4) we present an example of the MetRxn data being leveraged within a metabolic model.

v

Table of Contents

List of Figures ...... viii List of Tables ...... xi Acknowledgements ...... xii

Chapter 1 Pathway synthesis using de novo steps through uncharted biochemical spaces ...... 1

1. Introduction ...... 2 2. Methods...... 6 Description of the data and parameters required by rePrime and novoStoic: ...... 6 Developing a database of reaction rules using rePrime: ...... 7 novoStoic ...... 15 3. Results ...... 24 Phenylephrine synthesis: ...... 24 Naproxen synthesis: ...... 28 Epinine synthesis: ...... 32 Oxidative degradation of Benzo[a]pyrene to catechol: ...... 36 Discussion ...... 41 Acknowledgment ...... 42 Figure and Tables ...... 43

Chapter 2 Maximum common molecular substructure queries within the MetRxn database ...... 61

1. Introduction ...... 62 2. Methods...... 66 Reduction of a* search space...... 72 Reaction atom mapping...... 77 Common subgraphs between two reactions...... 82 Common subgraphs between two metabolic pathways...... 83 3. Results and discussion ...... 84 Application to e. Coli iaf1260 metabolic model...... 85 Alternate solutions in e. Coli iaf1260 metabolic model due to equivalent groups...... 86 Comparison with existing efforts: ...... 87 4. Summary ...... 92 Acknowledgments ...... 93 Abbreviations ...... 93 Figures and tables...... 94

vi

Chapter 3 MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases ...... 115

Background ...... 116 Construction and Content ...... 119 MetRxn construction ...... 119 Step 1 Source data acquisition ...... 119 Step 2 Source data parsing ...... 120 Step 3 Metabolite charge and structural analysis ...... 120 Step 4 Metabolite synonyms and initial reaction reconciliation ...... 121 Step 5 Reaction charge and elemental balancing ...... 122 Step 6 Iterative reaction reconciliation ...... 123 Data export and display ...... 124 Source comparisons and visualization ...... 124 MetRxn Scope ...... 125 Utility and Discussion ...... 127 1.Charge and elementally balanced metabolic models ...... 127 2. Contrasting existing metabolic models...... 128 3. Using MetRxn to Bio-Prospect for Novel Production Routes ...... 130 Conclusions ...... 132 Availability and requirements ...... 133 Acknowledgements and Funding ...... 134 Authors’ contributions ...... 134 Figures and tables ...... 135

Chapter 4 Assessing the Metabolic Impact of Nitrogen Availability Using a Compartmentalized Maize Leaf Genome-Scale Model...... 141

Results and discussion ...... 149 Effect of N Conditions on Biomass Components ...... 149 Development of the Second-Generation Maize Leaf Model ...... 150 Incorporation of Transcriptomic and Proteomic Data in the Model ...... 152 Flux Range Variations among Conditions ...... 155 Comparison of Model Predictions with Metabolomic Data ...... 157 Conclusion ...... 160 Materials and methods ...... 163 Plant Material ...... 163 Yield Components Analysis ...... 164 RNA and DNA Preparation ...... 164 Gene Expression Profiles Using Maize Complementary DNA Microarrays ...... 165 Total Protein Extraction, Solubilization, and Quantification ...... 166 Two-Dimensional Electrophoresis, Gel Staining, and Image Analysis ...... 167 Protein Identification by Liquid Chromatography-Tandem Mass Spectrometry ...... 167 Metabolite Extraction and Analyses ...... 168

vii

Metabolome Analysis ...... 171 Model Development and Curation ...... 171 Incorporation of Transcriptomic, Proteomic, and Metabolomic Data ...... 178 Number of gene transcripts, proteins, and metabolites that vary significantly ...... 179 Acknowledgments ...... 180 Figures and tables ...... 181

Bibliography ...... 187

viii

List of Figures

Figure 1-1 Reaction molecular graphs of 2-hydroxyisopthalate decarboxylase (Figure 1A)

and salicylate decarboxylase (Figure 1B): ...... 43

Figure 1-2: Calculation of 푷흀 and 풁흀 (흀 ∈ ퟏ, ퟐ, ퟑ) for the metabolite 2-hydroxyisopthalate.

...... 44

Figure 1-3 Reaction between moieties: ...... 49

Figure 1-4 Synthesis of phenol: ...... 50

Figure 1-5 phenylephrine synthesis...... 51

Figure 1-6 synthesis of naproxen from the precursor guaiacol...... 53

Figure 1-7 synthesis of naproxen from the precursor methyl o-toluate ...... 54

Figure 1-8 synthesis of epinine from ...... 56

Figure 1-9 epinine synthesis from phenylalnine ...... 57

Figure 1-10 N-methyl-l-aspartate synthesis ...... 58

Figure 1-11 benzo[a]pyrene oxidative degradation...... 58

Figure 2-1 CLCA workflow...... 94

Figure 2-2 Canonical labeling with stereodescriptors...... 95

Figure 2-3 Addition of artificial vertices...... 96

Figure 2-4 Addition of directional edges...... 97

Figure 2-5 CLCA using the auxiliary graph datastructure...... 97

Figure 2-6 Alternate solutions...... 99

Figure 2-7 Reaction atom mapping...... 100

Figure 2-8 Reaction similarity...... 101

ix

Figure 2-9 Comparison of the two branched chain amino acid degradation pathways 102

Figure 2-10 Bond changes per reaction statistics for the E. coli iAF1260 metabolic model.

...... 103

Figure 2-11 Example of possibly incorrect mapping from iAF1260...... 105

Figure 2-12 Alternate solutions due to equivalent or symmetric carbon groups ...... 106

Figure 2-13 CLCA incomplete mapping...... 107

Figure 2-14 Comparison with MetaCyc MWED...... 108

Figure 2-15 ReactionMap solution for acetyl-CoA acyltransferase...... 109

Figure 2-16 CLCA solution for acetyl-CoA acyltransferase...... 110

Figure 2-17 Incorrect CLCA solution for a CLASSIFY dataset reaction...... 111

Figure 2-18 Correct CLCA solution for a CLASSIFY dataset reaction...... 112

Figure 3-1 Typical incompatibilities and inconsistencies in genome-scale models and

databases...... 135

Figure 3-2 Flowchart outlining the construction of MetRxn...... 136

Figure 3-3 Various levels of structural information was available for models (main) and

databases (inset)...... 137

Figure 3-4 Comparison of metabolite and reaction overlaps for C. acetobutylicum and C.

thermocellum , and B. subtilis...... 139

Figure 3-5 Pathways from pyruvate to 1-butanol...... 140

Figure 4-1 Weight percentage of biomass components...... 181

Figure 4-2 Number of metabolic and transport reactions distributed between

compartments in the bundle sheath and mesophyll cell types...... 182

x

Figure 4-3 Number of metabolites in each condition that statistically varied from the N+

WT condition at the vegetative stage...... 184

Figure 4-4 Effect of omics-based regulation on the flux-sum prediction compared with

the experimental trend in metabolite concentration...... 185

Figure 4-5 Model development and curation schematic...... 186

xi

List of Tables

Table 1-1: 푷, 풂풏풅 풁 matrices populated by rePrime 46

Table 1-2 The 푪 molecular signature matrix 47

Table 1-3 The T, reaction rules matrix 48

Table 1-4 The reaction-rule template table 52

Table 1-5 The reaction templates naproxen synthesis. 55

Table 1-6: The reaction-rule templates for PAHs degradation 60

Table 2-1 Characterization and prime number assignment for figure 4. 113

Table 3-1 Representation of glucose-6-phosphate dehydrogenase in selected metabolic models 138

Table 4-1 Experimental content of classes of metabolites in different conditions 181

Table 4-2 Number of reactions after each model creation and curation step 182

Table 4-3 Summary of reactions that affect biomass synthesis 183

Table 4-4 Number of gene transcripts, proteins, and metabolites that vary significantly 186

xii

Acknowledgements

I would like to thank my advisor Dr. Costas Maranas for all his guidance and wisdom. The idea for the current project, the perspective of the research and the content of this dissertation have, in large part, been possible thanks to him. His track record of producing innovative work, his broad and deep scientific knowledge combined with his scholarly way of student advising has been exceptionally inspiring, which is why I am, indeed, indebted to him for teaching me how to perform high-quality research. In addition, incessant emphasis on improving presentation and communication skills as well as technical writing has played a major role towards achieving my academic and professional career goals. I would also like to thank Dr. Ross Hardison, Dr. Reka Albert, Dr. Andrew D. Patterson, and Dr. Kyle Bishop for agreeing to participate on my doctoral committee. Also, I would like to extend special thanks to past and present members of Costas Maranas’ group especially Dr. Rajib Saha for his invaluable guidance as well as Dr. Ali Khodayari, Chiam Yu Ng, Lin Wang, Ratul Chowdhury, Dr. Anthony Burgard, Maggie Simons, Satyakam Dash, Anupam Chowdhury, and Sarat Ram for insightful discussions. I would like to convey my sincere gratitude to my wife Yamini whose love, encouragement and persistent confidence in me helped me to stick to my goal. Last but not least, I thank my parents Indira and Kumar for their sacrifice, love and continuous support throughout my life and in all my pursuits.

1

Pathway synthesis using de novo steps through uncharted biochemical spaces

Akhil Kumar1, Costas D. Maranas2

1The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802,

United States

2Department of Chemical Engineering, Pennsylvania State University, University Park, Pennsylvania 16802,

United States

Computational retrosynthesis tools provide a systematic way to traverse production routes to high-value chemicals. Existing tools generally target linear paths from a source to a sink metabolite using known enzymatic functions or sometimes supplemented with de novo steps. Generally, important considerations such as reaction rules, network size, the complexity of pathway topology, mass-conservation, cofactor balance, thermodynamic feasibility, microbial chassis selection, yield, and cost have not been placed within the same decision framework and are largely dealt with in a posteriori fashion. The computational procedure we present here designs bioconversion routes while simultaneously considering all aforementioned design criteria. First, we track and codify as rules, all reaction centers using a novel prime factorization based encoding technique (rePrime). Reaction rules are then acted upon the pathway designing algorithm (novoStoic) that traces mass balanced bio-conversion strategies. rePrime is a recursive algorithmic procedure for cataloging reaction rules whereas novoStoic is a search algorithm posed as a mixed-integer linear (MILP) optimization formulation. We demonstrate the use of novoStoic in pathway

2 elucidation towards predicting intermediates of ill-defined pathways and by designing novel synthetic routes from aromatic precursors for non-natural molecules such as phenylephrine, naproxen, epinine, and N-methyl-aspartate.

1. Introduction

Advances in genetic engineering capabilities have expanded the range of chemicals synthesized by microbial platforms to non-natural synthetic molecules such as drugs and various pharmaceutical ingredients. Synthetic biology tools such as CRISPR-cas6 for genome editing and MAGE7 for site-directed mutagenesis, modular assembly of genetic components as circuits (promoters, regulators, etc.)5 enable the comprehensive host metabolic network rewiring for optimal production of high-value chemicals. Combining natural enzymes with heterologous genes using recombinant DNA technologies8 allows for the de novo pathway reconstruction. Recent successes in the de novo enzyme design, for example, the design of a non-natural formolase9 enzyme, expands the scope of enzymatic functions that can be called upon in the construction of synthetic de novo metabolic pathways. Computational retrosynthetic tools can be used to guide the recruitment of both native and de novo enzymatic function to assemble pathways towards targeted chemicals. This is an area of research with significant prior work.

Network based path finding methods such as PathComp10, PathWay Hunter11,

MetaRoute12, DESHARKY13, FMM14, Rahnuma15, MRSD16, Metabolitinker17, and

RouteSearch18 identify linear pathways from a single source to one target molecule. These

3 methods rely on heuristics based on substrate-product similarity, atom transitions or substrate-product reaction co-occurrence frequency to reduce carbon loss while designing the path from source to target. In contrast to linear pathfinding methods, network optimization based approaches such as CFP19, k-shortest EFM20, and minRxn21 can incorporate non-injective (i.e., not necessarily a one-to-one mapping between reactants to products) stoichiometry and directly model carbon flow as well as cofactor balancing. All network based path finding methods require literature extracted information available in biochemical databases such as MetRxn22, KEGG2, BRENDA1, MetaCyc3, etc. Prediction methods, on the other hand, expand upon uncovered knowledge space by suggesting putative reactant combinations plausible under the tenets of organic chemistry. Specific to principles observed in biochemistry, atom connectivity changes or molecule fingerprint changes between substrate and product have been encoded as reaction rule operators in formats such as BEM23, RDM24, SMIRKS25. Pathway prediction techniques such as

BNICE26, XTMS27, UM-PPS28, PathPred29, Route Designer30 and gem-Path31 employ reaction rule operators for a single molecular target iteratively in a retrosynthetic fashion so as to identify a bioconversion from a single source. This traversal strategy invoked by such retrosynthesis algorithms prunes the vast combinatorial space of putative transformations by evaluating free-energy change and substrate similarity metrics at each iteration. However, after each step, the trade-off between carbon yield and energy (ATP and NADH requirements), and thermodynamic feasibility of the main carbon conversion path remain unexplored. Instead, tools such as tFBA32 and EFM33 are used in a posteriori steps to assess energy and carbon efficiency. Computational tools such as optStoic/minRxn/minFlux21 can be used to design mass and energy balanced pathways.

4

However, they are limited to using reactions present in existing biochemical databases and metabolic models. Therefore, invoking novel molecules as intermediates using hypothetical reactions while maintaining a mass and energy balanced pathway remains elusive.

Protein engineering techniques have already demonstrated the feasibility of expanding the substrate range of existing enzymes. For example, the Lactobacillus kefir ketoreductase enzyme (KRED) has been tuned towards different non-native molecules34–

36. Note that engineered KREDs are used in the production of intermediates for pharmaceuticals atorvastatin, montelukast, duloxetine, phenylephrine, ezetimibe, and crizotinib35. Examples of other successful protein engineering efforts involving de novo enzymes such as the KEMP eliminase (KE59)37 and Formolase (FLS)9 allude to the increasingly important role of novel biocatalysts in traditional chemical manufacturing.

Directed evolution or sophisticated in silico protein modeling tools such as IPRO38 and

Rosetta39 complimented with modern reaction rule-based design procedures can be used to change enzyme substrate specificity systematically.

Existing rule-based techniques can only fortuitously identify biotransformations towards the target molecule that meet a number of performance criteria. novoStoic identifies “by design” mass-balanced, free-energy feasible, high yield, and economically viable biotransformations from the substrate(s) to natural and synthetic product(s). In addition to designing biosynthetic routes, novoStoic can be used to elucidate the strategies to biodegrade xenobiotics. By invoking a broad activity towards a topologically diverse set of substrates, biodegradation possibilities for xenobiotics can be systematically assessed using novoStoic. In particular, we demonstrate the use of novoStoic in designing

5 novel synthetic bioproduction routes for valuable pharmaceuticals from cost effective aromatic precursors.

First, we expanded the MetRxn (www.maranasgroup.com/metrxn) repository40 with a new dataset of elementally balanced reaction operators using the automated CLCA based reaction rule extraction procedure termed rePrime. For each reaction, the rePrime procedure identifies and captures as reaction rules the topological changes underpinning the substrate graph to product graph conversion. The reaction rule is a vector of molecular signatures encoded as prime numbers that capture the location of active reaction centers affected by the conversion of substrate to product. The reaction rules are operated upon an MILP procedure, novoStoic that identifies a mass balanced biochemical network that converts 푠표푢푟푐푒 → 푡푎푟푔푒푡. The novoStoic MILP formulation combines a number of constraints related to network size (i.e. the number of reactions and reaction-rules), free energy change, reaction categorization and chassis selection, for optimal network design.

We address a number of conversions from aromatic precursors to drug molecules and also present biodegradation strategies for benzo[a]pyrene. In the synthesis studies, the selected objective function maximizes the difference between source and target molecule cost while simultaneously imposing constraints related to network size, number of heterologous reactions, free energy change and reaction categorization. Both precursors and possibly co-substrates and co-products required to satisfy the mass balance constraints can serve as optimization variables. In biodegradation studies, novoStoic identifies the minimal network that biodegrades source(s) to target(s) metabolites with additional restriction on microbial system and reaction categories.

6

2. Methods

Description of the data and parameters required by rePrime and novoStoic:

Reactions and metabolites from KEGG, BRENDA, MetaCyc, Rhea, HMDB, ECMDB,

Chebi and Chembl and over 112 metabolic models were aggregated and standardized using the MetRxn curation workflow. The standardized dataset contains over 44,784 unique elementally balanced reactions, and over a million unique metabolites are encoded within sets J and I, respectively. The group contribution based formation energies of metabolites was calculated using eQuilibrator41 (standard cellular conditions, pH 7.0 and

′° ionic strength 0.1M) and stored as Δf퐺푖 . Each metabolite was converted into its corresponding molecular graph, wherein each atom is represented by a node and is indexed uniquely in the set 푁푖. Adjacent nodes (i.e. nodes that correspond to bonded

42 atoms) are indexed in 퐴푛푖. Using the ChemAxon java API, each node (푛 ∈ 푁푖) in each

(∀푖 ∈ 퐼) metabolic graph is annotated with the “atom feature string” and stored as 퐾푛푖. The

“atom feature string” is a numeric fixed length string, constructed by concatenating the atom corresponding information on the number of non-hydrogen connections, number of non-hydrogen bonds, atomic number and the number of hydrogen bonds (see figure 1a)43.

In addition, pathway/subsystem ( set 푃) ↔ reaction and organism/genus/taxa (set 퐵) ↔

reaction annotations were downloaded from KEGG and stored as sets 퐽푃 and 퐽퐵.

7

Developing a database of reaction rules using rePrime:

Reactions and Compounds are often represented as molecular graphs in various chemical information systems. Molecular graphs, wherein atoms are represented as vertices and bonds are represented as edges, provide a convenient data structure for many elementary graph operations. Graph edit operations i.e. addition and deletion of edges to transform the molecular graphs of reactants into product molecular graphs simulate the bond changes between reactants and products. Therefore, by relating bond changes to the reactant-product topology in cataloged reactions, we identify generalized reaction mechanism primitives. After analyzing ~44,784 (non-transport) MetRxn reactions using rePrime, reaction mechanism heuristics or reaction rules are extracted and codified as molecular graph edit operations. The database for molecular graph edit operations, alternatively known as reaction rule operators are then used within the novoStoic procedure to predict the product molecules from reactant molecules.

We developed the rePrime scheme to be consistent with the algorithmic necessities

(i.e. elementally balanced reaction rules) of the novoStoic pathway prediction procedure.

Principles from existing reaction rule operators BEM, RDM, and SMIRKS schemes have been incorporated into rePrime. BEM tracks bond changes as a summation operator. RDM codifies the topological changes in the neighborhood of the reaction-center, as changes between chemical fingerprints. The SMIRKS protocol leverages prime factorization to codify reaction transformations as canonical strings.

In the next section, we introduce the novoStoic and rePrime procedures using a toy example that involves only two decarboxylase reactions (shown in Figure 1),

2hydroxyisophthalate decarboxylase (2HIPD) and salicylate decarboxylase (SLD). First,

8 we apply rePrime and generate the parameters for novoStoic, the molecular signatures 퐶 for metabolites 2-hydroxyispthalate (2hipa), salicylate (sal), carbon dioxide (co2) and phenol (phnl), and its reaction signatures 푇 for reactions 2HIPD and SLD. The parameters

퐶, 푇, and the 푆 stoichiometric matrix are then used by novoStoic to design a mass-balanced biosynthesis pathway for phenol (phnl).

9

Sets and indices

휆 indexes the circular-fingerprint/molecular-signature radius

푖 ∈ 퐼, 퐼 is the set of metabolites in the database

푗 ∈ 퐽, 퐽 is the set of reactions

푛 ∈ 푁푖 푁푖 is the set of nodes in the molecular graph of metabolite 푖

∗ 푛 ∈ 퐴푛푖 퐴푛푖 is the set of adjacent nodes of node 푛 for metabolite 푖, where 퐴푛푖 ⊂ 푁푖

ℙ, ℕ ℙ set of prime numbers, and ℕ is the set of natural numbers.

Parameters and data

퐾푛푖 퐾푛푖 stores for each metabolite 푖, the atom features for each node 푛 ∈ 푁푖

휆 휆 푍푛푖 푍푛푖 stores the product of primes at radius 휆 for node 푛 in metabolite 푖

휆 휆 푃푛푖 ∈ ℙ 푃푛푖 stores the prime number assigned to each node 푛 in metabolite 푖 at

radius 휆

휆 퐶푚푖 : Cardinality of moieties 푚 in each metabolite 푖 ∈ 퐼

푘 푆푖푗: Stoichiometric matrix that describes the coefficient of metabolite 푖 ∈ 퐼 in

reaction 푗

휆 푇푚푗: Matrix that captures the change in number of moiety 푚 in reaction 푗

between substrates and products as reaction rules (tenets)

Scalars

Λ The maximum circular-fingerprint/molecular-signature radius

10

Functions

풉: 푲 ↦ ℙ 풉 is an injective function that maps a unique element in 퐾 value to

a unique prime.

품: ℕ ↦ ℙ 품 is an injective function that maps a unique integer to a unique

prime

′ ℕ푚×푛 ← 풇: ℕ푚×푛 풇 extracts the non-redundant columns into a 푚 × 푛′ matrix

rePrime

휆 = 1 . . (1.1)

휆 푃푛푖 ← 풉: 퐾푛푖 ∀푛 ∈ 푁푖, ∀푖 ∈ 퐼 . . (1.2)

푤ℎ푖푙푒 (휆 ≤ Λ) . . (1.3) {

휆+1 휆 2 휆 휆 2 푍푛푖 ← 푎푟푔푚푎푥 ((푃푛푖) ∏ 푃푛∗푖 , (푃푛푖) ) ∀푛 ∈ 푁푖, ∀푖 ∈ 퐼 . . (1.4) ∗ 푛 ∈퐴푛푖

휆 = 휆 + 1 . . (1.5)

휆 휆 푃푛푖 ← 품: 푍푛푖 ∀푛 ∈ 푁푖, ∀푖 ∈ 퐼 . . (1.6)

}

퐶휆 ← ∑ 푚 = 푃휆 푚푖 [ 푛푖] ∀푖 ∈ 퐼, ∀푚 ∈ 푀휆, ∀휆 ∈ 1. . Λ . . (1.7) 푛∈1..푁푖

휆 휆 푇푚푗 ← ∑ 푆푖푗 퐶푚푖 ∀푗 ∈ 퐽, ∀푚 ∈ 푀휆, ∀휆 ∈ 1. . Λ . . (1.8) 푖∈퐼푘

휆 휆 푇푚푟 ← 풇: 푇푚푗 ∀휆 ∈ 1. . Λ . . (1.9)

11

Using rePrime, we process two decarboxylase reactions (shown in Figure 1),

2hydroxyisophthalate decarboxylase (2HIPD) and salicylate decarboxylase (SLD). Each node 푛 in each molecular graph 푖 is annotated with atom-features in a preprocessing step to generate parameter 퐾푛푖. Shown in Figure 2A, the atom feature “2-3-06-1” for the node with index n = 6 for 2-hydroxyisopthalate (2hipa), encodes the presence of two non- hydrogen connections, three non-hydrogen bonds, the atomic number of carbon and one hydrogen bond. Table 1A shows the atom features for all the nodes of 2hipa, sal, phnl and co2.

The rePrime procedure initiates at 휆 = 1 (eq. 1.1), and assigns the feature corresponding (Figure 2B) prime number to each node (step 1. 2).

휆 푃푛푖 ← ℎ: 퐾푛푖, ∀푛 ∈ 푁푖 푎푛푑 ∀푖 ∈ 퐼 . . (1.2)

Based on the lexical ordering of atom-features in 퐾, the function ℎ assigns the rank corresponding prime number. In the example shown in figure , nodes 3, 4, 8, 9 and 12 with features “3-4-06-0” is assigned the 5th prime number ‘11’, since as per the lexical order in the column atom-features (see Table 1A), the feature “3-4-06-0” is ranked 5th. Also, as

휆 shown in Table 1A, parameter 푃푛푖 corresponding to sal (푛 ∈ 16,17 푎푛푑 22) and phnl (푛 ∈

28) are also assigned prime ‘11’. Next, within the while loop (step. 1. 3), until the condition

휆 ≤ Λ evaluates true, we execute the operations defined in steps (1.4), (1.5) and (1.6). For the toy example, Λ is equal to 3.

휆+1 In step (1.4), 푍푛푖 stores the integer-product for each node 푛 of molecular graph 푖.

Figure 2D depicts the integer-product calculated for each node 푛 ∈ 푁2hipa using the expression below.

12

휆+1 휆 2 휆 휆 2 푍푛푖 ← 푎푟푔푚푎푥 ((푃푛푖) ∏ 푃푛∗푖 , (푃푛푖) ) , ∀푛 ∈ 푁푖 푎푛푑 ∀푖 ∈ 퐼 . . (1.4) ∗ 푛 ∈퐴푛푖

For example, for 휆 = 1 and 푛 = 3 the adjacent nodes 푛∗ ∈ 1,2 and 4 are assigned the

2 prime numbers 3,2 and 11 respectively, 푍3,2ℎ푖푝푎 calculated in step (1.4) equals ‘7986’. The integer-product ‘7986’ implicitly indicates through its unique prime factors, the occurrence of a carboxyl group (-COOH) in the molecules 2-hydroxyispthalate (2hipa) and salicylate (sal). The same integer-product ‘7986’ is calculated for nodes 푛 ∈ 9 푎푛푑 16,

2 as shown in the column 푍푛푖 in Table 1A.

We next update 휆 = 휆 + 1, and assign a unique prime number 푚 ∈ 2. . ℙ to each

휆 휆 푍푛푖 by invoking (step. 1.6) the injective (i.e. one-to-one mapping) function 푔: 푍 .

휆 휆 푃푛푖 ← 품: 푍푛푖, ∀푛 ∈ 푁푖 푎푛푑 ∀푖 ∈ 퐼 . . (1.6)

휆 휆 Similar to step (1.2), the function 품: 푍푛푖 returns a rank corresponding prime to 푃푛푖.

2 2 2 th Therefore 푃3,2ℎ푖푝푎, 푃9,2ℎ푖푝푎 and 푃16,푠푎푙 are assigned the 8 prime number 푚 ∈ 19. For a given 휆, 푚 ∈ 2. . ℙ indexes a unique circular topology around an atom. For the metabolites of reactions 2HIPD and SLD (i.e. ∀푖 ∈ 2-hydroxyispthalate (2hipa), salicylate (sal), carbon

휆 휆 dioxide (co2) and phenol (phnl)), Table (1) shows 푃푖 and 푍푖 , for each 휆 ∈ 1. . 3. All values of 푃휆 are indexed in the sets 푀휆

13

Upon the termination of the while loop (i.e. when 휆 = 3), in step (1.7) we calculate the cardinality of each moiety 푚 in each metabolite 푖.

휆 휆 퐶푚푖 ← ∑ [푚 = 푃푛푖], ∀푖 ∈ 퐼, ∀푚 ∈ 푀휆 푎푛푑 ∀휆 ∈ 1. . Λ . . (1.7) 푛∈1..푁푖

1 푖푓 푄 푖푠 푡푟푢푒 The Iverson bracket [푄] = { , converts the logical proposition 푄 to 1 0 표푡ℎ푒푟푤푖푠푒

(if true) and 0 (if false). Therefore, for the moieties 푚 indexed over the domain of prime

휆 numbers ∀푚 ∈ 2. . ℙ, a value of 1 is returned for each node 푛 ∈ 푁푖 when [푚 = 푃푛푖] is true.

Step (1.7) is a counting operation. Table (1B) shows this distribution/cardinality of moieties 푚 in each metabolite.

In step (1.8), we capture within 푇, the stoichiometric changes in the distribution of moieties 푚, between the reactants and the products (∀푗 ∈ 퐽 and ∀휆 ∈ 1,2,3).

휆 휆 푇푚푗 ← ∑ 푆푖푗 퐶푚푖, ∀푗 ∈ 퐽, ∀푚 ∈ 푀휆 푎푛푑 ∀휆 ∈ 1. . Λ . . (1.8) 푖∈퐼푘

As each 푚 represents a distinct topology (i.e. moieties) around an atom,

휆 푇푚푗 captures the topological changes around the reaction center atoms that occur in the transformation of reaction to product. Table 1C shows the changes between reactant and product moieties for 2HIPD and SLD. At a particular 휆, reaction-center participant moieties do not cancel out, and are therefore indicated by non-zero entries. Notice that all

휆 the entries in each column sum to zero, i.e. moiety balance. 푇푗 captures the topological changes between reactant and product in an elementally balanced fashion, and therefore encodes the reaction templates for each reaction 푗. As shown in Figure 3, the 푇 matrix is analogous to 푆 matrix wherin 푆푖푗 captures reactions between balanced metabolites, while

1 1 푇푚푗 captures the reactions between moieties. At 휆 = 1, 푇2퐻퐼푃퐷 and 푇푆퐿퐷 are identical, and

14

1 1 1 1 1 1 1 1 1 1 (퐶2ℎ푖푝푎 + 푇2퐻퐼푃퐷 = 퐶푠푎푙 + 퐶푐표2), hence (퐶2ℎ푖푝푎 + 푇푆퐿퐷 = 퐶푠푎푙 + 퐶푐표2) or (퐶푠푎푙 + 푇2퐻퐼푃퐷 =

1 1 휆 휆 퐶푝ℎ푛푙 + 퐶푐표2). Therefore, in step (1.9), the function 풇: 푇푚푗 removes the redundancy in 푇푗

휆 1 1 and assigns 푟 ∈ 푟1, 푟2. . 푅 to a unique reaction template (for e.g. 푇2퐻퐼푃퐷 and 푇푆퐿퐷 are now

1 stored as 푇푟1).

휆 휆 푇푚푟 ← 풇: 푇푚푗, ∀휆 ∈ 1. . Λ . . (1.9)

Table 2 below, shows the total number of unique moieties in 푀휆 and reaction rules

휆 휆 휆 in 푅 upon the termination of the rePrime procedure. 퐶푚푖 and 푇푚푟 are parameters required by novoStoic.

휆 |푀휆| |푅휆|

1 50 826

2 298 1929

3 1110 6043

In the next section we setup the novoStoic formulation by defining the sets, parameters, variables and constraints used in the novoStoic MILP (mixed integer linear programming) formulation.

15 novoStoic

Sets

푚 ∈ 푀휆, 푀휆 is the set of moieties (for e.g. at radius (휆) = 3 there are 1110 unique CMS)

푖 ∈ 퐼, 퐼 is the set of all metabolites in the database (~ 1 million)

푗 ∈ 퐽, 퐽 is the set of reactions

푟 ∈ 푅휆, 푅 is the set of reaction rules (for e.g. at radius (휆) = 3 there are 6043 unique

reaction rules)

푏 ∈ 퐵, 퐵 is the set of organisms

푝 ∈ 푃, P is the set of pathway annotation

퐽푝 , 퐽푏 ⊂ 퐽 퐽푝 is the set of reactions in pathway 푝. 퐽푏 is the set of reactions in

organism 푏

Parameters

푘 푆푖푗: Stoichiometric matrix that describes the coefficient of metabolite 푖 ∈ 퐼 in

reaction 푗

휆 퐶푚푖: Cardinality/Number of moieties 푚 in each metabolite 푖 ∈ 퐼

휆 휆 푇푚푟: Matrix that describes the change in moieties 푚 in rule 푟 ∈ 푅

푃푟푖푐푒푖 ∶ bulk price of metabolite 푖

16

′° Δf퐺푖 : Formation energy of 푖 at cellular conditions (pH = 7.0 and ionic strength of

0.1M), identified using eQuilibrator41

Scalars

푊 푊ℎ푡푔: Cutoffs’ related to the number of heterologous,

푊ℎ푦푝: hypothetical reactions,

푊푟푥푛: Maximum number of known reactions the network can have,

푊푝푎푡ℎ: the maximum number of non-member reactions 푗 or rules 푟 that can

be active for a single pathway 푝

°푚푖푛 Δ퐺푓 : Threshold for free energy change under standard cellular conditions for

the network (adjusted for pH = 7.0 and ionic strength of 0.1M)

필 big M

Variables

푣푗: fluxes for each reaction 푗

푢푟: coefficient for each rule 푟, a non-zero indicates a novel substrate interaction

and a transformation guided by the reaction rule r

푥푖: non-zero values indicate the substrate 푖 ∈ 퐼 to participate in both

hypothetical and cataloged reactions

퐸푋 푥푖 : network exchange indicators, non-zero values indicates active uptake or

export for each metabolite 푖 휖 퐼

17

Binary variables

푟푥푛 푦푗 : binary variables to enable/disable flux of each reaction 푗

푟푢푙푒 푦푟 : binary variables to enable/disable flux for each rule 푟

표푟푔 푦푏 : 1 indicates if most of the reactions 푗 are members to the organism 푏

푝푎푡ℎ 푦푝 : 1 indicates if most of the reactions 푗 are members to pathway 푝

18 novoStoic (MILP formulation for synthesis/degradation network design)

풎풂풙풊풎풊풛풆 흓

퐸푋 푥푡푎푟푔푒푡 ≥ 1 (1.1)

퐸푋 − 푥푠표푢푟푐푒 ≥ 1 (1.2)

∑ 푆푖푗푣푗 − 푥푖 = 0 , ∀ 푖 휖 퐼 (2) 푗 휖 퐽

퐸푋 ∑ 푇푚푟푢푟 + ∑ 퐶푚푖 푥푖 − ∑ 퐶푚푖푥푖 = 0 , ∀ 푚 휖 푀 (3) 푟 휖 푅 푖 휖 퐼 푖∈퐼

푟푥푛 푟푥푛 푦푗 퐿퐵푗 ≤ 푣푗 ≤ 푦푗 푈퐵푗 , ∀ 푗 휖 퐽 (4)

푟푥푛 푟푥푛 ∑ 푦푗 ≤ 푊 , (5) 푗∈퐽

푟푢푙푒 푟푢푙푒 푦푟 퐿퐵푟 ≤ 푢푟 ≤ 푦푟 푈퐵푟 , ∀ 푟 휖 푅 (6)

푟푢푙푒 ℎ푦푝 ∑ 푦푟 ≤ 푊 , (7) 푟∈푅

푟푥푛 표푟푔 ℎ푡푔 표푟푔 ∑ 푦푗 ≤ 푦푏 푊 + (1 − 푦푏 ) 필 , ∀ 푏 ∈ 퐵 (8) 푗 ∉ 퐽푏

표푟푔 ∑ 푦푏 = 1 , (9) 푏∈퐵

푟푥푛 푝푎푡ℎ 푝푎푡ℎ 푝푎푡ℎ ∑ 푦푗 ≤ 푦푝 푊 + (1 − 푦푝 ) 필 ∀ 푝 ∈ 푃 (10) 푗 ∉ 퐽푝

푝푎푡ℎ ∑ 푦푝 = 1 , (11) 푝∈푃

19

′° 퐸푋 푚푖푛 ∑ Δf퐺푖 푥푖 ≤ Δ퐺푓 , (12) 푖 휖 퐼

20

To describe a 푡푎푟푔푒푡 molecule, we use constraint (1.1) and we set the objective as

푬푿 흓 = ∑ 풑풓풊풄풆풊 × 풙풊 풊

The objective function involves the maximization of the difference of cost between the precursor and target.

The information on economic values of each pharmaceutical and precursors were identified from various chemical suppliers and stored as the parameter 푃푟푖푐푒푖. Based on the per mole economic cost of the target, novoStoic designs a cost effective network within the constraints of network size, free energy change and reaction categories

(pathways/subsystems), and also simultaneously identifies the suitable chassis organism for engineering. Constraint (1.2) forces novoStoic to design pathways to target from predetermined precurors.

Constraints (2) and (3) are central to the concept of combining both known and hypothetical reaction within a single mass-balanced framework. Constraint (2) enforces the mass balance constraint on all 푣푗 for each metabolite 푖, Constraint (3) enforces the atom/moiety balance on variables 푢푟 for each moiety 푚.

∑ 푆푖푗푣푗 − 푥푖 = 0 , ∀ 푖 휖 퐼 (2) 푗 휖 퐽

푟푢푙푒 푚푒푡 푚푒푡 퐸푋 ∑ 푆푚푟 푢푟 + ∑ 퐷푚푖 푥푖 − ∑ 퐷푚푖 푥푖 = 0 , ∀ 푚 휖 푀 (3) 푟 휖 푅 푖 휖 퐼 푖∈퐼

21

The variables 푥푖 are constrained in both the equations. A non-zero value on the

‘mediator’ variable 푥푖 would indicate the metabolite 푖 to participate in both the known and hypothetical (reaction rules based) sections of the network. Notice the opposite signs on 푥푖 between the two constraints. A +푣푒 value for 푥푖would indicate that the metabolite 푖 is exported by the ‘known’ reaction network. The moiety balance constraints in (3) would

퐸푋 퐸푋 force atleast one of the two variables 푢푟 or 푥푖 to take on a non-zero value. If 푥푖 ≠ 푥푖 , it would indicate some fraction of the metabolite 푖 to be imported by the reaction-rule network. In principle, 푥푖 mediates the cross-talk between the hypothetical network and known reaction network. Constraints (4) and (5) control the number of known reactions allowed into the network design. Constraints (6) and (7) control the number of reaction rules allowed into the network design. In Figure 4, using the rePrime generated parameters for 휆 = 1, we expand the closed form of equations for constraints (1.1), (2) to

(7) and the objective function and show the possible values the various variables would

푟푥푛 ℎ푦푝 take on in our toy example. We assume hypothetical values for 푃푟푖푐푒푖. 푊 = 2, 푊 =

1 , 푈퐵푗, 푈퐵푟 = 1 and 퐿퐵푗, 퐿퐵푟 = −1. Constraints (8) to (12) were not active for the toy example.

22

Constraints (8) to (12) assist in reducing further the solution space. Constraints (8) and (9) forces the solver to identify the organism 푏, that would require the fewest heterologous reactions (controlled by 푊ℎ푡푔) for 푡푎푟푔푒푡 synthesis. Constraints (10) and (11) ensure the reactions and rules belong to common categories. The categories defined by pathway and subsystem annotations in databases such as KEGG, MetaCyc and BRENDA are based manual annotation by experts and builds on the observation that certain sequence of chemical transformations is conserved across various species and taxa.

Genetic loci (gene clusters) and genetic controls related to expression and regulation have also been factored into by the pathway and subsystem annotators44. By forcing such categorization into our pathway design, we could possible avoid the auxiliary genetic engineering steps needed to implement the synthetic network45. In addition, it also forces our design to emulate the reaction permutations observed in biochemistry, for e.g. in order to replace a hydroxyl group with an amino group, an intermediate oxidoreducatse reaction is required to form a keto group. As shown in the naproxen synthesis example, all the reactions from naphthol to naproxen follow the same reaction scheme to convert indole to indole 3-acetate in the tryptophan biosynthesis pathways. Constraint (12) forces the overall conversion to have a favorable free energy change. The parameters needed for

′° 41 this constraint (Δf퐺푖 ) were calculated using eQuilibrator .

To identify the degradation pathway for a molecule, we activate the constraints

(1.1) and (1.2) for the known 푠표푢푟푐푒 and 푡푎푟푔푒푡, and we set the objective as

풓풙풏 풓풖풍풆 흓 = −ퟏ × (∑ 풚풋 + ∑ 풚풓 ) 풋∈푱 풓

23

The objective function involves identifying the minimal set of reactions needed to degrade a source molecule to the known target molecule. All other constraints (2 - 11) were considered in our formulation for the degradation studies. Similar to the synthesis

퐸푋 formulation, non-zero values on 푥푖 indicate active uptake/export of co-substrates and co-products 푖. Non-zero values on 푣푗 and 푢푟 indicate the participation of the reactions 푗

퐸푋 and rules 푟 in our network. Similarly, a positive 푥푖 indicates the co-product to the target.

퐸푋 Negative values of 푥푖 would indicate the precursors 푖.

24

3. Results

Phenylephrine synthesis:

The non-natural molecule Phenylephrine is a member of the phenylethanolamines class. It mimics the action of such as and . The restrictions placed due to the Combat Epidemic Act of 2005 has forced many pharmaceutical manufacturers to replace the standard sympathomimetic

Pseudoephedrine (a substrate for Illicit Methamphetamine manufacture) with

Phenylephrine as the active ingredient in nasal sprays. Traditionally, Phenylephrine is produced in a catalytic process involving the reduction of the aromatic substrate m- hydroxybenzaldehyde. Figure 5 illustrates three pathways each starting with a different aromatic precursor identified by novoStoic. The pathways involve the co-utilization of (i) catechol and methyl salicylate, (ii) phenylalanine and methyl salicylate, and (iii) homovanillate as the only substrate.

Pathway 2a is an example of a thermodynamically infeasible design. It is a solution typically predicted by substrate similarity based retrosynthesis tools (single source → single target). Such algorithms would first choose a likely substrate as the target feedstock and then proceed to successfully prune the retrosynthesis reaction network using similarity based filters on each backward reaction until the shortest (one substrate to one product) linear route is found. The overall profit found for this pathway is 흓 =

풎풐풍 ퟏퟒ. ퟏퟐ × 푷풓풊풄풆풑풉풆풑 However, the free energy change in the direction of bioconversion towards phep is positive and thus thermodynamically infeasible. The overall conversion for pathway 2a is:

25

+127 푘푐푎푙 ℎ푚푣 + 푛ℎ3 → 푝ℎ푒푝 + 표2

Pathway 2b involves the conversion of phenylalanine (phal) and methyl salicylate

(msal) to serine, acetate, and phenylephrine (phep). In the first half of the 14 reaction cascade (steps 1-9), the salicylate produced after the demethylation of msal is converted to acetate and pyruvate via the benzoate degradation pathway. Pyruvate is then aminated by serine hydratase to produce L-serine. The degradation of msal to serine and acetaldehyde also produces an S-adenosyl-L-methionine (SAM-e), required as a substrate by the N-methylation reaction in the second half of the network. In addition, the NADH producing oxidation of acetaldehyde to acetate by an oxidoreductase keeps the net co- factor balance to zero. In the second half, the phal to phep conversion starts with the reduction of phenylalanine by an isozyme/homolog of phenylalaninase. Phenylalaninase requires Tetrahydrobiopterin as a co-factor and hydroxylates phal at the 4th phenol position to produce p-. However, to produce m-tyrosine, a hydroxylation at the

3rd phenol position is suggested (step 10). m-tyrosine has been identified in a number of bacterial species, humans, and plants. Studies on mammalian Phenylalaninase indicate high regioselectivity in hydroxylation at the 4th position and attribute the formation of m- tyrosine to non-enzymatic hydroxylation at 3rd position of phal under oxidative stress46–

48. Pacidamycin studies focusing on bacterial phenylalaninase have indicated the presence of the phenylalaninase homolog in many Streptomyces species with regiospecific hydroxylation activity at the 3rd phenol position49. The phal to m-tyrosine reaction suggested by novoStoic is predicted as a secondary activity by the phenylalanine hydroxylase homolog PacX from Streptomyces coeruleorubidus. This reaction (step 10) is not present in any of the metabolic databases or metabolic models and is predicted de novo

26 by novoStoic. Step 11 involves the conversion of m-tyrosine to m- by decarboxylation. The BRENDA entry for reaction “4.1.1.28” suggests the presence of m- tyrosine decarboxylase in mammals. The conversion of m-tyramine to phep (steps 12, 13 and 15, 16) in the remainder of the network is identical to the subnetwork of pathway 2a with the same N-methyltransferase and oxidoreductase reactions. With msal contributing only one carbon as a methyl group towards phep synthesis, we can, therefore, consider only the second half (steps 10-14) of the network for phep if endogenous S-adenosyl-L- methionine (SAM-e) is supplied. The objective solution for this pathway is 흓 = ퟗퟏ. ퟖퟐ ×

풎풐풍 푷풓풊풄풆풑풉풆풑. The overall conversion for pathway 2b is.

−323 푘푐푎푙 푝ℎ푎푙 + 푚푠푎푙 + 2 ℎ2표 + 4 표2 + 푛ℎ3 → 푠푒푟 + 푝ℎ푒푝 + 푎푐푒푡푎푡푒 + 3 푐표2 + ℎ2표2

The pathway 2c shown in green combines a number of fungal reactions from the tyrosine and phenylalanine metabolism pathways for conversion of catechol and methyl salicylate to acetaldehyde and phenylephrine. This is similar to the phytochemical synthesis pathway of , wherein the carboligation product of the benzoate derivate and pyruvate condensation reaction undergoes transamination and n- methylation. Two possible routes with 17 reactions each are suggested. Catechol degrades to pyruvate and acetaldehyde via the benzoate degradation pathways in steps 1-4. 3- hyroxybenzoate, the product of methyl salicylate degradation (steps 5-7), condenses with pyruvate to form chorismate. Conversion of chorismate proceeds (steps 8-12) via the tyrosine metabolism pathways to form 2‐(3‐hydroxyphenyl) acetaldehyde (23hpa). novoStoic predicts a novel reaction to convert 23hpa to m-tyramine as the next step (step

13). The predicted aminating reaction to produce m-tyramine requires a monoamine oxidase50. The , tyramine oxidoreductase, typically (de)aminates p-

27 tyramine, with trace (de)amination activity reported for m-tyramine51,52. The conversion of m-tyramine to phenylephrine, in the last two steps, proceeds further by the action of an

N-methyltransferase and oxidoreductase. The sequence of the final two reaction rule is flexible since the reaction centers are unique (steps 14, 15 and 17, 18). The non-enzymatic alternatives for the reduction of 3‐[2‐(methylamino)ethyl]phenol53 (step 15) and N- methylation of norfenefrine54 (step 18) can provide templates for the development of engineered dopamine hydroxylase (steps 15, 17) and methyltransferase (steps 14, 18).

The co-substrate S-adenosyl-L-methionine (SAM) required by the N-methylation reaction is generated in the O-methylation reaction by the action of salicylate 1-O- methyltransferase (step 5). Therefore, methyl salicylate provides the carbons in the phenol and methylamino moieties and catechol provides the carbons in the ethyl moiety of phenylephrine through its degradation to pyruvate. Steps 1-4 can be by-passed if pyruvate is directly provided. In contrast to the transamination reaction in the biosynthesis of pseudoephedrine, the assimilation of the amino group proceeds via the predicted monoamine oxidase step. novoStoic also suggested a pathway with transaminase, decarboxylase and additional amino acid recycling reactions (see

풎풐풍 supplementary). The objective solution for this pathway is 흓 = ퟏퟏퟔ. ퟎퟕ × 푷풓풊풄풆풑풉풆풑.The overall conversion for pathway 2a is

−60 푘푐푎푙 푐푎푡푒푐ℎ표푙 + 푚푠푎푙 + 2 표2 + ℎ2표 + 푛ℎ3 → 푝ℎ푒푝 + 푎푐푒푡푎푙푑푒ℎ푦푑푒 + 2 푐표2

28

Naproxen synthesis:

Naproxen is a nonsteroidal anti-inflammatory drug (NSAID) with analgesic and antipyretic activities. As many NSAID molecules, naproxen contains a propionate arm attached to an aromatic group that non-selectively inhibits both enzymes (i.e. cyclooxygenase-1 and cyclooxygenase-2) responsible for inflammation and pain.

Manufacturing of naproxen requires 2-naphthanol and its derivatives as precursors.

Harrington and Lodewjik55 review the various synthesis techniques and how the manufacturing has evolved to reduce reagent waste and side products since its first introduction in 1976. Reactions related to regioselective electrophilic aromatic substitutions and stereoselectivity are cited as the primary challenges in naproxen synthesis. They also indicate the potential of asymmetric synthesis (e.g. enzymatic) for extracting targets in high enantiomeric excess as an advancement to state of the art. In this study, novoStoic identified two pathways with 25 and 27 reactions, respectively. Each pathway proposes ten different routes from the precursor's guaiacol and methyl-o-toluate to naproxen. The pathways differ in the overall conversion and meta cleavage degradation steps of the benzoate derivative.

The aromatic oil guaiacol, a precursor used in the production of various flavorants, is biosynthesized/degraded by a number of organisms56,57. Lignin-derived guaiacol and its derivative vanillate have been shown to serve as the primary carbon source for many organisms57,58. Commercial guaiacol is obtained from petrochemical sources benzene and propylene and is comparatively inexpensive to many other methoxy benzoate derivatives59. In Pathway 3a, in the first half of the network (steps 1-7), the first reaction

29 requires the activity of the SAM-e dependent catechol-O-methyltransferase enzyme on a methoxy benzoate derivative. The methyl group supplied by the methoxy derivative, guaiacol is transferred onto the 2-naphthanol derivative by another methyltransferase, to regenerate SAH, as part of the SAM-e cycle. The catechol produced after demethylation

(step 1) is degraded by meta cleavage through the benzoate degradation pathway to produce formaldehyde, acetaldehyde, and pyruvate. Acetaldehyde is suggested as a co- product while pyruvate and formaldehyde are suggested as intermediates. In the second half of the network, the steps involving methyltransferases (R20 and R22) and the monooxygenase (R21) contribute to the pathways plasticity. The ordering of these reactions is flexible as they have independent and unique reaction centers. The ordering of the steps (R18 and R19) is rigid since the oxidoreductase step (R19) can only occur after the amino acid is produced. Ten unique permutations of reaction steps can be traced from

2-napthol to naproxen. The regioselective electrophilic aromatic substitution at C6 and concomitant amination suggested as R18, uses the beta-tyrosinase/tryptophanase reaction rule. Vela et al60 demonstrated the non-enzymatic conversion of 2-napthol to its tyrosine derivative (i.e. step 8). In the subsequent steps, identified by the reaction rule R19, an oxidoreductase step is suggested by novoStoic, for the formation of a 훼-keto acid from an amino acid, using isozymes/homologs of tyrosine/tryptophan oxgen oxidoreducatases. This oxidoreductase step can also be substituted for an aminotransferase step, similar to the bioconversion by the hyperthermostable

Thermococcus profundus aminotransferase enzyme, which catalyzes the conversion of the non-natural naphthylalanine to 3-(2-naphthyl) pyruvate. Next, the pathway can proceed via a methylation step, similar to the reactions catalyzed by indolepyruvate

30 methyltransferase, or loose a carbon dioxide via the oxidation step by a monooxygenase

(e.g. indole-3-pyruvate monoxygenase). The ordering of the two aforementioned reactions is interchangeable. A non-enzymatic conversion of p3 to p7 is mentioned in the work by Kogure et al61. A non-enzymatic reaction to convert 6-mnaa to naproxen, is known to be a part of the Syntex naproxen manufacturing process55. It is importatnt to note that the methylation of the hydroxyl group on 2-naphthol C1 can take place at any point in the transformation route. This reaction is independent of other reactions, since its reaction center is topologically apart from other reaction centers. The objective solution

풎풐풍 for this pathway is 흓 = ퟏퟖ. ퟓퟕ × 푷풓풊풄풆풏풂풑풓풐풙풆풏The overall conversion for pathway 2a is

−0.132 푘푐푎푙 푔푢푎푖푎푐표푙 + 2 표2 + 2 ℎ2표 + 2′푛푎푝ℎ푡ℎ표푙 → 푛푎푝푟표푥푒푛 + 푎푐푒푡푎푡푒 + 푐표2 + ℎ2표2

Pathway 2b, is similar in conversion to pathway 3a, however, with a more favorable free energy change. Analogous to the step 1 in pathway 2a, the benzoate derivative methyl o-toluate is demethylated to o-toluate by a SAM-dependent O- methyltransferase (e.g. salicylate carboxymethyl transferase). Methyl o-toluate can also be dealkylated to o-toluate using chloroaluminate ionic liquids as the medium and catalyst.

A number of terrestrial organisms (e.g. Pseudomonas cepacia MB2) utilize o-toluate as the sole carbon source, and degrade o-toluate → catechol → pyruvate via the benzoate degradation pathways, with 3-methyl catechol as a predicted intermediate62. Unlike the production of only one SAM-e molecule in pathway 3a, pathway 3b proposes the generation of two SAM-e molecules, while degrading the benzoate derivative to pyruvate.

The two SAM-e molecules are required by the SAM-dependent O-methyltransferase and pyruvate methyltransferase reactions to transfer two methyl groups onto the naphthol

31

풎풐풍 derivative. The objective solution for this pathway is 흓 = ퟐ. ퟕ × 푷풓풊풄풆풏풂풑풓풐풙풆풏 .The network in pathway 3b has a more favorable free energy change with the following overall conversion,

−10.2 푘푐푎푙 표′푡표푙푚푒 + 4 표2 + ℎ2표 + 2′푛푎푝ℎ푡ℎ표푙 → 푛푎푝푟표푥푒푛 + 푎푐푒푡푎푙푑푒ℎ푦푑푒 + 3 푐표2 + ℎ2표2

32

Epinine synthesis:

Like many , epinine is a and sympathomimetic. As the common names N-methyldopamine and suggest, epinine structurally differs from dopamine by the presence of an amide bond and differs from epinephrine by the absence of a 훽-hydroxyl group. Due to structural and pharmacological resemblance to epinephrine, synthetic epinine hydrochloride salts were marketed as antihypotensive preparations. Currently, the pharmaceutical epinine available for in the treatment of congestive , is formulated as the prodrug ibopmaine63, and marketed under the trade names Trazyl and Scandine. is the N-substituted 3,4-diisobutyryl ester of dopamine, and undergoes hydrolysis on dosage to form epinine64. The chemical stability of ibopamine is considered pharmaceutically superior to the epinine salt and its various analogues65. The production route of ibopamine involves the regioslective protection, and the consequent methylation and acetylation reactions of the catechol derived precursor 3,4-dimethoxyphenethylamine (dmpa)65,66. Due to the preferential acetaylation of the nitrogen over phenolic groups, the amino group is protected by substitution with a benzyl group67 prior to o-acylation by isobutyryl chloride68. The challenges in this process involve the subsequent removal of the by hydrogenation67.

Using novoStoic, we propose two epinine synthesis pathways with 18 and 45 routes each (phal→ epinine). Each pathway differs by its overall conversion. The first pathway, shown in figure 7a is similar to the synthesis of phenylephrine and involves a flexible collaboration of enzymes that catalyze decarboxylation, N-substitution, and monooxygenation reactions. The 18 routes proposed in this pathway can further by

33 categorized by the initial conversions that phenylalanine undergoes. For e.g. the route converting phenylalanine to L-dopa, dopamine, and epinine, initiates via the sequential hydroxylation at para and meta positions on the aromatic side chain, with subsequent decarboxylation and N-methylation respectively. In the route converting phenylalanine to phenylethylamine, methylphenethylamine, methyltyramine, and epinine, the phenolic groups form after the decarboxylation and N-methylation steps. It is important to note, the number of known reactions in the routes with the decarboxylation step preceding the

N-substitution step is higher than the routes where the N-substitution step precedes the decarboxylation step. This might also reflect the fact that the presence of the carboxyl group slows down the (de)methylase reaction69. A number of studies on biogenic synthesis in both eukaryotes and prokaryotes observe the decarboxylase activity to precede N-methylation. The N-methylation can either proceed via a direct methylation by formaldehyde in an oxidase reaction70 or via the SAM-e mediated N-methyltransferase reaction. Sarcosine degradation provides the methyl-group donors formaldehyde and

SAM-e. The sarcosine degradation pathway serves as an energy source in a number of glyphosates saturated soil dwelling oraganism71–74. Glyphosate, upon the cleavage of the carbon-phosphate, converts to sarcosine, which further degrades to formaldehyde/ via the intermediate glycine, concomitantly generating the reducing agents NADH/NADPH. The degradation pathway of the co-substrate sarcosine supplies the additional energy molecules needed by phal degradation.

The majority of reactions are present in a number of Pseudomonas species71,75,76. The overall conversion for this pathway is

−0.25푘푐푎푙 푝ℎ푎푙 + 푠푎푟푐표푠푖푛푒 + 3 표2 → 푒푝푖푛푖푛푒 + 3 푐표2 + 푛ℎ3 + ℎ2표2

34

The second pathway combines 52 reactions to provide 45 unique routes from phal→ epinine. The pathway similar to the overall conversion shown in figure 7a also requires the activity of decarboxylases and monooxygenases. However, each route additionally requires the action of four oxidoreductase enzymes (i.e. one oxidation reaction followed by three reduction reactions). In the oxidation step, the primary amine is converted to keto-carboxylic acids (for e.g. tyrosine (tyr) → 4-hydroxyphenylpyruvate

(4hpyr)). The keto acid is then directly converted into a α-hydroxy acid by decarboxylation and site-directed oxygenation by a synthase reaction. For example, the

Streptomyces coelicolor 4-hydroxymandelate (4hmdl) synthase (HmaS) synthase enzyme catalyzes the formation of 4hmdl by decarboxylation and regiospecific hydroxylation of

4hpyr. The similar conversion of phenylpyruvate (phlpyr) to the α-hydroxy aromatic acid mandelate (mdl), is proposed as a hypothetical reaction by novoStoic. This conversion is identified as a HmaS activity in the study by Giuro et al77. The α-hydroxy acids are then reduced to aldehydes by a nad/nadh dependent oxidoreductase. Next, the formation of secondary (for e.g. (synpr), tyramine (tyr), etc.) by the incorporation of an amide bond can proceed in two ways. The aromatic aldehyde is reduced by ammonia and then N-methylated, or it is directly reduced by methylamine. For example, in step 20, 4-hydroxymandelaldehyde (4hmal) can be reduced by ammonia to form (octpm) or reduced by methylamine to form synpr. Both the aminating reaction steps require the activity of a monoamine oxidases. octpr is then methylated to form synpr. The reduction of 4hmal → octpr suggested as hypothetical steps by novoStoic, has been identified in a number of phenylethylamines degrading bacteria78,79. The

35 hypothetical reaction 4hmal → synpr is also identified as an activity by a methylamine dependent monoamine oxidase in the acetobacter Nocardia sp dm180. The third reduction step requiring the activity of an ascorbate oxidoreductases is a regioselective removal of the α-hydroxyl group from the α-hydroxy secondary amines (i.e., synephrine (synpr), epinephrine (eprh) and halostachine (hlstc)). Finally, subsequent oxidations to incorporate the m- and p- phenolic groups might be required for the formation of epinine.

Similar to the previous design (figure 7a), the sarcosine degradation pathway provides the methyl groups and reducing agents. The overall conversion for pathway 3b is

푝ℎ푎푙 + 푠푎푟푐표푠푖푛푒 + 2표2 + ℎ2표2 → 푒푝푖푛푖푛푒 + 3 푐표2 + 2 ℎ2표

Many intermediates (for e.g. dopamine, adrenaline, etc.) in both 3a and 3b are attributed to metabolism in mammals and plants. These biogenic amines are reported as intermediates in the synthesis of Anthramycin and Calcium-dependent antibiotics in various Streptomyces77,81,82. Increasingly, a number of studies have also identified the presence of such biogenic amines in other prokaryotic species such as Pseudomonas, Vibrio tyrosinaticus, Rhodococcus and Neurospora 75,83–85.

36

Oxidative degradation of Benzo[a]pyrene to catechol:

Polycyclic aromatic hydrocarbons (PAHs), with mutagenic and carcinogenic properties, are unfortunately ubiquitous in the environment and have both natural and anthropogenic origins 86. Biodegradation studies around industrial effluent treatment plants and hydrocarbon drilling sites have implicated PAHs as the sole carbon and energy source for many soil dwelling organisms87. Metabolic assays on a number of terrestrial bacterial species indicate the involvement of a common cluster of metabolic genes to convert various PAHs to metabolites of the central carbon pathways88. Numerous studies have identified various pathway intermediates and pointed towards the use of recursive de-cyclization cleavage reactions at ortho and meta positions through dioxygen activation to degrade various PAHs into pyruvate, catechol, and phthalate (see Figure 9).

Decyclization reactions at ortho positions involve intradiol dioxygenases to cleave the carbon-carbon bonds between the two hydroxyl groups, while the decyclization reactions at meta positions involve extradiol dioxygenases to cleave carbon-carbon bonds adjacent to one of the hydroxyl groups. Extradiol and intradiol dioxygenases target specific topologies on the PAHs substrates. For example, the K-region carbons are predominantly targeted for dioxygen activation by intradiol dioxygenases, while the extradiol dioxygenases predominantly target the M and E-region carbons. When the substrate undergoes ortho cleavage, after every dioxygen activation, dehydrogenase and decarboxylase reactions enable the release of K-region carbons as two carbon dioxide molecules. However, when the substrate undergoes meta cleavage, a pyruvate and carbon dioxide molecule is released for every decyclization. The degradation initiates with the formation of a (poly)aromatic diol in a dioxygenase reaction while a subsequent oxidation

37 forms a heterocyclic chromene derivative. Next, a ring opening reaction of the chromene derivative is followed by an aldolase reaction to yield pyruvate and a (poly)aromatic aldehyde. The aldehyde is then oxidized yielding carbon dioxide and a (poly)aromatic diol as a substrate for the next decyclization process. Using novoStoic, we consider both cleavage strategies while suggesting multiple catabolic routes for Benzo[a]pyrene (bzp) to catechol. In addition, we also factor into our pathway design the findings from multiple metabolic PAHs degradation studies on various species. Metabolic studies on the

Pseudomonas and Mycobacterium species indicate the recruitment of nah89 genes and its homologs associated with naphthalene degradation in converting diverse bay-region

PAHs (e.g. benzo[a]pyrene, benzo[a]anthracene, chrysene, etc.) to pyruvate through the naphthalene degradation pathway intermediates 1-Naphthol-2-carboxylate and salicylaldehyde. By limiting the possible boundaries of catabolic routes to include cataloged intermediates and reactions, we identify the set of previously unknown reactions and provide a putative explanation for the complete degradation of benzo[a]pyrene to central carbon metabolites. The benzo[a]pyrene degradation network to catechol is derived by limiting the list of co-substrates/products to central carbon metabolites and small molecules such as CO2, H2O, O2. We also bias our design rules towards reaction rules related to naphthalene degradation pathways.

Figure 9 illustrates seven different routes from benzo[a]pyrene to catechol with the minimal number of hypothetical reaction steps. Our result recapitulates the findings in a number of studies and imputes metabolites into the ill-defined and incomplete benzo[a]pyrene degradation substrate annotations in current biochemistry databases.

Known reactions are indicated by solid lines while reactions suggested using rePrime

38 reaction rules are denoted with dashed lines. Every step is indexed with a reaction rule id. For example, reactions in steps 7, 12 and 17 have the same id R7. Novel intermediates predicted by novoStoic are abbreviated using an alphanumeric scheme (i.e. p1, p2, p3, etc.). The pathway designs have been color coded and segregated into two groupings with respect to pyruvate yield. In Figure 9, the group in blue contains one bioconversion route from benzo[a]pyrene to catechol while the grouping in green contains six different bioconversion routes.

There is a total of 21 steps in group 1a with the first 16 steps, denoted as dashed lines, are reactions predicted by novoStoic. Steps 17-21 are known reactions indexed in current biochemistry databases. Reactions in steps 1-4 have incomplete EC numbers in

KEGG. They were reproduced using reaction rules extracted from the reactions naphthalene dioxygenase (R1), naphthalene dihydrodiol dehydrogenase (R2),

Protocatechuate oxidoreductase (R3) and phthalate decarboxylase (R4). Steps 5 and 6 were derived using reaction-rules extracted from terephthalate dioxygenase (R5) and cyclohexadiene oxidoreductase (R6) reactions. Steps 1 and 2 are the dioxygen activation steps for ortho cleavage, while steps 5 and 6 are dioxygen activation steps for meta cleavage. Steps 7-11, 12-16 and 17-21 depict a cyclic pathway wherein a chromene derivative is produced (steps 7, 12 and 17) followed by the ring opening reaction (steps 8,

13 and 18). Next, a pyruvate molecule and an aromatic aldehyde are produced in steps 9,

14 and 19. The aromatic carboxylate generated in steps 10, 15 and 20 are next decarboxylated to produce substrates (diols) for the next iteration of biodegradation. Of the 22 reaction products identified in this pathway, 16 are listed in most biochemistry

39 databases. The overall biodegradation pathway produces three pyruvate molecules and five carbon dioxide molecules. The overall stoichiometric conversion for pathway 6a is

푏푧푝 + 8 표2 + 3 ℎ2표 → 푐푎푡푒푐ℎ표푙 + 3 푝푦푟 + 5 푐표2

Pathway 6b spans six separate routes. Both pathways share the reactions in steps

1-3 and 17-21. Each route in Pathway 6b is similar in the number of steps, each requiring

22 steps. The overall conversion for any route in Pathway 6b is

푏푧푝 + 9 표2 + 3 ℎ2표 → 푐푎푡푒푐ℎ표푙 + 2 푝푦푟 + 푎푐푙푑 + 6 푐표2

In each route, the substrate undergoes two ortho cleavage reactions and loses the

K-region carbons as two carbon dioxide molecules, to converge at the intermediate p9.

Subsequently, p9 undergoes meta cleavage to yield a pyruvate and an acetaldehyde molecule, to form the naphthalene (nah) degradation pathway intermediate, naphthalene

1,2-diol.

Based on the degradation pathways we propose here and a number of PAHs degradation studies, molecules such as chrysene and picene would not require any ortho cleavage reaction to form nah intermediates, while molecules such as pyrene and triphenylene would require at least one ortho cleavage to form nah intermediates. novoStoic enabled us to predict various intermediates and bioconversions and rapidly develop a hypothesis consistent with sparse information in many PAHs degradation studies and biochemical databases. The networks predicted by novoStoic provides a complete overall stoichiometry. None of the metabolic databases and metabolic models contain a complete degradation pathway for benzo[a]pyrene to catechol. As a number of previous studies indicated the PAHs to degrade via the 푛푎ℎ (naphthalene) degradation

푝푎푡ℎ 퐸푋 pathway to catechol and pyruvate. Therefore we set 푦푛푎ℎ = 1 and 푥 푐푎푡푒푐ℎ표푙 >

40

퐸푋 표푟푔 1 푎푛푑 푥푝푦푟 > 1. We also biased the constraints (8) and (9) by setting 푦퐵푢푟푘ℎ표푙푑푒푟푖푎 푔푒푛푢푠 = 1

표푟푔 and 푦푃푠푒푢푑표푚표푛푎푠 푔푒푛푢푠 = 1 in individual runs, allowing us to study the PAHs biodegredation putatively capable by the organisms belonging to the Burkholderia and

Pseudomonas genus. The reactions to genus associations were downloaded from KEGG.

41

Discussion

In this paper, we introduce two novel procedures rePrime and novoStoic. rePrime is currently the only reaction rule protocol to encode reaction centers as mass balanced operators. rePrime provides additional resolution control on specificity vs. generality on the operator targets. novoStoic is currently the only procedure to simultaneously integrate reaction rules with design factors related to network size, non-linear pathway topology, mass-conservation, cofactor balance, thermodynamic feasibility, microbial chassis selection, and cost. The case studies on the biosynthesis of pharmaceuticals highlights the potential of this procedure to design and evaluate xenobiotic bioproduction schemes. A number of the chemical manufacturing process are increasingly exploiting the incredible chemoselectivity and catalytic rate accelerations abilities of enzymes. Our procedure can assist in rapid blueprint development for comprehensive biobased synthesis, or even assist in identifying the individual steps in a chemosynthesis pipelines that can be refined by a (bio)alternative. In addition, the rePrime / novoStoic framework can impute and elucidate the intermediates of ill-defined xenobiotic degradative pathways wherein only the metabolic fate is known. Our procedure can aid the development of detailed xenobiotic degradation maps, which can, therefore, assist in evaluating the toxicity and potential side effects of new drugs, and even enable the study of synergistic, antagonist or toxic drug interactions. The git repos for rePrime and novoStoic will be made available post publication.

42

Acknowledgment

The authors acknowledge the contributions of Lin Wang and Chiam Yu Ng provided during preparation of this manuscript. The inputs given Anthony Burgard,

Anupam Chowdhury, and Ali Khodayari at the various stages of ideation, refinement, and realization were invaluable. The authors gratefully acknowledge funding from the

DOE (http://www. energy.gov/) grant no. DE-SC0008091. The funders had no role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript

43

Figure and Tables

Figure 1-1 Reaction molecular graphs of 2-hydroxyisopthalate decarboxylase (Figure 1A) and salicylate decarboxylase (Figure 1B):

Each atom/node in each molecular graph is identified by a unique numeric id that is consistent across all reactions in MetRxn. For e.g. atoms/nodes of salicylate and carbon dioxide have the same numeric ids in both the decarboxylase reactions.

44

Figure 1-2: Calculation of 푷흀 and 풁흀 (흀 ∈ ퟏ, ퟐ, ퟑ) for the metabolite 2-hydroxyisopthalate.

Figures 2C, 2E and 2F show the 푷흀 values projected onto the 2-hydroxyisopthalate molecular graph. For every increment in 흀, 흀 ퟑ 풁풏풊 captures a radially larger chemical group centered at 풏. For example 퐰퐡퐞퐧 흀 = ퟐ 퐚퐧퐝 풏 = ퟑ, 풁ퟑ,ퟐ풉풊풑풂 = ‘111910’, indicates the presence of a carboxyl group that is two bonds adjacent to a hydroxyl group.

Atom 흀 = ퟏ 흀 = ퟐ 흀 = ퟑ 풏 ∈ 푵풊 흀 흀+ퟏ 흀 흀+ퟏ 흀 features 푷풏풊 풁풏풊 푷풏풊 풁풏풊 푷풏풊 풊 = ퟐ풉풊풑풂 1 1-2-08-0 3 99 5 475 17 45 2 1-1-08-1 2 44 2 76 5 3 3-4-06-0 11 7986 19 111910 47 4 3-4-06-0 11 73205 31 6883643 67 5 2-3-06-1 5 1375 13 57629 41 6 2-3-06-1 5 625 11 20449 29 7 2-3-06-1 5 1375 13 57629 71 8 3-4-06-0 11 73205 31 6883643 67 9 3-4-06-0 11 7986 19 111910 47 10 1-1-08-1 2 44 2 76 5 11 1-2-08-0 3 99 5 475 17 12 3-4-06-0 11 29282 29 1616402 59 13 1-1-08-1 2 44 2 116 11 풊 = 풔풂풍 14 1-1-08-1 2 44 2 76 5 15 1-2-08-0 3 99 5 475 17 16 3-4-06-0 11 7986 19 111910 47 17 3-4-06-0 11 73205 31 5459441 61 18 2-3-06-1 5 1375 13 57629 41 19 2-3-06-1 5 625 11 17303 23 20 2-3-06-1 5 625 11 17303 23 21 2-3-06-1 5 1375 13 42757 37 22 3-4-06-0 11 13310 23 426374 53 23 1-1-08-1 2 44 2 92 7

풊 = 풄풐ퟐ 24 1-2-08-0 3 63 3 63 2 25 2-4-06-0 7 441 7 441 13 26 1-2-08-0 3 63 3 63 2 풊 = 풑풉풏풍 27 1-1-08-1 2 44 2 68 3 28 3-4-06-0 11 6050 17 97682 43 29 2-3-06-1 5 1375 13 31603 31 30 2-3-06-1 5 625 11 17303 23 31 2-3-06-1 5 625 11 14641 19 32 2-3-06-1 5 625 11 17303 23

46

33 2-3-06-1 5 1375 13 31603 31 Table 1-1: 푷, 풂풏풅 풁 matrices populated by rePrime

푪흀 풊 = ퟐ풉풊풑풂 풊 = 풔풂풍 풊 = 풄풐ퟐ 풊 = 풑풉풏풍 풎풊 47 흀 = 풎 Table 1-2 The 푪 molecular ퟏ ퟐ ퟑ ퟏ ퟐ ퟑ ퟏ ퟐ ퟑ ퟏ ퟐ ퟑ signature matrix 2 3 3 0 2 2 0 0 0 2 1 1 0

3 2 0 0 1 0 0 2 2 0 0 0 1 5 3 2 2 4 1 1 0 0 0 5 0 0 7 0 0 0 0 0 1 1 1 0 0 0 0

11 5 1 1 3 2 0 0 0 0 1 3 0 13 0 2 0 0 2 0 0 0 1 0 2 0 17 0 0 2 0 0 1 0 0 0 0 1 0 19 0 2 0 0 1 0 0 0 0 0 0 1 23 0 0 0 0 1 2 0 0 0 0 0 2

29 0 1 1 0 0 0 0 0 0 0 0 0 31 0 2 0 0 1 0 0 0 0 0 0 2 37 0 0 0 0 0 1 0 0 0 0 0 0 41 0 0 2 0 0 1 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 1

47 0 0 2 0 0 1 0 0 0 0 0 0 53 0 0 0 0 0 1 0 0 0 0 0 0 59 0 0 1 0 0 0 0 0 0 0 0 0 61 0 0 0 0 0 1 0 0 0 0 0 0 67 0 0 2 0 0 0 0 0 0 0 0 0

48

Table 1-3 The T, reaction rules matrix 흀 푻풎풋 풋 = ퟐ푯푰푷푫 풋 = 푺푳푫 흀 = 풎 1 2 3 1 2 3 2 -1 -1 2 -1 -1 2 3 1 2 0 1 2 1 5 1 -1 -1 1 -1 -1 7 1 1 1 1 1 -1 11 -2 1 -1 -2 1 0 13 0 0 1 0 0 1 17 0 0 -1 0 1 -1 19 0 -1 0 0 -1 1 23 0 1 2 0 -1 0 29 0 -1 -1 0 0 0 31 0 -1 0 0 -1 2 37 0 0 1 0 0 -1 41 0 0 -1 0 0 -1 43 0 0 0 0 0 1 47 0 0 -1 0 0 -1 53 0 0 1 0 0 -1 59 0 0 -1 0 0 0

61 0 0 1 0 0 -1 67 0 0 -2 0 0 0

49

Figure 1-3 Reaction between moieties:

The highlighted regions on the molecular graphs of reactions 2HIPD and SLD refer to the reaction-center participant moieties that remain after the spectator moieties cancel out. The matrix 푻 captures the changing moieties between reactants and product.

50

Figure 1-4 Synthesis of phenol:

The figure shows a toy example, with the expanded version of the constraints (1.1), (2) to (7) The table shows the values the variables take on for four alternative solutions. The parameters 푪 and 푻 used to generate the constraints (2) and (3) above are for 흀 = ퟏ

51

Figure 1-5 phenylephrine synthesis.

Three routes are depicted for the synthesis of the phenylephrine from benzoate derivatives. Each route is color coded to differentiate the route by choice of precursor and overall conversion. In addition, the precursor and target are colored to depict moiety translocation.

52

Table 1-4 The reaction-rule template table reaction-rule column presents the reaction that corresponds to the reaction rule identified by the alphanumeric scheme R1, R2, R3, ... etc. for phenylephrine synthesis. Step id identifies to the reaction that was predicted from the corresponding reaction rule.

53

Figure 1-6 synthesis of naproxen from the precursor guaiacol

The network contains ten unique routes. The network contains reactions and rules from the benzoate degradation pathways and tryptophan/tyrosine synthesis.

54

Figure 1-7 synthesis of naproxen from the precursor methyl o-toluate

The network for the, contains ten unique routes. Similar to Pathway 2a, the network contains reactions and rules from the benzoate degradation pathways and tryptophan/tyrosine synthesis.

55

Table 1-5 The reaction templates naproxen synthesis.

56

Figure 1-8 synthesis of epinine from phenylalanine

The pathway uses a number of reactions from the Pseudomonas species

57

Figure 1-9 epinine synthesis from phenylalnine

The conversion uses a number of reactions from the Streptomyces species

58

Figure 1-10 N-methyl-l-aspartate synthesis

Figure 1-11 benzo[a]pyrene oxidative degradation.

The oxidative degradation routes suggested by novoStoic combines existing and predicted reactions, combined in a mass-balanced fashion. The degradation products pyruvate and catechol were set as targets and benzo[a]pyrene was set as the source metabolite. Reaction rules are shown in red. Dashed arrows denoted hypothetical reactions. The predicted routes are color-coded in blue and green based on the overall conversion. The full description of the abbreviated reaction and metabolite names is listed in the supplementary section

59

60

Table 1-6: The reaction-rule templates for PAHs degradation

The reaction template column presents the reaction that corresponds to the reaction rule identified by the alphanumeric scheme R1, R2, R3, ... etc. Step id denotes the reaction that was predicted from the corresponding reaction rule for PAHs degradation.

61 Maximum common molecular substructure queries within the MetRxn database

Akhil Kumar1 and Costas D. Maranas2

This chapter has been previously published in modified form in the Journal of Chemical

Information and Modeling. (CLCA: Maximum Common Molecular Substructure Queries within the MetRxn Database, Akhil Kumar, Costas D. Maranas), Journal of Chemical Information and

Modeling, 12, 3417-3438)

1The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA;

2Department of Chemical Engineering, Pennsylvania State University, University Park, PA;

The challenge of automatically identifying the preserved molecular moieties in a chemical reaction is referred to as the atom mapping problem. Reaction atom maps provide the ability to locate the fate of individual atoms across an entire metabolic network. Atom maps are used to track atoms in isotope labeling experiments for metabolic flux elucidation, trace novel biosynthetic routes to a target compound and contrast entire pathways for structural homology. However, rapid computation of the reaction atom mappings remains elusive despite significant research. We present a novel substructure search algorithm, Canonical Labeling for Clique Approximation (CLCA), with polynomial run-time complexity to quickly generate atom maps for all the reactions present in MetRxn. CLCA uses number theory (i.e. prime factorization) to generate canonical labels or unique ids and identify a bijection between the vertices (atoms) of two distinct molecular graphs. CLCA utilizes molecular graphs generated by combining

62 atomistic information of reactions and metabolites from 112 metabolic models and 8 metabolic databases. CLCA offers improvements in runtime, accuracy and memory utilization over existing heuristic and combinatorial maximum common substructure

(MCS) search algorithms. We provide detailed examples of the various advantages as well as failure modes of CLCA over existing algorithms.

1. Introduction

MetRxn22 is a standardized non-redundant searchable collection of published metabolic models and databases from a wide variety of organisms. MetRxn22 primarily focuses on organizing metabolic information such as metabolites, reactions, and pathways using atomistic details25,90. Atomistic details enable standardization algorithms to scrutinize and remove duplications and inaccuracies from the metabolic information producing a curated dataset. The curated dataset is leveraged to automatically annotate and add missing information important for various metabolic modeling projects. Our annotations involve generating atom/bond transition maps, EC number classifications for all reactions and semantic ontologies for all metabolites. The aforementioned annotations are generated using our novel polynomial time maximum common substructure search algorithm; we refer to as Canonical Labeling for Clique Approximation (CLCA).

The reaction atom mapping problem involves the task of matching atoms and bonds between reactants and products. This goal is computationally abstracted by identifying the mapping that conserves the maximum common substructure (MCS)

63 between reactants and products91. By denoting atoms as vertices and bonds as edges, this task can be posed as a graph matching problem. There exist significant prior efforts devoted to the computational identification of biochemically correct mappings between reactants and products through arbitrary reactions. Jochum and Gasteiger92 introduced the heuristic known as the Principle of Minimum Chemical Distance (PMCD) to computationally solve reaction atom mapping problems as graph matching problems.

PMCD is based on the assumption that in most cases reactants (R) are converted into products (P) with the minimal number of bond additions and deletions. MCS search algorithms can thus be used to match and infer a one-to-one mapping between the vertices of R, and P. PMCD forms the basis for the methodology described here.

Maximum common substructure (MCS) searches on graphs of general class93 are computationally challenging94 classified under the larger class of maximum subgraph isomorphism problems with non-deterministic polynomial time complexity (NP-hard). A large number of efforts have addressed this challenge with varying degrees of success and scope91,95–103. These algorithms can identify a maximum substructure between most molecular graphs, but they have distinctive failure modes. For example, the algorithm by

Lynch and Willet100, an adaptation of the Extended Connectivity (EC) algorithm104 fails to identify a matching when the compared molecular graphs are too small91 (< 7 atoms). It also fails to identify the bonds changes in isomerization reactions involving functional group shifts91. The procedures by Vleduts105, McGregor and Willett99 and Funatsu et al...106 also make use of the EC algorithm to identify an MCS. All these methods have similar failure modes as the error from an incorrect starting MCS solution provided by the EC algorithm propagates into the final mapping solution91. Nevertheless, due to

64 computational tractability imperatives, commercial reaction atom mapping applications largely employ the EC algorithm to find MCS solutions91. Non-EC based methods traditionally use combinatorial search strategies such as branch and bound to identify an

푚!푛! MCS. Combinatorial searches require comparisons to identify a 푘!(푚 − 푘)!(푛 − 푘)! substructure containing 푘 atoms that is common to two structures containing 푚 and 푛 atoms, respectively99. Various algorithms attempt to speed up the MCS detection by using heuristics to reduce the combinatorial search space. For example, in the algorithm by

Barker et al.107, functional groups are reduced to aggregate vertices with the connectivity between the aggregate vertices reflecting the connectivity of atoms in the original chemical graph. Such a representation greatly reduces the search space for the branch and bound algorithm but might fail to detect an MCS between a cyclic and a linear structure107,108. The branch and bound algorithm presented by Caboche et al.109 can detect an MCS between a cyclic and linear graph, however, the algorithm requires converting the original graphs into a data structure refered to as a Compatibility Graph (CG). In CG, vertices represent a possible matching between a pair of atoms and edges represent the bonds in the two structures. The largest clique in a CG represents an MCS solution between the original graphs. A branch and bound algorithm is then used to exhaustively search the CG for the largest clique. However, the size of a CG scales exponentially to the number of vertices and edges in the original graph108. Therefore, for larger graphs such a representation is extremely dense and MCS searches do not scale108. Similar to branch and bound based algorithms for MCS searches between pairs of molecular graphs, a number of algorithms for MCS detection between multiple molecules in reaction graphs exist. The

MCS of a reaction graph is used to detect a one-to-one mapping between the

65 vertices/atoms and bonds/edges99 of the reactant (R) and product (P) graph. First et al.102 recently described an efficient mixed integer linear optimization based technique to map reactions as well as identify multiple reaction mechanisms by minimizing the number of bond or edge changes between R and P. The MWED algorithm by Latendresse et al.96 improves upon the mathematical formulation by First et al.102 by incorporating chemical knowledge as edge weights or bond costs based on atom species. In addition, to speed up the matching process, ring detection and matching is done prior to the branch and bound search. However, such mixed integer linear optimization techniques cannot identify reaction maps for unbalanced reactions96,102.

Unlike the previously mentioned algorithms, CLCA uses a local search strategy to canonically label (uniquely order) vertices/atoms, and detect an MCS between molecular graphs. CLCA uses the property of prime factorization to uniquely label vertices, and is an adaptation of the labeling technique used in the SMILES algorithm by Weinengier et al.25. Compared to the labels generated by the EC algorithm, the labels generated by CLCA uniquely represent a vertex in the entire topological space and not just within a fixed neighborhood. Vertex labels between each reactant and product graph is compared and unmatchable vertices are iteratively removed from the MCS search calculation. For reactions with multiple reactant or product molecular graphs, many combinations of MCS matches exist. We, therefore follow the heuristic known as the principle of minimum chemical distance92 (PMCD) to choose a matching that reduces the overall number of bond changes between reactants and products. When compared to MetaCyc3, KEGG110 and

ReactionMap111, CLCA has 97.8%, 99.3% and 97.9% agreement, respectively with their reported atom mapping solutions. We have also manually verified the atom mapping

66 results for 1,293 reactions from the E. coli iAF1260112 for accuracy and found an error rate of only 0.5%. In addition, we tested and accurately mapped over 95.2% of the reactions in the ICMap113 test suite. Failure modes of CLCA, MetaCyc MWED96 and ReactionMap111 for different classes of reactions91,113 are analyzed and discussed in detail. The atom mapping solution for approximately 27,000 reactions in the MetRxn database is available online at maranasgroup.com/metrxn.

2. Methods

A molecular graph represents a chemical compound wherein the vertices denote the atoms and the edges represent the bonds. MetRxn22 stores information about the atoms, bonds and connectivity as SMILES90. This information is leveraged to construct molecular graphs for the algorithm we present in this article. Every graph 퐺 is an ordered pair of {푉, 퐸} where 푣 ∈ 푉 is a set of vertices or nodes and 푒 ∈ 퐸 is the set of edges or connections. The induced graph of 퐺, represented by 퐺’, is homomorphic to 퐺

(i.e., has the same topology of 퐺). The induced graph 퐺’ by definition preserves all the structural properties of 퐺 and contains the same number of vertices and edges. Let, for two graphs 퐺′ and 퐻′ the graph 푔 = {푣′, 푒′} be the (sub)graph that is isomorphic i.e. after a finite set of edge and vertex deletions on 퐺′ and 퐻′, both 퐺′ and 퐻′ would contain subgraphs that are isomorphic to 푔. A clique is a fully connected graph is represented as 푞 = {푣′′, 푒′′} 푤ℎ푒푟푒 푣′′ represents a mapping between the elements of 퐺′

67 and 퐻′ and 푒′′ represents a mapping between the edges of 퐺′ and 퐻′. CLCA identifies the largest 푔 between 퐺’ and 퐻’ by iteratively identifying the mappings represented by cliques 푞. Vertices in 퐺’ and 퐻’ are compared for common labels and unmatchable ones are successively deleted. Multiple cliques 푞, identified by common labels are then combined into the graph 푔, which is (sub)graph isomorphic to both 퐺 and 퐻.

In order to identify vertex labels, a set of reordering operations are performed on the vertices of 퐺’ and 퐻’. The labels generated after reordering the vertices of 퐺’ and 퐻’ are representative of the vertex locations in the entire topological space (i.e., vertices in the center of the graph generally receive larger integer labels than the vertices in the periphery). Such a reordering also follows the popular graph conjecture114 that states that if two induced graphs are identical when reordered, then the parent graphs are identical. Therefore, to identify a mapping between 퐺 and 퐻, we identify a mapping between the reordered graphs 퐺’ and 퐻’. The reordering, labeling and mapping of two induced molecular graphs 퐺’ and 퐻’ by CLCA is as follows: (i) Identify features for each vertex/atom in 퐺’ and 퐻’ and initialize label on each vertex to ‘0’. (ii) Rank order with prime numbers only for vertices with features common to at least two compared molecules. Vertices with unique features are assigned the value ‘1’ (i.e., deletion from

MCS search space) (iii) (Re)assign labels and features based on product with neighboring primes (iv) Iterate through steps (ii) and (iii) until atom labels do not change anymore (v) Identify all cliques 푞 by common labels (vi) Expand the isomorphic subgraph 푔 by appending cliques 푞 (vii) Identify and keep the largest connected subgraph108 (viii) Extend largest connected subgraph to maximum common subgraph

68

(MCS) using the A* search algorithm98. The subgraph identified after the A* search procedure 푔 is (sub)graph isomorphic to 퐺’ as well as 퐺.

Before we iterate, we extract distinguishing features of each atom and convert it into a string representation. The first seven features we extract are as follows: (i) number of non-hydrogen connections (ii) number of non-hydrogen bonds (iii) atomic numbers

(iv) sign of charge (v) absolute charge (vi) number of connected hydrogens (vii) atomic numbers of neighboring atoms. These features are sufficient for mapping molecules without chiral centers. All the features are concatenated as integers and then rank ordered using the natural ordering for strings. For example, in Figure 1, we show the feature string generation and ranking of each atom in l-tyrosine and hydroxyphenylpyruvate for identifying each atom uniquely. After we identify features on each atom, we iterate to rank-order (Figure 1A), assign rank corresponding primes

(Figure 1B) and calculate the product of adjacent primes (Figure 1C). If at any iteration, the value for the product with adjacent primes is unique only to a single vertex in 퐺’ or 퐻’, we assign a value of ‘1’ to it (Figure 1B). Assigning a value of ‘1’ to a vertex can be considered as removing it from the MCS search. Prime numbers are assigned only to vertices with features and product values common to both 퐺’ and 퐻’. The procedure is repeated until each atom is uniquely ranked or until the rank ordering does not change anymore (Figure 1G). The reason for using products of adjacent primes is to assign a unique rank order93 to atoms by the properties of its adjacent neighbors. At each iteration for a given atom, the radii of influence on its rank by adjacent atoms increases.

As illustrated in the table Figure 1, the rank at step A for atom with index = 5 is only affected by the ranks of its surrounding atoms 4 and 6, but the rank at step D is affected

69 by the ranks of atoms 3, 10 and 7. The procedure terminates when in two consecutive iterations the same rank ordering remains for all vertices. CLCA iteratively removes vertices with distinct topological features and labels only those vertices with common topological features. The canonical labels are then used to recognize the mappings between the vertices of 퐺’ and 퐻’. We choose only the mappings for vertices in the largest connected subgraph and discard solutions to all other vertices. The mappings retained are stored as the graph 푔 which is (sub)graph isomorphic to both 퐺′ and 퐻’. The mappings identifiable by 푔 are then used as the starting solution for the combinatorial

A* search98. A* search is a widely used graph traversal algorithm to efficiently traverse a weighted path between vertices. A* uses best-first search to find a path with the least cost from source to goal vertex. In graphs 퐺’ and 퐻’ the source vertices are the mapped atoms/vertices and the goal vertices are the unmapped atoms/vertices. The cost on each atom/vertex is the corresponding atomic number. Two goal vertices are mapped only if the least cost paths as well as the mapping for their initial vertices are equal in both 퐺’ and 퐻’. As shown in Figure 1G, the source vertices in 퐺′ and 퐻’ contain the labels 2, 6, 7,

8, 10 and 11 respectively. The goal vertices mapped by the A* search are identified by the labels 3, 4, 9 and 12. Like many combinatorial search algorithms, to identify a path containing 푘 atoms, present in two structures with 푚 and 푛 atoms requires

푚!푛! comparisons99. This generally enormous number of combinatorial 푘!(푚 − 푘)!(푛 − 푘)! comparisons needed prior by the A* is therefore avoided by identifying the source and goal vertices using vertex labels. We also note that CLCA always generates equivalent labels for homotopic groups. For example, in Figure 1G, vertices with label 7, represent carbons (indices = 5, 10, 21 and 26) that are symmetric due to an inversion center.

70

The mappings identified with only the seven aforementioned atomic properties can identify symmetry in a molecular graph if we ignore stereochemical distinctions. This is inadequate since stereospecificity is paramount in enzymatic reactions115. For example, there is a large number of NADH dependent dehydrogenases that transfer a hydride ion either from/to the re-side or the si-side of the nicotinamide group in NADH /NAD+ implying that we need to differentiate between the pro-R and pro-S hydride ions116.

Enzymes Isocitrate dehydrogenase, Malate dehydrogenase, Lactate dehydrogenase and

Alcohol dehydrogenase show stereospecificity to the pro-R hydride ion while α-

Ketoglutarate dehydrogenase, Glucose-6-phosphate dehydrogenase, Glutamate dehydrogenase and Glyceraldehyde 3-phosphate dehydrogenase show stereospecificity to the pro-S side116. We therefore add additional characterizations related to stereochemistry in the feature string, so as to stereochemically discern between symmetric atoms. Therefore we characterize atoms based on an additional three stereodescriptors117

(viii) R or S descriptor for chiral atoms (ix) pro-R or pro-S for prochiral arms, and (x) cis and trans descriptors117. For stereochemically characterizing all carbons, the characterization procedure traverses the molecular graph to identify all chiral atoms, prochiral atoms and atoms with double bonds. To identify the prochiral arms/atoms, we calculate the dihedral angle between the heterotopic groups and the plane defined by the homotopic groups. For example in the prochiral molecule citrate, for the plane defined by the prochiral carbon and the two carboxymethyl groups, we compare the angle of the hydroxyl group and the plane to the angle between the carboxyl group and the plane. For visual clarity, we only use hydrogen suppressed molecular graphs in all the figures to explain CLCA. Figure 2A shows an example for mappings with and without

71 stereodescriptors for the conversion from citrate to cis-Aconitic acid, catalyzed by the enzyme Aconitase. Ignoring stereodescriptors, CLCA identifies chemically equivalent, though biochemically distinct symmetric groups. Citrate has a plane of symmetry and a prochiral carbon center. Aconitase differentiates between the pro-R and pro-S arms of citrate thus distinguishing between the carboxymethyl group and proton of the pro-R arm from the carboxymethyl group and proton of the pro-S arm and when forming the cis- aconitate intermediate118. Upon appending the stereodescriptors for the two prochiral substituents, we correctly distinguish between the pro-R and pro-S arms of Citrate. Other metabolic graphs shown in Figure 2 have distinct stereo isomers. 2,6- diammonioheptanedioate in Figure 2B has two chiral carbons and three distinct stereoisomers (i.e., meso-2,6-diaminopimelate, LL-2,6-diaminopimelate and (2R,6R)-2,6- diammonioheptanedioate). meso-2,6-diaminopimelate has an inversion center but lacks a rotation axis thereby rendering the two symmetric groups stereochemically distinct. LL-

2,6-diaminopimelate and (2R,6R)-2,6-diammonioheptanedioate contain an axis of rotation and therefore the two symmetric groups are also biochemically equivalent. In E. coli, the gene lysA encodes the enzyme diaminopimelate decarboxylase119 which stereospecifically catalyzes the decarboxylation of meso-2,6-diaminopimelate to L-lysine and is inactive for the RR- or SS-isomers of diaminopimelate. Similarly, as shown in Figure 2C, CLCA distinguishes between the symmetric groups that lie on the re-side or si-side for a sp2 hybridized atom. Isopentenyl pyrophosphate isomerase stereospecifically catalyzes the reversible isomerization reaction between isopentenyl pyrophosphate and dimethylallyl pyrophosphate. As part of the standardization procedure, we characterize all atoms in each molecule in the MetRxn22 database with chiral, prochiral and cis-trans flags. This

72 allows us to identify and present to the user biochemically relevant symmetry information.

Reduction of a* search space.

As a test case, we compared 2,000 pairs of molecular graphs for (sub)graph isomorphism using CLCA and we noticed that the algorithm still took considerable time to identify accurate solutions. For each comparison, the procedure took an average time of 36 seconds to produce a solution. The canonical labeling step identifies a subgraph close 61% the size of maximum common subgraph and the A* search step identifies the remaining 39%. Over 99% of the total runtime per comparison was used up by the A* extension step alone. For the examples shown in Figures 1 and 2, the runtime was close to one second, however, for comparison with larger molecular graphs

(i.e., atom count > 100) and polycyclic molecules, the runtime increased substantially.

We reduced the reliance on the A* search by augmenting the input molecular graphs 푮’ and 푯’ with additional information. First we introduce an artificial vertex between adjacent vertices in the induced graph 푮’ by converting each bond into an artificial vertex. The new graph with the additional set of artificial vertices to the graph 푮’ is the auxiliary graph 푮’’. If there is a bond between 풗ퟏ and 풗ퟐ, the bond is converted into an artificial vertex 풂ퟏퟐ and new connections are formed between 풗ퟏ and 풂ퟏퟐ , and 풗ퟐ and

풂ퟏퟐ.

In the example shown in Figure 3B, the vertex with index 24 is the artificial node in place of the edge between vertices 23 and 25. Similar to the previously mentioned

73 vertex characterization step for induced graphs 푮’ and 푯’, we characterize each vertex in auxiliary graphs 푮’’ and 푯’’ by the properties of their corresponding atoms. These features are limited to (i) number of non-hydrogen connections (ii) number of non- hydrogen bonds (iii) atomic numbers (iv) sign of charge (v) absolute charge (vi) number of connected hydrogens, and stereo descriptors such as (vii) R or S descriptor for chiral atoms (viii) pro-R or pro-S for prochiral arms (ix) cis and trans117. Since the artificial vertices represent the bonds in the parent molecular graph, they are characterized by the bond order. The column with header FHV in Table 1 shows the new features for each vertex in 푮’’ and 푯’’. The iterative labeling routine is then run on the auxiliary graphs. Figure 3B shows the improvement in mapping due to the introduction of artificial vertices identified by the labeling step alone. The introduction of the artificial vertex mitigates the removal of atoms and allows for the removal of bonds, thereby allowing the labeling of a larger subgraph without invoking the costly A* procedure as shown in Figure 3B for 푮’’ and 푯’’.

In addition to the aforementioned features for characterizing each atom vertex, we use a connectivity criterion to improve the overall mapping accuracy of CLCA. We increase both the prediction fidelity108 and size of the MCS by using path information.

′ A vertex 풗품 in 푮’’ is mapped to a vertex 풗풉 in 푯’’ iff a path 풗품 to 풗품 in 푮’’ is same as a

′ path from 풗풉 to 풗풉 in 푯’’ (i.e., a path of same length, type of vertices (atom type) and sequence of vertices). We use the Floyd-Warshall algorithm to identify shortest paths between all vertex/atom pairs. The set of shortest paths in an undirected graph with 풏 vertices is of size 풏ퟐ, therefore many vertices may share the same set of paths. In order to reduce the commonality in paths, edges are directed away from non-carbon vertices

74 and towards carbon vertices. The set of paths originating from 풗품 and 풗풉 is now smaller and more distinct. Since we require only one path as a feature for each vertex, we choose the longest of all the (common) paths from the smaller set. This path, called the longest geodesic is illustrated in Figure 4B where all the edges in 푮’’ and 푯’’ are directed towards the carbon atoms and away from the non-carbon atoms. The artificial vertex also preserves this directionality in the auxiliary graph. The rationale for the directionality is based on the type of molecules under consideration or the underlying dataset we wish to map. We noticed that in MetRxn over 86% of reactions have reaction centers on non C-C bonds. Only 3,761 reactions of 27,632 reactions in MetRxn involve breaking or formation of C-C bonds. Therefore, to ensure that a longest geodesic contains the least number of non C-C bonds, we direct the edges away from non-carbon atoms. It is important to note the mapping accuracy involving C-C bonds was not affected by this assumption, as shown by the example in Figure 4B. The relationship between direction and bond type is not fixed and can be modified to best suit the data that has to be mapped. A similar strategy is also followed by Kraut et al.113 wherein reaction rules pritoritize breaking of bonds between heteroatoms over carbon-carbon atoms.

Figure 5 and Table 1 provide an example for MCS detection between the molecular graphs of D-fructose 6-phosphate and D-xylulose 5-phosphate. As shown previously, the features on each node are concatenated and integer hash-coded, as shown in column “FHV” in Table 1. The feature hash codes of all vertices in 푮’’ and 푯’’ are combined into a column vector FHV (feature hash code vector) and sorted for natural ordering. Equivalent hash codes in FHV receive equivalent rank order. We create another column vector; prime number vector 푨 of the same size of FHV. Rows of

75 columns 푨 and FHV link to each other by their indices. Each row in column 푨 stores a prime number based on the rank order of a corresponding element in FHV. If the rank of an element is unique in FHV, the corresponding row in 푨 stores ‘1’. Each vertex in

푮’’ and 푯’’ can also be linked to a prime number or ‘1’ in 푨 through FHV. For each vertex, we calculate the product of primes with adjacent vertices. The adjacent product for each vertex for the first iteration is shown in column 푨’ in Table 1, for example in Figure 5, the vertex with index 10 has adjacent vertices with indices 9, 11 and 18. The prime numbers assigned to the adjacent vertices are 23, 23 and 43 respectively. The product for vertex 10 (prime = 17) is 386699. The products are appended to the original feature and sorted again to check for uniqueness. If in the current iteration no change in rank ordering of pairs of vertices is noticed then we identify them as a mapping. The clique identified by the mapping between vertices with common labels between 푮’’ and 푯’’ is 풒 = {풗′′, 풆′′} where 풗′′ represents a mapping between 푮′′ and 푯′′. 풆’’ is the edge between two vertices and represents a mapping between the edges of 푮′′ and 푯′′. As illustrated in Figure 5, the common cliques are appended to 품 at each iteration. For example in Table 1 iteration 푨 identifies 6 bijections. Only the bijections between the vertex pairs 24, 25 and 51, 52 form a clique (Figure 5A). Figure 5 illustrates the clique recognition and common subgraph expansion at each step. The graph 품 = {풗’’, 풆’’} obtained upon termination is subgraph isomorphic to both 푮’’ and 푯’’. We iterate to rank, identify corresponding primes and the product of adjacent primes until we notice no change in the rank ordering of each vertex. The procedure terminates after the sixth iteration as vectors 푭 and 품 are identical and improvement in the size of the common subgraph is not possible. For the example shown in Figure 5, we do not require the use of A* search method98 to maximize the subgraph isomorph. If for any comparison the

76

A* search is indeed needed, we limit the search depth to 3 as now CLCA without A* generates in > 99% of the cases an optimal mapping solution. After the introduction of the auxiliary graph data structure, the average time per MCS solution reduced from 36 seconds to 140 milliseconds.

Applying CLCA on the auxiliary graph provides multiple benefits. In addition to improved computational performance, CLCA can now deal with reactions involving ring formation or breaking. Without the auxiliary graph, CLCA would only identify common substructures close to the reaction center. Using the auxiliary graph data structure, we were able to extend the common subgraph to include reaction centers as well as to increase the size to the maximum subgraph in most cases. This dramatically reduces the reliance on the costly A* search for identifying the maximum subgraph.

Strategies other than the A* search can also be utilized to extended the MCS identified by CLCA120. CLCA generates homotopic solutions for the atom mapping problem (i.e., the same mapping solution to symmetric groups). Figure 5 shows the equivalent oxygen atoms, with mapping = 24 on the phosphate arm of D-fructose 6-phosphate and

D-xylulose 5-phosphate. A simple post-processing step to find all possible permutations of mappings produces alternate solutions. All alternate solutions are encoded within a single solution; therefore the mappings provided by CLCA greatly reduce the computational overhead needed to identify and iterate through the entire set of solutions due to symmetry.

77

Reaction atom mapping.

A pairwise comparison for the maximum common substructure (MCS) between each product and reactant auxiliary graph is performed using CLCA for computing the atom mapping solution for every reaction. The subset of vertices and edges of the molecular graphs which undergo transformation are denoted as reaction centers. Figure

6 shows the atom mapping solution for the transketolase transformation between the substrates D-Erythrose 4-phosphate and D-Xylulose 5-phosphate. Illustrated in Figure 6C,

CLCA is first applied to find common subgraphs between the molecular graphs of D-

Fructose 6-phosphate and D-Xylulose 5-phosphate, D-Glyceraldehyde 3-phosphate and

D-Xylulose 5-phosphate, D-Fructose 6-phosphate and D-Erythrose 4-phosphate, D-

Glyceraldehyde 3-phosphate and D-Erythrose 4-phosphate. The maximum subgraphs identified between the pairs of substrate-products are combined to identify the maximum subgraph between all substrates and products. The identification of the optimal MCS solution for the reaction requires that the solutions from the pairwise matching are combined so as to reduce the overall bond changes.

Multiple mapping solutions exist not only due to the presence of equivalent or symmetric groups but also due to possibly alternate combinations of MCS solutions.

Therefore, it is imperative to ensure that all the MCS solutions are correctly identified and assessed for biochemical feasibility. As in most methodologies95,98,121,122, we use the

Principle of Minimal Chemical Distance (PMCD) heuristic proposed by Jochum et al.92

PMCD states that computational procedures that predict reaction mechanisms based on largest substructure overlap alone might not always be chemically acceptable. Instead, an

MCS solution that minimizes the reordering of bonds is generally closer to the actual

78 reaction mechanism. Figure 6 presents two optimal solutions for atom mappings of D- glyceraldehyde-3-phosphate glycolaldehyde transferase reaction. In both cases, a minimum of three bond breaks are needed for the transformation between the substrates and the products. Both the reaction atom mapping solutions are predicted using the

PMCD “largest set of largest substructure” heuristic92. The mapping solutions have been identified by combining the maximum subgraph solutions between the substrate-product pairs. The order of this comparison dictates that the outcome of the atom mapping solution and hence each comparison has to be assigned a precedence order. A precedence order is needed to ensure that we correctly recognize the largest overlap with minimum bond breaks between a pair wise comparison as the first solution. For example in Figure

6C, only one bond is broken in the MCS between D-Xylulose 5-phosphate and D-

Glyceraldehyde 3-phosphate. The MCS solution indicated by a|d receives the highest precedence. Also, the MCS between D-Erythrose 4-phosphate and D-Glyceraldehyde 3- phosphate involves a single broken bond would get the highest precedence in an alternate solution scheme. The subgraphs that match with the least bond changes are chosen and successively removed from additional MCS comparisons. This greedy heuristic can be solved using a variety of algorithms. We chose to use the minimum spanning tree (MST) to identify a precedence order for the various MCS solutions. The following summarizes the procedure for building an MST precedence order:

i. Identify the MCS between the pairs of molecular graphs of products and

reactants and ascertain the number of bonds broken and formed.

ii. Generate a transition graph with weighted and directed edges. The vertices in

this graph represent the task of comparing the two molecular graphs. The edges

79

incident from these vertices are designated edge weights that represent the

output for the comparison. If we consider the minimization of bond changes,

each edge is given a positive weight. The positive value on the outgoing edge

represents the number of bonds that need to be broken or formed to get a

maximum common subgraph. For example, in Figure 6 vertex a|d represents a

comparison between D-Xylulose 5-phosphate and D-Glyceraldehyde 3-

phosphate. The outgoing edge is always given a weight of ‘1’ since one edge has

to be broken to derive the maximum common subgraph. iii. Using the Kruskal minimum spanning tree (MST) algorithm we identify the

spanning tree with the least weight. Each vertex in the tree identifies a MCS

comparison with the highest precedence given to the root. iv. The MCS solution from the root is fixed and the edge weights are recalculated. The

edge weights are modified to reflect the solution proposed by the root. For

example in Figure 6D, after we consider the MCS solution a|d, comparisons a|c

requires effectively no bond breaks as no MCS between ‘a’ and ‘c’ can be found

anymore. b|d requires two bond breaks for an MCS between ‘b’ and the remaining

section of ‘d’. Therefore, all edges connected to the vertices comparing ‘a’ or ‘d’ are

updated with new weights in the transition graph.

v. Upon completion of the a|d operation all edges outgoing from this vertex are

removed. vi. A MST is identified again and the MCS solution suggested by the new root is

appended to earlier MCS comparisons. We iterate to update the transition graph

and identify a new root from the MST again every time we fix a MCS solution.

80 vii. The iterations continue until all the vertices are fully disconnected or there is no

change in the solutions proposed.

Moreover, we also identify all possible minimum spanning trees possible for a reaction. Each alternate spanning tree identifies a unique route to identify an optimal solution. In the example shown, for the minimum bond change condition, two minimum spanning trees are suggested with highest precedence with roots a|c and a|d respectively. Each tree suggested by the minimum spanning tree indicates the order of comparison to be considered to obtain a complete atom mapping solution. For the example, the two solutions suggested in Figure 6A by the Kruskal minimum spanning tree algorithm has the root as a|c while the example in Figure 6B has the first root as a|d.

Using CLCA for MCS detection and the above stated greedy approach for the MCS solution combination, we generate atom mapping solutions for over 27000 MetRxn reactions. Figure 7 shows examples of six different reactions each representing common transformations in metabolism with their reaction centers highlighted. The reactions are

(A) Methionine synthase (B) Glutamate Decarboxylase (C) L-threonine dehydrogenase

(D) Glutamine synthetase (E) D-alanine-D-alanine dipeptidase and (F) Glucose-6- phosphate isomerase. The reaction graph of D-alanine-D-alanine dipeptidase, depicts the symmetry due to a stoichiometry of two for d-alanine due to the cleavage of d-alanyl-d- alanine. The reaction is catalyzed by the enzyme D-alanine-D-alanine dipeptidase, a

Zn2+-dependent enzyme. The enzyme encoded by the gene vanXB in Enterococcus faecium BM4147 has stereospecificity to only D-alanyl-D-alanine and does not accept the

81 other three RS, SS and SR stereoisomers as substrates123. L-threonine dehydrogenase is an oxidoreductase enzyme providing an example where only changes in bond order were identified, without changes in the position of atoms. Methionine synthase, a transferase enzyme, transfers a methyl group from l-homocysteine to 5-Methyltetrahydrofolate to form l-methionine. The changes involved are bond breaks, bond formation and change in position of two vertices. Glutamate Decarboxylase, a hydrolase, provides an example of a cleavage reaction with changes in atom and bond positions between the reactant and product graphs. Glucose-6-phosphate isomerase leads to changes in more than one bond and Glutamine synthetase results in bond and atom position changes in all the three molecular graphs. All six reactions combine cleavage, condensation, substitution and unimolecular reaction operations124.

The utility of CLCA in MCS detection is not limited to metabolite graphs alone.

We extend the application of CLCA to larger graphs and identify the conserved subsections between pairs of reaction or pathway graphs. The conserved subsections between pairs of reaction or pathway graphs elucidate the common transformation mechanism from one molecule into another. In MetRxn, we incorporate reaction and pathway information from multiple sources. In order to identify common transformations, we perform a comparison for similarity between two reaction graphs and two pathway graphs as shown in Figures 8 and 9 respectively.

82

Common subgraphs between two reactions.

As shown in the previous sections, a one-to-one MCS search between a pair of molecules is required for reaction atom mapping. The utility of the CLCA algorithm is however not limited to MCS searches between a pair of molecular graphs alone and can also be extended to calculate similarity between larger multiple molecular graphs. We can use similarity scores to cluster, organize and annotate reactions with information such as

EC numbers. For example, in reaction similarity calculation, an MCS search between the entire set of molecular graphs representing a pair of reaction is performed. We first find the common subgraph between the two reactions and calculate the similarity in the number of atom/vertices present in the subgraph. As shown in Figure 8, we compare two amino acid oxidoreductase reactions. CLCA clearly identifies the conserved phenyl, carboxyl and amino groups in the reactant side and the conserved phenyl, carboxyl and oxo groups in the product side of both the reactions. These conserved components, including the reaction center (shown as dotted bonds) are part of the common subgraph between the two reaction graphs. The only difference arise in the hydroxyl groups present in l-tyrosine:NAD+ oxidoreductase. The common subgraph identifies how similar the two graphs are to each other. We calculate the Jaccard similarity score by finding the ratio of twice the number of vertices in the common subgraph to the total number of vertices in both the reaction graphs. Users of MetRxn22 can search for reactions based on similarity scores for a query reaction or even sets of molecular graphs. Similarity scores in MetRxn can assist users to transfer ontological information such as Enzyme Commission (EC) number annotations to EC unannotated reactions from EC annotated reactions.

Historically, all EC information for reactions are manually annotated and take into

83 account the co-factors, reaction center and the substrate/product structures. Recent efforts for automated EC classifications are not powerful enough to classify a reaction fully and still depend upon expert knowledge to annotate the fourth number correctly. To identify the best EC class, these automated methods calculate similarity scores, (e.g., Tanimoto,

Jaccard, Sorensen-Dice, etc.) using chemical fingerprints125 or computationally intensive

MCS calculations24,121,126–129 for reaction center detection. CLCA has advantages over existing methods since it does not require chemical fingerprints and prior reaction center information for assisting EC annotation of reactions.

Common subgraphs between two metabolic pathways.

Similar to the reaction similarity detection problem, CLCA can be used to identify similarity between metabolic pathways. The pairwise pathway alignment problem130,131 is the task of calculating similarity between two pathways. A large number of pathway alignment algorithms align protein-protein interaction networks, or gene-protein interaction networks. Few algorithms for metabolic pathway alignment align metabolic networks using EC numbers, reaction centers or metabolite identity. Methods that align using EC classifiers, align pathways without considering the fourth classifier of the EC number132. Methods that align with metabolite identity cannot handle branched or cyclic pathways, because the aligned pathways need the start substrate and end product to be common133. Similarity calculated on atomistic details provides higher resolution by identifying the conserved moieties as well as conserved reaction mechanisms. As mentioned previously, the pairwise pathway alignment problem is also a MCS search

84 problem. Aligning pathways134 using atomistic details of metabolites would be intractable since they cannot handle large graphs. CLCA therefore has clear advantages over existing techniques since it (i) improves resolution of overlap by comparing atoms and bonds, (ii) accommodates reactions without EC annotations, (iii) can handle branched pathways (iv) has polynomial time complexity. Figure 9 shows a comparison between two branched chain amino acid degradation pathways. We compare valine degradation to isoleucine degradation and correctly identify the common moieties and conserved reaction mechanism between the two. The first three reactions of transamination, decarboxylation and dehydrogenation are catalyzed by three common enzymes. The next two reactions of hydration and dehydrogenation are catalyzed by distinct enzymes. We note that CLCA correctly identifies the reaction centers as well as the moiety transferred at each step.

3. Results and discussion

Ahead of the mapping calculations, all reactions were balanced and standardized using the MetRxn22 workflow. All information regarding protonation, stereochemistry and bond order of atoms were removed or standardized to produce a unique canonical form. Each reaction is unique only in its connectivity. All examples and results presented henceforth consider this canonical form of molecular graphs, unless stated otherwise.

Metabolic information from over 8 different databases and 112 different metabolic models is combined to produce a canonical dataset of over 27,000 reactions. CLCA and A* search

85 took an average of 140 milliseconds per reaction or 1.3 milliseconds per atom to calculate a single atom mapping solution. Overall, when only a single solution is requested, the run over the entire database was completed in 64 minutes using a standard desktop with 2.3

GHz CPU and 8 GB memory. When all possible alternative solutions were requested, the complete task took 160 milliseconds on average to terminate. When compared with existing methodologies, CLCA performs better than other procedures in terms of literature reported runtime and accuracy. Alternative mapping solutions arising due to group equivalence as well as alternative objectives were identified rapidly. The largest runtime for any reaction was 7 seconds, for the reaction O16 antigen (x4) ligase

(periplasm)112 including more than 2,000 atoms.

Application to e. Coli iaf1260 metabolic model.

We perform various comparisons, automated as well as manual to validate and test the robustness of CLCA. We started with a manual inspection of all the reactions mapped in E. coli iAF1260112 metabolic model. We flagged reactions with an abnormally high number of bond changes for manual curation. The bar graph shown in Figure 10 shows the number of reactions vs. number of bond changes per atom mapping solution excluding transport and exchange reactions. Overall atom mapping statistics for the remaining 1293 reactions in the E. coli metabolic model are depicted in Figure 10. 84 reactions were mapped with zero bond changes. These reactions either undergo protonation changes (no hydrogen atoms are tracked) or bond order changes (e.g., single to double). There are also reactions with a large number of bond changes ~30. Such

86 reactions have substrates with a large stoichiometry > 15 combining multiple elementary steps into a single one. We found seven reactions to be incorrectly mapped in the E. coli iAF1260112 by CLCA. Figure 11 shows two such examples. We find that all the reactions that failed to map aggregate multiple elementary transformations. The reaction anhydrous-N-acetylmuramyl-tetrapeptide amidase combines multiple reactions of N- acetylmuramoyl-ala amidohydrolase and tetrapeptide L,D-carboxypeptidase, part of the murein Recycling pathway. N-acetylmuramoyl-ala amidohydrolase at each step elongates the peptide chain on N-acetylmuramate by adding L-alanine. The reaction tetrapeptide L,D-carboxypeptidase cleaves the molecule L-alanine-D-glutamate-meso-

2,6-diaminoheptanedioate-D-alanine one by one generating five L-alanine molecules.

Reactions represented in highly lumped form may lead to incorrect mappings as the exact bond operations are concealed.

Alternate solutions in e. Coli iaf1260 metabolic model due to equivalent groups.

We generate the atom mapping solution for two cases. In the first case, we generate homotopic solutions by ignoring the stereochemical information of the molecule. We consider stereochemistry in the second case. Figure 12 shows the difference in the number of unique solutions when stereochemistry is considered. We notice a sharp increase in the number of reactions with only one unique solution. This is representative of the fact that most enzymes are highly stereospecific and the transitions identified have to consider stereochemical information in its solution. For example, the enterobactin transport reaction allows the transport of enterobactin from cytosol to the extracellular

87 environment involves nine alternate solutions as the number of symmetries in the reaction

푆푖 graph can be calculated by ∏푖 푆푖! × 푐푖 , where 푆푖 is the stochiometric coefficient and 푐푖 is the number of inherent symmetries of the 푖th molecule135.

Comparison with existing efforts:

MetRxn aggregates information primarily from three metabolic databases;

BRENDA136, KEGG10 and MetaCyc3, of which KEGG and MetaCyc also provide atom mapping information. In addition, recent efforts from Fooshee et al.111 and Kraut et al.113 have made available online reaction mapping tools. We compare atom mapping data from

KEGG10, MetaCyc96, ReactionMap111 and ICMAP113 with the atom mapping data generated by CLCA. The KEGG RPAIR110 database is a manually compiled list of reactant pair alignments. To generate reactant pair alignments, each reactant molecular graph is decomposed into 68 functional groups and atom microenvironments110. The functional group, atom microenvironment and alignment information are available in a proprietary format called the KEGG Chemical function (KCF) format. We downloaded KCF and MOL files for 2636 reactant pairs from http://www.kegg.jp/ version 71.0 and converted them to the SMILES90 format before running CLCA. We found high agreement (99.3%) between the alignments suggested by CLCA and the manually identified alignments from KEGG.

Only 16 of the alignments suggested by CLCA disagreed with KEGG RPAIR since they were non-maximal or incorrect in the alignments. Few examples for the disagreements are shown in Figure 13. The first alignment between 2-Oxobutanoate and 1-

Aminocyclopropane-1-carboxylate was incomplete since one of the two symmetric

88 carbons in the cyclopropane ring was not mapped. CLCA was unable to determine the correct bond breakage since structurally, both the C-C bonds are equivalent and breakage of any one of them would give a maximal alignment. Similarly in Figure 13, the second alignment between Anhydroglycinol and Daidzein involves a C-O bond breakage on the furan moiety with two possibilities. The phenol and naphthol moieties are connected by a rotatable bond and CLCA could not ascertain which of the two carbons (in phenol) in the third position was previously connected to the Oxygen atom. In the third example between (-)-leukotoxin B and Isoleukotoxin, CLCA was unable to map the Oxygen atom on the oxirane ring of (-)-leukotoxin B. In cases wherein symmetric members of ring moieties were involved in the breakage or formation of bonds, CLCA could not proceed to identify a complete match. As shown in the Figure 13, maximal overlaps might necessarily not give a correct solution and biochemical knowledge will always be needed to ascertain the correct bond breakage and formation. The remaining 13 cases of disagreement are shown in the Figure S2 of Supporting Information. Within the MetRxn database, 33 reactant pairs were found with such characteristics and we flag such alignments for manual validation.

MetaCyc3 is a reference database consisting of metabolites, enzymes, reactions and metabolic pathways database from more than 30,000 publications. As part of our goal to align large metabolic databases, we have successfully cross-referenced all metabolite and reaction information from MetaCyc with KEGG10, BRENDA136, RHEA137, CHEBI138,

HMDB139 and Reactome140. Atom transition information using the recently published

MWED (minimum weighted edit-distance) procedure for all reactions in MetaCyc was made available with the release of version 16. The MWED model is an MILP formulation

89 with an objective to minimize the cost of breaking and forming bonds while matching reactants and products. MWED also uses an empirically derived bond breakage/formation propensity metric to get an optimal matching solution. Ring detection and matching is performed separately as a preprocessing step to the complete matching. We obtained the reaction atom mapping information as a SMILES output for over 10,000 reactions for version 18.1 directly from the author. We compared the 10,585 reaction atom mapping solutions of MWED96 with the atom mapping solutions using

CLCA. We noticed disagreement with only 232 mismatches. CLCA suggested fewer bond breaks for 198 reactions while MWED suggested fewer bond breaks for 34 mapping solutions. Biochemical knowledge was used to manually verify the 232 CLCA solutions and 66 erroneous reaction atom mapping solutions were identified. Most of these cases were similar to the examples in Figure 13 or cases where elementary reactions were aggregated into a single reaction. Figure 14 provides an example of MWED failure mode where CLCA was able to correctly identify a solution with fewer number of bond rearrangements. In the reaction sporulenol synthase, for the conversion from sporulenol to tetraprenyl-β-curcumene, 11 bond deletions on sporulenol were suggested by MWED whereas CLCA identified an optimal mapping solution with only 5 bond deletions as shown in Figure 14. We found 38 such reactions wherein linear molecules transformed into polycyclic molecules. Most of these reactions were ring forming/breaking reactions with multiple bonds being formed or broken in the transformation between products and reactants. The MWED ring detection and matching step identifies only the rings that are conserved between the reactant and product. Therefore in some cases where the

90 polycyclic topology is not conserved between the reactant and products, MWED fails to detect the correct mapping.

In addition to the comparisons presented in the previous section, we compare the

CLCA atom mapping solutions of 1000 randomly chosen MetRxn reactions with the atom mapping solutions provided by the ReactionMap algorithm. The ReactionMap algorithm by Fooshee et al.111 uses a combination of MCS search and bipartite matching steps to produce the atom mapping solution. In the first step of ReactionMap, a partial MCS mapping is identified using the OEChem toolkit141. Then, the bipartite matching step111 extends the MCS solution by incorporating chemical knowledge in the form of SMILES encoded cost functions. For 979 reactions at least one CLCA generated solution matched the atom mapping solution provided by the ReactionMap web interface

(http://cdb.ics.uci.edu). For the remaining 21 reactions, CLCA suggested mappings with fewer bond changes than ReactionMap for 20 reactions and an incomplete mapping for phytoene synthetase (see figure S5). Figure 15 and Figure 16 show mapping solutions for acetyl-CoA acyltransferase by ReactionMap and CLCA, respectively. The reaction mechanism proposed by the CLCA mapping solution suggests a transfer of the hydroxybenzoyl group from hydroxybenzoyl-acetyl-CoA to CoA after the formation of a

C-S bond to produce 4-hydroxybenzoyl-CoA and acetyl-CoA. Unlike MWED, wherein failure modes were specific to certain topological features, ReactionMap failures were primarily associated with reactions with greater than 200 atoms and reactions involving acyl-CoA’s as a reactants or products. As shown in Figure 15, ReactionMap frequently suggested reaction centers on the acyl and neopentane moieties. The 21 reactions for

91 which CLCA and ReactionMap provide differing atom mapping solutions is available as

SMILES in Table S3.

Applicability of the CLCA reaction mapping procedure is not limited to the

MetRxn curated and mass balanced reaction database, and can be extended to other large unbalanced reaction databases as well. Various efforts devoted to the reaction atom mapping of large reaction databases has been reviewed in recent articles by Ehrlich and

Rarey108 and Chen et al.91. They compare various MCS and reaction mapping algorithms and identify specific shortcomings while mapping chemical structures with certain topological features. Following the conclusions presented in the aforementioned reviews,

Kraut et al.113 present a comprehensive dataset of 104 reactions. The CLASSIFY test set version 1.0 is categorized by difficulty into seven groups and is topologically representative of reactions in large commercial databases. We compare the solutions suggested by CLCA with the InfoChem ICMAP generated atom mapping solutions. Prior to mapping, reactant and product copies were added to minimize the difference in the number of reactant and product atom in unbalanced reactions. There were disagreements for only five reactions with two in group G7, two in group G6, one in group G5. Figure 17 shows the two mapping solutions by ICMAP and CLCA respectively for a hypoxanthine derivative synthesis reaction142. Figure 17A shows the correct mapping solution by

ICMAP142. Figure 17B shows the incorrect mapping solution by CLCA wherein the phenyl moiety of phenylazomalonamidamidine is mapped to the chlorophenyl moeity of 2,8- di(p-chlorophenyl)hypoxanthine. This reaction was grouped under groups G6 by Kraut et al.113 as the reaction information provided is incomplete and requires another copy of a reactant (i.e. a total of two p-chlorobenzaldehyde). An incomplete mapping was

92 suggested by CLCA as the reactant copy addition step did not suggest the addition of another p-chlorobenzaldehyde to the reaction. Figure 18B shows a correct mapping solution for a Furanose syliation reaction143 generated by CLCA after reactant copies were added. CLCA suggested mappings for all the 104 test case reactions are available as reaction SMILES in table S4.

4. Summary

We present a robust algorithm capable of identifying common substructures even for large molecular graphs. CLCA is novel and differs from all other algorithms in both conceptual design and overcomes the previously mentioned drawbacks of existing algorithms. We show that CLCA is accurate even with large molecular graphs with complex topologies (rings and symmetric groups), and outperforms other algorithms in terms of computational complexity. The key operations performed in CLCA are the feature string sorting and prime number assignment step, product of primes step and the

Floyd-Warshall all pair shortest path calculation step. The sorting, prime assignment and product of primes step has a convergence complexity similar to the canonical labeling algorithm presented by Weinenger et al.25 of 휔(푛) , and the Floyd-Warshall algorithm has a runtime complexity of 푛3. All the steps in CLCA are highly parallelizable, making it highly suitable for vectorization. CLCA has advantages over existing algorithms in performance and accuracy with the additional capability to handle unbalanced reactions, find multiple optimal mappings handle stereochemistry and handle large complex

93 structures with polynomial time computational complexity. CLCA is integrated within the MetRxn22 database allowing for rapid searches for similar molecules and biochemical transformations using Jaccard and Tanimoto similarity indices. Atom maps for all reactions are available online on MetRxn22 for download. The java executable and usage directions will be made available online for download at www.metrxn.che.psu.edu.

Marvin 6.3.142, (2014) ChemAxon (http://www.chemaxon.com) was used for drawing, displaying and characterizing chemical structures, substructures and reactions.

Acknowledgments

We acknowledge Saratram Gopalakrishnan for contributions during the manual verification of the atom mapping results for over 2000 reactions. We thank Anupam

Chowdhury, Rajib Saha and Ali Khodayari for their valuable inputs during the preparation of the manuscript. We thank Mario Latendresse from SRI International for sharing the MetaCyc reaction atom mappings for comparison and evaluation with CLCA atom mapping data. We also thank the reviewers for their invaluable comments and suggestions for improving the manuscript.

Abbreviations

CLCA, Canonical Labeling for Clique Approximation; MCS, Maximum Common

Substructure; MST, Minimum Spanning Tree; EC, Enzyme Commission or Extended

Connectivity algorithm; MWED, Minimum Weighted Edit-Distance; GED, Graph Edit

Distance; FHV, Feature Hash Vector

94

Figures and tables.

Figure 2-1 CLCA workflow.

The numbers shown in the table in columns A, D and G represent the labels generated by CLCA. Panel B shows the prime number assignment after ordering step A. Panel C shows the product with adjacent prime for each vertex. After the termination of the canonical labeling step (i.e steps A through G), we extend the subgraph size using the A* search methodology. A* traversal and subgraph extension always starts from the largest fragment (i.e. the subgraph highlighted in blue) towards the unmapped vertices in the graph (i.e. subgraph in red). The two non-equivalent atoms, identified after the termination of A* search are stamped with different numbers (i.e., 1 and 5).

95

Figure 2-2 Canonical labeling with stereodescriptors.

Labeling with the three additional stereodescriptors are applied on three molecules citrate, 2,6-diaminopimelate and dimethylallyl diphosphate identified by graphs A, B and C, respectively. Graphs A1, B1 and C1 identify chemically equivalent atoms when stereodescriptors are ignored, while A2, B2 and C2 differentiate stereochemically distinct atoms. Stereoisomers LL-2,6- diaminopimelate and (2R,6R)-2,6-diammonioheptanedioate represented by graphs B3 and B4 have chemically and stereochemically equivalent atoms due to the presence of a rotation axis.

96

Figure 2-3 Addition of artificial vertices.

Molecular graphs of D-Fructose 6-phosphate (푮’) and D-Xylulose 5-phosphate (푯’) are converted into their auxiliary graphs 푮’’ and 푯’’ respectively. The smaller vertices are the artificial vertices. The auxiliary graphs are created by replacing each bond with an artificial vertex. The region in blue is the mapping identified by CLCA. The region in red is the mapping identified using A* search. Notice the increase in the mapping regions between panel A and B, identified by CLCA after the introduction of artificial vertex. An artificial vertex allows for the removal of a bond in place of an atom from the MCS search (i.e. assignment of ‘1’ instead of a prime number), thereby allowing CLCA to label a larger subgraph.

97

Figure 2-4 Addition of directional edges.

Vertices representing carbon atoms have edges directed towards it while vertices for non-carbon atoms have edges directed away from them. If two vertices containing incoming or outgoing edges are adjacent, then the edges between them are bidirectional. Artificial vertices are assigned directionality according to adjacent atom representing vertices. The dashed line represents the longest graph geodesic for the vertices 풂ퟏ and 풃ퟏ in both the panels A and B. The longest geodesic of 풂ퟏ and 풃ퟏ terminate at carbon vertices in panel B as non-carbon vertices are unreachable due to edge directions. Notice the increase in the mapping regions between panel A and B, identified by CLCA after the introduction of directional edges. By calculating the longest geodesic feature on a directed graph, an MCS without the A* search step is identified.

Figure 2-5 CLCA using the auxiliary graph datastructure.

Panels A through F show the increase in size of 품 due to the repeated addition of cliques 풒. Cliques 풒 are identified if two adjacent labels are common to vertices in both 푮’’ and 푯’’. The final graph 품 after iterations terminate, is subgraph-isomorphic to both 푮’’ and 푯’’.

98

99

Figure 2-6 Alternate solutions.

The reaction D-Fructose 6-phosphate:D-glyceraldehyde-3-phosphate glycolaldehyde transferase has two atom mapping solutions. Both solutions shown in Panel A and B follow the Principle of Minimum Chemical Distance92 (PMCD). The highlighted regions in Panel A and B represent the common substructures between reactant and product. Reactant-product pairwise MCS searches are performed (Panel C) and MCS combinations that obey PMCD are chosen (Panel A and B). To identify an MCS combination with least number of bond breaks, a greedy procedure using minimum spanning tree (MST) is implemented (Panel D). After each MST is identified, the subgraphs in MCS solution suggested by the root are removed and a reactant-product pair MCS search is repeated. Since four MCS solutions exist, at most four MST’s are required. The root for the first MST receives the highest precedence while the root for the last MST receives the least precedence. The pairwise MCS solutions are then combined using the root precedence order to identify complete reaction atom mapping solutions. The MCS precedence depicted in Panel D identifies the biochemically correct solution shown in Panel B

100

Figure 2-7 Reaction atom mapping.

CLCA identifies correct reaction atom maps for reactions representing six common EC classifications. The highlighted regions identify the common fragments between reactant and product.

101

Figure 2-8 Reaction similarity.

Panel A shows reaction l-phenylalanine:NAD+ oxidoreductase and panel B shows reaction l-tyrosine:NAD+ oxidoreductase. Reaction centers are depicted using dotted bonds and common substructures are highlighted with contrasting colors for visual clarity. A Jaccard similarity value of 0.96 is calculated by the number of atoms in the common substructures.

102

Figure 2-9 Comparison of the two branched chain amino acid degradation pathways for Valine (A) and Isoleucine (B). Common subgraphs between the two pathways are highlighted and the reaction centers are identified by the dotted bonds for visual clarity.

103

Figure 2-10 Bond changes per reaction statistics for the E. coli iAF1260 metabolic model.

The x-axis depicts the number of bond changes all substrates and products in a reaction. This graph allows us to visually isolate reactions that may have been mapped incorrectly as denoted by too many bond changes. Note that 86 reactions were mapped without any bonds changes as we do not track changes due to bond-order change (i.e., single to double) or bonds with H atoms.

104

105

Figure 2-11 Example of possibly incorrect mapping from iAF1260.

Reaction 4-amino-2-methyl-5-phosphomethylpyrimidine synthetase (panel A) and anhydrous-N-Acetylmuramyl- tetrapeptide amidase (panel B) as mapped by CLCA involved too many bond breaks and formations (> 6). The connected common subgraphs between reactant and product are highlighted with contrasting colors for visual clarity.

106

Figure 2-12 Alternate solutions due to equivalent or symmetric carbon groups

in E. coli iAF126112 metabolic model. The x-axis depicts the number of alternate mapping solution a reaction might have when stereochemistry is ignored/considered. As many as 425 reactions that were producing alternative mappings were reduced to unique mappings when stereospecifity information was considered

107

Figure 2-13 CLCA incomplete mapping.

CLCA was unable to proceed with maximal mapping due to the presence of symmetric members in the reaction center of cyclic moieties. Such cases were rare and only 33 such reactant pairs were found in the entire MetRxn database. A complete mapping solution for transformation in panel B was however realized by A* search, but the mapping solution for atoms highlighted in Panel A and C require additional chemical information.

108

Figure 2-14 Comparison with MetaCyc MWED.

For reactions involving polycyclic compounds MWED suggested a large number of bond changes, often suggesting breaking up of the entire reactant molecule before its transformation into the product molecules. As many as 34 additional reactions where the polycyclic topology was not preserved during the transformation led to incorrect atom mapping solutions. Panel A shows the MWED generated solution whereas panel B shows the CLCA correctly mapped solution as confirmed in KEGG. CLCA identifies the mapping solution with only 5 bond deletions on the reactants side. The highlighted regions in contrasting colors depict the connected common subgraphs between reactant and product. When compared to MWED, CLCA identifies a larger connected common subgraph with lesser number of bond rearrangements (i.e. shown as dotted bonds).

109

Figure 2-15 ReactionMap solution for acetyl-CoA acyltransferase.

For the conversion of CoA (coenzyme A) to acetyl-CoA by acetyl transfer, ReactionMap suggests 16 bond rearrangements. The solution suggested by ReactionMap indicates the reaction center (shown as dotted bonds) to exist on hydroxybenzoyl and neopentane moieties for the molecules 4-hydroxybenzoyl-acetyl-CoA and CoA respectively

110

.

Figure 2-16 CLCA solution for acetyl-CoA acyltransferase.

For the conversion of CoA to acetyl-CoA by acetyl transfer, CLCA suggests 2 bond rearrangements. The solution suggested by CLCA indicates the transfer of 3‐hydroxybenzaldehyde from 4-hydroxybenzoyl-acetyl-CoA onto CoA to produce 4- hydroxybenzoyl-CoA and acetyl-CoA.

111

Figure 2-17 Incorrect CLCA solution for a CLASSIFY dataset reaction.

Panel A shows a correct solution from ICMAP wherein the phenyl moiety of phenylazomalonamidamidine remains unmapped. Panel B shows an incorrect mapping solution from CLCA wherein the phenyl moeity from phenylazomalonamidamidine is mapped to the chlorophenyl moiety 2,8-di(p-chlorophenyl)-hypoxanthine.

112

Figure 2-18 Correct CLCA solution for a CLASSIFY dataset reaction.

Panel A shows the mapping solution generated by ICMAP wherein the atoms from tert-butyldimethylsilyl Chloride are mapped to thrice to atoms in 1,2,3,4 tert-butyldimethylsilyl furanose. Panel B shows the same solution generated by CLCA after reactant copies of tert-butyldimethylsilyl were added.

113

Table 2-1 Characterization and prime number assignment for figure 4.

The feature hash vector (FHV) is the hash value of the string that encodes the features and longest graph geodesic for each vertex. Prime numbers in column 푨 are assigned based on the FHV and products 푨′ are only calculated for vertices with no mapping. Subsequent prime numbers in columns B through F are assigned based on the products in columns 푨’ through 푬’ at each iteration. Iterations when the prime values in subsequent iterations are the same (i.e 푭 and 품). At each iteration, if a mapping (green), is identified, the prime value assigned to those vertices will remain fixed.

114

115 MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases

Akhil Kumar1, Patrick F Suthers and Costas D. Maranas2

This chapter has been previously published in modified form in BMC Bioinformatics. (MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases,

Akhil Kumar, Patrick F Suthers, Costas D. Maranas), BMC Bioinformatics, 1, 6)

1The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA;

2Department of Chemical Engineering, Pennsylvania State University, University Park, PA;

Increasingly, metabolite and reaction information is organized in the form of genome- scale metabolic reconstructions that describe the reaction stoichiometry, directionality, and gene to protein to reaction associations. A key bottleneck in the pace of reconstruction of new, high-quality metabolic models is the inability to directly make use of metabolite/reaction information from biological databases or other models due to incompatibilities in content representation (i.e., metabolites with multiple names across databases and models), stoichiometric errors such as elemental or charge imbalances, and incomplete atomistic detail (e.g., use of generic R-group or non-explicit specification of stereo-specificity).

MetRxn is a knowledgebase that includes standardized metabolite and reaction descriptions by integrating information from BRENDA, KEGG, MetaCyc, Reactome.org and 44 metabolic models into a single unified data set. All metabolite entries have matched synonyms, resolved protonation states, and are linked to unique structures. All reaction entries are elementally and charge balanced. This is accomplished through the

116 use of a workflow of lexicographic, phonetic, and structural comparison algorithms.

MetRxn allows for the download of standardized versions of existing genome-scale metabolic models and the use of metabolic information for the rapid reconstruction of new ones.

The standardization in description allows for the direct comparison of the metabolite and reaction content between metabolic models and databases and the exhaustive prospecting of pathways for biotechnological production. This ever-growing dataset currently consists of over 76,000 metabolites participating in more than 72,000 reactions (including unresolved entries). MetRxn is hosted on a web-based platform that uses relational database models (MySQL).

Background

The ever accelerating pace of DNA sequencing and annotation information generation144 is spearheading the global inventorying of metabolic functions across all kingdoms of life. Increasingly, metabolite and reaction information is organized in the form of community145, organism, or even tissue-specific genome-scale metabolic reconstructions. These reconstructions account for reaction stoichiometry and directionality, gene to protein to reaction associations, organelle reaction localization, transporter information, transcriptional regulation and biomass composition. Already over 75 genome-scale models are in place for eukaryotic, prokaryotic and archaeal species146 and are becoming indispensable for computationally driving engineering interventions in microbial strains for targeted overproductions147–149, elucidating the

117 organizing principles of metabolism150–153 and even pinpointing drug targets154,155. A key bottleneck in the pace of reconstruction of new high quality metabolic models is our inability to directly make use of metabolite/reaction information from biological databases156 (e.g., BRENDA157, KEGG158, MetaCyc, EcoCyc, BioCyc159, BKM-react160, UM-

BBD161, Reactome.org162, Rhea137, PubChem, ChEBI138 etc.) or other models163 due to incompatibilities of representation, duplications and errors, as illustrated in Figure 1. A major impediment is the presence of metabolites with multiple names across databases and models, and in some cases within the same resource, which significantly slows down the pooling of information from multiple sources. Therefore, the almost unavoidable inclusion of multiple replicates of the same metabolite can lead to missed opportunities to reveal (synthetic) lethal gene deletions, repair network gaps and quantify metabolic flows. Moreover, most data sources inadvertently include some reactions that may be stoichiometrically inconsistent164 and/or elementally/charge unbalanced165,166, which can adversely affect the prediction quality of the resulting models if used directly. Finally, a large number of metabolites in reactions are partly specified with respect to structural information and may contain generic side groups (e.g., alkyl groups -R), varying degree of a repeat unit participation in oligomers, or even just compound class identification such as “an amino acid” or “electron acceptor”. Over 3% of all metabolites and 8% of all reactions in the aforementioned databases and models exhibit one or more of these problems. There have already been a number of efforts aimed at addressing some of these limitations. The Rhea database, hosted by the European Bioinformatics Institute, aggregates reaction data primarily from IntEnz167 and ENZYME168, whereas Reactome.org is a collection of reactions primarily focused on human metabolism140,162. Even though

118 they crosslink their data to one or more popular databases such as KEGG, ChEBI, NCBI.

Ensembl, Uniprot, etc., both retain their own representation formats. More recently, the

BKM-react database is a non-redundant biochemical reaction database containing known enzyme-catalyzed reactions compiled from BRENDA, KEGG, and MetaCyc159. The BKM- react database currently contains 20,358 reactions. Additionally, the contents of five frequently used human metabolic pathway databases have been compared169. An important step forward for models was the BiGG database, which includes seven genome- scale models from the Palsson group in a consistent nomenclature and exportable in

SBML format170–172. Research towards integrating genome-scale metabolic models with large databases has so far been even more limited. Notable exceptions include the partial reconciliation of the latest E. coli genome scale model iAF1260 with EcoCyc4 and the aggregation of data from the Arabidopsis thaliana database and KEGG for generating genome-scale models173 in a semi-automated fashion. Additionally, ReMatch integrates some metabolic models, although its primary focus is on carbon mappings for metabolic flux analysis174. Also, many metabolic models retain the KEGG identifiers of metabolites and reactions extracted during their construction175,176. An important recent development is the web resource Model SEED that can generate draft genome-scale metabolic models drawing from an internal database that integrates KEGG with 13 genome scale models

(including six of the models in the BiGG database)177. All of the reactions in Model SEED and BiGG are charge and elementally balanced. In this paper, we describe the development and highlight applications of the web-based resource MetRxn that integrates, using internally consistent descriptions, metabolite and reaction information from 8 databases and 44 metabolic models. The MetRxn knowledgebase (as of October

119

2011) contains over 76,000 metabolites and 72,000 reactions (including unresolved entries) that are charge and elementally balanced. By conforming to standardized metabolite and reaction descriptions, MetRxn enables users to efficiently perform queries and comparisons across models and/or databases. For example, common metabolites and/or reactions between models and databases can rapidly be generated along with connected paths that link source to target metabolites. MetRxn supports export of models in SBML format. New models are being added as they are published or made available to us. It is available as a web-based resource at http://metrxn.che.psu.edu.

Construction and Content

MetRxn construction

The construction of MetRxn largely followed the following steps, as illustrated in

Figure 2: 1) download of primary sources of data from databases and models, 2) integration of metabolite and reaction data, 3) calculation and reconciliation of structural information, 4) identification of overlaps between metabolite and reaction information, 5) elemental and charge balancing of reactions, 6) successive resolution of remaining ambiguities in description.

Step 1 Source data acquisition

Metabolite and reaction data was downloaded from BRENDA, KEGG, BioCyc,

BKM-react and other databases using a variety of methods based on protocols such as

120

SOAP, FTP and HTTP. We preprocessed the data into flat files that were subsequently imported into the knowledgebase. All original information pertaining to metabolite name, abbreviations, metabolite geometry, related reactions, catalyzing enzyme and organism name, gene-protein-reaction associations, and compartmentalization was retained. For all

44 initial genomescale models listed, the online information from the corresponding publications was also imported. The source codes for all parsers used in Step 1 are available on the MetRxn website.

Step 2 Source data parsing

The “raw data” from both databases and models was unified using standard SQL scripts on a MySQL server. The description schema for metabolites includes source, name, abbreviations used in the source, , and geometry. The schema for reactions accounts for source, name, reaction string (reactants and products), organism designation, associated enzymes and genes, EC number, compartment, reversibility/direction, and pathway information. Once a source has been imported into the MySQL server, a data source-specific dictionary is created to map metabolite abbreviations onto names/synonyms and structures and metabolites to reactions.

Step 3 Metabolite charge and structural analysis

We used Marvin (Chemaxon) to analyze all 218,122 raw metabolite entries containing structural information (out of a total of 322,936, including BRENDA entries).

Inconsistencies were found in 12,965 entries typically due to wrong atom connectivity,

121 valence, bond length or stereo chemical information, which were corrected using APIs available in Marvin. A final corrected version of the metabolite geometries was calculated at a fixed pH of 7.2 and converted into standard Isomeric SMILES format. The structure/formula used corresponded to the major microspecies found during the charge calculation, which effectively rounds the charge to an integer value in accordance with previous model construction conventions. This format includes both chiral and stereo information, as it allows specification of molecular configuration90,178,179. Metabolites were also annotated with Canonical SMILES using the OpenBabel Interface from Chemspider.

The canonical representation encodes only atom-atom connectivity while ignoring all conformers for a metabolite. Using bond connectivity information from the primary sources and resources such as PubChem and ChemSpider we used Canonical SMILES90,180 to resolve the identity of 34,984 metabolites and 32,311 reactions. Another 6,100 metabolites and 11,401 reactions involved, in various degrees, lack of full atomistic detail in their description (e.g., use an R or × as side-chains, are generic compounds like “amino acid” or “electron acceptor”). Over 25,000 duplicate metabolites and 27,000 reaction entries were identified and consolidated within the database. The metabolites and reactions present in the resolved repository were further classified with respect to the completeness of atomistic detail in their description.

Step 4 Metabolite synonyms and initial reaction reconciliation

Raw metabolite entries were assigned to Isomeric SMILES representations whenever possible. If insufficient structural information was available for a downloaded

122 raw metabolite then it was assigned temporarily with the Canonical SMILES and revisited during the reaction reconciliation. Canonical SMILES retain atom connectivity but not stereo-specificity and are used as the basic metabolite topology descriptors as many metabolic models lack stereo-specificity information. After generating the initial metabolite associations, we identified reaction overlaps using the reaction synonyms and reaction strings along with the metabolite SMILES representations. Directionality and cofactor usage were temporarily ignored. During this step, reactions were flagged as single-compartment or two-compartment (i.e., transport reactions). MetRxn internally retains the original compartment designations, but currently only displays these simplified compartment designations. In analogy to metabolites, reactions were grouped into families that shared participants but in the source data sets occurred in different compartments or differed only in protonation.

Step 5 Reaction charge and elemental balancing

Once metabolites were assigned correct elemental composition and protonation states, reactions were charge and elementally balanced. To this end, for charge balancing we relied on a linear programming representation that minimizes the difference in the sum of the charge of the reactants and the sum of the charge on the products. The complete formulation is provided in the documentation at MetRxn.

123

Step 6 Iterative reaction reconciliation

Reactions with one (or more) unresolved reactants and/ or products were string compared against the entire resolved collection of reactions. This step was successively executed as newly resolved metabolites and reactions could enable the resolution of previously unresolved ones. After the first pass 164 metabolites were resolved, while subsequent passes (up to 18 for some models) helped resolved a total of 8,720 entries.

Reactions with significant (but not complete) overlapping sets of reactants/products are additionally sent to the curator GUI including phonetic information. Briefly, the phonetic tokens of synonyms with known structures were compared against the ones without any associated structure. The algorithm suppresses keywords/tokens depicting stereo information such as cis, trans, L-, D-, alpha, beta, gamma, and numerical entries because they change the phonetic signature of the synonym under investigation. In addition, the algorithm ignores nonchemistry related words (e.g., use, for, experiment) that are found in some metabolite names. Certain tokens such as “-ic acid” and “-ate” are treated as equivalent. PubChem and Chemspider sources were accessed through the GUI so that the curator gets as much information as possible to identify the data correctly. Phonetic matches provided clues for resolving over 159 metabolites. The iterative application of string and phonetic comparison algorithms resolved as many as 8,879 metabolites after 18 rounds of reconciliation. Upon completion of this workflow, all genome-scale models are reformatted into a computations-ready form and Flux Balance Analysis181 is performed on both the source model and the standardized model in MetRxn to ascertain the ability of the model to produce biomass before and after standardization. We performed the calculations using GAMS version 12.6. MetRxn is accessible through a web interface that

124 indirectly generates MySQL queries. In order to facilitate analysis and use of the data, a number of tools are provided as part of MetRxn.

Data export and display

MetRxn supports a number of export capabilities. In general, any list that is displayed contains live links to the metabolite or reaction entities. These lists can consist of an entire model, data from a comparison, or query results. All items can be exported to SMBL format. In addition, the public MySQL database will be made available upon request.

Because of licensing limitations, the BRENDA database cannot be exported and is not part of the public MySQL database. However, we plan to provide Java source code that allows for the integration of a local copy of the public MySQL database with the BRENDA database (provided upon request).

Source comparisons and visualization

In addition to listing the content (number of metabolites, reactions, etc.) of the selected data source(s), MetRxn contains tools for comparing two or more models and visualizing the results. These associations can be for metabolites or reactions. During these comparisons compartment information and reversibility are suppressed. Comparison tables are generated by comparing the associations between the selected data source(s) using the canonical structures.

125

MetRxn Scope

An initial repository of reaction (i.e., 154,399) and metabolite (i.e., 322,936) entries were downloaded from 8 databases and 44 genome-scale metabolic models. We compiled a non-redundant list of 42,540 metabolites and 35,474 reactions (after consolidating duplicate entries) containing full atomistic and bond connectivity detail. Another 6,100 metabolites and 11,401 reactions have partial atomistic detail typically containing generic side-chains (R) and/or an unspecified number of polymer repeat units. Finally, 5,436 metabolites in metabolic models and 8,000 metabolites in databases are retained with no atomistic detail. In some cases lack of atomistic detail reflects complete lack of identity specificity (e.g., electron donor) whereas in other cases even though the chemical species is fully defined, atomistic level description is not warranted (e.g., gene product of dsbC protein disulfide isomerase II (reduced)). Figure 3 shows the distribution of metabolite resolution across models and databases in MetRxn. In general, metabolites without fully- specified structures tend to participate in a relatively small number of reactions. The workflow followed in the creation of the MetRxn knowledgebase identified a number of inconsistencies. For instance, the same metabolite name may map to molecules with different numbers of repeat units (e.g., lecithin) or completely different structures (e.g.,

AMP could refer to either adenosine monosphate or ampicillin). Notably, even for the most well-curated metabolic model, E. coli iAF1260112, we found minor errors or omissions (a total of 17) arising from inconsistencies or incompleteness of representation in the data culled from other sources. For example, the metabolite abbreviation arbtn-fe3 was mistakenly associated with the KEGG ID and structure of aerobactin instead of ferric- aerobactin. The number of inconsistencies is dramatically increased for less-curated

126 metabolic models. We used a variety of procedures to disambiguate the identity of metabolites lacking structural information ranging from reaction matching to phonetic searches. For example, in the Corynebacterium glutamicum model182, 7,8- aminopelargonic acid (DAPA) has no associated structural information. Reaction matching found the same reaction in the E. coli iAF1260 model. C. glutamicum DAPA +

ATP + CO2 ⇔ DTBIOTIN + ADP + PI iAF1260 [c] : atp + co2 + dann → adp + dtbt + (3) h

+ pi which implies that 7,8-aminopelargonic acid (DAPA) is identical to 7,8-

Diaminononanoate (dann). Examination of pelargonic acid and nonanoate reveals that they were indeed known synonyms. In many cases, we were also able to assign stereo- specific information to metabolite entries in models (e.g., stipulate the L-lysine isomer for lysine). We made use of an iterative approach that allowed us to map structures from models with explicit links to structures (e.g. to KEGG or CAS numbers) to models that only provided metabolite names. Furthermore, by using a phonetic algorithm that uses tokens for equivalent strings in metabolite names (e.g., ‘-ic acid’ and ‘-ate’ are equivalent) we were able to resolve more than an additional 159 metabolites. For example, phonetic searches flagged cis-4-coumarate and COUMARATE in the Acinetobacter baylyi model183 as potentially identical compounds. Additional checks revealed that indeed both metabolites should map to the same structure. A more complex matching example involved 1-(5’-Phosphoribosyl)-4-(N-succinocarboxamide)-5-aminoimidazole from the

Bacillus subtilis model184 and 1-(5’-Phosphoribosyl)-5-amino-4-(N-succinocarboxamide)- imidazole from the Aspergillus nidulans model185. We note that the phonetic algorithm only makes suggestions and orders the possible matches for the curator. Next, we detail three examples that provide an insight into the type of tasks that MetRxn can facilitate.

127

Utility and Discussion

1.Charge and elementally balanced metabolic models

The standardized description of metabolites and balanced reactions afforded by MetRxn enables the expedient repair of existing models for metabolite naming inconsistencies and reaction balancing errors. Here we highlight one such metabolic model repair for

Acinetobacter baylyi ADP1186. We identified that 189 out of 880 reactions are not elementally or charge balanced. Most of the reactions with charge balance errors involved a missed proton in reactions involving cofactor pairs such as NAD/NADH. For example, a proton had to be added to the reactants side in the reaction (R, R)-Butanediol- dehydrogenase in which butanediol reacts with NAD to form acetoin. In addition, the stoichiometric coefficient of water in GTP cyclohydrolase I was erroneously set at -2 which resulted in an imbalance in oxygen atoms. The re-balancing analysis changed the coefficient to -1 (as listed in BRENDA) and added a proton to the list of reactants (absent from BRENDA) in order to also balance charges. We performed flux balance analysis

(FBA) on both the published and MetRxn-based rebalanced version of the Acinetobacter baylyi model using the uptake constraints listed in Durot et al183 to assess the effect of re- balancing reaction entries on FBA results. We found that the maximum biomass using the glucose/ammonia uptake environment decreased by 9% primarily due to the increased energetic costs associated with maintaining the proton gradient. This result demonstrates the significant effect that lack of reaction balancing may cause in FBA calculations.

Overall, we found that nearly two-thirds of the models had at least one unbalanced reaction, with over 2,400 entities across all models that were either charge or elementally

128 imbalanced. Frequently, the same reaction was imbalanced in multiple models (each occurrence was counted separately).

2. Contrasting existing metabolic models

At the onset of creating MetRxn, we conducted a brief preliminary study to quantify the extent/severity of naming inconsistencies by contrasting the reaction information contained in an initial collection of 34 of the most popular genome-scale models spanning

21 bacterial, 10 eukaryotic and three archaeal organisms. Across all branches of life, most metabolic processes are largely conserved (e.g., glycolysis, pentose phosphate pathway, amino acid biosynthesis, etc.) therefore we expected to uncover a large core of common reactions shared by all models. Surprisingly, we found that only three reactions (i.e., phosphoglycerate mutase, phosphoglycerate kinase, and CO2 transport) were directly recognized as common across those 34 models using a simple string match comparison.

Even when examining models for only a few bacterial organisms (Bacillus subtilis,

Escherichia coli, Mycobacterium tuberculosis, Mycoplasma genitalium, and Salmonella

Typhimurium) simple text searches recognized only 40 common reactions (out of a possible 262, which is the size of the M. genitalium model). The reason for this glaring inconsistency is that differing metabolite naming conventions, compartment designations, stoichiometric ratios, reversibility, and water/proton balancing issues prevents the automated recognition of genuinely shared reactions across models. Using the glucose-6-phosphate dehydrogenase reaction as a representative example, Table 1 reveals some of the reasons for failing to automatically recognize common reactions across

129 selected models. As many as nine different representations of the same reaction exist due to incomplete elemental and charge balancing, alternate cofactor usage among different organisms, and lack of universal metabolite naming conventions. We have found that this level of discord between models is representative for most metabolic reactions. This lack of consistency renders direct pathway comparisons across models meaningless and the aggregation of reaction information from multiple models precarious. This deficiency motivated the development of MetRxn. Given standardization in metabolite naming and elementally/charge balanced reaction entries MetRxn allows for the identification of shared reactions as well as differences between any two metabolic models (assuming that all the metabolites in the compared reaction entries have full atomistic information). When making the comparison of those same metabolic models, MetRxn found an additional 15 reactions in common (for a total of 55 – a 38% increase) and that 142 reactions are shared by B. subtilis, E. coli and Salmonella Typhimurium. The Web interface of MetRxn allows for any number of models to be simultaneously compared. As a demonstration of this capability we selected to contrast the metabolic content of two clostridia models:

Clostridium acetobutylicum187 and Clostridium thermocellum188. Figure 4 shows the results in the form of a Venn diagram. Some of the differences between the clostridia species are not surprising arising due to their differing lifestyles (C. acetobutylicum contains solventogenesis pathways and a CoB12 pathway, whereas C. thermocellum contains cellulosome reactions). However, we found many differences that appear to reflect different conventions adopted when the two models were generated rather than genuine differences in metabolism. In particular, in the C. thermocellum model188 charged/ uncharged tRNA metabolites are explicitly tracked whereas they are not

130 included in the C. acetobutylicum model187. Surprisingly, both clostridia models are more similar, at the metabolite level, to the Bacillus subtilis iBsu1103 model189 rather than to each other (see Figure 4). Charged/uncharged tRNA metabolites account for most of the increased overlap between C. thermocellum and B. subtilis. Most of the reaction overlaps are in the amino acids biosynthesis pathways, carbohydrate metabolism, and nucleoside metabolism. It is important to note that 48 reactions in C. acetobutylicum, 67 reactions in

C. thermocellum, and 120 reactions in B. subtilis lack full atomistic information (see Figure

3) and thus were excluded from any comparisons. It is possible that additional shared reactions between the two models can be deduced by further examining comparisons between not fully structurally specified metabolite entries. The string/phonetic comparison algorithms described under Step 6 along with assisted curation could be adapted for this task.

3. Using MetRxn to Bio-Prospect for Novel Production Routes

A “Grand Challenge” in biotechnological production is the identification of novel production routes that allow for the conversion of inexpensive resources (e.g., various sugars) into useful products (e.g., succinate, artemisinin) and bio-fuels (e.g., , butanol, biodiesel etc.). Selected production routes must exhibit high yields, avoid thermodynamic barriers, bypass toxic intermediates and circumvent existing intellectual property restrictions. Historically, the incorporation of heterologous pathways relied largely on human intuition and literature review followed by experimentation190,191.

Currently, rapidly expanding compilations of biotransformations such as KEGG158 and

131

BRENDA157 are increasingly being prospected using search algorithms to identify biosynthetic routes to important product molecules. Several optimization and graph- based methods have been employed to computationally assemble novel biochemical routes from these sources. OptStrain192 used a mixed-integer linear optimization representation to identify the minimal number of reactions to be added (i.e. knock-ins) into a genome-scale metabolic model to enable the production of the new molecule.

However the combinatorial nature of the problem poses a significant challenge to the

OptStrain methodology as the number of reaction database entries increase from a few to tens of thousands. At the expense of not enforcing stoichiometric balances, graph-based algorithms have inherently better-scaling properties for exhaustively identifying all min- path reaction entries that link a source with a target metabolite. Hatzimanikatis et. al193. introduced a graph-based heuristic approach (BNICE) to identify all possible biosynthetic routes from a given substrate to a target chemical by hypothesized enzymatic reaction rules. In addition, the BNICE framework was used to identify novel metabolic pathways for the synthesis of 3-hydroxypropionate in E. coli194. Based on a similar approach, a new scoring algorithm195 was introduced to evaluate and compare novel pathways generated using enzyme-reaction rules. In addition, several techniques such as PathMiner196,

PathComp, Pathway Tools197, MetaRoute12, PathFinder198 and UM-BBD Pathway

Prediction System161 have been used to search databases for bioconversion routes. We recently published199 a graph-based algorithm that used reaction information from

BRENDA and KEGG to exhaustively identify all connected paths from a source to a target metabolite using a customized minpath algorithm200. We first demonstrated the minpath procedure by identifying all synthesis routes for 1- butanol from pyruvate using a

132 database of 9,921 reactions and 17,013 metabolites manually extracted from both

BRENDA and KEGG. Here, we re-visited the same task using the full list of reactions and metabolites present in MetRxn to assess the discovery potential of using MetRxn. Figure

5 illustrates all identified pathways from pyruvate to 1-butanol before MetRxn (29, shown in blue) and the ones discovered after using MetRxn (112, shown in green). As many as

83 new avenues for 1-butanol production were revealed as a consequence of using the expanded and standardized MetRxn resource. In addition, the search algorithm recovered known201,202 synthesis routes using E. coli for the production of 1-butanol (shown in orange). The first pathway involves the fermentative transformation of pyruvate and acetyl-CoA to 1-butanol using enzymes from C. acetobutylicum201. The second pathway uses ketoacid precursors203. This example demonstrates how the biotransformations stored in MetRxn can be used to traverse a multitude of production routes for targeted bioproducts.

Conclusions

MetRxn enables the standardization, correction and utilization of rapidly growing metabolic information for over 76,000 metabolites participating in 72,000 reactions

(including unresolved entries). The library of standardized and balanced reactions streamlines the process of reconstructing organism-specific metabolism and opens the way for identifying new paths for metabolic flux redirection. Moreover, the standardization of published genome-scale models enables the rapidly growing community of researchers who make use of metabolic information to understand

133 metabolism at an organism-level and re-deploy it for various biotechnological objectives.

By removing standardization and data heterogeneity bottlenecks the pace of knowledge creation and discovery from users of this resource will be accelerated. MetRxn is constructed in a way that allows for quick updating and tracking of changes that occur in the primary databases, as well as available parsing tools that allow for rapid import of new genome-scale metabolic models as they become available. By having exports in

SBML, MetRxn’s output can be directly interfaced with software packages such as the

COBRA toolbox. During the construction of the initial release of MetRxn, we managed to associate structures for over 8,800 metabolites and re-balanced more than 2,400 reaction instances across 44 metabolic models. This enables the genuine comparison of metabolic content between metabolic models. Preliminary results reinforce that that discrepancies between metabolic models echo not only genuine differences in metabolism but also assumptions and workflow followed by the model creator(s). Going forward, we will continue to expand MetRxn to include more genome-scale metabolic models and add additional tools to aid in their analysis. Because we anticipate that the scope and number of models will rapidly expand, we plan to invite and encourage the community to offer comments about metabolite and reaction information as well as provide feedback on

MetRxn itself.

Availability and requirements

MetRxn is available at http://metrxn.che.psu.edu. Its use is freely available for all non- commercial activity.

134

Acknowledgements and Funding

This work was funded by DOE grant DE-FG02-05ER25684. The authors would like to gratefully acknowledge Robert Pantazes, Sridhar Ranganatha, Rajib Saha, and Alireza

Zomorrodi for their help with testing and feedback on the MetRxn web interface.

Authors’ contributions

AK generated the software and tools for MetRxn and assisted in drafting the manuscript.

PFS participated in the design of the database, performed database curation and FBA analysis, and drafted the manuscript. CDM conceived the study, participated in the design of the database and edited the manuscript. All authors read and approved the final manuscript.

135

Figures and tables

Figure 3-1 Typical incompatibilities and inconsistencies in genome-scale models and databases.

Roadblocks to using genome-scale models and databases include ambiguities and differences in naming conventions, lack of balanced reactions, and incompleteness of structural information.

136

Figure 3-2 Flowchart outlining the construction of MetRxn.

After download of primary sources of data from databases and models, we integrated metabolite and reaction data, followed by calculation and reconciliation of structural information. By identifying overlaps between metabolite and reaction information, we generated elemental and charge balancing of reactions. The procedure for developing MetRxn was iterative with subsequent passes making use of previous associations to resolve remaining ambiguities.

137

Figure 3-3 Various levels of structural information was available for models (main) and databases (inset).

For every model, the majority of metabolites had full atomistic detail (blue). The smaller number of metabolites with partial atomistic detail (orange) such as genetic side chains, or with no atomistic detail (green) such as gene products, participated in few reactions.

138

Table 3-1 Representation of glucose-6-phosphate dehydrogenase in selected metabolic models

139

Figure 3-4 Comparison of metabolite and reaction overlaps for C. acetobutylicum and C. thermocellum , and B. subtilis.

Although the two Clostridium organisms are same genus, the models of these two species had significant numbers of unique metabolites (left) and reactions (right), and comparisons revealed that there was more similarity in metabolite usage with a model of B. subtilis than with each other. In part, these overlaps were driven by the explicit accounting for charged tRNA species in C. thermocellum and B. subtilis models, which was also reflected in the reaction overlaps through reactions involving these metabolites.

140

Figure 3-5 Pathways from pyruvate to 1-butanol.

Using the MetRxn knowledgebase, we identified a large number of new pathways (green) as well as previously established ones (orange) and those identified found in a previous study (blue).

141 Assessing the Metabolic Impact of Nitrogen Availability Using a Compartmentalized Maize Leaf Genome-Scale Model

Margaret Simons2, Rajib Saha2, Nardjis Amiour, Akhil Kumar1, Lenaïg Guillard, Gilles

Clément, Martine Miquel, Zhenni Li, Gregory Mouille, Peter J. Lea, Bertrand Hirel, and

Costas D. Maranas2

This chapter has been previously published in modified form in Plant Physiology.

(Assessing the Metabolic Impact of Nitrogen Availability Using a Compartmentalized

Maize Leaf Genome-Scale Model Margaret Simons2, Rajib Saha2, Nardjis Amiour, Akhil

Kumar1, Lenaïg Guillard, Gilles Clément, Martine Miquel, Zhenni Li, Gregory Mouille,

Peter J. Lea, Bertrand Hirel, and Costas D. Maranas2), Plant Physiology, 3, 1659-1674)

Departments of Chemical Engineering (M.S., R.S., C.D.M.) and Bioinformatics and

Genomics, Huck Institutes of the Life Sciences (A.K.), Pennsylvania State University,

University Park, Pennsylvania 16802; Institut Jean-Pierre Bourgin, Institut National de la

Recherche gronomique, Centre de Versailles-Grignon, Unité Mixte de Recherche 1318

Institut National de la Recherche Agronomique-Agro-ParisTech, Equipe de Recherce

Labellisée, Centre National de la Recherche Scientifique 3559, F–78026 Versailles cedex,

France (N.A., L.G., G.C., M.M., Z.L., G.M., B.H.); and Lancaster Environment Centre,

Lancaster University, Lancaster LA1 4YQ, United Kingdom (P.J.L.)

Maize (Zea mays) is an important C4 plant due to its widespread use as a cereal and energy crop. A second-generation genome-scale metabolic model for the maize leaf was created to capture C4 carbon fixation and investigate nitrogen (N) assimilation by modeling the interactions between the bundle sheath and mesophyll cells. The model

142 contains gene-protein-reaction relationships, elemental and charge-balanced reactions, and incorporates experimental evidence pertaining to the biomass composition, compartmentalization, and flux constraints. Condition-specific biomass descriptions were introduced that account for amino acids, fatty acids, soluble sugars, proteins, chlorophyll, lignocellulose, and nucleic acids as experimentally measured biomass constituents.

Compartmentalization of the model is based on proteomic/transcriptomic data and literature evidence. With the incorporation of information from the MetaCrop and

MaizeCyc databases, this updated model spans 5,824 genes, 8,525 reactions, and 9,153 metabolites, an increase of approximately 4 times the size of the earlier iRS1563 model.

Transcriptomic and proteomic data have also been used to introduce regulatory constraints in the model to simulate an N-limited condition and mutants deficient in glutamine synthetase, gln1-3 and gln1-4. Model-predicted results achieved 90% accuracy when comparing the wild type grown under an N-complete condition with the wild type grown under an N-deficient condition.

Maize (Zea mays), also known as corn, is an essential dual-use food and energy crop.

Maize production is increasing at the greatest rate among all cereals, with a worldwide trend of 0.06 tons ha−1 year−1 (Leveau et al., 2011) and a record 877 million tons produced in the 2011-2012 fiscal year (International Grains Council, 2013). With the recent completion of the maize genome in 2009 along with the creation and curation of databases such as MaizeGDB in 2011 (Schaeffer et al., 2011), MaizeCyc in 2013 (Monaco et al., 2013), and MetaCrop 2.0 in 2012 (Schreiber et al., 2012), there is a need for an updated genome- scale metabolic model (GSM; Saha et al., 2011) that will integrate all newly available

143 information from diverse sources. The integration of this information with experimental transcriptomic data, proteomic data, and biomass composition measurements obtained with wild-type plants grown under optimal nitrogen (N+ WT) conditions and limited nitrogen (N− WT) conditions (Amiour et al., 2012), as well as two Gln synthetase (GS) mutants grown under optimal nitrogen (N), gln1-3 and gln1-4 (Martin et al., 2006), has provided a more accurate assessment of N metabolism within the maize leaf. Moreover, since integration of transcriptomic, proteomic, and metabolomic data appeared not to be straightforward (Amiour et al., 2012, 2014), the development of a model could help to identify putative candidate genes, proteins, and metabolic pathways contributing to plant growth and development.

Maize is a C4 plant that overcomes the inefficiencies of Rubisco, to capture oxygen over the preferred CO2, by separating the photosynthetic carbon fixation process into two cell types: the bundle sheath and mesophyll cells. In comparison with C3 plants, this separation allows C4 plants to have a lower rate of photorespiration, a higher rate of photosynthesis at high light intensities (under standard air and temperature conditions), and a higher photosynthetic nitrogen use efficiency (NUE; Christin and Osborne, 2013;

Driever and Kromdijk, 2013; Peterhansel et al., 2013; Sage, 2014; Wang et al., 2014). A C4- specific maize GSM could provide insight into N metabolism and provide cues for improving NUE (i.e. the vegetative biomass or grain yield produced per unit of N present in the soil). Since N is the major limiting factor in agricultural production among mineral fertilizers (Vitousek et al., 1997; Hirel et al., 2007; Andrews and Lea, 2013; Andrews et al.,

2013) and NUE is estimated to be far below 50% in cereal grains (Raun and Johnson, 1999),

144 improving NUE is essential for improving overall productivity in maize (Hirel and

Gallais, 2011). Amiour et al. (2012) experimentally determined 150 gene transcripts, 40 proteins, and 89 metabolites that are significantly different between the N+ WT and N−

WT conditions during the vegetative stage of growth. N utilization is strongly linked to the GS enzyme, as all N, either in the form of nitrate or ammonium ions, is channeled through the reaction catalyzed by the GS enzyme (Martin et al., 2006; Cañas et al., 2010;

Hirel and Gallais, 2011; Andrews et al., 2013). The mesophyll cell-specific GS1-3 isozyme is involved in synthesizing Gln after nitrate reduction from the vegetative state until the plant reaches maturity. Leaf aging induces the synthesis of the bundle sheath-specific

GS1-4 isozyme. Consequently, Martin et al. (2006) hypothesized that the GS1-4 isoform is used in the reassimilation of ammonium during protein degradation in senescing leaves.

During vegetative growth in the leaf tissue, DNA microarray data revealed that 243 gene transcripts, 46 proteins, and 48 metabolites exhibited significant differences in the gln1-3 mutants and 107 gene transcripts, 14 proteins, and 18 metabolites displayed substantial differences in the gln1-4 mutants (Amiour et al., 2014). In this second-generation maize model, we explore the effect of the computational knockout of genes encoding for GS1-3 and GS1-4 isozymes using flux balance analysis (FBA) to elucidate the role of GS in N metabolism.

FBA of GSMs is used to model organism-specific metabolism by simulating the internal flow of metabolites. The number of GSMs for plants has increased rapidly, with models available for Arabidopsis thaliana (Poolman et al., 2009; de Oliveira Dal’Molin et al.,

2010a), barley (Hordeum vulgare) seed (Grafahrend-Belau et al., 2009), maize (de Oliveira

145

Dal’Molin et al., 2010b; Saha et al., 2011), sorghum (Sorghum bicolor; de Oliveira

Dal’Molin et al., 2010b), sugarcane (Saccharum officinarum; de Oliveira Dal’Molin et al.,

2010b), rapeseed (Brassica napus; Pilalis et al., 2011), and rice (Oryza sativa; Poolman et al., 2013). These models rely on annotation information to assemble comprehensive compilations of all reactions and metabolites known to occur within the organism.

Currently, whole-genome sequencing has been completed for approximately 40 vascular plants, including A. thaliana (Arabidopsis Genome Initiative, 2000), Arabidopsis lyrata

(Hu et al., 2011), soybean (Glycine max; Schmutz et al., 2010), rice (Goff et al., 2002; Yu et al., 2002), Populus trichocarpa (Tuskan et al., 2006), sorghum (Paterson et al., 2009),

Theobroma cacao (Tuskan et al., 2006), and maize (Schnable et al., 2009). Gene annotations of the whole-genome sequences have been used to determine the reactions within an organism and therefore build a GSM. FBA calculates all reaction fluxes in a metabolic network based on the optimization of an objective function (typically the maximization of the biomass yield). A quasi-steady state is assumed, and flux constraints are set based on the specific medium or the reversibility of reactions derived from thermodynamics.

Incorporation of omics data into GSMs is achieved through appropriate constraints on fluxes that restrict metabolic flows to only condition-relevant phenotypes.

During the last few years, multiple methods have been developed to integrate omics data into GSMs. Proteomic and transcriptomic data have been used to apply flux constraints on corresponding reactions determined by gene-protein-reaction (GPR) associations. The

GIMME (Becker and Palsson, 2008), iMAT (Shlomi et al., 2008), and MADE (Jensen and

Papin, 2011) algorithms use a switch approach to turn on/off reactions based on

146 expression levels. The GIMME algorithm turns off reactions based on a user-specified threshold of the expression level. The iMAT algorithm turns on a minimal set of reactions associated with low expression data in order to achieve a user-specified metabolic function. The MADE algorithm incorporates related experimental data sets into the model to activate or repress reactions based on the progression of the experimental conditions.

A different class of algorithms, known as the valve approach, was developed to incorporate proteomic and transcriptomic data by constraining the allowable flux ranges of reactions. The E-Flux method incorporates a user-specified function to convert gene expression data to flux constraints (Colijn et al., 2009). Finally, the PROM algorithm

(Chandrasekaran and Price, 2010) uses multiple data sets to constrain flux bounds (i.e. allowable flux ranges) based on the probabilities associated with gene activity among all data sets. Lee et al. (2012) integrated gene expression data by minimizing the difference between the predicted flux levels and gene expression data over all reactions with corresponding expression levels. Using the Yeast 5 model (Heavner et al., 2012) for

Saccharomyces cerevisiae, Lee et al. (2012) compared the predicted fluxes with experimentally determined exometabolome fluxes using the coefficient of determination r2. The authors achieved r2 values of 0.87 and 0.96 at 75% and 85% of the maximal biomass level, respectively. In comparison, the authors generated a best FBA solution, which maximizes r2 over all feasible solutions generated for FBA, and achieved r2 values of 0.2 and 0.58 at 75% and 85% of the maximal biomass level, respectively. These advancements pertaining to the integration of omics data with GSMs has enabled more accurate model predictions.

147

In this work, we describe the reconstruction of a second-generation maize leaf model and the incorporation of omics data into the model with the goal of improving the understanding of N metabolism. Both the primary and secondary metabolic pathways of maize are included, by combining information from MetaCrop (Schreiber et al., 2012),

MaizeCyc (Monaco et al., 2013), and the earlier iRS1563 (Saha et al., 2011) models. In comparison with the iRS1563 model, this second-generation model spans an additional

4,261 genes and 6,540 reactions. The increased number of genes and reactions enables the inclusion of additional pathways such as fructan biosynthesis, siroheme biosynthesis, and ubiquniol-9 biosynthesis. The model accounts for the two major cell types in the leaf (i.e. the bundle sheath and mesophyll cells). The bundle sheath cell contains seven compartments: the cytosol, mitochondrion, peroxisome, chloroplast stroma, plasma membrane, thylakoid membrane, and vacuole. The mesophyll cell contains six compartments: the cytosol, mitochondrion, chloroplast stroma, plasma membrane, thylakoid membrane, and vacuole. Compartmentalization is based on maize-specific experimental proteomic and transcriptomic measurements (Majeran et al., 2005; Friso et al., 2010; Li et al., 2010; Chang et al., 2012), as opposed to the A. thaliana-based compartmentalization adopted in the previous iRS1563 maize model (Saha et al., 2011).

Light reactions have been expanded from an aggregate reaction (as described in the iRS1563 model) to multiple reactions for each complex with the inclusion of a thylakoid membrane compartment. In contrast to the C4GEM maize model (de Oliveira Dal’Molin et al., 2010b), which focuses exclusively on primary metabolism in maize, the developed model also spans secondary metabolism by including all reactions known to occur within the maize leaf tissue. The model includes as many as 763 secondary metabolism reactions

148

(without including duplicate counting due to compartmentalization). Through the incorporation of omics data, regulatory restrictions are introduced in the model to switch- off/on reactions under the N+ WT and N− WT conditions and two GS knockout mutants

(gln1-3 and gln1-4) in the vegetative stage, during which the plant absorbs and assimilates

N for root and leaf biomass production (Amiour et al., 2012, 2014). Reactions linked to genes or proteins with significantly different expression levels between the N+ WT and

N− WT conditions, as well as the gln1-3 and gln1-4 mutants versus the N+ WT condition, are conditionally turned on or off accordingly. The metabolite pool is simulated by maximizing the total flux through a metabolite (i.e. flux sum) as a proxy for the metabolite turnover rate (Chung and Lee, 2009). The directional changes of flux-sum levels between the N− WT condition and the N+ WT condition, as well as the GS mutant conditions and the N+ WT condition, are qualitatively compared with the directional change in experimentally measured concentration levels. These analyses reveal similar trends to the recently developed flux imbalance analysis (Reznik et al., 2013), which makes use of dual variable values associated with metabolite balances to infer the effect of concentration changes on the objective function value.

149

Results and discussion

Effect of N Conditions on Biomass Components

Biomass components were measured in the N+ WT condition as well as for each N background (N− WT, gln1-3, and gln1-4). Table I and Figure 1 display the composition of the classes of biomass metabolites, and Supplemental Table S1 indicates the specific biomass measurements in all modeled conditions. As expected, in the majority of cases, the N− WT condition produced a smaller concentration of biomass components than the

N+ WT, gln1-3, and gln1-4 conditions. However, the concentration of amino acids produced was about 5 times higher in the gln1-4 mutant than the gln1-3 mutant, resulting in comparable amino acid concentrations between the gln1-4 mutant and N+ WT as well as between the gln1-3 mutant and N− WT. The similar amino acid concentrations between the gln1-4 mutant and the N+ WT condition in the vegetative stage help to confirm that the GS1-4 isozyme is essential in plant maturity and has a smaller effect compared with the GS1-3 isozyme at the vegetative stage. As expected, the concentration of starch was higher in the N− WT condition than in the N+ WT condition. Under the N− WT condition, the breakdown of starch is limited by the amount of N available (Tercé-Laforgue et al.,

2004; Amiour et al., 2012). Due to the limited N available, the starch is stored rather than broken down to produce other biomass components. The stained micrographs depicting the starch visible in the N+ WT, gln1-3 mutant, and gln1-4 mutant conditions are available in Supplemental Figure S1. The condition-specific biomass concentrations have been incorporated in the maize leaf model to more accurately represent metabolism under each condition.

150

Development of the Second-Generation Maize Leaf Model

The second-generation maize leaf model was developed using a combination of gene, protein, and reaction information from the previously developed maize model iRS1563

(Saha et al., 2011), biological databases such as the Kyoto Encyclopedia of Genes and

Genomes (Kanehisa et al., 2014), MaizeCyc (Monaco et al., 2013), and MetaCrop (Schreiber et al., 2012), as well as published literature sources. The model contains 5,824 genes and

8,525 reactions, a significant increase from the iRS1563 model, which contained 1,563 genes and 1,985 reactions. The second-generation maize model is split into two cell types

(i.e. the bundle sheath and mesophyll cells). The bundle sheath cell is further divided into seven compartments, while the mesophyll cell contains six compartments (Fig. 2). Of the

8,525 reactions in the model, 3,892 reactions are unique, as duplicated counts due to compartmentalization have been disregarded. Of these 3,892 unique reactions, 1,012 reactions were assigned localization information based on transcriptomic and proteomic data (Majeran et al., 2005; Friso et al., 2010; Li et al., 2010; Chang et al., 2012). Light reactions were adjusted to model the flow of protons across the thylakoid membrane to the chloroplast stroma, to represent the pH differential between compartments, and to describe the conversion of light to ATP (Nelson and Cox, 2009). The mitochondrial electron transport chain was similarly updated to include the proton exchange of ATP synthase between the intermembrane space and the mitochondrial matrix (Taiz, 2010).

Finally, 303 specific reactions were added to model glycerolipid synthesis, as shown in

Supplemental Figure S2 and Supplemental Table S2 (Moore, 1982; Murata, 1983; Murata and Tasaka, 1997; Mekhedov et al., 2000; Bachlava et al., 2009; Li-Beisson et al., 2010;

Rolland et al., 2012). To the best of our knowledge, this is the first plant model to include

151 detailed glycerolipid synthesis. Aggregate reactions were included to link specific two- tailed glycerolipids to the experimentally measured single lipids (Supplemental Table S2).

Compiling transcriptomic and proteomic compartmentalization data with literature- based pathways yielded a model of 4,103 reactions, leaving 2,880 unique reactions, still with their localizations unknown.

Once reactions were compartmentalized based on transcriptomic data, proteomic data, and published literature, the reactions were divided into two groups. The first group (core set) includes reactions with known localizations, while the second group (noncore set) spans reactions known to occur within the maize leaf but with no localization evidence.

Whenever possible, core reactions were unblocked by first adding reaction(s) from the noncore set to one or multiple compartment(s) and second appending intercellular or intracellular transporters (see “Materials and Methods”). By following this approach,

1,032 unique reactions with previously unknown localizations were assigned to compartments and 729 transporters were added. The remaining 1,848 unique reactions were assigned to compartments based on available pathway information or assigned to the cytosol of both the bundle sheath and mesophyll cells.

With all the reactions assigned to specific compartments, thermodynamically infeasible cycles that were generated due to the overly permissive inclusion of reactions in the model, as well as lack of reaction directionality information, were subsequently identified and eliminated. By first restricting the directionality of reactions and second removing reactions, it was possible to eliminate all thermodynamically infeasible cycles in the model. By this process, we restricted the directionality of 36 reactions and removed 2,055 reactions from the model (Table II). Upon the resolution of thermodynamically infeasible

152 cycles, attempts were made to unblock the remaining blocked core reactions and biomass formation by adding reactions from similar organisms (Krumholz et al., 2012) and model organisms (i.e. rice ssp. japonica, Brachypodium distachyon, sorghum, and A. thaliana).

By adding five unique reactions from similar organisms, the flux through three additional reactions known to be in maize was resolved. These reactions were all involved in the formation of Glu from His through urocanic acid. The model is provided in a Microsoft

Excel format in Supplemental Table S3 and in Systems Biology Markup Language format in Supplemental Table S4.

Incorporation of Transcriptomic and Proteomic Data in the Model

In order to more accurately model the N+ WT, N− WT, and GS mutant conditions in maize, GPR associations mapped the gene transcripts and proteins that were statistically expressed at a low level to reactions that were turned off in the model. However, no essential reactions to the model, which are required for biomass formation, were altered.

For example, the δ-aminolevulinic acid dehydratase reaction was experimentally determined to be higher in the N+ WT condition, suggesting that it should be restricted in the N− WT condition. However, when the flux through the δ-aminolevulinic acid dehydratase reaction is restricted to zero, biomass cannot be formed, as this reaction produces porphobilinogen, a precursor to chlorophyll (Gupta et al., 2013). Due to the incomplete information available in published literature or databases regarding possible alternative routes of the production and degradation of a specific metabolite, regulating reactions that are essential to the model will restrict biomass synthesis. Based on

153 experimental evidence, the fluxes through 83 reactions in the N+ WT condition, 20 in the

N− WT condition, 100 in the gln1-3 mutant, and nine in the gln1-4 mutant were restricted.

The reactions regulated in the N+ WT condition mainly correspond to reactions known to occur only under stress and are expressed at a low level in comparison with the N− WT and mutant conditions. Reactions that have been down-regulated based on omics data are indicated in the model file (Supplemental Table S3). N perturbations within the leaf tissue were modeled by combining the incorporation of transcriptomic and proteomic data with the unique biomass composition for each condition.

The minimal set of reactions, whose elimination causes a decrease in biomass yield, was determined for the N+ WT, N− WT, gln1-3 mutant, and gln1-4 mutant conditions. There are six reactions across the conditions that encompass the minimal set of reactions, as summarized in Table III. Of the 83 reactions with restricted flux in the N+ WT condition, only two reactions were identified to affect biomass yield. These two reactions are the conversion of ethanol to acetaldehyde through either ethanol oxidoreductase involving

NAD+ or a hydrogen peroxide-dependent oxidation of ethanol catalyzed by catalase

(Boamfa et al., 2005). These two reactions have a very slight effect on biomass formation, as biomass yield drops by less than 1%. As expected, we find that many of the reactions that correspond to genes that are significantly down-regulated in the N+ WT condition do not hinder biomass formation. In the N− WT condition, none of the reactions have an effect on the biomass yield, suggesting, as expected, that the decreased amount of N is the main limiting factor in biomass yield. In the gln1-3 mutant condition, three of the 100 reactions, which are switched off based on omics data, affect the biomass yield. These

154 three reactions are the glyceraldehyde-3-phosphate dehydrogenase, Fru-bisP aldolase, and Fru-bisphosphatase reactions. The capacity of glyceraldehyde-3-phosphate dehydrogenase to form a multienzyme complex in the chloroplasts for a range of plants is regulated by environmental conditions such as the light/dark transitions (Howard et al., 2011). Glyceraldehyde-3-phosphate is synthesized during carbon fixation in photosynthesis, and 1,3-bisphospho-d-gycerate (i.e. 3-phospho-d-glyceroyl phosphate) can be synthesized from 3-phospho-d-glycerate. ATP is required for the conversion of 3- phospho-d-glycerate to 1,3-bisphospho-d-glycerate catalyzed by 3-phospho-d-glycerate kinase in the bundle sheath chloroplast. This reaction is an important energy-requiring reaction in the Calvin-Benson cycle, as it is essential that the enzyme immediately metabolizes 3-phospho-d-glycerate, the product of the Rubisco reaction. This conclusion is also consistent with the findings that 3-phospho-d-glycerate 1-phosphotransferase is sensitive to changes in energy state (Nakamoto and Edwards, 1987). The Fru-bisP aldolase reaction, which is involved in the Calvin-Benson-Bassham cycle and the glycolysis pathway, can be bypassed using the sedoheptulose 1,7-bisphosphate/d-glyceraldehyde-

3-phosphate-lyase reaction, which catalyzes the synthesis of sedoheptulose 1,7- bisphosphate using dihydroxyacetone phosphate (i.e. glycerone phosphate) and d- erythrose 4-phosphate (Lakshmanan et al., 2013). The decreased expression of the cytosolic Fru-bisphosphatase reaction has been shown to decrease the ATP-ADP ratio, lead to the switch from Suc to starch synthesis, and inhibit photosynthesis at high CO2 levels in A. thaliana, resulting in the inhibition of plant growth (Strand et al., 2000).

Finally, the regulatory restrictions for the gln1-4 mutant involve only nine reactions, of which one affected the biomass drain (i.e. Rib-5-P isomerase reaction). The lack of the Rib-

155

5-P isomerase reaction has been experimentally shown to cause premature death and affect cellulose synthesis in A. thaliana (Howles et al., 2006; Xiong et al., 2009). A comparison of the number of reactions that affect the GS mutants suggests that at the vegetative stage, the impact of the gln1-4 mutation is less severe than that occurring in the gln1-3 mutant. Such a finding is not surprising, since it has been shown that the gene encoding the GS1-3 isozyme is constitutively expressed irrespective of the leaf development stage and that the expression of the gene encoding the GS1-4 isozyme is much lower and only enhanced at later stages of leaf development (Hirel et al., 2005).

Although only a subset of reactions affect the biomass production in the N+ WT, gln1-3 mutant, and gln1-4 mutant conditions, the additional regulation will have an effect on the flux predictions within the model.

Flux Range Variations among Conditions

The flux range of each reaction was determined in the N+ WT, N− WT, gln1-3 mutant, and gln1-4 mutant conditions under the assumption that biomass is maximized. The flux range of a reaction in the N− WT, gln1-3 mutant, and gln1-4 mutant conditions was compared with the flux range in the N+ WT reference condition to determine reactions with flux ranges that must deviate from the N+ WT flux range. This indicates that the flux through the reaction must change as a result of the limited N or mutation. Overall, the flux through 202 reactions in the N− WT condition is not contained within the flux range of the N+ WT condition, 765 reaction fluxes in the gln1-3 mutant diverge from the N+ WT flux range, and 678 reaction fluxes in the gln1-4 mutant must change from the N+ WT flux

156 range (Supplemental Table S5). In all three N backgrounds (i.e. the N− WT, gln1-3 mutant, and gln1-4 mutant conditions), the flux compared with the N+ WT reference condition decreases under maximum biomass through the chlorophyll cycle, chlorophyllide a biosynthesis, farnesyl diphosphate biosynthesis, methylerythritol phosphate pathway, and tetrapyrrole biosynthesis. Tetrapyrrole biosynthesis, chlorophyllide a biosynthesis, and the chlorophyll cycle link the production of chlorophyll from Glu (Tanaka and

Tanaka, 2007; Kim et al., 2013). The methylerythritol phosphate pathway and farnesyl diphosphate biosynthesis lead to a reactant required for the production of chlorophyll a from chlorophyllide a (Lange and Ghassemian, 2003). In both of the GS mutant conditions, the flux through chorismate biosynthesis (Tzin and Galili, 2010), Ser biosynthesis (Ho and

Saito, 2001), and the urea cycle (Mérigout et al., 2008) must decrease compared with the

N+ WT condition. Choline biosynthesis (McNeil et al., 2001) is decreased in the N− WT condition, increased in the gln1-3 mutant, and decreased in the gln1-4 mutant condition.

Flux through Ile and Leu biosynthesis (McCourt and Duggleby, 2006) is lower in the N−

WT condition, higher in the gln1-3 mutant condition, and lower in the gln1-4 mutant condition compared with the N+ WT condition, as expected by the proportion of these biomass components in the various conditions. The flux through the glyoxylate cycle

(Schnarrenberger and Martin, 2002), stearate biosynthesis (Li-Beisson et al., 2010), and urate degradation (Ramazzina et al., 2006) is higher in the gln1-3 mutant condition compared with the N+ WT condition. Val biosynthesis (McCourt and Duggleby, 2006) is lower in the gln1-3 mutant condition compared with the N+ WT condition. Flux through glutathione biosynthesis/degradation, Trp biosynthesis (Tzin and Galili, 2010), uracil degradation (Zrenner et al., 2006), and Xyl degradation (Penna et al., 2002) is higher in the

157 gln1-4 mutant compared with the N+ WT condition. Glu is converted to glutathione through two ATP-dependent steps requiring the addition of Cys and then Gly.

Glutathione is a vitally essential protectant against oxidative stress, heavy metals, and xenobiotics (Noctor et al., 2012; Rahantaniaina et al., 2013). Several routes of glutathione breakdown have been proposed, including the formation of Cys and Gly through cysteinyl-Gly. The Cys is then degraded to form pyruvate, helping to alleviate the gln1-4 mutation. The increased fluxes associated with Xyl (from 1,4-β-d-xylan) and uracil degradation generate a larger pool of xylulose-5-phosphate and β-Ala, respectively.

Finally, phenylpropanoid biosynthesis (Vogt, 2010) is lower in the gln1-4 mutant condition compared with the N+ WT condition. The majority of the changes in these pathways are directly related to differences in the proportion of the biomass components between the modeled conditions.

Comparison of Model Predictions with Metabolomic Data

The metabolomic data were compared with flux predictions within the model in each of the various N background conditions. The increasing or decreasing trend of the metabolite concentration, displayed in Figure 3, was qualitatively compared with the change in the flux-sum range determined by the model, as displayed in Figure 4. The flux sum is a measure of the amount of flow through the reactions associated with either the production or consumption of the metabolite. A variability analysis of the flux sum was performed, and flux-sum ranges, normalized by the biomass rate, that do not overlap

158 between the N background condition and the N+ WT condition were analyzed. An increase/decrease in the flux sum (i.e. used as a proxy for the metabolite pool) of a metabolite between the N− WT condition and the N+ WT condition and between the two

GS mutants and the N+ WT condition was compared with the metabolite concentration changes. Figure 4 demonstrates the importance of restricting fluxes based on transcriptomic and proteomic data. In the N− WT condition, the accuracy changes from

13% to 90% when the flux constraints based on omics data are incorporated. Without the incorporation of these constraints, all flux-sum ranges normalized by the biomass rate are predicted higher in the N− WT condition. The identified flux-sum levels are included in

Supplemental Table S6. The flux-sum variability approach is able to predict the change in metabolite pool sizes more accurately when the flux ranges are similar to the wild-type condition, as in the N− WT condition. Between the N− WT and N+ WT conditions, only approximately 7% of the reactions active in either condition have flux ranges at the maximum biomass that do not overlap. In the gln1-3 and gln1-4 mutant conditions, the fluxes are significantly perturbed, with 49% and 45% of the active reactions at maximum biomass resulting in nonoverlapping ranges compared with the N+ WT condition, respectively. The accuracy of flux sum in the gln1-3 mutant and gln1-4 mutant conditions with omics-based constraints incorporated reaches 53% and 25%, with eight of 15 metabolites predicted correctly and one of four metabolites predicted correctly in the gln1-

3 and gln1-4 mutant conditions, respectively. This level of prediction accuracy is far below what was seen for N− WT, suggesting a tenuous connection between concentration changes and gene expression levels when the genetic background changes.

159

We explored the efficacy of the flux-sum method under different genetic backgrounds for a much more well-studied and data-rich organism (i.e. Escherichia coli) to explore whether the dissonance between gene expression levels and concentrations was maize specific or applied broadly. We applied flux-sum variability to the Ishii et al. (2007) fluxomic and metabolomic data using the iAF1260 (Feist et al., 2007) E. coli model. Two single-gene knockout mutants (i.e. ppsA and glk) were compared with the wild-type condition, and predicting the directional change of the metabolite pool size was met with less than 50% accuracy in each condition. This implies that changes in the genetic background seem to cause concentration changes that are not predictable by gene expression changes alone. In contrast, changes in nutrient availability, as in the N− WT condition, can be captured with 90% accuracy.

We also decided to explore whether the dissonance between gene expression levels and concentration ranges was caused by a deficiency in the proposed flux-sum method. As an alternative, we used flux imbalance analysis (Reznik et al., 2013), which measures the effect of the deviation of a metabolite’s concentration from steady state on the maximum biomass by applying the concept of duality. Flux imbalance analysis examines how the model responds to a deviation from steady state by measuring the effect on biomass when a metabolite is allowed to accumulate or deplete. By determining the change in biomass formation due to the accumulation or depletion of the metabolite, a prediction can be made regarding the change in metabolite levels. Flux imbalance analysis was applied to the model, and the deviation in the maximum biomass was qualitatively compared with the experimental data for the metabolite in each compartment. Only nonoverlapping

160 ranges of the marginal value associated with each compartment-specific metabolite were analyzed. If all compartment-specific metabolites have marginal values that indicate the same trend compared with the N+ WT condition, a prediction was made for the tissue- specific metabolite. The flux imbalance analysis is 66%, 33%, and 78% accurate in the N−

WT, gln1-3 mutant, and gln1-4 mutant conditions, respectively, as compared with the N+

WT condition. While flux imbalance analysis makes a prediction for every metabolite in the model, the flux-sum analysis only predicts a direction of change for metabolites whose associated reactions can carry flux. Flux imbalance analysis allows for the prediction of compartment-specific metabolites whose associated reactions do not carry flux under maximum biomass formation. Comparable results between the flux imbalance analysis and flux-sum analysis in the N− WT condition provide independent backing regarding the validity of the flux-sum concept.

Conclusion

We have introduced a second-generation model that is specific for the leaf tissue of maize and differentiates between the bundle sheath and mesophyll cell types. By incorporating transcriptomic and proteomic data into the model, we were able to reproduce the metabolomic data with up to 90% accuracy when comparing the N− WT and N+ WT conditions. Ethanol oxidoreductase/catalase, glyceraldehyde-3-phosphate dehydrogenase, Fru-bisP aldolase, Fru-bisphosphatase, and Rib-5-P isomerase were shown to be important genes related to the decrease in biomass formation in the modeled

161 conditions. In order to study the impact of these genes on plant biomass production when optimal N is provided, their functional validation can be undertaken using transgenic technologies, mutagenesis, or association genetics, either at the single-gene or genome- wide level (Simons et al., 2014). The model also predicted a modification of the flux of metabolites formed during glutathione catabolism in the gln1-4 mutant condition compared with the N+ WT condition. This modification is predicted to compensate for the lack of GS1-4 by using the Glu and pyruvate derived from glutathione to produce Ala.

Thus, it will be interesting to determine whether the increase in Ala is related to the importance of the enzyme Ala aminotransferase in the improvement of plant productivity in general and NUE in particular (Good and Beatty, 2011; McAllister et al., 2013). In all N background conditions (i.e. N− WT, gln1-3 mutant, and gln1-4 mutant conditions), we find that the flux through chlorophyll biosynthesis, and those pathways directly related to chlorophyll biosynthesis, decrease, confirming the important link between N metabolism and chlorophyll synthesis through the use of its precursor Glu (Forde and

Lea, 2007). The leaf model, with the addition of other maize tissue-specific models, can be integrated into a whole-plant genome-scale model for maize. By determining a required metabolic function that is specific to each tissue, tissue-specific models can be created, ensuring that only relevant reactions are included in each tissue.

Future efforts will focus on tissue-specific models for the kernel, stalk, tassel, and root tissues. These tissue-specific models will follow community (Zomorrodi and Maranas,

2012; Zomorrodi et al., 2014) and multitissue human model (Duarte et al., 2007; Bordbar et al., 2011; Thiele et al., 2013) reconstruction principles. The tissues can be linked using

162 intertissue transport reactions, with the stalk tissue acting as the central transporter among the various tissues and particularly to the developing ear (Cañas et al., 2012). A whole-plant genome-scale model of maize will help to elucidate the flow of N from the root to the other tissues in the plant, from the shoot to the ear, and within the developing ear (Cañas et al., 2010). By modeling the entire plant, nonintuitive bottlenecks in N metabolism can be determined, which then can be used to suggest genetic interventions through mutagenesis, transgenic technology, or maker-assisted selection to increase the

NUE in maize. In addition, the flow of sugars to the kernel tissue can be analyzed to guide the increase of carbohydrate/sugar content of maize kernel by breaking the inverse relationship existing between carbohydrates and proteins (Feil et al., 1990). Apart from its crucial role as a food crop, maize is also used for cellulosic biofuels. To this end, the amount and composition of cell wall polymers is important in developing cellulosic maize. Lignin not only provides rigidity to the maize plant (Vanholme et al., 2008) but also makes the digestion of cellulosic and hemicellulosic sugars difficult during delignification (Li et al., 2008). Recent research endeavors have focused on altering lignin content, since plant viability and fitness are affected by lignin reductions (Li et al., 2008;

Bonawitz et al., 2014). Therefore, by utilizing the whole-plant genome-scale model, a system-wide implication of these genetic disruptions can be quantitatively assessed, thus facilitating new strategies for reducing lignin content without affecting the mechanical integrity of the maize plant.

163

Materials and methods

Plant Material

Maize (Zea mays; genotype B73) wild-type plants and gln1-3 and gln1-4 mutant seeds in the B73 background (for the production, selection, and characterization of the mutants, see Martin et al., 2006) were grown as described by Amiour et al. (2012) in a greenhouse at the Institut National de la Recherche Agronomique (Versailles, France) from May to

September 2004. Three individual plants of similar size and of similar developmental stage were selected, corresponding to the three replicates used for the omics experiments.

The three youngest fully expanded leaves at the 10- to 11-leaf stage without the midrib were harvested and pooled for the vegetative stage samples to obtain enough homogenous plant material representative of this plant development stage.

Plants were watered daily with a complete nutrient solution containing 10 mm KNO3 as the sole N source in the N+ WT, gln1-3 mutant, and gln1-4 mutant conditions (Coïc and

Lesaint, 1971). The N− WT condition was supplied 0.01 mm KNO3. The complete nutrient solution also contained 1.25 mm K+, 0.25 mm Ca2+, 0.25 mm Mg2+, 1.25 mm H2PO4−,

0.75 mm SO42−, 21.5 μm Fe2+ (Sequestrene; Ciba-Geigy), 23 μm B3+, 9 μm Mn2+, 0.3 μm

Mo2+, 0.95 μm Cu2+, and 3.5 μm Zn2+.

164

Yield Components Analysis

Kernel yield, its components, and the N content of different parts of the plant at stages of development from silking to maturity were determined according to the method described by Martin et al. (2005) and corresponded to the data described by Martin et al.

(2006) and Amiour et al. (2012).

RNA and DNA Preparation

Total RNA was extracted as described by Verwoerd et al. (1989) from leaves that had been stored at −80°C. Total RNAs (50 μg) for transcriptome and quantitative real-time (qRT)-

PCR studies were treated and prepared as described previously by Amiour et al. (2012).

Reverse transcription reactions and quantitative first strands were synthesized according to Amiour et al. (2012). Primers for qRT-PCR and reverse transcription-PCR cloning were designed from bacterial artificial chromosome sequences found in the public maize genome databases (Maizesequence.org, PlantGDB, and GenBank). The sequences of the primers used in reverse transcription-PCR and qRT-PCR are presented in Supplemental

Table S1.

165

Gene Expression Profiles Using Maize Complementary DNA Microarrays

Whole-genome leaf transcript profiling was performed using the maize 46K arrays obtained from the maize oligonucleotide array project

(http://www.maizecdna.org/outreach/resources.html) as described previously by

Amiour et al. (2012). The maize 46K spotted oligonucleotide array contains 46,000 unique probes from maize. Its detailed description, composition, and gene putative annotation can be found at the Gene Expression Omnibus;

(http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6438). Transcript abundance in each of the three replicates for vegetative leaves was determined using a mixture of all the samples (18 in total, each with the same mRNA concentration) as a reference. Statistical significance for differentially expressed genes was evaluated through statistical group comparisons performed using multiple testing procedures as described by Amiour et al. (2012). Transcriptomic data were validated by qRT-PCR analysis performed on a selected number of up- or down-regulated gene transcripts.

Statistical Analysis of Maize Complementary DNA Microarray Data

Statistical significance for differentially expressed genes was evaluated with statistical group comparisons using multiple testing procedures. The following two gene selection approaches were applied: the significance analysis of microarrays (Tusher et al., 2001) permutation algorithm and a P value ranking strategy using both z statistics in ArrayStat

1.0 software (Imaging Research) and moderated t statistics using a moderated Student’s t test available in MAnGO tools (http://bioinfome.cgm.cnrs-gif.fr) and BRBArrayTools version 3.2.3 (Korn et al., 2002). For multiple testing corrections, the false discovery rate

166 procedure was used (Benjamini and Hochberg, 1995). Statistical tests were computed and combined for each probe set using the log-transformed data. A significant probe set indicates that an adjusted P value was less than the effective α level (α = 0.05) in at least one of the two gene selection tests. A filtering procedure was used to exclude data points with low signal intensities (average log intensity mean [Amean] < 7.0) that are considered biologically unreliable.

Total Protein Extraction, Solubilization, and Quantification

A TCA/acetone protein precipitation was performed as described by Méchin et al. (2007) from the leaves of optimal N and limited N plants harvested at the vegetative stage of development. The frozen leaf powder was resuspended in acetone with 0.07% (v/v) 2- mercaptoethanol and 10% (w/v) TCA. Proteins were allowed to precipitate for 1 h at

−20°C. The pellet was then washed overnight with acetone containing 0.07% (v/v) 2- mercaptoethanol. The supernatant was discarded, and the pellet was dried under vacuum. Protein resolubilization was performed according to Méchin et al. (2007) using

60 μL mg−1 R2D2 buffer [5 m urea, 2 m thiourea, 2% (w/v) CHAPS, 2% (w/v) SB3-10 (i.e.

N-decyl-N,N-dimethyl-3-ammonio-1-propane-sulfonate), 20 mm dithiothreitol, 5 mm

Tris(2-carboxyethyl)phosphine hydrochloride, and 0.75% (v/v) carrier ampholytes]. After resolubilization, samples were centrifuged and the supernatant was transferred to an

Eppendorf tube prior to protein quantification. Total protein content of each sample was evaluated using the 2-D Quant kit (Amersham Biosciences).

167

Two-Dimensional Electrophoresis, Gel Staining, and Image Analysis

Total protein extraction, solubilization, and quantification were performed as described by Méchin et al. (2007). Solubilized proteins (300 µg) were subjected to two-dimensional gel electrophoresis and identified by liquid chromatography-mass spectrometry as described by Amiour et al. (2012).

Protein Identification by Liquid Chromatography-Tandem Mass Spectrometry

Spot digestion and liquid chromatography-tandem mass spectrometry were performed as described by Martin et al. (2006). In-gel digestion was performed with the Progest system (Genomic Solution). Gel pieces were washed twice by successive separate baths of

10% (v/v) acetic acid, 40% (v/v) ethanol, and acetonitrile (ACN). The pieces were then washed twice with successive baths of 25 mm NH4CO3 and ACN. Digestion was subsequently performed for 6 h at 37°C with 125 ng of modified trypsin (Promega) dissolved in 20% (v/v) methanol and 20 mm NH4CO3. The peptides were extracted successively with 2% (v/v) trifluoroacetic acid and 50% (v/v) ACN and then with ACN.

Peptide extracts were dried in a vacuum centrifuge and suspended in 20 mL of 0.05%

(v/v) trifluoroacetic acid, 0.05% (v/v) formic acid, and 2% (v/v) ACN. HPLC was performed on an Ultimate liquid chromatography system combined with a Famos autosampler and a Switchos II microcolumn switch system (Dionex). Trypsin digestion

168 was declared with one possible cleavage. Cys carboxyamidomethylation and Met oxidation were set to static and variable modifications, respectively. A multiple-threshold filter was applied at the peptide level: Cross correlation (i.e. Xcorr) magnitudes were up to 1.7, 2.2, 3.3, and 4.3 for peptides with one, two, three, and four isotopic charges, respectively; peptide probability was lower than 0.05, ΔCn > 0.1, with a minimum of two different peptides for an identified protein. Here, ΔCn is the change between the first and second cross correlation. A database search was performed with BioWorks 3.3.1 (Thermo

Electron). The Institute for Genomic Research maize gene index database version 16, with

72047*6 (for a total of 72047 EST reads in the six reading frames) EST sequences

(http://compbio.dfci.harvard.edu/tgi/), was used.

Metabolite Extraction and Analyses

Lyophilized leaf material was used for metabolite extraction. Approximately 20 mg of the powder was extracted in 1 mL of 80% (v/v) ethanol and 20% (v/v) distilled water for 1 h at 4°C. During extraction, the samples were continuously agitated and then centrifuged for 5 min at 15,000 rpm. The supernatant was removed, and the pellet was subjected to a further extraction in 60% (v/v) ethanol and finally in water at 4°C, as described above. All supernatants were combined to form the aqueous extract.

Nitrate was determined by the method of Cataldo et al. (1975). Total soluble amino acids were determined by the colorimetric method of Rosen (1957) with Leu as a standard.

169

Chlorophyll was estimated using 10 mg of fresh leaf material (Arnon, 1949). The total N content of 2 mg of lyophilized material was determined in an N elemental analyzer using the combustion method of Dumas (Flash 2000; Thermo Scientific). Starch content was determined as described by Ferrario-Mery et al. (1998).

Total lipids were extracted from frozen leaf material according to Miquel and Browse

(1992). Individual lipids were purified from the extracts by one-dimensional thin-layer chromatography on silica gel 60 plates (Lepage, 1967; Ohnishi and Yamada, 1980), which were obtained from Merck-Millipore. Lipids were located by spraying the plates with a solution of 0.001% (w/v) primuline (Sigma) in 80% (v/v) acetone, followed by visualization under UV light. To determine the fatty acid composition and relative amounts of individual lipids, the silica gel for each lipid was transferred to a screw- capped tube with 1 mL of 2.5% (v/v) H2SO4 in methanol and an appropriate amount of

C17:0 fatty acid (Sigma) as an internal standard. After heating for 90 min at 80°C, 1 mL of hexane and 1.5 mL of 0.9% (w/v) NaCl2 were added. Fatty acids were extracted in the upper organic phase by shaking and low-speed centrifugation. Samples (1 µL) of the organic phase were separated by gas chromatography on a 30-m × 0.53-mm EC-WAX column (Alltech Associates) and quantified using a flame ionization detector. The gas chromatograph was programmed for an initial temperature of 160°C for 1 min, followed by an increase of 20°C min−1 to 190°C and a ramp of 4°C min−1 to 230°C, with a 9-min hold of the final temperature.

170

The monosaccharide composition and linkage analysis of polysaccharides were determined as follows: 100 mg (fresh weight) of ground leaf was washed twice in 4 volumes of absolute ethanol for 15 min, then rinsed twice in 4 volumes of acetone at room temperature for 10 min, and left to dry under a fume hood overnight at room temperature.

The neutral monosaccharide composition was measured on 5 mg of dried alcohol- insoluble material after hydrolysis in 2.5 m trifluoroacetic acid for 1.5 h at 100°C as described by Harholt et al. (2006). To determine the cellulose content, the residual pellet obtained after the monosaccharide analysis was rinsed twice with 10 volumes of water and hydrolyzed with H2SO4 as described by Updegraff (1969). The released Glc was diluted 500 times and then quantified using high-performance anion-exchange chromatography-pulsed-amperometric detection as described by Harholt et al. (2006).

For lignin quantification, 100 mg (fresh weight) of ground leaf was washed twice in 4 volumes of absolute ethanol for 15 min and twice with 4 volumes of water at room temperature, then rinsed twice in 4 volumes of acetone at room temperature for 10 min, and left to dry under a fume hood overnight at room temperature. The following protocol is adapted from Fukushima and Hatfield (2001). Lignins from the prepared cell wall residue were solubilized in 1 mL of acetyl bromide solution (acetyl bromide:acetic acid,

1:3 [v/v]) in a glass vial at 55°C for 2.5 h under shaking. Samples were then allowed to cool to room temperature, and 1.2 mL of 2 m NaOH:acetic acid (9:50 v/v) was added in the vial. One hundred microliters of this sample was transferred in 300 µL of 0.5 m hydroxylamine chlorhydrate and mixed with 1.4 mL of acetic acid. The A280 of the samples was measured. The lignin content was calculated using the following formula:

171

Metabolome Analysis

All steps were adapted from the original protocol described by Fiehn (2006) following the procedure described by Amiour et al. (2012).

Model Development and Curation

Figure 5 outlines the work flow used for model development. Our previously developed maize model, iRS1563 (Saha et al., 2011), and biological databases such as MetaCrop

(downloaded in December 2012; Schreiber et al., 2012) and MaizeCyc (version 2.0.2;

Monaco et al., 2013) provided information pertaining to the genes, proteins, reactions, and metabolites used to reconstruct the second-generation maize leaf genome-scale model. In addition, available proteomic and transcriptomic data, maize-specific biological databases, namely MetaCrop and MaizeCyc, and published literature were used to assign cellular (i.e. bundle sheath or mesophyll) and intracellular organelle specificity to the curated reactions.

When the gene expression level was reported in reads per kilobase per million mapped reads (RPKM; Li et al., 2010; Chang et al., 2012), the cell specificity of any gene i can be calculated as:

172

Here, mi and bi are the RPKM abundance of gene i in the mesophyll and bundle sheath cells, respectively (Chang et al., 2012). A gene that is only expressed in one cell type will have an Ri of 1, while a gene that is equally expressed in both cell types will have an Ri of

0. As suggested by Chang et al. (2012), a threshold of 0.8 or a 5-fold abundance difference is adopted to assign gene cell type specificity. In the absence of RPKM information, an adjusted spectral count (adjSPC) along with the fold change difference between the mesophyll and bundle sheath cells was used to determine gene cell type specificity (Friso et al., 2010). The adjSPC is the number of mass spectra identified for a protein normalized by the number of unique spectral counts. Since low counts are not statistically informative, a cutoff of 10 was used for adjSPC (Zybailov et al., 2008; Kim et al., 2009). Similar to the threshold used for RPKM data, a 5-fold difference between the mesophyll and bundle sheath cell type normalized spectral abundance factor was used to determine the cellular specificity of any gene (Friso et al., 2010). The normalized spectral abundance factor is a weighted adjSPC based on the number of theoretical tryptic peptides with a relevant length (Ehleringer et al., 1997; Friso et al., 2010). Additional intracellular compartmentalization was carried out based on the MetaCrop database (Schreiber et al.,

2012), the MaizeCyc database (Monaco et al., 2013), and primary literature sources (Chang et al., 2012; Zhao et al., 2013).

The intracellular compartmentalization was determined based first on the

MetaCrop database (Schreiber et al., 2012), literature sources (Friso et al., 2010; Chang et al., 2012), compartmentalization information in the MaizeCyc database, and finally the

Plant Proteomics Database (Sun et al., 2009). An original set of intercellular and

173 intracellular transporters was determined based on literature evidence (Alberte and

Thornber, 1977; Leegood, 1985; Stitt and Heldt, 1985; Furbank et al., 1989; Weiner and

Heldt, 1992; Doulis et al., 1997; Burgener et al., 1998; Taniguchi et al., 2004; Sowiński et al.,

2008; Friso et al., 2010). In the subsequent standardization step, the MetRxn knowledgebase (Kumar et al., 2012) as well as manual curation were used to standardize the description of metabolites and reactions such as fixing stoichiometric errors (i.e. elemental or charge imbalances) and incomplete atomistic details (e.g. absence of stereospecificity and presence of unspecified side chains). Reactions and metabolites were given the Kyoto Encyclopedia of Genes and Genomes identifiers where available or were otherwise given new identifiers (in the form of MR or MC, respectively). Reaction directionality was adopted from the manually curated MetaCrop database, as available, and from the MaizeCyc database for the remaining reactions.

In the next step of model development, all reactions (including metabolic, intracellular, and extracellular transport reactions) were divided into two categories based on the evidence of their intercellular and intracellular compartmental specificity. The core set contains all metabolic reactions with experimental or literature-backed evidence of intracellular or intercellular compartmentalization as well as known intracellular and intercellular transporters. The noncore set contains reactions with partial or completely absent localization information. Barring any conflicting evidence, these reactions were provisionally placed in all compartments. An optimization formulation (as shown below) was developed by imposing flow though the maximal number of core reactions while including minimal intracellular and intercellular transporters and minimal participation of noncore reactions in various compartments. A parsimony criterion was used to

174 apportion noncore functions so that core functions could be restored. Furthermore, in order to restore a core function, the resolution strategy was prioritized in the following order: (1) apportion noncore reaction(s) in one/multiple compartment(s); (2) add intracellular transporter(s); and (3) add intercellular transporter(s). To this end, an objective function was formulated by taking the weighted sum of the number of noncore reactions and intracellular and intercellular transporters by providing weights of 1, 104, and 106, respectively, for these three groups of reactions. However, it is important to ensure that any resolution strategy does not cause thermodynamically infeasible cycles.

Therefore, each of these solutions was further checked, and those reactions that result in the formation of a cycle were rejected. For each core reaction, multiple solutions were determined, and the solution that fixes the largest number of core reactions was accepted.

When required, manual curation was used to delineate between multiple solutions. This approach is analogous to the one proposed by Mintz-Oron et al. (2012) but does not rely on a complicated scoring system. It is also computationally less taxing, as it activates one core reaction at a time. Furthermore, in contrast to the approach of Mintz-Oron et al.

(2012), the method proposed here allows for the minimal number of transporters added, rather than potentially minimizing the flux through many transporters. The process of minimally adding the number of reactions and transporters to the model is similar to that used by the model SEED (Henry et al., 2010). In order to allow flux through all reactions in the core set C = {1,…,c}, we minimized the addition of reactions from the noncore set

NC = {c + 1,…,g}, intracellular transporter set T = {g + 1,…,t}, and intercellular transporter set IC = {t + 1,…,m}. This encompasses an overall set of reactions M = {1,…,m} and a set of metabolites N = {1,…,n}. In addition, binary variable yj is defined as:

175

The task of identifying the minimal set of additional reactions that enable flux through a core reaction j* is posed as the following mixed-integer linear programming problem:

(3) subject to:

(4)

(5)

(6)

(7)

Here, Sij is the stoichiometric coefficient of metabolite i in reaction j and vj is the flux value of reaction j. Parameters vj,min and vj,max denote the minimum and maximum allowable fluxes for reaction j, respectively. vj* represents the core reaction flux that is currently being unblocked, and ε is a small value to ensure a threshold amount of flux

through each core reaction. c1, c2, and c3 represent weights associated with each set of

176 reactions (i.e. noncore set, intracellular transporter set, and intercellular transporter set, respectively). In this formulation, the objective function 3 above minimizes the number of added reactions (from three reaction sets as mentioned earlier) so as to restore flux through reaction j*. We chose values of 1, 104, and 106 for c1, c2, and c3, respectively, so that metabolic reactions without experimental or literature evidence for compartmental specificity are added to specific compartment(s) before including additional transport reactions with no literature evidence. Constraint set 4 above represents the pseudo- steady-state assumption, while constraint 5 determines the threshold amount of flux necessary through j*. Bounds on core reaction fluxes are imposed by constraint set 6, while constraint set 7 ensures that only reactions from those three sets having nonzero flow are added to the model. This algorithm is repeated for each core reaction j* to ensure flux and, hence, provides compartmentalization assignments for 431 metabolic reactions by assigning them to at least one compartment, adding 1,032 total metabolic reactions to the model, as shown in Table II.

The reactions identified by the above-mentioned algorithm plus the reactions from the core set constituted two new sets, a set of reactions with resolved compartmental information and a set whose location still needs resolution, as shown in Figure 5.

Reactions from the latter set that are known to occur within the maize leaf tissue but were not in the initial model were added to intracellular/intercellular compartments manually based on pathway localization or simply added to the cytosol of bundle sheath and/or mesophyll cells. Thermodynamically infeasible cycles were resolved by changing the minimum number of reaction directionalities possible and eliminating the smallest number of reactions from the model (Schellenberger et al., 2011) while conserving biomass

177 formation. An optimization procedure was iteratively run for each reaction in a thermodynamically infeasible cycle to determine the minimum number of directionality changes or reaction removals required to fix the cycle. These results were then compared for each reaction to determine the changes that resolve the largest number of reactions participating in thermodynamically infeasible cycles. The solutions found were manually inspected before the changes were applied to the model. The application of this optimization procedure led to restricting the directionality for 507 reactions that prevented 889 reactions from carrying unbounded fluxes, thus eliminating the corresponding thermodynamically infeasible cycles.

In the final step, as shown in Figure 5, the GapFind/GapFill (Kumar et al., 2007) procedure was applied to identify blocked/dead-end metabolites and subsequently restore their connectivity. A gap-filling database of reactions was created by combining reactions from phylogenetically close/model plant species (i.e. rice [Oryza sativa ssp. japonica], Brachypodium distachyon, sorghum [Sorghum bicolor], and Arabidopsis thaliana), noncore reactions without compartmental specificity (not identified by our aforementioned algorithm), and all possible intracellular/intercellular transporters. The gap-filling procedure was modified by prioritizing the addition of reactions from closely related/model plant species or noncore reactions over transporters to unblock the flow-through metabolites while ensuring that no new thermodynamically infeasible cycles are created. After completing this step, we added five reactions from closely related/model plant species, changed the directionality of 14 reactions, and added eight intracellular transporters.

178

Incorporation of Transcriptomic, Proteomic, and Metabolomic Data

Significantly different gene transcripts and proteins were incorporated into the model by switching off corresponding reactions under the N+ WT, N− WT (Amiour et al.,

2012), gln1-3 mutant, and gln1-4 mutant (Martin et al., 2006) conditions. The number of proteins, gene transcripts, and metabolites with abundances that are statistically differentially expressed in the various conditions are listed in Table IV. Reactions with GPRs associated with significantly lowered transcriptomic and proteomic expression are switched off under the corresponding conditions. Metabolite turnover rates were determined based on the flux-sum analysis method (Chung and Lee, 2009) and compared with the metabolomic data. The range of the flux sum or the flow through of each metabolite with experimental measurements was maximized/minimized as follows:

Here, set E represents the set of metabolites with experimental measurements and set LE represents reactions with statistically lower expression of gene transcripts and/or proteins. The formulation was run in an iterative manner for each metabolite with experimental measurements. The formulation was also repeated for each individual

179 condition, ensuring that the proper nutrients and simulated knockouts were considered.

By linearizing the objective function, the resulting formulation is a mixed-integer linear programming problem similar to the description by Chung and Lee (2009). Therefore, the basic idea is to determine the range of the flux sum of a metabolite (for which metabolomic data are available) under a given condition by switching off reaction fluxes corresponding to gene transcripts and/or proteins with lower expression levels (i.e. constraint 9). The flux-sum ranges were determined at the maximum biomass for the condition as displayed in constraint 10. Predictions were made only when the flux-sum ranges did not overlap between the N background condition and the N+ WT condition and when the direction of change in all compartments was consistent. In this way, the compartment-specific predictions of the flux-sum ranges were compared with tissue-specific experimental measurements. The flux-sum levels in the N− WT, gln1-3 mutant, and gln1-4 mutant conditions were compared with the reference N+ WT condition to find the qualitative trend in the change of metabolite pool size between the conditions.

Number of gene transcripts, proteins, and metabolites that vary significantly

Flux variability analysis was used to determine the flux range of each reaction under maximum biomass by subsequently maximizing and minimizing the flux through each reaction. The flux range of each reaction for the N− WT, gln1-3 mutant, and gln1-4 mutant conditions was compared with the reference N+ WT condition. Flux ranges that did not overlap between one of the N background conditions and the reference condition were further analyzed. These are reactions that must change in response to the limited amount

180 of N or the mutant conditions. Finally, we determined for each condition the minimum number of reactions that, when not regulated, will restore the biomass to the yield obtained when no omics-based regulation is applied. This was done by identifying the minimal set of reactions, included in the omics-based regulation, that when active would allow for a biomass yield equivalent to the yield under no omics-based regulation. This set of reactions represent the reactions whose restriction affects the biomass yield. The

CPLEX solver (version 12.3 IBM ILOG) was used in the GAMS environment (version

23.3.3; GAMS Development) to solve the optimization problems. The Python programming language was also used during model development (mainly for scripting and data analysis). All computations were carried out on Intel Xeon X5675 Six-Core 3.06

GHz processors constituting the Lion-XF cluster, which was built and operated by the

Research Computing and Cyberinfrastructure Group of Pennsylvania State University.

Acknowledgments

We thank Isabelle Quilleré, François Gosse, and Michel Lebrusq for technical help.

181

Figures and tables Table 4-1 Experimental content of classes of metabolites in different conditions

The biomass components were determined experimentally for each of the conditions (N+ WT, N− WT, gln1-3 mutant, and gln1- 4 mutant). Values are means of three replicates unless indicated by the asterisk, indicating that two replicate measurements were taken. Biomass measurements for the specific metabolites within each class are displayed in Supplemental Table S1.

Figure 4-1 Weight percentage of biomass components.

The weight percentage for each class of metabolites experimentally measured contributingto biomass synthesis is displayed. The composition is displayed for the N+ WT (A), N2 WT (B), gln1-3 mutant (C), and gln1-4 mutant (D) conditions. The measurements for specific components within each class of metabolites are shown in Supplemental Table S1. [See online article for color version of this figure.]

182

Figure 4-2 Number of metabolic and transport reactions distributed between compartments in the bundle sheath and mesophyll cell types.

The numbers of metabolic and transport reactions are shown for each compartment. Integral membrane proteins are counted for the compartment in which the main biotransformation occurs. For example, the ATP synthase associated with the mitochondrial electron transport chain is counted as a metabolic reaction in the mitochondrion, not the inner mitochondrial membrane (IMM). [See online article for color version of this figure.]

Table 4-2 Number of reactions after each model creation and curation step

The original two data sets are the core set and the noncore set, which combine to form the final model statistics. The total number of metabolic, transport, exchange, and biomass reactions are displayed after each process during model curation. Metabolic reaction totals include duplication from compartmentalization.

183

Table 4-3 Summary of reactions that affect biomass synthesis

The minimum set of reactions that are down-regulated as a result of the inclusion of proteomic and transcriptomic data and affect biomass synthesis is displayed. The corresponding condition is displayed for each reaction as well as the role of the reaction.

184

Figure 4-3 Number of metabolites in each condition that statistically varied from the N+ WT condition at the vegetative stage.

The numbers of metabolites that experimentally significantly increased (up arrows) or decreased (down arrows) in comparison with the N+ WT condition are displayed for each of the N conditions tested (i.e. N2 WT, gln1-3 mutant, and gln1-4 mutant conditions). The metabolites are shaded based on whether they are involved in carbon (C), N, or other metabolism. [See online article for color version of this figure.]

185

Figure 4-4 Effect of omics-based regulation on the flux-sum prediction compared with the experimental trend in metabolite concentration.

The accuracy in predicting the increasing (up arrows) or decreasing (down arrows) trend in metabolite change between the N background condition and the N+ WT condition is displayed. By restricting the reaction flux based on the transcriptomic and proteomic data, the accuracy of the qualitative trend in metabolite pool size between the N2 WTand N+ WT conditions increases. Before adding omics-based constraints, the model was able to correctly predict the direction of change in 13% of the metabolites measured in the N2 WT condition compared with the N+ WT condition. The accuracy increases to 90% when omics-based constraints are included. The flux-sum method is not able to accurately represent the gln1-3 and gln1-4 mutant conditions, suggesting that the genetic background affects the ability of the flux-sum method to predict metabolite changes. [See online article for color version of this figure.]

186

Figure 4-5 Model development and curation schematic.

The work flow for the second-generation genome-scale metabolic model of the maize leaf is displayed. The data sources give three types of retrieved data (i.e. raw reaction data, reaction directionality, and compartmentalization) that are then manipulated as shown to create the final model. [See online article for color version of this figure.]

Table 4-4 Number of gene transcripts, proteins, and metabolites that vary significantly

The wild-type condition for each study was combined to create one uniform N1 WT condition. The numbers of gene transcripts, proteins, and metabolites that statistically vary are displayed.

187

Bibliography

1. Kumar A, Suthers PF, Maranas CD: MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases. BMC bioinformatics 2012, 13:6. 2. Saha R, Suthers PF, Maranas CD: Zea mays iRS1563: a comprehensive genome-scale metabolic reconstruction of maize metabolism. PloS one 2011, 6(7):e21784. 3. Saha R, Verseput AT, Berla BM, Mueller TJ, Pakrasi HB, Maranas CD: Reconstruction and comparison of the metabolic potential of cyanobacteria Cyanothece sp. ATCC 51142 and Synechocystis sp. PCC 6803. PloS one 2012, 7(10):e48285. 4. Kim TY, Sohn SB, Kim YB, Kim WJ, Lee SY: Recent advances in reconstruction and applications of genome-scale metabolic models. Current opinion in biotechnology 2012, 23(4):617-623. 5. Pitkanen E, Rousu J, Ukkonen E: Computational methods for metabolic reconstruction. Current opinion in biotechnology 2010, 21(1):70-77. 6. Esvelt KM, Wang HH: Genome-scale engineering for systems and synthetic biology. Mol Syst Biol 2013, 9:641. 7. Ranganathan S, Suthers PF, Maranas CD: OptForce: an optimization procedure for identifying all genetic manipulations leading to targeted overproductions. PLoS Comput Biol 2010, 6(4):e1000744. 8. Thiele I, Palsson BO: A protocol for generating a high-quality genome-scale metabolic reconstruction. Nature protocols 2010, 5(1):93-121. 9. Blazier AS, Papin JA: Integration of expression data in genome-scale metabolic network reconstructions. Frontiers in physiology 2012, 3:299. 10. Reed JL: Shrinking the Metabolic Solution Space Using Experimental Datasets. PLoS computational biology 2012, 8(8). 11. Kim HU, Kim SY, Jeong H, Kim TY, Kim JJ, Choy HE, Yi KY, Rhee JH, Lee SY: Integrative genome-scale metabolic analysis of Vibrio vulnificus for drug targeting and discovery. Mol Syst Biol 2011, 7:460. 12. Lerman JA, Hyduke DR, Latif H, Portnoy VA, Lewis NE, Orth JD, Schrimpe- Rutledge AC, Smith RD, Adkins JN, Zengler K et al: In silico method for modelling metabolism and gene product expression at genome scale. Nat Commun 2012, 3. 13. Karr JR, Sanghvi JC, Macklin DN, Gutschow MV, Jacobs JM, Bolival B, Assad- Garcia N, Glass JI, Covert MW: A Whole-Cell Computational Model Predicts Phenotype from Genotype. Cell 2012, 150(2):389-401.

188 14. Jerby L, Shlomi T, Ruppin E: Computational reconstruction of tissue-specific metabolic models: application to human liver metabolism. Molecular Systems Biology 2010, 6. 15. Thiele I, Swainston N, Fleming RM, Hoppe A, Sahoo S, Aurich MK, Haraldsdottir H, Mo ML, Rolfsson O, Stobbe MD et al: A community-driven global reconstruction of human metabolism. Nature biotechnology 2013, 31(5):419-425. 16. Grafahrend-Belau E, Junker A, Eschenroder A, Muller J, Schreiber F, Junker BH: Multiscale metabolic modeling: dynamic flux balance analysis on a whole- plant scale. Plant physiology 2013, 163(2):637-647. 17. Pagani I, Liolios K, Jansson J, Chen IMA, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC: The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic acids research 2012, 40(D1):D571-D579. 18. Zhou T: Computational reconstruction of metabolic networks from KEGG. Methods Mol Biol 2013, 930:235-249. 19. Chen N, del Val IJ, Kyriakopoulos S, Polizzi KM, Kontoravdi C: Metabolic network reconstruction: advances in in silico interpretation of analytical information. Current opinion in biotechnology 2012, 23(1):77-82. 20. Zomorrodi AR, Suthers PF, Ranganathan S, Maranas CD: Mathematical optimization applications in metabolic networks. Metabolic engineering 2012, 14(6):672-686. 21. Hieno A, Naznin HA, Hyakumachi M, Sakurai T, Tokizawa M, Koyama H, Sato N, Nishiyama T, Hasebe M, Zimmer AD et al: ppdb: plant promoter database version 3.0. Nucleic Acids Res 2013. 22. Tanz SK, Castleden I, Hooper CM, Vacher M, Small I, Millar HA: SUBA3: a database for integrating experimentation and prediction to define the SUBcellular location of proteins in Arabidopsis. Nucleic Acids Res 2013, 41(Database issue):D1185-1191. 23. Mintz-Oron S, Aharoni A, Ruppin E, Shlomi T: Network-based prediction of metabolic enzymes' subcellular localization. Bioinformatics 2009, 25(12):i247- 252. 24. Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muniz-Rascado L, Garcia-Sotelo JS, Weiss V, Solano-Lira H, Martinez-Flores I, Medina-Rivera A et al: RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res 2013, 41(Database issue):D203-213. 25. Yilmaz A, Nishiyama MY, Jr., Fuentes BG, Souza GM, Janies D, Gray J, Grotewold E: GRASSIUS: a platform for comparative regulatory genomics across the grasses. Plant physiology 2009, 149(1):171-180. 26. Wittig U, Kania R, Golebiewski M, Rey M, Shi L, Jong L, Algaa E, Weidemann A, Sauer-Danzwith H, Mir S et al: SABIO-RK--database for biochemical reaction kinetics. Nucleic Acids Res 2012, 40(Database issue):D790-796. 27. Keseler IM, Mackie A, Peralta-Gil M, Santos-Zavaleta A, Gama-Castro S, Bonavides-Martinez C, Fulcher C, Huerta AM, Kothari A, Krummenacker M et

189 al: EcoCyc: fusing model organism databases with systems biology. Nucleic Acids Res 2013, 41(Database issue):D605-612. 28. Schomburg I, Chang A, Placzek S, Sohngen C, Rother M, Lang M, Munaretto C, Ulas S, Stelzer M, Grote A et al: BRENDA in 2013: integrated reactions, kinetic data, enzyme function data, improved disease classification: new options and contents in BRENDA. Nucleic acids research 2013, 41(D1):D764- D772. 29. Devoid S, Overbeek R, DeJongh M, Vonstein V, Best AA, Henry C: Automated genome annotation and metabolic model reconstruction in the SEED and Model SEED. Methods Mol Biol 2013, 985:17-45. 30. Avila-Campillo I, Drew K, Lin J, Reiss DJ, Bonneau R: BioNetBuilder: automatic integration of biological networks. Bioinformatics 2007, 23(3):392- 393. 31. Pitkanen E, Akerlund A, Rantanen A, Jouhten P, Ukkonen E: ReMatch: a web- based tool to construct, store and share stoichiometric metabolic models with carbon maps for metabolic flux analysis. Journal of integrative bioinformatics 2008, 5(2). 32. Dale JM, Popescu L, Karp PD: Machine learning methods for metabolic pathway prediction. BMC bioinformatics 2010, 11. 33. Reyes R, Gamermann D, Montagud A, Fuente D, Triana J, Urchueguia JF, de Cordoba PF: Automation on the generation of genome-scale metabolic models. Journal of computational biology : a journal of computational molecular cell biology 2012, 19(12):1295-1306. 34. Agren R, Liu LM, Shoaie S, Vongsangnak W, Nookaew I, Nielsen J: The RAVEN Toolbox and Its Use for Generating a Genome-scale Metabolic Model for Penicillium chrysogenum. PLoS computational biology 2013, 9(3). 35. Feng X, Xu Y, Chen Y, Tang YJ: MicrobesFlux: a web platform for drafting metabolic models from the KEGG database. BMC Syst Biol 2012, 6:94. 36. Suthers PF, Dasika MS, Kumar VS, Denisov G, Glass JI, Maranas CD: A genome-scale metabolic reconstruction of Mycoplasma genitalium, iPS189. PLoS Comput Biol 2009, 5(2):e1000285. 37. Mueller TJ, Berla BM, Pakrasi HB, Maranas CD: Rapid construction of metabolic models for a family of Cyanobacteria using a multiple source annotation workflow. BMC Syst Biol 2013, 7:142. 38. Schellenberger J, Lewis NE, Palsson BO: Elimination of thermodynamically infeasible loops in steady-state metabolic models. Biophys J 2011, 100(3):544- 553. 39. Satish Kumar V, Dasika MS, Maranas CD: Optimization based automated curation of metabolic reconstructions. BMC Bioinformatics 2007, 8:212. 40. Zomorrodi AR, Maranas CD: Improving the iMM904 S. cerevisiae metabolic model using essentiality and synthetic lethality data. BMC Systems Biology 2010, 4. 41. Kumar VS, Maranas CD: GrowMatch: an automated method for reconciling in silico/in vivo growth predictions. PLoS Comput Biol 2009, 5(3):e1000308.

190 42. Soh KC, Miskovic L, Hatzimanikatis V: From network models to network responses: integration of thermodynamic and kinetic properties of yeast genome-scale metabolic networks. FEMS yeast research 2012, 12:129-143. 43. Soh KC, Hatzimanikatis V: Network thermodynamics in the post-genomic era. Current opinion in microbiology 2010, 13:350-357. 44. Jankowski M, Henry C: Group Contribution Method for Thermodynamic Analysis of Complex Metabolic Networks. Biophysical journal 2008. 45. Noor E, Bar-Even A, Flamholz A, Lubling Y, Davidi D, Milo R: An integrated open framework for thermodynamics of reactions that combines accuracy and coverage. Bioinformatics (Oxford, England) 2012, 28:2037-2044. 46. Hamilton JJ, Dwivedi V, Reed JL: Quantitative assessment of thermodynamic constraints on the solution space of genome-scale metabolic models. Biophysical journal 2013, 105:512-522. 47. Feist AM, Henry CS, Reed JL, Krummenacker M, Joyce AR, Karp PD, Broadbelt LJ, Hatzimanikatis V, Palsson BO: A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Molecular Systems Biology 2007, 3. 48. Wiechert W: 13C metabolic flux analysis. Metabolic engineering 2001, 3(3):195-206. 49. Antoniewicz MR, Kelleher JK, Stephanopoulos G: Elementary metabolite units (EMU): a novel framework for modeling isotopic distributions. Metabolic engineering 2007, 9(1):68-86. 50. Weitzel M, Noh K, Dalman T, Niedenfuhr S, Stute B, Wiechert W: 13CFLUX2-- high-performance software suite for (13)C-metabolic flux analysis. Bioinformatics 2013, 29(1):143-145. 51. Nargund S, Sriram G: Mathematical modeling of isotope labeling experiments for metabolic flux analysis. Methods Mol Biol 2014, 1083:109-131. 52. Blum T, Kohlbacher O: MetaRoute: fast search for relevant metabolic routes for interactive network navigation and visualization. Bioinformatics 2008, 24(18):2108-2109. 53. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 2006, 34(Database issue):D354-357. 54. Ravikirthi P, Suthers PF, Maranas CD: Construction of an E. Coli genome-scale atom mapping model for MFA calculations. Biotechnol Bioeng 2011, 108(6):1372-1382. 55. Latendresse M, Malerich JP, Travers M, Karp PD: Accurate atom-mapping computation for biochemical reactions. Journal of chemical information and modeling 2012, 52(11):2970-2982. 56. Antoniewicz MR: (13)C metabolic flux analysis: optimal design of isotopic labeling experiments. Current opinion in biotechnology 2013, 24(6):1116-1121. 57. Leighty RW, Antoniewicz MR: COMPLETE-MFA: Complementary parallel labeling experiments technique for metabolic flux analysis. Metabolic engineering 2013, 20:49-55.

191 58. Crown SB, Antoniewicz MR: Publishing (13)C metabolic flux analysis studies: A review and future perspectives. Metabolic engineering 2013, 20:42-48. 59. Leighty RW, Antoniewicz MR: Parallel labeling experiments with [U- 13C]glucose validate E. coli metabolic network model for 13C metabolic flux analysis. Metabolic engineering 2012, 14(5):533-541. 60. Suthers PF, Chang YJ, Maranas CD: Improved computational performance of MFA using elementary metabolite units and flux coupling. Metabolic engineering 2010, 12(2):123-128. 61. Pey J, Rubio A, Theodoropoulos C, Cascante M, Planes FJ: Integrating tracer- based metabolomics data and metabolic fluxes in a linear fashion via Elementary Carbon Modes. Metabolic engineering 2012, 14(4):344-353. 62. Hyduke DR, Lewis NE, Palsson BO: Analysis of omics data with genome-scale models of metabolism. Molecular bioSystems 2013, 9(2):167-174. 63. Schmidt BJ, Ebrahim A, Metz TO, Adkins JN, Palsson BO, Hyduke DR: GIM3E: condition-specific models of cellular metabolism developed from metabolomics and expression data. Bioinformatics 2013, 29(22):2900-2908. 64. Lee D, Smallbone K, Dunn WB, Murabito E, Winder CL, Kell DB, Mendes P, Swainston N: Improving metabolic flux predictions using absolute gene expression data. BMC Syst Biol 2012, 6:73. 65. Hoppe A: What mRNA Abundances Can Tell us about Metabolism. Metabolites 2012, 2:614-631. 66. Covert MW, Xiao N, Chen TJ, Karr JR: Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli. Bioinformatics 2008, 24(18):2044-2050. 67. Berestovsky N, Zhou W, Nagrath D, Nakhleh L: Modeling integrated cellular machinery using hybrid Petri-Boolean networks. PLoS computational biology 2013, 9(11):e1003306. 68. Wang YC, Chen BS: Integrated cellular network of transcription regulations and protein-protein interactions. BMC Syst Biol 2010, 4:20. 69. Fisher CP, Plant NJ, Moore JB, Kierzek AM: QSSPN: dynamic simulation of molecular interaction networks describing gene regulation, signalling and whole-cell metabolism in human cells. Bioinformatics 2013, 29(24):3181-3190. 70. O'Brien EJ, Lerman JA, Chang RL, Hyduke DR, Palsson BO: Genome-scale models of metabolism and gene expression extend and refine growth phenotype prediction. Mol Syst Biol 2013, 9:693. 71. Cotten C, Reed JL: Mechanistic analysis of multi-omics datasets to generate kinetic parameters for constraint-based metabolic models. BMC bioinformatics 2013, 14:32. 72. Ishii N, Nakahigashi K, Baba T, Robert M, Soga T, Kanai A, Hirasawa T, Naba M, Hirai K, Hoque A et al: Multiple high-throughput analyses monitor the response of E. coli to perturbations. Science 2007, 316(5824):593-597. 73. Chassagnole C, Noisommit-Rizzi N, Schmid JW, Mauch K, Reuss M: Dynamic modeling of the central carbon metabolism of Escherichia coli. Biotechnol Bioeng 2002, 79(1):53-73.

192 74. Vital-Lopez FG, Wallqvist A, Reifman J: Bridging the gap between gene expression and metabolic phenotype via kinetic models. BMC Syst Biol 2013, 7:63. 75. Zomorrodi AR, Lafontaine Rivera JG, Liao JC, Maranas CD: Optimization- driven identification of genetic perturbations accelerates the convergence of model parameters in ensemble modeling of metabolic networks. Biotechnol J 2013, 8(9):1090-1104. 76. Jamshidi N, Palsson BØ: Mass action stoichiometric simulation models: incorporating kinetics and regulation into stoichiometric models. Biophysical journal 2010, 98:175-185. 77. Smallbone K, Mendes P: Large-scale metabolic models: from reconstruction to differential equations. Industrial Biotechnology 2013, 9(4):179-184. 78. Tamagnini P, Axelsson R, Lindberg P, Oxelfelt F, Wunschiers R, Lindblad P: Hydrogenases and hydrogen metabolism of cyanobacteria. Microbiol Mol Biol Rev 2002, 66(1):1-20, table of contents. 79. Schopf J: The Fossil Record: Tracing the Roots of the Cyanobacterial Lineage. In: The ecology of cyanobacteria. Edited by B. W, Dordrecht PM: Kluwer Academic Publishers; 2000: 13-35. 80. Moisander PH, Beinart RA, Hewson I, White AE, Johnson KS, Carlson CA, Montoya JP, Zehr JP: Unicellular cyanobacterial distributions broaden the oceanic N2 fixation domain. Science, 327(5972):1512-1514. 81. Bryant DA, Frigaard NU: Prokaryotic photosynthesis and phototrophy illuminated. Trends Microbiol 2006, 14(11):488-496. 82. Popa R, Weber PK, Pett-Ridge J, Finzi JA, Fallon SJ, Hutcheon ID, Nealson KH, Capone DG: Carbon and nitrogen fixation and metabolite exchange in and between individual cells of Anabaena oscillarioides. Isme Journal 2007, 1(4):354-360. 83. Ducat DC, Way JC, Silver PA: Engineering cyanobacteria to generate high- value products. Trends Biotechnol, 29(2):95-103. 84. Savage DF, Way J, Silver PA: Defossiling fuel: How synthetic biology can transform biofuel production. Acs Chemical Biology 2008, 3(1):13-16. 85. Dismukes GC, Carrieri D, Bennette N, Ananyev GM, Posewitz MC: Aquatic phototrophs: efficient alternatives to land-based crops for biofuels. Current Opinion in Biotechnology 2008, 19(3):235-240. 86. Welsh EA, Liberton M, Stockel J, Loh T, Elvitigala T, Wang C, Wollam A, Fulton RS, Clifton SW, Jacobs JM et al: The genome of Cyanothece 51142, a unicellular diazotrophic cyanobacterium important in the marine nitrogen cycle. Proc Natl Acad Sci U S A 2008, 105(39):15094-15099. 87. Zehr JP, Church, M.J., and Moisander, P.H. : Diversity, distribution and biogeochemical significance of nitrogen-fixing microorganisms in anoxic and suboxic ocean environments. In: NATO Series book on past and present water column anoxia. Springer; 2005: 337-369. 88. Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S et al: Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II.

193 Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res 1996, 3(3):109-136. 89. Knoop H, Zilliges Y, Lockau W, Steuer R: The Metabolic Network of Synechocystis sp. PCC 6803: Systemic Properties of Autotrophic Growth. Plant Physiology 2010, 154(1):410-422. 90. Lindberg P, Park S, Melis A: Engineering a platform for photosynthetic isoprene production in cyanobacteria, using Synechocystis as the model organism. Metabolic Engineering 2010, 12(1):70-79. 91. Wu GF, Shen ZY, Wu QY: Possibility to improve the cyanobacterial poly- beta-hydroxybutyrate biosynthesis level. Journal of Chemical Engineering of Japan 2001, 34(9):1187-1190. 92. Liu XY, Curtiss R: Nickel-inducible lysis system in Synechocystis sp PCC 6803. Proceedings of the National Academy of Sciences of the United States of America 2009, 106(51):21550-21554. 93. Navarro E, Montagud A, de Cordoba PF, Urchueguia JF: Metabolic flux analysis of the hydrogen production potential in Synechocystis sp PCC6803. International Journal of Hydrogen Energy 2009, 34(21):8828-8838. 94. McHugh K: Hydrogen production methods. Virginia: MPR Associates, Inc; 2005. 95. Turner J, Sverdrup G, Mann MK, Maness PC, Kroposki B, Ghirardi M, Evans RJ, Blake D: Renewable hydrogen production. International Journal of Energy Research 2008, 32(5):379-407. 96. Bandyopadhyay A, Stockel J, Min H, Sherman LA, Pakrasi HB: High rates of photobiological H2 production by a cyanobacterium under aerobic conditions. Nat Commun 2010, 1:139. 97. Min H, Sherman LA: Hydrogen production by the unicellular, diazotrophic cyanobacterium Cyanothece sp. strain ATCC 51142 under conditions of continuous light. Appl Environ Microbiol 2010, 76(13):4293-4301. 98. Schirmer A, Rude MA, Li XZ, Popova E, del Cardayre SB: Microbial Biosynthesis of Alkanes. Science 2010, 329(5991):559-562. 99. Wu B, Zhang BC, Feng XY, Rubens JR, Huang R, Hicks LM, Pakrasi HB, Tang YJJ: Alternative isoleucine synthesis pathway in cyanobacterial species. Microbiology-Sgm 2010, 156:596-602. 100. Reed JL, Patel TR, Chen KH, Joyce AR, Applebee MK, Herring CD, Bui OT, Knight EM, Fong SS, Palsson BO: Systems approach to refining genome annotation. Proc Natl Acad Sci U S A 2006, 103(46):17480-17484. 101. Puchalka J, Oberhardt MA, Godinho M, Bielecka A, Regenhardt D, Timmis KN, Papin JA, Martins dos Santos VA: Genome-scale reconstruction and analysis of the Pseudomonas putida KT2440 metabolic network facilitates applications in biotechnology. PLoS Comput Biol 2008, 4(10):e1000210. 102. Vu TT, Stolyar SM, Pinchuk GE, Hill EA, Kucek LA, Brown RN, Lipton MS, Osterman A, Fredrickson JK, Konopka AE et al: Genome-scale modeling of light-driven reductant partitioning and carbon fluxes in diazotrophic unicellular cyanobacterium Cyanothece sp. ATCC 51142. PLoS Comput Biol 2012, 8(4):e1002460.

194 103. Hong SJ, Lee CG: Evaluation of central metabolism based on a genomic database of Synechocystis PCC6803. Biotechnology and Bioprocess Engineering 2007, 12(2):165-173. 104. Shastri AA, Morgan JA: Flux balance analysis of photoautotrophic metabolism. Biotechnology Progress 2005, 21(6):1617-1626. 105. Yang C, Hua Q, Shimizu K: Metabolic flux analysis in Synechocystis using isotope distribution from 13C-labeled glucose. Metabolic engineering 2002, 4(3):202-216. 106. Fu PC: Genome-scale modeling of Synechocystis sp PCC 6803 and prediction of pathway insertion. Journal of Chemical Technology and Biotechnology 2009, 84(4):473-483. 107. Montagud A, Navarro E, de Cordoba PF, Urchueguia JF, Patil KR: Reconstruction and analysis of genome-scale metabolic model of a photosynthetic bacterium. Bmc Systems Biology 2010, 4:-. 108. Montagud A, Zelezniak A, Navarro E, de Cordoba P, Urchueguia JF, Patil KR: Flux coupling and transcriptional regulation within the metabolic network of the photosynthetic bacterium Synechocystis sp PCC6803. Biotechnology Journal 2011, 6(3):330-342. 109. Nogales J, Gudmundsson S, Knight EM, Palsson BO, Thiele I: Detailing the optimality of photosynthesis in cyanobacteria through systems biology analysis. Proc Natl Acad Sci U S A 2012, 109(7):2678-2683. 110. Zhang SY, Bryant DA: The Tricarboxylic Acid Cycle in Cyanobacteria. Science 2011, 334(6062):1551-1553. 111. Nakamura Y, Kaneko T, Miyajima N, Tabata S: Extension of CyanoBase. CyanoMutants: repository of mutant information on Synechocystis sp. strain PCC6803. Nucleic Acids Res 1999, 27(1):66-68. 112. Young JD, Shastri AA, Stephanopoulos G, Morgan JA: Mapping photoautotrophic metabolism with isotopically nonstationary (13)C flux analysis. Metabolic Engineering 2011, 13(6):656-665. 113. Stockel J, Jacobs JM, Elvitigala TR, Liberton M, Welsh EA, Polpitiya AD, Gritsenko MA, Nicora CD, Koppenaal DW, Smith RD et al: Diurnal rhythms result in significant changes in the cellular protein complement in the cyanobacterium Cyanothece 51142. PLoS One, 6(2):e16680. 114. Allen MM: Simple Conditions for Growth of Unicellular Blue-Green Algae on Plates. Journal of Phycology 1968, 4(1):1-&. 115. Reddy KJ, Haskell JB, Sherman DM, Sherman LA: Unicellular, Aerobic Nitrogen-Fixing Cyanobacteria of the Genus Cyanothece. Journal of Bacteriology 1993, 175(5):1284-1292. 116. Porra RJ, Thompson WA, Kriedemann PE: Determination of accurate extinction coefficients and simultaneous equations for assaying chlorophylls a and b extracted with four different solvents: verification of the concentration of chlorophyll standards by atomic absorption spectroscopy. Biochim Biophys Acta 1989, 975:384-394. 117. Lichtenthaler HK: Chlorophylls and Carotenoids - Pigments of Photosynthetic Biomembranes. Methods in Enzymology 1987, 148:350-382.

195 118. Steiger S, Schafer L, Sandmann G: High-light-dependent upregulation of carotenoids and their antioxidative properties in the cyanobacterium Synechocystis PCC 6803. Journal of Photochemistry and Photobiology B- Biology 1999, 52(1-3):14-18. 119. Arnon DI, Mcswain BD, Tsujimot.Hy, Wada K: Photochemical Activity and Components of Membrane Preparations from Blue-Green-Algae .1. Coexistence of 2 Photosystems in Relation to Chlorophyll Alpha and Removal of Phycocyanin. Biochimica Et Biophysica Acta 1974, 357(2):231-245. 120. Stoeckel J, Welsh EA, Liberton M, Kunnvakkam R, Aurora R, Pakrasi HB: Global transcriptomic analysis of Cyanothece 51142 reveals robust diurnal oscillation of central metabolic processes. Proceedings of the National Academy of Sciences of the United States of America 2008, 105(16):6156-6161. 121. Mortazavi AW, BA; McCue, K; Schaeffer, L; Wold, B: Mapping and quantifying mammalian transcripts by RNA-Seq. Nature Methods 2008, 5(7):621-628. 122. Varma A, Palsson BO: Metabolic Flux Balancing - Basic Concepts, Scientific and Practical Use. Bio-Technology 1994, 12(10):994-998. 123. Kumar VS, Ferry JG, Maranas CD: Metabolic reconstruction of the archaeon methanogen Methanosarcina Acetivorans. Bmc Systems Biology 2011, 5. 124. Kucho K, Okamoto K, Tsuchiya Y, Nomura S, Nango M, Kanehisa M, Ishiura M: Global analysis of circadian expression in the cyanobacterium Synechocystis sp. strain PCC 6803. J Bacteriol 2005, 187(6):2190-2199. 125. Tredici MR, Margheri MC, Philippis RD, Materass R, Bocci F, Tomaselli: Conversion of solar energy into the energy of biomass by culture of marine cyanobacteria. Proceedings of the 1986 International Congress on Renewable Energy Sources 1986, 1:191-199. 126. Reddy KJ, Haskell JB, Sherman DM, Sherman LA: Unicellular, aerobic nitrogen-fixing cyanobacteria of the genus Cyanothece. J Bacteriol 1993, 175(5):1284-1292. 127. Bentley FK, Melis A: Diffusion-based process for carbon dioxide uptake and isoprene emission in gaseous/aqueous two-phase photobioreactors by photosynthetic microorganisms. Biotechnol Bioeng 2012, 109(1):100-109. 128. Nakao M, Okamoto S, Kohara M, Fujishiro T, Fujisawa T, Sato S, Tabata S, Kaneko T, Nakamura Y: CyanoBase: the cyanobacteria genome database update 2010. Nucleic Acids Res 2010, 38(Database issue):D379-381. 129. Minamizaki K, Mizoguchi T, Goto T, Tamiaki H, Fujita Y: Identification of two homologous genes, chlAI and chlAII, that are differentially involved in isocyclic ring formation of chlorophyll a in the cyanobacterium Synechocystis sp. PCC 6803. The Journal of biological chemistry 2008, 283(5):2684-2692. 130. Jansson C, Debus RJ, Osiewacz HD, Gurevitz M, McIntosh L: Construction of an Obligate Photoheterotrophic Mutant of the Cyanobacterium Synechocystis 6803 : Inactivation of the psbA Gene Family. Plant Physiol 1987, 85(4):1021-1025.

196 131. Chitnis PR, Reilly PA, Nelson N: Insertional Inactivation of the Gene Encoding Subunit-Ii of Photosystem-I from the Cyanobacterium Synechocystis Sp Pcc-6803. Journal of Biological Chemistry 1989, 264(31):18381-18385. 132. Nakamoto H: Targeted inactivation of the gene psaI encoding a subunit of photosystem I of the cyanobacterium Synechocystis sp PCC 6803. Plant Cell Physiol 1995, 36(8):1579-1587. 133. Burnap RL, Sherman LA: Deletion Mutagenesis in Synechocystis Sp Pcc6803 Indicates That the Mn-Stabilizing Protein of Photosystem-Ii Is Not Essential for O2 Evolution. Biochemistry-Us 1991, 30(2):440-446. 134. Shen JR, Ikeuchi M, Inoue Y: Analysis of the psbU gene encoding the 12-kDa extrinsic protein of photosystem II and studies on its role by deletion mutagenesis in Synechocystis sp. PCC 6803. Journal of Biological Chemistry 1997, 272(28):17821-17826. 135. Papadopoulos JS, Agarwala R: COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 2007, 23(9):1073-1079. 136. Chitnis PR, Reilly PA, Miedel MC, Nelson N: Structure and Targeted Mutagenesis of the Gene Encoding 8-Kda Subunit of Photosystem-I from the Cyanobacterium Synechocystis Sp Pcc-6803. Journal of Biological Chemistry 1989, 264(31):18374-18380. 137. Ughy B, Ajlani G: Phycobilisome rod mutants in Synechocystis sp strain PCC6803. Microbiology-Sgm 2004, 150:4147-4156. 138. Delorimier R, Bryant DA, Stevens SE: Genetic-Analysis of a 9 Kda Phycocyanin-Associated Linker Polypeptide. Biochimica Et Biophysica Acta 1990, 1019(1):29-41. 139. Jallet D, Gwizdala M, Kirilovsky D: ApcD, ApcF and ApcE are not required for the Orange Carotenoid Protein related phycobilisome fluorescence quenching in the cyanobacterium Synechocystis PCC 6803. Biochim Biophys Acta 2012, 1817(8):1418-1427. 140. Shen JR, Vermaas W, Inoue Y: The Role of Cytochrome C-550 as Studied through Reverse Genetics and Mutant Characterization in Synechocystis Sp Pcc-6803. Journal of Biological Chemistry 1995, 270(12):6901-6907. 141. Shen JR, Qian M, Inoue Y, Burnap RL: Functional characterization of Synechocystis sp. PCC 6803 Delta psbU and Delta psbV mutants reveals important roles of cytochrome c-550 in cyanobacterial oxygen evolution. Biochemistry-Us 1998, 37(6):1551-1558. 142. Manna P, Vermaas W: Lumenal proteins involved in respiratory electron transport in the cyanobacterium Synechocystis sp. PCC6803. Plant Molecular Biology 1997, 35(4):407-416. 143. Shen GZ, Boussiba S, Vermaas WFJ: Synechocystis Sp Pcc-6803 Strains Lacking Photosystem-I and Phycobilisome Function. Plant Cell 1993, 5(12):1853-1863. 144. Mo ML, Palsson BO, Herrgard MJ: Connecting extracellular metabolomic measurements to intracellular flux states in yeast. BMC Syst Biol 2009, 3:37.

197 145. Zahalak M, Pratte B, Werth KJ, Thiel T: Molybdate transport and its effect on nitrogen utilization in the cyanobacterium Anabaena variabilis ATCC 29413. Mol Microbiol 2004, 51(2):539-549. 146. Fernandez-Gonzalez B, Sandmann G, Vioque A: A new type of asymmetrically acting beta-carotene ketolase is required for the synthesis of echinenone in the cyanobacterium Synechocystis sp. PCC 6803. J Biol Chem 1997, 272(15):9728-9733. 147. Tottey S, Rich PR, Rondet SA, Robinson NJ: Two Menkes-type atpases supply copper for photosynthesis in Synechocystis PCC 6803. J Biol Chem 2001, 276(23):19999-20004. 148. Tottey S, Rondet SA, Borrelly GP, Robinson PJ, Rich PR, Robinson NJ: A copper metallochaperone for photosynthesis and respiration reveals metal- specific targets, interaction with an importer, and alternative sites for copper acquisition. J Biol Chem 2002, 277(7):5490-5497. 149. Cheng Z, Sattler S, Maeda H, Sakuragi Y, Bryant DA, DellaPenna D: Highly divergent methyltransferases catalyze a conserved reaction in tocopherol and plastoquinone synthesis in cyanobacteria and photosynthetic eukaryotes. Plant Cell 2003, 15(10):2343-2356. 150. Sakuragi Y, Zybailov B, Shen G, Jones AD, Chitnis PR, van der Est A, Bittl R, Zech S, Stehlik D, Golbeck JH et al: Insertional inactivation of the menG gene, encoding 2-phytyl-1,4-naphthoquinone methyltransferase of Synechocystis sp. PCC 6803, results in the incorporation of 2-phytyl-1,4-naphthoquinone into the A(1) site and alteration of the equilibrium constant between A(1) and F(X) in photosystem I. Biochemistry-Us 2002, 41(1):394-405. 151. Dahnhardt D, Falk J, Appel J, van der Kooij TA, Schulz-Friedrich R, Krupinska K: The hydroxyphenylpyruvate dioxygenase from Synechocystis sp. PCC 6803 is not required for plastoquinone biosynthesis. FEBS letters 2002, 523(1- 3):177-181. 152. Ogawa T, Marco E, Orus MI: A gene (ccmA) required for carboxysome formation in the cyanobacterium Synechocystis sp. strain PCC6803. J Bacteriol 1994, 176(8):2374-2378. 153. Taiz L, Zeiger E: Plant Physilogy, Third edn. Massachusetts: Sinauer Associates, Inc., Publishers; 2002. 154. Yeates TO, Kerfeld CA, Heinhorst S, Cannon GC, Shively JM: Protein-based organelles in bacteria: carboxysomes and related microcompartments. Nat Rev Microbiol 2008, 6(9):681-691. 155. Badger MR, Price GD: CO2 concentrating mechanisms in cyanobacteria: molecular components, their diversity and evolution. Journal of experimental botany 2003, 54(383):609-622. 156. Paerl HW: Cyanobacterial Carotenoids - Their Roles in Maintaining Optimal Photosynthetic Production among Aquatic Bloom Forming Genera. Oecologia 1984, 61(2):143-149. 157. Glazer AN: Structure and molecular organization of the photosynthetic accessory pigments of cyanobacteria and red algae. Molecular and cellular biochemistry 1977, 18(2-3):125-140.

198 158. Poutanen EL, Nikkila K: Carotenoid pigments as tracers of cyanobacterial blooms in recent and postglacial sediments of the Baltic Sea. Ambio 2001, 30(4-5):179-183. 159. Collins MD, Jones D: Distribution of isoprenoid quinone structural types in bacteria and their taxonomic implication. Microbiological reviews 1981, 45(2):316-354. 160. Stockel J, Jacobs JM, Elvitigala TR, Liberton M, Welsh EA, Polpitiya AD, Gritsenko MA, Nicora CD, Koppenaal DW, Smith RD et al: Diurnal rhythms result in significant changes in the cellular protein complement in the cyanobacterium Cyanothece 51142. PLoS One 2011, 6(2):e16680. 161. Allahverdiyeva Y, Ermakova M, Eisenhut M, Zhang P, Richaud P, Hagemann M, Cournac L, Aro EM: Interplay between flavodiiron proteins and photorespiration in Synechocystis sp. PCC 6803. The Journal of biological chemistry 2011, 286(27):24007-24014. 162. Bandyopadhyay A, Elvitigala T, Welsh E, Stockel J, Liberton M, Min H, Sherman LA, Pakrasi HB: Novel metabolic attributes of the genus cyanothece, comprising a group of unicellular nitrogen-fixing Cyanothece. Mbio 2011, 2(5). 163. Quintero MJ, Muro-Pastor AM, Herrero A, Flores E: Arginine catabolism in the cyanobacterium Synechocystis sp. Strain PCC 6803 involves the urea cycle and arginase pathway. J Bacteriol 2000, 182(4):1008-1015. 164. Solomon CM, Collier JL, Berg GM, Glibert PM: Role of urea in microbial metabolism in aquatic systems: a biochemical and molecular review. Aquatic Microbial Ecology 2010, 59(1):67-88. 165. Tripp HJ, Bench SR, Turk KA, Foster RA, Desany BA, Niazi F, Affourtit JP, Zehr JP: Metabolic streamlining in an open-ocean nitrogen-fixing cyanobacterium. Nature 2010, 464(7285):90-94. 166. Connor MR, Atsumi S: Synthetic Biology Guides Biofuel Production. J Biomed Biotechnol 2010. 167. Antal TK, Lindblad P: Production of H2 by sulphur-deprived cells of the unicellular cyanobacteria Gloeocapsa alpicola and Synechocystis sp. PCC 6803 during dark incubation with methane or at various extracellular pH. J Appl Microbiol 2005, 98(1):114-120. 168. Muro-Pastor MI, Reyes JC, Florencio FJ: The NADP+-isocitrate dehydrogenase gene (icd) is nitrogen regulated in cyanobacteria. J Bacteriol 1996, 178(14):4070-4076. 169. Jensen PA, Lutz KA, Papin JA: TIGER: Toolbox for integrating genome-scale metabolic models, expression data, and transcriptional regulatory networks. Bmc Systems Biology 2011, 5. 170. Colijn C, Brandes A, Zucker J, Lun DS, Weiner B, Farhat MR, Cheng TY, Moody DB, Murray M, Galagan JE: Interpreting Expression Data with Metabolic Flux Models: Predicting Mycobacterium tuberculosis Mycolic Acid Production. Plos Computational Biology 2009, 5(8). 171. Mahadevan RE, JS; Doyle, FJ: Dynamic Flux Analysis of diauxic growth in Escherichia coli. Biophys J 2003, 83:1331-1340.

199 172. Kim J, Reed JL: OptORF: Optimal metabolic and regulatory perturbations for metabolic engineering of microbial strains. Bmc Systems Biology 2010, 4. 173. Ranganathan S, Suthers PF, Maranas CD: OptForce: An Optimization Procedure for Identifying All Genetic Manipulations Leading to Targeted Overproductions. Plos Computational Biology 2010, 6(4). 174. Gao Q, Wang W, Zhao H, Lu X: Effects of fatty acid activation on photosynthetic production of fatty acid-based biofuels in Synechocystis sp. PCC6803. Biotechnology for biofuels 2012, 5(1):17. 175. Tan X, Yao L, Gao Q, Wang W, Qi F, Lu X: Photosynthesis driven conversion of carbon dioxide to fatty and hydrocarbons in cyanobacteria. Metab Eng 2011, 13(2):169-176. 176. Gronenberg LS, Marcheschi RJ, Liao JC: Next generation biofuel engineering in prokaryotes. Curr Opin Chem Biol 2013. 177. Heidorn T, Camsund D, Huang HH, Lindberg P, Oliveira P, Stensjo K, Lindblad P: Synthetic biology in cyanobacteria engineering and analyzing novel functions. Methods in enzymology 2011, 497:539-579. 178. Chen Y, Holtman CK, Taton A, Golden SS: Functional Analysis of the Synechococcus elongatus PCC 7942 Genome. In: Functional Genomics and Evolution of Photosynthetic Systems. Edited by Burnap R, Vermaas W, vol. 33: Springer; 2012: 119-137. 179. Xu Y, Alvey RM, Byrne PO, Graham JE, Shen G, Bryant DA: Expression of genes in cyanobacteria: adaptation of endogenous plasmids as platforms for high-level gene expression in Synechococcus sp. PCC 7002. Methods in molecular biology (Clifton, NJ) 2011, 684:273-293. 180. Zhang Y, Pu H, Wang Q, Cheng S, Zhao W, Zhang Y, Zhao J: PII is important in regulation of nitrogen metabolism but not required for heterocyst formation in the Cyanobacterium Anabaena sp. PCC 7120. The Journal of biological chemistry 2007, 282(46):33641-33648. 181. Taton A, Lis E, Adin DM, Dong G, Cookson S, Kay SA, Golden SS, Golden JW: Gene transfer in Leptolyngbya sp. strain BL0902, a cyanobacterium suitable for production of biomass and bioproducts. PloS one 2012, 7(1):e30901. 182. Liu X, Sheng J, Curtiss R, 3rd: Fatty acid production in genetically modified cyanobacteria. Proceedings of the National Academy of Sciences of the United States of America 2011, 108(17):6899-6904. 183. Wang B, Pugh S, Nielsen DR, Zhang W, Meldrum DR: Engineering cyanobacteria for photosynthetic production of 3-hydroxybutyrate directly from CO. Metabolic engineering 2013, 16C:68-77. 184. Lagarde D, Beuf L, Vermaas W: Increased Production of Zeaxanthin and Other Pigments by Application of Genetic Engineering Techniques to Synechocystis sp. Strain PCC 6803. Applied and Environmental Microbiology 2000, 66(1):64-72. 185. Cheah YE, Albers SC, Peebles CA: A novel counter-selection method for markerless genetic modification in Synechocystis sp. PCC 6803. Biotechnology progress 2013, 29(1):23-30.

200 186. Takahama K, Matsuoka M, Nagahama K, Ogawa T: High-Frequency Gene Replacement in Cyanobacteria Using a Heterologous rps12 Gene. Plant Cell Physiology 2004, 45(3):333-339. 187. Tan X, Liang F, Cai K, Lu X: Application of the FLP/FRT recombination system in cyanobacteria for construction of markerless mutants. Applied microbiology and biotechnology 2013. 188. Tyo KE, Jin YS, Espinoza FA, Stephanopoulos G: Identification of gene disruptions for increased poly-3-hydroxybutyrate accumulation in Synechocystis PCC 6803. Biotechnology progress 2009, 25(5):1236-1243. 189. Holtman C, Chen Y, Sandoval P, Gonzales A, Nalty M, Thomas T, Youderian P, Golden S: High-Throughput Functional Analysis of the Synechococcus elongatus PCC 7942 Genome. DNA Research 2005, 12:103-115. 190. Huang HH, Camsund D, Lindblad P, Heidorn T: Design and characterization of molecular tools for a Synthetic Biology approach towards developing cyanobacterial biotechnology. Nucleic acids research 2010, 38(8):2577-2593. 191. Landry B, Stockel J, Pakrasi H: Use of Degradation Tags to Control Protein Levels in the Cyanobacterium Synechocystis sp. Strain PCC 6803. Applied and Environmental Microbiology 2012, 70(8):2833-2835. 192. Huang HH, Lindblad P: Wide-dynamic-range promoters engineered for cyanobacteria. Journal of biological engineering 2013, 7(1):10. 193. Li MZ, Elledge SJ: Harnessing homologous recombination in vitro to generate recombinant DNA via SLIC. Nature methods 2007, 4(3):251-256. 194. Gibson DG, Young L, Chuang RY, Venter JC, Hutchison CA, 3rd, Smith HO: Enzymatic assembly of DNA molecules up to several hundred kilobases. Nature methods 2009, 6(5):343-345. 195. Quan J, Tian J: Circular polymerase extension cloning of complex gene libraries and pathways. PloS one 2009, 4(7):e6441. 196. Szewczyk E, Nayak T, Oakley CE, Edgerton H, Xiong Y, Taheri-Talesh N, Osmani SA, Oakley BR: Fusion PCR and gene targeting in Aspergillus nidulans. Nature protocols 2007, 1(6):3111-3120. 197. Engler C, Marillonnet S: Generation of families of construct variants using golden gate shuffling. Methods in molecular biology (Clifton, NJ) 2011, 729:167-181. 198. Hilson N, Rosengarten R, Keasling J: j5 DNA Assembly Design Automation Software. ACS Synthetic Biology 2012, 1(1):14-21. 199. Nagarajan A, Winter R, Eaton-Rye J, Burnap R: A synthetic DNA and fusion PCR approach to the ectopic expression of high levels of the D1 protein of photosystem II in Synechocystis sp. PCC 6803. Journal of photochemistry and photobiology B, Biology 2011, 104(1-2):212-219. 200. Shao Z, Zhao H: DNA assembler, an in vivo genetic method for rapid construction of biochemical pathways. Nucleic acids research 2009, 37(2):e16. 201. Jones KL, Kim SW, Keasling JD: Low-copy plasmids can perform as well as or better than high-copy plasmids for metabolic engineering of bacteria. Metabolic engineering 2000, 2(4):328-338.

201 202. Dunlop MJ, Dossani ZY, Szmidt HL, Chu HC, Lee TS, Keasling JD, Hadi MZ, Mukhopadhyay A: Engineering microbial biofuel tolerance and export using efflux pumps. Molecular systems biology 2011, 7:487. 203. Ng WO, Zentella R, Wang Y, Taylor JS, Pakrasi HB: PhrA, the major photoreactivating factor in the cyanobacterium Synechocystis sp. strain PCC 6803 codes for a cyclobutane-pyrimidine-dimer-specific DNA photolyase. Archives of microbiology 2000, 173(5-6):412-417. 204. Berla BM, Pakrasi HB: Upregulation of plasmid genes during stationary phase in Synechocystis sp. strain PCC 6803, a cyanobacterium. Appl Environ Microbiol 2012, 78(15):5448-5451. 205. Wang B, Wang J, Zhang W, Meldrum DR: Application of synthetic biology in cyanobacteria and algae. Frontiers in microbiology 2012, 3:344. 206. Taniuchi Y, Yoshikawa S, Maeda S, Omata T, Ohki K: Diazotrophy under continuous light in a marine unicellular diazotrophic cyanobacterium, Gloeothece sp. 68DGA. Microbiology 2008, 154(Pt 7):1859-1865. 207. Latysheva N, Junker VL, Palmer WJ, Codd GA, Barker D: The evolution of nitrogen fixation in cyanobacteria. Bioinformatics 2012, 28(5):603-606. 208. Pfreundt U, Stal LJ, Voss B, Hess WR: Dinitrogen fixation in a unicellular chlorophyll d-containing cyanobacterium. The ISME journal 2012, 6(7):1367- 1377. 209. Stockel J, Welsh EA, Liberton M, Kunnvakkam R, Aurora R, Pakrasi HB: Global transcriptomic analysis of Cyanothece 51142 reveals robust diurnal oscillation of central metabolic processes. Proceedings of the National Academy of Sciences of the United States of America 2008, 105(16):6156-6161. 210. Zhang F, Carothers JM, Keasling JD: Design of a dynamic sensor-regulator system for production of chemicals and fuels derived from fatty acids. Nature biotechnology 2012, 30(4):354-359. 211. Akiyama S: Structural and dynamic aspects of protein clocks: how can they be so slow and stable? Cellular and molecular life sciences : CMLS 2012, 69(13):2147-2160. 212. Nakajima M, Imai K, Ito H, Nishiwaki T, Murayama Y, Iwasaki H, Oyama T, Kondo T: Reconstitution of circadian oscillation of cyanobacterial KaiC phosphorylation in vitro. Science 2005, 308(5720):414-415. 213. Xu Y, Ma P, Shah P, Rokas A, Liu Y, Johnson CH: Non-optimal codon usage is a mechanism to achieve circadian clock conditionality. Nature 2013, 495(7439):116-120. 214. Teng SW, Mukherji S, Moffitt JR, de Buyl S, O'Shea EK: Robust circadian oscillations in growing cyanobacteria require transcriptional feedback. Science 2013, 340(6133):737-740. 215. Woelfle MA, Ouyang Y, Phanvijhitsiri K, Johnson CH: The adaptive value of circadian clocks: an experimental assessment in cyanobacteria. Current biology : CB 2004, 14(16):1481-1486. 216. Melis A: Carbon partitioning in photosynthesis. Curr Opin Chem Biol 2013. 217. Atsumi S, Higashide W, Liao JC: Direct photosynthetic recycling of carbon dioxide to isobutyraldehyde. Nature biotechnology 2009, 27(12):1177-1180.

202 218. Selinger DW, Cheung KJ, Mei R, Johansson EM, Richmond CS, Blattner FR, Lockhart DJ, Church GM: RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nature biotechnology 2000, 18(12):1262-1268. 219. Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, Chabas S, Reiche K, Hackermuller J, Reinhardt R et al: The primary transcriptome of the major human pathogen Helicobacter pylori. Nature 2010, 464(7286):250- 255. 220. Mitschke J, Georg J, Scholz I, Sharma CM, Dienst D, Bantscheff J, Voss B, Steglich C, Wilde A, Vogel J et al: An experimentally anchored map of transcriptional start sites in the model cyanobacterium Synechocystis sp. PCC6803. Proceedings of the National Academy of Sciences of the United States of America 2011, 108(5):2124-2129. 221. Gierga G, Voss B, Hess WR: The Yfr2 ncRNA family, a group of abundant RNA molecules widely conserved in cyanobacteria. RNA biology 2009, 6(3):222-227. 222. Duhring U, Axmann IM, Hess WR, Wilde A: An internal antisense RNA regulates expression of the photosynthesis gene isiA. Proceedings of the National Academy of Sciences of the United States of America 2006, 103(18):7054-7058. 223. Ashby MK, Houmard J: Cyanobacterial two-component proteins: structure, diversity, distribution, and evolution. Microbiol Mol Biol Rev 2006, 70(2):472- 509. 224. Montgomery BL: Sensing the light: photoreceptive systems and signal transduction in cyanobacteria. Mol Microbiol 2007, 64(1):16-27. 225. Waters CM, Bassler BL: The Vibrio harveyi quorum-sensing system uses shared regulatory components to discriminate between multiple autoinducers. Genes Dev 2006, 20(19):2754-2767. 226. Moon TS, Lou C, Tamsir A, Stanton BC, Voigt CA: Genetic programs constructed from layered logic gates in single cells. Nature 2012, 491(7423):249-253. 227. Erbe JL, Adams AC, Taylor KB, Hall LM: Cyanobacteria carrying an smt-lux transcriptional fusion as biosensors for the detection of heavy metal cations. Journal of industrial microbiology 1996, 17(2):80-83. 228. Boyanapalli R, Bullerjahn GS, Pohl C, Croot PL, Boyd PW, McKay RM: Luminescent whole-cell cyanobacterial bioreporter for measuring Fe availability in diverse marine environments. Appl Environ Microbiol 2007, 73(3):1019-1024. 229. Blasi B, Peca L, Vass I, Kos PB: Characterization of stress responses of heavy metal and metalloid inducible promoters in synechocystis PCC6803. Journal of microbiology and biotechnology 2012, 22(2):166-169. 230. Peca L, Kos PB, Mate Z, Farsang A, Vass I: Construction of bioluminescent cyanobacterial reporter strains for detection of nickel, cobalt and zinc. FEMS microbiology letters 2008, 289(2):258-264.

203 231. Peca L, Kos PB, Vass I: Characterization of the activity of heavy metal- responsive promoters in the cyanobacterium Synechocystis PCC 6803. Acta biologica Hungarica 2007, 58 Suppl:11-22. 232. Michel KP, Pistorius EK, Golden SS: Unusual regulatory elements for iron deficiency induction of the idiA gene of Synechococcus elongatus PCC 7942. Journal of bacteriology 2001, 183(17):5015-5024. 233. Guerrero F, Carbonell V, Cossu M, Correddu D, Jones PR: Ethylene synthesis and regulated expression of recombinant protein in Synechocystis sp. PCC 6803. PloS one 2012, 7(11):e50470. 234. Kunert A, Vinnemeier J, Erdmann N, Hagemann M: Repression by Fur is not the main mechanism controlling the iron-inducible isiAB operon in the cyanobacterium Synechocystis sp. PCC 6803. FEMS microbiology letters 2003, 227(2):255-262. 235. Imamura S, Asayama M: Sigma factors for cyanobacterial transcription. Gene regulation and systems biology 2009, 3:65-87. 236. Hansen LH, Knudsen S, Sorensen SJ: The effect of the lacY gene on the induction of IPTG inducible promoters, studied in Escherichia coli and Pseudomonas fluorescens. Current microbiology 1998, 36(6):341-347. 237. Satya Lakshmi O, Rao NM: Evolving Lac repressor for enhanced inducibility. Protein engineering, design & selection : PEDS 2009, 22(2):53-58. 238. Mackey SR, Ditty JL, Clerico EM, Golden SS: Detection of rhythmic bioluminescence from luciferase reporters in cyanobacteria. Methods in molecular biology (Clifton, NJ) 2007, 362:115-129. 239. Ghim CM, Lee SK, Takayama S, Mitchell RJ: The art of reporter proteins in science: past, present and future applications. BMB reports 2010, 43(7):451- 460. 240. Meighen EA: Bacterial bioluminescence: organization, regulation, and application of the lux genes. FASEB journal : official publication of the Federation of American Societies for Experimental Biology 1993, 7(11):1016- 1022. 241. Hansen MC, Palmer RJ, Jr., Udsen C, White DC, Molin S: Assessment of GFP fluorescence in cells of Streptococcus gordonii under conditions of low pH and low oxygen concentration. Microbiology 2001, 147(Pt 5):1383-1391. 242. Golden SS, Ishiura M, Johnson CH, Kondo T: Cyanobacterial Circadian Rhythms. Annual review of plant physiology and plant molecular biology 1997, 48:327-354. 243. Drepper T, Eggert T, Circolone F, Heck A, Krauss U, Guterl JK, Wendorff M, Losi A, Gartner W, Jaeger KE: Reporter proteins for in vivo fluorescence without oxygen. Nature biotechnology 2007, 25(4):443-445. 244. Mukherjee A, Weyant KB, Walker J, Schroeder CM: Directed evolution of bright mutants of an oxygen-independent flavin-binding fluorescent protein from Pseudomonas putida. Journal of biological engineering 2012, 6(1):20. 245. Simkovsky R, Daniels EF, Tang K, Huynh SC, Golden SS, Brahamsha B: Impairment of O-antigen production confers resistance to grazing in a model

204 amoeba-cyanobacterium predator-prey system. Proceedings of the National Academy of Sciences of the United States of America 2012, 109(41):16678-16683. 246. Schwarz D, Orf I, Kopka J, Hagemann M: Recent Applications of Metabolomics Toward Cyanobacteria. Metabolites 2013, 3(1):72-100. 247. Fell DA, Small JR: Fat synthesis in adipose tissue. An examination of stoichiometric constraints. Biochem J 1986, 238(3):781-786. 248. Savinell JM, Palsson BO: Network analysis of intermediary metabolism using linear optimization. I. Development of mathematical formalism. J Theor Biol 1992, 154(4):421-454. 249. Varma A, Boesch BW, Palsson BO: Stoichiometric Interpretation of Escherichia-Coli Glucose Catabolism under Various Oxygenation Rates. Applied and Environmental Microbiology 1993, 59(8):2465-2473. 250. Orth JD, Thiele I, Palsson BO: What is flux balance analysis? Nature biotechnology 2010, 28(3):245-248. 251. Varma A, Palsson BO: Stoichiometric flux balance models quantitatively predict growth and metabolic by-product secretion in wild-type Escherichia coli W3110. Appl Environ Microbiol 1994, 60(10):3724-3731. 252. Heyes DJ, Hunter CN: Making light work of enzyme catalysis: protochlorophyllide oxidoreductase. Trends Biochem Sci 2005, 30(11):642-649. 253. Kopecna J, Sobotka R, Komenda J: Inhibition of chlorophyll biosynthesis at the protochlorophyllide reduction step results in the parallel depletion of Photosystem I and Photosystem II in the cyanobacterium Synechocystis PCC 6803. Planta 2013, 237(2):497-508. 254. Mahadevan R, Edwards JS, Doyle FJ: Dynamic flux balance analysis of diauxic growth in Escherichia coli. Biophys J 2002, 83(3):1331-1340. 255. Mahadevan R, Schilling CH: The effects of alternate optimal solutions in constraint-based genome-scale metabolic models. Metabolic engineering 2003, 5:264-276. 256. Kaczmarzyk D, Fulda M: Fatty acid activation in cyanobacteria mediated by acyl-acyl carrier protein synthetase enables fatty acid recycling. Plant physiology 2010, 152(3):1598-1610. 257. von Berlepsch S, Kunz HH, Brodesser S, Fink P, Marin K, Flugge UI, Gierth M: The acyl-acyl carrier protein synthetase from Synechocystis sp. PCC 6803 mediates fatty acid import. Plant physiology 2012, 159(2):606-617. 258. Collins MD, Jones D: Distribution of Isoprenoid Quinone Structural Types in Bacteria and Their Taxonomic Implications. Microbiol Rev 1981, 45(2):316- 354. 259. Sakuragi Y: Studies of Quinones in Cyanobacteria. The Pennsylvania State University; 2004. 260. Hamilton JJ, Reed JL: Identification of Functional Differences in Metabolic Networks Using Comparative Genomics and Constraint-Based Models. PloS one 2012, 7(4). 261. Bennetzen JL, Hake S, SpringerLink (Online service): Handbook of Maize Genetics and Genomics. In. New York, NY: Springer New York; 2009.

205 262. Sanchez OJ, Cardona CA: Trends in biotechnological production of fuel ethanol from different feedstocks. Bioresource Technology 2008, 99(13):5270- 5295. 263. Farrell AE, Plevin RJ, Turner BT, Jones AD, O'Hare M, Kammen DM: Ethanol can contribute to energy and environmental goals. Science 2006, 311(5760):506-508. 264. Stewart CN, Jr.: Biofuels and biocontainment. Nat Biotechnol 2007, 25(3):283- 284. 265. Mechin V, Argillier O, Rocher F, Hebert Y, Mila I, Pollet B, Barriere Y, Lapierre C: In search of a maize ideotype for cell wall enzymatic degradability using histological and biochemical lignin characterization. J Agric Food Chem 2005, 53(15):5872-5881. 266. Dennis C, Surridge C: A. thaliana genome. Nature 2000, 408(6814):791-791. 267. Yu J, Hu SN, Wang J, Wong GKS, Li SG, et al: A draft sequence of the rice genome (Oryza sativa L. ssp indica). Science 2002, 296(5565):79-92. 268. Goff SA, Ricke D, Lan TH, Presting G, Wang RL, et al: A draft sequence of the rice genome (Oryza sativa L. ssp japonica). Science 2002, 296(5565):92-100. 269. Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, et al: The Sorghum bicolor genome and the diversification of grasses. Nature 2009, 457(7229):551-556. 270. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, et al: The B73 maize genome: complexity, diversity, and dynamics. Science 2009, 326(5956):1112-1115. 271. Xavier Argout JS, Jean Marc Aury, Gaetan Droc, Jerome Gouzy, et al: Deciphering the genome structure and paleohistory of Theobroma cacao. Nature Proceedings 2010. 272. Dal'Molin CGD, Quek LE, Palfreyman RW, Brumbley SM, Nielsen LK: AraGEM, a Genome-Scale Reconstruction of the Primary Metabolic Network in Arabidopsis. Plant Physiology 2010, 152(2):579-589. 273. Sweetlove LJ, Last RL, Fernie AR: Predictive metabolic engineering: A goal for systems biology. Plant Physiology 2003, 132(2):420-425. 274. Gutierrez RA, Shasha DE, Coruzzi GM: Systems biology for the virtual plant. Plant Physiology 2005, 138(2):550-554. 275. Feist AM, Herrgard MJ, Thiele I, Reed JL, Palsson BO: Reconstruction of biochemical networks in microorganisms. Nature Reviews Microbiology 2009, 7(2):129-143. 276. Park JM, Kim TY, Lee SY: Constraints-based genome-scale metabolic simulation for systems metabolic engineering. Biotechnology Advances 2009, 27(6):979-988. 277. Milne CB, Kim PJ, Eddy JA, Price ND: Accomplishments in genome-scale in silico modeling for industrial and medical biotechnology. Biotechnol J 2009, 4(12):1653-1670. 278. Poolman MG, Miguet L, Sweetlove LJ, Fell DA: A Genome-Scale Metabolic Model of Arabidopsis and Some of Its Properties. Plant Physiology 2009, 151(3):1570-1581.

206 279. Grafahrend-Belau E, Schreiber F, Koschutzki D, Junker BH: Flux Balance Analysis of Barley Seeds: A Computational Approach to Study Systemic Properties of Central Metabolism. Plant Physiology 2009, 149(1):585-598. 280. Dal'Molin CGD, Quek LE, Palfreyman RW, Brumbley SM, Nielsen LK: C4GEM, a Genome-Scale Metabolic Model to Study C-4 Plant Metabolism. Plant Physiology 2010, 154(4):1871-1885. 281. Pilalis E, Chatziioannou A, Thomasset B, Kolisis F: An in silico compartmentalized metabolic model of Brassica napus enables the systemic study of regulatory aspects of plant central metabolism. Biotechnol Bioeng. 282. Bennett MD, Leitch IJ, Price HJ, Johnston JS: Comparisons with Caenorhabditis (approximately 100 Mb) and Drosophila (approximately 175 Mb) using flow cytometry show genome size in Arabidopsis to be approximately 157 Mb and thus approximately 25% larger than the Arabidopsis genome initiative estimate of approximately 125 Mb. Ann Bot 2003, 91(5):547-557. 283. Liang C, Mao L, Ware D, Stein L: Evidence-based gene predictions in plant genomes. Genome Res 2009, 19(10):1912-1923. 284. Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome Res 2000, 10(4):516-522. 285. Notebaart RA, van Enckevort FH, Francke C, Siezen RJ, Teusink B: Accelerating the reconstruction of genome-scale metabolic networks. BMC Bioinformatics 2006, 7:296. 286. Penningd FW, Brunstin AH, Vanlaar HH: Products, Requirements and Efficiency of Biosynthesis - Quantitative Approach. Journal of Theoretical Biology 1974, 45(2):339-377. 287. Spector WS: Handbook of biological data. Philadelphia,: Saunders; 1956. 288. Muller F, Dijkhuis, DJ, Heida, YS: On the relationship between chemical composition and digestibility in vivo of roughages. Agricultural Research Report 1970, 736:1-27. 289. Wedig C, Jaster, EH, Moore, KJ: Hemicellulose monosaccharide composition and in vitro disappearance of orchard grass and alfalfa hay. Journal of Agricultaral and Food Chemistry 1987, 35(2):23-27. 290. Sun Q, Zybailov B, Majeran W, Friso G, Olinares PDB, van Wijk KJ: PPDB, the Plant Proteomics Database at Cornell. Nucleic Acids Res 2009, 37:D969-D974. 291. Heazlewood JL, Verboom RE, Tonti-Filippini J, Small I, Millar AH: SUBA: the Arabidopsis Subcellular Database. Nucleic Acids Res 2007, 35(Database issue):D213-218. 292. Volk RJ, Jackson WA: Photorespiratory Phenomena in Maize - Oxygen- Uptake, Isotope Discrimination, and Carbon-Dioxide Efflux. Plant Physiology 1972, 49(2):218-&. 293. Dai ZY, Ku MSB, Edwards GE: C-4 Photosynthesis - the Effects of Leaf Development on the Co2-Concentrating Mechanism and Photorespiration in Maize. Plant Physiology 1995, 107(3):815-825.

207 294. Jolivettournier P, Gerster R: Incorporation of Oxygen into Glycolate, Glycine, and Serine during Photorespiration in Maize Leaves. Plant Physiology 1984, 74(1):108-111. 295. Kumar VS, Dasika MS, Maranas CD: Optimization based automated curation of metabolic reconstructions. BMC bioinformatics 2007, 8:212. 296. Wei Y, Lin M, Oliver DJ, Schnable PS: The roles of aldehyde dehydrogenases (ALDHs) in the PDH bypass of Arabidopsis. BMC Biochem 2009, 10:7. 297. Ouzounis CA, Karp PD: Global properties of the metabolic map of Escherichia coli. Genome Res 2000, 10(4):568-576. 298. Wise RR HJ: Synthesis, export and partitioning of end products of photosynthesis., vol. 23. Dordrecht, The Netherlands: Springer; 2007. 299. Dennis DT, Miernyk JA: Compartmentation of Non-Photosynthetic Carbohydrate-Metabolism. Annual Review of Plant Physiology and Plant Molecular Biology 1982, 33:27-50. 300. Allen JF: Photosynthesis of ATP - Electrons, proton pumps, rotors, and poise. Cell 2002, 110(3):273-276. 301. Hervas M, Navarro JA, De La Rosa MA: Electron transfer between membrane complexes and soluble proteins in photosynthesis. Accounts of Chemical Research 2003, 36(10):798-805. 302. Gregory R: Biochemistry of Photosynthesis. Chichester, NY, USA: John Wiley & Sons; 1989. 303. Tsaftaris AS, Bosabalidis AM, Scandalios JG: Cell-Type-Specific Gene- Expression and Acatalasemic Peroxisomes in a Null Cat2 Catalase Mutant of Maize. Proceedings of the National Academy of Sciences of the United States of America-Biological Sciences 1983, 80(14):4455-4459. 304. Hisano H, Nandakumar R, Wang ZY: Genetic modification of lignin biosynthesis for improved biofuel production. In Vitro Cellular & Developmental Biology-Plant 2009, 45(3):306-313. 305. Winkel-Shirley B: Flavonoid biosynthesis. A colorful model for genetics, biochemistry, cell biology, and biotechnology. Plant Physiology 2001, 126(2):485-493. 306. Styles ED, Ceska O: Genetic-Control of 3-Hydroxy-Flavonoids and 3-Deoxy- Flavonoids in Zea-Mays. Phytochemistry 1975, 14(2):413-415. 307. Winkel-Shirley B: Flavonoid biosynthesis. A colorful model for genetics, biochemistry, cell biology, and biotechnology. Plant Physiol 2001, 126(2):485- 493. 308. Weidemann C, Tenhaken R, Hohl U, Barz W: Medicarpin and Maackiain 3-O- Glucoside-6'-O-Malonate Conjugates Are Constitutive Compounds in Chickpea (Cicer-Arietinum L) Cell-Cultures. Plant Cell Reports 1991, 10(6- 7):371-374. 309. Vanholme R, Morreel K, Ralph J, Boerjan W: Lignin engineering. Current Opinion in Plant Biology 2008, 11(3):278-285. 310. Sattler SE, Funnell-Harris DL, Pedersen JF: Brown midrib mutations and their importance to the utilization of maize, sorghum, and pearl millet lignocellulosic tissues. Plant Science 2010, 178(3):229-238.

208 311. Marita JM, Vermerris W, Ralph J, Hatfield RD: Variations in the cell wall composition of maize brown midrib mutants. Journal of Agricultural and Food Chemistry 2003, 51(5):1313-1321. 312. Kuc J, Nelson OE: Abnormal Lignins Produced by Brown-Midrib Mutants of Maize .I. Brown-Midrib-1 Mutant. Archives of Biochemistry and Biophysics 1964, 105(1):103-&. 313. Guillaumie S, Pichon M, Martinant JP, Bosio M, Goffner D, Barriere Y: Differential expression of phenylpropanoid and related genes in brown- midrib bm1, bm2, bm3, and bm4 young near-isogenic maize plants. Planta 2007, 226(1):235-250. 314. Sticklen MB: Expediting the biofuels agenda via genetic manipulations of cellulosic bioenergy crops. Biofuels Bioproducts & Biorefining-Biofpr 2009, 3(4):448-455. 315. Sticklen MB: Plant genetic engineering for biofuel production: towards affordable cellulosic ethanol. Nat Rev Genet 2008, 9(6):433-443. 316. Li X, Weng JK, Chapple C: Improvement of biomass through lignin modification. Plant Journal 2008, 54(4):569-581. 317. Vega-Sanchez ME, Ronald PC: Genetic and biotechnological approaches for biofuel crop improvement. Current Opinion in Biotechnology 2010, 21(2):218- 224. 318. Grabber JH, Schatz PF, Kim H, Lu FC, Ralph J: Identifying new lignin bioengineering targets: 1. Monolignol-substitute impacts on lignin formation and cell wall fermentability. Bmc Plant Biology 2010, 10:-. 319. Abramson M, Shoseyov O, Shani Z: Plant cell wall reconstruction toward improved lignocellulosic production and processability. Plant Science 2010, 178(2):61-72. 320. Torney F, Moeller L, Scarpa A, Wang K: Genetic engineering approaches to improve bioethanol production from maize. Current Opinion in Biotechnology 2007, 18(3):193-199. 321. Smidansky ED, Martin JM, Hannah LC, Fischer AM, Giroux MJ: Seed yield and plant biomass increases in rice are conferred by deregulation of endosperm ADP-glucose pyrophosphorylase. Planta 2003, 216(4):656-664. 322. Kim J, Reed JL: OptORF: Optimal metabolic and regulatory perturbations for metabolic engineering of microbial strains. BMC Syst Biol 2010, 4:53. 323. Thiele I, Palsson BO: A protocol for generating a high-quality genome-scale metabolic reconstruction. Nature Protocols 2010, 5(1):93-121. 324. Feist AM, Palsson BO: The growing scope of applications of genome-scale metabolic reconstructions using Escherichia coli. Nature Biotechnology 2008, 26(6):659-667. 325. Duarte NC, Becker SA, Jamshidi N, Thiele I, Mo ML, Vo TD, Srivas R, Palsson BO: Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proceedings of the National Academy of Sciences of the United States of America 2007, 104(6):1777-1782.

209 326. Shlomi T, Cabili MN, Herrgard MJ, Palsson BO, Ruppin E: Network-based prediction of human tissue-specific metabolism. Nature Biotechnology 2008, 26(9):1003-1010. 327. Jerby L, Shlomi T, Ruppin E: Computational reconstruction of tissue-specific metabolic models: application to human liver metabolism. Molecular Systems Biology 2010, 6:-. 328. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology 1990, 215(3):403-410. 329. Leveau V, Lorgeou J, Prioul J-L: Maize in the world economy: a challenge for scientific research - how to produce more cheaper! In: Advances in Maize. Edited by Prioul JLT, C.; Molnar, T., vol. 3. UK: Society for Experimental Biology; 2011. 330. International Grains Council: International Grains Council: Report for Fiscal Year 2011/12. In.: International Grains Council; 2013. 331. Schaeffer ML, Harper LC, Gardiner JM, Andorf CM, Campbell DA, Cannon EK, Sen TZ, Lawrence CJ: MaizeGDB: curation and outreach go hand-in-hand. Database : the journal of biological databases and curation 2011, 2011:bar022. 332. Monaco MK, Sen TZ, Dharmawardhana PD, Ren L, Schaeffer M, Naithani S, Amarasinghe V, Thomason J, Harper L, Gardiner J et al: Maize Metabolic Network Construction and Transcriptome Analysis. Plant Gen 2013, 6(1):-. 333. Schreiber F, Colmsee C, Czauderna T, Grafahrend-Belau E, Hartmann A, Junker A, Junker BH, Klapperstuck M, Scholz U, Weise S: MetaCrop 2.0: managing and exploring information about crop plant metabolism. Nucleic Acids Res 2012, 40(Database issue):D1173-1177. 334. Saha R, Suthers PF, Maranas CD: Zea mays iRS1563: A Comprehensive Genome-Scale Metabolic Reconstruction of Maize Metabolism. Plos One 2011, 6(7). 335. Martin A, Lee J, Kichey T, Gerentes D, Zivy M, Tatout C, Dubois F, Balliau T, Valot B, Davanture M et al: Two cytosolic glutamine synthetase isoforms of maize are specifically involved in the control of grain production. The Plant cell 2006, 18(11):3252-3274. 336. Kennedy RA: Photorespiration in c(3) and c(4) plant tissue cultures: significance of kranz anatomy to low photorespiration in c(4) plants. Plant Physiol 1976, 58(4):573-575. 337. Brown RH: A Difference in N Use Efficiency in C3 and C4 Plants and its Implications in Adaptation and Evolution1. Crop Sci 1978, 18(1):93-98. 338. Zelitch I: Pathways of Carbon Fixation in Green Plants. Annual Review of Biochemistry 1975, 44(1):123-145. 339. Vitousek PM, Aber JD, Howarth RW, Likens GE, Matson PA, Schindler DW, Schlesinger WH, Tilman DG: HUMAN ALTERATION OF THE GLOBAL NITROGEN CYCLE: SOURCES AND CONSEQUENCES. Ecological Applications 1997, 7(3):737-750. 340. Hirel B, Le Gouis J, Ney B, Gallais A: The challenge of improving nitrogen use efficiency in crop plants: towards a more central role for genetic variability

210 and quantitative genetics within integrated approaches. J Exp Bot 2007, 58(9):2369-2387. 341. Hirel B, Gallais A: Nitrogen use efficiency – Physiological, molecular and genetic investigations towards crop improvement. In: Advances in Maize. vol. 3. UK: Society for Experimental Biology; 2011: 285-310. 342. Miflin BJ, Habash DZ: The role of glutamine synthetase and glutamate dehydrogenase in nitrogen assimilation and possibilities for improvement in the nitrogen utilization of crops. J Exp Bot 2002, 53(370):979-987. 343. de Oliveira Dal'Molin CG, Quek LE, Palfreyman RW, Brumbley SM, Nielsen LK: AraGEM, a genome-scale reconstruction of the primary metabolic network in Arabidopsis. Plant Physiol 2010, 152(2):579-589. 344. Poolman MG, Miguet L, Sweetlove LJ, Fell DA: A genome-scale metabolic model of Arabidopsis and some of its properties. Plant Physiol 2009, 151(3):1570-1581. 345. Grafahrend-Belau E, Schreiber F, Koschutzki D, Junker BH: Flux balance analysis of barley seeds: a computational approach to study systemic properties of central metabolism. Plant Physiol 2009, 149(1):585-598. 346. de Oliveira Dal'Molin CG, Quek LE, Palfreyman RW, Brumbley SM, Nielsen LK: C4GEM, a genome-scale metabolic model to study C4 plant metabolism. Plant Physiol 2010, 154(4):1871-1885. 347. Pilalis E, Chatziioannou A, Thomasset B, Kolisis F: An in silico compartmentalized metabolic model of Brassica napus enables the systemic study of regulatory aspects of plant central metabolism. Biotechnology and bioengineering 2011, 108(7):1673-1682. 348. Poolman MG, Kundu S, Shaw R, Fell DA: Responses to light intensity in a genome-scale model of rice metabolism. Plant Physiol 2013, 162(2):1060-1072. 349. The Arabidopsis Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408(6814):796-815. 350. Hu TT, Pattyn P, Bakker EG, Cao J, Cheng JF, Clark RM, Fahlgren N, Fawcett JA, Grimwood J, Gundlach H et al: The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat Genet 2011, 43(5):476-481. 351. Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J et al: Genome sequence of the palaeopolyploid soybean. Nature 2010, 463(7278):178-183. 352. Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H et al: A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 2002, 296(5565):92-100. 353. Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X et al: A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 2002, 296(5565):79-92. 354. Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, Putnam N, Ralph S, Rombauts S, Salamov A et al: The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 2006, 313(5793):1596-1604.

211 355. Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A et al: The Sorghum bicolor genome and the diversification of grasses. Nature 2009, 457(7229):551-556. 356. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA et al: The B73 maize genome: complexity, diversity, and dynamics. Science 2009, 326(5956):1112-1115. 357. Becker SA, Palsson BO: Context-specific metabolic networks are consistent with experiments. PLoS computational biology 2008, 4(5):e1000082. 358. Shlomi T, Cabili MN, Herrgard MJ, Palsson BO, Ruppin E: Network-based prediction of human tissue-specific metabolism. Nature biotechnology 2008, 26(9):1003-1010. 359. Jensen PA, Papin JA: Functional integration of a metabolic network model and expression data without arbitrary thresholding. Bioinformatics 2011, 27(4):541-547. 360. Colijn C, Brandes A, Zucker J, Lun DS, Weiner B, Farhat MR, Cheng TY, Moody DB, Murray M, Galagan JE: Interpreting expression data with metabolic flux models: predicting Mycobacterium tuberculosis mycolic acid production. PLoS computational biology 2009, 5(8):e1000489. 361. Chandrasekaran S, Price ND: Probabilistic integrative modeling of genome- scale metabolic and regulatory networks in Escherichia coli and Mycobacterium tuberculosis. Proceedings of the National Academy of Sciences of the United States of America 2010, 107(41):17845-17850. 362. Friso G, Majeran W, Huang MS, Sun Q, van Wijk KJ: Reconstruction of Metabolic Pathways, Protein Expression, and Homeostasis Machineries across Maize Bundle Sheath and Mesophyll Chloroplasts: Large-Scale Quantitative Proteomics Using the First Maize Genome Assembly. Plant Physiol 2010, 152(3):1219-1250. 363. Li PH, Ponnala L, Gandotra N, Wang L, Si YQ, Tausta SL, Kebrom TH, Provart N, Patel R, Myers CR et al: The developmental dynamics of the maize leaf transcriptome. Nat Genet 2010, 42(12):1060-U1051. 364. Chang YM, Liu WY, Shih ACC, Shen MN, Lu CH, Lu MYJ, Yang HW, Wang TY, Chen SCC, Chen SM et al: Characterizing Regulatory and Functional Differentiation between Maize Mesophyll and Bundle Sheath Cells by Transcriptomic Analysis. Plant Physiol 2012, 160(1):165-177. 365. Majeran W, Cai Y, Sun Q, van Wijk KJ: Functional differentiation of bundle sheath and mesophyll maize chloroplasts determined by comparative proteomics. The Plant cell 2005, 17(11):3111-3140. 366. Chung BK, Lee DY: Flux-sum analysis: a metabolite-centric approach for understanding the metabolic network. BMC systems biology 2009, 3:117. 367. Reznik E, Mehta P, Segre D: Flux imbalance analysis and the sensitivity of cellular growth to changes in metabolite pools. PLoS computational biology 2013, 9(8):e1003195. 368. Nelson DL, Cox MM: Oxidative Phosphorylation and Photophosphorylation. In: Lehninger Principles of Biochemsitry. Fifth edn. New York: W.H.Freeman & Co.; 2009: 707-772.

212 369. Taiz LaZ, E.: Plant Physiology, Fifth edn. Sunderland, Massachusetts: Sinauer Associates Inc.; 2010. 370. Bachlava E, Dewey R, Burton J, Cardinal AJ: Mapping candidate genes for oleate biosynthesis and their association with unsaturated fatty acid seed content in soybean. Mol Breeding 2009, 23(2):337-347. 371. Li-Beisson Y, Shorrosh B, Beisson F, Andersson MX, Arondel V, Bates PD, Baud S, Bird D, Debono A, Durrett TP et al: Acyl-lipid metabolism. The Arabidopsis book / American Society of Plant Biologists 2010, 8:e0133. 372. Mekhedov S, de Ilarduya OM, Ohlrogge J: Toward a functional catalog of the plant genome. A survey of genes for lipid biosynthesis. Plant Physiol 2000, 122(2):389-402. 373. Murata N: Molecular-Species Composition of Phosphatidylglycerols from Chilling-Sensitive and Chilling-Resistant Plants. Plant and Cell Physiology 1983, 24(1):81-86. 374. Moore TS: Phospholipid Biosynthesis. Annu Rev Plant Phys 1982, 33:235-259. 375. Rolland N, Curien G, Finazzi G, Kuntz M, Marechal E, Matringe M, Ravanel S, Seigneurin-Berny D: The biosynthetic capacities of the plastids and integration between cytoplasmic and chloroplast processes. Annual review of genetics 2012, 46:233-264. 376. Murata N, Tasaka Y: Glycerol-3-phosphate acyltransferase in plants. Bba- Lipid Lipid Met 1997, 1348(1-2):10-16. 377. Amiour N, Imbaud S, Clement G, Agier N, Zivy M, Valot B, Balliau T, Armengaud P, Quillere I, Canas R et al: The use of metabolomics integrated with transcriptomic and proteomic studies for identifying key steps involved in the control of nitrogen metabolism in crops such as maize. J Exp Bot 2012, 63(14):5017-5033. 378. Hirel B, Martin A, Terce-Laforgue T, Gonzalez-Moro MB, Estavillo JM: Physiology of maize I: A comprehensive and integrated view of nitrogen metabolism in a C4 plant. Physiol Plantarum 2005, 124(2):167-177. 379. Martin A, Belastegui-Macadam X, Quillere I, Floriot M, Valadier MH, Pommel B, Andrieu B, Donnison I, Hirel B: Nitrogen management and senescence in two maize hybrids differing in the persistence of leaf greenness: agronomic, physiological and molecular aspects. New Phytologist 2005, 167(2):483-492. 380. Gallais A, Hirel B: An approach to the genetics of nitrogen use efficiency in maize. J Exp Bot 2004, 55(396):295-306. 381. Coïc Y, Lesaint C: Comment assurer une bonne nutrition en eau et en ions minéraux en horticulture. Hortic Française 1971, 8:11-14. 382. Terce-Laforgue T, Mack G, Hirel B: New insights towards the function of glutamate dehydrogenase revealed during source-sink transition of tobacco (Nicotiana tabacum) plants grown under different nitrogen regimes. Physiol Plant 2004, 120(2):220-228. 383. Verwoerd TC, Dekker BMM, Hoekema A: A Small-Scale Procedure for the Rapid Isolation of Plant Rnas. Nucleic Acids Res 1989, 17(6):2362-2362. 384. Dellaporta S, Wood J, Hicks J: A plant DNA minipreparation: Version II. Plant Mol Biol Rep 1983, 1(4):19-21.

213 385. Eberwine J: Amplification of mRNA populations using aRNA generated from immobilized oligo(dT)-T7 primed cDNA. Biotechniques 1996, 20(4):584-&. 386. Imbeaud S, Graudens E, Boulanger V, Barlet X, Zaborski P, Eveno E, Mueller O, Schroeder A, Auffray C: Towards standardization of RNA quality assessment using user-independent classifiers of microcapillary electrophoresis traces. Nucleic Acids Res 2005, 33(6):e56. 387. Graudens E, Boulanger V, Mollard C, Mariage-Samson R, Barlet X, Gremy G, Couillault C, Lajemi M, Piatier-Tonneau D, Zaborski P et al: Deciphering cellular states of innate tumor drug responses. Genome biology 2006, 7(3):R19. 388. Marisa L, Ichante JL, Reymond N, Aggerbeck L, Delacroix H, Mucchielli-Giorgi MH: MAnGO: an interactive R-based tool for two-colour microarray analysis. Bioinformatics 2007, 23(17):2339-2341. 389. Smyth GK, Speed T: Normalization of cDNA microarray data. Methods 2003, 31(4):265-273. 390. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(9):5116-5121. 391. Korn EL, McShane LM, Troendle JF, Rosenwald A, Simon R: Identifying pre- post chemotherapy differences in gene expression in breast tumours: a statistical method appropriate for this aim. British journal of cancer 2002, 86(7):1093-1096. 392. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met 1995, 57(1):289-300. 393. Mechin V, Thevenot C, Le Guilloux M, Prioul JL, Damerval C: Developmental analysis of maize endosperm proteome suggests a pivotal role for pyruvate orthophosphate dikinase. Plant Physiol 2007, 143(3):1203-1219. 394. Cataldo DA, Haroon M, Schrader LE, Youngs VL: Rapid Colorimetric Determination of Nitrate in Plant-Tissue by Nitration of Salicylic-Acid. Commun Soil Sci Plan 1975, 6(1):71-80. 395. Rosen H: A Modified Ninhydrin Colorimetric Analysis for Amino Acids. Arch Biochem Biophys 1957, 67(1):10-15. 396. Arnon DI: Copper Enzymes in Isolated Chloroplasts - Polyphenoloxidase in Beta-Vulgaris. Plant Physiol 1949, 24(1):1-15. 397. Ferrario-Mery S, Valadier MH, Foyer CH: Overexpression of nitrate reductase in tobacco delays drought-induced decreases in nitrate reductase activity and mRNA. Plant Physiol 1998, 117(1):293-302. 398. Miquel M, Browse J: Arabidopsis Mutants Deficient in Polyunsaturated Fatty-Acid Synthesis - Biochemical and Genetic-Characterization of a Plant Oleoyl-Phosphatidylcholine Desaturase. J Biol Chem 1992, 267(3):1502-1509. 399. Ohnishi J, Yamada M: Glycerolipid Synthesis in Avena Leaves during Greening of Etiolated Seedlings .2. Alpha-Linolenic Acid Synthesis. Plant and Cell Physiology 1980, 21(8):1607-1618.

214 400. Harholt J, Jensen JK, Sorensen SO, Orfila C, Pauly M, Scheller HV: ARABINAN DEFICIENT 1 is a putative arabinosyltransferase involved in biosynthesis of Pectic Arabinan in Arabidopsis. Plant Physiol 2006, 140(1):49- 58. 401. Updegraff DM: Semimicro determination of cellulose in biological materials. Analytical biochemistry 1969, 32(3):420-424. 402. Harholt J, Jensen JK, Sorensen SO, Orfila C, Pauly M, Scheller HV: ARABINAN DEFICIENT 1 is a putative arabinosyltransferase involved in biosynthesis of pectic arabinan in Arabidopsis. Plant Physiol 2006, 140(1):49- 58. 403. Fukushima RS, Hatfield RD: Extraction and isolation of lignin for utilization as a standard to determine lignin concentration using the acetyl bromide spectrophotometric method. J Agr Food Chem 2001, 49(7):3133-3139. 404. Fiehn O: Metabolite profiling in Arabidopsis. Methods Mol Biol 2006, 323:439- 447. 405. Zybailov B, Rutschow H, Friso G, Rudella A, Emanuelsson O, Sun Q, van Wijk KJ: Sorting Signals, N-Terminal Modifications and Abundance of the Chloroplast Proteome. Plos One 2008, 3(4). 406. Kim J, Rudella A, Rodriguez VR, Zybailov B, Olinares PDB, van Wijk KJ: Subunits of the Plastid ClpPR Protease Complex Have Differential Contributions to Embryogenesis, Plastid Biogenesis, and Plant Development in Arabidopsis. The Plant cell 2009, 21(6):1669-1692. 407. Ehleringer JR, Cerling TE, Helliker BR: C₄ Photosynthesis, Atmospheric CO₂ , and Climate. Oecologia 1997, 112(3):285-299. 408. Zhao Q, Chen S, Dai S: C4 photosynthetic machinery: insights from maize chloroplast proteomics. Frontiers in plant science 2013, 4:85. 409. Leegood RC: The Intercellular Compartmentation of Metabolites in Leaves of Zea-Mays-L. Planta 1985, 164(2):163-171. 410. Weiner H, Heldt HW: Inter- and intracellular distribution of amino acids and other metabolites in maize (Zea mays L.) leaves. Planta 1992, 187:242-246. 411. Stitt M, Heldt HW: Generation and Maintenance of Concentration Gradients between the Mesophyll and Bundle Sheath in Maize Leaves. Biochim Biophys Acta 1985, 808(3):400-414. 412. Sowiński P, Szczepanik J, Minchin PEH: On the mechanism of C4 photosynthesis intermediate exchange between Kranz mesophyll and bundle sheath cells in grasses. J Exp Bot 2008, 59(6):1137-1147. 413. Taniguchi Y, Nagasaki J, Kawasaki M, Miyake H, Sugiyama T, Taniguchi M: Differentiation of dicarboxylate transporters in mesophyll and bundle sheath chloroplasts of maize. Plant & cell physiology 2004, 45(2):187-200. 414. Doulis AG, Debian N, KingstonSmith AH, Foyer CH: Differential localization of antioxidants in maize leaves. Plant Physiol 1997, 114(3):1031-1037. 415. Burgener M, Suter M, Jones S, Brunold C: Cyst(e)ine is the transport metabolite of assimilated sulfur from bundle-sheath to mesophyll cells in maize leaves. Plant Physiol 1998, 116(4):1315-1322.

215 416. Furbank RT, Jenkins CLD, Hatch MD: Co2 Concentrating Mechanism of C4 Photosynthesis - Permeability of Isolated Bundle Sheath-Cells to Inorganic Carbon. Plant Physiol 1989, 91(4):1364-1371. 417. Alberte RS, Thornber JP: Water stress effects on the content and organization of chlorophyll in mesophyll and bundle sheath chloroplasts of maize. Plant Physiol 1977, 59(3):351-353. 418. Mintz-Oron S, Meir S, Malitsky S, Ruppin E, Aharoni A, Shlomi T: Reconstruction of Arabidopsis metabolic network models accounting for subcellular compartmentalization and tissue-specificity. Proceedings of the National Academy of Sciences of the United States of America 2012, 109(1):339- 344. 419. Schellenberger J, Lewis NE, Palsson BO: Elimination of thermodynamically infeasible loops in steady-state metabolic models (vol 100, pg 544, 2010). Biophys J 2011, 100(5):1381-1381.

216

VITA

Akhil Kumar

Education

Institute Field of study Degree Year Penn State University Integrative PhD 2011-2017 Biosciences Penn State University Integrative MS 2011-2013 Biosciences Maharishi Dayanand Computer Science BEng 2002-2006 University, Rohtak

Publications

 Akhil Kumar, Costas D. Maranas. “De novo synthesis routes through uncharted biochemical spaces”  Akhil Kumar, Costas D. Maranas. “CLCA: Maximum Common Molecular Substructure Queries within the MetRxn Database”, Journal of Chemical Information and Modelling, 2014  Akhil Kumar, Costas D. Maranas. “Rapid Ontology Alignment in Large Metabolic Information Databases”, AIChE Annual Meeting, 2014  Margret Simmons, Rajib Saha, Nardjis Amiour, Akhil Kumar, and 6 other authors, Costas D. Maranas. “Towards Multi-Tissue Type Metabolic Modeling of Maize”, AIChE Annual Meeting, 2013  Margret Simmons, Rajib Saha, Nardjis Amiour, Akhil Kumar, and 6 other authors, Costas D. Maranas. “Assessing the Metabolic Impact of Nitrogen Availability Using a Compartmentalized Maize Leaf Genome-Scale Model”, Plant physiology, 2013  Akhil Kumar, Patrick Suthers, Costas D. Maranas. “MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases”, BMC Bioinformatics, 2012  Patrick Suthers*, Akhil Kumar*, Costas D. Maranas. “Reaction/Metabolite Standardization and Congruency Across Databases and Genome-Scale Metabolic Models”, AIChE Annual Meeting, 2010

* These authors contributed equally.