The Pennsylvania State University

The Graduate School

Department of Chemical Engineering

OPTIMIZATION BASED REDESIGN OF MICROBIAL PRODUCTION

SYSTEMS

A Thesis in

Chemical Engineering

by

Priti Pharkya

 2005 Priti Pharkya

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2005

The thesis of Priti Pharkya was reviewed and approved* by the following:

Costas D. Maranas Donald B. Broughton Professor of Chemical Engineering Thesis Advisor Chair of Committee

Patrick C. Cirino Assistant Professor of Chemical Engineering

John M. Regan Assistant Professor of Environmental Engineering

Andrew L. Zydney Walter L. Robb Chair and Professor of Chemical Engineering Head of the Department of Chemical Engineering

*Signatures are on file in the Graduate School

iii ABSTRACT

The primary objective of this research is to develop computational tools for guiding experimental strain engineering strategies. The key questions addressed in this work are (i) how can optimal gene deletions be selected so that biochemical production in a stoichiometric network is coupled to biomass formation?, (ii) alternatively, how can one select the recombination candidates from a database of biotransformations to confer the ability to produce a specific chemical from an optimal in a host organism?, (iii) what are the best reaction candidates for modulation to enhance the yields of biochemicals in a microbial network, and (iv) finally, what are the smallest sets that can be modulated to achieve the maximum possible enhancement of a specific reaction flux if a detailed kinetic model is used to describe metabolism?

This thesis begins with the description of a bilevel framework introduced to identify not only optimal reaction deletions but also the optimal transport rates of key metabolites so that complex compounds can be produced as an obligatory byproduct of growth. Case studies involve prediction of deletion strategies for different amino acids in

Escherichia coli. Next, an integrated framework, OptStrain is presented to uncover and investigate different alternative pathways in conjunction with the examination of multiple organisms and substrates for selecting the best strategy for producing a target metabolite.

These alternative pathways are identified from a database of reactions compiled from publicly available biopathway databases and stoichiometric models of metabolism.

Results include manipulation strategies for overproducing in networks of vastly different microbial organisms such as E. coli, Clostridium acetobutylicum and

iv Methylobacterium extorquens, and for vanillin production in E. coli. The array of in silico

genetic manipulations that can be predicted by using optimization frameworks is

completed by the OptReg framework which can predict inhibition, up regulation and

deletion of reactions for strain redesign. The applicability of this tool is demonstrated for

ethanol overproduction in Escherichia coli. Finally, a kinetic model of the central metabolism of E. coli is examined for elucidating the enzyme sets that can favorably

influence the flux towards serine synthesis and through the phosphotransferase uptake

system in the network. The broad array of genetic manipulation strategies identifiable

though the proposed frameworks highlights their utility as efficient strain design tools.

v TABLE OF CONTENTS

LIST OF FIGURES ...... vii

LIST OF TABLES...... x

ACKNOWLEDGEMENTS...... xi

Chapter 1 Introduction ...... 1

1.1 Motivation and objective ...... 1 1.2 Background...... 2 1.2.1 Kinetic models of metabolism...... 2 1.2.2 Constraint-based modeling...... 4 1.2.2.1 Flux balance analysis ...... 5 1.3 Thesis overview...... 7

Chapter 2 Reaction deletions for complex compound overproduction ...... 11

2.1 Background...... 11 2.2 Modifications to the OptKnock framework...... 12 2.2.1 Modified OptKnock procedure...... 14 2.3 Computational protocol ...... 14 2.4 Results...... 18 2.4.1 Chorismate formation (Precursor to aromatic amino acids)...... 19 2.4.2 Alanine overproduction (Pyruvate family)...... 22 2.4.3 Serine overproduction (3-phosphoglycerate family)...... 24 2.4.4 Aspartate overproduction (Aspartate family)...... 25 2.4.5 Glutamate overproduction (α-ketoglutarate family)...... 27 2.5 Conclusion ...... 29

Chapter 3 Redesigning strains by reaction additions and deletions...... 47

3.1 Background...... 47 3.2 The OptStrain Procedure ...... 50 3.2.1 Curation of the database ...... 52 3.2.2 Determination of the maximum yield...... 52 3.2.3 Identification of the minimum number of non-native reactions...... 53 3.2.4 Incorporating the non-native reactions into host organism’s stoichiometric model ...... 54 3.3 Mathematical formulation ...... 54 3.4 Results...... 57 3.4.1 Hydrogen production case study ...... 58 3.4.1.1 Escherichia coli ...... 59

vi 3.4.1.2 Clostridium acetobutylicum ...... 61 3.4.1.3 Methylobacterium extorquens AM1 ...... 62 3.4.2 Vanillin production case study ...... 63 3.5 Conclusion ...... 67

Chapter 4 Identifying Reaction Activation/Inhibition Candidates for Overproduction in Microbial Systems...... 78

4.1 Background...... 78 4.2 Modeling and computational protocol...... 81 4.2.1 Steady-state flux determination...... 81 4.2.2 Modeling of genetic manipulations...... 83 4.2.3 OptReg framework ...... 85 4.3 Strategies for overproducing ethanol...... 90 4.3.1 Two reaction modification strategies ...... 90 4.3.2 Three reaction modification strategies ...... 94 4.3.3 Evaluating the mutant networks using an alternate objective: MOMA...... 96 4.3.4 Effect of the value of the regulation strength parameter, C...... 97 4.4 Conclusion ...... 99 4.5 Appendix 4.1...... 101 4.6 Appendix 4.2...... 103 Stoichiometry of selected reactions...... 103

Chapter 5 Optimal enzyme selection using kinetic models...... 113

5.1 Introduction...... 113 5.2 Modeling cellular kinetics and cellular machinery...... 114 5.3 Solution Method ...... 117 5.3.1 Mixed Integer Non-linear Programming Problem (MINLP) ...... 117 5.3.2 Search for optimal enzyme sets and levels...... 117 5.3.3 Computational implementation ...... 118 5.4 Results and discussion ...... 119 5.5 Conclusion ...... 123

Chapter 6 Synopsis ...... 130

6.1 Future work...... 133

Bibliography ...... 135

vii LIST OF FIGURES

Figure 2.1: The limits of lactate production of the network...... 38

Figure 2.2: The modified bilevel formulation of OptKnock ...... 39

Figure 2.3: (a) The central metabolism in E. coli. (b) The amino acid biosynthetic pathways in E. coli...... 40

Figure 2.4: Comparison of chorismate formation rates for different number of reaction eliminations in the network...... 41

Figure 2.5: (a) Flux distribution for mutant A (six reaction deletions) with enhanced chorismate formation rate. (b) Chorismate formation limits of the mutant A...... 42

Figure 2.6: (a) Flux distribution for mutant A (with three reaction eliminations) and (b) mutant B (with four reaction eliminations) overproducing alanine. (c) Alanine production limits of the mutant networks...... 43

Figure 2.7: Flux distribution for the serine-overproducing mutant (three reactions removed). (b) Serine production abilities of the network for mutant A compared with those of the wild type E. coli network...... 44

Figure 2.8: (a) Flux distribution for the aspartate-secreting mutant A (with three reactions eliminated) and (b) for mutant B (with four reactions eliminated). (c) Aspartate producing limits of the mutant networks compared with those of wild type E. coli network...... 45

Figure 2.9: (a) Flux distribution for the glutamate-secreting mutant A and (b) for mutant B. (c) Glutamate producing limits of the mutant networks compared with those of wild type E. coli network...... 46

Figure 3.1: Pictorial representation of the OptStrain procedure. The red (X)’s pinpoint the deleted reactions...... 70

Figure 3.2: Maximum hydrogen yield on a weight basis for different substrates. ....71

Figure 3.3: Hydrogen production envelopes as a function of the biomass production rate of the wild-type E. coli network under aerobic and anaerobic conditions as well as the two-reaction and three-reaction deletion mutant networks...... 72

viii Figure 3.4: Calculated flux distributions at the maximum growth rates in the (a) two and (b) three deletion E. coli mutant networks for overproducing hydrogen. A basis glucose uptake rate of 10 mmol/gDW.hr was assumed...... 73

Figure 3.5: Calculated flux distributions at the maximum growth rates for the wild-type (light gray) and the two-reaction deletion mutant (dark gray) C. acetobutylicum networks. The (X)’s denote reactions that were selected for elimination in the mutant network. The wild-type network flux values are for the minimum hydrogen production scenario, corresponding to point A in Figure 3.6...... 74

Figure 3.6: Hydrogen formation limits of the wild-type (solid) and mutant (dotted) Clostridium acetobutylicum metabolic network for a basis glucose uptake rate of 1 mmol/gDW.hr...... 75

Figure 3.7: Calculated flux distributions at the maximum growth rates in the (a) one, (b) two, and (c) four deletion E. coli mutant networks for overproducing vanillin. Non-native reactions are denoted by the thicker gray arrows...... 76

Figure 3.8: Vanillin production envelope of the augmented E. coli metabolic network for a basis 10 mmol/gDW.hr uptake rate of glucose. An anaerobic mode of growth is suggested in all cases...... 77

Figure 4.1: A pictorial overview of the definitions of up/down regulations and deletions...... 106

Figure 4.2: The flux values at steady-state fixed at experimental values extracted from (Fischer et al. 2004)...... 107

Figure 4.3: A pictorial representation of the central metabolic network of Escherichia coli and the associated genes to explain the proposed metabolic engineering strategies...... 108

Figure 4.4: The biochemical production abilities of the mutant networks with two reactions knocked-out and/or modulated...... 109

Figure 4.5: The flux distribution for the three reaction mutant network MF. The down regulated reaction is shown with a dashed arrow and the cross represents the reaction deletion...... 110

Figure 4.6: The biochemical production abilities of the mutant networks with three reactions deleted and/or regulated...... 111

Figure 4.7: The ethanol yields (in mmol/gDW.hr) of the six mutant networks obtained by using the OptReg framework (max biomass) contrasted against

ix the yields when the minimal flux adjustment criterion (MOMA) is imposed on the networks. The yields from five of the six mutant networks are very close to the prefect correlation line (shown dotted)...... 112

Figure 5.1: A simulated annealing pseudo-code...... 125

Figure 5.2: Escherichia coli central metabolism ...... 126

Figure 5.3: Ratios rL/r30 (%) are plotted as a function of the modulated set size L. Solid rhombi and white triangles correspond to serine production and PTS rates, respectively...... 127

Figure 5.4: Flux control coefficients (FCCs) for the serine (solid bars) and PTS (white bars) fluxes...... 128

Figure 5.5: FCCs are plotted as a function of the enzyme set size L...... 129

x LIST OF TABLES

Table 2.1: Reaction deletions for maximum formation of chorismate. The maximum theoretical yield of chorismate is 5.33 mmol/gDW.hr...... 32

Table 2.2: Different approaches for maximizing flux towards aromatic acid production from the central metabolic pathway ...... 33

Table 2.3: The deletion mutants for overproducing alanine. The maximum theoretical yield of alanine is 19.77 mmol/gDW.hr. In all mutants, the secretion of D-alanine is blocked...... 34

Table 2.4: The deletion mutants for overproduction of serine. The maximum theoretical yield of serine is 20 mmol/gDW.hr...... 35

Table 2.5: The deletion mutants for overproduction of aspartate. The maximum theoretical yield of aspartate is 19.05 mmol/gDW.hr...... 36

Table 2.6: The deletion mutants for overproduction of glutamate. The maximum theoretical yield of glutamate is 11.36 mmol/gDW.hr...... 37

Table 3.1: Deletion mutants for enhanced hydrogen production. A basis glucose uptake rate 10 mmol/gDW.hr is assumed...... 69

Table 4.1: Strategies for overproducing ethanol (for C = 0.5) listed along with the corresponding ethanol production and biomass formation rates. The upregulated reactions are denoted with the (↑) symbol. Conversely, downregulated reactions are denoted with the (↓) symbol and the deleted reactions with a (X) symbol. The numbers within parentheses show the relative values of the modified fluxes when compared to the steady state fluxes. Biomass formation rate is expressed on a per hour basis and all the other fluxes are expressed in units of mmol/gDW·hr...... 104

Table 4.2: Comparisons of the theoretical yields of ethanol obtained for a value of C equal to 0.5 when implemented on mutant networks after changing C. M stands for modulation and M/D stands for both modulation and deletion...... 105

Table 5.1: Best enzyme sets leading to increased serine production ...... 124

xi ACKNOWLEDGEMENTS

My years at Penn State have exposed me to a new and exciting field of research.

The experience and the learning process here would not have been so rewarding without the exceptional guidance and motivation of my thesis advisor, Costas Maranas. I

sincerely thank him for all the words of wisdom that he has shared with me during the

past four years. I am also very grateful to my committee members, Andrew Zydney, John

Regan and Patrcik Cirino, for agreeing to serve on my committee. I really appreciate the

opportunity that Dr Cirino provided me to conduct experiments in his lab. Special thanks

to the past and present members of my group, especially, Anthony Burgard, Evgeni

Nikolaev and Anshuman Gupta for their stimulating discussions and ideas.

One person who has been my constant companion and inspiration during the past

more than ten years is my husband, Girish. Without his help and motivation, this

experience would hot have been the same. No words can express my gratitude to my parents, Yashoda and Ramesh, who have been pillars of support and encouragement throughout my life. I would also like to thank my brother and sister for being just

wonderful. Last, but not the least, I appreciate the affection that Girish’s parents and his

family have showered upon me since I first met them. I can’t really express how fortunate

I feel to have all these people in my life.

Chapter 1

Introduction

1.1 Motivation and objective

The field of metabolic engineering has undergone a fundamental transformation in the past few years. The initial focus on modifying individual metabolic pathways for endowing microorganisms with desired properties has now been broadened to adopt a global view of metabolism (Stephanopoulos et al. 1998). This trend first became evident when small-scale models of metabolism, comprised of less than 100 reactions, were employed to determine the performance limits of microorganisms (Papoutsakis and

Meyer 1985; Majewski and Domach 1990; Varma et al. 1993). The sequencing of the

Escherichia coli genome (Blattner et al. 1997) and the subsequent annotation of numerous genomes at an unprecedented pace has provided the necessary impetus for reconstructing genome-scale models of metabolism accounting for hundreds of genes in a microbial genome (Schilling et al. 2002; Famili et al. 2003; Reed et al. 2003; Becker and

Palsson 2005).

An interesting property of microbial organisms is their robustness when subjected to genetic modifications and when grown in different environmental conditions. This characteristic can be attributed to the inherent redundancy in their metabolic networks.

For example, in a recent study, it was pointed out that a large fraction of yeast genes

(80%) is not essential under normal laboratory conditions (Duarte et al. 2004). The authors determined that many of these dispensable genes are required in different growth environments; others are compensated for by isozymes or by alternate metabolic

2 pathways. This flexibility of cellular networks makes it more complex to predict their behavior in the face of imposed genetic manipulations, necessitating the use of computational tools for performing a systems-based analysis of the network behavior.

The underlying objective of this research is to develop optimization-based algorithms utilizing metabolic models for in silico design of strains efficient in biochemical production.

1.2 Background

Metabolism represents the sum of enzymatic reactions that convert nutrients

necessary for nourishing living organisms into energy and the chemically finished

products of the cell. Each enzyme catalyzing a reaction is a of one or multiple

genes in an organism’s DNA. The amount of raw material processed by an enzyme

through a biochemical reaction is expressed in terms of metabolic flux and is quantified

in units of mmol/gDW.hr. Broadly, models of metabolism fall into two broad categories

(i) kinetic and (ii) stoichiometric. These are discussed in the following subsections.

1.2.1 Kinetic models of metabolism

Metabolism can be described by accounting for the detailed kinetics of the

enzyme-catalyzed reactions in a network. These models provide the most accurate and

rigorous description of cellular behavior. Kinetic models of glycolysis in Saccharomyces

cerevisiae have been utilized to investigate the effects of environmental conditions on

metabolism (Reuss 1991), and to simulate physiology for understanding the underlying

regulation (Rizzi et al. 1997). These kinetic models have more recently been applied

within the framework of Metabolic Control Analysis (MCA) to amplify and/or redirect

metabolic fluxes for biochemical production (Galazzo and Bailey 1990; Delgado et al.

3 1993). Briefly, MCA (Kacser and Burns 1973; Heinrich and Rapoport 1974) uses kinetic models to perform a sensitivity analysis for analyzing the distribution of flux control in a metabolic pathway among the different enzymes that constitute the pathway. An alternate formalism for quantifying metabolic control is based on the Biochemical Systems Theory

(Savageau 1969a; Savageau 1969b; Savageau 1970) introduced in the late 1960s. In this approach, mass balances around each metabolite are approximated by two competing power-law functions describing the metabolite aggregation and consumption. Upon a logarithmic transformation, this provides the framework for steady-state enzyme level optimization through linear programming (Voit 1992; Regan et al. 1993; Torres et al.

1996).

There are, however, fundamental challenges associated with kinetics-based models of large-scale cellular processes. First, the intra-cellular chemical environment is complex and hard to define in terms of reaction equations. Second, assuming that all the governing equations are defined, it is hard to find numerical values for all equation parameters, the majority of which cannot be determined via current experimental methods. Third, even if all of the kinetic parameters are identified for a certain condition, there is no guarantee that they would remain unchanged if the organisms encounter different environmental conditions or are subjected to adaptive evolution and mutations occur that will change the values of the kinetic constants. Consequently, kinetic models have so far been largely limited to restricted parts of a metabolic network, such as glycolysis (Rizzi et al. 1997), the pentose phosphate pathway (Vaseghi et al. 1999), central nitrogen metabolism (van Riel et al. 1998) in Saccharomyces cerevisiae, and the central metabolism in Escherichia coli (Chassagnole et al. 2002). In fact, with the

4 exception of the human red blood cell (Joshi and Palsson 1989a; Joshi and Palsson

1989b; Joshi and Palsson 1990a; Joshi and Palsson 1990b; Jamshidi et al. 2001), the necessary information to describe the metabolism of a single cell in mathematical detail is still lacking despite the recent explosion of generated biological data. This has been the key motivation to the development of stoichiometric models of metabolism.

1.2.2 Constraint-based modeling

Genome-scale models for describing metabolism are primarily constructed by

accounting for only the stoichiometry of the biochemical reactions functionally assigned

to the genes in a specific organism. As mentioned earlier, these models are currently

available for several organisms and when complemented with information on cellular composition and transcriptional regulation, they can describe the metabolic behavior of a cell more accurately.

The analysis of these genome scale models has been most successfully accomplished through the constraints-based approach. This approach is based on restricting the allowable behavior or phenotypes of a cell, described in terms of flux distributions in the network, by successively imposing constraints on the system (Covert et al. 2003). These constraints are categorized into physico-chemical constraints, condition-dependent environmental constraints and regulatory or self-imposed constraints

(Price et al. 2004).

Physico-chemical constraints in the form of stoichiometric, enzyme capacity and thermodynamic constraints represent hard inviolable constraints that the cells must abide by (Reed and Palsson 2003). Stoichiometric mass balance constraints require that no accumulation or depletion of metabolites should take place in a network at steady state

5 i.e., the rate of production of each metabolite in the network must equal its consumption.

Mathematically, this can be represented as S.v = 0, where v is the vector of fluxes in the metabolic network and S is the stoichiometric matrix containing the stoichiometry of the reactions in the network. Capacity constraints limiting the maximum flux through a reaction (v ≤ vmax) are an outcome of the limited enzyme turnover rates in a cell. Finally,

the thermodynamics of internal reactions determines the directionality of reactions. If a

reaction is irreversible the minimum flux through it is zero, else the flux can assume

negative values indicating that the reaction can proceed in the backward direction also

(Varma and Palsson 1994).

Environmental constraints on the cell are time-dependent and result from nutrient

availability, pH, temperature, and the availability of electron acceptors. Regulatory

constraints are subject to evolutionary change and are self-imposed by the cell for

increased survivability. These are implemented by the cell by modulating the amount of gene products made (transcriptional and translational regulation) and their activity

(enzyme regulation) (Price et al. 2004). To account for the effects of transcriptional

regulation, a Boolean representation of the transcriptional regulatory network can be

constructed (Covert and Palsson 2002). Within this framework, genes can be found in

only two states, either expressed or nor expressed. If the gene is not expressed, the

enzyme will not be present in the cell and the associated reaction is inactive (vj = 0)

(Covert and Palsson 2003).

1.2.2.1 Flux balance analysis

The system of equations resulting from the constraints governing a cellular behavior is typically underdetermined because the number of fluxes is more than the

6 number of metabolites present in the cell. This has motivated the use of linear optimization to investigate flux distributions corresponding to an optimized objective function (Z). A typical Flux balance analysis (FBA) problem for maximizing biomass formation can be written as:

Maximize Z = vbiomass

M subject to ∑ Svij j = 0 ∀ i j=1

+ v j ∈ℜ ∀ j

where vbiomass is a flux drain comprised of all necessary components of biomass (i.e., amino acids, nucleotides, etc.) in their appropriate biological ratios (Neidhardt and

Curtiss 1996). A variety of such cellular objectives has been proposed such as the

maximization of ATP production (Majewski and Domach 1990; Ramakrishna et al.

2001), minimization of nutrient uptake (Savinell and Palsson 1992), minimization of

metabolic adjustment (MOMA) (Segre et al. 2002), and the maximization of the cellular

biomass yield (Varma and Palsson 1993; Varma and Palsson 1994).

Stoichiometric models of Escherichia coli metabolism utilizing the maximization

of biomass objective have, under certain conditions, been successful at (i) the qualitative

prediction of the outcomes of gene knockout experiments (Edwards and Palsson 2000;

Badarinarayana et al. 2001), (ii) the correct sequence of byproduct secretion under

increasingly anaerobic conditions (Varma et al. 1993), and (iii) the quantitative prediction

of cellular growth rates (Edwards et al. 2001). Interestingly, it has been reported that even

when FBA predictions under the biomass maximization assumption fail, metabolic

7 networks can be evolved towards maximum growth (i.e., biomass yield) through adaptive evolution (Ibarra et al. 2002). Knockout studies for qualitative growth comparisons have also been performed in S. cerevisiae (Duarte et al. 2004; Papp et al. 2004), H. pylori

(Schilling et al. 2002), H. influenzae (Edwards and Palsson 1999), and M. extorquens

(Van Dien and Lidstrom 2002) with an accuracy rate of 60-90%. Genome-scale models of E. coli containing regulatory constraints in conjunction with the physicochemical constraints can predict the growth phenotypes with a higher accuracy (Covert et al.

2004).

1.3 Thesis overview

The subsequent chapters in this thesis discuss the optimization based frameworks that have been developed during the course of this research for efficient strain design.

Chapter 2 presents a bilevel framework aimed at identifying gene knockout strategies leading to the overproduction of complex compounds such as amino acids. Specifically, binary variables are employed to eliminate competing functionalities from the reaction network such that biochemical production becomes an obligatory byproduct of growth. In addition to reaction deletions, the transport rates of carbon dioxide, ammonia and oxygen as well as the secretion pathways for key metabolites are introduced as optimization variables in the framework. Strategies for enhancing the production of representative amino acids and key precursors for all five families span not only the central metabolic network but also the amino acid biosynthetic and degradation pathways. Computational results demonstrate the importance of manipulating energy producing/consuming pathways, controlling the uptake of nitrogen and oxygen as well as blocking the secretion

8 pathways of key competing metabolites. A detailed discussion of these strategies can be found in:

• Pharkya, P., A. P. Burgard and C. D. Maranas (2003). "Exploring the overproduction of amino acids using the bilevel optimization framework OptKnock." Biotechnol Bioeng 84(7): 887-99.

An earlier version of this algorithm with the procedural details can be found in:

• Burgard, A. P., P. Pharkya and C. D. Maranas (2003). "OptKnock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization." Biotechnol Bioeng 84(6): 647-657.

In Chapter 3, we introduce a hierarchical computational framework OptStrain

aimed at guiding pathway modifications, through reaction additions and deletions, of

microbial networks for the overproduction of targeted compounds. A comprehensive

database of biotransformations, referred to as the Universal database (with over 5,000

reactions), is compiled by downloading and curating reactions from multiple biopathway

database sources. Combinatorial optimization is then employed to elucidate the set(s) of

non-native functionalities, extracted from this Universal database, to add to the examined

production host for enabling the desired product formation. Subsequently, competing

functionalities that divert flux away from the targeted product are identified and removed

to ensure higher product yields coupled with growth. This work establishes an integrated

computational framework capable of constructing stoichiometrically balanced pathways,

imposing maximum product yield requirements, pinpointing the optimal substrate(s), and

evaluating different microbial hosts. A complete description of the work can be obtained

from:

9

• Pharkya, P., A. P. Burgard and C. D. Maranas (2004). "OptStrain: An integrated optimization framework for microbial strain design." Genome Research 14(11): 2367-76.

The scope of the algorithm proposed in Chapter 2 is broadened to introduce

OptReg, a bilevel framework in Chapter 4 to determine the optimal reaction activations/inhibitions and eliminations for targeted biochemical production. A reaction is deemed up or down regulated if it is constrained to assume flux values significantly above or below its steady-state before the genetic manipulations. The developed framework is demonstrated by studying the overproduction of ethanol in Escherichia

coli. Computational results reveal the existence of synergism between reaction deletions

and modulations implying that the simultaneous application of both types of genetic

manipulations yields the most promising results. Conceptually, the proposed strategies

redirect both the carbon flux as well as the cofactors to enhance ethanol production in the

network. The OptReg framework is a versatile tool for strain design which allows for a

broad array of genetic manipulations. The specifics of the framework can be obtained

from:

• Pharkya, P. and C. D. Maranas (2005). "An optimization framework for identifying reaction activation/inhibition or elimination candidates for overproduction in microbial systems" Metabolic Engineering: in press.

In Chapter 5, we explore the utility of kinetic models for enhancing the desired

fluxes. Specifically, a hybrid optimization framework is introduced to identify enzyme

sets and levels to meet overproduction requirements using kinetic models of metabolism.

A simulated annealing algorithm is employed to navigate through the discrete space of

enzyme sets while a sequential quadratic programming method is utilized to identify

10 optimal enzyme levels. The framework is demonstrated on a model of E. coli central metabolism for serine biosynthesis. Computational results show that by optimally manipulating relatively small enzyme sets, a substantial increase in serine production can be achieved. The reference corresponding to this work is:

• Nikolaev, E. V., P. Pharkya, C. D. Maranas, and A. Armaou, ''Optimal selection of enzyme levels using large-scale kinetic models," Proceedings of 16th International Federation of Automatic Control World Congress, Prague, Czech Republic, 2005.

Chapter 6 provides a synopsis of the key contributions of the preceding chapters.

Chapter 2

Reaction deletions for complex compound overproduction

2.1 Background

The recent availability of genome scale models of microbial organisms (Edwards et al. 2001; Schilling et al. 2002) has provided the pathway reconstructions necessary for

developing novel computational methods aimed at identifying strain engineering strategies. These methods enable the investigation of the effect of a genetic manipulation,

such as a gene addition or deletion, in an organism on a systemic level and thus help accelerate the construction of mutant strains with desired characteristics. The importance of such computational tools has been better recognized in the past decade or so.

The initial attempts to construct strains with improved yields of biochemicals

relied on the relatively straightforward approach of eliminating the competing pathways,

on removing feedback inhibitions in the biosynthetic pathways and on overexpressing

rate-limiting enzymes in the terminal steps of biosynthetic pathways (Stephanopoulos et

al. 1998). The yield improvements accomplished by such techniques were, however, in

many cases quite limited because they relied on the manipulation of only terminal

pathways. Efforts have been made in the past decade to address this problem for

redirecting additional carbon flux from the central metabolism into the biosynthetic

pathways (Patnaik and Liao 1994; Flores et al. 1996) by adopting a more global view on

metabolism. However, predicting the synergistic effect of multiple changes introduced in

the central metabolism has proved to be challenging considering the complexity of

metabolism in these organisms. This has motivated the development of computational

12 approaches which can systematically consider the effect of genetic modifications on the entire metabolic network of a production system.

Optimization tools have played a key role in in silico strain design predictions. A

bilevel framework termed OptKnock was introduced (Burgard et al. 2003) to propose

reaction eliminations from the Escherichia coli central metabolic network for maximizing

the production of simple compounds such as succinate, lactate and 1,3 propanediol. This

was achieved by relying on the maximization of biomass formation assumption for

identifying plausible flux distribution in the network while the maximization of the

desired product flux was achieved by eliminating key reaction steps. This multiobjective

model aimed at ensuring that due to the stoichiometric constraints and pathway

connectivity, the desired biochemical had to be produced as an obligatory by-product of

growth. In this chapter, we introduce the modifications to this bilevel framework for

enhancing production of complex biochemicals such as amino acids.

2.2 Modifications to the OptKnock framework

Reaction deletions for the enhanced production of the desired amino acids are

considered not only in the central metabolic network of E. coli, but also in the amino acid

biosynthetic and degradation pathways. The resulting size of the explored network, in

conjunction with its redundancy, provides a large number of alternative routes for

channeling its carbon flux. This may not ensure the secretion of the desired biochemical,

even after the eliminations identified by OptKnock are imposed. This type of behavior

was observed when we investigated the mutants for lactate formation (Burgard et al.

2003). Specifically, after eliminating two reactions (i.e., acetate kinase and

phosphofructokinase) identified by OptKnock, we maximized and subsequently

13 minimized the lactate secretion as a function of different biomass production levels. As shown in Figure 2.1, this mutant could still avoid lactate formation (see point B) at maximum biomass formation and secrete other compounds because of a plethora of other options to redirect its carbon flux. Point A, identified by OptKnock, corresponds to maximum overproduction of lactate; however, point B involving no lactate production is also an equivalent solution. This is because the rightmost boundary of the range of feasible phenotypes is a vertical line and not a single point. In optimization terminology, there is not a single optimum but a family of alternative solutions, defined by the convex combinations of points A and B (i.e., the straight line joining points A and B).

Therefore, we decided to introduce the cellular transport rates of ammonia, carbon dioxide and oxygen as well as key secretion pathways as optimization variables in the

OptKnock framework to reduce the degrees of freedom available to the network forcing it to assume flux distributions as close as possible to the desired point A. Specifically, upon fixing the values of these variables equal to the ones for point A, we observe a strong coupling between the biochemical overproduction and the cellular objective of maximizing biomass formation. Note that ammonia, carbon dioxide and oxygen uptake rates are inputs that can be controlled in production systems. The rate of carbon dioxide evolution can be varied by controlling pressure in industrial reactors. Nitrogen is often produced through ammonia whose concentration can be effectively maintained by monitoring the pH of the medium. The aeration rate along with the oxygen transport coefficient (kLA) is used to maintain the requisite oxygen supply to the medium while

continuous stirring helps to ensure uniform oxygen concentration in the bioreactor

14 (Kumagai 2000; Ikeda 2003). Equivalently, the secretion of key compounds can be blocked by deleting the corresponding genes coding for their .

A pictorial overview of the OptKnock bilevel optimization model is described in

Figure 2.2. The following stepwise procedure summarizes the main steps undertaken for all amino acid studies. The technical details for each of these steps are described in

Section 2.3.

2.2.1 Modified OptKnock procedure

Step 1: Solve OptKnock optimization problem (see Section 2.3) to identify reaction

deletions (i.e. y j = 0) for enhanced production of an amino acid (or a key precursor).

Step 2: Solve the max/min problem (see Section 2.3) for finding the production limits of

the biochemical of interest after fixing the fluxes through the deleted reactions to zero.

Step 3: If alternative optimal solutions exist to the inner biomass maximization problem,

fix transport rates of carbon dioxide, ammonia and /or oxygen to their values identified from the OptKnock solution and return to Step 2.

Step 4: If alternative solutions still exist, prevent the network from secreting metabolites associated with alternative solutions. Return to Step 2.

2.3 Computational protocol

The linear programming (LP) model for maximizing the biomass yield of a steady-state

metabolic network comprising a set of N = {1…. N} metabolites and a set of M = {1…..

M} reactions is given by

maximize vbiomass (max Biomass)

M subject to ∑ Sij v j = 0 ∀ i ∈N j=1

15

v pts + vglk = vglc _ uptake mmol/ gDW. hr

v ATP ≥ vatp _ main mmol/ gDW. hr

v j ≥ 0 ∀ j ∈ Mirrev

v j ≤ 0 ∀ j ∈ Msecr_only

v j ∈ℜ ∀ j ∈ Mrev where Sij is the coefficient of metabolite i in reaction j, biomass formation is quantified

as an aggregate reaction flux vbiomass draining biomass components in their appropriate

biological ratios (Neidhardt and Curtiss 1996), and vatp _ main is the non-growth associated

minimum ATP requirement. The uptake rate of glucose, vglc _ uptake , is fixed and encompasses both the phosphotransferase system vpts and glucokinase vglk uptake mechanisms. A basis glucose uptake rate of 10 mmol/gDW.hr was chosen in this study.

The set of reactions M is divided into reversible and irreversible reactions. Metabolites which are only secreted by the network are included in the set Msecr_only. Note that the forward direction of transport fluxes corresponds to the uptake of the metabolite and the reverse direction corresponds to its secretion. Following from the basis uptake rate of glucose, fluxes are reported in mmol/gDW.hr and the biomass formation rate is given as grams biomass formed/gDW.hr or 1/hr.

To incorporate reaction eliminations into the problem, we use binary variables

y j (Burgard and Maranas 2001; Burgard et al. 2001) which take on a value of one if a particular reaction is active and a value of zero otherwise. An active reaction has an upper

max min bound v j and a lower bound v j obtained by maximizing and minimizing each flux

16 subject to the constraints in the max Biomass problem. We refer to this problem formulation as (max/min).

The following mixed integer bilevel integer programming problem, is solved to

identify which reactions should be eliminated from the network ( y j = 0) such that the

secretion of a particular biochemical, vbiochemical , is maximized during maximal growth.

This was briefly described in Step 1 of the computational procedure.

maximize vbiochemical (OptKnock) yj

s.t. maximize vbiomass (max Biomass ) vj M s.t. ∑ Sij v j = 0, ∀∈i N j=1 v pts + vglk = vglc _ uptake mmol/ gDW.hr

v ATP ≥ vatp _ main mmol/ gDW.hr

min vbiomass ≥ vbiomass 1/hr

min max v j y j ≤ v j ≤ v j y j ∀ j ∈ M

∑(1− y j ) ≤ K j∈M

y j ∈{0,1 } ∀ j ∈M Here K is the number of allowable reaction eliminations and the max Biomass problem contains additional constraints ensuring that (i) a minimal level of biomass formation is attained and (ii) the fluxes through reactions proposed for elimination (yj = 0) are set to zero.

17 We employ a solution methodology based on linear programming duality theory to solve this problem efficiently. The details of the solution procedure for this problem are discussed in detail in (Burgard et al. 2003).

To examine the coupling between biochemical production and biomass formation, we solve the following max/min optimization problem for various levels of biomass

t arg et formationvbiomass . This forms Step 2 of the procedure for all amino acid studies.

maximize/minimize vbiochemical (max/min)

M subject to ∑ Sij v j = 0, ∀ i ∈N j=1

v pts + vglk = vglc _ uptake mmol/gDW.hr

v ATP ≥ vatp _ main mmol/gDW.hr

t arg et vbiomass ≥ vbiomass 1/hr

v j ≥ 0 ∀ j ∈ Mirrev

v j ≤ 0 ∀ j ∈ Msecr_only

v j ∈ℜ ∀ j ∈ Mrev

v j = 0 ∀ j ∈ Mknockout

Note that the fluxes through reactions in the set Mknockout identified by OptKnock for deletion are fixed at zero. If the minimum value of biochemical production is considerably lower (or even zero) than the maximum value for most attainable biomass yields, these are said to be uncoupled. In that scenario, we investigated fixing the ammonia, carbon dioxide and/or oxygen transport fluxes to their values identified by

OptKnock to achieve a stronger coupling between biochemical and biomass production

(Step 3 of the computational procedure). Optimization problems were solved using

18 CPLEX 7.0 accessed via the GAMS modeling environment on an IBM RS6000-270 workstation.

2.4 Results

We focus on the five families into which the amino acids are categorized; the pyruvate family, the 3-phosphoglycerate family, the α-ketoglutarate family, the aspartate family and the family of aromatic amino acids. For the cases where a single precursor exists to all the amino acids in the family, we concentrate on improving its yield over that in the wild-type E. coli network. For example, chorismate is the common intermediate to the three amino acids in the family of aromatic amino acids. Alternatively, when increasing the formation of the immediate reactant does not map one-to-one with the enhancement of the desired amino acids, we focus on the individual amino acids. For example, increasing the flux towards pyruvate does not necessarily lead to more alanine production due to the large number of reactions in which the former metabolite is involved. Subsequently, we use OptKnock to detect the reactions whose removal leads to a stronger coupling between biomass formation and alanine overproduction. The E. coli

K-12 stoichiometric model (Edwards and Palsson 2000) comprised of 720 reactions is used to describe the metabolic network.

Figure 2.3 depicts the central metabolic and the amino acid biosynthetic pathways considered here to achieve amino acid overproduction. The metabolic network is scaled by allowing it to uptake a fixed amount of glucose (10 millimoles per gram dry weight of cell per unit hour or 10 mmol/gDW.hr) through the phosphotransferase system, the glucokinase pathway or both. The OptKnock framework allows for aerobic or anaerobic

19 metabolism by enabling or eliminating the oxygen transport reaction. The network is allowed to secrete all metabolites for which it possesses the mechanism of export. In all five cases tested, an aerobic mode of growth is found necessary for overproduction of amino acids.

2.4.1 Chorismate formation (Precursor to aromatic amino acids)

In the production of aromatic amino acids; tyrosine, tryptophan and phenylalanine; chorismate is the key intermediate. The aromatic amino acid biosynthesis pathway divides at the chorismate branching point of the metabolic network (see Figure

2.3b) to ensure the formation of all three aromatic amino acids. Consequently, we use chorismate formation as the surrogate objective function in OptKnock for overproduction of aromatic amino acids. Figure 2.4 compares the chorismate formation rates obtained for different total numbers of reactions deleted from the network. These rates are also compared with the chorismate production limits of the wild-type E. coli network. As shown, while the flux towards chorismate is very small when only five reactions are eliminated, the six-reaction strategy offers a substantial improvement with negligible changes found for seven or more deletions. Table 2.1 describes the six-deletion mutant and Table 2.2 summarizes the current strategies for aromatic amino acid overproduction.

The precursors for chorismate are erythrose-4-phosphate (E4P) and phosphoenolpyruvate (PEP). Not surprisingly, the reactions catalyzed by transaldolase B

(talB) and phosphoenolpyruvate carboxylase (ppc) (Figure 2.3a), which lead to the consumption of these compounds, are selected by OptKnock to be eliminated.

Interestingly, the removal of the ppc gene which converts PEP to oxaloacetate (OAA) was proposed originally by (Miller et al. 1987)) and later patented (Backman 1992) for

20 increasing carbon flux towards aromatic amino acids. Subsequently, it was suggested

(Patnaik et al. 1995) that the exclusion of this gene may not always have the desired effect of increasing the flux towards chorismate under all experimental set-ups.

Therefore, we decided to investigate the effect of the ppc deletion by OptKnock with this perspective in mind. We find that a ppc-deficient E. coli network by itself does not lead to increased flux towards aromatic amino acids at maximum biomass yield because PEP is preferably converted to pyruvate instead of OAA which is utilized for biomass formation in the TCA cycle. Specifically, it appears that the deletion of ppc reaction in conjunction with five others is needed to redirect flux towards chorismate. The calculated flux distributions for this mutant are shown in Figure 2.5a. Three of these reactions, pyruvate oxidase (pox), pyruvate dehydrogenase (lpdA) and pyruvate formate reactions (pfl) (see Figure 2.3a) involve pyruvate as a reactant. PEP is converted to pyruvate during glycolysis and this carbon flux may be lost for chorismate production because recycling of pyruvate to PEP is energetically expensive (Patnaik and Liao 1994).

These eliminations are thus aimed at preventing the conversion of PEP to pyruvate and help redirect more carbon flux towards chorismate production. This strategy, though quite non-intuitive, is in agreement with the observation that increasing the fluxes from

PEP towards a specific pathway by the elimination of competing enzymatic activities is not very straightforward (Stephanopoulos and Vallino 1991). The removal of these three pyruvate-consuming reactions is similar to increasing the expression of pyruvate synthase

(pps) which recycles pyruvate to PEP (Patnaik and Liao 1994). The deletion of 6- phosphogluconolactonase (pgl), suggested by OptKnock, is also essential for attaining the predicted yield because the removal of pgl and talB has a synergistic effect of making

21 more E4P available. This is comparable to the overexpression of transketolase (tktA)

(Draths et al. 1992), transaldolase (talB) or both (Lu and Liao 1997; Sprenger et al.

1998). It should be noted that in the strategy identified by OptKnock, no flux is observed through the oxidative part of the pentose phosphate pathway. Accordingly, the availability of E4P can be increased by eliminating talB. Interestingly, the identified knockout strategy utilizes a complex pathway of multiple reactions for linking flux from ribose-5-phosphate to glyceraldehyde-3-phosphate and acetate thus fueling the TCA cycle.

Figure 2.5b explores the performance limits of the wild-type E. coli network with respect to chorismate formation and compares them with those of the suggested mutant network. These trade-off curves have been obtained by solving the formulation

(max/min) for chorismate production subject to fixed rates of growth to identify its allowable production range. It is found that approximately 65% of the maximum theoretical yield of chorismate can be reached at maximum biomass formation (point A1).

In turn, most of this chorismate (about 88%) can be converted to the required aromatic amino acid by blocking the secretion of the other two aromatic amino acids. The remainder is converted to di- and tri-hydrofolates and to the two other amino acids in proportion to their utilization in the formation of biomass. In the following case study, we observe that the energy balance of the network and the fixing of the transport rates of carbon dioxide and ammonia have a key role in the production of the pyruvate family of amino acids.

22

2.4.2 Alanine overproduction (Pyruvate family)

The pyruvate family of amino acids is comprised of three amino acids; alanine, valine and leucine. Alanine is formed by the transamination of pyruvate. The identified reaction eliminations for alanine secretion are listed in Table 2.3 and the flux distributions are illustrated in Figures 2.6a and 2.6b.

Mutant A involves the removal of three reactions which decouple alanine production and biomass formation. These deletions have a cumulative effect of reducing the carbon flux from pyruvate to other compounds and of transaminating more pyruvate towards alanine. The removal of pyruvate dehydrogenase (lpdA, aceEF) inhibits the conversion of pyruvate to acetyl-CoA (ACCOA) and that of pyruvate formate lyase (pfl) prevents pyruvate from converting to formate and ACCOA. The elimination of the

ATPase reaction (which interconverts ATP into ADP with exchange of a proton) for increasing alanine production exhibits the importance of manipulating the bioenergetics in the system for biochemical production. Experimentally, this deletion has been reported to augment the amount of glucose through the glycolysis pathway (Jensen et al. 1993) increasing the availability of the precursors for alanine overproduction. In mutant B, the phosphofructokinase reaction (pfk) which converts fructose 6-phosphate (F6P) to fructose

1, 6-diphosphate (FDP) with the utilization of one ATP molecule, is also deleted relying on the Entner-Doudoroff pathway for the formation of pyruvate. The removal of this extra reaction alters the balance in such a way that the flux through the pyruvate oxidase reaction (pox) is reduced. This implies that less pyruvate is converted to acetate and subsequently to ACCOA for production of biomass through the TCA cycle.

Interestingly, the predicted alanine yield of mutant B is approximately 24% higher than

23 that of mutant A and is equal to 94% of the maximum theoretical yield of alanine. This amounts to 91.5% of the yield based on weight of the raw material (glucose) which is considerably higher than the range of 45-55% yield obtained in industry currently (Ikeda

2003).

We next investigated the degree of coupling between biomass production and alanine secretion by solving the max/min formulation for varying levels of biomass production. Figure 2.6c shows that even at maximum biomass, a wide range of phenotypic behaviors (the collection of points between the desired point D and the undesired point of zero alanine production U) is possible (dotted vertical lines corresponding to mutants A* and B*) for both mutant networks. This result is similar to the lactate example (Burgard et al. 2003) discussed previously. The availability of a number of alternative conversion routes gives the network the flexibility of not secreting alanine at maximum biomass yield even after the identified reactions are eliminated. In contrast to the chorismate study where the fluxes towards ammonia, sulfate, phosphate, oxygen and carbon dioxide transport are fully specified at maximum biomass formation, the mutant networks for alanine overproduction retain a range of values for these transport fluxes. In order to restore a stoichiometric coupling between alanine secretion and biomass production, we decided to explore the fixing of carbon dioxide and ammonia transport fluxes at the values identified by OptKnock for point D. These transport fluxes are therefore treated as target optimization variables for directing all predicted optimal growth phenotypes towards the desired point D in Figure 2.6c. The predicted values of these transport rates for the mutant networks are listed in Table 2.3. Importantly, secretion of D-alanine is blocked so that the network does not have any incentive to

24 convert L-alanine to D-alanine. This is consistent with the finding that the side reaction caused by alanine racemase which converts L-alanine to D-alanine reduces the yield of

L-alanine significantly in E. coli (Kumagai 2000). Figure 2.6c also shows the alanine production limits as a function of biomass formation after the ammonia uptake and carbon dioxide secretion rates have been fixed and the D-alanine secretion mechanism is blocked (solid lines corresponding to mutants A and B). Note that the network now has to secrete alanine at all levels of biomass production.

The alanine study demonstrated that the energy interactions of the network and the transport rates of carbon dioxide and ammonia are instrumental for coupling alanine overproduction with biomass formation. The next example on serine overproduction further reinforces the importance of energetics in biochemical overproduction in addition to modulating the transport rates of carbon dioxide and ammonia.

2.4.3 Serine overproduction (3-phosphoglycerate family)

Serine is derived from the glycolytic intermediate 3-phosphoglycerate (3PG).

OptKnock predicts the removal of three reactions to force the network to secrete serine

(see Table 2.4). The theoretical yield based on the weight of the raw material glucose is

57.6 % which is significantly higher than the current industrial yield of 30-35 % based on the weight of sugar (Ikeda 2003).

Figure 2.7a depicts the flux distribution of the mutant network and Table 2.4 provides the list of deleted reactions. These include the serine degradation reaction, serine deaminase. Commercial strains, not surprisingly, are also deficient in serine degradation

(Ikeda 2003). Also, OptKnock suggests that the reaction catalyzed by phosphoglycerate mutase (gpm) or enolase (eno) (see Figure 2.3a) should be removed. The elimination of

25 the former is quite obvious because 3PG is the key precursor to the formation of serine.

The deletion of enolase (which converts 2PG to PEP) has an equivalent effect because it blocks the path by which 3PG is ultimately converted to PEP during glycolysis. The third deleted reaction suggested by OptKnock is the respiratory ATPase reaction whose removal alters the metabolism so that the production of serine is favored energetically.

As was the case with alanine overproduction, a tight coupling between serine secretion and biomass formation is obtained only after the ammonia and carbon dioxide transport rates are fixed to those determined from the bilevel optimization problem at maximum biomass production. The values of these rates for the serine-secreting mutant are listed in Table 2.4. The serine production limits of the mutant network, plotted as a function of biomass formation are shown in Figure 2.7b. While the wild-type E. coli network may not necessarily secrete serine (dotted lines), the resulting mutant network has to overproduce serine after the elimination of the suggested reactions and the fixation of the transport rates of ammonia and carbon dioxide (point A1). The next section elucidates the strategies for aspartate overproduction and demonstrates that not only carbon dioxide and ammonia transport rates, but the oxygen transport rate also plays a vital role in influencing the aspartate producing potential of the network.

2.4.4 Aspartate overproduction (Aspartate family)

Aspartate, asparagine, lysine (formed via the diaminopimelic acid pathway), methionine, threonine and isoleucine, are members of the aspartate family of amino acids. Aspartate is formed from the citric acid cycle intermediate oxaloacetate and is subsequently converted to the other five amino acids. However, oxaloacetate is involved in a number of different metabolic reactions and there is no guarantee that increasing the

26 production of this metabolite will eventually lead to the overproduction of any of the six amino acids derived from it. Therefore, in this study we modify the objective function of

OptKnock to be the direct maximization of aspartate which is the key precursor to all other amino acids in the family.

OptKnock identifies a mutant with four deleted reactions which is predicted to reach as high as 75.3% of the maximum theoretical yield of aspartate at maximum growth. Table 2.5 lists two different strategies for overproducing aspartate. The flux distribution for mutant A, where three reactions are deleted, is outlined in Figure 2.8a.

The removal of 2-ketoglutarate dehydrogenase (sucAB, lpdA) redirects more flux towards the formation of aspartate from oxaloacetate, which would otherwise have been utilized in the TCA cycle. The deletion of acetate kinase (ack) or alternatively phosphotransacetylase (pta) also allows for an increase in the formation of aspartate.

These reactions are reversible and the network tends to convert acetate to ACCOA so that the flux through TCA cycle is maximized leading to maximum biomass formation.

However, mutant A shows a reduced flux through the TCA cycle preventing consumption of oxaloacetate and forcing the flux through these reversible reactions to zero. The elimination of ATPase (atp) also favors the formation of aspartate. Mutant B involves the removal of pyruvate kinase (pyk) in addition to those already eliminated in mutant A and its flux distribution is shown in Figure 2.8b. Notably, the exclusion of this additional reaction increases the yield of aspartate in mutant B by approximately 89% over that in mutant A. This glycolytic reaction converts PEP to pyruvate which can be converted into a number of metabolites. Consequently, by its removal, the network can

27 channel the available PEP to oxaloacetate, the immediate precursor to aspartate, through the phosphoenolpyruvate carboxylase (ppc) reaction.

The coupling between aspartate secretion and biomass formation is a strong function of the transport rates of oxygen, carbon dioxide and ammonia. Figure 2.8c shows the production limits of the networks for mutants A and B after these transport rates have been fixed (solid lines) and compares them with those of the wild type E. coli network. While the latter network may not secrete aspartate, both the mutant networks are required to overproduce aspartate due to the reengineered network stoichiometry. The aspartate study manifests the significance of oxygen transport rate for ensuring the overproduction of aspartate from the network. Next, the glutamate study shows that not only reaction eliminations and fixing of transport rates, but also blocking the secretion of key metabolites is sometimes needed for achieving the desired overproduction.

2.4.5 Glutamate overproduction (α-ketoglutarate family)

The common precursor to the amino acids in this family is α-ketoglutarate, a compound participating in the TCA cycle. As in the case of oxaloacetate, this metabolite branches into a number of pathways. The reaction scheme for the formation of these amino acids is also similar to the one for the amino acids in the aspartate family as α- ketoglutarate transaminates to form glutamate which is subsequently converted to glutamine, proline and arginine, the other three amino acids in the family. Therefore, we adopt a strategy similar to the one described in the previous example where the objective function is to directly maximize the secretion flux for glutamate.

The deletion mutants which lead to higher yields of glutamate are listed in Table

2.6. The yield predicted for the mutant B network is as high as 84% of the maximum

28 theoretical yield of glutamate, which is approximately 78% of the yield based on glucose weight. The current industrial yields of glutamic acid in the industry are in the range of

45-55% (Ikeda 2003). In mutant A, alpha-ketoglutarate dehydrogenase (sucAB, lpdA), phosphotransacetylase (pta) or acetate kinase (ack), and ATPase (atp) are deleted. The flux distribution for this mutant network is drawn in Figure 2.9a. The first reaction leads to the conversion of α-ketoglutarate (or 2-oxoglutarate), the reactant for glutamate formation, to succinate. Note that the 2-oxoglutarate dehydrogenase complex, which catalyzes this reaction, is reported to have very low activities in all strains of

Corynebacterium glutamicum (Kumagai 2000; Kimura 2003) which are used for the industrial production of glutamate. Glutamate-producing mutants of E. coli have also been reported with the metabolic pathway between 2-oxoglutarate and succinate blocked

(Kimura 2003). The deletion of phosphotransacetylase or acetate kinase prevents acetyl-

CoA (ACCOA) from being converted to acetate. ACCOA is a key metabolite in the TCA cycle during which α-ketoglutarate is produced. ATPase removal once again favors the production of glutamate. Mutant B, whose distribution of fluxes can be obtained from

Figure 2.9b suggests the removal of the glycolytic reaction pyruvate kinase in addition to the eliminations discussed already for mutant A. This prevents the conversion of phosphoenolpyruvate to pyruvate. In mutant A, the network chooses to directly convert

PEP to oxaloacetate so that maximum flux is directed towards the TCA cycle, without leaking any considerable amount to pyruvate or pyruvate-derived products.

Interestingly, the reactions identified by OptKnock are exactly the same as those for the case of aspartate. However, the specific rate of oxygen transport suggested by

OptKnock is different for glutamate overproduction (see Table 2.6). Also, a strong

29 coupling between biomass formation and the secretion of glutamate can be obtained only after the export routes of D-alanine, acetate, lactate, ethanol, pyruvate, fumarate and malate are blocked. In mutant B, the transport reaction for acetate is left enabled. The extra functionality removed in this case compensates for the elimination of the acetate transport reaction from the network. Figure 2.9c contrasts the glutamate production limits of the wild E. coli network from those of the mutant networks. While the wild-type E. coli network does not have to produce any glutamate for any level of biomass formation, mutant A is “forced” to secrete glutamate when the biomass production reaches approximately 75% of its maximum value (solid red line). Moreover, for mutant B, the modified stoichiometric constraints on the network cause it to secrete glutamic acid for any rate of biomass formation, with the highest yield obtained when the latter rate is

0.057 per hour (solid dark blue line).

2.5 Conclusion

In this chapter, we modified and extended the bilevel optimization framework

OptKnock (Burgard et al. 2003) to predict the reactions whose elimination from the E. coli metabolic network (Edwards and Palsson 2000) enhances amino acid production at maximum biomass yield. The stoichiometric constraints ensured that the network secreted the desired amino acid or precursor as an obligatory by-product of growth. We focused on the five families into which the amino acids are categorized. The framework correctly predicted that an aerobic environment is required for overproducing all amino acids. Also, the crucial role that energy balance plays in amino acid formation was computationally verified. Specifically, we found that the removal of the ATPase reaction can potentially augment the network’s capacity to produce amino acids such as alanine,

30 serine, aspartate and glutamate. This deletion has been observed to be favorable for the production of valine and leucine (Tomita et al. 1996; Ikeda 2003), the other two members of the pyruvate family besides alanine. The decrease in activity of ATPase is associated with low energy status of the cell. This induces a higher uptake of glucose (Jensen et al.

1993) in experimental cultures and hence, a higher rate of its metabolism by the glycolytic pathway. The rates of transport of oxygen, carbon dioxide and nitrogen were also included as optimization variables in the formulation due to their importance in ensuring the elimination of any remaining degrees of freedom for the network that lead to the decoupling of biomass formation and biochemical overproduction. Notably, the transport rates of these compounds are important central variables in commercial bioreactors and are tightly regulated to maintain conditions which preclude the formation of undesirable side products (Kumagai 2000; Ikeda 2003).

OptKnock suggested not only straightforward strategies involving elimination of competing pathways, but also identified reactions whose removal leads to indirect channeling of more carbon flux towards amino acid production. A case in point is serine overproduction where OptKnock suggested the removal of not only the serine degradation reaction, serine deaminase, but also the enolase and the ATPase functionalities. By considering the metabolic network in its entirety, OptKnock has the advantage of assessing the global impact of gene deletions and thus helps to reconcile the puzzle of sometimes rather contradictory strategies suggested for different experimental set-ups. Specifically, we addressed the ambiguity associated with the channeling of flux from PEP to aromatic acid production by the removal of the ppc gene and offered a plausible explanation. We found that the removal of ppc can lead to the redirection of

31 carbon flux into the formation of chorismate only if it is accompanied by the removal of pyruvate oxidase, pyruvate dehydrogenase and pyruvate lyase functionalities.

Given the inherently incomplete nature of the employed models and the fact that the substrate uptake level (i.e., glucose) is fixed, it is important to interpret the OptKnock predictions carefully. For example, blocking of a reaction in many cases is equivalent to the overexpression of a competing pathway. Finally, the model assumes that all metabolites with export mechanisms can be secreted by the network. However, membrane permeability can be an important factor for excretion of some amino acids, as noted for threonine (Debabov 2003). Despite these limitations, OptKnock, as shown above, provides interesting suggestions for strain optimization despite its current limitations and establishes a foundation for further modeling enhancements by the addition of kinetic and regulatory information.

32

Table 2.1: Reaction deletions for maximum formation of chorismate. The maximum theoretical yield of chorismate is 5.33 mmol/gDW.hr.

No of ID Growth rate Chorismate Reaction Enzyme Knockouts (1/hr) (mmol/gDW/hr)

6 A 1 PYR + COA + NAD → NADH + CO2 + ACCOA Pyruvate dehydrogenase 2 D6PGL→ D6PGC 6- Phosphogluconolactonase 3 T3P1 + S7P ↔ E4P + F6P Transaldolase B 0.32 3.44 4 PYR + COA → ACCOA + FOR Pyruvate formate lyase 1 5 PEP + CO2 → OA + PI Phosphoenolpyruvate carboxylase 6 PYR + Q → AC + CO2 + QH2 Pyruvate oxidase

33

Table 2.2: Different approaches for maximizing flux towards aromatic acid production from the central metabolic pathway

No. Approach Effect Reference

1 Overexpression of transketolase Increases the availability of Draths et al. gene tktA erythrose-4-phosphate (E4P) 2 Inactivation of phosphoenol Prevents the consumption of Miller et al., 1987, Backman, 1992 pyruvate carboxylase(ppc) phosphoenol pyruvate (PEP) 3 Amplification of phosphoenol Recycles pyruvate to PEP Patnaik and Liao, 1994 pyruvate synthase 4 Using non-PTS sugar Prevents the conversion of PEP Flores et al., 1996 to pyruvate. 5 Overexpression of transaldolase Increases the availability of E4P Sprenger et al., 1998, Lu and Liao, 1997

34

Table 2.3: The deletion mutants for overproducing alanine. The maximum theoretical yield of alanine is 19.77 mmol/gDW.hr. In all mutants, the secretion of D-alanine is blocked.

Uptake Secretion No of Growth Knock ID Reaction Enzyme Rate O NH CO Alanine outs (1/hr) 2 3 2 (mmol/gDW/hr)

3A1PYR + COA + NAD → NADH + CO2 + ACCOA Pyruvate dehydrogenase 0.23 5.58 17.18 6.13 14.95 2 PYR + COA → ACCOA + FOR Pyruvate formate lyase 3 ATP ↔ ADP + PI +4 HEXT ATPase

4B1F6P + ATP → FDP + ADP Phosphofructokinase 0.04 2.54 18.97 2.65 18.53 2 PYR + COA + NAD → NADH + CO2 + ACCOA Pyruvate dehydrogenase 3 PYR + COA → ACCOA + FOR Pyruvate formate lyase 4 ATP ↔ ADP + PI +4 HEXT ATPase

35

Table 2.4: The deletion mutants for overproduction of serine. The maximum theoretical yield of serine is 20 mmol/gDW.hr.

Growth Uptake Secretion No of ID Reaction Enzyme Rate O NH CO Serine Knockouts 2 3 2 (1/hr) (mmol/gDW/hr)

3 A 1 2PG ↔ PEP or Enolase or 0.04 9.86 10.27 0.12 9.87 3PG ↔ 2PG Phosphoglycerate mutase 2 ATP ↔ ADP + PI +4 HEXT ATPase 3 SER → PYR + NH3 Serine deaminase 1

4 B 1 3PG ↔ 2PG or Enolase or 0.03 9.69 10.25 0.02 9.88 2PG ↔ PEP Phosphoglycerate mutase 2 FUM ↔ MAL Fumarase 3 ATP ↔ ADP + PI +4 HEXT ATPase 4 SER → PYR + NH3 Serine deaminase

36 Table 2.5: The deletion mutants for overproduction of aspartate. The maximum theoretical yield of aspartate is 19.05 mmol/gDW.hr.

No of Growth Uptake Secretion

Knock ID Reaction Enzyme Rate O2 NH3 CO2 Aspartate outs (1/hr) (mmol/gDW/hr)

3A1AKG + NAD + COA → CO2 + NADH + SUCCOA 2-Ketoglutarate 0.26 21.36 10.19 10 7.6 dehydrogenase 2 ACCOA + PI ↔ ACTP + COA or Phosphotransacetylase ACTP + ADP ↔ ATP + AC or Acetate kinase 3 ATP↔ ADP + PI +4 HEXT ATPase

4B1PEP + ADP → PYR + ATP Pyruvate kinase 0.05 9.57 14.9 9.55 14.34 2 AKG + NAD + COA → CO2 + NADH + SUCCOA 2-Ketoglutarate (uptake) dehydrogenase 3 ACCOA + PI ↔ ACTP + COA or Phosphotransacetylase ACTP + ADP ↔ ATP + AC or Acetate kinase 4 ATP↔ ADP + PI +4 HEXT ATPase

37 Table 2.6: The deletion mutants for overproduction of glutamate. The maximum theoretical yield of glutamate is 11.36 mmol/gDW.hr.

No of Growth Secretion Uptake (O ) Knock ID Reaction Enzyme Rate 2 (Glutamate) outs (1/hr) (mmol/gDW/hr)

3A1AKG + NAD + COA → CO2 + NADH + SUCCOA 2-Ketoglutarate 0.26 16.75 5.14 dehydrogenase 2 ACCOA + PI ↔ ACTP + COA or Phosphotransacetylase ACTP + ADP ↔ ATP + AC or Acetate kinase 3 ATP↔ ADP + PI +4 HEXT ATPase

4B1PEP + ADP → PYR + ATP Pyruvate kinase 0.05 9.58 9.56 2 AKG + NAD + COA → CO2 + NADH + SUCCOA 2-Ketoglutarate dehydrogenase 3 ACCOA + PI ↔ ACTP + COA or Phosphotransacetylase ACTP + ADP ↔ ATP + AC or Acetate kinase 4 ATP↔ ADP + PI +4 HEXT ATPase

38

25 E. coli 20 A Lactate Mutant

15

10

( mmol/ gDW/ hr) ( mmol/ gDW/ 5 Lactate Production LimitsLactate Production B 0 0 0.1 0.2 0.3 0.4 Biomass Formation Rate (1/hr)

Figure 2.1: The limits of lactate production of the network.

39

maximize biochemical production (OptKnock) (by reaction eliminations) subject to maximize biomass formation (Primal) (over fluxes) subject to • fixed substrate uptake • network stoichiometry • blocked reactions identified by outer problem

• bounds on O2, CO2 and NH3 transport rates

Number of knockouts ≤ limit

Figure 2.2: The modified bilevel formulation of OptKnock

40

(a) PEP GLCATP pts glk PYR ADP pgl gnd G6P D6PGLD6PGC RL5P zwf pgi rpe rpi F6P X5P R5P fbp pfk tkt F16P tkt S7P GAP fba talB

tpiA E4P F6P DHAP GAP gapA 13P2DG pgk sdh fum FUM 3PG frd SUCC gpmA MAL 2PG aceB mdh sucCD eno GLX ppc PEP OA SUCCOA pps pyk gltA CIT aceA sucAB, lpdA, PYR ACCOA acn lpdA ldhA aceEF pta AKG LAC adhE ACTP ackA poxB ETH ICIT AC icdA

GLC (b) PEP ATP PYR ADP G6P D6PGLD6PGC RL5P hisG F6P X5P R5P HIS F16P S7P GAP DHAP GAP 13P2DG E4P F6P GLY glyA serA SER 3PG tyrB cysK TYR CYS CHOR 2PG trpE TRP ubiABPHE PEP ubiHX avtA ALA ilvC asnA PYR VAL asnB leuA ASN ACCOA LEU dapB,lysC LYS ASP OAA MET metBE thrA glnA GLN gdhA proB ILVilvGTHR AKG GLU PRO argF ARG

Figure 2.3: (a) The central metabolism in E. coli. (b) The amino acid biosynthetic pathways in E. coli.

41

4

3

2

(mmol/gDW/hr) 1 Rate of chorismate formation 0 wild type567 Number of deleted reactions

Figure 2.4: Comparison of chorismate formation rates for different number of reaction eliminations in the network.

42

PEP GLCATP 0.83 9.17 ADP PYR 0 G6P D6PGLD6PGC RL5P 0 9.95 3.43 3.42 F6P X5P R5P 6.48 0 F16P 3.44 S7P GAP 6.48 6.44 E4P F6P DHAP GAP 12.68 13P2DG 3.44 CHOR 12.68 FUM 3PG 1.87 1.13 SUCC 7.72 MAL 2PG 0.92 0.15 7.72 2.05 GLX PEP OA 1.26 0.92 SUCCOA 0 CIT 0.97 PYR ACCOA1.26 3.34 AKG ACTP 0.35 3.34 ICIT AC 3.16 3.16 ACAL

GAP

(a)

6 5 E. coli

Mutant A 4 A 3 1

2

(mmol/gDW/hr) 1 Chorismate Production Limits Production Chorismate 0 0 0.2 0.4 0.6 0.8 1 1.2 Biomass Formation Rate ( 1/hr)

(b) Figure 2.5: (a) Flux distribution for mutant A (six reaction deletions) with enhanced chorismate formation rate. (b) Chorismate formation limits of the mutant A.

43

9.14

PEP GLC GLC ATP PEP ATP 10.0 0 9.69 0.31 PYR ADP 0 ADP G6P D6PGL 0 D6PGC RL5P PYR 2.45 0 G6P D6PGL11.59 D6PGC RL5P 0.16 0.15 9.97 1.59 11.59 1.61 0.84 F6P X5P R5P F6P R5P 9.78 0.03 X5P 2KD6PG 0.81 F16P 0.13 S7P GAP 0.81 F16P 0.79 S7P GAP 9.78 9.14 0.81 E4P F6P 9.76 E4P F6P DHAP 0.01 GAP DHAP GAP 11.17 0.5 O2 19.42 + 5.08 0.5 O2 ATP H QH2 9.93 + 13P2DG 6.69 ATP H QH2 1.72 13P2DG 4.87 1.87 16.87 19.42 NADH FADH 5.19 NADPH 9.93 NADPH NADH FADH FUM FUM 3PG 1.72 3PG 1.88 SUCC 0.03 0 19.08 MAL 9.86 SUCC 2PG 0 MAL 0.90 1.6 2PG 19.08 1.88 GLX 0.03 GLX 0.02 0 9.86 PEP0.66 OA 0.13 1.97 SUCCOA PEP OA SUCCOA 8.22 0.05 0.90 CIT 0 CIT 0.05 1.97 1.72 PYR ACCOA PYR ACCOA 0.05 0 14.95 2.8 18.52 0.21 ALA ACTP AKG ALA AKG Q 2.67 1.97 Q ACTP 0.05 2.8 ICIT 0.18 0.21 AC ICIT QH AC 2 QH2

(a) (b)

25 E. coli D 20 Mutant A D Mutant B 15 Mutant A* Mutant B* 10

(mmol/gDW/hr) 5

Limits Production Alanine UU 0 0 0.2 0.4 0.6 0.8 1 1.2 Biomass Formation Rate (1/hr)

(c)

Figure 2.6: (a) Flux distribution for mutant A (with three reaction eliminations) and (b) mutant B (with four reaction eliminations) overproducing alanine. (c) Alanine production limits of the mutant networks.

44 9.96

PEP GLCATP 10.00 ADP PYR 9.96 0 G6P D6PGL D6PGC RL5P 9.96 0.04 0.03 0.03 F6P 2KD6PG X5P R5P 0 0 0.01 F16P 0.02 S7P GAP 9.96 0.01 0.01 E4P F6P DHAP GAP 19.73 0.5 O2 9.94 ATP H+ QH2 0.12 13P2DG 0 20.17 FADH 9.94 NADPH NADH 9.94 0.12 SER 3PG 0.15 FUM 0 SUCC MAL 0.12 2PG 0.27 GLX 0.02 PEP 0 OA 0.16 SUCCOA 0.04 CIT 0.12 9.82 0.16 PYR ACCOA 0 9.38 ACTP AKG 9.38 ICIT 0.04 AC

(a) 25 E. coli 20 Mutant A

15 Mutant A*

10 A1

(mmol/gDW/hr) 5

Serine Production Limits 0 00.20.40.60.811.2 Biomass Formation Rate (1/hr)

(b) Figure 2.7: Flux distribution for the serine-overproducing mutant (three reactions removed). (b) Serine production abilities of the network for mutant A compared with those of the wild type E. coli network.

45

PEP GLC ATP 0 10.00 PEP GLC ATP PYR ADP 10.00 0 00 G6P D6PGLD6PGC RL5P PYR ADP 0 00 G6P D6PGLD6PGC RL5P 9.95 0.18 0.17 0 F6P 10.00 0.04 0.04 X5P R5P 0.03 F6P 9.75 X5P R5P 9.95 0.01 F16P 0.15 S7P GAP 9.75 0.04 F16P 0.03 S7P GAP 9.95 0.01 9.71 E4P F6P DHAP GAP 42.73 9.94 E4P F6P + DHAP 19.33 ATP H QH2 GAP 19.15 25.05 ATP H+ QH2 13P2DG 8.37 19.86 14.35 11.84 FADH 13P2DG 4.79 19.33 NADPH NADH 15.25 FADH 19.86 NADPH NADH 8.37 3PG FUM ASP 8.56 4.79 3PG FUM 18.93 SUCC ASP 4.83 MAL SUCC 2PG 8.26 8.37 19.77 16.93 MAL 18.93 GLX 0.13 2PG 14.48 4.79 9.62 GLX 0.03 PEP 0 OA 19.77 SUCCOA 9.72 18.70 8.66 CIT PEP OA SUCCOA 18.00 8.37 4.85 PYR ACCOA CIT 4.79 By pfl 9.85 0 8.66 PYR ACCOA 4.85 reaction AKG By pfl ACTP 0 reaction AKG ICIT ACTP 0.29 ICIT AC 0.063 AC

(a) (b)

16 14 E. coli Mutant A 12 Mutant B 10 Mutant A* 8 Mutant B* 6 (mmol/gDW/hr) 4 2

Aspartate Production Limits 0 0 0.2 0.4 0.6 0.8 1 1.2 Biomass Formation Rate (1/hr)

(c)

Figure 2.8: (a) Flux distribution for the aspartate-secreting mutant A (with three reactions eliminated) and (b) for mutant B (with four reactions eliminated). (c) Aspartate producing limits of the mutant networks compared with those of wild type E. coli network.

46

GLC PEP GLC ATP PEP ATP 0 10.00 10.00 0.0 PYR ADP PYR ADP 0.0 0.0 0.0 0.0 G6P D6PGLD6PGC RL5P G6P D6PGLD6PGC RL5P 0.0 0.0 9.95 0.18 0.17 10.0 0.04 0.04 F6P F6P X5P R5P X5P R5P 0.01 9.75 0.03 9.95 F16P 0.15 F16P 0.03 S7P GAP S7P GAP 0.01 9.75 0.04 9.95 9.94 E4P F6P 9.71 E4P F6P DHAP GAP DHAP GAP 19.15 33.50 19.86 + 19.33 ATP H+ QH2 ATP H QH2 13P2DG 0.0 0.01 13P2DG 0.0 5.84 19.98 31.82 19.86 NADPH NADH FADH 19.33 NADPH NADH FADH FUM 0.01 FUM 5.84 3PG 0.05 3PG 6.02 19.86 SUCC 18.93 SUCC MAL MAL 0.01 5.84 2PG 0.06 0.03 2PG 0.13 11.86 19.86 GLX 18.93 GLX PEP 9.72 OA SUCCOA PEP 0.0 OA SUCCOA 9.63 11.20 CIT 0.01 18.70 CIT 5.84 9.85 18.00 PYR ACCOA PYR ACCOA By pfl 9.63 By pfl 11.20 reaction AKG 10.01 GLU reaction AKG 7.15 ACTP ACTP GLU 0 ICIT 9.62 0 ICIT 5.36 AC AC

(a) (b)

12 E. coli 10 Mutant A 8 Mutant B Mutant A* 6 Mutant B* 4 (mmol/gDW/hr) 2

Glutamate Production Limits 0 0 0.2 0.4 0.6 0.8 1 1.2 Biomass Formation Rate (1/hr)

(c)

Figure 2.9: (a) Flux distribution for the glutamate-secreting mutant A and (b) for mutant B. (c) Glutamate producing limits of the mutant networks compared with those of wild type E. coli network.

Chapter 3

Redesigning strains by reaction additions and deletions

3.1 Background

A fundamental goal in systems biology is to elucidate the complete “palette” of biotransformations accessible to nature in living systems. This goal parallels the continuing quest in biotechnology to construct microbial strains capable of accomplishing an ever-expanding array of desired biotransformations. These biotransformations are aimed at products that range from simple precursor chemicals (Nakamura and Whited

2003; Causey et al. 2004) or complex molecules such as carotenoids (Misawa et al.

1991), to electrons in bio fuel cells (Liu et al. 2004) or batteries (Bond et al. 2002; Bond and Lovley 2003), to even microbes capable of precipitating heavy metal complexes in bioremediation applications (Finneran et al. 2002; Lovley 2003; Methe et al. 2003).

Recent developments in molecular biology and recombinant DNA technology have ushered in a new era in the ability to shape the gene content and expression levels for microbial production strains in a direct and targeted fashion (Stephanopoulos 2002). The astounding range and diversity of these newly acquired capabilities and the scope of biotechnological applications imply that now more than ever we need modeling and computational aids to a priori identify the optimal sets of genetic modifications for strain optimization projects.

The recent availability of genome-scale models of microbial organisms has provided the pathway reconstructions necessary for developing computational methods aimed at identifying strain engineering strategies (Bailey 2001). These models, already

48 available for H. pylori (Schilling et al. 2002), E. coli (Edwards and Palsson 2000; Reed et al. 2003), S. cerevisiae (Forster et al. 2003) and other microorganisms (Van Dien and

Lidstrom 2002; David et al. 2003; Valdes et al. 2003) provide successively refined abstractions of the microbial metabolic capabilities. An automated process to expedite the construction of stoichiometric models from annotated genomes (Segre et al. 2003) promises to further accelerate the metabolic reconstructions of several microbial organisms. At the same time, individual reactions are deposited in databases such as

KEGG, EMP, MetaCyc, UM-BBD, and many more (Selkov et al. 1998; Overbeek et al.

2000; Karp et al. 2002; Ellis et al. 2003; Kanehisa et al. 2004; Krieger et al. 2004), forming encompassing and growing collections of the biotransformations for which we have direct or indirect evidence of existence in different species. Already many thousands of such reactions have been deposited; however, unlike organism-specific metabolic reconstructions (Edwards and Palsson 2000; Schilling et al. 2002; Forster et al. 2003;

Reed et al. 2003), these compilations include reactions from not a single but many different species in a largely uncurated fashion. This means that currently there exists an ever-expanding collection of microbial models and at the same time ever more encompassing compilations of non-native functionalities. This newly acquired plethora of data has brought to the forefront a number of computational and modeling challenges which form the scope of this article. Specifically, how can we systematically select from the thousands of functionalities catalogued in various biological databases, the appropriate set of pathways/genes to recombine into existing production systems such as

E. coli so as to endow them with the desired new functionalities? Subsequently, how can

49 we identify which competing functionalities to eliminate to ensure high product yield as well as viability?

Existing strategies and methods for accomplishing this goal include database queries to explore all feasible bioconversion routes from a substrate to a target compound from a given list of biochemical transformations (Seressiotis and Bailey 1988;

Mavrovouniotis et al. 1990). More recently, elegant graph theoretic concepts (e.g., P- graphs (Fan et al. 2002) and k-shortest paths algorithm (Eppstein 1994)) were pioneered to identify novel biotransformation pathways based on the tracing of atoms (Arita 2000;

Arita 2004), enzyme function rules and thermodynamic feasibility constraints

(Hatzimanikatis et al. 2003). Also an interesting heuristic search approach that uses the enzymatic biochemical reactions found in the KEGG database (Kanehisa et al. 2004) to construct a connected graph linking the substrate and the product metabolites was recently proposed (McShan et al. 2003). Most of these approaches, however, generate linear paths that link substrates to final products without ensuring that the rest of the metabolic network is balanced and that metabolic imperatives on cofactor usage/generation and energy balances are met.

In this paper, we introduce a hierarchical optimization-based framework,

OptStrain to identify stoichiometrically-balanced pathways to be generated upon recombination of non-native functionalities into a host organism to confer the desired phenotype. Candidate metabolic pathways are identified from an ever-expanding array of thousands (currently 5,738) of reactions pooled together from different stoichiometric models and publicly available databases such as KEGG (Kanehisa et al. 2004). Note that the identified pathways satisfy maximum yield considerations while the choice of

50 substrates can be treated as optimization variables. Important information pertaining to the cofactor/energy requirements associated with each pathway is deduced enabling the comparison of candidate pathways with respect to the aforementioned criteria. Production host selection is examined by successively minimizing the reliance on heterologous genes while satisfying the performance targets identified above. A gene set that encodes for all the enzymes needed to catalyze the identified non-native functionalities is then compiled by accounting for isozymes and multi-subunit enzymes. Subsequently, gene deletions are identified (Burgard et al. 2003; Pharkya et al. 2003) in the augmented host networks to improve product yields by removing competing functionalities which decouple biochemical production and growth objectives. The breadth and scope of OptStrain is demonstrated by addressing in detail two different product molecules (i.e., hydrogen and vanillin) which lie at the two extremes in terms of product molecule size. Briefly, computational results in some cases match existing strain designs and production practices whereas in others pinpoint novel engineering strategies.

3.2 The OptStrain Procedure

The first challenge addressed in this paper is to develop a systematic computational framework to identify which functionalities to add to an organism-specific metabolic network (e.g., E. coli (Edwards and Palsson 2000; Reed et al. 2003), S. cerevisiae (Forster et al. 2003), C. acetobutylicum (Papoutsakis 1984; Desai et al. 1999), etc.) to enable a desired biotransformation. Specifically, we aim to pinpoint gene additions identified from a Universal database composed of more than 5,700 reactions as well as to investigate multiple hosts and substrate choices. Note that the gene additions are identified by fulfilling both criteria of maximal product yield and minimum usage of

51 non-native reactions. Due to the extremely large size of the compiled database and the presence of multiple and sometimes conflicting objectives that need to be simultaneously satisfied, we developed the OptStrain procedure illustrated in Figure 3.1. Each step introduces different computational challenges arising from the specific structure and size of the optimization problems that need to be solved.

Step 1. Automated downloading and curation of the reactions in our Universal database to ensure stoichiometric balance;

Step 2. Calculation of the maximum theoretical yield of the product given a substrate choice without restrictions on the reaction origin (i.e., native or non-native);

Step 3. Identification of a stoichiometrically-balanced pathway(s) that minimizes the number of non-native functionalities in the examined production host given the maximum theoretical yield and the optimum substrate(s) found in Step 2. Alternative pathways that meet both criteria of maximum yield and minimum number of non-native reactions are generated along with comparisons between different host choices.

Information pertaining to the cofactor/energy usage associated with each pathway is also determined at this stage. Finally, one or multiple gene sets that ensure the presence of the targeted biotransformations by encoding for the appropriate enzymes are derived at this stage;

Step 4. Incorporation of the identified non-native biotransformations into the stoichiometric models, if available, of the examined microbial production hosts. The

OptKnock framework is next used (Burgard et al. 2003; Pharkya et al. 2003) on these augmented models to suggest gene deletions that ensure the production of the desired

52 product becomes an obligatory byproduct of growth by “shaping” the connectivity of the metabolic network.

3.2.1 Curation of the database

The first step of the OptStrain procedure begins with the downloading and curation of reactions acquired from various sources in our Universal database.

Specifically, given the fact that new reactions are incorporated in the KEGG database on a regular basis, we have developed customized scripts using Perl (Brown 1999) to automatically download all reactions in the database on a regular basis and convert them into a format readable by the GAMS (Brooke et al. 2003) optimization environment. A different script is then used to parse the number of atoms of each element in every compound. The number of atoms of each type among the reactants and products of all reactions are calculated and reactions which are elementally unbalanced are excluded from consideration. In addition, compounds with an unspecified number of repeat units,

(e.g., trans-2-Enoyl-CoA represented by C25H39N7O17P3S(CH2)n) or unspecified alkyl groups R in their chemical formulae are removed from the downloaded sets. This step enables the automated downloading of reactions present in genomic databases and the subsequent verification of their elemental balanceabilities forming large-scale sets of functionalities to be used as recombination targets.

3.2.2 Determination of the maximum yield

Once the reaction sets are determined, the second step is geared towards determining the maximum theoretical yield of the target product from a range of substrate choices, without restrictions on the number or origin of the reactions used. The maximum theoretical product yield is obtained for a unit uptake rate of substrate by maximizing the

53 sum of all reaction fluxes producing minus those consuming the target metabolite, weighted by the stoichiometric coefficient of the target metabolite in these reactions. The maximization of this yield subject to stoichiometric constraints and transport conditions yields a Linear Programming (LP) problem (see Section 3.3 for mathematical formulation). Given the computational tractability of LP problems, even for many thousands for reactions, a large number of different substrate choices can thoroughly be explored here.

3.2.3 Identification of the minimum number of non-native reactions

The next step in OptStrain uses the knowledge of the maximum theoretical yield to determine the minimum number of non-native functionalities that need to be added into a specific host organism network. Mathematically, this is achieved by first introducing a set of binary variables yj that serve as switches to turn the associated reaction fluxes vj on or off.

min max vj · yj ≤ vj ≤ vj · yj

Note that the binary variable yj assumes a value of one if reaction j is active and a value of zero if it is inactive. This constraint is imposed on only reactions associated with genes

min max heterologous to the specified production host. The parameters vj and vj are specified to be very low and high values unattainable by the reaction flux vj. This leads to a Mixed

Integer Linear Programming (MILP) model for finding the minimum number of genes to be added into the host organism network while meeting the yield target for the desired product. This formulation, described in the next section, enables the exploration of tradeoffs between the required numbers of heterologous genes versus the maximum theoretical product yield and also the iterative identification of all alternate optimal

54 solutions using integer cut constraints. The end result of this step is a set of distinct pathways and corresponding gene complements that provide a ranked list of all alternatives for the efficient conversion of the substrate(s) into the desired product.

3.2.4 Incorporating the non-native reactions into host organism’s stoichiometric model

Upon identification of the appropriate host organism, the analysis proceeds with an organism-specific stoichiometric model augmented by the set of the identified non- native reactions. However, simply adding genes to a microbial production strain will not necessarily lead to the desired overproduction due to the fact that microbial metabolism is primed to be as responsive as possible to the imposed selection pressures (e.g., outgrow its competition). These survival objectives are typically in direct competition with the overproduction of targeted biochemicals. To combat this, we use our previously developed bilevel computational framework, OptKnock (Burgard et al. 2003; Pharkya et al. 2003) to eliminate those functionalities which uncouple the cellular fitness objective, typically exemplified as the biomass yield, from the maximum yield of the product of interest.

3.3 Mathematical formulation

The redesign of microbial metabolic networks to enable enhanced product yields by employing the OptStrain procedure requires the solution of multiple types of optimization problems. The first optimization task (Step 2) involves the determination of the maximum yield of the desired product in a metabolic network comprised of a set N =

{1, …, N} of metabolites and a set M = {1, …, M} of reactions. The Linear Programming

55 (LP) problem for maximizing the yield on a weight basis of a particular product P (in the set N) from a set R of substrates is formulated as:

M Max MWi ⋅ ∑ Sijv j , i = Ρ j =1

M subject to ∑ Sij v j ≥ 0 , ∀ i ∈ N, i ∉R (1) j=1

 M   MW ⋅ S v  = −1 (2) ∑∑ i ij j  i∈=ℜ  j 1  where MWi is the molecular weight of metabolite i, vj is the molar flux of reaction j, and

Sij is the stoichiometric coefficient of metabolite i in reaction j. The metabolite set N is comprised of approximately 4,800 metabolites and the reaction set M consists of more than 5,700 reactions. The inequality in constraint (1) allows only for secretion and prevents the uptake of all metabolites in the network other than the substrates in R.

Constraint (2) scales the results for a total substrate uptake flux of one unit of mass. The reaction fluxes vj can either be irreversible (i.e., vj ≥ 0) or reversible in which case they can assume either positive or negative values. Reactions which enable the uptake of essential-for-growth compounds such as oxygen, carbon dioxide, ammonia, sulfate and phosphate are also present.

In Step 3 of OptStrain, the minimum number of non-native reactions needed to meet the identified maximum yield from Step 2 is found. First the Universal database reactions which are absent in the examined microbial host’s metabolic model are flagged as non-native. This gives rise to the following Mixed Integer Linear Programming

(MILP) problem:

56

Min ∑ y j j∈M non−native

M subject to ∑ Sij v j ≥ 0 , ∀ i ∈ N, i ∉R (1) j=1

 M   MW ⋅ S v  = −1, (2) ∑∑ i ij j  i∈=ℜ  j 1 

M t arg et MW i ⋅ ∑ Sij v j ≥ Yield , i = P (3) j =1

max v j ≤ v j ⋅ y j , ∀ j ∈ Mnon-native (4)

min v j ≥ v j ⋅ y j , ∀ j ∈ Mnon-native (5)

y j ∈{0,1 }, ∀ j ∈ Mnon-native (6)

The set Mnon-native comprises of the non-native reactions for the examined host and is a subset of the set M. Constraints (1) and (2) are identical to those in the product yield maximization problem. Constraint (3) ensures that the product yield meets the maximum

target theoretical yield, Yield , calculated in step 2. The binary variables yj in constraints (4) and (5) serve as switches to turn reactions on or off. A value of zero for yj forces the corresponding flux vj to be zero, while a value of one enables it to take on nonzero

min max values. The parameters vj and vj can either assume very low and very high values, respectively, or they can be calculated by minimizing and maximizing every reaction flux vj subject to constraints (1-3).

Alternative pathways that satisfy both optimality criteria of maximum yield and minimum non-native reactions are obtained by the iterative solution of the MILP formulation upon the accumulation of additional constraints referred to as an integer cuts.

Integer cut constraints exclude from consideration all sets of reactions previously

57 identified. For example, if a previously identified pathway utilizes reactions 1, 2, and 3, then the following constraint prevents the same reactions from being simultaneously considered in subsequent solutions: y1 + y2 + y3 ≤ 2. More details can be found in an earlier paper by Burgard and Maranas (2001).

Step 4 of OptStrain identifies which reactions to eliminate from the network augmented with the non-native functionalities, using the OptKnock framework developed previously (Burgard et al. 2003; Pharkya et al. 2003). The objective of this step is to constrain the phenotypic behavior of the network so that growth is coupled with the formation of the desired biochemical, thus curtailing byproduct formation. The envelope of allowable targeted product yields versus biomass yields is constructed by solving a series of linear optimization problems which maximize and then, minimize biochemical production for various levels of biomass formation rates available to the network. More details on the optimization formulation can be found in Pharkya et al. (2003). All the optimization problems were solved in the order of minutes to hours using CPLEX 7.0 accessed via the GAMS (Brooke et al. 2003) modeling environment on an IBM RS6000-

270 workstation.

3.4 Results

Computational results for microbial strain optimization focus on the production of hydrogen and vanillin. The hydrogen production case study underscores the importance of investigating multiple substrates and microbial hosts to pinpoint the optimal production environment as well as the need to eliminate competing functionalities. In contrast, in the vanillin study, identifying the smallest number of non-native reactions is

58 found to be the key challenge for strain design. A common database of reactions, as outlined in Step 1, was constructed for both examples by pooling together metabolic pathways from the methylotroph Methylobacterium extorquens AM1 (Van Dien and

Lidstrom 2002) and the KEGG database (Kanehisa et al. 2004) of reactions.

3.4.1 Hydrogen production case study

An efficient microbial hydrogen production strategy requires the selection of an optimal substrate and a microbial strain capable of forming hydrogen at high rates. First we solve the maximum yield LP formulation (Step 2) using all catalogued reactions which are balanced with respect to hydrogen, oxygen, nitrogen, sulfur, phosphorus and carbon (approximately 3,000 reactions) as recombination candidates. Note that OptStrain allows for different substrate choices such as pentose and hexose sugars as well as acetate, lactate, malate, glycerol, pyruvate, succinate and methanol. The highest hydrogen yield obtained for a methanol substrate is equal to 0.126 g/g substrate consumed which is not surprising given that the hydrogen to carbon ratio for methanol is the highest at four to one. A comparison of the yields for some of the more efficient substrates is shown in

Figure 3.2. We decided to explore methanol and glucose further, motivated by the high yield on methanol and the favorable costs associated with the use of glucose.

The next step in the OptStrain procedure entails the determination of the minimum number of non-native functionalities for achieving the theoretical maximum yield in a host organism. We examine three different uptake scenarios: (i) glucose as the substrate in Escherichia coli (an established production system), (ii) glucose in

Clostridium acetobutylicum (a known hydrogen producer), and (iii) methanol in

Methylobacterium extorquens (a known methanol consumer).

59 3.4.1.1 Escherichia coli

The MILP framework (described in Step 3) correctly verifies that with glucose as the substrate no non-native functionalities are required by E. coli for hydrogen production. Interestingly, hydrogen production is possible through either the ferredoxin reaction (E.C.# 1.12.7.2) which reduces protons to form hydrogen or via the hydrogen dehydrogenase reaction (E.C.# 1.12.1.2) which converts NADH into NAD+ while forming hydrogen through proton association. Subsequently, the upper and lower limits of maximum hydrogen formation are explored for the E. coli stoichiometric model

(Reed et al. 2003) as a function of biomass formation rate (i.e., growth rate) for both aerobic and anaerobic conditions and a basis glucose uptake rate of 10 mmol/gDW.hr

(see Figure 3.3). Notably, the maximum theoretical hydrogen yield is higher under aerobic conditions. However, only under anaerobic conditions is hydrogen formed at maximum growth (see point A, in Figure 3.3) leading to a growth-coupled production mode. Note that hydrogen production takes place through the formate hydrogen lyase reaction which converts formate into hydrogen and carbon dioxide under anaerobic conditions, in agreement with experimental observations (Nandi and Sengupta 1998).

Moving to phenotype restriction to curtail byproduct formation (Step 4), we explored whether the production of hydrogen in the wild type E. coli network could be enhanced by removing functionalities from the network that were in direct or indirect competition with hydrogen production. To this end, we employed the OptKnock framework to pinpoint gene deletion strategies that couple hydrogen production with growth. Here we highlight two of the identified strategies shown in Table 3.1. The first

(double deletion) removes both enolase (E.C.# 4.2.1.11) and glucose-6-phosphate

60 dehydrogenase (E.C.# 1.1.1.49). The removal of the enolase reaction strongly promotes hydrogen formation by directing the glycolytic flux towards the 3-phosphoglycerate branching point into the serine biosynthesis pathway. Subsequently, serine participates in a series of reactions in one-carbon metabolism to form 10-formyltetrahydrofolate which is eventually converted into formate and tetrahydrofolate. The dehydrogenase elimination prevents the shunting of glucose-6-phosphate flux into the pentose phosphate pathway.

The second strategy, a three-reaction deletion strategy, involves the removal of ATP synthase (E.C.# 3.6.3.14), alpha-ketoglutarate dehydrogenase, and acetate kinase (E.C.#

2.7.2.1). The removal of the first reaction enhances proton availability whereas the other two deletions ensure that maximum carbon flux is directed towards pyruvate which is then converted into formate through pyruvate formate lyase. Formate is catabolized into hydrogen and carbon dioxide through formate hydrogen lyase. Computationally derived

(not measured) flux distributions for both strategies are shown in Figure 3.4.

A comparison of the hydrogen production limits as a function of growth rate for both the wild-type and mutant networks is shown in Figure 3.3. The transport rates of carbon dioxide for the mutant networks are fixed at the values suggested by OptKnock

(see Table 3.1) thus setting the operational imperatives (Pharkya et al. 2003). Note that while the two-reaction deletion mutant has a theoretical hydrogen production rate of 22.7 mmol/gDW.hr (0.025 g/g glucose) at the maximum growth rate (Point B), the three- reaction deletion mutant produces a maximum of 29.5 mmol/gDW.hr (0.033 g/g glucose)

(Point C) at the expense of a reduced maximum growth rate. Interestingly, in both mutant networks, maximum hydrogen production requires the uptake of oxygen. This is in

61 contrast to the wild-type case where the lack of oxygen is preferred for hydrogen formation. Notably, it has been reported (Nandi and Sengupta 1996) that although formate hydrogen lyase can only be induced in the absence of oxygen, its catalytic activity is not affected in aerobic environments. This will have to be accounted for in any experimental study conducted on the basis of these results.

3.4.1.2 Clostridium acetobutylicum

Ample literature evidence has identified many of the organisms of the

Clostridium genus as natural hydrogen production systems (Kataoka et al. 1997; Nandi and Sengupta 1998; Das and Veziroglu 2001; Chin et al. 2003). The reduction of protons into hydrogen through (E.C.# 1.12.7.2) is the key associated reaction. Not surprisingly, using OptStrain (Step 3), we verified that no non-native reactions were required for hydrogen production (Papoutsakis and Meyer 1985) in

Clostridium acetobutylicum with glucose as a substrate. We next explored, as in the E. coli case, whether hydrogen production could be enhanced by judiciously removing competing functionalities using the OptKnock framework. To this end, we used the stoichiometric model for Clostridium acetobutylicum developed by Papoutsakis and coworkers (Papoutsakis 1984; Desai et al. 1999). OptKnock suggested the deletion of the acetate-forming and butyrate-transport reactions.

This deletion strategy is reasonable in hindsight upon considering the energetics of the entire network. Specifically, in the wild-type case the formation and secretion of each butyrate molecule requires the consumption of 2 NADH molecules, thus reducing the hydrogen production capacity of the network (see Figure 3.5). However, if butyrate is

62 not secreted, but is instead recycled to form acetone and butyryl CoA, then butyryl CoA can again be converted to butyrate without any NADH consumption. This is evident in the flux distribution for the mutant network (see Figure 3.5). The double deletion mutant has a theoretical hydrogen yield of 3.17 mol/mol glucose (0.036 g/ g glucose) at the expense of slightly lower growth rate (point C in Figure 3.6). Notably, in this case, biomass formation and hydrogen production are tightly coupled, in contrast to that in the wild-type network where a range (1.38-2.96 mmol/gDW.hr) of hydrogen formation rates is possible (Line AB in Figure 3.6) at the maximum growth rate. Experimental results

(Nandi and Sengupta 1998) indicate that only up to 2 mol of hydrogen can be produced per mol of glucose anaerobically in Clostridium. In fact, it has been reported that inhibitory effects of butyrate directly on hydrogen production and indirect effects of acetate on growth inhibition (Chin et al. 2003) are responsible for the observed low hydrogen yields. Interestingly, the suggested reaction eliminations directly circumvent these inhibition bottlenecks.

3.4.1.3 Methylobacterium extorquens AM1

Moving from glucose to methanol as the substrate, we next investigated hydrogen production in Methylobacterium extorquens AM1, a facultative methylotroph capable of surviving solely on methanol as a carbon and energy source (Van Dien and Lidstrom

2002). The organism has been well-studied (Anthony 1982; Chistoserdova et al. 1998;

Korotkova et al. 2002; Van Dien et al. 2003; Chistoserdova et al. 2004) and recently, a stoichiometric model of its central metabolism was published (Van Dien and Lidstrom

2002). Using Step 3 of OptStrain, we identified that only a single reaction needs to be introduced into the metabolic network of M. extorquens to enable hydrogen production.

63 Two such candidates are hydrogenase (E.C.# 1.12.7.2) which reduces protons to hydrogen or alternatively N5, N10-methenyltetrahydromethanopterin hydrogenase which catalyzes the following transformation:

E.C.# 1.12.98.2: 5,10-Methylenetrahydromethanopterin ↔

5,10-Methenyltetrahydromethanopterin + H2.

The need for an additional reaction is expected because the central metabolic pathways in the methylotroph, as abstracted in Van Dien and Lidstrom (2002), do not include any reactions that convert protons into hydrogen such as the found in E. coli and the anaerobes of the Clostridium genus. Therefore, it is not surprising that, to the best of our knowledge, no one has achieved hydrogen production using methylotrophs such as Pseudomonas AM1 and P. methylica (Nandi and Sengupta 1998).

The identified reaction additions provide a plausible explanation for this outcome by pinpointing the lack of a mechanism to convert the generated protons to hydrogen.

3.4.2 Vanillin production case study

Vanillin is an important flavor and aroma molecule. The low yields of vanilla from cured vanilla pods have motivated efforts for its biotechnological production. In this case study, we identify metabolic network redesign strategies for the de novo production of vanillin from glucose in E. coli. Using OptStrain, we first determined the maximum theoretical yield of vanillin from glucose to be 0.63 g/g glucose by considering over

4,270 candidate reactions balanced with respect to all elements but hydrogen (Step 2).

We next identified that the minimum number of non-native reactions that must be recombined into E. coli to endow it with the pathways necessary to achieve the maximum yield is three (Step 3). Numerous alternative pathways, differing only in their cofactor

64 usage, which satisfy both the optimality criteria of yield and minimality of recombined reactions, were identified. We then calculated the maximum theoretical yields of each of these gene addition strategies upon their incorporation into the E. coli stoichiometric model (Reed et al. 2003). Notably, for all these strategies, the yields are almost identical even though the stoichiometric model enforces a global balancing on cofactor usage.

Therefore, for the sake of economy of presentation, only the following gene addition is discussed:

+ + (i) E.C.# 1.2.1.46: Formate + NADH + H ↔ Formaldehyde + NAD + H2O,

+ (ii) E.C.# 1.2.3.12: 3, 4-dihydroxybenzoate (or protocatechuate) + NAD + H2O +

Formaldehyde ↔ Vanillate + O2 + NADH, and

+ + (iii) E.C.# 1.2.1.67: Vanillate + NADH + H ↔ Vanillin + NAD + H2O.

Interestingly, these steps are essentially the same as those used in the experimental study by Li and Frost (Li and Frost 1998) for the conversion of glucose into vanillin employing recombinant E. coli cells and the biocatalyst aryl aldehyde dehydrogenase extracted from Neurospora crassa, demonstrating that OptStrain can recover existing engineering strategies. Note, however, that the reported experimental yield of 0.15 g/g glucose is far from the maximum theoretical yield (i.e., 0.63 g/g glucose) of the network indicating the potential for considerable improvement.

This motivates examining whether it is possible to reach higher yields of vanillin by systematically pruning the metabolic network using OptKnock (Step 4). Here the genome-scale model of E. coli metabolism, augmented with the three functionalities identified above, is integrated into the OptKnock framework to determine the set(s) of reactions whose deletion would force a strong coupling between growth and vanillin

65 production. The highest vanillin-yielding single, double, and quadruple knockout strategies are discussed next for a basis glucose uptake rate of 10 mmol/gDW.hr. In all cases, anaerobic conditions are selected by OptKnock as the most favorable for vanillin production. Flux distributions corresponding to the proposed knockout strategies are shown in Figure 3.7. It is worth emphasizing that, in general, the deletion strategies identified by OptStrain are dependent upon the specific gene addition strategy fed into

Step 4 of OptStrain. Accordingly, we tested whether alternative and possibly better, deletion strategies would accompany some of the other candidate addition strategies alluded to above. For the vanillin case study, we found the deletion suggestions and anticipated vanillin yields at maximal growth to be quite similar regardless of the gene addition strategy employed.

The first deletion strategy identified by OptStrain suggests removing acetaldehyde dehydrogenase (E.C.# 1.2.1.10) to prevent the conversion of acetyl-CoA into ethanol.

Vanillin production in this network, at the maximum biomass production rate of 0.21 hr-1, is 3.9 mmol/gDW.hr or 0.33 g/g glucose based on the assumed uptake rate of glucose. In this deletion strategy, flux is redirected through the vanillin precursor metabolites, phosphoenolpyruvate (PEP) and erythrose-4-phosphate (E4P), by blocking the loss of carbon through ethanol secretion. The second (double) deletion strategy involves the additional removal of glucose-6-phosphate (E.C.# 5.3.1.9) essentially blocking the upper half of glycolysis. These deletions cause the network to place a heavy reliance on the Entner-Doudoroff pathway to generate pyruvate and glyceraldehyde-3-phosphate

(GAP) which undergoes further conversion into PEP in the lower half of glycolysis.

Fructose-6-phosphate (F6P), produced through the non-oxidative part of the pentose

66 phosphate pathway, is subsequently converted to E4P. Vanillin production, at the expense of a reduced maximum growth rate of 0.06 hr-1, is increased to 4.78 mmol/gDW.hr or

0.40 g/g glucose. A substantially higher level of vanillin production is predicted in the four-reaction deletion mutant network without imposing a high penalty on the growth rate. This strategy leads to the production of 6.79 mmol/gDW.hr of vanillin or 0.57 g/g glucose at the maximum growth rate of 0.052 hr-1. The OptKnock framework suggests the deletion of acetate kinase (E.C.# 2.7.2.1), pyruvate kinase (E.C.# 2.7.1.40), the PTS transport mechanism, and fructose 6-phosphate aldolase. The first three deletions prevent leakage of flux from PEP and redirect it instead to vanillin synthesis. The elimination of fructose 6-phosphate aldolase prevents the direct conversion of F6P into GAP and dihydroxyacetone (DHA). Note that both F6P and GAP are used to form E4P in the non- oxidative branch of the pentose phosphate pathway. DHA can be further reacted to form dihydroxyacetone phosphate (DHAP) with the consumption of a PEP molecule. Thus, elimination of fructose 6-phosphate aldolase prevents the utilization of both F6P and PEP which are required for vanillin synthesis. Furthermore, a surprising network flux redistribution involves the employment of a group of reactions from one-carbon metabolism to form 10-formyltetrahydrofolate, which is subsequently converted to formaldehyde. Figure 3.8 compares the vanillin production envelopes, obtained by maximizing and minimizing vanillin formation at different biomass production rates for the wild-type and mutant networks. These deletions endow the network with high levels of vanillin production under any growth conditions.

67 3.5 Conclusion

The OptStrain framework is aimed at systematically suggesting how to reshape whole genome-scale metabolic networks of microbial systems for the overproduction of not only small but also complex molecules. We have so far examined a number of different products (e.g., 1,3 propanediol, inositol, pyruvate, electron transfer, etc.) using a variety of hosts (i.e., E. coli, C. acetobutylicum, M. extorquens). The two case studies, hydrogen and vanillin, discussed earlier show that OptStrain can address the range of challenges associated with strain redesign allowing for the generation of multiple redesign strategies to be screened by experts and evaluated experimentally. At the same time, it is important to emphasize that the validity and relevance of the results obtained with the OptStrain framework are dependent on the level of completeness and accuracy of the reaction databases and microbial metabolic models considered. We have identified numerous instances of unbalanced reactions, especially with respect to hydrogen atoms, and ambiguous reaction directionality in the reaction databases that we mined. Careful curation of the downloaded reactions preceded all our case studies. Whenever the balanceability of a reaction could not be restored, the reaction was removed from consideration. We expect that this step will become less time-consuming as automated tools for reaction database testing and verification (Segre et al. 2003) are becoming available. Furthermore, the employed purely stoichiometric representation of metabolic pathways in microbial models can lead to unrealistic flux distributions by not accounting for kinetic barriers and regulatory interactions (e.g., ). Despite these simplifications, OptStrain has already provided useful insight into microbial host

68 redesign in many cases and, more importantly, established for the first time an integrated framework open to future modeling improvements.

69

Table 3.1: Deletion mutants for enhanced hydrogen production. A basis glucose uptake rate 10 mmol/gDW.hr is assumed.

No. of Knockouts ID Enzmye Growth CO2 H2 rate secretion (1/hr) mmol/gDW.hr Two A 1 Enolase 0.23 32.7 22.8 2 Glucose-6-phosphate dehydrogenase Three B 1 ATP synthase 0.17 40.9 29.5 2 Acetate kinase 3 2-oxoglutarate dehydrogeanse

70

Step 1:

KEGG MetaCyc EMP UM-BBD Universal Database Step 2:

Host

Substrate Product U. Database Step 3:

Host Substrate Product U. Database

Step 4:

Product Engineered Substrate Host Oxygen Biomass

Figure 3.1: Pictorial representation of the OptStrain procedure. The red (X)’s pinpoint the deleted reactions.

71

0.14

0.12

0.1

0.08

0.06 weight basis) 0.04

0.02 Maximum hydrogen yield (on

0 acetate methanol glucose glycerol lactate Substrates Figure 3.2: Maximum hydrogen yield on a weight basis for different substrates.

72

60 3 Knockouts 50 2 Knockouts Wild type (Anaerobic) 40 Wild type (Aerobic)

30 C

20 B A

10 Hydrogen Yield (mmol/gDW/hr)

0 00.20.40.60.81 Biomass Production Rate (1/hr) Figure 3.3: Hydrogen production envelopes as a function of the biomass production rate of the wild-type E. coli network under aerobic and anaerobic conditions as well as the two-reaction and three-reaction deletion mutant networks.

73

GLC (a) ATP 10.0 ADP 00 G6P D6PGL D6PGC RL5P 9.89 0.70 F6P X5P R5P 9.10 0.22 F16P 0.50 S7P GAP 9.10 9.10 0.24 9.0 E4P F6P DHAP GAP 17.98 13P2DG 17.98 1.47 9.53 17.97 FUM SER 3PG 2.13 THF SUCC 9.15 MAL M1THF 2PG GLY 8.48 3.64 GLX THF 8.15 PEP OA 2.17 M1THF 0.56 CIT 17.16 6.27 2.17 1.47 PYR ACCOA 10FTHF 0.23 6.27 AKG 16.5 ACTP 0.23 ICIT THF 16.5 FOR 0.70 FOR 22.77 AC

H2 (b) PEP GLC 10.0 PYR 0 00 G6P D6PGL D6PGC RL5P 9.97 0.14 0.13 F6P X5P R5P 9.81 0.03 F16P 0.11 S7P GAP 0.03 9.78 E4P F6P DHAP GAP 19.50 13P2DG 19.50 11.60 11.41 3PG FUM 19.20 SUCC MAL 11.41 2PG 19.20 10.91 12.1 GLX PEP OA 11.61 SUCCOA 9.04 CIT 29.5 11.61 11.41 PYR ACCOA 0 29.50 ACTP AKG 5.81 FOR ICIT 0.19 29.50 ETOH AC

H2 Figure 3.4: Calculated flux distributions at the maximum growth rates in the (a) two and (b) three deletion E. coli mutant networks for overproducing hydrogen. A basis glucose uptake rate of 10 mmol/gDW.hr was assumed.

74

0.168 Glucose Biomass 0.212 NADH 0.788 0.832 2 ATP + 2 NADH 1.517 Pyruvate 0.19 FdRed 1.576 1.66 CO + FdRed 2 3.17 Acetate Acetyl CoA 1.38 2 NADH 0.83 0.832 H2 Acetoacetyl CoA Butyrate 0.832 0.788 CO2 + Acetone Butyryl CoA 1.572 0.832 0.832 0.832 Butyrate +ATP 1.572 Figure 3.5: Calculated flux distributions at the maximum growth rates for the wild-type (light gray) and the two-reaction deletion mutant (dark gray) C. acetobutylicum networks. The (X)’s denote reactions that were selected for elimination in the mutant network. The wild-type network flux values are for the minimum hydrogen production scenario, corresponding to point A in Figure 3.6.

75

4.5 4 two knockout 3.5 no knockouts

3 C B 2.5 2 1.5 A

(mmol/gDW/hr) 1 0.5 Hydrogen production rate production Hydrogen 0 0 0.05 0.1 0.15 0.2 0.25 Biomass production rate(1/hr) Figure 3.6: Hydrogen formation limits of the wild-type (solid) and mutant (dotted) Clostridium acetobutylicum metabolic network for a basis glucose uptake rate of 1 mmol/gDW.hr

76

GLC PEP ATP GLCATP 9.15 0.85 10.0 ADP PYR 0 ADP 9.98 2.47 G6P D6PGL D6PGC RL5P G6P D6PGL D6PGC RL5P 001.50 1.50 9.96 9.98 0 2.46 F6P X5P R5P F6P X5P R5P 8.44 1.25 7.51 0 2.40 F16P 2.74 S7P GAP F16P 2.47 S7P GAP 8.44 1.25 2.40 E4P F6P E4P F6P 8.44 DHAP GAP GAP 14.17 5.10 13P2DG 13P2DG 4.81 3.99 DAHP DAHP 5.10 14.17 4.81 0.22 3.99 FUM FUM 3PG 0 3DHQ 3PG 3DHQ 0.06 SUCC 0.22 SUCC 5.07 13.79 MAL 4.78 MAL 3.99 2PG 0 2PG 0.06 0.22 0 5.07 GLX 3DHSQ 13.79 GLX 3DHSQ 0.24 PEP OA SUCCOA 4.78 PEP OA 3.99 0.06 0 0.23 SUCCOA CIT CIT 0 5.08 0 3,4 HBZ 8.60 0 3,4 HBZ PYR ACCOA 0.06 PYR ACCOA 0.23 4.85 FRMD 4.78 FRMD 5.08 7.62 3.91 ACT AKG 8.60 FOR ACT AKG ETH 4.85 VANILLATE FOR ICIT 0.06 ETH 7.62 ICIT VANILLATE 4.78 AC 4.70 3.91 0.23 AC FRMD 4.78 FRMD 3.91 VANILLIN VANILLIN (a) (b) GLC PEP ATP 10.0 PYR ADP 0 0 G6P D6PGL D6PGC RL5P 0 9.99 2.32 2.32 GAP F6P X5P R5P 2.24 DHA 7.66 F16P 4.57 S7P GAP 2.24 7.66 E4P F6P 7.65 DHAP GAP 10.78 13P2DG 6.81 DAHP 10.78 6.81 3.83 FUM SER 3PG 0.15 3DHQ THF 6.96 0.20 SUCC 3.27 MAL 6.81 M1THF 2PG 0.12 GLY 0.20 3DHSQ THF 3.19 6.96 0.12 PEP OA 0.20 SUCCOA 6.81 M1THF CIT 6.45 0.39 0.15 3,4 HBZ PYR ACCOA 0.20 10FTHF FOR FRMD 6.79 0 AKG 6.39 0.20 THF ACT ICIT VANILLATE FOR 6.79 6.79 AC VANILLIN FRMD (c) Figure 3.7: Calculated flux distributions at the maximum growth rates in the (a) one, (b) two, and (c) four deletion E. coli mutant networks for overproducing vanillin. Non-native reactions are denoted by the thicker gray arrows.

77

8 Augmented 7 C 1 Knockouts 2 Knockouts 6 4 Knockouts

5 B 4 A

3

(mmol/gDW/hr) 2

Vanillinrate production 1

0 0 0.05 0.1 0.15 0.2 0.25 Biomass production rate(1/hr) Figure 3.8: Vanillin production envelope of the augmented E. coli metabolic network for a basis 10 mmol/gDW.hr uptake rate of glucose. An anaerobic mode of growth is suggested in all cases.

Chapter 4

Identifying Reaction Activation/Inhibition Candidates for Overproduction in Microbial Systems

4.1 Background

Metabolic engineering in microbial hosts for the production of renewable fuels and chemicals has received considerable attention in recent years (Stafford and

Stephanopoulos 2001; Gross et al. 2003; Vera et al. 2003; Wyman 2003; Alper et al.

2005). This is because biotechnology offers an opportunity for unparalleled product diversity and is integral to achieving the goal of sustainable development. Furthermore, the potential of biocatalysts to produce very complex products of desired stereospecificity

(Breuer et al. 2004) with possibly more favorable economics has motivated many large- scale efforts in engineering microbial production systems. Already, a number of compounds are being produced industrially using microbial production systems (Chotani et al. 2000; Nakamura and Whited 2003), and many efforts are ongoing for synthesizing several others using biological routes (Lee and Schmidt-Dannert 2002; Martin et al.

2003; Baez-Viveros et al. 2004).

In recent years, optimization-based frameworks have been developed to guide experimental metabolic engineering strategies by adopting a systems approach for anticipating the effect of genetic modifications on metabolism. Metabolism is described by adopting genome-scale metabolic models widely available for many organisms

(Schilling et al. 2002; Van Dien and Lidstrom 2002; Forster et al. 2003; Reed et al. 2003;

Duarte et al. 2004). In the previous chapters, we introduced the OptKnock and the

OptStrain frameworks to predict reaction deletions and additions aimed at maximizing

79 the secretion of biochemicals from metabolic networks. However, these modifications do not describe the entire range of genetic manipulation strategies available. The importance of tuning upward or downward gene expression and consequently enzyme levels and corresponding flux rates is widely being recognized in metabolic engineering community

(Jensen and Hammer 1998a; Koffas et al. 2003). A recent successful effort for producing

1,3 propanediol from E. coli employed the downregulation of glyceraldehyde-3- phosphate dehydrogenase as a key genetic modification (Nakamura and Whited 2003).

Also, in many cases, a gene deletion is lethal whereas its downregulation is not. For example, in an earlier paper (Pharkya et al. 2003) we predicted a strategy involving elimination of enolase to enhance the theoretical production of serine in the metabolic network of E. coli. However, the deletion of enolase is known to be lethal in E. coli due to regulatory interactions not accounted for in the considered metabolic reconstruction.

Consequently, the repression of enolase rather than its deletion appears to be a more appropriate strategy. Gene up or down regulation can be tuned by using widely available promoter libraries (de Ruyter et al. 1996; Jensen and Hammer 1998b) providing experimental strategies to implement predictions on desired upward or downward flux changes.

In this chapter, we describe the modeling and algorithmic changes required to extend OptKnock (Burgard et al. 2003) to allow for up and/or down regulation in addition to gene knock-outs to meet a bioproduction goal. Specifically, the objective here is to computationally identify which reactions should be modulated, (i.e. repressed or activated) or knocked-out such that the biochemical of interest is overproduced. This extended computational framework termed OptReg uses the OptKnock formulation as a

80 starting point. However, the breadth and complexity of the newly considered genetic manipulations introduce many new variables and nonlinearities requiring a new and non- trivial theoretical treatment for the generation of the single-level optimization problem.

This treatment is described in detail in Section 4.2. The computational difficulties arising due to hundreds of binary variables and bilinear products of binary and continuous variables make this a challenging problem to solve.

Conceptually, in the OptReg framework, reactions are referred to as repressed or activated when their fluxes are forced to be sufficiently higher or lower with respect to their corresponding steady-state fluxes. Parameter C, termed the regulation strength parameter, quantifies the threshold that needs to be overcome before a reaction is considered up or down regulated (see Figure 4.1). It can be assigned values between zero and one. At C equal to zero, even if the reaction flux is equal to its steady state value, the reaction is considered to be modulated. On the other hand, for C equal to one, a reaction

max min flux must be equal to its upper or lower stoichiometric bound (vj or vj ) before it is deemed as up or down regulated respectively. It follows that the higher the value of C, stronger is the requirement imposed on a reaction when it is regulated (see Section 4.2 for further details).

Figure 4.1 graphically illustrates the imposed bounds as a consequence of activation, inhibition or elimination of a reaction. The figure also shows that not a single

0 0 value but rather a range of values between v j,L and v j,U sometimes needs to be used to describe the original steady-state. This is because as our group (Burgard et al. 2001) and others (Papin et al. 2002) have found, the maximization of biomass or any other cellular objective does not yield a unique solution (i.e., value) for a majority of reactions

81 (especially for internal ones) at steady state. Instead, due to the high redundancy in the network, a range of flux values is typically identified corresponding to alternate but equivalent optima for biomass. Therefore, in OptReg we have to use a range of flux values rather than a single value to describe the base state of the network before any genetic manipulations are implemented. The following section highlights these modeling and algorithmic developments.

4.2 Modeling and computational protocol

4.2.1 Steady-state flux determination

In OptReg, up or down flux modulations are modeled as upward or downward departures respectively, from the wild-type steady-state values. Therefore, before employing the model it is necessary to establish estimates for the steady-state flux values

(or range of values) for all reactions in the wild-type network of Escherichia coli (Reed et al. 2003). To this end, we use flux measurements (Fischer et al. 2004) to fix some of the reaction fluxes in central metabolism at values determined from comprehensive isotopomer balancing experiments performed on exponentially growing Escherichia coli cells in a bioreactor culture. These experimental results (shown in Figure 4.2) provide us with carbon flux partitioning values at key branch points in the central metabolism and an estimate for the biomass formation rate at steady state (0.81 hr-1 as predicted in experimental studies).

The steady-state flux values or ranges for the remaining fluxes are estimated computationally. A linear programming formulation, referred as the min/max problem

(Burgard et al. 2001; Mahadevan and Schilling 2003) is solved for base flux estimation for reactions in the most recent genome-scale model of E. coli (Reed et al. 2003).

82 Specifically, each flux is successively maximized and then minimized subject to predetermined experimental values of a few fluxes.

min/max v j ∀ j ∈M

subject to ∑ Svij j = 0, ∀ i ∈N j

vatp ≥ vatp _ ma int , vglc =10 mmol / gDW . hr

exp vvjj= , ∀ j ∈Mexp

v j ≥ 0 , ∀ j ∈M

Here, M = (1,…..,M) denotes the set of reactions and N = (1, ….,N) is the set of metabolites. Sij is the stoichiometric coefficient of metabolite i in reaction j. The first constraint imposes a stoichiometric balance on the network. The glucose uptake rate vglc is fixed at 10 mmol/gDW·hr and a minimum amount of ATP formation (vatp_main= 7.6 mmol/gDW·hr) is imposed for maintenance. The subset Mexp is comprised of the reactions whose fluxes are fixed at experimental values. In addition, all reversible reactions are split into their forward and backward counterparts to facilitate the modeling of regulation as described in the next section, increasing the size of the metabolic network to 1,470 one-directional positive-valued reactions. The identified minimum and maximum values

0 0 0 for each flux through reaction j are denoted as v j,L and v j,U respectively. If v j,L, is equal

0 to v j,U, then a unique value for the base steady-state value of reaction j is obtained.

0 0 Otherwise, the range of values between v j,L, and vj,U quantifies the ambiguity in assigning a flux value to reaction j.

83 4.2.2 Modeling of genetic manipulations

Three sets of binary variables for each reaction j∈M are introduced to model all possible combinations of knockouts, up and down regulations.

k 0 if reaction j is knocked out y j =  1 if reaction j is not knocked out

d 0 if reaction j is down regulated y j =  1 if reaction j is not down regulated

if reaction j is up regulated u 0 y j =  1 if reaction j is not up regulated

These binary variables then act as switches to ensure that fluxes are appropriately restricted in response to a deletion or an up/down regulation based on the following constraints:

min 0 mindd max vvvj ≤jjL ≤[(, ) ⋅− (1 CvC ) + ( j ) ⋅ ( )] ⋅− (1 yvy jjj ) + ⋅ , ∀ j ∈M (Down regulations)

0 maxuu min max [(vCvCyvyvvjU, )⋅− (1 ) + ( j ) ⋅ ( )] ⋅− (1 jjjjj ) + ⋅ ≤ ≤ , ∀ j ∈M (Up regulations)

minkk max vyvvyjjjjj⋅≤≤ ⋅; ∀ j ∈M (Knockouts)

min max Here, vj and vj are the stoichiometric bounds on the fluxes determined by minimizing and maximizing each flux subject to (i) the stoichiometric network balances,

(ii) a fixed glucose uptake rate, (iii) the fulfillment of the ATP maintenance requirement and (iv) the formation of at least 1% of the maximum theoretical biomass in the network.

min Note that the stoichiometric lower bound vj for a flux j may be greater than zero if the specific reaction is required for biomass formation.

84 As described earlier, the regulation strength parameter C is assigned values between zero and one. This parameter determines the fraction of the range of flux values between the stoichiometric bounds (lower or upper) and the corresponding lower or upper steady state flux values that are available to a regulated reaction (see Figure 4.1). We require that when a reaction is inhibited, the flux should vary between the stoichiometric

min 0 min lower bound vj and the point denoted by (v j,L)(1-C) + (vj )(C). Similarly, the reaction

0 max flux should be greater than (v j,U)(1-C) + (vj )(C) when it is upregulated. The use of parameter C for the regulated fluxes is employed to ensure a significant deviation of the fluxes from their steady state values. Obviously, the higher the value of C, the higher will be the departure from the steady state values and consequently, the “stronger” the regulation. Note that an appropriate estimate of the value of C can be made beforehand depending upon the strength of the promoter and the inhibitor (Jensen and Hammer

1998b). Finally, if a reaction is knocked out, then the corresponding flux is forced to

min zero. If a reaction is required for biomass formation it has a non-zero lower bound vj and as a consequence, it cannot be knocked out.

A reaction flux can be the target of at the most, a single type of genetic manipulation, thus:

k d u (1− y j ) + (1− y j ) + (1− y j ) ≤ 1 ∀ j ∈M

k d u Additionally, constraints ∑ j (1-yj ) ≤ K, ∑ j (1- yj ) ≤ D, and ∑ j (1-yj ) ≤ U specify that the total number of reactions that can be deleted, down regulated and up regulated are K,D and U respectively. Alternatively, a limit on the total number of regulated and deleted reactions is imposed as follows;

85 k d u ∑ j (1-yj ) + (1- yj ) + ∑ j (1-yj ) ≤ L where L is the total number of reactions that can be modulated or knocked out. Recall that all reactions are one-directional and their fluxes are constrained to be greater than or

kk equal to zero. A constraint yyjj= +1 ∀ j∈Mrev is imposed in the outer problem so that the

forward and backward reactions mapping to a reversible reaction (listed in the set Mrev) can only be knocked-out simultaneously. For the regulated reactions, either the forward or the backward flux can be modulated and this is imposed by the following set of constraints:

dd uu yyjj+≥+1 1 ∀ j∈Mrev and yyjj+ +1 ≥1 ∀ j∈Mrev

4.2.3 OptReg framework

In analogy to OptKnock and based on the above variable definitions and constraints, the bilevel optimization formulation for OptReg is as follows:

86

max vbiochemical (OptReg) K U D yj , yj , yj

s.t. max vvbiomass−⋅ε ∑ j (Primal) j

M s.t. ∑ Sij v j = 0, ∀ i ∈N j=1

vvatp≥ atp_int ma

vglc =10 mmol / gDW . hr

max vvbiomass≥⋅(0.01) biomass

max k min k vyjj≤⋅ν j, vvjj≥⋅ y j, ∀ j ∈M

0minmaxdd vvj ≤⋅−+⋅⋅−+⋅[(jL, ) (1 CvC ) ( j ) ( )] (1 yvy j ) j j, ∀ j ∈M

min max vvjj≥ , vvjj≤ , ∀ j ∈M

0maxminuu vvj≥⋅−+⋅⋅−+⋅[( jU, ) (1 CvC ) ( j ) ( )] (1 yvy jjj ) , ∀ j ∈M

k d u (1− y j ) + (1− y j ) + (1− y j ) ≤ 1 ∀ j ∈M

k d u y j ∈{0,1} ; y j ∈{0,1}; y j ∈{0,1} , ∀ j ∈M

kud ∑[(1−+−+−≤yyyLjjj ) (1 ) (1 )] j

kk dd uu yyjj= +1, yyjj+ +1 ≥1, yyjj+ +1 ≥1, ∀ j∈Mrev vbiochemical refers to the flux towards the synthesis of the desired biochemical. The second term in the objective function of the inner problem ensures that the maximum biomass flux distribution with the minimum network “trafficking” (i.e. minimum sum of all fluxes) is chosen out of the set of alternative optima. Recall that reversible reactions have been divided into forward and backward fluxes in the network. Therefore, it is possible to have assigned arbitrarily large values to the forward and backward fluxes for a reversible reaction while the net flux is finite. This in turn can lead to an erroneous prediction of up

87 regulation for a reaction. To safeguard against this occurrence, we introduce the second term in the objective function of the formulation (Primal). By minimizing the total flux circulation in the network, only one of the two unidirectional reactions forming a reversible reaction can carry a non-zero flux. Through a trial-and-error process, we have determined that ε has to be assigned values between 0.001 and 0.0001. If ε is greater than 0.001, then the second term starts affecting the solution for the maximization of biomass. If, on the other hand, ε is less than 0.0001, it is ineffective at preventing the presence of flux for both forward and backward directions of a reversible reaction.

The first three constraints in the inner problem (Primal) have already been described previously. The next constraint imposes a minimum requirement on biomass

max formation and is set at 1% of the maximum theoretical biomass (vbiomass ) feasible in the network. This bilevel problem is solved by first transforming it into a single level problem. To this end, we generate the dual of the Primal problem (Ignizio and Cavalier

1994) as follows:

max kkmax kk min min ( vatpatp_int ma ⋅λ _ ) + (0.01⋅⋅vbiobiomass λ _ ) + ∑()qyvqyvUjjj,,⋅ ⋅+⋅⋅+ Ljjj j

maxdd 0 min d d d min ∑ (vyqj⋅⋅ j Uj,, ) + [( v jL ⋅−+ (1 CvC ) j ⋅ ( )] ⋅−⋅ (1 yq j ) Uj ,, + ( q Lj ) ⋅ ( v j ) + j

Uuuumax min max uu ∑ (.)(qvUj,,,0, j+⋅⋅+⋅−+⋅⋅−⋅ v j yq j Lj )[((1) v j Cv j ()](1) C yq j Lj (Dual) j subject to:

N kkdduu ∑λiijUjLjUjLjUjLjSq,,,,,,,++++++≥− q q q q q ε , ∀ j ∈ M, j ≠ atp, biomass i=1

88

N kkdduu ∑λiSq i,,,,,,, biomass+++++++≥− U biomass q L biomass q U biomass q L biomass q U biomass q L biomass λε_1 bio i=1 N kkdduu ∑λiSqqqqqq i,,,,,,, atp+++++++ U atp L atp U atp L atp U atp L atp λε_ atp ≥− i=1

kdu kdu qqqUj,,,,, Uj Uj≥ 0; qqqLj,,,,, Lj Lj≤ 0; ∀ j ∈ M

λi œ R ,∀ i ∈ N; λ _0;_0atp≤≤λ bio

In the dual formulation, λi are the dual variables corresponding to the stoichiometric constraints and are unrestricted in sign. The dual variables for the ATP maintenance and the biomass formation constraints are denoted as λ_atp and λ_bio

k u d respectively. Furthermore, q L,j, q L,j and q L,j are the dual variables assigned to the constraints imposing lower bounds on the fluxes that become active if the binary

k u d k u d variables y j, y j and y j assume a value of zero in this order. Similarly, q U,j, q U,j and q U,j are the dual variables corresponding to the constraints imposing upper bounds on the fluxes when a reaction is knocked out, up regulated or down regulated respectively. An additional complication, absent in OptKnock, is that the objective function involves terms where a binary variable is multiplied by a continuous variable. To recast these nonlinear constraints into an equivalent linear form, we introduce three sets of additional variables

(Glover 1975) as follows:

kkkkkk zqyUj,,=⋅()() Uj j , zqyLj,,=⋅()() Lj j , ∀ j ∈ M

uuuddd zqyL,,jLjj=⋅()(), zqyUj,,=⋅()() Uj j ∀ j ∈ M

By imposing the following constraints the equivalent linear transformation of the nonlinearities is accomplished.

89

kkkkk ()qyzqyUj,,, LB⋅≤ j Uj ≤ () UjUB ⋅ j , ∀ j ∈ M and

kk kkkk k qqUj,,−⋅−≤≤−⋅−()(1) UjUB yzqq j Uj ,,, Uj ()(1) Uj LB y j , ∀ j ∈ M

k u d Similar constraints are imposed for the sets of variables z L,j, z L,j and zU,j. These variables assume non-negative or non-positive values according to the negativity restrictions on the corresponding dual continuous variables. All other bounds

(upper/lower) on the dual variables are calculated by maximizing or minimizing them subject to the constraints of the dual problem.

From the strong duality theory in Linear Programming, if the primal and the dual optimal solutions are bounded, then at optimality both objective function values must be equal (Ignizio and Cavalier 1994). This implies that the unique optimum solution value to the inner primal can be obtained by solving a system of equations encompassing an equality relation for the objective functions of both the primal and the dual problems and the accumulation of their respective constraints. The inner problem is thus transformed from an optimization problem to an equivalent set of equations (equalities and inequalities). This provides a single-level mixed-integer linear MILP formulation for

OptReg that is solved using CPLEX (Brooke et al. 2005) accessed through GAMS

(Brooke et al. 2003). The complete formulation of OptReg is provided in Appendix 4.1.

This utility of the framework is next demonstrated by applying it to elucidate the optimal down/up regulation and knockout strategies for overproducing ethanol, an industrially important chemical. In the next section, we discuss and contrast strategies that involve (i) only reaction eliminations, (ii) only reaction modulations i.e. activations/inhibitions and (iii) both deletions and modulations for overproducing ethanol

90 by manipulating the central metabolism in the Escherichia coli metabolic network as abstracted in Reed et al. (2003). The obtained results based on biomass maximization are also contrasted against the ones obtained based on the MOMA (Segre et al. 2002) criterion.

4.3 Strategies for overproducing ethanol

The OptReg formulation was employed to determine efficient reaction modification strategies for overproducing ethanol in the E. coli metabolic network comprised of more than 1,470 one-directional reactions. The glucose uptake rate was fixed at 10 mmol/gDW·hr and the network was allowed to uptake a maximum of 20 mmol/gDW·hr of oxygen. All other nutrients such as potassium, ammonia and available from minimal media in experimental studies were allowed in unlimited quantities and the network could secrete any metabolite. As expected, ethanol formation in the network takes place through acetaldehyde/alcohol dehydrogenase (adhE). It is worth mentioning that the predicted yields of ethanol in the network are strongly affected not only by the carbon availability but also by the abundance of NADH which is a cofactor for adhE. Table 4.1 shows the two and three-reaction modification strategies predicted by the framework for overproducing ethanol. We first discuss the strategies predicted by the framework for a universal value of C = 0.5. The impact of C on these strategies will be briefly discussed in a separate subsection.

4.3.1 Two reaction modification strategies

A two reaction deletion mutant (mutant MA) involving the removal of oxygen transport and phosphotransacetylase (pta) (see Figure 4.3) was predicted by the

91 framework. This double mutant network has a theoretical yield of 16.30 mmol/gDW·hr of ethanol. The predicted anaerobic (fermentative) environment for producing ethanol and the deletion of the competing pathway producing acetate are consistent with experimental evidence. The design for the two reaction modulation (i.e. repression and activation only) mutant involves the upregulation of pyruvate dehydrogenase (pdh) and the simultaneous downregulation of succinate dehydrogenase (sdh) leading to a theoretical yield of 16.72 mmol/gDW·hr (mutant MB). The upregulation of the pdh complex enables the conversion of most of pyruvate into acetyl CoA with a concurrent generation of an extra NADH molecule per molecule of acetyl CoA formed. The decrease in flux through sdh reduces the activity in the citric acid cycle preventing acetyl CoA from being channeled. The oxygen uptake in the network is very low suggesting microaerobic conditions of growth.

Note that the model description does not include regulatory conditions which would signal adhE to be inactivated in the presence of oxygen. However, a study suggests that alcohol dehydrogenase is indeed active even in aerobic conditions concomitant to mutations in the adhE structural gene and in the promoter region (Holland-Staley et al.

2000). A recent publication reports successful overexpression of pdh to increase the carbon flux from pyruvate to acetyl CoA (Vadali et al. 2004).

Column 3 (row1) of Table 4.1 shows a predicted design in which both activations/inhibitions and knockouts are allowed (mutant MC). Not surprisingly, oxygen uptake in the network is eliminated and pyruvate dehydrogenase (pdh) is upregulated.

These two reaction manipulations, in tandem, lead to an ethanol production rate that is approximately 94% of the maximum theoretical yield of 19.87 mmol/gDW·hr.

Interestingly, Dien et al. (2003) show that pyruvate formate lyase (pfl) is induced in

92 anaerobic conditions and a majority of the pyruvate flux is directed through it. However, this fermentative pathway is unbalanced because one NADH and proton is generated for each pyruvate made from sugars, and two NADH and protons are required for converting pyruvate into ethanol (See Appendix 4.2 for exact stoichiometry of the reactions).

Therefore, E. coli balances fermentation by also producing acetate, lactate and succinate which compromises the yield of ethanol (Dien et al. 2003). The proposed strategy

(mutant MC) circumvents this problem by upregulating pdh leading to a concurrent generation of NADH alleviating cofactor imbalances in the fermentative pathway. This provides higher yields of ethanol by preventing carbon loss to other fermentative products. Note that experimental observations indicate considerable pdh activity in anaerobic conditions which can be further induced by the presence of pyruvate in the medium (Carlsson et al. 1985). Similar results with high flux through pdh were reported

(Underwood et al. 2002) in experiments with pyruvate and acetaldehyde-supplemented media to optimize carbon partitioning at the acetyl CoA node to promote both ethanol and biomass yields in ethanologenic E. coli. The flux distribution through the mutant network MC shows a very small flux through the glycolytic pathway; instead the flux is diverted to the oxidative branch of the pentose phosphate pathway through which it enters the Entner-Doudoroff pathway. Notably, the ethanol-producing bacterium

Zymomonas mobilis employs this pathway for glucose metabolism instead of the

Embden-Meyerhoff-Parnas glycolytic pathway. The net yield of ATP through this pathway is only one mole per mole of glucose consumed which results in low cell mass allowing for higher ethanol yields (Jeffries 2005; Seo et al. 2005).

93 We next investigated the maximum and minimum biochemical production abilities of the mutant networks. To this end, we solved the Primal problem (see subsection 4.2.3) by first maximizing and subsequently, minimizing ethanol production at all values of biomass feasible to the network. The biochemical production envelopes that we derived for the three mutant networks are shown in Figure 4.4. The maximum theoretical yield of the deletion mutant MA is denoted by point A1. However, the presence of the vertical line A1A2 at the maximum biomass formation rate implies the existence of alternative optimal solutions which means that the theoretical yield of ethanol varies between points A1 and A2. Point B1 denotes the maximum theoretical yield attainable in the mutant network MB. Line B1B2 denotes the rightmost boundary of the allowable phenotypes. Note that point B2 corresponds to zero ethanol yield implying no coupling between growth and ethanol production for this mutant. In contrast, the network for mutant MC exhibits ethanol production limits that are superior not only to those for mutant MB but also to those for the mutant network MA. Notably, even at small biomass formation rates, the network for mutant MC produces ethanol (point C3). Thus, the combination of an upregulation and a knockout in this case eliminates undesirable phenotypes to a large extent as demarcated by the lower boundary C1C3, ultimately providing a single point optimal solution of 18.64 mmol/gDW·hr at the maximum growth rate (point C1). Note that the cofactor balance in the network forces the increased acetyl

CoA flux to be directed towards ethanol production and prevents the formation of other fermentative products such as acetate. The results allude to the synergistic effect of regulations and deletions on enhancing biochemical production.

94 4.3.2 Three reaction modification strategies

The identified strategies for overproducing ethanol which entail modification of three reactions are described here. The three reaction deletion mutant (mutant MD) involves the removal of pfl and phosphoglucoisomerase (pgi) in an anaerobic growth environment. As a consequence of the elimination of pgi, 99.8% of the glucose-6- phosphate flux is directed into the pentose phosphate pathway, which eventually enters the Entner-Doudoroff pathway. The expected ethanol yield for this mutant network is

18.74 mmol/gDW·hr. The consequence of the removal of pfl has already been discussed in the context of the upregulation of pdh for mutant MC. We next discuss the effect of allowing only activations and inhibitions of up to three reactions (mutant ME). OptReg suggested that oxygen uptake be downregulated along with the inhibition of sdh and upregulation of flux through pdh. The maximum theoretical yield of ethanol in this case is approximately 16.72 mmol/gDW·hr. Note that no improvement over the two reaction modulation strategy is shown by any of the more than twenty alternative optima that we determined by applying integer cuts (Pharkya et al. 2004).

Once again, allowing for both regulations and knockouts (mutant MF) leads to the best scenario for ethanol production. In addition to the elimination of pfl in an anaerobic environment whose effects have already been discussed earlier, the downregulation of phosphoglucomutase (pgm) is predicted to provide an ethanol formation rate of 19.83 mmol/gDW·hr. This is approximately 99.84% of the maximum theoretical yield. The computationally generated flux distribution diagram for the network of mutant MF is shown in Figure 4.5. A non-intuitive strategy for enhancing the availability of NADH to promote ethanol production through adhE is observed. Approximately 5.1

95 mmoles/gDW·hr of 3-phosphoglycerate, an intermediate in the second half of glycolysis, is converted into serine, thus producing NADH (see Appendix 4.2). A majority of serine is subsequently converted into pyruvate through L-serine deaminase (sda in Figure 4.3) and is utilized for ethanol production. Note that L-serine deaminase can be produced and activated in E. coli grown in glucose minimal medium when appropriate mutations are introduced (Lan and Newman 2003). The enzyme can also be induced in a variety of ways including growth in anaerobic environment (Su et al. 1989).

An interesting observation that can be made from the flux distribution in Figure

4.5 is that even though the serine deaminase reaction is not up regulated, the flux through it is substantially increased. This is because the modeling impact of the down regulation of phosphoglucomutase reaction is to add a constraint on the network that forces the flux through the phosphoglucomutase reaction at a lower than the reference steady-state value.

Consequently, the solution of the inner problem, with this added constraint, causes the flux through a number of reactions, specifically serine deaminase, to depart from their corresponding reference values. Thus, the direct modulation of specific reactions may indirectly affect the flux through other reactions in the network even though they are not classified as up or down regulated.

The ethanol production envelopes of the networks are shown in Figure 4.6. The maximum theoretical yield of mutant MD is denoted by point D1. Although this network also exhibits alternative optimal solutions (line D1D2), the range of alternative optima at the maximum biomass formation rate is considerably smaller as compared to the one found for mutant MA. The network for mutant ME shows significant improvement in terms of the coupling between growth and ethanol production as compared to the network

96 for mutant MB even though the maximum theoretical yields for both networks are the same. The range of ethanol production rates for mutant ME at the maximum growth rate is depicted by line E1E2. It can be seen that the network has to produce more than 50% of the maximum theoretical yield of ethanol (point E2) at the maximum rate of growth. The downregulation of pgm in conjunction with the elimination of pfl and oxygen availability to the network of mutant MF forces it to form high amount of ethanol (point F1). The feasible solution space for this network is significantly reduced and the mutant is forced to secrete ethanol in the presence of growth.

Given that more than 99% of the theoretical yield of ethanol is achieved with three modulations and knockouts, further manipulations in the network do not produce discernible improvements.

4.3.3 Evaluating the mutant networks using an alternate objective: MOMA

MOMA (Segre et al. 2002) is an alternative criterion introduced recently to anticipate the behavior of microbial systems immediately after imposing genetic modifications. This criterion hypothesizes that microorganisms adjust to a perturbation by minimally adjusting their flux distribution to be as close as possible to that of the wild- type organism which is no longer accessible. Here, we calculate the MOMA criterion for all the six mutant networks identified by OptReg. The objective is to deduce whether the maximization of biomass criterion employed in OptReg yields mutants with predicted yields that are in agreement or disagreement with the predicted yields under the MOMA criterion. The base case flux distribution for the MOMA study is derived by obtaining a feasible set of fluxes that meets the steady state biomass formation rate (0.81 hr-1). Figure

4.7 shows that for five out of six of the predicted redesigns (mutants MA, MC, MD, ME,

97

MF), both criteria predict very similar yields. The only outlier is the mutant network MB for which MOMA predicts a much lower yield compared to the max biomass criterion.

This close agreement in the results obtained using biomass maximization and MOMA is not surprising in light of the fact that biomass maximization is used to estimate the original steady-state flux distribution needed for applying MOMA. These results are indicative, but not necessarily conclusive, that the strategies obtained from the OptReg framework are quite robust with respect to the choice for the cellular objective (i.e., max biomass or MOMA).

4.3.4 Effect of the value of the regulation strength parameter, C

In this section we investigate the effect of different values of C on (i) ethanol formation rates and (ii) prediction of design strategies for overproducing ethanol.

With the first objective in mind, we applied four different values of C to the mutant networks identified with C equal to 0.5 and recalculated the ethanol yields. Table

4.2 shows the obtained results. Notably for values of C below 0.5 and for mutants that involve both modulations and deletions, the theoretical ethanol yields are similar though not identical, to the ones predicted for the mutant networks for C equal to 0.5. Not surprisingly, the influence of the ‘regulation strength parameter’ is more pronounced for strategies that involve only reaction modulations. However, it was somewhat surprising to find that the predicted flux strategies led to infeasible solutions for parameter values higher than 0.5. Upon further analysis, we found that higher values of C imply that the fluxes are constrained more tightly. As a result, the strategies suggested for higher C values cannot meet the cofactor requirements in the network and become infeasible.

98 Next, we revisited OptReg for values of C equal to 0.25 and 0.75 and identified how the predicted strategies change. When the parameter C is set at 0.75, the framework chooses to upregulate pfl and inhibits oxygen uptake. Contrary to the two-reaction modulation strategy for C equal to 0.5, pdh is not upregulated here. This is because a higher value of C implies a higher range of fluxes that is allowed to pdh. The enhanced pdh flux corresponds to an increased requirement of the oxidized cofactor NAD+. Due to the inhibition of oxygen uptake, the reduced cofactor NADH cannot be oxidized at the rate that is required for the pyruvate dehydrogenase flux to be upregulated at C equal to

0.75. As expected, this strain has a predicted lower yield of ethanol than mutant MB because of the unbalanced fermentative pathway through pfl. Thus, its appears that the cofactor balance in the network is a critical factor for deciding whether a particular reaction can be up or down regulated beyond a certain threshold.

For C equal to 0.25, all regulated fluxes are less constrained as compared to the case when C is equal to 0.5. Therefore, all the strategies suggested for C equal to 0.5 are feasible (see Table 4.2) though not necessarily optimal. For example, we found a three reaction deletion/modulation strategy that can lead to the production of 19.68 mmol/gDW·hr of ethanol in the network. The framework predicted the upregulation of malic enzyme (ME) in addition to the elimination of pfl in an anaerobic environment. To increase the flux through ME, the framework first redirects carbon flux from phosphoenolpyruvate (PEP) to oxaloacetate (OAA) through the phosphoenolpyruvate carboxylase (ppc) reaction. OAA is then transformed into malate through the reverse malate dehydrogenase (mdh) reaction, which finally gets converted into pyruvate through

99 ME (see Figure 4.3). These results indicate that the value of C must be carefully chosen by taking into account the strength of the available promoters.

4.4 Conclusion

This chapter describes an integrated framework to identify optimal modulation and deletion strategies for biochemical overproduction. The strategies predicted for the specific case study (ethanol overproduction) undertaken in this work demonstrate that

OptReg can allocate and manipulate the fluxes in the network to meet the carbon and the redox requirements for accomplishing the desired biotechnological goal. The critical role that cofactor availability plays in the accomplishment of the desired biotechnological goal has been reported extensively in the literature (San et al. 2002; Berrios-Rivera et al.

2004). Also, quite interestingly, results obtained from the OptReg framework are found to be quite robust with respect to the choice for the cellular objective (i.e., max biomass or MOMA).

It should be emphasized that although reactions deletions or modulations alone can be successful at enhancing the secretion of the desired biochemical, it is the synergistic effect of both kinds of manipulations that bears the maximum effect on the targeted overproduction. Specifically, for fermentative products such as ethanol where oxygen uptake needs to be eliminated, just downregulation of oxygen uptake may not generate the desired results. Similarly, reaction eliminations restrict the range of reactions that can be manipulated in a network because of lethality considerations. For example, in the network for mutant MF, the downregulation of pgm and the elimination of oxygen availability to the network in conjunction with the removal of pfl leads to high ethanol yields. If only reaction deletions are allowed in the framework, a strategy with

100 simultaneous deletion of pgm and oxygen uptake is not feasible because the network cannot meet the non-growth associated ATP maintenance requirements under such a scenario.

It is worthwhile to note that the modulation strategies predicted by OptReg should be interpreted carefully because we separate the reversible reactions into their forward and backward counter-parts. For example, in one case we found that OptReg suggested the upregulation of the backward flux of the aconitase reaction. However, when both the forward and backward fluxes of this reaction were analyzed, the net flux was found to be downregulated in the forward direction.

It is important to note that OptReg predicts gene modulation/deletion strategies by using stoichiometric models of metabolism. These models offer a genome-scale, though approximate, description of cellular metabolism and biochemistry (Schilling et al. 2002;

Forster et al. 2003; Reed et al. 2003) allowing for the global assessment of the effect of energetics and cofactor balancing on overproduction. These models used within the context of the Flux Balance Analysis (FBA) framework (Varma and Palsson 1994; Price et al. 2004), have been quite successful at elucidating the growth characteristics of the E. coli cells in disparate environments (Edwards et al. 2001; Ibarra et al. 2002; Fong et al.

2003) and their responses to gene mutations (Fong and Palsson 2004). However, they do suffer from insensitivity to potential kinetic and regulatory barriers. Therefore, it is important to carefully interpret the results obtained from OptReg and contrast them with results obtained from complementary efforts using kinetic models and those employing the knowledge of qualitative regulatory interactions.

101 4.5 Appendix 4.1

The single level OptReg formulation obtained by converting the inner level optimization problem into a set of equations by utilizing the strong duality theory is as follows: maximize vbiochemical (OptReg)

max s.t. vvbiomass−⋅ε ∑ j =(vatp_maint. λ _atp ) + (0.01. vbiomass .λ_bio) + j kkmax min ∑()zvUj,,⋅+⋅ j zv Lj j + j

maxdddd 0 min min ∑ ()[((1)()]()()()vzjUjjL⋅+,, v ⋅−+⋅⋅−+⋅ CvCq j UjU , z q Ljj , v + j Uuuuumax min max ∑ (.)(qvUj,,,0,, j+⋅+⋅−+⋅⋅− v j z Lj )[((1) v j Cv j ()]( C q Lj z Lj ) j

M ∑ Sij v j = 0, ∀ i ∈N j=1

max vvatp≥ atp_int ma , vvbiomass≥⋅(0.01) biomass , vmmolgDWhrglc = 10 / .

max k vyjj≤⋅ν j, ∀ j ∈M

min k vvjj≥⋅ y j, ∀ j ∈M

min 0 mindd max vvvj ≤jjL ≤[(, ) ⋅− (1 CvC ) + ( j ) ⋅ ( )] ⋅− (1 yvy jjj ) + ⋅ , ∀ j ∈M

0maxminmaxuu [(vCvCyvyvvjU, )⋅− (1 ) + ( j ) ⋅ ( )] ⋅− (1 j ) + j ⋅ j ≤ j ≤ j , ∀ j ∈M

k d u (1− y j ) + (1− y j ) + (1− y j ) ≤ 1 ∀ j ∈M

k d u y j ∈{0,1} ; y j ∈{0,1}; y j ∈{0,1} , ∀ j ∈M

kud ∑[(1−+−+−≤yyyLjjj ) (1 ) (1 )] j

kk dd uu yyjj= +1, yyjj+≥+1 1, yyjj+≥+1 1, ∀ j∈Mrev

N kkdduu ∑λiijUjLjUjLjUjLjSq,,,,,,,++++++≥− q q q q q ε , ∀ j ∈ M, j ≠ atp, biomass i=1

102

N kkdduu ∑λiSq i,,,,,,, biomass+++++++≥− U biomass q L biomass q U biomass q L biomass q U biomass q L biomass λε_1 bio i=1 N kkdduu ∑λiSqqqqqq i,,,,,,, atp+++++++ U atp L atp U atp L atp U atp L atp λε_ atp ≥−, i=1

kk k 0()≤≤zqUj,, UjUB ⋅ y j, ∀ j ∈ M

kk kkk qqUj,,−⋅−≤≤()(1) UjUB yzq j Uj ,, Uj, ∀ j ∈ M

kkk ()qyzLj,, LB⋅≤ j Lj ≤ 0, ∀ j ∈ M

kkk k k qzqL,,,jLjLjLjLBj≤≤−()(1) q , ⋅− y, ∀ j ∈ M

dd d 0()≤≤zqUj,, UjUB ⋅ y j, ∀ j ∈ M

dd ddd qqUj,,−⋅−≤≤()(1) UjUB yzq j Uj ,, Uj, ∀ j ∈ M

uuu ()qyzLj,, LB⋅≤ j Lj ≤ 0, ∀ j ∈ M

uuu u u qzqLj,,,≤≤− Lj Lj()(1) q Lj , LB ⋅− y j , ∀ j ∈ M

kdu kdu qqqUj,,,,, Uj Uj≥ 0; qqqLj,,,,, Lj Lj≤ 0; ∀ j ∈ M

λi œ R , ∀ i ∈ N; λ _0;_0atp≤ λ bio ≤

103 4.6 Appendix 4.2

Stoichiometry of selected reactions

Pyruvate formate lyase: CoA + pyruvate → acCoA + formate

+ Pyruvate dehydrogenase: CoA+ NAD + pyruvate → acCoA + CO2 + NADH

Acetaldehyde dehydrogenase: acCoA + 2H++ 2 NADH → CoA + ethanol + 2 NAD+

Phosphoglycerate dehydrogenase (serA): 3-phosphoglycerate + NAD+ → NADH + H+ +

3-phosphohydroxypyruvate

L-Serine deaminase (sda): L-serine → NH4 + pyruvate

104

Table 4.1: Strategies for overproducing ethanol (for C = 0.5) listed along with the corresponding ethanol production and biomass formation rates. The upregulated reactions are denoted with the (↑) symbol. Conversely, downregulated reactions are denoted with the (↓) symbol and the deleted reactions with a (X) symbol. The numbers within parentheses show the relative values of the modified fluxes when compared to the steady state fluxes. Biomass formation rate is expressed on a per hour basis and all the other fluxes are expressed in units of mmol/gDW·hr. Deletions Regulations Regulations/Deletions

Mutant MA Mutant MB Mutant MC Oxygen transport(X) Pyruvate dehydrogenase(↑) (1.75) Pyruvate dehydrogenase(↑) (1.75) Phosphotransacetylase(X) Succinate dehydrogenase(↓) (0.33) Oxygen transport(X) Ethanol:16.30, Biomass = 0.19 Ethanol:16.72; Biomass =0.17 Ethanol: 18.64; Biomass = 0.09

Mutant MD Mutant ME Mutant MF Glucose-6-phosphate isomerase(X) Pyruvate dehydrogenase(↑) (1.75) Phosphoglucomutase(↓) (0.37) Pyruvate formate lyase(X) Succinate dehydrogenase(↓) (0.33) Pyruvate formate lyase(X) Oxygen transport(X) Oxygen transport(↓) (0.14) Oxygen transport(X) Ethanol: 18.74; Biomass = 0.08 Ethanol: 16.72; Biomass =0.17 Ethanol: 19.83, Biomass= 0.011

105

Table 4.2: Comparisons of the theoretical yields of ethanol obtained for a value of C equal to 0.5 when implemented on mutant networks after changing C. M stands for modulation and M/D stands for both modulation and deletion. C Two reaction M Two reaction M/D Three reaction M Three reaction M/D

0.25 4.19 16.4 4.19 18.64 0.4 13.99 17.59 13.99 19.1

0.5 16.72 (MB) 18.64 (MC) 16.72 (ME) 19.83 (MF) 0.6 Infeasible Infeasible Infeasible 16.95 0.75 Infeasible Infeasible Infeasible Infeasible

106

Knockout Steady Up Down state regulation regulation range 1-C C C1-C

min 0 0 v ≥ 0 v v max j j,L j,U v j

Flux values Figure 4.1: A pictorial overview of the definitions of up/down regulations and deletions.

107

GLC 0.4 10.0 2.6 G6P D6PGL D6PGC RL5P 7.3 F6P X5P R5P

F16P S7P GAP

E4P F6P DHAP GAP

3.7 13P2DG NADPH NADH

PYR FUM 3PG 0.9 15.4 SUCC MAL 2PG GLX PEP OA 3.1 SUCCOA CIT PYR ACCOA 2.2

ACT AKG ETOH ICIT AC Figure 4.2: The flux values at steady-state fixed at experimental values extracted from (Fischer et al. 2004).

108

2DDG6P PEP GLCATP pts edd PYR ADP pgl gnd G6P D6PGL D6PGC RL5P pgi zwf rpec rpi F6P X5P tkt R5P eda pfkc tkt F16P S7P GAP fdp tal tpi E4P F6P DHAP GAP

13PDG pgk sdh 3PG fum FUM pgm SUCC MAL ace 2PG ME suc eno mdh GLX ppc OA PEP glt SUCCOA pyk CIT pfl, pdh PYR ACCOA acn ace akgd pta ACTP AKG adhE ack ETOH ICIT icd AC Figure 4.3: A pictorial representation of the central metabolic network of Escherichia coli and the associated genes to explain the proposed metabolic engineering strategies.

109

25 Mutant MA

Mutant MB C2 Mutant MC 20 C1

A1 C3 B1 15 A2

10

5

Ethanol formation rate (mmol/gDW.hr) B 0 2 0 0.05 0.1 0.15 0.2 Biomass formation rate (1/hr)

Figure 4.4: The biochemical production abilities of the mutant networks with two reactions knocked-out and/or modulated.

110

9.96 2DDG6P PEP GLCATP 4.77 5.23 9.96 PYR ADP 9.96 0 G6P D6PGL D6PGC RL5P 9.96 0.03 0.03 F6P R5P X5P 0.01

F16P 0.02 S7P GAP 0.01

E4P F6P DHAP GAP 9.98 4.61 13P2DG NADPH NADH 9.98 5.14 0 SER 3PHP 3PG 0.01 FUM 4.81 SUCC MAL 2PG 0.01 0.01 5.14 OA PEP 0.01 SUCCOA CIT 19.84 ACCOA PYR 0.01 0 ACTP AKG FORMATE 19.84 ICIT 0.01 ETH AC

Figure 4.5: The flux distribution for the three reaction mutant network MF. The down regulated reaction is shown with a dashed arrow and the cross represents the reaction deletion.

111

25 Mutant MD

Mutant ME

Mutant MF F1 20 D1

D2 E1 a 15

10 E2

5

Ethanol formation rate (mmol/gDW.hr) rate formation Ethanol 0 0 0.05 0.1 0.15 0.2 Biomass formation rate(1/hr) Figure 4.6: The biochemical production abilities of the mutant networks with three reactions deleted and/or regulated.

112

25 2 R =1 2 R = 0.9977

20 MD

MF MC 15 ME MA

10 MOMA-based yields MOMA-based 5 MB

0 15 16 17 18 19 20 Biomass maximization-based yields

Figure 4.7: The ethanol yields (in mmol/gDW.hr) of the six mutant networks obtained by using the OptReg framework (max biomass) contrasted against the yields when the minimal flux adjustment criterion (MOMA) is imposed on the networks. The yields from five of the six mutant networks are very close to the prefect correlation line (shown dotted).

Chapter 5

Optimal enzyme selection using kinetic models

5.1 Introduction

The systematic development of optimal microbial strains in biotechnology and identification of efficient therapeutic interventions in medicine still presents a fundamental challenge for metabolic engineering in the postgenomic era (Cornish-

Bowden and Cardenas 2000; Kholodenko and Westerhoff 2004). In this endeavor, mathematical models of cellular metabolism have become indispensable by providing a description of how system’s properties such as metabolic fluxes, concentrations, or cell growth respond to changes in system’s components and environments (i.e., gene knockouts, changes in gene expression or nutrient availability). Notably, genome-scale stoichiometric models have been widely used to devise genetic modification strategies for targeted overproductions of useful biochemicals, based on cellular stoichiometry alone

(Stephanopoulos et al. 1998; Palsson 2004). While successful in many instances, stoichiometric models cannot capture dynamic effects mediated by changes in metabolite concentrations, enzyme activities, or changes in the environment and genetic control.

Therefore, to simulate and analyze the effects of perturbations in cellular systems, a number of research groups have undertaken the development of large-scale kinetic mechanistic models (Chassagnole et al. 2002). Prominent modeling projects include the

ECell International Project (Tomita 2001), the minimal cell (Castellanos et al. 2004), and virtual cell models (Slepchenko et al. 2003).

114 Two mathematical approaches utilizing kinetic models of metabolism for rational analysis of cellular systems are Metabolic Control Analysis (MCA) (Kacser and Burns

1973; Heinrich and Rapoport 1974) and Biochemical Systems Analysis (BSA) (Savageau

1976). These approaches rely respectively on local linear and log-linear approximations of inherently nonlinear metabolic phenomena. Specifically, MCA aims to relate the variables of a metabolic system to its parameters (Stephanopoulos et al. 1998). Once the relationships are established, the sensitivity of a system variable, such as a flux or a specific metabolite concentration, to the system parameters, namely enzyme activities or effector concentrations can be determined. These sensitivities are described by a set of coefficients, called the control coefficients which apply only to the steady state conditions studied. The linearity of MCA representation has been exploited before for rational design analysis leading to improvements in bioprocess performance by genetic modifications of metabolic control structures (Hatzimanikatis et al. 1996).

Motivated by the advent of relatively large-scale mechanistic models of cellular systems, the objective of this study is to introduce a general optimization framework to identify minimal enzyme sets leading to a significant overproduction potential. The representative study undertaken in this chapter investigates the optimal enzymes sets for increasing serine synthesis flux and the flux through the phosphotransferase system in a kinetic model of Escherichia coli central metabolism (Chassagnole et al. 2002; Visser et al. 2004).

5.2 Modeling cellular kinetics and cellular machinery

A mathematical model (1) of relevant processes in metabolism and genetic control can be postulated as a set of kinetic mass balances coupled with equations

115 describing the genetic machinery (i.e., ribosomes and RNA polymerases contents)

(Schmid et al. 2004).

dC M i =⋅Srr(,max C,K ), ∀ i ∈ N dt ∑ ij j j j=1 (1) de j =−rrsyn deg , ∀ j ∈ M, dt jj

Here Ci is the concentration of metabolite i, i ∈ N, Sij is the stoichiometric coefficient of metabolite i in reaction j, j ∈ M, rj is the rate of reaction j, C is the vector of

max metabolite concentrations, K is the vector of kinetic parameters, and rj is the maximal reaction rate determined by enzyme level ej. Changes in enzyme levels ej are defined by

syn deg the rates of enzyme synthesis rj and degradation rj . N = {1,…,N) and M={1,…,M} are sets of metabolites and reactions, respectively.

Model (1) can be used to select optimal enzyme sets EL (i.e., EL = {j1,…,jL}) and

e ,...,e the corresponding enzyme levels (i.e., j1 jL ) such that the best possible reaction rate r can be achieved for the overproduction of a specific biochemical of interest (i.e., a j0 product of reaction j0). Since detailed information for describing model (1) is rarely available, reasonable approximations are necessary (Young et al. 2004). In this study, we follow an approximation approach (Mauch et al. 2001) to account for important processes such as the redistribution of limited mRNA contents and homeostasis. Specifically, the following constraints are introduced to anticipate the effects of mutation. Constraint (2) ensures that an increase in certain enzyme levels is compensated by decrease in levels of the remaining ones,

116

1 M r max j (2) ∑ max,0 =1 Mrj=1 j

Since maximal reaction rates are proportional to enzyme levels (Stephanopoulos et al.

0 1998), each ratio in (2) can be interpreted as the ratio of enzyme levels ej and ej for engineered and reference organisms, respectively, i.e.,

max rejj max,0= 0 rejj

Cellular systems maintain homeostasis (Reich and Selkov 1981; Heinrich and

Schuster 1996), meaning that any large increase in metabolite concentrations triggers the expression of specific genes responsible for the synthesis of enzymes counteracting the undesired changes. This can be mathematically captured by constraint (3), which enforces allowable concentration changes (i.e., within δ.100%) relative to the reference steady state concentrations C0,

N 1 | C − C0 | i i ≤ δ. (3) N ∑ C0 i=1 i

Constraints (2) and (3) by themselves are still not sufficient to describe coordinated changes in all enzymes when only L enzymes are modulated. To account for such coordinated changes, the following constraint is introduced

max max rj' rj' 1 = ... = K = γ max,0 max,0 . (4) r ' r ' j1 jK

’ ’ ’ Here j1 , j2 ,…,jk are the indices of non-modulated enzymes and K = M - L. Condition (4) can be interpreted as maintaining gene expressions rates of non-modulated enzymes at ratios equal to the ones at the reference steady state.

117 5.3 Solution Method

5.3.1 Mixed Integer Non-linear Programming Problem (MINLP)

A mixed integer nonlinear programming problem (MINLP) described in (5) is solved to find small sets of modulated enzymes and the corresponding specific maximal rates ( rmax ,...,rmax ) such that the best possible reaction rate r leading to the production ji jL j0 of the target biochemical can be achieved. In formulation (5), both enzyme choices EL

max and specific maximal reaction rates rj are design variables. Here, the first constraint describes the steady-state condition in model (1), the second and third constraints result from combining constraints (2) and (4). In this paper, we devise and utilize a hybrid

'stochastic/deterministic' strategy to efficiently solve the MINLP formulation (5).

max maximize rrjj( ,CK , ) rrmax,..., max 00 jj1 L M max subject to ∑Srrij⋅=∈ j ( j , CK , ) 0, iN j=1  (5) rrmax max jj1 ++... L +⋅=KMγ rrmax,0 max,0 jj1 L max max,0 rrsK''=⋅γ , = 1,...,  jjss   N 0  1 ||CCii−  ∑ 0 ≤δ   NCi=1 i  

5.3.2 Search for optimal enzyme sets and levels

A simulated annealing algorithm (Kirkpatrick et al. 1983) is implemented to navigate through the discrete space of enzyme sets (see Figure 5.1). Here, EL is a randomly chosen initial enzyme set of size L, Eb is the set with the best rate rb found so far, Ec is the currently investigated set with rc, and Et is the trial set with rate rt. Parameter

118 T is the ‘annealing temperature’ reduced by factor α after every J random moves, and

MaxIter is the maximum number of allowable iterations. To generate Et, the move class

‘Select or Terminate’ is implemented, where a random swap between two enzymes, one from Ec and another one from M\ Ec is repetitively performed until a new trial set Et is found. The search is terminated when all neighbors of Ec are evaluated or the maximum number of iterations (i.e., MaxIter) is performed. Optimal enzyme levels for every trial set Et are computed by using standard gradient-based algorithms (i.e., SQP). The

0 evaluation of the objective function (i.e., reaction rate rj ) relies on the calculation of steady state concentrations C by a two-step prediction-correction procedure. At the prediction step, the kinetic equation in (1) is integrated over time span [0, Tend]. The integration can be automatically terminated at an intermediate t, t∈[0,Tend], if

dC max |i ≤ ε | i dt

Subsequently, at the correction step, the final time t integration condition (i.e.,

C(t)) is used as an initial guess for a Newton-based solver to find a solution of the nonlinear equation in (5). Finally, the stability of the corrected steady state C in (1) is investigated by computing the eigenvalues of the Jacobian matrix available from the

Newton-based solver.

5.3.3 Computational implementation

The optimization framework is demonstrated on a model of central metabolism of

E. coli (Chassagnole et al. 2002), comprised of 30 enzymes and 17 metabolites, with the objectives of maximizing the serine overproduction and flux through the PTS transport system (see Figure 5.2). The following parameter values were selected, δ = 0.1, J = 25,

119 3 -3 3 MaxIter = 10 , α = 0.9, ε = 10 , and Tend = 10 . To ensure the robustness and the fast convergence of the algorithm, different values for the initial ‘simulated annealing

-4 -2 -5 -3 temperature’ T were used, T0 = [10 - 10 ] for the serine production and T0 = [10 - 10 ] for the PTS flux. These values account for 1% - 100% of the corresponding rates

(mM/sec) in the reference state. The complete enumeration of all one- and two-enzyme sets was performed to test the ability of the algorithm to locate the global optima. Also, random multistarts were performed to check the robustness of the SQP search. The entire framework was implemented in Matlab on a Linux cluster with Intel CPU 3.06 GHz computers. Computational requirements were in order of minutes for small and 10-30 hours for larger enzyme sets EL.

5.4 Results and discussion

If all thirty enzymes in the model (Chassagnole et al. 2002; Visser et al. 2004) are allowed to vary their levels subject to the constraints in formulation (5), a 20-fold increase in the serine production and a 3-fold increase in the PTS flux can be achieved.

Substantial improvements, however, are predicted here by manipulating only small enzyme sets (see Figure 5.3). For example, the modulation of only three enzymes leads to a flux increase, which is almost 50% of the best theoretical predictions for fluxes leading to serine synthesis and through the phosphotransferase system respectively. The manipulation of six enzymes leads to a flux increase of about 80% of the maximum theoretical predictions for these fluxes. Importantly, by manipulating 10 enzymes the maximum overproduction capability in the network is reached.

To get insight into how a small set EL can be chosen to meet overproduction requirements, flux control coefficients (FCCs) calculated by formulae (6) can be used,

120

J d ln J e ∆J Ce = ≈ ⋅ . (6) d ln e J ∆e

Here, J is a pathway flux or the rate of a particular reaction, and e is the enzyme's level.

Within the framework of Metabolic Control Analysis, it has been shown that flux control is distributed among several enzymes in the pathway. Specifically, calculations of FCCs for the model under investigation (Chassagnole et al. 2002; Visser et al. 2004) reveal several rate limiting enzymes with high control on the serine and PTS fluxes (see Figure

5.4). Interestingly, in both cases the same group of enzymes, namely, phosphofructokinase (PFK), glyceraldehyde-3-phosphate dehydrogenase (GAPDH), pyruvate dehydrogenase (PDH), glucose-6-phosphate dehydrogenase (G6PDH) and the enzyme catalyzing phosphotransferase system (PTS enzyme), exerts high control and, hence, these enzymes can be viewed as potential candidates for modulation. Since FCCs are also readily available from measurements, it is important to compare if the enzymes identified by the proposed optimization framework for modulation also have high FCCs.

To facilitate the comparison, the best enzyme sets leading to a substantial increase in the serine production have been listed in Table 5.1, where the indices highlighted in bold correspond to the enzymes exerting high control on serine biosynthesis (see Figure 5.4).

We find that the enzymes selected for modulation by our framework are in complete agreement with the MCA predictions in the form of FCCS only for the sets comprised of one and two enzymes. For example, PTS enzyme and SerSynth (enzyme catalyzing serine synthesis) form the best two-enzyme set for modulation. These enzyme choices are intuitive as the PTS transport system supplies metabolism with the substrate, while SerSynth leads to serine production. Further analysis reveals that both FCC-based

121 predictions and enzyme choices provided by the optimization framework lack the additivity property in a sense that the best choices cannot be just combined to improve the serine overproduction. For example, the triplet of the most important enzymes for metabolism and serine production (i.e., PTS enzyme, PFK, and SerSynth) exerts the highest total control and, yet, is absent from Table 5.1. These enzymes are, however, present in all larger enzyme sets.

Another important observation resulting from the analysis of Table 5.1 is that the best choices for large enzyme sets encompass enzymes with both high and low values of FCCs. Therefore, it cannot be deduced from MCA alone why some near-equilibrium enzymes, e.g., triosephosphate isomerase (TIS), phosphoglycerate kinase (PGK) and enolase (ENO) are present in the optimal enzyme sets while others such as phosphoglycerate mutase (PGM) are not. This can, however, be explained by other important factors such as complex interactions between demand and supply of cellular resources, regulation, and homeostasis. Specifically, due to homeostasis conditions as formulated in constraints (2) – (4), the concentration of the near equilibrium PGM, located just below the branching point towards serine production (see Figure 5.2) is lowered to allocate transcriptional rates in favor of the synthesis of the enzymes whose levels are increased (e.g., the enzymes above the branching point including near- equilibrium TIS and PGK). In contrast, the higher level of ENO is enforced to increase the phosphoenolpyruvate concentration for supply to the enhanced PTS system (see

Figure 5.2). Thus, certain enzymes with low flux control need to be considered for potential modulation to maintain metabolism at homeostasis and, hence, to prevent

122 cellular systems from undesirable or even catastrophic changes due to targeted perturbations.

The results presented in Table 5.1 also show how the best enzyme choices emerge. Specifically, while the best choices lack the additivity property, the best smaller sets repeatedly enter the best larger sets. This means that control of flux in the pathway does not shift among different groups of enzymes due to the compensating effects of global regulation and homeostasis. Indeed, stability of distributed control is facilitated by many negative feedback loops, which significantly contribute toward the stabilization of cellular systems at homeostasis (Reich and Selkov 1981; Heinrich and Schuster 1996;

Stephanopoulos et al. 1998). As a result, the FCCs are preserved around their original unperturbed values. The observed effect of stability of distributed control becomes even more pronounced after plotting high FCCs as a function of the enzyme set size L (see

Figure 5.5). The absence of the shift in distributed control emphasizes the importance of the rate limiting enzymes (or pathway steps) with high values of FCCs, computed for the unperturbed reference organism (see Figure 5.4 and Table 5.1). Indeed, an increase in the serine demand transfers control from serine synthesis (i.e., SerSynth) to the supply block

(i.e., PTS and PFK), and the pyruvate removal block in the form of pyruvate dehydrogenase (PDH).

Interestingly, the best ten enzyme set predicted by the framework encompasses eight glycolytic enzymes, PTS, PFK, ALDO, TIS, GAPDH, PGK, ENO, and glucose-6- phosphate isomerase (PGI) and two enzymes outside of glycolysis (i.e., PDH and

SerSynth). This implies that the high values of FCCs correctly delineate the most important blocks of central metabolism for serine overproduction from less important

123 pathways (i.e., PPP and other biosynthetic routes). Specifically, flux is increased through the PTS transport system, through PFK which is a committed enzyme in glycolysis, and through PDH to remove an excess of pyruvate accumulated through the enhanced PTS transport system.

5.5 Conclusion

A general hybrid stochastic/deterministic optimization framework for optimal selection of enzyme levels using large-scale mathematical models of cellular systems has been introduced and demonstrated on the model of central metabolism in Escherichia coli

(Chassagnole et al. 2002; Visser et al. 2004). A simulated annealing algorithm is employed to navigate through the discrete space of enzyme sets, while general gradient- based search methods are used to estimate optimal enzyme levels. The proposed framework allows for the optimization of the cellular system where by systematically selecting small enzyme sets, feasible for experimental implementations, many-fold flux improvements are predicted. The framework can also be used as a powerful tool for the validation of theoretic assumptions, the interconnections between distributed control, cellular economy of supply and demand, and negative feedback stabilization.

Alternatively, this framework can be utilized in biomedical studies to identify enzymes controlling undesired large metabolite concentrations and fluxes. Such enzymes can then be ranked as candidates for potential biomarkers of the underlying diseases or drug targets (Bandara et al. 2003).

124

Table 5.1: Best enzyme sets leading to increased serine production Size Enzyme set Flux ratio

1 SerSynth 1.89 2 PTS, SerSynth 4.65 3 PTS, GAPDH, SerSynth 9.08 4 PTS, PFK, GAPDH, SerSynth 14.45 5 PTS, PFK, PDH, GAPDH, SerSynth 15.93 6 PTS, PFK, TIS, GAPDH, PDH, SerSynth 17.42 7 PTS, PFK, ALDO, TIS, GAPDH, PDH, SerSynth 19.08 8 PTS, PFK, ALDO, TIS, GAPDH, PGK, PDH, SerSynth 19.84 9 PTS, PGI, PFK, ALDO,TIS, GAPDH, PGK, PDH, SerSynth 20.54 10 PTS, PGI, PFK, ALDO, TIS, GAPDH, PGK, ENO, PDH, SerSynth 20.59

125

1. Generate an initial enzyme set EL 2. Set Eb = Ec = Et = EL 3. rb = rc = rt = Optimize(Et) 4. for i = 1:MaxIter 5. Et = Select or Terminate (Ec) 6. rt = Optimize (Et) 7. if rt > rb 8. Eb = Ec = Et 9. rb = rc = rt 10. else 11. anneal = e (rt −rc ) / T 12. Generate a random d ∈ (0,1) 13. if d < anneal 14. Ec = Et 15. rc = rt 16. end if 17. end if 18. if [i/J] = 0 19. T = α·T 20. end if end loop

Figure 5.1: A simulated annealing pseudo-code

126

glucose pep pyr PTS PGlucoM G6PDH PGDH g1p g6p 6pg ribu5p Ru5P R5PI G1PAT PGI RPPK f6p xyl5p rib5p Polysaccharide TKa PFK synthesis Nucleotide fdp TKb sed7p gap TA synthesis ALDO e4p f6p TIS G3PDHdhap gap Glycerol GAPDH Synthesis pgp

PGK Aromatic amino DAHPS Serine SerSynth acid synthesis synthesis 3pg PGM 2pg ENO PEPCyclase pep oaa PK PDH Ile,lala,kival Synth2 pyr accoa synthesis

Figure 5.2: Escherichia coli central metabolism

127

100

80

60

40 Flux ratio (%) ratio Flux

20

0 12345678910 The number of modulated enzymes

Figure 5.3: Ratios rL/r30 (%) are plotted as a function of the modulated enzymes set size L. Solid rhombi and white triangles correspond to serine production and PTS rates, respectively.

128

1

0.8

0.6

0.4

0.2

Flux control coefficients coefficients control Flux 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 -0.2 PK TA TIS PGI Tka PTS PFK TKb PGK R5PI ENO PDH PGM Ru5P RPPK PGDH ALDO Synth1 Synth2 G1PAT G3PDH G6PDH DAHPS GAPDH MetSynt MurSynt PepCxyla SerSynth PGlucoM TrpSynth Enzymes

Figure 5.4: Flux control coefficients (FCCs) for the serine (solid bars) and PTS (white bars) fluxes.

129

PTS PFK GAP DH PDH Ser in e

1

0.8

0.6

0.4

0.2 Flux control coefficient control Flux 0 012345678910 The number of modulated enzymes (L)

Figure 5.5: FCCs are plotted as a function of the enzyme set size L

Chapter 6

Synopsis

Current developments in accumulating high throughput data have added new dimensions to our knowledge of the functioning of microbial organisms. Concurrent advances in recombinant DNA technology have opened up new possibilities for a metabolic engineer to “shape” the microbial circuitry for biotechnological and biomedical purposes. Specifically, the genome scale reconstruction of microbial metabolism has facilitated the development of computational methods for suggesting gene knockouts/knockins, and modulations in a targeted fashion. In this thesis, optimization based frameworks were developed to redesign microbial strains for efficient biochemical production.

We first proposed a framework in Chapter 1 to remove reactions from a stoichiometric network such that the growth or biomass formation in the network is accompanied by the production of biochemicals. The underlying hypothesis here was that existing microbial networks are a result of eons of evolution and are primed for maximum responsiveness to selective pressures. The competing microbial and biotechnological objectives can be reconciled if the functionalities uncoupling these two objectives can be eliminated. However, it was observed that for production of complex compounds, this measure by itself is not sufficient and it is important to control the transport rates of key metabolites such as oxygen, carbon dioxide and ammonia to meet the biochemical objectives. This conclusion was based on the strategies identified for overproducing amino acids in the E. coli network. Notably, a number of these strategies

131 were consistent with experimental reports while others were non-intuitive strategies not straightforward to determine only from inspection of the network.

The next endeavor was to systematically select from the thousands of functionalities catalogued in multiple databases (Selkov et al. 1996; Selkov et al. 1997;

Selkov et al. 1998; Karp et al. 2000; Overbeek et al. 2000; Kanehisa 2002; Kanehisa et al.

2002; Ellis et al. 2003; Kanehisa et al. 2004; Krieger et al. 2004), the appropriate sets of pathways/genes to recombine into existing production systems such as Escherichia coli so as to endow them with the desired new functionalities. To this end, we developed the four step OptStrain hierarchical procedure. This procedure allowed for the identification of non-native pathways to add into an existing metabolic network to enable a desired biotransformation while satisfying stoichiometry and requirements on the total number of added genes, maximum theoretical yield and cofactor/energy usage. The framework highlighted the efficacy of methanol as a substrate for hydrogen production and examined multiple host organisms, some of which are not known be natural producers of hydrogen.

The scope of the method was further elucidated by the identification of additions and deletion strategies for production of vanillin, a significantly complex compound, in the E. coli network.

The next logical step in this research was to investigate how the biochemical production could be enhanced when reaction deletions were complemented with modulation (i.e., up and downregulation) of reactions. Not surprisingly, it was revealed that the synergistic effect of these manipulations was the most efficient in enhancing the theoretical yield of ethanol in the E. coli network. All the identified strategies were based on the allocation of fluxes so that both the carbon flux and cofactor requirements for

132 producing the biochemical of choice were met. It was observed that the extent of regulation of reactions not only affects the theoretical yields of biochemicals but also the optimal strategy for target metabolite production. Also, quite interestingly, results obtained from the OptReg framework are found to be quite robust with respect to the choice for the cellular objective (i.e., max biomass or MOMA).

The significance of these methods becomes more obvious when equipped with the knowledge that microbial networks are very flexible and highly redundant, allowing them to circumvent the introduced genetic manipulations. Therefore, an ad hoc approach relying on individual pathway modifications is not sufficient. What is warranted is the systematic investigation of each introduced modification a priori. Notably, the accuracy of these in silico methods relies on the completeness and correctness of the stoichiometric models that are employed. The most desirable scenario would be to investigate detailed genome-scale kinetic models. However, the difficulty in obtaining kinetic parameters has hindered the formulation of network-scale kinetic models.

Working within this limitation, we decided to examine a central metabolic kinetic model of E. coli to determine the key enzymes that controlled a specific reaction flux and whose modulation could substantially improve that flux. The key results of this study were that (i) manipulation of even small enzyme sets could enhance the serine synthesis flux significantly, and (ii) that most of the identified enzymes that were predicted to be beneficial were known to have high control on the desired flux. However, it was also noted that enzymes which were very close to equilibrium and had lower flux control coefficients could be important in increasing a specific flux through the network.

133 In this research, we proposed tools to systematically identify strategies for manipulating microbial organisms to meet the desired biotechnological objectives.

6.1 Future work

The predictive power of the computational tools that were developed during the course of this study can be improved further. To this end, substantial work needs to be done at the level of network development as well as mathematical modeling.

The first task is to make the stoichiometric models more representative of the microbial metabolism. This objective can be accomplished partially by (i) adding the missing functionalities to these models, and (ii) by accounting for the genetic interactions in the form of transcriptional regulation in the proposed computational frameworks. A straightforward approach to add reactions to the existing stoichiometric models would be to rely on the functional assignments from gene annotations. An alternative solution would be to examine each organism-specific stoichiometric model thoroughly to determine (i) the reactions that are known to be active in vivo but are inactive in silico and, (ii) the metabolites that are experimentally reported to be produced in the organism but are not present in the stoichiometric model. Subsequently, this information can be used to “fill the gaps” in the model. For example, a reaction that is not active in the network in any growth environment will have zero flux through it because it is not connected to other reactions which could produce or consume the metabolites participating in this reaction. To ‘activate’ the inactive fluxes, reactions from multiple biopathway databases which serve as repositories for functionalities present in several organisms can be added to the stoichiometric models. The sequences of genes corresponding to these heterologous enzymatic reactions can then be blasted against the

134 genome of the organism under investigation to verify for the presence of the gene associated with the missing reaction. Analogously, reactions that produce and consume a metabolite detected by metabolic profiling studies can be added to the model and the presence of the corresponding genes verified.

Accounting for the transcriptional regulatory events in the microbial networks will also enhance the accuracy of computational predictions. It has been demonstrated that transcriptional regulation can be included within the stoichiometric models using

Boolean rules. These Boolean rules are then implemented mathematically by employing binary variables. However, a key challenge in the incorporation of transcriptional regulation within the frameworks proposed in this thesis is the computational complexity associated with handling a large number of binary variables. To eliminate the computational bottlenecks, we need to work on model formulation. Specifically, disjunctive programming inspired techniques, such as the Big-M method and the Convex

Hull method (Vecchietti and Grossmann 1997; Vecchietti et al. 2003), can be useful for modeling the 0-1 discrete decisions. These two methods propose alternative ways by which logical disjunctions can be recast into equivalent constraints. Depending upon the specifics of the model, one of these two approaches can prove to be useful.

The awe-inspiring complexity of microbial networks is being revealed further as more and more efforts with partial success rates are being made to understand their behavior. This complexity can most likely be rendered manageable by a synergistic use of intelligent computational tools and clever experimental techniques to decipher and analyze these systems.

Bibliography

Alper, H., Miyaoku, K. and Stephanopoulos, G.,2005. "Construction of lycopene- overproducing E. coli strains by combining systematic and combinatorial gene knockout targets." Nat Biotechnol. Anthony, C. (1982). The Biochemistry of Methylotrophs, Academic Press. Arita, M.,2000. "Metabolic construction using shortest paths." Simulation Practice and Theory. 8: 109-125. Arita, M.,2004. "The metabolic world of Escherichia coli is not small." Proc Natl Acad Sci U S A. 101(6): 1543-7. Backman, K. C. (1992). US Patent 5169 768. Badarinarayana, V., Estep, P. W., 3rd, Shendure, J., Edwards, J., Tavazoie, S., Lam, F. and Church, G. M.,2001. "Selection analyses of insertional mutants using subgenic-resolution arrays." Nat Biotechnol. 19(11): 1060-5. Baez-Viveros, J. L., Osuna, J., Hernandez-Chavez, G., Soberon, X., Bolivar, F. and Gosset, G.,2004. "Metabolic engineering and protein directed evolution increase the yield of L-phenylalanine synthesized from glucose in Escherichia coli." Biotechnology and Bioengineering. 87(4): 516-524. Bailey, J. E.,2001. "Complex biology with no parameters." Nat Biotechnol. 19(6): 503-4. Bandara, L. R., Kelly, M. D., Lock, E. A. and Kennedy, S.,2003. "A potential biomarker of kidney damage identified by proteomics: preliminary findings." Biomarkers. 8(3-4): 272-86. Becker, S. A. and Palsson, B. O.,2005. "Genome-scale reconstruction of the metabolic network in Staphylococcus aureus N315: an initial draft to the two-dimensional annotation." BMC Microbiol. 5(1): 8. Berrios-Rivera, S. J., Sanchez, A. M., Bennett, G. N. and San, K. Y.,2004. "Effect of different levels of NADH availability on metabolite distribution in Escherichia coli fermentation in minimal and complex media." Appl Microbiol Biotechnol. 65(4): 426-32. Blattner, F. R., Plunkett, G., 3rd, Bloch, C. A., Perna, N. T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J. D., Rode, C. K., Mayhew, G. F., Gregor, J., Davis, N. W., Kirkpatrick, H. A., Goeden, M. A., Rose, D. J., Mau, B. and Shao, Y.,1997. "The complete genome sequence of Escherichia coli K-12." Science. 277(5331): 1453-74. Bond, D. R., Holmes, D. E., Tender, L. M. and Lovley, D. R.,2002. "Electrode-reducing microorganisms that harvest energy from marine sediments." Science. 295(5554): 483-5. Bond, D. R. and Lovley, D. R.,2003. "Electricity production by Geobacter sulfurreducens attached to electrodes." Appl Environ Microbiol. 69(3): 1548-55. Breuer, M., Ditrich, K., Habicher, T., Hauer, B., Kesseler, M., Sturmer, R. and Zelinski, T.,2004. "Industrial methods for the production of optically active intermediates." Angew Chem Int Ed Engl. 43(7): 788-824. Brooke, A., Kendrick, D., Meeraus, A. and Raman, R. (2003). GAMS: A user's guide. Washington D.C., GAMS Development Corporation.

136 Brooke, A., Kendrick, D., Meeraus, A. and Raman, R. (2005). GAMS: The Solver Manuals. Washington, DC, GAMS Development Corporation. Brown, M. (1999). Perl programmer's reference. Berkeley, Calif., Osborne/McGraw-Hill. Burgard, A. P. and Maranas, C. D.,2001. "Probing the performance limits of the Escherichia coli metabolic network subject to gene additions or deletions." Biotechnol Bioeng. 74(5): 364-75. Burgard, A. P., Pharkya, P. and Maranas, C. D.,2003. "OptKnock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization." Biotechnol Bioeng. 84(6): 647-657. Burgard, A. P., Pharkya, P. and Maranas, C. D.,2003. "OptKnock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization." Biotechnology and Bioengineering. 85(7). Burgard, A. P., Vaidyaraman, S. and Maranas, C. D.,2001. "Minimal Reaction Sets for Escherichia coli Metabolism under Different Growth Requirements and Uptake Environments." Biotechnol Prog. 17: 791-797. Burgard, A. P., Vaidyaraman, S. and Maranas, C. D.,2001. "Minimal reaction sets for Escherichia coli metabolism under different growth requirements and uptake environments." Biotechnol Prog. 17(5): 791-7. Carlsson, J., Kujala, U. and Edlund, M. B.,1985. "Pyruvate dehydrogenase activity in Streptococcus mutans." Infect Immun. 49(3): 674-8. Castellanos, M., Wilson, D. B. and Shuler, M. L.,2004. "A modular minimal cell model: purine and pyrimidine transport and metabolism." Proc Natl Acad Sci U S A. 101(17): 6681-6. Causey, T. B., Shanmugam, K. T., Yomano, L. P. and Ingram, L. O.,2004. "Engineering Escherichia coli for efficient conversion of glucose to pyruvate." Proc Natl Acad Sci U S A. 101(8): 2235-40. Chassagnole, C., Noisommit-Rizzi, N., Schmid, J. W., Mauch, K. and Reuss, M.,2002. "Dynamic modeling of the central carbon metabolism of Escherichia coli." Biotechnology and Bioengineering. 79(1): 53-73. Chassagnole, C., Rizzi, N., Schmid, J. W., Mauch, K. and Reuss, M.,2002. "Dynamic metabolism of the central carbon metabolism of Escherichia coli." Biotechnology and Bioengineering. 79(1): 53-73. Chin, H. L., Chen, Z. S. and Chou, C. P.,2003. "Fedbatch operation using Clostridium acetobutylicum suspension culture as biocatalyst for enhancing hydrogen production." Biotechnol Prog. 19(2): 383-8. Chistoserdova, L., Laukel, M., Portais, J. C., Vorholt, J. A. and Lidstrom, M. E.,2004. "Multiple formate dehydrogenase enzymes in the facultative methylotroph Methylobacterium extorquens AM1 are dispensable for growth on methanol." J Bacteriol. 186(1): 22-8. Chistoserdova, L., Vorholt, J. A., Thauer, R. K. and Lidstrom, M. E.,1998. "C1 transfer enzymes and coenzymes linking methylotrophic bacteria and methanogenic Archaea." Science. 281(5373): 99-102. Chotani, G., Dodge, T., Hsu, A., Kumar, M., LaDuca, R., Trimbur, D., Weyler, W. and Sanford, K.,2000. "The commercial production of chemicals using pathway engineering." Biochim Biophys Acta. 1543(2): 434-455.

137 Cornish-Bowden, A. and Cardenas, M. L. (2000). Technological and Medical Implications of Metabolic Control Analysis. Boston, Kluwer Academic Publishers. Covert, M. W., Famili, I. and Palsson, B. O.,2003. "Identifying constraints that govern cell behavior: a key to converting conceptual to computational models in biology?" Biotechnol Bioeng. 84(7): 763-72. Covert, M. W., Knight, E. M., Reed, J. L., Herrgard, M. J. and Palsson, B. O.,2004. "Integrating high-throughput and computational data elucidates bacterial networks." Nature. 429(6987): 92-6. Covert, M. W. and Palsson, B. O.,2002. "Transcriptional regulation in constraints-based metabolic models of Escherichia coli." J Biol Chem. 277(31): 28058-28064. Covert, M. W. and Palsson, B. O.,2003. "Constraints-based models: regulation of gene expression reduces the steady-state solution space." J Theor Biol. 221(3): 309-25. Das, D. and Veziroglu, T. N.,2001. "Hydrogen production by biological processs: a survey of literature." International Journal of Hydrogen Energy. 26: 13-28. David, H., Akesson, M. and Nielsen, J.,2003. "Reconstruction of the central carbon metabolism of Aspergillus niger." Eur J Biochem. 270(21): 4243-53. de Ruyter, P. G., Kuipers, O. P. and de Vos, W. M.,1996. "Controlled gene expression systems for Lactococcus lactis with the food-grade inducer nisin." Appl Environ Microbiol. 62(10): 3662-7. Debabov, V. G.,2003. "The threonine story." Adv Biochem Eng Biotechnol. 79: 113-36. Delgado, J., Mercane, J. and Liao, J. C.,1993. "Experimental determination of flux control distributon in biochemical systems: In vitro model to analyze transient metabolite concentrations." Biotechnol Bioeng. 41: 1121-1128. Desai, R. P., Nielsen, L. K. and Papoutsakis, E. T.,1999. "Stoichiometric modeling of Clostridium acetobutylicum fermentations with non-linear constraints." J Biotechnol. 71(1-3): 191-205. Dien, B. S., Cotta, M. A. and Jeffries, T. W.,2003. "Bacteria engineered for fuel ethanol production: current status." Appl Microbiol Biotechnol. 63(3): 258-66. Draths, K. M., Pompliano, D. L., Conley, D. L., Frost, J. W., Berry, A., Disbrow, G. L., Staversky, R. J. and Lievense, J. C.,1992. "Biocatalytic synthesis of Aromatics from D-glucose: The Role of Transketolase." Journal of American Chemical Society. 114: 3956-3962. Duarte, N. C., Herrgard, M. J. and Palsson, B. O.,2004. "Reconstruction and Validation of Saccharomyces cerevisiae iND750, a Fully Compartmentalized Genome-Scale Metabolic Model." Genome Res. 14: 1298-1309. Edwards, J. S., Ibarra, R. U. and Palsson, B. O.,2001. "In silico predictions of Escherichia coli metabolic capabilities are consistent with experimental data." Nat Biotechnol. 19(2): 125-30. Edwards, J. S. and Palsson, B. O.,1999. "Systems properties of the Haemophilus influenzae Rd metabolic genotype." J Biol Chem. 274(25): 17410-6. Edwards, J. S. and Palsson, B. O.,2000. "The Escherichia coli MG1655 in silico metabolic genotype: its definition, characteristics, and capabilities." Proc Natl Acad Sci U S A. 97(10): 5528-33.

138 Ellis, L. B., Hou, B. K., Kang, W. and Wackett, L. P.,2003. "The University of Minnesota Biocatalysis/Biodegradation Database: post-genomic data mining." Nucleic Acids Res. 31(1): 262-5. Eppstein, D. (1994). Finding the k shortest paths. In Proceedings of 35 th Symp. Foundations of Computer Science, IEEE, Santa Fe. Famili, I., Forster, J., Nielsen, J. and Palsson, B. O.,2003. "Saccharomyces cerevisiae phenotypes can be predicted by using constraint-based analysis of a genome-scale reconstructed metabolic network." Proc Natl Acad Sci U S A. 100(23): 13134-9. Fan, L. T., Bertok, B. and Friedler, F.,2002. "A graph-theoretic method to identify candidate mechanisms for deriving the rate law of a catalytic reaction." Comput Chem. 26(3): 265-92. Finneran, K. T., Housewright, M. E. and Lovley, D. R.,2002. "Multiple influences of nitrate on uranium solubility during bioremediation of uranium-contaminated subsurface sediments." Environ Microbiol. 4(9): 510-6. Fischer, E., Zamboni, N. and Sauer, U.,2004. "High-throughput metabolic flux analysis based on gas chromatography-mass spectrometry derived 13C constraints." Anal Biochem. 325(2): 308-16. Flores, N., Xiao, J., Berry, A., Bolivar, F. and Valle, F.,1996. "Pathway engineering for the production of aromatic compounds in Escherichia coli." Nat Biotechnol. 14(5): 620-3. Fong, S. S., Marciniak, J. Y. and Palsson, B. O.,2003. "Description and interpretation of adaptive evolution of Escherichia coli K-12 MG1655 by using a genome-scale in silico metabolic model." J Bacteriol. 185(21): 6400-8. Fong, S. S. and Palsson, B. O.,2004. "Metabolic gene-deletion strains of Escherichia coli evolve to computationally predicted growth phenotypes." Nat Genet. 36(10): 1056-8. Forster, J., Famili, I., Fu, P., Palsson, B. O. and Nielsen, J.,2003. "Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network." Genome Res. 13(2): 244-53. Forster, J., Famili, I., Fu, P. C., Palsson, B. and Nielsen, J.,2003. "Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network." Genome Research. 13(2): 244-253. Galazzo, J. L. and Bailey, J. E.,1990. "Fermentative pathway kinetics and metabolic flux control in suspended and immobilized Saccharomyces cerevisiae." Enzyme and microbial technology. 12: 162-172. Glover, F.,1975. "Improved linear integer programming formulations of nonlinear integer problems." Management Science. 22(4): 445. Gross, R., Leach, M. and Bauen, A.,2003. "Progress in renewable energy." Environ Int. 29(1): 105-22. Hatzimanikatis, V., Floudas, C. A. and Bailey, J.,1996. "Optimization of regulatory Architectures in metabolic reaction networks." Biotechnology and Bioengineering. 52: 485-500. Hatzimanikatis, V., Li, C., Ionita, J. A. and Broadbelt, L. J.,2003. "A Computational Framework for the Discovery of Novel Biobased Chemicals." presented at Biochemical Engineering XIII Conference, Session 2. Boulder, CO.

139 Heinrich, R. and Rapoport, T. A.,1974. "A linear steady-state treatment of enzymatic chains." Eur. J. Biochem. 41: 89-95. Heinrich, R. and Rapoport, T. A.,1974. "A linear steady-state treatment of enzymatic chains. General properties, control and effector strength." Eur J Biochem. 42(1): 89-95. Heinrich, R. and Schuster, S. (1996). The Regulation of Cellular Systems. New York, Chapman & Hall. Holland-Staley, C. A., Lee, K., Clark, D. P. and Cunningham, P. R.,2000. "Aerobic activity of Escherichia coli alcohol dehydrogenase is determined by a single amino acid." J Bacteriol. 182(21): 6049-54. http://www.ilog.com/products/cplex/ Ilog Cplex. Ibarra, R. U., Edwards, J. S. and Palsson, B. O.,2002. "Escherichia coli K-12 undergoes adaptive evolution to achieve in silico predicted optimal growth." Nature. 420(6912): 186-9. Ibarra, R. U., Edwards, J. S. and Palsson, B. O.,2002. "Escherichia coli K-12 undergoes adaptive evolution to achieve in silico predicted optimal growth." Nature. 420(6912): 186-189. Ignizio, J. P. and Cavalier, T. M. (1994). Linear programming. Englewood Cliffs, N.J., Prentice Hall. Ikeda, M.,2003. "Amino acid production processes." Adv Biochem Eng Biotechnol. 79: 1-35. Jamshidi, N., Edwards, J. S., Fahland, T., Church, G. M. and Palsson, B. O.,2001. "Dynamic simulation of the human red blood cell metabolic network." Bioinformatics. 17(3): 286-7. Jeffries, T. W.,2005. "Ethanol fermentation on the move." Nat Biotechnol. 23(1): 40-1. Jensen, P. R. and Hammer, K.,1998a. "The sequence of spacers between the consensus sequences modulates the strength of prokaryotic promoters." Appl Environ Microbiol. 64(1): 82-7. Jensen, P. R. and Hammer, K.,1998b. "Artificial promoters for metabolic optimization." Biotechnol Bioeng. 58(2-3): 191-5. Jensen, P. R., Michelsen, O. and Westerhoff, H. V.,1993. "Control analysis of the dependence of Escherichia coli physiology on the H(+)-ATPase." Proc Natl Acad Sci U S A. 90(17): 8068-72. Joshi, A. and Palsson, B. O.,1989a. "Metabolic dynamics in the human red cell. Part I--A comprehensive kinetic model." J Theor Biol. 141(4): 515-28. Joshi, A. and Palsson, B. O.,1989b. "Metabolic dynamics in the human red cell. Part II-- Interactions with the environment." J Theor Biol. 141(4): 529-45. Joshi, A. and Palsson, B. O.,1990a. "Metabolic dynamics in the human red cell. Part III-- Metabolic reaction rates." J Theor Biol. 142(1): 41-68. Joshi, A. and Palsson, B. O.,1990b. "Metabolic dynamics in the human red cell. Part IV-- Data prediction and some model computations." J Theor Biol. 142(1): 69-85. Kacser, H. and Burns, J. A.,1973. "The control of flux." Symp. Soc. Exp. Biol. 27: 65- 104. Kacser, H. and Burns, J. A.,1973. "The Control of flux." Symp Soc Exp Biol. 27: 65-104.

140 Kanehisa, M.,2002. "The KEGG database." Novartis Found Symp. 247: 91-101; discussion 101-3, 119-28, 244-52. Kanehisa, M., Goto, S., Kawashima, S. and Nakaya, A.,2002. "The KEGG databases at GenomeNet." Nucleic Acids Res. 30(1): 42-6. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y. and Hattori, M.,2004. "The KEGG resource for deciphering the genome." Nucleic Acids Res. 32 Database issue: D277-80. Karp, P. D., Riley, M., Saier, M., Paulsen, I. T., Collado-Vides, J., Paley, S. M., Pellegrini-Toole, A., Bonavides, C. and Gama-Castro, S.,2002. "The EcoCyc Database." Nucleic Acids Res. 30(1): 56-8. Karp, P. D., Riley, M., Saier, M., Paulsen, I. T., Paley, S. M. and Pellegrini-Toole, A.,2000. "The EcoCyc and MetaCyc databases." Nucleic Acids Res. 28(1): 56-9. Kataoka, N., Miya, A. and Kiriyama, K.,1997. "Studies of hydrogen production by continuous culture system of hydrogen-producing anaerobic bacteria." Wat. Sci. Tech. 36(6-7): 41-47. Kholodenko, B. N. and Westerhoff, H. V. (2004). Metabolic engineering in the post- genomics era. Wymondham, UK, Horizon Bioscience. Kimura, E.,2003. "Metabolic engineering of glutamate production." Adv Biochem Eng Biotechnol. 79: 37-57. Kirkpatrick, S., Gelatt, C. D. J. and Vecchi, M. P.,1983. "Optimization by simulated annealing." Science. 220(4598): 671-680. Koffas, M. A., Jung, G. Y. and Stephanopoulos, G.,2003. "Engineering metabolism and product formation in Corynebacterium glutamicum by coordinated gene overexpression." Metab Eng. 5(1): 32-41. Korotkova, N., Chistoserdova, L. and Lidstrom, M. E.,2002. "Poly-beta-hydroxybutyrate biosynthesis in the facultative methylotroph methylobacterium extorquens AM1: identification and mutation of gap11, gap20, and phaR." J Bacteriol. 184(22): 6174-81. Krieger, C. J., Zhang, P., Mueller, L. A., Wang, A., Paley, S., Arnaud, M., Pick, J., Rhee, S. Y. and Karp, P. D.,2004. "MetaCyc: a multiorganism database of metabolic pathways and enzymes." Nucleic Acids Res. 32 Database issue: D438-42. Kumagai, H.,2000. "Microbial production of amino acids in Japan." Adv Biochem Eng Biotechnol. 69: 71-85. Lan, J. and Newman, E. B.,2003. "A requirement for anaerobically induced redox functions during aerobic growth of Escherichia coli with serine, glycine and leucine as carbon source." Res Microbiol. 154(3): 191-7. Lee, P. C. and Schmidt-Dannert, C.,2002. "Metabolic engineering towards biotechnological production of carotenoids in microorganisms." Appl Microbiol Biotechnol. 60(1-2): 1-11. Li, K. and Frost, J. W.,1998. "Synthesis of vanillin from glucose." Journal of American Chemical Society. 120: 10545-10546. Liu, H., Ramnarayanan, R. and Logan, B. E.,2004. "Production of electricity during wastewater treatment using a single chamber microbial fuel cell." Environmental Sceince and Technology. 38: 2281-2285.

141 Lovley, D. R.,2003. "Cleaning up with genomics: applying molecular biology to bioremediation." Nat Rev Microbiol. 1(1): 35-44. Lu, J. L. and Liao, J. C.,1997. "Metabolic Engineering and Control Analysis for Production of Aromatics: Role of Transaldolase." Biotechnology and Bioengineering. 53: 132-138. Mahadevan, R. and Schilling, C. H.,2003. "The effects of alternate optimal solutions in constraint-based genome-scale metabolic models." Metab Eng. 5(4): 264-76. Majewski, R. A. and Domach, M. M.,1990. "Simple constrained optimization view of acetate overflow in Escherichia coli." Biotechnol Bioeng. 35: 732-738. Majewski, R. A. and Domach, M. M.,1990. "Simple Constrained-Optimization View of Acetate Overflow in Escherichia-Coli." Biotechnology and Bioengineering. 35(7): 732-738. Martin, V. J. J., Pitera, D. J., Withers, S. T., Newman, J. D. and Keasling, J. D.,2003. "Engineering the mevalonate pathway in Escherichia coli for production of terpenoids." Nat. Biotechnol. 21: 796-802. Mauch, K., Buziol, S., Schmid, J. and Reuss, M. (2001). Computer-Aided Design of Metabolic Networks. Chemical Process Control-6 Conference, Tucson, Arizona. Mavrovouniotis, M., Stephanopoulos, G. and Stephanopoulos, G.,1990. "Computer- Aided Synthesis of Biochemical Pathways." Biotechnol Bioeng. 36: 1119-1132. McShan, D. C., Rao, S. and Shah, I.,2003. "PathMiner: predicting metabolic pathways by heuristic search." Bioinformatics. 19(13): 1692-8. Methe, B. A., Nelson, K. E., Eisen, J. A., Paulsen, I. T., Nelson, W., Heidelberg, J. F., Wu, D., Wu, M., Ward, N., Beanan, M. J., Dodson, R. J., Madupu, R., Brinkac, L. M., Daugherty, S. C., DeBoy, R. T., Durkin, A. S., Gwinn, M., Kolonay, J. F., Sullivan, S. A., Haft, D. H., Selengut, J., Davidsen, T. M., Zafar, N., White, O., Tran, B., Romero, C., Forberger, H. A., Weidman, J., Khouri, H., Feldblyum, T. V., Utterback, T. R., Van Aken, S. E., Lovley, D. R. and Fraser, C. M.,2003. "Genome of Geobacter sulfurreducens: metal reduction in subsurface environments." Science. 302(5652): 1967-9. Miller, J. E., Backman, K. C., O'Connor, J. M. and Hatch, T. R.,1987. "Production of phenylalanine and organic acids by phosphoenolpyruvate carboxylase-deficient mutants of Escherichia coli." J.Ind. Microbiol. 2: 143-149. Misawa, N., Yamano, S. and Ikenaga, H.,1991. "Production of beta-carotene in Zymomonas mobilis and Agrobacterium tumefaciens by introduction of the biosynthesis genes from Erwinia uredovora." Appl Environ Microbiol. 57(6): 1847-9. Nakamura, C. E. and Whited, G. M.,2003. "Metabolic engineering for the microbial production of 1,3-propanediol." Curr Opin Biotechnol. 14(5): 454-9. Nandi, R. and Sengupta, S.,1996. "Involvement of anaerobic reductases in the spontaneous lysis of formate by immobilized cells of Escherichia coli." Enzyme and microbial tehcnology. 19: 20-25. Nandi, R. and Sengupta, S.,1998. "Microbial production of hydrogen: an overview." Crit Rev Microbiol. 24(1): 61-84. Neidhardt, F. C. and Curtiss, R. (1996). Escherichia coli and Salmonella : cellular and molecular biology. Washington, D.C., ASM Press.

142 Overbeek, R., Larsen, N., Pusch, G. D., D'Souza, M., Selkov, E., Jr., Kyrpides, N., Fonstein, M., Maltsev, N. and Selkov, E.,2000. "WIT: integrated system for high- throughput genome sequence analysis and metabolic reconstruction." Nucleic Acids Res. 28(1): 123-5. Palsson, B. Ø.2004. "In silico biotechnology: Era of reconstruction and interrogation." Current Opinion in Biotechnology. 15(1): 50-51. Papin, J. A., Price, N. D., Edwards, J. S. and Palsson, B. B.,2002. "The genome-scale metabolic extreme pathway structure in Haemophilus influenzae shows significant network redundancy." J Theor Biol. 215(1): 67-82. Papoutsakis, E.,1984. "Equations and calculations for fermentations of Butyric acid bacteria." Biotechnol Bioeng. 26: 174-187. Papoutsakis, E. and Meyer, C.,1985. "Equations and calculations of product yields and preferred pathways for butanediol and mixed-acid fermentations." Biotechnol Bioeng. 27: 50-66. Papp, B., Pal, C. and Hurst, L. D.,2004. "Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast." Nature. 429(6992): 661-4. Patnaik, R. and Liao, J. C.,1994. "Engineering of Escherichia coli central metabolism for aromatic metabolite production with near theoretical yield." Appl Environ Microbiol. 60(11): 3903-8. Patnaik, R., Spitzer, R. G. and Liao, J. C.,1995. "Pathway engineering for production of aromatics in Escherichia coli: Confirmation of stoichiometric analysis by independent modulation of AroG, TktA and Pps activities." Biotechnology and Bioengineering. 46: 361-370. Pharkya, P., Burgard, A. P. and Maranas, C. D.,2003. "Exploring the overproduction of amino acids using the bilevel optimization framework OptKnock." Biotechnol Bioeng. 84(7): 887-899. Pharkya, P., Burgard, A. P. and Maranas, C. D.,2004. "OptSrain: A Computational Framework for Redesign of Microbial Production Networks." Genome Res. 14(11): 2367-76. Price, N. D., Reed, J. L. and Palsson, B. O.,2004. "Genome-scale models of microbial cells: evaluating the consequences of constraints." Nat Rev Microbiol. 2(11): 886- 97. Ramakrishna, R., Edwards, J. S., McCulloch, A. and Palsson, B. O.,2001. "Flux-balance analysis of mitochondrial energy metabolism: consequences of systemic stoichiometric constraints." Am J Physiol Regul Integr Comp Physiol. 280(3): R695-704. Reed, J. L. and Palsson, B. O.,2003. "Thirteen years of building constraint-based in silico models of Escherichia coli." J Bacteriol. 185(9): 2692-9. Reed, J. L., Vo, T. D., Schilling, C. H. and Palsson, B. O.,2003. "An expanded genome- scale model of Escherichia coli K-12 (iJR904 GSM/GPR)." Genome Biol. 4(9): R54. Regan, L., Bogle, I. D. L. and Dunnill, P.,1993. "Simulation and optimization of metabolic pathways." Comput. Chem. Eng. 17: 627-637. Reich, J. G. and Selkov, E. E. (1981). Energy Metabolism of the Cell. A Theoretical Treatise. New York, Academic Press.

143 Reuss, M.,1991. "Structured modeling of bioreactors .Recombinant DNA technology I." Annals of New York Academy of Sciences. 646: 284-299. Rizzi, M., Baltes, M., Theobald, U. and Reuss, M.,1997. "In vivo analysis of metabolic dynamics in Saccharomyces cerevisiae: II. mathematical model." Biotechnol Bioeng. 55: 592-608. San, K. Y., Bennett, G. N., Berrios-Rivera, S. J., Vadali, R. V., Yang, Y. T., Horton, E., Rudolph, F. B., Sariyar, B. and Blackwood, K.,2002. "Metabolic engineering through cofactor manipulation and its effects on metabolic flux redistribution in Escherichia coli." Metab Eng. 4(2): 182-92. Savageau, M. A.,1969a. "Biochemical systems analysis. I. Some mathematical properties of the rate law for the component enzymatic reactions." J Theor Biol. 25(3): 365- 9. Savageau, M. A.,1969b. "Biochemical systems analysis. II. The steady-state solutions for an n-pool system using a power-law approximation." J Theor Biol. 25(3): 370-9. Savageau, M. A.,1970. "Biochemical systems analysis, III: Dynamic solutions using a power-law approximation." J. Theor. Biol. 26: 215-226. Savageau, M. A. (1976). Biochemical systems analysis. A study of function and design in molecular biology. New York, Addison-Wesley, Reading, MA. Savinell, J. M. and Palsson, B. O.,1992. "Network analysis of intermediary metabolism using linear optimization. I. Development of mathematical formalism." J Theor Biol. 154(4): 421-54. Schilling, C. H., Covert, M. W., Famili, I., Church, G. M., Edwards, J. S. and Palsson, B. O.,2002. "Genome-scale metabolic model of Helicobacter pylori 26695." J Bacteriol. 184(16): 4582-93. Schmid, J. W., Mauch, K., Reuss, M., Gilles, E. D. and Kremling, A.,2004. "Metabolic design based on a coupled gene expression-metabolic network model of tryptophan production in Escherichia coli." Metab Eng. 6(4): 364-77. Segre, D., Vitkup, D. and Church, G. M.,2002. "Analysis of optimality in natural and perturbed metabolic networks." Proc Natl Acad Sci U S A. 99(23): 15112-7. Segre, D., Zucker, J., Katz, J., Lin, X., D'Haeseleer, P., Rindone, W. P., Kharchenko, P., Nguyen, D. H., Wright, M. A. and Church, G. M.,2003. "From annotated genomes to metabolic flux models and kinetic parameter fitting." Omics. 7(3): 301-16. Selkov, E., Basmanova, S., Gaasterland, T., Goryanin, I., Gretchkin, Y., Maltsev, N., Nenashev, V., Overbeek, R., Panyushkina, E., Pronevitch, L., Selkov, E., Jr. and Yunus, I.,1996. "The metabolic pathway collection from EMP: the enzymes and metabolic pathways database." Nucleic Acids Res. 24(1): 26-8. Selkov, E., Galimova, M., Goryanin, I., Gretchkin, Y., Ivanova, N., Komarov, Y., Maltsev, N., Mikhailova, N., Nenashev, V., Overbeek, R., Panyushkina, E., Pronevitch, L. and Selkov, E., Jr.,1997. "The metabolic pathway collection: an update." Nucleic Acids Res. 25(1): 37-8. Selkov, E., Jr., Grechkin, Y., Mikhailova, N. and Selkov, E.,1998. "MPW: the Metabolic Pathways Database." Nucleic Acids Res. 26(1): 43-5. Seo, J. S., Chong, H., Park, H. S., Yoon, K. O., Jung, C., Kim, J. J., Hong, J. H., Kim, H., Kim, J. H., Kil, J. I., Park, C. J., Oh, H. M., Lee, J. S., Jin, S. J., Um, H. W., Lee,

144 H. J., Oh, S. J., Kim, J. Y., Kang, H. L., Lee, S. Y., Lee, K. J. and Kang, H. S.,2005. "The genome sequence of the ethanologenic bacterium Zymomonas mobilis ZM4." Nat Biotechnol. 23(1): 63-8. Seressiotis, A. and Bailey, J. E.,1988. "MPS: An artificially intelligent software system for the analysis and synthesis of metabolic pathways." Biotechnol Bioeng. 31: 587-602. Slepchenko, B. M., Schaff, J., Macara, I. G. and Loew, L. M.,2003. "Quantitative Cell Biology with the Virtual Cell." Trends in Cell Biology. 13: 570-576. Sprenger, G., Siewe, R., Sahm, H., Karuts, M. and Sonke, T. (1998). Microbial preparation of substances from aromatic metabolism /II, WO9 818 936. Stafford, D. E. and Stephanopoulos, G.,2001. "Metabolic engineering as an integrating platform for strain development." Curr Opin Microbiol. 4(3): 336-40. Stephanopoulos, G.,2002. "Metabolic engineering by genome shuffling." Nat Biotechnol. 20(7): 666-8. Stephanopoulos, G., Aristidou, A. A. and Nielsen, J. (1998). Metabolic engineering : principles and methodologies. San Diego, Academic Press. Stephanopoulos, G. and Vallino, J. J.,1991. "Network rigidity and metabolic engineering in metabolite overproduction." Science. 252(5013): 1675-81. Stephanopoulos, G. N., Aristidou, A. A. and Nielsen, J. (1998). Metabolic Engineering Principles and Methodologies. New York, Academic Press. Stephanopoulos, G. N., Aristidou, A. A. and Nielsen, J. (1998). Metabolic Engineering. Principles and Methods. New York, Academic Press. Su, H. S., Lang, B. F. and Newman, E. B.,1989. "L-serine degradation in Escherichia coli K-12: cloning and sequencing of the sdaA gene." J Bacteriol. 171(9): 5095-102. Tomita, F., Yokota, A., Hashiguchi, K., Ishigooka, M. and Kurahashi, O. (1996). WO9 606 926. Tomita, M.,2001. "Whole-cell simulation: a grand challenge of the 21st century." Trends Biotechnol. 19(6): 205-210. Torres, N. V., Voit, E. O. and Gonzales-Alcon, C.,1996. "Optimization of nonlinear biotechnological processes with linear programming: application to citric acid production by Aspergillus niger." Biotechnol Bioeng. 49: 247-258. Underwood, S. A., Zhou, S., Causey, T. B., Yomano, L. P., Shanmugam, K. T. and Ingram, L. O.,2002. "Genetic changes to optimize carbon partitioning between ethanol and biosynthesis in ethanologenic Escherichia coli." Appl Environ Microbiol. 68(12): 6263-72. Vadali, R. V., Bennett, G. N. and San, K. Y.,2004. "Enhanced isoamyl acetate production upon manipulation of the acetyl-CoA node in Escherichia coli." Biotechnol Prog. 20(3): 692-7. Valdes, J., Veloso, F., Jedlicki, E. and Holmes, D.,2003. "Metabolic reconstruction of sulfur assimilation in the extremophile Acidithiobacillus ferrooxidans based on genome analysis." BMC Genomics. 4(1): 51. Van Dien, S. J. and Lidstrom, M. E.,2002. "Stoichiometric model for evaluating the metabolic capabilities of the facultative methylotroph Methylobacterium extorquens AM1, with application to reconstruction of C(3) and C(4) metabolism." Biotechnol Bioeng. 78(3): 296-312.

145 Van Dien, S. J., Strovas, T. and Lidstrom, M. E.,2003. "Quantification of central metabolic fluxes in the facultative methylotroph methylobacterium extorquens AM1 using 13C-label tracing and mass spectrometry." Biotechnol Bioeng. 84(1): 45-55. van Riel, N. A., Giuseppin, M. L., TerSchure, E. G. and Verrips, C. T.,1998. "A structured, minimal parameter model of the central nitrogen metabolism in Saccharomyces cerevisiae: the prediction of the behavior of mutants." J Theor Biol. 191(4): 397-414. Varma, A., Boesch, B. W. and Palsson, B. O.,1993. "Biochemical Production Capabilities of Escherichia coli." Biotechnol Bioeng. 42: 59-73. Varma, A., Boesch, B. W. and Palsson, B. O.,1993. "Stoichiometric interpretation of Escherichia coli glucose catabolism under various oxygenation rates." Appl Environ Microbiol. 59(8): 2465-73. Varma, A. and Palsson, B. O.,1993. "Metabolic Capabilities of Escherichia-Coli .2. Optimal-Growth Patterns." Journal of Theoretical Biology. 165(4): 503-522. Varma, A. and Palsson, B. O.,1994. "Metabolic Flux Balancing- Basic Concepts, Scientific and Practical Use." Bio-Technol. 12(10): 994-998. Varma, A. and Palsson, B. O.,1994. "Stoichiometric flux balance models quantitatively predict growth and metabolic by-product secretion in wild-type Escherichia coli W3110." Appl Environ Microbiol. 60(10): 3724-31. Vaseghi, S., Baumeister, A., Rizzi, M. and Reuss, M.,1999. "In vivo dynamics of the pentose phosphate pathway in Saccharomyces cerevisiae." Metab Eng. 1(2): 128- 40. Vecchietti, A. and Grossmann, I. E.,1997. "LOGMIP: a disjunctive 0-1 nonlinear optimizer for process systems." Computers & Chemical Engineering. 21(Supplement 1): S427-S432. Vecchietti, A., Lee, S. and Grossmann, I. E.,2003. "Modeling of discrete/continuous optimization problems: characterization and formulation of disjunctions and their relaxations." Computers & Chemical Engineering. 27(3): 433-448. Vera, J., de Atauri, P., Cascante, M. and Torres, N. V.,2003. "Multicriteria optimization of biochemical systems by linear programming: application to production of ethanol by Saccharomyces cerevisiae." Biotechnol Bioeng. 83(3): 335-43. Visser, D., Schmid, J. W., Mauch, K., Reuss, M. and Heijnen, J. J.,2004. "Optimal re- design of primary metabolism in Escherichia coli using linglog kinetics." Biotechnology and Bioengineering. Voit, E. O.,1992. "Optimization of integrated biochemical systems." Biotechnol Bioeng. 40: 572-582. Wyman, C. E.,2003. "Potential synergies and challenges in refining cellulosic biomass to fuels, chemicals, and power." Biotechnol Prog. 19(2): 254-62. Young, J., Henne, K., Morgan, J., Konopka, A. and Ramkrishna, D.,2004. "Cybernetic modeling of metabolism: towards a framework for rational design of recombinant organisms." Chemical Engineering Science. 59(22-23): 5041-5049.

VITA

Priti Pharkya

Education

Pennsylvania State University, University Park PhD in Chemical Engineering Expected graduation: Fall, 2005

Regional Engineering College, Jaipur, India. Bachelors in Chemical Engineering -1997 Graduated with Honors

PhD Thesis Topic Optimization based redesign of microbial production systems.

Honors • Recipient of the Merck Best Student Poster Award at the Biochemical Engineering Conference XIV (2005) • Recipient of the George and Ruth S. Graduate fellowship in Chemical Engineering (2004) for academic excellence. • Recipient of National Talent Search Scholarship (1990-1997). Awarded to top 0.1% students in India.

Publications • Pharkya, P. and C. D. Maranas (2005), “An optimization framework for identifying optimal reaction modulation/deletion candidates for biochemical overproduction”, Metabolic Engineering, in press. • Nikolaev, E. V., Pharkya, P., Maranas, C. D. and Armaou, A. (2005), “Elucidation of optimal enzyme sets using large-scale kinetic models”, Proceedings of the 16th IFAC Congress. • Pharkya, P., Burgard, A. P. and C. D. Maranas (2004), “OptStrain: A computational framework for redesign of microbial production systems”, Genome Research, 14(11), 2367-76. • Pharkya, P., Burgard, A. P. and C. D. Maranas (2003), “Exploring the overproduction of amino acids using the bilevel optimization framework OptKnock” Biotechnology and Bioengineering, 84(7), 887-99. • Burgard, A. P., Pharkya, P. and C. D. Maranas (2003), OptKnock: A bilevel programming framework for identifying gene knockout strategies for microbial strain optimization, Biotechnology and Bioengineering, 84(6), 647-57.