The Pennsylvania State University

Home , Phosphofructokinase 1

The Graduate School

College of Engineering

COMPUTATIONAL DESIGN AND OPTIMIZATION OF

METABOLIC PATHWAYS

A Dissertation in

Chemical Engineering

Chiam Yu Ng

 2017 Chiam Yu Ng

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2017

The dissertation of Chiam Yu Ng was reviewed and approved* by the following:

Costas D. Maranas Donald B. Broughton Professor of Chemical Engineering Dissertation Advisor Chair of Committee

Manish Kumar Associate Professor of Chemical Engineering

Phillip E. Savage Professor of Chemical Engineering Head of the Department of Chemical Engineering

Cooduvalli S. Shashikant Associate Professor of Molecular and Developmental Biology Assistant Director of the Huck Institutes of the Life Sciences Co-Director, Bioinformatics and Genomics Graduate Program Option

Janna K. Maranas Professor of Chemical Engineering Graduate Program Coordinator of Chemical Engineering

*Signatures are on file in the Graduate School

ABSTRACT

Microbial production of chemicals and fuels has often been cited as the highly viable solution to the current energy crisis. It has also been widely exploited to produce value-added products such as pharmaceuticals, nutraceuticals, food flavorings and fine chemicals. Knowledge gained from genome sequencing and functional annotations, as well as advances in genetic engineering techniques, enable us to re-code the genetic blueprint of a microbial cell to endow them with new functions such as a non-native metabolic pathway that make a specific biochemical. As the cells are often not optimized to synthesize the target product, significant rewiring of their metabolic networks is required to re-apportion carbon flux towards the target product. This should be performed with careful consideration of cellular protein resource allocation as well as energy and redox balance. However, the task is often complicated by the intricate interplay of the cellular gene-protein-reaction network.

In this dissertation, we aim to address several challenges in designing and engineering metabolic pathways by combining techniques in computational biology, metabolic engineering, and synthetic biology. Each project relies extensively on optimization-based computational design, from genetic constructs to pathways and finally the entire metabolic network, to achieve the associated metabolic engineering objectives. In Chapter 1, we review the state-of-the-art metabolic engineering and synthetic biology tools for pathway and strain design. These tools lay the foundation for the development of the subsequent chapters, which deal in-depth with specific challenges in metabolic pathway design and strain optimization.

One of the key challenges in metabolic engineering is to ensure a consistent supply of cofactor, which is often shared across over 100 reactions in a cell, to the desirable biosynthesis reactions. In Chapter 2, we describe the design and optimization of a synthetic metabolic pathway in E. coli to improve the availability of an essential redox cofactor NADPH. The synthetic metabolic pathway was derived from the highly active Entner-Doudoroff (ED) pathway of another prokaryote Zymomonas mobilis. Upon optimizing the expression of the multi-enzyme pathway, we obtained strains with improved NADPH production when compared to the wild-type strain. In addition to NADPH production, the ED pathway also simultaneously generates pyruvate and glyceraldehyde-

iii

3-phopshate, which are the precursors for the methylerythritol phosphate (MEP) pathway. As a proof-of-concept, we demonstrated that combination of the synthetic ED pathway and the MEP pathway improved the production of an NADPH-dependent terpenoid by up to 97%.

Motivated by the intriguing roles that glycolytic pathways play in a cell and the effects of their perturbation, in Chapter 3, we switch focus to uncover the reason why specific glycolytic pathways prevail in nature despite the presence of alternatives. Although the ED pathway and the Embden- Meyerhof-Parnas (EMP) glycolytic pathways both convert glucose to pyruvate, they take different routes with different intermediates and yield one or two moles of ATP per mole of glucose.

Theoretically, one could construct a thermodynamically feasible (i.e., ΔrG° < 0) glucose utilization pathway to pyruvate with up to five moles of ATP production. This raises the question as to why nature preferably employs either EMP or ED glycolysis in spite of the potential availability of pathways with improved energy yield. By computationally designing and assessing over 10,000 possible routes between glucose and pyruvate, we attempt to decipher the designing principles of the natural glycolytic pathways. The computational pipeline developed in this case study for pathway design and analysis can be applied to other important bioconversion pathways.

While the earlier chapters delve on the design and optimization of a pathway for redox and energy cofactors supply, generally the entire metabolic network of a microbial cell has to be rewired to drive sufficient carbon flux towards the production pathway and prevent the formation of competing by-products. Using the optimization-based strain design procedure OptForce (in Chapter 4), we combed the entire genome-scale metabolic network to pinpoint genetic interventions including up-regulation, down-regulation and knockout that could lead to the overproduction of the target biochemicals in yeast and cyanobacteria. We further demonstrated that our computational approach not only identified genetic manipulation strategies that recapitulate experimental results but also suggested novel interventions that improve experimental yields. In Chapter 5, we conclude the dissertation by summarizing current metabolic engineering efforts and discussing the remaining challenges as well as our future perspectives on the development of microbial cell factories.

TABLE OF CONTENTS

LIST OF TABLES ...... viii

LIST OF FIGURES ...... ix

ACKNOWLEDGEMENTS ...... xi

1. Chapter 1 ADVANCES IN DE NOVO STRAIN DESIGN USING INTEGRATED SYSTEMS AND SYNTHETIC BIOLOGY TOOLS ...... 1

1.1. Introduction ...... 1 1.2. Pathway prospecting for synthetic routes ...... 4 1.3. Modeling-driven pathway engineering ...... 6 1.4. Synthetic biology and genome engineering tools for implementation of pathway predictions ... 8 1.5. Future perspectives ...... 11 1.6. References ...... 15

2. Chapter 2 RATIONAL DESIGN OF A SYNTHETIC ENTNER-DOUDOROFF PATHWAY FOR IMPROVED AND CONTROLLABLE NADPH REGENERATION ... 21

2.1. Introduction ...... 21 2.2. Materials and methods ...... 26 Design of synthetic operons and codon optimization ...... 26 Strain and plasmid construction ...... 27 Combining the RBS Library Calculator and MAGE for pathway optimization ...... 28 Chromosomal integration of Entner-Doudoroff pathway variants ...... 29 Measurement of intracellular NADPH and NADP+ levels ...... 29 Quantifying NADPH levels with a modified fluorescent reporter ...... 30 Measurement of NADPH-dependent carotenoid biosynthesis ...... 31 2.3. Results ...... 32 Rational design and construction of a synthetic Entner-Doudoroff pathway ...... 32 Characterization of the synthetic Entner-Doudoroff pathway in ED1.0 ...... 33 Efficient search for improved ED pathway variants in a 5-dimensional expression space ... 37 Improving terpenoid biosynthesis using a synthetic Entner-Doudoroff pathway ...... 43

2.4. Discussion ...... 46 2.5. References ...... 49

3. Chapter 3 THE PARETO OPTIMALITY EXPLANATION OF THE GLYCOLYTIC ALTERNATIVES IN NATURE ...... 55

3.1. Introduction ...... 55 3.2. Methods ...... 60 Update of the optStoic reaction database ...... 60 Designing pathways using the modified optStoic procedure ...... 61 Assessing the thermodynamic feasibility of a pathway ...... 64 Protein cost analysis ...... 66 Pathway visualization ...... 67 3.3. Results ...... 69 Exhaustive enumeration of all glycolytic pathway variants using modified optStoic ...... 69 Imposing the thermodynamic feasibility test MDF and the effect of metabolite concentration ranges 71 The Pareto frontier of the tradeoff between protein cost and ATP yield ...... 74 Pathways with a lower cost than the canonical glycolytic pathways ...... 79 Pathways generating higher ATP yield than the canonical glycolytic pathways ...... 83 The canonical glycolytic pathways are robust to changes in ATP/ADP concentration ...... 88 3.4. Discussion ...... 91 3.5. References ...... 93

4. Chapter 4 COMPUTATIONAL STRAIN DESIGN FOR THE OVERPRODUCTION OF BIOCHEMICALS ...... 98

4.1. Introduction ...... 98 4.2. Case study I: Overproduction of shikimic acid and muconic acid in S. cerevisiae ...... 100 Objectives ...... 100 Methods ...... 100 Strategies for shikimic acid overproduction ...... 102 Strategies for muconic acid overproduction ...... 105 4.3. Case study II: Overproduction of malonyl-CoA in S. cerevisiae ...... 109 Methods ...... 109 Strategies for malonyl-CoA overproduction ...... 109

4.4. Case study III: Overproduction of isoprene in Synechocystis sp. PCC 6803 ...... 117 Objective ...... 117 Methods ...... 117 Strategies for isoprene overproduction ...... 118 4.5. References ...... 123

5. Chapter 5 SYNOPSIS AND FUTURE PERSPECTIVES ...... 126

5.1. Introduction ...... 126 5.2. A survey of current metabolic engineering efforts ...... 127 5.3. Outlook ...... 131 5.4. References ...... 134

Appendix A. Supplementary Information for Chapter 2...... 137

A.1. Supplementary notes ...... 137 A.1.1. Construction of pgi mutant with co-selection MAGE ...... 137 A.1.2. Improving tetA translation rate ...... 137 A.1.3. Modification of pQE-mBFP plasmid and measurement of mBFP production rate ...... 138 A.2. Supplementary figures ...... 140 A.3. Supplementary tables ...... 142 A.4. References...... 149

Appendix B. Supplementary Information for Chapter 3 ...... 151

B.1. Supplementary notes ...... 151 B.1.1. Reducing the run time of optStoic/minFlux ...... 151 B.2. Supplementary tables ...... 152 B.3. Supplementary figures ...... 153 B.4. Reference ...... 155

Appendix C. Supplementary Information for Chapter 5...... 156

C.1. Supplementary notes ...... 156 C.2. Supplementary table and figure ...... 156 C.3. References ...... 160

vii

LIST OF TABLES

Table 1.1. Computational tools for pathway prediction, strain design and genetic circuit redesign...... 12 Table 3.1. Number of unique pathways for 0 – 5 ATP identified using OptStoic ...... 73 Table 4.1. Metabolic interventions predicted by OptForce for (a) shikimic acid (SA) and (b) muconic acid (MA) overproduction...... 107 Table 4.2. Interventions identified using OptForce of malonyl-CoA overproduction...... 113 Table 4.3. The list of interventions identified by OptForce for isoprene overproduction...... 121 Table A.1. Strains and plasmids...... 142 Table A.2. Oligonucleotides used for MAGE ...... 144 Table A.3. RBS sequences in each RBS library and the corresponding translation rate (au) predicted by RBS Library Calculator ...... 145 Table A.4. Translation rate for ED variants...... 147 Table A.5. Net reaction for neurosporene biosynthesis from glyceraldehyde-3-phosphate and pyruvate...... 148 Table A.6. Comparison of neurosporene production ...... 149 Table B1. Cofactors that were removed from the S matrix when generating the internal stoichiometric matrix ...... 152 Table C.1. Experimental titers of biomolecules...... 156

viii

LIST OF FIGURES Figure 1.1. Pictorial overview of computational and experimental techniques for strain development and pathway engineering...... 3 Figure 2.1. EMP versus ED pathways...... 24 Figure 2.2. Design and construction of ED1.0 and combinatorial ED variants...... 25 Figure 2.3. Characterization of ED1.0...... 36 Figure 2.4. Uniform sampling of the 5-dimensional expression space...... 40 Figure 2.5. Characterization of ED-expressing genome variants...... 41 Figure 2.6. The effects of changing ED enzyme expression levels on NADPH regeneration rates...... 42 Figure 2.7. Improving terpenoid biosynthesis using the synthetic Entner-Doudoroff pathway. .. 45 Figure 3.1. Schematic overview of the workflow for design and analysis of glycolytic pathways...... 59 Figure 3.2. The modified optStoic procedure...... 68 Figure 3.3. The tradeoff of ATP yield and minimal protein cost of a pathway...... 77 Figure 3.4. Identifying the key factors contributing to the protein cost of different ATP yielding pathways...... 78 Figure 3.5. Glycolytic pathways designed using the modified minFlux procedure ...... 81 Figure 3.6. Pathways generating 3 to 5 ATP designed using the modified minFlux procedure .. 86 Figure 3.7. The effect of ATP and ADP concentrations on the pathway thermodynamic feasibility and minimal enzyme cost of glycolytic pathway variants...... 89 Figure 4.1. Metabolic interventions identified with OptForce for production of shikimic acid (SA)...... 104 Figure 4.2. Metabolic interventions for the overproduction of muconic acid (MA) identified through OptForce analysis...... 106 Figure 4.3. Pathways for malonyl-CoA biosynthesis in S. cerevisiae...... 112 Figure 4.4. Malonyl-CoA yields for strains designed using the OptForce procedure...... 115 Figure 4.6. Flux distribution of the best strain for malonyl-CoA production...... 116 Figure 4.7. The flux distribution under maximum biomass condition when MFA data is used to constrained the iSyn731 model...... 119 Figure 4.8. Flux distribution under maximum isoprene formation condition...... 120

Figure 4.9. The interventions that were identified by OptForce for isoprene overproduction. .. 122 Figure 5.1. Comparison of the maximum theoretical carbon yield ...... 129 Figure 5.2. Highest experimental titer (g/l) extracted from our literature survey for each molecule...... 130 Figure 5.3. Applications of genome-scale metabolic models ...... 133 Figure A.1. Plasmid map of pCN-065 (pCN-LEDT)...... 140 Figure B.1. Statistics of the glycolytic pathways generated using the modified optStoic procedure...... 153 Figure B.2. Distribution of absolute metabolite concentrations across different organisms ...... 154 Figure B.3. The ATP yield versus minimal protein cost plot...... 155 Figure C.1. All experimental titer (in g/L) extracted from our literature survey for each molecule...... 159

ACKNOWLEDGEMENTS

The completion of my Ph.D. dissertation is made possible with the help of many people along the journey. First and foremost, I would like to thank my academic advisor Professor Costas D. Maranas for his thoughtful guidance, advice, and encouragement. Since I joined his lab, he has been very supportive by providing me with opportunities to work on diverse projects while also helping me to define the direction for my research. I have greatly benefited from his extensive knowledge and his genuine enthusiasm on optimization of biological systems and metabolic engineering. I am also grateful to my committee members Professor Manish Kumar, Professor Phillip E. Savage and Professor Cooduvalli S. Shashikant for their insightful feedback and interest on my work. I would also like to thank Professor Howard M. Salis for his guidance in the synthetic biology project. In addition, I would like to extend my thanks to the graduate program coordinator Professor Janna K. Maranas and the friendly staff in the Department of Chemical Engineering for being extremely helpful throughout my years here.

I am very fortunate to work with many helpful and immensely talented colleagues and collaborators. Without them, I would not have accomplished this milestone. Special thanks to Dr. Iman Farasat and Dr. Anupam Chowdhury, for mentoring and teaching me various techniques as well as their philosophies in metabolic engineering, synthetic biology, and computational biology. To a highly motivated and inspiring group of scientists: Dr. Tian Tian, Dr. Manish Kushwaha, Dr. Amin Espah Borujeni and Long Chen, thanks for creating a friendly learning atmosphere in the lab and for helping me to tackle various challenges in my projects. Huge thanks to the Maranas lab members including Lin Wang, Satyakam Dash, Dr. Ali Khodayari, Dr. Ali. R. Zomorrodi, Dr. Akhil Kumar, Dr. Joshua Chan, Dr. Rajib Saha, Mazharul Mohammad, Dr. Margaret Simons, Saratram Gopalakrishnan and Charles Foster for their support and constructive feedback on my work. In particular, Satya and Lin have been an extremely reliable source for their knowledge on optimization and modeling biological systems. Thanks for patiently answering my questions all these years. I am also thankful to Professor Zengyi Shao and Dr. Miguel Suástegui for their hard work in the yeast metabolic engineering project.

My summer internship at Amyris Inc. was a truly enriching experience. I would like to sincerely thank Dr. Amoolya Singh for offering me the summer internship opportunity and for introducing me to brilliant mentors Dr. Joshua A. Lerman and Dr. John Hung. They have patiently taught me techniques in data science, bioinformatics, strain engineering, and enzymology, which I can apply in my thesis work. I would also like to take this opportunity to acknowledge my undergraduate and master thesis advisor, Professor Min-kyu Oh, for introducing me to the field of metabolic engineering.

Finally, I would like to thank my family and friends who encourage me all along the journey. I owe my deepest gratitude to my lovely parents, who endure my absence all these years and provide unconditional love and support from half a globe away. Thanks to my siblings and family, Kai Ling, Kai Hee, Kai Hsiang, Linda, Hao Zhe and Sinya for their constant love, encouragement, and support. I am lucky to meet many great friends here: Chee Hau, Dan, Jhi Yong, Sin Yen, Ying Woei, Andrew, Angela, Astha, Peng, Husna, Sakina, Yee Voan, Fion, Deldrie, Kelvin and many others. Thanks for encouraging me through the toughest times and for making life in State College a memorable one. I would also like to thank my mentors, Hae Ra, Ji Woong, Hee-Young, Young- Bin and Youngmin, for sharing good advice through various stages of my Ph.D. and keeping my Korean polished during the weekly lunch gathering.

xii

1. Chapter 1 ADVANCES IN DE NOVO STRAIN DESIGN USING INTEGRATED SYSTEMS AND SYNTHETIC BIOLOGY TOOLS

This chapter has been previously published in modified form in Current Opinion in Chemical Biology (Ng C.Y.*, Khodayari A.*,Chowdhury A.* and Maranas C.D. (2014), "Advances in de novo strain design using integrated systems and synthetic biology tools", Current Opinion in Chemical Biology, 28, 105-114) (*Authors contributed equally).

1.1. Introduction

Microbial production has the unique advantage over chemical catalysis in that it can co- opt thousands of enzymes finely tuned by nature and leverage the host’s biological processes for cofactor regeneration, catalytic machinery assembly/disassembly and housekeeping functions. Advancement in metabolic engineering has increased the range of bio-based chemical products in microbial hosts, including therapeutics such as artemisinin [1], bioplastic precursors such as 1,4-butanediol [2], and biodiesel fatty esters and fatty acids [3]. Despite several success stories, only few metabolic engineering products achieve performance metrics that currently merit commercialization [4].

Increasing demands on maximizing production potential have highlighted the importance of developing tools that can identify more efficient pathways, both native and heterologous to a given host, from a (often given) substrate to the target chemical. Subsequently, metabolic intervention strategies are drawn to reconfigure the host metabolism for channeling additional flux towards the selected pathway and eliminating carbon and redox losses towards undesirable products, followed by the construction and evaluation of the strains. However, there are several challenges to the successful implementation of this design-build-test-learn loop (Figure 1.1). Enzymes are sensitive to temperature and pH and cannot be universally expressed in all hosts with a controllable rate of expression [5]. Another challenge lies in the successful implementation of computational predictions due

1 to incomplete/erroneous modeling descriptions as well as the inability to precisely modulate gene expression to match model predictions. Furthermore, the current capacity to generate combinatorial variants far exceeds the throughput of screening.

In this chapter, we focus on recent advances in systems and synthetic biology for synthetic metabolic pathway design and optimization. We first describe the recently developed computational tools for identifying de novo biosynthetic pathways. Next, we discuss computational stoichiometry-based and kinetic-based approaches for strain optimization. Finally, we discuss the recently developed synthetic biology and genome engineering techniques for synthetic pathway and network engineering.

Figure 1.1. Pictorial overview of computational and experimental techniques for strain development and pathway engineering.

1.2. Pathway prospecting for synthetic routes

Synthetic pathways from a source metabolite to a target chemical must satisfy a number of performance criteria such as (i) maximal use of native reactions [6] (Figure 1.1B, in orange), (ii) minimal number of reaction steps or equivalently total enzymatic load [7] (Figure 1.1B, in blue), (iii) maximization of product yield [8] (Figure 1.1B, in grey), (iv) cofactor balance in the overall pathway [6], and (v) thermodynamic feasibility of the overall pathway and individual steps [9]. A priori assessment of whether these criteria are met requires knowledge of the metabolism of the host organism (e.g., codified as a genome- scale network (Figure 1.1A)) and other database resources. Optimizing native pathways generally requires less effort as both catalytic components and regulatory structures are already in place [5]. In contrast, expression of non-native reactions is more complex but the upside is that they can in many cases significantly improve the yield of target products [10].

A number of pathway-prospecting tools have been developed recently (Table 1.1A) leveraging advances in computational power and availability of well-curated databases of metabolic reactions. Elementary Flux Mode (EFM) derived approaches relying on Linear Programming (LP) formulations can now be extended to genome-scale models and comprehensive reaction databases to search for de novo pathways. Variations of this approach have been implemented for reconfiguring novel amino acid synthesis pathways in E. coli [7], engineering hosts for biomass-coupled chemical production (e.g., SSDesign

[11]) and designing novel pathways for CO2-fixation [12]. Alternatively, the computational intractability of EFMs in searching from thousands of reaction candidates [7] can be ameliorated by using graph-based tools for pathway design [13]. These tools can now rank pathways based on product yield [14] and cost of transcription and translation (e.g., DESHARKY [15]), as well as prevent selection of thermodynamically infeasible intermediate reactions (e.g., Metabolic Tinker [9]). In addition, atom-mapping information have also been incorporated to trace the fate of individual atoms (especially carbon) that filters out non-carbon transferring paths (e.g., Carbon Flux Path [16] ).

Recent tools in pathway design use a set of “reaction rules” instead of lists of reactions to predict novel pathways without being restricted to previously catalogued reactions in nature. A retrosynthetic algorithm selects the intermediate metabolites between source and target chemical by satisfying the rules of chemical transformation defined by the set of reaction operators (e.g., BNICE [17], XTMS [18]). For example, in the recent GEM-Path approach [6], Biochemical Reaction Operators (BROs) serve as reaction templates for conversion of metabolites. Using an iterative algorithm to trace back from the target molecule one reaction at a time, metabolites are assigned to the BROs using a scoring mechanism based on how similar a metabolite is to the existing host metabolome. A reaction is accepted if it is present in a curated database, or is similar enough to an existing enzyme in the database to catalyze the putative reaction. The algorithm proceeds to identify the previous reaction in the linear pathway terminating when a metabolite present in the host metabolome is identified.

Despite enormous progress over the past few years, available pathway design procedures are generally restricted to only (near) linear routes from the source to the target metabolite. Linear pathway designs generally miss cyclic networks with potential for higher efficiency (both carbon and energy) of production. In addition, by restricting the degrees of freedom to just the source and target metabolite, the identification of alternative co-reactants/co- products combinations are ignored. While post-processing efforts restore stoichiometry- balance of pathways [6], this may lead to designs with suboptimal carbon and energy efficiencies. Compatibility of a heterologous pathway with the metabolic host of interest is also often not adequately addressed at the design stage [19]. While some of the procedures minimize the number of heterologous enzymes [6, 20], or choose enzymes phylogenetically closest to the host [21], there is no guarantee that the synthetic pathway would be host compatible. In addition, existing computational procedures do not directly assess the toxicity potential of intermediate metabolites. As more toxicity data for model organisms is collected (e.g., PanDaTox [22]), toxicity prediction tools (e.g., EDGE [23]) would increasingly become more commonplace in scoring synthetic pathways. Likewise, kinetic properties of the enzymes in the pathway would increasingly be queried to find the most active routes to the target chemical (e.g., DESHARKY [15]).

1.3. Modeling-driven pathway engineering

Once the designed pathway is introduced in the host strain, the metabolic fluxes need to be re-apportioned towards the target product (Figure 1.1C). Several optimization-based computational techniques have been developed to achieve this aim [24] (Table 1.1B). The scope of these approaches has been expanded through the use of synthetic biology tools. These predictive tools comprised of stoichiometry-only approaches [25], kinetic models [26] or hybrid combinations thereof [27] make quantitative predictions on metabolism upon metabolic interventions.

A number of efforts have integrated non-native pathways into the production host metabolic model by simply expanding the stoichiometric matrix using Flux Balance Analysis (FBA) techniques. For example, Proportional Flux Forcing (PFF) [28] was developed to explore the effect of substrate competition upon insertion of non-native genes using the GDLS algorithm. This is achieved by forcing a fixed fraction of the flux passing through the substrate into the alternative heterologous pathways formed by the introduced genes. This procedure was used for enhancing free fatty acid production in E. coli. In another effort, Yim et al [29] integrated a biopathway prediction algorithm with computational strain design protocols to identify synthetic pathways producing non-native products from common metabolic intermediates in E. coli. They first constructed an ensemble of 10,000 pathways producing 1,4-butanediol (14BDO) from mixed sugar streams. The best engineering strategies were then identified using the OptKnock algorithm [30] improving the yield of 14BDO production for the two best-ranked synthetic pathways.

Stoichiometry-based approaches are limited to steady-state conditions and are generally unable to describe the rate of reaction in terms of the underlying pool of metabolite concentrations and enzyme abundances. Therefore, the identified metabolic engineering strategies may not be implementable. For example, for a suggested up-regulation the corresponding enzymatic activity and metabolite concentrations may not be reachable and/or physiologically allowable. These shortcomings can potentially be addressed by kinetic models that directly track both enzyme levels and concentrations [31].

The application of kinetic-based models in synthetic biology, however, is still hampered by a number of challenges, chief among which are the paucity, in vivo applicability and universality of kinetic parameter data. In an effort to alleviate this problem, Farasat et al proposed SEAMAPs to build a kinetic model for a given modular synthetic pathway [32]. They used RBS Library Calculator to design minimal number of experiments, which varied expression of each enzyme in the pathway over 10,000-fold, to parameterize the kinetic model of neurosporene production pathway in E. coli. In another effort, a kinetic model was developed to identify the rate-limiting step of an in vitro ATP-free synthetic pathway for production of hydrogen from pretreated biomass sugars [33].

Successful implementation of kinetic expressions to guide metabolic interventions requires that regulatory interactions at the substrate, transcriptional, translational and post- translational levels are adequately captured [34]. Using metabolite concentration and enzyme activity as model variables, significant progress in the integration of substrate level regulatory interactions in kinetic models has been achieved [35]. Implementation of transcription level regulatory interactions in stoichiometry-based models is limited to Boolean representations or introduction of ad hoc constraints to shrink the flux ranges in concert with transcriptomic and proteomic data. This posture generally assumes that a positive correlation exists between metabolic flux and gene expression levels though there exists ample counter-examples [36]. Generally, the predictive accuracy of these approaches is highly condition dependent [37]. The scope of kinetic models can be further expanded to integrate transcription-level regulatory events. This can potentially be achieved using phenomenological Hill equations or partition functions to describe the rate of mRNA synthesis from a given promoter in terms of transcription factor (TF) activities [38]. This could ultimately enable the integration of transcriptional regulatory events with models of metabolism.

Kinetic-based modeling approaches show promise in capturing the dynamic behavior of metabolic pathways and regulatory interactions. Integration of such models with system- level omics data and computational metabolic engineering tools provides an avenue for

7 understanding and subsequently optimizing the function of synthetic pathways. Robust model parameterization in response to genetic/environmental perturbations remains difficult. Many efforts are currently underway towards resolving this challenge by proposing more efficient optimization approaches [39], reducing parameters search space by structural analysis [40] and efficient sampling approaches [41].

1.4. Synthetic biology and genome engineering tools for implementation of pathway predictions

Ideally, the engineered strain should match as close as possible the desired flux distribution predicted through metabolic modeling. This requires among other considerations precise and reliable control of gene transcription and mRNA translation. Various genetic parts including promoters, RBSs, terminators, TFs and small regulatory RNA (sRNA) have been extensively characterized, offering an unprecedented range of parts for engineering gene expression. However, the performance of genetic parts is often non-conserved across different contexts (e.g. host, genetic, media) thus requiring performance re-assessment in the desirable conditions [42]. Alternatively, several computational techniques have been developed for designing context-specific sequences (e.g., RBS Calculator [43]) and selecting the optimum combination (e.g., OptCircuit [44]) of genetic parts (Table 1.1C).

Optimization of gene expression involves a vast design sequence space (i.e., 4n, where n is the length of a transcription unit). In addition, several parameters can affect gene expression such as codon usage, tRNA availability, secondary structures, presence of RNases binding sites, internal Shine-Dalgarno sequences and repeats [45]. Some of these parameters are contradictory thus confounding the task of codon selection. For example, rare codons instead of common codons at the N-terminal tend to reduce secondary structures thus increasing the rate of protein translation [46]. A number of gene design tools attempt to integrate all these sometimes conflicting requirements [45], albeit with limited experimental verification. An alternative approach is to rely on high-throughput gene synthesis technology to generate large codon-usage variants set for a particular protein and

8 then screen for high-expression variants, thus circumventing the need-to-know all design rules [47].

In addition to ensuring proper expression of all genes, the expression level of the entire synthetic pathway must also be carefully tuned to prevent imbalance in cellular resources (e.g., biomass precursors, proteins and redox equivalents [48]) and accumulation of toxic metabolites [49]. Combinatorial approaches geared towards optimizing the expression of all enzymes use various techniques ranging from targeting a rational selection of genes to random editing at a genome-scale. Combinatorial DNA assembly techniques such as Gibson Assembly [50] and DNA Assembler [51] are commonly used to fuse genes or operons with libraries of regulatory parts (e.g., promoter [52-54], RBS [55] and copy number [55]). When the expression of a large number of genes must be manipulated, they are often partitioned into separate modules based on their functions to reduce the search space. Notably, Ajikumar et al varied the expression of methylerythritol-phosphate (MEP) pathway and taxadiene synthesis modules by changing their promoter and copy number [56]. Despite exploring just a small fraction of the entire combinatorial space, their combinatorial variants achieved up to 15,000-fold change in taxadiene production [56]. On a larger scale, the Klebsiella oxytoca nitrogen fixation gene cluster (103 parts) was refactored by employing a combination of combinatorial design and assembly approach [5].

Advances in high-throughput genome engineering techniques have accelerated the construction of large strain libraries (Figure 1.1E). Methods such as MAGE [57] and synthetic sRNAs [58] directly target tens of pre-selected genes with high specificity, whereas other techniques such as gTME [59], SCALEs [60] and TRMR [61] first generate libraries of strains with randomized genome-wide mutations and subsequently perform selection to identify the genotype conferring the desired traits. Both approaches complement one another, as the latter can be used for identifying subsets of genes required for wider range of expression tuning by the former. Recently, CRISPR-Cas9 (or dCas9) system has emerged as a versatile tool for multiplex genome engineering [62]. This system requires the design of highly orthogonal guide RNA(s) with minimal off-target activity.

Collectively, these approaches are already capable of rapidly generating large combinatorial libraries, however, lack of high-throughput assays currently limits their applications to phenotypes with colorimetric (e.g., carotenoids production) or growth- based assays (e.g., metabolite tolerance).

As for other difficult to detect metabolites, intracellular biosensors have led to a number of success stories [63]. TF-based biosensors couple the expression of reporter proteins (typically fluorescent protein or antibiotic resistance marker) to the level of a metabolite of interest, enabling isolation of desired mutants via fluorescence-activated cell sorting (FACS) or positive growth selection [64, 65]. Raman et al recently fine-tuned their biosensors so as only cells that produce target chemicals above a certain threshold would survive [66]. They performed FBA to identify target genes for MAGE genome engineering and then employed their biosensors to select for high producers, resulting in 36-fold and 22-fold improvement of naringenin and glucaric acid production, respectively.

While it is important to map desired traits to their genotype, sequencing and characterizing the entire combinatorial libraries remains cost-prohibitive. Microarray technology has been previously employed for parallel genotype-phenotype mapping of large gene knockout and overexpression libraries [60, 61]. Recently, the tracking combinatorial engineered libraries (TRACE) method combines DNA assembly and next-generation sequencing for simultaneous genotype mapping of individual cells within a large combinatorial population [67]. TRACE was then employed to efficiently track both combinatorial diversity and evolution trajectory of a MAGE population [67].

All of these techniques have significantly improved pathway engineering and the strain construction process. In addition, they provide useful datasets for validation and refinement of computational tools and stoichiometric and kinetic metabolic models. For example, multiplex gene-knockout techniques can rapidly generate multiple knockout mutants to validate synthetic lethality predictions [68]. The refined model can then be used for predicting genetic manipulation strategies with better accuracy.

1.5. Future perspectives

With an ever expanding ability to construct, screen and characterize large mutant strain libraries, emphasis will likely shift on the ability to analyze large heterogeneous datasets for guiding the discovery of improved variants. Both machine learning inspired approaches that look for patterns in “big data” and predictive frameworks that seamlessly integrate different layers of biological processes would be needed. Moving beyond the scope of biological functions catalogued in nature, prospecting tools such as BNICE [17] and GEM- Path [6] could pro-actively be used to pinpoint desirable enzyme substrate and/or cofactor activity changes by harnessing enzyme plasticity. This has the potential for the discovery of more direct routes to target chemicals bypassing enzymatic or regulatory bottlenecks. Furthermore, the systematic discovery and pro-active elimination of undesirable secondary enzymatic functions that could drain carbon flux away from the main product could help shorten the strain design cycle [69]. Both tasks require the reliable re-design of enzymes either using de novo [70] or evolutionary techniques [71]. Several computational design techniques have been proposed for improving enzyme turnover number, substrate specificity, reduced allosteric inhibition, etc. [72-74], but reliable protein design remains elusive [75]. Improvements in our ability to model and predict the outcome of biological processes and networks coupled with falling DNA synthesis costs and efficient DNA assembly tools are bringing closer to fruition the dream of the design and assembly of synthetic production hosts uniquely tailored for bio-product biosynthesis.

Table 1.1. Computational tools for pathway prediction, strain design and genetic circuit redesign.

Method Merits Description BNICE [17]  The reaction rules are constructed by generalizing the bond breakage/formation at the active site using bond-electron matrix (BEM) information These procedures  Additional modules to curate the pathways define biochemical based on thermodynamic analysis of the reaction Operators reactions (BRO) or reaction XTMS [18]  Improves on an earlier RetroPath procedure by rules to suggest ranking pathways on a combined score based conversion of a on pathway heterogeneity, enzyme metabolite to a promiscuity, and metabolite toxicity product. Starting from information the target product,  Uses molecular signature information to they use an iterative establish a common core of reaction rules algorithm to identify  The generality of the reaction-rule can be existing and de novo modified based on requirements of reactions to retrace computational tractability back to a source GEM-Path  The reaction rules are manually curated based metabolite (or a host [6] on first three entries of EC classification metabolome)  Additional module identifies reaction removals in the host to rank the pathways on their growth-coupled yield Bar Evan et  Explores from a database of reactions (KEGG) These procedures use al [12] to construct minimal paths of conversion of a Elementary Flux Pathway prediction Pathway substrate to a product

Modes (EFM) derived

A.  Require additional manual curation to approach to convert a cofactor-balance the pathways source metabolite to  Not automated for identification of alternate the target chemical. paths Using a Linear Bordbar et al  Computationally tractable solution to find all programming [7] sets of EFMs in genome-scale networks optimization  Pathways are not ranked based on any metric formulation, they  Computationally challenging for identify a implementation on database of reactions stoichiometry- SSdesign  Restricts the metabolic network to EFMs that balanced path that [11] ensure both growth-coupled and non-growth minimizes the sum of coupled modes of production of target metabolic flux chemical required to produce a  Computationally intractable for large networks non-zero amount of as the procedure explores through all the EFMs target product. OptStrain [8]  All identified solutions ensure maximum These procedures use theoretical yield of production of the target a bilevel optimization chemical algorithm to identify a

 Subsequent OptKnock step [30] step identifies minimum set of reaction deletions in the existing network to heterologous couple target chemical with biomass reactions that induces production the production of a SimOptStrain  Simultaneous identification of reaction non-native desired [20] deletion in host metabolism and addition of chemical in a host heterologous reactions combines both the organism OptStrain steps  Engineering strategies with higher product yield overlooked by OptStrain can be identified DESHARKY  Pathways scored based on transcriptional and These methods use [15] translational cost of expressing heterologous graph-search genes in the host network techniques to find a Metabolic  Additional thermodynamic information of path between a source tinker [9] reactions is used to restrict infeasible pathway metabolite (or host designs between a source and a target metabolome) and a metabolite target metabolite by Carbon Flux  Use atom-transition information for reactions searching from a Path [16] to limit pathways involving carbon transfer database of metabolic reactions. These are fast techniques that generally cannot ensure stoichiometry- balanced paths with no yield information of the products OptKnock  Identifies a minimal set of knock-out (KO) [30] strategies  Requires only stoichiometry representation of the metabolism GDLS [76]  Reduces the run time through local search with multiple search paths  Employs gene-protein reaction (GPR) mapping These methods use a to identify a set of genes for KO nested mixed integer  Growth-coupled production of a desired (non-)linear product optimization problem OptORF [77]  Identifies gene regulatory KO strategies as to identify a minimal well as metabolic gene KO and overexpression set of engineering strategies strategies to ensure  Integrate gene-protein reaction association and overproduction of a transcriptional regulation target product Computational strain design Computational OptForce  Identifies flux up/down modulation as well as

B. [78] knock-out Emilio [79]  Identifies the optimal set of modified reactions and their optimal fluxes for overproduction of a target product  Couples production of the target product with growth

 Reduces the run time through successive linear programming BiMOMA  Identifies a set of KO strategies for [20] overproduction of a target product  Implement the homeostasis effect instead of optimal growth under purebred conditions RobustKnock  Identifies a set of KO strategies for [80] overproduction of a target product  Couples production of the target product objective (i.e., bioengineering) with growth (i.e., biological) objective  A three level optimization problem (max-min of target product production in outer problem and max of biomass production in the inner problem) k-OptForce  Integrate the available kinetic information [81]  Metabolite concentrations and enzyme abundance are constraints within the physiologically relevant ranges for reactions with available kinetic information Kamp and  Uses duality between elementary modes (EMs) Enumerates the Klamt [82] and minimal cut sets (MCSs) smallest intervention strategies that ensure overproduction of a target product OptCircuit  Deterministic rate equations were used for [44] modeling the interaction dynamics of genetic These procedure uses parts mixed-integer  Can be used to optimize kinetic parameters of dynamic optimization a given circuit components [44] , mixed-integer

Huynh et al  Uses a linear approximation of the non-linear non-linear [83] model around a target steady-state point to programming [83], or improve computation efficiency linear integer Zomorrodi  Circuits are modeled using piecewise linear programming [84] and Maranas differential equations framework to identify [84]  Less computationally intensive than other globally optimal optimization frameworks that involve circuit components nonlinearity of kinetic models and topologies that  Useful for selecting optimal components of match a pre-specified genetic circuits when detail quantitative behavior. information of the parts is absent RBS  Design of synthetic RBS with pre-specified A program that Genetic parts design and selection and design parts Genetic Calculator translation rate employs biophysical

C. [43]  Predict the translation rate of a given mRNA model of translation initiation process and simulated annealing algorithm for optimizing RBS sequences.

1.6. References

1. Paddon CJ, Westfall PJ, Pitera DJ, Benjamin K, Fisher K, McPhee D, Leavell MD, Tai a, Main a, Eng D et al: High-level semi-synthetic production of the potent antimalarial artemisinin. Nature 2013, 496:528-532. 2. Yim H, Haselbeck R, Niu W, Pujol-Baxley C, Burgard A, Boldt J, Khandurina J, Trawick JD, Osterhout RE, Stephen R et al: Metabolic engineering of Escherichia coli for direct production of 1,4-butanediol. Nature chemical biology 2011, 7:445-452. 3. Xu P, Li L, Zhang F, Stephanopoulos G, Koffas M: Improving fatty acids production by engineering dynamic pathway regulation and metabolic control. Proceedings of the National Academy of Sciences of the United States of America 2014, 111:1-6. 4. Van Dien S: From the first drop to the first truckload: commercialization of microbial processes for renewable chemicals. Current opinion in biotechnology 2013, 24(6):1061-1068. 5. Smanski MJ, Bhatia S, Zhao D, Park Y, L BAW, Giannoukos G, Ciulla D, Busby M, Calderon J, Nicol R et al: Functional optimization of gene clusters by combinatorial design and assembly. Nature biotechnology 2014, 32(12):1241-1249. 6. Campodonico MA, Andrews BA, Asenjo JA, Palsson BO, Feist AM: Generation of an atlas for commodity chemical production in Escherichia coli and a novel pathway prediction algorithm, GEM-Path. Metabolic engineering 2014, 25:140-158. 7. Bordbar A, Nagarajan H, Lewis NE, Latif H, Ebrahim A, Federowicz S, Schellenberger J, Palsson BO: Minimal metabolic pathway structure is consistent with associated biomolecular interactions. Molecular systems biology 2014, 10(7):737. 8. Pharkya P, Burgard AP, Maranas CD: OptStrain: a computational framework for redesign of microbial production systems. Genome research 2004, 14(11):2367-2376. 9. McClymont K, Soyer OS: Metabolic tinker: an online tool for guiding the design of synthetic metabolic pathways. Nucleic acids research 2013, 41(11):e113. 10. Lan EI, Liao JC: ATP drives direct photosynthetic production of 1-butanol in cyanobacteria. Proceedings of the National Academy of Sciences of the United States of America 2012, 109(16):6018-6023. 11. Toya Y, Shiraki T, Shimizu H: SSDesign: Computational metabolic pathway design based on flux variability using elementary flux modes. Biotechnology and bioengineering 2015, 112(4):759-768. 12. Bar-Even A, Noor E, Lewis NE, Milo R: Design and analysis of synthetic carbon fixation pathways. Proceedings of the National Academy of Sciences of the United States of America 2010, 107(19):8889-8894. 13. Liu F, Vilaca P, Rocha I, Rocha M: Development and application of efficient pathway enumeration algorithms for metabolic engineering applications. Computer methods and programs in biomedicine 2015, 118(2):134-146.

14. Yousofshahi M, Lee K, Hassoun S: Probabilistic pathway construction. Metabolic engineering 2011, 13(4):435-444. 15. Rodrigo G, Carrera J, Prather KJ, Jaramillo A: DESHARKY: automatic design of metabolic pathways for optimal cell growth. Bioinformatics 2008, 24(21):2554-2556. 16. Pey J, Planes FJ, Beasley JE: Refining carbon flux paths using atomic trace data. Bioinformatics 2014, 30(7):975-980. 17. Hatzimanikatis V, Li C, Ionita JA, Henry CS, Jankowski MD, Broadbelt LJ: Exploring the diversity of complex metabolic networks. Bioinformatics 2005, 21(8):1603-1609. 18. Carbonell P, Parutto P, Herisson J, Pandit SB, Faulon JL: XTMS: pathway design in an eXTended metabolic space. Nucleic acids research 2014, 42(Web Server issue):W389-394. 19. Chatsurachai S, Furusawa C, Shimizu H: An in silico platform for the design of heterologous pathways in nonnative metabolite production. BMC bioinformatics 2012, 13:93. 20. Kim J, Reed JL, Maravelias CT: Large-scale bi-level strain design approaches and mixed-integer programming solution techniques. PloS one 2011, 6(9):e24162. 21. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular biology and evolution 2011, 28(10):2731-2739. 22. Amitai G, Sorek R: PanDaTox: a tool for accelerated metabolic engineering. Bioengineered 2012, 3(4):218-221. 23. Wagner A, Zarecki R, Reshef L, Gochev C, Sorek R, Gophna U, Ruppin E: Computational evaluation of cellular metabolic costs successfully predicts genes whose expression is deleterious. Proceedings of the National Academy of Sciences of the United States of America 2013, 110(47):19166-19171. 24. Copeland WB, Bartley BA, Chandran D, Galdzicki M, Kim KH, Sleight SC, Maranas CD, Sauro HM: Computational tools for metabolic engineering. Metabolic engineering 2012, 14(3):270-280. 25. Zomorrodi AR, Suthers PF, Ranganathan S, Maranas CD: Mathematical optimization applications in metabolic networks. Metabolic engineering 2012, 14(6):672-686. 26. Almquist J, Cvijovic M, Hatzimanikatis V, Nielsen J, Jirstrand M: Kinetic models in industrial biotechnology - Improving cell factory performance. Metabolic engineering 2014, 24:38-60. 27. Khodayari A, Chowdhury A, Maranas CD: Succinate Overproduction: A Case Study of Computational Strain Design Using a Comprehensive Escherichia coli Kinetic Model. Frontiers in bioengineering and biotechnology 2014, 2:76. 28. Ip K, Donoghue N, Kim MK, Lun DS: Constraint-based modeling of heterologous pathways: application and experimental demonstration for overproduction of fatty acids in Escherichia coli. Biotechnology and bioengineering 2014, 111(10):2056-2066.

29. Yim H, Haselbeck R, Niu W, Pujol-Baxley C, Burgard A, Boldt J, Khandurina J, Trawick JD, Osterhout RE, Stephen R et al: Metabolic engineering of Escherichia coli for direct production of 1,4-butanediol. Nature chemical biology 2011, 7(7):445-452. 30. Burgard AP, Pharkya P, Maranas CD: Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnology and bioengineering 2003, 84(6):647-657. 31. Matsuoka Y, Shimizu K: Current status and future perspectives of kinetic modeling for the cell metabolism with incorporation of the metabolic regulation mechanism. Bioresources and Bioprocessing 2015, 2(1):1-19. 32. Farasat I, Kushwaha M, Collens J, Easterbrook M, Guido M, Salis HM: Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria. Mol Syst Biol 2014, 10:731. 33. Rollin JA, Martin Del Campo J, Myung S, Sun F, You C, Bakovic A, Castro R, Chandrayan SK, Wu CH, Adams MW et al: High-yield hydrogen production from biomass by in vitro metabolic engineering: Mixed sugars coutilization and kinetic modeling. Proceedings of the National Academy of Sciences of the United States of America 2015. 34. Heinemann M, Sauer U: Systems biology of microbial metabolism. Current opinion in microbiology 2010, 13(3):337-343. 35. Khodayari A, Zomorrodi AR, Liao JC, Maranas CD: A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metabolic engineering 2014, 25:50-62. 36. Kremling A, Bettenbrock K, Gilles ED: A feed-forward loop guarantees robust behavior in Escherichia coli carbohydrate uptake. Bioinformatics 2008, 24(5):704-710. 37. Machado D, Herrgard M: Systematic evaluation of methods for integration of transcriptomic data into constraint-based models of metabolism. PLoS computational biology 2014, 10(4):e1003580. 38. Garcia HG, Phillips R: Quantitative dissection of the simple repression input-output function. Proceedings of the National Academy of Sciences of the United States of America 2011, 108(29):12173-12178. 39. C. P, Miróa A, Guillén-Gosálbeza G, Sorribasc A, Alvesc R, Jiméneza L: Gobal optimization of hybrid kinetic/FBA models via outer-approximation. Computers & Chemical Engineering 2015, 72:325-333. 40. Lee Y, Lafontaine Rivera JG, Liao JC: Ensemble Modeling for Robustness Analysis in engineering non-native metabolic pathways. Metabolic engineering 2014, 25:63-71. 41. Tan Y, Liao JC: Metabolic ensemble modeling for strain engineers. Biotechnology journal 2012, 7(3):343-353. 42. Arkin AP: A wise consistency: engineering biology for conformity, reliability, predictability. Curr Opin Chem Biol 2013, 17(6):893-901. 43. Salis HM, Mirsky Ea, Voigt Ca: Automated design of synthetic ribosome binding sites to control protein expression. Nature biotechnology 2009, 27:946-950.

44. Dasika MS, Maranas CD: OptCircuit: an optimization based method for computational design of genetic circuits. BMC systems biology 2008, 2:24. 45. Gould N, Hendy O, Papamichail D: Computational tools and algorithms for designing customized synthetic genes. Frontiers in bioengineering and biotechnology 2014, 2:41. 46. Goodman DB, Church GM, Kosuri S: Causes and effects of N-terminal codon bias in bacterial genes. Science 2013, 342(6157):475-479. 47. Quan J, Saaem I, Tang N, Ma S, Negre N, Gong H, White KP, Tian J: Parallel on-chip gene synthesis and application to optimization of protein expression. Nature biotechnology 2011, 29:449-452. 48. Rollin JA, Martin del Campo J, Myung S, Sun F, You C, Bakovic A, Castro R, Chandrayan SK, Wu CH, Adams MW et al: High-yield hydrogen production from biomass by in vitro metabolic engineering: Mixed sugars coutilization and kinetic modeling. Proceedings of the National Academy of Sciences of the United States of America 2015, 112(16):4964-4969. 49. Zelcbuch L, Antonovsky N, Bar-Even A, Levin-Karp A, Barenholz U, Dayagi M, Liebermeister W, Flamholz A, Noor E, Amram S et al: Spanning high-dimensional expression space using ribosome- binding site combinatorics. Nucleic acids research 2013, 41(9):e98. 50. Gibson DG, Young L, Chuang R-Y, Venter JC, Hutchison CA, Smith HO: Enzymatic assembly of DNA molecules up to several hundred kilobases. Nature methods 2009, 6:343-345. 51. Shao Z, Zhao H, Zhao H: DNA assembler, an in vivo genetic method for rapid construction of biochemical pathways. Nucleic acids research 2009, 37(2):e16. 52. Lee ME, Aswani A, Han AS, Tomlin CJ, Dueber JE: Expression-level optimization of a multi- enzyme pathway in the absence of a high-throughput assay. Nucleic acids research 2013, 41(22):10668-10678. 53. Du J, Yuan Y, Si T, Lian J, Zhao H: Customized optimization of metabolic pathways by combinatorial transcriptional engineering. Nucleic acids research 2012, 40(18):e142. 54. Shao Z, Rao G, Li C, Abil Z, Luo Y, Zhao H: Refactoring the silent spectinabilin gene cluster using a plug-and-play scaffold. ACS synthetic biology 2013, 2(11):662-669. 55. Xu P, Gu Q, Wang W, Wong L, Bower AG, Collins CH, Koffas MA: Modular optimization of multi-gene pathways for fatty acids production in E. coli. Nature communications 2013, 4:1409. 56. Ajikumar PK, Xiao WH, Tyo KE, Wang Y, Simeon F, Leonard E, Mucha O, Phon TH, Pfeifer B, Stephanopoulos G: Isoprenoid pathway optimization for Taxol precursor overproduction in Escherichia coli. Science 2010, 330(6000):70-74. 57. Wang HH, Isaacs FJ, Carr Pa, Sun ZZ, Xu G, Forest CR, Church GM: Programming cells by multiplex genome engineering and accelerated evolution. Nature 2009, 460:894-898. 58. Wang Q, Venkataramanan KP, Huang H, Papoutsakis ET, Wu CH: Transcription factors and genetic circuits orchestrating the complex, multilayered response of Clostridium acetobutylicum to butanol and butyrate stress. BMC systems biology 2013, 7:120.

59. Alper H, Stephanopoulos G: Global transcription machinery engineering: a new approach for improving cellular phenotype. Metabolic engineering 2007, 9(3):258-267. 60. Lynch MD, Warnecke T, Gill RT: SCALEs: multiscale analysis of library enrichment. Nat Methods 2007, 4(1):87-93. 61. Warner JR, Reeder PJ, Karimpour-Fard A, Woodruff LB, Gill RT: Rapid profiling of a microbial genome using mixtures of barcoded oligonucleotides. Nature biotechnology 2010, 28(8):856-862. 62. Jakočiūnas T, Bonde I, Herrgård M, Harrison SJ, Kristensen M, Pedersen LE, Jensen MK, Keasling JD: Multiplex metabolic pathway engineering using CRISPR/Cas9 in Saccharomyces cerevisiae. Metabolic Engineering 2015, 28:213-222. 63. Eggeling L, Bott M, Marienhagen J: Novel screening methods-biosensors. Current opinion in biotechnology 2015, 35C:30-36. 64. Dietrich JA, Shis DL, Alikhani A, Keasling JD: Transcription factor-based screens and synthetic selections for microbial small-molecule biosynthesis. ACS synthetic biology 2013, 2(1):47-58. 65. Binder S, Schendzielorz G, Stabler N, Krumbach K, Hoffmann K, Bott M, Eggeling L: A high- throughput approach to identify genomic variants of bacterial metabolite producers at the single- cell level. Genome biology 2012, 13(5):R40. 66. Raman S, Rogers JK, Taylor ND, Church GM: Evolution-guided optimization of biosynthetic pathways. Proceedings of the National Academy of Sciences of the United States of America 2014, 111(50):17803-17808. 67. Zeitoun RI, Garst AD, Degen GD, Pines G, Mansell TJ, Glebes TY, Boyle NR, Gill RT: Multiplexed tracking of combinatorial genomic mutations in engineered cell populations. Nature biotechnology 2015. 68. Suthers PF, Zomorrodi A, Maranas CD: Genome-scale gene/reaction essentiality and synthetic lethality analysis. Mol Syst Biol 2009, 5:301. 69. Guzman GI, Utrilla J, Nurk S, Brunk E, Monk JM, Ebrahim A, Palsson BO, Feist AM: Model- driven discovery of underground metabolic functions in Escherichia coli. Proceedings of the National Academy of Sciences of the United States of America 2015, 112(3):929-934. 70. Rajagopalan S, Wang C, Yu K, Kuzin AP, Richter F, Lew S, Miklos AE, Matthews ML, Seetharaman J, Su M et al: Design of activated serine-containing catalytic triads with atomic-level accuracy. Nat Chem Biol 2014, 10(5):386-391. 71. Smith MA, Romero PA, Wu T, Brustad EM, Arnold FH: Chimeragenesis of distantly-related proteins by noncontiguous recombination. Protein science : a publication of the Protein Society 2013, 22(2):231-238. 72. Privett HK, Kiss G, Lee TM, Blomberg R, Chica RA, Thomas LM, Hilvert D, Houk KN, Mayo SL: Iterative approach to computational enzyme design. Proceedings of the National Academy of Sciences of the United States of America 2012, 109(10):3790-3795.

73. Tinberg CE, Khare SD, Dou J, Doyle L, Nelson JW, Schena A, Jankowski W, Kalodimos CG, Johnsson K, Stoddard BL et al: Computational design of ligand-binding proteins with high affinity and selectivity. Nature 2013, 501(7466):212-216. 74. Grisewood MJ, Gifford NP, Pantazes RJ, Li Y, Cirino PC, Janik MJ, Maranas CD: OptZyme: computational enzyme redesign using transition state analogues. PloS one 2013, 8(10):e75358. 75. Pantazes RJ, Grisewood MJ, Maranas CD: Recent advances in computational protein design. Current opinion in structural biology 2011, 21(4):467-472. 76. Lun DS, Rockwell G, Guido NJ, Baym M, Kelner JA, Berger B, Galagan JE, Church GM: Large- scale identification of genetic design strategies using local search. Molecular systems biology 2009, 5:296. 77. Kim J, Reed JL: OptORF: Optimal metabolic and regulatory perturbations for metabolic engineering of microbial strains. BMC systems biology 2010, 4:53. 78. Ranganathan S, Suthers PF, Maranas CD: OptForce: an optimization procedure for identifying all genetic manipulations leading to targeted overproductions. PLoS computational biology 2010, 6(4):e1000744. 79. Yang L, Cluett WR, Mahadevan R: EMILiO: a fast algorithm for genome-scale strain design. Metabolic engineering 2011, 13(3):272-281. 80. Tepper N, Shlomi T: Predicting metabolic engineering knockout strategies for chemical production: accounting for competing pathways. Bioinformatics 2010, 26(4):536-543. 81. Chowdhury A, Zomorrodi AR, Maranas CD: k-OptForce: integrating kinetics with flux balance analysis for strain design. PLoS computational biology 2014, 10(2):e1003487. 82. von Kamp A, Klamt S: Enumeration of smallest intervention strategies in genome-scale metabolic networks. PLoS computational biology 2014, 10(1):e1003378. 83. Huynh L, Kececioglu J, Koppe M, Tagkopoulos I: Automatic design of synthetic gene circuits through mixed integer non-linear programming. PloS one 2012, 7(4):e35529. 84. Zomorrodi AR, Maranas CD: Coarse-grained optimization-driven design and piecewise linear modeling of synthetic genetic circuits. Eur J Oper Res 2014, 237(2):665-676.

2. Chapter 2 RATIONAL DESIGN OF A SYNTHETIC ENTNER-DOUDOROFF PATHWAY FOR IMPROVED AND CONTROLLABLE NADPH REGENERATION

This chapter has been previously published in modified form in Metabolic Engineering (Ng, C. Y., Farasat, I., Maranas, C. D., & Salis, H. M. (2015). “Rational Design of a Synthetic Entner-Doudoroff Pathway for Improved and Controllable NADPH Regeneration”. Metabolic Engineering, 29, 86–96).

2.1. Introduction

Most metabolic reactions that produce industrially important compounds depend on electron-carrying cofactors, such as NADH and NADPH. In particular, NADPH plays a vital role in the biosynthesis of drugs [1-3], chiral alcohols [4, 5], fatty acids and biopolymers [6-9], while also being required for lipid biosynthesis, biomass formation, and cell replication [10, 11]. As a result, the regeneration rate of NADPH is often the rate- limiting step for the over-production of desired chemicals, while maintaining robust cellular growth. Therefore, increasing NADPH regeneration rates can increase both pathway productivities and product yields [1, 2, 9, 12-16]. Here, our objective is to develop a modular, drop-in pathway that rapidly regenerates NADPH, and provides control over redox supply levels, to increase the productivity of NADPH-dependent metabolic pathways.

In E. coli, the three major sources of NADPH regeneration are the pentose phosphate pathway (PPP), tricarboxylic acid (TCA) cycle, and the transhydrogenase system [17]. To increase NADPH regeneration rates, a common strategy has been to re-direct carbon flux through PPP by the deletion of pgi or pfkA/pfkB [1, 18, 19]; and by the over-expression of glucose-6-phosphate dehydrogenase (zwf) or 6-phosphogluconate dehydrogenase (gnd) [7, 20, 21]. Following these approaches, titers of leucocyanidin and thymidine, both limited

21 by NADPH availability, were improved by up to 3.8-fold [1] and 4.85-fold [3], respectively. However, the resulting release of carbon dioxide within PPP [22] lowers product carbon yield and the growth defect caused by a pgi deletion limits productivity [22-24]. To overcome this challenge, it is possible to redirect carbon flux through the Entner-Doudoroff (ED) pathway, which regenerates NADPH without a concomitant carbon loss.

The Entner-Doudoroff (ED) pathway combines the enzymes glucose-6-phosphate dehydrogenase (zwf), 6-phosphogluconolactonase (pgl), 6-phosphogluconate dehydratase (edd), and 2-keto-3-deoxygluconate-6-phosphate (KDPG) aldolase (eda) to convert glucose 6-phosphate into two units of pyruvate, while generating equimolar amounts of ATP, NADH, and NADPH (Figure 2.1). In contrast, the well-known Embden-Meyerhof- Parnas (EMP) glycolysis pathway performs the same conversion, but produces two moles each of ATP and NADH. There are several additional, and important, differences between these otherwise substitutable glycolytic pathways. First, the lower amount of ATP synthesis causes the ED pathway to become highly exergonic, favoring catalysis in the forward direction [25]. As a result, the ED pathway has been shown to require 3.5-fold less enzyme to achieve the same EMP pathway flux, implying a similar reduction in the cost of assembling the catalytic machinery. Second, bacterial strains that rely on the ED pathway to perform glycolysis generally produce more NADPH than their anabolic demand [25- 27]. To supplement ATP synthesis, like other facultative organisms, ED-dependent bacteria carry out aerobic respiration and catabolize additional non-glycolytic substrates [12, 25]. Finally, when both the EMP and ED pathways are available in the same organism, the ED pathway often fulfills an alternative role. For example, the conditionally expressed ED pathway in E. coli evolved to carry out gluconate metabolism [19, 20, 28-32].

Engineering the natural E. coli ED pathway may not enable tunable control over its NADPH regeneration rate, due to endogenous layers of transcriptional, translational, and allosteric regulation. Instead, a promising strategy is to heterologously express a highly active version of the pathway from a different organism [33, 34]. We therefore selected the highly active ED pathway from Zymomonas mobilis. This organism relies solely on the ED

22 pathway for glycolysis, has a high sugar uptake rate, has a high regeneration rate of ATP and NAD(P)H, and produces large amounts of ethanol that surpasses many yeast strains [35, 36]. The high glycolytic flux of Z. mobilis has been to attributed to the high turnover numbers, minimum allosteric control, and high expression levels of its ED enzymes [36- 38]. Its glucose 6-phosphate dehydrogenase (zwf) enzyme is known to regenerate both NADH and NADPH, enabling autonomous redox balancing [27]. To the best of our knowledge, a complete Z. mobilis Entner-Doudoroff pathway has not yet been expressed in E. coli.

In this study, we designed, constructed, and systematically optimized a synthetic Entner- Doudoroff pathway as a drop-in module that significantly increases a bacterial host's NADPH regeneration rate. Using computational optimization and biophysical models, we rationally designed two operon sequences to heterologously express the four-enzyme ED pathway as well as phosphoglucose isomerase (pgi) to obtain maximum control over their expression levels (Figure 2.2). We constructed and assembled the resulting 8.9-kbp genetic system, and integrated it into the E. coli MG1655-derived genome. We then efficiently explored the 5-dimensional expression space by employing the RBS Library Calculator to design optimized genome mutations [39] together with multiplex automated genome engineering (MAGE) mutagenesis to implement the genome mutations [40], generating libraries of 106 ED pathway-genome variants. Using a NADPH-dependent fluorescent protein, we screened 624 ED pathway-genome variants for high NADPH regeneration rates, and then extensively characterized 22 re-integrated pathways by measuring in vivo NADPH regeneration rates and NADPH-dependent biosynthesis rates. As a result, an optimized ED pathway increased NADPH-dependent fluorescence by 25-fold and increased the production titer of an already optimized carotenoid biosynthesis pathway by 97%.

Figure 2.1. EMP versus ED pathways. The major glycolytic pathways in E. coli, showing the Embden-Meyerhof-Parnas (EMP) and Entner-Doudoroff (ED) pathways. Native E. coli genes are shown in light blue color. The optimized heterologous genes in our synthetic bacterial operons are shown highlighted in yellow boxes.

Figure 2.2. Design and construction of ED1.0 and combinatorial ED variants. The synthetic ED-tetAR operons were designed using the Operon Calculator, with the Zm-zwf, Zm-pgi, Zm-edd, Zm-eda and Zm-pgl genes from Zymomonas mobilis ZM4 as input. Zm- zwf and Zm-pgi were grouped into the first operon and Zm-edd, Zm-eda and Zm-pgl were grouped into the second operon. Both operons were under the control of promoter Ptac. This genetic system was integrated into the chromosome of E. coli EcNR2 strain between tonB and yciL locus, resulting in strain ED1.0. 16-variant RBS library (dRBSi, i = 1, 2, …, 5) for each gene i was designed using RBS Library Calculator and introduced into strain ED1.0 using MAGE mutagenesis resulting in a large combinatorial population.

2.2. Materials and methods

Chemicals were obtained from Sigma-Aldrich Co. (St. Louis, MO) and VWR International (Radnor, PA). Enzymes were purchased from New England Biolabs Inc. (Ipswich, MA). E. coli TOP10 strain (Invitrogen), Pir116 strain (TransforMax™ EC100D™ pir-116) and E. coli K12 ER2267 (LacIq) strain (NEB) were used for plasmid construction and propagation. Plasmid pQE-mBFP [41] was obtained from Dr. Geun-Joong Kim’s lab (Chonnam National University, South Korea). Tetracyline-resistance gene cassette tetAR was obtained from Dr John Roth’s Lab (UC Davis, CA). DNA fragment and oligonucleotide synthesis were performed by Integrated DNA Technologies Inc. (Coralville, IA) and GeneArt (Regensburg, Germany). Gene sequencing was performed by QuintaraBio (Boston, MA) and the Penn State Genomics Core. Unless stated otherwise, M9 minimal media is 1x M9 salt (Sigma-Aldrich Co.), 2 mM magnesium sulfate, 100 µM calcium chloride, supplemented with 0.4% w/v glucose, 0.34 g/L thiamine and 0.05 g/L leucine, adjusted to a pH of 7.4. For EcNR2 or EcIF15 derived strains, biotin was added to a final concentration of 250 µg/L.

Design of synthetic operons and codon optimization

Starting from amino acid sequences, gene and operon designs were carried out using a recently developed design procedure for synthetic operons (Operon Calculator v0.50) (Unpublished data). Synonymous codon selections for each coding sequence were initially determined by using weighted random choice from a custom codon usage table, followed by multi-objective optimization according to several design criteria. The custom codon usage table heavily weighted synonymous codons from the 164 most highly translated E. coli MG1655 protein coding sequences, as predicted by the RBS Calculator v2.0 [42], while eliminating rare codons. Additional synonymous codon mutations were made according to the following criteria: (i) 5' untranslated regions and the beginnings of each protein coding sequence were co-optimized to achieve high translation rate capacities i.e. at least a 100,000 translation initiation rate on the RBS Calculator v1.1 proportional scale, (ii) the translation initiation rates of internal start codons were minimized, (iii) Shine-

Dalgarno-like ribosomal pause sequences [43] were removed, (iv) repetitive sequences, inverted repeat, and selected restriction enzyme recognition sequences were removed, and (v) 5' and 3' untranslated regions had minimum necessary lengths. Optimized ribosome binding sites and protein coding sequences for the five enzymes were assembled into two bacterial operons. Transcription of both operons is initiated by an IPTG-inducible Ptac promoter, and terminated by BBa_B0021 and BBa_K780000 transcriptional terminators from the Registry of Standard Biological Parts.

Strain and plasmid construction

The parent strain for initial genome integrations and ED optimization is E. coli EcNR2, derived from E. coli MG1655 with modifications bioA/bioB::λ-Red-bla ΔmutS::cat [40]. The parent strains for re-integration and characterization of selected ED pathways are EcNR2 and EcIF15, which is a derivative of EcHW2f [40] with a mutated dxs ribosome binding site (Table A.1) [39]. The EcNR2 Δpgi strain was constructed by inserting two consecutive stop codons into the pgi coding sequence with co-selection MAGE (Supplementary notes A.1.1.). The synthetic operons described in section 2.1 were divided into three segments (ZmED1, ZmED2 and ZmED3) for gene synthesis. The first two segments were further divided into six gBlocks Gene Fragments each with lengths of about 500 bp. Both ZmED1 (2417 bp) and ZmED2 (2318 bp) were then assembled using Gibson Assembly [44]. The third segment (ZmED3) was synthesized by GeneArt (Life Technologies). Plasmid pCN-LED was constructed by assembling ZmED1, ZmED2 and ZmED3 into a separately constructed vector backbone (pCN-L) that contained an R6K origin, a chloramphenicol (Cm) resistance marker, and the lacI gene. Once assembled, pCN-LED contained the two ED bacterial operons flanked by 40 bp sequences and restriction sites to enable recombination into the tonB/yciL intergenic region within the E. coli MG1655 genome. Additionally, the tetAR operon was then amplified from Salmonella typhimurium LT2 TT25401 strain and inserted into plasmid pCN-LED. After sequence verifying the resulting 11.8 kb plasmid pCN-LEDT (pCN-065), the recombination cassette (ED-tetAR) was amplified by PCR and integrated into the EcNR2 genome at the tonB/yciL intergenic region using lambda red recombination approach (see section 2.4). In order to

27 increase tetA expression required for counter-selection with nickel salts or fusaric acid during co-selection MAGE [45], additional MAGE genome mutagenesis was employed to increase the translation initiation rate of the integrated tetA marker from 228 to 48,372 au, employing the RBS Library Calculator to design RBS genomic mutations (Supplementary notes A.1.2.). The resulting strain is henceforth denoted as ED1.0.

Combining the RBS Library Calculator and MAGE for pathway optimization

For each of the five enzyme coding sequences, we used the RBS Library Calculator in Search mode to design 16-variant degenerate RBS libraries with translation initiation rates that spanned between 4,500 to 61,000-fold range, while constricting degenerate nucleotides to within a 9 bp region [39] (Table A.3). RBS libraries were encoded within 90 bp degenerate oligonucleotides with four phosphorothioated bases at the 5’ terminus (Table A.2). RBS libraries were incorporated into the ED1.0 chromosome by performing 40 manual rounds of MAGE using a pool size of 80 (16 x 5) oligonucleotides adjusted to a final concentration of 1 to 2 µM. Daily MAGE rounds were carried out by first growing ED1.0 or previously mutagenized strains in 5 mL of SOC broth, supplemented with 0.4% w/v glucose, for 12 to 16 hours at 30°C and 250 RPM shaking, followed by dilution to an initial OD600 of 0.05 in 5 mL of SOC broth. The first MAGE round begins by incubating cells at 30 °C at 250 to 300 RPM until their OD600 reached 0.5 to 0.7, followed by heat shock at 42 °C for 15 min to induce λ-prophage (bet, gam, exo) genes expression, pelleting, and washing three times with cold sterile water to induce electrocompetency. 50 µL of the oligonucleotide mixture was then added to electrocompetent cells and electroporated at 1,800 V. To begin the second MAGE round, cells were recovered in pre-warmed SOC until their OD600 reached 0.4 to 0.6, followed by repetition of heat shock, recovery, and electroporation. Three to four MAGE rounds were performed daily. The resulting pool of ED-genome variants were then characterized using several assays.

Chromosomal integration of Entner-Doudoroff pathway variants

ED-tetAR linear DNA cassettes were PCR amplified from selected variants and re- integrated into the chromosomes of strains EcNR2 or EcIF15 via homologous recombination using the following approach. Overnight cultures of isolated colonies were grown in SOC broth supplemented with 0.4% w/v glucose for 12 to 16 hours at 30 °C with aeration. Overnight cultures were then diluted to an initial OD600 of 0.05 to 0.1 in 5 ml of

SOC broth and were grown at 30 °C to OD600 of 0.4 to 0.6. Induction of the λ-Red recombination proteins were performed by shifting the cultures to 42 °C water bath with shaking for 15 minutes. The cultures were subsequently chilled for 10 min, washed three times with cold sterile distilled water and finally resuspended in 150 µL of distilled water. For each transformation, 10 ng – 100 ng of linear DNA cassette was added to 50 µL of electrocompetent cells. After electroporation at 1800 V, electroporated cells were mixed with 1 mL of SOC and then incubated at 30 °C, 250 RPM before plating on LB agar plate containing tetracycline or appropriate antibiotic. Colonies with correct insertion of linear DNA were verified by PCR analysis.

Measurement of intracellular NADPH and NADP+ levels

Intracellular levels of NADPH and NADP+ were determined according to a method described previously [20]. Overnight culture inoculated with isolated colonies were grown in LB with 50 µg/mL Cm at 30 °C with 250 RPM shaking. 50 mL of M9 minimal media with 0.4% w/v glucose were then inoculated with overnight culture to OD600 of 0.1 and induced with with 0.5 mM IPTG. After 24 h of growth at 30 °C with 250 RPM shaking

(OD600 of 1.0 to 3.0), cultures were chilled in ice bath for 10 min, harvested by centrifugation at 4 °C, 10 min, 4750 RPM and resuspended to 0.5 mL of OD600 of 30.

For NADPH (reduced form) analysis, cells were resuspended in 250 µL of 0.3 M NaOH, incubated at 60 °C for 7 min, then neutralized by 250 µL of of 0.3 M HCl and 50 µL of Tricine-NaOH (pH 8.0). For NADP+ (oxidized form) analysis, cells were resuspended in 250 µL of 0.3 M HCl and 50 µL of Tricine-NaOH (pH 8.0), incubated at 60 °C for 7 min,

29 then neutralized by 250 µL of of 0.3 M NaOH. Neutralized samples were then centrifuged at 4 °C, 60 min, 4750 RPM and the resulting supernatant was transferred to new microcentrifuge tube. 80 µL of reduced sample or 40 µL of oxidized sample mixed with 40 µL of 0.1 M NaCl was added to a 96-well microtiter plate for analysis. 2x stock solution of reaction mixture was prepared by mixing equal volume of 1.0 M Tricine-NaOH (pH 8.0), 4.2 mM thiazolyl blue tetrazolium bromide (MTT), 40 mM EDTA, 1.67 mM phenazine ethosulfate (PES), and 25 mM glucose-6-phosphate. Both reduced and oxidized samples were mixed with 80 µL of freshly prepared reaction mixture and then incubated at 37 °C for 15 min. 20 µL of 2.5 U/mL NADP+-specific glucose-6-phosphate dehydrogenase (G6PDH) from Saccharomyces cerevisiae was added to initiate reaction and the time-course formation of reduced MTT was measured at 570 nm, 37 °C with Tecan microplate reader. The cofactor concentration was determined from rate of change in absorption when compared to calibration curve constructed by measuring NADP+ standards on the same microplate.

Quantifying NADPH levels with a modified fluorescent reporter

Plasmid pQE-mBFP expresses a NADPH-dependent metagenomic blue fluorescent protein (mBFP) on a ColE1 vector [41]. The mBFP protein is a short chain dehydrogenase (SDR) that binds specifically to NADPH and emits fluorescence at 451 nm when excited at 395 nm, producing more fluorescence when supplied with more NADPH. We modified pQE- mBFP by replacing its original IPTG-inducible T5 promoter with a strong constitutive promoter and by increasing the translation initiation rate of mBFP to 356,786 au on the RBS Calculator proportional scale, resulting in pCN-mBFP. The pool of ED-genome strain variants, the strain ED1.0 and the EcNR2 strain were transformed with pCN-mBFP, followed by characterization of isogenic cultures using spectrophotometry (Infinite M1000, TECAN) to record mBFP fluorescence. Overnight cultures in LB with 50 µg/mL of kanamycin (Kan) were used to inoculate 200 µL M9 minimal media with 10 µg/mL Kan in microtiter wells. Cultures were then incubated at 30 °C with high orbital shaking for 6 to 9 hours. Cells were then serially diluted into fresh, pre-warmed M9 minimal media with 10 µg/mL Kan and 1 mM IPTG and grown similarly for another 12 hours. During the

IPTG-induced growth period, cell densities (OD600) and mBFP fluorescence levels were recorded every 10 minutes. Specific mBFP fluorescence levels were determined by dividing each strain's background-corrected fluorescence levels by their corresponding

OD600 values. Specific mBFP production rates were determined by the slope of mBFP fluorescence levels versus time over a region of linearly increasing fluorescence levels. For normalization, all specific mBFP production rates were divided by the specific mBFP production rate of the EcNR2 control strain. Selected ED-genome variants were sequenced according to their normalized specific mBFP production rates.

Measurement of NADPH-dependent carotenoid biosynthesis

The enzymes encoded by crtEBI from Rhodobacter sphaeroides catalyze the conversion of isopentenyl diphosphate to neurosporene, a brown carotenoid pigment. We previously constructed plasmids that express an optimally balanced crtEBI operon to produce high levels of neurosporene [39]. The operon's expression is controlled by either an IPTG- inducible PlacO1 promoter (pIF-001C or pIF-001K) or an arabinose-inducible PBAD promoter (pIF-002). Selected ED-genome variants were transformed with pIF-001 or pIF- 002, and cultured overnight in 25 mL of LB broth with 50 μg/mL Cm for EcIF15-derived strains or 50 μg/mL Kan for EcNR2-derived strains. Cultures were then diluted to a final

OD600 of 0.1 in 25 mL 2X M9 minimal media supplemented with 0.4% w/v glucose, 10 μg/mL Cm and 0.2 mM IPTG. For strains transformed by pIF-002, 10 mM arabinose was added. Cultures were incubated at 37°C and 300 RPM shaking for 10 hours and pelleted by centrifugation. Cell pellets were washed, dried, and mixed with 1 mL acetone in pre- weighed microcentrifuge tubes, followed by incubation at 55°C and repeated vortexing for 20 minutes. The extraction solution was centrifuged again, and a 50 μL supernatant sample was transferred to 1 mL acetone (21-fold dilution) in a fresh microcentrifuge tube. The absorbance of neurosporene in acetone was recorded by a NanoDrop 2000C spectrophotometer at 470 nm, using pure acetone as a blank. Extracted cell pellets were dried open-capped in a 65°C oven for 2 days, equilibrated closed-capped at room temperature, followed by recording of cell pellet masses by comparison of microcentrifuge weights before and after extraction.

2.3. Results

Rational design and construction of a synthetic Entner-Doudoroff pathway

We selected five enzymes from Z. mobilis ZM4 for heterologous expression in E. coli: glucose-6-phosphate dehydrogenase (ZMO0367/Zm-zwf), 6-phosphogluconolactonase (ZMO1478/Zm-pgl), 6-phosphogluconate dehydratase (ZMO0368/Zm-edd), 2-keto-3- deoxygluconate-6-phosphate (KDPG) aldolase (ZMO0997/Zm-eda), and phosphoglucose isomerase (ZMO1212/Zm-pgi). The first four enzymes constitute the ED pathway that converts glucose 6-phosphate to pyruvate and glyceraldehyde-3-phosphate, while the fifth reversibly interconverts fructose-6-phosphate to glucose-6-phosphate (Figure 2.1). We included the enzyme Zm-pgi to regulate metabolic flux at the major glycolysis branch point, glucose-6-phosphate. Throughout the paper, the operons consisting of the five Z. mobilis enzymes are designated as the synthetic ED operons.

Protein expression levels are regulated by several genetic elements, including promoters, ribosome binding sites (RBSs), and protein coding sequences. In natural genetic systems, changes in transcription, translation, and mRNA stability collectively control a protein's expression level. As a key strategy to optimizing the ED pathway, we developed an optimization procedure, called the Operon Calculator, that designs bacterial operon sequences with the overall objective of concentrating expression control to the fewest number of short genetic parts, while eliminating undesired genetic elements that confound our ability to control protein expression. The Operon Calculator minimizes the number of undesired internal start codons, ribosome pause sequences, repetitive sequences, and restriction sites, while selecting 5' UTR sequences, synonymous codon sequences, and terminators for high translation rate capacities and termination efficiencies (Section 2.2.1). As a result, two synthetic bacterial operon sequences were designed grouping together the enzymes Zm-zwf and Zm-pgi into the first operon, and Zm-pgl, Zm-eda and Zm-edd into the second operon (Figure 2.2, Figure A.1). We selected an IPTG-inducible Ptac promoter to transcribe both operons. We also designed the initial ribosome binding site sequences for all five enzyme coding sequences to have translation initiation rates of about 1,000 au

32 on the RBS Calculator v1.1 proportional scale, which is a moderate translation rate. Importantly, these RBS sequences were designed by the RBS Library Calculator [39] so that a small number of adjacent mutations could greatly vary the coding sequences' translation rates. In addition, we chose tetAR as our selection marker for genome integration.

The synthetic ED operons were first constructed and inserted into a pre-constructed vector (pCN-L) along with tetAR operon resulting in a 11.8 kb plasmid (pCN-LEDT) by combining DNA synthesis, DNA assembly, and molecular cloning. We then employed PCR amplification and homologous recombination to integrate the ED-tetAR operons (8.9 kb) into the EcNR2 chromosome (Section 2.2.2). We refer to this strain, harboring the first version of our synthetic Entner-Doudoroff pathway, as ED1.0 (Figure 2.2).

Characterization of the synthetic Entner-Doudoroff pathway in ED1.0

We first characterized the activity of the synthetic Entner-Doudoroff pathway in ED1.0, compared to its parent strain EcNR2, by measuring the NADPH/NADP+ intracellular redox ratio and by measuring the in vivo NADPH regeneration rate. Using a glucose 6-phosphate dehydrogenase assay on cell extract (Section 2.2.5), we found that the ED1.0 strain has a NADPH/NADP+ redox ratio that is 1.87-fold higher than its parent EcNR2 strain after both are cultured to the exponential growth phase using M9 minimal media (two-tailed, two- sample t-test, p-value = 0.037) (Figure 2.3A). We then selected mBFP, a NADPH- dependent fluorescent protein reporter, as a large consumption sink for NADPH that also serves as an observable readout for its in vivo regeneration rate. mBFP is a short-chain dehydrogenase that actively oxidizes NADPH and proportionally emits fluorescence [41].

As an initial control to validate the mBFP assay, we over-expressed the transhydrogenase PntAB on an R6K origin vector within E. coli Pir116 together with an IPTG-inducible mBFP expression plasmid to measure its effect on mBFP fluorescence (Supplementary notes A.1.3). For a comparison, we also created an R6K-origin plasmid that did not express any enzymes, and co-transformed this control plasmid with the mBFP expression plasmid

33 in the same strain. We continuously measured mBFP fluorescence after IPTG induction during long-time cultures maintained in the exponential phase of growth using M9 minimal media. After a lag period, mBFP fluorescence increased linearly in time over a long duration, indicating that NADPH availability was a rate-limiting step to fluorescence emission (Figure A.2A). We quantified NADPH regeneration rate by calculating the first- derivative (slope) of the time-course mBFP fluorescence per OD600 over the linear regime, which we refer to as the mBFP fluorescence production rate (mBFP flu rate). Overexpression of PntAB together with induced mBFP expression, using 0.5 mM IPTG, increased the mBFP fluorescence production rate by 81%, compared to the control (93.6 ±

20.7 au/OD600.h vs 51.8 ± 28.9 au/OD600.h, two-tailed, two-sample t-test, p-value = 0.057) (Figure A.2B). Therefore, mBFP was found to be a quantitative reporter of the in vivo NADPH regeneration rate.

We then modified the mBFP expression plasmid to ensure that NADPH availability was the rate-limiting factor that controls mBFP fluorescence across a larger dynamic range. We first replaced the IPTG-inducible promoter controlling mBFP expression with a constitutive promoter. We then replaced its ribosome binding site sequence with a rationally designed one to substantially increase its translation rate. The resulting plasmid (pCN-mBFP) constitutively expresses mBFP with a very high expression level (Section 2.2.6).

We then employed the pCN-mBFP plasmid to measure the in vivo NADPH regeneration rate of the ED1.0 strain, compared to its parent EcNR2 strain. Transformed strains were grown in long-time cultures maintained in the exponential growth phase using M9 minimal media. mBFP fluorescence was monitored continuously. We found that the mBFP fluorescence production rate for the parent strain EcNR2 was relatively constant regardless of the addition of IPTG. In contrast, the mBFP fluorescence production rate for the strain ED1.0 was substantially higher and was further increased when adding IPTG to induce the ED pathway's expression. The highest mBFP fluorescence production rate was observed at an intermediate IPTG concentration (25 µM) and there was no statistically significant increase in mBFP production rate when additional IPTG was added (p-values > 0.1 for all

34 pair-wise two-sample t-tests between 25 µM and 0.5 mM IPTG) (Figure 2.3B). In the absence of IPTG, the observed level of mBFP production could be explained by transcriptional leakiness of the Ptac promoter. Based on these results, the synthetic Entner- Doudoroff in strain ED1.0 is improving the NADPH regeneration rate by 4.8-fold when induced by 25 µM IPTG, compared to its parent strain. We expected that optimization of the ED pathway would be necessary to further increase its activity.

Figure 2.3. Characterization of ED1.0. (A) The NADPH/NADP+ redox ratios were measured for strains EcNR2 and ED1.0. Expression of the ED pathway was induced using 0.5 mM IPTG. ED1.0 had a significantly higher NADPH/NADP+ ratio than its parent strain (two-tailed, two-sample t-test, p-value = 0.037). Values and error bars represent the averages and s.d. of the ratios for four replicates. NADPH and NADP+ concentrations (μmol/ g DCW) for EcNR2 and ED1.0 are also reported (mean ± s.d. for n = 4). (B) The mBFP fluorescence production rates were measured for strains (blue bar) EcNR2 and (red bar) ED1.0. Increasing amounts of IPTG were added in separate experiments. The expression of the ED pathway in ED1.0 is controlled by an IPTG-inducible Ptac promoter. The pound sign (#) indicates a near-background mBFP production rate for EcNR2 strain at 0 mM IPTG. Values and error bars represent the means and s.d. of two replicates. *P < 0.1; **P < 0.05.

Efficient search for improved ED pathway variants in a 5-dimensional expression space

We initially designed the synthetic operons in ED1.0 to express all the ED pathway's enzymes with translation initiation rates of about 1,000 au on RBS Calculator proportional scale. The initial values were chosen to match the typical translation initiation rates controlling the expression of enzymes found in the native glycolysis, PPP, and ED pathways of E. coli MG1655. We anticipated that our initial translation rate guesses may not lead to the highest possible NADPH regeneration rate, and during the initial design of the synthetic operons, employed our RBS Library Calculator algorithm to design ribosome binding site sequences that could be easily mutagenized to provide large changes in translation initiation rate.

Specifically, the RBS Library Calculator in Genome Editing mode was applied to design 16-variant RBS libraries that systematically varied the translation rates for each enzyme coding sequence from about 10 to 900,000 au on the RBS Calculator proportional scale [39] (Table A.3). The resulting optimized RBS libraries contained a small number of nearby degenerate nucleotides that could be readily incorporated into the E. coli genome using a site-directed genome mutagenesis technique. A key advantage of this approach is the compactness of the resulting combinatorial RBS library and the broad coverage of the 5-dimensional expression space. Complete combinatorial incorporation of the five 16- variant RBS libraries will create a library of 165 (1,048,576) variants that will uniformly sample a 5-sided hypercube with lengths that span at least a 10,000-fold change in expression (Figure 2.4).

We then applied oligo-mediated allelic replacement, also known as MAGE [40], to incorporate these ribosome binding site mutations directly into the E. coli genome. According to an allelic replacement efficiency calculation [46], we estimated that 40 MAGE cycles were required to generate 14% of genomes with at least 4 out of 5 RBS mutations and 2.3% of genomes with all 5 RBSs mutated. Over a span of 12 days, 40 MAGE cycles were conducted using mutagenic oligonucleotides that correspond to the optimized RBS library sequences (Section 2.3). Bulk sequencing of genome pools after

37 the 12th, 30th, and 40th cycles revealed that RBS sequences became increasingly mutated at specifically targeted nucleotide positions. We then characterized the pool of genome variants from the 40th cycle of MAGE.

We transformed the pool of ED-expressing genome variants with pCN-mBFP, our constitutively over-expressed mBFP plasmid, and then isolated 624 single colonies. We cultured them using M9 minimal media supplemented with 0.4% w/v glucose and 1 mM IPTG, and measured their mBFP fluorescence production rates. As controls, we also characterized the mBFP fluorescence production rates of strain ED1.0 and parent strain EcNR2 in each set of measurements. When varying their RBS sequences and enzyme expression levels, we expected that ED-expressing genome variants would have different NADPH regeneration rates and growth rates, due to changes in pathway flux and the accumulation of toxic intermediates (e.g. KDPG). 237 ED-expressing genome variants (38%) exhibited poor growth after induction of the ED pathway, indicating that some combinations of RBS sequences and enzyme expression levels resulted in imbalanced pathways. The remaining 387 ED-expressing genome variants displayed varied mBFP fluorescence production rates across a 710-fold range (between 0.045-fold to 32.12-fold), indicating that these optimized RBS sequence mutations greatly affected the ED pathway's enzyme expression levels, and correspondingly, its overall NADPH regeneration rate (Figure 2.5). 336 of these variants had lower mBFP-linked NADPH regeneration rates than the average of ED1.0, suggesting that our initially selected translation rates of 1,000 au were a suitable initial condition. Interestingly, there was no observed correlation between the growth rate of a genome variant and its mBFP-linked NADPH regeneration rate (R2 = 0.04).

Next, we selected 22 ED-expressing genome variants, including 16 high mBFP-producing variants and 6 low mBFP-producing variants, for sequencing, re-integration into fresh E. coli EcNR2 genomes, and further characterization. We found that all 22 variants contained unique combinations of RBS sequences (Figure 2.6). 7 variants had four out of five modified RBS sequences. 4 variants had three modified RBSs, 6 variants had two modified RBSs, and 5 variants had a single modified RBS. None of the selected genome variants

38 had all five of their RBS sequences modified. The RBS controlling Zm-pgl was modified in 73% of the selected genome variants, while only 36% of genome variants had RBS modifications controlling Zm-zwf expression. Otherwise, 41%, 50%, and 62% of variants had RBS modifications controlling Zm-eda, Zm-edd, and Zm-pgi expression. In one variant (ED9), there was a spontaneous, non-designed mutation to the RBS controlling Zm-pgi expression. The translation initiation rates for all RBS modifications were calculated by the RBS Calculator's biophysical model (Figure 2.6). Notably, the ED3 genome variant contained the most highly translated RBS for Zm-zwf, but also a frame-shift mutation in the protein coding sequence that abrogated its expression, indicating that Zm-zwf overexpression was highly toxic.

We re-characterized the mBFP fluorescence production rates of the 22 selected ED- expressing genome variants and ranked them (Figure 2.6, Table A.4). Strain ED2 consistently ranked first among the selected variants with 25-fold higher mBFP fluorescence production rate, compared to the parent strain EcNR2. Notably, its translation rate profile is not significantly different from ED1.0, which ranked within the top three. The differences between ED2 and ED1.0 were a 3.7-fold higher translation rate for Zm- eda and a 10-fold lower translation rate for Zm-pgl, indicating that small changes in translation rate, and correspondingly enzyme expression level, can have a beneficial effect on NADPH regeneration rate. However, there remains ED-expressing genome variants in the larger pool with even higher mBFP fluorescence production rates and NADPH regeneration rates.

We examined the relationship between translation rates, enzyme expression levels, and NADPH regeneration rate for these 22 ED-expressing genome variants. Qualitatively, we note that higher enzyme expression levels did not always yield higher NADPH regeneration rates. However, we do not observe a quantitative pattern in contrast to our previous study, where we employed mass action kinetics to formulate such a quantitative relationship. Several factors could confound this analysis, including the activity of endogenous enzymes, competition for transcriptional or ribosomal resources, allosteric

39 feedback control, insufficient sampling of the high-dimensional expression space, and variations in day-to-day measurements.

Figure 2.4. Uniform sampling of the 5-dimensional expression space. (A) For each of the enzyme, degenerate RBS (dRBS) for 16-variant RBS library was designed with RBS Library Calculator. By having just four degenerate nucleotides on the RBS sequence of Zm-zwf, one can uniformly vary the predicted translation rate across a 4500-fold range. (B) (Black dots) The predicted translation initiation rates for all possible combinations of the five optimized RBS libraries, showing two-dimensional slices of the 5-dimensional space. (Red circles) The predicted translation rates for the 22 selected ED-expressing genome variants and (blue squares) the initial ED1.0 strain showing the sampling of the space after 40 cycles of MAGE genome engineering. Translation rates are predicted using RBS Calculator v1.1.

Figure 2.5. Characterization of ED-expressing genome variants. (A) The normalized mBFP fluorescence production rates of 387 genome variants and 22 control strains (including 7 ED 1.0 mBFP, 10 WT mBFP and 5 no BFP control strains) were measured and ranked. Normalized mBFP fluorescence production rates were calculated by dividing all measurements by the average mBFP fluorescence production rate for the EcNR2 strain, which was 28.2 ± 9.4 au/OD600/h. Boxplots represent the ranking distribution of ED1.0 mBFP, WT mBFP and the no mBFP control strains. (B) The growth rates of each variant were measured and appear in the same order as in A. Bar colors are (blue) ED-expressing genome variants, (green) parent strain EcNR2, (red) strain ED1.0, and (grey) negative controls EcNR2 and ED1.0 without mBFP overexpression.

Figure 2.6. The effects of changing ED enzyme expression levels on NADPH regeneration rates. (A) The normalized mBFP fluorescence production rates were measured and ranked for 22 selected ED-expressing genome variants that were re- integrated into the genome of parent strain EcNR2. Values and error bars represent the means and s.d. of 2-12 replicates. (B) The translation initiation rates controlling the expression of the five ED enzymes are shown, comparing the (red) 22 modified ED variants to (blue) ED1.0. Translation initiation rates were predicted using the RBS Calculator v1.1.

Improving terpenoid biosynthesis using a synthetic Entner-Doudoroff pathway

In bacteria, the methyl erythritol phosphate (MEP) pathway synthesizes the terpenoid precursors isopentenyl diphosphate (IPP) and dimethylallyl diphosphate (DMAPP), consuming equimolar amounts of glyceraldehyde-3-phosphate (G3P) and pyruvate [47]. In E. coli, the enzymes within the MEP pathway primarily use NADPH as the source of reducing equivalents [48] (Table A.5), and the reaction catalyzed by the first enzyme in the MEP pathway, dxs, is a key rate-limiting step to precursor biosynthesis [40, 49-51]. Interestingly, the Entner-Doudoroff pathway synthesizes both the carbon and redox precursors to terpenoid biosynthesis [29, 52] (Figure 2.7A). As a consequence, we expected that a highly active ED pathway would increase the rate of terpenoid biosynthesis, particularly when the activities of the MEP and downstream terpenoid biosynthesis pathways were optimized such that NADPH availability became a greater rate-limiting factor. To test this hypothesis, we took advantage of our previous work that systematically optimized a carotenoid biosynthesis pathway from Rhodobacter sphaeroides, expressed within an engineered strain that had a substantially higher MEP pathway flux [39].

First, we transformed selected ED-expressing genome variants with a crtEBI-expressing plasmid, which produces the carotenoid neurosporene under control of an IPTG-inducible

PlacO1 promoter. Cultures were grown in 2X M9 minimal media with 0.4% w/v glucose and 0.2 mM IPTG in short (10 hour) batches, followed by hot acetone extraction, and measurement of their neurosporene content and dry cell mass. The strains ED13 and ED1.0 produced 79% and 43% higher neurosporene production titers, respectively, compared to the parent EcNR2 strain (Figure 2.7B). To compare the use of the ED pathway to alternative approaches to increasing NADPH regeneration, we knocked out expression of pgi in the parent strain EcNR2 and found that it increased neurosporene production by 24% using the same growth conditions and measurements. In contrast, the strains ED2, ED5, and ED11 produced about the same amount of neurosporene, compared to their parent EcNR2 strain (4% to 15% lower).

After these measurements, we questioned whether the relatively low neurosporene production rate was creating a sufficiently large burden on NADPH availability. If NADPH regeneration is to become a rate-limiting step in neurosporene production, it must be produced at a high rate with a high demand for NADPH. Next, we selected a previously engineered strain EcIF15 and transformed it with plasmid that expresses the neurosporene biosynthesis pathway now under control of an arabinose-inducible PBAD promoter to enable orthogonal transcriptional control over both the ED and crtEBI pathways. EcIF15 was previously engineered to significantly over-express 1-deoxy-D-xylulose-5-phosphate synthase (dxs) and increase the biosynthesis of the precursors IPP and DMAPP [39]. As a result, with 10 mM arabinose induction, it produces 791% and 843% more neurosporene, compared to control strains E. coli MG1655 and EcNR2, respectively (Figure 2.7B). We also characterized the PBAD-crtEBI pathway in EcIF15 strain expressing ED17’s ED-tetAR to confirm that the large titer change was not a result of swapping promoters (Table A.6).

We next introduced ED pathway variants into the EcIF15 strain and evaluated the synergistic effects of an increased NADPH regeneration rate together with an increased precursor biosynthesis rate. In contrast to the EcNR2 strain, the expression of any of the selected ED pathway variants (ED1.0, ED2, ED11, ED13, and ED17) in EcIF15 substantially increased neurosporene production. The variant ED11 yielded the highest improvement with a 1336% increase, compared to EcNR2, and a 97% increase, compared to the parent strain EcIF15 (Figure 2.7B). Interestingly, ED11 has three higher translation rates controlling Zm-zwf, Zm-edd and Zm-eda expression, compared to ED1.0 (Figure 2.7B). Based on these results, when the bottleneck in the MEP pathway was eliminated, expression of the synthetic ED pathway was able to produce more precursors and regenerate more NADPH, leading to large improvements in neurosporene production.

Figure 2.7. Improving terpenoid biosynthesis using the synthetic Entner-Doudoroff pathway. (A) The metabolic reactions that convert glucose 6-phosphate into the carotenoid neurosporene via the Entner-Doudoroff, MEP, and carotenoid biosynthesis pathways. The enzymes crtEBI from Rhodobacter sphaeroides produce the brown pigment neurosporene. (B) Neurosporene production was measured when combining selected ED-expressing genome variants and control strains with expression of the crtEBI operon. ED pathways were integrated into either EcNR2 or EcIF15. The strain EcIF15 has an engineered RBS substantially increasing dxs expression. Values and error bars represent the means and s.d. of between 2 to 7 replicates.

2.4. Discussion

In the field of metabolic engineering, increasing the availability of NADPH has been a significant challenge, driven by the need to supply greater amounts of reducing equivalents towards the over-production of a wide range of chemical products. To solve this challenge, previous efforts have deleted or over-expressed selected genes, such as oxidoreductases, transhydrogenases, and NAD kinases [4, 7-9, 20, 21, 53-55]. In a recent effort, the E. coli NAD+-dependent glyceraldehyde-3-phosphate dehydrogenase (GAPDH) encoded by gapA gene was replaced with a Clostridium acetobutylicum gapC gene encoding for a NADP+- dependent GAPDH [14]. Another recent study also showed that replacing the promoters of E. coli edd-eda operon and zwf gene with a constitutive promoter and a strong promoter, respectively, increased intracellular NADPH/NADP+- ratio [56].

In this study, we engineered a synthetic version of the Entner-Doudoroff (ED) pathway to rapidly regenerate NADPH. The pathway combines five enzymes from Zymomonas mobilis, expressed together within two synthetic bacterial operons that were rationally designed to achieve maximum expression control. Starting from the first version of the pathway, we then carried out systematic optimization of the enzymes' expression levels to improve the pathway's activity, first employing a NADPH-dependent fluorescent protein reporter to measure NADPH regeneration rates, followed by measuring the ED pathway's effect on an NADPH-dependent terpenoid biosynthesis pathway. By combining MAGE genome mutagenesis with our RBS Library Calculator algorithm, we introduced targeted genome modifications to greatly vary the ED pathway's individual enzyme expression levels and to efficiently search its 5-dimensional expression space. In principle, continued MAGE cycling will generate up to a million genome variants, though it was only necessary to characterize a much smaller number to identify ED pathway variants with greatly improved NADPH regeneration rates. As a result, one of our ED pathway variants exhibited a 25-fold higher NADPH regeneration rate, as measured by the fluorescent protein reporter, and another variant increased terpenoid biosynthesis by 97%. The synthetic ED pathway exists as a drop-in module; in principle, it can be transferred and

46 expressed in many different bacterial hosts to increase their NADPH regeneration rate and enhance the production of NADPH-dependent products.

A novel and important aspect of our design approach was to commit, at an early stage, to integrating the ED pathway into the genome of our bacterial host before optimizing its expression levels. In comparison, pathway engineering efforts have traditionally relied upon multi-copy plasmids to over-express desired enzymes. Multi-copy plasmids can express more protein than expression cassettes inside genomes, but they require active selection (e.g. the addition of antibiotics) to maintain plasmid stability over long culture times, which is undesirable in industrial applications. When pathways are plasmid- encoded, any optimization of their expression levels may be problematic when the final version of the pathway must eventually be genome-integrated for industrial applications. Instead, the first version of our genome-integrated ED pathway (ED1.0) was remarkably stable over the course of 40 cycles of MAGE mutagenesis and during 2-day cultures. Taking advantage of genome engineering techniques, this design choice enabled us to rapidly insert, optimize, copy, and re-insert large genetic modules within and across genomes, while improving the modules' stabilities and maintaining the same copy number.

With our genome-centric strategy, we needed to ensure that our operons' transcription rates, translation rates, and mRNA stabilities could be sufficiently high to express high levels of enzyme even though the DNA copy number is very low (1 to 2 copies per cell). Our Operon Calculator applies several design criteria to ensure that the operons' mRNAs have fewer undesirable genetic elements and that the desired protein coding sequences have the potential to be translated at extremely high rates. Among our 22 sequenced ED pathway variants, the translation rates for Zm-pgl, Zm-edd, and Zm-zwf achieved extremely high translation initiation rates (500,000+ au on the RBS Calculator v1.1 proportional scale). If the ED enzymes had exhibited low turnover numbers, then such high translation rates would have been necessary to achieve high pathway activities.

In a previous study, applying a kinetic modeling formalism enabled us to determine a quantitative relationship between sequence, enzyme expression level, and pathway activity

[39]. However, while our optimization strategy yielded highly productive pathway variants, we did not observe such a clear relationship for the synthetic ED pathway in the EcNR2 parent strain. Previous reports have shown that accumulation of KDPG is toxic [57, 58] and will therefore impact the organism's growth rate [59], which could confound any expected relationship. Notably, amongst the ED variants that both over-produced neurosporene and emitted large amounts of mBFP fluorescence, ED11 has a consistent increase in Zm-eda, Zm-edd, and Zm-zwf expression that should result in an overall increase in pathway flux, while ED2 has increased Zm-eda and decreased Zm-pgl expression, which should minimize the accumulation of KDPG. It appears that increased neurosporene production in EcIF15-derived strains also relies on a consistent flux through the entire ED pathway, including production of the MEP precursors pyruvate and G3P, and not only on the NADPH regeneration rate.

Finally, together with previous work [39, 49, 60-71], this study highlights the common challenges of performing expression optimization on multi-enzyme pathways in high- dimensional expression spaces. For enzymes with low turnover numbers, large changes in enzyme expression are needed to observe a significant change in pathway activity. In contrast, when enzymes have higher turnover numbers, as in the case of the ED pathway, smaller expression level changes can have a significant impact. Multiple enzymes work together to control a pathway's overall activity. As we observed in this study, synergistic changes in the individual enzyme expression levels are needed to create a more active and balanced pathway. Expression imbalances can affect growth rate, due to the accumulation of toxic intermediates.

Further, to avoid exhaustively searching high-dimensional expression spaces, new modeling formalisms will be needed to convert large sets of sequences and measurements into accurately predicted optimal expression levels. These challenges will continue as the field designs and optimizes longer pathways. New approaches will also be needed to combine pathway modules with predictably matched pathway fluxes, tailored to the organism's existing metabolic network. The success of these approaches together with advanced genome engineering techniques will accelerate the development of extremely

48 large genetic systems, encoded within genomes, capable of radically redirecting carbon, energy, redox, uptake, and transport fluxes towards the production and secretion of desired products.

2.5. References

1. Chemler JA, Fowler ZL, McHugh KP, Koffas MAG: Improving NADPH availability for natural product biosynthesis in Escherichia coli by metabolic engineering. Metabolic engineering 2010, 12:96-104. 2. Gunnarsson N, Eliasson A, Nielsen J: Control of Fluxes Towards Antibiotics and the Role of Primary Metabolism in Production of Antibiotics. Advances in Biochemical Engineering/Biotechnology 2004, 88:137-178. 3. Lee HC, Kim JS, Jang W, Kim SY: High NADPH/NADP+ ratio improves thymidine production by a metabolically engineered Escherichia coli strain. Journal of biotechnology 2010, 149:24-32. 4. Bastian S, Liu X, Meyerowitz JT, Snow CD, Chen MMY, Arnold FH: Engineered ketol-acid reductoisomerase and alcohol dehydrogenase enable anaerobic 2-methylpropan-1-ol production at theoretical yield in Escherichia coli. Metabolic engineering 2011, 13:345-352. 5. Shen CR, Liao JC: Synergy as design principle for metabolic engineering of 1-propanol production in Escherichia coli. Metabolic engineering 2013, 17:12-22. 6. Hong SH, Park SJ, Moon SY, Park JP, Lee SY: In silico prediction and validation of the importance of the Entner-Doudoroff pathway in poly(3-hydroxybutyrate) production by metabolically engineered Escherichia coli. Biotechnology and bioengineering 2003, 83:854-863. 7. Lim S-J, Jung Y-M, Shin H-D, Lee Y-H: Amplification of the NADPH-related genes zwf and gnd for the oddball biosynthesis of PHB in an E. coli transformant harboring a cloned phbCAB operon. Journal of bioscience and bioengineering 2002, 93:543-549. 8. Rathnasingh C, Raj SM, Lee Y, Catherine C, Ashok S, Park S: Production of 3-hydroxypropionic acid via malonyl-CoA pathway using recombinant Escherichia coli strains. Journal of biotechnology 2012, 157:633-640. 9. Sanchez AM, Andrews J, Hussein I, Bennett GN, San K-Y: Effect of overexpression of a soluble pyridine nucleotide transhydrogenase (UdhA) on the production of poly(3-hydroxybutyrate) in Escherichia coli. Biotechnology Progress 2006, 22:420-425. 10. Alberts B, Alexander J, Lewis J, Raff M, Roberts K, Walter P: Molecular biology of the cell. 2002. 11. Smolke C: The Metabolic Pathway Engineering Handbook: Fundamentals. 2009. 12. Fuhrer T, Fischer E, Sauer U: Experimental identification and quantification of glucose metabolism in seven bacterial species. Journal of bacteriology 2005, 187:1581-1590.

13. Kabus A, Georgi T, Wendisch VF, Bott M: Expression of the Escherichia coli pntAB genes encoding a membrane-bound transhydrogenase in Corynebacterium glutamicum improves L-lysine formation. Applied microbiology and biotechnology 2007, 75:47-53. 14. Martínez I, Zhu J, Lin H, Bennett GN, San K-Y: Replacing Escherichia coli NAD-dependent glyceraldehyde 3-phosphate dehydrogenase (GAPDH) with a NADP-dependent enzyme from Clostridium acetobutylicum facilitates NADPH dependent pathways. Metabolic engineering 2008, 10:352-359. 15. Siedler S, Bringer S, Blank LM, Bott M: Engineering yield and rate of reductive biotransformation in Escherichia coli by partial cyclization of the pentose phosphate pathway and PTS-independent glucose transport. Applied microbiology and biotechnology 2012, 93:1459-1467. 16. Walton AZ, Stewart JD: Understanding and improving NADPH-dependent reactions by nongrowing Escherichia coli cells. Biotechnology progress 2004, 20:403-411. 17. Sauer U, Canonaco F, Heri S, Perrenoud A, Fischer E: The soluble and membrane-bound transhydrogenases UdhA and PntAB have divergent functions in NADPH metabolism of Escherichia coli. The Journal of biological chemistry 2004, 279:6613-6619. 18. Chin JW, Khankal R, Monroe Ca, Maranas CD, Cirino PC: Analysis of NADPH supply during xylitol production by engineered Escherichia coli. Biotechnology and bioengineering 2009, 102:209-220. 19. Siedler S, Bringer S, Bott M: Increased NADPH availability in Escherichia coli: improvement of the product per glucose ratio in reductive whole-cell biotransformation. Applied microbiology and biotechnology 2011, 92:929-937. 20. Chin JW, Cirino PC: Improved NADPH supply for xylitol production by engineered Escherichia coli with glycolytic mutations. Biotechnology progress 2011, 27:333-341. 21. Lee W-H, Park J-B, Park K, Kim M-D, Seo J-H: Enhanced production of epsilon-caprolactone by overexpression of NADPH-regenerating glucose 6-phosphate dehydrogenase in recombinant Escherichia coli harboring cyclohexanone monooxygenase gene. Applied microbiology and biotechnology 2007, 76:329-338. 22. Vital-Lopez FG, Armaou A, Nikolaev EV, Maranas CD: A computational procedure for optimal engineering interventions using kinetic models of metabolism. Biotechnology progress 2006, 22:1507-1517. 23. Charusanti P, Conrad TM, Knight EM, Venkataraman K, Fong NL, Xie B, Gao Y, Palsson BØ: Genetic basis of growth adaptation of Escherichia coli after deletion of pgi, a major metabolic gene. PLoS genetics 2010, 6:e1001186. 24. Fong SS, Nanchen A, Palsson BO, Sauer U: Latent pathway activation and increased pathway capacity enable Escherichia coli adaptation to loss of key metabolic enzymes. The Journal of biological chemistry 2006, 281:8024-8033.

25. Flamholz A, Noor E, Bar-Even A, Liebermeister W, Milo R: Glycolytic strategy as a tradeoff between energy yield and protein cost. Proceedings of the National Academy of Sciences of the United States of America 2013, 110:10039-10044. 26. Conway T: The Entner-Doudoroff pathway: history, physiology and molecular biology. FEMS microbiology reviews 1992, 9:1-27. 27. Fuhrer T, Sauer U: Different biochemical mechanisms ensure network-wide balancing of reducing equivalents in microbial metabolism. Journal of bacteriology 2009, 191:2112-2121. 28. Jiao Z, Baba T, Mori H, Shimizu K: Analysis of metabolic and physiological responses to gnd knockout in Escherichia coli by using C-13 tracer experiment and enzyme activity measurement. FEMS microbiology letters 2003, 220:295-301. 29. Liu H, Sun Y, Ramos KRM, Nisola GM, Valdehuesa KNG, Lee W-K, Park SJ, Chung W-J: Combination of Entner-Doudoroff Pathway with MEP Increases Isoprene Production in Engineered Escherichia coli. Plos One 2013, 8:e83290. 30. Matsushita K, Arents JC, Bader R, Yamada M, Adachi O, Postma PW: Escherichia coli is unable to produce pyrroloquinoline quinone (PQQ). Microbiology 1997, 143:3149-3156. 31. Peekhaus N, Conway T: What's for dinner?: Entner-Doudoroff metabolism in Escherichia coli. Journal of bacteriology 1998, 180:3495-3502. 32. Zhao J, Baba T, Mori H, Shimizu K: Global metabolic response of Escherichia coli to gnd or zwf gene-knockout, based on 13C-labeling experiments and the measurement of enzyme activities. Applied microbiology and biotechnology 2004, 64:91-98. 33. Alper H, Stephanopoulos G: Engineering for biofuels: exploiting innate microbial capacity or importing biosynthetic potential? Nature Reviews Microbiology 2009, 7:715-723. 34. Martin VJJ, Pitera DJ, Withers ST, Newman JD, Keasling JD: Engineering a mevalonate pathway in Escherichia coli for production of terpenoids. Nature biotechnology 2003, 21:796-802. 35. Kalnenieks U, Pentjuss A, Rutkis R, Stalidzans E, Fell Da: Modeling of Zymomonas mobilis central metabolism for novel metabolic engineering strategies. Frontiers in microbiology 2014, 5:42. 36. Sprenger G: Carbohydrate metabolism in Zymomonas mobilis: a catabolic highway with some scenic routes. FEMS microbiology letters 1996, 145:301-307. 37. Conway T, Fliege R, Jones-Kilpatrick D, Liu J, Barnell WO, Egan SE: Cloning, characterization and expression of the Zymononas mobilis eda gene that encodes 2-keto-3-deoxy-6- phosphogluconate aldolase of the Entner-Doudoroff pathway. Molecular microbiology 1991, 5:2901-2911. 38. Kalnenieks U: Physiology of Zymomonas mobilis: some unanswered questions. Advances in microbial physiology 2006, 51:73-117. 39. Farasat I, Kushwaha M, Collens J, Easterbrook M, Guido M, Salis HM: Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria. Molecular systems biology 2014, 10:731.

40. Wang HH, Isaacs FJ, Carr Pa, Sun ZZ, Xu G, Forest CR, Church GM: Programming cells by multiplex genome engineering and accelerated evolution. Nature 2009, 460:894-898. 41. Hwang C-S, Choi E-S, Han S-S, Kim G-J: Screening of a highly soluble and oxygen-independent blue fluorescent protein from metagenome. Biochemical and Biophysical Research Communications 2012, 419:676-681. 42. Espah Borujeni A, Channarasappa AS, Salis HM: Translation rate is controlled by coupled trade- offs between site accessibility, selective RNA unfolding and sliding at upstream standby sites. Nucleic acids research 2014, 42:2646-2659. 43. Li G-W, Oh E, Weissman JS: The anti-Shine–Dalgarno sequence drives translational pausing and codon choice in bacteria. Nature 2012, 484:538-541. 44. Gibson DG, Young L, Chuang R-Y, Venter JC, Hutchison CA, Smith HO: Enzymatic assembly of DNA molecules up to several hundred kilobases. Nature methods 2009, 6:343-345. 45. Podolsky T, Fong ST, Lee BT: Direct selection of tetracycline-sensitive Escherichia coli cells using nickel salts. Plasmid 1996, 36:112-115. 46. Wang HH, Church GM: Multiplexed genome engineering and genotyping methods applications for synthetic biology and metabolic engineering. Methods in enzymology 2011, 498:409-426. 47. Farmer WR, Liao JC: Precursor balancing for metabolic engineering of lycopene production in Escherichia coli. Biotechnology progress 2001, 17:57-61. 48. Alper H, Jin Y-S, Moxley JF, Stephanopoulos G: Identifying gene targets for the metabolic engineering of lycopene biosynthesis in Escherichia coli. Metabolic engineering 2005, 7:155-164. 49. Ajikumar PK, Xiao W-h, Tyo KEJ, Wang Y, Simeon F, Leonard E, Mucha O, Phon TH, Pfeifer B, Stephanopoulos G: Isoprenoid pathway optimization for Taxol precursor overproduction in Escherichia coli. Science (New York, NY) 2010, 330:70-74. 50. Yang J, Guo L: Biosynthesis of ß-carotene in engineered E. coli using the MEP and MVA pathways. Microbial cell factories 2014, 13:160. 51. Yuan LZ, Rouvière PE, Larossa Ra, Suh W: Chromosomal promoter replacement of the isoprenoid pathway for enhancing carotenoid production in E. coli. Metabolic engineering 2006, 8:79-90. 52. Liu H, Wang Y, Tang Q, Kong W, Chung W-J, Lu T: MEP Pathway-mediated isopentenol production in metabolically engineered Escherichia coli. Microbial cell factories 2014, 13:135. 53. Lee WH, Kim JW, Park EH, Han NS, Kim MD, Seo JH: Effects of NADH kinase on NADPH- dependent biotransformation processes in Escherichia coli. Applied microbiology and biotechnology 2013, 97:1561-1569. 54. Lee W-H, Kim M-D, Jin Y-S, Seo J-H: Engineering of NADPH regenerators in Escherichia coli for enhanced biotransformation. Applied microbiology and biotechnology 2013. 55. Wang B, Wang P, Zheng E, Chen X, Zhao H, Song P, Su R, Li X, Zhu G: Biochemical properties and physiological roles of NADP-dependent malic enzyme in Escherichia coli. Journal of Microbiology 2011, 49:797-802.

56. Zhang Y, Lin Z, Liu Q, Li Y, Wang Z, Ma H: Engineering of Serine-Deamination pathway , Entner- Doudoroff pathway and pyruvate dehydrogenase complex to improve poly ( 3-hydroxybutyrate ) production in Escherichia coli. 2014:1-11. 57. Barnell WO, Yi KC, Conway T: Sequence and genetic organization of a Zymomonas mobilis gene cluster that encodes several enzymes of glucose metabolism. Journal of bacteriology 1990, 172:7227-7240. 58. Egan SE, Fliege R, Tong S, Shibata A, Wolf RE, Conway T: Molecular characterization of the Entner-Doudoroff pathway in Escherichia coli: sequence analysis and localization of promoters for the edd-eda operon. Journal of bacteriology 1992, 174:4638-4646. 59. Fuhrman LK, Wanken A, Nickerson KW, Conway T: Rapid accumulation of intracellular 2-keto- 3-deoxy-6-phosphogluconate in an Entner-Doudoroff aldolase mutant results in bacteriostasis. FEMS microbiology letters 1998, 159:261-266. 60. Alper H, Stephanopoulos G: Global transcription machinery engineering: a new approach for improving cellular phenotype. Metabolic engineering 2007, 9:258-267. 61. Biggs BW, De Paepe B, Santos CNS, De Mey M, Kumaran Ajikumar P: Multivariate modular metabolic engineering for pathway and strain optimization. Current opinion in biotechnology 2014, 29:156-162. 62. Du J, Yuan Y, Si T, Lian J, Zhao H: Customized optimization of metabolic pathways by combinatorial transcriptional engineering. 2012, 40:1-10. 63. Lee ME, Aswani A, Han AS, Tomlin CJ, Dueber JE: Expression-level optimization of a multi- enzyme pathway in the absence of a high-throughput assay. Nucleic acids research 2013:1-11. 64. Nowroozi FF, Baidoo EE, Ermakov S, Redding-Johanson AM, Batth TS, Petzold CJ, Keasling JD: Metabolic pathway optimization using ribosome binding site variants and combinatorial gene assembly. Appl Microbiol Biotechnol 2014, 98(4):1567-1581. 65. Oliver JW, Machado IM, Yoneda H, Atsumi S: Combinatorial optimization of cyanobacterial 2,3- butanediol production. Metabolic engineering 2014, 22:76-82. 66. Pfleger BF, Pitera DJ, Smolke CD, Keasling JD: Combinatorial engineering of intergenic regions in operons tunes expression of multiple genes. Nature biotechnology 2006, 24:1027-1032. 67. Smanski MJ, Bhatia S, Zhao D, Park Y, Woodruff LBA, Giannoukos G, Ciulla D, Busby M, Calderon J, Nicol R et al: Functional optimization of gene clusters by combinatorial design and assembly. Nature biotechnology 2014, 32. 68. Wu J, Du G, Zhou J, Chen J: Metabolic engineering of Escherichia coli for (2S)-pinocembrin production from glucose by a modular metabolic strategy. Metabolic engineering 2013, 16:48-55. 69. Xu P, Gu Q, Wang W, Wong L, Bower AG, Collins CH, Koffas MA: Modular optimization of multi-gene pathways for fatty acids production in E. coli. Nat Commun 2013, 4:1409.

70. Zelcbuch L, Antonovsky N, Bar-Even A, Levin-Karp A, Barenholz U, Dayagi M, Liebermeister W, Flamholz A, Noor E, Amram S et al: Spanning high-dimensional expression space using ribosome- binding site combinatorics. Nucleic acids research 2013, 41(9):e98. 71. Zhao J, Li Q, Sun T, Zhu X, Xu H, Tang J, Zhang X, Ma Y: Engineering central metabolic modules of Escherichia coli for improving β-carotene production. Metabolic engineering 2013, 17:42-50.

3. Chapter 3 THE PARETO OPTIMALITY EXPLANATION OF THE GLYCOLYTIC ALTERNATIVES IN NATURE

3.1. Introduction

Billions of years of evolution led to highly genetically and phenotypically diverse organisms, yet most of them retain largely identical routes for sugar catabolism despite the presence of a myriad of ways in nature’s enzymatic repertoire for converting glucose to pyruvate [1]. Uniquely among them, the canonical Entner-Doudoroff (ED) and Embden- Meyerhof-Parnas (EMP) pathways are by far the most prevalent in nature. These two pathways differ in the first few but share six of the remaining enzymatic steps. They both generate two moles of reduced redox cofactor NAD(P)H, which can be used for generating additional ATP through oxidative phosphorylation, but differ in the overall ATP yield. The ED pathway, often found in species living in carbon/energy/oxygen-rich environment (e.g., Zymomonas mobilis, Acinetobacter sp. ADP1), sacrifices energy yield for a pathway with a larger driving force [1]. It is also recently found to be active in photosynthetic organisms which typically rely on light rather than glucose for ATP source [2]. In addition, the canonical ED pathway confers higher tolerance to oxidative stress as it generates NADPH as opposed to NADH in EMP [3, 4]. On the other hand, the EMP pathway is common in prokaryotes and eukaryotes with higher energy demands or those living in anoxic or low- energy environments [1]. The presence of the key EMP enzyme 6-phosphofructokinase (PFK) or key ED enzymes 2-keto-3-deoxygluconate-6-phosphate (KDPG) aldolase and 6- phosphogluconate dehydratase is often used to identify whether a strain is capable of using either pathways [1, 2]. The two pathways are, however, not mutually exclusive and often co-exist in many organisms [2, 5]. In particular, enteric bacteria such as Escherichia coli have been shown to switch between them in response to the availability of different substrates [5, 6].

Variants of the canonical ED and EMP pathways have also been discovered especially in extremophiles [7]. Semi-phosphorylative and non-phosphorylative ED pathways were

55 reported in anaerobic Clostridia and archaea, wherein the first ATP phosphorylation step is catalyzed by 2-dehydro-3-deoxy-D-gluconate (KDG) kinase or glycerate kinase, yielding one or zero ATP per glucose, respectively [8, 9]. Modified EMP pathways are found in (hyper)thermophilic archaea employing variants of glycolytic enzymes utilizing alternative cofactors such as ADP-dependent glucokinase and PFK (in euryarchaeota), pyrophosphate (PPi)-dependent PFK (in Thermoproteus tenax [10]), non-phosphorylating glyceraldehyde-3-phosphate (GAP) dehydrogenase (in crenarchaeon Aeropyrum pernix and T. tenax) and GAP ferredoxin oxidoreductase (in microaerobe Pyrobaculum aerophilum) [8, 11, 12]. A recent study confirmed that Clostridium thermocellum operates a GTP and PPi-dependent glycolysis and predominantly employs a malate shunt to convert phosphoenolpyruvate (PEP) to pyruvate due to the absence of pyruvate kinase [13]. An additional ATP per glucose by using a PPi-dependent PFK [9] could potentially be gained, however, the source of PPi in C. thermocellum remains elusive [13]. Furthermore, a variant of EMP that relies on an NADP-dependent GAP dehydrogenase was found (e.g., encoded by GDP1 gene of eukaryote Kluyveromyces lactis) [14]. To improve NADPH availability in the model organism E. coli, the NADP-dependent C. acetobutylicum gapC was expressed heterologously to replace the native NAD-dependent GAP dehydrogenase [15]. While these pathways retain the structure of ED or EMP, other rare glucose processing pathways have been found in nature. The heterofermentative lactic acid bacterium Lactobacillus reuteri ATCC 55730 employed predominantly the phosphoketolase (PK) pathway along with the EMP pathway as a shunt for glycolysis resulting in a lower energy yield [16]. Bifidobacteria utilize the bifid shunt for conversion of glucose to acetate: lactate: ATP in a 1.5: 1: 2.5 ratio [17]. A recent groundbreaking work by Bogorad et al. [18] designed and engineered the non-oxidative glycolysis (NOG) which operates cyclically to convert glucose to acetyl-coA in a redox-independent and carbon neutral manner. The NOG pathway generates two ATPs and three acetate moieties per glucose molecule.

Previous studies have already tried to shed light on why the canonical glycolytic pathways are so uniquely prevalent despite the presence of alternative routes with often higher carbon and energy yields. A theoretical study by Melendez-Hevia et al. suggested that the

56 canonical EMP pathway is an optimal series of chemical reactions that maximize ATP yield at high kinetic efficiency [19]. By designing a series of shortest pathways connecting different pairs of central carbon metabolites using a set of 30 reaction rules (that acts on carbohydrates), Noor et al. proposed that the canonical glycolytic pathway is the shortest pathway (in E. coli) that ensures the production of essential precursors of cellular biomass [20]. A recent biochemical analysis suggested that the glycolytic pathway is a result of a structural rearrangement of the substrate molecules to avoid toxic intermediates (e.g., methylglyoxal), along with phosphorylation of the intermediates to reduce metabolite leakage and to improve enzyme-substrate affinity [21]. Furthermore, a recent analysis put forth the hypothesis that the lower glycolytic pathway shared by both ED and EMP pathways is able to sustain the highest flux when compared to other (in silico) synthetic alternatives [22]. Absolute quantitative measurement of intracellular metabolite concentrations and fluxes by Park et al. revealed that the lower glycolysis has a higher overall driving force in terms of change in free energy (roughly six-fold higher) than currently stated in biochemistry textbooks [23].

Concomitant to energy production and precursor synthesis hypotheses, recent studies have reaffirmed minimization of enzymes production as a key driver of optimizing resource allocation [24, 25]. For example, fast-growing cells have to invest more resources for the synthesis of growth related proteins (i.e., proteins associated with translational and transcriptional machinery) [25]. Instead of simply making proportionally more protein to accommodate higher growth requirements, they often shift metabolism towards conversions requiring more modest catalytic resources per unit of growth at the expense of energy efficiency [25, 26]. Basan et al. verified that E. coli switches from respiration to fermentation (which requires less proteome) under high growth rates [27]. Cellular metabolism has been shaped by evolution to ensure that carbon catabolic pathways are carefully selected to be in tune with both growth rate requirements and resources availability. Optimal glycolytic pathways must thus be able to balance high ATP production capacity while generating important intermediates and redox molecules at minimal proteome cost. These requirements are in direct conflict with one another

57 requiring the establishment of Pareto optimal curves to decipher the relative “weights” between objectives that nature responds to when selecting different pathway designs.

In this study, we aim to systematically assess the relative importance between various objectives driving pathway selection by exhaustively generating over 11,916 routes from glucose to pyruvate with varying energy production efficiency per mole of converted glucose and quantifying the corresponding total protein investment. The pathways were designed by combining annotated reactions from the entire chemical repertoire of organisms accessed from the KEGG [28] database using a modified implementation of the optStoic protocol [29] (Figure 3.1). The Gibbs free energy of all reactions at standard conditions (i.e., 25°C, pH 7, ionic strength of 0.1 M) was estimated using the Component Contribution method developed by Noor et al. [30]. Subsequently, the pathways were categorized by their net ATP yield ranging from 1 to 5 mole of ATP per mole glucose and thereafter pruned to remove thermodynamically infeasible routes under physiologically relevant limits of intermediate metabolite concentrations. Minimal protein cost analysis on the feasible pathways revealed that the native ED and EMP pathways are indeed among the most protein cost-efficient pathways at physiological metabolite concentration ranges. Pathways with higher ATP yields were also identified (up to 5 ATP per glucose molecule). Driving thermodynamics closer to the limit (i.e., equilibrium) lowers the overall thermodynamic driving force thereby demanding a higher protein cost to drive the same amount of flux through the pathway. High ATP yield pathways are also less tolerant to the changes in metabolite concentration ranges and ATP/ADP ratios, which could explain why they are not commonly found in the nature.

Figure 3.1. Schematic overview of the workflow for design and analysis of glycolytic pathways. (A) The reaction database (DB v1) obtained from the previous study [29] was curated and updated to DB v2. (B) Pathways generating 1 to 5 ATP were designed using the modified optStoic/minFlux procedure. Visualization of the designed pathways was automated (see Methods). (C) The thermodynamic feasibility of each pathway under physiological metabolite concentration ranges was assessed. (D) The minimal protein cost for operating each thermodynamically feasible pathway was predicted using the method developed by Flamholz et al. [1].

3.2. Methods

Update of the optStoic reaction database

The reaction database for the optStoic procedure [29] was curated to ensure that all reactions are elementally (i.e., C, O, N, P and S) balanced and updated with new reactions from the KEGG database [28]. The updated optStoic reaction database contains a total of 7,164 reactions and 5,969 metabolites.

Reactions that are incomplete (e.g., elementally imbalanced or contain generic stoichiometric coefficient (e.g., R05327: n C00043 + n C00167 <=> C00518 + 2n C00015)) were removed. The standard transformed Gibbs free energy of all reactions ′ (∆푟퐺 °푗) at pH 7, 25°C and ionic strength of 0.1 M was estimated using the Component Contribution method [30]. For metabolites that were not present in the Component Contribution Python package, the chemical structure (i.e., Molfile) was retrieved from KEGG and converted into InChI using Open Babel [31] or ChemAxon (Marvin 16.7.18, 2016, ChemAxon (http://www.chemaxon.com)). The Gibbs free energy of formation of these compounds was then calculated using the same method and added to the database. The reaction directionality was determined by assessing the impact of the sign on the free energy of change by either depleting the product or reactant [32].

′ ′ Step 1: ∆푟퐺푗,푚푖푛 = ∆푟퐺푗 ° + 푅푇 ln 푄푚푖푛 is calculated at substrate concentrations of 0.1 M and product concentrations of 1 µM, ′ ′ Step 2: ∆푟퐺푗,푚푎푥 = ∆푟퐺푗 ° + 푅푇 ln 푄푚푎푥 is calculated at substrate concentration of 1 µM and product concentration of 0.1 M, ′ Step 3: A reaction is deemed (i) irreversible in the forward direction if both ∆푟퐺푗,푚푖푛 < 0 ′ ′ and ∆푟퐺푗,푚푎푥 < 0, (ii) irreversible in the reverse direction if both ∆푟퐺푗,푚푖푛 > 0 and ′ ′ ′ ′ ∆푟퐺푗,푚푎푥 > 0, and (iii) reversible if ∆푟퐺푗,푚푖푛 < 0 and ∆푟퐺푗,푚푎푥 > 0. If ∆푟퐺 °푗 cannot be approximated (e.g., due to the absence of standard Gibbs free energy of formation for at least one of the reactants), then the reaction is assumed to be reversible.

The directionality of 204 reactions, particularly those involving ATP, were manually curated [29]. Consequently, the updated database contains a total of 5,014 reversible ′ reactions (out of which 1,898 reactions have undefined ∆푟퐺 °푗 ) and 2,150 irreversible reactions.

Designing pathways using the modified optStoic procedure

The overall stoichiometry of glycolysis allowing for a varying amount of produced ATP moles (i.e., n) per glucose mole is given by: Glucose + 2 NAD(P)+ + n ADP + n Phosphate

= 2 Pyruvate + 2 NAD(P)H + n ATP + n H2O (1) + (4 – n)H+ This equation is henceforth denoted as the overall stoichiometry. We use the minFlux mixed-integer linear programming (MILP) formulation from the optStoic procedure [29] to identify the set of reactions that conform to the above glycolytic stoichiometry. The minFlux formulation is given by ∑ 푚푖푛푖푚푖푧푒 푗∈푱\푱풆풙풄풉풂풏품풆|푣푗| (푚푖푛퐹푙푢푥) (2)

푠푢푏푗푒푐푡 푡표 ∑ 푆푖푗푣푗 = 0, ∀ 푖 ∈ 푰 (3) 푗∈푱 퐸푋 푣푖 = 푞푖, ∀ 푖 ∈ 푰풔풕풐풊풄풉 (4)

퐿퐵푗 ≤ 푣푗 ≤ 푈퐵푗, ∀푗 ∈ 푱 (5) 퐸푋 푣푗 ∈ ℤ ⋁ ℝ , 푣푖 ∈ ℤ ⋁ ℝ where set 푰 and 푱 represent metabolites and reactions, respectively. 푰풔풕풐풊풄풉 is a set of metabolites that participate in the design equation. 푆푖푗 is the stoichiometric matrix with each row representing a metabolite 푖 and each column representing reaction 푗, 푣푗 is the flux 퐸푋 of reaction 푗, 푣푖 is the exchange reaction for metabolite 푖. The set of all exchange 퐸푋 퐸푋 reactions 푣푖 is declared as 푱풆풙풄풉풂풏품풆. Both 푣푗 and 푣푖 can either be integers (ℤ) or real numbers (ℝ). Parameter 푞푖 is the stoichiometric coefficient of metabolite 푖 in the design equation. 퐿퐵푗 and 푈퐵푗 indicate the lower and upper bounds on the flux of the respective reaction 푗. The objective function (equation (2)) ensures that the sum of the absolute flux

61 through the entire network of reactions is minimized. Constraint (3) ensures stoichiometric (mass) balance for all metabolite 푖 in the network. Constraint (4) enforces that the flux through exchange reaction for all metabolites given in the design equation is proportional to their stoichiometric coefficient. Constraint (5) imposes upper and lower bound on the 푣푗 based on the reaction directionality as discussed the previous section.

The above formulation was recast as a combination of linear relations by introducing two non-negative real number (or integer) variables for each 푣푗 as followed: 푓 푟 푣푗 = 푣푗 − 푣푗 , ∀푗 ∈ 푱 (6) 푓 푟 |푣푗| = 푣푗 + 푣푗 , ∀푗 ∈ 푱 푓 푟 푣푗 ∈ ℤ , 푣푗 ∈ ℤ≥0 and 푣푗 ∈ ℤ≥0

푓 푟 (푣푗 ∈ ℝ , 푣푗 ∈ ℝ≥0 and 푣푗 ∈ ℝ≥0) However, the original minFlux formulation does not restrict the identification of disjoint subnetworks that are only connected with the primary carbon transfer pathway (glycolysis in this case) with only energy (e.g., ATP), redox (e.g., NAD(P)H) or other cofactors (e.g.,

H2O) exchanges. This often results in pathway designs where the entire driving force of the conversion is accomplished by futile cycles disconnected from the main metabolism (Figure 3.2A). For example, a pathway shown in Figure 3.2A contains two undesirable subnetworks and one of them operates in the direction of ATP generation. This issue was remedied here by using an approach similar to the loopless-FBA [33] to eliminate the subnetworks.

Step 1: An internal stoichiometric matrix (푆푖푛푡 ) was constructed by first removing all exchange reactions (columns). Subsequently, rows containing the selected cofactors

(Table B.1) were removed resulting in the 푆푟푒푑 matrix.

Step 2: The rational basis of the null space of the 푆푟푒푑 matrix is calculated resulting in the

푁푟푒푑 matrix. Consequently, all loops of reactions whose net conversion results in only cofactors consumption or generation such as (i) ATP hydrolysis, (ii) redox generation or

(iii) water-splitting, can be represented by linear combination of the null basis 푁푟푒푑. Step 3: The following constraints are then appended to the minFlux formulation:

푇 ∑ 푁푟푒푑,푗푙퐺푗 = 0, ∀ 푙 ∈ 푳 (7) 푗

퐺푗 ≥ −푀푎푗 + (1 − 푎푗), ∀ 푗 ∈ 푱 (8)

퐺푗 ≥ −푎푗 + 푀(1 − 푎푗), ∀ 푗 ∈ 푱 (9)

푣푗 ≥ −푀 (1 − 푎푗), ∀ 푗 ∈ 푱 (10)

푣푗 ≤ 푀푎푗, ∀ 푗 ∈ 푱 (11)

퐺푗 ∈ ℝ ; 푎푗 ∈ {0, 1} 푇 where 푳 is the set of all loops, 푁푟푒푑 is the transpose of the null basis 푁푟푒푑 with loop 푙 and reaction 푗, M is a very large positive number (i.e., 1000), and the variable 퐺푗 is the thermodynamic driving force equivalent of reaction j. Constraint (7) imposes the loop law

[33] wherein the sum of 퐺푗 for all reactions in a closed loop 푙 has to be zero thereby preventing all the reactions in the loop to carry flux simultaneously in a cyclical manner.

Constraints (8) and (9) ensure that 퐺푗 is strictly positive ( 1 ≤ 퐺푗 ≤ 푀) or negative

(−푀 ≤ 퐺푗 ≤ −1) so that the solution Gj = 0 can be avoided. The binary variables 푎푗 are introduced in constraints (8) to (11) to ensure that vj > 0 when Gj < 0 and vice versa. By adding these constraints, a feasible solution that is a network devoid of the undesirable subnetwork can be identified (Figure 3.2C).

Integer cut constraints are then introduced to exhaustively identify alternate optimal pathways that satisfy the design equation. Herein, we define 푘 as the number of iteration

푓 푟 for minFlux (푘 ∈ {1, 2, … , 휅}). To this end, binary variables 푦푗 and 푦푗 are defined as followed: 1, if reaction 푗 carries non-zero flux in the forward direction (푣푓 > 0) (12) 푦푓 = { 푗 푗 0, otherwise 1, if reaction 푗 carries non-zero flux in the reverse direction (푣푟 > 0) (13) 푦푟 = { 푗 푗 0, otherwise

푓푘 푟푘 Likewise, 푦 푗 and 푦 푗 are introduced as the binary variables associated with the solution from the 푘-th iteration. At iteration 푘 = 휅, the following constraints are added to the minFlux formulation.

푓 푟 (14) ∑ (1 − 푦푗 − 푦푗 ) ≥ 1, ∀ 푘 = 1, 2, 3, … , 휅 − 1 푓푘 푟푘 푗∈푱 | 푦 푗 + 푦 푗 =1 푓 푟 (15) 푦푗 + 푦푗 ≤ 1

푓 푓 푓 (16) 푦푗 휀 ≥ 푣푗 ≥ 푦푗 푀

푟 푟 푟 푦푗 휀 ≥ 푣푗 ≥ 푦푗 푀 (17) Constraint (14) is the integer cut constraint that ensures that at least one of reaction 푗 that was identified in the previous iteration 푘 is inactive in the current iteration. Constraint (15) enforces that only one of the binary variables (corresponding to the flux directions) for each reaction 푗 is active. Finally, constraints (16) and (17) restrict the flux (in forward or reverse direction) to be strictly positive whenever the corresponding binary variable is active. The parameter 휀 is a user-defined small positive real number. The MILP problems were solved using the CPLEX v.12.6.1 solver accessed through the GAMS (v24.4.1) modeling system or Gurobi Optimizer v6.5.1 using Python 2.7.

Assessing the thermodynamic feasibility of a pathway

The thermodynamic feasibility of each pathway under physiological concentration ranges are assessed using the max-min driving force (MDF) formulation [34]. The MDF formulation in essence attempts to identify a set of metabolite concentrations that ensure the lowest free energy changes for all the reactions in a pathway. If the objective value of MDF is positive, then the pathway is thermodynamically infeasible.

′ Step 1: The ∆푟퐺 °for each reaction involved in a pathway (푗 ∈ 푱풑풂풕풉) is estimated using the Component Contribution method [30] at pH 7, 25°C and ionic strength of 0.1 M [35, 36]. Step 2: The MDF problem is solved for each pathway, which minimizes the maximum ′ ∆푟퐺푗 of a pathway by optimizing over the concentrations of all metabolites in the pathway. The formulation is given by: ′ min max{∆푟퐺푗 } (푀퐷퐹) (18) 푐푖 푗

′ ′ 푇 (19) 푠푢푏푗푒푐푡 푡표 ∆푟퐺푗 = ∆푟퐺푗 ° + 푅푇 ∑ 푆푖푗 ln c푖 , ∀ 푗 ∈ 푱풑풂풕풉

푖∈푰풑풂풕풉 푚푖푛 푚푎푥 ln 푐푖 ≤ ln c푖 ≤ ln 푐푖 , ∀ 푖 ∈ 푰풑풂풕풉 (20) min max ln 푟푝 ≤ ln 푟푝 ≤ ln 푟푝 , ∀ 푝 ∈ 푷 (21)

where 푰풑풂풕풉 is the set of all metabolites and 푱풑풂풕풉 is the set of all reactions in a pathway, 푐푖 is the concentration of metabolite 푖, 푅 is the gas constant, 푇 is the temperature, 푟푝 is the concentration ratio for an ordered pair of metabolites 푝 (e.g., 푝 =

(퐴푇푃, 퐴퐷푃), 푟(𝐴푇푃,𝐴퐷푃) = 푐𝐴푇푃/푐𝐴퐷푃 ), and 푷 is a set of metabolite pairs (e.g., 푷 ∈ {(퐴푇푃, 퐴퐷푃), (푁퐴퐷푃퐻, 푁퐴퐷푃+), (푁퐴퐷퐻, 푁퐴퐷+)}). Note that the 푆 matrix here refers to the stoichiometric matrix of the pathway with 푆 ∈ ℝ|푰풑풂풕풉|×|푱풑풂풕풉| .

′ Constraint (19) relates the Gibbs free energy of reaction (∆푟퐺푗 ) with the standard Gibbs ′ free energy of reaction (∆푟퐺 °푗 ) and the mass action ratio. The concentrations of all 푚푖푛 푚푎푥 metabolites are allowed to vary between 1 µM (푐푖 ) and 100 mM (푐푖 ) in constraint (20). Concentration ratios of common cofactor pairs (e.g., NADPH/NADP+, NADH/NAD+ and ATP/ADP) play an important role in a cell as they determine the driving force of a large number of biosynthesis reactions [37]. The concentration ratios of energy and redox cofactors are therefore allowed to vary within the maximum and minimum values found in the literature and the Bionumbers database [38] in constraint (21). Constraint (21) is optional depending on the case study. Herein, we assumed that the designed pathway operates at steady state and within a single compartment of a cell at temperature (푇) of 25°퐶, ionic strength of 0.1 M and pH 7.0. The pathway with a positive objective function (MDF) indicates that it is thermodynamically infeasible within the given physiological concentration (and ratio) ranges, is therefore omitted from the subsequent step. Importantly, the objective function of the enzyme cost minimization problem (see ′ Methods) is convex only when all ∆푟퐺푗 < 0 in a pathway. The MDF problem is solved using Gurobi Optimizer v6.5.1 solver and Python script modified from the Component Contribution Python package [30].

Protein cost analysis

The minimal enzyme demand in units of mg protein/mmol glucose/h for each one of the thermodynamically feasible pathways is then estimated based on the enzyme cost minimization (ECM) method [1, 39]. The formulation is as followed: 1 푚푖푛푖푚푖푧푒 PC = ∑ 푀퐸,푗휆퐸,푗 (퐸퐶푀) 푣퐸푋_푔푙푐 푗 (22)

푠푢푏푗푒푐푡 푡표 휆퐸,푗

−1 + ′ 푞푖푗 푣푗 ∆푟퐺푗 퐾푀,푖푗 = + (1 − exp ( )) (1 + ∏ ( ) ) , ∀ 푗 푘푐푎푡,푗 푅푇 푐푖 (23) 푖∈푰퐫퐞,퐣

∈ 푱풑풂풕풉

− 푞푖푗 ∏푖 ∈ 푰 푐 ∆ 퐺′ = ∆ 퐺′° + 푅푇 ln 퐩퐫,퐣 푖 , ∀ 푗 ∈ 푱 푟 푗 푟 푗 푞+ 풑풂풕풉 ∏ 푖푗 (24) 푖 ∈ 푰퐫퐞,퐣 푐푖

′ ∆푟퐺푗 < 0, ∀ 푗 ∈ 푱풑풂풕풉 (25) 푚푖푛 푚푎푥 푐푖 ≤ c푖 ≤ 푐푖 , ∀ 푖 ∈ 푰풑풂풕풉 (26) min max 푟푝 ≤ 푟푝 ≤ 푟푝 , ∀ 푝 ∈ 푷 (27)

where 푀퐸,푗 is the molecular weight of enzyme per active site for reaction 푗, 푣퐸푋_푔푙푐 is the glucose uptake flux (mmol Glucose/h), 휆퐸,푗 is the enzyme level for reaction j, 푣푗 is the flux + through reaction j, 푘푐푎푡,푗is the turnover number of the reaction in the forward direction,

푰퐫퐞,퐣 is the set of reactants in reaction j, 푰퐩퐫,퐣 is the set of products in reaction 푗, the set of all metabolites in the pathway 푰풑풂풕풉 is the union of 퐼re,j and 퐼pr,j , 퐾푀,푖푗 is the Michaelis- + − Menten constant of enzyme for reaction 푗 towards metabolite 푖 , 푞푖푗 and 푞푖푗 is the + stoichiometric coefficient of metabolite 푖 in reaction 푗. 푞푖푗 > 0 if metabolite 푖 is a reactant + − in reaction 푗 and 푞푖푗 = 0 otherwise, whereas 푞푖푗 > 0 if metabolite 푖 is a product in reaction

− 푗 and 푞푖푗 = 0 otherwise. Note that in the preprocessing step, all the reactions are re- arranged such that flux 푣푗 through each of them is strictly positive.

The objective function (equation (22)) involves the minimization of the sum of the enzymatic cost (µg Protein/ mmol Glucose/ h) for each reaction in the pathway normalized by the glucose uptake rate. Equation (23) defines the enzyme level for a reaction 푗 as a function derived from the reversible Michaelis-Menten kinetic equation [1]. Constraint (24) is the same as constraint (19) recast using concentrations. Constraint (25) ensures that all reactions have negative change in free energy and prevents division by zero in equation (23). Constraints (26) and (27) impose the bounds on the concentration ranges and concentration ratio ranges. The above formulation can be simplified by substituting the concentration variable 푐푖 with logarithmic concentrations 푥푖 = ln 퐶푖 and thus converting the product term into a summation.

The optimal concentrations of metabolites obtained from the MDF problem (see Methods) are used as the initial condition for the ECM problem, which is then solved using the sequential least squares quadratic programming method (Python SciPy package). Due to the lack of experimentally measured kinetic parameters, we assumed generic values (푀퐸 = −1 40 푘퐷푎, 푘푐푎푡 = 79 푠 푎푛푑 퐾푀 = 200 휇푀) [40] for all kinetic parameters as was carried out in the original study [1]. This implies that all enzymes were treated as equally fast in every pathway. The allowable metabolite concentration ranges are identical to that of the MDF analysis.

Pathway visualization

To assist in the analysis of a large number of pathways designed using the modified optStoic approach, each pathway and reaction are represented as a Pathway Class object and Reaction Class object, respectively. A directed bipartite graph for each pathway is generated and rendered as SVG, PNG or JPEG format using the Graphviz software accessed through the Graphviz Python Package.

Figure 3.2. The modified optStoic procedure. Pathways shown here perform the overall conversion defined below the panels. (A) Design A is a pathway with disjoint subnetworks that generate cofactors such as ATP and water. This design is obtained using the previous optStoic/minFlux formulation. (B) The 푆푖푛푡 matrix, which contains only internal reactions, was processed by removing rows containing cofactors. The basis of the null space of the resulting 푆푟푒푑 matrix is then obtained (푛푢푙푙(푆푟푒푑) = 푁푟푒푑). Each row of the 푁푟푒푑 matrix is an internal cycle that results in no net non-cofactor metabolite production. The loop law 푇 is imposed as 푁푟푒푑퐺 = 0, which implies that flux could traverse only through one of the directions in a loop. Two cases are shown here for the loop involving reaction R1 (D-

Fructose-1,6-phosphate + H2O → D-Fructose-6-phosphate + Pi) and R2 (D-Fructose-6- phosphate + ATP → D-Fructose-1,6-phosphate + ADP). In case (ii)(a), when reaction R1 is active (푣푅1 > 0), then reaction R2 can carry only zero flux or flux in the same direction with R1. (C) After adding the loop law constraints to the minFlux formulation, we found that ATP and redox generation occurs only on the main carbon transfer pathway.

3.3. Results

We first exhaustively traced pathways from glucose to pyruvate that conform to a general glycolysis stoichiometry while generating 1 to 5 ATP per glucose. We then filtered the pathways based on their thermodynamic feasibility and subsequently predicted the minimal protein cost of each pathway. We identified the Pareto optimal for the tradeoff between protein cost and ATP yield of glycolytic pathways and further determined the main factor(s) that affect the protein cost for a pathway. Several identified pathways that with higher ATP yields are discussed and their lack of robustness to varying ATP/ADP ratios is demonstrated.

Exhaustive enumeration of all glycolytic pathway variants using modified optStoic

A glycolytic pathway is defined here as the conversion of glucose into pyruvate accompanied by the generation of energy cofactor ATP and redox cofactor NAD(P)H. This conversion can be described by two half-reactions: (i) Oxidation of glucose to pyruvate: Glucose + 2 NAD(P)+ <=> 2 Pyruvate + 2 + ′ NAD(P)H + 4 H (∆푟퐺 ° = -133.6 kJ/mol) and

+ (ii) Phosphorylation of ADP to form ATP: ADP + Phosphate + H <=> ATP + H2O ′ (∆푟퐺 ° = 26.4 kJ/mol).

The overall reaction allowing for a varying amount (denotes by coefficient 푛) of ATP produced is given by: Glucose + 2 NAD(P)+ + n ADP + n Phosphate

= 2 Pyruvate + 2 NAD(P)H + n ATP + n H2O (28) + (4 – n)H+

′ The maximum number of ATP that can be generated while maintaining ∆푟퐺 ° ≤ 0 is therefore 133.6/26.4 = 5.06 mol/mol glucose. Although it is possible to generate additional ATPs through oxidative phosphorylation (e.g., 2.5 ATP/NADH [41]), only ATP production through substrate-level phosphorylation from the glycolytic pathway is

69 considered. We employed the minFlux version of optStoic to prospect for pathways from a database of curated reactions derived from KEGG [28] that perform the requisite conversion (equation 1) while generating from 푛 = 1 to 5 mol ATP/ mol glucose. The minFlux algorithm identifies minimal flux carrying network that conform to the given stoichiometry (see Methods). Alternate pathways were identified by iteratively appending integer cuts (see Methods).

An important consideration for designing a glycolytic pathway is that ADP phosphorylation should be strictly coupled to it [19]. Using directly optStoic [29] led to a significant number of pathways containing disjoint subnetworks with some of them generating ATP (see Methods and Figure 3.2A). Such disjoint subnetworks do not exchange carbon flux with the main glycolytic pathway chain. In extreme pathway analysis, such a closed loop of reactions that exchange only cofactors with other pathways are defined as Type II extreme pathways, whereas a thermodynamically infeasible closed loop that does not exchange any cofactors with its surroundings are defined as Type III extreme pathways [42]. We resolved this issue by appending the loopless-FBA constraints [33] to the minFlux formulation (see Methods and Figure 3.2B). In brief, the exchange reactions were first removed from the stoichiometric (푆) matrix of the reaction database resulting in the 푆푖푛푡 matrix which contains only the internal reactions. The rows involving cofactors (listed in Table S1) were then removed to generate the 푆푟푒푑 matrix. This step differs from the original loopless-FBA procedure, as we want to eliminate both Type II and

Type III internal cycles. The null basis (푁푟푒푑) of the 푆푟푒푑 matrix is generated with each row indicating a closed loop of reactions that result in no net non-cofactor metabolite production. A disjoint subnetwork (i.e., Type II pathway) that exchanges only cofactors with the main glycolytic chain is now a closed loop on its own. The constraints (Methods, equation 7 – 11) derived from the loop law [33] prevent net flux traversing any such a loop in a cyclic manner and essentially prevent the identification of the disjoint subnetwork. Therefore, ATP production can occur only on the main carbon transfer pathway (i.e., glycolytic pathway) (Figure 3.2C).

As a result, a total of 11,916 unique glycolytic routes generating between one to five ATP without the undesirable disjoint subnetworks were identified (Figure 3.3, SI Table 2). Among them, the ED and the EMP glycolytic pathways were also identified. The Jaccard similarity coefficient was used to verify that all the pathways generating the same ATP yield are indeed distinct from one another. The statistics of all pathways with respect to ATP yields, total flux (i.e., the minimum sum of the absolute values of fluxes) and number of reactions are shown in Figure B.1A and Figure B.1B.

Imposing the thermodynamic feasibility test MDF and the effect of metabolite concentration ranges

A glycolytic pathway variant could potentially operate in a cell in the forward direction if and only if there is a positive thermodynamic driving force through each constituting ′ reactions (i. e. , ∆푟퐺푗 ≤ 0, ∀푗 ∈ 푱풑풂풕풉 , where 푱풑풂풕풉 is a set of reactions in a pathway). Although all the glycolytic pathways designed above perform the overall conversion with ′ ∑ ′ a negative standard ∆푟퐺 ° (i.e., 푗∈푱풑풂풕풉 ∆푟퐺 °푗 ≤ 0), it is not sufficient to ensure that all ′ reaction steps 푗 within the pathway can indeed have a negative ∆푟퐺푗 for the intracellular metabolite concentration ranges. Consequently, we employed the max-min driving force (MDF) procedure [34] to search over the metabolite concentrations within the ′ physiologically relevant ranges (1 µM to 100 mM, Figure B.2) to check whether all ∆푟퐺푗 can simultaneously be negative. If at least one of the reaction steps has an unavoidable ′ positive ∆푟퐺푗 , then the pathway is deemed thermodynamically infeasible and is excluded from consideration. As a result, we were able to narrow down the solution pool by 19.3% (see Table 3.1(i)). The imposition of the overall standard free energy of change negativity during pathway design seems to safeguard against thermodynamically infeasible designs with only a small minority failing the more rigorous MDF test when concentration ranges were imposed. Even though one would have expected that pathways that produce more ATP and thus are closer to the thermodynamic limit to involve a larger fraction of thermodynamically infeasible designs we observed no such trend under a permissive condition (see Table 3.1(i)).

As expected the fraction of pathways deemed thermodynamically feasible strongly depends on the imposed metabolite concentration bounds. Tighter concentration bounds reduced the number of feasible pathways (see Table 3.1). Pathways that produce more ATP are much more susceptible to the effect of concentration bound reduction. For example, when upper and lower limits obtained from measurement of intracellular metabolite concentrations [23, 43] were imposed on ATP and ADP, none of the 5 ATP pathways were feasible (Table 3.1(ii)). In addition, when the cofactor ratios (i.e., ATP/ADP, NADPH/NADP and NADH/NAD) were allowed to vary only within experimentally observed ranges (Table 3.1(iv)), 77% of the pathways designed were rejected including all 5 ATP pathways. Finally, when the metabolite bounds were further restricted to between 1 µM to 10 mM (Table 3.1(v)), all of the 3 to 5 ATP yielding pathways were found to be thermodynamically infeasible, whereas both canonical glycolytic pathways (i.e., EMP and ED) remained thermodynamically feasible in all the cases (Table 3.1). This implies that along with setting an optimal trade-off between ATP production and proteome cost, both EMP and ED seem to be selected for a particular concentration range of glycolytic metabolites and ATP/ADP ratio.

Table 3.1. Number of unique pathways for 0 – 5 ATP identified using OptStoic. Number of pathways that are thermodynamically feasible (MDF < 0) at physiological concentration ranges and ratio. ATP yield per glucose 1 2 3 4 5 Number of pathway designed using minFlux 5,739 3,430 1,873 659 215 Number of pathway with MDF < 0, condition (i) 4,550† 2,891‡ 1,542 466 165

Number of pathway with MDF < 0, condition (ii) 2,549† 1,099‡ 281 4 0 Number of pathway with MDF < 0, condition (iii) 2,525† 1,098‡ 281 4 0 Number of pathway with MDF < 0, condition (iv) 1,824† 778‡ 173 2 0 Number of pathway with MDF < 0, condition (v) 538† 105‡ 0 0 0 Number of pathway with MDF < 0, condition (vi) 2,558† 1,099‡ 281 4 0 Number of pathway with MDF < 0, condition (vii) 927† 304‡ 83 2 0 (i) All metabolites are allowed to vary between 1 µM and 100 mM. (ii) Same with (i), except that ATP and ADP concentrations are bounded based on Park

et al. [23] (i. e. , 1.66 mM ≤ 퐶𝐴푇푃 ≤ 11.4 mM; 0.429 mM ≤ 퐶𝐴퐷푃 ≤ 0.715 mM).

(iii) Same with (ii), except that the concentration range of CO2 was bounded based on

Park et al. [23] (i.e., 50 µM ≤ 퐶퐶푂2 ≤ 10 mM).

(iv) All metabolites are allowed to vary between 1 µM and 100 mM except CO2. The

range of CO2 was obtained from Park et al. [23] (i.e., 50 µM ≤ 퐶퐶푂2 ≤ 10 mM). The 퐶 ratio ranges for different cofactor pairs were imposed as followed: 0.2 ≤ 퐴푇푃 ≤ 20, 퐶퐴퐷푃 퐶 퐶 0.2 ≤ 푁퐴퐷푃퐻 ≤ 100, 0.0005 ≤ 푁퐴퐷퐻 ≤ 0.5 based on data collected from 퐶푁퐴퐷푃 퐶푁퐴퐷 Bionumbers [38] and literature.

(v) Same with (iv), but all metabolites other than CO2 are allowed to vary between 1 µM

and 10 mM. The range of CO2 was obtained from Park et al. [23] (i.e., 50 µM ≤

퐶퐶푂2 ≤ 10 mM). (vi) All metabolites are allowed to vary between 1 µM and 100 mM. The ratio ranges for 퐶 different cofactor pairs were imposed as followed: 1 ≤ 퐴푇푃 ≤ 10,000. 퐶퐴퐷푃 (vii) All metabolites are allowed to vary between 1 µM and 100 mM. ATP is set to 100 mM whereas ADP is set to 1 µM. †The canonical ED pathway is thermodynamically feasible under the condition specified. ‡The canonical EMP pathway is thermodynamically feasible under the condition specified.

The Pareto frontier of the tradeoff between protein cost and ATP yield

A significant fraction (from 10% to 20%) of the total proteome is allocated to glycolytic pathways (e.g., 10% - 15% in E. coli [25, 26], 14% - 20% in yeast [44, 45] and 10% in cancer cells [46]) to ensure the production of many intermediate metabolites and redox equivalents. We computationally explored whether it is possible to identify a glycolytic pathway variant with a lower cost than the canonical glycolysis (ED and EMP). The absence of a lower cost pathway could support the cost-benefit theory that natural evolution converges toward parsimonious enzyme expression [47]. The enzyme cost of each pathway should, therefore, be determined to address this question. Although the sum of absolute fluxes through a pathway is generally assumed proportional to enzyme requirement [48], the actual enzyme demand relies on the enzyme catalytic efficiency (i.e., turnover number, kcat and Michaelis constant, KM) and the metabolite concentrations that affect both the enzyme saturation level and driving force of a reaction. However, the dearth of experimentally measured kinetic parameters, intracellular metabolite concentrations, and enzyme mechanisms hampers the development of a detailed mechanistic model for each pathway across different organisms.

To this end, we used the scalable convex optimization-based enzyme cost minimization (ECM) algorithm [1, 39] as a proxy for quantifying the effect of metabolite concentrations, kinetic parameters and Gibbs free energy on the enzyme demand per unit flux of a pathway. By recasting the reversible Michaelis-Menten equation as a separable rate law and integrating the Haldane relationship [39, 49], the kinetic parameters associated with the reverse direction can be approximated by using Gibbs free energy of reaction (which can be predicted using Group Contribution method [50] or Component Contribution method [30]). We, therefore, employed this computationally tractable approach to evaluate the minimal protein cost for operating any glycolytic pathway variants in a host cell-agnostic system by assuming that all enzymes are equally efficient (see Methods) and metabolite concentrations are allowed to vary between 1 µM and 100 mM. Note that this analysis provides a lower bound estimation to the actual enzyme cost [39]. Due to a large number of glycolytic variants, an automated pipeline was developed for the generation of the

74 kinetic models of each thermodynamically feasible pathway (under the same condition) and the subsequent analysis of the minimal enzyme cost.

The minimal protein cost of glycolytic pathway variants for each ATP yield spans a wide range regardless of the redox cofactor(s) generated (i.e., NADH or NADPH) (see Figure 3.5A and Figure B.3). By plotting the ATP yield (i.e., energetic objective) versus the minimal protein cost (i.e., operation cost) of all the glycolytic pathway variants (Figure 3.5B), we observed a tradeoff between the two competing objectives which forms a Pareto frontier. The Pareto front is obtained by connecting all the pathways with the least cost (i.e., Pareto optimal points) for each ATP yield (Figure 3.5B). On the Pareto frontier, ATP yield of a glycolytic pathway can only be increased through a higher investment of protein cost to operate the pathway.

Interestingly, the canonical ED and EMP pathways lie close to the Pareto front suggesting that they are among the most protein cost-efficient pathways in their respective ATP yield category. The distance between the canonical ED pathway and the Pareto front is 0.129 mg Protein/mmol Glc/h, whereas the distance between the EMP pathway and the Pareto front is slightly larger at 0.429 mg Protein/mmol Glc/h. In other words, out of the many possible ways that nature can construct a glycolytic pathway, nature has found a way to organize proteins into two pathways with near-optimal protein resource requirement.

The minimal protein cost for a pathway is approximated under a permissive concentration bounds (similar to Table 3.1, condition (i)). In reality, the concentration bounds may be more restrictive as they are organism/compartment-dependent. Our results also hinge upon the assumption that all enzymes are equally fast with the same Michaelis constant (KM) for their respective substrates. Under this assumption, the thermodynamic driving force for the most thermodynamically restricted reaction(s) (i.e., indicated by the MDF objective) of a pathway is the major factors that affect the cost (Figure 3.4). Each reaction on a pathway could only operate feasibly within a specific range of metabolite concentrations determined ′ by its ∆푟퐺푗 ° (i.e., the equilibrium constant) and the interplay between itself and other reactions (and participating metabolites) in the pathway. Therefore, the selection of which

75 reactions to form a pathway and the resulting structure of the pathway are crucial in determining its operational cost. Regardless of the ATP yield, the minimal protein cost increases exponentially and asymptotically approaches infinity when the MDF objective of a pathway approaches zero (Figure 3.4). If the pathway has a large negative MDF objective (i.e., the thermodynamic bottleneck of the pathway is insignificant), then the minimal protein cost required for the pathway is small (Figure 3.4). Overall, pathways generating higher ATP yield have a narrower range of MDF objective values than pathways with lower ATP yield, indicating the loss of driving force as the pathway has to retain more energy from glucose to produce more ATP. The canonical EMP pathway has the second lowest MDF value among the 2 ATP-yielding pathways and hence its lower protein cost. Although the ED pathway is 5.47 kJ/mol away from the 1 ATP pathway with minimal MDF, it still has a relatively low MDF when compared to most pathways (including EMP).

Figure 3.3. The tradeoff of ATP yield and minimal protein cost of a pathway. (A) The distribution of minimal protein cost required to operate 1 to 5 ATP yielding glycolytic pathway variants. (B) The tradeoff plot between pathway ATP yield (mol ATP/mol glucose) and the minimal protein cost per unit glucose consumed. The grey line indicates the Pareto optimal of the tradeoff between ATP yields and protein cost of the glycolytic pathway. Pink and red stars indicate the ED and the EMP pathways, respectively. The number of data points (i.e., pathways) for each ATP yield category is described on the right of the plot. The lines and circles are color-coded based on the pathway ATP yield (see legend in (A)).

Figure 3.4. Identifying the key factors contributing to the protein cost of different ′ ATP yielding pathways. The minimal protein cost correlates with the ∆푟퐺푗 of the thermodynamic bottleneck (i.e., MDF) of a pathway under the assumption that all enzymes are equally fast and 1 mmol/gDW/h of glucose is converted to pyruvate. The vertical and horizontal lines show the ranges of minimal protein cost and MDF, respectively, of all pathways in each ATP yield category color-coded based on the ATP yield per unit glucose. Note that the minimal protein cost function is convex only when MDF < 0 kJ/mol (proof provided in [1]).

Pathways with a lower cost than the canonical glycolytic pathways

There are, however, alternative pathways that operate at a lower cost than the canonical pathways. The canonical ED pathway generates an ATP molecule per molecule of glucose (Figure 3.6A). It is ranked among the top ten pathways with the least protein cost. The most protein cost-efficient 1 ATP generating-pathway has a structure similar with that of the ED pathway (Figure 3.6B). It, however, requires only 1% lesser protein cost than the ED pathway. It differs from the ED pathway in three reaction steps: (i.) the glucose-6- phosphate dehydrogenase is NAD-dependent (ii.) 1,3-bisphosphoglycerate (1,3-BPG) is converted into 2-phosphoglycerate (2-PG) through 2,3-bisphosphoglycerate (2,3-BPG) instead of 3-phosphoglycerate (3-PG) using a modified form of the Rapoport-Luebering (RL) bypass [51]. Although the Rapoport-Luebering shunt proposed by Cho et al. [51] comprises the dephosphorylation of 2,3-BPG to 2-PG, the mechanism of 3-phosphate phosphatase (catalyzed by multiple inositol polyphosphate phosphatase (MIPP1)) was not shown to be associated with ATP formation. The authors, however, suggested its potential in anaerobic ATP generation [51]. Other 2-PG kinases that could potentially interconvert between 2,3-BPG and 2-PG include that of Methanothermus fervidus [52] and Deinococcus radiodurans have been demonstrated to catalyze only the phosphorylation of ′ 2-PG to form 2,3-BPG but not the other way round [53]. Nevertheless, as the ∆푟퐺 of the reaction R02664 (2-PG + ATP ↔ 2,3-BPG + ADP) varies between -79 kJ/mol and 35 kJ/mol depending on the concentrations of participating metabolites (see section 1.3.1.), we consider this reaction as reversible. According to the distribution of the protein cost (Figure 3.6E and 3.6F), the 1% difference in protein cost between this pathway and ED pathway is because phosphoglycerate mutase in the ED pathway has a backward flux due ′ to a less negative ∆푟퐺 (Figure 3.6I) and it makes the enolase slightly less saturated (by its substrate 2-PG). The top design solved this problem by using an alternative reaction ′ with a more negative ∆푟퐺 (Figure 3.6J).

The EMP glycolysis generates two ATP per glucose (Figure 3.6C). It incurs 55% more protein cost than the ED pathway under the given concentration ranges (see Methods). A 2 ATP-generating pathway with a lower protein cost than the EMP pathway demanded just

80% of protein cost required by EMP due to its ED-like pathway structure (Figure 3.6D). This is because the lower glycolysis of the EMP requires twice as much enzyme to catalyze the doubled flux (Figure 3.6G). In this particular design, ATP usage and generation occur only in the lower glycolysis, while the upper glycolysis is derived from thermoacidophilic archaebacteria (e.g., Sulfolobus solfataricus and Thermoplasma acidophilum [54]). ATP is first invested in the phosphorylation of glyceraldehyde that is generated from the cleavage of KDG by KDG aldolase. Three ATPs are then produced downstream through the conversions of 1,3-BPG to 3-PG, 2,3-BPG to 2-PG and PEP to pyruvate. This is made possible by the investment of an inorganic phosphate to phosphorylate 3-PG to form the higher energy intermediate 2,3-BPG. Despite the lower protein cost, the lack of ATP phosphorylation of glucose to minimize the escape (i.e., diffusion) of glucose from the cell in the first step likely made this pathway unfavorable for most organisms. Nevertheless, the second step converts glucono-1,5-lactone into gluconic acid, a polar compound with reduced membrane permeability [19].

Figure 3.5. Glycolytic pathways designed using the modified minFlux procedure: (A) ED pathway, (B) a lower cost 1 ATP-generating pathway, (C) EMP pathway and (D) a lower cost 2 ATP-yielding pathway. The label beside each arrow represents the reaction ID and integer flux through each reaction. The pathway diagram is generated using the technique described in Methods. ATP and ADP cofactors are highlighted in red and pink, whereas other cofactors are highlighted in grey. (E-H) The distribution of protein cost through each pathway is displayed below each metabolic map. On each bar, the contribution of flux capacity, thermodynamic (i.e., high protein cost when reaction is close to equilibrium causing backward fluxes) and enzyme saturation level (i.e., high protein cost when substrate concentration < KM) to the protein cost is represented by blue, green and yellow stacked bars, respectively. Note that the y-axis is in log10-scale as the contribution of the each term is multiplicative. For this calculation, we assumed that all enzymes are

81 equally fast (kcat) and have the same kinetic properties (KM), and the arbitrary baseline enzyme is set to 20 µg Protein / (mmol Glc /h). (I-L) The thermodynamic profile of each pathway expressed as transformed standard Gibbs free energy (blue line), transformed Gibbs free energy of reaction which accounts of effect of metabolite concentrations when thermodynamic bottleneck (i.e., MDF) is minimized (green line) and when protein cost is minimized (red line). Regions shaded in grey highlight the reaction steps involving ATP.

Pathways generating higher ATP yield than the canonical glycolytic pathways

In this section, we will discuss a few interesting pathways generating more than two ATPs per glucose. To crank out an extra ATP from a pathway resembling EMP, the carbon flux from GAP is split into two routes to generate 3-PG (Figure 3.7A). While one of the GAP molecules traverse through the typical NAD+-dependent phosphorylating GAP dehydrogenase and ATP-generating phosphoglycerate kinase, the other bypass ATP- forming phosphoglycerate kinase step by using non-phosphorylating GAP dehydrogenase (GAPN). Both NAD+ and NADP+-dependent GAPN have been previously identified in hyperthermophilic archaea T. tenax as well as in photosynthetic higher eukaryotes [55]. The resulting 3-PG further goes through the modified Rapoport-Luebering shunt (described in the previous section) to form 2-PG, thereby generating two extra ATPs. As five ATPs are produced at lower glycolysis to compensate for the two ATPs invested in the upper glycolysis, a net of three ATPs are produced.

A net of 4 ATPs can be generated from a pathway shown in Figure 3.7B. In this pathway, the upper glycolysis is similar to the EMP pathway wherein 2 ATPs are consumed and GAP is converted to 1,3-BPG. Subsequently, the lower glycolysis generates 6 ATPs through a subnetwork similar with that of 2 ATP-yielding pathway in Figure 3.7D but with double the flux.

Another interesting 4 ATP-yielding pathway design mimics the C. cellulolyticum [56] glycolysis by utilizing a PPi-dependent PFK thereby bypassing an ATP investment upstream (Figure 3.7C). It then generates 5 ATPs at the lower glycolysis resulting in a net of 4 ATPs. C. cellulolyticum, an obligate anaerobe, could generate up to 5 NTP molecules (i.e., ATP and GTP) per glucose molecule through its glycolytic pathway. This is possible as:

(i) it uses a reversible PPi-dependent PFK; (ii) it generates 2 GTPs each from the reactions catalyzed by phosphoglycerate

kinase (1,3-BPG + GDP → 3-PG + GTP) and a GTP-dependent PEP

carboxykinase (EC 4.1.1.32: PEP + GDP + CO2 → oxaloacetate + ATP); and

(iii) it generates 2 ATPs from an ATP-dependent pyruvate carboxylase

(oxaloacetate + ADP → pyruvate + ATP + CO2).

The PPi consumed by PFK is proposed to be regenerated through the conversion of sedoheptulose-1,7-bisphosphate (SBP) to sedoheptulose 7-phosphate (S7P), which also require additional carbon input. If we account for the PPi usage, then overall this C. cellulolyticum pathway generates an equivalent of 4 ATPs. In the pathway designed using minFlux (Figure 3.7C), the PPi is recouped at the end of the pathway through the conversion of PEP to oxaloacetate. Additionally, the Clostridia pathway relies on GTP, which differs from this design. However, this pathway requires almost 10-fold more protein than the EMP pathway to operate at the same glucose conversion flux due to the large backward flux of the reaction involving phosphorylation of glycerate by organic phosphate to form 2-phosphoglycerate. Similarly, 13C-MFA study on C. cellulolyticum showed that significant backward flux is observed in the upper glycolysis due to its utilization of PPi-dependent PFK instead of the irreversible ATP-dependent PFK, which possibly leads to lower net forward flux through the pathway [56].

Finally, a 5 ATP pathway bypass an ATP investment in the upper glycolysis by combining transaldolase from the non-oxidative pentose phosphate pathway with the reverse of the Calvin cycle. It first converts fructose-6-phosphate and erythrose-4-phosphate (E4P) into GAP and S7P. The latter is then phosphorylated by the reversible sedoheptulose bisphosphatase [57] to form SBP. SBP is subsequently converted into dihydroxyacetone phosphate (DHAP) and E4P, which is catalyzed by fructose-bisphosphate (FBP) aldolase. The latter returns to the previous steps, while DHAP is channeled into the lower glycolysis. The lower glycolysis is similar to the 4 ATP pathway in Figure 3.7B, which generates 6 ATPs. Hence, a net of 5 ATPs are produced after subtracting the ATP invested for fructose phosphorylation. The sedoheptulose bisphosphatase operates with a significant backward flux, which leads to the significantly high non-saturation of the FBP aldolase. In this way, the pathway is able to retain more free energy for ATP production.

Comparison of the protein cost distribution between the high ATP (Figure 3.7E-H) and low ATP-yielding pathways (Figure 3.6E-H) revealed that pathways with high ATP yield are generally comprised of reactions with higher backward flux as well as low saturation level (i.e., substrate << KM). A reaction close to equilibrium is often followed by a reaction ′ that is substrate sub-saturated. Although the overall pathway ∆푟퐺 ° is closer to equilibrium ′ for higher ATP-yielding pathways (Figure 3.6I-L and Figure 3.7I-L), the overall ∆푟퐺 ′ became more negative upon optimizing the metabolite concentrations to minimize ∆푟퐺 of the thermodynamic bottleneck of a pathway or minimize the protein cost. Nevertheless, at higher ATP yield, the energy content in glucose is generally split over more reaction steps (in particular ATP generating steps), thereby causing more reactions to undergo a smaller ′ drop in ∆푟퐺 (Figure 3.7I-L). Our protein cost approximation is currently based on constant glucose to pyruvate conversion. However, under typical cellular condition, flux could vary depending on the thermodynamic driving force of a reaction. In this case, the ′ lower drop in ∆푟퐺 implies a slower flux through the pathway (i.e., low ATP production flux), which could be detrimental to the growth rate of a cell if glycolysis is the major ATP generating mechanism. Furthermore, we found that higher ATP pathways became thermodynamically infeasible when the bounds on metabolite concentrations are more restrictive (Table 3.1). This implies that a higher ATP yielding pathway has a smaller room (i.e., degree of freedom) for improvement in thermodynamic profile by modulation of the participating metabolite concentrations. Therefore, even if a high ATP yielding pathway may be feasible in a slow growing organism C. cellulolyticum, moving it to another different organism (implicating a different intracellular metabolite pool) may render the pathway infeasible. Overall, a combination of higher protein cost, lower pathway driving force and sensitivity to metabolite pools probably prevent these pathways from becoming prevalent in the nature despite their high energetic yield.

Figure 3.6. Pathways generating 3 to 5 ATP designed using the modified minFlux procedure: (A) 3ATP, (B) 4 ATP pathway A, (C) 4 ATP pathway B and (D) 5 ATP. The label beside each arrow represents the reaction ID and integer flux through each reaction. The pathway diagram is generated using the technique described in Methods. ATP and ADP cofactors are highlighted in red and pink, whereas other cofactors are highlighted in grey. (E-H) The distribution of protein cost through each pathway is displayed below each metabolic map. On each bar, the contribution of flux capacity, thermodynamic (i.e., high protein cost when reaction is close to equilibrium causing backward fluxes) and enzyme saturation level (i.e., high protein cost when substrate concentration < KM) to the protein cost is represented by blue, green and yellow stacked bars, respectively. Note that the y- axis is in log10-scale as the contribution of the each term is multiplicative. For this calculation, we assumed that all enzymes are equally fast (kcat) and have the same kinetic properties (KM), and the arbitrary baseline enzyme is set to 20 µg Protein / (mmol Glc /h).

(I-L) The thermodynamic profile of each pathway expressed as transformed standard Gibbs free energy (blue line), transformed Gibbs free energy of reaction which accounts of effect of metabolite concentrations when thermodynamic bottleneck (i.e., MDF) is minimized (green line) and when protein cost is minimized (red line). Regions shaded in grey highlight the reaction steps involving ATP.

The canonical glycolytic pathways are robust to changes in ATP/ADP concentration

We explored the feasible region of ATP/ADP concentration domain (푐𝐴푇푃, 푐𝐴퐷푃 ∈ ℝ) during MDF and ECM analysis. To this end, we first uniformly sampled 400 pairs of values from the logarithm concentration ranges of ATP and ADP. For each pair of ATP/ADP concentrations, we assessed the pathway thermodynamic feasibility as well as its minimal protein cost if it is thermodynamically feasible. Figure 3.7A and Figure 3.7C show that the canonical glycolytic pathways ED and EMP are thermodynamically feasible across all ATP/ADP concentrations when compared to the most cost-efficient pathways (Figure 3.7B and Figure 3.7D). In particular, the ED pathway maintains a low protein cost across the entire range of ATP/ADP concentrations. This is surprising as we expected that most glycolytic pathways would not be feasible when the ATP level is much higher than the ADP level (upper left edge of all plots in Figure 3.7), which lowers the driving force of the pathway to synthesize ATP. Based on Fig. 7, higher ATP yielding pathways seem to operate only when ATP/ADP ratio is below 1, however, the ATP/ADP ratios measured in mammalian cell, yeast and E. coli are all above unity [23, 43]. To reflect the conditions in these organisms, we repeated the MDF test when ATP/ADP ratio is set to above unity (Table 1, condition (vi)), we found that all 5 ATP pathways are rejected and only four 4 ATP pathways and 15% of 3 ATP pathways remained feasible. To test a more stringent condition, ATP and ADP levels are set to the maximum (100 mM) and minimum (1 µM) of the concentration bounds, respectively, while other metabolites are allow to vary between them (Table 1, condition (vii)). Only 11% of all the pathways were feasible. The reactions that constitute a pathway especially those involving ATP/ADP and the placement of the ATP/ADP involving steps play the key role in determining whether a pathway is robust to ATP/ADP concentrations. For example, the 2,3-BPG to 2-PG conversion by 2- PG kinase can operate in the direction of 2-PG generation only at high level of ADP and low level of ATP, which restrict the pathways comprising the reaction to operate feasibly only under a specific conditions. The robustness of ED and EMP pathways within such a wide range of ATP/ADP concentrations suggest that they could operate in the direction of glycolysis even with significant perturbation of the ATP/ADP pool (e.g, due to stress or

88 across different organisms), which could explain why the pathways survive natural selection.

Figure 3.7. The effect of ATP and ADP concentrations on the pathway thermodynamic feasibility and minimal enzyme cost of glycolytic pathway variants.

400 pairs of ATP and ADP concentrations were sampled uniformly from the log concentration ranges. The MDF analysis and ECM analysis were performed on each pathway when ATP and ADP concentrations were constrained to the sampled values. Grey color regions indicate that the pathway is thermodynamically infeasible. The color scales according to the minimal protein cost. Values above 32 mg protein/mmol glucose/h is set to dark red color.

3.4. Discussion

Flamholz et al. previously proposed that there exist a tradeoff between ATP yield and protein cost by showing that the ED pathway requires a lower enzyme cost than the EMP pathway [1]. However, it is not known that where do the two pathways stand in terms of their cost efficiency when compared to glycolytic alternatives that could yield the same number of ATP. This could answer why a particular set of reactions are selected out of many possibilities that nature could assemble a pathway. We, therefore, extended their approach by analyzing a large number of synthetic glycolytic variants generated computationally using the modified optStoic procedure. Our simulation results suggest that the canonical ED and EMP pathways have possibly evolved to minimize the overall protein cost required to operate them. Since a significant portion of the proteome is commonly invested on glycolysis, reducing the operational cost would allow the cell more flexibility to allocate its resources to other cellular mechanisms.

In addition, we also found that the canonical pathways are able to operate feasibly in the glycolytic direction despite significant perturbations of the metabolite pools as well as ATP/ADP ratio. Through our simulation, we showed that not every glycolytic pathway can achieve such robustness even if they have a low protein cost. This further justifies that selection of the reactions are important to confer them the multiple favorable traits. Metabolite concentrations could vary across a wide range not only in different species but also in the same cell growing under different conditions. Therefore, a pathway that appears universally across species must be able to retain its functionality in multiple conditions. Furthermore, ATP/ADP ratio is commonly maintained above unity in different organisms, therefore the ability of an ATP-generating pathway to operate with sufficiently high driving force even at high ATP-to-ADP ratio is crucial for a continuous supply of energy to the cell. With this criterion imposed, we found that 66.9% of the glycolytic alternatives were rejected.

Although there is insufficient evidence as to which glycolytic pathways predate another, if most species obtained energy through alternate means in the prehistoric era [7], it is likely

91 that an earlier version of glycolysis begins with a low ATP yield (i.e., 1 ATP) as it is just a pathway for carbohydrate metabolism. With the transition of the habitat of certain species, which force them to rely on sugar (e.g., glucose) as an energy source, glycolysis then assumes the main role of energy generation. It is, therefore, possible that these cells could gradually afford to overcome the protein cost barrier by evolving towards using pathways with a higher energy yield. Even though we have found that it is possible to construct pathway with even better energy yield (> 2ATP), what then prevent these pathways from becoming ubiquitous? Through our analysis, we found that glycolytic pathways with higher ATP yields require higher protein cost due to the presence of a significant number of steps with backward flux/non-saturation kinetics. In addition, the higher ATP yielding pathways are not as robust to accommodate increasingly restrictive concentration ranges of metabolite (Table 3.1), which could prevent them from spreading across different species with different metabolites composition. Our result agrees with previous study by Waddell et al. who have proposed that the optimal glycolytic pathway yield is 2 ATP/glucose as increasing beyond that reduces both the carbon flux towards lactate and the ATP production rate [58]. They, however, made their estimation based on only the standard Gibbs free energy of reaction [58].

Nevertheless, a high ATP yield glycolysis with a higher protein cost, lower flux efficiency and more sensitive to metabolite concentrations does not necessarily deter its appearance in nature. Even though most facultative anaerobes can generate ATP through electron transport chain as well as substrate-level phosphorylation through various kinases (e.g., acetate kinase), glycolysis is the major source of ATP during the production of biochemicals under anaerobic condition [59]. A high ATP yielding glycolysis is necessary to sustain both biomass production, cell maintenance, product biosynthesis and also product secretion [59]. This could justify why obligate anaerobes such as C. cellulolyticum and C. thermocellum can operate a higher energy-yielding glycolytic pathway [13, 60]. By trading higher ATP yield for slower flux, a suboptimal pathway became indispensable for their survival. Recent efforts have tried to leverage these organisms for biosynthesis of various products [59, 61].

In this study, the optStoic procedure is updated and leveraged to exhaustively trace pathways from glucose to pyruvate with varying energy yield. We have also employed two highly scalable approaches (i.e., MDF analysis [34] and ECM analysis [1, 39]) to exhaustively evaluate a large number of pathway designs. This general procedure can be extended to study other bioconversion pathways.

3.5. References

1. Flamholz A, Noor E, Bar-Even A, Liebermeister W, Milo R: Glycolytic strategy as a tradeoff between energy yield and protein cost. Proceedings of the National Academy of Sciences of the United States of America 2013, 110(24):10039-10044. 2. Chen X, Schreiber K, Appel J, Makowka A, Fähnrich B, Roettger M, Hajirezaei MR, Sönnichsen FD, Schönheit P, Martin WF et al: The Entner–Doudoroff pathway is an overlooked glycolytic route in cyanobacteria and plants. Proceedings of the National Academy of Sciences of the United States of America 2016, 113(19):5441-5446. 3. Chavarria M, Nikel PI, Perez-Pantoja D, de Lorenzo V: The Entner-Doudoroff pathway empowers Pseudomonas putida KT2440 with a high tolerance to oxidative stress. Environmental microbiology 2013, 15(6):1772-1785. 4. Klingner A, Bartsch A, Dogs M, Wagner-Dobler I, Jahn D, Simon M, Brinkhoff T, Becker J, Wittmann C: Large-Scale 13C flux profiling reveals conservation of the Entner-Doudoroff pathway as a glycolytic strategy among marine bacteria that use glucose. Applied and environmental microbiology 2015, 81(7):2408-2422. 5. Conway T: The Entner-Doudoroff pathway: history, physiology and molecular biology. FEMS microbiology reviews 1992, 9:1-27. 6. Labhsetwar P, Cole JA, Roberts E, Price ND, Luthey-Schulten ZA: Heterogeneity in protein expression induces metabolic variability in a modeled Escherichia coli population. Proceedings of the National Academy of Sciences of the United States of America 2013, 110(34):14006-14011. 7. Romano AH, Conway T: Evolution of carbohydrate metabolic pathways. Research in Microbiology 1996, 147(6):448-455. 8. Siebers B, Schönheit P: Unusual pathways and enzymes of central carbohydrate metabolism in Archaea. Current Opinion in Microbiology 2005, 8(6):695-705. 9. Taillefer M, Sparling R: Glycolysis as the Central Core of Fermentation. In: Anaerobes in Biotechnology. Edited by Hatti-Kaul R, Mamo G, Mattiasson B. Cham: Springer International Publishing; 2016: 55-77.

10. Siebers B, Hensel R: Glucose catabolism of the hyperthermophilic archaeum Thermoproteus tenax. FEMS microbiology letters 1993, 111(1):1-7. 11. Reher M, Gebhard S, Schonheit P: Glyceraldehyde-3-phosphate ferredoxin oxidoreductase (GAPOR) and nonphosphorylating glyceraldehyde-3-phosphate dehydrogenase (GAPN), key enzymes of the respective modified Embden-Meyerhof pathways in the hyperthermophilic crenarchaeota Pyrobaculum aerophilum and Aeropyrum pernix. FEMS microbiology letters 2007, 273(2):196-205. 12. Verhees CH, Kengen SWM, Tuininga JE, Schut GJ, Adams MWW, De Vos WM, Van Der Oost J: The unique features of glycolytic pathways in Archaea. Biochemical Journal 2003, 375(Pt 2):231- 246. 13. Zhou J, Olson DG, Argyros DA, Deng Y, van Gulik WM, van Dijken JP, Lynd LR: Atypical glycolysis in Clostridium thermocellum. Applied and environmental microbiology 2013, 79(9):3000-3008. 14. Verho R, Richard P, Jonson PH, Sundqvist L, Londesborough J, Penttila M: Identification of the first fungal NADP-GAPDH from Kluyveromyces lactis. Biochemistry 2002, 41(46):13833-13838. 15. Martínez I, Zhu J, Lin H, Bennett GN, San K-Y: Replacing Escherichia coli NAD-dependent glyceraldehyde 3-phosphate dehydrogenase (GAPDH) with a NADP-dependent enzyme from Clostridium acetobutylicum facilitates NADPH dependent pathways. Metabolic engineering 2008, 10:352-359. 16. Arskold E, Lohmeier-Vogel E, Cao R, Roos S, Radstrom P, van Niel EW: Phosphoketolase pathway dominates in Lactobacillus reuteri ATCC 55730 containing dual pathways for glycolysis. Journal of bacteriology 2008, 190(1):206-212. 17. Fushinobu S: Unique Sugar Metabolic Pathways of Bifidobacteria. Bioscience, Biotechnology, and Biochemistry 2010, 74(12):2374-2384. 18. Bogorad IW, Lin TS, Liao JC: Synthetic non-oxidative glycolysis enables complete carbon conservation. Nature 2013, 502(7473):693-697. 19. Melendez-Hevia E, Waddell TG, Heinrich R, Montero F: Theoretical approaches to the evolutionary optimization of glycolysis--chemical analysis. European journal of biochemistry 1997, 244(2):527- 543. 20. Noor E, Eden E, Milo R, Alon U: Central carbon metabolism as a minimal biochemical walk between precursors for biomass and energy. Molecular cell 2010, 39(5):809-820. 21. Bar-Even A, Flamholz A, Noor E, Milo R: Rethinking glycolysis: on the biochemical logic of metabolic pathways. Nature chemical biology 2012, 8(6):509-517. 22. Court SJ, Waclaw B, Allen RJ: Lower glycolysis carries a higher flux than any biochemically possible alternative. Nature communications 2015, 6:8427.

23. Park JO, Rubin SA, Xu YF, Amador-Noguez D, Fan J, Shlomi T, Rabinowitz JD: Metabolite concentrations, fluxes and free energies imply efficient enzyme usage. Nature chemical biology 2016. 24. O'Brien EJ, Lerman JA, Chang RL, Hyduke DR, Palsson BO: Genome-scale models of metabolism and gene expression extend and refine growth phenotype prediction. Molecular systems biology 2013, 9:693. 25. Peebo K, Valgepea K, Maser A, Nahku R, Adamberg K, Vilu R: Proteome reallocation in Escherichia coli with increasing specific growth rate. Molecular bioSystems 2015, 11(4):1184-1193. 26. Schmidt A, Kochanowski K, Vedelaar S, Ahrne E, Volkmer B, Callipo L, Knoops K, Bauer M, Aebersold R, Heinemann M: The quantitative and condition-dependent Escherichia coli proteome. Nature biotechnology 2016, 34(1):104-110. 27. Basan M, Hui S, Okano H, Zhang Z, Shen Y, Williamson JR, Hwa T: Overflow metabolism in Escherichia coli results from efficient proteome allocation. Nature 2015, 528(7580):99-104. 28. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 2000, 28(1):27-30. 29. Chowdhury A, Maranas CD: Designing overall stoichiometric conversions and intervening metabolic reactions. Scientific reports 2015, 5:16009. 30. Noor E, Haraldsdottir HS, Milo R, Fleming RM: Consistent estimation of Gibbs energy using component contributions. PLoS computational biology 2013, 9(7):e1003098. 31. O'Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR: Open Babel: An open chemical toolbox. Journal of cheminformatics 2011, 3:33. 32. Fleming RM, Thiele I, Nasheuer HP: Quantitative assignment of reaction directionality in constraint-based models of metabolism: application to Escherichia coli. Biophysical chemistry 2009, 145(2-3):47-56. 33. Schellenberger J, Lewis NE, Palsson BO: Elimination of thermodynamically infeasible loops in steady-state metabolic models. Biophysical journal 2011, 100(3):544-553. 34. Noor E, Bar-Even A, Flamholz A, Reznik E, Liebermeister W, Milo R: Pathway thermodynamics highlights kinetic obstacles in central metabolism. PLoS computational biology 2014, 10(2):e1003483. 35. Biemans-Oldehinkel E, Mahmood NA, Poolman B: A sensor for intracellular ionic strength. Proceedings of the National Academy of Sciences of the United States of America 2006, 103(28):10624-10629. 36. Storey KB: Functional Metabolism: Regulation and Adaptation. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2004. 37. Zhang J, Pierick At, van Rossum HM, Maleki Seifar R, Ras C, Daran J-M, Heijnen JJ, Aljoscha Wahl S: Determination of the Cytosolic NADPH/NADP Ratio in Saccharomyces cerevisiae using Shikimate Dehydrogenase as Sensor Reaction. Scientific reports 2015, 5:12846.

38. Milo R, Jorgensen P, Moran U, Weber G, Springer M: BioNumbers--the database of key numbers in molecular and cell biology. Nucleic acids research 2010, 38(Database issue):D750-753. 39. Noor E, Flamholz A, Bar-Even A, Davidi D, Milo R, Liebermeister W: The Protein Cost of Metabolic Fluxes: Prediction from Enzymatic Rate Laws and Cost Minimization. PLoS computational biology 2016, 12(11):e1005167. 40. Bar-Even A, Noor E, Savir Y, Liebermeister W, Davidi D, Tawfik DS, Milo R: The moderately efficient enzyme: evolutionary and physicochemical trends shaping enzyme parameters. Biochemistry 2011, 50(21):4402-4410. 41. Lehninger AL, Nelson DL, Cox MM: Lehninger principles of biochemistry, 6th edn. New York: W.H. Freeman; 2013. 42. Price ND, Famili I, Beard DA, Palsson BØ: Extreme Pathways and Kirchhoff's Second Law. Biophysical journal 2002, 83(5):2879-2882. 43. Bennett BD, Kimball EH, Gao M, Osterhout R, Van Dien SJ, Rabinowitz JD: Absolute metabolite concentrations and implied enzyme active site occupancy in Escherichia coli. Nature chemical biology 2009, 5(8):593-599. 44. de Godoy LMF, Olsen JV, Cox J, Nielsen ML, Hubner NC, Frohlich F, Walther TC, Mann M: Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast. Nature 2008, 455(7217):1251-1254. 45. Liebermeister W, Noor E, Flamholz A, Davidi D, Bernhardt J, Milo R: Visual account of protein investment in cellular functions. Proceedings of the National Academy of Sciences 2014, 111(23):8488-8493. 46. Madhukar NS, Warmoes MO, Locasale JW: Organization of enzyme concentration across the metabolic network in cancer cells. Plos One 2015, 10(1):e0117131. 47. Dekel E, Alon U: Optimality and evolutionary tuning of the expression level of a protein. Nature 2005, 436(7050):588-592. 48. Lewis NE, Hixson KK, Conrad TM, Lerman JA, Charusanti P, Polpitiya AD, Adkins JN, Schramm G, Purvine SO, Lopez-Ferrer D et al: Omic data from evolved E. coli are consistent with computed optimal growth from genome-scale models. Molecular systems biology 2010, 6:390-390. 49. Noor E, Flamholz A, Liebermeister W, Bar-Even A, Milo R: A note on the kinetics of enzyme action: a decomposition that highlights thermodynamic effects. FEBS Lett 2013, 587(17):2772-2777. 50. Jankowski MD, Henry CS, Broadbelt LJ, Hatzimanikatis V: Group contribution method for thermodynamic analysis of complex metabolic networks. Biophysical journal 2008, 95(3):1487- 1499. 51. Cho J, King JS, Qian X, Harwood AJ, Shears SB: Dephosphorylation of 2,3-bisphosphoglycerate by MIPP expands the regulatory capacity of the Rapoport–Luebering glycolytic shunt. Proceedings of the National Academy of Sciences of the United States of America 2008, 105(16):5998-6003.

52. Lehmacher A, Hensel R: Cloning, sequencing and expression of the gene encoding 2- phosphoglycerate kinase from Methanothermus fervidus. Molecular & general genetics : MGG 1994, 242(2):163-168. 53. Aravind L, Wolf YI, Koonin EV: The ATP-cone: an evolutionarily mobile, ATP-binding regulatory domain. Journal of molecular microbiology and biotechnology 2000, 2(2):191-194. 54. Budgen N, Danson MJ: Metabolism of glucose via a modified Entner-Doudoroff pathway in the thermoacidophilic archaebacterium Thermoplasma acidophilum. FEBS Letters 1986, 196(2):207- 210. 55. Brunner NA, Brinkmann H, Siebers B, Hensel R: NAD+-dependent Glyceraldehyde-3-phosphate Dehydrogenase from Thermoproteus tenax : THE FIRST IDENTIFIED ARCHAEAL MEMBER OF THE ALDEHYDE DEHYDROGENASE SUPERFAMILY IS A GLYCOLYTIC ENZYME WITH UNUSUAL REGULATORY PROPERTIES. Journal of Biological Chemistry 1998, 273(11):6149-6156. 56. Rabinowitz JD, Aristilde L, Amador-Noguez D: Metabolomics of Clostridial Biofuel Production. In. Washington, D.C. :: United States. Dept. of Energy. Office of Science ;; 2015. 57. Stolzenberger J, Lindner SN, Persicke M, Brautaset T, Wendisch VF: Characterization of Fructose 1,6-Bisphosphatase and Sedoheptulose 1,7-Bisphosphatase from the Facultative Ribulose Monophosphate Cycle Methylotroph Bacillus methanolicus. Journal of bacteriology 2013, 195(22):5112-5122. 58. Waddell TG, Repovic P, Meléndez-Hevia E, Heinrich R, Montero F: Optimization of glycolysis: new discussions. Biochemical Education 1999, 27(1):12-13. 59. Cueto-Rojas HF, van Maris AJA, Wahl SA, Heijnen JJ: Thermodynamics-based design of microbial cell factories for anaerobic product formation. Trends in Biotechnology 2015, 33(9):534-546. 60. Dash S, Khodayari A, Zhou J, Holwerda EK, Olson DG, Lynd LR, Maranas CD: Development of a core Clostridium thermocellum kinetic metabolic model consistent with multiple genetic perturbations. Biotechnology for biofuels 2017, 10(1):108. 61. Tracy BP, Jones SW, Fast AG, Indurthi DC, Papoutsakis ET: Clostridia: the importance of their exceptional substrate and metabolite diversity for biofuel and biorefinery applications. Current opinion in biotechnology 2012, 23(3):364-381.

4. Chapter 4 COMPUTATIONAL STRAIN DESIGN FOR THE OVERPRODUCTION OF BIOCHEMICALS

Portions of this chapter has been previously published in modified form in Metabolic Engineering (Suástegui M., Ng C.Y., Chowdhury A., Sun W., Cao M., House E., Maranas C.D. and Shao Z. (2017), “Multilevel engineering of the upstream aromatic module in Saccharomyces cerevisiae for high production of polymer and drug precursors”, Metabolic Engineering, 42, 134-144).

4.1. Introduction

Advances in genome sequencing technology have resulted in an explosive growth of genome sequence data. Bioinformatics techniques such as gene and protein BLAST [1] enable quick annotation of genome sequences. The availability of annotated sequences leads to the development of genome-scale metabolic models [2], which represent all biochemical reactions encoded by (metabolic) genes of an organism in a mathematical format (i.e., stoichiometric matrix and Boolean format for gene-protein-reaction association). Genome-scale metabolic model can be used to simulate the flux distribution (phenotype) upon genetic (genotype) or environmental perturbations; hence it is a powerful tool for diverse applications [3, 4] especially in metabolic engineering. By exhaustively exploring the yield of single, double and triple knockouts of E. coli using flux balance analysis (FBA), Park et al. sequentially identified knockout strategies to improve L-valine production [5]. In contrast to such an exhaustive search approach, various optimization- based computational strain design algorithms have been developed [6] (see Chapter 1) which can pinpoint top-ranking reaction and genetic interventions required to increase the production of a target metabolite. These computational-based strain design strategies have been successfully implemented in experimental studies for the production of various biochemicals including lactate, 1,4-butanediol, 2,3-butanediol, lycopene, tyrosine, fatty acids and malonyl-CoA [7-13].

In particular, the OptForce procedure [14] is a computationally tractable and deterministic algorithm that can identify reaction-level knockout, overexpression or down-regulation strategies. The procedure incorporates experimental flux measurement in its prediction to better characterize the reference strain and has been successfully applied in several case studies [9, 11]. The OptForce procedure has the following advantages when compared to other algorithms: (i) It can identify not only knockout strategies but also overexpression and down- regulation strategies, whereas algorithms such as OptKnock [15], RobustKnock [16], OptGene [17] and Genetic Design by Local Search (GDLS) [18] can only identify knockout strategies. (ii) It incorporates experimental flux data from 13C-metabolic flux analysis to first constrain the reference strain (e.g., wild-type) flux distribution leading to a more accurate representation of the reference phenotype. In this way, the procedure can identify flux changes that must be implemented on the reference strain to generate the product. (iii) When compared to OptReg [19], it has a lower computational time as it employed a flux variability analysis (FVA) step to limit the solution space when solving the MILP problem. (iv) As a deterministic algorithm, it prioritizes the most important set of manipulations needed to achieve the overproduction when searching for the solution. An evolutionary algorithm-based approach such as OptGene/GDLS generally identifies solutions that are locally optimal or near optimal (i.e., may not the best set of interventions).

In this chapter, the OptForce strain design procedure is used to identify reaction level interventions for the overproduction of (i) shikimic acid and muconic acid, (ii) malonyl- CoA in Saccharomyces cerevisiae and (iii) isoprene in Synechocystis sp. PCC 6803.

4.2. Case study I: Overproduction of shikimic acid and muconic acid in S. cerevisiae

Objectives

The optimization-based OptForce algorithm [14] was implemented to identify target reaction-level interventions leading to the overproduction of shikimic acid (SA) and muconic acid (MA) individually. In particular, the objective of the computational study was to identify key motifs in the intervention strategies that were conserved, and those that differed due to cofactor or energy equivalent requirements, when attempting to overproduce two metabolites in the same branch of metabolism (i.e. aromatic amino acid pathway). The genome-scale metabolic model iAZ900 [20] was used to simulate the metabolic flux profiles in S. cerevisiae for both target molecules. For the case of SA production, the base model was modified by adding the SA exchange and transport reactions. Likewise, for simulation of MA production, the MA production pathway was included in the model by adding 3-dehydroshikimate dehydratase (SKHL), protocatechuate decarboxylase (PCCL) and catechol 1,2-dioxygenase (CATO) reactions, along with the corresponding exchange and transport reactions. All simulations were performed in the aerobic minimal media with glucose as the sole carbon substrate mimicking the experimental fermentation conditions. Details of simulation conditions and regulation are described in section 4.2.2. The CPLEX optimization software was used to solve the mixed integer optimization programming problems to optimality, and was accessed through the General Algebraic Modeling System optimization package. Note that the reaction-level intervention strategy suggested by OptForce was converted to gene-level suggestion using the gene-protein-reaction relationship information mined from iAZ900 model, as well as from the most recent curation of S. cerevisiae metabolism reported in literature [21].

Methods

In all simulations, the maximum glucose and oxygen uptake rates were set to 100 mmol g DW-1 h-1 and 200 mmol g DW-1 h-1, respectively. The regulation on the tricarboxylic acid cycle (TCA) cycle activity under aerobic glucose conditions (i.e. the Crabtree effect) was

100 originally imposed in the iAZ900 model by limiting the oxygen uptake rate [20]. This is replaced in this simulation by directly imposing an upper bound on the mitochondrial cytochrome c oxidase (CYOOm) reaction flux to 20 mmol g DW-1 h-1. This modification, while simultaneously preserving the phenotypes observed due to Crabtree effect in the model, did not restrict additional oxygen to be consumed in the MA pathway. Under minimal media condition, the maximum yields were 2.89 h-1 for biomass, 64 mmol g DW- 1 h-1 for SA, and 62.13 mmol g DW-1 h-1 for MA.

In brief, the procedure for identifying the reaction-level interventions is described as follows.

Step 1: 13C metabolic flux analysis (MFA) data for wild-type S. cerevisiae (Jacqueline Shanks, Iowa State University, personal communication) was used to constrain the flux ranges for reactions in the central metabolism. Flux variability analysis (FVA) is performed 푟푒푓,퐿 푟푒푓,푈 to identify the wild-type strain flux ranges ([푣푗 , 푣푗 ]).

푂푆,퐿 푂푆,푈 Step 2: The flux ranges for overproducing strain ( [푣푗 , 푣푗 ] ) are identified by performing FVA on the metabolic network that is subjected to the desired overproduction 푚푎푥 푚푎푥 푚푎푥 target (90% of 푣푀𝐴 or 푣푆𝐴 ) and at least 10% of theoretical biomass yield (푣푏푖표푚푎푠푠).

Step 3: By superimposing the flux ranges for the wild-type and the overproducing strain, the set of reactions that must be up-regulated (MUSTU), down-regulated (MUSTL) and knockout (MUSTX) are then determined. Second order MUST sets (i.e. MUSTUU, MUSTUL and MUSTLL) were also identified according to previous work [14].

Step 4: The minimal set of interventions (FORCE sets) that guarantees the yield of the target product under worst-case scenario was then selected from the MUST sets. To this end, a bi-level mixed-integer optimization problem was formulated such that the objectives of the outer and the inner problem are the maximization and the minimization of the target product yield, respectively [22]. The lower bound of the biomass formation is set to 10% 푚푎푥 of 푣푏푖표푚푎푠푠 to ensure viability. The flux range for each reaction excluding biomass

101

푟푒푓,퐿 푂푆,퐿 푟푒푓,푈 푂푆,푈 formation and nutrients uptake is restricted to [min{푣푗 , 푣푗 } , max {푣푗 ,푣푗 }]. Note that the minimum guaranteed production flux and yield for each of the mutants were calculated under the same condition as Step 4. All optimization problems were solved in GAMS 24.4.1 with CPLEX Solver v.12.6.1.

Strategies for shikimic acid overproduction

Figure 4.1 shows the intervention strategies predicted by OptForce for improved SA overproduction. OptForce first suggested two interventions consistent with strategies that have been successfully implemented for producing aromatic amino acid pathway derivatives in previous studies [5, 6]. These include up-regulation of transketolase (TKL1) to rewire the pentose phosphate pathway (PPP) and 3-dehydroquinate synthase (DHQS, by replacing the native penta-functional gene ARO1 with a mutant ARO1D920A [23]) to channel the carbon flux towards SA. In addition, pyruvate kinase (PYK) down-regulation by 2.9- fold was identified as one of the single interventions that could improve SA production (Table 4.1). Down-regulation of PYK allows for the accumulation of precursor PEP by slowing down its conversion to pyruvate. Note that while this intervention appeared to be detrimental towards up-regulating the aromatic amino acid pathway as observed in Gold et al. [24], we believe that it was because this intervention was combined with the deletion of ZWF1, causing an NADPH deficiency in the cell. In contrary to the prediction of GDLS [24], OptForce did not suggest the knockout of ZWF1 because SA production requires the cofactor NADPH. Knocking out ZWF1 in silico reduced the theoretical maximum yield of SA by 4.8% (from 64.0 mmol g DW-1 h-1 to 60.9 mmol g DW-1 h-1) because metabolic flux has to be drained towards competitive metabolic pathways such acetaldehyde dehydrogenase to supply the required NADPH.

OptForce has also identified four single-reaction interventions that were not reported previously for shikimate production. They include downregulation of the reactions 3- phophoglycerate kinase (PGK1) and glyceraldehyde-3-phosphate dehydrogenase (TDH1), knockout of phosphofructokinase 1 (PFK1), and overexpression of ribose-5-phosphate

102 ketol-isomerase (RKI1) (Figure 4.1B, Table 4.1). Overall, the intention of all these interventions was to divert carbon flux from glycolysis towards the biosynthesis of the precursor erythrose-4-phosphate (E4P). The in silico overexpression of RKI1 resulted in the highest increase in SA among the four targets, achieving 83.45% of the theoretical yield. RKI1p catalyzes the interconversion of ribose-5-phosphate and ribulose-5-phosphate in the PPP (Figure 4.1A). According to OptForce, its overexpression could generate a higher flux into the aromatic amino acid pathway by maintaining a higher carbon pull into the non-oxidative PPP, i.e., by directing the flux towards the formation of E4P and preventing it from recirculating back into glycolysis (Figure 4.1A).

While RKI1 overexpression along with TKL1 overexpression and DHQS upregulation drives higher carbon flux towards shikimate biosynthesis (Figure 4.1C), it is important to prevent the depletion of shikimate into the downstream chorismate biosynthesis pathway. To this end, OptForce suggested the deletion of shikimate kinase (SHKK) which phosphorylates shikimate to form shikimate-3-phosphate. The combination of these four interventions improved shikimate yield up to 0.55 g SA/g glucose, representing 89.41% of the maximum theoretical yield (Figure 4.1C). These interventions were implemented experimentally by our collaborators [25] resulting in the highest titer of shikimate to date.

103

Figure 4.1. Metabolic interventions identified with OptForce for production of shikimic acid (SA). (A) Simplified map of central carbon metabolism depicting the upstream pathway (glycolysis and PPP) leading towards the aromatic amino acid pathway. The flux ranges (in mmol gDW-1 h-1) obtained through flux variability analysis are shown for the wild-type (top, purple) and the overproducer (bottom, blue) when glucose uptake is 100 mmol gDW-1 h-1. The sign of the flux values correspond to the direction of the arrow (i.e., a negative value indicates that the net flux traverses in the reverse direction). Green, red and orange arrows represent overexpression, down-regulation and deletion of genes, respectively. (B) Maximum yield achievable by downregulation (), deletion (), or overexpression () of the selected novel genes. The values on top of each bar graph indicate the percentage of the theoretical maximum yield (i.e. 0.615 g SA g-1). (C) In silico strain construction of the maximum SA producing strain. The overexpression of the genes RKI1,

TKL1, aro1D920A (DHQS), in combination with down-regulation of ARO1 (SHKK), led to a yield equivalent to 89.41% of the maximum the theoretical yield. Green and red circles represent overexpression and down-regulation of genes, respectively. The maximum theoretical yield was determined after constraining the model with flux values from 13C labeling experiments (see section 4.2.2).

104

Strategies for muconic acid overproduction

Muconic acid (MA) can be produced from the intermediates of the aromatic amino acid biosynthesis pathway of S. cerevisiae by introducing three heterologous reactions (genes): (i) 3-dehydroshikimate dehydratase (SKHL encoded by AROZ), (ii) protocathechuate decarboxylase (PCCL encoded by AROY) and (iii) catechol 1,2-dioxygenase (CATO encoded by HQD2) (Figure 4.2A). SKHL converts 3-dehydroshikimate (DHS) into protocathechuate, which subsequently undergoes decarboxylation to form catechol. Catechol then undergoes an oxidative ring cleavage catalyzed by CATO to generate MA. To this end, we introduced all three non-native reactions into the iAZ900 model (see section 1.2.1) and then used OptForce to identify strategies leading to the overproduction of MA.

Both MA and SA producing pathways compete for the precursor DHS, but only SA requires the redox cofactor NADPH (Figure 4.2). Interestingly, OptForce identifies the deletion of glucose-6-phosphate dehydrogenase encoded by ZWF1 as a strategy to reduce the availability of NADPH for SA production and thereby allow DHS to accumulate for the production of MA. Other interventions to increase precursors for MA production are similar with that of SA (see section 1.2.3). They include overexpression of transketolase (TKL1), ribose-5-phosphate isomerase (RKI1), and 3-dehydroquinate synthase (Figure 4.2, Table 4.1). A combination of these four interventions led to the highest yield of MA at 91.21% of theoretical maximum yield (Figure 4.2).

105

Figure 4.2. Metabolic interventions for the overproduction of muconic acid (MA) identified through OptForce analysis. Green, red, and orange arrows (and circles) represent overexpression, downregulation and knockout of genes, respectively. A combination of ZWF1 knockout, TKL1 RKI1 and DHQS overexpression leads to 0.441 g -1 MA g glucose, which is equivalent to 91.21% of theoretical maximum yield.

106

Table 4.1. Metabolic interventions predicted by OptForce for (a) shikimic acid (SA) and (b) muconic acid (MA) overproduction. The theoretical maximum and minimum guaranteed fluxes are calculated under the condition specified in Supplementary Methods. Although multiple isozymes may be associated with each reaction only the major genes described in this study are shown in the gene-level interventions. Reaction abbreviations and the chemical equations written in the net flux direction are as followed: (i) DHQS, 3-dehydroquinate synthase (2dda7p => 3dhq + pi); (ii) G6PDH2, glucose-6-phosphate dehydrogenase (g6p + nadp => 6pgl + h + nadph); (iii) GAPD, glyceraldehyde-3-phosphate dehydrogenase (g3p + nad + pi <=> 13dpg + h + nadh); (iv) PFK, phosphofructokinase (atp + f6p => adp + fdp + h); (v) PGK, phosphoglycerate kinase (3pg + atp <=> 13dpg + adp); (vi) PYK, pyruvate kinase (adp + h + pep => atp + pyr); (vii) RPI, ribose-5-phosphate isomerase (ru5p <=> r5p); (viii) SHKK, shikimate kinase (atp + skm => adp + h + skm5p); (ix) SKHL, 3-dehydroshikimate dehydratase (3dhsk <=> h2o + pca); (x) TKT1, transketolase 1 (r5p + xu5p <=> g3p + s7p); (xi) TKT2, transketolase 2 (f6p + g3p <=> e4p + xu5p).

Metabolic Metabolic Minimum Yield %

interventions (Reaction interventions (Gene guaranteed (gproduct Theoretical

1 level) level) flux (mmol g glucose) maximum gDW1 h1) (a) Shikimic acid ↓PGK ↓PGK1 51.59 0.499 80.61% ↓GAPD ↓TDH1 51.59 0.499 80.61% ΔPFK ΔPFK1 51.85 0.501 81.02% ↑RPI ↑RKI1 53.41 0.516 83.45% ↑TKT1 ↑TKL1 54.71 0.526 85.48% ↓PYK ↓CDC19 55.06 0.529 86.03% ↑TKT2 ↑TKL1 55.52 0.537 86.75% ↑TKT1 ↑TKT2 ↑TKL1 56.19 0.543 87.80%

107

↑TKT1 ↑TKT2 ↑RPI ↑TKL1 ↑RKI1 56.19 0.543 87.80% ↑TKT1 ↑TKT2 ↓SHKK ↑TKL1 ↓ARO1 56.49 0.546 88.27%

↑TKT1 ↑TKT2 ↑DHQS ↑TKL1 ↑ARO1D920A 57.13 0.552 89.27%

↑TKT1 ↑TKT2 ↑TKL1 ↑ARO1D920A 57.22 0.553 89.41% ↓ARO1 ↑RKI1 ↑DHQS ↓SHKK ↑RPI

(b) Muconic acid

ΔG6PDH2 ΔZWF1 19.88 0.155 32.00% ↑RPI ↑RKI1 25.07 0.195 40.35% ↑TKT1 ↑TKL1 31.35 0.244 50.46% ↑TKT1 ΔG6PDH2 ↑TKL1 ΔZWF1 51.44 0.4 82.79% ↑TKT2 ↑TKL1 51.63 0.401 83.10% ↑TKT1 ↑RPI ΔG6PDH2 ↑TKL1 ↑RKI1 52.73 0.41 84.87% ΔZWF1 ↑TKT2 ↑TKT1 ↑TKL1 53.24 0.414 85.69% ↑TKT1 ↑RPI ΔG6PDH2 ↑TKL1 ↑RKI1 55.92 0.435 90.00% ↑SKHL ΔZWF1 ↑AROZ ↑TKT2 ↑TKT1 ↑SKHL ↑TKL1 ↑AROZ 55.92 0.435 90.00%

↑TKT1 ↑DHQS ↑RPI ↑TKL1 ↑ARO1D920A 56.67 0.441 91.21% ↑RKI1

↑TKT1 ↑TKT2 ↑DHQS ↑TKL1 ↑ARO1D920A 56.67 0.441 91.21% ↑RPI ↑RKI1

↑TKT1 ↑TKT2 ↑DHQS ↑TKL1 ↑ARO1D920A 56.67 0.441 91.21% ↑RPI ΔG6PDH2 ↑RKI1 ΔZWF1

↑TKT1 ↑DHQS ↑RPI ↑TKL1 ↑ARO1D920A 56.67 0.441 91.21% ΔG6PDH2 ↑RKI1 ΔZWF1

108

4.3. Case study II: Overproduction of malonyl-CoA in S. cerevisiae

Methods

The genome-scale metabolic model of S. cerevisiae iAZ900 [20] was used for all the simulations in this case study. Unless stated otherwise, all the steps are similar with methods described in section 4.2.2. Transport (MALCOAt: malcoa[c] + h[c] <=> malcoa[e] + h[e] + coa[c]) and exchange reaction (EX_malcoa(e): malcoa[e] => ) for malonyl-CoA is added to the model as a sink for malonyl-CoA. The transport reaction is written such that the coenzyme-A do not leave the system along with malonyl-CoA export. All simulations are performed under glucose aerobic minimal media condition with glucose uptake and oxygen uptake of 100 mmol g DW-1 h-1 and 20 mmol g DW-1 h-1, respectively. Under minimal media condition, the maximum biomass yield is 2.88 h-1, whereas the maximum theoretical malonyl-CoA flux is 71 mmol g DW-1 h-1. Visualization of flux distribution is performed using Escher [26].

Strategies for malonyl-CoA overproduction

At the acetaldehyde node of S. cerevisiae, the carbon flux is partitioned into either ethanol or acetate production (Figure 4.3A). Ethanol production oxidizes the NADH produces in glycolysis, whereas conversion of acetaldehyde into acetate (precursor for acetyl-coA and malonyl-CoA) generates NADPH. Therefore, in order to overproduce malonyl-CoA overexpression of NAD(P)H oxidizing reactions or knockout/down-regulation of NADPH- generating reactions are required to balance the NADPH generated by acetaldehyde dehydrogenase (ALD). According to the simulation, the diphosphate generated in the acetyl-coA synthetase step can be recycled through the nucleotide salvage reactions (e.g., adenine phosphoribosyltransferase), while the bicarbonate required in the acetyl-coA carboxylase step can be generated from carbon dioxide–bicarbonate equilibrating reaction. OptForce predicted several interventions to overproduce malonyl-CoA (Table 4.2). Although intuitively, one would knockout the alcohol dehydrogenase to increase flux towards acetate, OptForce suggested that alcohol dehydrogenase is able to recycle the NADPH generated by aldehyde dehydrogenase. Instead, in all strain designs (Table 4.2),

109 the acetyl-coA carboxylase (ACCOACr, encoded by gene ACC1) which converts acetyl- CoA to malonyl-CoA are overexpressed. This seems to be sufficient to exert a pull of carbon flux towards malonyl-CoA. At k = 2 interventions, the overexpression of ACC1 is accompanied by the overexpression of either aspartate semialdehyde dehydrogenase (ASADi) or aspartate kinase (ASPKi) (Table 4.2). The two enzymes are the intermediate steps in a pathway involving the conversion of pyruvate into acetaldehyde by going through pyruvate carboxylase (PC), aspartate synthesis and threonine synthesis, instead of pyruvate decarboxylase (PDC) (Figure 4.3B). Overexpression of ASADi is considered as more favorable between the two enzymes as it could also consume the NADPH produced by ALD. Although overexpression of ASPKi could increase flux through this particular pathway, it could also compete for the ATP cofactor required for malonyl-CoA synthesis. Both the strategies could guarantee at least 49% of theoretical malonyl-CoA yield (Table 4.2 and Figure 4.4).

At k = 3 interventions, the reactions in the pentose phosphate (PP) pathways are selected for knockout or down-regulation in addition to the interventions suggested for k = 2 (Table 4.2). Deleting the glucose-6-phosphate dehydrogenase (G6PDH, encoded by ZWF1) reaction along with the overexpression of ACCOACr and ASADi/ASPKi lead to the highest yield for k = 3 group, reaching 0.48 mol/mol Glucose (68% of theoretical yield). G6PDH, the first step of the PP pathway, is also one of the major suppliers of cytosolic NADPH for the yeast cell. Deleting this reaction could enforce the NADPH production through ALD, which became an essential reaction for NADPH supply [27], and is thus favorable for malonyl-CoA biosynthesis. Alternate k = 3 solutions identified by OptForce with slightly lower yield than this mutant strain include deletion of ribulose-5-phosphate 3-epimerase (RPE) or down-regulation of phosphogluconate dehydrogenase (GND, another NADPH-generating reaction) instead of G6PDH knockout. Both mutants reduce the flux through PP pathway, but might not be as effective as G6PDH knockout.

At k = 4 interventions, the best strain (#8) combined the overexpression of ACCOACr, knockout of G6PDH, down-regulation of glutamine synthetase (GLNS, encoded by GLN1) and also knockout of glycerol-3-phosphatase (G3PT, encoded by GPP1/GPP2). GLNS

110 competes for ATP with malonyl-CoA biosynthesis reactions (ACS and ACCOACr). As GLN1 is an essential gene for S. cerevisiae S288C strain, down-regulating it instead of knocking out could probably reduce the growth but would still be beneficial for malonyl- CoA accumulation. The selection of G3PT instead of the glycerol-3-phosphate dehydrogenase (G3PD, a common target for metabolic engineering) by OptForce is interesting. G3PD is capable of recycling NADH, whereas G3PT is redox-independent but generates phosphate. It is possible that deleting G3PT could increase the dependency for phosphate regeneration through ACCOACr. Alternative solutions for k = 4 substitute G6PDH knockout strategy with other PP pathway down-regulation strategies. A novel target that was not identified in the aforementioned solutions is the down-regulation of ribose-5-phosphate isomerase (RPI).

For k = 5 interventions, the best strain (#19) requires the knockout of NADP-dependent isocitrate dehydrogenase (ICDHy, encoded by IDP2) on top of strain #8. Likewise, this target is selected to eliminate a non-essential NADPH-generating reaction to couple NADPH production with malonyl-CoA synthesis. This strain generates up to 0.55 mol malonyl-CoA / mol glucose (77.2% of theoretical maximum). By overexpressing ASADi/ASPKi on this strain, the highest yield thus far can be obtained (0.56 mol/mol glucose, 79.2% of theoretical maximum). Note that higher number of interventions can be pursed but as the relative improvement in yield has declined, we terminate OptForce at k = 6 interventions. The list of interventions for the best strain and the flux distribution under the worst-case scenario (minimization of malonyl-CoA production) are shown in Figure 4.6.

111

Figure 4.3. Pathways for malonyl-CoA biosynthesis in S. cerevisiae. (A) Malonyl-coA biosynthesis pathway from pyruvate. (B) Pyruvate flux is partitioned between pyruvate decarboxylase (red arrow) and pyruvate carboxylase (pink arrow) under the maximization of the malonyl-CoA flux condition. The latter redirects the flux through aspartate – homoserine – threonine biosynthesis pathway, which finally generates acetaldehyde and glycine. Although this pathway does not lose carbon as carbon dioxide and is also capable of re-oxidizing two NAD(P)H cofactors, it also consumes three ATP cofactors. Therefore, lesser flux traverses through this pathway than the pyruvate decarboxylase route possibly due to the larger energy requirement, which competes with the ATP cofactor requirement of malonyl-CoA. The reactions and metabolites abbreviation is based on iAZ900 model. The value beside the reaction abbreviation indicates flux.

112

Table 4.2. Interventions identified using OptForce of malonyl-CoA overproduction. (A) Description of the reaction abbreviations, corresponding reaction, gene, interventions and the reaction direction of interventions.

Reaction ID Name Equation Direction Intervention Gene Bicarbonate + Acetyl- Forward Upregulation YNR016C Acetyl-CoA CoA + ATP <=> ADP ACCOACr carboxylase + H + Malonyl-CoA + Phosphate H + 4-Phospho-L- Forward Upregulation YDR158W Aspartate aspartate + NADPH - ASADi semialdehyde -> L-Aspartate-4- dehydrogenase semialdehyde + Phosphate + NADP L-Aspartate + ATP - Forward Upregulation YER052C ASPKi Aspartate kinase -> 4-Phospho-L- aspartate + ADP Coenzyme-A + O- Forward Knockout YML042W Carnitine-O- Acetylcarnitine <=> CSNATm aceyltransferase Acetyl-CoA + L- (mitochondrial) Carnitine H2O + Glycerol-3- Forward Knockout ( YER062C Glycerol-3- G3PT phosphate --> or phosphatase Phosphate + Glycerol YIL053W ) NADP + D-Glucose- Forward Knockout YNL241C Glucose-6- 6-phosphate --> H + G6PDH2 phosphate 6-phospho-D- dehydrogenase glucono-1-5-lactone + NADPH ATP + Ammonium + Forward Downregulation YPR035W Glutamine L-Glutamate --> ADP GLNS synthetase + H + Phosphate + L- Glutamine Isocitrate NADP + Isocitrate --> Forward Knockout YLR174W ICDHy dehydrogenase 2-Oxoglutarate + (NADP) NADPH + CO2 D-Ribulose-5- Forward Knockout YJL121C Ribulose-5- phosphate <=> D- RPE phosphate 3- Xylulose-5- epimerase phosphate NADP + 6-Phospho- Forward Downregulation ( YGR256W Phosphogluconate D-gluconate --> D- or GND dehydrogenase Ribulose-5-phosphate YHR183W ) + NADPH + CO2 Ribose-5- alpha-D-Ribose-5- Reverse Downregulation YOR095C RPI phosphate phosphate <=> D- isomerase Ribulose-5-phosphate

113

Table 4.2. (B) Intervention strategies identified using OptForce: overexpression (↑), down- regulation(↓) and knockout (Δ). The symbol (*) indicates that the strain is the best strain(s) in their respective (k) group of interventions.

Minimal guarantee Yield % d malonyl- No. of (mol/ Theoret Stra CoA intervent mol ical in productio ions (k) Gluco maxim

n flux se) um

(mmol/gD

W/h)

↑ACCOACr ↑ASPKi ↑ASADi ΔG6PDH2 ΔRPE ↓GND ↓GLNS ΔG3PT ↓RPI ΔICDHy ΔCSNATm 1* o o 34.66 0.35 48.8% k=2 2* o o 34.66 0.35 48.8% 3* o o o 48.3 0.48 68.0% 4* o o o 48.3 0.48 68.0% k=3 5 o o o 48.21 0.48 67.9% 6 o o o 48.21 0.48 67.9% 7 o o o 47.32 0.47 66.6% 8* o o o o 53.89 0.54 75.9% 9 o o o o 53.78 0.54 75.7% k=4 10 o o o o 52.54 0.53 74.0% 11 o o o o 52.45 0.52 73.9% 12 o o o o 50.54 0.51 71.2% 13* o o o o o 54.81 0.55 77.2% 14 o o o o o 54.69 0.55 77.0% 15 o o o o o 54.6 0.55 76.9% k=5 16 o o o o o 54.6 0.55 76.9% 17 o o o o o 54.26 0.54 76.4% 18 o o o o o 54.26 0.54 76.4% 19* o o o o o o 56.21 0.56 79.2% 20* o o o o o o 56.21 0.56 79.2% k=6 21 o o o o o o 55.53 0.56 78.2% 22 o o o o o o 54.92 0.55 77.4%

114

0.60

0.55

0.50

0.45

CoA yield yield CoA -

0.40 Malonyl

(mol/mol Glucose) (mol/mol 0.35

0.30

0.25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Strain

Figure 4.4. Malonyl-CoA yields for strains designed using the OptForce procedure. The strain description is presented in Table 4.2B. The colors of the bars indicate the number of interventions, k = 2 (blue), 3 (orange), 4 (gray), 5 (yellow) and 6 (green). The maximum theoretical yield is 0.71 mol malonyl-CoA/mol glucose.

115

(A)

(B)

Figure 4.6. Flux distribution of the best strain for malonyl-CoA production. (A) The best intervention strategies are represented as Boolean logic. (B) The flux distribution for the best strain (#19) under the worst-case scenario (minimization of malonyl-CoA production). The star indicates the malonyl-CoA production flux.

116

4.4. Case study III: Overproduction of isoprene in Synechocystis sp. PCC 6803

Objective

Cyanobacteria are photosynthetic organisms that garner interest as metabolic engineering hosts due to their capability of producing biochemicals directly from CO2 and sunlight [28]. The cyanobacterium Synechocystis sp. PCC 6803 is one of the most widely studied cyanobacteria as it is amenable to genetic manipulation [29]. In this case study, we employed OptForce on the genome-scale metabolic model of Synechocystis 6803 iSyn731 developed by Saha et al.[30] to identify interventions for isoprene overproduction.

Methods

The FBA simulation with iSyn731 model was performed by setting carbon uptake (through -1 -1 a combination of CO2 and bicarbonate exchange) to 100 mmol g DW h and photon uptake (through both photosystem I and II) to 1000 mmol g DW-1 h-1. The 13C-MFA flux data for Synechocystis 6803 growing under photoautotrophic condition measured by Young et al. (Supplementary Table III of [31]) was used to constrain the iSyn731 model. The lower and upper bound of the flux through reactions with measured data were set to the 95% confidence bounds. However, the model became infeasible when invoking the maximization of biomass under this condition. To this end, a bilevel mixed-integer linear programming (MILP) formulation is written to maximize the number of measured flux data that can be used to constrain the wild-type flux bounds while ensuring that the biomass production can be maximized. As a result, 21 reactions were fixed to their respective MFA 푚푎푥 -1 bounds and the maximum biomass yield (푣푏푖표푚푎푠푠) is 1.33 h . Under the photoautotrophic condition (without constraint on biomass yield), the maximum isoprene production flux is 17.46 mmol g DW-1 h-1. The OptForce procedure used is similar to Step 1 to Step 4 of section 4.2.2. In this case study, we imposed a constraint that at least 10% of the theoretical 푚푎푥 maximum biomass yield (푣푏푖표푚푎푠푠 ) has to be produced in the simulation to overproduce isoprene.

117

Strategies for isoprene overproduction

Under photoautotrophic growth, the flux distribution upon maximization of biomass is shown on Figure 4.7. Isoprene is not produced under this condition. When isoprene production is maximized under photoautotrophic condition, the carbon flux enters the 1- deoxy-D-xylulose-5-phosphate (DXP) pathway not through glyceraldehyde-3-phosphate and pyruvate, but directly from an intermediate (i.e., sugar phosphates) of the pentose phosphate pathway as proposed by Ershov et al. [32] (Figure 4.8). Initially during OptForce simulation, we found that up-regulation of one of the following reactions already lead to maximum isoprene production flux (i.e., 15.7 mmol g DW-1 h-1 when 10% of 푚푎푥 푣푏푖표푚푎푠푠 is imposed on biomass yield): (i) isoprene synthase (R08199) - catalyzes the conversion of dimethylallyl diphosphate (DMAPP) to isoprene (ii) 1-hydroxy-2-methyl-2-butenyl 4-diphosphate (HMBPP) reductase (rxn04996) – catalyzes the conversion of HMBPP to DMAPP (iii) Iptr - transport of isoprene to extracellular

All of these reactions constitute either the linear DXP pathway forming isoprene or the isoprene secretion pathway. These reactions were then excluded from the list of interventions to identify other non-intuitive strategies. At k = 1 to 3 interventions, no solution can be found to produce isoprene. At k = 4 interventions, three alternate solution sets were found, leading to an average of 4.3 mmol g DW-1 h-1 of isoprene (Table 4.3). Incremental increase of the number of interventions led to higher isoprene production, but further improvement is not observed after k = 6 interventions. The mimimal guaranteed isoprene flux at k = 6 interventions is 7.79 mmol g DW-1 h-1. The reaction perturbations suggested by OptForce include both up-regulation of ribose-5-phosphate isomerase (rxn00777) and down-regulation of reactions directing flux towards citric acid cycle including phosphoenolpyruvate carboxylase (rxn00251) and enolase (rxn00459); DHAP forming triose phosphate isomerase (rxn00747) and NADH-dependent alanine dehydrogenase (rxn00278) (Figure 4.9).

118

Figure 4.7. The flux distribution under maximum biomass condition when MFA data is used to constrained the iSyn731 model. There was no isoprene formation under this condition. Metabolic map was drawn using Escher [26]. Darker blue indicates higher fluxes.

119

Figure 4.8. Flux distribution under maximum isoprene formation condition. Carbon fluxes towards TCA cycle reduce significantly, while isoprene is produced through a putative pathway instead of the DXP pathway. This result is consistent with a previous study [32], which suggested that isoprenoid biosynthesis in Synechocystis 6803 is initiated from compounds in the pentose phosphate pathway under photoautotrophic growth.

120

Table 4.3. The list of interventions identified by OptForce for isoprene overproduction. Reactions in the isoprene biosynthesis linear pathway and transport of isoprene were excluded. The interventions are shown in Boolean format: (↑) indicates up- regulation and (↓) indicates down-regulation. The reaction abbreviations are based on iSyn731 model [30]. No. of Minimal Standard Interventions intervention guaranteed deviation isoprene (mmol/gDCW/h production flux ) (mmol/gDCW/h ) 1 - - N/A 2 - - N/A 3 - - N/A 4 4.304 1.7 ↑rxn00777 and ↓rxn00278 and ((↓rxn00747 and ↑rxn01106) or (↓rxn00747 and ↓rxn00459) or (↓rxn01100 and ↓rxn00459)) 5 7.317 0 ↑rxn00777 and ↓rxn00278 and ↓rxn00747 and ((↓rxn13643 and ↑rxn01106) or (↓rxn13643 and ↓rxn00459) or (↓rxn00695 and ↓rxn00459)) 6 7.792 0 ↑rxn00777 and ↓rxn00251 and ↓rxn00278 and ↓rxn00459 and ↓rxn00747 and (↓rxn13643 or ↓rxn00695)

121

Figure 4.9. The interventions that were identified by OptForce for isoprene overproduction. They include up-regulation of ribose-5-phosphate isomerase (rxn00777) and down-regulation of reactions directing flux towards TCA, phosphoenolpyruvate carboxylase (rxn00251) and enolase (rxn00459); DHAP forming triose phosphate isomerase (rxn00747) and NADH-dependent alanine dehydrogenase (rxn00278). Blue arrows represent flux carrying reactions; green and red arrows indicate reactions that are identified for up-regulation and down-regulation, respectively.

122

4.5. References

1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403-410. 2. Thiele I, Palsson BO: A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat Protoc 2010, 5. 3. Kim TY, Sohn SB, Kim YB, Kim WJ, Lee SY: Recent advances in reconstruction and applications of genome-scale metabolic models. Current opinion in biotechnology 2012, 23(4):617-623. 4. Oberhardt MA, Palsson BO, Papin JA: Applications of genome-scale metabolic reconstructions. Molecular systems biology 2009, 5:320. 5. Park JH, Lee KH, Kim TY, Lee SY: Metabolic engineering of Escherichia coli for the production of l-valine based on transcriptome analysis and in silico gene knockout simulation. Proceedings of the National Academy of Sciences 2007, 104(19):7797-7802. 6. Zomorrodi AR, Suthers PF, Ranganathan S, Maranas CD: Mathematical optimization applications in metabolic networks. Metabolic engineering 2012, 14(6):672-686. 7. Yim H, Haselbeck R, Niu W, Pujol-Baxley C, Burgard A, Boldt J, Khandurina J, Trawick JD, Osterhout RE, Stephen R: Metabolic engineering of Escherichia coli for direct production of 1,4- butanediol. Nature chemical biology 2011, 7. 8. Gold ND, Gowen CM, Lussier FX, Cautha SC, Mahadevan R, Martin VJ: Metabolic engineering of a tyrosine-overproducing yeast platform using targeted metabolomics. Microbial cell factories 2015, 14:73. 9. Xu P, Ranganathan S, Fowler ZL, Maranas CD, Koffas MA: Genome-scale metabolic network modeling results in minimal interventions that cooperatively force carbon flux towards malonyl- CoA. Metabolic engineering 2011, 13(5):578-587. 10. Ng C, Jung M-y, Lee J, Oh M-K: Production of 2,3-butanediol in Saccharomyces cerevisiae by in silico aided metabolic engineering. Microbial cell factories 2012, 11(1):68. 11. Ranganathan S, Tee TW, Chowdhury A, Zomorrodi AR, Yoon JM, Fu Y, Shanks JV, Maranas CD: An integrated computational and experimental study for overproducing fatty acids in Escherichia coli. Metabolic engineering 2012, 14(6):687-704. 12. Alper H, Jin Y-S, Moxley JF, Stephanopoulos G: Identifying gene targets for the metabolic engineering of lycopene biosynthesis in Escherichia coli. Metabolic engineering 2005, 7:155-164. 13. Fong SS, Burgard AP, Herring CD, Knight EM, Blattner FR, Maranas CD, Palsson BO: In silico design and adaptive evolution of Escherichia coli for production of lactic acid. Biotechnology and bioengineering 2005, 91. 14. Ranganathan S, Suthers PF, Maranas CD: OptForce: an optimization procedure for identifying all genetic manipulations leading to targeted overproductions. PLoS computational biology 2010, 6(4):e1000744.

123

15. Burgard AP, Pharkya P, Maranas CD: Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnology and bioengineering 2003, 84(6):647-657. 16. Tepper N, Shlomi T: Predicting metabolic engineering knockout strategies for chemical production: accounting for competing pathways. Bioinformatics 2010, 26(4):536-543. 17. Patil KR, Rocha I, Förster J, Nielsen J: Evolutionary programming as a platform for in silico metabolic engineering. BMC bioinformatics 2005, 6(1):308. 18. Lun DS, Rockwell G, Guido NJ, Baym M, Kelner JA, Berger B, Galagan JE, Church GM: Large- scale identification of genetic design strategies using local search. Molecular systems biology 2009, 5:296. 19. Pharkya P, Maranas CD: An optimization framework for identifying reaction activation/inhibition or elimination candidates for overproduction in microbial systems. Metabolic engineering 2006, 8(1):1-13. 20. Zomorrodi AR, Maranas CD: Improving the iMM904 S. cerevisiae metabolic model using essentiality and synthetic lethality data. BMS Syst Biol 2010, 4:178. 21. Chowdhury R, Chowdhury A, Maranas CD: Using Gene Essentiality and Synthetic Lethality Information to Correct Yeast and CHO Cell Genome-Scale Models. Metabolites 2015, 5(4):536- 570. 22. Chowdhury A, Zomorrodi AR, Maranas CD: Bilevel optimization techniques in computational strain design. Comput Chem Eng 2015, 72:363-372. 23. Suástegui M, Guo W, Feng X, Shao Z: Investigating strain dependency in the production of aromatic compounds in Saccharomyces cerevisiae. Biotechnology and bioengineering 2016, 113(12):2676- 2685. 24. Gold ND, Gowen CM, Lussier F-X, Cautha SC, Mahadevan R, Martin VJJ: Metabolic engineering of a tyrosine-overproducing yeast platform using targeted metabolomics. Microb Cell Fact 2015, 14:73. 25. Suástegui M, Yu Ng C, Chowdhury A, Sun W, Cao M, House E, Maranas CD, Shao Z: Multilevel engineering of the upstream module of aromatic amino acid biosynthesis in Saccharomyces cerevisiae for high production of polymer and drug precursors. Metabolic engineering 2017, 42:134- 144. 26. King ZA, Drager A, Ebrahim A, Sonnenschein N, Lewis NE, Palsson BO: Escher: A Web Application for Building, Sharing, and Embedding Data-Rich Visualizations of Biological Pathways. PLoS computational biology 2015, 11(8):e1004321. 27. Grabowska D, Chelstowska A: The ALD6 gene product is indispensable for providing NADPH in yeast cells lacking glucose-6-phosphate dehydrogenase activity. The Journal of biological chemistry 2003, 278(16):13984-13988.

124

28. Rabinovitch-Deere CA, Oliver JW, Rodriguez GM, Atsumi S: Synthetic biology and metabolic engineering approaches to produce biofuels. Chemical reviews 2013, 113(7):4611-4632. 29. Berla B, Saha R, Immethun C, Maranas C, Moon TS, Pakrasi H: Synthetic biology of cyanobacteria: unique challenges and opportunities. Frontiers in Microbiology 2013, 4(246). 30. Saha R, Verseput AT, Berla BM, Mueller TJ, Pakrasi HB, Maranas CD: Reconstruction and comparison of the metabolic potential of cyanobacteria Cyanothece sp. ATCC 51142 and Synechocystis sp. PCC 6803. Plos One 2012, 7(10):e48285. 31. Young JD, Shastri AA, Stephanopoulos G, Morgan JA: Mapping photoautotrophic metabolism with isotopically nonstationary (13)C flux analysis. Metabolic engineering 2011, 13(6):656-665. 32. Ershov YV, Gantt RR, Cunningham J, Francis X., Gantt E: Isoprenoid Biosynthesis in Synechocystis sp. Strain PCC6803 Is Stimulated by Compounds of the Pentose Phosphate Cycle but Not by Pyruvate or Deoxyxylulose-5-Phosphate. Journal of bacteriology 2002, 184(18):5045- 5051.

125

5. Chapter 5 SYNOPSIS AND FUTURE PERSPECTIVES

Portions of this chapter has been previously published in modified form in Nature Biotechnology (Ng C.Y., Chowdhury A., and Maranas C.D. (2017), “A microbial factory for diverse chemicals”, Nature Biotechnology, 34(5), 513–515) and FEMS Microbiology Letters (Dash S., Ng C.Y., and Maranas C.D. (2016), “Metabolic modeling of clostridia: current developments and applications”, FEMS Microbiology Letters, 363(4), fnw004).

5.1. Introduction

Most of the chemicals used in industry and commercial products are derived from petroleum. If these manufacturing processes could be replaced by cellular factories— engineered microbes or microbial consortia— renewable energy sources could be exploited to generate products. Today, microbial production is commercially viable for only a handful of chemicals, such as ethanol, lactate, succinate [1], farnesene [2], 1,3-propanediol [3], and 1,4-butanediol [4] owing to a combination of cheap crude oil and expensive raw materials. Rather than attempting to compete in the area of high-volume petroleum-derived chemicals such as fuels, metabolic engineers have begun to focus on high-value biochemicals with specialized functional groups (e.g., dicarboxylic and phenyl-alkanoic acids), which are used as food additives, pharmaceuticals, textile fibers, plasticizers and building blocks for other chemicals. Rapid progress over the past decade in microbial genome annotations, cellular modeling and metabolic engineering tools have enabled the generation of native and non-native molecules including fatty esters, terpenoids, polyketides alkaloids and α-ketoacids [5-10]. Future success will depend crucially on the ability to optimize and improve their titer, yield and productivity as well as further reduction in feedstock cost.

126

5.2. A survey of current metabolic engineering efforts

We have manually curated and stored experimental data from over 115 published publications that describe engineered microbial strains designed for the production biomolecules and cataloged the current best yield and titer within the database. The pathway independent maximum theoretical carbon yield provides an upper bound (i.e., chemical limit) on the bio-based production of each molecule (see Supplementary notes C.1). In practice, different biomolecules can be produced at highly varied amounts often well below the maximum theoretical yield due to a variety of bottlenecks arising during bioconversion [11]. Using open literature publication data, we extracted the best yield (Figure 5.1) and titer (Figure 5.2) achieved thus far for each molecule and contrasted against the corresponding maximum theoretical yields. The information catalogs in our database provide the rational means of deciding whether to target a molecule that can already efficiently be produced or invest on a candidate molecule with so far low experimental yield but with a high maximum theoretical yield.

Among the molecules that we have surveyed, seven of them including ethanol [12], 1- butanol [13], isobutanol [14], 1,3-propanediol [3], 2,3-butanediol [15], lactic acid [16] and succinic acid [17] were successfully produced at > 80% of the theoretical carbon yield when glucose was used as the sole carbon source (Figure 5.1). In another notable effort, high yield of propanoic acid was generated from substrate glycerol (0.696 C-mol/C-mol glycerol) instead of glucose [18]. Such efficient conversion yields regardless of substrate suggest that the bioproduction pathways for these molecules have been significantly optimized in terms of pathway flux, redox and energetics [11]. However, most of these cases were achieved using model organisms such as E. coli, S. cerevisiae and B. subtilis. Nevertheless, advances in genetic engineering of non-model organisms including anaerobes, methanogens and cyanobacteria could elevate the use of non-model host organisms with advantageous genetic traits (e.g. high product tolerance, capable of catabolizing inexpensive substrates, and high ATP producer).

127

Other than yield, titer and productivity are also two important metrics in quantifying the performance of a bioconversion process. According to Figure 5.2, eight molecules have already achieved a titer of >100 g/l. While several compounds are above 1 g/l level, a significant number of them are below 1g/l. Although our literature survey has not covered all currently available publications, it suffices to conclude that a large number of metabolic engineering-based products remain at the proof-of-concept stage and significant efforts are still required to drive these products to commercially viable titers [11]. In general, those molecules that can be produced at high yield could also be produced at sufficiently high titer. Long-chain length molecules and esters currently suffer from low titer due to the complexity of the multi-step pathway. Although the highest titers for most of the biomolecules were attained by using engineered E. coli strains (Table C.1 and Figure C.1), several Clostridia species are promising host organisms as they were able to outperform E. coli in 1-butanol, butanoic acid, 1,2-propanediol and 1-hexanol titers (Figure C.1).

128

Figure 5.1. Comparison of the maximum theoretical carbon yield with glucose as substrate (red bar) calculated for each molecule and the current best experimental carbon yield obtained from the publications that we have surveyed. The experimental yields were separated into those that use glucose only (blue bars) or any carbon substrates (e.g. glycerol and cellobiose) (purple bars) as the carbon source. Note that the maximum theoretical yield of products that are less reduced than glucose (e.g. succinic acid, itaconic acid and pyruvic acid) is above 1 C-mol/C-mol glucose as carbon dioxide is fixed as carbon source in the 6 12 6 theoretical conversion equation (e.g. Glucose + CO2 => Succinic acid + H2O). 7 7 7

129

Figure 5.2. Highest experimental titer (g/l) extracted from our literature survey for each molecule. The molecules in the y-axis are placed in descending order based on their highest titer. Dashed blue and red reference lines indicate the titers at 1 g/L and 100g/L, respectively. Note that the x-axis is shown in log-scale.

130

5.3. Outlook

By harnessing the versatility of microbial cell factories, recent efforts have demonstrated that it is possible to produce a plethora of useful compounds including functionalized small molecules [9], opiods [8], taxadiene [19] and artemisinin [10] from relatively inexpensive feedstock such as glycerol and glucose. Key drivers to these efforts include the successful identification and expression of heterologous gene/enzymes (including protein-engineered variants) that confer the strains with the ability to convert natively available intermediates to non-native products. There is, however, a large number of high-value and high-potential natural molecules catalogued in databases such as KEGG [20] and Chemical Entities of Biological Interest (ChEBI) [21] that remained to be explored. Computational tools such as BNICE [22] and GEM-Path [23] can be leveraged to identify reactions and reaction rules that form de novo pathways from existing molecules in a cell to a novel molecule. Moreover, a multitude of synthetic biology and genome engineering approaches (see Chapter 1 and 2) can be used to facilitate the optimization of these production pathways.

An optimized downstream pathway relies upon efficient upstream pathways that can supply in appropriate ratios metabolite pathway precursors and relevant energy and redox cofactors. Recent constructions of non-oxidative glycolytic cycle [24] and nonphosphorylative pathways for lignocellulose feedstocks [25] capable of disassembling substrates with zero carbon loss provide exciting opportunities for achieving optimal yields of production. Computational tools such as optStoic [26] (see Chapter 3) provide a systematic framework for designing the overall conversion stoichiometry and intervening reactions. Balancing the expression of upstream and downstream pathways will likely be another challenge. Approaches to partition both upstream and downstream pathways into several modules expressed at optimal levels such as by introduction of dynamic biosensors [27] could be employed to ensure a balanced supply of precursors and to prevent accumulation of toxic intermediates. Alternatively, a microbial consortium approach—for example, by co-cultivation of a production strain with a different microorganism that produces and secretes precursors from inexpensive carbon sources—could improve yield and productivity. It might also be prudent to select for high-performing strains in the

131 engineered isogenic population (by weeding out cells that perform poorly because of nongenetic differences) [28].

Understanding and engineering microbial metabolism can be further facilitated by the use of genome-scale metabolic (GSM) models as demonstrated in Chapter 4. Previous studies have shown that they are useful for the (1) elucidation of underlying biological processes, (2) prediction of cellular phenotypes, (3) interpretation of high-throughput data, (4) metabolic engineering and (5) analysis of interspecies interactions [29, 30] (Figure 5.3). In particular, GSM models enable system-wide analysis of genetic perturbations (i.e., gene overexpression, down-regulation and deletion) and have been previously shown to be able to catalog non-intuitive targets for manipulation (see Chapter 4) [31]. One way to enable context-specific prediction using GSM models is by confining the solution space using experimental data (e.g., gene expression, 13C-metabolic flux analysis, protein expression, metabolome, growth rate). To this end, numerous methods (e.g, CoreReg [32], PROM [33], arFBA [34]) have been developed to integrate different types of omics data into GSM models (refer to [35, 36] for reviews). Other approaches for capturing different layers of regulation (e.g., transcriptional and allosteric) in metabolism such as kinetic models [37, 38] and ME-models [39, 40] are emerging as a solution to the lack of mechanistic details in the stoichiometric models.

In common with synthetic biologists, who use orthogonal synthetic logic gates to construct circuits, metabolic engineers would like to design off-the-shelf orthogonal pathways (e.g., a tunable Entner-Doudoroff pathway module in E. coli (see Chapter 2) [41]) that can be combined in a modular fashion in vivo to convert a substrate to the desired product in an automated engineering workflow. Full realization of this goal in vivo will require a more detailed understanding of complex metabolic, proteomic and transcriptomic interactions and compatibility of synthetic constructs with host metabolism.

132

Figure 5.3. Applications of genome-scale metabolic models include: (1) discovering new biological insights, (2) predicting cellular phenotype resulting from genetic or environmental perturbations, (3) contextualizing and exploring high-throughput datasets, (4) devising metabolic engineering strategies, and (5) understanding interspecies interactions (e.g., in consolidated bioprocessing).

133

5.4. References

1. Jullesson D, David F, Pfleger B, Nielsen J: Impact of synthetic biology and metabolic engineering on industrial production of fine chemicals. Biotechnology advances 2015, 33(7):1395-1402. 2. Meadows AL, Hawkins KM, Tsegaye Y, Antipov E, Kim Y, Raetz L, Dahl RH, Tai A, Mahatdejkul- Meadows T, Xu L et al: Rewriting yeast central carbon metabolism for industrial isoprenoid production. Nature 2016, 537(7622):694-697. 3. Nakamura CE, Whited GM: Metabolic engineering for the microbial production of 1,3-propanediol. Current opinion in biotechnology 2003, 14(5):454-459. 4. Yim H, Haselbeck R, Niu W, Pujol-Baxley C, Burgard A, Boldt J, Khandurina J, Trawick JD, Osterhout RE, Stephen R: Metabolic engineering of Escherichia coli for direct production of 1,4- butanediol. Nature chemical biology 2011, 7. 5. Dellomonaco C, Clomburg JM, Miller EN, Gonzalez R: Engineered reversal of the beta-oxidation cycle for the synthesis of fuels and chemicals. Nature 2011, 476(7360):355-359. 6. Steen EJ, Kang Y, Bokinsky G, Hu Z, Schirmer A, McClure A, Del Cardayre SB, Keasling JD: Microbial production of fatty-acid-derived fuels and chemicals from plant biomass. Nature 2010, 463(7280):559-562. 7. Zhang K, Sawaya MR, Eisenberg DS, Liao JC: Expanding metabolism for biosynthesis of nonnatural alcohols. Proceedings of the National Academy of Sciences of the United States of America 2008, 105(52):20653-20658. 8. Galanie S, Thodey K, Trenchard IJ, Filsinger Interrante M, Smolke CD: Complete biosynthesis of opioids in yeast. Science 2015, 349(6252):1095-1100. 9. Cheong S, Clomburg JM, Gonzalez R: Energy- and carbon-efficient synthesis of functionalized small molecules in bacteria using non-decarboxylative Claisen condensation reactions. Nat Biotech 2016, 34(5):556-561. 10. Paddon CJ, Westfall PJ, Pitera DJ, Benjamin K, Fisher K, McPhee D, Leavell MD, Tai A, Main A, Eng D et al: High-level semi-synthetic production of the potent antimalarial artemisinin. Nature 2013, 496(7446):528-532. 11. Van Dien S: From the first drop to the first truckload: commercialization of microbial processes for renewable chemicals. Current opinion in biotechnology 2013, 24(6):1061-1068. 12. Guadalupe Medina V, Almering MJ, van Maris AJ, Pronk JT: Elimination of glycerol production in anaerobic cultures of a Saccharomyces cerevisiae strain engineered to use acetic acid as an electron acceptor. Applied and environmental microbiology 2010, 76(1):190-195. 13. Shen CR, Lan EI, Dekishima Y, Baez A, Cho KM, Liao JC: Driving forces enable high-titer anaerobic 1-butanol synthesis in Escherichia coli. Applied and environmental microbiology 2011, 77(9):2905-2915.

134

14. Atsumi S, Hanai T, Liao JC: Non-fermentative pathways for synthesis of branched-chain higher alcohols as biofuels. Nature 2008, 451(7174):86-89. 15. Fu J, Huo G, Feng L, Mao Y, Wang Z, Ma H, Chen T, Zhao X: Metabolic engineering of Bacillus subtilis for chiral pure meso-2,3-butanediol production. Biotechnology for biofuels 2016, 9(1):90. 16. Zhou S, Shanmugam KT, Yomano LP, Grabar TB, Ingram LO: Fermentation of 12% (w/v) glucose to 1.2 M lactate by Escherichia coli strain SZ194 using mineral salts medium. Biotechnology letters 2006, 28(9):663-670. 17. Jantama K, Zhang X, Moore JC, Shanmugam KT, Svoronos SA, Ingram LO: Eliminating side products and increasing succinate yields in engineered strains of Escherichia coli C. Biotechnology and bioengineering 2008, 101(5):881-893. 18. Zhang A, Yang S-T: Propionic acid production from glycerol by metabolically engineered Propionibacterium acidipropionici. Process Biochemistry 2009, 44(12):1346-1351. 19. Ajikumar PK, Xiao W-h, Tyo KEJ, Wang Y, Simeon F, Leonard E, Mucha O, Phon TH, Pfeifer B, Stephanopoulos G: Isoprenoid pathway optimization for Taxol precursor overproduction in Escherichia coli. Science (New York, NY) 2010, 330:70-74. 20. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 2000, 28(1):27-30. 21. Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M et al: The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic acids research 2013, 41(Database issue):D456-463. 22. Hatzimanikatis V, Li C, Ionita JA, Henry CS, Jankowski MD, Broadbelt LJ: Exploring the diversity of complex metabolic networks. Bioinformatics 2005, 21(8):1603-1609. 23. Campodonico MA, Andrews BA, Asenjo JA, Palsson BO, Feist AM: Generation of an atlas for commodity chemical production in Escherichia coli and a novel pathway prediction algorithm, GEM-Path. Metabolic engineering 2014, 25:140-158. 24. Bogorad IW, Lin TS, Liao JC: Synthetic non-oxidative glycolysis enables complete carbon conservation. Nature 2013, 502(7473):693-697. 25. Tai YS, Xiong M, Jambunathan P, Wang J, Wang J, Stapleton C, Zhang K: Engineering nonphosphorylative metabolism to generate lignocellulose-derived products. Nature chemical biology 2016, 12(4):247-253. 26. Chowdhury A, Maranas CD: Designing overall stoichiometric conversions and intervening metabolic reactions. Scientific reports 2015, 5:16009. 27. Zhang F, Carothers JM, Keasling JD: Design of a dynamic sensor-regulator system for production of chemicals and fuels derived from fatty acids. Nature biotechnology 2012, 30(4):354-359. 28. Xiao Y, Bowen CH, Liu D, Zhang F: Exploiting nongenetic cell-to-cell variation for enhanced biosynthesis. Nature chemical biology 2016.

135

29. McCloskey D, Palsson BO, Feist AM: Basic and applied uses of genome-scale metabolic network reconstructions of Escherichia coli. Molecular systems biology 2013, 9:661. 30. Oberhardt MA, Palsson BO, Papin JA: Applications of genome-scale metabolic reconstructions. Molecular systems biology 2009, 5:320. 31. Burgard AP, Pharkya P, Maranas CD: Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnology and bioengineering 2003, 84(6):647-657. 32. Dash S, Mueller TJ, Venkataramanan KP, Papoutsakis ET, Maranas CD: Capturing the response of Clostridium acetobutylicum to chemical stressors using a regulated genome-scale metabolic model. Biotechnology for biofuels 2014, 7(1):144. 33. Chandrasekaran S, Price ND: Probabilistic integrative modeling of genome-scale metabolic and regulatory networks in Escherichia coli and Mycobacterium tuberculosis. Proceedings of the National Academy of Sciences of the United States of America 2010, 107(41):17845-17850. 34. Machado D, Herrgard MJ, Rocha I: Modeling the Contribution of Allosteric Regulation for Flux Control in the Central Carbon Metabolism of E. coli. Frontiers in bioengineering and biotechnology 2015, 3:154. 35. Topfer N, Kleessen S, Nikoloski Z: Integration of metabolomics data into metabolic networks. Frontiers in plant science 2015, 6:49. 36. Blazier AS, Papin JA: Integration of expression data in genome-scale metabolic network reconstructions. Frontiers in physiology 2012, 3:299. 37. Khodayari A, Maranas CD: A genome-scale Escherichia coli kinetic metabolic model k-ecoli457 satisfying flux data for multiple mutant strains. Nature communications 2016, 7:13806. 38. Chowdhury A, Khodayari A, Maranas CD: Improving prediction fidelity of cellular metabolism with kinetic descriptions. Current opinion in biotechnology 2015, 36:57-64. 39. Lerman JA, Hyduke DR, Latif H, Portnoy VA, Lewis NE, Orth JD, Schrimpe-Rutledge AC, Smith RD, Adkins JN, Zengler K et al: In silico method for modelling metabolism and gene product expression at genome scale. Nature communications 2012, 3:929. 40. O'Brien EJ, Lerman JA, Chang RL, Hyduke DR, Palsson BO: Genome-scale models of metabolism and gene expression extend and refine growth phenotype prediction. Molecular systems biology 2013, 9:693. 41. Ng CY, Farasat I, Maranas CD, Salis HM: Rational design of a synthetic Entner-Doudoroff pathway for improved and controllable NADPH regeneration. Metabolic engineering 2015, 29:86-96.

136

Appendix A. Supplementary Information for Chapter 2

A.1. Supplementary notes

A.1.1. Construction of pgi mutant with co-selection MAGE

The pgi mutant was constructed by inserting two consecutive stop codons in the open reading frame of E. coli EcNR2 native pgi gene using co-selection MAGE (CoS-MAGE) approach. First, bla gene of EcNR2 strain was inactivated by performing MAGE with oligo bla_off that introduced premature stop codons in bla gene. Pgi_KO oligo was added with bla_on oligo in a 50:1 ratio during electroporation of EcNR2 bla- strain. Three to four rounds of CoS-MAGE were performed, and the cells were plated on LB plate supplemented with ampicillin. Colonies were screened for genotype of interest by colony PCR with corresponding primers designed according to [1]. Colonies exhibiting the correct band on gel electrophoresis were re-confirmed by sequencing.

A.1.2. Improving tetA translation rate

The tetA gene confers resistance to tetracycline and also sensitivity to nickel salts and fusaric acid. We intended to use tetA as selection and counter-selection marker during co- selection MAGE procedure [2]. Higher copy number of tetA was more efficient than single/low copy of tetA in confering nickel-sensitivity [3]. Since tetA promoter overlaps with tetR[4], we increased the expression of tetA by modifying its RBS on the EcNR2 strain with chromosomally integrated ED-tetAR cassette as followed. bla gene was inactivated in ED strain using bla_off oligo. 8-variant RBS library was designed for tetA gene using the Genome Editing Mode of RBS Library Calculator. Then, co-selection MAGE (CoS- MAGE) [5] was performed with oligonucleotide mixture of the RBS library of tetA gene and bla_on oligo that restores the function of the inactivated bla. After three rounds of CoS-MAGE, the cells were plated on LB plate supplemented with ampicillin and chloramphenicol. 96 resulting colonies were grow in LB media supplemented with 50 µg/mL of heat-inactivated chlortetracycline (cTc), 25 µg/mL of chloramphenicol, and

137 nickel chloride. Expression of tetA in single copy at the original translation initiation rate of 228 au did not exhibited significant sensitivity to nickel and fusaric acid (unpublished data). High level of expression of tetA expression was found to cause sensitivity to nickel and fusaric acid [3, 6, 7]. Using growth rate screening in nickel chloride containing LB media, we isolated a colony that exhibited lowest growth rate across four different concentrations of nickel chloride (2.5 mM, 3.0 mM, 3.5 mM and 4.0 mM). Sequencing result verified that it has the highest translation initiation rate for tetA among the 8-variant RBS library. This ED strain with predicted tetA translation initiation rate of 48,372 au is designated as the ED 1.0 strain.

A.1.3. Modification of pQE-mBFP plasmid and measurement of mBFP production rate

The pQE-mBFP plasmid harboring metagenomic blue fluorescent protein driven by a IPTG-inducible T5 promoter [8] was modified as followed. Cat gene that confers chloramphenicol resistance was removed by inverse PCR of the original plasmid resulting in pQE-mBFP(Cat-). E. coli ER2267 (NEB) was transformed with pQE-mBFP and pQE- mBFP(Cat-) resulting in strain CYN014 and CYN019, respectively. These strains were initially constructed for in vivo measurement of mBFP production (Figure A.2).

To propagate plasmid with R6K origin, E. coli Pir116 (Transformax) was then used as host strain. Pir116 competent cells were transformed separately with (i.) pQE-mBFP(Cat-) and pCN-LA (R6K origin, expressing LacI constitutively and AmtR under pTac promoter) and (ii.) pQE-mBFP(Cat-) and pCN-LPab (R6K origin, expressing LacI constitutively and PntAB transhydrogenase under pTac promoter), resulting in (i.) CYN024 and (ii.) CYN027.

We characterized mBFP production of ER2267 or Pir116 derived strains as followed. Isolated colonies were grown overnight in LB broth supplemented with appropriate antibiotic at 37 °C. Overnight culture was diluted in M9 minimal media with 0.4% w/v glucose in culture tube or microtiter wells. Cells were grown at 37 °C with shaking for 4 h

138 and then diluted into 200 µL of fresh M9 minimal media with 0.4% w/v glucose in microtiter wells. IPTG was added between 1 to 3 h after the transition to allow for cell recovery from lag phase. Cell growth (optical density at 600 nm) and blue fluorescence (excitation and emission wavelength of 395 nm and 451 nm, respectively) were continuously monitored using Tecan. Specific mBFP production rate was calculated from the slope of a linear line fitted to specific mBFP fluorescence per unit OD600 versus time data over a period of linear increment after IPTG addition.

139

A.2. Supplementary figures

Figure A.1. Plasmid map of pCN-065 (pCN-LEDT). Labelled arrows indicate genes. RBSs and terminators are represented by yellow boxes and dark red arrow, respectively. The plasmid map is created using ApE - A plasmid Editor v2.0.46 and SnapGene ®Viewer 2.3.5.

140

Figure A.2. Metagenomic blue fluorescent protein (mBFP) as an in vivo NADPH biosensor. (A) Time course specific mBFP fluorescence per cell signal and the corresponding specific mBFP production rate of strain ER2267 (wild-type without mBFP plasmid), CYN014 (ER2267 pQE-mBFP) and CYN019 (ER2267 pQE-mBFP(Cat-)) with and without IPTG addition. Increase in blue fluorescence per unit OD600 was detectable upon IPTG induction only for the mBFP harboring strains CYN014 and CYN019 but not for the negative control strain ER2267. Values and error bars represent the means and s.d. of two repeats. (B) Measurement of mBFP production rate of strain CYN027 (E. coli Pir116 co-expressing transhydrogenase PntAB and mBFP on an R6K vector) and the control strain CYN024 (E. coli Pir116 co-expressing AmtR and mBFP on an R6K vector) at two different IPTG concentrations. CYN027 expressing the PntAB transhydrogenase exhibited higher specific mBFP production rate than the control strain. Values and error bars represent the means and s.d. of 3-4 repeats. All experiments were conducted at 37°C with M9 minimal glucose media. (Supplementary notes A.1.3).

141

A.3. Supplementary tables

Table A.1. Strains and plasmids. (Chr) indicates chromosomal integration at the yciL- tonB intergenic region. Strains or plasmids Relevant characteristics Source

E. coli strains MG1655 F-, lambda-, rph-1 Thomas Wood's Lab MG1655 crtEBI MG1655 pIF-001C This study DH10B/TOP10 F- mcrA Δ(mrr-hsdRMS-mcrBC) φ80lacZΔM15 Invitrogen ΔlacX74 nupG recA1 araD139 Δ(ara-leu)7697 galE15 galK16 rpsL(StrR) endA1 λ- ER2267 K12 F´ proA+B+ lacIq Δ(lacZ)M15 zzf::mini-Tn10 New England (KanR)/ Δ(argF-lacZ)U169 glnV44 e14-(McrA-) Biolabs Inc rfbD1? recA1 relA1? endA1 spoT1? thi-1 Δ(mcrC- mrr)114::IS10 CYN014 Same as ER2267 but with pQE-mBFP This study CYN019 Same as ER2267 but with pQE-mBFP(Cat-) This study Pir116 F- mcrA Δ(mrr-hsdRMS-mcrBC) φ80dlacZΔM15 Epicentre (TransforMax™ ΔlacX74 recA1 endA1 araD139 Δ(ara, leu)7697 galU EC100D™ pir-116 ) galK λ- rpsL (StrR) nupG pir-116(DHFR) CYN024 Same as Pir116 but with pCN-LA pQE-mBFP (Cat-) This study CYN027 Same as Pir116 but with pCN-LPab pQE-mBFP (Cat-) This study EcNR2 MG1655 bioA/bioB::λ-Red-bla ΔmutS::cat [9] EcNR2 mBFP Same as EcNR2 but with pCN-mBFP This study EcNR2 crtEBI Same as EcNR2 but with pIF-001K This study Δpgi Same as EcNR2 but with nonsense inactivation of pgi This study (two stop codons (taa) replacing codon 27 and 28 of pgi) Δpgi crtEBI Same as Δpgi but with pIF-001K This study EcIF15 E_IF3 (MG1655 bioA/bioB::λ-Red-bla ΔmutS::kanR) [10] with RBS of dxs: ACAATAAGTATAAGGAGGCCCCTG EcIF15 crtEBI Same as EcIF15 but with pIF-002 [10] ED1.0 Same as EcNR2 but with ED-tetAR(Chr) This study ED1.0 mBFP Same as ED 1.0 but with pCN-mBFP This study ED1.0 crtEBI Same as ED 1.0 but with pIF-001K This study

ED1.0MEP crtEBI Same as EcIF15 but with ED-tetAR(Chr) and pIF-002 This study EDi, i= 2, … , 23 Same as ED1.0 but with Zm-zwf, Zm-pgi, Zm-pgl, Zm- This study, refer edd and Zm-eda genomic RBS libraries and pCN- to Table S4 for mBFP details EDiR Same as EcNR2 but with EDi's ED-tetAR(Chr) This study

142

EDiR crtEBI Same as EDiR but with pIF-001K This study

EDiR,MEP crtEBI Same as EcIF15 but with EDi-tetAR(Chr) and pIF- This study 002

Plasmids pQE-mBFP T5 promoter with lac operator; mBFP; ColE1 origin; [8] cmR; ampR pQE-mBFP(Cat-) T5 promoter with lac operator; mBFP; ColE1 origin; This study ampR pCN-mBFP ColE1 origin; Insulated consitutive sigma 70 This study promoter (TTCTTGAGCACAGCTAACACCACGTCGTCCC TATCTGCTGCCCTAGGTCTATGAGTGGTTGCT GGATAACGACGTCTTTTTTTTGACAGCTAGC TCAGTCCTAGGTATAATATATTCAGGGAGAC CACAGGATCCACGGTTTCCCTCTACAAATAA TTTTGTTTAACTTTTACTAGAG) ; 356,786 au RBS (TAAAGACCCAGGACGATTTAAGGAGGAGAA AAGAC); mBFP; kanR pCN-MCSO R6K origin; cmR This study pCN-L R6K origin; constitutive promoter::LacI; cmR This study pCN-LA R6K origin; constitutive promoter::LacI; pTac::AmtR; This study cmR pCN-LPab R6K origin; constitutive promoter::LacI; pTac::PntA This study PntB; cmR pCN-LED (pCN- R6K origin; pTac::Zm-zwf Zm-pgi(A123S); This study 053) pTac::Zm-edd Zm-eda Zm-pgl; constitutive promoter::LacI; cmR pCN-LEDT (pCN- R6K origin; pTac::Zm-zwf Zm-pgi(A123S); This study 065) pTac::Zm-edd Zm-eda Zm-pgl; constitutive promoter::LacI; tetAR; cmR pIF-001C ColE1 origin; IPTG inducible PlacO1 promoter; [10] Rhodobacter sphaeroides crtEBI operons with optimized RBS (N14); cmR pIF-001K ColE1 origin; IPTG inducible PlacO1 promoter; This study Rhodobacter sphaeroides crtEBI operons with optimized RBS (N14); kanR pIF-002 ColE1 origin; Arabinose inducible PBAD promoter; [10] Rhodobacter sphaeroides crtEBI operons with optimized RBS (N14); cmR

143

Table A.2. Oligonucleotides used for MAGE. * indicate phosphorothioated base. Oligonucleotides Sequence (5’→ 3’) MAGE to fix Zm-pgi mutation prCY103 T*G*CCAAAGAATACCATGCACGCATGCGCACCCTGATTGAAGC TATTGATGCTGGTGCATTTGGCGAAGTAAAACACCTGCTGCATA TTGG

MAGE to introduce RBS libraries (degenerate oligos) Zm-zwf_dRBS T*G*T*G*AGCGCTCACAATTGCTAGCTAACTACATATAKWASGA GKTAACACATGACTAACACCGTATCCACGATGATTCTGTTTGGC TCCAC Zm-kdpga_dRBS G*C*T*A*TCTATGCGGGCGCGGGTATCTAATAGCAAACTAARCR TAASGMGGTTCCCCAATGCGTGATATTGATTCCGTGATGCGCTT AGCAC Zm-pgd_dRBS T*G*T*G*AGCGCTCACAATTGCCGCACACGRWAKAAGKAGGTA CACATGACTGATTTACATTCTACCGTTGAAAAAGTAACTGCACG TGTGAT Zm-pgi_dRBS T*G*A*C*GGCGTGACTTGGTACGACTAATAGACCCGCCATAATT AAGGMGGKMGWCAATGGCACGTATCGCAAATAAAGCAGCAAT TGACGCA Zm-pgl_dRBS T*T*T*A*AACGTGCAGCGGTGGCTTAATAGGATAGAAAGCAWT MAGSASGTAAGTTATGACGGAAGCTGAATGGTGGGAATTCGAA AACGTTG Zm-tetA_dRBS G*T*A*A*TTACCAATGCGATCTTTGTCGAACTATTCATTTCACTT WCCTCSWTCACTGATAGGGAGTGGTAAAATAACTCTATCAATG ATAGA

Co-selection MAGE bla_on G*C*C*A*CATAGCAGAACTTTAAAAGTGCTCATCATTGGAAAAC GTTCTTCGGGGCGAAAACTCTCAAGGATCTTACCGCTGTTGAGA TCCAG bla_off G*C*C*A*CATAGCAGAACTTTAAAAGTGCTCATCATTGGAAAAC GTTATTAGGGGCGAAAACTCTCAAGGATCTTACCGCTGTTGAGA TCCAG

Introduce two consecutive stop codons to knockout pgi gene pgi_KO G*G*A*G*AACTTAGAAAAACGATCGCCGTCTTTAGCAAAAAGA TCTTATTACGTAACGTCTTTCATTTCATCGAAGTGTTTCTGTAGT GCCTG

144

Table A.3. RBS sequences in each RBS library and the corresponding translation rate (au) predicted by RBS Library Calculator [10]. Translation rate No. Sequence (au) (a) Zm-zwf RBS library dRBS TAACTACATATAKWASGAGKTAACAC 1 TAACTACATATATAAGGAGGTAACAC 983028.6 2 TAACTACATATATTAGGAGGTAACAC 254806.2 3 TAACTACATATAGTAGGAGGTAACAC 99028.6 4 TAACTACATATATAACGAGGTAACAC 72267.7 5 TAACTACATATATAAGGAGTTAACAC 63140.6 6 TAACTACATATAGAAGGAGGTAACAC 60362 7 TAACTACATATATTACGAGGTAACAC 20496.4 8 TAACTACATATATTAGGAGTTAACAC 18732.2 9 TAACTACATATAGTACGAGGTAACAC 13670.1 10 TAACTACATATAGTAGGAGTTAACAC 6360.7 11 TAACTACATATAGAACGAGGTAACAC 5312.8 12 TAACTACATATATAACGAGTTAACAC 3387.4 13 TAACTACATATAGAAGGAGTTAACAC 2585.8 14 TAACTACATATATTACGAGTTAACAC 2363.2 15 TAACTACATATAGTACGAGTTAACAC 802.5 16 TAACTACATATAGAACGAGTTAACAC 217.6

(b) Zm-pgi RBS library dRBS ACCCGCCATAATTAAGGMGGKMGWCA 1 ACCCGCCATAATTAAGGAGGTAGACA 209974 2 ACCCGCCATAATTAAGGAGGTCGACA 44050.1 3 ACCCGCCATAATTAAGGAGGTAGTCA 23459.2 4 ACCCGCCATAATTAAGGAGGGAGACA 17907.8 5 ACCCGCCATAATTAAGGAGGGCGACA 17119.8 6 ACCCGCCATAATTAAGGAGGTCGTCA 4641.8 7 ACCCGCCATAATTAAGGCGGTAGACA 2363.2 8 ACCCGCCATAATTAAGGAGGGAGTCA 1973.9 9 ACCCGCCATAATTAAGGAGGGCGTCA 1506.8 10 ACCCGCCATAATTAAGGCGGTCGACA 1150.2 11 ACCCGCCATAATTAAGGCGGTAGTCA 260.5 12 ACCCGCCATAATTAAGGCGGGAGACA 181.7 13 ACCCGCCATAATTAAGGCGGGCGACA 173.7 14 ACCCGCCATAATTAAGGCGGTCGTCA 121.2 15 ACCCGCCATAATTAAGGCGGGAGTCA 20 16 ACCCGCCATAATTAAGGCGGGCGTCA 15.3

(c) Zm-pgl RBS library dRBS GATAGAAAGCAWTMAGSASGTAAGTT 1 GATAGAAAGCAATAAGGAGGTAAGTT 821079.7 2 GATAGAAAGCATTAAGGAGGTAAGTT 599195.9

145

3 GATAGAAAGCAATCAGGAGGTAAGTT 82714.1 4 GATAGAAAGCAATAAGCAGGTAAGTT 22426.9 5 GATAGAAAGCATTCAGGAGGTAAGTT 21439.9 6 GATAGAAAGCAATAAGGACGTAAGTT 13068.5 7 GATAGAAAGCATTAAGCAGGTAAGTT 8332.5 8 GATAGAAAGCATTAAGGACGTAAGTT 4055.6 9 GATAGAAAGCAATCAGCAGGTAAGTT 1005 10 GATAGAAAGCAATCAGGACGTAAGTT 489.1 11 GATAGAAAGCAATAAGCACGTAAGTT 427.4 12 GATAGAAAGCATTCAGCAGGTAAGTT 190.1 13 GATAGAAAGCATTAAGCACGTAAGTT 132.6 14 GATAGAAAGCATTCAGGACGTAAGTT 92.5 15 GATAGAAAGCAATCAGCACGTAAGTT 17.5 16 GATAGAAAGCATTCAGCACGTAAGTT 3.2

(d) Zm-edd RBS library dRBS GCCGCACACGRWAKAAGKAGGTACAC 1 GCCGCACACGAAATAAGGAGGTACAC 971887.9 2 GCCGCACACGATATAAGGAGGTACAC 783067.5 3 GCCGCACACGGAATAAGGAGGTACAC 304333.6 4 GCCGCACACGAAAGAAGGAGGTACAC 118277.1 5 GCCGCACACGGTATAAGGAGGTACAC 86314.6 6 GCCGCACACGATAGAAGGAGGTACAC 78884.9 7 GCCGCACACGGAAGAAGGAGGTACAC 40162 8 GCCGCACACGAAATAAGTAGGTACAC 32069.3 9 GCCGCACACGATATAAGTAGGTACAC 24480.4 10 GCCGCACACGGAATAAGTAGGTACAC 7946.7 11 GCCGCACACGGTAGAAGGAGGTACAC 6066.2 12 GCCGCACACGAAAGAAGTAGGTACAC 2059.8 13 GCCGCACACGGTATAAGTAGGTACAC 1255.6 14 GCCGCACACGATAGAAGTAGGTACAC 837.4 15 GCCGCACACGGAAGAAGTAGGTACAC 446 16 GCCGCACACGGTAGAAGTAGGTACAC 173.3

(e) Zm-eda RBS library dRBS CAAACTAARCRTAASGMGGTTCCCCA 1 CAAACTAAACATAAGGAGGTTCCCCA 204564.1 2 CAAACTAAACGTAAGGAGGTTCCCCA 86990.2 3 CAAACTAAACATAAGGCGGTTCCCCA 26995.7 4 CAAACTAAACATAACGAGGTTCCCCA 20607.4 5 CAAACTAAGCATAAGGAGGTTCCCCA 19700.6 6 CAAACTAAACGTAACGAGGTTCCCCA 6997.4 7 CAAACTAAGCGTAAGGAGGTTCCCCA 4461.6 8 CAAACTAAACGTAAGGCGGTTCCCCA 3405.8 9 CAAACTAAGCATAAGGCGGTTCCCCA 2599.8 10 CAAACTAAGCATAACGAGGTTCCCCA 2076 11 CAAACTAAACATAACGCGGTTCCCCA 562.9 12 CAAACTAAGCGTAACGAGGTTCCCCA 375.4 13 CAAACTAAGCGTAAGGCGGTTCCCCA 174.7

146

14 CAAACTAAACGTAACGCGGTTCCCCA 62 15 CAAACTAAGCATAACGCGGTTCCCCA 49.5 16 CAAACTAAGCGTAACGCGGTTCCCCA 3.3

Table A.4. Translation rate for ED variants. Predicted translation initiation rate, au Mean normalized Name Zm-pgi Zm-zwf Zm-pgl Zm-edd Zm-eda mBFP production rate ED1.0 260 218 1005 2060 563 12.35 ED2 260 218 93 2060 2076 25.43 ED3 260 983029 13069 2060 563 17.12 ED4 17908 218 21440 2060 563 11.75 ED5 1507 18732 1005 2060 563 10.95 ED6 44050 218 18 2060 3 9.73 ED7 17120 218 18 173 563 9.61 ED8 20 218 1005 2060 563 9.16 ED9 3379 218 22427 118277 563 8.30 ED10 260 218 18 2060 563 7.62 ED11 260 6361 190 40162 26996 7.52 ED12 2363 218 4056 40162 563 5.67 ED13 1974 802 821080 118277 563 4.98 ED14 260 218 1005 1256 20607 4.96 ED15 260 218 427 2060 563 2.91 ED16 2363 218 22427 6066 175 2.86 ED17 1974 6361 821080 2060 19701 2.11 ED18 17908 802 22427 6066 563 1.37 ED19 20 218 1005 2060 563 1.03 ED20 17908 802 1005 971888 19701 1.02 ED21 260 218 13069 40162 20607 0.99 ED22 182 60362 1005 2060 563 0.34 ED23 260 218 13069 40162 20607 0.17

147

Table A.5. Net reaction for neurosporene biosynthesis from glyceraldehyde-3- phosphate and pyruvate [11-14]. Gene Enzyme Name Equation dxs 1-deoxy-D-xylulose 5-phosphate g3p + h + pyr --> co2 + dxyl5p synthase dxr 1-deoxy-D-xylulose dxyl5p + h + nadph --> 2me4p + nadp reductoisomerase ispD 2-C-methyl-D-erythritol 4- 2me4p + ctp + h --> 4c2me + ppi phosphate cytidylyltransferase ispE 4-(cytidine 5'-diphospho)-2-C- 4c2me + atp --> 2p4c2me + adp + h methyl-D-erythritol kinase ispF 2-C-methyl-D-erythritol 2,4- 2p4c2me --> 2mecdp + cmp cyclodiphosphate synthase ispG 1-hydroxy-2-methyl-2-(E)-butenyl 2mecdp + (2) flxr/nadph + h --> h2mb4p + 4-diphosphate synthase h2o + (2) flxo/nadp ispH/idi 1-hydroxy-2-methyl-2-(E)-butenyl h2mb4p + nad(p)h + h --> ipdp + h2o + combined 4-diphosphate reductase and nad(p) isopentenyl-diphosphate D- isomerase idi isopentenyl-diphosphate D- ipdp <--> dmpp isomerase ispA farnesyl diphosphate synthase (3) ipdp --> frdp + (2) ppi combined crtE geranylgeranyl diphosphate frdp + ipdp --> ggpp + ppi synthase crtB overall phytoene synthase (2) ggpp --> phytoene + (2) ppi crtI overall phytoene desaturase phytoene + (3) fad --> neurosporene + (3) fadh2 Overall equation: (8) g3p + (24) h + (8) pyr + (8) nadph + (8) ctp + (8) atp + (16) flxr + (8) nad(p)h + (3) fad --> neurosporene + (3) fadh2 + (8) co2 + (8) nadp + (16) ppi + (8) adp + (8) cmp + (16) h2o + (16) flxo + (8) nad(p) Abbreviation Name Formula 2me4p 2-C-methyl-D-erythritol 4- C5H11O7P phosphate 2mecdp 2-C-methyl-D-erythritol 2,4- C5H10O9P2 cyclodiphosphate 2p4c2me 2-phospho-4-(cytidine 5'- C14H22N3O17P3 diphospho)-2-C-methyl-D- erythritol 4c2me 4-(cytidine 5'-diphospho)-2-C- C14H23N3O14P2 methyl-D-erythritol adp ADP C10H15N5O10P2 atp ATP C10H12N5O13P3 co2 carbon dioxide CO2 ctp cytidine-triphosphate C9H12N3O14P3

148 dmpp dimethylallyl diphosphate C5H9O7P2 dxyl5p 1-deoxy-D-xylulose 5-phosphate C5H9O7P fad Flavin adenine dinucleotide C27H33N9O15P2 oxidized fadh2 Flavin adenine dinucleotide C27H35N9O15P2 reduced flxo flavodoxin (oxidized) XH flxr flavodoxin (reduced) XH2 frdp farnesyl diphosphate C15H25O7P2 g3p D-glyceraldehyde 3-phosphate C3H7O6P ggpp geranylgeranyl diphosphate C20H36O7P2 grdp geranyl diphosphate C10H20O7P2 h2mb4p 1-hydroxy-2-methyl-2-(E)-butenyl C5H9O8P2 4-diphosphate h2o water H2O ipdp isopentenyl diphosphate C5H9O7P2 nadp Nicotinamide adenine dinucleotide C21H28N7O17P3 phosphate nadph Nicotinamide adenine dinucleotide C21H30N7O17P3 phosphate - reduced neurosporene C40H58 phytoene C40H64 ppi diphosphate HO7P2 pyr pyruvate C3H4O3

Table A.6. Comparison of neurosporene production from pIF-001C (PlacO1-crtEBI) and pIF-002 (PBAD-crtEBI) in ED17R,MEP. Values represent average of two replicates. Condition Strain Inducer Average neurosporene content (µg/ gDCW) 2x M9, 37°C, EcIF15 + ED17’s ED-tetAR + 0.2 mM IPTG 3154.0 200 RPM, 10 pIF-001C (PlacO1-crtEBI) h culture EcIF15 + ED17’s ED-tetAR + 10mM arabinose 3714.1

pIF-002 (PBAD-crtEBI) + 0.2 mM IPTG

A.4. References

1. Wang HH, Church GM: Multiplexed genome engineering and genotyping methods applications for synthetic biology and metabolic engineering. Methods in enzymology 2011, 498:409-426.

149

2. Carr Pa, Wang HH, Sterling B, Isaacs FJ, Lajoie MJ, Xu G, Church GM, Jacobson JM: Enhanced multiplex genome engineering through co-operative oligonucleotide co-selection. Nucleic acids research 2012, 40:e132. 3. Podolsky T, Fong ST, Lee BT: Direct selection of tetracycline-sensitive Escherichia coli cells using nickel salts. Plasmid 1996, 36:112-115. 4. Bertrand KP, Postle K, Wray LV, Reznikoff WS: Overlapping divergent promoters control expression of Tn10 tetracycline resistance. Gene 1983, 23:149-156. 5. Wang HH, Kim H, Cong L, Jeong J, Bang D, Church GM: Genome-scale promoter engineering by coselection MAGE. Nature methods 2012, 9:591-593. 6. Bochner BR, Huang HC, Schieven GL, Ames BN: Positive selection for loss of tetracycline resistance. Journal of bacteriology 1980, 143:926-933. 7. Moyed HS, Bertrand KP: Mutations in multicopy Tn10 tet plasmids that confer resistance to inhibitory effects of inducers of tet gene expression. Journal of bacteriology 1983, 155:557-564. 8. Hwang C-S, Choi E-S, Han S-S, Kim G-J: Screening of a highly soluble and oxygen-independent blue fluorescent protein from metagenome. Biochemical and Biophysical Research Communications 2012, 419:676-681. 9. Wang HH, Isaacs FJ, Carr Pa, Sun ZZ, Xu G, Forest CR, Church GM: Programming cells by multiplex genome engineering and accelerated evolution. Nature 2009, 460:894-898. 10. Farasat I, Kushwaha M, Collens J, Easterbrook M, Guido M, Salis HM: Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria. Molecular systems biology 2014, 10:731. 11. Alper H, Jin Y-S, Moxley JF, Stephanopoulos G: Identifying gene targets for the metabolic engineering of lycopene biosynthesis in Escherichia coli. Metabolic engineering 2005, 7:155-164. 12. Feist AM, Henry CS, Reed JL, Krummenacker M, Joyce AR, Karp PD, Broadbelt LJ, Hatzimanikatis V, Palsson BØ: A genome-scale metabolic reconstruction for Escherichia coli K-12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Molecular systems biology 2007, 3:121. 13. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 2000, 28:27-30. 14. Orth JD, Conrad TM, Na J, Lerman Ja, Nam H, Feist AM, Palsson BØ: A comprehensive genome- scale reconstruction of Escherichia coli metabolism--2011. Molecular systems biology 2011, 7:535.

150

Appendix B. Supplementary Information for Chapter 3

B.1. Supplementary notes

B.1.1. Reducing the run time of optStoic/minFlux

The run time for the modified minFlux depends on the search space (database size), allowable sum of flux through the pathway (푧∗, i.e., equivalent to the total number of reactions in a pathway for minRxn procedure), and the number of integer cut constraints. The techniques used to alleviate run time issue are as followed: 1. Prior to the minFlux step, blocked reactions (i.e., reactions incapable of carrying flux) were identified and excluded from the search space/solution space. 2. Integer cut constraints were imposed to enable identification of unique pathways. Increasing number of integer cut constraints significantly slows down run time. In particular, when the following integer cut is used to identify new pathway with increasing sum of flux (푧∗), the run time is affected by both the number of integer cuts and also expanding search space due to higher sum of flux:

푓 푟 푓 푟 ∑ (1 − 푦푗 − 푦푗 ) + ∑ (푦푗 + 푦푗 ) ≥ 1, 푓푘 푟푘 푓푘 푟푘 푗∈푱 | 푦 푗 + 푦 푗 =1 푗∈푱 | 푦 푗 + 푦 푗 ≠1 ∀ 푘 = 1, 2, 3, … To solve this issue, we imposed an equality constraint on the objective function as followed:

∗ ∑ |푣푗| = 푧

푗∈푱\푱풆풙풄풉풂풏품풆

and used only the following integer cut to prevent the same pathway from being identified for the same total sum of flux 푧∗.

푓 푟 ∑ (1 − 푦푗 − 푦푗 ) ≥ 1, ∀ 푘 = 1, 2, 3, … 푓푘 푟푘 푗∈푱 | 푦 푗 + 푦 푗 =1

151

B.2. Supplementary tables

Table B1. Cofactors that were removed from the S matrix when generating the internal stoichiometric matrix (푆∗).

KEGG ID Description KEGG ID Description C00001 H2O C00112 CDP C00002 ATP C00131 dATP C00003 NAD+ C00138 Reduced ferredoxin C00004 NADH C00139 Oxidized ferredoxin C00005 NADPH C00144 GMP C00006 NADP+ C00206 dADP C00007 Oxygen C00286 dGTP C00008 ADP C00360 dAMP C00009 Orthophosphate C00361 dGDP C00010 CoA C00362 dGMP C00011 CO2 C00363 dTDP C00013 Diphosphate C00364 dTMP C00015 UDP C00365 dUMP C00016 FAD C00390 Ubiquinol C00020 AMP C00399 Ubiquinone C00035 GDP C00458 dCTP C00044 GTP C00459 dTTP C00055 CMP C00460 dUTP C00063 CTP C01352 FADH2 C00075 UTP C00080 H+ C00081 ITP C00104 IDP C00105 UMP

152

B.3. Supplementary figures

Figure B.1. Statistics of the glycolytic pathways generated using the modified optStoic procedure. The distribution of the glycolytic pathway alternatives based on (A) total flux through a pathway, and (B) the number of reactions in a pathway. Note that the total flux through a pathway and the number of reactions are calculated without accounting for the exchange reactions. The colors represent the ATP yield per glucose (mol ATP/mol glucose) generated by a pathway at a fixed glucose uptake flux. Red dashed lines indicate the mean values, whereas blue dashed lines denote the median values.

153

Figure B.2. Distribution of absolute metabolite concentrations across different organisms [1]. The fraction of metabolites that are within 1 µM and 100 mM are 97.1%, 97.7% and 97.5% for mammalian cells, yeast and E. coli, respectively. The fraction of metabolites that fall within 1 µM and 10 mM are 94.1%, 90.9% and 94.2% for mammalian cells, yeast and E. coli, respectively.

154

Figure B.3. The ATP yield versus minimal protein cost plot. Pathways are color-coded based on the type of redox cofactors produced: (Blue) 2 NADH, (Green) 1 NADH and 1 NADPH, and (Red) 2 NADPH.

B.4. Reference

1. Park JO, Rubin SA, Xu YF, Amador-Noguez D, Fan J, Shlomi T, Rabinowitz JD: Metabolite concentrations, fluxes and free energies imply efficient enzyme usage. Nature chemical biology 2016.

155

Appendix C. Supplementary Information for Chapter 5

C.1. Supplementary notes

푚푎푥 The pathway-independent maximum theoretical carbon yield (푌퐶 ) of each molecule starting from glucose (C-mol product/C-mol glucose) is calculated by assuming that only water and carbon dioxide are the only other products. This yields the following overall conversion stoichiometry:

퐺푙푢푐표푠푒 → 푣1 푃푟표푑푢푐푡 + 푣2 푊푎푡푒푟 + 푣3 퐶푎푟푏표푛 퐷푖표푥푖푑푒

퐶6퐻12푂6 → 푣1 퐶푥퐻푦푂푧 + 푣2 퐻2푂 + 푣3 퐶푂2 (29)

where 푣푖 is the stoichiometric coefficient of compound 푖 . The maximum theoretical 푚푎푥 product yield (푌푃 ) is hence 푣1mol/mol glucose and the maximum theoretical carbon 푣 ∙푥 yield is 1 C-mol / C-mol glucose. 6

C.2. Supplementary table and figure

Table C.1. Experimental titers of biomolecules. This table provides additional information (host organism and reference) for Figure 5.2. Experimental Biomolecule Host organism References Titer (g/l) ethanol 158 Saccharomyces cerevisiae [1] Stowers MD et al. (2009) 2,3-butanediol 152 Enterobacter cloacae [2] Li L et al. (2015) Corynebacterium succinic acid 146.3 glutamicum R [3] Okino S et al. (2008) Lactobacillus delbrueckii lactic acid 135 NCIM 2365 [4] Kadam SR et al. (2006) 1,3-propanediol 135 Escherichia coli [5] Nakamura CE et al. (2003) beta-farnesene 130 Saccharomyces cerevisiae [44] Meadows AL et al. (2016) Propionibacterium propanoic acid 106 acidipropionici [6] Zhang A et al. (2009) acetoin 100.1 Saccharomyces cerevisiae [7] Bae SJ et al. (2016) 1,4-butanediol 99 Escherichia coli [8] Burgard A et al. (2016) Clostridium butanoic acid 62.8 tyrobutyricum CIP I-776 [9] Fayolle F et al. (1990) pyruvic acid 62.6 Escherichia coli [10] Zelic B et al. (2003)

156 acetic acid 52.73 Escherichia coli W3110 [11] Causey TB et al. (2003) isopropanol 40.1 Escherichia coli [12] Inokuma K et al. (2010) isobutyl acetate 36 Escherichia coli [13] Tai YS et al. (2015) [14] Rodriguez GM et al. isobutyraldehyde 35 Escherichia coli (2012) itaconic acid 32 Escherichia coli [15] Harder BJ et al. (2016) 1-butanol 30 Escherichia coli [16] Shen CR et al. (2011) isobutanol 22 Escherichia coli [17] Atsumi S et al. (2008) 1-propanol 10.8 Escherichia coli [18] Choi YJ et al. (2012) Clostridia [19] Sanchez-Riera F et al. 1,2-propanediol 9.5 thermosaccharolyticum (1987) isopentanol 9.5 Escherichia coli [20] Connor MR et al. (2010) benzaldehyde 5 Pichia pastoris [21] Craig T et al. (2012) limonene 2.7 Escherichia coli [22] Willrodt C et al. (2014) 3-methyl-3- 2.23 Escherichia coli buten-1-ol [23] George KW et al. (2015) [24] Marcheschi RJ et al. 1-pentanol 2.22 Escherichia coli (2012) glutaric acid 1.7 Escherichia coli WL3110 [25] Park SJ et al. (2013) [26] Rodriguez GM et al. butyl acetate 1.7 Escherichia coli (2014) 2-methyl-1- 1.25 Escherichia coli butanol [27] Cann AF et al. (2008) 3-methyl-1- 1.088 Escherichia coli butene-1-ol [28] Kang A et al. (2016) 2-butanol 1.03 Klebsiella pneumoniae [29] Chen Z et al. (2015) [30] Peralta-Yahya PP et al. alpha-bisabolene 0.994 Saccharomyces cerevisiae (2011) alpha-pinene 0.97 Escherichia coli [31] Yang J et al. (2013) Clostridia 1-hexanol 0.94 carboxidivorans [32] Phillips JR et al. (2015) 3-methyl-1- 0.7935 Escherichia coli pentanol [33] Zhang K et al. (2008) 3-methylbutyl 0.78 Escherichia coli acetate [13] Tai YS et al. (2015) pentanoic acid 0.4 Escherichia coli [34] Tseng HC et al. (2012) [26] Rodriguez GM et al. ethyl acetate 0.33 Escherichia coli (2014) 2-phenylethyl [26] Rodriguez GM et al. 0.3 Escherichia coli acetate (2014) octanoic acid 0.242 Escherichia coli [35] Torella JP et al. (2013) isohexanol 0.192 Escherichia coli [36] Sheppard MJ et al. (2014) hexanoic acid 0.154 Kluyveromyces marxianus [37] Cheon Y et al. (2014) [26] Rodriguez GM et al. tetradecyl acetate 0.137 Escherichia coli (2014) ethyl butanoate 0.134 Escherichia coli [38] Layton DS et al. (2014) [39] Alonso-Gutierrez J et al. perillyl alcohol 0.105 Escherichia coli (2013)

157

[24] Marcheschi RJ et al. 1-heptanol 0.08 Escherichia coli (2012) pentyl acetate 0.0663 Escherichia coli [40] Layton DS et al. (2016a) isobutyl 0.06471 Escherichia coli pentanoate [40] Layton DS et al. (2016a) pentyl pentanoate 0.06263 Escherichia coli [40] Layton DS et al. (2016a) 1-octanol 0.06 Escherichia coli [41] Machado HB et al. (2012) myrcene 0.05819 Escherichia coli [42] Kim EM et al. (2015) 4-methyl-1- 0.0573 Escherichia coli hexanol [33] Zhang K et al. (2008) isobutyl 0.04125 Escherichia coli butanoate [40] Layton DS et al. (2016a) ethyl pentanoate 0.04086 Escherichia coli [43] Layton DS et al. (2016b) 2-phenylethyl [26] Rodriguez GM et al. 0.04 Escherichia coli isobutanoate (2014) butyl butanoate 0.03683 Escherichia coli [38] Layton DS et al. (2014) hexyl acetate 0.03117 Escherichia coli [40] Layton DS et al. (2016a) isobutyl [26] Rodriguez GM et al. 0.0271 Escherichia coli isobutanoate (2014) 5-methyl-1- 0.022 Escherichia coli hexanol [33] Zhang K et al. (2008) 2-methylbutyl [26] Rodriguez GM et al. 0.02 Escherichia coli acetate (2014) [26] Rodriguez GM et al. propyl acetate 0.02 Escherichia coli (2014) 2-methylbutyl [26] Rodriguez GM et al. 0.02 Escherichia coli acetate (2014) [26] Rodriguez GM et al. propyl acetate 0.02 Escherichia coli (2014) ethyl propionate 0.01588 Escherichia coli [43] Layton DS et al. (2016b) 3-methylbutyl [26] Rodriguez GM et al. 0.01 Escherichia coli isobutanoate (2014) propyl butanoate 0.00521 Escherichia coli [38] Layton DS et al. (2014) isopropyl 0.0044 Escherichia coli butanoate [38] Layton DS et al. (2014) isobutyl 0.0036 Escherichia coli propionate [40] Layton DS et al. (2016a) ethyl hexanoate 0.00335 Escherichia coli [43] Layton DS et al. (2016b) propyl propionate 0.00061 Escherichia coli [43] Layton DS et al. (2016b) ethyl [26] Rodriguez GM et al. 1.00E-07 Escherichia coli isobutanoate (2014)

158

Figure C.1. All experimental titer (in g/L) extracted from our literature survey for each molecule. Each circle represents an experimental data point, while the color of the circle indicates the host organism as provided in the legend.

159

C.3. References

1. Stowers MD: The U.S. Ethanol Industry. Federal Reserve Bank of St Louis Regional Economic Development 2009, 5(1):3-11. 2. Li L, Li K, Wang Y, Chen C, Xu Y, Zhang L, Han B, Gao C, Tao F, Ma C, Xu P: Metabolic engineering of Enterobacter cloacae for high-yield production of enantiopure (2R,3R)-2,3-butanediol from lignocellulose-derived sugars. Metabolic engineering 2015. 3. Okino S, Noburyu R, Suda M, Jojima T, Inui M, Yukawa H: An efficient succinic acid production process in a metabolically engineered Corynebacterium glutamicum strain. Applied microbiology and biotechnology 2008. 4. Kadam SR, Patil SS, Bastawde KB, Khire JM, Gokhale DV: Strain improvement of Lactobacillus delbrueckii NCIM 2365 for lactic acid production. Process Biochemistry 2006, 41(1):120-126. 5. Nakamura CE, Whited GM: Metabolic engineering for the microbial production of 1,3-propanediol. Current opinion in biotechnology 2004. 6. Zhang A, Yang S-T: Propionic acid production from glycerol by metabolically engineered Propionibacterium acidipropionici. Process Biochemistry 2009, 44(12):1346-1351. 7. Bae SJ, Kim S, Hahn JS: Efficient production of acetoin in Saccharomyces cerevisiae by disruption of 2,3-butanediol dehydrogenase and expression of NADH oxidase. Scientific reports 2016. 8. Burgard A, Burk MJ, Osterhout R, Van Dien S, Yim H: Development of a commercial scale process for production of 1,4-butanediol from sugar. Current opinion in biotechnology 2016. 9. Fayolle F, Marchal R, Ballerini D: Effect of controlled substrate feeding on butyric acid production byClostridium tyrobutyricum. Journal of Industrial Microbiology 1990, 6(3):179-183. 10. Zelic B, Gostovic S, Vuorilehto K, Vasic-Racki D, Takors R: Process strategies to enhance pyruvate production with recombinant Escherichia coli: from repetitive fed-batch to in situ product recovery with fully integrated electrodialysis. Biotechnology and bioengineering 2004. 11. Causey TB, Zhou S, Shanmugam KT, Ingram LO: Engineering the metabolism of Escherichia coli W3110 for the conversion of sugar to redox-neutral and oxidized products: homoacetate production. Proceedings of the National Academy of Sciences of the United States of America 2003. 12. Inokuma K, Liao JC, Okamoto M, Hanai T: Improvement of isopropanol production by metabolically engineered Escherichia coli using gas stripping. Journal of bioscience and bioengineering 2011. 13. Tai YS, Xiong M, Zhang K: Engineered biosynthesis of medium-chain esters in Escherichia coli. Metabolic engineering 2015. 14. Rodriguez GM, Atsumi S: Isobutyraldehyde production from Escherichia coli by removing aldehyde reductase activity. Microbial cell factories 2013. 15. Harder BJ, Bettenbrock K, Klamt S: Model-based metabolic engineering enables high yield itaconic acid production by Escherichia coli. Metabolic engineering 2016. 16. Shen CR, Lan EI, Dekishima Y, Baez A, Cho KM, Liao JC: Driving forces enable high-titer anaerobic

160

1-butanol synthesis in Escherichia coli. Applied and environmental microbiology 2011. 17. Atsumi S, Hanai T, Liao JC: Non-fermentative pathways for synthesis of branched-chain higher alcohols as biofuels. Nature 2008. 18. Choi YJ, Park JH, Kim TY, Lee SY: Metabolic engineering of Escherichia coli for the production of 1- propanol. Metabolic engineering 2013. 19. Sánchez-Riera F, Cameron DC, Cooney CL: Influence of environmental factors in the production of R(−)-1, 2-propanediol by clostridium thermosaccharolyticum. Biotechnology letters 1987, 9(7):449-454. 20. Connor MR, Cann AF, Liao JC: 3-Methyl-1-butanol production in Escherichia coli: random mutagenesis and two-phase fermentation. Applied microbiology and biotechnology 2010, 86(4):1155-1164. 21. Craig T, Daugulis AJ: Polymer characterization and optimization of conditions for the enhanced bioproduction of benzaldehyde by Pichia pastoris in a two-phase partitioning bioreactor. Biotechnology and bioengineering 2013. 22. Willrodt C, David C, Cornelissen S, Buhler B, Julsing MK, Schmid A: Engineering the productivity of recombinant Escherichia coli for limonene formation from glycerol in minimal media. Biotechnology journal 2015. 23. George KW, Thompson MG, Kang A, Baidoo E, Wang G, Chan LJ, Adams PD, Petzold CJ, Keasling JD, Lee TS: Metabolic engineering for the high-yield production of isoprenoid-based C5 alcohols in E. coli. Scientific reports 2016. 24. Marcheschi RJ, Li H, Zhang K, Noey EL, Kim S, Chaubey A, Houk KN, Liao JC: A synthetic recursive "+1" pathway for carbon chain elongation. ACS chemical biology 2012. 25. Park SJ, Kim EY, Noh W, Park HM, Oh YH, Lee SH, Song BK, Jegal J, Lee SY: Metabolic engineering of Escherichia coli for the production of 5-aminovalerate and glutarate as C5 platform chemicals. Metabolic engineering 2013. 26. Rodriguez GM, Tashiro Y, Atsumi S: Expanding ester biosynthesis in Escherichia coli. Nature chemical biology 2014. 27. Cann AF, Liao JC: Production of 2-methyl-1-butanol in engineered Escherichia coli. Applied microbiology and biotechnology 2008. 28. Kang A, George KW, Wang G, Baidoo E, Keasling JD, Lee TS: Isopentenyl diphosphate (IPP)-bypass mevalonate pathways for isopentenol production. Metabolic engineering 2016. 29. Chen Z, Wu Y, Huang J, Liu D: Metabolic engineering of Klebsiella pneumoniae for the de novo production of 2-butanol as a potential biofuel. Bioresource technology 2016. 30. Peralta-Yahya PP, Ouellet M, Chan R, Mukhopadhyay A, Keasling JD, Lee TS: Identification and microbial production of a terpene-based advanced biofuel. Nature communications 2012. 31. Yang J, Nie Q, Ren M, Feng H, Jiang X, Zheng Y, Liu M, Zhang H, Xian M: Metabolic engineering of Escherichia coli for the biosynthesis of alpha-pinene. Biotechnology for biofuels 2013. 32. Phillips JR, Atiyeh HK, Tanner RS, Torres JR, Saxena J, Wilkins MR, Huhnke RL: Butanol and hexanol production in Clostridium carboxidivorans syngas fermentation: Medium development and culture

161

techniques. Bioresource technology 2016. 33. Zhang K, Sawaya MR, Eisenberg DS, Liao JC: Expanding metabolism for biosynthesis of nonnatural alcohols. Proceedings of the National Academy of Sciences of the United States of America 2009. 34. Tseng HC, Prather KL: Controlled biosynthesis of odd-chain fuels and chemicals via engineered modular metabolic pathways. Proceedings of the National Academy of Sciences of the United States of America 2013. 35. Torella JP, Ford TJ, Kim SN, Chen AM, Way JC, Silver PA: Tailored fatty acid synthesis via dynamic control of fatty acid elongation. Proceedings of the National Academy of Sciences of the United States of America 2013. 36. Sheppard MJ, Kunjapur AM, Wenck SJ, Prather KL: Retro-biosynthetic screening of a modular pathway design achieves selective route for microbial synthesis of 4-methyl-pentanol. Nature communications 2015. 37. Cheon Y, Kim JS, Park JB, Heo P, Lim JH, Jung GY, Seo JH, Park JH, Koo HM, Cho KM, Park JB, Ha SJ, Kweon DH: A biosynthetic pathway for hexanoic acid production in Kluyveromyces marxianus. Journal of biotechnology 2015. 38. Layton DS, Trinh CT: Engineering modular ester fermentative pathways in Escherichia coli. Metabolic engineering 2014. 39. Alonso-Gutierrez J, Chan R, Batth TS, Adams PD, Keasling JD, Petzold CJ, Lee TS: Metabolic engineering of Escherichia coli for limonene and perillyl alcohol production. Metabolic engineering 2014. 40. Layton DS, Trinh CT: Microbial synthesis of a branched-chain ester platform from organic waste carboxylates. Metabolic Engineering Communications 2016, 3:245-251. 41. Machado HB, Dekishima Y, Luo H, Lan EI, Liao JC: A selection platform for carbon chain elongation using the CoA-dependent pathway to produce linear higher alcohols. Metabolic engineering 2013. 42. Kim EM, Eom JH, Um Y, Kim Y, Woo HM: Microbial Synthesis of Myrcene by Metabolically Engineered Escherichia coli. Journal of agricultural and food chemistry 2015. 43. Layton DS, Trinh CT: Expanding the modular ester fermentative pathways for combinatorial biosynthesis of esters from volatile organic acids. Biotechnology and bioengineering 2016. 44. Meadows AL, Hawkins KM, Tsegaye Y, Antipov E, Kim Y, Raetz L, Dahl RH, Tai A, Mahatdejkul- Meadows T, Xu L et al: Rewriting yeast central carbon metabolism for industrial isoprenoid production. Nature 2016, 537(7622):694-697.

162

VITA Chiam Yu Ng Education

- Ph.D., Chemical Engineering, The Pennsylvania State University, USA, 2017 - M.Eng., Chemical and Biological Engineering, Korea University, Republic of Korea, 2012 - B.E., Chemical and Biological Engineering, Korea University, Republic of Korea, 2010

Honors and Awards

- Lagoa, Ray and Monkowski Award for Best Paper Presentation in Biomedical Science, 12th Annual College of Engineering Research Symposium at The Pennsylvania State University, April 2015 - Best Oral Presentation Award, Synthetic Biology Engineering Research Center (SYNBERC) 2014 Fall Retreat at MIT, Boston - Chemical Engineering Graduate Program Best Candidacy Exam Award, The Pennsylvania State University, September 2013

Publications

- Ng, C.Y., Chowdhury, A. & Maranas, C.D. A Pareto optimality explanation of the glycolytic alternatives in nature. in preparation. - Kumar, A., Wang, L., Ng, C.Y. & Maranas, C.D. Pathway design using de novo steps through uncharted biochemical spaces. under review. - Heirendt, L. et al. Creation and analysis of biochemical constraint-based models: the COBRA Toolbox v3.0. under review. - Lynd, L.R. et al. Biological Conversion of Lignocellulose to Fuels: A Strategic Perspective. in preparation. - Suástegui, M. et al. Multilevel engineering of the upstream module of aromatic amino acid biosynthesis in Saccharomyces cerevisiae for high production of polymer and drug precursors. Metabolic Engineering 42, 134-144 (2017). - Ng, C.Y., Chowdhury, A. & Maranas, C.D. A microbial factory for diverse chemicals. Nature Biotechnology 34, 513-515 (2016). - Dash, S., Ng, C.Y. & Maranas, C.D. Metabolic modeling of clostridia: current developments and applications. FEMS Microbiology Letters 363 (2016). - Ng, C.Y., Farasat, I., Maranas, C.D. & Salis, H.M. Rational design of a synthetic Entner- Doudoroff pathway for improved and controllable NADPH regeneration. Metabolic Engineering 29, 86-96 (2015). - Ng, C.Y., Khodayari, A., Chowdhury, A. & Maranas, C.D. Advances in de novo strain design using integrated systems and synthetic biology tools. Current Opinion in Chemical Biology 28, 105-114 (2015).