DISSERTATION
GENOME-SCALE METABOLIC MODELING OF CYANOBACTERIA: NETWORK
STRUCTURE, INTERACTIONS, RECONSTRUCTION AND DYNAMICS
Submitted by
Chintan Jagdishchandra Joshi
Department of Chemical and Biological Engineering
In partial fulfilment of the requirements
For the Degree of Doctor of Philosophy
Colorado State University
Fort Collins, Colorado
Fall 2016
Doctoral Committee:
Advisor: Ashok Prasad
Christie A. M. Peebles Kenneth Reardon Graham Peers
Copyright by Chintan Jagdishchandra Joshi 2016
All Rights Reserved
ABSTRACT
GENOME-SCALE METABOLIC MODELING OF CYANOBACTERIA: NETWORK
STRUCTURE, INTERACTIONS, RECONSTRUCTION AND DYNAMICS
Metabolic network modeling, a field of systems biology and bioengineering, enhances the quantitative predictive understanding of cellular metabolism and thereby assists in the development of model-guided metabolic engineering strategies. Metabolic models use genome- scale network reconstructions, and combine it with mathematical methods for quantitative prediction. Metabolic system reconstructions, contain information on genes, enzymes, reactions, and metabolites, and are converted into two types of networks: (i) gene-enzyme-reaction, and (ii) reaction-metabolite. The former details the links between the genes that are known to code for metabolic enzymes, and the reaction pathways that the enzymes participate in. The latter details the chemical transformation of metabolites, step by step, into biomass and energy. The latter network is transformed into a system of equations and simulated using different methods.
Prominent among these are constraint-based methods, especially Flux Balance Analysis, which utilizes linear programming tools to predict intracellular fluxes of single cells. Over the past 25 years, metabolic network modeling has had a range of applications in the fields of model-driven discovery, prediction of cellular phenotypes, analysis of biological network properties, multi- species interactions, engineering of microbes for product synthesis, and studying evolutionary processes. This thesis is concerned with the development and application of metabolic network modeling to cyanobacteria as well as E. coli.
ii Chapter 1 is a brief survey of the past, present, and future of constraint-based modeling using flux balance analysis in systems biology. It includes discussion of (i) formulation, (ii) assumption,
(iii) variety, (iv) availability, and (v) future directions in the field of constraint based modeling.
Chapter 2, explores the enzyme-reaction networks of metabolic reconstructions belonging to various organisms; and finds that the distribution of the number of reactions an enzyme participates in, i.e. the enzyme-reaction distribution, is surprisingly similar. The role of this distribution in the robustness of the organism is also explored. Chapter 3, applies flux balance analysis on models of
E. coli, Synechocystis sp. PCC6803, and C. reinhardtii to understand epistatic interactions between metabolic genes and pathways. We show that epistatic interactions are dependent on the environmental conditions, i.e. carbon source, carbon/oxygen ratio in E. coli, and light intensity in
Synechocystis sp. PCC6803 and C. reinhardtii.
Cyanobacteria are photosynthetic organisms and have great potential for metabolic engineering to produce commercially important chemicals such as biofuels, pharmaceuticals, and nutraceuticals. Chapter 4 presents our new genome scale reconstruction of the model cyanobacterium, Synechocystis sp. PCC6803, called iCJ816. This reconstruction was analyzed and compared to experimental studies, and used for predicting the capacity of the organism for (i) carbon dioxide remediation, and (ii) production of intracellular chemical species. Chapter 5 uses our new model iCJ816 for dynamic analysis under diurnal growth simulations. We discuss predictions of different optimization schemes, and present a scheme that qualitatively matches observations.
iii ACKNOWLEDGEMENTS
I started my journey in the field of metabolic modeling about seven years ago, while I was a Professional Science Masters’ (PSM) student at Oregon State University. Little did I know that modeling of MAPK pathway in bioprocess control systems, a class taught by Dr. Ganti Murthy, will send me down a path to pursue doctorate, an year later.
Now, after 6 years as I come to the final steps of my doctorate, I want to take an opportunity to express my gratitude to my advisor, co-advisor, committee members, professors, colleagues, department secretaries, friends, and family. These 6 years would not have been as productive, if it was not for these thoughtful, brilliant, dedicated, and hardworking people. I have truly come to understand the meaning of the phrase, “It takes a village…”
I deeply express my heartfelt gratitude to my advisor, Dr. Ashok Prasad, for his consistent faith in me since the inception of our student-advisor relationship. I highly commend his scientific enthusiasm, to allow me free rein at the choice of projects, which a young student can only dream about. Though he challenged me to do my best work, he also made sure that I remain on track rather than pursue tangents. I find myself indebted to his insurmountable patience, and support, during these 6 years of my learning in matters both professional and personal.
I am grateful to my co-advisor, Dr. Christie Peebles, who enhanced my learning with regular discussions on experimental biology of E. coli and cyanobacteria. Our collaborations on topics (included in this thesis) has greatly helped me in thinking about my work from various different aspects. I found my interactions with her during group and personal meetings as very enlightening.
iv I would also like to thank my committee members Dr. Graham Peers and Dr. Kenneth
Reardon. Dr. Graham Peers has pushed my understanding of cyanobacterial photosynthesis, and challenged me to think at the interface of computational and experimental biology. Dr. Kenneth
Reardon has been of tremendous help in my learning of presenting scientific research, be it for a conference or in-class projects. His experience with scientific research, in academia and industry alike, provides an inspiration to aspiring scientists like me.
I would also like to acknowledge all the professors in the Department of Chemical and
Biological Engineering who played a crucial role in establishing my fundamentals of not only chemical engineering, but also of scientific research itself.
Sincere thanks are in order to my colleague Katherine Schaumberg, Wenlong Xu, Elaheh
Alizade, Forrest Estep, Yi Ern Cheah, and Allison Zimont for helping me flesh out my manuscripts; and lending an ear for discussions ranging from experiments in cyanobacteria to systems biology. I would also like to extend my thanks to all other past and present members of
Prasad lab and Peebles lab.
I would also like to thank Mary Tracey, an undergraduate student from Peebles lab, who helped in the literature survey of cyanobacterial genes; and Aidan Ceney, a joint REU student from
Peebles and Prasad lab, for taking initiative for future work in thermodynamic calculations of cyanobacterial metabolic network.
Faculty and staff at Department of Chemical and Biological Engineering (CBE) and Scott
Engineering building including Claire Laville, Denise Morgan, and Marilyn Gross have had a special role for their support through my 6 years at Colorado State University.
A special note goes to my friends in Fort Collins, who were always there for support during this whole journey.
v Lastly, and most importantly, I am grateful to my parents and my brother for their patience, encouragement, and love which gives me strength to do better work each day. Their support has made me who I am.
vi DEDICATION
To my parents and brother
vii TABLE OF CONTENTS
ABSTRACT ...... ii ACKNOWLEDGEMENTS ...... iv DEDICATION ...... vii LIST OF TABLES ...... xii LIST OF FIGURES ...... xiii CHAPTER 1...... CONSTRAINT BASED MODELING OF METABOLIC NETWORKS IN SYSTEMS BIOLOGY...... 1 1. SYSTEMS BIOLOGY...... 1 1.1. PARTS ...... 1 1.2. SUM OF ITS PARTS ...... 2 1.3. THE WHOLE ...... 3 2. METABOLIC MODELING LANDSCAPE ...... 5 2.1. METABOLIC NETWORK RECONSTRUCTIONS ...... 8 2.2. MATHEMATICAL MODEL ...... 8 3. CONSTRAINT BASED MODELING ...... 9 3.1. CONSTRAINTS ...... 10 3.2. SOLUTION SPACE ...... 11 3.3. CELLULAR OBJECTIVE ...... 12 3.4. SIMULATION ENVIRONMENT ...... 13 4. FBA PARADIGM ...... 13 5. DYNAMIC FLUX BALANCE ANALYSIS ...... 16 6. OTHER MODELING FRAMEWORKS ...... 17 7. APPLICATIONS ...... 18 7.1. APPLICATIONS IN NON-PHOTOSYNTHETIC BACTERIAL ORGANISMS .... 19 7.2. APPLICATIONS IN MAMMALIAN ORGANISMS ...... 19 7.3. APPLICATIONS IN E. COLI AND S. CEREVISIAE...... 20 7.4. APPICATIONS IN PHOTOSYNTHETIC ORGANISMS ...... 20 8. THESIS OUTLINE ...... 21 CHAPTER 2...... STRUCTURE AND ROLE OF ENZYME-REACTION ASSOCIATION IN MICROBIAL METABOLISM ...... 26
viii 1. SYNOPSIS ...... 26 2. INTRODUCTION ...... 27 3. MATERIALS AND METHODS ...... 31 3.1. MODEL PREPARATION ...... 31 3.2. ASSIGNING SUBSYSTEMS TO GENES AND COMPLEXES ...... 31 3.3. GENE ASSOCIATION WITH REACTIONS AND EFFECTIVE GENE DELETION OR SINGLE ENZYME DELETION ...... 32 3.4. POWER-LAW ANALYSIS ...... 33 3.5. DISTRIBUTION ANALYSIS ...... 33 3.6. FLUX BALANCE ANALYSIS (FBA) ...... 34 3.7. SIMULATION OF GROWTH CONDITIONS IN VARIOUS ORGANISMS ...... 35 3.8. SIMULATION OF ENZYME DELETIONS, AND ESSENTIAL ENZYMES ...... 36 3.9. LETHAL COMPARATIVE MAPPING ANALYSIS (LCMA) AMONGST DIFFERENT MODELS ...... 37 4. RESULTS ...... 38 4.1. THE NUMBER OF REACTIONS CATALYZED BY AN ENZYME FALLS OFF AS A POWER-LAW ...... 38 4.2. DELETION ANALYSIS SUGGESTS FITNESS BENEFITS OF MULTIFUNCTIONAL ENZYMES ...... 43 4.3. MULTIFUNCTIONAL ENZYMES ARE MORE ESSENTIAL IN Synechocystis sp. PCC6803 ...... 46 4.4. COMPARATIVE LETHAL DELETIONS ANALYSIS SHOWS THAT E. coli HAS A GREATER DEGREE OF DISTRIBUTED CONTROL IN THE METABOLIC NETOWRK COMPARED WITH Synechocystis sp. PCC6803 ...... 50 5. DISCUSSION ...... 53 CHAPTER 3. EPISTATIC INTERACTIONS AMONG METABOLIC GENES DEPEND UPON ENVIRONMENTAL CONDITIONS ...... 59 1. SYNOPSIS ...... 59 2. INTRODUCTION ...... 60 3. RESULTS ABD DISCUSSION ...... 64 3.1. DIFFERENT CARBON SOURCES LEAD TO DIFFERENT PATTERNS OF FLUXES AND EPISTATIC INTERACTIONS ...... 64 3.2. POSITIVE EPISTASIS DOMINATES AEROBIC GROWTH OF E. coli AND Synechocystis sp. PCC6803 ...... 68
ix 3.3. MAXIMUM NUMBER OF POSITIVE INTERACTIONS CORRESPONDS TO MAXIMUM RESPIRATORY CAPACITY IN E. coli...... 70 3.4. DOMINANCE OF NEGATIVE EPISTASIS UNDER HIGH LIGHT CONDITIONS IN Synechocystis sp. PCC6803 ...... 72 3.5. EPISTATIC INTERACTIONS ARE DEPENDENT ON CARBON FLOW IN THE NETWORK ...... 74 4. EXPERIMENTAL ...... 77 4.1. FLUX BALANCE ANALYSIS (FBA) ...... 77 4.2. SIMULATION OF GROWTH CONDITIONS IN VARIOUS ORGANISMS ...... 79 4.3. RANKING OF FLUXES...... 80 4.4. CALCULATION OF EPISTASIS ...... 81 4.5. MAPPING GENE PAIRS FROM ONE ORGANISM TO ANOTHER ...... 82 4.6. CALCULATION OF RMS DIFFERENCE BETWEEN INTERACTIONS ...... 83 5. CONCLUSIONS...... 83 CHAPTER 4...... MODELING AND ANALYSIS OF BIOPRODUCT FORMATION IN Synechocystis sp. PCC6803 USING A NEW GENOME-SCALE METABOLIC NETWORK RECONSTRUCTION...... 88 1. SYNOPSIS ...... 88 2. INTRODUCTION ...... 88 3. MATERIAL AND METHODS ...... 91 3.1. MODEL RECONSTRUCTION AND ENHANCEMENT ...... 91 3.2. MODELING AOF IMPORTANT PHOTOSYNTHETIC REACTIONS...... 92 o 3.3. THERMODYNAMICS – CALCULATION AND ADJUSTMENT OF ΔrG’ to m m m ΔrG’ , ΔrG’ min, AND ΔrG’ max...... 94 3.4. BIOMASS COMPOSITION ...... 97 3.5. LIGHT COMPOSITION ...... 97 3.6. FLUX BALANCE ANALYSIS (FBA) ...... 99 3.7. FLUX VARIABILITY ANALYSIS (FVA)...... 101 3.8. GROWTH CONDITIONS AND SINGLE GENE DELETIONS ...... 102 4. RESULTS AND DISCUSSION ...... 103 4.1. IMPROVEMENTS IN NETWORK RECONSTRUCTION ...... 103 4.2. THERMODYNAMIC ANALYSIS CORRECTS REACTION DIRECTIONALITY AND IDENTIFIES UNFAVORABLE CYCLES ...... 106 4.3. ELECTRON TRANSFER IN THYLAKOID MEMBRANE ...... 111
x 4.4. RUBISCO OXYGENASE AND LIGHT-INDEPENDENT SERINE PRODUCTION 113 4.5. MODEL PREDICTS THEORETICAL INCREASES IN METABOLIC LOADS AND CARBON FIXATION ...... 117 4.6. METABOLITE SECRETION ...... 119 4.7. FEATURES OF AUTOTROPHIC FLUX DISTRIBUTION ...... 122 4.8. HETEROTROPHIC FLUX DISTRIBUTION ...... 125 4.9. SINGLE GENE DELETION ANALYSIS (AUTOTROPHIC CONDITIONS) ..... 129 5. CONCLUSION ...... 130 CHAPTER 5. LEXICOGRAPHIC ANALYSIS OF DYNAMIC FLUX BALANCE MODEL OF Synechocystis sp. PCC6803 METABOLIC NETWORK ...... 133 1. SYNOPSIS ...... 133 2. INTRODUCTION ...... 134 2.1. LEXICOGRAPHIC OPTIMIZATION ...... 137 3. METHODS ...... 137 3.1. STOICHIOMETRIC NETWORK ...... 137 3.2. FLUX BALANCE ANALYSIS (FBA) ...... 138 3.3. DYNAMIC FLUX BALANCE ANALYSIS (DFBA) ...... 140 3.4. LEXICOGRAPHIC OPTIMIZATION ...... 141 4. RESULTS AND DISCUSSION ...... 143 4.1. MODEL SETUP ...... 143 4.2. SCHEME 1 ...... 146 4.3. SCHEME 2 ...... 149 4.4. SCHEME 3 ...... 152 5. CONCLUSION ...... 154 BIBLIOGRAPHY ...... 156 APPENDICES ...... 176 APPENDIX A: SUPPLEMENTARY MATERIAL FOR CHAPTER 1 ...... 176 APPENDIX B: SUPPLEMENTARY MATERIAL FOR CHAPTER 2 ...... 182 APPENDIX C: SUPPLEMENTARY MATERIAL FOR CHAPTER 3 ...... 195 APPENDIX D: SUPPLEMENTARY MATERIAL FOR CHAPTER 4 ...... 202
xi LIST OF TABLES
TABLE 2.1: PARAMETERS OF THE POWER-LAW FITS OF THE FULL (NON-UNIQUE) ENZYME-REACTION ASSOCIATION OF THE ELEVEN MODELS TESTED, USING THE MAXIMUM LIKELIHOOD METHOD OF CLAUSET ET AL. 2009 ...... 42 TABLE 2.2: COMPARATIVE MAPPING OF LETHAL GENE DELETIONS FROM ONE ORGANISM TO ANOTHER ...... 50 TABLE 3.1: CLASSIFICATION OF DIFFERENT RANGES OF UNSCALED AND SCALED EPISTASIS ...... 81 TABLE 5.1: INITIAL CONCENTRATIONS AND PARAMETERS...... 145 TABLE 5.2: PRIORITY LIST ORDER USED FOR THE LEXICOGRAPHIC LP SCHEMES USED IN OUR SIMULATIONS ...... 147
xii LIST OF FIGURES
FIGURE 1.1: A SIMPLISTIC VIEW OF REGULATION BY EXCHANGE OF INFORMATION...... 4 FIGURE 1.2: CATEGORIES OF MICROBIAL GROWTH MODELS ...... 6 FIGURE 1.3: PARADIGM IN PREPARING GENOME-SCALE NETWORK RECONSTRUCTIONS ...... 7 FIGURE 1.4: AN EXAMPLE OF A SOLUTION SPACE GIVEN BY THE PROBLEM ON THE RIGHT...... 12 FIGURE 2.1: THE DISTRIBUTION OF THE NUMBER OF REACTIONS CONSTRAINED BY ENZYME COMPLEXES...... 39 FIGURE 2.2: THE POWER-LAW FIT USING MAXIMUM LIKELIHOOD METHODS FOR NINE SPECIES NOT INCLUDED IN FURTHER ANALYSIS ...... 40 FIGURE 2.3: COMPARISON BETWEEN ANY TWO MODELS USING TWO-SAMPLE KOLMOGOROV-SMIRNOV TEST ...... 41 FIGURE 2.4: COMPARISON BETWEEN ESSENTIAL REACTIONS AND ESSENTIAL COMPLEXES, AND DISTRIBUTION OF COMPLEXES IN ENERGY METABOLISM ...... 44 FIGURE 2.5: THE DISTRIBUTION OF ESSENTIAL/LETHAL GENE DELETIONS AND OF SPECIALIST AND GENERALIST ENZYME COMPLEXES WITHIN METABOLIC SUBSYSTEMS ...... 46 FIGURE 2.6: THE DISTRIBUTION OF ESSENTIAL COMPLEXES AND LETHALITY OF BOTH ORGANISMS ...... 47 FIGURE 2.7: A SNAPSHOT OF MULTIFUNCTIONAL ENZYMES IN ENERGY METABOLISM IN SYNECHOCYSTIS ...... 48 FIGURE 3.1: FLUXES CHANGE DEPENDING UPON GROWTH CONDITIONS ...... 64 FIGURE 3.2: EPISTASIS UNDER VARIOUS DIFFERENT CARBON SOURCES ...... 65 FIGURE 3.3: EPISTATIC INTERACTIONS MAPS RELATIVE TO AEROBIC GROWTH ON GLUCOSE FOR SYNECHOCYSTIS SP. PCC6803 AND E. COLI ...... 69 FIGURE 3.4: EPISTASIS UNDER VARYING GLUCOSE-TO-OXYGEN UPTAKE RATIOS ...... 71 FIGURE 3.5: HISTOGRAMS OF SCALED EPISTASIS FOR PHOTOAUTOTROPHIC ORGANISMS UNDER LIMITED LIGHT AND HIGH LIGHT CONDITIONS ...... 72 FIGURE 3.6: EPISTASIS INTERACTIONS AMONGST REACTIONS BELONGING TO THREE COMPARTMENTS GLYCOLYSIS, CITRATE CYCLE, AND PENTOSE PHOSPHATE PATHWAY FOR CELLS GROWN AEROBICALLY WITH GLUCOSE ...... 76
xiii FIGURE 4.1: PROPERTIES OF THE METABOLIC NETWORK RECONSTRUCTION OF SYNECHOCYSTIS SP. PCC6803 ...... 104
FIGURE 4.2: THERMODYNAMIC PROPERTIES OF THE REACTIONS FOR WHICH ΔrG’ WAS CALCULATED ...... 107 FIGURE 4.3: EXAMPLES OF THERMODYNAMICALLY INFEASIBLE CYCLES OR FUTILE CYCLES IDENTIFIED BY OUR ANALYSIS ...... 108 FIGURE 4.4: FLUX VARIABILITY OF SUCCINATE DEHYDROGENASE UNDER VARIOUS LIGHT UPTAKE CONDITIONS...... 112 FIGURE 4.5: SERINE PRODUCTION VIA LIGHT-INDEPENDENT PATHWAY AND PHOTORESPIRATORY PATHWAY ...... 116 FIGURE 4.6: SECRETION OF VARIOUS METABOLITES MAY RESULT IN INCREASED (A) CO2 FIXATION OR DECREASE IN (B) PHOTORESPIRATION ...... 118 FIGURE 4.7: PRINCIPAL COMPONENT ANALYSIS ON FLUX DISTRIBUTION DATA OBTAINED FROM SIMULATION OF SECRETION OF A METABOLITE AT 50% GROWTH-RATE TRADE-OFF ...... 121 FIGURE 4.8: GENE DELETION ANALYSIS ...... 128 FIGURE 5.1: CONCENTRATION IN REACTOR, WHEN LEXICOGRAPHIC SCHEME 1 WAS USED for 54h (LDLDL, 12h:12h) ...... 151 FIGURE 5.2: CONCENTRATION IN REACTOR, WHEN LEXICOGRAPHIC SCHEME 2 WAS USED for 54h (LDLDL, 12h:12h) ...... 152 FIGURE 5.3: CONCENTRATION IN REACTOR, WHEN LEXICOGRAPHIC SCHEME 3 WAS USED for 54h (LDLDL, 12h:12h) ...... 154
xiv CHAPTER 1. CONSTRAINT BASED MODELING OF METABOLIC NETWORKS IN
SYSTEMS BIOLOGY
1. SYSTEMS BIOLOGY
Systems biology is a field of study involving “the study of interaction between components of biological systems, and how these interactions give rise to the function and behavior of that system” (Snoep & Westerhoff, 2005). This field of study, in part, is based on the recognition that the knowledge of properties of single biological parts outside the system alone, is insufficient to satisfactorily explain the behavior of the whole biological system they belong to. The emergence of this field was also a reaction against what was thought to be excessive reductionism in biology.
Reductionism claims that all complex biological entities can be explained by the sum of its parts; while Holism claims that complex biological entities are inherently greater than the sum their parts
(Gilbert & Sarkar, 2000). To capture a holistic view of the whole, an iterative approach is prescribed in systems biology. This iterative approach involves generation of models and testing them by conducting experiments. The model changes as and when the model predictions do not match the experiments. This represents one of the most fundamental paradigm shift to holistic from reductionist approach, where the knowledge flows from component to system analysis
(Palsson, 2006). Therefore, systems biology studies the idea of holism (in context of biology) best described by Aristotle: “The whole is greater than the sum of its parts.”
1.1. PARTS
In context of systems biology, parts refer to cellular components such as: (i) genes; (ii) proteins, formed due to genes; (iii) intracellular chemical species, formed by reactions caused by proteins/enzymes; (iv) cellular/intra-cellular membranes, which compartmentalize cellular
1 functions; and (v) intracellular fluids, which provide medium for cellular functions. During the late 20th century, biology was practiced with reductionist approaches, i.e. gaining the knowledge about properties of a cellular component (such as a gene, or a protein) by isolating it from the cell
(Palsson, 2006). It should be noted that earlier studies in systems biology were supported by data generated using these reductionist approaches. Therefore, these early reductionist studies played a vital role in understanding the importance and need for systems biology.
1.2. SUM OF ITS PARTS
Isolating a single part and studying it outside the system can be slow, less efficient, and time consuming. During the mid-1990’s, the first complete genome sequences of three organisms
(Haemophilus influenza, Saccharomyces cerevisiae, and Methanococcus jannaschii) belonging to three domains of life were released (Bult et al., 1996; Fleischmann et al., 1995; Goffeau et al.,
1996). These developments ushered in the age of “-omics”: genomics (DNA), transcriptomics
(RNA), proteomics (proteins), and metabolomics (chemical species or metabolites). Using the “- omics” approaches, data on a single class of cellular component (DNA, RNA, protein, or chemical species) belonging to an organism can be gathered. New high-throughput “-omics” methods are being developed faster than ever (Gomez-Cabrero et al., 2014). As of this year, more than 9000 organisms have been completely sequenced and more than that are underway (Reddy et al., 2015).
The “-omics” data are being turned into genome-scale models which can be computationally simulated and analyzed. Currently, a large part of the field is involved in simulating individual “- omics” parts. A large amount of efforts are focused on understanding “the sum of its parts”; i.e. studying protein-protein interactions (Lv et al., 2015; Schoenrock et al., 2014; Wuchty et al.,
2014), gene interactions (D’Souza, Waschina, Kaleta, & Kost, 2015; He, Qian, Wang, Li, &
2 Zhang, 2010; Joshi & Prasad, 2014; Phillips, 2008), organismal metabolic networks (Chang et al.,
2011; Feist et al., 2007; Förster, Famili, Fu, Palsson, & Nielsen, 2003; N Jamshidi, Edwards,
Fahland, Church, & Palsson, 2001; Knoop et al., 2013), and genome-wide RNA expression levels
(Camas & Poyatos, 2008; Kochanowski, Sauer, & Chubukov, 2013; Kopf et al., 2014). Not only have the models been created, but modeling in systems biology have also facilitated predicting gaps in knowledge which can be later identified (Satish Kumar, Dasika, & Maranas, 2007).
1.3. THE WHOLE
The whole, in systems biology, may refer to a set of proteins/genes/chemical species, a biological process, an organism, an ecosystem, or something as grandeur as biological life
(Balaram, 2003; Xavier, Patil, & Rocha, 2014). The ability to model an organism, in its entirety, would involve analysis of all its “-omics” parts simultaneously. Systems biology is still in its infancy; taking all of its parts into consideration is computationally expensive and time consuming, and we are still learning about new parts (e.g. discovery of long noncoding RNAs) (Mattick &
Rinn, 2015) and new functions of old parts (e.g. peroxisomes are involved in biotin biosynthesis)
(Maruyama, Yamaoka, Matsuo, Tsutsumi, & Kitamoto, 2012). However, increasing number of models with multi-omics approach are being released and are underway (D. R. Hyduke, Lewis, &
Palsson, 2013; Kim & Lun, 2014). Further, the advances in systems biology has facilitated its interface with synthetic biology. In fact, design and synthesis of a minimal genome has been made possible from the understanding of essential cellular function (Xavier et al., 2014). Although, the genome has not been fully functionally characterized; the essentiality of all the 473 genes in M. mycoides JCVI-syn3.0 is qualitatively understood (Hutchison et al., 2016). This minimal genome can further facilitate understanding of essentiality of cellular functions.
3
FIGURE 1.1: A SIMPLISTIC VIEW OF REGULATION BY EXCHANGE OF INFORMATION (A) Cartoon depicting information exchange between different layers of cellular operations. Black line represents metabolism, green line represents translation, red line represents transcription, and arrows represent exchange of information. (B) Cartoon depicting how various cellular operations exchange information. Purple trapezoids represent genes, red ovals represent mRNA, green ovals represent enzymes/proteins, black dots represent metabolites, yellow triangles represent transcription factors, blue rhombi represent signaling cascades, and light blue drops represent environmental stimuli.
A simplistic view of understanding the structure of systems biology can be related to a
simplistic view of understanding intracellular regulation (Figure 1.1). Each layer of intracellular
regulation not only interacts with other layers but also leads to exchange of information. For
example genes (genomics) are transcribed to form RNA; RNA (transcriptomics), then, forms an
enzyme; an enzyme (proteomics) may in turn catalyze a reaction; then, resulting in conversion of
one metabolite (metabolomics) to another; and preparing the cell to divide and secrete metabolites
into the environment (Figure 1.1B). However, cells from different species result in various
different phenotypes. This is done by exchange of information between different layers via
environmental stimuli, signaling cascades, and transcriptional factors (Figure 1.1A). The cellular
metabolism as an important interface with the environment. Therefore, understanding organismal
4 systems biology requires understanding of metabolism as part of the system, the organism; which can best be accomplished by modeling organismal metabolism.
2. METABOLIC MODELING LANDSCAPE
Metabolism can be defined as the set of processes that allow the cell to maintain itself and to grow. Therefore, the two primary tasks of metabolism are to enable (i) the maintenance of energy, redox, and storage machinery and (ii) the growth of the cell, e.g. to produce metabolites and biomass required by daughter cells after division. These tasks are accomplished by a metabolic network, which is a set of chemical interconversions from various nutrient uptakes to cellular biomass and energy via a set of enzymes produced within a cell. Therefore, modeling of metabolism is primarily concerned with reconstructing and simulating the metabolic network of the organism for different environmental stimuli. Metabolic models are primarily growth models of the organism. Microbial growth models can be categorized at (i) intracellular level as structured
(multi-component system) or unstructured (single component system), based on the treatment of intracellular molecules; and (ii) multicellular level as segregated (heterogeneous) or unsegregated
(homogeneous), based on the treatment of cell population (Figure 1.2). Modeling organismal growth began with simplified unstructured and unsegregated models like Monod equation, which expressed the growth rate as a function of nutrient uptakes without going into how nutrient was assimilated into biomass and how growth was taking place (Monod, 1949). Though such models captured growth kinetics fairly well, they lacked any information on intracellular state, mainly because this information was limited at best during that time. Therefore, development of metabolic modeling has largely been an effort to develop structured models which capture the intracellular states of various components (metabolites and reactions) within the cell. Most of the efforts have
5 FIGURE 1.2: CATEGORIES OF MICROBIAL GROWTH MODELS Cartoon depiction of categories of microbial growth models at (i) intracellular level – structured or unstructured, and (ii) multicellular level – segregated or unsegregated. Colorful shapes with the cyan circle (cell) depict intracellular molecules. been driven towards unsegregated metabolic models. However, research is underway to capture
heterogeneity in cellular biomass composition to make segregated metabolic models (Personal
communication with Dr. Maciek Antoniewicz at University of Delaware). We will discuss this in
later sections.
Metabolic modeling has proved highly useful over the years. However, as mentioned
earlier, the state of the mathematical models of metabolism have been limited by the amount of
data that could be (or has been) experimentally verified or measured. Other limitations to the field
also include lack of kinetic information about the various intracellular reactions. However, more
quantitative information about intracellular reactions/genes/enzymes is becoming available. This
6 information proves highly useful in creating highly accurate models. Metabolism, as we know it, acts under different time scales, with different reaction rates, and shows different kinetic behavior at different levels of regulation. Hence, most of the progress has been made in the field of pseudo steady state models that ignore the kinetic information and regulation; and are solved using steady state mass-balance equations, without taking time into consideration (Song & Ramkrishna, 2009b).
However, new improvements have facilitated implementation of transcriptional information with genome-scale metabolic models.
FIGURE 1.3: PARADIGM IN PREPARING GENOME-SCALE NETWORK RECONSTRUCTIONS The flow elaborates how genome-scale metabolic network reconstructions are built. It involves 5 parts: (i) draft, (ii) refinement, (iii) conversion to model, (iv) evaluation, and (v) assembly.
7 2.1. METABOLIC NETWORK RECONSTRUCTIONS
A precursor to all genome-scale metabolic models is a genome-scale metabolic network reconstruction. To date, genome-scale network reconstructions are available for 69 different organisms and strains (Table A1) (King et al., 2016). A wide-array of tools and databases are available for preparing metabolic reconstructions. These include genome databases (Reddy et al.,
2015), biochemical databases (Minoru Kanehisa et al., 2014; Scheer et al., 2011; Y. Wang et al.,
2009), organism-specific databases (Keseler et al., 2009), protein localization databases (N. Y. Yu et al., 2010), reconstruction packages (Paley & Karp, 2006), simulation environments (D. Hyduke et al., 2011; Klamt, Saez-Rodriguez, & Gilles, 2007; Klamt, Stelling, Ginkel, & Gilles, 2003; R.
Luo, Liao, Zeng, Li, & Luo, 2006), and visualization packages (Maarleveld, Boele, Bruggeman,
& Teusink, 2014). A simplified version of the current paradigm in preparing reconstructions involves five steps: (i) drafting, (ii) refinement, (iii) conversion to mathematical model, (iv) evaluation, and (v) assembly and dissemination (Figure 1.3) (Thiele & Palsson, 2010).
Reconstruction is subjected to iterations of refinement and evaluation (ii-iv) until predictions match well with organism phenotype. Though many tools and databases are available, the process of reconstruction is semi-automated at best. This can be attributed to two main reasons: (i) varied objectives of each reconstruction, and (ii) availability of physiological data.
2.2. MATHEMATICAL MODEL
Once the reconstruction is built, the process of converting it into a mathematical model may be automated. Mathematical models are usually condition-specific, which involve invoking constraints and defining system boundary (Thiele & Palsson, 2010). The network itself is represented as a stoichiometric matrix of reactions and metabolites within the mathematical model;
8 and it is derived from the reconstruction. Each element within the matrix represents the stoichiometry of a metabolite in a reaction. A negative value in the matrix represents consumption of a metabolite, while a positive value represents production of a metabolite. As mentioned earlier, the reconstruction, and hence, the network is only representation of metabolism under balanced growth assumptions (Palsson, 2006). However, most biological networks are underdetermined; therefore, multiple flux solutions which satisfy the intracellular mass balance exist. A linear programming (LP) problem is formulated by removing the kinetics out of the system and making it a time-invariant mass-balance problem. To be able to reduce the solution space, we choose a cellular objective to optimize and apply constraints (Feist & Palsson, 2010; Price, Reed, & Palsson,
2004). Therefore, constraints are a crucial part of the mathematical model, making the models condition-specific, which is also why these models are also known as constraint based models.
The main assumption or hypothesis behind constraint based models is that the organism optimizes some cellular objective function. These predicted fluxes are the solution of the metabolic model and can be then compared with experimental results. The constraints, the objective function and the solution in context of constraint based modeling are discussed in more detail below.
3. CONSTRAINT BASED MODELING
The field of constrained based modeling, most notably flux balance analysis (FBA), has become one of the most important tools in genome-scale metabolic flux analyses. Currently, we have a wide array of FBA models available for many organisms; amongst the most well-known are E.coli (Feist et al., 2007), S. cerevisiae (Förster et al., 2003), Clostridium thermocellum
(Roberts, Gowen, Brooks, & Fong, 2010), Arabidopsis thaliana (Poolman, Miguet, Sweetlove, &
9 Fell, 2009), Synechocystis sp. PCC6803 (Knoop et al., 2013; J. Nogales, Gudmundsson, Knight,
Palsson, & Thiele, 2012), and C. reinhardtii (Chang et al., 2011).
3.1. CONSTRAINTS
In constraint based modeling, constraints belong to four different categories: physico- chemical, topobiological, regulatory and environmental constraints; and are applied as bounds and balances (Price et al., 2004). Physico-chemical constraints are dependent on free energies of biochemical reactions (Hamilton, Dwivedi, & Reed, 2013), diffusion rates (Weisz, 1973), enzyme turnover, and confinement of molecules (Lew & Bookchin, 1986). Topobiological constraints are dependent on molecular crowding, and number of molecules of metabolites. Regulatory constraints are dependent on transcriptional, translational or enzymatic regulation within the cell and are hypothesized to eliminate suboptimal cellular states. Lastly, environmental constraints are dependent on concentration of nutrients and are subject to change with interaction of metabolic network and its environment. These constraints determine the solution space (all possible phenotypes) within which the solution lies. Constraints can be implemented in various ways: (i) reaction reversibility – whether the enzyme catalyzing the reaction is reversible, (ii) reaction bounds, and (iii) biomass composition. Reaction reversibility can be determined by checking whether the free energy of a given reaction is negative, positive, or zero. If negative, the reactions must be allowed in only forward direction; if positive, the reaction can only carry negative flux; and if zero, the reaction must be allowed in both directions. Reaction bounds refer to upper, lower, or both bounds of flux through the reaction and are applied when diffusion rates of metabolites, uptake of nutrients, or regulatory control of enzyme is known; for e.g. flux through Ribulose-1,5- bisphopshate oxygenase activity is shown to be approximately 3-5% of the total Ribulose-1,5-
10 bisphosphate carboxylase activity (Timm & Bauwe, 2013; Vermaas, 2001). Biomass composition applies topobiological constraints on the model and is implemented by changing the equation representing the requirements of various metabolites for the cell to grow. As mentioned previously, the heterogeneity of biomass equation for a given microorganism is still under active research.
Due to the large size of genome-scale metabolic models, even after applying the constraints, it is likely that a unique solution is not possible. Therefore, the solution often, refers to the feasible solution space rather than a unique solution.
3.2. SOLUTION SPACE
The solution space refers to all the possible solutions to the problem of determining metabolic fluxes given the model and the constraints. Flux space can be visualized as a region in n-dimensional space. The n coordinates on each point represent the number of reaction fluxes in the metabolic model. The space we refer to here is bounded only by constraints, discussed above, applied on the model and forms a solid. This solid is further shaved off by drawing out mass balance equations resulting from the stoichiometric network. Therefore, the true solution space is actually a polytope (surface) in n-dimensional space (Figure 1.4). For the linear mathematical programming problem to be feasible, it is important that the polytope is convex, and not concave.
The polytope would be characterized by m intersection points of the entire stoichiometric network and all the constraints. It should be noted that the case referred to here is that of an under- determined system. As can be visualized from the above description, a unique solution may not exist. Therefore, the next step is to invoke an objective function of the metabolic network.
11
FIGURE 1.4: AN EXAMPLE OF A SOLUTION SPACE GIVEN BY THE PROBLEM ON THE RIGHT. The problem is defined on the right hand side. The plot on the left represents the solution of the problem. The color of the line corresponds to the equation of that color on the problem side. The shaded region indicates all possible values of the objective function (Z). In this case, the solution is (2, 2), which corresponds to the maximum value of Z and satisfied by bounds. 3.3. CELLULAR OBJECTIVE
To further constrain the solution space, it is required to hypothesize that the metabolic
network maximizes or minimize a cellular function. This is often referred to as objective function.
The purpose of objective function is (i) to explore the solution space (phenotypic space), (ii) to
determine physiological state that best represents a physiological function of the organism such as
growth or ATP production, and (iii) to determine fitness of engineered strains such as secretion of
a desired product (Price et al., 2004). Exploration of solution space and choosing a fitness function
for engineered strains are interrogative purposes and vary based on the goal of the study. From the
description of solution space, an objective is described as a function of other intracellular fluxes;
and in an n-dimensional space, it can be visualized as intersecting with the polytope. This solution
space corresponding to maximum or minimum value of the objective can be found by substituting
12 allowable values of various fluxes in the equation of the objective. Therefore, if an objective function is to be maximized or minimized, the solution lies on the vertices of the intersection of the polytope and the objective function.
The choice of a best representative physiological function for the organism has long been in debate. However, as the size of the metabolic networks increase, it is possible to not get a single unique solution but rather a number of solutions. Often, the entire range of flux values, corresponding to the optimum value of objective function, is reported to facilitate analysis.
3.4. SIMULATION ENVIRONMENT
There are number of interfaces/environments available in the field to solve resulting linear equations; some of them include General Algebraic Modeling System (GAMS), MATLAB
(various Toolboxes like SimBiology, COBRA (Schellenberger et al., 2011), SBML (Keating,
Bornstein, Finney, & Hucka, 2006)), OptKnock (Burgard, Pharkya, & Maranas, 2003) etc. All these can optimize the objective function for the organism’s metabolic network formulated as a linear/quadratic/mixed-integer linear/non-linear programming problem.
4. FBA PARADIGM
As mentioned previously, in FBA, a set of reactions is prepared that leads to production and consumption of each of the chemical compounds within a metabolic network, upon which constraints are imposed, and a cellular objective is chosen (Jeremy S. Edwards, Covert, & Palsson,
2002; Feist & Palsson, 2010; K. J. Kauffman, Prakash, & Edwards, 2003; Orth, Thiele, & Palsson,
2010). On the basis of this hypothesis and the constraints, standard optimization techniques such as linear programming are applied, that yield a vector of fluxes that optimize the objective and
13 satisfy all of the constraints. Therefore, mathematical frameworks such as FBA make it possible to calculate and analyze the flow of metabolites through a metabolic network and allow making predictions of growth and/or biotechnologically relevant products (Orth et al., 2010). Eventually, the cell metabolite pool is experimentally tested for various compounds; based on the set of chemical species involved in the model to calculate intracellular fluxes. The steps to constructing an FBA model are: (i) defining the system, (ii) obtaining reaction stoichiometry, (iii) defining biologically relevant objective functions and adding constraints, and (iv) solving the resulting linear equations (Raman & Chandra, 2009). We have discussed other parts of construction in detail except the details of understanding objective functions.
An obvious question still remains as to what cellular objective to choose. A simple answer to this question would be to find the experimentally obtained solution within the phenotypic space and use mathematical programming to identify the biochemical reaction state that maximizes network function experimentally obtained solution points at. However, as previous studies have noted, that this method only works for wild-type organisms (Robert Schuetz, Kuepfer, & Sauer,
2007; Segrè, Vitkup, & Church, 2002). The issue of choice of objective function was also recognized in the first study conducted by Savinell and Palsson (Savinell & Palsson, 1992a,
1992b), where systematically four different objective functions were tested for E. coli metabolism including minimization of ATP production, minimization of nutrient uptake (in moles), minimization of nutrient uptake (in mass), and minimization of NADH production. The study revealed that no single objective function captured the cell behavior accurately. However, later it was realized that a growth objective performed best in predicting cell behavior. It should be noted that E. coli strains have been growing long enough to evolve in laboratory conditions and have acquired optimal growth phenotype. This is also evident from experiments where after 700
14 generations under growth selection pressure, E. coli growing on glycerol shifted from sub-optimal growth rate to optimal growth rate predicted using in silico model (Ibarra, Edwards, & Palsson,
2002). This is a classic scenario where environmental perturbations drive the genetic perturbations such that the organism evolves to exhibit optimal growth phenotype.
However, sometimes it could be of interest to learn the intermediate state while the organism is transitioning to optimality. For such cases, a different objective function was introduced, minimization of metabolic adjustment (MOMA), as an extension to FBA. MOMA hypothesizes that gene deletion mutants undergo a minimal metabolic adjustment with respect to wild-type metabolic state (Segrè et al., 2002). Another objective function which models gene deletion mutants has also been introduced, regulatory on/off minimization (ROOM). It follows the similar assumptions of minimal adjustment from wild-type phenotypic state, but also hypothesizes that new phenotypic state of the mutant is reached through transient metabolic changes by the regulatory network which is minimized (Shlomi, Berkman, & Ruppin, 2005).
The search for a global objective function did not end there. Subsequently, many different objective functions have been tested by various groups. These include maximization of ATP per flux unit, minimization of overall flux, maximization of biomass per unit flux, minimization of reaction steps, maximization of ATP production (Robert Schuetz et al., 2007), and an objective function selector using Bayesian-based technique (Knorr, Jain, & Srivastava, 2007). A multidimensional approach was also able to gain limited success in forming an objective function which was a combination of maximal biomass yield, ATP yield, and minimization of sum of fluxes
(R. Schuetz, Zamboni, Zampieri, Heinemann, & Sauer, 2012). In this approach wide variety of organisms growing under different environmental conditions were mapped on to a Pareto-optimal surface. This allowed them to make predictions; and show that evolution shapes metabolic fluxes
15 in microorganisms’ environmental context by (i) optimal flux distribution under one given condition, and (ii) minimizing the adjustment between any two conditions. Among the above mentioned studies, there have been cases where contrary objectives such as maximization of ATP per flux unit was a better predictor of experimental data than biomass (Robert Schuetz et al., 2007).
It should be noted that most of the work mentioned above was done in E. coli or yeast.
The evidence so far suggests that growth objective is the most consistent among all the ones evaluated. There could be conditions where the growth objective is not appropriate, such as for organisms in a nutrient limited environment, organisms undergoing physical stress, etc. Most scientists that have critiqued this hypothesis in publications have argued that if growth rate is maximized, it is also necessary for the organism to maintain an appropriate level of expression for protein synthesis or ribosome expression (Bonven & Gulløv, 1979; Forchhammer & Lindahl,
1971).
5. DYNAMIC FLUX BALANCE ANALYSIS
FBA utilizes a static optimization framework, yielding a solution of flux vectors that do not change with time. However metabolism is dynamic and changes with environmental conditions. There have been several attempts to incorporate dynamics within the FBA framework, called dynamic FBA. Advances in dynamic-FBA (dFBA) (Radhakrishnan Mahadevan, Edwards,
& Doyle, 2002; Varma & Palsson, 1994) have shown that given some insights into the substrate uptake kinetics, a time variant problem can be solved for batch kinetics as a function of rate of reactions. Dynamic-FBA includes information about dynamics of a certain metabolite under batch kinetics or under time-dependent processes allowing interaction of the metabolic network with the environment (Jared L. Hjersted & Henson, 2006; Radhakrishnan Mahadevan et al., 2002). DFBA
16 provides a structured model of biochemical process where intracellular pathways interact with the environmental conditions, which is represented by functional dependency of the substrate. There are two most used versions of dFBA; dynamic optimization approach (a non-linear programming problem, that optimizes the fluxes over the entire time, DOA) and static optimization approach (a linear programming problem, that instantaneously optimizes over small time intervals to make up the entire time; and updates concentration after each time interval, SOA) (J L Hjersted & Henson,
2009; Jared L. Hjersted & Henson, 2006; Radhakrishnan Mahadevan et al., 2002). In addition to environmental interactions, regulatory changes due to the environment can be included in the model. A third dynamic FBA method involves embedding a linear program within a system of kinetic equations representing the exchange fluxes (Gomez, Höffner, & Barton, 2014). DFBA is increasingly becoming more efficient for larger models, moving from 43 metabolites and 38 reactions of a monoculture to more than 3000 reactions from a co-culture simulation of C. reinhardtii (iRC1080) and yeast (iND750) (Table A2) (Höffner, Harwood, & Barton, 2013).
6. OTHER MODELING FRAMEWORKS
Cybernetic models of microbial growth involve taking into account metabolic regulation, enzymatic regulation, and a single substrate uptake kinetics, and couples them with the metabolic network (Kompala, Ramkrishna, & Tsao, 1984). Unlike FBA, which requires uptake rates of multiple substrates to be specified; cybernetic modeling needs information from only one substrate, and the uptake rates of other substrates is estimated. It was first applied to predictions of diauxic growth patterns in multiple substrate bacterial cultures (Kompala, Ramkrishna, Jansen,
& Tsao, 1986; Kompala et al., 1984). There have been some advances in Cybernetic modeling as well (Song & Ramkrishna, 2011; J. Young, Henne, Morgan, Konopka, & Ramkrishna, 2004).
17 However, there is little known about it in large complex networks. Cybernetic Modeling has heavily relied upon lumping of metabolic pathways (Song & Ramkrishna, 2009a, 2011). Applying this technique over large metabolic network is highly constrained due to lack of experimentally determined parameters. Further, due to lumping of metabolic pathways, many interesting genetic changes cannot be predicted. However, recent successors of this technique such as Lumped Hybrid
Cybernetic Models (L-HCM) and Lumped-Elementary Mode (L-EM) have been applied to E. coli and S. cerevisiae (Papin et al., 2004; Schwartz & Kanehisa, 2006; Song & Ramkrishna, 2011; J.
Young et al., 2004). There are other tools that have evolved on similar lines and given us some insights into the network structure such as elementary mode analysis (EMA) (Zanghellini,
Ruckerbauer, Hanscho, & Jungreuthmayer, 2013), and lumped kinetic modeling (LKM)
(Nikolaev, 2010). The field of quantitative metabolic modeling has been on constant progress for the past 20 years now and continues to grow. As the field develops, new methods are emerging in parameter identification, metabolite kinetics, and other fields that might involve more sophisticated model formulations.
7. APPLICATIONS
As mentioned previously, the number of constraints based genome scale models have been rising consistently. Our analysis of Biochemical Genetic and Genomic (BiGG) databases suggests more than 60 models are in existence (King et al., 2016). Their usefulness in learning more about the phenotypic space has been elaborated in the previous sections. Therefore, here, we will focus on the biotechnological contributions that led (i) to understanding more about the systemic behavior, and (ii) to improve commercial outcomes (biofuels, pharmaceuticals, or nutraceuticals).
18 7.1. APPLICATIONS IN NON-PHOTOSYNTHETIC BACTERIAL ORGANISMS
Bacterial models have demonstrated successful applications to production of industrially relevant chemicals. For example production of lactate has been modeled in Lactobacillus plantarum, Lactococcus lactis, Streptococcus thermophilus, and Corynebacterium glutamicum. L. lactis was also used to predict genetic modifications for improving production of diacetyl, which is a flavor compound in dairy products (Oliveira, Nielsen, Förster, & Forster, 2005). It has also been used to predict genetic modifications for synthesis of recombinant protein (Oddone, Mills, &
Block, 2009). The resultant strains qualitatively enhanced GFP (a proxy for the recombinant protein) production by 15%. Genetic modifications in Pseudomonas putida were investigated for production of poly-3-hydroxyalkanes (PHA), which could be used to replace petrochemical-based plastics (Puchalka et al., 2008). The study demonstrated pools of acetyl-CoA, a precursor to PHAs, were increased by up to 26%. E. coli (Feist et al., 2007) and C. acetobutylicum (J. Lee, Yun, Feist,
Palsson, & Lee, 2008; Salimi, Mandal, Wishart, & Mahadevan, 2010) models were used in making predictions about acetone-butanol-ethanol production systems. Geobacter metallireducens reduces Fe (III) and is used in bioremediation of radioactive elements. Its model was used to show that it grows inefficiently with complex electron donors and acceptors (Sun et al., 2009).
7.2. APPLICATIONS IN MAMMALIAN ORGANISMS
The first human genome scale model, Human Recon 1 (Duarte et al., 2007), was used to identify biomarkers of inborn errors of metabolism (Shlomi, Cabili, & Ruppin, 2009). This revealed a set of 233 metabolites whose concentration is predicted to deviate as a result of 176 possible dysfunctional enzymes. Another genome scale model reconstruction revealed the importance of systems modeling in human metabolism to aid drug discovery. Simulations and
19 predictions using genome scale models of NCI-60 cell lines have resulted in identification of a new objective function, as well as to study Warburg effect and identified metabolic targets for inhibiting cancer cell migration (Yizhak et al., 2014). Simulations of hybridoma cell line production of mAb in a genome scale model of M. musculus predicted growth and build-up of lactate and ammonia, known byproducts to cause cell death in mammalian cell culture (Sheikh,
Forster, & Nielsen, 2005).
7.3. APPLICATIONS IN E. COLI AND S. CEREVISIAE
These are some of the best studied microbial species to date. Their genome scale models have been equally well studied as well. Therefore, their applications have been far wider than any of the organisms previously mentioned. Some of the important contributions of E. coli genome scale models include increasing the production of lycopene (Alper, Jin, Moxley, &
Stephanopoulos, 2005; Jin & Stephanopoulos, 2007), lactate (Burgard et al., 2003; Fong et al.,
2005; Ibarra et al., 2002), ethanol (Pharkya & Maranas, 2006), hydrogen (Jones, 2008; Pharkya &
Maranas, 2006), vanillin (Pharkya & Maranas, 2006), and 1,3-propanediol (Burgard et al., 2003).
Similarly, S. cerevisiae has contributed to increasing production of succinate, glycerol, vanillin, and sesquiterpenes (Asadollahi et al., 2009; Patil, Rocha, Förster, & Nielsen, 2005).
7.4. APPICATIONS IN PHOTOSYNTHETIC ORGANISMS
The most widely used and actively researched of all the photosynthetic organisms is
Synechocystis sp. which can convert CO2 to carbon based products. A recent study of a genome scale metabolic network of Synechocystis sp. analyzed the production of industrially relevant chemical compounds and growth trade-off (Knoop & Steuer, 2015) to find that shifts in
20 ATP/NADPH demand during autotrophic growth competed with product biosynthesis. A genome scale network reconstruction of Synechocystis sp. PCC6803 was also involved in studying epistatic maps within metabolic networks to elucidate that path of evolutionary adaptation is likely to be path dependent due to strong effect of the environment on epistasis (Joshi & Prasad, 2014).
Halobacter salinarium can store energy using a high potassium gradients. Its genome-scale metabolic network was used to investigate aerobic essential amino acid degradation, energy generation, nutrient utilization, and biomass production (Gonzalez et al., 2008). A genome scale model of C. reinhardtii (Chang et al., 2011), an algae, was used to see the effects of co-culture with yeast (Gomez et al., 2014). It should be noted that research in modeling photosynthetic metabolism is still in its infancy compared to E. coli and yeast.
The applications listed here are by no means the only ones. The actual list may require a separate article of its own. However, it should be noted that as more genome scale models are published, the level of details within the reconstructions will also increase, resulting in even more applicability of genome scale metabolic models.
8. THESIS OUTLINE
Both FBA and dynamic FBA depend upon a detailed understanding of the underlying metabolic network. As knowledge of this network grows, the models become better and better. An interesting question that sometimes arises is whether FBA is an idea, a hypothesis or a theory. In my view, our understanding of metabolism has progressed sufficiently that the underlying description of the metabolic network can be regarded as part of the biological theory of metabolism. FBA and dynamic FBA provide one method of estimating the internal fluxes of metabolites, based on the hypothesis of constrained optimality of some cellular objective.
21 However, there is significant evidence to suggest that this hypothesis may be true, at least under some conditions. In our view FBA is a self-consistent theoretical project that is a theory in the making: a theory of optimality in metabolism.
The metabolic network can be represented as a directed graph of nodes representing metabolites and directed edges representing reactions, along with an additional layer of complexity provided by the enzymes and transporters that participate in each reaction. In Chapter 2, we study the latter aspect of network structure, in particular the relationship between the number of enzymes and the number of reactions they participate in. We find that the distribution of the number of reactions an enzyme participates in, the enzyme-reaction distribution, is surprisingly similar across ten species. In six out of these ten species the distribution can be described by a power-law with statistical significance. We use flux balance analysis (FBA) to study the effect of the enzyme- reaction distribution on the robustness of two microorganisms, E. coli and Synechocystis, and based on a detailed study of gene deletions in both organisms we show that the form of this distribution plays an important and hitherto unappreciated role in robustness. Despite the similarity of the overall distribution of reactions among enzymes, we also uncover many differences in the specific details of this distribution between the two microorganisms, arising from their specific environmental niches. In particular, we discover that multifunctional enzymes play a major role in conferring lethality to many loss-of-function mutations in the photoautotrophic metabolism. Our analysis suggests that multifunctional enzymes may be contributing some unknown fitness benefits to the organism, by virtue of being multifunctional, that offsets their negative role in loss-of- function mutations, and that this may be especially important in photosynthetic metabolism. The similarity of the enzyme-reaction distribution between the ten species studied also strongly
22 suggests the existence of a shared design principle or evolutionary process (Joshi & Prasad,
Structural and role of enzyme-reaction association in microbial metabolism. In preparation).
When the effect of the state of one gene is dependent on the state of another gene in more than an additive or neutral way, the phenomenon is termed epistasis. In particular, positive epistasis signifies that the impact of the double deletion is less severe than the neutral combination, while negative epistasis signifies that the double deletion is more severe. Epistatic interactions between genes affect the fitness landscape of an organism in its environment and are believed to be important for the evolution of sex and the evolution of recombination. In Chapter 3, we use large- scale computational metabolic models of microorganisms to study epistasis computationally using
Flux Balance Analysis (FBA). We ask what the effects of the environment are on epistatic interactions between metabolic genes in three different microorganisms: the model bacterium E. coli, the cyanobacteria Synechocystis PCC6803 and the model green algae, C. reinhardtii. Prior studies had shown that in standard laboratory conditions epistatic interactions between metabolic genes are dominated by positive epistasis. We show here that epistatic interactions depend strongly upon environmental conditions, i.e. the source of carbon, the Carbon/Oxygen ratio, and for photosynthetic organisms, the intensity of light. By a comparative analysis of flux distributions under different conditions, we show that whether epistatic interactions are positive or negative depends upon the topology of the carbon flow between the reactions affected by the pair of genes being considered. Thus, complex metabolic networks can show epistasis even without explicit interactions between genes, and the direction and scale of epistasis are dependent on network flows. Our results suggest that the path of evolutionary adaptation in fluctuating environments is likely to be very history dependent because of the strong effect of the environment on epistasis
(Joshi & Prasad, 2014).
23 Cyanobacteria are prokaryotes capable of performing oxygenic photosynthesis, making them attractive candidates for genetic engineering towards production of biofuel, pharmaceuticals, nutraceuticals, and other commercially important chemicals. In Chapter
4, we present and analyze a genome scale metabolic network reconstruction (iSynCJ816) of Synechocystis sp. PCC6803, the most widely studied cyanobacterium. This reconstruction consists of 816 genes, 1045 reactions, and 929 non-unique metabolites spanning across 7 compartments (extracellular, cytosol, cytosolic membrane, carboxysome, periplasm, thylakoid and thylakoid membrane). This updated model builds from previously published models, and develops them further by integrating an unconstrained photo-respiratory reaction mechanism. The model also includes various molecular mechanisms of electron transfer in three most important protein complexes of photosynthesis (photosystem I, photosystem II, and cytochrome b6/f complex). We used
Flux Balance Analysis (FBA) to calculate the flux distribution within iSynCJ816 and compare in silico predictions with values obtained by previous in vivo metabolic flux analyses in Synechocystis sp. PCC6803. We performed gene deletion analysis and qualitatively compared gene deletions of 167 genes with experimental studies to find an accuracy rate of ~80%. We used the model to estimate maximum theoretical yield of products using each metabolite as a precursor, as well as the feasibility of engineering
Synechocystis to increase CO2 fixation. The model predicts that it may be possible to increase CO2 fixation by up to 35% from wild type levels (Joshi, Peebles, & Prasad,
Modeling and analysis of bioproduct formation in Synechocystis sp. PCC6803 using a new genome-scale metabolic network reconstruction. Submitted).
24 To construct strains that not only grow optimally but also are efficient at the technology they are constructed for, it is important to understand intracellular metabolic regulation in these microorganisms in their full dynamic complexity. Photosynthetic organisms have an inherent dynamic complexity because in the natural habitat there are days and nights, seasons and the consequent changes in light intensity and composition. A variety of sustainable and green applications of metabolic engineering of cyanobacteria are ultimately possible only when translatable to utilization of the energy given out by the sun. In Chapter 5, we apply a direct method of dynamic flux balance analysis which involves imbedding a Linear Programming problem within a set of kinetic equations, and using hierarchical or “lexicographic” optimization to study diurnal objective functions and lexicographic priority of substrate exchange, biomass growth, ATP synthase, and ATP maintenance in Synechocystis sp. PCC6803.
25 CHAPTER 2. STRUCTURE AND ROLE OF ENZYME-REACTION ASSOCIATION IN
MICROBIAL METABOLISM
1. SYNOPSIS
The metabolic network can be represented as a directed graph of nodes representing
metabolites and directed edges representing reactions, along with an additional layer of complexity
provided by the enzymes and transporters that participate in each reaction. Here we study the latter
aspect of network structure, in particular the relationship between the number of enzymes and the
number of reactions they participate in. We find that the distribution of the number of reactions an
enzyme participates in, the enzyme-reaction distribution, is surprisingly similar across eighteen
species and resembles a power law. In fifteen out of these eighteen species the power-law was
found to be with statistically significant. We use Flux Balance Analysis to study the effect of the
enzyme-reaction distribution on the robustness of the metabolic models of two microorganisms,
E. coli and Synechocystis, and based on a detailed study of gene deletions in both models we show
that the form of this distribution plays an important and hitherto unappreciated role in robustness.
Despite the similarity of the overall distribution of reactions among enzymes, we also uncover
many differences in the specific details of this distribution. In particular we discover that
multifunctional enzymes play a major role in conferring lethality to many loss-of-function
mutations in the current model of photoautotrophic metabolism. Our analysis suggests that
multifunctional enzymes may be contributing some unknown fitness benefits to the organism, by
virtue of being multifunctional, that offsets their negative role in loss-of-function mutations, and
that this may be especially important in photosynthetic metabolism. The similarity of the enzyme-
reaction distribution between the eighteen species studied also strongly suggests the existence of
a shared design principle or evolutionary process.
26 2. INTRODUCTION
Studies in E. coli report mutation rates as high as 10-3 per genome per generation (Kibota
& Lynch, 1996; H. Lee, Popodi, Tang, & Foster, 2012; Perfeito, Fernandes, Mota, & Gordo, 2007), underlining the importance of robustness against deleterious mutations for microorganisms.
Indeed, studies report that the majority of mutations appear neither beneficial nor deleterious, and deleterious mutations have mostly small fitness effects (H. Lee et al., 2012). Robustness to genetic mutations can arise because of a large variety of reasons. For example, genes could be coding for proteins that perform an inessential function, or proteins that are partly or entirely redundant because other proteins can carry out their functions. For genes that code for metabolic enzymes, network structure is an additional source of robustness, since multiple pathways exist for the synthesis of metabolites. To compensate for a gene deletion therefore, the organism can merely redistribute metabolic fluxes among surviving pathways (Segrè et al., 2002). This is easily seen by visualization of the metabolic network as a graph.
The most common representation of the metabolic network as a graph represents the metabolites as nodes, linked together by enzymatic or transport reactions that constitute the edges
(Arita, 2004; Jeong, Tombor, Albert, Oltvai, & Barabási, 2000; Ravasz, Somera, Mongru, Oltvai,
& Barabási, 2002). This may be called a “reaction-edge” graph (Light & Kraulis, 2004), and a gene deletion in this network can be represented by the removal of the edges corresponding to the reactions catalyzed by the enzyme that the deleted gene coded for. With the removal of an edge, the organism survives if an alternative pathway exists for the synthesis of the constituents of biomass. While in some cases, the alternative pathway may be too expensive and the organism fails to survive, the existence of an alternative pathway after removal of an edge is a topological
27 or a graph-theoretic property. Thus, the contribution of metabolic network topology to metabolic robustness is of great theoretical and practical significance.
Edges in the metabolic network are catalyzed by enzymes or represent specific protein mediated transport processes, and reaction-edge graphs therefore miss the contributions of these proteins to the metabolic network. Other representations of the metabolic network seek to correct this by including enzymes, in the form of “protein-centric” or “protein-vertex” graphs where the vertices are the proteins and the edges are the substrates that the proteins act on (Light & Kraulis,
2004). Alternatively, others have constructed “two-color” graphs where vertices represent reactions and there are two types of edges; one connects reactions that have a metabolite in common and the other is a weighted edge that represents genomic associations (Spirin, Gelfand,
Mironov, & Mirny, 2006). However while both these studies are useful, they use very broad measures of protein associations or genomic associations, and thereby miss some crucial properties of the network. For example, metabolites like ATP link together almost all enzymes. Such broad measures of association are unlikely to be very useful since enzymes show high specificity for specific reactions that they catalyze. In this study, we specifically concentrate on the role that these proteins play, i.e. of catalyzing reactions (and transporting metabolites) in the metabolic network.
A single enzyme, typically, does not necessarily catalyze only a single reaction (or edge) of the network. Some gene products are isozymes, which catalyze the same reaction. Other genes constitute multifunctional enzymes that constrain more than one reaction in the network (Roy,
1999). In order to quantify the distribution of these two kinds of enzymes, we define the degree of multifunctionality, (ke) of any enzyme. This is the number of reactions catalyzed by a particular enzyme, and we will call the distribution of ke the enzyme-reaction distribution. Note that ke encapsulates multiple kinds of enzyme promiscuity, including substrate promiscuity, i.e. being
28 able to perform the same function on multiple substrates and catalytic promiscuity, i.e. possessing multiple catalytic domains (Cheng et al., 2012).
Multifunctional enzymes have been studied for a long time, and it is believed that they may have played an important role in the evolution of life on the planet. It has been proposed on the basis of evolutionary arguments that precursor enzymes that catalyzed biochemical reactions when life emerged on earth are likely to have been multifunctional enzymes with broad substrate specificity, a hypothesis that has been called the patchwork hypothesis (Fani & Fondi, 2009;
Jensen, 1976). In support of this argument, recent work has found that in E. coli, specialist enzymes, i.e. enzymes that catalyze a single reaction, are more likely to be essential, carry greater flux and are regulated to a greater extent than generalist enzymes that catalyze more than one reaction (Nam et al., 2012). Multifunctional enzymes that catalyze a sequence of reactions have definite advantages due to substrate channeling. However, many multifunctional enzymes are involved in reactions that are not sequential, and their persistence remains to be explained.
Furthermore, we do not understand if the distribution of the degree of multifunctionality (which we will call the enzyme-reaction distribution) plays a physiological role in metabolic networks.
Since the enzyme-reaction, distribution forms a part of the structure or topology of metabolic networks and given the role of multifunctional enzymes in the debates over the evolution of life on earth, we asked ourselves how the degree of multifunctionality, ke, was distributed in organisms. In order to calculate this we depended on the genome scale metabolic reconstructions being undertaken by many groups for the last decade (Oberhardt, Palsson, & Papin, 2009). We downloaded 21 publicly available models corresponding to 18 different species including both eukaryotes and prokaryotes. These models include details about how specific genes that code for enzymes and transporters map on to specific reactions in the model. Using the gene as a proxy for
29 the enzyme, we extracted this information and studied the enzyme-reaction distribution for each organism. For a more detailed analysis on the possible role of this distribution on fitness and other properties of the network, we focused on two organisms. One of these two organisms was E. coli, due to its very well studied and relatively comprehensive metabolic network reconstruction. The other organism we chose to look at was the model cyanobacterium Synechocystis sp. PCC6803, as a representative of the metabolic niche of photosynthetic microorganisms.
We used a recently constructed genome scale model, iJN678, of Synechocystis sp.
PCC6803 (Juan Nogales, Gudmundsson, Knight, Palsson, & Thiele, 2012) as well as a comprehensive metabolic model, iAF1260, of E. Coli MG1655 (Covert, Knight, Reed, Herrgard,
& Palsson, 2004), and used these genome-scale models to calculate the distribution of ke, or the enzyme-reaction distribution. In order to understand the role played by multifunctional enzymes, we used gene deletion analysis carried out using Flux Balance Analysis (FBA) (Orth et al., 2010).
For the metabolic network, FBA can predict flux redistributions on single gene deletions with good accuracy (Reed & Palsson, 2004; Segrè, Deluna, Church, & Kishony, 2005; Segrè et al., 2002), with genome-scale E. coli models predicting gene lethality for example with an error of only about
8% (Covert et al., 2004).
We used large-scale metabolic models to ask whether there were any common patterns, or significant differences, in the structure of enzyme-reaction associations in both organisms. We used FBA to ask whether the observed enzyme-reaction distribution, which was similar for both organisms, could be contributing a fitness benefit to the organism. Finally, we used FBA to compare gene deletions between Synechocystis and E. coli.
30 3. MATERIALS AND METHODS
3.1. MODEL PREPARATION
For our analyses of enzyme-reaction distributions, we chose 21 genome scale reconstructions of 18 organisms. The organism names, model names, references and links to the
SBML file of all eleven models can be found in Table B1 in Supporting Information. For FBA of
E. coli and Synechocystis sp., we also selected a growth condition, under which the organism was most studied. Growth conditions included in the analyses were autotrophic, heterotrophic, mixotrophic, and aerobic. Detailed analysis of network structure differences were performed only for E. coli and Synechocystis.
3.2. ASSIGNING SUBSYSTEMS TO GENES AND COMPLEXES
For deeper analysis of the E. coli and Synechocystis, we made broad subsystems based on
KEGG pathway analysis (M Kanehisa & Goto, 2000; Minoru Kanehisa, Goto, Sato, Furumichi, &
Tanabe, 2012); e.g. Oxidative phosphorylation, Photosynthesis, Carbon fixation pathways,
Methane metabolism, Nitrogen metabolism, and Sulfur metabolism were considered as Energy metabolism. Enzymes were assigned subsystems based on the reactions catalyzed by them. If the enzyme catalyzed reactions belonging to more than one subsystem, it was considered to be a part of all those subsystems. This was done to recognize that enzyme complexes can be highly multifunctional and might catalyze reactions that are far apart in the metabolic network and belong to completely different coarse-grained subsystems. A detailed account, of how the actual subsystems were distributed among the coarse grained subsystem for both organisms is presented in Table S2 in Supporting Information. Note that since metabolic models usually report the genes associated with each reaction, the enzymes in our data are labeled by their gene names.
31 3.3. GENE ASSOCIATION WITH REACTIONS AND EFFECTIVE GENE DELETION OR
SINGLE ENZYME DELETION
Each metabolic reconstruction contains a matrix of gene-reaction associations. This matrix contains the information about the reactions catalyzed by an enzyme which either wholly, or partially, is coded for in a gene. Many possible associations between a reaction, the involved enzymes and the genes that code for them exist, such as:
(i) Two or more proteins are required to make a single enzyme. Each protein will be coded by a different gene, and the genes share an “AND” relationship.
(ii) Two or more enzymes, each coded for by a different gene, catalyze exactly the same set of reactions. Here the genes share an “OR” relationship.
In order to distinguish between the role of isozymes and multifunctional enzymes we pick only one of any set of genes in either an “OR” or an “AND” relationship with each other. This ensures that all multi-subunit enzymes are treated as a single unit, and isozymes are treated separately. The list of genes remaining after those coding for isozymes have been removed we call the unique gene list, and the list of enzymes corresponding to it the unique enzymes. Only enzymes that show an exact overlap of reactions catalyzed are treated as isozymes. Enzymes that show partial overlap of reactions constrained are not treated as isozymes and are retained in the unique gene list. When a gene (or an enzyme) is deleted, we delete every reaction that it can catalyze except those catalyzed by another enzyme. Such a strategy ensures the selection of every reaction that has a gene associated to it.
32 3.4. POWER-LAW ANALYSIS
Power law analysis was carried out by two methods. The first is a linear fit to the log-log plot using built-in Matlab fit (polyfit; Figure B1). However except for a few generic transporter proteins, proteins catalyze at most a few tens of reactions. With this small decadal span of the data, a linear fit on a log-log plot is very likely to give a false positive for a power law. We therefore analyzed the data using the Maximum Likelihood Estimators of Clauset et. al. (Clauset, Shalizi, &
Newman, 2009) and using the Matlab script files made available by them. For the power law
-a ³ described by p(x) ~ x valid for some x xmin we used the MLE estimate and a goodness of fit
a measure to estimate the parameters, xmin . The plausibility of the power-law fit was then estimated by the publically available code from Ref. (Clauset et al., 2009) which samples synthetic data sets from the true power-law distribution multiple times and measures the Kolmogorov-
Smirnov (KS) statistic for the synthetic data with respect to its best power-law fit. The fraction of times the KS statistic is larger for the synthetic data than the KS statistic for the empirical data is the p-value that estimates the probability that the empirical data comes from the fitted power law.
Note that a higher p-value represents support for the hypothesis. We follow the commonly used benchmark of assuming insufficient support of the power-law hypothesis when p ≤ 0.1. More detailed descriptions are provided in the reference cited.
3.5. DISTRIBUTION ANALYSIS
In addition to power-law analysis, we also used (i) Two sample Kolmogorov-Smirnov test, and (ii) Mann-Whitney U-test (a.k.a. Wilcoxon rank-sum test).
Kolmogorov-Smirnov test (two sample) is a non-parametric test used to test whether two underlying one-dimensional probability distributions differ. The KS statistic quantifies the
33 distance between the distribution functions of two samples. It should be noted that for power-law analysis, we used KS statistic between corresponding power-law fit and a given organismal enzyme-reaction distribution; however, here we used calculated KS statistic between any given two organismal enzyme-reaction distribution. Here, p ≤ 0.05 suggests that null hypothesis can be rejected. The null hypothesis, here, is that data from both organism belong to the same distribution.
The statistic was calculated using the Matlab R2014b built-in function, “kstest2”.
Mann-Whitney U-test (a.k.a. Wilcoxon rank-sum test), similar to KS test, is a non-parametric test used to test whether two organismal data come from the same distribution. Here, p ≤ 0.05 suggests that null hypothesis can be rejected. The statistic was calculated using the Matlab R2014b built-in function, “ranksum”.
3.6. FLUX BALANCE ANALYSIS (FBA)
Flux Balance Analysis (FBA) is a mathematical framework used to calculate the flow of the metabolites through the metabolic network at steady state (Orth et al., 2010). FBA was performed using COBRA Toolbox (Schellenberger et al., 2011) with Gurobi 4.6.1 on MATLAB
R2011b. Briefly, each available metabolic reconstruction that we make use of involves the construction of a (M-by-N) stoichiometric matrix, S for the metabolic reactions and a table of gene associations for each reaction (here M is the number of metabolites and N is the number of
th th reactions). Sij represents the stoichiometric coefficient for i metabolite in j reaction. In steady state a solution to the flux distribution in the organism is found under the condition that the growth rate reaction is maximized, making it a linear programming problem. Certain constraints are then imposed to find a unique solution to the under-determined system. The most important constraint arises from the reaction network at steady state:
34 N (2.1)
∑ � = = Here, is a vector of reaction flux. The growth rate reaction is described as:
� (2.2)
∑ � → =
Two other types of constraints arise:
1) Constraints on uptake and secretion rates of metabolites.
2) Limits on the upper and lower bounds of each reaction flux, i.e.
(2.3)