bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
ETFL: A formulation for flux balance models accounting for expression, thermodynamics, and resource allocation constraints
Pierre Salvy1, Vassily Hatzimanikatis1,*,
1 Laboratory of Computational Systems Biotechnology, Ecole´ Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland * [email protected]
Abstract
Since the introduction of metabolic models and flux balance analysis (FBA) in systems biology, several attempts have been made to add expression data. However, directly accounting for enzyme and mRNA production in the mathematical programming formulation is challenging because of macromolecules, which introduces a bilinear term in the mass-balance equations that become harder to solve than linear formulations like FBA. Furthermore, there have been no attempts to include thermodynamic constraints in these formulations, which would yield an even more complex mixed-integer non-linear problem. We propose here a new framework, called Expression and Thermodynamics Flux (ETFL), as a new ME-model implementation. ETFL is a top-down model formulation, from metabolism to RNA synthesis, that simulates thermodynamic-compliant intracellular fluxes as well as enzyme and mRNA concentration levels. The formulation results in a mixed-integer linear problem (MILP) that enables both relative and absolute metabolite, protein, and mRNA concentration integration. The proposed formulation is compatible with mainstream MILP solvers and does not require a non-linear solver. It also accounts for growth-dependent parameters, such as relative protein or mRNA content. We present here the formulation of ETFL along with its validation using results obtained from a well-characterized E. coli model. We show that ETFL is able to reproduce proteome-limited growth, which FBA cannot. We also subject it to different analyses, including the prediction of feasible mRNA and enzyme concentrations in the cell, and propose ETFL-based adaptations of other common FBA-based procedures. The software is available on our public repository at https://github.com/EPFL-LCSB/etfl.
Author summary
Metabolic modeling is a useful tool for biochemists who want to tweak biological networks for the direct expression of key products, such as biofuels, specialty chemicals, or drug candidates. To provide more accurate models, several attempts have been made to account for protein expression and growth-dependent parameters, key components of biological networks, though this is computationally challenging, especially when also attempting to include thermodynamics. To the best of our knowledge, there is no published methods integrating these three types of constraints in one model. We propose here a transparent mathematical formulation to model both expression and
April 12, 2019 1/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
metabolism of a cell, along with a reformulation that allows a computationally tractable inclusion of growth-dependent parameters and thermodynamics. We demonstrate good performance using community-standard software, and propose ways to adapt classical modeling studies to expression-enabled models. The incorporation of thermodynamics and growth-dependent variables provide a finer modeling of expression because they eliminate thermodynamically unfeasible solutions and consider phenotypic differences in different growth regimens, which are key for accurate modeling.
Introduction 1
Metabolic modeling, which helps makes sense of the metabolism in a biological network, 2
is an important tool for engineering biocatalysts, with applications in biofuels, drug 3
design, microbial community analysis, and personalized medicine. Model accuracy is 4
instrumental to the success of these applications through an efficient engineering of the 5
host organisms. However, incorporating expression information into metabolic networks 6
poses a significant challenge, and most current models do not even attempt 7
it—effectively excluding an important network in biological systems that can drastically 8
affect results. In metabolic engineering, strains are modified and controlled at the 9
genome level through the transcriptome, and the effects are observed at the fluxome 10
level, which accounts for the range of metabolic reactions in an organism. In between 11
these two levels is the proteome that actually performs the biochemical transformations 12
according to the genetic template, though it is this middle step in the process that 13
cannot yet be robustly and efficiently incorporated into models of metabolic systems. 14
Because of the complex interplay between these different layers of control, 15
understanding expression and incorporating this into future models is key for improving 16
metabolic engineering. 17
Classically, model-based strain design has relied on a tool that uses the DNA 18
sequence of an organism to detail the network of metabolic reactions that happen inside 19
a cell of that organism, which is called a genome-scale model (GEM). With current 20
technologies and tools like metagenome sequencing [1], it is possible to generate GEMs 21
for hundreds of different species at a time. GEMs are particularly amenable to flux 22
balance analysis (FBA), which models metabolism at the fluxome level using linear 23
optimization techniques. However, plain FBA has been known to predict biochemically 24
unrealistic solutions like free high-flux cycles or thermodynamically infeasible pathways. 25
It also scales growth linearly with carbon uptake, which is not observed at high-uptake 26
fluxes. FBA also fails to capture growth-dependent and protein-level effects, such as 27
enzyme saturation or proteome-related limitations. Hence, several efforts have been 28
made to supplement FBA with additional constraints to improve its predictive power. 29
For example, thermodynamics-based flux analysis (TFA) [2, 3] uses thermodynamic 30
constraints to enforce thermodynamically consistent reaction directionalities and to 31
allow the integration of metabolomics. Resource balance models add a total proteome 32
capacity constraint, as formulated in Beg et al. [4], to model the proteome-related 33
limitations of the cell, as enzymes have to compete for the constrained total amount of 34
cellular proteins. Frameworks like GECKO [5] further build on this resource balance 35 idea and include flux constraints based on proteomics, such as v ≤ Vmax = kcat [E] as 36 well as a constraint on the total proteome mass. Finally, metabolomics and expression 37
models (ME-models) [6, 7] were the first to integrate the entirety of the expression 38
mechanisms of the cell from the bottom-up, including mRNA and protein synthesis. 39
However, simultaneously accounting for all of these constraints is challenging 40
because of the formulation of each method, as TFA models involve integer variables that 41
yield a mixed-integer linear program (MILP), whereas ME-models involve bilinear 42
constraints that require special optimization procedures and a solver [8–10]. Mixing 43
April 12, 2019 2/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
these methods would require the inclusion of integers in ME-models, which is not 44
straightforward and would lead to more complex mixed-integer non-linear programs 45
(MINLP) that are computationally intensive to solve. Additionally, the amount of RNA 46
and protein, the RNA and protein expression rates, and their stabilities are all growth 47
dependent [11], and including accurate representations of these variables leads to even 48
more complex, non-linear models. Meanwhile, although resource balance models such as 49
GECKO could theoretically be integrated into TFA or ME-models in the current 50
formulations, to the best of our knowledge, no link with TFA or ME-models has been 51
proposed. Therefore, the metabolic engineering community needs a common 52
formulation for these methodologies to build the most accurate models. 53
We investigated the development of such a framework and propose herein a unified 54
formulation for Expression and Thermodynamics-enabled FLux models (ETFL) that 55
can account for the above integration issues. In ETFL, we address the compatibility of 56
the formulations by expressing the growth rate variable in bilinear products as a 57
piece-wise constant function. This reformulation allows us to transform the problem 58
into a MILP, which we can solve efficiently using widely available open source or 59
commercial solvers. The resulting model is then effectively able to directly integrate 60
thermodynamic constraints as well as expression constraints and growth-dependent 61
parameters. In this model, metabolite, enzyme, and mRNA concentration levels are 62
explicitly defined to enable fast and easy omics integration. Finally, we show an 63
application of this framework to a well characterized E. coli model, iJO1366 [12]. 64
1 Results and Discussion 65
1.1 Formulation of the expression problem 66
To transparently account for expression mechanisms and increase the predictive power 67
of our models, we needed to derive the equations that could bridge the biochemistry 68
with the optimization problem that is ETFL. Here we present a summary of these 69
equations, and detail their derivation in the section Materials and Methods. We derived 70
these equations using assumptions similar to those used in the formulation of the 71
GECKO [5] and ME-model [6–8]. 72
This formulation relies on derivations rooted in the biological mechanism of 73
expression and depends on a number of biochemical parameters related to the cell. In 74
particular, the mass balances of the macromolecules are expressed using concentration 75
variables. Each mass balance will yield an equation where the concentrations of the 76
macromolecules will be variables, thus effectively formulating a new constraint of the 77
model and allowing us to calculate concentration values by solving the model. 78
We can write the quasi-steady state mass balance for macromolecules as follows: 79
syn deg v? − v? − µ ∗ G? = 0, (1) syn where ? represents the indexing of the macromolecule, v? is the synthesis term, 80 deg v? is the degradation term, and µ ∗ G? is the dilution term. The detail of the 81 derivation is available in the Materials and Methods. 82
Using this formalism, for each macromolecule we can define and link together a 83
synthesis flux, a degradation flux, and the macromolecule’s concentration. Knowing 84
enzyme concentrations allows us to bound the variables representing metabolic reaction 85
fluxes with their maximum catalytic rate according to the classical equation: 86
v ≤ kcat · E, (2)
where kcat is the catalytic rate constant of the enzyme E with respect to flux v. In 87 this same fashion, we can also constrain the synthesis flux for the peptides, which are 88
April 12, 2019 3/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
then assembled into enzymes. Peptide synthesis is simply a metabolic reaction that 89
consumes energy (under the form of GTP) and charged tRNAs and produces a peptide 90
and uncharged tRNAs. The catalytic rate of the reaction is proportional to the 91
maximum ribosomal catalytic rate divided by the length of the peptide to be 92
synthesized. The same can be said about mRNA synthesis, which uses nucleoside 93
triphosphates and is catalyzed by the RNA polymerase. The details of these constraints 94
are explained in the Materials and Methods. 95
The part of the matrix that has been added to the FBA problem to account for 96
expression has been termed the expression problem (EP). Although this initial 97
formulation is bilinear, we will detail in the following section how we cast it to a MILP. 98
1.1.1 Biomass reaction synthesis and mass balance 99
In FBA, the biomass reaction is an artificial, lumped reaction that represents the 100
consumption of metabolites in proportion to the cell growth rate. This consumption 101
reflects nucleoside triphosphate (NTP) requirements for mRNAs, amino acid 102
requirements for proteins, lipid requirements for the cell wall, or metal ion needs. 103
Biomass reaction inclusiveness depends on the modeling assumptions made during the 104
model curation process and can vary significantly among models of the same species. 105
The consumed amount of each metabolite is usually estimated experimentally by 106
measuring the the amounts of these metabolites in dried cell mass. Because the 107
stoichiometric ratios of metabolites in the biomass reaction are fixed, the abundance of 108
metabolites is the same for all growth rates. This simplifying assumption, necessary in 109
FBA, goes against experimental evidence. Neidhardt and Curtis [11] report for instance 110
that mRNA and protein mass ratios in the cell change with growth rate. 111
Because ETFL has explicit expression requirements through transcription, 112
translation, and tRNA-charging reactions, it is possible to account for varying ratios of 113
NTPs and amino acids as the growth rate changes, an effect that is captured in 114
experiments [11]. In this context, the approximation made in FBA can be written using 115
ETFL terms: 116
∀aa , ηvbiomass · µ ≈ vcharging, (3) i aai aai tcr ∀NT P , ηvbiomass · µ ≈ v j , (4) i NTPi NTPi j X biomass vbiomass where v represents the biomass equation, and η is the 117 mi expression-related participation of metabolite mi in the biomass reaction. Hence, to 118 avoid accounting for the expression requirements twice (once through the biomass 119
equation, once through the EP), it is necessary to remove the participation of these 120
metabolites linked to expression from the biomass reaction. 121
1.1.2 Summary 122
Here we show the formulation of the constraints of ETFL. For clarity, we use different 123
indexing sets, each referring to a specific object in the model, detailed in Table 1. The 124
variables are detailed in Table 2, and the parameters are in Table 3 125
Catalytic constraints
+ j,+ vj − kcat Ej ≤ 0 (FCj) − j,− vj − kcat Ej ≤ 0 (BCj)
April 12, 2019 4/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Table 1. Indices used in the formulation. Index letter Indexed variables i Metabolite j Reaction/Flux/Enzyme l Gene/Peptide/mRNA s Binary coefficient for growth discretization u Binary coefficient for interpolation discretization
Table 2. Variables used in the formulation. Symbol Variable µ Growth rate ± th vj j net positive/negative biochemical flux th Ej Concentration of the j enzyme th Fl Concentration of the l mRNA th Pl Concentration of the RNA polymerase assigned to the l mRNA th Rl Concentration of the ribosome assigned to the l peptide T u Concentration of the ith uncharged tRNA aai T c Concentration of the ith charged tRNA aai tsl th vl Translation rate of the l gene tcr th vl Transcription rate of the l gene asm th vj Assembly rate of the j enzyme deg th vj Degradation rate of the j enzyme deg th vl Degradation rate of the l mRNA vcharging Charging rate of the ith tRNA aai
Table 3. Parameters used in the formulation. Symbol Parameter j,± th kcat Forward/backward catalytic rate constant of the j net biochemical flux j th kdeg Degradation rate constant of the j enzyme l th kdeg Degradation rate constant of the l mRNA j th th ηl Stoichiometry of the l peptide in the j enzyme ηl Stoichiometry of the ith amino acid in the lth peptide aai aa th Ll Length in amino acids (aa) of the l peptide nt th Ll Length in nucleotides (nt) of the l mRNA nt Lrib Ribosome footprint size on mRNA, in nucleotides ρ Ribosome occupancy π RNA polymerase occupancy
April 12, 2019 5/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Metabolite mass balance
S · v = 0 (FBA)
Expression mass balance
tsl j asm vl − ηl · vj = 0 (PBl) j X vtcr − vasm = 0 (RB ) rRNAl rib rRNAl asm deg vj − vj − µ ∗ Ej = 0 (EBj) tcr deg vl − vl − µ ∗ Fl = 0 (MBl) −vcharging + ηl · vtsl − µ ∗ T u = 0 (TBu ) aai aai l aai aai l X vcharging − ηl · vtsl − µ ∗ T c = 0 (TBc ) aai aai l aai aai l X Degradation fluxes
deg j vj − kdeg · Ej = 0 (EDj) deg l vl − kdeg · Fl = 0 (MDl)
Expression constraints
RNAP tcr kcat vl − nt Pl ≤ 0 (TR1l) Ll rib tsl kcat vl − aa Rl ≤ 0 (TR2l) Ll nt Ll Rl − nt Fl ≤ 0 (EXl) Lrib
Total capacity
Rl + RF − Erib = 0 (TC2) l X Pl + PF − ERNAP = 0 (TC1) l X RF − (1 − ρ) Erib = 0 (RR)
PF − (1 − π) ERNAP = 0 (PR)
1.2 Reformulation 126
1.2.1 Bilinearity of the problem 127
The main issue with the EP formulation presented above lies in the continuous bilinear 128
terms that describe the dilution of the macromolecules, 129 (un)charged G? ∈ {Ej} ∪ {Fl} ∪ tRNAaai . We use ? as a placeholder for the indexing of 130 G. Using previous notationsn for the synthesis,o degradation, and growth rate: 131
syn deg v? − v? − µ ∗ G? = 0. (5)
April 12, 2019 6/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
In this state, the dilution term is bilinear, and the formulation requires a bilinear 132
solver or potentially a mixed-integer bilinear solver if thermodynamics are to be added. 133
The original ME-model formulation has similar terms as we are presenting here [6, 7]. 134
As such, its recent adaptation in Lloyd et al. [8] uses the two-level algorithm 135
SolveME [9] that requires a dedicated non-linear solver. We present instead a MILP 136
approximation of the problem that makes it compatible and solvable with mainstream 137
MILP solvers. We achieve this through the discretization and linearization of the 138
bilinear products. This operation can be understood as locally approximating the 139
bilinear problem by several linear subproblems and choosing the best approximation. 140
1.2.2 Approximation of the growth rate 141
In ETFL, we approximate the growth rate µ in bilinear products with a 142 th piecewise-constant function µ (0 order approximation) to transform the continuous 143
bilinear terms into mixed (integer × continuous) bilinear terms. This simplifies the 144 problem, as these mixed bilinearb terms can be linearized in a MILP setting using the 145 Petersen linearization scheme [13], a particular case of the Glover linearization 146
scheme [14] that was previously used in metabolic engineering by Hatzimanikatis et 147
al. [15, 16]. 148 2 Let µ be an upper bound to µ, (p, N) ∈N , p ≤ N. We can approximate µ with the 149 th following 0 order approximation: 150 µ ∀µ ∈ [0, µ] , µ ≈ µ = p · . (6) N . 151 µ b With this notation, N is, in fact, the resolution of the approximation. We can then 152 perform the binary expansion of p: 153
dlog2 Ne s p = 2 · δs, (7) s=0 X where dlog2 Ne denotes the smallest majoring integer to log2 N, and δs ∈ {0, 1} is 154 th r digits from the right of the binary notation of p. 155
As an example, let us consider modeling an organism whose growth rate does not 156 −1 exceed µmax = 2.3 h . To do this, we can set µ = 2.5 ≥ µmax. Let us choose a 157 −1 resolution of 0.25 h , which gives N = 10. Then, log2N ≈ 3.32, and dlog2 Ne = 4. A 158 growth rate µ = 1.4 will be approximated by: 159
µ µ = 1.5 = 6 · , 10 µ µ = δ × 20 + δ × 21 + δ × 22 + δ × 23 + δ × 24 · , b 0 1 2 3 4 10 µ µ = (0 × 1 + 1 × 2 + 1 × 4 + 0 × 8 + 0 × 16) · . b 10
This exampleb is illustrated in Fig. 1. Conceptually, a heuristic for solving this 160 problem would be: 161
1. Solve the FBA for µ 162
2. Select the corresponding, closest µ 163
3. Apply it to compute dilution values 164 b 4. Solve the EP with fixed dilution 165
April 12, 2019 7/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
5. Apply the catalytic constraints to the FBA 166
6. Recalculate the FBA under catalytic constraints 167
µ 7. If µ∈ / µ ± N , go back to 3, else, end. 168 n o b 2.5 y = x µ
2
1 µ =1.5h− 1.5
µ
1
0.5
1 0 µ =1.4h−
0 0.5 1 1.5 2 2.5
Fig 1. Discretization of µ into µ. The step approximation transforms the continuous interval [0, 1] into the discrete interval {0, 0.25,..., 2.5}. b
1.2.3 Linearizing the bilinearity 169
In the previous derivation, we replaced the growth rate variable by a discrete number of 170 acceptable values. We can approximate the continuous product µ ∗ G?, which represents 171 the dilution, as follows: 172
µ ∗ G? ≈ µ ∗ G?, (8) 173 dlog Ne 2 2s µ ∗ G = b · δ ∗ G , (9) ? N s ? r=0 X The product δl ∗ G? is then stillb bilinear, but one of its variables is binary. Assuming a 174 constant M > G?, We can use Petersen’s linearization theorem [13, 14] to replace the 175 s product δs ∗ G? with a single nonnegative variable z?, as described in the Materials and 176 Methods. 177
April 12, 2019 8/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Because of the binary expansion, the complexity of the model grows only as 178 1 O (log2 N) = O log2 , where = 1/N is proportional to the resolution of the 179 µ approximation (which is N ). This means that the linearization part of a model with a 180 −1 resolution of 0.01 h is only around twofold bigger than that of a model with a 181 −1 resolution 0.04 h , while resolution has been improved fourfold. 182
1.2.4 Discretization of growth-dependent parameters 183
mRNA and enzyme content Since growth has been discretized, it is now possible 184
to also directly discretize other growth-dependent parameters of the problem, regardless 185
of whether they are in a linear or non-linear relationship with growth. This is a direct 186
consequence of the formulation of ETFL, which allows some flexibility in the modeling 187
assumptions of the user. As an example, we described the relationship between growth 188 m m and protein and mRNA mass ratios, P and R , in the cell as reported in Neidhardt 189
et al. [11] by interpolating and discretizing the protein ratio and mRNA ratio as 190
functions of the growth rate. We used type 1 special ordered set constraints (SOS1) to 191
model these variables with a first-order (piecewise linear) approximation: 192
m MWj · Ej − λu · Pu = 0, (IC1) j u X X m MWl · Fl − λu · Ru = 0, (IC2) l u X X λu = 1. (SOS1) u X m m Pu and Ru are growth-dependent, interpolated protein and RNA mass ratios (in 193 −1 g.g ). Given a growth rate, they define the relative mass of the cell that is protein or 194 RNA. MW? represents the molar weight of the corresponding enzyme or RNA. The 195 first two constraints enforce equality between the interpolated data and the model 196 production. The last line is the SOS1 constraint that forces only one of the λu to be 197 active. 198 Additionally, it is necessary to have the integer index of λu equal to the index of the 199 growth rate. This is obtained through the constraint: 200
l u · λu − 2 · δl = 0. (EQI) u l X X The first term represents the growth integer index, and the second represents its 201
binary expansion. However, imposing such mass ratios requires the addition of a 202
dummy mRNA as well as a dummy protein to represent the part of the 203
transcriptome/proteome that is either missing from the expression model or altogether 204
unrelated to metabolic function. We use average amino acid frequencies and GC content 205
to model this. Explicit interpolation functions can also be used, such as the 206
growth-dependent functions given in Pramanik et al. [17]. 207 The simultaneous use of catalytic constraints on metabolic reactions (Eq. FCj, BCj) 208 and maximal enzyme load (Eq. IC3) effectively implements allocation constraints like in 209
GECKO [5], although in ETFL, the enzyme concentrations are also directly linked to 210
the metabolism. In GECKO, the metabolic cost of building the enzymes is not taken 211
into account. 212
Fig. 2 shows an example piecewise linear interpolation of the growth-dependent 213
protein mass ratio in E. coli according to Neidhardt et al. [11]. The reported values 214
(red circles) are interpolated using a piecewise linear function (dashed line), which is 215
then discretized (full line). Using the integer constraints described above, the model can 216
April 12, 2019 9/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
0.68 Interpolation Neidhardt et al. values
0.66 Discretization
0.64
0.62
0.6 Protein ratio
0.58
0.56
0.54
1 µ =1.4h− 0.52
0 0.5 1 1.5 2 2.5
Fig 2. Example of piecewise linear interpolation and discretization of the protein mass ratio from Neidhardt et al. [11]. Red circles represent the values reported. The dashed line is the piecewise linear interpolation. The solid line is its discretization.
be forced to display a protein content that corresponds to its growth. We apply the 217
same techniques to mRNA and DNA content. 218
DNA content To further increase the scope of macromolecules covered by the model, 219
it is also possible to add growth-dependent DNA content. DNA mass ratios at specific 220
growth rates are reported in Neidhardt et al. [11]. We model the DNA reaction 221
synthesis as follows: 222
synthesis bp vDNA : (1 − γ) LDNAdAT P, bp + (1 − γ) LDNAdT T P, bp + γLDNAdGT P, bp + γLDNAdCT P, bp → DNA + 2LDNAP P i,
April 12, 2019 10/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
bp where γ is the GC content of the cell, and LDNA is the total length in base pairs of the 223 DNA. As with mRNAl and Enzj, DNA has a mass-balance equation of the following 224 shape: 225
d [DNA] = 0 = vsynthesis − vdegradation − vdilution, (10) dt DNA DNA DNA synthesis deg vDNA − vDNA − µ ∗ DNA = 0. (DBDNA)
We consider that the DNA does not degrade, meaning the only source of DNA 226 DNA consumption is dilution caused by the growth of the cell and kdeg = 0. We then define 227 the molar weight of DNA MWDNA and enforce the DNA mass ratio Dm as we did 228 with both proteins and mRNA: 229
bp MWDNA = (1 − γ) LDNA (MWdAT P + MWdT T P ) , bp +γLDNA (MWdGT P + MWdCT P ) , (11)
230 MWDNA · DNA − λu · Dmu = 0. (IC3) u X
1.2.5 Scale 231
A critical issue in the formulation of this problem is that the variables are different 232 −3 1 −1 −1 orders of magnitude. Fluxes are typically between 10 − 10 mmol.gDW .h , 233 −6 −3 −1 whereas protein concentrations are around 10 − 10 mmol.gDW and mRNA 234 −10 −6 −1 concentrations are 10 − 10 mmol.gDW . The relationship between these scales is 235
given by the catalytic rate constant of enzymes and expression machinery, which spans 236 3 6 −1 from 10 − 10 h . In particular, the ribosome rate constant for translation 237 −1 −1 (∼ 12 aa.s = 43 200 aa.h ) as well as the RNA polymerase rate constant of 238 −1 −1 transcription (∼ 85 nt.s = 306 000 nt.h ) are responsible for strong differences in the 239
concentrations and fluxes between transcription- and translation-related parts of the 240
problem. Consequently, the constraint matrix becomes ill-conditioned, and the solver 241
has to operate close to, or sometimes beyond, its maximal solving accuracy (usually 242 −9 around 10 for commercial solvers such as ILOG CPLEX or Gurobi). 243
To circumvent these limitations, we scale the EP, which will reduce the numerical 244
difficulty of the problem, using nondimensionalization. We create nondimensionalized 245
variables by dividing the variables of the initial problem by an estimated upper bound. 246 −1 For example, by definition, macromolecule concentrations cannot exceed 1 g.gDW , 247
and the following constrains the transformed macromolecule variables between 0 and 1: 248 X X = , σX ≥ sup(X) =⇒ 0 ≤ X ≤ 1. (12) σX In this scheme, σX isb an upper bound to X. In particular,b if we consider σX to be 249 −1 the concentration of 1 g.gDW : 250
−1 σX = 1 g.gDW , 1 = 1 g.gDW−1 × mmol.g−1, MW(X) 1 = mmol.gDW−1, (13) MW(X)
April 12, 2019 11/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
where MW(X) denotes the molecular weight of the macromolecule in SI units 251 −1 −1 ( kg.mol ≡ g.mmol ), and X represents the mass fraction of the molecule in the cell. 252
We scale the fluxes using a method derived from this, detailed in the supporting file S1 253 Nondimensionalization. It is alsob possible to further refine this upper bound by 254 performing a variation analysis on X and re-generating a model using the newly 255
estimated upper bound. 256
For the sake of clarity, all problem formulations will be kept in their dimensionalized 257
form in the subsequent equations although the implementation is actually 258
nondimensionalized. The nondimensionalized problem is described further in S1 259
Nondimensionalization. 260
1.2.6 Recovering the FBA problem 261
In the ETFL formulation, enzyme synthesis is driven by the coupling between FBA and 262
EP through the catalytic constraints. To carry flux, the cell needs to produce enzymes 263
whose production will also use the metabolic resources of the cell. If allocation 264
constraints are enforced, the amount of protein and mRNA synthesized must meet 265
predefined mass ratios for the problem to be feasible. Hence, the metabolic requirement 266
terms for the expression machinery (amino acids and NTP) have been removed from the 267
biomass reaction and are accounted for in the tRNA charging and transcription 268
reactions. Thus, the FBA solutions can be recovered from the ETFL formulation by the 269
following routine: 270
j,± Setting ∀j, kcat = +∞, 271
charging vbiomass Constraining ∀aa , v = η · µ, 272 i aai aai
transcriptionj vbiomass Constraining ∀NT P , v = η · µ, 273 i j NTPi NTPi
If applicable, relaxing theP allocation constraints, 274
If applicable, relaxing the thermodynamic coupling constraints. 275
1.3 Application: E. coli genome-scale model iJO1366 276
iJO366 [12] is a well-curated and well-studied GEM of E. coli that is closely related to 277
the GEM used in developing both ME-models iOL1650-ME [7] and iJL1678b-ME [8]. 278
Additionally, this model has been extensively applied in the literature and is aligned 279
with a variety of datasets that can be used for data integration. We wanted to subject 280
the model to classical studies that would highlight the power of ETFL, particularly as 281
pertains to proteome-limited growth, macromolecule concentration variability analysis, 282
and gene knock-out studies. We also wanted to assess the sensitivity of the model with 283
respect to the presence of thermodynamic constraints as well as growth-dependent 284
parameters. 285
Thus, we first experimented with four different models using ETFL with or without 286
thermodynamic constraints and growth-dependent protein/RNA/DNA allocation 287
following Table 2 as reported by Neidhardt et al. [11]. The following Table 4 details the 288
nomenclature used to refer to these different models. The features of the most 289
constrained model containing both thermodynamic and growth-dependent parameters, 290
vETFL, are detailed in Table 5. These four models were optimized for maximal growth 291
at increasing glucose uptake rates to assess their behavior with respect to excess 292
substrate, which will show the non-linearity of the relationship between growth and 293
glucose uptake at high uptake rates. A plateau in the growth rate was expected, which 294
indicates a proteome-limited phenotype that cannot be observed with FBA. We also 295
April 12, 2019 12/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Table 4. Nomenclature of the models used in the study of E. coli iJO1366. EFL stands for Expression and FLuxes, ETFL for Expression, Thermodynamics, and FLuxes, and the v- prefix indicates the inclusion of growth-dependent parameters (see 1.2.4) growth-independent parameters growth-dependent parameters (-) thermodynamics EFL vEFL (+) thermodynamics ETFL vETFL
Table 5. Properties of the vETFL model generated from iJO1366. Growth upper bound µ 3.5h−1 Number of bins N 128 µb −1 Resolution N b 0.0273h Number of constraints 68304 Number of variables 49207 Number of species 3240 – Metabolites 1806 – Peptides 1431 – rRNA 3 Number of enzymes 562 Number of reactions 8023 – Metabolic 1543 – Transport 733 – Exchange flux 330 – Transcription 1431 – Translation 1431 – Complexation 562 – Degradation 1993 0o Number of metabolites ∆f G 1737 0o Number of reactions ∆rG 1787 0o Percent of metabolites ∆f G 93.9% 0o Percent of reactions ∆rG 68.6%
subsequently subject vETFL to a variability analysis and gene essentiality analysis, 296
which will respectively show us the flexibility of the model and its accuracy in 297
predicting gene knock-out behavior. 298
1.3.1 Growth rate prediction 299
To study the behavior of the model at different carbon uptake rates, we simulated 300
growth on a minimal medium with only glucose as a carbon source, unlimited oxygen, 301
and some essential inorganic compounds. This would allow us to show that at a higher 302
carbon uptake, the model would predict a limited growth – unlike FBA that would 303
predict an unlimited linear increase. 304
Figure 3 shows the predicted growth rate of the different (v)E(T)FL models 305
described in Table 4 with respect to the glucose uptake of the cell. As expected and in 306
contrast to current FBA models, all four models plateau after a certain uptake rate, 307
which indicates a proteome-limited phenotype due to the limited capacity of the cells to 308
make more enzymes to metabolize the glucose. As discussed for the ME-models [7] and 309
GECKO [5] formulations, within the context of models accounting for protein usage, 310
this is caused by (i) the protein burden necessary to metabolize higher fluxes; (ii) the 311
increased demand in protein synthesis at higher growth rates; and (iii) for the models 312
with allocation constraints, the allowed protein and RNA mass ratio. We can see that 313
April 12, 2019 13/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
models featuring protein, RNA, and DNA allocation constraints (vE[T]FL) consistently 314
predict a lower growth rate than models without allocation constraints. This is 315
expected, as the data we input requires additional proteins and mRNA to account for 316
non-metabolism-related macromolecules. Models featuring thermodynamic constraints 317
([v]ETFL) also predict a lower growth rate, consistent with the fact that thermodynamic 318
constrain the model to valid solutions whose flux is in the subspace of the FBA feasible 319
space. The most constrained model (vETFL) consequently has the lowest growth rate 320
at any glucose uptake. This is in accordance with published TFA results that eliminated 321
biologically infeasible flux profiles yielding non-realistic higher growth rates [2]. 322
We summarize the constraint matrix of the EP of vETFL in Fig. S2 Example EP 323
constraint matrix, where each line represents a type of constraint and each column 324
represents a type of variable. The blocks of the matrix that are non-zero are colored, 325
and these blocks directly reflect the involvement of the constrained variables. 326
1.3.2 Modeling missing enzymes 327
Although we initially focused on including only enzymes for which we had all the 328
necessary information (catalytic rate and peptide constitution), we wanted to assess the 329
behavior of our model when the missing enzymes were modeled as well as check our 330
model’s sensitivity to changes in the catalytic rate constants. Thus, we additionally 331
built three more models, based on vETFL, with the following properties: (i) all the 332
missing enzymes were estimated by averaging the properties of the known enzymes 333
based on the curation for the vETFL iJO1366 (333 amino acids long, average 334 ± −1 kcat = 172 s ); (ii) all the enzymes (including the missing enzymes) but the ribosome, 335 RNA polymerase, and ATP synthase were assumed to have a catalytic rate constant 336 ± −1 kcat = 65 s ; and, for comparison purposes, (iii) all the known enzymes of vETFL 337 except for the ribosome, RNA polymerase, and ATP synthase were assumed to have a 338 ± −1 −1 catalytic rate constant of kcat = 65 s . kcat = 65 h is an assumption used in O’Brien 339 et al. [7] that we reproduce here as a study on sensitivity of the quality of the kcat 340 values. We observe that this value matches the median value of the catalytic rate 341
constants in the 562 vETFL enzymes. For clarity, we will refer to these models as (i) the 342
estimated mean enzymes model; (ii) the full median model; and (iii) the partial median 343
model. The ribosome, RNA polymerase, and ATP synthase were not modified, as their 344
catalytic rates directly and strongly affect the growth of the organism. Any drastic 345
change in these would make changes related to other enzymes negligible in comparison. 346
Figure 3b shows a comparison of the growth prediction for the estimated mean 347
enzymes (brown), full median (pink), and partial median (purple) models designed to 348
account for the missing enzymes. For a better comparison, we also reproduce the 349
vETFL results in red on the same graph. The partial median model (purple) shows a 350
higher predicted growth than the original vETFL model (red). This implies that 351 limiting enzymes in the original vETFL model have a kcat parameter lower than the 352 median value. Both models featuring inferred enzymes, the full median model (pink) 353
and estimated mean enzymes model (brown) show a lower growth rate at a given 354
uptake. This is expected as fluxes which previously had no enzymes assigned in vETFL 355
are now subject to catalytic constraints, and thus the models are more constrained. 356
Additionally, we observe that the model with estimated mean enzymes (brown) is also 357
below the full median model (pink). Similarly to vETFL and the partial median model, 358
this shows that the limiting enzymes in the model with estimated mean enzymes have a 359 kcat parameter lower than the median value. Finally, we observe that the differences 360 between these four models only appear at glucose uptake rates higher than 361 −1 −1 ≈ 6 mmolglc.DW .h , when the problem switches from being stoichiometry-limited to 362 proteome-limited. Thus, this experiment also illustrates the importance of well-curated 363
catalytic rate constants for modeling organisms grown in proteome-limited regimens. 364
April 12, 2019 14/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
a) b)
1.5 1.5 ] ] -1 -1 1 1 Growth rate [h Growth rate [h
0.5 0.5
EFL vETFL ETFL vETFL, mean inferred enzymes vEFL vETFL, full median vETFL vETFL, partial median
0 0 5 10 15 5 10 15 glucose uptake [mmol/(gDW.h)] glucose uptake [mmol/(gDW.h)] Fig 3. Growth rate with respect to glucose uptake for differently constrained models in the ETFL framework. a. Growth rate predictions using the [v]E[T]FL models with kcat values aggregated from the reaction database SabioRK [18], b. Growth rate predictions accounting for missing enzymes using vETFL and models (i)-(iii) representing different −1 initial enzyme assumptions, with kcat values obtained from SabioRK or kcat = 65 h , and with/without inferred enzymes.
These results demonstrate the capability of ETFL to predict different phenotypes 365
depending on growth rate. ETFL is also amenable to hypothesis testing, as evidenced 366
using the models that estimate the missing enzymes. In particular, we showed with 367
ETFL that an uptake increase does not yield a proportional growth rate increase as 368
with FBA and that ETFL provides a maximal uptake rate that is unmodeled in FBA, 369
thus more effectively modeling growth-dependent biomass yield in E. coli . This allows 370
for more realistic predictions for phenotypes that are limited by the expression 371
capabilities of the cell as well as captures the variability of the biomass composition in 372
different growth regimens. 373
1.3.3 Variability analysis 374
It is also possible to subject the model to a range of variability analyses. These are 375
routinely used in FBA to assess the flexibility of the system and in TFA to find the 376
ranges of allowed metabolite concentrations. We can extend their use in ETFL to 377
explore the allowed proteome and transcriptome. For example, we measured the 378
admissible extreme concentrations of each mRNA in aerobic growth conditions as 379
described in McCloskey et al. [19] by performing a variability analysis on the mRNA 380
concentration variables. Fig. 4 depicts the admissible peptide concentration upper and 381
lower bounds, sorted by average, for vETFL with a glucose uptake set to 382 −1 −1 12.5 mmol.gDW .h , which yields a proteome-limited phenotype, according to our 383
results in Fig 3a. It is important to note that all peptides with a non-zero minimal 384
concentration (most of the left of the figure) are, by definition, essential peptides: These 385
are always present at this uptake rate and are hence necessary for the cell to grow at an 386
optimum growth rate. The same study can be performed for enzyme concentrations or 387
even metabolite log-concentrations for models with thermodynamics. This type of study 388
is useful for comparing how the model performs in relation to actual proteomics, 389
transcriptomics, or metabolomics data. The method for running these other types of 390
variability analyses is exactly the same – only the variables subject to the variability 391
April 12, 2019 15/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
100
10-1 Dummy peptide
-2 10 Ribosomal peptides
10-3
10-4
-5 Peptide concentration [mmol/(gDW.h)] 10
10-6
10-7
Essential peptides 10-8
10-9
Fig 4. Concentration variability of peptide species, sorted by average peptide concentration (darker disc). Lower bounds that were 0 were set to the accuracy of the solver, 10−9. The horizontal line on the left side of the figure represents ribosomal peptides, which is narrow due to their instrumental role in making the tightly constrained amount of protein in the cell at a given growth rate. The vertical line in the middle represents the dummy peptide, which accounts for unmodeled peptides (non-metabolic proteins and enzymes with missing information) and therefore is used by the solver as a slack.
analysis are changed. 392
A specific usage of a variability analysis is the study of the allowed proteome (resp. 393
transcriptome) that is done by performing a variability analysis on the enzyme (mRNA) 394
concentration variables. This type of study can, for instance, be compared with 395
transcriptomics to check if the expression profile of an engineered strain corresponds to 396
what is expected in its corresponding model. A way to visualize the average allowed 397
proteome (transcriptome) is to use the average value of the variability of each enzyme 398
(mRNA) concentration as a feasible observation. Due to the convexity of the solution 399
space, it is a solution to the problem. This observation is then plotted on a finite area, 400
which can be done using the online software Proteomaps [20,21]. This method and 401
software are often used by biologists to represent protein abundances in the cell, and 402
using the data from ETFL, we can generate similar comparative graphs that can help 403
biologists analyze the variability in the different concentration variables using a 404
visualization they are familiar with. 405
Fig. 5 is an example of such a representation, graphed using the mRNA 406
concentrations corresponding to the solution represented by the dark dots in Fig 4 as an 407
input. In this figure, mRNAs are clustered using KEGG Gene Ontology (GO) 408
annotations. GO annotations form a tree describing the physiological role of genes, 409
ranging from the least specific (e.g. general metabolism) to most specific (e.g. araH 410
gene). The area of each (sub)cluster is proportional to the relative abundance of each 411
(sub)group of mRNAs. 412
We used the mean of the variability analysis as the observation rather than a single 413
optimal solution because the optimality principle in LP only guarantees a unique global 414
optimum value and not a unique optimal solution. Moreover, solver heuristics give 415
April 12, 2019 16/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Fig 5. Proteomaps of the predicted abundances of mRNAs. Each colored patch represents a different mRNA in proportion to its relative abundance. Genes can be clustered using KEGG Gene Ontology (GO) terms.
sparse and extreme results (corners of the explored simplex), which do not accurately 416
represent the full extent of the considered solution space. 417
1.3.4 Essentiality analysis 418
The ETFL framework can also analyze the essentiality of specific genes by performing 419
single gene knockouts. The growth of models with knocked-out genes can then be 420
compared to experimental data to assess the quality of the model as a validation. We 421
performed this analysis and compared it to the results reported in the publication of 422
iJO1366 by Orth et al. [12]. We use the Matthew’s correlation coefficient (MCC) as a 423
metric for the quality of the prediction, which is preferred over accuracy as it is not 424
sensitive to the imbalance between the number of essential genes and non-essential genes. 425
The MCC reads like a usual correlation coefficient, with 1 being a perfect correlation, -1 426
perfect anti-correlation, and 0 no correlation. We used the essentiality data and 427
conventions given in the supplementary material of Orth et al. [12], as explained in 428
Fig. 6-a and Fig. 6-b. Knockouts in ETFL are simulated by setting the corresponding 429
April 12, 2019 17/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Experiment a) Prediction Essential Non-essential Essential TN FN % True Prediction Non-essential FP TP 100 b) iJO1366 Essential Non-essential 90 Essential 168 39 80 Non-essential 80 1079 70 60 c) vETFL Essential Non-essential 50 Essential 136 35 40 Non-essential 112 1083 30 20 d) vETFL, kcat = 65 h-1 Essential Non-essential 10 Essential 130 31 0 Non-essential 118 1087 Fig 6. a. Conventions from Orth et al. [12] for gene essentiality. TN is True Negative. FN is False Negative. FP is False Positive. TP is True Positive. The color shading represents how good the classification is in the experimental class. Perfect classification should have a strict red first diagonal, as shown on this example. b. Gene essentiality prediction for FBA model iJO1366, yielding a Matthew’s correlation coefficient (MCC) of 0.69. c. Gene essentiality prediction for vETFL model, yielding a MCC of 0.60. −1 d. Gene essentiality prediction for vETFL model with all kcat = 65 h , yielding a MCC of 0.59
transcription reaction rate to 0 as opposed to in FBA where knockouts derive from 430
Gene Protein association Rules (GPR), which are Boolean rules that directly set 431
metabolic reaction rates to 0 if the knocked-out gene is essential to this reaction. The 432
difference is that in ETFL, the solver actually sees the knock-out as a part of the model, 433
while FBA changes the model to accommodate the knock-out. This feature can be used 434
in strain design strategies to optimize directly for knock-outs. The results are presented 435
in Fig. 6-c and 6-d, respectively for vETFL and the partial median model. 436
We observe that, compared to iJO1366, (i) ETFL presents more false positives 437
(experimentally essential genes predicted as non-essential); and (ii) ETFL predicts fewer 438
false negatives (experimentally non-essential genes predicted as essential). This first 439
point might indicate that ETFL is less constrained than iJO1366—the cell has more 440
genetic alternatives for growth. This is an artificial effect that likely stems from the 441
missing enzyme data, because if a reaction depends on a missing enzyme, by default 442
there will be no coupled expression and hence no impact if the related gene is knocked 443
out. As more enzyme data is added to the model, the false positive rate should decrease. 444
The second point indicates that ETFL better captures the genes that are not essential 445
for growth than iJO1366. This is an expected effect of explicit expression coupling, from 446
transcription to enzyme concentration, as opposed to Boolean gene reaction rules. 447
April 12, 2019 18/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
1.3.5 Sampling 448
Sampling the feasible solution space of FBA is a common way to study solution 449
robustness and variability. Since there are often multiple FBA solutions at the optimal 450
objective value, representative solutions are often sought, and sampling is one way to 451
obtain them. However, because ETFL contains integer variables, it is not compatible 452
with traditional sampling methods in its current formulation. It is possible, though, to 453
make the model convex, and hence amenable to sampling, by fixing the integers to their 454
values at a given growth rate and, if applicable, TFA directionality. This will block the 455
flux directions as well as the growth-dependent parameters if TFA is performed. The 456
resulting model is then solely linear, and sampling can be performed with traditional 457
techniques, such as artificially centered hit and run (ACHR) [22], gpSampler [23], or 458
optGpSampler [24]. Once it has converged, a sampling should provide a better 459
representation of the center of the solution space than the mean of the variability 460
analysis. 461
1.4 Performance 462
ETFL relies on solver-specific MILP algorithms and heuristics, which also means that 463
great variability in performances can be observed depending on the solver parameters. 464
We provide tuned presets for different tasks (gene knock-out, variability analysis, 465
growth maximization) with the package, and recommend that users run their own solver 466
tuning if long run times are observed. We witnessed an up to 10× increase in 467
performance using such tuning. 468
1.5 Adaptation of FBA-based methods to ETFL 469
The ETFL formulation is amenable to further kinds of analyses. Leveraging both the 470
explicit expression constraints and the MILP nature of the problem, we present several 471
possibilities for future studies using ETFL: 472
Growth-dependent parameters It has been reported that several other 473 parameters, such as the ribosome transcription rate constant krib, are growth 474 dependent [11]. Although such dependency is not taken into account in the presented 475 results, it is possible to account for this by (i) discretizing krib following the method 476 used in 1.2.4, and (ii) using Petersen’s linearization scheme from 1.2.3 on the product 477 krib ∗ Erib. Other parameters that can be transformed in this way include, but are not 478 limited to, the RNAP transcription rate constant ktrans, free ribosomes, and the RNAP 479 ratios ρ and π. 480
Omics integration Explicit mRNA and enzyme concentrations allow the direct 481
integration of absolute or relative proteomics and transcriptomics by changing the 482
bounds of the corresponding variables in the EP. An additional gauge constraint will be 483
needed for relative data. Previous transcriptomic integration methods, such as 484
REMI [25], iMAT [26], GIMME [27], or MINEA [28], can also be adequately 485
reformulated for ETFL. Metabolomics can still be integrated using TFA [2,3]. 486
Minimization of adjustment In the original paper, the hypothesis behind the 487
Minimization of Metabolic Adjustment (MOMA) method is that the metabolic fluxes of 488
an organism subject to a gene knock out show a minimal change compared to the 489
metabolic fluxes of the wild-type organism [29]. The underlying hypothesis is that the 490
enzyme distribution and assignments remain the same except for the knocked-out gene. 491
With ETFL, it is possible to directly compute a Minimization of Protein Adjustment 492
April 12, 2019 19/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
(MOPA) by reformulating the objective function as a Minimization of Expression 493
Adjustment (MOXA): 494
min E − E0 , p ∈ {0, 1} (MOPA) j j p j X
where k·kp is either the Manhattan norm (p = 1, `1-norm) or the Euclidean norm (p = 2, `2-norm), which will require a MIQP solver. In the same fashion, it is also possible to formulate a (weighted) Minimization of mRNA Adjustment (MORA) or even a Minimization of eXpression Adjustment (MOXA) using the following formulations:
0 min Fl − Fl p, p ∈ {0, 1} (MORA) l X min θ · E − E0 + (1 − θ) · F − F 0 . p ∈ {0, 1} , θ ∈ [0, 1] (MOXA) j j p l l p j l X X
Parsimonious analysis Parsimonious FBA (pFBA) [23] was developed to address 495
the high fluxes of some of the solutions given by FBA. Although this concern is 496
addressed in ETFL by the combined actions of the EP and thermodynamics, pFBA can 497
be adapted to ETFL to study an organism under parsimonious constraints. For 498
example, it is possible to reformulate it into a parsimonious expression problem to find 499
the minimal expression level required to meet a growth target using objective functions 500
similar to MOPA, MORA, and MOXA. It is also possible to turn the problem around to 501
consider the allowed enzyme amounts under minimal flux constraint obtained by pFBA 502
to assess the metabolic flexibility of an organism. 503
Dynamic ETFL (dETFL) Dynamic FBA (dFBA) [30] is a method that uses FBA 504
to predict the dynamics of a biological system represented with a stoichiometric model. 505
In its original static optimization approach (SOA) formulation, a FBA problem is solved 506
at each time step. The value of boundary fluxes of the FBA problem are updated at 507
each iteration with values produced with a kinetic law, such as Michaelis-Menten 508
glucose uptake and oxygen diffusion. Because ETFL allows direct access to enzyme 509
concentrations, it is possible to use the latter to reformulate dFBA in its SOA. The 510
original SOA approach uses ad-hoc constraints on the absolute flux change at each time 511
step. However, in ETFL, it is possible to bound flux changes indirectly by bounding 512
enzyme and mRNA concentration changes in the EP. Effectively, this approach allows 513
the movement from ad-hoc constraints to physiological constraints. 514
Use in kinetic frameworks Often, kinetic frameworks require a reference flux 515
distribution as an input. ETFL can provide such a distribution, with an increased 516
accuracy as compared to FBA. 517
2 Conclusions 518
ETFL is a framework can implement expression and thermodynamic formalism using 519
mainstream MILP solvers. This could not be previously accomplished using 520
state-of-the-art ME-models, which use specialized quad-precision solvers and do not 521
support integer variables. The formalism itself is based on the explicit and direct 522
relationship with the underlying biochemistry and provides a way to incorporate 523
growth-dependent variables using MILP linearization techniques. These new 524
growth-dependent variables provide a finer modeling of expression because they consider 525
phenotypic differences in different growth regimens, which is key for accurate modeling. 526
April 12, 2019 20/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
ETFL can also compute explicit mRNA and enzyme concentrations as well as perform 527
direct -omics data integration. In this, ETFL complements and extends FBA 528
capabilities by using explicit relationships in lieu of the typical assumptions on the 529
relationships between the transcriptome, proteome, and fluxome. This explicit 530
accounting of expression mechanisms provides a finer level of control and a more 531
relevant prediction of gene-editing outcomes. Because of this and its operational 532
similarity with classic FBA-related analyses, ETFL can be efficiently integrated in 533
standard model-based pipelines. For example, metagenome-based genome-scale 534
reconstructions such as published by Magn´usd´ottir et al. [1] can be directly fed to the 535
framework to generate one models for each of the 773 bacteria they identified. In a 536
more general way, ETFL can assess the allowed expression profiles of any biological 537
system amenable to genome-scale modeling, such as the metabolic engineering of 538
biocatalysts, microbial communities, drug design, or personalized medicine. 539
3 Materials and Methods 540
3.1 Formulation of the expression problem 541
Several parameter values were taken from the BioNumbers database [31], and when 542
used, their identification number as well as the original source from which the value was 543
reported are specified. 544
3.1.1 Preliminaries, Conventions, and Notations 545
We will write the mass balances for the macromolecules with included concentration 546
variables. Because we assume the cell is growing at a growth rate µ, we must assume 547
that the volume in which the mass balance is calculated varies. 548 The time derivative of the concentration of a macromolecule G of concentration CG 549 syn in the volume V , for a total mass mG in the cell, produced at a rate vG and degraded 550 deg at a rate vG , will be written: 551
dm dV dC G = C + V G (14) dt G dt dt syn deg = vG · V − vG · V. (15)
We can combine equations 14 and 15 and divide by V (necessarily non-zero) to write 552 the time derivative of the concentration CG: 553 dC 1 dV G = vsyn − vdeg − · C . (16) dt G G V dt G 1 dV By definition, V dt = µ is the growth rate of the cell, and we call the term µ · CG 554 dil the dilution term, or vG , as per Fredrickson’s work on formulating growth models [32]. 555 It is a common assumption that the concentrations inside the cell remain time invariant 556
(quasi-steady state assumption), effectively yielding the constraint: 557 1 dV vsyn − vdeg − · C = 0. (17) G G V dt G
It is also understood from the formulation of the FBA that adding a new reaction to 558
the system, such as: 559
j j vj : ηAA → ηBB, (18)
results in adding terms to the mass balances of A and B : 560
April 12, 2019 21/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
d [A] = ... − ηj · v , (19) dt A j 561 d [B] = ... + ηj · v . (20) dt B j
The further extension of this to reactions of n reactants to m products is trivial. 562
Several parameter values are taken from the BioNumbers database [31]. When used, 563
we specify their identification number as well as the original source from which the 564
value was reported. Finally, we will represent products between a parameter value and a 565
variable by the symbol “ · ” and products between two variables by the symbol “ ∗ ”. 566
Hereafter, we propose a detailed top-down approach to formulate the constraints 567
being built for ETFL, starting from the metabolite network and moving down to RNA 568
synthesis. 569
3.1.2 Metabolic reactions 570
From FBA, the mass-balance relationship for metabolites can be written as: 571
S · v = 0. (FBA)
For the rest of the formulation, it is necessary to split the net flux v from each 572
reaction into its forward net component and backward net component: 573
+ − + − vj = vj − vj , vj , vj ≥0. (21)
Biochemical reactions are catalyzed by enzymes. Each enzyme (Enzj) of concentration 574 Ej can catalyze a flux vj subject to the enzyme capacity constraint, which is a function 575 j,+ j,− of its forward and backward catalytic rate constants kcat and kcat : 576
+ j,+ 0 ≤ vj ≤ kcat Ej, (22) − j,− 0 ≤ vj ≤ kcat Ej, (23)
+ j,+ vj − kcat Ej ≤ 0, (FCj) − j,− vj − kcat Ej ≤ 0. (BCj)
The distinction between the bounds of the forward and backward net fluxes is 577
important, as some enzymes have different catalytic activities depending on the 578
direction of the flux. 579
3.1.3 Enzyme assembly 580
Each enzyme Enzj in concentration Ej is subject to mass balance, which can be 581 written: 582 d E = vasm − vdeg − vdil, (24) dt j j j j
which reads under quasi-steady state assumption (QSSA): 583
asm deg vj − vj − µ ∗ Ej = 0, (EBj) asm where vj is the formation rate of the enzyme by the assembly of its constituent 584 deg dil peptides, vj is the degradation rate, vj is the dilution rate, and µ is the growth rate 585
April 12, 2019 22/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
of the cell. The formation rate of the enzyme describes the assembly of free peptides, so 586
it is necessary to add the peptide assembly reaction to the stoichiometric matrix: 587
asm j vj : ηl · P epl → Enzj, (25) l X j where ηl is the stoichiometric coefficient of peptide P epl for the formation of the 588 complex of enzyme Enzj. 589
3.1.4 Peptide synthesis 590
The synthesis of peptides consumes charged tRNAs, which are subsequently uncharged 591
during the current peptide synthesis by a ribosome. The process consumes 2 GTP and 592
releases 2 GDP and 2 Pi per amino acid: 593
vtsl : ηl · tRNAcharged + 2Laa · (GT P + H O) l aai aai l 2 i X (26) → P ep + ηl · tRNAuncharged + 2Laa · GDP + P i + H+ , l aai aai l i X th l where aa denotes the i amino acid, η its relative stoichiometric coefficient (count) 594 i aai ∗ in the sequence of P ep , tRNA the (un)charged tRNAs for each amino acid, and 595 l aai aa l L = η is the length of the amino acid sequence of P ep . 596 l i aai l As explained in section 1.1, this reaction adds a supplementary term in the mass 597 P + balances of the metabolites (GTP, GDP, Pi, H2O, H ), the peptide, and the tRNAs 598 (see 3.1.9 for the latter). This term is what connects the expression requirements to the 599
metabolic network defined in the FBA. 600
In particular, the peptide concentrations obey the mass-balance equation: 601 d P ep = vtsl − ηj · vasm − vdeg − vdil. (27) dt l l l j l l j X We assume in the current model that the assembly rates are much faster than 602
dilution and degradation, and thus simplify this mass balance to: 603 d P ep = vtsl − ηj · vasm, (28) dt l l l j j X which, under QSSA, can be written: 604
tsl j asm vl − ηl · vj = 0. (PBl) j X In this context, the peptides are treated just like regular metabolites in the system. 605 This assumption in PBl can be relaxed without a loss of generality by introducing a 606 dilution and a degradation term, thus introducing a bilinearity. 607
Finally, we must take into account the actual degradation reactions. We add the 608
ideal degradation reaction to the system: 609
vdeg : Enz → ηj · aa + Laa · H O. (29) j j aai i j 2 i X For this, the degradation rate is known: 610
deg j vj − kdeg · Ej = 0, (EDj) j where kdeg is the degradation rate constant of the enzyme. 611
April 12, 2019 23/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
3.1.5 Ribosome synthesis and utilization 612
The peptides are the product of a translation reaction that is catalyzed by a ribosome. 613
As we did with the catalytic constraints for general biochemistry reactions, we can apply 614 tsl the ribosome maximum catalytic rate as an upper bound to its translation rate vl : 615
rib tsl kcat vl − aa Rl ≤ 0, (TR2l) Ll rib −1 where kcat is the maximum ribosomal translation rate constant (10 − 12 aa.s for E. 616 aa coli , BioNumbers ID [BNID] 100059 [33]), Ll is the amino acid length of the peptide l, 617 and Rl is the quantity of ribosomes assigned to the translation of this peptide. This 618 way, the ratio Rl/P epl is effectively the average polysome size translating the peptide l. 619 Like any other enzyme, ribosomes verify the mass balance: 620
asm deg vrib − vrib − µ ∗ Erib = 0. (EBrib)
Erib denotes the total quantity of ribosomes in a cell. It accounts for Rl, the 621 ribosomes assigned to the translation of P epl, as well as the free ribosomes in the cell, 622 RF . We can then write the total ribosome capacity constraint: 623
Rl + RF − Erib = 0. (TC2) l X If we know the ratio ρ of free versus occupied ribosomes, we can enforce it: 624
RF − (1 − ρ) Erib = 0. (RR)
Finally, the ribosome differs from other enzymes in that it takes ribosomal peptides 625 rPepl as well as ribosomal RNA rRNAl for its assembly. Hence, its assembly reaction is: 626
vasm : ηrib · rPep + ηrib · rRNA → Rib. (30) rib rPepl l rRNAl l l l X X rib As explained earlier, the stoichiometric coefficients η? will appear in the mass 627 balances of each of the compounds of the reaction. 628
3.1.6 RNA Polymerase 629
The transcription is catalyzed by RNA polymerase (RNAP). For each transcription of 630 tcr mRNA, we can put an upper bound on the transcription rate vl in the same way as 631 for translation: 632
RNAP tcr kcat vl − nt Pl ≤ 0, (TR1l) Ll nt RNAP where Ll is the length in nucleotides of the mRNA sequence, kcat is the 633 −1 catalytic rate constant of RNAP (85 nt.s for E. coli , BNID 100060 [33]), and Pl the 634 concentration of RNAP assigned to the transcription of this mRNA. 635
RNAP is an enzyme, and hence it also satisfies mass balance: 636
asm deg vRNAP − vRNAP − µ ∗ ERNAP = 0, (EBRNAP )
where ERNAP is the total amount of RNAP, which also accounts for free RNAP PF , 637 and follows: 638
Pl + PF − ERNAP = 0. (TC1) l X
April 12, 2019 24/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
As we did with the ribosomes, if we know the ratio of occupied RNAP, π, we can 639
enforce it: 640
PF − (1 − π) ERNAP = 0. (PR)
3.1.7 mRNA synthesis and mass balance 641
During the translation, an mRNA is read to produce a peptide. mRNAs are subject the 642
same mass-balance constraints: 643
tcr deg vl − vl − µ ∗ Fl = 0, (MBl) th deg where Fl is the total concentration of the l mRNA (mRNAl), vl is its 644 tcr degradation rate, and vl is its transcription (synthesis) rate. Fl is to mRNAl what Ej 645 is to Enzj –the concentration variable that represents the macromolecule in the EP. 646 The transcription reaction is modeled as follows: 647
vtcr : ηl · AT P + ηl · UTP + ηl · CTP + ηl · GT P l A U C G (31) l l l l → ηA + ηU + ηC + ηG P P i + mRNAl.
Again, the stoichiometric coefficients will appear in the mass balances of each of the 648
metabolites and macromolecules involved. 649
We must also take into account the relationship between ribosome assignment and 650 mRNA concentration. On each strand of mRNAl, there can be only a finite number ρl 651 of ribosomes translating at the same time. This number is given by the ratio of the 652 nt nt footprint size of the ribosome Lrib and the length of the mRNA strand Ll . This 653 effectively yields the number of ribosomes that can be present at the same time on a 654
given mRNA strand: 655
nt Ll ρl = nt . (32) Lrib nt For E. coli , Lrib is approximately 20 nm (BNID 102320 [34], 100121 [35]), which 656 amounts to approximately 60 base pairs (the length of a nucleotide is approximately 0.3 657
nm; BNID 103777 [36]). From there we can get the additional constraint: 658
Rl ≤ ρl · Fl, (33) 659 nt Ll Rl − nt Fl ≤ 0. (EXl) Lrib
Finally, we must take into account the actual degradation reactions. We consider 660
perfect degradation for mRNAs: 661
deg l l l l vl : mRNAl → ηA · AMP + ηu · UMP + ηC · CMP + ηG · GMP. (34)
And, again, we know the degradation rates: 662
deg l vl − kdeg · Fl = 0. (MDl)
April 12, 2019 25/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
3.1.8 rRNA synthesis and mass balance 663
asm rRNAs are used in the ribosome assembly reaction. According to the definition of vrib 664 in 3.1.5, their mass balance can be written: 665
d tcr asm deg dil [rRNAl] = 0 = v − v − v − v . (35) dt rRNAl rib rRNAl rRNAl
We neglect their dilution and degradation under the hypothesis that free rRNAs are 666
scarce and stable [37]. Thus, their mass balance in the model reads: 667
vtcr − vasm = 0. (RB ) rRNAl rib rRNAl
3.1.9 tRNA synthesis and mass balance 668
tRNAs are produced with a charging reaction and consumed by peptide synthesis. We 669
use the following charging reaction: 670
charging uncharged v : aai + tRNA + AT P + 2H2O aai aai (36) → tRNAcharged + AMP + 2H+. aai
Once again, the stoichiometric coefficients of each reactant will appear in the 671
stoichiometric matrix in the column corresponding to this reaction. 672
Since tRNAs are relatively stable molecules [37], we neglect their degradation. Let 673 u c uncharged charged T (resp. T ) represent tRNA (resp. tRNA ). Then, we can 674 aai aai aai aai write the following constraints: 675
d T u = 0 = −vcharging + ηl · vtsl − µ ∗ T u , (37) dt aai aai aai l aai l X d T c = 0 = vcharging − ηl · vtsl − µ ∗ T c . (38) dt aai aai aai l aai l X
−vcharging + ηl · vtsl − µ ∗ T u = 0, (TBu ) aai aai l aai aai l X vcharging − ηl · vtsl − µ ∗ T c = 0. (TBc ) aai aai l aai aai l X
3.2 Thermodynamics-based constraints 676
Thermodynamics flux analysis (TFA) [2, 3] imposes constraints on a FBA problem to 677
couple reaction directionality to the standard free energy of reactions and metabolite 678
concentrations. We also introduce constraints that couple the sign of the Gibbs energy 679
of a reaction to its directionality through the use of integer variables and a 680
mixed-integer linear coupling formulation. This framework reduces the feasible flux 681
space and improves the predictive power of FBA by removing thermodynamically 682
invalid flux profiles. 683 th Considering ci is the concentration of i metabolite, we define Ci as its scaled 684 logarithm with respect to c0 so that in standard conditions c0 = 1 M: 685 c ∀i, C = ln i . (39) i c 0
April 12, 2019 26/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
0◦ 0◦ We use the ∆f Gi to calculate the ∆rGj the standard Gibbs energy of formation in 686 th th solution of the i metabolite and Gibbs energy in solution of the j reaction, 687
respectively, and we obtain the additional variables: 688
min max Ci ≤ Ci ≤ Ci , (40) 0◦ 0◦ 0◦ ∆rGj,min ≤ ∆rGi ≤ ∆rGj,max, (41) 0 0 0 ∆rGj,min ≤ ∆rGi ≤ ∆rGj,max. (42)
The concentration variables are bounded by experimental measurements or 689
physiological assumptions, and the standard Gibbs energies are bounded by the 690
measurement or estimation error. Since the net flux of each reaction has already been 691 + − split between forward flux (vj ) and backward flux (vj ), (see Eq. 21), we can directly 692 add the constraints described in [2]: 693
m 0 j 0◦ ∆rGj − RT ηi Ci − ∆rGi = 0, (43) i=1 X0 + ∆rGj − K + K · bj ≤ 0, (44) 0 − −∆rGj − K + K · bj ≤ 0, (45) ± ± vj − K · bj ≤ 0, (46) + − bj + bj ≤ 1. (47)
j R denotes the ideal gas constant, T is the temperature in Kelvin, and ηi represents 694 the stoichiometry of the metabolite i in the reaction j. K is a big-M constant (bigger 695 ± than all upper bounds), and bj are binary variables. Eq. 43 defines the actual Gibbs 696 energy of the reaction as a function of its standard Gibbs energy and the scaled 697
logarithms of metabolite concentrations. Eq. 44 and Eq. 45 ensure that 698 0 + 0 − ∆rGj ≤ 0 ⇐⇒ bi = 1 and ∆rGj ≥ 0 ⇐⇒ bi = 1. These binary variables are used to 699 block flux in Eq. 46 if the thermodynamics do not favor it. Finally, Eq. 47 is added to 700
enforce that only one direction is chosen. 701
3.3 Petersen linearization 702
After discretization of the growth rate, the dilution term for the macromolecule G? will 703 consist of a sum of products of the binary variables δs and the continuous variable G?. 704 We can use the Petersen linearization scheme [13] to transform this product into an 705
equivalent system of one new variable and three new constraints: 706
s z? = δs ∗ G?, s G? + M · δs − M ≤ z? ≤ M · δs, ⇐⇒ s , z ≤ G? ? (48) s G? + M · δs − z? ≤ M, s ⇐⇒ z? − M · δs ≤ 0, s z? − G? ≤ 0.
With this method, we can directly reformulate generalized mass balances as 707 described in Eq. 5 for mRNAs, enzymes, uncharged tRNAs, and charged tRNAs: 708
April 12, 2019 27/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
dlog Ne 2 2s vasm − vdeg − µ · zs = 0, (EB0 ) j j N j j r=0 X dlog Ne 2 2s vtcr − vdeg − µ · zs = 0, (MB0 ) l l N l l r=0 X dlog Ne 2 2s −vcharging + ηl · vtsl − µ · zu,s = 0, (TB0u ) aai aai l N aai aai i r=0 X X dlog Ne 2 2s vcharging − ηl · vtsl − µ · zc,s = 0. (TB0c ) aai aai l N aai aai i r=0 X X And we get the additional linearization constraints: 709
dlog Ne 2 2s µ ≤ µ, (GR) N r=0 X dlog Ne µ 2 2s µ − ≤ µ − µ ≤ . (GC) 2N N 2N r=0 X
3.4 Data 710
3.4.1 Degradation rate constants kdeg (proteins, mRNA) 711
mRNA degradation rates were taken from Bernstein et al. [38]. We converted the 712 log(2) reported half lives into rate constants using the classical relationship k = . 713 t1/2 Enzyme degradation rates were approximated to 5% per hour, as per a report indicating 714
this rate was between 2% and 7% per hour in [39]. 715
3.4.2 Catalytic rate constants kcat (homomers, other complexes) 716 j Catalytic rate constants kcat were obtained from Davidi et al. [40] for homomer 717 enzymes. Complex formation reactions for non-homomer enzymes were taken from the 718
supplementary information of O’Brien et al. [7] and Lloyd et al. [8]. EC numbers were 719 obtained from BiGG [41] and the iJO1366 publication [12]. Their corresponding kcat 720 values were assigned using conservative (max) values from SabioRK. 721
3.4.3 Peptide composition of enzymes 722
Homomer compositions were obtained from Davidi et al. [40]. Other peptide 723
compositions of enzymes were taken from the supplementary information of O’Brien et 724
al. [7] and Lloyd et al. [8]. Additional information was obtained from the 725
Metacyc/Biocyc database [42, 43] using specialized SmartTables queries [44]. 726
3.5 Modeling 727
3.5.1 Model modifications 728
The initial model was subjected to minor changes to accommodate for ETFL modeling. 729
In particular, we added: 730
Selenocysteine as a metabolite. 731
April 12, 2019 28/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
Cysteine to selenocysteine conversion as a pseudo reaction. 732
tRNA charging reactions. 733
We also modified the biomass reaction by removing its nucleotide and amino acid 734
components, since they are already taken into account by the expression problem as 735
explained in 1.1.1. 736
3.5.2 Enzyme estimation 737
Given a reaction in the model, if no enzyme is supplied but the reaction possesses a 738
gene reaction rule, it is possible to infer an enzyme from it. The rule expression is 739 expanded, and each term separated by an OR boolean operator is interpreted as an 740 isozyme, while terms separated by an AND boolean operator are interpreted as unit 741 peptide stoichiometric requirements. The enzyme is then assigned an average catalytic 742
rate constant and degradation rate constant. 743
3.5.3 Essentiality analysis 744
For increased performance, the essentiality analysis was cast into a feasibility problem. 745
We put a lower bound on growth equal to 10% of the predicted ETFL growth and set 746
the objective to 0. With this method, essential genes will cause the problem to be 747
infeasible, while non-essential genes will return a feasible solution satisfying at least 10% 748
of the growth. This method achieved up to a 5-fold reduction in solving time on the 749
most complex models. 750
3.6 Computation 751
3.6.1 Implementation 752
The code has been implemented as a plug-in to pyTFA [45], a Python implementation 753
of the TFA method. It uses COBRApy [46] and Optlang [47] as a backend to ensure 754
compatibility with several open-source (GLPK, scipy, . . . ) as well as commercial 755
(CPLEX, Gurobi, . . . ) solvers. The code is freely available under the APACHE 2.0 756 license at https://github.com/EPFL-LCSB/etfl. We rely on the Python package 757 Biopython [48] for transcribing and translating sequences of nucleotides and amino acids. 758
3.6.2 Hardware 759
Computations were done on a 64-bit Ubuntu 18.04.1 LTS (Bionic Beaver); 2 × Intel(R) 760
Xeon(R) CPU E5-2667 v3 @ 3.20 GHz (8 cores, 16 threads per socket); 4 ×16 761
Go @ 2400 MHz RAM. Code was run on Python 3.6 on Docker (18.09.0) containers 762
based on the official python 3.6-stretch container, available on ETFL GitHub. 763
Supporting information 764
S1 Nondimensionalization. Derivation and formulation. Details on the 765
variables and constraint transformations to scale the model. 766
S2 Example EP constraint matrix. Representation of the constraint 767
matrix of the EP for a vETFL of iJO1366. Colored cells represent non-zero 768
blocks. Uncolored cells are zero blocks. 769
April 12, 2019 29/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
S3 SOP for creating an ETFL model. Tips and Prerequisites. List of 770
required and optional inputs to transform a COBRA model into an ETFL model 771
Acknowledgements 772
The authors would like to thank Kaycie Butler for her valuable input on the wording 773
and structure of this manuscript. This work has received funding from the European 774
Union’s Horizon 2020 Research and Innovation Programme under the Marie 775
Sklodowska-Curie grant agreement No 722287. 776
References
1. Magn´usd´ottirS, Heinken A, Kutt L, Ravcheev DA, Bauer E, Noronha A, et al. Generation of genome-scale metabolic reconstructions for 773 members of the human gut microbiota. Nature biotechnology. 2017;35(1):81. 2. Henry CS, Broadbelt LJ, Hatzimanikatis V. Thermodynamics-based metabolic flux analysis [Journal Article]. Biophysical Journal. 2007;92(5):1792–1805. Available from:
April 12, 2019 30/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
10. Ma D, Yang L, Fleming RM, Thiele I, Palsson BO, Saunders MA. Reliable and efficient solution of genome-scale models of Metabolism and macromolecular Expression. Scientific reports. 2017;7:40863. 11. Neidhardt FC, Curtiss R. Escherichia coli and Salmonella: cellular and molecular biology. vol. 2. ASM press Washington, DC:; 1999. 12. Orth JD, Conrad TM, Na J, Lerman JA, Nam H, Feist AM, et al. A comprehensive genome-scale reconstruction of Escherichia coli metabolism-2011 [Journal Article]. Molecular Systems Biology. 2011;7. Available from:
April 12, 2019 31/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
23. Schellenberger J, Que R, Fleming RMT, Thiele I, Orth JD, Feist AM, et al. Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0 [Journal Article]. Nature Protocols. 2011;6(9):1290–1307. Available from:
April 12, 2019 32/33 bioRxiv preprint doi: https://doi.org/10.1101/590992; this version posted April 15, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
37. Neidhardt FC. In: The regulation of RNA synthesis in bacteria. vol. 3. Elsevier; 1964. p. 145–181. 38. Bernstein JA, Khodursky AB, Lin PH, Lin-Chao S, Cohen SN. Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays [Journal Article]. Proceedings of the National Academy of Sciences of the United States of America. 2002;99(15):9697–9702. Available from:
April 12, 2019 33/33