<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

1 Designing novel biochemical pathways to commodity chemicals using

2 ReactPRED and RetroPath2.0

3

4

5

6 Authors and Affiliations

7 • Eleanor Vigrass 8 • M. Ahsanul Islam 9 • Department of Chemical , Loughborough University, Loughborough, 10 Leicestershire, LE11 3TU, UK

11

12 Corresponding Author

13 • M. Ahsanul Islam ([email protected]) 14

15

16

17

18

19

20

21

22

23

24

25

26

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

27 Abstract

28 Commodity chemicals are high-demand chemicals, used by chemical industries to synthesise

29 countless chemical products of daily use. For many of these chemicals, the main production

30 process uses petroleum-based feedstocks. Concerns over these limited resources and their

31 associated environmental problems, as well as mounting global pressure to reduce CO2

32 emissions have motivated efforts to find biochemical pathways capable of producing these

33 chemicals. Advances in metabolic engineering have led to the development of technologies

34 capable of designing novel biochemical pathways to commodity chemicals. Computational

35 software tools, ReactPRED and RetroPath2.0 were utilised to design 49 novel pathways to

36 produce , , and 1,2-propanediol — all industrially important chemicals with

37 limited biochemical knowledge. A pragmatic methodology for pathway curation was

38 developed to analyse thousands and millions of pathways that were generated using the

39 software. This method utilises publicly accessible biological databases, including MetaNetX,

40 PubChem, and MetaCyc to analyse the generated outputs and assign EC numbers to the

41 predicted reactions. The workflow described here for pathway generation and curation can be

42 used to develop novel biochemical pathways to commodity chemicals from numerous starting

43 compounds.

44

45 Key words: Biochemical pathways, cheminformatics tools, commodity chemicals,

46 ReactPRED, RetroPath2.0, retrosynthesis

47

48

49

50

51

52

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

53 Introduction

54 Commodity chemicals, such as , propylene, benzene, phenol, , and , are

55 high-value chemicals used by industries to synthesise countless chemical products of daily use.

56 From pharmaceuticals to (Bengelsdorf and Dürre, 2017; Straathof, 2014), the global

57 chemical turnover was valued at € 3475 billion in 2017, and this demand is expected to rise

58 further in the future (Cefic, 2018). Both organic and inorganic commodities are mainly derived

59 from fossil fuel-based petroleum feedstocks to release harmful direct and indirect greenhouse

60 gases such as CO2 and CO into the atmosphere. Concerns over these limited fossil-fuel

61 resources and increasing global pressure to reduce greenhouse gas emissions (UNEP, 2017)

62 have led to an urgent need to find sustainable biochemical routes capable of producing these

63 chemicals and satisfying their demands.

64

65 Biochemical routes involving fermentation and enzymatic methods have widely discussed in

66 the literature for sustainable production of commodity chemicals (Saha, 2003; Siebert and

67 Wendisch, 2015). Fermentation is a microbial process that uses microorganisms such as

68 and yeast to produce (Renge et al., 2012), which then catalyse the

69 biochemical reactions producing commodities from sugars and other biomass resources

70 (Straathof, 2014). For example, the production of ethanol via the fermentation of syngas

71 (Bengelsdorf et al., 2013; Bengelsdorf and Dürre, 2017), or the conversion of protein waste to

72 cinnamic acid and β-alanine (Kumar et al., 2015) are microbially mediated fermentation

73 processes. Enzymes are highly selective, but they also have the ability of catalyse numerous

74 non-selective or non-specific reactions in addition to the specific reaction the has

75 evolved for (Kumar et al., 2015; Straathof, 2014). This ability of catalysing non-specific

76 reactions is known as the ‘’ (Tawfik, 2010), and is dependent on the

77 substrates and cofactors involved in the reactions (Delépine et al., 2018; Shin et al., 2013;

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

78 Tawfik, 2010). Although billions of years of evolution have enriched the repertoire of natural

79 biochemical reaction networks of an organism, many chemical commodities cannot be

80 produced ‘naturally’ due to surpassing an organism’s natural capabilities (Wang et al., 2017).

81 Additionally, there is lack of knowledge on promiscuous enzyme activities such as the number

82 of promiscuous reactions that enzymes can partake (Lee et al., 2012; Shin et al., 2013; Wang

83 et al., 2017). These limitations prevent the discovery and implementation of potential

84 biochemical pathways to high-value commodity chemicals.

85

86 Recent advances in cheminformatics and bioinformatics have enabled the design of novel (i.e.,

87 biologically unknown) biochemical pathways (Brunk et al., 2012; Medema et al., 2012), and

88 have expanded our knowledge of promiscuous enzyme activities through the design and

89 implementation of computational tools (Hadadi et al., 2019; Wang et al., 2017). Many of these

90 state-of-the-art computational tools are equipped with unique abilities to aid metabolic

91 engineering efforts by designing novel pathways for numerous applications, including

92 bioremediation of xenobiotics (Finley et al., 2009), novel drug discovery (Moura et al., 2016),

93 and production of commodity chemicals (Islam et al., 2017; Yim et al., 2011). Examples of

94 some of the widely used cheminformatics tools include From Metabolite to Metabolite (FMM)

95 (http://fmm.mbc.nctu.edu.tw/), BINCE (Hatzimanikatis et al., 2005), DESHARKY (Rodrigo

96 et al., 2008), PathPred (Moriya et al., 2010), and MRE (Kuwahara et al., 2016). These tools

97 have been applied to numerous studies and have been extensively discussed elsewhere (Brunk

98 et al., 2012; Henry et al., 2010; Islam et al., 2017; Medema et al., 2012; Wang et al., 2017).

99

100 Many of these computational tools ‘retrosynthetically’ generate biochemical pathways by

101 iteratively applying the ‘generalised reaction rules’ to transform and connect target compounds

102 to the metabolites of interest (Hadadi et al., 2016; Medema et al., 2012; Wang et al., 2017).

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

103 The generalised reaction rules are derived using the EC (Enzyme commission) number

104 information of known biochemical reactions assigned by the Nomenclature Committee of the

105 International Union of Biochemistry and Molecular Biology (NC-IUBMB, 1992). These tools

106 have the capability of generating novel and known biochemical reactions; however, a

107 significant limitation that most tools suffer from is the combinational explosion of pathways

108 predicted due to using the generalised reaction rules. The number of pathways generated could

109 result in the thousands and in some cases, in millions, presenting the challenge of efficient post-

110 processing of the generated pathways to find meaningful results (Islam et al., 2017). Although

111 publications relevant to a specific software provide information on how to use and generate

112 results using the software, often there is no further guidance on how to curate these results to

113 obtain useful pathways: a crucial need for practicing metabolic engineers. This need leads to

114 developing individual curation methods that are mainly tools or software specific, as well as

115 specific to the conducted studies.

116

117 In this study, two powerful computational cheminformatics tools, ReactPRED (Sivakumar et

118 al., 2016) and RetroPath2.0 (Delépine et al., 2018) were applied to design novel biochemical

119 pathways to produce three commodity chemicals: benzene, phenol, and 1, 2-propanediol. These

120 target compounds were chosen based on their limited biochemical pathway knowledge (i.e.,

121 how many pathways are known in the current biological databases) and global demand. For

122 example, it was estimated that the global demand for benzene in 2016 was 46 million tonnes

123 (Pérez-Uresti et al., 2017). RetroPath2.0 and ReactPRED are relatively new, open source, and

124 customisable cheminformatics tools. We chose to use these tools based on their ability to

125 predict novel retrosynthetic (i.e., transforming the target compounds to their simpler

126 precursors) and synthetic (i.e., using simpler precursor compounds to construct target

127 molecules) pathways through identifying the chemical bond transformations occurring in the

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

128 reactions (Delépine et al., 2018; Sivakumar et al., 2016). After generating pathways

129 automatically, a pathway curation method was developed for each tool to analyse millions of

130 generated pathways and remove redundant results. Initially pathways were examined for

131 specific starting compounds, such as acetate, pyruvate, and glucose, as these compounds are

132 abundantly available in cells and widely used in their metabolisms. Next the pathways

133 containing these compounds were screened based on thermodynamic feasibility. The feasible

134 pathways were further analysed to examine the compounds generated, and individual reactions

135 were assigned to an enzyme commission (EC) number: a numerical classification scheme for

136 enzyme catalysed reactions (Egelhofer et al., 2010). This task was accomplished by comparing

137 the generated reactions to known reactions in the MetaNetX (Moretti et al., 2016), MetaCyc

138 (Caspi et al., 2020) and KEGG (Kanehisa et al., 2020) databases. Finally, both software

139 programmes were analysed to discuss their comparative advantages and limitations for finding

140 novel biochemical pathways to target compounds.

141

142 Materials and methods

143 Automated generation of pathways

144 Novel biochemical pathways were constructed for the production of 1, 2-propanediol, benzene,

145 and phenol using the computational software programmes, ReactPRED and RetroPath2.0.

146 Detailed descriptions of both algorithms and their functionalities can be found elsewhere

147 (Delépine et al., 2018; Sivakumar et al., 2016). Both software programmes require a number

148 of inputs to generate biochemical pathways. In the case of ReactPRED, these inputs included

149 a set of generalised reaction rules developed based on the EC numbers of biochemical reactions

150 found in the MetaCyc (Caspi et al., 2020) database, cofactors (NAD, NADP), target

151 compounds (benzene, phenol, 1, 2-propanediol), and source (glucose, pyruvate, acetate)

152 compounds information in the SMILES format (Weininger, 1988). For RetroPath2.0, the inputs

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

153 included a set of reaction rules generated based on the biochemical reactions in the MetaNetX

154 (Moretti et al., 2016) database and the reactions in the genome-scale E. coli metabolic model,

155 iJ01366 (Orth et al., 2011), as well as the source, sink, and compounds, including

156 benzene, phenol, NAD, and NADP. RetroPath2.0 then retrosynthetically and ReactPRED

157 synthetically generated pathways by iteratively applying the generalised reaction rules to

158 generate reactions connecting the target compounds to the metabolites present within the

159 MetaCyc and MetaNetX databases. The thermodynamic feasibility of both ReactPRED and

160 Rertopath2.0 generated pathways was analysed by estimating the standard Gibbs free energy

161 of the generated reactions using the group contribution method (Jankowski et al., 2008; Noor

162 et al., 2012).

163

164 Manual curation of the generated pathways

165 The automatically generated pathways were analysed and manually curated based on the

166 reactions and compounds involved in reactions, and the overall pathway feasibility. For

167 RetroPath 2.0, most of the generated reactions that were biologically known were automatically

168 assigned an EC number. However, the unknown or novel reactions were examined and

169 compared to similar reactions in the MetaNetX (Moretti et al., 2016) database. Also, the

170 generated compounds were all examined to verify their identities by comparing them with the

171 compounds in the MetaNetX and PubChem (Kim et al., 2019) databases. If the compounds

172 were identified and existed in the databases, they were assigned to corresponding reactions

173 with an EC number while unconfirmed compounds were discarded (Figure 1). ReactPRED

174 generated compounds in the SMILES format (Weininger, 1988), which were examined and

175 verified using the PubChem database. Unidentified compounds were discarded while identified

176 compounds were further assessed using the MetaCyc (Caspi et al., 2020) and KEGG (Kanehisa

177 et al., 2020) databases to confirm if the generated compounds were present in the biological

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

178 databases. The identified compounds and corresponding reactions were then assigned an EC

179 number based on the reaction rule and cofactor information used to generate those reactions.

180 Moreover, the compounds not present in MetaCyc and KEGG were further analysed with the

181 online CDK depicter tool (Willighagen et al., 2017) to confirm the bond transformations

182 occurring within the proposed reactions (Figure 1). An EC number was then assigned to the

183 reactions using the provided reaction rule and cofactor information.

184

185 Results and discussion

186 Analysis of the generated pathways using ReactPRED

187 Reactions and pathways were generated using ReactPRED’s default reaction rule set, which

188 included a total of 1462 reaction rules (Sivakumar et al., 2016) and the SMILES format of the

189 starting compounds. Glucose, pyruvate, and acetate were used to generate synthetic pathways

190 up to the pathway length of 3, while phenol, benzene, and 1, 2-propanediol were used as starting

191 compounds to retrosynthetically generate pathways up to the pathway length of 2.

192

193 Figure 2 illustrates the number of pathways generated with increasing pathway length. The

194 number of generated pathways is linked to the number of potential bond transformations

195 available in the starting compound. For example, the number of glucose pathways produced at

196 each pathway length is greater than the number of acetate and pyruvate pathways produced

197 (Figure 2A). Additionally, the number of phenol pathways significantly increased from 1952

198 at pathway length 1 to 2337181 at pathway length 3 (Figure 2B), further illustrating that more

199 potential for bond transformations in an input compound generates more outputs. From the

200 generated pathways, thermodynamically feasible pathways, i.e., reactions with a negative

201 standard Gibbs free energy to the target compounds were examined. Table 1 shows the number

202 of thermodynamically feasible pathways generated to the target compounds at each pathway

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

203 length. No pathways were found to synthetically generate benzene. Table 2 shows the number

204 of thermodynamically feasible retrosynthetic pathways generated at each pathway length.

205

206 Comparing both the synthetic and retrosynthetic results, more pathways were generated

207 retrosynthetically because there were more potential for bond transformations in the

208 retrosynthetic starting compounds than the compounds used for the synthetic analysis.

209 Therefore, more reaction rules were automatically applied to these compounds to generate

210 more outputs, i.e., reactions. Pathways were further analysed individually based on the identity

211 of the compounds involved in the pathways (Figure 3). Pathways were discarded if they

212 included compounds unidentifiable in the PubChem database (Kim et al., 2019). For instance,

213 many of generated compounds contained carbon atoms with 5 or more bonds, which means

214 these compounds would be unlikely to exist in nature. From the synthetic outputs, two

215 pathways to 1, 2-propanediol and one pathway to phenol were accepted (Figure 3A). Figure

216 3B shows the number of accepted and discarded pathways to each target compound for the

217 retrosynthetic outputs: one acetate, ten pyruvate, and seven glucose to 1, 2-propanediol

218 pathways were accepted, while seven acetate, five pyruvate, and fifty nine glucose to phenol

219 pathways were accepted. No acetate to benzene or pyruvate to benzene pathways were accepted

220 because each of these categories of pathways included at least one unidentifiable compound;

221 however, fifteen glucose to benzene pathways were accepted. The accepted reactions were

222 assigned EC numbers capable of catalysing the reactions following the procedure described in

223 Materials and Methods.

224

225 Analysis of the generated pathways using RetroPath2.0

226 The RetroPath2.0 algorithm generated pathways using 14,302 reaction rules known to the E.

227 coli metabolism, benzene as the starting compound, and a set of ‘sink’ compounds. The sink

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

228 compounds are the native metabolites of the chassis organism, i.e., E coli metabolism (Delépine

229 et al., 2018). The algorithm converged after 7 iterations, as no new reaction was generated from

230 further run of the algorithm (Figure 4). Figure 4A illustrates the total number of generated

231 reactions, the number of reactions assigned to EC numbers, and the number of compounds

232 generated at each iteration. The number of generated compounds and reactions peaked at

233 iteration 3, generating in total thirty seven reactions and eighty nine compounds, while the total

234 number of reactions and compounds significantly decreased to three and five after the third

235 iteration. This reduction in compounds and reactions can be attributed to the handling of ‘sink’

236 compounds by the algorithm. Ordinarily, the total number of reactions should exponentially

237 increase with each iteration of the algorithm. However, RetroPath2.0 removes all outputs, in

238 which the generated compounds match those that are in the sink set (Delépine et al., 2018);

239 thus, preventing further iterations of the algorithm on those generated compounds. Each

240 reaction was further analysed, and reactions were discarded if they contained compounds

241 unidentifiable in the PubChem (Kim et al., 2019) database. Thus, no reactions were accepted

242 from those that were generated after the fourth iteration (Figure 4B). However, three, two, and

243 twelve reactions were accepted from the reactions generated after the 1st, 2nd, and 3rd iteration

244 of the algorithm, respectively (Figure 4B). RetroPath2.0 automatically assigns EC numbers to

245 each reaction. Only one accepted reaction from the first iteration set and ten accepted reactions

246 from the third iteration set were assigned to EC numbers. The unassigned accepted reactions

247 were then compared to the ReactPRED results to identify similar reactions. Additionally, the

248 reaction rule and co- information were examined to manually assign an EC number to

249 the unassigned reactions.

250

251 No direct pathways to glucose, acetate, or pyruvate were found using RetroPath2.0. However,

252 pathways were generated to connect compounds present within the E. coli metabolism, as well

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

253 as in the PubChem database only. Additionally, generating pathways retrosynthetically using

254 phenol as the starting compound produced the same results as benzene, but no pathways were

255 generated using 1, 2-propanediol as the starting compound.

256

257 Analysis of the accepted pathways

258 From the generated outputs of ReactPRED and RetroPath2.0, in total 49 pathways consisting

259 of 106 reactions connecting acetate, glucose, and pyruvate to phenol, benzene, and 1, 2-

260 propanediol were accepted. No 1-step pathway connecting benzene or 1, 2-propanediol to the

261 target starting compounds, i.e., glucose, acetate, and pyruvate was identified. Each pathway

262 contains at least one novel step, while 25 (51%) of the accepted pathways are entirely composed

263 of novel reactions. No pathways were identified in which all reaction steps were known, i.e.,

264 found in the MetaCyc, MetaNetX, and KEGG databases. Many of the accepted pathways

265 contained identical reaction rules, as well as compounds found in the PubChem database only.

266 For example, 13 of the 26 (50%) glucose to phenol pathways (Supplementary data) contained

267 compounds only identifiable in the PubChem database. Many of these compounds are synthetic

268 man-made compounds that are found only in the retrosynthetically generated pathways. Figure

269 5 shows the number of accepted pathways from glucose, pyruvate, and acetate to each

270 commodity chemical.

271

272 The thermodynamic feasibility of the accepted pathways was analysed based on the overall

" 273 standard Gibbs free energy of reactions (∆G!) of each pathway. The standard Gibbs free energy

274 of both the ReactPRED and RetroPath2.0-generated reactions were estimated using the group

275 contribution method (Jankowski et al., 2008; Noor et al., 2012). Figure 6 shows a few notable

276 examples of the accepted pathways discussed in this section, while Figure 7 depicts their

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

277 thermodynamic feasibility information. Additional information of the pathways discussed in

278 this section can be found in the supplementary data provided.

279

280 Pathways 1-7 (Supplementary data) are the predicted pathways from acetate to 1,2-propanediol

281 and phenol. Pathways 1, 3, and 4 are composed of the same novel reaction in the 1st step, while

282 pathway 2 include only novel reactions (Figure 6). Each acetate to 1, 2-propanediol producing

283 pathway is thermodynamically feasible although pathways 3 and 4 include reaction steps with

" 284 positive ∆G! (Figure 7). Further comparison of pathways 1, 3, and 4 shows that the acetate to

285 methylglyoxal reaction is thermodynamically more favourable than the acetate to

" 286 hydroxyacetone reaction, resulting in a larger negative ∆G! for the corresponding pathways.

287 The first reaction step in pathways 5, 6, and 7 includes a novel reaction. This novel step

288 involves the transferring of alkyl or aryl groups in 5 and 6, while in 7, this novel step uses a

289 haloacetate dehalogenase-catalysed reaction. All three pathways generate phenol through an

290 arylesterase reaction in the final reaction step. Each pathway is overall thermodynamically

291 feasible. However, pathway 5 was found to be the most thermodynamically feasible while

292 pathway 7 was the least of the 3 acetate to phenol pathways.

293

294 Pathways 8-21 (Supplementary data) are the predicted pyruvate pathways connecting pyruvate

295 to the target commodity chemicals. Pathways 8-12 are predicted to produce hydroxyacetone in

296 the first reaction step using the same novel reaction. Pathways 13-17 predict the production of

297 lactic acid from pyruvate in the first reaction step. This step is a novel reaction for pathways

298 13-16, while it is a biologically known reaction in pathway 17. Only pathways 9 and 10 include

299 a known reaction in the 2nd reaction step. All pyruvate to 1, 2-propanediol pathways are overall

" 300 thermodynamically feasible with each pathway having an overall negative ∆G!. However,

301 further comparison of the reaction steps revealed that the pathway producing 1, 2-propanediol

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

302 via lactic acid were thermodynamically more favourable than the pathways through

303 hydroxyacetone.

304

305 Pathways 18-21 produce phenol from pyruvate. The first reaction step in pathways 18, 21, and

306 the 2nd in pathway 20 are a known reaction, while the other reactions are novel in these

307 pathways. Pathway 19 is predicted to use phenyl phosphono hydrogen phosphate as a co-

308 reactant (Supplementary data). Phenyl phosphono hydrogen phosphate was only identified in

309 the PubChem database, indicating that it is not a natural biological compound. Examining the

310 thermodynamic feasibility of the pyruvate to phenol pathways reveal that pathways 18, 19, and

" 311 20 have the same ∆G! (-4.61 and -4.5 kcal/mol) for the first and second reaction steps, while

" 312 only pathway 21 has different ∆G! values (-4.6 and -1.7 kcal/mol) for the two reactions. These

313 estimates lead to the fact that pathways 18, 19, and 20 are overall thermodynamically more

314 favourable than pathway 21.

315

316 All predicted glucose to 1, 2-propanediol pathways, i.e., pathways 22-25 were

317 retrosynthetically generated using ReactPRED and are composed of two novel reactions in both

318 steps. Notably, all of these pathways contain a compound in the first reaction step only found

319 in the PubChem database: 2-(hydroxymethyl)-6-(1-hydroxypropane-2-yloxy)oxane-3,4,5-triol

320 (pathway 22 in Figure 6). The reactions in pathways 22-25 are all thermodynamically feasible

321 with pathway 22 is estimated to be the most thermodynamically feasible glucose to 1, 2-

322 propanediol pathway while pathway 25 is the least. The presence of NADP+ as a cofactor in

323 the first reaction step of pathway 22 is likely to make it thermodynamically more favourable,

324 as NADP+ works alongside enzymes to provide energy for cellular reactions (Xiao et al., 2018).

325

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

326 The only 1-step pathway from glucose to phenol (Pathway 26 in Figure 6) was generated

327 retrosynthetically using ReactPRED. Pathways 27-32 include a known reaction in the first step,

328 while the second reaction step in these pathways are all novel (Supplementary data). All of

329 these pathways utilised water produced from the first step as a reactant for the second reaction.

330 Pathways 33-40 and 44 are predicted to consist of novel reactions in both reaction steps.

331 Interestingly, pathways 33-40 generated phenol from phenyl-α-D-glucoside in the second step

332 using the same reaction; this reaction was classified as α-galactosidase and was assigned EC

333 3.2.1.21. Pathway 41 and 42 included the same known reaction in the second reaction step,

334 while this 2nd step in pathway 43 was identified as reaction R05626 using the KEGG database.

335 Overall, each glucose to phenol pathway is thermodynamically feasible even though some

" 336 pathways include reactions with a positive ∆G!.

337

338 The second step in pathways 45 and 46 were retrosynthetically generated by the RetroPath2.0

339 software in the first iteration (Figure 6). These pathways were compared to the ReactPRED

340 generated pathways to find similarities and construct novel reactions. Pathway 45 was

341 confirmed by creating a customised reaction rule set and generating the reactions using

342 ReactPRED’s pathway prediction system (Sivakumar et al., 2016), while pathway 46 was

343 confirmed through comparison of ReactPRED’s retrosynthetic pathway results. Pathway 47-

344 49 were all retrosynthetically generated by ReactPRED and consisted of two novel reactions.

345 Each glucose to benzene pathway is overall thermodynamically feasible. Pathway 48 was

346 identified as the most thermodynamically feasible glucose to benzene pathway with an

" 347 estimated overall ∆G! of -105 kcal/mol, while pathway 46 was the least thermodynamically

" 348 feasible having an overall ∆G! of -9.4 kcal/mol.

349

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

350 A closer examination of the accepted pathways further revealed similarities in the co-substrate

" 351 use and bond transformation information, as well as in the estimated ∆G! for multiple reaction

352 steps. For example, 32 of the 80 accepted glucose pathways contain a reverse sucrose alpha-

353 glucohydrolase reaction in the first reaction step to produce sucrose and water (Supplementary

354 data). These pathways used water as the starting compound in the next reaction step and

355 compounds only identified in the PubChem database as co-reactants. Similarly, pathways 22-

356 25, 34, and 46-49 included co-reactants that were only identified in the PubChem database.

357 The presence of a compound only in PubChem but not in other biological databases (MetaCyc,

358 MetaNetX, KEGG) implies that the compound is man-made or synthetic and may not be made

359 by biological systems. 16 out of 28 glucose pathways contained one of these compounds, whilst

360 none of the acetate or pyruvate pathways contained these compounds. Moreover, many other

361 pathways, such as pathways 6-8, 23-26, 28-30 produce target compounds using commodity

362 chemicals as co-reactants or generate CO2 (Pathway 46 in Figure 6) in their reaction steps.

363 Thus, the proposed pathways, although novel, may not necessarily be considered ‘green’.

364 Finally, most of the retrosynthetically generated pathways, including pathways 1, 13-16, 19,

365 22-26, 33-39, 40, 43-45 are completely composed of novel reactions, indicating that

366 retrosynthetic generation allows for more potential novel reactions to be uncovered.

367

368 Comparative analysis of ReactPRED and RetroPath2.0

369 Both software tools, ReactPRED and RetroPath2.0 were applied to generate novel biochemical

370 pathways to three industrially important commodity chemicals: benzene, phenol and 1, 2-

371 propanediol. These cheminformatics tools were designed to be user-friendly and customisable

372 to conduct user-specific pathway design tasks (Delépine et al., 2018; Sivakumar et al., 2016).

373 Additionally, both tools have unique features that enable the design of biochemical pathways

374 of various lengths from different starting compounds to different targets.

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

375 Both ReactPRED and RetroPath 2.0 allow users to customise their inputs to generate reactions.

376 For ReactPRED, the input reaction rules were completely customisable, and the software’s

377 reaction rule creation system allows the user to generate tailored reaction rules for specific

378 tasks (Sivakumar et al., 2016). However, this study only utilised ReactPRED’s default reaction

379 rule set to generate outputs, i.e., compounds, reactions, and pathways. Additionally,

380 ReactPRED’s default reaction rule set was created using reactions present in the MetaCyc

381 database (Caspi et al., 2020), and identical rules were merged together; thus, indicating that the

382 novel reactions may be catalysed by more than one enzyme. Uniquely, ReactPRED estimates

383 the overall Gibbs free energy change of the generated reactions, allowing users to assess the

384 thermodynamic feasibility of the generated outputs. Further, ReactPRED allows users to view

385 and search for pathways based on thermodynamic feasibility, molecular weight, and

386 substructure through the user-friendly pathway analysis system.

387

388 Comparatively, RetroPath2.0 allows the user to tailor not only the reaction rules but also the

389 ‘sink’ compounds to find novel pathways in the context of a specific chassis organism. This

390 study used the software’s default reaction rule set and sink compounds that were developed

391 based on the genome-scale metabolic model of E. coli, iJO1366 (Orth et al., 2011) and

392 MetaNetX (Moretti et al., 2016), a meta-database consisting of reactions extracted from the

393 KEGG, MetaCyc, Rhea (Lombardot et al., 2019) and Reactome (Jassal et al., 2020) databases.

394 Additionally, RetroPath2.0 automatically assigns an EC number to each reaction within the

395 chassis strain. Uniquely, RetroPath2.0 uses ‘sinks’ to prevent further iterations of the algorithm

396 using compounds found within the chassis strain. This strategy, thus, not only shortens the

397 execution time of the algorithm, but also prevents the combinational explosion of pathways

398 that is usually generated with cheminformatics tools.

399

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

400 Although ReactPRED and RetroPath2.0 have benefits to aid biochemical pathway design, both

401 software tools have limitations. For example, both pieces of software generate compounds that

402 do not exist in nature, i.e., the compounds are unidentifiable in both biochemical and chemical

403 databases. This is an important limitation, as these compounds and relevant reactions cannot

404 be removed automatically from the generated results. Therefore, each pathway is required to

405 be individually analysed to find and discard the reactions and pathways containing these

406 compounds. Another major limitation of ReactPRED is its longer execution time, which can

407 take anywhere from a few seconds to a week to generate the desired outputs. The time taken

408 for ReactPRED to generate outputs is dependent on the number of potential bond

409 transformations available for the starting compound and the number of reaction rules used to

410 predict reactions. For instance, larger input molecules with more bond transformation potential

411 will take longer to predict reactions. Furthermore, similar to many other cheminformatics tools,

412 ReactPRED suffers from the combinational explosion of predicted reactions and pathways. As

413 the pathway length increases, the number of predictions could be in the millions, leading to a

414 greater effort to sift through the data to find meaningful results. Additionally, assigning EC

415 numbers to the ReactPRED predicted reactions is also challenging and requires extensive

416 analysis of the reactions, as discussed in this study, to assign a complete EC number.

417

418 A significant limitation of both pieces of software is that neither ReactPRED nor RetroPath2.0

419 can propose targeted pathways, i.e., generate only the desired reactions connecting starting

420 compounds to target compounds automatically. Instead, both algorithms will continue to

421 generate reactions iteratively until they are converged based on specific cut-off parameters such

422 as pathway length and bond transformation diameter. Hence, a substantial amount of

423 downstream pathway curation work is required to find the meaningful and novel results.

424

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

425 Conclusions

426 Cheminformatics tools, ReactPRED and RetroPath2.0 were utilised to design novel

427 biochemical pathways to produce three industrially important commodity chemicals with

428 limited biochemical knowledge: benzene, phenol, and 1, 2-propanediol. All of the 49 designed

429 pathways from glucose, acetate, and pyruvate contained at least one novel step, i.e.,

430 biologically unknown reaction, and all were found to be thermodynamically feasible. A novel

431 methodology for curation of thousands and millions of pathways generated by both software

432 tools was developed, and this method can be used as a guide for designing biochemical

433 pathways to produce not only commodity chemicals but also nutraceuticals and

434 pharmaceuticals. RetroPath2.0 and ReactPRED were also comparatively assessed to provide

435 further insight on their effectiveness as a biochemical pathway design tool, as well as their

436 advantages and limitations in the context of a specific design task. Although both software

437 tools are user-friendly and help design novel pathways, these tools also produce thousands of

438 pathways with compounds non-existent in nature. Hence, this study can be used to develop

439 practical pathway curation strategies while using similar cheminformatics tools to design

440 biochemical pathways. Moreover, the designed pathways can be used as valuable hypotheses

441 for experimental implementation of the pathways in suitable chassis organisms for sustainable

442 production of bio-based commodity chemicals.

443

444

445

446

447

448

449

450

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

451 References

452 Bengelsdorf, F.R., Dürre, P., 2017. Gas fermentation for commodity chemicals and fuels. 453 Microb. Biotechnol. 10, 1167–1170. https://doi.org/10.1111/1751-7915.12763

454 Bengelsdorf, F.R., Straub, M., Dürre, P., 2013. Bacterial synthesis gas (syngas) fermentation. 455 Environ. Technol. (United Kingdom). https://doi.org/10.1080/09593330.2013.827747

456 Brunk, E., Neri, M., Tavernelli, I., Hatzimanikatis, V., Rothlisberger, U., 2012. Integrating 457 computational methods to retrofit enzymes to synthetic pathways. Biotechnol. Bioeng. 458 https://doi.org/10.1002/bit.23334

459 Caspi, R., Billington, R., Keseler, I.M., Kothari, A., Krummenacker, M., Midford, P.E., Ong, 460 W.K., Paley, S., Subhraveti, P., Karp, P.D., 2020. The MetaCyc database of metabolic 461 pathways and enzymes-a 2019 update. Nucleic Acids Res. 462 https://doi.org/10.1093/nar/gkz862

463 Cefic, 2018. Facts & Figures of the European chemical .

464 Delépine, B., Duigou, T., Carbonell, P., Faulon, J.L., 2018. RetroPath2.0: A retrosynthesis 465 workflow for metabolic engineers. Metab. Eng. 45, 158–170. 466 https://doi.org/10.1016/j.ymben.2017.12.002

467 Egelhofer, V., Schomburg, I., Schomburg, D., 2010. Automatic assignment of EC numbers. 468 PLoS Comput. Biol. 6. https://doi.org/10.1371/journal.pcbi.1000661

469 Finley, S.D., Broadbelt, L.J., Hatzimanikatis, V., 2009. Computational framework for 470 predictive biodegradation. Biotechnol. Bioeng. https://doi.org/10.1002/bit.22489

471 Hadadi, N., Hafner, J., Shajkofci, A., Zisaki, A., Hatzimanikatis, V., 2016. ATLAS of 472 Biochemistry: A Repository of All Possible Biochemical Reactions for Synthetic 473 Biology and Metabolic Engineering Studies. ACS Synth. Biol. 474 https://doi.org/10.1021/acssynbio.6b00054

475 Hadadi, N., MohammadiPeyhani, H., Miskovic, L., Seijo, M., Hatzimanikatis, V., 2019. 476 Enzyme annotation for orphan and novel reactions using knowledge of substrate reactive 477 sites. Proc. Natl. Acad. Sci. 116, 7298 LP – 7307. 478 https://doi.org/10.1073/pnas.1818877116

479 Hatzimanikatis, V., Li, C., Ionita, J.A., Henry, C.S., Jankowski, M.D., Broadbelt, L.J., 2005. 480 Exploring the diversity of complex metabolic networks. Bioinformatics. 481 https://doi.org/10.1093/bioinformatics/bti213

482 Henry, C.S., Dejongh, M., Best, A.A., Frybarger, P.M., Linsay, B., Stevens, R.L., 2010. 483 High-throughput generation, optimization and analysis of genome-scale metabolic 484 models. Nat. Biotechnol. 28, 977–982. https://doi.org/10.1038/nbt.1672

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

485 Islam, M.A., Hadadi, N., Ataman, M., Hatzimanikatis, V., Stephanopoulos, G., 2017. 486 Exploring biochemical pathways for mono- (MEG) synthesis from 487 synthesis gas. Metab. Eng. 41, 173–181. https://doi.org/10.1016/j.ymben.2017.04.005

488 Jankowski, M.D., Henry, C.S., Broadbelt, L.J., Hatzimanikatis, V., 2008. Group contribution 489 method for thermodynamic analysis of complex metabolic networks. Biophys. J. 95, 490 1487–1499. https://doi.org/10.1529/biophysj.107.124784

491 Jassal, B., Matthews, L., Viteri, G., Gong, C., Lorente, P., Fabregat, A., Sidiropoulos, K., 492 Cook, J., Gillespie, M., Haw, R., Loney, F., May, B., Milacic, M., Rothfels, K., Sevilla, 493 C., Shamovsky, V., Shorser, S., Varusai, T., Weiser, J., Wu, G., Stein, L., Hermjakob, 494 H., D’Eustachio, P., 2020. The reactome pathway knowledgebase. Nucleic Acids Res. 495 https://doi.org/10.1093/nar/gkz1031

496 Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M., Tanabe, M., 2020. KEGG: 497 integrating viruses and cellular organisms. Nucleic Acids Res. 498 https://doi.org/10.1093/nar/gkaa970

499 Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., Li, Q., Shoemaker, B.A., Thiessen, 500 P.A., Yu, B., Zaslavsky, L., Zhang, J., Bolton, E.E., 2019. PubChem 2019 update: 501 Improved access to chemical data. Nucleic Acids Res. 47, D1102–D1109. 502 https://doi.org/10.1093/nar/gky1033

503 Kumar, M.B., Gao, Y., Shen, W., He, L., 2015. Valorisation of protein waste: An enzymatic 504 approach to make commodity chemicals. Front. Chem. Sci. Eng. 307.

505 Kuwahara, H., Alazmi, M., Cui, X., Gao, X., 2016. MRE: a web tool to suggest foreign 506 enzymes for the biosynthesis pathway design with competing endogenous reactions in 507 mind. Nucleic Acids Res. https://doi.org/10.1093/nar/gkw342

508 Lee, J.W., Na, D., Park, J.M., Lee, J., Choi, S., Lee, S.Y., 2012. Systems metabolic 509 engineering of microorganisms for natural and non-natural chemicals. Nat. Chem. Biol. 510 8, 536–546. https://doi.org/10.1038/nchembio.970

511 Lombardot, T., Morgat, A., Axelsen, K.B., Aimo, L., Hyka-Nouspikel, N., Niknejad, A., 512 Ignatchenko, A., Xenarios, I., Coudert, E., Redaschi, N., Bridge, A., 2019. Updates in 513 Rhea: SPARQLing biochemical reaction data. Nucleic Acids Res. 514 https://doi.org/10.1093/nar/gky876

515 Medema, M.H., Van Raaphorst, R., Takano, E., Breitling, R., 2012. Computational tools for 516 the synthetic design of biochemical pathways. Nat. Rev. Microbiol. 517 https://doi.org/10.1038/nrmicro2717

518 Moretti, S., Martin, O., Van Du Tran, T., Bridge, A., Morgat, A., Pagni, M., 2016. 519 MetaNetX/MNXref - Reconciliation of metabolites and biochemical reactions to bring 520 together genome-scale metabolic networks. Nucleic Acids Res. 44, D523–D526. 521 https://doi.org/10.1093/nar/gkv1117

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

522 Moriya, Y., Shigemizu, D., Hattori, M., Tokimatsu, T., Kotera, M., Goto, S., Kanehisa, M., 523 2010. PathPred: An enzyme-catalyzed metabolic pathway prediction server. Nucleic 524 Acids Res. https://doi.org/10.1093/nar/gkq318

525 Moura, M., Finkle, J., Stainbrook, S., Greene, J., Broadbelt, L.J., Tyo, K.E.J., 2016. 526 Evaluating enzymatic synthesis of small molecule drugs. Metab. Eng. 527 https://doi.org/10.1016/j.ymben.2015.11.006

528 NC-IUBMB, 1992. Nomenclature committee of the international union of biochemistry and 529 molecular biology. [WWW Document].

530 Noor, E., Bar-Even, A., Flamholz, A., Lubling, Y., Davidi, D., Milo, R., 2012. An integrated 531 open framework for thermodynamics of reactions that combines accuracy and coverage. 532 Bioinformatics. https://doi.org/10.1093/bioinformatics/bts317

533 Orth, J.D., Conrad, T.M., Na, J., Lerman, J.A., Nam, H., Feist, A.M., Palsson, B., 2011. A 534 comprehensive genome-scale reconstruction of Escherichia coli metabolism-2011. Mol. 535 Syst. Biol. https://doi.org/10.1038/msb.2011.65

536 Pérez-Uresti, S., Adrián-Mendiola, J., El-Halwagi, M., Jiménez-Gutiérrez, A., 2017. Techno- 537 Economic Assessment of Benzene Production from Shale Gas. Processes 5, 33. 538 https://doi.org/10.3390/pr5030033

539 Renge, V.C., Khedkar, S. V, Nandurkar, N.R., 2012. Enzyme Synthesis By Fermentation 540 Method : a Review 2, 585–590.

541 Rodrigo, G., Carrera, J., Prather, K.J., Jaramillo, A., 2008. DESHARKY: Automatic design 542 of metabolic pathways for optimal cell growth. Bioinformatics 24, 2554–2556. 543 https://doi.org/10.1093/bioinformatics/btn471

544 Saha, B.C., 2003. Commodity chemicals production by fermentation: An overview. Ferment. 545 Biotechnol. 862, 3–17. https://doi.org/doi:10.1021/bk-2003-0862.ch001\r10.1021/bk- 546 2003-0862.ch001

547 Shin, J.H., Kim, H.U., Kim, D.I., Lee, S.Y., 2013. Production of bulk chemicals via novel 548 metabolic pathways in microorganisms. Biotechnol. Adv. 31, 925–935. 549 https://doi.org/10.1016/j.biotechadv.2012.12.008

550 Siebert, D., Wendisch, V.F., 2015. Metabolic pathway engineering for production of 1,2- 551 propanediol and 1-propanol by Corynebacterium glutamicum. Biotechnol. Biofuels 8, 1– 552 13. https://doi.org/10.1186/s13068-015-0269-0

553 Sivakumar, T.V., Giri, V., Park, J.H., Kim, T.Y., Bhaduri, A., 2016. ReactPRED: A tool to 554 predict and analyze biochemical reactions. Bioinformatics. 555 https://doi.org/10.1093/bioinformatics/btw491

21 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

556 Straathof, A.J.J., 2014. Transformation of biomass into commodity chemicals using enzymes 557 or cells. Chem. Rev. 114, 1871–1908. https://doi.org/10.1021/cr400309c

558 Tawfik, O.K. and D.S., 2010. Enzyme Promiscuity: A Mechanistic and Evolutionary 559 Perspective. Annu. Rev. Biochem. 79, 471–505. https://doi.org/10.1146/annurev- 560 biochem-030409-143718

561 UNEP, 2017. The Emissions Gap Report 2017.

562 Wang, L., Dash, S., Ng, C.Y., Maranas, C.D., 2017. A review of computational tools for 563 design and reconstruction of metabolic pathways. Synth. Syst. Biotechnol. 564 https://doi.org/10.1016/j.synbio.2017.11.002

565 Weininger, D., 1988. SMILES, a Chemical Language and Information System: 1: 566 Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 28, 31– 567 36. https://doi.org/10.1021/ci00057a005

568 Willighagen, E.L., Mayfield, J.W., Alvarsson, J., Berg, A., Carlsson, L., Jeliazkova, N., 569 Kuhn, S., Pluskal, T., Rojas-Chertó, M., Spjuth, O., Torrance, G., Evelo, C.T., Guha, R., 570 Steinbeck, C., 2017. The Chemistry Development Kit (CDK) v2.0: atom typing, 571 depiction, molecular formulas, and substructure searching. J. Cheminform. 572 https://doi.org/10.1186/s13321-017-0220-4

573 Xiao, W., Wang, R.-S., Handy, D.E., Loscalzo, J., 2018. NAD(H) and NADP(H) Redox 574 Couples and Cellular Energy Metabolism. Antioxid. Redox Signal. 28, 251–272. 575 https://doi.org/10.1089/ars.2017.7216

576 Yim, H., Haselbeck, R., Niu, W., Pujol-Baxley, C., Burgard, A., Boldt, J., Khandurina, J., 577 Trawick, J.D., Osterhout, R.E., Stephen, R., Estadilla, J., Teisan, S., Schreyer, H.B., 578 Andrae, S., Yang, T.H., Lee, S.Y., Burk, M.J., Van Dien, S., 2011. Metabolic 579 engineering of Escherichia coli for direct production of 1,4-butanediol. Nat. Chem. Biol. 580 https://doi.org/10.1038/nchembio.580

581

582

583

584

585

586

587

22 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

588 Figure legends

589 Figure 1: Schematic of the overall workflow followed to design biochemical pathways using 590 RetroPath2.0 and ReactPRED. The input to both software tools included source or target 591 compounds, sink compounds, and reaction rules developed from biochemical databases and 592 genome-scale E. coli metabolic model, iJO1366. The generated outputs in excel (RetroPath2.0) 593 and SMILES (ReactPRED) formats were manually curated and extensively analysed with 594 MetaNetX, PubChem, MetaCyc, and KEGG databases, as well as with the online CDK depicter 595 tool to remove non-natural compounds and assign EC numbers to accepted reactions (see text 596 for details).

597 Figure 2: Total number of pathways generated by the ReactPRED software. (A) is showing 598 the number of pathways generated synthetically using acetate, glucose, and pyruvate as 599 starting compounds, and (B) is showing the number of retrosynthetic pathways generated 600 using 1,2-propanediol, phenol, and benzene as starting compounds. The total number of 601 compounds that were generated at each pathway length are also illustrated.

602 Figure 3: The number of accepted and discarded pathways to the target chemicals obtained 603 from synthetic and retrosynthetic runs of ReactPRED software is shown in (A) and (B), 604 respectively. The thermodynamically feasible pathways were subjected to further pathway 605 pruning (see materials and methods). The accepted pathways are the ones that successfully 606 passed the curation criteria while the discarded pathways failed to pass the curation criteria 607 (see materials and methods).

608 Figure 4: Illustration of the results generated by the RetroPath2.0 software: (A) is showing 609 the number of reactions and compounds generated, as well as EC numbers assigned to 610 generated reactions using this software. Notably, 76% of the total reactions were assigned an 611 EC number at iteration 3 (represented by the green bar). (B) is showing the number of 612 accepted and discarded reactions generated at each iteration of RetroPath2.0. The predicted 613 reactions were subjected to further pruning (see materials and methods). The accepted 614 pathways are the ones that successfully passed the curation criteria, while the discarded 615 pathways failed to pass the curation criteria (see materials and methods).

616 Figure 5: Number of accepted pathways to target commodity chemicals from glucose, 617 pyruvate, and acetate. Different categories of accepted pathways to phenol, benzene, and 1,2- 618 propanediol are shown from target starting compounds: glucose, pyruvate, and acetate. Each 619 of these pathways are thermodynamically feasible and passed the pathway curation criteria.

620 Figure 6: Examples of a few predicted pathways with assigned EC numbers. Each pathway 621 contains at least one novel step (blue), while pathways 1, 3, 6, and 18 contain biologically 622 known steps (green) as well. Pathways 22, 26, 45, 46, and 49 include only novel steps.

623 Figure 7: Analysis of the thermodynamic feasibility of the pathways shown in Figure 6. 624 Pathways 1-3, 6, and 18 are shown in (A), and pathways 22, 26, 45-46 and 49 are shown in 625 (B). The overall thermodynamic feasibility of each pathway was evaluated by estimating the

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

" " 626 standard Gibbs free energy of reaction (∆G!) for each step followed by combining the ∆G! 627 values for all relevant reactions in a pathway (see text for details).

628 Tables

Table 1: Number of thermodynamically feasible and synthetically generated pathways to target chemicals (1,2-propanediol, benzene, phenol) from synthetic starting compounds (acetate, pyruvate, glucose)

Thermodynamically Thermodynamically Thermodynamically Pathway Length feasible pathways to feasible pathways to feasible pathways to 1,2-Propanediol Benzene Phenol 1 0 0 1 2 1 0 21 3 2 0 164

Table 2: Number of thermodynamically feasible and retrosynthetically generated pathways to target compounds (acetate, glucose, pyruvate) from retrosynthetic inputs (1,2- propanediol, benzene, phenol)

Thermodynamically Thermodynamically Thermodynamically Pathway Length feasible pathways to feasible pathways to feasible pathways to Acetate Glucose Pyruvate 1 0 1 0 2 2220 3580 57

629

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Figure 1 Source Compounds

O=C(O)C

Sink Compounds Input

RetroPath 2.0 ReactPRED Generation

P athways Output Output in SMILES in excel file of plain text format format Automat ed

Output Output unidentified PubChem unidentified MetaNetX Output compound Database discarded finder

Output Output identified identified

Compound Present MetaCyc and [#6:2][#6:1][#6:6][#6:4](=[#8:5])[#8:3] [#6:1][#6:2](=[#8:5])[#8:3] KEGG Databases P athways of

Manual Curation Compound not present H OO O Assign EC Online CDK H number to O O Depicter tool reaction bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

A Figure 2 2000000 Acetate Outputs 1800000 1740690 Glucose Outputs 1600000 Pyruvate Outputs 1400000

1200000 1048575 f Pathways 1000000 969769 931440 o 800000

Number 600000

400000

200000 91584 60 903 177 20000 0 1 2 3 Pathway Length B

2500000 1,2-propanediol Outputs 2337181 Phenol Outputs 2000000 Benzene Outputs

1500000 f Pathways o

1000000 Number

500000 417766

94562 96 1952 150 0 1 2 Pathway Length bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

A Figure 3 180 Accepted Pathways 164 160 Discarded Pathways 140

120

100 f Pathways

o 80

60 Number 40

20 2 2 1 0 B 1,2-Propanediol Phenol 4000 Accepted pathways 3449 3500 Discarded Pathways 3000

2500

f Pathways 2000 o

1500 989 Number 1000 649 500 7 1 0 10 2 15 0 72 0 2 59 7 34 5 36 0 Glucose to 1,2- Acetate to 1,2- Pyruvate to 1,2- Glucose to Acetate to Pyruvate to Glucose to Acetate to Pyruvate to Propanediol Propanediol Propanediol Benzene Benzene Benzene Phenol Phenol Phenol bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Figure 4 A 60 90 Total reactions 80 50 Assigned Reactions 70 Total Compounds on s 40 60

50 f Reacti 30 pound s f Com o 40 o

20 30 Number Number 20 10 10

0 0 1 2 3 4 5 6 7 Iterations B 30

25 Accepted Reactions 25 Discarded Reactions on s 20 f Reacti

o 15 12

Number 10

5 3 3 2 2 2 0 0 0 0 0 0 1 2 3 4 5 6 Iterations bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

20 Figure 5

18 18

16

14

12

10 10 f Pathways o 8 7

Number 6 5 4 4 3 2 2 2 1

0 1-step Glucose 2-step Glucose 2-step Glucose 2-step Glucose 2-step Pyruvate 2-step Pyruvate 2-step Acetate 3-Step Acetate 2-step Acetate to Phenol to Phenol to Benzene to 1,2- to 1,2- to Phenol to 1,2- to 1,2- to Phenol propanediol propanediol propanediol propanediol Figure 6 NovelbioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this versionO posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review)1 is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NDO 4.0 International license.

reaction Hydroxyacetone

NADPH

OO NADP+ O Known O

reaction Acetate Acetate 1,2-propanediol Water Propanol + NADP+ O NADPH O 2 O Propylene glycol 2-acetate

NADP+ NADPH NADPH NADP+ Acetone OO O O O O 3 O 2.8.3.19 O 1.1.1.283 1.1.1.6 Acetate Hydroxyacetone Methylglyoxal 1,1,2-propandiol2-propandiol

Methanethiol Acetate Thioanisole Water O OO OO 6 2.5.1.49 3.1.1.2 Acetate Phenyl acetate Phenol

Phosphenol pyruvate Phenolphosphoric ATP O ATP acid O O

18 ADP O 2.7.1.40 2.7.1.53 Pyruvate Phenol

Propanol + NADPH O Glucose O Water ȕ-NADP+ O O O O O O 22 OO O O O 1.17.1.3 3.2.1.31 O O 1,2-propanediol Glucose 2-(hydroxymethyl)-6- (1-hydroxypropane-2- yloxy)oxane-3, 4, 5- triol

Glucose L- O Phosphate Phenyl phosphate O O O

26 O O O 2.7.1.142 Glucose Phenol

Glucose L- NAD+ O Phosphate Phenyl phosphate O NADH O O

45 O O O 2.7.1.142 1.2.1.-/ 1.2.1.86 Glucose Phenol Benzene

1-galloylglucose Benzoyl 3, 4, 5- O OO CO2 trihydroxybenzoate O O

46 O O O 2.3.1.90 4.1.1.98

Glucose Benzenoic acid Benzene

Alpha-D- NADH O glucopyranoside 2,4-cyclohexadiene-1,2-diol O O NADP+ + water 49 O O NAD+ O 2.7.1.142 1.14.13.243 Glucose Benzene bioRxiv preprint doi: https://doi.org/10.1101/2020.12.31.425007; this version posted January 3, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Figure 7 A 5

Pathway Length 0 1 2 3

-5

-10 l/mol) ca l/mol) (k 0 R G

Δ -15 Pathway 1

Pathway 2 -20 Pathway 3

Pathway 6 -25 Pathway 18

-30

0 B 1 Pathway Length 2

-10

-20

-30 l/mol) ca l/mol)

k -40 ( 0 R G Δ -50 Pathway 22

Pathway 26 -60 Pathway 45

-70 Pathway 46

Pathway 49 -80