<<

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

Article 1

Hierarchical Harmonization of Atom-Resolved Metabolic Reac- 2

tions Across Metabolic Databases 3

Huan Jin 1, and Hunter N. Moseley 2, 3, 4, 5, * 4

1 Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA; 5 [email protected] 6 2 Department of Molecular & Cellular Biochemistry, University of Kentucky, Lexington, KY 40536, USA 7 3 Markey Cancer Center, University of Kentucky, Lexington, KY 40536, USA 8 4 Superfund Research Center, University of Kentucky, 40506 Lexington, KY, USA 9 5 Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40536, USA 10 11 * Correspondence: [email protected]; Tel.: 859-218-2964 12

Abstract: Metabolic models have been proven to be useful tools in system biology and have been 13 successfully applied to various research fields in a wide range of organisms. A relatively complete 14 metabolic network is a prerequisite for deriving reliable metabolic models. The first step in con- 15 structing metabolic network is to harmonize compounds and reactions across different metabolic 16 databases. However, effectively integrating data from various sources still remains a big challenge. 17 Incomplete and inconsistent atomistic details in compound representations across databases is a 18 very important limiting factor. Here, we optimized a subgraph isomorphism detection algorithm to 19 validate generic compound pairs. Moreover, we defined a set of harmonization relationship types 20 between compounds to deal with inconsistent chemical details while successfully capturing atom- 21

level characteristics, enabling a more complete enabling compound harmonization across metabolic 22

databases. In total, 15,704 compound pairs across KEGG (Kyoto Encyclopedia of and Ge- 23 nomes) and MetaCyc databases were detected. Furthermore, utilizing the classification of com- 24 pound pairs and EC ( Commission) numbers of reactions, we established hierarchical rela- 25 Copyright: © 2021 by the authors. Sub- tionships between metabolic reactions, enabling the harmonization of 3,856 reaction pairs. In addi- 26 mitted for possible open access publica- tion, we created and used atom-specific identifiers to evaluate the consistency of atom mappings 27 tion under the terms and conditions of within and between harmonized reactions, detecting some consistency issues between the reaction 28 the Creative Commons Attribution (CC and compound descriptions in these metabolic databases. 29 BY) license (http://creativecom- mons.org/licenses/by/4.0/). Keywords: metabolite; compound harmonization; reaction harmonization; metabolic network; met- 30 abolic model; subgraph isomorphism. 31 32

1. Introduction 33 Metabolic models describe the inter-conversion of metabolites via biochemical reac- 34 tions catalyzed by , providing snapshots of the under a given genetic 35 or environmental condition [1,2]. Metabolic models of metabolism have proven to be an 36 important tool in studying systems biology and have been successfully applied to various 37 research fields, ranging from metabolic engineering to system medicine [3-7]. Advances in 38 analytical methodologies like mass spectroscopy and nuclear magnetic resonance greatly 39 improve the high-throughput detection of thousands of metabolites, enabling the genera- 40 tion of large volumes of high-quality metabolomics datasets [8, 9] that greatly facilitate 41 metabolic research. As a next major step, incorporating reaction atom-mappings into met- 42 abolic models enables metabolic flux analysis of isotope-labeled metabolomics datasets 43 [10-13], which will contribute to the large-scale characterization of metabolic flux molecu- 44 lar phenotypes and prediction of potential targets for manipulation [4]. Building 45

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

2 of 29

reliable metabolic models heavily depend on the completeness of metabolic network data- 46 bases. However, a relatively complete metabolic network, especially at an atom-resolved 47 level, is practically not available [14]. 48 Therefore, to construct an atom-resolved metabolic network, the very first major step 49 is to integrate metabolic data from various metabolic databases without redundancy [15], 50 which remains extreme labor-intensive. This is partially due to problems in the individual 51 databases [16]. Common issues include non-unique compound identifiers, reactions with 52 unbalanced atomic species, and enzyme catalyzing more than one reaction [17]. Moreover, 53 incompatibilities of data representations (like compound identifiers) and incomplete at- 54 omistic details (like the presence of R groups and lack of atom and bond stereochemistry) 55 across databases are key bottlenecks for the rapid construction of high-quality metabolic 56 networks [18]. Great efforts have been made to map different compound identifiers across 57 metabolic databases [19, 20]. Some algorithms use logistic regression to compute the simi- 58 larity between strings generated by concatenating a variety of compound features, which 59 requires careful selection of compound features that can well characterize a string pair by 60 capturing the similarity between different variations as well as underlining the difference 61 between descriptions which are not synonymous [6]. Alternatively, utilization of unique 62 chemical identifier independent from a particular database, like InChI [21, 22] or SMILES 63 [23], have been suggested as an important step in harmonizing metabolic databases [24]. 64 However, several tricky cases still remain unresolved. For example, InChI cannot handle 65 the compound entries that contain R-groups. 66 Our neighborhood-specific graph coloring method can derive atom identifiers for 67 every atom in a specific compound with consideration of molecular symmetry, facilitating 68 the construction of an atom-resolved metabolic network [25]. Furthermore, a unique com- 69 pound coloring identifier can be generated based on the atom identifiers, which can be 70 used for compound harmonization across metabolic databases. The results derived from 71 the compound coloring identifiers were quite promising. However, issues like incomplete 72 atomistic details were not completely handled in that prior work. 73 In this paper, we further optimized the subgraph isomorphism detection algorithm 74 CASS (Chemically Aware Substructure Search) [26] to aid in the validation of generic com- 75 pound pairs. In addition, we solved inconsistent atomistic characteristics across databases 76 by defining a set of harmonization relationship types between compounds, aiming to cap- 77 ture chemical details while maintain compound pairs at various levels. Furthermore, we 78 used the classification of compound pairs and EC (Enzyme Commission) numbers to har- 79 monize metabolic reactions across Kyoto Encyclopedia of Genes and Genomes (KEGG) 80 and MetaCyc metabolic pathway databases via establishing hierarchical harmonization re- 81 lationships between metabolic reactions. We further made use of the atom identifiers to 82 evaluate atom mapping consistency of these harmonized reactions. Through this analysis, 83 we detected some issues that cause the inconsistency of reaction atom mappings both 84 within and across databases. The generalization of metabolic reactions can be applied to 85 various interesting topics including but not limited to predicting biotransformation of 86 newly discovered metabolites[27], devising novel synthetic pathways of essential metabo- 87 lites[28], and bridging gaps in the current metabolic network[29]. Furthermore, expanding 88 the existing metabolic network by integrating other metabolic databases can be easily 89 achieved when the molfile representations [30] of compounds are provided. 90

2. Results 91 2.1. Overview of KEGG and MetaCyc databases 92 The compounds in the KEGG and MetaCyc databases are summarized in Table 1. 93 Based on the atomic composition, we divided compounds into two groups: specific com- 94 pounds (no R group) and generic compounds (with presence of R group(s)). About 8.02% 95 KEGG compounds and 21.72% MetaCyc compounds contain R groups. 96

Table 1. Summary of KEGG and MetaCyc compound databases. 97

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

3 of 29

Compound Type KEGG MetaCyc specific compounds 16529 (91.98%) 15859 (78.28%) generic compounds 1441 (8.02%) 4400 (21.72%) Total 17970 (100%) 20259 (100%) 98 According to the classification of compounds, we also categorized the atom-resolved 99 metabolic reactions into two sets: specific reactions where all compounds in the reaction are 100 specific compounds and generic reactions which contain at least one generic compound. Here, 101 we only considered reactions with relatively complete EC numbers [31, 32], since 102 consistent EC number is one essential component in reaction harmonization. From Table 103 2, we can see that about 10% KEGG reactions and 20% MetaCyc reactions are generic 104 reactions. 105

Table 2. Summary of KEGG and MetaCyc atom-resolved metabolic reaction databases. 106 Reaction Type KEGG MetaCyc specific reactions (4-leveled EC) 6780 (75.26%) 6397 (49.93%) specific reactions (3-leveled EC) 886 (9.83%) 2022 (15.78%) generic reactions (4-leveled EC) 1244 (13.81%) 3572 (27.88%) generic reactions (3-leveled EC) 99 (1.10%) 822 (6.42%) Total 9009 (100%) 12813 (100%) 107 2.2. Results of compound harmonization across KEGG and MetaCyc databases 108 With the loose compound coloring identifiers generated by the neighborhood- 109 specific graph coloring method, about 8865 compound pairs were detected, including 110 both generic and specific compound pairs [25]. However, some cases were not solved 111 perfectly by the loose compound coloring identifiers. First, chemical details like atom and 112 bond stereochemistry were ignored in the loose compound coloring identifies. Second, a 113 compound pair can involve a generic compound and a specific compound (Figure 1), 114 which can not be discovered by the loosing compound coloring identifiers. The methyl 115 group in KEGG compound C01042 can be a specification of the R group in MetaCyc 116 compound CPD-576. What makes things more complicated is that a compound pair can 117 be composed of two generic compounds with different atom composition. In Figure 2, 118 even though both compounds contain R group, the MetaCyc compound 3-Acyl-pyruvates 119 can be regarded as a subgroup of compounds belonging to KEGG compound C00060. In 120 addition, compound pairs with different structural representations, like tautomers, were 121 missed by the loose compound coloring identifiers. 122

123 124 KEGG: C01042 MetaCyc: CPD-576 125 Figure 1. Compound pair of generic and specific compounds. 126 127

128 KEGG: C00060 MetaCyc: 3-Acyl-pyruvates

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

4 of 29

129 Figure 2. Compound pair of generic compounds. 130

2.2.1. Harmonization of specific compounds. 131 We first incorporated the chemical details, including atom stereochemistry and bond 132 stereochemistry to evaluate the specific compound pairs detected by loose compound col- 133 oring identifiers. Incorporation of the chemical details can lead to three scenarios: 1) the 134 paired compounds have the same set of chemical details; 2) the chemical details of one 135 compound are the subset of the other compound; 3) the chemical details of the two com- 136 pounds cannot be fully matched. Based on the above cases, we decided to classify the 137 relationship between compound pairs as an equivalence relationship, a generic-specific 138 relationship, or a loose relationship. With this classification, a compound in one database 139 can be paired with multiple compounds in the other database with an appropriate rela- 140 tionship. With these improvements incorporated into specific compound harmonization 141 (Table 3), we can see that the majority of specific compound pairs have a loose relation- 142 ship, which is not surprising since the criteria for the loose relationship were less strict. 143 Another explanation is that the chemical details for the same compound can be incon- 144 sistent across databases. The MetaCyc compound CPD-399 has a direct KEGG compound 145 reference C03495 (Figure 3), but stereochemistry of some atoms in the two compound rep- 146 resentations are not the same. 147

Table 3. Harmonization of specific compound pairs. 148 Relationship Type Count equivalence 3636 (27.99%) generic-specific 1712 (13.18%) loose 7642 (58.83%) Total 12990 (100%) 149

150 151 KEGG: C03495 MetaCyc: CPD-399 Figure 3. Example harmonized compound pair of compounds with inconsistent chemical details. 152

2.2.2. Harmonization of generic compounds. 153 Generic compounds further complicate relationships between compounds. A generic 154 compound can be related to generic and/or specific compounds (Figure 1 & 2). For a com- 155 pound pair of two generic compounds with the same atom composition, we classify them 156 based on the same criteria of specific compounds. Harmonization of generic compound pairs 157 of compounds with different chemical formulas is much more complicated, involving de- 158 tection and validation steps. All chemical identifiers fail in detecting the possible pairs, 159 including the loose compound coloring identifiers. On the other hand, it will be very time- 160 consuming and unnecessary to do brute-force search of all compounds across databases. 161 Here, we made use of the metabolic reactions across databases to detect the possible 162 compound pairs with a different atom composition. We first extracted reaction pairs that 163 can contain at least one generic reaction and share at least one EC number. Next, com- 164 pounds with R group(s) in one reaction were paired with all the compounds in the other 165 reaction. The validation method is described in the Materials and Methods section. Results 166

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

5 of 29

of harmonization are summarized in Table 4. Most of the generic compound pairs have ge- 167 neric-specific relationships. This may be explained by the assumption that chemical de- 168 tails in a compound with less atoms are more likely to be included in the compound con- 169 taining more atoms. 170

Table 4. Harmonization of generic compounds. 171 Relationship Type Count equivalence 126 (4.72%) generic-specific 2543 (95.28%) loose 0 Total 2669 (100%) 172 2.2.3. Harmonization of compounds with changeable representations. 173 Harmonization of compounds with changeable representation (e.g. linear vs circular 174 sugar representations) also requires detection and validation. Again, metabolic reactions 175 were used to detect the possible compound pairs via an iterative approach (see Method 176 and Materials 4.4). Two criteria should be obeyed when extracting the reaction pairs: 1) 177 the two reactions should share at least one EC number; 2) at least a pair of compounds in 178 the two reactions can be matched. For those unmatched compounds with the same chem- 179 ical formula, they will be added to the possible list. The validation methods are described 180 in the Materials and Methods section. About 45 such compound pairs were discovered 181 after two rounds of iteration (Table 5). 182 183 Table 5. Harmonization of compounds with changeable representations. 184 Relationship Type Count equivalence 20 (44.44%) generic-specific 0 loose 25 (55.56%) Total 45 (100%) 185 2.2.4. Summary of compound harmonization. 186 All compound pairs detected above were summarized in Table 6. In total, 15,704 187 compound pairs were discovered, and more than 80% of them were specific compound 188 pairs, roughly in agreement with the proportion of generic compounds in the database. 189 More importantly, about 2,669 generic compound pairs were detected, which cannot be 190 achieved by any existing chemical identifier. 191 192 Table 6. Summary of compound harmonization between KEGG and MetaCyc. 193 Compound Pair Type Count specific compound pairs 12990 (82.72%) generic compound pairs 2669 (16.99%) changeable compound pairs 45 (0.29%) Total 15704 (100%) 194 2.3. Results of reaction harmonization across KEGG and MetaCyc databases 195 With the harmonized compounds, we performed reaction harmonization across 196 KEGG and MetaCyc databases. Two criteria should be followed in reaction harmoniza- 197 tion: 1) the two reactions should share at least an EC number; and 2) all compounds in the 198 two reactions should be paired unless one reaction has an extra compound entity, like H+. 199 Reaction pairs were further categorized into the following three relationship types based 200

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

6 of 29

on the classification of their compound pairs: 1) equivalence relationship when a reaction 201 pair included only equivalently paired compounds; 2) generic-specific relationship when 202 a reaction pair only included equivalently paired compounds and at least one generic- 203 specific compound pair that are consistently in the same general-to-specific direction; 3) 204 loose relationship when a reaction pair included loosely paired compounds or generic- 205 specific paired compounds with inconsistent general-to-specific direction. 206 We first harmonized the specific metabolic reactions where both reactions are specific 207 reaction (Table 7). We can see that reaction pairs in group 3 take up more than 70%, which 208 is quite consistent with the classification of specific compound pairs. About 60% of specific 209 compound pairs are loosely matched (Table 3), and a reaction pair only requires one 210 loosely matched compound pair to be classified into group 3. 211 212 Table 7. Harmonization of specific reactions between KEGG and MetaCyc. 213 Relationship Type Count equivalence 718 (24.00%) generic-specific 68 (2.27%) loose 2205 (73.72%) Total 2991 (100%) 214 We also analyzed the generic reaction pairs where at least one reaction is generic. 215 Above 70% generic reaction pairs are in group 2 (Table 8), which can also be well explained 216 by the previous result that around 95% generic compound pairs have a generic-specific 217 relationship. 218 219 Table 8. Harmonization of generic reactions between KEGG and MetaCyc. 220 Relationship Type Count equivalence 29 (6.03%) generic-specific 344 (71.51%) loose 108 (22.45%) Total 481 (100%) 221 Since the EC information is not very complete in both databases, some reaction pairs 222 can be ignored due to the mismatch or miss of the last level EC. To avoid missed pairs, we 223 relaxed the first criterion in reaction harmonization to “the two reactions should have at 224 lease a pair of EC numbers that share the first 3 levels”. The newly discovered reaction 225 pairs are summarized in Table 9, including both specific and generic reaction pairs. Either 226 mismatch or miss of last EC occur in some reaction pairs. Specific examples are shown in 227 S Figure 1 & 2. 228 229 Table 9. Loose harmonization of reactions between KEGG and MetaCyc. 230 Relationship Type Count equivalence 49 (12.76%) generic-specific 96 (25.00%) loose 239 (62.24%) Total 384 (100%) 231 The results of reaction harmonization are shown in Table 10. Overall, 3,856 reaction 232 pairs were detected via EC numbers and integrated compound pairs. The majority of 233 reaction pairs are specific. About 10% of reactions pairs can be missed due to incomplete 234 and inconsistent EC numbers. 235 236

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

7 of 29

Table 10. Summary of reaction harmonization between KEGG and MetaCyc. 237 Reaction Pair Type Count specific 2991 (77.57%) generic 481 (12.47%) loose EC 384 (9.96%) Total 3856 (100%) 238 2.4. Comparison of KEGG RCLASS and RPAIR data 239 For the KEGG database, the RCLASS data describes the chemical transformation of 240 - in the RDM pattern [33]. A RDM description can be divided into three 241 parts: reaction center (R), the different region (D), and the matched region (M). In order to 242 distinguish functional groups and microenvironment of atoms, KEGG classified atomic 243 species of C, N, O, S, and P into 68 types (KEGG atom types) [34], which are implemented 244 in the RDM description. As shown in Figure 4, The RCLASS entry RC0003 contains one 245 RDM description. The S atom is the reaction center, the C1a in the first substructure 246 belongs to the different region, and those C1b atoms are in the matched region. Based on 247 the RDM pattern, we derived the atom mappings for specific reactant-product compound 248 pairs based on a common graph isomorphism search between the two compounds limited 249 by RDM description. We successfully parsed atom mappings for 10212 (out of 10313) 250 compound pairs. For complicated compound pairs with multiple reaction centers, each 251 reaction center can be mapped to several different atoms, which in a few instances causes 252 a serious combinatorial issue that is impossible to address in a reasonable amount of time. 253 An example is shown in Figure 5. Roughly 1013 possible cases can be derived based on the 254 RDM descriptions. In total, 25 compound pairs cannot be processed owing to this 255 combinatorial problem (S Table 1). Furthermore, there are 76 compound pairs that cannot 256 be deciphered due to the incorrect or missing descriptions of reaction centers. KEGG used 257 to archive the atom mappings between the reactant-product compound pairs in the RPAIR 258 database, where the mapped atoms are specified by the atom numbering for a compound 259 pair. Here, we evaluated the atom mappings derived from RCLASS and an older version 260 of KEGG RPAIR. The majority (great than 85%) of atom mappings between RCLASS and 261 RPAIR are the same (Table 11). To further validate the results, we calculated the fraction 262 of atom mappings with changed local bonded chemical environment across the mapping 263 (i.e. atom mappings with changed one-bond atom color) in terms of the total number of 264 mapped atoms in the reaction. The expectation is that this fraction represents the fraction 265 of reaction center atoms present where a chemical bond is changed or broken. Then we 266 generated a scatter plot of changed local atom color fraction for KEGG RPAIR versus 267 RCLASS atom mappings. From Figure 6, we can see that more atom mappings derived 268 from RPAIR have a higher ratio of changed atoms. We figured out that the majority of the 269 inconsistency is due to the interchangeable mappings of resonant atoms, like the O atoms 270 in the carboxyl group (S Figure 3). After further curation (Table 12), about 95% compound 271 pairs have the same atom mappings. The remaining 422 inconsistent mappings appear to 272 come from two different issues. One, more than 40% of the remaining inconsistent 273 mappings (180 out of 422) are likely caused by the updating of the KCF (molfile like) files 274 or associated molfiles in KEGG database from continual curation. For example, the RDM 275 description for compound pair C01255_C02378 has been updated in the RCLASS (S Figure 276 4). Two, we found that a compound representation can vary across different compound 277 pairs. Therefore, we hypothesize that a lack of synchronization between compound and 278 compound_pair representations over time has caused the observed atom mapping 279 inconsistencies detected in most of the other 242 compound pairs. 280 281

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

8 of 29

282 Figure 4. Example of RCLASS entry (https://www.genome.jp/dbget-bin/www_bget?rc:RC00003) 283

284 A 285

286 287 B

288 Figure 5. RCLASS with combinatorial issue caused by multiple possible mappings of RDM descriptions. A) RDM 289 description of RCLASS RC02715 (https://www.genome.jp/dbget-bin/www_bget?rc:RC02715); B) Compound pair 290 C01054_C08626 that follows the RC02715 RDM pattern (https://www.genome.jp/Fig/reaction/R09910.gif). 291

292

Table 11. First-round evaluation of atom mappings of compound pairs between KEGG RCLASS 293 and RPAIR. 294 Condition Count same atom mappings 7540 (87.51%) inconsistent atom mappings 1251 (14.29%) Total 8755 (100%) 295

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

9 of 29

296 Figure 6. Scatter plot of changed one-bond atom color fraction for KEGG RCLASS versus RPAIR 297 atom mappings. 298

Table 12. Second-round evaluation of atom mappings of compound pairs between KEGG RCLASS 299 and RPAIR. 300 Condition Count same atom mappings 8333 (95.19%) inconsistent atom mappings 422 (4.8%) Total 8755 (100%) 301 2.5. Evaluation of atom mappings between KEGG and MetaCyc databases. 302 The atom mappings for each reaction in the MetaCyc database are specified based 303 on the atom numbering of each compound in their molfile representation. For the KEGG 304 database, we used the atom mappings for compound pairs parsed from the RCLASS 305 entries. Here, we evaluated the atom mappings in about 2500 specific reaction pairs with 306 the same compound representations (Table 13). About 87% of the reaction pairs have 307 consistent atom mappings between the two databases. A consistent example is shown in 308 Figure 7. 309 Table 13. Evaluation of atom mappings between KEGG and MetaCyc. 310 Condition Count same atom mappings 2143 (87.26%) inconsistent atom mappings 313 (12.74%) Total 2456 (100%) 311 312

313

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

10 of 29

4 4 6 5 6 5 7 3 1 3 7 2 1 2 9 8 9 8

314 315 Figure 7. Reaction pair MetaCyc 1.1.1.168-RXN (https://metacyc.org/META/NEW- 316 IMAGE?object=1.1.1.168-RXN&&redirect=T) and KEGG R03155 317 (https://www.genome.jp/entry/R03155) with the same atom mappings. 318 319 We also generated a scatter plot of changed atom color fraction between paired 320 KEGG and MetaCyc reactions to further evaluate the results (Figure 8). The number of 321 mapped atoms in the paired reactions are not always the same. For some reactions, only 322 part of the compounds are mapped in either database. In addition, KEGG RCLASS 323 database maps atoms at a compound level. For MetaCyc, atoms are normally mapped at a 324 reaction level. Therefore, the plot of the changed ratio can deviate from the diagonal. From 325 Figure 8, we can see that the MetaCyc reactions have a higher ratio of changed atoms. 326 Through these comparisons, we see that both databases can contain distinct issues 327 with their atom mappings. Some MetaCyc reactions can have incorrect atom mappings. 328 An example is shown in Figure 9. For some KEGG reactions with single compound 329 involving in several compound pairs, one atom can be mapped to multiple atoms and 330 leave some atoms unmatched. For the KEGG reaction R10579 shown in Figure 10A, based 331 on the RDM descriptions in the two compound pairs (Figure 10B & 10C), atom 1 in 332 compound C00251 is mapped to atom 1 in compound C00022 and atom 1 in compound 333 C00578, leaving atom 2 in C00251 unmapped. Compared with the corresponding MetaCyc 334 reaction RXN-14940 (Figure 11), the RDM description of KEGG RCLASS RC03212 appears 335 incorrect. The harmonized reactions with different atom mappings are shown in S Table 336 2. 337

338 Figure 8. Scatter plot of changed one-bond atom color fraction for KEGG versus Meta- 339 Cyc atom mappings. 340

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

11 of 29

341

342

Figure 9. Example of MetaCyc reaction with incorrect atom mappings 343 (https://metacyc.org/META/NEW-IMAGE?object=1.13.11.45-RXN&&redirect=T). 344

345 A

1

2 1

1

B 346

347 C

348

Figure 10. Example of KEGG reaction with incorrect mappings. A) KEGG reaction R10579 349 (https://www.kegg.jp/entry/R10597); B) KEGG RCLASS RC02148 (https://www.kegg.jp/en- 350 try/RC02148); C) KEGG RCLASS 03212 (https://www.kegg.jp/entry/RC03212). 351

352

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

12 of 29

353

Figure 11. Atom mappings of MetaCyc reaction RXN-14940 (https://metacyc.org/META/NEW- 354 IMAGE?object=RXN-14940&&redirect=T). 355

3. Discussion 356 Effective integration of compound and reaction from various sources is hard to 357 achieve due to incomplete and inconsistent atom-level and bond-level details, like R 358 groups and stereochemistry, across databases. First, we categorized compounds into spe- 359 cific and generic compounds based on the presence of R groups. Meanwhile, metabolic 360 reactions were classified into specific and generic reactions according to the presence of 361 generic compounds. To overcome inconsistent atomistic characteristics, a set of relation- 362 ships between compounds were defined to both keep chemical details and conserve com- 363 pound pairs at various levels. According to the degree of consistency, compound pair re- 364 lationships are classified into three types: equivalence, generic-specific, and loose relation- 365 ships. The majority (around 60%) of specific compound pairs have loose relationships, 366 confirming the inconsistent issues in the databases to some extent. To our knowledge, no 367 chemical identifiers can be used to harmonize generic compounds across databases. Here, 368 we further optimized a subgraph isomorphism detection algorithm to validate generic 369 compound pairs. We first made use of the metabolic reactions to discover possible generic 370 compound pairs. After validation, 2,669 generic compound pairs remained. In addition, 371 we developed pragmatic methods to validate tautomers and compounds with linear and 372 circular representations. We discovered 45 compound pairs of compounds with the same 373 but fundamentally different structures, for example linear versus circu- 374 larized chemical representations. In total, 15,704 harmonized compound pairs were de- 375 tected. Next, we mapped atom-resolved metabolic reactions across KEGG and MetaCyc 376 via compound pairs and EC numbers. Reaction pairs were also catalogued into hierar- 377 chical relationships in accordance with the classification of compound pairs. About 3,856 378 harmonized reaction pairs were detected, and 10% of them can be missed by mismatched 379 EC numbers, strongly suggesting that curation of EC numbers is of great importance in 380 reaction harmonization. 381 Furthermore, we made use of the atom identifiers derived from our neighborhood- 382 specific graph coloring method to evaluate the consistency of atom mappings across har- 383 monized reactions. About 87% of reaction pairs have consistent atom mappings. We also 384 determined that both databases contain issues leading to inconsistency. For example, at- 385 oms in some MetaCyc reactions are not mapped correctly. For KEGG, we detected unsyn- 386 chronized atom numbering in the older KEGG RPAIR representation, which is likely the 387 reason that KEGG removed RPAIRS from their public version of the database. In contrast, 388 the KEGG RCLASS provides a concise RDM representation of reaction atom mappings 389 between a reaction-product compound pair, which appears highly resistant to consistency 390 errors. This resistance to consistency error is due to a decoupling of the atom mappings 391 from the specific atom order in the molfile representations. This allows the molfile rep- 392 resentations to be minorly updated without having to update the RDM descriptions. 393 However, there are also some issues with RDM descriptions. About 76 compound pairs 394 cannot be parsed due to the incorrect description of reaction centers, and parsed com- 395 pound pairs can be unreasonable at reaction level. In addition, a few KEGG RCLASS en- 396 tries are computationally difficult to decipher due to a combinatorial issue caused by the 397 several factors: multiple reaction centers in a single reaction, symmetric compounds, and 398 reaction descriptions involving multistep reactions. This combination of factors 399

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

13 of 29

introduces a large number of possibilities with matching a list of RDM descriptions to 400 specific reaction center atoms. One way to prevent this combinatorial problem is to repre- 401 sent multiple reaction center atoms with their associated difference atoms and match at- 402 oms within a paired substructure representation instead of a list of RDM descriptions. 403 Figure 10B illustrates this paired substructure representation for the KEGG RCLASS 404 RC02148. This kind of paired RDM substructure representation would allow the use of an 405 efficient subgraph isomorphism detection method to derive the atom mappings and could 406 be represented as a pair of molfiles along with a mapping of atoms between the two mol- 407 files, all stored within a single sdfile. Also, our compound harmonization method for har- 408 monizing changeable compound pairs would be useful for updating the paired RDM sub- 409 structures when the compound representations dramatically change. 410 In addition, the methods we developed can be easily applied to integrate other met- 411 abolic databases that provide molfile representations of compounds, facilitating the ex- 412 pansion of the existing metabolic networks. Moreover, this hierarchical framework for 413 relating compounds and reactions is a possible first step towards creating a systematic 414 organization of all reaction descriptions at a desired chemical specificity to fit a given ap- 415 plication. Such a systematic organization of reaction descriptions would augment the 416 current system and be useful to a wide range of possible 417 applications from metabolic modeling, metabolite and reaction prediction, and network 418 incorporation of newly discovered metabolites. 419

4. Materials and Methods 420 4.1. Compound and Metabolic Reaction Data 421 All data were downloaded directly from KEGG and MetaCyc databases. MetaCyc 422 compound and reaction data downloaded from BioCyc is in version 23.0. The KEGG 423 COMPOUND, KEGG REACTION and KEGG RCLASS data is from the version available 424 from KEGG on April 2021 via its REST interface. KEGG RPAIR data was downloaded 425 from KEGG database in 2016. 426 427 4.2. Curation of molfile 428 The documentation of atom stereochemistry in the molfiles is not complete. We used 429 Open Babel [35] to curate the original molfiles and add stereospecific information. 430 431 4.3. Identification of double bond stereochemistry 432 We previously adopted a method for automated identification of double bond 433 stereochemistry [36]. One limitation of this method is that only double bonds between two 434 carbon atoms can be handled. For example, double bonds connected by heterogenous 435 atoms, like N=C, cannot be processed by the method. Here, we designed a new algorithm 436 to distinguish cis/trans stereoisomers. The same criteria are applied to assign priority to 437 each group attached to the double bond. If one side of the double bond only has one 438 group, this group will be prioritized. Next, the 2D plane of the compound representation 439 is divided into two parts with line crossing the double bond. If the prioritized groups of 440 both sides are on the same part of the divided plane, the double bond is labeled as cis; 441 otherwise, it is trans. 442 443 4.4. Flowchart of steps in the compound and reaction harmonization process 444 The flowchart of steps in compound and reaction harmonization is shown in Figure 445 12. The initial compound pair list is composed of compound pairs detected by the loose 446 compound coloring identifiers. Next, reaction harmonization is conducted with the 447 compound pair list. Two criteria are obeyed in reaction harmonization: the two reactions 448 should share at least one EC number and all compounds in the two reactions are paired 449 unless one reaction has an extra compound entity, like H+. Apart from valid reaction pairs, 450

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

14 of 29

reaction pairs with the same EC number and some unmatched compounds are also 451 extracted. We hypothesized that those unmatched compounds are likely to be compound 452 pairs. Validation is conducted for the unmatched compounds, and the valid compound 453 pairs are added to the compound pair list. Every time the compound pair list is updated, 454 the above process is repeated until no new compound pairs are discovered. 455 456

Compound pair list

Reaction harmonization

Valid reaction pairs: same EC Possible reaction pairs: same EC with with paired compounds. some unmatched compounds

Possible compound pairs

Validation of compound pairs 457 Figure 12. Flowchart of compound and reaction harmonization. 458

4.5. Validation of tautomers. 459 Most common form of tautomerization involves a hydrogen changing places with a 460 double bond. Based on this transformation, the following steps are performed to validate 461 if two compounds with same chemical formula are tatutomers. To eliminate the difference 462 caused by single and double bonds in the structural representation, all the double bonds 463 are converted into single bonds, and the subgraph isomorphism detection algorithm [26] 464 is used to check if two structural representations are the same after modification. Next, 465 double bonds at unmatched positions are examined. If all the mismatches are caused by 466 possible tautomerization, the compound pair is considered valid. Finally, other chemical 467 details not related to atoms in the changeable positions are compared to classify the 468 relationship between valid pairs. 469 470 4.6. Validation of generic compound pairs of compounds with different chemical formula. 471 For two compounds A and B, the subgraph isomorphism detection algorithm [26] is 472 used to verify if the graph representation of A (ignoring R and H) is contained in the graph 473 representation of B. Then, each unmatched branch in B is examined if it corresponds to an 474 R group in A. Compound pairs that meet both criteria are considered valid. Next, the 475 chemical details (atom and bond stereochemistry) in the two compounds are compared 476 for relationship type classification. If the chemical details of compound A are included in 477 compound B, then A has a generic-specific relationship to B; otherwise, A and B have a 478 loose relationship. 479 480 4.7. Validation of compound pairs with linear and circular representations. 481 The compound with changeable linear and circular structures are common in small 482 molecule carbohydrate metabolites, like . This conversion occurs due to the ability 483 of aldehydes and ketones to react with alcohols. To validate the compound pairs with 484 linear and circular representations, we first locate the bond in the circular structure that is 485 formed by connecting the C in the aldehyde (keto) group and O in the hydroxy group. 486

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

15 of 29

The following steps include breaking the newly formed bond and restoring the C=O bond 487 in the aldehyde (keto) group. Then, a new compound coloring identifier is generated for 488 the modified circular representation. If the updated compound coloring identifiers match, 489 the compound pair is considered valid. 490 491 4.8. Parse of KEGG RCLASS RDM patterns. 492 Based on the RDM patterns, we first identified the possible atoms that can be 493 mapped to each reaction center. Then we derived the possible combinations of atoms for 494 all the reaction centers for each compound. We paired cases in either compound, removed 495 changed bond in the compound according to different region, and detected the maximum 496 common subgraph of the remaining structures. We examined all the combinations and 497 derived the optimal mappings with the maximum number of mapped atoms and least 498 ratio of changed atoms. 499

Supplementary Materials: Figure S1: Reaction pair with mismatch of last EC number, Figure S2: 500 Reaction pair with missing 4th-level EC number designation, Figure S3: Example of atoms with in- 501 terchangeable mappings, Figure S4: Comparison of KEGG RCLASS and RPAIR description for com- 502 pound pair C01255 and C02378, Table S1: Hardly interpretable compound pairs, Table S2: Harmo- 503 nized reactions with inconsistent atom mappings. 504 Author Contributions: H.J. and H.N.M. worked together on the design of the experiments and the 505 analysis of the results. H.J. wrote the manuscript and H.N.M. revised. All authors have read and 506 agreed to the published version of the manuscript. 507 Funding: The work was supported by grant NSF 2020026 (PI Moseley). 508 Data Availability Statement: All data used and the results generated in the manuscript are availa- 509 ble on: https://doi.org/10.6084/m9.figshare.14703999.v2 510 Conflicts of Interest: The authors declare no conflict of interest. 511

References 512

1. Pham, N.; van Heck, R. G. A.; van Dam, J. C. J.; Schaap, P. J.; Saccenti, E.; Suarez-Diez, M., Consistency, Inconsistency, and 513 Ambiguity of Metabolite Names in Biochemical Databases Used for Genome-Scale Metabolic Modelling. Metabolites 2019, 9 (2). 514 2. Ryu, J. Y.; Kim, H. U.; Lee, S. Y., Reconstruction of genome-scale human metabolic models using omics data. Integr Biol (Camb) 515 2015, 7 (8), 859-68. 516 3. Contreras, A.; Ribbeck, M.; Gutiérrez, G. D.; Cañon, P. M.; Mendoza, S. N.; Agosin, E., Mapping the Physiological Response 517 of Oenococcus oeni to Stress Using an Extended Genome-Scale Metabolic Model. Frontiers in Microbiology 2018, 9 (291). 518 4. Lee, D. S.; Burd, H.; Liu, J.; Almaas, E.; Wiest, O.; Barabasi, A. L.; Oltvai, Z. N.; Kapatral, V., Comparative genome-scale 519 metabolic reconstruction and flux balance analysis of multiple Staphylococcus aureus genomes identify novel antimicrobial drug 520 targets. J Bacteriol 2009, 191 (12), 4015-24. 521 5. Patil, K. R.; Akesson, M.; Nielsen, J., Use of genome-scale microbial models for metabolic engineering. Curr Opin Biotechnol 522 2004, 15 (1), 64-9. 523 6. Radrich, K.; Tsuruoka, Y.; Dobson, P.; Gevorgyan, A.; Swainston, N.; Baart, G.; Schwartz, J. M., Integration of metabolic 524 databases for the reconstruction of genome-scale metabolic networks. BMC Syst Biol 2010, 4, 114. 525 7. Zhang, C.; Hua, Q., Applications of Genome-Scale Metabolic Models in Biotechnology and Systems Medicine. Front Physiol 526 2015, 6, 413. 527 8. Creek, D. J.; Chokkathukalam, A.; Jankevics, A.; Burgess, K. E.; Breitling, R.; Barrett, M. P., Stable isotope-assisted 528 metabolomics for network-wide metabolic pathway elucidation. Anal Chem 2012, 84 (20), 8442-7. 529 9. Fan, T. W.; Lorkiewicz, P. K.; Sellers, K.; Moseley, H. N.; Higashi, R. M.; Lane, A. N., Stable isotope-resolved metabolomics 530 and applications for drug development. Pharmacol Ther 2012, 133 (3), 366-91. 531

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

16 of 29

10. Antoniewicz, M. R., Methods and advances in metabolic flux analysis: a mini-review. J Ind Microbiol Biotechnol 2015, 42 (3), 532 317-25. 533 11. Jin, H.; Moseley, H. N. B., Moiety modeling framework for deriving moiety abundances from mass spectrometry measured 534 isotopologues. BMC Bioinformatics 2019, 20 (1), 524. 535 12. Jin, H.; Moseley, H. N. B., Robust Moiety Model Selection Using Mass Spectrometry Measured Isotopologues. Metabolites 2020, 536 10 (3). 537 13. Young, J. D., INCA: a computational platform for isotopically non-stationary metabolic flux analysis. Bioinformatics 2014, 30 (9), 538 1333-5. 539 14. Chokkathukalam, A.; Kim, D. H.; Barrett, M. P.; Breitling, R.; Creek, D. J., Stable isotope-labeling studies in metabolomics: 540 new insights into structure and dynamics of metabolic networks. Bioanalysis 2014, 6 (4), 511-24. 541 15. Altman, T.; Travers, M.; Kothari, A.; Caspi, R.; Karp, P. D., A systematic comparison of the MetaCyc and KEGG pathway 542 databases. BMC Bioinformatics 2013, 14 (1), 112. 543 16. Ginsburg, H., Caveat emptor: limitations of the automated reconstruction of metabolic pathways in Plasmodium. Trends 544 Parasitol 2009, 25 (1), 37-43. 545 17. Poolman, M. G.; Bonde, B. K.; Gevorgyan, A.; Patel, H. H.; Fell, D. A., Challenges to be faced in the reconstruction of 546 metabolic networks from public databases. Syst Biol (Stevenage) 2006, 153 (5), 379-84. 547 18. Saha, R.; Chowdhury, A.; Maranas, C. D., Recent advances in the reconstruction of metabolic models and integration of omics 548 data. Curr Opin Biotechnol 2014, 29, 39-45. 549 19. Qi, X.; Ozsoyoglu, Z. M.; Ozsoyoglu, G., Matching metabolites and reactions in different metabolic networks. Methods 2014, 550 69 (3), 282-97. 551 20. van Heck, R. G.; Ganter, M.; Martins Dos Santos, V. A.; Stelling, J., Efficient Reconstruction of Predictive Consensus Metabolic 552 Network Models. PLoS Comput Biol 2016, 12 (8), e1005085. 553 21. Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; Pletnev, I., InChI - the worldwide chemical structure identifier standard. 554 J Cheminform 2013, 5 (1), 7. 555 22. Heller, S. R.; McNaught, A.; Pletnev, I.; Stein, S.; Tchekhovskoi, D., InChI, the IUPAC International Chemical Identifier. J 556 Cheminform 2015, 7, 23. 557 23. Weininger, D.; Weininger, A.; Weininger, J. L., SMILES. 2. Algorithm for generation of unique SMILES notation. Journal of 558 Chemical Information and Computer Sciences 1989, 29 (2), 97-101. 559 24. Lieven, C.; Beber, M. E.; Olivier, B. G.; Bergmann, F. T.; Ataman, M.; Babaei, P.; Bartell, J. A.; Blank, L. M.; Chauhan, S.; 560 Correia, K.; Diener, C.; Dräger, A.; Ebert, B. E.; Edirisinghe, J. N.; Faria, J. P.; Feist, A.; Fengos, G.; Fleming, R. M. T.; García- 561 Jiménez, B.; Hatzimanikatis, V.; van Helvoirt, W.; Henry, C. S.; Hermjakob, H.; Herrgård, M. J.; Kim, H. U.; King, Z.; Koehorst, 562 J. J.; Klamt, S.; Klipp, E.; Lakshmanan, M.; Le Novère, N.; Lee, D.-Y.; Lee, S. Y.; Lee, S.; Lewis, N. E.; Ma, H.; Machado, D.; 563 Mahadevan, R.; Maia, P.; Mardinoglu, A.; Medlock, G. L.; Monk, J. M.; Nielsen, J.; Nielsen, L. K.; Nogales, J.; Nookaew, I.; 564 Resendis-Antonio, O.; Palsson, B. O.; Papin, J. A.; Patil, K. R.; Poolman, M.; Price, N. D.; Richelle, A.; Rocha, I.; Sanchez, B. J.; 565 Schaap, P. J.; Malik Sheriff, R. S.; Shoaie, S.; Sonnenschein, N.; Teusink, B.; Vilaça, P.; Vik, J. O.; Wodke, J. A.; Xavier, J. C.; 566 Yuan, Q.; Zakhartsev, M.; Zhang, C., Memote: A community driven effort towards a standardized genome-scale metabolic model test 567 suite. bioRxiv 2018, 350991. 568 25. Jin, H.; Mitchell, J. M.; Moseley, H. N. B., Atom Identifiers Generated by a Neighborhood-Specific Graph Coloring Method 569 Enable Compound Harmonization across Metabolic Databases. Metabolites 2020, 10 (9). 570 26. Mitchell, J. M.; Fan, T. W.-M.; Lane, A. N.; Moseley, H. N. B., Development and in silico evaluation of large-scale metabolite 571 identification methods using functional group detection for metabolomics. Frontiers in Genetics 2014, 5 (237). 572

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

17 of 29

27. Jeffryes, J. G.; Colastani, R. L.; Elbadawi-Sidhu, M.; Kind, T.; Niehaus, T. D.; Broadbelt, L. J.; Hanson, A. D.; Fiehn, O.; 573 Tyo, K. E. J.; Henry, C. S., MINEs: open access databases of computationally predicted products for untargeted 574 metabolomics. Journal of Cheminformatics 2015, 7 (1), 44. 575 28. Hadadi, N.; Hafner, J.; Shajkofci, A.; Zisaki, A.; Hatzimanikatis, V., ATLAS of Biochemistry: A Repository of All Possible 576 Biochemical Reactions for Synthetic Biology and Metabolic Engineering Studies. ACS Synth Biol 2016, 5 (10), 1155-1166. 577 29. Frainay, C.; Schymanski, E. L.; Neumann, S.; Merlet, B.; Salek, R. M.; Jourdan, F.; Yanes, O., Mind the Gap: Mapping Mass 578 Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas. Metabolites 2018, 8 (3), 51. 579 30. Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B. A.; Laufer, J., Description of several 580 chemical structure file formats used by computer programs developed at Molecular Design Limited. Journal of Chemical Information 581 and Computer Sciences 1992, 32 (3), 244-255. 582 31. Barrett, A. J., Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme 583 Nomenclature. Recommendations 1992. Supplement 4: corrections and additions (1997). Eur J Biochem 1997, 250 (1), 1-6. 584 32. McDonald, A. G.; Boyce, S.; Tipton, K. F., ExplorEnz: the primary source of the IUBMB enzyme list. Nucleic Acids Research 2008, 585 37 (suppl_1), D593-D597. 586 33. Kotera, M.; Okuno, Y.; Hattori, M.; Goto, S.; Kanehisa, M., Computational Assignment of the EC Numbers for Genomic-Scale 587 Analysis of Enzymatic Reactions. Journal of the American Chemical Society 2004, 126 (50), 16487-16498. 588 34. Hattori, M.; Okuno, Y.; Goto, S.; Kanehisa, M., Development of a chemical structure comparison method for integrated 589 analysis of chemical and genomic information in the metabolic pathways. J Am Chem Soc 2003, 125 (39), 11853-65. 590 35. O'Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R., Open Babel: An open chemical 591 toolbox. Journal of Cheminformatics 2011, 3 (1), 33. 592 36. Teixeira, A. L.; Leal, J. P.; Falc„o, A., Automated Identification and Classification of Stereochemistry: Chirality and Double Bond 593 Stereoisomerism. ArXiv 2013, abs/1303.1724. 594 595 596

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

18 of 29

Supporting Information for 597

Hierarchical Harmonization of Atom-Resolved Metabolic Reac- 598 tions Across Metabolic Databases 599

Huan Jin 1, and Hunter N. Moseley 2, 3, 4, 5, * 600

1 Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA; [email protected] 601 2 Department of Molecular & Cellular Biochemistry, University of Kentucky, Lexington, KY 40536, USA 602 3 Markey Cancer Center, University of Kentucky, Lexington, KY 40536, USA 603 4 Superfund Research Center, University of Kentucky, 40506 Lexington, KY, USA 604 5 Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40536, USA 605 606 * Correspondence: [email protected]; Tel.: 859-218-2964 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

19 of 29

638 639

A

640

641 B

642

S Figure 1. Reaction pair with mismatch of last EC number. A) MetaCyc reaction 6.2.1.34-RXN with EC number 643 6.2.1.34 (https://metacyc.org/META/NEW-IMAGE?object=6.2.1.34-RXN&&redirect=T); B) KEGG reaction R02194 with EC 644 number 6.2.1.12 (https://www.genome.jp/entry/R02194). 645

646 647 A

648

649 B

650

S Figure 2. Reaction pair with missing 4th-level EC number designation. A) MetaCyc reaction ACETCAPR-RXN 651 with EC number 2.6.1.- (https://metacyc.org/META/NEW-IMAGE?object=ACETCAPR-RXN&&redirect=T); B) KEGG reac- 652 tion R04029 with EC number 2.6.1.65 (https://www.genome.jp/entry/R04029). 653

654

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

20 of 29

655 656 657 658 659 660 S Table 1. Hardly interpretable compound pairs. 661 RCLASS Compound pair RC02715 C01054_C08626 RC00871 C01051_C02463 RC01850 C00751_C06309 RC01579 C00751_C06083 RC01851 C00751_C06310 RC01582 C01054_C01902 RC02163 C00751_C08627 RC02496 C12354_C18337 RC01862 C01054_C08615 RC01616 C05773_C05774 RC02632 C01054_C08637 RC03124 C01054_C08797 RC02708 C01054_C17966 RC01863 C01054_C08616 RC02603 C01054_C19819 RC01864 C01054_C08628 RC02714 C01054_C20188 RC02619 C01054_C19833 RC02620 C01054_C19801 RC02621 C00751_C19834 RC01901 C02094_C15943 RC02716 C01054_C20189 RC02717 C01054_C20191 RC02720 C01054_C20194 RC02722 C01054_C20200 662 663

1 2

2 1

664 665 S Figure 3. Example of atoms with interchangeable mappings. KEGG RPAIR maps atom 1 in C03618 to atom 2 in C06030 and atom 666 2 in C03618 to atom 1 in C06030. 667 668 669 670 671 672 673

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

21 of 29

A

674 675 B

676 677 C

678

679 S Figure 4. Comparison of KEGG RCLASS and RPAIR description for compound pair C01255 and C02378 680 (https://www.kegg.jp/kegg-bin/rpair_image?entry=RC00090&cpair=C01255_C02378). A) Compound C01255 and C02378; 681 B) KEGG RCLASS description for the compound pair (https://www.genome.jp/dbget-bin/www_bget?rc:RC00090); C) 682 KEGG RPAIR description for the compound pair along with the atom mappings. 683

684

685 S Table 2. Harmonized reactions with inconsistent atom mappings. MetaCyc KEGG 1.13.11.45-RXN ['R05718'] 1.14.11.18-RXN ['R05722'] 12-ALPHA-L-FUCOSIDASE-RXN ['R04270'] 2.1.1.21-RXN ['R01586'] 2.3.1.56-RXN ['R04271'] 2.3.1.90-RXN ['R00049'] 2.4.1.121-RXN ['R03094']

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

22 of 29

2.4.1.230-RXN ['R07264', 'R11398'] 2.4.1.54-RXN ['R07257'] 2.5.1.41-RXN ['R04158'] 2.5.1.42-RXN ['R04520'] 2.5.1.67-RXN ['R08948'] 2.6.1.50-RXN ['R02781'] 2.6.1.80-RXN ['R07277'] 2.7.1.121-RXN ['R01012'] 2TRANSKETO-RXN ['R01067', 'R01830'] 3.2.1.48-RXN ['R00801', 'R00802'] 3.3.2.8-RXN ['R05784'] 3.4.13.22-RXN ['R07651'] 3.5.2.18-RXN ['R07984'] 4-HYDROXYGLUTAMATE-AMINOTRANSFERASE- RXN ['R03266', 'R05052'] 4-HYDROXYPHENYLPYRUVATE-DIOXYGENASE-RXN ['R02521'] 4.1.3.26-RXN ['R08090'] 4.2.1.100-RXN ['R05597'] 4.2.3.10-RXN ['R02004'] 4.2.3.19-RXN ['R05092'] 4.2.3.20-RXN ['R02013', 'R06120'] ['R00347', 'R05758', 4.3.1.16-RXN 'R09683'] ['R00347', 'R05758', 4.3.1.20-RXN 'R09683'] 4.6.1.11-RXN ['R02311', 'R09897'] ['R00838', 'R00839', 6-PHOSPHO-BETA-GLUCOSIDASE-RXN 'R03256'] ACETYL-COA-ACETYLTRANSFER-RXN ['R00238'] ADENYL-KIN-RXN ['R00127'] ALDOSE-6-PHOSPHATE-REDUCTASE-NADPH-RXN ['R00834', 'R01817'] -BETA-GLUCOSIDASE-RXN ['R02985'] ARISTOLOCHENE-SYNTHASE-RXN ['R02307', 'R09574'] CATECHOL-OXIDASE-DIMERIZING-RXN ['R00080'] CONIFERIN-BETA-GLUCOSIDASE-RXN ['R02595'] DXPREDISOM-RXN ['R05688'] ETHANOLAMINE-PHOSPHATE-PHOSPHO-- RXN ['R00748'] ['R01068', 'R01069', F16ALDOLASE-RXN 'R01070'] FARNESYLTRANSTRANSFERASE-RXN ['R02061', 'R05555'] FORMALDEHYDE-TRANSKETOLASE-RXN ['R01440']

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

23 of 29

['R02003', 'R08400', FPPSYN-RXN 'R08528'] GALPMUT-RXN ['R00505'] GERANYL-DIPHOSPHATE-CYCLASE-RXN ['R02007', 'R10509'] GLUCOSE-1-PHOSPHATE-PHOSPHODISMUTASE- RXN ['R00960'] GLUTAMATESYN-RXN ['R00114'] GLYCYRRHIZINATE-BETA-GLUCURONIDASE-RXN ['R03906'] HEMEOSYN-RXN ['R07411'] ISOCHORMAT-RXN ['R03037'] KETOLACTOSE-RXN ['R04783'] ['R00838', 'R00839', LACTOSE6P-HYDROXY-RXN 'R03256'] LANOSTEROL-SYNTHASE-RXN ['R03199'] ['R00838', 'R00839', MALTOSE-6-PHOSPHATE-GLUCOSIDASE-RXN 'R03256'] MALTOSE-SYNTHASE-RXN ['R00957'] METHYLMALONYL-COA-MUT-RXN ['R00833'] NITRATE-REDUCTASE-NADPH-RXN ['R00796'] O-AMINOPHENOL-OXIDASE-RXN ['R00074'] OHMETHYLBILANESYN-RXN ['R00084'] OLIGOGALACTURONIDE-LYASE-RXN ['R04382'] OXALOMALATE-LYASE-RXN ['R00477'] PENTALENENE-SYNTHASE-RXN ['R02305'] PHOSPHOENOLPYRUVATE-PHOSPHATASE-RXN ['R00208'] PROPIOIN-SYNTHASE-RXN ['R00038'] -BETA-GLUCOSIDASE-RXN ['R02558'] RAUCAFFRICINE-BETA-GLUCOSIDASE-RXN ['R03703'] RXN-10005 ['R09620'] RXN-10053 ['R10049'] RXN-10482 ['R09607', 'R10732'] RXN-10568 ['R10734'] RXN-10600 ['R09629'] RXN-10632 ['R09123'] RXN-10635 ['R09788'] RXN-10636 ['R09526'] RXN-10637 ['R09789'] RXN-10640 ['R09630'] RXN-10685 ['R08546'] RXN-10768 ['R05759'] RXN-10775 ['R09686'] RXN-11023 ['R09249']

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

24 of 29

RXN-11141 ['R10908'] RXN-11223 ['R08807'] RXN-11224 ['R08808'] RXN-11225 ['R08809'] RXN-11226 ['R08810'] RXN-11298 ['R09588'] RXN-11485 ['R09246'] RXN-11490 ['R00039'] RXN-11501 ['R03634'] RXN-11502 ['R01103'] RXN-11747 ['R07247'] RXN-11760 ['R09533'] ['R00347', 'R05758', RXN-11767 'R09683'] RXN-11905 ['R09615'] RXN-11910 ['R09606'] RXN-12263 ['R00702'] RXN-12326 ['R09916'] RXN-12329 ['R09689'] RXN-12338 ['R09743'] RXN-12494 ['R10734'] RXN-12573 ['R10936'] RXN-12593 ['R09903'] RXN-12625 ['R00022'] RXN-12627 ['R09942'] ['R08541', 'R09889', RXN-12774 'R10273'] RXN-12823 ['R09888'] RXN-12824 ['R09886', 'R09890'] RXN-12832 ['R02311', 'R09897'] RXN-12836 ['R09900'] RXN-12843 ['R09908'] RXN-12846 ['R09914'] RXN-12893 ['R10079'] RXN-12983 ['R06421', 'R09968'] RXN-13002 ['R02007', 'R10509'] RXN-13074 ['R10010'] RXN-13144 ['R10050'] RXN-13218 ['R08910'] RXN-13291 ['R10194'] RXN-13335 ['R10269'] RXN-13337 ['R10009']

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

25 of 29

RXN-13338 ['R10585'] RXN-13374 ['R00032'] RXN-13642 ['R10557', 'R10559'] RXN-13643 ['R10558'] RXN-13724 ['R02872'] RXN-13761 ['R10271'] RXN-13762 ['R10272'] ['R08541', 'R09889', RXN-13763 'R10273'] RXN-13769 ['R10275'] RXN-14015 ['R10284'] RXN-14092 ['R10579'] RXN-14281 ['R03802'] RXN-14282 ['R05142'] RXN-14283 ['R05141'] RXN-14573 ['R10581'] RXN-14574 ['R10582'] RXN-14633 ['R10530'] RXN-14820 ['R10527', 'R10539'] RXN-14821 ['R10487', 'R10542'] RXN-14822 ['R10527', 'R10539'] RXN-14823 ['R10487', 'R10542'] RXN-14900 ['R12095', 'R12250'] RXN-14901 ['R12095'] RXN-14921 ['R10623'] RXN-14922 ['R10624'] RXN-14939 ['R10583'] RXN-14940 ['R10597'] RXN-14980 ['R10580'] RXN-15224 ['R08616'] RXN-15289 ['R08127'] RXN-15419 ['R10968'] RXN-15425 ['R11087'] RXN-15605 ['R10987'] RXN-15706 ['R10897'] RXN-15708 ['R10899'] RXN-15945 ['R11040'] RXN-15946 ['R10935'] RXN-15961 ['R10380'] RXN-15964 ['R10370'] RXN-17129 ['R11410'] RXN-17138 ['R11501']

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

26 of 29

RXN-17139 ['R11503'] RXN-17175 ['R11578'] RXN-17373 ['R11327'] RXN-17501 ['R07264', 'R11398'] RXN-17506 ['R11544'] RXN-17772 ['R11332'] RXN-1781 ['R00015'] RXN-18377 ['R11719'] RXN-18378 ['R11718'] RXN-18379 ['R11607'] RXN-18395 ['R11705'] RXN-18396 ['R09801'] RXN-18397 ['R09800'] RXN-18398 ['R11619'] RXN-18432 ['R11593'] RXN-18496 ['R11603'] RXN-18506 ['R12114'] RXN-18592 ['R11721', 'R11722'] RXN-18593 ['R11721', 'R11722'] RXN-18659 ['R11785'] RXN-18789 ['R10004'] RXN-18791 ['R08691'] RXN-18823 ['R09788'] RXN-18824 ['R09891'] RXN-18852 ['R10004'] RXN-18854 ['R09891'] RXN-18857 ['R11942'] RXN-18975 ['R11821'] RXN-18978 ['R11853'] RXN-18979 ['R11854'] RXN-18980 ['R11855'] RXN-18999 ['R11959'] RXN-19023 ['R11862'] RXN-19024 ['R11861'] RXN-19138 ['R07543'] RXN-19150 ['R07856'] RXN-19597 ['R12087'] RXN-19768 ['R10809'] RXN-20044 ['R08691'] RXN-2543 ['R07502'] RXN-2561 ['R07503'] RXN-3962 ['R00059']

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

27 of 29

RXN-4441 ['R02727', 'R07265'] RXN-4781 ['R02307', 'R09574'] RXN-4823 ['R06523'] RXN-4882 ['R09115'] RXN-5101 ['R06421', 'R09968'] RXN-5106 ['R05765', 'R09974'] RXN-5107 ['R05766', 'R09975'] RXN-5109 ['R09962'] RXN-5110 ['R02009'] RXN-5121 ['R09961'] RXN-5123 ['R09962'] ['R02011', 'R09971', RXN-5142 'R09972'] RXN-5341 ['R10040'] RXN-7929 ['R00354'] RXN-8036 ['R04998'] RXN-8046 ['R07630'] RXN-8059 ['R11626'] RXN-8147 ['R05126'] RXN-8149 ['R05127'] RXN-8151 ['R03698'] RXN-8152 ['R09402'] ['R08541', 'R09889', RXN-8414 'R10273'] RXN-8415 ['R08373'] RXN-8416 ['R10584'] RXN-8417 ['R09614'] RXN-8418 ['R10599'] RXN-8422 ['R08695'] RXN-8424 ['R10598'] RXN-8425 ['R10284'] RXN-8426 ['R09895'] RXN-8512 ['R06303'] RXN-8540 ['R08542'] RXN-8541 ['R08540'] RXN-8546 ['R09970'] RXN-8549 ['R09619'] RXN-8551 ['R10006'] RXN-8562 ['R07648', 'R09558'] RXN-8563 ['R05765', 'R09974'] RXN-8564 ['R05766', 'R09975'] RXN-8565 ['R09963']

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

28 of 29

RXN-8572 ['R10726'] RXN-8574 ['R08696'] RXN-8575 ['R09963'] ['R02011', 'R09971', RXN-8576 'R09972'] RXN-8587 ['R09886', 'R09890'] ['R09610', 'R09892', RXN-8588 'R10005'] RXN-8589 ['R10004'] ['R09610', 'R09892', RXN-8591 'R10005'] ['R09610', 'R09892', RXN-8592 'R10005'] ['R09610', 'R09892', RXN-8593 'R10005'] RXN-8594 ['R02311', 'R09897'] RXN-8599 ['R10004'] RXN-8600 ['R10004'] RXN-8601 ['R10008'] RXN-8602 ['R10006'] RXN-8603 ['R10584'] RXN-8604 ['R10007'] RXN-8608 ['R08691'] RXN-8609 ['R09886', 'R09890'] RXN-8621 ['R09607', 'R10732'] RXN-8650 ['R07597'] RXN-8651 ['R07597'] RXN-8731 ['R05166'] RXN-8801 ['R10555'] RXN-8813 ['R07475', 'R08748'] RXN-8931 ['R08696'] RXN-8939 ['R07648', 'R09558'] RXN-8958 ['R09292'] RXN-8992 ['R09248'] RXN-9030 ['R11463'] RXN-9106 ['R09249'] RXN-9138 ['R06447'] RXN-9140 ['R10937'] RXN-9349 ['R07830'] RXN-9412 ['R05784'] RXN-9413 ['R05784'] RXN-9456 ['R09714']

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license.

29 of 29

RXN-9457 ['R09715'] RXN-9464 ['R05784'] RXN-9588 ['R10035'] RXN-9664 ['R09913'] RXN-9674 ['R10039'] RXN-9825 ['R10452'] RXN-9850 ['R04811'] RXN-9863 ['R05419'] RXN-9962 ['R09626'] RXN-9963 ['R09627'] RXN0-3521 ['R07661'] RXN0-4641 ['R08555'] RXN0-5180 ['R02061', 'R05555'] RXN0-5183 ['R01444'] RXN0-5297 ['R05134'] SERINE--PYRUVATE-AMINOTRANSFERASE-RXN ['R00585'] STERYL-BETA-GLUCOSIDASE-RXN ['R01460'] STRICTOSIDINE-BETA-GLUCOSIDASE-RXN ['R03820'] SUCROSE-PHOSPHORYLASE-RXN ['R00803'] ['R01068', 'R01069', TAGAALDOL-RXN 'R01070'] TECH1REDHAL-RXN ['R05402'] TRANS-HEXAPRENYLTRANSTRANSFERASE-RXN ['R09247'] TRANS-PENTAPRENYLTRANSFERASE-RXN ['R09245'] TRE6PHYDRO-RXN ['R00837'] TRICHODIENE-SYNTHASE-RXN ['R02306'] TRYPSYN-RXN ['R02722'] TYROSINE-3-MONOOXYGENASE-RXN ['R07212'] UDPNACETYLGLUCOSAMENOLPYRTRANS-RXN ['R00660'] URATE-RIBONUCLEOTIDE-PHOSPHORYLASE-RXN ['R02646'] UROGENIIISYN-RXN ['R03165'] VICIANIN-BETA-GLUCOSIDASE-RXN ['R03642'] X-METHYL-HIS-DIPEPTIDASE-RXN ['R03288'] 686 687 688