Downloaded Directly from KEGG and Metacyc Databases

bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. Article 1 Hierarchical Harmonization of Atom-Resolved Metabolic Reac- 2 tions Across Metabolic Databases 3 1 2, 3, 4, 5, Huan Jin , and Hunter N. Moseley * 4 1 Department of Toxicology and Cancer Biology, University of Kentucky, Lexington, KY 40536, USA; 5 [email protected] 6 2 Department of Molecular & Cellular Biochemistry, University of Kentucky, Lexington, KY 40536, USA 7 3 Markey Cancer Center, University of Kentucky, Lexington, KY 40536, USA 8 4 Superfund Research Center, University of Kentucky, 40506 Lexington, KY, USA 9 5 Institute for Biomedical Informatics, University of Kentucky, Lexington, KY 40536, USA 10 11 * Correspondence: [email protected]; Tel.: 859-218-2964 12 Abstract: Metabolic models have been proven to be useful tools in system biology and have been 13 successfully applied to various research fields in a wide range of organisms. A relatively complete 14 metabolic network is a prerequisite for deriving reliable metabolic models. The first step in con- 15 structing metabolic network is to harmonize compounds and reactions across different metabolic 16 databases. However, effectively integrating data from various sources still remains a big challenge. 17 Incomplete and inconsistent atomistic details in compound representations across databases is a 18 very important limiting factor. Here, we optimized a subgraph isomorphism detection algorithm to 19 validate generic compound pairs. Moreover, we defined a set of harmonization relationship types 20 between compounds to deal with inconsistent chemical details while successfully capturing atom- 21 level characteristics, enabling a more complete enabling compound harmonization across metabolic 22 databases. In total, 15,704 compound pairs across KEGG (Kyoto Encyclopedia of Genes and Ge- 23 nomes) and MetaCyc databases were detected. Furthermore, utilizing the classification of com- 24 pound pairs and EC (Enzyme Commission) numbers of reactions, we established hierarchical rela- 25 Copyright: © 2021 by the authors. Sub- tionships between metabolic reactions, enabling the harmonization of 3,856 reaction pairs. In addi- 26 mitted for possible open access publica- tion, we created and used atom-specific identifiers to evaluate the consistency of atom mappings 27 tion under the terms and conditions of within and between harmonized reactions, detecting some consistency issues between the reaction 28 the Creative Commons Attribution (CC and compound descriptions in these metabolic databases. 29 BY) license (http://creativecom- mons.org/licenses/by/4.0/). Keywords: metabolite; compound harmonization; reaction harmonization; metabolic network; met- 30 abolic model; subgraph isomorphism. 31 32 1. Introduction 33 Metabolic models describe the inter-conversion of metabolites via biochemical reac- 34 tions catalyzed by enzymes, providing snapshots of the metabolism under a given genetic 35 or environmental condition [1,2]. Metabolic models of metabolism have proven to be an 36 important tool in studying systems biology and have been successfully applied to various 37 research fields, ranging from metabolic engineering to system medicine [3-7]. Advances in 38 analytical methodologies like mass spectroscopy and nuclear magnetic resonance greatly 39 improve the high-throughput detection of thousands of metabolites, enabling the genera- 40 tion of large volumes of high-quality metabolomics datasets [8, 9] that greatly facilitate 41 metabolic research. As a next major step, incorporating reaction atom-mappings into met- 42 abolic models enables metabolic flux analysis of isotope-labeled metabolomics datasets 43 [10-13], which will contribute to the large-scale characterization of metabolic flux molecu- 44 lar phenotypes and prediction of potential targets for gene manipulation [4]. Building 45 bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 2 of 29 reliable metabolic models heavily depend on the completeness of metabolic network data- 46 bases. However, a relatively complete metabolic network, especially at an atom-resolved 47 level, is practically not available [14]. 48 Therefore, to construct an atom-resolved metabolic network, the very first major step 49 is to integrate metabolic data from various metabolic databases without redundancy [15], 50 which remains extreme labor-intensive. This is partially due to problems in the individual 51 databases [16]. Common issues include non-unique compound identifiers, reactions with 52 unbalanced atomic species, and enzyme catalyzing more than one reaction [17]. Moreover, 53 incompatibilities of data representations (like compound identifiers) and incomplete at- 54 omistic details (like the presence of R groups and lack of atom and bond stereochemistry) 55 across databases are key bottlenecks for the rapid construction of high-quality metabolic 56 networks [18]. Great efforts have been made to map different compound identifiers across 57 metabolic databases [19, 20]. Some algorithms use logistic regression to compute the simi- 58 larity between strings generated by concatenating a variety of compound features, which 59 requires careful selection of compound features that can well characterize a string pair by 60 capturing the similarity between different variations as well as underlining the difference 61 between descriptions which are not synonymous [6]. Alternatively, utilization of unique 62 chemical identifier independent from a particular database, like InChI [21, 22] or SMILES 63 [23], have been suggested as an important step in harmonizing metabolic databases [24]. 64 However, several tricky cases still remain unresolved. For example, InChI cannot handle 65 the compound entries that contain R-groups. 66 Our neighborhood-specific graph coloring method can derive atom identifiers for 67 every atom in a specific compound with consideration of molecular symmetry, facilitating 68 the construction of an atom-resolved metabolic network [25]. Furthermore, a unique com- 69 pound coloring identifier can be generated based on the atom identifiers, which can be 70 used for compound harmonization across metabolic databases. The results derived from 71 the compound coloring identifiers were quite promising. However, issues like incomplete 72 atomistic details were not completely handled in that prior work. 73 In this paper, we further optimized the subgraph isomorphism detection algorithm 74 CASS (Chemically Aware Substructure Search) [26] to aid in the validation of generic com- 75 pound pairs. In addition, we solved inconsistent atomistic characteristics across databases 76 by defining a set of harmonization relationship types between compounds, aiming to cap- 77 ture chemical details while maintain compound pairs at various levels. Furthermore, we 78 used the classification of compound pairs and EC (Enzyme Commission) numbers to har- 79 monize metabolic reactions across Kyoto Encyclopedia of Genes and Genomes (KEGG) 80 and MetaCyc metabolic pathway databases via establishing hierarchical harmonization re- 81 lationships between metabolic reactions. We further made use of the atom identifiers to 82 evaluate atom mapping consistency of these harmonized reactions. Through this analysis, 83 we detected some issues that cause the inconsistency of reaction atom mappings both 84 within and across databases. The generalization of metabolic reactions can be applied to 85 various interesting topics including but not limited to predicting biotransformation of 86 newly discovered metabolites[27], devising novel synthetic pathways of essential metabo- 87 lites[28], and bridging gaps in the current metabolic network[29]. Furthermore, expanding 88 the existing metabolic network by integrating other metabolic databases can be easily 89 achieved when the molfile representations [30] of compounds are provided. 90 2. Results 91 2.1. Overview of KEGG and MetaCyc databases 92 The compounds in the KEGG and MetaCyc databases are summarized in Table 1. 93 Based on the atomic composition, we divided compounds into two groups: specific com- 94 pounds (no R group) and generic compounds (with presence of R group(s)). About 8.02% 95 KEGG compounds and 21.72% MetaCyc compounds contain R groups. 96 Table 1. Summary of KEGG and MetaCyc compound databases. 97 bioRxiv preprint doi: https://doi.org/10.1101/2021.06.01.446673; this version posted June 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 3 of 29 Compound Type KEGG MetaCyc specific compounds 16529 (91.98%) 15859 (78.28%) generic compounds 1441 (8.02%) 4400 (21.72%) Total 17970 (100%) 20259 (100%) 98 According to the classification of compounds, we also categorized the atom-resolved 99 metabolic reactions into two sets: specific reactions where all compounds in the reaction are 100 specific compounds

Load more