BioTransformer: a comprehensive computational tool for small molecule prediction and metabolite identification Y. Djoumbou-Feunang, Jarlei Fiamoncini, A. Gil-De-La-Fuente, R. Greiner, Claudine Manach, D. S. Wishart

To cite this version:

Y. Djoumbou-Feunang, Jarlei Fiamoncini, A. Gil-De-La-Fuente, R. Greiner, Claudine Manach, et al.. BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification. Journal of Cheminformatics, Chemistry Central Ltd. and BioMed Central, 2019, 11, ￿10.1186/s13321-018-0324-5￿. ￿hal-01997281￿

HAL Id: hal-01997281 https://hal.archives-ouvertes.fr/hal-01997281 Submitted on 28 Jan 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés.

Distributed under a Creative Commons Attribution| 4.0 International License Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 https://doi.org/10.1186/s13321-018-0324-5 Journal of Cheminformatics

SOFTWARE Open Access BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identifcation Yannick Djoumbou‑Feunang1, Jarlei Fiamoncini2,3, Alberto Gil‑de‑la‑Fuente4, Russell Greiner5,6, Claudine Manach2 and David S. Wishart1,5*

Abstract Background: A number of computational tools for metabolism prediction have been developed over the last 20 years to predict the structures of small molecules undergoing biological transformation or environmental deg‑ radation. These tools were largely developed to facilitate absorption, distribution, metabolism, excretion, and toxic‑ ity (ADMET) studies, although there is now a growing interest in using such tools to facilitate metabolomics and exposomics studies. However, their use and widespread adoption is still hampered by several factors, including their limited scope, breath of coverage, availability, and performance. Results: To address these limitations, we have developed BioTransformer, a freely available software package for accurate, rapid, and comprehensive in silico metabolism prediction and compound identifcation. BioTransformer combines a machine learning approach with a knowledge-based approach to predict small molecule metabolism in human tissues (e.g. liver tissue), the human gut as well as the environment (soil and water microbiota), via its metabo‑ lism prediction tool. A comprehensive evaluation of BioTransformer showed that it was able to outperform two state- of-the-art commercially available tools (Meteor Nexus and ADMET Predictor), with precision and recall values up to 7 times better than those obtained for Meteor Nexus or ADMET Predictor on the same sets of pharmaceuticals, pesti‑ cides, phytochemicals or endobiotics under similar or identical constraints. Furthermore BioTransformer was able to reproduce 100% of the transformations and metabolites predicted by the EAWAG pathway prediction system. Using mass spectrometry data obtained from a rat experimental study with epicatechin supplementation, BioTransformer was also able to correctly identify 39 previously reported epicatechin metabolites via its metabolism identifcation tool, and suggest 28 potential metabolites, 17 of which matched nine monoisotopic masses for which no evidence of a previous report could be found. Conclusion: BioTransformer can be used as an open access command-line tool, or a software library. It is freely available at https​://bitbu​cket.org/djoum​bou/biotr​ansfo​rmerj​ar/. Moreover, it is also freely available as an open access RESTful application at www.biotr​ansfo​rmer.ca, which allows users to manually or programmatically submit queries, and retrieve metabolism predictions or compound identifcation data.

*Correspondence: [email protected] 1 Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada Full list of author information is available at the end of the article

© The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat​iveco​mmons​.org/licen​ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creat​iveco​mmons​.org/ publi​cdoma​in/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 2 of 25

Keywords: Metabolism prediction, Metabolite identifcation, Biotransformation, Microbial degradation, Mass spectrometry, Machine learning, Knowledge-based system, Structure-based classifcation, Metabolic pathway, Enzyme-substrate specifcity

Introduction elimination of metabolic by-products or xenobiotics. Metabolism is key to the production of energy (catabo- Xenobiotics are compounds such as pharmaceuticals lism), the generation of cellular building blocks (anab- and personal care products (PPCPs), pesticides, plant or olism) as well as the activation, detoxifcation, and food compounds, food additives, surfactants, solvents, elimination of metabolic by-products or xenobiotics. and other man-made or biologically foreign substances. Over the past 100 years, considerable efort has gone Tey constitute the largest portion of the human chemi- into determining the precise molecular details of primary cal exposome of which more than 95% remain unknown metabolism—i.e. the metabolic processes associated with or largely uncharacterized [2, 3]. In many cases, non- the production and breakdown of essential metabolites essential metabolites are the products of promiscuous or (e.g. lipids, amino acids, and steroids) [1]. Unfortunately, non-specifc enzymatic reactions [4, 5], microbial or gut somewhat less efort has been devoted to the characteri- metabolism [6, 7], liver-based phase I metabolism (oxida- zation or understanding of non-essential or secondary tion, reduction or hydrolysis) or general phase II metabo- metabolism and non-essential metabolites, partly due to lism (conjugation). Metabolism is known to signifcantly their much higher number, and greater structural com- infuence the pharmacokinetics and pharmacodynam- plexity, compared to primary metabolites. ics of xenobiotics and their derivatives within a biologi- Non-essential metabolites include metabolites gen- cal system [8] (Fig. 1). Moreover, given the diversity of erated through the activation, detoxifcation and biological systems that constitute our environment, it

Fig. 1 Efects of metabolism on the pharmacokinetics and pharmacodynamics of small molecules. This fgure illustrates how metabolism of a xenobiotic can alter its pharmacodynamics (PD), including pharmacological activity (Act), and toxicological efects (Tox). Moreover, the nature of the resulting metabolites can infuence their involvement in pharmacokinetic processes (i.e. ADME absorption, distribution, metabolism, excretion). DETP diethylthiophosphate Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 3 of 25

is clear that understanding xenobiotic metabolism is of Chlorpyrifos can also lead to the generation of the critical to accurately linking chemistry and biology, and inactive metabolites 3,5,6-trichloro-2-pyridinol, and die- understanding the interactions between those biological thyl phosphorothioate (see Additional fle 1), via O-dear- systems and the environment. ylation [10]. Figure 2 partially describes the “life cycle of a xenobi- Once released from the human body into the environ- otic”, using pesticides as an example. Pesticides can be ment, the pool of xenobiotics and their derivatives often used to protect plants against insect pests, waterborne contaminate soil and water, where they are often further ailments, other plant competitors and parasites, thus degraded by soil and/or aquatic microbes. Te resulting enabling the production of larger amounts of high qual- metabolites, which are mostly unknown, can afect soil/ ity food products, while using less land [9]. In this regard, water microbial diversity, and soil fertility [12] and even pesticides contribute to a healthier way of life. However, re-enter the food chain [13, 14] (Fig. 2). Such a metabolic exposure to pesticides through inhalation (e.g. by farm- “life cycle” is applicable to other chemicals, such as phar- workers), skin contact, or ingestion of contaminated maceuticals, food additives, and other man-made prod- harvested products is known to cause harmful efects ucts, as highlighted by a steadily increasing number of (Fig. 2). For instance, the organophosphate pesticide independent studies [15, 16]. For these reasons, the char- Chlorpyrifos (see Additional fle 1) can be activated in acterization of xenobiotic metabolites, which has long humans to become the carcinogenic substance Chlorpy- been vitally important to the pharmaceutical industry rifos-oxon, through CYP450-catalyzed desulfurization [5], has become increasingly more important to the pesti- [10]. Moreover, exposure to Chlorpyrifos has been linked cide industry [17] and to the felds of metabolomics [18], to a decrease in the population of probiotic Lactobacillus exposomics [3], and environmental sciences [19, 20]. and Bifdobacterium species in the gut microbiota of rats Te characterization or identifcation of xenobiotic [11]. Interestingly, human CYP450-catalyzed metabolism metabolites from biological or environmental samples

Fig. 2 The life cycle of a xenobiotic: this fgure partly illustrates the circulation, transformations, and efects of pesticides in humans and the environment. These substances can enhance crop protection, thereby increasing the yield of healthy foods. However, they can also contaminate soil and water meaning that they can fnd their way into non-target organisms, including humans. Moreover, upon exposure to pesticides humans usually generate and excrete pesticide metabolites into the environment, which can also contaminate soil and aquatic environments. Some of these metabolites, and their microbial degradation products have been isolated from water and food samples, showing that they can re-enter the human food chain [15, 16, 21]. This cycle is applicable to other types of xenobiotics, including pharmaceuticals, and personal care products, among others Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 4 of 25

is quite difcult and is not unlike natural product iden- small number of tools provide predicted structures in tifcation or dereplication [22]. It can take months or a downloadable or shareable format, and those that even years to purify and positively identify a metabolite do place severe restrictions on their distribution; (6) using standard analytical techniques. As a result, there almost none of the existing tools are open access or has been a growing focus on using in silico strategies to open source; and (7) very few of the tools make their help with this process. Indeed, over the past two dec- databases or training sets available. Tese limitations ades, a number of very efective computational tools have have slowed the development of in silico metabolism been developed to predict the metabolism of xenobiot- prediction software and have also restricted the feld to ics—especially drugs. Tese computer programs typically a tiny number of applications, mainly in the pharma- require a starting parent molecule and employ pattern ceutical industry. recognition techniques along with hand-made rules or Addressing these limitations and extending the capa- machine learning algorithms to identify: (1) a site of reac- bilities of in silico metabolism prediction software tion or a site of metabolism (SoM) within the molecule; could lead to substantial benefts in many other scien- and/or (2) a set of chemical products resulting from a tifc disciplines including, but not limited to, analytical biotransformation at the specifc SoM. Most in silico chemistry, natural product chemistry, agricultural and metabolism prediction tools are quite specifc to cer- nutrition science, environmental chemistry, exposom- tain classes of reactions or metabolic processes, such as ics and metabolomics. Potential applications might phase I (only) or phase II (only) reactions. Some in silico include the in silico expansion of chemical databases of metabolism predictors, such as SMARTCyp [23, 24] and drugs (e.g. DrugBank [38]), food compounds (e.g. FooDB isoCYP [25], are limited to predicting phase I metabo- [39]), phytochemicals (e.g. PhytoHub [40]), environmen- lism (or a portion of phase I metabolism), while others tal contaminants (e.g. ContaminantDB [41], T3DB [42], are more comprehensive (e.g. Meteor Nexus—Lhasa the CompTox Database [43]), organism-specifc metabo- Limited, UK) [26] and SyGMa [27] cover a broad range of lites (e.g. HMDB [2], ECMDB [44], YMDB [45]), and phase I and phase II biotransformations. Some programs other chemicals of biological interest (e.g. ChEBI [46], are commercial such as Meteor Nexus, MetabolExpert KEGG [47]). In fact, a notable efort carried by Jefryes (CompuDrug, Bal Harbor, FL, USA) [28] and ADMET et al., has led to the development of the Metabolic In sil- Predictor (Simulation Plus, Lancaster, CA, USA) [29], ico Network Expansion (MINEs) databases. Te MINE while others are freely available either as web services databases contain close to 600,000 metabolites from (e.g. XenoSite [30] or as freely accessible standalone soft- compounds derived from KEGG [47], EcoCyc [48], and ware packages (e.g. SMARTCyp). Most of these tools are YMDB [45]. Te metabolites were generated computa- focused on mammalian metabolism (e.g. Meteor Nexus). tionally using reaction rules based on the Enzyme Com- In comparison, a smaller number are targeted towards mission classifcation system [49], and the Biochemical environmental microbial degradation. Such tools include Network Integrated Computational Explorer (BNICE) enviPath, a complete redesign of the EAWAG-BBD/PPS, algorithm [50]. Jefrryes et al. reported that 93% of the which in turn originates from the UM-BBD and UM- computationally generated putative metabolites starting PPS systems [31–34]. Te necessity for such tools, along from KEGG compounds were not found in PubChem, with the aforementioned developments, have motivated the largest publicly accessible chemical database. Tere- certain mass spectrometry vendors to integrate metabo- fore, we anticipate that in silico expansions of the afore- lism prediction tools into their data processing systems mentioned databases using BioTransformer, will lead to [35, 36]. Such integration often simplifes the discovery of the discovery of new exposure biomarkers, new bioac- unknown metabolites, even at low concentration levels. tive metabolites, and consequently to the development Unfortunately, even with the growing abundance of better drugs and consumer products (e.g. food, house- of in silico metabolism prediction tools, there con- hold and cosmetic products). Tis may ultimately lead to tinues to be a number of signifcant limitations, espe- improved toxicology assessment, and the advancement cially with regard to their performance, their scope of precision medicine [51] Moreover, the integration of and their accessibility. In particular: (1) very few tools predicted metabolites with their corresponding in silico predict more than the SoMs; (2) none of the tools predicted MS spectra could facilitate the identifcation of combine phase I, II, gut microbial metabolism, pro- unknowns using metabolite identifcation tools such as miscuous enzymatic metabolism, and environmental CFM-ID [52–54], and MetFrag [55]. Tis would, in turn, microbial metabolism together; (3) many tools sufer help to further identify and characterize the so-called from poor performance [37]; (4) almost all of the tools “dark matter” of the metabolome, which consists of the were developed and trained on drug molecules and chemical signatures or molecules that remain uncharac- were not adapted for non-drug xenobiotics; (5) only a terized or undiscovered [56]. Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 5 of 25

Here, we present BioTransformer, an open access service is also freely accessible at www.biotr​ansfo​rmer. software tool, and freely accessible web service for ca. accurate, and comprehensive in silico metabolism prediction and metabolite identification. It has been Methods specifically designed to address essentially all of the Structure and implementation of BioTransformer shortcomings previously identified with existing in BioTransformer consists of a metabolism prediction tool silico metabolism prediction tools. In particular, Bio- (BMPT), and a metabolite identifcation tool (BMIT). Transformer is freely available and furthermore its Te BMPT consists of fve independent prediction mod- databases and predictions are free to download and ules called “transformers”, namely: (1) the Enzyme Com- use. It consists of two components: a metabolism pre- mission based (EC-based) transformer, (2) the CYP450 diction tool, and a metabolite identification tool. Bio- (phase I) transformer, (3) the phase II transformer, Transformer’s metabolism prediction tool (BMPT) (4) the human gut microbial transformer, and (5) the generates predicted metabolite structures in standard environmental microbial transformer. For the predic- electronic formats, and it provides comprehensive tion of metabolites, BioTransformer implements two metabolite predictions. BMPT covers a wide range of approaches, a rule-based or knowledge-based approach, molecular classes. In particular, BMPT combines a and a machine learning approach. BioTransformer’s knowledge (or rule)-based approach with a machine knowledge-based system consists of three major compo- learning approach to predict (1) human CYP450- nents: (1) a biotransformation database (called MetXBi- calyzed phase I metabolism of xenobiotics, (2) human oDB) containing detailed annotations of experimentally gut microbial metabolism, (3) phase II metabolism, confrmed metabolic reactions, (2) a reaction knowl- (4) promiscuous enzymatic metabolism, and (5) envi- edgebase containing generic biotransformation rules, ronmental microbial metabolism of endogenous and preference rules, and other constraints for metabolism exogenous compounds. For the prediction of CYP450 prediction, and (3) a reasoning engine that implements metabolism, BioTransformer makes use of CypReact both generic and transformer-specifc algorithms for [57], a tool for CYP450 substrate specificity predic- metabolite prediction and selection. Te BMPT machine tion. BioTransformer also implements a set of rules learning system uses a set of random forest and ensemble provided by the EAWAG-BBD/PPS system [33] to pre- prediction models for the prediction of CYP450 substrate dict the products of environmental microbial degrada- selectivity, and for the Phase II fltering of molecules. tion. BioTransformer’s Metabolite Identification Tool BioTransformer’s Metabolite Identifcation Tool builds (BMIT) builds upon the metabolite prediction tool, on the BMPT to identify specifc metabolites using mass and can be used to identify metabolites of a given mol- spectrometry (MS) data, namely accurate mass or chemi- ecule that match a given set of masses or molecular cal formula information. formulas. In this section, we describe the structure, content, and In addition to providing a description of BioTrans- implementation of MetXBioDB, the knowledgebase, the former, we also provide a detailed analysis of its reasoning engine, the CYP450 metabolism and Phase performance, including a number of comparative II prediction systems, and the metabolite identifcation analyses of BioTransformer against Meteor Nexus tool. Figure 3 gives a brief overview of each “transformer” [26] and ADMET Predictor [29]. These analyses were module, their tasks, and the type of prediction approach done using the results of published studies on experi- they employ. Additional fle 2: Figure S1 illustrates the mentally determined metabolites identified after spe- design workfow for the aforementioned BioTransformer cific exposures to drugs, foods, pesticides, and other components. Finally, we will describe BioTransformer’s xenobiotics by various mammalian species. We also workfow, and the RESTful web service. describe the freely available BioTransformer RESTful web service, which allows users to freely predict and MetXBioDB: a database of metabolites and experimentally identify metabolites of diverse types of compounds, confrmed biotransformations and biodegradations including but not limited to PPCPs, food compounds, MetXBioDB is a database that consists of a manually phytochemicals, environmental contaminants/pol- curated collection of > 2000 experimentally confrmed lutants, as well as endogenous and other exogenous biotransformations derived from the literature. It was compounds. BioTransformer is available as an open developed to help with: (1) the design of biotransfor- access Java library at https​://bitbu​cket.org/djoum​bou/ mation rules, (2) the training and validation of machine biotr​ansfo​rmerj​ar. The JAR library can either be run learning metabolism prediction models, and (3) the as a command-line executable, or used as an imported design of preference rules. Each biotransformation in library within a project. The BioTransformer web MetXBioDB includes a starting reactant (structure and Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 6 of 25

Fig. 3 An overview of BioTransformer’s fve metabolism prediction modules, the Enzyme Commission based (EC-based), Cyochrome P450 (CYP450), phase II, human gut microbial, and environmental biotransformer modules. ML machine learning-based, KB knowledge-based, CYP cytochrome (P450)

identifers), a reaction product (structure and identifers), references) along with data downloaded from publicly the name or type of the enzyme catalyzing the biotrans- available databases such as DrugBank [38], PharmGKB formation, the type of reaction, and one or more cita- [61], XMETDB [62], and SuperCYP [63]. Tese data- tions. For the purposes of this paper, a reactant is defned bases list over 1000 enzyme-substrate associations for as a small molecule that binds to a specifc enzyme and the major CY4P50s and UDP-glucuronosyltransferases undergoes a metabolic transformation catalyzed by that (UGTs). Along with published scientifc reports, Phe- enzyme. A biotransformation describes the chemical nolExplorer [64] and PhytoHub [40] were also used to conversion or molecular transformation of a reactant to compile information about the metabolism of polyphe- one or more products by a specifc enzyme (or enzyme nolic compounds in the gut. class) through a defned chemical reaction. Cytochrome Te data curation process consisted of three phases P450 enzymes (CYP450s) are responsible for > 90% of including: (1) the collection of biotransformation data, (2) phase I oxidative reactions and > 75% of the creation and annotation of biotransformation objects [58], while UDP-glucuronosyltransferases (UGTs) and and, (3) data validation. Tis process was conducted col- sulfotransferases (SULTs) are responsible for the phase laboratively with a small team of chemistry experts. A II metabolism of most xenobiotics [59, 60] In the gut detailed description of the data collection and curation microbiota, enzymatic reactions are mostly reductive, process is provided in the Additional fle 2. Additional and are carried out by anaerobic bacteria due to the very fle 2: Figure S2 illustrates one entry in MetXBioDB, low concentration of oxygen. corresponding to the oxidation of acetaminophen to Te “starting” reactants in the current version N-acetyl-p-benzoquinone (NAPQI). Overall, MetXBi- (version 1.0) of MetXBioDB primarily consist of oDB contains > 2000 biotransformations, which include xenobiotics such as drugs, pesticides, toxins and phy- the -catalyzed phase I reactions of ~ 800 tochemicals. Te database also includes a small number unique starting reactants (and > 1500 reaction products), of sterol lipids and a selected set of mammalian pri- the phase II reactions of > 500 unique starting reactants mary metabolites. In assembling MetXBioDB we gath- (and > 600 reaction products) and human gut microbial ered reaction data from the existing literature (> 100 metabolism of > 50 unique polyphenolic compounds. Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 7 of 25

The reaction knowledgebase phosphatidylcholines are mapped to the glycerophos- BioTransformer’s reaction knowledgebase contains pholipid metabolism pathway in humans. chemical reaction descriptions and rules encoded by Based on the information gathered from the various SMARTS [65] and SMIRKS [66] strings that are used resources, 423 associations could be established between by the reasoning engine to make biotransformation the reaction knowledgebase’s enzymes and reactions. Pri- predictions. Tis knowledgebase encodes information ority was given to enzymes with wide substrate specifcity about, and contains mapping data between, fve dif- such as the arylamine N-acetyltransferase (EC 2.3.1.5), as ferent concepts: (1) the biosystem, (2) the metabolic the aim was to predict the metabolism of small molecules enzyme, (3) the metabolic reaction, (4) the metabolic partly based on generic biotransformation rules. Excep- pathway, and (5) the chemical class (as determined by tions included, for example, serine palmitoyltransferase ClassyFire [67]). Tese concepts are defned as follows: (EC 2.3.1.50), which is a specifc enzyme that provides the sphingoid base 3-dehydrosphinganine needed for the (1) A biosystem is a living organism or a community biosynthesis of sphingolipids. All biotransformation rules of living organisms within which the biotransfor- in the knowledgebase were encoded in the SMIRKS lan- mation reactions can occur. Currently, the imple- guage [66]. For each biotransformation rule, one or more mented biosystems are: (a) the human organism, structural constraints (e.g. the known enzyme substrates (b) the human gut microbiome, and (c) the environ- are restricted short-chain fatty acyl chains) were encoded mental microbiome. separately, either in the SMARTS language [65] or pro- (2) A metabolic enzyme is an enzyme that catalyzes or grammatically (by combining several rules based on the accelerates a metabolic reaction. structural constraints and/or physicochemical proper- (3) A metabolic reaction is a chemical reaction that ties). Te reaction SMIRKS descriptions, and SMARTS- modifes the structure of a molecule, leading to the encoded constraints are freely available at https​://bitbu​ generation of one or more products. cket.org/djoum​bou/biotr​ansfo​rmerj​ar/. (4) A metabolic pathway is a linked series of chemi- Te separate design of structural constraints was nec- cal reactions that occur in a specifc order in the essary for several reasons. First, structural constraints cell or within an organism. A metabolic pathway is can sometimes be difcult or impossible to fully encode organism-specifc as an enzyme can be expressed using the SMIRKS language alone, due to its limited by some organisms but not by others. expressivity. Second, the juxtaposition of constraints (5) A chemical class refers to a group of chemicals within a SMIRKS pattern can make it difcult to under- that share a common structural feature or a group stand, and cumbersome to update. A typical reaction thereof as defned using ClassyFire [67]. scheme encoded in the reaction knowledgebase is shown in Additional fle 2: Figure S4. Once a reaction was Te interrelationships between the diferent con- encoded, several tests were performed to assess its cor- cepts are illustrated in Additional fle 2: Figure S3. Te rectness by applying the reaction to known substrates as construction of the reaction knowledgebase required well as to known non-substrates (i.e. chemicals that were data acquisition and aggregation from several sources, known not to satisfy the various constraints). If the reac- including the information captured in MetXBioDB. tion passed all the tests, it was added to the database; if Additional reaction information was gathered from it failed, the reaction schema was subject to one or more resources such as the SIB Bioinformatics Resource iterations and tests until validated. Portal (ExPASy) [68], the BRENDA enzyme database Some of the encoded reactions in the reaction knowl- [69], various Cyc databases [70], the UniProt knowl- edgebase apply to a very specifc set of chemicals, and edgebase (UniProtKB) [71], the KEGG database [47], can be used to accurately predict the metabolism of and enzyme nomenclature information provided by compounds belonging to those classes. Such examples the International Union of Biochemistry and Molecular include the aforementioned conversion of diacyl-sn- Biology (IUBMB) [49]. Te collected data was used to: glycero-3-phosphoethanolamines to diacyl-sn-glycero- (1) design, test, and validate generic reaction/transfor- 3-phosphoserines, and the metabolism of several classes mation rules, (2) add constraints and rules that would of lipids, which are known to follow classic primary met- be used by the reasoning engine, and (3) map entities abolic pathways. Other reactions are so generic or non- from diferent concepts. An example of the type of con- specifc that they would lead to a high number of false cept mapping done for the reaction knowledgebase is predictions if applied blindly. Some examples of highly given here: phosphatidylcholines are a chemical class, non-specifc reactions include aliphatic hydroxylation, the glycerophospholipid metabolism pathway is a N-dealkylation, and glucuronidation, among many oth- metabolic pathway, a human is a biosystem, therefore ers. Tese reactions are catalyzed by enzymes that have Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 8 of 25

broad substrate specifcity, such as CYP450s and UGTs. occurs most preferably at the C-7 position, compared to To handle these situations, new reaction subtypes and the C-4′ and C-3 positions. Based on these observed pat- constraints were defned, which focused on a specifc terns, kaempferol 7,4′-dimethyl ether 3-glucoside (see subclass of compounds that fulflled a defned set of Additional fle 1) would more likely undergo O-degly- structural constraints. Te resulting manually generated cosylation, followed by C-7 O-demethylation to give rules were then subject to further testing and validation. kaempferol 4′-methyl ether (see Additional fle 1), which An example of such a reaction/rule is the N-dealkylation will then undergo further metabolism (Additional fle 2: of alicyclic tertiary amines catalyzed by CYP3A4, a well- Fig. S5). In total, 190 precedence rules were created for studied bioactivation pathway of cyclic amines [72]. 49 unique biotransformation rules that were encoded In addition to the core knowledge provided by text- for the human and/or human gut microbial biosystems. books, online databases and journal articles, the design of Tese precedence rules were created based on observa- biotransformation rules for the reaction knowledgebase tions reported in scientifc articles, or personal commu- often required additional investigation. One approach nication with experts. In addition, 1960 precedence rules consisted of selecting compounds (from MetXBioDB) for 195 unique biotransformation rules were adopted that triggered a given reaction and labeling them based from the EAWAG-BBD/PPS system (environmental on whether their expected metabolites were reported microbial metabolism). Not all reaction schemes in the or not. Further analysis of these reaction sets often sug- reaction knowledgebase are fully specifed. For instance, gested new reaction schemes or the addition of new con- because relatively little is known about the biology and straints to existing reaction schemes. A similar process enzymology of the human gut microfora, a large num- was previously used to generate > 300 biotransforma- ber of encoded biotransformation rules were either tion rules for the prediction of environmental microbial assigned to an enzyme superfamily or to an “unspecifed metabolism [33, 73]. Tese rules were also encoded, enzyme”. For the Knowledgebase’s collection of environ- tested, and added to BioTransformer’s reaction knowl- mental microbial reactions, the biotransformation rules edgebase. Overall, a total of 797 biotransformation rules were assigned to a single “unspecifed enzyme”, as they were encoded, tested, and eventually added to the reac- are often consensus rules designed by combining pat- tion knowledgebase. terns of reactions catalyzed by several enzymes. Overall, In addition to identifying the mechanisms involved in upon validation of the reactions and the addition of con- various metabolic reactions, and encoding of biotrans- straints, 1716 enzyme-based reaction associations were formation rules, another challenge to building the reac- created. tion knowledgebase was determining the prioritization Te next step in constructing the reaction knowledge- needed for specifc metabolic reactions. For any com- base consisted of associating enzymes with metabolic pound that triggers several competing reactions, certain pathways, and the corresponding biosystems. Tis step is reactions are more likely to occur than others. Terefore very important for several reasons. First, many metabolic the metabolites resulting from these preferred reactions pathways are organism-dependent as diferent organ- are more likely to be observed. Given a pair of metabolic isms express diferent enzymes or transporters (Addi- reactions, a common approach to defne precedence tional fle 2: Figure S3). Tus, as illustrated in Additional rules involves a detailed analysis of common putative fle 2: Figure S3, the metabolic route linking a compound and observed metabolites via NMR or mass spectrom- to a metabolite could vary between organisms. While etry [73]. Another approach involves using NMR or mass sphingomyelins can be directly converted into ceramide- spectrometry to perform time-course monitoring of bio- 1-phosphates in Aspergillus Flavus, humans must con- transformations in order to elucidate the preferred meta- vert sphingomyelins into ceramides frst, which are then bolic pathways [74]. In this work, our construction of transformed into ceramide-1-phosphates. Second, the precedence rules between pairs of reactions was mostly mapping also allows one to encode more constraints based on data acquired from previously reported scien- and exclusion rules for certain types of compounds. For tifc studies, as well as observations published in previous instance, glycerophospholipids are transformed solely studies. within the glycerophospholipid metabolism pathway, For instance, when absorbed in the intestine, polyphe- and do not undergo CYP450- or UGT-catalyzed metab- nolic compounds must be deconjugated (via glycosidases olism. In total, seven metabolic pathways were created, or carboxylesterases) before undergoing any transforma- 84 enzyme-pathway associations, and nine chemical tion [75, 76] Recently, Burapan et al. [74] investigated the class-pathway associations were created for the human regioselectivity of O-demethylation of polyphenols by biosystem. A summary of the numbers of rules and the human gut bacterium Blautia Sp. MRG-PMF1, and associations encoded in the reaction knowledgebase are concluded that O-demethylation of polymethoxyfavones shown in Table 1 for each of the fve transformer modules Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 9 of 25

Table 1 Statistics for each of the fve transformer modules: (1) EC-based module (Enzyme Commission-based metabolism); (2) CYP450 module (Cytochrome P450 metabolism); (3) human gut microbial module (Human gut microbial metabolism); (4) Phase II module (Phase II metabolism), and (5) environmental microbial module (Environmental microbial degradation) Number Number Number of enzyme-rule Number of enzymes of biotransformation rules associations of covered biosystems

EC-based (ecbased) 285 408 459 2 CYP450 (cyp450) 9 163 712 1 Human gut microbial (hgut) 53 201 204 2 Phase II (phaseII) 9 74 81 2 Environmental microbial (envmicro) 1 301 301 1

The codes/abbreviations in the frst column are the names of options used programmatically to specify the module of interest

(EC-based, human CYP450, human gut microbial, phase Besides qualitative attributes (e.g. chemical class), rea- II, and environmental microbial). Te biotransformation soning engines often also use quantitative attributes (e.g. rules and the list of enzymes cover all six enzyme classes mass, LogP) to guide their predictions. BioTransformer’s EC1 through EC6 of the Enzyme Nomenclature, as reasoning engine uses both types of attributes. While defned by the IUBMB [49], with deeper focus on classes chemical classifcation can help to select the most likely EC1 to EC4. Te metabolic pathways are currently lim- biotransformations or discard the unlikely ones, quanti- ited to lipid metabolism. Te annotation and mapping of tative attributes such as the mass and LogP are used to all enzymes, metabolic reactions, biosystems, metabolic predict the substrate specifcity for various enzymes, pathways, and chemical classes are freely available at or whether a known molecule is hydrophilic enough https​://bitbu​cket.org/djoum​bou/biotr​ansfo​rmerj​ar/. to be conjugated/eliminated. For the prediction of enzyme-substrate specifcity, the current version of Bio- Transformer focuses on nine of the most “active” or best- The reasoning engine studied CYP450 enzymes (CYP1A2, CYP2A6, CYP2B6, Te BMPT’s Reasoning Engine uses the rules in the CYP2C8, CYP2C9, CYP2C18, CYP2D6, CYP2E1, and reaction knowledgebase to select the most likely of all CYP3A4). Te prediction of their specifcity toward a applicable metabolic biotransformations or pathways. given substrate is made by CypReact [57] a machine In general, two types of reasoning are used for the selec- learning software tool for CYP450 reaction prediction tion and ranking of predicted metabolites: absolute rea- that was recently developed by our team. To predict soning, and relative reasoning [77]. Absolute reasoning whether a compound is hydrophilic enough for conju- solely focuses on the likelihood of a biotransformation to gation/elimination, BioTransformer uses its internal, occur, and is used to select the biotransformations with machine learning Phase II flter that use structural fnger- an occurrence ratio above a given threshold. Examples prints, and physicochemical properties (e.g. LogP, mass) of biotransformation software using absolute reason- to select likely Phase II candidates. CypReact, and the ing include SyGMA and Meteor Nexus. Relative rea- Phase II flter will be briefy described in the next section. soning evaluates the comparative likelihood between With the reaction knowledgebase and the machine two independent but competing reactions (e.g. favone learning tools in hand, the Reasoning Engine was imple- 7-O-demethylation is more likely to occur than favone mented programmatically for each of the fve diferent 4′-O-demethylation [74]. Examples of computational transformer modules. Te rationale behind this design tools using relative reasoning include Meteor Nexus and was to have independent transformer modules that the EAWAG-BBD/PPS system. Both absolute and relative could be used separately. Tis way, one could focus on reasoning have been implemented. However, in the cur- a specifc type of metabolism (e.g. CYP450-catalyzed rent version of BioTransformer all reaction patterns have metabolism) or a specifc type of biosystem (human). been assigned the same likelihood. Te computation of Among the fve transformer modules, three rely solely more accurate reaction-specifc scores requires a larger on the application of rules and constraints from the set of data, which is still being assembled and tested. We reaction knowledgebase. Tese three are the EC-based aim to provide more accurate reaction scores in a future transformer, the human gut transformer and the envi- version of BioTransformer that will be released in 2019. ronmental transformer. Te cytochrome P450 (Phase I) Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 10 of 25

transformer, which focuses on the metabolism of small each CYP450 model. Empirical results show that CypRe- molecules mediated by CYP450 enzymes, and the Phase act’s classifers can achieve a very high performance, with II transformer, are the only transformers that imple- AUROC scores ranging between 83% and 92%. Moreover, ment a machine learning approach in combination with they were shown to signifcantly outperform SmartCyp a knowledge-based approach. In addition to the fve [24], and ADMET Predictor [29]. For a more detailed transformer modules, the Reasoning Engine is used by description about the list of fngerprint generation, train- a combined human “super transformer”, which aims at ing process, and resulting models, the user is referred to simulating the metabolism of small molecules in humans the CypReact paper [57]. In addition to the nine models, (including the human gut), from their absorption to their CypReact also uses a heuristic approach to flter can- excretion. didates that are known to be out of scope for CYP450 mediated metabolism, based on their chemical structure The CYP450 metabolism prediction system and/or physicochemical properties. Tese include inor- Cytochrome P450 enzymes (CYP450s) constitute a ganic compounds, and several classes of glycero- and superfamily of heme proteins, with over 50 isozymes glycerophospholipids, among others. CypReact is freely identifed in humans [78]. Tey are predominantly found available at https​://bitbu​cket.org/Leon_Ti/cypre​act/. in the liver, but also occur in other organs such as the Given any small molecule, the CYP450 transformer lungs, the kidneys, the gut wall, and the small intestine. uses CypReact to predict which of the nine CYP450s CYP450s are the major oxidative enzymes in the human is likely to metabolize the molecule. Subsequently, it body, and are responsible for the metabolism of a large implements the constraints and biotransformation rules number of compounds. Nine specifc CYP450s have encoded within the reaction knowledgebase to predict been identifed as responsible for most of the Phase I the structures of the resulting metabolites. As for any metabolism of xenobiotics (e.g. drugs, food additives, other transformer module, the user can vary the param- and environmental contaminants) and a small number eters, including the number of transformation steps, and of endogenous compounds. Tese include the CYP1A2, whether to use certain precedence rules. CYP2A6, CYP2B6, CYP2C8, CYP2C9, CYP2C18, CYP2D6, CYP2E1, and CYP3A4 isozymes. Because of The Phase II metabolism prediction system their broad substrate specifcity, a special CYP450-reac- tant specifcity prediction was implemented, in order to Phase I reactions tend to render the lipophilic xenobiotics predict metabolites for the more likely reactants. Te more reactive by adding or modifying functional groups, enzyme-specifcity is assessed by a program called Cyp- such as an amino-, hydroxyl-, or carboxyl group. Some React [57]. examples of Phase I reactions include aliphatic hydroxy- CypReact is a software tool that uses a machine learn- lation, and epoxide hydrolysis. In Phase II, the more ing approach to predict whether a small molecule reacts reactive metabolites are conjugated to cofactors, mak- with any of the nine major CYP450 isozymes. CypReact ing them less toxic, more hydrophilic, and thus easier to uses a random forest model for each of seven isozymes eliminate. Some of the more common Phase II reactions (CYP1A2, CYP2A6, CYP2B6, CYP2C8, CYP2C19, include the conjugation of xenobiotics to glucuronic acid CYP2E1, CYP3A4), and ensemble models for two (glucuronidation), sulphate (sulfation), a methyl group isozymes (CYP2C9, CYP2D6). Each of the models uses (methylation), an N-acetyl group (N-acetylation), glu- a set of physicochemical properties and structural fea- tathione, taurine, and glycine. Tese reactions are cata- tures of a molecule for substrate specifcity prediction. lyzed by the families of UDP-glucuronosyltransferases Te substructure fngerprints were partly developed by (UDP-GTs), sulfotransferases (SULTs), methyltrans- including a subset of SMARTS pattern defnitions from ferases (MTs), N-acetyltransferases (NATs), ClassyFire [67], and a set of SMARTS patterns known to transferases (GSTs), bile acid-CoA:amino acid N-acyl- trigger CYP450-catalyzed metabolism (e.g. p-substituted transferase (BACATs), and glycine transferases (GTs), phenols, or N-substituted piperazine). Tese fnger- respectively. While the presence of adequate attachment prints encode other pattern defnitions for key functional and functional groups is required for conjugation, the groups and structural features relevant to CYP450-cat- lipophilicity of a molecule is also signifcantly infuenced alyzed metabolism, which were obtained through data by its shape, mass, and functional group composition, mining. In addition, the corresponding PubChem fnger- among other parameters. Terefore, a simple structure- print [79] and the MACCS fngerprint [80] were added. based chemical classifcation would not be enough to Feature selection, and parameter optimization, cost- predict whether a candidate molecule is suitable for sensitive learning, and cross-validation based evaluation Phase II. In order to provide an accurate prediction, we were performed to design highly accurate models for designed the Phase II Filter (P2F). Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 11 of 25

Te Phase II Filter was designed as a simple machine prediction process but also reduced the likelihood of learning model that takes physicochemical properties overftting. Feature selection was performed on the Wai- as well as structural features of a molecule to predict kato Environment for Knowledge Analysis (WEKA) [82] whether it is ready for Phase II metabolism. A com- using the information gain criteria and a ranker. Overall, pound is predicted as Phase II ‘ready’ if it can undergo 25 physicochemical properties and structural features one or more transformations catalyzed by any of the six were selected to build and evaluate several models (eval- aforementioned enzyme families. In contrast to CypRe- uated by 10-fold cross validation) using several diferent act, which combines nine independent predictors (one machine learning algorithms (i.e. decision trees, random for each CYP450 isozyme), the P2F consists of a single forest, and naïve Bayes). Upon comparative evaluation of machine learning model. the F-1 measure and ROC area, a random forest model Because of the broad specifcity of the aforementioned was selected as the best predictor. Te model achieved Phase II enzymes, especially UPD-GTs and SULTs, it a weighted average F1-measure of 0.88, and a weighted was important to collect as structurally diverse a set as average ROC area of 0.94. possible. Selected compounds included xenobiotics (e.g. Our training was limited to compounds possessing pharmaceuticals, pesticides, food additives, toxins, phy- necessary structural motifs (e.g. functional groups) that tochemicals), as well as endobiotics (e.g. steroids, bile are targeted by the aforementioned Phase II enzyme acids, amino acids). A total of 1113 compounds were classes for conjugation. A number of chemical classes, collected from several databases, including MetXBioDB, including ether lipids, glycerolipids, and glycerophos- PubChem [79], BRENDA [69], and the Cyc databases pholipids, sphingolipids, and acyl-CoA conjugates were [70], as well as the scientifc literature. Te training set excluded from the training set, as such compounds are contained 807 Phase II substrates, and 306 Phase II non- known either not to be transformed by any of the seven substrates. When unavailable from any of the sources, Phase II enzyme classes, or to be conjugated following the structure of a compound was generated using Che- a very specifc metabolic pathway. In the latter case, the mAxon’s MarvinSketch v.17.2.27.0 [81]. Standardization chemical class-to-pathway associations encoded in Bio- operations (e.g. removal of salts, and 3D structure gen- Transformer’s reaction knowledgebase would allow for eration) were also performed. Certain classes of com- a more accurate biotransformation prediction, if appli- pounds, such as glycerolipids, are known not to undergo cable. For these reasons, a simple rule-based fltering conjugation by any of the Phase II enzymes. Since these module was implemented to eliminate the most trivial compounds could be pre-fltered using a simple structure non-candidates, before applying the trained model. Te search, they were not included in the training set. Fur- rule-based module excludes compounds from the fve thermore, compounds that do not contain adequate reac- aforementioned chemical classes. Moreover, only com- tion sites (i.e. functional groups that could be attacked pounds with a molecular weight lower than or equal to by Phase II enzymes) were not included. Tis is because 900 Da (selected based on extensive internal analysis of such compounds could be easily fltered by structural our collected data), and containing a limited set of 64 dif- pattern matching. ferent structural motifs (see Additional fle 3: Table S2) After the collection and standardization of our train- are then passed to the machine learning fltering module. ing set, a total of 32 molecular descriptors were calcu- lated for each of the 1113 molecules. Tese included The BioTransformer metabolite identifcation tool nine constitutional descriptors and molecular prop- Metabolite identifcation is one of the main tasks of erties (e.g. the number of H-bonds, the mass, and untargeted metabolomics. Te aim of untargeted metab- the AlgoP), as well as 23 structural features, such as olomics is to analyze biofuids (e.g. urine, blood) from amine groups (SMARTS = “[NX3+0,NX4+;!$([N]~[! an organ or organism and to attempt to identify novel #6]);!$([N]*~[#7,#8,#15,#16])]”), and carboxyl groups metabolites that are characteristic of that organism’s (SMARTS = “[#8;A;X2H1,X1-][#6]([#6,#1;A])=O”). Te response to an exposure to a chemical or other stimuli. molecular descriptors were all computed with the CDK Mass spectrometry (MS) is one of several analytical library. Te structural features are represented as binary approaches used to perform this task. When coupled features in a custom chemical fngerprint to encode their with (gas or liquid) chromatography, a mass spectrom- absence (0) or presence [1] in the query molecule. A list eter produces a set of spectra that contain features (e.g. of structural features and physicochemical parameters is mass-to-charge ratios, peak intensities, calculated molec- available in Additional fle 3: Table S1. ular formulas) characteristic of metabolites or fragments Feature selection was performed to select a set consist- thereof. While spectral searching is a method commonly ing of the features that are most signifcant in explaining used to identify metabolites, the lack of reference spectra the training data. Tis not only accelerated the training/ for many metabolites is a bottleneck in rapid and accurate Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 12 of 25

compound identifcation. Terefore, the comparison of structure, its chemical formula, its molecular mass, and a spectral features (e.g. mass, molecular formula) obtained pathway leading to it, starting from the query compound. from mass spectra with those obtained from metabolism Te results are saved in a single SDF fle in which each prediction data could help to putatively identify known pathway is stored as an ordered list of chemical reactions or unknown metabolites and validate predictions. (with reaction name, and a list of catalyzing enzymes). Te BioTransformer metabolite identifcation tool (BMIT) is an additional module within BioTransformer BioTransformer’s input and workfow that is designed to assist users in metabolite identi- BioTransformer was implemented in the Java program- fcation. It relies on the BMPT to fnd compounds of a ming language, and can be used as a command-line tool specifc mass (within a user-specifed threshold) or chem- (on Linux, Mac OSX, and Windows) to perform metabo- ical formula that are generated upon single- or multistep lism prediction and metabolite identifcation of small metabolism of a given parent molecule. BMIT takes the molecules. Beside CypReact, described earlier, BioTrans- chemical structure of the starting molecule as input, as former uses two other open source tools, namely the well as a list of neutral chemical masses or molecular for- Chemistry Development Kit (CDK) [83] and the AMBIT mulas for the metabolites to be identifed. BMIT is imple- library [84]. Te CDK programming library is used for mented to only support metabolite identifcation using several operations, including the calculation of phys- the allHuman and superbio options (Human + Human icochemical properties, the execution of superstructure Gut Microbiome), or the envmicro option (Environmen- search operations, and the handling of chemical struc- tal Microbiome). Te search for metabolites is applied tures, among others. Te AMBIT library is used for the iteratively at each step, and stops when at least one application of biotransformation rules and structure metabolite has been identifed for each given mass (± a generation. mass tolerance) or given chemical formula or when the Te BioTransformer metabolism prediction tool’s maximal number of steps has been reached. If applicable, workfow is illustrated in Fig. 4. As can be seen in this the BMIT returns each matching metabolite, including its diagram, BioTransformer accepts molecules either in

Fig. 4 Workfow of BioTransformer’s metabolism prediction and metabolite identifcation tools. The BioTransformer metabolite prediction tool (BMPT) is used solely for metabolism prediction. For metabolite identifcation tasks, the BioTransformer metabolite identifcation tool (BMIT) makes use of those predictions to suggest putative metabolites of a compound that have a given neutral mass or molecular formula Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 13 of 25

SMILES (single molecule), InChI (single molecule), MOL monoisotopic mass, (4) the reaction type leading to the (single molecule), or SDF (single or multiple compounds) metabolite, (5) the biosystem that generated the mol- format as input. Each molecule must be an organic mol- ecule, (6) the parent compound identifers (BioTrans- ecule and it must not be a mixture or a salt. Once the former ID, InChIKey), (7) the parent monoisotopic input is parsed, the structures are subjected to chemical mass, (8) the metabolite’s and parent’s AlogP, as well as validation and standardization. Te standardization pro- (9) the metabolite’s and parent’s synonyms. Te results cess consists of removing charges from functional groups are returned in a SDF or CSV fle that contains the struc- (with some exceptions, such as nitro groups), checking ture and annotation of the predicted metabolites. Te and validating bond types and adding explicit hydrogen returned information can be used separately to analyze atoms. Subsequently, BioTransformer predicts the bio- metabolic pathways. It can also be used to compute neu- transformations and the resulting metabolite structures tral losses for MS-based analyses that can be used to for each query molecule separately (see Additional fle 2: experimentally detect each biotransformation. Fig. S6). In some cases, the structural representation of BioTransformer’s metabolite identifcation tool (BMIT) a molecule upon standardization can difer slightly from builds from the metabolism prediction tool (BMPT). the original one. Terefore, we encourage users to pro- Given a starting molecule, a set of molecular masses vide identifers (e.g. custom labels, names, etc.) in addi- and a mass tolerance threshold (in Da) or simply a set of tion to the structural representation. Tis is even more molecular formulas, BMIT identifes potential metabo- relevant when a BioTransformer prediction is used as lites for each valid mass or molecular formula, via single part of an automated workfow. or multi-step metabolism, depending on the user input. Each prediction must be run in the single module For mass-based searches, the default number of steps, mode, where the user selects one of the fve transformer and mass tolerance are set to one, and 0.01, respectively. modules (CYP450, EC-based, phase II, gut microbial, or Te user can select to explore the human and human environmental microbial). Te Biotransformer options gut microbiome environments (with the allHuman and used to specify the modules are cyp450 (CYP450 metab- superbio options), or the environmental microbial metab- olism module), ecbased (EC-based metabolism module), olism (with the option “env”). A metabolic pathway link- phaseII (Phase II metabolism module), hgut (Human ing the starting structure and each of the metabolites gut microbial degradation module), and envmicro (Envi- is returned, based on the metabolic tree obtained upon ronment microbial degradation module). Alternately, metabolism prediction. Metadata include the structures, a human “super transformer” has been implemented to identifers, reaction types, and enzymes. mimic the metabolism of small molecules in the human “superorganism”, which also includes the gut microbiota. The BioTransformer web service Tis super transformer integrates the CYP450, EC-based, Te BioTransformer software package can be used as a phase II, gut microbial transformers and covers a number command line tool or as a Java library. In order to fur- of diferent reaction types, including hydrolysis, oxida- ther facilitate access to this tool, a RESTful web service tion and reduction, and conjugation. Te “super trans- was built using the JRuby on Rails framework. Te Bio- former” provides two options: (1) allHuman, which uses Transformer web service is freely available at www.biotr​ all four human-related transformers at each step of the ansfo​rmer.ca. Te web service allows users to manually prediction, or; (2) superbio, which uses all the human- or programmatically submit queries, and retrieve the related biotransformers in an ordered sequence of up to corresponding results using the workfow described in 12 steps, starting with the hydrolysis of the query mol- the previous section. In particular, the web service allows ecule (if applicable), and ending with the conjugation of users to submit compounds in SMILES, InChI, and SDF its metabolites. formats (Additional fle 2: Fig. S7). Query results can After the metabolite prediction step is completed, be returned as JSON, SDF, and CSV documents (Addi- the structures and biotransformations are annotated tional fle 2: Fig. S8). Moreover, the web server provides (Fig. 4). Based on the information from the predicted information about each previously predicted single-step biotransformation(s), BioTransformer builds a metabolic metabolic transformation of the compound, including tree by associating each metabolite with its parent(s). the corresponding biosystem, reaction type, metabo- Moreover, each predicted metabolite is annotated with lizing enzymes, and transformation products. Te web additional information that provides structural iden- application ofers several advantages compared to the tifcation, reports its physicochemical properties, and command-line tool, namely: (1) it is easier to use than the an explanation of its origin or provenance. Te data stand-alone program; (2) users need not be programmers includes: (1) three chemical identifers (metabolite ID, or need to install a local program to run the web service; InChI, InChI Key), (2) the molecular formula, (3) the (3) several queries can be processed simultaneously; (4) Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 14 of 25

the computation is faster, as previous prediction results comparing the precision (i.e. the fraction of true metabo- are saved in a database to facilitate more rapid retrieval; lites among the predicted ones) and recall (i.e. the frac- and (5) metabolite prediction and identifcation data can tion of true metabolites that were predicted over the total be accessed manually or programmatically and down- number of true metabolites) for each setting. For details loaded in several formats. While the command-line about the evaluation, see Additional fles 2 and 4. executable does not beneft from the database of com- BioTransformer’s average computation time was 3.55 s puted metabolites, it also does provide some advantages, per compound whereas Meteor Nexus’ average computa- namely: [1] it allows users to submit large sets of com- tion time was 2.95 s per compound. A summary of the pounds; [2] it does not rely on an Internet connection, comparative assessment of BioTransformer and Meteor and; [3] queries are executed immediately and not put in Nexus (Lhasa Limited, UK) is displayed in Table 2. When a queue. compared to the Meteor Nexus predictions obtained at the “Equivocal” level of reasoning, BioTransformer Evaluation of BioTransformer’s metabolism prediction achieved higher precision (49% vs. 35%) and recall (88% and metabolite identifcation capabilities vs. 71%). As an illustration, BioTransformer predicted 7 In order to evaluate the performance of BioTransformer, out of 8 true metabolites for 17-Ethinylestradiol, com- we performed a comparative analysis with two popular in pared to 4, 1, and 0 by Meteor Nexus using its Equivocal, silico metabolism prediction tools, namely Meteor Nexus Plausible, and Probable levels of confdence, respectively [26], and ADMET Predictor [29]. Moreover, we evalu- (Fig. 5). On the other hand, Meteor Nexus predicted 3 ated BioTransformer’s ability to replicate environmen- out of 3 true metabolites for Efavirenz, compared to only tal microbial metabolism prediction from the EAWAG 2 for BioTransformer (Fig. 5). Meteor Nexus achieved BDD/PPS system [33, 34, 73]. We also tested BioTrans- higher precision at the “Plausible” (56%) and “Probable” former’s ability to predict comprehensive human and gut (59%) levels compared to BioTransformer. However, metabolism of small molecules. Building on BioTrans- this caused a signifcant drop of the recall to 45% at the former’s metabolism prediction ability, we also tested Plausible, and 13% at the Probable levels of confdence, its metabolite identifcation capabilities with the BMIT respectively, compared to an 88% recall by BioTrans- module. For each of the tests, BioTransformer was run on former (see Table 2). a 2.7 GHz Intel Core i5 MacOSX with 16 GB (1867 MHz Evaluation of BioTransformer’s prediction of human DDR3) of memory. Te procedures and results are pre- and human gut microbial single‑step metabolism of small sented in the Results section. molecules Results Te second test involved an assessment of BioTrans- Comparative evaluation of BioTransformer and Meteor former’s performance in predicting single-step human Nexus in the prediction of human single‑step metabolism and human gut microbial metabolism of 20 well-stud- of small molecules ied pharmaceuticals, lipids, polyphenols, and other Te frst test involved a comparative assessment of the phytochemicals, from the HMDB [2] (none of which performance of BioTransformer and Meteor Nexus (v.3.0.1) [26] in predicting single-step human metabo- lism of 40 pharmaceuticals and pesticides, randomly Table 2 Comparative assessment of BioTransformer’s selected from DrugBank [38] and T3DB [42]. Tis test set and Meteor Nexus’ predictions of human (not was limited to these compound classes because Meteor including gut microbiome) single-step metabolism for 40 Nexus’ biotransformation dictionary and associated pharmaceuticals and pesticides rule bases are specifcally limited to pharmaceuticals BioTransformer Meteor Nexus and pesticides. Both BioTransformer and Meteor Nexus EQUI PLAU PRO were set to use absolute/relative reasoning to prioritize the most likely biotransformations. In contrast to Bio- True positives 188 152 96 28 Transformer, Meteor Nexus clearly defnes several levels False positives 198 279 74 19 of reasoning that express diferent levels of confdence. False negatives 26 62 118 186 Terefore, Meteor Nexus’ predictions were computed Total no. of predictions 386 433 170 47 for each of the equivocal (EQUI), plausible (PLAU), and Precision 0.49 0.35 0.56 0.59 probable (PRO) levels of confdence. For each compound, Recall 0.88 0.71 0.45 0.13 the BioTransformer’s predictions were evaluated against No. of reported metabolites 224 a Meteor Nexus prediction obtained at each of the three The diferent confdence levels implemented by Meteor Nexus (Lhasa Limited) confdence levels. Te assessment was performed by are: EQUIVOCAL (EQUI), PLAUSIBLE (PLAU), and PROBABLE (PROB) Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 15 of 25

Fig. 5 Examples of predicted metabolites: a 16-hydroxy-17-ethinylestradiol, a reported metabolite of 17-Ethinylestradiol was predicted by BioTransformer only. b Efavirenz, N-glucuronide, a reported metabolite of Efavirenz predicted by Meteor Nexus only

was included in the frst test set), using the super trans- Comparative Evaluation of BioTransformer and ADMET former’s allHuman option. Tis was done to assess Predictor in the Prediction of Human Single‑step BioTransformer’s performance in a task more related CYP450‑mediated Metabolism of Small Molecules to metabolomic or exposomic studies, where the pre- In our third test, the CYP450-catalyzed single-step diction of both endogenous and exogenous metabolites metabolism of the 60 aforementioned molecules was pre- arising from human metabolism is highly desirable. To dicted using ADMET Predictor (v.8.5.1.1) [29]. ADMET our knowledge, no commercial or publicly available Predictor is a software tool that allows the prediction tool is available that was implemented to perform this of sites of metabolism and the resulting metabolites kind of diverse metabolite prediction, so no compari- son could be done in a fair manner. BioTransformer’s average computation time for this (more comprehen- Table 3 Evaluation of BioTransformer’s performance sive) analysis was 4.10 s per compound. Overall, Bio- in predicting human and human gut microbial metabolism of 20 small molecules Transformer achieved a precision of 69% and a recall of 87% (Table 3). Although the set is more chemically BioTransformer diverse, the performance of BioTransformer is actually True positives 111 better than what was achieved for the frst test involv- False positives 49 ing pesticides and pharmaceuticals (described above). False negatives 17 Examples of predictions by BioTransformer are illus- Total no. of predictions 160 trated in Fig. 6. Details of the evaluation are available in Precision 0.69 the Additional fle 5. Recall 0.87 No. of reported metabolites 128 Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 16 of 25

Fig. 6 Examples of human (non gut microbial) metabolites predicted by BioTransformer. This fgure illustrates the human hepatic metabolites of Atrazine, delta-9-tetrahydrocannbinol (Delta-9-THC), and phosphatidylcholine (16:0/16:0), as well as human gut microbial metabolites of L-DOPA and Epicatechin, correctly predicted by BioTransformer upon CYP450-catalyzed biotransformation. Te set of BioTransformer’s predictions were computed in an aver- nine CYP450 isoforms supported by ADMET Predictor age 2.69 s per compound, while the ADMET Predic- (1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, and 3A4) is tor predictions took an average of 0.45 s per compound. identical to the one covered by BioTransformer CYP450 BioTransformer and ADMET Predictor had comparable metabolism prediction tool. Te resulting metabolites levels of precision at 46% and 47% respectively. However, were compared to those obtained from BioTransform- BioTransformer was able to predict 90% of all experi- er’s CYP450 metabolism prediction module, and a per- mentally confrmed metabolites, which is signifcantly formance assessment was then carried out (Table 4). higher than the 61% predicted by ADMET Predictor. Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 17 of 25

Table 4 Comparative assessment of BioTransformer system using three test compounds, namely Ampicil- and ADMET predictor (Simulations Plus) in predicting lin (an antibiotic), Nitroglycerin (a plasticizer, a drug), single-step human CYP450 metabolism for 60 drugs, and Disulfoton (an insecticide), all of which (along pesticides, phytochemicals, and other xenobiotics, as well with their metabolites) have been found in wastewa- as endobiotics (e.g. lipids) ter treatment plants [21, 85, 86]. Te respective struc- BioTransformer ADMET tures were retrieved from ContaminantDB [41]. Here, predictor only BioTransformer’s environmental microbial trans- True positives 162 110 former was used, and only a single biotransformation False positives 188 122 step was conducted for each compound. Te aim of this False negatives 18 70 comparison was to assess the ability of BioTransformer Total no. of predictions 350 232 to reproduce the EAWAG-BBD/PPS predictions, since Precision 0.46 0.47 the rules applicable to environmental degradation were Recall 0.90 0.61 encoded using the freely accessible EAWAG Biodegra- No. of reported metabolites 180 dation and Biocatalysis database. Both BioTransformer and the EAWAG-BBD/PPS system were set to apply relative reasoning, and both were set to predict all microbial transformations (i.e. aerobic and anaerobic). Figure 7 illustrates some examples of CYP450-generated BioTransformer was able to replicate all 15 biotrans- metabolites predicted only by BioTransformer, and oth- formations predicted by the EAWAG system, and to ers predicted only by ADMET Predictor. Details of the successfully predict all 18 metabolites predicted by evaluation are available in the Additional fle 6. EAWAG. In addition, BioTransformer predicted three more metabolites for the degradation of Disulfoton. Comparative evaluation of BioTransformer and the EAWAG All three metabolites resulted from the correctly used BBD/PPS system in the prediction of environmental biotransformation rule (bt0259), which was applied microbial metabolism at three diferent sites of metabolism, producing two Meteor Nexus and ADMET Predictor are not capable metabolites in each case. Figure 8 displays the metabo- of predicting environmental microbial metabolism/ lites predicted by BioTransformer and the EAWAG sys- degradation. Terefore in order to assess BioTrans- tem, and highlights the metabolites reported only by former’s abilities to predict environmental microbial BioTransformer. metabolism, we compared it to the EAWAG-BBD/PPS

Fig. 7 Examples of predicted metabolites as predicted by BioTransformer and ADMET predictor Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 18 of 25

Fig. 8 Environmental microbial metabolism of disulfoton, as predicted by BioTransformer and the EAWAG-BBD/PPS system. The metabolites BTM0004, BTM0006, and BTM0009 are reported by BioTransformer as by-products of the biotransformation bt0259 that generate BTM0003, BTM0005, and BTM0010. These by-products, which should be generated according to the rule bt0259 provided by the EAWAG-BBD/PPS, were not reported by the system

Evaluation of BioTransformer’s metabolite identifcation mass spectrometer operated in the positive ion mode. tool More detailed information about the specifc experimen- Te fnal evaluation of BioTransformer consisted of sim- tal protocols, the treatment protocols and the mass spec- ply identifying putative human/mammalian metabolites tral data extraction/analysis is provided in Additional of epicatechin using the BioTransformer Metabolite fle 2. Identifcation Tool (BMIT). Tis was designed to simu- In order to identify the metabolites observed in our late a real case involving the MS-based experimental study, the BMIT module used a set of 260 neutral analysis of epicatechin metabolites produced by rats monoisotopic masses, derived from the [M + H] + ions upon a fve-day treatment with epicatechin, as done by extracted from the experimental QToF MS data col- two of the co-authors of this manuscript (CM and JF). lected from the rat urine samples, ranging from 53.4896 Epicatechin is an important compound from the chemi- to 969.8669 Da. Monoisotopic masses were generated by cal class of favan-3-ols, and is known to exhibit cardio- subtracting 1.00727 Da from the ions extracted from the vascular health benefts [85–87]. It is a major component MS dataset. Details regarding the data extraction process from cocoa extracts, and is also abundant in apples, are provided in Additional fle 2. Tese masses exhibited grapes, berries, and tea. Briefy, rats were fed for 5 days marked increases in intensity after epicatechin supple- a standardized diet supplemented with epicatechin. Spot mentation compared to baseline. Te human supertrans- urines were sampled after the supplementation period former (option superbio) was used to facilitate putative and compared to the spot urines sampled under the same compound identifcation. From the 260 monoisotopic conditions after 9 days of the same diet without epicate- masses that were extracted, BMIT identifed 37 possible chin. Te samples were analysed by high-resolution mass metabolites of epicatechin corresponding to 20 unique spectrometry—UPLC-QToF (Bruker, Impact II), with the masses. Tese masses do not correspond to adducts or Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 19 of 25

isomers and may therefore be considered parent ions We also tested whether BMIT could identify any of (Additional fle 7: Table S1). Tese putative identifcations the remaining 38 known metabolites (correspond- will have to be further investigated with MS/MS experi- ing to 26 unique masses) previously reported, but not ments and validated against authentic standards for observed in our study, or not selected by our data treat- more defnitive identifcation. In order to acquire addi- ment parameters. Te 26 unique masses were provided tional support for the identity of the predicted metabo- to BioTransformer as input, and the identifcation was lites, the scientifc literature was searched manually to performed using the same mass tolerance as before collect structural data regarding epicatechin metabo- (0.01). BMIT was able to suggest 28 molecules for 19 lites reported in previous experimental studies of both unique masses. Among those, 21 compounds cor- humans and rats. A total of 56 single- and multi-step responding to 18 unique monoisotopic masses had metabolites of epicatechin, corresponding to 37 monoi- previously been reported as epicatechin metabolites sotopic masses were identifed (Additional fle 7: Tables (Additional fle 7: Table S2). Figure 9 illustrates a num- S1 and S2). Of the 37 predicted metabolites matching ber of epicatechin metabolites exclusively reported in our experimental data, 22 matched 11 unique and previ- previous studies, which were correctly identifed by ously reported monoisotopic masses. Among those, 18 BMIT (Fig. 9b), as well as a previously reported metab- compounds corresponded to previously reported metab- olite that was not identifed by BMIT (Fig. 9c). BMIT’s olites. For the nine other experimental masses that had identifcation results are available in Additional fle 9, matches with BMIT predictions, 15 possible metabo- and their comparison to previously reported data are lites (never previously reported) were obtained. Figure 9 available in Additional fle 7: Table S2. shows examples of the suggested epicatechin metabolites Overall, BMIT was able to suggest 39 epicatechin with their masses, as identifed in our study. A complete metabolites that were previously reported in the litera- list of predicted epicatechin metabolites, along with ture, 18 of which were observed in our study. Moreover, their corresponding metabolic pathways leading to each BMIT suggested 28 epicatechin metabolites that had metabolite are available in Additional fle 8. Moreover, not been reported in previous studies (17 correspond- metadata (e.g. masses, retention times), and comparisons ing to masses that do not match previously reported to previously reported data, can be found in Additional ones, and 11 extra structures matching previously fle 7: Table S1. known masses).

Fig. 9 Identifcation of predicted metabolites of Epicatechin in humans (which are assumed to be nearly identical for rats). The fgure illustrates: a metabolites correctly identifed by BMIT, and corresponding to masses (in Da) observed in our experimental study; b metabolites correctly identifed by BMIT, and corresponding to masses observed exclusively in previous studies, and; c a previously reported metabolite of epicatechin not identifed by BMIT. (2R)-2-(3,4-diOH-phenyl)-5,7-diOH-2,4-DBP stands for (2R)-2-(3,4-dihydroxyphenyl)-5,7-dihydr oxy-2,4-dihydro-1-benzopyran-3-one Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 20 of 25

Discussion Evaluation of BioTransformer’s predictions BioTransformer’s design and implementation In our frst test, BioTransformer was evaluated against BioTransformer is a software tool that combines both Meteor Nexus (v.3.0.1). Meteor Nexus is a commercially a knowledge-based approach and a machine learning available software tool that is considered to be the gold approach to predict the metabolism of small molecules, standard for predicting biotransformations of xenobiot- and to assist in metabolite identifcation. Te knowl- ics. While BioTransformer achieved a better prediction edge-based system consists of a biotransformation (49%) and recall (88%) than Meteor Nexus at the equivo- database (MetXBioDB), a knowledgebase (the reaction cal level of confdence (35% precision, and 71% recall), knowledgebase), and a reasoning engine. MetXBioDB Meteor Nexus’ precision improved signifcantly at the is a unique resource that is freely available, and cov- plausible (56%), and probable (59%) levels. Te increase ers a wide range of enzymatic reactions that take place in Meteor Nexus’ precision matched our expectations, as in human tissues, the human gut and the environment the minimum likelihood threshold for metabolite selec- (soil and water microfora). In contrast to most pub- tion increased, thus reducing its probability of selecting licly available databases, MetXBioDB provides detailed unconfrmed metabolites. However, the 68% increase biological and chemical information about all of its in precision (from Equivocal to Probable) led to an 82% biotransformations, including the catalyzing enzymes, decrease in recall. As a consequence, while Meteor the substrates, the products, and the biotransforma- Nexus’ predicted a higher percentage of true metabolites tion rule(s) that is/are applied. MetXBioDB describes at these levels, compared to BioTransformer, it returned a the metabolism of > 2000 compounds catalyzed by ~ 15 signifcantly lower number of true metabolites. enzyme families. For each biotransformation, at least It is worth noting that BioTransformer heavily relies on one scientifc source or reference is provided. MetXBi- the selective nature of the biotransformation rules and oDB is stored as a JSON document, which can be easily other structural constraints, in addition to its implemen- parsed. tation of relative reasoning. On the other hand, Meteor One potential application of MetXBioDB is in the Nexus combines the continuous absolute scoring of design of biotransformation rules with narrow speci- biotransformations with relative reasoning, providing fcity, which can be used for in silico metabolism pre- binned data for diferent levels of reasoning through a diction. In fact, this resource has already been used more dynamic scoring system. Overall, the performance (in addition to other data) to successfully design > 300 of BioTransformer suggests that the freely accessible biotransformation rules, which were used to annotate BioTransformer tool could be used to assist scientists in the biotransformations in the database and predict various drug discovery and environmental safety studies. metabolites via the BioTransformer Reasoning Engine. In our second test, we evaluated BioTransformer’s per- Despite the aforementioned strengths of MetXBioDB, formance in predicting single-step human and human gut the database still has a number of limitations. Although microbial metabolism of 20 endobiotics and xenobiotics. it covers a large number of enzymatic reactions, it is Overall, 69% of BioTransformer predictions matched clear that more data is needed in order to cover an even experimentally confrmed metabolites. Moreover, Bio- larger set of reactions (e.g. oxidation reactions) cata- Transformer was able to predict 87% of all reported lyzed by enzymes other than CYP450s. It is also clear (and experimentally confrmed) metabolites. Te better that there is a need to defne more constraints and/or performance, compared to the frst test, can be partly build additional models that would increase the qual- explained by the fact that some endobiotics, such as ity of the predictions. Moreover, users could beneft sphingo- and glycerophospholipids, follow very classical from data about the diferent sites of metabolism for and well-known metabolic pathways (Additional fle 2: each specifc biotransformation, as it would serve as a Fig. S3), which were encoded in the reaction knowledge- training set for the development of models for the pre- base. However, these compounds represent only 15% of diction of sites of metabolism (SoMs). For the current the second test set. Terefore, these results still show that version of MetXBioDB, the intent was simply to pro- BioTransformer was also able to accurately predict the vide an easily readable and comprehensible data set. metabolism of compounds with a more complex metab- However, providing MetXBioDB in a database format olism (Fig. 7). In fact, BioTransformer was able to cor- that can be parsed and queried in a more sophisticated rectly predict the human and human gut metabolism of way (e.g. SQL) would make the database much more polyphenols (e.g. Epicatechin), and pharmaceuticals (e.g. useful to a broader number of users. Eforts are under- L-DOPA). Tis is very promising, as little is known about way to do so for the next release of MetXBioDB. We gut microbial metabolism of those classes of compounds. welcome and encourage contributions in regard to the Even for the well-studied, and biologically relevant class curation, improvement, and expansion of this resource. of polyphenols, a lot of experimental work is needed to Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 21 of 25

validate the metabolic pathways for hundreds of known previously known metabolites. Twenty-six monoisotopic compounds. BioTransformer could be used to provide masses matching to 36 reported epicatechin metabolites accurate suggestions about the identity of their metabo- were not observed in our experimental study. Tis vari- lites and propose metabolic pathways, which could then ation in the observed metabolites may be caused by dif- in turn be validated experimentally. ferent experimental settings and analytical conditions Te third evaluation involved the comparative assess- (e.g. length of the treatment, species, gender, dietary ment of BioTransformer’s and ADMET Predictor’s capa- background, sample preparation and analysis methods) bilities to accurately predict CYP450 metabolism of 60 in diferent studies. For example, rats are expected to per- pharmaceuticals, pesticides, food metabolites, and other form less sulfonation of epicatechin than humans [87]. In endogenous and exogenous compounds. Te comparable a second run, BMIT was used to search metabolites cor- precision of BioTransformer and ADMET Predictor (46% responding to monoisotopic masses that were observed and 47%, respectively) shows that on average, about half in previous studies but not in our experimental dataset. of their predictions matched experimentally confrmed In this test it was able to correctly identify another 21 metabolites. However, BioTransformer was able to pre- known epicatechin metabolites. Overall, BMIT was able dict 90% of all experimentally confrmed metabolites, to predict 39 out of 56 previously reported compounds. which is signifcantly higher than the 61% predicted by Te discrepancy between the number of metabolites sug- ADMET Predictor. gested by BMIT and the number of previously reported Overall, the frst three tests demonstrate BioTrans- metabolites could be explained by several factors. First, former ability to accurately predict human and human ten of the known epicatechin metabolites not predicted gut microbial metabolism for a very diverse set of by BMIT (3 masses observed in our study) are products metabolites, covering endogenous metabolites, pharma- of a 2-step conjugation, but the superbio option simulates ceuticals and personal care products, food compounds, only one conjugation step, as it is often sufcient to make as well as other exogenous compounds. Te compara- a molecule stable and hydrophilic enough for excretion tive assessments of BioTransformer with Meteor Nexus (based on experimental data from MetXBioDB). and ADMET Predictor show that while BioTransformer Second, in some cases (e.g. mass = 195.0532 Da), BMIT is slightly slower, it consistently performs better, and predicted two isobaric metabolites, but only one peak it also addresses some of their shortcomings. In par- (retention time = 5.94 min.) was found in the spectra, ticular, BioTransformer is open access, and it covers a indicating that only one metabolite was present in the much wider range of chemical substrates and metabolic sample or that the analytical conditions did not allow biotransformations. the resolution of isobaric compounds (Supplemental In order to evaluate BioTransformer’s ability to pre- Table 1). Often, the same reaction (especially conjuga- dict environmental metabolism, we compared its pre- tions) can occur at several locations within a molecule, diction results with the EAWAG-BBD/PPS system. It is thus producing regioisomers. Te opposite was seen worth noting that the biotransformation and preference in the case of mass = 314.064 Da, which corresponds to rules we encoded in BioTransformer were based on the 3 predicted metabolites (glucuronic acid conjugates), same set of rules defned by the EAWAG-BBD/PPS. Te with 5 observed peaks exclusively found in samples col- key diference was that the rules were encoded in the lected after exposure to epicatechin at 8, 11, 11.40, same common SMIRKS/SMARTS format used by all of 11.64, 11.75 min. Tese examples illustrate a common BioTransformer’s other transformer tools. Based on the problem with metabolism prediction in the identifca- sample tests provided in the Results section, it is clear tion of the correct sites of metabolism. We believe that that BioTransformer was able to accurately replicate the increasing the number of true positives, as well as reduc- predictions provided by the EAWAG-BBD/PPS system. ing the number of false positives could be achieved by Tese results suggest that BioTransformer could also integrating models that more accurately predict sites of be used to accurately predict environmental microbial metabolism. metabolism. BMIT was able to identify metabolites such as (2R)- In a fourth test, we evaluated BioTransformer’s ability 2-(3,4-diOH-phenyl)-5,7-diOH-2,4-DBP (Fig. 9a), and to identify metabolites using its BMIT module. Tis task other conjugated metabolites corresponding to masses tacitly relies on the metabolism prediction task, and Bio- not previously reported. It is worth mentioning that Transformer was able to suggest 37 metabolites match- these are only putative predicted metabolites, and that ing 20 masses from a list of 260 monoisotopic masses the results of the BMIT must be validated experimen- extracted from the MS analysis of urine samples col- tally, through further MS-based investigations. However, lected after exposure to epicatechin (Additional fle 7: it was beyond the scope of this particular experimental Table S1). Of those, 18 metabolites were identifed as study to fully investigate the metabolism of epicatechin. Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 22 of 25

Indeed, we believe that complementary analytical plat- Additional biotransformation data would also provide forms such as GC–MS would be necessary to cover the further statistical evidence to fne tune the reaction pref- whole chemical space of epicatechin metabolites. Tor- erence rules (relative reasoning) and occurrence ratios ough identifcation of the observed metabolites using for absolute/relative reasoning. In particular, adding MS/MS or authentic (synthesized) standards was not an option for absolute reasoning would give BioTrans- performed in our assessment of the metabolites present former the ability to select candidates with a set cut-of in urine. Epicatechin is metabolized in the liver, and more score. Currently BioTransformer’s biotransformation extensively by the gut microbiome. Te ability of BMIT database (MetXBioDB) and its reaction knowledgebase to identify/predict both human and human microbial cover only a small portion of gut microbial metabo- epicatechin metabolites suggests that this module would lism (i.e. metabolism of plant-derived polyphenols). As be a useful asset in elucidating the dark matter in host- many xenobiotics as well as endogenous compounds are microbiome metabolomics [88]. BMIT should also be a known to be metabolized in the gut [75, 89–92], it will be very useful tool for general metabolism prediction and important to further expand the coverage of gut micro- metabolite identifcation using MS or MS/MS data. In bial metabolism in BioTransformer. We plan to make addition, the predictions generated by BMPT could be these improvements in upcoming versions of BioTrans- very useful for suspect-screening analysis, and thereby former. Over the longer term we are hoping to integrate permit faster non-targeted data analysis and more fac- more machine learning prediction models (e.g. SoMs for ile putative compound identifcation. Tanks to in silico CYP450 metabolism, and SoMs for phase II metabolism). MS/MS fragmentation tools such as CFM-ID, the com- Tis integration depends mostly on the amount of data putation of MS/MS-spectra for those metabolites could available as machine learning hinges on having large and be used to provide additional evidence. diverse training sets to optimize its performance. Given We believe the examples used here nicely demon- that the number of experimentally confrmed biotrans- strate the ability of BioTransformer to accurately pre- formations is still quite low for the systems of interest, it dict a wide range of metabolic reactions, for a number is likely that this will take a number of years to complete. of diferent types of small molecules (endogenous and xenobiotic compounds) and a number of diferent bio- Conclusion systems (humans, microbial/environmental). BioTrans- In this work, we have presented BioTransformer, a freely former is unique in its ability to cover almost all aspects available, open access software tool that supports the of non-essential metabolism (drug/xenobiotic metabo- rapid, accurate, comprehensive prediction of metabo- lism, endogenous compound metabolism, gut microbial lism of small molecules in both mammals and environ- metabolism, environmental metabolism). Tis makes it mental microorganisms. BioTransformer can also assist particularly useful for the wide-ranging applications seen in metabolite identifcation using experimental MS data. in metabolomics and other small molecule studies. Fur- BioTransformer can be used either as a command-line thermore, the accuracy, coverage, precision and recall of tool or as an imported library. Te Java executable and BioTransformer appear to be as good as, or even much Java library are open access, and freely available at https​ better than some of the most highly regarded metabolic ://bitbu​cket.org/djoum​bou/biotr​ansfo​rmerj​ar/. Moreo- prediction systems now available. It is also notable that ver, BioTransformer is also freely accessible as a web ser- BioTransformer, unlike most of its competitors, is freely vice at www.biotr​ansfo​rmer.ca. Te web service provides available. users with the possibility to manually or programmati- Certainly a more extensive analysis of a much larger cally submit queries, and retrieve data generated by the set of query compounds would likely better illustrate BioTransformer software tool. the strengths and weaknesses of BioTransformer. How- Within mammals, we have shown that BioTrans- ever, it is important to remember that there are relatively former was able to accurately predict single-step few experimentally validated, comprehensive sets of biotransformations for a diverse set of xenobiotics, metabolic “biotransformation trees” and that the exam- including drugs, pesticides, and food compounds. Te ples selected here (which required hundreds of hours to reactions that BioTransformer predicts cover Phase assemble, curate and validate) cover a good portion of I and Phase II metabolism in mammals, as well as the the better known trees. human gut microbial metabolism. Overall, BioTrans- While there are a number of strengths to BioTrans- former was shown to perform better than Meteor former, we believe that certain improvements could still Nexus and ADMET Predictor, two highly regarded be made to the program. First, the addition of more bio- commercial software tools for in silico metabolism transformation data would certainly provide additional prediction. Unlike most other metabolic prediction reaction “fodder” to create more biotransformation rules. tools, BioTransformer also supports the prediction Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 23 of 25

of metabolism of small molecules by environmental Author details 1 Department of Biological Sciences, University of Alberta, Edmonton, AB T6G microbes. Te integration of environmental metab- 2E9, Canada. 2 INRA, Human Nutrition Unit, Université Clermont Auvergne, olism with endogenous human and gut microbial 63000 Clermont‑Ferrand, France. 3 Department of Food and Experimental metabolism allows BioTransformer to address many Nutrition, School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil. 4 Department of Information Technology, CEU San Pablo Univer‑ of the predictive metabolic needs of metabolomics or sity, Madrid, Spain. 5 Department of Computing Science, University of Alberta, exposomics researchers, which tend to span a much Edmonton, AB T6G 2E8, Canada. 6 Alberta Machine Intelligence Institute, wider range than, say, drug researchers, food chemists University of Alberta, Edmonton, AB T6G 2E8, Canada. or environmental scientists. Acknowledgements Despite its strengths, BioTransformer is not without We would like to thank Nazanin Assempour (NA), Ithayavani Iynkkaran (II), some limitations. Addressing these would certainly make David Arndt (DA), Carin Li(CL), Xuan Cao (XC), Zachary Budinski (ZB), An ChI Guo (AG), and Hasan Bradan (HB) from the Wishart lab for their contributions. the program much more fexible, more accurate, and NA, and II helped coordinating early eforts in the development of MetXBioDB. more comprehensive. Obvious improvements for the DA, XC, ZB contributed in the curation of MetXBioDB. DA, XC, ZB, CL, HB, and current version of BioTransformer include: (1) the vali- AG contributed in improving the design and functionality of the webserver. We would also like to thank Kathrin Fenner from the Swiss Federal Institute of dation of BioTransformer’s predictions for a larger and Aquatic Science and Technology (EAWAG) for answering some of our ques‑ more diverse test set of molecules; (2) the experimen- tions in regard to the EAWAG-BBD/PPS system. tal validation of BioTransformer’s BMIT predictions for Competing interests a larger set of molecules and experimental data; (3) the The authors declare that they have no competitive interests. expansion of the reaction knowledgebase to cover more reactions, and (4) the addition of new options for metab- Availability and requirements Project name: BioTransformer. Project home page: Server http://www.biotr​ olite prediction/ranking. ansfo​rmer.ca; Command-line tool/Library https​://bitbu​cket.org/djoum​bou/ biotr​ansfo​rmerj​ar. Operating system(s): Web service—platform independ‑ ent. Command-line tool/Library—Windows, Linux, MacOS. Programming language: Java. Other requirements: Java 1.8. Any restrictions to use by non- academics: No login requirement for running or accessing the results using Additional fles the web service. Permission of the authors is required for use in commercial applications. License: GPLv2.1.

Additional fle 1. Cited structures. Funding Additional fle 2. Additional-Notes-Introduction-Methods-Evaluation. This work was supported by grants from Alberta Innovates (the Collaborative Research and Innovation Opportunity Fund), Genome Alberta (a division of Additional fle 3. Phase-II-Filter-Features. Genome Canada), the Canadian Institutes of Health Research (CIHR), and the Additional fle 4. Predictions: BioTransformer vs. Meteor Nexus. Agence Nationale de la Recherche (#ANR-14-HDHL-0002-02) for the FoodBAll project (JPI HDHL). JF was an AgreenSkills fellow (app. ID 1007). Additional fle 5. BioTransformer—human and human gut microbial + metabolism. Additional fle 6. BioTransformer vs. ADMET Predictor. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in pub‑ Additional fle 7. Epicatechin metabolite identifcation tables. lished maps and institutional afliations. Additional fle 8. Epicatechin metabolites identifcation part 1. Received: 17 September 2018 Accepted: 22 December 2018 Additional fle 9. Epicatechin metabolites identifcation part 2.

Abbreviations ADMET: absorption distribution metabolism excretion toxicology; BMIT: Bio‑ Transformer metabolite identifcation tool; BMPT: BioTransformer metabolism References prediction tool; CYP450: cytochrome P450; CSV: comma-separated values; EC: 1. Nelson DL, Cox MM (2012) Lehninger principles of biochemistry, 6th edn. enzyme classifcation; JSON: JavaScript object notation; KB: knowledgebase; W H Freeman & Co (Sd), New York PPC: pharmaceutical and personal care product; SDF: structure data fle; 2. Wishart DS, Feunang YD, Marcu A, Guo AC, Liang K, Vázquez-Fresno SMILES: simplifed molecular-input line-entry system; InChI: international R et al (2018) HMDB 4.0: the human metabolome database for 2018. chemical identifer; SULT: sulfotransferase; UGT​: UDP-glucuronosyltransferase. Nucleic Acids Res 46(D1):D608–D617 3. Uppal K, Walker DI, Liu K, Li S, Go Y, Jones DP (2016) Computational Authors’ contributions metabolomics: a framework for the million metabolome. Chem Res DSW conceived, initiated and supervised the project. RG provided feedback Toxicol 29(12):1956–1975 for the conceptualization of the machine learning system. YDF conceptual‑ 4. Arora B, Mukherjee J, Nath Gupta M (2014) Enzyme promiscuity: using ized the project, developed the knowledgebase and machine learning the dark side of enzyme specifcity in white biotechnology. Sustain Chem systems, designed the prediction algorithms, implemented the algorithms Process 2:25 and engines, created the JAR library and Java software, the Rails API, and 5. Testa B, Pedretti A, Vistoli G (2012) Reactions and enzymes in the performed iterative test and evaluations. JF and CM provided expertise in metabolism of drugs and other xenobiotics. Drug Discov Today the generation of validation of rules for the gut microbial biotransformation 17(11–12):549–560 of polyphenols. They also provided expertise and experimental data for the 6. Dueñas M, Muñoz-González I, Cueva C, Jiménez-Girón A, Sánchez-Patán F, evaluation of BioTransformer’s metabolite identifcation tool. YDF and AG Santos-Buelga C et al (2015) A survey of modulation of gut microbiota by collaborated in the confguration and optimization of the web service. Every dietary polyphenols. Biomed Res Int. https​://doi.org/10.1155/2015/85090​ co-author provided signifcant feedback in the editing of this manuscript, and 2 approved it. All authors read and approved the fnal manuscript. Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 24 of 25

7. Koppel N, Rekdal VM, Balskus EP (2017) Chemical transformation of xeno‑ 31. Wicker J, Lorsbach T, Gütlein M, Schmid E, Latino D, Kramer S et al (2016) biotics by the human gut microbiota. Science 356(6344):1246–1257 enviPath—the environmental contaminant biotransformation pathway 8. Testa B (2009) Drug metabolism for the perplexed medicinal chemist. resource. Nucleic Acids Res 44:D502 Chem Biodivers 6(11):2055–2070 32. Gao J, Ellis LBM, Wackett LP (2009) The University of Minnesota biocataly‑ 9. Aktar W, Sengupta D, Chowdhury A (2009) Impact of pesticides use in sis/biodegradation database: improving public access. Nucleic Acids Res agriculture: their benefts and hazards. Interdiscip Toxicol 2(1):1–12 38(Suppl. 1):D488–D491 10. Tang J, Cao Y, Rose RL, Brimfeld AA, Dai D, Goldstein JA et al (2001) 33. Ellis LB, Gao J, Fenner K, Wackett LP (2008) The University of Minnesota Metabolism of chlorpyrifos by human cytochrome p450 isoforms pathway prediction system: predicting metabolic logic. Nucleic Acids Res and human, mouse, and rat liver microsomes. Drug Metab Dispos 36(Web Server issue):W427–W432 29(9):1201–1204 34. Wicker J, Fenner K, Ellis L, Wackett L, Kramer S (2010) Predicting biodeg‑ 11. Joly C, Gay-Quéheillard J, Léké A, Chardon K, Delanaud S, Bach V et al radation products and pathways: a hybrid knowledge- and machine (2013) Impact of chronic exposure to low doses of chlorpyrifos on learning-based approach. Bioinformatics 26(6):814–821 the intestinal microbiota in the simulator of the human intestinal 35. Molecular Discovery (2017) Mass-MetaSite. https​://www.moldi​scove​ microbial ecosystem ­(SHIME®) and in the rat. Environ Sci Pollut Res ry.com/softw​are/massm​etasi​te/. Accessed 15 Jan 2017 20(5):2726–2734 36. SCIEX—LightSight® Software (2018) https​://sciex​.com/produ​cts/softw​ 12. Supreeth M, Chandrashekar MA, Sachin N, Raju NS (2016) Efect of are/light​sight​-softw​are. Accessed 20 Apr 2018 chlorpyrifos on soil microbial diversity and its biotransformation by Strep- 37. Kirchmair J, Göller AH, Lang D, Kunze J, Testa B, Wilson ID et al (2015) tomyces sp. HP-11. 3 Biotech 6(2):147 Predicting drug metabolism: experiment and/or computation? Nat Rev 13. Benzidane C, Dahamna S (2013) Chlorpyrifos residues in food plant in the Drug Discov 14(6):387–404 region of Setif-Algeria. Commun Agric Appl Biol Sci 78(2):157–160 38. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR et al (2018) 14. Shamasunder B (2017) Chlorpyrifos contamination across the food sys‑ DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic tem: shifting science, regulatory challenges, and implications for public Acids Res 46(D1):D1074–D1082 health. In: Hofund AB, Jones JC, Pautz MC (eds) The intersection of food 39. FooDB (2016) The Food Metabolome Database. http://foodb​.ca/. and public health: current policy challenges and solutions. Routledge, Accessed 1 Jan 2017 New York, pp 107–120 40. PhytoHub (2017). http://phyto​hub.eu. Accessed 1 Jan 2017 15. Ebele AJ, Abou-Elwafa Abdallah M, Harrad S (2017) Pharmaceuticals and 41. Wishart DS (2017) ContaminantDB. http://conta​minan​tdb.ca. Accessed personal care products (PPCPs) in the freshwater aquatic environment. 15 June 2017 Emerg Contam 3(1):1–16 42. Wishart D, Arndt D, Pon A, Sajed T, Guo AC, Djoumbou Y et al (2015) T3DB: 16. Blair BD, Crago JP, Hedman CJ, Klaper RD (2013) Pharmaceuticals and the toxic exposome database. Nucleic Acids Res 43(D1):D928–D934 personal care products found in the Great Lakes above concentrations of 43. McEachran AD, Sobus JR, Williams AJ (2017) Identifying known unknowns environmental concern. Chemosphere 93(9):2116–2123 using the US EPA’s CompTox Chemistry Dashboard. Anal Bioanal Chem 17. Coleman S, Linderman R, Hodgson E, Rose RL (2000) Comparative metab‑ 409(7):1729–1735 olism of chloroacetamide herbicides and selected metabolites in human 44. Sajed T, Marcu A, Ramirez M, Pon A, Guo AC, Knox C et al (2016) ECMDB and rat liver microsomes. Environ Health Perspect 108(12):1151–1157 2.0: a richer resource for understanding the biochemistry of E. coli. 18. Wishart DS (2009) Computational strategies for metabolite identifcation Nucleic Acids Res 44(D1):D495–D501 in metabolomics. Bioanalysis 1(9):1579–1596 45. Ramirez-Gaona M, Marcu A, Pon A, Guo AC, Sajed T, Wishart NA et al 19. Celiz M, Tso J, Aga D (2009) Pharmaceutical metabolites in the environ‑ (2017) YMDB 2.0: a signifcantly expanded version of the yeast metabo‑ ment: analytical challenges and ecological risks. Environ Toxicol Chem lome database. Nucleic Acids Res 45(D1):D440–D445 28(12):173 46. Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V et al 20. Geissen V, Mol H, Klumpp E, Umlauf G, Nadal M, van der Ploeg M et al (2016) ChEBI in 2016: improved services and an expanding collection of (2015) Emerging pollutants in the environment: a challenge for water metabolites. Nucleic Acids Res 44(D1):D1214–D1219 resource management. Int Soil Water Conserv Res 3(1):57–65 47. Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K (2017) KEGG: 21. Basheer C, Alnedhary AA, Rao BSM, Lee HK (2007) Determination of new perspectives on genomes, pathways, diseases and drugs. Nucleic organophosphorous pesticides in wastewater samples using binary- Acids Res 45(D1):D353–D361 solvent liquid-phase microextraction and solid-phase microextraction: a 48. Keseler IM, Mackie A, Peralta-Gil M, Santos-Zavaleta A, Gama-Castro S, comparative study. Anal Chim Acta 605(2):147–152 Bonavides-Martínez C et al (2013) EcoCyc: fusing model organism data‑ 22. Hubert J, Nuzillard J, Renault J (2017) Dereplication strategies in natural bases with systems biology. Nucleic Acids Res 41:D605 product research: How many tools and methodologies behind the same 49. International Union of Biochemistry and Molecular Biology—IUBMB concept? Phytochem Rev 16(1):55–95 Nomenclature Committee Recommendations 2017. http://www.chem. 23. Liu R, Liu J, Tawa G, Wallqvist A (2012) 2D SMARTCyp reactivity-based site qmul.ac.uk/iubmb​/. Accessed 15 Apr 2017 of metabolism prediction for major drug-metabolizing cytochrome P450 50. González-Lergier J, Broadbelt LJ, Hatzimanikatis V (2005) Theoretical con‑ enzymes. J Chem Inf Model 52(6):1698–1712 siderations and computational analysis of the complexity in polyketide 24. Rydberg P, Gloriam DE, Olsen L (2010) The SMARTCyp cytochrome P450 synthesis pathways. J Am Chem Soc 127(27):9930 metabolism prediction server. Bioinformatics 26(23):2988–2989 51. Wishart DS (2016) Emerging applications of metabolomics in drug 25. Terfoth L, Bienfait B, Gasteiger J (2007) Ligand-based models for the discovery and precision medicine. Nat Rev Drug Discov 15(7):473–484 isoform specifcity of cytochrome P450 3A4, 2D6, and 2C9 substrates. J 52. Allen F, Pon A, Wilson M, Greiner R, Wishart D (2014) CFM-ID: a web server Chem Inf Model 47(4):1688–1701 for annotation, spectrum prediction and metabolite identifcation from 26. Marchant CA, Briggs KA, Long A (2008) In silico tools for sharing data and tandem mass spectra. Nucleic Acids Res 42(W1):W94–W99 knowledge on toxicity and metabolism: Derek for windows, meteor, and 53. Allen F, Greiner R, Wishart D (2014) Competitive fragmentation modeling vitic. Toxicol Mech Methods 18(2–3):177–187 of ESI-MS/MS spectra for putative metabolite identifcation. Metabo‑ 27. Ridder L, Wagener M (2008) SyGMa: combining expert knowledge and lomics 11(1):98–110 empirical scoring in the prediction of metabolites. ChemMedChem 54. Allen F, Pon A, Greiner R, Wishart D (2016) Computational prediction of 3(5):821–832 electron ionization mass spectra to assist in GC/MS compound identifca‑ 28. COMPUDRUG (2013) Metabolexpert. http://www.compu​drug.com/ tion. Anal Chem 88(15):7689–7697 metab​olexp​ert. Accessed 1 Jan 2017 55. Ruttkies C, Schymanski EL, Wolf S, Hollender J, Neumann S (2016) Met‑ 29. ADMET Predictor (2018) Simulations Plus, Inc., Lancaster, California, USA. Frag relaunched: incorporating strategies beyond in silico fragmentation. https​://www.simul​ation​s-plus.com/softw​are/admet​predi​ctor/metab​ J Cheminform 8(1):3 olism​. Accessed 1 Jan 2018 56. Da Silva RR, Dorrestein PC, Quinn RA (2015) Illuminating the dark matter 30. Zaretzki J, Matlock M, Swamidass SJ (2013) XenoSite: accurately predict‑ in metabolomics. Proc Natl Acad Sci U S A 112(41):12549–12550 ing cyp-mediated sites of metabolism with neural networks. J Chem Inf Model 53(12):3373–3383 Djoumbou‑Feunang et al. J Cheminform (2019) 11:2 Page 25 of 25

57. Tian S, Djoumbou Y, Greiner R, Wishart DS (2018) CypReact: a software 75. Selma MV, Espín JC, Tomás-Barberán FA (2009) Interaction between tool for in silico reactant prediction for human cytochrome P450 phenolics and gut microbiota: role in human health. J Agric Food Chem enzymes. J Chem Inf Model 58:1282–1291 57(15):6485–6501 58. Delaney KA, Kleinschmidt KC (2010) Biochemical and metabolic princi‑ 76. Ozdal T, Sela DA, Xiao J, Boyacioglu D, Chen F, Capanoglu E (2016) The ples. Goldfrank’s toxicologic emergencies, 9th edn. McGraw-Hill Profes‑ reciprocal interactions between polyphenols and gut microbiota and sional, New York, p 170 efects on bioaccessibility. Nutrients 8(2):78 59. Miners JO, Smith PA, Sorich MJ, McKinnon RA, Mackenzie PI (2004) Pre‑ 77. Button WG, Judson PN, Long A, Vessey JD (2003) Using absolute and rela‑ dicting human drug glucuronidation parameters: application of in vitro tive reasoning in the prediction of the potential metabolism of xenobiot‑ and in silico modeling approaches. Annu Rev Pharmacol Toxicol 44:1–25 ics. J Chem Inf Comput Sci 43(5):1371–1377 60. Jančová P, Šiller M (2012) Topics on drug metabolism. In: Paxton J (ed) 78. Chen C-H (2013) Activation and detoxifcation enzymes: functions and Phase II drug metabolism. InTech, Croatia implications. Springer, New York, pp 1–177 61. Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF 79. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A et al (2016) et al (2012) Pharmacogenomics knowledge for personalized medicine. PubChem substance and compound databases. Nucleic Acids Res Clin Pharmacol Ther 92(4):414–417 44(D1):D1202–D1213 62. Spjuth O, Rydberg P, Willighagen EL, Evelo CT, Jeliazkova N (2016) 80. BIOVIA (2011) The keys to understanding MDL keyset technology. http:// XMetDB: an open access database for xenobiotic metabolism. J Chemin‑ accel​rys.com/produ​cts/pdf/keys-to-keyse​t-techn​ology​.pdf. Accessed 1 form 8(1):47 Oct 2012 63. Preissner S, Kroll K, Dunkel M, Senger C, Goldsobel G, Kuzman D et al 81. ChemAxon’s Marvin Suite (2017). https​://www.chema​xon.com/downl​ (2009) SuperCYP: a comprehensive database on Cytochrome P450 oad/marvi​n-suite​/. Accessed 15 Jan 2017 enzymes including a tool for analysis of CYP-drug interactions. Nucleic 82. Frank E, Hall MA, Witten IH (eds) (2016) The WEKA workbench. Online Acids Res 38(Suppl. 1):D237–D243 appendix for “data mining: practical machine learning tools and tech‑ 64. Rothwell JA, Perez-Jimenez J, Neveu V, Medina-Remón A, M’Hiri N, García- niques”, 4th edn. Morgan Kaufmann, Burlington Lobato P et al. (2013) Phenol-Explorer 3.0: a major update of the Phenol- 83. Willighagen EL, Mayfeld JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova Explorer database to incorporate data on the efects of food processing N et al (2017) The Chemistry Development Kit (CDK) v2.0: atom typing, on polyphenol content. Databases. https​://doi.org/10.1093/datab​ase/ depiction, molecular formulas, and substructure searching. J Cheminform bat07​0 9(1):33 65. Daylight Chemical Information Systems, Inc. (2008) SMARTS—a language 84. Jeliazkova N, Kochev N (2011) AMBIT-SMARTS: efcient searching of for describing molecular patterns. http://www.dayli​ght.com/dayht​ml/ chemical structures and fragments. Mol Inform 30(8):707–720 doc/theor​y/theor​y.smart​s.html. Accessed 20 May 2009 85. Wang H, Wang N, Wang B, Zhao Q, Fang H, Fu C et al (2016) Antibiotics in 66. SMIRKS (2007) A reaction transform language. http://dayli​ght.com/dayht​ drinking water in Shanghai and their contribution to antibiotic exposure ml/doc/theor​y/theor​y.smirk​s.html. Accessed 15 Sept 2014 of school children. Environ Sci Technol 50(5):2692–2699 67. Djoumbou Feunang Y, Eisner R, Knox C, Chepelev L, Hastings J, Owen G 86. Cyplik P, Marecik R, Piotrowska-Cyplik A, Olejnik A, Drozdzynska A, et al (2016) ClassyFire: automated chemical classifcation with a compre‑ Chrzanowski L (2012) Biological denitrifcation of high nitrate process‑ hensive, computable taxonomy. J Cheminform 8(1):1–20 ing wastewaters from explosives production plant. Water Air Soil Pollut 68. Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A (2003) 223(4):1791–1800 ExPASy: the proteomics server for in-depth protein knowledge and 87. Ottaviani JI, Borges G, Momma TY, Spencer JPE, Keen CL, Crozier A et al analysis. Nucleic Acids Res 31(13):3784–3788 (2016) The metabolome of [2-14C](–)-epicatechin in humans: implica‑ 69. Placzek S, Schomburg I, Chang A, Jeske L, Ulbrich M, Tillack J et al (2017) tions for the assessment of efcacy, safety, and mechanisms of action of BRENDA in 2017: new perspectives and new tools in BRENDA. Nucleic polyphenolic bioactives. Sci Rep 6:29034 Acids Res 45(D1):D380–D388 88. Peisl BYL, Schymanski EL, Wilmes P (2018) Dark matter in host-microbi‑ 70. Caspi R, Billington R, Ferrer L, Foerster H, Fulcher CA, Keseler IM et al ome metabolomics: tackling the unknowns—a review. Anal Chim Acta (2016) The MetaCyc database of metabolic pathways and enzymes and 1037:12–27 the BioCyc collection of pathway/genome databases. Nucleic Acids Res 89. Das A, Srinivasan M, Ghosh TS, Mande SS (2016) Xenobiotic metabolism 44(D1):D471–D480 and gut microbiomes. PLoS ONE 11(10):e0163099 71. Bateman A, Martin MJ, O’Donovan C, Magrane M, Alpi E, Antunes R et al 90. Ridlon JM, Harris SC, Bhowmik S, Kang D, Hylemon PB (2016) Con‑ (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res sequences of bile salt biotransformations by intestinal bacteria. Gut 45(D1):D158–D169 Microbes 7(1):22–39 72. Kalgutkar AS, Gardner I, Obach RS, Shafer CL, Callegari E, Henne KR et al 91. Ghazalpour A, Cespedes I, Bennett BJ, Allayee H (2016) Expanding role of (2005) A comprehensive listing of bioactivation pathways of organic gut microbiota in lipid metabolism. Curr Opin Lipidol 27(2):141–147 functional groups. Curr Drug Metab 6(3):161–225 92. Carmody RN, Turnbaugh PJ (2014) Host-microbial interactions in the 73. Fenner K, Gao J, Kramer S, Ellis L, Wackett L (2008) Data-driven extraction metabolism of therapeutic and diet-derived xenobiotics. J Clin Invest of relative reasoning rules to limit combinatorial explosion in biodegrada‑ 124(10):4173–4181 tion pathway prediction. Bioinformatics 24(18):2079–2085 74. Burapan S, Kim M, Han J (2017) Demethylation of polymethoxyfavones by human gut bacterium, Blautia sp. MRG-PMF1. J Agric Food Chem 65(8):1620–1629

Ready to submit your research ? Choose BMC and benefit from:

• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations • maximum visibility for your research: over 100M website views per year

At BMC, research is always in progress.

Learn more biomedcentral.com/submissions