<<

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Taxonomically informed scoring enhances confidence in natural 2 products annotation

3

4 Adriano Rutz1,∆, Miwa Dounoue-Kubo1,2,∆, Simon Ollivier1, Jonathan Bisson3, Mohsen 5 Bagheri1,4, Tongchai Saesong5, Samad Nejad Ebrahimi4, Kornkanok Ingkaninan5, Jean-Luc 6 Wolfender1 and Pierre-Marie Allard1,*

7 ∆equal contribution

8 1School of Pharmaceutical Sciences, EPGL, University of Geneva, University of Lausanne, CMU - Rue Michel 9 Servet, 1 1211 Geneva 4 – Switzerland

10 2Faculty of Pharmaceutical Sciences, Tokushima Bunri University, 180 Yamashiro-cho, Tokushima, 770-8514 – 11 Japan

12 3Center for Natural Product Technologies, Program for Collaborative Research in the Pharmaceutical Sciences 13 (PCRPS) and Department of Pharmaceutical Sciences, College of Pharmacy, University of Illinois at Chicago, 833 14 South Wood Street, Chicago, Illinois 60612, United States

15 4Department of Phytochemistry, Medicinal and Drugs Research Institute, Shahid Beheshti University, G.C., 16 Evin, Tehran, Iran.

17 5Department of Pharmaceutical Chemistry and Pharmacognosy, Faculty of Pharmaceutical Sciences and Center of 18 Excellence for Innovation in Chemistry, Naresuan University, Phitsanulok, Thailand, 65000

19 * Correspondence: 20 Corresponding Author 21 [email protected]

22

23 Keywords: metabolite annotation, chemotaxonomy, taxonomically informed scoring system, 24 natural products chemistry, metabolomics, taxonomic distance.

25 Abstract

26 Mass spectrometry (MS) hyphenated to liquid chromatography (LC)-MS offers unrivalled sensitivity 27 for metabolite profiling of complex biological matrices encountered in natural products (NP) research. 28 With advanced platforms LC, MS/MS spectra are acquired in an untargeted manner on most detected 29 features. This generates massive and complex sets of spectral data that provide valuable structural 30 information on most analytes. To interpret such datasets, computational methods are mandatory. To 31 this extent, computerized annotation of metabolites links spectral data to candidate structures. When 32 profiling complex extracts spectra are often organized in clusters by similarity via Molecular 33 Networking (MN). A spectral matching score is usually established between the acquired data and 34 experimental or theoretical spectral databases (DB). The process leads to various candidate structures 35 for each MS features. At this stage, obtaining high annotation confidence level remains a challenge 36 notably due to the high chemodiversity of specialized metabolomes. bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

37 The integration of additional information in a meta-score is a way to capture complementary 38 experimental attributes and improve the annotation process. Here we show that integrating 39 unambiguous taxonomic position of analyzed samples and candidate structures enhances confidence 40 in metabolite annotation. A script is proposed to automatically input such information at various 41 granularity levels (species, genus, and family) and weight the score obtained between experimental 42 spectral data and output of available computational metabolite annotation tools (ISDB-DNP, MS- 43 Finder, Sirius). In all cases, the consideration of the taxonomic distance allowed an efficient re-ranking 44 of the candidate structures leading to a systematic enhancement of the recall and precision rates of the 45 tools (1.5 to 7-fold increase in the F1 score). Our results clearly demonstrate the importance of 46 considering taxonomic information in the process of specialized metabolites’ annotation. This requires 47 to access structural data systematically documented with biological origin, both for new and previously 48 reported NPs. In this respect, the establishment of an open structural DB of specialized metabolites and 49 their associated metadata (particularly biological sources) is timely and critical for the NP research 50 community.

51

52 1 Introduction

53 Specialized metabolites define the chemical signature of a living organism. Sessile organisms such as 54 plants, sponges and corals, but also microorganisms (bacteria and fungi), are known to biosynthesize 55 a wealth of such chemicals. These molecules can play a role as defense or communication agents 56 (Brunetti et al., 2018). However, the functional role of these metabolites for the producer is not always 57 fully understood. The comprehension of these functions is one of the goals of chemical ecology and 58 recent research illustrates the importance of specialized metabolism (Hoffmann et al., 2018; Schmitt et 59 al., 1995). Throughout history, humans have been relying on derived products for a variety of 60 purposes: housing, feeding, clothing and, especially, medication. Our therapeutic arsenal is in fact 61 deeply dependent on the chemistry of natural products (NPs) whether they are used in mixtures, 62 purified forms or for hemi-synthetic drug development. After a period of disregard by the 63 pharmaceutical industry, NPs are the object of a renewed interest (Shen, 2015). Today, technological 64 and methodological advances allow to establish more precise views of the chemo-biologic aspects of 65 life. In the field of NPs, advances in genomics and metagenomics now allow the anticipation of the 66 chemical output directly from environmental DNA (Alanjary et al., 2019; Craig et al., 2009). 67 Developments in metabolite profiling by mass spectrometry (MS) grant access to large volumes of 68 high-quality spectral data from minimal amount of samples and appropriate data analysis workflows 69 allow to efficiently mine such data (Wolfender et al., 2019). Initiatives such as the Global Natural 70 Products Social (GNPS) molecular networking (MN) (Wang et al., 2016) project offer both a living 71 MS repository and the possibility to establish MN organizing MS data. Despite such advancements, 72 metabolite identification remains a major challenge for both NP research and metabolomics (Kind et 73 al., 2018). Metabolite identification of a novel compound requires physical isolation of the analyte 74 followed by complete NMR acquisition and three-dimensional structural establishment via X-ray 75 diffraction or chiroptical techniques. For previously described compounds, metabolite identification 76 implies complete matching of physicochemical properties between the analyte and a standard 77 compound (including chiroptical properties). Metabolite identification is thus a tedious and labor- 78 intensive process, which should ideally be reserved to novel metabolites’ description. Any less 79 complete process should be defined as metabolite annotation. By definition, metabolite annotation 80 can be applied at a higher throughput and offers an effective proxy for the chemical characterization 81 of complex matrices. This process includes dereplication (the annotation of previously described

2 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

82 molecules prior to any physical isolation process) and allows focusing isolation and metabolite 83 identification efforts on potentially novel compounds only (Newman, 2017).

84 Given its sensitivity, selectivity and structural determination potential, MS is a tool of choice for 85 metabolite annotation in complex mixtures. High resolution mass spectrometers (HRMS) affording 86 high mass accuracy (ppm range) are now routinely used (Eliuk and Makarov, 2015; Lommen et al., 87 2011; Olsen et al., 2005). As a result of this evolution, unambiguous molecular formula (MF) can be 88 assigned in most cases. However, the isomeric nature of many NPs indicates that MF determination is 89 a necessary but non-sufficient step in the metabolite annotation process. Acquisition of fragmentation 90 spectra (tandem MS (MS/MS) or MSn) offers a way to gain structural insights on analytes. Such 91 analytical approach (physically breaking down the analyte in constitutive building blocks to infer its 92 general structure) is analog to DNA (genomics) or protein sequencing (proteomics). However, 93 compared to those two fields, a notable difference lies within the non-polymeric nature of small 94 metabolites, which partly explains the difficulties encountered in linking a fragmentation spectrum to 95 a chemical structure in metabolomics. The dependence of the fragmentation behavior of analytes on 96 the mass spectrometer geometry adds further complexity to the interpretation of MS data (Johnson and 97 Carlson, 2015). To interpret such data, a common approach consists in comparing experimental spectra 98 to established spectral libraries. In the specific case of GC-MS, the reproducibility of the electron 99 ionization (EI) mediated gas-phase fragmentation process allowed to establish huge spectral libraries 100 (e.g. NIST) comparable across laboratories worldwide. The situation is quite different for soft 101 ionization electrospray (ESI)-MS spectra acquired in LC-MS. The lack of standardization, 102 fragmentation variability and difficult access to standards has complicated the establishment of robust 103 experimental libraries. As a result, it is estimated that no more than 25,000 unique compounds are 104 present in such libraries (Dührkop et al., 2015). Various computational MS solutions have appeared to 105 overcome such limitations and link experimental spectra to chemical structures. They can be classified 106 into experimental rule-based strategies (MassHunter, Agilent Technologies), combinatorial 107 fragmentation strategies (MetFrag, (Ruttkies et al., 2016)), machine learning based approaches using 108 stochastic Markov modelling (CFM-ID, (Allen et al., 2014; Djoumbou-Feunang et al., 2019)) or 109 predicting fragmentation trees (Sirius) (Böcker et al., 2009; Dührkop et al., 2019). Computationally 110 demanding ab initio calculations, modeling the gas-phase fragmentation process, have also been 111 proposed (Bauer and Grimme, 2016). The output of such tools is, in general, a list of candidate 112 molecules ranked according to a score. Such score can be based on a single measure (e.g. spectral 113 similarity in CFM-annotate) (Allen et al., 2014) or integrate combined parameters (MS-Finder, Sirius) 114 (Dührkop et al., 2019; Tsugawa et al., 2016, 2019).

115 In our view, increased confidence in specialized metabolite annotation can be achieved by the 116 establishment of a meta-scoring system capturing the similarity of diverse attributes shared by the 117 queried analytes and candidate structures (Allard et al., 2017). Such meta-score could for example 118 consider 1) spectral similarity 2) taxonomic distance between the producer of the candidate compound 119 and the annotated biological matrix 3) structural consistency within a cluster and 4) physico-chemical 120 consistency. See Fig 1 for a conceptual overview of such metascore. The rationale behind 121 comprehensive scoring systems is that orthogonal information (not directly related to spectral 122 comparison) should strengthen the metabolite annotation process. This has been illustrated by using 123 additional information, such as the number of literature references related to a candidate structure and 124 basic retention time scoring based on logP in MetFrag 2.2 (Ruttkies et al., 2016). Recently, retention 125 order prediction has been integrated to an MS/MS prediction tool and also provided increased 126 performance in metabolite annotation (Bach et al., 2018). Another example is the Network Annotation 127 Propagation (NAP) approach, which takes advantage of the topology of a MN to proceed to a re-

3

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

128 ranking of annotated candidates within a cluster where structural consistency is expected (da Silva et 129 al., 2018).

130 To the best of our knowledge, the automated inclusion of the taxonomic dimension has not been 131 considered in current metabolite annotation strategies. The central hypothesis of this work is directly 132 inferred from the characteristics of the specialized metabolome. Unlike primary metabolites, which are 133 mostly ubiquitous compounds central to cell functioning, specialized metabolites are, by definition, 134 strongly linked to the taxonomic position of the producing organisms. Such principles were already 135 formulated in 1816 by de Candolle (de Candolle, 1828) who postulated that 1) Plant would 136 be the most useful guide to man in his search for new industrial and medicinal plants and 2) Chemical 137 characteristics of plants will be most valuable to plant taxonomy in the future. More than 200 years 138 later both principles have been largely verified, as recently illustrated (Hoffmann et al., 2018; Kang et 139 al., 2019).

140 It thus appears desirable to consider taxonomic information when describing the chemistry of an 141 organism. A taxonomic filtering process could be used to limit a DB to compounds previously isolated 142 in organisms situated within a given taxonomic distance from the biological source of the analyte to 143 annotate. However, results of chemotaxonomic studies also highlight the presence of broadly 144 distributed metabolites. For example, liriodenine (MUMCCPUVOAUBAN-UHFFFAOYSA-N) is a 145 widely distributed alkaloid produced by more than 50 distinct biological sources, it is found in over 30 146 genus belonging to 13 botanical families. Convergent biosynthetic pathways offer intriguing example 147 of unrelated species, shaped by evolution, that end up producing similar classes of compounds 148 (Pichersky and Lewinsohn, 2011). To adjust for the annotation of such compounds, a more tolerant 149 weighted scoring system allowing, both, to consider spectral similarity and taxonomic information 150 while conserving the independence of the individual resulting scores appears as a better solution.

151 In the frame of this study we propose such taxonomically informed scoring system and benchmark the 152 impact of taxonomic distance consideration on a set of 2107 identified molecules using three different 153 computational mass spectrometry metabolite annotation tools (ISDB-DNP, MS-Finder and Sirius).

4 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

154 155 Figure 1 Conceptual overview of a possible meta-scoring system for specialized metabolite annotation 156 incorporating 1) spectral similarity 2) taxonomic distance between the biological source of the queried spectra 157 and candidate annotations in the database (DB) 3) structural consistency within a cluster (see (da Silva et al., 158 2018)) and 4) physico-chemical consistency (see (Ruttkies et al., 2016) and (Bach et al., 2018)). A factor (w) 159 should allow to attribute relative weight to individual scores and modulate their contribution to the overall score.

160

161 2 Results

162 2.1 Conception of the taxonomically informed scoring system 163 The constituents of specialized metabolomes, as expression products of the genome, should reflect the 164 taxonomical position of the producing organism. Our initial working hypothesis is thus that the 165 attribution of a weight, inversely proportional to the taxonomic distance between the biological source 166 of the queried analyte and the one of the candidate structures, should be a valuable input into the 167 metabolite annotation process.

168 The proposal of the taxonomically informed scoring system is thus to complement the initial score 169 given to candidate structures by existing metabolite annotation tool by such a weight. To this end, the 170 initial score should first be normalized. Then, weights, inversely proportional to the taxa level 171 difference (family < genus < species) are given. Such weights are attributed when an exact match is 172 observed between biological source denominations at the following taxa levels: family, genus and 173 species. The weight corresponding to the shortest taxonomic distance is then added to the initial score. 174 Candidates are further re-ranked according to the newly weighted score. The general outline of the 175 taxonomically informed scoring system is presented in Fig. 2. In this study, no phylogenetic distances 176 within taxa (e.g. family, genus or species) were considered.

5

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

MS1 MS2

metabolite annotation

A. Spectral Query Spectra ID biological source 1

taxonomically informed scoring B. Query Spectral Candidate Spectra Biological Score ID structures source 1 XKTZVOMYVJKQTH 0.55 JGPRUCFGUSXLES 0.53 RQWMGMIZGOCPMP 0.50

ALLQMEDAYDKMIO 0.45 JUTRIMPZQCIOOF 0.42 1 VNBUMBNLPGLBML 0.41 2 … 3

Query Candidate C. Spectral Candidate Taxonomical Combined Spectra Biological Score Biological ID structures Weight Score source source 1 VNBUMBNLPGLBML 0.41 3 3.41 XKTZVOMYVJKQTH 0.55 2 2.55 ALLQMEDAYDKMIO 0.45 1 1.45 JUTRIMPZQCIOOF 0.42 1 1.42 JGPRUCFGUSXLES 0.53 0 0.53

RQWMGMIZGOCPMP 0.50 0 0.50 … 177 178 Figure 2 General outline of the taxonomically informed scoring system. Candidates structures are 179 complemented with their biological sources at the family, genus and species level, when available. A weight, 180 inversely proportional to the taxonomic distance between the biological source of the standard compound and 181 the one of the candidate compounds is given when the biological source of the candidate structures matches the

6 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

182 biological source of the standard at the family, genus and species level, respectively. The maximal weight for 183 each candidate is then added to its spectral score to yield a weighted spectral score. Finally, candidates are re- 184 ranked according to the weighted spectral score.

185

186 In order to apply the taxonomically informed scoring system in a generic manner, the initial scores 187 given by the metabolite annotation tools were rescaled to obtain values ranging from 0 (worst 188 candidate) to 1 (best candidate). The weights, given according to the taxonomic distance between the 189 biological source of the queried spectra and the one of the candidate compounds, were integrated in 190 the final score by a sum. This choice allows to keep independence between individual components of 191 the metascore (see Fig. 1). Since the boundaries of the candidates’ normalized score in a given dataset 192 are defined (0 to 1), the minimal weight to be applied to the worst candidate for it to be ranked at the 193 first position after weighting is 1. Following our initial hypothesis, a weight of 1 was thus given if a 194 match between biological sources was found at the family taxa level. In the case were the initial 195 maximal score (1) would be given to a candidate and added to a weight corresponding to a match at 196 the family level (1), a weight of at least 2 should be given for a candidate having the worst score to be 197 ranked above. A weight of 2 was thus given if a match between biological sources was found at the 198 genus taxa level. Following the same logic, a weight of 3 was given for matches between biological 199 sources at the species level.

200 2.2 Benchmarking the influence of taxonomic information consideration in metabolite 201 annotation 202 2.2.1 Establishment of a benchmarking dataset 203 In order to establish the importance of considering taxonomic information during the metabolite 204 annotation we needed to construct a dataset constituted by molecular structures, their MS/MS 205 fragmentation spectra acquired under various experimental conditions and their biological sources in 206 the form of a fully resolved taxonomical hierarchy. This dataset, denominated hereafter “benchmarking 207 dataset”, was built by combining a curated structural/biological sources dataset (obtained from the 208 DNP) and a curated structural/spectral dataset (obtained from GNPS libraries) as described in Material 209 and Methods. Below are the results of each step.

210 2.2.1.1 Structural and biological sources dataset 211 The prerequisite to apply a taxonomically informed scoring in a metabolite annotation process is to 212 dispose of the biological source information of 1) the queried spectra and 2) the candidate structures. 213 To the best of our knowledge, there is currently no freely available database (DB) compiling NP 214 structures and their biological sources down to the species level. For this study, we thus exploited the 215 DNP that is commercially available and allows export of structures and biological sources as associated 216 metadata. For example, the output for the biological source field of pulsaquinone, “Constit. of the roots 217 of Pulsatilla koreana.”, is Plantae | Tracheophyta | Magnoliopsida | | Ranunculaceae | 218 Pulsatilla | Pulsatilla cernua. Within the Global Names index, biological sources resolved against the 219 Catalogue of Life source were kept resulting in 219,800 entries with accepted scientific names and full, 220 homogeneous, taxonomy up to the kingdom level. See Material and Methods for example and details 221 concerning the curation process.

222 2.2.1.2 Structural and spectral dataset 223 The GNPS libraries agglomerate a wide and publicly available ensemble of MS/MS spectra coming 224 from various analytical platforms and thus having different levels of quality (Wang et al., 2016). These

7

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

225 spectral libraries were used as representative source of diverse experimental MS/MS spectra to evaluate 226 the annotation improvement that could be obtained by applying taxonomically informed scoring 227 system. For this purpose, all GNPS libraries and publicly accessible third-party libraries were retrieved 228 online (https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp) and concatenated as a single spectral file 229 containing 66646 individual entries. The pretreatment described in corresponding Material and 230 Methods section, yielded a dataset of 40138 structures with their experimental associated MS/MS 231 acquired on different platforms.

232 2.2.1.3 Structural, spectral and biological sources dataset (benchmarking set) 233 To implement the taxonomically informed metabolite annotation process, it is required that both 1) the 234 queried spectra and 2) the candidate structures biological sources are equally resolved (i.e. using the 235 accepted denomination) at the taxonomic level. It was thus necessary to build a spectral dataset for 236 which each entry had a unique structure and a properly documented biological source, which 237 constituted the benchmarking set. For this, the structural and spectral datasets was matched against the 238 structural and biological source dataset, following the procedure detailed in Material and Methods. The 239 full processing resulted in a dataset of 2107 individual entries (characterized NPs with no stereoisomers 240 distinction and a unique biological source associated), which was used for the rest of this study.

241 2.2.2 Evaluation of the improvement of metabolite annotation on the benchmarking set 242 In order to assess the importance of considering taxonomic information in the annotation process, the 243 output of three different computational mass spectrometry-based metabolite annotation solutions were 244 submitted to the taxonomically informed scoring. The 2107 spectra of the benchmarking set were 245 queried using these three tools according to parameters detailed in the Material and Methods section. 246 This resulted in three different outputs constituted by a list of candidates for each entry of the 247 benchmarking set.

248

249 2.2.2.1 Metabolite annotation tools used 250 • ISDB-DNP

251 The first tool, denominated hereafter ISDB-DNP (In Silico DataBase – Dictionary of Natural Products) 252 is metabolite annotation strategy that we previously developed (Allard et al., 2016). This approach is 253 focused on specialized metabolites annotation and is constituted by a pre-fragmented theoretical 254 spectral DB version of the DNP. The in silico fragmentation was performed by CFM-ID, a software 255 using a probabilistic generative model for the fragmentation process, and a machine learning approach 256 for learning parameters for this model from MS/MS data (Allen et al., 2015). CFM is, to our 257 knowledge, the only computational solution currently available able to generate a theoretical spectrum 258 with prediction of fragment intensity. The matching phase between experimental spectra and the 259 theoretical DB is based on a simple spectral similarity measure (cosine score) performed using Tremolo 260 as a spectral library search tool (Wang and Bandeira, 2013). The scores are reported from 0 (worst 261 candidate) to 1 (best candidate).

262 • MS-Finder

263 The second tool is MS-Finder. This in silico fragmentation approach considers multiple parameters 264 such as bond dissociation energies, mass accuracies, fragment linkages and various hydrogen

8 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

265 rearrangement rules at the candidate ranking phase (Tsugawa et al., 2016). The resulting scoring system 266 range from 1 (worst candidate) to 10 (best candidate).

267 • Sirius

268 The third tool to be used is Sirius 4.0. It is considered as a state-of-the-art metabolite annotation 269 solution, which combines molecular formula calculation and the prediction of a molecular fingerprint 270 of a query from its fragmentation tree and spectrum (Dührkop et al., 2019). Sirius uses a DB of 271 73,444,774 unique structures for its annotations. The resulting score is a probabilistic measure ranging 272 between negative infinity and 0 (best candidate).

273 2.2.2.2 Computation of the taxonomically informed scoring system 274

275 R scripts were written to perform 1) cleaning and standardization of the outputs, 2) taxonomically 276 informed scoring and re-ranking and 3) summary of the resulting annotations. First, the outputs were 277 standardized to a table containing on each row: the unique spectral identifier (CCCMSLIB N°) of the 278 queried spectra, the short InChIKey of the candidate structures, the score of the candidates (within the 279 scoring system of the used metabolite annotation tool), the biological source of the standard compound 280 and the biological source of the candidate structures. As described in section 2.1, a weight, inversely 281 proportional to the taxonomic distance between the biological source of the annotated compound and 282 the biological source of the candidate structure, was given when both matched at the family (weight of 283 1), genus (weight of 2) or species level(s) (weight of 3). A sum of this weight (1-3) and the original 284 score (0 to 1) yielded the taxonomically weighted score. This taxonomically weighted score was then 285 used to re-rank the candidates.

286 Taxonomically unweighted results

287 Using each tools’ initial scoring system, on the 2107 experimental MS/MS spectra constituting the 288 benchmarking set, the ISDB-DNP returned 214 (10.2 %) correct annotations at rank 1, Sirius 975 (46.3 289 %) and MS-Finder 180 (8.5 %). The total number of unique correct annotations ranked first covered 290 by ISDB-DNP, Sirius and MS-Finder prior to taxonomically informed scoring reached 1110 or 52.7 % 291 of the benchmarked library. Out of these, 29 (less than 1.4 %), were common to all three tools, 292 indicating the interest of considering various annotation tools when proceeding to metabolite 293 annotation. See Venn diagram in Fig. 3 illustrating the complementarity of the returned annotations. 294 Within the list of all candidates, the ISDB-DNP returned 1750 correct annotations, Sirius 1589 and 295 MS-Finder 574. The ROC curves (Supplementary Material S4) outline the number of correct hits 296 outside first rank and indicate remaining improvement potential.

9

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

A. B. 33 146 53 123 99 9 29 73 889 69 754 376

ISDB-DNP Sirius 205 38 MS-Finder 297 298 Figure 3. Venn diagrams representing common and unique correct annotations of each tool at rank 1 before (A.) 299 and after (B.) taxonomically informed scoring.

300

301 Taxonomically weighted results

302 After weighting and reranking with the taxonomically informed score, the number of correct 303 annotations at rank 1 increased to 1510, 1508 and 546, respectively for ISDB-DNP, Sirius and MS- 304 Finder. The annotation complementarity between the tools is highlighted in Fig. 4. The total number 305 of correct annotations covered by all ISDB-DNP, Sirius and MS-Finder after taxonomically informed 306 scoring reached 1786 or 84.8 % of the benchmarked library. Interestingly, a more than 10-fold increase 307 after taxonomically informed scoring was also observed for the correctly annotated metabolite 308 commonly returned by the three tools 376 (17 %). It is to be noted that no stereoisomer distinction 309 could be performed since all correct matches were assessed by short InChIKey comparison.

310 In order to evaluate the impact of the taxonomically informed scoring system, the F1 score was used. 311 More details can be found in Material and Methods. The results of this treatment on the outputs of the 312 three metabolite annotation tools are displayed on Fig. 4. The taxonomically informed scoring stage 313 led to a systematic increase of the F1 score for the benchmarked tools. This increase was of 7 fold 314 (ISDB-DNP), 1.5 (Sirius) and 3 fold (MS-Finder).

10 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

100 Before taxonomically informed scoring After taxonomically informed scoring

81.6 75 78.3

50 52.80

F1 Score [%] F1 Score 40.7

25

13.40 11.10

0 ISDB-DNP Sirius MS-Finder 315 316 Figure 4. Influence of the taxonomically informed scoring on the F1 score of each metabolite annotation tool.

317

318 2.3 Optimization of weights combination for the taxonomically informed scoring 319 In order to verify our initial hypothesis and define the optimal weights combination (at the family, 320 genus and species taxa level) to be applied for taxonomically informed scoring we proceeded to a 321 global optimization of the taxonomically informed scoring function.

322 To this end, the taxonomic information related to candidate’s annotation was artificially degraded. This 323 step allowed to mimic a “real life” case in which correct candidate annotation’s taxonomic metadata 324 are not necessarily complete down to the species level. Using the procedure detailed in the 325 corresponding Material and Methods section, the algorithm was applied four times on the four 326 randomized datasets. It quickly converged (100 iterations) towards a global maximum (max 1126 hits, 327 see Fig 5.). The global optimal parameters were found to be 0.81, 1.62 and 2.55 for the family, genus 328 and species taxa level, respectively. Such scores are dependent on the nature and completeness of the 329 employed taxonomic metadata. However, the results obtained when applying the Bayesian 330 optimization on the annotation sets (for which taxonomic metadata was randomly degraded) indicated

11

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

331 that optimal results were systematically obtained when the attributed weights were inversely 332 proportional to the taxa hierarchical position, thus confirming our initial hypothesis.

1. 2.

Number of correct hits ranked 1

sp sp

gen fam gen fam

3. 4.

sp sp

gen fam gen fam

333 334 Figure 5. Results of the Bayesian optimization converge toward the optimal weights combination required for 335 a maximal number of correct annotation ranked at the first position. This is observed for four randomly degraded 336 training sets (first optimization round displayed). The results confirm that the applied weights should be 337 inversely proportional to the taxonomic distance between the biological source associated to the queried spectra 338 and the biological source of the candidate structures.

339

340 2.4 Application of the taxonomically informed scoring to the annotation of metabolites from 341 Glaucium sp. 342 The impact of the taxonomically informed scoring was evaluated in a real case example for the 343 annotation of metabolites from Glaucium species ( family). Three species, G. 344 grandiflorum, G. fimberilligerum and G. corniculatum were studied. The ethyl acetate and methanolic 345 extracts of the three species were profiled by UHPLC-HRMS in positive ionization mode using a data- 346 dependent MS2 acquisition. After appropriate data treatment and molecular networking (see 347 corresponding Material and Methods section), the taxonomically informed scoring was used to proceed 348 to the re-ranking of the candidates according to the respective taxonomic position of their biological

12 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

349 sources and return the top 5 hits. We especially focused on the two major compounds (MS signal 350 intensity) of G. grandiflorum. These were feature m/z 342.1670 at 1.42 min and m/z 356.1860 at 1.83 351 min. A weight of 0.81, 1.62 and 2.55 was thus given to candidates for which the biological source was 352 found to be Papaveraceae at the family level, Glaucium at the genus level and G. grandiflorum at the 353 species level, and respectively. The results of the taxonomically informed scoring annotation for 354 feature m/z 342.1670 at 1.42 min are presented in Table 1. (see SI for annotation results concerning 355 feature m/z 356.1860 at 1.83 min). Both features were targeted within the extract and, after isolation, 356 the structure of their corresponding compound was determined by 1D and 2D NMR measurements (see 357 spectra in Supplementary Material S1 and S2). The NMR spectra matched the literature reported 358 spectra for predicentrine (Guinaudeau et al., 1979) in case of feature m/z 342.1670 at 1.42 min and 359 glaucine (Huang et al., 2004) for feature m/z 356.1860 at 1.83 min. In both cases, the candidate structure 360 proposed via the taxonomically informed scoring annotation at rank 1 was found to be correct. With 361 the classical spectral matching process, the correct candidates were initially ranked at position 9 and 7 362 for predicentrine and glaucine, respectively (see Table 1 and Supplementary Material S5 and S6).

Weighted Cluster Family Genus Species Max Spectral Normalized Rank Rank Structure Short IK Molecule Name Family Genus Species Spectral ID Weight Weight Weight Weight Score Score Initial Weighted Score OH O

CH3 H3 C O 1772 OUTYMWDDJORZOH Predicentrine Papaveraceae Glaucium Glaucium oxylobum 0.81 1.62 0 1.62 0.36 0.23 1.85 9 1 O N

CH3 CH3

CH3

CH3 O O OH 1772 O QELDJEKNFOQJOY Isocorydine Papaveraceae Glaucium NA 0.81 1.62 0 1.62 0.34 0.22 1.84 H C 3 11 2 N

CH3

OH

O

H 3 C 1772 KDFKJOFJHSVROC Isocorypalmine Papaveraceae Glaucium Glaucium fimbrilligerum 0.81 1.62 0 1.62 0.3 0.14 1.76 N O 28 3

O CH 3

CH 3

H C 3 NH

Secosarcocapnidine 1772 JADHMUPTWPBTMT Papaveraceae Sarcocapnos Sarcocapnos crassifolia Me ether, N-De-Me 0.81 0 0 0.81 0.4 0.32 1.13 O 1 4 CH O 3 O O

H3 C CH3

H C 3 NH

CH3 O Secocularidine 1772 WNBUTZHPPULVTP Papaveraceae Ceratocapnos claviculata Me ether, N-de-Me 0.81 0 0 0.81 0.39 0.29 1.1 O 2 5 O O H C H C 3 363 3 364 Table 1. Output of the taxonomically informed scoring annotation using ISDB-DNP for feature m/z 342.1670 365 at 1.42 min. Predicentrine, which is the correct annotation and was initially ranked at the 9th position, is now 366 ranked at the first position.

367

368 3 Discussion

369 Given the complexity of natural extracts, the isolation of specialized metabolites thereof is a long and 370 tedious process which should ideally be carried on NPs of high value only (e.g. novel or bioactive 371 compounds). On the other hand, metabolite annotation in complex extracts is also required for 372 metabolomics studies, compositional assessment of phyto-preparation or chemotaxonomical studies. 373 The annotation process is thus key for many aspects of NP research. This process can be boiled down 374 to the comparison of attributes (e.g. exact mass, MF, fragmentation spectra) of the queried analyte to 375 attributes of candidate structures present in a DB. When HRMS and appropriate heuristic filters are 376 used, the establishment of the molecular formula (MF) of the analyte is relatively straightforward (Kind 377 and Fiehn, 2007). However, this is not sufficient to proceed to metabolite annotation given the isomeric 378 nature of numerous NP. Over all the compounds reported in the DNP, less than 10 % (9.46 %) have a 379 unique chemical formula. For example, 147 occurrences correspond to the chemical formula C16H14O4. 380 With a MS1 analysis relying on exact mass only, no ranking between those isomers is possible. By

13

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

381 filtering the biological sources, the number of possible structures can be significantly reduced, but can 382 still lead to multiple isomers. In the case of Bauhinia manca (synonym of Bauhinia guianensis), there 383 are 5 entries corresponding to this formula. Taking both MS/MS fragmentation and taxonomic position 384 into account, the confidence into the annotation process is notably enhanced leading in this case to the 385 correct annotation at rank 1 (see Supplementary Material S3). Comparison of spectra to experimental 386 or in silico fragmented DBs, allows to attribute a score to candidate structures and, thus, to discriminate 387 isomeric molecules. However, MF and fragmentation spectra are not the only attributes which can be 388 compared in the metabolite annotation process. Specialized metabolites, as products of biosynthetic 389 clusters themselves part of the genome, are tightly linked to the taxonomic position of the producing 390 organisms (Ernst et al., 2019b; Hoffmann et al., 2018). Here, we demonstrate that the biological source 391 and, more precisely, the taxonomic distance between the biological source of the queried compound 392 and the biological source of the candidate structures is also a valuable attribute to integrate into the 393 metabolite annotation process. We show that such information can be considered in a taxonomically 394 informed scoring system and automatically applied to the outputs of different computational metabolite 395 annotation programs. The consideration of taxonomic information was shown to systematically 396 improve the F1 score of the evaluated solutions (ISDB-DNP, Sirius, MS-Finder) with a 1.5 to 7-fold 397 increase. The advantage of considering such information in the metabolite annotation process are thus 398 observed independently of the tools and their associated structural DBs. This benchmarking was carried 399 to evaluate the importance of considering taxonomical information during the metabolite annotation 400 process. It is not meant to compare the performances of the tools. Indeed, all compounds of the 401 benchmarking set were present in the DNP, and the ISDB-DNP tool, which is by definition backed by 402 the same database is thus favored. Furthermore filters (for the selection of [M+H]+ adducts and for the 403 filtering MS/MS spectra for the 500 most intense peaks) were applied to meet restriction of the ISDB- 404 DNP and MS-Finder, respectively. Finally, a number of entries (197) of the benchmarking dataset were 405 found to have large mass difference (> 0.01 Da) between their experimental parent ion mass and their 406 calculated exact mass. For example, cevadine [M+H]+ (CCMSLIB00004689734) had an experimental 407 parent ion mass of 632.386 Da, while its calculated exact mass is 591.3407 Da (C32H49NO9). Of course, 408 such erroneous entries cannot be identified by the computational metabolite annotation tools. The list 409 of these problematic entries is available online (Problematic_entries.csv). It is also important to keep 410 in mind that such metabolite annotation tools are incapable of discriminating stereoisomers.

411 In the two cases studied in this work (annotation of standard compounds and annotation of analytes 412 originating from a single biological sources), the attribution of the biological source to the queried 413 features is straightforward. However, a more complicated configuration can appear: the annotation of 414 aligned MS data coming from an extract library. In this particular case a feature can belong to multiple 415 biological sources. It is thus necessary to link features to multiple extracts (biological sources) for 416 example via MS signal intensity.

417 The results of the optimization on datasets for which taxonomical information had been randomly 418 degraded at multiple taxa level indicated, for the ISDB-DNP results, that the optimal combination of 419 weights was 0.81, 1.62 and 2.55 (family, genus and species taxa level, respectively). Such results 420 should however be taken with caution, and not as absolute optimal values, as such optimization process 421 are heavily dependent on the learning sets. The optimization however indicates that the best results 422 were achieved when the assigned weights were inversely proportional to the taxonomic distance 423 between the biological sources of, both, the queried spectra and the candidate structures.

424 It is important to note that such taxonomically informed scoring system will mostly benefit the 425 annotation process of specialized metabolites and not ubiquitous molecules (e.g. coming from the 426 primary metabolism) for obvious reasons. Furthermore, it heavily depends on the availability and

14 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

427 quality of DBs compiling structures and their biological sources reported as a fully and homogeneously 428 resolved taxonomy. To the best of our knowledge, such DBs are not publicly available at the moment. 429 The NPAtlas (https://www.npatlas.org/) is an interesting initiative of the Linington lab, however 430 biological sources information down to the species level is only accessible in query mode and the DB 431 is limited to 20440 metabolites of microbial origin only. The Dictionary of Natural Products (DNP), 432 which we used in this study is the widest compilation of structure/biological sources pairs, but is only 433 available commercially. Furthermore, the biological sources are reported as a free text field (codes are 434 available only for the family taxa levels and above), thus requiring tedious standardization and name 435 resolving. It is therefore important for the community to start the systematic reporting of biological 436 sources, together with spectral and structural information, when documenting novel metabolites. In 437 fact, the reporting of newly described biological occurrence should be encouraged even for previously 438 described metabolites. However, the policy of most journals in NP research is to accept for publication 439 only description of novel and bioactive structures, which hinders these potentially informative reports. 440 The GNPS spectral libraries (https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp) and MassIVE 441 repositories (https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp) appear as the optimal place, at 442 the moment, to compile and share NP spectral and structural information. But, if free text comments 443 can complement the documentation of an entry, no standardized fields are available to report the 444 biological sources of the uploaded spectra. The creation of such a feature, ideally directly linking the 445 entered biological sources to existing taxonomy backbones such as GBIF (https://www.gbif.org) or 446 Catalogue of Life (http://www.catalogueoflife.org), would be extremely useful. A recent initiative, the 447 Pharmacognosy Ontology (PHO), that builds on the 50 years of development of NAPRALERT 448 (https://www.napralert.org) is aimed at providing a Free and Open resource that will link taxonomical, 449 chemical and biological data (http://ceur-ws.org/Vol-1747/IP12_ICBO2016.pdf).

450 The taxonomically informed scoring presented here only takes into account the identity between the 451 biological sources, at different taxa level, of the query compounds and the ones of the candidates 452 structures. Taking into account a more precise phylogenetic position within or across taxa, for example 453 via the calculation of taxonomic distinctiveness indexes (Clarke and Warwick, 1998; Weikard et al., 454 2006), could offer a more accurate distance and eventually improve such taxonomically informed 455 metabolite annotation process. Of course, and in addition to correct and systematic biological sources 456 occurrence reporting in dedicated DBs, it is of utmost importance to count on the expert knowledge of 457 trained taxonomists specialized in the classification of biodiversity. But it appears that today, sadly, 458 these people are missing (Ajmal Ali and Choudhary, 2011; Drew, 2011)

459 As proposed in the introduction (see also Fig. 1), an ideal metabolite annotation process should 460 consider a maximal number of attributes when matching the queried analyte to candidate structures 461 present in a DB. It is clear that MS1 based annotation is not sufficient when annotating NP given their 462 isomeric nature, and MS/MS based workflow are known the standard way to proceed to their 463 annotation. We have shown here that complementing the MS/MS comparison with taxonomical 464 information is beneficial for the metabolite annotation process. Others groups have demonstrated the 465 advantages of integrating information such as retention order (Bach et al., 2018), the topology of a 466 molecular network (relatedness within a cluster) in order to rerank candidates proposed by MS/MS 467 based annotation (da Silva et al. 2018). Efforts remains to be done towards the establishment of a global 468 meta-score (see. Fig 1) and problematics such as the individual weights to attribute to each individual 469 component of a metascore will likely appear, but integrating the maximal number of metadata available 470 when proceeding to metabolite annotation should only be beneficial for such process. Overall, 471 considering the relations across attributes by contextualization approaches (molecular networking, 472 biosynthetic coherence, chemotaxonomic relationships) should be particularly efficient to annotate the 473 chemistry of living systems (Allard et al., 2018). Various approaches have been proposed to exploit

15

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

474 structural (or biosynthetic) relationships among metabolites and further organize the producing extracts 475 (Junker, 2018; Liu et al., 2017) and interesting developments will appear once robust metabolite 476 annotation solution are coupled to comprehensive DBs compiling structures and their biological 477 sources. Indeed, specialized metabolome annotation could be a novel way to infer the taxonomic 478 position of an unknown sample, just as valid as a genetic sequencing. This could offer new possibilities 479 for taxonomists and application in the field of phyto-pharmaceutical quality control for the 480 identification of adulterations in a similar way to MALDI-TOF based approaches now identifying 481 bacterial strains in mixtures (Yang et al., 2018).

482

483 4 Conclusion

484

485 Efficient characterization of specialized metabolomes is a key challenge in metabolomics and NP 486 chemistry. Now, technical advances allow to access large amounts of informative data, requiring ad 487 hoc computational solutions for their interpretation. The metabolite annotation process, can be resumed 488 to the comparison of attributes of the queried features against attributes of the candidate structures. 489 This process can benefit from the information complementary to the classically used MS fragmentation 490 fingerprints. Ideally, the quantification of multiples attributes similarity (or dissimilarity) should be 491 integrated within a metascoring system (see Fig 1). Here, we demonstrate that the consideration of the 492 taxonomic distance separating the biological sources of both, the queried analytes, and the candidate 493 structures, can drastically improve the efficiency (recall and precision rates) of existing computational 494 metabolite annotation solutions. Metabolite annotation is crucial to guide chemical ecology research 495 or drug discovery projects. We therefore go in the direction of the first of De Candolle’s assumptions 496 (“Plant taxonomy would be the most useful guide to man in his search for new industrial and medicinal 497 plants”). His correlated postulate (“Chemical characteristics of plants will be most valuable to plant 498 taxonomy in the future”) has been validated and exciting research in this direction is being carried at 499 the moment (MSXXX). Metabolite annotation can benefit from taxonomy and taxonomical 500 relationships can be inferred from precise metabolite characterization. Efforts in both directions should 501 thus fuel a virtuous cycle of research aiming to better understand Life and its chemistry.

502 These progresses will however depend on the establishment of open data repositories systematically 503 compiling structures, spectra and associated biological sources but also capturing organization of these 504 objects at various levels (structural and spectral similarity, taxonomical relationships). The 505 Pharmacognosy Ontology project should provide a valuable framework to document and contextualize 506 such valuable data (http://ceur-ws.org/Vol-1747/IP12_ICBO2016.pdf).

507

508 5 Material and Methods

509 5.1 Outline and implementation of the taxonomically informed scoring system 510

511 To evaluate the importance of considering taxonomic information in the annotation process, three 512 different computational mass spectrometry-based metabolite annotation tools were used (namely, 513 ISDB-DNP, MS-Finder and Sirius). This resulted in three different outputs constituted by a list of

16 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

514 candidates returned by each tool for the entries of the benchmarking dataset. These candidates were 515 ranked according to the scoring system of each tool. R scripts in the form of markdown notebooks were 516 written to perform 1) cleaning and standardization of the outputs (TaxoCleaneR.Rmd) 2) 517 taxonomically informed scoring and re-ranking (TaxoWeighter.Rmd) and 3) analysis of the results 518 (TaxoDesigner.Rmd). First, the outputs were standardized to a table containing on each row: a unique 519 spectral identifier (CCMSLIB N°) of the queried spectra, the short InChIKey of the candidate 520 structures, the score of the candidates (within the scoring system of the used metabolite annotation 521 tool), the biological source of the standard compound and the biological source of the candidate 522 structures. As described in section 2.1, a weight, inversely proportional to the taxonomic distance 523 between the biological source of the annotated compound and the one biological source of the 524 candidate structure, was given when an exact match was found between both biological sources at the 525 family, genus or/and species level(s). A sum of this weight and the original score yielded the 526 taxonomically weighted score. This taxonomically weighted score was then used to re-rank the 527 candidates. See Fig. 2 for a schematic overview of the taxonomically informed scoring process.

528 5.2 Dataset preparation

529 • Structural and biological sources dataset

530 In the Dictionary of Natural Products (v 27.1), taxonomic information appears in two fields. The 531 Biological Source field, which is constituted by a free text field reporting occurrence of a specific 532 compound and the Compound Type field which reports various codes corresponding to molecule 533 classes or taxonomic position at the family level. As an example, for the entry corresponding to larictrin 534 3-glucoside (ODXINVOINFDDDD-UHFFFAOYSA-N), the Biological Source field indicates “Isol. 535 from Larix spp., Cedrus sp. and other plant spp. Constit. of Vitis vinifera cv. Petit Verdot grapes and 536 Abies amabilis.” and the Compound Type field indicates “V.K.52600 W.I.40000 W.I.35000 537 Z.N.50000 Z.Q.71600” suggesting that biological sources are found in the Phyllocladaceae 538 (Z.N.50000) and Vitaceae family (Z.Q.71600). The biological source information is reported in a non- 539 homogeneous way and multiple biological sources are reported in the same row. In order to extract 540 taxonomic information out of the free text contents, we used the gnfinder program 541 (https://github.com/gnames/gnfinder). Gnfinder takes UTF8-encoded text as inputs and returns back 542 JSON-formatted output that contains detected scientific names. It automatically detects the language 543 of the text and uses complementary heuristic and natural language processing algorithms to detect 544 patterns corresponding to scientific binomial or uninomial denomination. We used gnfinder forcing for 545 English language detection. In addition to scientific denomination extraction, gnfinder allows to match 546 the detected names against the Global Names index services (https://index.globalnames.org). The 547 preferred taxonomy backbone was set to be Catalogue of Life. This last step allowed to return the full 548 taxonomy down to the entered taxa level. It also allows to resolve synonymy. Since gnfinder is 549 designed to mine free texts, the JSON formatted output indicates the position of the detected name in 550 the original input by character position. A python script was written to output a .csv file with the found 551 name and taxonomy in front of the corresponding input. When multiple biological sources were found 552 for an entry, this one was duplicated in order to obtain a unique structure/biological source pair per 553 row. The script is available online (gnfinder_field_scrapper.py).

554 • Structural and spectral dataset

555 All GNPS libraries and publicly accessible third-party libraries were retrieved online 556 (https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp) and concatenated as a single spectral file 557 (Full_GNPS_lib.mgf) in the .mgf format. A python Jupyter notebook (MGF_lib_filterer.ipynb) was

17

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

558 created to filter .mgf spectral file according to specific parameters: maximum and minimum number 559 of fragments per spectrum and defined spectral ID (e.g. CCMLIB N°). All parameters are optional. 560 The spectral file was filtered to retain only entries having at least 6 fragments. For spectra containing 561 more than 500 fragments, only the 500 most intense were kept. A second python Jupyter notebook 562 (GNPS_lib_parser_cleaner.ipynb) was written to proceed to 1) extraction of relevant metadata (parent 563 ion mass, Smiles, InChI, library origin, source instrument, molecule name and individual spectrum id 564 value (CCMSLIB N°) 2) filtering entries having at least one structural information associated (Smiles 565 and/or InChI) and corresponding to protonated adducts and 3) converting structures to their InChIKey, 566 a 27-character hashed version of the full InChI. The InChIKey conversion was realized using the RDKit 567 2019.03.1 framework (RDKit: Open-source cheminformatics; http://www.rdkit.org). This resulted in 568 a structural dataset (GNPS_lib_structural.tsv) of 40138 entries. The library was further filtered to keep 569 entries which parent masses were comprised between 100 and 1500 Da. Duplicate structures and 570 stereoisomers were removed by keeping distinct InChIKey according to the first layer (first 14 571 characters) of the hash code. This spectral dataset encompasses spectra acquired on a variety of MS 572 platforms. The codes, input and output data are available on OSF at the following address 573 (https://osf.io/bvs6x/).

574

575 • Structural, spectral and benchmarking dataset (benchmarking dataset)

576 Once the structural and biological sources dataset and the structural and spectral datasets were prepared 577 (as described above), both were joined in order to attribute a biological source to each spectrum. The 578 scripts used to proceed to the merging step are part of the python Jupyter notebook 579 (GNPS_lib_parser_cleaner.ipynb). Since in most cases it is not expected to differentiate stereoisomers 580 based on their MS spectra, the combination of both datasets was made using the short InChIKey (first 581 14 characters of the InChIKey) as a common key. In this merging process only, entries having 582 biological source information resolved against the Catalogue of Life and complete down to the species 583 level were retained. However, this merging implies that for a given biological source the information 584 on the 3D aspects of the structure is lost. While this is not an issue for the benchmarking objective of 585 this work the resulting dataset doesn’t constitute a reliable occurrence dataset for annotation that needs 586 stereoisomers to be differentiated. The resulting dataset containing structural, spectral and biological 587 sources information was constituted by 2107 distinct entries. This constituted the benchmarking 588 dataset. The scripts allowing to generate the benchmarking dataset, the associated spectral data 589 (Benchmark_dataset_spectral.mgf), and associated metadata (Benchmark_dataset_metadata.tsv) are 590 available at the following address (https://osf.io/bvs6x/).

591

592 5.3 Computational metabolite annotation tools

593 • ISDB-DNP

594 The ISDB-DNP (In Silico DataBase – Dictionary of Natural Products) is an approach that we 595 previously developed (Allard et al., 2016). A version using the freely available Universal Natural 596 Products Database (ISDB-UNPD) is available online (http://oolonek.github.io/ISDB/). This approach 597 is focused on specialized metabolites annotation and is constituted by a pre-fragmented theoretical 598 spectral DB version of the DNP. The in silico fragmentation was performed using CFM-ID, a software 599 using a probabilistic generative model for the fragmentation process, and a machine learning approach 600 for learning parameters for this model from MS/MS data (Allen et al., 2015). CFM, is, to the best of

18 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

601 our knowledge, the only solution available at the moment allowing to output a spectrum with fragment 602 intensity prediction. The matching phase between experimental spectra and the theoretical DB is based 603 on a plain spectral similarity computation performed using Tremolo as a spectral library search tool 604 (Wang and Bandeira, 2013). The parameters used to proceed to the benchmarking dataset analysis were 605 the following: parent mass tolerance 0.05 Da, minimum cosine score 0.1, no limits for the returned 606 candidates numbers.

607

608 • MS-Finder

609 This in silico fragmentation approach considers multiple parameters such as bond dissociation 610 energies, mass accuracies, fragment linkages and various hydrogen rearrangement rules at the 611 candidate ranking phase (Tsugawa et al., 2016). The resulting scoring system range from 1 to 10. The 612 parameters used to proceed to the benchmarking dataset analysis were the following: mass tolerance 613 setting: 0.1 Da (MS1), 0.1 Da (MS2); relative abundance cut off: 5 % formula finder settings: LEWIS 614 and SENIOR check (yes), isotopic ratio tolerance: 20 %, element probability check (yes), element 615 selection (O, N, P, S, Cl, Br). Structure Finder setting: tree depth: 2, maximum reported number: 100, 616 data sources (all except MINEs DBs. Total number of structures, 321617.) MS-Finder v. 3.22 was 617 used, it is available at the following address: http://prime.psc.riken.jp/Metabolomics_Software/MS- 618 FINDER/

619 • Sirius

620 Sirius 4.0.1 is considered as a state-of-the-art metabolite annotation solution, which combines 621 molecular formula calculation and the prediction of a molecular fingerprint of a query compound from 622 its fragmentation tree and spectrum (Dührkop et al., 2019). Sirius uses a DB of 73,444,774 unique 623 structures for its annotations. The parameters used to proceed to the benchmarking dataset analysis 624 were the following for Sirius molecular formula calculation: possible ionization [M+H]+, instrument: 625 Q-TOF, ppm tolerance 50 ppm, Top molecular formula candidates: 3, filter :formulas from biological 626 DBs. For the CSI:FingerID step, the parameters were the following: possible adducts: [M+H]+, filter: 627 compounds present in biological DB, maximal number of returned candidates: unlimited. Sirius 4.0.1 628 is available at the following address: https://bio.informatik.uni-jena.de/software/sirius/

629

630 5.4 Results analysis 631

632 The F1 score was calculated for each evaluated metabolite annotation tool before and after the 633 weighting step. The F1 score is the harmonic mean of the recall (True Positive / (True Positive + False 634 Negative)) and precision rate (True Positive / (True Positive + False Positive)) of a tool. The True 635 Positive (TP) corresponds to the number of correct candidate annotations at rank 1, the False Positive 636 (FP) to the number of wrong candidate annotations at rank 1, and the False Negative (FN) to the number 637 of correct annotations at rank > 1. The F1 score is then calculated as follows:

(Recall × Precision) 638 �1 = 2 × (Recall + Precision)

19

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

639 R notebook to analyze the results of the taxonomically informed scoring process and plot the Venn 640 diagrams presented in Fig. 4 are available online (TaxoDesigner.Rmd).

641 5.5 Optimization of the weight combination for the taxonomically informed scoring 642 In order to establish the optimal weights to be applied for each of the taxonomic distances (family, 643 genus and species), the information related to candidates’ annotations was artificially degraded. For 644 this, the annotation set returned by the ISDB-DNP approach against the benchmarking dataset was 645 randomized. The randomized annotation set was then split into four equal blocks. For the first three 646 blocks, the biological source information was deleted, respectively, at the species level; at the genus 647 and species level; and, finally, at the family, genus and species levels. The fourth block was not 648 modified. Finally, the four blocks were merged back to a unique dataset. The process was repeated four 649 times yielding four datasets with randomly incomplete biological sources. The taxonomic distance 650 informed scoring process was compiled to a unique function taking three arguments (weights given 651 when a match was found at the family, genus and species level, respectively) and outputting the number 652 of correct hits ranked at the first position. A parallelizable Bayesian optimization algorithm 653 (https://github.com/AnotherSamWilson/ParBayesianOptimization) was then used, being particularly 654 suited for the optimization of black box functions for which no formal representation is available 655 (arXiv:1807.02811). The bounds were set between 0 and 3 for the exploration of the three parameters 656 of the function. Number of initial points was set to 10 and the number of iterations to 100. Parameter 657 kappa (κ) was set to 5.152, to force the algorithm to explore unknown areas. The chosen acquisition 658 function was set to Expected Improvement (ei). Epsilon parameter (ε, eps) was set to 0. The whole 659 procedure was run 4 times on the 4 randomized datasets. Best set of parameters were then averaged 660 across the 16 results set. All codes required for this optimization step are available online 661 (TaxoOptimizeR.Rmd).

662 5.6 Chemical analysis and isolation of compounds from Glaucium extract 663 5.6.1 Plant material 664 The aerial flowering parts of three Glaucium species were collected in May and June of 2015 from the 665 northern part of Iran including Mazandaran and Tehran provinces. The samples were identified by Dr. 666 Ali Sonboli, Medicinal Plants and Drugs Research Institute, Shahid Beheshti University, Tehran, Iran. 667 Voucher specimens (MPH-2351 for G. grandiflorum, MPH-2352 for G. fimberilligerum and MPH- 668 2353 for G. corniculatum) have been deposited at the Herbarium of Medicinal Plants and Drugs 669 Research Institute (HMPDRI), Shahid Beheshti University, Tehran, Iran.

670 5.6.2 Mass Spectrometry analysis 671 Chromatographic separation was performed on a Waters Acquity UPLC system interfaced to a Q- 672 Exactive Focus mass spectrometer (Thermo Scientific, Bremen, ), using a heated electrospray 673 ionization (HESI-II) source. Thermo Scientific Xcalibur 3.1 software was used for instrument control. 674 The LC conditions were as follows: column, Waters BEH C18 50 × 2.1 mm, 1.7 µm; mobile phase, 675 (A) water with 0.1 % formic acid; (B) acetonitrile with 0.1 % formic acid; flow rate, 600 µL.min−1; 676 injection volume, 6 µL; gradient, linear gradient of 5−100 % B over 7 min and isocratic at 100 % B for 677 1 min. The optimized HESI-II parameters were as follows: source voltage, 3.5 kV (pos); sheath gas 678 flow rate (N2), 55 units; auxiliary gas flow rate, 15 units; spare gas flow rate, 3.0; capillary temperature, 679 350.00 °C, S-Lens RF Level, 45. The mass analyzer was calibrated using a mixture of caffeine, 680 methionine − arginine − phenylalanine − alanine − acetate (MRFA), sodium dodecyl sulfate, sodium 681 taurocholate, and Ultramark 1621 in an acetonitrile/methanol/water solution containing 1% formic acid 682 by direct injection. The data-dependent MS/MS events were performed on the three most intense ions

20 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

683 detected in full scan MS (Top3 experiment). The MS/MS isolation window width was 1 Da, and the 684 stepped normalized collision energy (NCE) was set to 15, 30 and 45 units. In data-dependent MS/MS 685 experiments, full scans were acquired at a resolution of 35 000 FWHM (at m/z 200) and MS/MS scans 686 at 17 500 FWHM both with an automatically determined maximum injection time. After being 687 acquired in a MS/MS scan, parent ions were placed in a dynamic exclusion list for 2.0 s.

688 5.6.3 MS data pretreatment 689 The MS data were converted from the .RAW (Thermo) standard data format to .mzXML format using 690 the MSConvert software, part of the ProteoWizard package (Chambers et al., 2012). The converted 691 files were treated using the MzMine software suite v. 2.38 (Pluskal et al., 2010).

692 The parameters were adjusted as following: the centroid mass detector was used for mass detection 693 with the noise level set to 1.0E6 for MS level set to 1, and to 0 for MS level set to 2. The ADAP 694 chromatogram builder was used and set to a minimum group size of scans of 5, minimum group 695 intensity threshold of 1.0E5, minimum highest intensity of 1.0E5 and m/z tolerance of 8.0 ppm. For 696 chromatogram deconvolution, the algorithm used was the wavelets (ADAP). The intensity window 697 S/N was used as S/N estimator with a signal to noise ratio set at 25, a minimum feature height at 10000, 698 a coefficient area threshold at 100, a peak duration ranges from 0.02 to 0.9 min and the RT wavelet 699 range from 0.02 to 0.05 min. Isotopes were detected using the isotopes peaks grouper with a m/z 700 tolerance of 5.0 ppm, a RT tolerance of 0.02min (absolute), the maximum charge set at 2 and the + + + + + 701 representative isotope used was the most intense. An adduct (Na , K , NH4 , CH3CN , CH3OH , + + 702 C3H8O (IPA )) search was performed with the RT tolerance set at 0.1 min and the maximum relative 703 peak height at 500%. A complex search was also performed using [M+H]+ for ESI positive mode, with 704 the RT tolerance set at 0.1 min and the maximum relative peak height at 500%. Peak alignment was 705 performed using the join aligner method (m/z tolerance at 8 ppm), absolute RT tolerance 0.065 min, 706 weight for m/z at 10 and weight for RT at 10. The peak list was gap-filled with the same RT and m/z 707 range gap filler (m/z tolerance at 8 ppm). Eventually the resulting aligned peaklist was filtered using 708 the peak-list rows filter option in order to keep only features associated with MS2 scans.

709 5.6.4 Molecular networks generation 710 In order to keep the retention time, the exact mass information and to allow for the separation of 711 isomers, a feature-based MN (https://bix- 712 lab.ucsd.edu/display/Public/Feature+Based+Molecular+Networking) was created using the .mgf file 713 resulting from the MzMine pretreatment step detailed above. Spectral data was uploaded on the GNPS 714 molecular networking platform. A network was then created where edges were filtered to have a cosine 715 score above 0.7 and more than 6 matched peaks. Further edges between two nodes were kept in the 716 network if and only if each of the nodes appeared in each other's respective top 10 most similar nodes. 717 The spectra in the network were then searched against GNPS' spectral libraries. All matches kept 718 between network spectra and library spectra were required to have a score above 0.7 and at least 6 719 matched peaks. The output was visualized using Cytoscape 3.6 software (Shannon et al., 2003). The 720 GNPS job parameters and resulting data are available at the following address 721 (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=a475a78d9ae8484b904bcad7a16abd1f).

722 5.6.5 Taxonomically informed metabolite annotation 723 The spectral file (.mgf) and attributes metadata (.clustersummary) obtained after the MN step were 724 annotated using the DNP-ISDB with the following parameters: parent mass tolerance 0.005 Da, 725 minimum cosine score 0.2, maximal number of returned candidates: 50. An R script was written to

21

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

726 proceed to the taxonomically informed scoring on GNPS outputs and return an attribute table which 727 can be directly loaded in Cytoscape. The script is available here (Taxo_WeighteR_UseR.Rmd).

728

729 5.6.6 Isolation of predicentrine and glaucine from G. grandiflorum 730 The air-dried, ground and powdered plant materials (500 g) was successively extracted by solvents of 731 increasing polarities (hexane, ethyl acetate and methanol), 4 × 5.0 L of each solvent (48 h). An aliquot 732 of each ethyl acetate and methanolic extract was submitted to C18 SPE (eluted with 100% MeOH), 733 dried under nitrogen flow and redissolved at 5 mg/ml in MeOH for LC-MS analysis. The methanolic 734 extract of G. grandiflorum was concentrated under reduced pressure, then dried with a nitrogen flow 735 until complete evaporation of the residual solvent yielding 50 g of extract. An aliquot (5 g) was 736 subjected to a VLC in order to eliminate sugars and other very polar compounds. A 250 mL sintered- 737 glass Buchner funnel connected to a vacuum line was packed with a C18 reverse phase LiChroprep 738 40–63 µm (Lobar Merck, Darmstadt, Germany). After conditioning the stationary phase with methanol 739 (4 × 250 mL, 0.1% formic acid) and distilled water (4 × 250 mL, 0.1% formic acid), 5 g of methanolic 740 extract was dissolved in water and the mixture was deposited on the stationary phase. Elution of the 741 sample was conducted using water (4 × 250 mL, 0.1% formic acid) in the first step and followed by 742 methanol (4 × 250 mL, 0.1% formic acid) in the second step. This process yielded 1.4 g of processed 743 methanolic extract. After condition optimisation at the analytical level, 50 mg of the extract were 744 solubilized in 500 µL DMSO and injected using a Rheodyne® valve (1 mL loop). Semi-preparative 745 HPLC-UV purification was performed on a Shimadzu system equipped with: LC20A module elution 746 pumps, an SPD-20A UV/VIS detector, a 7725I Rheodyne® injection valve, and a FRC-10A fraction 747 collector (Shimadzu, Kyoto, Japan). The HPLC system was controlled by the LabSolutions software. 748 The HPLC conditions were selected as follows: Waters X-Bridge C18 column (250 × 19 mm i.d., 5 749 µm) equipped with a Waters C18 pre-column cartridge holder (10 × 19 mm i.d.); solvent system: ACN 750 (2mM TEA) (B) and H2O (2mM TEA & 2mM ammonium acetate) (A). Optimized separation 751 condition from the analytical was transferred to semi-preparative scale by a geometric gradient transfer 752 software (Guillarme et al. 2008). The separation was conducted in gradient elution mode as follows: 753 5% B in 0-5 min, 12% B in 5-10 min, 30% B in 10-30 min, 60% B in 30-55 min, 100% B in 55-65 754 min. The column was reconditioned by equilibration with 5% of B in 15 min. Flow rate was equal to 17 755 mL/min and UV traces were recorded at 210 nm and 280 nm. The separation procedure yielded 0.3 mg 756 of predicentrine and 3.4 mg of glaucine. Spectra for predicentrine (CCMSLIB00005436122) and 757 glaucine (CCMSLIB00005436123) were deposited on GNPS servers.

758 5.6.7 NMR analysis 759 The NMR spectra of each isolated compound was recorded on a Bruker BioSpin 600 MHz 760 spectrometer (Avance Neo 600). Chemical shifts (δ) were recorded in parts per million in methanol‒ 761 d4 with TMS as an internal standard. NMR data are available as Supplementary Material S1 and S2.

762

763

764

765

766

22 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

767

768

769 Conflict of Interest

770 The authors declare that the research was conducted in the absence of any commercial or financial 771 relationships that could be construed as a potential conflict of interest.

772 Author Contributions

773 P-MA, AR, MDK and J-LW designed the study. JB wrote the python script for gnfinder output 774 formatting. P-MA and AR wrote the scripts for dataset preparation, taxonomically informed scoring 775 and results analysis. MDK, SO, MB and TS used the scripts for metabolite annotation and provided 776 feedback. MB and SNE collected the Glaucium species. P-MA and AR performed the LCMS analysis 777 on the Glaucium extracts. MB analyzed profiling data, isolated the compounds of Glaucium and 778 established their structures. P-MA wrote the manuscript together with AR and J-LW. All authors 779 discussed the results and commented on the manuscript.

780 Funding

781 Jonathan Bisson gratefully acknowledge the support of this work by grant U41 AT008706 from 782 NCCIH and ODS. The School of Pharmaceutical Sciences of the University of Geneva JLW is thankful 783 to the Swiss National Science Foundation for the support in the acquisition of the NMR 600 MHz (SNF 784 R’Equip grant 316030_164095).

785 Acknowledgments

786 MDK acknowledges gratefully the support by Yamada Science Foundation and The Nagai Foundation 787 Tokyo. The authors acknowledge Ali Bakiri for fruitful discussions on the optimization of the weights.

788 Supplementary Material

789 See attached file

790 Code and Data Availability Statement

791 Scripts and datasets generated and analyzed for this study can be found at the following OSF repository: 792 https://osf.io/bvs6x/ (DOI 10.17605/OSF.IO/BVS6X)

793

794

795

796

797

798

23

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

799

800

24 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

801 References

802

803 Ajmal Ali, M., and Choudhary, R. K. (2011). India needs more plant taxonomists. Nature 471, 37. 804 doi:10.1038/471037d.

805 Alanjary, M., Cano-Prieto, C., Gross, H., and Medema, M. H. (2019). Computer-aided re-engineering 806 of nonribosomal peptide and polyketide biosynthetic assembly lines. Nat. Prod. Rep. 807 doi:10.1039/c9np00021f.

808 Allard, P.-M., Bisson, J., Azzollini, A., Pauli, G. F., Cordell, G. A., and Wolfender, J.-L. (2018). 809 Pharmacognosy in the digital era: shifting to contextualized metabolomics. Curr. Opin. Biotechnol. 54, 810 57–64. doi:10.1016/j.copbio.2018.02.010.

811 Allard, P.-M., Genta-Jouve, G., and Wolfender, J.-L. (2017). Deep metabolome annotation in natural 812 products research: towards a virtuous cycle in metabolite identification. Curr. Opin. Chem. Biol. 36, 813 40–49. doi:10.1016/j.cbpa.2016.12.022.

814 Allard, P.-M., Péresse, T., Bisson, J., Gindro, K., Marcourt, L., Pham, V. C., et al. (2016). Integration 815 of Molecular Networking and In-Silico MS/MS Fragmentation for Natural Products Dereplication. 816 Anal. Chem. 88, 3317–3323. doi:10.1021/acs.analchem.5b04804.

817 Allen, F., Greiner, R., and Wishart, D. (2015). Competitive fragmentation modeling of ESI-MS/MS 818 spectra for putative metabolite identification. Metabolomics 11, 98–110. doi:10.1007/s11306-014- 819 0676-4.

820 Allen, F., Pon, A., Wilson, M., Greiner, R., and Wishart, D. (2014). CFM-ID: a web server for 821 annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids 822 Res. 42, W94–9. doi:10.1093/nar/gku436.

823 Bach, E., Szedmak, S., Brouard, C., Böcker, S., and Rousu, J. (2018). Liquid-chromatography retention 824 order prediction for metabolite identification. Bioinformatics 34, i875–i883. 825 doi:10.1093/bioinformatics/bty590.

826 Bauer, C. A., and Grimme, S. (2016). How to Compute Electron Ionization Mass Spectra from First 827 Principles. J. Phys. Chem. A 120, 3755–3766. doi:10.1021/acs.jpca.6b02907.

828 Beran, F., Köllner, T. G., Gershenzon, J., and Tholl, D. (2019). Chemical convergence between plants 829 and insects: biosynthetic origins and functions of common secondary metabolites. New Phytol. 223, 830 52–67. doi:10.1111/nph.15718.

831 Berlinck, R. G. S., Monteiro, A. F., Bertonha, A. F., Bernardi, D. I., Gubiani, J. R., Slivinski, J., et al. 832 (2019). Approaches for the isolation and identification of hydrophilic, light-sensitive, volatile and 833 minor natural products. Nat. Prod. Rep. doi:10.1039/c9np00009g.

834 Böcker, S., Letzel, M. C., Lipták, Z., and Pervukhin, A. (2009). SIRIUS: decomposing isotope patterns 835 for metabolite identification. Bioinformatics 25, 218–224. doi:10.1093/bioinformatics/btn603.

25

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

836 Böttger, A., Vothknecht, U., Bolle, C., and Wolf, A. (2018). “Plant Secondary Metabolites and Their 837 General Function in Plants,” in Lessons on Caffeine, Cannabis & Co: Plant-derived Drugs and their 838 Interaction with Human Receptors, eds. A. Böttger, U. Vothknecht, C. Bolle, and A. Wolf (Cham: 839 Springer International Publishing), 3–17. doi:10.1007/978-3-319-99546-5_1.

840 Bouwmeester, R., Martens, L., and Degroeve, S. (2019). Comprehensive and Empirical Evaluation of 841 Machine Learning Algorithms for Small Molecule LC Retention Time Prediction. Analytical 842 Chemistry 91, 3694–3703. doi:10.1021/acs.analchem.8b05820.

843 Brejnrod, A. D., Ernst, M., Dworzynski, P., Rasmussen, L. B., Dorrestein, P., van der Hooft, J., et al. 844 (2019). Implementations of the chemical structural and compositional similarity metric in R and 845 Python. bioRxiv, 546150. doi:10.1101/546150.

846 Brunetti, A. E., Carnevale Neto, F., Vera, M. C., Taboada, C., Pavarini, D. P., Bauermeister, A., et al. 847 (2018). An integrative omics perspective for the analysis of chemical signals in ecological interactions. 848 Chem. Soc. Rev. 47, 1574–1591. doi:10.1039/c7cs00368d.

849 Caesar, L. K., Kellogg, J. J., Kvalheim, O. M., and Cech, N. B. (2019). Opportunities and Limitations 850 for Untargeted Mass Spectrometry Metabolomics to Identify Biologically Active Constituents in 851 Complex Natural Product Mixtures. J. Nat. Prod. 82, 469–484. doi:10.1021/acs.jnatprod.9b00176.

852 Chambers, M. C., Maclean, B., Burke, R., Amodei, D., Ruderman, D. L., Neumann, S., et al. (2012). 853 A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920. 854 doi:10.1038/nbt.2377.

855 Chiang, Y.-M., Chang, S.-L., Oakley, B. R., and Wang, C. C. C. (2011). Recent advances in awakening 856 silent biosynthetic gene clusters and linking orphan clusters to natural products in microorganisms. 857 Curr. Opin. Chem. Biol. 15, 137–143. doi:10.1016/j.cbpa.2010.10.011.

858 Cimermancic, P., Medema, M. H., Claesen, J., Kurita, K., Wieland Brown, L. C., Mavrommatis, K., et 859 al. (2014). Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene 860 clusters. Cell 158, 412–421. doi:10.1016/j.cell.2014.06.034.

861 Clarke, K. R., and Warwick, R. M. (1998). A taxonomic distinctness index and its statistical properties. 862 J. Appl. Ecol. 35, 523–531. doi:10.1046/j.1365-2664.1998.3540523.x.

863 Corley, D. G., and Durley, R. C. (1994). Strategies for Database Dereplication of Natural Products. J. 864 Nat. Prod. 57, 1484–1490. doi:10.1021/np50113a002.

865 Craig, J. W., Chang, F.-Y., and Brady, S. F. (2009). Natural products from environmental DNA hosted 866 in Ralstonia metallidurans. ACS Chem. Biol. 4, 23–28. doi:10.1021/cb8002754.

867 da Silva, R. R., Wang, M., Nothias, L.-F., van der Hooft, J. J. J., Caraballo-Rodríguez, A. M., Fox, E., 868 et al. (2018). Propagating annotations of molecular networks using in silico fragmentation. PLoS 869 Comput. Biol. 14, e1006089. doi:10.1371/journal.pcbi.1006089.

870 de Candolle, A. P. (1828). Mémoire sur la famille des Crassulacées. Available at: 871 https://market.android.com/details?id=book-T8dAAAAAcAAJ.

26 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

872 Depke, T., Franke, R., and Brönstrup, M. (2017). Clustering of MS2 spectra using unsupervised 873 methods to aid the identification of secondary metabolites from Pseudomonas aeruginosa. J. 874 Chromatogr. B Analyt. Technol. Biomed. Life Sci. 1071, 19–28. doi:10.1016/j.jchromb.2017.06.002.

875 Depke, T., Franke, R., and Brönstrup, M. (2019). CluMSID: an R package for similarity-based 876 clustering of tandem mass spectra to aid feature annotation in metabolomics. Bioinformatics. 877 doi:10.1093/bioinformatics/btz005.

878 Djoumbou Feunang, Y., Eisner, R., Knox, C., Chepelev, L., Hastings, J., Owen, G., et al. (2016). 879 ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. 880 Cheminform. 8, 61. doi:10.1186/s13321-016-0174-y.

881 Djoumbou-Feunang, Y., Pon, A., Karu, N., Zheng, J., Li, C., Arndt, D., et al. (2019). CFM-ID 3.0: 882 Significantly Improved ESI-MS/MS Prediction and Compound Identification. Metabolites 9, 72. 883 doi:10.3390/metabo9040072.

884 Drew, L. W. (2011). Are We Losing the Science of Taxonomy? BioScience 61, 942–946. 885 doi:10.1525/bio.2011.61.12.4.

886 Dührkop, K., Fleischauer, M., Ludwig, M., Aksenov, A. A., Melnik, A. V., Meusel, M., et al. (2019). 887 SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. 888 Methods 16, 299–302. doi:10.1038/s41592-019-0344-8.

889 Dührkop, K., Shen, H., Meusel, M., Rousu, J., and Böcker, S. (2015). Searching molecular structure 890 databases with tandem mass spectra using CSI:FingerID. Proc. Natl. Acad. Sci. U. S. A. 112, 12580– 891 12585. doi:10.1073/pnas.1509788112.

892 Eliuk, S., and Makarov, A. (2015). Evolution of Orbitrap Mass Spectrometry Instrumentation. Annu. 893 Rev. Anal. Chem. 8, 61–80. doi:10.1146/annurev-anchem-071114-040325.

894 Ernst, M., Kang, K. B., and Caraballo-Rodríguez, A. M. (2019a). MolNetEnhancer: enhanced 895 molecular networks by integrating metabolome mining and annotation tools. BioRxiv. Available at: 896 https://www.biorxiv.org/content/10.1101/654459v1.abstract.

897 Ernst, M., Nothias, L.-F., van der Hooft, J. J. J., Silva, R. R., Saslis-Lagoudakis, C. H., Grace, O. M., 898 et al. (2019b). Assessing Specialized Metabolite Diversity in the Cosmopolitan Plant Genus Euphorbia 899 L. Front. Plant Sci. 10, 846. doi:10.3389/fpls.2019.00846.

900 Guinaudeau, H., Leboeuf, M., and Cavé, A. (1979). Aporphine Alkaloids. II. Journal of Natural 901 Products 42, 325–360. doi:10.1021/np50004a001.

902 Hinchliff, C. E., Smith, S. A., Allman, J. F., Burleigh, J. G., Chaudhary, R., Coghill, L. M., et al. 903 (2015). Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc. Natl. Acad. Sci. 904 U. S. A. 112, 12764–12769. doi:10.1073/pnas.1423041112.

905 Hoffmann, T., Krug, D., Bozkurt, N., Duddela, S., Jansen, R., Garcia, R., et al. (2018). Correlating 906 chemical diversity with taxonomic distance for discovery of natural products in myxobacteria. Nat. 907 Commun. 9, 803. doi:10.1038/s41467-018-03184-1.

27

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

908 Horsman, M. E., and Boddy, C. N. (2016). Natural products: Mapping an amazing thicket. Nat. Chem. 909 Biol. 13, 6–7. doi:10.1038/nchembio.2265.

910 Huang, W.-J., Singh, O. V., Chen, C.-H., and Lee, S.-S. (2004). Synthesis of (. -.)-Glaucine and (. -.)- 911 Neospirodienone via a One-Pot Bischler—Napieralski Reaction and Oxidative Coupling by a 912 Hypervalent Iodine Reagent. ChemInform 35. doi:10.1002/chin.200425199.

913 Izsak, C., and Price, A. R. G. (2001). Measuring b-diversity using a taxonomic similarity index, and 914 its relation to spatial scale. Mar. Ecol. Prog. Ser. 215, 69–77. doi:10.3354/meps215069.

915 Ji, H., Xu, Y., Lu, H., and Zhang, Z. (2019). Deep MS/MS-Aided Structural-similarity Scoring for 916 Unknown Metabolites Identification. Anal. Chem. Available at: 917 https://pubs.acs.org/doi/abs/10.1021/acs.analchem.8b05405.

918 Johnson, A. R., and Carlson, E. E. (2015). Collision-Induced Dissociation Mass Spectrometry: A 919 Powerful Tool for Natural Product Structure Elucidation. Anal. Chem. 87, 10668–10678. 920 doi:10.1021/acs.analchem.5b01543.

921 Junker, R. R. (2018). A biosynthetically informed distance measure to compare secondary metabolite 922 profiles. Chemoecology 28, 29–37. doi:10.1007/s00049-017-0250-4.

923 Kang, K. B., Ernst, M., Hooft, J. J. J., Silva, R. R., Park, J., Medema, M. H., et al. (2019). 924 Comprehensive mass spectrometry-guided phenotyping of plant specialized metabolites reveals 925 metabolic diversity in the cosmopolitan plant family Rhamnaceae. The Plant Journal. 926 doi:10.1111/tpj.14292.

927 Kind, T., and Fiehn, O. (2007). Seven Golden Rules for heuristic filtering of molecular formulas 928 obtained by accurate mass spectrometry. BMC Bioinformatics 8, 105. doi:10.1186/1471-2105-8-105.

929 Kind, T., and Fiehn, O. (2017). Strategies for dereplication of natural compounds using high-resolution 930 tandem mass spectrometry. Phytochem. Lett. 21, 313–319. doi:10.1016/j.phytol.2016.11.006.

931 Kind, T., Tsugawa, H., Cajka, T., Ma, Y., Lai, Z., Mehta, S. S., et al. (2018). Identification of small 932 molecules using accurate mass MS/MS search. Mass Spectrom. Rev. 37, 513–532. 933 doi:10.1002/mas.21535.

934 Koistinen, V. M., da Silva, A. B., Abrankó, L., Low, D., Villalba, R. G., Barberán, F. T., et al. (2018). 935 Interlaboratory Coverage Test on Plant Food Bioactive Compounds and their Metabolites by Mass 936 Spectrometry-Based Untargeted Metabolomics. Metabolites 8. doi:10.3390/metabo8030046.

937 Liu, K., Abdullah, A. A., Huang, M., Nishioka, T., Altaf-Ul-Amin, M., and Kanaya, S. (2017). Novel 938 Approach to Classify Plants Based on Metabolite-Content Similarity. Biomed Res. Int. 2017, 5296729. 939 doi:10.1155/2017/5296729.

940 Lommen, A., Gerssen, A., Oosterink, J. E., Kools, H. J., Ruiz-Aracama, A., Peters, R. J. B., et al. 941 (2011). Ultra-fast searching assists in evaluating sub-ppm mass accuracy enhancement in U- 942 HPLC/Orbitrap MS data. Metabolomics 7, 15–24. doi:10.1007/s11306-010-0230-y.

943 Medema, M. H., Blin, K., Cimermancic, P., de Jager, V., Zakrzewski, P., Fischbach, M. A., et al. 944 (2011). antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis

28 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

945 gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339–46. 946 doi:10.1093/nar/gkr466.

947 Mohimani, H., Gurevich, A., Shlemov, A., Mikheenko, A., Korobeynikov, A., Cao, L., et al. (2018). 948 Dereplication of microbial metabolites through database search of mass spectra. Nat. Commun. 9, 949 4035. doi:10.1038/s41467-018-06082-8.

950 Newman, D. (2017). Faculty of 1000 evaluation for Dereplication: racing to speed up the natural 951 products discovery process. F1000 - Post-publication peer review of the biomedical literature. 952 doi:10.3410/f.725736500.793531725.

953 Olsen, J. V., de Godoy, L. M. F., Li, G., Macek, B., Mortensen, P., Pesch, R., et al. (2005). Parts per 954 million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. 955 Cell. Proteomics 4, 2010–2021. doi:10.1074/mcp.T500030-MCP200.

956 Pichersky, E., and Gang, D. R. (2000). Genetics and biochemistry of secondary metabolites in plants: 957 an evolutionary perspective. Trends Plant Sci. 5, 439–445. Available at: 958 https://www.ncbi.nlm.nih.gov/pubmed/11044721.

959 Pichersky, E., and Lewinsohn, E. (2011). Convergent Evolution in Plant Specialized Metabolism. 960 Annual Review of Plant Biology 62, 549–566. doi:10.1146/annurev-arplant-042110-103814.

961 Plante, P.-L., Francovic-Fontaine, É., May, J. C., McLean, J. A., Baker, E. S., Laviolette, F., et al. 962 (2019). Predicting Ion Mobility Collision Cross-Sections Using a Deep Neural Network: DeepCCS. 963 Anal. Chem. 91, 5191–5199. doi:10.1021/acs.analchem.8b05821.

964 Pluskal, T., Castillo, S., Villar-Briones, A., and Orešič, M. (2010). MZmine 2: Modular framework for 965 processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC 966 Bioinformatics 11. doi:10.1186/1471-2105-11-395.

967 Rutledge, P. J., and Challis, G. L. (2015). Discovery of microbial natural products by activation of 968 silent biosynthetic gene clusters. Nat. Rev. Microbiol. 13, 509–523. doi:10.1038/nrmicro3496.

969 Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J., and Neumann, S. (2016). MetFrag relaunched: 970 incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3. doi:10.1186/s13321-016- 971 0115-9.

972 Samaraweera, M. A., Hall, L. M., Hill, D. W., and Grant, D. F. (2018). Evaluation of an Artificial 973 Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted 974 Metabolomics. Anal. Chem. 90, 12752–12760. doi:10.1021/acs.analchem.8b03118.

975 Schmitt, T. M., Hay, M. E., and Lindquist, N. (1995). Constraints on chemically mediated coevolution: 976 multiple functions for seaweed secondary metabolites. Ecology 76, 107–123. Available at: 977 https://esajournals.onlinelibrary.wiley.com/doi/abs/10.2307/1940635.

978 Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., et al. (2003). Cytoscape: 979 a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 980 2498–2504. doi:10.1101/gr.1239303.

29

bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomically license. informed metabolite annotation

981 Shen, B. (2015). A New Golden Age of Natural Products Drug Discovery. Cell 163, 1297–1300. 982 doi:10.1016/j.cell.2015.11.031.

983 Sokal, R. R. (1961). Distance as a measure of taxonomic similarity. Syst. Zool. 10, 70–79. Available 984 at: https://www.jstor.org/stable/2411724?casa_token=T7r- 985 RgHCTXYAAAAA:cCNsjjXWGh0Yl1EmzXOzHxmEwB4G6PH13tZ5WMAuocaGgFUau5- 986 hR9y1l7kn5hEsQAfN-8gr0h5N69d2tKuRQSpJxhnK09V0KWvmOowwjQ7zJ5_c.

987 Soldatou, S., Eldjarn, G. H., Huerta-Uribe, A., Rogers, S., and Duncan, K. R. (2019). Linking 988 biosynthetic and chemical space to accelerate microbial secondary metabolite discovery. FEMS 989 Microbiol. Lett. doi:10.1093/femsle/fnz142.

990 Tsugawa, H., Kind, T., Nakabayashi, R., Yukihira, D., Tanaka, W., Cajka, T., et al. (2016). Hydrogen 991 Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS- 992 FINDER Software. Anal. Chem. 88, 7946–7958. doi:10.1021/acs.analchem.6b00770.

993 Tsugawa, H., Nakabayashi, R., Mori, T., Yamada, Y., Takahashi, M., Rai, A., et al. (2019). A 994 cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms. Nat. 995 Methods 16, 295–298. doi:10.1038/s41592-019-0358-2.

996 Tudhope, D., Taylor, C., and Beynon-Davies, P. (1995). Taxonomic Distance-Classification and 997 Navigation. in ICHIM, Multimedia Computing and Museums, 322–334.

998 van der Hooft, J. J. J., Wandy, J., Young, F., Padmanabhan, S., Gerasimidis, K., Burgess, K. E. V., et 999 al. (2017). Unsupervised Discovery and Comparison of Structural Families Across Multiple Samples 1000 in Untargeted Metabolomics. Anal. Chem. 89, 7569–7577. doi:10.1021/acs.analchem.7b01391.

1001 Wallwey, C., and Li, S.-M. (2011). Ergot alkaloids: structure diversity, biosynthetic gene clusters and 1002 functional proof of biosynthetic genes. Nat. Prod. Rep. 28, 496–510. doi:10.1039/c0np00060d.

1003 Wang, M., and Bandeira, N. (2013). Spectral library generating function for assessing spectrum- 1004 spectrum match significance. J. Proteome Res. 12, 3944–3951. doi:10.1021/pr400230p.

1005 Wang, M., Carver, J. J., Phelan, V. V., Sanchez, L. M., Garg, N., Peng, Y., et al. (2016). Sharing and 1006 community curation of mass spectrometry data with Global Natural Products Social Molecular 1007 Networking. Nat. Biotechnol. 34, 828–837. doi:10.1038/nbt.3597.

1008 Weber, T., Blin, K., Duddela, S., Krug, D., Kim, H. U., Bruccoleri, R., et al. (2015). antiSMASH 3.0— 1009 a comprehensive resource for the genome mining of biosynthetic gene clusters. Nucleic Acids Res. 43, 1010 W237–W243. doi:10.1093/nar/gkv437.

1011 Weikard, H.-P., Punt, M., and Wesseler, J. (2006). Diversity measurement combining relative 1012 abundances and taxonomic distinctiveness of species. Divers. Distrib. 12, 215–217. 1013 doi:10.1111/j.1366-9516.2006.00234.x.

1014 Wolfender, J.-L., Nuzillard, J.-M., van der Hooft, J. J. J., Renault, J.-H., and Bertrand, S. (2019). 1015 Accelerating Metabolite Identification in Natural Product Research: Toward an Ideal Combination of 1016 Liquid Chromatography-High-Resolution Tandem Mass Spectrometry and NMR Profiling, in Silico 1017 Databases, and Chemometrics. Anal. Chem. 91, 704–742. doi:10.1021/acs.analchem.8b05112.

30 This is a provisional file, not the final typeset article bioRxiv preprint doi: https://doi.org/10.1101/702308; this version posted July 14, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International Taxonomicall license. y informed metabolite annotation

1018 Yang, Y., Lin, Y., and Qiao, L. (2018). Direct MALDI-TOF MS Identification of Bacterial Mixtures. 1019 Anal. Chem. 90, 10400–10408. doi:10.1021/acs.analchem.8b02258.

1020 Zani, C. L., and Carroll, A. R. (2017). Database for Rapid Dereplication of Known Natural Products 1021 Using Data from MS and Fast NMR Experiments. J. Nat. Prod. 80, 1758–1766. 1022 doi:10.1021/acs.jnatprod.6b01093.

1023 Zidorn, C. (2019). Plant chemophenetics − A new term for plant chemosystematics/plant 1024 chemotaxonomy in the macro-molecular era. Phytochemistry 163, 147–148. 1025 doi:10.1016/j.phytochem.2019.02.013.

1026

1027

1028

31