COMMENTARY COMMENTARY Illuminating the dark matter in metabolomics Ricardo R. da Silvaa,b, Pieter C. Dorresteina,c,1, and Robert A. Quinna expertise to isolate and determine the struc- aCollaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and b ture of a single molecule. To put this in perspec- Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA 92093; Núcleo de tive, a modern day metabolomics experiment Pesquisa em Produtos Naturais e Sintéticos, Departamento de Física e Química, Faculdade with hundreds to thousands of independent Ciências Farmacêuticas de Ribeirão Preto, Universidade de São Paulo, São Paulo samples can easily contain 1 million unique 14040-903, Brazil; and cCenter for Marine Biotechnology and Biomedicine, Scripps spectra. Assuming that spectral matching takes Institution of Oceanography, La Jolla, CA 92037 approximately 10 min to a trained eye, a gross underestimate, it would take 19 y of nonstop data analysis for a single project. This is obvi- Despite the over 100-y history of mass spec- computational tool, called CSI (compound ously an unrealistic endeavor, especially con- trometry, it remains challenging to link the structure identification):FingerID (2). The sidering that mass spectrometers will become large volume of known chemical structures to tool is designed to aid in the annotation of even faster and more sensitive in the future. the data obtained with mass spectrometers. chemistries that can be observed by mass The method presented by Dührkop et al. Presently, only 1.8% of spectra in an untar- spectrometry. CSI:FingerID uses fragmen- (2) is divided into three phases. In the first geted metabolomics experiment can be an- tation trees to connect tandem MS (MS/MS) phase, called the learning phase, a tandem notated. This means that the vast majority data to chemical structures found in public mass spectra database of reference compounds of information collected by metabolomics chemistry databases. Tools such as this is used to train a set of predictors for known is “dark matter,” chemical signatures that re- can allow metabolomics with mass spec- molecular properties (the fingerprint). Using main uncharacterized (Fig. 1). For a genomic trometry to become as commonly used the data from these reference spectra, the comparison, 80% of predicted in the and scientifically productive as sequencing method computes a fragmentation tree that best explains the fragmentation spectrum of Escherichia coli are known. In a technologies have in the field of genomics. an unknown molecule. The tree assigns mo- metagenome, a well-known fron- There are >60 million molecules in Pub- lecular formulas to the corresponding frag- tier of biological dark matter, the amount Chem, yet only 220,000 MS/MS spectra rep- ment peaks in the MS/MS spectrum, and of known genes is 1–30%, depending on the resenting about 20,000 molecules that are fragments are connected by the assumed sample (1). Thus, one could argue that we accessible for untargeted metabolomics exper- losses. The algorithm then tries to recover the know more about the genetics of uncultured iments (3). Chemists and biologists attempt- identity and connectivity of the atoms in a phage than we do about the chemistry within ing to identify a mass spectrum without a molecular structure. With the predicted struc- our own bodies. Much of the chemical dark match in a reference database, such as GNPS, ture from the fragmentation tree, the method matter may include known structures, but Metlin, NIST, MassBank, and others, must searches for multiple similarity measures they remain undiscovered because the reference often resort to Googling the parent mass or for molecular structural comparisons (called spectra are not available in mass spectrome- manually entering it into PubChem or similar kernels) to improve the performance of mo- try databases. The only way to overcome chemical databases, hoping to find a match lecular fingerprint prediction. A molecular – this challenge is through the development (3 5). The alternative is complete structure fingerprint is based on its molecular prop- of computational solutions. In PNAS, Dührkop elucidation de novo, an even more laborious erties retrieved from the publicly avail- et al. describe the development of such a task, requiring years of work with high-level able known structures (e.g., in PubChem or the literature). In the second phase, a Support Vector Machine classifier is trained using the kernel Dark similarities to separate molecular structures Biological Sample Matter in a class that contains the molecular prop- m/z ~98% CSI:FingerID m/z m/z ’ m/z erty, and one that doesn t. Such classification m/z m/z m/z m/z m/z is repeated for all molecular properties pre- m/z m/z m/z m/z m/z Dark Matter

m/z m/z sent in the fingerprint. With the classifier m/z m/z m/z m/z Illuminated m/z m/z m/z carefully built on the previous step, the m/z m/z m/z m/z m/z m/z m/z method follows to the Prediction phase. Here, m/z m/z m/z m/z m/z m/z m/z given the MS/MS spectra of an unknown m/z m/z m/z m/z m/z compound, the task is to calculate its kernel m/z m/z m/z m/z m/z m/z >2% m/z similarities against all compounds in the m/z m/z m/z m/z m/z Known Spectra Author contributions: R.R.d.S., P.C.D., and R.A.Q. wrote the paper. Chemical m/z ~2% The authors declare no conflict of interest. See companion article on page 12580. Fig. 1. Millions of MS/MS spectra can be generated on a natural sample, such as this coral reef, but the vast majority 1To whom correspondence should be addressed. Email: pdorrestein@ of spectra are from unknown molecules. CSI:FingerID can help illuminate the chemical dark matter. ucsd.edu.

www.pnas.org/cgi/doi/10.1073/pnas.1516878112 PNAS | October 13, 2015 | vol. 112 | no. 41 | 12549–12550 Downloaded by guest on October 1, 2021 reference dataset. A learning tree is again speed the analysis workflow of complex chemical information are becoming available, built and the result is a predicted fingerprint metabolomics datasets. The method has including the ChEBI database, Kyoto Ency- of the unknown compound. Dührkop et al. the potential to improve identification in clopedia of Genes and , and others (2) point out that the machine-learning basis metabolomics experiments, by expanding (15, 16). These will be extremely useful to of the method allows for improvement in the search space outside of that available in chemists and biologists as they apply compu- performance with additional reference MS/ spectral libraries. Dührkop et al. (2) also point tational tools to more complex systems with MS data. In the metabolomics and natural to the potential to search databases contain- increasingly complex chemistry. product community, the benefit from publicly ing hypothetical simulated compounds, ex- The field of genomics was made possible by available annotated reference spectra is becom- panding the search space by an order of the development of algorithms for comparing ing increasingly evident. One such resource is millions (10). Matching spectra in a metab- sequences to identify relatedness a part of the Global Natural Product Social olomics experiment to molecules whose struc- in genetic information. In the late 1980s and Molecular Networking effort at gnps.ucsd. ture has not yet been elucidated may well be early 1990s hundreds of these algorithms were edu, which the authors extensively used for in reach within the next few years. developed, including the basic local alignment the development of CSI:FingerID. Such ref- As tools such as CSI:FingerID begin to search tool (BLAST) (17). Since then the field erence collections are crucial for the devel- illuminate more of the chemical dark matter, has exploded and technologies for sequencing opment of search tools, because machine- some form of a chemical ontology must be millions of nucleic acids have been developed learning methods perform better with more agreed upon to better classify and bin struc- to capture the genetic information in our bi- comprehensive training sets. Studies such as turesintogroupsofrelatedcompounds.A ological world. In metabolomics, the techno- this one will hopefully stimulate groups that classification hierarchy will allow the research logical advances are already in place. Mass isolate and characterize specific molecules to community to link metabolites to their asso- spectrometers are incredible machines capa- share their data. Data-sharing will facilitate ciated biological processes, whether or not ble of identifying the mass of molecules to the prediction and detection of new structures the specific metabolite in question is biolog- unprecedented accuracy, on a massive scale, within the same molecular class, which will ically characterized. Such ontology would in timeframes of less than a second. However, be enormously beneficial to both the mass greatly benefit from biological information computational resources analogous to BLAST spectrometry and sciences community. about where a particular molecule or molec- and the NCBI’s GenBank database are only Dührkop et al. (2) refer to the use of spectral ular family comes from and what it does. in their infancy. This year is the 25th anni- orthogonal information (retention time, infra- Many compounds in structure databases are versary of the release of BLAST. CSI:Finger red and UV spectroscopy, and so forth), as a chemically synthesized and not produced nat- ID is an example of the type of tools required “ ” way to manually refine the best spectral urally. Although these compounds broaden to expand the power of metabolomics and match. There are several automated methods the molecular space of these databases, they catch up to the successes of genomics. These for using such orthogonal information, but are most often not clearly differentiated from tools are fundamental to harnessing mass most of them are limited to a specific experi- natural products. For CSI:FingerID, Dührkop spectral information and similar in their syn- mental setup (6, 7). The availability of datasets et al. (2) enrich the molecular property in- thesis and setting to the early tools developed covering different and experimental formation with molecules that have known for genomics that revolutionized the field of procedures will allow the use of the full in- biological activity (11) and weight these sig- biology. CSI:FingerID and other algorithms formational content of a mass spectrum, re- natures with higher scores in their identifica- will help catch up to the field of genomic sulting in improved identification scores. tions. This is crucial to avoid convoluting bioinformatics, despite its 25-y head start, ThefinalstageofCSI:FingerIDisthe the search with synthetic compounds (12), and begin to illuminate the diverse chem- Scoring phase. With the predicted fingerprint as strategies to differentiate signatures of istry in our biological world. of an unknown molecule, one can retrieve metabolites and synthetic compounds im- all structures, matching the same molecular prove the quality of results from search tools ACKNOWLEDGMENTS. R.R.d.S. is supported by the formula in a structure database. For each (13, 14). Databases with biologically relevant São Paulo Research Foundation (FAPESP-2015/03348-3). candidate molecular structure, its fingerprint is scored against the predicted fingerprint. Dührkop et al. (2) benchmarked their tool 1 Mokili JL, Rohwer F, Dutilh BE (2012) Metagenomics and future 9 Shen Y, Yin C, Su M, Tu J (2010) Rapid, sensitive and selective perspectives in discovery. Curr Opin Virol 2(1):63–77. liquid chromatography-tandem mass spectrometry (LC-MS/MS) and found an enormous improvement on the 2 Dührkop K, Shen H, Meusel M, Rousu J, Böcker S (2015) Searching method for the quantification of topically applied azithromycin in scoring function compared with similar algo- molecular structure databases with tandem mass spectra using CSI: rabbit conjunctiva tissues. J Pharm Biomed Anal 52(1):99–104. rithms (8, 9). In the last few years, computa- Finger ID. Proc Natl Acad Sci USA 112:12580–12585. 10 Kind T, Fiehn O (2010) Advances in structure elucidation of small 3 Johnson SR, Lange BM (2015) Open-access metabolomics molecules using mass spectrometry. Bioanal Rev 2(1-4):23–60. tional methods for structural assessment of databases for natural product research: present capabilities and 11 Klekota J, Roth FP (2008) Chemical substructures that metabolomics data have seen significant de- future potential. Front Bioeng Biotechnol 3:22. enrich for biological activity. Bioinformatics 24(21):2518–2525. velopment. For the two large-scale MS/MS 4 Vaniya A, Fiehn O (2015) Using fragmentation trees and mass 12 Allen F, Greiner R, Wishart D (2014) Competitive fragmentation spectral trees for identifying unknown compounds in metabolomics. modeling of ESI-MS/MS spectra for putative metabolite identification. datasets that were tested, the method achieved Trends Analyt Chem 69:52–61. Metabolomics 11(1):98–110. more correct identifications than the next-best 5 Boulsimani A, Sanchez LM, Garg N, Dorrestein PC (2014) Mass 13 Peironcely JE, Reijmers T, Coulier L, Bender A, Hankemeier T available search algorithm. Dührkop et al.’s spectrometry of natural products: Current, emerging and future (2011) Understanding and classifying metabolite space and technologies. Nat Prod Rep 31(6):718–729. metabolite-likeness. PLoS One 6(12):e28966. (2) method also provides fivefold more 6 Pluskal T, Uehara T, Yanagida M (2012) Highly accurate chemical 14 Ruttkies C, Gerlich M, Neumann S (2013) Tackling CASMI unique and correct identifications. The CSI: formula prediction tool utilizing high-resolution mass spectra, MS/MS 2012: Solutions from MetFrag and MetFusion. Metabolites 3(3): FingerID tool is available as a web server pro- fragmentation, heuristic rules, and isotope pattern matching. Anal 623–636. viding an easy-to-use tool for wet laboratory Chem 84(10):4396–4403. 15 Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes ’ 7 Stanstrup J, Gerlich M, Dragsted LO, Neumann S (2013) and Genomes. Nucleic Acids Res 28(1):27–30. scientists. The next step of the tool s evolution Metabolite profiling and beyond: Approaches for the rapid 16 Hastings J, et al. (2013) The ChEBI reference database will be the ability to process multiple spectra processing and annotation of blood serum mass and ontology for biologically relevant chemistry: Enhancements for at the same time in a batch process and pro- spectrometry data. Anal Bioanal Chem 405(15):5037–5048. 2013. Nucleic Acids Res 41(Database issue):D456–D463. 8 Heinonen M, Shen H, Zamboni N, Rousu J (2012) Metabolite 17 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) viding a standalone version to run on the identification and molecular fingerprint prediction through machine Basic local alignment search tool. J Mol Biol 215(3): user’s own computer. These options will learning. Bioinformatics 28(18):2333–2341. 403–410.

12550 | www.pnas.org/cgi/doi/10.1073/pnas.1516878112 da Silva et al. Downloaded by guest on October 1, 2021