Structure Elucidation and Identification of Known and Unknown
Metabolites by Nuclear Magnetic Resonance Spectroscopy
Dissertation
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of
Philosophy in the Graduate School of The Ohio State University
By
Cheng Wang
Graduate Program in Chemistry
The Ohio State University
2019
Dissertation Committee
Rafael Brüschweiler, Advisor
James Coe
Marcos Sotomayor
1
Copyrighted by
Cheng Wang
2019
2 Abstract
Identification of metabolites is one of the main challenges in metabolomics. Since metabolite identities and their concentrations are often directly linked to the phenotype, such information can be used to map biochemical pathways and understand their role in health and disease. A very large number of metabolites are still unknown, i.e. their spectroscopic signatures do not match those in existing databases, suggesting unknown molecule identification is both imperative and challenging.
This dissertation describes new methods that combine nuclear magnetic resonance
(NMR), mass spectrometry (MS) and combinatorial chemoinformatics tools to identify the structures of known and unknown metabolites and development of 2D NMR hydrophobic metabolite database to identify lipids in lipidomics mixtures. Chapter 1 presents a general introduction to metabolomics in systems biology and an overview of current NMR and MS-based metabolite identification. Chapter 2 focuses on the development of the SUMMIT MS/NMR approach for the identification of metabolites in a model mixture and E. coli cell extract. It combines 2D and 3D NMR experiments with
Fourier transform ion cyclotron resonance MS and MS/MS data to yield the complete structures or fragments that match those in the complex mixture, independent of any spectroscopic database information. SUMMIT MS/NMR greatly assists the targeted or untargeted analysis of unknown compounds in complex mixtures. Chapter 3 describes an efficient motif-based identification method to identify molecular motif from NMR spectra followed by identification of the complete structure of unknown metabolites. Two databases are assembled, namely COLMAR MSMMDB and pNMR MSMMDB. The motif-
ii based identification method was applied to the hydrophilic extract of mouse bile fluid and two unknown metabolites were successfully identified. The final chapter illustrates the development of 2D NMR database of hydrophobic compounds and application of high- resolution non-uniform sampling 2D real-time pure shift HSQC spectra to metabolomics mixtures for accurate lipid identification. The methods and databases introduced here permit applications to a wide range of metabolomics mixtures for accurate identification of both known and unknown metabolites.
iii Dedication
To my parents.
iv Acknowledgments
I would like to express my great gratitude to my advisor, Prof. Rafael Brüschweiler, for his invaluable supervision and guidance along my PhD journey. His dedication to science with superior wisdom deeply motivated and helped me overcome numerous problems in research. Particularly, his intelligent ideas and advice inspired me to discover new directions in multiple research projects.
I would like to thank my dissertation committee members, Profs. James Coe and
Marcos Sotomayor for their continuous support and guidance. They gave me valuable advice for my candidacy proposal, research progress and dissertation. I also thank Profs.
Mark Foster, Philip Grandinetti for their help at various stages during my PhD.
I greatly appreciate the Campus Chemical Instrument Center, not only the state- of-the-art high-field NMR and mass spectrometers, but more importantly the excellent expertise and helpfulness of its staff scientists. Drs. Alexandar Hansen and Chunhua Yuan mentored me to be familiar with setting up NMR experiments. Dr. Lei Bruschweiler-Li expertly guided me the preparation of high-quality NMR samples. Dr. Da-Wei Li helped me boosting the analysis of multiple research projects with intelligent computational methods. Dr. Árpád Somogyi helped me collecting high-quality ultra-high resolution MS data.
I also express sincere thanks to the Brüschweiler Lab members. Particularly, I am grateful to Drs. Bo Zhang and István Timári for their help and discussions in the motif- based unknown metabolite identification, Drs. Mouzh Xie and Jiaqi Yuan for their tutorial and discussion in fundamental NMR tools and experiments, Drs. Yina Gu, Helena
v Zacharias, and Gilson C. Santos Jr. for their useful advice how to enhances the daily PhD research efficiency.
vi Vita
1991 Born in Heze, Shandong, China, P.R.
2013 B.S. in Applied Chemistry, China University of Petroleum
2014 - Graduate Teaching and Research Associate, The Ohio State University
Publications
First authored:
1. Wang, C., Zhang, B., Timári, I., Somogyi, A., Li, D. W., Adcox, H. E., Gunn, J. S.,
Bruschweiler-Li, L., & Brüschweiler, R. (2019). Accurate and Efficient Determination
of Unknown Metabolites in Metabolomics by NMR-Based Molecular Motif
Identification. Analytical chemistry. (Accepted)
2. Leggett, A.,* Wang, C.,* Li, D. W., Somogyi, A., Bruschweiler-Li, L., & Brüschweiler,
R. (2019). Identification of Unknown Metabolomics Mixture Compounds by
Combining NMR, MS, and Cheminformatics. Methods in enzymology, 615, 407-422.
3. Wang, C.,* He, L.,* Li, D. W.,* Bruschweiler-Li, L., Marshall, A. G., & Brüschweiler, R.
(2017). Accurate identification of unknown and known metabolic mixture
components by combining 3D NMR with fourier transform ion cyclotron resonance
tandem mass spectrometry. Journal of proteome research, 16(10), 3774-3786.
(*co-first author)
Co-authored:
vii 4. Knobloch, T. J., Ryan, N. M., Bruschweiler-Li, L., Wang. C., Bernier, M. C., Somogyi,
A., Yan, P. S., Cooperstone, J. L., Mo, X., Brüschweiler, R., Weghorst, C. M., &
Oghumu, S. (2019). Metabolic Regulation of Glycolysis and AMP Activated Protein
Kinase Pathways during Black Raspberry-Mediated Oral Cancer Chemoprevention.
Metabolites, 9(7), 140.
5. Timári, I., Wang, C., Hansen, A. L., Costa dos Santos, G., Yoon, S. O., Bruschweiler-Li,
L., & Brüschweiler, R. (2019). Real-Time Pure Shift HSQC NMR for Untargeted
Metabolomics. Analytical chemistry, 91(3), 2304-2311
6. Yuan, J., Zhang, B., Wang, C., & Brüschweiler, R. (2018). Carbohydrate background
removal in metabolomics samples. Analytical chemistry, 90(24), 14100-14104.
7. Hansen, A. L., Li, D.W., Wang, C., & Brüschweiler, R. (2017). Absolute Minimal
Sampling of Homonuclear 2D NMR TOCSY Spectra for High-Throughput
Applications of Complex Mixtures. Angewandte Chemie (International ed. in English),
129(28), 8261-8264.
8. Li, D. W., Wang, C., & Brüschweiler, R. (2017). Maximal clique method for the
automated analysis of NMR TOCSY spectra of complex mixtures. Journal of
biomolecular NMR, 68(3), 195-202.
Fields of Study
Major Field: Chemistry
viii Table of Contents
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
Vita ...... vii
Table of Contents ...... ix
List of Tables ...... xiii
List of Figures ...... xv
Introduction...... 1
1.1 Metabolomics in Systems Biology ...... 2
1.2 NMR-Based Metabolomics ...... 3
1.2.1 Basics of NMR ...... 3
1.2.2 One-dimensional and Multidimensional NMR ...... 5
1.2.3 NMR Metabolomics Database ...... 7
1.3 Overview of Identification of Metabolites ...... 8
1.3.1 Identification of Metabolites by Mass Spectrometry ...... 9
1.3.2 Identification of Metabolites by NMR ...... 10
1.3.3 MS and NMR Hybrid Approaches for Identification of Metabolites ...... 12
ix Accurate Identification of Unknown and Known Metabolic Mixture
Components by Combining 3D NMR with Fourier Transform Ion Cyclotron Resonance
Tandem Mass Spectrometry ...... 19
2.1 Introduction...... 20
2.2 Materials and Methods ...... 22
2.2.1 Identification of Unique Elemental Composition from Ultrahigh-Resolution FT-ICR
Mass Spectra ...... 22
2.2.2 Identification of Individual Spin Systems of a Mixture from a 3D 13C-1H HSQC-TOCSY
NMR Spectrum ...... 23
2.2.3 Structure Manifold Generation and 2D HSQC NMR Spectra Prediction ...... 25
2.2.4 Weighted Matching between Experimental and Predicted NMR Spectra ...... 26
2.2.5 Molecular Structure Motif Identification of Compounds ...... 26
2.2.6 Sample Preparation ...... 28
2.2.7 NMR Experiments and Data Processing ...... 29
2.2.8 FT-ICR MS Experiments and Processing ...... 30
2.3 Results and Discussion ...... 31
2.3.1 Demonstration of SUMMIT MS/NMR to Identify Metabolites ...... 31
2.3.2 Application to a 25-Compound Model Mixture ...... 33
2.3.3 Application to E. coli Cell Lysate...... 36
2.4 Conclusion ...... 40
Accurate and Efficient Determination of Unknown Metabolites in
Metabolomics by NMR-Based Molecular Motif Identification ...... 62
3.1 Introduction...... 64
3.2 Materials and Methods ...... 66
x 3.2.1 Definition of spin systems, molecular structural motifs, and COLMAR MSMMDB
curation ...... 66
3.2.2 Curation of the empirically predicted pNMR MSMMDB ...... 68
3.2.3 Single and multiple spin system analysis from 2D and 3D NMR experiments ...... 69
3.2.4 Workflow of MSM-based metabolite identification ...... 70
3.2.5 Sample Preparation ...... 70
3.2.6 Experiments and Data Processing ...... 72
3.2.7 Classification of hydrophilic metabolites based on lipophilicity logP ...... 74
3.2.8 Spin system matching and scoring ...... 74
3.2.9 Quantitative metric on evaluation of the MSM identification...... 75
3.3 Results ...... 75
3.3.1 Evaluation of COLMAR and pNMR MSMMDB in identification of known molecules in bile and E. coli extracts ...... 75
3.3.2 Structure elucidation of unknown metabolites ...... 77
3.3.3 Coverage of COLMAR motifs of HMDB ...... 79
3.4 Discussion ...... 80
3.5 Conclusion ...... 83
COLMAR Lipids Database for 2D NMR-Based Lipidomics ...... 101
4.1 Introduction...... 102
4.2 Methods and Experimental Section ...... 104
4.2.1 Integration of Hydrophobic Compound 2D 13C-1H HSQC Database...... 104
4.2.2 Sample Preparation ...... 105
4.2.3 NMR Experiments and Processing ...... 107
4.3 Results and Discussion ...... 108
xi 4.3.1 COLMAR Lipids Web Server with Hydrophobic Metabolite 2D HSQC Database ..... 108
4.3.2 Disambiguation of Lipid Identification Using Complementary Information ...... 109
4.3.3 Accurate Lipid Identification by High-resolution 2D NUS Real-time Pure Shift HSQC
...... 111
4.4 Conclusion ...... 114
Bibliography ...... 137
xii List of Tables
Table 1.1. The advantages and limitations of NMR spectroscopy as an analytical tool for metabolomics research in comparison with MS spectrometry...... 14 Table 2.1. Example of chemical shift (c.s.) matching results for valine...... 41 Table 2.2. SUMMIT results for 25-compound model mixture ...... 42
Table 2.3. Effect of cutoff threshold of mass peak amplitudes on the rank of SUMMIT results for
25-compound model mixture...... 43
Table 2.4. Metabolites identified in E. coli cell lysate and verified by COLMAR web server and
SUMMIT MS/NMR ...... 44
Table 2.5. Quantum-chemical calculations of NMR chemical shifts for two selected compounds, together with matching results from experimental spectra ...... 46
Table 3.1. Categorization of COLMAR and HMDB hydrophilic compounds according to their molecular structure motifs...... 85
Table 3.2. Examples of molecular structural motif identification of bile metabolites by COLMAR and pNMR MSMMDB queries...... 86 Table 3.3. True and false positive top hits of E. coli metabolites with various RMSD thresholds.. 87
Table 3.4. Molecular structures of top 10 most abundant molecular structural motifs (yellow) of all hydrophilic molecules contained in COLMAR MSMMDB...... 88 Table 3.5. Molecular structures of top 10 most abundant molecular structural motifs (yellow) of all hydrophilic molecules contained in HMDB...... 89
Table 4.1. Composition of artificial micelles incubated with Caco-2 human intestinal cells for 6 hours prior to experiment termination...... 116
Table 4.2. Hydrophobic metabolite identification of Caco-2 cell extracts using COLMAR Lipids web servera...... 117
xiii Table 4.3. Full hydrophobic metabolite identification of Caco-2 cell extracts using COLMAR
Lipids web servera...... 119
Table 4.4. Hydrophobic metabolite identification of lung tissue using COLMAR Lipids web server...... 123
Table 4.5. Different ratios of fatty acids utilized in the micelles with which Caco-2 human intestinal cells were incubated for 6 h...... 128
xiv List of Figures
Figure 1.1. Correlation between main omics strategies in systems biology...... 16
Figure 1.2. 1D 1H NMR spectrum collected from NIST SRM-1950 human serum at a 700 MHz
NMR spectrometer ...... 17
Figure 1.3. Screenshot of COLMARm web server with the identification of isoleucine in human serum...... 18
Figure 2.1. Extraction of spin systems of individual mixture compounds from 3D 13C-1H HSQC-
TOCSY...... 47
Figure 2.2. Illustration of the spin system refinement procedure based on 2D HSQC, 2D TOCSY, and 2D/3D HSQC-TOCSY NMR spectra ...... 48
Figure 2.3. Predicted chemical shifts compared with their experimental chemical shifts of 25- compound model mixture ...... 49
Figure 2.4. 2D 13C-1H HSQC of L-glutamine, glutathione, and L-glutamic acid mixture ...... 50
Figure 2.5. Putatively annotated valine spin system in a 25-metabolite model mixture extracted from 3D HSQC-TOCSY spectrum and confirmed by 2D TOCSY and 2D HSQC-TOCSY ...... 51
Figure 2.6. Molecular structural motifs identified by SUMMIT from 122 different compound candidates that all match the spin system of valine...... 52
Figure 2.7. Positive electrospray ionization 9.4 T FT-ICR broadband mass spectrum of E. coli cell lysate...... 53
Figure 2.8. Identification of best matching metabolites in a 25-compound model mixture by
SUMMIT MS/NMR ...... 54 Figure 2.9. FT-ICR MS/MS of glutamine and lysine in 25 metabolite mixture ...... 56
Figure 2.10. Infrared multiphoton dissociation positive electrospray ionization 9.4 T FT-ICR product ion mass spectrum of arginine and ornithine ...... 57
xv Figure 2.11. Collision-induced dissociation (normalized collision energy 22) Velos Pro product ion mass spectrum of valine ...... 58
Figure 2.12. Spin system of an unknown compound from an E. coli cell lysate extracted from 3D
HSQC-TOCSY and verified by 2D TOCSY and 2D HSQC-TOCSY ...... 59
Figure 3.1. Definition of 1st and 2nd shell molecular motifs based on a spin system ...... 90 Figure 3.2. Examples of metabolites with identical molecular structural motifs (in color) ...... 91 Figure 3.3. Box plots of chemical shift errors of MSMs ...... 92 Figure 3.4. Histogram of ALOGP values of hydrophilic and hydrophobic metabolites...... 93
Figure 3.5. Workflow of molecular structural motif (MSM) based unknown metabolite identification...... 94 Figure 3.6. ROC curve with AUC of 0.851 of true and false positive top hits of E. coli metabolites with various RMSD thresholds...... 95 Figure 3.7. Identification of L-alanyl-L-valine metabolite in mouse bile extracts ...... 96
Figure 3.8. Identification of two unknown spin systems of taurocholic acid in mouse bile extracts.
...... 97 Figure 3.9. Identification of taurocholic acid in mouse bile extracts...... 98
Figure 3.10. Graph-theoretical representation of molecular motif clustering of current COLMAR
MSMMDB molecules...... 99
Figure 3.11. Distribution of the number of molecules in the 50 most common motifs of hydrophilic compounds of the HMDB ...... 100
Figure 4.1. 1D 1H and 2D 13C-1H HSQC NMR spectra of Caco-2 cell extract...... 129
Figure 4.2. Identification of hydrophobic metabolites in Caco-2 cell extract with COLMAR Lipids web server ...... 130
Figure 4.3. Comparison of 2D 13C-1H HSQC NMR spectra and LC-MS/MS spectra pairs of structural isomers...... 131
Figure 4.4. Comparison of performance of different 2D 13C-1H HSQC experiments in the CH=CH region for lipids identification in a model mixture...... 132
Figure 4.5. Identification of eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA) in different 2D 13C-1H HSQC experiments in Caco-2 cell extract...... 133 Figure 4.6. Differentiation of Caco-2 cells by NMR metabolomics analysis...... 134
xvi Figure 4.7. Spectral comparison of crowded regions of different types of 2D 13C-1H HSQC experiments in Caco-2 cell extract and lung tissue...... 135
Figure 4.8. Spectral comparison of 6.25% NUS HSQC NMR spectrum (red) and US HSQC spectrum (blue) of mouse lung tissue extracts...... 136
xvii Introduction
Metabolomics, which is defined as the comprehensive and quantitative analysis of all metabolites in biological systems, has been demonstrated to possess high potential in disease diagnosis and the understanding of the physiological state in cells, tissues and biological fluids. As an essential part of metabolomics, metabolite identification plays a key role in extending the knowledge of biological systems. However, identification of metabolites is still a major bottleneck in metabolomics due to the huge number of potentially interesting metabolites that remain unknown. Two leading analytical platforms, Mass Spectrometry (MS) and Nuclear Magnetic Resonance (NMR), have been extensively deployed to identify known and unknown metabolites in complex mixtures in a high-throughput manner with numerous approaches. Specifically, NMR allows for structural characterization and increases the efficiency of unambiguous identification of metabolites. This chapter gives an overview of metabolomics in systems biology and metabolite identification by NMR and MS.
1 1.1 Metabolomics in Systems Biology
Systems biology refers to the system-level understanding of a biological system, which focuses on the structure and dynamics of cellular and organismal function.1-4 The development of omics strategies in systems biology provides comprehensive data
(genomics, transcriptomics, proteomics, and metabolomics) towards understanding the organization principle of cellular functions at different levels.5, 6 Omics strategies aim at identifying the entire gene products (transcripts, proteins, and metabolites) in biological tissues, cells, fluids or organisms. Figure 1.1 shows how the various omics approaches in systems biology are correlated with each other.7 Metabolomics is defined as the comprehensive and quantitative analysis of all metabolites (< 1,500 Da) present within a tissue, cell, fluid or entire organism.8-10 Unlike other “omics” approaches, metabolomics connects the environmental factors with gene-related outcomes that directly reflect the underlying biochemical activity and the state of cells or tissues. This provides fingerprints toward the understanding of the structure and dynamics of system networks through metabolic pathway analysis. Besides metabolomics, there are other related technologies introduced in the past decades, such as metabonomics. Proposed by Nicholson et al.,11, 12 metabonomics is defined as quantitative measurement of the multiparametric metabolic responses of living systems to pathophysiological stimuli or genetic modification, which includes not only the intracellular molecules but also the components of extracellular biofluids. The metabolome was first introduced by Oliver et al.,13 which describes the entire complement of small molecules (< 1,500 Da) in a biological system such as an organism, organ, tissue or cell. Metabolic profiling describes the analysis of a small number of metabolites for the investigation of selected biochemical pathways in a biological system. As a subdivision of metabolomics, lipidomics is defined as “the full characterization of lipid molecular species and biological roles with respect to expression of proteins involved in lipid metabolism and function, including gene regulation.”14-16 One
2 focus of lipidomics is to identify fingerprints of lipid molecular species in diseases for therapeutic development.
1.2 NMR-Based Metabolomics
Metabolomics methodologies fall into two main categories, which are termed as targeted and untargeted metabolomics. Targeted metabolomics refers to the measurement of targeted groups of chemically characterized and biochemically known metabolites in a biological system.17, 18 By taking the advantage of comprehensive understanding of known metabolites, metabolic enzymes, pathways, accurate identification and quantitation of a set of targeted metabolites in biological samples can be achieved on a variety of analytical platforms such as gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS) and nuclear magnetic resonance spectroscopy. Untargeted metabolomics is the comprehensive analysis of all detectable signals in a biological sample set, followed by the assignment of many of these signals to metabolites by screening against metabolomics databases.19-21 This section focuses on the brief introduction of NMR-based methodologies for both targeted and untargeted metabolomics.
1.2.1 Basics of NMR
Nuclear magnetic resonance (NMR) can be traced back to 1938 when Isidor Rabi accurately measured the nuclear magnetic moments using magnetic resonance absorption of molecular beams. In 1946, Felix Bloch and Edward Mills Purcell successfully demonstrated NMR on liquids and solid materials. With the discovery of the “chemical shift” such as 1H and 31P, NMR spectroscopy has become a tool for chemists for the structure elucidation of molecules.
3 Among the various types of spectroscopic methods, NMR spectroscopy uses the lowest irradiation energy (~105 Hz) for excitation. The proton in an atomic nucleus has the intrinsic property of spin 1/2. The spinning movement of a proton results in the generation of a magnetic field and the magnet is called a nuclear magnetic moment. When it is placed in a large external magnetic field, the spin can adopt a low-energy state, a high-energy state, or any quantum-mechanical superposition thereof. Such a two-state description applies to all nuclei with the nuclear spin quantum number I = ½, such as 1H, 13C, 15N, and
31P. The energy difference between the two states is:
∆� = ℎ��0/2� 1.1
-27 where ℎ is Planck's constant (6.63 x 10 erg sec), � is the gyromagnetic ratio and �0 is the external magnetic field. Combined with Bohr condition (∆� = ℎ�), the above equation can be written as:
� = ��0/2� 1.2 where � is the transition or resonance frequency between the two energy states. The energy when the nuclear spins flip from higher energy state to lower energy state induces a voltage that can be detected. The most critical perturbation of the NMR frequency is the
“shielding” effect by the surrounding shells of electrons, which gives rise to atom-specific chemical shifts (see below). This shift in the NMR frequency is dependent on the chemical structure of the molecule, as well as ring current effects, hydrogen bonds, and solvent accessibility. The NMR frequencies are also proportional to the magnetic field strength.
To define NMR frequency without specifying the external magnetic field strength, a reference compound (tetramethylsilane, TMS) is used to calculate the relative frequency, so-called, chemical shift. The chemical shift is calculated based on the following equation:
� = (� − �0)/�0 1.3 where � is the chemical shift (parts per million, ppm), � is the frequency of interest, �0 is the frequency of TMS.
4 The observable NMR signal generated by excited nuclear spins with a net magnetic moment perpendicular to the external magnetic field will oscillate and decay due to relaxation and is called free induction decay (FID). To convert the time domain data to the frequency domain data, Fourier transform is applied to FIDs followed by phase correction. Sensitivity and resolution are key components when evaluating the quality of the NMR spectrum before diving into deeper spectral analysis. The resolution of the spectrum can be slightly improved by simply adding zeros to the tail of the FIDs before
Fourier transformation. In addition, multiplying the FID by a window function before
Fourier transformation, so-called apodization, will increase the sensitivity or resolution.
For instance, an exponential decay function can increase sensitivity and a shifted Gaussian function can enhance the resolution.
1.2.2 One-dimensional and Multidimensional NMR
The 1D 1H NMR experiment is the most prevalent 1D NMR experiment in the vast majority of NMR-based metabolomics studies.22-27 This is because 1D 1H NMR is more sensitive than other nuclei (13C, 15N) with high (~99%) isotopic natural abundance. More importantly, protons exist in most metabolites. Figure 1.2 shows an example of a 1D 1H
NMR spectrum of human serum.28 Hundreds of metabolites with unique peaks are identified from a single 1D 1H NMR spectrum. The exceptionally high resolution is achieved due to the narrow linewidths (often <1 Hz). 1D 1H NMR is fast, typically only 5
- 15 minutes per sample. This enables high-throughput sample analysis with automated
1D 1H NMR approaches.
Signals in 1D 1H NMR spectrum tend to overlap when the number of species beyond a few hundred in a complex metabolomics mixture, which severely hampers unambiguous metabolite identification. Multidimensional NMR spectroscopy is often applied to overcome the issue by adding additional dimensions (e.g., 13C) to spread out the peaks. Two dimensional heteronuclear 13C,1H single quantum coherence spectroscopy 5 (2D 13C-1H HSQC) experiment measures a 2D heteronuclear chemical shift correlation between directly bonded 13C and 1H, which provides one cross-peak for each C-H pair in a spectrum. Over the decades, a number of customized 2D HSQC experiments have been extensively developed in metabolomics studies.29-31 For instance, extrapolated time-zero
HSQC (HSQC0) enables simultaneous quantification and identification of compounds in metabolite mixtures.32 The approach starts with a series of HSQC spectra acquired with incremented repetition times, which can be extrapolated back to zero time to yield a time- zero 2D 13C-1H HSQC spectrum (HSQC0). The signal intensities are proportional to the concentrations of individual metabolites that can be used to determine the relative concentration of individual metabolites. By clustering the HSQC0 cross-peaks based on normalized intensities, the peaks of each cluster can be assigned to specific compounds.
The multiplicity-edited 2D 13C- 1H HSQC NMR spectrum shows the cross-peaks of 13C-1H correlations in a phase-sensitive manner. The methyl (CH3) and methyne (CH) moieties show the opposite phase to those of methylene groups (CH2), which provides tremendous power for assigning signals in crowded regions in the NMR spectra of biological samples.
2D total correlation spectroscopy (TOCSY) measures the chemical shift of a given nucleus and the correlated chemical shifts of other nuclei within the total spin system of a given molecule. With sufficiently long isotropic mixing time (>80ms), 2D 1H-1H TOCSY yields complete spin-spin connectivity information in molecular spin systems.33, 34 The 2D
Heteronuclear Single Quantum Coherence-Total Correlation Spectroscopy (HSQC-
TOCSY) is a hybrid experiment that combines 2D HSQC and 2D TOCSY to generate through-bond correlations between a 13C-attached 1H to all other coupled 1H in a spin system. The experiment is extremely powerful when severe overlapping occurs in 2D
TOCSY.35 2D Heteronuclear Multiple Bond Correlation (HMBC) experiment correlates chemical shifts of two types of nuclei (1H, 13C) separated from each other with two or more bonds. 2D HMBC enables detection of 1H, 13C that are separated by quaternary carbons or heteroatoms (e.g., N, O), which connects isolated spin systems in a metabolite. Although
6 relatively insensitive, 2D HMBC experiment serves as the supplementary experiment to help connect one detected spin system to additional spin system.36
1.2.3 NMR Metabolomics Database
Over the decades, a number of spectral databases have been established along with the development of NMR-based metabolomics approaches, such as the Human Metabolome
Database (HMDB), BioMagResBank (BMRB) and COLMAR database.37-42 Human
Metabolome Database is currently the world’s largest and most comprehensive, organism-specific metabolomics database.37, 38 As a standard metabolomics resource for human metabolic studies, HMDB not only offers information about the roles of a given metabolite in metabolic context but also provides NMR and MS spectral information and links to other public chemical databases. HMDB has been updated to version 4.0, including a large number of predicted MS/MS and GC–MS reference spectral data and predicted metabolite structures to facilitate novel metabolite identification.38 As a central repository for experimental NMR spectral data, BioMagResBank (BMRB) includes a comprehensive metabolomics database, containing an archive of experimental NMR spectral data of metabolomic compounds such as 1D 1H, 2D 13C-1H HSQC, 2D 1H-1H
TOCSY and 2D 13C-1H HMBC.40 All the experimental NMR spectral data in BMRB are publicly accessible and downloadable for direct comparison and referencing. Complex
Mixture Analysis by NMR (COLMAR) developed in the Brüschweiler laboratory is a suite of NMR web server that allows a user to upload experimental 2D NMR data and query against COLMAR database.42-45 The COLMAR HSQC database unifies the NMR spectroscopic information from HMDB and BMRB and adds the isomer information that distinguishes isomers given a set of chemical shifts.42 COLMAR HSQC increases the accuracy of metabolite identification by more than 37% and decreases the false positive identification rate by more than 82%, compared with other existing 13C-1H HSQC metabolomics databases. COLMAR TOCCATA database integrates the spin system
7 information within the chemical shifts.43, 45 The chemical shifts of each compound are subdivided into isomeric states followed by separation into the individual spin systems.
Groups of C-H spin-pairs are considered as separate spin systems if they are separated by at least one noncarbon atom or quaternary carbon. COLMAR TOCCATA database allows the unambiguous identification of spin systems and isomeric states of metabolites in complex metabolomics mixtures.
1.3 Overview of Identification of Metabolites
Identification of metabolites is one of the main challenges in metabolomics. As a prerequisite for both targeted and untargeted metabolomics, numerous methods of metabolite identification have been developed.46-49 Mass spectrometry and NMR are two leading analytical platforms that allow detecting hundreds or thousands of signals in a single biological sample.50-55 MS and NMR have distinct advantages and disadvantages in terms of identification of metabolites. A brief comparison between MS and NMR is summarized in Table 1.1.56 The typical process of metabolite identification consists of two essential steps. First, the signals detected by MS or NMR are screened against existing metabolomics databanks such as HMDB or METLIN to identify the maximal number of known metabolites.37-39, 57 Second, the signals that are not identified belong to unknown metabolites, which are further analyzed by computational approaches or supplemental experiments. The Metabolomics Standards Initiative (MSI) recognizes 4 levels of metabolite identification.58 Level one, known as “identified compound”, requires a minimum of two independent and orthogonal data (such as retention time and mass spectrum) compared directly to an authentic reference standard. Level two, known as
“putatively annotated compound”, refers to a compound that is identified by spectral analysis and/or similarity to data in a public database but without direct comparison to a reference standard as for Level one. Level three, known as “putatively characterized compound class”, denotes unidentified per se but the data allows the metabolite to be 8 placed in a certain compound class. Level four, known as “unknown compound”, is defined as unidentified or unclassified but characterized by spectral data (LC-MS, NMR).
In the following part, current advances regarding metabolite identification based on MS,
NMR or MS/NMR hybrid approaches are discussed.
1.3.1 Identification of Metabolites by Mass Spectrometry
Mass spectrometry (MS) offers high sensitivity, selectivity and a wide dynamic range for the analysis of metabolomics samples.49, 59 MS instrumentation has undergone significant development in recent years. The availability of various ionization methods (e.g., electrospray ionization (ESI)) in both positive and negative modes enables the ionization of various metabolite classes and facilitates the applications to metabolite identification in various metabolomics samples. In addition, versatile mass analyzers configured in tandem mass spectrometry instruments improve metabolite identification by acquiring highly resolved tandem MS spectra. The integration of gas chromatography or liquid chromatography (LC) to MS further improves the metabolite identification by reducing the complexity of the sample via separation. With the benefit of simple sample preparation and broad coverage of detectable metabolites, GC-MS and LC-MS are two of the most favored approaches for metabolite identification in MS-based metabolomics. The m/z value of molecular ion is queried against database(s) such as METLIN and all possible chemical structures that are consistent with the mass are returned as candidates.50, 57, 60 The experimental retention time and MS/MS spectrum of the ion of interest are compared with the retention time and MS/MS spectra of each compound to confirm or reject the metabolite candidate. With exponentially increasing chemical space, the mass spectral databases, the key component in metabolites identification with experimental MS/MS spectra, have been developed considerably over the past years in MS-based metabolite identification.61-65 Current public and commercial mass spectral databases contain around
1–2 million spectra of one million unique compounds, which are derived from authentic
9 experimental spectra of reference compounds. However, there are over 100 million known compounds have been discovered or synthesized cataloged in PubChem and ChemSpider, whose MS spectra are not available in current databases.66 Therefore, a number of approaches regarding generating in silico mass spectra, especially MS/MS spectra, have been developed using quantum chemistry, machine learning and chemical reaction-based methods.67-72
Moreover, new MS-based metabolite identification frameworks that integrate the analysis of MS spectra and metabolite annotations facilitate the process of MS spectra analysis and enhance the efficiency of metabolite identification.73-75 For instance, the integration of BinVestigate, MS-DIAL 2.0 and MS-FINDER 2.0 enable identification of novel metabolites that are distinct from canonical pathways.76 Specifically, BinVestigate queries BinBase to search for biological metadata, a GC-MS-based untargeted metabolomics database containing 1,561 studies with 114,795 samples for various species, organs, matrices, and experimental conditions. MS-DIAL 2.0 enables the processing of both GC-MS and LC-MS data. MS-FINDER 2.0 aims to identify structures of known and unknown metabolites based on a combination of 14 metabolome databases and an enzyme promiscuity library. The integrated framework is demonstrated in mammalian samples for the discovery of new methylation products and microbial cells for the identification of plant-specific metabolites and transformations of exposome compounds.
1.3.2 Identification of Metabolites by NMR
NMR spectroscopy has been widely used in metabolomics studies.77-84 Hundreds of metabolites can be detected and measured simultaneously without the need for elaborate sample preparation or fractionation. NMR spectra are highly reproducible, minimizing the spectrometer influence on measurements of biological samples. The majority of NMR- based metabolite identification approaches start with analyzing NMR data with existing metabolomics databases to identify signals to known metabolites, followed by 10 computational strategies to identify unassigned signals. 1D 1H NMR is the most widely used NMR approach in metabolite identification. Software such as Chenomx NMR Suite simplifies and facilitates the analysis of 1D 1H NMR. Chenomx provides a user-friendly interactive visualization interface and embeds a spectral library of over 600 metabolites to automatically identify metabolites in 1H NMR spectra of biological samples.
When peaks are excessively overlapped in 1D 1H NMR spectra of a complex mixture, 2D NMR is often used to aid unambiguous metabolite identification. 2D 13C-1H
HSQC is primarily implemented in 2D NMR based metabolite identification. By querying peak lists (13C, 1H) against NMR spectral databases such as HMDB and COLMAR, the majority of cross-peaks can be assigned with known metabolites.37-39, 85 For instance,
COLMAR HSQC web server allows users to upload peak lists of HSQC and run automatic referencing based on an internal standard such as sodium trimethylsilylpropanesulfonate
(DSS). The query step only takes a few seconds, after which the metabolite identification result is generated with quantitative metrics such as RMSD, matching ratio and uniqueness. However, it is possible that two HSQC peaks that are from different compounds can be assigned to another compound due to the similarity of chemical shifts.
Therefore, validation of this metabolite identification by additional experiments is needed.
COLMARm web server is developed to integrate 2D HSQC, 2D TOCSY and 2D HSQC-
TOCSY for metabolite identification and validation. COLMARm allows for the simultaneous spectral analysis based on 2D HSQC, 2D TOCSY and 2D HSQC-TOCSY on a publicly accessible web server (http://spin.ccic.ohio- state.edu/index.php/colmarm/index).86 Users can upload up to three NMR spectra of a metabolomics sample, which includes a required 2D HSQC and optional 2D TOCSY and
2D HSQC-TOCSY spectra. COLMARm web server queries experimental HSQC spectrum against the COLMAR HSQC metabolomics database and returns a list of metabolite candidates. For each candidate metabolite, the corresponding 2D TOCSY and 2D HSQC-
TOCSY spectra are generated based on the COLMAR TOCCATA database, which is
11 subject to comparison with experimental TOCSY and HSQC-TOCSY spectra. Depending on the matching agreement of database based TOCSY and HSQC-TOCSY with experimental TOCSY and HSQC-TOCSY, the candidate metabolites can either be confirmed or excluded. COLMARm web server was first applied to a human serum sample and 62 metabolites, of which 14 metabolites were first time identified by NMR with high accuracy. Figure 1.3 shows an example of the COLMARm web server with the identification of isoleucine in human serum.
1.3.3 MS and NMR Hybrid Approaches for Identification of Metabolites
A number of novel hybrid approaches combining MS and NMR have been developed recently to offer high-accuracy metabolite identification and quantification.87-90 MS/NMR hybrid approaches capitalize on the complementary strengths of MS and NMR. The
NMR/MS Translator approach combines NMR and accurate MS data of the same metabolomics sample, which provides fast and highly accurate identification of known metabolites.91 The NMR/MS Translator first identifies metabolite candidates from NMR spectra based on NMR database query (e.g., COLMAR web server), followed by simulation of the mass spectrum based on the determination of the masses (m/z) of all of the likely ions, possible adducts, and characteristic isotope distributions. The simulated mass spectrum is then compared with the experimental mass spectrum for the direct assignment of those signals that correspond to the metabolites identified in the NMR spectra. Since both experimental chemical shifts and accurate mass data of the same biological sample are analyzed, it increases the accuracy of metabolite identification compared with those methods based on NMR or MS alone. The NMR/MS Translator was applied to human urine and identified up to 88 metabolites with consistent signals in both
NMR and MS spectra. It should be noted that the NMR/MS Translator is designed to improve the accuracy and efficiency of known metabolite identification, which is limited by the metabolomics spectral database. To identify unknown metabolites that correspond
12 to unassigned signals in NMR or MS spectra without extensive isolation via time- consuming purification, a number of purification-free metabolite identification MS/NMR hybrid strategies have been developed. LC-MS-SPE-NMR has emerged as a very powerful yet less labor-intensive analytical platform to obtain spectral data for structural elucidation of metabolites.92-98 The integrated work platform enables scientists to selectively purify and concentrate metabolites of interest in complex biological samples, followed by structure elucidation based on 1D or 2D NMR spectra. This approach has been applied to a variety of complex biological samples to identify secondary metabolites.
With the integrated UHPLC-QTOF-MS/MS-SPE-NMR approach (a variant approach of
LC-MS-SPE-NMR),96 more than 100 previously known as well as unknown triterpene and flavonoid glycosides are identified in root tissue of Medicago truncatula. Recently,
SUMMIT MS/NMR approach is developed to combines NMR and MS data to identify structures of interest in the mixture without depending on sample separation and experimental spectral databases.99 SUMMIT MS/NMR first extracts accurate masses from ultrahigh-resolution mass spectra and generate all possible structures based on derived chemical formulas. The comparison of the predicted NMR spectra of all candidate structures with experimental NMR spectra of the same sample promotes the identification of unknown metabolites. Another novel approach, termed as ISEL NMR/MS2 (Integrated
Structure ELucidation by NMR/MS2), combined in silico MS/MS and NMR prediction to improve the identification accuracy of unknown metabolites.100 It starts with unknown features from the LC-MS spectra of the sample, followed by the comparison of experimental MS/MS and NMR with predicted MS/MS and NMR spectra to identify the structure of unknowns that are not present in experimental NMR or MS metabolomics databases.
13 Table 1.1. The advantages and limitations of NMR spectroscopy as an analytical tool for metabolomics research in comparison with MS spectrometry. Adapted with permission from reference (56).
NMR Mass spectrometry
Low but can be improved with High, but can suffer from ion higher field strength, cryo and Sensitivity suppression in complex and microprobes and dynamic nuclear salty mixtures polarization
Usually need different Sample The entire sample analyzed in one chromatography techniques measurement measurement for different classes of metabolites
Nondestructive; sample can be Destructive technique but Sample recovered and stored for a long need a small amount of recovery time, several analyses can be sample carried out on the same sample
Reproducibility Very high Moderate
More demanding; needs Sample different LC columns and Need minimal sample preparation preparation optimization of ionization conditions
Less than 3 min for direct Experimental infusion but more than 10 min 5-15 min for 1D proton NMR time for simplest analysis by GC MS or LC MS
No, requires tissue extraction Yes, using HRMAS NMR tissue Tissue samples MS can be used to identify samples analyzed directly metabolites in tissues using MALDI-MS
Continued
14
Table 1.1. Continued
NMR Mass spectrometry
Number of detectable 40–100 depending on spectral metabolites Can be several hundreds resolution in urine sample
Target Inferior for targeted analysis Superior for targeted analysis analysis
No—although suggestion that Yes—widely used for 1H magnetic In-vivo DESI may be a useful way to resonance spectroscopy (and to a studies sample tissues minimally lesser degree 31P and 13C) invasively during surgery
Molecular dynamic, NMR can be used to probe the No molecular molecular diffusion and dynamics diffusion
Direct structure Yes No analysis of unknowns
15 Figure 1.1. Correlation between main omics strategies in systems biology. Adapted with permission from reference (7).
16
Figure 1.2. 1D 1H NMR spectrum collected from NIST SRM-1950 human serum at a 700
MHz NMR spectrometer. Adapted with permission from reference (28).
17
Figure 1.3. Screenshot of COLMARm web server with the identification of isoleucine in human serum. Panel A: 2D HSQC, panel B: 2D TOCSY and panel C: 2D HSQC–TOCSY.
Green and red circles show experimental and database cross-peaks of isoleucine, respectively. Magenta circles indicate the expected isoleucine peaks from the COLMAR
TOCCATA database. The good agreement between green and red circles indicates that isoleucine is a confident candidate. This was further validated by the close match between magenta peaks with the experimental TOCSY cross-peaks observed in the TOCSY and
HSQC–TOCSY spectra. Adapted with permission from reference (85).
18 Accurate Identification of Unknown and Known
Metabolic Mixture Components by Combining 3D NMR with
Fourier Transform Ion Cyclotron Resonance Tandem Mass
Spectrometry
Metabolite identification in metabolomics samples is a key step that critically impacts downstream analysis. The recently introduced SUMMIT NMR/ mass spectrometry (MS) hybrid approach allows the identification of the molecular structure of unknown metabolites, based on the combination of NMR, MS, and combinatorial cheminformatics.
In this chapter, we demonstrate the feasibility of the approach for an untargeted analysis of both a model mixture and E. coli cell lysate, based on 2D/3D NMR experiments in combination with Fourier transform ion cyclotron resonance MS and MS/MS data. For 19 of the 25 model metabolites SUMMIT yielded complete structures that matched those in the mixture independent of database information. Of those, 7 top-ranked structures matched those in the mixture, and 4 of those were further validated by positive ion
MS/MS. For 5 metabolites, correct molecular structural motifs were identified. For E. coli,
SUMMIT MS/NMR identified 20 previously known metabolites with 3 or more 1H spins independent of database information. Moreover, for 15 unknown metabolites, molecular structural fragments were determined consistent with their spin systems and chemical shifts. By providing structural information for entire metabolites or molecular fragments,
SUMMIT MS/NMR greatly assists the targeted or untargeted analysis of complex mixtures of unknown compounds.
19 2.1 Introduction
The large number of different metabolites found in living organisms offers important clues about the chemical underpinning of life, which is the subject of the field of metabolomics. It has been estimated that the human body alone contains over 100,000 different metabolites.101 Despite ongoing progress in the development of larger metabolomics databases, the identification of unknown metabolites remains a major bottleneck. Traditional approaches for natural product analysis, which are based on the complete physical separation of the compound of interest, are very time-consuming and, hence, impractical for routine and high-throughput applications. Alternatively, the two primary analytical techniques in metabolomics, namely mass spectrometry (MS)102-104 and nuclear magnetic resonance (NMR), have been applied separately.
Recently, new approaches have been proposed for the analysis of complex mixture, based on combining MS and NMR. Finding ways to synergistically apply the two methods to the same problem has been a challenge due to the high complementarity of their information content. One strategy focuses on subsets of spectroscopic signals that are highly correlated or interdependent with respect to each other across a large number of samples and, hence, may stem from the same molecule.105-108 Such correlation analysis can be carried out either for NMR data or direct infusion MS data alone or between the two methods.109 Groups of signals that have been identified in this way can then be used to deduce information about the molecular structure. These statistical methods are applicable under two conditions, namely that a potentially large pool of samples is available so that statistically meaningful results can be obtained, and that the compound of interest shows relatively large modulations of its concentration relative to other metabolites so that the correlations between signals of the compound are sufficiently large. For applications with smaller sample pools, which can consist of as few as a single sample, alternative approaches have been proposed. For uniformly 13C-labeled mixtures,
2D 13C-13C TOCSY or INADEQUATE experiments permit the tracing of the backbone 20 topology of individual metabolites, thereby providing useful information toward the elucidation of their structure.110, 111 3D-(H)CCH-TOCSY and COSY spectra of a 60% 13C- labeled rhododendron shrub were used together with quantum-chemical calculations to identify catalogued as well as several uncatalogued metabolites.112
We recently introduced approaches that synergistically use NMR and MS for a single sample of a complex mixture at 13C natural abundance for the validation of known compounds and the determination of unknowns. The first approach is the NMR/MS
Translator, which translates the metabolites identified from 1D or 2D NMR by database query to accurate masses that are then directly compared with MS of the same sample, thereby providing a methodical validation of metabolites by both NMR and MS.91, 113 The second approach, termed SUMMIT MS/NMR (for “Structure of Unknown Metabolomic
Mixture components by MS/NMR”),99 is more complex and more ambitious than some of the other approaches listed, because it aims at the determination of the structure of unknown metabolites without the use of NMR or MS databases. Based on accurate masses from MS, it generates a pool of possible molecular structures for which NMR chemical shifts are computed and compared directly with the 2D experimental chemical shift data of the mixture spectrum. As a proof-of-principle, it was demonstrated how SUMMIT could determine the correct structures for a finite, well-defined subset of metabolites previously known to exist in E. coli cell lysate.99
Here, we generalize SUMMIT for the untargeted identification of both known and unknown metabolites by combining ultrahigh-resolution Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS)114 to assign unique elemental compositions
(CcHhNnOoPp…) with 3D NMR complemented by 2D NMR experiments as the primary source of information for spin-system identification and validation by tandem MS
(MS/MS). The NMR information is first queried against the COLMAR NMR database,44, 85,
115 by use of the COLMARm86 database to identify a maximal number of known metabolites and thereby assign as many cross-peaks as possible. This step is then followed
21 by SUMMIT, combining MS with NMR data, for the identification of unknowns from the remaining cross-peaks. The approach is first demonstrated here for a model mixture consisting of 25 metabolites. Compared to the original SUMMIT experiments, for which mass measurement was based on time-of-flight mass analysis with an average root-mean- square (rms) mass error of ~5 ppm (thus limited to metabolites of up to ~300 Da in mass), the present 9.4 T FT-ICR mass measurements achieve 25 times higher mass accuracy (~200 ppb rms mass error), and thus allow a more reliable determination of elemental composition for metabolites up to at least 1 kDa in mass.116 By combining the results with
MS/MS analysis, this approach can provide additional disambiguation of the top hits produced by SUMMIT.
2.2 Materials and Methods
2.2.1 Identification of Unique Elemental Composition from Ultrahigh-Resolution FT-
ICR Mass Spectra
The SUMMIT approach begins from identification of metabolite elemental composition.
It has been shown that mass measurement accuracy of ~200 ppb enables identification of unique elemental composition for organic molecules up to ~1 kDa in mass.117 Such mass measurement accuracy for complex mixtures is routinely achieved by 9.4 T FT-ICR MS: e.g., resolution and elemental composition assignment for more than 125,000 peaks in a single mass spectrum of a volcanic asphalt sample.118 Although limited structural information may be derived from elemental composition alone (e.g., Kendrick mass,119 van Krevelen diagrams,120 double bond equivalents vs. carbon number for individual heteroatom classes,121 etc.), the most definitive structural information relies on NMR (see below).
22 2.2.2 Identification of Individual Spin Systems of a Mixture from a 3D 13C-1H HSQC-
TOCSY NMR Spectrum
The NMR portion of SUMMIT focuses on the identification of spin systems of individual compounds based on multidimensional 13C-1H and 1H-1H cross-peaks. In particular, the
2D 13C-1H HSQC experiment provides 13C-1H chemical shift correlations between directly bonded 1H and 13C nuclei and 2D 13C-1H HSQC-TOCSY provides 13C-1H and 1H-1H bond connectivity information. If peak overlaps are absent or rare, the combination of 2D HSQC and 2D HSQC-TOCSY enables the unambiguous extraction of the spin systems of the various mixture compounds. However, in practice the presence of peak overlap will interfere with the accuracy and reliability of spin system extraction. Because the 3D 13C-
1H HSQC-TOCSY spectrum is much less prone to peak overlap than its 2D variant, we extract the spin systems directly from the 3D 13C-1H HSQC-TOCSY NMR spectrum. The
3D 13C-1H HSQC-TOCSY experiment provides 13C( 1), 1H( 2), 1H( 3) correlations and resolves overlap of cross-peaks in the 2D 13C( 1)-1H( 2) plane by spreading the resonances along the orthogonal 1H( 3) dimension, which is the direct 1H detection dimension.122, 123
Comparison of the 1H-1H correlations along 3 for each pair of 13C-1H cross-peaks enables one to determine whether or not two 13C-1H cross-peaks belong to the same molecule or not, thereby drastically reducing the possibility of false spin system identification. In practice, the main concern is the relatively low resolution along the two indirect dimensions that provide the 13C and 1H correlation information in order to keep the measurement time reasonably short. This problem can be addressed in part by measuring an additional high-resolution 2D 13C-1H HSQC spectrum to complement the 13C and 1H correlation information from the 3D experiment, as done here, or the use of non-uniform sampling methods.123, 124
The 13C( )− ( ) plane of the 3D HSQC-TOCSY spectrum depicts single bond 13C-
1H correlations of all molecules in the mixture. To distinguish 13C( )− ( ) cross-peaks from different molecules and be able to cluster these cross-peaks into individual spin
23 systems, the analysis of 1H-1H TOCSY transfers along the 3 dimension permits one to correlate pairs of 13C( 1)-1H( 2) cross-peaks and assign them to the same spin system. A prerequisite is that the cross-peaks share the same cross-peaks along 3. Specifically, such a pair of 2D cross-peaks, ( 1’, 2’) and ( 1’’, 2’’), must then share 3D cross-peaks at positions ( 1, 2, 3) = ( 1’, 2’, 2’), ( 1’, 2’, 2’’), ( 1’’, 2’’, 2’’), ( 1’’, 2’’, 2’). The goal is to find all pairs of 2D cross-peaks that are connected in this manner. These cross-peaks can be considered as edges of a mathematical graph in which the nodes correspond to directly bonded 13C-1H spin pairs. Such a graph can then be analyzed in terms of a “maximal clique” analysis, which we recently developed to automatically extract all possible spin systems from TOCSY-type spectra automatically.125
Figure 2.1 depicts a schematic diagram for the generation of spin systems from the
3D 13C-1H HSQC-TOCSY NMR spectrum. It should be noted that for spin systems with only one 13C-1H pair, the method does not work, because the spin system is fully characterized by a single cross-peak in the 3D spectrum. Therefore, peak assignments of one-spin systems need to be performed manually. Similarly, because 2-spin systems contain no redundant information, they are more prone to false positives and were considered only for the model mixture but not for E. coli cell extract.
After spin systems of individual compounds are extracted, they are refined in order to minimize the occurrence of false positives. Spin system refinement includes three consecutive steps. First, extracted spin systems are validated by visually checking 2D 13C-
1H HSQC, 2D 1H-1H TOCSY and 2D 13C-1H HSQC-TOCSY NMR spectra. If the extracted spin system was incomplete, expected peaks that are unambiguously observed by visual inspection, but that were missed by the automated procedure, are manually added until the spin system is complete. Second, 1H peak doublets and nearly degenerate 1H resonances are combined into a single chemical shift. For example, CH2 groups can have two separate proton chemical shifts belonging to the same 13C, but sometimes it is difficult to determine whether two separate peaks stem from a single CH2 group or from two
24 separate CH groups. Therefore, those spin systems that contain two cross-peaks with the same 13C frequency in the HSQC are merged into a single cross-peak (with a chemical shift taken as the mean of the two proton resonances) for the generation of an alternative spin system candidate, which is added to the list of spin systems. Third, potential extra spins are manually identified and added after comparison of 1D 1H traces along 3. For example, for an automatically generated 3-spin system, if an additional resonance shows unambiguous connectivities to all three spins, but has not yet been included in the clique, then this spin is manually added, resulting in a new 4-spin system. An example for the refinement of spin systems is provided in Figure 2.2.
2.2.3 Structure Manifold Generation and 2D HSQC NMR Spectra Prediction
Each accurate mass derived from an experimental FT-ICR mass spectrum was compared to the METLIN database to identify the closest matching molecular formula (note that
METLIN was used only to search for molecular formulas that are consistent with the FT-
ICR-based mass information, but not for molecular structures). Because each molecular formula could correspond to any of several isomers, we then searched the ChemSpider database126,127 for all structures corresponding to a given molecular formula.
For all molecular structures, 2D 13C-1H HSQC spectra are predicted by use of the empirical chemical shift predictor by Modgraph implemented in MestReNova 10.0.1
(Mestrelab Research). HSQC prediction for each molecule takes about 3-10 s with a desktop computer. The 13C chemical shift prediction utilizes a HOSE code algorithm, whereas the 1H chemical shift prediction is based on functional groups which were individually parametrized.128 Because NMR chemical shift prediction plays a critical role in SUMMIT for identifying the correct compound from a large compound pool, we examined the prediction accuracy for amino acids, organic acids, and carbohydrates contained in a 25-compound model mixture. We compared the predicted NMR chemical shifts with the NMR chemical shifts contained in the COLMAR database.44, 85, 115 For a total 25 of 179 13C-1H moieties, the average prediction errors for carbon and proton chemical shifts are 2.903 ppm and 0.292 ppm. The comparison between predicted and experimental chemical shifts is shown in Figure 2.3.
2.2.4 Weighted Matching between Experimental and Predicted NMR Spectra
After 2D 13C-1H HSQC NMR spectra were predicted for all chemical compound candidates, the weighted matching algorithm by Munkres was applied to match the 2D
13C-1H HSQC spectra extracted for individual mixture compounds with the predicted 2D
13C-1H HSQC spectra.129 The use of a weighted matching algorithm is motivated by the goal to find the closest matching peak pairs for each experimental spin system to each predicted spin system, provided that the total number of spins is the same. The matching results are ranked according to the chemical shift root-mean-square deviation (RMSD)
(Equation 1) between the experimental and predicted chemical shifts:
ì N ü1 2 2 2 RMSD =í å [(Ci,exp - Ci,pred ) +((Hi,exp - Hi,pred )´ 10) ] 2Ný (1) î i=1 þ
in which Xexp are the experimental chemical shifts, Xpred are the predicted chemical shifts, and N is the number of peaks in the spin system. A scaling factor of 10 is used to normalize the effects of 13C and 1H chemical shifts on the overall RMSD by correcting for the different chemical shift ranges of these nuclei. Table 2.1 shows the matching result for valine in the 25-compound model mixture.
2.2.5 Molecular Structure Motif Identification of Compounds
After matching and rank-ordering, predicted NMR spectra for a potentially large number of candidate compounds derived from FT-ICR MS with an experimentally extracted spin system generally yielded a large number of hits with a reasonably low chemical shift
26 RMSD cutoff (< 5 ppm). Although candidate compounds with lower RMSDs generally are more likely to be the true compound, it cannot be excluded that the true compound has a lower rank due to the limited NMR prediction accuracy or molecular structure degeneracy. Therefore, to simplify and speed up the identification of the true compound among the hundreds or even thousands of hits, we propose the following approach referred to as “molecular structural motif identification of chemical compounds” or
MSMIC. Because the experimentally extracted spin system corresponds to a structural motif consisting only of carbons and protons of the true compound, the goal of MSMIC is to find all possible molecular structural motifs that correspond to the experimentally extracted spin system among all of the compound candidates. The common molecular structural motif among different compounds will generate similar chemical shifts because additional atoms and functional groups typically have only little influence on the NMR chemical shift prediction of spins that are part of the molecular structure motif. For instance, L-glutamine and glutathione share the common molecular structural motif
(HOOCCH(NH2)CH2CH2CONH-) and hence have similar chemical shifts for this motif
(Figure 2.4). All hits (compound candidates) are sorted into groups according to their
MSMICs by use of the nearest neighbor heavy atom for discrimination between different
MSMICs. In a next step, molecular representatives of all high scoring MSMICs are selected for NMR experiments and used for quantum chemical calculations of their chemical shifts for the more accurate ranking of MSMICs. The best matching molecules are then either purchased or synthesized for NMR spiking experiments. This approach was implemented by use of in-house python scripts. Examples of MSMICs will be discussed below. In chemometrics, the maximum common substructure (MCS) approach is very efficient in identifying local structural similarities among large structural manifolds (>750,000).130, 131
Unfortunately, MCS is not able to identify the common motif that corresponds to a given spin system because it is not based on substructures connected by scalar J-couplings.
27 2.2.6 Sample Preparation
A 25-compound metabolite mixture contained adenosine, alanine, arginine, carnitine, citrulline, cysteine, fructose, galactose, glucose, glutamine, histidine, inosine, isoleucine, lactose, leucine, lysine, methionine, ornithine, proline, ribose, serine, shikimate, sucrose, threonine, and valine. For the NMR experiments, the final concentration of each metabolite was 1 mM in 600 μL D2O with 20 mM phosphate buffer and 0.1 mM DSS (4,4- dimethyl-4-silapentane-1-sulfonic acid) for chemical shift referencing. The same 25 compounds were used for the MS sample, which was prepared in 50%/50% (v/v)
ACN/H2O solution with 0.1% formic acid. The final concentration of each metabolite for
MS was 10 μM. All chemicals and solvents were obtained from Sigma-Aldrich and Fisher
Scientific Corporation.
E. coli BL21(DE3) cells were cultured at 37 °C with shaking at 250 rpm in M9 minimum medium with glucose (natural abundance, 5 g/L) added as the sole carbon source. One liter of culture at OD 1 was centrifuged at 5000 x g for 20 min at 4 °C, and the cell pellet was resuspended in 50 mL of 50 mM phosphate buffer at pH 7.0. The cell suspension was then subjected to centrifugation for cell pellet collection. The cell pellet was resuspended in 10 mL of ice-cold water and freeze-thawed three times. The sample was centrifuged at 20,000 x g at 4 °C for 15 min to remove cell debris. Prechilled methanol and chloroform were sequentially added to the supernatant under vigorous vortexing at an H2O/methanol/chloroform ratio of 1:1:1 (v/v/v). The mixture was then left at −20 °C overnight for phase separation. Next, it was centrifuged at 4000 x g for 20 min at 4 °C, and the clear upper hydrophilic phase was collected and subjected to rotary evaporation to reduce the methanol content. Finally, the sample was lyophilized. The dry sample was then divided into two parts: one for MS and one for NMR analysis. The NMR sample was prepared by dissolving the material in 200 μL of D2O with 20 mM phosphate buffer and
0.1 mM DSS (4,4-dimethyl-4-silapentane-1-sulfonic acid) for chemical shift referencing, and then transferred to a 3 mm NMR tube. 1.5-2 mg of E. coli cell lysate was dissolved in
28 200 μL of H2O, and 10 μL of that was diluted 10-fold with 50%/50% (v/v) ACN/H2O with
0.1% formic acid. The resulting solution was centrifuged at 13,000 rpm at 4 °C for 5 min, and the supernatant was used for direct-infusion MS.
2.2.7 NMR Experiments and Data Processing
2D 13C-1H HSQC, 2D 1H-1H TOCSY, 2D 13C-1H HSQC-TOCSY, and 3D 13C-1H HSQC-
TOCSY spectra of the 25-compound model mixture and E. coli cell lysate were collected.
All NMR spectra of the 25-compound model mixture and E. coli cell lysate were acquired with a Bruker AVANCE solution-state NMR spectrometer equipped with a cryogenically cooled TCI probe at 850 MHz proton frequency at 298 K. The 2D 13C−1H HSQC spectra of the 25-compound model mixture and E. coli cell lysate were collected with 256 t1 and 1024 t2 complex points. The measurement time was ~2 h. The spectral width along the indirect and the direct dimensions was 34205.6 and 10204.1 Hz. The number of acquisitions per t1 increment was 8. The transmitter frequency offset was 80 ppm in the 13C dimension and
4.7 ppm in the 1H dimension. The 2D 1H−1H TOCSY spectra of the 25-compound model mixture and E. coli cell lysate were collected with 512 t1 and 1024 t2 complex points. The measurement time was ~4 h. The spectral width along the indirect and direct dimensions was set to 10204.1 Hz. The number of acquisitions per t1 increment was 8. The transmitter frequency offset was 4.7 ppm in both 1H dimensions. The TOCSY mixing time was set to
120 ms after optimization of the isotropic mixing time. 2D 13C−1H HSQC-TOCSY spectra of the 25-compound model mixture and E. coli cell lysate were collected with 512 t1 and
2048 t2 complex points. The measurement time was ~8.5 h. The spectral width along the indirect and the direct dimensions was 34205.6 and 10204.1 Hz. The TOCSY mixing time for 2D 13C−1H HSQC-TOCSY was set to 120 ms. The number of acquisitions per t1 increment was 16. The transmitter frequency offset was 80 ppm in the 13C dimension and
4.7 ppm in the 1H dimension. 3D 13C−1H HSQC-TOCSY spectra of the 25-compound model mixture and E. coli cell lysate were collected with 64 t1, 128 t2, and 2048 t3 complex points.
29 The measurement time was ~113 h. The spectral width along the indirect and the direct dimensions was 34205.6, 10204.1, and 10204.1 Hz. The number of scans per t1 increment was 8. The transmitter frequency offset was 80 ppm in the 13C dimension and 4.7 ppm in the 1H dimension. The data were zero-filled two-fold along the 13C dimension, Fourier transformed, and phase- and baseline-corrected by use of NMRPipe.132 Sparky was used for peak-picking in all spectra.133 All spectra were converted to MATLAB format for maximal clique analysis.
2.2.8 FT-ICR MS Experiments and Processing
A custom-built 9.4 T Fourier transform ion cyclotron resonance mass spectrometer was used for sample analysis.116 A 25 metabolites mixture (10 µM) and E. coli extract sample
(in 50% ACN, 50% water, and 0.1 % formic acid) were ionized by positive or negative nanoelectrospray at a flow rate of 0.3 µL/min and accumulated in an external linear quadrupole ion trap. Ions were then transferred through an octopole ion guide to the ICR cell for broadband and tandem mass spectra acquisition. The transfer time was set to 0.35 ms for lower mass range analysis (m/z 107-270) and 0.55 ms for higher mass range analysis
(m/z 265-400). MS/MS fragmentations for glutamine, lysine, arginine, and ornithine were performed by infrared multiphoton dissociation (IRMPD), and precursor ions were isolated externally with a quadrupole mass filter and internally by stored waveform inverse Fourier transform (SWIFT) excitation in the ICR cell.134 IRMPD was performed with a 40 W, 10.6 mm, CO2 laser (Synrad, Mukilteo, WA, USA), fitted with a 2.5x beam- expander. The laser beam was directed to the center of the cell through an off-axis BaF2 window. Photon irradiation was for 500 ms at 40–90% laser power (16–36 W). MS/MS fragmentation for valine was performed with a Velos Pro ion trap mass spectrometer with normalized collisional energy 22 (ThermoFisher parameter setting). Broadband and tandem mass spectra were acquired from m/z 107–2,000, with a 6 s time-domain acquisition period. 500 time-domain transients were digitized and signal-averaged. All
30 data were stored as .DAT files. All time-domain data were Hanning apodized, zero-filled, and fast Fourier transformed to yield magnitude-mode mass spectra. Frequency-to-m/z conversion was performed with a two-term calibration equation.135, 136 Mass calibration was performed by dual spray spanning m/z 112-410. For positive ESI, the custom- prepared standard mix included cytosine, caffeine, biotin, adenosine, Val-Ala-Pro-Gly, and des-Tyr1-methionine enkephalinamide. For negative ESI, Agilent ESI-L Low
Concentration Tuning Mix (Agilent, Santa Clara, CA) was used for calibration. After dual spray calibration, high magnitude peaks (6 peaks for ESI positive mode and 5 peaks for
ESI negative mode) in m/z range 112-410 were chosen as internal standards and used for calibration during sample direct infusion into the mass spectrometer. Data were manually interpreted by use of Predator Analysis (version 4.1.9) software.
2.3 Results and Discussion
2.3.1 Demonstration of SUMMIT MS/NMR to Identify Metabolites
First, we illustrate SUMMIT MS/NMR for the identification of metabolites based on the example of a spin system extracted from the 3D HSQC-TOCSY NMR spectrum of the 25- compound model mixture. The spin system with chemical shifts (δH, δC) of (3.597, 63.000),
(2.267, 31.690), (1.034, 20.560), and (0.983, 19.230) ppm was used to match the predicted
2D HSQC NMR spectra of 57,881 compounds derived from elemental compositions corresponding to all above-threshold FT-ICR mass spectral peaks (see below). The spin- spin connectivity information is manifested in both 2D HSQC-TOCSY and 3D HSQC-
TOCSY, which confirms that the four peaks indeed belong to the same spin system (Figure
2.5). The experimental chemical shifts were matched against the predicted chemical shifts of the 57,881 compounds by use of the weighted matching algorithm. Based on an RMSD cutoff of 5.0 ppm, 122 RMSD rank-ordered hits were returned. The top hit was valine with an RMSD of 0.93 ppm. Because valine was known to be one of the 25 compounds in the
31 model mixture (which was independently confirmed by querying the chemical shifts of this spin system against the COLMAR 1H(13C)-TOCCATA database,44 yielding a low
RMSD of 0.08 ppm for the database entry of valine), SUMMIT MS/NMR was successful in identifying (and verifying) this mixture component without depending on either a spectral NMR or MS database.
How should one proceed if the true compound does not exist in an NMR metabolomics database for validation, and how can one verify the true compound among dozens or hundreds of candidates returned by SUMMIT MS/NMR? Here, the MSMIC approach described above comes to fruition. By first identifying the molecular structural motif of the true compound, it helps elucidate the complete structure of the true compound in a second step. Again, the spin system that was eventually identified as valine is used as an example to demonstrate the strategy. For an unknown spin system with hundreds of compound candidates, the hits with lower RMSDs are more likely to correspond to the true compound. To verify the identity of the true compound beyond the limited NMR database information, quantum-chemical calculations followed by a
NMR spiking experiment was adopted as the “gold standard” for compound verification.58 In fact, before proceeding to verify the true compound by labor-intensive
NMR spiking experiments for the 122 hits, the molecular structural motifs that reflect the possible substructures of the unknown spin systems are extracted, which both simplifies and speeds up the verification process. Examples of molecular structural motifs identified among the 122 hits are shown in Figure 2.6. As compounds with a common motif are expected to have similar chemical shifts, the next step is to select one or two compounds with low RMSD by quantum-chemical calculation in each cluster and perform NMR spiking experiments. Hence, the 122 initial hits were further reduced to fewer than 10 compound candidates for the verification of the molecular structural motif of the true compound. After confirmation of the MSMIC that belongs to an unknown spin system, further validation steps are performed to verify the entire compound as explained below.
32
2.3.2 Application to a 25-Compound Model Mixture
NMR and FT-ICR MS Data-Derived Information
Based on the maximal clique approach to automatically extract spin systems (see Methods section), 49 spin systems were extracted from the 3D 13C-1H HSQC-TOCSY NMR spectrum of the 25-compound model mixture. All extracted spin systems included 2 or more spins
(one-spin systems were not included, see Methods section). 26 spin systems were identified by SUMMIT MS/NMR and verified based on COLMAR 1H(13C)-TOCCATA database query; 2 unknown spin systems could not be annotated and were classified as false positives, because each resonance in these spin systems belongs to other spin systems as determined by visual inspection of the 2D TOCSY and HSQC-TOCSY spectra. 21 spin systems were identified as partially or fully overlapping spin systems after spin system refinement as described in the Methods section. 80 neutral molecular formulas (rms mass error 0.07 ppm) were obtained from the 100 highest magnitude FT-ICR mass spectral peaks by identifying elemental compositions matched to the METLIN database (see above) with <0.15 ppm mass error threshold (Figure 2.7).127 (Peaks not originating from the 25 metabolites likely belong to impurities in the purchased compounds.) For each mass peak, [M+H]+, [M+Na]+, [M+K]+, [M+ACN+H]+, [M+ACN+Na]+ and [M+2Na-H]+ (in which M is the metabolite or its derivative) were considered as possible adducts. There were 57,881 molecular structures corresponding to the 80 molecular formulas according to the ChemSpider database (presently containing over 58 million molecular structures).126
By comparison, if the mass error threshold was set to 1.0 ppm, 92 distinct molecular formulas were obtained with rms mass error 0.17 ppm, corresponding to 68,173 structures.
In mixture analysis by MS, it is possible that intra- and inter-dimers and multimers may be generated. For example, ESI MS can yield both [M+H]+ and [2M+2H]2+ ions, which have the same monoisotopic mass-to-charge ratio i.e. [12Cc1H(h+1)14Nn16Oo31Pp32Ss]+ and
[12C2c1H(2h+2)14N2n16O2o31P2p32S2s]2+. However, the dimer is readily recognized by an m/z
33 separation of 0.5 between 12C2c2+ and its [12C(2c-1)13C1]2+ isotope peak. We did not observe any multimers of the reported metabolites.
Identification of Metabolites in the 25-Compound Model Mixture
After application of the weighted matching algorithm to match the 26 spin systems with the predicted 2D NMR HSQC spectra, the hits for each spin system were sorted according to the best matching chemical shift RMSD. Figure 2.8 shows the weighted matching scheme for the identification of metabolites by SUMMIT MS/NMR. 7 mixture compounds were returned as the top hits, namely valine, lysine, glutamine, isoleucine, arginine and ornithine, and 4 compounds ranked between 2 and rank 10 (including fructose (ranked 4) of a total of 6 carbohydrates). An additional 6 compounds ranked between the top 11 - 50 hits, including adenosine (vide infra). Histidine and sucrose ranked 52th and 61th.
Shikimic acid was not returned in the top 100 hits. However, a compound that has the same MCMIC as shikimic acid is the top hit. 4 of the 6 carbohydrates were not in the top
100 hits due to high structural degeneracy and limited NMR chemical shift prediction accuracy. Finally, although the molecular weight of alanine (89.09320 Da) falls below the low-m/z limit of the FT-ICR MS, it can easily be identified by low resolution MS (e.g., quadrupole ion trap). Table 2.2 shows the matching results for the 25-compound model mixture.
Table 2.3 shows the effect of (+) ESI FT-ICR mass spectral peak magnitude threshold on the number of possible ChemSpider structures and hit rank for the 25 metabolites mixture. For example, the number of possible structures drops from 57,881 to 16,030 or 9,162 for a MS peak magnitude threshold increase from the top 100 highest- magnitude peaks to just the top 30 or 20 highest-magnitude peaks.
Although adenosine in the 25-compound mixture was low-ranked (rank 44 among
92 hits), it shares the ribose ring as the same common molecular structural motif with all
34 other top 43 hits. Shikimic acid, galactose, glucose, lactose and ribose were not identified
(because chemical shift RMSDs > 5.0 ppm), but their molecular structural motifs were correctly identified by SUMMIT MS/NMR. These results show the ability of this approach to correctly identify structural motifs belonging to spin systems identified by NMR. The remaining ambiguity among molecules with the correct structural motif is due to limitations of the empirical chemical shift predictor, which can be alleviated in part by applying more accurate, but also more expensive, quantum-chemistry based chemical shift calculations to the top hits.137
To reduce the number of false-positive compounds returned by weighted matching, we optimized the mass error threshold for molecular formulas determination to 0.15 ppm, which significantly reduced the number of false compounds and improved the rankings of the true compounds. As an example, mixture compound leucine ranked
7th of a total of 29 hits if a 1.0 ppm mass error threshold was used, but it ranked as the top hit of 13 returned hits after lowering the mass error threshold to 0.15 ppm. Note that the average mass error for the true 25 compounds is 30 ppb and the maximum mass error is less than 120 ppb, which demonstrates the high mass accuracy that can be achieved from an ultrahigh-resolution FT-ICR mass spectrum for a metabolic mixture. Therefore, for an unknown metabolite in a metabolomics mixture, focusing on the low ppm mass accuracy molecular formulas will lead to fewer molecular formulas and thereby facilitate the true compound identification. It should be noted that the low ppm mass accuracy cutoff varies from sample to sample, but it can be determined for each sample by the identification of known abundant metabolites in the mixture.
Validation of Putative Metabolite Structure by FT-ICR MS/MS
MS/MS can serve to validate the top-ranked structures. As a demonstration, FT-ICR infrared multiphoton dissociation (IRMPD) was performed for the isolated precursor ions corresponding to SUMMIT MS/NMR top-ranked glutamine, lysine, arginine, and
35 ornithine (Figure 2.9, Figures 2.10). Glutamine and lysine MS/MS yielded mass differences (between the precursor ion and product ion) of 17.02656 Da and 17.02658 Da
(i.e., 0.01 mDa and 0.03 mDa deviation from the calculated mass 17.02655 Da) corresponding to loss of ammonia. Arginine and ornithine MS/MS yielded loss of ammonia (0.06 mDa and 0.05 mDa deviation from the calculated mass of 17.02655 Da) and loss of water (0.06 mDa and 0.06 mDa deviation from the calculated mass of 18.01056 Da).
Collision-induced dissociation (CID) in a linear quadrupole ion trap yielded loss of ammonia, water, and carbon monoxide from valine precursor ion (Figure 2.11). Therefore, the product ion mass spectrum further supports the highest ranked SUMMIT-based structures. Although the information content of MS/MS fragment analysis varies from metabolite to metabolite, MS/MS is expected to be most helpful for SUMMIT
MS/MS/NMR for the identification of larger molecules, such as secondary metabolites.
2.3.3 Application to E. coli Cell Lysate
NMR and FT-ICR MS Data-Derived Information
397 potential spin systems were extracted from the 3D HSQC-TOCSY NMR spectrum of the E. coli cell lysate by applying the maximal clique approach in full analogy to the model mixture. All extracted spin systems included 3 or more spins. Besides one-spin systems, two-spin systems were also excluded to avoid the generation of a potentially large number of false positives. We obtained a total of 1095 molecular formulas by searching the FT-ICR broadband accurate masses against the METLIN database (see above), leading to the generation of 914,947 candidate molecular structures by screening the ChemSpider database.
Identification of Known Metabolites in E. coli
36 In the 25-compound model mixture, all of the compounds are known and they are contained in the NMR database, thereby enabling testing of the SUMMIT MS/NMR method. Here, we first apply SUMMIT MS/NMR to identify known metabolites in E. coli to test the power and limitations of the method by comparing the results with those obtained by querying the spectra directly against the COLMAR web server. The recently developed COLMARm web server module provides simultaneous analysis of 2D HSQC,
2D TOCSY, and 2D HSQC-TOCSY NMR spectra and was used to identify metabolites.
Metabolites were first identified by querying the 2D HSQC against the COLMAR database, and subsequently verified by 2D TOCSY and 2D HSQC-TOCSY by use of
COLMARm. 41 metabolites could be identified with high confidence by COLMARm (2D
HSQC cross-peak matching ratio >0.8 and more than 50% spin-spin connectivities showing up in the 2D TOCSY and 2D HSQC-TOCSY spectra). The 41 metabolites were treated as “putatively annotated metabolites” to be verified by SUMMIT MS/NMR. When implementing SUMMIT MS/NMR to verify the metabolites identified by COLMAR, we compared the identified metabolites with the matching results returned for each extracted spin system. Verification results are reported in Table 2.4 for metabolites that fulfill the following conditions: they are ranked among the top 200 hits if the total number of hits with a chemical shift RMSD < 5.0 ppm was 400 or less, or they are ranked in the top 50% percentile if the total number of hits with RMSD < 5.0 ppm exceeded 400. These criteria ensure that the most likely candidates are retained without making the pool unrealistically large. Based on cross-platform analytical methods to verify compounds, the identification and verification results by COLMAR and SUMMIT MS/NMR achieved level 2 confidence according to the Metabolomics Standards Initiative.58
The following 13 known metabolites were successfully verified by SUMMIT
MS/NMR: L-glutamine, L-valine, maltose, cellobiose, N-acetyl-putrescine, L-glutamic- acid, D-glucose, spermidine, L-phenylalanine, L-tyrosine, N- -acetyl-L-lysine, L- glutathione-reduced, and L-methionine. Adenosine, inosine, L-proline, leucine,
37 pyridoxamine-5-phosphate-1 and guanosine could not be verified because not all of their cross-peaks showed up due to the relatively low abundance of these metabolites and the limited sensitivity of HSQC-TOCSY. However, by manually checking 2D HSQC-TOCSY and 3D HSQC-TOCSY, partial spin systems of these metabolites (covering 50% or more of the expected cross-peaks) could be identified. When implementing SUMMIT MS/NMR, we set the matching ratio cutoff to 1 to increase the identification accuracy when matching with FT-ICR MS-derived NMR spectra. For instance, if a compound contains a 5-spin system, but only a 4-spin sub-system could be reconstructed from 3D HSQC-TOCSY (e.g., because a resonance is very weak), the true (5-spin system) compound would not be returned as a hit by matching the experimental 4-spin system with the FT-ICR MS-derived
NMR spectra.
Therefore, SUMMIT MS/NMR will verify only the metabolites that are detectable by both analytical methods, providing high validation confidence across platforms. In addition, off-line LC fractionation can be applied prior to MS/NMR analysis to decrease the complexity of the metabolites mixture and increase the chance to identify more common metabolites by MS and NMR. Those metabolites that are identified by only one of the two analytical methods need to be further validated by other analytical methods, e.g., HPLC-MS. In any case, HPLC retention time (especially when calibrated by spiking with the putative metabolite) can help further validate any metabolite assignments based on NMR, MS, or a combination of the two. Finally, metabolites that are not detected by positive ESI can often be detected by negative ESI. For example, the MS1 accurate masses for acetyl-L-glutamine, DL- -glycerol-phosphoric acid, D-glucuronic acid, methyl- uridine, deoxythymidine monophosphate, uridine monophosphate, and cystathionine were detected by use of negative ESI (rms mass error 0.18 ppm), and those identities were confirmed by NMR.
Identification of Unknown Metabolites in E. coli by SUMMIT MS/NMR
38 The primary aim of SUMMIT MS/NMR is to identify unknown compounds that are not catalogued in current NMR and MS metabolomics databases. Unknown spin systems show self-consistent spin-spin connectivities in the 3D HSQC-TOCSY spectrum, but do not match any compound in the NMR database. Here, SUMMIT MS/NMR identified up to 15 unknown spin systems (compounds) in E. coli cell lysate. For instance, the spin system with chemical shifts (δH, δC) identified as (1.167,19.542), (3.618,59.207),
(3.632,75.944), (3.767,73.296), (4.032,70.725), and (5.573,98.048) ppm shows high confidence as a true positive spin system (Figure 2.12), but could not be assigned to any known compound after querying against the COLMAR database. However, after matching against 914,947 predicted NMR spectra, 12 hits (RMSD < 5.0 ppm) were returned and 4 molecular structural motifs were identified (Figure 2.13). As a proof of principle for the identification of true compounds, we applied quantum-chemical calculations to two selected compounds among the 12 hits, namely L-fucosamine and 6-desoxy-D- glucosamine. 6-Desoxy-D-glucosamine was the top hit among the 12 hits. L-Fucosamine has been found as a constituent of mucopolysaccharides of certain enteric bacteria (e.g.,
Citrobacter fiemdii), but the existence of L-fucosamine in E. coli was previously unknown.138
Quantum-chemical calculations of NMR chemical shifts for these two compounds return a lower RMSD for L-fucosamine and, hence, L-fucosamine is more likely to be the true compound than 6-desoxy-D-glucosamine, consistent with the literature (Table 2.5).
Although the true identity of this spin system remains uncertain, SUMMIT MS/NMR provides a small list of likely candidates, which represents actionable information for the identification of the true compound.
Pyroglutamic acid, which at the outset of this study was not part of the COLMAR
1H(13C)-TOCCATA database, represents another instructive example of the SUMMIT approach. SUMMIT MS/NMR successfully extracted the spin system and returned pyroglutamic acid as the 116th hit (Figure 2.14). Independently and at about the same time, the COLMAR 1H(13C)-TOCCATA database increased by 284 compounds,115 including
39 pyroglutamic acid, thereby enabling the identification of pyroglutamic acid as a known metabolite, confirming the SUMMIT results. For the E. coli cell lysate, SUMMIT returned
15 unknown spin systems along with their candidate compounds. To maximize the confidence of the unknown spin systems, we included all of the pairwise-connected spins that appear along the 1D 3 (1H) trace, and none of the peaks in the spin system matched the NMR database.
Finally, we note that some unknown spin systems have multiple candidate compounds, whereas others do not match any candidate compounds based on our metrics. Extending the mass range of FT-ICR MS will be helpful to incorporate all possible compounds and find the compound candidates for these unknown spin systems.
2.4 Conclusion
SUMMIT MS/NMR provides powerful fingerprints, based on spin system information, molecular formulas, and compound candidates in complex biological mixtures, thereby greatly assisting the analysis of complex metabolomics mixtures whose compositions are only partially known, without being limited to spectroscopic databases. SUMMIT is expected to find fruitful applications to support key objectives of contemporary metabolomics research, including the discovery of new biochemical pathways and biomarkers.
40 Table 2.1. Example of chemical shift (c.s.) matching results for valine.
Functional Predicted 1H c.s. Predicted 13C c.s. Expt. 1H c.s. Expt. 13C c.s.
group (ppm)a (ppm) a (ppm) (ppm)
-CγH3 0.910 19.32 1.034 20.56
-CγH3 0.960 19.32 0.983 19.23
-CβH2 2.220 31.07 2.267 31.69
-CαH 3.440 61.90 3.597 63.00
RMSD (ppm) 0.93 a The empirical chemical shift prediction was obtained by use of the NMR predictor by Modgraph embedded in the MestReNova software.
41 Table 2.2. SUMMIT results for 25-compound model mixture (mass error cutoff of 0.15 ppm).
Total hits (c.s. Percentile = Rank/Total c.s. RMSD Mass error Compound Rank RMSD<5.0 ppm) number of hits (ppm) (ppm)
valine 1 122 0.8% 0.93 -0.04
lysine 1 39 2.6% 1.54 0.04
glutamine 1 140 0.7% 1.36 0.07
isoleucine 1 16 6.3% 1.77 0
arginine 1 81 1.2% 2.02 0.10
ornithine 1 62 1.6% 2.05 0.05
leucine 1 13 7.7% 2.06 0
threonine 3 79 3.8% 1.71 -0.08
fructose 4 116 3.4% 1.32 0.05
carnitine 4 22 18.2% 3.32 0.12
cysteine 10 142 7.0% 2.17 -0.05
inosine 13 99 13.1% 2.27 -0.02
citrulline 23 65 35.4% 2.76 0.12
methionine 29 133 21.8% 2.53 0.03
serine 30 514 5.8% 2.2 0.07
proline 30 91 33.0% 2.55 0
adenosine 44 92 47.8% 2.74 0
histidine 52 181 28.7% 2.56 0.11
sucrose 61 117 52.1% 2.06 0.04
42 Table 2.3. Effect of cutoff threshold of mass peak amplitudes on the rank of SUMMIT results for 25-compound model mixture (0.15 ppm mass error cutoff).
Rank: Top 30 Rank: Top 20 Rank: All c.s. RMSD Index Compound Mass Peaks Mass Peaks Mass Peaks (ppm)
1 valine 1 1 1 0.83
2 lysine 1 1 1 1.44
3 glutamine 1 1 1 1.35
4 isoleucine 1 1 1 1.86
5 arginine 1 1 1 1.95
6 ornithine 1 1 1 1.92
7 leucine 1 1 1 1.84
8 threonine 2 N/A 3 1.55
9 fructose 4 4 4 1.36
10 carnitine 4 3 4 3.14
11 methionine 6 3 29 2.18
12 proline 7 5 30 2.54
13 cysteine 7 N/A 10 1.93
14 citrulline 11 10 23 2.40
15 inosine 13 13 13 2.26
16 serine 17 11 30 2.07
17 histidine 18 9 52 2.58
18 adenosine 39 38 44 2.63
19 sucrose 61 61 61 1.77
Structure manifolds 16,030 9,162 57,881
43 Table 2.4. Metabolites identified in E. coli cell lysate and verified by COLMAR web server and SUMMIT MS/NMR. (Highlighted metabolites in green were consistent with identification by COLMARm (based on HSQC for query and TOCSY + HSQC-TOCSY for validation); highlighted metabolites in blue were identified based on the COLMAR HSQC database alone.
Total hits Mass error Compound Rank RMSD (ppm) (RMSD<5.0 ppm) (ppm)
L-glutamine 1 6268 0.14 1.27
L-valine 5 5089 0.13 0.99
maltose 14 641 0.08 1.52
cellobiose 29 641 0.08 1.73
N-acetyl-putrescine 45 1697 0.08 1.87
L-glutamic acid 79 5549 0.14 1.58
D-glucose 170 659 -0.07 2.17
spermidine 213 1697 0.14 2.82
L-phenylalanine 230 7685 0.06 2.06
L-tyrosine 251 7685 0.11 2.07
N-alpha-acetyl-L-lysine 278 576 0.11 2.94
L-glutathione-reduced 313 6268 0.10 2.55
L-methionine 417 4982 0.13 2.49
D-xylose 1 96 0.09 1.85
D-sorbose 3 285 -0.07 1.59
lactose 6 744 0.08 1.71
N-alpha-acetyl-ornithine 7 1552 0.18 2.18
Continued
44 Table 2.4. Continued
Total hits Mass error Compound Rank RMSD (ppm) (RMSD<5.0 ppm) (ppm)
AMP 19 36 0.12 2.69
D-ribose 26 142 0.09 2.51
D-galactose 33 659 -0.07 1.74
2-pyrrolidinone-5- 275 5549 0.08 2.02 carboxylate
lysine 281 576 0.20 2.95
isovalerylglutamic acid 289 3773 0.13 1.8
S-adenosyl-L- 332 4982 0.07 2.3 homocysteine
gamma-glutamylcysteine 367 6268 -0.05 2.55
D-glucosamine 9 144 0.12 2.46
xylulose 10 1629 0.09 1.69
L-arabinose 25 96 0.09 2.58
D-tagatose 26 285 -0.07 1.84
muramic acid 32 81 0.01 3.09
melibiose 36 99 0.08 2.22
isomaltose 38 99 0.08 2.41
D-salicin 80 744 -0.06 2.29
sucrose 136 168 0.08 3.26
beta-gentiobiose 345 956 0.08 2.48
D-trehalose 378 1139 0.08 2.42
N-acetyl-lactosamine 427 1079 0.09 2.45
45 Table 2.5. Quantum-chemical calculations of NMR chemical shifts for two selected compounds, together with matching results from experimental spectra. The quantum- chemical calculations were performed according to our MOSS-DFT protocol published recently.3
Predicted 1H Predicted 13C Expt. 1H c.s. Expt. 13C Compound Peak index c.s. (ppm)a c.s. (ppm) a (ppm) c.s. (ppm)
1 1.342 17.506 1.167 19.542
L-fucosamine 2 3.252 60.607 3.618 59.207
3 3.835 71.816 4.032 70.725
4 3.893 73.883 3.632 75.944
5 3.901 72.945 3.767 73.296
6 4.887 94.099 5.573 98.048
RMSD 2.9335 (ppm)
1 1.337 18.311 1.167 19.542
2 3.207 61.619 3.618 59.207 6-desoxy-D-
glucosamine 3 3.737 73.449 4.032 70.725
4 3.326 77.943 3.632 75.944
5 3.565 74.916 3.767 73.296
6 4.949 93.834 5.573 98.048
RMSD 3.1622 (ppm)
46 A 2D HSQC 3D HSQC-TOCSY
1 1 3 2 ) ) H C
2 1 3 ( 1 3 (
1 4
4 ω 3 ω
) 3 1 C ( ω 1 ω (1H) 1 2 ω2( H)
0. 8 B 0. 6 0. 4 10. 2 1 OH O 0. 8
3 0. 6 2 0. 4 0. 2 2 0 0. 81 0. 6 1 3 0. 4 2 OH 30. 2 1
0. 8
0. 6
0. 4 0. 2 3 NH 4 0 2 5 4 3 2 1 1 ω3( H) (ppm)
Figure 2.1. Extraction of spin systems of individual mixture compounds from 3D 13C-1H
HSQC-TOCSY. Panel A shows the relationship between cross-peaks from the 2D 13C-1H
HSQC spectrum (left) and the 3D 13C-1H HSQC-TOCSY spectrum (right). Panel B illustrates how 1D cross-sections along ω3 (1H) of the 3D HSQC-TOCSY spectrum of (a) yield spin system information, which is extracted by use of a maximal clique approach.
Traces 1, 2, 3 show high similarity because they belong to the same spin system consisting of 3 protons, whereas trace 4 belongs to a separate spin system with a single proton.
47 Figure 2.2. Illustration of the spin system refinement procedure based on 2D HSQC, 2D
TOCSY, and 2D/3D HSQC-TOCSY NMR spectra. The following steps are depicted: a) merging of two separate 1H peaks into one peak; b) identification of extra spins (indicated by arrows in (i)) by 1D ω3 (1H) trace comparison.
48
y=0.911*x+0.287 R2=0.9604
y=0.982*x+1.976 R2=0.9889
Figure 2.3. Predicted chemical shifts compared with their experimental chemical shifts of
25-compound model mixture. The RMSD between predicted and experimental chemical shifts of proton and carbon chemical shifts are 0.292 ppm and 2.903 ppm.
49 Spect r um: dat a User: cwang Date: Thu Jan 5 14:00:47 2017 Positive contours: low 8.00e+06 levels 20 factor 1.40 Negative contours: low -5.79e+10 levels 1 factor 1.40
4. 5 4. 0 3. 5 3. 0 2. 5 2. 0
30 30
L-Glutamic acid
35 35
L-Glutamine
40 40 ) m p p (
Glutathione C 3 1
-
1 45 45
50 50
55 55
4. 5 4. 0 3. 5 3. 0 2. 5 2. 0 1 2 - H ( ppm)
Figure 2.4. 2D 13C-1H HSQC of L-glutamine, glutathione, and L-glutamic acid mixture.
Peaks with the common motif (HOOCCH(NH2)CH2CH2CONH-) are highlighted by symbols. The spectrum illustrates the similarity of chemical shifts of identical fragments
(colored in magenta) that are part of different molecules. Only cross-peaks that belong to the motif are labeled.
50
Figure 2.5. Putatively annotated valine spin system in a 25-metabolite model mixture extracted from 3D HSQC-TOCSY spectrum and confirmed by 2D TOCSY and 2D HSQC-
TOCSY. Panel A: four single bond C-H cross-peaks (blue) of valine in the 2D HSQC (left) and 2D HSQC-TOCSY (right) spectra. The expected relay HSQC-TOCSY cross-peaks of the spin system are highlighted in red. Panel B: four different 1- 2 planes of the 3D HSQC-
TOCSY spectrum belonging to valine. Blue peaks obey 2= 3, and the red peaks are the other expected 3D cross-peaks of the valine spin system.
51
Figure 2.6. Molecular structural motifs identified by SUMMIT from 122 different compound candidates that all match the spin system of valine. The hits were sorted into different groups according to their common molecular motif that represents the NMR- derived spin system.
52 Figure 2.7. Positive electrospray ionization 9.4 T FT-ICR broadband mass spectrum of E. coli cell lysate.
53
A B C experimental predicted spin system c.s. spin system c.s. (57,881 in total) 26 24 ) m p p
( top hit
) 42 38 C 3 1 ( 1 ω
58 58 arginine 3.8 1.9 1.6 3.5 1.8 1.4 28 26
) m p p
( top hit
) 33 32 C 3 1 ( 1
ω 58 57 glutamine 3.8 2.4 2.1 3.9 2.6 2.2 13 11 )
m p
p top hit (
) 27 24 C
3 1 ( 1 ω 63 62 isoleucine 3.7 1.5 0.9 3.4 1.5 1 23 22 ) m
p p top hit ( 32 ) 33 C 3 1 (
1 ω lysine 58 58 3.8 1.9 1.4 3.5 1.8 1.4 25 24 ) m p
p top hit ( 38 ) 41 C 3 1 ( 1 ω ornithine 58 58 3.8 3.0 1.8 3.4 2.8 1.5 18 18 )
m p p
( top hit
) 31 31 C 3
1 ( 1 ω 64 63 valine 3.6 2.2 1.0 3.5 2.2 0.9 23 21 ) m
p
p top hit (
) 27 41 C 3 1 ( 1
ω 57 55 leucine 3.8 1.7 0.9 3.3 1.8 0.9 1 1 ω2( H) (ppm) ω2( H) (ppm)
Figure 2.8. Identification of best matching metabolites in a 25-compound model mixture by SUMMIT MS/NMR. A) Experimental 2D HSQC NMR spectra of metabolites extracted from 3D HSQC-TOCSY. Each spectrum is a collection of enlarged spectral regions
(separated by dotted lines) that contain the corresponding cross-peaks. B) Predicted 2D
HSQC NMR spectra from FT-ICR MS-derived molecular structures (57,881 in total). Each experimental HSQC spectrum was compared with all 57,881 predicted HSQC spectra by maximum weighted bipartite matching. All returned hits were ranked according to their
54 chemical shift RMSD. C) Top hit compounds that belong to true compounds in the model mixture. Molecular substructures highlighted in magenta correspond to the molecular structural motifs (MSMIC) of the matched spin system. The spectra of panel B) were sorted so that each experimental spectrum of (A) is adjacent to its top hit (B). To each edge of the graph connecting (A) and (B) belongs a chemical shift RMSD.
55
Figure 2.9. FT-ICR MS/MS of glutamine and lysine in 25 metabolite mixture. Glutamine and lysine MS/MS yields mass differences (between the precursor ion and product ion) of
17.02656 Da and 17.02658 Da (i.e., 0.01 mDa and 0.03 mDa deviation from the calculated mass 17.02655 Da) corresponding to loss of ammonia.
56
Figure 2.10. Infrared multiphoton dissociation positive electrospray ionization 9.4 T FT-
ICR product ion mass spectrum of arginine and ornithine.
57
Figure 2.11. Collision-induced dissociation (normalized collision energy 22) Velos Pro product ion mass spectrum of valine.
58
Figure 2.12. Spin system of an unknown compound from an E. coli cell lysate extracted from 3D HSQC-TOCSY and verified by 2D TOCSY and 2D HSQC-TOCSY (2D TOCSY is not shown). A) Cross-peaks of the unknown compound shown in 2D HSQC (left, blue cross-peaks) and 2D HSQC-TOCSY (right, blue and red cross-peaks) spectra. B) Six cross- peaks of the unknown compound depicted in 6 different 2D slices ( ) at fixed frequency of the 3D HSQC-TOCSY spectrum (blue symbols: diagonal peaks; red symbols: cross-peaks).
59
Figure 2.13. Motif identification of compound candidates for the spin system of an unknown compound from E. coli cell lysate. Four different motifs were identified, highlighted in red, green, blue and magenta, which are consistent with the NMR data.
60
Figure 2.14. Spin system of pyroglutamic acid spin system extracted from 3D HSQC-
TOCSY and confirmed by 2D TOCSY and 2D HSQC-TOCSY. a) Cross-peaks of pyroglutamic acid shown in the 2D HSQC (blue peaks) and 2D TOCSY spectra (blue and red). b) Four cross-peaks of pyroglutamic acid shown in multiple 2D slices of 3D HSQC-
TOCSY. Cross-peaks (δH, δC) at (2.04, 28.0) ppm and (2.51, 28.0) ppm are two separate peaks of the same CH2 group.
61 Accurate and Efficient Determination of Unknown
Metabolites in Metabolomics by NMR-Based Molecular Motif
Identification
Knowledge of the chemical identity of metabolite molecules is critical for the understanding of the complex biological systems they belong to. Since metabolite identities and their concentrations are often directly linked to the phenotype, such information can be used to map biochemical pathways and understand their role in health and disease. A very large number of metabolites however are still unknown, i.e. their spectroscopic signatures do not match those in existing databases, suggesting unknown molecule identification is both imperative and challenging. Although metabolites are structurally highly diverse, the majority shares a rather limited number of structural motifs, which are defined by sets of 1H, 13C chemical shifts of the same spin system. This allows one to characterize unknown metabolites by a divide-and-conquer strategy that identifies their structural motifs first. Here we present the structural motif-based approach “SUMMIT Motif” for the de novo identification of unknown molecular structures in complex mixtures, without the need for extensive purification, using NMR in tandem with two newly curated NMR molecular structural motif metabolomics databases
(MSMMDB). In identifying structural motif(s), first the 1H and 13C chemical shifts of all the individual spin systems are extracted from 2D and 3D NMR spectra of the complex mixture. Next, the molecular structural motifs are identified by querying these chemical
62 shifts against the new MSMMDBs. One database, COLMAR MSMMDB, was derived from experimental NMR chemical shifts of known metabolites taken from the COLMAR metabolomics database, while the other MSMMDB, pNMR MSMMDB, is based on empirically predicted chemical shifts of metabolites of several existing large metabolomics databases. For molecules consisting of multiple spin systems, spin systems are connected via a long-range scalar J-coupling NMR experiment. When this motif-based identification method was applied to the hydrophilic extract of mouse bile fluid, two unknown metabolites could be successfully identified. This approach is both accurate and efficient for the identification of unknown metabolites and hence contribute to the understanding of human health and disease.
63
3.1 Introduction
The chemical complexity of living organisms is reflected in the large number of different metabolites they are composed of. The human body alone may contain over 100,000 different metabolites, but the majority still needs to be identified and characterized.139
Such identification is critical to identify potential biomarkers and study new biochemical pathways for the better understanding of biological processes involved in health and disease. The system-wide study of metabolites and pathways, also in relation to the phenotype, is the subject of the field of metabolomics.6, 8, 11
Structure determination of novel organic molecules is a standard task in synthetic organic chemistry and natural product research. Analytical methods, such as infra-red
(IR) and UV/Vis spectroscopy, high-resolution mass spectrometry (HRMS) and 1H and 13C
NMR spectroscopy are routinely used.140 However, traditional synthetic or natural product characterization requires that a compound has been purified and isolated. In metabolomics, this is often impractical as it can be hard to efficiently isolate a compound at sufficient concentration among hundreds of molecular species. With respect to NMR, important methodological advances now allow routine characterization of known metabolites in a wide range of different complex mixtures with little or no purification.28,
139
Over the recent past, several approaches have been introduced that aim at the de novo characterization and structure determination of metabolites directly in the complex mixture environment.106, 141-146 We recently introduced a protocol for the identification of unknown metabolites in metabolomics samples without the need for purification that combines MS, NMR, and cheminformatics.99, 147 The approach, named SUMMIT MS/NMR, uses accurate mass information, e.g. from Fourier transform ion cyclotron resonance (FT-
ICR), to determine the elemental composition of the metabolites present in the sample. A
64 large pool of chemical compounds is then generated, which are consistent with the MS- derived molecular formulas. The candidate compounds are then filtered against multidimensional NMR data, in particular 1H and 13C chemical shifts that belong to individual spin systems. All candidate compounds are rank-ordered by comparing their predicted NMR chemical shifts with experimentally determined chemical shifts of the unknown compounds in the mixture.
SUMMIT MS/NMR requires that two key conditions are fulfilled: (i) the unknown compound must be present in the pool of candidate structures and (ii) the accuracy of the chemical shift predictor must be sufficiently high to identify the correct compound from potentially many others. In practice, both conditions pose specific challenges. For condition (i), the presence of the unknown metabolite in the pool of structures can be difficult to meet depending on the unknown and how the pool of structures has been generated. Instead, it can be easier and less ambiguous to first establish more general molecular structural properties of the unknown, such as its structural motif(s), before attempting to characterize its full molecular structure. For condition (ii), empirical chemical shift predictors, such as Modgraph/Mnova,148, 149 are fast, but their accuracy is limited. The average root-mean-square deviations (RMSD) of empirically predicted chemical shifts for a set of representative metabolites are around 0.292 ppm (1H) and 2.90 ppm (13C), and can be improved to 0.154 ppm (1H) and 1.93 ppm (13C) by quantum- chemical calculations with multiple scaling150 at the cost of largely increased computation time. Due to the currently limited accuracy of the chemical shift prediction, the SUMMIT
MS/NMR approach returns a potentially large number of compounds as viable candidates, which makes their final verification in terms of their purchase or chemical synthesis followed by spiking experiments in the complex mixture both time-consuming and expensive.
65
Here, we present an alternative approach, named SUMMIT Motif, for the de novo determination of molecular structures of unknowns by first focusing on the determination of molecular structural motif(s) (MSM) without the need for any mass spectrometry data.
Such information represents a key step toward the determination of the full structure. The approach starts out with the identification of 1H and 13C NMR spin systems to define an unknown metabolite’s backbone or contiguous parts thereof. This step is achieved by querying the experimental chemical shifts of the unknown spin system against those of molecular structural motifs (MSM) in both an experimental and a synthetic MSM chemical shift database. It is shown that this approach is highly effective for MSM identification provided that the MSM of the unknown compound is in fact present in the molecular structural motif metabolomics databases (MSMMDB). As is shown here, this requirement is much easier fulfilled than condition (i) of SUMMIT MS/NMR described above. The power of the approach is demonstrated by determining unknowns present in mouse bile fluid.
3.2 Materials and Methods
3.2.1 Definition of spin systems, molecular structural motifs, and COLMAR MSMMDB curation
For any given molecular structure, molecular structural motifs (MSM) are defined using spin systems as a starting point. As customarily defined, for each proton spin system consisting of NH protons each 1H spin is connected to another 1H spin through no more than 3 bonds. A basic molecular structural motif is then defined by the 1H spins together with up to NC carbons (13C or 12C) where each carbon is directly attached to at least one of the 1H atoms (Figure 3.1). This “0th shell molecular motif” can then be systematically expanded by including additional atoms (N, O, S, P) that are not directly observable in 1H
66 and 13C NMR experiments. When including additional heavy atoms that are exactly one bond away one obtains the “1st shell molecular structural motif” and when heavy atoms are included that are up to two bonds away the “2nd shell molecular structural motif” is obtained (Figure 3.1). The shell order is analogous to the HOSE code used for the empirical prediction of chemical shifts of small molecules.151 The higher the shell order, the more chemically distinct the structural motif is and the more unique its 1H and 13C chemical shifts are, but the lower the chance that one or several molecules with the same structural motif already exist in a current metabolomics NMR database.
In order to recognize molecular structural motifs that are part of an unknown metabolite, a 1st and 2nd shell molecular structural motif database was generated from the
COLMAR small molecule database (experimentally measured in aqueous solution) along with 1H, 13C chemical shifts of each motif that were assigned to specific groups within each motif. It is called COLMAR MSM Metabolomics Database or COLMAR MSMMDB.
COLMAR MSMMDB stores the molecular structures of 1st and 2nd shell MSM, the parent metabolites of a MSM along with their chemical shifts, their averages and standard deviations. By considering motifs with NC > 1 spin systems only, 623 metabolites in the parent COLMAR metabolomics database share 180 unique 1st shell and 397 unique 2nd shell molecular structural motifs (Table 3.1). Examples of 2nd shell molecular structural motifs are depicted in Figure 3.2, along with 1H, 13C chemical shift standard deviations of individual atoms.
The standard deviations of chemical shifts in 2nd shell MSMs are generally well below the average chemical shift errors of NMR chemical shift prediction programs (see
Figure 3.3). This demonstrates that the chemical shifts of 2nd shell MSMs are in most cases considerably more accurate than computationally predicted chemical shifts, which is the reason why 2nd shell MSMs have a better chance to be successfully identified from experimental chemical shifts of unknown metabolites.
67
3.2.2 Curation of the empirically predicted pNMR MSMMDB
Since the COLMAR MSMMDB features only a subset of MSMs (currently 180 1st shell and
397 2nd shell MSMs) of all possible metabolite MSMs, it is possible that an unknown spin system does not have a good match. For such cases, a MSM database has been built that covers a larger range of MSMs from empirically predicted, rather than experimental chemical shifts. This database, termed pNMR molecular structural motif metabolomics database (pNMR MSMMDB), consists of MSMs that were extracted from molecules in the
HMDB, the Chemical Entities of Biological Interest (ChEBI) database, and the Kyoto
Encyclopedia of Genes and Genomes (KEGG) database.38, 152, 153 The HMDB is currently the most comprehensive, organism-specific metabolomics database and is the largest collection of human metabolites with their chemical structures and biological roles annotated. ChEBI covers both metabolites produced in biological systems and synthetic products that can intervene with living organisms. The KEGG database is one of the most widely used biochemical pathway databases, containing metabolites involved in human diseases and molecular interactions in various organisms. The pNMR MSMMDB, which currently covers 23,697 metabolites with a molecular weight below 800 Da, focuses on motifs of hydrophilic metabolites, which are defined as metabolites with a predicted lipophilicity logP value smaller than 3.0 (as computed by ALOGP,154, 155 Figure 3.4). The pNMR MSMMDB contains motifs that overlap with those of COLMAR MSMMDB, but with their chemical shifts predicted rather than experimentally determined.
1H, 13C chemical shifts were computed and stored for each compound in the pNMR
MSMMDB using the empirical chemical shift predictor by Modgraph implemented in
MestReNova 10.0.1 (Mestrelab Research). The 1H chemical shift prediction is based on the effects of functional groups that were individually parameterized, whereas the 13C chemical shift prediction is achieved with a HOSE code algorithm. The predicted 1H, 13C 68 chemical shifts have been sorted into individual spin systems belonging to unique MSMs so that they can be compared directly with experimental 1H, 13C chemical shifts extracted from experiments. For each MSM, predicted chemical shifts from multiple metabolites are stored separately.
3.2.3 Single and multiple spin system analysis from 2D and 3D NMR experiments
Entire spin systems of individual molecules are extracted from 2D 1H-1H TOCSY, 2D 13C-
1H HSQC, and 3D 13C-1H HSQC-TOCSY spectra directly applied to the complex mixture of interest. This is automatically achieved by applying methods developed previously in our lab using graph theory and maximal clique analysis.125 Specifically, each directly bonded 13C-1H cross-peak in a 2D HSQC is defined as a node of a mathematical graph.
Edges between the nodes correspond to connectivities between pairs of 13C-1H cross-peaks observed in TOCSY-type spectra (2D TOCSY, 2D HSQC-TOCSY and 3D HSQC-TOCSY).
The graph is then subjected to maximal clique analysis using the Bron-Kerbosch algorithm where individual cliques correspond to separate spin systems.125, 147
Considering that molecules can be composed of multiple motifs, once individual motifs have been identified, the connectivity of multiple MSMs can be further explored by additional NMR experiments. 2D heteronuclear multiple bond correlation (HMBC) experiment is widely used to measure long-range heteronuclear coupling constants for the stereochemical and conformational analysis of biologically active natural products,156 and it is also a promising tool for the structure elucidation of metabolites and natural products in complex mixtures.157, 158 The heteronuclear single quantum multiple bond correlation (HSQMBC) experiment159 detects long-range heteronuclear correlations through nJ(CH) (n>1) couplings (~ 2 - 10 Hz), which makes the identification of quaternary carbons possible as well. Connectivity information between separate spin systems within
69 a molecule can hence be retrieved via the 2D 13C-1H PIP-HSQMBC spectrum.160 Therefore, after extracting individual spin systems of each compound from 2D and 3D TOCSY-type spectra, a 2D 13C-1H HSQMBC experiment is performed to identify connectivities between different spin systems to establish whether they belong to the same compound.
3.2.4 Workflow of MSM-based metabolite identification
The total workflow of metabolite identification based on molecular structural motifs extracted from NMR spin systems is depicted in Figure 3.5. After identification of a H-C spin system from 2D/3D TOCSY spectra, it is queried against COLMAR MSMMDB. If no hits are returned with RMSD < 2.5 ppm, the spin system information is then queried against pNMR MSMMDB. This two-pronged SUMMIT Motif approach focuses first on the query against the more accurate experimental COLMAR MSMMDB before turning to the larger, but less accurate pNMR MSMMDB. The MSM hits returned by COLMAR
MSMMDB (with RMSD < 2.5 ppm) and the top 15 MSM hits returned by pNMR
MSMMDB (with RMSD < 5.0 ppm) are subject to structure determination of unknown metabolites with spiking NMR experiments and/or additional experiments. Molecular structural motif query based on COLMAR MSMMDB and pNMR MSMMDB are publicly accessible via the COLMAR suite of web servers (http://spin.ccic.ohio- state.edu/index.php/motif).
3.2.5 Sample Preparation
S. Typhimurium was used as a mouse model of S. Typhi infection in vivo, as currently no animal model for the human-specific S. Typhi exists. Naturally resistant
NRAMP1(SLC11A1)+/+ 129x1/SvJ were fed a lithogenic diet (1% cholesterol and 0.5% cholic acid; Envigo/Harlan Laboratory, IN) for 8 weeks to induce gallstone formation.
70
After completing the diet, mice were intraperitoneally infected with 100 µl of PBS containing 104 S. Typhimurium or 100 µl of PBS alone. 15 (14) mice were sacrificed 10 days post-infection (PI) for metabolite analysis. About 30 µL pooled bile was collected from the uninfected mice (PBS) and about 65 µL pooled bile was collected from the infected mice
(S. Typhimurium). Both samples were subjected to an immediate metabolite extraction procedure. The freshly collected bile was (sequentially) mixed with 260 µL ice-cold methanol, 260 µL ice-cold chloroform, and 195 µL of ice-cold water and vortexing was applied after each solvent addition. The mixture was then placed on ice for 30 minutes, and centrifuged at 5000 x g for 30 min at 4oC for phase separation. The polar phase was then lyophilized and re-dissolved in ice-cold water for subsequent ultra-filtration to remove residual macromolecules. The ultra-filtration step was carried out with an
Amicon Ultra 0.5 mL centrifuge filter (MWCO 3 kDa). The filtrate was lyophilized and the powder was reconstituted in 600 µL D2O for NMR measurements with 0.1 mM DSS. To prepare the MS sample, 1.5 mg of the dried sample was re-suspended in 200 μL of H2O.
10 µL of the sample was transferred to a new tube followed by 10-fold dilution with
50%/50% (v/v) ACN/H2O containing 0.1% formic acid. Identification of metabolites was primarily performed on the infected bile samples.
E. coli BL21(DE3) cells were cultured at 37 °C while shaking at 250 rpm in M9 minimum medium with glucose (natural abundance, 5 g/L) added as the sole carbon source. One liter of culture at OD 1 was centrifuged at 5000 x g for 20 min at 4 °C, and the cell pellet was resuspended in 50 mL of 50 mM phosphate buffer at pH 7.0. The cell suspension was then subjected to centrifugation for cell pellet collection. The cell pellet was resuspended in 10 mL of ice-cold water and freeze-thawed three times. The sample was centrifuged at 20,000 x g at 4 °C for 15 min to remove cell debris. Prechilled methanol and chloroform were sequentially added to the supernatant under vigorous vortexing at an H2O/methanol/chloroform ratio of 1:1:1 (v/v/v). The mixture was then left at −20 °C
71 overnight for phase separation. Next, it was centrifuged at 4000 x g for 20 min at 4 °C, and the clear upper hydrophilic phase was collected and subjected to rotary evaporation to reduce the methanol content. Finally, the sample was lyophilized. The NMR sample was prepared by dissolving the dry sample in 200 μL of D2O with 20 mM phosphate buffer and 0.1 mM DSS (4,4-dimethyl-4-silapentane-1-sulfonic acid) for chemical shift referencing, and then transferred to a 3 mm NMR tube.
3.2.6 Experiments and Data Processing
All NMR spectra were collected on a Bruker Advance III 850 MHz spectrometer equipped with a cryogenically cooled TCI probe at 298K. The 2D 13C-1H HSQC spectra of mice bile samples and E. coli cell extracts were collected with 512 x 1024 (N1 x N2) complex points along the two dimensions with 32 scans per increment. The spectral widths along the 13C and 1H dimensions were 34206.23 and 10204.08 Hz, respectively, and the transmitter frequency offsets were 75.00 and 4.70 ppm, respectively.
The 2D 13C-1H HSQC-TOCSY spectra were collected with 512 x 2048 (N1 x N2) complex points along the two dimensions with 32 scans per increment. The spectral widths along the 13C and 1H dimensions were 34206.23 and 10204.08 Hz, respectively. The transmitter frequency offsets were 75.00 and 4.70 ppm, respectively.
The 2D 1H-1H TOCSY spectra were collected with 512 x 2048 (N1 x N2) complex points along the two dimensions with 8 scans per increment. The spectral widths along indirect 1H and direct 1H dimensions were 10201.97 and 10204.08 Hz, respectively, and the transmitter frequency offset was 4.70 ppm.
The 3D 13C-1H HSQC-TOCSY spectra were collected with 64 x 100 x 1048 (N1 x N2 x N3) complex points along the three dimensions with 8 scans per increment. The spectral widths along the indirect 13C, indirect 1H and direct 1H dimensions were 34204.76,
72
10204.09 and 10204.08 Hz, respectively, and the transmitter frequency offsets were 75.00,
4.70 and 4.70 ppm, respectively.
The isotropic mixing times for 2D 13C-1H HSQC-TOCSY, 2D 1H-1H TOCSY and 3D
13C-1H HSQC-TOCSY were 120, 117 and 117 ms, respectively. The relaxation delay (d1) was 1.5 s.
The 2D 13C-1H HSQMBC spectra were collected with 320 x 2048 (N1 x N2) complex points along the two dimensions with 64 scans per increment. The spectral widths along the 13C and 1H dimensions were 40621.32 and 10204.08 Hz, respectively and the transmitter frequency offsets were 90.00 and 4.70 ppm, respectively. The multiple-bond coupling constant was set to 6 Hz. All spectra were zero-filled, Fourier-transformed, and phase- and baseline-corrected using NMRPipe.
FT-ICR MS spectra were collected on a 15 Tesla FT-ICR MS experiment using methods established previously.147 Briefly, after initial calibration by a standard amino acid mixture, electrospray ionization (ESI) mode was selected and the mass range (m/z) was set to 50 - 3000. After tuning, acquisition and calibration, both positive and negative ion mode mass spectra of the background and metabolite mixture were collected.
The FT-ICR mass spectra were calibrated and analyzed based on common compounds in the metabolite mixture. The mass peak list (m/z) from mass range 100-1000 was generated with the signal to noise ratio set to 10. The background mass peaks were removed. For each mass peak, all the mass peaks (m/z) were converted to accurate masses with possible adducts.
73
3.2.7 Classification of hydrophilic metabolites based on lipophilicity logP
Molecular structural motif clustering was applied to hydrophilic metabolites contained in the HMDB. The selection criterion for hydrophilic vs. hydrophobic metabolites was based on the quantitative molecular hydrophobicity (lipophilicity) measure log10(P) (or logP) where P is the 1-octanol/water partition coefficient of a given metabolite.161 For this purpose, lipophilicity was predicted by ALOGP, which is a widely used computational estimators of logP, for 933 known metabolites with both hydrophilic and hydrophobic character taken from the COLMAR database, HMDB and BMRB. 723 compounds whose
NMR data were measured in D2O are pre-classified as hydrophilic compounds and 210 compounds whose NMR data were measured in CDCl3 are pre-classified as hydrophobic compounds. Based on the ALOGP distribution (Figure 3.4) an ALOGP criterion < 3.0 was chosen for hydrophilic metabolites.
3.2.8 Spin system matching and scoring
The 1H, 13C chemical shifts of each compound in COLMAR MSMMDB and pNMR
MSMMDB were compared with each experimental spin system with the same number of spins. The weighted matching algorithm, known as the Hungarian method using the
Munkres assignment algorithm, was applied to find the closest matching peak pairs between the experimental and predicted spin systems.162 The corresponding chemical shift root-mean-square deviation (RMSD) was calculated between each experimentally determined spin system and each candidate compound according to:
ì N ü1 2 2 2 RMSD =í å [(Ci,exp - Ci,pred ) +((Hi,exp - Hi,pred )´ 10) ] 2Ný (1) î i=1 þ