Structure Elucidation and Identification of Known and Unknown

Metabolites by Nuclear Magnetic Resonance Spectroscopy

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of

Philosophy in the Graduate School of The Ohio State University

By

Cheng Wang

Graduate Program in Chemistry

The Ohio State University

2019

Dissertation Committee

Rafael Brüschweiler, Advisor

James Coe

Marcos Sotomayor

1

Copyrighted by

Cheng Wang

2019

2 Abstract

Identification of metabolites is one of the main challenges in . Since metabolite identities and their concentrations are often directly linked to the phenotype, such information can be used to map biochemical pathways and understand their role in health and disease. A very large number of metabolites are still unknown, i.e. their spectroscopic signatures do not match those in existing databases, suggesting unknown molecule identification is both imperative and challenging.

This dissertation describes new methods that combine nuclear magnetic resonance

(NMR), spectrometry (MS) and combinatorial chemoinformatics tools to identify the structures of known and unknown metabolites and development of 2D NMR hydrophobic metabolite database to identify in mixtures. Chapter 1 presents a general introduction to metabolomics in and an overview of current NMR and MS-based metabolite identification. Chapter 2 focuses on the development of the SUMMIT MS/NMR approach for the identification of metabolites in a model mixture and E. coli cell extract. It combines 2D and 3D NMR experiments with

Fourier transform ion cyclotron resonance MS and MS/MS data to yield the complete structures or fragments that match those in the complex mixture, independent of any spectroscopic database information. SUMMIT MS/NMR greatly assists the targeted or untargeted analysis of unknown compounds in complex mixtures. Chapter 3 describes an efficient motif-based identification method to identify molecular motif from NMR spectra followed by identification of the complete structure of unknown metabolites. Two databases are assembled, namely COLMAR MSMMDB and pNMR MSMMDB. The motif-

ii based identification method was applied to the hydrophilic extract of mouse bile fluid and two unknown metabolites were successfully identified. The final chapter illustrates the development of 2D NMR database of hydrophobic compounds and application of high- resolution non-uniform sampling 2D real-time pure shift HSQC spectra to metabolomics mixtures for accurate identification. The methods and databases introduced here permit applications to a wide range of metabolomics mixtures for accurate identification of both known and unknown metabolites.

iii Dedication

To my parents.

iv Acknowledgments

I would like to express my great gratitude to my advisor, Prof. Rafael Brüschweiler, for his invaluable supervision and guidance along my PhD journey. His dedication to science with superior wisdom deeply motivated and helped me overcome numerous problems in research. Particularly, his intelligent ideas and advice inspired me to discover new directions in multiple research projects.

I would like to thank my dissertation committee members, Profs. James Coe and

Marcos Sotomayor for their continuous support and guidance. They gave me valuable advice for my candidacy proposal, research progress and dissertation. I also thank Profs.

Mark Foster, Philip Grandinetti for their help at various stages during my PhD.

I greatly appreciate the Campus Chemical Instrument Center, not only the state- of-the-art high-field NMR and mass spectrometers, but more importantly the excellent expertise and helpfulness of its staff scientists. Drs. Alexandar Hansen and Chunhua Yuan mentored me to be familiar with setting up NMR experiments. Dr. Lei Bruschweiler-Li expertly guided me the preparation of high-quality NMR samples. Dr. Da-Wei Li helped me boosting the analysis of multiple research projects with intelligent computational methods. Dr. Árpád Somogyi helped me collecting high-quality ultra-high resolution MS data.

I also express sincere thanks to the Brüschweiler Lab members. Particularly, I am grateful to Drs. Bo Zhang and István Timári for their help and discussions in the motif- based unknown metabolite identification, Drs. Mouzh Xie and Jiaqi Yuan for their tutorial and discussion in fundamental NMR tools and experiments, Drs. Yina Gu, Helena

v Zacharias, and Gilson C. Santos Jr. for their useful advice how to enhances the daily PhD research efficiency.

vi Vita

1991 Born in Heze, Shandong, China, P.R.

2013 B.S. in Applied Chemistry, China University of

2014 - Graduate Teaching and Research Associate, The Ohio State University

Publications

First authored:

1. Wang, C., Zhang, B., Timári, I., Somogyi, A., Li, D. W., Adcox, H. E., Gunn, J. S.,

Bruschweiler-Li, L., & Brüschweiler, R. (2019). Accurate and Efficient Determination

of Unknown Metabolites in Metabolomics by NMR-Based Molecular Motif

Identification. Analytical chemistry. (Accepted)

2. Leggett, A.,* Wang, C.,* Li, D. W., Somogyi, A., Bruschweiler-Li, L., & Brüschweiler,

R. (2019). Identification of Unknown Metabolomics Mixture Compounds by

Combining NMR, MS, and Cheminformatics. Methods in enzymology, 615, 407-422.

3. Wang, C.,* He, L.,* Li, D. W.,* Bruschweiler-Li, L., Marshall, A. G., & Brüschweiler, R.

(2017). Accurate identification of unknown and known metabolic mixture

components by combining 3D NMR with fourier transform ion cyclotron resonance

tandem . Journal of proteome research, 16(10), 3774-3786.

(*co-first author)

Co-authored:

vii 4. Knobloch, T. J., Ryan, N. M., Bruschweiler-Li, L., Wang. C., Bernier, M. C., Somogyi,

A., Yan, P. S., Cooperstone, J. L., Mo, X., Brüschweiler, R., Weghorst, C. M., &

Oghumu, S. (2019). Metabolic Regulation of Glycolysis and AMP Activated Protein

Kinase Pathways during Black Raspberry-Mediated Oral Cancer Chemoprevention.

Metabolites, 9(7), 140.

5. Timári, I., Wang, C., Hansen, A. L., Costa dos Santos, G., Yoon, S. O., Bruschweiler-Li,

L., & Brüschweiler, R. (2019). Real-Time Pure Shift HSQC NMR for Untargeted

Metabolomics. Analytical chemistry, 91(3), 2304-2311

6. Yuan, J., Zhang, B., Wang, C., & Brüschweiler, R. (2018). Carbohydrate background

removal in metabolomics samples. Analytical chemistry, 90(24), 14100-14104.

7. Hansen, A. L., Li, D.W., Wang, C., & Brüschweiler, R. (2017). Absolute Minimal

Sampling of Homonuclear 2D NMR TOCSY Spectra for High-Throughput

Applications of Complex Mixtures. Angewandte Chemie (International ed. in English),

129(28), 8261-8264.

8. Li, D. W., Wang, C., & Brüschweiler, R. (2017). Maximal clique method for the

automated analysis of NMR TOCSY spectra of complex mixtures. Journal of

biomolecular NMR, 68(3), 195-202.

Fields of Study

Major Field: Chemistry

viii Table of Contents

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

Table of Contents ...... ix

List of Tables ...... xiii

List of Figures ...... xv

Introduction...... 1

1.1 Metabolomics in Systems Biology ...... 2

1.2 NMR-Based Metabolomics ...... 3

1.2.1 Basics of NMR ...... 3

1.2.2 One-dimensional and Multidimensional NMR ...... 5

1.2.3 NMR Metabolomics Database ...... 7

1.3 Overview of Identification of Metabolites ...... 8

1.3.1 Identification of Metabolites by Mass Spectrometry ...... 9

1.3.2 Identification of Metabolites by NMR ...... 10

1.3.3 MS and NMR Hybrid Approaches for Identification of Metabolites ...... 12

ix Accurate Identification of Unknown and Known Metabolic Mixture

Components by Combining 3D NMR with Fourier Transform Ion Cyclotron Resonance

Tandem Mass Spectrometry ...... 19

2.1 Introduction...... 20

2.2 Materials and Methods ...... 22

2.2.1 Identification of Unique Elemental Composition from Ultrahigh-Resolution FT-ICR

Mass Spectra ...... 22

2.2.2 Identification of Individual Spin Systems of a Mixture from a 3D 13C-1H HSQC-TOCSY

NMR Spectrum ...... 23

2.2.3 Structure Manifold Generation and 2D HSQC NMR Spectra Prediction ...... 25

2.2.4 Weighted Matching between Experimental and Predicted NMR Spectra ...... 26

2.2.5 Molecular Structure Motif Identification of Compounds ...... 26

2.2.6 Sample Preparation ...... 28

2.2.7 NMR Experiments and Data Processing ...... 29

2.2.8 FT-ICR MS Experiments and Processing ...... 30

2.3 Results and Discussion ...... 31

2.3.1 Demonstration of SUMMIT MS/NMR to Identify Metabolites ...... 31

2.3.2 Application to a 25-Compound Model Mixture ...... 33

2.3.3 Application to E. coli Cell Lysate...... 36

2.4 Conclusion ...... 40

Accurate and Efficient Determination of Unknown Metabolites in

Metabolomics by NMR-Based Molecular Motif Identification ...... 62

3.1 Introduction...... 64

3.2 Materials and Methods ...... 66

x 3.2.1 Definition of spin systems, molecular structural motifs, and COLMAR MSMMDB

curation ...... 66

3.2.2 Curation of the empirically predicted pNMR MSMMDB ...... 68

3.2.3 Single and multiple spin system analysis from 2D and 3D NMR experiments ...... 69

3.2.4 Workflow of MSM-based metabolite identification ...... 70

3.2.5 Sample Preparation ...... 70

3.2.6 Experiments and Data Processing ...... 72

3.2.7 Classification of hydrophilic metabolites based on lipophilicity logP ...... 74

3.2.8 Spin system matching and scoring ...... 74

3.2.9 Quantitative metric on evaluation of the MSM identification...... 75

3.3 Results ...... 75

3.3.1 Evaluation of COLMAR and pNMR MSMMDB in identification of known molecules in bile and E. coli extracts ...... 75

3.3.2 Structure elucidation of unknown metabolites ...... 77

3.3.3 Coverage of COLMAR motifs of HMDB ...... 79

3.4 Discussion ...... 80

3.5 Conclusion ...... 83

COLMAR Lipids Database for 2D NMR-Based Lipidomics ...... 101

4.1 Introduction...... 102

4.2 Methods and Experimental Section ...... 104

4.2.1 Integration of Hydrophobic Compound 2D 13C-1H HSQC Database...... 104

4.2.2 Sample Preparation ...... 105

4.2.3 NMR Experiments and Processing ...... 107

4.3 Results and Discussion ...... 108

xi 4.3.1 COLMAR Lipids Web Server with Hydrophobic Metabolite 2D HSQC Database ..... 108

4.3.2 Disambiguation of Lipid Identification Using Complementary Information ...... 109

4.3.3 Accurate Lipid Identification by High-resolution 2D NUS Real-time Pure Shift HSQC

...... 111

4.4 Conclusion ...... 114

Bibliography ...... 137

xii List of Tables

Table 1.1. The advantages and limitations of NMR spectroscopy as an analytical tool for metabolomics research in comparison with MS spectrometry...... 14 Table 2.1. Example of chemical shift (c.s.) matching results for valine...... 41 Table 2.2. SUMMIT results for 25-compound model mixture ...... 42

Table 2.3. Effect of cutoff threshold of mass peak amplitudes on the rank of SUMMIT results for

25-compound model mixture...... 43

Table 2.4. Metabolites identified in E. coli cell lysate and verified by COLMAR web server and

SUMMIT MS/NMR ...... 44

Table 2.5. Quantum-chemical calculations of NMR chemical shifts for two selected compounds, together with matching results from experimental spectra ...... 46

Table 3.1. Categorization of COLMAR and HMDB hydrophilic compounds according to their molecular structure motifs...... 85

Table 3.2. Examples of molecular structural motif identification of bile metabolites by COLMAR and pNMR MSMMDB queries...... 86 Table 3.3. True and false positive top hits of E. coli metabolites with various RMSD thresholds.. 87

Table 3.4. Molecular structures of top 10 most abundant molecular structural motifs (yellow) of all hydrophilic molecules contained in COLMAR MSMMDB...... 88 Table 3.5. Molecular structures of top 10 most abundant molecular structural motifs (yellow) of all hydrophilic molecules contained in HMDB...... 89

Table 4.1. Composition of artificial micelles incubated with Caco-2 human intestinal cells for 6 hours prior to experiment termination...... 116

Table 4.2. Hydrophobic metabolite identification of Caco-2 cell extracts using COLMAR Lipids web servera...... 117

xiii Table 4.3. Full hydrophobic metabolite identification of Caco-2 cell extracts using COLMAR

Lipids web servera...... 119

Table 4.4. Hydrophobic metabolite identification of lung tissue using COLMAR Lipids web server...... 123

Table 4.5. Different ratios of fatty acids utilized in the micelles with which Caco-2 human intestinal cells were incubated for 6 h...... 128

xiv List of Figures

Figure 1.1. Correlation between main strategies in systems biology...... 16

Figure 1.2. 1D 1H NMR spectrum collected from NIST SRM-1950 human serum at a 700 MHz

NMR spectrometer ...... 17

Figure 1.3. Screenshot of COLMARm web server with the identification of isoleucine in human serum...... 18

Figure 2.1. Extraction of spin systems of individual mixture compounds from 3D 13C-1H HSQC-

TOCSY...... 47

Figure 2.2. Illustration of the spin system refinement procedure based on 2D HSQC, 2D TOCSY, and 2D/3D HSQC-TOCSY NMR spectra ...... 48

Figure 2.3. Predicted chemical shifts compared with their experimental chemical shifts of 25- compound model mixture ...... 49

Figure 2.4. 2D 13C-1H HSQC of L-glutamine, glutathione, and L-glutamic acid mixture ...... 50

Figure 2.5. Putatively annotated valine spin system in a 25-metabolite model mixture extracted from 3D HSQC-TOCSY spectrum and confirmed by 2D TOCSY and 2D HSQC-TOCSY ...... 51

Figure 2.6. Molecular structural motifs identified by SUMMIT from 122 different compound candidates that all match the spin system of valine...... 52

Figure 2.7. Positive 9.4 T FT-ICR broadband mass spectrum of E. coli cell lysate...... 53

Figure 2.8. Identification of best matching metabolites in a 25-compound model mixture by

SUMMIT MS/NMR ...... 54 Figure 2.9. FT-ICR MS/MS of glutamine and lysine in 25 metabolite mixture ...... 56

Figure 2.10. Infrared multiphoton dissociation positive electrospray ionization 9.4 T FT-ICR product ion mass spectrum of arginine and ornithine ...... 57

xv Figure 2.11. Collision-induced dissociation (normalized collision energy 22) Velos Pro product ion mass spectrum of valine ...... 58

Figure 2.12. Spin system of an unknown compound from an E. coli cell lysate extracted from 3D

HSQC-TOCSY and verified by 2D TOCSY and 2D HSQC-TOCSY ...... 59

Figure 3.1. Definition of 1st and 2nd shell molecular motifs based on a spin system ...... 90 Figure 3.2. Examples of metabolites with identical molecular structural motifs (in color) ...... 91 Figure 3.3. Box plots of chemical shift errors of MSMs ...... 92 Figure 3.4. Histogram of ALOGP values of hydrophilic and hydrophobic metabolites...... 93

Figure 3.5. Workflow of molecular structural motif (MSM) based unknown metabolite identification...... 94 Figure 3.6. ROC curve with AUC of 0.851 of true and false positive top hits of E. coli metabolites with various RMSD thresholds...... 95 Figure 3.7. Identification of L-alanyl-L-valine metabolite in mouse bile extracts ...... 96

Figure 3.8. Identification of two unknown spin systems of taurocholic acid in mouse bile extracts.

...... 97 Figure 3.9. Identification of taurocholic acid in mouse bile extracts...... 98

Figure 3.10. Graph-theoretical representation of molecular motif clustering of current COLMAR

MSMMDB molecules...... 99

Figure 3.11. Distribution of the number of molecules in the 50 most common motifs of hydrophilic compounds of the HMDB ...... 100

Figure 4.1. 1D 1H and 2D 13C-1H HSQC NMR spectra of Caco-2 cell extract...... 129

Figure 4.2. Identification of hydrophobic metabolites in Caco-2 cell extract with COLMAR Lipids web server ...... 130

Figure 4.3. Comparison of 2D 13C-1H HSQC NMR spectra and LC-MS/MS spectra pairs of structural isomers...... 131

Figure 4.4. Comparison of performance of different 2D 13C-1H HSQC experiments in the CH=CH region for lipids identification in a model mixture...... 132

Figure 4.5. Identification of (EPA) and (DHA) in different 2D 13C-1H HSQC experiments in Caco-2 cell extract...... 133 Figure 4.6. Differentiation of Caco-2 cells by NMR metabolomics analysis...... 134

xvi Figure 4.7. Spectral comparison of crowded regions of different types of 2D 13C-1H HSQC experiments in Caco-2 cell extract and lung tissue...... 135

Figure 4.8. Spectral comparison of 6.25% NUS HSQC NMR spectrum (red) and US HSQC spectrum (blue) of mouse lung tissue extracts...... 136

xvii Introduction

Metabolomics, which is defined as the comprehensive and quantitative analysis of all metabolites in biological systems, has been demonstrated to possess high potential in disease diagnosis and the understanding of the physiological state in cells, tissues and biological fluids. As an essential part of metabolomics, metabolite identification plays a key role in extending the knowledge of biological systems. However, identification of metabolites is still a major bottleneck in metabolomics due to the huge number of potentially interesting metabolites that remain unknown. Two leading analytical platforms, Mass Spectrometry (MS) and Nuclear Magnetic Resonance (NMR), have been extensively deployed to identify known and unknown metabolites in complex mixtures in a high-throughput manner with numerous approaches. Specifically, NMR allows for structural characterization and increases the efficiency of unambiguous identification of metabolites. This chapter gives an overview of metabolomics in systems biology and metabolite identification by NMR and MS.

1 1.1 Metabolomics in Systems Biology

Systems biology refers to the system-level understanding of a biological system, which focuses on the structure and dynamics of cellular and organismal function.1-4 The development of omics strategies in systems biology provides comprehensive data

(, transcriptomics, , and metabolomics) towards understanding the organization principle of cellular functions at different levels.5, 6 Omics strategies aim at identifying the entire gene products (transcripts, proteins, and metabolites) in biological tissues, cells, fluids or organisms. Figure 1.1 shows how the various omics approaches in systems biology are correlated with each other.7 Metabolomics is defined as the comprehensive and quantitative analysis of all metabolites (< 1,500 Da) present within a tissue, cell, fluid or entire organism.8-10 Unlike other “omics” approaches, metabolomics connects the environmental factors with gene-related outcomes that directly reflect the underlying biochemical activity and the state of cells or tissues. This provides fingerprints toward the understanding of the structure and dynamics of system networks through metabolic pathway analysis. Besides metabolomics, there are other related technologies introduced in the past decades, such as metabonomics. Proposed by Nicholson et al.,11, 12 metabonomics is defined as quantitative measurement of the multiparametric metabolic responses of living systems to pathophysiological stimuli or genetic modification, which includes not only the intracellular molecules but also the components of extracellular biofluids. The metabolome was first introduced by Oliver et al.,13 which describes the entire complement of small molecules (< 1,500 Da) in a biological system such as an organism, organ, tissue or cell. Metabolic profiling describes the analysis of a small number of metabolites for the investigation of selected biochemical pathways in a biological system. As a subdivision of metabolomics, lipidomics is defined as “the full characterization of lipid molecular species and biological roles with respect to expression of proteins involved in lipid and function, including gene regulation.”14-16 One

2 focus of lipidomics is to identify fingerprints of lipid molecular species in diseases for therapeutic development.

1.2 NMR-Based Metabolomics

Metabolomics methodologies fall into two main categories, which are termed as targeted and untargeted metabolomics. Targeted metabolomics refers to the measurement of targeted groups of chemically characterized and biochemically known metabolites in a biological system.17, 18 By taking the advantage of comprehensive understanding of known metabolites, metabolic , pathways, accurate identification and quantitation of a set of targeted metabolites in biological samples can be achieved on a variety of analytical platforms such as gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS) and nuclear magnetic resonance spectroscopy. Untargeted metabolomics is the comprehensive analysis of all detectable signals in a biological sample set, followed by the assignment of many of these signals to metabolites by screening against metabolomics databases.19-21 This section focuses on the brief introduction of NMR-based methodologies for both targeted and untargeted metabolomics.

1.2.1 Basics of NMR

Nuclear magnetic resonance (NMR) can be traced back to 1938 when Isidor Rabi accurately measured the nuclear magnetic moments using magnetic resonance absorption of molecular beams. In 1946, Felix Bloch and Edward Mills Purcell successfully demonstrated NMR on liquids and solid materials. With the discovery of the “chemical shift” such as 1H and 31P, NMR spectroscopy has become a tool for chemists for the structure elucidation of molecules.

3 Among the various types of spectroscopic methods, NMR spectroscopy uses the lowest irradiation energy (~105 Hz) for excitation. The proton in an atomic nucleus has the intrinsic property of spin 1/2. The spinning movement of a proton results in the generation of a magnetic field and the magnet is called a nuclear magnetic moment. When it is placed in a large external magnetic field, the spin can adopt a low-energy state, a high-energy state, or any quantum-mechanical superposition thereof. Such a two-state description applies to all nuclei with the nuclear spin quantum number I = ½, such as 1H, 13C, 15N, and

31P. The energy difference between the two states is:

∆� = ℎ��0/2� 1.1

-27 where ℎ is Planck's constant (6.63 x 10 erg sec), � is the gyromagnetic ratio and �0 is the external magnetic field. Combined with Bohr condition (∆� = ℎ�), the above equation can be written as:

� = ��0/2� 1.2 where � is the transition or resonance frequency between the two energy states. The energy when the nuclear spins flip from higher energy state to lower energy state induces a voltage that can be detected. The most critical perturbation of the NMR frequency is the

“shielding” effect by the surrounding shells of electrons, which gives rise to atom-specific chemical shifts (see below). This shift in the NMR frequency is dependent on the chemical structure of the molecule, as well as ring current effects, hydrogen bonds, and solvent accessibility. The NMR frequencies are also proportional to the magnetic field strength.

To define NMR frequency without specifying the external magnetic field strength, a reference compound (tetramethylsilane, TMS) is used to calculate the relative frequency, so-called, chemical shift. The chemical shift is calculated based on the following equation:

� = (� − �0)/�0 1.3 where � is the chemical shift (parts per million, ppm), � is the frequency of interest, �0 is the frequency of TMS.

4 The observable NMR signal generated by excited nuclear spins with a net magnetic moment perpendicular to the external magnetic field will oscillate and decay due to relaxation and is called free induction decay (FID). To convert the time domain data to the frequency domain data, Fourier transform is applied to FIDs followed by phase correction. Sensitivity and resolution are key components when evaluating the quality of the NMR spectrum before diving into deeper spectral analysis. The resolution of the spectrum can be slightly improved by simply adding zeros to the tail of the FIDs before

Fourier transformation. In addition, multiplying the FID by a window function before

Fourier transformation, so-called apodization, will increase the sensitivity or resolution.

For instance, an exponential decay function can increase sensitivity and a shifted Gaussian function can enhance the resolution.

1.2.2 One-dimensional and Multidimensional NMR

The 1D 1H NMR experiment is the most prevalent 1D NMR experiment in the vast majority of NMR-based metabolomics studies.22-27 This is because 1D 1H NMR is more sensitive than other nuclei (13C, 15N) with high (~99%) isotopic natural abundance. More importantly, protons exist in most metabolites. Figure 1.2 shows an example of a 1D 1H

NMR spectrum of human serum.28 Hundreds of metabolites with unique peaks are identified from a single 1D 1H NMR spectrum. The exceptionally high resolution is achieved due to the narrow linewidths (often <1 Hz). 1D 1H NMR is fast, typically only 5

- 15 minutes per sample. This enables high-throughput sample analysis with automated

1D 1H NMR approaches.

Signals in 1D 1H NMR spectrum tend to overlap when the number of species beyond a few hundred in a complex metabolomics mixture, which severely hampers unambiguous metabolite identification. Multidimensional NMR spectroscopy is often applied to overcome the issue by adding additional dimensions (e.g., 13C) to spread out the peaks. Two dimensional heteronuclear 13C,1H single quantum coherence spectroscopy 5 (2D 13C-1H HSQC) experiment measures a 2D heteronuclear chemical shift correlation between directly bonded 13C and 1H, which provides one cross-peak for each C-H pair in a spectrum. Over the decades, a number of customized 2D HSQC experiments have been extensively developed in metabolomics studies.29-31 For instance, extrapolated time-zero

HSQC (HSQC0) enables simultaneous quantification and identification of compounds in metabolite mixtures.32 The approach starts with a series of HSQC spectra acquired with incremented repetition times, which can be extrapolated back to zero time to yield a time- zero 2D 13C-1H HSQC spectrum (HSQC0). The signal intensities are proportional to the concentrations of individual metabolites that can be used to determine the relative concentration of individual metabolites. By clustering the HSQC0 cross-peaks based on normalized intensities, the peaks of each cluster can be assigned to specific compounds.

The multiplicity-edited 2D 13C- 1H HSQC NMR spectrum shows the cross-peaks of 13C-1H correlations in a phase-sensitive manner. The methyl (CH3) and methyne (CH) moieties show the opposite phase to those of methylene groups (CH2), which provides tremendous power for assigning signals in crowded regions in the NMR spectra of biological samples.

2D total correlation spectroscopy (TOCSY) measures the chemical shift of a given nucleus and the correlated chemical shifts of other nuclei within the total spin system of a given molecule. With sufficiently long isotropic mixing time (>80ms), 2D 1H-1H TOCSY yields complete spin-spin connectivity information in molecular spin systems.33, 34 The 2D

Heteronuclear Single Quantum Coherence-Total Correlation Spectroscopy (HSQC-

TOCSY) is a hybrid experiment that combines 2D HSQC and 2D TOCSY to generate through-bond correlations between a 13C-attached 1H to all other coupled 1H in a spin system. The experiment is extremely powerful when severe overlapping occurs in 2D

TOCSY.35 2D Heteronuclear Multiple Bond Correlation (HMBC) experiment correlates chemical shifts of two types of nuclei (1H, 13C) separated from each other with two or more bonds. 2D HMBC enables detection of 1H, 13C that are separated by quaternary or heteroatoms (e.g., N, O), which connects isolated spin systems in a metabolite. Although

6 relatively insensitive, 2D HMBC experiment serves as the supplementary experiment to help connect one detected spin system to additional spin system.36

1.2.3 NMR Metabolomics Database

Over the decades, a number of spectral databases have been established along with the development of NMR-based metabolomics approaches, such as the Human Metabolome

Database (HMDB), BioMagResBank (BMRB) and COLMAR database.37-42 Human

Metabolome Database is currently the world’s largest and most comprehensive, organism-specific metabolomics database.37, 38 As a standard metabolomics resource for human metabolic studies, HMDB not only offers information about the roles of a given metabolite in metabolic context but also provides NMR and MS spectral information and links to other public chemical databases. HMDB has been updated to version 4.0, including a large number of predicted MS/MS and GC–MS reference spectral data and predicted metabolite structures to facilitate novel metabolite identification.38 As a central repository for experimental NMR spectral data, BioMagResBank (BMRB) includes a comprehensive metabolomics database, containing an archive of experimental NMR spectral data of metabolomic compounds such as 1D 1H, 2D 13C-1H HSQC, 2D 1H-1H

TOCSY and 2D 13C-1H HMBC.40 All the experimental NMR spectral data in BMRB are publicly accessible and downloadable for direct comparison and referencing. Complex

Mixture Analysis by NMR (COLMAR) developed in the Brüschweiler laboratory is a suite of NMR web server that allows a user to upload experimental 2D NMR data and query against COLMAR database.42-45 The COLMAR HSQC database unifies the NMR spectroscopic information from HMDB and BMRB and adds the isomer information that distinguishes isomers given a set of chemical shifts.42 COLMAR HSQC increases the accuracy of metabolite identification by more than 37% and decreases the false positive identification rate by more than 82%, compared with other existing 13C-1H HSQC metabolomics databases. COLMAR TOCCATA database integrates the spin system

7 information within the chemical shifts.43, 45 The chemical shifts of each compound are subdivided into isomeric states followed by separation into the individual spin systems.

Groups of C-H spin-pairs are considered as separate spin systems if they are separated by at least one noncarbon atom or quaternary . COLMAR TOCCATA database allows the unambiguous identification of spin systems and isomeric states of metabolites in complex metabolomics mixtures.

1.3 Overview of Identification of Metabolites

Identification of metabolites is one of the main challenges in metabolomics. As a prerequisite for both targeted and untargeted metabolomics, numerous methods of metabolite identification have been developed.46-49 Mass spectrometry and NMR are two leading analytical platforms that allow detecting hundreds or thousands of signals in a single biological sample.50-55 MS and NMR have distinct advantages and disadvantages in terms of identification of metabolites. A brief comparison between MS and NMR is summarized in Table 1.1.56 The typical process of metabolite identification consists of two essential steps. First, the signals detected by MS or NMR are screened against existing metabolomics databanks such as HMDB or METLIN to identify the maximal number of known metabolites.37-39, 57 Second, the signals that are not identified belong to unknown metabolites, which are further analyzed by computational approaches or supplemental experiments. The Metabolomics Standards Initiative (MSI) recognizes 4 levels of metabolite identification.58 Level one, known as “identified compound”, requires a minimum of two independent and orthogonal data (such as retention time and mass spectrum) compared directly to an authentic reference standard. Level two, known as

“putatively annotated compound”, refers to a compound that is identified by spectral analysis and/or similarity to data in a public database but without direct comparison to a reference standard as for Level one. Level three, known as “putatively characterized compound class”, denotes unidentified per se but the data allows the metabolite to be 8 placed in a certain compound class. Level four, known as “unknown compound”, is defined as unidentified or unclassified but characterized by spectral data (LC-MS, NMR).

In the following part, current advances regarding metabolite identification based on MS,

NMR or MS/NMR hybrid approaches are discussed.

1.3.1 Identification of Metabolites by Mass Spectrometry

Mass spectrometry (MS) offers high sensitivity, selectivity and a wide dynamic range for the analysis of metabolomics samples.49, 59 MS instrumentation has undergone significant development in recent years. The availability of various ionization methods (e.g., electrospray ionization (ESI)) in both positive and negative modes enables the ionization of various metabolite classes and facilitates the applications to metabolite identification in various metabolomics samples. In addition, versatile mass analyzers configured in tandem mass spectrometry instruments improve metabolite identification by acquiring highly resolved tandem MS spectra. The integration of gas chromatography or liquid chromatography (LC) to MS further improves the metabolite identification by reducing the complexity of the sample via separation. With the benefit of simple sample preparation and broad coverage of detectable metabolites, GC-MS and LC-MS are two of the most favored approaches for metabolite identification in MS-based metabolomics. The m/z value of molecular ion is queried against database(s) such as METLIN and all possible chemical structures that are consistent with the mass are returned as candidates.50, 57, 60 The experimental retention time and MS/MS spectrum of the ion of interest are compared with the retention time and MS/MS spectra of each compound to confirm or reject the metabolite candidate. With exponentially increasing chemical space, the mass spectral databases, the key component in metabolites identification with experimental MS/MS spectra, have been developed considerably over the past years in MS-based metabolite identification.61-65 Current public and commercial mass spectral databases contain around

1–2 million spectra of one million unique compounds, which are derived from authentic

9 experimental spectra of reference compounds. However, there are over 100 million known compounds have been discovered or synthesized cataloged in PubChem and ChemSpider, whose MS spectra are not available in current databases.66 Therefore, a number of approaches regarding generating in silico mass spectra, especially MS/MS spectra, have been developed using quantum chemistry, machine learning and chemical reaction-based methods.67-72

Moreover, new MS-based metabolite identification frameworks that integrate the analysis of MS spectra and metabolite annotations facilitate the process of MS spectra analysis and enhance the efficiency of metabolite identification.73-75 For instance, the integration of BinVestigate, MS-DIAL 2.0 and MS-FINDER 2.0 enable identification of novel metabolites that are distinct from canonical pathways.76 Specifically, BinVestigate queries BinBase to search for biological metadata, a GC-MS-based untargeted metabolomics database containing 1,561 studies with 114,795 samples for various species, organs, matrices, and experimental conditions. MS-DIAL 2.0 enables the processing of both GC-MS and LC-MS data. MS-FINDER 2.0 aims to identify structures of known and unknown metabolites based on a combination of 14 metabolome databases and an promiscuity library. The integrated framework is demonstrated in mammalian samples for the discovery of new methylation products and microbial cells for the identification of plant-specific metabolites and transformations of exposome compounds.

1.3.2 Identification of Metabolites by NMR

NMR spectroscopy has been widely used in metabolomics studies.77-84 Hundreds of metabolites can be detected and measured simultaneously without the need for elaborate sample preparation or fractionation. NMR spectra are highly reproducible, minimizing the spectrometer influence on measurements of biological samples. The majority of NMR- based metabolite identification approaches start with analyzing NMR data with existing metabolomics databases to identify signals to known metabolites, followed by 10 computational strategies to identify unassigned signals. 1D 1H NMR is the most widely used NMR approach in metabolite identification. Software such as Chenomx NMR Suite simplifies and facilitates the analysis of 1D 1H NMR. Chenomx provides a user-friendly interactive visualization interface and embeds a spectral library of over 600 metabolites to automatically identify metabolites in 1H NMR spectra of biological samples.

When peaks are excessively overlapped in 1D 1H NMR spectra of a complex mixture, 2D NMR is often used to aid unambiguous metabolite identification. 2D 13C-1H

HSQC is primarily implemented in 2D NMR based metabolite identification. By querying peak lists (13C, 1H) against NMR spectral databases such as HMDB and COLMAR, the majority of cross-peaks can be assigned with known metabolites.37-39, 85 For instance,

COLMAR HSQC web server allows users to upload peak lists of HSQC and run automatic referencing based on an internal standard such as sodium trimethylsilylpropanesulfonate

(DSS). The query step only takes a few seconds, after which the metabolite identification result is generated with quantitative metrics such as RMSD, matching ratio and uniqueness. However, it is possible that two HSQC peaks that are from different compounds can be assigned to another compound due to the similarity of chemical shifts.

Therefore, validation of this metabolite identification by additional experiments is needed.

COLMARm web server is developed to integrate 2D HSQC, 2D TOCSY and 2D HSQC-

TOCSY for metabolite identification and validation. COLMARm allows for the simultaneous spectral analysis based on 2D HSQC, 2D TOCSY and 2D HSQC-TOCSY on a publicly accessible web server (http://spin.ccic.ohio- state.edu/index.php/colmarm/index).86 Users can upload up to three NMR spectra of a metabolomics sample, which includes a required 2D HSQC and optional 2D TOCSY and

2D HSQC-TOCSY spectra. COLMARm web server queries experimental HSQC spectrum against the COLMAR HSQC metabolomics database and returns a list of metabolite candidates. For each candidate metabolite, the corresponding 2D TOCSY and 2D HSQC-

TOCSY spectra are generated based on the COLMAR TOCCATA database, which is

11 subject to comparison with experimental TOCSY and HSQC-TOCSY spectra. Depending on the matching agreement of database based TOCSY and HSQC-TOCSY with experimental TOCSY and HSQC-TOCSY, the candidate metabolites can either be confirmed or excluded. COLMARm web server was first applied to a human serum sample and 62 metabolites, of which 14 metabolites were first time identified by NMR with high accuracy. Figure 1.3 shows an example of the COLMARm web server with the identification of isoleucine in human serum.

1.3.3 MS and NMR Hybrid Approaches for Identification of Metabolites

A number of novel hybrid approaches combining MS and NMR have been developed recently to offer high-accuracy metabolite identification and quantification.87-90 MS/NMR hybrid approaches capitalize on the complementary strengths of MS and NMR. The

NMR/MS Translator approach combines NMR and accurate MS data of the same metabolomics sample, which provides fast and highly accurate identification of known metabolites.91 The NMR/MS Translator first identifies metabolite candidates from NMR spectra based on NMR database query (e.g., COLMAR web server), followed by simulation of the mass spectrum based on the determination of the masses (m/z) of all of the likely ions, possible adducts, and characteristic isotope distributions. The simulated mass spectrum is then compared with the experimental mass spectrum for the direct assignment of those signals that correspond to the metabolites identified in the NMR spectra. Since both experimental chemical shifts and accurate mass data of the same biological sample are analyzed, it increases the accuracy of metabolite identification compared with those methods based on NMR or MS alone. The NMR/MS Translator was applied to human urine and identified up to 88 metabolites with consistent signals in both

NMR and MS spectra. It should be noted that the NMR/MS Translator is designed to improve the accuracy and efficiency of known metabolite identification, which is limited by the metabolomics spectral database. To identify unknown metabolites that correspond

12 to unassigned signals in NMR or MS spectra without extensive isolation via time- consuming purification, a number of purification-free metabolite identification MS/NMR hybrid strategies have been developed. LC-MS-SPE-NMR has emerged as a very powerful yet less labor-intensive analytical platform to obtain spectral data for structural elucidation of metabolites.92-98 The integrated work platform enables scientists to selectively purify and concentrate metabolites of interest in complex biological samples, followed by structure elucidation based on 1D or 2D NMR spectra. This approach has been applied to a variety of complex biological samples to identify secondary metabolites.

With the integrated UHPLC-QTOF-MS/MS-SPE-NMR approach (a variant approach of

LC-MS-SPE-NMR),96 more than 100 previously known as well as unknown triterpene and flavonoid glycosides are identified in root tissue of Medicago truncatula. Recently,

SUMMIT MS/NMR approach is developed to combines NMR and MS data to identify structures of interest in the mixture without depending on sample separation and experimental spectral databases.99 SUMMIT MS/NMR first extracts accurate masses from ultrahigh-resolution mass spectra and generate all possible structures based on derived chemical formulas. The comparison of the predicted NMR spectra of all candidate structures with experimental NMR spectra of the same sample promotes the identification of unknown metabolites. Another novel approach, termed as ISEL NMR/MS2 (Integrated

Structure ELucidation by NMR/MS2), combined in silico MS/MS and NMR prediction to improve the identification accuracy of unknown metabolites.100 It starts with unknown features from the LC-MS spectra of the sample, followed by the comparison of experimental MS/MS and NMR with predicted MS/MS and NMR spectra to identify the structure of unknowns that are not present in experimental NMR or MS metabolomics databases.

13 Table 1.1. The advantages and limitations of NMR spectroscopy as an analytical tool for metabolomics research in comparison with MS spectrometry. Adapted with permission from reference (56).

NMR Mass spectrometry

Low but can be improved with High, but can suffer from ion higher field strength, cryo and Sensitivity suppression in complex and microprobes and dynamic nuclear salty mixtures polarization

Usually need different Sample The entire sample analyzed in one chromatography techniques measurement measurement for different classes of metabolites

Nondestructive; sample can be Destructive technique but Sample recovered and stored for a long need a small amount of recovery time, several analyses can be sample carried out on the same sample

Reproducibility Very high Moderate

More demanding; needs Sample different LC columns and Need minimal sample preparation preparation optimization of ionization conditions

Less than 3 min for direct Experimental infusion but more than 10 min 5-15 min for 1D proton NMR time for simplest analysis by GC MS or LC MS

No, requires tissue extraction Yes, using HRMAS NMR tissue Tissue samples MS can be used to identify samples analyzed directly metabolites in tissues using MALDI-MS

Continued

14

Table 1.1. Continued

NMR Mass spectrometry

Number of detectable 40–100 depending on spectral metabolites Can be several hundreds resolution in urine sample

Target Inferior for targeted analysis Superior for targeted analysis analysis

No—although suggestion that Yes—widely used for 1H magnetic In-vivo DESI may be a useful way to resonance spectroscopy (and to a studies sample tissues minimally lesser degree 31P and 13C) invasively during surgery

Molecular dynamic, NMR can be used to probe the No molecular molecular diffusion and dynamics diffusion

Direct structure Yes No analysis of unknowns

15 Figure 1.1. Correlation between main omics strategies in systems biology. Adapted with permission from reference (7).

16

Figure 1.2. 1D 1H NMR spectrum collected from NIST SRM-1950 human serum at a 700

MHz NMR spectrometer. Adapted with permission from reference (28).

17

Figure 1.3. Screenshot of COLMARm web server with the identification of isoleucine in human serum. Panel A: 2D HSQC, panel B: 2D TOCSY and panel C: 2D HSQC–TOCSY.

Green and red circles show experimental and database cross-peaks of isoleucine, respectively. Magenta circles indicate the expected isoleucine peaks from the COLMAR

TOCCATA database. The good agreement between green and red circles indicates that isoleucine is a confident candidate. This was further validated by the close match between magenta peaks with the experimental TOCSY cross-peaks observed in the TOCSY and

HSQC–TOCSY spectra. Adapted with permission from reference (85).

18 Accurate Identification of Unknown and Known

Metabolic Mixture Components by Combining 3D NMR with

Fourier Transform Ion Cyclotron Resonance Tandem Mass

Spectrometry

Metabolite identification in metabolomics samples is a key step that critically impacts downstream analysis. The recently introduced SUMMIT NMR/ mass spectrometry (MS) hybrid approach allows the identification of the molecular structure of unknown metabolites, based on the combination of NMR, MS, and combinatorial cheminformatics.

In this chapter, we demonstrate the feasibility of the approach for an untargeted analysis of both a model mixture and E. coli cell lysate, based on 2D/3D NMR experiments in combination with Fourier transform ion cyclotron resonance MS and MS/MS data. For 19 of the 25 model metabolites SUMMIT yielded complete structures that matched those in the mixture independent of database information. Of those, 7 top-ranked structures matched those in the mixture, and 4 of those were further validated by positive ion

MS/MS. For 5 metabolites, correct molecular structural motifs were identified. For E. coli,

SUMMIT MS/NMR identified 20 previously known metabolites with 3 or more 1H spins independent of database information. Moreover, for 15 unknown metabolites, molecular structural fragments were determined consistent with their spin systems and chemical shifts. By providing structural information for entire metabolites or molecular fragments,

SUMMIT MS/NMR greatly assists the targeted or untargeted analysis of complex mixtures of unknown compounds.

19 2.1 Introduction

The large number of different metabolites found in living organisms offers important clues about the chemical underpinning of life, which is the subject of the field of metabolomics. It has been estimated that the human body alone contains over 100,000 different metabolites.101 Despite ongoing progress in the development of larger metabolomics databases, the identification of unknown metabolites remains a major bottleneck. Traditional approaches for natural product analysis, which are based on the complete physical separation of the compound of interest, are very time-consuming and, hence, impractical for routine and high-throughput applications. Alternatively, the two primary analytical techniques in metabolomics, namely mass spectrometry (MS)102-104 and nuclear magnetic resonance (NMR), have been applied separately.

Recently, new approaches have been proposed for the analysis of complex mixture, based on combining MS and NMR. Finding ways to synergistically apply the two methods to the same problem has been a challenge due to the high complementarity of their information content. One strategy focuses on subsets of spectroscopic signals that are highly correlated or interdependent with respect to each other across a large number of samples and, hence, may stem from the same molecule.105-108 Such correlation analysis can be carried out either for NMR data or direct infusion MS data alone or between the two methods.109 Groups of signals that have been identified in this way can then be used to deduce information about the molecular structure. These statistical methods are applicable under two conditions, namely that a potentially large pool of samples is available so that statistically meaningful results can be obtained, and that the compound of interest shows relatively large modulations of its concentration relative to other metabolites so that the correlations between signals of the compound are sufficiently large. For applications with smaller sample pools, which can consist of as few as a single sample, alternative approaches have been proposed. For uniformly 13C-labeled mixtures,

2D 13C-13C TOCSY or INADEQUATE experiments permit the tracing of the backbone 20 topology of individual metabolites, thereby providing useful information toward the elucidation of their structure.110, 111 3D-(H)CCH-TOCSY and COSY spectra of a 60% 13C- labeled rhododendron shrub were used together with quantum-chemical calculations to identify catalogued as well as several uncatalogued metabolites.112

We recently introduced approaches that synergistically use NMR and MS for a single sample of a complex mixture at 13C natural abundance for the validation of known compounds and the determination of unknowns. The first approach is the NMR/MS

Translator, which translates the metabolites identified from 1D or 2D NMR by database query to accurate masses that are then directly compared with MS of the same sample, thereby providing a methodical validation of metabolites by both NMR and MS.91, 113 The second approach, termed SUMMIT MS/NMR (for “Structure of Unknown Metabolomic

Mixture components by MS/NMR”),99 is more complex and more ambitious than some of the other approaches listed, because it aims at the determination of the structure of unknown metabolites without the use of NMR or MS databases. Based on accurate masses from MS, it generates a pool of possible molecular structures for which NMR chemical shifts are computed and compared directly with the 2D experimental chemical shift data of the mixture spectrum. As a proof-of-principle, it was demonstrated how SUMMIT could determine the correct structures for a finite, well-defined subset of metabolites previously known to exist in E. coli cell lysate.99

Here, we generalize SUMMIT for the untargeted identification of both known and unknown metabolites by combining ultrahigh-resolution Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS)114 to assign unique elemental compositions

(CcHhNnOoPp…) with 3D NMR complemented by 2D NMR experiments as the primary source of information for spin-system identification and validation by tandem MS

(MS/MS). The NMR information is first queried against the COLMAR NMR database,44, 85,

115 by use of the COLMARm86 database to identify a maximal number of known metabolites and thereby assign as many cross-peaks as possible. This step is then followed

21 by SUMMIT, combining MS with NMR data, for the identification of unknowns from the remaining cross-peaks. The approach is first demonstrated here for a model mixture consisting of 25 metabolites. Compared to the original SUMMIT experiments, for which mass measurement was based on time-of-flight mass analysis with an average root-mean- square (rms) mass error of ~5 ppm (thus limited to metabolites of up to ~300 Da in mass), the present 9.4 T FT-ICR mass measurements achieve 25 times higher mass accuracy (~200 ppb rms mass error), and thus allow a more reliable determination of elemental composition for metabolites up to at least 1 kDa in mass.116 By combining the results with

MS/MS analysis, this approach can provide additional disambiguation of the top hits produced by SUMMIT.

2.2 Materials and Methods

2.2.1 Identification of Unique Elemental Composition from Ultrahigh-Resolution FT-

ICR Mass Spectra

The SUMMIT approach begins from identification of metabolite elemental composition.

It has been shown that mass measurement accuracy of ~200 ppb enables identification of unique elemental composition for organic molecules up to ~1 kDa in mass.117 Such mass measurement accuracy for complex mixtures is routinely achieved by 9.4 T FT-ICR MS: e.g., resolution and elemental composition assignment for more than 125,000 peaks in a single mass spectrum of a volcanic asphalt sample.118 Although limited structural information may be derived from elemental composition alone (e.g., ,119 van Krevelen diagrams,120 double bond equivalents vs. carbon number for individual heteroatom classes,121 etc.), the most definitive structural information relies on NMR (see below).

22 2.2.2 Identification of Individual Spin Systems of a Mixture from a 3D 13C-1H HSQC-

TOCSY NMR Spectrum

The NMR portion of SUMMIT focuses on the identification of spin systems of individual compounds based on multidimensional 13C-1H and 1H-1H cross-peaks. In particular, the

2D 13C-1H HSQC experiment provides 13C-1H chemical shift correlations between directly bonded 1H and 13C nuclei and 2D 13C-1H HSQC-TOCSY provides 13C-1H and 1H-1H bond connectivity information. If peak overlaps are absent or rare, the combination of 2D HSQC and 2D HSQC-TOCSY enables the unambiguous extraction of the spin systems of the various mixture compounds. However, in practice the presence of peak overlap will interfere with the accuracy and reliability of spin system extraction. Because the 3D 13C-

1H HSQC-TOCSY spectrum is much less prone to peak overlap than its 2D variant, we extract the spin systems directly from the 3D 13C-1H HSQC-TOCSY NMR spectrum. The

3D 13C-1H HSQC-TOCSY experiment provides 13C(1), 1H(2), 1H(3) correlations and resolves overlap of cross-peaks in the 2D 13C(1)-1H(2) plane by spreading the resonances along the orthogonal 1H(3) dimension, which is the direct 1H detection dimension.122, 123

Comparison of the 1H-1H correlations along 3 for each pair of 13C-1H cross-peaks enables one to determine whether or not two 13C-1H cross-peaks belong to the same molecule or not, thereby drastically reducing the possibility of false spin system identification. In practice, the main concern is the relatively low resolution along the two indirect dimensions that provide the 13C and 1H correlation information in order to keep the measurement time reasonably short. This problem can be addressed in part by measuring an additional high-resolution 2D 13C-1H HSQC spectrum to complement the 13C and 1H correlation information from the 3D experiment, as done here, or the use of non-uniform sampling methods.123, 124

The 13C()−() plane of the 3D HSQC-TOCSY spectrum depicts single bond 13C-

1H correlations of all molecules in the mixture. To distinguish 13C()−() cross-peaks from different molecules and be able to cluster these cross-peaks into individual spin

23 systems, the analysis of 1H-1H TOCSY transfers along the 3 dimension permits one to correlate pairs of 13C(1)-1H(2) cross-peaks and assign them to the same spin system. A prerequisite is that the cross-peaks share the same cross-peaks along 3. Specifically, such a pair of 2D cross-peaks, (1’,2’) and (1’’,2’’), must then share 3D cross-peaks at positions (1,2,3) = (1’,2’,2’), (1’,2’,2’’), (1’’,2’’,2’’), (1’’,2’’,2’). The goal is to find all pairs of 2D cross-peaks that are connected in this manner. These cross-peaks can be considered as edges of a mathematical graph in which the nodes correspond to directly bonded 13C-1H spin pairs. Such a graph can then be analyzed in terms of a “maximal clique” analysis, which we recently developed to automatically extract all possible spin systems from TOCSY-type spectra automatically.125

Figure 2.1 depicts a schematic diagram for the generation of spin systems from the

3D 13C-1H HSQC-TOCSY NMR spectrum. It should be noted that for spin systems with only one 13C-1H pair, the method does not work, because the spin system is fully characterized by a single cross-peak in the 3D spectrum. Therefore, peak assignments of one-spin systems need to be performed manually. Similarly, because 2-spin systems contain no redundant information, they are more prone to false positives and were considered only for the model mixture but not for E. coli cell extract.

After spin systems of individual compounds are extracted, they are refined in order to minimize the occurrence of false positives. Spin system refinement includes three consecutive steps. First, extracted spin systems are validated by visually checking 2D 13C-

1H HSQC, 2D 1H-1H TOCSY and 2D 13C-1H HSQC-TOCSY NMR spectra. If the extracted spin system was incomplete, expected peaks that are unambiguously observed by visual inspection, but that were missed by the automated procedure, are manually added until the spin system is complete. Second, 1H peak doublets and nearly degenerate 1H resonances are combined into a single chemical shift. For example, CH2 groups can have two separate proton chemical shifts belonging to the same 13C, but sometimes it is difficult to determine whether two separate peaks stem from a single CH2 group or from two

24 separate CH groups. Therefore, those spin systems that contain two cross-peaks with the same 13C frequency in the HSQC are merged into a single cross-peak (with a chemical shift taken as the mean of the two proton resonances) for the generation of an alternative spin system candidate, which is added to the list of spin systems. Third, potential extra spins are manually identified and added after comparison of 1D 1H traces along 3. For example, for an automatically generated 3-spin system, if an additional resonance shows unambiguous connectivities to all three spins, but has not yet been included in the clique, then this spin is manually added, resulting in a new 4-spin system. An example for the refinement of spin systems is provided in Figure 2.2.

2.2.3 Structure Manifold Generation and 2D HSQC NMR Spectra Prediction

Each accurate mass derived from an experimental FT-ICR mass spectrum was compared to the METLIN database to identify the closest matching molecular formula (note that

METLIN was used only to search for molecular formulas that are consistent with the FT-

ICR-based mass information, but not for molecular structures). Because each molecular formula could correspond to any of several isomers, we then searched the ChemSpider database126,127 for all structures corresponding to a given molecular formula.

For all molecular structures, 2D 13C-1H HSQC spectra are predicted by use of the empirical chemical shift predictor by Modgraph implemented in MestReNova 10.0.1

(Mestrelab Research). HSQC prediction for each molecule takes about 3-10 s with a desktop computer. The 13C chemical shift prediction utilizes a HOSE code algorithm, whereas the 1H chemical shift prediction is based on functional groups which were individually parametrized.128 Because NMR chemical shift prediction plays a critical role in SUMMIT for identifying the correct compound from a large compound pool, we examined the prediction accuracy for amino acids, organic acids, and carbohydrates contained in a 25-compound model mixture. We compared the predicted NMR chemical shifts with the NMR chemical shifts contained in the COLMAR database.44, 85, 115 For a total 25 of 179 13C-1H moieties, the average prediction errors for carbon and proton chemical shifts are 2.903 ppm and 0.292 ppm. The comparison between predicted and experimental chemical shifts is shown in Figure 2.3.

2.2.4 Weighted Matching between Experimental and Predicted NMR Spectra

After 2D 13C-1H HSQC NMR spectra were predicted for all chemical compound candidates, the weighted matching algorithm by Munkres was applied to match the 2D

13C-1H HSQC spectra extracted for individual mixture compounds with the predicted 2D

13C-1H HSQC spectra.129 The use of a weighted matching algorithm is motivated by the goal to find the closest matching peak pairs for each experimental spin system to each predicted spin system, provided that the total number of spins is the same. The matching results are ranked according to the chemical shift root-mean-square deviation (RMSD)

(Equation 1) between the experimental and predicted chemical shifts:

ì N ü1 2 2 2 RMSD =í å [(Ci,exp - Ci,pred ) +((Hi,exp - Hi,pred )´ 10) ] 2Ný (1) î i=1 þ

in which Xexp are the experimental chemical shifts, Xpred are the predicted chemical shifts, and N is the number of peaks in the spin system. A scaling factor of 10 is used to normalize the effects of 13C and 1H chemical shifts on the overall RMSD by correcting for the different chemical shift ranges of these nuclei. Table 2.1 shows the matching result for valine in the 25-compound model mixture.

2.2.5 Molecular Structure Motif Identification of Compounds

After matching and rank-ordering, predicted NMR spectra for a potentially large number of candidate compounds derived from FT-ICR MS with an experimentally extracted spin system generally yielded a large number of hits with a reasonably low chemical shift

26 RMSD cutoff (< 5 ppm). Although candidate compounds with lower RMSDs generally are more likely to be the true compound, it cannot be excluded that the true compound has a lower rank due to the limited NMR prediction accuracy or molecular structure degeneracy. Therefore, to simplify and speed up the identification of the true compound among the hundreds or even thousands of hits, we propose the following approach referred to as “molecular structural motif identification of chemical compounds” or

MSMIC. Because the experimentally extracted spin system corresponds to a structural motif consisting only of carbons and protons of the true compound, the goal of MSMIC is to find all possible molecular structural motifs that correspond to the experimentally extracted spin system among all of the compound candidates. The common molecular structural motif among different compounds will generate similar chemical shifts because additional atoms and functional groups typically have only little influence on the NMR chemical shift prediction of spins that are part of the molecular structure motif. For instance, L-glutamine and glutathione share the common molecular structural motif

(HOOCCH(NH2)CH2CH2CONH-) and hence have similar chemical shifts for this motif

(Figure 2.4). All hits (compound candidates) are sorted into groups according to their

MSMICs by use of the nearest neighbor heavy atom for discrimination between different

MSMICs. In a next step, molecular representatives of all high scoring MSMICs are selected for NMR experiments and used for quantum chemical calculations of their chemical shifts for the more accurate ranking of MSMICs. The best matching molecules are then either purchased or synthesized for NMR spiking experiments. This approach was implemented by use of in-house python scripts. Examples of MSMICs will be discussed below. In chemometrics, the maximum common substructure (MCS) approach is very efficient in identifying local structural similarities among large structural manifolds (>750,000).130, 131

Unfortunately, MCS is not able to identify the common motif that corresponds to a given spin system because it is not based on substructures connected by scalar J-couplings.

27 2.2.6 Sample Preparation

A 25-compound metabolite mixture contained adenosine, alanine, arginine, carnitine, citrulline, cysteine, fructose, galactose, glucose, glutamine, histidine, inosine, isoleucine, lactose, leucine, lysine, methionine, ornithine, proline, ribose, serine, shikimate, sucrose, threonine, and valine. For the NMR experiments, the final concentration of each metabolite was 1 mM in 600 μL D2O with 20 mM phosphate buffer and 0.1 mM DSS (4,4- dimethyl-4-silapentane-1-sulfonic acid) for chemical shift referencing. The same 25 compounds were used for the MS sample, which was prepared in 50%/50% (v/v)

ACN/H2O solution with 0.1% formic acid. The final concentration of each metabolite for

MS was 10 μM. All chemicals and solvents were obtained from Sigma-Aldrich and Fisher

Scientific Corporation.

E. coli BL21(DE3) cells were cultured at 37 °C with shaking at 250 rpm in M9 minimum medium with glucose (natural abundance, 5 g/L) added as the sole carbon source. One liter of culture at OD 1 was centrifuged at 5000 x g for 20 min at 4 °C, and the cell pellet was resuspended in 50 mL of 50 mM phosphate buffer at pH 7.0. The cell suspension was then subjected to centrifugation for cell pellet collection. The cell pellet was resuspended in 10 mL of ice-cold water and freeze-thawed three times. The sample was centrifuged at 20,000 x g at 4 °C for 15 min to remove cell debris. Prechilled methanol and chloroform were sequentially added to the supernatant under vigorous vortexing at an H2O/methanol/chloroform ratio of 1:1:1 (v/v/v). The mixture was then left at −20 °C overnight for phase separation. Next, it was centrifuged at 4000 x g for 20 min at 4 °C, and the clear upper hydrophilic phase was collected and subjected to rotary evaporation to reduce the methanol content. Finally, the sample was lyophilized. The dry sample was then divided into two parts: one for MS and one for NMR analysis. The NMR sample was prepared by dissolving the material in 200 μL of D2O with 20 mM phosphate buffer and

0.1 mM DSS (4,4-dimethyl-4-silapentane-1-sulfonic acid) for chemical shift referencing, and then transferred to a 3 mm NMR tube. 1.5-2 mg of E. coli cell lysate was dissolved in

28 200 μL of H2O, and 10 μL of that was diluted 10-fold with 50%/50% (v/v) ACN/H2O with

0.1% formic acid. The resulting solution was centrifuged at 13,000 rpm at 4 °C for 5 min, and the supernatant was used for direct-infusion MS.

2.2.7 NMR Experiments and Data Processing

2D 13C-1H HSQC, 2D 1H-1H TOCSY, 2D 13C-1H HSQC-TOCSY, and 3D 13C-1H HSQC-

TOCSY spectra of the 25-compound model mixture and E. coli cell lysate were collected.

All NMR spectra of the 25-compound model mixture and E. coli cell lysate were acquired with a Bruker AVANCE solution-state NMR spectrometer equipped with a cryogenically cooled TCI probe at 850 MHz proton frequency at 298 K. The 2D 13C−1H HSQC spectra of the 25-compound model mixture and E. coli cell lysate were collected with 256 t1 and 1024 t2 complex points. The measurement time was ~2 h. The spectral width along the indirect and the direct dimensions was 34205.6 and 10204.1 Hz. The number of acquisitions per t1 increment was 8. The transmitter frequency offset was 80 ppm in the 13C dimension and

4.7 ppm in the 1H dimension. The 2D 1H−1H TOCSY spectra of the 25-compound model mixture and E. coli cell lysate were collected with 512 t1 and 1024 t2 complex points. The measurement time was ~4 h. The spectral width along the indirect and direct dimensions was set to 10204.1 Hz. The number of acquisitions per t1 increment was 8. The transmitter frequency offset was 4.7 ppm in both 1H dimensions. The TOCSY mixing time was set to

120 ms after optimization of the isotropic mixing time. 2D 13C−1H HSQC-TOCSY spectra of the 25-compound model mixture and E. coli cell lysate were collected with 512 t1 and

2048 t2 complex points. The measurement time was ~8.5 h. The spectral width along the indirect and the direct dimensions was 34205.6 and 10204.1 Hz. The TOCSY mixing time for 2D 13C−1H HSQC-TOCSY was set to 120 ms. The number of acquisitions per t1 increment was 16. The transmitter frequency offset was 80 ppm in the 13C dimension and

4.7 ppm in the 1H dimension. 3D 13C−1H HSQC-TOCSY spectra of the 25-compound model mixture and E. coli cell lysate were collected with 64 t1, 128 t2, and 2048 t3 complex points.

29 The measurement time was ~113 h. The spectral width along the indirect and the direct dimensions was 34205.6, 10204.1, and 10204.1 Hz. The number of scans per t1 increment was 8. The transmitter frequency offset was 80 ppm in the 13C dimension and 4.7 ppm in the 1H dimension. The data were zero-filled two-fold along the 13C dimension, Fourier transformed, and phase- and baseline-corrected by use of NMRPipe.132 Sparky was used for peak-picking in all spectra.133 All spectra were converted to MATLAB format for maximal clique analysis.

2.2.8 FT-ICR MS Experiments and Processing

A custom-built 9.4 T Fourier transform ion cyclotron resonance mass spectrometer was used for sample analysis.116 A 25 metabolites mixture (10 µM) and E. coli extract sample

(in 50% ACN, 50% water, and 0.1 % formic acid) were ionized by positive or negative nanoelectrospray at a flow rate of 0.3 µL/min and accumulated in an external linear quadrupole ion trap. Ions were then transferred through an octopole ion guide to the ICR cell for broadband and tandem mass spectra acquisition. The transfer time was set to 0.35 ms for lower mass range analysis (m/z 107-270) and 0.55 ms for higher mass range analysis

(m/z 265-400). MS/MS fragmentations for glutamine, lysine, arginine, and ornithine were performed by infrared multiphoton dissociation (IRMPD), and precursor ions were isolated externally with a quadrupole mass filter and internally by stored waveform inverse Fourier transform (SWIFT) excitation in the ICR cell.134 IRMPD was performed with a 40 W, 10.6 mm, CO2 laser (Synrad, Mukilteo, WA, USA), fitted with a 2.5x beam- expander. The laser beam was directed to the center of the cell through an off-axis BaF2 window. Photon irradiation was for 500 ms at 40–90% laser power (16–36 W). MS/MS fragmentation for valine was performed with a Velos Pro ion trap mass spectrometer with normalized collisional energy 22 (ThermoFisher parameter setting). Broadband and tandem mass spectra were acquired from m/z 107–2,000, with a 6 s time-domain acquisition period. 500 time-domain transients were digitized and signal-averaged. All

30 data were stored as .DAT files. All time-domain data were Hanning apodized, zero-filled, and fast Fourier transformed to yield magnitude-mode mass spectra. Frequency-to-m/z conversion was performed with a two-term calibration equation.135, 136 Mass calibration was performed by dual spray spanning m/z 112-410. For positive ESI, the custom- prepared standard mix included cytosine, caffeine, , adenosine, Val-Ala-Pro-Gly, and des-Tyr1-methionine enkephalinamide. For negative ESI, Agilent ESI-L Low

Concentration Tuning Mix (Agilent, Santa Clara, CA) was used for calibration. After dual spray calibration, high magnitude peaks (6 peaks for ESI positive mode and 5 peaks for

ESI negative mode) in m/z range 112-410 were chosen as internal standards and used for calibration during sample direct infusion into the mass spectrometer. Data were manually interpreted by use of Predator Analysis (version 4.1.9) software.

2.3 Results and Discussion

2.3.1 Demonstration of SUMMIT MS/NMR to Identify Metabolites

First, we illustrate SUMMIT MS/NMR for the identification of metabolites based on the example of a spin system extracted from the 3D HSQC-TOCSY NMR spectrum of the 25- compound model mixture. The spin system with chemical shifts (δH, δC) of (3.597, 63.000),

(2.267, 31.690), (1.034, 20.560), and (0.983, 19.230) ppm was used to match the predicted

2D HSQC NMR spectra of 57,881 compounds derived from elemental compositions corresponding to all above-threshold FT-ICR mass spectral peaks (see below). The spin- spin connectivity information is manifested in both 2D HSQC-TOCSY and 3D HSQC-

TOCSY, which confirms that the four peaks indeed belong to the same spin system (Figure

2.5). The experimental chemical shifts were matched against the predicted chemical shifts of the 57,881 compounds by use of the weighted matching algorithm. Based on an RMSD cutoff of 5.0 ppm, 122 RMSD rank-ordered hits were returned. The top hit was valine with an RMSD of 0.93 ppm. Because valine was known to be one of the 25 compounds in the

31 model mixture (which was independently confirmed by querying the chemical shifts of this spin system against the COLMAR 1H(13C)-TOCCATA database,44 yielding a low

RMSD of 0.08 ppm for the database entry of valine), SUMMIT MS/NMR was successful in identifying (and verifying) this mixture component without depending on either a spectral NMR or MS database.

How should one proceed if the true compound does not exist in an NMR metabolomics database for validation, and how can one verify the true compound among dozens or hundreds of candidates returned by SUMMIT MS/NMR? Here, the MSMIC approach described above comes to fruition. By first identifying the molecular structural motif of the true compound, it helps elucidate the complete structure of the true compound in a second step. Again, the spin system that was eventually identified as valine is used as an example to demonstrate the strategy. For an unknown spin system with hundreds of compound candidates, the hits with lower RMSDs are more likely to correspond to the true compound. To verify the identity of the true compound beyond the limited NMR database information, quantum-chemical calculations followed by a

NMR spiking experiment was adopted as the “gold standard” for compound verification.58 In fact, before proceeding to verify the true compound by labor-intensive

NMR spiking experiments for the 122 hits, the molecular structural motifs that reflect the possible substructures of the unknown spin systems are extracted, which both simplifies and speeds up the verification process. Examples of molecular structural motifs identified among the 122 hits are shown in Figure 2.6. As compounds with a common motif are expected to have similar chemical shifts, the next step is to select one or two compounds with low RMSD by quantum-chemical calculation in each cluster and perform NMR spiking experiments. Hence, the 122 initial hits were further reduced to fewer than 10 compound candidates for the verification of the molecular structural motif of the true compound. After confirmation of the MSMIC that belongs to an unknown spin system, further validation steps are performed to verify the entire compound as explained below.

32

2.3.2 Application to a 25-Compound Model Mixture

NMR and FT-ICR MS Data-Derived Information

Based on the maximal clique approach to automatically extract spin systems (see Methods section), 49 spin systems were extracted from the 3D 13C-1H HSQC-TOCSY NMR spectrum of the 25-compound model mixture. All extracted spin systems included 2 or more spins

(one-spin systems were not included, see Methods section). 26 spin systems were identified by SUMMIT MS/NMR and verified based on COLMAR 1H(13C)-TOCCATA database query; 2 unknown spin systems could not be annotated and were classified as false positives, because each resonance in these spin systems belongs to other spin systems as determined by visual inspection of the 2D TOCSY and HSQC-TOCSY spectra. 21 spin systems were identified as partially or fully overlapping spin systems after spin system refinement as described in the Methods section. 80 neutral molecular formulas (rms mass error 0.07 ppm) were obtained from the 100 highest magnitude FT-ICR mass spectral peaks by identifying elemental compositions matched to the METLIN database (see above) with <0.15 ppm mass error threshold (Figure 2.7).127 (Peaks not originating from the 25 metabolites likely belong to impurities in the purchased compounds.) For each mass peak, [M+H]+, [M+Na]+, [M+K]+, [M+ACN+H]+, [M+ACN+Na]+ and [M+2Na-H]+ (in which M is the metabolite or its derivative) were considered as possible adducts. There were 57,881 molecular structures corresponding to the 80 molecular formulas according to the ChemSpider database (presently containing over 58 million molecular structures).126

By comparison, if the mass error threshold was set to 1.0 ppm, 92 distinct molecular formulas were obtained with rms mass error 0.17 ppm, corresponding to 68,173 structures.

In mixture analysis by MS, it is possible that intra- and inter-dimers and multimers may be generated. For example, ESI MS can yield both [M+H]+ and [2M+2H]2+ ions, which have the same monoisotopic mass-to-charge ratio i.e. [12Cc1H(h+1)14Nn16Oo31Pp32Ss]+ and

[12C2c1H(2h+2)14N2n16O2o31P2p32S2s]2+. However, the dimer is readily recognized by an m/z

33 separation of 0.5 between 12C2c2+ and its [12C(2c-1)13C1]2+ isotope peak. We did not observe any multimers of the reported metabolites.

Identification of Metabolites in the 25-Compound Model Mixture

After application of the weighted matching algorithm to match the 26 spin systems with the predicted 2D NMR HSQC spectra, the hits for each spin system were sorted according to the best matching chemical shift RMSD. Figure 2.8 shows the weighted matching scheme for the identification of metabolites by SUMMIT MS/NMR. 7 mixture compounds were returned as the top hits, namely valine, lysine, glutamine, isoleucine, arginine and ornithine, and 4 compounds ranked between 2 and rank 10 (including fructose (ranked 4) of a total of 6 carbohydrates). An additional 6 compounds ranked between the top 11 - 50 hits, including adenosine (vide infra). Histidine and sucrose ranked 52th and 61th.

Shikimic acid was not returned in the top 100 hits. However, a compound that has the same MCMIC as shikimic acid is the top hit. 4 of the 6 carbohydrates were not in the top

100 hits due to high structural degeneracy and limited NMR chemical shift prediction accuracy. Finally, although the molecular weight of alanine (89.09320 Da) falls below the low-m/z limit of the FT-ICR MS, it can easily be identified by low resolution MS (e.g., quadrupole ion trap). Table 2.2 shows the matching results for the 25-compound model mixture.

Table 2.3 shows the effect of (+) ESI FT-ICR mass spectral peak magnitude threshold on the number of possible ChemSpider structures and hit rank for the 25 metabolites mixture. For example, the number of possible structures drops from 57,881 to 16,030 or 9,162 for a MS peak magnitude threshold increase from the top 100 highest- magnitude peaks to just the top 30 or 20 highest-magnitude peaks.

Although adenosine in the 25-compound mixture was low-ranked (rank 44 among

92 hits), it shares the ribose ring as the same common molecular structural motif with all

34 other top 43 hits. Shikimic acid, galactose, glucose, lactose and ribose were not identified

(because chemical shift RMSDs > 5.0 ppm), but their molecular structural motifs were correctly identified by SUMMIT MS/NMR. These results show the ability of this approach to correctly identify structural motifs belonging to spin systems identified by NMR. The remaining ambiguity among molecules with the correct structural motif is due to limitations of the empirical chemical shift predictor, which can be alleviated in part by applying more accurate, but also more expensive, quantum-chemistry based chemical shift calculations to the top hits.137

To reduce the number of false-positive compounds returned by weighted matching, we optimized the mass error threshold for molecular formulas determination to 0.15 ppm, which significantly reduced the number of false compounds and improved the rankings of the true compounds. As an example, mixture compound leucine ranked

7th of a total of 29 hits if a 1.0 ppm mass error threshold was used, but it ranked as the top hit of 13 returned hits after lowering the mass error threshold to 0.15 ppm. Note that the average mass error for the true 25 compounds is 30 ppb and the maximum mass error is less than 120 ppb, which demonstrates the high mass accuracy that can be achieved from an ultrahigh-resolution FT-ICR mass spectrum for a metabolic mixture. Therefore, for an unknown metabolite in a metabolomics mixture, focusing on the low ppm mass accuracy molecular formulas will lead to fewer molecular formulas and thereby facilitate the true compound identification. It should be noted that the low ppm mass accuracy cutoff varies from sample to sample, but it can be determined for each sample by the identification of known abundant metabolites in the mixture.

Validation of Putative Metabolite Structure by FT-ICR MS/MS

MS/MS can serve to validate the top-ranked structures. As a demonstration, FT-ICR infrared multiphoton dissociation (IRMPD) was performed for the isolated precursor ions corresponding to SUMMIT MS/NMR top-ranked glutamine, lysine, arginine, and

35 ornithine (Figure 2.9, Figures 2.10). Glutamine and lysine MS/MS yielded mass differences (between the precursor ion and product ion) of 17.02656 Da and 17.02658 Da

(i.e., 0.01 mDa and 0.03 mDa deviation from the calculated mass 17.02655 Da) corresponding to loss of ammonia. Arginine and ornithine MS/MS yielded loss of ammonia (0.06 mDa and 0.05 mDa deviation from the calculated mass of 17.02655 Da) and loss of water (0.06 mDa and 0.06 mDa deviation from the calculated mass of 18.01056 Da).

Collision-induced dissociation (CID) in a linear quadrupole ion trap yielded loss of ammonia, water, and carbon monoxide from valine precursor ion (Figure 2.11). Therefore, the product ion mass spectrum further supports the highest ranked SUMMIT-based structures. Although the information content of MS/MS fragment analysis varies from metabolite to metabolite, MS/MS is expected to be most helpful for SUMMIT

MS/MS/NMR for the identification of larger molecules, such as secondary metabolites.

2.3.3 Application to E. coli Cell Lysate

NMR and FT-ICR MS Data-Derived Information

397 potential spin systems were extracted from the 3D HSQC-TOCSY NMR spectrum of the E. coli cell lysate by applying the maximal clique approach in full analogy to the model mixture. All extracted spin systems included 3 or more spins. Besides one-spin systems, two-spin systems were also excluded to avoid the generation of a potentially large number of false positives. We obtained a total of 1095 molecular formulas by searching the FT-ICR broadband accurate masses against the METLIN database (see above), leading to the generation of 914,947 candidate molecular structures by screening the ChemSpider database.

Identification of Known Metabolites in E. coli

36 In the 25-compound model mixture, all of the compounds are known and they are contained in the NMR database, thereby enabling testing of the SUMMIT MS/NMR method. Here, we first apply SUMMIT MS/NMR to identify known metabolites in E. coli to test the power and limitations of the method by comparing the results with those obtained by querying the spectra directly against the COLMAR web server. The recently developed COLMARm web server module provides simultaneous analysis of 2D HSQC,

2D TOCSY, and 2D HSQC-TOCSY NMR spectra and was used to identify metabolites.

Metabolites were first identified by querying the 2D HSQC against the COLMAR database, and subsequently verified by 2D TOCSY and 2D HSQC-TOCSY by use of

COLMARm. 41 metabolites could be identified with high confidence by COLMARm (2D

HSQC cross-peak matching ratio >0.8 and more than 50% spin-spin connectivities showing up in the 2D TOCSY and 2D HSQC-TOCSY spectra). The 41 metabolites were treated as “putatively annotated metabolites” to be verified by SUMMIT MS/NMR. When implementing SUMMIT MS/NMR to verify the metabolites identified by COLMAR, we compared the identified metabolites with the matching results returned for each extracted spin system. Verification results are reported in Table 2.4 for metabolites that fulfill the following conditions: they are ranked among the top 200 hits if the total number of hits with a chemical shift RMSD < 5.0 ppm was 400 or less, or they are ranked in the top 50% percentile if the total number of hits with RMSD < 5.0 ppm exceeded 400. These criteria ensure that the most likely candidates are retained without making the pool unrealistically large. Based on cross-platform analytical methods to verify compounds, the identification and verification results by COLMAR and SUMMIT MS/NMR achieved level 2 confidence according to the Metabolomics Standards Initiative.58

The following 13 known metabolites were successfully verified by SUMMIT

MS/NMR: L-glutamine, L-valine, maltose, cellobiose, N-acetyl-putrescine, L-glutamic- acid, D-glucose, spermidine, L-phenylalanine, L-tyrosine, N- -acetyl-L-lysine, L- glutathione-reduced, and L-methionine. Adenosine, inosine, L-proline, leucine,

37 pyridoxamine-5-phosphate-1 and guanosine could not be verified because not all of their cross-peaks showed up due to the relatively low abundance of these metabolites and the limited sensitivity of HSQC-TOCSY. However, by manually checking 2D HSQC-TOCSY and 3D HSQC-TOCSY, partial spin systems of these metabolites (covering 50% or more of the expected cross-peaks) could be identified. When implementing SUMMIT MS/NMR, we set the matching ratio cutoff to 1 to increase the identification accuracy when matching with FT-ICR MS-derived NMR spectra. For instance, if a compound contains a 5-spin system, but only a 4-spin sub-system could be reconstructed from 3D HSQC-TOCSY (e.g., because a resonance is very weak), the true (5-spin system) compound would not be returned as a hit by matching the experimental 4-spin system with the FT-ICR MS-derived

NMR spectra.

Therefore, SUMMIT MS/NMR will verify only the metabolites that are detectable by both analytical methods, providing high validation confidence across platforms. In addition, off-line LC fractionation can be applied prior to MS/NMR analysis to decrease the complexity of the metabolites mixture and increase the chance to identify more common metabolites by MS and NMR. Those metabolites that are identified by only one of the two analytical methods need to be further validated by other analytical methods, e.g., HPLC-MS. In any case, HPLC retention time (especially when calibrated by spiking with the putative metabolite) can help further validate any metabolite assignments based on NMR, MS, or a combination of the two. Finally, metabolites that are not detected by positive ESI can often be detected by negative ESI. For example, the MS1 accurate masses for acetyl-L-glutamine, DL- -glycerol-phosphoric acid, D-glucuronic acid, methyl- uridine, deoxythymidine monophosphate, uridine monophosphate, and cystathionine were detected by use of negative ESI (rms mass error 0.18 ppm), and those identities were confirmed by NMR.

Identification of Unknown Metabolites in E. coli by SUMMIT MS/NMR

38 The primary aim of SUMMIT MS/NMR is to identify unknown compounds that are not catalogued in current NMR and MS metabolomics databases. Unknown spin systems show self-consistent spin-spin connectivities in the 3D HSQC-TOCSY spectrum, but do not match any compound in the NMR database. Here, SUMMIT MS/NMR identified up to 15 unknown spin systems (compounds) in E. coli cell lysate. For instance, the spin system with chemical shifts (δH, δC) identified as (1.167,19.542), (3.618,59.207),

(3.632,75.944), (3.767,73.296), (4.032,70.725), and (5.573,98.048) ppm shows high confidence as a true positive spin system (Figure 2.12), but could not be assigned to any known compound after querying against the COLMAR database. However, after matching against 914,947 predicted NMR spectra, 12 hits (RMSD < 5.0 ppm) were returned and 4 molecular structural motifs were identified (Figure 2.13). As a proof of principle for the identification of true compounds, we applied quantum-chemical calculations to two selected compounds among the 12 hits, namely L-fucosamine and 6-desoxy-D- glucosamine. 6-Desoxy-D-glucosamine was the top hit among the 12 hits. L-Fucosamine has been found as a constituent of mucopolysaccharides of certain enteric bacteria (e.g.,

Citrobacter fiemdii), but the existence of L-fucosamine in E. coli was previously unknown.138

Quantum-chemical calculations of NMR chemical shifts for these two compounds return a lower RMSD for L-fucosamine and, hence, L-fucosamine is more likely to be the true compound than 6-desoxy-D-glucosamine, consistent with the literature (Table 2.5).

Although the true identity of this spin system remains uncertain, SUMMIT MS/NMR provides a small list of likely candidates, which represents actionable information for the identification of the true compound.

Pyroglutamic acid, which at the outset of this study was not part of the COLMAR

1H(13C)-TOCCATA database, represents another instructive example of the SUMMIT approach. SUMMIT MS/NMR successfully extracted the spin system and returned pyroglutamic acid as the 116th hit (Figure 2.14). Independently and at about the same time, the COLMAR 1H(13C)-TOCCATA database increased by 284 compounds,115 including

39 pyroglutamic acid, thereby enabling the identification of pyroglutamic acid as a known metabolite, confirming the SUMMIT results. For the E. coli cell lysate, SUMMIT returned

15 unknown spin systems along with their candidate compounds. To maximize the confidence of the unknown spin systems, we included all of the pairwise-connected spins that appear along the 1D 3 (1H) trace, and none of the peaks in the spin system matched the NMR database.

Finally, we note that some unknown spin systems have multiple candidate compounds, whereas others do not match any candidate compounds based on our metrics. Extending the mass range of FT-ICR MS will be helpful to incorporate all possible compounds and find the compound candidates for these unknown spin systems.

2.4 Conclusion

SUMMIT MS/NMR provides powerful fingerprints, based on spin system information, molecular formulas, and compound candidates in complex biological mixtures, thereby greatly assisting the analysis of complex metabolomics mixtures whose compositions are only partially known, without being limited to spectroscopic databases. SUMMIT is expected to find fruitful applications to support key objectives of contemporary metabolomics research, including the discovery of new biochemical pathways and biomarkers.

40 Table 2.1. Example of chemical shift (c.s.) matching results for valine.

Functional Predicted 1H c.s. Predicted 13C c.s. Expt. 1H c.s. Expt. 13C c.s.

group (ppm)a (ppm) a (ppm) (ppm)

-CγH3 0.910 19.32 1.034 20.56

-CγH3 0.960 19.32 0.983 19.23

-CβH2 2.220 31.07 2.267 31.69

-CαH 3.440 61.90 3.597 63.00

RMSD (ppm) 0.93 a The empirical chemical shift prediction was obtained by use of the NMR predictor by Modgraph embedded in the MestReNova software.

41 Table 2.2. SUMMIT results for 25-compound model mixture (mass error cutoff of 0.15 ppm).

Total hits (c.s. Percentile = Rank/Total c.s. RMSD Mass error Compound Rank RMSD<5.0 ppm) number of hits (ppm) (ppm)

valine 1 122 0.8% 0.93 -0.04

lysine 1 39 2.6% 1.54 0.04

glutamine 1 140 0.7% 1.36 0.07

isoleucine 1 16 6.3% 1.77 0

arginine 1 81 1.2% 2.02 0.10

ornithine 1 62 1.6% 2.05 0.05

leucine 1 13 7.7% 2.06 0

threonine 3 79 3.8% 1.71 -0.08

fructose 4 116 3.4% 1.32 0.05

carnitine 4 22 18.2% 3.32 0.12

cysteine 10 142 7.0% 2.17 -0.05

inosine 13 99 13.1% 2.27 -0.02

citrulline 23 65 35.4% 2.76 0.12

methionine 29 133 21.8% 2.53 0.03

serine 30 514 5.8% 2.2 0.07

proline 30 91 33.0% 2.55 0

adenosine 44 92 47.8% 2.74 0

histidine 52 181 28.7% 2.56 0.11

sucrose 61 117 52.1% 2.06 0.04

42 Table 2.3. Effect of cutoff threshold of mass peak amplitudes on the rank of SUMMIT results for 25-compound model mixture (0.15 ppm mass error cutoff).

Rank: Top 30 Rank: Top 20 Rank: All c.s. RMSD Index Compound Mass Peaks Mass Peaks Mass Peaks (ppm)

1 valine 1 1 1 0.83

2 lysine 1 1 1 1.44

3 glutamine 1 1 1 1.35

4 isoleucine 1 1 1 1.86

5 arginine 1 1 1 1.95

6 ornithine 1 1 1 1.92

7 leucine 1 1 1 1.84

8 threonine 2 N/A 3 1.55

9 fructose 4 4 4 1.36

10 carnitine 4 3 4 3.14

11 methionine 6 3 29 2.18

12 proline 7 5 30 2.54

13 cysteine 7 N/A 10 1.93

14 citrulline 11 10 23 2.40

15 inosine 13 13 13 2.26

16 serine 17 11 30 2.07

17 histidine 18 9 52 2.58

18 adenosine 39 38 44 2.63

19 sucrose 61 61 61 1.77

Structure manifolds 16,030 9,162 57,881

43 Table 2.4. Metabolites identified in E. coli cell lysate and verified by COLMAR web server and SUMMIT MS/NMR. (Highlighted metabolites in green were consistent with identification by COLMARm (based on HSQC for query and TOCSY + HSQC-TOCSY for validation); highlighted metabolites in blue were identified based on the COLMAR HSQC database alone.

Total hits Mass error Compound Rank RMSD (ppm) (RMSD<5.0 ppm) (ppm)

L-glutamine 1 6268 0.14 1.27

L-valine 5 5089 0.13 0.99

maltose 14 641 0.08 1.52

cellobiose 29 641 0.08 1.73

N-acetyl-putrescine 45 1697 0.08 1.87

L-glutamic acid 79 5549 0.14 1.58

D-glucose 170 659 -0.07 2.17

spermidine 213 1697 0.14 2.82

L-phenylalanine 230 7685 0.06 2.06

L-tyrosine 251 7685 0.11 2.07

N-alpha-acetyl-L-lysine 278 576 0.11 2.94

L-glutathione-reduced 313 6268 0.10 2.55

L-methionine 417 4982 0.13 2.49

D-xylose 1 96 0.09 1.85

D-sorbose 3 285 -0.07 1.59

lactose 6 744 0.08 1.71

N-alpha-acetyl-ornithine 7 1552 0.18 2.18

Continued

44 Table 2.4. Continued

Total hits Mass error Compound Rank RMSD (ppm) (RMSD<5.0 ppm) (ppm)

AMP 19 36 0.12 2.69

D-ribose 26 142 0.09 2.51

D-galactose 33 659 -0.07 1.74

2-pyrrolidinone-5- 275 5549 0.08 2.02 carboxylate

lysine 281 576 0.20 2.95

isovalerylglutamic acid 289 3773 0.13 1.8

S-adenosyl-L- 332 4982 0.07 2.3 homocysteine

gamma-glutamylcysteine 367 6268 -0.05 2.55

D-glucosamine 9 144 0.12 2.46

xylulose 10 1629 0.09 1.69

L-arabinose 25 96 0.09 2.58

D-tagatose 26 285 -0.07 1.84

muramic acid 32 81 0.01 3.09

melibiose 36 99 0.08 2.22

isomaltose 38 99 0.08 2.41

D-salicin 80 744 -0.06 2.29

sucrose 136 168 0.08 3.26

beta-gentiobiose 345 956 0.08 2.48

D-trehalose 378 1139 0.08 2.42

N-acetyl-lactosamine 427 1079 0.09 2.45

45 Table 2.5. Quantum-chemical calculations of NMR chemical shifts for two selected compounds, together with matching results from experimental spectra. The quantum- chemical calculations were performed according to our MOSS-DFT protocol published recently.3

Predicted 1H Predicted 13C Expt. 1H c.s. Expt. 13C Compound Peak index c.s. (ppm)a c.s. (ppm) a (ppm) c.s. (ppm)

1 1.342 17.506 1.167 19.542

L-fucosamine 2 3.252 60.607 3.618 59.207

3 3.835 71.816 4.032 70.725

4 3.893 73.883 3.632 75.944

5 3.901 72.945 3.767 73.296

6 4.887 94.099 5.573 98.048

RMSD 2.9335 (ppm)

1 1.337 18.311 1.167 19.542

2 3.207 61.619 3.618 59.207 6-desoxy-D-

glucosamine 3 3.737 73.449 4.032 70.725

4 3.326 77.943 3.632 75.944

5 3.565 74.916 3.767 73.296

6 4.949 93.834 5.573 98.048

RMSD 3.1622 (ppm)

46 A 2D HSQC 3D HSQC-TOCSY

1 1 3 2 ) ) H C

2 1 3 ( 1 3 (

1 4

4 ω 3 ω

) 3 1 C ( ω 1 ω (1H) 1 2 ω2( H)

0. 8 B 0. 6 0. 4 10. 2 1 OH O 0. 8

3 0. 6 2 0. 4 0. 2 2 0 0. 81 0. 6 1 3 0. 4 2 OH 30. 2 1

0. 8

0. 6

0. 4 0. 2 3 NH 4 0 2 5 4 3 2 1 1 ω3( H) (ppm)

Figure 2.1. Extraction of spin systems of individual mixture compounds from 3D 13C-1H

HSQC-TOCSY. Panel A shows the relationship between cross-peaks from the 2D 13C-1H

HSQC spectrum (left) and the 3D 13C-1H HSQC-TOCSY spectrum (right). Panel B illustrates how 1D cross-sections along ω3 (1H) of the 3D HSQC-TOCSY spectrum of (a) yield spin system information, which is extracted by use of a maximal clique approach.

Traces 1, 2, 3 show high similarity because they belong to the same spin system consisting of 3 protons, whereas trace 4 belongs to a separate spin system with a single proton.

47 Figure 2.2. Illustration of the spin system refinement procedure based on 2D HSQC, 2D

TOCSY, and 2D/3D HSQC-TOCSY NMR spectra. The following steps are depicted: a) merging of two separate 1H peaks into one peak; b) identification of extra spins (indicated by arrows in (i)) by 1D ω3 (1H) trace comparison.

48

y=0.911*x+0.287 R2=0.9604

y=0.982*x+1.976 R2=0.9889

Figure 2.3. Predicted chemical shifts compared with their experimental chemical shifts of

25-compound model mixture. The RMSD between predicted and experimental chemical shifts of proton and carbon chemical shifts are 0.292 ppm and 2.903 ppm.

49 Spect r um: dat a User: cwang Date: Thu Jan 5 14:00:47 2017 Positive contours: low 8.00e+06 levels 20 factor 1.40 Negative contours: low -5.79e+10 levels 1 factor 1.40

4. 5 4. 0 3. 5 3. 0 2. 5 2. 0

30 30

L-Glutamic acid

35 35

L-Glutamine

40 40 ) m p p (

Glutathione C 3 1

-

1 45 45

50 50

55 55

4. 5 4. 0 3. 5 3. 0 2. 5 2. 0 1 2 - H ( ppm)

Figure 2.4. 2D 13C-1H HSQC of L-glutamine, glutathione, and L-glutamic acid mixture.

Peaks with the common motif (HOOCCH(NH2)CH2CH2CONH-) are highlighted by symbols. The spectrum illustrates the similarity of chemical shifts of identical fragments

(colored in magenta) that are part of different molecules. Only cross-peaks that belong to the motif are labeled.

50

Figure 2.5. Putatively annotated valine spin system in a 25-metabolite model mixture extracted from 3D HSQC-TOCSY spectrum and confirmed by 2D TOCSY and 2D HSQC-

TOCSY. Panel A: four single bond C-H cross-peaks (blue) of valine in the 2D HSQC (left) and 2D HSQC-TOCSY (right) spectra. The expected relay HSQC-TOCSY cross-peaks of the spin system are highlighted in red. Panel B: four different 1-2 planes of the 3D HSQC-

TOCSY spectrum belonging to valine. Blue peaks obey 2=3, and the red peaks are the other expected 3D cross-peaks of the valine spin system.

51

Figure 2.6. Molecular structural motifs identified by SUMMIT from 122 different compound candidates that all match the spin system of valine. The hits were sorted into different groups according to their common molecular motif that represents the NMR- derived spin system.

52 Figure 2.7. Positive electrospray ionization 9.4 T FT-ICR broadband mass spectrum of E. coli cell lysate.

53

A B C experimental predicted spin system c.s. spin system c.s. (57,881 in total) 26 24 ) m p p

( top hit

) 42 38 C 3 1 ( 1 ω

58 58 arginine 3.8 1.9 1.6 3.5 1.8 1.4 28 26

) m p p

( top hit

) 33 32 C 3 1 ( 1

ω 58 57 glutamine 3.8 2.4 2.1 3.9 2.6 2.2 13 11 )

m p

p top hit (

) 27 24 C

3 1 ( 1 ω 63 62 isoleucine 3.7 1.5 0.9 3.4 1.5 1 23 22 ) m

p p top hit ( 32 ) 33 C 3 1 (

1 ω lysine 58 58 3.8 1.9 1.4 3.5 1.8 1.4 25 24 ) m p

p top hit ( 38 ) 41 C 3 1 ( 1 ω ornithine 58 58 3.8 3.0 1.8 3.4 2.8 1.5 18 18 )

m p p

( top hit

) 31 31 C 3

1 ( 1 ω 64 63 valine 3.6 2.2 1.0 3.5 2.2 0.9 23 21 ) m

p

p top hit (

) 27 41 C 3 1 ( 1

ω 57 55 leucine 3.8 1.7 0.9 3.3 1.8 0.9 1 1 ω2( H) (ppm) ω2( H) (ppm)

Figure 2.8. Identification of best matching metabolites in a 25-compound model mixture by SUMMIT MS/NMR. A) Experimental 2D HSQC NMR spectra of metabolites extracted from 3D HSQC-TOCSY. Each spectrum is a collection of enlarged spectral regions

(separated by dotted lines) that contain the corresponding cross-peaks. B) Predicted 2D

HSQC NMR spectra from FT-ICR MS-derived molecular structures (57,881 in total). Each experimental HSQC spectrum was compared with all 57,881 predicted HSQC spectra by maximum weighted bipartite matching. All returned hits were ranked according to their

54 chemical shift RMSD. C) Top hit compounds that belong to true compounds in the model mixture. Molecular substructures highlighted in magenta correspond to the molecular structural motifs (MSMIC) of the matched spin system. The spectra of panel B) were sorted so that each experimental spectrum of (A) is adjacent to its top hit (B). To each edge of the graph connecting (A) and (B) belongs a chemical shift RMSD.

55

Figure 2.9. FT-ICR MS/MS of glutamine and lysine in 25 metabolite mixture. Glutamine and lysine MS/MS yields mass differences (between the precursor ion and product ion) of

17.02656 Da and 17.02658 Da (i.e., 0.01 mDa and 0.03 mDa deviation from the calculated mass 17.02655 Da) corresponding to loss of ammonia.

56

Figure 2.10. Infrared multiphoton dissociation positive electrospray ionization 9.4 T FT-

ICR product ion mass spectrum of arginine and ornithine.

57

Figure 2.11. Collision-induced dissociation (normalized collision energy 22) Velos Pro product ion mass spectrum of valine.

58

Figure 2.12. Spin system of an unknown compound from an E. coli cell lysate extracted from 3D HSQC-TOCSY and verified by 2D TOCSY and 2D HSQC-TOCSY (2D TOCSY is not shown). A) Cross-peaks of the unknown compound shown in 2D HSQC (left, blue cross-peaks) and 2D HSQC-TOCSY (right, blue and red cross-peaks) spectra. B) Six cross- peaks of the unknown compound depicted in 6 different 2D slices () at fixed frequency of the 3D HSQC-TOCSY spectrum (blue symbols: diagonal peaks; red symbols: cross-peaks).

59

Figure 2.13. Motif identification of compound candidates for the spin system of an unknown compound from E. coli cell lysate. Four different motifs were identified, highlighted in red, green, blue and magenta, which are consistent with the NMR data.

60

Figure 2.14. Spin system of pyroglutamic acid spin system extracted from 3D HSQC-

TOCSY and confirmed by 2D TOCSY and 2D HSQC-TOCSY. a) Cross-peaks of pyroglutamic acid shown in the 2D HSQC (blue peaks) and 2D TOCSY spectra (blue and red). b) Four cross-peaks of pyroglutamic acid shown in multiple 2D slices of 3D HSQC-

TOCSY. Cross-peaks (δH, δC) at (2.04, 28.0) ppm and (2.51, 28.0) ppm are two separate peaks of the same CH2 group.

61 Accurate and Efficient Determination of Unknown

Metabolites in Metabolomics by NMR-Based Molecular Motif

Identification

Knowledge of the chemical identity of metabolite molecules is critical for the understanding of the complex biological systems they belong to. Since metabolite identities and their concentrations are often directly linked to the phenotype, such information can be used to map biochemical pathways and understand their role in health and disease. A very large number of metabolites however are still unknown, i.e. their spectroscopic signatures do not match those in existing databases, suggesting unknown molecule identification is both imperative and challenging. Although metabolites are structurally highly diverse, the majority shares a rather limited number of structural motifs, which are defined by sets of 1H, 13C chemical shifts of the same spin system. This allows one to characterize unknown metabolites by a divide-and-conquer strategy that identifies their structural motifs first. Here we present the structural motif-based approach “SUMMIT Motif” for the de novo identification of unknown molecular structures in complex mixtures, without the need for extensive purification, using NMR in tandem with two newly curated NMR molecular structural motif metabolomics databases

(MSMMDB). In identifying structural motif(s), first the 1H and 13C chemical shifts of all the individual spin systems are extracted from 2D and 3D NMR spectra of the complex mixture. Next, the molecular structural motifs are identified by querying these chemical

62 shifts against the new MSMMDBs. One database, COLMAR MSMMDB, was derived from experimental NMR chemical shifts of known metabolites taken from the COLMAR metabolomics database, while the other MSMMDB, pNMR MSMMDB, is based on empirically predicted chemical shifts of metabolites of several existing large metabolomics databases. For molecules consisting of multiple spin systems, spin systems are connected via a long-range scalar J-coupling NMR experiment. When this motif-based identification method was applied to the hydrophilic extract of mouse bile fluid, two unknown metabolites could be successfully identified. This approach is both accurate and efficient for the identification of unknown metabolites and hence contribute to the understanding of human health and disease.

63

3.1 Introduction

The chemical complexity of living organisms is reflected in the large number of different metabolites they are composed of. The human body alone may contain over 100,000 different metabolites, but the majority still needs to be identified and characterized.139

Such identification is critical to identify potential biomarkers and study new biochemical pathways for the better understanding of biological processes involved in health and disease. The system-wide study of metabolites and pathways, also in relation to the phenotype, is the subject of the field of metabolomics.6, 8, 11

Structure determination of novel organic molecules is a standard task in synthetic organic chemistry and natural product research. Analytical methods, such as infra-red

(IR) and UV/Vis spectroscopy, high-resolution mass spectrometry (HRMS) and 1H and 13C

NMR spectroscopy are routinely used.140 However, traditional synthetic or natural product characterization requires that a compound has been purified and isolated. In metabolomics, this is often impractical as it can be hard to efficiently isolate a compound at sufficient concentration among hundreds of molecular species. With respect to NMR, important methodological advances now allow routine characterization of known metabolites in a wide range of different complex mixtures with little or no purification.28,

139

Over the recent past, several approaches have been introduced that aim at the de novo characterization and structure determination of metabolites directly in the complex mixture environment.106, 141-146 We recently introduced a protocol for the identification of unknown metabolites in metabolomics samples without the need for purification that combines MS, NMR, and cheminformatics.99, 147 The approach, named SUMMIT MS/NMR, uses accurate mass information, e.g. from Fourier transform ion cyclotron resonance (FT-

ICR), to determine the elemental composition of the metabolites present in the sample. A

64 large pool of chemical compounds is then generated, which are consistent with the MS- derived molecular formulas. The candidate compounds are then filtered against multidimensional NMR data, in particular 1H and 13C chemical shifts that belong to individual spin systems. All candidate compounds are rank-ordered by comparing their predicted NMR chemical shifts with experimentally determined chemical shifts of the unknown compounds in the mixture.

SUMMIT MS/NMR requires that two key conditions are fulfilled: (i) the unknown compound must be present in the pool of candidate structures and (ii) the accuracy of the chemical shift predictor must be sufficiently high to identify the correct compound from potentially many others. In practice, both conditions pose specific challenges. For condition (i), the presence of the unknown metabolite in the pool of structures can be difficult to meet depending on the unknown and how the pool of structures has been generated. Instead, it can be easier and less ambiguous to first establish more general molecular structural properties of the unknown, such as its structural motif(s), before attempting to characterize its full molecular structure. For condition (ii), empirical chemical shift predictors, such as Modgraph/Mnova,148, 149 are fast, but their accuracy is limited. The average root-mean-square deviations (RMSD) of empirically predicted chemical shifts for a set of representative metabolites are around 0.292 ppm (1H) and 2.90 ppm (13C), and can be improved to 0.154 ppm (1H) and 1.93 ppm (13C) by quantum- chemical calculations with multiple scaling150 at the cost of largely increased computation time. Due to the currently limited accuracy of the chemical shift prediction, the SUMMIT

MS/NMR approach returns a potentially large number of compounds as viable candidates, which makes their final verification in terms of their purchase or chemical synthesis followed by spiking experiments in the complex mixture both time-consuming and expensive.

65

Here, we present an alternative approach, named SUMMIT Motif, for the de novo determination of molecular structures of unknowns by first focusing on the determination of molecular structural motif(s) (MSM) without the need for any mass spectrometry data.

Such information represents a key step toward the determination of the full structure. The approach starts out with the identification of 1H and 13C NMR spin systems to define an unknown metabolite’s backbone or contiguous parts thereof. This step is achieved by querying the experimental chemical shifts of the unknown spin system against those of molecular structural motifs (MSM) in both an experimental and a synthetic MSM chemical shift database. It is shown that this approach is highly effective for MSM identification provided that the MSM of the unknown compound is in fact present in the molecular structural motif metabolomics databases (MSMMDB). As is shown here, this requirement is much easier fulfilled than condition (i) of SUMMIT MS/NMR described above. The power of the approach is demonstrated by determining unknowns present in mouse bile fluid.

3.2 Materials and Methods

3.2.1 Definition of spin systems, molecular structural motifs, and COLMAR MSMMDB curation

For any given molecular structure, molecular structural motifs (MSM) are defined using spin systems as a starting point. As customarily defined, for each proton spin system consisting of NH protons each 1H spin is connected to another 1H spin through no more than 3 bonds. A basic molecular structural motif is then defined by the 1H spins together with up to NC carbons (13C or 12C) where each carbon is directly attached to at least one of the 1H atoms (Figure 3.1). This “0th shell molecular motif” can then be systematically expanded by including additional atoms (N, O, S, P) that are not directly observable in 1H

66 and 13C NMR experiments. When including additional heavy atoms that are exactly one bond away one obtains the “1st shell molecular structural motif” and when heavy atoms are included that are up to two bonds away the “2nd shell molecular structural motif” is obtained (Figure 3.1). The shell order is analogous to the HOSE code used for the empirical prediction of chemical shifts of small molecules.151 The higher the shell order, the more chemically distinct the structural motif is and the more unique its 1H and 13C chemical shifts are, but the lower the chance that one or several molecules with the same structural motif already exist in a current metabolomics NMR database.

In order to recognize molecular structural motifs that are part of an unknown metabolite, a 1st and 2nd shell molecular structural motif database was generated from the

COLMAR small molecule database (experimentally measured in aqueous solution) along with 1H, 13C chemical shifts of each motif that were assigned to specific groups within each motif. It is called COLMAR MSM Metabolomics Database or COLMAR MSMMDB.

COLMAR MSMMDB stores the molecular structures of 1st and 2nd shell MSM, the parent metabolites of a MSM along with their chemical shifts, their averages and standard deviations. By considering motifs with NC > 1 spin systems only, 623 metabolites in the parent COLMAR metabolomics database share 180 unique 1st shell and 397 unique 2nd shell molecular structural motifs (Table 3.1). Examples of 2nd shell molecular structural motifs are depicted in Figure 3.2, along with 1H, 13C chemical shift standard deviations of individual atoms.

The standard deviations of chemical shifts in 2nd shell MSMs are generally well below the average chemical shift errors of NMR chemical shift prediction programs (see

Figure 3.3). This demonstrates that the chemical shifts of 2nd shell MSMs are in most cases considerably more accurate than computationally predicted chemical shifts, which is the reason why 2nd shell MSMs have a better chance to be successfully identified from experimental chemical shifts of unknown metabolites.

67

3.2.2 Curation of the empirically predicted pNMR MSMMDB

Since the COLMAR MSMMDB features only a subset of MSMs (currently 180 1st shell and

397 2nd shell MSMs) of all possible metabolite MSMs, it is possible that an unknown spin system does not have a good match. For such cases, a MSM database has been built that covers a larger range of MSMs from empirically predicted, rather than experimental chemical shifts. This database, termed pNMR molecular structural motif metabolomics database (pNMR MSMMDB), consists of MSMs that were extracted from molecules in the

HMDB, the Chemical Entities of Biological Interest (ChEBI) database, and the Kyoto

Encyclopedia of Genes and Genomes (KEGG) database.38, 152, 153 The HMDB is currently the most comprehensive, organism-specific metabolomics database and is the largest collection of human metabolites with their chemical structures and biological roles annotated. ChEBI covers both metabolites produced in biological systems and synthetic products that can intervene with living organisms. The KEGG database is one of the most widely used biochemical pathway databases, containing metabolites involved in human diseases and molecular interactions in various organisms. The pNMR MSMMDB, which currently covers 23,697 metabolites with a molecular weight below 800 Da, focuses on motifs of hydrophilic metabolites, which are defined as metabolites with a predicted lipophilicity logP value smaller than 3.0 (as computed by ALOGP,154, 155 Figure 3.4). The pNMR MSMMDB contains motifs that overlap with those of COLMAR MSMMDB, but with their chemical shifts predicted rather than experimentally determined.

1H, 13C chemical shifts were computed and stored for each compound in the pNMR

MSMMDB using the empirical chemical shift predictor by Modgraph implemented in

MestReNova 10.0.1 (Mestrelab Research). The 1H chemical shift prediction is based on the effects of functional groups that were individually parameterized, whereas the 13C chemical shift prediction is achieved with a HOSE code algorithm. The predicted 1H, 13C 68 chemical shifts have been sorted into individual spin systems belonging to unique MSMs so that they can be compared directly with experimental 1H, 13C chemical shifts extracted from experiments. For each MSM, predicted chemical shifts from multiple metabolites are stored separately.

3.2.3 Single and multiple spin system analysis from 2D and 3D NMR experiments

Entire spin systems of individual molecules are extracted from 2D 1H-1H TOCSY, 2D 13C-

1H HSQC, and 3D 13C-1H HSQC-TOCSY spectra directly applied to the complex mixture of interest. This is automatically achieved by applying methods developed previously in our lab using graph theory and maximal clique analysis.125 Specifically, each directly bonded 13C-1H cross-peak in a 2D HSQC is defined as a node of a mathematical graph.

Edges between the nodes correspond to connectivities between pairs of 13C-1H cross-peaks observed in TOCSY-type spectra (2D TOCSY, 2D HSQC-TOCSY and 3D HSQC-TOCSY).

The graph is then subjected to maximal clique analysis using the Bron-Kerbosch algorithm where individual cliques correspond to separate spin systems.125, 147

Considering that molecules can be composed of multiple motifs, once individual motifs have been identified, the connectivity of multiple MSMs can be further explored by additional NMR experiments. 2D heteronuclear multiple bond correlation (HMBC) experiment is widely used to measure long-range heteronuclear coupling constants for the stereochemical and conformational analysis of biologically active natural products,156 and it is also a promising tool for the structure elucidation of metabolites and natural products in complex mixtures.157, 158 The heteronuclear single quantum multiple bond correlation (HSQMBC) experiment159 detects long-range heteronuclear correlations through nJ(CH) (n>1) couplings (~ 2 - 10 Hz), which makes the identification of quaternary carbons possible as well. Connectivity information between separate spin systems within

69 a molecule can hence be retrieved via the 2D 13C-1H PIP-HSQMBC spectrum.160 Therefore, after extracting individual spin systems of each compound from 2D and 3D TOCSY-type spectra, a 2D 13C-1H HSQMBC experiment is performed to identify connectivities between different spin systems to establish whether they belong to the same compound.

3.2.4 Workflow of MSM-based metabolite identification

The total workflow of metabolite identification based on molecular structural motifs extracted from NMR spin systems is depicted in Figure 3.5. After identification of a H-C spin system from 2D/3D TOCSY spectra, it is queried against COLMAR MSMMDB. If no hits are returned with RMSD < 2.5 ppm, the spin system information is then queried against pNMR MSMMDB. This two-pronged SUMMIT Motif approach focuses first on the query against the more accurate experimental COLMAR MSMMDB before turning to the larger, but less accurate pNMR MSMMDB. The MSM hits returned by COLMAR

MSMMDB (with RMSD < 2.5 ppm) and the top 15 MSM hits returned by pNMR

MSMMDB (with RMSD < 5.0 ppm) are subject to structure determination of unknown metabolites with spiking NMR experiments and/or additional experiments. Molecular structural motif query based on COLMAR MSMMDB and pNMR MSMMDB are publicly accessible via the COLMAR suite of web servers (http://spin.ccic.ohio- state.edu/index.php/motif).

3.2.5 Sample Preparation

S. Typhimurium was used as a mouse model of S. Typhi infection in vivo, as currently no animal model for the human-specific S. Typhi exists. Naturally resistant

NRAMP1(SLC11A1)+/+ 129x1/SvJ were fed a lithogenic diet (1% cholesterol and 0.5% cholic acid; Envigo/Harlan Laboratory, IN) for 8 weeks to induce gallstone formation.

70

After completing the diet, mice were intraperitoneally infected with 100 µl of PBS containing 104 S. Typhimurium or 100 µl of PBS alone. 15 (14) mice were sacrificed 10 days post-infection (PI) for metabolite analysis. About 30 µL pooled bile was collected from the uninfected mice (PBS) and about 65 µL pooled bile was collected from the infected mice

(S. Typhimurium). Both samples were subjected to an immediate metabolite extraction procedure. The freshly collected bile was (sequentially) mixed with 260 µL ice-cold methanol, 260 µL ice-cold chloroform, and 195 µL of ice-cold water and vortexing was applied after each solvent addition. The mixture was then placed on ice for 30 minutes, and centrifuged at 5000 x g for 30 min at 4oC for phase separation. The polar phase was then lyophilized and re-dissolved in ice-cold water for subsequent ultra-filtration to remove residual macromolecules. The ultra-filtration step was carried out with an

Amicon Ultra 0.5 mL centrifuge filter (MWCO 3 kDa). The filtrate was lyophilized and the powder was reconstituted in 600 µL D2O for NMR measurements with 0.1 mM DSS. To prepare the MS sample, 1.5 mg of the dried sample was re-suspended in 200 μL of H2O.

10 µL of the sample was transferred to a new tube followed by 10-fold dilution with

50%/50% (v/v) ACN/H2O containing 0.1% formic acid. Identification of metabolites was primarily performed on the infected bile samples.

E. coli BL21(DE3) cells were cultured at 37 °C while shaking at 250 rpm in M9 minimum medium with glucose (natural abundance, 5 g/L) added as the sole carbon source. One liter of culture at OD 1 was centrifuged at 5000 x g for 20 min at 4 °C, and the cell pellet was resuspended in 50 mL of 50 mM phosphate buffer at pH 7.0. The cell suspension was then subjected to centrifugation for cell pellet collection. The cell pellet was resuspended in 10 mL of ice-cold water and freeze-thawed three times. The sample was centrifuged at 20,000 x g at 4 °C for 15 min to remove cell debris. Prechilled methanol and chloroform were sequentially added to the supernatant under vigorous vortexing at an H2O/methanol/chloroform ratio of 1:1:1 (v/v/v). The mixture was then left at −20 °C

71 overnight for phase separation. Next, it was centrifuged at 4000 x g for 20 min at 4 °C, and the clear upper hydrophilic phase was collected and subjected to rotary evaporation to reduce the methanol content. Finally, the sample was lyophilized. The NMR sample was prepared by dissolving the dry sample in 200 μL of D2O with 20 mM phosphate buffer and 0.1 mM DSS (4,4-dimethyl-4-silapentane-1-sulfonic acid) for chemical shift referencing, and then transferred to a 3 mm NMR tube.

3.2.6 Experiments and Data Processing

All NMR spectra were collected on a Bruker Advance III 850 MHz spectrometer equipped with a cryogenically cooled TCI probe at 298K. The 2D 13C-1H HSQC spectra of mice bile samples and E. coli cell extracts were collected with 512 x 1024 (N1 x N2) complex points along the two dimensions with 32 scans per increment. The spectral widths along the 13C and 1H dimensions were 34206.23 and 10204.08 Hz, respectively, and the transmitter frequency offsets were 75.00 and 4.70 ppm, respectively.

The 2D 13C-1H HSQC-TOCSY spectra were collected with 512 x 2048 (N1 x N2) complex points along the two dimensions with 32 scans per increment. The spectral widths along the 13C and 1H dimensions were 34206.23 and 10204.08 Hz, respectively. The transmitter frequency offsets were 75.00 and 4.70 ppm, respectively.

The 2D 1H-1H TOCSY spectra were collected with 512 x 2048 (N1 x N2) complex points along the two dimensions with 8 scans per increment. The spectral widths along indirect 1H and direct 1H dimensions were 10201.97 and 10204.08 Hz, respectively, and the transmitter frequency offset was 4.70 ppm.

The 3D 13C-1H HSQC-TOCSY spectra were collected with 64 x 100 x 1048 (N1 x N2 x N3) complex points along the three dimensions with 8 scans per increment. The spectral widths along the indirect 13C, indirect 1H and direct 1H dimensions were 34204.76,

72

10204.09 and 10204.08 Hz, respectively, and the transmitter frequency offsets were 75.00,

4.70 and 4.70 ppm, respectively.

The isotropic mixing times for 2D 13C-1H HSQC-TOCSY, 2D 1H-1H TOCSY and 3D

13C-1H HSQC-TOCSY were 120, 117 and 117 ms, respectively. The relaxation delay (d1) was 1.5 s.

The 2D 13C-1H HSQMBC spectra were collected with 320 x 2048 (N1 x N2) complex points along the two dimensions with 64 scans per increment. The spectral widths along the 13C and 1H dimensions were 40621.32 and 10204.08 Hz, respectively and the transmitter frequency offsets were 90.00 and 4.70 ppm, respectively. The multiple-bond coupling constant was set to 6 Hz. All spectra were zero-filled, Fourier-transformed, and phase- and baseline-corrected using NMRPipe.

FT-ICR MS spectra were collected on a 15 Tesla FT-ICR MS experiment using methods established previously.147 Briefly, after initial calibration by a standard mixture, electrospray ionization (ESI) mode was selected and the mass range (m/z) was set to 50 - 3000. After tuning, acquisition and calibration, both positive and negative ion mode mass spectra of the background and metabolite mixture were collected.

The FT-ICR mass spectra were calibrated and analyzed based on common compounds in the metabolite mixture. The mass peak list (m/z) from mass range 100-1000 was generated with the signal to noise ratio set to 10. The background mass peaks were removed. For each mass peak, all the mass peaks (m/z) were converted to accurate masses with possible adducts.

73

3.2.7 Classification of hydrophilic metabolites based on lipophilicity logP

Molecular structural motif clustering was applied to hydrophilic metabolites contained in the HMDB. The selection criterion for hydrophilic vs. hydrophobic metabolites was based on the quantitative molecular hydrophobicity (lipophilicity) measure log10(P) (or logP) where P is the 1-octanol/water partition coefficient of a given metabolite.161 For this purpose, lipophilicity was predicted by ALOGP, which is a widely used computational estimators of logP, for 933 known metabolites with both hydrophilic and hydrophobic character taken from the COLMAR database, HMDB and BMRB. 723 compounds whose

NMR data were measured in D2O are pre-classified as hydrophilic compounds and 210 compounds whose NMR data were measured in CDCl3 are pre-classified as hydrophobic compounds. Based on the ALOGP distribution (Figure 3.4) an ALOGP criterion < 3.0 was chosen for hydrophilic metabolites.

3.2.8 Spin system matching and scoring

The 1H, 13C chemical shifts of each compound in COLMAR MSMMDB and pNMR

MSMMDB were compared with each experimental spin system with the same number of spins. The weighted matching algorithm, known as the Hungarian method using the

Munkres assignment algorithm, was applied to find the closest matching peak pairs between the experimental and predicted spin systems.162 The corresponding chemical shift root-mean-square deviation (RMSD) was calculated between each experimentally determined spin system and each candidate compound according to:

ì N ü1 2 2 2 RMSD =í å [(Ci,exp - Ci,pred ) +((Hi,exp - Hi,pred )´ 10) ] 2Ný (1) î i=1 þ

exp are the experimental chemical shifts, pred are the predicted chemical shifts, and N is the number of HSQC cross-peaks of the spin system. A scaling factor of 10 is used to 74 normalize the effects of 13C and 1H chemical shifts on the overall RMSD by correcting for the different chemical shift ranges of 13C vs. 1H nuclei. For each experimentally determined spin system, all compounds that fulfill the chemical shift RMSD cutoff < 5 ppm are rank- ordered with the smallest RMSD appearing first.

3.2.9 Quantitative metric on evaluation of the MSM identification

To quantitatively evaluate the MSM identification result, the true/false positive/negative results are described here. A true positive (TP) is defined as a top hit returned by

COLMAR MSMMDB that contains the true metabolite MSM, while a false positive (FP) does not contain the true metabolite MSM. A true negative (TN) results when the true metabolite MSM does not exist in COLMAR MSMMDB and no hit is returned by

COLMAR MSMMDB. A false negative (FN) results when no hit is returned by COLMAR

MSMMDB, while the true metabolite MSM exists in COLMAR MSMMDB. The true positive rate (TPR) is calculated based on TP/(TP + FN). The false positive rate (FPR) is calculated based on FP/(FP + TN).

3.3 Results

3.3.1 Evaluation of COLMAR and pNMR MSMMDB in identification of known molecules in bile and E. coli extracts

The strategy for the identification of molecular structural motifs (1st or 2nd shell) in unknown metabolites was first tested on known metabolites in bile and E. coli cell extracts.

There are 26 metabolites in bile and 111 metabolites in E. coli cell extract, which could be identified and verified by previously established methods, i.e. 2D HSQC, 2D TOCSY and

2D HSQC-TOCSY via COLMARm web server. Each of these known metabolites contains

75 at least one spin system with two or more 1H spins and their directly attached 13C spins.

The 1H, 13C chemical shifts of each spin system were queried against the COLMAR and pNMR MSMMDB. To be able to treat these spin systems like real unknowns, their true metabolite motifs were intentionally removed from COLMAR MSMMDB. After querying experimental spin systems against COLMAR MSMMDB, both 1st shell and 2nd shell motifs are returned if hits exist within an RMSD cutoff of 5.0 ppm. The motifs with lowest RMSDs are prioritized for further evaluation. For bile metabolites, the 1H, 13C chemical shifts of 19 metabolites matched the correct 1st or 2nd shell MSMs from the COLMAR MSMMDB as the top hit. In some cases, a clear distinction between 1st and 2nd shell hits is difficult. For instance, when querying an unknown spin system with the motif of taurine

(SO3HCH2CH2NH2) against the COLMAR MSMMDB, the identified MSM

(SO3HCH2CH2NH-CO-) is partially 2nd shell. Overall, the RMSD for the correct hits range between 0.03 and 2.11 ppm (mean value of 0.97 ppm). The RMSD of MSMs of this and other (known) molecules provides a useful benchmark for the range of RMSD values that belong to the correct MSMs. Six metabolites (glycerol, valine, isoleucine, N-alpha-acetyl-

L-lysine, leucine, alpha epsilon-diaminopimelic acid) did not return any good hits, because no other COLMAR metabolite contains the same 1st or 2nd shell MSM as these six metabolites. The only spin system whose MSM was misidentified belonged to L-serine where the (incorrect) top hit had a RMSD of 3.21 ppm, confirming that higher RMSDs are generally associated with lower confidence in the returned MSMs. Similarly, after querying the 26 experimental spin systems against the pNMR MSMMDB the motif of the true metabolites ranks as follows: 1st hits for 12 spin systems, 2nd hits for 4 spin systems,

3rd hits for 4 spin systems, 4th hits for 1 spin system, 5th – 15th top hits for 5 spin systems.

The RMSD of correct hits range between 0.57 and 2.78 ppm (mean value of 1.78 ppm), which are significantly higher than the RMSD of hits returned by COLMAR MSMMDB.

Examples of MSM identification and representative metabolites with the same MSM

76 returned by COLMAR and pNMR MSMMDB are listed in Table 3.2. Since pNMR

MSMMDB covers a much larger pool of metabolites and MSMs and the chemical shift prediction is less accurate than for COLMAR MSMMDB, the top 15 different MSM hits are considered here as viable candidates when identifying an unknown motif.

For E. coli metabolites, the molecular motif of the top hit is 100% correct up to 1st shell atoms when using an RMSD threshold of 2.1 ppm. With respect to 2nd shell atoms,

89 out of 111 top hits contain the correct 2nd shell MSMs (true positives) for an RMSD threshold of 5 ppm, whereas 22 top hits contain incorrect 2nd shell molecular motifs (false positives). The number of true and false positive top hits of E. coli metabolites with various

RMSD thresholds is summarized in Table 3.3 with the receiver operating characteristic

(ROC) curve shown in Figure 3.6 and an Area Under Curve (AUC) of 0.851. The number of false positives was reduced when setting the RMSD threshold to 2.1 ppm with 81 true positive 2nd shell MSMs and only 9 false positives. Among the 9 false positives, the true

2nd shell MSMs either ranked as high as 2nd or were not returned at all, because COLMAR

MSMMDB did not contain an entry with the same MSM as the true metabolite.

3.3.2 Structure elucidation of unknown metabolites

The determination of MSMs represents a critical step toward the structure elucidation of unknown metabolites, which is demonstrated here for gallbladder bile fluid. We focused on three experimentally extracted spin systems with unknown identity, designated as spin systems A, B, and C. The MSM identification method was applied to identify the unknown molecular motifs belonging to these spin systems followed by identifying the structures of these unknown metabolites. Unknown spin system A has 4 cross-peaks with chemical shifts (δH, δC) of (0.883, 20.032), (0.906, 21.543), (2.090, 33.293) and (4.048, 63.569) ppm. After querying against the COLMAR MSMMDB, it best matches the valine-like

77

MSM with structure CH3(CH3)CHCH(COOH)N–. When querying against the pNMR

MSMMDB, the same valine-like motif was returned that is shared by 32 metabolites. By selecting the top 6 hits with the same molecular motif in the return list and performing spiking experiments, L-alanyl-L-valine was found to accurately match the unknown spin system (Figure 3.7). L-alanyl-L-valine is a dipeptide composed of alanine and valine, which was not previously identified in human tissues or biofluids. Although most dipeptides are relatively short-lived intermediates, some dipeptides are known to have physiological effects, for example, for cell-signaling. The identification of L-alanyl-L- valine in mouse bile provides new information toward the understanding of amino-acid specific pathways for further biological interpretation. The identification and validation of L-alanyl-L-valine is depicted in Figure 3.7.

Unknown spin system B contains two peaks with chemical shifts (δH, δC) of (3.068,

52.507) and (3.558, 37.715) ppm (Figure 3.8a, b). After querying against the COLMAR

MSMMDB, it matches the MSM –NHCH2CH2SO3H, which is also found in taurine. When querying against the pNMR MSMMDB, the same taurine-like MSM is returned that is shared by 96 metabolites. Unknown spin system C contains 6 peaks with chemical shifts

(δH, δC) of (0.943 20.558), (1.339 34.342), (1.735 34.345), (1.413 37.477), (2.199 35.428) and

(2.296 35.435) ppm (Figure 3.8c, d). After querying against COLMAR MSMMDB, no hit was returned, indicating that this particular molecular motif does not exist in the

COLMAR MSMMDB. By comparison, when querying against pNMR MSMMDB, 33

MSMs were returned that are shared by 49 metabolites. However, the RMSDs of all hits exceeded 2.8 ppm, suggesting that none of the MSM candidates may include the true

MSM. The 2D 13C-1H HSQMBC spectrum showed that the two unknown spin systems B and C are part of the same molecule, which were connected via a quaternary carbon peak at 180.820 ppm (Figure 3.8e). This indicates the unknown compound has at least two spin systems, and the number of potential unknown compound candidates is lowered from 96

78 to 25 compounds. The 25 compounds were further separated into four groups based on their second MSMs in addition to MSM (–NHCH2CH2SO3H). By selecting and purchasing the individual, isolated compounds of the top hit in each MSM group with the lowest

RMSD (total of 4 hits) and performing NMR spiking experiment, taurocholic acid was found to precisely match both unknown spin systems B and C, confirming that they belong to taurocholic acid (Figure 3.9). The reason that neither COLMAR nor pNMR

MSMMDB returned a match for spin system C is that spin system C represents a sub-spin system of a much larger spin system of taurocholic acid, which contains 19 carbons together with their attached hydrogens. During 120 ms TOCSY mixing, magnetization transfer was incomplete and, in addition, a number of cross-peaks overlapped with those of other molecules in the mixture. Therefore, spin system C, which was returned by the maximal clique method, did not correspond to the full spin system, which prevented identification of this motif when querying against pNMR MSMMDB. Although taurocholic acid is a known metabolite of bile, NMR database information for taurocholic acid is available only in DMSO. Because of the dependence of chemical shifts on the solvent (chemical shifts in DMSO are substantially different from those in aqueous condition), taurocholic acid was not part of COLMAR MSMMDB. These examples illustrate how the molecular structural motif-based method for the identification of unknown metabolites successfully works for rather large, real-world metabolites.

3.3.3 Coverage of COLMAR motifs of HMDB

The COLMAR MSMMDB, although established with the motif information from only 632 metabolites, performed remarkably well. The strong performance can be rationalized when comparing the motifs in COLMAR MSMMDB and motifs extracted from the much larger HMDB database. Because a large fraction of the metabolites in the HMDB are hydrophobic metabolites, we focus here on the hydrophilic subset, which includes 13,138 79 metabolites with a lipophilicity logP < 3 as predicted by ALOGP software.154 All MSMs with 2 or more carbons per spin system were extracted from the hydrophilic HMDB and

COLMAR metabolites. The MSM identification results are summarized in Table 3.1 with

180 COLMAR vs. 1924 HMDB 1st shell MSMs and 397 COLMAR vs. 4912 HMDB 2nd shell

MSMs. The frequency of COLMAR MSMs (nodes and their sizes) is depicted as a network graph in Figure 3.10, which reflects that MSMs are notably unevenly distributed among metabolites. The most common MSMs are unsaturated carbon-carbon bonds that are part of aromatic ring structures. Many MSMs are frequently found in the same molecule together with other MSMs as is indicated by the many edges of the graph. The top 10 most abundant motifs in the COLMAR and HMDB databases are listed in Table 3.4 and

Table 3.5. For all 1st shell motifs found in the HMDB database, 37 out of the most frequent

50 motifs are covered by COLMAR (Figure 3.11). Importantly, the 1st shell COLMAR

MSMs cover 10,728 out of 12,506 (85.8%) hydrophilic compounds (with NC > 1 spins for each spin system) of the HMDB, which shows that, despite its much smaller size,

COLMAR NMR provides very good coverage of the MSMs of HMDB metabolites. 1778 hydrophilic compounds of the HMDB that contain 1042 different 1st shell MSMs are not covered by COLMAR MSMMDB. However, among the 1042 motifs, only 92 MSMs are present in more than 10 compounds, whereas 92 MSMs represent 5 - 10 compounds. 858

MSMs are present in fewer than 5 compounds. This motif analysis shows that the vast majority of hydrophilic metabolite motifs not covered by COLMAR MSMMDB are rare motifs.

3.4 Discussion

We have established a motif-based method to identify unknown metabolites. The introduction and curation of motif databases is of central importance, whereby the

80 accuracy and precision of the experimental database entries are crucial for the successful identification of MSMs of unknowns. The 1st and 2nd shell atoms of a structural motif sensitively influence the 1H and 13C chemical shifts and, conversely, experimental 1H, 13C chemical shifts can be used for the determination of molecular structural motifs. This information offers a path toward the determination of the structure of unknown metabolites. Based on the finding presented here that 1st and especially 2nd shell MSMs generally have chemical shifts that are remarkably well conserved, an MSM NMR database termed COLMAR MSMMDB was established with experimental chemical shifts measured in aqueous solution. The chemical shift accuracy in COLMAR MSMMDB is much better than that of both quantum-chemical150 and empirical chemical shift prediction used for pNMR MSMMDB. The true MSMs are often found with RMSD < 2.5 ppm and false MSMs typically have RMSDs > 3.0 ppm. If the RMSD of an MSM is between 2.5 ppm and 3.0 ppm, application of the pNMR MSMMDB to check whether the same MSM is returned as one of the top 15 hits further increases confidence. Finally, for MSMs (1st shell or 2nd shell) not present in the COLMAR MSMMDB, the more comprehensive, but less accurate pNMR MSMMDB consisting of 3512 1st shell MSM and 7874 2nd shell MSM with computationally predicted chemical shifts can be used to identify unknown MSMs.

Although the present COLMAR MSMMDB only includes 397 2nd shell MSMs, these

MSMs cover a remarkably large number of metabolites. In particular, they represent

10,728 out of 12,506 (85.8%) hydrophilic compounds (logP < 3.0 with NC > 1) of the much larger HMDB, which contains many more metabolites, including many metabolites without experimental NMR chemical shifts and many expected, but still unconfirmed metabolites. An interesting question is how many additional MSMs with chemical shifts would need to be added in order to cover all metabolites in the current HMDB. When limiting the MSMs to hydrophilic metabolites only (logP < 3), this can be accomplished with the addition of another 1042 unique MSMs. This is a relatively small number

81 considering that this would cover the MSMs of additional 1778 HMDB metabolites that are currently not covered by the COLMAR MSMMDB.

The best motifs identified by COLMAR MSMMDB query can be further developed into complete metabolite candidates. For the examples presented here, a database of potential metabolites was created using a wealth of information from existing databases, such as KEGG, ChEBI, and HMDB. The subset of metabolites that emerge with the correct

MSMs and good predicted chemical shift scores represent the candidate molecules that can be purchased (or synthesized) for spiking experiments to confirm their authenticity in the mixture. This SUMMIT Motif approach was successfully illustrated for the identification of two “unknown” metabolites L-alanyl-L-valine and taurocholic acid in mouse bile fluid. The feasibility of the approach depends on the total concentration of the unknown metabolites, which should exceed ~50 - 100 μM, otherwise the sample needs to be concentrated first. Because the approach is not designed for high-throughput analysis, it is best applied to few representative samples selected, for example, from a large cohort of samples. Also, for unknown compounds that have titratable groups with pKa ~ 7, even small changes in pH can cause significant changes in chemical shifts that might adversely interfere with motif identification. In principle, NMR pH titration experiments (e.g. via

13C-1H HSQC) can help identify such unknowns.

The MSM identification approach presented here is “NMR spin system centric” in the sense that 1H, 13C spin systems form 0th shell MSMs, which are then extended to 1st and 2nd shell heavy atoms. As a consequence, the MSMs are only indirectly identifiable by techniques other than NMR. Still, other types of experimental molecular fragment information can be used, such as metabolite fragments produced by tandem mass spectrometry (MS/MS or MS2), which are routinely used in targeted metabolomics. For the identification of unknown metabolites, MS/MS fragments can be predicted for candidate structures as additional scoring criterion thereby further limiting the number

82 of potential metabolites.100 Moreover, the molecular formula of the parent ions, determined from their accurate mass, can serve as an additional filter to narrow down viable candidate compounds that contain the correct MSMs.99, 147 For instance, if mass information (Figure 3.7f) were used in addition, L-alanine-L-valine emerges as the only candidate among the six top hits to be further verified by spiking NMR experiments, thereby further speeding up the verification of this unknown metabolite.

3.5 Conclusion

The accurate determination of molecular motifs of unknown metabolites presented here is made possible because of the high quality of NMR chemical shifts of the COLMAR metabolomics database, which was customized primarily using data from the BMRB163 and HMDB.38 Our results suggest that the future addition of a modest number of suitably chosen metabolites, the coverage of COLMAR MSMMDB can be substantially broadened to include most real and putative hydrophilic metabolites of the HMDB.

The new COLMAR MSMMDB and pNMR MSMMDB could pave the way for the systematic and efficient determination of motifs and their associated unknown metabolites in a wide range of metabolomics samples. This information will help fill in critical gaps in our understanding of molecular conversions along new metabolomics pathways and their modulations upon internal and external perturbations. The characterization of metabolites with novel MSMs will not immediately benefit from the new COLMAR MSMMDB. It remains a challenge requiring a traditional and, hence, much slower approach to the determination of new natural products relying on extensive purification and comprehensive characterization by the combination of many different analytical techniques. From an NMR perspective, a long-term goal is a universal chemical shift predictor with a similar performance as COLMAR MSMMDB as it would permit the

83 determination of unknowns at a rate that is comparable to the identification of molecular motifs by COLMAR MSMMDB. An empirical predictor with this property would need to be trained on experimental small-molecule chemical shift databases that are much larger than those currently available or require substantially improved quantum-chemical methods for the calculation of chemical shifts. Until then, the systematic addition of high- quality experimental chemical shifts and spin system information of strategically chosen metabolites to NMR metabolomics databases, such as COLMAR MSMMDB, is practically feasible although time-consuming undertaking.

84

Table 3.1. Categorization of COLMAR and HMDB hydrophilic compounds according to their molecular structure motifs.

Compound source COLMAR HMDB

Total number of 720 13138a compounds

Minimal spin system > 1 1H-13C spin pair per spin > 1 1H-13C spin pair per spin

size system (i.e. NC > 1) system (i.e. NC > 1)

Number of compounds 623 12506 categorized

Number of 1st shell 180 1924 motifs

Number of 2nd shell 397 4912 motifs a The compounds were retrieved from HMDB version 4.0. All metabolites (detected and expected metabolites) with ALOGP < 3.0 were included (see Figure 3.4); predicted HMDB metabolites were not included.

85

Table 3.2. Examples of molecular structural motif identification of bile metabolites by

COLMAR and pNMR MSMMDB queries.

Hit that contains true

motif returned by Hit that contains true motif

COLMAR MSMMDB returned by pNMR MSMMDB Motif True Metabolite (Hit rank, type of (Hit rank, type of identified

identified true motif, true motif, RMSD (ppm))

RMSD (ppm))

5-Methyluridine -D-3-Ribofuranosyluric acid Uridine (1, 2nd shell, 0.16) (2, 2nd shell, 2.08)

N-(1-Deoxy-1-fructosyl)- 3-Chlorotyrosine tyrosine L-tyrosine (1, 2nd shell, 0.20)

(1, 2nd shell, 1.42)

Adonitol Xylitol Xylitol

(1, 2nd shell, 0.79) (1, 2nd shell, 1.80)

Biocytin N6-L-homocysteinyl-N2-L-

(1, 1st shell, 1.29) valyl-L-lysine Aspartyl-lysine

(3, *2nd shell, 1.29)

*For these molecules, a partial 2nd shell MSM was identified, since in the true molecule a hydrogen terminates the 1st shell MSM.

86

Table 3.3. True and false positive top hits of E. coli metabolites with various RMSD thresholds (see ROC plot, Figure 3.6).

RMSD threshold (ppm) True positive False positive

0.3 30 0

0.6 39 2

0.9 50 4

1.2 62 5

1.5 68 5

1.8 77 7

2.1 81 9

2.4 88 11

2.7 89 13

3.0 89 16

3.3 89 19

3.6 89 21

3.9 89 22

87

Table 3.4. Molecular structures of top 10 most abundant molecular structural motifs

(yellow) of all hydrophilic molecules contained in COLMAR MSMMDB.

Number of Molecular Number of Molecular molecules with the structural motif molecules with the structural motif same motif (Top 1- (MSM) (Top 1- same motif (Top 6- (MSM) (Top 6- 5) 5) 10) 10)

78 27

41 22

38 22

34 18

32 18

88

Table 3.5. Molecular structures of top 10 most abundant molecular structural motifs

(yellow) of all hydrophilic molecules contained in HMDB.

Number of Number of Molecular structural Molecular structural molecules with the molecules with motif (MSM) motif (MSM) same motif (Top 1- the same motif (Top 1-5) (Top 6-10) 5) (Top 6-10)

4556 776

3375 772

1600 654

1075 642

851 623

89

Figure 3.1. Definition of 1st and 2nd shell molecular motifs based on a spin system. The proton spin system together with its directly bonded carbon atoms defines the molecular spin system or “0th shell molecular motif” (red). The “1st shell molecular motif” (dashed orange box) is obtained after inclusion of the heavy atoms that are directly bonded to the spin system (orange). The additional inclusion of heavy atoms that are up to two bonds away from the spin system yields the “2nd shell molecular motif” (dashed green box).

90

Figure 3.2. Examples of metabolites with identical molecular structural motifs (in color).

The three 2nd shell molecular motifs on the left are highlighted in color. Each row depicts examples of molecules with the same molecular structural motif as the MSM furthest on the left. The experimental NMR chemical shift root-mean-square deviations (RMSDs) at the same C-H positions for all molecules containing the same motif are indicated where the first (second) number is the 1H (13C) RMSD in units of ppm.

91

Figure 3.3. Box plots of chemical shift errors of MSMs. The chemical shift errors of MSMs are calculated based on the average chemical shift of each COLMAR MSM (with occurrence more than 3) versus the true chemical shift of each COLMAR MSM in each compound. The chemical shift errors of prediction are calculated based on the predicted chemical shift of each COLMAR MSM (with occurrence > 3) versus the true chemical shift of each COLMAR MSM in each compound. The statistics of 1H, 13C chemical shift errors

(ppm) of 1st shell, 2nd shell MSM and predictions are summarized below. The averages are

(0.158, 2.03), (0.115, 1.32) and (0.269, 2.29). The medians are (0.143, 1.60), (0.085, 1.11) and

(0.211, 2.08). The first quartiles (Q1) are (0.062, 0.90), (0.038, 0.36) and (0.126, 1.40). The third quartiles (Q3) are (0.224, 3.10), (0.176, 1.94) and (0.352, 2.92). The interquartile ranges

(IQR, Q3 - Q1) are (0.162, 2.20), (0.138, 1.58) and (0.226, 1.52). The “minimum” (lower bound, Q1 - 1.5 x IQR) are (0.003, 0.089), (0.003, 0.006) and (0.011, 0.089). The “maximum”

(upper bound, Q3 + 1.5 x IQR) are (0.47, 5.80), (0.384, 4.31) and (0.692, 5.20).

92

Figure 3.4. Histogram of ALOGP values of hydrophilic and hydrophobic metabolites. The purpose of this figure is to determine a lipophilicity criterion to accurately classify metabolites without experimental NMR data into hydrophilic and hydrophobic metabolites. This allows one to limit the pool of potential metabolites found in aqueous metabolomics samples. 933 hydrophilic and hydrophobic metabolites from the COLMAR,

HMDB and BMRB databases were classified according to their predicted lipophilicity

ALOGP values, which are estimates of the experimental log10(P) values. Metabolites whose NMR spectra were measured in aqueous solution are indicated in blue

(“hydrophilic metabolites”) and metabolites measured in organic CDCl3 solvent are indicated in red color (“hydrophobic metabolites”). When setting the threshold for

ALOGP values to 3.0, 710 out of 723 (98.2%) hydrophilic compounds are classified as hydrophilic and 86 out of 210 (41.0%) hydrophobic compounds as hydrophobic. For pNMR MSMMDB only compounds with ALOGP < 3.0 were included from the parent databases (HMDB, KEGG, ChEBI).

93

Figure 3.5. Workflow of molecular structural motif (MSM) based unknown metabolite identification.

94

Figure 3.6. ROC curve with AUC of 0.851 of true and false positive top hits of E. coli metabolites with various RMSD thresholds.

95

Figure 3.7. Identification of L-alanyl-L-valine metabolite in mouse bile extracts. Panels a and b: 2D 13C-1H HSQC and 2D 1H-1H TOCSY of the unknown spin system A. Panels c and d: overlay of 2D NMR spectra of L-alanyl-L-valine (blue peaks) and bile extracts (gray peaks). Chemical shift agreement confirms the presence of L-alanyl-L-valine in the mouse bile mixture. Panel e: the chemical structure of L-alanyl-L-valine. Panel f: partial FT-ICR spectrum of the mouse bile extracts. The red peak with m/z 189.12337 of the (M+H)+ adduct is consistent with molecular formula C8H17N2O3 of L-alanyl-L-valine (mass error:

16 ppb).

96

Figure 3.8. Identification of two unknown spin systems of taurocholic acid in mouse bile extracts. Panels a and b: 2D 13C-1H HSQC and 2D 1H-1H TOCSY of spin system B. Panels c and d: 2D 13C-1H HSQC and 2D 1H-1H TOCSY of spin system C. Panel e: 2D 13C-1H

HSQMBC indicates spin system B and C are connected via the quaternary carbon that has a signal at 180.82 ppm, suggesting that the two spin systems belong to the same unknown compound.

97

Figure 3.9. Identification of taurocholic acid in mouse bile extracts. Panels a, b, c, d: overlay of 13C-1H HSQC spectra of taurine-like motifs –NHCH2CH2SO3H present in the top 4 hit molecules (color coded in red, magenta, green and cyan) and spin system B in the bile extracts (2 grey peaks underneath the colored peaks). Since taurocholic acid also matched the spin system C of the unknown metabolite (Figure 3.8), a reference spectrum of taurocholic acid was measured, which matched the HSQC cross-peak well, thereby confirming the presence of taurocholic acid in bile (Panel e). Panel e: overlay of 2D 13C-1H

HSQC spectra of taurocholic acid (magenta peaks) and the mouse bile extracts (grey peaks).

98

Figure 3.10. Graph-theoretical representation of molecular motif clustering of current

COLMAR MSMMDB molecules. Each node denotes a molecular motif where the node area is proportional to the number of molecules in the motif. For molecules containing two molecular motifs, their nodes are connected by an edge. The edge thickness (weight) is proportional to the number of molecules that contain both motifs. Representative molecular structures of some of the most abundant molecular motifs are depicted in the graph, and the spin system backbone is highlighted in yellow.

99

Figure 3.11. Distribution of the number of molecules in the 50 most common motifs of hydrophilic compounds of the HMDB. A blue bar denotes that the motif exists both in the

COLMAR MSMMDB and HMDB of hydrophilic compounds, whereas a red bar denotes that the motif only exists in the HMDB database of hydrophilic compounds.

100

COLMAR Lipids Database for 2D NMR-Based

Lipidomics

Accurate identification of lipids in biological samples is a key step in lipidomics studies.

Multidimensional NMR is a powerful tool as it provides comprehensive structural information on lipid composition at atomic resolution. However, the interpretation of

NMR spectra of complex lipids mixtures is currently limited by NMR spectral resolution and the availability of comprehensive NMR lipids databases along with user-friendly spectral analysis tools. In this chapter, the recently curated 2D HSQC hydrophobic metabolite database is introduced, which contains 501 compounds with accurate experimental 2D 13C-1H HSQC chemical shift data measured in CDCl3. A new module in the public COLMAR suite of NMR web servers was developed for the semi-automated analysis of complex lipidomics mixtures. To obtain 2D HSQC spectra with the necessary high spectral resolution along both 13C and 1H dimensions, non-uniform sampling (NUS) in combination with pure shift spectroscopy was applied allowing the extraction of an abundance of unique cross-peaks belonging to hydrophobic compounds in complex lipidomics mixtures. This information promotes the unambiguous identification of underlying lipid molecules by means of the new COLMAR Lipids web server as is demonstrated for Caco-2 cell and lung tissue cell extracts.

101

4.1 Introduction

Lipids constitute structural cell membrane components of all living cells, provide energy storage, protection from physical and chemical influences, and play critical roles in cellular functions such as cellular communication.14 Over the years, lipidomics has emerged as a systems biology approach toward a comprehensive understanding of the role of lipids in physiological processes,15, 16 including the metabolism and cellular functions of lipids in diseases, such as cancer, cardiovascular disorders, neurodegenerative diseases, obesity, and diabetes. Mass spectrometry combined with chromatography has become a popular platform for lipidomics studies to elucidate lipid composition in many different biological systems.164 Nuclear magnetic resonance (NMR) spectroscopy is also used for the detailed structure characterization of lipid species.165 1D

1H NMR and 13C NMR have proven useful for lipid profiling to characterize different classes of lipids in lipidomics mixtures.166-170 For instance, the composition of unsaturated fatty acyl residues and allylic and divinyl carbons in rapeseed oil and soybean oil can be determined by 1H and 13C NMR spectra for quantitative analysis.166 LipSpin has been introduced for the semiautomatic 1H NMR spectra-based profiling of lipid extracts, providing quantitative information of major lipid classes in lipidomics mixtures, such as fatty acids, triglycerides, phospholipids, and cholesterols.170 The composition and positions of acyl chains in the triacylglycerols of oil from Moringa oleifera were determined by 13C NMR, demonstrating the detection of oleate, vaccenate, and eicosenoate chains.167

Multidimensional high-resolution solution NMR spectroscopy can provide unique fingerprints of chemical compounds in complex chemical mixtures without the need for significant sample purification/separation.171-173 2D 13C-1H HSQC has been applied for lipid profiling in a variety of lipid cell extracts.174-176 Mycobacterial cell wall lipids were analyzed by 2D 13C-1H HSQC to monitor unique biomarker cross-peaks and

102 test for the presence of certain lipid species to improve our understanding of the role of lipids in mycobacteria pathogenesis.174 2D 13C-1H HSQC NMR was also used for phospholipid profiling of thermophilic Geobacillus sp. strain GWE1 isolated from sterilization ovens, which confirmed the 1H and 13C NMR assignments of the glycerol backbone position sn-1177 and the CH2CH2NH2 head group.176 However, application of 2D

HSQC to lipidomics mixtures for untargeted lipid identification and quantification is still very limited, which is mostly due to a lack of comprehensive lipid 2D NMR databases and user-friendly platforms for NMR spectral analysis.

Moreover, certain lipids within the same lipid class share common substructures that tend to give rise to similar chemical shifts, requiring NMR spectra with maximal resolution for their unambiguous discrimination (Figure 4.1). Non-uniform sampling

(NUS) enables the efficient acquisition of high-resolution 2D NMR spectra along the indirect dimension, thereby greatly benefitting metabolomics studies.178, 179 In addition, to boost the resolution along the direct proton dimension, broadband homonuclear decoupled ‘pure shift’ methods have been adopted where peak multiplets collapse into singlets as the result of 1H-1H J-coupling removal.180, 181 Moreover, the real-time pure shift

(BIRD) HSQC improves not only spectral resolution, but affords also a simultaneous gain in sensitivity.182-186 Because the real-time pure shift technique works independently of the chosen NUS schedule along the indirect t1 dimension, we combine it here with non- uniform sampling to simultaneously increase the resolution along both dimensions in 2D

HSQC experiments of lipidomics samples.

We also introduce a new 2D HSQC hydrophobic compound database, which currently contains experimental 2D 13C-1H HSQC chemical shift data of 501 hydrophobic compounds, in tandem with a user-friendly NMR spectral analysis platform for the identification of lipids. This is implemented as a COLMAR Lipids web server, which is a new module of our public COLMAR suite of NMR web servers

103

(http://spin.ccic.osu.edu/index.php/colmarm/index2). We demonstrate how the spectral resolution gain obtained with non-uniformly sampled 2D real-time pure shift HSQC spectra serving as input for COLMAR Lipids enables the automated and unambiguous identification of lipid molecules in complex lipidomics mixtures. The approach is first tested for a lipid model mixture and then applied to hydrophobic extracts from Caco-2 human intestinal cells and mouse lung tissue.

4.2 Methods and Experimental Section

4.2.1 Integration of Hydrophobic Compound 2D 13C-1H HSQC Database

The hydrophobic compound 2D 13C-1H HSQC chemical shifts were retrieved from the

HMDB, BMRB and NMRShiftDB databases.38, 41, 163 Only compounds with chemical shifts measured in 100% deuterated chloroform (CDCl3) were used from these databases using either 4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS) or tetramethylsilane (TMS) for chemical shift referencing. An in-house python script was developed to automatically read the source data document and obtain the compound name, CAS registry number (if applicable), chemical shifts (1H and 13C), smile strings, and molecular structures. This hydrophobic compound database contains lipids from a wide range lipid classes, such as fatty acids, cholesterols, triglycerides, and phospholipids and is currently the most comprehensive hydrophobic metabolite 2D 13C-1H HSQC database. For each compound, the chemical shifts were further subdivided into spin systems based on their molecular structure, allowing compound verification based on 2D TOCSY or 2D HSQC-TOCSY via our standard NMR metabolomics protocol implemented in the COLMARm web server.86

Each database entry was manually inspected and compounds with incomplete data, such as a lack of 13C chemical shifts, were removed. Currently, 501 hydrophobic compounds

104 are stored in the COLMAR Lipids HSQC database, which are publicly accessible via the

COLMAR Lipids web server (http://spin.ccic.osu.edu/index.php/colmarm/index2).

4.2.2 Sample Preparation

Lipid mixture samples were prepared to test and apply the COLMAR Lipids web server on NUS 2D real-time pure shift HSQC NMR spectra. A lipid model mixture was prepared containing 4 lipids: cholesterol (CAS: 57-88-5), (CAS: 60-33-3), (CAS: 506-32-1), and docosahexanoic acid (CAS: 6217-54-5). The final concentration of each compound was 2 mM in 600 μL deuterated cholorform (CDCl3) containing 0.03%

(v/v) tetramethylsilane (TMS) for chemical shift referencing. The model mixture was transferred to a 5 mm NMR tube for NMR data collection.

Lung tissue was extracted from wild-type C57BL/6 mice and snap frozen in liquid . Lung samples (~150 mg) were cut with a clean blade and added to a 1.5 mL Safe-

Lock microcentrifuge tube (Eppendorf) with 600 μL 1:1 methanol/double distilled H2O

(ddH2O) and 300 μL of 1.4 mm stainless steel beads (SSB14B). The sample tube was inserted in a Bullet Blender (24 Gold BB24-AU by Next Advance) and homogenized at speed 10 for 12 min at 4°C. Then 400 μL 1:1 methanol/ddH2O was added and the sample was centrifuged at 12,000 x g for 20 min at 4°C to remove the beads and any remaining solid debris. The supernatant was collected and 3.5 mL methanol, 3.5 mL ddH2O, and 4 mL chloroform was added to a final methanol/ddH2O/chloroform 1:1:1 (v/v/v) mixture.

The sample was vortexed and centrifuged at 5,000 x g for 20 min at 4°C for separation of the aqueous and organic phases. The organic phase, which contains the lipids, was collected and dried using compressed nitrogen gas and stored at -80°C until measurements were performed. The NMR sample was prepared by dissolving the dried sample in 600 μL CDCl3 with 0.03% (v/v) tetramethylsilane (TMS) for chemical shift

105 referencing. The sample was then transferred to a 5 mm NMR tube with a Teflon cap and sealed with parafilm for NMR measurements.

The Caco-2 cell model was utilized as previously described187, 188 with minor modifications as delineated below. Briefly, Caco-2 human intestinal cells (ATCC, #HTB-

37, Rockville, MD) were grown as monolayers on 6-well plates and incubated at 37°C under modified atmosphere (i.e. 95% air, 5% CO2) in a controlled humidity chamber

(FormaSeriesII, Thermo Scientific). The media was changed every second day, and consisted of DMEM with 15% fetal bovine serum (FBS) until confluence was reached, at which time FBS was reduced to 7.5% of the medium. After cell differentiation (passage 27,

11-14 days post-confluence), cells were treated with artificial micelles diluted with fresh

DMEM (1:4, v/v). Artificial micelles were prepared as previously described,189 and their composition, including eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA), is listed in Table 4.1. Incubation occurred at 37 ºC for 6 hours, followed by several washes of 2 mL PBS buffer + albumin, and PBS buffer alone, and monolayer harvest using a cell scraper after. Cells were frozen at -80oC for less than 4 weeks prior to analysis.

Lipids were extracted from cells using a modified Bligh-Dyer method, as described previously.190 The extracts obtained from 4 wells were pooled to optimize metabolite concentration for NMR analysis. Extracts were also analyzed using ultra-high performance liquid chromatography-high resolution mass spectrometry (UHPLC-MS) using a 1290 UHPLC system interfaced with a Q-Tof 6545 (Agilent, Inc.) with an electrospray probe operated in positive ion mode. A C18 column (Hypersil gold, 2.1 mm x 100 mm, 3 µm particle size) was employed using a water/acetonitrile gradient as previously described with the addition of 0.1 % ammonium formate (w/v) as a solvent modifier.191 The MS source settings were as follows: capillary voltage = +3000 V, nozzle voltage = 35 V, gas temperature = 300° C, drying gas flow = 10 L/min, nebulizer = 25 psi,

106 sheath gas temperature = 350° C, sheath gas flow = 12 L/min. The MS monitored m/z 50-

1700 with an acquisition rate of 8119 transients per spectrum and 1 spectrum per second.

4.2.3 NMR Experiments and Processing

Uniformly (US) and non-uniformly sampled (NUS) 2D 13C-1H HSQC spectra were collected for the lipid model mixture and the lipid cell extracts with and without real-time broadband homonuclear BIRD decoupling pure shift methodology. All NMR spectra were acquired on a Bruker AVANCE III HD solution-state NMR spectrometer equipped with a cryogenically cooled TCI probe at 850 MHz proton frequency at 298 K. The standard US HSQC spectra and real-time pure shift US HSQC spectra of the lipid model mixture and lipid cell extracts were collected with 512 complex t1 and 1024 complex t2 points. The measurement time for each experiment was around 4 hours. The spectral width along the indirect and the direct dimensions was 34205.6 and 10204.1 Hz. The number of scans per t1 increment was 8. The transmitter frequency offset was 80 ppm in the 13C dimension and 4.7 ppm in the 1H dimension. For NUS NMR experiments, 12.5%

NUS (i.e. 512 NUS complex points of 4096 uniformly sampled points in t1) with and without real-time pure shift 2D HSQC spectra of the lipid model mixture and lipid cell extracts were collected with a measurement time around 4 hours. The number of scans per t1 increment was 8. The transmitter frequency offset was 80 ppm in the 13C dimension and 4.7 ppm in the 1H dimension. For the indirect carbon dimensions, the NUS spectra were processed with the IST algorithm implemented in hmsIST software.192 For all NUS spectra the t1 increments were selected using a Poisson-gap distribution using the hmsIST software’s schedule generator with default parameters (sinusoidal weight is 2 with random seed generator).193 Both the NUS and US NMR data were zero-filled, Fourier- transformed, phase-corrected with final digital resolution 4096 (2) x 8192 (1) (real) points using Bruker Topspin 3.5 software. Chemical shift database query and 107 identification of metabolites from the 2D NMR spectra were performed using the

COLMAR NMR webserver (http://spin.ccic.ohio-state.edu/index.php/colmar).

4.3 Results and Discussion

4.3.1 COLMAR Lipids Web Server with Hydrophobic Metabolite 2D HSQC Database

The COLMAR Lipids web server was implemented as a new module in the

COLMAR suite of NMR web servers, and it is designed for the automated analysis of 2D

13C-1H HSQC spectra of lipidomics mixtures. All spectra need to be properly referenced before being queried against the lipids database. The COLMAR Lipids web server allows users to reference the chemical shifts based on an internal standard, such as tetramethylsilane (TMS), or a lipid with known chemical shifts present in the sample.

After referencing the HSQC spectrum, the query algorithm embedded in the COLMAR web server performs an automated matching with database compounds. The default values for the error tolerance for 1H and 13C chemical shifts is 0.03 ppm and 0.3 ppm, respectively, and can be changed by the user. Figure 4.2 is a screenshot of lipid compound identification in Caco-2 cell extract by the new COLMAR Lipids web server together with examples of matched peaks and spectral regions of identified compounds. It demonstrates the high spectral resolution achievable for lipidomics samples necessary for the accurate identification of lipid metabolites. Table 4.2 provides the lipids (uniqueness > 0 and precursor ion(s) (MS1) detected in UHPLC-MS) identified in Caco-2 cell extract. Displayed are quantitative metrics based on a standard 2D 13C-1H HSQC spectrum including matching ratios (number of cross-peaks identified of a given metabolite divided by the number of cross-peaks expected for this compound), chemical shift RMSDs (for both 13C and 1H), uniqueness (number of cross-peaks uniquely assigned to a given compound) and the chemical structure of the lipid. The “uniqueness” metric reflects the number of unique

108 peaks of a compound present in the spectrum, which is a useful metric to determine whether at least one cross-peak of an identified compound is not overlapped with peaks of other database compounds. The complete lipid identification result in Caco-2 cell extract and lung tissue is provided in Table 4.3 and Table 4.4. As can be seen in the lipid identification Tables 4.3 and 4.4 of Caco-2 cell extract and lung tissue, a number of lipids exist with uniqueness ≥ 1, which means at one or more distinct peaks of these lipids were found that uniquely belong to them. These peaks have potential to serve as diagnostic biomarker(s) to confirm the presence or absence of a particular lipid molecule and allow one to elucidate cellular origins of a particular lipid.

A benefit of the hydrophobic metabolite 2D HSQC NMR database is that it allows identification of structural isomers that may not be easily distinguished by mass spectrometry (MS or MS/MS) alone (Figure 4.3). For example, isomers decanal and menthol generate highly similar MS/MS spectra, whereas the 2D HSQC spectra of decanal and menthol give rise to very different non-overlapping cross-peaks, which makes it straightforward to distinguish between these isomers.

4.3.2 Disambiguation of Lipid Identification Using Complementary Information

COLMAR Lipids generates a list of hydrophobic compound candidates based on a single

2D HSQC spectrum. However, due to the high similarity of certain lipids and of their chemical shifts, some HSQC peaks are assigned to multiple compounds simultaneously, which can reduce the uniqueness scores all the way to zero (Table 4.3 and 4.4) thereby reducing the confidence in the accuracy of these compounds. For instance, COLMAR

Lipids returns TG(16:0/16:0/16:0) and TG(10:0/10:0/10:0) with matching ratios 1.0 (10 out of 10 database peaks are matched) and 0.91 (10/11 database peaks are matched), respectively, while the uniqueness scores of both compounds are zero. The two

109 compounds have very similar chemical shifts and share four cross-peaks. Hence, complementary experimental information about the lipidomics sample, such as mass information, is needed to cross-validate and disambiguate the identified compounds. By combining COLMAR Lipids results of the Caco-2 cell extract with LC-MS data, a number of lipids could be confirmed and others excluded, thereby reducing the false positive rate

(Table 4.3).

Additional opportunities exist in leveraging the COLMAR approach, in combination with empirically derived LC-MS and LC-MS/MS spectra obtained from the same samples, to conclusively identify more challenging lipids. Certainly MS/MS confers some advantages in lipid identification, e.g. fatty acyl positional isomers of di- and tri- glycerides can be established by studying fragmentation patterns, with loss in the outer position (sn-1 or sn-3) favored over the sn-2 position.194 However, accurate annotation of geometric (i.e. cis-trans) isomers by LC-MS has not been possible in the absence of authentic standards until recently. Ion mobility spectrometry (IMS) has made some notable progress in this arena,195, 196 but ion transmission remains a major limitation by reducing sensitivity, which is a challenge when identifying lipids at low concentrations in samples of biological origin. Likewise, accurate identification of unsaturated double bond positions remains cumbersome by LC-MS, requiring an online ozone source197 or in- line acetone photo-derivatization via the Paterno-Buchi reaction.198 In contrast, utilizing

COLMAR in combination with LC-fractionation, exact mass and fragmentation data, will open up alternative strategies for the accurate annotation of such types of complex lipids.

110

4.3.3 Accurate Lipid Identification by High-resolution 2D NUS Real-time Pure Shift

HSQC

For some lipids and fatty acids, e.g., linoleic acid (LA) and arachidonic acid (AA), which share common substructures, their chemical shifts can be very similar as a result cross-peaks from common substructures have a tendency to overlap in a standard HSQC spectrum, e.g. one with digital resolution of 4096 (2) x 1024 (1) real data points. By contrast, resolution enhancement along the indirect 13C dimension by NUS and along the direct 1H dimension by pure shift NMR resolves many of the overlaps without increasing measurement time over the standard HSQC experiment. This permits the unique identification of hydrophobic compounds as it is demonstrated for both a model mixture and two real-world lipid mixtures. Figure 4.4 shows a spectral comparison between US and NUS 2D HSQC spectra with and without real-time pure shift technique for the lipid model mixture. The US HSQC spectra were acquired with 512 complex points along the indirect 13C dimension. The NUS HSQC spectra were acquired with 512 NUS points out of 4096 complex points (12.5% sampling) along the indirect 13C dimension. Hence, the

NUS increments were selected from an 8 times larger indirect evolution time t1 range than the US spectrum resulting in ultrahigh-resolution along 1. All spectra were processed or reconstructed to final digital resolution of 4096 (2) x 8192 (1) real data points. In the alkene carbon region, where structurally similar fatty acids tend to generate similar chemical shifts, the NUS spectra not only provide substantially higher resolution than the

US spectra allowing the easier identification of individual cross-peaks with more precise peak positions, but also show a larger number of peaks than US spectra. In particular, the cross-peaks in the NUS HSQC spectrum of arachidonic acid, docosahexaenoic acid and linoleic acid are clearly separated (Figure 4.4). As indicated by the black square box in

Figure 4.4e, arachidonic acid, docosahexaenoic acid and linoleic acid have a similar proton and carbon chemical shift distribution (1H: 5.38 ppm – 5.41 ppm, 13C: 127.8 ppm – 128.2

111 ppm). For example, the 13C chemical shift difference of two closest peaks from arachidonic acid and linoleic acid is only 0.0966 ppm (i.e. 20.65 Hz on a 850 MHz NMR spectrometer).

To achieve a digital resolution that distinguishes these two peaks in a traditional US

HSQC spectrum, according to the Nyquist sampling theorem at least 1502 complex points along t1 needed (with carbon spectral width 145 ppm). Traditional data processing techniques, such as zero-filling and linear prediction, have very limited effect on the improvement of the digital resolution as is shown in Figure 4.4c. To boost the resolution to that of a US HSQC with 4096 complex t1 points at the same measurement time as a US

HSQC with 512 complex t1 points, NUS can be employed and combined with real-time pure shift techniques along t2 (Figure 4.4b,d). The latter provides broadband homonuclear proton-proton decoupling, which causes the multiplet signals collapse into singlets along the 1H dimension, which further separate cross-peaks, simplify the spectra and enhance sensitivity as well. On the other hand, the relatively long 13C T2 relaxation times cause some minor sensitivity loss by NUS where the longest t1 increment is up to 8 times longer than in the standard HSQC.

The spectral resolution benefits for lipid metabolism studies are demonstrated by applying the high-resolution NUS pure shift NMR strategy to Caco-2 cell extracts to study their cellular uptake of eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA).

Both EPA and DHA are long-chain omega-3 fatty acids found primarily in the human diet in cold-water fish oil.199 EPA and DHA serve as key precursors to important metabolites that play roles in inflammatory signaling,200 as well as visual and neural development.201

Caco-2 cells were incubated with artificial micelle formulations with varying ratios of EPA,

DHA and OA (Table 4.1). Due to the high structural similarity (Tanimoto similarity coefficient of 0.704), the chemical shifts tend to be very similar necessitating ultrahigh- resolution NMR to distinguish between the two molecules. 2D HSQC and 2D NUS HSQC with and without pure shift methodology were applied. In Figure 4.5, the unambiguous

112 identification of EPA and DHA in complex Caco-2 cell extract is demonstrated by using

NUS HSQC spectra with and without real-time pure shift technique. Specifically, the alkene carbon CH=CH peaks in DHA and EPA are clearly separated in the 2D NUS HSQC, while they are severely overlapped in the normal 2D HSQC. The uniquely identified cross-peaks of the CH=CH groups help identify and quantify EPA and DHA in this complex mixture. Caco-2 cells were incubated with micelles containing six different formulations of EPA, DHA and OA (Table 4.5). These can be clearly differentiated and identified by high-resolution NUS 2D standard HSQC spectra (Figure 4.6), but are barely distinguishable by 1D 1H NMR or standard 2D HSQC spectra. Furthermore, this information can be utilized to quantitate the cellular uptake of EPA, DHA, and OA from the micelles.

A significant improvement of resolution in the NUS HSQC experiment can also be seen in other crowded regions. Figure 4.7 shows a comparison between the uniformly sampled HSQC and NUS HSQC of Caco-2 cell extract and lung tissue by acquiring 512 complex points along the indirect 13C dimension. All spectra were processed or reconstructed to final digital resolution of 4096 (2) x 8192 (1) real data points. The resolution along the carbon dimension in the NUS HSQC is much higher than that in the standard HSQC spectrum with a digital resolution of 7.58 Hz (NUS) vs. 60.66 Hz (US) acquired on a 850 MHz NMR spectrometer. As a direct consequence, NUS HSQC spectra produce numerous well-isolated peaks in crowded regions, especially when combined with pure shift methodology along the proton dimension as indicated in Figure 4.7

(Panels B,D,F,H).

It should be noted that the majority of hydrophobic compounds in the COLMAR

Lipids database contain (1H, 13C) chemical shift information that were measured using regular HSQCs with 512 complex points along the indirect 13C dimension. As shown in

Figure 4.5, structurally similar lipid compounds such as EPA and DHA share 13 out of 18

113 cross-peaks at virtually identical positions, which may hamper the unambiguous identification of lipids in complex lipidomics mixtures. Therefore, for targeted lipids identification and analysis by high-resolution NUS real-time pure shift HSQC, it is recommended that in the future the reference lipid target compounds are also measured using the same types of high-resolution HSQC experiments.

In addition, even an extremely low sparseness rate of the NUS sampling schedule

(e.g., as low as 6.25%, 256 NUS complex points along t1) for lipidomics mixture applications hardly affects spectral reconstruction quality. Figure 4.8 shows a comparison of a NUS HSQC NMR spectrum of lung tissue with 6.25% sparseness with the corresponding US NMR spectrum with the same number of t1 increments. The reconstructed NUS HSQC and US HSQC spectra display almost the same peak profile.

This demonstrates the feasibility and benefits of up to 16-fold improved spectral resolution along the indirect dimension for this real-world lipidomics mixture without a concomitant increase of measurement time.

4.4 Conclusion

The integration of the new 2D 13C-1H HSQC lipids database into COLMAR greatly facilitates the unique identification of lipids present in complex lipidomics mixtures. To maximize the number of identified compounds and minimize false positives, the highest possible NMR spectral resolution is critically important. As demonstrated here, it can be achieved by performing the NMR experiments at high magnetic fields in combination with the use of non-uniformly sampled (NUS) pure shift methodology to simultaneously optimize resolution along both the 13C and 1H dimensions. Highly crowded HSQC regions, such as the alkene region, can thereby be resolved into individual cross-peaks, which dramatically improves identification accuracy of lipids in complex lipidomics

114 mixtures. We anticipate that the new relational 2D 13C-1H HSQC hydrophobic compound database that can be queried via the COLMAR web server will significantly facilitate and promote application of 2D NMR-based lipidomics for a variety of hydrophobic biological mixtures.

115

Table 4.1. Composition of artificial micelles incubated with Caco-2 human intestinal cells for 6 hours prior to experiment termination.

Lipid Species Molarity (mM) in final solution

Taurocholate 0.5

Monoolein 500

Lyso-phosphatidylcholine (lyso-PC) 100

Phosphatidylcholine (PC) 100

Cholesterol 10

Glycerol 50

α-Tocopherol 25

116

Table 4.2. Hydrophobic metabolite identification of Caco-2 cell extracts using COLMAR

Lipids web servera.

Metabolite name Matching 13C RMSD 1H RMSD Uniquenessc Compound Class or ID ratiob (ppm) (ppm)

Cholesterols and Cholesterol 1.00 0.05 0.005 2 derivatives

Medium-chain fatty Octanoic acid 1.00 0.15 0.004 1 acids

Tetracosanoic 1.00 0.12 0.009 1 Long-chain fatty acids acid

Linoleic acid 1.00 0.13 0.012 1 Omega-6 fatty acids

Docosahexaenoic Very long-chain fatty 0.95 0.08 0.008 4 acid acids

Arachidonic acid 0.94 0.10 0.011 1 Long-chain fatty acids

Eicosapentaenoic 0.94 0.10 0.008 1 Long-chain fatty acids acid

Long-chain fatty 1-Hexadecanol 0.94 0.11 0.006 2 alcohols

-Sitosterol 0.90 0.08 0.013 6 Phytosterols

Cholesterols and CE(16:0) 0.79 0.10 0.006 4 derivatives

PC(16:0/16:0) 0.71 0.12 0.006 4 Glycerophospholipids

Continued

117

Table 4.2. Continued

Metabolite name Matching 13C RMSD 1H RMSD Uniquenessc Compound Class or ID ratiob (ppm) (ppm)

Stigmasterol 0.65 0.10 0.010 5 Phytosterols

Cholesterols and 5--Cholestanol 0.61 0.16 0.009 9 derivatives

Abbreviations: TG = triglyceride, PC = phosphatidylcholine, CE = cholesterol ester a Identified lipids with uniqueness >= 1 and precursor ion(s) detected in UHPLC-MS. b The ratio of the matched number of peaks to the total number of peaks. c Number of cross-peaks in the HSQC spectrum of the mixture that are uniquely assigned to a certain metabolite.

118

Table 4.3. Full hydrophobic metabolite identification of Caco-2 cell extracts using

COLMAR Lipids web servera.

Metabolite name or Matching Precursor ion(s) Uniquenessb Compound Class ID ratioa detected (MS1)c

Cholesterols and Cholesterol 1.00 2/29 [M+NH4]+ derivatives

Nonadecane 1.00 0/19 N.D. Alkanes

Myristic acid 1.00 0/13 N.D. Long-chain fatty acids

Cyclohexane 1.00 0/6 N.D. Cycloalkanes

Medium-chain fatty Dodecanoic acid 1.00 0/11 [M+NH4]+ acids

Medium-chain fatty Octanoic acid 1.00 1/7 [M+NH4]+ acids Medium-chain fatty Nonanoic acid 1.00 0/8 N.D. acids

Octane 1.00 0/8 N.D. Alkanes

Medium-chain fatty Decanoic acid 1.00 0/9 N.D. acids

TG(16:0/16:0/ 1.00 0/10 N.D. Monoacid triglyceride 16:0)

Palmitic acid 1.00 0/6 [M+Na]+ Long-chain fatty acids

Pentacosanoic acid 1.00 0/8 [M+ACN+H]+ Long-chain fatty acids

Continued

119

Table 4.3. Continued

Metabolite name or Matching Precursor ion(s) Uniquenessb Compound Class ID ratioa detected (MS1)c

Hexacosanoic acid 1.00 0/7 [M+ACN+H]+ Long-chain fatty acids

Octacosanoic acid 1.00 0/7 [M+ACN+H]+ Long-chain fatty acids

Heneicosanoic acid 1.00 0/7 [M+NH4]+ Long-chain fatty acids

Tetracosanoic acid 1.00 1/7 [M+NH4]+ Long-chain fatty acids

Stearic acid 1.00 0/7 N.D. Long-chain fatty acids

Nonadecanoic acid 1.00 0/7 N.D. Long-chain fatty acids

Linoleic acid 1.00 1/12 [M+H]+ Omega-6 fatty acids

Elaidic acid 1.00 0/9 [M+NH4]+ Long-chain fatty acids

Medium-chain fatty 1.00 0/7 N.D. acids

Oleic acid 1.00 0/9 [M+ACN+H]+ Long-chain fatty acids

Docosahexaenoic Very long-chain fatty 0.95 4/18 [M+NH4]+ acid acids

Arachidonic acid 0.94 1/17 [M+ACN+H]+ Long-chain fatty acids

Eicosapentaenoic 0.94 1/17 [M+ACN+H]+ Long-chain fatty acids acid

Long-chain fatty 1-Hexadecanol 0.94 2/15 [M+Na]+ alcohols

TG(10:0/10:0/ 0.91 0/10 [M+NH4]+ Monoacid triglyceride 10:0)

Continued

120

Table 4.3. Continued

Metabolite name or Matching Precursor ion(s) Uniquenessb Compound Class ID ratioa detected (MS1)c

Medium-chain fatty Undecanoic acid 0.90 2/9 N.D. acids

cis- 0.90 0/9 [M+NH4]+ Omega-7 fatty acids

-Sitosterol 0.90 6/26 [M+H]+ Phytosterols

Vaccenic acid 0.90 0/9 [M+NH4]+ Omega-7 fatty acids

1-Octanol 0.88 1/7 N.D. Fatty alcohol

Octadecanol 0.88 1/7 N.D. Fatty alcohol

Palmitoleic acid 0.87 0/13 [M+ACN+H]+ Long-chain fatty acids

Heptadecanoic 0.86 0/6 [M+ACN+H]+ Long-chain fatty acids acid

Tricosanoic acid 0.86 0/6 N.D. Long-chain fatty acids

-Linolenic acid 0.85 1/11 [M+H]+ Omega-3 fatty acids

8,11,14- 0.80 0/8 N.D. Long-chain fatty acids Eicosatrienoic acid

Long-chain unsaturated 0.80 0/8 N.D. fatty acids

Cholesterols and CE(16:0) 0.79 4/30 [M+NH4]+ derivatives

Continued

121

Table 4.3. Continued

Metabolite name or Matching Precursor ion(s) Uniquenessb Compound Class ID ratioa detected (MS1)c

5Z-Dodecenoic Long-chain unsaturated 0.73 2/8 N.D. acid fatty acids

PC(16:0/16:0) 0.71 4/10 [M+ACN+H]+ Glycerophospholipids

3-oxo-Undecanoic Medium-chain keto 0.67 0/6 N.D. acid acids

Stigmasterol 0.65 5/20 [M+ACN+H]+ Phytosterols

Medium-chain Decanal 0.63 0/5 [M+Na]+ aldehyde

+ [M+NH4] Cholesterols and 5--Cholestanol 0.61 9/20 [M+Na]+ derivatives

Abbreviations: CE = cholesterol ester, PC = phosphatidylcholine, TG = triglyceride, N.D. = not detected. a The ratio of the matched number of peaks to the total number of peaks. b Number of cross-peaks in the HSQC spectrum of the mixture that are uniquely assigned to the metabolite. c Precursor ion species detected via UHPLC-QTof with ESI positive, with m/z ≤ 0.005 Da of the predicted precursor mass.

122

Table 4.4. Hydrophobic metabolite identification of lung tissue using COLMAR Lipids web server.

Metabolite name or ID Matching ratioa Uniquenessb Compound Class

Capric acid 1.00 1/7 Medium-chain fatty acids

Cholesterols and Cholesterol 1.00 3/29 derivatives

Docosahexaenoic acid 1.00 4/19 Very long-chain fatty acids

Dodecanoic acid 1.00 0/11 Medium-chain fatty acids

Elaidic acid 1.00 0/9 Long-chain fatty acids

Heneicosanoic acid 1.00 0/7 Long-chain fatty acids

Heptadecanoic acid 1.00 0/7 Long-chain fatty acids

Hexacosanoic acid 1.00 0/7 Long-chain fatty acids

Linoleic acid 1.00 1/12 Omega-6 fatty acids

Myristic acid 1.00 0/6 Long-chain fatty acids

Long-chain unsaturated Nervonic acid 1.00 0/10 fatty acids

Nonadecanoic acid 1.00 0/7 Long-chain fatty acids

Nonanoic acid 1.00 0/8 Medium-chain fatty acids

Octacosanoic acid 1.00 0/7 Long-chain fatty acids

Octane 1.00 0/8 Alkanes

Octanoic acid 1.00 1/7 Medium-chain fatty acids

Continued

123

Table 4.4. Continued

Metabolite name or ID Matching ratioa Uniquenessb Compound Class

Oleic acid 1.00 0/9 Long-chain fatty acids

PC(16:0/16:0) 1.00 8/14 Glycerophospholipids

Palmitic acid 1.00 0/6 Long-chain fatty acids

Stearic acid 1.00 0/7 Long-chain fatty acids

TG(16:0/16:0/16:0) 1.00 1/10 Monoacid triglyceride

Tricosanoic acid 1.00 0/7 Long-chain fatty acids

Vaccenic acid 1.00 0/10 Omega-7 fatty acids

cis-Vaccenic acid 1.00 0/10 Omega-7 fatty acids

Cyclohexane 1.00 0/6 Cycloalkanes

Decanoic acid 1.00 0/9 Medium-chain fatty acids

Nonadecane 1.00 0/19 Alkanes

Palmitoleic acid 1.00 2/15 Long-chain fatty acids

-Sitosterol 0.93 6/27 Phytosterols

-Linolenic acid 0.92 1/12 Omega-3 fatty acids

TG(10:0/10:0/10:0) 0.91 1/10 Monoacid triglyceride

Diphenylacetic acid 0.91 0/10 Diphenylmethanes

8,11,14-Eicosatrienoic 0.90 2/9 Long-chain fatty acids acid

Undecanoic acid 0.90 1/9 Medium-chain fatty acids

Continued

124

Table 4.4. Continued

Metabolite name or ID Matching ratioa Uniquenessb Compound Class

Arachidonic acid 0.89 1/16 Long-chain fatty acids

Eicosapentaenoic acid 0.89 0/16 Long-chain fatty acids

Pentacosanoic acid 0.88 0/7 Long-chain fatty acids

Cholesterols and CE(16:0) 0.87 6/33 derivatives

Tetracosanoic acid 0.86 0/6 Long-chain fatty acids

11-cis-Retinol 0.85 0/11 Retinoids

13-cis-Retinol 0.85 0/11 Retinoids

9-cis-Retinol 0.85 0/11 Retinoids

Vitamin A 0.85 0/11 Retinoids

1-Hexadecanol 0.81 0/13 Long-chain fatty alcohols

p-Cymene 0.75 0/6 Aromatic monoterpenoids

Stearoylcarnitine 0.73 1/8 Acyl carnitines

Pentadecanoic acid 0.71 0/10 Long-chain fatty acids

Cholesterols and 5--Coprostanol 0.68 4/23 derivatives

Stigmasterol 0.68 4/21 Triterpenoids

2-Indanone 0.67 0/4 Indanones

2-Methylbenzyl alcohol 0.67 4/4 Benzyl alcohols

Continued

125

Table 4.4. Continued

Metabolite name or ID Matching ratioa Uniquenessb Compound Class

3-oxo-Undecanoic acid 0.67 0/6 Medium-chain keto acids

3-Methylbenzyl alcohol 0.67 3/4 Benzyl alcohols

Cholesterols and 5--Cholestanol 0.67 7/22 derivatives

o-Xylene 0.67 0/4 m-Xylenes

p-Xylene 0.67 0/4 m-Xylenes

Cholesterols and 5--Cholestanone 0.66 2/21 derivatives

Cholesterols and Epi-coprostanol 0.65 6/22 derivatives

-Tocopherol 0.65 5/11 Tocopherols

Long-chain unsaturated 5Z-Dodecenoic acid 0.64 1/7 fatty acids

1-Octanol 0.63 0/5 Fatty alcohol

Decanal 0.63 0/5 Medium-chain aldehyde

Octadecanol 0.63 0/5 Fatty alcohol

(1r,2s,5r)-5-Methyl-2-

propan-2-ylcyclohexan-1- 0.60 3/6 Menthane monoterpenoids

ol

Continued

126

Table 4.4. Continued

Metabolite name or ID Matching ratioa Uniquenessb Compound Class

Methyl-branched fatty 4-Methylvaleric acid 0.60 1/3 acids

Cholesterols and Cholestenone 0.60 0/18 derivatives

Abbreviations: TG = triglyceride, PC = phosphatidylcholine, CE = cholesterol ester a The ratio of the matched number of peaks to the total number of peaks. b Number of cross-peaks in the HSQC spectrum of the mixture that are uniquely assigned to a certain metabolite.

127

Table 4.5. Different ratios of fatty acids utilized in the micelles with which Caco-2 human intestinal cells were incubated for 6 h.

Sodium Oleate Oleic Eicosapentaenoic Docosahexaenoic Sample acid (OA) (mg) acid (EPA) (mg) acid (DHA) (mg)

# 1 w/OA 6.08 0 0

#2 w/EPA/OA 2.43 14.52 0

#3 w/DHA/OA 2.43 0 15.768

#4 w/EPA 0 24.2 0

#5 w/DHA 0 0 26.28

#6 w/EPA+DHA 0 14.52 10.512

128

Figure 4.1. 1D 1H and 2D 13C-1H HSQC NMR spectra of Caco-2 cell extract. Panels A and

B show up- and down-field regions. The uniformly sampled HSQC spectrum was acquired with 2048 (t2) x 512 (t1) complex points and was processed to digital resolution of 4096 (2) x 2048 (1). Congested peak areas are circled by red boxes and putatively identified hydrophobic metabolites are listed. The corresponding functional groups (in red) of identified hydrophobic metabolites are annotated in the representative structures.

129

Figure 4.2. Identification of hydrophobic metabolites in Caco-2 cell extract with COLMAR

Lipids web server. Panel A: screenshot of annotated 13C-1H HSQC spectrum (above) with list of hydrophobic metabolites identified by the web server (below). Each type of symbols belongs to the cross-peaks of a different hydrophobic metabolite in the COLMAR Lipids database. Panel B: illustration of spectral matching of HSQC cross-peaks (red ellipses) by

COLMAR Lipids web server of stearic acid (left) and cholesterol (right) allowing the identification of these metabolites in Caco-2 cell extract.

130

Figure 4.3. Comparison of 2D 13C-1H HSQC NMR spectra and LC-MS/MS spectra pairs of structural isomers. The chemical shifts and MS data were taken from the HMDB. This demonstrates the complementarity of the two methods and the high sensitivity of NMR to structural differences.

131

Figure 4.4. Comparison of performance of different 2D 13C-1H HSQC experiments in the

CH=CH region for lipids identification in a model mixture. Panel A: uniformly sampled standard HSQC (2048 (t2) x 512 (t1) complex points). Panel B: non-uniformly sampled

(NUS) HSQC (12.5% NUS and 512 NUS points, 2048 (t2) x of 4096 (t1) complex points).

Panel C: uniformly sampled real-time pure shift HSQC (2048 (t2) x 512 (t1) complex points).

Panel D: non-uniformly sampled real-time pure shift HSQC (12.5% NUS and 512 NUS points, 2048 (t2) x 4096 (t1) complex points). Panel E shows lipid identification in standard

2D HSQC, 2D NUS HSQC and 2D NUS pure shift HSQC spectrum of model mixture. All spectra were processed to final spectral digital resolution of 4096 (2) x 8192 (1).

132

Figure 4.5. Identification of eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA) in different 2D 13C-1H HSQC experiments in Caco-2 cell extract. Panel A shows identification of EPA and DHA in a uniformly sampled standard HSQC (blue, 2048 (t2) x

512 (t1) complex points) and a non-uniformly sampled standard HSQC (red, 12.5% NUS and 512 NUS points, 2048 (t2) x 4096 (t1) complex points). Panel B shows identification of

EPA and DHA in a uniformly sampled real-time pure shift HSQC (blue, 2048 (t2) x 512 (t1) complex points) and the non-uniformly sampled real-time pure shift HSQC (red, 12.5%

NUS and 512 NUS points, 2048 (t2) x 4096 (t1) complex points). All spectra were processed to final spectral digital resolution of 4096 (2) x 8192 (1). Unique cross-peaks that belong to docosahexaenoic acid (DHA), eicosapentaenoic acid (EPA) and oleic acid (OA) are labeled by green dots.

133

Figure 4.6. Differentiation of Caco-2 cells by NMR metabolomics analysis. NMR spectra of Caco-2 cells incubated with micelles containing different concentrations of docosahexaenoic acid (DHA), eicosapentaenoic acid (EPA) and oleic acid (OA) are provided in each column (the details of each micelle formulation are provided in Table

4.5). Panel A shows the alkene carbon region of the normal uniformly sampled HSQC

(2048 (t2) x 512 (t1) complex points). Panel B shows the alkene carbon region of normal non-uniformly sampled HSQC (12.5% NUS and 512 NUS points, 2048 (t2) x 4096 (t1) complex points). Panel C shows alkene carbon region of the 1D 1H NMR. In the rightmost columns, DHA, EPA and OA peaks are annotated. The red box denotes the peaks that belong to DHA and EPA, the green box denotes peaks that belong to EPA only, the yellow box denotes peaks that belong to DHA only, and the black box denotes peaks that belong to OA only.

134

Figure 4.7. Spectral comparison of crowded regions of different types of 2D 13C-1H HSQC experiments in Caco-2 cell extract and lung tissue. Panel A, C, E, G show spectral comparison of US HSQC (blue, 2048 (t2) x 512 (t1) complex points) vs. NUS HSQC (red,

12.5% NUS and 512 NUS points, 2048 (t2) x 4096 (t1) complex points) in Caco-2 cell extract

(A, C) and lung tissue (E, G). Panels B, D, F, H show spectral comparison of US pure shift

HSQC (blue, 2048 (t2) x 512 (t1) complex points) vs. NUS pure shift HSQC (red, 12.5% NUS and 512 NUS points, 2048 (t2) x 4096 (t1) complex points) in Caco-2 cell extract (B, D) and lung tissue (F, H). High-resolution NUS spectra resolve substantially more peaks than US spectra in crowded regions. All spectra were processed to final spectral digital resolution of 4096 (2) x 8192 (1).

135

Figure 4.8. Spectral comparison of 6.25% NUS HSQC NMR spectrum (red) and US HSQC spectrum (blue) of mouse lung tissue extracts. Sampling of the NUS NMR spectrum is only 6.25%, i.e. 256 NUS points of 4096 complex points along t1. The US spectrum uses 512 complex points along t1. Both spectra were processed to final spectral digital resolution of

4096 (2) x 8192 (1).

136

Bibliography

(1) Weckwerth, W., Metabolomics in systems biology. Annu. Rev. Plant Biol. 2003, 54, 669-89. (2) Kell, D. B., Metabolomics and systems biology: making sense of the soup. Curr. Opin. Microbiol. 2004, 7, 296-307. (3) Hood, L.; Heath, J. R.; Phelps, M. E.; Lin, B., Systems biology and new technologies enable predictive and preventative medicine. Science 2004, 306, 640-3. (4) Bruggeman, F. J.; Westerhoff, H. V., The nature of systems biology. Trends Microbiol. 2007, 15, 45-50. (5) Fukushima, A.; Kusano, M.; Redestig, H.; Arita, M.; Saito, K., Integrated omics approaches in plant systems biology. Curr. Opin. Chem. Biol. 2009, 13, 532-8. (6) Patti, G. J.; Yanes, O.; Siuzdak, G., Innovation: Metabolomics: the apogee of the omics trilogy. Nat. Rev. Mol. Cell Biol. 2012, 13, 263-9. (7) Vignoli, A.; Ghini, V.; Meoni, G.; Licari, C.; Takis, P. G.; Tenori, L.; Turano, P.; Luchinat, C., High-Throughput Metabolomics by 1D NMR. Angew. Chem. Int. Ed. Engl. 2019, 58, 968-994. (8) Fiehn, O., Metabolomics--the link between genotypes and phenotypes. Plant Mol. Biol. 2002, 48, 155-71. (9) Hollywood, K.; Brison, D. R.; Goodacre, R., Metabolomics: current technologies and future trends. Proteomics 2006, 6, 4716-23. (10) Shulaev, V., Metabolomics technology and . Brief Bioinform 2006, 7, 128-39. (11) Nicholson, J. K.; Lindon, J. C.; Holmes, E., 'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica 1999, 29, 1181-9. (12) Nicholson, J. K.; Lindon, J. C., Systems biology: Metabonomics. Nature 2008, 455, 1054-6. (13) Oliver, S. G.; Winson, M. K.; Kell, D. B.; Baganz, F., Systematic functional analysis of the yeast genome. Trends Biotechnol. 1998, 16, 373-8. (14) Wenk, M. R., The emerging field of lipidomics. Nat Rev Drug Discov 2005, 4, 594-610. (15) Dennis, E. A., Lipidomics joins the omics evolution. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 2089-90. (16) Wenk, M. R., Lipidomics: new tools and applications. Cell 2010, 143, 888-95.

137

(17) Griffiths, W. J.; Koal, T.; Wang, Y.; Kohl, M.; Enot, D. P.; Deigner, H. P., Targeted metabolomics for biomarker discovery. Angew. Chem. Int. Ed. Engl. 2010, 49, 5426-45. (18) Roberts, L. D.; Souza, A. L.; Gerszten, R. E.; Clish, C. B., Targeted metabolomics. Curr. Protoc. Mol. Biol. 2012, Chapter 30, Unit 30 2 1-24. (19) Vinayavekhin, N.; Saghatelian, A., Untargeted metabolomics. Curr. Protoc. Mol. Biol. 2010, Chapter 30, Unit 30 1 1-24. (20) Tautenhahn, R.; Cho, K.; Uritboonthai, W.; Zhu, Z.; Patti, G. J.; Siuzdak, G., An accelerated workflow for untargeted metabolomics using the METLIN database. Nat. Biotechnol. 2012, 30, 826- 8. (21) Schrimpe-Rutledge, A. C.; Codreanu, S. G.; Sherrod, S. D.; McLean, J. A., Untargeted Metabolomics Strategies-Challenges and Emerging Directions. J. Am. Soc. Mass Spectrom. 2016, 27, 1897-1905. (22) Akoka, S.; Barantin, L.; Trierweiler, M., Concentration Measurement by Proton NMR Using the ERETIC Method. Anal. Chem. 1999, 71, 2554-7. (23) Pauli, G. F.; Jaki, B. U.; Lankin, D. C., Quantitative 1H NMR: development and potential of a method for natural products analysis. J. Nat. Prod. 2005, 68, 133-49. (24) Weljie, A. M.; Newton, J.; Mercier, P.; Carlson, E.; Slupsky, C. M., Targeted profiling: quantitative analysis of 1H NMR metabolomics data. Anal. Chem. 2006, 78, 4430-42. (25) Holmes, E.; Nicholls, A. W.; Lindon, J. C.; Connor, S. C.; Connelly, J. C.; Haselden, J. N.; Damment, S. J.; Spraul, M.; Neidig, P.; Nicholson, J. K., Chemometric models for toxicity classification based on NMR spectra of biofluids. Chem. Res. Toxicol. 2000, 13, 471-8. (26) Hao, J.; Astle, W.; De Iorio, M.; Ebbels, T. M., BATMAN--an R package for the automated quantification of metabolites from nuclear magnetic resonance spectra using a Bayesian model. Bioinformatics 2012, 28, 2088-90. (27) Dona, A. C.; Jimenez, B.; Schafer, H.; Humpfer, E.; Spraul, M.; Lewis, M. R.; Pearce, J. T.; Holmes, E.; Lindon, J. C.; Nicholson, J. K., Precision high-throughput proton NMR spectroscopy of human urine, serum, and plasma for large-scale metabolic phenotyping. Anal. Chem. 2014, 86, 9887- 94. (28) Emwas, A. H.; Roy, R.; McKay, R. T.; Tenori, L.; Saccenti, E.; Gowda, G. A. N.; Raftery, D.; Alahmari, F.; Jaremko, L.; Jaremko, M.; Wishart, D. S., NMR Spectroscopy for Metabolomics Research. Metabolites 2019, 9. (29) Lewis, I. A.; Schommer, S. C.; Hodis, B.; Robb, K. A.; Tonelli, M.; Westler, W. M.; Sussman, M. R.; Markley, J. L., Method for determining molar concentrations of metabolites in complex solutions from two-dimensional 1H-13C NMR spectra. Anal. Chem. 2007, 79, 9385-90. (30) Rai, R. K.; Tripathi, P.; Sinha, N., Quantification of metabolites from two-dimensional nuclear magnetic resonance spectroscopy: application to human urine samples. Anal. Chem. 2009, 81, 10232- 8.

138

(31) Oman, T.; Tessem, M. B.; Bathen, T. F.; Bertilsson, H.; Angelsen, A.; Hedenstrom, M.; Andreassen, T., Identification of metabolites from 2D (1)H-(13)C HSQC NMR using peak correlation plots. BMC Bioinformatics 2014, 15, 413. (32) Hu, K.; Westler, W. M.; Markley, J. L., Simultaneous quantification and identification of individual chemicals in metabolite mixtures by two-dimensional extrapolated time-zero (1)H- (13)C HSQC (HSQC(0)). J. Am. Chem. Soc. 2011, 133, 1662-5. (33) Sandusky, P.; Raftery, D., Use of semiselective TOCSY and the pearson correlation for the metabonomic analysis of biofluid mixtures: application to urine. Anal. Chem. 2005, 77, 7717-23. (34) MacKinnon, N.; While, P. T.; Korvink, J. G., Novel selective TOCSY method enables NMR spectral elucidation of metabolomic mixtures. J. Magn. Reson. 2016, 272, 147-157. (35) Singh, A.; Dubey, A.; Adiga, S. K.; Atreya, H. S., Phase modulated 2D HSQC-TOCSY for unambiguous assignment of overlapping spin systems. J. Magn. Reson. 2018, 286, 10-16. (36) Bernini, P.; Bertini, I.; Luchinat, C.; Nepi, S.; Saccenti, E.; Schafer, H.; Schutz, B.; Spraul, M.; Tenori, L., Individual human phenotypes in metabolic space and time. J. Proteome Res. 2009, 8, 4264- 71. (37) Wishart, D. S.; Knox, C.; Guo, A. C.; Eisner, R.; Young, N.; Gautam, B.; Hau, D. D.; Psychogios, N.; Dong, E.; Bouatra, S.; Mandal, R.; Sinelnikov, I.; Xia, J.; Jia, L.; Cruz, J. A.; Lim, E.; Sobsey, C. A.; Shrivastava, S.; Huang, P.; Liu, P.; Fang, L.; Peng, J.; Fradette, R.; Cheng, D.; Tzur, D.; Clements, M.; Lewis, A.; De Souza, A.; Zuniga, A.; Dawe, M.; Xiong, Y.; Clive, D.; Greiner, R.; Nazyrova, A.; Shaykhutdinov, R.; Li, L.; Vogel, H. J.; Forsythe, I., HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res. 2009, 37, D603-10. (38) Wishart, D. S.; Feunang, Y. D.; Marcu, A.; Guo, A. C.; Liang, K.; Vazquez-Fresno, R.; Sajed, T.; Johnson, D.; Li, C.; Karu, N.; Sayeeda, Z.; Lo, E.; Assempour, N.; Berjanskii, M.; Singhal, S.; Arndt, D.; Liang, Y.; Badran, H.; Grant, J.; Serra-Cayuela, A.; Liu, Y.; Mandal, R.; Neveu, V.; Pon, A.; Knox, C.; Wilson, M.; Manach, C.; Scalbert, A., HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 2018, 46, D608-D617. (39) Wishart, D. S.; Tzur, D.; Knox, C.; Eisner, R.; Guo, A. C.; Young, N.; Cheng, D.; Jewell, K.; Arndt, D.; Sawhney, S.; Fung, C.; Nikolai, L.; Lewis, M.; Coutouly, M. A.; Forsythe, I.; Tang, P.; Shrivastava, S.; Jeroncic, K.; Stothard, P.; Amegbey, G.; Block, D.; Hau, D. D.; Wagner, J.; Miniaci, J.; Clements, M.; Gebremedhin, M.; Guo, N.; Zhang, Y.; Duggan, G. E.; Macinnis, G. D.; Weljie, A. M.; Dowlatabadi, R.; Bamforth, F.; Clive, D.; Greiner, R.; Li, L.; Marrie, T.; Sykes, B. D.; Vogel, H. J.; Querengesser, L., HMDB: the Human Metabolome Database. Nucleic Acids Res. 2007, 35, D521-6. (40) Ulrich, E. L.; Akutsu, H.; Doreleijers, J. F.; Harano, Y.; Ioannidis, Y. E.; Lin, J.; Livny, M.; Mading, S.; Maziuk, D.; Miller, Z.; Nakatani, E.; Schulte, C. F.; Tolmie, D. E.; Kent Wenger, R.; Yao, H.; Markley, J. L., BioMagResBank. Nucleic Acids Res. 2008, 36, D402-8. (41) Kuhn, S.; Schlorer, N. E., Facilitating quality control for spectra assignments of small organic molecules: nmrshiftdb2--a free in-house NMR database with integrated LIMS for academic service laboratories. Magn. Reson. Chem. 2015, 53, 582-9.

139

(42) Bingol, K.; Li, D. W.; Bruschweiler-Li, L.; Cabrera, O. A.; Megraw, T.; Zhang, F.; Bruschweiler, R., Unified and isomer-specific NMR metabolomics database for the accurate analysis of (13)C- (1)H HSQC spectra. ACS Chem. Biol. 2015, 10, 452-9. (43) Bingol, K.; Li, D. W.; Zhang, B.; Bruschweiler, R., Comprehensive Metabolite Identification Strategy Using Multiple Two-Dimensional NMR Spectra of a Complex Mixture Implemented in the COLMARm Web Server. Anal. Chem. 2016, 88, 12411-12418. (44) Bingol, K.; Bruschweiler-Li, L.; Li, D. W.; Bruschweiler, R., Customized metabolomics database for the analysis of NMR (1)H-(1)H TOCSY and (1)(3)C-(1)H HSQC-TOCSY spectra of complex mixtures. Anal. Chem. 2014, 86, 5494-501. (45) Bingol, K.; Zhang, F.; Bruschweiler-Li, L.; Bruschweiler, R., TOCCATA: a customized carbon total correlation spectroscopy NMR metabolomics database. Anal. Chem. 2012, 84, 9395-401. (46) Courant, F.; Royer, A. L.; Chereau, S.; Morvan, M. L.; Monteau, F.; Antignac, J. P.; Le Bizec, B., Implementation of a semi-automated strategy for the annotation of metabolomic fingerprints generated by liquid chromatography-high resolution mass spectrometry from biological samples. 2012, 137, 4958-67. (47) Dunn, W. B.; Broadhurst, D.; Begley, P.; Zelena, E.; Francis-McIntyre, S.; Anderson, N.; Brown, M.; Knowles, J. D.; Halsall, A.; Haselden, J. N.; Nicholls, A. W.; Wilson, I. D.; Kell, D. B.; Goodacre, R.; Human Serum Metabolome, C., Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat. Protoc. 2011, 6, 1060-83. (48) Brown, M.; Dunn, W. B.; Dobson, P.; Patel, Y.; Winder, C. L.; Francis-McIntyre, S.; Begley, P.; Carroll, K.; Broadhurst, D.; Tseng, A.; Swainston, N.; Spasic, I.; Goodacre, R.; Kell, D. B., Mass spectrometry tools and metabolite-specific databases for molecular identification in metabolomics. Analyst 2009, 134, 1322-32. (49) Dunn, W. B.; Broadhurst, D. I.; Atherton, H. J.; Goodacre, R.; Griffin, J. L., Systems level studies of mammalian metabolomes: the roles of mass spectrometry and nuclear magnetic resonance spectroscopy. Chem. Soc. Rev. 2011, 40, 387-426. (50) Cui, Q.; Lewis, I. A.; Hegeman, A. D.; Anderson, M. E.; Li, J.; Schulte, C. F.; Westler, W. M.; Eghbalnia, H. R.; Sussman, M. R.; Markley, J. L., Metabolite identification via the Madison Metabolomics Consortium Database. Nat. Biotechnol. 2008, 26, 162-4. (51) Fonville, J. M.; Maher, A. D.; Coen, M.; Holmes, E.; Lindon, J. C.; Nicholson, J. K., Evaluation of full-resolution J-resolved 1H NMR projections of biofluids for metabonomics information retrieval and biomarker identification. Anal. Chem. 2010, 82, 1811-21. (52) Go, E. P., Database resources in metabolomics: an overview. J. Neuroimmune Pharmacol. 2010, 5, 18-30. (53) Tulpan, D.; Leger, S.; Belliveau, L.; Culf, A.; Cuperlovic-Culf, M., MetaboHunter: an automatic approach for identification of metabolites from 1H-NMR spectra of complex mixtures. BMC Bioinformatics 2011, 12, 400.

140

(54) Wishart, D. S., Advances in metabolite identification. Bioanalysis 2011, 3, 1769-82. (55) Watson, D. G., A rough guide to metabolite identification using high resolution liquid chromatography mass spectrometry in metabolomic profiling in metazoans. Comput Struct Biotechnol J 2013, 4, e201301005. (56) Emwas, A. H., The strengths and weaknesses of NMR spectroscopy and mass spectrometry with particular focus on metabolomics research. Methods Mol. Biol. 2015, 1277, 161-93. (57) Smith, C. A.; O'Maille, G.; Want, E. J.; Qin, C.; Trauger, S. A.; Brandon, T. R.; Custodio, D. E.; Abagyan, R.; Siuzdak, G., METLIN: a metabolite mass spectral database. Ther. Drug Monit. 2005, 27, 747-51. (58) Sumner, L. W.; Amberg, A.; Barrett, D.; Beale, M. H.; Beger, R.; Daykin, C. A.; Fan, T. W.; Fiehn, O.; Goodacre, R.; Griffin, J. L.; Hankemeier, T.; Hardy, N.; Harnly, J.; Higashi, R.; Kopka, J.; Lane, A. N.; Lindon, J. C.; Marriott, P.; Nicholls, A. W.; Reily, M. D.; Thaden, J. J.; Viant, M. R., Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics 2007, 3, 211-221. (59) Want, E. J.; Wilson, I. D.; Gika, H.; Theodoridis, G.; Plumb, R. S.; Shockcor, J.; Holmes, E.; Nicholson, J. K., Global metabolic profiling procedures for urine using UPLC-MS. Nat. Protoc. 2010, 5, 1005-18. (60) Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; Oda, Y.; Kakazu, Y.; Kusano, M.; Tohge, T.; Matsuda, F.; Sawada, Y.; Hirai, M. Y.; Nakanishi, H.; Ikeda, K.; Akimoto, N.; Maoka, T.; Takahashi, H.; Ara, T.; Sakurai, N.; Suzuki, H.; Shibata, D.; Neumann, S.; Iida, T.; Tanaka, K.; Funatsu, K.; Matsuura, F.; Soga, T.; Taguchi, R.; Saito, K.; Nishioka, T., MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 2010, 45, 703-14. (61) Crawford, R. W.; Brand, H. R.; Wong, C. M.; Gregg, H. R.; Hoffman, P. A.; Enke, C. G., Instrument database system and application to mass spectrometry/mass spectrometry. Anal. Chem. 1984, 56, 1121-7. (62) McLafferty, F. W., Tandem mass spectrometry. Science 1981, 214, 280-7. (63) Stein, S., Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 2012, 84, 7274-82. (64) Johnson, S. R.; Lange, B. M., Open-access metabolomics databases for natural product research: present capabilities and future potential. Front Bioeng Biotechnol 2015, 3, 22. (65) Kapp, E.; Schutz, F., Overview of tandem mass spectrometry (MS/MS) database search algorithms. Curr Protoc Protein Sci 2007, Chapter 25, Unit25 2. (66) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; Wang, J.; Yu, B.; Zhang, J.; Bryant, S. H., PubChem Substance and Compound databases. Nucleic Acids Res. 2016, 44, D1202-13.

141

(67) Huan, T.; Tang, C.; Li, R.; Shi, Y.; Lin, G.; Li, L., MyCompoundID MS/MS Search: Metabolite Identification Using a Library of Predicted Fragment-Ion-Spectra of 383,830 Possible Human Metabolites. Anal. Chem. 2015, 87, 10619-26. (68) Wolf, S.; Schmidt, S.; Muller-Hannemann, M.; Neumann, S., In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics 2010, 11, 148. (69) Heinonen, M.; Rantanen, A.; Mielikainen, T.; Kokkonen, J.; Kiuru, J.; Ketola, R. A.; Rousu, J., FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data. Rapid Commun. Mass Spectrom. 2008, 22, 3043-52. (70) Scheubert, K.; Hufsky, F.; Bocker, S., Computational mass spectrometry for small molecules. J. Cheminform. 2013, 5, 12. (71) Tsugawa, H.; Kind, T.; Nakabayashi, R.; Yukihira, D.; Tanaka, W.; Cajka, T.; Saito, K.; Fiehn, O.; Arita, M., Hydrogen Rearrangement Rules: Computational MS/MS Fragmentation and Structure Elucidation Using MS-FINDER Software. Anal. Chem. 2016, 88, 7946-58. (72) Ruttkies, C.; Schymanski, E. L.; Wolf, S.; Hollender, J.; Neumann, S., MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 2016, 8, 3. (73) Tsugawa, H.; Cajka, T.; Kind, T.; Ma, Y.; Higgins, B.; Ikeda, K.; Kanazawa, M.; VanderGheynst, J.; Fiehn, O.; Arita, M., MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat Methods 2015, 12, 523-6. (74) Rost, H. L.; Sachsenberg, T.; Aiche, S.; Bielow, C.; Weisser, H.; Aicheler, F.; Andreotti, S.; Ehrlich, H. C.; Gutenbrunner, P.; Kenar, E.; Liang, X.; Nahnsen, S.; Nilse, L.; Pfeuffer, J.; Rosenberger, G.; Rurik, M.; Schmitt, U.; Veit, J.; Walzer, M.; Wojnar, D.; Wolski, W. E.; Schilling, O.; Choudhary, J. S.; Malmstrom, L.; Aebersold, R.; Reinert, K.; Kohlbacher, O., OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat Methods 2016, 13, 741-8. (75) Mahieu, N. G.; Genenbacher, J. L.; Patti, G. J., A roadmap for the XCMS family of software solutions in metabolomics. Curr. Opin. Chem. Biol. 2016, 30, 87-93. (76) Lai, Z.; Tsugawa, H.; Wohlgemuth, G.; Mehta, S.; Mueller, M.; Zheng, Y.; Ogiwara, A.; Meissen, J.; Showalter, M.; Takeuchi, K.; Kind, T.; Beal, P.; Arita, M.; Fiehn, O., Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics. Nat Methods 2018, 15, 53-56. (77) Evanochko, W. T.; Sakai, T. T.; Ng, T. C.; Krishna, N. R.; Kim, H. D.; Zeidler, R. B.; Ghanta, V. K.; Brockman, R. W.; Schiffer, L. M.; Braunschweiger, P. G.; et al., NMR study of in vivo RIF-1 tumors. Analysis of perchloric acid extracts and identification of 1H, 31P and 13C resonances. Biochim. Biophys. Acta 1984, 805, 104-16. (78) Bales, J. R.; Higham, D. P.; Howe, I.; Nicholson, J. K.; Sadler, P. J., Use of high-resolution proton nuclear magnetic resonance spectroscopy for rapid multi-component analysis of urine. Clin. Chem. 1984, 30, 426-32. (79) Nicholson, J. K.; Buckingham, M. J.; Sadler, P. J., High resolution 1H n.m.r. studies of vertebrate blood and plasma. Biochem. J. 1983, 211, 605-15.

142

(80) Dewar, B. J.; Keshari, K.; Jeffries, R.; Dzeja, P.; Graves, L. M.; Macdonald, J. M., Metabolic assessment of a novel chronic myelogenous leukemic cell line and an imatinib resistant subline by H NMR spectroscopy. Metabolomics 2010, 6, 439-450. (81) Robinette, S. L.; Lindon, J. C.; Nicholson, J. K., Statistical spectroscopic tools for biomarker discovery and systems medicine. Anal. Chem. 2013, 85, 5297-303. (82) Chen, C.; Deng, L.; Wei, S.; Nagana Gowda, G. A.; Gu, H.; Chiorean, E. G.; Abu Zaid, M.; Harrison, M. L.; Pekny, J. F.; Loehrer, P. J.; Zhang, D.; Zhang, M.; Raftery, D., Exploring Metabolic Profile Differences between Colorectal Polyp Patients and Controls Using Seemingly Unrelated Regression. J. Proteome Res. 2015, 14, 2492-9. (83) Kaddurah-Daouk, R.; Weinshilboum, R.; Pharmacometabolomics Research, N., Metabolomic Signatures for Drug Response Phenotypes: Pharmacometabolomics Enables Precision Medicine. Clin. Pharmacol. Ther. 2015, 98, 71-5. (84) Fan, T. W.; Lorkiewicz, P. K.; Sellers, K.; Moseley, H. N.; Higashi, R. M.; Lane, A. N., Stable isotope-resolved metabolomics and applications for drug development. Pharmacol. Ther. 2012, 133, 366-91. (85) Bingol, K.; Li, D. W.; Bruschweiler-Li, L.; Cabrera, O. A.; Megraw, T.; Zhang, F. L.; Bruschweiler, R., Unified and Isomer-Specific NMR Metabolomics Database for the Accurate Analysis of C-13-H-1 HSQC Spectra. ACS Chem. Biol. 2015, 10, 452-459. (86) Bingol, K.; Li, D.-W.; Zhang, B.; Brüschweiler, R., Comprehensive Metabolite Identification Strategy Using Multiple Two-Dimensional NMR Spectra of a Complex Mixture Implemented in the COLMARm Web Server. Anal. Chem. 2016, 88, 12411-12418. (87) Pan, Z.; Raftery, D., Comparing and combining NMR spectroscopy and mass spectrometry in metabolomics. Anal. Bioanal. Chem. 2007, 387, 525-7. (88) Crockford, D. J.; Holmes, E.; Lindon, J. C.; Plumb, R. S.; Zirah, S.; Bruce, S. J.; Rainville, P.; Stumpf, C. L.; Nicholson, J. K., Statistical heterospectroscopy, an approach to the integrated analysis of NMR and UPLC-MS data sets: application in metabonomic toxicology studies. Anal. Chem. 2006, 78, 363-71. (89) Pan, Z.; Gu, H.; Talaty, N.; Chen, H.; Shanaiah, N.; Hainline, B. E.; Cooks, R. G.; Raftery, D., Principal component analysis of urine metabolites detected by NMR and DESI-MS in patients with inborn errors of metabolism. Anal. Bioanal. Chem. 2007, 387, 539-49. (90) Crockford, D. J.; Maher, A. D.; Ahmadi, K. R.; Barrett, A.; Plumb, R. S.; Wilson, I. D.; Nicholson, J. K., 1H NMR and UPLC-MS(E) statistical heterospectroscopy: characterization of drug metabolites (xenometabolome) in epidemiological studies. Anal. Chem. 2008, 80, 6835-44. (91) Bingol, K.; Bruschweiler, R., NMR/MS Translator for the Enhanced Simultaneous Analysis of Metabolomics Mixtures by NMR Spectroscopy and Mass Spectrometry: Application to Human Urine. J. Proteome Res. 2015, 14, 2642-2648.

143

(92) van der Hooft, J. J.; Mihaleva, V.; de Vos, R. C.; Bino, R. J.; Vervoort, J., A strategy for fast structural elucidation of metabolites in small volume plant extracts using automated MS-guided LC-MS-SPE-NMR. Magn. Reson. Chem. 2011, 49 Suppl 1, S55-60. (93) Sorensen, D.; Raditsis, A.; Trimble, L. A.; Blackwell, B. A.; Sumarah, M. W.; Miller, J. D., Isolation and structure elucidation by LC-MS-SPE/NMR: PR toxin- and cuspidatol-related eremophilane sesquiterpenes from Penicillium roqueforti. J. Nat. Prod. 2007, 70, 121-3. (94) Castro, A.; Moco, S.; Coll, J.; Vervoort, J., LC-MS-SPE-NMR for the isolation and characterization of neo-clerodane diterpenoids from Teucrium luteum subsp. flavovirens (perpendicular). J. Nat. Prod. 2010, 73, 962-5. (95) Schlotterbeck, G.; Ceccarelli, S. M., LC-SPE-NMR-MS: a total analysis system for bioanalysis. Bioanalysis 2009, 1, 549-59. (96) Bhatia, A.; Sarma, S. J.; Lei, Z.; Sumner, L. W., UHPLC-QTOF-MS/MS-SPE-NMR: A Solution to the Metabolomics Grand Challenge of Higher-Throughput, Confident Metabolite Identifications. Methods Mol. Biol. 2019, 2037, 113-133. (97) Gathungu, R. M.; Kautz, R.; Kristal, B. S.; Bird, S. S.; Vouros, P., The integration of LC-MS and NMR for the analysis of low molecular weight trace analytes in complex matrices. Mass Spectrom. Rev. 2018. (98) Liu, F.; Wang, Y. N.; Li, Y.; Ma, S. G.; Qu, J.; Liu, Y. B.; Niu, C. S.; Tang, Z. H.; Zhang, T. T.; Li, Y. H.; Li, L.; Yu, S. S., Rhodoterpenoids AC, Three New Rearranged Triterpenoids from Rhododendron latoucheae by HPLCMSSPENMR. Sci. Rep. 2017, 7, 7944. (99) Bingol, K.; Bruschweiler-Li, L.; Yu, C.; Somogyi, A.; Zhang, F.; Brüschweiler, R., Metabolomics beyond spectroscopic databases: a combined MS/NMR strategy for the rapid identification of new metabolites in complex mixtures. Anal. Chem. 2015, 87, 3864-70. (100) Boiteau, R. M.; Hoyt, D. W.; Nicora, C. D.; Kinmonth-Schultz, H. A.; Ward, J. K.; Bingol, K., Structure Elucidation of Unknown Metabolites in Metabolomics by Combined NMR and MS/MS Prediction. Metabolites 2018, 8. (101) Markley, J. L.; Bruschweiler, R.; Edison, A. S.; Eghbalnia, H. R.; Powers, R.; Raftery, D.; Wishart, D. S., The future of NMR-based metabolomics. Curr. Opin. Biotechnol. 2016, 43, 34-40. (102) Huan, T.; Tang, C. Q.; Li, R. H.; Shi, Y.; Lin, G. H.; Li, L., MyCompoundID MS/MS Search: Metabolite Identification Using a Library of Predicted Fragment-Ion-Spectra of 383,830 Possible Human Metabolites. Anal. Chem. 2015, 87, 10619-10626. (103) Yilmaz, A.; Rudolph, H. L.; Hurst, J. J.; Wood, T. D., High-Throughput Metabolic Profiling of Soybean Leaves by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. Anal. Chem. 2016, 88, 1188-1194. (104) Rathahao-Paris, E.; Alves, S.; Junot, C.; Tabet, J. C., High resolution mass spectrometry for structural identification of metabolites in metabolomics. Metabolomics 2016, 12. (105) Cloarec, O.; Dumas, M. E.; Craig, A.; Barton, R. H.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J., Statistical total correlation spectroscopy: An 144 exploratory approach for latent biomarker identification from metabolic H-1 NMR data sets. Anal. Chem. 2005, 77, 1282-1289. (106) Hao, J.; Liebeke, M.; Sommer, U.; Viant, M. R.; Bundy, J. G.; Ebbels, T. M. D., Statistical Correlations between NMR Spectroscopy and Direct Infusion FT-ICR Mass Spectrometry Aid Annotation of Unknowns in Metabolomics. Anal. Chem. 2016, 88, 2583-2589. (107) Wei, S. W.; Zhang, J.; Liu, L. Y.; Ye, T.; Nagana Gowda, G. A.; Tayyari, F.; Raftery, D., Ratio Analysis Nuclear Magnetic Resonance Spectroscopy for Selective Metabolite Identification in Complex Samples. Anal. Chem. 2011, 83, 7616-7623. (108) Gu, H. W.; Nagana Gowda, G. A.; Neto, F. C.; Opp, M. R.; Raftery, D., RAMSY: Ratio Analysis of Mass Spectrometry to Improve Compound Identification. Anal. Chem. 2013, 85, 10771-10779. (109) Crockford, D. J.; Holmes, E.; Lindon, J. C.; Plumb, R. S.; Zirah, S.; Bruce, S. J.; Rainville, P.; Stumpf, C. L.; Nicholson, J. K., Statistical heterospectroscopy, an approach to the integrated analysis of NMR and UPLC-MS data sets: Application in metabonomic toxicology studies. Anal. Chem. 2006, 78, 363-371. (110) Bingol, K.; Zhang, F.; Bruschweiler-Li, L.; Brüschweiler, R., Carbon backbone topology of the metabolome of a cell. J. Am. Chem. Soc. 2012, 134, 9006-11. (111) Clendinen, C. S.; Pasquel, C.; Ajredini, R.; Edison, A. S., C-13 NMR Metabolomics: INADEQUATE Network Analysis. Anal. Chem. 2015, 87, 5698-5706. (112) Komatsu, T.; Ohishi, R.; Shino, A.; Kikuchi, J., Structure and Metabolic-Flow Analysis of Molecular Complexity in a (13) C-Labeled Tree by 2D and 3D NMR. Angew. Chem. Int. Ed. Engl. 2016, 55, 6000-3. (113) Walker, L. R.; Hoyt, D. W.; Walker, S. M., 2nd; Ward, J. K.; Nicora, C. D.; Bingol, K., Unambiguous metabolite identification in high-throughput metabolomics by hybrid 1D (1) H NMR/ESI MS(1) approach. Magn. Reson. Chem. 2016, 54, 998-1003. (114) Marshall, A. G.; Hendrickson, C. L.; Jackson, G. S., Fourier transform ion cyclotron resonance mass spectrometry: A primer. Mass Spectrom. Rev. 1998, 17, 1-35. (115) Bingol, K.; Zhang, F. L.; Bruschweiler-Li, L.; Bruschweiler, R., TOCCATA: A Customized Carbon Total Correlation Spectroscopy NMR Metabolomics Database. Anal. Chem. 2012, 84, 9395- 9401. (116) Kaiser, N. K.; Quinn, J. P.; Blakney, G. T.; Hendrickson, C. L.; Marshall, A. G., A novel 9.4 tesla FTICR mass spectrometer with improved sensitivity, mass resolution, and mass range. J. Am. Soc. Mass Spectrom. 2011, 22, 1343-51. (117) Kim, S.; Rodgers, R. P.; Marshall, A. G., Truly "exact" mass: Elemental composition can be determined uniquely from molecular mass measurement at similar to 0.1 mDa accuracy for molecules up to similar to 500 Da. Int. J. Mass spectrom. 2006, 251, 260-265. (118) Marshall, A. G. K., L. C.; Enke, C. G.; Hendrickson, C. L., Dynamic Range Extension in FT- ICR Mass Spectrometry by Spectral Segmentation. In 63rd Amer. Soc. for Mass Spectrometry Conference on Mass Spectrometry and Allied Topics, 2015. 145

(119) Hughey, C. A.; Hendrickson, C. L.; Rodgers, R. P.; Marshall, A. G.; Qian, K., Kendrick mass defect spectrum: a compact visual analysis for ultrahigh-resolution broadband mass spectra. Anal. Chem. 2001, 73, 4676-81. (120) Wu, Z.; Rodgers, R. P.; Marshall, A. G., Two- and three-dimensional van krevelen diagrams: a graphical analysis complementary to the kendrick mass plot for sorting elemental compositions of complex organic mixtures based on ultrahigh-resolution broadband fourier transform ion cyclotron resonance mass measurements. Anal. Chem. 2004, 76, 2511-6. (121) Marshall, A. G.; Rodgers, R. P., Petroleomics: chemistry of the underworld. Proc. Natl. Acad. Sci. U. S. A. 2008, 105, 18090-5. (122) Misiak, M.; Kozminski, W., Determination of heteronuclear coupling constants from 3D HSQC-TOCSY experiment with optimized random sampling of evolution time space. Magn. Reson. Chem. 2009, 47, 205-209. (123) Reardon, P. N.; Marean-Reardon, C. L.; Bukovec, M. A.; Coggins, B. E.; Isern, N. G., 3d Tocsy- Hsqc Nmr for Metabolic Flux Analysis Using Non-Uniform Sampling. Anal. Chem. 2016, 88, 2825- 2831. (124) Kazimierczuk, K.; Orekhov, V., Non-uniform sampling: post-Fourier era of NMR data collection and processing. Magn. Reson. Chem. 2015, 53, 921-926. (125) Li, D. W.; Wang, C.; Brüschweiler, R., Maximal clique method for the automated analysis of NMR TOCSY spectra of complex mixtures. J. Biomol. NMR 2017, 68, 195-202. (126) Pence, H. E.; Williams, A., ChemSpider: An Online Chemical Information Resource. J. Chem. Educ. 2010, 87, 1123-1124. (127) Smith, C. A.; O'Maille, G.; Want, E. J.; Qin, C.; Trauger, S. A.; Brandon, T. R.; Custodio, D. E.; Abagyan, R.; Siuzdak, G., Metlin - a Metabolite Mass Spectral Database. Ther. Drug Monit. 2005, 27, 747-751. (128) Abraham, R. J.; Reid, M., Proton chemical shifts in NMR. Part 16. Proton chemical shifts in acetylenes and the anisotropic and steric effects of the acetylene group. Journal of the - 2 2001, 1195-1204. (129) Munkres, J., Algorithms for the Assignment and Transportation Problems. Journal of the Society for Industrial and Applied Mathematics 1957, 5, 32-38. (130) Stahl, M.; Mauser, H., Database clustering with a combination of fingerprint and maximum common substructure methods. J. Chem. Inf. Model. 2005, 45, 542-548. (131) Raymond, J. W.; Willett, P., Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J. Comput. Aided Mol. Des. 2002, 16, 521-533. (132) Delaglio, F.; Grzesiek, S.; Vuister, G. W.; Zhu, G.; Pfeifer, J.; Bax, A., Nmrpipe - a Multidimensional Spectral Processing System Based on Unix Pipes. J. Biomol. NMR 1995, 6, 277- 293.

146

(133) Kneller, D. G.; Kuntz, I. D., Ucsf Sparky - an Nmr Display, Annotation and Assignment Tool. J. Cell. Biochem. 1993, 254-254. (134) Guan, S. H.; Marshall, A. G., Stored waveform inverse Fourier transform (SWIFT) ion excitation in trapped-ion mass spectometry: Theory and applications. Int. J. Mass spectrom. 1996, 157, 5-37. (135) Ledford, E. B.; Rempel, D. L.; Gross, M. L., Space-Charge Effects in Fourier-Transform Mass- Spectrometry - Mass Calibration. Anal. Chem. 1984, 56, 2744-2748. (136) Shi, S. D. H.; Drader, J. J.; Freitas, M. A.; Hendrickson, C. L.; Marshall, A. G., Comparison and interconversion of the two most common frequency-to-mass calibration functions for Fourier transform ion cyclotron resonance mass spectrometry. Int. J. Mass spectrom. 2000, 195, 591-598. (137) Hoffmann, F.; Li, D. W.; Sebastiani, D.; Bruschweiler, R., Improved Quantum Chemical NMR Chemical Shift Prediction of Metabolites in Aqueous Solution toward the Validation of Unknowns. J. Phys. Chem. A 2017, 121, 3071-3078. (138) Barry, G. T.; Roark, E., L-Fucosamine + 4-Oxo-Norleucine as Constituents in Mucopolysaccharides of Certain Enteric Bacteria. Nature 1964, 202, 493-&. (139) Markley, J. L.; Brüschweiler, R.; Edison, A. S.; Eghbalnia, H. R.; Powers, R.; Raftery, D.; Wishart, D. S., The future of NMR-based metabolomics. Curr. Opin. Biotechnol. 2017, 43, 34-40. (140) Pretsch, E.; Bühlmann, P.; Badertscher, M., Structure determination of organic compounds: tables of spectral data. Springer: Berlin Heidelberg, 2009. (141) Wolfender, J. L.; Nuzillard, J. M.; van der Hooft, J. J. J.; Renault, J. H.; Bertrand, S., Accelerating Metabolite Identification in Natural Product Research: Toward an Ideal Combination of Liquid Chromatography-High-Resolution Tandem Mass Spectrometry and NMR Profiling, in Silico Databases, and Chemometrics. Anal. Chem. 2019, 91, 704-742. (142) Djoumbou-Feunang, Y.; Fiamoncini, J.; Gil-de-la-Fuente, A.; Greiner, R.; Manach, C.; Wishart, D. S., BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification. J. Cheminform. 2019, 11, 2. (143) Paudel, L.; Nagana Gowda, G. A.; Raftery, D., Extractive Ratio Analysis NMR Spectroscopy for Metabolite Identification in Complex Biological Mixtures. Anal. Chem. 2019, 91, 7373-7378. (144) Shen, X.; Wang, R.; Xiong, X.; Yin, Y.; Cai, Y.; Ma, Z.; Liu, N.; Zhu, Z. J., Metabolic reaction network-based recursive metabolite annotation for untargeted metabolomics. Nat Commun 2019, 10, 1516. (145) Chekmeneva, E.; Dos Santos Correia, G.; Gomez-Romero, M.; Stamler, J.; Chan, Q.; Elliott, P.; Nicholson, J. K.; Holmes, E., Ultra-Performance Liquid Chromatography-High-Resolution Mass Spectrometry and Direct Infusion-High-Resolution Mass Spectrometry for Combined Exploratory and Targeted Metabolic Profiling of Human Urine. J. Proteome Res. 2018, 17, 3492-3502. (146) Sanchon-Lopez, B.; Everett, J. R., New Methodology for Known Metabolite Identification in Metabonomics/Metabolomics: Topological Metabolite Identification Carbon Efficiency (tMICE). J. Proteome Res. 2016, 15, 3405-19. 147

(147) Wang, C.; He, L.; Li, D. W.; Bruschweiler-Li, L.; Marshall, A. G.; Brüschweiler, R., Accurate Identification of Unknown and Known Metabolic Mixture Components by Combining 3D NMR with Fourier Transform Ion Cyclotron Resonance Tandem Mass Spectrometry. J. Proteome Res. 2017, 16, 3774-3786.

(148) MestreLab Mnova NMRPredict. http://mestrelab.com/software/mnova-nmrpredict- desktop/

(149) Modgraph NMRPredict. http://www.modgraph.co.uk/product_nmr.htm (150) Hoffmann, F.; Li, D. W.; Sebastiani, D.; Brüschweiler, R., Improved Quantum Chemical NMR Chemical Shift Prediction of Metabolites in Aqueous Solution toward the Validation of Unknowns. J. Phys. Chem. A 2017, 121, 3071-3078. (151) Bremser, W., Hose — a novel substructure code. Anal. Chim. Acta 1978, 103, 355-365. (152) Degtyarenko, K.; de Matos, P.; Ennis, M.; Hastings, J.; Zbinden, M.; McNaught, A.; Alcantara, R.; Darsow, M.; Guedj, M.; Ashburner, M., ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2008, 36, D344-50. (153) Kanehisa, M.; Goto, S., KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28, 27-30. (154) Thompson, S. J.; Hattotuwagama, C. K.; Holliday, J. D.; Flower, D. R., On the hydrophobicity of peptides: Comparing empirical predictions of peptide log P values. Bioinformation 2006, 1, 237- 41. (155) Tetko, I. V.; Tanchuk, V. Y., Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program. J. Chem. Inf. Comput. Sci. 2002, 42, 1136-45. (156) Parella, T.; Espinosa, J. F., Long-range proton-carbon coupling constants: NMR methods and applications. Prog. Nucl. Magn. Reson. Spectrosc. 2013, 73, 17-55. (157) Bakiri, A.; Hubert, J.; Reynaud, R.; Lambert, C.; Martinez, A.; Renault, J. H.; Nuzillard, J. M., Reconstruction of HMBC Correlation Networks: A Novel NMR-Based Contribution to Metabolite Mixture Analysis. J. Chem. Inf. Model. 2018, 58, 262-270. (158) Dumas, M. E.; Canlet, C.; Andre, F.; Vercauteren, J.; Paris, A., Metabonomic assessment of physiological disruptions using 1H-13C HMBC-NMR spectroscopy combined with pattern recognition procedures performed on filtered variables. Anal. Chem. 2002, 74, 2261-73. (159) Williamson, R. T.; Márquez, B. L.; Gerwick, W. H.; Kövér, K. E., One‐ and two‐dimensional gradient ‐ selected HSQMBC NMR experiments for the efficient analysis of long ‐ range heteronuclear coupling constants. Magn. Reson. Chem. 2000, 38, 265-273. (160) Castanar, L.; Sauri, J.; Williamson, R. T.; Virgili, A.; Parella, T., Pure in-phase heteronuclear correlation NMR experiments. Angew. Chem. Int. Ed. Engl. 2014, 53, 8379-82. (161) Leo, A.; Hansch, C.; Elkins, D., Partition coefficients and their uses. Chem. Rev. 1971, 71, 525- 616.

148

(162) Kuhn, H. W., The Hungarian method for the assignment problem. Naval research logistics quarterly 1955, 2, 83-97. (163) Ulrich, E. L.; Akutsu, H.; Doreleijers, J. F.; Harano, Y.; Ioannidis, Y. E.; Lin, J.; Livny, M.; Mading, S.; Maziuk, D.; Miller, Z.; Nakatani, E.; Schulte, C. F.; Tolmie, D. E.; Wenger, R. K.; Yao, H. Y.; Markley, J. L., BioMagResBank. Nucleic Acids Res. 2008, 36, D402-D408. (164) Cajka, T.; Fiehn, O., Toward Merging Untargeted and Targeted Methods in Mass Spectrometry-Based Metabolomics and Lipidomics. Anal. Chem. 2016, 88, 524-45. (165) Li, J.; Vosegaard, T.; Guo, Z., Applications of nuclear magnetic resonance in lipid analyses: An emerging powerful tool for lipidomics studies. Prog. Lipid Res. 2017, 68, 37-56. (166) Miyake, Y.; Yokomizo, K.; Matsuzaki, N., Determination of unsaturated fatty acid composition by high-resolution nuclear magnetic resonance spectroscopy. Journal of the American Oil Chemists' Society 1998, 75, 1091–1094. (167) Vlahov, G.; Chepkwony, P. K.; Ndalut, P. K., 13C NMR Characterization of Triacylglycerols of Moringa oleifera Seed Oil: An “Oleic-Vaccenic Acid” Oil. J. Agric. Food Chem. 2002, 50, 970-975. (168) Tugnoli, V.; Bottura, G.; Fini, G.; Reggiani, A.; Tinti, A.; Trinchero, A.; Tosi, M. R., 1H-NMR and 13C-NMR lipid profiles of human renal tissues. Biopolymers 2003, 72, 86-95. (169) Barison, A.; da Silva, C. W.; Campos, F. R.; Simonelli, F.; Lenz, C. A.; Ferreira, A. G., A simple methodology for the determination of fatty acid composition in edible oils through 1H NMR spectroscopy. Magn. Reson. Chem. 2010, 48, 642-50. (170) Barrilero, R.; Gil, M.; Amigo, N.; Dias, C. B.; Wood, L. G.; Garg, M. L.; Ribalta, J.; Heras, M.; Vinaixa, M.; Correig, X., LipSpin: A New Bioinformatics Tool for Quantitative (1)H NMR Lipid Profiling. Anal. Chem. 2018, 90, 2031-2040. (171) Bingol, K.; Brüschweiler, R., Multidimensional approaches to NMR-based metabolomics. Anal. Chem. 2014, 86, 47-57. (172) Marchand, J.; Martineau, E.; Guitton, Y.; Dervilly-Pinel, G.; Giraudeau, P., Multidimensional NMR approaches towards highly resolved, sensitive and high-throughput quantitative metabolomics. Curr. Opin. Biotechnol. 2017, 43, 49-55. (173) Alexandri, E.; Ahmed, R.; Siddiqui, H.; Choudhary, M. I.; Tsiafoulis, C. G.; Gerothanassis, I. P., High Resolution NMR Spectroscopy as a Structural and Analytical Tool for Unsaturated Lipids in Solution. Molecules 2017, 22. (174) Mahrous, E. A.; Lee, R. B.; Lee, R. E., A rapid approach to lipid profiling of mycobacteria using 2D HSQC NMR maps. J. Lipid Res. 2008, 49, 455-63. (175) Hatzakis, E.; Agiomyrgianaki, A.; Kostidis, S.; Dais, P., High-Resolution NMR Spectroscopy: An Alternative Fast Tool for Qualitative and Quantitative Analysis of Diacylglycerol (DAG) Oil. Journal of the American Oil Chemists' Society 2011, 88, 1695-1708.

149

(176) Shah, S. P.; Jansen, S. A.; Taylor, L. J.; Chong, P. L.; Correa-Llanten, D. N.; Blamey, J. M., Lipid composition of thermophilic Geobacillus sp. strain GWE1, isolated from sterilization oven. Chem. Phys. Lipids 2014, 180, 61-71. (177) Bittman, R., Glycerolipids: Chemistry. In Encyclopedia of Biophysics, Roberts, G. C. K., Ed. Springer Berlin Heidelberg: Berlin, Heidelberg, 2013; pp 907-914. (178) Le Guennec, A.; Dumez, J. N.; Giraudeau, P.; Caldarelli, S., Resolution-enhanced 2D NMR of complex mixtures by non-uniform sampling. Magn. Reson. Chem. 2015, 53, 913-20. (179) Li, D.; Hansen, A. L.; Bruschweiler-Li, L.; Brüschweiler, R., Non-Uniform and Absolute Minimal Sampling for High-Throughput Multidimensional NMR Applications. Chemistry (Easton) 2018, 24, 11535-11544. (180) Castanar, L.; Parella, T., Broadband 1H homodecoupled NMR experiments: recent developments, methods and applications. Magn. Reson. Chem. 2015, 53, 399-426. (181) Zangger, K., Pure shift NMR. Prog. Nucl. Magn. Reson. Spectrosc. 2015, 86-87, 1-20. (182) Paudel, L.; Adams, R. W.; Kiraly, P.; Aguilar, J. A.; Foroozandeh, M.; Cliff, M. J.; Nilsson, M.; Sandor, P.; Waltho, J. P.; Morris, G. A., Simultaneously enhancing spectral resolution and sensitivity in heteronuclear correlation NMR spectroscopy. Angew. Chem. Int. Ed. Engl. 2013, 52, 11616-9. (183) Perez-Trujillo, M.; Castanar, L.; Monteagudo, E.; Kuhn, L. T.; Nolis, P.; Virgili, A.; Williamson, R. T.; Parella, T., Simultaneous (1)H and (13)C NMR enantiodifferentiation from highly-resolved pure shift HSQC spectra. Chem. Commun. (Camb.) 2014, 50, 10214-7. (184) Farjon, J.; Milande, C.; Martineau, E.; Akoka, S.; Giraudeau, P., The FAQUIRE Approach: FAst, QUantitative, hIghly Resolved and sEnsitivity Enhanced (1)H, (13)C Data. Anal. Chem. 2018, 90, 1845-1851. (185) Kiraly, P.; Nilsson, M.; Morris, G. A., Practical aspects of real-time pure shift HSQC experiments. Magn. Reson. Chem. 2018, 56, 993-1005. (186) Timári, I.; Wang, C.; Hansen, A. L.; Costa Dos Santos, G.; Yoon, S. O.; Bruschweiler-Li, L.; Brüschweiler, R., Real-Time Pure Shift HSQC NMR for Untargeted Metabolomics. Anal. Chem. 2019, 91, 2304-2311. (187) Garrett, D. A.; Failla, M. L.; Sarama, R. J., Development of an in vitro digestion method to assess carotenoid bioavailability from meals. J. Agric. Food Chem. 1999, 47, 4301-9. (188) Zhong, S.; Vendrell-Pacheco, M.; Heskitt, B.; Chitchumroonchokchai, C.; Failla, M.; Sastry, S. K.; Francis, D. M.; Martin-Belloso, O.; Elez-Martinez, P.; Kopec, R. E., Novel Processing Technologies as Compared to Thermal Treatment on the Bioaccessibility and Caco-2 Cell Uptake of Carotenoids from Tomato and Kale-Based Juices. J. Agric. Food Chem. 2019, 67, 10185-10194. (189) Chitchumroonchokchai, C.; Schwartz, S. J.; Failla, M. L., Assessment of lutein bioavailability from meals and a supplement using simulated digestion and caco-2 human intestinal cells. J. Nutr. 2004, 134, 2280-6.

150

(190) Kopec, R. E.; Caris-Veyrat, C.; Nowicki, M.; Bernard, J. P.; Morange, S.; Chitchumroonchokchai, C.; Gleize, B.; Borel, P., The Effect of an Iron Supplement on Lycopene Metabolism and Absorption During Digestion in Healthy Humans. Mol Nutr Food Res 2019, e1900644. (191) Grison, S.; Fave, G.; Maillot, M.; Manens, L.; Delissen, O.; Blanchardon, E.; Dublineau, I.; Aigueperse, J.; Bohand, S.; Martin, J. C.; Souidi, M., Metabolomics reveals dose effects of low-dose chronic exposure to uranium in rats: identification of candidate biomarkers in urine samples. Metabolomics 2016, 12, 154. (192) Hyberts, S. G.; Milbradt, A. G.; Wagner, A. B.; Arthanari, H.; Wagner, G., Application of iterative soft thresholding for fast reconstruction of NMR data non-uniformly sampled with multidimensional Poisson Gap scheduling. J. Biomol. NMR 2012, 52, 315-327. (193) Hyberts, S. G.; Takeuchi, K.; Wagner, G., Poisson-gap sampling and forward maximum entropy reconstruction for enhancing the resolution and sensitivity of protein NMR data. J. Am. Chem. Soc. 2010, 132, 2145-7. (194) Herrera, L. C.; Potvin, M. A.; Melanson, J. E., Quantitative analysis of positional isomers of triacylglycerols via electrospray ionization tandem mass spectrometry of sodiated adducts. Rapid Commun. Mass Spectrom. 2010, 24, 2745-52. (195) Kyle, J. E.; Zhang, X.; Weitz, K. K.; Monroe, M. E.; Ibrahim, Y. M.; Moore, R. J.; Cha, J.; Sun, X.; Lovelace, E. S.; Wagoner, J.; Polyak, S. J.; Metz, T. O.; Dey, S. K.; Smith, R. D.; Burnum-Johnson, K. E.; Baker, E. S., Uncovering biologically significant lipid isomers with liquid chromatography, ion mobility spectrometry and mass spectrometry. Analyst 2016, 141, 1649-59. (196) Wojcik, R.; Webb, I. K.; Deng, L.; Garimella, S. V.; Prost, S. A.; Ibrahim, Y. M.; Baker, E. S.; Smith, R. D., Lipid and Glycolipid Isomer Analyses Using Ultra-High Resolution Ion Mobility Spectrometry Separations. Int J Mol Sci 2017, 18. (197) Thomas, M. C.; Mitchell, T. W.; Harman, D. G.; Deeley, J. M.; Murphy, R. C.; Blanksby, S. J., Elucidation of double bond position in unsaturated lipids by ozone electrospray ionization mass spectrometry. Anal. Chem. 2007, 79, 5013-22. (198) Ma, X.; Chong, L.; Tian, R.; Shi, R.; Hu, T. Y.; Ouyang, Z.; Xia, Y., Identification and quantitation of lipid C=C location isomers: A shotgun lipidomics approach enabled by photochemical reaction. Proc. Natl. Acad. Sci. U. S. A. 2016, 113, 2573-8. (199) Wall, R.; Ross, R. P.; Fitzgerald, G. F.; Stanton, C., Fatty acids from fish: the anti-inflammatory potential of long-chain omega-3 fatty acids. Nutr. Rev. 2010, 68, 280-9. (200) Serhan, C. N.; Chiang, N.; Van Dyke, T. E., Resolving inflammation: dual anti-inflammatory and pro-resolution lipid mediators. Nat. Rev. Immunol. 2008, 8, 349-61. (201) Innis, S. M., Dietary omega 3 fatty acids and the developing brain. Brain Res. 2008, 1237, 35- 43

151