Automated Annotation of Structurally Uncharacterized Metabolites in Human Metabolomics Studies by Systems Biology Models
Total Page:16
File Type:pdf, Size:1020Kb
Technische Universität München Automated annotation of structurally uncharacterized metabolites in human metabolomics studies by systems biology models Jan-Dominik Bernd Quell 2019 Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt Technische Universität München Lehrstuhl für Experimentelle Bioinformatik Automated annotation of structurally uncharacterized metabo- lites in human metabolomics studies by systems biology models Jan-Dominik Bernd Quell Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Er- nährung, Landnutzung und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Prof. Dr. Dmitrij Frischmann Prüfer der Dissertation: 1. Prof. Dr. Hans-Werner Mewes 2. Prof. Dr. Bernhard Küster Die Dissertation wurde am 28.11.2018 bei der Technischen Universität München eingereicht und durch die Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt am 28.01.2019 angenommen. „Der Mensch muß bei dem Glauben verharren, daß das Unbegreifliche begreiflich sei; er würde sonst nicht forschen.“ Johann Wolfgang von Goethe i Danksagung Zunächst möchte ich mich ganz besonders bei meinem Doktorvater (Hans-)Werner Mewes bedanken, der mir nicht nur die Möglichkeit gegeben hat an seinem Institut des Helmholtz Zentrums München sowie seinem Lehrstuhl an der Technischen Uni- versität München in Forschung und Lehre zu arbeiten, sondern auch meine Laufbahn ab dem ersten Tag meines Studiums sowohl im fachlichen als auch im persönlichen Kontext entscheidend geprägt hat. Mein besonderer Dank gilt auch Gabi Kastenmüller. Ihr möchte ich für viele fachliche Ideen, Diskussionen, und Inspirationen danken, durch die wichtige Weichen gestellt wurden und in meinen Projekten mündeten. Für fachliche Diskussionen insb. mit biochemischen Inhalten und für die Unterstüt- zung bei massenspektrometrischen Messungen zur Verifikation von Kandidatenmole- külen im Labor bedanke ich mich sehr herzlich bei Werner Römisch-Margl. Da ich in dieser Arbeit Daten aus zahlreichen Kohortenstudien verwendet habe (u.a. SUMMIT, KORA, HuMet) geht an dieser Stelle ein Dank an meine Kooperations- partner wie Anna Köttgen (GCKD) und Karsten Suhre (QMDiab). Bedanken möchte ich mich auch bei Bernhard Küster für die fachliche Auseinander- setzung mit meiner Arbeit und die Bereitschaft als zweiter Prüfer zu fungieren. Natürlich geht auch ein Dank an meine Kollegen, Freunde und Familie, die mich auf unterschiedliche Arten und Weisen unterstützt haben. Die Vollendung dieser Arbeit markiert den Abschluss eines Kapitels in meinem Leben; nun freue ich mich auf neue berufliche Herausforderungen. Nicht nur aus Tradition gebührt zum Abschluss ein sehr großer Dank meiner Frau Christina. Ich möchte ihr Danke sagen für viele Formen der Unterstützung bei mei- ner Arbeit, ebenso bei dem Verwirklichen von privaten Ideen und Zielen und für gemeinsam geschmiedete Lebenspläne, die erst durch sie Bereicherung finden. iii Abstract Metabolomics studies are commonly performed in research of health and disease, capturing metabolite levels as intermediate phenotypes that bridge interacting bio- chemical levels from the genome to clinically relevant phenotypes and reflect environ- mental influences. While defined sets of metabolites are quantified by targeted mass spectrometry (MS) for testing of hypotheses, non-targetedMS includes measurement of metabolites with initially unknown chemical structures and enables hypothesis- free testing of unexpected associations. Although metabolites without any chemi- cal characterization only occur in non-targeted settings, targeted measurements can lead to ambiguous annotations too. The outcome of metabolomics studies strongly depends on the insights that can be obtained from measured data: Functional or intermediate metabolite features influence measurements of highly sensitive mass spectrometers, which differ in their mass resolution (e.g. ± 0.3 or ± 0.01) and in their metabolite separation capacity. Although separation of molecules can be improved by a coupled chromatography or multiple reaction monitoring (MRM), some mole- cules that share similar chemical or physical properties are therefore not separated analytically and labeled ambiguously (measurement of metabolite sums, e.g. hexo- ses, steroids, or complex lipids). As unknown or ambiguously labeled metabolites in scientific results mark dead ends in metabolomics research, they cannot be inter- preted in concrete biochemical contexts, or further used in clinical applications. In consequence, unknown or ambiguously labeled metabolites need to be characterized precisely. Experimental approaches in wet-laboratories are successful in identifica- tion of individual unknown metabolites, but they are comparably time-consuming and laboratory-intensive. For metabolite annotation in large scales, high-throughput in silico approaches that mostly analyze fragmentation spectra have been introduced in recent years. Since some commercialMS platforms do not provide spectra and the development ofMS devices leads to faster measurements, larger data sets, and increasing numbers of metabolites (also of those with unknown structural identities), v vi Abstract characterization of unknown metabolites will stay a long-term accompanying future challenge. Here, I propose two novel automatic approaches for (i) determination of main constituents of measured metabolite sums and for (ii) characterization of unknown metabolites including automated selection of candidate molecules relying mostly on non-spectral data. In the first part, I provide an automatic workflow to determine main constituents of ambiguously annotated metabolites based on a platform com- parison in a small data (sub-) set and impute concentrations of a (large) number of samples on a higher precision level: In case of phosphatidylcholines (PCs) that consist of a polar head group and two non-polar fatty acid chains, most established lipido- mics platforms measure the total length of side chains including their desaturation but cannot distinguish the distribution to the sn-1 or sn-2 positions (lipid species level). Recent platforms identify the length and desaturation of both side chains separately, but do not provide their ordering, or position of double bonds (higher precision; fatty acid level). To this end I used 76 and 109PC concentrations mea- sured by AbsoluteIDQTM that has frequently been used during numerous large cohort studies for targeted quantification of amino acids and lipids on the lipid species level, and by the more precise LipidyzerTM platform (fatty acid level), respectively, in the same 223 samples of healthy young men. First, I determined 150 to 456 theoretically possible constituents (fatty acid level) for 76 measuredPC sums (lipid species level) by systematically distributing their fatty acid chains and desaturation to the sn-1 and sn-2 positions. 69 of 109PCs were identified as constituents for 38PC sums, of which 10 compositions contained one clear main constituent with a contribution of at least 80%. Further 17 compositions consisted of one or two measured main constituents with contributions between 20 and 80%. In one case, aPC with one odd-chain fatty acid was identified as main constituent, which was an unexpected finding, because in human matrices even-chain fatty acids are much more abundant than odd chain lengths. Since 13 and 19 quantitative compositions, respectively, were replicated directly, or via imputation of metabolite concentrations in samples of a second independent cohort, determined factors can be used for imputation of PC concentrations (lipid species level → fatty acid level). AlthoughPCs measured on the fatty acid level are not equal to chemical structures, they allow more precise interpretations in their biochemical context. In the second part, I introduce a systems biology model that approaches charac- terization of unknown metabolites, measured by non-targetedMS, from a holistic vii perspective. While current methods are focused on transforming and linking frag- ment spectra of unknown structures to database compounds, my approach identifies the biochemical context of unknown metabolites, before selecting specific candidate molecules. On this account, I use 637 concentrations (388 known, 249 unknown) measured in 2 279 samples to estimate pairwise partial correlations that are known to reconstruct biochemical pathways. The automatic data integration module merges significant associations with a published Gaussian graphical model (GGM) to receive 1 040 edges as backbone of my network model. By mapping and adding metabolites, I integrated 299 metabolite-gene associations of a published mGWAS, and 1 152 re- actions as well as 495 functional metabolite-gene relations of the public database Recon. Several metabolite characterization modules that access the network model and transfer information from known to unknown metabolites predicted pathways for 180 unknown metabolites and selected 21 reaction types for 100 pairs of known and unknown metabolites. 12 reactions were substantiated by a second criterion on the retention time. I replicated 18 of 26 pathway annotations that were independently predicted in two data sets, based on high-resolution metabolite levels measured in 1 209 samples of a second independent cohort. For 139 of 315 connected unknown high-resolution metabolites that are connected to known metabolites in the network model, I automatically