Bioinformatics Tools for Mass Spectrometry, Phylogenetic Footprinting, and the Integration of Biological Data

Bioinformatics tools for mass spectrometry, phylogenetic footprinting, and the integration of biological data Dissertation zur Erlangung des akademischen Grades Doktor der Naturwissenschaften (Dr.rer.nat.) der Naturwissenschaftliche Fakultät III Agrar- und Ernährungswissenschaften, Geowissenschaften und Informatik der Martin-Luther-Universität Halle-Wittenberg vorgelegt von Hendrik Treutler Geb. am 10.03.1985 in Karl–Marx–Stadt Gutachter: 1. Prof. Dr. Ivo Grosse 2. Prof. Dr. Burkhard Morgenstern 3. Dr. Steffen Neumann Tag der Verteidigung: 16. November 2017 ii Acknowledgements My first thanks belong to my beloved wife Anna-Maria Treutler. Thanks to you I can submit this thesis before our fifth wedding anniversary. You encourage me all the time and you gave me three wonderful children. Many thanks also belong to my parents who formed my roots in this world and made it possible for me to fulfill my wishes. I thank my instructors Ivo Grosse, Steffen Neumann, Stefan Posch, and Falk Schreiber for guiding me to this thesis. You formed my roots in the scientific world and your advice on technical, scientific, and political issues lead to own experiences which will be indispensable for me in the future. Martin Nettling earns special thanks. The exchange of thoughts with you concerning technical, scientific, and private issues changed and widened my mind countless times. With him on your side things will basically work out. I thank Gerd Balcke for accompanying me in the world of mass spectrometry. Your constructive ideas and your eagerness are an inspiration for me and lead to a great collaboration. I thank Karin Gorzolka for her passion to share thoughts in many positive discussions. I thank Susann Mönchgesang, Christoph Ruttkies, Daniel Schober, and Sarah Scharfenberg for their great support in technical and scientific issues and for their friendship. I thank Jesus Cerquides, Tobias Czauderna, Jan Grau, Eva Grafahrend-Belau, Anja Hartmann, Astrid Junker, Jens Keilwagen, Matthias Klapperstück, Matthias Lange, Christian Klukas, Hendrik Rohn, and Uwe Scholz for fruitful discussions and professional teamwork. Last but not least I thank Kristian Peters for technical support with MetFamily. You do a really great job and I enjoy your presence for many reasons. iv Contents 1 Summary 1 1.1 English version . .1 1.2 German version . .4 2 Introduction 7 2.1 Biological background . .9 2.1.1 The central dogma of molecular biology . .9 2.1.2 Regulation of gene expression and transcriptional initiation . 11 2.1.3 Metabolism and small molecules . 11 2.2 Computer science background . 13 2.2.1 The Java programming language . 14 2.2.2 The R programming language . 14 2.2.3 Databases . 15 2.2.4 Data integration and Ontologies . 16 2.3 Bioinformatics background . 16 2.3.1 Integration of biological data . 16 2.3.2 Phylogenetic footprinting and phylogenetic shadowing . 17 2.3.3 Mass spectrometry . 18 2.4 Research objectives . 19 3 Paper summary 23 3.1 The presented works in context . 23 3.2 Integration of biological data . 26 3.2.1 VANTED v2 – A framework for systems biology applications . 26 3.2.2 DBE2 – Management of experimental data for the VANTED system 28 3.3 Phylogenetic footprinting . 29 3.3.1 Detecting and correcting the binding–affinity bias in ChIP-seq data using inter–species information . 30 3.3.2 DiffLogo: a comparative visualization of sequence motifs . 31 3.4 Data Processing and Interpretation of Mass Spectrometry Data . 33 v CONTENTS 3.4.1 Discovering Regulated Metabolite Families in Untargeted Metabolomics Studies . 33 3.4.2 Prediction, Detection, and Validation of Isotope Clusters in Mass Spectrometry Data . 35 4 Integration of biological data 49 4.1 VANTED v2: a framework for systems biology applications . 49 4.2 DBE2 – Management of experimental data for the VANTED system . 63 5 Phylogenetic footprinting 75 5.1 Detecting and correcting the binding-affinity bias in ChIP-seq data using inter-species information . 75 5.2 DiffLogo: a comparative visualization of sequence motifs . 86 6 Data Processing and Interpretation of Mass Spectrometry Data 97 6.1 Discovering Regulated Metabolite Families in Untargeted Metabolomics Stud- ies ......................................... 97 6.2 Prediction, Detection, and Validation of Isotope Clusters in Mass Spectrom- etry Data . 107 vi 1 Summary 1.1 English version Complex biological processes such as RNA splicing, protein degradation, and metabolic switches are a prerequisite for organisms to adapt to environmental stimuli, respond to pathogens, and absorb nutrients. These biological processes are studied in molecular biology and there are powerful techniques for the study of proteins, DNA sequences, and small molecules such as nuclear magnetic resonance (NMR) spectroscopy, high–throughput se- quencing technologies, and mass spectrometry. Depending on the technique the amount of biological raw data can be in the order of hundreds of gigabytes and specialized tools are needed to process this data and extract interpretable information. A deep understanding of both biology and computer science is a prerequisite for the development of such tools and bioinformatics is an interdisciplinary field which emerged from this demand. Bioin- formatics research involves the development of tools for the study of biological processes using methods from computer science. Systems biology is a bioinformatics field which aims at an understanding of biological systems rather than investigating each biological process individually. Systems biology approaches are able to translate various kinds of biological data to parameters of mathe- matical models enabling, amongst others, the understanding of the complex interplay of different biological entities, the reproduction of emergent properties of metabolic pathways, and the prediction of the behaviour of cells under hypothetical conditions. Research in systems biology necessitates insights into the regulation of gene expression, the metabolism, and the integration of biological data from different sources. The publications in this cumu- lative thesis are rooted in these three fields, namely (i) “Phylogenetic footprinting” for the study of the regulation of transcriptional initiation, (ii) “Data Processing and Interpretation of Mass Spectrometry Data” for the study of metabolic fingerprints, and (iii) the “Integra- tion of biological data” for the combined analysis of different data. Two publications reside in each of these fields and I will summarize these six publications subsequently. Decoding the regulation of gene expression is essential for the understanding of life and 1 1. SUMMARY the transcriptional initiation is a vital sub–process. Transcriptional initiation is governed by the concerted binding of transcription factors (TFs) to transcription factor binding sites (TFBSs) and de novo motif discovery with “Phylogenetic footprinting” on basis of ChIP–seq data gives deep insights into this sub–process. Unfortunately, ChIP–seq data is subject to the ubiquitous binding–affinity bias which leads to the prediction of distorted sequence motifs and biased sets of TFBSs. My colleagues and I developed a phylogenetic footprinting model which is capable of estimating and correcting the binding–affinity bias in ChIP–seq data. We found that the proposed phylogenetic footprinting model improves de novo motif discovery and that the corrected sequence motifs are softer than the uncorrected sequence motifs. The increasing availability of sequence data and algorithms for de novo motif discovery re- sults in the demand to compare different sequence motifs. Sequence motifs which have been extracted from, e.g., different species, tissues, and samples are often similar which makes the comparison challenging. My colleagues and I developed the R–based tool DiffLogo for the comparative visualization of sequence motifs in DNA sequences, RNA sequences, and protein domains. DiffLogo allows the detection of even small motif differences between two sequence motifs and also supports the pair–wise comparison of more than two sequence motifs. We demonstrated the utility of DiffLogo using sequence motifs of one transcription factor from different cell lines, sequence motifs from different transcription factors, and sequence motifs in protein domains from different phylogenetic kingdoms. DiffLogo was also used to show the effect of the binding–affinity bias in ChIP–seq data. The study of small molecules in the cell gives insights into the chemical fingerprints of different biological processes. Metabolite profiles provide a snapshot of these chemical fingerprints in the samples of interest and Mass Spectrometry (MS) is an essential technol- ogy to prepare these metabolite profiles. The “Data Processing and Interpretation of Mass Spectrometry Data” is mandatory in order to gain biological insights from MS data and the identification and precise quantification of metabolites remains a major challenge. Isotope clusters are sets of related signals in MS data and it has been shown that isotope clusters can be utilized to improve the identification and quantification of metabolites. My colleagues and I developed an approach for the prediction, detection, and validation of isotope clusters in MS data and we integrated this approach into the popular R–based tools xcms and CAMERA. We found that using the proposed approach it is possible to extract 37% more isotope signals from Arabidopsis thaliana measurements and in a mix of standard compounds the correct molecular formula could be predicted in 92% of the cases in the top three ranks. The structural elucidation

Bioinformatics Tools for Mass Spectrometry, Phylogenetic Footprinting, and the Integration of Biological Data

A Method to Infer Changed Activity of Metabolic Function from Transcript Proﬁles

The Kyoto Encyclopedia of Genes and Genomes (KEGG)

3 13437143.Pdf

Open Data, Open Source, and Open Standards in Chemistry: the Blue Obelisk Five Years On" Journal of Cheminformatics Vol

Article Reference

Open Data, Open Source and Open Standards in Chemistry: the Blue Obelisk ﬁve Years On

Genbank by Walter B

Integrative Annotation of 21,037 Human Genes Validated by Full-Length Cdna Clones

General Assembly and Consortium Meeting 2020

2003 Mulder Nucl Acids Res {22

ACCEPTED PAPERS – ABSTRACTS (As of May 5, 2006) Comparative

MPGM: Scalable and Accurate Multiple Network Alignment