Informatics for Tandem Mass Spectrometry-Based Metabolomics

Informatics for Tandem Mass Spectrometry-based Metabolomics Dipl.-Ing. (FH) Stephan A. Beisken, M.Res. European Molecular Biology Laboratory European Bioinformatics Institute University of Cambridge Gonville & Caius College A thesis submitted on April 10, 2014 for the Degree of Doctor of Philosophy \The demand upon a resource tends to expand to match the supply of the resource. The reverse is not true." - Generalization of Parkinson's law - This dissertation is the result of my own work and includes noth- ing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation is not substantially the same as any I have submitted for a degree, diploma or other qualification at any other university, and no part has already been, or is currently being submitted for any degree, diploma or other qualification. This dissertation does not exceed the specified length limit of 300 pages as defined by the Biology Degree Committee. April 10, 2014 Stephan A. Beisken Acknowledgements First, I would like to express my gratitude to my supervisor, Dr. Christoph Stein- beck, who has supported me throughout my thesis. His knowledge and life expe- rience helped me to overcome the various hurdles encountered. I want to thank thank my Thesis Advisory Committee members, Dr. Jeroen Krijgsveld, Dr. Jules Griffin, Dr. John Marioni, and Dr. Mark Seymour, for their advise and guidance. My special thanks goes to Dr. Mark Seymour for providing me with insights into the needs of the industry sector. I also wish to thank the members of the Steinbeck group and, in particular, the research team. Their assistance and friendship helped me to enjoy my work and push beyond my own limits. Special thanks to Mr. Mark Earll, Dr. David Portwood, Dr. Reza Salek, and Dr. Michael Eiden for their helpful discussions and suggestions. Their input inspired and helped me to find my way around the data analysis landscape and opened up many glorious opportunities. The Syngenta AG in collaboration with the European Bioinformatics Institute has provided the funding, support, and data I needed to produce and complete my thesis. Finally, I wish to thank Ms. Evelyn Lim and my family for their continuous support throughout the programme, ultimately resulting in the creation of this dissertation. Abstract Metabolomics is a rapidly expanding field with applications in areas such as medicine, agriculture, or food safety. Tandem mass spectrometry (MSn) is one of the main technologies that drives the field forward. Optionally coupled to a chromatographic element, MSn can capture detailed snapshots of an organism's metabolome. The resulting data sets are complex and difficult to analyse due to the multitude of external, biologically irrelevant influences. In particular metabolite identification { the ultimate goal of MSn metabolomics { is a highly challeng- ing exercise with inherently uncertain results. We have developed the data processing tool MassCascade to rapidly analyse and visualise chromatography MSn data. MassCascade features methods for data (pre-)processing from initial file input to the compilation of the final result ma- trix. To simplify use and break down the complex analysis process, the tool has been made available in the form of a plug-in for the workflow platform KNIME: MassCascade-KNIME offers a visual representation of each processing function that can be utilized following the concept of visual programming. To further support metabolomics data analysis, cheminformatics methods have been added separately to the workflow platform from the Chemistry Development Kit to enable digital small molecule handling, essential for semi-automated metabolite identification. To demonstrate the MSn analysis process and test MassCascade and its plug- in, two scenarios typical in metabolomics were chosen: spectral fingerprinting and metabolite identification. A set of metabolomics tomato samples from a long-term study about chromatography MSn system stability was processed and interpreted. Distinct trends and clustering could be extracted and explained veri- fying correct processing by the tool. Metabolite identification of spectral features was applied on a study about tomato ripening. Features differentiating ripening of four different tomato genotypes were singled out to that end. The implemented information-driven identification methodology enabled the selection of putative metabolite identifications from large lists of chemical compounds. Contents Contents xi List of Figures xv Nomenclature xviii 1 General Introduction1 1.1 Metabolomics . .1 1.1.1 Experimental Methods in Metabolomics . .4 1.1.2 Applications of Metabolomics . .5 1.1.3 Mass spectrometry . .6 1.1.4 Data Pre-Processing . 13 1.1.5 Data Post-Processing . 19 1.1.6 Identification of Metabolites . 20 1.1.7 Software . 22 1.2 Cheminformatics support for Metabolomics . 24 1.2.1 Small Molecule Library Management . 24 1.2.2 Representation of Small Molecules . 25 1.2.3 Properties of Small Molecules . 27 1.2.4 Workflow Environments for Cheminformatics . 28 1.3 Aim of this Thesis . 33 2 Informatics for LC-MSn Analysis 35 2.1 Introduction . 35 2.2 MassCascade's Implementation . 36 2.3 MassCascade's Functionality . 41 2.3.1 Data Pre-Processing . 41 2.3.2 Data Processing . 49 2.3.3 Data Post-Processing . 53 2.4 MassCascade for KNIME . 57 2.4.1 Structure . 57 xi CONTENTS 2.4.2 Node Types . 59 2.4.3 Node Interactions . 60 2.5 Evaluation . 63 2.5.1 Spectral Fingerprinting of Tomato Samples . 63 2.5.2 Materials and Methods . 64 2.5.3 Results . 66 2.5.4 Performance and Scaling of the Core Library . 78 2.6 Technical Validation . 80 2.6.1 Methods . 81 2.6.2 Results & Discussion . 82 2.7 Conclusion . 83 2.8 Software Availability . 83 2.8.1 Update Site . 84 2.8.2 Extensions . 84 2.8.3 Example Workflows . 84 3 Knowledge-based Compound Identification 85 3.1 Introduction . 85 3.1.1 Open Data . 87 3.2 Metabolite Identification . 88 3.2.1 Identification Factors . 88 3.2.2 Scoring Schemes . 92 3.3 Materials and Methods . 93 3.3.1 Tomato Cultivars . 94 3.3.2 Sample preparation . 95 3.3.3 Chromatography . 95 3.3.4 Mass Spectrometry . 96 3.3.5 Reference Standards . 96 3.3.6 Data Deposition . 96 3.4 Data Processing and Transformation . 97 3.4.1 Known Identification . 101 3.4.2 Known Unknown Identification . 102 3.4.3 Unknown Identification . 102 xii CONTENTS 3.5 Results . 103 3.5.1 Analysis of the Quality Controls . 103 3.5.2 Analysis of the Tomato Samples . 103 3.5.3 Identification . 108 3.6 Discussion . 120 3.7 Conclusion . 121 3.8 Technical Validation . 123 3.8.1 Methods . 123 3.8.2 Results & Discussion . 124 4 Workflows for Cheminformatics 127 4.1 Introduction . 127 4.2 KNIME-CDK's Implementation . 128 4.2.1 Structure . 129 4.2.2 Persistence . 129 4.3 KNIME-CDK's Functionality . 130 4.3.1 Input/Output . 131 4.3.2 Processing . 133 4.3.3 Visualisation . 133 4.4 Evaluation . 134 4.4.1 Round Tripping . 138 4.4.2 Test Workflows . 139 4.4.3 Performance and Scalability . 139 4.5 Conclusion . 140 4.6 Software Availability . 140 4.6.1 Update Site . 142 4.6.2 Extensions . 142 4.6.3 Example Workflows . 142 5 Summary and Discussion 143 Appendix 147 References 171 xiii List of Figures 1.1 Schematic of a mass spectometry pipeline and data landscape . .7 1.2 Schematic of a mass chromatogram and a spectrum . 11 1.3 Schematic of the mass spectrometry data analysis process . 13 1.4 Summary of components contributing to signal distortions . 16 1.5 Comparison of graphical representations for chlorophyll f ..... 27 1.6 Schematic of the principle mechanism of workflow platforms . 29 1.7 Screenshot of the KNIME workbench . 31 2.1 MassCascade data types and their representations . 38 2.2 UML diagram of MassCascade's data structure . 39 2.3 Example of JavaDoc documentation . 40 2.4 Comparison of centroiding methods . 42 2.5 Illustration of noise removal pre-processing methods . 44 2.6 Illustration of the feature extraction process . 45 2.7 Illustration of the TopHat algorithm . 46 2.8 Illustration of Durbin Watson filtering and pre-processing summary 48 2.9 Illustration of deconvolution, alignment, and feature set methods . 51 2.10 Illustration of the modified Bieman algorithm . 53 2.11 Illustration of annotation and identification methods . 55 2.12 Linear regression for molecular masses vs. isotope abundances . 56 2.13 Schematic of the MassCascade-KNIME architecture and node model 58 2.14 Screenshot of a complex MassCascade-KNIME workflow . 59 2.15 Schematic of of the node architecture and interactions . 61 2.16 Screenshot of configuration dialogues . 62 xv LIST OF FIGURES 2.17 Screenshot of the Spectrum Viewer data view . 63 2.18 Schematic of LC-MS data processing for metabolomics fingerprinting 65 2.19 Cross-sample total ion currents and chromatograms . 67 2.20 Line plot of time vs time deviation of aligned samples . 68 2.21 Line plot of time vs time drift of aligned samples by group . 69 2.22 Analysis of interferents intensities and distribution . 70 2.23 Stepwise analysis of feature missingness . 72 2.24 Principal component analysis for all standard aliquots . 75 2.25 Principal component analysis for filtered standard aliquots . 76 2.26 Analysis of standard aliquots from 2010-09-21 . 77 2.27 Performance charts of the core library . 79 2.28 F-scores for feature isolation . 82 3.1 Workflow for known and known unknown metabolite identification 90 3.2 Processing workflow for metabolite identification . 98 3.3 Annotated correlation heatmap of tomato study features . 99 3.4 Overview of the tomato cultivars data set . 104 3.5 PCA model for the tomato cultivars data set .

Load more