Openms – a Framework for Computational Mass Spectrometry

Total Page:16

File Type:pdf, Size:1020Kb

Openms – a Framework for Computational Mass Spectrometry OpenMS { A framework for computational mass spectrometry Dissertation der Fakult¨atf¨urInformations- und Kognitionswissenschaften der Eberhard-Karls-Universit¨atT¨ubingen zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) vorgelegt von Dipl.-Inform. Marc Sturm aus Saarbr¨ucken T¨ubingen 2010 Tag der m¨undlichen Qualifikation: 07.07.2010 Dekan: Prof. Dr. Oliver Kohlbacher 1. Berichterstatter: Prof. Dr. Oliver Kohlbacher 2. Berichterstatter: Prof. Dr. Knut Reinert Acknowledgments I am tremendously thankful to Oliver Kohlbacher, who aroused my interest in the field of computational proteomics and gave me the opportunity to write this thesis. Our discussions where always fruitful|no matter if scientific or technical. Furthermore, he provided an enjoyable working environment for me and all the other staff of the working group. OpenMS would not have been possible without the joint effort of many people. My thanks go to all core developers and students who contributed to OpenMS and suffered from the pedantic testing rules. I especially thank Eva Lange, Andreas Bertsch, Chris Bielow and Clemens Gr¨oplfor the tight cooperation and nice evenings together. Of course, I'm especially grateful to my parents and family for their support through- out my whole life. Finally, I thank Bettina for her patience and understanding while I wrote this thesis. iii Abstract Mass spectrometry coupled to liquid chromatography (LC-MS) is an analytical technique becoming increasingly popular in biomedical research. Especially in high-throughput proteomics and metabolomics mass spectrometry is widely used because it provides both qualitative and quantitative information about analytes. The standard protocol is that complex analyte mixtures are first separated in liquid chromatography and then analyzed using mass spectrometry. Finally, computational tools extract all relevant information from the large amounts of data produced. This thesis aims at improving computational analysis of LC-MS data|we present two novel computational methods and a software framework for the development of LC-MS data analysis tools. In the first part of this thesis we present a quantitation algorithm for peptide signals in isotope-resolved LC-MS data. Exact quantitation of all peptide signals (so-called peptide features) is an essential step in most LC-MS data analysis pipelines. Our algorithm detects and quantifies peptide features in centroided peak maps using a multi-phase approach: First, putative feature centroid peaks, so-called seeds, are determined based on signal properties that are typical for peptide features. In the second phase, the seeds are extended to feature regions, which are compared to a theoretical feature model in the third phase. Features that show a high correlation between measured data and the theoretical model are added to a feature candidate list. In a last phase, contradicting feature candidates are detected and contradictions are resolved. In a comparative study, we show that our algorithm outperforms several state-of-the-art algorithms, especially on complex datasets with many overlapping peaks. The second part of this thesis introduces a novel machine learning approach for model- ing chromatographic retention of DNA in ion-pair reverse-phase liquid chromatography. The retention time of DNA is of interest for many biological applications, e.g., for quality control of DNA synthesis and DNA amplification. Most existing models use only the base composition to model chromatographic retention of DNA. Our model complements the base composition with secondary structure information to improve the prediction performance. A second difference to previous models is the use of a support vector re- gression model instead of simple linear or logarithmic models. In a thorough evaluation, we show that these changes significantly improve the prediction performance, especially at temperatures below 60◦C. As a by-product, our approach allows the creation of a temperature-independent model, which can predict DNA retention times not only for a fixed temperature, but for all temperatures within the temperature range of the training data. Finally, we present OpenMS { a framework for computational mass spectrometry. OpenMS provides data structures and algorithms for the rapid development of mass spectrometry data analysis software. Rapid software prototyping is especially important in this area of research because both instrumentation and experimental procedures are quickly evolving. Thus, new analysis tools have to be developed frequently. OpenMS facilitates software development for mass spectrometry by providing a rich functionality ranging from support for many file formats, over customizable data structures and data visualization, to sophisticated algorithms for all major data analysis steps. The peptide feature quantitation algorithm presented in the first part of this thesis is one of many algorithms provided by OpenMS. We demonstrate the benefits of using OpenMS by the development of TOPP { The OpenMS Proteomics Pipeline. TOPP is a collection of command line tools which each perform one atomic data analysis step|typically one of the OpenMS data analysis algo- iv rithms. The individual TOPP tools are used as building blocks for customized analysis pipelines. This kind of flexibility and a graphical user interface for the visual creation of analysis pipelines make TOPP a versatile instrument for LC-MS data analysis. Kurzzusammenfassung Die Kopplung von Massenspektrometrie (MS) und Fl¨ussigchromatographie (LC) gewinnt immer mehr Bedeutung als analytische Technik in der biomedizinischen Forschung. Vor allem in der Hochdurchsatzproteomik und -metabolomik ist Massenspektrometrie weit verbreitet, weil sie sowohl qualitative als auch quantitative Information ¨uber Analyten liefert. Komplexe Stoffmischungen werden normalerweise mit Fl¨ussigchromatographie aufgetrennt bevor sie mittels Massenspektrometrie analysiert werden. Danach werden alle relevanten Informationen mithilfe spezieller Computerprogramme extrahiert, da die produzierten Datenmengen sehr groß sind. Diese Arbeit hat das Ziel die computer- gest¨utzteAnalyse von LC-MS Daten zu verbessern. Wir stellen zwei neue Methoden zur Datenanalyse und eine Softwarebibliothek zur Entwicklung von Analyseprogrammen vor. Der erste Teil dieser Arbeit besch¨aftigtsich mit einem Algorithmus zur Quantifizierung von Peptidsignalen (sogenannten Peptid-Features) in LC-MS Daten. Die exakte Quan- tifizierung aller Peptid-Features ist ein wichtiger Verarbeitungsschritt der meisten LC- MS Analyse-Pipelines. Unser Algorithmus detektiert und quantifiziert Peptide-Features in Peakdaten durch ein mehrstufiges Verfahren: Zuerst werden potenzielle Signalmit- telpunkte gesucht anhand der f¨urPeptidsignale typischen Eigenschaften. In einem zweiten Schritt werden die gefundenen Mittelpunkte zu Signalregionen vergr¨oßert,die im dritten Schritt mit einem theoretischen zweidimensionalen Modell verglichen wer- den. Signalregionen die eine hohe Ubereinstimmung¨ zwischen Messdaten und Modell aufweisen werden dann in eine Kandidatenliste von potenziellen Peptid-Features einge- tragen. Im letzten Schritt werden Widerspr¨uche in der Kandidatenliste gesucht und diese behoben. In einer Vergleichsstudie auf komplexen Daten mit vielen ¨uberlappenden Signalen konnten wir zeigen, dass unser Algorithmus mehreren modernen Algorithmen ¨uberlegen ist. Im zweiten Teil der Arbeit stellen wir einen neues maschinelles Lernverfahren zur Vorhersage von DNA Retentionszeiten in der Umkehrphasen-Chromatographie vor. Die Retentionszeit von DNA ist f¨urviele biologische Anwendungen von Interesse, zum Beispiel f¨urdie Qualit¨atskontrolle der DNA-Synthese und der DNA-Amplifikation. Die meisten existierenden Verfahren benutzen nur die Basenzusammensetzung der DNA um die Re- tentionszeit zu modellieren. Unser Modell beruht auch auf der Basenzusammensetzung, bezieht aber Sekund¨arstrukturinformationein, um die Vorhersageleistung zu verbessern. Ein weiterer Unterschied zu bisherigen Methoden ist die Verwendung von Support Vector Regression anstelle einfacher linearer und logarithmischer Modelle. In einer Vergleichs- studie zeigen wir, dass diese Neuerungen die Vorhersageleistung signifikant erh¨ohen, vor allem bei Temperaturen unter 60◦C. Außerdem erlaubt unsere Methode die Erstellung temperaturunabh¨angiger Modelle, die Retentionszeiten nicht nur f¨ureine feste Temper- atur, sondern f¨urden gesamten von den Trainingsdaten abgedeckt Temperaturbereich vorhersagen k¨onnen. Schließlich stellen wir OpenMS, eine Bibliothek zur Entwicklung von Software f¨ur die Massenspektrometrie, vor. OpenMS bietet alle erforderliche Datenstrukturen und viele Algorithmen zur schnellen Entwicklung von Analysesoftware. Die schnelle Entwick- lung von Softwareprototypen ist gerade in der Massenspektrometrie besonders wichtig, da sowohl die Instrumente als auch die experimentellen Protokolle sehr schnell weiter- v entwickelt werden. Daher m¨ussenregelm¨aßigneue Softwarel¨osungenzur Analyse der Daten entwickelt werden. OpenMS stellt eine umfangreiche Infrastruktur zur Verf¨ugung und vereinfacht so die Entwicklung dieser Analysesoftware. Die Funktionalit¨atvon OpenMS reicht von der Unterst¨utzungf¨urweit verbreitete Dateiformate, ¨uber anpass- bare Datenstrukturen and Datenvisualisierung, bis hin zu modernen Analysealgorithmen f¨uralle Hauptanalyseschritte. Die Vorteile die sich aus der Benutzung von OpenMS ergeben zeigen wir anhand der Entwicklung von TOPP { The OpenMS Proteomics Pipeline. TOPP ist eine
Recommended publications
  • Introduction to Label-Free Quantification
    SeqAn and OpenMS Integration Workshop Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) Mass-spectrometry data analysis in KNIME Julianus Pfeuffer, Alexander Fillbrunn OpenMS • OpenMS – an open-source C++ framework for computational mass spectrometry • Jointly developed at ETH Zürich, FU Berlin, University of Tübingen • Open source: BSD 3-clause license • Portable: available on Windows, OSX, Linux • Vendor-independent: supports all standard formats and vendor-formats through proteowizard • OpenMS TOPP tools – The OpenMS Proteomics Pipeline tools – Building blocks: One application for each analysis step – All applications share identical user interfaces – Uses PSI standard formats • Can be integrated in various workflow systems – Galaxy – WS-PGRADE/gUSE – KNIME Kohlbacher et al., Bioinformatics (2007), 23:e191 OpenMS Tools in KNIME • Wrapping of OpenMS tools in KNIME via GenericKNIMENodes (GKN) • Every tool writes its CommonToolDescription (CTD) via its command line parser • GKN generates Java source code for nodes to show up in KNIME • Wraps C++ executables and provides file handling nodes Installation of the OpenMS plugin • Community-contributions update site (stable & trunk) – Bioinformatics & NGS • provides > 180 OpenMS TOPP tools as Community nodes – SILAC, iTRAQ, TMT, label-free, SWATH, SIP, … – Search engines: OMSSA, MASCOT, X!TANDEM, MSGFplus, … – Protein inference: FIDO Data Flow in Shotgun Proteomics Sample HPLC/MS Raw Data 100 GB Sig. Proc. Peak 50 MB Maps Data Reduction 1
    [Show full text]
  • June 5, 2017 Bioinformatics MS Interest Group Your Hosts
    Open Source Software Packages: Using and Making your conbributions June 5, 2017 Bioinformatics MS Interest Group Your hosts Meena Choi Samuel Payne Post doc. Scientist Northeastern University Pacific Northwest National Lab Statistical methods for Integrative Omics quantitative proteomics Outline • General Intro – Meena Choi • mzRefinery/proteowizard - Sam Payne • openMS - Oliver Kohlbacher • Skyline - Brendan MacLean • General discussion on open source • Ask questions for the General Discussion http://bit.ly/2qNZVBU • Shout-out for Open Source tool http://bit.ly/2qVHVo7 Oliver Kohlbacher • The chair of Applied Bioinformatics at University of Tübingen & fellow at the Max Plank Institute • OpenMS ( openms.de ) Brendan MacLean • Principal developer for Skyline ( skyline.ms ) • University of Washington Ask questions or comments : http://bit.ly/2qNZVBU • Why have open source? • What are the advantages and disadvantages between open source and private closed-source software? • How should a developer consider the question of making a project open source or not? • What is appropriate level of guide/documentation to help new developers? • How to incentivize people to contribute to open source software? Bioconductor.org biocViews search R package development • Provide the framework for developing package : basic structure, requirements… • Requirements : 1. pass check or BiocCheck on all supported platforms (their own checking system) 2. Documents • DESCRIPTION, NAMESPACE, vignette, help file, NEWS 3. Review process (2-5 weeks) • submit a GitHub repository • a reviewer will be assigned and a detailed package review is returned. • the process is repeated until the package is accepted to Bioconductor. • Maintaining the packages across release cycles (twice a year) + deprecate packages • Import or depend on other packages in Bioconductor or CRAN R package as software • Easy to make open source software for new method development.
    [Show full text]
  • Computational Proteomics and Metabolomics
    COMPUTATIONAL PROTEOMICS AND METABOLOMICS Oliver Kohlbacher, Sven Nahnsen, Knut Reinert 0. Introducon and Overview This work is licensed under a Creative Commons Attribution 4.0 International License. LU 0B – OPENMS AND KNIME • Workflows - defini/on • Conceptual ideas behind OpenMS and TOPP • Installaon of KNIME and OpenMS extensions • Overview of KNIME • Simple workflows in KNIME • LoadinG tabular data, manipulanG rows, columns • Visualizaon of data • PreparinG simple reports • EmbeddinG R scripts • Simple OpenMS ID workflow: findinG all proteins in a sample This work is licensed under a Creative Commons Attribution 4.0 International License. High-Throughput Proteomics • AnalyzinG one sample is usually not a biG deal • AnalyzinG 20 can be resome • AnalyzinG 100 is a really biG deal • High-throughput experiments require high- throughput analysis • Compute power scales much beer than manpower Pipelines and Workflows pipeline |ˈpīpˌlīn| noun 1. a lonG pipe, typically underGround, for conveyinG oil, Gas, etc., over lonG distances. […] 2. Compu,ng a linear sequence of specialized modules used for pipelining. 3. (in surfing) the hollow formed by the breakinG of a larGe wave. workflow |ˈwərkˌflō| noun • the sequence of industrial, administrave, or other processes throuGh which a piece of work passes from ini/aon to comple/on. http://oxforddictionaries.com/definition/american_english/pipeline http:// oxforddictionaries.com/definition/american_english/workflow Bioinformacs – The Holy Grail KNIME and OpenMS • Construc/nG workflows requires • Tools – makinG
    [Show full text]
  • Automated SWATH Data Analysis Using Targeted Extraction of Ion Chromatograms
    bioRxiv preprint doi: https://doi.org/10.1101/044552; this version posted March 19, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Automated SWATH Data Analysis Using Targeted Extraction of Ion Chromatograms Hannes L. Röst1,2, Ruedi Aebersold1,3 and Olga T. Schubert1,4 1 Institute of Molecular Systems Biology, ETH Zurich, CH-8093 Zurich, Switzerland 2 Department of Genetics, Stanford University, Stanford, CA 94305, USA 3 Faculty of Science, University of Zurich, CH-8057 Zurich, Switzerland 4 Department of Human Genetics, University of California Los Angeles, Los Angeles, CA 90095, USA Corresponding authors: HR ([email protected]), RA ([email protected]) and OS ([email protected]) Running head: SWATH Data Analysis 1 bioRxiv preprint doi: https://doi.org/10.1101/044552; this version posted March 19, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Summary Targeted mass spectrometry comprises a set of methods able to quantify protein analytes in complex mixtures with high accuracy and sensitivity. These methods, e.g., Selected Reaction Monitoring (SRM) and SWATH MS, use specific mass spectrometric coordinates (assays) for reproducible detection and quantification of proteins. In this protocol, we describe how to analyze in a targeted manner data from a SWATH MS experiment aimed at monitoring thousands of proteins reproducibly over many samples.
    [Show full text]
  • Introduction to Label-Free Quantification
    Analysis of Mass Spectrometry and Sequence Data with KNIME Alexander Fillbrunn, Julianus Pfeuffer, Jeanette Prinz The Center for Integrative Bioinformatics (CIBI) and KNIME Schedule • About us • Generic KNIME Nodes • Mass-spectrometry data analysis in KNIME with OpenMS • Introduction and theory of label-free quantification • Demo of a label-free quantification workflow • Analysis of high-throughput sequencing data with KNIME and SeqAn • Introduction and theory of variant calling • Demo of a variant calling workflow German Network for Bioinformatics Infrastructure de.NBI Mission Statement • The 'German Network for Bioinformatics Infrastructure' provides comprehensive first-class bioinformatics services to users in life sciences research, industry and medicine. The de.NBI program coordinates bioinformatics training and education and the cooperation of the German bioinformatics community with international bioinformatics network structures. Center for Integrative Bioinformatics (CIBI) • … provides cutting-edge and integrative tools for proteomics, metabolomics, NGS and image data analysis as well as a workflow engine to integrate tools into coherent solutions for reproducible analysis of large-scale biological data. Generic KNIME Nodes • Wrapping of command line tools (OpenMS, SeqAn,…) in KNIME via GenericKNIMENodes (GKN) • Every OpenMS and SeqAn tool writes its Common Tool Description (CTD) via its command line parser • GKN generates Java source code (static) or an XML representation (dynamic) for nodes to show up in KNIME • Wraps C++ (&more)
    [Show full text]
  • Pyopenms Documentation Release 2.7.0
    pyOpenMS Documentation Release 2.7.0 OpenMS Team Sep 21, 2021 Installation 1 pyOpenMS Installation 3 1.1 Binaries..................................................3 1.2 Source..................................................5 1.3 Wrap Classes...............................................5 2 Getting Started 7 2.1 Import pyopenms.............................................7 2.2 Getting help...............................................7 2.3 First look at data.............................................8 3 Reading Raw MS data 13 3.1 mzML files in memory.......................................... 13 3.2 indexed mzML files........................................... 14 3.3 mzML files as streams.......................................... 14 3.4 cached mzML files............................................ 16 4 Other MS data formats 19 4.1 Identification data (idXML, mzIdentML, pepXML, protXML)..................... 19 4.2 Quantiative data (featureXML, consensusXML)............................ 20 4.3 Transition data (TraML)......................................... 20 5 MS Data 23 5.1 Spectrum................................................. 23 5.2 Chromatogram.............................................. 26 5.3 LC-MS/MS Experiment......................................... 29 6 Chemistry 35 6.1 Constants................................................. 35 6.2 Elements................................................. 35 6.3 Molecular Formulae........................................... 39 6.4 Isotopic Distributions.........................................
    [Show full text]
  • Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis
    International Journal of Molecular Sciences Review Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis Chen Chen 1, Jie Hou 2,3, John J. Tanner 4 and Jianlin Cheng 1,* 1 Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA; [email protected] 2 Department of Computer Science, Saint Louis University, St. Louis, MO 63103, USA; [email protected] 3 Program in Bioinformatics & Computational Biology, Saint Louis University, St. Louis, MO 63103, USA 4 Departments of Biochemistry and Chemistry, University of Missouri, Columbia, MO 65211, USA; [email protected] * Correspondence: [email protected] Received: 18 March 2020; Accepted: 18 April 2020; Published: 20 April 2020 Abstract: Recent advances in mass spectrometry (MS)-based proteomics have enabled tremendous progress in the understanding of cellular mechanisms, disease progression, and the relationship between genotype and phenotype. Though many popular bioinformatics methods in proteomics are derived from other omics studies, novel analysis strategies are required to deal with the unique characteristics of proteomics data. In this review, we discuss the current developments in the bioinformatics methods used in proteomics and how they facilitate the mechanistic understanding of biological processes. We first introduce bioinformatics software and tools designed for mass spectrometry-based protein identification and quantification, and then we review the different statistical and machine learning methods that have been developed to perform comprehensive analysis in proteomics studies. We conclude with a discussion of how quantitative protein data can be used to reconstruct protein interactions and signaling networks. Keywords: bioinformatics analysis; computational proteomics; machine learning 1. Introduction Proteins are essential molecules in nearly all biological processes.
    [Show full text]
  • Openms Tutorial Handouts
    User Tutorial The OpenMS Developers Creative Commons Attribution 4.0 International (CC BY 4.0) Contents 1 General remarks 5 2 Getting started 6 2.1 Installation .................................... 6 2.1.1 Installation from the OpenMS USB stick .............. 6 2.1.2 Installation from the internet .................... 7 2.2 Data conversion ................................. 7 2.2.1 MSConvertGUI ............................. 8 2.2.2 msconvert ................................ 8 2.3 Data visualization using TOPPView ...................... 9 2.4 Introduction to KNIME / OpenMS ...................... 12 2.4.1 Plugin and dependency installation ................. 12 2.4.2 KNIME concepts ............................ 14 2.4.3 Overview of the graphical user interface .............. 16 2.4.4 Creating workflows .......................... 18 2.4.5 Sharing workflows ........................... 18 2.4.6 Duplicating workflows ......................... 18 2.4.7 A minimal workflow .......................... 19 2.4.8 Digression: Working with chemical structures ........... 21 2.4.9 Advanced topic: Meta nodes ..................... 22 2.4.10 Advanced topic: R integration .................... 24 3 Label-free quantification of peptides 26 3.1 Introduction ................................... 26 3.2 Peptide Identification ............................. 26 3.2.1 Bonus task: identification using several search engines ..... 29 3.3 Quantification .................................. 31 3.4 Combining quantitative information across several label-free experi- ments ......................................
    [Show full text]
  • Cloudy with a Chance of Peptides: Accessibility, Scalability, and Reproducibility with Cloud-Hosted Environments Benjamin A
    pubs.acs.org/jpr Technical Note Cloudy with a Chance of Peptides: Accessibility, Scalability, and Reproducibility with Cloud-Hosted Environments Benjamin A. Neely* Cite This: J. Proteome Res. 2021, 20, 2076−2082 Read Online ACCESS Metrics & More Article Recommendations ABSTRACT: Cloud-hosted environments offer known benefits when computational needs outstrip affordable local workstations, enabling high-performance computation without a physical cluster. What has been less apparent, especially to novice users, is the transformative potential for cloud-hosted environments to bridge the digital divide that exists between poorly funded and well- resourced laboratories, and to empower modern research groups with remote personnel and trainees. Using cloud-based proteomic bioinformatic pipelines is not predicated on analyzing thousands of files, but instead can be used to improve accessibility during remote work, extreme weather, or working with under-resourced remote trainees. The general benefits of cloud-hosted environ- ments also allow for scalability and encourage reproducibility. Since one possible hurdle to adoption is awareness, this paper is written with the nonexpert in mind. The benefits and possibilities of using a cloud-hosted environment are emphasized by describing how to setup an example workflow to analyze a previously published label-free data-dependent acquisition mass spectrometry data set of mammalian urine. Cost and time of analysis are compared using different computational tiers, and important practical considerations are described.
    [Show full text]
  • A Python-Based Pipeline for Preprocessing LC–MS Data for Untargeted Metabolomics Workflows
    H OH metabolites OH Article A Python-Based Pipeline for Preprocessing LC–MS Data for Untargeted Metabolomics Workflows Gabriel Riquelme 1,2 , Nicolás Zabalegui 1,2, Pablo Marchi 3, Christina M. Jones 4 and María Eugenia Monge 1,* 1 Centro de Investigaciones en Bionanociencias (CIBION), Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Godoy Cruz 2390, Ciudad de Buenos Aires C1425FQD, Argentina; [email protected] (G.R.); [email protected] (N.Z.) 2 Departamento de Química Inorgánica Analítica y Química Física, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria, Buenos Aires C1428EGA, Argentina 3 Facultad de Ingeniería, Universidad de Buenos Aires, Paseo Colón 850, Ciudad de Buenos Aires C1063ACV, Argentina; pmarchi@fi.uba.ar 4 Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20899-8392, USA; [email protected] * Correspondence: [email protected]; Tel.: +54-11-4899-5500 (ext. 5614) Received: 12 September 2020; Accepted: 14 October 2020; Published: 16 October 2020 Abstract: Preprocessing data in a reproducible and robust way is one of the current challenges in untargeted metabolomics workflows. Data curation in liquid chromatography–mass spectrometry (LC–MS) involves the removal of biologically non-relevant features (retention time, m/z pairs) to retain only high-quality data for subsequent analysis and interpretation. The present work introduces TidyMS, a package for the Python programming language for preprocessing LC–MS data for quality control (QC) procedures in untargeted metabolomics workflows. It is a versatile strategy that can be customized or fit for purpose according to the specific metabolomics application.
    [Show full text]
  • Openms Tutorial Handouts
    User Tutorial The OpenMS Developers September 9, 2021 Creative Commons Attribution 4.0 International (CC BY 4.0) Contents 1 General remarks 7 2 Getting started 8 2.1 Installation .................................... 8 2.1.1 Installation from the OpenMS USB stick .............. 8 2.1.2 Installation from the internet .................... 9 2.2 Data conversion ................................. 9 2.2.1 MSConvertGUI ............................. 10 2.2.2 msconvert ................................ 10 2.2.3 ThermoRawFileParser ......................... 11 2.3 Data visualization using TOPPView ...................... 11 2.4 Introduction to KNIME / OpenMS ...................... 15 2.4.1 Plugin and dependency installation ................. 15 2.4.2 KNIME concepts ............................ 18 2.4.3 Overview of the graphical user interface .............. 19 2.4.4 Creating workflows .......................... 21 2.4.5 Sharing workflows ........................... 22 2.4.6 Duplicating workflows ......................... 22 2.4.7 A minimal workflow .......................... 22 2.4.8 Digression: Working with chemical structures ........... 25 2.4.9 Advanced topic: Metanodes ..................... 27 2.4.10 Advanced topic: R integration .................... 27 3 Label-free quantification of peptides 30 3.1 Introduction ................................... 30 3.2 Peptide Identification ............................. 30 3.2.1 Bonus task: identification using several search engines ..... 33 3.3 Quantification .................................. 35 3.4 Combining
    [Show full text]
  • Pyopenms Documentation Release 2.6.0-Alpha
    pyOpenMS Documentation Release 2.6.0-alpha OpenMS Team Apr 02, 2020 Installation 1 pyOpenMS Installation 3 1.1 Binaries..................................................3 1.2 Source..................................................5 1.3 Wrap Classes...............................................5 2 Getting Started 7 2.1 Import pyopenms.............................................7 2.2 Getting help...............................................7 2.3 First look at data.............................................8 3 Reading Raw MS data 11 3.1 mzML files in memory.......................................... 11 3.2 indexed mzML files........................................... 12 3.3 mzML files as streams.......................................... 12 3.4 cached mzML files............................................ 14 4 mzML files 17 4.1 Binary encoding............................................. 17 4.2 Base64 encoding............................................. 18 4.3 numpress encoding............................................ 19 5 Other MS data formats 21 5.1 Identification data (idXML, mzIdentML, pepXML, protXML)..................... 21 5.2 Quantiative data (featureXML, consensusXML)............................ 22 5.3 Transition data (TraML)......................................... 23 6 MS Data 25 6.1 Spectrum................................................. 25 6.2 LC-MS/MS Experiment......................................... 27 6.3 Chromatogram.............................................. 29 7 Chemistry 31 7.1 Constants................................................
    [Show full text]