Using KNIME for metabolomics

Visual Programming for Metabolomics

Stephan Beisken (PhD) Reza Salek (PhD) Cheminformatics and metabolism The European Bioinformatics Institute (EMBL-EBI) Email: [email protected]

Data standards efforts

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4648992/ Supported by PhenoMeNal Samples QC C5 S3 S7 C1 C10 QC S1 C3 S5 C7 S6 QC ..

Technical Triplicates C5 C5’ C5’’ S3 S3’ S3’’ .. ..’ ..’’ Data analysis and Audit trails DIMS Data Collection Instrument .RAW files IRFC5 IRFC5’ IRFC5’’ IRFS3 IRFS3’ IRFS3’’ IRF.. IRF..’ IRF..’’

C5 C5 C5 S3 S3 S3 ...... Averaged Transients AT AT ’ AT ’’ AT AT ’ AT ’’ AT AT ’ AT ’’

Apodisation, Zero-filling and FFT TIC Filtering

Frequency Spectra FSC5 FSC5’ FSC5’’ FSS3 FSS3’ FSS3’’ FS.. FS..’ FS..’’

Calibrant List Mass Calibration and SIM-stitching

Stitched Peak Lists SPLC5 SPLC5’ SPLC5’’ SPLS3 SPLS3’ SPLS3’’ SPL.. SPL..’ SPL..’’

Replicate Replicate Replicate Filtering Filtering Filtering

• C5 S3 .. blank RFPL RFPL Capturing different step of data Replicate Filtered Peak Lists RFPL RFPL

Sample Filtering processing.: Blank Filtering

Sample Filtered Peak Matrix SFPM

Missing-value • Relationship and reproducibility Filtering

PQN Normalisation Batch Spectral Correction Cleaning

SFPM PQN SFPM PQN + BATCH SFPM PQN + BATCH + CLEAN

Impute Missing Values using KNN

SFPM PQN + KNN SFPM PQN + BATCH + KNN SFPM PQN + BATCH + CLEAN + KNN

Glog Transformation

SFPM PQN + KNN + GLOG SFPM PQN + BATCH + KNN + GLOG SFPM PQN + BATCH + CLEAN + KNN + GLOG Visual Programming

• “Visual programming languages enable physicians and other computer users with little knowledge of programming to develop computer software. The physician uses a visual paradigm to "draw" the computer interface and then attaches short segments of computer code to buttons, menus, and list boxes.”

Ebell, M. H. (1993). Visual programming languages. M.D. Computing: Computers in Medical Practice, 10(5), 305–11. Motivation

• Simplify your (working) life

• Data processing and analysis requires various different tools to work together in sequence • Data input and output • Spreadsheets • Data transformation • Transposition, aggregation, string manipulation • ISAcreator • Formatting of tables • Submission to MetaboLights Disclaimer

• Workflows are great • It does not have to be KNIME, there are many other solutions • Every method that captures information in a consistent manner and enables reproducibility is great • Transparency • Ability to share data and ‘everything’ that was done to the data Introduction

• KNIME: Konstanz Information Miner • http://www.knime.org/ • Developed at University of Konstanz in Germany • Desktop version available free of charge (open source) • Modular platform for building and executing workflows using predefined components: nodes • Core functionality available for tasks such as , analysis, and manipulation • Extra features and functionality available in KNIME through extensions from various groups (community) and vendors • Written in based on the SDK platform Workflow Concepts

• Workflow execution • Can execute complex, multi-step operations on input data • Can be run be “non-experts” using predefined parameter templates ensuring optimal results • Can be set up for specific measurement systems • Can be shared across researchers Functionality

• Data manipulation and analysis • File & database I/O, sorting, filtering, grouping, joining, pivoting • Data mining and , , KNIME, interactive plotting • Cheminformatics • Conversions, similarity, clustering, (Q)SAR analysis, etc. • Scripting integration • R, Perl, Python, Matlab, Octave, Groovy • Reporting and much more • Bioinformatics, HTS & image analysis, network & • Marketing, big data and business analytics Modules (Community Extensions)

• http://tech.knime.org/community • Chemoinformatics • CDK (EMBL-EBI), RDKit (Novartis), Indigo (GGA), • ErlWood (Eli Lilly), Enalos (NovaMechanics) • ChEMBL and ChEBI (EMBL-EBI) • Bioinformatics • OpenMS (Tübingen, ETH Zurich) • MassCascade (EMBL-EBI) • HCS (MPI), NGS (Konstanz), Image analysis • Integration • Python, Perl, R, Groovy, Matlab (MPI), PDB web services client (Vernalis), REST and SOAP web service support Workflow Platforms Applications of metabolomics EMBO course Applications metabolite reporting frequency Applications MVDA Applications Stat Regression Calibration Advantages Disadvantages

• Intuitive to use • Steep learning cure • No or little programming • Resource greedy experience required • No (free) server edition • Good for prototyping • Slower execution than • Lots of functionality standalone scripts • Very modular and flexible • Active community • Extensible

• Visual Feedback Workbench Auto-layout Execute Execute all nodes

Node descripon

tabs

workflow projects

favorite nodes

public server

workflow editor

node repository outline console Nodes

• Node: Basic processing unit of a workflow • performs a particular task

Input port(s) – on the le of icon Title Output port(s) – on the right of icon

Icon

Status display (‘traffic lights’) Right-click menu

Sequence number • Red (not ready) To configure and • Amber (ready) execute the node, • Green (executed) display the output views, edit the • Blue bar during execuon node, and display (with percentage or flashing) data for the ports Dialogs

• Double-click opens configuration dialogs

• Explicit column types MassCascade

https://bitbucket.org/sbeisken/masscascadeknime/wiki/ExampleWorkflows XCMS

http://www.bioconductor.org/packages/devel/data/experiment/manuals/faahKO/man/faahKO.pdf A great example: OpenMS

http://ftp.mi.fu-berlin.de/OpenMS/release-documentation/OpenMS_tutorial.pdf Feature Filter Deconvolution

A

A B B C C Sample Alignment Compound Spectra Compound Spectra

OH HO H H H OH OH H O O O H OH HO O OH H O HH OH HO H H H OH HO Extracted Mass Chromatograms Workflow Parameter Tuning Applications Regression Calibration Data Sources Final Remarks

• Workflows can make exploratory or repetitive data tasks easier and save time • Extensive data pre-processing functionality • Extensions for statistics, machine learning, bio-, and cheminformatics • Integration of R (XCMS) and spectrometry extensions can help you to build elaborate pipelines and share work

• Can help to organize one’s thoughts. • It’s actually quite a bit of fun. Resources

• KNIME Forum • http://www.knime.org/ • KNIME Learning Hub • http://www.knime.org/learning-hub • Quickstart Guide • http://tech.knime.org/files/KNIME_quickstart.pdf

• Happy to Help • [email protected]