Analysis of and Sequence Data with KNIME Alexander Fillbrunn, Julianus Pfeuffer, Jeanette Prinz The Center for Integrative (CIBI) and KNIME Schedule

• About us • Generic KNIME Nodes • Mass-spectrometry in KNIME with OpenMS • Introduction and theory of label-free quantification • Demo of a label-free quantification workflow • Analysis of high-throughput sequencing data with KNIME and SeqAn • Introduction and theory of variant calling • Demo of a variant calling workflow German Network for Bioinformatics Infrastructure de.NBI Mission Statement • The 'German Network for Bioinformatics Infrastructure' provides comprehensive first-class bioinformatics services to users in life sciences research, industry and medicine. The de.NBI program coordinates bioinformatics training and education and the cooperation of the German bioinformatics community with international bioinformatics network structures.

Center for Integrative Bioinformatics (CIBI) • … provides cutting-edge and integrative tools for , metabolomics, NGS and image data analysis as well as a workflow engine to integrate tools into coherent solutions for reproducible analysis of large-scale biological data. Generic KNIME Nodes

• Wrapping of command line tools (OpenMS, SeqAn,…) in KNIME via GenericKNIMENodes (GKN) • Every OpenMS and SeqAn tool writes its Common Tool Description (CTD) via its command line parser • GKN generates source code (static) or an XML representation (dynamic) for nodes to show up in KNIME • Wraps C++ (&more) executables and provides additional file handling nodes Mass-spectrometry data analysis in KNIME

Julianus Pfeuffer, Alexander Fillbrunn OpenMS • OpenMS – an open-source C++ framework for computational mass spectrometry • Jointly developed at ETH Zürich, FU , University of Tübingen • Open source: BSD 3-clause license • Portable: available on Windows, OSX, • Vendor-independent: supports all standard formats and vendor-formats through proteowizard • OpenMS TOPP tools – The OpenMS Proteomics Pipeline tools – Building blocks: One application for each analysis step – All applications share identical user interfaces – Uses PSI standard formats • Can be integrated in various workflow systems – Galaxy – WS-PGRADE/gUSE – KNIME

Kohlbacher et al., Bioinformatics (2007), 23:e191 Installation of the OpenMS plugin • Community-contributions update site (stable & trunk) – Bioinformatics & NGS • Provides > 180 OpenMS TOPP tools as Community nodes – SILAC, iTRAQ, TMT, label-free, SWATH, SIP, … – Search engines: OMSSA, MASCOT, X!TANDEM, MSGFplus, … – inference: FIDO A Mass Spectrum Data Flow in

Sample HPLC/MS Raw Data 100 GB

Sig. Proc.

Peak 50 MB Maps Data Reduction 1 GB Data

Diff. Quant. Differentially Annotated 50 MB Expressed 50 kB Maps Identification Quantification Strategies

Quantitative Proteomics

Relative Quantification Absolute Quantification

AQUA SISCAPA

Labeled Label-Free

Spectral Feature-Based In vivo In vitro Counting MRM

14N/15N SILAC iTRAQ TMT 16O/18O

After: Lau et al., Proteomics, 2007, 7, 2787 Quantitative Data – LC-MS Maps • Spectra are acquired with rates up to dozens per second • Stacking the spectra yields maps • Resolution: – Up to millions of points per spectrum – Tens of thousands of spectra per LC run • Huge 2D datasets of up to hundreds of GB per sample • MS intensity follows the chromatographic concentration LC-MS Data (Map)

Quantification (15 nmol/µl, 3x over-expressed, …)

13 Label-Free Quantification (LFQ)

• Label-free quantification is probably the most natural way of quantifying – No labeling required, removing further sources of error, no restriction on sample generation, cheap – Data on different samples acquired in different measurements – higher reproducibility needed – Manual analysis difficult – Scales very well with the number of samples, basically no limit, no difference in the analysis between 2 or 100 samples LFQ – Analysis Strategy

1. Find features in all maps LFQ – Analysis Strategy

1. Find features in all maps 2. Align maps LFQ – Analysis Strategy

1. Find features in all maps 2. Align maps 3. Link corresponding features LFQ – Analysis Strategy

1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Identify features

GDAFFGMSCK LFQ – Analysis Strategy

1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Identify features 5. Quantify

GDAFFGMSCK

1.0 : 1.2 : 0.5 Feature Finding

• Identify all peaks belonging to one peptide • Key idea: – Identify suspicious regions (e.g. highest peaks) – Fit a model to that region and identify peaks explained by it Feature Finding

• Extension: collect all data points close to the seed • Refinement: remove peaks that are not consistent with the model • Fit an optimal model for the reduced set of peaks • Iterate this until no further improvement can be achieved Feature-Based Alignment

• LC-MS maps can contain millions of peaks • Retention time of peptides (or metabolites) can shift between experiments • In label-free quantification, maps thus need to be aligned in order to identify corresponding features • Alignment can be done on the raw maps (where it is usually called ‘dewarping’) or on already identified features • The latter is simpler, as it does not require the alignment of millions of peaks, but just of tens of thousands of features • Disadvantage: it replies on an accurate feature finding Feature-Based Alignment

~350,000 peaks

~ 700 features Multiple Alignment • Dewarp k maps onto a comparable coordinate system • Choose one map (usually the one with the largest number of features) as

reference map (here: map 2 -> T2 = 1)

Map 1

T1

Map 2 … T2 …

Consensus map

Map k m/z Tk rt rt Peptide Identification

LC-MS/MS experiment Experimental spectra Fragment m/z values 569.24 572.33 580.30 RT [%] 581.46 582.63 606.32 m/z 610.24 616.14 m/z

Compare Score hits

1 QRESTATDILQK 18.77 569.24 569.24 569.24 569.24 569.24 569.24 572.33 572.33 572.33 572.33 572.33 572.33 Theoretical spectra 580.30 580.30 580.30 580.30 580.30 580.30 581.46 581.46 581.46 581.46 581.46 581.46 2 EIEEDSLEGLKK 14.78 582.63 582.63 582.63 582.63 582.63 582.63 606.32 606.32 606.32 606.32 606.32 606.32 610.24 610.24 610.24 610.24 610.24 610.24 3 GIEDDLMDLIKK 12.63 616.14 616.14 616.14 616.14 616.14 616.14

[%]

m/z 569.24 569.24 570.84 Q9NSC5|HOME3_HUMAN Homer protein homolog 3 - Homo sapiens (Human) 572.33 574.83 571.72 MSTAREQPIFSTRAHVFQIDPATKRNWIPAGKHALTVSYFYDA TRNVYRIISIGGAKAIINSTVTPNMTFTKTSQKFGQWDSRANTV 580.30 580.70 580.40 YGLGFASEQHLTQFAEKFQEVKEAARLAREKSQDGGELTSPAL 581.46 580.92 591.18 GLASHQVPPSPLVSANGPGEEKLFRSQSADAPGPTERERLKK [%] MLSEGSVGEVQWEAEFFALQDSNNKLAGALREANAAAAQW 582.63 579.99 579.35 RQQLEAQRAEAERLRQRVAELEAQAASEVTPTGEKEGLGQG QSLEQLEALVQTKDQEIQTLKSQTGGPREALEAAEREETQQKV 606.32 603.92 607.25 QDLETRNAELEHQLRAMERSLEEARAERERARAEVGRAAQLL 610.24 611.14 611.42 DVSLFELSELREGLARLAEAAP m/z 616.14 616.74 614.45 Sequence database [%] Theoretical fragment m/z values from suitable peptides

m/z Sven Nahnsen Variant Calling/Annotation with SeqAn and KNIME

Jeanette Prinz, Julianus Pfeuffer, Alexander Fillbrunn, René Rahn

© 2019 KNIME AG. All rights reserved. Next Generation Sequencing (NGS)

• DNA sequencing: process of determining the nucleic acid sequence – the order of nucleotides (A,T,G,C) in DNA • NGS platforms perform sequencing of millions of small fragments (reads) of DNA in parallel => fast, cheap • Bioinformatics used to map the individual reads to reference genome

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841808/

© 2019 KNIME AG. All rights reserved. 32 NGS Application Areas

Hereditary Diseases Agriculture

Taken from: http://www.nfcr.org/sites/default/files/images/Ge nomicProfiling2.jpg http://ecx.images- amazon.com/images/I/51tztcMqIRL._SS500_.jpg Cancer Metagenomics

© 2019 KNIME AG. All rights reserved. 33 SeqAn - a Bioinformatics Resource

© 2019 KNIME AG. All rights reserved. 34 SeqAn Library Features

• Efficient index data structures • Efficient algorithms • Fully parallelized and vectorized pairwise alignment algorithms • Search schemes for index searches • Fast I/O • SAM/BAM, FastA/FastQ, VCF, …

© 2019 KNIME AG. All rights reserved. 35 SeqAn: NGS Tools

Installation: KNIME Community Contributions -> Bioinformatics & NGS -> SeqAn NGS ToolBox

© 2019 KNIME AG. All rights reserved. 36 Variant Calling

https://www.ebi.ac.uk/training/online/course/human-genetic-variation-i-introduction/variant-identification- and-analysis/what-variant

© 2019 KNIME AG. All rights reserved. 37 Variant Annotation

© 2019 KNIME AG. All rights reserved. 38 Variant Consequences

http://www.ensembl.info/2012/08/06/variation-consequences/

© 2019 KNIME AG. All rights reserved. 39 Variant Calling/Annotation with KNIME

Demo

© 2019 KNIME AG. All rights reserved. 40 Summary

• SeqAn KNIME nodes for mapping to reference genome and variant calling • KNIME nodes for variant annotation and visualization of results • Workflow can be easily extended and adjusted to your own needs, e.g. to include quality control

© 2019 KNIME AG. All rights reserved. 41 The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME AG under license from KNIME GmbH, and are registered in the United States. KNIME® is also registered in Germany.

© 2019 KNIME AG. All rights reserved. 42