Workflows for Automated Downstream Data Analysis

Supplementary Information

Workflows for automated downstream data analysis and visualization in large-scale computational mass spectrometry

Stephan Aiche1,*, Timo Sachsenberg2, Erhan Kenar3, Mathias Walzer2, Bernd Wiswedel4, Theresa Kristl5, Matthew Boyles6, Albert Duschl6, Christian G. Huber5, Michael R. Berthold7, Knut Reinert1, and Oliver Kohlbacher2

1Department of Mathematics and Computer Science, Freie Universitaet Berlin, Germany 2Applied Bioinformatics, Center for Bioinformatics, Quantitative Biology Center, and Dept. of Computer Science, University of Tübingen, Germany 3Quantitative Biology Center (QBiC), University of Tübingen, Germany 4KNIME.com AG, Zurich, 8005, Switzerland 5Division of Chemistry and Bioanalytics, Department of Molecular Biology, University of Salzburg, Austria 6Division of Allergy and Immunology, Department of Molecular Biology, University of Salzburg, Austria 7Chair for Bioinformatics and Information Mining, Department of Computer and Information Science, University of Konstanz, Germany

*Corresponding Author: Stephan Aiche, Takustr. 9, 14195 Berlin, Germany, tel.: +49 30 838 75137, fax: +49 30 838 75218, email: [email protected] Example 1 (TMT Quantitation)

Experimental procedure: Human lung adenocarcinoma epithelial cells (A549) were treated with 10µg/ml (2.1µg/cm2 exposure surface) nano-copper oxide (CuO) or left untreated (three biological replicates). Cells were harvested at 6 different time points (0, 1, 3, 6, 12 and 24 hours after exposure to CuO or medium only controls) and subsequently lysed, reduced (4.55 mM tris(2-carboxyethyl)phosphine hydrochloride solution, Sigma Aldrich, St. Louis, MO, USA) and alkylated (8.70 mM iodo acetamide, Sigma Aldrich). The tryptic peptides (Trypsin, Promega, Madision, WI, USA) were labeled with 6-plex isobaric tags (TMT 6-plex, Thermo Fisher Scientific) and afterwards separated by capillary ion-pair reversed-phase HPLC (U3000 nano HPLC Unit from Dionex, Germering, Germany) in a 150 x 0.20 mm i.d. monolithic column (produced in-house according to ref. [1], commercially available as ProSwift™ columns from Thermo Scientific) with a 5 h 0-40% acetonitrile gradient at 1.0 µL/min followed by linear ion trap-Orbitrap mass spectrometry (LTQ Orbitrap XL from Thermo Scientific). Each sample was measured three times with the use of two exclusion lists facilitating the identification of more proteins compared to a single measurement.

Workflow details: The input for the workflow consists of the sequence database (Uniprot, human) including the reversed sequences for decoy search, a text file containing the experimental layout in CSV (comma separated values) format where for each experiment the condition (control or treated), the replicate number, and the association between the TMT channel and the respective time point is given, and the 18 mzML files containing the unprocessed MS data. The workflow starts by analyzing all the mzML files independently performing identification and TMT reporter ion extraction. Initially each file is separated into two smaller files, one containing only the precursor and the CID scans, the other containing only the precursor and HCD (Higher-energy collisional dissociation) scans. The CID scans are used for identifications using OMSSA [2] and X!Tandem [3]. The HCD scans are used for reporter ion extraction (including an isotopic impurity correction as described in [4]) and identifications using again OMSSA [2] and X! Tandem [3]. The four individual identification results are subsequently combined by the ConsensusID [5] approach. Afterwards, the identification results are mapped to the quantitative information extracted from the TMT reporter channels. The resulting data is converted into a format readable by the R package isobar [6]. The workflow then uses the KNIME R nodes to apply isobar to the data extracted from the MS data with respect to the given experimental design. After initial quantitation of the individual experiments the data is combined on a protein level (i.e., group results from the different experiments by proteins) and different plots are generated to allow visual inspection of the results and quality control.

Example 2 (Label-free quantitation of metabolites): The label-free quantitation workflow for metabolites demonstrates differential analysis of small molecules as used in biomarker discovery. Two spike-in conditions from a previously characterized dilution of isotopically labeled compounds series (0.5 mg/l and 10.0 mg/l against male blood background) were measured using UPLC-MS. For details on the sample, sample processing, and data acquisition, we refer to [6]. The input of the workflow comprised six mzML files (two conditions, each measured in triplicates). Mass trace detection of eluting small-molecules was performed with the FeatureFinderMetabo [6] node. In the subsequent nodes, retention time shifts between measurements were corrected (MapAlignerPoseClustering [7]) and features occurring at the same mass-to-charge ratio and retention time were linked between MS runs (FeatureLinkerUnlabeledQT [8]). Identifications were assigned by AccurateMassSearch querying against the HMDB database (extended with the isotopically labeled spike-in compounds). The results were converted into the mzTab data format [7] which was parsed using the SmallMoleculeMzTabReader node to convert it into a KNIME table. The column containing the chemical formula was then converted into a KNIME compatible molecule type (Molecule Type Cast) to allow interoperability with other KNIME extensions (i.e., cheminformatics packages). In the lower part of the workflow, statistical analysis was performed with the integrated R framework. First, quantification values were normalized by quantiles [9]. Multiple hypotheses testing (two sided t-test) and a minimum fold change of two were used to determine differentially quantified features. P-values were corrected for FDR control using the Benjamini-Hochberg procedure. The quantification results passing the FDR threshold of 5% were then joined with the identification results into a table for manual inspection. As chemical formulas were converted to a KNIME-compatible molecule type, their molecule structure could be easily visualized with cheminformatics packages (e.g., CDK). Example 3 (Quality Control): The shown workflow is a comprehensive qcML workflow, utilizing many of the more advanced features of qcML. In addition to the generated qcML file it also generated a report in pdf format. The workflow starts with two input nodes, providing the mzML files that should be analyzed and the sequence database used for identification. The first step in the workflow is the preprocessing node, which performs feature detection and identification, each resulting in a file that can be used to assess the quality of the experiment. The QCCalculator will take the two generated files and the original mzML file to calculate basic statistics and agglomerate the quality data needed later on to apply more advanced quality metrics. All will be stored in a qcML file, which will make it easy to access the needed data. More data will be added before handing the extended file over to the next step.

The ID Ratio Meta-node will create a plot of the measured spectra vs. the identified spectra in a m/z vs. RT map. The plot will be included in the qcML file and in addition sent to the reporting tool. In the Mass Accuracy Meta-node, the accuracy of the measurement will be analyzed by reference to the identifications. The calculated median deviation and the corresponding plot will be added to the qcML file. In addition the mass accuracy over the elution time is plotted. In the following Fractional Mass Meta-node an external reference file (theoretical_masses.txt) of theoretical masses is used to plot the experimentally acquired fractional masses on the theoretically possible. The last Meta-node is plotting the total ion current of the experiment over time. If additional data is available in tabular format, it can be added as well to the qcML file. In this case, the injection times of the machine is added to the qcML file. Subsequently, some of the computed quality metrics from the qcML file are extracted in the following Meta-node (Extraction of QP details) to make the accessible for the KNIME reporting engine. The QCShrinker finally removes verbose and redundant data. In addition to the qcML file the workflow will export numerous plots and quality metrics to the KNIME reporting engine. After the workflow is executed one can open the report view in KNIME to create a pdf file containing the plots and quality values. References [1] Premstaller, A., Oberacher, H., Huber, C. G., High-performance liquid chromatography-electrospray ionization mass spectrometry of single- and double- stranded nucleic acids using monolithic capillary columns. Anal Chem 2000, 72, 4386-4393. [2] Geer, L. Y., Markey, S. P., Kowalak, J. A., Wagner, L., et al., Open mass spectrometry search algorithm. J Proteome Res 2004, 3, 958-964. [3] Craig, R., Beavis, R. C., TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466-1467. [4] Bielow, C., 2012, p. 147 S. [5] Nahnsen, S., Bertsch, A., Rahnenführer, J., Nordheim, A., Kohlbacher, O., Probabilistic consensus scoring improves tandem mass spectrometry peptide identification. Journal of proteome research 2011, 10, 3332-3343. [6] Breitwieser, F. P., Muller, A., Dayon, L., Kocher, T., et al., General statistical modeling of data from protein relative expression isobaric tags. J Proteome Res 2011, 10, 2758-2766. [7] Griss, J., Jones, A. R., Sachsenberg, T., Walzer, M., et al., The mzTab Data Exchange Format: communicating MS-based proteomics and metabolomics experimental results to a wider audience. Molecular & Cellular Proteomics 2014.