Vol. 25 no. 15 2009, pages 1980–1981 APPLICATIONS NOTE doi:10.1093/bioinformatics/btp301

Gene expression A toolbox for validation of identification and generation of database: IRMa Véronique Dupierris1, Christophe Masselon2, Magali Court2, Sylvie Kieffer-Jaquinod2 and Christophe Bruley2,∗ 1Fondation Rhône-Alpes Futur, 89 rue Bellecombe, 69003 Lyon and 2CEA, DSV, iRTSV, Laboratoire d’Etude de la Dynamique des Protéomes, INSERM, U880, Université Joseph Fourier, Grenoble F-38054, France Received on December 5, 2008; revised and accepted on April 30, 2009 Advance Access publication May 6, 2009 Associate Editor: John Quackenbush Downloaded from https://academic.oup.com/bioinformatics/article/25/15/1980/211492 by guest on 24 September 2021

ABSTRACT absolute probability that the observed match is a random event, is Summary: The IRMa toolbox provides an interactive application assigned to each spectrum matches (PSMs). For each query,  to assist in the validation of Mascot search results. It allows PSMs are ranked according to their score. Mascot then groups automatic filtering of Mascot identification results as well as manual PSMs into hits. A hit contains not only all covered confirmation or rejection of individual PSM (a match between a by the same set of PSMs but also all proteins covered by a subset fragmentation mass spectrum and a peptide). Dynamic grouping and of these PSMs. A hit score, based on its matching PSM scores, is coherence of information are maintained by the software in real time. assigned to each protein. Validated results can be exported under various forms, including an While existing tools (Bouyssié et al., 2007) validate whole hits, identification database (MSIdb). This allows biologists to compile selecting confident PSMs and rejecting dubious ones undoubtedly search results from a whole study in a unique repository in order to allow more accurate identification. When validating whole hits, provide a summarized view of their project. IRMa also features a fully false PSMs can still be considered in the result. On the other automated version that can be used in a high-throughput pipeline. hand, validation at the PSM level affects the grouping consistency Given filter parameters, it can delete hits with no significant PSM, (and hits properties such as score and coverage): rejecting PSM regroup hits identified by the same peptide(s) and export the result without grouping update leads to redundancy of the protein hits’ to the specified format without user intervention. list. IRMa allows validation at the PSM level, while circumventing Availability: http://biodev.extra.cea.fr/docs/irma (java 1.5 or higher the associated difficulties. needed) Contact: [email protected] 3 FEATURES  IRMa uses the Mascot Parser distributed free of charge by Matrix 1 INTRODUCTION Science to build an identification result from the native ‘.dat’ files   The Mascot search engine uses mass spectrometry data to identify generated by the Mascot server. The parser can be invoked with proteins by matching experimental and theoretical peptide mass different report parameters, such as the number of hits or significance spectra (Aebersold and Mann, 2003). However, it is necessary to thresholds. At this stage, the same information and grouping shown  eliminate false matches (Baldwin, 2004; Nesvizhskii and Aebersold, in the Mascot result page are available but displayed in a structured 2004) in order to avoid incorrect protein identification. and accessible layout by IRMa (see Fig. 1). PSMs relevant for  The Mascot html output is not designed for this kind of protein identifications are flagged as significant. A PSM having the  validation, and exporting the data to Excel for subsequent same sequence as a significant one but associated to another query validation results in loss of data consistency. Moreover, validation and with a lower score are flagged as duplicated. being the first step in data interpretation, subsequent data mining For validation purposes, in addition to the significant and often requires compilation of numerous search results. IRMa is duplicated categories, IRMa introduces a third category in which  a validation toolbox for Mascot identification results that allow ambiguous PSMs are classified. This third category regroups users to validate and export the identified proteins and their PSM to PSMs rejected by filtering and that do not contribute to protein various formats. Its main originality is to filter matches rather than identification. While PSMs’ classification into these categories  identified proteins. can be done manually, several filters applied on the Mascot parser results are proposed. They provide rules to assign PSMs to significant or ambiguous categories and can be based on PSM 2 VALIDATING HITS VERSUS PSM ranks, scores, on mandatory post-translational modifications or  For each submitted spectrum (query), Mascot attempts to match on PSMs properties expression. This automatic classification can up to 10 peptides from a sequence database. A score, based on the always be adjusted; the user can restore PSMs from the ambiguous group to the significant group or vice versa. Nevertheless, IRMa ∗To whom correspondence should be addressed. will ensure that the exact same sequence cannot appear twice

1980 © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

[18:08 26/6/2009 Bioinformatics-btp301.tex] Page: 1980 1980–1981 IRMa

4 LARGE-SCALE STUDIES Biological studies often require the comparison of numerous samples. Furthermore, each sample can be fractionated before analysis to generate several aliquots, which in turn result in as many  Mascot searches. One of the major challenges facing the biologist resides in the high redundancy in identification lists. Compiling  multiple search results through multiple Excel files or html results pages is impractical. In order to overcome this difficulty, the creation of a unique database, containing the identification results of a whole project, allows the user to access information on all analyzed samples at once. IRMa was used in our laboratory to compile data from a large- scale proteomics study of A. thaliana chloroplasts. The MSIdb Fig. 1. IRMa snapshot. User can browse the submitted spectra (queries) was compiled in a fully unattended mode after providing the (1) or the identified protein hits and PSMs (2). The main component of the software with filter parameters and an output path. The database Downloaded from https://academic.oup.com/bioinformatics/article/25/15/1980/211492 by guest on 24 September 2021 interface displays query statistics or, as shown here, hit information. In this was populated using search results from ∼585 000 Mascot queries case, the upper part (3) shows the master protein description and properties generated from ∼500 LC–MS/MS analyses, with minimal user together with the lists of same-set and sub-set proteins. PSMs classified in intervention. significant, ambiguous and duplicated categories are displayed for the current hit (4). 5 PERSPECTIVES IRMa allows generation of an MSIdb for a given project. To enhance data mining of comprehensive identification catalog, it significant in the category. Dynamic classification has an impact would therefore be helpful to integrate identification databases ambiguous on protein hits: when a PSM is declared , it becomes with information from other data sources (e.g. web-based public irrelevant to proteins identification. IRMa takes into account these repositories). We are currently working on a tool to achieve changes to ensure consistency of information such as protein this aim and to further investigate identified proteins. Grouping coverage and identification score and dynamically proposes new identifications at the whole database level to remove redundancy protein hit sets and sub-sets based on significant PSMs. For among samples or comparing multiple experiments are some of the instance, two proteins of a same hit that initially shared only available operations. a sub-set of PSMs can become indistinguishable (all significant PSMs are shared) if specific PSMs are judged ambiguous. Another possibility is the merge of two distinct hits into a unique one ACKNOWLEDGEMENTS ambiguous if discriminating PSMs are flagged as . For example, We thank Matrix Science technical support team for their timely given the following two hits, ACTG_HUMAN {pep_63, pep_133, answers to our requests, the IRMa workgroup for their help in pep_167, pep_254, pep_311} and Q53G99_HUMAN {pep_63, application specification and design, and all EDyP Laboratory team , pep_139 pep_133 pep_139, pep_254, pep_311}, if PSM is declared members for testing and feedback. ambiguous, IRMa will propose grouping Q53G99_HUMAN and ACTG_HUMAN as the same hit. Funding: CEA, INSERM and the Rhône-Alpes Génopôle. The user can validate new grouping proposed by the application Conflict of Interest: none declared. by merging hits that become sub-sets or same sets of another and by deleting hits that no longer contain any significant PSM. This implies that hits are reordered according to their scores. REFERENCES The validated result can finally be exported under various forms:  Aebersold,R. and Mann,M. (2003) Mass spectrometry-based proteomics. Nature, 422, to the widely used Microsoft Excel format; to a PDF format 198–207. 1 including graphical representation of annotated spectra or to Baldwin,M. (2004) Protein identification by mass spectrometry: issues to be considered. MSIdb2. The information recorded in all exports can be customized. Mol. Cell. Proteomics 3, 1–9.  For instance, in Microsoft Excel format, user can choose which Bouyssié,D. et al. (2007) MFPaQ, a new software to parse, validate, and quantify worksheets (protein summary, similar proteins, peptide list, etc.) and proteomic data generated by ICAT and SILAC mass spectrometric analyses:  application to the proteomic study of membrane proteins from primary human which columns will be present. Information on Mascot search endothelial cells. Mol. Cell. Proteomics 6, 1621–1637. parameters or automatic filtering is also reported in the output file. Nesvizhskii,A.I. and Aebersold,R. (2004) Analysis, statistical validation and The MSIdb stores this information as well as that on grouping dissemination of large-scale proteomics datasets generated by tandem MS. Drug and matches between proteins and PSMs. Exporting to MSIdb is Discov. Today, 9, 173–181. especially suitable for large-scale studies.

1MCP guideline compliant - for single peptide hits and PTM peptides hits. 2For database structure see supplementary material.

1981

[18:08 26/6/2009 Bioinformatics-btp301.tex] Page: 1981 1980–1981