The HUPO Proteomics Standards Initiative Meeting: Towards Common Standards for Exchanging Proteomics Data Hinxton, Cambridge, UK, 19–20 October 2002

Comparative and Functional Genomics Comp Funct Genom 2003; 4: 16–19. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.232 Feature Meeting Review: The HUPO Proteomics Standards Initiative meeting: towards common standards for exchanging proteomics data Hinxton, Cambridge, UK, 19–20 October 2002 Sandra Orchard, Paul Kersey, Henning Hermjakob* and Rolf Apweiler EMBL Outstation–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK *Correspondence to: Abstract Henning Hermjakob, EMBL Outstation–European The Proteomics Standards Initiative (PSI) aims to define community standards Bioinformatics Institute, for data representation in proteomics and to facilitate data comparison, exchange Wellcome Trust Genome and verification. Initially the fields of protein–protein interactions (PPI) and mass Campus, Hinxton, Cambridge, spectroscopy have been targeted and the inaugural meeting of the PSI addressed the UK. questions of data storage and exchange in both of these areas. The PPI group rapidly E-mail: reached consensus as to the minimum requirements for a data exchange model; an [email protected] XML draft is now being produced. The mass spectroscopy group have achieved major advances in the definition of a required data model and working groups are currently taking these discussions further. A further meeting is planned in January 2003 to Received: 14 November 2002 advance both these projects. Copyright 2003 John Wiley & Sons, Ltd. Accepted: 14 November 2002 Keywords: proteomics; spectroscopy; protein–protein interactions Introduction process, before splitting into two working parties to address the issues facing their respective fields. The Proteomics Standards Initiative was established following a meeting in April 2002, jointly organized by HUPO and NAS, at which the urgent Protein–protein interactions (PPI) group need for standardization of proteomics data was recognized. Rolf Apweiler (Sequence Database The session commenced with a brief introduction Group, European Bioinformatics Institute) from each of the PPI databases represented at opened the proceedings by explaining that a deci- the meeting as to the ethos and coverage of sion had been made to address these issues ini- their particular product. This included presentations tially in the fields of mass spectroscopy and by representatives from Hybrigenics SA, DIP, protein–protein interactions (PPI). This inaugu- BIND, MINT, GIN-DB, PPID and IntAct, a public ral meeting of the Proteomics Standards Initiative repository of PPI data that will be launched by the brought together representatives from the database EBI early in 2003. The meeting was then thrown producer, user and software producer communi- open to address a number of key issues. ties, who were seen as essential in establishing and maintaining the required standards and who Is there a requirement for a community were jointly charged over the 2 days of the meet- standard? ing with laying the groundwork that would enable these objectives to be met. Data exchange is essential for the purposes of The delegates listened to a short presentation data comparison, benchmarking and quality con- by Alvis Brazma (EBI), outlining the successful trol, all of which are particularly important in a standardization of microarray data in the MGED field like protein–protein interaction, where the Copyright 2003 John Wiley & Sons, Ltd. Meeting Review 17 standard high-throughput methods are known to Outline data structure yield high false-positive and false-negative rates. A community standard should allow simple access The need for a multi-level approach was soon to core protein interaction data, while being exten- recognized, with Level 1 designed to fulfil basic sible to exchange data with a high level of detail. requirements and be suitable for rapid implemen- Many users will require only simple indexing and tation, whilst subsequent levels will contain more interface systems; larger organizations will have features, yet remain compatible backwards. The requirements that are more complex but will have interchange format will need to be able to repre- the infrastructure to develop much of this them- sent both binary and n-ary (complex) interactions. selves. The confidentiality of data could be seen The topology of the latter would then be described as an issue that might inhibit organizations from within each set. contributing; however, this question has already Each Interchange Format Record will report one been addressed by the various sequence databases or more interactions supported by one or more where entries can be flagged and retained by experiments. Predicted interactions are allowed and the parent database until permission is given for will be clearly flagged. Wherever the sequence release. It was recognized early on in the discus- of the interactors is available in public databases, sions that a minimum standard for data exchange appropriate cross-references should be given. The needed to be developed and a formal mecha- sequence should be given in the interaction record nism for monitoring and maintaining this standard when it is not available from public databases, and put in place. Valuable lessons can be learned in may always be given. this area from MGED’s experience of defining a Each entry will need to contain the accession minimal standard for the exchange of microarray number of its parent database. Parent databases data. will be identified by a prefix. This will require a registry service, which will have to be recognized and maintained. It is proposed to use PSI/HUPO as the authority for this and a host site will have to Definition of use cases be established, which can be accessed by databases wishing to submit data. The potential use of the data has to be under- The standardization of experimental design pro- stood before the minimum common standard can vides a particularly complex set of issues for the be defined. Most of the groups represented at the field of PPI, in which researchers use a host of meeting were interested in making graphical repre- diverse techniques and practices. Level 1 of the sentations of PPIs and in making interspecies com- standard will not attempt to provide a full descrip- parisons based on sequence or structural homology. tion of the experimental design, but will provide the To compare data from different systems, a cor- means to clearly classify the experiments through rect description of the source systems is essential, hierarchical controlled vocabularies. including details of species, strain and, in some A work group has been set up to develop cases, tissue, cell type and disease state. Domain common controlled vocabularies for experimental identification and the dynamic properties of PPIs methods and other attributes of protein interaction were also common requirements, whilst the func- data. These will be used by the interaction data tional outcome of PPIs and the effects of sequence standard and will be made available via the Global variations and posttranslational modifications were Open Biological Ontologies (GOBO) website. seen as desirables. Some users have a requirement To capture a larger part of the interaction data for in-depth experimental detail; however, this was that is generated worldwide, the support of major felt to be beyond the scope of a data exchange biochemistry and proteomics journals in this pro- format and would have to be retrieved from the cess is seen as crucial. It is proposed that, once literature. Links to public databases were seen as a PSI PPI level 1 standard has been established, essential when available but would not be made the major public database providers will collec- mandatory, since this would compromise the trans- tively approach journals and funding agencies to fer of unpublished data between collaborating lab- request that deposition of published interaction data oratories. in public databases will be strongly encouraged as Copyright 2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 16–19. 18 Meeting Review part of the publication process. This would be sim- Mass spectrometry data exists at many levels, ilar to the deposition requirement for nucleotide from raw data, through peak lists and peptide sequence data, and the current encouragement to identification, to protein identification; on top of deposit DNA microarray data. this is the desire to mine data. Huge amounts of variation (and manual interpretation) exist in the processes that effect these transformations. PPI molecular interaction interchange The following specific points were discussed in format record structure more detail. The structure of an Interchange Format record defining both mandatory and optional fields was The purpose of new repositories discussed in great detail and a draft document was produced. A small working party was formed to One projected use was to provide an audit trail produce an XML draft of this consensus, which for publications, so that the producers of bulk or will then be further refined and finally presented to complex data would be able to fully describe (and members of the PSI at a meeting in January. The be held to account for) methodologies that could PPI group aim to have a publicly available version not appear in print medium; this would require the of the level 1 format available by Spring 2003. cooperation of journals. Another purpose could be to allow the user to explore/mine the data, prefer- ably in a biological context. Important concepts Mass spectrometry here are ‘the minimal description of the experiment’ and ‘validation criteria’. This session discussed two questions — the use of standards in the field of mass spectrometry and the potential use of a public data repository for mass How many repositories? spectrometry data. A component-based approach, with different repos- Following presentations on various aspects of itories for different types of proteomics experiment, mass spectrometry by Alexey Nesvizhskii (ISB, was considered, but fears were expressed that this Seattle, MI), Arkadiusz Nawrocki (CPA, would disrupt the audit trail, or make biological Odense, Denmark) and Rulin Zhang (SynX interpretation of the data impossible.

The HUPO Proteomics Standards Initiative Meeting: Towards Common Standards for Exchanging Proteomics Data Hinxton, Cambridge, UK, 19–20 October 2002

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

The European Bioinformatics Institute in 2020: Building a Global Infrastructure of Interconnected Data Resources for the Life Sciences Charles E

2003 Mulder Nucl Acids Res {22

MPGM: Scalable and Accurate Multiple Network Alignment

EMBL-EBI-Overview.Pdf

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

Expert Review of Proteomics

Multi-Platform Discovery of Haplotype-Resolved Structural Variation in Human Genomes

UC Riverside UC Riverside Previously Published Works

Interpro: the Integrative Protein Signature Database

Speaker Biographies

Planning for Globally Sustainable Life Sciences Data Resources