Comparative and Functional Genomics Comp Funct Genom 2003; 4: 16–19. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.232 Feature Meeting Review: The HUPO Standards Initiative meeting: towards common standards for exchanging proteomics data Hinxton, Cambridge, UK, 19–20 October 2002

Sandra Orchard, Paul Kersey, Henning Hermjakob* and Rolf Apweiler EMBL Outstation–European Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK

*Correspondence to: Abstract Henning Hermjakob, EMBL Outstation–European The Proteomics Standards Initiative (PSI) aims to define community standards Bioinformatics Institute, for data representation in proteomics and to facilitate data comparison, exchange Wellcome Trust Genome and verification. Initially the fields of –protein interactions (PPI) and mass Campus, Hinxton, Cambridge, spectroscopy have been targeted and the inaugural meeting of the PSI addressed the UK. questions of data storage and exchange in both of these areas. The PPI group rapidly E-mail: reached consensus as to the minimum requirements for a data exchange model; an [email protected] XML draft is now being produced. The mass spectroscopy group have achieved major advances in the definition of a required data model and working groups are currently taking these discussions further. A further meeting is planned in January 2003 to  Received: 14 November 2002 advance both these projects. Copyright 2003 John Wiley & Sons, Ltd. Accepted: 14 November 2002 Keywords: proteomics; spectroscopy; protein–protein interactions

Introduction process, before splitting into two working parties to address the issues facing their respective fields. The Proteomics Standards Initiative was estab- lished following a meeting in April 2002, jointly organized by HUPO and NAS, at which the urgent Protein–protein interactions (PPI) group need for standardization of proteomics data was recognized. Rolf Apweiler (Sequence Database The session commenced with a brief introduction Group, European Bioinformatics Institute) from each of the PPI databases represented at opened the proceedings by explaining that a deci- the meeting as to the ethos and coverage of sion had been made to address these issues ini- their particular product. This included presentations tially in the fields of mass spectroscopy and by representatives from Hybrigenics SA, DIP, protein–protein interactions (PPI). This inaugu- BIND, MINT, GIN-DB, PPID and IntAct, a public ral meeting of the Proteomics Standards Initiative repository of PPI data that will be launched by the brought together representatives from the database EBI early in 2003. The meeting was then thrown producer, user and software producer communi- open to address a number of key issues. ties, who were seen as essential in establishing and maintaining the required standards and who Is there a requirement for a community were jointly charged over the 2 days of the meet- standard? ing with laying the groundwork that would enable these objectives to be met. Data exchange is essential for the purposes of The delegates listened to a short presentation data comparison, benchmarking and quality con- by Alvis Brazma (EBI), outlining the successful trol, all of which are particularly important in a standardization of microarray data in the MGED field like protein–protein interaction, where the

Copyright  2003 John Wiley & Sons, Ltd. Meeting Review 17 standard high-throughput methods are known to Outline data structure yield high false-positive and false-negative rates. A community standard should allow simple access The need for a multi-level approach was soon to core protein interaction data, while being exten- recognized, with Level 1 designed to fulfil basic sible to exchange data with a high level of detail. requirements and be suitable for rapid implemen- Many users will require only simple indexing and tation, whilst subsequent levels will contain more interface systems; larger organizations will have features, yet remain compatible backwards. The requirements that are more complex but will have interchange format will need to be able to repre- the infrastructure to develop much of this them- sent both binary and n-ary (complex) interactions. selves. The confidentiality of data could be seen The topology of the latter would then be described as an issue that might inhibit organizations from within each set. contributing; however, this question has already Each Interchange Format Record will report one been addressed by the various sequence databases or more interactions supported by one or more where entries can be flagged and retained by experiments. Predicted interactions are allowed and the parent database until permission is given for will be clearly flagged. Wherever the sequence release. It was recognized early on in the discus- of the interactors is available in public databases, sions that a minimum standard for data exchange appropriate cross-references should be given. The needed to be developed and a formal mecha- sequence should be given in the interaction record nism for monitoring and maintaining this standard when it is not available from public databases, and put in place. Valuable lessons can be learned in may always be given. this area from MGED’s experience of defining a Each entry will need to contain the accession minimal standard for the exchange of microarray number of its parent database. Parent databases data. will be identified by a prefix. This will require a registry service, which will have to be recognized and maintained. It is proposed to use PSI/HUPO as the authority for this and a host site will have to Definition of use cases be established, which can be accessed by databases wishing to submit data. The potential use of the data has to be under- The standardization of experimental design pro- stood before the minimum common standard can vides a particularly complex set of issues for the be defined. Most of the groups represented at the field of PPI, in which researchers use a host of meeting were interested in making graphical repre- diverse techniques and practices. Level 1 of the sentations of PPIs and in making interspecies com- standard will not attempt to provide a full descrip- parisons based on sequence or structural homology. tion of the experimental design, but will provide the To compare data from different systems, a cor- means to clearly classify the experiments through rect description of the source systems is essential, hierarchical controlled vocabularies. including details of species, strain and, in some A work group has been set up to develop cases, tissue, cell type and disease state. Domain common controlled vocabularies for experimental identification and the dynamic properties of PPIs methods and other attributes of protein interaction were also common requirements, whilst the func- data. These will be used by the interaction data tional outcome of PPIs and the effects of sequence standard and will be made available via the Global variations and posttranslational modifications were Open Biological Ontologies (GOBO) website. seen as desirables. Some users have a requirement To capture a larger part of the interaction data for in-depth experimental detail; however, this was that is generated worldwide, the support of major felt to be beyond the scope of a data exchange and proteomics journals in this pro- format and would have to be retrieved from the cess is seen as crucial. It is proposed that, once literature. Links to public databases were seen as a PSI PPI level 1 standard has been established, essential when available but would not be made the major public database providers will collec- mandatory, since this would compromise the trans- tively approach journals and funding agencies to fer of unpublished data between collaborating lab- request that deposition of published interaction data oratories. in public databases will be strongly encouraged as

Copyright  2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 16–19. 18 Meeting Review part of the publication process. This would be sim- Mass spectrometry data exists at many levels, ilar to the deposition requirement for nucleotide from raw data, through peak lists and peptide sequence data, and the current encouragement to identification, to protein identification; on top of deposit DNA microarray data. this is the desire to mine data. Huge amounts of variation (and manual interpretation) exist in the processes that effect these transformations. PPI molecular interaction interchange The following specific points were discussed in format record structure more detail. The structure of an Interchange Format record defining both mandatory and optional fields was The purpose of new repositories discussed in great detail and a draft document was produced. A small working party was formed to One projected use was to provide an audit trail produce an XML draft of this consensus, which for publications, so that the producers of bulk or will then be further refined and finally presented to complex data would be able to fully describe (and members of the PSI at a meeting in January. The be held to account for) methodologies that could PPI group aim to have a publicly available version not appear in print medium; this would require the of the level 1 format available by Spring 2003. cooperation of journals. Another purpose could be to allow the user to explore/mine the data, prefer- ably in a biological context. Important concepts Mass spectrometry here are ‘the minimal description of the experi- ment’ and ‘validation criteria’. This session discussed two questions — the use of standards in the field of mass spectrometry and the potential use of a public data repository for mass How many repositories? spectrometry data. A component-based approach, with different repos- Following presentations on various aspects of itories for different types of proteomics experiment, mass spectrometry by Alexey Nesvizhskii (ISB, was considered, but fears were expressed that this Seattle, MI), Arkadiusz Nawrocki (CPA, would disrupt the audit trail, or make biological Odense, Denmark) and Rulin Zhang (SynX interpretation of the data impossible. How to go Pharma, Toronto, Canada), the group received about capturing the meaningful results of an exper- a demonstration of PEDRo, a tool developed at the iment that resulted in the conclusion that two pro- University of Manchester to capture data and meta- teins interact, without a wasteful overlap with PPI data from proteomics experiments that include databases, was discussed at intervals throughout mass spectrometry as one component. PEDRo has the meeting. been designed according to the MGED guidelines and has a similar scope to the microarray data Would the users enter all the data? model, capturing the complete process of scientific experiment from hypothesis formation through to The hope was expressed that if a standard could peak identification. A consideration of PEDRo led be produced, LIMS systems might automatically to the discussion as to whether a single repository produce compliant output. However, proteomics is would encompass the diverse needs of mass spec- often not fully automated and many data points trometry in the context of proteomics or whether might be missing. separate standards for each type of experiment, with separate repositories for each type of data, Participation of equipment would be required. As the issues became apparent, manufacturers and other parties questions of feasibility were also raised. Examples were given of ambitious plans to design software The view was expressed that the participation that supported data from all types of proteomics of equipment manufacturers was essential to the experiments, which had eventually been replaced ultimate success of any new standard. In areas such by projects aimed at capturing only one particular as hypothesis description and preliminary sample workflow. preparation, substantial opportunities for overlap

Copyright  2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 16–19. Meeting Review 19 with other groups involved in standardization were Conclusions perceived, and enthusiasm expressed for taking these forward. There was a remarkable consensus between dele- gates attending the PSI meeting to the effect that Error rates valuable data would be lost without public repos- itories and common interchange formats making There is little public awareness among poten- information accessible to the scientific commu- tial users of the data of problems, such as esti- nity. Major progress was made in the field of mating error rates and the statistical complexity protein–protein interactions, with a draft exchange in producing the final protein identifications. A format being produced and work on an XML ver- need to raise community awareness of these issues sion in progress. The mass spectroscopy group has was recognized. to undertake more groundwork, to establish com- Three work groups have now been established: mon needs and requirements, to identify what data • Group 1 will work on the definition of mass is appropriate for public access and the degree of spectrometry data, and the subsequent data anal- supplementary information which is required to be ysis, as far as protein identification. A draft stored alongside, but important advances have been model has been produced, which includes the made and it is hoped that this group will have pre- facility for recursive analysis and refinement of liminary results by early 2003. the peak list. All such efforts require support from the user • Group 2 are modelling the process of sample community and from the scientific press and fund- preparation, considering the overall workflow of ing agencies. Members of the PSI will be actively proteomics experiments in which ‘mass spec- canvassing such collaboration, but input is wel- trometry’ was one component, up to the point come from any quarter. Anyone wishing to become where a sample is ready to be loaded into the involved is invited to visit http://psidev.sf.net,to spectrometer. Again, a recursive model has been participate in the discussion groups listed, and to used, whereby a sample could undergo many contribute to the further development of commu- cycles of preparative steps. nity standards for proteomics data. A further meet- • Group 3 are considering likely user demands of ing of the PSI is planned for 22–24 January 2003 in any implemented system. The interests of both Hinxton, Cambridge, UK. Details will be published expert mass spectrometrists and biological users via the website. are being considered. A system should support the ability to query with peak lists, and with known sample compositions, against the results Related websites of previous experiments; and should also allow users to query across experiments to observe the BIND: http://bind.ca/ concomitant changes in identified species. DIP: http://dip.doe-mbi.ucla.edu/ Hybrigenics: http://www.hybrigenics.fr The findings of these working groups will be pre- IntAct Project: http://www.ebi.ac.uk/intact/ sented during the HUPO conference in November MINT: http://cbm.bio.uniroma2.it/mint/ 2002 and the way forward can then be discussed MGED: http://www.mged.org with input from the wider proteomics community, PPID: http://www.anc.ed.ac.uk/mscs/PPID who will be attending that meeting. PSI: http://psidev.sf.net/

The Meeting Reviews of Comparative and Functional Genomics aim to present a commentary on the topical issues in genomics studies presented at a conference. The Meeting Reviews are invited; they represent personal critical analyses of the current reports and aim at providing implications for future genomics studies.

Copyright  2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 16–19.