Interpro in 2011: New Developments in the Family and Domain Prediction Database Sarah Hunter, Philip Jones, Alex Mitchell, Rolf Apweiler, Teresa K
Total Page:16
File Type:pdf, Size:1020Kb
InterPro in 2011: new developments in the family and domain prediction database Sarah Hunter, Philip Jones, Alex Mitchell, Rolf Apweiler, Teresa K. Attwood, Alex Bateman, Thomas Bernard, David Binns, Peer Bork, Sarah Burge, et al. To cite this version: Sarah Hunter, Philip Jones, Alex Mitchell, Rolf Apweiler, Teresa K. Attwood, et al.. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Research, Oxford University Press, 2012, 40 (D1), pp.D306-D312. 10.1093/nar/gkr948. hal-02652188 HAL Id: hal-02652188 https://hal.inrae.fr/hal-02652188 Submitted on 29 May 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. D306–D312 Nucleic Acids Research, 2012, Vol. 40, Database issue Published online 16 November 2011 doi:10.1093/nar/gkr948 InterPro in 2011: new developments in the family and domain prediction database Sarah Hunter1,*, Philip Jones1,*, Alex Mitchell1,*, Rolf Apweiler1, Teresa K. Attwood2, Alex Bateman3, Thomas Bernard4, David Binns1, Peer Bork5, Sarah Burge1, Edouard de Castro6, Penny Coggill3, Matthew Corbett1, Ujjwal Das1, Louise Daugherty1, Lauranne Duquenne4, Robert D. Finn3, Matthew Fraser1, Julian Gough7, Daniel Haft8, 6 4 9 5 1 Nicolas Hulo , Daniel Kahn , Elizabeth Kelly , Ivica Letunic , David Lonsdale , Downloaded from Rodrigo Lopez1, Martin Madera7, John Maslen1, Craig McAnulla1, Jennifer McDowall1, Conor McMenamin1, Huaiyu Mi10, Prudence Mutowo-Muellenet1, Nicola Mulder9, Darren Natale11, Christine Orengo12, Sebastien Pesseat1, Marco Punta3, Antony F. Quinn1, Catherine Rivoire6, Amaia Sangrador-Vegas1, Jeremy D. Selengut8, http://nar.oxfordjournals.org/ Christian J. A. Sigrist6, Maxim Scheremetjew1, John Tate3, Manjulapramila Thimmajanarthanan1, Paul D. Thomas10, Cathy H. Wu12, Corin Yeats12 and Siew-Yit Yong1 1EMBL Outstation European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, Cambridge, 2Faculty of Life Science and School of Computer Science, The University of Manchester, M13 9PL, Manchester, 3The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, at INRA Institut National de la Recherche Agronomique on August 9, 2016 Hinxton, Cambridge, CB10 1SA UK, 4Poˆ le Rhoˆ ne-Alpin de Bio-Informatique (PRABI) and Laboratoire de Biome´ trie et Biologie Evolutive; CNRS; INRA; Universite´ de Lyon; Universite´ Lyon 1, 69622 Villeurbanne, France, 5European Molecular Laboratory (EMBL), Meyerhofstrasse 1, 69117 Heidelberg, Germany, 6Swiss Institute of Bioinformatics (SIB), CMU - Rue Michel-Servet 11211, Geneva 4, Switzerland, 7Department of Computer Science, University of Bristol, Woodland Road, Bristol, BS8 1UB, UK, 8J. Craig Venter Institute (JCVI), 9704 Medical Center Drive, Rockville, MD 20850, USA,9Computational Biology Unit, Institute of Infectious Disease and Molecular Medicine, University of Cape Town Health Sciences Campus, Anzio Road, Observatory 7925, South Africa, 10University of Southern California, Los Angeles, CA 90089, USA, 11Protein Information Resource (PIR), Georgetown University Medical Center, 3300 Whitehaven Street, NW, Suite 1200 Washington, D.C. 20007, USA and12Structural and Molecular Biology Department, University College London, University of London, WC1E 6BT UK Received October 4, 2011; Revised October 10, 2011; Accepted October 12, 2011 ABSTRACT protein sequences can be searched to determine their potential function. InterPro has utility in the InterPro (http://www.ebi.ac.uk/interpro/) is a database large-scale analysis of whole genomes and meta- that integrates diverse information about protein families, domains and functional sites, and makes genomes, as well as in characterizing individual it freely available to the public via Web-based inter- protein sequences. Herein we give an overview of faces and services. Central to the database are diag- new developments in the database and its asso- nostic models, known as signatures, against which ciated software since 2009, including updates to *To whom correspondence should be addressed. Tel: +44 (0) 1223 494 481; Fax: +44 (0) 1223 494 468; Email: [email protected] Correspondence may also be addressed to Alex Mitchell. Email: [email protected] Correspondence may also be addressed to Philip Jones. Tel: +44 (0) 1223 492610; Fax: +44 (0) 1223 494484; Email:[email protected] The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors. ß The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2012, Vol. 40, Database issue D307 database content, curation processes and Web and 2009; the latest release (version 34.0) contains 31 685 programmatic interfaces. member database signatures integrated into 22 245 InterPro entries, and provides matches to 80% of the sequences in UniProtKB (Table 1). INTRODUCTION The InterPro database integrates predictive models, or sig- Changes to terminology and data structure natures, from multiple, diverse source repositories: Pfam (1), PRINTS (2), PROSITE (3), SMART (4), ProDom (5), InterPro entries are classified according to the type of sig- PIRSF (6), SUPERFAMILY (7), PANTHER (8), CATH- nature they group together. In order to make it clear to Gene3D (9), TIGRFAMs (10) and HAMAP (11). Each end users what can be inferred from a match to a particu- source has its own distinct biological focus and/or meth- lar entry in the database, the different entry types have odology of signature production. The aim of InterPro is to been reviewed and terminology has been standardized. Downloaded from combine their individual strengths to provide a single Entry types now comprise families, domains, repeats, resource through which scientists can access comprehen- post-translational modifications, active sites, binding sites sive information about protein families, domains and and conserved sites, with formal definitions for each type functional sites. clearly stated on the InterPro Web site. Member database signatures are integrated into The relationships between different entry types have http://nar.oxfordjournals.org/ InterPro manually. Curators combine signatures repre- also been revised. Family and domain entries continue to senting the same protein family, domain or site into single be organized into hierarchies, with top-level entries de- database entries, and, where possible, trace biological re- scribing broad families or domains that share higher lationships between the constituent signatures. They check level structure and/or function, and entries further down the biological accuracy of the individual signatures and the hierarchy describing more specific functional sub- add pertinent information, including consistent names, de- families or structural/functional subclasses of domains. scriptive abstracts (with links to original publications) and However, family and domain entries are no longer per- Gene Ontology (GO) (12) terms. Semi-automatic proced- mitted to occur within the same hierarchy, and are now ures create and maintain links to an array of other data- classified into distinct hierarchies that relate domain archi- at INRA Institut National de la Recherche Agronomique on August 9, 2016 bases, including the protease resource MEROPS (13), the tectures to protein families. These data structures are pre- protein interaction database IntAct (14), the ENZYME sented on the Web interface in an intuitive tree view, as database (15) and the 3D structure database PDB (16). well as being available on the FTP site in a flat-file (ftp:// InterPro signature matches to the UniProt ftp.ebi.ac.uk/pub/databases/interpro/ParentChildTreeFile Knowledgebase (UniProtKB) (17) and the UniParc .txt). protein sequence archive are calculated using the InterProScan software package (18). This information is made available to the public in XML files as well as Integration of the HAMAP database through Web interfaces, where users can search with either A new member database, HAMAP (High-quality a protein sequence or a protein identifier. These data are Automated and Manual Annotation of microbial also used to aid UniProtKB curators in their annotation Proteomes), was added in 2009, InterPro release 22.0. of Swiss-Prot proteins and are utilized by the automatic The HAMAP database contains over 1600 signatures, spe- system that adds annotation to UniProtKB/TrEMBL. cifically describing archaeal, bacterial and plastid-encoded InterProScan can also be used to perform automated protein families. It is based upon weighted-matrix signa- analysis of protein sequences. The software is available tures, of the type used in the PROSITE profiles database. (i) as a browser-based tool for analysing single protein The integration of HAMAP into InterPro has helped sequences (http://www.ebi.ac.uk/Tools/pfa/iprscan/);