D482–D489 Nucleic Acids Research, 2019, Vol. 47, Database issue Published online 16 November 2018 doi: 10.1093/nar/gky1114 SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins Jose M. Dana1, Aleksandras Gutmanas 1, Nidhi Tyagi2, Guoying Qi2, Claire O’Donovan3, Maria Martin2 and Sameer Velankar1,*
1Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2Protein Function Development, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 3Metabolomics, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Received September 07, 2018; Revised October 15, 2018; Editorial Decision October 18, 2018; Accepted October 22, 2018
ABSTRACT source for sequence and functional information pertain- ing to proteins (1). It currently contains over 500 000 The Structure Integration with Function, Taxonomy manually annotated sequences (UniProtKB/Swiss-Prot) and Sequences resource (SIFTS; http://pdbe.org/ and over 120 million computationally annotated ones sifts/) was established in 2002 and continues to oper- (UniProtKB/TrEMBL) despite a near 50% reduction of the ate as a collaboration between the Protein Data Bank size of the holdings in 2015 to remove high sequence re- in Europe (PDBe; http://pdbe.org) and the UniProt dundancy. This increase is set to continue and likely to ac- Knowledgebase (UniProtKB; http://uniprot.org). The celerate even further with the growing appreciation of the resource is instrumental in the transfer of anno- role microbiome plays in health and disease. Most of these tations between protein structure and protein se- protein sequences are unlikely to be experimentally charac- quence resources through provision of up-to-date terised and, therefore, they will not be targeted for manual residue-level mappings between entries from the curation. In order to annotate this large protein space, the UniProt team has developed a rule-based prediction system PDB and from UniProtKB. SIFTS also incorporates (UniRule) to automatically enrich UniProtKB/TrEMBL residue-level annotations from other biological re- proteins with functional annotations. The rules in the sources, currently comprising the NCBI taxonomy UniRule system are manually annotated based on Inter- database, IntEnz, GO, Pfam, InterPro, SCOP, CATH, Pro family classification and experimental annotation in PubMed, Ensembl, Homologene and automatic Pfam UniProtKB/Swiss-Prot, and then computationally applied domain assignments based on HMM profiles. The to annotate millions of protein sequences in the database recently released implementation of SIFTS includes (1). Knowledge of protein structure can help elucidate func- support for multiple cross-references for proteins in tion, and thus enhance computational (and manual) anno- the PDB, allowing mappings to UniProtKB isoforms tations available in UniProtKB. and UniRef90 cluster members. This development In parallel to the growth in sequencing data, structural makes structure data in the PDB readily available to biology has undergone revolutionary changes over the past decade, ranging from dramatic improvements in electron over 1.8 million UniProtKB accessions. microscopy to wider accessibility and near complete au- tomation of crystallographic techniques. The Protein Data INTRODUCTION Bank (PDB) is the single global archive of experimen- The rapid evolution in genetic sequencing over the past tally determined three-dimensional (3D) biomacromolec- decades is leading to an unprecedented growth in the num- ular structures and associated experimental data (2). It is ber of protein sequences available in the UniProt Knowl- managed by the Worldwide PDB (wwPDB; http://wwpdb. edgebase (UniProtKB, http://uniprot.org)––a universal re- org)(3), an international consortium, of which the Protein
*To whom correspondence should be addressed. Tel: +44 1223 494646; Fax: +44 1223 494468; Email: [email protected]