SIFTS: Updated Structure Integration with Function, Taxonomy and Sequences Resource Allows 40-Fold Increase in Coverage of Struc
Total Page:16
File Type:pdf, Size:1020Kb
D482–D489 Nucleic Acids Research, 2019, Vol. 47, Database issue Published online 16 November 2018 doi: 10.1093/nar/gky1114 SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins Jose M. Dana1, Aleksandras Gutmanas 1, Nidhi Tyagi2, Guoying Qi2, Claire O’Donovan3, Maria Martin2 and Sameer Velankar1,* 1Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2Protein Function Development, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK and 3Metabolomics, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received September 07, 2018; Revised October 15, 2018; Editorial Decision October 18, 2018; Accepted October 22, 2018 ABSTRACT source for sequence and functional information pertain- ing to proteins (1). It currently contains over 500 000 The Structure Integration with Function, Taxonomy manually annotated sequences (UniProtKB/Swiss-Prot) and Sequences resource (SIFTS; http://pdbe.org/ and over 120 million computationally annotated ones sifts/) was established in 2002 and continues to oper- (UniProtKB/TrEMBL) despite a near 50% reduction of the ate as a collaboration between the Protein Data Bank size of the holdings in 2015 to remove high sequence re- in Europe (PDBe; http://pdbe.org) and the UniProt dundancy. This increase is set to continue and likely to ac- Knowledgebase (UniProtKB; http://uniprot.org). The celerate even further with the growing appreciation of the resource is instrumental in the transfer of anno- role microbiome plays in health and disease. Most of these tations between protein structure and protein se- protein sequences are unlikely to be experimentally charac- quence resources through provision of up-to-date terised and, therefore, they will not be targeted for manual residue-level mappings between entries from the curation. In order to annotate this large protein space, the UniProt team has developed a rule-based prediction system PDB and from UniProtKB. SIFTS also incorporates (UniRule) to automatically enrich UniProtKB/TrEMBL residue-level annotations from other biological re- proteins with functional annotations. The rules in the sources, currently comprising the NCBI taxonomy UniRule system are manually annotated based on Inter- database, IntEnz, GO, Pfam, InterPro, SCOP, CATH, Pro family classification and experimental annotation in PubMed, Ensembl, Homologene and automatic Pfam UniProtKB/Swiss-Prot, and then computationally applied domain assignments based on HMM profiles. The to annotate millions of protein sequences in the database recently released implementation of SIFTS includes (1). Knowledge of protein structure can help elucidate func- support for multiple cross-references for proteins in tion, and thus enhance computational (and manual) anno- the PDB, allowing mappings to UniProtKB isoforms tations available in UniProtKB. and UniRef90 cluster members. This development In parallel to the growth in sequencing data, structural makes structure data in the PDB readily available to biology has undergone revolutionary changes over the past decade, ranging from dramatic improvements in electron over 1.8 million UniProtKB accessions. microscopy to wider accessibility and near complete au- tomation of crystallographic techniques. The Protein Data INTRODUCTION Bank (PDB) is the single global archive of experimen- The rapid evolution in genetic sequencing over the past tally determined three-dimensional (3D) biomacromolec- decades is leading to an unprecedented growth in the num- ular structures and associated experimental data (2). It is ber of protein sequences available in the UniProt Knowl- managed by the Worldwide PDB (wwPDB; http://wwpdb. edgebase (UniProtKB, http://uniprot.org)––a universal re- org)(3), an international consortium, of which the Protein *To whom correspondence should be addressed. Tel: +44 1223 494646; Fax: +44 1223 494468; Email: [email protected] C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2019, Vol. 47, Database issue D483 Data Bank in Europe (PDBe; http://pdbe.org)(4) is one of The original procedure was limited to cross-referencing the founding members. PDB receives an increasing num- the polypeptide sequence in a given PDB entry to a sin- ber of depositions (over 13 000 in 2017) of ever increasing gle UniProtKB accession. This limitation was overcome complexity, yet the pace of growth of the PDB is by neces- in the most recent SIFTS infrastructure update by organ- sity slower than that of sequence resources, with increases ising the PDB-UniProtKB cross-references into three cat- in coverage of the sequence space proportionate to the in- egories: (i) mapping to a UniProt canonical protein se- crease in the number of PDB entries: from 28 000 unique quence, unchanged compared to the previous implementa- UniProtKB accessions referenced by 84 000 PDB entries in tion, (ii) mapping to all alternative isoforms of the canon- early 2013 (5) to over 45 000 UniProtKB accessions refer- ical sequence and (iii) mapping to sequences in UniRef90 enced by over 145 000 PDB entries at present. Robust mech- clusters. The latter two categories will be discussed below. anisms of data discovery and of linking biological contexts pertaining to proteins are essential. A number of resources utilise the structure data from the PDB to annotate protein Mappings to isoforms sequences within related families and superfamilies of se- It is thought that alternative splicing is implicated in a quences (6). number of diseases, and that nearly all multi-exon protein- Both the PDBe and UniProtKB are core resources at coding genes in humans may undergo alternative splicing, the European Bioinformatics Institute (EMBL-EBI; http: giving rise to different isoform protein products (28). One //www.ebi.ac.uk)(7) and within the context of the ELIXIR of these products - usually the most prevalent - is termed infrastructure (http://elixir-europe.org)(8). Facilitated by a ‘canonical’ entry in UniProtKB, and was previously the their co-location at EMBL-EBI, the PDBe and UniProt only option for SIFTS cross-references to protein sequences teams developed the Structure Integration with Function, in the PDB. In order to overcome this limitation, the SIFTS Taxonomy and Sequences (SIFTS) resource (9), which al- process was updated as follows (Figure 1A and B): lows for transfer of value-added annotations between the protein sequences and the protein structures, helping to un- a) For each polypeptide sequence in the PDB––the query derstand mechanisms of protein interactions and function. sequence––retrieve the existing manually annotated SIFTS provides residue-level cross-references between pro- cross-reference provided by either the UniProtKB or by tein sequences in UniProtKB and 3D atomic models of the PDB, as described previously (9). those proteins within PDB entries. The resource also col- b) Expand the set of UniProtKB sequences to be analysed lates and distributes residue-level annotations from Pfam with all the isoforms of the accession from (a), unless the (10), InterPro (11), SCOP (12) and CATH (13), and whole query sequence is identified as a chimeric construct. In sequence level cross-references from IntEnz (14), GOA (15), the latter case, the set of accessions is not expanded be- PubMed (16), and NCBI taxonomy (17), all of which have yond the manually annotated ones. been part of the SIFTS process as described previously (9). c) Calculate sequence alignments and sequence identity be- The most recent update added cross-references from Ho- tween the query sequence and each UniProtKB accession mologene (https://www.ncbi.nlm.nih.gov/homologene)(18) from the set defined in (b). For canonical UniProtKB se- and Ensembl (19), and automatic Pfam assignments based quences, coverage by the PDB sequence is also calculated. on HMM profiles (20,21). In order to enhance the possi- d) Annotate the best sequence alignment from (c). Cur- bility of transfer of annotations between protein sequences rently, the best alignment is defined as the one with the and structures, the underlying SIFTS pipeline was also re- highest sequence identity with a preference for the canon- engineered to support multiple cross-references between ical accession in the case of a tie. UniProtKB and PDB, as described below. e) Cross-references from Pfam, IntEnz and Homologene The pipeline underlies many features of the PDBe website are added on the basis of the mappings to the canoni- and REST API (4). Many other bioinformatics resources cal UniProtKB accessions, as these resources do not con- such as UniProt (1), RCSB PDB (22), PDBj (23), PDBsum sider isoform data, while those from Ensembl are added (24), Reactome (25), Pfam (10), SCOP2 (26), MobiDB (27) based on the isoform information. Cross-references from and InterPro (11) rely on the SIFTS resource to establish GOA, InterPro and preliminary Pfam assignments