(2019). the SUPERFAMILY 2.0 Database: a Significant Proteome Update and a New Webserver
Total Page:16
File Type:pdf, Size:1020Kb
Pandurangan, A. P., Stahlhacke, J., Oates, M. E., Smithers, B., & Gough, J. (2019). The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. Nucleic Acids Research, 47(D1), D490-D494. https://doi.org/10.1093/nar/gky1130 Publisher's PDF, also known as Version of record License (if available): CC BY Link to published version (if available): 10.1093/nar/gky1130 Link to publication record in Explore Bristol Research PDF-document This is the final published version of the article (version of record). It first appeared online via Oxford University Press at https://academic.oup.com/nar/article/47/D1/D490/5184710 . Please refer to any applicable terms of use of the publisher. University of Bristol - Explore Bristol Research General rights This document is made available in accordance with publisher policies. Please cite only the published version using the reference above. Full terms of use are available: http://www.bristol.ac.uk/red/research-policy/pure/user-guides/ebr-terms/ D490–D494 Nucleic Acids Research, 2019, Vol. 47, Database issue Published online 16 November 2018 doi: 10.1093/nar/gky1130 The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver Arun Prasad Pandurangan 1,*, Jonathan Stahlhacke2, Matt E. Oates2, Ben Smithers 2 and Julian Gough1 Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D490/5184710 by University of Bristol Library user on 26 February 2019 1MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK and 2Computer Science, University of Bristol, Bristol BS8 1UB, UK Received September 24, 2018; Revised October 23, 2018; Editorial Decision October 23, 2018; Accepted October 25, 2018 ABSTRACT level, most homologous proteins cluster together with high sequence similarity suggesting clear evolutionary relation- Here, we present a major update to the SUPERFAM- ship and functional consistency (3). The SUPERFAMILY ILY database and the webserver. We describe the ad- database provides domain annotations at both Superfamily dition of new SUPERFAMILY 2.0 profile HMM library and Family levels (4). containing a total of 27 623 HMMs. The database SUPERFAMILYprovides various analysis tools to facil- now includes Superfamily domain annotations for itate better analysis and interpretation of the database con- millions of protein sequences taken from the Uni- tent. They include the identification of under- and overrep- versal Protein Recourse Knowledgebase (UniPro- resentation of domains between genomes (5), construction tKB) and the National Center for Biotechnology In- of phylogenetic trees (6), analysis of the domain distribution formation (NCBI). This addition constitutes about 51 of superfamilies and families across the tree of life (7)aswell and 45 million distinct protein sequences obtained as providing ontology based annotations for SUPERFAM- ILY domains and architectures (8,9). from UniProtKB and NCBI respectively. Currently, Here we present the development of new SUPERFAM- the database contains annotations for 63 244 and ILY 2.0 HMM library along with a major database update 102 151 complete genomes taken from UniProtKB that includes the addition of SUPERFAMILY annotations and NCBI respectively. The current sequence col- for the all the protein sequences from the UniProtKB (10) lection and genome update is the biggest so far and NCBI reference genome collections (11). We also intro- in the history of SUPERFAMILY updates. In order duce a newly developed webserver to mainly focus on the to the deal with the massive wealth of information, annotation of exponentially growing sequence data as well here we introduce a new SUPERFAMILY 2.0 web- as to facilitate future integration with the SUPERFAMILY 2 2 server (http://supfam.org). Currently, the webserver sister resources including dcGO (8) and D P (12)tocap- mainly focuses on the search, retrieval and display of ture the combined information representing structure, dis- Superfamily annotation for the entire sequence and order and domain centric ontologies in a single platform. In the following section, we discuss the development of new genome collection in the database. SUPERFAMILY 2.0 profile HMM library. Later, we dis- cuss the annotation statistics for UniProtKB sequences and INTRODUCTION NCBI reference genome collection followed by the intro- duction of the new webserver and its basis functionalities. SUPERFAMILY 1.75 (1) uses a library of 15 438 expert- Finally, we discuss the future directions for the SUPER- curated profile hidden Markov models (HMMs) represent- FAMILY resource. ing protein domains of known structure to predict the pres- ence of structural domains in amino acid sequences. The domain sequences were obtained from the Structural Clas- SUMMARY OF UPDATES sification of Protein database (SCOP) (2). SCOP classifies SUPERFAMILY 2.0 profile HMM model library protein domains into Class, Fold, Superfamily and Family level to understand structural, functional and evolutionary In this update, we have created a new profile HMM library relationship between protein structural domains. The Su- using sequences taken from the structural domain database perfamily level domains in SCOP share structural and func- SCOPe (13), CATH (14), ECOD (15)andPDB(16). Ini- tional properties that infer common evolutionary origin de- tially, we built the new HMMs for SCOPe domain se- spite sharing low sequence identity. Whereas at the Family quences by filtering it at 95% sequence identity against the *To whom correspondence should be addressed. Tel: +44 122 2267822; Email: [email protected] C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2019, Vol. 47, Database issue D491 Table 1. SUPERFAMILY annotation statistics for the UniProtKB and NCBI protein sequence collection No. of proteomes No. of proteins Proteins with assignments % Amino acid coverage % UniProtKB NCBI UniProtKB NCBI UniProtKB NCBI UniProtKB NCBI Eukaryota 1272 781 194 81 055 17 857 765 56 67 38 39 Archaea 793 671 2 136 652 1 822 967 62 63 59 60 Bacteria 17 277 93 480 66 475 668 346 500 943 67 67 62 64 Viruses 43 902 7194 1 025 062 303 337 39 21 39 31 Downloaded from https://academic.oup.com/nar/article-abstract/47/D1/D490/5184710 by University of Bristol Library user on 26 February 2019 Complete proteome 63 244 102 151 89 118 437 90 495 662 64 67 55 62 sequences present in the HMM library 1.75. The filtering NCBI complete genome collection and model building procedure was repeated for CATH and Protein sequences were downloaded from the NCBI ECOD domain sequences followed by the full length pro- Reference Sequence Database (ftp://ftp.ncbi.nlm.nih.gov/ tein chain sequences downloaded from PDB (16). For the genomes/refseq dated 26/08/2017) (11). The genome col- purpose of building new HMMs, we used the HMMER lection has about 93 million protein sequences from 102 package version 3.1b2 (17). For each new domain sequence, 151 complete genomes. It contained 91%, 7%, 1% and 1% the program jackhmmer from the HMMER package was of Bacteria, Viruses, Eukaryotes and Archaea respectively used to iteratively search for remote homologs to produce with bacterial genomes being the most common (Table 1). multiple sequence alignments (MSAs). The MSAs were Eukaryotes, Archaea, Bacteria and Viruses had 67%, 63%, used to generate HMMs using the Sequencing and Align- 67% and 21% of proteins with at least one Superfamily do- ment Modeling Package version 3.5 (SAM) (18). The gen- main annotations (Table 1). Overall, the complete proteome erated HMMs were converted to the HMMER 3.1b2 for- in NCBI contains 90495662 unique proteins and 67% have mat. The number of iterative jackhmmer search steps was Superfamily domain annotations with 62% of amino acids set to 5. The newly generated HMMs were carefully checked have been mapped to Superfamily domains (Table 1). against each other and all models producing cross hits were After the major update, the SUPERFAMILY database removed. The new library contained 12,185 HMMs rep- contains 50 604 320 and 44 765 365 distinct protein se- resenting 10 668, 504, 279 and 734 models from SCOPe, quences from UniProtKB and NCBI respectively. About CATH, ECOD and PDB sequences respectively. Finally, a 50% of the protein sequences (45 730 297) are common new SUPERFAMILY 2.0 HMM library containing a total between UniProtKB and NCBI sequence collection. It is of 27,623 models was created by merging the new and ex- worth noting that the annotations for UniProtKB and isting 1.75 HMM library. Through the scop hierarchy page NCBI sequences were performed using the SUPERFAM- (http://supfam.org/scop), the user can browse full details ILY 1.75 HMM library. of all the available domain sequences (including SCOPe, CATH, ECOD and PDB) used for building SUPERFAM- ILY 2.0 profile HMM library. New webserver - SUPERFAMILY 2.0 The wealth of proteome sequence information continues to increase manifold with the recent advancement of sequence UniProtKB sequence collection technologies. In order to meet the challenges involved in the Protein sequences were downloaded from the UniProtKB analysis and interpretation of large proteome datasets, we (ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase have developed a new webserver (http://supfam.org). The dated 29/03/2018) (10). It contained ∼112 million protein webserver is built using a Perl based real-time web appli- sequences classified into 63 244 complete genomes. The cation framework called Mojolicious (https://mojolicious. complete genomes represent 70%, 27%, 2% and 1% of org). In this new development, we have predominately fo- Viruses, Bacteria, Eukaryotes and Archaea respectively. cused on the search, retrieval and display of Superfamily do- In UniProtKB, Viral genomes are most commonly found main annotations present in the database.