D490–D494 Nucleic Acids Research, 2019, Vol. 47, Database issue Published online 16 November 2018 doi: 10.1093/nar/gky1130 The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver Arun Prasad Pandurangan 1,*, Jonathan Stahlhacke2, Matt E. Oates2, Ben Smithers 2 and Julian Gough1
1MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK and 2Computer Science, University of Bristol, Bristol BS8 1UB, UK
Received September 24, 2018; Revised October 23, 2018; Editorial Decision October 23, 2018; Accepted October 25, 2018
ABSTRACT level, most homologous proteins cluster together with high sequence similarity suggesting clear evolutionary relation- Here, we present a major update to the SUPERFAM- ship and functional consistency (3). The SUPERFAMILY ILY database and the webserver. We describe the ad- database provides domain annotations at both Superfamily dition of new SUPERFAMILY 2.0 profile HMM library and Family levels (4). containing a total of 27 623 HMMs. The database SUPERFAMILYprovides various analysis tools to facil- now includes Superfamily domain annotations for itate better analysis and interpretation of the database con- millions of protein sequences taken from the Uni- tent. They include the identification of under- and overrep- versal Protein Recourse Knowledgebase (UniPro- resentation of domains between genomes (5), construction tKB) and the National Center for Biotechnology In- of phylogenetic trees (6), analysis of the domain distribution formation (NCBI). This addition constitutes about 51 of superfamilies and families across the tree of life (7)aswell and 45 million distinct protein sequences obtained as providing ontology based annotations for SUPERFAM- ILY domains and architectures (8,9). from UniProtKB and NCBI respectively. Currently, Here we present the development of new SUPERFAM- the database contains annotations for 63 244 and ILY 2.0 HMM library along with a major database update 102 151 complete genomes taken from UniProtKB that includes the addition of SUPERFAMILY annotations and NCBI respectively. The current sequence col- for the all the protein sequences from the UniProtKB (10) lection and genome update is the biggest so far and NCBI reference genome collections (11). We also intro- in the history of SUPERFAMILY updates. In order duce a newly developed webserver to mainly focus on the to the deal with the massive wealth of information, annotation of exponentially growing sequence data as well here we introduce a new SUPERFAMILY 2.0 web- as to facilitate future integration with the SUPERFAMILY 2 2 server (http://supfam.org). Currently, the webserver sister resources including dcGO (8) and D P (12)tocap- mainly focuses on the search, retrieval and display of ture the combined information representing structure, dis- Superfamily annotation for the entire sequence and order and domain centric ontologies in a single platform. In the following section, we discuss the development of new genome collection in the database. SUPERFAMILY 2.0 profile HMM library. Later, we dis- cuss the annotation statistics for UniProtKB sequences and INTRODUCTION NCBI reference genome collection followed by the intro- duction of the new webserver and its basis functionalities. SUPERFAMILY 1.75 (1) uses a library of 15 438 expert- Finally, we discuss the future directions for the SUPER- curated profile hidden Markov models (HMMs) represent- FAMILY resource. ing protein domains of known structure to predict the pres- ence of structural domains in amino acid sequences. The domain sequences were obtained from the Structural Clas- SUMMARY OF UPDATES sification of Protein database (SCOP) (2). SCOP classifies SUPERFAMILY 2.0 profile HMM model library protein domains into Class, Fold, Superfamily and Family level to understand structural, functional and evolutionary In this update, we have created a new profile HMM library relationship between protein structural domains. The Su- using sequences taken from the structural domain database perfamily level domains in SCOP share structural and func- SCOPe (13), CATH (14), ECOD (15)andPDB(16). Ini- tional properties that infer common evolutionary origin de- tially, we built the new HMMs for SCOPe domain se- spite sharing low sequence identity. Whereas at the Family quences by filtering it at 95% sequence identity against the
*To whom correspondence should be addressed. Tel: +44 122 2267822; Email: [email protected]