Gene3d: Multi-Domain Annotations for Protein Sequence and Comparative Genome Analysis Jonathan G

Nucleic Acids Research Advance Access published November 21, 2013 Nucleic Acids Research, 2013, 1–6 doi:10.1093/nar/gkt1205 Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis Jonathan G. Lees1,*, David Lee1, Romain A. Studer1, Natalie L. Dawson1, Ian Sillitoe1, Sayoni Das1, Corin Yeats2, Benoit H. Dessailly1, Robert Rentzsch3 and Christine A. Orengo1 1Division of Biosciences, Institute of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK, 2Department of Infectious Disease Epidemiology, Imperial College London, St Mary’s Campus, Norfolk Place, London W2 1PG, UK and 3Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany Downloaded from Received October 1, 2013; Revised November 2, 2013; Accepted November 4, 2013 ABSTRACT contain long disordered sections that are also of functional importance. The CATH database which focuses on the http://nar.oxfordjournals.org/ Gene3D (http://gene3d.biochem.ucl.ac.uk) is a data- folded domain, classifies structures in the PDB into their base of protein domain structure annotations for constituent domains, and each domain is subsequently protein sequences. Domains are predicted using a assigned to a single superfamily by homology (1,2). The library of profile HMMs from 2738 CATH super- classified domain structure sequences are used in a families. Gene3D assigns domain annotations to pipeline to build domain superfamily specific HMMs that Ensembl and UniProt sequence sets including are then used to identify domains in structurally >6000 cellular genomes and >20 million unique uncharacterized protein sequences. As some CATH protein sequences. This represents an increase of superfamilies can be large and functionally diverse, we at St Petersburg State University on December 8, 2013 45% in the number of protein sequences since our recently introduced protocols for subdividing each super- last publication. Thanks to improvements in the family into functional sub-families known as FunFams underlying data and pipeline, we see large increases (2,3). Owing to the speed and scale at which domain annotations can be assigned to sequences, the computational as- in the domain coverage of sequences. We have signment of functional family domains to proteins is a expanded this coverage by integrating Pfam and powerful tool for bridging the gap in functional coverage SUPERFAMILY domain annotations, and we now of the ever-expanding genome and sequence databases. resolve domain overlaps to provide highly compre- While more development is needed to improve accuracy, hensive composite multi-domain architectures. To the CATH-FunFams were the top performing domain- make these data more accessible for comparative based method for assigning protein functions in a recent genome analyses, we have developed novel search international competition (4). After the domains have algorithms for searching genomes to identify related been assigned to a protein, DomainFinder (5) is used to multi-domain architectures. In addition to providing resolve issues of overlapping domain assignments to domain family annotations, we have now developed derive a single combination of domains known as the a pipeline for 3D homology modelling of domains multi-domain architecture (MDA). To increase the utility in Gene3D. This has been applied to the human of the domain annotations in Gene3D, we integrate many genome and will be rolled out to other major other complementary data sources including UniProt (6), UniProt-GO (7), NCBI taxonomy (8), DrugBank (9) and organisms over the next year. OMIM (10). Other domain annotation resources include Pfam(11) and SUPERFAMILY(12). SUPERFAMILY is more similar to Gene3D in that it makes use of the struc- INTRODUCTION tural domains from SCOP as its starting source of estab- Proteins commonly contain 1 discrete independently lished domains. SUPERFAMILY adds a small percentage folding domains. Similar to these folded domains, it is of annotations for structural superfamilies classified in becoming increasingly clear that many proteins can also SCOP but not in CATH. A new resource, Genome3D *To whom correspondence should be addressed. Tel: +44 2076793890; Fax: +44 2076797193; Email: [email protected] The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors. ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2013 Downloaded from http://nar.oxfordjournals.org/ at St Petersburg State University on December 8, 2013 Figure 1. Workflow of the Gene3D pipeline from data input from external resources (Parallelogram shaped boxes) to useful functions for the user (Diamond shaped boxes). Rectangular boxes represent data processing steps. (13), provides a consensus assignment showing where New data and features in Gene3D Gene3D and SUPERFAMILY agree on matches to a Significant increase in numbers of sequences annotated particular domain within a protein sequence. with at least one domain As a summary of the principal updates for this The number of sequences in the database has increased by article, we have expanded Gene3D domain assignments 45% since our last publication with >6000 cellular considerably by using CATH version 4.0 (updated from genomes now present. We report a significant increase in CATH version 3.5). As CATH version 4.0 contains domain coverage of sequences in Gene3D, over the nearly 50% more domain structures this has resulted previous NAR database publication (15). The new in greater sequence coverage. We have also updated release, Gene3D version 12, has 25 615 754 CATH our core sequence and genome databases, increasing domains assigned for 21 662 155 distinct protein se- the number of unique sequences to >20 million. The quences, belonging to 2738 CATH superfamilies. On the FunFams have been updated using the new CATH set of Ensembl genome sequences common to the previous version 4.0 superfamily assignments for Gene3D and current releases, there is a substantial increase in version 12. To provide 3D models for selected model domain coverage from 60 to 64% of sequences assigned organisms, we have developed a homology modelling at least one CATH domain. Some branches of life show pipeline that exploits the FunFam multiple sequence greater increases, with metazoan species showing an 8.7% alignments to improve accuracy and coverage. For the increase in the proportion of sequences assigned at least first time in Gene3D, we have fully integrated Pfam (11) one domain. The increase in coverage can partly be and SUPERFAMILY (12) domain family assignments ascribed to the increase in PDB domains assigned by the by using DomainFinder to give more complete MDA CATH domain assignment protocol and partly to assignments. We have also included a comprehensive changing the CATH domain HMM model building data set of post translational modifications (PTMs) method from target2K to jackhmmer. We now include (14). For an overview of the Gene3D data pipeline see domains from Pfam-A, Pfam-B and SUPERFAMILY, Figure 1. The website has also been updated to provide which are integrated using the DomainFinder algorithm improved analysis tools, including a sequence search (5). This method uses graph theory to identify the optimal tool to find proteins with similar MDAs. combination of domain family matches, having acceptable Nucleic Acids Research, 2013 3 scores, to maximize coverage of the sequence and remove domain overlaps. After integration, this produces 35 253 067 domain assignments with no sequence overlaps between them. On the Ensembl genomes sequence set the addition of Pfam-A (11), Pfam-B and SUPERFAMILY(12) increase the coverage from 64 to 84% showing that the different domain-based resources are highly complementary. In our previous Gene3D publication (15), we introduced more specific domain assignments based on functional sub- classification of CATH superfamily assignments (FunFams). FunFams are built by agglomerative clustering of Gene3D domain sequences using the E-value returned from profile–profile comparisons (16). Since >50% of sequences in Gene3D belong to functionally diverse Downloaded from superfamilies, and since the FunFams are functionally cohesive families, using FunFam-based assignments allows us to provide more reliable functional annotation than using assignments from broad, highly diverse CATH superfamilies. The FunFams have been benchmarked Figure 2. The MDA alignment method in Gene3D uses the Needleman–Wunsch (NW) algorithm. (A) Domain matches in the http://nar.oxfordjournals.org/ against manually curated, experimentally annotated func- substitution matrix for the NW algorithm can take place at multiple tional classifications, such as the SFLD (17) and more levels. The highest scoring match is the FunFam level (FunFam recently have been shown to be highly competitive in the Match). The next highest scoring match is between different international CAFA functional annotation assessment (4). FunFams from the same superfamily scored by their similarity in a hierarchical tree of FunFams built from profile–profile comparisons (FunFam-Tree Match). The next highest scoring match is at the New MDA comparison method homologous superfamily

Gene3d: Multi-Domain Annotations for Protein Sequence and Comparative Genome Analysis Jonathan G

Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P

Abstracts Genome 10K & Genome Science 29 Aug - 1 Sept 2017 Norwich Research Park, Norwich, Uk

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*

DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S

1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

EMBL-EBI Now and in the Future

Origin of a Folded Repeat Protein from an Intrinsically Disordered Ancestor

Strategic Plan 2011-2016

The Pfam Protein Families Database Marco Punta1,*, Penny C