Dcgo: Database of Domain-Centric Ontologies on Functions, Phenotypes, Diseases and More Hai Fang* and Julian Gough*

Nucleic Acids Research Advance Access published November 17, 2012 Nucleic Acids Research, 2012, 1–9 doi:10.1093/nar/gks1080 dcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more Hai Fang* and Julian Gough* Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK Received August 15, 2012; Revised September 28, 2012; Accepted October 16, 2012 ABSTRACT higher-order knowledge on function (1), phenotype (2) and even human disease (3). We present ‘dcGO’ (http://supfam.org/ The domain-centric Gene Ontology (dcGO) database at SUPERFAMILY/dcGO), a comprehensive ontology http://supfam.org/SUPERFAMILY/dcGO is a compre- Downloaded from database for protein domains. Domains are often hensive ontology resource that contributes to the afore- the functional units of proteins, thus instead of mentioned challenge through a new domain-centric associating ontological terms only with full-length strategy. Our method, dcGO (4), annotates protein proteins, it sometimes makes more sense to asso- domains with ontological terms. Ontologies are hierarch- ically organized controlled vocabularies/terms defined to ciate terms with individual domains. Domain-centric http://nar.oxfordjournals.org/ GO, ‘dcGO’, provides associations between onto- categorize a particular sphere of knowledge (5). For logical terms and protein domains at the superfam- example, ‘Gene Ontology’ (GO) was created to describe ily and family levels. Some functional units consist functions of proteins (6). Ontological labels are already available for full-length proteins, derived from experimen- of more than one domain acting together or acting tal data for that protein. The dcGO approach takes the at an interface between domains; therefore, onto- terms attached to full-length sequences, and combines logical terms associated with pairs of domains, them with the domain composition of the sequences on triplets and longer supra-domains are also a large scale, to statistically infer for each term which provided. At the time of writing the ontologies in domain is the functional unit responsible for it. The at University Library on November 19, 2012 dcGO include the Gene Ontology (GO); Enzyme method has been formulated in a general way, enabling Commission (EC) numbers; pathways from it to be applied to numerous ontologies. The dcGO UniPathway; human phenotype ontology and database now contains a panel of ontologies from a phenotype ontologies from five model organisms, variety of contexts: functions such as GO (6,7), enzymes including plants; anatomy ontologies from three or- (8), pathways (9) and keywords used by UniProt (10); ganisms; human disease ontology and drugs from phenotype and anatomy ontologies across major model organisms, including mouse (11), worm (12), yeast (13), DrugBank. All ontological terms have probabilistic fly (14), zebrafish (15), Xenopus (16) and Arabidopsis (17); scores for their associations. In addition to associ- human phenotypes (18), diseases (19) and drugs (20). In ations to domains and supra-domains, the onto- addition to complete sets of ontological terms, a collapsed logical terms have been transferred to proteins, subset (slim version) is also provided for each ontology. through homology, providing annotations of >80 The automatically generated slim version of each ontology million sequences covering 2414 complete is based on annotation frequency (4), and it provides the genomes, hundreds of meta-genomes, thousands user with a manageable and more coarse-grained list. This of viruses and so forth. The dcGO database is is analogous to the ‘GO slim’ provided by the Gene updated fortnightly, and its website provides down- Ontology consortium, which has proven useful for enrich- loads, search, browse, phylogenetic context and ment analyses (21). other data-mining facilities. The domain definitions used in dcGO are taken from the structural classification of proteins (SCOP) (22) classified at both the superfamily and family levels. SCOP groups domains at the superfamily level if there is struc- INTRODUCTION ture, sequence and function evidence for a common evo- Scientists are increasingly confronted with the grand chal- lutionary ancestor. Some superfamilies are sub-divided lenge: how to convert sequenced genome information into into families, which often share a higher sequence *To whom correspondence should be addressed. Tel: +44 792 9104281; Fax: +44 117 9545208; Email: [email protected] Correspondence may also be addressed to Julian Gough. Tel: +44 117 3315221; Fax: +44 117 9545208; Email: [email protected] ß The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]. 2 Nucleic Acids Research, 2012 similarity and a related function. In addition to individual matrix, we then use Fisher’s exact test to infer associations domains at these two different levels, dcGO also offers between the rows and columns. On top of this, we take annotations for combinations of domains. We use the advantage of the true path rule of the directed acyclic concept of supra-domains to describe combinations of graph of the GO to determine the optimal level at which two or more successive domains of known structure. In to make an association. We achieve this by comparing the addition to providing ontology for SCOP domains, the significance of each term using two different backgrounds, generality of the method has enabled us to also include one background using all analysable UniProt proteins Pfam (23) domains in dcGO. (being annotatable by the GO), and one background Our domain-centric ontology derived from proteins using only those UniProt proteins annotated to direct with experimental evidence can be turned and used as parents of the term. If a GO term and its parent term a predictor on proteins of unknown function but where are both significantly associated with a domain/ the domain content is known. The ‘dcGO Predictor’ supra-domain using the first background, and if the term provides pre-computed functional annotation [using is not significantly different from the parent term using the SUPERFAMILY hidden Markov models (24)] of all se- second background, then it is desirable to only associate quences in UniProt (25), 2414 completely sequenced the parent term. As a result of these dual constraints, only genomes, thousands of viral genomes and hundreds of the most significant GO term associations to domains/ Downloaded from meta-genomes. The dcGO website also has a facility for supra-domains will be retained. the user to submit their own sequences for function pre- The significance of association is assessed by the method diction. The dcGO Predictor took part in the recent of false discovery rate (FDR) to account for multiple Critical Assessment of Functional Annotation experiment hypothesis tests, whereas the strength of association is (http://biofunctionprediction.org); hence, for comparison measured by a hypergeometric distribution-based score. with other non-domain-centric predictors, we refer the For a domain/supra-domain, the associated GO terms http://nar.oxfordjournals.org/ reader to this independent evaluation of its performance. (i.e. direct annotations) are propagated to all ancestor A key result from the experiment, however, was that terms (i.e. inherited annotations); together they constitute dcGO performs significantly better than the most a complete GO annotation profile. Based on the informa- commonly used method for GO annotation, Basic Local tion content of a GO term (i.e. negative logarithmic trans- Alignment Search Tool searches against UniProt (25). formation of frequency of domains/supra-domains In the main body of the article later we describe the annotated to that term), a search procedure is applied to database contents in detail, doing so separately for GO partition the directed acyclic graph structure of the GO, and for other biomedical ontologies (collectively denoted each partition reflecting the same or similar specificity but at University Library on November 19, 2012 as ‘BO’ hereinafter). Then, we provide an overview of located in distinct paths. With four seeds of increasing various utilities available through the website that may information content, the procedure produces the ‘GO interest users. Finally, we conclude with planned future slim’ that contains GO terms classified into four levels of developments. increasing granularity. These are highly general, general, specific and highly specific (Supplementary Table S1). The use of information content (as a measure of how specific DATABASE CONTENTS and informative a term is) adds great value to the existing Algorithm summary GO hierarchy for the user. The GO was created for annotating proteins, so some parts of GO structure are To fully understand the content of the dcGO database less valid for annotating domains/supra-domains than (Table 1), it is necessary to give a description of the algo- others. Rather than merely relying on the ontology rithm that is used to build it. Without loss of generality, graph depth to define the term specificity, our approach we take the GO as an exclusive example, but in principle, has taken into account actual usage of terms when the applications

Dcgo: Database of Domain-Centric Ontologies on Functions, Phenotypes, Diseases and More Hai Fang* and Julian Gough*

The SUPERFAMILY 2.0 Database: a Signiﬁcant Proteome Update and a New Webserver Arun Prasad Pandurangan 1,*, Jonathan Stahlhacke2, Matt E

Tum1 Is Involved in the Metabolism of Sterol Esters in Saccharomyces Cerevisiae Katja Uršič1,4, Mojca Ogrizović1,Dušan Kordiš1, Klaus Natter2 and Uroš Petrovič1,3*

Smurflite: Combining Simplified Markov Random Fields With

Renaissance SUPERFAMILY in the Decade Since Its Launch, a Repository of Information About Proteins in Genomes Has Developed Into a Primary Reference

Specialized Hidden Markov Model Databases for Microbial Genomics

Interpro: the Integrative Protein Signature Database

Article Reference

Assignment of Homology to Genome

GI-Edition Proceedings

Package 'Dcgor'

HMMER Web Server Documentation Release 1.0

Sequences and Topology: Disorder, Modularity, and Post/Pre Translation Modiﬁcation