Renaissance SUPERFAMILY in the Decade Since Its Launch, a Repository of Information About Proteins in Genomes Has Developed Into a Primary Reference
Total Page:16
File Type:pdf, Size:1020Kb
Renaissance SUPERFAMILY In the decade since its launch, a repository of information about proteins in genomes has developed into a primary reference. Dr Julian Gough, its creator, describes current work enhancing the scope, content and functionality of this key service Could you begin by outlining the reasons for How do HMMs feed into the service? creating SUPERFAMILY? HMMs are profi les which represent multiple SUPERFAMILY was originally created to better sequence alignments of homologous proteins understand molecular evolution, initially by in a rigorous statistical framework. They enabling comparison of the repertoire of proteins can be used to classify sequences based on and domains across the genomes of different homology and to create sequence alignments. species. The starting point was the Structural We use sequences of domains of known Classifi cation of Proteins (SCOP), and the most structure via iterative search procedures on basic purpose of SUPERFAMILY is to detect and large background sequence databases to classify these domains with known structural build alignments and, subsequently, models representatives in the protein sequences of representing domains of known structure at genomes. There are tens of thousands of known the superfamily level. These models are then protein structures, each the result of a costly searched against genome sequences to detect three-dimensional atomic resolution structure and classify the structural domains. determination by experiment, usually X-ray crystallography or nuclear magnetic resonance. Finally, what is the current stage of There are tens of millions of protein sequences development of the SUPERFAMILY resource? of unknown structure, more cheaply determined by automated sequencing machines. Using the genomes, plasmids, pseudo-gene collections, Annotation of genomes with domains principle of sequence homology, SUPERFAMILY Protein Data Bank (PDB) sequences (updated of known structure is achieved to a high takes the known structural domains and maps weekly) and Universal Protein Resource standard, although maintaining an ever- them to sequences. Since structure reveals (UniProt) (updated monthly). It provides the expanding resource is a signifi cant task. After evolutionary relationships, the structural only fully-resolved species tree of weekly- some years maturing and consolidating the classifi cation of domains, mapped to genome updated completely-sequenced genomes, and basic infrastructure, the project is now in a sequences, enables the evolutionary study of reconstructed ancestral genomes of eukaryotes. phase of developing in many new and exciting complete genomes. SUPERFAMILY has the best and most complete directions. Since SUPERFAMILY has the world’s collection of functional and other ontologies most complete collection of proteomes, What types of protein does SUPERFAMILY for protein domains, annotated on all the we have worked to provide a fully-resolved detect and classify? genomes, and the most comprehensive reference species tree of all organisms that collection of disorder prediction for genomes. It have had their genome completely sequenced; It includes any protein which contains a also contains web-based comparative genomics this provides phylogenetic context to the data. domain superfamily for which there is a known tools for comparing superfamilies, families, We have recently added extensive functional structural representative, so approximately domain architectures, Gene Ontology (GO) annotation to the genomes via our own 70 per cent of proteins in animal genomes and other ontologies between genomes and/or domain-centric ontology mappings – dcGO and a higher percentage in bacteria. Now that clades of evolutionarily-related genomes. – including GO and 14 other ontologies such we include intrinsically-disordered regions of as disease, phenotype, anatomy, pathway proteins via the D2P2 sister database, we have Could you briefl y highlight the contents of and drug ontologies. We have just released some annotation for almost all human proteins, SUPERFAMILY? D2P2 which adds the perfect complement to leaving only 17-27 per cent of the amino acids the domains of known structure, by adding in the human genome with no structure/ The library consists of approximately 15,000 annotations of intrinsically-disordered disorder annotation. hidden Markov models (HMMs) representing regions using a battery of nine predictors. about 2,000 superfamilies which can be We have also just published a tool, FATHMM, What are some of the major features of downloaded and used in conjunction with for analysing mutations in human and SUPERFAMILY? software we provide to replicate the annotations other organisms. Looking to the future, we in the database. However, we pre-calculate are working to incorporate nucleotide and SUPERFAMILY contains the world’s most results on every public sequence we can transcript/expression data, including a cloud complete and up-to-date collection of reasonably obtain, totalling (as of August 2012) computing solution to run SUPERFAMILY proteomes and includes many other sequence over 75 million sequences and including 2,414 directly on unassembled next-generation sets, such as hundreds of meta-genomes, viral completely sequenced genomes. sequencing data. 48 INTERNATIONAL INNOVATION SUPERFAMILY The new developments are opening up lots of new possibilities, such as for phenotype prediction, evolution of intrinsically disordered proteins and mutation analysis Scaling up a core service A fi ve-year programme at the University of Bristol, UK is updating and augmenting the SUPERFAMILY protein domain resource on which molecular biologists rely for reliable, curated information about proteins and genomes BIOINFORMATICS AS A discipline incorporating protein information could be organised were In 2010, the Biotechnology and Biological yet remaining distinct from Biology, Computer incorporated in a library of hidden Markov Sciences Research Council sponsored a fi ve- Science and IT developed in the 1980s and has Models (HMMs) based on the SCOP superfamily year programme of improvements to scale since transformed knowledge of biological domain defi nitions for all known proteins. Gough up SUPERFAMILY and increase its robustness. entities in terms of their relationships, then launched SUPERFAMILY as a free, open Some additional funding was also obtained from organisations, functions and structures. access resource accessible via the web. Amazon and Google for computing resources. Bioinformatics approaches have made it possible Gough himself is the Principal Investigator on to extract and extrapolate information from SUPERFAMILY immediately made a big the improvement programme, which is making unprecedentedly large volumes of data about impact on the biological world and has solid progress. genomes and proteomes, leveraging considerable been used extensively as the primary source computing power, massive databases and of protein evolutionary information ever SUPERFAMILY BASIC CAPABILITY sophisticated processing rules, algorithms since: “SUPERFAMILY is the best resource for and logical arguments for interrogations and annotating protein sequences and genomes SUPERFAMILY’s main body of users is biologists predictions, bioinformatics enables biologists with SCOP structure domains on a large scale,” without substantial computing resources at their to quickly establish whether hypotheses and asserts Gough. Today, SUPERFAMILY attracts disposal, though it is assumed that they will be propositions are correct or likely to be worth in the region of 3-4 million hits per month, or au fait with requisite technological and analytical further exploration. an average of more than one hit per second. techniques: “In this day and age, it is no longer SUPERFAMILY has also been cited in more than possible for even the most traditional laboratory 1,300 scientifi c publications: “Surpassing 1,000 biologists to carry out their work in ignorance THE SUCCESS OF SUPERFAMILY citations for SUPERFAMILY was a landmark of the massive amount of high-throughput Professor Julian Gough, while working with moment,” refl ects Gough. biological data available. It is resources like ours Dr Cyrus Chothia, designed the SUPERFAMILY that make this data accessible to those studying database and services to enhance knowledge SUPERFAMILY was designed to require a biological questions,” explains Gough. of the evolution of protein domains and their minimum of maintenance. For Gough, this repertoires within genomes, approximately conserved SUPERFAMILY’s role as a prime SUPERFAMILY can provide information about 10 years ago. SUPERFAMILY’s main purposes service: “SUPERFAMILY was under-resourced all the proteins in any completely sequenced were therefore to support genome annotation, from the beginning so it was imbued, by genome. The SUPERFAMILY database, models structural genomics, gene prediction and necessity, with a drive to develop only and associated scripts can be downloaded from domain-centred genomic investigations. features that can be automatically updated at the web as required; users can submit sequences reasonable computational cost. This enabled for SCOP domain classifi cation and keyword Gough based the SUPERFAMILY data catalogue sustainability”. Gough and associates continued searches by superfamily, family, organism name, on the Structural Classifi cation of Proteins to maintain the service but the upsurge in model and sequence identifi er; fi nd over- and (SCOP) superfamily level that groups proteins sequence and protein