Protein Family and Domain Databases Database Evolution Clare Samson
Total Page:16
File Type:pdf, Size:1020Kb
Regulars Cyberbiochemist Protein family and domain databases Database evolution Clare Samson The topic of protein family and domain Swiss‑Prot sequence database, now, of course, databases is one that we have covered in the incorporated into UniProt. Both of these Cyberbiochemist on several occasions before. databases are exceptionally reliable, but they However, it is one that deserves regular coverage, are also, by bioinformatics database standards, Downloaded from http://portlandpress.com/biochemist/article-pdf/31/1/52/4252/bio031010052.pdf by guest on 25 September 2021 as it is a rapidly moving field and one that no exceptionally small. Manual annotation simply biochemist, whether bench scientist or not, can cannot keep up with the rate at which proteome ignore. It is also one that is clearly underpinned sequences are being produced from what is now by evolution, and so particularly appropriate for a torrent of genomic data; automatically and an issue celebrating Darwin’s discoveries. semi‑automatically generated databases can be Generically, protein family databases hold much larger. For example, the latest version of information about the sequences, structures one of the best‑known family databases, Pfam, and functions of protein domains, divided into contains over 10000 entries, divided into Pfam‑A groups according to homology: in all databases, and Pfam‑B. The larger Pfam‑A comprises there is at least one subdivision in which all alignments of protein domains that were gener‑ Structure of a heterotrimeric G-protein that members of each grouping are assumed to ated semi‑automatically, using hidden Markov consists of a chimaeric αt/αi subunit (blue) and have evolved from a common ancestor. Some models, and then annotated manually with the βγ subunit (red, green). (Public domain image databases, particularly structural ones, also have functional and structural information; Pfam‑B from The RCSB Protein Data Bank) wider groupings containing sequences where families are generated automatically from simple similarity is too low for a common origin to sequence alignment. A Pfam‑B family may sim‑ which also has a genomic mode, is particu‑ be more than guessed at, or where similarity ply consist of a group of sequences of unknown larly rich in domain families involved in signal of structure and function has been assumed to function that look similar enough to be possibly transduction, and TIGRFAM, which contains derive from convergent evolution. derived from a common ancestor. The database a useful list of families organized hierarchically It is possible to think of the databases entry for each Pfam‑A family, however, contains by ‘role category’ (with top level terms including themselves as evolving, becoming more complex information about the different domain distribu‑ ‘amino acid biosynthesis’, ‘transcription’ and and more divergent and new ones appearing tions, or architectures, in which that particular ‘transport and binding proteins’) focuses solely regularly (perhaps in the equivalent of ‘specia‑ domain is found, its phylogenetic distribution, on microbial proteins. tion’ events). In this analogy, the common ances‑ protein–protein interactions that it is known to The data held in these databases, and in tor is the venerable ProSite, first published in be involved in and, where possible, its structure. many others including structural databases 1989. Its first release contained about 60 entries, Protein family and domain databases that CATH, SCOP and MODBASE, are combined in each describing one or more sequence patterns work in a similar way to Pfam include ProDom, a single resource called InterPro, held at the EBI. characteristic of a family of proteins; the latest SMART and TIGRFAM (which takes its name There is now no need to keep the URLs of the version contains about 1500, characterized by from The Institute of Genomic Research where it individual resources, as each is linked directly either patterns or profiles derived from sequence is based). There is, obviously enough, a very high from the InterPro homepage; however, a search alignments. ProSite has always been praised for degree of overlap between the data held in these for information about a particular protein is the richness and reliability of its documentation. databases. However, they are not completely likely to begin with a single search of InterPro, This is derived manually, with the assistance of equivalent; each contains families that are not producing a summary of matches found in all an expert panel of researchers who still make found in the others and the coverage of each the databases. About 75% of well over 6 million themselves available to deal with emailed queries database and the options available reflect, to proteins in the whole UniProt resource now concerning particular protein families. The latest some extent, the history and research interests match at least one entry in one or more of the release also contains a new section, ProRule, a of the group that created it. ProDom has two databases that make up InterPro. That is the series of manually developed ‘rules’ that enhance special sections: ProDom‑CG, which only same as saying that there are over 4.5 million the discrimination of ProSite patterns and contains sequences from complete genomes, and protein sequences in the public domain with at profiles with information about the nature and ProDom‑SG, for structural genomics targets. least something known about their evolution, role of particular amino acids. Currently (in December 2008), ProDom‑CG function, structure or mechanism: that’s an enor‑ ProSite is derived from, and based at includes data derived from 19 archaeal, 97 mous resource for all biochemists. Bookmark the same website as, the manually curated bacterial and nine eukaryotic genomes. SMART, www.ebi.ac.uk/interpro and enjoy! ■ 52 February 2009 © 2009 The Biochemical Society.