Assignment of Homology to Genome
Total Page:16
File Type:pdf, Size:1020Kb
doi:10.1006/jmbi.2001.5080availableonlineathttp://www.idealibrary.comon J. Mol. Biol. (2001) 313, 903±919 AssignmentofHomologytoGenomeSequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure JulianGough1*,KevinKarplus2,RichardHughey2andCyrusChothia1 1MRC, Laboratory of Molecular Of the sequence comparison methods, pro®le-based methods perform Biology, Hills Road, Cambridge with greater selectively than those that use pairwise comparisons. Of the CB2 2QH, UK pro®le methods, hidden Markov models (HMMs) are apparently the best. The ®rst part of this paper describes calculations that (i) improve 2Department of Computer the performance of HMMs and (ii) determine a good procedure for creat- Engineering, Jack Baskin School ing HMMs for sequences of proteins of known structure. For a family of of Engineering, University of related proteins, more homologues are detected using multiple models California, Santa Cruz built from diverse single seed sequences than from one model built from CA 95064, USA a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectiv- ity and coverage. The second part of the paper describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95 %, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the paper describes the use of the SUPERFAMILY model library to annotate the sequences of over 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35 % of eukaryotic genomes and 45 % of bacterial genomes. On average roughly 15% of genome sequences are labelled as being hypothetical yet homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY.Thisserveralsoenables users to match their own sequences against the SUPERFAMILY model library. # 2001 Academic Press Keywords: genome; superfamily; hidden Markov model; structure; *Corresponding author homology Introduction analysis for many years, and with the growth in the experimental determination of new sequences Protein structure prediction, to discover the fold following Moore's law (largely due to the genome and hence information about the probable function projects), they are of increasing value. There are of the sequence of a gene about which nothing is more than 50 completely sequenced genomes at known, is possible via homology to a sequence of the time of writing, including ®ve eukaryotes. known structure. Protein homology searching They comprise approximately a quarter of a methods have been the central tool in sequence million sequences. Improvements in the speed of homology methods enables larger-scale studies, Abbreviations used: HMM, hidden Markov model; and improvements in their selectivity leads to dis- FIM, free insertion module; PDB, Protein Data Bank. covery of novel relationships. The system pre- E-mail address of the corresponding author: sented here probably offers the best selectivity [email protected] currently available for genomic-scale studies. 0022-2836/01/040903±17 $35.00/0 # 2001 Academic Press 904 Sequence Homology Assignment Of the sequence comparison methods available, SCOP database pairwise searches perform much less selectively 10 than pro®le-based, of which hidden Markov Thisdatabase containsastructuralandevol- 11 models1±4(HMMs)areapparentlythebest.The utionaryclassi®cationoftheproteinsinthePDB workbyParketal.5whichisthebasisofthisasser- and usually keeps up to date to within two to six tion,issupportedbyourmorerecentcalculations.6 months. In SCOP, a multi-domain protein is split TheSAMHMMpackage(http://www.cse.ucsc. up into its constituent domains, which are then edu/research/compbio/sam.html)includesan considered separately. These protein domains are iterative model building procedure called T99,7 evolutionary units in that for a protein to be which improves remote homology detection. Of divided into domains they must be observed in the HMM procedures available, SAM T99 is the isolation or in different combinations in nature. most effective. The fundamental units used in the work described The implementation of HMMs is not entirely here are the protein domains as classi®ed by straightforward. In the ®rst part of this paper we SCOP. Note that a domain which has another describe calculations that (i) improve the perform- domain inserted within it will comprise multiple ance of SAM HMMs and (ii) determine a good pro- chains or regions of sequence. cedure for creating SAM HMM for sequences of SCOP is a hierarchical classi®cation, and the proteins of known structure. level of classi®cation relevant to this work is the In the second part of the paper we describe the ``superfamily''. A superfamily is de®ned as a construction of a library of HMMs called SUPER- group of domains which have structural and func- FAMILY, that represent essentially all proteins of tional evidence for their descent from a common known structure. This library of models extends evolutionary ancestor. The level below superfamily both aspects of performance: speed and selectivity. is the ``family'' level which groups together those A sequence can be searched against it at over an domains that have clear sequence similarities. The order of magnitude faster than building a model level above superfamily is the ``fold'' which groups from a query sequence and searching an equivalent domains that have the same major secondary struc- database of target sequences. Expert creation and ture with the same chain topology. Superfamilies selection of the models, coupled with consensus clustered at this level have either no evidence to information, has greatly improved the selectivity of suggest an evolutionary relationship or only very the library. weak evidence that requires conformation. In the third part of the paper we describe the The superfamily level contains the most distantly use of the SUPERFAMILY library to annotate the related domains and so is the highest level for use- sequences of over 50 genomes. For each genome, ful remote homology detection. Proteins in the matches are made to sequences that form roughly same superfamily often have the same function, half the cytoplasmic proteins. These annotations and usually but not always have related functions. areavailablefromapublicwebserverat:http:// Since SCOP classi®es protein domains separately, a stash.mrc-lmb.cam.ac.uk/SUPERFAMILY/cgi-bin/ multi-domain protein may have contributions to gen_list.cgi. it's overall function from the different domains. This server also enables users to match their own sequences again the SUPERFAMILY model ASTRAL database library. Filesinthisdatabase12providesequencescorre- sponding to the SCOP domain de®nitions and are derived from the SEQRES entries in PDB ®les. Sequences, Domains and Homologies Domains which are non-contiguous in sequence, i.e. parts of the domain separated by the insertion of the Proteins of Known Structure of another domain, are treated as a whole. Their Many small proteins contain single domains, sequences are marked with separators between the whereas nearly all large proteins contain two or fragments representing regions belonging to other more that have linked by recombination and domains. All PDB entries with sequences shorter which occur in other proteins in isolation, in than 20 residues, or with poor quality or no combination with different partners, or in both sequence information are omitted. thesestates.8±10Thismeansthattosearchfor ASTRAL also has available sequence ®les ®ltered homologies in large sets of sequences, it is more to different levels of residue identity. The ASTRAL effective to use HMMs that represent protein entries are generated entirely automatically and domains (whole small proteins or the evolution- hence have a small number of documented errors. ary units of large proteins). Thus to build HMMs representing all proteins of known struc- SUPERFAMILY database ture we must ®rst have the sequences of repre- sentative sets of proteins, the de®nition of their The sequences which are used for the work pre- domain structure and a description of their hom- sented here are available from the SUPERFAMILY ologies. These data are available from the SCOP server, described in a later section, and are gener- and ASTRAL databases. ated from the ASTRAL sequences yet differ in the Sequence Homology Assignment 905 following ways: (1) SUPERFAMILY sequence ®les Performance of a model from an alignment of have any sequence shorter than 30 residues multiple homologous sequences versus removed rather than 20 in ASTRAL. Domains multiple models from single which are split across more than one chain had homologous sequences separate entries in ASTRAL which had to be joined to make a single entry in SUPERFAMILY. The Previouspracticebythisgroup14andothers15 ordering of the chains was obtained from NRDB has been to build one HMM model for each pro- andNRDB90.13SubsequentreleasesofASTRAL tein family or superfamily using an accurate align- however now include these joined