A Comparison of Profile Hidden Markov Model Procedures

ã 2002 Oxford University Press Nucleic Acids Research, 2002, Vol. 30 No. 19 4321±4328

A comparison of pro®le hidden Markov model procedures for remote homology detection

Martin Madera* and Julian Gough

MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK

Received April 29, 2002; Revised and Accepted August 7, 2002

ABSTRACT information about many of the large and rapidly growing number of protein sequences. Pro®le hidden Markov models (HMMs) are amongst Many related proteins have similar sequences, making their the most successful procedures for detecting relationship easy to detect, but equally as many have diverged remote homology between proteins. There are two to a point where their structural and functional similarity is popular pro®le HMM programs, HMMER and SAM. hard to detect from purely sequence-based data. Homology Little is known about their performance relative to detection methods vary in their ability to detect some of these each other and to the recently improved version of more distant relationships. It has been shown by Park et al. (1) PSI-BLAST. Here we compare the two programs to that pro®le-based methods, which consider pro®les of protein each other and to non-HMM methods, to determine families, perform much better than pairwise methods, which consider individual protein sequences, and that of the pro®le- their relative performance and the features that are based methods hidden Markov models (HMMs) (2,3) perform important for their success. The quality of the best. A more recent study by Lindahl and Elofsson (4) multiple sequence alignments used to build models con®rmed the relative performance of pairwise and pro®le was the most important factor affecting the overall methods and showed, further, that at the family and super- performance of pro®le HMMs. The SAM T99 proced- family levels pro®le methods are also superior to threading ure is needed to produce high quality alignments methods as represented by THREADER (5). automatically, and the lack of an equivalent Currently there are two popular pro®le HMM software component in HMMER makes it less complete as a packages: HMMER (6) and SAM (7,8). Little is known about package. Using the default options and parameters their relative performance or the features that are important for as would be expected of an inexpert user, it was their success. The work described here focuses on these issues. Because of the large number of parameters affecting the found that from identical alignments SAM consist- performance of pro®le HMMs the approach taken was to use ently produces better models than HMMER and that the default settings wherever possible, making the results the relative performance of the model-scoring representative of what an informed but inexpert user might components varies. On average, HMMER was found achieve. It should be noted, however, that an expert user may to be between one and three times faster than SAM be able to obtain an improved performance with either method when searching databases larger than 2000 by ®ne tuning the parameters and conditions (9). Great care sequences, SAM being faster on smaller ones. Both was taken to compare the methods without bias. methods were shown to have effective low complex- Our results differ from those of other studies (4,10) that ity and repeat sequence masking using their null have used pro®le HMMs. This is discussed in detail in a separate section towards the end of this paper. models, and the accuracy of their E-values was The general organisation of this paper is as follows. First we comparable. It was found that the SAM T99 iterative introduce the pro®le HMM procedure and the two packages, database search procedure performs better than the HMMER and SAM. Then we describe the tests we chose to most recent version of PSI-BLAST, but that scoring measure their performance and explain the technical aspects of PSI-BLAST pro®les is more than 30 times faster of the comparison. Next we discuss the results and put the than scoring of SAM models. performance of pro®le HMMs into a broader context of other sequence-based methods, including PSI-BLAST. Finally, we compare our results to those of the previous studies, as mentioned above. INTRODUCTION Protein sequence homology detection is a central tool in THE PROFILE HMM PROCEDURE genomics. The ability to infer a relationship between two proteins of known amino acid sequence using computers The operation of pairwise sequence comparison methods and without laboratory experimentation gives valuable (e.g. simple BLAST, http://www.ncbi.nlm.nih.gov/BLAST,

*To whom correspondence should be addressed. Tel: +44 1223 402479; Fax: +44 1223 213556; Email: [email protected]

The authors wish it to be known that, in their opinion, the ®rst two authors should be regarded as joint First Authors 4322 Nucleic Acids Research, 2002, Vol. 30 No. 19 http://blast.wustl.edu) essentially consists of a single step: the model scoring the relevant program was used directly. Unlike program takes two sequences and calculates a score for their HMMER, the package does not include a model-calibration optimal alignment; this score may then be used to decide program. The SAM model-scoring program calculates whether the two sequences are related. The use of pro®le E-values directly using a theoretical function that takes as its HMMs in homology detection is more complicated because of argument the difference between raw scores of the query the need to ®rst construct the pro®le HMM. The procedure sequence and its reverse. consists of three steps. (i) A multiple sequence alignment is Perhaps the most important component of the SAM package made of known members of a given protein family. The is the target99 script (8) commonly known as T99, which quality of the alignment and the number and diversity of the automatically generates a multiple sequence alignment suit- sequences it contains are crucial for the eventual success of the able for model building. The script takes as its input a single whole procedure. (ii) A pro®le HMM of the family is built seed sequence or an initial alignment and iteratively searches a from the multiple sequence alignment. The model-building sequence database in a manner similar to PSI-BLAST (11). program uses information derived from the alignment together T99 is not under direct assessment here due to the lack of an with its prior knowledge of the general nature of proteins. equivalent component in the HMMER package. (iii) Finally, a model-scoring program is used to assign a score Version 3.2 of SAM was used, released in July 2000. with respect to the model to any sequence of interest; the better the score, the higher the chance that the query sequence is a member (homologue) of the protein family represented by the REMOTE HOMOLOGY DETECTION TESTS model. In this way each sequence in a database can be scored We subjected the pro®le HMM methods to two tests, which to ®nd the members of the family present in the database. were to some extent complementary in their nature. We ®rst This assessment compares the performance of the model- give an overview of the tests, highlighting the key points and building (step 2) and model-scoring (step 3) programs in the contrasting the differences, and then describe them in two packages. The performance is measured by the ability to technical detail. The two tests were as follows. (i) A test detect members of a protein family in sequence databases against nrdb90 (12), a non-redundant database of all known given a multiple sequence alignment for that family. A variety protein sequences. The assessment was restricted to just two of alignments were used in this work, some of which might be protein families (globins and cupredoxins) because of the need expected to be more suitable for one package, some for the to classify all hits by hand, but in return we were able to other, and some about which there were no expectations. A explore a wide variety of multiple sequence alignments from detailed explanation of our tests is provided later in the paper. which to build the HMMs, ranging from fully automated to expert curated and from purely sequence-based to structural. Because only two protein families were investigated, the large THE TWO PROFILE HMM PACKAGES size of the database (in particular the fact that it contained In this section we brie¯y introduce the two pro®le HMM many homologues of the two families) was crucial for packages assessed in this work. The technical issues associ- statistical signi®cance of the results. (ii) An all-against-all ated with their comparison are discussed later. match using a database of approximately 3000 proteins of known structure. For each member of the database an HMMER alignment of homologous proteins was created using fully Developed chie¯y by Sean Eddy, the HMMER package (6; automated methods with single sequence inputs, and models http://hmmer.wustl.edu) is freely available under the GNU created from these alignments were scored against the original General Public License and includes the necessary model- database. Evolutionary relationships of all proteins in the test building and model-scoring programs relevant to homology set were provided by the SCOP (13) database, allowing for detection. In addition, the package contains a program that automatic classi®cation of the results. Although limited to a calibrates a model by scoring it against a set of random small number of alignment methods, this test was very sequences and ®tting an extreme value distribution to the comprehensive, as representatives from every protein family resultant raw scores; the parameters of this distribution are of known structure were included in the database. then used to calculate accurate E-values for sequences of We also brie¯y investigated the low complexity masking interest. All HMMER models used in this study were capabilities of the two methods and their computing time calibrated in this way. requirements. The protocols for these are described together In this comparison version 2.2g of the package was used, with our ®ndings in the Results section. which was released in August 2001. Next we describe the two remote homology detection tests in detail. SAM Developed by the bioinformatics group at the University Test 1: The globin and cupredoxin families of California, Santa Cruz, the SAM package (7,8; A variety of models representing the globin and cupredoxin http://www.cse.ucsc.edu/research/compbio/sam.html) is not families, as de®ned by the SCOP (13) family `Globins' and the open source, but it is free for academic use and the authors superfamily `Cupredoxins', were searched against the nrdb90 retain no rights over the models produced with the software. database (12). All hits were then classi®ed as either true, false The package contains the necessary model-building and or uncertain based on the following criteria. (i) Database model-scoring programs as well as several scripts for running annotation. Globins and cupredoxins were chosen for this test them. In particular, the fw0.7 script recommended in the because they are well known families with reliable annota- documentation was used for all SAM model building; for tions in the databases. (ii) Pairwise comparisons with well Nucleic Acids Research, 2002, Vol. 30 No. 19 4323 annotated homologues. For poorly annotated sequences with as either true, false or uncertain according to their SCOP highly signi®cant pairwise matches to well annotated homo- classi®cation. Hits to the same superfamily as the model were logues the annotations from the well annotated homologues classi®ed as true, hits to the same fold as uncertain and hits to a were used as above. (iii) Structural understanding of the two different fold as false. Two signi®cant exceptions to this families. Members of our group have carried out detailed general scheme were that: cross-hits between the Rossmann structural and key residue analyses of the two families (14; and Rossmann-like folds (3.2, 3.3 and 3.4 in the 1.50 J.Gough, unpublished results) and this knowledge was used in classi®cation) were considered uncertain, and hits to the a number of cases. (iv) All hits about which clear-cut decisions seed sequence from which the particular model was built were could not be reached were classi®ed as uncertain. The ignored altogether as too easy. Using these criteria there was a classi®cation is available on our website (http://stash. total of 36 612 possible true and 8 173 744 false pairwise mrc-lmb.cam.ac.uk/HMMER-SAM/). relationships. The models used in this test were built from a large number of alignments listed below. Each alignment was used with both packages and the results were compared; the T99 alignments were either re-formatted or re-aligned beforehand TECHNICAL ASPECTS OF THE COMPARISON (see the subsection on input alignments later in the paper). The manual alignments used in test 1 were: (i) a revised There are two technical aspects of the comparison between the version of the alignment (15) of selected members of the two packages that are important for understanding the results: globin family based on a detailed structural analysis (denoted our model conversion procedure and a difference in the nature G-STR); (ii) the full PFAM globin alignment (16) (denoted of multiple sequence alignments used for model building. A G-PFAM); (iii) an alignment of the cupredoxin family further aspect, the search mode, is needed for reproducibility (J.Gough, unpublished results) similar to Bashford et al. (15) of our results. This section is devoted to an explanation of (denoted C-STR). Here G stands for globins and C for these issues. cupredoxins, STR for structural and PFAM for a PFAM alignment. Each of these alignments was also used as input for Input alignments: an important difference the SAM T99 producedure; the resultant T99 alignments are The SAM model-building program distinguishes between two called G-STR-T99, G-PFAM-T99 and C-STR-T99. types of alignment columns: the aligned residues, marked by In addition, a single representative member of each family upper case letters or - characters for deletions, and unaligned (1rse for globins and 1plc for cupredoxins) was used as a seed insertions, with lower case letters or . characters (see Table 1 for T99; the resulting T99 alignments are denoted G-1S-T99 for illustration). The program strictly follows this convention or C-1S-T99, where 1S stands for single sequence input. and the number of aligned upper case columns in the multiple Finally, the same two sequences were used in a procedure sequence alignment is therefore always equal to the number of comprised of a WU-BLAST (http://blast.wustl.edu, version 2.0a19) search of nrdb90, followed by a ClustalW alignment segments in the ®nal model. In contrast, in a HMMER-style of hits with a P-value score better than 1 3 10±5; these are alignment all columns are supposed to be aligned, though the denoted G-1S-BL&CLW and C-1S-BL&CLW. HMMER model-building program treats the most divergent regions as insertions. With the important exception of Test 2: the SCOP all-against-all alignments produced by the SAM T99 procedure, none of The SCOP database (13; http://scop.mrc-lmb.cam.ac.uk/scop/ the alignments used in this assessment differentiated between index.html) provides an accurate and reliable classi®cation of aligned and unaligned regions. all proteins of known structure based on structural, functional When dealing with T99 alignments, which follow the SAM and sequence evidence. As many of the relationships between convention, our principal objective was to deny SAM any proteins that are clear at the structural and functional levels are information not available to HMMER. This was achieved in dif®cult to detect at the sequence level it represents a hard test one of two ways: by simply converting all letters to upper case set for assessment of sequence comparison methods. Because and . characters to - (these alignments are called T99-UC) or it covers all proteins of known structure it is also very by completely realigning the sequences with ClustalW (19), a comprehensive. Starting with Brenner et al. (17) it has indeed popular multiple sequence aligment program (T99-CLW). (To been used in a number of previous studies of this type (1,4). counteract poisoning due to inserted domains we removed all The sequence set ®ltered to 40% sequence identity (which insertions longer than 30 residues prior to realignment with ensures that only remote homologues are present and hence ClustalW.) The situation was therefore entirely analogous to that the test is suitably dif®cult) is provided by the ASTRAL that for other alignments. compedium (18). Version 1.50 of the database was used and We also used the intact T99 alignments with SAM so that the test set consisted of 2873 sequences. the effects, if any, of the above conversions could be seen (the Taking each sequence in the set as a seed, two automatic alignments are called simply T99). Finally, it is possible to methods were used to create multiple sequence alignments: pass the SAM aligned±unaligned distinction to the HMMER T99 (1S-T99) and a WU-BLAST search of nrdb90 followed model-building program in a non-default way, via the `--hand' by a ClustalW alignment (1S-BL&CLW). These were iden- option in combination with the SELEX format. We wrote a tical to the corresponding methods used in test 1, in particular program (a2m2selex.pl) to facilitate the conversion, available T99 was iterated on nrdb90. Models generated from these on our website (http://stash.mrc-lmb.cam.ac.uk/HMMER- alignments were then scored against all seed sequences. Hits SAM/). This method was only used in test 2, and T99 from all models were pooled, sorted by E-value and classi®ed alignments used in this way are called T99HAND. 4324 Nucleic Acids Research, 2002, Vol. 30 No. 19

Table 1. A comparison of alignment conventions HMMER alignment convention 1aac ---EAALKGPMMKKEQAY--SLTFTE----AG-TYDYHCTP--H--PFMR 1bqk ---DGAEA.FKSKINENY--KVTFTA----PG.VYGVKCTP--HYGMGMV 2cbp ---STCNTPAGAK---VY--TSGRDQIKLPKGQSY-FICNFPGHCQSGMK 1nwp VIAHTKVIGAGEK--DSV--TFDVSKLA--AGEKYGFFCSFPGHI-SMMK 1rcy -GTGFSPVPKDGK--FGYTDFTWHPT----AG-TYYYVCQIPGHAATGMF SAM alignment convention 1aac .--EAALKGPMMKKEQAY..SLTFTE....AG.TYDYHCTP..H..PFMR 1bqk .--DGAEA-FKSKINENY..KVTFTA....PG.VYGVKCTP..HygMGMV 2cbp .--STCNTPAGAK---VY..TSGRDQiklpKGqSY-FICNFpgHcqSGMK 1nwp vIAHTKVIGAGEK--DSV..TFDVSKla..AGeKYGFFCSFpgHi.SMMK 1rcy .GTGFSPVPKDGK--FGYtdFTWHPT....AG.TYYYVCQIpgHaaTGMF

Note that SAM distinguishes between two types of columns: aligned (upper case residue or - for deletions) and unaligned (lower case residues and .). A SAM-style alignment therefore contains more information than a HMMER-style one.

Conversion between HMMER and SAM models PROFILE HMM RESULTS In order to be able to compare the model-building and model- Model building scoring components of the two packages separately, we wrote In all the cases we investigated SAM models performed better a program (convert.pl) to convert between HMMER and than HMMER models when both were scored with the same SAM model ®les, available from our website (http://stash.mrc- program, i.e. SAM models converted to the HMMER format lmb.cam.ac.uk/HMMER-SAM/). Each alignment used in this (SH) were better than native HMMER models built from the assessment therefore gave rise to four sets of results: two using same alignment (HH), and HMMER models converted to the the default procedures for each package (denoted HH for SAM format (HS) were worse than native SAM models (SS) HMMER and SS for SAM) and two where models built by one (see Table 2a and Fig. 1 for illustration). package were converted using our program and scored by the Furthermore, in all cases bar one this was true across all other package (denoted HS for HMMER-built models con- error rates. The single exception was the case of WU-BLAST verted to the SAM format and scored by SAM, and vice versa followed by ClustalW (1S-BL&CLW in test 2; Fig. 1A) where for SH). In this way models produced by each program were HMMER-built models performed better at very low error rates scored independently by both model-scoring programs and, (<0.5%). This behaviour was caused by a small number of conversely, both model-scoring programs were assessed on `poisoned' alignments, in which some of the sequences essentially the same models. contained parts of domains from other superfamilies. When Due to a difference in model topologies there was a small these alignments (<1% of the total) were removed from the loss of information when converting from SAM models to test set SAM performed better across all error rates. HMMER HMMER ones, but no information was lost the other way; in model building was less susceptible to this type of error particular, using the script to convert from the HMMER because the columns that contained poisoning sequences also format to the SAM format and back again resulted in a ®le tended to be poorly aligned and thus were averaged as identical to the initial one. See our website for further insertions (see the earlier subsection on alignments). information. Overall, this means that SAM consistently produces better models than HMMER, even though it is less robust with The local/local search mode was used respect to poisoned alignments that do not follow its There are two popular ways of using pro®le HMMs: by forcing distinction between aligned and unaligned columns. a global alignment to the model and a local one to the query sequence (the domain or global/local mode) or by using the Model scoring local mode for both. (By global alignment we mean the mode SAM scoring (HS and SS) performed better on high quality, whereby a match is forced to each segment of the model or diverse alignments; it did so overwhelmingly on both manual every residue of the query sequence, as opposed to the local globins alignments (structural and PFAM, or G-STR and mode where only the best scoring region is considered.) G-PFAM; data not shown for G-STR) and up to a certain error Which approach is better in a real application is debatable rate also on the T99 alignments in test 2 (1S-T99 and and dependent on the problem at hand. In our experience the 1ST99HAND), but on our structural cupredoxin alignment local/local mode is the one better suited for genome annota- (C-STR) HMMER scoring performed better. Similarly, while tions because it is more robust with respect to inserted HMMER scoring was better for the hand curated PFAM domains and gene prediction errors. In test 2 each sequence in alignments run through T99 and realigned with ClustalW the test set is hand curated to represent a single complete (G-PFAM-T99-CLW), SAM scoring was better when the domain in the corresponding protein structure. As a result, the same procedure was applied to our hand curated structural global/local mode performs signi®cantly better on this test, but alignment (G-STR-T99-CLW; data not shown) (see Table 2a this is not representative of real sequence databases. and Fig. 1 for illustration). In order to compare both methods under realistic condi- As the above dif®culties illustrate, we have not managed to tions, we used the local/local mode for both methods in all of extract any de®nitive rules governing the relative performance our tests. of the model-scoring programs, even though it was almost Nucleic Acids Research, 2002, Vol. 30 No. 19 4325

Table 2a. The numbers of true homologues found in tests 1 and 2, at 1% error rate, for a selected number of methods, concentrating on comparisons of model building and model scoring Alignment Model building and scoring method HH HS SH SS

(i) C-STR 117 113 129 116 G-PFAM 557 590 560 593 G-PFAM-T99-CLW 554 549 560 554 (ii) 1S-BL&CLW 5536 5247 5929 5965 1S-T99-CLW 6620 7205 7559 8128 1S-T99, 1S-T99HAND 7414 7813 8283 8784

(iii) Number of nrdb90 iterations 0330

PSI-BLAST 4330 7897 7462

(i) Results for test 1; (ii) for test 2; (iii) results of test 2 applied to PSI-BLAST (the case with zero nrdb90 iterations being pairwise NCBI BLAST). G stands for globins, C for cupredoxins, 1S for a single-sequence seed; PFAM indicates a PFAM alignment, STR a structural alignment, G-1S is the sequence of 1rse; T99 is the SAM T99 procedure, either default (T99), with all columns in the resultant alignment converted to upper case (T99-UC), or realigned with ClustalW (T99-CLW). Finally, HH and SS are the default procedures for HMMER and SAM, respectively, HS indicates a HMMER model converted to the SAM format and scored by SAM, and vice versa for SH. See text for further explanations.

always the case that one of them performed better than the other on both sets of models.

Multiple sequence alignments The previous two subsections show that SAM builds better models than HMMER and that the relative performance of the model-scoring components varies. However, often the most important factor affecting the overall performance was not the particular pro®le HMM program, but rather the input alignment used. As can be seen in Table 2b, the best-performing alignment for globins was the PFAM alignment (G-PFAM), but the fully automated SAM-T99 procedure seeded from a single member of the family (G-1S-T99) with 433 sequences in the ®nal alignment came second and performed better than the manual structural alignment consisting of only 12 sequences (G-STR). Running the PFAM alignment through the T99 procedure (G- PFAM-T99) worsened performance, though for the structural alignment this did lead to a marginal improvement at larger error rates (G-STR-T99). Realigning the T99 alignment with ClustalW resulted in a loss of performance in both cases (G- PFAM-T99-CLW and G-STR-T99-CLW) and this loss was greater than that from merely removing the aligned±unaligned distinction (G-PFAM-T99-UC and G-STR-T99-UC). We nevertheless chose ClustalW for test 2 as the difference between the HMMER and SAM models was smaller using it. The results for cupredoxins were similar. In test 2, using ClustalW to realign T99 alignments (1S-T99-CLW) again resulted in a ~10% drop in performance. To sum up, it is clear that a good alignment for use with pro®le HMM remote homology detection procedures needs Figure 1. Sensitivity plots for the SCOP all-against-all (test 2). The input to include a large number of diverse sequences, correctly alignments were: (A) the results of a WU-BLAST search of nrdb90 aligned aligned and with unalignable portions either excised or with ClustalW; (B) T99 alignments realigned with ClustalW. In both ®gures, HH and SS are the default procedures for HMMER and SAM, appropriately marked. The only way of producing such respectively, HS indicates a HMMER model converted to the SAM format alignments is via an iterated database search, so the lack of and scored by SAM, and vice versa for SH. 4326 Nucleic Acids Research, 2002, Vol. 30 No. 19

Table 2b. The numbers of true homologues found in tests 1 and 2, at 1% error rate, for a selected number of methods, comparing suitability for model building of globin alignments in test 1

Method Number of hits

G-PFAM-SS 593 G-1S-T99-SS 582 G-PFAM-T99-SS 573 G-STR-SS 567 G-STR-T99-SS 567 G-PFAM-T99-UC-SS 565 G-STR-T99-UC-SS 561 G-PFAM-T99-CLW-SS 554 G-STR-T99-CLW-SS 548

G stands for globins, C for cupredoxins, 1S for a single-sequence seed; PFAM indicates a PFAM alignment, STR a structural alignment, G-1S is the sequence of 1rse; T99 is the SAM T99 procedure, either default (T99), with all columns in the resultant alignment converted to upper case (T99-UC) or re-aligned with ClustalW (T99-CLW). See text for further Figure 2. Distribution of E-values E of ®rst false positives in test 2. The explanations. probability density is with respect to the log10(E) x-axis. The experimental curves are smoothed (each model was added as a Gaussian of standard deviation 0.1 and area 2873±1), the theoretical curve is ln(10) E exp(±E). a T99 equivalent in the HMMER package makes it less useful 1S-BL&CLW is the result of a WU-BLAST search of nrdb90 aligned with for an inexpert user. ClustalW, 1S-T99 the alignment produced by the T99 procedure; HH and SS are the default procedures for HMMER and SAM, respectively. E-value accuracy The E-value is de®ned as the expected number of errors per Low complexity masking query. In the context of test 2 this means that one would expect on average E false positives per model with an E-value score One aspect insuf®ciently covered by the comprehensive test 2 better than E to occur by chance. For the 2873 models used in is the ability of the methods to deal with low complexity test 2 one would therefore expect about 29 false positives with sequences, because of their absence from the PDB (14). As an E-value score better than 0.01 when results for all models low complexity sequences are known to cause problems for are pooled together. sequence comparison methods, and because the HMMER For pairwise methods with accurate E-value scoring `null2' and the SAM `reverse null' null models are very schemes, e.g. NCBI BLAST, this is indeed what one gets different, we chose 100 models from different SCOP folds at (results not shown). Unfortunately matters are more compli- random and scored them against three databases of low cated for pro®le methods, because of model poisoning already complexity sequences with roughly 2000 sequences each. The alluded to in the subsection on model building (also discussed databases were constructed as follows: (i) the program seg in 9). Poisoned models systematically assign highly signi®cant (20) was used to extract long sections of sequences from scores to sequences from poisoning superfamilies and thus nrdb90; (ii) PDB sequences ®ltered to 30% identity were all greatly in¯ate false positive counts at low E-values. As a reversed; (iii) PDB sequences ®ltered to 30% identity were all result, the customary plots of false positive counts against randomly shuf¯ed. E-value (1,17) are dominated by poisoning and therefore say WU-BLAST was also included in this test, to provide a little about E-value accuracy for unpoisoned models. comparison with pairwise methods. The results are summar- To circumvent this problem we plotted the distribution of ized in Table 3; it is clear that unlike WU-BLAST, both pro®le E-values of the ®rst false hit for each model (Fig. 2). It can be HMM methods make very few hits to either database and seen that the distributions for both pro®le HMM methods are therefore possess in their null models effective low complexity in reasonably good agreement with the theoretical curve, masking systems. although HMMER scoring is somewhat conservative and Computer time SAM scoring somewhat optimistic. HMMER E-values are more accurate in the crucial region around E = 0.01, To measure the speed of pro®le HMM programs, we recorded presumably because the extreme value distribution function the times taken to build and calibrate 100 models and to score is being ®tted there, whereas SAM E-values (based on a them against our test 2 dataset of 2873 sequences. It should be theoretical calculation) only become accurate in the high-E noted that each sequence in this dataset is a single protein limit. domain and that the sequences are therefore signi®cantly The numbers of models with ®rst false positive below the shorter than those in other databases. The test was performed 1 3 10±5 E-value level, indicative of the overall level of model on a computer with a single 1.3 GHz AMD Athlon processor poisoning for the particular alignment procedure, were 15±35 and 768 MB of RAM running the Linux operating system for WU-BLAST followed by ClustalW (1S-BL&CLW) and (kernel version 2.4.2-2). 100±120 for T99 (1S-T99-Sx, 1S-T99HAND-Hx and 1S-T99- As shown in Table 4, model building is quick using both CLW). All four combinations of model building and model methods, except for HMMER model calibration, which is scoring for the particular alignment procedure ®tted within slow. Model scoring was on average 3.4 times faster using each range. HMMER, but the additional need to calibrate all models Nucleic Acids Research, 2002, Vol. 30 No. 19 4327

Table 3. Hits to low complexity databases Database Number of hits by each method HMMER SAM WU-BLAST

1. seg 0 0 33 2. reversed 1 1 11 3. shuf¯ed 1 2 0

Table 4. Computer times per model against our test 2 dataset Task Average time (s)

HMMER model building 0.4 HMMER model calibration 52.6 SAM model building 1.1 HMMER model scoring 17.1 SAM model scoring 58.9 Figure 3. Sensitivity plots for the SCOP all-against-all. SCOP version 1.50 PSI-BLAST pro®le scoring 1.6 was used, ®ltered down to 2873 sequences of less than 40% sequence identity, with a total of 36 612 possible true pairwise relationships. See the text for further details. means that HMMER is only faster for databases of more than approximately 2000 sequences; for smaller databases SAM is the entire database, but only a restricted set of WU-BLAST faster. This difference, although signi®cant, will not affect the hits to the seed sequence (see 8 for details).] feasibility of large-scale experiments. E-values produced by PSI-BLAST appear to be more accurate than those of pro®le HMM methods (see Fig. 2).

COMPARISON TO OTHER METHODS In order to see the performance of HMMER and SAM in the DISCUSSION OF PREVIOUS ASSESSMENTS context of other sequence comparison methods, we re-ran There are two previous studies (4,10) that include some test 2 on PSI-BLAST (11), the most popular iterative pro®le comparison of the two pro®le HMM packages. Lindahl and method, and WU-BLAST (http://sapiens.wustl.edu/blast), one Elofsson (4) is a systematic benchmark, while Rehmsmeier of the best pairwise methods in Brenner et al. (17). Version and Vingron (10) use the two packages to demonstrate an 2.2.1 of PSI-BLAST was used, last updated in August 2001. improvement due to a novel approach. This version included the improvements described in Schaefer Lindahl and Elofsson (4) found SAM to be better than et al. (21). HMMER at the superfamily level and vice versa at the family PSI-BLAST was ®rst iterated on nrdb90 using each level. While these ®ndings are in agreement with our results at sequence in the test set as a seed, with a maximum of either the superfamily level, the authors used the global/global mode three or 30 iterations. The resulting models were saved and run for HMMER and a local/local one for SAM (A.Elofsson, against our test set in a single iteration. In addition, we used personal communication; see also our discussion earlier in the one iteration of the PSI-BLAST binary (blastpgp) directly on paper). It should be noted that the SCOP test strongly and our test set. This run is referred to as NCBI BLAST, since the arti®cially favours methods using the global alignment mode, underlying algorithm is a simple pairwise gapped BLAST. because each sequence in the test set is hand curated to As shown in Figure 3, pro®le methods perform considerably represent a single complete domain. Lindahl and Elofsson (4) better than pairwise methods (as represented by WU-BLAST also experienced dif®culties running SAM and were often and NCBI BLAST) and SAM-T99 is better than PSI-BLAST. unable to produce error rates of <5%. These factors lead us to The relative performance has remained approximately the believe that the true performance of the SAM package at both same since the work of Park et al. (1), which was carried out family and superfamily levels is the same or better than that on a smaller dataset, but the overall coverage has dropped: in suggested (4), while the true HMMER performance is almost our study, SAM T99 found 24% of the total number of certainly worse. possible true hits at the 1% error rate; in Park et al. (1) the Rehmsmeier and Vingron (10) found that HMMER per- corresponding ®gure for SAM-T98 was 34%. forms better than SAM. They tested the two methods under Although SAM T99 detects ~10% more true homologues substantially different conditions, which are less realistic for than PSI-BLAST, pro®le HMM methods in general and SAM database searching: they considered the numbers of false in particular are considerably slower (see Table 4): HMMER positives for each model before the ®rst true hit, regardless of model scoring is 11 times, and SAM model scoring is 37 times E-value, and examined cases with up to 100 false positives. slower than PSI-BLAST pro®le scoring. [For large databases Under these circumstances we do not think a comparison PSI-BLAST appears to be even faster, by up to an order of between the two sets of results should be attempted. magnitude, presumably due to indexing, which the pro®le In summary, we have reasons to believe that neither of the HMM methods do not use. On the other hand, the SAM T99 previous comparisons provides an accurate picture of the procedure iterated on a large database does not actually search relative performance of HMMER and SAM. 4328 Nucleic Acids Research, 2002, Vol. 30 No. 19

2. Krogh,A., Brown,M., Mian,S., Sjolander,K. and Haussler,D. (1994) CONCLUSIONS Hidden Markov models in computational biology. J. Mol. Biol., 235, 1501±1531. We have examined the performance of two pro®le HMM 3. Eddy,S.R. (1995) Hidden Markov models. Curr. Opin. Struct. Biol., 6, packages for detection of remote protein homologues, 361±365. HMMER and SAM. It is clear from our results that the most 4. Lindahl,E. and Elofsson,A. (2000) Identi®cation of related proteins on signi®cant distinction between the two packages is the SAM family, superfamily and fold level. J. Mol. Biol., 295, 613±625. T99 script. Not only do both packages require a multiple 5. Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) A new approach to protein fold recognition. Nature, 358, 86±89. sequence alignment from which to build a model, but their 6. Eddy,S.R. (1998) Pro®le hidden Markov models. Bioinformatics, 14, performance is directly related to the quality of that alignment. 755±763. The T99 script automatically produces high quality multiple 7. Hughey,R. and Krogh,A. (1996) Hidden Markov models for sequence alignments well suited for building HMMs, but so far there is analysis. Extension and analysis of the basic method. Comput. Appl. no equivalent in the HMMER package. Biosci., 12, 95±107. 8. Karplus,K., Barrett,C. and Hughey,R. (1999) Hidden Markov models for In order to compare the model-building and model-scoring detecting remote protein homologies. Bioinformatics, 14, 846±856. components of the two packages independently, we wrote a 9. Gough,J., Karplus,K., Hughey,R. and Chothia,C. (2001) Assignment of program to convert between the model formats used by the homology to genome sequences using a library of hidden Markov models two packages. We found that SAM consistently produces that represent all proteins of known structure. J. Mol. Biol., 313, better models than HMMER, i.e. the SAM model built from a 903±919. 10. Rehmsmeier,M. and Vingron,M. (2001) Phylogenetic information given alignment and converted to the HMMER format using improves homology detection. Proteins, 45, 360±371. our script performs better than the native HMMER model built 11. Altschul,S.F., Madden,T.L., Schaeffer,A.A., Zhang,J., Zhang,Z., from the same alignment. The relative performance of the Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a model-scoring components varies. new generation of protein database search programs. Nucleic Acids Res., 25, 3389±3402. The E-value scores produced by both methods have similar 12. Holm,L. and Sander,C. (1998) Removing near-neighbour redundancy reliability. If HMMER model calibration is included, then from large protein sequence collections. Bioinformatics, 14, 423±429. HMMER scoring is faster for searches of more than 2000 13. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: sequences, SAM being faster for smaller ones. This difference a structural classi®cation of proteins database for the investigation of does not affect the feasibility of large-scale studies. sequences and structures. J. Mol. Biol., 247, 536±540. 14. Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Comparing the performance of HMMs to other methods, the Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data relative results on a SCOP all-against-all test (our test 2) are Bank. Nucleic Acids Res., 28, 235±242. similar to those obtained by Park et al. (1), namely SAM T99 15. Bashford,D., Chothia,C. and Lesk,A.M. (1987) Determinants of a protein is somewhat better than PSI-BLAST and the pairwise methods fold; unique features of the globin amino acid sequences. J. Mol. Biol., perform poorly in comparison. We found PSI-BLAST scoring 196, 199±216. 16. Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and to be more than 10 times faster than HMMER scoring Sonnhammer,E.L. (2000) The Pfam protein families database. Nucleic (excluding model calibration). Acids Res., 28, 263±266. 17. Brenner,S.E., Chothia,C. and Hubbard,T. (1998) Assessing sequence comparison methods with reliable structurally identi®ed distant ACKNOWLEDGEMENTS evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 6073±6078. 18. Brenner,S.E., Koehl,P. and Levitt,M. (2000) The ASTRAL compendium We would like to thank Cyrus Chothia for many helpful for sequence and structure analysis. Nucleic Acids Res., 28, 254±256. discussions, and Bernard de Bono, Rajkumar Sasidharan, 19. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: Cyrus Chothia and an anonymous referee for many helpful improving the sensitivity of progressive multiple sequence alignment comments on the manuscript. through sequence weighting, position-speci®c gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673±4680. 20. Wootton,J.C. and Federhen,S. (1993) Statistics of local complexity in REFERENCES amino acid sequences and sequence databases. Comput. Chem., 17, 149±163. 1. Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and 21. Schaefer,A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., Chothia,C. (1998) Sequence comparisons using multiple sequences Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the accuracy detect three times as many remote homologues as pairwise methods. of PSI-BLAST protein database searches with composition-based J. Mol. Biol., 284, 1201±1210. statistics and other re®nements. Nucleic Acids Res., 29, 2994±3005.