View 2000, 1:91- 3

BMC Bioinformatics BioMed Central Methodology BMC2002, Bioinformatics article 3 Identification and characterization of subfamily-specific signatures in a large protein superfamily by a hidden Markov model approach Kevin Truong and Mitsuhiko Ikura* Address: Division of Molecular and Structural Biology, Ontario Cancer Institute and Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada E-mail: Kevin Truong - [email protected]; Mitsuhiko Ikura* - [email protected] *Corresponding author Published: 10 January 2002 Received: 17 August 2001 Accepted: 10 January 2002 BMC Bioinformatics 2002, 3:1 This article is available from: http://www.biomedcentral.com/1471-2105/3/1 © 2002 Truong and Ikura; licensee BioMed Central Ltd. Verbatim copying and redistribution of this article are permitted in any medium for any non- commercial purpose, provided this notice is preserved along with the article's original URL. For commercial use, contact [email protected] Abstract Background: Most profile and motif databases strive to classify protein sequences into a broad spectrum of protein families. The next step of such database studies should include the development of classification systems capable of distinguishing between subfamilies within a structurally and functionally diverse superfamily. This would be helpful in elucidating sequence- structure-function relationships of proteins. Results: Here, we present a method to diagnose sequences into subfamilies by employing hidden Markov models (HMMs) to find windows of residues that are distinct among subfamilies (called signatures). The method starts with a multiple sequence alignment (MSA) of the subfamily. Then, we build a HMM database representing all sliding windows of the MSA of a fixed size. Finally, we construct a HMM histogram of the matches of each sliding window in the entire superfamily. To illustrate the efficacy of the method, we have applied the analysis to find subfamily signatures in two well-studied superfamilies: the cadherin and the EF-hand protein superfamilies. As a corollary, the HMM histograms of the analyzed subfamilies revealed information about their Ca2+ binding sites and loops. Conclusions: The method is used to create HMM databases to diagnose subfamilies of protein superfamilies that complement broad profile and motif databases such as BLOCKS, PROSITE, Pfam, SMART, PRINTS and InterPro. Background last decade, the sensitivity of sequence searching tech- The biological function of a protein can often be inferred niques has been improved by profile- or motif-based anal- from its similarity to sequences of known function in se- ysis, which uses information derived from MSAs to quence databases using single-sequence similarity algo- construct and search for sequence patterns [3–6]. Unlike rithms such as BLAST [1] or FASTA [2]. Such algorithms single-sequence similarity, a profile or motif can exploit are suitable for determining highly similar sequences, but additional information, such as the position and identity are not sensitive enough to capture highly divergent se- of residues that are conserved throughout the family, as quences. Therefore, many members of an evolutionarily well as variable insertion and deletion probabilities. diverse family of proteins may be overlooked. Within the Page 1 of 14 (page number not for citation purposes) BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/1 START END cross validated Sequences from SWISS PROT HMM database of the superfamily Run test set on Filter sequences Create test set HMM database to that belong to the from the cross validate the protein sequences superfamily results Members of the protein superfamily database HMM database of the superfamily (filtered sequences) Randomly removed 5% of sequences Build HMM Partition the database from sequences into subfamily subfamiles signatures Members of a particular subfamily Considated regions in the multiple alignment Create a multiple Create a sequence histogram and alignment identify subfamily signatures Create a Multiple alignment of histogram and find Set of matches to the HMMs a subfamily functional regions with an e-value cutoff of 100 Run HMMs Create all possible Set of matches to the HMMs across the protein w-sized windows with an e-value cutoff of 0.1 superfamily of the alignment database Create an HMM of Set of alignment Set of HMMs of all alignment each window of windows windows of the subfamily alignment Figure 1 Flow diagram of the method. First, filter a primary database using a profile or motif database for a subset of sequences that will comprise the protein superfamily database. Then, partition the protein superfamily database into subfamilies depending on the criterion for a subfamily. Then, build an MSA for each subfamily and build HMMs of all w width windows of the MSA. Finally, tabulate matches with an e-value under 100 to identify subfamily signatures for the HMM database of the superfamily and tabulate matches with e-value under 0.1 to identify potentially significant functional regions in the subfamily. Page 2 of 14 (page number not for citation purposes) BMC Bioinformatics 2002, 3 http://www.biomedcentral.com/1471-2105/3/1 Currently, the most widely-used profile and motif data- researcher who defines it. Subfamilies can be partitioned bases are: BLOCKS [4], which stores ungapped MSAs cor- based on sequence or function and while function-based responding to the most conserved regions of protein methods are valid, sequence-based methods can be auto- families; PROSITE [3], which uses single consensus pat- mated. terns and profiles to characterize each family of sequences; Pfam [7] or SMART [8], which uses profile hidden To divide the sequences into subfamilies, construct a Markov models (HMMs) to find commonly occurring square similarity matrix, S, of dimensions nT by nT. Si,j is protein domains; and PRINTS-S [9], which is a database the percent similarity between the sequence i and se- similar to both PROSITE and BLOCKS, except it uses "fin- quence j. The alignment between a pair of sequences is de- gerprints" composed of more than one pattern to charac- termined in CLUSTALW by performing a global terize a protein family. Recently, a new profile and motif alignment [22] with an opening gap penalty of 10, an ex- database, InterPro [10,11], consisting of an amalgama- tension gap penalty of 0.1 and a Gonnet scoring matrix tion of PROSITE, ProDom [12], Pfam and the PRINTS fin- [23,24]. The percent similarity is estimated by the division gerprint database, was used in the automatic annotation of the alignment score by the maximum alignment score of complete proteomes including fly[13] and human[14]. between each sequence aligned to itself. Most of these databases strive to classify protein sequences into broad families, with the exception of the PRINTS-S alignment_, score() i j = fingerprint database, which has both family- and subfami- Sij, max()alignment _ score() i , i , alignment __,score() j j ly-specific fingerprints [9]. The ability to classify query proteins into subfamilies within superfamilies is useful in providing more specific functional annotations. There- The similarity matrix is used to build a tree by the UPGMA fore, we propose a method based on HMMs to find win- (unweighted pair group method using arithmetic averag- dows of residues that are distinct in protein subfamilies. es) clustering algorithm [25] for the purpose of partition- Although HMMs are expensive, both in terms of memory ing sequences based on sequence similarity. At this point, and computation time, they provide a solid statistical Sjolander [20] pointed out that any partition of the tree foundation for the modeling of information in an MSA. may be meaningful. Indeed, there is no partitioning crite- Our method works by constructing an HMM database rion that is impartially better than another. In the end, the representing a sliding window of residues for the MSA of biologist must decide the most appropriate partitioning each subfamily and then comparing the HMM database criterion from their perspective given their experience across an entire sequence database of the protein super- with the protein superfamily. Therefore, the introduction family (Fig. 1). To demonstrate the utility of our ap- of complementary methods may be important for consist- proach, it has been applied to two well-studied protein ent and reproducible analysis. superfamilies: the cadherin superfamily [15] and the EF- hand superfamily [16]. Our aim is to achieve a high quality MSA of each subfamily. A benchmark of the quality of an MSA is how well it re- Results and discussion flects the structural alignment. Comparative homology Subfamily partitioning modeling allows us to predict the three-dimensional The purpose of subfamily partitioning is to create an MSA structure of a target protein based on its alignment to one of each subfamily, however, if quality MSAs of sub- or more proteins with a known template structure [26]. It families already exist, it is possible to commence with the has been observed that as the sequence identity between analysis at that point, as is done for PRINTS-S [9]. This sec- the target sequence and the template increases, the aver- tion outlines a simple procedure for partitioning, however age structural similarity between the template and the tar- other methods exist which may be more preferable [17– get also increases and for closely related protein sequences 21]. Many methods, like the one described herein, use a with identity over 40%, the alignment is almost always tree clustering approach based on sequence distance or correct [27]. Therefore, if a similarity threshold greater identity. than 40% is used for partitioning, the resulting MSAs should be reasonably high quality and well correlated The members of a protein family can be identified by col- with the structure. Since Dayhoff used a 60% identity for lecting the matching sequences to profile or motif data- the threshold for a subfamily [17], we adopt a 60% uni- bases such as the ones described in the Background.

Load more