Developing and Using Special Purpose Hidden Markov Model Databases

Developing and using special purpose Hidden Markov Model databases Tutorial Session ISMB 2005 Martin Gollery Center For Bioinformatics University of Nevada, Reno Intro to HMMs- In the broadest sense, a hidden Markov model may be seen as a probabilistic approach to signal recognition. Long before they were used in protein analysis, HMMs were being utilized in speech recognition. The goal here was to successfully identify words out of an audio signal, picking out the correct word when the tempo, tone and pronunciation could vary considerably. These variations are quite similar to the variations in a protein family! Through the experience with speech recognition, it was demonstrated that HMM’s can excel in identifying relationships out of a weak signal. Over all, it has been shown that HMMs are useful in areas where other systems fail to detect anything, yet may not be the best solution in areas of high similarity. But what is an HMM? A Markov chain that provides a probabilistic model for a sequence family where the specific states are hidden. A Markov chain is simply a collection of states, with a certain transition probability between each state. Any sequence can be traced through the model from one state to the next through these transitions. The model is referred to as ‘hidden’ because you do not know which series of transitions produced that state. An HMM that models a protein will have one state for every position in the sequence, and each of those states would have over twenty possible outcomes, one for each amino acid, and two for the insert and deletion states. A sequence may be scored by tracing it through the model by passing it from one state to the next, calculating the probabilities for each transition. Since each transition has a finite probability, the chain is more likely to emit sequences with certain outcomes at each position. HMMs may therefore be seen as position-specific profiles that represent a specific sequence family. It is important to note that the probabilities are independent, and therefore any state is independent of any previous or following state. Higher order Markov models are possible, but are too computationally intensive to be useful for biological problems at this time. A group of aligned sequences from a particular family can tell us much more than a single representative sequence. For example, some of the positions in that family will be highly conserved, while others show little or no conservation whatsoever. What this shows is that those positions were important enough through evolutionary time that any change at that location caused a disruption in the function of that protein. Positions that demonstrate a high level of substitution or gapping may be considered less important functionally. The reason behind this usually becomes clear if the proteins are modeled, but the crucial point for most purposes is not the details of why the positions are conserved, but simply to what levels, and what substitutions are commonly seen at that position. Scoring for the standard pairwise algorithms may be seen as rather generic in comparison to a Hidden Markov Model. The Pam or BLOSUM scoring matrices assign a fixed score to a particular substitution regardless of the location in the protein. That substitution, however, could have considerable effect on the function of the protein in some cases, and no effect at all in others. The commonly quoted ‘percent identities’ of pairwise alignments can be quite deceiving. A match with a relatively high percent identity may still be missing a crucial motif that is a characteristic signature of the protein family. The Plan 7 model The plan 7 model is a representation of the system used by the popular HMMer (pronounced hammer, as in ‘more precise than a BLAST’) program suite. The boxes in this schematic represent match states. The Diamonds along the top represent insert states and the lower circles represent deletions. The arrows represent transitions between these states, and each of these transitions is assigned a probability. The J state in the bottom diamond allows the model to be reset, which can be very useful for multi-domain proteins. Earlier versions of the HMMer suite were called ‘plan 9’ in the code, and allowed for transitions between insertions and deletions. Since this shouldn’t really take place in a protein alignment, the code was simplified in the later versions of the software. Calculating Transition Probabilities When a protein family is aligned, there are regions that are not possible to align well, and many insertions will be seen in the MSA. There are many reasons why this may occur- for example, the proteins may not have the identical structure in the loops, or the length of the loops may vary considerably between one protein and the next. In this situation the areas of conserved secondary structure are aligned well, but the loop regions are not. The scoring of the well-aligned areas will reflect relatively high scores for the match states, but very poor scores for the insert or delete states. On the other hand, the regions of poor alignment will show higher scores for the insert states, and that probability will increase with the number of sequences with inserts at that position. The longer the poorly-aligned region is, the higher the transition probability from an insert state to itself. Similarly, the transition to the delete state will have a higher probability as the number of sequences with no residues in that position increases. When assigning the transition probabilities, it is important to account for background rates- that is, the possibility that there may be transitions that are not observed in the training alignment. This possibility leads us to the use of prior information. Prior information about protein sequences in general can help us to construct a model that can more accurately recognize remote homologs than a model that only has data from the training set. No transitions should be set with a zero probability. For example, insertions and deletions are possible anywhere in a sequence, even if we have not previously observed one at that location. Using prior knowledge of insertion and deletion rates from a large number of sequences can help us to set the background rates. These rates are then combined with information from the alignment. The probabilities are set by the ratio of the background information to the alignment information. This ratio is dependent upon the significance of the observation, which is based on the number of sequences in the alignment. Ultimately, no transition should be set to zero. State Probabilities There is some probability, however small, that any amino acid may be found at any location in a protein. If a model is to be capable of matching new and distant members of the family, it will have to take this into account. Therefore, the model will have to allow for the possibility of amino acids being found in positions where none are found in the training alignment. These probabilities may be derived from standard matrices such as the BLOSUM series, which gives the likelihood of one amino acid being substituted for another. In general, these probabilities are dependant upon properties such as relative size, hydrophobicity, and charge. The substitution matrix is typically used to generate pseudo-counts based on the amino acids observed at each location in the alignment. Building a better model When building a model, it is best to start with a large group of related sequences that are as diverse as possible. When these sequences are aligned, typically there will be some groups of closely related sequences, that are more distantly related to other groups of genes. If we simply give equal weight to each sequence when assigning probabilities to the model, then the largest subgroup will have the largest representation. The resulting model will not properly represent the true diversity of the family. Weighting the sequences will capture the information in the subgroups while retaining the valuable information found in the differences between closely related members of the subgroups. The scheme that is used by most systems is the maximum entropy model. This involves making the statistical spread of the model as broad as possible. This creates a more general model, and the model will then be best able to find remote members of the family. Scoring of search results Scoring the comparisons of these models may be done at either the global or the local level. Global If the model and the sequence are of similar length, then global alignments may be applied. Global scoring forces the beginning of the model to match the beginning of the sequence, and the transitions along the path generate the sequence until the end of the model reaches the end of the sequence. This method is of limited use. Semi-Global Biological Applications of Hidden Markov Models We will be discussing the use of HMMs for remote protein homology detection. Searches may be performed by comparing an HMM to a database of sequences (hmmsearch), or by comparing a sequence to a database of HMMs (hmmpfam). The latter case is what we will be emphasizing in this tutorial. It is important to note, however, that the biological applications go far beyond these applications. They can also be used for -detection of CpG islands -signaling region identification -Prediction of Trans-membrane helices -gene prediction -nucleic acid homology HMMs and other profile methods, such as PSI-BLAST and meta-MEME, are methods of detecting relationships between a sequence and a sequence family. Since they utilize all the information in that family, they are much better at identifying remote relationships than the pairwise methods that simply compare one sequence to another.

Load more