Hidden Markov Models in Bioinformatics

Current Bioinformatics, 2007, 2, 49-61 49 Hidden Markov Models in Bioinformatics Valeria De Fonzo1, Filippo Aluffi-Pentini2 and Valerio Parisi*,3 1EuroBioPark, Università di Roma “Tor Vergata”, Via della Ricerca Scientifica 1, 00133 Roma, Italy 2Dipartimento Metodi e Modelli Matematici, Università di Roma “La Sapienza”, Via A. Scarpa 16, 00161 Roma, Italy 3Dipartimento di Medicina Sperimentale e Patologia, Università di Roma “La Sapienza”, Viale Regina Elena 324, 00161 Roma, Italy Abstract: Hidden Markov Models (HMMs) became recently important and popular among bioinformatics researchers, and many software tools are based on them. In this survey, we first consider in some detail the mathematical foundations of HMMs, we describe the most important algorithms, and provide useful comparisons, pointing out advantages and drawbacks. We then consider the major bioinformatics applications, such as alignment, labeling, and profiling of sequences, protein structure prediction, and pattern recognition. We finally provide a critical appraisal of the use and perspectives of HMMs in bioinformatics. Keywords: Hidden markov model, HMM, dynamical programming, labeling, sequence profiling, structure prediction. INTRODUCTION this case, the time evolution of the internal states can be induced only through the sequence of the observed output A Markov process is a particular case of stochastic states. process, where the state at every time belongs to a finite set, the evolution occurs in a discrete time and the probability If the number of internal states is N, the transition distribution of a state at a given time is explicitly dependent probability law is described by a matrix with N times N only on the last states and not on all the others. values; if the number of emissions is M, the emission probability law is described by a matrix with N times M A Markov chain is a first-order Markov process for values. A model is considered defined once given these two which the probability distribution of a state at a given time is matrices and the initial distribution of the internal states. explicitly dependent only on the previous state and not on all the others. In other words, the probability of the next The paper by Rabiner [1] is widely well appreciated for (“future”) state is directly dependent only on the present clarity in explaining HMMs. state and the preceding (“past”) states are irrelevant once the present state is given. More specifically there is a finite set of SOME NOTATIONS possible states, and the transitions among them are governed For the sake of simplicity, in the following notations we by a set of conditional probabilities of the next state given consider only one sequence of internal states and one the present one, called transition probabilities. The transition sequence of associated emissions, even if in some cases, as probabilities are implicitly (unless declared otherwise) we shall see later, more than one sequence is to be independent of the time and then one speaks of homo- considered. geneous, or stationary, Markov chains. Note that the independent variable along the sequence is conventionally called Here are the notations: “time” also when this is completely inappropriate; for U the set of all the N example for a DNA sequence, the “time” means the position possible internal states along the sequence. the set of all the Starting from a given initial state, the consecutive trans- X M possible external states itions from a state to the next one produce a time-evolution of the chain that is therefore completely represented by a L the length of the sequence sequence of states that a priori are to be considered random. k a time instant, where A Hidden Markov Model is a generalization of a Markov k 1,,L chain, in which each (“internal”) state is not directly observ- [] able (hence the term hidden) but produces (“emits”) an obs- sk internal state at time k , ervable random output (“external”) state, also called “emi- where s U ssion”, according to a given stationary probability law. In k a sequence of L internal S ()s1,s2,s3,…,sL states *Address correspondence to this author at the Dipartimento di Medicina Sperimentale e Patologia, Università di Roma “La Sapienza”, Viale Regina e emission at time k , where k Elena 324, 00161 Roma, Italy; Tel: +39 06 4991 0787; Fax: +39 338 ek X 09981736; E-mail: [email protected] 1574-8936/07 $50.00+.00 © 2007 Bentham Science Publishers Ltd. 50 Current Bioinformatics, 2007, Vol. 2, No. 1 De Fonzo et al. E e ,e ,e ,…,e a sequence of L external The most used algorithms are: ()1 2 3 L states 1. the forward algorithm: find the probability of emi- the probabilities of a trans- ssion distribution (given a model) starting from the au,v = Ps()k = v | sk1 = u ition to the state v from beginning of the sequence. the state u 2. the backward algorithm: find the probability of A the N N matrix of emission distribution (given a model) starting from the end of the sequence. elements au,v B) Decoding problem: given a model and a sequence of the probabilities of the bu ()x = Pe()k = x | sk = u observations, induce the most likely hidden states. emission x from the state u More specifically: 1. find the sequence of internal states that has, as a B the N M matrix of ele- whole, the highest probability. The most used ments bu ()x algorithm is the Viterbi algorithm. the probability of the initial u Ps()1 = u 2. find for each position the internal state that has the state u highest probability. The most used algorithm is the posterior decoding algorithm. the N -vector of elements u C) Learning problem: given a sequence of observations, find an optimal model. = (A,B,) the definition of the HMM model The most used algorithms start from an initial guessed model and iteratively adjust the model parameters. More A SIMPLE EXAMPLE specifically: We propose an oversimplified biological example of an 1. find the optimal model based on the most probable HMM (Fig. 1), inspired by the toy example in Eddy [2] with sequences (as in problem B1). The most used only two internal states but with exponential complexity. algorithm is the Viterbi training (that uses recursively The model is detailed in Fig. 1a. the Viterbi algorithm in B1). The set of internal states is U {}'c','n' where 'c' and 2. find the optimal model based on the sequences of 'n' stand for the coding and non-coding internal states and most probable internal states (as in problem B2). The the set of emissions is the set of the four DNA bases: most used algorithm is the Baum-Welch algorithm (that uses recursively the posterior decoding X {}' A','T','C','G' algorithm in B2). As emitted sequence, we consider a sequence of 65 bases A) THE EVALUATION PROBLEM (Fig. 1b). The probability of observing a sequence E of emissions It is important to note that in most cases of HMM use in given an HMM (likelihood function of ), is given by bioinformatics a fictitious inversion occurs between causes and effects when dealing with emissions. For example, one P(E | ) = P(E | S;) P(S | ) can synthesise a (known) polymer sequence that can have S different (unknown) features along the sequence. In an HMM one must choose as emissions the monomers of the We note that the logarithm of the likelihood function sequence, because they are the only known data, and as (log-likelihood) is more often used. internal states the features to be estimated. In this way, one The above sum must be computed over all the hypothesises that the sequence is the effect and the features N L possible sequences S (of length L ) of internal states are the cause, while obviously the reverse is true. An and therefore the direct computation is too expensive; excellent case is provided by the polypeptides, for which it is fortunately there exist some algorithms which have a just the amino acid sequence that causes the secondary considerably lower complexity, for example the forward and structures, while in an HMM the amino acids are assumed as the backward algorithms (of complexity O(N 2L), see emissions and the secondary structures are assumed as below). internal states. A1) The Forward Algorithm MAIN TYPES OF PROBLEMS This method introduces auxiliary variables k (called The main types of problems occurring in the use of forward variables), where Hidden Markov Models are: u = Pe,…,e ;s = u | is the probability of A) Evaluation problem (Direct problem): compute the k () ()1 k k observing a partial sequence of emissions e …e and a probability that a given model generates a given 1 k sequence of observations. state sk = u at time k . Hidden Markov Models in Bioinformatics Current Bioinformatics, 2007, Vol. 2, No. 1 51 Fig. (1). An example of HMM. (a) The square boxes represent the internal states 'c' (coding) and 'n' (non coding), inside the boxes there are the probabilities of each emission ( ' A', 'T', 'C' and 'G') for each state; outside the boxes four arrows are labelled with the corresponding transition probability. (b) The first row is a sample sequence of 65 observed emissions and the second row is one of the likely sequences of internal states. The boxed part is dealt with in (c) and (d). (c) The right-hand side column represents the boxed tract of bases in (b). The other columns represent, for each circled base, the two possible alternatives for the internal state ( 'c' or 'n') that emitted the base. Each row refers to the same position along the sequence. The arrows represent all possible transitions and the emissions.

Hidden Markov Models in Bioinformatics

Modeling Dependence in Data: Options Pricing and Random Walks

Estimation of Viterbi Path in Bayesian Hidden Markov Models

A Study of Hidden Markov Model

Sequence Models

Entropy Rate

Markov Decision Process Example

1. Markov Models

Viterbi Extraction Tutorial with Hidden Markov Toolkit

Markov Chains and Hidden Markov Models

N-Best Hypotheses

Notes on Markov Models for 16.410 and 16.413 1 Markov Chains

A HMM Approach to Identifying Distinct DNA Methylation Patterns