MARKOV CHAIN and HIDDEN MARKOV MODEL Markov Chain
Total Page:16
File Type:pdf, Size:1020Kb
MARKOV CHAIN AND HIDDEN MARKOV MODEL JIAN ZHANG [email protected] Markov chain and hidden Markov model are probably the simplest models which can be used to model sequential data, i.e. data samples which are not independent from each other. Markov Chain Let I be a countable set. Each i ∈ I is called a state and I is called the state-space. Without loss of generality we assume I = {1, 2,...}, and in most cases we have I a finite set and use the notation I = {1, 2,...,k} or I = {S1,S2,...,Sk}. λ is said to be a distribution on I if 0 ≤ λi < ∞ and i∈I λi =1. Definition 1.1. A matrix T ∈ Rk×k is stochastic if each row of T is a probability distribution.P One example of a stochastic matrix is 1 − α α T = β 1 − β with α, β ∈ [0, 1]. Figure 1 shows another example of a transition matrix on I = {S1,S2,S3} using finite state machine. Figure 1. Finite state machine for a Markov chain X0 → X1 → X2 →···→ Xn where the random variables Xi’s take values from I = {S1,S2,S3}. The numbers T (i, j)’s on the arrows are the transition probabilities such that Tij = P (Xt+1 = Sj|Xt = Si). Definition 1.2. We say that (Xn)n≥0 is a Markov chain with initial distribution λ and transition matrix T if (i) X0 has distribution λ; (ii) for n ≥ 0, conditional on Xn = i, Xn+1 has distribution (Tij : j ∈ I) and is independent of X0,...,Xn−1. By the Markov property we have (1) P (X0,...,Xn) = P (X0)P (X1|X0) · · · P (Xn|X0,...,Xn−1) n (2) = P (X0) P (Xt|Xt−1) t=1 Y1 englishMARKOV CHAIN AND HIDDEN MARKOV MODEL 2 which greatly simplifies the joint distribution of X0,...,Xn. Note also that in our definition the process is homogeneous, i.e. we have P (Xt = Sj |Xt−1 = Si)= Tij which does not depend on t. Assume that X takes values from X = {S1,...,Sk}, the behavior of the process can then be described k×k by a transition matrix T ∈ R where we have Tij = P (Xt = Sj |Xt−1 = Si). The set of parameters Θ for a Markov chain is Θ= {λ, T }. Graphical Model for Markov Chain. The Markov chain X0,...,Xn can be represented in terms of a graphical model, where each node represents a random variable, and the edges indicate conditional dependence structure. Graphical model is a very useful tool to visualize probabilistic models as well as to design efficient inference algorithms. Figure 2. Graphical Model for Markov Chain Random Walk on Graphs. The behavior of a Markov chain can also be described as a random walk on the graph shown in Figure 1. Initially a vertice is chosen according to the initial distribution λ and is denoted as SX0 ; at time t the current position is SXt and the next vertice is chosen with respect to the probability TXt,., the Xt-th row of the transition matrix T . Many properties of Markov chain can be identified by studying λ and T . For example, the distribution 1 of X0 is determined by λ, while the distribution of X1 is determined by λT , etc. Hidden Markov Model A hidden Markov model is an extension of a Markov chain which is able to capture the sequential relations among hidden variables. Formally we have Zt = (Xt, Yt) for t = 0, 1,...,n with Xt ∈ I and Yt ∈ O = {O1,...,Ol} such that the joint probability of Z0,...,Zn can be factorized as: n (3) P (Z0,...,Zn) = [P (X0)P (Y0|X0)] [P (Xt|Xt−1)P (Yt|Xt)] t=1 n Y n (4) = P (X0) P (Xt|Xt−1) P (Yt|Xt) . " t=1 # "t=0 # Y Y In other words, the X0,...,Xn is a Markov chain and Yt is independent of all other variables given Xt. The k×l set of parameters for a HMM Θ = {λ, T, Γ} where Γ ∈ R is defined as Γij = P (Yt = Oj |Xt = Si). If P (Yt|Xt) is assumed to be a Multinomial distribution, then the total number of parameters for a HMM is (k − 1) + k(k − 1) + k(l − 1). Figure 3 shows the graphical model for HMM, from which we can easily see the conditional independence structure of all variables (X0, Y0),..., (Xn, Yn). Figure 3. Graphical Model for Hidden Markov Model 1We assume λ ∈ R1×k to be a row vector. englishMARKOV CHAIN AND HIDDEN MARKOV MODEL 3 HMM is suitable for situations where the observed sequences Y0,...,Yn are influenced by a hidden Markov chain X0,...,Xn. For example, in speech recognition, we observe the phoneme sequences Y0,...,Yn. The sequence of Y0,...,Yn can be thought as noisy observations of the underlying words X0,...,Xn. In this case, we would like to infer the unknown words based on the observation sequence Y0,...,Yn. Three Fundamental Problems in HMM There are three basic problems of interest for the hidden Markov model: • Problem 1 : Given an observation sequence y0y1 ...yn and the model parameters Θ= {λ, T, Γ}, how to efficiently compute P (Y = y|Θ) = P (Y0 = y0,...,Yn = yn|Θ), the probability of the observation sequence given the model? • Problem 2 : Given an observation sequence y0y1 ...yn and the model parameters Θ= {λ, T, Γ}, how to find the optimal sequence of states x0x1 ...xn in the sense of maximizing P (X = x|Θ, Y = y)= P (X0 = x0,...,Xn = xn|Θ, Y0 = y0,...,Yn = yn)? • Problem 3 : How to estimate the model parameters Θ= {λ, T, Γ} by maximizing P (Y = y|Θ)? Forward-Backward Algorithm. The solution of problem 1 can be computed as P (Y = y|Θ) = P (X = x|Θ)P (Y = y|Θ, X = x) x X n n (5) = · · · P (X0 = x0) P (Xt = xt|Xt−1 = xt−1) P (Yt = yt|Xt = xt) x x xn " t=1 t=0 # X0 X1 X Y Y However, the total number of possible hidden sequences x is large and thus direct computation is very expensive. Intuitively, we want to move some of the sums inside the product to reduce the computation. The basic idea of the forward algorithm is as follows. First, the forward variable αt(i) is defined by (6) αt(i)= P (y0,...,yt,Xt = Si) is the probability of observing a partial sequence y0,...,yt and ending up in state Si. We have (7) αt+1(i) = P (y0,...,yt+1,Xt+1 = Si) (8) = P (Xt+1 = Si)P (y0,...,yt+1|Xt+1 = Si) (9) = P (Xt+1 = Si)P (yt+1|Xt+1 = Si)P (y0,...,yt|Xt+1 = Si) (10) = P (yt+1|Xt+1 = Si)P (y0,...,yt,Xt+1 = Si) (11) = P (yt+1|Xt+1 = Si) P (y0,...,yt,Xt = xt,Xt+1 = Si) xt X (12) = P (yt+1|Xt+1 = Si) P (Xt+1 = Si|Xt = xt)P (y0,...,yt,Xt = xt) xt X k (13) = Γi,yt+1 Tj,iαt(j). j=1 X Initially we have α0(i)= λiΓi,y0 and the final solution is k (14) P (Y = y|Θ) = αn(i). i=1 X The backward algorithm can be constructed similarly by defining the backward variable βt(i)= P (yt+1,...,yn|Xt = Si). englishMARKOV CHAIN AND HIDDEN MARKOV MODEL 4 Viterbi Algorithm. The solution of problem 2 can be written as (15) x∗ = argmax P (X = x|Y = y, Θ) x (16) = argmax P (X = x, Y = y, Θ). x A formal technique for finding the best state sequence x∗ based on dynamic programming is known as the Viterbi algorithm. Define the quantity (17) δt(i)= max P (x0,...,xt−1,Xt = Si,y0,...,yt|Θ), x0,...,xt−1 which is the highest probability along a single path at time t ending at state Si. We have (18) δt+1(j) = max {δt(i)P (Xt+1 = Sj |Xt = Si)P (Yt+1 = yt+1|Xt+1 = Sj )} i (19) = max δt(i)Tij Γj,yt . i +1 ∗ Initially we have δ0(i)= λiΓi,y0 and the final highest probability is P = maxSi∈I δn(i). To find the optimal ∗ sequence x we need to define some auxiliary variables ψt+1(j) which stores the optimal path: (20) ψt+1(j) = argmax δt(i)Tij Γj,yt+1 = argmax {δt(i)Tij } , i i ∗ ∗ for t = 1, 2,...,n. The final optimal path can be traced back by using xn = argmaxi δn(i) and xt = ∗ ψt+1(xt+1) for t = n − 1,..., 0. Baum-Welch Algorithm. Let Θ = (λ, T, Γ) represent all of the parameters of the HMM model. Given m observation sequences y1,..., ym, the parameters can be estimated by maximizing the (log)-likelihood: m (21) Θˆ = argmax p(Y = yl|Θ) Θ l=1 Ym (22) = argmax log p(Y = yl|Θ) Θ l=1 X m nl nl l (23) = argmax log · · · λx0 Txt,xt+1 Γxt,yt . Θ l=1 x0 xn t=1 t=0 X X Xl Y Y In principle, the above equation can be maximized using standard numerical optimization methods to find Θˆ . In practice, the above estimation is often solved by the well-known Baum-Welch algorithm, which is a special case of the Expectation Maximization (EM) algorithm. Details will be discussed after we introduce the EM algorithm. Learning with (x, y). There are often cases where we are able to know both the state sequences and the observation sequences.