<<

AND HIDDEN

JIAN ZHANG [email protected]

Markov chain and are probably the simplest models which can be used to model sequential data, i.e. data samples which are not independent from each other. Markov Chain

Let I be a . Each i ∈ I is called a state and I is called the state-. Without loss of generality we assume I = {1, 2,...}, and in most cases we have I a finite set and use the notation I = {1, 2,...,k} or I = {S1,S2,...,Sk}. λ is said to be a distribution on I if 0 ≤ λi < ∞ and i∈I λi =1. Definition 1.1. A matrix T ∈ Rk×k is if each row of T is a distribution.P One example of a stochastic is 1 − α α T = β 1 − β   with α, β ∈ [0, 1]. Figure 1 shows another example of a transition matrix on I = {S1,S2,S3} using finite state machine.

Figure 1. Finite state machine for a Markov chain X0 → X1 → X2 →···→ Xn where the random variables Xi’s take values from I = {S1,S2,S3}. The T (i, j)’s on the arrows are the transition such that Tij = P (Xt+1 = Sj|Xt = Si).

Definition 1.2. We say that (Xn)n≥0 is a Markov chain with initial distribution λ and transition matrix T if (i) X0 has distribution λ; (ii) for n ≥ 0, conditional on Xn = i, Xn+1 has distribution (Tij : j ∈ I) and is independent of X0,...,Xn−1. By the we have

(1) P (X0,...,Xn) = P (X0)P (X1|X0) · · · P (Xn|X0,...,Xn−1) n (2) = P (X0) P (Xt|Xt−1) t=1 Y1 englishMARKOV CHAIN AND HIDDEN MARKOV MODEL 2 which greatly simplifies the joint distribution of X0,...,Xn. Note also that in our definition the process is homogeneous, i.e. we have P (Xt = Sj |Xt−1 = Si)= Tij which does not depend on t. Assume that X takes values from X = {S1,...,Sk}, the behavior of the process can then be described k×k by a transition matrix T ∈ R where we have Tij = P (Xt = Sj |Xt−1 = Si). The set of parameters Θ for a Markov chain is Θ= {λ, T }. for Markov Chain. The Markov chain X0,...,Xn can be represented in terms of a graphical model, where each node represents a , and the edges indicate conditional dependence structure. Graphical model is a very useful tool to visualize probabilistic models as well as to design efficient inference algorithms.

Figure 2. Graphical Model for Markov Chain

Random Walk on Graphs. The behavior of a Markov chain can also be described as a on the graph shown in Figure 1.

Initially a vertice is chosen according to the initial distribution λ and is denoted as SX0 ; at time t the current position is SXt and the next vertice is chosen with respect to the probability TXt,., the Xt-th row of the transition matrix T . Many properties of Markov chain can be identified by studying λ and T . For example, the distribution 1 of X0 is determined by λ, while the distribution of X1 is determined by λT , etc. Hidden Markov Model

A hidden Markov model is an extension of a Markov chain which is able to capture the sequential relations among hidden variables. Formally we have Zt = (Xt, Yt) for t = 0, 1,...,n with Xt ∈ I and Yt ∈ O = {O1,...,Ol} such that the joint probability of Z0,...,Zn can be factorized as: n (3) P (Z0,...,Zn) = [P (X0)P (Y0|X0)] [P (Xt|Xt−1)P (Yt|Xt)] t=1 n Y n (4) = P (X0) P (Xt|Xt−1) P (Yt|Xt) . " t=1 # "t=0 # Y Y In other words, the X0,...,Xn is a Markov chain and Yt is independent of all other variables given Xt. The k×l set of parameters for a HMM Θ = {λ, T, Γ} where Γ ∈ R is defined as Γij = P (Yt = Oj |Xt = Si). If P (Yt|Xt) is assumed to be a Multinomial distribution, then the total number of parameters for a HMM is (k − 1) + k(k − 1) + k(l − 1). Figure 3 shows the graphical model for HMM, from which we can easily see the structure of all variables (X0, Y0),..., (Xn, Yn).

Figure 3. Graphical Model for Hidden Markov Model

1We assume λ ∈ R1×k to be a row vector. englishMARKOV CHAIN AND HIDDEN MARKOV MODEL 3

HMM is suitable for situations where the observed Y0,...,Yn are influenced by a hidden Markov chain X0,...,Xn. For example, in , we observe the phoneme sequences Y0,...,Yn. The of Y0,...,Yn can be thought as noisy observations of the underlying words X0,...,Xn. In this case, we would like to infer the unknown words based on the observation sequence Y0,...,Yn.

Three Fundamental Problems in HMM

There are three basic problems of interest for the hidden Markov model:

• Problem 1 : Given an observation sequence y0y1 ...yn and the model parameters Θ= {λ, T, Γ}, how to efficiently compute P (Y = y|Θ) = P (Y0 = y0,...,Yn = yn|Θ), the probability of the observation sequence given the model? • Problem 2 : Given an observation sequence y0y1 ...yn and the model parameters Θ= {λ, T, Γ}, how to find the optimal sequence of states x0x1 ...xn in the sense of maximizing P (X = x|Θ, Y = y)= P (X0 = x0,...,Xn = xn|Θ, Y0 = y0,...,Yn = yn)? • Problem 3 : How to estimate the model parameters Θ= {λ, T, Γ} by maximizing P (Y = y|Θ)?

Forward-Backward Algorithm. The solution of problem 1 can be computed as

P (Y = y|Θ) = P (X = x|Θ)P (Y = y|Θ, X = x) x X n n (5) = · · · P (X0 = x0) P (Xt = xt|Xt−1 = xt−1) P (Yt = yt|Xt = xt) x x xn " t=1 t=0 # X0 X1 X Y Y However, the total number of possible hidden sequences x is large and thus direct computation is very expensive. Intuitively, we want to move some of the sums inside the product to reduce the computation. The basic idea of the forward algorithm is as follows. First, the forward variable αt(i) is defined by

(6) αt(i)= P (y0,...,yt,Xt = Si) is the probability of observing a partial sequence y0,...,yt and ending up in state Si. We have

(7) αt+1(i) = P (y0,...,yt+1,Xt+1 = Si)

(8) = P (Xt+1 = Si)P (y0,...,yt+1|Xt+1 = Si)

(9) = P (Xt+1 = Si)P (yt+1|Xt+1 = Si)P (y0,...,yt|Xt+1 = Si)

(10) = P (yt+1|Xt+1 = Si)P (y0,...,yt,Xt+1 = Si)

(11) = P (yt+1|Xt+1 = Si) P (y0,...,yt,Xt = xt,Xt+1 = Si) xt X (12) = P (yt+1|Xt+1 = Si) P (Xt+1 = Si|Xt = xt)P (y0,...,yt,Xt = xt) xt X k

(13) = Γi,yt+1 Tj,iαt(j). j=1 X

Initially we have α0(i)= λiΓi,y0 and the final solution is

k (14) P (Y = y|Θ) = αn(i). i=1 X The backward algorithm can be constructed similarly by defining the backward variable

βt(i)= P (yt+1,...,yn|Xt = Si). englishMARKOV CHAIN AND HIDDEN MARKOV MODEL 4

Viterbi Algorithm. The solution of problem 2 can be written as (15) x∗ = argmax P (X = x|Y = y, Θ) x (16) = argmax P (X = x, Y = y, Θ). x A formal technique for finding the best state sequence x∗ based on is known as the . Define the quantity

(17) δt(i)= max P (x0,...,xt−1,Xt = Si,y0,...,yt|Θ), x0,...,xt−1 which is the highest probability along a single path at time t ending at state Si. We have

(18) δt+1(j) = max {δt(i)P (Xt+1 = Sj |Xt = Si)P (Yt+1 = yt+1|Xt+1 = Sj )} i

(19) = max δt(i)Tij Γj,yt . i +1 ∗ Initially we have δ0(i)= λiΓi,y0 and the final highest probability is P = maxSi∈I δn(i). To find the optimal ∗ sequence x we need to define some auxiliary variables ψt+1(j) which stores the optimal path:

(20) ψt+1(j) = argmax δt(i)Tij Γj,yt+1 = argmax {δt(i)Tij } , i i  ∗ ∗ for t = 1, 2,...,n. The final optimal path can be traced back by using xn = argmaxi δn(i) and xt = ∗ ψt+1(xt+1) for t = n − 1,..., 0. Baum-Welch Algorithm. Let Θ = (λ, T, Γ) represent all of the parameters of the HMM model. Given m observation sequences y1,..., ym, the parameters can be estimated by maximizing the (log)-likelihood: m (21) Θˆ = argmax p(Y = yl|Θ) Θ l=1 Ym (22) = argmax log p(Y = yl|Θ) Θ l=1 X m nl nl

l (23) = argmax log · · · λx0 Txt,xt+1 Γxt,yt . Θ   l=1 x0 xn t=1 t=0 X X Xl Y Y  In principle, the above equation can be maximized using standard numerical optimization methods to find   Θˆ . In practice, the above estimation is often solved by the well-known Baum-Welch algorithm, which is a special case of the Expectation Maximization (EM) algorithm. Details will be discussed after we introduce the EM algorithm. Learning with (x, y). There are often cases where we are able to know both the state sequences and the observation sequences. That is, given m pairs of sequences (x1, y1),..., (xm, ym), we want to estimate parameters λ, T and Γ. Since the state sequences are observed (and thus the summation over x is not needed any more), the maximum likelihood estimation Θˆ can be computed easily in this case: m (24) Θˆ = argmax log p(Y = yl, X = xl|Θ) Θ l=1 X m nl nl λ l T l l l l (25) = argmax log x xt,xt Γxt,yt Θ 0 +1 l=1 ( t=1 t=0 ) X Y Y m m nl m nl

l l l l l (26) = argmax log λx + log Txt,x + logΓxt,yt Θ 0 t+1 ( l=1 l=1 t=1 l=1 t=0 ) X X X X X which is straightforward to solve by adding the constraints that λ and each row of T and Γ are probability distributions. englishMARKOV CHAIN AND HIDDEN MARKOV MODEL 5

Discussion

HMM has been applied to many applications such as speech recognition, robotics, bio-informatics, etc, and it is also the simplest example of what is known as Dynamic Bayesian Networks (DBN) or directed graphical models. More complicated models (generalizations of HMM) include: HMM, HMM decision trees, etc. Other related models include Conditional (CRF) which is a member of undirected graphical models. References

[1] J. Norris. Markov Chains. Cambridge University Press, 1997. [2] M. Jordan. An Introduction to Graphical Models. unpublished manuscript, 2001. [3] L. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 1989.