Markov Chains and Hidden Markov Models
Total Page:16
File Type:pdf, Size:1020Kb
COMP 182 Algorithmic Thinking Luay Nakhleh Markov Chains and Computer Science Hidden Markov Models Rice University ❖ What is p(01110000)? ❖ Assume: ❖ 8 independent Bernoulli trials with success probability of α? ❖ Answer: ❖ (1-α)5α3 ❖ However, what if the assumption of independence doesn’t hold? ❖ That is, what if the outcome in a Bernoulli trial depends on the outcomes of the trials the preceded it? Markov Chains ❖ Given a sequence of observations X1,X2,…,XT ❖ The basic idea behind a Markov chain (or, Markov model) is to assume that Xt captures all the relevant information for predicting the future. ❖ In this case: T p(X1X2 ...XT )=p(X1)p(X2 X1)p(X3 X2) p(XT XT 1)=p(X1) p(Xt Xt 1) | | ··· | − | − t=2 Y Markov Chains ❖ When Xt is discrete, so Xt∈{1,…,K}, the conditional distribution p(Xt|Xt-1) can be written as a K×K matrix, known as the transition matrix A, where Aij=p(Xt=j|Xt-1=i) is the probability of going from state i to state j. ❖ Each row of the matrix sums to one, so this is called a stochastic matrix. 590 Chapter 17. Markov and hidden Markov models Markov Chains 1 α 1 β − α − A11 A22 A33 1 2 A12 A23 1 2 3 β ❖ A finite-state Markov chain is equivalent(a) (b) Figure 17.1 State transition diagrams for some simple Markov chains. Left: a 2-state chain. Right: a to a stochastic automaton3-state left-to-right. chain. ❖ One way to represent aAstationary,finite-stateMarkovchainisequivalenttoa Markov chain is stochastic automaton.Itiscommon 590 to visualizeChapter such 17. automataMarkov and by hidden drawing Markov a directed models graph, where nodes represent states and arrows through a state transitionrepresent legaldiagram transitions, i.e., non-zero elements of A.Thisisknownasastate transition diagram.Theweightsassociatedwiththearcsaretheprobabilities.Forexample,thefollowing 2-state chain 1 α 1 β − α A11 A22 A33 − 1 αα A = A12 A23 (17.2) 1 2 −β 1 β ! 1 − 2 " 3 β is illustrated in Figure 17.1(left). The following 3-state chain (a) A11 A12 (b) 0 A = 0 A22 A23 (17.3) Figure 17.1 State transition diagrams for some simple⎛ Markov001 chains. Left: a⎞ 2-state chain. Right: a 3-state left-to-right chain. is illustrated⎝ in Figure 17.1(right).⎠ This is called a left-to-right transition matrix,andiscom- monly used in speech recognition (Section 17.6.2). Astationary,finite-stateMarkovchainisequivalenttoastochastic automaton.Itiscommon The A element of the transition matrix specifies the probability of getting from i to j in to visualize such automata by drawing a directed graph,ij where nodes represent states and arrows represent legal transitions, i.e., non-zero elementsone step. of TheA.Thisisknownasan-step transition matrixstate transitionA(n) is defined as diagram.Theweightsassociatedwiththearcsaretheprobabilities.Forexample,thefollowing Aij(n) ! p(Xt+n = j Xt = i) (17.4) 2-state chain | 1 αα which is the probability of getting from i to j in exactly n steps. Obviously A(1) = A.The A = (17.2) −β 1 β Chapman-Kolmogorov equations state that ! − " is illustrated in Figure 17.1(left). The following 3-state chain K Aij(m + n)= Aik(m)Akj(n) (17.5) A11 A12 0 k=1 A = 0 A22 A23 ' (17.3) ⎛ 001⎞ In words, the probability of getting from i to j in m + n steps is just the probability of getting is illustrated⎝ in Figure 17.1(right).⎠ This is calledfrom ai left-to-rightto k in m steps, transition and then matrix from,andiscom-k to j in n steps, summed up over all k.Wecanwrite monly used in speech recognition (Sectionthe 17.6.2). above as a matrix multiplication The Aij element of the transition matrix specifiesA(m + then)= probabilityA(m)A of(n getting) from i to j in (17.6) one step. The n-step transition matrix A(n) is defined as Hence A (n) p(X = j X = i) (17.4) ij ! t+n | t A(n)=AA(n 1) = AAA(n 2) = = An (17.7) which is the probability of getting from i to j in exactly n steps.− Obviously A(1) =−A.The··· Chapman-Kolmogorov equations state thatThus we can simulate multiple steps of a Markov chain by “powering up” the transition matrix. K Aij(m + n)= Aik(m)Akj(n) (17.5) k'=1 In words, the probability of getting from i to j in m + n steps is just the probability of getting from i to k in m steps, and then from k to j in n steps, summed up over all k.Wecanwrite the above as a matrix multiplication A(m + n)=A(m)A(n) (17.6) Hence A(n)=AA(n 1) = AAA(n 2) = = An (17.7) − − ··· Thus we can simulate multiple steps of a Markov chain by “powering up” the transition matrix. Markov Chains ❖ The Aij element of the transition matrix specifies the probability of getting from i to j in one step. ❖ The n-step transition matrix A(n) is defined as A (n)=p(X = j X = i) ij t+n | t Markov Chains ❖ Obviously, A(1)=A. ❖ The Chapman-Kolmogorov equations state that K Aij(m + n)= Aik(m)Akj(n) kX=1 Markov Chains ❖ Therefore, we have A(m+n)=A(m)A(n). ❖ Hence, A(n)=AA(n-1)=AAA(n-2)=…=An. ❖ Thus, we can simulate multiple steps of a Markov chain by “powering up” the transition matrix. Language Modeling ❖ One important application of Markov models is to make statistical language models, which are probability distributions over sequences of words. ❖ We define the state space to be all the words in English (or, the language of interest). ❖ The probabilities p(Xt=k) are called unigram statistics. ❖ If we use a first-order Markov model, then p(Xt=k|Xt-1=j) is called a bigram model. 592 Language ModelingChapter 17. Markov and hidden Markov models Bigrams Unigrams _ a b c d e f g h i j k l m n o p q r s t u v w x y z 1 0.16098 _ _ 2 0.06687 a a 3 0.01414 b b 4 0.02938 c c 5 0.03107 d d 6 0.11055 e e 7 0.02325 f f 8 0.01530 g g 9 0.04174 h h 10 0.06233 i i 11 0.00060 j j 12 0.00309 k k 13 0.03515 l l 14 0.02107 m m 15 0.06007 n n 16 0.06066 o o 17 0.01594 p p 18 0.00077 q q 19 0.05265 r r 20 0.05761 s s 21 0.07566 t t 22 0.02149 u u 23 0.00993 v v 24 0.01341 w w 25 0.00208 x x 26 0.01381 y y 27 0.00039 z z Unigram and bigram counts for Darwin’s On the Origin of Species Figure 17.2 Unigram and bigram counts from Darwin’s On The Origin Of Species.The2Dpictureonthe right is a Hinton diagram of the joint distribution. The size of the white squares is proportional to the value of the entry in the corresponding vector/ matrix. Based on (MacKay 2003, p22). Figure generated by ngramPlot. 17.2.2.1 MLE for Markov language models We now discuss a simple way to estimate the transition matrix from training data. The proba- bility of any particular sequence of length T is given by p(x1:T θ)=π(x1)A(x1,x2 ) ...A(xT 1,xT ) (17.8) | − K T K K I(x1=j) I(xt=k,xt 1=j) = (πj ) (Ajk) − (17.9) j=1 t=2 j=1 ! ! ! k!=1 Hence the log-likelihood of a set of sequences =(x ,...,x ),wherex =(x ,...,x ) D 1 N i i1 i,Ti is a sequence of length Ti,isgivenby N logp( θ)= logp(x θ)= N 1 logπ + N logA (17.10) D| i| j j jk jk i=1 j j " " " "k where we define the following counts: N N T 1 i− 1 Nj ! I(xi1 = j),Njk ! I(xi,t = j, xi,t+1 = k) (17.11) i=1 i=1 t=1 " " " Parameter Estimation for Markov Chains ❖ The parameters of a Markov chain, denoted by θ, consist of the transition matrix (A) and the distribution on the initial states (#). ❖ We want to estimate these parameters from a training data set. ❖ Such a data set consists of sequences X1,X2,…,Xm. ❖ Sequence Xk has length Lk. Parameter Estimation for Markov Chains ❖ The maximum likelihood estimate (MLE) of the parameters is easy to obtain from the data: m m L 1 i− 1 i i i Nj = I(X1 = j) Njk = I(Xt = j, Xt+1 = k) i=1 i=1 t=1 X X X N 1 N j Aˆ = jk ⇡ˆj = 1 jk Njk i Ni k0 0 P P Parameter Estimation for Markov Chains ❖ The maximum likelihood estimate (MLE) of the parameters is easy to obtain from the data: m m L 1 i− 1 i i i Nj = I(X1 = j) Njk = I(Xt = j, Xt+1 = k) i=1 i=1 t=1 X X X N 1 N j Aˆ = jk ⇡ˆj = 1 jk Njk i Ni k0 0 1ife is true Indicator function:P (e)= P I 0ife is false ⇢ Parameter Estimation for Markov Chains ❖ It is very important to handle zero counts properly (this is called smoothing). Hidden Markov Models ❖ A hidden Markov model, or HMM, consists of a discrete-time, discrete state Markov chain, with hidden states Zt∈{1,…,K}, plus an observation model p(Xt|Zt) (emission probabilities).