COMP 182 Algorithmic Thinking
Luay Nakhleh Markov Chains and Computer Science Hidden Markov Models Rice University ❖ What is p(01110000)? ❖ Assume: ❖ 8 independent Bernoulli trials with success probability of α? ❖ Answer: ❖ (1-α)5α3 ❖ However, what if the assumption of independence doesn’t hold? ❖ That is, what if the outcome in a Bernoulli trial depends on the outcomes of the trials the preceded it? Markov Chains
❖ Given a sequence of observations X1,X2,…,XT ❖ The basic idea behind a Markov chain (or, Markov model) is to assume that Xt captures all the relevant information for predicting the future. ❖ In this case: T p(X1X2 ...XT )=p(X1)p(X2 X1)p(X3 X2) p(XT XT 1)=p(X1) p(Xt Xt 1) | | ··· | | t=2 Y Markov Chains
❖ When Xt is discrete, so Xt∈{1,…,K}, the conditional distribution p(Xt|Xt-1) can be written as a K×K matrix, known as the transition matrix A, where Aij=p(Xt=j|Xt-1=i) is the probability of going from state i to state j. ❖ Each row of the matrix sums to one, so this is called a stochastic matrix. 590 Chapter 17. Markov and hidden Markov models Markov Chains 1 α 1 β − α − A11 A22 A33 1 2 A12 A23 1 2 3 β ❖ A finite-state Markov chain is equivalent(a) (b) Figure 17.1 State transition diagrams for some simple Markov chains. Left: a 2-state chain. Right: a to a stochastic automaton3-state left-to-right. chain.
❖ One way to represent aAstationary,finite-stateMarkovchainisequivalenttoa Markov chain is stochastic automaton.Itiscommon 590 to visualizeChapter such 17. automataMarkov and by hidden drawing Markov a directed models graph, where nodes represent states and arrows through a state transitionrepresent legaldiagram transitions, i.e., non-zero elements of A.Thisisknownasastate transition diagram.Theweightsassociatedwiththearcsaretheprobabilities.Forexample,thefollowing 2-state chain 1 α 1 β − α A11 A22 A33 − 1 αα A = A12 A23 (17.2) 1 2 −β 1 β ! 1 − 2 " 3 β is illustrated in Figure 17.1(left). The following 3-state chain
(a) A11 A12 (b) 0 A = 0 A22 A23 (17.3) Figure 17.1 State transition diagrams for some simple⎛ Markov001 chains. Left: a⎞ 2-state chain. Right: a 3-state left-to-right chain. is illustrated⎝ in Figure 17.1(right).⎠ This is called a left-to-right transition matrix,andiscom- monly used in speech recognition (Section 17.6.2). Astationary,finite-stateMarkovchainisequivalenttoastochastic automaton.Itiscommon The A element of the transition matrix specifies the probability of getting from i to j in to visualize such automata by drawing a directed graph,ij where nodes represent states and arrows represent legal transitions, i.e., non-zero elementsone step. of TheA.Thisisknownasan-step transition matrixstate transitionA(n) is defined as diagram.Theweightsassociatedwiththearcsaretheprobabilities.Forexample,thefollowing Aij(n) ! p(Xt+n = j Xt = i) (17.4) 2-state chain | 1 αα which is the probability of getting from i to j in exactly n steps. Obviously A(1) = A.The A = (17.2) −β 1 β Chapman-Kolmogorov equations state that ! − " is illustrated in Figure 17.1(left). The following 3-state chain K Aij(m + n)= Aik(m)Akj(n) (17.5) A11 A12 0 k=1 A = 0 A22 A23 ' (17.3) ⎛ 001⎞ In words, the probability of getting from i to j in m + n steps is just the probability of getting is illustrated⎝ in Figure 17.1(right).⎠ This is calledfrom ai left-to-rightto k in m steps, transition and then matrix from,andiscom-k to j in n steps, summed up over all k.Wecanwrite monly used in speech recognition (Sectionthe 17.6.2). above as a matrix multiplication The Aij element of the transition matrix specifiesA(m + then)= probabilityA(m)A of(n getting) from i to j in (17.6) one step. The n-step transition matrix A(n) is defined as Hence A (n) p(X = j X = i) (17.4) ij ! t+n | t A(n)=AA(n 1) = AAA(n 2) = = An (17.7) which is the probability of getting from i to j in exactly n steps.− Obviously A(1) =−A.The··· Chapman-Kolmogorov equations state thatThus we can simulate multiple steps of a Markov chain by “powering up” the transition matrix. K Aij(m + n)= Aik(m)Akj(n) (17.5) k'=1 In words, the probability of getting from i to j in m + n steps is just the probability of getting from i to k in m steps, and then from k to j in n steps, summed up over all k.Wecanwrite the above as a matrix multiplication A(m + n)=A(m)A(n) (17.6) Hence A(n)=AA(n 1) = AAA(n 2) = = An (17.7) − − ··· Thus we can simulate multiple steps of a Markov chain by “powering up” the transition matrix. Markov Chains
❖ The Aij element of the transition matrix specifies the probability of getting from i to j in one step. ❖ The n-step transition matrix A(n) is defined as A (n)=p(X = j X = i) ij t+n | t Markov Chains
❖ Obviously, A(1)=A. ❖ The Chapman-Kolmogorov equations state that K
Aij(m + n)= Aik(m)Akj(n) kX=1 Markov Chains
❖ Therefore, we have A(m+n)=A(m)A(n). ❖ Hence, A(n)=AA(n-1)=AAA(n-2)=…=An. ❖ Thus, we can simulate multiple steps of a Markov chain by “powering up” the transition matrix. Language Modeling
❖ One important application of Markov models is to make statistical language models, which are probability distributions over sequences of words. ❖ We define the state space to be all the words in English (or, the language of interest).
❖ The probabilities p(Xt=k) are called unigram statistics. ❖ If we use a first-order Markov model, then p(Xt=k|Xt-1=j) is called a bigram model. 592 Language ModelingChapter 17. Markov and hidden Markov models
Bigrams Unigrams _ a b c d e f g h i j k l m n o p q r s t u v w x y z 1 0.16098 _ _ 2 0.06687 a a 3 0.01414 b b 4 0.02938 c c 5 0.03107 d d 6 0.11055 e e 7 0.02325 f f 8 0.01530 g g 9 0.04174 h h 10 0.06233 i i 11 0.00060 j j 12 0.00309 k k 13 0.03515 l l 14 0.02107 m m 15 0.06007 n n 16 0.06066 o o 17 0.01594 p p 18 0.00077 q q 19 0.05265 r r 20 0.05761 s s 21 0.07566 t t 22 0.02149 u u 23 0.00993 v v 24 0.01341 w w 25 0.00208 x x 26 0.01381 y y 27 0.00039 z z Unigram and bigram counts for Darwin’s On the Origin of Species Figure 17.2 Unigram and bigram counts from Darwin’s On The Origin Of Species.The2Dpictureonthe right is a Hinton diagram of the joint distribution. The size of the white squares is proportional to the value of the entry in the corresponding vector/ matrix. Based on (MacKay 2003, p22). Figure generated by ngramPlot.
17.2.2.1 MLE for Markov language models We now discuss a simple way to estimate the transition matrix from training data. The proba- bility of any particular sequence of length T is given by
p(x1:T θ)=π(x1)A(x1,x2 ) ...A(xT 1,xT ) (17.8) | − K T K K I(x1=j) I(xt=k,xt 1=j) = (πj ) (Ajk) − (17.9) j=1 t=2 j=1 ! ! ! k!=1 Hence the log-likelihood of a set of sequences =(x ,...,x ),wherex =(x ,...,x ) D 1 N i i1 i,Ti is a sequence of length Ti,isgivenby
N logp( θ)= logp(x θ)= N 1 logπ + N logA (17.10) D| i| j j jk jk i=1 j j " " " "k where we define the following counts:
N N T 1 i− 1 Nj ! I(xi1 = j),Njk ! I(xi,t = j, xi,t+1 = k) (17.11) i=1 i=1 t=1 " " " Parameter Estimation for Markov Chains
❖ The parameters of a Markov chain, denoted by θ, consist of the transition matrix (A) and the distribution on the initial states ( ). ❖ We want to estimate these parameters from a training data set. ❖ Such a data set consists of sequences X1,X2,…,Xm.
❖ Sequence Xk has length Lk. Parameter Estimation for Markov Chains
❖ The maximum likelihood estimate (MLE) of the parameters is easy to obtain from the data:
m m L 1 i 1 i i i Nj = I(X1 = j) Njk = I(Xt = j, Xt+1 = k) i=1 i=1 t=1 X X X
N 1 N j Aˆ = jk ⇡ˆj = 1 jk Njk i Ni k0 0 P P Parameter Estimation for Markov Chains
❖ The maximum likelihood estimate (MLE) of the parameters is easy to obtain from the data:
m m L 1 i 1 i i i Nj = I(X1 = j) Njk = I(Xt = j, Xt+1 = k) i=1 i=1 t=1 X X X
N 1 N j Aˆ = jk ⇡ˆj = 1 jk Njk i Ni k0 0 1ife is true Indicator function:P (e)= P I 0ife is false ⇢ Parameter Estimation for Markov Chains
❖ It is very important to handle zero counts properly (this is called smoothing). Hidden Markov Models
❖ A hidden Markov model, or HMM, consists of a discrete-time, discrete state Markov chain, with hidden states Zt∈{1,…,K}, plus an observation model p(Xt|Zt) (emission probabilities).
❖ The corresponding joint distribution has the form T T
p(Z1:T ,X1:T )= p(Z1) p(Zt Zt 1) p(Xt Zt) | | " t=2 #"t=1 # Y Y Hidden Markov Models
❖ Example: The “occasionally dishonest casino”
❖ Most of the time, the casino uses a fair die (Z=1), but occasionally it switches for a short period to a loaded die (Z=2) that is skewed towards face 6. 606 Chapter 17. Markov and hidden Markov models
related tasks. • Gene finding.Herext represents the DNA nucleotides (A,C,G,T), and zt represents whether we are inside a gene-coding region or not. See e.g., (Schweikerta et al. 2009) for details. • Protein sequence alignment.Herext represents an amino acid, and zt represents whether this matches the latent consensus sequence at this location. This model is called a profile HMM and is illustrated in Figure 17.8. The HMM has 3 states, called match, insert and delete. If zt is a match state, then xt is equal to the t’th value of the consensus. If zt is an insert state, then xt is generated from a uniform distribution that is unrelated to the consensus sequence. If zt is a delete state, then xt = .Inthisway,wecangeneratenoisycopiesof the consensus sequence of diferent lengths.− In Figure 17.8(a), the consensus is “AGC”, and we see various versions of this below. A path through the state transition diagram, shown in Figure 17.8(b), specifies how to align a sequence to the consensus, e.g., for the gnat, the most probable path is D, D, I, I, I, M.ThismeanswedeletetheAandGpartsofthe consensus sequence, we insert 3 A’s, and then we match the final C. We can estimate the model parameters by counting the number of such transitions, and the number of emissions from each kind of state, as shown in Figure 17.8(c). See Section 17.5 for more information on training an HMM, and (Durbin et al. 1998) for details on profile HMMs.
Note that for some of these tasks, conditional random fields, which are essentially discrimi- native versions of HMMs, may be more suitable; see Chapter 19 for details.
17.4 Inference in HMMs
We now discuss how to infer the hidden state sequence of an HMM, assuming the parameters are known.Hidden Exactly the same algorithmsMarkov apply to otherModels chain-structured graphical models, such as chain CRFs (see Section 19.6.1). In Chapter 20, we generalize these methods to arbitrary graphs. And in Section 17.5.2, we show how we can use the output of inference in the context of parameter estimation.
17.4.1 Types of inference problems for temporal models ❖ Example:There are several The di“occasionallyferent kinds of inferential dishonest tasks for casino” an HMM (and SSM in general). To illustrate the diferences, we will consider an example called the occasionally dishonest casino, from (Durbin et al. 1998). In this model, xt 1, 2,...,6 represents which dice face shows ❖ ∈ { } up,Most and z oft represents the time, the identity the casino of the dice uses that a is beingfair die used. ( MostZ=1 of), thebut time the casino usesoccasionally a fair dice, z =1 it,butoccasionallyitswitchestoaloadeddice, switches for a short period to az =2loaded,forashortperiod. If z =1the observation distribution is a uniform multinoulli over the symbols 1,...,6 .If zdie=2,theobservationdistributionisskewedtowardsface6(seeFigure17.9).Ifwesamplefrom (Z=2) that is skewed towards face 6. { } this model, we may observe data such as the following:
Listing 17.1 Example output of casinoDemo Rolls: 664153216162115234653214356634261655234232315142464156663246 Die: LLLLLLLLLLLLLLFFFFFFLLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFLLLLLLLL Here “rolls” refers to the observed symbol and “die” refers to the hidden state (L is loaded and Fisfair).Thusweseethatthemodelgeneratesasequenceofsymbols,butthestatisticsofthe Hidden Markov Models
17.4. Inference in HMMs 607