Hidden Markov Model: Tutorial
Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada
Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Abstract emitted observation symbols (Rabiner, 1989). HMM mod- This is a tutorial paper for Hidden Markov Model els the the probability density of a sequence of observed (HMM). First, we briefly review the background symbols (Roweis, 2003). on Expectation Maximization (EM), Lagrange This paper is a technical tutorial for evaluation, estima- multiplier, factor graph, the sum-product algo- tion, and training in HMM. We also explain some applica- rithm, the max-product algorithm, and belief tions for HMM. There exist many different applications for propagation by the forward-backward procedure. HMM. One of the most famous applications is in speech Then, we introduce probabilistic graphical mod- recognition (Rabiner, 1989; Gales et al., 2008) where an els including Markov random field and Bayesian HMM is trained for every word in the speech (Rabiner & network. Markov property and Discrete Time Juang, 1986). Another application of HMM is in action Markov Chain (DTMC) are also introduced. We, recognition (Yamato et al., 1992) because an action can then, explain likelihood estimation and EM in be seen as a sequence or times series of poses (Ghojogh HMM in technical details. We explain evalua- et al., 2017; Mokari et al., 2018). An interesting applica- tion in HMM where direct calculation and the tion of HMM, which is not trivial in the first glance, is in forward-backward belief propagation are both face recognition (Samaria, 1994). The face can be seen as explained. Afterwards, estimation in HMM is a sequence of organs being modeled by HMM (Nefian & covered where both the greedy approach and the Hayes, 1998). Usage of HMM in DNA analysis is another Viterbi algorithm are detailed. Then, we explain application of HMM (Eddy, 2004). how to train HMM using EM and the Baum- The remainder of paper is as follows. In Section2, we re- Welch algorithm. We also explain how to use view some necessary background for the paper. Section HMM in some applications such as speech and 3 introduces probabilistic graphical models. Expectation action recognition. maximization in HMM is explained in Section4. after- wards, evaluation, estimation, and training in HMM are explain in Sections5,6, and7, respectively. Some applica- 1. Introduction tions are introduced in more details in Section8. Finally, Assume we have a time series of hidden random variables. Section9 concludes the paper. We name the hidden random variables as states (Ghahra- mani, 2001). Every hidden random variable can generate 2. Background (emit) an observation symbol or output. Therefore, we have Some background required for better understanding of the a set of states and a set of observation symbols. A Hidden HMM algorithms is mentioned in this Section. Markov Model (HMM) has some hidden states and some 2.1. Expectation Maximization This section is taken from our previous paper (Ghojogh et al., 2019). Sometimes, the data are not fully observ- Hidden Markov Model: Tutorial 2 able. For example, the data are known to be whether zero or greater than zero. As an illustration, assume the data are collected for a particular disease but for convenience of the patients participated in the survey, the severity of the dis- ease is not recorded but only the existence or non-existence of the disease is reported. So, the data are not giving us complete information as Xi > 0 is not obvious whether is Xi = 2 or Xi = 1000. In this case, MLE cannot be directly applied as we do not have access to complete information and some data are missing. In this case, Expectation Maximization (EM) (Moon, 1996) is useful. The main idea of EM can be sum- marized in this short friendly conversation: – What shall we do? The data are missing! The log- likelihood is not known completely so MLE cannot be used. – Mmm, probably we can replace the missing data with Figure 1. (a) An example factor graph, and (b) its representation something... as a bipartite graph. – Aha! Let us replace it with its mean. – You are right! We can take the mean of log-likelihood over the possible values of the missing data. Then every- For solving this problem, we can introduce a new variable thing in the log-likelihood will be known, and then... η which is called “Lagrange multiplier”. Also, a new func- – And then we can do MLE! tion L(Θ1,..., ΘK , η), called “Lagrangian” is introduced: (obs) (miss) Assume D and D denote the observed data (Xi’s = 0 X > 0 in the above example) and the missing data ( i’s L(Θ1,..., ΘK , η) = Q(Θ1,..., ΘK ) in the above example). The EM algorithm includes two (4) − η P (Θ ,..., Θ ) − c. main steps, i.e., E-step and M-step. 1 K Let Θ be the parameter which is the variable for likelihood Maximizing (or minimizing) this Lagrangian function estimation. In the E-step, the log-likelihood `(Θ), is taken gives us the solution to the optimization problem (Boyd & expectation with respect to the missing data D(miss) in or- Vandenberghe, 2004): der to have a mean estimation of it. Let Q(Θ) denote the expectation of the likelihood with respect to D(miss): set ∇Θ1,...,ΘK ,ηL = 0, (5)
Q(Θ) := (miss) (obs) [`(Θ)]. (1) ED |D ,Θ which gives us: Note that in the above expectation, the D(obs) and Θ are ∇ L =set 0 =⇒ ∇ Q = η∇ P, conditioned on, so they are treated as constants and not ran- Θ1,...,ΘK Θ1,...,ΘK Θ1,...,ΘK dom variables. set ∇ηL = 0 =⇒ P (Θ1,..., ΘK ) = c. In the M-step, the MLE approach is used where the log- likelihood is replaced with its expectation, i.e., Q(Θ); 2.3. Factor Graph, The Sum-Product and therefore: Max-Product Algorithms, and The Forward-Backward Procedure Θb = arg max Q(Θ). (2) Θ 2.3.1. FACTOR GRAPH These two steps are iteratively repeated until convergence A factor graph is a bipartite graph where the two partitions of the estimated parameters Θ. b of graph include functions (or factors) fj, ∀j and the vari- ables x , ∀i (Kschischang et al., 2001; Loeliger, 2004). Ev- 2.2. Lagrange Multiplier i ery factor is a function of two or more variables. The vari- Suppose we have a multi-variate function Q(Θ1,..., ΘK ) ables of a factor are connected by edges to it. Hence, we (called “objective function”) and we want to maximize (or have a bipartite graph. Figure1 shows an example factor minimize) it. However, this optimization is constrained and graphs where the factor and variable nodes are represented its constraint is equality P (Θ1,..., ΘK ) = c where c is a by circles and squares, respectively. constant. So, the constrained optimization problem is: 2.3.2. THE SUM-PRODUCT ALGORITHM maximize Q(Θ1,..., ΘK ), Θ1,...,ΘK (3) Assume we have a factor graph with some variable nodes subject to P (Θ1,..., ΘK ) = c. and factor nodes. For this, one of the nodes is considered as Hidden Markov Model: Tutorial 3 the root of tree. The result of algorithm does not depend on > 1 Initialize the messages to [1, 1,..., 1] which node to be taken as the root (Bishop, 2006). Usually, 2 for time t from 1 to τ do the first variable or the last one are considered as the root. 3 Forward pass: Do sum-product algorithm Then, some messages from the leaf (or leaves) are sent up 4 Backward pass: Do sum-product algorithm on the tree to the root (Bishop, 2006). The messages can be 5 belief = forward message × backward initialized to any fixed constant values (see Eq. (8)). The message message from the variable node x to its neighbor factor i 6 if max(belief change) then node f is: j 7 break the loop Y mxi→fj = mf→xi , (6) 8 Return beliefs
f∈N (xi)\fj Algorithm 1: The forward-backward algorithm where f ∈ N (xi) \ fj means all the neighbor factor nodes using the sum-product sub-algorithm to the variable xi except fj. Also, mxi→fj and mf→xi denote the message from the variable xi to the factor fj and the message from the factor f to the variable xi, re- 2.3.3. THE MAX-PRODUCT ALGORITHM spectively. The message from a the factor node fj to its The max-product algorithm (Weiss & Freeman, 2001; neighbor variable xi is: Pearl, 2014) is similar to the sum-product algorithm where X Y the summation operator is replaced by the maximum op- m = f m . (7) fj →xi j x→fj erator. In this algorithm, the messages from the variable values of x∈N (fj )\xi x∈N (fj )\xi nodes to the factor nodes and vice versa are: The Eq. (6) includes both sum and product. This is Y why this procedure is named the sum-product algorithm mxi→fj = mf→xi , (10) (Kschischang et al., 2001; Bishop, 2006). In the factor f∈N (xi)\fj Y graph, if the variable xi has degree one, the message is: mfj →xi = max fj mx→fj . (11) values of x∈N (fj )\xi > k x∈N (fj )\xi mxi→fj = [1, 1,..., 1] ∈ R , (8) where k is the number of possible values that the variables 2.3.4. BELIEF PROPAGATION WITH can get. Moreover, if the variable xi has degree two con- FORWARD-BACKWARD PROCEDURE nected to fj and f`, the message is: In order to learn the beliefs over the variables and factor nodes, belief propagation can be applied using a forward- m = m . (9) xi→fj f`→xi backward procedure (see chapter 8 in (Bishop, 2006)). A good example for better understanding of Eqs. (6) and The forward-backward algorithm using the sum-product (7) exists in chapter 8 in (Bishop, 2006). sub-algorithm is shown in Algorithm1. In the forward- backward procedure, the belief over a random variable x For the exact convergence (and inference) of belief prop- i is the product of the forward and backward messages. agation, the graph should be a tree, i.e., should be cycle- free (Kschischang et al., 2001). If the factor graph has 3. Probabilistic Graphical Models cycles, the inference is approximate where the algorithm is stopped after a while manually. The belief propagation 3.1. Markov and Bayesian Networks in such graphs is called loopy belief propagation (Murphy A Probabilistic Graphical Model (PGM) is a grpah-based et al., 1999; Bishop, 2006). Note that Markov chains are representation of a complex distribution in the possibly cycle-free so the exact belief propagation can be applied high dimensional space (Koller & Friedman, 2009). In for them. other words, PGM is a combination of graph theory and It is also noteworthy that as every variable which is con- probability theory. In a PGM, the random variables are rep- nected to only two factor nodes merely passes the message resented by nodes or vertices. There exist edges between through (see Eq. (9)), a new graphical model, named nor- two variables which have interaction with one another in mal graph (Forney, 2001), was proposed which states every terms of probability. Different conditional probabilities can variable as an edge rather than a node (vertex). More details be represented by a PGM. about factor graph and sum-product algorithm can be found There exist two types of PGM which are Markov network in references (Kschischang et al., 2001; Loeliger, 2004) (also called Markov random field) and Bayesian network and chapter 8 of (Bishop, 2006). Also note that some al- (Koller & Friedman, 2009). In the Markov network and ternatives to sum-product algorithm are min-product, max- Bayesian network, the edges of graph are undirected and product, and min-sum (Bishop, 2006). directed, respectively. Hidden Markov Model: Tutorial 4
3.2. The Markov Property the number of states and observation symbols, respectively. Consider a times series of random variables We show the sets of states and possible observation sym- S = {s , . . . , s } O = {o , . . . , o } X1,X2,...,Xn. In general, the joint probability of bols by 1 n and 1 m , re- these random variables can be written as: spectively. We show being in state si and in observation symbol oi at time t by si(t) and oi(t), respectively. Let n×n P(X1,X2,...,Xn) = P(X1) P(X2 | X1) R 3 A = [ai,j] be the state Transition Probability Ma- trix (TPM), where: P(X3 | X2,X1) ... P(Xn | Xn−1,...,X2,X1), (12) ai,j := P sj(t + 1) | si(t) . (15) according to chain (or multiplication) rule in probabil- We have: ity. [The first order] Markov property is an assumption n X which states that in a time series of random variables ai,j = 1. (16) X1,X2,...,Xn, every random variable is merely depen- j=1 dent on the latest previous random variable and not the oth- ers. In other words: The Emission Probability Matrix (EPM) is denoted by as n×m R 3 B = [bi,j] where: P(Xi | Xi−1,Xi−2,...,X2,X1) = P(Xi | Xi−1). (13) bi,j := P oj(t) | si(t) , (17) Hence, with Markov property, the chain rule is simplied to: which is the probability of emission of the observation symbols from the states. We have: P(X1,X2,...,Xn) n = P(X1) P(X2 | X1) P(X3 | X2) ... P(Xn | Xn−1). X (14) bi,j = 1. (18) j=1 The Markov property can be of any order. For example, in Let the initial state distribution be denoted by the vector a second order Markov property, a random variable is de- n 3 π = [π , . . . , π ] where: pendent on the latest and one-to-latest variables. Usually, R 1 n the default Markov property is of order one. A stochastic πi := P si(1) , (19) process which has the Markov process is called a Marko- and: vian process (or Markov process). n X 3.3. Discrete Time Markov Chain πi = 1, (20) A Markov chain is a PGM which has Markov property. The i=1 Markov chain can be either directed or undirected. Usually, to satisfy the probability properties. An HMM model is Markov chain is a Bayesian network where the edges are denoted by the tuple λ = (π, A, B). directed. It is important not to confuse Markov chain with Assume that a sequence of states is generated by the HMM Markov network. according to the TPM. We denote this generated sequence g g g g There are two types of Markov Chain which are Discrete of states by S := s (1), . . . , s (τ) where s (t) ∈ S, ∀t. Time Markov Chain (DTMC) (Ross, 2014) and Continu- Likewise, a sequence of outputs (observations) is generated ous Time Markov Chain (CTMC) (Lawler, 2018). As it is by the HMM according to SPM. We denote this generated g g g obvious from their names, in DTMC and CTMC, the time sequence of output symbols by O := o (1), . . . , o (τ) g of transitions from a random variable to another one is and where o (t) ∈ O, ∀t. is not partitioned into discrete slots, respectively. We denote the probability of transition from state sg(t) to g If the variables in a DTMC are considered as states, the s (t + 1) by asg (t),sg (t+1). So: DTMC can be viewed as a Finite-State Machine (FSM) g g asg (t),og (t+1) := P(s (t + 1) | s (t)). (21) or a Finite-State Automaton (FSA) (Booth, 1967). Also, note that the DTMC can be viewed as a Sub-Shifts of Finite Note that sg(t) ∈ S and sg(t + 1) ∈ S. We also denote: Type in modeling dynamic systems (Brin & Stuck, 2002). g πsg (i) := P(s (i)). (22) 3.4. Hidden Markov Model (HMM) Likewise, we denote the probability of state sg(t) emitting g HMM is a DTMC which contains a sequence of hidden the observation o (t) by bsg (t),og (t). So: variables (named states) in addition to a sequence of emit- g g ted observation symbols (outputs). bsg (t),og (t) := P(o (t) | s (t)). (23) We have an observation sequence of length τ which is the Note that sg(t) ∈ S and og(t) ∈ O. Figure2 depicts the number clock times, t ∈ {1, . . . , τ}. Let n and m denote structure of an HMM model. Hidden Markov Model: Tutorial 5
n n X X =⇒ log(asg (t−1),sg (t)) = 1i[i] 1j[j] log(ai,j) i=1 j=1 > = 1sg (t−1)(log A)1sg (t), (27)
n n Y Y 1i[i] 1j [j] bsg (t),og (t) = (bi,j) , i=1 j=1
n n X X =⇒ log(bsg (t),og (t)) = 1i[i] 1j[j] log(bi,j) Figure 2. A Hidden Markov Model i=1 j=1 > = 1sg (t)(log B)1og (t), (28) 4. Likelihood and Expectation Maximization where 1i[i] = 1j[j] = 1. Hence, we can write the log- in HMM likelihood as: The EM algorithm can be used for analysis and training τ−1 in HMM (Moon, 1996). In the following, we explain the > X > ` = 1 g log π + 1 g (log A)1 g details of EM for HMM. s (1) s (t−1) s (t) t=1 τ 4.1. Likelihood X > + 1sg (t)(log B)1og (t). (29) According to Fig.2, the likelihood of occurrence of the t=1 state sequence Sg and the observation sequence Og is (Ghahramani, 2001): 4.2. E-step in EM
g g The missing variables in the log-likelihood are 1sg (1), L = P(S , O ) 1sg (t−1), 1sg (t), and 1og (t). The expectation of the log- g g g = P s (1) P(next states | previous states) P(O | S ) likelihood with respect to the missing variables is: τ−1 τ g Y g g Y g g = P s (1) P(s (t + 1) | s (t)) P(o (t) | s (t)) Q(π,A, B) = E(`) t=1 t=1 τ−1 τ−1 τ > X > Y Y = E(1sg (1) log π) + E 1sg (t)(log A)1sg (t+1) = πsg (1) asg (t),sg (t+1) bsg (t),og (t). (24) t=1 t=1 t=1 τ X > + 1 g (log B)1 g . (30) The log-likelihood is: E s (t) o (t) t=1 τ−1 X 4.3. M-step in EM ` = log(L) = log πsg (1) + log(asg (t),sg (t+1)) t=1 We maximize the Q(π, A, B) with respect to the parame- τ π a b X ters i, i,j, and i,j: + log(bsg (t),og (t)). (25) t=1 maximize Q(π, A, B) x n Let 1i be a vector with entry one at index i, i.e., 1i := > X [0, 0,..., 0, 1, 0 ..., 0] . Also, 1sg (1) means the vector subject to πi = 1, with entry one at the index of the first state in the sequence i=1 n Sg. For example if there are three possible states and a se- X (31) quence of length three, sg(1) = 2, sg(2) = 1, sg(3) = 3, ai,j = 1, ∀i ∈ {1, . . . , n}, > j=1 we have 1sg (1) = [0, 1, 0] . n The terms in this log-likelihood are: X bi,j = 1, ∀i ∈ {1, . . . , n}, > j=1 log πsg (1) = 1sg (1) log π, (26) n n where the constraints ensure that the probabilities in the Y Y 1i[i] 1j [j] asg (t−1),sg (t) = (ai,j) , initial states, the transition matrix, and the emission matrix i=1 j=1 add to one. Hidden Markov Model: Tutorial 6
The Lagrangian (Boyd & Vandenberghe, 2004) for this op- Likewise, we have: timization problem is: τ ∂L X set = (1 g [i] 1 g [j]) − η b = 0 τ−1 ∂b E s (t) o (t) 3 i,j > X > i,j t=1 L = E(1sg (1) log π) + E 1sg (t)(log A)1sg (t+1) τ t=1 1 X =⇒ b = (1 g [i] 1 g [j]). (39) τ n i,j η E s (t) o (t) X > X 3 t=1 + E 1sg (t)(log B)1og (t) − η1( πi − 1) t=1 i=1 n n τ n n X (39) 1 X X X X b = 1 =⇒ (1 g [i] 1 g [j]) = − η2( ai,j − 1) − η3( bi,j − 1), (32) i,j E s (t) o (t) η3 j=1 j=1 j=1 j=1 t=1 τ η η η 1 X where 1, 2, and 3 are the Lagrange multipliers. = (1 g [i] × 0) + ··· + (1 g [i] × 1) η E s (t) E s (t) The first term in the Lagrangian is simplified to 3 t=1 | {z } j-th element E(1sg (1)[1] log π1 + ··· + 1sg (1)[τ] log πτ ) = τ (1sg (1)[1] log π1) + ··· + (1sg (1)[τ] log πτ ); therefore, 1 X set E E + ··· + (1 g [i] × 0) = (1 g [i]) = 1 E s (t) η E s (t) we have: 3 t=1 ∂L set τ = (1 g [i]) − η1πi = 0 X E s (1) =⇒ η = (1 g [i]). ∂πi 3 E s (t) (40) 1 t=1 =⇒ πi = E(1sg (1)[i]). (33) η1 Pτ t=1 E(1sg (t)[i] 1og (t)[j]) ∴ bi,j = Pτ . (41) n (1 g [i]) X (33) 1 t=1 E s (t) π = 1 =⇒ (1 g [1])+ i η E s (1) i=1 1 In Section 7.1, we will simplify the statements for πi, ai,j, and bi,j. + ··· + E(1sg (1)[i]) + ··· + E(1sg (1)[n]) 1 set = (0 + ··· + 1 + ··· + 0) = 1 =⇒ η1 = 1. (34) 5. Evaluation in HMM η1 Evaluation in HMM means the following (Rabiner & π = (1 g [i]). (35) ∴ i E s (1) Juang, 1986; Rabiner, 1989): Given the observation se- Similarly, we have: quence Og = og(1), . . . , og(τ) and the HMM model λ = (π, A, B), we want to compute (Og | λ), i.e., the proba- τ−1 P ∂L X set bility of the generated observation sequence. In summary: = E(1sg (t)[i] 1sg (t+1)[j]) − η2 ai,j = 0 ∂ai,j g g t=1 O , given: λ =⇒ P(O | λ) = ? (42) τ−1 1 X Note that (Og | λ) can also be denoted by (Og; λ). The =⇒ ai,j = E(1sg (t)[i] 1sg (t+1)[j]). (36) P P η g 2 t=1 P(O | λ) is sometimes referred to as the likelihood.
n n τ−1 5.1. Direct Calculation (36) X 1 X X g g g ai,j = 1 =⇒ (1 g [i] 1 g [j]) = Assume that the state sequence S = s (1), . . . , s (τ) has η E s (t) s (t+1) j=1 2 j=1 t=1 caused the observation sequence Og = og(1), . . . , og(τ). τ−1 Hence, we have: 1 X = E(1sg (t)[i] × 0) + ... g g η (O | S , λ) = b g g b g g . . . b g g . 2 t=1 P s (1),o (1) s (2),o (2) s (τ),o (τ) (43) + E(1sg (t)[i] × 1) + ··· + E(1sg (t)[i] × 0) | {z } On the other hand, the probability of the state sequence j-th element Sg = sg(1), . . . , sg(τ) is: τ−1 τ−1 1 X set X g = (1 g [i]) = 1 =⇒ η = (1 g [i]). (S | λ) = π g a g g . . . a g g . (44) η E s (t) 2 E s (t) P s (1) s (2),o (2) s (τ),o (τ) 2 t=1 t=1 (37) According to chain rule, we have: g g g g g P(O , S | λ) = P(O | S , λ) P(S | λ) = (45) Pτ−1 E(1sg (t)[i] 1sg (t+1)[j]) a = t=1 . (38) πsg (1) bsg (1),og (1) asg (2),og (2) . . . bsg (τ),og (τ) asg (τ),og (τ), ∴ i,j Pτ−1 t=1 E(1sg (t)[i]) (46) Hidden Markov Model: Tutorial 7 which is the probability of occurrence of both the observa- 1 Input: λ = (π, A, B) tion sequence Og and state sequence Sg. 2 αi(1) = πi bi,og (1), ∀i ∈ {1, . . . , n} Any state sequence may have caused the observation se- 3 for state j from 1 to n do g quence O . Therefore, according to the law of total proba- 4 for time t from 1 to (τ − 1) do bility, we have: h Pn i 5 αj(t + 1) = αi(t) ai,j bj,og (t+1) X i=1 (Og| λ) = (Og, Sg | λ) P P g Pn ∀Sg 6 P(O | λ) = i=1 αi(τ) g 7 Return (O | λ), ∀i, ∀t : αi(t) (45) X g g g P = P(O | S , λ) P(S | λ) ∀Sg Algorithm 2: The forward belief propagation in (46) X X X the forward-backward procedure = ··· πsg (1) bsg (1),og (1) asg (2),og (2) ∀sg (1) ∀sg (2) ∀sg (τ)
. . . bsg (τ),og (τ) asg (τ),og (τ), (47) Proposition 1. The Eq. (50) can be interpreted as the sum- which means that we start with the first state, then output product algorithm (see Section 2.3.2). the first observation, and then go to the next state. This procedure is repeated until the last state. The summations Proof. The Algorithm2 has iterations over states indexed are over all possible states in the state sequence. j i g by . Also the Eq. (50) has sum over states index by . The time complexity of this direct calculation of P(O | λ) i s τ Consider all states indexed by and a specific state j (see is in the order of O(2τn ) because at every time clock Fig.3). We can consider every two successive states as a t ∈ {1, . . . , τ}, there are n possible states to go through factor node in the factor graph. Hence, a state indexed by (Rabiner & Juang, 1986). Because of nτ , this is very inef- i and the state sj form a factor node which we denote by ficient especially for long sequences (large τ). i,j g f . The observation symbol o (t + 1) emitted from sj is considered as the variable node in the factor graph. 5.2. The Forward-Backward Procedure The message α (t) is the message received to the factor A more efficient algorithm for evaluation in HMM is i node f i,j so far. Therefore, in the sum-product algorithm forward-backward procedure (Rabiner & Juang, 1986). (see Eq. (6)), we have: The forward-backward procedure includes two stages, i.e., forward and backward belief propagation stages. mog (t)→f i,j = αi(t). (51) 5.2.1. THE FORWARD BELIEF PROPAGATION
Similar to the belief propagation procedure, we define the The message ai,j bj,og (t+1) is the message received from forward message until time t as: the factor nodes f i,j, ∀i to the variable node og(t + 1). Therefore, in the sum-product algorithm (see Eq. (7)), we α (t) := og(1), og(2), . . . , og(t), sg(t) = s | λ, (48) i P i have: which is the probability of partial observation sequence un- til time t and being in state si at time t. mf i,j →og (t+1) = ai,j bj,og (t+1). (52) Algorithm2 shows the forward belief propagation from state one to the state τ. In this algorithm, αi(t) is solved Hence: inductively. The initial forward message is: n X α (1) = π b g , ∀i ∈ {1, . . . , n}, (49) i i i,o (1) mf→og (t+1) = mf i,j →og (t+1) m g i,j (53) o (t)→fnext which is the probability of occurrence of the initial state i=1 n g si and the observation symbol o (1). The next forward X = a b g α (t) messages are calculated as: i,j j,o (t+1) i i=1 n n h X i h X i αj(t + 1) = αi(t) ai,j bj,og (t+1), (50) = αi(t) ai,j bj,og (t+1). (54) i=1 i=1 which is the probability of occurrence of observation se- g g i,j n i,j quence o (1), . . . , o (t), being in state si at time t, going where f is the set of all factor nodes, {f }i=1, and fnext to state j at time t+1, and the observation symbol og(t+1). is the factor f i,j in the next time slot. The summation is because, at time t, the state sg(t) can be Note that if we consider Fig.3 for all states indexed by j, any state so we should use the law of total probability. a lattice network is formed. Hidden Markov Model: Tutorial 8
1 Input: λ = (π, A, B) 2 βi(τ) = 1, ∀i ∈ {1, . . . , n} 3 for state i from 1 to n do 4 for time t from (τ − 1) to 1 do Pn 5 βi(t) = j=1 ai,j bj,og (t+1)
6 Return ∀i, ∀t : βi(t) Algorithm 3: The backward belief propagation in the forward-backward procedure
state sg(t + 1) can be any state so we should use the law of total probability.
It is noteworthy that for very long sequences, the αi(t) and β(t) become extremely small, recursively (see Algorithms 2 and3). Hence, some people normalize them at every iteration of algorithm (Ghahramani, 2001): Figure 3. Modeling forward belief propagation for HMM as a sum-product algorithm in a factor graph. αi(t) αi(t) ← Pτ , (59) j=1 αj(t) Finally, using the law of total probability, we have: βi(t) βi(t) ← Pτ , (60) βj(t) n n j=1 g X g g X P(O | λ) = P O , s (τ) = si | λ = αi(τ), in order to sum to one. However, note that if this normal- i=1 i=1 g (55) ization is done, we will have P(O | λ) = 1. which is the desired probability in the evaluation for HMM. 5.2.3. TIME COMPLEXITY Hence, the forward belief propagation suffices for evalua- The time complexity of the forward belief propagation is in tion. In the following, we explain the backward belief prop- the order of O(τn2) because we have O(nτ) loops each of agation which is required for other sections of the paper. which includes summation over n states (Rabiner & Juang, 1986). This is much more efficient than the complexity of 5.2.2. THE BACKWARD BELIEF PROPAGATION direct calculation of P(Og| λ) which was O(2τnτ ). Simi- Again similar to the belief propagation procedure, we de- larly, the time complexity of the backward belief propaga- fine the backward message since time τ to t + 1 as: tion is in the order of O(τn2) (Rabiner & Juang, 1986).