<<

Hidden : Tutorial

Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Laboratory, University of Waterloo, Waterloo, ON, Canada

Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada

Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Abstract emitted observation symbols (Rabiner, 1989). HMM mod- This is a tutorial paper for els the the probability density of a sequence of observed (HMM). First, we briefly review the background symbols (Roweis, 2003). on Expectation Maximization (EM), Lagrange This paper is a technical tutorial for evaluation, estima- multiplier, factor graph, the sum-product algo- tion, and training in HMM. We also explain some applica- rithm, the max-product algorithm, and belief tions for HMM. There exist many different applications for propagation by the forward-backward procedure. HMM. One of the most famous applications is in speech Then, we introduce probabilistic graphical mod- recognition (Rabiner, 1989; Gales et al., 2008) where an els including Markov random field and Bayesian HMM is trained for every word in the speech (Rabiner & network. and Discrete Time Juang, 1986). Another application of HMM is in action (DTMC) are also introduced. We, recognition (Yamato et al., 1992) because an action can then, explain likelihood estimation and EM in be seen as a sequence or times series of poses (Ghojogh HMM in technical details. We explain evalua- et al., 2017; Mokari et al., 2018). An interesting applica- tion in HMM where direct calculation and the tion of HMM, which is not trivial in the first glance, is in forward-backward are both face recognition (Samaria, 1994). The face can be seen as explained. Afterwards, estimation in HMM is a sequence of organs being modeled by HMM (Nefian & covered where both the greedy approach and the Hayes, 1998). Usage of HMM in DNA analysis is another are detailed. Then, we explain application of HMM (Eddy, 2004). how to train HMM using EM and the Baum- The remainder of paper is as follows. In Section2, we re- Welch algorithm. We also explain how to use view some necessary background for the paper. Section HMM in some applications such as speech and 3 introduces probabilistic graphical models. Expectation action recognition. maximization in HMM is explained in Section4. after- wards, evaluation, estimation, and training in HMM are explain in Sections5,6, and7, respectively. Some applica- 1. Introduction tions are introduced in more details in Section8. Finally, Assume we have a of hidden random variables. Section9 concludes the paper. We name the hidden random variables as states (Ghahra- mani, 2001). Every hidden can generate 2. Background (emit) an observation symbol or output. Therefore, we have Some background required for better understanding of the a set of states and a set of observation symbols. A Hidden HMM algorithms is mentioned in this Section. Markov Model (HMM) has some hidden states and some 2.1. Expectation Maximization This section is taken from our previous paper (Ghojogh et al., 2019). Sometimes, the data are not fully observ- Hidden Markov Model: Tutorial 2 able. For example, the data are known to be whether zero or greater than zero. As an illustration, assume the data are collected for a particular disease but for convenience of the patients participated in the survey, the severity of the dis- ease is not recorded but only the existence or non-existence of the disease is reported. So, the data are not giving us complete information as Xi > 0 is not obvious whether is Xi = 2 or Xi = 1000. In this case, MLE cannot be directly applied as we do not have access to complete information and some data are missing. In this case, Expectation Maximization (EM) (Moon, 1996) is useful. The main idea of EM can be sum- marized in this short friendly conversation: – What shall we do? The data are missing! The log- likelihood is not known completely so MLE cannot be used. – Mmm, probably we can replace the missing data with Figure 1. (a) An example factor graph, and (b) its representation something... as a bipartite graph. – Aha! Let us replace it with its . – You are right! We can take the mean of log-likelihood over the possible values of the missing data. Then every- For solving this problem, we can introduce a new variable thing in the log-likelihood will be known, and then... η which is called “Lagrange multiplier”. Also, a new func- – And then we can do MLE! tion L(Θ1,..., ΘK , η), called “Lagrangian” is introduced: (obs) (miss) Assume D and D denote the observed data (Xi’s = 0 X > 0 in the above example) and the missing data ( i’s L(Θ1,..., ΘK , η) = Q(Θ1,..., ΘK ) in the above example). The EM algorithm includes two (4) − ηP (Θ ,..., Θ ) − c. main steps, i.e., E-step and M-step. 1 K Let Θ be the parameter which is the variable for likelihood Maximizing (or minimizing) this Lagrangian function estimation. In the E-step, the log-likelihood `(Θ), is taken gives us the solution to the optimization problem (Boyd & expectation with respect to the missing data D(miss) in or- Vandenberghe, 2004): der to have a mean estimation of it. Let Q(Θ) denote the expectation of the likelihood with respect to D(miss): set ∇Θ1,...,ΘK ,ηL = 0, (5)

Q(Θ) := (miss) (obs) [`(Θ)]. (1) ED |D ,Θ which gives us: Note that in the above expectation, the D(obs) and Θ are ∇ L =set 0 =⇒ ∇ Q = η∇ P, conditioned on, so they are treated as constants and not ran- Θ1,...,ΘK Θ1,...,ΘK Θ1,...,ΘK dom variables. set ∇ηL = 0 =⇒ P (Θ1,..., ΘK ) = c. In the M-step, the MLE approach is used where the log- likelihood is replaced with its expectation, i.e., Q(Θ); 2.3. Factor Graph, The Sum-Product and therefore: Max-Product Algorithms, and The Forward-Backward Procedure Θb = arg max Q(Θ). (2) Θ 2.3.1. FACTOR GRAPH These two steps are iteratively repeated until convergence A factor graph is a bipartite graph where the two partitions of the estimated parameters Θ. b of graph include functions (or factors) fj, ∀j and the vari- ables x , ∀i (Kschischang et al., 2001; Loeliger, 2004). Ev- 2.2. Lagrange Multiplier i ery factor is a function of two or more variables. The vari- Suppose we have a multi-variate function Q(Θ1,..., ΘK ) ables of a factor are connected by edges to it. Hence, we (called “objective function”) and we want to maximize (or have a bipartite graph. Figure1 shows an example factor minimize) it. However, this optimization is constrained and graphs where the factor and variable nodes are represented its constraint is equality P (Θ1,..., ΘK ) = c where c is a by circles and squares, respectively. constant. So, the constrained optimization problem is: 2.3.2. THE SUM-PRODUCT ALGORITHM maximize Q(Θ1,..., ΘK ), Θ1,...,ΘK (3) Assume we have a factor graph with some variable nodes subject to P (Θ1,..., ΘK ) = c. and factor nodes. For this, one of the nodes is considered as Hidden Markov Model: Tutorial 3 the root of tree. The result of algorithm does not depend on > 1 Initialize the messages to [1, 1,..., 1] which node to be taken as the root (Bishop, 2006). Usually, 2 for time t from 1 to τ do the first variable or the last one are considered as the root. 3 Forward pass: Do sum-product algorithm Then, some messages from the leaf (or leaves) are sent up 4 Backward pass: Do sum-product algorithm on the tree to the root (Bishop, 2006). The messages can be 5 belief = forward message × backward initialized to any fixed constant values (see Eq. (8)). The message message from the variable node x to its neighbor factor i 6 if max(belief change) then node f is: j 7 break the loop Y mxi→fj = mf→xi , (6) 8 Return beliefs

f∈N (xi)\fj Algorithm 1: The forward-backward algorithm where f ∈ N (xi) \ fj all the neighbor factor nodes using the sum-product sub-algorithm to the variable xi except fj. Also, mxi→fj and mf→xi denote the message from the variable xi to the factor fj and the message from the factor f to the variable xi, re- 2.3.3. THE MAX-PRODUCT ALGORITHM spectively. The message from a the factor node fj to its The max-product algorithm (Weiss & Freeman, 2001; neighbor variable xi is: Pearl, 2014) is similar to the sum-product algorithm where X Y the summation operator is replaced by the maximum op- m = f m . (7) fj →xi j x→fj erator. In this algorithm, the messages from the variable values of x∈N (fj )\xi x∈N (fj )\xi nodes to the factor nodes and vice versa are: The Eq. (6) includes both sum and product. This is Y why this procedure is named the sum-product algorithm mxi→fj = mf→xi , (10) (Kschischang et al., 2001; Bishop, 2006). In the factor f∈N (xi)\fj Y graph, if the variable xi has degree one, the message is: mfj →xi = max fj mx→fj . (11) values of x∈N (fj )\xi > k x∈N (fj )\xi mxi→fj = [1, 1,..., 1] ∈ R , (8) where k is the number of possible values that the variables 2.3.4. BELIEF PROPAGATION WITH can get. Moreover, if the variable xi has degree two con- FORWARD-BACKWARD PROCEDURE nected to fj and f`, the message is: In order to learn the beliefs over the variables and factor nodes, belief propagation can be applied using a forward- m = m . (9) xi→fj f`→xi backward procedure (see chapter 8 in (Bishop, 2006)). A good example for better understanding of Eqs. (6) and The forward-backward algorithm using the sum-product (7) exists in chapter 8 in (Bishop, 2006). sub-algorithm is shown in Algorithm1. In the forward- backward procedure, the belief over a random variable x For the exact convergence (and inference) of belief prop- i is the product of the forward and backward messages. agation, the graph should be a tree, i.e., should be cycle- free (Kschischang et al., 2001). If the factor graph has 3. Probabilistic Graphical Models cycles, the inference is approximate where the algorithm is stopped after a while manually. The belief propagation 3.1. Markov and Bayesian Networks in such graphs is called loopy belief propagation (Murphy A Probabilistic (PGM) is a grpah-based et al., 1999; Bishop, 2006). Note that Markov chains are representation of a complex distribution in the possibly cycle-free so the exact belief propagation can be applied high dimensional space (Koller & Friedman, 2009). In for them. other words, PGM is a combination of graph theory and It is also noteworthy that as every variable which is con- . In a PGM, the random variables are rep- nected to only two factor nodes merely passes the message resented by nodes or vertices. There exist edges between through (see Eq. (9)), a new graphical model, named nor- two variables which have interaction with one another in mal graph (Forney, 2001), was proposed which states every terms of probability. Different conditional probabilities can variable as an edge rather than a node (vertex). More details be represented by a PGM. about factor graph and sum-product algorithm can be found There exist two types of PGM which are Markov network in references (Kschischang et al., 2001; Loeliger, 2004) (also called Markov random field) and and chapter 8 of (Bishop, 2006). Also note that some al- (Koller & Friedman, 2009). In the Markov network and ternatives to sum-product algorithm are min-product, max- Bayesian network, the edges of graph are undirected and product, and min-sum (Bishop, 2006). directed, respectively. Hidden Markov Model: Tutorial 4

3.2. The Markov Property the number of states and observation symbols, respectively. Consider a times series of random variables We show the sets of states and possible observation sym- S = {s , . . . , s } O = {o , . . . , o } X1,X2,...,Xn. In general, the joint probability of bols by 1 n and 1 m , re- these random variables can be written as: spectively. We show being in state si and in observation symbol oi at time t by si(t) and oi(t), respectively. Let n×n P(X1,X2,...,Xn) = P(X1) P(X2 | X1) R 3 A = [ai,j] be the state Transition Probability Ma- trix (TPM), where: P(X3 | X2,X1) ... P(Xn | Xn−1,...,X2,X1), (12)  ai,j := P sj(t + 1) | si(t) . (15) according to chain (or multiplication) rule in probabil- We have: ity. [The first order] Markov property is an assumption n X which states that in a time series of random variables ai,j = 1. (16) X1,X2,...,Xn, every random variable is merely depen- j=1 dent on the latest previous random variable and not the oth- ers. In other words: The Emission Probability Matrix (EPM) is denoted by as n×m R 3 B = [bi,j] where: P(Xi | Xi−1,Xi−2,...,X2,X1) = P(Xi | Xi−1). (13)  bi,j := P oj(t) | si(t) , (17) Hence, with Markov property, the chain rule is simplied to: which is the probability of emission of the observation symbols from the states. We have: P(X1,X2,...,Xn) n = P(X1) P(X2 | X1) P(X3 | X2) ... P(Xn | Xn−1). X (14) bi,j = 1. (18) j=1 The Markov property can be of any order. For example, in Let the initial state distribution be denoted by the vector a second order Markov property, a random variable is de- n 3 π = [π , . . . , π ] where: pendent on the latest and one-to-latest variables. Usually, R 1 n  the default Markov property is of order one. A stochastic πi := P si(1) , (19) process which has the Markov process is called a Marko- and: vian process (or Markov process). n X 3.3. Discrete Time Markov Chain πi = 1, (20) A Markov chain is a PGM which has Markov property. The i=1 Markov chain can be either directed or undirected. Usually, to satisfy the probability properties. An HMM model is Markov chain is a Bayesian network where the edges are denoted by the tuple λ = (π, A, B). directed. It is important not to confuse Markov chain with Assume that a sequence of states is generated by the HMM Markov network. according to the TPM. We denote this generated sequence g g g g There are two types of Markov Chain which are Discrete of states by S := s (1), . . . , s (τ) where s (t) ∈ S, ∀t. Time Markov Chain (DTMC) (Ross, 2014) and Continu- Likewise, a sequence of outputs (observations) is generated ous Time Markov Chain (CTMC) (Lawler, 2018). As it is by the HMM according to SPM. We denote this generated g g g obvious from their names, in DTMC and CTMC, the time sequence of output symbols by O := o (1), . . . , o (τ) g of transitions from a random variable to another one is and where o (t) ∈ O, ∀t. is not partitioned into discrete slots, respectively. We denote the probability of transition from state sg(t) to g If the variables in a DTMC are considered as states, the s (t + 1) by asg (t),sg (t+1). So: DTMC can be viewed as a Finite-State Machine (FSM) g g asg (t),og (t+1) := P(s (t + 1) | s (t)). (21) or a Finite-State Automaton (FSA) (Booth, 1967). Also, note that the DTMC can be viewed as a Sub-Shifts of Finite Note that sg(t) ∈ S and sg(t + 1) ∈ S. We also denote: Type in modeling dynamic systems (Brin & Stuck, 2002). g πsg (i) := P(s (i)). (22) 3.4. Hidden Markov Model (HMM) Likewise, we denote the probability of state sg(t) emitting g HMM is a DTMC which contains a sequence of hidden the observation o (t) by bsg (t),og (t). So: variables (named states) in addition to a sequence of emit- g g ted observation symbols (outputs). bsg (t),og (t) := P(o (t) | s (t)). (23) We have an observation sequence of length τ which is the Note that sg(t) ∈ S and og(t) ∈ O. Figure2 depicts the number clock times, t ∈ {1, . . . , τ}. Let n and m denote structure of an HMM model. Hidden Markov Model: Tutorial 5

n n X X =⇒ log(asg (t−1),sg (t)) = 1i[i] 1j[j] log(ai,j) i=1 j=1 > = 1sg (t−1)(log A)1sg (t), (27)

n n Y Y 1i[i] 1j [j] bsg (t),og (t) = (bi,j) , i=1 j=1

n n X X =⇒ log(bsg (t),og (t)) = 1i[i] 1j[j] log(bi,j) Figure 2. A Hidden Markov Model i=1 j=1 > = 1sg (t)(log B)1og (t), (28) 4. Likelihood and Expectation Maximization where 1i[i] = 1j[j] = 1. Hence, we can write the log- in HMM likelihood as: The EM algorithm can be used for analysis and training τ−1 in HMM (Moon, 1996). In the following, we explain the > X > ` = 1 g log π + 1 g (log A)1 g details of EM for HMM. s (1) s (t−1) s (t) t=1 τ 4.1. Likelihood X > + 1sg (t)(log B)1og (t). (29) According to Fig.2, the likelihood of occurrence of the t=1 state sequence Sg and the observation sequence Og is (Ghahramani, 2001): 4.2. E-step in EM

g g The missing variables in the log-likelihood are 1sg (1), L = P(S , O ) 1sg (t−1), 1sg (t), and 1og (t). The expectation of the log- g  g g = P s (1) P(next states | previous states) P(O | S ) likelihood with respect to the missing variables is: τ−1 τ g  Y g g Y g g = P s (1) P(s (t + 1) | s (t)) P(o (t) | s (t)) Q(π,A, B) = E(`) t=1 t=1 τ−1 τ−1 τ > X >  Y Y = E(1sg (1) log π) + E 1sg (t)(log A)1sg (t+1) = πsg (1) asg (t),sg (t+1) bsg (t),og (t). (24) t=1 t=1 t=1 τ X >  + 1 g (log B)1 g . (30) The log-likelihood is: E s (t) o (t) t=1 τ−1 X 4.3. M-step in EM ` = log(L) = log πsg (1) + log(asg (t),sg (t+1)) t=1 We maximize the Q(π, A, B) with respect to the parame- τ π a b X ters i, i,j, and i,j: + log(bsg (t),og (t)). (25) t=1 maximize Q(π, A, B) x n Let 1i be a vector with entry one at index i, i.e., 1i := > X [0, 0,..., 0, 1, 0 ..., 0] . Also, 1sg (1) means the vector subject to πi = 1, with entry one at the index of the first state in the sequence i=1 n Sg. For example if there are three possible states and a se- X (31) quence of length three, sg(1) = 2, sg(2) = 1, sg(3) = 3, ai,j = 1, ∀i ∈ {1, . . . , n}, > j=1 we have 1sg (1) = [0, 1, 0] . n The terms in this log-likelihood are: X bi,j = 1, ∀i ∈ {1, . . . , n}, > j=1 log πsg (1) = 1sg (1) log π, (26) n n where the constraints ensure that the probabilities in the Y Y 1i[i] 1j [j] asg (t−1),sg (t) = (ai,j) , initial states, the transition matrix, and the emission matrix i=1 j=1 add to one. Hidden Markov Model: Tutorial 6

The Lagrangian (Boyd & Vandenberghe, 2004) for this op- Likewise, we have: timization problem is: τ ∂L X set = (1 g [i] 1 g [j]) − η b = 0 τ−1 ∂b E s (t) o (t) 3 i,j > X >  i,j t=1 L = E(1sg (1) log π) + E 1sg (t)(log A)1sg (t+1) τ t=1 1 X =⇒ b = (1 g [i] 1 g [j]). (39) τ n i,j η E s (t) o (t) X >  X 3 t=1 + E 1sg (t)(log B)1og (t) − η1( πi − 1) t=1 i=1 n n τ n n X (39) 1 X X X X b = 1 =⇒ (1 g [i] 1 g [j]) = − η2( ai,j − 1) − η3( bi,j − 1), (32) i,j E s (t) o (t) η3 j=1 j=1 j=1 j=1 t=1 τ η η η 1 X  where 1, 2, and 3 are the Lagrange multipliers. = (1 g [i] × 0) + ··· + (1 g [i] × 1) η E s (t) E s (t) The first term in the Lagrangian is simplified to 3 t=1 | {z } j-th element E(1sg (1)[1] log π1 + ··· + 1sg (1)[τ] log πτ ) =  τ (1sg (1)[1] log π1) + ··· + (1sg (1)[τ] log πτ ); therefore, 1 X set E E + ··· + (1 g [i] × 0) = (1 g [i]) = 1 E s (t) η E s (t) we have: 3 t=1 ∂L set τ = (1 g [i]) − η1πi = 0 X E s (1) =⇒ η = (1 g [i]). ∂πi 3 E s (t) (40) 1 t=1 =⇒ πi = E(1sg (1)[i]). (33) η1 Pτ t=1 E(1sg (t)[i] 1og (t)[j]) ∴ bi,j = Pτ . (41) n (1 g [i]) X (33) 1 t=1 E s (t) π = 1 =⇒ (1 g [1])+ i η E s (1) i=1 1 In Section 7.1, we will simplify the statements for πi, ai,j,  and bi,j. + ··· + E(1sg (1)[i]) + ··· + E(1sg (1)[n]) 1 set = (0 + ··· + 1 + ··· + 0) = 1 =⇒ η1 = 1. (34) 5. Evaluation in HMM η1 Evaluation in HMM means the following (Rabiner & π = (1 g [i]). (35) ∴ i E s (1) Juang, 1986; Rabiner, 1989): Given the observation se- Similarly, we have: quence Og = og(1), . . . , og(τ) and the HMM model λ = (π, A, B), we want to compute (Og | λ), i.e., the proba- τ−1 P ∂L X set bility of the generated observation sequence. In summary: = E(1sg (t)[i] 1sg (t+1)[j]) − η2 ai,j = 0 ∂ai,j g g t=1 O , given: λ =⇒ P(O | λ) = ? (42) τ−1 1 X Note that (Og | λ) can also be denoted by (Og; λ). The =⇒ ai,j = E(1sg (t)[i] 1sg (t+1)[j]). (36) P P η g 2 t=1 P(O | λ) is sometimes referred to as the likelihood.

n n τ−1 5.1. Direct Calculation (36) X 1 X X g g g ai,j = 1 =⇒ (1 g [i] 1 g [j]) = Assume that the state sequence S = s (1), . . . , s (τ) has η E s (t) s (t+1) j=1 2 j=1 t=1 caused the observation sequence Og = og(1), . . . , og(τ). τ−1 Hence, we have: 1 X  = E(1sg (t)[i] × 0) + ... g g η (O | S , λ) = b g g b g g . . . b g g . 2 t=1 P s (1),o (1) s (2),o (2) s (τ),o (τ)  (43) + E(1sg (t)[i] × 1) + ··· + E(1sg (t)[i] × 0) | {z } On the other hand, the probability of the state sequence j-th element Sg = sg(1), . . . , sg(τ) is: τ−1 τ−1 1 X set X g = (1 g [i]) = 1 =⇒ η = (1 g [i]). (S | λ) = π g a g g . . . a g g . (44) η E s (t) 2 E s (t) P s (1) s (2),o (2) s (τ),o (τ) 2 t=1 t=1 (37) According to chain rule, we have: g g g g g P(O , S | λ) = P(O | S , λ) P(S | λ) = (45) Pτ−1 E(1sg (t)[i] 1sg (t+1)[j]) a = t=1 . (38) πsg (1) bsg (1),og (1) asg (2),og (2) . . . bsg (τ),og (τ) asg (τ),og (τ), ∴ i,j Pτ−1 t=1 E(1sg (t)[i]) (46) Hidden Markov Model: Tutorial 7 which is the probability of occurrence of both the observa- 1 Input: λ = (π, A, B) tion sequence Og and state sequence Sg. 2 αi(1) = πi bi,og (1), ∀i ∈ {1, . . . , n} Any state sequence may have caused the observation se- 3 for state j from 1 to n do g quence O . Therefore, according to the law of total proba- 4 for time t from 1 to (τ − 1) do bility, we have: h Pn i 5 αj(t + 1) = αi(t) ai,j bj,og (t+1) X i=1 (Og| λ) = (Og, Sg | λ) P P g Pn ∀Sg 6 P(O | λ) = i=1 αi(τ) g 7 Return (O | λ), ∀i, ∀t : αi(t) (45) X g g g P = P(O | S , λ) P(S | λ) ∀Sg Algorithm 2: The forward belief propagation in (46) X X X the forward-backward procedure = ··· πsg (1) bsg (1),og (1) asg (2),og (2) ∀sg (1) ∀sg (2) ∀sg (τ)

. . . bsg (τ),og (τ) asg (τ),og (τ), (47) Proposition 1. The Eq. (50) can be interpreted as the sum- which means that we start with the first state, then output product algorithm (see Section 2.3.2). the first observation, and then go to the next state. This procedure is repeated until the last state. The summations Proof. The Algorithm2 has iterations over states indexed are over all possible states in the state sequence. j i g by . Also the Eq. (50) has sum over states index by . The time complexity of this direct calculation of P(O | λ) i s τ Consider all states indexed by and a specific state j (see is in the order of O(2τn ) because at every time clock Fig.3). We can consider every two successive states as a t ∈ {1, . . . , τ}, there are n possible states to go through factor node in the factor graph. Hence, a state indexed by (Rabiner & Juang, 1986). Because of nτ , this is very inef- i and the state sj form a factor node which we denote by ficient especially for long sequences (large τ). i,j g f . The observation symbol o (t + 1) emitted from sj is considered as the variable node in the factor graph. 5.2. The Forward-Backward Procedure The message α (t) is the message received to the factor A more efficient algorithm for evaluation in HMM is i node f i,j so far. Therefore, in the sum-product algorithm forward-backward procedure (Rabiner & Juang, 1986). (see Eq. (6)), we have: The forward-backward procedure includes two stages, i.e., forward and backward belief propagation stages. mog (t)→f i,j = αi(t). (51) 5.2.1. THE FORWARD BELIEF PROPAGATION

Similar to the belief propagation procedure, we define the The message ai,j bj,og (t+1) is the message received from forward message until time t as: the factor nodes f i,j, ∀i to the variable node og(t + 1). Therefore, in the sum-product algorithm (see Eq. (7)), we α (t) := og(1), og(2), . . . , og(t), sg(t) = s | λ, (48) i P i have: which is the probability of partial observation sequence un- til time t and being in state si at time t. mf i,j →og (t+1) = ai,j bj,og (t+1). (52) Algorithm2 shows the forward belief propagation from state one to the state τ. In this algorithm, αi(t) is solved Hence: inductively. The initial forward message is: n X α (1) = π b g , ∀i ∈ {1, . . . , n}, (49) i i i,o (1) mf→og (t+1) = mf i,j →og (t+1) m g i,j (53) o (t)→fnext which is the probability of occurrence of the initial state i=1 n g si and the observation symbol o (1). The next forward X = a b g α (t) messages are calculated as: i,j j,o (t+1) i i=1 n n h X i h X i αj(t + 1) = αi(t) ai,j bj,og (t+1), (50) = αi(t) ai,j bj,og (t+1). (54) i=1 i=1 which is the probability of occurrence of observation se- g g i,j n i,j quence o (1), . . . , o (t), being in state si at time t, going where f is the set of all factor nodes, {f }i=1, and fnext to state j at time t+1, and the observation symbol og(t+1). is the factor f i,j in the next time slot. The summation is because, at time t, the state sg(t) can be Note that if we consider Fig.3 for all states indexed by j, any state so we should use the law of total probability. a lattice network is formed. Hidden Markov Model: Tutorial 8

1 Input: λ = (π, A, B) 2 βi(τ) = 1, ∀i ∈ {1, . . . , n} 3 for state i from 1 to n do 4 for time t from (τ − 1) to 1 do Pn 5 βi(t) = j=1 ai,j bj,og (t+1)

6 Return ∀i, ∀t : βi(t) Algorithm 3: The backward belief propagation in the forward-backward procedure

state sg(t + 1) can be any state so we should use the law of total probability.

It is noteworthy that for very long sequences, the αi(t) and β(t) become extremely small, recursively (see Algorithms 2 and3). Hence, some people normalize them at every iteration of algorithm (Ghahramani, 2001): Figure 3. Modeling forward belief propagation for HMM as a sum-product algorithm in a factor graph. αi(t) αi(t) ← Pτ , (59) j=1 αj(t) Finally, using the law of total probability, we have: βi(t) βi(t) ← Pτ , (60) βj(t) n n j=1 g X g g  X P(O | λ) = P O , s (τ) = si | λ = αi(τ), in order to sum to one. However, note that if this normal- i=1 i=1 g (55) ization is done, we will have P(O | λ) = 1. which is the desired probability in the evaluation for HMM. 5.2.3. TIME COMPLEXITY Hence, the forward belief propagation suffices for evalua- The time complexity of the forward belief propagation is in tion. In the following, we explain the backward belief prop- the order of O(τn2) because we have O(nτ) loops each of agation which is required for other sections of the paper. which includes summation over n states (Rabiner & Juang, 1986). This is much more efficient than the complexity of 5.2.2. THE BACKWARD BELIEF PROPAGATION direct calculation of P(Og| λ) which was O(2τnτ ). Simi- Again similar to the belief propagation procedure, we de- larly, the time complexity of the backward belief propaga- fine the backward message since time τ to t + 1 as: tion is in the order of O(τn2) (Rabiner & Juang, 1986).

g g g g  βi(t) := P o (t+1), o (t+2), . . . , o (τ) | s (t) = si, λ , 6. Estimation in HMM (56) which is the probability of partial observation sequence Estimation in HMM means the following (Rabiner & from t + 1 until the end time τ given being in state s at Juang, 1986; Rabiner, 1989): Given the observation se- i Og = og, . . . , og λ = time t. quence 1 τ and the HMM model (π, A, B), we want to compute or estimate (Sg | Og, λ), Algorithm3 shows the backward belief propagation. In this P i.e., the probability of a state sequence given an observation algorithm, the initial backward message is: sequence. In summary:

βi(τ) = 1, ∀i ∈ {1, . . . , n}. (57) g g g g S , given: O , λ =⇒ P(S | O , λ) = ? (61) The next backward messages are calculated as:

n 6.1. Greedy Approach X βi(t) = ai,j bj,og (t+1), (58) Let the probability of being in state si at time t given the j=1 observation sequence Og and the HMM model λ be de- noted by: which is the probability of being in state si at time t, go- ing to state j at time t + 1, and the observation symbol g g  og(t + 1). The summation is because, at time t + 1, the γi(t) := P s (t) = si | O , λ . (62) Hidden Markov Model: Tutorial 9

We can say: 1 Input: λ = (π, A, B) g g g  g 2 // Initialization: γi(t) P(O | λ) = P s (t) = si | O , λ P(O | λ) 3 δi(1) = πi bi,og (1), ∀i ∈ {1, . . . , n} (a) g g  (b) = P s (t) = si, O , λ = αi(t) βi(t) 4 ψi(1) = 0, ∀i ∈ {1, . . . , n} 5 // Recursion: α (t) β (t) 6 for state j from 1 to n do =⇒ γ (t) = i i (63) i (Og | λ) 7 for time t from 2 to τ do P  α (t) β (t) 8 δj(t) = max1≤i≤n δi(t−1) ai,j bj,og (t) i i  = Pn , (64) 9 ψj(t) = arg max1≤i≤n δi(t − 1) ai,j j=1 αj(t) βj(t) 10 // Termination: where (a) is because of chain rule in probability and (b) is ∗ 11 p = max1≤i≤n δi(τ) because: ∗ 12 s (τ) = arg max1≤i≤n δi(τ) (48),(56) αi(t) βi(t) = 13 // Backtracking: g g g g  14 for time t from τ − 1 to 1 do = P o (1), o (2), . . . , o (t), s (t) = si | λ ∗ 15 s (t) = ψs∗(t+1)(t + 1) g g g g  × P o (t + 1), o (t + 2), . . . , o (τ) | s (t) = si, λ g g ∗ 16 P(O , S | λ) = p g g g g  g g g ∗ ∗ = P o (1), o (2), . . . , o (τ), s (t) = si, λ 17 Return P(O , S | λ), S = s (1), . . . , s (τ) = Og, sg(t) = s , λ. P i Algorithm 4: The Viterbi algorithm for estima- The reason of (b) can also be interpreted in this way: it is tion in HMM because of the forward-backward procedure which states the belief over a variable as product of the forward and Proposition 2. The Eq. (66) can be interpreted as the max- backward messages (see Section 2.3.4). Note that we have Pn g product algorithm (see Section 2.3.3). i=1 γi(t) = 1. Also, note that P(O | λ) in Eq. (63) can be obtained from either the denominator of Eq. (64) or line Proof. The Algorithm4 has iterations over states indexed 6 in Algorithm2 (i.e., Eq. (55)). by j. Also the Eq. (66) has maximum operator over states In the greedy approach, at every time t, we select a state index by i. Consider all states indexed by i and a specific with maximum probability of occurrence without consid- state sj (see Fig.4). The definitions of factor node and ering the other states in the sequence. Therefore, we have: variable node in the factor graph are similar to the proof of

g Proposition1. s (t) = arg max γi(t), ∀t ∈ {1, . . . , τ}, (65) 1≤i≤n The message δi(t − 1) is the message received to the factor node f i,j so far. Therefore, in the max-product algorithm 6.2. The Viterbi Algorithm (see Eq. (10)), we have: The greedy approach does not optimize over the whole path but greedily chooses the best state at every time step. An- mog (t−1)→f i,j = δi(t − 1). (67) other approach is to find the best state sequence which The message ai,j bj,og (t) is the message received from the has the highest probability of occurrence, i.e., maximiz- factor nodes f i,j, ∀i to the variable node og(t). Therefore, (Og, Sg | λ) ing P (Rabiner & Juang, 1986). The Viterbi in the max-product algorithm (see Eq. (11)), we have: algorithm (Viterbi, 1967; Forney, 1973) can be used to find this path of states (Blunsom, 2004). Different works, such mf i,j →og (t) = ai,j bj,og (t). (68) as (He, 1988), have worked on using Viterbi algorithm for Hence: HMM.

mf→og (t) = max mf i,j →og (t) m g i,j (69) Algorithm4 shows the Viterbi algorithm for estimation in 1≤i≤n o (t)→fnext HMM (Rabiner & Juang, 1986). In this algorithm, we have = max ai,j bj,og (t)δi(t − 1) variable δj(t): 1≤i≤n h i h i = max δi(t − 1) ai,j b g . (70) δj(t) = max δi(t − 1) ai,j bj,og (t), (66) j,o (t) 1≤i≤n 1≤i≤n i,j n i,j which is similar to αj(t) defined in Eq. (50), except that where f is the set of all factor nodes, {f }i=1, and fnext i,j αj(t) in the forward belief propagation uses sum-product is the factor f in the next time slot. algorithm (Kschischang et al., 2001) while δj(t) in the Note that if we consider Fig.4 for all states indexed by Viterbi algorithm uses max-product algorithm (Weiss & j, a lattice network is formed which is common in Viterbi Freeman, 2001; Pearl, 2014) (see Sections 2.3.2 and 2.3.3). algorithm (see (Rabiner & Juang, 1986)). Hidden Markov Model: Tutorial 10 7. Training the HMM Training HMM means the following (Rabiner & Juang, 1986; Rabiner, 1989): Given the observation sequence g g g O = o1, . . . , oτ , we want to adjust the HMM model pa- rameters λ = (π, A, B) in order to maximize P(Og | λ). In summary:

g g given: O , O, S =⇒ λ = arg max P(O | λ). (77) λ

7.1. The Baum-Welch Algorithm We can solve for Eq. (77) using maximum likelihood esti- mation using Expectation Maximization (EM). The Baum- Welch algorithm (Baum et al., 1970) is the most well- known method for training HMM. It makes use of the EM results. We define the probability of occurrence of a path being in states si and sj, respectively, at times t and t + 1 by:

g g g  Figure 4. Modeling Viterbi algorithm for HMM as a max-product ξi,j(t) := P s (t) = si, s (t + 1) = sj | O , λ . (78) algorithm in a factor graph. This figure is very similar to Fig.3 where the difference is in the index of time. We can say: g ξi,j(t) P(O | λ) g g g  g Similar to Eq. (49), the initial δj(t) is: = P s (t) = si, s (t + 1) = sj | O , λ P(O | λ) (a) g g g  δi(1) = πi bi,og (1), ∀i ∈ {1, . . . , n}. (71) = P s (t) = si, s (t + 1) = sj, O , λ (b) = α (t) a b g β (t + 1) We define the index maximizing in Eq. (66) as: i i,j j,o (t+1) j  ψj(t) = arg max δi(t − 1) ai,j . (72) αi(t) ai,j b g βj(t + 1) 1≤i≤n j,o (t+1) =⇒ ξi,j(t) = (79) P(Og | λ) Then, a backward analysis is done starting from the end of αi(t) ai,j bj,og (t+1) βj(t + 1) state sequence: = Pn Pn , (80) r=1 `=1 αr(t) ar,` b`,og (t+1) β`(t + 1) ∗ p = max δi(τ), (73) where (a) is because of chain rule in probability and (b) 1≤i≤n ∗ is because of the following: According to Eqs. (48), (15), s (τ) = arg max δi(τ), (74) 1≤i≤n (17), and (56), we have:

g g g and the other states in the sequence are backtracked as: αi(t) = P(o (1), . . . , o (t), s (t) = si | λ), ai,j = P(sj(t + 1) | si(t)), ∗ s (t) = ψ ∗ (t + 1). (75) g s (t+1) bj,og (t+1) = P(o (t + 1) | sj(t + 1)), β (t + 1) = (og(t + 2), . . . , og(τ) | sg(t + 1) = s , λ). The states Sg = s∗(1), s∗(2), . . . , s∗(τ) are the desired j P j state sequence in the estimation. Therefore, the states in Therefore, we have: the state sequence are maximizing the forward belief prop- agation in a max-product setting. αi(t) ai,j(t) bj,og (t+1) βj(t + 1) The probability of this path of states with maximum prob- g g g  = P s (t) = si, s (t + 1) = sj, O , λ . ability of occurrence is: Pn Pn Note that we have i=1 j=1 ξi,j(t) = 1. Also, note g g ∗ g P(O , S | λ) = p . (76) that P(O | λ) in Eq. (79) can be obtained from either the denominator of Eq. (80) or line6 in Algorithm2 (i.e., Eq. Note that the Viterbi algorithm can be visualized using a (55)). trellis structure (see Appendix A in (Jurafsky & Martin, In Eq. (79), the terms αi(t), ai,j(t), bj,og (t+1), and βj(t + 2019)). 1) stand for the probability of the first t observations ending Hidden Markov Model: Tutorial 11 in state si at time t, the probability of transitioning from 1 Input: γi(t), ξi,j(t), ∀i, ∀j, ∀t state si (at time t) to state sj (at time t + 1), the probability 2 πi = γi(1), ∀i ∈ {1, . . . , n} of observing og(t + 1) from state s at time t + 1, and the j 3 for state i from 1 to n do probability of the remainder of the observation sequence, 4 for state j from 1 to n do respectively. Pτ−1 Pτ−1 5 ai,j = ξi,j(t)/ γi(t) Now, recall the Eqs. (35), (38), and (41) from EM algo- t=1 t=1 rithm for HMM. We write these equations again for conve- 6 for state j from 1 to n do nience of the reader: 7 for observation k from 1 to m do Pτ Pτ 8 bj,k = t=1, og (t)=k γj(t)/ t=1 γj(t) πi = E(1sg (1)[i]), Pτ−1 9 // Normalization, for computer error corrections: t=1 E(1sg (t)[i] 1sg (t+1)[j]) Pn a = , 10 πi ← πi/ πj i,j Pτ−1 j=1 E(1sg (t)[i]) Pn t=1 11 ai,j ← ai,j/ `=1 ai,` Pτ Pm t=1 E(1sg (t)[i] 1og (t)[j]) 12 bj,k ← bj,k/ `=1 bj,` bi,j = τ . > P 13 π = [π , . . . , π ] , A = [a ], B = t=1 E(1sg (t)[i]) 1 n i,j [bj,k], ∀i, j ∈ {1, . . . , n}, ∀k ∈ {1, . . . , m} On the other hand, recall Eqs. (62) and (78) which are 14 Return λ = (π, A, B) repeated here: Algorithm 5: The Baum-Welch algorithm for g g  γi(t) = P s (t) = si | O , λ , training the HMM g g g  ξi,j(t) = P s (t) = si, s (t + 1) = sj | O , λ . expected number of transitions from state s to s over the Therefore, we can say: i j expected number of transitions out of state si. Finally, the bj,k is calculated using Eq. (87). Likewise, using counting E(1sg (1)[i]) = γi(1), (81) in probability, the bj,k can be interpreted as the ratio of E(1sg (t)[i]) = γi(t), (82) the expected number of times being in state sj and seeing E(1sg (t)[i] 1sg (t+1)[j]) = ξi,j(t), (83) observation ok over the expected number of times being in g E(1sg (t)[i] 1og (t)[j]) = γi(t), where o (t) = j in γi(t). state sj. (84) Note that in the numerator of Eq. (87), we have og(t) = k in γj(t), i.e., according to Eq. (62), we have γj(t) = Hence: g g g g  P s (t) = sj | o (1), . . . , o (t) = k, . . . , o (τ), λ . This g (62) g g  means that according to Eq. (63), we set o (t) = k in Eqs. πi = γi(1) = P s (1) = si | O , λ , ∀i ∈ {1, . . . , n}, (50) and (58), or in Algorithms2 and3, for calculation of (85) γj(t) in Eq. (87). On the other hand, in line8 in Algorithm 5, we are iterating over all the time slots t ∈ {1, . . . , τ}. Pτ−1 Hence, in the numerator of Eq. (87), we will have γj(t) = t=1 ξi,j(t) g g g g ai,j = , ∀i, j ∈ {1, . . . , n}, (86) P s (t) = sj | o (1) = k, . . . , o (t) = k, . . . , o (τ) = Pτ−1 γ (t) t=1 i k, λ. Therefore: Pτ t=1, og (t)=j γi(t) b = , ∀i ∈ {1, . . . , n}, • For calculating the numerator of Eq. (87), we use Eq. i,j Pτ γ (t) t=1 i (64) where: ∀j ∈ {1, . . . , m}. n h X i α (t + 1) = α (t) a b , (88) With change of variable, we have: j i i,j j,k i=1 Pτ n g γj(t) X t=1, o (t)=k β (t) = a b , (89) bj,k = Pτ , ∀j ∈ {1, . . . , n}, (87) i i,j j,k t=1 γj(t) j=1 ∀k ∈ {1, . . . , m}. in line5 in Algorithm2 and in line5 in Algorithm 3, respectively. We use the obtained α , ∀i, ∀t and The algorithm is shown in Algorithm5(Rabiner & Juang, i β (t), ∀i, ∀t for calculating the numerator of Eq. (64). 1986). In this algorithm the initial probabilities of being in i state si at time t = 1 is according to Eq. (85). Then the ai,j • For calculating the denominator of Eq. (87), we use is then calculated using Eq. (86). According to counting Eq. (64) where Algorithms2 and3 are used for calcu- in probability, it can also be interpreted as the ratio of the lating αi(t) and βi(t). Hidden Markov Model: Tutorial 12

It is noteworthy that because of a possible error caused g g g |Q| 1 Input: {O = o , . . . , o } by computer, it is better to normalize the obtained HMM 1 τ q=1 Pn model parameters (see lines 10 to 12 in Algorithm5) in or- 2 Initialize λ = (π, A, B) where i=1 πi = 1, Pn Pn der to make sure that Eqs. (20), (16), and (18) are satisfied. j=1 ai,j = 1, j=1 bi,j = 1 3 while Convergence do 7.2. Training the HMM 4 for each q from 1 to |Q| do

We know that Algorithm5 requires γi(t) and ξi,j(t). Ac- 5 Consider the q-th training sequence g cording to Eqs. (63) and (79), the γi(t) and ξi,j(t) also re- 6 P(O | λ), ∀i, ∀t : αi(t) ← Do quire αi(t) and βi(t). The αi(t) and βi(t) can be computed Algorithm2 [input: λ] using Algorithms (2) and (3), respectively. Moreover, note 7 ∀i, ∀t : βi(t) ← Do Algorithm3 [input: that in calculating αi(t), β(i), and ξi,j(t), we need some πi, λ] ai,j, and bi,j from π, A, and B, respectively. Therefore, 8 ∀i, ∀t : γi(t) ← Eq. (63) [input: α, β] we need an initial λ = (π, A, B) to compute another λ0 = 9 ∀i, ∀j, ∀t : ξi,j(t) ← Eq. (79) [input: α, (π0, A0, B0) at the end according to Algorithm5. Thus, we β, λ] should fine tune λ = (π, A, B) iteratively. The obtained 10 λ ← Do Algorithm5 [input: γ, ξ] λ0 from the previous λ satisfies P(Og | λ0) ≥ P(Og | λ) 11 if change of λ is small then (Rabiner & Juang, 1986) (see (Baum & Eagon, 1967) or 12 Convergence ← True (Baum et al., 1970) for proof) so we have progress in con- vergence (note that P(Og | λ) is obtained for each iteration 13 Return λ = (π, A, B) in line6 in Algorithm6). If we have several training se- Algorithm 6: Training the HMM quences Q, indexed by q ∈ {1,..., |Q|}, we use one of the sequences at the first iteration, the second sequence at the g second iteration, and so on. We repeat this procedure until 1. Calculate P(Ot | λw) for all w ∈ {1,..., |W|} using convergence which is reached when there is no significant the forward belief propagation, i.e., Algorithm2 (see change in λ. Algorithm6 shows how to train an HMM. Eq. (55)). 8. Usage of HMM in Applications 2. The test word is recognized as: ∗ g w = arg max (Ot | λw). (90) In Section1, we introduced some application of HMM w P in (Rabiner & Juang, 1986; Rabiner, So, the test word is recognized as the w-th word in the 1989; Gales et al., 2008), action recognition (Yamato et al., dictionary. 1992; Ghojogh et al., 2017), and face recognition (Nefian & Hayes, 1998; Samaria, 1994). Here, we explain the ap- In test phase for speech recognition, usually, the Viterbi al- plications in speech and action recognition in more details. gorithm is used (Rabiner, 1989). Hence, another approach for recognition of the test word Og is: 8.1. Application in Speech Recognition t g g Assume we have a dictionary of words consisting of |W| 1. Calculate P(Ot , St | λw) for all w ∈ {1,..., |W|} words. For every word indexed by w ∈ {1,..., |W|}, we using the Viterbi algorithm, i.e., Algorithm4 (see Eq. (76)). have |Qw| training instances spoken by one or several peo- ple. The training instances for a word are indexed by q 2. The test word is recognized as: where q ∈ {1,..., |Q |}. Every training instance is a w ∗ g g w = arg max (Ot , St | λw). (91) sequence of observation symbols obtained from formants w P (Titze & Martin, 1998). We consider an HMM model for So, the test word is recognized as the w-th word in the every word in the dictionary. Training the HMMs are as dictionary. (Rabiner & Juang, 1986; Rabiner, 1989): Note that the words are pronounced with different lengths 1. For every word wi, consider the training sequences, (fast or slowly) by different people. As HMM is robust to Og , indexed by q, i.e., og, . . . , og . different repetitions of states, the recognition of words with w 1 |Qw| different pacing is possible. 2. Train the HMM for the w-th word using Algorithm6 to obtain λw. 8.2. Application in Action Recognition In action recognition, every action can be seen as a se- g For an unknown test word with sequence Ot = quence of poses where every pose may be repeated for sev- og, . . . , og , we recognize the word as (Rabiner & Juang, eral frames (Ghojogh et al., 2017). Hence, HMM can be 1 |Qt| 1986; Rabiner, 1989): used for action recognition (Yamato et al., 1992). Hidden Markov Model: Tutorial 13

Action Sequence t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 t = 11 t = 12 1 stand stand stand sit sit sit × × × × × × 2 stand stand sit sit sit sit sit sit × × × × Sit 3 stand stand stand stand stand sit sit × × × × × 4 stand stand stand stand stand sit sit sit sit sit × × 5 stand stand stand stand stand sit sit sit sit stand sit sit 1 sit sit sit stand stand stand × × × × × × 2 sit sit sit sit sit sit stand stand × × × × Stand 3 sit sit stand stand stand stand stand × × × × × 4 sit sit sit sit stand stand stand stand stand × × × 5 sit sit sit sit stand stand stand sit stand stand stand × 1 stand stand stand tilt tilt tilt × × × × × × 2 stand stand tilt tilt tilt tilt tilt tilt × × × × Turn 3 stand stand stand stand stand tilt tilt × × × × × 4 stand stand stand stand tilt tilt tilt tilt tilt tilt × × 5 stand stand stand stand tilt tilt tilt stand tilt tilt tilt ×

Table 1. An example training dataset for action recognition

Action t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 t = 11 Sit stand stand stand sit sit sit sit × × × × Stand sit sit sit sit stand stand stand × × × × Turn stand stand stand stand tilt tilt stand stand tilt tilt tilt

Table 2. An example test dataset for action recognition

Assume we have a set of actions denoted by W where the i.e., stand, sit, and tilt. The actions can have different actions are indexed by w ∈ {1,..., |W|}. We have |Qw| lengths or pacing. An example training dataset with its in- training instances for every action where the training in- stances is shown in Table1. In some sequences of dataset, stances are indexed by q ∈ {1,..., |Qw|}. Every training there are some noisy poses in the middle of sequences of instance is a sequence of observation symbols where the correct poses for making a difficult instance. An example symbols are the poses, e.g., sitting, standing, etc. An HMM test dataset is also shown in Table2. The three test se- is trained for every action (Ghojogh et al., 2017; Mokari quences are different from the training sequences to check et al., 2018). Training and testing HMMs for action recog- the generalizability of the HMM models. Three HMM nition is the same as training and test phases explained for models can be trained for the three actions in this dataset speech recognition. and then, the test action sequences can be fed to the HMM The actions are performed with different sequence lengths models to be recognized. (fast or slowly) by different people. As HMM is robust to different repetitions of states, the recognition of actions 9. Conclusion with different pacing is possible. In this paper, we explained the theory of HMM for evalu- In action recognition, we have a dataset of actions consist- ation, estimation, and training. We started with some re- ing of several defined poses (Ghojogh et al., 2017; Mokari quired background, i.e., EM, factor graphs, sum-product et al., 2018). For example, if the dataset includes three ac- and max-product algorithms, forward-backward propaga- tions sit, stand, and turn, the format of actions is as follows: tion, Markov and Bayesian networks, Markov property, and DTMC. We then introduced HMM and detailed EM in • Action sit: stand, stand ..., stand, sit, sit ..., sit HMM. Evaluation in HMM was explained in both direct | {z } | {z } stand sit calculation and forward-backward procedure. We intro- • Action stand: sit, sit ..., sit, stand, stand ..., stand duced estimation in HMM using the greedy approach and | {z } | {z } the Viterbi algorithm. Training the HMM was also covered sit stand using the Baum-Welch algorithm based on EM algorithm. • Action turn: stand, stand ..., stand, tilt, tilt ..., tilt We also introduced speech and action recognition as two | {z } | {z } stand tilt popular applications of HMM. where the actions are modeled as sequences of some poses, Hidden Markov Model: Tutorial 14

Acknowledgment Ghojogh, Benyamin, Ghojogh, Aydin, Crowley, Mark, and The authors hugely thank Prof. Mehdi Molkaraie, Prof. Karray, Fakhri. Fitting a mixture distribution to data: Kevin Granville, and Prof. Mu Zhu whose courses partly tutorial. arXiv preprint arXiv:1901.06708, 2019. covered the materials mentioned in this tutorial paper. He, Yang. Extended Viterbi algorithm for second order hid- [1988 Proceedings] 9th Interna- References den markov process. In tional Conference on , pp. 718–720. Baum, Leonard E and Eagon, John Alonzo. An inequal- IEEE, 1988. ity with applications to statistical estimation for proba- bilistic functions of Markov processes and to a model for Jurafsky, Daniel and Martin, James H. Speech and Lan- ecology. Bulletin of the American Mathematical Society, guage Processing: An introduction to speech recog- 73(3):360–363, 1967. nition, computational linguistics and natural language processing. Pearson Prentice Hall, 2019. Baum, Leonard E, Petrie, Ted, Soules, George, and Weiss, Norman. A maximization technique occurring in the Koller, Daphne and Friedman, Nir. Probabilistic graphical statistical analysis of probabilistic functions of Markov models: principles and techniques. MIT press, 2009. chains. The annals of mathematical , 41(1): 164–171, 1970. Kschischang, Frank R, Frey, Brendan J, and Loeliger, Hans-Andrea. Factor graphs and the sum-product algo- Bishop, Christopher M. Pattern recognition and machine rithm. IEEE Transactions on , 47(2): learning. Springer, 2006. 498–519, 2001.

Blunsom, Phil. Hidden Markov models. Technical report, Lawler, Gregory F. Introduction to stochastic processes. 2004. Chapman and Hall/CRC, 2018.

Booth, Taylor L. Sequential machines and automata the- Loeliger, H-A. An introduction to factor graphs. IEEE ory. New York, NY: Wiley, 1967. Magazine, 21(1):28–41, 2004.

Boyd, Stephen and Vandenberghe, Lieven. Convex opti- Mokari, Mozhgan, Mohammadzade, Hoda, and Ghojogh, mization. Cambridge university press, 2004. Benyamin. Recognizing involuntary actions from 3d skeleton data using body states. Scientia Iranica, 2018. Brin, Michael and Stuck, Garrett. Introduction to dynami- cal systems. Cambridge university press, 2002. Moon, Todd K. The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6):47–60, 1996. Eddy, Sean R. What is a hidden Markov model? Nature biotechnology, 22(10):1315, 2004. Murphy, Kevin P, Weiss, Yair, and Jordan, Michael I. Loopy belief propagation for approximate inference: An Forney, G David. The Viterbi algorithm. Proceedings of empirical study. In Proceedings of the Fifteenth confer- the IEEE, 61(3):268–278, 1973. ence on Uncertainty in artificial intelligence, pp. 467– 475. Morgan Kaufmann Publishers Inc., 1999. Forney, G David. Codes on graphs: Normal realizations. IEEE Transactions on Information Theory, 47(2):520– Nefian, Ara V and Hayes, Monson H. Hidden markov mod- 548, 2001. els for face recognition. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Sig- Gales, Mark, Young, Steve, et al. The application of hidden nal Processing, ICASSP’98 (Cat. No. 98CH36181), vol- Markov models in speech recognition. Foundations and ume 5, pp. 2721–2724. IEEE, 1998. Trends R in Signal Processing, 1(3):195–304, 2008. Pearl, Judea. Probabilistic reasoning in intelligent systems: Ghahramani, Zoubin. An introduction to hidden Markov networks of plausible inference. Elsevier, 2014. models and Bayesian networks. In Hidden Markov mod- els: applications in computer vision, pp. 9–41. World Rabiner, Lawrence R. A tutorial on hidden markov mod- Scientific, 2001. els and selected applications in speech recognition. Pro- ceedings of the IEEE, 77(2):257–286, 1989. Ghojogh, Benyamin, Mohammadzade, Hoda, and Mokari, Mozhgan. Fisherposes for human action recognition us- Rabiner, Lawrence R and Juang, Biing-Hwang. An intro- ing kinect sensor data. IEEE Sensors Journal, 18(4): duction to hidden Markov models. IEEE ASSP Maga- 1612–1627, 2017. zine, 3(1):4–16, 1986. Hidden Markov Model: Tutorial 15

Ross, Sheldon M. Introduction to probability models. Aca- demic press, 2014. Roweis, Sam. Hidden Markov models. Technical report, University of Toronto, 2003. Samaria, Ferdinando Silvestro. Face recognition using hid- den Markov models. PhD thesis, University of Cam- bridge, Cambridge, UK, 1994. Titze, Ingo R and Martin, Daniel W. Principles of voice production. Acoustical Society of America, 1998. Viterbi, Andrew. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260– 269, 1967. Weiss, Yair and Freeman, William T. On the optimality of solutions of the max-product belief-propagation algo- rithm in arbitrary graphs. IEEE Transactions on Infor- mation Theory, 47(2):736–744, 2001. Yamato, Junji, Ohya, Jun, and Ishii, Kenichiro. Recogniz- ing human action in time-sequential images using hid- den Markov model. In Proceedings 1992 IEEE Com- puter Society conference on computer vision and pattern recognition, pp. 379–385. IEEE, 1992.