Hidden Markov Model: Tutorial
Total Page:16
File Type:pdf, Size:1020Kb
Hidden Markov Model: Tutorial Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Abstract emitted observation symbols (Rabiner, 1989). HMM mod- This is a tutorial paper for Hidden Markov Model els the the probability density of a sequence of observed (HMM). First, we briefly review the background symbols (Roweis, 2003). on Expectation Maximization (EM), Lagrange This paper is a technical tutorial for evaluation, estima- multiplier, factor graph, the sum-product algo- tion, and training in HMM. We also explain some applica- rithm, the max-product algorithm, and belief tions for HMM. There exist many different applications for propagation by the forward-backward procedure. HMM. One of the most famous applications is in speech Then, we introduce probabilistic graphical mod- recognition (Rabiner, 1989; Gales et al., 2008) where an els including Markov random field and Bayesian HMM is trained for every word in the speech (Rabiner & network. Markov property and Discrete Time Juang, 1986). Another application of HMM is in action Markov Chain (DTMC) are also introduced. We, recognition (Yamato et al., 1992) because an action can then, explain likelihood estimation and EM in be seen as a sequence or times series of poses (Ghojogh HMM in technical details. We explain evalua- et al., 2017; Mokari et al., 2018). An interesting applica- tion in HMM where direct calculation and the tion of HMM, which is not trivial in the first glance, is in forward-backward belief propagation are both face recognition (Samaria, 1994). The face can be seen as explained. Afterwards, estimation in HMM is a sequence of organs being modeled by HMM (Nefian & covered where both the greedy approach and the Hayes, 1998). Usage of HMM in DNA analysis is another Viterbi algorithm are detailed. Then, we explain application of HMM (Eddy, 2004). how to train HMM using EM and the Baum- The remainder of paper is as follows. In Section2, we re- Welch algorithm. We also explain how to use view some necessary background for the paper. Section HMM in some applications such as speech and 3 introduces probabilistic graphical models. Expectation action recognition. maximization in HMM is explained in Section4. after- wards, evaluation, estimation, and training in HMM are explain in Sections5,6, and7, respectively. Some applica- 1. Introduction tions are introduced in more details in Section8. Finally, Assume we have a time series of hidden random variables. Section9 concludes the paper. We name the hidden random variables as states (Ghahra- mani, 2001). Every hidden random variable can generate 2. Background (emit) an observation symbol or output. Therefore, we have Some background required for better understanding of the a set of states and a set of observation symbols. A Hidden HMM algorithms is mentioned in this Section. Markov Model (HMM) has some hidden states and some 2.1. Expectation Maximization This section is taken from our previous paper (Ghojogh et al., 2019). Sometimes, the data are not fully observ- Hidden Markov Model: Tutorial 2 able. For example, the data are known to be whether zero or greater than zero. As an illustration, assume the data are collected for a particular disease but for convenience of the patients participated in the survey, the severity of the dis- ease is not recorded but only the existence or non-existence of the disease is reported. So, the data are not giving us complete information as Xi > 0 is not obvious whether is Xi = 2 or Xi = 1000. In this case, MLE cannot be directly applied as we do not have access to complete information and some data are missing. In this case, Expectation Maximization (EM) (Moon, 1996) is useful. The main idea of EM can be sum- marized in this short friendly conversation: – What shall we do? The data are missing! The log- likelihood is not known completely so MLE cannot be used. – Mmm, probably we can replace the missing data with Figure 1. (a) An example factor graph, and (b) its representation something... as a bipartite graph. – Aha! Let us replace it with its mean. – You are right! We can take the mean of log-likelihood over the possible values of the missing data. Then every- For solving this problem, we can introduce a new variable thing in the log-likelihood will be known, and then... η which is called “Lagrange multiplier”. Also, a new func- – And then we can do MLE! tion L(Θ1;:::; ΘK ; η), called “Lagrangian” is introduced: (obs) (miss) Assume D and D denote the observed data (Xi’s = 0 X > 0 in the above example) and the missing data ( i’s L(Θ1;:::; ΘK ; η) = Q(Θ1;:::; ΘK ) in the above example). The EM algorithm includes two (4) − ηP (Θ ;:::; Θ ) − c: main steps, i.e., E-step and M-step. 1 K Let Θ be the parameter which is the variable for likelihood Maximizing (or minimizing) this Lagrangian function estimation. In the E-step, the log-likelihood `(Θ), is taken gives us the solution to the optimization problem (Boyd & expectation with respect to the missing data D(miss) in or- Vandenberghe, 2004): der to have a mean estimation of it. Let Q(Θ) denote the expectation of the likelihood with respect to D(miss): set rΘ1;:::;ΘK ,ηL = 0; (5) Q(Θ) := (miss) (obs) [`(Θ)]: (1) ED jD ;Θ which gives us: Note that in the above expectation, the D(obs) and Θ are r L =set 0 =) r Q = ηr P; conditioned on, so they are treated as constants and not ran- Θ1;:::;ΘK Θ1;:::;ΘK Θ1;:::;ΘK dom variables. set rηL = 0 =) P (Θ1;:::; ΘK ) = c: In the M-step, the MLE approach is used where the log- likelihood is replaced with its expectation, i.e., Q(Θ); 2.3. Factor Graph, The Sum-Product and therefore: Max-Product Algorithms, and The Forward-Backward Procedure Θb = arg max Q(Θ): (2) Θ 2.3.1. FACTOR GRAPH These two steps are iteratively repeated until convergence A factor graph is a bipartite graph where the two partitions of the estimated parameters Θ. b of graph include functions (or factors) fj; 8j and the vari- ables x ; 8i (Kschischang et al., 2001; Loeliger, 2004). Ev- 2.2. Lagrange Multiplier i ery factor is a function of two or more variables. The vari- Suppose we have a multi-variate function Q(Θ1;:::; ΘK ) ables of a factor are connected by edges to it. Hence, we (called “objective function”) and we want to maximize (or have a bipartite graph. Figure1 shows an example factor minimize) it. However, this optimization is constrained and graphs where the factor and variable nodes are represented its constraint is equality P (Θ1;:::; ΘK ) = c where c is a by circles and squares, respectively. constant. So, the constrained optimization problem is: 2.3.2. THE SUM-PRODUCT ALGORITHM maximize Q(Θ1;:::; ΘK ); Θ1;:::;ΘK (3) Assume we have a factor graph with some variable nodes subject to P (Θ1;:::; ΘK ) = c: and factor nodes. For this, one of the nodes is considered as Hidden Markov Model: Tutorial 3 the root of tree. The result of algorithm does not depend on > 1 Initialize the messages to [1; 1;:::; 1] which node to be taken as the root (Bishop, 2006). Usually, 2 for time t from 1 to τ do the first variable or the last one are considered as the root. 3 Forward pass: Do sum-product algorithm Then, some messages from the leaf (or leaves) are sent up 4 Backward pass: Do sum-product algorithm on the tree to the root (Bishop, 2006). The messages can be 5 belief = forward message × backward initialized to any fixed constant values (see Eq. (8)). The message message from the variable node x to its neighbor factor i 6 if max(belief change) then node f is: j 7 break the loop Y mxi!fj = mf!xi ; (6) 8 Return beliefs f2N (xi)nfj Algorithm 1: The forward-backward algorithm where f 2 N (xi) n fj means all the neighbor factor nodes using the sum-product sub-algorithm to the variable xi except fj. Also, mxi!fj and mf!xi denote the message from the variable xi to the factor fj and the message from the factor f to the variable xi, re- 2.3.3. THE MAX-PRODUCT ALGORITHM spectively. The message from a the factor node fj to its The max-product algorithm (Weiss & Freeman, 2001; neighbor variable xi is: Pearl, 2014) is similar to the sum-product algorithm where X Y the summation operator is replaced by the maximum op- m = f m : (7) fj !xi j x!fj erator. In this algorithm, the messages from the variable values of x2N (fj )nxi x2N (fj )nxi nodes to the factor nodes and vice versa are: The Eq. (6) includes both sum and product. This is Y why this procedure is named the sum-product algorithm mxi!fj = mf!xi ; (10) (Kschischang et al., 2001; Bishop, 2006). In the factor f2N (xi)nfj Y graph, if the variable xi has degree one, the message is: mfj !xi = max fj mx!fj : (11) values of x2N (fj )nxi > k x2N (fj )nxi mxi!fj = [1; 1;:::; 1] 2 R ; (8) where k is the number of possible values that the variables 2.3.4. BELIEF PROPAGATION WITH can get.