Introduction to Hidden Markov Models
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Hidden Markov Models Alperen Degirmenci This document contains derivations and algorithms for im- Z Z Z Z plementing Hidden Markov Models. The content presented 1 2 t N here is a collection of my notes and personal insights from two seminal papers on HMMs by Rabiner in 1989 [2] and Ghahramani in 2001 [1], and also from Kevin Murphy’s book X1 X2 Xt XN [3]. This is an excerpt from my project report for the MIT 6.867 Machine Learning class taught in Fall 2014. Fig. 1. A Bayesian network representing a first-order HMM. The hidden states are shaded in gray. I. HIDDEN MARKOV MODELS (HMMS) HMMs have been widely used in many applications, such 1) Number of states in the model, K: This is the number as speech recognition, activity recognition from video, gene of states that the underlying hidden Markov process has. finding, gesture tracking. In this section, we will explain what The states often have some relation to the phenomena being HMMs are, how they are used for machine learning, their modeled. For example, if a HMM is being used for gesture advantages and disadvantages, and how we implemented our recognition, each state may be a different gesture, or a part own HMM algorithm. of the gesture. States are represented as integers 1;:::;K. A. Definition We will encode the state Zt at time t as a K × 1 vector of A hidden Markov model is a tool for representing prob- binary numbers, where the only non-zero element is the k-th ability distributions over sequences of observations [1]. In element (i.e. Ztk = 1), corresponding to state k 2 K at time this model, an observation Xt at time t is produced by a t. While this may seem contrived, it will later on help us in stochastic process, but the state Zt of this process cannot be our computations. (Note that [2] uses N instead of K). directly observed, i.e. it is hidden [2]. This hidden process 2) Number of distinct observations, Ω: Observations are is assumed to satisfy the Markov property, where state Zt at represented as integers 1;:::; Ω. We will encode the observa- time t depends only on the previous state, Zt−1 at time t−1. tion Xt at time t as a Ω×1 vector of binary numbers, where This is, in fact, called the first-order Markov model. The nth- the only non-zero element is the l-th element (i.e. Xtl = 1), order Markov model depends on the n previous states. Fig. 1 corresponding to state l 2 Ω at time t. While this may seem shows a Bayesian network representing the first-order HMM, contrived, it will later on help us in our computations. (Note where the hidden states are shaded in gray. that [2] uses M instead of Ω, and [1] uses D. We decided We should note that even though we talk about “time” to use Ω since this agrees with the lecture notes). to indicate that observations occur at discrete “time steps”, 3) State transition model, A: Also called the state transi- “time” could also refer to locations within a sequence [3]. tion probability distribution [2] or the transition matrix [3], The joint distribution of a sequence of states and observa- this is a K × K matrix whose elements Aij describe the tions for the first-order HMM can be written as, probability of transitioning from state Zt−1;i to Zt;j in one N time step where i; j 2 f1;:::;Kg. This can be written as, Y P (Z1:N ;X1:N ) = P (Z1)P (X1jZ1) P (ZtjZt−1)P (XtjZt) Aij = P (Zt;j = 1jZt−1;i = 1): (3) t=2 (1) P Each row of A sums to 1, j Aij = 1, and therefore it is where the notation Z1:N is used as a shorthand for called a stochastic matrix. If any state can reach any other Z1;:::;ZN . Notice that Eq. 1 can be also written as, state in a single step (fully-connected), then Aij > 0 for N N Y Y P (X1:N ;Z1:N ) = P (Z1) P (ZtjZt−1) P (XtjZt) t=2 t=1 1-α 1-β A A A (2) 11 22 33 A A which is same as the expression given in the lecture notes. α 12 23 There are five elements that characterize a hidden Markov 1 2 1 2 3 model: β A21 A32 The author is with the School of Engineering and Applied (a) (b) Sciences at Harvard University, Cambridge, MA 02138 USA. ([email protected]). This document is an excerpt Fig. 2. A state transition diagram for (a) a 2-state, and (b) a 3-state ergodic from a project report for the MIT 6.867 Machine Learning class taught in Markov chain. For a chain to be ergodic, any state should be reachable from Fall 2014. any other state in a finite amount of time. 1 c 2014 Alperen Degirmenci μΦ →Z μZ →Φ μΦ →Z Zt-1 ,Zt t t Zt ,Zt+1 Zt ,Zt+1 t+1 all i; j; otherwise A will have some zero-valued elements. Φ Φ Fig. 2 shows two state transition diagrams for a 2-state and Zt-1,Zt Zt Zt,Zt+1 Zt+1 μZ →Φ μΦ →Z μZ →Φ 3-state first-order Markov chain. For these diagrams, the state t Zt-1 ,Zt Zt ,Zt+1 t t+1 Zt-1 ,Zt μZ →Φ μΦ →Z transition models are, t Xt ,Zt Xt ,Zt t ΦX ,Z 2A A 0 3 t t 1 − α α 11 12 A = ;A = A A A : (4) μ μ (a) (b) 21 22 23 ΦX ,Z →X X →ΦX ,Z β 1 − β 4 5 t t t t t t 0 A32 A33 Xt The conditional probability can be written as K K Fig. 3. Factor graph for a slice of the HMM at time t. Y Y Zt−1;iZt;j P (ZtjZt−1) = Aij : (5) i=1 j=1 • Problem 2: Given observations X1;:::;XN and the Taking the logarithm, we can write this as model λ, how do we find the “correct” hidden state K K Z ;:::;Z X X sequence 1 N that best “explains” the observa- logP (ZtjZt−1) = Zt−1;iZt;j log Aij (6) tions? This corresponds to finding the most probable i=1 j=1 sequence of hidden states from the lecture notes, and > = Zt log (A)Zt−1: (7) can be solved using the Viterbi algorithm. A related problem is calculating the probability of being in state 4) Observation model, B: Also called the emission prob- Ztk given the observations, P (Zt = kjX1:N ), which abilities, B is a Ω × K matrix whose elements Bkj describe can be calculated using the forward-backward algo- the probability of making observation Xt;k given state Zt;j. rithm. This can be written as, • Problem 3: How do we adjust the model parameters λ = (A; B; π) to maximize P (X jλ)? This corresponds Bkj = P (Xt = kjZt = j): (8) 1:N to the learning problem presented in the lecture notes, The conditional probability can be written as and can be solved using the Expectation-Maximization K Ω (EM) algorithm (in the case of HMMs, this is called the Y Y Zt;j Xt;k Baum-Welch algorithm). P (XtjZt) = Bkj : (9) j=1 k=1 C. The Forward-Backward Algorithm Taking the logarithm, we can write this as K Ω The forward-backward algorithm is a dynamic program- X X ming algorithm that makes use of message passing (be- logP (XtjZt) = Zt;jXt;k log Bkj (10) lief propagation). It allows us to compute the filtered and j=1 k=1 smoothed marginals, which can be then used to perform = X> log (B)Z : (11) t t inference, MAP estimation, sequence classification, anomaly 5) Initial state distribution, π: This is a K × 1 vector detection, and model-based clustering. We will follow the of probabilities πi = P (Z1i=1). The conditional probability derivation presented in Murphy [3]. can be written as, 1) The Forward Algorithm: In this part, we compute K the filtered marginals, P (ZtjX1:T ) using the predict-update Y Z1i cycle. The prediction step calculates the one-step-ahead P (Z1jπ) = πi : (12) i=1 predictive density, Given these five parameters presented above, an HMM P (Zt =jjX1:t−1) = ··· can be completely specified. In literature, this often gets K X abbreviated as = P (Zt = jjZt−1 = i)P (Zt−1 = ijX1:t−1) λ = (A; B; π): (13) i=1 (14) B. Three Problems of Interest which acts as the new prior for time t. In the update state, In [2] Rabiner states that for the HMM to be useful in the observed data from time t is absorbed using Bayes rule: real-world applications, the following three problems must be solved: αt(j) , P (Zt = jjX1:t) • Problem 1: Given observations X ;:::;X and a 1 N = P (Zt = jjXt;X1:t−1) model λ = (A; B; π), how do we efficiently compute P (XtjZt = j; X1:t−1 )P (Zt = jjX1:t−1) P (X jλ), the probability of the observations given = 1:N P P (X jZ = j; X)P (Z = jjX ) the model? This is a part of the exact inference problem j t t 1:t−1 t 1:t−1 presented in the lecture notes, and can be solved using 1 = P (XtjZt = j)P (Zt = jjX1:t−1) (15) forward filtering.