Variational Bayesian Inference for Hidden Markov Models with Multivariate Gaussian Output Distributions Christian Gruhl, Bernhard Sick

ARXIV 1 Variational Bayesian Inference for Hidden Markov Models With Multivariate Gaussian Output Distributions Christian Gruhl, Bernhard Sick Abstract—Hidden Markov Models (HMM) have been used for several years in many time series analysis or pattern recognitions tasks. HMM are often trained by means of the Baum-Welch algorithm which can be seen as a special variant of an expectation maximization (EM) algorithm. Second-order training techniques such as Variational Bayesian Inference (VI) for probabilistic models regard the parameters of the probabilistic models as random variables and define distributions over these distribution parameters, hence the name of this technique. VI can also bee regarded as a special case of an EM algorithm. In this article, we bring both together and train HMM with multivariate Gaussian output distributions with VI. The article defines the new training technique for HMM. An evaluation based on some case studies and a comparison to related approaches is part of our ongoing work. Index Terms—Variational Bayesian Inference, Hidden Markov Model, Gaussian-Wishart distribution F 1 INTRODUCTION If point estimates for parameters are needed, they can be derived from the second-order distributions in a Hidden Markov Models (HMM) are a standard technique maximum posterior (MAP) approach or by taking the in time series analysis or data mining. Given a (set expectation of the second-order distributions. Variational of) time series sample data, they are typically trained Bayesian Inference (VI), which can also be seen as a by means of a special variant of an expectation max- special variant of an expectation maximization (EM) imization (EM) algorithm, the Baum-Welch algorithm. algorithm, is a typical second-order approach [1]. HMM are used for gesture recognition, machine tool Although the idea to combine VI and HMM is not monitoring, or speech recognition, for instance. completely new and there were already approaches to Second-order techniques are used to find values for perform the HMM training in a variational framework parameters of probabilistic models from sample data. (cf. [2]), typically only models with univariate output The parameters are regarded as random variables, and distributions (i.e., scalar values) are considered. distributions are defined over these variables. These In this article, we bring these two ideas together type of these second-order distributions depends on and propose VI for HMM with multivariate Gaussian the type of the underlying probabilistic models. Typi- output distributions. The article defines the algorithm. cally, so called conjugate distributions are chosen, e.g., a An in-depth analysis of its properties, an experimental Gaussian-Wishart distribution for an underlying Gaus- evaluation, and a comparison to related work a are part sian for which mean and covariance matrix have to be of our current research. determined. Second-order techniques have some advan- Section 2 introduces the model and the notation we arXiv:1605.08618v1 [cs.LG] 27 May 2016 tages over conventional approaches, e.g., use in our work. Section 3 introduces VI for HMM. • the uncertainty associated with the determination of Finally, Section 4 concludes the article with a summary the parameters can be numerically expressed and of the key results and a brief outlook. used later, • prior knowledge about parameters can be considered in the parameter estimation process, and 2 MODEL AND NOTATION • the parameter estimation (i.e., training) process can We assume a GMM where each Gaussian is the output more easily be controlled (e.g., to avoid singulari- distribution of a hidden state. This is not so simple as it ties), seems on first sight, especially when the Gaussians are • the training process can easily be extended to au- overlapping. Thus it is not clear which observation was tomate the search for an appropriate number of generated by which Gaussian (or by which state). model components in a mixture density model (e.g., The GMM can be interpreted as a special instance of a Gaussian mixture model). a HMM, namely a HMM that consists of a transition matrix with the same transition probabilities from each C. Gruhl and B. Sick are with the University of Kassel, Department of Electrical Engineering and Computer Science, Wilhelmshoeher Allee 73, state to every other state that is similar to the initial state 34121 Kassel, Germany (email: cgruhl,[email protected]). distribution. ARXIV 2 Fig. 1. Graphical Model. Observations xn are depending we observe that, for any choice of q(Z), the following on latent variables zn, which are the estimate of the state, decomposition holds as well as the GMM parameters Λ (precision matrices) and µ (mean vector, which also depends on the precision ln p(XjΘ) = L(q; Θ) + KL(qjjp) (1) matrix) for the J components. The latent variables zn where we define have an additional dependency on the transition matrix Z p(X; Z)jΘ Π. L(q; Θ) = q(Z)ln dZ (2) E-Step M-Step q(Z) Z p(ZjX; Θ) KL(qjjp) = − q(Z)ln dZ (3) q(Z) Π zn Λ The latent variables Z absorbed the model parameters Θ, which are also random variables in this setup. To obtain an optimal model we are interested in maximizing the lower bound with respect to our variational distribution q: xn µ argmax L(q) (4) q N Which is the same as minimizing Eq. (3). Therefore optimum is reached when the variational distribution The mixing coefficients estimated for the GMM are q(Z) matches the conditional (posterior) distribution similar to the starting probabilities of the HMM. p(ZjX). In that case the KL(qjjp) divergence vanishes In the remainder of the article, we use the following since ln(1) = 0. notation: Factorization of the real distribution p (see Fig. 1). N. B., only the values of samples xn are observed. • E [x] is the expectation of the random variable x, • vectors are denoted with a bold, lowercase symbol p(X; Z; Π; µ; Σ) = p(XjZ; µ; Λ)p(ZjΠ)p(Π)p(µjΛ)p(Λ) e.g., x, (5) • matrices are denoted with a bold, uppercase symbol e.g., X, We assume, that a factorization of the variational • X is the sequence of observations xn, with 1 ≤ n ≤ distribution q is possible as follows: N and N = jXj, q(Z; Π; µ; Λ) = q(Z)q(Π; µ; Λ) (6) • Z is the set of latent variables zn;j, with 1 ≤ n ≤ N and N = jXj, 1 ≤ j ≤ J and J being the number of = q(Z)q(Π)q(µ; Λ) (7) states (which is equal to the number of components J J Y Y of the GMM) (Here we use a 1-out-of-K coding.), = q(Z) q(πi) q(µj; Λj) (8) • Θ is the parameter vector/matrix containing all i=1 j=1 model parameters (including transition probabilities π, as well as output parameters µ; Λ), 3.1 Choice of Distributions • L is the likelihood or its lower bound approxima- tion, For each state j, we assign an independent Dirichlet prior • Π is the transition matrix with rows πi, distribution for the transition probabilities, so that • πi are the transition probabilities for state i, with J Y (0) 1 ≤ i ≤ J and elements πi;j, p(Π) = Dir(πjjαj ) (9) • πi;j is the transition probability to move from state j=1 i to state j, with 1 ≤ i; j ≤ J, and (0) (0) (0) αj = fαj;1 ; : : : ; αj;J g (10) • πj is the probability to start in state j. The variational posterior distributions for the model 2.1 Hidden Markov Model parameters turn out to have the following form: A Hidden Markov Model (HMM) J Y q(Π) = Dir(π jα ) (11) 3 VARIATIONAL INFERENCE j j j=1 XjΘ The direct optimization of p( ) is difficult, but the αj = fαj;1; : : : ; αj;J g (12) optimization of the complete-data likelihood function p(X; ZjΘ) is significantly easier. We introduce a dis- The means are assigned independent univariate Gaus- tribution q(Z) defined over the latent variables, and sian conjugate prior distributions, conditional on the ARXIV 3 precisions. The precisions themselves are assigned inde- The latent variables γ(zn;j) denotes the probability pendent Wishart prior distributions: that the observation at time step n was generated by the j-th component of the model. p(µ; Λ) = p(µjΛ)p(Λ) (13) υ(z )!(z ) J n n γ(zn) = E [zn] = p(znjX) = (21) Y −1 P υ(z)!(z) = N (µjjm0; (β0Λj) ) ·W(ΛjjW 0; ν0) (14) z2Z j=1 υ(zn;j)!(zn;j) γ(zn;j) = E [zn;j] = (22) PJ υ(z )!(z ) The variational posterior distributions for the model k=1 n;k n;k parameters are as fallows (application of Bayes theorem): The transition probabilities ξ(zn−1;j; zn;s) express the uncertainty how likely it is, that a transition from state j to s has happened if observation xn−1 was generated −1 q(µj; Λj) = N (µjjmj; (βjΛj) ) ·W(ΛjjW j; νj) (15) by the j-th component and the n-th observation by the s-th component. The variational posterior for q(Z) will have the form: ξ(zn−1; zn) / υ(zn−1)p(znjzn−1)p(xnjzn)!(zn) (23) M J N J J Y Y zn;j Y Y Y zn;j ·z(n+1);s ξ(zn−1;j; zn;s) / υ(zn−1;j)aj;sbn;s!(zn;s) (24) q(Z) / (bn;j) (aj;s) (16) υ(z )a b !(z ) n=1 j=1 n=1 j=1 s=1 ξ(z ; z ) = n−1;j j;s n;s n;s n−1;j n;s PJ PJ k=1 l=1 υ(zn−1;k)ak;lbn;l!(zn;l) Which is identical to the one given by McGrory et al.

Variational Bayesian Inference for Hidden Markov Models with Multivariate Gaussian Output Distributions Christian Gruhl, Bernhard Sick

Multivariate Poisson Hidden Markov Models for Analysis of Spatial Counts

Regime Heteroskedasticity in Bitcoin: a Comparison of Markov Switching Models

12 : Conditional Random Fields 1 Hidden Markov Model

Ergodicity, Decisions, and Partial Information

Particle Gibbs for Infinite Hidden Markov Models

Hidden Markov Models (Particle Filtering)

Hierarchical Dirichlet Process Hidden Markov Model for Unsupervised Bioacoustic Analysis

Modelling Multi-Object Activity by Gaussian Processes 1

Markov-Modulated Marked Poisson Processes for Check-In Data [Draft]

Forecasting Volatility of Stock Indices with HMM-SV Models

Package 'Hiddenmarkov'

A Tutorial on Hidden Markov Models Using Stan