ARXIV 1 Variational for Hidden Markov Models With Multivariate Gaussian Output Distributions Christian Gruhl, Bernhard Sick

Abstract—Hidden Markov Models (HMM) have been used for several years in many analysis or pattern recognitions tasks. HMM are often trained by of the Baum-Welch algorithm which can be seen as a special variant of an expectation maximization (EM) algorithm. Second-order training techniques such as Variational Bayesian Inference (VI) for probabilistic models regard the parameters of the probabilistic models as random variables and define distributions over these distribution parameters, hence the name of this technique. VI can also bee regarded as a special case of an EM algorithm. In this article, we bring both together and train HMM with multivariate Gaussian output distributions with VI. The article defines the new training technique for HMM. An evaluation based on some case studies and a comparison to related approaches is part of our ongoing work.

Index Terms—Variational Bayesian Inference, Hidden , Gaussian-Wishart distribution !

1 INTRODUCTION If point estimates for parameters are needed, they can be derived from the second-order distributions in a Hidden Markov Models (HMM) are a standard technique maximum posterior (MAP) approach or by taking the in time series analysis or . Given a (set expectation of the second-order distributions. Variational of) time series sample data, they are typically trained Bayesian Inference (VI), which can also be seen as a by means of a special variant of an expectation max- special variant of an expectation maximization (EM) imization (EM) algorithm, the Baum-Welch algorithm. algorithm, is a typical second-order approach [1]. HMM are used for gesture recognition, machine tool Although the idea to combine VI and HMM is not monitoring, or , for instance. completely new and there were already approaches to Second-order techniques are used to find values for perform the HMM training in a variational framework parameters of probabilistic models from sample data. (cf. [2]), typically only models with univariate output The parameters are regarded as random variables, and distributions (i.e., scalar values) are considered. distributions are defined over these variables. These In this article, we bring these two ideas together type of these second-order distributions depends on and propose VI for HMM with multivariate Gaussian the type of the underlying probabilistic models. Typi- output distributions. The article defines the algorithm. cally, so called conjugate distributions are chosen, e.g., a An in-depth analysis of its properties, an experimental Gaussian-Wishart distribution for an underlying Gaus- evaluation, and a comparison to related work a are part sian for which and have to be of our current research. determined. Second-order techniques have some advan- Section 2 introduces the model and the notation we arXiv:1605.08618v1 [cs.LG] 27 May 2016 tages over conventional approaches, e.g., use in our work. Section 3 introduces VI for HMM. • the uncertainty associated with the determination of Finally, Section 4 concludes the article with a summary the parameters can be numerically expressed and of the key results and a brief outlook. used later, • prior knowledge about parameters can be consid- ered in the parameter estimation process, and 2 MODELAND NOTATION • the parameter estimation (i.e., training) process can We assume a GMM where each Gaussian is the output more easily be controlled (e.g., to avoid singulari- distribution of a hidden state. This is not so simple as it ties), seems on first sight, especially when the Gaussians are • the training process can easily be extended to au- overlapping. Thus it is not clear which observation was tomate the search for an appropriate number of generated by which Gaussian (or by which state). model components in a mixture density model (e.g., The GMM can be interpreted as a special instance of a Gaussian ). a HMM, namely a HMM that consists of a transition matrix with the same transition probabilities from each C. Gruhl and B. Sick are with the University of Kassel, Department of Electrical Engineering and Computer Science, Wilhelmshoeher Allee 73, state to every other state that is similar to the initial state 34121 Kassel, Germany (email: cgruhl,[email protected]). distribution. ARXIV 2

Fig. 1. . Observations xn are depending we observe that, for any choice of q(Z), the following on latent variables zn, which are the estimate of the state, decomposition holds as well as the GMM parameters Λ (precision matrices) and µ (mean vector, which also depends on the precision ln p(X|Θ) = L(q, Θ) + KL(q||p) (1) matrix) for the J components. The latent variables zn where we define have an additional dependency on the transition matrix Z p(X, Z)|Θ Π. L(q, Θ) = q(Z)ln dZ (2) E-Step M-Step q(Z) Z p(Z|X, Θ) KL(q||p) = − q(Z)ln dZ (3) q(Z) Π zn Λ The latent variables Z absorbed the model parameters Θ, which are also random variables in this setup. To obtain an optimal model we are interested in maximizing the lower bound with respect to our variational distri- bution q: xn µ argmax L(q) (4) q N Which is the same as minimizing Eq. (3). Therefore optimum is reached when the variational distribution The coefficients estimated for the GMM are q(Z) matches the conditional (posterior) distribution similar to the starting probabilities of the HMM. p(Z|X). In that case the KL(q||p) divergence vanishes In the remainder of the article, we use the following since ln(1) = 0. notation: Factorization of the real distribution p (see Fig. 1). N. B., only the values of samples xn are observed. • E [x] is the expectation of the random variable x, • vectors are denoted with a bold, lowercase symbol p(X, Z, Π, µ, Σ) = p(X|Z, µ, Λ)p(Z|Π)p(Π)p(µ|Λ)p(Λ) e.g., x, (5) • matrices are denoted with a bold, uppercase symbol e.g., X, We assume, that a factorization of the variational • X is the sequence of observations xn, with 1 ≤ n ≤ distribution q is possible as follows: N and N = |X|, q(Z, Π, µ, Λ) = q(Z)q(Π, µ, Λ) (6) • Z is the set of latent variables zn,j, with 1 ≤ n ≤ N and N = |X|, 1 ≤ j ≤ J and J being the number of = q(Z)q(Π)q(µ, Λ) (7) states (which is equal to the number of components J J Y Y of the GMM) (Here we use a 1-out-of-K coding.), = q(Z) q(πi) q(µj, Λj) (8) • Θ is the parameter vector/matrix containing all i=1 j=1 model parameters (including transition probabilities π, as well as output parameters µ, Λ), 3.1 Choice of Distributions • L is the likelihood or its lower bound approxima- tion, For each state j, we assign an independent Dirichlet prior • Π is the transition matrix with rows πi, distribution for the transition probabilities, so that • πi are the transition probabilities for state i, with J Y (0) 1 ≤ i ≤ J and elements πi,j, p(Π) = Dir(πj|αj ) (9) • πi,j is the transition probability to move from state j=1 i to state j, with 1 ≤ i, j ≤ J, and (0) (0) (0) αj = {αj,1 , . . . , αj,J } (10) • πj is the probability to start in state j. The variational posterior distributions for the model 2.1 Hidden Markov Model parameters turn out to have the following form: A Hidden Markov Model (HMM) J Y q(Π) = Dir(π |α ) (11) 3 VARIATIONAL INFERENCE j j j=1 X|Θ The direct optimization of p( ) is difficult, but the αj = {αj,1, . . . , αj,J } (12) optimization of the complete-data likelihood function p(X, Z|Θ) is significantly easier. We introduce a dis- The means are assigned independent univariate Gaus- tribution q(Z) defined over the latent variables, and sian distributions, conditional on the ARXIV 3

precisions. The precisions themselves are assigned inde- The latent variables γ(zn,j) denotes the probability pendent Wishart prior distributions: that the observation at time step n was generated by the j-th component of the model. p(µ, Λ) = p(µ|Λ)p(Λ) (13) υ(z )ω(z ) J n n γ(zn) = E [zn] = p(zn|X) = (21) Y −1 P υ(z)ω(z) = N (µj|m0, (β0Λj) ) ·W(Λj|W 0, ν0) (14) z∈Z j=1 υ(zn,j)ω(zn,j) γ(zn,j) = E [zn,j] = (22) PJ υ(z )ω(z ) The variational posterior distributions for the model k=1 n,k n,k parameters are as fallows (application of Bayes theorem): The transition probabilities ξ(zn−1,j, zn,s) express the uncertainty how likely it is, that a transition from state j to s has happened if observation xn−1 was generated −1 q(µj, Λj) = N (µj|mj, (βjΛj) ) ·W(Λj|W j, νj) (15) by the j-th component and the n-th observation by the s-th component. The variational posterior for q(Z) will have the form: ξ(zn−1, zn) ∝ υ(zn−1)p(zn|zn−1)p(xn|zn)ω(zn) (23) M J N J J Y Y zn,j Y Y Y zn,j ·z(n+1),s ξ(zn−1,j, zn,s) ∝ υ(zn−1,j)aj,sbn,sω(zn,s) (24) q(Z) ∝ (bn,j) (aj,s) (16) υ(z )a b ω(z ) n=1 j=1 n=1 j=1 s=1 ξ(z , z ) = n−1,j j,s n,s n,s n−1,j n,s PJ PJ k=1 l=1 υ(zn−1,k)ak,lbn,lω(zn,l) Which is identical to the one given by McGrory et al. (25) in [2]. The expected logarithm for Eq. (16) can be derived to: with

N J aj,s = exp {E [lnπ ˜j,s]} (26) E X X E  −1  [ln q(Z)] = γ(zn,j) ln p(xn|µj, Λj ) E   bn,j = exp ln p(xn|µj, Λj) (27) n=1 j=1 N−1 J J X X X and the popular Baum Welch algorithm: + ξ(zn,j, zn+1,s)E [lnπ ˜j,s] (17) n=1 j=1 s=1 υ(z1,j) = πjb1,j (28) The distribution over Z given the transition table Π J X expands to: υ(zn,j) = bn,j υ(zn−1,k) (29) k=1

N ω(zN,j) = 1 (30) Y zn−1,j .zn,s J p(Z|Π) = p(z1|π) (p(zn|zn−1)) (18) X n=2 ω(zn,j) = ω(zn+1,s) · aj,s · bn+1,s (31) N s=1 Y zn−1,j .zn,s = π (an−1,n) (19) Note that we substituted the common function names n=2 to α → ν and β → ω, since those symbols are already and its expected value to: occupied by other definitions.

1 ln (2π)D J E  −1  E X ln p(xn|µj, Λj ) = Λ [ln |Λj|] − E [ln p(Z|Π)] = π 2 2 j 1 j=1 E  T  − µ,Λ (xn − µj) Λj(xn − µj) (32) N J J 2 X X X + ξ(zn−1,j, zn,s) · E [lnπ ˜j,s] n=2 j=1 s=1 D X νj + 1 − d (20) E [ln |Λ |] = ψ( ) + Dln 2 + ln |W | Λj j 2 j d=1 For Eq. (19) confer Eq. (26) and Eq. (36). (33) E  T  µj ,Λj (xn − µj) Λj(xn − µj) −1 T = Dβj + νj(xn − mj) W j(xn − mj) 3.2 E-Step (34) The expectation step (E-Step) uses the Baum Welch algo- Eq. (33) and Eq. (34) are taken from [1]. rithm to estimate the latent variables for all observations Where D = |x| is the dimensionality of the observed in the sequence. data. ARXIV 4

The conditional probability that a given observation 3.4 Variational LowerBound xn is generated by the jth state (that means the probability of zn,j = 1 ) is given by: X γ(zn,j) = E [zn,j] = γ(z) · zn,j (35) z To decide whether the model has converged we consult Estimating the initial probabilities π that the model starts the change of the likelihood function of the model. If the in state j. model does not change anymore (or only in very small steps) the likelihood of multiple successive training it- γ(z1,j) πj = = γ(z1,j = 1) (36) erations will be nearly identical. The calculation of the PJ γ(z ) k=1 1,k actual likelihood is too hard (is it even possible?) to be practicable, to cumbersome, this we use a lower bound At the beginning of the sequence there is no predeces- approximation for the likelihood. sor state, thus we can directly use the estimated as for the j-th state. For the variational mixture of Gaussians, the lower bound is given by 3.3 M-Step The Maximization step:

N X Nj = γ(zn,j) (37) n=1

N X ZZZ  p(X, Z, Π, µ, Λ)  1 X L = q(Z, Π, µ, Λ)ln dΠdµdΛ (46) x¯j = γ(zn,j)xn (38) q(Z, Π, µ, Λ) N Z j n=1

N 1 X S = γ(z )(x − x¯ )(x − x¯ )T (39) j N n,j n j n j j n=1

βj = β0 + Nj (40)

νj = ν0 + Nj (41) Wich is in our case: 1 mj = (β0m0 + x¯jNj) (42) βj

−1 −1 W j = W 0 + NjSj β0Nj T + (x¯j − m0)(x¯j − m0) (43) β0 + Nj E E All equations are based on the Variational Mixture L = [ln p(X, Z, Π, µ, Λ)] − [ln q(Z, Π, µ, Λ)] (47) of Gaussians [1] and adjusted for our hidden Markov = E [ln p(X|Z, µ, Λ)] + E [ln p(Z|Π)] model. + E [ln p(Π)] s + E [ln p(µ, Λ)] − E [ln q(Z)] Maximizing the hyper distribution for the transition − E [ln q(Π)] − E [ln q(µ, Λ)] (48) matrix: N−1 (0) X αj,s = αj,s + ξ(zn,j, z(n+1),s) (44) n=1

(0) for 1 ≤ j, s ≤ J and αj being the prior values for state j. The hyper parameters for the initial starting probabil- ities are maximized as fallows: The lower bound L is used to detect convergence of the model, i.e., approaching of the best (real) parameters αj = α0 + Nj (45) of the underlying distribution. ARXIV 5

local minima. Especially those that arise when a com- ponent (or a state) collapses over a single observation E [ln p(Π))] (or multiple observations with identical characteristics). J ( J ) X (0) X (0) In that case the variance approaches 0 (and the mean E R = ln (C(αj )) + (αj,s − 1) · [lnπ ˜j,s] (49) ∞ since p(x)dx = 1) and would normally dramatically j=1 s=1 increase the likelihood for the model – this is also known E [ln q(Π))] as singularity (and one of the known drawbacks of normal J ( J ) X X EM). Using the 2nd order approaches prevents this by a = ln (C(αj)) + (αj,s − 1) · E [lnπ ˜j,s] (50) low density in the parameter space where the variance j=1 s=1 approaches 0 (Therefore p(x|σ) → ∞ is attenuated by a J X low density for p(σ)). E [lnπ ˜j,s] = ψ(αj,s) − ψ(αj,k) (51) k=1 J 1 X E [ln p(X|Z, µ, Λ)] = N {ln E [ln |Λ |] 4 CONCLUSIONAND OUTLOOK 2 j j j=1 −1 T In this article, we adapted the concept of second-order − Dβj − νjTr(SjW j) − νj(x¯j − mj) W j(x¯j − mj) training techniques to Hidden Markov Models. A train- − Dln (2π)} (52) ing algorithm has been defined following the ideas J of Variational Bayesian Inference and the Baum-Welch 1 X E [ln p(µ, Λ)] = {Dln (β /2π) + E [ln |Λ |] algorithm. 2 0 j j=1 Part of our ongoing research is the evaluation of Dβ0 T the new training algorithm using various benchmark − − β0νj(mj − m0) W j(mj − m0) βj data, the analysis of the computational complexity of j=1 the algorithm as well as the actual run-times on these ν0 − D − 1 X + J ln B(W , ν ) + E [ln |Λ |] benchmark data, and a comparison to a standard Baum- 0 0 2 j J Welch approach. J In our future work we will extend the approach 1 X −1 − νjTr(W W j)} (53) further by allowing different discrete and continuous 2 0 j=1 distributions in different dimensions of an output distri- J X 1 D bution. We will also used the HMM trained with the new E [ln q(µ, Λ)] = { E [ln |Λ |] + ln (β /2π) 2 j 2 j algorithm for anomaly detection (in particular for the j=1 detection of temporal anomalies). For that purpose, we D will extend the technique we have proposed for GMM − − H[q(Λj)]} (54) 2 in [3]. ν − D − 1 H[q(Λ )] = ln B(W , ν ) − j E [ln |Λ |] j j j 2 j νjD + (55) ACKNOWLEDGMENT 2 This work was supported by the German Research Foun- 3.5 Choice of Hyper Parameters dation (DFG) under grant number SI 674/9-1 (project 0 We can either choose the priors αj,s for the transitions CYPHOC). on random (with a seed) or make some assumptions and use that for initialization e.g.: APPENDIX ( 0 0.5, if j = s αj,s = 1 (56) Dirichlet 2J , otherwise K Which means that the probability to stay in a state is Y αk−1 Dir(µ|α) = C(α) µk (57) always .5 and transitions to the other states is equally k=1 likely. 0 0 The prior for the starting states αj = α is the same with constraints for all states . K Do we need to include actual prior knowledge when X µj = 1 (58) using VI approaches? No, the main advantage in relying k=1 on a training that is based on variational Bayesian infer- 0 ≤ µk ≤ 1 (59) ence is, that the introduction of the distributions over the model parameters prevents us from running into |µ| = |α| = K (60) ARXIV 6

Christian Gruhl received his M.Sc. in Computer K Science in 2014 at the University of Kassel X with Honors. Currently he is working towards his αˆ = αk (61) Ph.D. as a research assistant at the Intelligent k=1 Embedded Systems lab of Prof. Sick. His re- Γ(ˆα) search interests focuses on 2nd order training C(α) = (62) algorithms for and its applica- QK tions in cyber physical systems. k=1 Γ(αk) K X H[µ] = − (αk − 1){ψ(αk) − ψ(ˆα)} − ln C(α) (63) k=1 Digamma function: d Ψ(a) ≡ ln Γ(a) (64) Bernhard Sick received a diploma (1992, M.Sc. da equivalent), a Ph.D. degree (1999), and a “Habil- itation” degree (2004), all in computer science, Gamma function: from the University of Passau, Germany. Cur- Z ∞ rently, he is full professor for Intelligent Embed- Γ(x) ≡ ux−1e−udu (65) ded Systems at the Faculty for Electrical Engi- 0 neering and Computer Science of the University of Kassel, Germany. There, he is conducting research in the areas Autonomic and Organic Normal and Wishart Computing and Technical Data Analytics with applications in biometrics, intrusion detection, 1 1 N (x|µ, Λ) = energy management, automotive engineering, and others. Bernhard (2π)D/2 |Λ−1|1/2 Sick authored more than 100 peer-reviewed publications in these areas.   He is a member of IEEE (Systems, Man, and Cybernetics Society, 1 Computer Society, and Computational Intelligence Society) and GI · exp (x − µ)T Λ(x − µ) (66) 2 (Gesellschaft fuer Informatik). Bernhard Sick is associate editor of the IEEE Transactions on Cybernetics; he holds one patent and received (ν−D−1) W(Λ|W , ν) = B(W , ν)|Λ| 2 several thesis, best paper, teaching, and inventor awards. 1  · exp Tr(W −1Λ) (67) 2 where B(W , ν) = |W |−ν/2 D !−1 Y ν + 1 − i · 2νD/2πD(D−1)/4 Γ (68) 2 i=1

Gaussian-Wishart

− p(µ, Λ|µ0, β, W , ν) = N (µ|µ0, (βΛ) 1)W(Λ|W , ν) (69) This is the conjugate prior distribution for a multivari- ate Gaussian N (x|µ, Λ) in which both the mean µ and precision matrix Λ are unknown, and is also called the Normal-Wishart distribution.

REFERENCES [1] C. M. Bishop, and Machine Learning. New York, NY: Springer, 2006. [2] C. A. McGrory and D. M. Titterington, “Variational bayesian analysis for hidden markov models,” Australian & New Zealand Journal of , vol. 51, no. 2, pp. 227–244, 2009. [3] C. Gruhl and B. Sick, “Detecting Novel Processes with CANDIES – An Holistic Novelty Detection Technique based on Probabilistic Models,” arXiv:1605.05628, 2016, (last access: 25.05.2016). [Online]. Available: https://arxiv.org/abs/1605.05628