Stochastic Variational Inference for Bayesian Time Series Models
Matthew James Johnson [email protected] Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA USA
Alan S. Willsky [email protected] Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA USA
Abstract require only a small number of passes and can even operate in the single-pass or streaming settings (Broderick et al., Bayesian models provide powerful tools for an- 2013). In particular, stochastic variational inference (SVI) alyzing complex time series data, but perform- (Hoffman et al., 2013) provides a general framework for ing inference with large datasets is a challenge. scalable inference based on mean field and stochastic gra- Stochastic variational inference (SVI) provides a dient optimization. However, while SVI has been studied new framework for approximating model poste- extensively for topic models (Hoffman et al., 2010; Wang riors with only a small number of passes through et al., 2011; Bryant & Sudderth, 2012; Wang & Blei, 2012; the data, enabling such models to be fit at scale. Ranganath et al., 2013; Hoffman et al., 2013), it has not However, its application to time series models been applied to time series. has not been studied. In this paper we develop SVI algorithms for In this paper, we develop SVI algorithms for the core several common Bayesian time series models, Bayesian time series models based on the hidden Markov namely the hidden Markov model (HMM), hid- model (HMM), namely the Bayesian HMM and hidden den semi-Markov model (HSMM), and the non- semi-Markov model (HSMM), as well as their nonpara- parametric HDP-HMM and HDP-HSMM. In ad- metric extensions based on the hierarchical Dirichlet pro- dition, because HSMM inference can be expen- cess (HDP), the HDP-HMM and HDP-HSMM. Both the sive even in the minibatch setting of SVI, we de- HMM and HDP-HMM are ubiquitous in time series mod- velop fast approximate updates for HSMMs with eling, and so the SVI algorithms developed in Sections3 durations distributions that are negative binomi- and4 are widely applicable. als or mixtures of negative binomials. The HSMM and HDP-HSMM extend their HMM coun- terparts by allowing explicit modeling of state durations with arbitrary distributions. However, HSMM inference 1. Introduction subroutines have time complexity that scales quadratically with the observation sequence length, which can be expen- Bayesian time series models can be applied to complex sive even in the minibatch setting of SVI. To address this data in many domains, including data arising from behav- shortcoming, in Section5 we develop a new method for ior and motion (Fox et al., 2010; 2011), home energy con- Bayesian inference in (HDP-)HSMMs with negative bino- sumption (Johnson & Willsky, 2013), physiological signals mial durations that allows approximate SVI updates with (Lehman et al., 2012), single-molecule biophysics (Linden´ time complexity that scales only linearly with sequence et al., 2013), brain-machine interfaces (Hudson, 2008), and length. The methods in this paper also provide the first natural language and text (Griffiths et al., 2004; Liang et al., batch mean field algorithm for HDP-HSMMs. 2007). However, scaling inference in these models to large datasets is a challenge. Our code is available at github.com/mattjj/pyhsmm. Many Bayesian inference algorithms require a complete pass over the data in each iteration and thus do not scale 2. Background well. In contrast, some recent Bayesian inference methods Here we review the key ingredients of SVI, namely Proceedings of the 31 st International Conference on Machine stochastic gradient algorithms, the mean field variational Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- inference problem, and natural gradients of the mean field right 2014 by the author(s). objective for models with complete-data conjugacy. SVI for Time Series Models
2.1. Stochastic gradient ascent complete-data likelihood p(z(k), y(k)|φ) are a conjugate pair of exponential family distributions. That is, if Consider the optimization problem
K ln p(φ)=hηφ, tφ(φ)i − Aφ(ηφ) X (k) max f(φ, y¯) where f(φ, y¯) = g(φ, y¯ ) ln p(z(k), y(k)|φ)=hη (φ), t (z(k), y(k))i−A (η (φ)) φ zy zy zy zy k=1 (k) K ˆ and where y¯ = {y¯ }k=1 is fixed. Then if k is sampled uniformly over {1, 2,...,K}, we have then conjugacy (Bernardo & Smith, 2009, Proposition 5.4) implies that tφ(φ) = (ηzy(φ), −Azy(ηzy(φ)), so that h ˆ i ∇ f(φ) = K · ∇ g(φ, y¯(k)) . φ Ekˆ φ (k) (k) (k) (k) p(φ|z , y¯ ) ∝ exp{hηφ + (tzy(z , y¯ ), 1), tφ(φ)i}. Thus we can generate approximate gradients of the objec- tive using only one y¯(k) at a time. A stochastic gradient Conjugacy also implies the optimal q(φ) is in the same algorithm for a sequence of stepsizes ρ(t) and positive def- family, i.e. q(φ) = exp {hηeφ, tφ(φ)i − Aφ(ηeφ)} for some inite matrices G(t) is given in Algorithm1. From stan- parameter ηeφ (Bishop & Nasrabadi, 2006, Section 10.4.1). dard results (Robbins & Monro, 1951; Bottou, 1998), if With this structure, there is a simple expression for the gra- P∞ ρ(t) = ∞ and P∞ (ρ(t))2 < ∞ and G(t) has t=1 t=1 dient of L with respect to ηeφ. To simplify notation, we uniformly bounded eigenvalues, then the algorithm con- PK (k) (k) ∗ (t) write t(z, y¯) , k=1(tzy(z , y¯ ), 1), ηe , ηeφ, η , ηφ, verges to a stationary point, i.e. φ , lim φ satisfies t→∞ and A Aφ. Then dropping terms constant over η we have ∗ , e ∇φf(φ , y¯) = 0. L = [ln p(φ|z, y¯) − ln q(φ)] Since each update in the stochastic gradient ascent algo- Eq(φ)q(z) rithm only operates on one y¯(k), or minibatch, at a time, it = hη + Eq(z)[t(z, y¯)], ∇A(ηe)i − (hη,e ∇A(ηe)i−A(ηe)) can scale to the case where y¯ is large. where we have used the exponential family identity Algorithm 1 Stochastic gradient ascent Eq(φ) [tφ(φ)] = ∇A(ηe). Differentiating over ηe, we have Initialize φ(0) for t = 1, 2,... do ∇ L = ∇2A(η) η + [t(z, y¯)] − η . ηe e Eq(z) e kˆ(t) ← Uniform({1, 2,...,K}) (t) (t) (t) (t−1) (kˆ(t)) The natural gradient ∇ is defined (Hoffman et al., 2013) φ ← ρ KG ∇φg(φ , y¯ ) e ηe −1 as ∇ ∇2A(η) ∇ , and so expanding we have e ηe , e ηe 2.2. Stochastic variational inference K X (k) (k) ∇ L = η + (k) [(t (z , y¯ ), 1)] − η. e ηe Eq(z ) zy e Given a probabilistic model k=1 K Y p(φ, z, y) = p(φ) p(z(k)|φ)p(y(k)|z(k), φ) Therefore a stochastic natural gradient ascent algorithm on k=1 the global variational parameter ηeφ proceeds at iteration t by sampling a minibatch y¯(k) and taking a step of some size that includes global latent variables φ, local latent vari- (t) (k) K (k) K ρ in an approximate natural gradient direction via ables z = {z }k=1, and observations y = {y }k=1, the mean field problem is to approximate the posterior (t) (t) (k) (k) ηeφ ← (1−ρ )ηeφ +ρ (ηφ +s·Eq∗(z(k))[t(z , y¯ )]) p(φ, z|y¯) for fixed data y¯ with a distribution of the form Q (k) (k) q(φ)q(z) = q(φ) k q(z ) by finding a local minimum where s , |y¯|/|y¯ | scales the minibatch statistics to rep- of the KL divergence from the approximating distribution resent the full dataset. In each step we find the optimal lo- to the posterior or, equivalently, finding a local maximum cal factor q∗(z(k)) with standard mean field updates and the of the marginal likelihood lower bound current value of q(φ). There are automatic methods to tune the sequence of stepsizes (Snoek et al., 2012; Ranganath p(φ, z, y¯) L ln ≤ p(¯y). (1) et al., 2013), though we do not explore them here. , Eq(φ)q(z) q(φ)q(z) SVI optimizes the objective (1) using a stochastic natural 2.3. Hidden Markov Models gradient ascent algorithm over the global factors q(φ). A Bayesian Hidden Markov Model (HMM) on N states in- Natural gradients of L with respect to the parameters of cludes priors on the model parameters, namely the initial (i) N q(φ) have a convenient form if the prior p(φ) and each state distribution and transition matrix rows π = {π }i=0 SVI for Time Series Models
(i) N and the observation parameters θ = {θ }i=1. The full from a state-specific duration distribution each time a state generative model over the parameters, a state sequence is entered. That is, if state i is entered at time t, a duration (i) (i) x1:T of length T , and an observation sequence y1:T is d is sampled d ∼ p(d|ϑ ) for some parameter ϑ and the state stays fixed until xt+d−1, when a Markov transi- (1) ! π tion step selects a new state for x . For identifiability, θ(i) ∼iid p(θ), π(i) ∼ (α(i)),A . t+d Dir , . self-transitions are often ruled out; the transition matrix A π(N) is constrained via Ai,i = 0 and the Dirichlet prior on each (0) (xt) (xt) x1 ∼ π , xt+1 ∼ π , yt ∼ p(yt|θ ) row is placed on the off-diagonal entries. The parameters π(0) and θ are treated as in the HMM. where we abuse notation slightly here and use p(θ) and p(yt|θ) to denote the prior distribution over θ and the con- Analogous to the HMM case, for a fixed observation ditional observation distribution, respectively. When con- sequence y¯1:T , we define likelihood potentials L by (i) N L p(¯y |θ(i)) and now define duration potentials D venient, we collect the transition rows {π }i=1 into the t,i , t N (i) Q (i) via Dd,i p(d|ϑ ) and say p(x1:T |y¯1:T , π, θ, ϑ) = transition matrix A and write q(A) , i=1 q(π ). , HSMM(A, π(0), L, D). In mean field inference for Conditioned on the model parameters (π, θ) and a fixed ob- HSMMs, as developed in (Hudson, 2008), we approximate servation sequence y¯1:T , the distribution of x1:T is Markov the posterior p(θ, ϑ, π, x1:T |y¯1:T ) with a variational family on a chain graph. Defining likelihood potentials L by q(θ)q(ϑ)q(π)q(x ). When updating the factor q(x ) (i) 1:T 1:T Lt,i , p(¯yt|θ ), the density p(x1:T |y¯1:T , π, θ) is (0) we have q(x1:T ) = HSMM(A,e πe , L,e De) using the defi- T −1 T nitions in (3) and (0) P P exp ln πx1 + ln Axt,xt+1 + ln Lt,xt − Z . t=1 t=1 n (i) o D exp (i) ln p(d|ϑ ) . (2) ed,i , Eq(ϑ ) where Z is the normalizing constant for the distribution. (0) Expectations with respect to q(x1:T ) can be computed in We say p(x1:T |y¯1:T , π, θ) = HMM(A, π ,L). terms of the standard HSMM forward messages (F,F ∗) In mean field inference for HMMs (Beal, 2003), we ap- and backward messages (B,B∗) via the recursions (Mur- proximate the full posterior p(π, θ, x1:T |y¯1:T ) with a mean phy, 2002): field variational family q(π)q(θ)q(x1:T ) and update each t−1 N variational factor in turn while fixing the others. When up- P ∗ ∗ P Ft,i, Ft−d,iDed,iLet−d+1:t,i,Ft,i, Aej,iFt,j (6) dating q(x1:T ), by taking expectations of the log of (2) with d=1 j=1 respect to the variational distribution over parameters, we T −t N q(x ) = (A, π(0), L) ∗ P P ∗ see the update sets 1:T HMM e e e with Bt,i, Bt+d,iDed,iLet+1:t+d,i Bt,i, Bt,jAei,j (7) d=1 j=1 (0) n (0)o Ai,j exp q(π) ln Ai,j π exp ln (0) π e , E ei , Eq(π ) i ∗ (0) with F1,i , πi and BT,i , 1. These messages require O(T 2N + TN 2) time to compute for general duration dis- Let,i , exp Eq(θi) ln p(¯yt|xt = i) . (3) tributions.
We can compute the expectations with respect to q(x1:T ) necessary for the other factors’ updates using the standard 3. SVI for HMMs and HSMMs HMM message passing recursions for forward messages F In this section we apply SVI to both HMMs and HSMMs and backward messages B using these HMM parameters: and express the SVI updates in terms of HMM and HSMM
N messages. For notational simplicity, we consider a dataset P (0) (k) K Ft,i , Ft−1,jAejiLet,i F1,i , πei (4) of K sequences each of length T , written y¯ = {y¯1:T }k=1, j=1 and take each minibatch to be a single sequence, which we N P write without the superscript index as y¯1:T . Bt,i , AeijLet+1,jBt+1,j BT,i , 1. (5) j=1 3.1. SVI update for HMMs The messages can be computed in O(TN 2) time. In terms of the notation in Section 2.2, the global variables are the HMM parameters and the local variables are the hid- 2.4. Hidden semi-Markov Models (0) den states; that is, φ = (A, π , θ) and z = x1:T . To make The Hidden semi-Markov Model (HSMM) (Murphy, 2002; the updates explicit, we assume the observation parameter (i) (i) Hudson, 2008; Johnson & Willsky, 2013) augments the priors p(θ ) and likelihoods p(yt|θ ) are conjugate pairs generative process of the HMM by sampling a duration of exponential family distributions for each i so that the SVI for Time Series Models
PN ∗ (0) conditionals have the form where Z is the normalizer Z , i=1 B1,iπei . n o (i) (i) (i) (i) (i) To be written in terms of the HSMM messages the expected p(θ |y) ∝ exp hηθ + (ty (y), 1), tθ (θ )i . state indicators I[xt = i] must be expanded to P At each iteration of the SVI algorithm we sample a se- I[xt = i] = I[xτ+1 = i 6= xτ ] − I[xτ = i 6= xτ+1] quence y¯1:T from the dataset and perform a stochastic gra- τ Eq(x1:T )I[xt = i, xt 6= xt+1] = Ft,iBt,i/Z. (i) (i) n (i) (i) (i) o p(π ) = Dir(α), p(θ ) ∝ exp hηθ , tθ (θ )i , from which we can compute Eq(x1:T )I[xt = i], which we (i) use in the definition of tˆy given in (8). (i) (i) (i) n (i) (i) (i) o q(π ) = Dir(αe ), q(θ ) ∝ exp hηeθ , tθ (θ )i Finally, we compute the expected duration statistics as in- dicators on every possible duration d = 1, 2,...,T via and using the messages F and B as in (4) and (5), we define ˆ(i) P (tdur)d , E I[xt 6= xt+1, xt+1:t+d = i, xt+d+1 6= i] t ˆ(i) PT (i) ty , Eq(x1:T ) t=1I[xt = i]ty (¯yt) T −d+1 T X ∗ Qt+d P (i) = Ded,iFt,iBt+d,i( 0 Let0,i)/Z. (13) = t=1Ft,iBt,i · (ty (¯yt), 1)/Z (8) t =t t=1 ˆ(i) PT −1 (ttrans)j , Eq(x1:T ) t=1 I[xt = i, xt+1 = j] Note that this step alone requires O(T 2N) time even after PT −1 = t=1 Ft,iAei,jLet+1,jBt+1,j/Z (9) the messages have been computed. (tˆ ) [x = i] = π B /Z init i , Eq(x1:T )I 1 e0 1,i If the priors and mean field factors over duration param- p(ϑ(i)) ∝ exp{hη(i), t(i)(ϑ(i))i} q(ϑ(i)) ∝ where I[·] is 1 if its argument is true and 0 otherwise and Z eters are ϑ ϑ and PN exp{hη(i), t(i)(ϑ(i))i}, and the duration likelihood is is the normalizing constant Z , i=1 FT,i. eϑ ϑ p(d|ϑ(i)) = exp{ht(i)(ϑ(i)), (t (d), 1)i} then the duration With these expected statistics, taking a natural gradient step ϑ d factor update is in the parameters of q(A), q(π0), and q(θ) of size ρ is T (i) (i) (i) ˆ(i) η(i) ← (1 − ρ)η(i) + ρ(η(i) + s( P (tˆ(i)) · (t (d), 1))). ηθ ← (1 − ρ)ηθ + ρ(ηθ + s · ty ) (10) eϑ eϑ ϑ dur d d e e d=1 (i) (i) (i) ˆ(i) αe ← (1 − ρ)αe + ρ(α + s · ttrans) (11) (0) (0) (0) ˆ(i) 4. SVI for HDP-HMMs and HDP-HSMMs αe ← (1 − ρ)αe + ρ(α + s · tinit) (12) In this section we extend our methods to the Bayesian where s |y¯|/|y¯ | as in Section 2.2. , 1:T nonparametric versions of these models, the HDP-HMM and the HDP-HSMM. The generative model for the HDP- 3.2. SVI update for HSMMs HMM with scalar concentration parameters α, γ > 0 is The SVI updates for the HSMM are very similar to those (i) (i) iid (i) for the HMM with the addition of a duration update, but β ∼ GEM(γ), π ∼ DP(αβ), θ ∼ p(θ ) (0) (xt) (xt) writing the expectations in terms of the HSMM messages x1 ∼ π , xt+1 ∼ π , yt ∼ p(yt|θ ) is substantially different. The form of these expected statis- tics follow from the standard HSMM E-step (Murphy, where β ∼ GEM(γ) denotes sampling from a stick break- 2002; Hudson, 2008). ing distribution defined by ∗ ∗ iid Y Using the HSMM messages (F,F ) and (B,B ) defined vj ∼ Beta(1, γ), βk = (1 − vj)vk in (6)-(7), we can write j To perform mean field inference in HDP models, we ap- We use this gradient expression to take a truncated gradient proximate the posterior with a truncated variational dis- step on β∗ during each SVI update, using a backtracking tribution. While a common truncation is to limit the two line search to ensure the updated value satisfies β∗ ≥ 0. stick-breaking distributions in the definition of the HDP The updates for q(π) and q(β) in the HDP-HSMM differ (Hoffman et al., 2013), a more convenient truncation for only in that the variational lower bound expression changes our models is the “direct assignment” truncation, used in slightly because the support of each q(π(i)) is restricted to (Liang et al., 2007) for batch mean field with the HDP- the off-diagonal (and renormalized). We can adapt q(π(i)) HMM and in (Bryant & Sudderth, 2012) in an SVI algo- by simply dropping the ith component from the represen- rithm for LDA. The direct assignment truncation limits the T tation and writing support of q(x1:T ) to the finite set {1, 2,...,K} for a (i) (i) (i) truncation parameter K, i.e. fixing q(x1:T ) = 0 when any q((π1:K\i, πrest)) = Dir(αe\i ), xt > K. Thus the other factors, namely q(π), q(β), and ∗ q(θ), only differ from their priors in their distribution over and we change the second term in the gradient for β to the first K components. As opposed to standard truncation, ∂ (i) ∗ ∂β∗ Eq(π)[ln p(π |β )] this family of approximations is nested over K, enabling a k (i) (i) P ∗ ∗ search procedure over the truncation parameter as devel- = γψ(αek ) − γψ(αeK+1) + γψ(γ βj ) − γψ(βk) oped in (Bryant & Sudderth, 2012). A similar search pro- j6=i cedure can be used with the HDP-HMM and HDP-HSMM algorithms in this paper, though we do not explore it here. when k 6= i, otherwise the partial derivative is 0. A disadvantage to the direct assignment truncation is that Using these gradient expressions for β∗ and a suitable the update to q(β) is not conjugate given the other factors gradient-based batch optimization procedure we can also as in Hoffman et al.(2013). Following Liang et al.(2007), perform batch mean field updates for the HDP-HSMM. to simplify the update we use a point estimate by writing q(β) = δβ∗ (β). Since the main effect of β is to enforce shared sparsity among the π(i), it is reasonable to expect 5. Fast updates for negative binomial HSMMs that a point approximation for q(β) will suffice. General HSMM inference is much more expensive than HMM inference, having runtime O(T 2N + TN 2) com- The updates to the factors q(θ) and q(x1:T ) are iden- 2 tical to those derived in the previous sections. To de- pared to just O(TN ) on N states and a sequence of length rive the SVI update for q(π), we write the relevant T . The quadratic dependence on T can be severely limiting part of the untruncated model and truncated variational even in the minibatch setting of SVI, since minibatches of- (i) (i) ten must be sufficiently large for good performance (Hoff- factors as p((π1:K , πrest)) = Dir(α · (β1:K , βrest)) and (i) (i) (i) (i) man et al., 2013; Broderick et al., 2013). q((π1:K , πrest)) = Dir(αe ), respectively, where πrest , PK (i) PK A common approach to mitigate HSMM computational 1 − k=1 πk and βrest , 1 − k=1 βk for i = 1,...,K. Therefore the updates to q(π(i)) are identical to those complexity (Hudson, 2008; Johnson & Willsky, 2013) is in (11) except the number of variational parameters is to limit the support of the duration distributions, either in K + 1 and the prior hyperparameters are replaced with the model or as an approximation in the message passing computation, and thus limit the terms in the sums of (6) α · (β1:K , βrest). and (7). This truncation approach can be readily applied ∗ The gradient of the variational objective with respect to β to the algorithms presented in this paper. However, trun- is given by cation can be undesirable or ineffective if states have long p(β, π) durations. In the this section, we develop approximate up- ∇ ∗ L = ∇ ∗ ln β β Eq(π) q(β)q(π) dates for a particular class of duration distributions with unbounded support for which the computational complex- K ∗ P (i) ∗ ity is only linear in T . = ∇β∗ ln p(β ) + Eq(π(i)) ln p(π |β ) i=1 5.1. HMM embeddings of negative binomial HSMMs ∂ ∗ P 1 P 1 ∗ ln p(β ) = 2 P ∗ − (γ − 1) P ∗ ∂βk 1− βj 1− βj i≥k jr, p), where 0 < p < 1 and r is a positive integer. Its ∂β∗ Eq(π)[ln p(π |β )] k probability mass function (PMF) for k = 1, 2,... is K+1 = γψ(α(i)) − γψ(α(i) )+γψ(γ P β∗) − γψ(β∗). k + r − 2 ek eK+1 j k p(k|r, p) = exp{(k − 1) ln p + r ln(1 − p)}. j=1 k − 1 (14) SVI for Time Series Models ¯ Fixing r, the family of distributions parameterized by p is Eq(π)q(p)q(θ) ln A. Defining q(¯x1:T ) as the corresponding an exponential family and admits a conjugate Beta prior. distribution over HMM states, we can write expected suf- However, as a family over (r, p) it is not exponential be- ˆ(i) ˆ(i) ˆ(i) ficient statistics td , (t1 , t0 ) for the duration factors cause the of the binomial coefficient base measure term in terms of the expected transition statistics in the HMM which depends on r. When r = 1 the distribution is geo- embedding: metric, and so the class of HSMMs with negative binomial durations include HMMs. By varying r and p, the mean T −1 r(i) ˆ(i) X X and variance can be controlled separately, making the neg- td,1 , Eq(¯x1:T ) I[¯xt =x ¯t+1 = (i, k)] ative binomial a popular choice for duration distributions t=1 k=1 (Bulla & Bulla, 2006; Fearnhead, 2006). T −1 r(i) (i) X X tˆ [¯x = (i, k), x¯ 6=x ¯ ]. A negative binomial random variable can be represented as d,0 , Eq(¯x1:T ) I t t+1 t+1 a sum of r geometric random variables: if x ∼ NB(r, p) t=1 k=1 Pr k and y = 1 + i=1 zi with p(zi = k) = p (1 − p), then We can compute these expected transition statistics effi- x ∼ y. Therefore given an HSMM in which the durations ciently from the HMM messages using (9). The SVI update (i) (i) of state i are distributed as NB(r , p ) we can construct to the duration factors is then of the form PN (i) an HMM on i=1 r states that encodes the same pro- (i) (i) (i) ˆ(i) cess, where HSMM state i corresponds to ri states in the ea ← (1 − ρ)ea + ρ(a + s · td,1) (15) HMM. We call this construction an HMM embedding of b(i) ← (1 − ρ)b(i) + ρ(b(i) + s · tˆ(i) ) (16) the HSMM, and the resulting HMM transition matrix A¯ is e e d,0 ¯ C1 A12 ··· ! for some stepsize ρ and minibatch scaling s. We can ¯ ¯ A21 C2 A , . similarly write the transition, initial state, and observation . statistics for the other HSMM mean field factors in terms p(i) 1−p(i) ! of its embedding: . ¯ Ci , . , Aij , (i) . (i) T r (i) (i) (i) (ij) (ij) ˆ P P p 1−p Aij p¯ ··· Aij p¯ ty q(¯x1:T ) [¯xt = (i, k)]ty (¯yt) 1 r(j) , E t=1 k=1I ˆ(i) PT −1 (i) (ttrans)j , Eq(¯x1:T ) t=1 I[¯xt = (i, r ) 6=x ¯t+1] (ij) ˆ where p¯ is defined in the supplementary materials. We (tinit)i , Eq(¯x1:T )I[¯x1 = (xt, 1)]. write the HMM embedding state sequence as x¯1:T , where (xt) each x¯t decomposes as x¯t = (xt, k) for k = 1, 2, . . . , r Using these statistics we perform SVI updates to the corre- according to the block structure of A¯. If we define sponding the mean field factors as in (10)-(12). 1 PN (i) R , N i=1 r , then passing messages in this structured We have shown that by working with an efficient HMM 2 HMM embedding can be done in O(TNR + TN ) time. embedding representation we can compute updates to the 2 This construction draws on ideas from HSMMs with “para- HSMM mean field factors in time O(TNR + TN ) when metric macro-states” (Guedon´ , 2005, Section 3) and on durations are negative binomially distributed with fixed r. expanded-state HMMs (ESHMMs) (Russell & Moore, In the next subsection we extend these fast updates to in- 1985; Russell & Cook, 1987; Johnson, 2005). However, clude a variational representation to the posterior of r. this precise construction for negative binomial durations does not appear in those works. Furthermore, we extend 5.2. Approximate updates for fitting q(r, p) these ideas by applying Bayesian inference as well as meth- By learning r as well as p, negative binomial HSMMs can ods to fit (a posterior over) the r(i) parameter, as we discuss learn to be HMMs when appropriate and generally provide in the next section. a much more flexible class of duration distributions. In this If every r(i) is fixed and the p(i) are the only duration pa- subsection, we derive an exact SVI update step for fitting rameters, we can use the HMM embedding to perform ef- both r and p, explain its computational difficulties, and pro- ficient conjugate SVI (or batch mean field) updates to the pose a fast approximate alternative based on sampling. duration factors q(p(i)). We write the duration prior and We define mean field factors q(r(i))q(p(i)|r(i)), where each mean field factors as q(r(i)) is a categorical distribution with finite support taken (i) (i) (i) (i) (i) (i) (i) (i) (i) p(p ) = Beta(a , b ) q(p ) = Beta(ea ,eb ). to be {1, 2, . . . , rmax} and each q(p |r ) is Beta, i.e. The embedding allows us to write the variational lower (i) (i) q(r ) ∝ exp{hνe , Ir(i) i} bound for the HSMM as an equivalent HMM vari- (i) (i) (i,r) (i,r) ational lower bound with effective transition matrix q(p |r ) = Beta(ea ,eb ). SVI for Time Series Models (i) where Ir(i) is an indicator vector with the r th entry set Therefore computing an exact SVI update on the q(r, p) to 1 and the others to zero. The priors are defined simi- factors is expensive both because the HSMM messages larly. To simplify notation, in this section we often drop cannot be computed using the methods of Section 5.1 and the superscript i from the notation. because given the messages the required statistics are ex- pensive to compute. To achieve an update running time that Two challenges arise in computing updates to q(r, p). First, is linear in T , we propose to use a sample approximation the optimal variational factor on the HSMM states is to q(x1:T ) inspired by the method developed in (Wang & Blei, 2012). That is, we sample negative binomial HSMM q(x1:T ) ∝ exp Eq(π)q(r,p)q(θ) ln p(x1:T |y¯1:T , π, r, p, θ). models from the distribution q(π)q(θ)q(r, p) and use the Due to the expectation over q(r), this factor does not have embedding to generate a sample of x1:T under each model. the form required to use the efficient HMM embedding of Using the message passing methods of Section 5.1, the first Section 5.1, and so the corresponding general HSMM mes- state sequence sample for each model can be collected in sages require O(T 2N + TN 2) time to compute. Second, time O(TNR + TN 2), and additional state sequence sam- as we show next, due to the base measure term in (14), to ples from the same model can be collected in time O(TN). compute an exact update to q(r, p) requires computing the As discussed in Wang & Blei(2012), this sampling ap- expected duration indicator statistics of (13), a computation proximation does not optimize the variational lower bound which itself requires O(T 2N) time even after computing over q(x1:T ) and so it should yield an inferior objective the HSMM messages. value. Indeed, while the optimal mean field update sets First, we show that the update to q(p|r) is straightforward. q(x1:T ) ∝ exp{Eq[ln p(π, θ, ϑ, x1:T , y¯1:T )]}, this update To derive an update for q(p|r), we write the relevant part of approximates q(x1:T ) ∝ Eq[p(π, θ, ϑ, x1:T , y¯1:T )]. How- the variational lower bound as ever, Wang & Blei(2012) found this approximate update yielded better predictive performance in some topic mod- p(r, p, D) els, and provided an interpretation as an approximate ex- L , Eq(r,p)q(x1:T ) ln (17) q(r, p) pectation propagation (EP) update. p(r) = ln + h(r, D) (k) S Eq(r) Eq(r)q(x1:T ) With the collected samples S , {x1:T }k=1, we set the q(r) 1 P factor qˆ(x1:T ) = δxˆ (x1:T ), where we use p(p)¯p(D|r, p) S xˆ1:T ∈S 1:T the notation qˆ(x1:T ) to emphasize that it is a sample-based + Eq(r) Eq(p|r) ln q(p|r) representation. It is straightforward to compute the expec- tation over states in (18) by plugging in the sampled dura- where D is the set of relevant durations in x1:T , h(r, D) , tions. The update to the parameters of q(r(i), p(i)) is P r+k−2 k∈D ln k−1 arises from the negative binomial base (i) (i) (i) ˆ(i) P ν ← (1 − ρ)ν + ρ(ν + s · tr ) measure term, and lnp ¯(D|r, p) , k∈D k ln p + r ln(1 − e e p) collects the negative binomial PMF terms excluding the (i,r) (i,r) (i) ˆ(i,r) ea ← (1 − ρ)ea + ρ(a + s · ta ) base measure. The only terms in (17) that depend on q(p|r) (i,r) (i,r) (i) ˆ(i,r) are in the final bracketed term. Furthermore, each of these eb ← (1 − ρ)eb + ρ(b + s · tb ) terms corresponds to the variational lower bound for the fixed-r case described in Section 5.1, and so each q(p|r) 1 X X tˆ(i,r) (d − 1) factor can be updated with Eqs. (15)-(16). a , S xˆ∈S d∈D(i)(ˆx) To compute an update for q(r), we note that since it is a (i,r) 1 X X tˆ r distribution with finite support we can write its complete- b , S data conditional in exponential family form trivially via xˆ∈S d∈D(i)(ˆx) ˆ(i) (i,r) ˆ(i,r) (i,r) (tr )r , (ea + ta − 1)Eq(p|r)[ln(p )] p(r|p, D) ∝ exp{hν + tr(p, D), Iri} (i,r) ˆ(i,r) (i,r) P + (eb + tb − 1)Eq(p|r)[ln(1 − p )] tr(p, D)j , k∈D ln p(p|k, r = j) + ln h(j, k) X X d + r − 2 + ln and so from the results in Section 2.2 the natural gradient d − 1 xˆ∈S d∈D(i)(ˆx) of (17) with respect to the parameters of q(r) is (i) and where D (ˆx1:T ) denotes the set of durations of state ∇ν L = ν + q(p|r)q(x )tr(p, D) − ν. (18) e e E 1:T e i in the sequence xˆ1:T . The updates to the other factors are as before but with expectations taken over qˆ(x ). Due to the base measure term h(j, k), to compute this up- 1:T date requires evaluating the expected statistics of Eq. (13), The methods presented in this section for HSMMs with which require O(T 2N) time. negative binomial durations can be extended in several SVI for Time Series Models ways. In particular, one can use similar methods to perform −54000 efficient updates when state durations are modeled as mix- −56000 svi batch tures of negative binomial distributions. Since each neg- −58000 ative binomial can separately parameterize mean and vari- −60000 −62000 ance, in this way one can generate a flexible and convenient −64000 family of duration distributions analogous to the Gaussian −66000 −68000 mixture model pervasive in density modeling. held-out predictive likelihood 10-2 10-1 100 101 102 103 time (seconds) 6. Experiments (a) SVI vs Batch HDP-HMM We conclude with a numerical study to validate these al- −3800 gorithms and in particular measure the effectiveness of the −4000 svi −4200 batch approximate updates proposed in Section 5.2. As a perfor- −4400 mance metric, we evaluate an approximate posterior pre- −4600 −4800 dictive density on held-out data, writing −5000 −5200 ZZ −5400 −5600 p(¯ytest|y¯train) = p(¯ytest|πθ)p(π, θ|y¯train)dπdθ held-out predictive likelihood 100 101 102 103 time (seconds) ≈ p(¯y |π, θ) Eq(π)q(θ) test (b) SVI vs Batch HDP-HSMM and approximating the expectation by sampling models from the variational distribution. To reproduce these fig- −12000 −13000 ures, see the code in the supplementary materials. −14000 First, we compare the performance of SVI and batch mean −15000 −16000 S=10 S=5 field algorithms for the HDP-HMM. We sampled a 10-state −17000 S=1 HMM with 2-dimensional Gaussian emissions and gener- −18000 exact −19000 ated a dataset of 100 observation sequences of length 3000 held-out predictive likelihood 10-1 100 101 102 103 each. We chose a random subset of 95% of the sequences time (seconds) as training sequences and held out 5% as test sequences. (c) Exact vs Approx. HSMM SVI We repeated the fitting procedures 5 times with identical initializations drawn from the prior, and we report the me- Figure 1. Synthetic numerical experiments. dian performance with standard deviation error bars. The SVI procedure made only one pass through the training set. Figure 1(a) shows that the SVI algorithm produces fits that Figure 1(c) shows that the approximate update from Sec- are comparable in performance in the time it takes the batch tion5 results in higher predictive performance than that of algorithm to complete a single iteration. the model trained with the exact update even using a single sample. This performance is likely dataset-dependent, but Similarly, we compare the SVI and batch mean field algo- the experiment demonstrates that the approximate update rithms for the HDP-HSMM with Poisson durations. Due to may be very effective in some cases. the much greater computational complexity of HSMM in- ference, we generated a set of 30 sequences of length 2000 each and used 90% of the sequences in the training set. 7. Conclusion Figure 1(b) again demonstrates that the SVI algorithm can This paper develops scalable SVI-based inference for fit such models in the time it takes the batch algorithm to HMMs, HSMMs, HDP-HMMs, and HDP-HSMMs, and complete a single iteration. provides a technique to make Bayesian inference in nega- Finally, we compare the performance of the exact SVI up- tive binomial HSMMs much more practical. These models date for the HSMM with that of the fast approximate up- are widely applicable to time series inference, so these al- date proposed in Section5. We again generated data us- gorithms and our code may be immediately useful to the ing Poisson duration distributions, but we train models us- community. ing negative binomial durations where p ∼ Beta(1, 1) and r ∼ Uniform({1, 2,..., 10}). We generated 55 observa- Acknowledgements tion sequences of length 3000 and used 90% of the se- quences in the training set. We compare the sampling al- This research was supported in part under AFOSR Grant gorithm’s performance for several numbers of samples S. FA9550-12-1-0287. SVI for Time Series Models References Lehman, Li-wei H, Nemati, Shamim, Adams, Ryan P, and Mark, Roger G. Discovering shared dynamics in physiological sig- Beal, Matthew James. Variational algorithms for approximate nals: Application to patient monitoring in icu. In Engineering Bayesian inference. PhD thesis, University of London, 2003. in Medicine and Biology Society (EMBC), 2012 Annual Inter- Bernardo, Jose´ M and Smith, Adrian FM. Bayesian theory, vol- national Conference of the IEEE, pp. 5939–5942. IEEE, 2012. ume 405. Wiley. com, 2009. Liang, Percy, Petrov, Slav, Jordan, Michael I, and Klein, Dan. The Bishop, Christopher M and Nasrabadi, Nasser M. Pattern recog- infinite pcfg using hierarchical dirichlet processes. In EMNLP- nition and machine learning, volume 1. springer New York, CoNLL, pp. 688–697, 2007. 2006. Linden,´ Martin, Johnson, Stephanie, van de Meent, Jan-Willem, Bottou, Leon.´ Online learning and stochastic approximations. Phillips, Rob, and Wiggins, Chris H. Analysis of dna loop- On-line learning in neural networks, 17:9, 1998. ing kinetics in tethered particle motion experiments using hid- den markov models. Biophysical Journal, 104(2):418A–418A, Broderick, Tamara, Boyd, Nicholas, Wibisono, Andre, Wilson, 2013. Ashia C, and Jordan, Michael. Streaming variational bayes. In Advances in Neural Information Processing Systems, pp. Murphy, K. Hidden semi-markov models (segment models). 1727–1735, 2013. Technical Report, November 2002. URL http://www.cs. ubc.ca/˜murphyk/Papers/segment.pdf. Bryant, Michael and Sudderth, Erik B. Truly nonparametric on- line variational inference for hierarchical dirichlet processes. Ranganath, Rajesh, Wang, Chong, David, Blei, and Xing, Eric. In Advances in Neural Information Processing Systems, pp. An adaptive learning rate for stochastic variational inference. 2708–2716, 2012. In Proceedings of the 30th International Conference on Ma- chine Learning (ICML-13), pp. 298–306, 2013. Bulla, Jan and Bulla, Ingo. Stylized facts of financial time series and hidden semi-markov models. Computational Statistics & Robbins, Herbert and Monro, Sutton. A stochastic approximation Data Analysis, 51(4):2192–2209, 2006. method. The Annals of Mathematical Statistics, pp. 400–407, 1951. Fearnhead, Paul. Exact and efficient bayesian inference for mul- tiple changepoint problems. Statistics and computing, 16(2): Russell, Martin and Moore, Roger. Explicit modelling of state oc- 203–213, 2006. cupancy in hidden markov models for automatic speech recog- nition. In Acoustics, Speech, and Signal Processing, IEEE In- Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. Shar- ternational Conference on ICASSP’85., volume 10, pp. 5–8. ing features among dynamical systems with beta processes. In IEEE, 1985. Neural Information Processing Systems 22. MIT Press, 2010. Russell, Martin J and Cook, A. Experimental evaluation of dura- Fox, Emily, Sudderth, Erik B, Jordan, Michael I, and Willsky, tion modelling techniques for automatic speech recognition. In Alan S. Bayesian nonparametric inference of switching dy- Acoustics, Speech, and Signal Processing, IEEE International namic linear models. Signal Processing, IEEE Transactions Conference on ICASSP’87., volume 12, pp. 2376–2379. IEEE, on, 59(4):1569–1585, 2011. 1987. Griffiths, Thomas L, Steyvers, Mark, Blei, David M, and Tenen- Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical baum, Joshua B. Integrating topics and syntax. In Advances in bayesian optimization of machine learning algorithms. arXiv neural information processing systems, pp. 537–544, 2004. preprint arXiv:1206.2944, 2012. Guedon,´ Yann. Hidden hybrid markov/semi-markov chains. Wang, Chong and Blei, David. Truncation-free online variational Computational statistics & Data analysis, 49(3):663–688, inference for bayesian nonparametric models. In Advances in 2005. Neural Information Processing Systems 25, pp. 422–430, 2012. Hoffman, M, Blei, D, Wang, Chong, and Paisley, John. Stochastic Wang, Chong, Paisley, John W, and Blei, David M. Online vari- variational inference. Journal of Machine Learning Research, ational inference for the hierarchical dirichlet process. In In- 14:1303–1347, 2013. ternational Conference on Artificial Intelligence and Statistics, pp. 752–760, 2011. Hoffman, Matthew, Bach, Francis R, and Blei, David M. Online learning for latent dirichlet allocation. In advances in neural information processing systems, pp. 856–864, 2010. Hudson, Nicolas H. Inference in hybrid systems with applications in neural prosthetics. PhD thesis, California Institute of Tech- nology, 2008. Johnson, Matthew J. and Willsky, Alan S. Bayesian nonparamet- ric hidden semi-markov models. Journal of Machine Learning Research, 14:673–701, February 2013. Johnson, Michael T. Capacity and complexity of hmm duration modeling techniques. Signal Processing Letters, IEEE, 12(5): 407–410, 2005.