Stochastic Variational Inference for Bayesian Time Series Models

Stochastic Variational Inference for Bayesian Time Series Models Matthew James Johnson [email protected] Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA USA Alan S. Willsky [email protected] Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA USA Abstract require only a small number of passes and can even operate in the single-pass or streaming settings (Broderick et al., Bayesian models provide powerful tools for an- 2013). In particular, stochastic variational inference (SVI) alyzing complex time series data, but perform- (Hoffman et al., 2013) provides a general framework for ing inference with large datasets is a challenge. scalable inference based on mean field and stochastic gra- Stochastic variational inference (SVI) provides a dient optimization. However, while SVI has been studied new framework for approximating model poste- extensively for topic models (Hoffman et al., 2010; Wang riors with only a small number of passes through et al., 2011; Bryant & Sudderth, 2012; Wang & Blei, 2012; the data, enabling such models to be fit at scale. Ranganath et al., 2013; Hoffman et al., 2013), it has not However, its application to time series models been applied to time series. has not been studied. In this paper we develop SVI algorithms for In this paper, we develop SVI algorithms for the core several common Bayesian time series models, Bayesian time series models based on the hidden Markov namely the hidden Markov model (HMM), hid- model (HMM), namely the Bayesian HMM and hidden den semi-Markov model (HSMM), and the non- semi-Markov model (HSMM), as well as their nonpara- parametric HDP-HMM and HDP-HSMM. In ad- metric extensions based on the hierarchical Dirichlet pro- dition, because HSMM inference can be expen- cess (HDP), the HDP-HMM and HDP-HSMM. Both the sive even in the minibatch setting of SVI, we de- HMM and HDP-HMM are ubiquitous in time series mod- velop fast approximate updates for HSMMs with eling, and so the SVI algorithms developed in Sections3 durations distributions that are negative binomi- and4 are widely applicable. als or mixtures of negative binomials. The HSMM and HDP-HSMM extend their HMM coun- terparts by allowing explicit modeling of state durations with arbitrary distributions. However, HSMM inference 1. Introduction subroutines have time complexity that scales quadratically with the observation sequence length, which can be expen- Bayesian time series models can be applied to complex sive even in the minibatch setting of SVI. To address this data in many domains, including data arising from behav- shortcoming, in Section5 we develop a new method for ior and motion (Fox et al., 2010; 2011), home energy con- Bayesian inference in (HDP-)HSMMs with negative bino- sumption (Johnson & Willsky, 2013), physiological signals mial durations that allows approximate SVI updates with (Lehman et al., 2012), single-molecule biophysics (Linden´ time complexity that scales only linearly with sequence et al., 2013), brain-machine interfaces (Hudson, 2008), and length. The methods in this paper also provide the first natural language and text (Griffiths et al., 2004; Liang et al., batch mean field algorithm for HDP-HSMMs. 2007). However, scaling inference in these models to large datasets is a challenge. Our code is available at github.com/mattjj/pyhsmm. Many Bayesian inference algorithms require a complete pass over the data in each iteration and thus do not scale 2. Background well. In contrast, some recent Bayesian inference methods Here we review the key ingredients of SVI, namely Proceedings of the 31 st International Conference on Machine stochastic gradient algorithms, the mean field variational Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- inference problem, and natural gradients of the mean field right 2014 by the author(s). objective for models with complete-data conjugacy. SVI for Time Series Models 2.1. Stochastic gradient ascent complete-data likelihood p(z(k); y(k)jφ) are a conjugate pair of exponential family distributions. That is, if Consider the optimization problem K ln p(φ)=hηφ; tφ(φ)i − Aφ(ηφ) X (k) max f(φ, y¯) where f(φ, y¯) = g(φ, y¯ ) ln p(z(k); y(k)jφ)=hη (φ); t (z(k); y(k))i−A (η (φ)) φ zy zy zy zy k=1 (k) K ^ and where y¯ = fy¯ gk=1 is fixed. Then if k is sampled uniformly over f1; 2;:::;Kg, we have then conjugacy (Bernardo & Smith, 2009, Proposition 5.4) implies that tφ(φ) = (ηzy(φ); −Azy(ηzy(φ)), so that h ^ i r f(φ) = K · r g(φ, y¯(k)) : φ Ek^ φ (k) (k) (k) (k) p(φjz ; y¯ ) / expfhηφ + (tzy(z ; y¯ ); 1); tφ(φ)ig: Thus we can generate approximate gradients of the objective using only one y¯(k) at a time. A stochastic gradient Conjugacy also implies the optimal q(φ) is in the same algorithm for a sequence of stepsizes ρ(t) and positive def- family, i.e. q(φ) = exp fhηeφ; tφ(φ)i − Aφ(ηeφ)g for some inite matrices G(t) is given in Algorithm1. From stan- parameter ηeφ (Bishop & Nasrabadi, 2006, Section 10.4.1). dard results (Robbins & Monro, 1951; Bottou, 1998), if With this structure, there is a simple expression for the gra- P1 ρ(t) = 1 and P1 (ρ(t))2 < 1 and G(t) has t=1 t=1 dient of L with respect to ηeφ. To simplify notation, we uniformly bounded eigenvalues, then the algorithm con- PK (k) (k) ∗ (t) write t(z; y¯) , k=1(tzy(z ; y¯ ); 1), ηe , ηeφ, η , ηφ, verges to a stationary point, i.e. φ , lim φ satisfies t!1 and A Aφ. Then dropping terms constant over η we have ∗ , e rφf(φ ; y¯) = 0. L = [ln p(φjz; y¯) − ln q(φ)] Since each update in the stochastic gradient ascent algo- Eq(φ)q(z) rithm only operates on one y¯(k), or minibatch, at a time, it = hη + Eq(z)[t(z; y¯)]; rA(ηe)i − (hη;e rA(ηe)i−A(ηe)) can scale to the case where y¯ is large. where we have used the exponential family identity Algorithm 1 Stochastic gradient ascent Eq(φ) [tφ(φ)] = rA(ηe). Differentiating over ηe, we have Initialize φ(0) for t = 1; 2;::: do r L = r2A(η) η + [t(z; y¯)] − η : ηe e Eq(z) e k^(t) Uniform(f1; 2;:::;Kg) (t) (t) (t) (t−1) (k^(t)) The natural gradient r is defined (Hoffman et al., 2013) φ ρ KG rφg(φ ; y¯ ) e ηe −1 as r r2A(η) r , and so expanding we have e ηe , e ηe 2.2. Stochastic variational inference K X (k) (k) r L = η + (k) [(t (z ; y¯ ); 1)] − η: e ηe Eq(z ) zy e Given a probabilistic model k=1 K Y p(φ, z; y) = p(φ) p(z(k)jφ)p(y(k)jz(k); φ) Therefore a stochastic natural gradient ascent algorithm on k=1 the global variational parameter ηeφ proceeds at iteration t by sampling a minibatch y¯(k) and taking a step of some size that includes global latent variables φ, local latent vari- (t) (k) K (k) K ρ in an approximate natural gradient direction via ables z = fz gk=1, and observations y = fy gk=1, the mean field problem is to approximate the posterior (t) (t) (k) (k) ηeφ (1−ρ )ηeφ +ρ (ηφ +s·Eq∗(z(k))[t(z ; y¯ )]) p(φ, zjy¯) for fixed data y¯ with a distribution of the form Q (k) (k) q(φ)q(z) = q(φ) k q(z ) by finding a local minimum where s , jy¯j=jy¯ j scales the minibatch statistics to rep- of the KL divergence from the approximating distribution resent the full dataset. In each step we find the optimal lo- to the posterior or, equivalently, finding a local maximum cal factor q∗(z(k)) with standard mean field updates and the of the marginal likelihood lower bound current value of q(φ). There are automatic methods to tune the sequence of stepsizes (Snoek et al., 2012; Ranganath p(φ, z; y¯) L ln ≤ p(¯y): (1) et al., 2013), though we do not explore them here. , Eq(φ)q(z) q(φ)q(z) SVI optimizes the objective (1) using a stochastic natural 2.3. Hidden Markov Models gradient ascent algorithm over the global factors q(φ). A Bayesian Hidden Markov Model (HMM) on N states in- Natural gradients of L with respect to the parameters of cludes priors on the model parameters, namely the initial (i) N q(φ) have a convenient form if the prior p(φ) and each state distribution and transition matrix rows π = fπ gi=0 SVI for Time Series Models (i) N and the observation parameters θ = fθ gi=1. The full from a state-specific duration distribution each time a state generative model over the parameters, a state sequence is entered. That is, if state i is entered at time t, a duration (i) (i) x1:T of length T , and an observation sequence y1:T is d is sampled d ∼ p(dj# ) for some parameter # and the state stays fixed until xt+d−1, when a Markov transi- (1) ! π tion step selects a new state for x . For identifiability, θ(i) ∼iid p(θ); π(i) ∼ (α(i));A . t+d Dir , . self-transitions are often ruled out; the transition matrix A π(N) is constrained via Ai;i = 0 and the Dirichlet prior on each (0) (xt) (xt) x1 ∼ π ; xt+1 ∼ π ; yt ∼ p(ytjθ ) row is placed on the off-diagonal entries.

Load more