Developing a Tempered HDP-HMM for Systems with State Persistence

MIT LABORATORY FOR INFORMATION & DECISION SYSTEMS TECHNICAL REPORT P-2777, NOVEMBER 2007 1 Developing a tempered HDP-HMM for Systems with State Persistence Emily B. Fox∗, Erik B. Sudderth†, Michael I. Jordan† and Alan S. Willsky∗ Department of Electrical Engineering and Computer Science ∗Massachusetts Institute of Technology †University of California, Berkeley [email protected], [email protected], [email protected], [email protected] I. INTRODUCTION Many real-world processes, as diverse as speech signals, the human genome, and financial time-series, can be modeled via a hidden Markov model (HMM). The HMM assumes that observations are generated by a hidden, discrete-valued Markov process representing the system’s state evolution. An extension to the HMM is the switching linear dynamic system (SLDS), which allows for more complicated dynamics generating the observations, but still follows the Markov state-switching of the HMM. For both the HMM and the SLDS, the state sequence’s Markov structure accounts for the temporal persistence of certain regimes of operation. Recently, the hierarchical Dirichlet process (HDP) [1] has been applied to the problem of learning hidden Markov models (HMM) with unknown state space cardinality, and is referred to as a HDP-HMM. A Dirichlet process is a distribution over random probability measures on infinite parameter spaces. This process provides a practical, data-driven prior towards models whose complexity grows as more data is observed. A specific hierarchical layering of these Dirichlet processes results in the HDP. When applied as a prior on the parameters of an HMM, the Dirichlet process encourages simple models of state dynamics, but allows additional states to be created as new behaviors are observed. The hierarchical structure allows for consistent learning of temporal state dependencies. In addition, the HDP has a number of properties that allow for computationally efficient learning algorithms, even on large datasets. The original HDP-HMM addresses the statistical issue of coping with an unknown and potentially infinite state space, but allows for learning models with unrealistically rapid dynamics. For many ap- November 7, 2007 DRAFT 2 plications, the state sequence’s Markov structure is an approximation to a system with more complex temporal behavior, perhaps better approximated as semi-Markov with some non-exponentially distributed state duration. Setting a high probability of self-transition is a common approach to modeling states that persist over lengthy periods of time. One of the main limitations of the original HDP-HMM formulation is that it cannot be biased towards learning transition densities that favor such self-transitions. This results in a large sensitivity to noise, since the HDP-HMM can explain the data by rapidly switching among redundant states. Although the Dirichlet process induces a weak bias towards simple explanations employing fewer model components, when state-switching probabilities are unconstrained there can be significant posterior uncertainty in the underlying model. Existing learning algorithms for HDP-HMMs are based on Markov chain Monte Carlo (MCMC) methods, such as Gibbs sampling, with an implementation that sequentially samples the state for each time step [1]. This sequential sampler leads to a slow mixing rate since global assignment changes are constrained to occur coordinate by coordinate, making it difficult to transition between different modes of the posterior. Existing HMM algorithms, such as the forward-backward algorithm [2], provide efficient methods for jointly sampling the entire state sequence conditioned on the observations and model parameters. While the original MCMC algorithm marginalized out the infinite set of infinite dimensional transition densities, we explore the use of truncated approximations to the Dirichlet process to make joint sampling tractable. In this paper we revisit the HDP-HMM, and develop methods which allow more efficient and effective learning from realistic time series. In Sec. II, we begin by presenting some of the theoretical background of Dirichlet processes. Then, in Sec. III, we briefly describe the hierarchical Dirichlet process and, in Sec. IV, how it relates to learning HMMs. The revised formulation is described in Sec. V while Section V-C outlines the procedure for the blocked resampling of the state sequence. In Sec. ??, we offer a model and inference algorithm for an HDP-HMM with non-standard emission distributions. We present results from simulated datasets in Sec. VII. II. DIRICHLET PROCESSES A Dirichlet process defines a distribution over probability measures on a parameter space Θ, which might be countably or uncountably infinite. This stochastic process is uniquely defined by a concentration parameter, α, and base measure, H, on the parameter space Θ; we denote it by DP(α, H). Consider a random probability measure G ∼ DP(α, H). The Dirichlet process is formally defined by the property November 7, 2007 DRAFT 3 1 that for any finite partition {A1,...,AK } of the parameter space Θ, (G(A1),...,G(AK )) ∼ Dir(αH(A1),...,αH(AK )). (1) That is, the measure of a random probability distribution G ∼ DP(α, H) on every finite partition of the parameter space Θ follows a specific Dirichlet distribution. The Dirichlet process was first introduced by Ferguson [3] using Kolmogorov’s consistency conditions. A more practically insightful definition of the ∞ Dirichlet process was given by Sethuraman [4]. Consider a probability mass function (pmf) {πk}k=1 on a countably infinite set, where the discrete probabilities are constructively defined as follows: ′ βk ∼ Beta(1, α) k = 1, 2,... k−1 ′ ′ πk = βk (1 − βℓ) k = 1, 2,.... (2) ℓY=1 th In effect, we have divided a unit-length stick by the weights πk. The k weight is a random proportion ′ βk of the remaining stick after the previous (k − 1) weights have been defined. This stick-breaking construction is typically denoted by π ∼ GEM(α). Sethuraman showed that with probability one, a random draw G ∼ DP (α, H) can be expressed as ∞ G(θ)= πkδ(θ − θk) θk ∼ H, k = 1, 2,..., (3) k X=1 where the notation δ(θ − θk) indicates a Dirac delta at θ = θk. From this definition, we see that the Dirichlet process actually defines a distribution over discrete probability measures. The stick-breaking construction also gives us insight into how the concentration parameter α controls the relative proportion of the mixture weights πk, and thus determines the model complexity in terms of the expected number of components with significant probability mass.2 The Dirichlet process has a number of properties which make inference using this nonparametric prior computationally tractable. Because random probability measures drawn from a Dirichlet process are discrete, there is a strictly positive probability of multiple observations θ¯i ∼ G taking identical values. For each observation θ¯i ∼ G, let zi be an indicator random variable for the unique values θk such that ¯ θi = θzi . Blackwell and MacQueen [6] introduced a Pólya urn representation of the Dirichlet process, 1A partition of a set A is a set of disjoint, non-empty subsets of A such that every element of A is contained in exactly one K of these subsets. More formally, {Ak}k=1 is a partition of A if ∪kAk = A and for each j 6= k, Ak ∩ Aj = ∅. 2If the value of α is unknown, the model may be augmented with a gamma prior distribution on α, so that the parameter is learned from the data [5]. See Section V-D. November 7, 2007 DRAFT 4 which can be equivalently described by the following predictive distribution on these indicator random variables: K α 1 p(z = z | z , α)= δ(z, k˜)+ N δ(z, k). (4) N+1 1:N α + N α + N k k X=1 Here, Nk is the number of indicator random variables taking the value k, and k˜ is a previously unseen value. We use the notation δ(z, k) to indicate the Kronecker delta. This representation can be used to sample observations from a Dirichlet process without explicitly constructing the countably infinite random probability measure G ∼ DP(α, H). The predictive distribution of Eq. (4) is commonly referred to as the Chinese restaurant process. The analogy is as follows. Take θ¯i to be a customer entering a restaurant with infinitely many tables, each serving a unique dish θk. Each arriving customer chooses a table, indicated by zi, in proportion to how many customers are currently sitting at that table. With some positive probability proportional to α, the customer starts a new, previously unoccupied table k˜. From the Chinese restaurant process, we see that the Dirichlet process has a reinforcement property that leads to favoring simpler models. We have shown that if zi ∼ π and π ∼ GEM(α), then we can integrate out π to determine the predictive likelihood of zi. Another important distribution is that over the number K of unique values of zi drawn from π given the total number of N draws. When π is distributed according to a stick-breaking construction with concentration parameter α, this distribution is given by [7]: Γ(α) p(K | N, α)= s(N, K)αK , (5) Γ(α + N) where s(n,m) are unsigned Stirling numbers of the first kind. The Dirichlet process is most commonly used as a prior distribution on the parameters of a mixture model when the number of mixture components is unknown a priori. Such a model is called a Dirichlet process mixture model and is depicted by the graphs of Fig.1(a)-(b). The parameter with which an observation is associated implicitly partitions or clusters the data. In addition, the Chinese restaurant process representation indicates that the Dirichlet process provides a prior that makes it more likely to associate an observation with a parameter to which other observations have already been associated.

Developing a Tempered HDP-HMM for Systems with State Persistence

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support