An Infinite Hidden Markov Model with Similarity-Biased Transitions
Total Page:16
File Type:pdf, Size:1020Kb
An Infinite Hidden Markov Model With Similarity-Biased Transitions Colin Reimer Dawson 1 Chaofan Huang 1 Clayton T. Morrison 2 Abstract ing a “stick” off of the remaining weight according to a Beta (1; γ) distribution. The parameter γ > 0 is known We describe a generalization of the Hierarchical as the concentration parameter and governs how quickly Dirichlet Process Hidden Markov Model (HDP- the weights tend to decay, with large γ corresponding to HMM) which is able to encode prior informa- slow decay, and hence more weights needed before a given tion that state transitions are more likely be- cumulative weight is reached. This stick-breaking process tween “nearby” states. This is accomplished is denoted by GEM (Ewens, 1990; Sethuraman, 1994) for by defining a similarity function on the state Griffiths, Engen and McCloskey. We thus have a discrete space and scaling transition probabilities by pair- probability measure, G , with weights β at locations θ , wise similarities, thereby inducing correlations 0 j j j = 1; 2;::: , defined by among the transition distributions. We present an augmented data representation of the model i.i.d. θj ∼ H β ∼ GEM(γ): (1) as a Markov Jump Process in which: (1) some jump attempts fail, and (2) the probability of suc- G0 drawn in this way is a Dirichlet Process (DP) random cess is proportional to the similarity between the measure with concentration γ and base measure H. source and destination states. This augmenta- The actual transition distribution, πj, from state j, is drawn tion restores conditional conjugacy and admits a from another DP with concentration α and base measure simple Gibbs sampler. We evaluate the model G0: and inference method on a speaker diarization π i.i.d.∼ DP(αG ) j = 0; 1; 2;::: (2) task and a “harmonic parsing” task using four- j 0 part chorale data, as well as on several synthetic where π0 represents the initial distribution. The hidden datasets, achieving favorable comparisons to ex- state sequence, z1; z2; : : : zT is then generated according to isting models. z1 j π0 ∼ Cat(π0), and zt j zt−1; πzt−1 ∼ Cat(πzt−1 ) t = 1; 2;:::;T (3) 1. Introduction and Background Finally, the emission distribution for state j is a function of θj, so that observation yt is drawn according to The hierarchical Dirichlet process hidden Markov model (HDP-HMM) (Beal et al., 2001; Teh et al., 2006) is a yt j zt; θzt ∼ F (θzt ) (4) Bayesian model for time series data that generalizes the conventional hidden Markov Model to allow a countably A shortcoming of the HDP prior on the transition matrix is infinite state space. The hierarchical structure ensures that, that it does not use the fact that the source and destination despite the infinite state space, a common set of destination states are the same set: that is, each πj has a special ele- states will be reachable with positive probability from each ment which corresponds to a self-transition. In the HDP- source state. The HDP-HMM can be characterized by the HMM, however, self-transitions are no more likely a priori following generative process. than transitions to any other state. The Sticky HDP-HMM (Fox et al., 2008) addresses this issue by adding an extra Each state, indexed by j, has parameters, θ , drawn j mass κ at location j to the base measure of the DP that from a base measure, H. A top-level sequence of state generates πj. That is, (2) is replaced by weights, β = (β1; β2;::: ), is drawn by iteratively break- πj ∼ DP(αG0 + κδθ ): (5) 1 Oberlin College, Oberlin, OH, USA 2The University of Ari- j zona, Tucson, AZ, USA. Correspondence to: Colin Reimer Daw- An alternative approach that treats self-transitions as spe- son <[email protected]>. cial is the HDP Hidden Semi-Markov Model (HDP- Proceedings of the 34 th International Conference on Machine HSMM; Johnson & Willsky(2013)), wherein state dura- Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 tion distributions are modeled separately, and ordinary self- by the author(s). transitions are ruled out. However, while both of these An Infinite HMM With Similarity-Biased Transitions models have the ability to privilege self-transitions, they present in musical harmony, where consecutive chords are contain no notion of similarity for pairs of states that are often (near-)neighbors in the “circle of fifths”, and small not identical: in both cases, when the transition matrix is steps along the circle are more common than large ones. integrated out, the prior probability of transitioning to state The paper is structured as follows: In section2 we de- j0 depends only on the top-level stick weight associated fine the model. In section3, we develop a Gibbs sam- with state j0, and not on the identity or parameters of the pling algorithm based on an augmented data representa- previous state j. tion, which we call the Markov Jump Process with Failed The two main contributions of this paper are (1) a general- Transitions (MJP-FT). In section4 we test two versions ization of the HDP-HMM, which we call the HDP-HMM of the model: one on a speaker diarization task in which with local transitions (HDP-HMM-LT) that allows for a ge- the speakers are inter-dependent, and another on a four- ometric structure to be defined on the latent state space, so part chorale corpus, demonstrating performance improve- that “nearby” states are a priori more likely to have transi- ments over state-of-the-art models when “local transitions” tions between them, and (2) a simple Gibbs sampling algo- are more common in the data. Using sythetic data from rithm for this model. The “LT” property is introduced by an HDP-HMM, we show that the LT variant can learn elementwise rescaling and then renormalizing of the HDP not to use its similarity bias when the data does not sup- transition matrix. Two versions of the similarity structure port it. Finally, in section5, we conclude and discuss are illustrated: in one case, two states are similar to the ex- the relationships between the HDP-HMM-LT and existing tent that their emission distributions are similar. In another, HMM variants. Code and additional details are available at the similarity structure is inferred separately. In both cases, http://colindawson.net/hdp-hmm-lt/ we give augmented data representations that restore condi- tional conjugacy and thus allow a simple Gibbs sampling 2. An HDP-HMM With Local Transitions algorithm to be used for inference. We wish to add to the transition model the concept of a A rescaling and renormalization approach similar to the transition to a “nearby” state, where transitions between one used in the HDP-HMM-LT is used by Paisley et al. states j and j0 are more likely a priori to the extent that (2012) to define their Discrete Infinite Logistic Normal they are “nearby” in some similarity space. In order to (DILN) model, an instance of a correlated random measure accomplish this, we first consider an alternative construc- (Ranganath & Blei, 2016), in the setting of topic modeling. tion of the transition distributions, based on the Normal- There, however, the contexts and the mixture components ized Gamma Process representation of the DP (Ishwaran & (topics) are distinct sets, and there is no notion of tem- Zarepour, 2002; Ferguson, 1973). poral dependence. Zhu et al.(2016) developed an HMM based directly on the DILN model1. Both Paisley et al. and Zhu et al. employ variational approximations, whereas we 2.1. A Normalized Gamma Process representation of present a Gibbs sampler, which converges asymptotically the HDP-HMM to the true posterior. We discuss additional differences be- The Dirichlet Process is an instance of a normalized com- tween our model and the DILN-HMM in Sec. 2.2. pletely random measure (Kingman, 1967; Ferguson, 1973), that can be defined as G = P1 π~ δ , where One class of application in which it is useful to incorporate k=1 k θk a notion of locality occurs when the latent state sequence 1 ind. X πk consists of several parallel chains, so that the global state π ∼ Gamma(αβ ; 1) T = π π~ = ; (6) k k k k T changes incrementally, but where these increments are not k=1 independent across chains. Factorial HMMs (Ghahramani et al., 1997) are commonly used in this setting, but this ig- δθ is a measure assigning 1 to sets if they contain θ and 0 P nores dependence among chains, and hence may do poorly otherwise, and subject to the constraint that k≥1 βk = 1 when some combinations of states are much more probable and 0 < α < 1. It has been shown (Ferguson, 1973; Pais- than suggested by the chain-wise dynamics. ley et al., 2012; Favaro et al., 2013) that the normalization constant T is positive and finite almost surely, and that G is Another setting where the LT property is useful is when P1 distributed as a DP with base measure G0 = βkδθ . there is a notion of state geometry that licenses syllogisms: k=1 k If we draw β = (β1; β2;::: ) from the GEM(γ) stick- e.g., if A frequently leads to B and C and B frequently leads breaking process, draw an i.i.d. sequence of θk from a base to D and E, then it may be sensible to infer that A and C measure H, and then draw an i.i.d. sequence of random may lead to D and E as well. This property is arguably measures, fGjg; j = 1; 2;::: , from the above process, this 1We thank an anonymous ICML reviewer for bringing this pa- defines a Hierarchical Dirichlet Process (HDP).