<<

An Infinite With Similarity-Biased Transitions

Colin Reimer Dawson 1 Chaofan Huang 1 Clayton T. Morrison 2

Abstract ing a “stick” off of the remaining weight according to a Beta (1, γ) distribution. The parameter γ > 0 is known We describe a generalization of the Hierarchical as the concentration parameter and governs how quickly Hidden Markov Model (HDP- the weights tend to decay, with large γ corresponding to HMM) which is able to encode prior informa- slow decay, and hence more weights needed before a given tion that state transitions are more likely be- cumulative weight is reached. This stick-breaking process tween “nearby” states. This is accomplished is denoted by GEM (Ewens, 1990; Sethuraman, 1994) for by defining a similarity function on the state Griffiths, Engen and McCloskey. We thus have a discrete space and scaling transition by pair- measure, G , with weights β at locations θ , wise similarities, thereby inducing correlations 0 j j j = 1, 2,... , defined by among the transition distributions. We present an augmented data representation of the model i.i.d. θj ∼ H β ∼ GEM(γ). (1) as a Markov Jump Process in which: (1) some jump attempts fail, and (2) the probability of suc- G0 drawn in this way is a Dirichlet Process (DP) random cess is proportional to the similarity between the measure with concentration γ and base measure H. source and destination states. This augmenta- The actual transition distribution, πj, from state j, is drawn tion restores conditional conjugacy and admits a from another DP with concentration α and base measure simple Gibbs sampler. We evaluate the model G0: and inference method on a speaker diarization π i.i.d.∼ DP(αG ) j = 0, 1, 2,... (2) task and a “harmonic parsing” task using four- j 0 part chorale data, as well as on several synthetic where π0 represents the initial distribution. The hidden datasets, achieving favorable comparisons to ex- state sequence, z1, z2, . . . zT is then generated according to isting models. z1 | π0 ∼ Cat(π0), and

zt | zt−1, πzt−1 ∼ Cat(πzt−1 ) t = 1, 2,...,T (3) 1. Introduction and Background Finally, the emission distribution for state j is a function of θj, so that observation yt is drawn according to The hierarchical Dirichlet process hidden Markov model

(HDP-HMM) (Beal et al., 2001; Teh et al., 2006) is a yt | zt, θzt ∼ F (θzt ) (4) Bayesian model for data that generalizes the conventional hidden Markov Model to allow a countably A shortcoming of the HDP prior on the transition matrix is infinite state space. The hierarchical structure ensures that, that it does not use the fact that the source and destination despite the infinite state space, a common set of destination states are the same set: that is, each πj has a special ele- states will be reachable with positive probability from each ment which corresponds to a self-transition. In the HDP- source state. The HDP-HMM can be characterized by the HMM, however, self-transitions are no more likely a priori following generative process. than transitions to any other state. The Sticky HDP-HMM (Fox et al., 2008) addresses this issue by adding an extra Each state, indexed by j, has parameters, θ , drawn j mass κ at location j to the base measure of the DP that from a base measure, H. A top-level sequence of state generates πj. That is, (2) is replaced by weights, β = (β1, β2,... ), is drawn by iteratively break- πj ∼ DP(αG0 + κδθ ). (5) 1 Oberlin College, Oberlin, OH, USA 2The University of Ari- j zona, Tucson, AZ, USA. Correspondence to: Colin Reimer Daw- An alternative approach that treats self-transitions as spe- son . cial is the HDP Hidden Semi-Markov Model (HDP- Proceedings of the 34 th International Conference on Machine HSMM; Johnson & Willsky(2013)), wherein state dura- Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 tion distributions are modeled separately, and ordinary self- by the author(s). transitions are ruled out. However, while both of these An Infinite HMM With Similarity-Biased Transitions models have the ability to privilege self-transitions, they present in musical harmony, where consecutive chords are contain no notion of similarity for pairs of states that are often (near-)neighbors in the “circle of fifths”, and small not identical: in both cases, when the transition matrix is steps along the circle are more common than large ones. integrated out, the prior probability of transitioning to state The paper is structured as follows: In section2 we de- j0 depends only on the top-level stick weight associated fine the model. In section3, we develop a Gibbs sam- with state j0, and not on the identity or parameters of the pling algorithm based on an augmented data representa- previous state j. tion, which we call the Markov Jump Process with Failed The two main contributions of this paper are (1) a general- Transitions (MJP-FT). In section4 we test two versions ization of the HDP-HMM, which we call the HDP-HMM of the model: one on a speaker diarization task in which with local transitions (HDP-HMM-LT) that allows for a ge- the speakers are inter-dependent, and another on a four- ometric structure to be defined on the latent state space, so part chorale corpus, demonstrating performance improve- that “nearby” states are a priori more likely to have transi- ments over state-of-the-art models when “local transitions” tions between them, and (2) a simple Gibbs sampling algo- are more common in the data. Using sythetic data from rithm for this model. The “LT” property is introduced by an HDP-HMM, we show that the LT variant can learn elementwise rescaling and then renormalizing of the HDP not to use its similarity bias when the data does not sup- transition matrix. Two versions of the similarity structure port it. Finally, in section5, we conclude and discuss are illustrated: in one case, two states are similar to the ex- the relationships between the HDP-HMM-LT and existing tent that their emission distributions are similar. In another, HMM variants. Code and additional details are available at the similarity structure is inferred separately. In both cases, http://colindawson.net/hdp-hmm-lt/ we give augmented data representations that restore condi- tional conjugacy and thus allow a simple Gibbs sampling 2. An HDP-HMM With Local Transitions algorithm to be used for inference. We wish to add to the transition model the concept of a A rescaling and renormalization approach similar to the transition to a “nearby” state, where transitions between one used in the HDP-HMM-LT is used by Paisley et al. states j and j0 are more likely a priori to the extent that (2012) to define their Discrete Infinite Logistic Normal they are “nearby” in some similarity space. In order to (DILN) model, an instance of a correlated random measure accomplish this, we first consider an alternative construc- (Ranganath & Blei, 2016), in the setting of topic modeling. tion of the transition distributions, based on the Normal- There, however, the contexts and the mixture components ized representation of the DP (Ishwaran & (topics) are distinct sets, and there is no notion of tem- Zarepour, 2002; Ferguson, 1973). poral dependence. Zhu et al.(2016) developed an HMM based directly on the DILN model1. Both Paisley et al. and Zhu et al. employ variational approximations, whereas we 2.1. A Normalized Gamma Process representation of present a Gibbs sampler, which converges asymptotically the HDP-HMM to the true posterior. We discuss additional differences be- The Dirichlet Process is an instance of a normalized com- tween our model and the DILN-HMM in Sec. 2.2. pletely random measure (Kingman, 1967; Ferguson, 1973), that can be defined as G = P∞ π˜ δ , where One class of application in which it is useful to incorporate k=1 k θk a notion of locality occurs when the latent state sequence ∞ ind. X πk consists of several parallel chains, so that the global state π ∼ Gamma(αβ , 1) T = π π˜ = , (6) k k k k T changes incrementally, but where these increments are not k=1 independent across chains. Factorial HMMs (Ghahramani et al., 1997) are commonly used in this setting, but this ig- δθ is a measure assigning 1 to sets if they contain θ and 0 P nores dependence among chains, and hence may do poorly otherwise, and subject to the constraint that k≥1 βk = 1 when some combinations of states are much more probable and 0 < α < ∞. It has been shown (Ferguson, 1973; Pais- than suggested by the chain-wise dynamics. ley et al., 2012; Favaro et al., 2013) that the normalization constant T is positive and finite almost surely, and that G is Another setting where the LT property is useful is when P∞ distributed as a DP with base measure G0 = βkδθ . there is a notion of state geometry that licenses syllogisms: k=1 k If we draw β = (β1, β2,... ) from the GEM(γ) stick- e.g., if A frequently leads to B and C and B frequently leads breaking process, draw an i.i.d. sequence of θk from a base to D and E, then it may be sensible to infer that A and C measure H, and then draw an i.i.d. sequence of random may lead to D and E as well. This property is arguably measures, {Gj}, j = 1, 2,... , from the above process, this 1We thank an anonymous ICML reviewer for bringing this pa- defines a Hierarchical Dirichlet Process (HDP). If each Gj per to our attention. is associated with the hidden states of an HMM, π is the 0 infinite matrix where entry πjj0 is the j th mass associated An Infinite HMM With Similarity-Biased Transitions P with the jth random measure, and Tj is the sum of row j, and nj· = j0 njj0 is the total number of visits to state then we obtain the prior for the HDP-HMM, where j. Since Tj is a sum over products of πjj0 and φjj0 terms, the posterior for π is no longer a DP. However, conditional p(z | z , π) =π ˜ = π 0 /T (7) t t−1 zt−1zt jj j conjugacy can be restored by a data-augmentation process with a natural interpretation, which is described next. 2.2. Promoting “Local” Transitions In the HDP prior, the rows of the transition matrix are con- 2.3. The HDP-HMM-LT as the Marginalization of a ditionally independent. We wish to relax this assumption, Markov Jump Process with “Failed” Transitions to incorporate possible prior knowledge that certain pairs of In this section, we define a that we states are “nearby” in some sense and thus more likely than call the Markov Jump Process with Failed Transitions others to produce large transition weights between them (in (MJP-FT), from which we obtain the HDP-HMM-LT by both directions); that is, transitions are likely to be “lo- marginalizing over some of the variables. By reinstating cal”. We accomplish this by associating each latent state these auxiliary variables, we obtain a simple Gibbs sam- j with a location `j in some space Ω, introducing a “sim- pling algorithm over the full MJP-FT, which can be used ilarity function” φ :Ω × Ω → (0, 1], and scaling each to sample from the marginal posterior of the variables used element πjj0 by φjj0 = φ(`j, `j0 ). For example, we might by the HDP-HMM-LT. wish to define a (possibly asymmetric) divergence function d :Ω × Ω → [0, ∞) and set φ(`j, `j) = exp{−d(`j, `j0 )} Let β, π, ` and Tj, j = 1, 2,... be defined as in the so that transitions are less likely the farther apart two states last section. Consider a continuous-time Markov Process are. By setting φ ≡ 1, we obtain the standard HDP-HMM. over the states j = 1, 2,... , and suppose that if the pro- The DILN-HMM (Zhu et al., 2016), employs a similar cess makes a jump to state zt at time τt, the next jump, rescaling of transition probabilities via an exponentiated which is to state zt+1, occurs at time τt +u ˜t, where P 0 , following (Paisley et al., 2012), but the u˜t ∼ Exp( j0 πjj0 ), and p(zt+1 = j | zt = j) ∝ πjj0 , scaling function must be positive semi-definite, and in par- independent of u˜t. Note that in this formulation, unlike in ticular symmetric, whereas in the HDP-HMM-LT, φ need standard formulations of Markov Jump Processes, we are only take values in (0, 1]. Moreover, the DILN-HMM does assuming that self-jumps are possible. not allow the scales to be tied to other state parameters, and If we only observe the jump sequence z and not the hold- hence encode an independent notion of similarity. ing times u˜t, this is an ordinary with tran- Letting ` = (`1, `2,... ), we can replace (6) for j ≥ 1 by sition matrix row-proportional to π. If we do not observe the jumps directly, but instead an observation is generated ∞ X once per jump from a distribution that depends on the state π 0 | β, ` ∼ Gamma(αβ 0 , 1),T = π 0 φ 0 jj j j jj jj being jumped to, then we have an ordinary HMM whose j0=1 (8) transition matrix is obtained by normalizing π; that is, we π˜ 0 = π 0 φ 0 /T , p(z | z , π, `) =π ˜ . jj jj jj j t t−1 zt−1zt have the HDP-HMM. Since the φjj0 are positive and bounded above by 1, We modify this process as follows. Suppose each jump at- 0 X tempt from state j to state j has probability (1 − φjj0 ) of 0 < π φ ≤ T ≤ π 0 < ∞ (9) j1 j1 j jj failing, in which case no transition occurs and no observa- 0 j tion is generated. Assuming independent failures, the rates 0 almost surely, where the last inequality carries over from of successful and failed jumps from j to j are πjj0 φjj0 and the original HDP. The prior means of the unnormalized πjj0 (1 − φjj0 ), respectively. The probability that the first 0 0 transition distributions, πj are then proportional (for each successful jump is to state j (that is, that zt+1 = j ) is 0 j) to αβφj where φj = (φj1, φj2,... ). proportional to the rate of successful jump attempts to j , which is π 0 φ 0 . Conditioned on z , the holding time, u˜ , The distribution of the latent state sequence z given π and jj jj t t is independent of z and is distributed as Exp(T ). We ` is now t+1 zt denote the total time spent in state j by u = P u˜ , j t:zt=j t T where, as the sum of i.i.d. Exponentials, Y −nzt−1· p(z | π, `) = πz z φz z Tz t−1 t t−1 t t−1 ind. t=1 uj | z, π, θ ∼ Gamma(nj·,Tj) (11) ∞ ∞ (10) n 0 n 0 Y −1 Y jj jj During this period there will be qjj0 failed attempts to jump = T π 0 φ 0 j jj jj 0 j=1 j0=1 to state j , where qjj0 ∼ Poisson(ujπjj0 (1 − φjj0 )) are in- dependent. This data augmentation bears some conceptual PT 0 where njj0 = t=1 I(zt−1 = j, zt = j ) is the num- similarity to the Geometrically distributed ρ auxiliary vari- ber of transitions from state j to state j0 in the sequence z ables introduced to the HDP-HSMM (Johnson & Willsky, An Infinite HMM With Similarity-Biased Transitions

2013) to restore conditional conjugacy. However, there are uniform with probability 1, so the dynamics are controlled key differences: first, ρ measure how many steps the chain entirely by φ. If A(d) is the transition matrix for chain d, would have remained in state j under Markovian dynam- then setting φ(`j, `j0 ) = exp −d(`j, `j0 ) with asymmetric P (d) ics, whereas our u represents putative continuous holding “divergences” d(`j, `j0 ) = − log(A ) yields the d `jd,`j0d times between each transition, and second ρ allows for the factorial transition model. restoration of a zeroed out entry in each row, whereas u al- lows us to work with unnormalized π entries, avoiding the 2.6. An Infinite Factorial HDP-HMM-LT need to restore zeroed out entries in the HSMM-LT Nonparametric extensions of the factorial HMM, such as Incorporating u = {u } and Q = {q 0 } as augmented j jj the infinite factorial hidden Markov Model (Gael et al., data simplifies the likelihood for π, yielding 2009) and the infinite factorial dynamic model (Valera p(z,u, Q | π) = p(z | π)p(u | z, π)p(Q | u, π) (12) et al., 2015), have been developed in recent years by mak- ing use of the Indian Buffet Process (Ghahramani & Grif- where dependence on ` has been omitted for conciseness. fiths, 2005) as a state prior. It would be conceptually After grouping terms and omitting terms that do not depend straightforward to combine the IBP state prior with the sim- on π, this proportional (as a function of π) to ilarity bias of the LT model, provided the chosen similarity function is uniformly bounded above on the space of in- n 0 +q 0 n 0 Y Y jj jj jj qjj0 −πjj0 uj finite length binary vectors (for example, take φ(u, v) to πjj0 φjj0 (1 − φjj0 ) e (13) j j0 be the exponentiated negative Hamming distance between u and v). Since the number of differences between two Conveniently, the Tj have canceled, and the exponential draws from the IBP is finite with probability 1, this yields terms involving πjj0 and φjj0 in the Gamma and Poisson a reasonable similarity metric. distributions of uj and qjj0 combine to cause φjj0 to vanish. 3. Inference 2.4. Sticky and Semi-Markov Generalizations We develop a Gibbs sampling algorithm based on the MJP- We note that the local transition property of the HDP- FT representation described in Sec. 2.3, augmenting the HMM-LT can be combined with the Sticky property of data with the duration variables u, the failed jump attempt the Sticky HDP-HMM (Fox et al., 2008), or the non- count matrix, Q, as well as additional auxiliary variables geometric duration distributions of the HDP-HSMM (John- which we will define below. In this representation the tran- son & Willsky, 2013), to add additional prior weight on sition matrix is not represented directly, but is a determinis- self-transitions. In the former case, no changes to inference tic function of the unscaled transition “rate” matrix, π, and are needed; one can simply add the the extra mass κ to the the similarity matrix, φ. The full set of variables is parti- shape parameter of the Gamma prior on the πjj, and em- tioned into blocks: {γ, α, β, π}, {z, u, Q, Λ}, {θ, `}, and ploy the same auxiliary variable method used by Fox et al. {ξ}, where Λ represents a set of auxiliary variables that will to distinguish “Sticky” from “regular” self-transitions. For be introduced below, θ represents the emission parameters the semi-Markov case, we can fix the diagonal elements (which may be further blocked depending on the specific of π to zero, and allow Dt observations to be emitted i.i.d. choice of model), and ξ represents additional parameters according to a state-specific duration distribution, and sam- such as any free parameters of the similarity function, φ, ple the latent state sequence using a suitable semi-Markov and any hyperparameters of the emission distribution. message passing algorithm (Johnson & Willsky, 2013). In- ference for the φ matrix is not affected, since the diag- 3.1. Sampling Transition Parameters and onal elements are assumed to be 1. Unlike in the orig- Hyperparameters inal representation of the HDP-HSMM, no further data- augmentation is needed, as the (continuous) durations u The joint posterior over γ, α, β and π given the augmented already account for the normalization of the π. data D = (z, u, Q, Λ) will factor as p(γ,α, β, π | D) 2.5. Obtaining the Factorial HMM as a Limiting Case (14) = p(γ | D)p(α | D)p(β | γ, D)p(π | α, β, D) One setting in which a local transition property is desirable is the case where the latent states encode multiple hidden We describe these four factors in reverse order. features at time t as a vector of categories. Such problems are often modeled using factorial HMMs (Ghahramani Sampling π Having used data augmentation to sim- et al., 1997). In fact, the HDP-HMM-LT yields the factorial plify the likelihood for π to the factored conjugate form HMM in the limit as α, γ → ∞, fixing each row of π to be in (13), the individual πjj0 are a posteriori independent An Infinite HMM With Similarity-Biased Transitions

Gamma(αβj0 + njj0 + qjj0 , 1 + uj) distributed. 3.2. Sampling z and the auxiliary variables We sample the hidden state sequence, z, jointly with the Sampling β To enable joint sampling of z, we employ auxiliary variables, which consist of u, Q, M, r and w. a weak limit approximation to the HDP (Johnson & Will- The joint conditional distribution of these variables is de- sky, 2013), approximating the stick-breaking process for β fined directly by the generative model: using a finite Dirichlet distribution with a J components, where J is larger than we expect to need. Due to the p(D) = p(z)p(u | z)p(Q | u)p(M | z, Q)p(r | M)p(w | M) product-of-Gammas form, we can integrate out π analyt- ically to obtain the marginal likelihood: Since we are conditioning on the transition matrix, we J Γ(γ/J) Y γ −1 can sample the entire sequence z jointly with the forward- p(β | γ) = β J (15) Γ(γ) j backward algorithm, as in an ordinary HMM. Since we are j sampling the labels jointly, this step requires O(TJ 2) com- J putation per iteration, which is the bottleneck of the infer- Y −α Y Γ(αβj0 + njj0 + qjj0 ) p(D | β, α) ∝ (1 + uj) ence algorithm for reasonably large T or J (other updates Γ(αβj0 ) j=1 j0 are constant in T or in J). Having done this, we can sample u, Q, M, r and w from their forward distributions. It is also where we have used the fact that the βj sum to 1 to pull −αβ 0 possible to employ a variant on beam sampling (Van Gael out terms of the form (1 + uj) j from the inner prod- uct in the likelihood. Following Teh et al.(2006), we can et al., 2008) to speed up each iteration, at the cost of slower , but we did not use this variant here. introduce auxiliary variables M = {mjj0 }, with

ind m 0 mjj0 0 0 jj 3.3. Sampling state and emission parameters p(mjj | βj , α, D) ∝ snjj0 +qjj0 ,mjj0 α βj0 (16) Depending on the application, the locations ` may or may for integer mjj0 ranging between 0 and njj0 + qjj0 , where not depend on the emission parameters, θ. If not, sampling sn,m is an unsigned Stirling number of the first kind. The normalizing constant in this distribution cancels the ra- θ conditional on z is unchanged from the HDP-HMM. tio of Gamma functions in the β likelihood, so, letting There is no general-purpose method for sampling `, or for P P sampling θ in the dependent case, due to the dependence m·j0 = j mjj0 and m·· = j0 m·j0 , the posterior for (the truncated) β is a Dirichlet whose jth mass parameter on the form of φ and on the emission model, but specific γ instances are illustrated in the experiments below. is J + m·j. Sampling Concentration Parameters Incorporating M 4. Experiments into D, we can integrate out β to obtain The parameter space for the hidden states, the associated m − P log(1+u 00 )α p(D | α, γ) ∝ α ·· e j00 j prior H on θ, and the similarity function φ, is application- γ specific; we consider here two cases. The first is a speaker- Γ(γ) Y Γ( J + m·j) (17) × γ diarization task, where each state consists of a finite D- Γ(γ + m··) Γ( ) j J dimensional binary vector whose entries indicate which Assuming that α and γ have Gamma priors with shape and speakers are currently speaking. In this experiment, the state vectors both determine the pairwise similarities and rate parameters aα, bα and aγ , bγ , then partially determine the emission distributions via a linear- X α | D ∼ Gamma(aα + m··, bα + log(1 + uj)). (18) Gaussian model. In the second experiment, the data con- j sists of Bach chorales, and the latent states can be thought of as harmonic contexts. There, the components of the To simplify the likelihood for γ, we can introduce a fi- states that govern similarities are modeled as independent nal set of auxiliary variables, r = (r1, . . . , rJ ), rj0 ∈ of the emission distributions, which are categorical distri- {0, . . . , m·j0 } and w ∈ (0, 1) with the following distribu- butions over four-voice chords. tions:  γ r p(r 0 = r | m 0 , γ) ∝ s(m 0 , r) (19) 4.1. Cocktail Party j ·j ·j J γ−1 m··−1 The Data The data was constructed using audio signals p(w | m··γ) ∝ w (1 − w) (20) collected from the PASCAL 1st Speech Separation Chal- The normalizing constants are ratios of Gamma functions, lenge2. The underlying signal consisted of D = 16 speaker which cancel those in (17), so that channels recorded at each of T = 2000 time steps, with the

2 γ | D, r, w ∼ Gamma(aγ + r·, bγ − log(w)) (21) http://laslab.org/SpeechSeparationChallenge/ An Infinite HMM With Similarity-Biased Transitions resulting T × D signal matrix, denoted by θ∗, mapped to both the state sequence z and the observation matrix Y are K = 12 microphone channels via a weight matrix, W. The used in the update. We put independent Beta-Bernoulli pri- 16 speakers were grouped into 4 conversational groups of ors on each coordinate of θ, and Gibbs sample each coor- 4, where speakers within a conversation took turns speak- dinate θjd conditioned on all the others and the coordinate- D ing (see Fig.2). In such a task, there are naively 2 pos- wise prior means, {µd}, which we sample in turn condi- sible states (here, 65536). However, due to the conversa- tioned on θ. Details are in the supplement. tional grouping, if at most one speaker in a conversation is speaking at any given time, the state space is constrained, Sampling λ The λ parameter of the similarity function Q with only c(sc + 1) states possible, where sc is the num- governs the connection between ` and φ. Substituting the ber of speakers in conversation c (in this case sc ≡ 4, for a definition of φ into (13) yields total of 625 possible states).

Y Y −λdjj0 njj0 −λdjj0 qjj0 Each “turn” within a conversation consisted of a single p(z, Q | `, λ) ∝ e (1 − e ) (22) 0 sentence (average duration ∼ 3s) and turn orders within j j a conversation were randomly generated, with random We put an Exp(b ) prior on λ, which yields a posterior pauses distributed as N (1/4s, (1/4s)2) inserted between λ density sentences. Every time a speaker has a turn, the sentence

−(b +P P d 0 n 0 )λ is drawn randomly from the 500 sentences uttered by that p(λ | z, Q, `) ∝ e λ j j0 jj jj (23) speaker in the data. The conversations continued for 40s, Y Y −λd 0 q 0 and the signal was down-sampled to length 2000. The ’on’ × (1 − e jj ) jj portions of each speaker’s signal were normalized to have j j0 amplitudes with mean 1 and standard deviation 0.5. An ad- This density is log-concave, and so we use Adaptive Rejec- ditional column of 1s was added to the speaker signal ma- tion Sampling (Gilks & Wild, 1992) to sample from it. trix, θ∗, representing background noise. The resulting sig- ∗ nal matrix, denoted θ , was thus 2000 × 17 and the weight ∗ matrix, W, was 17 × 12. Following Gael et al.(2009) and Sampling W and Σ Conditioned on Y and θ , W and Valera et al.(2015), the weights were drawn independently Σ can be sampled as in Bayesian linear regression. If from a Unif(0, 1) distribution, and independent N (0, 0.32) each column of W has a multivariate Normal prior, then noise was added to each entry of the observation matrix. the columns are a posteriori independent multivariate Nor- mals. For the experiments reported here, we fix W to its ground truth value so that θ∗ can be compared directly with The Model The latent states, θ , are the D-dimensional j the ground truth signal matrix, and we constrain Σ to be binary vectors whose dth entry indicates whether or not diagonal, with Inverse Gamma priors on the variances, re- speaker d is speaking. The locations ` are identified with j sulting in conjugate updates. the binary vectors, `j := θj. We use a Laplacian simi- larity function on Hamming distance, d , so that φ 0 := 0 jj Results We attempted to infer the binary speaker matri- exp(−λd0(`j, `j0 )), λ ≥ 0. The emission model is linear- Gaussian as in the data, with (D + 1) × K weight matrix ces using five models: (1) a binary-state Factorial HMM W, and T × (D + 1) signal matrix θ∗ whose tth row is (Ghahramani et al., 1997), where individual binary speaker T ∗ sequences are modeled as independent, (2) an ordinary θt := (1, θzt ), so that yt | z ∼ N (W θt , Σ). For the ex- periments discussed here, we assume that Σ is independent HDP-HMM without local transitions (Teh et al., 2006), of j, but this assumption is easily relaxed if appropriate. where the latent states are binary vectors, (3) a Sticky HDP- HMM (Fox et al., 2008), (4) our HDP-HMM-LT model, For finite-length binary vector states, the set of possible and (5) a model that combines the Sticky and LT proper- states is finite, and so it may seem that a nonparametric ties3. For all models, all concentration and noise preci- model is unnecessary. However, if D is reasonably large, sion parameters are given Gamma(0.1, 0.1) priors. For the likely most of the 2D possible states are vanishingly un- κ Sticky models, the ratio α+κ is given a Unif(0, 1) prior. likely (and the number of observations may well be less We evaluated the models at each iteration using both the than 2D), and so we would like to encourage the selection Hamming distance between inferred and ground truth state of a sparse set of states. Moreover, there could be more matrices and F1 score. We also plot the inferred decay rate than one state with the same emission parameters, but with λ, and the number of states used by the LT and Sticky-LT different transition dynamics. Next we describe the addi- models. The results for the five models are in Fig.1. In tional inference steps needed for this version of the model. 3We attempted to add a comparison to the DILN-HMM (Zhu et al., 2016) as well, but code could not be obtained, and the paper Sampling θ / ` Since θj and `j are identified, influencing did not provide enough detail to reproduce their inference algo- both the transition matrix and the emission distributions, rithm. An Infinite HMM With Similarity-Biased Transitions

model 20 BFact LT

10 noLT Groundtruth density Sticky StickyLT 0 0.5 0.6 0.7 0.8

F1_score StickyLT

2.0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.5 ● ● model LT ● 1.0 ● LT StickyLT lambda 0.5

1000 2000 3000 4000 5000

Iterations BFact

0.15 model 0.10 LT noLT Sticky 0.05 Sticky density StickyLT 0.00 25 50 75 100 125

n_dot noLT

Figure 1. Top: F1 score for inferred relative to ground truth bi- nary speaker matrices on cocktail party data, evaluated every 50th Figure 2. Binary speaker matrices for the cocktail data, with time Gibbs iteration after the first 2000, aggregating across 5 runs of on the horizontal axis and speaker on the vertical axis. White each model. Middle: Inferred λ, for the LT and Sticky-LT mod- is 1, black is 0. The ground truth matrix is at the top, fol- els by Gibbs iteration, averaged over 5 runs. Bottom: Number of lowed by the inferred speaker matrix for the Sticky HDP-HMM- states used, n·, by each model in the training set. Error bands are LT, HDP-HMM-LT, binary factorial, Sticky-HDP-HMM, and 99% confidence interval of the mean per iteration. “vanilla” HDP-HMM. All inferred matrices are averaged over 5 runs of 5000 Gibbs iterations each, with the first 2000 iterations discarded as burn-in. Fig.2, we plot the ground truth state matrix against the av- ∗ erage state matrix, η , averaged over runs and post-burn-in (larger α and non-zero λ). iterations. The LT and Sticky-LT models outperform the others, while 4.3. Bach Chorales the regular Sticky model exhibits only a small advantage To test a version of the HDP-HMM-LT model in which the over the vanilla HDP-HMM. Both converge on a non- components of the latent state governing similarity are un- negligible λ value of about 1.6 (see Fig.1), suggesting that related to the emission distributions, we used our model the local transition structure explains the data well. The to do unsupervised “grammar” learning from a corpus of LT models also use more states than the non-LT models, Bach chorales. The data was a corpus of 217 four-voice perhaps owing to the fact that the weaker transition prior major key chorales by J.S. Bach from music214, 200 of of the non-LT model is more likely to explain nearby simi- which were randomly selected as a training set, with the lar observations as a single persisting state, whereas the LT other 17 used as a test set to evaluate surprisal (marginal model places a higher probability on transitioning to a new log likelihood per observation) by the trained models. All state with a similar latent vector. chorales were transposed to C-major, and each distinct four-voice chord (with voices ordered) was encoded as a 4.2. Synthetic Data Without Local Transitions single integer. In total there were 3307 distinct chord types We generated data directly from the ordinary HDP-HMM and 20401 chord tokens in the 217 chorales, with 3165 used in the cocktail experiment as a sanity check, to exam- types and 18818 tokens in the 200 training chorales, and ine the performance of the LT model in the absence of a 143 chord types that were unique to the test set. similarity bias. The results are in Fig.3. When the λ pa- rameter is large, the LT model has worse performance than Modifications to Model and Inference Since the chords the non-LT model on this data; however, the λ parameter were encoded as integers, the emission distribution for each settles near zero as the model learns that local transitions state is Cat(θj). We use a symmetric Dirichlet prior for are not more probable. When λ = 0, the HDP-HMM-LT is each θj, resulting in conjugate updates to θ conditioned on an ordinary HDP-HMM. The LT model does not make en- the latent state sequence, z. tirely the same inferences as the non-LT model, however; in In this experiment, the locations, ` , are independent of the particular, the α concentration parameter is larger. To some j θj, with N (0, I) priors. We use a Gaussian similarity func- extent, α and λ trade off: sparsity of the transition matrix 2 tion, φ := exp{−λd (` , ` 0 ) } where d is Euclidean can be achieved either by beginning with a sparse rate ma- jj 2 j j 2 distance. Since the latent states are continuous, we use trix prior to rescaling (α small), or by beginning with a less sparse rate matrix which becomes sparser through rescaling 4http://web.mit.edu/music21 An Infinite HMM With Similarity-Biased Transitions

1.0 6 model 0.9 model 4 LT ● BFact 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● noLT LT Sticky noLT density 2 F1_score 0.7 StickyLT 0 0.6 500 1000 1500 2000 −5.2 −5.0 −4.8 −4.6 −4.4 Iterations train_log_likelihood 12 1.5 8 model 1.0 4 LT ● model ● noLT ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● LT 0.5 Sticky density lambda StickyLT −4 0.0 500 1000 1500 2000 −15 −14 −13 −12 −11 Iterations test_log_likelihood

0.10 Figure 4. Training set and test set log marginal likelihoods for model LT 0.05 noLT Bach chorale data on the four HDP-based models: HDP-HMM- density

0.00 LT, HDP-HMM, Sticky HMM, and Sticky HDP-HMM-LT. 0 10 20 30 40 n_dot

Figure 3. Top: F1 score for inferred relative to ground truth bi- nary speaker matrices on synthetic data generated from the vanilla HDP-HMM model. Middle: Learned similarity parameter, λ, for which we call the Markov Jump Process with Failed Tran- the LT model by Gibbs iteration, averaged over 5 runs. Bottom: sitions (MJP-FT), we obtain a Gibbs sampling algorithm Number of states used, n·, by each model in the training set. Error that simplifies inference in both the LT and ordinary HDP- bands are 99% confidence interval of the mean per iteration. The HMM. When multiple latent chains are interdependent, as first 100 iterations are omitted. in speaker diarization, the HDP-HMM-LT model combines the HDP-HMM’s capacity to discover a small set of joint states with the Factorial HMM’s ability to encode the prop- a Hamiltonian Monte Carlo (HMC) update (Duane et al., erty that most transitions involve a small number of chains. 1987; Neal et al., 2011) to update the ` simultaneously, j The HDP-HMM-LT outperforms both, as well as outper- conditioned on z and π (see the supplement for details). forming the Sticky-HDP-HMM, on a speaker diarization task in which speakers form conversational groups. Despite Results We ran 5 Gibbs chains for 10,000 iterations each the addition of the similarity kernel, the HDP-HMM-LT is using the HDP-HMM-LT, Sticky-HDP-HMM-LT, HDP- able to suppress its local transition prior when the data does HMM and Sticky-HDP-HMM models on the 200 training not support it, achieving identical performance to the HDP- chorales, which were modeled as conditionally indepen- HMM on data generated directly from the latter. dent of one another. We evaluated the marginal log like- The local transition property is particularly clear when lihood on the 17 test chorales (integrating out z) at every transitions occur at different times for different latent fea- 50th iteration. The training and test log likelihoods are in tures, as with binary vector-valued states in the cocktail Fig.4. Although the LT model does not achieve as close party setting, but the model can be used with any state a fit to the training data, its generalization performance is space equipped with a suitable similarity kernel. Similar- better, suggesting that the vanilla HDP-HMM is overfit- ities need not be defined in terms of emission parameters; ting. This is perhaps counterintuitive, since the LT model state “locations” can be represented and inferred separately, is more flexible, and might be expected to be more prone which we demonstrate using Bach chorale data. There, to overfitting. However, the similarity bias induces greater the LT model achieves better predictive performance on a information sharing across parameters, as in a hierarchical held-out test set, while the ordinary HDP-HMM overfits model: instead of each entry of the transition matrix be- the training set: the LT property here acts to encourage a ing informed mainly by transitions directly involving the concise harmonic representation where chord contexts are corresponding states, it is informed to some extent by all arranged in bidirectional functional relationships. transitions, as they all inform the similarity structure. We focused on fixed-dimension binary vectors for the cock- 5. Discussion tail party and synthetic data experiments, but it would be straightforward to add the LT property to a model with non- We have defined a new probabilistic model, the Hierar- parametric latent states, such as the iFHMM (Gael et al., chical Dirichlet Process Hidden Markov Model with Lo- 2009) and the infinite factorial dynamic model (Valera cal Transitions (HDP-HMM-LT), which generalizes the et al., 2015), both of which use the Indian Buffet Process HDP-HMM by allowing state space geometry to be repre- (IBP) (Ghahramani & Griffiths, 2005) as a state prior. The sented via a similarity kernel, making transitions between similarity function used here could be employed without “nearby” pairs of states (“local” transitions), more likely a changes: since only finitely many coordinates are non-zero priori. By introducing an augmented data representation, in the IBP, the distance between any two states is finite. An Infinite HMM With Similarity-Biased Transitions

ACKNOWLEDGMENTS Johnson, Matthew J and Willsky, Alan S. Bayesian non- parametric hidden semi-Markov models. The Journal of This work was funded in part by DARPA grant W911NF- Research, 14(1):673–701, 2013. 14-1-0395 under the Big Mechanism Program and DARPA grant W911NF-16-1-0567 under the Communicating with Kingman, John. Completely random measures. Pacific Computers Program. Journal of Mathematics, 21(1):59–78, 1967.

Neal, Radford M et al. MCMC using Hamiltonian dynam- References ics. Handbook of Markov Chain Monte Carlo, 2:113– Beal, Matthew J, Ghahramani, Zoubin, and Rasmussen, 162, 2011. Carl E. The infinite hidden Markov model. In Advances Paisley, John, Wang, Chong, and Blei, David M. The dis- in neural information processing systems, pp. 577–584, crete infinite logistic normal distribution. Bayesian Anal- 2001. ysis, 7(4):997–1034, 2012. Duane, Simon, Kennedy, Anthony D, Pendleton, Brian J, Ranganath, Rajesh and Blei, David M. Correlated random and Roweth, Duncan. Hybrid monte carlo. Physics let- measures. Journal of the American Statistical Associa- ters B, 195(2):216–222, 1987. tion, 2016. Ewens, Warren John. Population genetics theory – the past Sethuraman, Jayaram. A constructive definition of Dirich- and the future. In Mathematical and Statistical Devel- let processes. Statistica Sinica, 4:639–650, 1994. opments of Evolutionary Theory, pp. 177–227. Springer, 1990. Teh, Yee Whye, Jordan, Michael I, Beal, Matthew J, and Blei, David M. Hierarchical Dirichlet processes. Journal Favaro, Stefano, Teh, Yee Whye, et al. MCMC for normal- of the American Statistical Association, 101(476), 2006. ized random measure mixture models. Statistical Sci- ence, 28(3):335–359, 2013. Valera, Isabel, Ruiz, Francisco, Svensson, Lennart, and Perez-Cruz, Fernando. Infinite factorial dynamical Ferguson, Thomas S. A Bayesian analysis of some non- model. In Advances in Neural Information Processing parametric problems. The annals of , pp. 209– Systems, pp. 1657–1665, 2015. 230, 1973. Van Gael, Jurgen, Saatci, Yunus, Teh, Yee Whye, and Fox, Emily B, Sudderth, Erik B, Jordan, Michael I, and Ghahramani, Zoubin. Beam sampling for the infinite Willsky, Alan S. An HDP-HMM for systems with state hidden Markov model. In Proceedings of the 25th In- persistence. In Proceedings of the 25th international ternational Conference on Machine Learning, pp. 1088– conference on Machine learning, pp. 312–319. ACM, 1095. ACM, 2008. 2008. Zhu, Hao, Hu, Jinsong, and Leung, Henry. Hidden Markov Gael, Jurgen V, Teh, Yee W, and Ghahramani, Zoubin. The models with discrete infinite logistic normal distribution Advances in infinite factorial hidden Markov model. In priors. In Information Fusion (FUSION), 2016 19th In- Neural Information Processing Systems , pp. 1697–1704, ternational Conference on, pp. 429–433. IEEE, 2016. 2009.

Ghahramani, Zoubin and Griffiths, Thomas L. Infinite la- tent feature models and the Indian buffet process. In Advances in neural information processing systems, pp. 475–482, 2005.

Ghahramani, Zoubin, Jordan, Michael I, and Smyth, Padhraic. Factorial hidden Markov models. Machine learning, 29(2-3):245–273, 1997.

Gilks, Walter R and Wild, Pascal. Adaptive rejection sam- pling for Gibbs sampling. Applied Statistics, pp. 337– 348, 1992.

Ishwaran, Hemant and Zarepour, Mahmoud. Exact and ap- proximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283, 2002.