Dirichlet Process Hmm Mixture Models with Application to Music Analysis

DIRICHLET PROCESS HMM MIXTURE MODELS WITH APPLICATION TO MUSIC ANALYSIS Yuting Qi, John William Paisley and Lawrence Carin Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708-0291 ABSTRACT proven that DP is rich enough to model parameters of individ- A hidden Markov mixture model is developed using a Dirich- ual components with arbitrarily high complexity, and flexible let process (DP) prior, to represent the statistics of sequen- enough to fit them well without any assumptions about the tial data for which a single hidden Markov model (HMM) functional form of the prior distribution [6][13]. Importantly, may not be sufficient. The DP prior has an intrinsic clus- the number of mixture components need not be set a priori tering property that encourages parameter sharing, naturally in the DP HMM mixture model. A variational Bayes [5] ap- revealing the proper number of mixture components. The proach is considered to perform DP-based mixture modeling evaluation of posterior distributions for all model parameters for efficient computation. In this paper we focus on HMM is achieved via a variational Bayes formulation. We focus mixture models based on discrete observations; our method is on exploring music similarities as an important application, applicable to any sequential discrete data set containing mul- highlighting the effectiveness of the HMM mixture model. tiple underlying patterns. Experimental results are presented from classical music clips. The remainder of the paper is organized as follows. Sec- tion 2 provides an introduction to the Dirichlet process and its Index Terms— Dirichlet Process, HMM mixture, Music, application to HMM mixture models. A variational Bayes in- Variational Bayes. ference method is developed in Section 3. Section 4 describes 1. INTRODUCTION the music application as well as experimental results. Section 5 concludes the work. Music recognition, including music classification, retrieval, browsing, and recommendation systems, has been of signif- 2. DP-BASED HIDDEN MARKOV MIXTURE MODEL icant recent interest. Correspondingly, ideas from statisti- 2.1. Hidden Markov Mixture Model cal machine learning have attracted growing interest in the music-analysis community. For example, Gaussian mixture The hidden Markov mixture model with K∗ mixture compo- models have been used to represent the distribution of the nents may be written as ∗ K MFCCs over all frames of an individual song [1][4]. How- p(x|a1, ··· ,a ∗ , Θ1, ··· , Θ ∗ )= a p(x|Θ ), (1) ever, no dynamic behavior of music is taken into account in K K k k k=1 these works. Since “the brain dynamically links a multitude of short events which cannot always be separated” [2], tem- where x = {xt}t=1,T is a sequence of observations, p(x|Θk) poral cues are critical and contain information that should not represents the kth HMM component with associated param- th be ignored. Therefore, music is treated as time-series data eters Θk, and ak represents the mixing weight for the k ∗ K and hidden Markov models (HMMs), which can accurately HMM, with k=1 ak =1. represent the statistics of sequential data [8], have been intro- We assume a set X = {xn}n=1,N of N sequences of duced to model the overall music in [2][9] and more recently data. Each data sequence xn is assumed to be drawn from for music genre classification [10][12]. an associated HMM with parameters Θn = {An, Bn, πn}, Building a single HMM for a song performs well when the i.e., xn ∼H(Θn), where H(Θ) represents the HMM. The music’s “movement pattern” is relatively simple and thus the set of associated parameters {Θn}n=1,N are drawn i.i.d from structure is of modest complexity (e.g., the number of states i.i.d a shared prior G, i.e., Θn|G ∼ G. The distribution G is is few). However, most real music is a complicated signal, itself drawn from a distribution, in particular a Dirichlet pro- which may have more than one “movement pattern” across cess. The prior G encourages the clustering of the parameters the entire piece. Therefore an HMM mixture model is pro- {Θn}n=1,N and each such cluster corresponds to an HMM posed in this paper to describe multiple “movement patterns” mixture component in (1). The algorithm automatically de- in music, with each pattern characterized by a single mix- termines an appropriate number of mixture components, bal- ture component (an HMM). The work reported here develops ancing the DP-generated desire to cluster with the likelihood’s an HMM mixture model in a Bayesian setting using a non- desire to choose parameters that match the data X well. This parametric Dirichlet process (DP) as a common prior distri- balance between the likelihood and the DP prior is manifested bution on the parameters of the individual HMMs. It has been in the posterior density function for parameters {Θn}n=1,N . 1424407281/07/$20.00 ©2007 IEEE II 465 ICASSP 2007 Authorized licensed use limited to: IEEE Xplore. Downloaded on January 12, 2009 at 11:38 from IEEE Xplore. Restrictions apply. 2.2. Dirichlet Process where p = {pk}k=1,∞ is given by (4) and Mult(p) is the The Dirichlet process, denoted as DP(α, G0), is a random multinomial distribution with parameter p. measure on measures and is parameterized by a positive scal- Assuming A, B and π are independent of each other, the ing parameter α, often termed the “innovation parameter”, base distribution G0 is represented as G0 = p(A)p(B)p(π). and a base distribution G0. Assume we have N random vari- For computational convenience (use of appropriate conjugate ables {Θn}n=1,N distributed according to G, and G itself is priors), we have the following prior distributions a random measure drawn from a Dirichlet process, I A|uA { ··· } uA Θn|G ∼ G, n =1, ··· ,N, P ( )= Dir( ai1, ,aiI ; ) (6) i=1 G ∼ DP(α, G0), I −n B B where G0 is the expectation of G, E[G]=G0. Define Θ = p(B|u )= Dir({bi1, ··· ,biM }; u ) (7) ∗ { ··· ··· } { } ∗ i=1 Θ1, , Θn−1, Θn+1, , ΘN and let Θk k=1,K be the − π π { } n π|u { 1 ··· } u distinct values taken by Θn n=1,N and let nk be the num- p( )=Dir( π , ,πI ; ), (8) −n ∗ ber of values in Θ that equal Θk. Integrating out G,the −n uA { A} uB { B } uπ conditional distribution of Θn given Θ follows a Polya´ urn where = ui i=1,I , = um m=1,M , and = π scheme and has the following form [13] {u }i=1,I are parameters of the Dirichlet distribution. To ∗ i K learn α from the data, we place a prior distribution on it, −n 1 −n Θ |Θ ,α,G0 ∼ (αG0 + n δΘ∗ ). (2) n α + N − 1 k k k=1 p(α)=Ga(α; γ01,γ02), (9) where δΘi denotes the distribution concentrated at point Θi. where Ga(α; γ01,γ02) is the Gamma distribution with selected Equation (2) shows that when considering Θn given all parameters γ01 and γ02. other observations Θ−n, this new sample is either drawn from α base distribution G0 with probability + −1 , or is selected ∗ α N 3. VARIATIONAL INFERENCE from the existing draws Θk according to a multinomial allo- cation, with probabilities proportional to existing groups sizes Considering computational complexity in the infinite stick- −n breaking model, in practice we select an appropriate trunca- nk . Sethuraman [11] provides an explicit characterization of G in terms of a stick-breaking construction, tion level K (i.e., finite sticks) that leads to a model virtu- ∞ ally indistinguishable from the infinite DP model [5]. Since G = pkδΘ∗ , (3) ∗ k {Θn}n=1,N may only take a subset of values from {Θ }k=1,K , =1 k k the utilized number of mixture components K∗ may be less with k−1 than K (and the clustering properties of DP almost always pk = vk (1 − vi), (4) yield less than K mixture components, unless α is very large) i=1 [7]. From Bayes’ rule, we have ∗ where v |α ∼ Beta(1,α) and Θ |G0 ∼ G0. This repre- X| | k k |X p( Φ)p(Φ Ψ) sentation shows the support of G consists of an infinite set of p(Φ , Ψ) = X| | , (10) ∗ p( Φ)p(Φ Ψ)dΦ atoms located at Θk, drawn independently from G0. The mix- ∗ ∗ ∗ ∗ ing weights pk for atom Θk are given by successively break- where Φ={A , B , π , v,α,S, c} are hidden variables of ing a unit length “stick” into an infinite number of pieces [11], interest and Ψ={uA, uB, uπ,γ01,γ02} are fixed parame- ≤ ≤ ∞ with 0 pk 1 and k=1 pk =1. ters. The integration in the denominator of (10), the marginal likelihood, is generally intractable analytically. Variational 2.3. HMM mixture models with DP prior methods are thus introduced to seek a distribution q(Φ) to ap- Given the observed data X = {x } =1 , each x is as- n n ,N n proximate the true posterior distribution p(Φ|X, Ψ). Consider sumed to be drawn from its own HMM H(Θ ) parameterized n the log marginal likelihood by Θn with the underlying state sequence sn. The common log p(X|Ψ) = L(q(Φ)) + DKL(q(Φ)||p(Φ|X, Ψ)), (11) prior G on all Θn is given as (3). Since G is discrete, dif- ∗ ferent Θn may share the same value, Θk, and take the value ∗ where of Θk with probability pk. Introducing an indicator variable c { } p(X|Φ)p(Φ|Ψ) = cn n=1,N and letting cn = k indicate that Θn takes the L(q(Φ)) = q(Φ) log dΦ ≤ log p(X|Ψ), ∗ q(Φ) value of Θk, the hidden Markov mixture model with DP prior can be expressed as (12) ∗ ∞ ∗ x | { } ∼H and DKL(q||p) is the KL divergence between q and p.

Load more