
DIRICHLET PROCESS HMM MIXTURE MODELS WITH APPLICATION TO MUSIC ANALYSIS Yuting Qi, John William Paisley and Lawrence Carin Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708-0291 ABSTRACT proven that DP is rich enough to model parameters of individ- A hidden Markov mixture model is developed using a Dirich- ual components with arbitrarily high complexity, and flexible let process (DP) prior, to represent the statistics of sequen- enough to fit them well without any assumptions about the tial data for which a single hidden Markov model (HMM) functional form of the prior distribution [6][13]. Importantly, may not be sufficient. The DP prior has an intrinsic clus- the number of mixture components need not be set a priori tering property that encourages parameter sharing, naturally in the DP HMM mixture model. A variational Bayes [5] ap- revealing the proper number of mixture components. The proach is considered to perform DP-based mixture modeling evaluation of posterior distributions for all model parameters for efficient computation. In this paper we focus on HMM is achieved via a variational Bayes formulation. We focus mixture models based on discrete observations; our method is on exploring music similarities as an important application, applicable to any sequential discrete data set containing mul- highlighting the effectiveness of the HMM mixture model. tiple underlying patterns. Experimental results are presented from classical music clips. The remainder of the paper is organized as follows. Sec- tion 2 provides an introduction to the Dirichlet process and its Index Terms— Dirichlet Process, HMM mixture, Music, application to HMM mixture models. A variational Bayes in- Variational Bayes. ference method is developed in Section 3. Section 4 describes 1. INTRODUCTION the music application as well as experimental results. Section 5 concludes the work. Music recognition, including music classification, retrieval, browsing, and recommendation systems, has been of signif- 2. DP-BASED HIDDEN MARKOV MIXTURE MODEL icant recent interest. Correspondingly, ideas from statisti- 2.1. Hidden Markov Mixture Model cal machine learning have attracted growing interest in the music-analysis community. For example, Gaussian mixture The hidden Markov mixture model with K∗ mixture compo- models have been used to represent the distribution of the nents may be written as ∗ K MFCCs over all frames of an individual song [1][4]. How- p(x|a1, ··· ,a ∗ , Θ1, ··· , Θ ∗ )= a p(x|Θ ), (1) ever, no dynamic behavior of music is taken into account in K K k k k=1 these works. Since “the brain dynamically links a multitude of short events which cannot always be separated” [2], tem- where x = {xt}t=1,T is a sequence of observations, p(x|Θk) poral cues are critical and contain information that should not represents the kth HMM component with associated param- th be ignored. Therefore, music is treated as time-series data eters Θk, and ak represents the mixing weight for the k ∗ K and hidden Markov models (HMMs), which can accurately HMM, with k=1 ak =1. represent the statistics of sequential data [8], have been intro- We assume a set X = {xn}n=1,N of N sequences of duced to model the overall music in [2][9] and more recently data. Each data sequence xn is assumed to be drawn from for music genre classification [10][12]. an associated HMM with parameters Θn = {An, Bn, πn}, Building a single HMM for a song performs well when the i.e., xn ∼H(Θn), where H(Θ) represents the HMM. The music’s “movement pattern” is relatively simple and thus the set of associated parameters {Θn}n=1,N are drawn i.i.d from structure is of modest complexity (e.g., the number of states i.i.d a shared prior G, i.e., Θn|G ∼ G. The distribution G is is few). However, most real music is a complicated signal, itself drawn from a distribution, in particular a Dirichlet pro- which may have more than one “movement pattern” across cess. The prior G encourages the clustering of the parameters the entire piece. Therefore an HMM mixture model is pro- {Θn}n=1,N and each such cluster corresponds to an HMM posed in this paper to describe multiple “movement patterns” mixture component in (1). The algorithm automatically de- in music, with each pattern characterized by a single mix- termines an appropriate number of mixture components, bal- ture component (an HMM). The work reported here develops ancing the DP-generated desire to cluster with the likelihood’s an HMM mixture model in a Bayesian setting using a non- desire to choose parameters that match the data X well. This parametric Dirichlet process (DP) as a common prior distri- balance between the likelihood and the DP prior is manifested bution on the parameters of the individual HMMs. It has been in the posterior density function for parameters {Θn}n=1,N . 1­4244­0728­1/07/$20.00 ©2007 IEEE II ­ 465 ICASSP 2007 Authorized licensed use limited to: IEEE Xplore. Downloaded on January 12, 2009 at 11:38 from IEEE Xplore. Restrictions apply. 2.2. Dirichlet Process where p = {pk}k=1,∞ is given by (4) and Mult(p) is the The Dirichlet process, denoted as DP(α, G0), is a random multinomial distribution with parameter p. measure on measures and is parameterized by a positive scal- Assuming A, B and π are independent of each other, the ing parameter α, often termed the “innovation parameter”, base distribution G0 is represented as G0 = p(A)p(B)p(π). and a base distribution G0. Assume we have N random vari- For computational convenience (use of appropriate conjugate ables {Θn}n=1,N distributed according to G, and G itself is priors), we have the following prior distributions a random measure drawn from a Dirichlet process, I A|uA { ··· } uA Θn|G ∼ G, n =1, ··· ,N, P ( )= Dir( ai1, ,aiI ; ) (6) i=1 G ∼ DP(α, G0), I −n B B where G0 is the expectation of G, E[G]=G0. Define Θ = p(B|u )= Dir({bi1, ··· ,biM }; u ) (7) ∗ { ··· ··· } { } ∗ i=1 Θ1, , Θn−1, Θn+1, , ΘN and let Θk k=1,K be the − π π { } n π|u { 1 ··· } u distinct values taken by Θn n=1,N and let nk be the num- p( )=Dir( π , ,πI ; ), (8) −n ∗ ber of values in Θ that equal Θk. Integrating out G,the −n uA { A} uB { B } uπ conditional distribution of Θn given Θ follows a Polya´ urn where = ui i=1,I , = um m=1,M , and = π scheme and has the following form [13] {u }i=1,I are parameters of the Dirichlet distribution. To ∗ i K learn α from the data, we place a prior distribution on it, −n 1 −n Θ |Θ ,α,G0 ∼ (αG0 + n δΘ∗ ). (2) n α + N − 1 k k k=1 p(α)=Ga(α; γ01,γ02), (9) where δΘi denotes the distribution concentrated at point Θi. where Ga(α; γ01,γ02) is the Gamma distribution with selected Equation (2) shows that when considering Θn given all parameters γ01 and γ02. other observations Θ−n, this new sample is either drawn from α base distribution G0 with probability + −1 , or is selected ∗ α N 3. VARIATIONAL INFERENCE from the existing draws Θk according to a multinomial allo- cation, with probabilities proportional to existing groups sizes Considering computational complexity in the infinite stick- −n breaking model, in practice we select an appropriate trunca- nk . Sethuraman [11] provides an explicit characterization of G in terms of a stick-breaking construction, tion level K (i.e., finite sticks) that leads to a model virtu- ∞ ally indistinguishable from the infinite DP model [5]. Since G = pkδΘ∗ , (3) ∗ k {Θn}n=1,N may only take a subset of values from {Θ }k=1,K , =1 k k the utilized number of mixture components K∗ may be less with k−1 than K (and the clustering properties of DP almost always pk = vk (1 − vi), (4) yield less than K mixture components, unless α is very large) i=1 [7]. From Bayes’ rule, we have ∗ where v |α ∼ Beta(1,α) and Θ |G0 ∼ G0. This repre- X| | k k |X p( Φ)p(Φ Ψ) sentation shows the support of G consists of an infinite set of p(Φ , Ψ) = X| | , (10) ∗ p( Φ)p(Φ Ψ)dΦ atoms located at Θk, drawn independently from G0. The mix- ∗ ∗ ∗ ∗ ing weights pk for atom Θk are given by successively break- where Φ={A , B , π , v,α,S, c} are hidden variables of ing a unit length “stick” into an infinite number of pieces [11], interest and Ψ={uA, uB, uπ,γ01,γ02} are fixed parame- ≤ ≤ ∞ with 0 pk 1 and k=1 pk =1. ters. The integration in the denominator of (10), the marginal likelihood, is generally intractable analytically. Variational 2.3. HMM mixture models with DP prior methods are thus introduced to seek a distribution q(Φ) to ap- Given the observed data X = {x } =1 , each x is as- n n ,N n proximate the true posterior distribution p(Φ|X, Ψ). Consider sumed to be drawn from its own HMM H(Θ ) parameterized n the log marginal likelihood by Θn with the underlying state sequence sn. The common log p(X|Ψ) = L(q(Φ)) + DKL(q(Φ)||p(Φ|X, Ψ)), (11) prior G on all Θn is given as (3). Since G is discrete, dif- ∗ ferent Θn may share the same value, Θk, and take the value ∗ where of Θk with probability pk. Introducing an indicator variable c { } p(X|Φ)p(Φ|Ψ) = cn n=1,N and letting cn = k indicate that Θn takes the L(q(Φ)) = q(Φ) log dΦ ≤ log p(X|Ψ), ∗ q(Φ) value of Θk, the hidden Markov mixture model with DP prior can be expressed as (12) ∗ ∞ ∗ x | { } ∼H and DKL(q||p) is the KL divergence between q and p.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages4 Page
-
File Size-