Nonstationary Latent Dirichlet Allocation for Speech Recognition
Total Page:16
File Type:pdf, Size:1020Kb
Nonstationary Latent Dirichlet Allocation for Speech Recognition Chuang-Hua Chueh and Jen-Tzung Chien Department of Computer Science and Information Engineering National Cheng Kung University, Tainan, Taiwan, ROC {chchueh,chien}@chien.csie.ncku.edu.tw model was presented for document representation with time Abstract evolution. Current parameters served as the prior information Latent Dirichlet allocation (LDA) has been successful for to estimate new parameters for next time period. Furthermore, document modeling. LDA extracts the latent topics across a continuous time dynamic topic model [12] was developed by documents. Words in a document are generated by the same incorporating a Brownian motion in the dynamic topic model, topic distribution. However, in real-world documents, the and so the continuous-time topic evolution was fulfilled. usage of words in different paragraphs is varied and Sparse variational inference was used to reduce the accompanied with different writing styles. This study extends computational complexity. In [6][7], LDA was merged with the LDA and copes with the variations of topic information the hidden Markov model (HMM) as the HMM-LDA model within a document. We build the nonstationary LDA (NLDA) where the Markov states were used to discover the syntactic by incorporating a Markov chain which is used to detect the structure of a document. Each word was generated either from stylistic segments in a document. Each segment corresponds to the topic or the syntactic state. The syntactic words and a particular style in composition of a document. This NLDA content words were modeled separately. can exploit the topic information between documents as well This study also considers the superiority of HMM in as the word variations within a document. We accordingly expressing the time-varying word statistics of document data. establish a Viterbi-based variational Bayesian procedure. A We build the nonstationary LDA (NLDA) as a new document language model adaptation scheme using NLDA is developed model. As we know, the usage of words is varied in different for speech recognition. Experimental results show paragraphs of a document even through the words belong to improvement of NLDA over LDA in terms of perplexity and the same topic. For example, in a sport news document, we word error rate. usually see the summary of a game result and know the team Index Terms: latent Dirichlet allocation, variational Bayesian, name in the first paragraph. Next, the document starts to nonstationary process, speech recognition describe the game details and report the players’ names. The words appearing in different time or segment stamps should be 1. Introduction modeled separately. In this paper, we adopt a Markov chain in LDA and automatically detect the stylistically-similar Topic modeling plays an important role in natural language segments in a document. Each segment represents a specific processing. Beyond the word-level text modeling, we are writing or speaking style in the composition of a text or interested in topic modeling for document representation. spoken document. The document structure is discovered. The Latent semantic analysis (LSA) performs the matrix proposed NLDA exploits the topic information across decomposition and extracts the latent semantics from words documents as well as the word variations inside a document. A and documents. Each document is projected to the LSA space Viterbi variational inference algorithm is presented for 10.21437/Interspeech.2009-118 and is represented by a low dimensional vector with each estimation of NLDA model. The language model adaptation entry corresponding to a weight on a specific topic. Hoffman using NLDA is also introduced for application of speech [5] presented the probabilistic LSA (PLSA) where the topic recognition. We evaluate the performance of NLDA in information was built via a probabilistic framework using the document modeling and speech recognition by using the maximum likelihood method. However, PLSA was broadcast news documents in Wall Street Journal corpus. unavailable to represent the unseen documents. The size of parameters is increased significantly with a large amount of 2. Latent Dirichlet Allocation documents. To deal with this problem, Blei et al. [3] presented LDA extends the PLSA model by treating the latent topic of the latent Dirichlet allocation (LDA) by incorporating the each document as a random variable. The size of parameters is Dirichlet priors for topic mixtures of documents. Seen and controlled even through the size of training documents is unseen documents were generated consistently by LDA increased significantly. LDA model is capable of calculating parameters estimated according to the variational inference likelihood function of unseen documents. Typically, LDA is a and the expectation-maximization (EM) algorithm. generative probabilistic model for a text collection. The LDA has been shown useful in document classification [3], documents are represented with the latent topics, which are information retrieval [13] and speech recognition [4][9]. characterized by the distributions over words. The LDA Recently, many studies have been proposed to improve LDA. T parameters consist of {,!} where [ ,, ] are the In [11], the LDA bigram was presented by collecting bigram 1 K events from a document rather than unigram events. Each Dirichlet parameters of K latent topic mixtures, and ! is a document was treated as a bag of bigram. In [1], a correlated matrix with multinomial entry k,w p(w | z k) . Using topic model was presented by replacing the Dirichlet LDA, the probability of an N-word document distribution with a logistic Gaussian. The topic correlation was T modeled in the covariance matrix. Also, a tree structure of w [w1 , , wN ] is calculated by the following procedure. T topics was built in the Pachinko allocation [8] as well as the First, a topic mixture vector % [ 1, , K ] is drawn from latent Dirichlet tree allocation [10]. In [2], a dynamic topic the Dirichlet distribution with parameter . The topic Copyright © 2009 ISCA 372 6-10 September, Brighton UK T N K S sequence z [z1, , z N ] is generated by the multinomial p(w | , 4, A,B) O p(% | )2 BBp(zn | %) distribution with parameter % in document level. Each word n 1 znn11s (5) wn is generated by the distribution p(wn | zn ,!) . The joint p(sn | sn 1 , 4, A) p(wn | zn , sn ,B)d% . probability of % , topic assignment z and document w is In comparison of the marginal likelihoods of LDA in (2) given by and NLDA in (5), NLDA additionally introduces the hidden N states, and so the dynamics of words in different positions are p(%, z, w | ,!) p(% | )2 p(z | %) p(w | z ,!) . (1) n n n captured. NLDA calculates the word probability associated n1 with the topic z and state s. By integrating (1) over % and summing it over z, we obtain the marginal probability of document w by N K B p(w | ,!) O p(% | )2 B p(zn | %) p(wn | zn ,!)d% . (2) n 1 zn 1 LDA parameters {,!} are then estimated by maximizing the marginal likelihood of training documents. The estimation problem was solved by the approximate inference algorithms z w including Laplace approximation, variational inference, and % n n resampling method [3]. Using variational inference, the sn variational parameters were adopted for calculating a lower s N bound of marginal likelihood. The LDA parameters were n 1 estimated by maximizing the lower bound, or equivalently M minimizing the Kullback-Leibler divergence between the 4 A variational distribution and the true posterior p(%, z | w,,!) . Figure 1: Graphical representation of NLDA. 3. Nonstationary LDA In general, LDA treats a document as a bag of words and 3.2. Model inference generates all words by the same topic-dependent parameters. The NLDA parameters {, 4, A,B} are estimated by LDA is viewed as a stationary process where the word maximizing the logarithm of marginal likelihood accumulated probabilities are not changed in temporal positions. In real from M training documents world, the words in different paragraphs are varied due to the M composing style and the document structure. In addition to the B log p(w d | ,4,A,B) . (6) topic information, the temporal positions of words are also d1 important in natural language. In what follows, we present the However, it is intractable to directly optimize (6) due to the nonstationary LDA model and its variational inference coupling between % and B in the summation over topics and algorithm. states. Accordingly, we maximize the lower bound of (6) for each individual training document w , and approximate the 3.1. Model construction d true posterior p(%, z,s | w d ,, 4, A,B) by using the Figure 1 displays the graphical model for document generation variational distribution q (%, z,s) which is factorized as using NLDA where a Markov chain is merged to characterize d the dynamics of words in different segments or paragraphs. F N V F N V q(% | " d )G2 q(zn | dn )W Gq(s1 | # d1 )2 q(sn | sn 1 ,# dn )W . (7) The topic mixture vector % is drawn from a Dirichlet Hn1 X H n2 X distribution with parameter The variational parameters of Dirichlet distribution " d and BK ( k 1 k ) 1 1 topic-based multinomial distribution are introduced. The p(% | ) 1 K . (3) dn 2K 1 K k 1 ( k ) parameters # d {# d1,# dn } are associated with the The topic sequence z is generated by a multinomial multinomial distributions of {4, A} . Based on the Jensen’s distribution with parameter % . The state sequence inequality, the lower bound of log likelihood is obtained by s [s ,,s ]T is generated based on a Markov chain with D 1 N L(, 4, A,B; ",,#) B{E [log p(% | )] E [log p(z | %)] qd qd the initial state parameter 4 and the S S state transition d 1 probability matrix A {a } .