Nonstationary Latent Dirichlet Allocation for

Chuang-Hua Chueh and Jen-Tzung Chien Department of and Information Engineering National Cheng Kung University, Tainan, Taiwan, ROC {chchueh,chien}@chien.csie.ncku.edu.tw

model was presented for document representation with time Abstract evolution. Current parameters served as the prior information Latent Dirichlet allocation (LDA) has been successful for to estimate new parameters for next time period. Furthermore, document modeling. LDA extracts the latent topics across a continuous time dynamic [12] was developed by documents. Words in a document are generated by the same incorporating a Brownian motion in the dynamic topic model, topic distribution. However, in real-world documents, the and so the continuous-time topic evolution was fulfilled. usage of words in different paragraphs is varied and Sparse variational inference was used to reduce the accompanied with different writing styles. This study extends computational complexity. In [6][7], LDA was merged with the LDA and copes with the variations of topic information the hidden Markov model (HMM) as the HMM-LDA model within a document. We build the nonstationary LDA (NLDA) where the Markov states were used to discover the syntactic by incorporating a Markov chain which is used to detect the structure of a document. Each word was generated either from stylistic segments in a document. Each segment corresponds to the topic or the syntactic state. The syntactic words and a particular style in composition of a document. This NLDA content words were modeled separately. can exploit the topic information between documents as well This study also considers the superiority of HMM in as the word variations within a document. We accordingly expressing the time-varying word statistics of document data. establish a Viterbi-based variational Bayesian procedure. A We build the nonstationary LDA (NLDA) as a new document language model adaptation scheme using NLDA is developed model. As we know, the usage of words is varied in different for speech recognition. Experimental results show paragraphs of a document even through the words belong to improvement of NLDA over LDA in terms of perplexity and the same topic. For example, in a sport news document, we word error rate. usually see the summary of a game result and know the team Index Terms: latent Dirichlet allocation, variational Bayesian, name in the first paragraph. Next, the document starts to nonstationary process, speech recognition describe the game details and report the players’ names. The words appearing in different time or segment stamps should be 1. Introduction modeled separately. In this paper, we adopt a Markov chain in LDA and automatically detect the stylistically-similar Topic modeling plays an important role in natural language segments in a document. Each segment represents a specific processing. Beyond the word-level text modeling, we are writing or speaking style in the composition of a text or interested in topic modeling for document representation. spoken document. The document structure is discovered. The (LSA) performs the matrix proposed NLDA exploits the topic information across decomposition and extracts the latent semantics from words documents as well as the word variations inside a document. A and documents. Each document is projected to the LSA space Viterbi variational inference algorithm is presented for 10.21437/Interspeech.2009-118 and is represented by a low dimensional vector with each estimation of NLDA model. The language model adaptation entry corresponding to a weight on a specific topic. Hoffman using NLDA is also introduced for application of speech [5] presented the probabilistic LSA (PLSA) where the topic recognition. We evaluate the performance of NLDA in information was built via a probabilistic framework using the document modeling and speech recognition by using the maximum likelihood method. However, PLSA was broadcast news documents in Wall Street Journal corpus. unavailable to represent the unseen documents. The size of parameters is increased significantly with a large amount of 2. Latent Dirichlet Allocation documents. To deal with this problem, Blei et al. [3] presented LDA extends the PLSA model by treating the latent topic of the latent Dirichlet allocation (LDA) by incorporating the each document as a random variable. The size of parameters is Dirichlet priors for topic mixtures of documents. Seen and controlled even through the size of training documents is unseen documents were generated consistently by LDA increased significantly. LDA model is capable of calculating parameters estimated according to the variational inference likelihood function of unseen documents. Typically, LDA is a and the expectation-maximization (EM) algorithm. generative probabilistic model for a text collection. The LDA has been shown useful in [3], documents are represented with the latent topics, which are information retrieval [13] and speech recognition [4][9]. characterized by the distributions over words. The LDA Recently, many studies have been proposed to improve LDA. T parameters consist of {,} where  [ ,, ] are the In [11], the LDA was presented by collecting bigram 1 K events from a document rather than unigram events. Each Dirichlet parameters of K latent topic mixtures, and is a    document was treated as a bag of bigram. In [1], a correlated matrix with multinomial entry k,w p(w | z k ) . Using topic model was presented by replacing the Dirichlet LDA, the probability of an N-word document distribution with a logistic Gaussian. The topic correlation was   T modeled in the covariance matrix. Also, a tree structure of w [w1 , , wN ] is calculated by the following procedure.     T topics was built in the Pachinko allocation [8] as well as the First, a topic mixture vector [ 1, , K ] is drawn from latent Dirichlet tree allocation [10]. In [2], a dynamic topic the Dirichlet distribution with parameter . The topic

Copyright © 2009 ISCA 372 6-10 September, Brighton UK   T N K S sequence z [z1, , z N ] is generated by the multinomial  p(w | , , A,B) O p( | )2 BBp(zn | )   distribution with parameter in document level. Each word n 1 znn11s (5)  wn is generated by the distribution p(wn | zn ,) . The joint p(sn | sn 1 , , A) p(wn | zn , sn ,B)d . probability of , topic assignment z and document w is In comparison of the marginal likelihoods of LDA in (2) given by and NLDA in (5), NLDA additionally introduces the hidden N states, and so the dynamics of words in different positions are p(, z, w | ,)  p( | )2 p(z | ) p(w | z ,) . (1) n n n captured. NLDA calculates the word probability associated n1 with the topic z and state s. By integrating (1) over and summing it over z, we obtain the marginal probability of document w by N K B  p(w | ,) O p( | )2 B p(zn | ) p(wn | zn ,)d . (2)   n 1 zn 1 LDA parameters {, } are then estimated by maximizing the marginal likelihood of training documents. The estimation problem was solved by the approximate inference algorithms z w including Laplace approximation, variational inference, and n n resampling method [3]. Using variational inference, the sn variational parameters were adopted for calculating a lower s N bound of marginal likelihood. The LDA parameters were n 1 estimated by maximizing the lower bound, or equivalently M minimizing the Kullback-Leibler divergence between the A variational distribution and the true posterior p(, z | w,, ) . Figure 1: Graphical representation of NLDA. 3. Nonstationary LDA In general, LDA treats a document as a bag of words and 3.2. Model inference generates all words by the same topic-dependent parameters. The NLDA parameters {, , A,B } are estimated by LDA is viewed as a stationary process where the word maximizing the logarithm of marginal likelihood accumulated probabilities are not changed in temporal positions. In real from M training documents world, the words in different paragraphs are varied due to the M composing style and the document structure. In addition to the B log p(w d | ,,A,B) . (6) topic information, the temporal positions of words are also d1 important in natural language. In what follows, we present the However, it is intractable to directly optimize (6) due to the nonstationary LDA model and its variational inference coupling between and B in the summation over topics and algorithm. states. Accordingly, we maximize the lower bound of (6) for each individual training document w , and approximate the 3.1. Model construction d true posterior p(, z,s | w d ,, , A,B) by using the Figure 1 displays the graphical model for document generation variational distribution q (, z,s) which is factorized as using NLDA where a Markov chain is merged to characterize d the dynamics of words in different segments or paragraphs. F N V F N V q( | d )G2 q(zn | dn )W Gq(s1 | d1 )2 q(sn | sn 1 , dn )W . (7) The topic mixture vector is drawn from a Dirichlet Hn1 X H n2 X distribution with parameter The variational parameters of Dirichlet distribution d and BK  ( k 1 k )  1  1 topic-based multinomial distribution are introduced. The p( | )   1  K . (3) dn 2K  1 K  k 1 ( k ) parameters d { d1, dn } are associated with the The topic sequence z is generated by a multinomial multinomial distributions of {, A } . Based on the Jensen’s distribution with parameter . The state sequence inequality, the lower bound of log likelihood is obtained by s  [s ,,s ]T is generated based on a Markov chain with D 1 N L(, , A,B; ,,)  B{E [log p( | )]  E [log p(z | )] qd qd the initial state parameter and the S  S state transition d 1   probability matrix A  {a } . Then, the joint probability Eq [log p(s | , A)] Eq [log p(w | z,s,B)] (8) sn 1sn d d E [log q( | )] E [log q(z | )] E [log q(s | )]} p(w,,z,s | , , A,B) is calculated by qd d qd d qd d

N The variational Bayesian EM (VB-EM) procedure is p( | )2 p(zn | ) p(sn | sn 1 , , A) p(wn | zn , sn , B) . (4) performed to estimate the variational parameters {,, } and n1 the model parameters {, , A,B} . First, in the VB-E step, we The parameter set B  {b } contains the observation snznwn approximate the expectation function in (8) by finding the probability of word wn given latent topic zn and state sn . optimal variational parameters as follows N By integrating (4) over and summing it over z and s, we      ˆdk k B dkn (9) obtain the marginal probability of a document conditional on n1 the model parameters exp(BS  logb  ( )) ˆ  s1 ds1 sk1 dk (10) dk1 K S     BBl11exp( s ds1 logbsl1 ( dl ))

373 S S     given document w and is calculated by combining the state exp(BBdijn logb jkn ( dk )) d ˆ  i 11j (11) dkn K S S     transition probability and word output probability. BBBl111exp( i j dijn logb jln ( dl )) Accordingly, we accumulate the best score Gn (sn ) and  K  i exp(B  dk1 logbik1 ) record the most likely state R (s ) for each word w . The ˆ  k 1 (12) n n n di1 BBS  K  s11s exp( k dk1 logbsk1 ) score Gn (sn ) approximates the lower bound of marginal K ˆ a exp(B   logb ) likelihood. The optimal state sequence s is updated in Viterbi ˆ  ij k 1 dkn jkn (13) dijn S K  decoding stage. When sˆ converges, the model parameters BBs11ais exp( k dkn logbskn ) ˆ ˆ where (4) is the digamma function and its derivative is {ˆ, ˆ, A,B} are estimated in VB-M step. The NLDA model denoted as (4 ) . In the VB-M step, we maximize the inference is completed by this Viterbi VB-EM algorithm. We finally output the model parameters. expectation function L(, , A,B; ˆ,ˆ ,ˆ ) using the estimated Table 1: Viterbi VB-EM algorithm for implementing NLDA parameters {ˆ,ˆ ,ˆ} and obtain the closed-form solutions to Initialize  { ,, } randomly. {, A,B} as 1 K Initialize  {  1/ S,1  i  S} . BM  i d1 di1        ˆ  (14) Initialize B {b jkv ,1 j S,1 k K,1 v V} randomly. i BBS M  s11d ds1 Initialize sˆ by uniform partition. M N BBd  1. VB-E step (updating of variational parameters):  d12n dijn aˆij (15)      S M Nd  Initialize d { k N / K,1 k K} . BBBs112d n dsjn Update by (18)(19) based on the current sˆ . M Nd S d BBB[  (w , w )    (w , w )] bˆ  d 121dk1 dj1 v 1 n s dkn dijn v n jkv V M N S Update d by (9) and d by (12)(13) .     d    BBm11{ d [ dk1 dj1 (wm , w1) BBn  21s dkn dijn (wm , wn )]} 2. Viterbi decoding step (finding sˆ for each document w ): (16) d  For each word w (and its topic k ) and state s , calculate the where V is the vocabulary size and (wv , wn ) is the n n n best accumulated score and the backtracking record by Kronecker delta function that returns one when word wv is G (s )  max G (s ) 4 n n n 1 n 1 d,sn 1sn ,n identical to wn and returns zero otherwise. The Newton- sn 1 Raphson algorithm [3] is used to estimate the optimal R (s )  arg max G (s ) 4 . n n n 1 n 1 d,sn 1sn ,n s Dirichlet parameter ˆ with the (k,l) - th entry of Hessian n 1 End matrix in a form of    Backtrack sˆn Rn1 (sˆn1 ) , n N d 1, ,1 and estimate sˆ . L(, , A,B; ˆ,ˆ ,ˆ)   (k,l)M( ) (BK  ) . (17) If sˆ does not converge, go to step 1.   k l1 l k l 3. Viterbi VB-M step (updating of model parameters): Update by (14), A by (15) and B by (20). 3.3. Implementation of NLDA Update by Newton-Raphson algorithm using (17). Using (9)-(17), we can iteratively update the variational 4. Go to step 1 and terminate when the lower bound of marginal log parameters and model parameters in VB-EM procedure. likelihood converges. However, such procedure suffers from high computations due to the consideration of all possible state sequences in (10) and 3.4. Comparison of LDA, HMM-LDA and NLDA (11). Here, we build a Viterbi-based VB-EM algorithm and We investigate the relations of LDA [3], HMM-LDA [6][7] estimate the best state sequence sˆ for model training. Given sˆ , and the proposed NLDA. LDA captures the semantic the formulas in (10) (11) and (16) are simplified as information of words in a document and calculates the    exp(logbsˆ k1 ( dk )) document likelihood by using the derived topic statistics. ˆ  1 (18) dk1 K    However, a word either has its syntactic function or conveys B  exp(logbsˆ l1 ( dl )) l 1 1 the semantic message. The words with different properties are exp(logb  ( )) generated differently in natural language. HMM-LDA ˆ  sˆnkn dk dkn K (19) provides an unsupervised approach to combine semantic and B  exp(logb  ( )) l 1 sˆnln dl syntactic information. Each word in a document was M Nd BB[ (s ,sˆ )(w ,w )   (s ,sˆ )(w ,w )] generated by a semantic topic or a syntactic state based on its bˆ  d 12dk1 j 1 v 1 n dkn j n v n . jkv V M N contextual information. The function words and content words BB{ [ (s ,sˆ )(w ,w )  Bd  (s ,sˆ )(w ,w )]} m11d dk1 j 1 m 1 n  2dkn j n m n were modeled separately. An HMM was adopted to obtain the (20) state sequence and to judge if a word was emitted from a topic In (18)-(20),  and  are unseen due to the d ,sˆn ,1 d,sˆn 1sˆn ,n or a state. The Markov chain Monte Carlo (MCMC) method Viterbi approximation. A Viterbi VB-EM procedure for was used for model estimation. Rather than considering the implementing NLDA is shown in Table 1. In the initialization, syntactic regularities, we explore the regularity of temporal the state alignment is presumed by uniform partition and the variations of words in generation of a document. We claim parameters {, A,B, ,} are specified. The variational that, even under the same topic, the word distributions in parameters are then updated according to the current state different segments or paragraphs should be varied in document composition. The proposed NLDA endeavors to sequence sˆ in VB-E step. Next, we perform the Viterbi build a nonstationary process where the word distribution of a decoding and realign all training documents. The variational topic is adaptive in different positions of a document. A parameter dn represents the posterior probability of state sn Markov chain is used to characterize such variations. The

374 documents are automatically segmented according to their determined empirically. Both LDA and NLDA outperformed word occurrences. A word is generated by its semantics and baseline in terms of perplexity and WER. The WER also its position in a document. We derive the NLDA solution improvement of NLDA over LDA was relatively small, which by using the Viterbi VB-EM procedure. should be caused by testing utterances rather than the whole documents. We will evaluate NLDA by the experiments on 3.5. NLDA for speech recognition spoken document transcription. Except the document modeling, NLDA can be applied to Table 2: Comparison of perplexity and WER dynamically adapt the n-gram model for speech recognition. ʳ Baseline LDA NLDA Similar to LDA language model adaptation [9], we first Perplexity 46.6 45.1 43.3 perform the Viterbi VB-EM algorithm on the test sentence or WER (%) 5.38 5.17 5.14 test document. Using the best state sequence sˆ and the estimated variational parameter ˆ , we calculate NLDA 5. Conclusions unigram by In natural language, the word distribution of each topic was K  varied in different paragraphs of a document. This study psˆ (w) O B p(w | k, sˆ, B) p(k | ) p( | )d (21) k 1 proposed a nonstationary LDA where a Markov chain was K embedded for detection of similar segments in a document. K K B  b ˆ  O B b  q( | ˆ )d  B b E[ | ˆ ]  k 1 sˆkw k . Each segment corresponded to a composition unit of the sˆkw k k k k sˆkw k k K k 1 k 1 q ˆ Bl1 l document. The words in this segment shared the same Next, for each segment of test sentence, the NLDA unigram is parameters. We derived the variational solutions by applied to adapt the background n-gram model by performing maximizing the lower bound of log likelihood of training data linear interpolation with weight  and evaluated the NLDA on using WSJ corpus. A Viterbi VB- EM procedure was presented for a simplified implementation. pˆ(w | h)  p (w | h)  (1 ) p (w) (22) n gram sˆ When applying NLDA in language modeling, we obtained where h is the history in n-gram event. The adapted language improvement in terms of perplexity and recognition accuracy model is employed in language model rescoring for speech over the baseline trigram and LDA. NLDA effectively recognition. However, in the experiments, the benchmark test extracted the topic information and document structure. In the data in WSJ consisted of individual utterances rather than the future, we will endeavor to model the length of each segment whole spoken documents. To illustrate the performance of in a document by incorporating the duration information. It is NLDA in broadcast news transcription, we modified the also interesting to build the HMM-NLDA by considering the Viterbi VB-EM in order to search the best state sequences for syntactic regularities in NLDA. Additional experiments on test utterances. Since the position of a test utterance in the large scale corpus will be conducted. document was unknown, we manipulated the search as illustrated in Figure 2. The bar denotes a document with three 6. References segments. The black, red and blue arrow lines denote the [1] D. M. Blei and J. D. Lafferty, “Correlated topic models”, utterances located in different positions of a document, and Advances in Neural Information Processing Systems, vol. 18, cover three, two and one NLDA states, respectively. In MIT Press, Cambridge, MA, 2006. implementation of NLDA, we relaxed the limitation of starting [2] D. M. Blei and J. D. Lafferty, “Dynamic topic models”, in Proc. state and ending state when searching the best state sequence. of ICML, vol. 148, pp. 113-120, 2006. We specified the state initial probabilities to be equal so that [3] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet different states would be the starting state in calculation of allocation”, Journal of Research, vol. 3, pp. sentence probability. 993-1022, 2003. [4] J.-T. Chien and C.-H. Chueh, “Latent Dirichlet language model for speech recognition”, in Proc. of IEEE SLT, pp. 201-204, 2008.

s1 s2 s3 [5] T. Hofmann, “Probabilistic latent semantic indexing”, in Proc. of ACM SIGIR, pp. 50-57, 1999. Figure 2: Utterances with different NLDA state coverage [6] B.-J. Hsu and J. Glass, “Style & topic language model adaptation using HMM-LDA”, in Proc. of EMNLP, pp. 373-381, 2006. 4. Experiments [7] T. Griffiths, M. Steyvers, D. Blei and J. Tenenbau, “Integrating topics and syntax”, Advances in Neural Information Processing The Wall Street Journal (WSJ) corpus was used to evaluate Systems, vol. 17, pp. 537-544, 2004. the proposed NLDA. We performed the evaluation by [8] W. Li and A. McCallum, “Pachinko allocation: DAG-structured following the November 1992 ARPA continuous speech mixture models of topic correlations”, in Proc. of ICML, pp. 577- recognition benchmark. The SI-84 training set was used for 584, 2006. HMM estimation by using 39 dimensional MFCC feature [9] Y.-C. Tam and T. Schultz, “Unsupervised language model vectors. The ’87-89 WSJ with about eighty adaptation using latent semantic marginals”, in Proc. of thousand documents was used to train the baseline Kneser- Interspeech, pp. 2206-2209, 2006. Ney trigram model and NLDA model. The lexicon was 5K [10] Y.-C. Tam and T. Schultz, “Correlated latent semantic model for unsupervised LM adaptation”, in Proc. of ICASSP, vol. 4, pp. non-verbalized punctuation, closed vocabulary. We trained 41-44, 2007. NLDA with 100 topics and 3 states by following VB-EM [11] H. M. Wallach, “Topic modeling: beyond bag-of-words”, in procedure shown in Table 1. LDA model [3][9] with 100 Proc. of ICML, pp. 977-984, 2006. topics was carried out for comparison. The benchmark test set [12] C. Wang, D. Blei and D. Heckerman, “Continuous time dynamic consisted of 330 utterances. The baseline trigram was used to topic models”, in Proc. of Uncertainty in Artificial generate the 100-best lists. The adapted language model was Intelligence, 2008. used for rescoring. The perplexities and word error rates [13] X. Wei and W. Croft, “LDA-based document models for ad-hoc (WERs) are reported in Table 2. The interpolation weights of retrieval”, in Proc. of ACM SIGIR, pp. 178-185, 2006. combining baseline trigram with LDA and NLDA were

375