Language Model Adaptation Using Latent Dirichlet Allocation and an Efficient Topic Inference Algorithm

Aaron Heidel1 , Hung-an Chang2 , and Lin-shan Lee1

1Dept. of Computer Science & Information Engineering, National Taiwan University Taipei, Taiwan, Republic of China 2Spoken Language Systems Group, CSAIL, Massachusetts Institute of Technology Cambridge, MA 02139, USA

[email protected], hung [email protected], [email protected]

Abstract Another well-known approach is the sentence-level We present an effort to perform topic mixture-based lan- mixture model, which used topics identified from a het- guage model adaptation using latent Dirichlet allocation erogenous training corpus by automatic clustering [4]. (LDA). We use probabilistic Improvements were demonstrated in both perplexity and (PLSA) to automatically cluster a heterogeneous training recognition accuracy over an unadapted trigram language corpus, and train an LDA model using the resultant topic- model. document assignments. Using this LDA model, we then construct topic-specific corpora at the utterance level for interpolation with a background language model during language model adaptation. We also present a novel it- In this paper, we propose an improved language model erative algorithm for LDA topic inference. Very encour- adaptation scheme that utilizes topic information obtained aging results were obtained in preliminary experiments from first-pass recognition results using a novel topic in- with broadcast news in Mandarin Chinese. ference algorithm for latent Dirichlet allocation (LDA) Index Terms: language model, unsupervised adaptation, [5]. A mix of language models is constructed that best topic modeling, models this topic information; that is, a set of signifi- cant topic-specific language models (topic LMs) is se- lected, and their corresponding weights when interpo- lated with a background LM are set. The resulting mix is then used to find an adapted hypothesis. The principle 1. Introduction difference between this and a recently proposed approach Statistical n-gram-based language models have long been [6] is that where they adapt the background LM according used to estimate probabilities for the next word given the to the LDA-inferred unigram distribution, the proposed preceding word history. Although they have proven to method decomposes the background LM into topic LMs work extremely well, researchers quickly noticed signs using utterance-level n-gram counts. In fact, the proposed that their performance improvements given increasingly method is different from all such approaches that directly large corpora were beginning to asymptote [1]. Many at- manipulate the background LM according to some uni- tempts, therefore, have been made to compensate for their gram distribution based on the adaptation text. This ap- most notable weakness: no consideration of long distance proach is also conceptually simpler than a recent work dependencies, or no understanding of semantics. using HMM-LDA [7], for example, in that no distinction One of the earliest such attempts introduced was the is made between syntactic and semantic states. cache-based technique, which took advantage of the gen- erally observed “burstiness” of words by increasing the probability of words in the history when predicting the next word [2]. This technique was then generalized using In the next section, we describe the LDA framework trigger pairs, in which the observation of certain “trig- for topic modeling and present the proposed inference al- ger” words increases the probability of seeing correlated gorithm, and in section 3 we outline the proposed lan- words; furthermore, a maximum entropy approach was guage modeling scheme. Experimental results are pre- used to combine the probability estimates of multiple trig- sented in section 4, and our concluding remarks follow in gers into a reasonable joint estimate [3]. section 5. the marginal probability p(w|α, β). Unfortunately, the integration in eq. 3 is intractable due to the coupling be- tween θ and w. Although a variational inference algo- rithm has been proposed to solve this problem [5], this algorithm still involves computationally expensive calcu- lations such as the digamma function. To make topic in- ference faster and more efficient, we here propose an iter- ative algorithm to find the most suitable topic mixture θˆ Figure 1: LDA topic sequence generation. The gray cir- for the set of words w under the mean square error (MSE) cles represent model parameters, while the white circles criterion. represent latent parameters. The probability p(w|α, β) in eq. 3 can be further interpreted as an expectation over θ: 2. Latent Dirichlet Allocation p(w|α, β) 2.1. LDA Model Representation " N !# Y X LDA [5] is a generative, probabilistic model character- = Eθ p(zn|θ)p(wn|zn, β) . (4) ized by the two sets of parameters α and β, where α = n=1 zn [α1α2 ··· αk] represents the Dirichlet parameters for the Under the MSE criterion, the most suitable mixture θˆ is k latent topics of the model, and β is a k × V matrix the one that makes the probability of w equal to the ex- where each entry βij represents the unigram probability pectation in eq. 4. That is, θˆ is chosen such that of the jth word in the V -word vocabulary under the ith latent topic. N ! Y X ˆ Under the LDA modeling framework, the generative p(zn|θ)p(wn|zn, β) = process for a document of N words can be summarized as n=1 zn " N !# follows. A vector for the topic mixture θ = [θ1θ2 ··· θk] Y X is first drawn from the Dirichlet distribution with proba- Eθ p(zn|θ)p(wn|zn, β) . (5) bility n=1 zn Because the Dirichlet distribution is continuous, we are Pk  Γ i=1 αi guaranteed that such a θˆ exists. Hence, we can calculate p(θ|α) = θα1−1 ··· θαk−1, (1) Qk 1 k θˆ using the following iteration. Γ(αi) i=1 First, we set the initial vector according to the prior where Γ(x) denotes the gamma function. Then, a topic information such that sequence z = [z1z2 ··· zN ] is generated multinomially   [0] α1 α2 αk by θ with p(zn = i|θ) = θi. Finally, each word wn is θ = ··· , (6) αsum αsum αsum chosen according to the probability p(wn|zn, β), which is equal to βij, where zn = i and wn is the jth word in Pk where αsum = i=1 αi and αi/αsum = E[θi|α]. Let the vocabulary. Figure 1 shows the generative process. θ[t] denote the mixture after t iterations. The posterior θ N The joint distribution of a topic mixture , a set of probability λ that word w is drawn from topic i can z N w ni n topics , and a set of words can thus be calculated be derived by by θ[t]β N λ = i iwn , (7) Y ni Pk [t] p(θ, z, w|α, β) = p(θ|α) p(zn|θ)p(wn|zn, β). j=1 θj βjwn n=1 (2) [t] [t] where θi βiwn = p(zn = i | θ )p(wn|zn, β). Because Integrating over θ and summing over z, we obtain the each word is an independent and equally reliable obser- marginal distribution of w: vation under the LDA model, the posterior probability of each word has equal weight in determining the topic mix- p(w|α, β) = (3) ture. Thus, the new vector θ[t+1] can be calculated by N !! Z Y X p(θ|α) p(zn|θ)p(wn|zn, β) dθ. N [t+1] 1 X z θ = λ . (8) n=1 n i N ni n=1 2.2. LDA Topic Inference The iteration is continued until the mixture vector con- We now illustrate how to infer the topic mixture θ given a verges, resulting in the most suitable topic mixture for set of words w. The inference requires the calculation of the word sequence w. Figure 3: Perplexity given background model weight CB when combined with top .

Figure 2: Perplexity for LDA-derived LM given topic mix size m. ‘B’ stands for the background trigram LM; 22 and 64 are topic counts; ‘h’ and ‘P’ stand for human- and PLSA-initiated; and ‘utterance’ and ‘story’ indicate per-utterance and per-story perplexities, respectively. Figure 4: Optimal background Figure 5: Optimal background model weight CB given top model weight CB given topic topic mixture weight θi. mix size m. 3. Language Modeling 3.1. Topic Corpus Construction and LM Training Since LDA requires labelled training data, we first use PLSA 4.1. Topic LM Mixture Size Experiments to identify k latent topics in our training corpus. Given these The results of our initial experiments are shown in Figure 2. topic labels, we train our LDA model, after which we proceed Here a random set of 6000 and 5000 utterances was respec- to assign each individual utterance in the training corpus to one tively selected from the training and testing corpora for per- of k topic corpora as follows: for each utterance, we infer the plexity evaluation. The experiment was conducted as follows: θ topic mixture from which we choose the topic with the max- for each utterance u, the LDA topic mixture θu was inferred imum weight, and append the utterance to this topic’s corpus. and then per-utterance perplexity was calculated for interpo- We then use the resulting k topic-specific corpora to train each lated mixtures of the the top m topics’ language models, from topic’s trigram language model (the background trigram lan- m = 1 to m = 5. In these experiments, the background tri- guage model is trained on the entire training corpus). In our gram LM was not included in the mixtures. The per-utterance experiments, Good-Turing smoothing was used for all language perplexity of the entire set given the background trigram model models, and the SRI Language Modeling toolkit was used for was also calculated as 112.3 and 272.7 for training and test- all language model training and interpolation [8]. ing, respectively. In addition, the experiment was conducted with two different sets of topics: 64 topics derived from those 3.2. Language Model Adaptation identified using PLSA, and 22 topics derived from those into which the Central News Agency categorized their news stories. We perform utterance-level adaptation by inferring from each Also, perplexities were calculated in a similar fashion but at the utterance u the LDA topic mixture θu and interpolating the m story level, to validate our utterance-based approach (the 64- topic LMs corresponding to the top m weights in θu with the topic PLSA-initiated per-story perplexity experiment was not background LM; the background LM weight CB is set to that run on the test set due to time constraints). which leads to the lowest overall perplexity on test data, and The results shown for the training set in Figure 2 seem to where θ is topic i’s weight in θ , the interpolation weight ui u demonstrate the topical impurity of stories and the topical pu- θui Cti for topic i’s LM is set to Cti = (1 − CB ) Pm . j=1 θuj rity of utterances; that is, when topic mixtures are calculated by story, perplexities show a clear decreasing trend as the num- 4. Experiments ber of mixed-in language models (and thus the number of top- ics that are assumed to be represented by the story) increases. For our training corpus, we used 20 months of text news sup- Conversely, when topic mixtures are calculated by utterance, plied by Taiwan’s Central News Agency (CNA), from January perplexities show a clear rising trend trend as the number of 2000 through August 2001. This corpus contains 245,417 sto- mixed-in language models (and the number of topics that this ries, comprising 11,431,402 utterances (an utterance contains utterance is assumed to represent) increases. 5.8 words on average). For perplexity experiments, we used an- Results for the testing set, however, show not an increasing other 4 months of CNA text news from September through De- trend for per-utterance perplexities but a decreasing one: this cember 2001; this corpus contains 52,014 stories and 2,318,630 could be evidence of overfitting. Still, though, there is a clear utterances. For recognition experiments, we used a random separation between per-story and per-utterance perplexities, as selection of 30 CNA-broadcast news stories from August and well as a large improvement for LDA-derived models with topic September 2002 comprising 261 utterances. mixtures over the background model, and this improvement is iment, where erroneous hypotheses were used for topic infer- ence, we see an obvious mismatch between the optimal CB as predicted by the perplexity experiments described in section 4.2 and that when performing actual recognition experiments. Here, setting CB to the recommended 0.15 resulted in a 14.51% CER, a 4.5% relative degradation, whereas setting CB to 0.7 or 0.8 resulted in a 13.36% CER, a 3.8% relative improvement. We attribute these results to the effect of ASR errors dis- torting the results of topic inference. Intuitively, when topic inference are distorted, we should rely more on the background LM and less on the topic LMs dictated by the topic inference. Thus it makes perfect sense that while oracle experiments show the best results at more aggressive (lower) CB values, real ASR experiments show better results when using more conservative (higher) CB values. These experiments show great potential for Figure 6: Recognition rates given background LM weight CB . improvements to ASR results if ways can be found to improve robustness when inferring topic mixtures. consistent over both training and test sets. 5. Conclusions and Future Work In addition, the results shown for both sets seem to demon- strate the superiority of the 64 PLSA-initiated topics over the We have described a high-performance, unsupervised mecha- 22 human-initiated topics. The question here is whether this is nism for topic mixture-based language model adaptation using an issue of PLSA-initiated versus human-initiated topics, or of LDA, and have proposed a novel decomposition of the training topic granularity. corpus for fine-grained topic LM estimation which results in im- proved perplexity and recognition accuracy. We have also pre- 4.2. Interpolation Weight Experiments sented a novel topic inference algorithm for LDA that is faster than the variational alternative. Figures 3 through 5 show the results of experiments to find the In the future, our efforts will focus on improving robustness optimal weights when interpolating the background model with to ASR errors by performing topic inference on n-best lists in- the mix of topic models. Figure 3 shows detailed perplexity stead of just 1-best hypotheses, by using adaptation segments results when mixing the background model with the single topic larger than just one utterance, and also by finding run-time in- with the highest weight, and indicates that a little background dicators for optimal CB values instead of using a static CB for goes a long way; setting the background model weight CB at all utterances. 0.3 results in a 39% improvement in perplexity over just the background model alone (CB = 1.0), and a 26% improvement 6. References over just the topic model (CB = 0). Figure 4 shows the variation of the optimal background [1] Ronald Rosenfeld, “Two Decades of Statistical Language model weight CB with the top topic’s mixture weight θi ob- Modeling: Where Do We Go From Here?,” in Proceedings tained from LDA as discussed in section 2. Here we can see of the IEEE, 2000, pp. 1270–1278. that the top topic’s mixture weight θi can be viewed as a mea- [2] R. Kuhn and R. De Mori, “A Cache-Based Natural Lan- sure of confidence, or alternately of utility, for this particular guage Model from Speech Recognition,” IEEE Transac- inference. When the topic’s mixture weight θi is high, the back- tions on Pattern Analysis and Machine Intelligence, pp. ground model is not needed as much as when the topic’s mixture 570–583, 1990. weight θi is low. Finally, Figure 5 shows the optimal background model [3] Ronald Rosenfeld, “A Maximum Entropy Approach Computer, weight CB for 1-, 2-, 3- and 4-topic mix sizes. The first bar to Adaptive Statistical Language Modeling,” Speech and Language for m = 1 is the result of CB = 0.3 in Figure 3. The trend , pp. 187–228, 1996. here is again that the more topics we add, the less we need the [4] R. Iyer and M. Ostendorf, “Modeling Long Distance De- background model. Perplexity improvements over just using the pendency in Language: Topic Mixtures vs. Dynamic Cache background model in these cases ranged from 44% to 48%, al- Models,” in Proceedings of ICSLP, 1996, pp. 236–239. though these results were obtained with a relatively small num- [5] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet ber of utterances and thus are less statistically trustworthy than Allocation,” The Journal of Machine Learning Research, the results for the single topic case. pp. 993–1022, 2003. 4.3. Speech Recognition Experiments [6] Yik-Cheung Tam and Tanya Shultz, “Unsupervised Language Model Adaptation Using Latent Semantic Figure 6 shows the best results (measured in character error Marginals,” in Proceedings of ICSLP, 2006, pp. 2206– rate) of initial recognition tests in a two-pass framework that 2209. were performed on a development set of 261 utterances. First- pass results were used to infer topics, which were then used [7] B. J. Hsu and J. Glass, “Style & Topic Language Model as the basis for language model adaptation in the second pass, Adaptation Using HMM-LDA,” in EMNLP, 2006. as described as section 3.2. The 4-topic oracle experiment, [8] Andreas Stolcke, “SRILM – An Extensible Language Mod- in which reference transcripts were used for topic inference, eling Toolkit,” in Proceedings of ICSLP, 2002, pp. 901– yielded a 12.09% CER, a 13.0% relative improvement over the 904. baseline (13.89% CER). However, for the 3-topic ASR exper-