Language Model Adaptation Using Latent Dirichlet Allocation and an Efﬁcient Topic Inference Algorithm

Language Model Adaptation Using Latent Dirichlet Allocation and an Efficient Topic Inference Algorithm Aaron Heidel1 , Hung-an Chang2 , and Lin-shan Lee1 1Dept. of Computer Science & Information Engineering, National Taiwan University Taipei, Taiwan, Republic of China 2Spoken Language Systems Group, CSAIL, Massachusetts Institute of Technology Cambridge, MA 02139, USA [email protected], hung [email protected], [email protected] Abstract Another well-known approach is the sentence-level We present an effort to perform topic mixture-based lan- mixture model, which used topics identified from a het- guage model adaptation using latent Dirichlet allocation erogenous training corpus by automatic clustering [4]. (LDA). We use probabilistic latent semantic analysis Improvements were demonstrated in both perplexity and (PLSA) to automatically cluster a heterogeneous training recognition accuracy over an unadapted trigram language corpus, and train an LDA model using the resultant topic- model. document assignments. Using this LDA model, we then construct topic-specific corpora at the utterance level for interpolation with a background language model during language model adaptation. We also present a novel it- In this paper, we propose an improved language model erative algorithm for LDA topic inference. Very encour- adaptation scheme that utilizes topic information obtained aging results were obtained in preliminary experiments from first-pass recognition results using a novel topic in- with broadcast news in Mandarin Chinese. ference algorithm for latent Dirichlet allocation (LDA) Index Terms: language model, unsupervised adaptation, [5]. A mix of language models is constructed that best topic modeling, speech recognition models this topic information; that is, a set of signifi- cant topic-specific language models (topic LMs) is selected, and their corresponding weights when interpo- lated with a background LM are set. The resulting mix is then used to find an adapted hypothesis. The principle 1. Introduction difference between this and a recently proposed approach Statistical n-gram-based language models have long been [6] is that where they adapt the background LM according used to estimate probabilities for the next word given the to the LDA-inferred unigram distribution, the proposed preceding word history. Although they have proven to method decomposes the background LM into topic LMs work extremely well, researchers quickly noticed signs using utterance-level n-gram counts. In fact, the proposed that their performance improvements given increasingly method is different from all such approaches that directly large corpora were beginning to asymptote [1]. Many at- manipulate the background LM according to some uni- tempts, therefore, have been made to compensate for their gram distribution based on the adaptation text. This ap- most notable weakness: no consideration of long distance proach is also conceptually simpler than a recent work dependencies, or no understanding of semantics. using HMM-LDA [7], for example, in that no distinction One of the earliest such attempts introduced was the is made between syntactic and semantic states. cache-based technique, which took advantage of the gen- erally observed “burstiness” of words by increasing the probability of words in the history when predicting the next word [2]. This technique was then generalized using In the next section, we describe the LDA framework trigger pairs, in which the observation of certain “trig- for topic modeling and present the proposed inference al- ger” words increases the probability of seeing correlated gorithm, and in section 3 we outline the proposed lan- words; furthermore, a maximum entropy approach was guage modeling scheme. Experimental results are pre- used to combine the probability estimates of multiple trig- sented in section 4, and our concluding remarks follow in gers into a reasonable joint estimate [3]. section 5. the marginal probability p(w|α, β). Unfortunately, the integration in eq. 3 is intractable due to the coupling between θ and w. Although a variational inference algorithm has been proposed to solve this problem [5], this algorithm still involves computationally expensive calcu- lations such as the digamma function. To make topic inference faster and more efficient, we here propose an iter- ative algorithm to find the most suitable topic mixture θˆ Figure 1: LDA topic sequence generation. The gray cir- for the set of words w under the mean square error (MSE) cles represent model parameters, while the white circles criterion. represent latent parameters. The probability p(w|α, β) in eq. 3 can be further interpreted as an expectation over θ: 2. Latent Dirichlet Allocation p(w|α, β) 2.1. LDA Model Representation " N !# Y X LDA [5] is a generative, probabilistic model character- = Eθ p(zn|θ)p(wn|zn, β) . (4) ized by the two sets of parameters α and β, where α = n=1 zn [α1α2 ··· αk] represents the Dirichlet parameters for the Under the MSE criterion, the most suitable mixture θˆ is k latent topics of the model, and β is a k × V matrix the one that makes the probability of w equal to the ex- where each entry βij represents the unigram probability pectation in eq. 4. That is, θˆ is chosen such that of the jth word in the V -word vocabulary under the ith latent topic. N ! Y X ˆ Under the LDA modeling framework, the generative p(zn|θ)p(wn|zn, β) = process for a document of N words can be summarized as n=1 zn " N !# follows. A vector for the topic mixture θ = [θ1θ2 ··· θk] Y X is first drawn from the Dirichlet distribution with proba- Eθ p(zn|θ)p(wn|zn, β) . (5) bility n=1 zn Because the Dirichlet distribution is continuous, we are Pk Γ i=1 αi guaranteed that such a θˆ exists. Hence, we can calculate p(θ|α) = θα1−1 ··· θαk−1, (1) Qk 1 k θˆ using the following iteration. Γ(αi) i=1 First, we set the initial vector according to the prior where Γ(x) denotes the gamma function. Then, a topic information such that sequence z = [z1z2 ··· zN ] is generated multinomially [0] α1 α2 αk by θ with p(zn = i|θ) = θi. Finally, each word wn is θ = ··· , (6) αsum αsum αsum chosen according to the probability p(wn|zn, β), which is equal to βij, where zn = i and wn is the jth word in Pk where αsum = i=1 αi and αi/αsum = E[θi|α]. Let the vocabulary. Figure 1 shows the generative process. θ[t] denote the mixture after t iterations. The posterior θ N The joint distribution of a topic mixture , a set of probability λ that word w is drawn from topic i can z N w ni n topics , and a set of words can thus be calculated be derived by by θ[t]β N λ = i iwn , (7) Y ni Pk [t] p(θ, z, w|α, β) = p(θ|α) p(zn|θ)p(wn|zn, β). j=1 θj βjwn n=1 (2) [t] [t] where θi βiwn = p(zn = i | θ )p(wn|zn, β). Because Integrating over θ and summing over z, we obtain the each word is an independent and equally reliable obser- marginal distribution of w: vation under the LDA model, the posterior probability of each word has equal weight in determining the topic mix- p(w|α, β) = (3) ture. Thus, the new vector θ[t+1] can be calculated by N !! Z Y X p(θ|α) p(zn|θ)p(wn|zn, β) dθ. N [t+1] 1 X z θ = λ . (8) n=1 n i N ni n=1 2.2. LDA Topic Inference The iteration is continued until the mixture vector con- We now illustrate how to infer the topic mixture θ given a verges, resulting in the most suitable topic mixture for set of words w. The inference requires the calculation of the word sequence w. Figure 3: Perplexity given background model weight CB when combined with top topic model. Figure 2: Perplexity for LDA-derived LM given topic mix size m. ‘B’ stands for the background trigram LM; 22 and 64 are topic counts; ‘h’ and ‘P’ stand for human- and PLSA-initiated; and ‘utterance’ and ‘story’ indicate per-utterance and per-story perplexities, respectively. Figure 4: Optimal background Figure 5: Optimal background model weight CB given top model weight CB given topic topic mixture weight θi. mix size m. 3. Language Modeling 3.1. Topic Corpus Construction and LM Training Since LDA requires labelled training data, we first use PLSA 4.1. Topic LM Mixture Size Experiments to identify k latent topics in our training corpus. Given these The results of our initial experiments are shown in Figure 2. topic labels, we train our LDA model, after which we proceed Here a random set of 6000 and 5000 utterances was respec- to assign each individual utterance in the training corpus to one tively selected from the training and testing corpora for per- of k topic corpora as follows: for each utterance, we infer the plexity evaluation. The experiment was conducted as follows: θ topic mixture from which we choose the topic with the max- for each utterance u, the LDA topic mixture θu was inferred imum weight, and append the utterance to this topic’s corpus. and then per-utterance perplexity was calculated for interpo- We then use the resulting k topic-specific corpora to train each lated mixtures of the the top m topics’ language models, from topic’s trigram language model (the background trigram lan- m = 1 to m = 5. In these experiments, the background tri- guage model is trained on the entire training corpus). In our gram LM was not included in the mixtures.

Language Model Adaptation Using Latent Dirichlet Allocation and an Efﬁcient Topic Inference Algorithm

Automatic Correction of Real-Word Errors in Spanish Clinical Texts

Unified Language Model Pre-Training for Natural

Analyzing Political Polarization in News Media with Natural Language Processing

Automatic Spelling Correction Based on N-Gram Model

N-Gram-Based Machine Translation

Spell Checker in CET Designer

Automatic Extraction of Personal Events from Dialogue

Deep Learning for Web Search and Natural Language Processing

Training Automatic Transliteration Models on Dbpedia Data

Heafield, K. and A. Lavie. "Voting on N-Grams for Machine Translation

Proceedings of the 55Th Annual Meeting of the Association For

Phrase-Based Attentions