EUSIPCO 2013 1569741253

1 2 3 4 PLSA ENHANCED WITH A LONG-DISTANCE BIGRAM FOR SPEECH 5 RECOGNITION 6 7 Md. Akmal Haidar and Douglas O’Shaughnessy 8 9 INRS-EMT, 6900-800 de la Gauchetiere Ouest, Montreal (Quebec), H5A 1K6, Canada 10 11 12 13 ABSTRACT sition to extract the semantic information for different words and documents. In PLSA and LDA, semantic properties of 14 We propose a language modeling (LM) approach using back- words and documents can be shown in probabilistic topics. 15 ground n-grams and interpolated distanced n-grams for The PLSA latent topic parameters are trained by maximizing 16 using an enhanced probabilistic latent the likelihood of the training data using an expectation max- 17 semantic analysis (EPLSA) derivation. PLSA is a bag-of- imization (EM) procedure and have been successfully used 18 words model that exploits the topic information at the docu- for speech recognition [3, 5]. The LDA model has been used 19 ment level, which is inconsistent for the language modeling successfully in recent research work for LM adaptation [6, 7]. 20 in speech recognition. In this paper, we consider the word A bigram LDA topic model, where the word are 21 sequence in modeling the EPLSA model. Here, the predicted conditioned on their preceding context and the topic proba- 22 word of an n-gram event is drawn from a topic that is cho- bilities are conditioned on the documents, has been recently 23 sen from the topic distribution of the (n-1) history words. investigated [8]. A similar model but in the PLSA framework, 24 The EPLSA model cannot capture the long-range topic in- called a bigram PLSA model, was introduced recently [9]. 25 formation from outside of the n-gram event. The distanced An updated bigram PLSA model was proposed in [10] where 26 n-grams are incorporated into interpolated form (IEPLSA) the topic is further conditioned on the bigram history con- 27 to cover the long-range information. A cache-based LM that text. A topic-based language model was proposed where the 28 models the re-occurring words is also incorporated through topic information was obtained from n-gram history through 29 unigram scaling to the EPLSA and IEPLSA models, which Dirichlet distribution [11] and from long-distance history 30 models the topical words. We have seen that our proposed ap- (topic cache) through multinomial distributions [12]. 31 proaches yield significant reductions in perplexity and word 32 error rate (WER) over a PLSA based LM approach using the In [13], a PLSA technique enhanced with long-distance 33 Wall Street Journal (WSJ) corpus. bigrams was used to incorporate the long-term word de- 34 pendencies in determining word clusters. This motivates 35 Index Terms— language model, topic model, speech us to present LM approaches for speech recognition using 36 recognition, cache-based LM, long-distance n-grams distanced n-grams. In this paper, we use default n-grams 37 using enhanced PLSA derivation to form the EPLSA n-gram 38 1. INTRODUCTION model. Here, the observed n-gram events contain the history 39 words and the predicted word. The EPLSA model extracts 40 Statistical n-gram LMs play a vital role for speech recogni- the topic information from history words and the current 41 tion and many other applications. They use the local context word is then predicted based on the topic information of the 42 information by modeling text as a Markovian Sequence. history words. However, the EPLSA model does not cap- 43 However, the n-gram LMs suffer from shortages of long- ture the topic information from outside of the n-gram events. 44 range information, which degrade performance. Cache-based We propose interpolated distanced n-grams (IEPLSA) and 45 LM was one of the earliest efforts to capture the long-range cache based models to capture the long-term word depen- 46 information. Here, the model increases the of the dencies into the EPLSA model. The n-gram probabilities of 47 words that appear earlier in a document when predicting the the IEPLSA model are computed by mixing the component 48 next word [1]. Recently, latent topic modeling techniques distanced word probabilities for topics and the interpolated 49 have been used broadly for topic based language modeling topic information for histories. Furthermore, a cache-based 50 to compensate for the weaknesses of the n-gram LMs. Sev- LM is incorporated into the EPLSA and IEPLSA models as 51 eral techniques such as Latent Semantic Analysis (LSA) [2], the cache-based LM models a different part of the language 52 PLSA [3], and latent Dirichlet allocation (LDA) [4] have been than EPLSA/IEPLSA models. 53 studied to extract the latent semantic information from a train- The rest of this paper is organized as follows. Section 2 is 54 ing corpus. All these methods are based on a bag-of-words used to review the PLSA. The proposed EPLSA and IEPLSA 55 assumption. LSA performs word-document matrix decompo- models are described in section 3. In section 4, a comparison 56 57 60 1 61 1 2 of PLSA, bigram PLSA, EPLSA and IEPLSA models is illus- 3 trated. The unigram scaling of the cache-based model to the topic models is explained in section 5. Section 6 is used to de- 4 h z wi 5 scribe the experiments. Finally, the conclusions are explained 6 in section 7. V 7 H 8 2. PLSA MODEL 9 10 The PLSA model [3] can be described in the following pro- Fig. 1. The graphical model of the EPLSA model. The shaded 11 cedure. First a document Dj (j = 1, 2,...,N) is selected circle represents the observed variables. H and V describe 12 with probability P (Dj). A topic zk (k = 1, 2,...,K) is the number of histories and the size of vocabulary. 13 then chosen with probability P (zk|Dj), and finally a word 14 wi (i = 1, 2,...,V ) is generated with probability P (wi|zk). 15 Here, the observed variables are wi and Dj whereas the un- M-step: 16 observed variable is zk. The joint distribution of the observed P data can be described as: h n(h, wi)P (zk|h, wi) 17 P (wi|zk) = P P , (6) n(h, w 0 )P (z |h, w 0 ) 18 K i0 h i k i X 19 P (Dj, wi) = P (Dj)P (wi|Dj) = P (Dj) P (zk|Dj)P (wi|zk), P n(h, w 0 )P (z |h, w 0 ) 20 k=1 i0 i k i P (zk|h) = P P . (7) 21 (1) k0 i0 n(h, wi0 )P (zk0 |h, wi0 ) 22 where the word probability P (wi|Dj) can be computed as: 23 K 3.2. IEPLSA X 24 P (w |D ) = P (z |D )P (w |z ). (2) i j k j i k The EPLSA model does not capture the long-distance in- 25 k=1 26 formation. To incorporate the long-range characteristics, we 27 The model parameters P (wi|zk) and P (zk|Dj) are com- used the distanced n-grams in the EPLSA model. Incorpo- 28 puted by using the EM algorithm [3]. rating the interpolated distance n-grams in the EPLSA, the 29 model can be written as [13]: 30 3. PROPOSED EPLSA AND IEPLSA MODELS K 31 X X P (w |h) = [ λ P (w |z )]P (z |h), (8) 32 3.1. EPLSA IEP LSA i d d i k k k=1 d 33 Representing a document Dj as a sequence of words, the joint 34 distribution of the document and the previous (n-1) history where λd are the weights for each component probability 35 words h of the current word wi can be described as [13]: estimated on the held-out data using the EM algorithm and 36 P (w |z ) is the word probabilities for topic z obtained by Y d i k k 37 P (Dj, h) = P (h) Pd(wi|h), (3) using the distanced n-grams in the IEPLSA training. d repre-

38 wiDj sents the distance between words in the n-gram events. d = 1 39 describes the default n-grams. For example, the distanced where P (w |h) is the distanced n-gram model. Here, d rep- 40 d i n-grams of the phrase “Speech in Life Sciences and Human resents the distance between the words in the n-grams. There- 41 Societies” are described in Table 1 for the distance d = 1, 2. 42 fore, the probability Pd(wi|h) can be computed similar to the 43 PLSA derivation [3, 13]. For d = 1, Pd(wi|h) is the default 44 background n-gram and we define it as the enhanced PLSA Table 1. Distanced n-grams for the phrase “Speech in Life 45 (EPLSA) model. The graphical model of the EPLSA model Sciences and Human Societies” Distance Bigrams Trigrams 46 can be described in Figure 1. The equations for the EPLSA d=1 Speech in, in Life, Speech in Life, in Life 47 model are: Life Sciences, Sci- Sciences, Life Sci- 48 K X ences and, and Hu- ences and, Sciences 49 PEP LSA(wi|h) = P (wi|zk)P (zk|h), (4) man, Human Soci- and Human, and Hu- 50 k=1 eties man Societies 51 The parameters of the model are computed using the EM al- d=2 Speech Life, in Sci- Speech Life and, in 52 gorithm as: E-step: ences, Life and, Sci- Sciences Human, 53 P (w |z )P (z |h) ences Human, and Life and Societies 54 P (z |h, w ) = i k k , (5) k i PK Societies 55 k0=1 P (wi|zk0 )P (zk0 |h) 56 57 60 2 61 1 2 The parameters of the IEPLSA model can be computed As the cache-based LM (i.e., models re-occurring words) 3 as: E-step: is different from the background model (i.e., models short- range information), EPLSA and IEPLSA models (i.e., model 4 P (w |z )P (z |h) P (z |h, w ) = d i k k , (9) topical words), we can integrate the cache model to adapt the 5 d k i PK P (w |z 0 )P (z 0 |h) 6 k0=1 d i k k PL(wi|h) through unigram scaling as [15, 16]: 7 M-step: P (w |h)δ(w ) 8 L i i P PAdapt(wi|h) = , (14) 9 h nd(h, wi)Pd(zk|h, wi) Z(h) Pd(wi|zk) = P P , (10) 10 i0 h nd(h, wi0 )Pd(zk|h, wi0 ) with 11 P P X λ n (h, w 0 )P (z |h, w 0 ) Z(h) = δ(wi).PL(wi|h). (15) 12 P (z |h) = i0 d d d i d k i . (11) k P P P wi 13 k0 i0 d λdnd(h, wi0 )Pd(zk|h, wi0 ) 14 where Z(h) is a normalization term, which guarantees that the 15 4. COMPARISON OF PLSA, PLSA BIGRAM AND total probability sums to unity, PL(wi|h) is the interpolated 16 EPLSA/IEPLSA model of the background and the EPLSA/IEPLSA model and 17 δ(wi) is a scaling factor that is usually approximated as: 18 PLSA [3] is a bag-of-words model where the document prob- ability is computed by using the topic structure at the docu- αPcache(wi) + (1 − α)PBackground(wi) β 19 δ(wi) ≈ ( ) , 20 ment level. This is inappropriate for the language model in PBackground(wi) 21 speech recognition. PLSA bigram models were introduced (16) 22 where the bigram probabilities for each topic are modeled where β is a tuning factor between 0 and 1. In our experi- 23 and the topic is conditioned on the document [9] or bigram ments we used the value of β as 1. We used the same proce- 24 history and the document [10]. In either approach, the mod- dure as [15] to compute the normalization term. To do this, an 25 els require V distributions for each topic, where V is the additional constraint is employed where the total probability 26 size of the vocabulary. Therefore, the size of the parameters of the observed transitions is unchanged: grows exponentially with increasing n-gram order. In con- 27 X X 28 trast, the EPLSA/IEPLSA models developed the word distri- PAdapt(wi|h) = PL(wi|h). butions given the history words. The history information is 29 wi:observed(h,wi) wi:observed(h,wi) 30 used to form the topic distributions, then the probability of 31 the predicted word is computed given the topic information The model PL(wi|h) has standard back-off structure and the 32 of the histories. Therefore, the parameter number grows lin- above constraint, so the model PAdapt(wi|h) has the follow- 33 early with V [11]. ing recursive formula: 34 35 5. INCORPORATING THE CACHE MODEL  δ(wi) 36 THROUGH UNIGRAM SCALING  .P (w |h)if (h, w ) exists(17) Z (h) L i i 37 PAdapt(wi|h) = o  ˆ 38 A Cache-based language model was used to increase the  b(h).PAdapt(wi|h)otherwise (18) 39 probability of words appearing in a document that are likely 40 to re-occur in the same document. The unigram cache model where 41 for a given history hc = wi−M , . . . , wi, where M is the cache P δ(wi).PL(wi|h) size, is defined as: wi:observed(h,wi) 42 Z (h) = (19) o P P (w |h) 43 wi:observed(h,wi) L i n(wi, hc) 44 Pcache(wi) = (12) n(h ) 45 c and 46 where n(wi, hc) is the number of occurrences of the word wi P 1 − PL(wi|h) 47 within h and n(h ) <= M is the number of words within b(h) = wi:observed(h,wi) (20) c c P ˆ 1 − PAdapt(wi|h) 48 hc that belongs to the vocabulary V [14]. wi:observed(h,wi) 49 The EPLSA/IEPLSA models capture topical words. The 50 models are then interpolated with a background n-gram where b(h) is the back-off weight of the context h to ensure ˆ 51 model to capture the local lexical regularities as: that PAdapt(wi|h) sums to unity. h is the reduced word his- 52 tory of h. The term Zo(h) is used to do normalization similar 53 PL(wi|h) = to Equation 15 except the summation is considered only on 54 (1 − γ)PEP LSA/IEP LSA(wi|h) + γPBackground(wi|h). the observed alternative words with the equal word history h 55 (13) in the LM [6]. 56 57 60 3 61 1 2 6. EXPERIMENTS scribe topical words. We compared our approaches with a 3 PLSA based LM approach [3] using unigram scaling where 6.1. Data and experimental setup 4 the PLSA unigrams are used in place of cache unigrams in Equation 16 and denoted as Background*PLSA. 5 LM adaptation approaches are evaluated using the Wall Street We tested the proposed approach for various sizes of top- 6 Journal (WSJ) corpus [17]. The SRILM toolkit [18] and the ics. The perplexity results are described in Table 2. From 7 HTK toolkit [19] are used for generating the LMs and com- 8 puting the WER respectively. The ’87-89 WSJ corpus is 9 used to train language models. The models are trained using Table 2. Perplexity results of the language models 10 the WSJ 5K non-verbalized punctuation closed vocabulary. 11 A tri-gram background model is trained using the modified Language Model 40 Topics 80 Topics 12 Kneser-Ney smoothing incorporating the cutoffs 1 and 3 on Background 70.26 70.26 13 the bi-gram and tri-gram counts respectively. To reduce the PLSA 517.77 514.78 14 computational and memory requirements using MATLAB, EPLSA 192.91 123.32 15 we trained only the bi-gram EPLSA and IEPLSA models. IEPLSA 101.19 93.02 16 For IEPLSA models, we considered bigrams for d = 1, 2.A Background*PLSA 66.63 66.50 17 fixed cache size of M = 400 is used for the cache-based LM. Background+EPLSA 62.92 59.74 18 The acoustic model from [20] is used in our experiments. The Background+IEPLSA 55.12 55.10 19 acoustic model is trained by using all WSJ and TIMIT [21] (Background+EPLSA)*CACHE 57.98 55.06 20 training data, the 40 phone set of the CMU dictionary [22], (Background+IEPLSA)*CACHE 50.71 50.69 21 approximately 10000 tied-states, 32 Gaussians per state and 22 64 Gaussians per silence state. The acoustic waveforms are 23 parameterized into a 39-dimensional feature vector consisting Table 2, we can note that all the models outperform the back- 24 of 12 cepstral coefficients plus the 0th cepstral, delta and delta ground model and the performances are better with increasing 25 delta coefficients, normalized using cepstral mean subtraction topics. The proposed EPLSA and IEPLSA models outper- 26 (MFCC0−D−A−Z ). We evaluated the cross-word models. form the PLSA models in every form (stand-alone, interpo- 27 The values of the word insertion penalty, beam width, and the lated, unigram scaling). 28 language model scale factor are -4.0, 350.0, and 15.0 respec- We evaluated the WER experiments using lattice rescor- 29 tively [20]. The development and the evaluation test sets are ing. In the first pass, we used the back-off trigram back- 30 the si dt 05.odd (248 sentences from 10 speakers) and the ground language model for lattice generation. In the sec- 31 Nov’93 Hub 2 5K test data from the ARPA November 1993 ond pass, we applied the LM adaptation approaches for lat- 32 WSJ evaluation (215 sentences from 10 speakers) [17, 23]. tice rescoring. The experimental results are explained in Fig- 33 The interpolation weights λd, γ and α are computed using the ure 2. From Figure 2, we can note that the proposed EPLSA 34 compute-best-mix program from the SRILM toolkit. They model yields significant WER reductions of about 10.93% 35 are tuned on the development test set. The results are noted (7.59% to 6.76%) and 8.64% (7.40% to 6.76%) for 40 top- 36 on the evaluation test set. ics, and about 15.41% (7.59% to 6.42%) and 13.00% (7.38% 37 to 6.41%) for 80 topics, over the background model and the 38 6.2. Experimental Results PLSA [3] approaches respectively. For the IEPLSA mod- 39 els, the WER reductions are about 19.50% (7.59% to 6.11), 40 We used the folding-in procedure [3] to compute the PLSA, 17.43% (7.40% to 6.11%), and 9.61% (6.76% to 6.11%) for 41 EPLSA and IEPLSA model probabilities. We keep the uni- 40 topics and about 20.28% (7.59% to 6.05), 18.02% (7.38% 42 gram (Equations 2, 4 and 8) probabilities for topics of PLSA, to 6.05%), and 5.76% (6.42% to 6.05%) for 80 topics, over the 43 EPLSA and IEPLSA, and λd of component probabilities for background model, PLSA [3] and EPLSA approaches respec- 44 IEPLSA unchanged, and used them to compute P (zk|D) for tively. The integration of cache based models improves the 45 the test document D of the PLSA model and P (zk|h) for the performance as it carries different information (captures the 46 test document histories of the EPLSA and IEPLSA models. dynamics of word occurrences in a cache) than the EPLSA 47 The language models for PLSA, EPLSA and IEPLSA are then and IEPLSA approaches. The cache unigram scaling of the 48 computed using (Equations 2, 4 and 8). The remaining zero IEPLSA approach gives 6.74% and 1.63% WER reductions 49 probabilities of the obtained matrix PEP LSA/IEP LSA(wi|h) over the cache unigram scaling of the EPLSA approach for 40 50 are computed by using back-off smoothing. The EPLSA and 80 topics respectively. We can note that the addition of 51 and IEPLSA models are interpolated with a back-off trigram cache models improves the performance of EPLSA (6.76% to 52 background model to capture the local lexical regularities. 6.52% for 40 topics and 6.42% to 6.13% for 80 topics) more 53 Furthermore, a cache-based LM that models re-occurring than for IEPLSA (6.11% to 6.08% for 40 topics and 6.05% to 54 words is integrated through unigram scaling (Equations 17 6.03% for 80 topics). This might be due to the fact that the 55 and 18) with the EPLSA and IEPLSA models, which de- IEPLSA approach captures long-range information using the 56 57 60 4 61 1 2 Background Background*PLSA [4] D. M. Blei, A. Y.Ng, and M. I. Jordan, “Latent Dirichlet Al- Background+EPLSA (Background+EPLSA)*CACHE location”, Journal of Research, vol. 3, pp. 3 Background+IEPLSA (Background+IEPLSA)*CACHE 993-1022, 2003. 4 8 7.59 7.4 7.59 7.38 6.76 [5] D. Mrva and P. C. Woodland, “A PLSA-based Language Model 5 7 6.52 6.42 6.11 6.08 6.13 6.05 6.03 for conversational telephone speech”, in Proc. of ICSLP, pp. 6 6 5 2257-2260, 2004. 7 4 [6] Y.-C. Tam and T. Schultz, “Unsupervised Language Model 8 3 Adaptation Using Latent Semantic Marginals”, in Proc. of IN- 2 9 1 TERSPEECH, pp. 2206-2209, 2006. 10 0 [7] M. A. Haidar and D. O’Shaughnessy, “Unsupervised Language 11 Topic 40 Topic 80 Model Adaptation Using Latent Dirichlet Allocation and Dy- 12 namic Marginals”, in Proc. of EUSIPCO, pp. 1480-1484, 2011. 13 Fig. 2. WER Results of the Language Models [8] H. M. Wallach, “Topic Modeling: Beyond bag-of-words”, in 14 Proc. of the 23rd International Conference of Machine Learn- 15 ing (ICML’06), pp. 977-984, 2006. 16 interpolated distanced bigrams. [9] J. Nie, R. Li, D. Luo, and X. Wu, “Refine bigram PLSA model 17 by assigning latent topics unevenly”, in Proc. of the IEEE workshop on ASRU, pp. 141-146, 2007. 18 7. CONCLUSIONS 19 [10] M. Bahrani and H. Sameti, “A New Bigram PLSA Language Model for Speech Recognition”, Research Article, Eurasip 20 In this paper, the background n-grams and the interpolated Journal on Signal Processing, pp. 1-8, 2010. 21 distanced n-grams are used to derive the EPLSA and IEPLSA [11] J-T. Chien and C-H. Chueh, “Latent Dirichlet Language Model 22 models respectively for speech recognition. The EPLSA for Speech Recognition”, in Proc. of IEEE SLT workshop, pp. 23 model extracted the topic information from the (n-1) history 201-204, 2008. 24 words in calculating the n-gram probabilities. However, it [12] C-H. Chueh and J-T. Chien, “Topic Cache Language Model 25 does not capture the long-range semantic information from for Speech Recognition”, in Proc. of ICASSP, pp.5194-5197, 26 outside of the n-gram events. The IEPLSA model over- 2010. 27 comes the shortcomings of EPLSA by using the interpolated [13] N. Bassiou and C. Kotropoulos, “Word Clustering PLSA en- 28 long-distance n-grams that capture the long-term word de- hanced with Long Distance Bigrams”, in Proc. of International 29 pendencies. Using the IEPLSA, the topic information for the Conferance on , pp. 4226-4229, 2010. 30 histories are trained using the interpolated distanced n-grams. [14] A. Vaiciunas and G. Raskinis, “Cache-based Statistical Lan- 31 The model probabilities are computed by weighting the com- guage Models of English and Highly Inflected Lithuanian, in Informatica, Vol. 17(1), pp. 111-124, 2006. 32 ponent word probabilities for topics and the interpolated topic 33 information for the histories. We have seen that the proposed [15] R. Kneser, J. Peters, and D. Klakow, “Language Model Adap- Proc. of EUROSPEECH 34 tation Using Dynamic Marginals”, in , EPLSA and IEPLSA approaches yield significant perplexity pp. 1971-1974, 1997. 35 and WER reductions over the PLSA-based LM approach us- [16] W. Naptali, M. Tsuchiya, and S. Nakagawa, “Topic Dependent 36 ing the WSJ corpus. Moreover, we incorporate a cache-based Class-Based N-gram Language Model”, in IEEE Trans. on Au- 37 model into the EPLSA and IEPLSA models using unigram dio, Speech and Language Processing, pp. 1513-1525, Vol. 20. 38 scaling for adaptation and have seen improved performances. No.5, 2012. 39 However, cache unigram scaling of the EPLSA gives much [17] “CSR-II (WSJ1) Complete”, Linguistic Data Consortium, 40 better performance over the EPLSA than the cache unigram Philadelphia, 1994. 41 scaling of the IEPLSA over the IEPLSA. This proves that [18] A. Stolcke, “SRILM- An Extensible Language Modeling 42 the IEPLSA approach captures long-range information of the Toolkit”, in Proc. of ICSLP, vol. 2, pp. 901-904, 2002. 43 language. [19] S. Young, P. Woodland, G. Evermann and M. Gales, “The HTK 44 toolkit 3.4.1”, http://htk.eng.cam.ac.uk/, Cambridge Univ. Eng. Dept. CUED.s 45 8. REFERENCES 46 [20] K. Vertanen, “HTK Wall Street Journal Training Recipe”, 47 [1] R. Kuhn and R. D. Mori, “A Cache-Based Natural Language http://www.inference.phy.cam.ac.uk/kv227/htk/. 48 Model for Speech Recognition”, in IEEE Trans. on Pattern [21] John S. Garofolo, et al,“TIMIT Acoustic-Phonetic Continu- 49 Analysis and Machine Intelligence, vol. 12(6), pp. 570-583, ous Speech Corpus” Linguistic Data Consortium, Philadelphia, 50 1990. 1993 51 [2] J. R. Bellegarda, “Exploiting Latent Semantic Information in [22] “The Carnegie Mellon University (CMU) Pronounciation 52 Statistical Language Modeling”, in Proc. of the IEEE, vol. 88, Dictionary”,http://www.speech.cs.cmu.edu/cgi-bin/cmudict 53 No. 8, pp. 1279-1296, 2000. [23] P.C. Woodland, J.J. Odell, V. Valtchev and S.J. Young, “Large Vocabulary Continuous Speech Recognition Using HTK”, in 54 [3] D. Gildea and T. Hofmann, “Topic-Based Language Models Proc. of ICASSP, pp. II:125-128, 1994. 55 Using EM”, in Proc. of EUROSPEECH, pp. 2167-2170, 1999. 56 57 60 5 61