Eusipco 2013 1569741253

EUSIPCO 2013 1569741253 1 2 3 4 PLSA ENHANCED WITH A LONG-DISTANCE BIGRAM LANGUAGE MODEL FOR SPEECH 5 RECOGNITION 6 7 Md. Akmal Haidar and Douglas O’Shaughnessy 8 9 INRS-EMT, 6900-800 de la Gauchetiere Ouest, Montreal (Quebec), H5A 1K6, Canada 10 11 12 13 ABSTRACT sition to extract the semantic information for different words and documents. In PLSA and LDA, semantic properties of 14 We propose a language modeling (LM) approach using back- words and documents can be shown in probabilistic topics. 15 ground n-grams and interpolated distanced n-grams for The PLSA latent topic parameters are trained by maximizing 16 speech recognition using an enhanced probabilistic latent the likelihood of the training data using an expectation max- 17 semantic analysis (EPLSA) derivation. PLSA is a bag-of- imization (EM) procedure and have been successfully used 18 words model that exploits the topic information at the docu- for speech recognition [3, 5]. The LDA model has been used 19 ment level, which is inconsistent for the language modeling successfully in recent research work for LM adaptation [6, 7]. 20 in speech recognition. In this paper, we consider the word A bigram LDA topic model, where the word probabilities are 21 sequence in modeling the EPLSA model. Here, the predicted conditioned on their preceding context and the topic proba- 22 word of an n-gram event is drawn from a topic that is cho- bilities are conditioned on the documents, has been recently 23 sen from the topic distribution of the (n-1) history words. investigated [8]. A similar model but in the PLSA framework, 24 The EPLSA model cannot capture the long-range topic in- called a bigram PLSA model, was introduced recently [9]. 25 formation from outside of the n-gram event. The distanced An updated bigram PLSA model was proposed in [10] where 26 n-grams are incorporated into interpolated form (IEPLSA) the topic is further conditioned on the bigram history con- 27 to cover the long-range information. A cache-based LM that text. A topic-based language model was proposed where the 28 models the re-occurring words is also incorporated through topic information was obtained from n-gram history through 29 unigram scaling to the EPLSA and IEPLSA models, which Dirichlet distribution [11] and from long-distance history 30 models the topical words. We have seen that our proposed ap- (topic cache) through multinomial distributions [12]. 31 proaches yield significant reductions in perplexity and word 32 error rate (WER) over a PLSA based LM approach using the In [13], a PLSA technique enhanced with long-distance 33 Wall Street Journal (WSJ) corpus. bigrams was used to incorporate the long-term word de- 34 pendencies in determining word clusters. This motivates 35 Index Terms— language model, topic model, speech us to present LM approaches for speech recognition using 36 recognition, cache-based LM, long-distance n-grams distanced n-grams. In this paper, we use default n-grams 37 using enhanced PLSA derivation to form the EPLSA n-gram 38 1. INTRODUCTION model. Here, the observed n-gram events contain the history 39 words and the predicted word. The EPLSA model extracts 40 Statistical n-gram LMs play a vital role for speech recogni- the topic information from history words and the current 41 tion and many other applications. They use the local context word is then predicted based on the topic information of the 42 information by modeling text as a Markovian Sequence. history words. However, the EPLSA model does not cap- 43 However, the n-gram LMs suffer from shortages of long- ture the topic information from outside of the n-gram events. 44 range information, which degrade performance. Cache-based We propose interpolated distanced n-grams (IEPLSA) and 45 LM was one of the earliest efforts to capture the long-range cache based models to capture the long-term word depen- 46 information. Here, the model increases the probability of the dencies into the EPLSA model. The n-gram probabilities of 47 words that appear earlier in a document when predicting the the IEPLSA model are computed by mixing the component 48 next word [1]. Recently, latent topic modeling techniques distanced word probabilities for topics and the interpolated 49 have been used broadly for topic based language modeling topic information for histories. Furthermore, a cache-based 50 to compensate for the weaknesses of the n-gram LMs. Sev- LM is incorporated into the EPLSA and IEPLSA models as 51 eral techniques such as Latent Semantic Analysis (LSA) [2], the cache-based LM models a different part of the language 52 PLSA [3], and latent Dirichlet allocation (LDA) [4] have been than EPLSA/IEPLSA models. 53 studied to extract the latent semantic information from a train- The rest of this paper is organized as follows. Section 2 is 54 ing corpus. All these methods are based on a bag-of-words used to review the PLSA. The proposed EPLSA and IEPLSA 55 assumption. LSA performs word-document matrix decompo- models are described in section 3. In section 4, a comparison 56 57 60 1 61 1 2 of PLSA, bigram PLSA, EPLSA and IEPLSA models is illus- 3 trated. The unigram scaling of the cache-based model to the topic models is explained in section 5. Section 6 is used to de- 4 h z wi 5 scribe the experiments. Finally, the conclusions are explained 6 in section 7. V 7 H 8 2. PLSA MODEL 9 10 The PLSA model [3] can be described in the following pro- Fig. 1. The graphical model of the EPLSA model. The shaded 11 cedure. First a document Dj (j = 1; 2;:::;N) is selected circle represents the observed variables. H and V describe 12 with probability P (Dj). A topic zk (k = 1; 2;:::;K) is the number of histories and the size of vocabulary. 13 then chosen with probability P (zkjDj), and finally a word 14 wi (i = 1; 2;:::;V ) is generated with probability P (wijzk). 15 Here, the observed variables are wi and Dj whereas the un- M-step: 16 observed variable is zk. The joint distribution of the observed P data can be described as: h n(h; wi)P (zkjh; wi) 17 P (wijzk) = P P ; (6) n(h; w 0 )P (z jh; w 0 ) 18 K i0 h i k i X 19 P (Dj; wi) = P (Dj)P (wijDj) = P (Dj) P (zkjDj)P (wijzk); P n(h; w 0 )P (z jh; w 0 ) 20 k=1 i0 i k i P (zkjh) = P P : (7) 21 (1) k0 i0 n(h; wi0 )P (zk0 jh; wi0 ) 22 where the word probability P (wijDj) can be computed as: 23 K 3.2. IEPLSA X 24 P (w jD ) = P (z jD )P (w jz ): (2) i j k j i k The EPLSA model does not capture the long-distance in- 25 k=1 26 formation. To incorporate the long-range characteristics, we 27 The model parameters P (wijzk) and P (zkjDj) are com- used the distanced n-grams in the EPLSA model. Incorpo- 28 puted by using the EM algorithm [3]. rating the interpolated distance n-grams in the EPLSA, the 29 model can be written as [13]: 30 3. PROPOSED EPLSA AND IEPLSA MODELS K 31 X X P (w jh) = [ λ P (w jz )]P (z jh); (8) 32 3.1. EPLSA IEP LSA i d d i k k k=1 d 33 Representing a document Dj as a sequence of words, the joint 34 distribution of the document and the previous (n-1) history where λd are the weights for each component probability 35 words h of the current word wi can be described as [13]: estimated on the held-out data using the EM algorithm and 36 P (w jz ) is the word probabilities for topic z obtained by Y d i k k 37 P (Dj; h) = P (h) Pd(wijh); (3) using the distanced n-grams in the IEPLSA training. d repre- 38 wiDj sents the distance between words in the n-gram events. d = 1 39 describes the default n-grams. For example, the distanced where P (w jh) is the distanced n-gram model. Here, d rep- 40 d i n-grams of the phrase “Speech in Life Sciences and Human resents the distance between the words in the n-grams. There- 41 Societies” are described in Table 1 for the distance d = 1; 2. 42 fore, the probability Pd(wijh) can be computed similar to the 43 PLSA derivation [3, 13]. For d = 1, Pd(wijh) is the default 44 background n-gram and we define it as the enhanced PLSA Table 1. Distanced n-grams for the phrase “Speech in Life 45 (EPLSA) model. The graphical model of the EPLSA model Sciences and Human Societies” Distance Bigrams Trigrams 46 can be described in Figure 1. The equations for the EPLSA d=1 Speech in, in Life, Speech in Life, in Life 47 model are: Life Sciences, Sci- Sciences, Life Sci- 48 K X ences and, and Hu- ences and, Sciences 49 PEP LSA(wijh) = P (wijzk)P (zkjh); (4) man, Human Soci- and Human, and Hu- 50 k=1 eties man Societies 51 The parameters of the model are computed using the EM al- d=2 Speech Life, in Sci- Speech Life and, in 52 gorithm as: E-step: ences, Life and, Sci- Sciences Human, 53 P (w jz )P (z jh) ences Human, and Life and Societies 54 P (z jh; w ) = i k k ; (5) k i PK Societies 55 k0=1 P (wijzk0 )P (zk0 jh) 56 57 60 2 61 1 2 The parameters of the IEPLSA model can be computed As the cache-based LM (i.e., models re-occurring words) 3 as: E-step: is different from the background model (i.e., models short- range information), EPLSA and IEPLSA models (i.e., model 4 P (w jz )P (z jh) P (z jh; w ) = d i k k ; (9) topical words), we can integrate the cache model to adapt the 5 d k i PK P (w jz 0 )P (z 0 jh) 6 k0=1 d i k k PL(wijh) through unigram scaling as [15, 16]: 7 M-step: P (w jh)δ(w ) 8 L i i P PAdapt(wijh) = ; (14) 9 h nd(h; wi)Pd(zkjh; wi) Z(h) Pd(wijzk) = P P ; (10) 10 i0 h nd(h; wi0 )Pd(zkjh; wi0 ) with 11 P P X λ n (h; w 0 )P (z jh; w 0 ) Z(h) = δ(wi):PL(wijh): (15) 12 P (z jh) = i0 d d d i d k i : (11) k P P P wi 13 k0 i0 d λdnd(h; wi0 )Pd(zkjh; wi0 ) 14 where Z(h) is a normalization term, which guarantees that the 15 4.

Eusipco 2013 1569741253

Improving Neural Language Models with A

Formatting Instructions for NIPS -17

A Study on Neural Network Language Modeling

Master's Thesis Preparation Report

Chapter 4 N-Grams

Modeling Vocabulary for Big Code Machine Learning

A Dynamic Language Model for Speech Recognition

Language Model Techniques in Machine Translation Diplomarbeit

A Neurophysiologically-Inspired Statistical Language Model

Arxiv:1910.05493V1 [Cs.LG] 12 Oct 2019 1

Context-Sensitive Code Completion

Are Deep Neural Networks the Best Choice for Modeling Source Code?