Optimization of Latent Semantic Analysis Based Language Model Interpolation for Meeting Recognition

Optimization of Latent Semantic Analysis based Language Model Interpolation for Meeting Recognition Michael Pucher† ‡, Yan Huang∗, Ozg¨ ur¨ Çetin∗ ‡Telecommunications Research Center Vienna, Austria [email protected] †Speech and Signal Processsing Lab, TU Graz Graz, Austria ∗International Computer Science Institute Berkeley, USA [email protected], [email protected] Abstract Latent Semantic Analysis (LSA) defines a semantic similarity space using a training corpus. This semantic similarity can be used for dealing with long distance dependencies, which are an inherent problem for traditional word-based n-gram models. This paper presents an analysis of interpolated LSA models that are applied to meeting recognition. For this task it is necessary to combine meeting and background models. Here we show the optimization of LSA model parameters necessary for the interpolation of multiple LSA models. The comparison of LSA and cache-based models shows furthermore that the former contain more semantic information than is contained in the repetition of words forms. Optimizacija latentne semanticneˇ analize temeljeceˇ na interpolaciji jezikovnega modela za namene razpoznavanja sestankov Latentna semantiˇcna analiza (LSA) definira prostor semantiˇcne podobnosti z uporabo uˇcnega korpusa. To semantiˇcno podobnost je mogoˇce uporabiti pri odvisnostih dolgega dosega, ki so inherenten problem za tradicionalne, na besedah temeljeˇce n-gramske modele. Prispevek predstavlja analizo interpoliranih modelov LSA, ki so uporabljeni za razpoznavanje sestankov. Za to nalogo je potrebno zdruˇziti modela sestankov in ozadja. Predstavljena je optimizacija parametrov modela LSA za interpolacijo med veˇcimi modeli LSA. Primerjava modelov LSA in modelov s predpomnilnikom pokaˇze tudi, da prvi vsebujejo veˇcsemantiˇcnih informacij kot ponavljanje besednih oblik. 1. Introduction tively small corpora with background LSA models which are trained on larger corpora. Word-based n-gram models are a popular and fairly LSA-based language models have several parameters sucessful paradigm in language modeling. With these mod- influencing the length of the history or the similarity func- els it is however difficult to model long distance dependen- tion that need to be optimized. The interpolationof multiple cies which are present in natural language (Chelba and Je- LSA models leads to additional paramters that regulate the linek, 1998). impact of different models on a word and model basis. LSA maps a corpus of documents onto a semantic vector space. Long distance dependencies are modeled by rep- resenting the context or history of a word and the word it- self as a vector in this space. The similarity between these 2. LSA-based Language Models two vectors is used to predict a word given a context. Since 2.1. Constructing the Semantic Space LSA models the contextas a bag of words it has to be combined with n-gram models to include word-order statistics In LSA first the training corpus is encoded as a word– of the short span history. Language models that combine document co-ocurrence matrix W (using weighted term word-based n-gram models with LSA models have been frequency). This matrix has high dimension and is highly successfully applied to conversational speech recognition sparse. Let V be the vocabulary with |V| = M and T and to the Wall Street Journal recognition task (Bellegarda, be a text corpus containing n documents. Let cij be the 2000b)(Deng and Khudanpur, 2003). number of occurrences of word i in document j, ci the We conjecture that LSA-based language models can number of occurrences of word i in the whole corpus, i.e. also help to improve speech recognition of recorded meet- N ci = Pj=1 cij , and cj the number of words in document j. ings, because meetings have clear topics and LSA models The elements of W are given by adapt dynamically to topics. Due to the sparseness of avail- able data for languagemodeling for meetings it is important cij [W ]ij = (1 − ǫwi ) (1) to combine meeting LSA models that are trained on rela- cj where ǫwi is defined as Model Definition N n-gram (baseline) Pn−gram 1 cij cij ǫwi = − X log . (2) Linear interpolation (LIN) λPLSA + (1 − λ)Pn−gram log N ci ci j=1 Information weighted λw 1−λw geometric mean ∝ PLSAPn−gram ǫ will be used as a short-hand for ǫ i . Informative words w w interpolation (INFG) will have a low value of ǫw. Then a semantic space with much lower dimension is constructed using Singular Value Table 1: Interpolation methods. Decomposition (SVD) (Deerwester et al., 1990). ˆ T W ≈ W = U × S × V (3) The information weighted geometric mean interpolation represents a loglinear interpolation of normalized LSA For some order r ≪ min(m,n), U is a m × r left singular probabilities and the standard n-gram. matrix, S is a r × r diagonal matrix that contains r singular values, and V is a n × r right singular matrix. The vector 2.4. Combining LSA Models represents word , and represents document . uiS wi vj S dj For the combination of multiple LSA models we tried 2.2. LSA Probability two different approaches. The first approach was the linear interpolation of LSA models with optimized λ where In this semantic space the cosine similarity between i λ =1 − (λ + . + λ ): words and documents is defined as n+1 1 n P , λ P + . + λ P + λ P (7) u SvT lin 1 Lsa1 n Lsan n+1 n−gram , i j Ksim(wi, dj ) 1 1 . (4) ||uiS 2 || · ||vj S 2 || Our second approach was the INFG Interpolation with (n+1) (1) (n) Since we need a probability for the integration with the n- optimized θi where λw =1 − (λw + . + λw ): gram models, the similarity is converted into a probabil- (1) (n) (n+1) P ∝ P λw θ1 ...P λw θn P λw θn+1 (8) ity by normalizing it. According to (Coccaro and Jurafsky, infg Lsa1 Lsan n−gram 1998), we extend the small dynamic range of the similarity (k) function by introducing a temperature parameter γ. The parameter θi have to be optimized since the λw de- We also have to define the concept of a pseudo- pend on the corpus, so that a certain corpus can get a higher document d˜t−1 using the word vectors of all words pre- weight because of a content-word-like distribution of w, al- ceding wt, i.e. w1,...,wt−1. This is needed because the though the whole data does not well fit the meeting domain. model is used to compare words with documents that have In general we saw that the λw values were higher for the not been seen so far. In the construction of the pseudo- background domain models than for the meeting models. document we also include a decay parameter δ < 1 that is But taking the n-gram mixtures as an example the meet- multiplied with the preceding pseudo-document vector and ing models should get a higher weight than the background renders words closer in the history more significant. models. For this reason the λw of the background models The conditional probability of a word wt given a have to be lowered using θ. pseudo-document d˜t−1 is defined as To ensure that the n-gram model gets a certain part α of (k) the distribution, we define λw for word w and LSA model PLSA(wt|d˜t−1) , Lsak as (k) γ 1 − ǫw [Ksim(wt, d˜t−1) − Kmin(d˜t−1)] (k) , (5) λw n (9) ˜ ˜ γ 1−α Pw[Ksim(w, dt−1) − Kmin(dt−1)] (k) where ǫw is the uninformativeness of word w in LSA where Kmin(d˜t−1) = minw K(w, d˜t−1) to make the result- ing similarities nonnegative (Deng and Khudanpur, 2003). model Lsak as defined in (2) and n is the number of LSA models. This is a generalization of definition (6). Through 2.3. Combining LSA and n-gram Models the generalization it is also possible to train α, the mimi- For the interpolation of the word based n-gram mod- mum weight of the n-gram model. els and the LSA models we used the methods defined in For the INFG interpolation we had to optimize the Table 1. λ is a fixed constant interpolation weight, and ∝ model parameters θi, the part of the n-gram model α, and denotes that the result is normalized by the sum over the the γ exponent for each LSA model. whole vocabulary. λw is a word-dependent parameter defined as 3. Analysis of the models 1 − ǫw To gain a deeper understanding of our models we an- λw , . (6) 2 alyzed the effects of the model parameters and compared This definition ensures that the n-gram model gets at least our models with other similar models. For this analysis we half of the weight. λw is higher for more informativewords. used meeting heldout data, containing four ICSI, four CMU We used two different methods for the interpolation and four NIST meetings. The perplexities and similarities of n-gram models and LSA models. The information were estimated using LSA and 4-gram models trained on weighted geometric mean and simple linear interpolation. the Fisher conversational speech data (Bulyko et al., 2003) and the meeting data (Table 2) minus the meeting heldout data. The models were interpolated using the INFG interpolation method (Table 1). 72 71 Training Source # of words (×103) 70 Fisher 23357 69 Meeting 880 Perplexity 68 Table 2: Training data sources. 67 1 0.8 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 3.1. Perplexity Space of Combined LSA Models 0 0 Meeting model θ Fisher model θ Figure 1 shows the perplexities for the meeting and the 1 2 Fisher LSA model, that were interpolated with an n-gram Figure 2: Perplexity space for 2 INFG interpolated LSA model using linear interpolation (Definition 7) were λ and 1 models.

Optimization of Latent Semantic Analysis Based Language Model Interpolation for Meeting Recognition

Probabilistic Topic Modelling with Semantic Graph

Employee Matching Using Machine Learning Methods

Matrix Decompositions and Latent Semantic Indexing

Latent Semantic Analysis for Text-Based Research

How Latent Is Latent Semantic Analysis?

Navigating Dbpedia by Topic Tanguy Raynaud, Julien Subercaze, Delphine Boucard, Vincent Battu, Frederique Laforest

Document and Topic Models: Plsa

Vector-Based Word Representations for Sentiment Analysis: a Comparative Study

Similarity Detection Using Latent Semantic Analysis Algorithm Priyanka R

LDA V. LSA: a Comparison of Two Computational Text Analysis Tools for the Functional Categorization of Patents

An Application of Latent Semantic Analysis to Word Sense Discrimination for Words with Related and Unrelated Meanings

Distributional Semantics Today Introduction to the Special Issue Cécile Fabre, Alessandro Lenci