Cache-Augmented Latent Topic Language Models for Speech Retrieval

Cache-Augmented Latent Topic Language Models for Speech Retrieval Jonathan Wintrode Center for Language and Speech Processing Johns Hopkins University Baltimore, MD [email protected] Abstract and others, 2007). However, for sites like YouTube, localized in over 60 languages (YouTube, 2015), the We aim to improve speech retrieval perfor- likelihood of high accuracy speech recognition in mance by augmenting traditional N-gram lan- most languages is quite low. guage models with different types of topic context. We present a latent topic model Our proposed solution is to focus on topic infor- framework that treats documents as arising mation in spoken language as a means of dealing from an underlying topic sequence combined with errorful speech recognition output in many lan- with a cache-based repetition model. We ana- guages. It has been repeatedly shown that a task like lyze our proposed model both for its ability to topic classification is robust to high (40-60%) word capture word repetition via the cache and for error rate systems (Peskin, 1996; Wintrode, 2014b). its suitability as a language model for speech We would leverage the topic signal’s strength for re- recognition and retrieval. We show this model, augmented with the cache, captures intuitive trieval in a high volume, multilingual digital media repetition behavior across languages and ex- processing environment. hibits lower perplexity than regular LDA on The English word topic, defined as a particu- held out data in multiple languages. Lastly, we lar ’subject of discourse’ (Houghton-Mifflin, 1997), show that our joint model improves speech re- arises from the Greek root, τoπoς, meaning a physi- trieval performance beyond N-grams or latent cal ’place’ or ’location’. However, the semantic con- topics alone, when applied to a term detection cepts of a particular subject are not disjoint from the task in all languages considered. physical location of the words themselves. The goal of this particular work is to jointly model 1 Introduction two aspects of topic information, local context (rep- The availability of spoken digital media continues etition) and broad context (subject matter), which to expand at an astounding pace. According to we previously treated in an ad hoc manner (Win- YouTube’s publicly released statistics, between Au- trode and Sanjeev, 2014) in a latent topic frame- gust 2013 and February 2015 content upload rates work. We show that in doing so we can achieve bet- have tripled from 100 to 300 hours of video per ter word retrieval performance than language mod- minute (YouTube, 2015). Yet the information con- els with only N-gram context on a diverse set of spo- tent therein, while accessible via links, tags, or other ken languages. user-supplied metadata, is largely inaccessible via 2 Related Work content search within the speech. Speech retrieval systems typically rely on The use of both repetition and broad topic con- Large Vocabulary Continuous Speech Recognition text have been exploited in a variety of ways by (LVSCR) to generate a lattice of word hypotheses the speech recognition and retrieval communities. for each document, indexed for fast search (Miller Cache-based or adaptive language models were 1 Proceedings of NAACL-HLT 2015 Student Research Workshop (SRW), pages 1–8, Denver, Colorado, June 1, 2015. c 2015 Association for Computational Linguistics some of the first approaches to incorporate informa- Algorithm 1 Cache-augmented generative process tion beyond a short N-gram history (where N is typ- for all t do ∈ T ically 3-4 words). draw φ(t) Dirichlet(β) ∼ Cache-based models assume the probability of a for all d do ∈ D word in a document d is influenced both by the draw θ(d) Dirichlet(α) (d) ∼ global frequency of that word and N-gram context as draw κ Beta(ν0, ν1) ∼ well as by the N-gram frequencies of d (or preceding for w , 1 i d do d,i ≤ ≤ | | cache of K words). Although most words are rare at draw k Bernoulli(κ(d)) d,i ∼ the corpus level, when they do occur, they occur in if kd,i = 0 then bursts. Thus a local estimate, from the cache, may draw z θ(d) d,i ∼ be more reliable than the global estimate. Jelinek draw w φ(t=zd,i) d,i ∼ (1991) and Kuhn (1990) both successfully applied else these types of models for speech recognition, and draw wd,i Cache(d, W i) ∼ − Rosenfeld (1994), using what he referred to as ’trig- end if ger pairs’, also realized significant gains in WER. More recently, recurrent neural network language baseline N-gram model with topic-specific N-gram models (RNNLMs) have been introduced to capture counts. Clarkson and Robinson (1997) proposed more of these ”long-term dependencies” (Mikolov a similar application of cache and mixture mod- et al., 2010). In terms of speech retrieval, recent ef- els, but only demonstrate small perplexity improve- forts have looked at exploiting repeated keywords at ments. Similar approaches use latent topic models to search time, without directly modifying the recog- infer a topic mixture of the test document (soft clus- nizer (Chiu and Rudnicky, 2013; Wintrode, 2014a). tering) with significant recognition error reductions Work within the information retrieval (IR) com- (Heidel et al., 2007; Hsu and Glass, 2006; Liu and munity connects topicality with retrieval. Hearst and Liu, 2008; Huang and Renals, 2008). Instead of in- Plaunt (1993) reported that the ”subtopic structur- terpolating with a traditional backoff model, Chien ing” of documents can improve full-document re- and Chueh (2011) use topic models with and with- trieval. Topic models such as Latent Dirichlet Al- out a dynamic cache to good effect as a class-based location (LDA) (Blei et al., 2003) or Probabilistic language model. Latent Semantic Analysis (PLSA) (Hofmann, 2001) We build on the cluster-oriented results, particu- are used to the augment the document-specific lan- larly Khudanpur and Wu (1997) and Wintrode and guage model in probabilistic, language-model based Khudanpur (2014), but within an explicit frame- IR (Wei and Croft, 2006; Chen, 2009; Liu and Croft, work, jointly capturing both types of topic informa- 2004; Chemudugunta et al., 2007). In all these tion that many have leveraged individually. cases, topic information was helpful in boosting retrieval performance above baseline vector space or 3 Cache-augmented Topic Model N-gram models. Our proposed model closely resembles that from We propose a straightforward extension of the LDA Chemudugunta et al. (2007), with our notions of topic model (Blei et al., 2003; Steyvers and Griffiths, broad and local context corresponding to their ”gen- 2007), allowing words to be generated either from a eral and specific” aspects. The unigram cache case latent topic or from a document-level cache. At each of our model should correspond to their ”special word position we flip a biased coin. Based on the words” model, however we do not constrain our outcome we either generate a latent topic and then cache component to only unigrams. the observed word, or we pick a new word directly With respect to speech recognition, Florian and from the cache of already observed words. Thus we Yarowsky (Florian and Yarowsky, 1999) and Khu- would jointly learn the underlying topics and the ten- danpur and Wu (Khudanpur and Wu, 1999) use dency towards repetition. vector-space clustering techniques to approximate As with LDA, we assume each corpus is drawn the topic content of documents and augment a from latent topics. Each topic is denoted φ(t), a T 2 multinomial random variable in the size of the vo- α (d) (t) θ cabulary where φv is the probability P (wv t). For κ (d) | each document we draw θ(d), where θ is the prob- ν t z ability P (t d). | (d) We introduce two additional sets of variables, κ k β φ(z) w and kd,i. The state kd,i is a Bernoulli variable indi- cating whether a word w is drawn from the cache T d d,i W i | | or from the latent topic state. κ(d) is the document − D specific prior on the cache state kd,i. Algorithm 1 gives the generative process explic- itly. We choose a Beta prior κ(d) for the Bernoulli Figure 1: Cache-augmented model plate diagram. variables kd,i. As with the Dirichlet priors, this al- lows for a straightforward formulation of the joint timates for θ(d) and κ(d), which we then use to com- probability P (W, Z, K, Φ, Θ, κ), from which we de- pute document-specific and cache-augmented lan- rive densities for Gibbs sampling. A plate diagram guage models. is provided in Figure 1, illustrating the dependence From a language modeling perspective we treat both on latent variables and the cache of previous the multinomials φ(t) as unigram LM’s and use the observations. inferred topic proportions θ(d) as a set of mixture We implement our model as a collapsed Gibbs weights. From these we compute the document- sampler extending Java classes from the Mallet topic specific unigram model for d (Eqn. 1). This serves to modeling toolkit (McCallum, 2002). We use the capture what we have referred to as the broad topic Gibbs sampler for parameter estimation (training context. data) and inference (held-out data). We also lever- We incorporate both Pd as well as the cache Pc age Mallet’s hyperparameter re-estimation (Wallach (local context) into the base model PL using linear et al., 2009), which we apply to α, β, and ν. interpolation of probabilities. Word histories are denoted hi for brevity. For our experiments we first 4 Language Modeling combine Pd with the N-gram model (Eqn.

Load more