Topic Compositional Neural Language Model

Topic Compositional Neural Language Model Wenlin Wang1 Zhe Gan1 Wenqi Wang3 Dinghan Shen1 Jiaji Huang2 Wei Ping2 Sanjeev Satheesh2 Lawrence Carin1 1Duke University 2Baidu Silicon Valley AI Lab 3Purdue University Abstract et al., 2012; Vaswani et al., 2013) to image caption- ing (Mao et al., 2014; Devlin et al., 2015). Training a good language model often improves the underlying We propose a Topic Compositional Neural metrics of these applications, e.g., word error rates for Language Model (TCNLM), a novel method speech recognition and BLEU scores (Papineni et al., designed to simultaneously capture both the 2002) for machine translation. Hence, learning a pow- global semantic meaning and the local word- erful language model has become a central task in ordering structure in a document. The TC- NLP. Typically, the primary goal of a language model NLM learns the global semantic coherence is to predict distributions over words, which has to of a document via a neural topic model, encode both the semantic knowledge and grammat- and the probability of each learned latent ical structure in the documents. RNN-based neural topic is further used to build a Mixture-of- language models have yielded state-of-the-art perfor- Experts (MoE) language model, where each mance (Jozefowicz et al., 2016; Shazeer et al., 2017). expert (corresponding to one topic) is a re- However, they are typically applied only at the sen- current neural network (RNN) that accounts tence level, without access to the broad document con- for learning the local structure of a word se- text. Such models may consequently fail to capture quence. In order to train the MoE model effi- long-term dependencies of a document (Dieng et al., ciently, a matrix factorization method is ap- 2016). plied, by extending each weight matrix of the RNN to be an ensemble of topic-dependent Fortunately, such broader context information is of weight matrices. The degree to which each a semantic nature, and can be captured by a topic member of the ensemble is used is tied to the model. Topic models have been studied for decades document-dependent probability of the cor- and have become a powerful tool for extracting high- responding topics. Experimental results on level semantic structure of document collections, by several corpora show that the proposed ap- inferring latent topics. The classical Latent Dirichlet proach outperforms both a pure RNN-based Allocation (LDA) method (Blei et al., 2003) and its model and other topic-guided language mod- variants, including recent work on neural topic models. Further, our model yields sensible topics, els (Wan et al., 2012; Cao et al., 2015; Miao et al., and also has the capacity to generate mean- 2017), have been useful for a plethora of applications ingful sentences conditioned on given topics. in NLP. Although language models that leverage topics have shown promise, they also have several limitations. For 1 Introduction example, some of the existing methods use only pre- trained topic models (Mikolov and Zweig, 2012), with- A language model is a fundamental component to nat- out considering the word-sequence prediction task of ural language processing (NLP). It plays a key role interest. Another key limitation of the existing meth- in many traditional NLP tasks, ranging from speech ods lies in the integration of the learned topics into the recognition (Mikolov et al., 2010; Arisoy et al., 2012; language model; e.g., either through concatenating the Sriram et al., 2017), machine translation (Schwenk topic vector as an additional feature of RNNs (Mikolov and Zweig, 2012; Lau et al., 2017), or re-scoring the Proceedings of the 21st International Conference on Ar- predicted distribution over words using the topic vec- tificial Intelligence and Statistics (AISTATS) 2018, Lan- tor (Dieng et al., 2016). The former requires a bal- zarote, Spain. PMLR: Volume 84. Copyright 2018 by the ance between the number of RNN hidden units and author(s). Topic Compositional Neural Language Model Neural Topic Model Topic Proportion Neural Language Model d Law 0.36 p(d t) y1 y2 yM | Art 0.03 The judge <eos> Market 0.10 t h2 hM g( ) Travel 0.07 · LSTM LSTM LSTM ✓ (µ, σ2) Company 0.09 ⇠ N q(t d) Politics 0.15 | µ log σ2 Sport 0.01 Education 0.11 <sos> The guilty MLP x xM Medical 0.02 x1 2 d Army 0.06 Figure 1: The overall architecture of the proposed model. the number of topics, while the latter has to carefully RNN-based language models define the conditional design the vocabulary of the topic model. probabiltiy of each word ym given all the previous words y through the hidden state h : Motivated by the aforementioned goals and limitations 1:m−1 m of existing approaches, we propose the Topic Compo- p(ym y1:m−1) = p(ym hm) (2) sitional Neural Language Model (TCNLM), a new ap- j j proach to simultaneously learn a neural topic model hm = f(hm−1; xm) : (3) and a neural language model. As depicted in Figure 1, The function f( ) is typically implemented as a ba- TCNLM learns the latent topics within a variational sic RNN cell, a· Long Short-Term Memory (LSTM) autoencoder (Kingma and Welling, 2013) framework, cell (Hochreiter and Schmidhuber, 1997), or a Gated and the designed latent code t quantifies the proba- Recurrent Unit (GRU) cell (Cho et al., 2014). The bility of topic usage within a document. Latent code input and output words are related via the relation t is further used in a Mixture-of-Experts model (Hu x = y . et al., 1997), where each latent topic has a correspond- m m−1 ing language model (expert). A combination of these Topic Model A topic model is a probabilistic graph- \experts," weighted by the topic-usage probabilities, ical representation for uncovering the underlying se- results in our prediction for the sentences. A ma- mantic structure of a document collection. Latent trix factorization approach is further utilized to reduce Dirichlet Allocation (LDA) (Blei et al., 2003), for ex- computational cost as well as prevent overfitting. The ample, provides a robust and scalable approach for entire model is trained end-to-end by maximizing the document modeling, by introducing latent variables variational lower bound. Through a comprehensive for each token, indicating its topic assignment. Specif- set of experiments, we demonstrate that the proposed ically, let t denote the topic proportion for document model is able to significantly reduce the perplexity of d, and z represent the topic assignment for word w . a language model and effectively assemble the mean- n n The Dirichlet distribution is employed as the prior of ing of topics to generate meaningful sentences. Both t. The generative process of LDA may be summarized quantitative and qualitative comparisons are provided as: to verify the superiority of our model. t Dir(α0); zn Discrete(t) ; wn Discrete(βzn ) ; 2 Preliminaries ∼ ∼ ∼ where βzn represents the distribution over words for We briefly review RNN-based language models and topic zn, α0 is the hyper-parameter of the Dirichlet traditional probabilistic topic models. prior, n [1;Nd], and Nd is the number of words in document2 d. The marginal likelihood for document d Language Model A language model aims to learn can be expressed as a probability distribution over a sequence of words in Z Y X a pre-defined vocabulary. We denote as the vocab- p(d α0; β) = p(t α0) p(wn βzn )p(zn t)dt : V j t j j j ulary set and y1; :::; yM to be a sequence of words, n zn with each y f . A languageg model defines the like- m 2 V lihood of the sequence through a joint probability dis- 3 Topic Compositional Neural tribution Language Model M Y p(y ; :::; y ) = p(y ) p(y y − ) : (1) We describe the proposed TCNLM, as illustrated in 1 M 1 mj 1:m 1 m=2 Figure 1. Our model consists of two key components: W. Wang, Z. Gan, W. Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, L. Carin (i) a neural topic model (NTM), and (ii) a neural Diversity Regularizer Redundance in inferred language model (NLM). The NTM aims to capture the topics is a common issue exisiting in general topic long-range semantic meanings across the document, models. In order to address this issue, it is straightfor- while the NLM is designed to learn the local semantic ward to regularize the row-wise distance between each and syntactic relationships between words. paired topics to diversify the topics. Following Xie et al. (2015); Miao et al. (2017), we apply a topic di- 3.1 Neural Topic Model versity regularization while carrying out the inference. D Specifically, the distance between a pair of topics Let d Z+ denote the bag-of-words representation 2 are measured by their cosine distance a(βi; βj) = of a document, with Z+ denoting nonnegative inte- jβ ·β j arccos i j . The mean angle of all pairs of gers. D is the vocabulary size, and each element of jjβijj2jjβj jj2 d reflects a count of the number of times the corre- 1 P P T topics is φ = T 2 i j a(βi; βj), and the variance sponding word occurs in the document. Distinct from 1 P P 2 is ν = T 2 i j(a(βi; βj) φ) . Finally, the topic LDA (Blei et al., 2003), we pass a Gaussian random diversity regularization is defined− as R = φ ν. vector through a softmax function to parameterize the − multinomial document topic distributions (Miao et al., 2017). Specifically, the generative process of the NTM 3.2 Neural Language Model is We propose a Mixture-of-Experts (MoE) language 2 θ (µ0; σ0) t = g(θ) i.e.

Load more