Topic Modeling with Wasserstein Autoencoders
Total Page:16
File Type:pdf, Size:1020Kb
Topic Modeling with Wasserstein Autoencoders Feng Nany, Ran Ding ∗z , Ramesh Nallapatiy, Bing Xiangy Amazon Web Servicesy, Compass Inc.z fnanfen, rnallapa, [email protected], [email protected] Abstract (VAE) (Kingma and Welling, 2013). The key ad- vantage of such neural network based models is We propose a novel neural topic model in the that inference can be carried out easily via a for- Wasserstein autoencoders (WAE) framework. ward pass of the recognition network, without the Unlike existing variational autoencoder based models, we directly enforce Dirichlet prior on need for expensive iterative inference scheme per the latent document-topic vectors. We exploit example as in VB and collapsed Gibbs sampling. the structure of the latent space and apply a Topic models that fall in this framework include suitable kernel in minimizing the Maximum NVDM (Miao et al., 2016), ProdLDA (Srivastava Mean Discrepancy (MMD) to perform distri- and Sutton, 2017) and NTM-R (Ding et al., 2018). bution matching. We discover that MMD per- At a high level, these models consist of an en- forms much better than the Generative Adver- coder network that maps the Bag-of-Words (BoW) sarial Network (GAN) in matching high di- input to a latent document-topic vector and a de- mensional Dirichlet distribution. We further discover that incorporating randomness in the coder network that maps the document-topic vec- encoder output during training leads to signif- tor to a discrete distribution over the words in the icantly more coherent topics. To measure the vocabulary. They are autoencoders in the sense diversity of the produced topics, we propose a that the output of the decoder aims to reconstruct simple topic uniqueness metric. Together with the word distribution of the input BoW representa- the widely used coherence measure NPMI, we tion. Besides the reconstruction loss, VAE-based offer a more wholistic evaluation of topic qual- methods also minimize a KL-divergence term be- ity. Experiments on several real datasets show that our model produces significantly better tween the prior and posterior of the latent vector topics than existing topic models. distributions. Despite their popularity, these VAE- based topic models suffer from several conceptual 1 Introduction and practical challenges. First, the Auto-Encoding Variational Bayes (Kingma and Welling, 2013) Probabilistic topic models (Hoffman et al., 2010) framework of VAE relies on a reparameteriza- have been widely used to explore large collec- tion trick that only works with the “location-scale” tions of documents in an unsupervised manner. family of distributions. Unfortunately, the Dirich- They can discover the underlying themes and or- let distribution, which largely accounted for the arXiv:1907.12374v2 [cs.IR] 6 Dec 2019 ganize the documents accordingly. The most pop- modeling success of LDA, does not belong to this ular probabilistic topic model is the Latent Dirich- family. The Dirichlet prior on the latent document- let Allocation (LDA) (Blei et al., 2003), where the topic vector nicely captures the intuition that a authors developed a variational Bayesian (VB) al- document typically belongs to a sparse subset of gorithm to perform approximate inference; sub- topics. The VAE-basedtopic models have to resort sequently (Griffiths and Steyvers, 2004) proposed to various Gaussian approximations toward this an alternative inference method using collapsed effect. For example, NVDM and NTM-R simply Gibbs sampling. use Gaussian instead of Dirichlet prior; ProdLDA More recently, deep neural networks have been uses Laplace approximation of the Dirichlet dis- successfully used for such probabilistic models tribution in the softmax basis as prior. Second, the with the emergence of variational autoencoders KL divergence term in the VAE objective forces ∗This work was done when the author was with Amazon. posterior distributions for all examples to match the prior, essentially making the encoder output able to provide a more wholistic measure of topic independent of the input. This leads to the prob- quality. To summarize our main contributions: lem commonly known as posterior collapse (He et al., 2019). Although various heuristics such as • We introduce a uniqueness measure to evalu- KL-annealing (Bowman et al., 2016) have been ate topic quality more wholistically. proposed to address this problem, they are shown • W-LDA produces significantly better quality to be ineffective in more complex datasets (Kim topics than existing topic models in terms of et al., 2018). topic coherence and uniqueness. In this work we leverage the expressive power and efficiency of neural networks and propose a • We experiment with both the WAE-GAN and novel neural topic model to address the above WAE-MMD variants (Tolstikhin et al., 2017) difficulties. Our neural topic model belongs to for distribution matching and demonstrate a broader family of Wasserstein autoencoders key performance advantage of the latter with (WAE) (Tolstikhin et al., 2017). We name our a carefully chosen kernel, especially in high neural topic model W-LDA 1 to emphasize the dimensional settings. connection with WAE. Compared to the VAE- • We discover a novel technique of adding based topic models, our model has a few advan- noise to W-LDA to significantly boost topic tages. First, we encourage the latent document- coherence. This technique can potentially be topic vectors to follow the Dirichlet prior directly applied to WAE in general and is of indepen- via distribution matching, without any Gaussian dent interest. approximation; by preserving the Dirichlet prior, our model represents a much more faithful gen- 2 Related Work eralization of LDA to neural network based topic models. Second, our model matches the aggre- Adversarial Autoencoder (AAE) (Makhzani et al., gated posterior to the prior. As a result, the latent 2015) was proposed as an alternative to VAE. The codes of different examples get to stay away from main difference is that AAE regularizes the aggre- each other, promoting a better reconstruction (Tol- gated posterior to be close to a prior distribution stikhin et al., 2017). We are thus able to avoid the whereas VAE regularizes the posterior to be close problem of posterior collapse. to the prior. Wasserstein autoencoders (WAE) To evaluate the quality of the topics from W- (Tolstikhin et al., 2017) provides justification for LDA and other models, we measure the coherence AAE from the Wasserstein distance minimization of the representative words of the topics using point of view. In addition to adversarial training the widely accepted Normalized Pointwise Mu- used in AAE, the authors also suggested using tual Information (NPMI) (Aletras and Stevenson, Maximum Mean Discrepancy (MMD) for distri- 2013) score, which is shown to closely match hu- bution matching. Compared to VAE, AAE/WAEs man judgments (Lau et al., 2014). While NPMI are shown to produce better quality samples. captures topic coherence, it is also important that AAE has been applied in the task of unaligned the discovered topics are diverse (not repetitive). text style transfer and semi-supervised natural lan- Yet such a measure has been missing in the topic guage inference by ARAE (Kim et al., 2017). To model literature.2 We therefore propose a sim- be best of our knowledge, W-LDA is the first ple Topic Uniqueness (TU) measure for this pur- topic model based on the WAE framework. Re- pose. Given a set of representative words from all cently, Adversarial Topic model (ATM) (Wang the topics, the TU score is inversely proportional et al., 2018) proposes using GAN with Dirich- to the number of times each word is repeated in let prior to learn topics. The generator takes in the set. High TU score means the representative samples from Dirichlet distribution and maps to words are rarely repeated and the topics are unique a document-word distribution layer to form the to each other. Using both TU and NPMI, we are fake samples. The discriminator tries to distin- guish between the real documents from the fake 1Code available at https://github.com/ documents. It also pre-processes the BoW repre- awslabs/w-lda sentation of documents using TF-IDF. The evalu- 2Most papers on topic modeling only present a selected small subset of non-repetitive topics for qualitative evalua- ation is limited to topic coherence. A critical dif- tion. The diversity among the topics is not measured. ference of W-LDA and ATM is that ATM tries to perform distribution matching in the vocabulary Given a document w, the inference task is to de- space whereas W-LDA in the latent document- termine the conditional distribution p(θjw). topic space. Since the vocabulary space has much higher dimension (size of the vocabulary) than the 3.2 Wasserstein Auto-encoder latent document-topic space (number of topics), The latent variable generative model posits that a we believe it is much more challenging for ATM target domain example (eg. document w) is gener- to train and perform well compared to W-LDA. ated by first sampling a latent code θ from a prior Our work is also related to the topic of learn- distribution PΘ and then passed through a decoder ing disentangled representations. A disentangled network. The resulting distribution in the target representation can be defined as one where sin- domain is Pdec with density: gle latent units are sensitive to changes in sin- Z gle generative factors, while being relatively in- pdec(w) = pdec(wjθ)p(θ)dθ: (1) variant to changes in other factors (Bengio et al., θ 2013). In topic modeling, such disentanglement The key result of (Tolstikhin et al., 2017) is that in means that the learned topics are coherent and order to minimize the optimal transport distance distinct. (Rubenstein et al., 2018) demonstrated between P and the target distribution P , it is that WAE learns better disentangled representa- dec w equivalent to minimizing the following objective tion than VAE.