Weakly Semi-Supervised Neural Topic Models

Published as a workshop paper at 2nd Learning with Limited Data (LLD) Workshop, ICLR 2019 WEAKLY SEMI-SUPERVISED NEURAL TOPIC MODELS Ian Gemp∗ Ramesh Nallapati, Ran Ding, Feng Nan, Bing Xiang DeepMind, London, UK Amazon Web Services, New York, NY 10001, USA [email protected] frnallapa,rding,nanfen,[email protected] ABSTRACT We consider the problem of topic modeling in a weakly semi-supervised setting. In this scenario, we assume that the user knows a priori a subset of the topics she wants the model to learn and is able to provide a few exemplar documents for those topics. In addition, while each document may typically consist of multiple topics, we do not assume that the user will identify all its topics exhaustively. Recent state-of-the-art topic models such as NVDM, referred to herein as Neural Topic Models (NTMs), fall under the variational autoencoder framework. We ex- tend NTMs to the weakly semi-supervised setting by using informative priors in the training objective. After analyzing the effect of informative priors, we propose a simple modification of the NVDM model using a logit-normal posterior that we show achieves better alignment to user-desired topics versus other NTM models. 1 INTRODUCTION Topic models are probabilistic models of data that assume an abstract set of topics underlies the data generating process (Blei, 2012). These abstract topics are often not only useful as feature represen- tations for downstream tasks, but also for exploring and analyzing a corpus. Topic models are used to explore natural scenes in images (Fei-Fei and Perona, 2005; Mimno, 2012), genetics data (Rogers et al., 2005), and numerous text corpora (Newman et al., 2006; Mimno and McCallum, 2007). While latent Dirichlet allocation (LDA) serves as the classical benchmark for topic models, recent state-of-the-art topic models such as NVDM (Miao et al., 2016) fall under the variational autoencoder (VAE) framework (Kingma and Welling, 2013), which we refer to as Neural Topic Models (NTMs). NTMs leverage the flexibility of deep learning to fit an approximate posterior using variational inference. This posterior can then be used to efficiently predict the topics contained in a document. NTMs have been shown to model documents well, as well as associate a set of meaning- ful top words with each topic (Miao et al., 2017). Often, the top words associated with each extracted topic only approximately match the user’s in- tuition. Therefore, the user may want to guide the model towards learning topics that better align with natural semantics by providing example documents. To our knowledge, supervision has been explored in more classical LDA models, but has not been explored yet in the NTM literature. Labeling the existence of topics in each document across a corpus is prohibitively expensive. Hence, we focus on a weak form of supervision. Specifically, we assume a user may identify the existence of a single topic in a document. Furthermore, if a user does not specify the existence of a topic, it does not mean the topic does not appear in the document. The main contribution of our work is an NTM with the ability to leverage minimal user supervision to better align topics to desired semantics. 2 BACKGROUND:NEURAL TOPIC MODELS Topic models describe documents (typically represented as bags-of-words) as being generated by a mixture of underlying abstract topics. Each topic is represented as a distribution over the words in ∗Work done while interning at Amazon NYC in Summer 2018. 1 Published as a workshop paper at 2nd Learning with Limited Data (LLD) Workshop, ICLR 2019 a vocabulary so that each document can be expressed as a mixture of different topics and its words drawn from the associated mixture distribution. A Neural Topic Model is a topic model constructed according to the variational autoencoder (VAE) paradigm. The generative process for a document x given a latent topic mixture υ is υ ∼ p(υ); x ∼ pθ(xjυ), where p(υ) is the prior and θ are learned parameters. The marginal likelihood pθ(x) and posterior pθ(υjx) are in general intractable so a standard approach is to introduce an approximate posterior, qφ(υjx), and seek maximization of the evidence lower bound (ELBO): log pθ(x) ≥ Eqφ(υjx) log pθ(xjυ) − DKL(qφ(υjx)jjp(υ)) : (1) | {z } ELBO(x;θ,φ) The reparamterization trick (Kingma and Welling, 2013) lets us train this model via stochastic opti- mization. The most commonly used reparameterization is for sampling from a Gaussian with diag- onal covariance, e.g., υ = µ(x) + σ(x), ∼ N (0; 1) where µ(x) and σ(x) are neural networks. 3 WEAKLY SEMI-SUPERVISED TOPIC MODELING As motivated in the introduction, the topics extracted from unsupervised topic modeling can be challenging to interpret for a user (see Table 1). It is often the case that a user has a set of “ground truth” topics in mind that they would like the model to align with. We focus on the setting where a user is presented a subset of documents and identifies at least a single topic for each of the documents. (U) rf harvesting energy chains documents (S) recommendation retrieval item document items consumption compression shortest texts retrieval filtering documents collaborative preferences engine Table 1: Upon scanning topics extracted by an unsupervised (U) NVDM for the AAPD ArXiv dataset, the user recognizes a few words relevant to the topic “information retrieval”. Supervision (S) leveraging exemplar documents containing the topic helps the model to better align the topic to the desired semantics. We emphasize two challenging components in our setting. First, it is semi-supervised in the sense that only a small subset of documents is labeled by the user. Second, the supervision is weak in the sense that the user does not necessarily identify all the topics that are present in a document. We define “ground truth” word rankings for each topic using a relative chi-square association metric, a common measure of word association. Specifically, we compute the chi-square statistics between the occurrence of each word and occurrence of each document class using the ground truth topic labels. We then divide the chi-square statistic for each word by the average of its chi-square statistics over all labeled topics, giving us a relative importance that is discriminative across topics. The top- 10 words for each topic are then the words with highest relative importance. To evaluate alignment of topics extracted by an NTM to these “ground truth” topics, we compute the average normalized point-wise mutual information (NPMI) (Bouma, 2009) between each word in the NTM’s topics and each word in the “ground truth” topics—a higher score indicating the NTM’s extracted words often co-occur with the ground truth’s across Wikipedia. We refer to this score as pairwise-NPMI (P-NPMI). We additionally report the average NPMI computed between all word pairs within each NTM topic (NPMI) as a measure of topic coherence as well as NPMI separately averaged over just supervised topics (S-NPMI) and unsupervised topics (U-NPMI). 4 OUR APPROACH Our approach focuses on extending the NTM to the semi-supervised setting. Our method is based on modifying the prior, pi(υi), of a VAE to incorporate user label information, where pi(υi) is the prior distribution over the latent variables for the ith document. As discussed above, we assume the user provides partial binary labels indicating the existence of topics in a document (1=exists, 0=unlabeled). We would like to encourage the posterior samples of the model, υi ∼ qφ(υijxi), to match the true presence of topics in a document. In the case of a Gaussian posterior (NVDM), a natural interpretation of υi is as the logits of the probabilities that each topic exists. Therefore, to recover probabilities, we simply sigmoid the outputs 2 Published as a workshop paper at 2nd Learning with Limited Data (LLD) Workshop, ICLR 2019 of our encoder. Given this interpretation for semi-supervision, we set pi(υik) to be N (µik; 1), where 8 logit(0:9999) if document i is labeled to <> contain topic k; µ = (2) ik logit(p ) else if topic k is supervised; > k+ : logit(0:5) = 0 otherwise, and pk+ is the class-prior probability, i.e. the probability that any given document contains topic k. Note that pk+ can be estimated from PU data (Jain et al., 2016). In experiments, we instead use the mean class-prior probability for all topics (p¯+) and are still able to obtain high performance suggesting this approach is robust to estimation error. In order to extract topics from the model, we condition the generative distribution, pθ(xjυ), on the logits of each of the possible approximate onehot vectors, e.g., υ1=logit([0:9999; 0:0001;:::]) for topic 1. After conditioning on υk, we observe the mean of the conditional distribution, ξk. We ~ ¯ then subtract from each ξk the mean of ξ over the topics, i.e., ξk = ξk −ξ. The indices corresponding ~ to the top-10 values of ξk indicate the top-10 words associated with each topic k. We also explore using two other posteriors. One is the Concrete / Gumbel-softmax approximation to the Bernoulli (Jang et al., 2016; Maddison et al., 2016) which more closely matches our goal of modeling the existence of topics in documents. The differentiable posterior qφ(υjx) = Concrete(π; τ) can be sampled with Equations (3), (4), and (5): 8 u ∼ U(0; 1) (3) 0:9999 if document i is labeled to <> contain topic k; η = logit(π) + logit(u) (4) π = (6) ik p else if topic k is supervised; > k+ υ = sigmoidη/τ (5) : 0:5 otherwise.

Load more