Variational Inference in Pachinko Allocation Machines

Variational Inference In Pachinko Allocation Machines Akash Srivastava Charles Sutton University of Edinburgh University of Edinburgh [email protected] The Alan Turing Institute Google Brain [email protected] Abstract each word, a path from the root of the DAG to a The Pachinko Allocation Machine (PAM) is a leaf. Thus the internal nodes can represent distri- deep topic model that allows representing rich butions over topics, so-called “super-topics”, and correlation structures among topics by a di- so on, thereby representing correlations among rected acyclic graph over topics. Because of topics. the flexibility of the model, however, approx- Unfortunately PAM introduces many latent imate inference is very difficult. Perhaps for variables — for each word in the document, the this reason, only a small number of potential path in the DAG that generated the word is latent. PAM architectures have been explored in the literature. In this paper we present an effi- Therefore, traditional inference method, such as cient and flexible amortized variational infer- Gibbs sampling and decoupled mean-field varia- ence method for PAM, using a deep inference tional inference become significantly more expen- network to parameterize the approximate pos- sive. This not only affects the scale of data sets terior distribution in a manner similar to the that can be considered, but more fundamentally variational autoencoder. Our inference method the computational cost of inference makes it diffi- produces more coherent topics than state-of- cult to explore the space of possible architectures art inference methods for PAM while being an order of magnitude faster, which allows explo- for PAM. As a result, to date only relatively sim- ration of a wider range of PAM architectures ple architectures have been studied in the literature than have previously been studied. (Li and McCallum, 2006; Mimno et al., 2007; Li et al., 2012). 1 Introduction We present what is, to the best of our knowl- Topic models are widely used tools for explor- edge, the first variational inference method for ing and visualizing document collections. Sim- PAM, which we call dnPAM. Unlike collapsed pler topic models, like latent Dirichlet alloca- Gibbs, dnPAM can be generically applied to any tion (LDA) (Blei et al., 2003), capture correla- PAM architecture without the need to derive a new tions among words but do not capture correlations inference algorithm, allowing much more rapid among topics. This limits the model’s ability to exploration of the space of possible model archi- discover finer-grained hierarchical latent structure tectures. dnPAM is an amortized inference follow- in the data. For example, we expect that very spe- ing the learning principle of variational autoen- cific topics, such as those pertaining to individ- coders, which means that the variational distribu- ual sports teams, are likely to co-occur more often tion is parameterized by a deep network, which is than more general topics, such as a generic “poli- trained to perform accurate inference. We find that tics” topic with a generic “sports” topic. dnPAM is not only an order of magnitude faster A popular extension to LDA that captures topic than collapsed Gibbs, but even returns topics with correlations is the Pachinko Allocation Machine comparable or greater coherence. The dramatic (PAM) (Li and McCallum, 2006). PAM is es- speedup in inference time comes from the amor- sentially “deep LDA”. It is defined by a directed tization of the learning cost via learning a neu- acyclic graph (DAG) in which each leaf node de- ral network to produce posterior parameter instead notes a word in the vocabulary, and each internal of learning these parameters directly. This effi- node is associated with a distribution over its chil- ciency in inference enables exploration of more dren. The document is generated by sampling, for complex and deeper PAM models than have previously been possible. sists of a root node θr which is connected to chil- As a demonstration of this, as our second con- dren θ1 : : : θS called super-topics. Each super- tribution we introduce a mixture of PAM model, topic θs is connected to the same set of children where each component distribution of the mixture β1 : : : βK called subtopics, each of which are fully is represented by a PAM. By mixing PAMs with connected to the vocabulary items 1 :::V in the varying number of topics, this model captures the leaves. latent structure in the data at many different levels A document is generated in 4-PAM as follows. of granularity that decouples general broad topics First, a single matrix of subtopics β are gener- from the more specific ones. ated for the entire corpus as βk ∼ Dirichlet(α0). Like other variational autoencoders (VAEs) Then, to sample a document w, we sample child (Kingma and Welling, 2013; Rezende et al., 2014), distributions for each remaining internal node in our model also suffers from the posterior collaps- the DAG. For the root node, θr is drawn from ing (van den Oord et al., 2017) what is sometime a Dirichlet prior θr ∼ Dirichlet(αr), and simi- also referred to as component collapsing (Dinh larly for each super-topic s 2 f1 :::Sg, the su- and Dumoulin, 2016) and slow training due to low pertopic θs is drawn as θs ∼ Dirichlet(αs). Fi- learning rates. We present an analysis of these is- nally, for each word wn, a path is sampled from sues in the context of topic modeling and propose the root to the leaf. From the root, we sam- normalization based solution to alleviate them. ple the index of a supertopic zn0 2 f1 :::Sg as zn0 ∼ Categorical(θr), followed by a subtopic 2 Latent Dirichlet Allocation index zn2 2 f1 :::Kg sampled as zn2 ∼ Categorical(βz ), and finally the word is sampled LDA represents each document w in a collection n1 as wn ∼ Categorical(βz ). This process can be β n1 as a mixture of topics. Each topic vector k is a written as a density distribution over the vocabulary, that is, a vector of probabilities, and β = (β : : : β ) is the ma- S 1 K Y trix of the K topics. Every document is then mod- P (w; z; θ j α; β) = p(θrjαr) P (θsjαs) (1) eled as an admixture of the topics. The genera- s=1 tive process is to first sample a proportion vector Y × p(zn1jθr)p(zn2jθzn1 )p(wnjβzn2 ): θ ∼ Dirichlet(α), and then for each word at posi- n tion n, sampling a topic indicator zn 2 f1;:::Kg It should be easily seen how this process can be as zn ∼ Categorical(θ), and finally sampling the extended to arbitrary `-partite graphs, yielding the word index wn ∼ Categorical(βzn ). `-PAM model, in which case LDA exactly corre- 2.1 Deep LDA: Pachinko Allocation Machine sponds to 3-PAM, and also to arbitrary DAGs. PAM is a class of topic models that extends LDA 3 Mixture of PAMs by modeling correlations among topics. A partic- ular instance of a PAM represents the correlation The main advantage of the inference framework structure among topics by a DAG in which the leaf we propose is that it allows easily exploring the nodes represent words in the vocabulary and the design space of possible structures for PAM. As internal nodes represent topics. Each node s in the a demonstration of this, we present a word-level DAG is associated with a distribution θs over its mixture of PAMs that allows learning finer grained children, which has a Dirichlet prior. There is no topics than a single PAM, as some mixture com- need to differentiate between nodes in the graph ponents learn topics that capture the more general, and the distributions θs, so we will simply take global topics so that other mixture components can fθsg to be the node set of the graph. To gener- focus on finer-grained topics. ate a document in PAM, for each word we sample We describe a word-level mixture of M PAMs a path from the root to a leaf, and output the word P1 :::PM , each of which can have a different associated with that leaf. number of topics or even a completely different More formally, we present the special case of DAG structure. To generate a document under this 4-PAM, in which the DAG is a 4-partite graph. model, first we sample an M-dimensional docu- It will be clear how to generalize this discussion ment level mixing proportion θr ∼ Dirichlet(αr). to arbitrary DAGs. In 4-PAM, the DAG con- Then, for each word wn in the document, we topics are sharper, indicating that each individual topic is capturing more information about the data. The mixture model allows the two LDAs being mixed to focus exclusively on higher (for 10 topics) and lower (for 50 topics) level features while modeling the images. On the other hand, the topics in the vanilla LDA need to account for all the variability in the dataset using just 10 (or 50) topics and therefore are fuzzier. 4 Inference Probabilistic inference in topic models is the task of computing posterior distributions p(zjw; α; β) over the topic assignments for words, or over the posterior p(θjw; α; β) of topic proportions for documents. For all practical topic models, this Figure 1: Top: A and B show randomly sampled topics from MoLDA(10:50). Bottom: C and D show ran- task is intractable. Commonly used methods in- domly sampled topics from LDA with 10 topics and clude Gibbs sampling (Li and McCallum, 2006; 50 topics on Omniglot. Notice that by using a mixture, Blei et al., 2004), which can be slow to converge, the MoLDA can decouple the higher level structure (A) and variational inference methods such as mean from the lower-level details(B).

Load more