Variational Inference In Machines

Akash Srivastava Charles Sutton University of Edinburgh University of Edinburgh [email protected] The Alan Turing Institute Google Brain [email protected]

Abstract each word, a path from the root of the DAG to a The Pachinko Allocation Machine (PAM) is a leaf. Thus the internal nodes can represent distri- deep that allows representing rich butions over topics, so-called “super-topics”, and correlation structures among topics by a di- so on, thereby representing correlations among rected acyclic graph over topics. Because of topics. the flexibility of the model, however, approx- Unfortunately PAM introduces many latent imate inference is very difficult. Perhaps for variables — for each word in the document, the this reason, only a small number of potential path in the DAG that generated the word is latent. PAM architectures have been explored in the literature. In this paper we present an effi- Therefore, traditional inference method, such as cient and flexible amortized variational infer- Gibbs sampling and decoupled mean-field varia- ence method for PAM, using a deep inference tional inference become significantly more expen- network to parameterize the approximate pos- sive. This not only affects the scale of data sets terior distribution in a manner similar to the that can be considered, but more fundamentally variational autoencoder. Our inference method the computational cost of inference makes it diffi- produces more coherent topics than state-of- cult to explore the space of possible architectures art inference methods for PAM while being an order of magnitude faster, which allows explo- for PAM. As a result, to date only relatively sim- ration of a wider range of PAM architectures ple architectures have been studied in the literature than have previously been studied. (Li and McCallum, 2006; Mimno et al., 2007; Li et al., 2012). 1 Introduction We present what is, to the best of our knowl- Topic models are widely used tools for explor- edge, the first variational inference method for ing and visualizing document collections. Sim- PAM, which we call dnPAM. Unlike collapsed pler topic models, like latent Dirichlet alloca- Gibbs, dnPAM can be generically applied to any tion (LDA) (Blei et al., 2003), capture correla- PAM architecture without the need to derive a new tions among words but do not capture correlations inference algorithm, allowing much more rapid among topics. This limits the model’s ability to exploration of the space of possible model archi- discover finer-grained hierarchical latent structure tectures. dnPAM is an amortized inference follow- in the data. For example, we expect that very spe- ing the learning principle of variational autoen- cific topics, such as those pertaining to individ- coders, which means that the variational distribu- ual sports teams, are likely to co-occur more often tion is parameterized by a deep network, which is than more general topics, such as a generic “poli- trained to perform accurate inference. We find that tics” topic with a generic “sports” topic. dnPAM is not only an order of magnitude faster A popular extension to LDA that captures topic than collapsed Gibbs, but even returns topics with correlations is the Pachinko Allocation Machine comparable or greater coherence. The dramatic (PAM) (Li and McCallum, 2006). PAM is es- speedup in inference time comes from the amor- sentially “deep LDA”. It is defined by a directed tization of the learning cost via learning a neu- acyclic graph (DAG) in which each leaf node de- ral network to produce posterior parameter instead notes a word in the vocabulary, and each internal of learning these parameters directly. This effi- node is associated with a distribution over its chil- ciency in inference enables exploration of more dren. The document is generated by sampling, for complex and deeper PAM models than have pre- viously been possible. sists of a root node θr which is connected to chil- As a demonstration of this, as our second con- dren θ1 . . . θS called super-topics. Each super- tribution we introduce a mixture of PAM model, topic θs is connected to the same set of children where each component distribution of the mixture β1 . . . βK called subtopics, each of which are fully is represented by a PAM. By mixing PAMs with connected to the vocabulary items 1 ...V in the varying number of topics, this model captures the leaves. latent structure in the data at many different levels A document is generated in 4-PAM as follows. of granularity that decouples general broad topics First, a single matrix of subtopics β are gener- from the more specific ones. ated for the entire corpus as βk ∼ Dirichlet(α0). Like other variational autoencoders (VAEs) Then, to sample a document w, we sample child (Kingma and Welling, 2013; Rezende et al., 2014), distributions for each remaining internal node in our model also suffers from the posterior collaps- the DAG. For the root node, θr is drawn from ing (van den Oord et al., 2017) what is sometime a Dirichlet prior θr ∼ Dirichlet(αr), and simi- also referred to as component collapsing (Dinh larly for each super-topic s ∈ {1 ...S}, the su- and Dumoulin, 2016) and slow training due to low pertopic θs is drawn as θs ∼ Dirichlet(αs). Fi- learning rates. We present an analysis of these is- nally, for each word wn, a path is sampled from sues in the context of topic modeling and propose the root to the leaf. From the root, we sam- normalization based solution to alleviate them. ple the index of a supertopic zn0 ∈ {1 ...S} as zn0 ∼ Categorical(θr), followed by a subtopic 2 Latent Dirichlet Allocation index zn2 ∈ {1 ...K} sampled as zn2 ∼ Categorical(βz ), and finally the word is sampled LDA represents each document w in a collection n1 as wn ∼ Categorical(βz ). This process can be β n1 as a mixture of topics. Each topic vector k is a written as a density distribution over the vocabulary, that is, a vector of probabilities, and β = (β . . . β ) is the ma- S 1 K Y trix of the K topics. Every document is then mod- P (w, z, θ | α, β) = p(θr|αr) P (θs|αs) (1) eled as an admixture of the topics. The genera- s=1 tive process is to first sample a proportion vector Y × p(zn1|θr)p(zn2|θzn1 )p(wn|βzn2 ). θ ∼ Dirichlet(α), and then for each word at posi- n tion n, sampling a topic indicator zn ∈ {1,...K} It should be easily seen how this process can be as zn ∼ Categorical(θ), and finally sampling the extended to arbitrary `-partite graphs, yielding the word index wn ∼ Categorical(βzn ). `-PAM model, in which case LDA exactly corre- 2.1 Deep LDA: Pachinko Allocation Machine sponds to 3-PAM, and also to arbitrary DAGs.

PAM is a class of topic models that extends LDA 3 Mixture of PAMs by modeling correlations among topics. A partic- ular instance of a PAM represents the correlation The main advantage of the inference framework structure among topics by a DAG in which the leaf we propose is that it allows easily exploring the nodes represent words in the vocabulary and the design space of possible structures for PAM. As internal nodes represent topics. Each node s in the a demonstration of this, we present a word-level DAG is associated with a distribution θs over its mixture of PAMs that allows learning finer grained children, which has a Dirichlet prior. There is no topics than a single PAM, as some mixture com- need to differentiate between nodes in the graph ponents learn topics that capture the more general, and the distributions θs, so we will simply take global topics so that other mixture components can {θs} to be the node set of the graph. To gener- focus on finer-grained topics. ate a document in PAM, for each word we sample We describe a word-level mixture of M PAMs a path from the root to a leaf, and output the word P1 ...PM , each of which can have a different associated with that leaf. number of topics or even a completely different More formally, we present the special case of DAG structure. To generate a document under this 4-PAM, in which the DAG is a 4-partite graph. model, first we sample an M-dimensional docu- It will be clear how to generalize this discussion ment level mixing proportion θr ∼ Dirichlet(αr). to arbitrary DAGs. In 4-PAM, the DAG con- Then, for each word wn in the document, we topics are sharper, indicating that each individual topic is capturing more information about the data. The allows the two LDAs being mixed to focus exclusively on higher (for 10 top- ics) and lower (for 50 topics) level features while modeling the images. On the other hand, the top- ics in the vanilla LDA need to account for all the variability in the dataset using just 10 (or 50) top- ics and therefore are fuzzier.

4 Inference

Probabilistic inference in topic models is the task of computing posterior distributions p(z|w, α, β) over the topic assignments for words, or over the posterior p(θ|w, α, β) of topic proportions for documents. For all practical topic models, this Figure 1: Top: A and B show randomly sampled top- ics from MoLDA(10:50). Bottom: C and D show ran- task is intractable. Commonly used methods in- domly sampled topics from LDA with 10 topics and clude Gibbs sampling (Li and McCallum, 2006; 50 topics on Omniglot. Notice that by using a mixture, Blei et al., 2004), which can be slow to converge, the MoLDA can decouple the higher level structure (A) and variational inference methods such as mean from the lower-level details(B). field (Blei et al., 2003; Blei and Lafferty, 2006), which sometimes sacrifice topic quality for com- putational efficiency. More fundamentally, these choose one of the PAM models by sampling m ∼ families of approximate inference algorithms tend Categorical(θ ) and then finally sample a word by r to be model specific and require extensive math- sampling a path through P as described in the m ematical sophistication on the practitioner’s part previous section. This model can be expressed as since even the slightest changes in model assump- a general PAM model in which the root node θ r tions may require substantial adjustments to the in- is connected to the root nodes of each of the M ference. The time required to derive new approx- mixture components. If each of the mixture com- imate inference algorithms dramatically slows ex- ponents are 3-PAM models, that is LDA, then we plorations through the space of possible models. call the resulting model a mixture of LDA models In this work we present a generic, amortized ap- (MoLDA).1 proximate inference method dnPAM for learning The advantage of this model is that if we in the PAM family of models, that is extremely choose to incorporate different mixture compo- fast, flexible and accurate. The inference method nents with different numbers of topics, we find is flexible in the sense that it can be generically ap- that the components with fewer topics explain the plied to any DAG structure for PAM, without the coarse-grained structure in the data, freeing up need to derive a new variational update. The main the other components to learn finer grained top- idea is that we will approximate the posterior dis- ics. For example, the Omniglot dataset contains tribution p(θ |w, α, β) for each super-topic θ by 28x28 images of handwritten alphabets from arti- s s a variational distribution q(θ |w). Unlike stan- ficial scripts. In Figure1, panels (C) and (D) are s dard mean field approaches, in which q(θ |w) has visualization of the latent topics that are generated s an independent set of variational parameters for using vanilla LDA with 10 and 50 topics, respec- each document in the corpus, the parameters of tively. Because we are modelling image data, each q(θ |w) will be computed by an inference net- topic can also be visualized as an image. Panels s work, which is a neural network that takes the doc- (A) and (B) show the topics from a single MoLDA ument w as input, and outputs the parameters of with two components, one with 10 topics and one the variational distribution. This is motivated by with 50 topics. It is apparent that the MoLDA the observation that similar documents can be de- 1It would perhaps be more proper to call this model an scribed well by similar posterior parameters. admixture of LDA models. In dnPAM, we seek to approximate the poste- rior distribution P (θ|w, α, β), that is, the paths normal distribution. First, we construct a Laplace zn for each word are integrated out. Note that this approximation of the Dirichlet prior in the soft- is in contrast to previous collapsed Gibbs meth- max basis, which allows us to approximate the ods for PAM (Li and McCallum, 2006), which posterior distribution using a Gaussian that is integrate out θ using conjugacy. To simplify no- in the location-scale family. Then in order to tation, we will describe dnPAM for the special sample θ’s from the posterior in the simplex basis case of 4-PAM, but it will be clear how to gen- we apply the softmax transform to the Gaussian eralize this discussion to arbitrary DAGs. So samples. Using this Laplace approximation trick for 4-PAM, we have θ = (θr, θ1 . . . θS). We also allows handling different prior assump- introduce a variational distribution q(θ|w) = tions, including other non-location-scale family q(θr|w)q(θ1|w) . . . q(θS|w). distributions. To choose the best approximation q(θ|w), we construct a lower bound to the evidence (ELBO) Amortizing Super-Topics: As mentioned using Jensen’s inequality, as is standard in varia- above, in PAM the super topics need to be tional inference. For example, the log-likelihood sampled for each document in the corpus. This function log p(w|α, β) for the 4-PAM model (1) presents a bottleneck in speeding up posterior can be lower bounded by inference via Gibbs sampling or DMFVI as the number of parameters to be learned increase L = − KL[q(θr|w)||p(θr|αr)] drastically with data compared to a typical LDA. S We design our inference method to tackle this X − KL[q(θs|w)||p(θs|αs)] bottleneck such that the number of posterior s=1 (2) parameters to be learned do not directly depend " # X on the number of documents in the corpus. + E log p(wn|θ, β) , n dnPAM Inference Network Recently Srivas- tava and Sutton(2017) amortized the cost of where the expectation is with respect to the varia- learning posterior parameter in LDA by using tional posterior q(θ|w). a feedforward Multi-layer Perceptron (MLP) to dnPAM uses stochastic gradient descent to max- generate the parameters for the posterior dis- imize this ELBO to infer the variational param- tribution over the topic proportion vector θ. eters and learn the model parameters. To finish Like them, we model the posterior q(θ |w) as describing the method, we must describe how q is r LN(θ; f (w), f (w)) where f and f are neu- parameterized, which we do next. For the subtopic µ u µ u ral networks that generate the parameters for the parameters β, we learn these using variational EM, logistic normal distribution. But, since their in- that is, we maximize L with respect to β. It would ference network does not allow sampling top- be a simple extension to add a variational distribu- ics, therefore they assumed the topics to be fixed tion over β if this was desired. model parameters. We now introduce an alter- Re-parameterizing Dirichlet Distribution: native inference network architecture that is de- The expectation over the second term in equation signed to efficiently sample all the posterior pa- (2) is in general intractable and therefore we rameters that need to be inferred in PAMs includ- approximate it using a special type of Monte- ing topics. Carlo (MC) method (Kingma and Welling, 2013; For generality we assume that at the sub-topics Rezende and Mohamed, 2015) that employs the at the lowest level are sampled only once for the re-parametrization-trick (Williams, 1992) for corpus. To generate the parameters of the vari- sampling from the variational posterior. But this ational posterior distributions at each of the lev- MC-estimate requires q(θ|w) to belong to the els above, we use one MLP per level. For exam- location-scale family which excludes Dirichlet ple in the case of 4-PAM, the parameters of the distribution. Recently, some progress has been variational posteriors (q(θ1|w), ..., q(θs|w)) over made in the re-parametrization of distributions the super-topics are generated from a single MLP. like Dirichlet (Ruiz et al., 2016) but in this These MLPs are trained using the VAE-basedvari- work, following Srivastava and Sutton(2017) ational learning principle for topic models (Srivas- we approximate the posterior with a logistic tava and Sutton, 2017) and then sampled from us- ing the process for Dirichlet distribution described above to generate θ’s and the super-topics.

Note that the Dirichlet distribution is a conju- gate prior to the multinomial distribution. This fact can be used to leverage the modern GPU- based computation to generate the posterior pa- rameters for the nodes in the next lower level since it only involves a dot-product. Therefore we stack all the topic vectors (each sampled from its respec- Figure 2: 9-randomly sampled ”topics” from Omniglot dataset folded back to the original image dimensions. tive variational posterior) in a 3-D tensor and us- An example of how the topics look like if component 2 ing a custom implementation for this dot-product collapsing occurs. we gain significant reduction in training time. We want to point out that the result of above process 4.1 Learning Issues in VAE can also be seen as construction of MLPs on the fly by sampling Dirichlet vectors from our infer- Trained with stochastic variational inference, like ence networks and stacking them to form weight VAEs, our PAM models suffer from primarily two matrices of the MLPs. learning problems; slow convergence and compo- nent collapse.

The decoder in the case of PAM is just a dot Slow Learning product between the sample from the output dis- Training PAM models even on the recommended tribution of the inference network θ and the sub- learning rate of 0.001 for the ADAM optimizer topic matrix β. This makes the entire class of (Kingma and Ba, 2014), generally causes the gra- PAM-type mixed membership models permeable dients to diverge early on in training. Therefore in to deep learning while being Bayesian about the practice, fairly low learning rates have been used latent beliefs. Though in our experiments we al- in VAE-based generative models of text, which ways use MLPs to encode the posterior and de- significantly delays the learning in such model. In code the output, if required other architectures this section we first explain one of the reasons for like CNNs and RNNs can be easily used to re- the diverging behavior of the gradients and then place the MLPs. As mentioned before, dnPAM propose a solution that stabilizes them and there- can work with non-Dirichlet priors by using the fore allows training VAEs on high learning rates, Laplace approximation trick. It can also handle hence speeds-up the learning. full-covariance Gaussian as well as logistic Nor- Consider a VAE for a model p(x, z) where mals by simply using the Cholesky decomposition z is a latent Gaussian variable, x is a cat- and can therefore be used to learn Correlated Topic egorical variable distributed as pΘd (x|z) = Model (CTM) (Blei and Lafferty, 2006). Multinomial(fd(z, Θd)), and the function fd() is a decoder MLP with parameters Θd whose outputs lie in the unit simplex. Suppose we define a vari- At first, the use of an inference network seems ational distribution qΘe (z|x) = N (µ, exp(u)), strange, as coupling the variational parameters where µ = fµ(x, Θµ), u = fu(x, Θu) are MLPs across documents guarantees that the variational with parameters Θe = {Θµ, Θu} and u is the log- bound will not be as tight. But the advantage of an arithm of the diagonal of the covariance matrix. inference network is that after the weights of the Now the VAE objective function is inference network have been learned on training documents, we can obtain an approximate poste- ELBO(ΘΘΘ) = −KL[qΘe (z|x)||p(z)] rior distribution for a new test document simply by + E[log pΘd (x|z)]. (3) evaluating the inference network, without needing to carry out any variational optimization. This is Notice that the first term, the KL divergence, in- the reason for the term amortized inference, i.e., teracts only with the encoder parameters. The gra- the computational cost of training the inference 2Tensorflow requires that the rank of the matrices in network is amortized across future test documents. tf.matmul be the same. dients of this term L = KL[qΘe (z|x)||p(z)] with respect to u is 1 ∇ L = (exp(u) − 1). (4) u 2 One explanation for the diverging behavior of the gradients lies in the exponential curvature of this gradient. L is sensitive to small changes in u, which makes it difficult to optimize it with respect to Θe on high learning rates. The instability of the gradient w.r.t. to u demands an adaptive learning rate for encoder Figure 3: In optimization without any BatchNorm, the parameters Θu that can adapt to sudden large average KL gets minimized fairly early in the training. changes in ∇uL. With BatchNorm applied to the encoder unit that pro- We now propose that this adaptive learning rate duces log σ2, the KL minimization is slow and slower can be achieved by applying BachNorm (BN) if BatchNorm is also applied to each of the topics in the (Ioffe and Szegedy, 2015) transformation to fu. decoder. BN transformation for an incoming mini-batch m {ui=1} is given by that at this point the topics start to improve when the learning rate is ≥ 0.001. u − µbatch uBN = γ + b. (5) In order to establish that the improvement in q 2 σbatch+ training comes from the adaptive learning rate property of the gain parameter we replace the di- 1 Pm 2 1 Pm Here, µbatch = ui, σ = (ui − m i=1 batch m i=1 visor in the BN transformation with the `2 norm µbatch)2 , γ is the gain parameter and finally b is of the activation. We neither center the activations the shift parameter. We are specifically interested γ nor apply any shift to them. This normalization in the scaling factor q , because the sample performs equivalently and occasionally better than σ2 batch BN, therefore confirming our hypothesis. It also variance grows and shrinks with large changes in removes any dependency on batch-level statistics the norm of the mini-batch therefore allowing the that might be a requirement in models that make scaling factor to approximately dictates the norm i.i.d assumptions. of the activations. Let L be defined as before, the posterior q is now a function of uBN . The gradi- Component Collapsing ents w.r.t. u and the gain parameter γ are Another well known issue in VAEs such as γ dnPAM is the problem of component collapsing ∇ L = P ∇ L (6) u q u uBN (Dinh and Dumoulin, 2016; van den Oord et al., σ2 batch+ 2017). In the context of topic models, compo- (u − µ ) ∇ L = batch .∇ L, nent collapsing is a bad local minimum of VAEs γ q uBN (7) 2 in which the model only learns a small number σbatch+ of topics out of K (Srivastava and Sutton, 2017). where Pu is a projection matrix. If ∇uBN L is large For example, suppose we train a 3-PAM model on with respect to the out-going uBN , the scaling the Omniglot dataset (Lake et al., 2015) using the term brings it down. Therefore, the scaling term stochastic variational inference from Kingma and works like an adaptive learning rate that grows and Welling(2013). Figure2 shows nine randomly shrinks in response to the change in norm of the sampled topics for from this model which have batch of u’s due to large gradient updates to the been reshaped to Omniglot image dimensions. All weights, thus resolving the issue with the diverg- the topics look exactly the same, with a few excep- ing gradients. As shown in figure3, after apply- tions. This is clearly not a useful set of topics. ing BN to u (red), the KL term minimizes fairly When trained without applying BN to the u, slowly compared to the case (blue) when no BN is the KL terms across most of the latent dimensions applied to fu. This was also observed by (Srivas- (components of z) vanish to zero. We call them tava and Sutton, 2017). We experimentally found collapsed dimensions, since the posterior along them has collapsed to the prior. As a result, the Mallet, we train a 4-PAM model using 10000 it- decoder only receives the sampling noise along erations of collapsed Gibbs sampling and using such collapsed dimensions and in order to mini- dnPAM we train a 4-PAM, a 5-PAM, a MoLDA mize the noise in the output, it makes the weights and a correlated topic model (CTM). In this exper- corresponding to these collapsed components very iment we use 50 sub-topics for all models5. For 4- small. In practice this means that these weight PAM and 5-PAM, we use two super-topics follow- do not participate in learning and therefore do not ing (Li and McCallum, 2006), and two additional represent any meaningful topic. super-duper-topics for 5-PAM. As shown in Table Following Srivastava and Sutton(2017), we 1 all PAM models perform better than LDA-type also found that the topic coherence increases dras- models, showing that more complex PAM archi- tically when BN is also applied to the topic ma- tectures do improve the quality of the topics. Ad- trix prior to the application of the softmax non- ditionally 4-PAM and 5-PAM models trained on linearity along with fu. Besides preventing the dnPAM beat all the LDA models for topic quality. softmax units to saturate, this slows down the KL MoLDA and CTM trained using dnPAM also per- minimization further as shown by the green curve form competitively with the LDA models but the in figure3. CTM model falls significantly behind PAM mod- els on topic coherence. 5 Experiments and Results Next, to study the effect of increasing the num- We evaluate how dnPAM inference performs for ber of super and sub-topics in PAM models on different architectures of PAM models when com- the topic quality we increase the number of super- pared to the state-of-art collapsed Gibbs infer- duper-topics to 10, super-topics to 50 and sub- ence. To this end we evaluate three different PAM topics to 100 and re-run the previous experiment architectures, 4-PAM, 5-PAM and MoLDA, on but only on the 4-PAM or deeper models. Table2 two different datasets, 20 Newsgroups and NIPS shows the topic coherence for each of these mod- abstracts (Lichman, 2013). We use these two els and also the training time. Not only our infer- data sets because they represent two extreme set- ence method produces better topics it also is an or- tings. 20 Newsgroup is a large dataset (12,000 der of magnitude faster than the state-of-art Gibbs documents) but with a more restricted vocabulary sampling based inference for 4-PAM. Note that we (2000 words) whereas the NIPS dataset is smaller run the sampler for a total of 3000 iterations with in size (1500 abstracts) dataset but has a consid- the burn-in parameter set to 2000 iterations. erably larger vocabulary (12419 words). We com- For the smaller NIPS dataset, we repeat the pare inference methods both on time required for same experiments only for the PAM models again training as well as topic quality. As a measure under the same exact settings as described above. of topic quality, we use the topic coherence met- Reported in table3 are the topic coherence for ric (normalized point-wise mutual information), smaller PAM models with 50 sub-topics. Again, which as shown in Lau et al.(2014) corresponds we allowed 10,000 Gibbs iterations which took very well with human judgment on the quality of more than a day to finish but did not beat dnPAM- topics. We do not report perplexity of the mod- trained models on topic quality. For the bigger els because it has been repeatedly shown to not PAM models we replicated the experiments from be a good measure of topic coherence and even to the original paper. As reported in table4 we be negatively correlated with the topic quality in found that while the collapsed Gibbs based 4-PAM some cases (Lau et al., 2014; Chang et al., 2009; model produced the best topics it did so in 15 Srivastava and Sutton, 2017). hours. On the other hand, dnPAM-trained mod- We start by comparing the topic coherence els produced topics with comparable quality and across the different topic models on the 20 News- only took a fraction of the inference time for Gibbs group dataset. We train an LDA model using both by being able to leverage the GPU architecture for collapsed Gibbs sampling3 (Griffiths and Steyvers, computing dot-products very efficiently. We are 2004) and Decoupled Mean-Field Variational In- not aware of GPU-based implementations of other ference (DMFVI)4 (Blei et al., 2003). Using inference method for PAMs.

3We used the Mallet implementation (McCallum, 2002). dregosa et al., 2011). 4We used the scikit-learn implementation (Pe- 5For MoLDA we use 10 and 50 topics Table 1: Topic Coherence on 20Newsgroup for 50 topics. PAM models use two super-topics at each level. For MoLDA, we report separate the coherence for each component in the admixture. dnPAM LDA LDA 4-PAM MoLDA 4-PAM 5-PAM CTM GIBBS DMFVI GIBBS 10 50 Topic Coherence 0.17 0.11 0.20 0.24 0.24 0.14 0.29 0.21

Table 2: Topic coherence for models trained on 20Newsgroup dataset for for 100 topics with 50 super-topics. dnPAM 4-PAM MoLDA 4-PAM 5-PAM GIBBS 50 100 Topic Coherence 0.19 0.22 0.21 0.24 0.21 Training Time (Min.) 594 11 16 16

Table 3: Topic coherence for models trained on NIPS dataset for 50 topics with 2 super-topics. dnPAM 4-PAM MoLDA 4-PAM 5-PAM GIBBS 10 50 Topic Coherence 0.033 0.042 0.039 0.036 0.024

Table 4: Topic coherence for models trained on NIPS dataset for 100 topics with 50 super-topics. dnPAM 4-PAM MoLDA 4-PAM 5-PAM GIBBS 50 100 Topic Coherence 0.047 0.041 0.045 0.025 0.024 Training Time (Min.) 892 19 26 11

5.1 Hyper-Parameter Tuning as well as undirected models or restricted Boltz- For the experiments in this section we did not con- mann machines (Larochelle and Lauly, 2012; Hin- duct an extensive hyper-parameter tuning. We did ton and Salakhutdinov, 2009). Hierarchical exten- a grid search for setting the encoder capacity while sions to these models have received special atten- moving between the two datasets. As a general tion since they allow capturing the correlations be- guideline for PAM models, the encoder capacity tween the topics and provide meaningful interpre- should grow with the vocabulary size. For the tation to the latent structures in the data. −3 learning rate, we used the default setting of 1e Recent advancements in blackbox-type infer- for the Adam optimizer for all the models. We ence method (Kucukelbir et al., 2016; Ranganath used a batch-size of 200 for 20 Newsgroup dataset et al., 2014; Mnih and Gregor, 2014) have made as used in (Srivastava and Sutton, 2017) and 50 for it easier to try newer models without the need of the NIPS dataset. Though we found that the topic deriving model-specific inference algorithms. coherence, especially for the smaller NIPS dataset is sensitive to the batch-size setting and initializa- tion. For certain settings we were able to achieve 7 Conclusion higher topic coherence than the average topic co- herence reported in this section. 6 Related Work In this work we introduced dnPAM, which gener- alizes the idea of amortized variational inference Topic models have been explored extensively via that VAEs provide to a large class of deep topic directed (Blei et al., 2003; Li and McCallum, models. 2006; Blei and Lafferty, 2006; Blei et al., 2004) References Wei Li, David Blei, and Andrew McCallum. 2012. Nonparametric bayes pachinko allocation. arXiv David Blei and John Lafferty. 2006. Correlated topic preprint arXiv:1206.5270 . models. Advances in neural information processing systems 18:147. Wei Li and Andrew McCallum. 2006. Pachinko allo- cation: Dag-structured mixture models of topic cor- David M Blei, Andrew Y Ng, and Michael I Jordan. relations. In Proceedings of the 23rd international 2003. Latent dirichlet allocation. Journal of ma- conference on . ACM, pages 577– chine Learning research 3(Jan):993–1022. 584.

D.M. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenen- M. Lichman. 2013. UCI machine learning repository. baum. 2004. Hierarchical Topic Models and the http://archive.ics.uci.edu/ml. Nested Chinese Restaurant Process. Advances in Neural Information Processing Systems 16: Pro- Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet. ceedings of the 2003 Conference . cs.umass.edu. Jonathan Chang, Jordan L Boyd-Graber, Sean Gerrish, David Mimno, Wei Li, and Andrew McCallum. 2007. Chong Wang, and David M Blei. 2009. Reading Mixtures of hierarchical topics with pachinko allo- tea leaves: How humans interpret topic models. In cation. In Proceedings of the 24th international con- Nips. volume 31, pages 1–9. ference on Machine learning. ACM, pages 633–640. Laurent Dinh and Vincent Dumoulin. 2016. Training Andriy Mnih and Karol Gregor. 2014. Neural vari- neural Bayesian nets. ational inference and learning in belief networks. arXiv preprint arXiv:1402.0030 . Thomas L Griffiths and Mark Steyvers. 2004. Find- ing scientific topics. Proceedings of the National F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, academy of Sciences 101(suppl 1):5228–5235. B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- Geoffrey E Hinton and Ruslan R Salakhutdinov. 2009. sos, D. Cournapeau, M. Brucher, M. Perrot, and Replicated softmax: an undirected topic model. In E. Duchesnay. 2011. Scikit-learn: Machine learning Advances in neural information processing systems. in Python. Journal of Machine Learning Research pages 1607–1614. 12:2825–2830.

Sergey Ioffe and Christian Szegedy. 2015. Batch nor- Rajesh Ranganath, Sean Gerrish, and David M Blei. malization: Accelerating deep network training by 2014. Black box variational inference. In AISTATS. reducing internal covariate shift. In Proceedings pages 814–822. of The 32nd International Conference on Machine Danilo Rezende and Shakir Mohamed. 2015. Varia- Learning. pages 448–456. tional inference with normalizing flows. In Proceed- ings of The 32nd International Conference on Ma- Diederik Kingma and Jimmy Ba. 2014. Adam: A chine Learning. pages 1530–1538. method for stochastic optimization. arXiv preprint arXiv:1412.6980 . Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and ap- Diederik P Kingma and Max Welling. 2013. Auto- proximate inference in deep generative models. In encoding variational bayes. arXiv preprint Proceedings of The 31st International Conference arXiv:1312.6114 . on Machine Learning. pages 1278–1286. Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, An- Francisco R Ruiz, Michalis Titsias RC AUEB, and drew Gelman, and David M Blei. 2016. Automatic David Blei. 2016. The generalized reparameteriza- differentiation variational inference. arXiv preprint tion gradient. In Advances in Neural Information arXiv:1603.00788 . Processing Systems. pages 460–468.

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Akash Srivastava and Charles Sutton. 2017. Autoen- Tenenbaum. 2015. Human-level concept learning coding variational inference for topic models. Inter- through probabilistic program induction. Science national Conference on Learning Representations 350(6266):1332–1338. (ICLR) . Aaron van den Oord, Oriol Vinyals, et al. 2017. Neu- Hugo Larochelle and Stanislas Lauly. 2012. A neural ral discrete representation learning. In Advances autoregressive topic model. In Advances in Neural in Neural Information Processing Systems. pages Information Processing Systems. pages 2708–2716. 6297–6306. Jey Han Lau, David Newman, and Timothy Baldwin. Ronald J Williams. 1992. Simple statistical gradient- 2014. Machine reading tea leaves: Automatically following algorithms for connectionist reinforce- evaluating topic coherence and topic model quality. ment learning. Machine learning 8(3-4):229–256. In EACL. pages 530–539.