: DAG-Structured Mixture Models of Topic Correlations

Wei Li [email protected] Andrew McCallum [email protected] University of Massachusetts, Dept. of

Abstract give high likelihood to the training data, and the high- Latent Dirichlet allocation (LDA) and other est probability words in each topic provide keywords related topic models are increasingly popu- that briefly summarize the themes in the text collec- lar tools for summarization and manifold dis- tion. In addition to textual data (including news ar- covery in discrete data. However, LDA does ticles, research papers and email), topic models have not capture correlations between topics. In also been applied to images, biological findings and this paper, we introduce the pachinko alloca- other non-textual multi-dimensional discrete data. tion model (PAM), which captures arbitrary, Latent Dirichlet allocation (LDA) (Blei et al., 2003) nested, and possibly sparse correlations be- is a widely-used , often applied to textual tween topics using a directed acyclic graph data, and the basis for many variants. LDA repre- (DAG). The leaves of the DAG represent in- sents each document as a mixture of topics, where each dividual words in the vocabulary, while each topic is a multinomial distribution over words in a vo- interior node represents a correlation among cabulary. To generate a document, LDA first samples its children, which may be words or other in- a per-document multinomial distribution over topics terior nodes (topics). PAM provides a flex- from a Dirichlet distribution. Then it repeatedly sam- ible alternative to recent work by Blei and ples a topic from this multinomial and samples a word Lafferty (2006), which captures correlations from the topic. only between pairs of topics. Using text data from newsgroups, historic NIPS proceedings The topics discovered by LDA capture correlations and other research paper corpora, we show among words, but LDA does not explicitly model cor- improved performance of PAM in document relations among topics. This limitation arises because classification, likelihood of held-out data, the the topic proportions in each document are sampled ability to support finer-grained topics, and from a single Dirichlet distribution. As a result, LDA topical keyword coherence. has difficulty modeling data in which some topics co- occur more frequently than others. However, topic correlations are common in real-world text data, and 1. Introduction ignoring these correlations limits LDA’s ability to pre- dict new data with high likelihood. Ignoring topic cor- Statistical topic models have been successfully used to relations also hampers LDA’s ability to discover a large analyze large amounts of textual information in many number of fine-grained, tightly-coherent topics. Be- tasks, including language modeling, document classi- cause LDA can combine arbitrary sets of topics, LDA fication, information retrieval, document summariza- is reluctant to form highly specific topics, for which tion and data mining. Given a collection of textual some combinations would be “nonsensical”. documents, parameter estimation in these models dis- Motivated by the desire to build accurate models that covers a low-dimensional set of multinomial word dis- can discover large numbers of fine-grained topics, we tributions called “topics”. Mixtures of these topics are interested in topic models that capture topic cor- Appearing in Proceedings of the 23 rd International Con- relations. ference on , Pittsburgh, PA, 2006. Copy- Teh et al. (2005) propose hierarchical Dirichlet pro- right 2006 by the author(s)/owner(s). Pachinko Allocation

r

… … S S

… S … …

… … V … V … V (a) Dirichlet Multinomial (b) LDA (c) Four-Level PAM (d) Arbitrary PAM

Figure 1. Model structures for four topic models (a) Dirichlet Multinomial: For each document, a multinomial distribution over words is sampled from a single Dirichlet. (b) LDA: This model samples a multinomial over topics for each document, and then generates words from the topics. (c) Four-Level PAM: A four-level hierarchy consisting of a root, a set of super-topics, a set of sub-topics and a word vocabulary. Both the root and the super-topics are associated with Dirichlet distributions, from which we sample multinomials over their children for each document. (d) PAM: An arbitrary DAG structure to encode the topic correlations. Each interior node is considered a topic and associated with a Dirichlet distribution. cesses (HDP) to model groups of data that have a some interior nodes may also have children that are pre-defined hierarchical structure. Each pre-defined other topics, thus representing a mixture over topics. group is associated with a Dirichlet process, whose With many such nodes, PAM therefore captures not base measure is sampled from a higher-level Dirich- only correlations among words (as in LDA), but also let process. HDP can capture topic correlations de- correlations among topics themselves. fined by this nested data structure, however, it does For example, consider a document collection that dis- not automatically discover such correlations from un- cusses four topics: cooking, health, insurance and structured data. A simple version of HDP does not drugs. The cooking topic co-occurs often with health , use a hierarchy over pre-defined groups of data, but while health, insurance and drugs are often discussed can be viewed as an extension to LDA that integrates together. A DAG can describe this kind of correlation. over (or alternatively selects) the appropriate number Four nodes for the four topics form one level that is of topics. directly connected to the words. There are two addi- An alternative model that not only represents topic tional nodes at a higher level, where one is the parent correlations, but also learns them, is the correlated of cooking and health, and the other is the parent of topic model (CTM) (Blei & Lafferty, 2006). It is sim- health, insurance and drugs. ilar to LDA, except that rather than drawing topic In PAM each interior node’s distribution over its chil- mixture proportions from a Dirichlet, it does so from dren could be parameterized arbitrarily. In the re- a logistic normal distribution, whose parameters in- mainder of this paper, however, as in LDA, we use a clude a covariance matrix in which each entry speci- Dirichlet, parameterized by a vector with the same di- fies the correlation between a pair of topics. Thus in mension as the number of children. Thus, here, a PAM CTM topics are not independent, however note that model consists of a DAG, with each interior node con- only pairwise correlations are modeled, and the num- taining a Dirichlet distribution over its children. To ber of parameters in the covariance matrix grows as generate a document from this model, we first sample the square of the number of topics. a multinomial from each Dirichlet. Then, to generate In this paper, we introduce the pachinko allocation each word of the document, we begin at the root of model (PAM), which uses a directed acyclic graph the DAG, sampling one of its children according to its (DAG) structure to represent and learn arbitrary-arity, multinomial, and so on sampling children down the nested, and possibly sparse topic correlations. In DAG until we reach a leaf, which yields a word. The PAM, the concept of topics are extended to be distri- model is named for pachinko machines—a game popu- butions not only over words, but also over other topics. lar in Japan, in which metal balls bounce down around The model structure consists of an arbitrary DAG, in a complex collection of pins until they land in various which each leaf node is associated with a word in the bins at the bottom. vocabulary, and each non-leaf “interior” node corre- Note that the DAG structure in PAM is extremely sponds to a topic, having a distribution over its chil- flexible. It could be a simple tree (hierarchy), or an dren. An interior node whose children are all leaves arbitrary DAG, with cross-connected edges, and edges would correspond to a traditional LDA topic. But Pachinko Allocation skipping levels. The nodes can be fully or sparsely (d) (d) (d) 1. Sample θt1 , θt2 , ..., θts from g1(α1), g2(α2), ..., connected. The structure could be fixed beforehand (d) gs(αs), where θ is a multinomial distribution of or learned from the data. It is easy to see that LDA ti topic ti over its children. can be viewed as a special case of PAM: the DAG corresponding to LDA is a three-level hierarchy con- 2. For each word w in the document, sisting of one root at the top, a set of topics in the • Sample a topic path z of length L : < middle and a word vocabulary at the bottom. The w w z , z , ..., z >. z is always the root root is fully connected to all the topics, and each topic w1 w2 wLw w1 and z through z are topic nodes in is fully connected to all the words. (LDA represents w2 wLw T . z is a child of z and it is sam- topics as multinomial distributions over words, which wi w(i−1) pled according to the multinomial distribu- can be seen as Dirichlet distributions with variance 0.) (d) tion θzw(i−1) . In this paper we present experimental results demon- (d) • Sample word w from θz . strating PAM’s improved performance over LDA in wLw three different tasks, including topical word coherence assessed by human judges, likelihood on held-out test Following this process, the joint probability of gener- ating a document d, the topic assignments z(d) and the data, and document classification accuracy. We also (d) show a favorable comparison versus CTM and HDP multinomial distributions θ is on held-out data likelihood. s (d) (d) Y (d) P (d, z , θ |α) = P (θti |αi) i=1

2. The Model Lw Y Y (d) (d) × ( P (zwi|θ )P (w|θ )) In this section, we detail the pachinko allocation model zw(i−1) zwLw (PAM), and describe its generative process, inference w i=2 algorithm and parameter estimation method. We be- (d) (d) gin with a brief review of latent Dirichlet allocation. Integrating out θ and summing over z , we calcu- late the marginal probability of a document as: Latent Dirichlet allocation (LDA) (Blei et al., 2003) Z s can be understood as a special-case of PAM with Y (d) P (d|α) = P (θ |αi) a three-level hierarchy. Beginning from the bottom, ti i=1 it includes: V = {x1, x2, ..., xv}, a vocabulary over Lw words; T = {t1, t2, ..., ts}, a set of topics; and r, the Y X Y (d) (d) (d) × ( P (zwi|θ )P (w|θ ))dθ root, which is the parent of all topic nodes and is asso- zw(i−1) zwLw ciated with a Dirichlet distribution g(α). Each topic is w zw i=2 represented as a multinomial distribution over words and sampled from a given Dirichlet distribution g(β). Finally, the probability of generating a whole corpus The model structure of LDA is shown in Figure 1(b). is the product of the probability for every document: Y To generate a document, LDA samples a multinomial P (D|α) = P (d|α) distribution over topics from g(α), then repeatedly d samples a topic from this multinomial, and a word from the topic. 2.1. Four-Level PAM Now we introduce notation for the pachinko allocation While PAM allows arbitrary DAGs to model the topic model. PAM connects words in V and topics in T correlations, in this paper, we focus on one special with an arbitrary DAG, where topic nodes occupy the structure in our experiments. It is a four-level hierar- interior levels and the leaves are words. Two possible chy consisting of one root topic, s1 topics at the second model structures are shown in Figure 1(c) and (d). level, s2 topics at the third level and words at the bot- Each topic ti is associated with a Dirichlet distribution tom. We call the topics at the second level super-topics gi(αi), where αi is a vector with the same dimension and the ones at the third level sub-topics. The root as the number of children in ti. In general, gi is not is connected to all super-topics, super-topics are fully restricted to be Dirichlet. It could be any distribution connected to sub-topics and sub-topics are fully con- over discrete children, such as logistic normal. But in nected to words (Figure 1(c)). We also make a simpli- this paper, we focus only on Dirichlet and derive the fication similar to LDA: the multinomial distributions inference algorithm under this assumption. for sub-topics are sampled once for the whole corpus, from a single Dirichlet distribution g(β). The multino- To generate a document d, we follow a two-step pro- mials for the root and super-topics are still sampled in- cess: dividually for each document. As we can see, both the Pachinko Allocation

Here we assume that the root topic is t1. zw2 and zw3 !"! " ( !":! correspond to super-topic and sub-topic assignments respectively. z−w is the topic assignments for all other ")$ ")$ ")$ (d) !( !( !!: ,9 words. Excluding the current token, nx is the num- (d) ber of occurrences of topic tx in document d; nxy is #! #! *+ *+9 *+; *+< the number of times topic ty is sampled from its par- ent tx in document d; nx is the number of occurrences of sub-topic t in the whole corpus and n is the !: + !: + x xw , -)-! -)- ,; number of occurrences of word w in sub-topic tx. Fur- .! . thermore, αxy is the yth component in αx and βw is "#$!%&'! "/$!012(3%4546!7'8! ! the component for word w in β.

Figure 2. Graphical models for (a) LDA and (b) four-level Note that in the Gibbs sampling equation, we assume PAM that the Dirichlet parameters α are given. While LDA can produce reasonable results with a simple uniform Dirichlet, we have to learn these parameters for the model structure and generative process for this spe- super-topics in PAM since they capture different cor- cial setting are similar to LDA. The major difference relations among sub-topics. As for the root, we assume is that it has one additional layer of super-topics mod- a fixed Dirichlet parameter. To learn α, we could use eled with Dirichlet distributions, which is the key com- maximum likelihood or maximum a posteriori estima- ponent capturing topic correlations here. We present tion. However, since there are no closed-form solu- the corresponding graphical models for LDA and PAM tions for these methods and we wish to avoid iterative in Figure 2. methods for the sake of simplicity and speed, we ap- proximate it by moment matching. In each iteration 2.2. Inference and Parameter Estimation of Gibbs sampling, we update

The hidden variables in PAM include the sampled (d) 1 X nxy multinomial distributions Θ and topic assignments z. mean = × ; xy N (d) Furthermore, we need to learn the parameters in the d nx Dirichlet distributions α = {α1, α2, ..., αs}. We could (d) 1 X nxy 2 apply the Expectation-Maximization (EM) algorithm varxy = × ( − meanxy) ; N (d) for inference, which is often used to estimate param- d nx eters for models involving hidden variables. However, meanxy × (1 − meanxy) mxy = − 1; EM has been shown to perform poorly for topic models varxy due to many local maxima. αxy ∝ meanxy; P Instead, we apply Gibbs Sampling to perform infer- X 1 y log(mxy) αxy = × exp( ). ence and parameter learning. For an arbitrary DAG, 5 s2 − 1 we need to sample a topic path for each word given y other variable assignments enumerating all possible paths and calculating their conditional probabilities. For each super-topic x and sub-topic y, we first cal- In our special four-level PAM structure, each path con- culate the sample mean meanxy and sample variance (d) (d) tains the root, a super-topic and a sub-topic. Since varxy. nxy and nx are the same as defined above. the root is fixed, we only need to jointly sample the Then we estimate αxy, the yth component in αx from super-topic and sub-topic assignments for each word, sample mean and variance. N is the number of docu- based on their conditional probability given observa- ments and s2 is the number of sub-topics. tions and other assignments, integrating out the multi- Smoothing is important when we estimate the Dirich- nomial distributions Θ; (thus the time for each sam- let parameters with moment matching. From the ple is in the number of possible paths). The following equations above, we can see that when one sub-topic y equation shows the joint probability of a super-topic does not get sampled from super-topic x in one itera- and a sub-topic. For word w in document d, we have: tion, αxy will become 0. Furthermore from the Gibbs sampling equation, we know that this sub-topic will P (zw2 = tk, zw3 =tp|D, z−w, α, β) ∝ never have the chance to be sampled again by this (d) n(d) + α super-topic. We introduce a prior in the calculation n1k + α1k kp kp npw + βw × × . of sample means so that meanxy will not be 0 even if (d) P (d) P n + P β n + 0 α1k0 n + 0 αkp0 p m m (d) 1 k k p nxy is 0 for every document d. Pachinko Allocation 3. Experimental Results PAM LDA PAM LDA control control motion image In this section, we present example topics that PAM systems systems image motion discovers from real-world text data and evaluate robot based detection images adaptive adaptive images multiple against LDA using three measures: topic clarity by environment direct scene local human judgement, likelihood of held-out test data, goal con vision generated and document classification accuracy. We also com- state controller texture noisy pare held-out data likelihood with CTM and HDP. controller change segmentation optical 5 votes 0 vote 4 votes 1 vote In the experiments we discuss below, we use a fixed PAM LDA PAM LDA four-level hierarchical structure for PAM, which in- signals signal algorithm algorithm cludes a root, a set of super-topics, a set of sub-topics source signals learning algorithms and a word vocabulary. For the root, we always as- separation single algorithms gradient sume a fixed Dirichlet distribution with parameter eeg time gradient convergence sources low convergence stochastic 0.01. We can change this parameter to adjust the blind source function line variance in the sampled multinomial distributions. We single temporal stochastic descent choose a small value so that the variance is high and event processing weight converge each document contains only a small number of super- 4 votes 1 vote 1 vote 4 votes topics, which tends to make the super-topics more in- terpretable. We treat the sub-topics in the same way Table 1. Example topic pairs in human judgement as LDA and assume they are sampled once for the LDA PAM whole corpus from a given Dirichlet with parameter 5 votes 0 5 0.01. So the only parameters we need to learn are the ≥ 4 votes 3 8 Dirichlet parameters for the super-topics, and multi- ≥ 3 votes 9 16 nomial parameters for the sub-topics. Table 2. Human judgement results. For all the categories, In Gibbs sampling for both PAM and LDA we use 5 votes, ≥ 4 votes and ≥ 3 votes, PAM has more topics 2000 burn-in iterations, and then draw a total of 10 judged better than LDA. samples in the following 1000 iterations. The total training time for the NIPS dataset (as described in Section 3.2) is approximately 20 hours on a 2.4 GHz Opteron machine with 2GB memory. 3.2. Human Judgement

3.1. Topic Examples We provided each of five human evaluators a set of topic pairs, one each from PAM and LDA, anonymized Our first test dataset comes from Rexa, a search en- and in random order. Evaluators were asked to choose gine over research papers (http://Rexa.info). We ran- which one has stronger sense of semantic coherence domly choose a subset of abstracts from its large col- and specificity. lection. In this dataset, there are 4000 documents, 278438 word tokens and 25597 unique words. Figure These topics were generated using the NIPS abstract 3 shows a subset of super-topics in the data, and how dataset (NIPS00-12), which includes 1647 documents, they capture correlations among sub-topics. For each a vocabulary of 11708 words and 114142 word tokens. super-topic x, we rank the sub-topics {y} based on the We use 100 topics for LDA, and 50 super-topics and 100 sub-topics for PAM. The topic pairs are created learned Dirichlet parameter αxy. In Figure 3, each cir- cle corresponds to one super-topic and links to a set of based on similarity. For each sub-topic in PAM, we sub-topics as shown in the boxes, which are selected find its most similar topic in LDA and present them from its top 10 list. The numbers on the edges are as a pair. We also find the most similar sub-topic the corresponding α values. As we can see, all the in PAM for each LDA topic. Similarity is measured super-topics share the same sub-topic in the middle, by the KL-divergence between topic distributions over which is a subset of stopwords in this corpus. Some words. After removing redundant pairs and dissimilar super-topics also share the same content sub-topics. pairs that share less than 5 out of their top 20 words,, For example, the topics about scheduling and tasks co- we provide the evaluators with a total of 25 pairs. We occur with the topic about agents and also the topic present four example topic pairs in Table 1. There are about distributed systems. Another example is infor- 5 PAM topics that every evaluator agrees to be the mation retrieval. It is discussed along with both the better ones in their pairs, while LDA has none. And data mining topic and the web, network topic. out of 25 pairs, 19 topics from PAM are chosen by the majority (≥ 3 votes). We show the full evaluation results in Table 2. Pachinko Allocation

speech agents scheduling recognition agent tasks distributed language time grammar text plan task 9 2 applications dialogue word actions scheduler words planning schedule communication statistical network semantic 8 25 13 2 abstract 5 performance data based 27 4 parallel paper memory clustering 19 mining approach 13 6 processors 12 present cache cluster 9 16 2 sets 3 1 3 network database 2 20 networks relational information nodes web web databases routing query server relationships client traffic sql data document file performance

Figure 3. Topic correlation in PAM. Each circle corresponds to a super-topic each box corresponds to a sub-topic. One super-topic can connect to several sub-topics and capture their correlation. The numbers on the edges are the correspond- ing α values for the (super-topic, sub-topic) pair.

3.3. Likelihood Comparison ments from the trained model, based on its own gen- erative process. Then from each sample we estimate a In addition to human evaluation of topics, we also pro- multinomial distribution (directly from the sub-topic vide quantitative measurements to compare PAM with mixture). The probability of a test document is then LDA, CTM and HDP. In this experiment, we use the calculated as its average probability from each multi- same NIPS dataset and split it into two subsets with nomial, just as in a simple . Unlike in 75% and 25% of the data respectively. Then we learn Gibbs sampling, the samples are unconditionally gen- the models from the larger set and calculate likelihood erated; therefore, they are not restricted to the topic for the smaller set. We use 50 super-topics for PAM, co-occurrences observed in the held-out data, as they and the number of sub-topics varies from 20 to 180. are in the harmonic mean method. To calculate the likelihood of heldout data, we must We show the log-likelihood on the test data in Figure integrate out the sampled multinomials and sum over 4, averaging over all the samples in 10 different Gibbs all possible topic assignments. This problem has no sampling. Compared to LDA, PAM always produces closed-form solution. Previous work that uses Gibbs higher likelihood for different numbers of sub-topics. sampling for inference approximates the likelihood of The advantage is especially obvious for large numbers a document d by the harmonic mean of a set of con- (d) of topics. LDA performance peaks at 40 topics and ditional probabilities P (d|z ), where the samples are decreases as the number of topics increases. On the generated using Gibbs sampling (Griffiths & Steyvers, other hand, PAM supports larger numbers of topics 2004). However, this approach has been shown to be and has its best performance at 160 sub-topics. When unstable because the inverse likelihood does not have the number of topics is small, CTM exhibits better finite variance (Chib, 1995) and has been widely crit- performance than both LDA and PAM. However, as icized (e.g. (Newton & Raftery, 1994) discussion). we use more and more topics, its likelihood starts to In our experiments, we employ a more robust al- decrease. The peak value for CTM is at 60 topics and ternative in the family of non-parametric likelihood it is slightly worse than the best performance of PAM. estimates—specifically an approach based on empiri- We also apply HDP to this dataset. Since there is no cal likelihood (EL), e.g. (Diggle & Gratton, 1984). In pre-defined data structure, HDP does not model any these methods one samples data from the model, and topic correlations but automatically learns the num- calculates the empirical distribution from the samples. ber of topics. Therefore, the result of HDP does not In cases where the samples are sparse, a kernel may change with the number of topics and it is similar to be employed. We first randomly generate 1000 docu- the best result of LDA. Pachinko Allocation

class # docs LDA PAM PAM graphics 243 83.95 86.83 LDA CTM os 239 81.59 84.10 HDP pc 245 83.67 88.16 -228000

mac 239 86.61 89.54 -228500 windows.x 243 88.07 92.20 -229000

total 1209 84.70 87.34 d o o

h -229500 i l e k i

L -230000 -

Table 3. Document classification accuracy (%) g o L -230500

-231000

-231500 We also present the likelihood for different numbers 0 20 40 60 80 100 120 140 160 180 200 of training documents in Figure 5. The results are all Number of Topics based on 160 topics except for HDP. As we can see, the performance of CTM is noticeably worse than the Figure 4. Likelihood comparison with different numbers of other three when there is limited amount of training topics: the results are averages over all samples in 10 dif- data. One possible reason is that CTM has a large ferent Gibbs sampling and the maximum standard error is number of parameters to learn especially when the 113.75. number of topics is large.

3.4. Document Classification -228000 Another evaluation comparing PAM with LDA is doc- -230000 ument classification. We conduct a 5-way classifica- -232000 -234000 PAM

tion on the comp subset of the 20 newsgroup dataset. d

o -236000 LDA o h

i CTM This contains 4836 documents with a vocabulary size l -238000 e

k HDP i of 35567 words. Each class of documents is divided L - -240000 g o into 75% training and 25% test data. We train a model L -242000 for each class and calculate the likelihood for the test -244000 data. A test document is considered correctly classi- -246000 -248000 fied if its corresponding model produces the highest 0 20 40 60 80 100 likelihood. We present the classification accuracy for Training Data (%) both PAM and LDA in Table 3. According to the sign test, the improvement of PAM over LDA is statisti- cally significant with a p-value < 0.05. Figure 5. Likelihood comparison with different amounts of training data: the results are averages over all samples in 10 different Gibbs sampling and the maximum standard 4. Related Work error is 171.72. Previous work in document summarization has ex- plored topic hierarchies built with a probabilistic lan- mixture of computer science, artificial intelligence and guage model (Lawrie et al., 2001).The dependence be- robotics. However, for example, the document can- tween a topic and its children is captured by relative not cover both robotics and natural language process- entropy and encoded in a graph of conditional proba- ing under the more general topic artificial intelligence. bilities. Unlike PAM, which simultaneously learns all This is because a document is sampled from only one the topic correlations at different levels, this model in- topic path in the hierarchy. Compared to hLDA, PAM crementally builds the hierarchy by identifying topic provides more flexibility because it samples a topic terms for individual levels using a greedy approxima- path for each word instead of each document. Note tion to the Dominating Set Problem. that it is possible to create a DAG structure in PAM that would capture hierarchically nested word distri- Hierarchical LDA (hLDA) (Blei et al., 2004) is a vari- butions, and obtain the advantages of both models. ation of LDA that assumes a hierarchical structure among topics. Topics at higher levels are more gen- Another model that captures the correlations among eral, such as stopwords, while the more specific words topics is the correlated topic model introduced by Blei are organized into topics at lower levels. To generate a and Lafferty (2006). The basic idea is to replace the document, it samples a topic path from the hierarchy Dirichlet distribution in LDA with a logistic normal and then samples every word from those topics. Thus distribution. Under a single Dirichlet, the topic mix- hLDA can well explain a document that discusses a ture components for every document are sampled al- Pachinko Allocation most independently from each other. Instead, a logis- fense Advanced Research Projects Agency (DARPA), tic normal distribution can capture the pairwise corre- through the Department of the Interior, NBC, Ac- lations between topics based on a covariance matrix. quisition Services Division, under contract number Although CTM and PAM are both trying to model NBCHD030010, and under contract number HR0011- topic correlations directly, PAM takes a more flexible 06-C-0023. Any opinions, findings and conclusions or approach that can capture n-ary and nested correla- recommendations expressed in this material are those tions. In fact, CTM is very similar to a special-case of the author(s) and do not necessarily reflect those structure of PAM, where we create one super-topic for of the sponsor. We also thank Charles Sutton and every pair of sub-topics. Not only is CTM limited to Matthew Beal for helpful discussions, David Blei and pairwise correlations, it must also estimate parameters Yee Whye Teh for advice about a Dirichlet process for each possible pair in the covariance matrix, which version, Sam Roweis for discussions about alternate grows as the square of the number of topics. In con- structure-learning methods, and Michael Jordan for trast, with PAM we do not need to model every pair help naming the model. of topics but only sparse mixtures of correlations, as determined by the number of super-topics. References In this paper, we have described PAM operating on Blei, D., Griffiths, T., Jordan, M., & Tenenbaum, J. fixed DAG structures. It would be interesting to (2004). Hierarchical topic models and the nested explore methods for learning the number of (super- chinese restaurant process. In Advances in neural and sub-) topics and their nested connectivity. This information processing systems 16. extension is closely related to hierarchical Dirichlet processes (HDP) (Teh et al., 2005), where the per- Blei, D., & Lafferty, J. (2006). Correlated topic mod- document mixture proportions over topics are gener- els. In Advances in neural information processing ated from Dirichlet processes with an infinite number systems 18. of mixture components. These models have been used to estimate the number of topics in LDA. In addition, Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet when the data is pre-organized into nested groups, allocation. Journal of Machine Learning Research, HDP can capture different topic correlations within 3, 993–1022. these groups by using a nested hierarchy of Dirichlet Chib, S. (1995). Marginal likelihood from the Gibbs processes. In future work, we plan to use Dirichlet output. Journal of the American Statistical Associ- processes to learn the numbers of topics at different ation. levels, as well as their connectivity. Note that unlike HDP, PAM does not rely on pre-defined hierarchical Diggle, P., & Gratton, R. (1984). Monte Carlo meth- data, but automatically discovers topic structures. ods of inference for implicit statistical models. Jour- nal of the Royal Statistical Society. 5. Conclusion Griffiths, T., & Steyvers, M. (2004). Finding scien- In this paper, we have presented pachinko allocation, tific topics. Proceedings of the National Academy of a mixture model that uses a DAG structure to capture Sciences (pp. 5228–5235). arbitrary topic correlations. Each leaf in the DAG is Lawrie, D., Croft, W., & Rosenberg, A. (2001). Find- associated with a word in the vocabulary, and each ing topic words for hierarchical summarization. Pro- interior node corresponds to a topic that models the ceedings of SIGIR’01 (pp. 349–357). correlation among its children, where topics can be not only parents of words, but also other topics. The DAG Newton, M., & Raftery, A. (1994). Approximate structure is completely general, and some topic models Bayesian inference with the weighted likelihood like LDA can be represented as special cases of PAM. bootstrap. Journal of the Royal Statistical Society. Compared to other approaches that capture topic cor- relations such as hierarchical LDA and correlated topic Teh, Y., Jordan, M., Beal, M., & Blei, D. (2005). Hi- model, PAM provides more expressive power to sup- erarchical Dirichlet processes. Journal of the Amer- port complicated topic structures and adopts more re- ican Statistical Association. alistic assumptions for generating documents.

Acknowledgments This work was supported in part by the Center for In- telligent Information Retrieval and in part by the De-