Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum {weili,mccallum}@cs.umass.edu University of Massachusetts, Dept. of Computer Science Abstract applied to images, biological information and other Latent Dirichlet allocation (LDA) and other multi-dimensional data. related topic models are increasingly pop- Latent Dirichlet allocation (LDA) (Blei et al., 2003) is ular tools for summarization and manifold one of the most popular models for extracting topic in- discovery in discrete data. LDA does not formation from large text corpora. It represents each capture correlations between topics, however. document as a mixture of topics, where each topic is a Recently Blei and Lafferty have proposed multinomial distribution over a word vocabulary. To the correlated topic model (CTM) in which generate a document, LDA first samples a multinomial the off-diagonal covariance structure in a lo- distribution over topics from a Dirichlet distribution. gistic normal distribution captures pairwise Then it repeatedly samples a topic from the multino- correlations between topics. In this paper mial and samples a word from the topic. By applying we introduce the Pachinko Allocation model LDA to a text collection, we are able to organize words (PAM), which uses a directed acyclic graph into a set of semantically coherent clusters. (DAG) to capture arbitrary, nested, and pos- sibly sparse correlations. The leaves of the Several LDA variations have been proposed to deal DAG represent individual words in the vo- with more complicated data structures. For example, cabulary, while each interior node represents the hierarchical LDA model (hLDA) (Blei et al., 2004) a correlation among its children, which may assumes a given hierarchical structure among topics. be words or other interior nodes (topics). Us- To generate a document, it first samples a topic path ing text data from UseNet, historic NIPS pro- from the hierarchy and then samples words from those ceedings, and other research paper corpora, topics. The advantage of hLDA is the ability to dis- we show that PAM improves over LDA in cover topics with various levels of granularity. An- document classification, likelihood of heldout other variation of LDA is the HMMLDA model (Grif- data, ability to support finer-grained topics, fiths et al., 2005), which combines a hidden Markov and topical keyword coherence. model (HMM) with LDA to extract word clusters from sequential data. It distinguishes between syntactic words and semantic words, and simultaneously orga- 1. Introduction nizes them into different clusters. HMMLDA has been successfully applied to sequence modeling tasks such Statistical topic models have been successfully used as part-of-speech tagging and Chinese word segmen- in many areas to analyze large amounts of textual tation (Li & McCallum, 2005). Another work related information, including language modeling, document to LDA is the author-topic model by Rosen-Zvi et al. classification, information retrieval, automated docu- (2004), which associates each author with a mixture of ment summarization and data mining. In addition to topics. It can be applied to text collections where au- textual data such as newswire articles, research pa- thor information is available, such as research papers. pers and personal emails, topic models have also been Topic models like LDA can automatically organize Appearing in Proceedings of the 23 rd International Con- words into different clusters that capture their correla- ference on Machine Learning, Pittsburgh, PA, 2006. Copy- tions in the text collection. However, LDA does not di- right 2006 by the author(s)/owner(s). rectly model the correlations among topic themselves. Pachinko Allocation Figure 1. Model Structures for Four Topic Models (a) Dirichlet Multinomial: For each document, a multinomial distribution over words is sampled from a single Dirichlet. (b) LDA: This model samples a multinomial over topics for each document, and then generates words from the topics. (c) Four-Level PAM: A four-level hierarchy consisting of a root, a set of super-topics, a set of sub-topics and a word vocabulary. Both the root and the super-topics are associated with Dirichlet distributions, from which we sample multinomials over their children for each document. (d) PAM: An arbitrary DAG structure to encode the topic correlations. Each interior node is considered a topic and associated with a Dirichlet distribution. This limitation comes from the assumption that the For example, consider a document collection that dis- topic proportions in each document are all sampled cusses four topics: cooking, health, insurance and from a single Dirichlet distribution. As a result, LDA drugs. Cooking only co-occurs often with health , while has difficulty describing a scenario in which some top- health, insurance and drugs are often discussed to- ics are more likely to co-occur and some other top- gether. We can build a DAG to describe this kind ics are rarely found in the same document. However, of correlation. The four topics form one level that is we believe such correlations are common in real-world directly connected to the words. Then there are two text data and are interested in topic models that can more nodes at a higher level, where one of them is account for them. the parent of cooking and health, and the other is the parent of health, insurance and drugs. Blei and Lafferty (2006) propose the correlated topic model (CTM) to address this problem. Its main dif- In PAM, we still use Dirichlet distribution to model ference from LDA is that for each document, CTM topic correlations. But unlike LDA, where there is only randomly draws the topic mixture proportions from a one Dirichlet to sample all the topic mixture compo- logistic normal distribution instead of a Dirichlet. The nents, we associate each interior node with a Dirichlet, logistic normal distribution is parameterized by a co- parameterized by a vector with the same dimension as variance matrix, in which each entry specifies the cor- the number of children. To generate a document, we relation between a pair of topics. While topics are no first sample a multinomial from each Dirichlet. Then longer independent in CTM, only pairwise correlations based on these multinomials, the Pachinko machine are modeled. Additionally, the number of parameters samples a path for each word, starting from the root in the covariance matrix grows as the square of the and ending at the leaf node. number of topics. The DAG structure in PAM is completely flexible. It In this paper, we introduce the Pachinko Allocation can be as simple as a tree, a hierarchy, or an arbi- model (PAM). As we will see later, LDA can be viewed trary DAG with edges skipping levels. The nodes can as a special case of PAM. In our model, we extend be fully or sparsely connected. We can either fix the the concept of topics to be not only distributions over structure beforehand or learn it from the data. It is words, but also other topics. We assume an arbitrary easy to see that LDA can be viewed as a special case DAG structure, in which each leaf node is associated of PAM; the DAG corresponding to LDA is a three- with a word in the vocabulary, and each interior node level hierarchy consisting of one root at the top, a set of corresponds to a topic. There is one root in the DAG, topics in the middle and a word vocabulary at the bot- which has no incoming links. Interior nodes can be tom. The root is fully connected with all the topics and linked to both leaves and other interior nodes. There- each topic is fully connected with all the words. Fur- fore, we can capture both correlations among words thermore, LDA represents topics as multinomial dis- like LDA, and also correlations among topic them- tributions over words, which can be seen as Dirichlet selves. distributions with variance 0. Pachinko Allocation We present improved performance of PAM over LDA an arbitrary DAG structure, while LDA is limited to in three different experiments, including topical word a special three-level hierarchy. Two possible model coherence by human judgement, likelihood on heldout structures of PAM are shown in Figure 1(c) and (d). test data and accuracy of document classification. A To generate a document d, we follow a two-step pro- preliminary favorable comparison with CTM is also cess: presented. (d) (d) (d) 1. Sample θt1 , θt2 , ..., θts from g1(α1), g2(α2), ..., 2. The Model (d) gs(αs), where θti is a multinomial distribution of In this section, we define the Pachinko allocation topic ti over its children. model (PAM) and describe its generative process, in- 2. For each word w in the document, ference algorithm and parameter estimation method. To provide a better understanding of PAM, We first • Sample a topic path zw of length Lw: < give a brief review of latent Dirichlet allocation. zw1, zw2, ..., zwLw >. zw1 is always the root and z through z are topic nodes in Latent Dirichlet allocation (LDA) (Blei et al., 2003) is w2 wLw S. z is a child of z and it is sam- a generative probabilistic model based on a three-level wi w(i−1) pled according to the multinomial distribu- hierarchy including: (d) tion θzw(i−1) . V = {x1, x2, ..., xv}: a vocabulary over words. (d) • Sample word w from θzwLw . S = {t1, t2, ..., ts}: a set of topics. Each topic is rep- resented as a multinomial distribution over words and Following this process, the joint probability of gener- sampled from a given Dirichlet distribution g(β). ating a document d, the topic assignments z(d) and the (d) r: the root, which is the parent of all topic nodes and multinomial distributions θ is is associated with a Dirichlet distribution g(α).

Load more