Nonparametric Bayes Pachinko Allocation

LI ET AL. 243 Nonparametric Bayes Pachinko Allocation Wei Li David Blei Andrew McCallum Department of Computer Science Computer Science Department Department of Computer Science University of Massachusetts Princeton University University of Massachusetts Amherst, MA 01003 Princeton, NJ 08540 Amherst, MA 01003 Abstract One example is the correlated topic model (CTM) (Blei & Lafferty, 2006). CTM represents each docu- Recent advances in topic models have ex- ment as a mixture of topics, where the mixture pro- plored complicated structured distributions portion is sampled from a logistic normal distribution. to represent topic correlation. For example, The parameters of this prior include a covariance ma- the pachinko allocation model (PAM) cap- trix in which each entry specifies the covariance be- tures arbitrary, nested, and possibly sparse tween a pair of topics. Therefore, topic occurrences correlations between topics using a directed are no longer independent from each other. However, acyclic graph (DAG). While PAM provides CTM is limited to pairwise correlations only, and the more flexibility and greater expressive power number of parameters in the covariance matrix grows than previous models like latent Dirichlet al- as the square of the number of topics. location (LDA), it is also more difficult to de- An alternative model that provides more flexibility to termine the appropriate topic structure for a describe correlations among topics is the pachinko al- specific dataset. In this paper, we propose a location model (PAM) (Li & McCallum, 2006), which nonparametric Bayesian prior for PAM based uses a directed acyclic graph (DAG) structure to rep- on a variant of the hierarchical Dirichlet pro- resent and learn arbitrary-arity, nested, and possibly cess (HDP). Although the HDP can cap- sparse topic correlations. Each leaf node in the DAG ture topic correlations defined by nested data represents a word in the vocabulary, and each interior structure, it does not automatically discover node corresponds to a topic. A topic in PAM can be such correlations from unstructured data. By not only a distribution over words, but also a distribu- assuming an HDP-based prior for PAM, we tion over other topics. Therefore it captures inter-topic are able to learn both the number of topics correlations as well as word correlations. The distribu- and how the topics are correlated. We eval- tion of a topic over its children could be parameterized uate our model on synthetic and real-world arbitrarily. One example is to use a Dirichlet distri- text datasets, and show that nonparamet- bution, from which a multinomial distribution over its ric PAM achieves performance matching the children is sampled on a per-document-basis. To gen- best of PAM without manually tuning the erate a word in a document in PAM, we start from the number of topics. root, samples a topic path based on these multinomials and finally samples the word. 1 Introduction The DAG structure in PAM is extremely flexible. It could be a simple tree (hierarchy), or an arbitrary DAG, with cross-connected edges, and edges skipping Statistical topic models such as latent Dirichlet alloca- levels. Some other models can be viewed as special tion (LDA) (Blei et al., 2003) have been shown to be cases of PAM. For example, LDA corresponds to a effective tools in topic extraction and analysis. These three-level hierarchy consisting of one root at the top, models can capture word correlations in a collection of a set of topics in the middle and a word vocabulary at textual documents with a low-dimensional set of multi- the bottom. The root is fully connected to all the top- nomial distributions. Recent work in this area has in- ics, and each topic is fully connected to all the words. vestigated richer structures to also describe inter-topic The structure of CTM can be described with a four- correlations, and led to discovery of large numbers of level DAG in PAM, in which there are two levels of more accurate, fine-grained topics. 244 LI ET AL. topic nodes between the root and words. The lower- 2 The Model level consists of traditional LDA-style topics. In the upper level there is one node for every topic pair, cap- 2.1 Four-Level PAM turing pairwise correlations. Pachinko allocation captures topic correlations with a While PAM provides a powerful means to describe directed acyclic graph (DAG), where each leaf node is inter-topic correlations and extract large numbers of associated with a word and each interior node corre- fine-grained topics, it has the same practical difficulty sponds to a topic, having a distribution over its chil- as many other topic models, i.e. how to determine dren. An interior node whose children are all leaves the number of topics. It can be estimated using cross- would correspond to a traditional LDA topic. But validation, but this method is not efficient even for some interior nodes may also have children that are simple topic models like LDA. Since PAM has a more other topics, thus representing a mixture over topics. complex topic structure, it is more difficult to evaluate all possibilities and select the best one. While PAM allows arbitrary DAGs to model topic correlations, in this paper we focus on one special class Another approach to this problem is to automatically of structures. Consider a four-level hierarchy consist- learn the number of topics with a nonparametric prior ing of one root topic, s2 topics at the second level, such as the hierarchical Dirichlet process (HDP) (Teh s3 topics at the third level, and words at the bot- et al., 2005). HDPs are intended to model groups tom. We call the topics at the second level super-topics of data that have a pre-defined hierarchical structure. and the ones at the third level sub-topics. The root Each pre-defined group is associated with a Dirichlet is connected to all super-topics, super-topics are fully process whose base measure is sampled from a higher- connected to sub-topics and sub-topics are fully con- level Dirichlet process. Note that HDP cannot au- nected to words. For the root and each super-topic, tomatically discover topic correlations from unstruc- we assume a Dirichlet distribution parameterized by tured data. However, it has been applied to LDA a vector with the same dimension as the number of where it integrates over (or alternatively selects) the children. Each sub-topic is associated with a multi- appropriate number of topics. nomial distribution over words, sampled once for the In this paper, we propose a nonparametric Bayesian whole corpus from a single Dirichlet distribution. To prior for pachinko allocation based on a variant of generate a document, we first draw a set of multino- the hierarchical Dirichlet process. We assume that mials from the Dirichlet distributions at the root and the topics in PAM are organized into multiple levels super-topics. Then for each word in the document, we and each level is modeled with an HDP to capture un- sample a topic path consisting of the root, super-topic certainty in the number of topics. Unlike a standard and sub-topic based on these multinomials. Finally, HDP mixture model, where the data has a pre-defined we sample a word from the sub-topic. nested structure, we build HDPs based on dynamic As with many other topic models, we need to spec- groupings of data according to topic assignments. The ify the number of topics in advance for the four- nonparametric PAM can be viewed as an extension to level PAM. It is inefficient to manually examine every fixed-structure PAM where the number of topics at (s2, s3) pair in order to find the appropriate structure. each level is taken to infinity. To generate a docu- Therefore, we develop a nonparametric Bayesian prior ment, we first sample multinomial distributions over based on the Dirichlet process to automatically deter- topics from corresponding HDPs. Then we repeatedly mine the numbers of topics. As a side effect, this also sample a topic path according to the multinomials for discovers a sparse connectivity between super-topics each word in the document. and sub-topics. We present our model in terms of Chi- The rest of the paper is organized as follows. We de- nese restaurant process. tail the nonparametric pachinko allocation model in Section 2, describing its generative process and infer- 2.2 Chinese Restaurant Process ence algorithm. Section 3 presents experimental re- sults on synthetic data and real-world text data. We The Chinese restaurant process (CRP) is a distribu- compare discovered topic structures with true struc- tion on partitions of integers. It assumes a Chinese tures for various synthetic settings, and likelihood on restaurant with an infinite number of tables. When a held-out test data with PAM, HDP and hierarchical customer comes, he sits at a table with the following LDA (Blei et al., 2004) on two datasets. Section 4 re- probabilities: views related work, followed by conclusions and future C(t) work in Section 5. • P (an occupied table t) = P 0 , t0 C(t )+α α • P (an unoccupied table) = P 0 , t0 C(t )+α LI ET AL. 245 Name Description 1. He chooses the kth entryway ejk in the restau- rj the jth restaurant. rant from CRP({C(j, k)}k, α0), where C(j, k) is ejk the kth entryway in the jth restaurant. the number of customers that entered the kth en- cl the lth category. Each entryway ejk is associated tryway before in this restaurant. with a category. 2. If e is a new entryway, a category c is assigned t the nth table in the jth restaurant that has jk l jln to it from CRP({P C(l, j0)} , γ ), where C(l, j0) category c .

Load more