Topic Models for Taxonomies

Topic Models for Taxonomies Anton Bakalov, Andrew McCallum, David Mimno Hanna Wallach Dept. of Computer Science Dept. of Computer Science Princeton University University of Massachusetts Princeton, NJ 08540 Amherst, MA 01003 [email protected] {abakalov,mccallum,wallach}@cs.umass.edu ABSTRACT to fully understand the topics a node is intended to cap- Concept taxonomies such as MeSH, the ACM Computing ture. We present a user study (Section 4.3) confirming that Classification System, and the NY Times Subject Headings substantial inaccuracies arise when asking computer science are frequently used to help organize data. They typically graduate students to assign a research paper to nodes of consist of a set of concept names organized in a hierarchy. the ACM taxonomy when given its node names and hier- However, these names and structure are often not sufficient archical structure. Users could gain better understanding to fully capture the intended meaning of a taxonomy node, by reading titles of papers that have previously been accu- and particularly non-experts may have difficulty navigating rately assigned, but this would be time consuming. Taxon- and placing data into the taxonomy. This paper introduces omy builders could augment each concept name with a list two semi-supervised topic models that automatically aug- of keywords that delineate the concept. But this task would ment a given taxonomy with many additional keywords by be burdensome to perform manually for large taxonomies, leveraging a corpus of multi-labeled documents. Our exper- and furthermore would need to be redone frequently since iments show that users find the topics beneficial for taxon- new ideas and topics within a concept arise over time. omy interpretation, substantially increasing their cataloging This paper presents two semi-supervised topic models that accuracy. Furthermore, the models provide a better infor- automatically discover lists of relevant keywords for tax- mation rate compared to Labeled LDA [7]. onomic concepts. The models, termed Labeled Pachinko Allocation (LPAM) and LPAM-List, take as input an ex- isting taxonomy as well as documents with their concept Categories and Subject Descriptors assignments. Then they run inference in a latent-Dirichlet- I.2.7 [Artificial Intelligence]: Natural Language Process- allocation-like manner in which there is one topic per con- ing|Text analysis; H.3.7 [Information Systems]: Digital cept node, and the set of topics allowable in the document is Libraries; G.3 [Mathematics of Computing]: Probability restricted by its taxonomic concept labels, augmented with and Statistics|Statistical computing their ancestors in the hierarchy. Document terms are assigned to the topics, and the highest weighted words in each Keywords topic become the concept keywords. Notably, even though most documents are labeled with multiple concepts and all Topic modeling, Taxonomy annotation, Taxonomy browsing concepts pull in their ancestors, the model is able to par- tition the keywords into their appropriate concepts. Fur- 1. INTRODUCTION thermore, incorporating ancestor nodes enables our models Many organizations such as the Association of Computing (a) to discover keywords for more abstract concepts located Machinery (ACM) use taxonomies of classes as structured close to the root of the taxonomy, and (b) to separate more metadata to facilitate browsing of their document libraries. generic words out of the taxonomy leaves. Users browse and search these libraries by navigating a hi- Multiple previous papers have focused on learning topic erarchy of named concept nodes. When new documents are hierarchies (e.g., [2, 4]). We are addressing the complemen- added to the library they must also be assigned one or more tary problem|leveraging a given human-defined hierarchy concept names. This task is often performed by the docu- such as ACM's Computing Classification System. Like our ment authors themselves, not trained catalogers. work, some other methods sample paths to nodes in a given Unfortunately, the concept node names often do not de- taxonomy (e.g., Hierarchical Concept Topic Model (HCTM) scribe the concept in sufficient detail for unfamiliar users [9], Hierarchical Pachinko Allocation (HPAM) [5], Multilin- gual Supervised LDA (ML-SLDA) [3]). However, LPAM leverages available labels to select a subtree of the taxonomy to generate a given document. Furthermore, LPAM differs Permission to make digital or hard copies of all or part of this work for from HCTM in that each node has a distribution over the personal or classroom use is granted without fee provided that copies are whole vocabulary, not over a pre-specified subset. not made or distributed for profit or commercial advantage and that copies The second model we propose, LPAM-List, represents a bear this notice and the full citation on the first page. To copy otherwise, to document as a mixture of the same topic nodes as LPAM's, republish, to post on servers or to redistribute to lists, requires prior specific but does not use the tree structure in its generative process. permission and/or a fee. JCDL’12, June 10–14, 2012, Washington, DC, USA. This makes it more similar to Labeled LDA [7] which also Copyright 2012 ACM 978-1-4503-1154-0/12/06 ...$10.00. child nodes; however, LPAM additionally enforces path con- Info. Root Software straints based on labels. Systems LPAM's generative process operates as follows: (r) Info. 1. For each node r draw φ ∼ Dir(β) Storage and Database Software 2. For each document d: Retrieval Management Engineering (d;c) (d;c) (a) For each node c 2 Cd, draw θ ∼ Dir(α ). (b) For each word w: Info. (d;root) Search and Systems Metrics - Draw r ∼ Mult(θ ) Retrieval - While r is not an exit, draw r ∼ Mult(θ(d;r)) - Draw w ∼ Mult(φ(a)) where a is r's parent Abstract: Dealing with verbose (or long) queries poses a new challenge for information retrieval. Selecting a subset of the original query (a "sub- We train the models by Gibbs sampling. The probability query") has been shown to be an effective method ... of choosing node c as the topic of the i'th token (wi) given the remaining topic assignments (zni) and all words (w) is: Figure 1: The first two sentences from the abstract of a pa- P (zi = cjzni; w; U) / per about information retrieval [11]. The publication is la- k (d) (d;cj−1) ! (w) beled with the shaded node from the ACM taxonomy. Some Y ncj−1;cj + αcj nc + βw of the neighboring nodes are also displayed. P (d) (d;cj−1) P (n(m) + β ) j=2 r(ncj−1;r + αr ) m c m (d) where U is the set of hyperparameters; ncj−1;cj is the num- constrains the choice of topics to those associated with the ber of times node cj is visited from its parent cj−1 for doc- document's assigned concepts. However, LPAM-List addi- (w) ument d; nc is the number of times word w is assigned to tionally incorporates the taxonomic ancestors of the assigned node c; k is the level at which the exit is located, i.e., ck−1 nodes. This simple change enables it to learn about both is the node generating the word. The contribution of the leaves and interior nodes of the taxonomy. Similarly to La- token being sampled is removed from these counts. beled LDA, Newman et. al. [6] also uses a non-hierarchical The first term of the right-hand side expression above is set of concept labels, but rather than discovering a distribu- the probability of traversing the path to node c and choosing tion of words for each concept, they instead estimate a dis- to emit from it. The second term is the probability of gen- tribution of topics for each concept; this is accomplished by erating word w from the word distribution associated with using document labels as authors in the author-topic model concept c. [8]. Our experiments show that leveraging the internal struc- 2.2 Labeled Pachinko Allocation - List ture of taxonomies imparts our methods with two advan- LPAM-List considers the same restricted set of topics Cd tages. First, we obtain a better information rate compared as LPAM's but sampling a path is not part of the generative to Labeled LDA (Section 4.2), indicating that our models process. If (d) is a multinomial distribution over the nodes more precisely capture characteristics of the data. Second, in Cd, then the generative storyline is: we obtain a list of keywords for each concept in the taxon- 1. For each node r draw φ(r) ∼ Dir(β) omy, including the high-level interior concepts that do not 2. For each document d: usually appear as labels. Our user study shows that our con- (a) Draw (d) ∼ Dir(α(d)) cept description keywords help people interpret a taxonomy (b) For each word w draw c ∼ Mult( (d)) and w ∼ Mult(φ(c)) as measured by the accuracy of concept label assignments (Section 4.3). In contrast, Labeled LDA [7] constrains the topics only to those associated with the document labels. LPAM-List's 2. MODELS sampling equation is: (w) (d) (d) (d) nc + βw 2.1 Labeled Pachinko Allocation P (zi = cjzni; w; α ; β) / (nc + αc ) P (n(m) + β ) LPAM is a semi-supervised topic model for documents m c m labeled with nodes from a taxonomy. It builds on latent where zi is the topic of the i'th token wi; zni are the topic Dirichlet allocation (LDA) [1] and hierarchical pachinko al- (d) assignments of the remaining tokens; c 2 Cd; nc is the location [5] by leveraging additional structure, as described (w) number of times node c is sampled in document d; nc is the in this section.

Load more