Discriminative Topic Mining Via Category-Name Guided Text Embedding
Total Page:16
File Type:pdf, Size:1020Kb
Discriminative Topic Mining via Category-Name Guided Text Embedding Yu Meng1∗, Jiaxin Huang1∗, Guangyuan Wang1, Zihan Wang1, Chao Zhang2, Yu Zhang1, Jiawei Han1 1Department of Computer Science, University of Illinois at Urbana-Champaign, IL, USA 2College of Computing, Georgia Institute of Technology, GA, USA 1{yumeng5, jiaxinh3, gwang10, zihanw2, yuz9, hanj}@illinois.edu [email protected] ABSTRACT 1 INTRODUCTION Mining a set of meaningful and distinctive topics automatically To help users effectively and efficiently comprehend a large setof from massive text corpora has broad applications. Existing topic text documents, it is of great interest to generate a set of meaning- models, however, typically work in a purely unsupervised way, ful and coherent topics automatically from a given corpus. Topic which often generate topics that do not fit users’ particular needs models [6, 19] are such unsupervised statistical tools that discover and yield suboptimal performance on downstream tasks. We pro- latent topics from text corpora. Due to their effectiveness in uncov- pose a new task, discriminative topic mining, which leverages a set ering hidden semantic structure in text collections, topic models of user-provided category names to mine discriminative topics from are widely used in text mining [15, 29] and information retrieval text corpora. This new task not only helps a user understand clearly tasks [14, 52]. and distinctively the topics he/she is most interested in, but also Despite of their effectiveness, traditional topic models suffer from benefits directly keyword-driven classification tasks. We develop two noteworthy limitations: (1) Failure to incorporate user guidance. CatE, a novel category-name guided text embedding method for dis- Topic models tend to retrieve the most general and prominent topics criminative topic mining, which effectively leverages minimal user from a text collection, which may not be of a user’s particular guidance to learn a discriminative embedding space and discover interest, or provide a skewed and biased summarization of the category representative terms in an iterative manner. We conduct corpus. (2) Failure to enforce distinctiveness among retrieved topics. a comprehensive set of experiments to show that CatE mines high- Concepts are most effectively interpreted via their uniquely defining quality set of topics guided by category names only, and benefits a features. For example, Egypt is known for pyramids and China is variety of downstream applications including weakly-supervised known for the Great Wall. Topic models, however, do not impose classification and lexical entailment direction identification1. disriminative constraints, resulting in vague interpretations of the retrieved topics. Table 1 demonstrates three retrieved topics from CCS CONCEPTS the New York Times (NYT) annotated corpus [42] via LDA [6]. We • Information systems ! Data mining; Document topic models; can see that it is difficult to clearly define the meaning of thethree Clustering and classification; topics due to an overlap of their semantics (e.g., the term “united states” appears in all three topics). KEYWORDS Table 1: LDA retrieved topics on NYT dataset. The meanings Topic Mining, Discriminative Analysis, Text Embedding, Text Clas- of the retrieved topics have overlap with each other. sification Topic 1 Topic 2 Topic 3 ACM Reference Format: Yu Meng1∗, Jiaxin Huang1∗, Guangyuan Wang1, Zihan Wang1, Chao Zhang2, canada, united states sports, united states united states, iraq Yu Zhang1, Jiawei Han1. 2020. Discriminative Topic Mining via Category- canadian, economy olympic, games government, president Name Guided Text Embedding. In Proceedings of The Web Conference 2020 arXiv:1908.07162v2 [cs.CL] 27 Jan 2020 (WWW ’20), April 20–24, 2020, Taipei, Taiwan. ACM, New York, NY, USA, In order to incorporate user knowledge or preference into topic 11 pages. https://doi.org/10.1145/3366423.3380278 discovery for mining distinctive topics from a text corpus, we pro- pose a new task, Discriminative Topic Mining, which takes only 1Source code can be found at https://github.com/yumeng5/CatE. a set of category names as user guidance, and aims to retrieve a set of representative and discriminative terms under each pro- vided category. In many cases, a user may have a specific set of ∗Equal Contribution. interested topics in mind, or have prior knowledge about the po- tential topics in a corpus. Such user interest or prior knowledge This paper is published under the Creative Commons Attribution 4.0 International may come naturally in the form of a set of category names that (CC BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. could be used to guide the topic discovery process, resulting in WWW ’20, April 20–24, 2020, Taipei, Taiwan more desirable results that better cater to a user’s need and fit spe- © 2020 IW3C2 (International World Wide Web Conference Committee), published cific downstream applications. For example, a user may provide under Creative Commons CC BY 4.0 License. ACM ISBN 978-1-4503-7023-3/20/04. several country names and rely on discriminative topic mining to https://doi.org/10.1145/3366423.3380278 retrieve each country’s provinces, cities, currency, etc. from a text corpus. We will show that this new task not only helps the user to from D for each category ci such that each term in Si semantically clearly and distinctively understand his/her interested topics, but belongs to and only belongs to category ci . also benefits keywords-driven classification tasks. Example 1. Given a set of country names, c : “The United States”, There exist previous studies that attempt to incorporate prior 1 c : “France” and c : “Canada”, it is correct to retrieve “Ontario” knowledge into topic models. Along one line of work, supervised 2 3 as an element in S , because Ontario is a province in Canada and topic models such as Supervised LDA [5] and DiscLDA [23] guide 3 exclusively belongs to Canada semantically. However, it is incorrect the model to predict category labels based on document-level train- to retrieve “North America” as an element in S , because North ing data. While they do improve the discriminative power of unsu- 3 America is a continent and does not belong to any countries. It is pervised topic models on classification tasks, they rely on massive also incorrect to retrieve “English” as an element in S , because hand-labeled documents, which may be difficult to obtain in practi- 3 English is also the national language of the United States. cal applications. Along another line of work that is more similar to our setting, users are asked to provide a set of seed words to The differences between discriminative topic mining and stan- guide the topic discovery process, which is referred to as seed- dard topic modeling are mainly two-fold: (1) Discriminative topic guided topic modeling [1, 21]. However, they still do not impose mining requires a set of user provided category names and only requirements on the distinctiveness of the retrieved topics and thus focuses on retrieving terms belonging to the given categories. (2) are not optimized for discriminative topic presentation and other Discriminative topic mining imposes strong discriminative require- applications such as keyword-driven classification. ments that each retrieved term under the corresponding category We develop a novel category-name guided text embedding method, must belong to and only belong to that category semantically. CatE, for discriminative topic mining. CatE consists of two mod- ules: (1) A category-name guided text embedding learning module 3 CATEGORY-NAME GUIDED EMBEDDING that takes a set of category names to learn category distinctive word In this section, we first formulate a text generative process under embeddings by modeling the text generative process conditioned user guidance, and then cast the learning of the generative process on the user provided categories, and (2) a category representative as a category-name guided text embedding model. Words, docu- word retrieval module that selects category representative words ments and categories are jointly embedded into a shared space based on both word embedding similarity and word distributional where embeddings are not only learned according to the corpus specificity. The two modules collaborate in an iterative way:At generative assumption, but also encouraged to incorporate category each iteration, the former refines word embeddings and category distinctive information. embeddings for accurate representative word retrieval; and the latter selects representative words that will be used by the former 3.1 Motivation at the next iteration. Traditional topic models like LDA [6] use document-topic and topic- Our contributions can be summarized as follows. word distributions to model the text generation process, where an (1) We propose discriminative topic mining, a new task for topic dis- obvious defect exists due to the bag-of-words generation assumption— covery from text corpora with a set of category names as the only each word is drawn independently from the topic-word distribution supervision. We show qualitatively and quantitatively that this without considering the correlations between adjacent words. In new task helps users obtain a clear and distinctive understand- addition, topic models make explicit probabilistic assumptions re- ing of interested topics, and directly benefits keyword-driven