Learning to Map Wikidata Entities to Predefined Topics
Total Page:16
File Type:pdf, Size:1020Kb
Learning to Map Wikidata Entities To Predefined Topics Preeti Bhargava∗ Nemanja Spasojevic Sarah Ellinger Demandbase Youtube Juul Labs San Francisco, CA San Bruno, CA San Francisco, CA [email protected] [email protected] [email protected] Adithya Rao Abhinand Menon Saul Fuhrmann Magic Leap Beeswax Lime Bikes Sunnyvale, CA New York, NY San Francisco, CA [email protected] [email protected] [email protected] Guoning Hu Amazon Sunnyvale, CA [email protected] ABSTRACT Text Curry is a USA NBA player from Golden State Warriors . Recently much progress has been made in entity disambiguation and linking systems (EDL). Given a piece of text, EDL links words EDL and phrases to entities in a knowledge base, where each entity de- Entities fines a specific concept. Although extracted entities are informative, United_States_Of_America Stephen_Curry National_Basketball_Association Golden_State_Warriors they are often too specific to be used directly by many applications. 0.572 0.269 0.850 0.621 These applications usually require text content to be represented Entity with a smaller set of predefined concepts or topics, belonging toa To Topic topical taxonomy, that matches their exact needs. In this study, we Topics aim to build a system that maps Wikidata entities to such predefined usa north-america stephen-curry golden-state-warriors sports nba Basketball topics. We explore a wide range of methods that map entities to 0.246 0.411 0.321 0.996 0.702 1.587 2.836 topics, including GloVe similarity, Wikidata predicates, Wikipedia entity definitions, and entity-topic co-occurrences. These methods Figure 1: Topic extraction using Entity Disambiguation and Linking often predict entity-topic mappings that are reliable, i.e., have high (EDL) together with entity-to-topic mapping. precision, but tend to miss most of the mappings, i.e., have low recall. Therefore, we propose an ensemble system that effectively 1 INTRODUCTION combines individual methods and yields much better performance, comparable with human annotators. There have been many efforts to extract the rich information avail- able in various types of user-generated text, such as webpages, KEYWORDS blog posts, and tweets and represent it as a set of concepts which can then be used by various applications, such as content search, entity topic mapping; entity topic assignment; natural language personalization, and user profile modeling. This can be achieved processing; knowledge base; wikipedia; wikidata by understanding the topics in which users are interested or are ACM Reference Format: experts [8, 16, 21, 22] by categorizing the user-generated text into Preeti Bhargava, Nemanja Spasojevic, Sarah Ellinger, Adithya Rao, Abhi- a finite set of topics or categories. nand Menon, Saul Fuhrmann, and Guoning Hu. 2019. Learning to Map Traditionally, statistical topic models such as LDA [7] have been Wikidata Entities To Predefined Topics. In Companion Proceedings of the used for topical categorization of text. These models are based on 2019 World Wide Web Conference (WWW ’19 Companion), May 13–17, 2019, San Francisco, CA, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/ the idea that individual documents are made up of one or more 10.1145/3308560.3316749 topics, where each topic is a distribution over words. There have been many applications showing the power of these models on a ∗This work was done when all the authors were employees of Lithium Technologies j variety of text documents (e.g. Enron emails, CiteSeer abstracts, Klout Web pages). While LDA is a powerful tool for finding topic clus- ters within a document, it may miss implicit topics that are better This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their suitable for document categorization. personal and corporate Web sites with the appropriate attribution. Recently, tremendous advances have been made in entity dis- WWW ’19 Companion, May 13–17, 2019, San Francisco, CA, USA ambiguation and linking (EDL) [1, 4, 9, 13, 14, 19]. Many EDL API © 2019 IW3C2 (International World Wide Web Conference Committee), published 1 under Creative Commons CC-BY 4.0 License. services are now available to the public, including Google NLP , ACM ISBN 978-1-4503-6675-5/19/05. https://doi.org/10.1145/3308560.3316749 1https://cloud.google.com/natural-language/ WWW ’19 Companion, May 13–17, 2019, San Francisco, CA, USA P. Bhargava et. al. Watson Natural Language Understanding API2, and Rosette Text • We propose a system that maps entities in a KB (derived Analytics3. These advanced EDL technologies and services make from Wikidata) to topics in a taxonomy. Together with EDL, it practically elementary to extract a set of entities from a piece of our system allows one to extract the concepts that best meet text. specific application needs from a given text. Unlike LDA-generated topics, entities are well defined concepts • We study multiple popular models that explore the relation- described in a knowledge base (KB) e.g. entities in the Wikidata ship among entities from various perspectives, including KB. Modern KBs contain hundreds of thousands of entities or more. cooccurrence, word embeddings, and Wikipedia content. We Some entities are quite broad, but more often they are very spe- find that each of them performs reasonably well on mapping cific. When such narrow entities are extracted from text, theyare entities to topics. somewhat informative, but they may be too specific and too many • We investigate multiple approaches to combine the above for the needs of a given application. Moreover, entities help enable models into a stacked ensemble model and obtain much a syntactic rather than a semantic understanding of the text. For better results. We find that best performance is achieved example, in a search application where the query is “Golden State through a SVM meta-model (AUC: 0.874 and F1: 0.786) which Warriors”, documents indexed by the entity “Stephen Curry” are yields results comparable to human annotators. highly relevant, but may not be returned. • We show that although our system is developed with a spe- To address these challenges and meet the needs of general appli- cific topical taxonomy, one can easily adapt it for use with cations, a topical taxonomy, or hierarchical structure of topics, can other taxonomies. be introduced. The primary advantage of using such a taxonomy • Open data - we make our label set publicly available. rather than directly applying entities is to support product and business requirements such as: 2 PROBLEM SETTING (1) Limiting topics to a given knowledge domain. In this work, we attempt to build a system that maps entities in an (2) Imposing an editorial style or controlling the language used entity set to topics in a topic set. in describing topics (e.g., by imposing a character limit on topic names). 2.1 Entity set (3) Limiting the topic set in size so that an application user can Wikidata is the largest free and open KB, acting as a central struc- better interact with the available topics. Topic taxonomy tured data store of all Wikimedia content. Entities are the atomic cardinality is orders of magnitude smaller than number of building blocks of Wikidata. Information about a Wikidata entity entities within the KB. is organized using named predicates, many of which annotate rela- (4) Preventing unsuitable concepts from being represented as tions between entities. topics. These may include concepts that are: We derived our entity set from Wikidata for the following rea- (a) Offensive or controversial (e.g. Pornography). sons: (b) Either too general (e.g. Life) or too specific (e.g. Australian • Wikidata entities are widely used in the community, allowing Desert Raisin). our system to benefit a large audience. (c) Redundant with one another (e.g. Obamacare and Afford- • Wikidata contains more than 43M entities, covering the ma- able Care Act). jority of concepts that people care about. These entities in- For example, Klout.com4 used a custom taxonomy[10] which clude people, places, locations, organizations, etc. There are was modeled around capturing topics of social media content in entities for broad concepts, such as Sports, and entities for order to build topical user profiles21 [ , 22]. Another example is very specific concepts, such as 5th Album of The Beatles. Google Adwords, which uses a small, human-readable taxonomy5 • Wikidata entities come with rich annotations that can be to allow advertisers to target personalized ads to users based on utilized for this problem. For example: topical interests. – Every entity is linked to a corresponding word or phrase Thus, to categorize text into topics, one can take advantage of in multiple languages. mature EDL systems by mapping the entities extracted from the text – Millions of predicates describe special relations among to topics in a topical taxonomy. An example of using EDL to extract entities (see Section 5.2.1 for more details). topics is shown in Figure 1. Although there have been studies that • There are datasets associated with Wikidata that provide touch upon aspects of the entity-topic mapping problem, either useful information. In this work, we leverage Wikipedia while modeling the relationships among entities or while modeling pages (see Section 5.3.1 for more details) and DAWT [20], an the concepts of entities, no systematic study exists of this particular extension dataset to Wikidata. task. However, we find that an ensemble of some selected models is able to yield very good results. 2.2 Topic set Our main contributions in this paper are: Topics that entities are mapped to are application specific. In this study, we use a topic set from the Klout Topic Ontology (KTO) [10] which itself is a subset of Wikidata. KTO contains about 8K topics.