Automatic Synonym Discovery with Knowledge Bases
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Synonym Discovery with Knowledge Bases Meng Xiang Ren Jiawei Han University of Illinois University of Illinois University of Illinois at Urbana-Champaign at Urbana-Champaign at Urbana-Champaign [email protected] [email protected] [email protected] ABSTRACT Text Corpus Synonym Seeds Recognizing entity synonyms from text has become a crucial task in ID Sentence Cancer 1 Washington is a state in the Pacific Northwest region. Cancer, Cancers many entity-leveraging applications. However, discovering entity 2 Washington served as the first President of the US. e.g. Washington State synonyms from domain-specic text corpora ( , news articles, 3 The exact cause of leukemia is unknown. Washington State, State of Washington scientic papers) is rather challenging. Current systems take an 4 Cancer involves abnormal cell growth. entity name string as input to nd out other names that are syn- onymous, ignoring the fact that oen times a name string can refer to multiple entities (e.g., “apple” could refer to both Apple Inc and Cancer George Washington Washington State Leukemia the fruit apple). Moreover, most existing methods require training data manually created by domain experts to construct supervised- learning systems. In this paper, we study the problem of automatic Entities Knowledge Bases synonym discovery with knowledge bases, that is, identifying syn- Figure 1: Distant supervision for synonym discovery. We onyms for knowledge base entities in a given domain-specic corpus. link entity mentions in text corpus to knowledge base enti- e manually-curated synonyms for each entity stored in a knowl- ties, and collect training seeds from knowledge bases. edge base not only form a set of name strings to disambiguate the meaning for each other, but also can serve as “distant” supervision to help determine important features for the task. We propose a we can leverage entity synonyms to enhance the process of query novel framework, called DPE, to integrate two kinds of mutually- expansion, and thus improve the retrieval performances. complementing signals for synonym discovery, i.e., distributional One straightforward approach for obtaining entity synonyms is features based on corpus-level statistics and textual paerns based to leverage publicly available knowledge bases such as Freebase and on local contexts. In particular, DPE jointly optimizes the two kinds WordNet, in which popular synonyms for the entities are manually of signals in conjunction with distant supervision, so that they can curated by human crowds. However, the coverage of knowledge mutually enhance each other in the training stage. At the inference bases can be rather limited, especially on some newly emerging stage, both signals will be utilized to discover synonyms for the entities, as the manual curation process entails high costs and is given entities. Experimental results prove the eectiveness of the not scalable. For example, the entities in Freebase have only 1.1 proposed framework. synonyms on average. To increase the synonym coverage, we expect to automatically extract more synonyms that are not in ACM Reference format: knowledge bases from massive, domain-specic text corpora. Many Meng , Xiang Ren, and Jiawei Han. 2017. Automatic Synonym Discovery approaches address this problem through supervised [19, 27, 29] with Knowledge Bases. In Proceedings of KDD’17, August 13-17, 2017, Halifax, NS, Canada, , 9 pages. or weakly supervised learning [11, 20], which treat some manually DOI: 10.1145/3097983.3098185 labeled synonyms as seeds to train a synonym classier or detect some local paerns for synonym discovery. ough quite eective 1 INTRODUCTION in practice, such approaches still rely on careful seed selections by humans. People oen have a variety of ways to refer to the same real-world To retrieve training seeds automatically, recently there is a grow- entity, forming dierent synonyms for the entity (e.g., entity United ing interest in the distant supervision strategy, which aims to auto- States can be referred using “America” and “USA”). Automatic syn- matically collect training seeds from existing knowledge bases. e arXiv:1706.08186v1 [cs.CL] 25 Jun 2017 onym discovery is an important task in text analysis and under- typical workow is: i) detect entity mentions from the given corpus, standing, as the extracted synonyms (i.e. the alternative ways to ii) map the detected entity mentions to the entities in a given knowl- refer to the same entity) can benet many downstream applica- edge base, iii) collect training seeds from the knowledge base. Such tions [1, 32, 33, 37]. For example, by forcing synonyms of an entity techniques have been proved eective in a variety of applications, to be assigned in the same topic category, one can constrain the such as relation extraction [10], entity typing [17] and emotion topic modeling process and yield topic representations with higher classication [14]. Inspired by such strategy, a promising direction quality [33]. Another example is in document retrieval [26], where for automatic synonym discovery could be collecting training seeds Permission to make digital or hard copies of all or part of this work for personal or (i.e., a set of synonymous strings) from knowledge bases. classroom use is granted without fee provided that copies are not made or distributed Although distant supervision helps collect training seeds au- for prot or commercial advantage and that copies bear this notice and the full citation tomatically, it also poses a challenge due to the string ambiguity on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. To copy otherwise, or republish, problem, that is, the same entity surface strings can be mapped to post on servers or to redistribute to lists, requires prior specic permission and/or a to dierent entities in knowledge bases. For example, consider fee. Request permissions from [email protected]. the string “Washington” in Figure 1. e “Washington” in the rst KDD’17, August 13-17, 2017, Halifax, NS, Canada © 2017 ACM. 978-1-4503-4887-4/17/08...$15.00 sentence represents a state of the United States; while in the second DOI: 10.1145/3097983.3098185 sentence it refers to a person. As some strings like “Washington” have ambiguous meanings, directly inferring synonyms for such Pattern Based Approaches strings may lead to a set of synonyms for multiple entities. For Target Strings Local Patterns: example, the synonyms of entity Washington returned by current America USA USA ; known as ; America systems may contain both the state names and person names, which USA ; ( ) ; America is not desirable. To address the challenge, instead of using ambigu- ID Sentence 1 The USA is also known as America . ous strings as queries, a beer way is using some specic concepts Distributional Based Approaches 2 The USA ( America ) is a country of 50 states . as queries to disambiguate, such as entities in knowledge bases. Country State The 3 The USA is a highly developed country . is motivated us to dene a new task: automatic synonym America 1 1 2 4 Canada is adjacent to America . discovery for entities with knowledge bases. Given a domain-specic USA 2 1 2 corpus, we aim to collect existing name strings of entities from knowledge bases as seeds. For each query entity, the existing name Figure 2: Comparison of the distributional based and pat- strings of that entity can disambiguate the meaning for each other, tern based approaches. To predict the relation of two tar- and we will let them vote to decide whether a given candidate string get strings, the distributional based approaches will ana- is a synonym of the query entity. Based on that, the key task for lyze their distributional features, while the pattern based this problem is to predict whether a pair of strings are synonymous approaches will analyze the local patterns extracted from or not. For this task, the collected seeds can serve as supervision to sentences mentioning both of the target strings. help determine the important features. However, as the synonym seeds from knowledge bases are usually quite limited, how to use semantic meanings of strings. During training, both modules will them eectively becomes a major challenge. ere are broadly treat the embeddings as features for synonym prediction, and in two kinds of eorts towards exploiting a limited number of seed turn update the embeddings based on the supervision from syn- examples. onym seeds. e string embeddings are shared across the modules, e distributional based approaches [9, 13, 19, 27, 29] consider and therefore each module can leverage the knowledge discovered the corpus-level statistics, and they assume strings which oen by the other module to improve the learning process. appear in similar contexts are likely to be synonyms. For exam- To discover missing synonyms for an entity, one may directly ple, the strings “USA” and “United States” are usually mentioned in rank all candidate strings with both modules. However, such strat- similar contexts, and they are the synonyms of the country USA. egy can have high time costs, as the paern module needs to extract Based on the assumption, the distributional based approaches usu- and analyze all sentences mentioning a pair of given strings when ally represent strings with their distributional features, and treat predicting their relation. To speed up synonym discoveries, our the synonym seeds as labels to train a classier, which predicts framework will rst utilize the distributional module to rank all whether a given pair of strings are synonymous or not. Since candidate strings, and extract a set of top ranked candidates as most synonymous strings will appear in similar contexts, such ap- high-potential ones. Aer that, we will re-rank the high-potential proaches usually have high recall. However, such strategy also candidates with both modules, and treat the top ranked candidates brings some noise, since some non-synonymous strings may also as the discovered synonyms.