Arxiv:2009.13827V1 [Cs.CL] 29 Sep 2020 to Cluster All Vocabulary Terms Into Synsets
Total Page:16
File Type:pdf, Size:1020Kb
SynSetExpan: An Iterative Framework for Joint Entity Set Expansion and Synonym Discovery Jiaming Shen1?, Wenda Qiu1?, Jingbo Shang2, Michelle Vanni3, Xiang Ren4, Jiawei Han1 1University of Illinois Urbana-Champaign, IL, USA, 2University of California San Diego, CA, USA 3U.S. Army Research Laboratory, MD, USA, 4University of Southern California, CA, USA 1{js2, qiuwenda, hanj}@illinois.edu [email protected] [email protected] [email protected] User Provided Vocabulary V Text Corpus Discovered Synsets VC Abstract D S Seed Synsets derives of Semantic Class C S Illinois IL part of Wisconsin WI Land of Lincoln America’s Dairyland Entity set expansion and synonym discovery inputs inputs outputs are two critical NLP tasks. Previous studies Texas TX SynSetExpan Framework Washington WA inputs accomplish them separately, without exploring Lone Star State Evergreen State their interdependences. In this work, we hy- pothesize that these two tasks are tightly cou- California CA Connecticut CT Golden State Set Expansion Synonym Discovery …… …… …… pled because two synonymous entities tend to have similar likelihoods of belonging to var- Figure 1: An illustrative example of joint entity set ious semantic classes. This motivates us to expansion and synonym discovery. design SynSetExpan, a novel framework that enables two tasks to mutually enhance each et al., 2017), taxonomy construction (Shen et al., other. SynSetExpan uses a synonym discovery 2018a), and online education (Yu et al., 2019a). model to include popular entities’ infrequent Previous studies regard ESE and ESD as two synonyms into the set, which boosts the set expansion recall. Meanwhile, the set expan- independent tasks. Many ESE methods (Mamou sion model, being able to determine whether et al., 2018b; Yan et al., 2019; Huang et al., 2020; an entity belongs to a semantic class, can gen- Zhang et al., 2020; Zhu et al., 2020) are developed erate pseudo training data to fine-tune the syn- to iteratively select and add the most confident en- onym discovery model towards better accuracy. tities into the set. A core challenge for ESE is to To facilitate the research on studying the in- find those infrequent long-tail entities in the target terplays of these two tasks, we create the first semantic class (e.g., “Lone Star State” in the class large-scale Synonym-Enhanced Set Expansion US_States (SE2) dataset via crowdsourcing. Extensive ) while filtering out false positive en- experiments on the SE2 dataset and previous tities from other related classes (e.g., “Austin” and benchmarks demonstrate the effectiveness of “Dallas” in the class City) as they will cause se- SynSetExpan for both entity set expansion and mantic shift to the set. Meanwhile, various ESD synonym discovery tasks. methods (Qu et al., 2017; Ustalov et al., 2017a; Wang et al., 2019; Shen et al., 2019) combine string- 1 Introduction level features with embedding features to find a query term’s synonyms from a given vocabulary or Entity set expansion (ESE) aims to expand a arXiv:2009.13827v1 [cs.CL] 29 Sep 2020 to cluster all vocabulary terms into synsets. A ma- small set of seed entities (e.g., {“United States”, jor challenge here is to combine those features with “Canada”}) into a larger set of entities that belong limited supervisions in a way that works for enti- to the same semantic class (i.e., Country). En- ties from all semantic classes. Another challenge tity synonym discovery (ESD) intends to group all is how to scale a ESD method to a large, extensive terms in a vocabulary that refer to the same real- vocabulary that contains terms of varied qualities. world entity (e.g., “America” and “USA” refer to To address the above challenges, we hypothe- the same country) into a synonym set (hence called size that ESE and ESD are two tightly coupled a synset). Those discovered entities and synsets in- tasks and can mutually enhance each other because clude rich knowledge and can benefit many down- two synonymous entities tend to have similar like- stream applications such as semantic search (Xiong lihoods of belonging to various semantic classes ∗Equal Contributions. and vice versa. This hypothesis implies that (1) knowing the class membership of one entity en- terms, and 1200 seed queries from 60 semantic ables us to infer the class membership of all its classes of 6 different types (e.g., Person, Location, synonyms, and (2) two entities can be synonyms Organization, etc.). only if they belong to the same semantic class. For Contributions. In summary, this study makes the example, we may expand the US_States class following contributions: (1) we hypothesize that from a seed set {“Illinois”, “Texas”, “California”}. ESE and ESD can mutually enhance each other An ESE model can find frequent state full names and propose a novel framework SynSetExpan to (e.g., “Wisconsin”, “Connecticut”) but may miss jointly conduct two tasks; (2) we construct a new those infrequent entities (e.g., “Lone Star State” large-scale dataset SE2 that supports fair compari- and “Golden State”). However, an ESD model may son across different methods and facilitates future predict “Lone Star State” is the synonym of “Texas” research on both tasks; and (3) we conduct exten- and “Golden State” is synonymous to “California” sive experiments to verify our hypothesis and show and directly adds them into the expanded set, which the effectiveness of SynSetExpan on both tasks. shows synonym information help set expansion. Meanwhile, from the ESE model outputs, we may 2 Problem Formulation infer “Wisconsin”, “WI” is a synonymous pair We first introduce important concepts in this work, whileh “Connecticut”, “SCi” is not, and use them and then present our problem formulation. A term to fine-tuneh an ESD model oni the fly. This relieves is a string (i.e., a word or a phrase) that refers to the burden of using one single ESD model for all a real-world entity2. An entity synset is a set of semantic classes and improves the ESD model’s terms that can be used to refer to the same real- inference efficiency because we refine the synonym world entity. For example, both “USA” and “Amer- search space from the entire vocabulary to only the ica” can refer to the entity United States and thus ESE model outputs. compose an entity synset. We allow the singleton In this study, we propose SynSetExpan, a synset and a term may locate in multiple synsets novel framework jointly conducting two tasks (c.f. due to its ambiguity. A semantic class is a set of Fig.1). To better leverage the limited supervision entities that share a common characteristic and a signals in seeds, we design SynSetExpan as an iter- vocabulary is a term list that can be either provided ative framework consisting of two components: (1) by users or derived from a corpus. a ESE model that ranks entities based on their prob- Problem Formulation. Given (1) a text corpus , abilities of belonging to the target semantic class, (2) a vocabulary derived from , and (3) a seedD and (2) a ESD model that returns the probability of set of user-providedV entity synonymD sets that two entities being synonyms. In each iteration, we 0 belong to the same semantic class , weS aim to first apply the ESE model to obtain an entity rank C (1) select a subset of entities C from that all list from which we derive a set of pseudo training V V belong to C; and (2) clusters all terms in C into data to fine-tune the ESD model. Then, we use this V entity synsets V where the union of all clusters fine-tuned model to find synonyms of entities in the S C is equal to C . In other words, we expand the seed currently expanded set and adjust the above rank V set 0 into a more complete set of entity synsets list. Finally, we add top-ranked entities in the ad- S that belong to the same semantic class C. justed rank list into the currently expanded set and 0 VC AS concrete[S example is presented in Fig.1. start the next iteration. After the iterative process ends, we construct a synonym graph from the last 3 The SynSetExpan Framework iteration’s output and extract entity synsets (includ- ing singleton synsets) as graph communities. In this study, we hypothesize that entity set expan- As previous ESE datasets are too small and con- sion and synonym discovery are two tightly cou- tain no synonym information for evaluating our pled tasks and can mutually enhance each other. hypothesis, we create the first Synonym Enhanced Hypothesis 1. Two synonymous entities tend to Set Expansion (SE2) benchmark dataset via crowd- have similar likelihoods of belonging to various sourcing. This new dataset1 is one magnitude larger semantic classes and vice versa. than previous benchmarks. It contains a corpus of The above hypothesis has two implications. the entire Wikipedia, a vocabulary of 1.5 million First, if two entities ei and ej are synonyms and 1http://bit.ly/SE2-dataset. 2In this work, we use “term” and “entity” interchangeably. Vocabulary V Knowledge Base Vocabulary V Positive Negative Set Expansion Output L K Examples Examples se string match Entity Score Pseudo Training Data Dpl g1( ) · Wisconsin 0.98 Term Pair Label Current Set E x2 South Carolina 0.94 <Wisconsin, WI> 1 Distant Supervision Illinois x WI 0.91 randomly 1 predict <Wisconsin, South Korea> 0 IL sample Term Pair Label learn g2( ) onV SC 0.85 derive · <WI, Golden State> 0 Texas <United States, U.S.> 1 x2 Chicago 0.78 <South Carolina, SC> 1 <United States, United Kingdom> 0 TX …… …… x1 <South Carolina, South Korea> 0 <Kobe Bryant, Black Mamba> 1 California … … South Korea 0.62 <SC, Lone Star State> 0 <Kobe Bryant, NBA> 0 … Golden State 0.58 gT ( ) · … … … … Lone Star State 0.57 x2 Add top- …… …… ranked x1 fine-tune pretrain entities Synonym Discovery Output Lsy Set Expansion Model merge Fitted Trees on Dpl Pretrained Class-Agnoistic Model 0 Entity Score M Lone Star State 0.99 predict Rank 1 2 3 4 5 6 7 … 200 201 … merge Land of Lincoln 0.97 on V … … Golden State 0.96 South Lone Star Golden Land of South …… …… Entity Wisconsin WI SC … Chicago … Carolina State State Lincoln Korea South Korea 0.02 Chicago 0.01 Synonym Discovery Model Final Merged Entity Rank List …… …… Mc Figure 2: Overview of one iteration in our proposed SynSetExpan framework.