Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP Anthony Chen∗ Pallavi Gudipati Shayne Longpre Xiao Ling Sameer Singh University of California, Irvine Apple fanthony.chen, [email protected] fpgudipati, slongpre, [email protected] Abstract Q: Which battle did Abe Lincoln fight in? Retrieval is a core component for open-domain A: World War II Wikipedia Documents Ranked by BLINK: NLP tasks. In open-domain tasks, multiple en- 1. Abraham Lincoln tities can share a name, making disambigua- 2. Abraham Lincoln in the Black Hawk War tion an inherent yet under-explored problem. 3. Abraham Lincoln (captain) We propose an evaluation benchmark for as- 4. Benjamin Lincoln 5. Lincoln Nebraska sessing the entity disambiguation capabilities 6. Lincoln England of these retrievers, which we call Ambiguous Entity Retrieval (AmbER) sets. We define an Q: What musical instrument does Abe Lincoln play? AmbER set as a collection of entities that share A: Trombone a name along with queries about those entities. Wikipedia Documents Ranked by BLINK: 1. Abraham Lincoln By covering the set of entities for polysemous 2. John Wilkes Booth names, AmbER sets act as a challenging test 3. Abe (musical) of entity disambiguation. We create AmbER 4. Nebraska sets for three popular open-domain tasks: fact 5. Lincoln Nebraska checking, slot filling, and question answering, 6. Abe Lincoln (musician) and evaluate a diverse set of retrievers. We find that the retrievers exhibit popularity bias, sig- Figure 1: Queries for two entities (president& mu- nificantly under-performing on rarer entities sician) with the name “Abe Lincoln”. Retrieving the that share a name, e.g., they are twice as likely gold document involves disambiguating which “Abe to retrieve erroneous documents on queries for Lincoln” each query is asking about. BLINK performs the less popular entity under the same name. sub-optimally on the second query, as it ranks the doc- These experiments on AmbER sets show their ument of the president over the gold document. utility as an evaluation tool and highlight the weaknesses of popular retrieval systems.1 Because success hinges on finding relevant docu- ments, open-domain progress has been closely tied 2 1 Introduction to improvements in retrieval systems (Lee et al., 2019; Karpukhin et al., 2020; Lewis et al., 2020b). Substantial progress in NLP has been made on A crucial challenge when interacting with a large “closed” tasks, where queries are paired with rele- knowledge source (e.g., Wikipedia) is entity ambi- vant documents (Rajpurkar et al., 2016; Dua et al., arXiv:2106.06830v1 [cs.CL] 12 Jun 2021 guity, the phenomenon where a single name can 2019). However, there is growing interest in “open- map to multiple entities. Resolving this ambiguity domain” tasks, where relevant documents need is referred to as entity disambiguation and is an to be retrieved from a knowledge source before important step for effective retrieval. For example, an NLP system can perform reasoning and pro- given the query “What musical instrument does duce an answer (Chen et al., 2017; Petroni et al., Abe Lincoln play?”, documents about the musician 2021). The open-domain setting better reflects should rank higher than other entities with the same real-world usage for tasks where relevant informa- name (Figure1). Although entity disambiguation tion is generally not provided (e.g., fact checking). has been extensively studied in entity linking (Hof- ∗Work started during an internship at Apple. fart et al., 2011; Rao et al., 2013; Sevgili et al., 1The AmbER sets used in this paper and the code to generate them are available at https://github.com/ 2For example, replacing the BM25 retriever with DPR on anthonywchen/AmbER-Sets. Natural Questions increases exact match by 15 points. 2020) and search (Balog et al., 2010, 2011), in the context of open-domain NLP, it is unclear how good retrieval systems are when faced with queries with ambiguous entities. Evaluating entity ambigu- ity is challenging because the popularity of entities follows a long-tail (Figure2) and rare entities are seldom covered in naturally-occurring datasets. In this paper we introduce AmbER sets, a bench- mark for evaluating the entity disambiguation ca- pabilities of retrievers across multiple NLP tasks. Each AmbER set is a collection of Wikidata entities that share a name, and their corresponding queries for specific NLP tasks. For each set, we define the Figure 2: The Long Tail of Entity Popularity: Graph head entity as the most popular entity and tail en- of the Wikipedia pageviews (in October 2019) for each tities as the less popular ones. By creating queries Wikidata entity, ranked by popularity. Gray are 100k for multiple entities that share a name, AmbER sets randomly sampled entities, while red/blue are entities with the name “Abe Lincoln”. provide an accurate test of entity disambiguation capabilities of retrievers and help assess the role first step in the open-domain pipeline. An inher- of entity popularity in disambiguation. We show ent problem in working with such sources is entity examples of AmbER sets for the question answer- disambiguation: resolving a name (mention) to ing task in Table1. We automatically create Am- an entity in the knowledge source. Entity disam- bER sets by mining the Wikidata knowledge graph biguation can be challenging because many entities (Vrandecic and Krotzsch¨ , 2014) for relevant names share a name, and the popularity of entities follows and entities, and leveraging task-specific templates a long-tail distribution (Figure2). Despite the im- to generate inputs for three tasks: fact checking, portance of entity disambiguation, it remains an slot filling, and question answering (Figure3). In understudied problem for open-domain NLP. We total, our AmbER sets contain 80k task-specific introduce AmbER sets for evaluating entity disam- queries which we align to the Wikipedia snapshot biguation capabilities of retrievers and analyze the from KILT (Petroni et al., 2021). role of entity popularity in disambiguation. We use AmbER sets to conduct a systematic study of various retrieval systems that operate un- 2.1 What is an AmbER Set? der different principles, such as token overlap and We first provide an intuition for an AmbER set be- dense embedding similarity. Retrievers perform fore concretely defining one. Consider two entities, very differently on AmbER sets in terms of ab- a president and a musician, both of which have the solute retrieval numbers, with Bootleg (Orr et al., name “Abe Lincoln” (Figure1). Now, consider 2020), an entity-linking-based retriever, perform- the query “Which battle did Abe Lincoln fight in?” ing best. Despite these differences, all retrievers and assume a retriever correctly returns the article exhibit a large degree of popularity bias, under- about the president for this query. Simply because performing on inputs concerning tail entities. TF- the correct document was retrieved does not mean IDF, a token-based retriever, performs about four a retriever has the ability to disambiguate between times worse on tail entity inputs compared to head the president and the musician, as the president is entity inputs. Even with Bootleg, the best perform- much more popular. We should only be confident ing retriever, performance on tail entities is still in its ability to disambiguate entities if we also 1.5 times lower than on head entities. Our results pose a query about the less popular musician and on AmbER sets demonstrate that there is signifi- the retriever again returns the correct document (as cant work to be done on making retrievers robust opposed to the document about the president). in handling entity disambiguation. Based on this intuition, we define an AmbER set 2 AmbER Sets as a collection of queries that satisfy the following: • Criteria 1: Polysemous Name: The queries in Retrieving relevant documents from large knowl- an AmbER set are all about entities that share a edge sources such as Wikipedia is an important common name (e.g., Abe Lincoln). QID Input Answer Gold Document Q517 What wars did Napoleon participate in? Napoleon Wars Napoleon Q3335909 What sport does Napoleon play? Rugby Napolioni Nalaga Q3335909 Which team does Napoleon play for? Fiji National Napolioni Nalaga AmbER-H Q117012 What movement did Yoko Ono participate in? Fluxus Yoko Ono Q16264827 Which sport does Yoko Ono participate in? Judo Yoko Ono (judoka) Q312 Which industry is Apple in? Electronics Apple Inc. Q532100 What is the record label of Apple? Page One Apple (band) Q7714007 Who acted in Apple? Ray Shell The Apple (1980 film) AmbER-N Q788822 Who is a cast member on Her? Steve Zissis Her (film) Q788822 Who is Her’s screenwriter? Spike Jonze Her (film) Q28441308 Who performed Her? Aaron Tippin Her (song) Table 1: Examples of QA AmbER sets. An AmbER set is a collection of entities that share a name, with instantiated queries for each entity. In this work, we use Wikidata to collect entities (QID). We also create queries for two more tasks, fact checking and slot filling (omitted from this table). • Criteria 2: Disparity in Popularity: An Am- Task Input Instance Output bER set contains queries about both the most FC John Mayer plays music. True popular entity for a name (the head entity), e.g., SF Nike [SEP] country USA the president, and the less popular entities (the QA Whose face is on $100 bill? Benjamin Franklin tail entities), e.g., the musician. Table 2: Examples for each open-domain NLP task. • Criteria 3: Resolvable Ambiguity: The con- tent of the query should be sufficient to resolve names and related entities before manually writing to the correct entity. The query “Which battle did queries for those entities. Instead, we present a Abe Lincoln fight in?” satisfies this criteria, be- pipeline for automatically creating AmbER sets us- cause there is only one Abe Lincoln that fought ing the Wikidata knowledge graph (Vrandecic and in a war, while “Where was Abe Lincoln born?” Krotzsch¨ , 2014).