Ranking Related Entities: Components and Analyses

Ranking Related Entities: Components and Analyses Marc Bron Krisztian Balog Maarten de Rijke [email protected] [email protected] [email protected] ISLA, University of Amsterdam Science Park 904, 1098 XH Amsterdam ABSTRACT the TREC Question Answering track in 1999, where much research Related entity finding is the task of returning a ranked list of home- has focused on fact-based questions such as “Who invented the pa- pages of relevant entities of a specified type that need to engage per clip?” Such questions can be answered by named entities such in a given relationship with a given source entity. We propose a as locations, dates, etc. [35]. The expert finding task, studied at framework for addressing this task and perform a detailed analy- the TREC Enterprise track (2005–2008), focused on a single type sis of four core components; co-occurrence models, type filtering, of entity: people [7]. The INEX Entity Ranking task (2007–2009) context modeling and homepage finding. Our initial focus is on broadened the task to include other types and required systems to recall. We analyze the performance of a model that only uses co- return ranked lists of entities given a textual description (“Countries occurrence statistics. While it identifies a set of related entities, it where one can pay with the euro”) and type information (“coun- fails to rank them effectively. Two types of error emerge: (1) en- tries”) [9]. Next, the TREC 2009 Entity track introduced the related tities of the wrong type pollute the ranking and (2) while some- entity finding (REF) task: given a source entity, a relation and a tar- how associated to the source entity, some retrieved entities do not get type, identify homepages of target entities that enjoy the spec- engage in the right relation with it. To address (1), we add type ified relation with the source entity and that satisfy the target type filtering based on category information available in Wikipedia. To constraint [3]. E.g., for a source entity (“Michael Schumacher”), a correct for (2), we add contextual information, represented as lan- relation (“His teammates while he was racing in Formula 1”) and guage models derived from documents in which source and target a target type (“people”) return entities such as “Eddie Irvine” and entities co-occur. To complete the pipeline, we find homepages of “Felipe Massa.” REF aims at making arbitrary relations between top ranked entities by combining a language modeling approach entities searchable; it provides a way of searching for information with heuristics based on Wikipedia’s external links. Our method through entities, previously only possible by (implicitly) manually achieves very high recall scores on the end-to-end task, providing annotated links such as those in social networks. a solid starting point for expanding our focus to improve precision; Named entity Entity Corpus additional heuristics lead to state-of-the-art performance. recog.& norm. repository Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.3 Information Search Candidate query Filtering Reranking results and Retrieval selection Figure 1: Components of an idealized entity finding system. General Terms Solid arrows indicate control flow, dashed arrows data flow. Algorithms, Experimentation, Measurement, Performance We start with an idealized entity retrieval architecture, see Fig. 1. Computations take place at two levels: the entity repository is built Keywords off-line, using tools and techniques for named entity recognition Entity search, Language modeling, Wikipedia and normalization. Queries are processed online, through a retrieval pipeline. This pipeline resembles a question answering ar- 1. INTRODUCTION chitecture, where first candidate answers are generated, followed by type filtering and the final ranking (scoring) steps. Candidate Over the past decade, increasing attention has been devoted to generation is a recall-oriented step, while the subsequent two blocks retrieval technology aimed at identifying entities relevant to an in- aim to improve precision. Our work sets out the challenge of adopt- formation need. The area received a big boost with the arrival of ing this general architecture to the REF task, and addresses the is- sue of balancing precision and recall when executing a search. When building a system to perform a task such as REF, the most Permission to make digital or hard copies of all or part of this work for important evaluation is on the end-to-end task. The TREC Entity personal or classroom use is granted without fee provided that copies are track will play an important role in advancing REF technology, but not made or distributed for profit or commercial advantage and that copies its end-to-end focus means that it is difficult to disentangle the per- bear this notice and the full citation on the first page. To copy otherwise, to formance contributions of individual components. This effect is republish, to post on servers or to redistribute to lists, requires prior specific reinforced in the case of a new task such as REF where a canonical permission and/or a fee. CIKM’10, October 26–30, 2010, Toronto, Ontario, Canada. architecture has yet to emerge. In this paper we go through a series Copyright 2010 ACM 978-1-4503-0099-5/10/10 ...$10.00. of ablation studies and contrastive runs so as to obtain an under- 1079 standing of each of the components that play a role and the impact 2. RELATED WORK they have on precision and recall. Entity Entity retrieval. The roots of entity retrieval go back to natural Corpus repository language processing, specifically to information extraction (IE). Wikipedia Unique ID The goal in IE is to find all entities for a certain class, for exam- Name { Type ple “cities.” The general approach taken uses context based pat- Homepage Web crawl Homepage finding terns to extract entities; e.g., “cities such as * and *”, either learned from examples [28] or created manually [16]. At the intersection of natural language processing and IR lies question answering (QA), Co-occurrence Context query Type filtering results modeling modeling which combines IE and IR, investigated at the TREC QA track [35]. What sets QA apart from Entity Retrieval? One is a matter of tech- Figure 2: Components of our REF system. nology: many QA systems considered at TREC have a knowledge- Specifically, we address the REF task as defined at TREC 2009 intensive pipeline that simply does not comply with the wishes of and consider a particular instantiation of the idealized entity find- efficient processing on very large volumes of data. Another is the ing system, shown in Fig. 2. Our focus is on retrieval and rank- difference in task; while the “list” subtask at the QA track does in- ing rather than on named entity recognition and normalization; to deed resemble the REF task, the two differ in important ways: (i) simplify matters we use Wikipedia as a repository of (normalized) QA list queries do not always contain an entity [34], e.g., “What known entities. While the restriction to entities in Wikipedia is a are 12 types of clams” and (ii) REF queries impose a more specific limitation in terms of the number of entities considered, it provides (elaborate) relation between the source entity and the target entities, us with high-quality data, including names, unique identifiers and e.g., “Airlines that currently use Boeing 747 planes.” type information, for millions of entities. Our framework is generic A particularly relevant paper on the interface of QA and REF is and conceptually independent of this particular usage of Wikipedia. [26] wherein entity language models are processed using a prob- Given our focus on entities in Wikipedia, it is natural to ad- abilistic representation of the language used to describe a named dress the REF task in two phases. In the first we build up our re- entity (person, organization or location). The model is constructed trieval pipeline (the blocks at the bottom of Fig. 2) working only from snippets of text surrounding mentions of an entity. Unsurpris- with Wikipedia pages of entities; in the second we map entities to ingly, we model the language model of an individual entity in the their homepages. In phase one we use a generative framework to same manner, but have more complex models of pairs of entities. combine the components. The first component is a co-occurrence- Our main concern is with precision and recall aspects of our ap- based model that selects candidate entities. While, by itself, a co- proach to the REF task. We initially focus on recall and then ap- occurrence-based model can be effective in identifying the poten- ply techniques to boost precision: one of these techniques is type tial set of related entities, it fails to rank them effectively. Our fail- filtering, aimed at demoting entities that are not of the required ure analysis reveals two types of error that affect precision: (1) en- type. Previously, type filtering has been considered in the setting of tities of the wrong type pollute the ranking and (2) entities are re- QA, where candidate answers are filtered by employing surface pat- trieved that are associated with the source entity without engaging terns [27] or by a mix of top-down and bottom-up approaches [29]. in the right relation with it. To address (1), we add a type filtering We apply type filtering based on Wikipedia category assignments component based on category information in Wikipedia. To correct and category structure in the context of the REF task. for (2), we complement the pipeline with contextual information, The expert finding task, which was run at the TREC Enterprise represented as statistical language models derived from documents track [7], focuses on a single type (“person”) and relation (“expert in which the source and target entities co-occur.

Load more