Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference

Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference Robert L. Logan IV∗} Andrew McCallum♠♥ Sameer Singh} Daniel Bikel♠ }University of California, Irvine ♠Google Research ~University of Massachusetts, Amherst frlogan, [email protected] fmccallum, [email protected] Abstract proaches, i.e., models that link mentions to a fixed set of known entities. While significant strides Streaming cross document entity corefer- have been made on this front—with systems that ence (CDC) systems disambiguate mentions of named entities in a scalable manner via in- can be trained end-to-end (Kolitsas et al., 2018), on cremental clustering. Unlike other approaches millions of entities (Ling et al., 2020), and link to for named entity disambiguation (e.g., entity entities using only their textual descriptions (Lo- linking), streaming CDC allows for the disam- geswaran et al., 2019)—all entity linking systems biguation of entities that are unknown at in- suffer from the significant limitation that they are ference time. Thus, it is well-suited for pro- restricted to linking to a curated list of entities that cessing streams of data where new entities are is fixed at inference time. Thus they are of limited frequently introduced. Despite these benefits, this task is currently difficult to study, as exist- use when processing data streams where new enti- ing approaches are either evaluated on datasets ties regularly appear, such as research publications, that are no longer available, or omit other cru- social media feeds, and news articles. cial details needed to ensure fair comparison. In contrast, the alternative approach of cross- In this work, we address this issue by compil- document entity coreference (CDC) (Bagga and ing a large benchmark adapted from existing Baldwin, 1998; Gooi and Allan, 2004; Singh et al., free datasets, and performing a comprehensive 2011; Dutta and Weikum, 2015), which disam- evaluation of a number of novel and existing biguates mentions via clustering, does not suffer baseline models.1 We investigate: how to best encode mentions, which clustering algorithms from this shortcoming. Instead most CDC algo- are most effective for grouping mentions, how rithms suffer from a different failure mode: lack models transfer to different domains, and how of scalability. Since they run expensive clustering bounding the number of mentions tracked dur- routines over the entire set of mentions, they are ing inference impacts performance. Our re- not well suited to applications where mentions ar- sults show that the relative performance of neu- rive one at a time. There are, however, a subset of ral and feature-based mention encoders varies streaming CDC methods that avoid this issue by across different domains, and in most cases the best performance is achieved using a combi- clustering mentions incrementally (Figure1). Un- nation of both approaches. We also find that fortunately, despite such methods’ apparent fitness performance is minimally impacted by limit- for streaming data scenarios, this area of research ing the number of tracked mentions. has received little attention from the NLP commu- nity. To our knowledge there are only two existing 1 Introduction works on the task (Rao et al., 2010; Shrimpton et al., The ability to disambiguate mentions of named 2015), and only the latter evaluates truly streaming entities in text is a central task in the field of in- systems, i.e., systems that process new mentions in formation extraction, and is crucial to topic track- constant time with constant memory. ing, knowledge base induction and question an- One crucial factor limiting research on this topic swering. Recent work on this problem has fo- is a lack of free, publicly accessible benchmark cused almost solely on entity linking–based ap- datasets; datasets used in existing works are either small and impossible to reproduce (e.g., the dataset ∗Work done during an internship at Google Research. 1Code and data available at: https://github.com/ collected by Shrimpton et al.(2015) only contains rloganiv/streaming-cdc a few hundred unique entities, and many of the 4717 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4717–4731 August 1–6, 2021. ©2021 Association for Computational Linguistics Time SARS - CoV fusion peptides induce ...gather information on the membrane Mild encephalitis/encephalopathy with ...Middle East respiratory syndrome membrane surface ordering and fusion mechanism promoted by two reversible splenial lesion (MERS) (MERS) outbreaks have been linked to curvature... putative SARS FPs... associated with... healthcare facilities... (a) New mentions arrive over time. (b) Mentions are encoded as points in a vector space and incrementally clustered. As the space grows some points are removed to ensure that the amount of memory used does not exceed a given threshold. Figure 1: Streaming Cross-Document Coreference. annotated tweets are no longer available for down- find that existing approaches for streaming CDC load) or lack the necessary canonical ordering and (e.g., feature-based mention encoding with greedy are expensive to procure (e.g., the ACE 2008 and nearest-neighbors clustering) outperform neural ap- TAC-KBP 2009 corpora used by Rao et al.(2010)). proaches on two of three datasets (+1-3% abs. im- To remedy this, we compile a benchmark of three provement in CoNLL F1), while a RELIC-based datasets for evaluating English streaming CDC sys- encoder with GRINCH performs better on the last tems along with a canonical ordering in which eval- dataset (+9% abs. improvement in CoNLL F1). In uation data should be processed. These datasets cases where existing approaches perform well, we are derived from existing datasets that cover di- also find that better performance can be obtained verse subject matter: biomedical texts (Mohan and by using a combination of neural and feature-based Li, 2019), news articles (Hoffart et al., 2011), and mention encoders. Lastly, we observe that by us- Wikia fandoms (Logeswaran et al., 2019). ing relatively simple memory management policies, We evaluate a number of novel and existing e.g. removing old and redundant mentions from streaming CDC systems on this benchmark. Our the mention cache, bounded memory models can systems utilize a two step approach where: 1) each achieve performance near on-par with unbounded mention is encoded using a neural or feature-based models while storing only a fraction of the men- model, and 2) the mention is then clustered with tions (in one case we observe a 2% abs. drop in existing mentions using an incremental clustering CoNLL F1 caching only 10% of the mentions). algorithm. We investigate the performance of dif- 2 Streaming Cross-Document Entity ferent mention encoders (existing feature-based Coreference (CDC) methods, pretrained LMs, and encoders from entity linkers such as RELIC (Ling et al., 2020) and 2.1 Task Overview BLINK (Wu et al., 2020)), and incremental clus- The key goal of cross-document entity coreference tering algorithms (greedy nearest-neighbors clus- (CDC) is to identify mentions that refer to the same tering, and a recently introduced online agglomera- entity. Formally, let M = m1; : : : ; mjMj de- tive clustering algorithm, GRINCH (Monath et al., note a corpus of mentions, where each mention con- 2019)). Since GRINCH does not use bounded sists of a surface text m:surface (e.g., the colored memory, which is required for scalability in the text in Figure 1a), as well as its surrounding con- streaming setting, we introduce a novel bounded text m:context (e.g., the text in black). Provided memory variant that prunes nodes from the clus- M as an input, a CDC system produces a disjoint ter tree when the number of leaves exceeds a clustering over the mentions C = C1;:::; CjCj , given size, and compare its performance to existing jCj ≤ jMj, as the output, where each cluster bounded memory approaches. Ce = fm 2 M j m:entity = eg is the set of men- Our results show that the relative performance tions that refer to the same entity. of different mention encoders and clustering al- In streaming CDC, there are two additional re- gorithms varies across different domains. We quirements: 1) mentions arrive in a fixed order 4718 (M is a list) and are clustered incrementally, and Streaming Cross Document Coreference The 2) memory is constrained so that only a fixed num- methods mentioned in the previous paragraphs dis- ber of mentions can be stored. This can be formu- ambiguate mentions all at once, and are thus un- lated in terms of the above notation by adding a suitable for applications where a large number of time index t, so that MT = fmt 2 M j t ≤ T g is mentions appear over time. Rao et al.(2010) pro- the set of all mentions observed at or before time pose to address this issue using an incremental clus- T , MfT ⊆ MT is a subset of “active” mentions tering approach where each new mention is either whose size does not exceed a fixed memory bound placed into one of a number of candidate clusters, k, e.g., jMfT j ≤ k, and CT is comprised of clus- or a new cluster if similarity does not exceed a ters that only contain mentions in MfT . Due to given threshold (Allaway et al.(2021) use a similar the streaming nature, MfT − fmT g ⊂ MfT −1, i.e., approach for joint entity and event coreference). a mention cannot be added back to MfT if it was Shrimpton et al.(2015) note that the this incremen- previously removed. When the memory bound is tal clustering does not process mentions in constant reached, mention are removed from Mf according time/memory, and thus is not “truly streaming”. to a memory management policy Φ. They present the only truly streaming approach for An illustrative example is provided in Figure1.

Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference

Competitieschema 2009-2010

Conceptprogramma Eredivisie

Clubs Without Resilience Summary of the NBA Open Letter for Professional Football Organisations

G-Toernooi Programmaboekje 2019

ADO Den Haag Competitieschema 2012-2013

Fortuna Sittard Kicks 07/08 Seizoen

RKC Waalwijk, Klaar Voor De Eredivisie

Competitieprogramma Eredivisie

Financieel Jaarverslag 2019-2020 Jaarverslag Directie

Programma Eredivisie

Cbv-Magazine-Juni-2016-Digitaal

Conceptprogramma Eredivisie 2021