Entity Linking in 100 Languages

Entity Linking in 100 Languages Jan A. Botha Zifei Shan Daniel Gillick Google Research Google Research Google Research [email protected] [email protected] [email protected] Abstract The accompanying motivation is that KBs may only ever exist in some well-resourced languages, We propose a new formulation for multilingual but that text in many different languages need to entity linking, where language-specific mentions resolve to a language-agnostic Knowl- be linked. Recent work in this direction features edge Base. We train a dual encoder in this new progress on low-resource languages (Zhou et al., setting, building on prior work with improved 2020), zero-shot transfer (Sil and Florian, 2016; Ri- feature representation, negative mining, and an jhwani et al., 2019; Zhou et al., 2019) and scaling to auxiliary entity-pairing task, to obtain a sin- many languages (Pan et al., 2017), but commonly gle entity retrieval model that covers 100+ lan- assumes a single primary KB language and a lim- guages and 20 million entities. The model ited KB, typically English Wikipedia. outperforms state-of-the-art results from a far more limited cross-lingual linking task. Rare We contend that this popular formulation lim- entities and low-resource languages pose chal- its the scope of EL in ways that are artificial and lenges at this large-scale, so we advocate for inequitable. an increased focus on zero- and few-shot eval- First, it artificially simplifies the task by restrict- uation. To this end, we provide Mewsli-9, a large new multilingual dataset1 matched to our ing the set of viable entities and reducing the va- setting, and show how frequency-based anal- riety of mention ambiguities. Limiting the focus ysis provided key insights for our model and to entities that have English Wikipedia pages un- training enhancements. derstates the real-world diversity of entities. Even within the Wikipedia ecosystem, many entities only 1 Introduction have pages in languages other than English. These Entity linking (EL) fulfils a key role in grounded are often associated with locales that are already language understanding: Given an ungrounded en- underrepresented on the global stage. By ignoring tity mention in text, the task is to identify the en- these entities and their mentions, most current mod- tity’s corresponding entry in a Knowledge Base eling and evaluation work tend to side-step under- (KB). In particular, EL provides grounding for ap- appreciated challenges faced in practical industrial plications like Question Answering (Févry et al., applications, which often involve KBs much larger 2020b) (also via Semantic Parsing (Shaw et al., than English Wikipedia, with a much more signifi- 2019)) and Text Generation (Puduppully et al., cant zero- or few-shot inference problem. 2019); it is also an essential component in knowl- Second, it entrenches an English bias in EL re- edge base population (Shen et al., 2014). Entities search that is out of step with the encouraging shift have played a growing role in representation learn- toward inherently multilingual approaches in nat- ing. For example, entity mention masking led to ural language processing, enabled by advances in greatly improved fact retention in large language representation learning (Johnson et al., 2017; Pires models (Guu et al., 2020; Roberts et al., 2020). et al., 2019; Conneau et al., 2020). But to date, the primary formulation of EL out- Third, much recent EL work has focused on mod- side of the standard monolingual setting has been els that rerank entity candidates retrieved by an cross-lingual: link mentions expressed in one lan- alias table (Févry et al., 2020a), an approach that guage to a KB expressed in another (McNamee works well for English entities with many linked et al., 2011; Tsai and Roth, 2016; Sil et al., 2018). mentions, but less so for the long tail of entities 1http://goo.gle/mewsli-dataset and languages. 7833 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7833–7845, November 16–20, 2020. c 2020 Association for Computational Linguistics To overcome these shortcomings, this work languages. WikiData itself contains names and makes the following key contributions: short descriptions, but through its close integra- tion with all Wikipedia editions, it also connects • Reformulate entity linking as inherently mul- entities to rich descriptions (and other features) tilingual: link mentions in 104 languages to drawn from the corresponding language-specific entities in WikiData, a language-agnostic KB. Wikipedia pages. • Advance prior dual encoder retrieval work Basing entity representations on features of their with improved mention and entity encoder Wikipedia pages has been a common approach in architecture and improved negative mining EL (e.g. Sil and Florian, 2016; Francis-Landau targeting. et al., 2016; Gillick et al., 2019; Wu et al., 2019), but we will need to generalize this to include multi- • Establish new state-of-the-art performance rel- ple Wikipedia pages with possibly redundant fea- ative to prior cross-lingual linking systems, tures in many languages. with one model capable of linking 104 languages against 20 million WikiData entities. 2.1.1 WikiData Entity Example Consider the WikiData Entity Sí RàdioQ3511500, • Introduce Mewsli-9, a large dataset with a now defunct Valencian radio station. Its KB en- nearly 300,000 mentions across 9 diverse try references Wikipedia pages in three languages, languages with links to WikiData. The which contain the following descriptions:2 dataset features many entities that lack En- glish Wikipedia pages and which are thus inac- • (Catalan) Sí Ràdio fou una emissora de ràdio cessible to many prior cross-lingual systems. musical, la segona de Radio Autonomía Valen- • Present frequency-bucketed evaluation that ciana, S.A. pertanyent al grup Radiotelevisió highlights zero- and few-shot challenges with Valenciana. clear headroom, implicitly including low- resource languages without enumerating re- • (Spanish) Nou Si Ràdio (anteriormente cono- sults over a hundred languages. cido como Sí Ràdio) fue una cadena de radio de la Comunidad Valenciana y emisora her- 2 Task Definition mana de Nou Ràdio perteneciente al grupo RTVV. Multilingual Entity Linking (MEL) is the task of linking an entity mention m in some context lan- • (French) Sí Ràdio est une station de radio guage lc to the corresponding entity e 2 V in a publique espagnole appartenant au groupe language-agnostic KB. That is, while the KB may Ràdio Televisió Valenciana, entreprise de include textual information (names, descriptions, radio-télévision dépendant de la Generalitat etc.) about each entity in one or more languages, valencienne. we make no prior assumption about the relationship kb between these KB languages L = fl1; : : : ; lkg Note that these Wikipedia descriptions are not di- and the mention-side language: lc may or may not rect translations, and contain some name variations. be in Lkb. We emphasize that this particular entity would have This is a generalization of cross-lingual EL been completely out of scope in the standard cross- (XEL), which is concerned with the case where lingual task (Tsai and Roth, 2016), because it does Lkb = fl0g and lc 6= l0. Commonly, l0 is English, not have an English Wikipedia page. and V is moreover limited to the set of entities that express features in l0. In our analysis, there are millions of WikiData entities with this property, meaning the standard set- 2.1 MEL with WikiData and Wikipedia ting skips over the substantial challenges of model- As a concrete realization of the proposed task, ing these (often rarer) entities, and disambiguating we use WikiData (Vrandeciˇ c´ and Krötzsch, 2014) them in different language contexts. Our formula- as our KB: it covers a large set of diverse enti- tion seeks to address this. ties, is broadly accessible and actively maintained, 2We refer to the first sentence of a Wikipedia page as a and it provides access to entity features in many description because it follows a standardized format. 7834 2.2 Knowledge Base Scope Entities Our modeling focus is on using unstructured tex- Lang. Docs Mentions Distinct 2= EnWiki tual information for entity linking, leaving other ja 3,410 34,463 13,663 3,384 modalities or structured information as areas for de 13,703 65,592 23,086 3,054 es 10,284 56,716 22,077 1,805 future work. Accordingly, we narrow our KB to the ar 1,468 7,367 2,232 141 subset of entities that have descriptive text avail- sr 15,011 35,669 4,332 269 tr 997 5,811 2,630 157 able: We define our entity vocabulary V as all Wiki- fa 165 535 385 12 Data items that have an associated Wikipedia page ta 1,000 2,692 1,041 20 in at least one language, independent of the lan- en 12,679 80,242 38,697 14 guages we actually model.3 This gives 19,666,787 58,717 289,087 82,162 8,807 entities, substantially more than in any other task en0 1801 2,263 1799 0 settings we have found: the KB accompanying the entrenched TAC-KBP 2010 benchmark (Ji et al., Table 1: Corpus statistics for Mewsli-9, an evalua- 2010) has less than a million entities, and although tion set we introduce for multilingual entity linking English Wikipedia continues to grow, recent work against WikiData. Line en0 shows statistics for using it as a KB still only contend with roughly English WikiNews-2018, by Gillick et al.(2019). 6 million entities (Févry et al., 2020a; Zhou et al., 2020). Further, by employing a simple rule to de- 5 termine the set of viable entities, we avoid potential six orthographies. Per-language statistics appear selection bias based on our desired test sets or the in Table 1.

Load more