Arxiv:2103.12528V1 [Cs.CL] 23 Mar 2021
Total Page:16
File Type:pdf, Size:1020Kb
Multilingual Autoregressive Entity Linking Nicola De Cao1,2, Ledell Wu1, Kashyap Popat1, Mikel Artetxe1, Naman Goyal1, Mikhail Plekhanov1, Luke Zettlemoyer1,3, Nicola Cancedda1, Sebastian Riedel1,4, Fabio Petroni1 1Facebook AI 2University of Amsterdam 3University of Washington 4University College London [email protected] {ledell, kpopat, artetxe, naman, movb lsz, ncan, sriedel, fabiopetroni}@fb.com Abstract 2016; Wen et al., 2017; Williams et al., 2017; Chen et al., 2017; Curry et al., 2018), Biomedical sys- We present mGENRE, a sequence-to- tems (Leaman and Gonzalez, 2008; Zheng et al., sequence system for the Multilingual Entity Linking (MEL) problem—the task of re- 2015), to name just a few. It consists of ground- solving language-specific mentions to a ing entity mentions in unstructured texts to KB multilingual Knowledge Base (KB). For a descriptors (e.g., Wikipedia articles). mention in a given language, mGENRE The multilingual version of the EL problem has predicts the name of the target entity left-to- been for a long time tight to a purely cross-lingual right, token-by-token in an autoregressive formulation (XEL, McNamee et al., 2011a; Ji et al., fashion. The autoregressive formulation 2015), where mentions expressed in one language allows us to effectively cross-encode mention string and entity names to capture more are linked to a KB expressed in another (typically interactions than the standard dot product English). Recently, Botha et al.(2020) made a step between mention and entity vectors. It also towards a more inherently multilingual formulation enables fast search within a large KB even by defining a language-agnostic KB, obtained by for mentions that do not appear in mention grouping language-specific descriptors per entity. tables and with no need for large-scale vector Such formulation has the power of considering en- indices. While prior MEL works use a single tities that do not have an English descriptor (e.g., a representation for each entity, we match Wikipedia article in English) but have one in some against entity names of as many languages as possible, which allows exploiting language other languages. connections between source input and target A common design choice to most current solu- name. Moreover, in a zero-shot setting tions, regardless of the specific formulation, is to on languages with no training data at all, provide a unified entity representation, either by mGENRE treats the target language as a la- collating multilingual descriptors in a single vec- tent variable that is marginalized at prediction tor or by defining a canonical language. For the time. This leads to over 50% improvements common bi-encoder approach (Wu et al., 2020; in average accuracy. We show the efficacy of our approach through extensive evaluation Botha et al., 2020), this might be optimal. How- arXiv:2103.12528v1 [cs.CL] 23 Mar 2021 including experiments on three popular MEL ever, in the recently proposed GENRE model benchmarks where mGENRE establishes (De Cao et al., 2021), an autoregressive formu- new state-of-the-art results. Code and pre- lation to the EL problem leading to stronger per- trained models at https://github.com/ formance and considerably smaller memory foot- facebookresearch/GENRE. prints than bi-encoder approaches on monolingual benchmarks, the representations to match against 1 Introduction are entity names (i.e., strings) and it’s unclear how Entity Linking (EL, Hoffart et al., 2011; Dredze to extend those beyond a monolingual setting. et al., 2010; Bunescu and Pa¸sca, 2006; Cucerzan, In this context, we find that maintaining as much 2007) is an important task in NLP, with plenty of language information as possible, hence providing applications in multiple domains, spanning Ques- multiple representations per entity, helps due to the tion Answering (De Cao et al., 2019; Nie et al., connections between source language and entity 2019; Asai et al., 2020), Dialogue (Bordes et al., names in different languages. We additionally find Sequence Bidirectional Wikipedia-Wikidata mapping scores Transformer Global Sequence Encoder Positioning >> de Q 1 8 8 2 2 - 0 . 0 9 scores Autoregressive System Globalse Q18822 INPUT: [..] Es steht in Transformer >> de Q 1 7 9 4 3 5 0 . 6 1 Konkurrenz zum etablierten Navigationssate - 0 . 9 6 (Global [START] GPS [END] - System Decoder llitensystem Positioning der USA, soll aber mit den Sistema de System) technischen Spezifikationen der with prefix posicionamento >> es Q 1 8 8 2 2 - 1 . 1 0 Datenstrome des GPS-Systems constrained global kompatibel sein. [..] Vocabulary Sistema di - 0 . 9 6 Q179435 TRANSLATION: [..] It posizionamento >> it Q 1 8 8 2 2 - 1 . 1 7 (satellite competes with the established globale navigation [START] GPS [END] system in Sistema de system) the USA, but should be posicionament >> ca Q 1 8 8 2 2 - 1 . 2 7 Aggregating compatible with the technical global specifications of the data streams of the GPS system. [..] Figure 1: mGENRE: we use an autoregressive decoder to generate language IDs as well as entity names (i.e., Wikipedia titles). The combination of language ID and a entity name uniquely identify a Wikidata ID (with a N- to-1 mapping). We use Beam Search for efficient inference and we marginalize the probability scores for different languages to score entities. This example is a real output from our system. that using all available languages as targets and • Publicly release our best model, pre-trained aggregating over the possible choices is an effec- as multilingual denoising auto-encoder using tive way to deal with a zero-shot setting where no the BART objective (Lewis et al., 2020; Liu training data is available for the source language. et al., 2020) on large-scale monolingual cor- Concretely, in this paper, we present mGENRE, pora in 125 languages and fine-tuned to gen- the first multilingual EL system that exploits a erate entity names given ∼730M in-context sequence-to-sequence architecture to generate en- Wikipedia hyperlinks in 105 languages. tity names in more than 100 languages left to right, token-by-token in an autoregressive fashion and 2 Background conditioned on the context (see Figure1 for an out- line of our system). While prior works use a single We first introduce Multilingual Entity Linking in representation for each entity, we maintain entity Section 2.1 highlighting its difference with mono- names for as many languages as possible, which lingual and cross-lingual linking. We address the allows exploiting language connections between MEL problem with a sequence-to-sequence model i.e. source input and target name. To summarize, this that generates textual entity identifiers ( , entity work makes the following contributions: names). Our formulation generalizes the GENRE model by De Cao et al.(2021) to a multilingual • Extend the catalog of entity names by consid- setting (mGENRE). Thus in Section 2.2 and 2.3, ering all languages for each entry in the KB. we discuss the GENRE model and how it ranks Storing the multilingual names index is feasi- entities with Beam Search respectively. ble and cheap (i.e., 2.2GB for ∼89M names). • Design a novel objective function that 2.1 Task Definition marginalizes over all languages to perform Multilingual Entity Linking (MEL, Botha et al., a prediction. This approach is particularly 2020) is the task of linking a given entity men- effective in dealing with languages not seen tion m in a given context c of language l 2 LC to ∼ during fine-tuning ( 50% improvements). the corresponding entity e 2 E in a multilingual • Establish new state-of-the-art performance Knowledge Base (KB). See Figure1 for an exam- for the Mewsli-9 (Botha et al., 2020), ple: there are textual inputs with entity mentions hard TR2016 (Tsai and Roth, 2016a) and TAC- (in bold) and we ask the model to predict the corre- KBP2015 (Ji et al., 2015) MEL datasets. sponding entities in the KB. A language-agnostic • Present extensive analysis of modeling KB includes at least the name (but could include choices, including the usage of candidates also descriptions, aliases, etc.) of each entity in one from a mention table, frequency-bucketed or more languages but there is no assumption about evaluation, and performance on an heldout the relationship between these languages LKB and set including low-resource languages. languages of the context LC . This is a generaliza- tion of both monolingual Entity Linking EL and e in the language l. We extracted these identifiers cross-lingual EL (XEL, McNamee et al., 2011a; Ji from our KB—each Wikidata item has a set of et al., 2015). The latter considers contexts in differ- Wikipedia pages in multiple languages linked to it, ent languages while mapping mentions to entities and in any given language, each page has a unique in a monolingual KB (e.g., English Wikipedia). name. We identified 3 strategies to employ these Additionally, we assume that each e 2 E has a identifiers: unique textual identifier in at least a language. Con- i) define a canonical textual identifier for each cretely, in this work, we use Wikidata (Vrandeciˇ c´ entity such that there is a 1-to-1 mapping be- and Krötzsch, 2014) as our KB. Each Wikidata tween the two (i.e., for each entity, select a spe- item lists a set of Wikipedia pages in multiple lan- cific language for its name—see Section 3.1); guages linked to it and in any given language each ii) define a N-to-1 mapping between textual iden- page has a unique name (i.e., its title). tifier and entities concatenating a language ID (e.g., a special token or the ISO 639-1 2.2 Autoregressive generation code1) followed by its name in that language— GENRE ranks each e 2 E by computing a score alternatively concatenating its name first and with an autoregressive formulation: scoreθ(ejx) = then a language ID (see Section 3.2); QN pθ(yjx) = i=1 pθ(yijy<i; x) where y is the se- iii) treat the selection of an identifier in a partic- quence of N tokens in the identifier of e, x the ular language as a latent variable (i.e., we let input (i.e., the context c and mention m), and θ the the model learn a conditional distribution of parameters of the model.