Multilingual Autoregressive Entity Linking

Nicola De Cao1,2, Ledell Wu1, Kashyap Popat1, Mikel Artetxe1, Naman Goyal1, Mikhail Plekhanov1, Luke Zettlemoyer1,3, Nicola Cancedda1, Sebastian Riedel1,4, Fabio Petroni1 1Facebook AI 2University of Amsterdam 3University of Washington 4University College [email protected] {ledell, kpopat, artetxe, naman, movb lsz, ncan, sriedel, fabiopetroni}@fb.com

Abstract 2016; Wen et al., 2017; Williams et al., 2017; Chen et al., 2017; Curry et al., 2018), Biomedical sys- We present mGENRE, a sequence-to- tems (Leaman and Gonzalez, 2008; Zheng et al., sequence system for the Multilingual Entity Linking (MEL) problem—the task of re- 2015), to name just a few. It consists of ground- solving language-specific mentions to a ing entity mentions in unstructured texts to KB multilingual Knowledge Base (KB). For a descriptors (e.g., Wikipedia articles). mention in a given language, mGENRE The multilingual version of the EL problem has predicts the name of the target entity left-to- been for a long time tight to a purely cross-lingual right, token-by-token in an autoregressive formulation (XEL, McNamee et al., 2011a; Ji et al., fashion. The autoregressive formulation 2015), where mentions expressed in one language allows us to effectively cross-encode mention string and entity names to capture more are linked to a KB expressed in another (typically interactions than the standard dot product English). Recently, Botha et al.(2020) made a step between mention and entity vectors. It also towards a more inherently multilingual formulation enables fast search within a large KB even by defining a language-agnostic KB, obtained by for mentions that do not appear in mention grouping language-specific descriptors per entity. tables and with no need for large-scale vector Such formulation has the power of considering en- indices. While prior MEL works use a single tities that do not have an English descriptor (e.g., a representation for each entity, we match Wikipedia article in English) but have one in some against entity names of as many languages as possible, which allows exploiting language other languages. connections between source input and target A common design choice to most current solu- name. Moreover, in a zero-shot setting tions, regardless of the specific formulation, is to on languages with no training data at all, provide a unified entity representation, either by mGENRE treats the target language as a la- collating multilingual descriptors in a single vec- tent variable that is marginalized at prediction tor or by defining a canonical language. For the time. This leads to over 50% improvements common bi-encoder approach (Wu et al., 2020; in average accuracy. We show the efficacy of our approach through extensive evaluation Botha et al., 2020), this might be optimal. How- arXiv:2103.12528v1 [cs.CL] 23 Mar 2021 including experiments on three popular MEL ever, in the recently proposed GENRE model benchmarks where mGENRE establishes (De Cao et al., 2021), an autoregressive formu- new state-of-the-art results. Code and pre- lation to the EL problem leading to stronger per- trained models at https://github.com/ formance and considerably smaller memory foot- facebookresearch/GENRE. prints than bi-encoder approaches on monolingual benchmarks, the representations to match against 1 Introduction are entity names (i.e., strings) and it’s unclear how Entity Linking (EL, Hoffart et al., 2011; Dredze to extend those beyond a monolingual setting. et al., 2010; Bunescu and Pa¸sca, 2006; Cucerzan, In this context, we find that maintaining as much 2007) is an important task in NLP, with plenty of language information as possible, hence providing applications in multiple domains, spanning Ques- multiple representations per entity, helps due to the tion Answering (De Cao et al., 2019; Nie et al., connections between source language and entity 2019; Asai et al., 2020), Dialogue (Bordes et al., names in different languages. We additionally find Sequence Bidirectional Wikipedia-Wikidata mapping scores Transformer Global Sequence Encoder Positioning >> de Q 1 8 8 2 2 - 0 . 0 9 scores Autoregressive System Globalse Q18822 INPUT: [..] Es steht in Transformer >> de Q 1 7 9 4 3 5 0 . 6 1 Konkurrenz zum etablierten Navigationssate - 0 . 9 6 (Global [START] GPS [END] - System Decoder llitensystem Positioning der USA, soll aber mit den Sistema de System) technischen Spezifikationen der with prefix posicionamento >> es Q 1 8 8 2 2 - 1 . 1 0 Datenstrome des GPS-Systems constrained global kompatibel sein. [..] Vocabulary Sistema di - 0 . 9 6 Q179435 TRANSLATION: [..] It posizionamento >> it Q 1 8 8 2 2 - 1 . 1 7 (satellite competes with the established globale navigation [START] GPS [END] system in Sistema de system) the USA, but should be posicionament >> ca Q 1 8 8 2 2 - 1 . 2 7 Aggregating compatible with the technical global specifications of the data streams of the GPS system. [..]

Figure 1: mGENRE: we use an autoregressive decoder to generate language IDs as well as entity names (i.e., Wikipedia titles). The combination of language ID and a entity name uniquely identify a Wikidata ID (with a N- to-1 mapping). We use Beam Search for efficient inference and we marginalize the probability scores for different languages to score entities. This example is a real output from our system. that using all available languages as targets and • Publicly release our best model, pre-trained aggregating over the possible choices is an effec- as multilingual denoising auto-encoder using tive way to deal with a zero-shot setting where no the BART objective (Lewis et al., 2020; Liu training data is available for the source language. et al., 2020) on large-scale monolingual cor- Concretely, in this paper, we present mGENRE, pora in 125 languages and fine-tuned to gen- the first multilingual EL system that exploits a erate entity names given ∼730M in-context sequence-to-sequence architecture to generate en- Wikipedia hyperlinks in 105 languages. tity names in more than 100 languages left to right, token-by-token in an autoregressive fashion and 2 Background conditioned on the context (see Figure1 for an out- line of our system). While prior works use a single We first introduce Multilingual Entity Linking in representation for each entity, we maintain entity Section 2.1 highlighting its difference with mono- names for as many languages as possible, which lingual and cross-lingual linking. We address the allows exploiting language connections between MEL problem with a sequence-to-sequence model i.e. source input and target name. To summarize, this that generates textual entity identifiers ( , entity work makes the following contributions: names). Our formulation generalizes the GENRE model by De Cao et al.(2021) to a multilingual • Extend the catalog of entity names by consid- setting (mGENRE). Thus in Section 2.2 and 2.3, ering all languages for each entry in the KB. we discuss the GENRE model and how it ranks Storing the multilingual names index is feasi- entities with Beam Search respectively. ble and cheap (i.e., 2.2GB for ∼89M names). • Design a novel objective function that 2.1 Task Definition marginalizes over all languages to perform Multilingual Entity Linking (MEL, Botha et al., a prediction. This approach is particularly 2020) is the task of linking a given entity men- effective in dealing with languages not seen tion m in a given context c of language l ∈ LC to ∼ during fine-tuning ( 50% improvements). the corresponding entity e ∈ E in a multilingual • Establish new state-of-the-art performance Knowledge Base (KB). See Figure1 for an exam- for the Mewsli-9 (Botha et al., 2020), ple: there are textual inputs with entity mentions hard TR2016 (Tsai and Roth, 2016a) and TAC- (in bold) and we ask the model to predict the corre- KBP2015 (Ji et al., 2015) MEL datasets. sponding entities in the KB. A language-agnostic • Present extensive analysis of modeling KB includes at least the name (but could include choices, including the usage of candidates also descriptions, aliases, etc.) of each entity in one from a mention table, frequency-bucketed or more languages but there is no assumption about evaluation, and performance on an heldout the relationship between these languages LKB and set including low-resource languages. languages of the context LC . This is a generaliza- tion of both monolingual Entity Linking EL and e in the language l. We extracted these identifiers cross-lingual EL (XEL, McNamee et al., 2011a; Ji from our KB—each Wikidata item has a set of et al., 2015). The latter considers contexts in differ- Wikipedia pages in multiple languages linked to it, ent languages while mapping mentions to entities and in any given language, each page has a unique in a monolingual KB (e.g., English Wikipedia). name. We identified 3 strategies to employ these Additionally, we assume that each e ∈ E has a identifiers: unique textual identifier in at least a language. Con- i) define a canonical textual identifier for each cretely, in this work, we use Wikidata (Vrandeciˇ c´ entity such that there is a 1-to-1 mapping be- and Krötzsch, 2014) as our KB. Each Wikidata tween the two (i.e., for each entity, select a spe- item lists a set of Wikipedia pages in multiple lan- cific language for its name—see Section 3.1); guages linked to it and in any given language each ii) define a N-to-1 mapping between textual iden- page has a unique name (i.e., its title). tifier and entities concatenating a language ID (e.g., a special token or the ISO 639-1 2.2 Autoregressive generation code1) followed by its name in that language— GENRE ranks each e ∈ E by computing a score alternatively concatenating its name first and with an autoregressive formulation: scoreθ(e|x) = then a language ID (see Section 3.2); QN pθ(y|x) = i=1 pθ(yi|y

2.3 Ranking with Constrained Beam Search 3.1 Canonical entity representation At test time, it is prohibitively expensive to com- Selecting a single textual identifier for each entity E pute a score for every element in and then sort corresponds to choosing its name among all the them. Thus, GENRE exploits Beam Search (BS, available languages of that entity. We employ the Sutskever et al., 2014), an established approximate same data-driven selection heuristic as in Botha decoding strategy to navigate the search space ef- et al.(2020): for each entity e we sort all its names ficiently. Instead of explicitly scoring all entities nl for each language l according to the number of E k E e in , search for the top- entities in using BS mentions of e in documents of language l. Then k with beams. BS only considers one step ahead we take the name nl in the language l that has the i.e. e during decoding ( , it generates the next token most mentions of e. In case of a tie, we select the conditioned on the previous ones). Thus, GENRE language that has the most number of mentions employs a prefix tree (trie) to enable constrained across all entities (i.e., the language for which we beam search and then generate only valid entity have more training data). Having a single iden- identifiers. tifier for each entity corresponds to having a 1- to-1 mapping between strings and entities. Thus, 3 Model scoreθ(e|x) = pθ(ne|x) where with ne we indicate To extend GENRE to a multilingual setting we the canonical name for e. We train to maximize need to define what are the unique identifiers of all the scores for all our training data. A downside of entities in a language-agnostic fashion. This is not this strategy is that most of the time, the model can- trivial since we rely on text representations that are not exploit the lexical overlap between the context by their nature grounded in some language. Con- and entity name since it has to translate it in the cretely, for each entity e, we have a set of identifiers canonical one. l Ie that consists of pairs hl, nei where l ∈ LKB in- 1https://www.iso.org/standard/22109. l dicates a language and ne the name of the entity html 3.2 Multilingual entity representation training, and inference with marginalization are To accommodate the canonical representation is- more expensive than with simple generation (scal- sues, we can predict entity names in any language. ing linearly with the number of languages). How- Concatenating a language ID l and an entity name ever, at least during inference, we can still apply l BS to only marginalize using the top-k generations. ne in different orders induces two alternative fac- torizations. We train maximizing the scores for all For this reason, we test this training strategy only on few languages but we evaluate marginalization our training data: scoreθ(e|x) = even when training with the other generation strate- ( l gies described above. pθ(l|x) · pθ(ne|x, l) for ‘lang+name’ l l (1) pθ(ne|x) · pθ(l|ne, x) for ‘name+lang’ 3.4 Candidate selection The former corresponds to first predicting a distri- Modern EL systems that employ cross-encoding bution over languages and then predicting a title between context and entities usually do not score conditioning on the language l where the latter cor- all entities in a KB as it is too computational expen- responds to the opposite. Predicting the language sive (Wu et al., 2020). Instead, they first apply can- first conditions the generation to a smaller set ear- didate selection (with a less expensive method first lier during beam search (i.e., all names in a specific or just a non-parametric mention table) to reduce language). However, it might exclude some tar- the number of entities before scoring. In our formu- gets from the search too early if the beam size is lation, there is no need to do that since mGENRE too small. Predicting the language last does not uses Beam Search to efficiently generate. However, condition the generation of names in a particular using candidates might help, and therefore, we also language but it asks the model to disambiguate the experiment with that. Scoring all candidates might language of the generated name whenever it is am- not be always possible (sometimes there are thou- biguous (i.e., when the same name in different lan- sands of candidates for a mention) and especially guages corresponds to possibly different entities). when using an N-to-1 mapping between textual Only 1.65% of the entity names need to be disam- identifiers there will be names to rank in all lan- biguated with the language. In practice, we observe guages available for each candidate. Thus, when no difference in performance between the two ap- we use candidates, it is to constrain BS steps fur- proaches. Both strategies define an N-to-1 mapping ther, rather than to rank all of them. between textual identifiers and entities and then at Concretely, candidate selection is made with an test time we just use a lookup table to select the alias table. Using the training data, we build a men- correct KB item. This N-to-1 mapping is an advan- tion table where we record all entities indexed by tage compared to using canonical names because the names used to refer to them. Additionally, we the model can predict in any available language also use Wikipedia titles as additional mentions and therefore exploit synergies between source and (useful for entities that never appear as links), redi- target language as well as avoiding translation. rects, Wikidata labels, and aliases. 3.3 Marginalization Differently from the plain generation strategies de- 4 Experimental Setting scribed above, we can treat the textual identifiers We use Wikidata (Vrandeciˇ c´ and Krötzsch, 2014) as a latent variable and express scoreθ(e|x) as the probability of the entity name in all languages and as our KB while exploiting the supervision signal from Wikipedia hyperlinks. For evaluation, we test marginalizing over them: scoreθ(e|x) = our model on two established cross-lingual datasets, X l TR2016hard and TAC-KBP2015 (Ji et al., 2015; pθ(e|x) = pθ(ne, l|x) . (2) l Tsai and Roth, 2016a), as well as the recently pro- hl,nei∈Ie posed Mewsli-9 MEL dataset (Botha et al., 2020). Marginalization exposes the model to all repre- Additionally, we propose a novel setting extracted sentations in all languages of the same entity from Wikinews2 where we train a model on a set and it requires a minor modification of the train- of languages, and we test it on unseen ones. ing procedure. Unfortunately, because computing 2 scoreθ(e|x) requires a sum over all languages, both https://www.wikinews.org 4.1 Knowledge Base: Wikidata TR2016hard (Tsai and Roth, 2016a) is a Wikipedia based cross-lingual dataset specifically We use Wikidata as the target KB to link to filter- constructed to contain difficult mention-entity pairs. ing the same heuristic as Botha et al.(2020) (see Authors extracted Wikipedia hyperlinks for which AppedinxA for more details). Eventually, our en- the corresponding entity is not the most likely when tity set E contains 20,277,987 items (as a reference, using an alias table. Since we train on Wikipedia, to English Wikipedia has just ∼6M items). Using the avoid an overlap with this test data, we removed all corresponding Wikipedia titles as textual identifiers mentions from our training data that also appear in in all languages leads to a table of 53,849,351 en- TR2016hard. Note that this pruning strategy is more tity names. We extended the identifiers including aggressive than Tsai and Roth’s (2016a) and Botha redirects which leads to a total of 89,270,463 entity et al.’s (2020) strategies. Tsai and Roth(2016a) names (see Table 10 in AppendixA for more de- assured to not have mention-entity pairs overlaps tails). Although large, the number of entity names between training and test, but a mention (with a dif- is not a bottleneck as the generated prefix tree only ferent entity) might appear in training. Botha et al. occupies 2.2GB for storage. As a comparison the (2020)4 split at the page-level only, making sure to Botha et al.’s (2020) MEL systems need ∼10 times hold out all Tsai and Roth(2016a) test pages (and more storage. their corresponding pages in other languages), but they trained on any mention-entity pair that could 4.2 Supervision: Wikipedia be extracted from their remaining training page For all experiments, we do not train a model from partition (i.e., they have overlap between training scratch, but we fine-tune a multilingual language and text entity-mention pairs). To compare with model trained on 125 languages (see AppendixA previous works (Tsai and Roth, 2016a; Upadhyay for more details on the pre-trained model). We ex- et al., 2018; Botha et al., 2020) we only evaluate ploit Wikipedia hyperlinks as the source of supervi- on German, Spanish, French and Italian (a total of sion for MEL. We used Wikipedia in 105 languages 16,357 datapoints). out of the >300 available. These 105 are all the languages for which our model was pre-trained on TAC-KBP2015 To evaluate our system on docu- that overlaps with the one available in Wikipedia ments out of the Wikipedia domain, we experiment (see full language list in Figure2 and more details on the TAC-KBP2015 Tri-Lingual Entity Linking in AppendixA). Eventually, we extracted a large- Track (Ji et al., 2015). To compare with previ- scale dataset of 734,826,537 datapoints. For the ous works (Tsai and Roth, 2016a; Upadhyay et al., plain generation strategy, we selected as the ground 2018; Sil et al., 2018; Zhou et al., 2019), we use truth the name in the source language. When such only Spanish and Chinese (i.e., we do not evaluate entity name is not available we randomly select 5 in English). Following previous work, we only eval- alternative languages and we use all of them as dat- uate in-KB links (Yamada et al., 2016; Ganea and apoints. To enable model selection, we randomly Hofmann, 2017), i.e, we do not evaluate on men- selected 1k examples from each language to keep tions that link to entities out of the KB. Previous as a validation set. works considered Freebase (Bollacker et al., 2008) as KB, and thus we computed a mapping between 4.3 Datasets Freebase ID and Wikidata ID. When we cannot solve the match, our system gets zero scores (i.e., Mewsli-9 (Botha et al., 2020) contains 289,087 it counts as a wrong prediction). TAC-KBP2015 entity mentions appearing in 58,717 originally writ- contains 166 Chinese documents (84 news and 82 ten news articles from Wikinews, linked to Wiki- discussion forum articles) and 167 Spanish docu- Data. The corpus includes documents in 9 lan- ments (84 news and 83 discussion forum articles) 3 guages. Differently from the cross-lingual setting, for a total of 12,853 mention-entity datapoints. this is a truly multilingual dataset since 11% tar- get entities in Mewsli-9 do not have an English Wikinews-7 For the purpose of testing a model Wikipedia page thus a XEL model would not link on languages unseen during training, we extract these. mention-entities pairs from Wikinews in 7 lan-

3Arabic, English, Farsi, German, Japanese, Serbian, Span- 4Information provided by private correspondence with ish, Tamil, and Turkish. the authors. erpr eut rmm from results report We base- table alias an against set validation Wikipedia l fteeatraie upromMdlF Model outperform alternatives these 3.3). of Section All (see marginalization without and with m our of predictions wrong accu- the show We datasets. those all in works MdlF (Model previous all outperforms ‘title+lang’) with (trained w eotsm xmlso orc and correct of examples some report we 5 Table Ta- in reported are work this of results main The this of statistics reports A Appendix in 7 Table esi9fo iiesdump. Wikinews a from Mewsli-9 Russian. the all clas- effectively among is sifying it so no candidates has model on base restrictions The languages. all macro over for average reduction error 18.0% in and reduction average micro error 10.9% a has candidates marginalization) (without or model base Our the languages. across 9 accuracy average macro and micro both as well as 3.4) Section (see candi- table top-k the the from to dates search beam the constraining (2020) al. et Botha from model best the against Mewsli-9 evaluation Performance 5.1 Mewsli-9. from datapoints In we results. A set Appendix validation in the 11 on Table details more In report 2. Figure in line m of racy TR2016 m for Our 2 respectively. TAC-KBP2015 Table and in and Mewsli-9, for 1 ble Results 5 dumps raw from to data implementation extract own our as used way we same but the Mewsli-9, in created is Wikinews-7 dataset. set. language Mewsli-9 a the for in B not Appendix are in that 11 guages Table and 7 Figure names). (see full language language each accuracy and per the values report sizes precise also set for We log-training and set. view validation the Wikipedia larger and our table in languages alias 105 the the of on mGENRE of Accuracy 2: Figure Accuracy 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 6 5 i o ees oefrextracting for code release not did (2020) al. et Botha and Portuguese, Polish, Italian, French, Czech, Chinese, mg uz gu et + eu swl swt hi la al baseline. table alias their with as well as ) GENRE sv kk

w opr u m our compare we 1 Table In sw af su br jv gn ky cy ga

nte15lnugsi our in languages 105 the on ur ca sk

∼ qu

GENRE ja mk 0 niis h base The entities. 20M az gl la GENRE 6 vi

. sa ka sr da lt ihadwithout and with eo no te bg sl nselected on

ku mGENRE fy ms GENRE

GENRE nl fa th ro hu

+ lv hard is

on id

be Alias Table 5 de , gd as ru hy 1%aslt cuayo A-B21 and TR2016 TAC-KBP2015 on +5% on accuracy absolute +11% 36 n 10 o ir n ar vrg re- average macro and micro for 21.0% and 13.6% iullnug xetEgihadGra.Note German. and English except language vidual iu ytm.Uigmriaiainbig mini- brings marginalization Using systems. vious eeain efrhripoeerrrdcinto reduction error improve for further space we the generation, restrict to filtering link. candidate to enabling languages hardest the are these fore the ( have entities/pages that most languages the also we but each) which apoints ( for data ones training the more have are languages these that F Model than better performs model u hr r osc agae nteMwl- set. Mewsli-9 the in languages instances, such training no fewer are are there there but as harder lan- be low-resource should Moreover, guages harder. is entities more between pre-training, in Chinese traditional names used entity we the because of versions Chinese included simplified also expand all we to Additionally, it set. used candidate we and the set training TAC-KBP2015 a with +22% comes Chinese. a for but Spanish accuracy for difference absolute a of much not there is where TAC-KBP2015 on evident candi- very of is role dates The datasets. both in state-of-the-art gives candidates using Instead, improvements. mal Meswli-9, m from base Differently the 2. Table in F (2020) Model al. and et 2019 ) , al. al., et et Upadhyay Zhou 2018; 2018; al., et Sil 2016a; Roth, and m our TR2016 than better also is F Model model best average Our macro and respectively. micro on 23.0% reduction and error 14.5% improves to it model: best our to marginalizationleads com- and but candidates with filtering search candidate bining as amount by same error the the reduces Marginalization spectively. fi hi uk it mr 7

vntog hr r aydtpit,disambiguating datapoints, many are there though Even am

ar l o GENRE sq g

cs 1 0

tr ( hard t +

hr r ta a i nEgihado a ihi nGerman. in it with par on and English in n GENRE ko i ln n g

my n TAC-KBP2015 and s

bs i z

he e )

gis rs-iga ytm (Tsai systems cross-lingual against zh

hard kn mn or pt

oe osntotefr pre- outperform not does model ne fetvl aigm making effectively fr

∼ kg en

.Mand 6.1M xh pl es

∼ so sd

3Mand 134M pa ps wo el lo om ss

∼ ht tn + .M.There- 2.4M).

ecompared We km bn

nec indi- each on ml si + ∼ ha tl Botha by 0 dat- 60M

GENRE bm

7 ig lg

When yo ti ff 1 2 3 4 5 6 7 8 9

log10(training size) Botha et al.(2020) Ours Language Alias Table Model F+ mGENRE + cand. + marg. + cand. + marg. ar 89.0 92.0 94.7 94.8 95.3 95.4 de 86.0 92.0 91.5 91.8 91.8 92.0 en 79.0 87.0 86.7 87.1 87.0 87.2 es 82.0 89.0 90.0 90.1 90.1 90.1 fa 87.0 92.0 94.6 94.6 94.2 94.4 ja 82.0 88.0 89.9 91.1 90.2 91.4 sr 87.0 93.0 94.9 94.4 95.0 94.5 ta 79.0 88.0 92.9 93.3 93.1 93.8 tr 80.0 88.0 90.7 91.4 90.9 91.5 micro-avg 83.0 89.0 90.2 90.5 90.4 90.6 macro-avg 83.0 90.0 91.8 92.1 92.0 92.3

Table 1: Accuracy on Mewsli-9 dataset. We report results of mGENRE (trained with ‘title+lang’) with and without top-k candidates from the table as well as with and without marginalization.

TAC-KBP2015 TR2016hard Method es zh macro-avg de es fr it macro-avg Tsai and Roth(2016a) 82.4 85.1 83.8 53.3 54.5 47.5 48.3 50.9 Sil et al.(2018)* 83.9 85.9 84.9 ----- Upadhyay et al.(2018) 84.4 86.0 85.2 55.2 56.8 51.0 52.3 53.8 Zhou et al.(2019) 82.9 85.5 84.2 ----- Botha et al.(2020)--- 62.0 58.0 54.0 56.0 57.5 mGENRE 86.3 64.6 75.5 56.3 57.1 50.0 51.0 53.6 mGENRE + marg. 86.9 65.1 76.0 56.2 56.9 49.7 51.1 53.5 mGENRE + cand. 86.5 86.6 86.5 61.8 61.0 54.3 56.9 58.5 mGENRE + cand. + marg. 86.7 88.4 87.6 61.5 60.6 54.3 56.6 58.2

Table 2: Accuracy on TAC-KBP2015 Entity Linking dataset (only datapoints linked to FreeBase) and TR2016hard of mGENRE (trained with ‘title+lang’) with and without top-k candidates from the table as well as with and without marginalization. *as reported by Upadhyay et al.(2018). and TAC-KBP2015 uses simplified Chinese. Many Botha et al.(2020) mGENRE mentions in TAC-KBP2015 were not observed in Bin Support Acc. Support Acc. Wikipedia, so the performance gain mostly comes [0, 1) 3,198 8.0 1,244 22.1 from this but including the simplified and alterna- [1, 10) 6,564 58.0 5,777 47.3 tive Chinese names also played an important role [10, 100) 32,371 80.0 28,406 77.3 8 [100, 1k) 66,232 90.0 72,414 89.9 (+5% comes from this alone). [1k, 10k) 78,519 93.0 84,790 93.2 [10k, +) 102,203 94.0 96,456 96.3 5.2 Analysis micro-avg 289,087 89.0 289,087 90.6 By entity frequency Table3 shows a break- macro-avg - 70.0 - 71.0 down of Mewsli-9 accuracy by entity frequency in training for Botha et al.’s (2020) Model F+ and Table 3: Results on the Mewsli-9 dataset, by entity fre- mGENRE. Interestingly, our mGENRE has much quency in training. The support is slightly different be- cause training data differ (i.e., the set of languages from higher accuracy (22% vs 8%) on unseen entities Wikipedia is different). (i.e., the [0,1) bin). This is because our formula- tion can take advantage of copying names from the source, translating them or normalizing them. For these cases. On very rare entities (i.e., the [1,10) example, an unseen person name should likely be bin) our model performs worse than Model F+. linked to the entity with the same name. This is Note that Model F+ was trained specifically to a powerful bias that gives the model advantage in tackle those cases (e.g., with hard negatives and 8We speculate that including different version (e.g., dif- frequency-based mini-batches) whereas our model ferent dialects for Arabic) of entity names could improve performance in all languages. Since this is not in the scope of was not. We argue that similar strategies can be ap- this paper, we will leave it for future work. plied to mGENRE to improve performance on rare mGENRE log2(support) ar de en es fa ja sr ta tr

5.0 0.00 0.80 4.18 42.36 0.00 1.38 39.63 5.46 6.19 1.0 cs 4.5 0.9 fr 0.09 0.50 4.26 91.19 0.00 0.42 2.06 0.79 0.69 0.8 4.0 ) t r

0.7 o 3.5 it 0.05 1.11 5.49 83.38 0.28 0.13 2.19 0.53 6.83 p

0.6 p

3.0 u

0.5 s pl 0.00 2.45 8.81 60.43 0.00 2.08 15.58 8.29 2.35 (

2.5 2

Accuracy 0.4 g

o pt 0.19 0.98 1.81 94.04 0.00 0.08 1.66 1.13 0.11 0.3 l 2.0 0.2 1.5 ru 0.02 0.04 0.44 4.78 0.00 1.79 92.74 0.12 0.06 0.1 0.0 1.0 0.47 0.00 1.16 1.42 0.11 94.89 1.05 0.42 0.47 0 20 21 22 23 24 25 26 27 28 29 210 211 212 zh Number of candidates (a) Lang+Name.

Figure 3: Results of mGENRE on Mewsli-9 by the ar de en es fa ja sr ta tr number of retrieved candidates. cs 8.75 12.44 10.94 8.86 5.44 21.83 8.44 19.69 3.60 fr 7.93 9.21 18.83 9.16 6.96 22.09 5.75 17.44 2.63 Lang. Can. N+L L+N L+NM it 8.64 10.40 14.11 9.71 5.16 33.80 4.51 11.03 2.65 cs 36.3 30.2 34.0 69.7 pl 7.88 12.46 24.70 7.84 5.22 19.59 6.46 13.06 2.78 fr 62.9 57.0 53.3 73.4 it 44.8 43.7 42.9 56.8 pt 8.50 7.07 15.53 10.89 3.79 19.32 7.27 23.09 4.54 pl 31.9 21.2 25.6 68.8 ru 7.66 6.91 14.81 7.56 5.15 26.09 7.05 20.62 4.15 pt 60.8 61.7 59.5 76.2 ru 34.9 32.4 35.1 65.8 zh 10.06 6.85 17.70 5.71 4.93 32.26 4.38 15.39 2.72 zh 35.1 41.1 44.0 52.8 (b) Lang+NameM. micro-avg 41.6 38.3 39.5 65.9 macro-avg 43.8 41.0 42.1 66.2 Figure 4: Distribution of languages on the top-1 predic- tion of two mGENRE models on Wikinews-7 (unseen Table 4: mGENRE on the Wikinew-7 unseen lan- languages). Y-axis indicates the source (unseen at train- guages. Models are trained only on the Mewsli-9 lan- ing time) language where X-axis indicates the language guages (1M datapoints per language). ‘Can.’ is canon- (seen at training time) of the first prediction. ical, ‘N+L’ is ‘name+language‘ and ‘L+N’ is the oppo- site. M indicates marginalization. set of languages in train and test are disjoint). This entities, and we leave that to future work. The per- zero-shot setting implies that no mention table is formance gap between Model F+ and mGENRE available during inference; hence we do not con- on entities that appear more than 100 times in the sider candidates for test mentions. We train our training set is negligible. models on the nine Mewsli-9 languages and com- pare all strategies exposed in Section3. To make By candidate frequency We additionally mea- our ablation study feasible, we restrict the train- sure the accuracy on Mewsli-9 by the number of ing data to the first 1 million hyperlinks from candidates retrieved from the alias table (details in Wikipedia abstracts. Results are reported in Ta- Figure3). When there are no candidates ( ∼12k dat- ble4. apoints that is ∼4% Mewsli-9) an alias table would automatically fail, but mGENRE uses the entire Using our novel marginalization strategy that ag- KB as candidates and has 63.9% accuracy. For gregates (both at training and inference time) over datapoints with few candidates (e.g., less than 100), all seen languages to perform the linking brings an we could use mGENRE as a ranker and score all improvement of over 50% with respect to consid- of the options without relying on constrained beam ering a single language. To deeper investigate the search. However, this approach would be compu- behaviour of the model in this setting, we compute tationally infeasible when there are no candidates the probability mass distribution over languages (i.e., we use all the KB as candidates) or too many seen at training time for the first prediction (re- candidates (e.g., thousands). Constrained BS al- ported in Figure4). When marginalization is en- lows us to efficiently explore the space of entity abled (Figure 4b) the distribution is more spread names, whatever the number of candidates. across languages since the model is trained to use all of them. Hence the model can exploit connec- Unseen Languages We use our Wikinews-7 tions between an unseen language and all seen lan- dataset to evaluate mGENRE capabilities to deal guages for the linking process, which drastically with languages not seen during training (i.e., the increases the accuracy. Input [ES] . . . Michaëlle Jean, gobernadora general de Canadá, ha emitido el miércoles (13) un comunicado acerca del tiroteo ocurrido en el [START] Dawson College [END] de Montreal. . . . Translation [EN] . . . Michaëlle Jean, Governor General of , issued a statement on Wednesday (13) about the shooting that occurred at [START] Dawson College [END] in Montreal. . . . Prediction ‘Collège Dawson » fr’: College in MontrealQ2983587 Outcome Correct: mGENRE copies and normalizes the college name even if it is not does not have an identifier in the source language (i.e., it predicts in French but the source is Spanish). Input [DE] . . . Etwa 47 Menschen sind bei den Protesten festgenommen worden, darunter Chas Booth, Mitglied der [START] schottischen Grünen [END] mit Sitz im Stadtrat von Edinburgh. . . . Translation [EN] . . . Around 47 people were arrested during the protests, including Chas Booth, a member of the [START] Scottish Greens [END] on the Edinburgh City Council. . . . Prediction ‘Scottish Green Party » de’: Scottish Green PartyQ1256956 Outcome Correct: even if the party is referred with its German alias, mGENRE predicts the identifier with its English name since the truth German Wikipedia page has the English name.

Input [TR] . . . Kâinat Güzeli yine Venezueladan´ [START] 2009 yılı Kâinat Güzellik Yarı¸sması [END] 83 ülkenin temsilcisiyle Bahamalarda´ yapıldı. . . . Translation [EN] . . . is again from [START] The Universe Beauty Contest of 2009 [END] was held in Bahamas with the representatives of 83 countries. . . . Prediction ‘2009 Eurovision Çocuk ¸SarkıYarı¸sması» tr’: Junior Eurovision Song Contest 2009Q205038 Annotation Miss Universe 2009Q756701 Outcome Wrong: the model is conditioned early during beam search to start with ‘2009’. Thus, it does not effectively search sequences where the year is at the end missing the ground truth answer. Input [SR] . . . [START] Марко Стоjановић [END] , председник Светске организациjе пантомимичара, каже:... Translation [EN]... [START] Marko Stojanovic´ [END], President of the World Mime Organization, says: . . . Prediction ‘Марко Стоjановић ¿ sr’ : Marko Stojanovic´ (lawyer)Q12754975 Annotation Marko Stojanovic´ (actor)Q16099367 Outcome Wrong: from the context lawyer is potentially more appropriate than actor. This is the risk of not considering the full entity description (that might say that Marko is an actor and also the President of the World Mime Organization). Even if on average copying is an effective strategy, it does not always succeed (using the alias table on this example leads to a correct prediction).

Table 5: Examples of correct and wrong predictions of our mGENRE model on selected samples from Mewsli-9. With both correct and wrong predictions we highlight some specific behaviour of our model.

6 Related Work et al., 2019) encoders that outputs vector repre- sentations for contet and entities. Similar to Wu The most related works to ours are De Cao et al. et al.(2020) they rank entities with a dot-product (2021), that proposed to use an autoregressive lan- between these representations. Model F+ uses the guage model for monolingual EL, and Botha et al. description of entities as input to the entity encoder (2020) that proposes to extend the cross-lingual EL and title, document and mention (separated with task to multilingual EL with a language-agnostic special tokens) as inputs to the context encoder. KB. We provide an outline of the GENRE model Bi-encoders solutions may be memory inefficient proposed by De Cao et al.(2021) in Section 2.2 since they require to keep in memory big matrices and 2.3. GENRE was applied not only to EL but of embeddings, although memory-efficient dense also for joint mention detection and entity linking retrieval has recently received attention (Izacard (still with an autoregressive formulation) as well et al., 2020; Min et al., 2021; Lewis et al., 2021). as to page-level document retrieval across multi- ple Knowledge Intensive Language Tasks (KILT; Another widely explored line of work is Cross- Petroni et al., 2021) i.e., fact-checking, open- Language Entity Linking (XEL; McNamee et al., domain question answering, slot filling, and dia- 2011b; Cheng and Roth, 2013). XEL considers log. Botha et al.’s (2020) Model F+ is a bi-encoder contexts in different languages while mapping men- model: it is based on two BERT-based (Devlin tions to entities in a monolingual KB (e.g., English Wikipedia). Tsai and Roth(2016b) used align- References ments between languages to train multilingual en- Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, tity embeddings. They used candidate selection and Richard Socher, and Caiming Xiong. 2020. Learn- then they re-rank them with an SVM using these ing to retrieve reasoning paths over wikipedia graph embeddings as well as a set of features (based on for question answering. In International Conference the multilingual title, mention, and context tokens). on Learning Representations. Sil et al.(2018) explored the use of more sophisti- Giusepppe Attardi. 2015. Wikiextractor. https:// cated neural models for XEL as well as Upadhyay github.com/attardi/wikiextractor. et al.(2018) who jointly modeled type information to boost performance. Zhou et al.(2019) propose Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collab- improvements to both entity candidate generation oratively created graph database for structuring hu- and disambiguation to make better use of the lim- man knowledge. In Proceedings of the 2008 ACM ited data in low-resource scenarios. Note that in SIGMOD international conference on Management this work we focus on multilingual EL, not cross- of data, pages 1247–1250. lingual. XEL is limiting to a monolingual KB (usu- Antoine Bordes, Y-Lan Boureau, and Jason Weston. ally English), where MEL is more general since 2016. Learning end-to-end goal-oriented dialog. it can link to entities that might not be necessary International Conference on Learning Representa- represented in the target monolingual KB but in tions. any of the available languages. Jan A. Botha, Zifei Shan, and Daniel Gillick. 2020. En- tity Linking in 100 Languages. In Proceedings of the 2020 Conference on Empirical Methods in Nat- 7 Conclusion ural Language Processing (EMNLP), pages 7833– 7845, Online. Association for Computational Lin- guistics. In this work, we propose an autoregressive formu- lation to the multilingual entity linking problem. Razvan Bunescu and Marius Pa¸sca.2006. Using en- For a mention in a given language, our solution cyclopedic knowledge for named entity disambigua- generates entity names left-to-right and token-by- tion. In 11th Conference of the European Chap- token. The resulting system, mGENRE, main- ter of the Association for Computational Linguis- tics, Trento, . Association for Computational tains entity names in as many languages as pos- Linguistics. sible to exploit language connections and interac- tions between source mention context and target Henry Y. Chen, Ethan Zhou, and Jinho D. Choi. 2017. entity name. The constrained beam search decod- Robust coreference resolution and entity linking on dialogues: Character identification on TV show tran- ing strategy enables fast search within a large set scripts. In Proceedings of the 21st Conference on of entity names (e.g., the whole KB in multiple Computational Natural Language Learning (CoNLL languages) with no need for large-scale dense in- 2017), pages 216–225, Vancouver, Canada. Associa- dices. We additionally design a novel objective tion for Computational Linguistics. function that marginalizes over all available lan- Xiao Cheng and Dan Roth. 2013. Relational inference guages to perform a prediction. We empirically for wikification. In Proceedings of the 2013 Con- show that this strategy is really effective in dealing ference on Empirical Methods in Natural Language with languages for which no training data is avail- Processing, pages 1787–1796, Seattle, Washington, able (i.e., 50% improvements for languages never USA. Association for Computational Linguistics. seen during training). Overall, our experiments Alexis Conneau, Kartikay Khandelwal, Naman Goyal, show that mGENRE achieves new state-of-the-art Vishrav Chaudhary, Guillaume Wenzek, Francisco performance on three popular multilingual entity Guzmán, Edouard Grave, Myle Ott, Luke Zettle- linking datasets. moyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 8440– Acknowledgments 8451, Online. Association for Computational Lin- guistics.

Authors thank Patrick Lewis, Aleksandra Piktus, Silviu Cucerzan. 2007. Large-scale named entity dis- for helpful discussions and technical support. ambiguation based on Wikipedia data. In Proceed- ings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Com- Heng Ji, Joel Nothman, Ben Hachey, and Radu Florian. putational Natural Language Learning (EMNLP- 2015. Overview of tac-kbp2015 tri-lingual entity CoNLL), pages 708–716, Prague, . discovery and linking. In TAC. Association for Computational Linguistics. Diederik P Kingma and Jimmy Ba. 2014. Adam: A Amanda Cercas Curry, Ioannis Papaioannou, Alessan- method for stochastic optimization. Proceedings of dro Suglia, Shubham Agarwal, Igor Shalyminov, the 3rd International Conference on Learning Rep- Xinnuo Xu, Ondrejˇ Dušek, Arash Eshghi, Ioannis resentations (ICLR). Konstas, Verena Rieser, et al. 2018. Alana v2: En- tertaining and informative open-domain social dia- Robert Leaman and Graciela Gonzalez. 2008. Ban- logue using ontologies and entity linking. Alexa ner: An executable survey of advances in biomedi- Prize Proceedings. cal named entity recognition. In Pacific Symposium on Biocomputing 2008, PSB 2008, Pacific Sympo- Nicola De Cao, Wilker Aziz, and Ivan Titov. 2019. sium on Biocomputing 2008, PSB 2008, pages 652– Question answering by reasoning across documents 663. 13th Pacific Symposium on Biocomputing, with graph convolutional networks. In Proceed- PSB 2008 ; Conference date: 04-01-2008 Through ings of the 2019 Conference of the North American 08-01-2008. Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 Mike Lewis, Yinhan Liu, Naman Goyal, Mar- (Long and Short Papers), pages 2306–2317, Min- jan Ghazvininejad, Abdelrahman Mohamed, Omer neapolis, Minnesota. Association for Computational Levy, Veselin Stoyanov, and Luke Zettlemoyer. Linguistics. 2020. BART: Denoising sequence-to-sequence pre- training for natural language generation, translation, Nicola De Cao, Gautier Izacard, Sebastian Riedel, and and comprehension. In Proceedings of the 58th An- Fabio Petroni. 2021. Autoregressive entity retrieval. nual Meeting of the Association for Computational In International Conference on Learning Represen- Linguistics, pages 7871–7880, Online. Association tations. for Computational Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Kristina Toutanova. 2019. BERT: Pre-training of Minervini, Heinrich Küttler, Aleksandra Piktus, Pon- deep bidirectional transformers for language under- tus Stenetorp, and Sebastian Riedel. 2021. Paq: 65 standing. In Proceedings of the 2019 Conference million probably-asked questions and what you can of the North American Chapter of the Association do with them. arXiv preprint arXiv:2102.07033. for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey pages 4171–4186, Minneapolis, Minnesota. Associ- Edunov, Marjan Ghazvininejad, Mike Lewis, and ation for Computational Linguistics. Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transac- Mark Dredze, Paul McNamee, Delip Rao, Adam Ger- tions of the Association for Computational Linguis- ber, and Tim Finin. 2010. Entity disambiguation for tics, 8:726–742. knowledge base population. In Proceedings of the 23rd International Conference on Computational Paul McNamee, James Mayfield, Dawn Lawrie, Dou- Linguistics (Coling 2010), pages 277–285, Beijing, glas Oard, and David Doermann. 2011a. Cross- . Coling 2010 Organizing Committee. language entity linking. In Proceedings of 5th In- ternational Joint Conference on Natural Language Octavian-Eugen Ganea and Thomas Hofmann. 2017. Processing, pages 255–263, Chiang Mai, . Deep joint entity disambiguation with local neural Asian Federation of Natural Language Processing. attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Paul McNamee, James Mayfield, Dawn Lawrie, Dou- pages 2619–2629, Copenhagen, . Associa- glas W Oard, and David Doermann. 2011b. Cross- tion for Computational Linguistics. language entity linking. In Proceedings of 5th In- ternational Joint Conference on Natural Language Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- Processing, pages 255–263. dino, Hagen Fürstenau, Manfred Pinkal, Marc Span- iol, Bilyana Taneva, Stefan Thater, and Gerhard Sewon Min, Jordan Boyd-Graber, Chris Alberti, Weikum. 2011. Robust disambiguation of named en- Danqi Chen, Eunsol Choi, Michael Collins, Kelvin tities in text. In Proceedings of the 2011 Conference Guu, Hannaneh Hajishirzi, Kenton Lee, Jenni- on Empirical Methods in Natural Language Process- maria Palomaki, Colin Raffel, Adam Roberts, Tom ing, pages 782–792, Edinburgh, Scotland, UK. Asso- Kwiatkowski, Patrick Lewis, Yuxiang Wu, Hein- ciation for Computational Linguistics. rich Küttler, Linqing Liu, Pasquale Minervini, Pon- tus Stenetorp, Sebastian Riedel, Sohee Yang, Min- Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola joon Seo, Gautier Izacard, Fabio Petroni, Lu- De Cao, Sebastian Riedel, and Edouard Grave. 2020. cas Hosseini, Nicola De Cao, Edouard Grave, A memory efficient baseline for open domain ques- Ikuya Yamada, Sonse Shimaoka, Masatoshi Suzuki, tion answering. arXiv preprint arXiv:2012.15156. Shumpei Miyawaki, Shun Sato, Ryo Takahashi, Jun Suzuki, Martin Fajcik, Martin Docekal, Karel On- Chen-Tse Tsai and Dan Roth. 2016a. Cross-lingual drej, Pavel Smrz, Hao Cheng, Yelong Shen, Xi- wikification using multilingual embeddings. In Pro- aodong Liu, Pengcheng He, Weizhu Chen, Jian- ceedings of the 2016 Conference of the North Amer- feng Gao, Barlas Oguz, Xilun Chen, Vladimir ican Chapter of the Association for Computational Karpukhin, Stan Peshterliev, Dmytro Okhonko, Linguistics: Human Language Technologies, pages Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, 589–598, San Diego, California. Association for and Wen tau Yih. 2021. NeurIPS 2020 Efficien- Computational Linguistics. tQA Competition: Systems, Analyses and Lessons Learned. arXiv preprint arXiv:2101.00133. Chen-Tse Tsai and Dan Roth. 2016b. Cross-lingual wikification using multilingual embeddings. In Pro- Yixin Nie, Songhe Wang, and Mohit Bansal. 2019. ceedings of the 2016 Conference of the North Amer- Revealing the importance of semantic retrieval for ican Chapter of the Association for Computational machine reading at scale. In Proceedings of the Linguistics: Human Language Technologies, pages 2019 Conference on Empirical Methods in Natu- 589–598. ral Language Processing and the 9th International Shyam Upadhyay, Nitish Gupta, and Dan Roth. 2018. Joint Conference on Natural Language Processing Joint multilingual supervision for cross-lingual en- (EMNLP-IJCNLP), pages 2553–2566, Hong Kong, tity linking. In Proceedings of the 2018 Conference China. Association for Computational Linguistics. on Empirical Methods in Natural Language Process- ing, pages 2486–2495, , . Associa- Myle Ott, Sergey Edunov, Alexei Baevski, Angela tion for Computational Linguistics. Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible Denny Vrandeciˇ c´ and Markus Krötzsch. 2014. Wiki- toolkit for sequence modeling. In Proceedings of data: a free collaborative knowledgebase. Commu- the 2019 Conference of the North American Chap- nications of the ACM, 57(10):78–85. ter of the Association for Computational Linguistics (Demonstrations), pages 48–53. Tsung-Hsien Wen, David Vandyke, Nikola Mrkšic,´ Milica Gašic,´ Lina M. Rojas-Barahona, Pei-Hao Su, Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Stefan Ultes, and Steve Young. 2017. A network- Lewis, Majid Yazdani, Nicola De Cao, James based end-to-end trainable task-oriented dialogue Thorne, Yacine Jernite, Vassilis Plachouras, Tim system. In Proceedings of the 15th Conference of Rocktäschel, et al. 2021. KILT: a Benchmark for the European Chapter of the Association for Compu- Knowledge Intensive Language Tasks. To appear tational Linguistics: Volume 1, Long Papers, pages at Proceedings of the 2021 Conference of the North 438–449, Valencia, . Association for Computa- American Chapter of the Association for Computa- tional Linguistics. tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- neau, Vishrav Chaudhary, Francisco Guzmán, Ar- Avirup Sil, Gourab Kundu, Radu Florian, and Wael mand Joulin, and Edouard Grave. 2020. CCNet: Hamza. 2018. Neural cross-lingual entity linking. Extracting high quality monolingual datasets from In Proceedings of the AAAI Conference on Artificial web crawl data. In Proceedings of the 12th Lan- Intelligence, volume 32. guage Resources and Evaluation Conference, pages 4003–4012, Marseille, . European Language Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Resources Association. Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Jason D. Williams, Kavosh Asadi, and Geoffrey Zweig. Dropout: A simple way to prevent neural networks 2017. Hybrid code networks: practical and efficient from overfitting. Journal of Machine Learning Re- end-to-end dialog control with supervised and rein- search, 15(56):1929–1958. forcement learning. In Proceedings of the 55th An- nual Meeting of the Association for Computational Ilya Sutskever, James Martens, and Geoffrey E Hin- Linguistics (Volume 1: Long Papers), pages 665– ton. 2011. Generating text with recurrent neural 677, Vancouver, Canada. Association for Computa- networks. In Proceedings of the 28th International tional Linguistics. Conference on Machine Learning (ICML),, pages 1017—-1024. Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2020. Scalable zero- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. shot entity linking with dense entity retrieval. In Sequence to sequence learning with neural networks. Proceedings of the 2020 Conference on Empirical In Advances in neural information processing sys- Methods in Natural Language Processing (EMNLP), tems, pages 3104–3112. pages 6397–6407, Online. Association for Computa- tional Linguistics. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and the inception architecture for computer vision. In Yoshiyasu Takefuji. 2016. Joint learning of the em- Proceedings of the IEEE conference on computer vi- bedding of words and entities for named entity dis- sion and pattern recognition, pages 2818–2826. ambiguation. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 250–259, Berlin, . Associ- ation for Computational Linguistics. Jin G Zheng, Daniel Howsmon, Boliang Zhang, Juer- gen Hahn, Deborah McGuinness, James Hendler, and Heng Ji. 2015. Entity linking for biomedical literature. BMC medical informatics and decision making, 15(1):1–9. Shuyan Zhou, Shruti Rijhwani, and Graham Neubig. 2019. Towards zero-resource cross-lingual entity linking. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pages 243–252, Hong Kong, China. Association for Computational Linguistics. A Experimental Details Botha et al. (2020) Pre-trained A.1 Pre-training 20

We used a pre-trained mBART (Lewis et al., 2020; 28 75 Liu et al., 2020) model on 125 languages—see 30 Figure5 for a visual overview of the overlap with these languages, Wikipedia and the languages used 165 by Botha et al.(2020). mBART has 24 layers of hidden size is 1,024 and it has a total of 406M parameters. We pre-trained on an extended version of the cc100 (Conneau et al., 2020; Wenzek et al., Wikipedia 2020) corpora available here9 where we increased the number of common crawl snapshots for low Figure 5: Venn diagram on the overlap of lan- resource languages from 12 to 60. The dataset has guages used during multilingual language modeling ∼5TB of text. We pre-trained for 500k steps with (pre-training), the languages available on Wikipedia (as of 2019-10-01), and the languages used by Botha et al. max 1,024 tokens per GPU on a variable batch size (2020). After pre-training on 125 languages, we fine- (∼3000). Figure5 shows a Venn diagram on the tune on the 105 that overlap with the one available in overlap of languages used during pre-training and Wikipedia. fine-tuning.

A.2 Data for supervision Wikidata ID Label Wikidata Wikidata contains tens of millions of Q4167836 category Q24046192 category stub items but most of them are scholarly articles or they Q20010800 user category correspond to help and template pages in Wikipedia Q11266439 template i.e. 10 Q11753321 navigational template ( , not entities we want to retain) . Follow- Q19842659 user template ing (Botha et al., 2020), we only keep Wikidata Q21528878 redirect page items that have an associated Wikipedia page in at Q17362920 duplicated page Q14204246 project page least one language, independent of the languages Q21025364 project page we actually model. Moreover, we filter out items Q17442446 internal item that are a subclass (P279) or instance of (P31) some Q26267864 KML file Q4663903 portal Wikimedia organizational entities (e.g., help and Q15184295 module template pages—see Table6). Table 6: Wikidata identifiers used for filtering out items Wikipedia We aligned each Wikipedia hyper- from Botha et al.(2020). link to its respective Wikidata item using a custom script. Note that each Wikipedia page maps to a Wikidata item. For the alignment we use i) direct the hyperlinks. We only keep unambiguous align- reference when the hyperlink point directly to a ments since, when using Wikidata search (i.e., the Wikipedia page, ii) a re-directions table if the hy- third alignment strategy), the mapping could be am- perlink points to an alias page, and iii) a Wikidata biguous (e.g., multiple items may share the same search among labels and aliases of items if the pre- labels and aliases). vious two alignment strategies failed. The previous In Table 10 we report some statistics of the two alignment strategies might fail when i) authors training data extracted from Wikipedia. We use a made a mistake linking on a non-existing page, ii) standard Wikipedia extractor wikiextractor11 authors linked to a non-existing page on purpose by Attardi(2015) and a redirect extractor 12. We use hoping it will be created in the future, or iii) the both Wikipedia and Wikidata dumps from 2019- original title of a page changed over time and no 10-01. redirection was added to accommodate old hyper- links. This procedure successfully aligns 91% of 11https://github.com/attardi/ 9http://data.statmt.org/cc-100 wikiextractor 10https://www.wikidata.org/wiki/ 12https://code.google.com/archive/p/ Wikidata:Statistics wikipedia-redirect Entities Bin Support Acc. Lang. Docs Mentions Distinct 6∈ EnWiki [0, 1) 14,741 66.7 [1, 10) 15,279 88.1 ru 1,625 20,698 8,832 1,838 [10, 100) 43,169 92.0 it 907 8,931 4,857 911 [100, 1k) 75,927 91.7 pl 1,162 5,957 3,727 547 [1k, 10k) 80,329 91.5 fr 978 7,000 4,093 349 [10k, 100k) 47,944 93.6 cs 454 2,902 1,974 200 [100k, 1M) 11,460 93.0 pt 666 2,653 1,313 113 [1M, 10M) 238 73.2 zh 395 2,057 1,274 110 Total 6,187 50,198 26,070 4,068 Table 8: mGENRE results on Mewsli-9 dataset by mention frequency in training. Table 7: Corpus statistics for the Wikinews unseen lan- guages we use as an evaluation set. B Additional results B.1 Analysis A.3 Data for test By mention frequency We show a breakdown We use Wikinews (from 2019-10-01) to construct of the accuracy of mGENRE on Mewsli-9 by men- our unseen Wikinews-7 dataset. In Table7 we tion frequency in Table8. The accuracy of unseen report some statistic of our new dataset. mentions is 66.7% and increases up to 93.6% for mentions seen more than 10k times. For extremely common mentions (i.e., seen more than 1M times) A.4 Training the accuracy drops to 73.2%. These mentions cor- We implemented, trained, and evaluate our model respond to entities that are harder to disambiguate using the fariseq library (Ott et al., 2019). We (e.g., ‘’ appears 3.2M times but can trained mGENRE using Adam (Kingma and Ba, be linked to the country as well as any sports team −4 where the context refers to sports). 2014) with a learning rate 10 , β1 = 0.9, β2 = 0.98, and with a linear warm-up for 5,000 steps Unseen Languages Even though marginaliza- followed by liner decay for maximum 2M steps. tion and canonical representation are the top-two The objective is sequence-to-sequence categorical systems in the unseen languages setting, they are cross-entropy loss with 0.1 of label smoothing and not on seen languages. In Table9 we report the 0.01 of weight decay. We used dropout probabil- results of all these strategies also on the seen lan- ity of 0.1 and attention dropout of 0.1. We used guages (Mewsli-9 test set). Complementary to Fig- max 3,072 tokens per GPU and variable batch size ure4 we also report the probability mass distribu- (∼12,500). Training was done on 384 GPUs (Tesla tion over languages seen for Mewsli-9. V100 with 32GB of memory) and it completed in ∼72h for a total of ∼27,648 GPU hours or ∼1,152 GPU days. Since TAC-KBP2015 contains noisy text (e.g., XML/HTML tags), we further fine-tune mGENRE for 2k steps on its training set when testing on it.

A.5 Inference

At test time, we use Constrained Beam Search with 10 beams, length penalty of 1, and maxi- mum decoding steps of 32. We restrict the input sequence to be at most 128 tokens cutting the left, right, or both parts of the context around a men- tion. When employing marginalization, we normal- ize the log-probabilities by sequence length using log p(y|x)/Lα, where α = 0.5 was tuned on the development set. ar de en es fa ja sr ta tr

ar 99.99 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 M Lang. Can. N+L L+N L+N de 0.02 99.38 0.54 0.04 0.00 0.01 0.00 0.00 0.00 ar 90.5 92.8 92.9 89.2 en 0.02 0.07 99.85 0.04 0.00 0.01 0.00 0.00 0.00 de 84.6 86.4 86.4 85.3 es 0.06 0.04 0.79 99.08 0.00 0.02 0.01 0.00 0.00 en 77.6 79.3 79.2 76.5 es 83.4 85.5 85.2 83.4 fa 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 fa 91.6 90.7 91.8 88.2 ja 0.00 0.00 0.01 0.00 0.00 99.98 0.00 0.00 0.00 ja 81.3 82.3 82.8 81.3 sr 91.5 92.7 92.9 92.5 sr 0.01 0.00 0.10 0.01 0.00 0.02 99.86 0.00 0.00 ta 92.8 91.8 91.9 91.3 ta 0.00 0.00 0.04 0.00 0.00 0.00 0.00 99.96 0.00 tr 88.0 87.7 87.3 86.0 tr 0.03 0.02 0.53 0.03 0.02 0.00 0.02 0.05 99.29 micro-avg 83.20 84.77 84.80 83.05 macro-avg 86.82 87.68 87.82 85.97 (a) Lang+Name.

+ candidates ar de en es fa ja sr ta tr ar 94.4 94.5 94.7 93.0 ar 27.72 3.63 4.14 7.21 4.46 17.93 4.09 28.66 2.15 de 89.4 89.8 89.8 89.3 de 7.06 26.76 7.48 7.38 6.48 18.34 5.52 18.87 2.12 en 83.6 83.8 83.9 82.4 9.80 7.32 35.24 6.76 5.65 16.77 3.07 12.70 2.68 es 87.7 88.2 88.3 87.3 en fa 93.6 93.3 93.6 93.3 es 9.00 6.39 10.73 21.99 6.54 18.12 4.63 19.59 3.00 ja 87.9 88.0 88.4 87.9 fa 10.00 6.64 7.10 5.23 23.27 18.97 6.17 20.00 2.62 sr 93.1 93.4 93.5 93.2 ta 93.0 92.2 92.5 92.5 ja 7.22 7.95 8.96 4.70 6.80 46.85 2.70 11.33 3.47 tr 91.1 90.4 89.9 89.1 sr 3.66 4.04 4.41 2.57 2.92 32.59 13.23 34.55 2.03

micro-avg 87.95 88.22 88.32 87.43 ta 6.17 3.25 4.38 3.83 10.55 20.97 7.56 40.77 2.53 macro-avg 90.42 90.41 90.51 89.78 tr 6.63 6.51 8.30 6.25 6.80 16.80 4.26 25.75 18.70

Table 9: mGENRE on the Mewsli-9. Models are (b) Lang+NameM. trained only on the Mewsli-9 languages (1M data- points per language). ‘Can.’ is canonical, ‘N+L’ is Figure 6: Distribution of languages on the top-1 predic- ‘name+language‘ and ‘L+N’ is the opposite. M indi- tion of two mGENRE models on Mewsli-9. Y-axis in- cates marginalization. dicates the source language where X-axis indicates the language of the top-1 prediction. The models trained on those languages. Language Pages Names Hyperlinks Language Pages Names Hyperlinks Afrikaans (af) 85,456 110,705 1,089,581 Kurdish (ku) 26,963 42,134 244,779 Albanian (sq) 86,234 112,112 978,394 Kyrgyz (ky) 80,985 89,486 271,335 Amharic (am) 15,280 19,905 75,575 Lao (lo) 4,414 5,761 16,173 Arabic (ar) 971,861 1,883,080 10,308,074 Latin (la) 132,410 186,829 1,986,307 Armenian (hy) 260,395 582,941 3,082,000 Latvian (lv) 99,062 226,570 1,522,814 Assamese (as) 6,119 19,041 61,209 Lingala (ln) 3,262 4,134 15,518 Azerbaijani (az) 152,033 189,793 1,562,968 Lithuanian (lt) 197,215 282,077 3,512,764 Bambara (bm) 747 916 2,191 Macedonian (mk) 103,960 152,384 2,035,348 Basque (eu) 337,916 430,456 4,305,648 Malagasy (mg) 92,500 142,156 857,000 Belarusian (be) 181,030 415,152 2,459,794 Malay (ms) 331,403 388,110 3,190,700 Bengali (bn) 76,121 257,730 960,484 Malayalam (ml) 67,475 152,809 712,869 Bosnian (bs) 82,164 184,148 1,916,515 Marathi (mr) 55,601 100,904 355,536 Breton (br) 67,388 88,284 1,255,295 Mongolian (mn) 21,772 28,455 208,847 Bulgarian (bg) 257,962 376,934 4,655,641 Nepali (ne) 34,107 39,904 151,958 Burmese (my) 48,683 55,700 98,992 Norwegian (no) 521,665 816,772 10,234,086 Catalan (ca) 630,340 1,024,519 14,790,419 Oriya (or) 15,532 30,431 79,261 Chinese (zh) 1,085,180 1,951,612 17,262,417 Oromo (om) 1,063 1,317 7,153 Croatian (hr) 193,705 250,008 4,223,179 Panjabi (pa) 33,934 46,720 145,204 Czech (cs) 439,249 719,643 12,173,376 Pashto (ps) 11,773 16,878 46,987 Danish (da) 255,957 405,745 5,621,483 Persian (fa) 716,604 2,139,255 5,567,774 Dutch (nl) 1,986,801 2,714,649 25,002,389 Polish (pl) 1,370,672 1,812,412 25,817,929 English (en) 6,071,492 14,751,661 134,477,329 Portuguese (pt) 1,053,673 1,858,821 20,625,904 Esperanto (eo) 270,871 447,159 5,570,306 Quechua (qu) 21,670 41,230 247,508 Estonian (et) 201,505 342,215 4,700,888 Romanian (ro) 403,517 979,524 6,974,837 Finnish (fi) 470,896 737,165 8,390,037 Russian (ru) 1,585,051 3,592,042 35,783,391 French (fr) 2,160,840 3,718,185 59,006,932 Sanskrit (sa) 11,960 22,472 73,380 Frysk (fy) 42,893 72,490 1,206,432 Serbian (sr) 625,871 3,248,789 7,012,202 Fulah (ff) 306 421 912 Sindhi (sd) 14,616 18,556 33,990 Gaelic, (gd) 15,126 23,631 180,186 Sinhala (si) 20,363 29,794 90,866 Galician (gl) 159,849 229,561 4,709,070 Slovak (sk) 232,109 301,681 4,014,344 Ganda (lg) 2,376 2,668 2,476 Slovenian (sl) 166,997 238,706 3,754,135 Georgian (ka) 135,040 138,267 1,369,094 Somali (so) 6,716 9,595 53,132 German (de) 2,356,465 3,877,850 60,638,345 Spanish (es) 1,547,372 3,313,727 37,749,593 Greek (el) 170,541 251,692 3,310,875 Sundanese (su) 54,921 61,716 598,878 Guarani (gn) 3,755 5,589 89,593 Swahili (sw) 53,926 74,634 693,049 Gujarati (gu) 29,091 32,526 402,483 Swati (ss) 514 610 4,344 Haitian (ht) 59,350 63,279 677,064 Swedish (sv) 3,755,203 6,143,945 39,409,278 Hausa (ha) 4,143 5,025 19,929 Tagalog (tl) 79,036 181,951 562,526 Hebrew (he) 253,861 444,127 9,947,354 Tamil (ta) 129,591 168,718 1,110,037 Hindi (hi) 138,378 192,652 1,040,288 Telugu (te) 71,819 98,189 841,549 Hungarian (hu) 459,261 663,995 10,138,904 Thai (th) 139,522 299,433 2,190,249 Icelandic (is) 48,563 75,963 772,213 Tigrinya (ti) 307 390 696 Igbo (ig) 1,521 3,000 4,702 Tswana (tn) 827 894 4,896 Indonesian (id) 516,196 1,015,784 7,882,254 Turkish (tr) 338,865 593,365 5,657,757 Irish (ga) 51,824 61,336 435,135 Ukrainian (uk) 939,234 1,468,963 16,360,016 Italian (it) 1,571,189 2,450,009 39,382,886 Urdu (ur) 156,300 353,391 1,142,953 Japanese (ja) 1,173,978 1,877,660 45,957,053 Uzbek (uz) 132,666 450,865 764,566 Javanese (jv) 57,422 75,792 718,589 Vietnamese (vi) 1,240,324 1,466,573 10,015,209 Kannada (kn) 25,986 33,880 227,731 Welsh (cy) 106,556 154,043 1,254,901 Kazakh (kk) 229,165 271,260 1,564,344 Wolof (wo) 1,503 1,969 7,257 Khmer (km) 9,838 12,349 73,950 Xhosa (xh) 1,370 1,610 14,163 Kongo (kg) 1,247 1,440 3,733 Yoruba (yo) 32,304 42,022 88,032 Korean (ko) 475,605 1,061,961 8,309,492 Others 12,613,082 12,613,082 - total 53,849,351 89,270,463 777,210,183

Table 10: Number of pages, entity names, and hyperlinks used in the 105 languages used for mGENRE. Entity names are more than the pages because we also includes redirections. Hyperlinks count is after filtering missed alignments to Wikidata and augmenting when there is no name in the source language. mGENRE mGENRE Alias Table Alias Table

log10(training size) log10(training size)

log10(training size) log10(training size) 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

Malagasy (mg) English (en) Uzbek (uz) German (de) Gujarati (gu) French (fr) Estonian (et) Japanese (ja) Basque (eu) Swedish (sv) Swedish (sv) Italian (it) Kazakh (kk) Spanish (es) Swahili (sw) Russian (ru) Afrikaans (af) Polish (pl) Breton (br) Dutch (nl) Sundanese (su) Portuguese (pt) Guarani (gn) Chinese (zh) Javanese (jv) Ukrainian (uk) Kyrgyz (ky) Catalan (ca) Welsh (cy) Czech (cs) Slovak (sk) Arabic (ar) Catalan (ca) Norwegian (no) Urdu (ur) Hungarian (hu) Irish (ga) Vietnamese (vi) Japanese (ja) Hebrew (he) Quechua (qu) Finnish (fi) Macedonian (mk) Korean (ko) Azerbaijani (az) Indonesian (id) Galician (gl) Serbian (sr) Sanskrit (sa) Romanian (ro) Vietnamese (vi) Turkish (tr) Latin (la) Danish (da) Georgian (ka) Esperanto (eo) Esperanto (eo) Persian (fa) Lithuanian (lt) Galician (gl) Danish (da) Estonian (et) Serbian (sr) Bulgarian (bg) Kurdish (ku) Basque (eu) Slovenian (sl) Croatian (hr) Bulgarian (bg) Slovak (sk) Telugu (te) Slovenian (sl) Norwegian (no) Lithuanian (lt) Romanian (ro) Greek (el) Thai (th) Malay (ms) Persian (fa) Armenian (hy) Dutch (nl) Belarusian (be) Malay (ms) Thai (th) Frysk (fy) Macedonian (mk) Hungarian (hu) Latin (la) Latvian (lv) Bosnian (bs) Icelandic (is) Kazakh (kk) Belarusian (be) Azerbaijani (az) Indonesian (id) Latvian (lv) German (de) Georgian (ka) Assamese (as) Breton (br) Gaelic, (gd) Welsh (cy) Russian (ru) Frysk (fy) Ukrainian (uk) Urdu (ur) Hindi (hi) Tamil (ta) Finnish (fi) Afrikaans (af) Armenian (hy) Hindi (hi) Arabic (ar) Albanian (sq) Amharic (am) Bengali (bn) Marathi (mr) Malagasy (mg) Italian (it) Telugu (te) Albanian (sq) Icelandic (is) Czech (cs) Uzbek (uz) Croatian (hr) Javanese (jv) Turkish (tr) Malayalam (ml) Tamil (ta) Swahili (sw) Bosnian (bs) Haitian (ht) Burmese (my) Sundanese (su) Lingala (ln) Tagalog (tl) Korean (ko) Irish (ga) Mongolian (mn) Gujarati (gu) Kannada (kn) Marathi (mr) Chinese (zh) Kyrgyz (ky) Hebrew (he) Quechua (qu) Oriya (or) Kurdish (ku) Portuguese (pt) Kannada (kn) Nepali (ne) Mongolian (mn) Xhosa (xh) Gaelic, (gd) English (en) Nepali (ne) Kongo (kg) Panjabi (pa) French (fr) Burmese (my) Polish (pl) Sinhala (si) Spanish (es) Guarani (gn) Somali (so) Yoruba (yo) Pashto (ps) Oriya (or) Panjabi (pa) Amharic (am) Sindhi (sd) Khmer (km) Greek (el) Sanskrit (sa) Wolof (wo) Assamese (as) Lao (lo) Somali (so) Oromo (om) Pashto (ps) Swati (ss) Sindhi (sd) Haitian (ht) Hausa (ha) Tswana (tn) Lao (lo) Khmer (km) Lingala (ln) Bengali (bn) Xhosa (xh) Malayalam (ml) Wolof (wo) Tagalog (tl) Oromo (om) Hausa (ha) Tswana (tn) Sinhala (si) Igbo (ig) Bambara (bm) Swati (ss) Igbo (ig) Kongo (kg) Ganda (lg) Ganda (lg) Yoruba (yo) Bambara (bm) Tigrinya (ti) Fulah (ff) Fulah (ff) Tigrinya (ti) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Accuracy (a) Sorted by mGENRE accuracy. (b) Sorted by training set size.

Figure 7: Accuracy of mGENRE and alias table on the 105 languages in our Wikipedia validation set. We report also the log-training set sizes per each language. See Table 11 for all precise values. Language Alias Table mGENRE Support Language Alias Table mGENRE Support Afrikaans (af) 92.1 97.0 1,089,581 Kyrgyz (ky) 86.1 96.6 271,335 Albanian (sq) 87.4 94.1 978,394 Lao (lo) 86.8 91.5 16,173 Amharic (am) 85.6 94.2 75,575 Latin (la) 88.0 95.8 1,986,307 Arabic (ar) 89.9 94.2 10,308,074 Latvian (lv) 85.3 95.0 1,522,814 Armenian (hy) 89.2 94.3 3,082,000 Lingala (ln) 84.4 93.7 15,518 Assamese (as) 85.8 94.6 61,209 Lithuanian (lt) 89.5 95.5 3,512,764 Azerbaijani (az) 90.0 96.0 1,562,968 Macedonian (mk) 88.4 96.1 2,035,348 Bambara (bm) 80.6 89.8 2,191 Malagasy (mg) 94.4 98.8 857,000 Basque (eu) 94.0 97.6 4,305,648 Malay (ms) 89.0 95.2 3,190,700 Belarusian (be) 85.9 94.8 2,459,794 Malayalam (ml) 75.8 90.0 712,869 Bengali (bn) 79.7 90.4 960,484 Marathi (mr) 85.0 94.2 355,536 Bosnian (bs) 86.5 93.7 1,916,515 Mongolian (mn) 86.1 93.5 208,847 Breton (br) 91.9 96.8 1,255,295 Nepali (ne) 83.9 92.8 151,958 Bulgarian (bg) 88.8 95.4 4,655,641 Norwegian (no) 88.7 95.4 10,234,086 Burmese (my) 86.4 93.7 98,992 Oriya (or) 87.3 93.2 79,261 Catalan (ca) 91.5 96.3 14,790,419 Oromo (om) 80.8 91.4 7,153 Chinese (zh) 88.0 93.5 17,262,417 Panjabi (pa) 85.4 92.1 145,204 Croatian (hr) 84.7 93.9 4,223,179 Pashto (ps) 80.0 92.1 46,987 Czech (cs) 87.1 94.0 12,173,376 Persian (fa) 87.6 95.2 5,567,774 Danish (da) 90.6 95.5 5,621,483 Polish (pl) 82.3 92.4 25,817,929 Dutch (nl) 86.4 95.2 25,002,389 Portuguese (pt) 85.9 93.0 20,625,904 English (en) 84.6 92.6 134,477,329 Quechua (qu) 93.8 96.2 247,508 Esperanto (eo) 89.7 95.5 5,570,306 Romanian (ro) 88.8 95.2 6,974,837 Estonian (et) 91.2 97.7 4,700,888 Russian (ru) 85.7 94.5 35,783,391 Finnish (fi) 87.4 94.3 8,390,037 Sanskrit (sa) 86.2 95.8 73,380 French (fr) 83.6 92.6 59,006,932 Serbian (sr) 85.7 95.5 7,012,202 Frysk (fy) 91.3 95.2 1,206,432 Sindhi (sd) 80.7 92.1 33,990 Fulah (ff) 44.3 69.7 912 Sinhala (si) 80.1 89.9 90,866 Gaelic, (gd) 90.9 94.6 180,186 Slovak (sk) 88.0 96.3 4,014,344 Galician (gl) 90.9 95.9 4,709,070 Slovenian (sl) 87.0 95.4 3,754,135 Ganda (lg) 74.7 88.2 2,476 Somali (so) 84.5 92.2 53,132 Georgian (ka) 87.9 95.7 1,369,094 Spanish (es) 86.5 92.3 37,749,593 German (de) 86.9 94.7 60,638,345 Sundanese (su) 93.6 96.8 598,878 Greek (el) 84.1 91.9 3,310,875 Swahili (sw) 91.9 97.2 693,049 Guarani (gn) 92.7 96.7 89,593 Swati (ss) 81.2 91.2 4,344 Gujarati (gu) 96.6 98.1 402,483 Swedish (sv) 90.9 97.5 39,409,278 Haitian (ht) 95.6 90.7 677,064 Tagalog (tl) 83.4 89.9 562,526 Hausa (ha) 81.7 89.9 19,929 Tamil (ta) 84.2 93.8 1,110,037 Hebrew (he) 90.6 93.5 9,947,354 Telugu (te) 89.2 95.4 841,549 Hindi (hi) 89.1 94.3 1,040,288 Thai (th) 92.3 95.2 2,190,249 Tigrinya (ti) 57.8 79.0 696 Hungarian (hu) 90.7 95.1 10,138,904 Tswana (tn) 89.5 90.6 4,896 Icelandic (is) 89.8 94.9 772,213 Turkish (tr) 89.0 93.9 5,657,757 Igbo (ig) 87.4 89.7 4,702 Ukrainian (uk) 85.8 94.3 16,360,016 Indonesian (id) 91.8 94.8 7,882,254 Urdu (ur) 91.0 96.3 1,142,953 Irish (ga) 90.1 96.3 435,135 Uzbek (uz) 73.8 98.4 764,566 Italian (it) 89.9 94.2 39,382,886 Vietnamese (vi) 91.8 95.8 10,015,209 Japanese (ja) 92.1 96.2 45,957,053 Welsh (cy) 94.4 96.4 1,254,901 Javanese (jv) 92.4 96.7 718,589 Wolof (wo) 78.0 91.9 7,257 Kannada (kn) 87.8 93.5 227,731 Xhosa (xh) 73.9 92.6 14,163 Kazakh (kk) 91.0 97.3 1,564,344 Yoruba (yo) 75.6 87.9 88,032 Khmer (km) 85.1 90.5 73,950 Kongo (kg) 81.1 92.6 3,733 micro-avg 86.5 93.8 - Korean (ko) 89.1 93.7 8,309,492 macro-avg 86.6 93.9 - Kurdish (ku) 89.1 95.4 244,779 total - - 777,210,183

Table 11: Accuracy of mGENRE and alias table on the 105 languages in our Wikipedia validation set. The support indicates how many datapoints where used to train where validation is done on 1,000 examples per language (less for Tigrinya and Fulah since we have less than a thousand hyperlinks). agae.Cerytemdli isdt rdc ntesuc agaent htw ri nsc way—but a such in train we that ( language—note those often source all quite on the used languages. trained also in other is are predict that model to languages The some biased prediction. are is top-1 there model the indicates Y-axis of the set. language Clearly heldout the Wikipedia on indicates languages. X-axis mGENRE where of language prediction source top-1 the the on languages of Distribution 8: Figure mn mg bm am om my mk km ms wo mr sw qu hu gu gn gd bn bg ml ne he eu en de be pa ha ga da no xh hy eo uk kn kg yo ku ka ca zh uz ko kk su sq sd ps bs cy ky az es as sa so sv sk cs br ur ru hr ss tn th pt ht ar or te ro et ta fa af sr fy pl nl ln lg ig id hi gl el la ja lo vi lv jv tr sl si is fr ff tl ti lt it fi

af am ar as az be bg bm bn br bs ca cs cy da de el en eo es et eu fa ff fi fr fy ga gd gl gn gu ha he hi hr ht hu hy id ig is it ja jv ka kg kk km

e.g. kn ko ku ky nls) rnhadRsinaeas fe sdfrom used often also are Russian and French English). , la lg ln lo lt lv mg mk ml mn mr ms my ne nl no om or pa pl ps pt qu ro ru sa sd si sk sl so sq sr ss su sv sw ta te th ti tl tn tr uk ur uz vi wo xh yo zh