Design Challenges in Low-Resource Cross-Lingual Entity Linking
Total Page:16
File Type:pdf, Size:1020Kb
Design Challenges in Low-resource Cross-lingual Entity Linking Xingyu Fu1,∗ Weijia Shi2∗, Xiaodong Yu1, Zian Zhao3, Dan Roth1 1University of Pennsylvania 2University of California Los Angeles 3Columbia University fxingyuf2, xdyu, [email protected] [email protected], [email protected] Abstract Odia: [ଚିଲିକାେର] ଅରଖକୁଦ ନିକଟେର ଏକ ମୁହାଣ େଖାଲିଥିଲା । Cross-lingual Entity Linking (XEL), the prob- English: Gloss: Flood appeared in Arakhkud near [Chilika Lake]. lem of grounding mentions of entities in a for- XEL: eign language text into an English knowledge [ଚିଲିକାେର] Chilika Lake base such as Wikipedia, has seen a lot of re- https://en.wikipedia.org/wiki/Chilika_Lake search in recent years, with a range of promis- ing techniques. However, current techniques Figure 1: The EXL task: in the given sentence we link do not rise to the challenges introduced by text “Chilika Lake” to its corresponding English Wikipedia in low-resource languages (LRL) and, surpris- ingly, fail to generalize to text not taken from task typically involves two main steps: (1) candi- Wikipedia, on which they are usually trained. This paper provides a thorough analysis of date generation, retrieving a list of candidate KB low-resource XEL techniques, focusing on entries for the mention, and (2) candidate ranking, the key step of identifying candidate English selecting the most likely entry from the candidates. Wikipedia titles that correspond to a given for- While XEL techniques have been studied heavily eign language mention. Our analysis indi- in recent years, many challenges remain in the LRL cates that current methods are limited by their setting. Specifically, existing candidate generation reliance on Wikipedia’s interlanguage links methods perform well on Wikipedia-based dataset and thus suffer when the foreign language’s but fail to generalize beyond Wikipedia, to news Wikipedia is small. We conclude that the LRL setting requires the use of outside-Wikipedia and social media text. Error analysis on existing cross-lingual resources and present a simple LRL XEL systems shows that the key obstacle is yet effective zero-shot XEL system, QuEL, candidate generation. For example, 79.3%-89.1% that utilizes search engines query logs. With of XEL errors in Odia can be attributed to the limi- experiments on 25 languages, QuEL shows an tations of candidate generation. average increase of 25% in gold candidate re- In this paper, we present a thorough analysis of call and of 13% in end-to-end linking accu- the limitations of several leading candidate genera- racy over state-of-the-art baselines.1 tion methods. Although these methods adopt differ- 1 Introduction ent techniques, we find that all of them heavily rely on Wikipedia interlanguage links2 as their cross- Cross-lingual Entity Linking (XEL) aims at ground- lingual resources. However, small SL Wikipedia ing mentions written in a foreign (source) language size limits their performance in the LRL setting. (SL) into entries in a (target) language Knowledge As shown in Figure2, while the core challenge of Base (KB), which we consider here as the English LRL XEL is to link LRL entities (A) to candidates Wikipedia following Pan et al.(2017); Upadhyay in the English Wikipedia (C), interlanguage links et al.(2018a); Zhou et al.(2020). In Figure1, only map a small subset of the LRL entities that for instance, an Odia (an Indo-Aryan language in appear in both LRL Wikipedia (B) and English India) mention (“Chilika Lake”) is linked to the Wikipedia. Therefore, methods that only leverage corresponding English Wikipedia entry. The XEL interlanguage links (B \ C) as the main source of ∗ Both authors contributed equally to this work. supervision cannot cover a wide range of entities. 1Code is available at: http://cogcomp.org/page/ publication_view/911. The LORELEI data will be 2https://en.wikipedia.org/wiki/Help: available through LDC. Interlanguage_links 6418 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6418–6432, November 16–20, 2020. c 2020 Association for Computational Linguistics For example, the Amharic Wikipedia has 14,854 2 Related Work entries, but only 8,176 of them have interlanguage links to English. 2.1 Background: Wikipedia resources Furthermore, as we show, existing candidate gen- Wikipedia title mappings: The mapping comes eration methods perform well on Wikipedia-based from Wikipedia interlanguage links4 between the datasets but fail to generalize to outside-Wikipedia SL and English, and uses Wikipedia articles titles text such as news or social media. as mappings. It directly links an SL entity to an English Wikipedia entry without ambiguity. Wikipedia anchor text mappings: A clickable A B C text mention in Wikipedia articles is annotated with LRL LRL Wiki English Wiki Entities Entities anchor text linking it to a Wikipedia entry. The Entities following retrieving order: SL anchor text ! SL Wikipedia entry ! English Wikipedia entry (where Figure 2: An illustration of Cross-lingual resources the last step is done via the Wikipedia interlanguage serving current XEL systems. links), allows one to build a bilingual title mapping from a SL mentions to Wikipedia English entries, resulting in a probabilistic mapping with scores Our observations lead to the conclusion that calculated using total counts (Tsai and Roth, 2016). the LRL setting necessitates the use of outside- Wikipedia cross-lingual resources. Specifically, we 2.2 XEL systems for Low-resource languages propose various ways to utilize the abundant query log from online search engines to compensate for We briefly survey key approaches to XEL below. the lack of supervision. Similar to Wikipedia, a Direct Mapping Based Systems, including free online encyclopedia created by Internet users, xlwikifier (Tsai and Roth, 2016) and Query logs (QL) provide a free resource, collabo- xelms (Upadhyay et al., 2018a), focus on building ratively generated by a large number of users, and a SL to English mapping to generate candidates. mildly curated. However, it is orders of magni- For candidate ranking, both xlwikifier and tude larger than Wikipedia.3 In particular, it in- xelms combine supervision from multiple lan- cludes all of Wikipedia cross-lingual resources as guages to learn a ranking model. a subset since a search of an SL mention leads to Word Translation Based Systems including Pan the English Wikipedia entity if the corresponding et al.(2017) and ELISA (Zhang et al., 2018), ex- Wikipedia entries are interlanguage-linked. tract SL – English name translation pairs and apply In this paper, the main part, Sec.3 presents an unsupervised collective inference approach to a thorough method-wise evaluation and analy- link the translated mention. sis of leading candidate generation methods, and Transliteration Based Systems include Tsai and quantifies their limitations as a function of SL Roth(2018) and translit (Upadhyay et al., Wikipedia size and the size of the interlanguage 2018b). translit uses a sequence-to-sequence cross-lingual resources. Based on the limitations, model and bootstrapping to deal with limited data. we analyze QuEL CG, an improved candidate gen- It is useful when the English and SL word pairs eration method utilizing QL in Sec.4, showing have similar pronunciation. that it exceeds the Wikipedia resource limit on Pivoting Based Systems including (Rijh- LRL. To exhibit a system-wise XEL compari- wani et al., 2019; Zhou et al., 2019) and son, in Sec.5, we suggest a simple yet efficient PBEL PLUS (Zhou et al., 2020), remove the zero-shot XEL framework QuEL, that incorpo- reliance on SL resources and use a pivot language rates QuEL CG. QuEL achieves an average of 25% for candidate generation. Specifically, they train increase in gold candidate recall, and 13% in- the XEL model on a selected high-resource crease in end-to-end linking accuracy on outside- language, and apply it to SL mentions through wikipedia text. language conversion. 3https://www.internetlivestats.com/ 4https://en.wikipedia.org/wiki/Help: google-search-statistics/ Interlanguage_links 6419 2.3 Query Logs flow: SL mention ! SL Wikipedia entity ! En- Query logs have long been used in many tasks glish Wikipedia entity.23 name trans such as across-domain generalization (Rud¨ et al., (Name Translation) as introduced in 2011) for NER, and ontological knowledge acquisi- (Pan et al., 2017; Zhang et al., 2018) performs word tion(Alfonseca et al., 2010; Pantel et al., 2012). In alignment on Wikipedia title mappings, to induce the English entity linking task, Shen et al.(2015) a fixed word-to-word translation table between SL pointed out that google query logs can be an ef- and English. For instance, to link the Suomi name ficient way to identify candidates. Dredze et al. “Pekingin tekninen instituutti” (Beijing Institute of (2010); Monahan et al.(2011) use search result as Technology), it translates each word in the mention: one of their methods for candidate generation on (“Pekingin” – Beijing, “tekninen” – Technology, high resource language entity linking task. While “instituutti” – Institute). At test time, after map- earlier work indicates that query logs provide abun- ping each word in the given mention to English, it dant information that can be used in NLP tasks, as links the translation to English Wikipedia using an far as we know, it has never been used in cross- unsupervised collective inference approach. translit lingual tasks as we do here. And, only search in- (Upadhyay et al., 2018b) trains a formation has been used, while we suggest using seq2seq model on Wikipedia title mappings, to Maps information too. generate the English entity directly. pivoting to a related high-resource language 3 Candidate Generation Analysis (HRL) (Zhou et al., 2020) attempts to general- ize to unseen LRL mentions through grapheme or In this section, we analyze four leading candidate phoneme similarity between SL and a related HRL. generation methods: p(e|m), xlwikifier, Specifically, it first finds a related HRL for the SL, name trans, pivoting, and translit(see and learns a XEL model on the related HRL, using Table1 for the systems used by each method) and Wikipedia title mappings and anchor text mappings.