Arxiv:2010.11856V3 [Cs.CL] 13 Apr 2021 Questions from Non-English Native Speakers to Rep- Information-Seeking Questions—Questions from Resent Real-World Applications

XOR QA: Cross-lingual Open-Retrieval Question Answering Akari Asaiº, Jungo Kasaiº, Jonathan H. Clark¶, Kenton Lee¶, Eunsol Choi¸, Hannaneh Hajishirziº¹ ºUniversity of Washington ¶Google Research ¸The University of Texas at Austin ¹Allen Institute for AI {akari, jkasai, hannaneh}@cs.washington.edu {jhclark, kentonl}@google.com, [email protected] Abstract ロン・ポールの学部時代の専攻は？[Japanese] (What did Ron Paul major in during undergraduate?) Multilingual question answering tasks typi- cally assume that answers exist in the same Multilingual document collections language as the question. Yet in prac- (Wikipedias) tice, many languages face both information ロン・ポール (ja.wikipedia) scarcity—where languages have few reference 高校卒業後はゲティスバーグ大学へ進学。 (After high school, he went to Gettysburg College.) articles—and information asymmetry—where questions reference concepts from other cul- Ron Paul (en.wikipedia) tures. This work extends open-retrieval ques- Paul went to Gettysburg College, where he was a member of the Lambda Chi Alpha fraternity. He tion answering to a cross-lingual setting en- graduated with a B.S. degree in Biology in 1957. abling questions from one language to be an- swered via answer content from another lan- 生物学 (Biology) guage. We construct a large-scale dataset built on 40K information-seeking questions Figure 1: Overview of XOR QA. Given a question in across 7 diverse non-English languages that Li, the model finds an answer in either English or Li TYDI QA could not find same-language an- Wikipedia and returns an answer in English or L . L swers for. Based on this dataset, we introduce i i is one of the 7 typologically diverse languages. a task framework, called Cross-lingual Open- Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross- lingual document retrieval from multilingual the bulk of this work has been exclusively on En- and English resources. We establish baselines glish. In this paper, we bring together for the first with state-of-the-art machine translation sys- time information-seeking questions, open-retrieval tems and cross-lingual pretrained models. Ex- QA, and multilingual QA to create a multilin- perimental results suggest that XOR QA is a gual open-retrieval QA dataset that enables cross- challenging task that will facilitate the devel- lingual answer retrieval. opment of novel techniques for multilingual question answering. Our data and code are While multilingual open QA systems would ben- available at https://nlp.cs.washington. efit the many speakers of non-English languages, edu/xorqa/. there are several pitfalls in designing such a dataset. First, a multilingual QA dataset should include 1 Introduction arXiv:2010.11856v3 [cs.CL] 13 Apr 2021 questions from non-English native speakers to rep- Information-seeking questions—questions from resent real-world applications. Questions in most people who are actually looking for an answer— recent multilingual QA datasets (Lewis et al., 2020; have been increasingly studied in question answer- Artetxe et al., 2020; Longpre et al., 2020) are trans- ing (QA) research. Fulfilling these information lated from English, which leads to English-centric needs has led the research community to look fur- questions such as questions about American sports, ther for answers: beyond paragraphs and articles cultures and politics. Second, it is important to toward performing open retrieval1 on large-scale support retrieving answers in languages other than document collections (Chen and Yih, 2020). Yet the original language due to information scarcity of low-resource languages (Miniwatts Marketing 1 We use open retrieval—instead of open domain—to Group, 2011). Moreover, questions strongly re- refer to models that can access answer context from large document collections. We avoid using open domain due to its lated to entities from other cultures are less likely double meaning as “covering topics from many domains.” to have answer content in the questioner’s language due to cultural bias (information asymmetry, Calla- 18.7 F1 points on XOR-FULL. This result indicates han and Herring, 2011). For example, Fig.1 shows that XOR-TYDI QA poses unique challenges to that the Japanese Wikipedia article of an Ameri- tackle toward building a real-world open-retrieval can politician, Ron Paul, does not have information QA system for diverse languages. We expect about his college degree perhaps because Japanese that our dataset opens up new challenges to make Wikipedia editors are less interested in specific ed- progress in multilingual representation learning. ucational backgrounds of American politicians. In this paper, we introduce the task of cross- 2 The XOR-TYDI QA Dataset lingual open-retrieval question answering (XOR Our XOR-TYDI QA dataset comprises questions QA) which aims at answering multilingual ques- inherited from TYDI QA (Clark et al., 2020) and tions from non-English native speakers given mul- answers augmented with our annotation process tilingual resources. To support research in this area, across 7 typologically diverse languages. We focus we construct a dataset (called XOR-TYDI QA) of on cross-lingual retrieval from English Wikipedia 40k annotated questions and answers across 7 ty- because in our preliminary investigation we were pologically diverse languages. Questions in our able to find answers to a majority of the questions dataset are inherited from TYDI QA (Clark et al., from resource-rich English Wikipedia, and native 2020), which are written by native speakers and speakers with much annotation experience were are originally unanswerable due to the informa- readily available via crowdsourcing in English. tion scarcity or asymmetry issues. XOR-TYDI QA is the first large-scale cross-lingual open-retrieval 2.1 XOR-TYDI QA Collection QA dataset that consists of information-seeking questions from native speakers and multilingual Our annotation pipeline proceeds with four steps: reference documents. 1) collection of questions from TYDI QA without a same-language answer which require cross-lingual XOR-TYDI QA is constructed with an annota- reference to answer (§2.1.1); 2) question translation tion pipeline that allows for cross-lingual retrieval from a target language to the pivot language of from large-scale Wikipedia corpora (§2). Unan- English where the missing information may exist swerable questions in TYDI QA are first translated (§2.1.2); 3) answer retrieval in the pivot language into English by professional translators. Then, an- given a set of candidate documents (§2.1.3); 4) notators find answers to translated queries given answer verification and translation from the pivot English Wikipedia using our new model-in-the- language back to the original language (§2.1.4). loop annotation framework that reduces annotation Fig.2 shows an overview of the pipeline. errors. Finally, answers are verified and translated back to the target languages. 2.1.1 Question Selection Building on the dataset, we introduce three new Our questions are collected from unanswerable tasks in the order of increasing complexity (§3). questions in TYDI QA. A question is unanswer- In XOR-RETRIEVE, a system retrieves English able in TYDI QA if an annotator cannot select Wikipedia paragraphs with sufficient information a passage answer (a paragraph in the article that to answer the question posed in the target language. contains an answer). We randomly sample 5,000 XOR-ENGLISHSPAN takes one step further and questions without any passage answer annotations finds a minimal answer span from the retrieved (unanswerable questions) from the TYDI QA train- English paragraphs. Finally, XOR-FULL expects ing data, and split them into training (4,500) and a system to generate an answer end to end in the development (500) sets. We use the develop- target language by consulting both English and ment data from TYDI QA as our test data, since the target language’s Wikipedia. XOR-FULL is the TYDI QA’s original test data is not publicly our ultimate goal, and the first two tasks enable available.2 We choose 7 languages with vary- researchers to diagnose where their models fail and ing amounts of Wikipedia data out of the 10 non- develop under less coding efforts and resources. English languages based on the cost and availability We provide baselines that extend state-of-the- 2 art open-retrieval QA systems (Asai et al., 2020; Furthermore, despite the benefits of hidden test sets, the resource-intensive nature of open-retrieval QA is not suitable Karpukhin et al., 2020) to our multilingual retrieval to code-submission leaderboards. This further precluded the setting. Our best baseline achieves an average of use of the original TYDI QA test sets. 1. Question 2. Question Translation 3. Answer Retrieval in English 4. Answer Translation Selection QL → Qen (Qen, Pen) (Qen, Pen, Aen→ AL ) TyDiQA XOR- Article retrieval Answer What did Ron Paul major in during TyDiQA ロンポールの学 Cross-lingual Annotation undergraduate? 部時代の専攻は (Q , No Paul went to Gettysburg L 何ですか？ College … He graduated answer) Search Top English Engine Wikipedia articles with a B.S. degree in Human translation Biology in 1957. (QL , AL) In-language Answer verification What did Paragraph retriever Human Ron Paul major Paragraph Human translation (Q , A ) Annotation L L in during ranking Ron Paul is an American @Mechanical turk undergraduate? politician ... 生物学 Figure 2: Overview of the annotation process for XOR-TYDI QA. of translators:3 Arabic, Bengali, Finnish, Japanese, annotation errors because annotators have to find Korean, Russian and Telugu. answer context among many candidate articles. 2.1.2 Question Translation We use a professional translation service, Gengo,4 Collaborative model-in-the-loop. To find a mid- to translate all collected questions into English. dle ground in the tradeoff, we introduce a collabora- Since named entities are crucial for QA, we instruct tive model-in-the-loop framework that uses Google translators to carefully translate them by search- Search and a state-of-the-art paragraph ranker. We ing for common English translations from English first run Google Search to retrieve as many as top Wikipedia or other external sources.

Load more