SD-QA: Spoken Dialectal Question Answering for the Real World
Total Page:16
File Type:pdf, Size:1020Kb
SD-QA: Spoken Dialectal Question Answering for the Real World Fahim Faisal, Sharlina Keshava, Md Mahfuz ibn Alam, Antonios Anastasopoulos Department of Computer Science, George Mason University {ffaisal,skeshav,malam21,antonis}@gmu.edu Abstract Question answering (QA) systems are now available through numerous commercial × applications for a wide variety of domains, × Ö ASR serving millions of users that interact with Transcription: Teolojia ya dogma ni nini? × them via speech interfaces. However, current × Ö QA benchmarks in QA research do not account Minimal Answer: neno linalotumika hasa for the errors that speech recognition models kumaanisha fundisho la imani might introduce, nor do they consider the lisiloweza kukanushwa na wafuasi language variations (dialects) of the users. To wa dini fulani. address this gap, we augment an existing QA Figure 1: Illustration of the envisioned scenario for a dataset to construct a multi-dialect, spoken user-facing QA system that SD-QA aims to evaluate QA benchmark on four languages (Arabic, (example from Swahili). Bengali, English, Kiswahili) with more than 68k audio prompts in 22 dialects from 245 speakers. We provide baseline results Wide-spread adoption of these systems requires showcasing the real-world performance of QA they perform consistently in real-world conditions, systems and analyze the effect of language an area where Ravichander et al. (2021) note there variety and other sensitive speaker attributes on downstream performance. Last, we study is substantial room for improvement. Existing the fairness of the ASR and QA models with QA system evaluation benchmarks rely on text- respect to the underlying user populations.1 based benchmark data that is provided in written format without error or noise. However, inputs 1 Introduction to real-world QA systems are gathered from users The development of question answering (QA) through error-prone interfaces such as keyboards, systems that can answer human prompts with, speech recognition systems that convert verbal or in some cases without, context is one of the queries to text, and machine-translation systems. great success stories of modern natural language Evaluating production-ready QA systems on data processing (NLP) and a rapidly expanding that is not representative of real-world inputs is research area. Usage of such systems has problematic and has consequences on the utility reached a global scale in that millions of users of such systems. For example, Ravichander can conveniently query voice assistance like et al. (2021) quantify and illustrate the effects of Google Assistant,2 Amazon Alexa,3 or Apple interface noise on English QA systems. Siri4 through their smartphones or smart home In this work, we address the need for devices. QA systems have also made large strides realistic evaluations of QA systems by creating in technology-adjacent industries like healthcare, a multilingual and multi-dialect spoken QA privacy, and e-commerce platforms. evaluation benchmark. Our focus is the utility of QA systems on users with varying demographic 1 The dataset and code for reproducing our experiments traits like age, gender, and dialect spoken. Our are available here: https://github.com/ffaisal93/SD-QA. 2 https://assistant.google.com/ contributions are as follows: 3 https://developer.amazon.com/alexa 4 https://www.apple.com/siri/ 1. We augment the TyDi-QA dataset (Clark et al., 2020) with spoken utterances matching Language Locations (Variety Code) the questions. In particular, we collect utterances in four languages (Arabic, Bengali, Arabic Algeria (DZA), Bahrain (BHR), English, Kiswahili) and from multiple Egypt (EGY), Jordan (JOR), varieties5 (seven for Arabic, eleven for Morocco (MAR), Saudi Arabia English, and two for each of Bengali and (SAU), Tunisia (TUN) Kiswahili). Bengali Bangladesh-Dhaka (BGD), India- 2. We perform contrastive evaluation for Kolkata (IND) a baseline pipeline approach that first English Australia (AUS), India-South transcribes the utterances with an ASR (IND-S), India-North (IND-N), system and then provides the transcription Ireland (IRL), Kenya (KEN), to the QA system. We compare general New Zealand (NZL), Nigeria and localized ASR systems, finding wide (NGA), Philippines (PHI), divergences between the downstream QA Scotland (SCO), South Africa performance for different language varieties. (ZAF), US-Southeast (USA-SE) Kiswahili Kenya (KEN), Tanzania (TZA) 2 The SD-QA dataset Table 1: Languages and sample collection locations The SD-QA dataset builds on the TyDi-QA (roughly corresponding to different spoken varieties) in dataset (Clark et al., 2020), using questions, the SD-QA dataset. contexts, and answers from four typologically diverse languages. Our focus is to augment the of NLP has focused on Modern Standard Arabic dataset along two additional dimensions. The first (MSA), the literary standard for language of is a speech component, to match the real-world culture, media, and education. However, no scenario of users querying virtual assistants for Arabic speaker is really a MSA native speaker– information. Second, we add a dimension on rather, most Arabic speakers in every day life dialectal and geographical language variation. use their native dialect, and often resort to code- 2.1 Languages and Varieties switching between their variety and MSA in unscripted situations (Abu-Melhim, 1991). We We focus on four of the languages present in follow a sub-regional classification for the Arabic the original TyDi QA dataset: English, Arabic, dialects we work with, as these dialects can 6 Bengali, and Kiswahili. In total, SD-QA includes differ in terms of their phonology, morphology, more than 68k audio prompts in 22 dialects orthography, syntax, and lexicon, even between from 245 annotators. Table 1 presents a list of cities in the same country. We direct the reader the different locations from where we collected to (Habash, 2010) for further details on Arabic spoken samples. A detailed breakdown of dataset handling in NLP, and to (Bouamor et al., 2018) for and speaker statistics is available in Appendix A. a larger corpus of Arabic dialects. We provide a brief discussion on the dialectal The original TyDi-QA dataset provides data in variation exhibited in the dataset’s languages in MSA, as the underlying Wikipedia data are in the following paragraphs. We note that English MSA.7 To capture the proper regional variation of and Arabic varieties are over-represented, but this Arabic as much as possible, the instructions to the is due to cost: it was easier and cheaper to source annotators were to first read the question and then data in English and Arabic than in other languages; record the same question using their local variety, we plan to further expand the dataset in the future rephrasing as they deem appropriate. with more varieties for Bengali and Kiswahili as well as more languages. Bengali is an Indo-Aryan language spoken in the general Bengal region, with the majority of Arabic is practically a family of several speakers concentrated in modern-day Bangladesh varieties, forming a dialect continuum from and the West Bengal state in India (Klaiman, Morocco to the west to Oman in the east. Most 1987). The language exhibits large variation, 5We will use the terms “dialect" and “language variety" interchangeably. 7While Clark et al. (2020) do not specify whether the data 6The languages were selected for their typological are in MSA or in any vernaculars, we rely on a native speaker diversity and the wide range of variation they exhibit. for this observation. Dhaka Kolkata English spoken in Tanzania, Kenya, Congo, and Uganda. Its grammar is characteristically Bantu, but it also েদায়া (doa) াথনা쇍 (prarthona) pray has strong Arabic influences and uses a significant পািন (pani) জল (jol) water amount of Arabic loanwords. While there are দাওয়াত (daowat) িনমণ (nimontron) invitation more than a dozen Swahili varieties, the three most Table 2: Example of lexical differences between the prominent are kiUnguja, spoken on Zanzibar and Dhaka and Kolkata Bengali varieties. in the mainland areas of Tanzania, which is also the basis of considered-standard Swahili; kiMvita, spoken in Mombasa and other areas of Kenya; and although languages like Sylheti and Chittagonian, kiAmu (or Kiamu), spoken on the island of Lamu once considered varieties of Bengali, are now and adjoining parts of the coast. Also prominent considered as languages on their own (Simard is Congolese Swahili (ISO code: swc), which et al., 2020; Masica, 1993). We focus our we treat as a separate language because of its collection on two major geographical poles: significant French influences (due to colonisation). Dhaka (Bangladesh) and Kolkata (India). The In this work we collected utterances from Kenya Dhaka variety is the most widely spoken one, (Nairobi area) and from Tanzania (Dar-es-Salaam while the Rarhi variety (Central Standard Bengali) area). Some of our contributors self-reported as is prevalent in Kolkata and has been the basis for non-native Swahili speakers, naming languages standard Bengali. Thompson (2020) notes that like Kikuyu (ISO code: kik) and Chidigo (ISO the linguistic differences between the two sides of code: dig) as their native ones. the border are minor. The differences are mainly phonological, although some lexical differences 2.2 Data Collection Process are also observed in common words. For example, Data was collected through subcontractors. Each the bangla spoken in Dhaka dialect is more spread annotator was a native speaker of the language out like using (choila, to go) (koira, to চইলা , কইরা who grew up and lived in the same region we do) instead of using (chole, to go) (kore, to চেল , কের were focusing on. The annotators were paid a do)Arabic influence is prevalent in Dhaka dialect, minimum of $15 per hour.8 We aimed for gender- whereas the Kolkata dialect is more influenced by and age-balanced collection. The data for almost other Indo-Aryan languages, resulting in lexical all dialects are gender-balanced,9 but not all age differences (examples shown in Table 2).