Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

∀∗, Wilhelmina Nekoto1, Vukosi Marivate2, Tshinondiwa Matsila1, Timi Fasubaa3, Tajudeen Kolawole4, Taiwo Fagbohungbe5, Solomon Oluwole Akinola6, Shamsuddeen Hassan Muhammad7,39, Salomon Kabongo4, Salomey Osei4, Sackey Freshia8, Rubungo Andre Niyongabo9, Ricky Macharm10, Perez Ogayo11, Orevaoghene Ahia12, Musie Meressa13, Mofe Adeyemi14, Masabata Mokgesi-Selinga15, Lawrence Okegbemi5, Laura Jane Martinus16, Kolawole Tajudeen4, Kevin Degila17, Kelechi Ogueji12, Kathleen Siminyu18, Julia Kreutzer19, Jason Webster20, Jamiil Toure Ali1, Jade Abbott21, Iroro Orife3, Ignatius Ezeani38, Idris Abdulkabir Dangana23,7, Herman Kamper24, Hady Elsahar25, Goodness Duru26, Ghollah Kioko27, Espoir Murhabazi1, Elan van Biljon12,24, Daniel Whitenack28, Christopher Onyefuluchi29, Chris Emezue40, Bonaventure Dossou31, Blessing Sibanda32, Blessing Itoro Bassey4, Ayodele Olabiyi33, Arshath Ramkilowan34, Alp Oktem¨ 35, Adewale Akinfaderin36, Abdallah Bashir37 ∗Masakhane, Africa 1Independent, 2University of Pretoria, 3Niger-Volta LTI, 4African Masters in Machine Intelligence, 5Federal University of Technology, Akure, 6University of Johannesburg, 7Bayero University, Kano, 8Jomo Kenyatta University of Agriculture and Technology, 9UESTC, 10Siseng Consulting Ltd, 11African Leadership University, 12InstaDeep Ltd, 13Sapienza University of Rome, 14Udacity, 15Parliament, Republic of South Africa, 16Explore Data Science Academy, 17UCD, Konta, 18AI4Dev, 19Google Research, 20Percept, 21Retro Rabbit, 23Di-Hub, 24Stellenbosch University, 25Naver Labs , 26Retina, AI, 27Lori Systems, 28SIL International, 29Federal College of Dental Technology and Therapy, Enugu, 31Jacobs University, 32Namibia University of Science and Technology, 33Data Science Nigeria, 34Praekelt Consulting, 35Translators without Borders, 36Amazon, 37Max Planck Institute for Informartics, University of Saarland, 38Lancaster University, 39University of Porto, 40Technical University of Munich [email protected]

Abstract As MT researchers cannot solve the prob- lem of low-resourcedness alone, we pro- Research in NLP lacks geographic diver- pose participatory research as a means to sity, and the question of how NLP can involve all necessary agents required in be scaled to low-resourced languages has the MT development process. We demon- not yet been adequately solved. “Low- strate the feasibility and scalability of par- resourced”-ness is a complex problem go- ticipatory research with a case study on ing beyond data availability and reflects MT for African languages. Its imple- systemic problems in society. mentation leads to a collection of novel In this paper, we focus on the task of Ma- translation datasets, MT benchmarks for arXiv:2010.02353v2 [cs.CL] 6 Nov 2020 chine Translation (MT), that plays a cru- over 30 languages, with human evalua- cial role for information accessibility and tions for a third of them, and enables par- communication worldwide. Despite im- ticipants without formal training to make mense improvements in MT over the past a unique scientific contribution. Bench- decade, MT is centered around a few high- marks, models, data, code, and evaluation resourced languages. results are released at https://github. com/masakhane-io/masakhane-mt ∀∗ to represent the whole Masakhane community. . 1 Introduction Language Articles Speakers Category English 6,087,118 1,268,100,000 Winner Language prevalence in societies is directly Egyptian Arabic 573,355 64,600,000 Hopeful bound to the people and places that speak this Afrikaans 91,002 17,500,000 Rising Star language. Consequently, resource-scarce lan- Kiswahili 59,038 98,300,000 Rising Star Yoruba 32,572 39,800,000 Rising Star guages in an NLP context reflect the resource Shona 5,505 9,000,000 Scraping by scarcity in the society from which the speak- Zulu 2,219 27,800,000 Hopeful Igbo 1,487 27,000,000 Scraping by ers originate (McCarthy, 2017). Through the Luo 0 4,200,000 Left-behind lens of a machine learning researcher, “low- Fon 0 2,200,000 Left-behind resourced” identifies languages for which few Dendi 0 257,000 Left-behind Damara 0 200,000 Left-behind digital or computational data resources exist, often classified in comparison to another lan- Table 1: Sizes of a subset of African language guage (Gu et al., 2018; Zoph et al., 2016). Wikipedias1, speaker populations2, and categories However, to the sociolinguist, “low-resourced” according to Joshi et al.(2020) (28 May 2020). can be broken down into many categories: low density, less commonly taught, or endan- translation (MT) using parallel corpora only, gered, each carrying slightly different mean- and refer the reader to Joshi et al.(2019) for an ings (Cieri et al., 2016). In this complex defini- assessment of low-resourced NLP in general. tion, the “low-resourced”-ness of a language is a symptom of a range of societal problems, Contributions. We diagnose the problems e.g. authors oppressed by colonial govern- of MT systems for low-resourced languages ments have been imprisoned for writing nov- by reflecting on what agents and interactions els in their languages impacting the publica- are necessary for a sustainable MT research tions in those languages (Wa Thiong’o, 1992), process. We identify which agents and inter- or that fewer PhD candidates come from op- actions are commonly omitted from existing pressed societies due to low access to tertiary low-resourced MT research, and assess the im- education (Jowi et al., 2018). This results pact that their exclusion has on the research. in fewer linguistic resources and researchers To involve the necessary agents and facilitate from those regions to work on NLP for their required interactions, we propose participatory language. Therefore, the problem of “low- research to build sustainable MT research com- resourced”-ness relates not only to the avail- munities for low-resourced languages. The fea- able resources for a language, but also to the sibility and scalability of this method is demon- lack of geographic and language diversity of strated with a case study on MT for African NLP researchers themselves. languages, where we present its implementa- The NLP community has awakened to the tion and outcomes, including novel translation fact that it has a diversity crisis in terms of lim- datasets, benchmarks for over 30 target lan- ited geographies and languages (Caines, 2019; guages contributed and evaluated by language Joshi et al., 2020): Research groups are ex- speakers, and publications authored by partici- tending NLP research to low-resourced lan- pants without formal training as scientists. guages (Guzman´ et al., 2019; Hu et al., 2020; 2 Background Wu and Dredze, 2020), and workshops have been established (Haffari et al., 2018; Axelrod Cross-lingual Transfer. With the success of et al., 2019; Cherry et al., 2019). deep learning in NLP, language-specific fea- We scope the rest of this study to machine ture design has become rare, and cross-lingual transfer methods have come into bloom (Upad- While this provides a solution for small quanti- hyay et al., 2016; Ruder et al., 2019) to transfer ties of training data or monolingual resources, progress from high-resourced to low-resourced the extent to which standard BLEU evaluations languages (Adams et al., 2017; Wang et al., reflect translation quality is not clear yet, since 2019; Kim et al., 2019). The most diverse human evaluation studies are missing. benchmark for multilingual transfer by Hu et al. Targeted Resource Creation. Guzman´ et al. (2020) allows measurement of the success of (2019) develop evaluation datasets for low- such transfer approaches across 40 languages resourced MT between English and Nepali, from 12 language families. However, the in- Sinhala, Khmer and Pashtolow. They high- clusion of languages in the set of benchmarks light many problems with low-resourced trans- is dependent on the availability of monolin- lation: tokenization, content selection, and gual data for representation learning with pre- translation verification, illustrating increased viously annotated resources. The content of the difficulty translating from English into low- benchmark tasks is English-sourced, and hu- resourced languages, and highlight the ineffec- man performance estimates are taken from En- tiveness of accepted state-of-the-art techniques glish. Most cross-lingual representation learn- on morphologically-rich languages. Despite ing techniques are Anglo-centric in their de- involving all agents of the MT process (Sec- sign (Anastasopoulos and Neubig, 2019). tion3), the study does not involve data curators Multilingual Approaches. Multilingual or evaluators that understood the languages in- MT (Dong et al., 2015; Firat et al., 2016a,b; volved, and resorts to standard MT evaluation Wang et al., 2020) addresses the transfer of metrics. Additionally, how this effort-intensive MT from high-resourced to low-resourced approach would scale to more than a handful languages by training multilingual models for of languages remains an open question. all languages at once. (Aharoni et al., 2019; 3 The Machine Translation Process Arivazhagan et al., 2019) train models to trans- late between English and 102 languages, for We reflect on the process enabling a sustainable the 10 most high-resourced African languages process for MT research on parallel corpora on private data, and otherwise on public in terms of the required agents and interac- TED talks (Qi et al., 2018). Multilingual tions, visualized in Figure1. Content creators, training often outperforms bilingual training, translators, and curators form the dataset cre- especially for low-resourced languages. ation process, while the language technologists However, with multilingual parallel data and evaluators are part of the model creation being also Anglo-centric, the capabilities to process. Stakeholders (not displayed) create translate from English versus into English demand for both processes. vastly diverge (Zhang et al., 2020). Stakeholders are people impacted by the Another recent approach, mBART (Liu artifacts generated by each agent in the MT et al., 2020), leverages both monolingual and process, and can typically speak and read the parallel data and also yields improvements source or the target languages. To benefit from in translation quality for lower-resource lan- MT systems, the stakeholders need access to guages such as Nepali, Sinhala and Gujarati.3 technology and electricity. Content Creators produce content in a lan- 3Note that these languages have more digital re- sources available and a longer history of written texts guage, where content is any digital or non- than the low-resourced languages we are addressing here. digital representation of language. For digi- tal content, content creators require keyboards, and access to technology. Translators translate the original content, including crowd-workers, researchers, or trans- lation professionals. They must understand the language of the content creator and the target language. A translator needs content to trans- late, provided by content creators. For digital content, the translator requires keyboards and technology access. Curators are defined as individuals in- volved in the content selection for a dataset (Bender and Friedman, 2018), requir- Figure 1: The MT Process, in terms of the neces- ing access to content and translations. They sary agents, interactions and external constraints and demand (excluding stakeholders). should understand the languages in question for quality control and encoding information. Language Technologists are defined as in- By contrast, parts of the process for exist- dividuals using datasets and computational lin- ing low-resourced MT are constrained. His- guistic techniques to produce MT models be- torically, many low-resourced languages had tween language pairs. Language technologists low demand from stakeholders for content cre- require language preprocessors, MT toolkits, ation and translation (Wa Thiong’o, 1992). and access to compute resources. Due to missing keyboards or limited access Evaluators are individuals who measure to technology, content creators were not em- and analyse the performance of a MT model, powered to write digital content (Adam, 1997; and therefore need knowledge of both source van Esch et al., 2019). This is a chicken-or- and target languages. To report on the perfor- egg problem, where existing digital content in mance on models, evaluators require quality a language would attract more stakeholders, metrics, as well as evaluation datasets. Evalu- which would incentivize content creators (Kaf- ators provide feedback to the Language Tech- fee et al., 2018). As a result, primary data nologists for improvement. sources for NLP research, such as Wikipedia, often have a few hundred articles only for low- 3.1 Limitations of Existing Approaches resourced languages despite large speaker pop- If we place a high-resource MT pair such ulations, see Table1. Due to limited demand, as English-to-French into the process defined existing translations are often domain-specific above, we observe that each agent nowadays and small in size, such as the JW300 corpus has the necessary resources and historical (Agic´ and Vulic´, 2019) whose content was cre- stakeholder demand to perform their role ef- ated for missionary purposes. fectively. A “virtuous cycle” emerged where When data curators are not part of the so- available content enabled the development of cieties from where these languages originate, MT systems that in turn drove more transla- they are are often unable to identify data tions, more tools, more evaluation and more sources or translators for languages, prohibit- content, which cycled back to improving MT ing them from checking the validity of the cre- systems. ated resource. This creates problems in en- coding, orthography or alignment, resulting guages. Where this condition cannot be sat- in noisy or incorrect translation pairs (Taylor isfied, at least a knowledge transfer between et al., 2015). This is aggravated by the fact that agents should be enabled. We hypothesize that many low-resourced languages do not have a using a participatory approach will allow re- long written history to draw from and therefore searchers to improve the MT process by iterat- might be less standardized and using multiple ing faster and more effectively. scripts. In collaboration with content creators, Participatory research, unlike conventional data curators can contribute to standardization research, emphasizes the value of research or at least recognize potential issues for data partners in the knowledge-production process processing further down the line. where the research process itself is defined As discussed in Section1, language tech- collaboratively and iteratively. The “partici- nologists are fewer in low-resourced societies. pants” are individuals involved in conducting Furthermore, the techniques developed in high- research without formal training as researchers. resourced societies might be inapplicable due Participatory research describes a broad set to compute, infrastructure or time constraints. of methodologies, organised in terms of the Aside from the problem of education and com- level of participation. At the lowest level plexity, existing techniques may not apply due is crowd-sourcing, where participants are in- to linguistic and morphological differences in volved solely in data collection. The highest the languages, or the scale, domain, or quality level—extreme citizen science–involves partic- of the data (Hu et al., 2020; Pires et al., 2019). ipation in the problem definition, data collec- Evaluators usually resort to potentially un- tion, analysis and interpretation (English et al., suitable automatic metrics due to time con- 2018). straints or missing connections to stakehold- Crowd-sourcing has been applied to low- ers (Guzman´ et al., 2019). The main evaluators resourced language data collection (Ambati of low-resourced NLP that is developed today et al., 2010; Guevara-Rukoz et al., 2020; Mil- typically cannot use human metrics due to the lour and Fort, 2018), but existing studies high- inability to speak the languages, or the lack of light how the disconnect between the data reliable crowdsourcing infrastructure, identi- creation process and model creation process fied as one of the core weaknesses of previous causes challenges. In seeking to create cross- approaches (in Section2). disciplinary teams that emphasize the values In summary, many agents in the MT process in a societal context, a participatory approach for low-resourced languages are either missing which involves participants in every part of invaluable language and societal knowledge, or the scientific process appears pertinent to solv- the necessary technical resources, knowledge, ing the problems for low-resourced languages connections, and incentives to form interac- highlighted in Section 3.1. tions with other agents in the process. To show how more involved participatory research can benefit low-resource language 3.2 Participatory Research Approach translation, we present a case study in MT for African languages. We propose one way to overcome the limita- tions in Section 3.1: ensuring that the agents 4 Case Study: Masakhane in the MT process originate from the coun- tries where the low-resourced languages are Africa currently has 2144 living lan- spoken or can speak the low-resourced lan- guages (Eberhard et al., 2019). Despite this, African languages account for a small fraction to bring together distributed participants, en- of available language resources, and NLP abling each language speaker to train, con- research rarely considers African languages. tribute, and evaluate their models. The experi- In the taxonomy of Joshi et al.(2020), African ment is evaluated in terms of the quantity and languages are assigned categories ranging diversity of participants and languages, and the from “The Left Behinds” to “The Rising variety of research artifacts, in terms of bench- Stars”, with most languages not having any marks, human evaluations, publications, and annotated data. Even monolingual resources the overall health of the community. While are sparse, as shown in Table1. a set of novel human evaluation results are In addition to a lack of NLP datasets, the presented, they serve as demonstration of the African continent lacks NLP researchers. In value of a participatory approach, rather than 2018, only five out of the 2695 affiliations of the empirical focus of the paper. the five major NLP conferences were from 4.1 Methodology African institutions (Caines, 2019). ∀ et al. (2020) attribute this to a culmination of cir- To overcome the challenge of recruiting par- cumstances, in particular their societal embed- ticipants, a number of strategies were em- ding (Alexander, 2009) and socio-economic ployed. Starting from local demand at a ma- factors, hindering participation in research ac- chine learning school (Deep Learning Indaba tivities and events, leaving researchers dis- (Engelbrecht, 2018)), meetups and universities, connected and distributed across the conti- distant connections were made through Twitter, nent. Consequently, existing data resources conference workshops,4 and eventually press are harder to discover, especially since these coverage5 and research publications.6 To over- are often published in closed journals or are come the limited tertiary education enrollments not digitized (Mesthrie, 1995). in Sub-Saharan Africa (Jowi et al., 2018), no For African languages, the implementation prerequisites were placed on researchers join- of a standard crowd-sourcing pipeline as for ing the project. For the agents outlined in Sec- example used for collecting task annotations tion3, no fixed roles are imposed onto par- for English, is at the current stage infeasible, ticipants. Instead, they join with a specific due to the challenges outlined in Section3 and interest, background, or skill aligning them above. Additionally, no standard MT evalua- best to one or more of agents. To obtain cross- tion set for all of the languages in focus exists, disciplinarity, we focus on the communication nor are there prior published systems that we and interaction between participants to enable could compare all models against for a more knowledge transfer between missing connec- insightful human evaluation. We therefore re- tions (identified in Section 3.1), allowing a sort to intrinsic evaluation, and rely on this fluidity of agent roles. For example, someone work becoming the first benchmark for future who initially joined with the interest of using evaluations. 4ICLR AfricaNLP 2020: https://africanlp- We invite the reader to adopt a meta- workshop.github.io/ 5https://venturebeat.com/2019/ perspective of this case study as an empirical 11/27/the-masakhane-project-wants- experiment: Where the hypothesis is that par- machine-translation-and-ai-to- ticipatory research can facilitate low-resourced transform-africa/ 6https://github.com/masakhane- MT development; the experimental method- io/masakhane-community/blob/master/ ology is the strategies and tools employed publications.md machine translation for their local language (as detailed demographics of a subset of partic- a stakeholder) to translate education material, ipants from a voluntary survey in February might turn into a junior language technologist 2020. 86.5% of participants responded pos- when equipped with tools and introductory ma- itively when asked if the community helped terial and mentoring, and guide content cre- them find mentors or collaborators, indicating ation more specifically for resources needed that the health of the community is positive. for MT. This is also reflected in joint research publica- To bridge large geographical divides, the tions of new groups of collaborators. community lives online. Communication Research Artifacts. As a result of mentor- occurs on GitHub and Slack with weekly ship and knowledge exchange between agents video conference meetings and reading groups. of the translation process, our implementa- Meeting notes are shared openly so that contin- tion of participatory research has produced uous participation is not required and time com- artifacts for NLP research, namely datasets, mitment can be organized individually. Sub- benchmarks and models, which are publicly interest groups have emerged in Slack chan- available online.8. Additionally, over 10 partic- nels to allow focused discussions. Agendas for ipants have gone on to publish works address- meetings and reading groups are public and ing language-specific challenges at confer- democratically voted upon. In this way, the re- ence workshops, such as (Dossou and Emezue, search questions evolve based on stakeholder 2020; Orife, 2020; Orife et al., 2020; Oktem¨ demands, rather than being imposed upon by et al., 2020; Van Biljon et al., 2020; Martinus external forces. et al., 2020; Marivate et al., 2020). The lack of compute resources and prior exposure to NLP is overcome by providing tu- Dataset Creation. The dataset creation pro- torials for training a custom-size Transformer cess is ongoing, with new initiatives still model with JoeyNMT (Kreutzer et al., 2019) emerging. We showcase a few initiatives be- on Google Colab7. International researchers low to demonstrate how bridging connections were not prohibited from joining. As a re- between agents facilitates the MT process. sult, mutual mentorship relations emerged, whereby international researchers with more 1. A team of Nigerian participants, driven language technology experience guided re- by the internal demand to ensure that ac- search efforts and enabled data curators or cessible and representative data of their translators to become language technologists. culture is used to train models, are trans- In return, African researchers introduced the lating their own writings including per- international language technologists to African sonal religious stories and undergraduate stakeholders, languages and context. theses into Yoruba and Igbo9.

4.2 Research Outcomes 2. A Namibian participant, driven by a pas- sion to preserve the culture of the Damara, Participants. A growth to over 400 partici- is hosting collaborative sessions with pants of diverse disciplines, from at least 20 Damara speakers, to collect and trans- countries, has been achieved within the past late phrases that reflect Damara culture year, suggesting the participant recruitment process was effective. AppendixA contains 8https://github.com/masakhane-io 9https://github.com/masakhane- 7https://colab.research.google.com io/masakhane-wazobia-dataset around traditional clothing, songs, and the benchmarks they valued most. 16 of the prayers.10 selected target languages are categorized as “Left-behind” and 11 are categorized as “Scrap- 3. Creating a connection between a trans- ing by” in the taxonomy of (Joshi et al., 2020). lator in South-Africa’s parliament and The benchmarks are hosted publicly, includ- a language technologist has enabled the ing model weights, configurations and prepro- process of data curation, allowing access cessing pipelines for full reproducibility. The to data from the parliament in South- benchmarks are submitted by individual or Africa’s languages (which are public but groups of participants in form of a GitHub Pull 11 obfuscated behind internal tools). . Request. By this, we ensure that the contact to the benchmark contributors can be made, and These stories demonstrate the value of includ- ownership is experienced. ing curators, content creators, and translators as participants. 4.3 Human MT Evaluation Benchmarks. We publish 46 benchmarks To our knowledge, there is no prior research for neural translation models from English into on human evaluation specifically for machine 39 distinct African languages, and from French translations of low-resourced languages. Until into two additional languages, as well as from now, NLP practitioners were left with the hope three different languages into English.12 Most that successful evaluation methodologies for were trained on the JW300 corpus (Agic´ and high-resource languages would transfer well Vulic´, 2019). From this corpus, we select to low-resourced languages. This lack of study the English sentences most commonly found is due to the missing connections between the (and longer than 4 tokens) in all languages, community of speakers (content creators and as a global set of test sources. For individ- translators), and the language technologists. ual languages, test splits are composed by se- MT evaluations by humans are often done ei- lecting the translations that are available from ther within a group of researchers from the this subset. While this biases the test set to- same lab or field (e.g. for WMT evaluations14), wards frequent segments, it prevents cross- or via crowdsourcing platforms (Ambati and lingual overlap between training and test data Vogel, 2010; Post et al., 2012). Speakers of which has to be ensured for cross-lingual trans- low-resource languages are traditionally under- fer learning. For training data, other sources represented in these groups, which makes such like Autshumato (McKellar, 2014), TED (Cet- studies even harder (Joshi et al., 2019; Guzman´ tolo et al., 2012), SAWA (De Pauw et al., et al., 2019). 2009), Tatoeba13, Opus (Tiedemann, 2012), One might argue that human evaluation and data translated or curated by participants should not be attempted before reaching a vi- were added. Language pairs were selected able state of quality, but we found that early based on the individual demands of each of the evaluation results in an improved understand- 33 participants, who voluntarily contributed ing of the individual challenges of the target languages, strengthens the network of the com- 10https://github.com/masakhane- io/masakhane-khoekhoegowab munity, and most importantly, improves the 11http://bit.ly/raw-parliamentary- connection and knowledge transfer between translations language technologists, content creators and 12Benchmark scores can be found in AppendixC. 13https://tatoeba.org/ 14http://www.statmt.org/wmt19/ curators. by both stakeholders and language technolo- The “low-resourced”-ness of the addressed gists. Within only 10 days, we gathered a total languages pose challenges for evaluation be- of 707 evaluated translations covering Igbo yond interface design or recruitment of eval- (ig), Nigerian Pidgin (pcm), Shona (sn), Luo uators proficient in the target language. For (luo), Hausa (ha, twice by two different an- the example of Igbo, evaluators had to find notators), Kiswahili (sw), Yoruba (yo), Fon solutions for typing diacritics without a suit- (fon) and Dendi (ddn). We did not impose pre- able keyboard. In addition, Igbo has many scriptions in terms of number of sentences to dialects and variations which the MT model is evaluate, or time to spend, since this was volun- uninformed of. Medical or technical terminol- tary work, and guidelines or estimates for the ogy (e.g., “data”) is difficult to translate and evaluation of translations into these languages whether to use loan words required discussion. are non-existent. Target language news websites were found to be useful for resolving standardization or termi- Evaluation Technique. Instead of a direct nology questions. Solutions for each language assessment (Graham et al., 2013) often used were shared and often also applicable for other in benchmark MT evaluations (Barrault et al., languages. 2019; Guzman´ et al., 2019), we opt for post- editing. Post-edits are grounded in actions that Data. The models are trained on JW300 can be analyzed in terms of e.g. error types for 15 data. To gain real-world quality estimates be- further investigations, while direct assessments yond religious context, we assess the models’ require expensive calibration (Bentivogli et al., out-of-domain generalization by translating a 2018). Embedded in the community, these English COVID-19 survey with 39 questions post-edit evaluations create an asset for the 16 and statements regarding COVID-19, where interaction of various agents: for the language the human-corrected and approved translations technologists for domain adaptation, or for the can directly serve the purpose of gathering re- content creators, curators, or translators for sponses. The domain is challenging as it con- guidance in standardization or domain choice. tains medical terms and new vocabulary. Fur- thermore, we evaluate a subset of the Multitar- Results. Table2 reports evaluation results in get TED test data (Duh, 2018)17. The obtained terms of BLEU evaluated on the benchmark translations enrich the TED datasets, adding test set from JW300, and human-targeted TER new languages for which no prior translations (HTER) (Snover et al., 2006), BLEU (Papineni exist. The size of the TED evaluations vary et al., 2002) and ChrF (Popovic´, 2015) against from 30 to 120 sentences. Details are given in human corrected model translations. For ha Table3, AppendixB. we find modest agreement between evaluators: Spearman’s ρ = 0.56 for sentence-BLEU mea- Evaluators. 11 participants of the commu- surements of the post-edits compared to the nity volunteered to evaluate translations in original hypotheses. Generally, we observe their language(s), often involving family or that the JW300 score is misleading, overesti- friends to determine the most correct transla- mating model quality (except yo). Training tions. The evaluator role is therefore taken data size appears to be a more reliable predic- 15 Except for Hausa: multiple domains, see Table4. tor of generalization abilities, illustrating the 16https://coronasurveys.org/ 17http://www.cs.jhu.edu/˜kevinduh/a/ danger of chasing a single benchmark. How- multitarget-tedtalks/ ever, ig and yo both have comparable amounts Trg. Train. Autom.: JW300 Human: COVID Human: TED lang. size BLEU ↑ HTER ↓ HBLEU ↑ HCHRF ↑ HTER ↓ HBLEU ↑ HCHRF ↑ ddn 6,937 22.30 1.11 0.27 0.08 - - - pcm 20,214 23.29 0.98 3.03 0.19 0.84 9.76 25.16 fon 27,510 31.07 0.92 15.43 23.22 - - - luo 136,459 34.33 - - - 1.26 7.90 20.88 0.71 26.96 43.97 0.73 20.42 39.31 ha 333,845 41.11 0.64 26.56 46.71 - - - ig 414,467 34.85 0.85 11.94 29.86 0.55 33.74 49.67 yo 415,100 38.62 0.09 85.92 89.90 0.51 49.22 58.41 sn 712,455 30.84 0.53 31.31 54.04 - - - sw 875,558 48.94 - - - 0.32 60.47 78.67

Table 2: Evaluation results for translations from English. Metrics are computed based on Polyglot- tokenized translations. HTER are mean sentence-level TER scores computed with the Pyter Python package. BLEU and ChrF are computed with Sacrebleu and tokenize “none” (Post, 2018). of training data, JW300 scores, and carry di- to low-resourced NLP. The sheer volume and acritics, but exhibit very different evaluation diversity of participants, languages and out- performances, in particular on COVID. This comes, and that for many for languages fea- can be explained by the large variations of ig tured, this paper constitutes the first time that as discussed above: Training data and model human evaluation of an MT system has been output are not consistent with respect to one performed, is evidence of the value of partici- dialect, while the evaluator had to decide on patory approaches for low-resourced MT. For one. We also find difference in performance future work, we will (1) continue to iterate, across domains, with the TED domain appear- analyze and widen our benchmarks and eval- ing easier for pcm and ig, while the yo model uations, (2) build richer and more meaningful performs better on COVID. datasets that reflect priorities of the stakehold- ers, (3) expand the focus of the existing com- 5 Conclusion munity for African languages to other NLP We proposed a participatory approach as a so- tasks, and (4) help implement similar commu- lution to sustainably scaling NLP research to nities for other geographic regions with low- low-resourced languages. Having identified resourced languages. key agents and interactions in the MT devel- opment process, we implement a participatory approach to build a community for African Acknowledgements MT. In the process, we discovered successful strategies for distributed growth and commu- We would like to thank Benjamin Rosman and nication, knowledge sharing and model build- Richard Klein for their invaluable feedback, as ing. In addition to publishing benchmarks and well as the anonymous EMNLP reviewers, and datasets for previously understudied languages, George Foster and Daan van Esch. We would we show how the participatory design of the also like to thank Google Cloud for the grant community enables us to conduct a human eval- that enabled us to build baselines for languages uation study of model outputs, which has been with larger datasets. one of the limitations of previous approaches References Naveen Arivazhagan, Ankur Bapna, Orhan Fi- rat, Dmitry Lepikhin, Melvin Johnson, Maxim Lishan Adam. 1997. Content and the web for Krikun, Mia Xu Chen, Yuan Cao, George F. african development. Journal of information Foster, Colin Cherry, Wolfgang Macherey, science, 23(1):91–97. Zhifeng Chen, and Yonghui Wu. 2019. Mas- Oliver Adams, Adam Makarucha, Graham Neubig, sively multilingual neural machine translation Steven Bird, and Trevor Cohn. 2017. Cross- in the wild: Findings and challenges. CoRR, lingual word embeddings for low-resource lan- abs/1907.05019. guage modeling. In Proceedings of the 15th Amittai Axelrod, Diyi Yang, Rossana Cunha, Conference of the European Chapter of the As- Samira Shaikh, and Zeerak Waseem, editors. sociation for Computational Linguistics: Vol- 2019. Proceedings of the 2019 Workshop on ume 1, Long Papers, pages 937–947, Valencia, Widening NLP. Association for Computational . Association for Computational Linguis- Linguistics, Florence, . tics. Zeljkoˇ Agic´ and Ivan Vulic.´ 2019. JW300: A Lo¨ıc Barrault, Ondrejˇ Bojar, Marta R. Costa-jussa,` wide-coverage parallel corpus for low-resource Christian Federmann, Mark Fishel, Yvette Gra- languages. In Proceedings of the 57th Annual ham, Barry Haddow, Matthias Huck, Philipp Meeting of the Association for Computational Koehn, Shervin Malmasi, Christof Monz, Math- Linguistics, pages 3204–3210, Florence, Italy. ias Muller,¨ Santanu Pal, Matt Post, and Mar- Association for Computational Linguistics. cos Zampieri. 2019. Findings of the 2019 con- ference on machine translation (wmt19). In Roee Aharoni, Melvin Johnson, and Orhan Firat. Proceedings of the Fourth Conference on Ma- 2019. Massively multilingual neural machine chine Translation (Volume 2: Shared Task Pa- translation. In Proceedings of the 2019 Confer- pers, Day 1), pages 1–61, Florence, Italy. Asso- ence of the North American Chapter of the Asso- ciation for Computational Linguistics. ciation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Emily M Bender and Batya Friedman. 2018. Data Short Papers), pages 3874–3884, Minneapolis, statements for natural language processing: To- Minnesota. Association for Computational Lin- ward mitigating system bias and enabling bet- guistics. ter science. Transactions of the Association for Computational Linguistics, 6:587–604. Neville Alexander. 2009. Evolving african ap- proaches to the management of linguistic diver- Luisa Bentivogli, Mauro Cettolo, Marcello Fed- sity: The acalan project. Language Matters, erico, and Federmann Christian. 2018. Ma- 40(2):117–132. chine translation human evaluation: an investi- gation of evaluation based on post-editing and Vamshi Ambati and Stephan Vogel. 2010. Can its relation with direct assessment. In 15th Inter- crowds build parallel corpora for machine trans- national Workshop on Spoken Language Trans- lation systems? In Proceedings of the NAACL lation 2018, pages 62–69. HLT 2010 Workshop on Creating Speech and Language Data with ’s Mechanical Andrew Caines. 2019. The geographic diversity Turk, pages 62–65, Los Angeles. Association of nlp conferences. for Computational Linguistics. Mauro Cettolo, Christian Girardi, and Marcello Vamshi Ambati, Stephan Vogel, and Jaime Car- Federico. 2012. Wit3: Web inventory of tran- bonell. 2010. Active learning-based elicitation scribed and translated talks. In Proceedings of for semi-supervised word alignment. In Pro- the 16th Conference of the European Associ- ceedings of the ACL 2010 Conference Short Pa- ation for Machine Translation (EAMT), pages pers, pages 365–370, Uppsala, Sweden. Associ- 261–268, Trento, Italy. ation for Computational Linguistics. Colin Cherry, Greg Durrett, George Foster, Reza Antonios Anastasopoulos and Graham Neubig. Haffari, Shahram Khadivi, Nanyun Peng, Xi- 2019. Should all cross-lingual embeddings ang Ren, and Swabha Swayamdipta, editors. speak english? 2019. Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource Prasad, Evan Crew, Chieu Nguyen, and NLP (DeepLo 2019). Association for Computa- Franc¸oise Beaufays. 2019. Writing across the tional Linguistics, Hong Kong, China. world’s languages: Deep internationalization for gboard, the google keyboard. arXiv preprint Christopher Cieri, Mike Maxwell, Stephanie arXiv:1912.01218. Strassel, and Jennifer Tracey. 2016. Selection criteria for low resource language programs. Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. In Proceedings of the Tenth International Con- 2016a. Multi-way, multilingual neural machine ference on Language Resources and Evalua- translation with a shared attention mechanism. tion (LREC 2016), pages 4543–4549, Portoroz,ˇ In Proceedings of the 2016 Conference of the Slovenia. European Language Resources Asso- North American Chapter of the Association for ciation (ELRA). Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, Cali- Guy De Pauw, Peter Waiganjo Wagacha, and fornia. Association for Computational Linguis- Gilles-Maurice de Schryver. 2009. The SAWA tics. corpus: A parallel corpus English - Swahili. In Proceedings of the First Workshop on Language Orhan Firat, Baskaran Sankaran, Yaser Al- Technologies for African Languages, pages 9– onaizan, Fatos T. Yarman Vural, and 16, Athens, Greece. Association for Computa- Kyunghyun Cho. 2016b. Zero-resource tional Linguistics. translation with multi-lingual neural machine translation. In Proceedings of the 2016 Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, Conference on Empirical Methods in Nat- and Haifeng Wang. 2015. Multi-task learning ural Language Processing, pages 268–277, for multiple language translation. In Proceed- Austin, Texas. Association for Computational ings of the 53rd Annual Meeting of the Asso- Linguistics. ciation for Computational Linguistics and the 7th International Joint Conference on Natural ∀, Iroro Orife, Julia Kreutzer, Blessing Sibanda, Language Processing (Volume 1: Long Papers), Daniel Whitenack, Kathleen Siminyu, Laura pages 1723–1732, Beijing, China. Association Martinus, Jamiil Toure Ali, Jade Abbott, Vukosi for Computational Linguistics. Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi, Orevaoghene Ahia, Elan van Bonaventure FP Dossou and Chris C Emezue. Biljon, Arshath Ramkilowan, Adewale Akin- 2020. Ffr v1. 0: Fon-french neural machine faderin, Alp Oktem,¨ Wole Akin, Ghollah Kioko, translation. arXiv preprint arXiv:2003.12111. Kevin Degila, Herman Kamper, Bonaventure Kevin Duh. 2018. The multitarget ted talks task. Dossou, Chris Emezue, Kelechi Ogueji, and http://www.cs.jhu.edu/ kevinduh/a/ Abdallah Bashir. 2020. Masakhane – machine ˜ translation for africa. In “AfricaNLP” Work- multitarget-tedtalks/. shop at the 8th International Conference on David M Eberhard, Gary F. Simons, and Charles D. Learning Representations. Fenning. 2019. Ethnologue: Languages of the worlds. twenty-second edition. Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measure- Herman Engelbrecht. 2018. The deep learning ment scales in human evaluation of machine indaba report. ACM SIGMultimedia Records, translation. In Proceedings of the 7th Linguistic 9(3):5–5. Annotation Workshop and Interoperability with Discourse, pages 33–41, Sofia, Bulgaria. Asso- PB English, MJ Richardson, and Catalina Garzon-´ ciation for Computational Linguistics. Galvis. 2018. From crowdsourcing to extreme citizen science: participatory research for en- Jiatao Gu, Hany Hassan, Jacob Devlin, and Vic- vironmental health. Annual review of public tor O.K. Li. 2018. Universal neural ma- health, 39:335–350. chine translation for extremely low resource lan- guages. In Proceedings of the 2018 Conference Daan van Esch, Elnaz Sarbar, Tamar Lucassen, of the North American Chapter of the Associ- Jeremy O’Brien, Theresa Breiner, Manasa ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pa- 2020. The state and fate of linguistic diver- pers), pages 344–354, New Orleans, Louisiana. sity and inclusion in the nlp world. ArXiv, Association for Computational Linguistics. abs/2004.09095.

Adriana Guevara-Rukoz, Isin Demirsahin, Fei James Jowi, Charles Ochieng Ong’ondo, and He, Shan-Hui Cathy Chu, Supheakmungkol Mulu Nega. 2018. Building phd capacity in sub- Sarin, Knot Pipatsrisawat, Alexander Gutkin, saharan africa. Alena Butryna, and Oddur Kjartansson. 2020. Lucie-Aimee´ Kaffee, Hady ElSahar, Pavlos Vou- Crowdsourcing Latin American Spanish for giouklis, Christophe Gravier, Fred´ erique´ Lafor- low-resource text-to-speech. In Proceedings est, Jonathon Hare, and Elena Simperl. 2018. of The 12th Language Resources and Evalua- Mind the (language) gap: generation of multi- tion Conference, pages 6504–6513, Marseille, lingual wikipedia summaries from wikidata for France. European Language Resources Associ- articleplaceholders. In European Semantic Web ation. Conference, pages 319–334. Springer. Francisco Guzman,´ Peng-Jen Chen, Myle Ott, Yunsu Kim, Yingbo Gao, and Hermann Ney. 2019. Juan Pino, Guillaume Lample, Philipp Koehn, Effective cross-lingual transfer of neural ma- Vishrav Chaudhary, and Marc’Aurelio Ran- chine translation models without shared vocabu- zato. 2019. The flores evaluation datasets laries. In Proceedings of the 57th Annual Meet- for low-resource machine translation: Nepali– ing of the Association for Computational Lin- english and sinhala–english. In Proceedings guistics, pages 1246–1257, Florence, Italy. As- of the 2019 Conference on Empirical Meth- sociation for Computational Linguistics. ods in Natural Language Processing and the 9th International Joint Conference on Natu- Julia Kreutzer, Jasmijn Bastings, and Stefan Rie- ral Language Processing (EMNLP-IJCNLP), zler. 2019. Joey NMT: A minimalist NMT pages 6100–6113. toolkit for novices. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Reza Haffari, Colin Cherry, George Foster, Joint Conference on Natural Language Pro- Shahram Khadivi, and Bahar Salehi, editors. cessing (EMNLP-IJCNLP): System Demonstra- 2018. Proceedings of the Workshop on tions, Hong Kong, China. Deep Learning Approaches for Low-Resource NLP. Association for Computational Linguis- Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, tics, Melbourne. Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilin- Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- gual denoising pre-training for neural machine ham Neubig, Orhan Firat, and Melvin John- translation. arXiv preprint arXiv:2001.08210. son. 2020. Xtreme: A massively multi- lingual multi-task benchmark for evaluating Vukosi Marivate, Tshephisho Sefara, Vongani cross-lingual generalization. arXiv preprint Chabalala, Keamogetswe Makhaya, Tumisho arXiv:2003.11080. Mokgonyane, Rethabile Mokoena, and Abio- dun Modupe. 2020. Investigating an approach Pratik Joshi, Christain Barnes, Sebastin Santy, for low resource language dataset creation, cu- Simran Khanuja, Sanket Shah, Anirudh Srini- ration and classification: Setswana and sepedi. vasan, Satwik Bhattamishra, Sunayana Sitaram, arXiv preprint arXiv:2003.04986. Monojit Choudhury, and Kalika Bali. 2019. Laura Martinus, Jason Webster, Joanne Moon- Unsung challenges of building and deploying samy, Moses Shaba Jnr, Ridha Moosa, and language technologies for low resource lan- Robert Fairon. 2020. Neural machine transla- guage communities. In Sixteenth International tion for south africa’s official languages. arXiv Conference on Natural Language Processing preprint arXiv:2005.06609. (ICON), Hyderabad, India. Arya McCarthy. 2017. The new digital divide: Pratik M. Joshi, Sebastin Santy, Amar Budhi- Language is the impediment to informa- raja, Kalika Bali, and Monojit Choudhury. tion access. https://hilltopicssmu. wordpress.com/2017/04/08/the-new- Maja Popovic.´ 2015. chrF: character n-gram f- digital-divide-language-is-the- score for automatic MT evaluation. In Proceed- impediment-to-information-access/. ings of the Tenth Workshop on Statistical Ma- Accessed: 2020-05-30. chine Translation, pages 392–395, Lisbon, Por- tugal. Association for Computational Linguis- Cindy A. McKellar. 2014. An english to xit- tics. songa statistical machine translation system for the government domain. In Proceedings of Matt Post. 2018. A call for clarity in reporting the 2014 PRASA, RobMech and AfLaT Interna- BLEU scores. In Proceedings of the Third Con- tional Joint Symposium, pages 229–233. ference on Machine Translation: Research Pa- pers, pages 186–191, Belgium, Brussels. Asso- Rajend Mesthrie. 1995. Language and social his- ciation for Computational Linguistics. tory: Studies in South African sociolinguistics. Matt Post, Chris Callison-Burch, and Miles Os- New Africa Books. borne. 2012. Constructing parallel corpora for six Indian languages via crowdsourcing. In Alice Millour and Karen¨ Fort. 2018. Toward Proceedings of the Seventh Workshop on Sta- a lightweight solution for less-resourced lan- tistical Machine Translation, pages 401–409, guages: Creating a POS tagger for alsatian Montreal,´ Canada. Association for Computa- using voluntary crowdsourcing. In Proceed- tional Linguistics. ings of the Eleventh International Conference on Language Resources and Evaluation (LREC- Ye Qi, Devendra Sachan, Matthieu Felix, Sar- 2018), Miyazaki, Japan. European Languages guna Padmanabhan, and Graham Neubig. 2018. Resources Association (ELRA). When and why are pre-trained word embed- dings useful for neural machine translation? Alp Oktem,¨ Mirko Plitt, and Grace Tang. 2020. In Proceedings of the 2018 Conference of the Tigrinya neural machine translation with trans- North American Chapter of the Association for fer learning for humanitarian response. arXiv Computational Linguistics: Human Language preprint arXiv:2003.11523. Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana. Association Iroro Orife. 2020. Towards neural machine trans- for Computational Linguistics. lation for edoid languages. arXiv preprint arXiv:2003.10704. Sebastian Ruder, Ivan Vulic,´ and Anders Søgaard. 2019. A survey of cross-lingual word embed- Iroro Orife, David I Adelani, Timi Fasubaa, ding models. Journal of Artificial Intelligence Victor Williamson, Wuraola Fisayo Oyewusi, Research, 65:569–631. Olamilekan Wahab, and Kola Tubosun. 2020. Matthew Snover, Bonnie Dorr, Richard Schwartz, Improving yor\ub\’a diacritic restoration. Linnea Micciulla, and John Makhoul. 2006. A arXiv preprint arXiv:2003.10564. study of translation edit rate with targeted hu- man annotation. In Proceedings of association Kishore Papineni, Salim Roukos, Todd Ward, and for machine translation in the Americas, vol- Wei-Jing Zhu. 2002. Bleu: a method for auto- ume 200. matic evaluation of machine translation. In Pro- ceedings of the 40th Annual Meeting of the As- Alex S Taylor, Sianˆ Lindley, Tim Regan, sociation for Computational Linguistics, pages David Sweeney, Vasillis Vlachokyriakos, Lillie 311–318, Philadelphia, Pennsylvania, USA. As- Grainger, and Jessica Lingel. 2015. Data-in- sociation for Computational Linguistics. place: Thinking through the relations between data and community. In Proceedings of the Telmo Pires, Eva Schlinger, and Dan Garrette. 33rd Annual ACM Conference on Human Fac- 2019. How multilingual is multilingual BERT? tors in Computing Systems, pages 2863–2872. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Jorg¨ Tiedemann. 2012. Parallel data, tools and in- pages 4996–5001, Florence, Italy. Association terfaces in opus. In Lrec, volume 2012, pages for Computational Linguistics. 2214–2218. Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical compari- son. In Proceedings of the 54th Annual Meet- ing of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1661– 1670, Berlin, Germany. Association for Compu- tational Linguistics.

Elan Van Biljon, Arnu Pretorius, and Julia Kreutzer. 2020. On optimal transformer depth for low-resource language translation. arXiv preprint arXiv:2004.04418.

Ngugi Wa Thiong’o. 1992. Decolonising the mind: The politics of language in African lit- erature. East African Publishers.

Xinyi Wang, Yulia Tsvetkov, and Graham Neubig. 2020. Balancing training for multilingual neu- ral machine translation.

Yuxuan Wang, Wanxiang Che, Jiang Guo, Yi- jia Liu, and Ting Liu. 2019. Cross-lingual BERT transformation for zero-shot dependency parsing. In Proceedings of the 2019 Confer- ence on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 5721–5727, Hong Kong, China. Association for Computa- tional Linguistics. Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual bert? arXiv preprint arXiv:2005.09093. Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. 2020. Improving massively multilin- gual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online. Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low- resource neural machine translation. In Pro- ceedings of the 2016 Conference on Empiri- cal Methods in Natural Language Processing, pages 1568–1575, Austin, Texas. Association for Computational Linguistics. Secondary School Language Domain Size Undergraduate Masters Nigerian Pidgin COVID 39 PhD TED 100 Online Courses Luo TED 30 Yoruba COVID 39 (a) Highest Level of Education. TED 80

Student Hausa COVID 78 Researcher TED 120 Data Scientist Igbo COVID 39 Academic TED 50 Other

Software Engineer Fon COVID 39 Linguist Swahili TED 55 Shona COVID 39 (b) Occupation. Dendi COVID 39 Figure 2: Education (a) and occupation (b) of a subset of 37 participants as indicated in a voluntary Table 3: Number of sentences for collected post- survey in February 2020. edits for TED talks and COVID surveys.

18 A Demographics data comes tokenized with Polyglot. . The ta- ble also features the target categories according Figure2 shows the demographics for a subset to (Joshi et al., 2020) as of 28 May 2020. of participants from a voluntary survey con- ducted in February 2020. Between then and now (May 2020), the community has grown by 30%, so these figures have to be seen as a snapshot. Nevertheless we can see that the educational background and the occupation is fairly diverse, with a majority of undergraduate students (not necessarily Computer Science).

B Evaluation Data

Table3 reports the number sentences that were post-edited in the human evaluation study re- ported in Section4.

C Benchmark Scores

Table4 contains BLEU scores on the JW300 test set for all benchmark models. BLEU scores are computed with Sacrebleu (Post, 18https://polyglot.readthedocs.io/ 2018) with tokenizer ’none’ since the JW300 en/latest/index.html Source Target Best Test BLEU Category English Afrikaans (Autshumato) 19.56 Rising Star English Afrikaans (JW300) 45.48 Rising Star English Amharic 2.03 Rising Star English Arabic (TED, custom) 9.28 Underdog English Cinyanja 30.00 Scraping by English Dendi 22.30 Left Behind English Efik 33.48 Left Behind English E`. do´ 12.49 Left Behind English E`. s`.an´ 6.2 Left Behind English Fon 31.07 Left Behind English Hausa (JW300+Tatoeba+more) 41.11 Hopeful English Igbo 34.85 Scraping by English Isoko 38.91 Left Behind English Kamba 27.90 Left Behind English Kimbundu 32.76 Left Behind English Kikuyu 37.85 Scraping by English Lingala 48.64 Scraping by English Luo 34.33 Left Behind English Nigerian Pidgin 23.29 Left Behind English Northern Sotho (Autshumato) 19.56 Scraping by English Northorn Sotho (JW300) 15.40 Scraping by English Sesotho 41.23 Scraping by English Setswana 19.66 Hopeful English Shona 30.84 Scraping by English Southern Ndebele (I) 4.01 Left Behind English Southern Ndebele (II) 26.61 Left Behind English kiSwahili (JW300) 48.94 Rising Star English kiSwahili (SAWA) 3.60 Rising Star English Tigrigna (JW300) 4.02 Hopeful English Tigrigna (JW300+Tatoeba+more) 14.88 Hopeful English Tiv 44.70 Left Behind English Tshiluba 42.52 Left Behind English Tshivenda 49.57 Scraping by English Urhobo 28.82 Left Behind English isiXhosa (Autshumato) 13.32 Hopeful English isiXhosa (JW300) 6.00 Hopeful English Xitsonga (JW300) 4.44 Scraping by English Xitsonga (Autshumato) 13.54 Scraping by English Yoruba 38.62 Rising Star English isiZulu (Autshumato) 1.96 Hopeful English isiZulu (JW300) 4.87 Hopeful Efik English 33.68 Winner French Lingala 39.81 Scraping by French Swahili Congo 33.73 Left Behind Hausa English 25.27 Winner Yoruba English 39.44 Winner

Table 4: Benchmarks as of Nov 6, 2020. If not indicated, training domain is JW300. BLEU scores are computed with Sacrebleu (tokenize=’none’) on the JW300 test sets. Target languages are categorized according to (Joshi et al., 2020) as of 28 May 2020.