Named Entity Recognition for African Languages
Total Page:16
File Type:pdf, Size:1020Kb
MasakhaNER: Named Entity Recognition for African Languages David Ifeoluwa Adelani1∗, Jade Abbott2∗, Graham Neubig3, Daniel D’souza4∗, Julia Kreutzer5∗, Constantine Lignos6∗, Chester PalenMichel6∗, Happy Buzaaba7∗ Shruti Rijhwani3, Sebastian Ruder8, Stephen Mayhew9, Israel Abebe Azime10∗ Shamsuddeen H. Muhammad11;12∗, Chris Chinenye Emezue13∗, Joyce NakatumbaNabende14∗ Perez Ogayo15∗, Anuoluwapo Aremu16∗, Catherine Gitau∗, Derguene Mbaye∗ Jesujoba Alabi17∗, Seid Muhie Yimam18, Tajuddeen Gwadabe19∗, Ignatius Ezeani20∗ Rubungo Andre Niyongabo21∗, Jonathan Mukiibi14, Verrah Otiende22∗ Iroro Orife23∗, Davis David∗, Samba Ngom∗, Tosin Adewumi24∗ Paul Rayson20, Mofetoluwa Adeyemi∗, Gerald Muriuki14, Emmanuel Anebi∗ Chiamaka Chukwuneke20, Nkiruka Odu25, Eric Peter Wairagala14 Samuel Oyerinde∗, Clemencia Siro∗, Tobius Saul Bateesa14, Temilola Oloyede∗ Yvonne Wambui∗, Victor Akinode∗, Deborah Nabagereka14, Maurice Katusiime14 Ayodele Awokoya26∗, Mouhamadane MBOUP∗ Dibora Gebreyohannes∗, Henok Tilaye∗ Kelechi Nwaike∗, Degaga Wolde∗, Abdoulaye Faye∗, Blessing Sibanda27∗ Orevaoghene Ahia28∗, Bonaventure F. P. Dossou29∗, Kelechi Ogueji30∗ Thierno Ibrahima DIOP∗, Abdoulaye Diallo∗, Adewale Akinfaderin∗ Tendai Marengereke∗, and Salomey Osei10∗ ∗ Masakhane, 1 Spoken Language Systems Group (LSV), Saarland University, Germany 2 Retro Rabbit, 3 Language Technologies Institute, Carnegie Mellon University 4 ProQuest, 5 Google Research, 6 Brandeis University, 8 DeepMind, 9 Duolingo 7 Graduate School of Systems and Information Engineering, University of Tsukuba, Japan. 10 African Institute for Mathematical Sciences (AIMSAMMI), 11 University of Porto 12 Bayero University, Kano, 13 Technical University of Munich, Germany 14 Makerere University, Kampala, Uganda,15 African Leadership University, Rwanda 16 University of Lagos, Nigeria, 17 Max Planck Institute for Informatics, Germany. 18 LT Group, Universität Hamburg, 19 University of Chinese Academy of Science, China 20 Lancaster University, 21 University of Electronic Science and Technology of China, China. 22 United States International University Africa (USIUA), Kenya. 23 NigerVolta LTI 24 Luleå University of Technology, 25 African University of Science and Technology, Abuja 26 University of Ibadan, Nigeria, 27Namibia University of Science and Technology 28 Instadeep, 29 Jacobs University Bremen, Germany, 30 University of Waterloo. Abstract 1 Introduction We take a step towards addressing the under Africa has over 2,000 languages (Eberhard et al., representation of the African continent in NLP 2020); however, these languages are scarcely rep research by creating the first large publicly resented in existing natural language processing available highquality dataset for named entity (NLP) datasets, research, and tools (Martinus and recognition (NER) in ten African languages, Abbott, 2019). 8 et al. (2020) investigate the rea bringing together a variety of stakeholders. sons for these disparities by examining how NLP We detail characteristics of the languages to for lowresource languages is constrained by sev help researchers understand the challenges that eral societal factors. One of these factors is the these languages pose for NER. We analyze our datasets and conduct an extensive empirical geographical and language diversity of NLP re evaluation of stateoftheart methods across searchers. For example, only 5 out of the 2695 both supervised and transfer learning settings. affiliations of authors whose work was published We release the data, code, and models in order at the five major NLP conferences in 2019 are to inspire future research on African NLP1. from African institutions (Caines, 2019). Many 1https://github.com/masakhane-io/ NLP tasks such as machine translation, text classi masakhane-ner fication, partofspeech tagging, and named entity recognition would benefit from the knowledge of such as the WikiAnn corpus (Pan et al., 2017) native speakers who are involved in the develop covering 282 languages, such datasets consist of ment of datasets and models. “silverstandard” labels created by transferring an In this paper, we focus on named entity recog notations from English to other languages through nition (NER)—one of the most impactful tasks in crosslingual links in knowledge bases. As the data NLP (Sang and De Meulder, 2003; Lample et al., comes from Wikipedia, it includes some African 2016). NER is a core NLP task in information languages, most with fewer than 10k tokens. extraction and an essential component of numer Other NER datasets for African languages are ous products including spellcheckers, localization SADiLaR (Eiselen, 2016) for ten South African of voice and dialogue systems, and conversational languages based on government data, and small agents. It also enables identifying African names, corpora of less than 2K sentences for Yorùbá (Al places and organizations for information retrieval. abi et al., 2020) and Hausa (Hedderich et al., 2020). African languages are underrepresented in this Additionally, LORELEI language packs (Strassel crucial task due to lack in datasets, reproducible and Tracey, 2016) were created for some African results, and researchers who understand the chal languages including Yorùbá, Hausa, Amharic, So lenges that such languages pose for NER. mali, Twi, Swahili, Wolof, Kinyarwanda, and In this paper, we take an initial step towards im Zulu, but are not publicly available. proving representation for African languages in the NER models Popular sequence labeling models NER task. More specifically, this paper makes the for NER are CRF (Lafferty et al., 2001), CNN following contributions: BiLSTM (Chiu and Nichols, 2016), BiLSTMCRF (i) We bring together language speakers, dataset (Huang et al., 2015), and CNNBiLSTMCRF (Ma curators, NLP practitioners, and evaluation and Hovy, 2016). The traditional CRF makes use experts to address the constraints facing of handcrafted features like partofspeech tags, African NER. Based on the availability of on context words and word capitalization. Neural net line news corpora and language annotators, work NER models are initialized with word em we develop NER datasets, models, and evalu beddings like Word2Vec (Mikolov et al., 2013), ation covering ten widely spoken African lan GloVe (Pennington et al., 2014) and FastText (Bo guages. janowski et al., 2016). More recently, pretrained language models such as BERT (Devlin et al., (ii) We curate NER datasets from local sources to 2019), RoBERTa (Liu et al., 2019), and LUKE (Ya ensure relevance of future research for native mada et al., 2020) produce stateoftheart results speakers of the respective languages. for the task. Multilingual variants of these models (iii) We train and evaluate multiple NER mod like mBERT and XLMRoBERTa (Conneau et al., els for all ten languages. Our experiments 2020) make it possible to train NER models for sev provide insights into the transfer across lan eral languages using transfer learning. Language guages, and highlight open challenges. specific parameters and adaptation to unlabelled data of the target language have yielded further (iv) We release the data, code, and model outputs gains (Pfeiffer et al., 2020a,b). in order to facilitate future research on the spe cific challenges of African NER. 3 Focus Languages Table 1 gives an overview of the languages consid 2 Related Work ered in this work, in terms of their language fam African NER datasets NER is a wellstudied se ily, number of speakers and the regions in Africa quence labeling task (Yadav and Bethard, 2018) where they are spoken. We chose to focus on these and has been subject of many shared tasks in dif languages due to the availability of online news ferent languages (Tjong Kim Sang, 2002; Tjong corpora, annotators, and most importantly because Kim Sang and De Meulder, 2003; ijc, 2008; they are widely spoken African languages. Both re Shaalan, 2014; Benikova et al., 2014). Most of gion and language family might indicate a notion the available datasets are from highresource lan of proximity for NER, either because of linguis guages. Although there have been efforts to cre tic features shared within that family, or because ate NER datasets for lowerresourced languages, data sources cover a common set of locally rele Language Family Speakers Region (except c), Igbo contains the following digraphs: Amharic AfroAsiaticEthio 26M East (ch, gb, gh, gw, kp, kw, nw, ny, sh). Semitic Hausa AfroAsiaticChadic 63M West Kinyarwanda (kin) makes use of 24 Latin char Igbo NigerCongoVoltaNiger 27M West acters with 5 vowels similar to English and 19 Kinyarwanda NigerCongoBantu 12M East consonants excluding q and x. Moreover, Kin Luganda NigerCongoBantu 7M East yarwanda has 74 additional complex consonants Luo Nilo Saharan 4M East (such as mb, mpw, and njyw).2 It is a tonal lan Nigerian English Creole 75M West Pidgin guage with three tones: low (no diacritic), high Swahili NigerCongoBantu 98M Central (signaled by “/”) and falling (signaled by “^”). The & East default word order is SubjectVerbObject. Wolof NigerCongo 5M West Senegambia & NW Luganda (lug) is a tonal language with subject Yorùbá NigerCongoVoltaNiger 42M West verbobject word order. The Luganda alphabet is Table 1: Language, family, number of speakers (Eber composed of 24 letters that include 17 consonants hard et al., 2020), and regions in Africa. (p, v, f, m, d, t, l, r, n, z, s, j, c, g), 5 vowel sounds represented in the five alphabetical symbols