Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof Elodie Gauthier1, Laurent Besacier1, Sylvie Voisin2, Michael Melese3, Uriel Pascal Elingui4 1Univ. Grenoble Alpes, LIG, F-38000 Grenoble, France 2Laboratoire Dynamique Du Langage (DDL), CNRS - Universite´ de Lyon, France 3Addis Ababa University, Ethiopia 4Voxygen SAS, Pleumeur-Bodou, France & Dakar, Sen´ egal´ [email protected], [email protected], [email protected], [email protected], [email protected] Abstract This article presents the data collected and ASR systems developped for 4 sub-saharan african languages (Swahili, Hausa, Amharic and Wolof). To illustrate our methodology, the focus is made on Wolof (a very under-resourced language) for which we designed the first ASR system ever built in this language. All data and scripts are available online on our github repository. Keywords: african languages, under-resourced languages, wolof, data collection, automatic speech recognition 1. Introduction 2. ASR systems made available in the Today is very favorable to the development of a market for ALFFA project speech in sub-saharan african languages. People’s access 2.1. Target languages of the project to information and communications technologies (ICT) is Language choice for the project is mainly governed by pop- done mainly through mobile (and keyboard) and the need ulation coverage and industrial perspectives. We focus on for voice services can be found in all sectors, from higher Hausa spoken by around 60 million people, as first or sec- priority (health, food) to more fun (games, social media). ond language. Hausa is part of the family of Afroasiatic For this, overcoming the language barrier is needed. This languages. Specifically, Hausa is the most spoken language 1 paper is done in the context of ALFFA project where two among the Chadic languages (Vycichl, 1990). It is the offi- main aspects are involved: fundamentals of speech anal- cial language of northern Nigeria (around 30 million speak- ysis (language phonetic and linguistic description, dialec- ers) and a national language of Niger (around 9 million tology) and speech technologies (automatic speech recog- speakers) but is also spoken in Ghana, Benin, Cameroon, nition (ASR) and text-to-speech (TTS)) for African lan- Togo, Chad and Burkina Faso (Koslow, 1995). Hausa is guages. In the project, developed ASR and TTS technolo- considered to be a common language of West Africa and gies will be used to build micro speech services for mobile Central Africa: it is spoken in many large commercial cities phones in Africa. For this, speech fundamental knowledge such as Dakar, Abidjan, Lome, Ouagadougou or Bamako. for targeted languages has to be upgraded while African Hausa is a tonal language. About a quarter of the Hausa language technologies are still at their very beginning. For words comes from the Arabic language but Hausa was also these reasons, the ALFFA project is really interdisciplinary influenced by French. Hausa can be written with the Arabic since it does not only gather technology experts but also in- spelling since the beginning of the 17th century. This writ- cludes fieldwork linguists/phoneticians. ing system is called ’Ajami. However, the official spelling Paper contribution. This article presents the data col- is based on the Latin alphabet called Boko. There are dif- lected and ASR systems developped for 4 sub-saharan ferent varieties of Hausa, depending on whether it is spoken african languages (Swahili, Hausa, Amharic and Wolof). in eastern (i.e: Kano, Zaria, Bauchi, Daura), western (i.e: To illustrate our methodology, the focus is made on Wolof Sokoto, Gobir, Tahoua), northern (i.e: Katsina, Maradi, (a very under-resourced language) for which we designed Zinder) or southern (i.e: Zaria, Bauci) areas. the first ASR system ever built in this language. Enlarging the subregion coverage, we also consider Bam- Paper outline. The outline of this paper is the follow- bara, Wolof, and Fula languages, to cover major West ing: first, section 2 will summarize ASR sytems already African languages. All the languages targeted cover more 2 available on our github repository for Swahili, Hausa and than half of the 300 million people of West Africa. Amharic (together with scripts and resources needed to re- Bambara (or Bamanankan) is largely spoken in West Africa produce ASR experiments). Then, sections 3, 4 and 5 will by around 40 million people. It is mainly spoken in Mali focus respectively on text data collection, speech data col- with 4 million speakers (census from the SIL3 in 2012) and lection and ASR for Wolof language. Finally, section 6 will is used as a lingua franca. The Bambara is an agglutinative conclude this work and give some perspectives. language. It uses the Roman script written system and is a tonal language. 1 see http://alffa.imag.fr Wolof belongs to the Atlantic languages which is part of the 2see https://github.com/besacier/ALFFA_ PUBLIC/tree/master/ASR 3see https://www.ethnologue.com/ 3863 Niger-Congo phylum. This language is spoken in Senegal, case. Finally, the lexicon contains 42,662 entries. Gambia and Mauritania where it is considered as a national The Amharic system was retrained from the corpus de- language. In addition, Wolof is considered as one of the scribed in (Tachbelie et al., 2014) and represents about 20 common languages and is spoken by 10 million people in hours of speech data, while the testing set represents about total. More details on the language are given in the section 2 hours. Concerning the language model, it was created 3.1.. using SRILM using 3-grams and the text is segmented in Fula is a set of dialects spoken in all West Africa coun- morphemes using Morfessor 2.0 (Creutz and Lagus, 2002). tries by 70 million people. They include both tonal and A summary of the ASR performance (measured with Word non-tonal languages. In the ALFFA project, we have de- Error rate (WER)) obtained for the three languages is given cided to focus on Pulaar, the western dialect of the fula lan- on table 1 but more experimental details can be found in the guages. Like the Wolof, the Pulaar belongs to the Atlantic README files of the Github repository. languages which is part of the Niger-Congo phylum. The dialect is mainly spoken in Senegal by almost 3.5 million Table 1: ASR performance for Swahili, Hausa and Amharic speakers (census in 2015 from the SIL) but also in dozens - HMM/SGMM acoustic modeling - all scripts available on of other African countries. The Pulaar dialect use the Latin- github to reproduce experiments. based orthography. As far as East Africa is concerned, we designed ASR sys- Task WER (%) Swahili broadcast news 20.7 tem for Swahili which is the most widespread language in Hausa read speech 10.0 the East of the continent: spoken by more than 100 million Amharic read speech 8.7 people4. Swahili (or Kiswahili) belongs to the Niger-Congo phylum. It is the most spoken language among the Bantu language: it is used as a first language but also as a common 3. Collecting text in Wolof language by a large population. In Tanzania, the language 3.1. Our focus: Wolof language has an official status but Swahili is a national language in numerous East and Central Africa. For centuries the Arabic As we said in 2.1., Wolof is mainly spoken in Senegal but script has been used to written the Swahili but currently the also in Gambia and Mauritania. Even if people who speaks standard written system is based on Latin script. Wolof understand each other, the Senegalese Wolof and We also work on Amharic, a Semitic language which is part the Gambian Wolof are two distincts languages: both own of the Afro-asiatic phylum. It is the second most spoken their ISO 639-3 language code (respectively ”WOL” and language among the Semitic languages. It is spoken mostly ”WOF”). For our studies, we decided to focus on Sene- in Ethiopia by 22 million speakers (census from the SIL in galese Wolof, and more precisely on the urban Wolof spo- 2010) where the language has an official status. Amharic ken in Dakar. uses an alphasyllabary written system named fidel : the About 90% of the Senegalese people speak Wolof, while al- 5 consonant-vowel sequence represents a discrete unit (Com- most 40% of the population uses it as their mother tongue . rie, 2009). Wolof is originally used as an oral language. The writing system was developed at a later time, in Arabic orthogra- 2.2. Automatic Speech Recognition for 3 phy named Wolofal first (back to the pre-colonial period, languages through Islam spreading) and then in Latin orthography ASR systems for Swahili, Hausa and Amharic have been (through pre-colonial period). Nowadays, the latter is of- built so far. All the data and scripts to build a complete ficially in use. 29 Roman-based characters are used from ASR system for these 3 languages are already available to the Latin script and most of them are involved in digraphs the public on the github repository already mentioned in standing for geminate and prenasalized stops. Wolof is a section 1. We used Kaldi speech recognition toolkit (Povey non-tonal language, with no diphthong and a moderate syl- et al., 2011) for building our ASR systems. For the Swahili labic complexity. and Amharic ASR systems, the transcribed speech corpora, The official language in Senegal is French. By definition, pronunciation lexicons and language models (LMs) are also it is the language used in all the Institutions of the gov- made available while for Hausa ASR, users need to buy the ernment like administrations and all administrative papers, corpus and the lexicon at ELDA first.

Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof

Ideologies of Honorific Language

Multilingual Texting in Senegal

Wolof Informational Report

Annex H. Summary of the Early Grade Reading Materials Survey in Senegal

1 Chapter 1 Introduction 1.1 Overview of the West Atlantic Languages As Wolof Has Not Been Widely Studied in the Generative Trad

Homosexuality in the Ghanaian Media: a Preliminary Survey

Proquest Dissertations

Dakar Wolof and the Configuration of an Urban Identity

Named Entity Recognition for African Languages

V Ol. 21 Issue 1 (November 2011) Working Papers of the Linguistics Circle of the University of Victoria

The Impact of Arabic on Wolof Language

Perspectives of Comparative Studies on the Mande and West Atlantic Language Groups: 1 an Approach to the Quantitative Comparative Linguistics