<<

arXiv:2103.07762v2 [cs.CL] 16 Mar 2021 ( OkwuGb Introduction 1 OkwuGb ugs n infissuyn,adintegrating and studying, signifies and guages, Okwu hs uhr otiue qal oti work. this to equally contributed authors These . ∗ ihtelnusi opeiiso fia lan- African of complexities linguistic the with the of some Interestingly, . African understand- ensures it way, spoken or written n gomdl orecd aebe made been available. have publicly Fon code The source models Igbo. Igbo and and Fon for fu- research guide NLP as ture well African as other languages, for low-resourced models of recognition creation valu- the speech into provide guidance Igbo) and insights and able Fon (for bench- linguis- analyses Our as tic Igbo. well for results as model Fon, ASR mark for model state- ASR a present of-art We languages. both for models recognition speech network-based end-to-end, neural of deep creation the describe and lan- a guage each conduct of we analysis study, linguistic case comprehensive our as Using Igbo and Fon languages. low-resourced African systems for recognition speech building step a towards is OkwuGbé NLP. text-based for deal guages to how on intuition more speech, interpretation provide their speech could in careful root that major suggesting com- a tonal have and diacritical plexities, their like NLP, affect- ing languages African of properties unique little for recognition very speech still in is research in comparable there However, ar- research NLP. text-based of of other eas and subject translation, major machine re- a have been languages African cently research, NLP languages low-resourced in ef- more and include awareness to growing fort different the and With same the regions. of people between ing a in expressed hu- Whether for communication. compulsory man and inherent is [email protected] n o ( Fon and ) kub:Edt-n pehRcgiinfrFnadIgbo and Fon for Recognition Speech End-to-End OkwuGbé: e e ´ ´ h no ftowrsfo Igbo from words two of union the , = Okwu oaetr .P Dossou P. F. Bonaventure aosUiest Bremen University Jacobs Igbo Gb Abstract ( speech e ´ en h peho lan- of speech the means ) + ) Gb on F e ´ ( languages ∗ ) fia agae,a ela upeettetex- the supplement as well as languages, African ok en oeo pehfrteeArcnlan- African these for speech on done being works ihu sn agaemdl Ls,wihusu- which (LMs), models language using without S ytm o w o-eore languages: low-resourced two for systems ASR to converted and interpreted identified, are words iebte nih o etbsdNPinvolving NLP text-based for insight better vide NP ( (NLP) uldt eddfrmcietasaino lan- or inspired what translation is This machine modelling. for guage needed data tual pehaayi fArcnlnugscudpro- could languages African of analysis careful speech a etc), African diacritical, of tonal, nature (mostly acoustic languages largely the their to on Due placed being text. is emphasis more as guages, lodmntaeta eeaigatninmecha- attention leveraging We that training. demonstrate for also data of amounts huge require ally 2020 2020b 2006a n h tt fArcnlnugsi L ( NLP in languages African of improv- state towards the ing geared efforts recent prompting l on eerhi aua agaeprocessing lit- language natural very in African received research decade no them. to past tle unify the to in effort languages an African several in to languages recognition speech automatic erlntok EEDN ihConnection- ( (CTC) with Classification DNN) Temporal ist (E2E networks deep end-to-end neural using that introduce show We we Igbo. and Fon paper, this In African (ALs). most un- for Languages or resources lack speech the of to availability due mainly ASR is lacking This still applications. are hand other the on African of languages, resources. availability speech the quality of to quantity large due French, etc, English, Japanese, as Chinese, such languages Asian and west- ern most for ASR performances years, state-of-art ac- recent achieved is In information used. and way processed, the cessed, changing is ASR spoken text. where technology language a is to-text) recognition. speech automatic on the and uoai pehrcgiin(S,o speech- or (ASR, recognition speech Automatic ehia nvriyo Munich of University Technical ; [email protected] ,alw st civ rmsn results promising achieve to us allows ), ; ∀ botadMartinus and Abbott oh tal. et Joshi tal. et hi .Emezue C. Chris , 2020a , 2020a .Hwvr hr r few are there However, ). ; nrwCaines Andrew , 2018 ∗ ; iiy tal. et Siminyu rvse al. et Graves OkwuGb , ∀ 2019 tal. et e ), ´ , , , nism (Bahdanau et al., 2016) improves the perfor- nature (Capo, 1986). Fon nominals are generally mance of acoustic models. preceded by a prefix consisting of a vowel (eg. the

In section 2, we give an overview of Fon and word a¡ú: ’tooth’). The quality of this vowel is Igbo languages. Then we discuss some related restricted to the subset of non-nasal vowels (Capo, work in section 3 and examine the data and data 1991; Duthie and Vlaardingerbroek, 1981). processing techniques we employed in this re- is a morphological process in search in section 4. In section 5, we explore which the root or stem of a word, or part of it, is re- models architectures used for our experiments and peated. Fon, like the other , makes show our evaluation in section 6. extensive use of reduplication in the formation of new words, especially in deriving , adjec- 2 Overview of Fon and Igbo tives, and adverbs from . For instance, the lã, which means to cut (both in Fon and Ewe), In this section, we give an extensive overview of is nominalized by reduplication, yielding lãlã : the both languages. Table 1 aims to summarise our act of cutting. Triplication is used to intensify the analysis for the reader. meaning of adjectives and adverbs (Capo, 1991; 2.1 Fon Duthie and Vlaardingerbroek, 1981). Fon (also known as Fongbe) is a native language of 2.2 Igbo Republic, spoken in average by more than 2.2 million people in Benin, in , and Igbo is a native language of the Igbo people, an eth- (Eberhard et al., 2020). Fon belongs to the Niger- nic group majorly located in the southeastern part Congo-Gbe languages family, and is a tonal, isolat- of Nigeria, like Abia, Anambra, Ebonyi, Enugu, ing and left-behind language according to (Joshi and Imo states, as well as in the northeast of et al., 2020b), with a basic Subject-Verb-Object the Delta state and in the southeast of the Rivers (SVO) word order. There are currently about state. Outside Nigeria, it is spoken a little bit in 53 different dialects of the spoken Cameroon and Equatorial Guinea. Igbo belongs to throughout Benin (Lefebvre and Brousseau, 2002; the Benue-Congo group of the Niger-Congo lan- Capo, 1991; Eberhard et al., 2020). guage family and is spoken by over 27 million

Its alphabet is based on the Latin alphabet, with people (Eberhard et al., 2020). There are approx-

¡ ¢ the addition of the letters: ª, , , and the di- imately 30 Igbo dialects, some of which are not graphs gb, hw, kp, ny, and xw. There are 10 vowel mutually intelligible. To illustrate the complexity

in Fon: 6 said to be closed [i, u, ˜ı, u],˜ of Igbo, we quote (Nwaozuzu, 2008): "...almost ª and 4 said to be opened [¢, , a, ã]. There are every community living as few as three kilometers

22 (m, b, n, ¡, p, t, d, c, j, k, g, kp, apart has its few linguistic peculiarities. If these gb, f, v, s, z, x, h, xw, hw, w). Fon has two tiny peculiarities are isolated and considered to phonemic tones: high and low. High is real- be able to assign linguistic dependence to each of ized as rising (low–high) after a . Basic these communities, we shall therefore be boasting disyllabic words have all four possibilities: high- of not less than one thousand languages in what high, high-low, low-high, and low-low. In longer we now know as the ." phonological words, like verb and phrases, This large number of dialects and peculiarities a high tends to persist until the final sylla- inspired the development of a standardized spoken ble. If that has a phonemic low tone, and written Igbo in 1962, called the Standard Igbo it becomes falling (high–low). Low tones disap- (Ohiri-Aniche, 2007) (which we will refer to when pear between high tones, but their effect remains we say "Igbo"). However, studies have shown that as a downstep. Rising tones (low–high) simplify there are many sounds (mainly consonants) found to high after high (without triggering downstep) in some other dialects of Igbo which are lacking and to low before high (Lefebvre and Brousseau, in the Standard Igbo orthography. For example, 2002; Capo, 1991). Achebe et al. (2011) discovered about 50 unique Fon makes extensive use of a rich system of speech sounds in Igbo. Morphologically, Igbo is tense or aspect markers, express many semantic an agglutinating language, with a compounding features by lexical items, and the periphrastic con- word formation: e.g., ugbo (vehicle) + igwe (iron) structions often used are of a more agglutinative = ugboigwe (locomotive). Igbo also uses redupli- Characteristics Fon Igbo Spoken where mostly in Benin. Some part of mostly in southeastern Nigeria. Nigeria and Togo A little bit in the Equatorial Guinea and Cameroon Speakers (Eberhard et al., 2020) 2.2 million 22 million Niger-Congo

Atlantic Congo

Volta-Congo tree Kwa Volta-Niger

Gbe Igboid

Fon Igbo Language structure Agglutinating language Alphabet structure 32 letters: 22 consonants, 10 36 letters: 28 consonants, 8 vow-

vowels els

¡ ¢ Special alphabets besides Latin ª, , , ã, gb, hw, kp, ny, and xw. ch, gb, gh, gw, kp, kw, nw, ny, and sh Tonal ? Yes. 3 tones: high (/), low (\) and Yes. 4 tones: high tone (/), low down step (-) (\), down step (−), and down drift (−) structure 10 vowel phonemes and 22 con- 28 consonant phonemes and 8 sonant phonemes. vowel phonemes. Nasalization is present is present Number of dialects about 53 about 30 Reduplication ? Yes, especially in deriving Yes, sometimes in compounding nouns, adjectives, and adverbs word formation: e.g., ugbo (ve- from verbs. hicle) + igwe (iron) = ugboigwe (locomotive). Code-switching? No Yes

Table 1: Summary analysis of Fon and Igbo cation like Fon. Igbo has 28 consonants and 8 vow- Code-switching, the act of “alternation of two els, totalling 36 letters of the alphabet. languages during speech” (Poplack, 1979), is very The sound system of Igbo consists of eight common among Igbo-English bilingual speakers, vowel phonemes, and 28 consonant phonemes making it an interesting feature for speech recog- (Ikekeonwu, 1999). There are four different nition research. Therefore, we will go deeper into types of tones in Igbo language (Odinye and it. Udechukwu, 2016). They include: High tone G.O and Mbagwu (2007); Obiamalu and (/), Low tone (\), Down step (−) (Rice, 1992), Mbagwu (2010) did an extensive research on Down drift (−). Down drift is only observed in code-switching among Igbo speakers, where they Igbo sentences because one can raise or lower the classified it into three types: borrowing, quasi- pitch before a sentence is completed. Tone is an borrowing and true code-switching (see Table 2). integral part of a word in Igbo. It is the inter- Borrowing in Igbo arises when words from En- face of phonology and syntax in Igbo because it glish are inserted into Igbo during speech and the performs both lexical and grammatical functions words go through phonological and morphological (Nkamigbo, 2012). Igbo has three syllable types: transformation (mark -> maakigo, table -> tebulu). consonant + vowel (the most common syllable This is usually because the speaker can not quickly type), vowel or syllabic nasal. find the Igbo equivalent of the word or such equiv- alent does not exist. This is illustrated by 1 and 2. and sometimes no developed pronunciation lexi- In quasi-borrowing, the Igbo equivalents of the En- con rule or language model to improve ASR mod- glish words exist, but the English words are more els. Some of these languages also contain few un- often used by both monolinguals and bilinguals. It paired multi-speaker data. This is the case of many may or may not be assimilated into Igbo, like in African languages. borrowing. This is illustrated by 3 and 4. The third Previous works according to model archi- situation, called true code-switching, occurs when tecture: While traditional phonetic-based ap- the speaker purposely chooses to use the English proaches (Hidden Markov Models) have produced word, even though the Igbo equivalent is known considerable results in the past, we focus on and always used. This is most common among end-to-end speech recognition with deep learning Igbo-English bilinguals. 5 and 6 are good exam- (Chorowski et al., 2014a,b; Hannun et al., 2014; ples. Amodei et al., 2015; Chan et al., 2016; Chiu Type Examples (Igbo | English) Explanation et al., 2018) because they have been shown to pro- 1.O maakigo (mark) ule ahu. | The words ‘mark’ and ‘table’ had been borrowing . . He has marked the examination borrowed and assimilated into Igbo duce better results, with little dependence on hand- 2.O. di. na tebulu (table) | It is on because there are not readily available the table in Igbo. crafted features and phoneme dictionaries. quasi- 3.Obi zu.ru. car o.hu. ru. . | Obi Even though Igbo has words for ‘car, borrowing bought a new car some bilinguals still use English Chorowski et al. (2014b) introduced an end- 4.O. bi zu. ru. u.gbo. ala o.hu. ru. . | Obi words. bought a new car to-end continuous ASR using a bidirectional re- true code- 5.Fela na ecriticize onye o. bu. la.| These cases are true code-switching current neural network (RNN) encoder with an switching Fela criticizes everybody because the Igbo words for ‘criticize’, 6.Jesus turnu. ru. water o. gho. ro. ‘turn’, ‘water’ and ‘wine’ are readily RNN decoder that aligns the input and the out- wine. | Jesus turned water into available in Igbo, but the speaker wine chooses to use the English equivalents. put sequences using the attention mechanism. The model achieved a word error rate (WER) of Table 2: Code-switching types and examples. Adapted from (G.O and Mbagwu, 2007) 18.57% on the TIMIT data set. Hannun et al. (2014); Amodei et al. (2015) presented a state- of-the-art ASR system using E2E DNNs. They 3 Related Works introduced a system that does not use any hand- designed language component, nor even the con- In this section, we review some related works ac- cept of "phoneme". Their result was achieved, as cording to the data resources, the model architec- the authors stated in their original paper, through tures and the state of ASR research for Fon and a well-optimized RNN training system that uses Igbo. multiple GPUs, as well as a set of novel data syn- Previous works according to data resources: thesis techniques and language models. Xu et al. (2020) classified previous works on ASR, Following the promising features that E2E according to data resources, into rich-resource, DNNs offer, Mamyrbayev et al. (2020) showed low-resource and unsupervised settings, as shown in their recent studies that, using them with CTC in Table 3. works without the need for direct inclusion of lan- In the rich-resource setting, a large amount of guage models. paired speech and text data is available for train- State of ASR resources for African languages: ing. This amounts up to hundreds of hours by mul- Open-sourced data for ASR is one of the driv- tiple speakers. Furthermore, pronunciation lexi- ing forces of research in any deep learning field, con is also leveraged while training for better re- including ASR, because it fosters experimenting, sults. These are the languages with ASR models training and developing better models. The Open already deployed in the industry. English is a main Speech and Language Resources (OpenSLR)1 is example of this setting. In the low-resource set- a platform with open-sourced speech and lan- ting, there are only about dozen minutes of single- guage resources, such as training corpora for speaker high-quality paired data, and few hours of speech recognition for public use. It currently multi-speaker low-quality paired data. Compared contains many languages (both high and low re- to the rich-resource setting, these data resources sources). However, we discovered that of the 2000 contained fewer paired data. African languages, only Yoruba (Gutkin et al., In the extremely low-resource setting, which is 2020), Nigerian Pidjin and four South African lan- where our work lies, there are little to no paired speech data resources, very low online presence, 1http://openslr.org/index.html Setting Rich-Resource Low-Resource Extremely Low-Resource pronunciation lexicon ✓ ✓ ✗ paired data (single-speaker, high- dozens of minutes several minutes ✗ quality) paired data (multi-speaker, high- hundreds of hours dozens of hours several hours quality) unpaired speech (single-speaker, high- Data ✓ dozens of hours ✗ quality) unpaired speech (multi-speaker,low- ✓ ✓ dozens of hours quality) unpaired text ✓ ✓ very few [(Chorowski et al., 2014a; Chiu et al., 2018; Chan et al., 2016; Hannun et al., [Our Work, Laleye et al. 2014; Li et al., 2019; (2016); Xu et al. (2020); Related Work ASR Mamyrbayev et al., 2020; [(Tjandra et al., 2017)] Baevski et al. (2020); Liu Chorowski et al., 2014b; et al. (2020); Ren et al. Hori et al., 2019; Rosen- (2019)] berg et al., 2019; Schnei- der et al., 2019)]

Table 3: Data sources to build ASR models and the corresponding related works in the different settings. Adapted from (Xu et al., 2020) guages (Afrikaans, Sesotho, Setswana, isiXhosa) and speech analysis research in the past decade, (van Niekerk et al., 2017) are present, as of the pe- but no public research on E2E DNN ASR, to the riod of writing this paper. Furthermore, they con- best of our knowledge. We opine that this is tain very few samples (tens, few hundreds of audio largely because 1) many old works in the past on hours), compared to their high-resource counter- Igbo focused solely on tonal analysis (Odinye and parts (thousands, millions of audio hours). This Udechukwu, 2016; Nkamigbo, 2012), and 2) there scarcity of open resources for the development is a lack of open-source speech data to encourage of ASR for low-resourced African languages is further research on exploring ASR with deep learn- one of the major factors affecting the low state ing methods, which are known to be data-hungry. of ASR research in African languages. Although there may be some non-open resources for some 4 Speech-to-Text Corpora and Data of these African languages, they come with huge Preprocessing licensing fees, among other limiting factors. 4.1 Fon Speech-to-Text Corpus State of ASR research for Fon and Igbo: Fon, We got our speech dataset for Fon from the exist- unlike Igbo, has little to no digital presence. With ing Fon speech corpus2 which was built upon the very few speakers, and almost no online presence, tedious task of recording the texts pronounced by there have been understandably very few tonal native speakers (including 8 women and 20 men) analysis or ASR research for this language. The of Fongbe in a noiseless environment. The record- few that exist are mostly by researchers who are ings are sampled at a frequency of 16Khz. The 28 native speakers of the language. native speakers have spoken around 1500 phrases To the best of our knowledge, there has only (daily conversations domain). These recordings been notable efforts from Laleye (2016); Laleye were made with the LigAikuma3 android applica- et al. (2016) to build an ASR for Fon, with a tion. The minimum length of a speech sample is 2 word error rate (WER) of 14.83%. This result seconds and the maximum is 5 seconds, giving us had been achieved, building two LMs and also an average of 4 seconds content length. Overall, only after normalizing and removing the diacrit- there are around 10 hours of speech data that have ics; whose crucial importance for both performant been collected. ASR and neural machine translation (NMT) has The global data set has been split into training, been proved by Orife (2018); Dossou and Emezue validation and test data sets. The training set con- (2020). This will be discussed later in section tains 8 hours of speech (8235 speech samples), the 4.3.2. The best model with diacritics scored a 2https://paperswithcode.com/dataset/ (WER) of 44.04%. fongbe-speech-recognition Igbo, on the other hand, has had a lot of tonal 3https://lig-aikuma.imag.fr/ validation data set contains 1500 speech samples 4.3 Data Preprocessing and finally the test data set contains 669 speech 4.3.1 Speech Preprocessing samples. The text corpus made of the 1500 sen- Speech signals are made up of amplitudes and tences used to build the speech data set has been frequencies. Amplitudes simply inform about scraped from BéninLangues 4. the loudness of the sound recording, nothing in- 4.2 Igbo Speech-to-Text Corpus formative. To get more information from our speech samples, we decided to map them into It was very hard to find the data set of Igbo au- the frequency domain. Two of the known tech- dio samples and their transcripts. We realized that niques, enabling the conversion of speech data there is a great lacuna: even though there’s been from its time domain to its frequency domain, much research on Igbo phonology, there has really are the Fourier Transformation (FT) and Discrete not been any (public) efforts to gather any speech- Fourier Transformation (DFT) (Boashash, 2003; to-text data set for it. Bracewell, 2000). The data set for our experiments on Igbo was FT is a mathematical concept that converts a got through a license from the Linguistic Data continuous signal from the time domain to the fre- Consortium (LDC2019S16: IARPA Babel Igbo quency domain. FT decomposes a continuous sig- Language Pack) (Nikki et al., 2019). It con- nal into its frequency components, giving the fre- tains approximately 207 hours of Igbo conversa- quencies present in the signal, and their respective tional and scripted telephone speech collected in magnitudes. DFT, similary to FT, converts a se- 2014 and 2015 along with corresponding tran- quence, considered as a discrete signal, into its fre- scripts. The data set (hereafter called IgboDataset) quency components. is made up of telephone calls representing the Ow- However, applying only the FFT just gives fre- erri, Onitsha, and Ngwa dialects spoken in Nigeria, quency values without time information. To make sampled at 8kHz, with a few sampled at 48kHz. sure we preserve frequencies, time and amplitudes The gender distribution among speakers is approx- information about the speech samples, in reason- imately equal; speakers’ ages range from 16 years able and adequate range, we decided then to use to 67 years. The telephone calls were made using mel-spectrograms. different telephones (e.g., mobile, landline) from The mel-scale is a perceptual scale of pitches a variety of environments including the street, a judged by listeners to be equal in distance from home or office, a public place, and inside a vehi- one another (Stevens et al., 1937). It is constructed cle. The diacritics were originally removed from such that sounds of equal distance from each other the transcripts. on the mel-scale, also «sound» to humans as they Unlike the Fon data set, which was modern are equal in distance from one another. A popular and contained very clean audio, IgboDataset had formula to convert f hertz (frequencies measures) old speech patterns, and contained many noisy au- into m mels is the O’Shaughnessy formula de- dio samples. Therefore, we had to implement a scribed in O’Shaughnessy book (O’Shaughnessy, number of cleaning strategies (like filtering based 1987), defined as on length of words, upsampling, exploring differ- ent mel-spectrograms units and number of Fast f m = 2595 ∗ log ∗ (1 + ) (10). Fourier Transform (FFT) bins. The FFT is an algo- 10 700 rithm that computes the discrete Fourier transform 5 (DFT, described in section 4.3.1) of a sequence. A spectrogram is an image, a three dimen- Our cleaning strategies gave us a reduced data set sional (3D) representation that displays how en- of 2.5 hours, which we split into train, dev and test ergy frequencies components of the speech change sizes of 4000, 100 and 100 audio samples respec- over time. The abscissa represents time, the ordi- tively. To test the importance of our pre-cleaning, nate axis represents frequency, and amplitudes are we trained the model on both the large uncleaned shown by the darkness of a precise frequency at data set and our cleaned version (results are dis- a particular time: low amplitudes are represented cussed in section 6). with a light-blue color, and very high amplitudes are represented by dark red. 4https://beninlangues.com/fongbe 5http://www.glottopedia.org/index.php/Spectrogram There are two types of spectrograms: broad- Lang. Source Text English Transla- band spectrograms and narrow-band spectro- (preprocessed tion grams6. text same as the source)

• Broad-band spectrograms have higher tem- Fon xov¢ sin mì tlala I’m super hun- ¢ poral resolutions allowing the detection of k¢nkl n bo ná mì gry, I’m starving. changes in frequency over small intervals of nù¡é Please give me some food

time. However, they usually do not help mak- ¡ Fon a ná gbª nù ú bó You will not be eat-

ing good frequency distinctions, as the time ª gbª sin nù bó gb ing, neither drink- interval for each spectrum is small. xó ¡ª káká yi azãn ing nor speaking

tªn gbè for the next three • Narrow-band spectrograms have higher fre- days quency resolution, and larger time interval for every spectrum than broad-band spectro- Table 4: Fon sentences before and after the preprocess- grams: this allows the detection of very small ing with their English Translations differences in frequencies. Moreover, they

show individual harmonic structures, which Fon (a) Igbo (b)

are vibration frequency folds of the speech, to (ambiguous and uncertain) akwa (ambiguous and uncertain) as horizontal striations. tó (ear) tò (sea) tô (country) ákwà (cloth) akwᯠ(egg) ákwá (to cry) A mel-spectrogram is hence a spectrogram with the mel-scale. In our study, we decided to use Table 5: Examples of tonal inflection with Fon and Igbo narrow-band mel-spectrograms, as input features for the model. A diacritic is a glyph added to a letter or basic We used 512 as length of the FFT window, 512 glyph. They appear above or below a letter, or in as the hop-length (number of samples between suc- some other position such as within the letter or be- cessive frames) and a hanning windows size is set tween two letters. Some diacritical marks, such to the length of FFT window. as the acute (´) and grave (‘), are often called ac- For handling the audio data, we used the tor- cents. In Fon and Igbo, a good example of the chaudio utility from PyTorch (Paszke et al., 2019). importance can be seen in Table 5 (a) where we We used Spectrogram Augmentation (SpecAug- demonstrate that removing diacritics from a word ment) (Park et al., 2019) as a form of data augmen- could lead to ambiguity and result in the confusion tation: we cut out random blocks of consecutive of the word. Therefore, we preprocessed the tex- time and frequency dimensions. Mel-Spectograms tual data without removing the diacritics. The re- were generated from each speech samples with sults of the text data preprocessing on the Fon data some fine-tuned hyper-parameters: set, are presented in Table 4. For Igbo, unfortu- • sample_rate: sample rate of audio signal, set nately the IgboDataset was originally stripped of to 16000 hertz (16 kHz) for Fon and 8000 its diacritics. Therefore, we were not able to en- hertz (8 kHz) for Igbo. code any diacritical information. • n_mels: number of mel filterbanks, set to 128 for Fon and 64 for Igbo. 5 Models Architectures and Experiments 4.3.2 Text Preprocessing 5.1 Preliminaries Scholars like Orife (2018) and Dossou and We take a moment to briefly define Emezue (2020) have shown in their studies that the task of ASR mathematically. We keeping the diacritics reduces lexical disambigua- have a training set of n samples: χ = tion and provides more morphological information (x(1),y(1)), (x(2),y(2)), ..., (x(n),y(n)) . Each to neural machine translation models. Addition- utterance, x(i), sampled from the training set, is a ally, diacritics relay the pronunciation tone and time-series of length T (i) where every time-slice is the sound generated, leading to an improved un- (i) (i) derstanding of the sentences and their contexts. a vector of audio features, (xt ,t = 0, ..., T − 1. The goal of ASR is to generate transcripts yˆ(i) 6 http://www.cas.usf.edu/~frisch/SPA3011_L07.html for each utterance x(i). In order to achieve this, where euea rhtcuecnitn foeo more or one of consisting architecture an use we ¨ª ultp psrpe om,blank comma, apostrophe, fullstop, ,é, è, space aea h hrce e o h nls language. English the for set character the as same n ulso hrceshv enaddt de- to added Igbo, For been boundaries. have word characters note stop full and eteupe o iesre data. are time-series they for equipped since best (RNN), networks neural recurrent wit Igbo, and Fon for model block. best BiGRU the and block of Architecture 1: Figure o o xml,det u nlso ftedi- the of inclusion have our we In to acritics, due symbol. example, blank for the Fon or diacritics) their cluding hrce rmtecaatrst opti math- time-step it output put To each at set. ematically, character the from character x ae rdcinoe characters, over prediction a makes , ( i ¡ generate to order In ) h N oestepoaiiyo ikn a picking of probability the models RNN the , , e, ¢ ¯ , , c blank ¢ t ,ì ,î ï, î, í, ì, e, ˘ , sete hrce fteapae (in- alphabet the of character a either is ¡¢

, Mel-Spectrogram ¨¢

} (narrow band) , N layer CNN psrpe ht-pc,comma, white-space, Apostrophe, . £¢ uptfo rvoslayer previous from Output , fullstop nu o etlayer next for Input c t ae Norm Layer Norm Layer ˘ ,ó, ı, Dropout Dropout GeLU GeLU = CNN CNN + | | | | | | y ˆ { ( ,ò, o, ˘ i ,b ,.. z, ..., c, b, a, ) , apostrophe o ie utterance given a for u, ¯ CNblock rCNN c ,ù ú, ù, u, t ˘ } = hc sthe is which t h RNN the , { ,á, à, p ,b .,z, ..., b, a, , ( comma c ª t , | x ,˘a, a, ¯ ¡ª ( , i ) ac Normalization Batch ª ) , , ,

Fully Connected Layer hscntan a eas h pehdtstfor dataset speech the because was constraint This mdie al. et Amodei hc mn ubro piiaintech- optimization ( of number a among which gowsaraysrpe ftedartc,a ex- as diacritics, section the in of plained stripped already was Igbo r ahrta aigec ae larger. layer each making lay- than hidden rather more ers adding from by learn datasets, efficiently speech increase to large can order we in capacity, that model shown have works Related Architecture Model 5.2 ius xesvl ple ac normalization batch applied extensively niques, osctv iietoa eurn aes and layers, recurrent bidirectional consecutive al. et hnm rwr ro ae(E)o h ASR the of (WER) rate error word or phoneme ( mech- anism attention (additive) Badhanau of use the that

of n Szegedy and Ioffe

oct with concat. to flatten output from rCNN block

Furthermore, x nepnino ahcmoeto h rCNN the of component each of expansion an h ( 2013 iSMblock BiLSTM adnue al. et Bahdanau (outputs uptfo rvoslayer previous from Output nu o etlayer next for Input xlrdicesn h ubrof number the increasing explored ) iietoa GRU Bidirectional softmax teto weights Attention lgmn score Alignment otx Vector Context idnstate Hidden ae Norm Layer ( x 2015 GeLU n idnstate) hidden and hrwk tal. et Chorowski | 4.2 , rpsdteDe Speech2, Deep the proposed ) 2015 . , otede RNNs. deep the to ) iR block BiGRU 2016 ol euethe reduce could ) mechanism connection RNN-type Attention ( 2014a showed ) Fully Connected Layer Graves

CTC Loss model. This is possible because the attention bet at each timestep from the narrow-band mel- mechanism forces the decoder to make monotonic spectrogram we feed it. Traditional ASR models alignment and hence improve the predictions. require aligning transcripted text to the speech be- Our model architecture, shown in Figure 1, fore the training, and the model is trained to pre- draws inspiration from these research findings. dict specific labels at specific timesteps. However, While our model at its core is similar to Deep with the CTC loss function (Graves et al., 2006a), Speech 2, our key improvements are: the previously described step is skipped and the • the exploration of the combination of Bidirec- model directly learns to align the transcript itself tional Long Short Term Memory (BiLSTMs) during training. and Bidirectional Gated Recurrent Units (Bi- 5.2.1 Implementing the attention mechanism GRUs) for low-resource ASR. • the integration of the Badhanau attention Let us recall here that we have 5 blocks of rCNNs, mechanism, which effect has been demon- 3 blocks each of BiLSTMs and BiGRUs. Bahdanu strated on Fon. attention a bit modified and implemented as ex- Our model has two main neural network mod- plained in Figure 1, in the following steps: ules: N-blocks of Residual Convolutional Neu- • an input x from the stacked blocks of rC- ral Networks (rCNNs) (He et al., 2015, 2016) and NNs is fed to the BiLSTM (encoder), and pri- M-blocks each of BiLSTMs and BiGRUs. Each marily layer-normalized. The output is then rCNN block is made of two CNN layers, two passed through a GeLU activation function, dropout layers and two normalization layers (Ba which output is fed to the BiLSTM layers of et al., 2016) for the CNN inputs. We leveraged the the current block. Within the current encoder power of convolutional neural networks (CNNs) block, each of the BiLSTM layers, produce to extract abstract features by converting speeches an output, a hidden state and a cell state. The into spectrograms. RNNs process the abstract fea- encoder output is lastly passed to a dropout tures produced by the rCNNs, step by step, mak- layer, and is used as input for the next encoder ing a prediction for each frame while using context block. The output of the last encoder is used from previous frames. We use BiRNN’s (Schuster as input for the first BiGRU (decoder) block. and Paliwal, 1997) because we want the context • The input of the decoder goes successively of not only the frame before each timestep, but the through a normalization layer and a GeLU ac- following as well. This help the model make better tivation function. The output is then fed to Bi- predictions. GRU layers of the current decoder, which pro- In our scenario, BiLSTMs and BiGRUs act re- duces the decoder output and a hidden state. spectively as encoder and decoder blocks. Each In common NLP or more specifically in neu- block produces subsequentially outputs and hid- ral machine translation (NMT), the hidden den states fed to the next block. The last hidden state is a 2-dimensional tensor. However, this state of the last BiGRU block is used to compute is not the case here, since our initial input the attention weights and the context vector, that features from the stack of rCNNs layers are will be concatenated to the BiGRU output, to serve 4-dimensional tensors of shape (batch, chan- as final output. In-between and towards each block nel, feature, time), and the output from the output are stacked dropout layers to prevent over- stack of encoders is a 3-dimensional tensor fitting (Srivastava et al., 2014). of shape (batch_size, number of features, hid- The output from the models is a probability ma- den size). trix for characters which will be fed into a greedy • Using the hidden state h, and the decoder out- decoder. We implemented the greedy decoder sug- put x shapes, three dense layers are created gested by Graves et al. (2006b) to extract what the accordingly using the following PyTorch ver- model believes are the highest probability charac- sion pseudo-code: ters that were spoken. This simple decoder, albeit w1= nn.Linear(x.size[2], x.size[2]//2) with no linguistic information, has been shown to produce useful transcriptions (Zenkel et al., 2017). w2= nn.Linear(h.size[2], h.size[2]) Our model is trained to predict the probabil- v = nn.Linear(x.size[2]//2, x.size[2]) ity distribution of every character of the alpha- Along our manipulation, in the attempt to customize the NMT concept to ours, we en- • epochs: 500 (for Fon) and 1000 (for Igbo), countered few cases were shapes of the de- with early stopping after 100 epochs. coder output and the hidden state were mis- • activation_function: GeLU (Hendrycks and matching. To resolve that, we augmented Gimpel, 2016) the hidden state tensor, along the second axis • optimizer: AdamW (Loshchilov and Hutter, (axis = 1) with desirable number of zeros. 2019) (Fon), Nesterov accelerated descent This was done for two reasons: (Nesterov, 1983) (Igbo) 1. the concatenation will not affect the orig- We used two metrics to evaluate the models: the inal hidden state tensor Character Error Rate (CER) and the WER. WER 2. tanh(x|x = 0) = 0: 0 is hence acting uses the Levenshtein distance (Levenshtein, 1966) like a neutral element. to compare reference text and hypothesis text in Once we get compatible shapes for both the word-level. Even with a low CER, the WER can decoder output and the hidden state tensor, be high: hence the lower the WER, the better the we used them to conventionally compute the model. For the training processes, we used the attention scores, s: WER of the validation data set to select the best weights and parameters. m = nn.T anh() s = v(m(w1(x)+ w2(h)) 6 Results

• The attention weights are then computed by Models Fon Igbo (without cleaning) (rCNN) CER WER CER WER passing s through a softmax operation. The +BiGRUs 22.0831 59.66 - - context vector, cv , which is got by multi- +BiLSTMs 24.2783 61.46 - - +BiLSTMs+BiGRUs 16.9581 47.05 56.00 64.00 plying the attention weights by x, is concate- +BiLSTMs+BiGRUs+Attn 18.7976 42.50 50.12 (92.67) 55.03 (97.99) 7 nated to the decoder output x to get the atten- (Laleye et al., 2016) - 44.09 - - tion features: Table 6: CER (%), and WER (%) of different models on Fon and Igbo (original and cleaned) test datasets. n = nn.Softmax() attention_weights = n(s) We present our findings using 5 blocks of rC- cv = attention_weights ∗ x NNs with: • blocks solely of BiGRUs x = concatenate(cv,x) 3 • 3 blocks solely of BiLSTMs • The output of the current decoder block is • 3 blocks each of BiLSTMs and BiGRUs the attention output passed through a dropout • 3 blocks each of BiLSTMs and BiGRUs + At- layer, and taken as input for the next decoder. tention mechanism. Table 6 presents the results of different models 5.3 Experiments architectures on the test data set for Fon and Igbo. Throughout our experiments, we explored differ- ent model architectures with various number of 6.1 Results for Fon convolutional and bidirectional recurrent layers. We show that implementing attention mechanism The best ASR model, shown on Figure 1 has reduced the WER by 5%. Our Fon ASR model 5 blocks of rCNNs, 3 blocks each of BiLSTMs outperformed the current Fon ASR model with di- and BiGRUs, with attention incorporated into each acritics of Laleye et al. (2016). component of the BiGRU block. Also, we used Table 7 shows some decoded predictions and a form of Batch Normalization throughout the targets from the Fon ASR model, which are very model. similar. Common mistakes (colored), happen most We got the best evaluation results with the fol- often at a character level where a character is either lowing hyper-parameters: omitted, added or replaced by another one. The na- • max learning_rate: 5e-4 (for Fon), 3e-4 (for tive speakers included in this study have testified Igbo) to the fact that those mismatched words or char- • batch_size: 20 (for Fon), 20 (for Igbo). acters are often practically not distinguishable in • (N, M): (5, 3) for Fon and Igbo. 7WER of the best model with diacritics from (Laleye • embedding_size: 512 et al., 2016)

Fon Decoded Predictions Fon Decoded Targets

ª ª ª ª ª ª ª ª ª tª ce xwe y y din t n ci gblagadaa t ce xwe y y din t n ci gblagadaa

eo mi sa aakpan nu mi eo mi sa akpan nu mi ¢

fit¢ a gosin xwe yi gbe fit a go sin xwe yin gbe ¢¡

e kpo kp¢¡é e kpo kp e

¢ ¡ ¢ ¢ ¢ ¡ ¢ akw¢ c gbadé jí axim akw j gbadé ji axim

Table 7: Decoded Predictions and Targets of the best Fon ASR Model speaking. The model source code is open-sourced For Igbo language, the next stage involves incor- at: https://github.com/bonaventuredossou/fonasr porating diacritical information in the ASR model. We have begun by gathering new speech dataset 6.2 Results for Igbo which include the diacritics. An important observation we show in Table 6 is the effect of the state of audio samples on the 8 Acknowledgement model’s ability to learn: for our large IgboData We are grateful to Professor Graham Neubig of with background noise, uneven audio length, low Carnegie Mellon University for coming to our aid sampling rate, etc, the model found it very diffi- by providing us with an Amazon EC2 instance cult to learn the speech representations. Taking for training our model when we were very low on time to sieve through the data (in Section 4.2) mit- computational resources. We also thank Dr Frejus igated this issue by helping the model learn the ab- Layele for giving us access to the Fon data set, and stract features better, albeit on a small training set. Dr Iroro Orife, for his guidance on designing the While the model is currently still training on more ASR model and cleaning the IgboData. epochs (with hope of improving), our preliminary results serve as a benchmark for ASR on Igbo. The source code for the model can be accessed at References https://github.com/chrisemezue/IgboASR. Jade Z. Abbott and Laura Martinus. 2018. Towards In Table 6, one may observe the large difference neural machine translation for african languages. between the CER and WER on Fon language, un- CoRR, abs/1811.05467. like Igbo. We strongly believe that this is due to Ike Achebe, Clara Ikekeonwu, Cecilia Eme, Nolue the fact that the character set for Fon contains all Emenanjo, and Nganga Wanjiku. 2011. A com- the possible diacritics for each letter of the Fon posite synchronic alphabet of igbo dialects (csaid). alphabet, making it extremely large (compared to IADP, New York. the set of Igbo characters which had no diacritical Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl information). To further support our claim, a close Case, Jared Casper, Bryan Catanzaro, Jingdong observation of the targets and predictions in Table Chen, Mike Chrzanowski, Adam Coates, Greg Di- 7 reveals that the errors are mostly due to omis- amos, Erich Elsen, Jesse Engel, Linxi Fan, Christo- pher Fougner, Tony Han, Awni Hannun, Billy Jun, sion or mismatch of diacritics for the characters Patrick LeGresley, Libby Lin, Sharan Narang, An- (‘e’ predicted instead of ‘é ’ in row 4 or a space drew Ng, Sherjil Ozair, Ryan Prenger, Jonathan added between ‘go’ and ‘sin’ in row 3 ). Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, 7 Future Work Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep speech 2: End-to-end speech Our work shows promising results considering the recognition in english and mandarin. small training sizes, and we have presented a state- Andrew Caines. 2019. The Geographic Diversity of of-the-art ASR model for Fon. As future path- NLP Conferences. way to improve the proposed models, we are ex- Jimmy Ba, J. Kiros, and Geoffrey E. Hinton. 2016. ploring approaches like leveraging language mod- Layer normalization. ArXiv, abs/1607.06450. els, deeper model structures, transformers and crowd-sourcing/compiling speech-to-text data set Alexei Baevski, Michael Auli, and Abdelrahman Mo- for Igbo and Fon. hamed. 2020. Effectiveness of self-supervised pre– training for speech recognition. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Jamiil Toure Ali, Jade Abbott, Vukosi Marivate, Sa- gio. 2016. Neural machine translation by jointly lomon Kabongo, Musie Meressa, Espoir Murhabazi, learning to align and translate. Orevaoghene Ahia, Elan van Biljon, Arshath Ramk- ilowan, Adewale Akinfaderin, Alp Öktem, Wole B. Boashash. 2003. Time-Frequency Signal Analysis Akin, Ghollah Kioko, Kevin Degila, Herman Kam- and Processing: A Comprehensive Reference. Ox- per, Bonaventure Dossou, Chris Emezue, Kelechi ford: Elsevier Science. Ogueji, and Abdallah Bashir. 2020b. Masakhane – machine translation for africa. R. N Bracewell. 2000. The Fourier Transform and Its Applications. Boston: McGraw-Hill. Obiamalu G.O and D.U. Mbagwu. 2007. Code- switching: Insights from code-switched en- Hounkpati B. C. Capo. 1986. Renaissance du gbe, une glish/igbo expressions. pages 51–53. Awka Journal langue de l’Afrique occidentale: étude critique sur of Linguistics and Languages Vol 3. les langues ajatado, l’ewe, le fon, le gen, laja, le gun, etc. Université du Bénin, Institut national des A. Graves, Abdel rahman Mohamed, and Geoffrey E. sciences de l’éducation. Hinton. 2013. Speech recognition with deep recur- rent neural networks. 2013 IEEE International Con- Hounkpati B. C. Capo. 1991. A comparative phonol- ference on Acoustics, Speech and Signal Processing, ogy of Gbe. Foris Publications. pages 6645–6649. William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Alex Graves, Santiago Fernández, Faustino Gomez, Vinyals. 2016. Listen, attend and spell: A neural and Jürgen Schmidhuber. 2006a. Connectionist network for large vocabulary conversational speech temporal classification: Labelling unsegmented se- recognition. In ICASSP. quence data with recurrent neural networks. In Pro- ceedings of the 23rd International Conference on Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Ro- Machine Learning, ICML ’06, page 369–376, New hit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, York, NY,USA. Association for Computing Machin- Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekate- ery. rina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-art Alex Graves, Santiago Fernández, Faustino Gomez, speech recognition with sequence-to-sequence mod- and Jürgen Schmidhuber. 2006b. Connectionist els. temporal classification: Labelling unsegmented se- quence data with recurrent neural networks. In Pro- Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, ceedings of the 23rd International Conference on and Yoshua Bengio. 2014a. End-to-end continuous Machine Learning, ICML ’06, page 369–376, New speech recognition using attention-based recurrent York, NY,USA. Association for Computing Machin- nn: First results. ery. Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, Alexander Gutkin, I¸sın Demir¸sahin, Oddur Kjartans- and Yoshua Bengio. 2014b. End-to-end continuous son, Clara Rivera, and Kó.lá Túbò.sún. 2020. De- speech recognition using attention-based recurrent veloping an Open-Source Corpus of Yoruba Speech. nn: First results. In Proceedings of Interspeech 2020, pages 404–408, Bonaventure F. P. Dossou and Chris C. Emezue. 2020. Shanghai, China. International Speech and Commu- Ffr v1.1: Fon-french neural machine translation. nication Association (ISCA). Alan S. Duthie and R. K. Vlaardingerbroek. 1981. Bib- Awni Hannun, Carl Case, Jared Casper, Bryan Catan- liography of GBE: (Ewe, Gen, Aja, Xwala, Fon, zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San- Gun, etc.): publications "on" and "in" the language. jeev Satheesh, Shubho Sengupta, Adam Coates, and Basler Afrika Bibliographien. Andrew Y. Ng. 2014. Deep speech: Scaling up end– to-end speech recognition. David M. Eberhard, Gary F. Simons, and Charles Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian D. Fennig (eds.). 2020. : Languages of Sun. 2015. Deep residual learning for image recog- the world. twenty-third edition. nition. ∀, Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Sun. 2016. Identity mappings in deep residual net- Fagbohungbe, Solomon Oluwole Akinola, Sham- works. suddee Hassan Muhammad, Salomon Kabongo, Sa- lomey Osei, et al. 2020a. Participatory research for Dan Hendrycks and Kevin Gimpel. 2016. Gaussian er- low-resourced machine translation: A case study in ror linear units (gelus). african languages. Findings of EMNLP. T. Hori, R. Astudillo, T. Hayashi, Y. Zhang, S. Watan- ∀, Iroro Orife, Julia Kreutzer, Blessing Sibanda, Daniel abe, and J. Le Roux. 2019. Cycle-consistency train- Whitenack, Kathleen Siminyu, Laura Martinus, ing for end-to-end speech recognition. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Daniel van Niekerk, Charl van Heerden, Marelie Davel, pages 6271–6275. Neil Kleynhans, Oddur Kjartansson, Martin Jansche, and Linne Ha. 2017. Rapid development of TTS cor- Clara Ikekeonwu. 1999. Igbo”, handbook of the inter- pora for four South African languages. In Proc. In- national phonetic association. terspeech 2017, pages 2178–2182, Stockholm, Swe- den. Sergey Ioffe and Christian Szegedy. 2015. Batch nor- malization: Accelerating deep network training by Adams Nikki, Bills Aric, Conners Thomas, David reducing internal covariate shift. Anne, Dubinski Eyal, Fiscus Jonathan G., Gann Ketty, Harper Mary, Kaiser-Schatzlein Alice, Kazi Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Michael, Malyska Nicolas, Melot Jennifer, Onaka Bali, and Monojit Choudhury. 2020a. The state and Akiko, Paget Shelley, Ray Jessica, Richardson Fred, fate of linguistic diversity and inclusion in the NLP Rytting Anton, and Sinney Shen. 2019. Iarpa world. In Proceedings of the 58th Annual Meet- babel igbo language pack iarpa-babel306b-v2.0c ing of the Association for Computational Linguistics, ldc2019s16. web download. pages 6282–6293, Online. Association for Computa- tional Linguistics. Linda Chinelo Nkamigbo. 2012. A phonetic analysis of igbo tone. ISCA Archive, The Third International Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Symposium on Tonal Aspects of Languages. Bali, and Monojit Choudhury. 2020b. The state and fate of linguistic diversity and inclusion in the nlp G.I. Nwaozuzu. 2008. Dialects of the Igbo Language. world. University of Nigeria Press. F. A. A. Laleye, L. Besacier, E. C. Ezin, and C. Mo- G. Obiamalu and Davidson U. Mbagwu. 2010. Moti- tamed. 2016. First automatic fongbe continuous vations for code-switching among igboenglish bilin- speech recognition system: Development of acous- guals: A linguistic and sociopsychological survey. tic models and language models. In 2016 Federated OGIRISI: a New Journal of African Studies, 5:27– Conference on Computer Science and Information 39. Systems (FedCSIS), pages 477–482. Sunny Odinye and Gladys Udechukwu. 2016. Igbo Frejus Adissa Akintola Laleye. 2016. Contributions and chinese tonal systems: a comparative analysis. à the á study and à automatic speech recognition Ogirisi: A new Journal of African Studies, Volume in Fongbe. Theses, Universit é du Littoral C ô te 1:48. d’Opale. Chinyere Ohiri-Aniche. 2007. Stemming the tide of Claire Lefebvre and Anne-Marie Brousseau. 2002. A centrifugal forces in igbo orthography. Dialectical grammar of Fongbe. Mouton de Gruyter. Anthropology, 31(4):423–436. V. I. Levenshtein. 1966. Binary Codes Capable of Cor- Iroro Orife. 2018. Attentive sequence-to-sequence recting Deletions, Insertions and Reversals. Soviet learning for diacritic restoration of yorùbá language Physics Doklady, 10:707. text.

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Douglas O’Shaughnessy. 1987. Speech communica- Ming Liu. 2019. Neural speech synthesis with trans- tion: human and machine. Journal of the Acoustical former network. Proceedings of the AAAI Confer- Society of America. ence on Artificial Intelligence, 33(01):6706–6713. Daniel S. Park, William Chan, Yu Zhang, Chung- Alexander H. Liu, Tao Tu, Hung yi Lee, and Lin shan Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Lee. 2020. Towards unsupervised speech recogni- Quoc V.Le. 2019. Specaugment: A simple data aug- tion and synthesis with quantized speech representa- mentation method for automatic speech recognition. tion learning. Interspeech 2019.

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Adam Paszke, Sam Gross, Francisco Massa, Adam weight decay regularization. Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Orken Mamyrbayev, Keylan Alimhan, Bagashar Zhu- Antiga, Alban Desmaison, Andreas Köpf, Edward mazhanov, Tolganay Turdalykyzy, and Farida Gus- Yang, Zach DeVito, Martin Raison, Alykhan Tejani, manova. 2020. End-to-end speech recognition in Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun- agglutinative languages. In Intelligent Informa- jie Bai, and Soumith Chintala. 2019. Pytorch: An tion and Database Systems, pages 391–401, Cham. imperative style, high-performance deep learning li- Springer International Publishing. brary. Y. Nesterov. 1983. A method for unconstrained convex S. Poplack. 1979. “sometimes i’ll start a sentence in minimization problem with the rate of convergence 2 spanish y termino en español”: Toward a typology o(1/k ). of code-switching. Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Almost unsupervised text to speech and automatic speech recognition. In Pro- ceedings of the 36th International Conference on Machine Learning, volume97 of Proceedings of Ma- chine Learning Research, pages 5410–5419. PMLR. Keren Rice. 1992. Language, 68(1):149–156. Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye Jia, Pedro Moreno, Yonghui Wu, and Zelin Wu. 2019. Speech recognition with augmented synthe- sized speech. Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Inter- speech 2019, pages 3465–3469. Mike Schuster and Kuldip Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45:2673 – 2681. Kathleen Siminyu, Sackey Freshia, Jade Abbott, and Vukosi Marivate. 2020. Ai4d – african language dataset challenge. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural net- works from overfitting. J. Mach. Learn. Res., 15(1):1929–1958. Stanley Smith Stevens, John Volkmann, and Edwin B Newman. 1937. A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America. Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Listening while speaking: Speech chain by deep learning. Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. 2020. Lrspeech: Extremely low-re- source speech synthesis and recognition. Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan Niehues, Matthias Sperber, Sebastian Stüker, and Alex Waibel. 2017. Comparison of decoding strate- gies for ctc acoustic models.