Accent Identification in the Presence of Code-Mixing
Total Page:16
File Type:pdf, Size:1020Kb
Accent identification in the presence of code-mixing Thomas Niesler† & Febe de Wet‡ †Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa [email protected] ‡Centre for Language and Speech Technology, Stellenbosch University, South Africa [email protected] Abstract tongue of Xhosa and of Zulu speakers can be determined when the utterance is part of a mixed code, and when it is We investigate whether automatic accent identification is not. more effective for English utterances embedded in a dif- ferent language as part of a mixed code than for English The work was motivated by the findings of two pre- utterances that are part of a monolingual dialogue. Our vious studies. In the first, the performance of language focus is on Xhosa and Zulu, two South African languages identification systems were investigated for Afrikaans, for which code mixing with English is very common. In English, Xhosa and Zulu [1]. Here it was established that order to carry out our investigation, we extract English even in mixed-code utterances consisting predominantly utterances from mixed-code Xhosa and Zulu speech cor- of English words, the language could be correctly classi- pora, as well as comparable utterances from an English- fied with approximately 70% accuracy. The second study only corpus by Xhosa and Zulu mother-tongue speakers. focussed on the identification of accents in English spo- Experiments show that accent identification is substan- ken by mother-tongue speakers of an African language, tially more accurate for the utterances originating from and for whom English is a second or third language [2]. the mixed-code speech. We conclude that accent identi- Instead of considering individual languages, a distinction fication is more successful for these utterances because was in this case made between the Nguni (Zulu, Xhosa, accents are more pronounced for English embedded in Swati, Ndebele) and Sotho (Northern Sotho, Southern mother-tongue speech than for English spoken as part of Sotho, Tswana) language families. These are the two a monolingual dialogue by non-native speakers. largest language families in South Africa, and each shares a similar vowel system. It was shown by means of both perceptual tests involving human subjects as well as au- tomatic accent identification systems that neither humans 1. Introduction nor machines were able to distinguish reliably between the Nguni and Sotho English accents. The South African constitution officially recognises eleven languages and, in practice, many more are spoken These two findings were seemingly contradictory. On by the country’s population. As is often the case in such a the one hand, reliable accent identification was not pos- multilingual society, English is usually the lingua franca. sible between Nguni and Sotho accent groups. On the South African English is consequently characterised by a other hand, Xhosa and Zulu accents could be identified large variety of accents. Automatic language and accent with fair accuracy automatically, although both are Nguni identification as well as multilingual and accent specific languages and hence strongly related. There was, how- speech recognition are therefore important aspects of the ever, a difference in the type of English utterances that development of ASR technology in the region. had been used for testing. For the Xhosa/Zulu experi- Since most citizens are multilingual, it is common ments, the English words were embedded in the respec- for South African speakers to switch among different tive African mother tongues as part of mixed codes. For languages both between utterances (code-switching) and the Nguni/Sotho experiments, data was drawn from a cor- within an utterance (code-mixing). In modern Xhosa and pus in which all words were English. Hence in this sec- Zulu, for example, it is very common to revert to English ond case, code mixing and switching did not occur. when citing numbers, dates and money amounts. In this paper we attempt to establish experimentally The research presented in this paper investigates the whether or not the performance of automatic accent- effect of such code-switching and code-mixing on the identification systems is indeed improved for mixed-code accuracy of an automatic accent-identification system. utterances produced by mother-tongue speakers Xhosa In particular, we determine how accurately the mother and of Zulu. 2. Speech databases transcriptions were subsequently corrected and validated manually by human experts. Our experiments are based on the African Speech Tech- nology (AST) corpora, which consist of recorded and an- notated South African speech collected over mobile as 3. Automatic accent identification well as fixed telephone networks [3]. For the compilation The structure of our automatic accent identification sys- of these corpora, speakers were recruited from targeted tem in is shown in Figure 1. This architecture, referred language groups and given a unique datasheet with items to as Parallel Phone Recognition followed by Language designed to elicit a phonetically diverse mix of read and Modelling (PPRLM), uses a parallel set of phone recog- spontaneous speech. The datasheets included read items nisers for explicit accent classification [2, 4, 5]. such as isolated digits, as well as digit strings, money amounts, dates, times, spellings and also phonetically- XHOSA ASR rich words and sentences. Spontaneous items included SYSTEM WX ; L( W X ) references to gender, age, mother tongue, place of resi- X dence and level of education. English speech in ZULU ASR ( unknown accent ) Corpora were compiled in five different languages, SYSTEM WZ ; L( W Z ) namely Afrikaans, English, Southern Sotho, Xhosa and Zulu. Furthermore, the accents of South African En- Multi−accent ASR system glish spoken by English, Afrikaans, Coloured, Indian and Black mother-tongue speakers were gathered separately, Figure 1: Accent-identification using parallel recognition resulting in five accent-specific English corpora. In the systems for Xhosa and Zulu. work presented here, we have made use of the Xhosa and Zulu mother-tonguecorpora, as well as the Black English In Figure 1, each recognition system includes accent- corpus. The latter consists of English spoken chiefly by specific acoustic and language models. Since we would mother-tongue speakers of Xhosa, Zulu, Southern Sotho like to determine the accent (Xhosa or Zulu) of En- (Sesotho), and Tswana (Setswana), as set out in Table 1. glish input speech, the acoustic models should ideally be The former two corpora, on the other hand, each con- trained using only English utterances by speakers of the sist of speech uttered exclusively by Xhosa and by Zulu appropriate mother tongue. However, due to the very lim- mother-tongue speakers respectively. Furthermore, al- ited size of the corpora at our disposal, and since approx- though most words were uttered in Xhosa or in Zulu, imately 45% of the words in the Xhosa and Zulu corpora mixed codes were very common in these two corpora, are in fact English due to code mixing and switching, especially for dates, times, numbers, money amounts and these two corpora have been used to train the acoustic spelled words. By identifying such predominantly En- models instead. Using the system depicted in Figure 1, glish utterances in the Xhosa and the Zulu data, and each recognition system provides a transcription W of considering comparable utterances by Xhosa and Zulu the speech utterance X in terms of its phone inventory speakers in the Black English corpus, the effect of code- and a corresponding language-specific language model. mixing on accent-identification can be studied. In addition to this transcription, each recogniser provides a likelihood L(W), which is used to identify the accent Mother tongue % of speakers of the input speech. Xhosa 23 Zulu 18 3.1. Data preparation Sesotho 23 In the following, we will refer to the Xhosa, Zulu and Tswana 32 Black English corpora described in Section 2 by the ab- Other 4 breviations XX, ZZ and BE respectively. Because the mothertongueof each speaker in the BE corpusis known, Table 1: Mother tongues of the speakers making up the it was possible to extract sub-portions of this corpus ut- Black English AST corpus. tered by Xhosa and Zulu speakers. These two corpora (XBE and ZBE respectively), were large enough to pro- Together with the recorded speech waveforms in vide testing material free of code mixing and switching each corpus, both orthographic (word-level) and pho- for the purposes of our evaluation. However, as already netic (phone-level) transcriptions were available for each pointed out in the previous section, the XBE and ZBE utterance. The orthographic transcriptions were pro- corpora were not large enough to provide training data duced and validated by human transcribers. Initial pho- for the acoustic models. Instead, these were trained using netic transcriptions were obtained from the orthography data obtained from the XX and ZZ corpora, each contain- using grapheme-to-phoneme rules for Xhosa and Zulu, ing approximately 7 hours of audio data, as indicated in and using a pronunciation dictionary for English. These Table 2. Corpus Speech No. of No. of Phone Word We will refer to the subset of utterances in the XX name (h) utts. spkrs. tokens tokens and ZZ test sets that consist entirely of English words XX 6.98 8 538 219 177 843 36 676 as XXE and ZZE respectively. These subsets of the full ZZ 7.03 8 295 203 187 249 35 568 XX and ZZ test sets are described in Table 4. The same table also shows the subsets of XBE and ZBE that consist Table 2: Training sets for the Xhosa and the Zulu (XX of the same English words found in the XXE and ZZE and ZZ) corpora, as used to train acoustic models. test sets. These subsets of the full XBE and ZBE test sets will be referred to as XXBE and ZZBE respectively.