Accent identification in the presence of code-mixing

Thomas Niesler† & Febe de Wet‡

†Department of Electrical and Electronic Engineering, Stellenbosch University, [email protected] ‡Centre for Language and Speech Technology, Stellenbosch University, South Africa [email protected]

Abstract tongue of Xhosa and of Zulu speakers can be determined when the utterance is part of a mixed code, and when it is We investigate whether automatic accent identification is not. more effective for English utterances embedded in a dif- ferent language as part of a mixed code than for English The work was motivated by the findings of two pre- utterances that are part of a monolingual dialogue. Our vious studies. In the first, the performance of language focus is on Xhosa and Zulu, two South African languages identification systems were investigated for , for which code mixing with English is very common. In English, Xhosa and Zulu [1]. Here it was established that order to carry out our investigation, we extract English even in mixed-code utterances consisting predominantly utterances from mixed-code Xhosa and Zulu speech cor- of English words, the language could be correctly classi- pora, as well as comparable utterances from an English- fied with approximately 70% accuracy. The second study only corpus by Xhosa and Zulu mother-tongue speakers. focussed on the identification of accents in English spo- Experiments show that accent identification is substan- ken by mother-tongue speakers of an African language, tially more accurate for the utterances originating from and for whom English is a second or third language [2]. the mixed-code speech. We conclude that accent identi- Instead of considering individual languages, a distinction fication is more successful for these utterances because was in this case made between the Nguni (Zulu, Xhosa, accents are more pronounced for English embedded in Swati, Ndebele) and Sotho (Northern Sotho, Southern mother-tongue speech than for English spoken as part of Sotho, Tswana) language families. These are the two a monolingual dialogue by non-native speakers. largest language families in South Africa, and each shares a similar system. It was shown by means of both perceptual tests involving human subjects as well as au- tomatic accent identification systems that neither humans 1. Introduction nor machines were able to distinguish reliably between the Nguni and Sotho English accents. The South African constitution officially recognises eleven languages and, in practice, many more are spoken These two findings were seemingly contradictory. On by the country’s population. As is often the case in such a the one hand, reliable accent identification was not pos- multilingual society, English is usually the . sible between Nguni and Sotho accent groups. On the is consequently characterised by a other hand, Xhosa and Zulu accents could be identified large variety of accents. Automatic language and accent with fair accuracy automatically, although both are Nguni identification as well as multilingual and accent specific languages and hence strongly related. There was, how- speech recognition are therefore important aspects of the ever, a difference in the type of English utterances that development of ASR technology in the region. had been used for testing. For the Xhosa/Zulu experi- Since most citizens are multilingual, it is common ments, the English words were embedded in the respec- for South African speakers to switch among different tive African mother tongues as part of mixed codes. For languages both between utterances (code-switching) and the Nguni/Sotho experiments, data was drawn from a cor- within an utterance (code-mixing). In modern Xhosa and pus in which all words were English. Hence in this sec- Zulu, for example, it is very common to revert to English ond case, code mixing and switching did not occur. when citing numbers, dates and money amounts. In this paper we attempt to establish experimentally The research presented in this paper investigates the whether or not the performance of automatic accent- effect of such code-switching and code-mixing on the identification systems is indeed improved for mixed-code accuracy of an automatic accent-identification system. utterances produced by mother-tongue speakers Xhosa In particular, we determine how accurately the mother and of Zulu. 2. Speech databases transcriptions were subsequently corrected and validated manually by human experts. Our experiments are based on the African Speech Tech- nology (AST) corpora, which consist of recorded and an- notated South African speech collected over mobile as 3. Automatic accent identification well as fixed telephone networks [3]. For the compilation The structure of our automatic accent identification sys- of these corpora, speakers were recruited from targeted tem in is shown in Figure 1. This architecture, referred language groups and given a unique datasheet with items to as Parallel Phone Recognition followed by Language designed to elicit a phonetically diverse mix of read and Modelling (PPRLM), uses a parallel set of phone recog- spontaneous speech. The datasheets included read items nisers for explicit accent classification [2, 4, 5]. such as isolated digits, as well as digit strings, money    amounts, dates, times, spellings and also phonetically-   XHOSA ASR rich words and sentences. Spontaneous items included  SYSTEM WX ; L( W X )   references to gender, age, mother tongue, place of resi-   X   dence and level of education.  English speech in ZULU ASR ( unknown accent )  Corpora were compiled in five different languages, SYSTEM WZ ; L( W Z )    namely Afrikaans, English, Southern Sotho, Xhosa and   Zulu. Furthermore, the accents of South African En- Multi−accent ASR system glish spoken by English, Afrikaans, Coloured, Indian and Black mother-tongue speakers were gathered separately, Figure 1: Accent-identification using parallel recognition resulting in five accent-specific English corpora. In the systems for Xhosa and Zulu. work presented here, we have made use of the Xhosa and Zulu mother-tonguecorpora, as well as the Black English In Figure 1, each recognition system includes accent- corpus. The latter consists of English spoken chiefly by specific acoustic and language models. Since we would mother-tongue speakers of Xhosa, Zulu, Southern Sotho like to determine the accent (Xhosa or Zulu) of En- (Sesotho), and Tswana (Setswana), as set out in Table 1. glish input speech, the acoustic models should ideally be The former two corpora, on the other hand, each con- trained using only English utterances by speakers of the sist of speech uttered exclusively by Xhosa and by Zulu appropriate mother tongue. However, due to the very lim- mother-tongue speakers respectively. Furthermore, al- ited size of the corpora at our disposal, and since approx- though most words were uttered in Xhosa or in Zulu, imately 45% of the words in the Xhosa and Zulu corpora mixed codes were very common in these two corpora, are in fact English due to code mixing and switching, especially for dates, times, numbers, money amounts and these two corpora have been used to train the acoustic spelled words. By identifying such predominantly En- models instead. Using the system depicted in Figure 1, glish utterances in the Xhosa and the Zulu data, and each recognition system provides a transcription W of considering comparable utterances by Xhosa and Zulu the speech utterance X in terms of its phone inventory speakers in the Black English corpus, the effect of code- and a corresponding language-specific language model. mixing on accent-identification can be studied. In addition to this transcription, each recogniser provides a likelihood L(W), which is used to identify the accent Mother tongue % of speakers of the input speech. Xhosa 23 Zulu 18 3.1. Data preparation Sesotho 23 In the following, we will refer to the Xhosa, Zulu and Tswana 32 Black English corpora described in Section 2 by the ab- Other 4 breviations XX, ZZ and BE respectively. Because the mothertongueof each speaker in the BE corpusis known, Table 1: Mother tongues of the speakers making up the it was possible to extract sub-portions of this corpus ut- Black English AST corpus. tered by Xhosa and Zulu speakers. These two corpora (XBE and ZBE respectively), were large enough to pro- Together with the recorded speech waveforms in vide testing material free of code mixing and switching each corpus, both orthographic (word-level) and pho- for the purposes of our evaluation. However, as already netic (phone-level) transcriptions were available for each pointed out in the previous section, the XBE and ZBE utterance. The orthographic transcriptions were pro- corpora were not large enough to provide training data duced and validated by human transcribers. Initial pho- for the acoustic models. Instead, these were trained using netic transcriptions were obtained from the data obtained from the XX and ZZ corpora, each contain- using grapheme-to-phoneme rules for Xhosa and Zulu, ing approximately 7 hours of audio data, as indicated in and using a pronunciation dictionary for English. These Table 2. Corpus Speech No. of No. of Phone Word We will refer to the subset of utterances in the XX name (h) utts. spkrs. tokens tokens and ZZ test sets that consist entirely of English words XX 6.98 8 538 219 177 843 36 676 as XXE and ZZE respectively. These subsets of the full ZZ 7.03 8 295 203 187 249 35 568 XX and ZZ test sets are described in Table 4. The same table also shows the subsets of XBE and ZBE that consist Table 2: Training sets for the Xhosa and the Zulu (XX of the same English words found in the XXE and ZZE and ZZ) corpora, as used to train acoustic models. test sets. These subsets of the full XBE and ZBE test sets will be referred to as XXBE and ZZBE respectively. The composition of these various test sets is illustrated in Independent test sets were also prepared for Xhosa Figure 3. and Zulu, each containing approximately 25 minutes of speech. These two test sets, as well as the XBE and ZBE Xhosa mother− Zulu mother− tongue utterances, tongue utterances, test sets, are described in Table 3. There was no speaker- including mixed−codes including mixed−codes overlap between the training and any of the test sets, and XX ZZ each contained a balance of male and female speakers. Utterances consisting Utterances consisting entirely of English words entirely of English words Corpus Speech No. of No. of Phone Word name (min) utts. spkrs. tokens tokens XX 26.8 609 17 10 925 2 480 ZZ 27.1 583 16 11 008 2 385 XXE ZZE

XBE 23.8 614 17 9 112 2 709 English−only test sets ZBE 25.8 643 16 9 841 2 856 XXBE ZZBE Table 3: Test sets prepared for the Xhosa and Zulu (XX and ZZ) corpora as well as the Xhosa and Zulu subsets (XBE and ZBE) of the BE corpus. Utterances containing Utterances containing same words as XXE same words as ZZE XBE ZBE However, the XX and ZZ test sets both contain a sig- nificant deal of code mixing and switching. Figure 2 shows the fraction of utterances in the XX and ZZ test sets that contain at least a certain threshold proportion of Utterances spoken by Utterances spoken by English words. This threshold is indicated on the hori- Xhosa mother−tongue by Zulu mother−tongue zontal axis, so that it can be seen, for example, that ap- speakers speakers BE proximately 25% of the utterances in both test sets consist English utterances spoken by entirely of English words. Black mother−tongue speakers

1 Figure 3: Composition of the test sets described in Xhosa test set (XX) Zulu test set (ZZ) Table 3 and in Table 4. 0.8

Corpus Speech No. of No. of Phone Word 0.6 name (min) utts. spkrs. tokens tokens XXE 10.0 148 17 2 873 1 332 0.4 ZZE 10.5 158 16 3 139 1 272 XXBE 12.2 237 17 4 007 1 752 0.2 Proportion of utterances in test set ZZBE 10.2 214 16 3 472 1 456

0 0 0.2 0.4 0.6 0.8 1 Table 4: English-only subsets of each test set in Table 3. Minimum proportion of English words per test utterance

Figure 2: Fraction of Xhosa and Zulu (XX and ZZ) test The utterances in the XXE and ZZE test sets are set utterances that contain a certain proportion of English therefore quite similar to those comprising XXBE and words. ZZBE, because they are made up of the same English words that occur often as part of Xhosa and Zulu mixed The experimental results in Table 5 indicate that while codes. XXE and ZZE differ from XXBE and ZZBE in the accent of 73.4% of the utterances extracted from the that the former two have been extracted from the XX and XX and ZZ tests sets were correctly classified, this drops ZZ corpora respectively, while XXBE and ZZBE both to 55.4% for a set of comparable utterances drawn from originate from the BE corpus. Hence XXE and ZZE are the XBE and ZBE test sets. Hence the accent is more dif- drawn from mixed codes, while XXBE and ZZBE are ficult to classify when the speech is part of a monolingual drawn from monolingual speech. English dialogue, and easier to classify when it forms part of a mixed code. 3.2. Acoustic models In order to gauge the effect which the proportion of English words in the test set has on accent identification, Acoustic models were trained using the HTK tools [6] the accuracy was calculated for a series of points along and the XX and ZZ training sets described in Table 2. the horizontal axis of Figure 2. These results, shown in The Xhosa and Zulu recognition systems employ a com- Figure 4, indicate the effect which the degree of code mon set of 90 phones, including silence and speaker mixing (proportion of English words in a Xhosa or Zulu noise. This phone set was verified to account for 99.5% sentence) has on the average accuracy of the identifica- of the phone tokens in the XBE and ZBE test sets. Thus, tion system. The left-hand side of the graph corresponds although the phones present in the XBE/ZBE and the to testing the system using the XX and ZZ test sets of Ta- XX/ZZ corporaare not exactly the same, there is an over- ble 3, while the right-hand side of the graph corresponds whelming degree of overlap. to testing with the XXE and ZZE subsets described in Ta- The speech was parameterised as Mel-frequency cep- ble 4. Figure 4 shows that, for utterances drawn from the stral coefficients (MFCCs) and their first and second XX and ZZ corpora,as the proportionof English wordsin differentials, with cepstral mean normalisation (CMN) the test utterances increases, identification accuracy de- applied on a per-utterance basis. Speaker-independent teriorates, but not drastically. This happens because a cross-word left-to-right triphone HMMs were trained by higher proportion of Xhosa and Zulu words in an utter- embedded Baum-Welsh re-estimation and decision-tree ance makes it easier to identify the mother tongue cor- state clustering, using the phonetically-labelled training rectly. sets. Each model had three states, eight Gaussian mix- tures per state and diagonal covariance matrices. Tri- 85 phone clustering resulted in a total of approximately1250 Test set drawn from XX and ZZ clustered states for each set of acoustic models.

80 3.3. Accent identification results Table 5 shows the accuracy of the automatic accent iden- tification system illustrated in Figure 1 when presented 75 with the test sets listed in Table 4. In these experiments, the Xhosa and Zulu recognition systems depicted in Fig- 70 ure 1 each used an unweighted phone loop as a language Accent identification accuray (%) model. Since the two systems share a common phone set, this means that discrimination between the two accents 65 was based exclusively on acoustic differences. 0 0.2 0.4 0.6 0.8 1 Minimum proportion of English words per test utterance

Test Classified as (%) Figure 4: The effect of the degree of code-mixing in the corpus Xhosa Zulu test set on the accent identification accuracy. XXE 75.0 25.0 ZZE 29.1 70.9 Figure 5 shows the effect of restricting the test utter- ances drawn from the XBE and ZBE test sets to consist Average correct 73.4% of the same words found in the XXE and ZZE test sets. XXBE 48.9 51.1 Again, the left-hand side of the graph corresponds to test- ZZBE 37.4 62.6 ing the system using the full XBE and ZBE test sets of Ta- Average correct 55.4% ble 3, while the right-hand side of the graph corresponds to testing with the XXBE and ZZBE subsets depicted in Table 5: Accent identification accuracy (%) for the XXE, Table 4. The results in the figure show that the particu- ZZE, XXBE and ZZBE test sets using phone loop lan- lar words present in the monolingual test sets drawn from guage model. the BE data do not affect the identification in a systematic way. This indicates that the strength of the Xhosa or Zulu accent in an English word does not depend upon whether 4. Summary and conclusions that word occurs frequently as part of a mixed code or We have investigated the effect which code-mixing and not. Rather, when code mixing does not occur, the accent code-switching have on the automatic identification of of the speaker cannot be determined accurately, irrespec- the English accent of Xhosa and Zulu mother-tongue tive of the vocabulary. speech. In particular, we have sought to determine whether the performance of an automatic accent identi- 62 Test set drawn from XBE and ZBE fication system is different for English utterances that are embeddedin Xhosa or Zuluas partof a mixedcode, in re- 60 lation to English utterances that are part of a monolingual dialogue. We have found that, in the latter case, it is not 58 possible to distinguish between the two accents with good accuracy, while in the former case accuracies of above

56 70% were achieved. We therefore conclude that English which is embedded within a Xhosa or Zulu dialogue ex- hibits a more distinct accent than monolingual English 54 Accent identification accuray (%) produced by the same type of speakers. For the devel- opment of automatic speech recognition systems this ap- 52 0 0.2 0.4 0.6 0.8 1 pears to suggest that, while a common set of acoustic Minimum proportion of XXE/ZZE words per test utterance models is suitable for the recognition of monolingual En- glish spoken by Xhosa and Zulu mother-tongue speakers, Figure 5: The effect of the similarity between the -specificmodels will be more appropriatefor the words used in the monolingual and the mixed-code test modelling of English words that are expected to be em- sets on the accent identification accuracy. bedded in the respective African mother tongue as part of a mixed code. Finally, in order to gauge the effectof the grammar on the accent identification accuracy, the experiments of Ta- 5. Acknowledgements ble 5 were repeated, but now with each of the two recog- nisers using a backoff bigram phone language model [7]. This work was supported by the National Research These were trained using the XX and ZZ training set tran- Foundation (NRF) under grants FA2005022300010 and scriptions, and language model probabilities were esti- GUN2072874. mated using absolute discounting [8]. The resulting ac- curacies are shown in Table 6. 6. References

Test Classified as (%) [1] T.R. Niesler and D. Willett, “Language identifica- tion and multilingual speech recognition using dis- corpus Xhosa Zulu criminatively trained acoustic models,” in First ISCA XXE 79.7 20.3 ITRW on Multilingual Speech and Language Pro- ZZE 29.7 70.3 cessing (MULTILING), Stellenbosch, South Africa, Average correct 74.8% 2006. XXBE 56.5 43.5 [2] F. de Wet, P.H. Louw, and T.R. Niesler, “Human and ZZBE 42.1 57.9 automatic accent identification of Nguni and Sotho Average correct 57.2% Black South African English,” South African Journal of Science, 2007. Table 6: Accent identification accuracy (%) for XXE, ZZE, XBE and ZBE when using a bigram language [3] J.C. Roux, P.H. Louw, and T.R. Niesler, “The African model. Speech Technology project: An assessment,” in Proc. LREC, Lisbon, Portugal, 2004.

It is apparent that the use of a bigram slightly [4] W. Shen and D. Reynolds, “Improving phonotactic improves identification accuracies for both code-mixed language recognition with acoustic adaptation,” in (XXE/ZZE) and monolingual (XXBE/ZZBE) test sets. Proc. Interspeech, Antwerp, Belgium, 2007. Nevertheless, the accent identification accuracy for the utterances drawn from the monolingual BE corpus re- [5] M.A. Zissman and K.M. Berkling, “Automatic lan- mains much lower than for the mixed-code utterances guage identification,” Speech Communication, vol. drawn from the XX and ZZ corpora. 35, pp. 115–124, 2001. [6] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book, ver- sion 3.2.1, Cambridge University Engineering De- partment, 2002. [7] S. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recogniser,” IEEE Transactions on Acoustics Speech and Signal Processing, vol. 35, no. 3, pp. 400–401, March 1987. [8] H. Ney, U. Essen, and R. Kneser, “On structur- ing probabilistic dependencies in stochastic language modelling,” Computer, Speech and Language,vol. 8, pp. 1–38, 1994.