Spoken Arabic Dialect Identification Using Phonotactic Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Spoken Arabic Dialect Identification Using Phonotactic Modeling Fadi Biadsy and Julia Hirschberg Nizar Habash Department of Computer Science Center for Computational Learning Systems Columbia University, New York, USA Columbia University, New York, USA {fadi,julia}@cs.columbia.edu [email protected] Abstract mal written standard language of the media, cul- The Arabic language is a collection of ture and education, and the informal spoken di- multiple variants, among which Modern alects that are the preferred method of communi- Standard Arabic (MSA) has a special sta- cation in daily life. While there are commercially tus as the formal written standard language available Automatic Speech Recognition (ASR) of the media, culture and education across systems for recognizing MSA with low error rates the Arab world. The other variants are in- (typically trained on Broadcast News), these rec- formal spoken dialects that are the media ognizers fail when a native Arabic speaker speaks of communication for daily life. Arabic di- in his/her regional dialect. Even in news broad- code switch alects differ substantially from MSA and casts, speakers often between MSA each other in terms of phonology, mor- and dialect, especially in conversational speech, phology, lexical choice and syntax. In this such as that found in interviews and talk shows. paper, we describe a system that automat- Being able to identify dialect vs. MSA as well as to ically identifies the Arabic dialect (Gulf, identify which dialect is spoken during the recog- Iraqi, Levantine, Egyptian and MSA) of a nition process will enable ASR engines to adapt speaker given a sample of his/her speech. their acoustic, pronunciation, morphological, and The phonotactic approach we use proves language models appropriately and thus improve to be effective in identifying these di- recognition accuracy. alects with considerable overall accuracy Identifying the regional dialect of a speaker will — 81.60% using 30s test utterances. also provide important benefits for speech tech- nology beyond improving speech recognition. It 1 Introduction will allow us to infer the speaker’s regional origin For the past three decades, there has been a great and ethnicity and to adapt features used in speaker deal of work on the automatic identification (ID) identification to regional original. It should also of languages from the speech signal alone. Re- prove useful in adapting the output of text-to- cently, accent and dialect identification have be- speech synthesis to produce regional speech as gun to receive attention from the speech science well as MSA – important for spoken dialogue sys- and technology communities. The task of dialect tems’ development. identification is the recognition of a speaker’s re- In Section 2, we describe related work. In Sec- gional dialect, within a predetermined language, tion 3, we discuss some linguistic aspects of Ara- given a sample of his/her speech. The dialect- bic dialects which are important to dialect iden- identification problem has been viewed as more tification. In Section 4, we describe the Arabic challenging than that of language ID due to the dialect corpora employed in our experiments. In greater similarity between dialects of the same lan- Section 5, we explain our approach to the identifi- guage. Our goal in this paper is to analyze the ef- cation of Arabic dialects. We present our experi- fectiveness of a phonotactic approach, i.e. making mental results in Section 6. Finally, we conclude use primarily of the rules that govern phonemes in Section 7 and identify directions for future re- and their sequences in a language — a techniques search. which has often been employed by the language 2 Related Work ID community — for the identification of Arabic dialects. A variety of cues by which humans and machines The Arabic language has multiple variants, in- distinguish one language from another have been cluding Modern Standard Arabic (MSA), the for- explored in previous research on language identi- Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages, pages 53–61, Athens, Greece, 31 March, 2009. c 2009 Association for Computational Linguistics 53 fication. Examples of such cues include phone in- ferences exist between Western and Eastern Ara- ventory and phonotactics, prosody, lexicon, mor- bic. The analysis of these differences is done by phology, and syntax. Some of the most suc- comparing percentages of vocalic intervals (%V) cessful approaches to language ID have made and the standard deviation of intervocalic inter- use of phonotactic variation. For example, the vals (∆C) across the two groups. These features Phone Recognition followed by Language Model- have been shown to capture the complexity of the ing (PRLM) approach uses phonotactic informa- syllabic structure of a language/dialect in addition tion to identify languages from the acoustic sig- to the existence of vowel reduction. The com- nal alone (Zissman, 1996). In this approach, a plexity of syllabic structure of a language/dialect phone recognizer (not necessarily trained on a re- and the existence of vowel reduction in a language lated language) is used to tokenize training data for are good correlates with the rhythmic structure of each language to be classified. Phonotactic lan- the language/dialect, hence the importance of such guage models generated from this tokenized train- a cue for language/dialect identification (Ramus, ing speech are used during testing to compute lan- 2002). guage ID likelihoods for unknown utterances. As far as we could determine, there is no Similar cues have successfully been used for previous work that analyzes the effectiveness of the identification of regional dialects. Zisssman a phonotactic approach, particularly the parallel et al. (1996) show that the PRLM approach yields PRLM, for identifying Arabic dialects. In this pa- good results classifying Cuban and Peruvian di- per, we build a system based on this approach and alects of Spanish, using an English phone recog- evaluate its performance on five Arabic dialects nizer trained on TIMIT (Garofolo et al., 1993). (four regional dialects and MSA). In addition, we The recognition accuracy of this system on these experiment with six phone recognizers trained on two dialects is 84%, using up to 3 minutes of test six languages as well as three MSA phone recog- utterances. Torres-Carrasquillo et al. (2004) devel- nizers and analyze their contribution to this classi- oped an alternate system that identifies these two fication task. Moreover, we make use of a discrim- Spanish dialects using Gaussian Mixture Models inative classifier that takes all the perplexities of (GMM) with shifted-delta-cepstral features. This the language models on the phone sequences and system performs less accurately (accuracy of 70%) outputs the hypothesized dialect. This classifier than that of (Zissman et al., 1996). Alorfi (2008) turns out to be an important component, although uses an ergodic HMM to model phonetic dif- it has not been a standard component in previous ferences between two Arabic dialects (Gulf and work. Egyptian Arabic) employing standard MFCC (Mel Frequency Cepstral Coefficients) and delta fea- 3 Linguistic Aspects of Arabic Dialects tures. With the best parameter settings, this system achieves high accuracy of 96.67% on these two 3.1 Arabic and its Dialects dialects. Ma et al. (2006) use multi-dimensional MSA is the official language of the Arab world. pitch flux features and MFCC features to distin- It is the primary language of the media and cul- guish three Chinese dialects. In this system the ture. MSA is syntactically, morphologically and pitch flux features reduce the error rate by more phonologically based on Classical Arabic, the lan- than 30% when added to a GMM based MFCC guage of the Qur’an (Islam’s Holy Book). Lexi- system. Given 15s of test-utterances, the system cally, however, it is much more modern. It is not achieves an accuracy of 90% on the three dialects. a native language of any Arabs but is the language Intonational cues have been shown to be good of education across the Arab world. MSA is pri- indicators to human subjects identifying regional marily written not spoken. dialects. Peters et al. (2002) show that human sub- The Arabic dialects, in contrast, are the true na- jects rely on intonational cues to identify two Ger- tive language forms. They are generally restricted man dialects (Hamburg urban dialects vs. North- in use to informal daily communication. They ern Standard German). Similarly, Barakat et are not taught in schools or even standardized, al- al. (1999) show that subjects distinguish between though there is a rich popular dialect culture of Western vs. Eastern Arabic dialects significantly folktales, songs, movies, and TV shows. Dialects above chance based on intonation alone. are primarily spoken, not written. However, this Hamdi et al. (2004) show that rhythmic dif- is changing as more Arabs gain access to elec- 54 tronic media such as emails and newsgroups. Ara- a large gray area in between and it is often filled bic dialects are loosely related to Classical Ara- with a mixing of the two forms. bic. They are the result of the interaction between In this paper, we focus on classifying the di- different ancient dialects of Classical Arabic and alect of audio recordings into one of five varieties: other languages that existed in, neighbored and/or MSA, GLF,IRQ,LEV, and EGY. We do not ad- colonized what is today the Arab world. For ex- dress other dialects or