<<

Spoken Identification Using Phonotactic Modeling

Fadi Biadsy and Julia Hirschberg Nizar Habash Department of Computer Science Center for Computational Learning Systems Columbia University, New York, USA Columbia University, New York, USA {fadi,julia}@cs.columbia.edu [email protected]

Abstract mal written standard of the media, cul- The Arabic language is a collection of ture and education, and the informal spoken di- multiple variants, among which Modern alects that are the preferred method of communi- Standard Arabic (MSA) has a special sta- cation in daily life. While there are commercially tus as the formal written standard language available Automatic Recognition (ASR) of the media, culture and education across systems for recognizing MSA with low error rates the . The other variants are in- (typically trained on Broadcast News), these rec- formal spoken that are the media ognizers fail when a native Arabic speaker speaks of communication for daily life. Arabic di- in his/her regional dialect. Even in news broad- code switch alects differ substantially from MSA and casts, speakers often between MSA each other in terms of , mor- and dialect, especially in conversational speech, phology, lexical choice and . In this such as that found in interviews and talk shows. paper, we describe a system that automat- Being able to identify dialect vs. MSA as well as to ically identifies the Arabic dialect (Gulf, identify which dialect is spoken during the recog- Iraqi, Levantine, Egyptian and MSA) of a nition process will enable ASR engines to adapt speaker given a sample of his/her speech. their acoustic, pronunciation, morphological, and The phonotactic approach we use proves language models appropriately and thus improve to be effective in identifying these di- recognition accuracy. alects with considerable overall accuracy Identifying the regional dialect of a speaker will — 81.60% using 30s test utterances. also provide important benefits for speech tech- nology beyond improving speech recognition. It 1 Introduction will allow us to infer the speaker’s regional origin For the past three decades, there has been a great and ethnicity and to adapt features used in speaker deal of work on the automatic identification (ID) identification to regional original. It should also of from the speech signal alone. Re- prove useful in adapting the output of text-to- cently, accent and dialect identification have be- speech synthesis to produce regional speech as gun to receive attention from the speech science well as MSA – important for spoken dialogue sys- and technology communities. The task of dialect tems’ development. identification is the recognition of a speaker’s re- In Section 2, we describe related work. In Sec- gional dialect, within a predetermined language, tion 3, we discuss some linguistic aspects of Ara- given a sample of his/her speech. The dialect- bic dialects which are important to dialect iden- identification problem has been viewed as more tification. In Section 4, we describe the Arabic challenging than that of language ID due to the dialect corpora employed in our experiments. In greater similarity between dialects of the same lan- Section 5, we explain our approach to the identifi- guage. Our goal in this paper is to analyze the ef- cation of Arabic dialects. We present our experi- fectiveness of a phonotactic approach, i.e. making mental results in Section 6. Finally, we conclude use primarily of the rules that govern phonemes in Section 7 and identify directions for future re- and their sequences in a language — a techniques search. which has often been employed by the language 2 Related Work ID community — for the identification of Arabic dialects. A of cues by which humans and machines The Arabic language has multiple variants, in- distinguish one language from another have been cluding (MSA), the for- explored in previous research on language identi-

Proceedings of the EACL 2009 Workshop on Computational Approaches to , pages 53–61, Athens, Greece, 31 March, 2009. c 2009 Association for Computational Linguistics

53 fication. Examples of such cues include phone in- ferences exist between Western and Eastern Ara- ventory and phonotactics, prosody, , mor- bic. The analysis of these differences is done by phology, and syntax. Some of the most suc- comparing percentages of vocalic intervals (%V) cessful approaches to language ID have made and the standard deviation of intervocalic inter- use of phonotactic variation. For example, the vals (∆C) across the two groups. These features Phone Recognition followed by Language Model- have been shown to capture the complexity of the ing (PRLM) approach uses phonotactic informa- syllabic structure of a language/dialect in addition tion to identify languages from the acoustic sig- to the existence of . The com- nal alone (Zissman, 1996). In this approach, a plexity of syllabic structure of a language/dialect phone recognizer (not necessarily trained on a re- and the existence of vowel reduction in a language lated language) is used to tokenize training data for are good correlates with the rhythmic structure of each language to be classified. Phonotactic lan- the language/dialect, hence the importance of such guage models generated from this tokenized train- a cue for language/dialect identification (Ramus, ing speech are used during testing to compute lan- 2002). guage ID likelihoods for unknown utterances. As far as we could determine, there is no Similar cues have successfully been used for previous work that analyzes the effectiveness of the identification of regional dialects. Zisssman a phonotactic approach, particularly the parallel et al. (1996) show that the PRLM approach yields PRLM, for identifying Arabic dialects. In this pa- good results classifying Cuban and Peruvian di- per, we build a system based on this approach and alects of Spanish, using an English phone recog- evaluate its performance on five Arabic dialects nizer trained on TIMIT (Garofolo et al., 1993). (four regional dialects and MSA). In addition, we The recognition accuracy of this system on these experiment with six phone recognizers trained on two dialects is 84%, using up to 3 minutes of test six languages as well as three MSA phone recog- utterances. Torres-Carrasquillo et al. (2004) devel- nizers and analyze their contribution to this classi- oped an alternate system that identifies these two fication task. Moreover, we make use of a discrim- Spanish dialects using Gaussian Mixture Models inative classifier that takes all the perplexities of (GMM) with shifted-delta-cepstral features. This the language models on the phone sequences and system performs less accurately (accuracy of 70%) outputs the hypothesized dialect. This classifier than that of (Zissman et al., 1996). Alorfi (2008) turns out to be an important component, although uses an ergodic HMM to model phonetic dif- it has not been a standard component in previous ferences between two Arabic dialects (Gulf and work. ) employing standard MFCC (Mel Frequency Cepstral Coefficients) and delta fea- 3 Linguistic Aspects of Arabic Dialects tures. With the best parameter settings, this system achieves high accuracy of 96.67% on these two 3.1 Arabic and its Dialects dialects. Ma et al. (2006) use multi-dimensional MSA is the official language of the Arab world. pitch flux features and MFCC features to distin- It is the primary language of the media and cul- guish three Chinese dialects. In this system the ture. MSA is syntactically, morphologically and pitch flux features reduce the error rate by more phonologically based on , the lan- than 30% when added to a GMM based MFCC guage of the Qur’an (’s Holy Book). Lexi- system. Given 15s of test-utterances, the system cally, however, it is much more modern. It is not achieves an accuracy of 90% on the three dialects. a native language of any Arabs but is the language Intonational cues have been shown to be good of education across the Arab world. MSA is pri- indicators to human subjects identifying regional marily written not spoken. dialects. Peters et al. (2002) show that human sub- The Arabic dialects, in contrast, are the true na- jects rely on intonational cues to identify two Ger- tive language forms. They are generally restricted man dialects (Hamburg urban dialects vs. North- in use to informal daily communication. They ern Standard German). Similarly, Barakat et are not taught in schools or even standardized, al- al. (1999) show that subjects distinguish between though there is a rich popular dialect culture of Western vs. Eastern Arabic dialects significantly folktales, songs, movies, and TV shows. Dialects above chance based on intonation alone. are primarily spoken, not written. However, this Hamdi et al. (2004) show that rhythmic dif- is changing as more Arabs gain access to elec-

54 tronic media such as emails and newsgroups. Ara- a large gray area in between and it is often filled bic dialects are loosely related to Classical Ara- with a mixing of the two forms. bic. They are the result of the interaction between In this paper, we focus on classifying the di- different ancient dialects of Classical Arabic and alect of audio recordings into one of five varieties: other languages that existed in, neighbored and/or MSA,GLF,IRQ,LEV, and EGY. We do not ad- colonized what is today the Arab world. For ex- dress other dialects or . ample, has many influences from Berber as well as French. 3.2 Phonological Variations among Arabic Arabic dialects vary on many dimensions – Dialects primarily, geography and social class. Geo- Although Arabic dialects and MSA vary on many linguistically, the Arab world can be divided in different levels — phonology, orthography, mor- many different ways. The following is only one phology, lexical choice and syntax — we will of many that covers the main Arabic dialects: focus on phonological difference in this paper.1 • (GLF) includes the dialects of MSA’s phonological profile includes 28 conso- Kuwait, , Bahrain, Qatar, United nants, three short vowels, three long vowels and Arab Emirates, and Oman. two diphthongs (/ay/ and /aw/). Arabic dialects vary phonologically from standard Arabic and • Iraqi Arabic (IRQ) is the dialect of Iraq. In each other. Some of the common variations in- some dialect classifications, Iraqi Arabic is clude the following (Holes, 2004; Habash, 2006): considered a sub-dialect of Gulf Arabic. The MSA consonant (/q/) is realized as a glot- tal stop /’/ in EGY and LEV and as /g/ in GLF and • (LEV) includes the di- IRQ. For example, the MSA word /tari:q/ ‘road’ alects of , , , ¯ appears as /tari:’/ (EGY and LEV) and /tari:g/ (GLF and . ¯ ¯ and IRQ). Other variants also are found in sub di- • Egyptian Arabic (EGY) covers the dialects alects such as /k/ in rural Palestinian (LEV) and of the valley: and Sudan. /dj/ in some GLF dialects. These changes do not apply to modern and religious borrowings from • covers the dialects of MSA. For instance, the word for ‘Qur’an’ is never , , and . pronounced as anything but /qur’a:n/. is sometimes included. The MSA alveolar affricate (/dj/) is realized as Yemenite Arabic is often considered its own /g/ in EGY, as // in LEV and as /y/ in GLF.IRQ class. Maltese Arabic is not always consid- preserves the MSA pronunciation. For example, ered an Arabic dialect. It is the only Arabic the word for ‘handsome’ is /djami:l/ (MSA, IRQ), variant that is considered a separate language /gami:l/ (EGY), /jami:l/ (LEV) and /yami:l/ (GLF). and is written with script. The MSA consonant (/k/) is generally realized as /k/ in Arabic dialects with the exception of GLF, Socially, it is common to distinguish three sub- IRQ and the Palestinian rural sub-dialect of LEV, dialects within each dialect region: city dwellers, which allow a /c/ˇ pronunciation in certain con- peasants/farmers and . The three degrees texts. For example, the word for ‘fish’ is /samak/ are often associated with a class hierarchy from in MSA, EGY and most of LEV but /simac/ˇ in IRQ rich, settled city-dwellers down to Bedouins. Dif- and GLF. ferent social associations exist as is common in The MSA consonant /θ/ is pronounced as /t/ in many other languages around the world. LEV and EGY (or /s/ in more recent borrowings The relationship between MSA and the dialect from MSA), e.g., the MSA word /θala:θa/ ‘three’ in a specific region is complex. Arabs do not think is pronounced /tala:ta/ in EGY and /tla:te/ in LEV. of these two as separate languages. This particular IRQ and GLF generally preserve the MSA pronun- perception leads to a special kind of coexistence ciation. between the two forms of language that serve dif- ferent purposes. This kind of situation is what lin- 1It is important to point out that since Arabic dialects are guists term diglossia. Although the two variants not standardized, their orthography may not always be con- sistent. However, this is not a relevant point to this paper have clear domains of prevalence: formal written since we are interested in dialect identification using audio (MSA) versus informal spoken (dialect), there is recordings and without using the dialectal transcripts at all.

55 The MSA consonant /δ/ is pronounced as /d/ dialects: Gulf Arabic, Iraqi Arabic, Egyptian Ara- in LEV and EGY (or /z/ in more recent borrow- bic, and Levantine Arabic. These are corpora of ings from MSA), e.g., the word for ‘this’ is pro- spontaneous telephone conversations produced by nounced /ha:δa/ in MSA versus /ha:da/ (LEV) and native speakers of the dialects, speaking with fam- /da/ EGY.IRQ and GLF generally preserve the ily members, friends, and unrelated individuals, MSA pronunciation. sometimes about predetermined topics. Although, The MSA consonants /d/ (emphatic/velarized the data have been annotated phonetically and/or ¯ d) and /δ/ (emphatic /δ/) are both normalized to orthographically by LDC, in this paper, we do not ¯ /d/ in EGY and LEV and to /δ/ in GLF and IRQ. make use of any of annotations. ¯ ¯ For example, the MSA sentence /δalla yadrubu/ We use the speech files of 965 speakers (about ¯ ¯ ‘ continued to hit’ is pronounced /dall yudrub/ 41.02 hours of speech) from the Gulf Arabic ¯ ¯ (LEV) and /δall yuδrub/ (GLF). In modern bor- conversational telephone Speech database for our ¯ ¯ rowings from MSA, /δ/ is pronounced as /z/ (em- Gulf Arabic data (Appen Pty Ltd, 2006a).2 From ¯ ¯ phatic z) in EGY and LEV. For instance, the word these speakers we hold out 150 speakers for test- for ‘police officer’ is /δa:bit/ in MSA but /za:bit/ ing (about 6.06 hours of speech).3 We use the Iraqi ¯ ¯ ¯ ¯ in EGY and LEV. Arabic Conversational Telephone Speech database In some dialects, a loss of the emphatic feature (Appen Pty Ltd, 2006b) for the Iraqi dialect, se- of some MSA consonants occurs, e.g., the MSA lecting 475 Iraqi Arabic speakers with a total du- word /lati:f/ ‘pleasant’ is pronounced as /lati:f/ in ration of about 25.73 hours of speech. From ¯ the Lebanese city sub-dialect of LEV. Empha- these speakers we hold out 150 speakers4 for test- sis typically spreads to neighboring vowels: if a ing (about 7.33 hours of speech). Our Levan- vowel is preceded or succeeded directly by an em- tine data consists of 1258 speakers from the Ara- phatic consonant (/d/, /s/, /t/, /δ/) then the vowel bic CTS Levantine Fisher Training Data Set 1-3 ¯ ¯ ¯ ¯ becomes an emphatic vowel. As a result, the loss (Maamouri, 2006). This set contains about 78.79 of the emphatic feature does not affect the conso- hours of speech in total. We hold out 150 speakers nants only, but also their neighboring vowels. for testing (about 10 hours of speech) from Set 1.5 Other vocalic differences among MSA and the For our Egyptian data, we use CallHome Egyp- dialects include the following: First, short vow- tian and its Supplement (Canavan et al., 1997) els change or are completely dropped, e.g., the and CallFriend Egyptian (Canavan and Zipperlen, MSA word /yaktubu/ ‘he writes’ is pronounced 1996). We use 398 speakers from these corpora /yiktib/ (EGY and IRQ) or /yoktob/ (LEV). Sec- (75.7 hours of speech), holding out 150 speakers ond, final and unstressed long vowels are short- for testing.6 (about 28.7 hours of speech.) ened, e.g., the word /mata:ra:t/ ‘airports’ in MSA Unfortunately, as far as we can determine, there ¯ becomes /matara:t/ in many dialects. Third, the is no data with similar recording conditions for ¯ MSA diphthongs /aw/ and /ay/ have mostly be- MSA. Therefore, we obtain our MSA training data come /o:/ and /e:/, respectively. These vocalic from TDT4 Arabic broadcast news. We use about changes, particularly vowel drop lead to different 47.6 hours of speech. The acoustic signal was pro- syllabic structures. MSA syllables are primarily cessed using forced-alignment with the transcript light (CV, CV:, CVC) but can also be (CV:C and to remove non-speech data, such as music. For CVCC) in utterance-final positions. EGY sylla- testing we again use 150 speakers, this time iden- bles are the same as MSA’s although without the tified automatically from the GALE Year 2 Dis- utterance-final restriction. LEV,IRQ and GLF al- tillation evaluation corpus (about 12.06 hours of low heavier syllables including word initial clus- speech). Non-speech data (e.g., music) in the test ters such as CCV:C and CCVCC. 2We excluded very short speech files from the corpora. 4 Corpora 3The 24 speakers in devtest folder and the last 63 files, after sorting by file name, in train2c folder (126 speakers). When training a system intended to classify lan- The sorting is done to make our experiments reproducible by other researchers. guages or dialects, it is of course important to use 4Similar to the Gulf corpus, the 24 speakers in devtest training and testing corpora recorded under simi- folder and the last 63 files (after sorting by filename) in lar acoustic conditions. We are able to obtain cor- train2c folder (126 speakers) 5We use the last 75 files in Set 1, after sorting by name. pora from the Linguistic Data Consortium (LDC) 6The test speakers were from evaltest and devtest folders with similar recording conditions for four Arabic in CallHome and CallFriend.

56 corpus was removed manually. It should be noted we employ a logistic regression classifier as our that the data includes read speech by anchors and back-end combiner. We have experimented with reporters as well as spontaneous speech spoken in different classifiers such as SVM, and neural net- interviews in studios and though the phone. works, but logistic regression classifier was supe- rior. The system is illustrated in Figure 1. 5 Our Dialect ID Approach We hypothesize that using multiple phone rec- ognizers as opposed to only one allows the system Since, as described in Section 3, Arabic dialects to capture subtle phonetic differences that might differ in many respects, such as phonology, lex- be crucial to distinguish dialects. Particularly, icon, and morphology, it is highly likely that since the phone recognizers are trained on differ- they differ in terms of phone-sequence distribu- ent languages, they may be able to model different tion and phonotactic constraints. Thus, we adopt vocalic and consonantal systems, hence a different the phonotactic approach to distinguishing among phonetic inventory. For example, an MSA phone Arabic dialects. recognizer typically does not model the phoneme 5.1 PRLM for dialect ID /g/; however, an English phone recognizer does. As described in Section 3, this phoneme is an As mentioned in Section 2, the PRLM approach to important cue to distinguishing Egyptian Arabic language identification (Zissman, 1996) has had from other Arabic dialects. Moreover, phone rec- considerable success. Recall that, in the PRLM ognizers are prone to many errors; relying upon approach, the phones of the training utterances of multiple phone streams rather than one may lead a dialect are first identified using a single phone to a more robust model overall. recognizer.7 Then an n-gram language model is trained on the resulting phone sequences for this 5.2 Phone Recognizers dialect. This process results in an n-gram lan- In our experiments, we have used phone recogniz- guage model for each dialect to model the dialect ers for English, German, Japanese, , Man- distribution of phone sequence occurrences. Dur- darin, and Spanish, from a toolkit developed by ing recognition, given a test speech segment, we Brno University of Technology.8 These phone rec- run the phone recognizer to obtain the phone se- ognizers were trained on the OGI multilanguage quence for this segment and then compute the per- database (Muthusamy et al., 1992) using a hybrid plexity of each dialect n-gram model on the se- approach based on Neural Networks and Viterbi quence. The dialect with the n-gram model that decoding without language models (open-loop) minimizes the perplexity is hypothesized to be the (Matejka et al., 2005). dialect from which the segment comes. Since Arabic dialect identification is our goal, Parallel PRLM is an extension to the PRLM ap- we hypothesize that an Arabic phone recognizer proach, in which multiple (k) parallel phone rec- would also be useful, particularly since other ognizers, each trained on a different language, are phone recognizers do not cover all Arabic con- used instead of a single phone recognizer (Ziss- sonants, such as pharyngeals and emphatic alveo- man, 1996). For training, we run all phone recog- lars. Therefore, we have built our own MSA phone nizers in parallel on the set of training utterances recognizer using the HMM toolkit (HTK) (Young of each dialect. An n-gram model on the outputs of et al., 2006). The monophone acoustic models each phone recognizer is trained for each dialect. are built using 3-state continuous HMMs without Thus if we have m dialects, k x m n-gram models state-skipping, with a mixture of 12 Gaussians per are trained. During testing, given a test utterance, state. We extract standard Mel Frequency Cepstral we run all phone recognizers on this utterance and Coefficients (MFCC) features from 25 ms frames, compute the perplexity of each n-gram model on with a frame shift of 10 ms. Each feature vec- the corresponding output phone sequence. Finally, tor is 39D: 13 features (12 cepstral features plus the perplexities are fed to a combiner to determine energy), 13 deltas, and 13 double-deltas. The fea- the hypothesized dialect. In our implementation, tures are normalized using cepstral mean normal- ization. We use the Broadcast News TDT4 corpus 7The phone recognizer is typically trained on one of the languages being identified. Nonetheless, a phone recognize (Arabic Set 1; 47.61 hours of speech; downsam- trained on any language might be a good approximation, pled to 8Khz) to train our acoustic models. The since languages typically share many phones in their phonetic inventory. 8www.fit.vutbr.cz/research/groups/speech/sw/phnrec

57 72.8*0!$'%5()! 1(2$/(3*&*()! !"#$%&'(&

)*+,&'(& !*/0'"()1#-+( -./01%#2&'(& 2+"#.-'3+*(( '34#21%23&'(&

(56&'(&

456/*)'!$'%5()! !"#$%&'(&

)*+,&'(& 4-.5'%1()1#-+( !"#$%&'"( -./01%#2&'(& 7/"894-:( 2+"#.-'3+*(( )*+,*#"+%%'-.! '34#21%23&'(& ;5/%%'<'+*((

(56&'(&

9.$.5()(!$'%5()! "#$%&'()*+(,! !"#$%&'(& -*./(0&! )*+,&'(& 6/,/-+%+()1#-+( -./01%#2&'(& 2+"#.-'3+*(( '34#21%23&'(&

(56&'(&

Figure 1: Parallel Phone Recognition Followed by Language Modeling (PRLM) for Arabic Dialect Identification.

! pronunciation dictionary is generated as described formance of our system to Alorfi’s (2008) on the in (Biadsy et al., 2009). Using these settings we same two dialects (Gulf and Egyptian Arabic). build three MSA phone recognizers: (1) an open- The second is to attempt to classify four collo- loop phone recognizer which does not distinguish quial Arabic dialects. In the third experiment, we emphatic vowels from non-emphatic (ArbO), (2) include MSA as well in a five-way classification an open-loop with emphatic vowels (ArbOE), and task. (3) a phone recognizer with emphatic vowels and with a bi-gram phone language model (ArbLME). 6.1 Gulf vs. Egyptian Dialect ID We add a new pronunciation rule to the set of To our knowledge, Alorfi’s (2008) work is the rules described in (Biadsy et al., 2009) to distin- only work dealing with the automatic identifica- guish emphatic vowels from non-emphatic ones tion of Arabic dialects. In this work, an Ergodic (see Section 3) when generating our pronunciation HMM is used to model phonetic differences be- dictionary for training the acoustic models for the tween Gulf and Egyptian Arabic using MFCC and the phone recognizers. In total we build 9 (Arabic delta features. The test and training data used in and non-Arabic) phone recognizers. this work was collected from TV soap operas con- 6 Experiments and Results taining both the Egyptian and Gulf dialects and from twenty speakers from CallHome Egyptian In this section, we evaluate the effectiveness of the database. The best accuracy reported by Alorfi parallel PRLM approach on distinguishing Ara- (2008) on identifying the dialect of 40 utterances bic dialects. We first run the nine phone recog- of duration of 30 seconds each of 40 male speakers nizers described in Section 5 on the training data (20 and 20 Gulf speakers) is 96.67%. described in Section 4, for each dialect. This pro- Since we do not have access to the test collec- cess produces nine sets of phone sequences for tion used in (Alorfi, 2008), we test a version of our each dialect. In our implementation, we train a system which identifies these two dialects only on tri-gram language model on each phone set using our 150 Gulf and 150 Egyptian speakers, as de- the SRILM toolkit (Stolcke, 2002). Thus, in total, scribed in Section 4. Our best result is 97.00% we have 9 x (number of dialects) tri-grams. (Egyptian and Gulf F-Measure = 0.97) when us- In all our experiments, the 150 test speakers of ing only the features from the ArbOE, English, each dialect are first decoded using the phone rec- Japanese, and Mandarin phone recognizers. While ognizers. Then the perplexities of the correspond- our accuracy might not be significantly higher than ing tri-gram models on these sequences are com- that of Alorfi’s, we note a few advantages of our puted, and are given to the logistic regression clas- experiments. First, the test sets of both dialects sifier. Instead of splitting our held-out data into are from telephone conversations, with the same test and training sets, we report our results with recording conditions, as opposed to a mix of dif- 10-fold cross validation. ferent genres. Second, in our system we test 300 We have conducted three experiments to eval- speakers as oppose to 40, so our results may be uate our system. The first is to compare the per- more reliable. Third, our test data includes female

58 4 dialects seconds accuracy Gulf Iraqi Levantine Egyptian 5 60.833 49.2 52.7 58.1 83 15 72.83 60.8 61.2 77.6 91.9 30 78.5 68.7 67.3 84 94 45 81.5 72.6 72.4 86.9 93.7 60 83.33 75.1 75.7 87.9 94.6 120 84 75.1 75.4 89.5 96

)$$# Dur. Acc. (%) Phone Recognizers

("# 5s 60.83 ArbOE+ArbLME+G+H+M+S

($# 15s 72.83 ArbOE+ArbLME+G+H+M

'"# 30s 78.50 ArbO+H+S 45s 81.5 ArbE+ArbLME+H+G+S '$# 60s 83.33 ArbOE+ArbLME+E+G+H+M &"# D# 120s 84.00 ArbOE+ArbLME+G+M &$# %"# Table 1: Accuracy of the four-way classification (four col- %$# ,--./0-1# loquial Arabic dialects) and the best combination of phone 2.34#567809./8# ""# recognizers used per test-utterances duration; The phone :/0;<#567809./8# "$# recognizers used are: E=English, G=German, H=Hindi, =8>0?@?8#567809./8# M=Mandarin, S=Spanish, ArbO=open-loop MSA without !"# AB1C@0?#567809./8# "# )"# *$# !"# %$# )+$# emphatic vowels, ArbOE=open-loop MSA with emphatic E89F6.G8/0?-8#H./0

59 4 dialects seconds accuracy Gulf Iraqi Levantine Egyptian 5 68.6667 54.5 50.7 60 77.9 15 76.6667 57.3 62.6 73.8 90.7 30 81.6 68.3 71.7 79.4 90.2 45 84.8 69.9 73.6 86.2 94.9 60 86.933 76.8 76.5 85.4 96.3 120 87.86 79.1 77.4 90.1 93.6

)$$# (ArbO, ArbOE, and/or ArbLME). Removing them

("# completely leads to a significant drop in accu- ($# racy. In this classification task, we observe that all '"# phone recognizers play a role in the classification '$# task in some of the conditions. &"# E# &$# ,--./0-1# %"# 7 Conclusions and Future Work 2.34#567809./8# %$# :/0;<#567809./8# In this paper, we have shown that four Arabic ""# =8>0?@?8#567809./8# "$# AB1C@0?#567809./8# colloquial dialects (Gulf, Iraqi, Levantine, and 7D,#567809./8# !"# Egyptian) plus MSA can be distinguished using "# )"# *$# !"# %$# )+$# F89G6.H8/0?-8#I./0

60 References P. Torres-Carrasquillo, T. P. Gleason, and D. A. Reynolds. 2004. Dialect identification using Gaussian Mixture Mod- F. S. Alorfi. 2008. PhD Dissertation: Automatic Identifica- els. In Proceedings of the Speaker and Language Recog- tion Of Arabic Dialects Using Hidden Markov Models. In nition Workshop, Spain. University of Pittsburgh. S. Young, G. Evermann, M. Gales, D. Kershaw, G. Moore, Appen Pty Ltd. 2006a. Gulf Arabic Conversational Tele- J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Wood- phone Speech Linguistic Data Consortium, Philadelphia. land. 2006. The HTK Book, version 3.4.

Appen Pty Ltd. 2006b. Iraqi Arabic Conversational Tele- M. A. Zissman, T. Gleason, D. Rekart, and B. Losiewicz. phone Speech Linguistic Data Consortium, Philadelphia. 1996. Automatic Dialect Identification of Extempora- neous Conversational, Latin American Spanish Speech. M. Barkat, J. Ohala, and F. Pellegrino. 1999. Prosody as a In Proceedings of the IEEE International Conference on Distinctive Feature for the Discrimination of Arabic Di- Acoustics, Speech, and Signal Processing, Atlanta, USA. alects. In Proceedings of Eurospeech’99. M. A. Zissman. 1996. Comparison of Four Approaches to F. Biadsy, N. Habash, and J. Hirschberg. 2009. Improv- Automatic Language Identification of Telephone Speech. ing the Arabic Pronunciation Dictionary for Phone and IEEE Transactions of Speech and Audio Processing, 4(1). Word Recognition with Linguistically-Based Pronuncia- tion Rules. In Proceedings of NAACL/HLT 2009, Col- orado, USA.

A. Canavan and G. Zipperlen. 1996. CALLFRIEND Egyp- tian Arabic Speech Linguistic Data Consortium, Philadel- phia.

A. Canavan, G. Zipperlen, and D. Graff. 1997. CALL- HOME Egyptian Arabic Speech Linguistic Data Consor- tium, Philadelphia.

J. S. Garofolo et al. 1993. TIMIT Acoustic-Phonetic Continuous Speech Corpus Linguistic Data Consortium, Philadelphia.

N. Habash. 2006. On Arabic and its Dialects. Multilingual Magazine, 17(81).

R. Hamdi, M. Barkat-Defradas, E. Ferragne, and F. Pelle- grino. 2004. Speech Timing and Rhythmic Structure in Arabic Dialects: A Comparison of Two Approaches. In Proceedings of Interspeech’04.

C. Holes. 2004. Modern Arabic: Structures, Functions, and Varieties. Georgetown University Press. Revised Edition.

B. Ma, D. Zhu, and . Tong. 2006. Chinese Dialect Iden- tification Using Tone Features Based On Pitch Flux. In Proceedings of ICASP’06.

M. Maamouri. 2006. Levantine Arabic QT Training Data Set 5, Speech Linguistic Data Consortium, Philadelphia.

P. Matejka, P. Schwarz, J. Cernocky, and P. Chytil. 2005. Phonotactic Language Identification using High Quality Phoneme Recognition. In Proceedings of Eurospeech’05.

Y. K. Muthusamy, R.A. Cole, and B.T. Oshika. 1992. The OGI Multi-Language Telephone Speech Corpus. In Pro- ceedings of ICSLP’92.

J. Peters, P. Gilles, P. Auer, and M. Selting. 2002. Iden- tification of Regional Varieties by Intonational Cues. An Experimental Study on Hamburg and Berlin German. 45(2):115–139.

F. Ramus. 2002. Acoustic Correlates of Linguistic Rhythm: Perspectives. In Speech Prosody.

A. Stolcke. 2002. SRILM - an Extensible Language Model- ing Toolkit. In ICASP’02, pages 901–904.

61