<<

RAPID USING EVERYDAY LIFE NATURAL CHAT FOR DIALECTAL SPEECH RECOGNITION

Mohamed Elmahdy† Rainer Gruhn Slim Abdennadher† Wolfgang Minker

 Faculty of Engineering & Computer Science, University of Ulm, Ulm, Germany † Faculty of Media Engineering & Technology, German University in Cairo, Cairo,

ABSTRACT Dialectal Arabic usually differs significantly from MSA to the We propose the (ACA) as naturally written in extent to consider dialects as totally different . That is everyday life for dialectal Arabic speech transcription. Our assump- why phonetic transcription techniques for MSA cannot be applied tion is that ACA is a natural that includes short vowels that directly on dialectal Arabic. In order to avoid automatic or man- are missing in traditional Arabic orthography. Furthermore, ACA ual diacritization, graphemic acoustic modeling was proposed for transcriptions can be rapidly prepared. Egyptian Colloquial Arabic MSA [4] where the phonetic transcription is approximated to be the was chosen as a typical dialect. Two speech recognition baselines sequence of letters while ignoring short vowels. It could be were built: phonemic and graphemic. Original transcriptions were noticed that graphemic systems work with an acceptable recognition re-written in ACA by different transcribers. Ambiguous ACA se- rate. However the performance is still below the accuracy of - quences were handled by automatically generating all possible vari- mic models. Since MSA and the Arabic dialects share the same ants. ACA variations across transcribers were modeled by inventory, the -based approach was also applica- normalization and merging. Results show that the ACA-based ap- ble for dialectal Arabic as shown in [5] and [6]. proach outperforms the graphemic baseline while it performs as ac- In this paper, we propose the Arabic Chat Alphabet (ACA) for curate as the -based baseline with a slight increase in WER. dialectal Arabic speech transcription instead of traditional Arabic orthography for the purpose of acoustic modeling. Transcriptions Index Terms — Arabic, acoustic modeling, chat alphabet, phonetic are written directly in ACA exactly as in everyday life without any transcription, speech recognition special training for transcribers. In fact, we have noticed that ACA is usually written with the short vowels that are omitted in normal Ara- 1. INTRODUCTION bic orthography. Furthermore, it is a natural language that is very well known among all Arabic computer users. It was also found that (MSA) is currently the formal Arabic va- the majority of computer users types in ACA faster than traditional riety across all Arabic speakers. MSA is used in news broadcast, Arabic orthography. So, our assumption is that ACA-based tran- newspapers, formal speeches, books, movies subtitling, etc. Prac- scriptions are closer to the full phonetically transcribed text rather tically, MSA is not the natural spoken language for native Arabic than graphemic transcriptions (Arabic letters), moreover, transcrip- speakers. In fact, dialectal (or colloquial) Arabic is the natural spo- tions development time is significantly reduced. Previous work in ken form in everyday life. The different Arabic dialects (Egyptian, [7], a was proposed for Arabic phonetic tran- Levantine, Iraqi, Gulf, etc) are mainly spoken and not written where scription, however the proposed technique needs specially trained significant phonological, morphological, syntactic, and lexical dif- transcribers that is too costly with significant long development time. ferences exist between the dialects and MSA. This situation is called We would like to highlight that we are not proposing a translitera- [1]. Because of the diglossic nature of dialectal Arabic, lit- tion notation as the Buckwalter [8]. Buckwalter no- tle research has been done in dialectal Arabic speech recognition as tation cannot be directly written by any transcriber, it is only used well as all natural language processing tasks. For MSA, on the other to transliterate Arabic and into standard ASCII hand, a lot of research has been conducted. characters that are fully reversible to the original Arabic letters. Dialectal Arabic speech resources are very limited. Moreover, Egyptian Colloquial Arabic (ECA) was chosen in our work as for some dialects, no resources are available at all. This is mainly a typical dialect. Unlike other Arabic dialects, ECA was mainly due to the difficulties of estimating the correct phonetic transcrip- selected since for that dialect there exist some pronunciation dictio- tion for the different Arabic dialects. Basically, Arabic is a mor- naries or lexicons like [9] and [10]. Thus, it is possible to build a phologically very rich language. That is why a simple lookup ta- phonemic baseline which can be used to evaluate the performance of ble for phonetic transcription -essential for acoustic modeling- is any other proposed technique. The main phonetic characteristics of not appropriate because of the high out-of- (OOV) rate. ECA when compared to MSA have been already summarized in [6]. Furthermore, Arabic orthographic transcriptions are written without diacritics. Diacritics are essential to estimate short vowels, nuna- tion, gemination, and silent letters. State of the art techniques for 2. ARABIC CHAT ALPHABET MSA phonetic transcription are usually done in several phases. In one phase, transcriptions are written without diacritics. Afterwards, Arabic Chat Alphabet (ACA) (also known as: Arabizi, Arabish, automatic diacritization is performed to estimate missing Franco-Arab, or Franco) is a system for Arabic in which marks (WER is 15%-25%) as in [2] and [3]. Finally, the mapping English letters are written instead of Arabic ones. Basically, it is from diacritized text to phonetic transcription is almost a one-to-one an encoding system that represents every Arabic phoneme with the mapping. English that matches the same pronunciation [11]. Arabic

978-1-4577-0539-7/11/$26.00 ©2011 IEEE 4936 ICASSP 2011 phonemes that do not have an equivalent in English are replaced with numerals or some accent marks that are close in shape to the Table 1. ECA phonemes in IPA, SAMPA, and ACA with corresponding Arabic letter. ACA was originally introduced when corresponding Arabic letters. Note that for each vowel there exist the was not available on computer systems. ACA is two forms: short and long. Long vowels are almost double the dura- widely used in chat rooms, SMS, social networks, and non-formal tion of the short ones. . Actually, the majority still prefers writing Arabic in ACA Type IPA Arabic SAMPA ACA instead of the original Arabic alphabet even if the Arabic alphabet is supported since ACA is much faster for them. P () ? 2, ’ By comparing ACA-based text to normal Arabic orthography, it b  b b could be noticed that short vowels are usually written in ACA. For p  p p    example, the word  (milk) is only written in traditional Arabic t  () t t with three letters (, , and ) from which we can only estimate  g  g g, the three consonants /l/, /b/, and /n/ and the vowels in-between are missing. In ACA, on the other hand, the same word is commonly Z  Z j written laban where we have the same three consonants plus two è  X\ 7 short vowels of the type /a/. Thus, ACA-based provides us x x kh, 5, 7’ with more information about missing vowels compared to normal  Arabic orthography. d  () d d  In order to prove our first assumption that ACA-based transcrip- r  r r tion is easier and faster than Arabic orthography (non-diacritized   z ( ) z z script), we conducted a survey over 100 Arabic computer users, Consonant s  () s s 86% of the users confirmed that they type faster using ACA, 9%  S  S sh do not feel a difference, and 5% type Arabic letters slightly faster  than ACA. All users confirmed that it is almost impossible to type s¸   s‘ s, 9 a correct fully diacritized Arabic text. In order to further prove our d¸  () d‘ d, 9’  assumption, recently, some research has been done to help the vast ¸t   t‘ t, 6 majority of Arabic computer users who are not able to type directly z¸  () D‘ z, 6’ in Arabic orthography as in [12] and [13] where automatic ACA Q ?\ 3 conversion back to Arabic letters was proposed.  G  G gh, 3’ f  f f 3. SPEECH CORPUS  v  v v  The same ECA corpus as in [6] has been used in this work. The q q q, 8, 9 corpus was recorded with the Sennheiser ME 3-N super-cardioid k  k k microphone in linear PCM, 16 kHz, and 16 bits. The database in- l  l l  cludes utterances from different speech domains like: greetings, time m  m m and dates, , restaurants, train reservation, Egyptian n n n proverbs, etc. The diversity of speech domains ensures good cov-  h h erage of acoustic features. The corpus consists of 15K distinct tri-  phones with a lexicon of 700 words with accurate phonetic transcrip- w w tion. Every speaker was prompted 50 utterances randomly selected j  j from the database. Overall we have 22 native Egyptian speakers.  a  a a The corpus was divided into a training set of 14 speakers and a test- A  A a ing set of 8 speakers. i  i i, e The initial ECA phoneme set consists of twenty nine consonants Vowel  e  e i, e and twelve vowels. Table 1 shows ECA phonemes in IPA, SAMPA,  ACA, and the corresponding Arabic letters. In ECA, some conso- u u u, o nants may be represented by more than one unique Arabic letter like o  o u, o /t/ and /d/. Some Arabic letters are represented by more than one form in ACA notation, like the letters Khah and Ghain . Short vowels are represented in Arabic by diacritic marks which is very form rather than in digits. Fig. 1 shows some sample utterances with uncommon to write them as the reader infers them from the context. corresponding IPA and ACA transcriptions.

3.1. ACA transcription layer 4. SYSTEM DESCRIPTION AND BASELINES Another transcription layer in ACA was added to the corpus. Since ACA may be variable across users, one transcriber is not enough Our system is a GMM-HMM architecture based on the CMU Sphinx to catch all possible variations. Five different volunteer transcribers engine. Acoustic models are all fully continuous density context- were asked to transcribe utterances from the corpus in ACA exactly dependent tri-phones with 3 states per HMM. A bi-gram language as in their everyday life without any specific phonetic guidelines. model with Kneser-Ney smoothing was trained using the transcrip- The only requirement was to transcribe numbers in normalized word tions of the ECA training set plus additional 25000 more utterances

4937 (  Arabic: !"! #$%&') " *+,-./) %$ Table 2. Speech recognition results using phoneme-based, English: Thursday 1st of October 2010 grapheme-based, and ACA-based acoustic models. IPA: /yo:m Pilxami:swa:èid Pukto: Palfe:n wQASA:rA/ Acoustic model WER(%) Relative1(%) Relative2(%) ACA: /yoom 2il7’amiis waa7ed 2uktoobar 2alfeen w3ashaara/( Phoneme-based 13.4 Baseline1 -28.0 01) 23  #' 4 $ Grapheme-based 18.6 +38.8 Baseline2 Arabic:  ACA-based initial 16.9 +26.1 -9.1 English: First class ticket ACA-based final 14.1 +5.2 -24.2 IPA: /tAzkArA dArAgA Pu:la/ ACA: /tazkara daraga 2uula/ %567 8& $ Arabic:  phoneme model, e.g. /kh/, /5/, and /7’/ were all normalized to the English: Boiled eggs same phoneme. Since the vowel /A/ (emphatic /a/) does not have e: u:P IPA: /b d¸ masl / equivalent in ACA as it is always transcribed as /a/, it was removed ACA: /bee9’ masluu2/ from our phoneme set (long and short forms). Now our phoneme set consists of 39 phonemes. The normalized ACA transcriptions were Fig. 1. Samples from the ECA speech corpus with Arabic, IPA, and used to train the initial ACA-based acoustic model. The optimized ACA transcriptions. Note that the shown ACA is the most correct number of Gaussians and tied-states were found to be 125 and 8 form as ACA varies across transcribers. respectively. An ACA-based lexicon was generated from the two parallel transcription layers: ACA and traditional Arabic. This lexi- con was only utilized in testing to allow the same language model as from the same speech domains. All language modeling parameters in the baselines. Speech recognition result was an absolute WER of were fixed during the whole work, so that any change in recognition 16.9% as shown in Table 2 with a +26.1% relative increase in WER rate is mainly due to acoustic modeling. We have built two baseline compared to the phoneme-based baseline, while it outperformed the systems for the two well-known acoustic modeling techniques for grapheme-based baseline by -9.1% relative decrease in WER. Arabic: phoneme-based and grapheme-based.

4.1. Phoneme-based baseline 5.2. Modeling ACA common transcription errors An ECA phoneme-based acoustic model baseline was trained with ACA is not a standard encoding and it varies among transcribers. the ECA training set. The optimized number of Gaussians per state That is why we have summarized all common transcription errors and the optimized number of tied-states were found to be 250 and that occurred in the ACA transcription layer of the corpus. The com- 4 respectively. No approximations were applied on the phoneme set mon errors (observed at least 4 times) were as follows: (1) The con- that consists of 41 phonemes. Speech recognition results of decoding fusion between long and short vowels. For instant, using /a/ instead the ECA test set using the baseline acoustic model were an absolute of /aa/ or vice versa. (2) The confusion between /y/, /i/, and /e/. (3) WER of 13.4% as shown in Table 2. The confusion between /u/ and /o/. (4) Writing /ea/ instead of /ee/, e.g. instead of writing /2eeh/ (what), it is sometimes written /2eah/. (5) Ignoring the /2/ at word beginnings, e.g. writing /ana/ 4.2. Grapheme-based baseline (me) instead of /2ana/. (6) Geminations or consonant doublings are Grapheme-based acoustic modeling (also known as graphemic mod- sometimes not written, e.g., writing /s/ instead of /ss/. (7) Ignoring eling) is an acoustic modeling approach for Arabic where the pho- /h/ at word endings. (8) The incorrect representation of emphatic netic transcription is approximated to be the Arabic word letters in- phonemes like: /9/, /9’/, /6/, /6’/, and /q/ and typing the correspond- stead of the exact phonemes sequence. All possible vowels, gem- ing non-emphatic forms instead: /s/, /d/, /t/, /z/, and /k/ respectively. inations, or nunations are assumed to be implicitly modeled in the (9) Rare and foreign phonemes are sometimes miss represented, e.g. acoustic model. The graphemic convention in our work is assign- typing /g/ instead of /j/. ing one distinct phoneme to each letter except all forms of Alef In order to model all ACA common errors, one idea was to gen- and , these were all assigned the same phoneme model. The erate all possible variants for a given word. But this idea was quickly grapheme-based baseline was built using only the ECA training set rejected since we found that the number of variants will be huge in a similar way to the phoneme-based baseline but the lexicon in (may exceed 10 variants per word in average). The high number this case is only graphemic. The optimized number of Gaussians of variants increases confusability as the difference in pronunciation and tied-states were found to be 125 and 8 respectively. The result of between words becomes smaller. Furthermore, it leads to a very decoding the ECA test set using the grapheme-based acoustic model complex search space as the decoder has to consider all of them. was an absolute WER of 18.6% as shown in Table 2 with a +38.8% Our alternative solution was to apply several approximations and relative increase in WER compared to the phoneme-based baseline. phoneme-mergings to cover all error types (uncommon errors that occurred less than 4 times were not considered), relying on that they will be implicitly modeled in the acoustic model. ACA transcriptions 5. ACA-BASED ACOUSTIC MODELING were further pre-processed as follows: (1) Merging short vowels and 5.1. Initial ACA-based model their corresponding long vowels to the same phoneme model. In other words, regardless the transcription is /aa/ or /a/, both of them Since there are different ACA representations for the same phoneme, are normalized to /a/. (2) Merging /y/, /i/, and /e/. (3) Merging /u/ we have normalized all ACA transcriptions of the ECA corpus into and /o/. (4) /ea/ is normalized to /ee/. (5) In Arabic, we have a one consistent notation. More specifically, we have normalized constraint that words or utterances cannot start with vowels. Thus, the different phoneme representations to be mapped to the same /2/ was automatically added to the beginning of words starting with

4938 vowels. (6) Double consonants were approximated to be single con- hence it is either pronounced /y/ or /a/. Such ambiguity is already sonants. (7) /h/ is deleted if found at word endings. (8) Merging resolved in ACA transcriptions. (4) In traditional Arabic alphabet, emphatic consonants and their corresponding non-emphatic forms. at word endings is not usually written. In ACA, nunation (9) Normalizing foreign phonemes to the nearest Arabic ones, so is always correctly represented. /v/, /p/, and /j/ approximating them to /f/, /b/, and /g/ respectively. Finally, the total number of phonemes was reduced to 23. To 7. CONCLUSIONS clarify the concept of normalization in this stage by an example, the words: /2eah/, /2eeh/, /eah/, /eeh/, /eih/, /ea/, etc (what) are all In this paper, we proposed a very promising approach for rapid pho- mapped to the same model. netic transcription development for dialectal Arabic where the Ara- bic Chat Alphabet is used exactly as in everyday life for the purpose 5.3. Modeling ACA ambiguity of acoustic modeling. The proposed approach represents an alter- native way to conventional phonetic transcription methods. Results Some phoneme sequences in ACA are phonetically ambiguous, e.g. show that ACA-based acoustic models trained with ACA transcrip- the consonant /3’/ that corresponds to the Arabic letter Ghain can tions perform almost as accurate as the phoneme-based models with be written /gh/ in so many cases. The question would be: does a small relative increase in WER of +5.2%. Furthermore, the ACA- the sequence /gh/ represent /3’/ or /g/ followed by /h/? Other am- based acoustic models were found to outperform grapheme-based biguous examples are: /kh/ and /sh/ that may correspond the letters models by -24.2% relative decrease in WER. This is mainly due Khah and Sheen respectively. Our solution was to initially train a to the lack of explicit modeling of short vowels in traditional Ara- preliminary context-independent acoustic model. All possible pro- bic orthography. Finally, the performance of the ACA-based models nunciation variants have been automatically generated for ambigu- was mainly interpreted because of the appearance of short vowels in ous sequences as long as they satisfy two constraints. First, no more ACA transcripts. The proposed approach can be extended not only than two consecutive consonants are allowed in Arabic. Second, to the different Arabic varieties but to other languages as well where the allowed in Arabic are CV, CVC, and CVCC where C the chat alphabet is widely used like: Persian, , , Bengali, is a consonant and V is a vowel. Note that the CVCC pattern can Hebrew, Greek, Serbian, Chinese, Russian, etc. only be found at word endings. Afterwards, the context-independent acoustic model is used to force align ACA transcriptions to select 8. REFERENCES the most probable pronunciation variants. Finally, the forced aligned transcriptions are used to train the final ACA-based acoustic model. [1] C. A. Ferguson, “Diglossia,” Word, vol. 15, pp. 325–340, 1959. [2] K. Kirchhoff and D. Vergyri, “Cross-dialectal data sharing for 5.4. ACA-based final results acoustic modeling in Arabic speech recognition,” Speech Com- munication, vol. 46(1), pp. 37–51, 2005. After ACA common errors and ambiguity modeling, speech recog- [3] R. Sarikaya, O. Emam, I. Zitouni, and Y. Gao, “Maximum nition results indicate an absolute WER of 14.1% with a small rel- entropy modeling for diacritization of Arabic text,” in Pro- ative increase of +5.2% compared to the phoneme-based baseline ceedings of INTERSPEECH, 2006, pp. 145–148. (see Table 2). This implies that ACA-based models can perform al- most as accurate as phoneme-based models, and this is mainly due [4] J. Billa, M. Noamany, A. Srivastava, D. Liu, R. Stone, J. Xu, to the representation of short vowels. It can also be noticed that the J. Makhoul, and F. Kubala, “Audio indexing of Arabic broad- ACA-based model has outperformed the grapheme-based baseline cast news,” in Proceedings of ICASSP, 2002, vol. 1, pp. 5–8. by -24.2% relative decrease in WER which is a major improvement [5] D. Vergyri, K. Kirchhoff, R. Gadde, A. Stolcke, and J. Zheng, over graphemic modeling for dialectal Arabic, since accurate pho- “Development of a conversational telephone speech recognizer netic transcription for the different Arabic dialects, in most cases, is for ,” in Proceedings of INTERSPEECH, too costly and may be not feasible at all for some dialects. Results 2005, pp. 1613–1616. also confirm that all the applied approximations are all acceptable [6] M. Elmahdy, R. Gruhn, W. Minker, and S. Abdennadher, and contribute to a better recognition rate. “Cross-lingual acoustic modeling for dialectal Arabic speech recognition,” in Proceedings of INTERSPEECH, 2010, pp. 873–876. 6. DISCUSSIONS [7] A. Canavan, G. Zipperlen, and D. Graff, “CALLHOME Egyp- tian Arabic speech,” Linguistic Data Consortium, 1997. In our proposed approach of using ACA orthographic transcriptions, the ACA-based acoustic model has outperformed the grapheme- [8] T. Buckwalter, “Arabic transliteration,” 2002, http://www. based one -given that transcription time was also reduced- because qamus.org/transliteration.htm. of several facts: (1) In grapheme-based modeling, short vowels are [9] H. Kilany, H. Gadalla, H. Arram, A. Yacoub, A. El-Habashi, not explicitly modeled. While in ACA, the majority of short vowels and C. McLemore, “Egyptian colloquial Arabic lexicon,” Lin- are explicitly modeled. We have found that in Arabic in general, guistic Data Consortium, 2002. ∼40% of the speech consists of vowels. (2) Moreover, traditional [10] M. Hinds and E. Badawi, A dictionary of , graphemic transcriptions for dialectal Arabic are highly affected Librairie du Liban, 2009. by MSA. For example, writing the letter Qaf /q/ even though it is [11] M. A. Yaghan, “Arabizi: a contemporary style of Arabic realized as a glottal stop /2/ in ECA. In ACA, on the other hand, slang,” Design Issues, vol. 24(2), pp. 39–52, 2008, MIT Press. MSA influence is not observed and transcriptions always follow the [12] Labs, “Google transliteration,” 2009, http://www. correct realized pronunciation. (3) In traditional Arabic alphabet, google.com/ta3reeb/. the letter Teh Marboota and the letter Alef Maksura are ambiguous. [13] Cairo Microsoft Innovation Lab, “Microsoft Maren,” The letter Teh Marboota is either pronounced /h/, /a/ or /t/. The let- 2009, http://www.microsoft.com/middleeast/ ter Alef Maksura is sometimes written instead of the letter Yeh, and egypt/cmic/maren/.

4939