<<

Collection and Analysis of Code-switch Egyptian -English Speech Corpus

Injy Hamed, Mohamed Elmahdy, Slim Abdennadher The German in Tagamoa El Khames, New Cairo, Cairo, {injy.hamed, mohamed.elmahdy, slim.abdennadher}@guc.edu.eg

Abstract Speech corpora are key components needed by both: linguists (in analyses, research and teaching ) and Natural Language Processing (NLP) researchers (in training and evaluating several NLP tasks such as speech recognition, text-to-speech and speech-to-text synthesis). Despite of the great demand, there is still a huge shortage in available corpora, especially in the case of dialectal languages, and code-switched speech. In this paper, we present our efforts in collecting and analyzing a speech corpus for conversational . As in other multilingual societies, it is common among to use a mix of Arabic and English in daily conversations. The act of switching languages, at sentence boundaries or within the same sentence, is referred to as code-switching. The aim of this work is a three-fold: (1) gather conversational Egyptian Arabic spontaneous speech, (2) obtain manual transcriptions and (3) analyze the speech from the code-switching perspective. A subset of the transcriptions were manually annotated for part-of-speech (POS) tags. The POS distribution of the embedded words was analyzed as well as the POS distribu- tion for the trigger words (Arabic words preceding a code-switching point). The speech corpus can be obtained by contacting the authors.

Keywords: Speech corpus, Dialectal Egyptian Arabic, Conversational Egyptian Arabic, Egyptian Arabic-English, code-switching, code-mixing

1. Introduction • Inter-sentential Code-switching: defined as switching The Arabic language is one of the most widely used languages from one sentence to another. For example: languages in the world. It is the 6th most used language “úæJ .k. A« àA¿ . It was very interesting.” based on number of of native speakers. Nearly 250 (I liked it. It was very interesting.) million persons use Arabic as their first language and it is the for around four that number • Intra-sentential Code-switching (also known as code- (Elmahdy et al., 2009). There are three types of the Arabic mixing): defined as using multiple languages within language: , the same sentence. For example: (MSA), and Colloquial (dialectal) Arabic. The classical “ .implement a semantic search engine áK . AJ» AJk@ ” Arabic is the most formal type of Arabic. It is used in the (We were implementing a semantic search engine.) and early . MSA is the official used in the . It is derived Another type of language alternation is referred to as from the Classical Arabic, with some simplifications ”borrowing“ or ”“. This is defined as having the such as removal of marks. MSA is the form of whole sentence in one language except for words which Arabic taught in schools and is used in writings, formal are borrowed from the secondary language. For example: speeches, interviews, news broadcasts, movies’ subtitling “.AJÔ « researchÈ@Im' AK@ ” (I generally research.) and education. However, MSA is not the language used . . in everyday life and is considered as a second language Researchers use different definitions for code-switching. for all Arabic-speakers. Each country or region within a Some researchers define code-switching as incorporating country has its own . Colloquial (dialectal) Arabic sentences from different languages, each having its own is the language used in daily conversations and informal grammatical rules. Therefore, following this definition, writings such as chats, blogs and comments on on-line borrowing is not considered as a type of code-switching. social media such as Facebook and . However, they On the other hand, some researchers consider borrowing do not necessarily have a standard written form. as a type of intra-sentential Code-switching. In the scope of this paper, for the sake of simplicity, we will use the In addition to the three forms of Arabic, as in other mul- term “code-switching” to refer to all types of language tilingual environments, many Arabic-speakers use code- alternation; inter-sentential code-switching, intra-sentential switching and code-mixing in their conversations. Such code-switching as well as borrowing. mixed speech is usually defined as a mixture of two distinct languages: primary language (also known as the matrix This phenomenon of comes as a result language); which is spoken in majority and secondary lan- of several factors such as globalization, immigration, guage (also known as the embedded language); by which colonization, the rise of education levels, as well as inter- words (in the case of code-mixing) or phrases (in the case of national business and communication. Code-switching is code-switching) are embedded into the conversation. There seen in several Arab countries, where for example, English are two types of code-switching: is commonly used in Egypt and French in ,

3805 , and Jordon. 1997). Dialectal Egyptian Arabic speech corpora were also gathered by and (Elmahdy et al., 2009) and (El-Sakhawy et Arabic varieties are characterized by where al., 2014). These corpora may contain a small percentage dialectal forms usually differ considerably from their of code-switching, however, they are mainly designed for formal language, and are thus considered by researchers to Dialectal Egyptian Arabic. Up to our knowledge, there are be a separate language (Ferguson, 1959). For instance, the no speech corpora dedicated to code-switching occurring huge variation between MSA and Colloquial in dialectal Egyptian Arabic. In this paper, we present Arabic was shown in the study conducted by Kirchhoff and our first efforts in filling this gap. The collected corpus Vergyri (Kirchhoff and Vergyri, 2005). The authors studied can be used in training and testing data for building a the data sharing between both languages, where the author multilingual Egyptian Arabic-English ASR system. It calculated the percentage of shared unigrams, bigrams can also serve as a data set for linguistic analysis on the and trigrams in the LDC CallHome corpus (telephone Egyptian Arabic-English code-switching behaviour. conversations in Egyptian Colloquial Arabic) and the FBIS corpus (broadcast news in MSA). It was found that the The remainder of the paper is structured as follows: Related corpora only overlap by 10.3% in unigrams, 1% in bigrams work will be discussed in Section 2 . Section 3 describes and < 1% in trigrams. The same overlap computation was the process of corpus creation; speech collection and tran- done for the CHRISTINE corpus (conversational British scription. In Section 4 , the corpus is analyzed to exam- English) and the English Broadcast News data ine/provide insights on its code-switching features. Finally, from the NIST 2004 Rich Transcription evaluations. The Section 5 concludes and provides future work. overlap was found to be 44.5% in unigrams, 19.2% in bigrams and 5.3% in trigrams. Consequently, existing 2. Related Work speech corpora in Modern Standard Arabic will be unsuit- There are two types of speech corpora: read speech and able for recognizing dialectal Arabic. Moreover, it is also spontaneous speech. Spontaneous speech corpora are usu- unsuitable to use monolingual speech corpora to recognize ally preferred in building ASR systems as they are closer code-switch speech. to natural speech. However, they require manual tran- scription, which is a labour-intensive and -consuming In literature, there are two main approaches used to task. Both approaches have been recorded for collect- recognize code-switched speech: language-dependent ing code-switched speech corpora. (Chan et al., 2005) and language-independent. In the language-dependent collected a Cantonese-English speech corpus through read approach, monolingual ASR systems are used. This is done content. (Li et al., 2012) gathered a Mandarin- by first detecting the boundaries at which language switch- English speech from four different sources: (1) conversa- ing occurs using language boundary detection (LBD) tional meetings ; (2) group project meetings; (3) student . For each language-homogeneous segment, the interviews speech. The speech was then transcribed by an language is identified using language identity detection annotator and manually verified by Mandarin-English bilin- (LID) algorithms. Finally, each segment is recognized by gual speakers. (Lyu et al., 2015) collected the SEAME cor- its respective monolingual ASR system. This approach pus, where Mandarin-English audio recordings were col- is suitable for building multilingual ASR systems that lected from interviews and conversational speech and were handle multiple languages, where the input speech is manually transcribed. (Chong et al., 2012) developed a monolingual, as well as those that handle inter-sentential Malay conversational speech corpus, containing Malay- code-switching. However, in the case of intra-sentential English code-switching through the recording and tran- code-switching, the language switching occurs within the scription of conversational speech. same sentence and the language-homogeneous segments can be short. The accuracy of the LBD and LID algorithms 3. Approach can then become a limitation to the overall ASR system The purpose of this work is to develop a corpus for performance. In this case, the language-independent spontaneous Egyptian Arabic-English code-switching approach is more suitable for the task. It involves building speech. This was done by recording informal interviews a truly multilingual language model, acoustic model rather than collecting read speech to gather casual speech and pronunciation that encompass both- or of spontaneous nature. This decision was made under the all- languages involved. Recognition is then done in a assumption that code-switching occurs most frequently in one-pass approach. Building multilingual language and spontaneous speech. acoustic models is considered to be a holistic task. One of the main challenges is the need for code-switching For the interview setups, each interview involved two corpora for training and testing purposes. Despite the great interviewers and one or two interviewee(s). Only the efforts done in collecting speech and text corpora, there is interviewees’ speech were transcribed. The participants still shortage in the available corpora for code-switching were asked to discuss technical topics such as the courses languages. The lack of existing speech corpora for they teach, work experiences, as well as their B.Sc., code-switched Egyptian Arabic-English is a bottleneck in M.Sc. and Ph.D. projects. The participants were Egyptian the creation of ASR systems for conversational Egyptian teaching assistants in the German University in Cairo. The Arabic. Few speech corpora are available for Dialectal participants were of both genders, with their ages ranging Egyptian Arabic, such as CALLHOME (Canavan et al., between 23 and 28. The mother of all participants

3806 is Arabic, and they are all fluent in English. Trigger word Percentage È@ 31.0% The interviews were recorded in a quiet closed room at 4.8% the German University in Cairo. The recordings were ú ¯ collected using two table-mounted microphones. All ð 3.4% audio recordings were stored in mono channel pulse-code úæªK 1.5% modulation (PCM) at 16kHz sampling rate. In order to 1.3% minimize the interference with power lines, the laptop ñë charger was disconnected. The transcriptions were done manually by two Egyptian annotators who master both Table 1: The most frequent trigger words. languages. The transcriptions were done using Transcrib- erAG. All transcriptions were recorded in UTF-8. Arabic transcriptions were written without diactritic marks. All 4.3. Top unigrams, bigrams and trigrams non-speech sounds, pauses, and unintelligible speech The number of unigrams, bigrams and trigrams gathered segments were transcribed with some predefined filler tags. from the transcriptions are shown in Table 4.3.. Figure 4.3. shows the most frequently used mixed n-grams in the con- versations. 4. Corpus Evaluation N-grams order The interviews were held with 12 participants; 6 males and unigrams 4,330 6 females. A total of 5.3 hours were recorded, with an - bigrams 14,205 erage duration of a participant’s recording of 26.5 minutes. trigrams 15,855 A total of 4.5 hours were transcribed. The transcriptions made up a total of 1,234 sentences and 17,769 words, with Table 2: The number of unigrams, bigrams and trigrams an average of 14 words per sentence. The transcribed text gathered from the transcriptions. is analyzed from two perspectives: (1) code-switching and code-mixing distribution, (2) part-of-Speech (POS) distri- bution of embedded words and POS trigger tags. The most frequently used Arabic trigger words (preceding a code- switching point) are identified. The most frequently used unigrams, bigrams and trigrams are also presented as well as examples from the transcriptions. 4.1. Code-switching and code-mixing analysis The transcriptions contain a total of 17,769 words; 11,045 (62.1%) Arabic and 6,641 (37.4%) English words. This shows a high usage of the embedded in the conversations. The transcriptions show a high rate of code-switching and code-mixing. The 1,234 sentences are divided by language as follows: 124 monolingual Arabic, 125 monolingual English and 985 mixed. This shows a percentage of 79.8% of code-mixing in speech. It also shows a percentage of 10.1% of code-switching, where the whole sentence was uttered in the embedded language. Only 10% of the sentences are uttered purely in the matrix (Arabic) language. Figure 1: The top most frequently used unigrams, bigrams and trigrams. The transcriptions were further analyzed from the code- mixing perspective. In the scope of the code-mixed sentences, 34.4% of the words are in the embedded 4.4. Part-of-Speech analysis language. The average number of code-switch points The transcriptions were analyzed to determine the Part-of- (Arabic-English and English-Arabic) in a sentence is 4. On speech (POS) of the embedded English words. A total of average, there are 2 embedded English parts. The average 388 sentences were manually annotated for POS tags. Fig- number of English words in an embedded part is 2. ure 2 shows the POS distribution of the embedded English words. It can be seen from the data that the participants used the English language mostly in nouns. 4.2. Trigger Words The POS of the Arabic words preceding a code-switching Trigger words are defined as the Arabic words preceding a point was also examined. These are considered to be trig- code-switching point. There are in total 535 unique Arabic ger POS tags. Figure 3 shows the POS distribution of the words preceding a code-switching point. Table 4.2. shows trigger POS tags. It can be seen that code-switching mostly the most frequent trigger words. occurs after articles, followed by verbs and prepositions.

3807 after articles. It was found that code-switching occurs at a rate of 30% after the artice È@. It is to be noted that in tech- nical contexts, many technical terms are embedded as loan- words, thus contributing to this high code-switching rate. Accordingly, further research needs to be done to inves- tigate the code-switching rate in other domains. We also intend to continue working on the corpus and collect more recordings and transcriptions.

6. Bibliographical References Canavan, A., Zipperlen, ., and Graff, D. (1997). Call- home egyptian arabic speech. Linguistic Data Consor- tium. Figure 2: The POS distribution of the embedded English Chan, . Y., Ching, P., and Lee, T. (2005). Development words in the set of manually annotated sentences. of a cantonese-english code-mixing speech corpus. In Ninth European Conference on Speech Communication and Technology. Chong, T. Y., Xiao, X., Tan, T.-P., Chng, E. S., and Li, H. (2012). Collection and annotation of malay conver- sational speech corpus. In Speech Database and Assess- ments ( COCOSDA), 2012 International Confer- ence on, pages 30–35. IEEE. El-Sakhawy, D., Abdennadher, S., and Hamed, I. (2014). Collecting data for automatic speech recognition systems in dialectal arabic using games with a purpose. In Mul- timodal Analyses enabling Artificial Agents in - Machine Interaction. Springer LNAI. Elmahdy, M., Gruhn, ., Minker, W., and Abdennadher, S. (2009). Modern standard arabic based multilingual ap- proach for dialectal arabic speech recognition. In Eighth Figure 3: The POS distribution of the trigger POS tags in International Symposium on Natural Language Process- the set of manually annotated sentences. ing (SNLP’09), pages 169–174. IEEE. Ferguson, C. A. (1959). Diglossia. word, 15(2):325–340. Kirchhoff, K. and Vergyri, D. (2005). Cross-dialectal data sharing for acoustic modeling in arabic speech recogni- 4.5. Transcriptions examples tion. Speech Communication, 46(1):37–51. Figure 4 presents a sample of the transcribed sentences. Li, Y., Yu, Y., and Fung, P. (2012). A mandarin-english code-switching corpus. In LREC, pages 2515–2519. 5. Conclusion Lyu, D.-C., Tan, T.-P., Chng, E.-S., and Li, H. (2015). Mandarin–english code-switching speech corpus in Code-switching has become a prevalent phenomenon in ev- south-east asia: Seame. Language Resources and Eval- eryday conversations in multilingual societies. In Egypt, uation, 49(3):581–600. it has become common, especially among youth, to mix Arabic and English in conversations. This created a need for ASR systems to be able to recognize mixed Egyptian Arabic-English speech. In order to build such a multi- lingual ASR system, an Egyptian Arabic-English speech corpus is crucial. In this work, we collected spontaneous speech through informal interviews. The topics of the in- terviews were chosen to be technical, thus more probable to contain code-switching. The collected speech was man- ually annotated. The transcriptions were analyzed in terms of the code-switching and code-mixing behaviour. It was seen that both phenomena were used extensively. A sub- set of the transcribed data was manually annotated for POS tags. Analyses were performed on the POS tags of the En- glish embedded words as well as the Arabic words preced- ing a code-switching point. It was found that English em- bedded words were mostly used for nouns, and that code- switching (from Arabic to English) occurs most frequently

3808 Figure 4: Some of the transcribed sentences. The time interval in seconds is shown for each sentence between braces.

3809