Collection and Analysis of Code-Switch Egyptian Arabic-English Speech Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus Injy Hamed, Mohamed Elmahdy, Slim Abdennadher The German University in Cairo El Tagamoa El Khames, New Cairo, Cairo, Egypt finjy.hamed, mohamed.elmahdy, [email protected] Abstract Speech corpora are key components needed by both: linguists (in language analyses, research and teaching languages) and Natural Language Processing (NLP) researchers (in training and evaluating several NLP tasks such as speech recognition, text-to-speech and speech-to-text synthesis). Despite of the great demand, there is still a huge shortage in available corpora, especially in the case of dialectal languages, and code-switched speech. In this paper, we present our efforts in collecting and analyzing a speech corpus for conversational Egyptian Arabic. As in other multilingual societies, it is common among Egyptians to use a mix of Arabic and English in daily conversations. The act of switching languages, at sentence boundaries or within the same sentence, is referred to as code-switching. The aim of this work is a three-fold: (1) gather conversational Egyptian Arabic spontaneous speech, (2) obtain manual transcriptions and (3) analyze the speech from the code-switching perspective. A subset of the transcriptions were manually annotated for part-of-speech (POS) tags. The POS distribution of the embedded words was analyzed as well as the POS distribu- tion for the trigger words (Arabic words preceding a code-switching point). The speech corpus can be obtained by contacting the authors. Keywords: Speech corpus, Dialectal Egyptian Arabic, Conversational Egyptian Arabic, Egyptian Arabic-English, code-switching, code-mixing 1. Introduction • Inter-sentential Code-switching: defined as switching The Arabic language is one of the most widely used languages from one sentence to another. For example: languages in the world. It is the 6th most used language “úæJ .k. A« àA¿ . It was very interesting.” based on number of of native speakers. Nearly 250 (I liked it. It was very interesting.) million persons use Arabic as their first language and it is the second language for around four times that number • Intra-sentential Code-switching (also known as code- (Elmahdy et al., 2009). There are three types of the Arabic mixing): defined as using multiple languages within language: Classical Arabic, Modern Standard Arabic the same sentence. For example: (MSA), and Colloquial (dialectal) Arabic. The classical “ .implement a semantic search engine áK . AJ» AJk@ ” Arabic is the most formal type of Arabic. It is used in the (We were implementing a semantic search engine.) Quran and early Islamic literature. MSA is the official modern language used in the Arab world. It is derived Another type of language alternation is referred to as from the Classical Arabic, with some simplifications ”borrowing“ or ”loanword“. This is defined as having the such as removal of diacritic marks. MSA is the form of whole sentence in one language except for words which Arabic taught in schools and is used in writings, formal are borrowed from the secondary language. For example: speeches, interviews, news broadcasts, movies’ subtitling “.AJÔ « researchÈ@ Im' AK@ ” (I generally love research.) and education. However, MSA is not the language used . in everyday life and is considered as a second language Researchers use different definitions for code-switching. for all Arabic-speakers. Each country or region within a Some researchers define code-switching as incorporating country has its own dialect. Colloquial (dialectal) Arabic sentences from different languages, each having its own is the language used in daily conversations and informal grammatical rules. Therefore, following this definition, writings such as chats, blogs and comments on on-line borrowing is not considered as a type of code-switching. social media such as Facebook and Twitter. However, they On the other hand, some researchers consider borrowing do not necessarily have a standard written form. as a type of intra-sentential Code-switching. In the scope of this paper, for the sake of simplicity, we will use the In addition to the three forms of Arabic, as in other mul- term “code-switching” to refer to all types of language tilingual environments, many Arabic-speakers use code- alternation; inter-sentential code-switching, intra-sentential switching and code-mixing in their conversations. Such code-switching as well as borrowing. mixed speech is usually defined as a mixture of two distinct languages: primary language (also known as the matrix This phenomenon of multilingualism comes as a result language); which is spoken in majority and secondary lan- of several factors such as globalization, immigration, guage (also known as the embedded language); by which colonization, the rise of education levels, as well as inter- words (in the case of code-mixing) or phrases (in the case of national business and communication. Code-switching is code-switching) are embedded into the conversation. There seen in several Arab countries, where for example, English are two types of code-switching: is commonly used in Egypt and French in Morocco, 3805 Tunisia, Lebanon and Jordon. 1997). Dialectal Egyptian Arabic speech corpora were also gathered by and (Elmahdy et al., 2009) and (El-Sakhawy et Arabic varieties are characterized by Diglossia where al., 2014). These corpora may contain a small percentage dialectal forms usually differ considerably from their of code-switching, however, they are mainly designed for formal language, and are thus considered by researchers to Dialectal Egyptian Arabic. Up to our knowledge, there are be a separate language (Ferguson, 1959). For instance, the no speech corpora dedicated to code-switching occurring huge variation between MSA and the Egyptian Colloquial in dialectal Egyptian Arabic. In this paper, we present Arabic was shown in the study conducted by Kirchhoff and our first efforts in filling this gap. The collected corpus Vergyri (Kirchhoff and Vergyri, 2005). The authors studied can be used in training and testing data for building a the data sharing between both languages, where the author multilingual Egyptian Arabic-English ASR system. It calculated the percentage of shared unigrams, bigrams can also serve as a data set for linguistic analysis on the and trigrams in the LDC CallHome corpus (telephone Egyptian Arabic-English code-switching behaviour. conversations in Egyptian Colloquial Arabic) and the FBIS corpus (broadcast news in MSA). It was found that the The remainder of the paper is structured as follows: Related corpora only overlap by 10.3% in unigrams, 1% in bigrams work will be discussed in Section 2 . Section 3 describes and < 1% in trigrams. The same overlap computation was the process of corpus creation; speech collection and tran- done for the CHRISTINE corpus (conversational British scription. In Section 4 , the corpus is analyzed to exam- English) and the American English Broadcast News data ine/provide insights on its code-switching features. Finally, from the NIST 2004 Rich Transcription evaluations. The Section 5 concludes and provides future work. overlap was found to be 44.5% in unigrams, 19.2% in bigrams and 5.3% in trigrams. Consequently, existing 2. Related Work speech corpora in Modern Standard Arabic will be unsuit- There are two types of speech corpora: read speech and able for recognizing dialectal Arabic. Moreover, it is also spontaneous speech. Spontaneous speech corpora are usu- unsuitable to use monolingual speech corpora to recognize ally preferred in building ASR systems as they are closer code-switch speech. to natural speech. However, they require manual tran- scription, which is a labour-intensive and time-consuming In literature, there are two main approaches used to task. Both approaches have been recorded for collect- recognize code-switched speech: language-dependent ing code-switched speech corpora. (Chan et al., 2005) and language-independent. In the language-dependent collected a Cantonese-English speech corpus through read approach, monolingual ASR systems are used. This is done newspaper content. (Li et al., 2012) gathered a Mandarin- by first detecting the boundaries at which language switch- English speech from four different sources: (1) conversa- ing occurs using language boundary detection (LBD) tional meetings ; (2) group project meetings; (3) student algorithms. For each language-homogeneous segment, the interviews speech. The speech was then transcribed by an language is identified using language identity detection annotator and manually verified by Mandarin-English bilin- (LID) algorithms. Finally, each segment is recognized by gual speakers. (Lyu et al., 2015) collected the SEAME cor- its respective monolingual ASR system. This approach pus, where Mandarin-English audio recordings were col- is suitable for building multilingual ASR systems that lected from interviews and conversational speech and were handle multiple languages, where the input speech is manually transcribed. (Chong et al., 2012) developed a monolingual, as well as those that handle inter-sentential Malay conversational speech corpus, containing Malay- code-switching. However, in the case of intra-sentential English code-switching through the recording and tran- code-switching, the language switching occurs within the scription of conversational speech. same sentence and the language-homogeneous segments can be short. The accuracy of the LBD and LID algorithms 3. Approach can then become a limitation