Toward a Web-Based Speech Corpus for Algerian Dialectal Arabic Varieties

Toward a Web-based Speech Corpus for Algerian Arabic Dialectal Varieties Soumia Bougrine1 Aicha Chorana1 Abdallah Lakhdari1 Hadda Cherroun1 1Laboratoire d’informatique et Mathématiques Université Amar Telidji Laghouat, Algérie {sm.bougrine,a.chorana,a.lakhdari,hadda_cherroun}@lagh-univ.dz Abstract tion/verification and spoken language systems. The need for such systems becomes inevitable. The success of machine learning for au- These systems include real life wingspan appli- tomatic speech processing has raised the cations such as speech searching engines and re- need for large scale datasets. However, cently Conversational Agents, conversation is be- collecting such data is often a challenging coming a key mode of human-computer interac- task as it implies significant investment in- tion. volving time and money cost. In this pa- The crucial points to be taken into consideration per, we devise a recipe for building large- when designing and developing relevant speech scale Speech Corpora by harnessing Web corpus are numerous. The necessity that a cor- resources namely YouTube, other Social pus takes the within-language variability (Li et al., Media, Online Radio and TV. We illustrate 2013). We can mention some of them: The corpus our methodology by building KALAM’DZ, size and scope, richness of speech topics and con- An Arabic Spoken corpus dedicated to Al- tent, number of speakers, gender, regional dialects, gerian dialectal varieties. The preliminary recording environment and materials. We have at- version of our dataset covers all major Al- tempted to cover a maximum of these considera- gerian dialects. In addition, we make sure tions. We will underline each considered point in that this material takes into account nu- what follows. merous aspects that foster its richness. In For many languages, the state of the art of fact, we have targeted various speech top- designing and developing speech corpora has ics. Some automatic and manual anno- achieved a mature situation. On the other extreme, tations are provided. They gather use- there are few corpora for Arabic (Mansour, 2013). ful information related to the speakers and In spite that geographically, Arabic is one of the sub-dialect information at the utterance most widespread languages of the world (Behn- level. Our corpus encompasses the 8 ma- stedt and Woidich, 2013). It is spoken by more jor Algerian Arabic sub-dialects with 4881 than 420 million people in 60 countries of the speakers and more than 104.4 hours seg- world (Lewis et al., 2015). Actually, it has two mented in utterances of at least 6 s. major variants: Modern Standard Arabic (MSA), and Dialectal Arabic. MSA is the official language 1 Introduction of all Arab countries. It is used in administrations, Speech datasets and corpora are crucial for both schools, official radios, and press. However, DA developing and evaluating Natural Language Pro- is the language of informal daily communication. cessing (NLP) systems. Moreover, such corpora Recently, it became also the medium of commu- have to be large to achieve NLP communities ex- nication on the Web, in chat rooms, social media pectations. In fact, the notion of "More data is etc. This fact, amplifies the need for language re- better data" was born with the success of modeling sources and language related NLP systems for di- based on machine learning and statistical methods. alects. The applications that use speech corpora can For some dialects, especially Egyptian and Lev- be grouped into four major categories: speech antine, there are some investigations in terms of recognition, speech synthesis, speaker recogni- building corpora and designing NLP tools. While, 138 Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), pages 138–146, Valencia, Spain, April 3, 2017. c 2017 Association for Computational Linguistics very few attempts have considered Algerian Ara- In contrast of relative abundance of speech cor- bic dialect. Which, make us affirm that the Al- pora for Modern Standard Arabic, very few at- gerian dialect and its varieties are considered as tempts have considered building Arabic speech under-resourced language. In this paper, we tend corpora for dialects. Table 1 reports some fea- to fill this gap by giving a complete recipe to tures of the studied DA corpora. The first set of build a large-size speech corpus. This recipe corpora has exploited the limited solution of tele- can be adopted for any under-resourced language. phony conversation recording. In fact, as far as It eases the challenging task of building large we know, development of the pioneer DA cor- datasets by means of traditional direct recording. pus began in the middle of the nineties and it is Which is known as time and cost consuming. Our CALLFRIEND Egyptian (Canavan and Zipperlen, idea relies on Web resources, an essential mile- 1996). Another part of OrienTel project, cited stone of our era. In fact, the Web 2.0, becomes below, has been dedicated to collect speech cor- a global platform for information access and shar- pora for Arabic dialects of Egypt, Jordan, Mo- ing that allows collecting any type of data at scales rocco, Tunisia, and United Arab Emirates coun- hardly conceivable in near past. tries. In these corpora, the same telephone re- The proposed recipe is to build a speech cor- sponse to questionnaire method is used. These pus for Algerian Arabic dialect varieties. For this corpora are available via the ELRA catalogue 1. preliminary version, the corpus is annotated for The DARPA Babylon Levantine 2 Arabic speech mainly supporting research in dialect and speaker corpus gathers four Levantine dialects spoken by identification. speakers from Jordan, Syria, Lebanon, and Pales- The rest of this paper is organized as follows. tine (Makhoul et al., 2005). In the next section, we review some related work Appen company has collected three Arabic di- that have built DA corpora. In Section 3 we give a alects corpora by means of spontaneous telephone brief overview of Algerian sub-dialects features. conversations method. These corpora 3 uttered by Section 4 is dedicated to describe the complete speakers from Gulf, Iraqi and Levantine. With proposed recipe of building a Web-based speech a more guided telephone conversation recording dataset. In Section 5, we show how this recipe protocol, Fisher Levantine Arabic corpus is avail- is narrated to construct a speech corpus for Alge- able via LDC catalogue 4. The speakers are se- rian dialectal varieties. The resulted corpus is de- lected from Jordan, Lebanon, Palestine, Lebanon, scribed in Section 6. We enumerate its potential Syria and other Levantine countries. uses in Section 7 TuDiCoI (Graja et al., 2010) is a spontaneous dialogue speech corpus dedicated to Tunisian di- 2 Related Work alect, which contains recorded dialogues between staff and clients in the railway of Sfax town, In this section, we restricted our corpora review Tunisia. to speech corpora dealing with Arabic dialects. Concerning corpora that gather MSA and Ara- We classify them according to two criteria: col- bic dialects, we have studied some of them. lecting method and Intra/Inter country dialect col- SAAVB corpus is dedicated to speakers from all lection context. They can be classified into five the cities of Saudi Arabia country using telephone categories according to the collecting method. In- response of questionnaire method (Alghamdi et deed, it can be done by recording broadcast, spon- al., 2008). The main characteristic of this corpus taneous telephone conversations, telephone re- is that, before recording, a preliminary choice of sponses of questionnaires, direct recording and speakers and environment are performed. The se- Web-based resourcing. The second criterion dis- lection aims to control speaker age and gender and tinguishes the origin of targeted dialects in ei- telephone type. ther Intra-country/region or Inter-country, which Multi-Dialect Parallel (MDP) corpus, a free means that the targeted dialects are from the same country/region or dialects from different countries. 1Respective code product are ELRA-S0221, ELRA- This criterion is chosen because it is harder to per- S0289, ELRA-S0183, ELRA-S0186 and ELRA-S0258. 2 form fine collection of Arabic dialects belonging Code product is LDC2005S08. 3The LDC catalogue’s respective code product are to close geographical areas that share many his- LDC2006S43, LDC2006S45 and LDC2007S01. toric, social and cultural aspects. 4Code product is LDC2007S02. 139 Corpus Type Collecting Method Corpus Details Al Jazeera 57 hours, 4 major Arabic dialect groups multi-dialectal Inter Broadcast news annotated using crowdsourcing 109 speakers from 17 Algerian departments, ALG-DARIDJAH Intra Direct Recording 4.5 hours 3 Algerian dialect groups, 735 speakers, more AMCASC Intra Telephone conversations than 72 hours. Guided telephone conversations and Direct 201 speakers from nine Arab countries, 9 KSU Rich Arabic Inter recording. dialects + MSA. 52 speakers, 23% MSA utterances, 77% DA MDP Inter Direct Recording utterances, 32 hours, 3 dialects + MSA. Selected speaker before telephone response of 1033 speakers; 83% MSA utterances, 17% DA SAAVB Inter questionnaire utterances, Size: 2.59 GB, 1 dialect + MSA TuDiCoI Inter Spontaneous dialogue 127 Dialogues, 893 utterances, 1 dialect. Guided telephone Fisher Levantine Inter conversations 279 conversations, 45 hours, 5 dialects. 3 dialects, Gulf: 975 conver, v 93 hours; Iraqi: Spontaneous telephone 474 conver, v 24 hours; Levantine: 982 Appen’s corpora Inter conversations conver, v 90 hours. DARPA Babylon Direct recording of 164 speakers, 75900 Utterances, Size: 6.5 GB, Levantine Inter spontaneous speech 45 hours, 4 dialects. 5 dialects, # speakers: 750 Egyptian, 757 Telephone response of Jordanian, 772 Moroccan, 792 Tunisian and OrienTel MCA Inter questionnaire 880 Emirates. Spontaneous telephone 60 conversations, lasting between 5-30 CALLFRIEND Inter conversations minutes, 1 dialect. Table 1: Speech Corpora for Arabic dialects. corpus, which gathers MSA and three Arabic pora dedicated to Algerian Arabic dialect vari- dialects (Almeman et al., 2013).

Load more