Toward a Web-based Speech Corpus for Algerian Dialectal Varieties

Soumia Bougrine1 Aicha Chorana1 Abdallah Lakhdari1 Hadda Cherroun1 1Laboratoire d’informatique et Mathématiques Université Amar Telidji Laghouat, Algérie {sm.bougrine,a.chorana,a.lakhdari,hadda_cherroun}@lagh-univ.dz

Abstract tion/verification and spoken language systems. The need for such systems becomes inevitable. The success of machine learning for au- These systems include real life wingspan appli- tomatic speech processing has raised the cations such as speech searching engines and re- need for large scale datasets. However, cently Conversational Agents, conversation is be- collecting such data is often a challenging coming a key mode of human-computer interac- task as it implies significant investment in- tion. volving time and money cost. In this pa- The crucial points to be taken into consideration per, we devise a recipe for building large- when designing and developing relevant speech scale Speech Corpora by harnessing Web corpus are numerous. The necessity that a cor- resources namely YouTube, other Social pus takes the within-language variability (Li et al., Media, Online Radio and TV. We illustrate 2013). We can mention some of them: The corpus our methodology by building KALAM’DZ, size and scope, richness of speech topics and con- An Arabic Spoken corpus dedicated to Al- tent, number of speakers, gender, regional dialects, gerian dialectal varieties. The preliminary recording environment and materials. We have at- version of our dataset covers all major Al- tempted to cover a maximum of these considera- gerian dialects. In addition, we make sure tions. We will underline each considered point in that this material takes into account nu- what follows. merous aspects that foster its richness. In For many languages, the state of the art of fact, we have targeted various speech top- designing and developing speech corpora has ics. Some automatic and manual anno- achieved a mature situation. On the other extreme, tations are provided. They gather use- there are few corpora for Arabic (Mansour, 2013). ful information related to the speakers and In spite that geographically, Arabic is one of the sub-dialect information at the utterance most widespread languages of the world (Behn- level. Our corpus encompasses the 8 ma- stedt and Woidich, 2013). It is spoken by more jor Algerian Arabic sub-dialects with 4881 than 420 million people in 60 countries of the speakers and more than 104.4 hours seg- world (Lewis et al., 2015). Actually, it has two mented in utterances of at least 6 s. major variants: (MSA), and Dialectal Arabic. MSA is the official language 1 Introduction of all Arab countries. It is used in administrations, Speech datasets and corpora are crucial for both schools, official radios, and press. However, DA developing and evaluating Natural Language Pro- is the language of informal daily communication. cessing (NLP) systems. Moreover, such corpora Recently, it became also the medium of commu- have to be large to achieve NLP communities ex- nication on the Web, in chat rooms, social media pectations. In fact, the notion of "More data is etc. This fact, amplifies the need for language re- better data" was born with the success of modeling sources and language related NLP systems for di- based on machine learning and statistical methods. alects. The applications that use speech corpora can For some dialects, especially Egyptian and Lev- be grouped into four major categories: speech antine, there are some investigations in terms of recognition, speech synthesis, speaker recogni- building corpora and designing NLP tools. While,

138 Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), pages 138–146, Valencia, Spain, April 3, 2017. c 2017 Association for Computational Linguistics very few attempts have considered Algerian Ara- In contrast of relative abundance of speech cor- bic dialect. Which, make us affirm that the Al- pora for Modern Standard Arabic, very few at- gerian dialect and its varieties are considered as tempts have considered building Arabic speech under-resourced language. In this paper, we tend corpora for dialects. Table 1 reports some fea- to fill this gap by giving a complete recipe to tures of the studied DA corpora. The first set of build a large-size speech corpus. This recipe corpora has exploited the limited solution of tele- can be adopted for any under-resourced language. phony conversation recording. In fact, as far as It eases the challenging task of building large we know, development of the pioneer DA cor- datasets by means of traditional direct recording. pus began in the middle of the nineties and it is Which is known as time and cost consuming. Our CALLFRIEND Egyptian (Canavan and Zipperlen, idea relies on Web resources, an essential mile- 1996). Another part of OrienTel project, cited stone of our era. In fact, the Web 2.0, becomes below, has been dedicated to collect speech cor- a global platform for information access and shar- pora for Arabic dialects of Egypt, Jordan, Mo- ing that allows collecting any type of data at scales rocco, Tunisia, and United Arab Emirates coun- hardly conceivable in near past. tries. In these corpora, the same telephone re- The proposed recipe is to build a speech cor- sponse to questionnaire method is used. These pus for Algerian Arabic dialect varieties. For this corpora are available via the ELRA catalogue 1. preliminary version, the corpus is annotated for The DARPA Babylon Levantine 2 Arabic speech mainly supporting research in dialect and speaker corpus gathers four Levantine dialects spoken by identification. speakers from Jordan, Syria, Lebanon, and Pales- The rest of this paper is organized as follows. tine (Makhoul et al., 2005). In the next section, we review some related work Appen company has collected three Arabic di- that have built DA corpora. In Section 3 we give a alects corpora by means of spontaneous telephone brief overview of Algerian sub-dialects features. conversations method. These corpora 3 uttered by Section 4 is dedicated to describe the complete speakers from Gulf, Iraqi and Levantine. With proposed recipe of building a Web-based speech a more guided telephone conversation recording dataset. In Section 5, we show how this recipe protocol, Fisher corpus is avail- is narrated to construct a speech corpus for Alge- able via LDC catalogue 4. The speakers are se- rian dialectal varieties. The resulted corpus is de- lected from Jordan, Lebanon, Palestine, Lebanon, scribed in Section 6. We enumerate its potential Syria and other Levantine countries. uses in Section 7 TuDiCoI (Graja et al., 2010) is a spontaneous dialogue speech corpus dedicated to Tunisian di- 2 Related Work alect, which contains recorded dialogues between staff and clients in the railway of Sfax town, In this section, we restricted our corpora review Tunisia. to speech corpora dealing with Arabic dialects. Concerning corpora that gather MSA and Ara- We classify them according to two criteria: col- bic dialects, we have studied some of them. lecting method and Intra/Inter country dialect col- SAAVB corpus is dedicated to speakers from all lection context. They can be classified into five the cities of Saudi Arabia country using telephone categories according to the collecting method. In- response of questionnaire method (Alghamdi et deed, it can be done by recording broadcast, spon- al., 2008). The main characteristic of this corpus taneous telephone conversations, telephone re- is that, before recording, a preliminary choice of sponses of questionnaires, direct recording and speakers and environment are performed. The se- Web-based resourcing. The second criterion dis- lection aims to control speaker age and gender and tinguishes the origin of targeted dialects in ei- telephone type. ther Intra-country/region or Inter-country, which Multi-Dialect Parallel (MDP) corpus, a free means that the targeted dialects are from the same country/region or dialects from different countries. 1Respective code product are ELRA-S0221, ELRA- This criterion is chosen because it is harder to per- S0289, ELRA-S0183, ELRA-S0186 and ELRA-S0258. 2 form fine collection of Arabic dialects belonging Code product is LDC2005S08. 3The LDC catalogue’s respective code product are to close geographical areas that share many his- LDC2006S43, LDC2006S45 and LDC2007S01. toric, social and cultural aspects. 4Code product is LDC2007S02.

139 Corpus Type Collecting Method Corpus Details Al Jazeera 57 hours, 4 major Arabic dialect groups multi-dialectal Inter Broadcast news annotated using crowdsourcing 109 speakers from 17 Algerian departments, ALG-DARIDJAH Intra Direct Recording 4.5 hours 3 Algerian dialect groups, 735 speakers, more AMCASC Intra Telephone conversations than 72 hours. Guided telephone conversations and Direct 201 speakers from nine Arab countries, 9 KSU Rich Arabic Inter recording. dialects + MSA. 52 speakers, 23% MSA utterances, 77% DA MDP Inter Direct Recording utterances, 32 hours, 3 dialects + MSA. Selected speaker before telephone response of 1033 speakers; 83% MSA utterances, 17% DA SAAVB Inter questionnaire utterances, Size: 2.59 GB, 1 dialect + MSA TuDiCoI Inter Spontaneous dialogue 127 Dialogues, 893 utterances, 1 dialect. Guided telephone Fisher Levantine Inter conversations 279 conversations, 45 hours, 5 dialects. 3 dialects, Gulf: 975 conver, v 93 hours; Iraqi: Spontaneous telephone 474 conver, v 24 hours; Levantine: 982 Appen’s corpora Inter conversations conver, v 90 hours. DARPA Babylon Direct recording of 164 speakers, 75900 Utterances, Size: 6.5 GB, Levantine Inter spontaneous speech 45 hours, 4 dialects. 5 dialects, # speakers: 750 Egyptian, 757 Telephone response of Jordanian, 772 Moroccan, 792 Tunisian and OrienTel MCA Inter questionnaire 880 Emirates. Spontaneous telephone 60 conversations, lasting between 5-30 CALLFRIEND Inter conversations minutes, 1 dialect.

Table 1: Speech Corpora for Arabic dialects. corpus, which gathers MSA and three Arabic pora dedicated to Algerian Arabic dialect vari- dialects (Almeman et al., 2013). Namely, the eties: AMCASC (Djellab et al., 2016) and ALG- dialects are from Gulf, Egypt and Levantine. DARIDJAH (Bougrine et al., 2016). AMCASC cor- The speech data is collected by direct recording pus, based on telephone conversations collecting method. method, is a large corpus that takes three regional KSU Rich Arabic corpus encompasses speakers dialectal varieties. While ALG-DARIDJAH cor- by different ethnic groups, Arabs and non-Arabs pus is a parallel corpus that encompasses Algerian (Africa and Asia). Concerning Arab speakers in Arabic sub-dialects. It is based on direct recording this corpus, they are selected from nine Arab coun- method. Thus, many considerations are controlled tries: Saudi, Yemen, Egypt, Syria, Tunisia, Alge- while building this corpus. Compared to AMCASC ria, Sudan, Lebanon and Palestine. This corpus is corpus, the size of ALG-DARIDJAH corpus is re- rich in many aspects. Among them, the richness of stricted. the recording text. In addition, different recording According to our study of these major Arabic sessions, environments and systems are taken into dialects corpora, we underline some points. First, account (Alsulaiman et al., 2013). these corpora are mainly fee-based and the free Al Jazeera multi-dialectal speech corpus, a ones are extremely rare. Second, almost exist- larger scale, based on Broadcast News of ing corpora are dedicated to inter-country dialects. Al Jazeera (Wray and Ali, 2015). Its annotation Third, to the best of our knowledge, there is no is performed by crowd sourcing technology. It Web-based speech dataset/corpus that deals with encompasses the four major Arabic dialectal cat- Arabic speech data neither for MSA nor for di- egories. alects. While for other languages, there are some In an intra country context, there are two cor- investigations. We can cite the large recent col-

140 lection Kalaka-3 (Rodríguez-Fuentes et al., 2016). This is a speech database specifically designed for Algerian Dialects Spoken Language Recognition. The dataset pro- vides TV broadcast speech for training, and audio Tamazight Arabic data extracted from YouTube videos for testing. It Kabyle Pre-Hil¯al¯ı deals with European languages. Shawiya

3 Algerian Dialects: Brief Overview Hil¯al¯ı Tuareg is a large country, administratively divided Saharan into 48 departments. Its first official language is Mozabite Modern Standard Arabic. However, Algerian di- Tellian Chenoua alects are widely the predominant means of com- High-plains(Tell) munication. In Figure 1, we depict the main Alge- Zenati rian dialect varieties. In this work, we focus on Ma’qilian Algerian Arabic sub-dialects as they are spoken by 75% to 80% of the population. The Algerian Sulaymite dialect is known as Daridjah to its speakers. Algerian Arabic dialects resulted from two Ara- -Blanks bization processes due to the expansion of Islam Sahel-Tell in the 7th and 11th centuries, which lead to the ap- propriation of the Arabic language by the Berber population. According to both processes, di- Figure 1: Hierarchical Structure of Algerian Di- alectologists (Palva, 2006), (Pereira, 2011) show alects. that Algerian Arabic dialects can be divided into two major groups: Pre-Hilal¯ ¯ı and Bedouin di- alects. Both dialects are different by many linguis- some linguistic differences, we have divided this tic features (Marçais, 1986) (Caubet, 2000). last dialect into two sub-dialects, namely Algiers- Firstly, Pre-Hilal¯ ¯ı dialect is called a sedentary Blanks and Sahel-Tell. dialect. It is spoken in areas that are affected by the expansion of Islam in the 7th century. At this time, Arabic Algerian dialects present complex lin- the partially affected cities are: , Con- guistic features and many linguistic phenomena stantine and their rural surroundings. The other can be observed. Indeed, there is many borrowed cities have preserved their mother tongue language words due to the deep colonization. In fact, Ara- (Berber). bic Algerian dialects are affected by other lan- guages such as Turkish, French, Italian, and Span- Secondly, Bedouin dialect is spoken in areas ish (Leclerc, 30 avril 2012). In addition, code which are influenced by the Arab immigration in switching is omnipresent especially from French. the 11th century (Palva, 2006) (Pereira, 2011). Marçais (1986) has divided Bedouin dialect into Versteegh et al. (2006) used four consonants four distinct dialects: i) Sulaymite dialect which (the dentals /t, d, d. / and a voiceless uvu- is connected with Tunisian Bedouin dialects, ii) lar stop /q/) to discriminate the two major groups: Ma’qilian dialect which is connected with Moroc- Pre-Hilal¯ ¯ı and Bedouin dialect. In fact, he shows can Bedouin dialects, iii) Hilal¯ ¯ı dialect contains that Pre-Hilal¯ ¯ı dialect are characterized by: /q/ three nomadic sub-dialects. Hilal¯ ¯ı-Saharan that is pronounced /k/ and the loss of inter-dentals covers the totality of the Sahara of Algeria, the and pass into the dentals /t, d, d. /. For Alge- Hilal¯ ¯ı-Tellian dialect which its speakers occupy a rian Bedouin dialect, the four discriminative con- large part of the Tell of Algeria, and the High- sonants are characterized by: /q/ is pronounced plains of Constantine, which covers the north of /g/ and the inter-dentals are fairly preserved. For Hodna region to Seybouse river. iv) Completely- more details on Algerian linguistic features refer bedouin dialect that covers Algiers’ Blanks, and to (Embarki, 2008) (Versteegh et al., 2006) (Har- some of its near sea coast cities. Regarding to rat et al., 2016).

141 4 Methodology metadata 5 which are very useful for an- notation. In this section, we first describe, in general way, (c) Downloading: This is a time consum- the complete recipe to collect and annotate a Web- ing task. Thus, it is important to con- based spoken dataset for an under-resourced lan- sider many facts such as preparing stor- guage/dialect. Then, we illustrate this recipe to age and downloading the related meta- build our Algerian Arabic dialectal speech corpus data, . . . mainly dedicated to dialect and speaker identifica- tion. (d) Cleaning: Now, the videos/audios are locally available, a first scan is per- Global View of the Recipe formed in order to keep the most appro- The recipe described in the following can be easily priate data to the corpus concerns. This tailored according to potential uses of the corpus can be achieved by establishing a strat- and on the specificities of the targeted language egy depending on the corpus future use. resources and its spoken community. 3. Annotation and Pre-processing: For a tar- 1. Inventorying Potential Web sources: First, geted NLP task, pre-processing the col- we have to identify sources that are the lected speech/video can include segmenta- most targeted by the communities of the lan- tion, White noise removing. . . . Some annota- guages/dialects in concerns. Indeed, depend- tions can simply be provided from the related ing on their culture and preferences, some metadata of the Web-source when they exist. communities show preference for dealing However, this task makes use of other anno- with some Web media over others. For exam- tation techniques like crowdsourcing where ple, Algerian people are less used to use In- crowd are called to identify the targeted di- stagram or Snapchat compared with Middle alect/speaker or/and perform translations. Est and Gulf ones. Moreover, each country The method can be generalized to other lan- has its own most used communication media. guages/dialects without linguistic and cultural For instance, some societies (Arabs ones) are knowledge of the regional language or dialect by more productive on TVs and Radios, com- using video/audio search query based on the area pared with west communities that are more (location) of targeted dialect/language. Then use present and productive on social media. the power of crowdsourcing to annotate corpus. 2. Extraction Process: In order to avoid crawl- ing useless data, this steps is achieved by 5 Corpus Building three stages For the context of the Algerian dialects, in order to (a) Preliminary Validation Lists: For each build a speech corpus that is mainly dedicated to chosen Web source, we define the main dialect/speaker identification using machine learn- keywords that can help automatically ing techniques, we have chosen several resources. search video/audio lists. When such 5.1 Web Sources Inventory lists are established, a first cleaning is performed keeping only the potential The main aim is to allow the richness of the cor- suitable data. Sizing such lists depends pus. In fact, it is well known that modeling a spo- on the sought scale. ken language needs a set of speech data counting (b) Providing the collection script: For each the within-language/intersession variability, such resource, we fix and implement the suit- as speaker, content, recording device, communi- able way to collect data automatically. cation channel, and background noise (Li et al., Open Source tools are the most suit- 2013). It is desirable to have sufficient data that able. In fact, downloading a speech include the intended intersession effects. from a streaming or from YouTube or Table 2 reports the main Web sources that feed even from online Tv needs different our corpus. Let us observe that there are several scripts. The same fact has to be taken 5YouTube video Metadata such as published_date, dura- into account concerning their related tion, description, category. . .

142 speech Topics which allows capturing more lin- In addition, the extraction of related metadata are guistics varieties. In fact, this inventory contains performed using Python package BeautifulSoup "Local radio channels" resources. Fortunately, V 4 dedicated to Web crawling (Nair, 2014). each Algerian province has at least one local ra- In order to draw up search queries, we have used dio (a governmental one). It deals with local com- three lists. The first one Dep, contains the names munity concerns and exhibits main local events. of the 48 Algerian provinces, spelled in MSA, di- Some of their reports often use their own local di- alect and French. While the second list Cat con- alect. It is the same case for amateur radios. Both, tains the selected YouTube categories. Among the these radio channels and Tvs are Web-streamed actual YouTube categories, we have targeted: Peo- live. ple & Blogs, Education, Entertainment, How-to & In addition, we have chosen some Algerian TVs Style, News & Politics, Non-profits & Activism cat- for which most programs are addressed to large egories, and Humor. The third list Comm_W ord community. So, they use dialects. Finally, we contains a set of common Algerian dialect word have targeted some YouTube resources such as Al- (called White Algerian terms) that are used by all gerian PodCasts, Algerian Tags, and channels of Algerians. These chosen set is reported in Ta- Algerian YouTubers. ble 3. Then, we iterate searching using a search query model that has four keywords. The first and Source Sample Topics the second ones are from Dep and Cat lists re- Algerian Tv Ennahar News spectively. The remaining two keywords are se- El chorouk News, General lected arbitrary from Comm_W ord. This query Samira, Bina Cook formulation can guarantee that speakers are from Social, local, the fixed province and the content topics are varied Local Radios 48 departments General thanks to YouTube topic classification. On YouTube Concerning Algerian TVs source, the search Algerian Politic, Culture, queries are drawn up using mainly two keywords. PodCast Anes Tina Social The first one is the name the channel and the sec- Algerian Khaled Fkir Blogs, Cook YouTubers CCNA DZ Tips, Fun ond word refers to the report name. In fact, a prior Advices, Beauty list is filled manually with emission and report Mesolyte Technology, Vlog names that uses dialects. Easily, videos from Al- Algerian TAG – Advices, Tips gerian YouTubers/Podcasts channels are searched social discussions using the name of the corresponding author.

Table 2: Main Sources of Videos €@ð What É¿A‚Ó Problems €A ®K Discussion   àA KQk . Journal èQ®mÌ'@ Injure ¼CK. Maybe úæJ ‚ QK Electricity @Qå” Happen øPA“ Happened    ék. P@X Colloquial éK QK PX Algerian èQå YË@ Village 5.2 Extraction Process  èQå„JË@ News ¬ñƒ See øQK @Qk . Algerian  Now having these Web sources, and as they are nu- QK @PX Algeria éK BñË Department ¼AJ.Ë@ Baccalaureat ÕæÊKðQK Problem éÊ¿AÓ food ½K@Qå ð Your Opinion merous, we process in two steps in order to acquire . . video/audio speech data. First, we drawn up lists Table 3: Common Algerian terms used as Keywords by crawling information mainly meta data about existing data related to potential videos/audios that For all YouTube search queries, we selected the potentially contain Algerian dialect speech. The first 100 videos (when available). When lists are deployed procedure relies on mainly two different drawn up, we start the cleaning process that dis- scripts according to the Web resource type. cards the irrelevant video entries. In fact, we re- In order to collect speech data from local radio move all list entries whose duration is less than channels, we refer back manually to the programs 5s. The videos whose topic shows that it doesn’t of radio to select report and emission lists that are deal with dialects are also discarded. This is done susceptible to contain dialectal speech. by analyzing manually the video title then its de- For data from YouTube, the lists are fed by us- scription and keywords. 6 ing YouTube search engine through its Data API . In addition, to be in compliance with YouTube 6YouTube Data API, https://developers. Terms of Services, first, we take into account the google.com/YouTube/v3/ existence of any Creative Commons license asso-

143 ciated to the video. Number of Dialects 8 The whole cleaning process leads us to keep Total Duration 104.4 hours 1182 Audios/videos among 4 000 retrieved ones. Clean Speech Ratio 39, 15 % Video’s length varies between 6s and 1 hour. The Number of speakers 4881 cleaning task was carried out by the authors of this Speech duration by Speaker 6s – 9hours paper. Now, the download process is launched using Table 4: Corpus Global Statistics. the resulting cleaned lists. Concerning the data from radio channels, we script a planned recording of the reports from the stream of the radio. Here For this preliminary version, most manual anno- also the recording amount depends on the desired tations are made thanks to authors themselves with scale. the help of 8 volunteers. These In Lab annotations Concerning the data from Radios, we deploy, in concern assigning for each utterance the spoken the download script, mainly the VLC Media Player dialect, validation of speaker gender (previously tool 7 with the cron Linux command. In order detected automatically by VoiceID). During these to download videos from YouTube, we have de- manual annotations, we check that utterances deal ployed YouTube-dl 8 a command-line program. with dialect. Otherwise, they are discarded.

5.3 Preprocessing and Annotation 6 Corpus Main Features A collection of Web speech data is not immedi- First of all, we note that this preliminary version ately suitable for exploration in the same way a of our corpus is collected and annotated in less traditional corpus is. It needs more cleaning and than two months, and the building is still an ongo- preprocessing. Let us recall, that our illustrative ing process. A sample of our corpus is available corpus will serve for dialect/speaker identification online 11.KALAM’DZ covers most major Ara- using machine learning techniques. For that pur- bic sub-dialects of Algeria. Table 4 reports some pose, for each downloaded video, we have applied global statistics of the resulted corpus. Clean the following processing: Speech Ratio row gives the ratio of Speeches that have good sound quality. The remaning portion of 9 1. Audio extraction: FFmpeg tool is used speeches present some noise background mainly to extract the audio layer. In addition, the music or they are recorded in outdoor. However, 10 SoX tool, a sound processing program, is they can be used to build dialect models in order applied to get single-channel 16 kHz WAV to represent the within-language variability. audio files. More details on KALAM’DZ corpus are re- ported in Table 5. Let us observe that some di- 2. Non-speech segments removal: such as music alects are more represented. This is due to people or white noise by running a VAD ( Ac- distribution and Web culture. For instance, Algiers tivation Detection) tool to remove as many as and are metropolitan cities. So their produc- possible. tivity on the Web is more abundant. 3. Speaker Diarization: is performed to de- In order to facilitate the deployment of termine who speaks when, and to assign KALAM’DZ corpus, we have adopted the same for each utterance a speaker ID. It is packaging method as TIMIT Acoustic-Phonetic achieved using VoiceID Python library based Continuous Speech Corpus (Garofolo et al., on the LIUM Speaker Diarization frame- 1993). work (Meignier and Merlin, 2010). The out- 7 Potential Uses put from VoiceID segmentation is a set of au- dio files with information about speaker ID, KALAM’DZ built corpus can be considered as the and utterance duration. first of its kind in term a rich speech corpus for 7VLC media player V 2.2.4 https://www. Algerian Arabic dialects using Web resources. It videolan.org/vlc/ can be useful for many purposes both for NLP and 8YouTube-dl V3 http://YouTube-dl.org/ computational linguistic communities. In fact, it 9FFmpeg http://www.ffmpeg.org/ V3.2 10SoX http://sox.sourceforge.net/ 11https://github.com/LIM-MoDos/KalamDZ

144 Sub-Dialect Departments/Village # Speakers Web-sources (h) Algerian Local On Total Good Tv Radios YouTube (h) Quality (%) Hilali-Saharan Laghouat, , Ghardaia, Adrar, 1338 12.7 16.5 - 29.4 38.5 Bechar, Naâma’ South Hilali-Tellian Setif, Batna, Bordj-Bou-Arreridj 605 3.6 -- 3.6 32.2 High-plains Constantine, Mila, Skikda 297 2.0 -- 2.0 42.7 Sidi-Bel Abbas, Saïda, Mascara, 421 4.1 - 20.7 24.8 38.3 Ma’qilian Relizane, Oran, Aïn Timouchent, Tiaret, Mostaganem, Naâma’ North Annaba, El-Oued, Souk-Ahras, 914 6.7 - 6.9 13.6 39 Sulaymite Tebessa, Biskra, Khanchela, Oum El Bouagui, , El Taref Algiers Blanks Algiers, Blida, Boumerdes, Tipaza 723 5.1 - 16.1 21.2 42.3 Médea, Chlef, Tissemsilt, Ain Sahel-Tell Defla 447 3.1 6.0 - 9.1 45.5 Pre-Hilal¯ ¯ı Tlemcen Nadrouma, Dellys, 136 0.7 -- 0.7 34.7 , Collo, Cherchell Global 4881 38.2 22.5 43.7 104.4 39.1

Table 5: Distribution of speakers and Web-sources per sub-dialect in KALAM’DZ corpus.

Sub- 8 Conclusion Dialect N&P Edu. Ent. H&S P&B Hum. Hilali- In this paper, we have devised a recipe in or- Saharan 29.4 ----- der to facilitate building large-scale Speech cor- Hilali- pus which harnesses Web resources. In fact, the Tellian 3.6 ----- used methodology makes building a Web-based High-plains 2.0 ----- corpus that shows the within-language variability. Ma’qilian 4.1 1.5 - 7.4 10.2 1.6 In addition, we have narrated this procedure for Sulaymite 6.7 --- 6.9 - building KALAM’DZ a speech corpus dedicated Algiers to the whole Algerian Arabic sub-dialects. We Blanks 5.1 2.2 8.7 - 1.4 2.8 have been ensured that this material takes into ac- Sahel-Tell 9.1 ----- count numerous speech aspects that foster its rich- Pre-Hilal¯ ¯ı 0.7 ----- ness and provides a representation of linguistic va- rieties. In fact, we have targeted various speech Total 60.7 3.7 8.7 7.4 19.5 4.4 topics. Some automatic and manual annotations Note: News & Politics (N&P), Educations (Edu.), En- tertainment (Ent.), How-to & Style (H&S), People & are provided. They gather useful information re- Blogs (P&B), and Humor (Hum.). lated to the speakers and sub-dialect information at the utterance level. This preliminary KALAM’DZ Table 6: Distribution of categories per dialect (in version encompasses the 8 major Algerian Ara- hours). bic sub-dialects with 4881 speakers and more than 104.4 hours. can be used for building models for both speaker Mainly developed to be used in dialect identifi- and dialect identification systems for the Algerian cation, KALAM’DZ can serve as a testbed support- dialects. For linguistic and sociolinguistics com- ing evaluation of wide spectrum of NLP systems. munities, it can serve as base for capturing dialects In future work, we will extend the corpus by characteristic. collecting Algerian sub-dialects uttered by Berber All videos related to extracted audio data are native speakers. As the corpus building is still also available. This can be deployed to build an- an ongoing work, its evaluation is left to a future other corpus version to serve any image/video pro- work. In fact, we plan to evaluate the corpus on cessing based applications. dialects identification in intra-country context.

145 References Jacques Leclerc. 30 avril 2012. Algérie dans l’aménagement linguistique dans le monde. Mansour Alghamdi, Fayez Alhargan, Mohammed Alkanhal, Ashraf Alkhairy, Munir Eldesouki, and M. Paul Lewis, F. Simons Gary, and D. Fenning Ammar Alenazi. 2008. Saudi Accented Arabic Charles. 2015. Ethnologue: Languages of the Voice Bank. Journal of King Saud University- World, Eighteenth edition. Web. Computer and Information Sciences, 20:45–64. H. Li, B. Ma, and K. A. Lee. 2013. Spoken language K. Almeman, M. Lee, and A. A. Almiman. 2013. recognition: From fundamentals to practice. Pro- Multi Dialect Arabic Speech Parallel Corpora. In ceedings of the IEEE, 101(5):1136–1159, May. Communications, Signal Processing, and their Ap- plications (ICCSPA), pages 1–6, Feb. J. Makhoul, B. Zawaydeh, F. Choi, and D. Stallard. 2005. BBN/AUB DARPA Babylon Levantine Ara- Mansour Alsulaiman, Ghulam Muhammad, Mo- bic Speech and Transcripts. Linguistic Data Consor- hamed A Bencherif, Awais Mahmood, and Zulfiqar tium (LDC). LDC Catalog Number LDC2005S08. Ali. 2013. KSU Rich Arabic Speech Database. Journal of Information, 16(6). Mohamed Abdelmageed Mansour. 2013. The Ab- sence of Arabic Corpus Linguistics: A Call for Peter Behnstedt and Manfred Woidich. 2013. Dialec- Creating an Arabic National Corpus. Interna- tology. tional Journal of Humanities and Social Science, 3(12):81–90. S. Bougrine, H. Cherroun, D. Ziadi, A. Lakhdari, and A. Chorana. 2016. Toward a Rich Arabic Philippe Marçais. 1986. Algeria. Leiden: E.J. Brill. Speech Parallel Corpus for Algerian sub-Dialects. In LREC’16 Workshop on Free/Open-Source Arabic Sylvain Meignier and Teva Merlin. 2010. Lium spkdi- Corpora and Corpora Processing Tools (OSACT), arization: an open source toolkit for diarization. In pages 2–10. in CMU SPUD Workshop.

Alexandra Canavan and George Zipperlen. 1996. Vineeth G. Nair. 2014. Getting Started with Beautiful CALLFRIEND LDC96S49. Soup. Packt Publishing. Philadelphia: Linguistic Data Consortium. Heikki Palva. 2006. Dialects: classification. Encyclo- Dominique Caubet. 2000. Questionnaire de di- pedia of Arabic Language and Linguistics, 1:604– alectologie du (d’après les travaux de W. 613. Marçais, M. Cohen, GS Colin, J. Cantineau, D. Co- hen, Ph. Marçais, S. Lévy, etc.). Estudios de dialec- C. Pereira. 2011. Arabic in the North African Re- tología norteafricana y andalusí, EDNA, 5(2000- gion. In S. Weniger, G. Khan, M. P. Streck, and 2001):73–90. J. C. E. Watson, editors, . An In- ternational Handbook, pages 944–959. Berlin. Mourad Djellab, Abderrahmane Amrouche, Ahmed Bouridane, and Noureddine Mehallegue. 2016. Al- Luis Javier Rodríguez-Fuentes, Mikel Penagarikano, gerian Modern Colloquial Arabic Speech Corpus Amparo Varona, Mireia Diez, and Germán Bor- (AMCASC): regional accents recognition within del. 2016. Kalaka-3: a database for the assess- complex socio-linguistic environments. Language ment of spoken language recognition technology on Resources and Evaluation, pages 1–29. youtube audios. Language Resources and Evalua- tion, 50(2):221–243. Mohamed Embarki. 2008. Les dialectes arabes mod- ernes: état et nouvelles perspectives pour la classifi- Kees Versteegh, Mushira Eid, Alaa Elgibali, Manfred cation géo-sociologique. Arabica, 55(5):583–604. Woidich, and Andrzej Zaborski. 2006. Encyclo- pedia of Arabic Language and Linguistics. African J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, Studies, 8. D. S. Pallett, and N. L. Dahlgren. 1993. DARPA TIMIT Acoustic Phonetic Continuous Speech Cor- Samantha Wray and Ahmed Ali, 2015. Crowdsource a pus CDROM. little to label a lot: Labeling a speech corpus of di- alectal Arabic, volume 2015-January, pages 2824– Marwa Graja, Maher Jaoua, and L Hadrich-Belguith. 2828. International Speech and Communication As- 2010. Lexical Study of A Spoken Dialogue Corpus sociation. in Tunisian Dialect. In The International Arab Con- ference on Information Technology (ACIT), Beng- hazi, Libya.

Salima Harrat, Karima Meftouh, Mourad Abbas, Khaled-Walid Hidouci, and Kamel Smaili. 2016. An algerian dialect: Study and resources. Interna- tional Journal of Advanced Computer Science and Applications-IJACSA, 7(3).

146