MOROCO: the Moldavian and Romanian Dialectal Corpus
Total Page:16
File Type:pdf, Size:1020Kb
MOROCO: The Moldavian and Romanian Dialectal Corpus Andrei M. Butnaru and Radu Tudor Ionescu Department of Computer Science, University of Bucharest 14 Academiei, Bucharest, Romania [email protected] [email protected] Abstract In this work, we introduce the Moldavian and Romanian Dialectal Corpus (MOROCO), which is freely available for download at https://github.com/butnaruandrei/MOROCO. The corpus contains 33564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports and tech. The data set is divided into 21719 samples for training, 5921 samples for validation and another 5924 Figure 1: Map of Romania and the Republic of samples for testing. For each sample, we Moldova. provide corresponding dialectal and category labels. This allows us to perform empirical studies on several classification tasks such as Romanian is part of the Balkan-Romance group (i) binary discrimination of Moldavian versus that evolved from several dialects of Vulgar Romanian text samples, (ii) intra-dialect Latin, which separated from the Western Ro- multi-class categorization by topic and (iii) mance branch of languages from the fifth cen- cross-dialect multi-class categorization by tury (Coteanu et al., 1969). In order to dis- topic. We perform experiments using a shallow approach based on string kernels, tinguish Romanian within the Balkan-Romance as well as a novel deep approach based on group in comparative linguistics, it is referred to character-level convolutional neural networks as Daco-Romanian. Along with Daco-Romanian, containing Squeeze-and-Excitation blocks. which is currently spoken in Romania, there We also present and analyze the most discrim- are three other dialects in the Balkan-Romance inative features of our best performing model, branch, namely Aromanian, Istro-Romanian, and before and after named entity removal. Megleno-Romanian. Moldavian is a subdialect of Daco-Romanian, that is spoken in the Republic of 1 Introduction Moldova and in northeastern Romania. The delim- arXiv:1901.06543v2 [cs.CL] 1 Jun 2019 The high number of evaluation campaigns on spo- itation of the Moldavian dialect, as with all other ken or written dialect identification conducted in Romanian dialects, is made primarily by analyz- recent years (Ali et al., 2017; Malmasi et al., ing its phonetic features and only marginally by 2016; Rangel et al., 2017; Zampieri et al., 2017, morphological, syntactical, and lexical character- 2018) prove that dialect identification is an inter- istics. Although the spoken dialects in Romania esting and challenging natural language process- and the Republic of Moldova are different, the two ing (NLP) task, actively studied by researchers in countries share the same literary standard (Mina- nowadays. Due to the recent interest in dialect han, 2013). Some linguists (Pavel, 2008) consider identification, we introduce the Moldavian and that the border between Romania and the Repub- Romanian Dialectal Corpus (MOROCO), which lic of Moldova (see Figure1) does not correspond is composed of 33564 samples of text collected to any significant isoglosses to justify a dialectal from the news domain. division. One question that arises in this context is whether we can train a machine to accurately nels, in (i) distinguishing the Moldavian and distinguish literary text samples written by peo- the Romanian dialects and in (ii) categoriz- ple in Romania from literary text samples writ- ing the text samples by topic. ten by people in the Republic of Moldova. If We organize the remainder of this paper as fol- we can construct such a machine, then what are lows. We discuss related work in Section2. We the discriminative features employed by this ma- describe the MOROCO data set in Section3. We chine? Our corpus formed of text samples col- present the chosen classification methods in Sec- lected from Romanian and Moldavian news web- tion4. We show empirical results in Section5, sites, enables us to answer these questions. Fur- and we provide a discussion on the discriminative thermore, MOROCO provides a benchmark for features in Section6. Finally, we draw our conclu- the evaluation of dialect identification methods. sion in Section7. To this end, we consider two state-of-the-art meth- ods, string kernels (Butnaru and Ionescu, 2018; 2 Related Work Ionescu and Butnaru, 2017; Ionescu et al., 2014) There are several corpora available for dialect and character-level convolutional neural networks identification (Ali et al., 2016; Alsarsour et al., (CNNs) (Ali, 2018; Belinkov and Glass, 2016; 2018; Bouamor et al., 2018; Francom et al., 2014; Zhang et al., 2015), which obtained the first two Johannessen et al., 2009; Kumar et al., 2018; places (Ali, 2018; Butnaru and Ionescu, 2018) in Samardziˇ c´ et al., 2016; Tan et al., 2014; Zaidan the Arabic Dialect Identification Shared Task of and Callison-Burch, 2011). Most of these cor- the 2018 VarDial Evaluation Campaign (Zampieri pora have been proposed for languages that are et al., 2018). We also experiment with a novel widely spread across the globe, e.g. Arabic (Ali CNN architecture inspired the recently introduced et al., 2016; Alsarsour et al., 2018; Bouamor et al., Squeeze-and-Excitation (SE) networks (Hu et al., 2018), Spanish (Francom et al., 2014), Indian (Ku- 2018), which exhibit state-of-the-art performance mar et al., 2018) or German (Samardziˇ c´ et al., in object recognition from images. To our knowl- 2016). Among these, Arabic is the most popu- edge, we are the first to introduce Squeeze-and- lar, with a number of four data sets (Ali et al., Excitation networks in the text domain. 2016; Alsarsour et al., 2018; Bouamor et al., 2018; As we provide category labels for the collected Zaidan and Callison-Burch, 2011), if not even text samples, we can perform additional exper- more. iments on various text categorization by topic Arabic. The Arabic Online news Commentary tasks. One type of task is intra-dialect multi-class (AOC) (Zaidan and Callison-Burch, 2011) is the categorization by topic, i.e. the task is to clas- first available dialectal Arabic data set. Although sify the samples written either in the Moldavian AOC contains 3.1 million comments gathered dialect or in the Romanian dialect into one of the from Egyptian, Gulf and Levantine news websites, following six topics: culture, finance, politics, sci- the authors labeled only around 0:05% of the data ence, sports and tech. Another type of task is set through the Amazon Mechanical Turk crowd- cross-dialect multi-class categorization by topic, sourcing platform. Ali et al.(2016) constructed i.e. the task is to classify the samples written in a data set of audio recordings, Automatic Speech one dialect, e.g. Romanian, into six topics, using Recognition transcripts and phonetic transcripts of a model trained on samples written in the other Arabic speech collected from the Broadcast News dialect, e.g. Moldavian. These experiments are domain. The data set was used in the 2016, 2017 aimed at showing if the considered text categoriza- and 2018 VarDial Evaluation Campaigns (Mal- tion methods are robust to the dialect shift between masi et al., 2016; Zampieri et al., 2017, 2018). training and testing. Alsarsour et al.(2018) collected the Dialectal In summary, our contribution is threefold: ARabic Tweets (DART) data set, which contains • We introduce a novel large corpus containing around 25K manually-annotated tweets. The data 33564 text samples written in the Moldavian set is well-balanced over five main groups of Ara- and the Romanian dialects. bic dialects: Egyptian, Maghrebi, Levantine, Gulf • We introduce Squeeze-and-Excitation net- and Iraqi. Bouamor et al.(2018) presented a large works to the text domain. parallel corpus of 25 Arabic city dialects, which • We analyze the discriminative features that was created by translating selected sentences from help the best performing method, string ker- the travel domain. Other languages. The Nordic Dialect Corpus (Johannessen et al., 2009) contains about 466K spoken words from Denmark, Faroe Islands, Ice- land, Norway and Sweden. The authors tran- scribed each dialect by the standard official or- thography of the corresponding country. Francom et al.(2014) introduced the ACTIV-ES corpus, which represents a cross-dialectal record of the informal language use of Spanish speakers from Argentina, Mexico and Spain. The data set is Figure 2: The distribution of text samples per topic for the Moldavian and the Romanian dialects, respectively. composed of 430 TV or movie subtitle files. The Best viewed in color. DSL corpus collection (Tan et al., 2014) comprises news data from various corpora to emulate the in Figure2. In both countries, we notice that the diverse news content across different languages. most popular topics are finance and politics, while The collection is comprised of six language vari- the least popular topics are culture and science. ety groups. For each language, the collection con- The distribution of topics for the two dialects is tains 18K training sentences, 2K validation sen- mostly similar, but not very well-balanced. For in- tences and 1K test sentences. The ArchiMob cor- stance, the number of Moldavian politics samples pus (Samardziˇ c´ et al., 2016) contains manually- (5154) is about six times higher than the number annotated transcripts of Swiss German speech col- of Moldavian science samples (877). However, lected from four different regions: Basel, Bern, MOROCO is well-balanced when it comes to the Lucerne and Zurich. The data set was used in distribution of samples per dialect, since we were the 2017 and 2018 VarDial Evaluation Campaigns able to collect 15403 Moldavian text samples and (Zampieri et al., 2017, 2018). Kumar et al.(2018) 18161 Romanian text samples. constructed a corpus of five Indian dialects con- sisting of 307K sentences. The samples were col- It is important to note that, in order to obtain the lected by scanning, passing through an OCR en- text samples, we removed all HTML tags and re- gine and proofreading printed stories, novels and placed consecutive space characters with a single essays from books, magazines or newspapers.