A Computational Perspective on the Romanian Dialects
Total Page:16
File Type:pdf, Size:1020Kb
A Computational Perspective on the Romanian Dialects Alina Maria Ciobanu, Liviu P. Dinu Faculty of Mathematics and Computer Science, University of Bucharest Center for Computational Linguistics, University of Bucharest [email protected], [email protected] Abstract In this paper we conduct an initial study on the dialects of Romanian. We analyze the differences between Romanian and its dialects using the Swadesh list. We analyze the predictive power of the orthographic and phonetic features of the words, building a classification problem for dialect identification. Keywords: Romanian, dialects, language similarity. 1. Introduction and Related Work related varieties. The aim of our investigation is to assess the orthographic and phonetic differences between the di- The rapid development of the online repositories has lead alects of Romanian. In this paper, we quantify only the to a significant increase in the number of multilingual doc- orthographic and phonetic differences, but the morphology uments, allowing users from all over the world to access and the syntax are other important aspects which contribute information that has never been available before. This ac- to the individualization of each variety, that we leave for celerated growth created the stringent need to overcome the further study. language barrier by developing methods and tools for pro- Previously, Tonelli et al. (2010) proposed such an adap- cessing multilingual information. Nowadays, NLP tools for tation for a morphological analyzer for Venetan. Simi- the official languages spoken in the European Union and for larly, Kanayama et al. (2014) built a dependency parser the most popular languages are constantly created and im- for Korean, leveraging resources (transfer learning) from proved. However, there are many other language varieties Japanese. They showed that Korean sentences could be suc- and dialects that could benefit from such NLP tools. The ef- cessfully parsed using features learnt from a Japanese train- fort for building NLP tools for resource-poor language vari- ing corpus. In the field of machine translation, Aminian et eties and dialects can be reduced by adapting the tools from al. (2014) developed a method for translating from dialec- related languages for which more resources are available. tal Arabic into English in which they reduce the OOV ratio The importance of adapting NLP tools from resource-rich by exploiting resources from standard Arabic. Although to resource-poor closely related languages has been ac- dialects and varieties have been investigated for other lan- knowledged by the research communities and has been ma- guages, such as Spanish and Portuguese (Zampieri and Ge- terialized through multiple events, such as the workshop on bre, 2012; Zampieri et al., 2013), Romanian dialects did not Language Technology for Closely Related Languages and receive much attention in NLP. To our knowledge, while the Language Variants (Nakov et al., 2014) or the workshop on syllabic structure of Aromanian has been previously inves- Applying NLP Tools to Similar Languages, Varieties and tigated (Nisioi, 2014), this is one of the very first computa- Dialects (Zampieri et al., 2014). tional comparative studies on the Romanian dialects. A related problem occurs when researchers are interested in the cultural heritage of small communities, who developed 2. The Romanian Dialects their own techniques of communication and prefer using Romanian is a Romance language, belonging to the Italic dialects instead of the official language of the region they branch of the Indo-European language family, and is of live in. In some situations, these dialects are close enough particular interest regarding its geographic setting. It is to the standard language, but in other situations the differ- surrounded by Slavic languages and its relationship with ence is consistent, so much so that some dialects have be- the big Romance kernel was difficult. According to Tagli- come languages of their own (for example Friulian, spoken avini (1972), Romanian has been isolated for a long period in the North-East of Italy). These matters raise interesting from the Latin culture in an environment of different lan- research problems, since many such dialects are used only guages. Joseph (1999) emphasizes the reasons which make in speaking. Moreover, they often tend to be used only in Romanian of special interest to linguists with comparative very specific situations (such as speaking in the family), interests. Besides general typological comparisons that can very rarely being taught in schools. Thus, many dialects be made between any two or more languages, Romanian are in danger of extinction, according to the UNESCO list can be studied based on comparisons of genetic and geo- of endangered languages (Moseley, 2010). graphical nature. Joseph further states that, regarding ge- In this paper, we conduct an initial study on the dialects of netic relationships, Romanian can be studied in the con- Romanian. This investigation has the purpose of providing text of those languages most closely related to it and that a deeper understanding of the differences between dialects, the well-studied Romance languages enable comparisons which would aid the adaptation of existing NLP tools for that might not be possible otherwise, within less well docu- 3281 mented families of languages. Romanian is of particular in- the Danube. According to the Ethnologue (Lewis et al., terest also regarding its geographic setting, participating in 2015), the three dialects to the South of the Danube were numerous areally-based similarities that define the Balkan developed between the 5th and the 10th century, while ac- convergence area. cording to Rosetti (1966), this process took place after the Romanian is spoken by over 24 million people as native 10th century. Thus, according to Rosetti (1966), Aroma- language, out of which 17 million are located in Romania, nian and Megleno-Romanian developed in the 11th cen- and most of the others in territories that surround Romania tury, while Istro-Romanian developed in the 13th century. (Lewis et al., 2015). According to most Romanian linguists Rosetti (1966) states that there are, actually, two main di- (Pus¸cariu, 1976; Petrovici, 1970; Caragiu Mariot¸eanu, alects of Romanian: Daco-Romanian and Aromanian, the 1975), Romanian has four dialects: other two being derived from them (Megleno-Romanian derived from Aromanian and Istro-Romanian derived from • Daco-Romanian, or Romanian (RO) - spoken primar- Daco-Romanian). ily in Romania and the Republic of Moldova, where it has an official status. 7 • Macedo-Romanian, or Aromanian (AR) – spoken in 6 relatively wide areas in Macedonia, Albania, Greece, Bulgaria, Serbia and Romania. 5 4.66 4.67 4.46 • Megleno-Romanian (ME) – spoken in a more narrow 4 4.29 area in the Meglen region, in the South of the Balkan Peninsula. 3 2 • Istro-Romanian (IS) – spoken in a few villages from Avg. word length the North-East of the Istrian Peninsula in Croatia. It 1 is much closer to Italy than to Romania, from a ge- ographical point of view, but shows obvious similari- 0 RO IS ME AR ties with Romanian. It seems that the community of Language Istro-Romanians exists here since before the 12th cen- tury. Istro-Romanian is today on the “selected” list Figure 1: Average word length, using the orthographic form of endangered languages, according to the UNESCO of the words. classification.1 Romanian was originally a single language, descendant of the oriental Latin, spoken in the regions around the ro- 3. Experiments manized Danube: Moesia Inferior and Superior, Dacia and In this section we describe our investigations and experi- Pannonia Inferior (Rosetti, 1966). The period of common ments on the dialects of Romanian. We are mainly inter- Romanian begun in the 7th-8th century and ended in the ested in assessing the differences between the dialects from 10th century, when a part of the population migrated to the South of the Danube and Daco-Romanian. We hence- the South of the Danube, beginning the creation of the di- forth refer to Daco-Romanian, the standard language spo- alects. Densus¸ianu (1901) places the migration to the South ken in Romania, as Romanian. even earlier in time, in the 6th and 7th century. Thus, 3.1. Data starting with the 10th century, given a series of political, military, economical and social events, the 4 dialects of We use a dataset of 108 words comprising the short 2 Romanian were born: Daco-Romanian (to the North of Swadesh list for the Romanian dialects. The Swadesh list the Danube), Aromanian, Megleno-Romanian and Istro- has been widely used in lexicostatistics and comparative Romanian (to the South of the Danube). Among these linguistics, to investigate the classification of the languages dialects, only Daco-Romanian could develop into a na- (Dyen et al., 1992; McMahon and McMahon, 2003). The tional standard language, in the context of several polit- dataset is provided in two versions: orthographic and pho- ical and historical factors, leading to the Romanian lan- netic. In Figure 1 we represent the average word length guage that is spoken today inside the borders of Roma- (considering their orthographic form) for the Romanian di- nia. The other three dialects are spoken in communi- alects. Istro-Romanian has the shortest words, followed ties spread in different countries. An explanation for this by Romanian. Megleno-Romanian and Aromanian have fact is the setting of the Slavic people at the South of the slightly longer words, on average, but the differences are Danube, which has lead, among others, to the dispersion not significant. of the groups that spoke the three dialects to the South of The orthographic or phonetic distance has been widely used for analyzing related words and for reconstructing phyloge- 1UNESCO Interactive Atlas of the World’s Languages in Dan- nies (Kondrak, 2004; Delmestri and Cristianini, 2012). We ger (Moseley, 2010) provides for Istro-Romanian the following use the edit distance to observe how close the Romanian information: severely endangered, with an estimation of 300 dialects are to one another (Table 1).