<<

Tools for Building a Corpus to Study the Historical and Geographical Variation of the Romanian Victoria Bobicev Cat˘ alina˘ Mar˘ anduc˘ Cenel Augusto Perez Technical University “Al. I. Cuza” University, Al. I. Cuza University, of Iasi, Ias¸i, Romania Chis¸inau˘ Institute of Linguistics [email protected] Republic of Moldova “ Al. Rosetti” victoria [email protected] , Romania − [email protected]

Abstract tion, between which there are 5,723 sentences in old Romanian and 1,230 sentences in regional Contemporary standard language corpora variants of the Romanian. are ideal for NLP. There are few morpho- If we know the non-standard, regional or ear- logically and syntactically annotated cor- lier forms, we can understand the laws of natural pora for Romanian, and those existing or language evolution; we can know how it functions in progress only deal with the Contem- in the communication and process it. The use of porary Romanian standard. However, the Old or Regional Romanian should not be judged necessity to study the dynamics of natu- as mistaken reporting them to the standard rules, ral gave rise to balanced cor- but it is in accordance with other rules which we pora, containing non-standard texts. In must discover. this paper, we describe the creation of Linguists are increasingly interested in the tools for processing non-standard Roma- study of old languages with modern tools and their nian to build a big balanced corpus. We demand for old language processing tools is grow- want to preserve in annotated form as ing worldwide. Consequently, the computational many early stages of language as possi- linguists are building Diachronic and Old Cor- ble. We have already built a corpus in pora for all the natural languages, some described Old Romanian. We also intend to in- in many related works, as: Borin and Forsberg clude the South-Danube , remote (2008), Davies(2010), Prevost and Stein(2013), to the standard language, along with re- etc. gional forms closer to the standard. We We have built a sub-corpus for the Old Ro- try to preserve data about endangered id- manian, another for the regional variants, but it ioms such as Aromanian, Meglenoroma- is more difficult to build corpora for the South nian and Istroromanian dialects, and cal- Danube (because they are very culate the distance between different re- different from the standard language, cannot be gional variants, including the language understood by his speakers and two of them have spoken in the Republic of Moldova. This no written aspect). distance, as well as the mutual understand- The emergence of South-Danube dialects is his- ing between the speakers, is the correct torically and politically determined. Due to their criterion for the classification of idioms as isolation of the linguistic center, they are more different languages, or as dialects, or as re- conservative than this one and retain many archaic gional variants close to the standard. linguistic phenomena. Matteo Giulio Bartoli 1925 formulates the the- 1 Introduction ory of isolated area or side areas, demonstrating The UAIC-RoDia-DepTb (ISLRN 156-635-615- with some examples that these areas are more con- 024-0) is a balanced that becomes the servative than the center. The dialects conserve core of a big corpus for the Old and Regional Ro- more forms of the old languages or of the language manian and for its South-Danube dialects. The where they were inherited (in our case, the treebank has now 16,187 sentences, with 322,404 language). tokens, illustrating all the styles of communica-10 For example, Istroromanian is the of

Proceedings of the LT4DHCSEE in conjunction with RANLP 2017, pages 10–19, Varna, Bulgaria, 8 September 2017. http://doi.org/10.26615/978-954-452-046-5_002 the Romanian spoken by the smallest number of noun phrases (NP) and prepositional phrases (PP). speakers, in eight villages in Croatia by 1,000 peo- The next phase is the manual annotation via a spe- ple, called ”Vlach”, only recently recognized as a cial interface. We also have a hybrid POS-tagger national minority. This is an isolated area, very which permits the introduction of rules (see be- conservative for the Old Romanian. Through its low) and use the manual annotation for a part of disappearance and the lack of collected and digi- the semantic relations. tized testimonies we could lose important data on In another paper Borin et al.(2010) present the evolution of the . an ongoing work on the building of digitalized The dialectal variations study is important for diachronic Swedish lexical resource. The pa- the history of languages and for the etymology. per presents a basic research infrastructure for The big etymological dictionary of Roman lan- language technology called BLARK (Basic Lan- guages DeRom1 has between its bibliographical guage Resource Kit) which includes basic lexical sources books of specialists in dialectology as: resources, annotated corpora and basic NLP tools Iosif Popovici 1909 and Sarbu 1998 for for processing these corpora. The same authors the Istroromanian dialect. Borin and Forsberg(2008) describe the creation Besides their historical importance, these di- of the tool for the morphological analyze of Old alects are languages of disadvantaged minorities, Swedish words, which should be followed by syn- threatened with extinction, with limited access to tactic and semantic analysis. East European lan- culture. Their folk creation and other contribu- guages have several preprocessing issues. For ex- tions must be conserved and the people speaking ample, for some of them Cyrillic and Latin scripts it have to be received into the family of European were used in various periods of time; hence some languages. documents need to be transliterated before fur- ther processing (Gruszczynski and Ogrodniczuk, 2 Related Work 2015). The need for a research infrastructure for the study 3 Tools for Romanian Standard of historical lexical resources by digitization and Processing implication of language technology is increas- ingly recognized by the historical research com- 3.1 The UAIC-Ro Hybrid Part Of Speech munity. Historical documents are being digitized (POS)-Tagger on a vast scale in cultural heritage and digital li- The tools for the processing of the Contemporary brary projects in many . Modern linguis- Romanian are the basis for creating the ones for tics studies pays increasing attention to diachronic the old or regional Romanian processing. The lex- and dialectal variations of languages. Similar cor- icon of the UAIC-POS-tagger for Contemporary pora of other languages have started from con- Romanian also contains archaic words and forms, temporary language processing tools by adapting extracted from dictionaries, while Old Romanian them to their old or regional variants. also contains words and forms used today. Digitized historical corpora are already created The UAIC-Ro POS-tagger is hybrid, i.e. suc- for many languages: English Yanez-Bouza(2011), cessfully combines a statistical model with a rule Davies(2012), Spanish and Portuguese Davies based system (Simionescu, 2011). The specificity (2010), French Stein(2008) and so on. In another of the hybrid model is that it applies a set of paper, Martineau 2007 analyze the use of rules to reduce the large set of valid pairs lemma probabilistic parsing methods for old French texts. and POS-tag (abbreviated morphological analysis) Unfortunately, the described probabilistic parser which can be applied at a word-form. In fact, there was trained and can be used only within the project are morphological homonyms, interpretable tak- and cannot be adapted for other language and an- ing into account the words in the vicinity of their notation conventions. occurrence. After the reduction of the set of pos- Thus, Nuria Yanez-Bouza 2011 describes the sible analysis, the statistical system is put into op- building of a rule-based automate pre-annotator eration. which has around 30 rules to identify complex The dictionary of the POS-tagger is formed of verb forms (VCOMP), adjective phrases (AP), triplets: word-form, lemma (the basis form of 1 http://www.degruyter.com/view/product/205712 11 word, found in dictionaries), POS-tag (an abbrevi- ated morphological analysis). The amount of the ventions of annotation used are in FDG (Func- POS-tagger lexicon is related with the accuracy of tional Dependency Grammar), with labels of the tool. The tool for the Contemporary Roma- classical syntax, with numerous semantic sub- nian contains 1,15 million distinct words extracted classifications of modifiers. Creating the treebank from dictionaries and 100,000 proper nouns ex- in 2007, Augusto Perez had the intention of target- tracted from . The set of 406 tags are a ing the treebank for didactic purposes, for medium reduced version of the tagset used by the Multext learning, even building a computer game to pre- East project Erjavec(2004). pare students for exams, but, of course, the learn- The rule of big dimensions shows that the ing system cannot be so easily convinced to adopt higher the number of tags is, the greater the gold the Dependency Grammar. corpus for the training must be. The training cor- This system can be transposed both into the pus for Contemporary Romanian consists in the modern syntactic system of Universal Dependen- NAACL 2003 corpus (39,000 sentences), and an- cies (UD) with loss of semantic information and other 28,000 sentences extracted from the JRC- into a semantic annotation system by adding infor- ACQUIS. The corpus for evaluation was Orwell’s mation. This is why we will continue to use this 1984, manually annotated in the Multext classic format, in which the processing tools were East project. trained, and then it will be automatically (super- But these corpora were not have identical set vised) converted into UD (Universal Dependen- of conventions and not use our tagset. The train- cies) and into a semantic annotation. ing corpus is not entirely manually corrected. It is The parser is named multilingual or universal in possible that there are inconsistencies between the Hall et al.(2006), because its functioning is based annotation of these corpora. The accuracy of the only on dependency relations, on the training and tool, evaluated on standard Romanian, is 95.12 % on the morphological previously annotation in no without rules and 96.66 % with rules. matter what language. The accuracy of the parser The POS-tagger has been trained on standard- is determined by the size of the training corpus, ized, but also on non-standardized language, be- more than 10,000 sentences, by the exactitude of fore processing 2,570 sentences in Social Media the morphological annotation, and by the consis- communication (Romanian chat). A method to in- tency of the syntactic conventions. Thus, the syn- crease the accuracy on chat sentences was to dou- tactic parser will not cause problems if used for ble the training corpus, with and without letters annotation of ancient or regional texts, respecting with specific Romanian diacritics: s¸, a,˘ t¸, ˆı, a,ˆ that consistently the conventions of UAIC-RoDia Dep- are not always used in chat communication. treebank, if the difficult problem of correct mor- For the diacritics only RO-POS tagger evaluated phological annotation will be solved. on Orwell’s 1984 novel, we obtain an accuracy of The parser was successfully used for syntactic 97.03 %. For the diacritics only RO-POS tagger, annotation in the 2,570 chat phrases. On this occa- evaluated on the chat corpus the accuracy was only sion we found that the parser has a better accuracy 68.67 %, for the Mixed diacritics RO-POS tagger, after being trained on a large corpus which con- evaluated on mixed 1984, the accuracy was 94.38 tains standard and non-standard sentences. After %, and for the Mixed diacritics RO-POS tagger, the training only with chat sentences, the accuracy evaluated on the chat corpus, the accuracy was on chat was 71.74 % for head attachment, 66.08 % 84.78 %, as shown in Perez 2016. for label attachment and 62.31 % for both attach- For difficult texts (old, non-standardized, new ment. styles of communication not yet trained) we Using the same method, after the training with largely use manual validation of the output of the 15,000 sentences from all types of texts, including tool, and by the bootstrapping method, the cor- 4,000 sentences from the seventeenth century, and rected sentences are added to the gold training cor- after the creation of Old Ro POS-tagger, the ac- pus. curacy of our parser evaluated on Old Romanian was, for both attachments, 77.06 %, for head at- 3.2 The Malt Parser trained on Romanian tachment, 83.79 %, and for label attachment, 82.5 A variant of the Malt parser trained on UAIC- %. The results are better than on chat corpus, RoDepTb began to operate satisfactorily. The con-12 because the training corpus was bigger, and we used the new Old Ro POS tagger (described be- erywhere in the training corpus. We intend to build low) with a satisfactory accuracy, with the output a dictionary of predicate arguments and adjunct entirely manually corrected. structures (Cenel-Augusto Perez, 2015), and the participle has the same possible (syntactic and se- 3.3 POS-tagger for the Old and Dialectal mantic) dependencies as the other verbal forms. Romanian We do not accept the letters y / n for annotating For building a series of Old Romanian processing categories other than +/- definiteness, because they tools, we began by building a POS-tagger for Old are not transparent and can be confusing, as shown Romanian, which would give us basic annotation inM ar˘ anduc˘ et al.(2016). for the syntactic and semantic parsers or for any Finally, if the above forms are possible, but un- other type of annotation. To build a new POS- common in Contemporary Romanian, there are tagger, a list of tags, a lexicon, and a training cor- tags needed to annotate specific forms only of the pus is needed. After the elaboration of these data, Old language. The list of tags for the forms of we can make clones of the UAIC hybrid POS- Dh (emphatic determiners) was also doubled as Ph tagger, described above, and of the POS-tagger of (emphatic pronouns), because these forms exist in the Institute of Mathematics and Computer Sci- Old Romanian independently, not only as deter- ence of Chisinau, Republic of Moldova, which can miners of a noun. The emphatic adjective has a lot analyze also Romanian words written in Cyrillic of specific forms in Old Romanian: “elus¸i”, “eis¸i”, letters. “luis¸i” (En: himself, themselves). Another phe- nomenon is the imperative formed from the long 3.3.1 The List of Tags for the Old Ro infinitive. Example: “Nu va˘ teamet¸i!” (with short POS-tagger infinitive) “Nu va˘ teameret¸i!” (with long infini- To establish the list of tags for the new POS- tive) (En: Do not be afraid!) tagger, we began with the list of the UAIC hy- In Contemporary Romanian, the relative pro- brid POS-tagger for Contemporary Romanian, re- noun “care” has a very reduced inflection, and in introducing some tags which had been eliminated, Old Romanian the inflection is complex. Exam- i.e. the detailed analysis of personal and reflexive ples: “carele” Pw3msry,“carea” Pw3fsry,“cari” pronouns (dative and accusative case, strong and Pw3mpr,“carii” Pw3mpry (En: which). weak forms) and the complex tags for the rela- The new tagset for the Old Romanian has 540 tional words (prepositions, conjunctions, relative tags. All these tags was annotated in the entire adverbs). The first category (personal and reflex- corpus and the tags which do not exist in the tagset ive pronouns) is useful to differentiate the direct was eliminated from all the sub-corpora and from and indirect objects, to establish the co-references the lexicon of the POS-tagger. In this way we as- and the expletives. sured the consistency of the training corpus with A new set of tags which we introduced, origi- the lexicon. nal ones, were aimed at annotating language spe- cific phenomenon of (Old) Romanian, namely the 3.3.2 The Lexicon for the Old Ro POS-tagger negation of non-personal synthetic modes (partici- The collection of texts in Old Romanian is quite ple, gerund and supine) by the prefix ne-. Words advanced; it contains 23 documents in TXT for- as “neterminat”,“nestricata˘”, “nes¸tiind” (En: un- mat from the sixteenth century, 60 from the sev- finished, unbroken, not knowing) have the lemma enteenth, 76 from the eighteenth and 325 from “a termina”(to finish), “a strica”(to break), “a the nineteenth century. These texts were cleaned s¸ti”(to know), because the verbs “*a netermina”, of the meta-text, specialists comments and notes, “*a nestrica”, “*a nes¸ti” do not exist. The pos- then processed by a concordancer program that tag of these forms will be annotated as: “Vmp– builds indexes and makes statistics of the number sm-”, “Vmp–pf-z”, “Vmg—–z”. (participle nega- of occurrences for each form in all the 500 books tive singular masculine, participle negative plural in text format opened. The concordancer used is feminine, gerund negative). The opposite tag to the Lucon 03.16, the Cat˘ alin˘ Mititelu’s program, “Vmp–pf-z” has, on the eighth position, the p that available on Sourceforge site2. means “positive”. However, the old Romanian texts were written The annotation of verbal participles as adjec- tives is not acceptable, and has been corrected ev-13 2https://sourceforge.net/directory/os:windows/?q=Lucon in Old Cyrillic letters. The collection that we hold Daco-Romanian dialect (the name used by the di- contains both scans with Cyrillic and transcripts alectologists for the language spoken in Romania), in Latin letters made by specialists. There was but they are not useful for the South Danube di- no Optical Character Recognizer (OCR) program alects, that have special dictionaries. for such letters. The letters differ from one book The project Monumenta Linguae Dacoromano- to another and there are many transitional alpha- rum4, started in 1988 in cooperation with the bets with mixed Latin and Cyrillic letters. Our Freiburg University, has the purpose to digitize the colleagues from the Institute of Mathematics and old religious books in Romanian. A new edition Computer Science of the Academy of Sciences of of the first Bible printed in Romanian (1688) has Moldova are now building an OCR which began been completed by using manually checked auto- operating satisfactorily for some texts (Cojocaru matic morphologic annotation; we have received et al., 2017). the indices of their edition, using another system In the case of the Romanian language written of abbreviations compatible with ours, to be intro- with Cyrillic, we used the same lexicon for the duced in our Old Ro POS-tagger lexicon. Old-Romanian POS-tagger and for the OCR pro- Finally, the result will be processed with a gram, having two variants, with the word forms tool that generates complete paradigms of words written in Latin letters and written in Cyrillic let- (adding possible forms that do not appear in the in- ters, trying to help our colleagues to increase the dexed texts). We have such a tool for the Contem- performances of the OCR for Romanian Cyrillic porary Romanian, called Anamorph (Timofciuc letters. The lexicon is obtained from the annota- et al., 2013), and now a variant for Old Romanian tion of the Gospel, the first part of the New Testa- has been built and its training must begin (Gˆıfu ment (1648) having 5,028 sentences, text partially and Simionescu, 2016). All variants of the word obtained by the cited above OCR program. are associated with the contemporary lemma, and We have about 500 scanned old books without the inflexion is generated starting from the root of transcription in Latin alphabet. We hope that the the word form chosen in the text. new OCR will solve this problem. The solution of transcriptions made by specialists is question- 3.3.3 The training corpus for the Old Ro able, because the specialists made sometimes in- POS-tagger terpretative transcriptions, closer to the old text, The third step is the construction of a training cor- approaching it by the contemporary language (so pus for the Old Romanian POS-tagger. We have that the POS-tagger will not recognize them in the no other solution than to train the POS-tagger and text obtained by OCR) and introduced numerous the Malt parser on the contemporary gold corpus, notes and comments in contemporary language, then to process the old texts, and then to correct while their elimination is time consuming. the output of the tools, at the beginning having a Using the Lucon program, we created a lex- modest accuracy. icon to be introduced in the POS-tagger, with The manually corrected sentences will form in 120,000 word-forms. However, we do not know time the training corpus for the Old Romanian. which number of actualized forms of words it con- The module attached of the POS-tagger will ex- tains; after the OCR processing of more printed tract from these manually corrected annotations old books, we will construct a more authentic lex- the forms or the analysis that does not exist in icon. its lexicon and will add them. An increase of the Using the program DEPAR (Dictionary Parser) accuracy of the POS-tagger is expected after the (Mar˘ anduc˘ et al., 2017) we extracted a list of 5,000 increasing of the training corpus, i. e. after the stable unanalyzable Multi Word Expressions from manually correction of more books automatically a dictionary (Mar˘ anduc˘ , 2010), and also 98,000 annotated with this tool. The training corpus (hav- lexical or spelling variants extracted from the The- ing now 18,187 sentences, including 6,882 of the saurus Dictionary3. The variants are generally old 17th to 19th centuries) is smaller than the one for or regional, consequently their introduction in the the Standard Romanian POS-tagger, but is consis- POS-tagger lexicon is useful for the processing tent with the tagset. The POS-tagger for the Old both of the Old and of the regional variants of the Romanian has now an accuracy of 91.66 %.

3http://edtlr.info.uaic.ro/ 14 4https://consilr.info.uaic.ro/ mld/monumenta/ 3.3.4 Building POS-taggers for processing the South Danube Dialects

Then, we will apply the same solution: to build a few clones of the UAIC-Ro-POS-tagger for each South-Danube dialect. The same steps must be completed for each South-Danube dialect, begin- ning with the difficult step of the acquisition of digitized sources.

The construction of the Regional lexicon begins Figure 1: The map of localities, from with the acquisition of a big collection of sources: the http://www.theapricity.com/forum/showthread texts in each variant of the language, in editable form. Then, two problems must be solved: first, it needs a big lexicon containing all the possible 4 Short Presentation of Romanian word-forms in each dialect of the language, with Dialects the lemmas and with the correct morphological analysis for any form find in texts, and secondly, 4.1 The Aromanian Dialect it needs a big gold corpus (manually checked) for Aromanian is the dialect with the most speak- the training. ers, approximately 250,000. They live in Albania, in Greece, (the Pindos Mountains, the The greater difficulty lies in the fact that there of ) or in the Republic of Macedonia, is a very big distance between these variants of the hence the name of the Macedo dialect, that can be language, we will need to start from zero the con- confusing, because this is the name of a geograph- struction of the training corpus for each dialect, ical region, and in that region there are also people without using the large corpus of the other vari- of other origins and languages. ants of the language. It can only be used as train- There are numerous texts in this dialect, be- ing corpus on regional variants spoken in Romania cause more people speaking it migrated in Ro- and the Republic of Moldova mania and ones of them become specialists in dialectal linguistics. For example, Mariot¸eanu We have a collection of texts published by spe- (2006), Saramandu(2003) and Nevaci(2011), for cialists from each dialect and some dictionaries. In mentioning only the contemporary ones. Other order to create a collection of sources, we should older collections are Capidan(1925), Obedenaru do dialectal surveys in the villages where the di- and Bianu(1891), Bujduveanu(2005), and so on. alects are spoken, because two of them do not have They have published collections of texts in more a written aspect, to register and annotate the texts. styles of the language, written or oral popular lit- We have not yet a collection of annotated speci- erature, dictionaries, textbooks. This dialect has mens of Spontaneous spoken language in our cor- been studied from the nineteen century and it has pus. historical variants (texts in Old Aromanian). It has also different regional variants. (see Figure1) We can extract lemmas from the dictionaries However, there are several difficulties. The pro- using the program DEPAR (Dictionary Parser) nunciation is differently transcribed in different (Mar˘ anduc˘ et al., 2017), but the inflexion must books, depending on the period in which they were be manually introduced in the POS-tagger lexicon. collected and published. This dialect is also stud- For this purpose, we will have to associate special- ied by the Greek linguists that have another set of ists in the South-Danube dialects in our project. graphical conventions for the transcription. The The existence of a lexicon with flexionary forms spelling interpretive conception, nearby the stan- could also lead to the creation of OCR for the Aro- dard language, exists also on these specialists, also manian dialect, that has a written aspect. The diffi- the intention to approach the Aromanian dialect to culty is that various phonetic transcription systems the standard language exists, especially since it is are used in different published collections. 15 the only one who enjoys the existence of schools and manuals in the Aromanian. Aromanian is no option in any OCR software, neither in the automated programs for the trans- lation, and various letters with diacritics are not recognized. For the moment, the editable text ob- tained by OCR is of poor quality and should be carefully corrected by specialists in South-Danube Romanian dialects. Probably we have to introduce the three South-Danube dialects as independent options in the OCR program, each with its lexi- con, including the letters from all the transcription systems.

4.2 The Meglenoromanian Dialect The Meglenoromanians are an ethnic group liv- Figure 2: The map of Meglenoromanians locali- ing in the Meglen region of Central Macedonia, ties, drew by T. Capidan Greece. This ethnic group is less numerous than the Aromanians. The researchers estimated their building an annotated corpus for this idiom and by number at 20,000 persons in the nineteenth cen- computerizing their dialect. A computerized form tury. However, Thede Kahl 2006 estimated them of their dictionaries and lexicons is also necessary. at 5,000 persons; the negative demographic dy- The annotated corpus and the electronic dictionary namic is evident. The minorities are not recog- will be used to build a machine translation. nized in Greece, and in Turkey they are submit- ted at an assimilation and islamization process. 4.3 Short Presentation of the Istroromanian There are no schools in the Meglenoromanian di- Dialect alect. This idiom, having no written aspect, has Istroromanian is an Eastern Romance idiom spo- no cult literature. However, literary folklore texts ken in a few villages in the peninsula of Istria, were published by many linguists. More collec- in Croatia. The number of Istroromanian speak- tions were published by Pericles Papahagi 1902, ers is more than 500, the ”smallest ethnic group Ion Aurel Candrea 1925a, 1925b, and Theodor in Europe”. In the eighteen century the num- Capidan 1925. There is only one cult publica- ber of speakers was 10,000, and many toponyms tion, a brochure about silkworm rearing with the with origin in Istroromanian dialect demonstrate script adapted, and terms borrowed from Roma- this fact. Part of speakers are migrants in Europe, nian. The Megleno idiom is endangered, it was USA, Canada or Australia. entered in the UNESCO Red Book on endangered It is listed among languages that are ”seriously languages, the ”Languages in grave danger” and endangered” in the UNESCO Red Book of Endan- UNESCO Atlas of Languages in danger in the gered Languages. Since 2010, the Croatian Con- world. (Atanasov, 2014). Unlike the Aromanians, stitution recognizes Istroromanians as one of 22 who are mostly herdsmen, the Meglenoromanians national minorities. However, there have not been are traditionally occupied by agriculture. They are significant changes in preserving their language, not nomads, but sedentary and therefore this di- culture and ethnic identity. (see Figure3). alect suffered fewer external influences and was Given the fact that Istroromanians have long kept as a native language spoken in the family. been in a gradual process of assimilation, and their In the Figure 2 there is a map of their localities, language was not used in writing, it is strongly extract from the Th. Capidan study, 1928. More influenced by the Croatian language, and there families emigrated in the Dobrudja region of Ro- are no documents to be processed, except a small mania. (see Figure2). number of texts collected by linguists: Feresini Therefore, we have to contribute to the preser- (1996), Pus¸cariu(1906), Cantemir(1959). Cur- vation of texts in the Meglenoromanian dialect, rently there are some rescue actions for the preser- and to facilitate the access of the European cul- vation of Istroromanian language, carried out by ture for this disfavored linguistic community, by16 cultural associations. 5.2 Aromanian: A Dialect or an Independent Idiom? We can establish with scientific arguments if this dialect has the tendency to become an independent natural language or to remain a dialect of Roma- nian, without imposing any solution, but only to ascertain if, as some experts have said, Aroma- nian is a language independent appertaining of the group of . The national consciousness of speakers, result- Figure 3: The map of Istroromanians localities, ing on their texts, is also important for estab- drew by Pericle Papahagi lish whether it’s an independent idiom. Aroma- nians are divided into three main groups: the first, A digitization of all the published texts in Istro- living in northern and central Greek Macedonia, romanian and their carefully annotation is neces- called Gramushtenians, the second living in the sary. There are some lexicons to be computerized Pindos mountains, called Pindenians, the third liv- and linked to the dictionaries of the other South- ing in the south of Epirus, and in Thessaly, called Danube dialects and to the Romanian computer- Farsherots. However, the speakers of this dialect ized dictionaries. In the table 1, some differences have different conceptions about their national- between the Romanian dialects are exemplified. ity. According with the testimonies of early re- We ignored here the different diacritics of letters. searchers, the first two groups consider that they are Aromanians, but the Farsherots consider that 5 Discussion they are . By studying texts collected 5.1 The Linguistic Correct Definition of the more soon we will see if these conceptions of Aro- Variants of the Romanian manians about themselves were kept or changed. The linguistic definition for the concept of dialect 6 Future Work is that this variant of the language is quite differ- ent from the language standard, and mutual under- The corpus which we intend to create can be use- standing between the speakers of the dialect and ful to create computerized dictionaries for the Ro- the speakers of the standard language is quite diffi- manian dialects aligned with those of standard cult or even impossible. In cases where differences Romanian language (and of Romanian Word Net are small and the understanding between speakers (RoWN), aligned at Princeton Word Net (PWN), is easy, these are not different dialects, but sim- and to build a machine translation system for these ple regional variants. Such is the case of the lan- isolated idioms, to introduce them in the interna- guage spoken in the Republic of Moldova, some- tional circulation. The cultural and linguistic iso- times referred to as an independent language, or lation of these speakers will cease, if a platform, as a different dialect of the Romanian, for no lin- such as Babel, will be able to translate into these guistic reasons. However, to demonstrate this truth dialects the information that the speakers find on we need big corpora in the language spoken in the the Internet in a well-known language. Republic of Moldova and in the South-Danube di- alects, all morphologically and syntactically anno- 6.1 Study of the Historical Variations and of tated, and then we can statistically calculate the in- the Evolution Tendencies dices of the approach or departure from the stan- Another important utility of the corpus that we dard Romanian of each of these idioms. The cre- begin building now is the statistic demonstration ation of a big Romanian corpus with morpholog- of the evolution tendencies. The Aromanian is ical and syntactic annotations, illustrating all the the most important dialect of the Romanian af- geographical and historical variations of this lan- ter the standard language, called the Dacoroma- guage is a long time project, which will continue nian dialect. The comparative study of Aromanian probably since 2020 or more years. The corpus texts from different historical periods with statis- had counted in 2014 only 4,600 sentences in Ro- tical methods will allow us to know what is their manian Standard (Perez, 2014). 17 evolution tendency, whether the trend of this idiom Istro-Romanian Aromanian Megleno-Romanian Romanian English klieptu cheptu klieptu piept chest bire ghine bini bine well, good bliera azghirari zber zbiera to roar filiu hilj iliu fiu son filia hilje ilie fiica˘ daughter flier heru ieru fier iron vit¸elu yit¸al vit¸al vit¸el calf (g)lierm iermu ghiarmi vierme worm

Table 1: Differences between few words in the Romanian Dialects is approaching the Romanian standard language or Lars Borin, Marcus Forsberg, and Dimitrios Kokki- the Greek language, whether it becomes an inde- nakis. 2010. Diabase: Towards a diachronic blark pendent idiom or not. We plan to compare parts in support of historical studies. In Proceedings of LREC.. pages 35–42. of our corpora using various comparison methods in order to better understand their similarities and Lars Borin and Markus Forsberg. 2008. Something dissimilarities. old, something new: A computational morpholog- ical description of old swedish. In LREC 2008 - Workshop on Language Technology for Cultural 7 Conclusions Heritage Data (LaTeCH 2008) Conference Proceed- In this paper we described an ongoing work on ings. pages 9–16. the creation of a big balanced corpus of Old and Tanase˘ Bujduveanu. 2005. The Sar˘ ac˘ acians˘ . Aroma- Regional Romanian texts, impossible without cre- nian Book Publisher, Reading, Massachusetts. ating tools for its processing. Consequently, we described some such as tools used for the develop- Ion Aurel Candrea. 1925a. Meglenoromanian texts. Speech and Soul I/I:261285. ment of our corpus in several directions. The balanced corpus does not have only a sci- Ion Aurel Candrea. 1925b. Meglenoromanian texts. entific interest, but also practical consequences. If Speech and Soul I/II:100128. the language spoken in the Republic of Moldova is Traian Cantemir. 1959. Istroromanian Texts. Ro- easily understandable by Romanian speakers with- manian Academy Publisher, Institute of Linguistics out consulting a bilingual dictionary, it demon- Cluj. strates that the Moldavian is not a dialect of Ro- Theodor Capidan. 1925. Meglenoromanians. Their manian. However, in the case of South-Danube History and Their Language. National Culture / Ro- dialects, there are dictionaries and also final lexi- manian Academy, Bucharest. cons in the published books, that being necessary Theodor Capidan. 1928. The Meglenoromanians. in order to allow Romanian readers to understand Their Folk Literature. Socec & Co publisher, the published texts. That is proof that they are di- Bucharest. alects and there cannot be any mutual understand- ing between their speakers. Cat˘ alina˘ Mar˘ anduc˘ Radu Simionescu Cenel- Augusto Perez. 2015. Ro-paas a resource linked to our uaic-ro-dep-treebank. In Advances in Artificial Intelligence and Soft Computing 14th References Mexican International Conference on Artificial 1902. Meglenoromanians. Folk Ethnographic Study. Intelligence, MICAI 2015. pages 29–46. Socec & Co publisher, Bucharest. Cat˘ alina˘ Mar˘ anduc˘ Radu Simionescu Cenel- Peter Atanasov. 2014. The current state of megleno- Augusto Perez. 2016. Social media processing romanians. megleno-romanian, an endangered id- romanian chats and discourse analysis. Computacin iom. Memoria Ethnologica XIV(52-53):3037. y Sistemas 20(3):404–414. Matteo Bartoli. 1925. Introduzione alla neolin- Svetlana Cojocaru, Alexander Colesnicov, and Lud- guistica: (princ`ıpi-scopi-metodi). 2. Biblioteca mila Malahova. 2017. Digitization of old romanian dell’ Archivum Romanicum - Serie II: Linguis- texts printed in the cyrillic script. In Proceedings of tica. https://books.google.ro/books? International Conference on Digital Access to Tex- id=Zg0MAAAAIAAJ. 18 tual Cultural Heritage. pages 143–148. Mark Davies. 2010. Creating useful historical corpora: Manuela Nevaci. 2011. The Gray of the Frontier Aro- A comparison of corde, the corpus del espanol, and manians in Dobrogea. Cartea Universitara˘ Publish- the corpus do portuguesˆ pages 137–166. ing House, Bucharest.

Mark Davies. 2012. Expanding horizons in historical Gheorghiad Mihail Obedenaru and . 1891. linguistics with the 400 million word corpus of his- Macedoromanans Texts. Tales and Folk Poems torical american english. Corpora (7):121–157. from Crushova. Carol Gobl¨ Publisher, Bucharest.

Tomaz Erjavec. 2004. Multext-east version 3: Mul- Cenel-Augusto Perez. 2014. Linguistic Resources for tilingual morphosyntactic specifications, lexicons Natural Language Processing. (PhD thesis). Al. I. and corpora. In Proceedings of the Fourth Con- Cuza University, Ias¸i. ference on Language Resources and Evaluation, LREC2004. http://nl.ijs.si/ME/. Iosif Popovici. 1909. The Romanian Dialects in Istria. Halle. Nerina Feresini. 1996. Il Comune Istro-romeno di Val- Syntactic darsa. Edizioni Italo Svevo, Trieste. Sophie Prevost and Achim Stein. 2013. Reference Corpus of Medieval French (SRCMF). Daniela Gˆıfu and Radu Simionescu. 2016. Tracing ENS de Lyon; Universitt Stuttgart; Lattice, Lyon/S- language variation for romanian. In Proceedings tuttgart/Paris. of the 17th International Conference on Intelligent Sextil Pus¸cariu. 1906. Istroromanian studies. in col- Text Processing and Computational Linguistics, CI- laboration with m. bartoli, a. belulovici and a. by- CLing. han, vol. i. texts. Annals series Wlodzimierz Gruszczynski and Maciej Ogrodniczuk. II,(tom. XXVIII):117–182. 2015. The electronic corpus of the 17th and 18th Nicolae Saramandu. 2003. Aromanian and Meglenoro- century polish texts (up to 1772) aims, methods, manian Studies. Ex Ponto Publisher, Constant¸a. current state, problems and prospects for develop- ment. In Slavic Corpus Linguistics: The Historical Richard Sarbuˆ and Vasile Frat¸il˘ a.˘ 1998. Istroromanan Dimension. pages 21–25. Dialect. Texts and Glossary. Amarcord publisher, Timis¸oara. Johan Hall, Joakim Nivre, and Jens Nilsson. 2006. Dis- criminative classifiers for deterministic dependency Radu Simionescu. 2011. Hybrid pos tagger. Proceed- parsing. In Proceedings of the 21st International ings of Language Resources and Tools with Indus- Conference on Computational Linguistics and 44th trial Applications Workshop Eurolan 2011 Summer Annual Meeting of the Association for Computa- school . tional Linguistics (COLING-ACL). pages 316–323. Achim Stein. 2008. Syntactic annotation of old french Thede Kahl. 2006. The islamisation of the text corpora. Corpus (7):157–172. meglen (megleno-romanians): The village of nantiˆ (notia)´ and the nantinetsˆ in present- Ana Maria Timofciuc, Daniela Gˆıfu, and Corina day turkey. Nationalities Papers 34(1):71–90. Forascu.˘ 2013. A simple reliable web application https://doi.org/10.1080/00905990500504871. for public discourse analysis. In Proceedings of the International Conference on Intelligent Information Cat˘ alina˘ Mar˘ anduc.˘ 2010. The Dictionary of Roma- Systems (IIS). pages 158–162. nian Expressions, Syntagms and Phrases (DELS). Corint Publishing, Bucharest. Nuria Yanez-Bouza. 2011. Archer past and present (1990-2010). ICAME Journal (35):205–236. Cat˘ alina˘ Mar˘ anduc,˘ Ludmila Malahov, Cenel-Augusto Perez, and Alexandru Colesnicov. 2016. Rodia project of a regional and historical corpus for roma- nian. In Proceedings of MFOI. pages 268–284.

Cat˘ alina˘ Mar˘ anduc,˘ Cat˘ alin˘ Mititelu, and Radu Simionescu. 2017. Parsing romanian specialized dictionaries structured in nests. In Proceedings of Digital Access to Textual Cultural Heritage (DAT- eCH). pages 35–41.

Matilda Caragiu Mariot¸eanu. 2006. Aromanians and Aromanian Dialect in Contemporary Conscious- ness. Romanian Academy Publisher, Bucharest.

France Martineau, Constana Rodica Diaconescu, , and Paul Hirschbuhler.¨ 2007. Le corpus voies du fran- cais : de lelaboration a` lannotation pages 121–142.19