Tools for Building a Corpus to Study the Historical and Geographical

Tools for Building a Corpus to Study the Historical and Geographical Variation of the Romanian Language Victoria Bobicev Cat˘ alina˘ Mar˘ anduc˘ Cenel Augusto Perez Technical University “Al. I. Cuza” University, Al. I. Cuza University, of Moldova Iasi, Romania Ias¸i, Romania Chis¸inau˘ Institute of Linguistics [email protected] Republic of Moldova “Iorgu Iordan Al. Rosetti” victoria [email protected] Bucharest, Romania − [email protected] Abstract tion, between which there are 5,723 sentences in old Romanian and 1,230 sentences in regional Contemporary standard language corpora variants of the Romanian. are ideal for NLP. There are few morpho- If we know the non-standard, regional or ear- logically and syntactically annotated cor- lier forms, we can understand the laws of natural pora for Romanian, and those existing or language evolution; we can know how it functions in progress only deal with the Contem- in the communication and process it. The use of porary Romanian standard. However, the Old or Regional Romanian should not be judged necessity to study the dynamics of natu- as mistaken reporting them to the standard rules, ral languages gave rise to balanced cor- but it is in accordance with other rules which we pora, containing non-standard texts. In must discover. this paper, we describe the creation of Linguists are increasingly interested in the tools for processing non-standard Roma- study of old languages with modern tools and their nian to build a big balanced corpus. We demand for old language processing tools is grow- want to preserve in annotated form as ing worldwide. Consequently, the computational many early stages of language as possi- linguists are building Diachronic and Old Cor- ble. We have already built a corpus in pora for all the natural languages, some described Old Romanian. We also intend to in- in many related works, as: Borin and Forsberg clude the South-Danube dialects, remote (2008), Davies(2010), Prevost and Stein(2013), to the standard language, along with re- etc. gional forms closer to the standard. We We have built a sub-corpus for the Old Ro- try to preserve data about endangered id- manian, another for the regional variants, but it ioms such as Aromanian, Meglenoroma- is more difficult to build corpora for the South nian and Istroromanian dialects, and cal- Danube Romanian dialects (because they are very culate the distance between different re- different from the standard language, cannot be gional variants, including the language understood by his speakers and two of them have spoken in the Republic of Moldova. This no written aspect). distance, as well as the mutual understand- The emergence of South-Danube dialects is his- ing between the speakers, is the correct torically and politically determined. Due to their criterion for the classification of idioms as isolation of the linguistic center, they are more different languages, or as dialects, or as re- conservative than this one and retain many archaic gional variants close to the standard. linguistic phenomena. Matteo Giulio Bartoli 1925 formulates the the- 1 Introduction ory of isolated area or side areas, demonstrating The UAIC-RoDia-DepTb (ISLRN 156-635-615- with some examples that these areas are more con- 024-0) is a balanced treebank that becomes the servative than the center. The dialects conserve core of a big corpus for the Old and Regional Ro- more forms of the old languages or of the language manian and for its South-Danube dialects. The where they were inherited (in our case, the Latin treebank has now 16,187 sentences, with 322,404 language). tokens, illustrating all the styles of communica-10 For example, Istroromanian is the dialect of Proceedings of the LT4DHCSEE in conjunction with RANLP 2017, pages 10–19, Varna, Bulgaria, 8 September 2017. http://doi.org/10.26615/978-954-452-046-5_002 the Romanian spoken by the smallest number of noun phrases (NP) and prepositional phrases (PP). speakers, in eight villages in Croatia by 1,000 peo- The next phase is the manual annotation via a spe- ple, called ”Vlach”, only recently recognized as a cial interface. We also have a hybrid POS-tagger national minority. This is an isolated area, very which permits the introduction of rules (see be- conservative for the Old Romanian. Through its low) and use the manual annotation for a part of disappearance and the lack of collected and digi- the semantic relations. tized testimonies we could lose important data on In another paper Borin et al.(2010) present the evolution of the Romanian language. an ongoing work on the building of digitalized The dialectal variations study is important for diachronic Swedish lexical resource. The pa- the history of languages and for the etymology. per presents a basic research infrastructure for The big etymological dictionary of Roman lan- language technology called BLARK (Basic Lan- guages DeRom1 has between its bibliographical guage Resource Kit) which includes basic lexical sources books of specialists in dialectology as: resources, annotated corpora and basic NLP tools Iosif Popovici 1909 and Richard Sarbu 1998 for for processing these corpora. The same authors the Istroromanian dialect. Borin and Forsberg(2008) describe the creation Besides their historical importance, these di- of the tool for the morphological analyze of Old alects are languages of disadvantaged minorities, Swedish words, which should be followed by syn- threatened with extinction, with limited access to tactic and semantic analysis. East European lan- culture. Their folk creation and other contribu- guages have several preprocessing issues. For ex- tions must be conserved and the people speaking ample, for some of them Cyrillic and Latin scripts it have to be received into the family of European were used in various periods of time; hence some languages. documents need to be transliterated before fur- ther processing (Gruszczynski and Ogrodniczuk, 2 Related Work 2015). The need for a research infrastructure for the study 3 Tools for Romanian Standard of historical lexical resources by digitization and Processing implication of language technology is increasingly recognized by the historical research com- 3.1 The UAIC-Ro Hybrid Part Of Speech munity. Historical documents are being digitized (POS)-Tagger on a vast scale in cultural heritage and digital li- The tools for the processing of the Contemporary brary projects in many countries. Modern linguis- Romanian are the basis for creating the ones for tics studies pays increasing attention to diachronic the old or regional Romanian processing. The lex- and dialectal variations of languages. Similar cor- icon of the UAIC-POS-tagger for Contemporary pora of other languages have started from con- Romanian also contains archaic words and forms, temporary language processing tools by adapting extracted from dictionaries, while Old Romanian them to their old or regional variants. also contains words and forms used today. Digitized historical corpora are already created The UAIC-Ro POS-tagger is hybrid, i.e. suc- for many languages: English Yanez-Bouza(2011), cessfully combines a statistical model with a rule Davies(2012), Spanish and Portuguese Davies based system (Simionescu, 2011). The specificity (2010), French Stein(2008) and so on. In another of the hybrid model is that it applies a set of paper, France Martineau 2007 analyze the use of rules to reduce the large set of valid pairs lemma probabilistic parsing methods for old French texts. and POS-tag (abbreviated morphological analysis) Unfortunately, the described probabilistic parser which can be applied at a word-form. In fact, there was trained and can be used only within the project are morphological homonyms, interpretable tak- and cannot be adapted for other language and an- ing into account the words in the vicinity of their notation conventions. occurrence. After the reduction of the set of pos- Thus, Nuria Yanez-Bouza 2011 describes the sible analysis, the statistical system is put into op- building of a rule-based automate pre-annotator eration. which has around 30 rules to identify complex The dictionary of the POS-tagger is formed of verb forms (VCOMP), adjective phrases (AP), triplets: word-form, lemma (the basis form of 1 http://www.degruyter.com/view/product/205712 11 word, found in dictionaries), POS-tag (an abbreviated morphological analysis). The amount of the ventions of annotation used are in FDG (Func- POS-tagger lexicon is related with the accuracy of tional Dependency Grammar), with labels of the tool. The tool for the Contemporary Roma- classical syntax, with numerous semantic sub- nian contains 1,15 million distinct words extracted classifications of modifiers. Creating the treebank from dictionaries and 100,000 proper nouns ex- in 2007, Augusto Perez had the intention of target- tracted from Wikipedia. The set of 406 tags are a ing the treebank for didactic purposes, for medium reduced version of the tagset used by the Multext learning, even building a computer game to pre- East project Erjavec(2004). pare students for exams, but, of course, the learn- The rule of big dimensions shows that the ing system cannot be so easily convinced to adopt higher the number of tags is, the greater the gold the Dependency Grammar. corpus for the training must be. The training cor- This system can be transposed both into the pus for Contemporary Romanian consists in the modern syntactic system of Universal Dependen- NAACL 2003 corpus (39,000 sentences), and an- cies (UD) with loss of semantic information and other 28,000 sentences extracted from the JRC- into a semantic annotation system by adding infor- ACQUIS. The corpus for evaluation was Orwell’s mation. This is why we will continue to use this novel 1984, manually annotated in the Multext classic format, in which the processing tools were East project. trained, and then it will be automatically (super- But these corpora were not have identical set vised) converted into UD (Universal Dependen- of conventions and not use our tagset.

Tools for Building a Corpus to Study the Historical and Geographical

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support