Lexicon Induction for Spoken Rusyn – Challenges and Results
Total Page:16
File Type:pdf, Size:1020Kb
Lexicon Induction for Spoken Rusyn – Challenges and Results Achim Rabus Yves Scherrer Department of Slavonic Studies Department of Linguistics University of Freiburg University of Geneva Germany Switzerland achim.rabus@ [email protected] slavistik.uni-freiburg.de Abstract 2 Rusyn and the Corpus of Spoken Rusyn This paper reports on challenges and re- sults in developing NLP resources for spo- Rusyn belongs to the Slavic language family and ken Rusyn. Being a Slavic minority lan- is spoken predominantly in the Carpathian region, guage, Rusyn does not have any resources most notably in Transcarpathian Ukraine, Eastern to make use of. We propose to build Slovakia, and South Eastern Poland, where it is 1 a morphosyntactic dictionary for Rusyn, called Lemko. Some scholars claim Rusyn to be combining existing resources from the et- a dialect of Ukrainian (Skrypnyk, 2013), others ymologically close Slavic languages Rus- see it as an independent Slavic language (Pugh, sian, Ukrainian, Slovak, and Polish. We 2009; Plishkova, 2009). While there is no deny- adapt these resources to Rusyn by us- ing the fact that Ukrainian is the standard lan- ing vowel-sensitive Levenshtein distance, guage closest to the Rusyn varieties, certain dis- hand-written language-specific transfor- tinct features at all linguistic levels can be detected. mation rules, and combinations of the two. This makes the Rusyn varieties take an interme- Compared to an exact match baseline, we diary position between the East and West Slavic increase the coverage of the resulting mor- languages (for more details see, e.g., Teutsch phological dictionary by up to 77.4% rel- (2001)). Nowadays, the speakers of Rusyn find ative (42.9% absolute), which results in a themselves in a dynamic sociolinguistic environ- tagging recall increased by 11.6% relative ment and experience significant pressure by their (9.1% absolute). Our research confirms respective roofing state languages Ukrainian, Slo- and expands the results of previous stud- vak, or Polish. Thus, new divergences within the ies showing the efficiency of using NLP old Rusyn dialect continuum due to contact with resources from neighboring languages for the majority language, i.e., so-called border ef- low-resourced languages. fects, are to be expected (Rabus, 2015; Woolhiser, 2005). In order to trace these divergences, and 1 Introduction create an empirically sound basis for investigat- ing current Rusyn speech, the Corpus of Spoken This paper deals with the development of a mor- Rusyn (www.russinisch.uni-freiburg. phological dictionary for spoken varieties of the de/corpus, Rabus and Šymon (2015)) has been Slavic minority language Rusyn by leveraging the created. It consists of several hours of transcribed similarities between Rusyn and neighboring ety- speech as well as recordings.2 Although the tran- mologically related languages. It is structured as scription in the corpus is not phonetic, but rather follows: First, we give a brief introduction on the orthographic, both diatopic and individual varia- characteristics of the Rusyn minority language and the data our investigation is based upon. After- 1According to official data, there are 110 750 Rusyns, ac- cording to an “informed estimate” no less than 1 762 500, the wards, we describe our approach to lexicon induc- majority of them living in the Carpathian region (Magocsi, tion using resources from several related Slavic 2015, p. 1). languages and the steps we took to improve the 2The corpus engine is CWB (Christ, 1994), the GUI func- tionality has been continuously expanded for several Slavic matches from the dictionaries. Finally, we discuss corpus projects (Waldenfels and Woźniak, 2017; Waldenfels the results and give an outlook on future work. 27 and Rabus, 2015; Rabus and Šymon, 2015). Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 27–32, Valencia, Spain, 4 April 2017. c 2017 Association for Computational Linguistics tion is reflected in the transcription. The reason for Language Source Entries that is that exactly this variation is what we want to investigate using the corpus, i.e., more “Slovak” Polish MULTEXT-East 1.9M Rusyn varieties should be distinguished from more Russian MULTEXT-East 244k “Ukrainian” or “Polish” varieties. Besides, vari- Russian TnT (RNC) 373k ation in transcription practices of different tran- Ukrainian MULTEXT-East 300k scribers cannot be avoided. Ukrainian UGtag 4.6M At the moment, Rusyn does not have any exist- Slovak MULTEXT-East 1.9M ing NLP resources (annotated corpora or tools) to Table 1: Sizes of the morphosyntactic dictionaries make use of. The aim of this paper is to investigate used for induction. first steps towards (semi-)automatically annotating the transcribed speech data. It goes without saying that the different types of variation present in our in the easternmost dialects. Moreover, the respec- data significantly complicate the task of develop- tive umbrella languages – Ukrainian, Slovak, and ing NLP resources. Polish – exert considerable influence on the Rusyn vernacular. In fact, the overwhelming majority of 3 Lexicon Induction Rusyn speakers are bilingual. We propose to build a morphosyntactic dictionary 3.1 Data for Rusyn, using existing resources from etymo- logically related languages. The idea is that if Our RL data consist of morphosyntactic dictionar- we know that a Rusyn word X corresponds to the ies (i.e., files associating word tokens with their Ukrainian word Y, and that Y is linked to the mor- lemmas and tags) from Ukrainian, Slovak, Pol- ish, Russian. All of them were taken from the phosyntactic descriptions M1,M2,Mn, we can cre- ate an entry in the Rusyn dictionary consisting of MULTEXT-East repository (Erjavec et al., 2010a; Erjavec et al., 2010b; Erjavec, 2012). As Rusyn X and M1,M2,Mn. The proposed approach is in- spired by earlier work by Mann and Yarowsky is written in Cyrillic script, we converted the Slo- (2001), who aim to detect cognate word pairs in or- vak and Polish dictionaries into Cyrillic script first. der to induce a translation lexicon. They evaluate During the conversion process, we made the to- different measures of phonetic or graphemic dis- kens more similar to Rusyn by applying certain lin- tance on this task. While they show that distance guistic transformations (e.g., denasalization in the measures adapted to the language pair by machine Polish case) and thus excluded some output tokens learning work best, we are not able to use them as that could not possibly match any Rusyn tokens for we do not have the required bilingual training cor- obvious linguistic reasons. pus at our disposal. Scherrer and Sagot (2014) use As mentioned above, the standard language such distance measures as a first step of a pipeline closest to the Rusyn varieties is Ukrainian. Several for transferring morphosyntactic annotations from Ukrainian NLP resources exist, e.g., the Ukrainian a resourced language (RL) towards an etymologi- National Corpus.4 However, these resources can- cally related non-resourced language (NRL). not easily be used to train taggers or parsers. UG- Due to the high amount of variation and the tag (Kotsyba et al., 2011) is a tagger specifically heterogeneity of the Rusyn data (our NRL), we developed for Ukrainian; it is essentially a mor- resolved to use resources from several neighbor- phological dictionary with a simple disambigua- ing RLs, namely from the East Slavic languages tion component. Its underlying dictionary is rather Ukrainian and Russian as well as from the West large and can be easily converted to text format, Slavic languages Polish and Slovak.3 This makes making it a good addition to the small MULTEXT- sense, because the old Rusyn dialect continuum East Ukrainian dictionary. For Russian, we com- features both West Slavic and East Slavic linguis- plemented the small MULTEXT-East dictionary tic traits, with more West Slavic features in the with the TnT lexicon file based on data from the westernmost dialects and more East Slavic ones Russian National Corpus (Sharoff et al., 2008). We also harmonized the MSD tags (morphosyn- 3As a matter of fact, Russian is no neighboring language to Rusyn, but since for historical reasons there are numerous tactic descriptions) across all languages and data Russian borrowings in Rusyn and since NLP resources for Russian are developed quite well, we also include Russian.28 4www.mova.info sources. Table 1 sums up the used resources. propose different types of transformations, as de- Our NRL data consist of 10 361 unique to- scribed in the following sub-sections. kens extracted from the Corpus of Spoken Rusyn (which currently contains a total of 75 000 running 3.3 Daitch-Mokotoff Soundex Algorithm words). In addition, we were able to obtain a small Soundex is a family of phonetic algorithms for in- sample of morphosyntactically annotated Rusyn, dexing words and, in particular, names by their amounting to 1 047 tokens; the induction methods pronunciation and regardless of their spelling (Hall are evaluated on this sample. and Dowling, 1980). The principle behind a Soundex algorithm is to group different graphemes 3.2 Exact Matches into a small set of sound classes, where all vow- As a baseline, we checked how many Rusyn word els except the first of a word are discarded. The forms could be retrieved by exact match in the four Daitch-Mokotoff Soundex is a variant of the orig- RL lexicons. Despite Rusyn being closely related inal (English) Soundex that is adapted to Eastern to the dictionary languages, the results are rather European names (Mokotoff, 1997). poor: merely 55.47% of all Rusyn tokens were Matching soundex-transformed RL words with found in at least one RL lexicon (see Table 2, first soundex-transformed NRL words allowed us to column). obtain a coverage of 97.16% (i.e., almost all NRL We further show the relative contributions of words were matched), but in fact, each matched the four RLs in Table 2.