Corpora and Processing Tools for Non-Standard Contemporary and Diachronic Balkan Slavic
Total Page:16
File Type:pdf, Size:1020Kb
Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2019 Corpora and Processing Tools for Non-Standard Contemporary and Diachronic Balkan Slavic Vukovic, Teodora ; Nora, Muheim ; Winistörfer, Olivier-Andreas ; Anastasia, Makarova ; Ivan, Šimko ; Sanja, Bradjan Abstract: The paper describes three corpora of different varieties of BS that are currently being developed with the goal of providing data for the analysis of the diatopic and diachronic variation in non-standard Balkan Slavic. The corpora includes spoken materials from Torlak, Macedonian dialects, as well as the manuscripts of pre-standardized Bulgarian. Apart from the texts, tools for PoS annotation and lemmatization for all varieties are being created, as well as syntactic parsing for Torlak and Bulgarian varieties. The corpora are built using a unified methodology, relying on the pest practices and state- of-the-art methods from the field. The uniform methodology allows the contrastive analysis of thedata from different varieties. The corpora under construction can be considered a crucial contribution tothe linguistic research on the languages in the Balkans as they provide the lacking data needed for the studies of linguistic variation in the Balkan Slavic, and enable the comparison of the said varieties with other neighbouring languages. Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-175260 Conference or Workshop Item Published Version The following work is licensed under a Creative Commons: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. Originally published at: Vukovic, Teodora; Nora, Muheim; Winistörfer, Olivier-Andreas; Anastasia, Makarova; Ivan, Šimko; Sanja, Bradjan (2019). Corpora and Processing Tools for Non-Standard Contemporary and Diachronic Balkan Slavic. In: The 12th International Conference on Recent Advances in Natural Language Process- ing (RANLP 2019), Varna, Bulgaria, 2 September 2019 - 4 September 2019, 62-68. Corpora and Processing Tools for Non-Standard Contemporary and Diachronic Balkan Slavic Teodora Vukovic´ Nora Muheim Olivier Winistorfer¨ Ivan Simkoˇ Anastasia Makarova Sanja Bradjan Slavic Departement, University of Zurich, Plattenstrasse 43, Zurich [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract inflection and fully or partially grammaticalized post-positive definite articles (Lindstedt, 2000). The paper describes three corpora of Many other differences lie in the nominal and ff di erent varieties of BS that are currently verbal morphosyntax, adjectival morphology, and being developed with the goal of providing tense and mood system, as well as in the lexical and data for the analysis of the diatopic phraseological domain. Nonetheless, the area itself and diachronic variation in non-standard is not linguistically uniform. On the contrary - the Balkan Slavic. The corpora includes diversification starts from the division into official spoken materials from Torlak, Macedonian standard languages, further separated by various dialects, as well as the manuscripts of phonological and structural isoglosses (Ivic´, 1985; pre-standardized Bulgarian. Apart from Institute for Bulgarian Language, 2018; Stojkov, the texts, tools for PoS annotation 2002; Vidoeski, 1999). Variation occurs even in and lemmatization for all varieties are very small regions, such as the Timok area in being created, as well as syntactic South East Serbia, where substantial differences parsing for Torlak and Bulgarian varieties. occur in the use of post-posed articles between The corpora are built using a unified villages in the mountains and valleys or urban methodology, relying on the pest practices areas (Vukovic´ and Samardziˇ c´, 2018). There is a and state-of-the-art methods from the significant variation not only from an areal, but field. The uniform methodology allows also from a diachronic point of view. Looking at the contrastive analysis of the data from older texts written in pre-standardized Bulgarian, ff di erent varieties. The corpora under phases of change in the language happening over construction can be considered a crucial time are noticeable. contribution to the linguistic research on This variation is best analyzed in non-standard the languages in the Balkans as they varieties, unaffected or minimally affected by the provide the lacking data needed for the prescriptive standard, and ideally in the form of studies of linguistic variation in the Balkan spoken language since it represents arguably a Slavic, and enable the comparison of the more "natural"state. In this paper, we present three said varieties with other neighbouring ongoing projects on South Slavic dialectology and languages. diachronic linguistics, which are currently being 1 Introduction elaborated at the Slavic linguistics department at the University of Zurich. In these projects, special Balkan Slavic (BS) languages are the eastern focus is on places in time and space where the branch of South Slavic languages, which are change happens and the underlying grammatical known for their affiliation to the so-called processes and structures. Apart from linguistic Balkansprachbund. The languages that belong to questions, the focus of our research is on the this group are Bulgarian and Macedonian varieties identification of potential geographical and social as well as the Torlak dialects spoken in South factors influencing changes in Bulgarian, Torlak East Serbia and West Bulgaria. These languages and Macedonian. express typological differences from other Slavic For the purpose of this research, we are languages. Some of the differentiating features are: aiming at three corpora (and also processing complete or almost complete loss of the noun case 62 Proceedings of the Student Research Workshop (RANLPStud 2019) associated with RANLP 2019, pages 62–68 Varna, Bulgaria, 2 – 4 September, 2019. tools) for modern and historical non-standard in size. Texts were transcribed into Latin and BS varieties that reflect diachronic and diatopic Cyrillic script in order to make the data available variation within the Balkan Slavic languages. The to a wider audience. The texts are annotated with varieties covered are pre-standardized Bulgarian, grammatical information, lemmas, English glosses Macedonian dialects, and Torlak varieties from for each token, and English translations for each Serbia and Bulgaria. Apart from the large line (see the annotated sentence in example 1, collection of texts, the resources are equipped presented as it is in the corpus). The MSD are with part-of-speech (PoS) and morphosyntactic easily readable, but do not fin the standards used description (MSD) annotation, while some corpora in the field. The database represents an impressive also include a syntactic tree-bank and a layer achievement, and a particularly valuable one given of normalization. In order to make the corpora that it is the sole resource for dialectal research on mutually comparable, we are developing and BS at the moment. applying a uniform methodology. Methodology, (1) hm kvo´ se corpora and tools are based on the existing disc what.Sg.N.Interr Acc.Refl.Clt standards used in the field as well as resources . какво се for the standard languages of the area, wherever kazva´ seˇ say.3Sg.Impf.I available. This enables inter-comparability and казвам reproducibility of the data. At the same time re- ‘Hmmmm. What is it called?’ using already existing tools makes the process easier for the creators and it offers well-established The corpora described in this paper are methods too, which are discussed below. created with the goal to be comparable with other existing related resources – be it corpora 2 Related Work of dialects or of the standard language. The automatic processing tools developed for the There is currently little data available in digital language varieties included in the collection are form that allows the analysis of the mentioned created based on existing ones whenever possible. variation and change in BS. Bulgarian is the For example, the morphosyntactic tagger and only language supplemented with a corpus of lemmatizer for Torlak is an adaptation of the tagger non-standard spoken varieties apart from the for standard Serbian (Miliceviˇ c´ and Ljubesiˇ c´, standard ones (Alexander, Ronelle and Zhobov, 2016a). Vladimir, 2016; Alexander, 2015). Serbian only An important tool for Serbian is the ReLDI has corpora of written standard and computer- tagger, which assigns morphosyntactic tags and mediated communication (CMC) (CCSL; Utvic´, lemmas to text (Ljubesiˇ c´, 2016). The tagger 2011; Ljubesiˇ c´ and Klubickaˇ , 2014; Miliceviˇ c´ and was developed for standard Serbian, Croatian, Ljubesiˇ c´, 2016b). The only available searchable and Slovene using a character-level machine resource for Macedonian is an unannotated corpus translation method. It assigns tags specified by the of movie subtitles (Steinberger et al., 2014; Lison MULTEXT-East standards (Krstev et al., 2004). and Tiedemann, 2016). The mentioned corpora The python package allows training for any other of standard language (or movie subtitles) are variety with an novel training set as an input extremely valuable resources in itself, but they (Ljubesiˇ c´, 2016). provide little-to-no insight into variation because MULTEXT-East resources represent an they are a sample of only one variety by definition;