Empirical Evidence from the Slavic Languages

REDUNDANT FEATURES ARE LESS LIKELY TO SURVIVE: EMPIRICAL EVIDENCE FROM THE SLAVIC LANGUAGES АLEKSANDRS BERDICEVSKIS, HANNE ECKHOFF Department of Language and Linguistics, UiT The Arctic University of Norway Tromsø, Norway [email protected], [email protected] We test whether the functionality (non-redundancy) of morphological features can serve as a predictor of the survivability of those features in the course of language change. We apply a recently proposed method of measuring functionality of a feature by estimating its importance for the performance of an automatic parser to the Slavic language group. We find that the functionality of a Common Slavic grammeme, together with the functionality of its category, is a significant predictor of its survivability in modern Slavic languages. The least functional grammemes within the most functional categories are most likely to die out. 1. Introduction Many explanations of language evolution and change involve (either explicitly or implicitly) the concept of redundancy, especially morphological redundancy. The assumption that redundant features are more likely to disappear has played an important role in historical linguistics for decades (see Kiparsky 1982: 88–99 for an example; Lloyd, 1987: 33–35 for a brief overview). More recently, several influential theories have emerged (Sampson et al., 2009; Lupyan and Dale, 2010; Trudgill, 2011) that refine this assumption, claiming that it does not apply in equal measure to all languages. It is hypothesized that languages under certain sociocultural conditions (such as large population size or a large share of adult learners) will tend to shed excessive (i.e. redundant) complexity. A serious problem with the notion of redundancy, however, is that it is difficult to operationalize and measure quantitatively, which means that theories such as those cited above must to some extent rest on assumptions or indirect qualitative estimates. In this paper, we improve on a method of measuring morphological redundancy proposed by Berdicevskis (2015). The key idea behind the method is that the identification of syntactic structure by an automatic parser can be taken as a model of how human beings understand meaning (i.e. identify semantic structure). While the model is not necessarily ecologically valid (parsers and humans process information differently), it is externally valid: given the same input (text to process) as humans, parsers can approximate the output (correct structure) very well. The main benefit of the model is that it makes it possible to run experiments, manipulating the input. If we, for instance, artifically distort the input, removing the information about a given morphological feature, and then compare the performance of the parser before and after removal, we can estimate how important the feature is for the identification of the underlying structure, how necessary for the understanding of the meaning and hence, how functional (non- redundant). Importantly for the study of language change and evolution, this ablation technique can be applied both to extant and extinct languages, as long as there exists a decent treebank. We present a case study where we apply the method to the Slavic language group. We estimate the functionality of morphological categories and grammemes in Common Slavic and test how well this information predicts the survival and death of those features in modern Slavic languages. 2. Materials and methods 2.1. The Slavic group The Slavic language group is divided into three branches: South, West and East. All extant languages have rich inflectional morphology, mostly inherited from Common Slavic. In this section, we describe how Common Slavic grammemes survive across Slavic languages (Table 1). In the following two sections we describe how we measure the functionality of these grammemes. The earliest Slavic texts were written in Old Church Slavonic (OCS), a literary language based on a South Slavic dialect of Late Common Slavic. We use OCS as a proxy for Common Slavic, as is often done in historical linguistics. We exclude the following from the analysis: mood, finiteness, voice, degree of comparison, adjective long/short form, synthetic future tense (which exists only for the verb 'be'), non-indicative and non-finite verbal forms.a The tense grammeme coded as res in Table 1 stands for the Common Slavic perfect, pluperfect and conditional that consisted of an auxiliary verb and a so-called resultative participle. We do not take into account any morphological a The reasons for exclusion range from theoretical (there is no unified view on the structure of some categories, e.g. finiteness) to methodological (our experiments in their current form do not work with binary categories, e.g. adjective form, or categories that are too heterogeneous, e.g. mood). innovations. Decisions represented in Table 1 largely follow Comrie & Corbett (1993). Table 1. Common Slavic grammemes across modern Slavic languages Cate- Gram- CF GF freq South branch West branch East branch gory meme •10-3 •103 bul mkd hbs slv ces slk hsb dsb pol csb rus bel ukr Case nom 36 6.3 20.5 1 1 1 1 1 1 1 1 1 1 1 1 1 Case acc 36 5.2 16.1 0 0 1 1 1 1 1 1 1 1 1 1 1 Case dat 36 4.8 8.4 0 0 1 1 1 1 1 1 1 1 1 1 1 Case gen 36 4.2 11.1 0 0 1 1 1 1 1 1 1 1 1 1 1 Case ins 36 2.8 3.1 0 0 1 1 1 1 1 1 1 1 1 1 1 Gend m 4 2.5 35.4 1 1 1 1 1 1 1 1 1 1 1 1 1 Pers 1 3 2.0 4.8 1 1 1 1 1 1 1 1 1 1 1 1 1 Pers 2 3 2.0 6.1 1 1 1 1 1 1 1 1 1 1 1 1 1 Pers 3 3 2.0 22.4 1 1 1 1 1 1 1 1 1 1 1 1 1 Gend n 4 2.0 10.3 1 1 1 1 1 1 1 1 1 1 1 1 1 Numb pl 4 2.0 20.2 1 1 1 1 1 1 1 1 1 1 1 1 1 Numb sg 4 2.0 60.7 1 1 1 1 1 1 1 1 1 1 1 1 1 Tens res 8 2.0 0.4 1 1 1 1 1 1 1 1 1 1 1 1 1 Case loc 36 1.7 3.3 0 0 0 1 1 1 1 1 1 1 1 1 1 Case voc 36 1.7 0.9 1 1 1 0 1 0 1 0 1 1 0 0 1 Gend f 4 1.5 10.6 1 1 1 1 1 1 1 1 1 1 1 1 1 Tens pres 8 1.3 15.6 1 1 1 1 1 1 1 1 1 1 1 1 1 Numb du 4 1.0 2.1 0 0 0 1 0 0 1 1 0 0 0 0 0 Tens aor 8 0.7 7.4 1 1 1 0 0 0 1 0 0 0 0 0 0 Tens impf 8 0.7 2.1 1 1 0 0 0 0 1 0 0 0 0 0 0 CF = category functionality, GF = grammeme functionality (see section 2.3), freq = absolute frequency (in OCS). The table is sorted first by GF (descending), then by CF (ascending), i.e. in the approximate descending order of survivability (see section 3). Languages are denoted by their ISO 639-3 codes. 0 means that a grammeme is (almost) extinct, 1 means that it is extant. 2.2. Treebank and parser We extracted OCS data from the Tromsø Old Russian and OCS Treebank (TOROT),b using the two largest documents, the Codex Marianus and the Codex Suprasliensis, both dated to the beginning of the 11th century. The joint TOROT file contains 13308 manually annotated (and double-checked) sentences. b https://nestor.uit.no/ The TOROT is a dependency treebank with morphological and syntactic annotation according to the PROIEL scheme (Haug et al., 2009). For our experiments, we converted the files to the CONLL format (Table 2). For the parsing experiments we used MaltParser (Nivre et al., 2007), version 1.8.1.c The parser was optimized using MaltOptimizer (Ballesteros and Nivre, 2012), version 1.0.3,d optimization was performed on the original text, before any changes (see section 2.3). Table 2. Example OCS sentence ('He said to them', from Matthew 12:11) in the PROIEL scheme and CONLL format. ID Form Lemma CPOS FPOS Features Head DREL 1 onʺ onʺ P Pd NUMBs|GENDm|CASEn 3 sub he 2 že že D Df INFLn 3 aux but 3 reče reŝi V V- PERS3|NUMBs|TENSa|MOODi|VOICa 0 pred say 4 imʺ i P Pp PERS3|NUMBp|GENDm|CASEd 3 obl them CPOS/FPOS = coarse/fine-grained part-of-speech tag; DREL = dependency relation. OCS words are transliterated using the ISO 9 system. Contrary to standard practice in computer science, we do not create separate training and test sets, and thus perform all operations, including optimization, on the whole dataset. The reason for this solution is that our goal is not to evaluate how accurately a given parser can analyze a given text, but how its performance is affected by certain changes in the annotation of the input data. As regards absolute measures of performance, we want them to be as high as possible, in order to approximate human performance and thus increase the validity of the model. Training and parsing on the same set allows us to reach a LAS (labelled attachment score) of 0.938,e while parsing of unfamiliar test sets usually results in a LAS in the high seventies at best.

Empirical Evidence from the Slavic Languages

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support