Diachronic Trends in Word Order Freedom and Dependency Length in Dependency-Annotated Corpora of Latin and Ancient Greek
Total Page:16
File Type:pdf, Size:1020Kb
Diachronic Trends in Word Order Freedom and Dependency Length in Dependency-Annotated Corpora of Latin and Ancient Greek Kristina Gulordava Paola Merlo University of Geneva University of Geneva [email protected] [email protected] Abstract word order; we need languages that exhibit gen- uine optionality of word order, and for which large One easily observable aspect of language amounts of text have been carefully annotated in variation is the order of words. In human the chosen representation. and machine natural language process- ing, it is often claimed that parsing free- In the current choice of hand-annotated tree- order languages is more difficult than pars- banks, these requirements are fullfilled by ing fixed-order languages. In this study dependency-annotated corpora of Latin and An- on Latin and Ancient Greek, two well- cient Greek. These two languages are exten- known and well-documented free-order sively documented, they are dead languages and languages, we propose syntactic correlates are therefore studied in a tradition where careful of word order freedom. We apply our text editing and curation is a necessity, and have indicators to a collection of dependency- the added advantage that their genealogical chil- annotated texts of different time peri- dren, Romance languages and Modern Greek, are ods. On the one hand, we confirm a also grammatically well studied, so that we can trend towards more fixed-order patterns in add a diachronic dimension to our observations. time. On the other hand, we show that Both Latin and Ancient Greek allow a lot of a dependency-based measure of the flex- freedom in the linearisation of sentence elements. ibility of word order is correlated with the In these languages, this also concerns the noun- parsing performance on these languages. phrase domain, which is otherwise typically more constrained than the verbal domain in modern Eu- 1 Introduction ropean languages1. In this study, we propose syn- Languages vary in myriad ways. One easily ob- tactic correlates of word order freedom both in the servable aspect of variation is the order of words. noun phrase and at the sentence level: variabil- Not only do languages vary in the linear order of ity in the directionality of the head-modifier rela- their phrases, they also vary in how fixed and uni- tion, adjacency of the head-modifier relation (also form the orders are. We speak of fixed-order lan- called non-projectivity), and degree of minimisa- guages and free word order languages. tion of dependency length. Free word order has been associated in the lin- First, we look at head directionality, that is, guistic literature with other properties, such as post-nominal versus prenominal placement, of ad- richness of morphology, for example. In natural jectives and numerals. While the variation in language processing, it is often claimed that pars- adjective placement is a wide-spread and well- ing freer word order languages is more difficult, studied phenomenon in modern languages, such as for instance, than parsing English, whose word or- Romance languages, for example, the variation in der is quite fixed. numeral placement is a rarer phenomenon and is Quantitative measures of word order freedom particularly interesting to investigate. and investigations of it on a sufficiently large scale Then, we analyse the discontinuity of noun- to draw firm conclusions, however, are not com- mon (Liu, 2010; Futrell et al., 2015b). To be able 1Regarding the diachronic change in word order freedom, to study word order flexibility quantitatively and Tily (2010) found that in the change from Old to Middle and Modern English, the verb-headed clause changed consider- computationally, we need a syntactic representa- ably in word order and dependency length, from verb-final to tion that is appropriate for both fixed and flexible verb initial, while the domain of the noun phrase did not. 121 Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pages 121–130, Uppsala, Sweden, August 24–26 2015. Language Text Period #Sentences #Words Latin Caesar, Commentarii belli Gallici 58-49 BC 1154 22408 Cicero, Epistulae ad Atticum & De officii 68–43 BC 3830 44370 Aetheriae, Peregrinatio 4th century AD 921 17554 Jerome’s Vulgate 4th century AD 8903 79389 Ancient Greek Herodotus, Histories, 450-420 BC 5098 75032 New Testament 4th century AD 10627 119371 Table 1: Summary of properties of the treebanks of Latin and Ancient Greek languages, including the historical period and size of each text. phrases. Specifically, we extract the modifiers that (Bamman and Crane, 2011), previously used in are separated from the noun by some elements of the parsing literature, the PROIEL corpus contains a sentence that are not themselves noun depen- exclusively prose and is therefore more appropri- dents. Example (1) illustrates a non-adjacent de- ate for a word order variation study than other pendency between the noun maribus and the ad- treebanks, which also contain poetry. Moreover, jective reliquis, separated by the verb utimur. the PROIEL corpus allows us to analyze differ- ent texts and authors independently of each other. (1) (Caes. Gal. 5.1.2) This, as we will see, provides us with interest- ... quam quibus in reliquisa utimurv maribusn ing diachronic data. Table 1 presents the texts in- ... than those in other we-use seas cluded in the corpus with their time periods and ‘... than those (that) we use in (the) other seas’ the size in sentences and number of words. The texts in Latin range from the Classical Latin We apply our two indicators to a collection of period (Caesar and Cicero) to the Late Latin of 4th dependency-annotated texts of different time pe- century (Vulgate and Peregrinatio). Jerome’s Vul- riods and show a pattern of diachronic change, gate is a translation from the Greek New Testa- demonstrating a trend towards more fixed-order ment. The two Greek texts are Herodotus (4th cen- patterns in time. tury BC) and New Testament (4th century AD). The different word order properties that we de- The sizes of the texts are uneven, but include at tect at different points in time for the same lan- least 17000 words or 900 sentences. guage allow us to set up a controlled experiment to ask whether greater word-order freedom causes 2.2 Modifier-noun dependencies in the greater parsing difficulty. We show that the depen- corpus dency formalism provides us with a sentence-level We use the dependency and part-of-speech anno- measure of the flexibility of word order which we tations of the PROIEL corpus to extract adjective- define as the distance between the actual depen- noun and numeral-noun dependencies and their dency length of a sentence and its optimal depen- properties. dency length (Gildea and Temperley, 2010). We Both Latin and Ancient Greek are annotated us- demonstrate that this robust measure of the word ing the same guidelines and tagsets. We identify order freedom of the languages reflects their pars- adjectives by their unique (fine and coarse) PoS ing complexity. tag “A-”. The PoS annotation of the PROIEL cor- 2 Materials pora distinguishes between cardinal and ordinal Before discussing our measures in detail, we take numerals (“Ma” and “Mo” fine tags correspond- a look at the resources that are available and that ingly). Cardinal numerals differ in their structural are used in our study. and functional properties from ordinal numerals; current analysis includes only cardinals to ensure 2.1 Dependency-annotated corpora the homogeneity of this class of modifiers. The dependency treebanks of Latin and Ancient For our analysis, we consider only adjectives Greek used in our study come from the PROIEL and numerals which directly modify a noun, that project (Haug and Jøhndal, 2008). Compared to is, their dependency head must be tagged as a noun other treebanks, such as the Perseus treebanks (“Nb” and “Ne” fine tags). Such dependencies 122 must also have an “atr” dependency label, for at- ROOT tribute. The overall number of extracted adjective de- pendencies ranges from 600 (Peregrinatio) to 1700 (Herodotus and NewTestament), with an aver- quam quibus in reliquisa utimurv maribusn age of 1000 dependencies per text. The overall 1 2 3 4 5 6 number of extracted numeral dependencies ranges from 83 (Peregrinatio) to 400 (New Testament and Figure 1: The dependency tree of the sentence Vulgate), with average of 220 dependencies per from Example (1), extracted from the original text. PROIEL treebank. 2.3 Measures long to the subtree of maribus (which comprises only reliquis and maribus, in this example). We Our indicators of word order freedom are based on calculate the proportion of such non-projective ad- the relationship between the head and the depen- jectives over all adjectives whose head is a noun. dent. In addition, we report the average distance of non- Head-Dependent Directionality Word order is projective adjectives from their head. The same a relative positional notion. The simplest indica- values are also computed and reported for numer- tor of word order is therefore the relative order of als. head and dependent. We say then that a language 3 NP-internal word order variation has free(r) word order if the position of the depen- dents relative to the head, before or after, is less We begin our investigation of word order varia- uniform than for a fixed order language. In tradi- tion by looking at word order in the noun phrase, a tional linguistic studies, this is the notion that is controlled setting potentially influenced by fewer most often used. However, it is a measure that is factors than sentential word order. often too coarse to exhibit any clear patterns. 3.1 Head-Dependent Directionality Head-Dependent Adjacency A more sensitive For each of the texts in our corpus, we computed measure of freedom of word order will take into the percentage of prenominal versus post-nominal account adjacency to the head.