Towards a Computational Model of Grammaticalization and Lexical Diversity
Total Page:16
File Type:pdf, Size:1020Kb
Towards a computational model of grammaticalization and lexical diversity Christian Bentz Paula Buttery University of Cambridge, DTAL University of Cambridge, DTAL 9 West Road, CB3 9DA 9 West Road, CB3 9DA [email protected] [email protected] Abstract To model these processes Section 2 will Languages use different lexical inven- present a quantitative measure of lexical diver- tories to encode information, ranging sity based on Zipf-Mandelbrots law, which is from small sets of simplex words to also used as a biodiversity index (Jost, 2006). large sets of morphologically complex Based on this measure we present a prelimi- words. Grammaticalization theories nary computational model to reconstruct the argue that this variation arises as gradual change from lexically constrained to the outcome of diachronic processes lexically rich languages in Section 3. We whereby co-occurring words merge therefore use a simple grammaticalization al- to one word and build up complex gorithm and show how historical developments morphology. To model these pro- towards higher lexical diversity match the vari- cesses we present a) a quantitative ation in lexical diversity of natural languages measure of lexical diversity and b) a today. This suggests that synchronic variation preliminary computational model of in lexical diversity can be explained as the out- changes in lexical diversity over several come of diachronic language change. generations of merging higly frequent The computational model we present will collocates. therefore help to a) understand the diver- sity of lexical encoding strategies across lan- guages better, and b) to further uncover the 1 Introduction diachronic processes leading up to these syn- All languages share the property of being car- chronic differences. riers of information. However, they vastly dif- 2 Zipf’s law as a measure of lexical fer in terms of the exact encoding strategies diversity they adopt. For example, German encodes in- formation about number, gender, case, tense, Zipf-Mandelbrot’s law (Mandelbrot, 1953; aspect, etc. in a multitude of different articles, Zipf, 1949) states that ordering of words ac- pronouns, nouns, adjectives and verbs. This cording to their frequencies in texts will render abundant set of word forms contrasts with a frequency distributions of a specific shape: in smaller set of uninflected words in English. general, few words have high frequencies, fol- Crucially, grammaticalization theories lowed by a middle ground of medium frequen- (Heine and Kuteva, 2007, 2002; Bybee 2006, cies and a long tail of low frequency items. 2003; Hopper and Traugott, 2003; Lehmann, However, a series of studies pointed out that 1985) demonstrate that complex morpho- there are subtle differences in frequency dis- logical marking can derive diachronically by tributions for different texts and languages merging originally independent word forms (Bentz et al., forthcoming; Ha et al., 2006; that frequently co-occur. Over several gen- Popescu and Altmann, 2008). Namely, lan- erations of language learning and usage such guages with complex morphology tend to have grammaticalization and entrenchment pro- longer tails of low frequency words than lan- cesses can gradually increase the complexity guages with simplex morphology. The param- of word forms and hence the lexical diversity eters of Zipf-Mandelbrot’s law reflect these dif- of languages. ferences, and can be used as a quantitative 38 Proc. of 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL) @ EACL 2014, pages 38–42, Gothenburg, Sweden, April 26 2014. c 2014 Association for Computational Linguistics measure of lexical diversity. 2.1 Method We use the definition of ZM’s law as captured by equation (1): C f(ri) = α , β + ri C > 0, α > 0, β > 1, i = 1, 2, . , n (1) − where f(ri) is the frequency of the word th of the i rank (ri), n is the number of ranks, C is a normalizing factor and α and β are parameters. To illustrate this, we use parallel Figure 1: Zipf frequency distributions for four texts of the Universal Declaration of Human natural languages (Fijian, English, German, Rights (UDHR) for Fijian, English, German Hungarian). Plots are in log-log space, val- and Hungarian. For frequency distributions ues 0.15, 0.1 and 0.05 were added to Fijian, of these texts (with tokens delimited by English and German log-frequencies to avoid white spaces) we can approximate the best overplotting. Values for the Zipf-Mandelbrot fitting parameters of the ZM law by means parameters are given in the legend. The of maximum likelihood estimation (Izs´ak, straight black line is the line of best fit for 2006; Murphy, 2013). In double logarithmic Fijian. space (see Figure 1) the normalizing factor C would shift the line of best fit upwards or low lexical diversity language like Fijian might downwards, α is the slope of this line and β develop towards a high diversity language like is Mandelbrot’s (1953) corrective for the fact Hungarian. that the line of best fit will deviate from a straight line for higher frequencies (upper left 3 Modelling changes in lexical corner in Figure 1). diversity As can be seen in Figure 1 Fijian has higher frequencies towards the lowest ranks (upper Grammaticalization theorists have long left corner) but the shortest tail of words with claimed that synchronic variation in word frequency one (horizontal bars in the lower complexity and lexical diversity might be the right corner). For Hungarian the pattern runs outcome of diachronic processes. Namely, the the other way round: it has the lowest frequen- grammaticalization cline from content item cies towards the low ranks and a long tail of >grammatical word >clitic >inflectional affix words with frequency one. German and En- is seen as a ubiquitous process in language glish lie between these. These patterns are re- change (Hopper and Traugott, 2003: 7). flected in ZM parameter values. Namely, Fi- In the final stage frequently co-occurring jian has the highest parameters, followed by words merge by means of phonological fusion English, German and Hungarian. By trend (Bybee, 2003: 617) and hence ’morphologize’ there is a negative relationship between ZM to built inflections and derivations. parameters and lexical diversity: low lexical Typical examples of a full cline of grammat- diversity is associated with high parameters, icalization are the Old English noun l¯ıc ’body’ high diversity is associated with low param- becoming the derivational suffix -ly, the inflec- eters. Cross-linguistically this effect can be tional future in Romance languages such as used to measure lexical diversity by means of Italian canter`o ’I will sing’ derived from Latin approximating the parameters of ZM’s law for cantare habeo ’I have to sing’, or Hungarian parallel texts. inflectional elative and inessive case markers In the following, we will present a compu- derived from a noun originally meaning ‘in- tational model to elicit the diachronic path- terior’ (Heine and Kuteva, 2007: 66). These ways of grammaticalization through which a processes can cause languages to distinguish 39 between a panoply of different word forms. For p : Percentage of words replaced by new • v example, Hungarian displays up to 20 different words. Choose pv of words randomly and noun forms where English would use a single replace all instances of these words by in- form (e.g. ship corresponding to Hungarian verting the letters. This simulates neolo- haj´o ’ship’, haj´oban ’in the ship’, haj´oba ’into gisms and loanwords replacing deprecated the ship’, etc.). words. As a consequence, once the full grammati- calization cline is completed this will increase rR: Range of ranks to be included in pv • the lexical diversity of a language. Note, replacements. If set to 0, vocabulary from however, that borrowings (loanwords) and ne- anywhere in the distribution will be ran- ologisms can also increase lexical diversity. domly replaced. Hence, a model of changes in lexical diversity n : Number of generations to simulate. will have to take both grammaticalization and • G new vocabulary into account. This simulation essentially allows us to vary 3.1 The model the degree of grammaticalization by means of Text: We use the Fijian UDHR as our start- varying pm, and also to control for the fact ing point for two reasons: a) Fijian is a lan- that frequency distributions might change due guage that is well known to be largely lack- to loanword borrowing and introduction of ing complex morphology, b) the UDHR is a new vocabulary (pv). Additionally, rR allows parallel text and hence allows us to compare us to vary the range of ranks where new words different languages by controlling for constant might replace deprecated ones. For frequency information content. Fijian has relatively low distributions calculated by generations we ap- lexical diversity and high ZM parameter val- proximate ZM parameters by maximum likeli- ues (see Figure 1). The question is whether hood estimations and therefore document the we can simulate a simple merging process over change of their shape. several generations that will transform the fre- Results: Figure 2 illustrates a simulation quency distribution of the original Fijian text of how the low lexical diversity language Fi- to fit the frequency distribution of the mor- jian approaches quantitative lexical properties phologically and lexically rich Hungarian text. similar to the Hungarian text just by means of To answer this question, we simulate the out- merging high-frequent collocates. While the come of grammaticalization on the frequency frequency distribution of Fijian in generation distributions in the following steps: 0 still reflects the original ZM values, the Simulation: Our program takes a given ZM parameter values after 6 generations of text of generation i, calculates a frequency grammaticalization have become much closer distribution for this generation, changes the to the values of the Hungarian UDHR: text along various operations given below, and gives the frequency distribution of the text for Fij (nG = 0): α = 1.21,β = 2.1,C = 812 a new generation i + 1 as output.