Towards a computational model of grammaticalization and lexical diversity Christian Bentz Paula Buttery University of Cambridge, DTAL University of Cambridge, DTAL 9 West Road, CB3 9DA 9 West Road, CB3 9DA
[email protected] [email protected] Abstract To model these processes Section 2 will Languages use different lexical inven- present a quantitative measure of lexical diver- tories to encode information, ranging sity based on Zipf-Mandelbrots law, which is from small sets of simplex words to also used as a biodiversity index (Jost, 2006). large sets of morphologically complex Based on this measure we present a prelimi- words. Grammaticalization theories nary computational model to reconstruct the argue that this variation arises as gradual change from lexically constrained to the outcome of diachronic processes lexically rich languages in Section 3. We whereby co-occurring words merge therefore use a simple grammaticalization al- to one word and build up complex gorithm and show how historical developments morphology. To model these pro- towards higher lexical diversity match the vari- cesses we present a) a quantitative ation in lexical diversity of natural languages measure of lexical diversity and b) a today. This suggests that synchronic variation preliminary computational model of in lexical diversity can be explained as the out- changes in lexical diversity over several come of diachronic language change. generations of merging higly frequent The computational model we present will collocates. therefore help to a) understand the diver- sity of lexical encoding strategies across lan- guages better, and b) to further uncover the 1 Introduction diachronic processes leading up to these syn- All languages share the property of being car- chronic differences.