A Probabilistic Generative Model of Linguistic Typology
Total Page:16
File Type:pdf, Size:1020Kb
A Probabilistic Generative Model of Linguistic Typology Johannes Bjerva@ Yova Kementchedjhieva@ Ryan CotterellP,H Isabelle Augenstein@ @Department of Computer Science, University of Copenhagen PDepartment of Computer Science, Johns Hopkins University HDepartment of Computer Science and Technology, University of Cambridge bjerva,yova,[email protected], [email protected] Abstract In the principles-and-parameters framework, the structural features of languages depend on parameters that may be toggled on or off, with a single parameter often dictating the status of multiple features. The implied co- variance between features inspires our prob- abilisation of this line of linguistic inquiry— we develop a generative model of language based on exponential-family matrix factorisa- tion. By modelling all languages and fea- tures within the same architecture, we show how structural similarities between languages can be exploited to predict typological features Figure 1: Correlations between selected typological with near-perfect accuracy, outperforming sev- parameters. Feature values are classified according eral baselines on the task of predicting held- to head-directionality (head-initial +, head-final -, no out features. Furthermore, we show that lan- dominant order o). For instance, ++ under Affixation guage embeddings pre-trained on monolingual means strongly suffixing. text allow for generalisation to unobserved lan- guages. This finding has clear practical and also theoretical implications: the results con- firm what linguists have hypothesised, i.e. that to identify the parameters that serve as axes, along there are significant correlations between typo- which languages may vary. logical features and languages. It is not enough, however, to simply write down the set of parameters available to language. In- 1 Introduction deed, one of the most interesting facets of typol- Linguistic typologists dissect and analyse lan- ogy is that different parameters are correlated. To guages in terms of their structural properties (Croft, illustrate this point, we show a heatmap in Fig- 2002). For instance, consider the phonological ure1 that shows the correlation between the values property of word-final obstruent decoding: Ger- of selected parameters taken from a typological man devoices word-final obstruents (Zug is pro- knowledge base (KB). Notice how head-final word nounced /zuk/), whereas English does not (dog is order, for example, highly correlates with strong pronounced /d6g/). In the tradition of generative suffixation. The primary contribution of this work linguistics, one line of typological analysis is the is a probabilisation of typology inspired by the principles-and-parameters framework (Chomsky, principles-and-parameters framework. We assume 1981), which posits the existence of a set of univer- a given set of typological parameters and develop a sal parameters, switches as it were, that languages generative model of a language’s parameters, cast- toggle. One arrives at a kind of factorial typol- ing the problem as a form of exponential-family ma- ogy, to borrow terminology from optimality theory trix factorisation. We observe a binary matrix that (Prince and Smolensky, 2008), through different encodes the settings of each parameter for each lan- settings of the parameters. Within the principle- guage. For example, the Manchu head-final entry and-parameters research program, then, the goal is of this matrix would be set to 1, because Manchu 1529 Proceedings of NAACL-HLT 2019, pages 1529–1540 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics is a head-final language. The goal of our model is rose to prominence. Given the tight relationship to explain each entry of matrix as arising through between the theory of universal grammar and the dot product of a language embedding and a typology, principles and parameters offers a fruitful parameter embedding passed through a sigmoid. manner in which to research typological variation. We test our model on The World Atlas of The principles and parameters takes a parametric Language Structures (WALS), the largest available view of linguistic typology. The structure of hu- knowledge base of typological parameters at the man language is governed by a series of principles, lexical, phonological, syntactic and semantic which are hard constraints on human language. A level. Our contributions are: (i) We develop common example of a principle is the requirement a probabilisation of typology inspired by the that every sentence has a subject, even if one that principles-and-parameters framework. (ii) We in- is not pronounced; see the discussion on the pro- troduce the novel task of typological collaborative drop parameter in Carnie(2013). Principles are filtering, where we observe some of a language’s universally true for all languages. On the other parameters, but hold some out. At evaluation time, hand, languages are also governed by parameters. we predict the held-out parameters using the gen- Unlike principles, parameters are the parts of lin- erative model. (iii) We develop a semi-supervised guistic structure that are allowed to vary. It is useful extension, in which we incorporate language to think of parameters as attributes that can take embeddings output by a neural language model, on a variety of values. As Chomsky(2000) him- thus improving performance with unlabelled data. self writes “we can think of the initial state of the Indeed, when we partially observe some of the faculty of language as a fixed network connected typological parameters of a language, we achieve to a switch box; the network is constituted of the near-perfect (97%) accuracy on the prediction of principles of language, while the switches are the held-out parameters. (iv) We perform an extensive options to be determined by experience. When the qualitative and quantitative analysis of our method. switches are set one way, we have Swahili; when they are set another way, we have Japanese. Each 2 Typology in Generative Grammar possible human language is identified as a particu- lar setting of the switches-a setting of parameters, What we will present in this paper is a generative in technical terminology.” model that corresponds to a generative tradition of research in linguistic typology. We first outline What are possible parameters? Here, in our for- the technical linguistic background necessary for malisation of the parameter aspect of the principles- the model’s exposition. Chomsky famously argued and-parameters framework, we take a catholic view that the human brain contains a prior, as it were, of parameters, encompassing all areas of linguis- over possible linguistic structures, which he termed tics, rather than just syntax. For example, as we universal grammar (Chomsky, 1965). The con- saw before, consider the switch of devoicing word- nection between Chomsky’s Universal Grammar final obstruents a parameter. We note that while and the Bayesian prior is an intuitive one, but the principle-and-parameters typology has primarily earliest citation we know for the connection is been applied to syntax, there are also interesting ap- Eisner(2002, x2). As a theory, universal grammar plications to non-syntactic domains. For instance, holds great promise in explaining the typological van Oostendorp(2015) applies a parametric ap- variation of human language. Cross-linguistic proach to metrical stress in phonology; this is in similarities and differences may be explained line with our view. In the field of linguistic ty- by the influence universal grammar exerts over pology, there is a vibrant line of research, which language acquisition and change. While universal fits into the tradition of viewing typological param- grammar arose early on in the writtings of eters through the lens of principles and parame- Chomsky, early work in generative grammar ters. Indeed, while earlier work due to Chomsky focused primarily on English (Harris, 1995). focused on what have come to be called macro- Indeed, Chomsky’s Syntactic Structures contains parameters, many linguists now focus on micro- exclusively examples in English (Chomsky, 1957). parameters, which are very close to the features As the generative grammarians turned their focus found in the WALS dataset that we will be mod- to a wider selection of languages, the principles elling (Baker, 2008; Nicolis and Biberauer, 2008; and parameters framework for syntactic analysis Biberauer et al., 2009). This justifies our viewing 1530 We define the the prior over language embeddings: ` SOV ··· SVO Str. Pref. ··· Str. Suff. eng 2 1 0 0 0 0 1 3 p(λ`) = N 0; σ2I (4) nld 6 0 1 0 0 0 1 7 6 7 deu 6 0 1 0 0 0 1 7 6 7 where µ is the mean of the Gaussian whose co- ·········6 7 6 7 vie 6 0 0 1 0 1 0 7 variance is fixed at I. Now, give a collection of 6 7 tur 4 0 1 0 0 0 1 5 languages L, we arrive at the joint distribution mrd 1 0 0 − − − Y Y p(π`; λ`) = p(π` j λ`) · p(λ`) (5) `2L `2L Figure 2: Example of training and evaluation por- ` tions of a feature matrix, with word order features Note that p(λ ) is, spiritually at least, a universal (81A, SOV; ··· ; SVO) and affixation features (26A, grammar: it is the prior over what sort of languages Strongly Prefixing; ··· ; Strongly Suffixing). We train can exist, albeit encoded as a real vector. In the on the examples highlighted in blue, evaluate on the ex- parlance of principles and parameters, the prior amples highlighted in red, and ignore those which are represents the principles. not covered in WALS highlighted in green. Then our model parameters are Θ = 1 jLj feπ1 ;:::; eπjπj ; λ ;:::; λ g. Note that for the re- WALS through the lens of principles and parame- mainder of the paper, we will never shorten ‘model ters, even though the authors of WALS adher to the parameters’ to simply ‘parameters’ to avoid am- functional-typological school.1 biguity. We will, however, refer to ‘typological Notationally, we will represent the parameters parameters’ as simply ‘parameters.’ We can view this model as a form of exponential- as a vector π = [π1; : : : ; πn].