Studying the Inductive Biases of Rnns with Synthetic Variations of Natural Languages
Total Page:16
File Type:pdf, Size:1020Kb
Accepted as a long paper in NAACL 2019 1 INTRODUCTION Studying the Inductive Biases of RNNs with Synthetic Variations of Natural Languages Shauli Ravfogel1 Yoav Goldberg1,2 Tal Linzen3 1Computer Science Department, Bar Ilan University 2Allen Institute for Artificial Intelligence 3Department of Cognitive Science, Johns Hopkins University fshauli.ravfogel, [email protected], [email protected] Abstract 2018) and limitations (Chowdhury and Zampar- elli, 2018; Marvin and Linzen, 2018; Wilcox et al., How do typological properties such as word 2018). order and morphological case marking af- fect the ability of neural sequence models to Most of the work so far has focused on En- acquire the syntax of a language? Cross- glish, a language with a specific word order and linguistic comparisons of RNNs’ syntactic relatively poor morphology. Do the typological performance (e.g., on subject-verb agreement properties of a language affect the ability of RNNs prediction) are complicated by the fact that to learn its syntactic regularities? Recent studies any two languages differ in multiple typo- suggest that they might. Gulordava et al.(2018) logical properties, as well as by differences in training corpus. We propose a paradigm evaluated language models on agreement predic- that addresses these issues: we create syn- tion in English, Russian, Italian and Hebrew, and thetic versions of English, which differ from found worse performance on English than the English in one or more typological parame- other languages. In the other direction, a study ters, and generate corpora for those languages on agreement prediction in Basque showed sub- based on a parsed English corpus. We re- stantially worse average-case performance than port a series of experiments in which RNNs reported for English (Ravfogel et al., 2018). were trained to predict agreement features for verbs in each of those synthetic languages. Existing cross-linguistic comparisons are dif- Among other findings, (1) performance was ficult to interpret, however. Models were in- higher in subject-verb-object order (as in En- evitably trained on a different corpus for each glish) than in subject-object-verb order (as in language. The constructions tested can differ Japanese), suggesting that RNNs have a re- cency bias; (2) predicting agreement with both across languages (Gulordava et al., 2018). Per- subject and object (polypersonal agreement) haps most importantly, any two natural languages improves over predicting each separately, sug- differ in a number of typological dimensions, such gesting that underlying syntactic knowledge as morphological richness, word order, or explicit transfers across the two tasks; and (3) overt case marking. This paper proposes a controlled morphological case makes agreement predic- experimental paradigm for studying the interac- tion significantly easier, regardless of word or- tion of the inductive bias of a neural architec- arXiv:1903.06400v2 [cs.CL] 26 Mar 2019 der. ture with particular typological properties. Given 1 Introduction a parsed corpus for a particular natural language (English, in our experiments), we generate cor- The strong performance of recurrent neural net- pora for synthetic languages that differ from the works (RNNs) in applied natural language pro- original language in one of more typological pa- cessing tasks has motivated an array of studies rameters (Chomsky, 1981), following Wang and that have investigated their ability to acquire nat- Eisner(2016). In a synthetic version of English ural language syntax without syntactic annota- with a subject-object-verb order, for example, sen- tions; these studies have identified both strengths tence (1a) would be transformed into (1b): (Linzen et al., 2016; Giulianelli et al., 2018; Gulordava et al., 2018; Kuncoro et al., 2018; (1) a. The man eats the apples. van Schijndel and Linzen, 2018; Wilcox et al., b. The man the apples eats. 1 Accepted as a long paper in NAACL 2019 2 SETUP Original they say the broker took them out for lunch frequently . (subject; verb; object) Polypersonal agreement they saykon the broker tookkarker them out for lunch frequently . (kon: plural subject; kar: singular subject; ker: plural object) Word order variation SVO they say the broker took out frequently them for lunch . SOV they the broker them took out frequently for lunch say. VOS say took out frequently them the broker for lunch they. VSO say they took out frequently the broker them for lunch . OSV them the broker took out frequently for lunch they say. OVS them took out frequently the broker for lunch say they. Case systems Unambiguous theykon saykon the brokerkar tookkarker theyker out for lunch frequently . (kon: plural subject; kar: singular subject; ker: plural object) Syncretic theykon saykon the brokerkar tookkarkar theykar out for lunch frequently . (kon: plural subject; kar: plural object/singular subject) Argument marking theyker sayker the brokerkin tookkerkin theyker out for lunch frequently . (ker: plural argument; kin: singular argument) Figure 1: The sentences generated in our synthetic languages based on an original English sentence. All verbs in the experiments reported in the paper carried subject and object agreement suffixes as in the polypersonal agreement experiment; we omitted these suffixes from the word order variation examples in the table for ease of reading. We then train a model to predict the agreement tems, agreement patterns, and order of core ele- features of the verb; in the present paper, we focus ments. For each parametric version of English, we on predicting the plurality of the subject and the recorded the verb-argument relations within each object (that is, whether they are singular or plural). sentence, and created a labeled dataset. We ex- The subject plurality prediction problem for (1b), posed our models to sentences from which one of for example, can be formulated as follows: the verbs was omitted, and trained them to predict the plurality of the arguments of the unseen verb. (2) The man the apples hsingular/plural subject?i. The following paragraph describes the process of collecting verb-argument relations; a detailed dis- We illustrate the potential of this approach in a cussion of the parametric generation process for series of case studies. We first experiment with agreement marking, word order and case marking polypersonal agreement, in which the verb agrees is given in the corresponding sections. We have with both the subject and the object (x3). We then made our synthetic language generation code pub- manipulate the order of the subject, the object and licly available.1 the verb (x4), and experiment with overt morpho- logical case (x5). For a preview of our synthetic Argument collection We created a labeled languages, see Figure1. agreement prediction dataset by first collecting verb-arguments relations from the parsed corpus. 2 Setup We collected nouns, proper nouns, pronouns, ad- jectives, cardinal numbers and relative pronouns Synthetic language generation We used an connected to a verb (identified by its part-of- expert-annotated corpus, to avoid potential con- speech tag) with an nsubj, nsubjpass or dobj de- founds between the typological parameters we pendency edge, and record the plurality of those manipulated and possible parse errors in an au- arguments. Verbs that were the head of a clausal tomatically parsed corpus. As our starting point, complement without a subject (xcomp dependen- we took the English Penn Treebank (Marcus et al., cies) were excluded. We recorded the plurality of 1993), converted to the Universal Dependencies the dependents of the verb regardless of whether scheme (Nivre et al. 2016) using the Stanford con- the tense and person of the verb condition agree- verter (Schuster and Manning, 2016). We then ment in English (that is, not only in third-person manipulated the tree representations of the sen- present-tense verbs). For relative pronouns that tences in the corpus to generate parametrically modified English corpora, varying in case sys- 1https://github.com/Shaul1321/rnn typology 2 Accepted as a long paper in NAACL 2019 3 POLYPERSONAL AGREEMENT Prediction Subject Object Object embedding and embeddings of the character n- task accuracy accuracy recall grams that made up the word.2 Subject 94:7 ± 0:3 -- The model (including the embedding layer) Object - 88:9 ± 0:26 81:8 ± 1:4 was trained end-to-end using the Adam optimizer Joint 95:7 ± 0:23 90:0 ± 0:1 85:4 ± 2:3 (Kingma and Ba, 2014). For each of the experi- ments described in the paper, we trained four mod- Table 1: Results of the polypersonal agreement exper- els with different random initializations; we report iments. “Joint” refers to multitask prediction of subject averaged results alongside standard deviations. and object plurality. 3 Polypersonal agreement Singular Plural Subject -kar -kon In languages with polypersonal agreement, verbs Object -kin -ker agree not only with their subject (as in English), Indirect Object -ken -kre but also with their direct object. Consider the fol- lowing Basque example:3 Table 2: Case suffixes used in the experiments. Verbs are marked by a concatenation of the suffixes of their (4) Kutxazain-ek bezeroa-ri liburu-ak corresponding arguments. cashier-PL.ERG customer-SG.DAT book-PL.ABS eman dizkiote gave they-them-to-her/him function as subjects or objects, we recorded the The cashiers gave the books to the customer. plurality of their referent; for instance, in the phrase Treasury bonds, which pay lower interest Information about the grammatical role of certain rates, we considered the verb pay to have a plural constituents in the sentence may disambiguate the subject. function of others; most trivially, if a word is the subject of a given verb, it cannot simultaneously Prediction task We experimented both with be its object. The goal of the present experiment prediction of one of the arguments of the verb is to determine whether jointly predicting both ob- (subject or object), and with a joint setting in ject and subject plurality improves the overall per- which the model predicted both arguments of each formance of the model.