Can LSTM Learn to Capture Agreement? the Case of Basque
Total Page:16
File Type:pdf, Size:1020Kb
Can LSTM Learn to Capture Agreement? The Case of Basque Shauli Ravfogel1 and Francis M. Tyers2;3 and Yoav Goldberg1;4 1 Computer Science Department, Bar Ilan University 2 School of Linguistics, Higher School of Economics 3 Department of Linguistics, Indiana University 4 Allen Institute for Artificial Intelligence fshauli.ravfogel, [email protected], [email protected] Abstract the ability of RNNs to capture syntactic struc- ture requires a use of established benchmarks. A Sequential neural networks models are pow- common approach is the use of an annotated cor- erful tools in a variety of Natural Language Processing (NLP) tasks. The sequential nature pus to learn an explicit syntax-oriented task, such of these models raises the questions: to what as parsing or shallow parsing (Dyer et al., 2015; extent can these models implicitly learn hier- Kiperwasser and Goldberg, 2016; Dozat and Man- archical structures typical to human language, ning, 2016) . While such an approach does evalu- and what kind of grammatical phenomena can ate the ability of the model to learn syntax, it has they acquire? several drawbacks. First, the annotation process We focus on the task of agreement predic- relies on human experts and is thus demanding tion in Basque, as a case study for a task in term of resources. Second, by its very nature, that requires implicit understanding of sen- training a model on such a corpus evaluates it on tence structure and the acquisition of a com- a human-dictated notion of grammatical structure, plex but consistent morphological system. An- and is tightly coupled to a linguistic theory. Lastly, alyzing experimental results from two syntac- tic prediction tasks – verb number prediction the supervised training process on such a corpus and suffix recovery – we find that sequential provides the network with explicit grammatical la- models perform worse on agreement predic- bels (e.g. a parse tree). While this is sometimes tion in Basque than one might expect on the desirable, in some instances we would like to eval- basis of a previous agreement prediction work uate the ability of the model to implicitly acquire in English. Tentative findings based on diag- hierarchical representations. nostic classifiers suggest the network makes use of local heuristics as a proxy for the hier- Alternatively, one can train language model archical structure of the sentence. We propose (LM) (Graves, 2013;J ozefowicz´ et al., 2016; the Basque agreement prediction task as chal- Melis et al., 2017; Yogatama et al., 2018) to model lenging benchmark for models that attempt to the probability distribution of a language, and use learn regularities in human language. common measures for quality such as perplexity as an indication of the model’s ability to capture 1 Introduction regularities in language. While this approach does In recent years, recurrent neural network (RNN) not suffer from the above discussed drawbacks, models have emerged as a powerful architec- it conflates syntactical capacity with other factors ture for a variety of NLP tasks (Goldberg, such as world knowledge and frequency of lexi- 2017). In particular, gated versions, such as Long cal items. Furthermore, the LM task does not pro- Short-Term Networks (LSTMs) (Hochreiter and vide one clear answer: one cannot be “right” or Schmidhuber, 1997) and Gated Recurrent Units “wrong” in language modeling, only softly worse (GRU) (Cho et al., 2014; Chung et al., 2014) or better than other systems. achieve state-of-the-art results in tasks such as lan- A different approach is testing the model on a guage modeling, parsing, and machine translation. grammatical task that does not require an exten- RNNs were shown to be able to capture long- sive grammatical annotation, but is yet indicative term dependencies and statistical regularities in in- of syntax comprehension. Specifically, previous put sequences (Karpathy et al., 2015; Linzen et al., works (Linzen et al., 2016; Bernardy and Lap- 2016; Shi et al., 2016; Jurafsky et al., 2018; Gu- pin, 2017; Gulordava et al., 2018) used the task lordava et al., 2018). An adequate evaluation of of predicting agreement, which requires detecting 98 Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 98–107 Brussels, Belgium, November 1, 2018. c 2018 Association for Computational Linguistics hierarchal relations between sentence constituents. the network relies, at least to a certain degree, on Labeled data for such a task requires only the col- local cues. Bernardy and Lappin(2017) evaluated lection of sentences that exhibit agreement from agreement prediction on a larger dataset, and ar- an unannotated corpora. However, those works gued that a large vocabulary aids the learning of have focused on relatively small set of languages: structural patterns. Gulordava et al.(2018) fo- several Indo-European languages and a Semitic cused on the ability of LM’s to capture agreement language (Hebrew). As we show, drawing con- as a marker of syntactic ability, and used nonsen- clusions on the model’s abilities from a relatively sical sentences to control for semantic clues. They small subset of languages can be misleading. have shown positive results in four languages, as In this work, we test agreement prediction in well as some similarities between their models’ a substantially different language, Basque, which performance and human judgment of grammati- is a language with ergative–absolutive alignment, cality. rich morphology, relatively free word order, and polypersonal agreement (see Section3). We pro- 3 Properties of the Basque Language pose two tasks, verb-number prediction (Section Basque agreement patterns are ostensibly more 6) and suffix prediction (Section7), and show that complex and very different from those of English. agreement prediction in Basque is indeed harder In particular, nouns inflect for case, and the verb for RNNs. We thus propose Basque agreement as agrees with all of its core arguments. How well a challenging benchmark for the ability of models can a RNN learn such agreement patterns? to capture regularities in human language. We first outline key properties of Basque rele- vant to this work. We have used the following two 2 Background and Previous Work grammars written in English for reference (Laka, To shed light on the question of hierarchical struc- 1996; de Rijk, 2007). ture learning, a previous work on English (Linzen Morphological marking of case and number on et al., 2016) has focused on subject-verb agree- NPs The grammatical role of noun phrases is ex- ment: The form of third-person present-tense plicitly marked by nuclear case suffixes that attach verbs in English is dependent upon the number after the determiner in a noun phrase — this is typ- of their subject (“They walk” vs. “She walks”). ically the last element in the phrase. Agreement prediction is an interesting case study The nuclear cases are the ergative (ERG), the for implicit learning of the tree structure of the in- absolutive (ABS) and the dative (DAT).1 In ad- put, as once the arguments of each present-tense dition to case, the same suffixes also encode for verb in the sentence are found and their grammat- number (singular or plural) as seen in Table1. ical relation to the verb is established, predicting the verb form is straightforward. Ergative-absolutive case system Unlike En- Linzen et al.(2016) tested different variants of glish and most other Indo-European languages the agreement prediction task: categorical pre- that have nominative–accusative morphosyntactic diction of the verb form based on the left con- alignment in which the single argument of intran- text; grammatical assessment of the validity of the sitive verbs and the agent of transitive verbs be- agreement present in a given sentence; and lan- have similarly to each other (“subjects”) but dif- guage modeling. Since in many cases the verb ferently from the object of transitive verbs, Basque form can be predicted according to number of the has ergative–absolutive alignment. This means preceding noun, they focused on agreement attrac- that the “subject” of an intransitive verb and the tors: sentences in which the preceding nouns have “object” of a transitive verbs behave similarly to the opposite number of the grammatical subject. each other and receive the absolutive case, while Their model achieved very good overall perfor- the “subject” of a transitive verb receives the erga- mance in the first two tasks of number prediction tive case. To illustrate the difference, while in and grammatical judgment, while in the third task English we say “she sleeps” and “she sees them” of language modeling, weak supervision did not (treating she the same in both sentences), in an suffice to learn structural dependencies. With re- 1Additional cases encode different aspects of the role of gard to the presence of agreement attractors, they the noun phrase in the sentence. For example, local cases have shown the performance decays with their indicate aspects such as destination and place of occurrence, possessive/genitive cases indicate possession, causal cases in- number, to the point of worse-than-random accu- dicate causation, etc. In this work we focus only on the three racy in the presence of 4 attractors; this suggests mentioned. 99 Suffix Forms The person sees the trees. Case Function Sg Pl No det (4) Zuhaitz-ak pertson-ak Absolutive S, O -a -ak - tree-SG.ERG person-PL.ABS Ergative A -ak -ek -(e)k ikusten ditu Dative IO -ari -ei -(r)i seeing it-is-them The tree sees the people. Table 1: Basque case and their corresponding determined nuclear case suffixes. Note the case syncretism, resulting in Word-order and Polypersonal Agreement structural ambiguity between the plural absolutive and the Basque is often said to have a SOV word order, ergative singular. Under function, S refers to the single ar- gument of a prototypical intransitive verb, O refers to the although the rules governing word order are most patient-like argument of a prototypical transitive verb, rather complex, and word order is dependent and A refers to the most agent-like argument of a prototyp- on the focus and topic of the sentence.