Sygns: a Systematic Generalization Testbed Based on Natural Language Semantics

SyGNS: A Systematic Generalization Testbed Based on Natural Language Semantics Hitomi Yanaka1;2, Koji Mineshima3, and Kentaro Inui4;2 1The University of Tokyo, 2RIKEN, 3Keio University, 4Tohoku University [email protected], [email protected], [email protected] Abstract Training Sentences Generalization Test One wild dog ran MODIFIER Recently, deep neural networks (DNNs) have All wild dogs ran All dogs ran QUANTIFIER achieved great success in semantically chal- All dogs did not run lenging NLP tasks, yet it remains unclear One dog did not run whether DNN models can capture composi- NEGATION tional meanings, those aspects of meaning that Multiple meaning representations Evaluation methods have been long studied in formal semantics. Exact matching: G = P ? MR1: 8x:(dog#(x) ^ wild#(x)) ! (run"(x)) To investigate this issue, we propose a Sys- Theorem Proving: G , P ? MR2: ALL AND DOG WILD RUN Polarity: fdog#; wild#; run"g tematic Generalization testbed based on Natu- b1 IMP b2 b3 x ral language Semantics (SyGNS), whose chal- 1 b2 REF x1 MR3: wild(x1) ) run(x1) Clausal form: b2 wild x1 lenge is to map natural language sentences to dog(x1) b2 dog x1 b3 run x1 multiple forms of scoped meaning representations, designed to account for various semantic Figure 1: Illustration of our evaluation protocol using phenomena. Using SyGNS, we test whether SyGNS. The goal is to map English sentences to mean- neural networks can systematically parse sen- ing representations. The generalization test evaluates tences involving novel combinations of logi- novel combinations of operations (modifier, quantifier, cal expressions such as quantifiers and nega- negation) in the training set. We use multiple meaning tion. Experiments show that Transformer and representations and evaluation methods. GRU models can generalize to unseen combinations of quantifiers, negations, and modifiers that are similar to given training instances in ity to capture compositional aspects of meaning in form, but not to the others. We also find that the generalization performance to unseen com- natural language. binations is better when the form of meaning There are two issues to consider here. First, re- representations is simpler. The data and code cent analyses (Talmor and Berant, 2019; Liu et al., for SyGNS are publicly available at https: 2019; McCoy et al., 2019) have pointed out that //github.com/verypluming/SyGNS. the standard paradigm for evaluation, where a test set is drawn from the same distribution as the train- 1 Introduction ing set, does not always indicate that the model Deep neural networks (DNNs) have shown impres- has obtained the intended generalization ability for arXiv:2106.01077v1 [cs.CL] 2 Jun 2021 sive performance in various language understand- language understanding. Second, the NLI task of ing tasks (Wang et al., 2019a,b, i.a.), including predicting the relationship between a premise sen- semantically challenging tasks such as Natural Lan- tence and an associated hypothesis without asking guage Inference (NLI; Dagan et al., 2013; Bowman their semantic interpretation tends to be black-box, et al., 2015). However, a number of studies to probe in that it is often difficult to isolate the reasons why DNN models with various NLI datasets (Naik et al., models make incorrect predictions (Bos, 2008). 2018; Dasgupta et al., 2018; Yanaka et al., 2019; To address these issues, we propose SyGNS Kim et al., 2019; Richardson et al., 2020; Saha (pronounced as signs), a Systematic Generaliza- et al., 2020; Geiger et al., 2020) have reported that tion testbed based on Natural language Semantics. current DNN models have some limitations to gen- The goal is to map English sentences to various eralize to diverse semantic phenomena, and it is meaning representations, so it can be taken as a still not clear whether DNN models obtain the abil- sequence-to-sequence semantic parsing task. Figure1 illustrates our evaluation protocol us- unseen combinations of quantifiers, negations, and ing SyGNS. To address the first issue above, we modifiers to some extent. However, the generaliza- probe the generalization capability of DNN models tion ability is limited to the combinations whose on two out-of-distribution tests: systematicity (Sec- forms are similar to those of the training instances. tion 3.1) and productivity (Section 3.2), two con- In addition, the models struggle with parsing sen- cepts treated as hallmarks of human cognitive ca- tences involving nested clauses. We also show that pacities in cognitive sciences (Fodor and Pylyshyn, the extent of generalization depends on the choice 1988; Calvo and Symons, 2014). We use a train- of primitive patterns and representation forms. test split controlled by each target concept and train models with a minimally sized training set (Basic 2 Related Work set) involving primitive patterns composed of semantic phenomena such as quantifiers, modifiers, The question of whether neural networks obtain and negation. If a model learns different properties the systematic generalization capacity has long of each semantic phenomenon from the Basic set, been discussed (Fodor and Pylyshyn, 1988; Mar- it should be able to parse a sentence with novel cus, 2003; Baroni, 2020). Recently, empirical stud- combination patterns. Otherwise, a model has to ies using NLI tasks have revisited this question, memorize an exponential number of combinations showing that current models learn undesired bi- of linguistic expressions. ases (Glockner et al., 2018; Poliak et al., 2018; Tsuchiya, 2018; Geva et al., 2019; Liu et al., 2019) To address the second issue, we use multiple and heuristics (McCoy et al., 2019), and fail to con- forms of meaning representations developed in for- sistently learn various inference types (Rozen et al., mal semantics (Montague, 1973; Heim and Kratzer, 2019; Nie et al., 2019; Yanaka et al., 2019; Richard- 1998; Jacobson, 2014) and their respective evalua- son et al., 2020; Joshi et al., 2020). In particu- tion methods. We use three scoped meaning repre- lar, previous works (Goodwin et al., 2020; Yanaka sentation forms, each of which preserves the same et al., 2020; Geiger et al., 2020; Yanaka et al., 2021) semantic information (Section 3.3). In formal se- have examined whether models learn the system- mantics, it is generally assumed that scoped mean- aticity of NLI on monotonicity and veridicality. ing representations are standard forms for handling While this line of work has shown certain limita- diverse semantic phenomena such as quantification tions of model generalization capacity, it is often and negation. Scoped meaning representations also difficult to figure out why the NLI model fails and enable us to evaluate the compositional general- how to improve it, partly because NLI tasks de- ization ability of the models to capture semantic pend on multiple factors, including semantic inter- phenomena in a more fine-grained way. By de- pretation of target phenomena and acquisition of composing an output meaning representation into background knowledge. By focusing on semantic constituents (e.g., words) in accordance with its parsing rather than NLI, one can probe to what ex- structure, we can compute the matching ratio be- tent models systematically interpret the meaning tween the output representation and the gold stan- of sentences according to their structures and the dard representation. Evaluating the models on mul- meanings of their constituents. tiple meaning representation forms also allows us Meanwhile, datasets for analysing the compo- to explore whether the performance depends on the sitional generalization ability of DNN models in complexity of the representation forms. semantic parsing have been proposed, including This paper provides three main contributions. SCAN (Lake and Baroni, 2017; Baroni, 2020), First, we develop the SyGNS testbed to test model CLUTRR (Sinha et al., 2019), and CFQ (Keysers ability to systematically transform sentences in- et al., 2020). For example, the SCAN task is to volving linguistic phenomena into multiple forms investigate whether models trained with a set of of scoped meaning representations. The data and primitive instructions (e.g., jump ! JUMP) and code for SyGNS are publicly available at https: modifiers (e.g., walk twice ! WALK WALK) gener- //github.com/verypluming/SyGNS. Second, we alize to new combinations of primitives (e.g., jump use SyGNS to analyze the systematic generaliza- twice ! JUMP JUMP). However, these datasets tion capacity of two standard DNN models: Gated deal with artificial languages, where the variation Recurrent Unit (GRU) and Transformer. Experi- of linguistic expressions is limited, so it is not clear ments show that these models can generalize to to what extent the models systematically interpret various semantic phenomena in natural language, Pattern Sentence such as quantification and negation. Train Regarding the generalization capacity of DNN Primitive quantifier One tiger ran Basic 1 EXI A tiger ran models in natural language, previous studies have NUM Two tigers ran focused on syntactic and morphological general- UNI Every tiger ran ization capacities such as subject-verb agreement Basic 2 ADJ One small tiger ran tasks (Linzen et al., 2016; Gulordava et al., 2018; ADV One tiger ran quickly Marvin and Linzen, 2018, i.a.). Perhaps closest CON One tiger ran or came to our work is the COGS task (Kim and Linzen, Test 2020) for probing the generalization capacity of EXI+ADJ A small tiger ran semantic parsing in a synthetic natural language NUM+ADV Two tigers ran quickly fragment. For instance, the task is to see whether UNI+CON Every tiger ran or came models trained to parse sentences where some lexi- cal items only appear in subject position (e.g., John Table 1: Training and test instances for systematicity. ate the meat) can generalize to structurally related sentences where these items appear in object posi- set 2 by setting an arbitrary quantifier (e.g., one) to tion (e.g., The kid liked John). In contrast to this a primitive quantifier and combining it with vari- work, our focus is more on semantic parsing of sen- ous types of modifiers.

Load more