Arxiv:2007.06761V2 [Cs.CL] 23 Sep 2020
Total Page:16
File Type:pdf, Size:1020Kb
Can neural networks acquire a structural bias from raw linguistic data? Alex Warstadt ([email protected]) Department of Linguistics, New York University New York, NY 10003 USA Samuel R. Bowman ([email protected]) Department of Linguistics & Center for Data Science & Department of Computer Science, New York University New York, NY 10003 USA Abstract Training Has the man who has gone seen the cat? Has the man who gone has seen the cat? We evaluate whether BERT, a widely used neural network for sentence processing, acquires an inductive bias towards form- Hypothesis Space ? ing structural generalizations through pretraining on raw data. We conduct four experiments testing its preference for struc- Structural Generalization: Move the Linear Generalization: Move the tural vs. linear generalizations in different structure-dependent structurally highest auxiliary to the front. linearly last auxiliary to the front. phenomena. We find that BERT makes a structural general- ization in 3 out of 4 empirical domains—subject-auxiliary in- version, reflexive binding, and verb tense detection in embed- has the man who has gone has seen the cat? ded clauses—but makes a linear generalization when tested on NPI licensing. We argue that these results are the strongest ev- idence so far from artificial learners supporting the proposition that a structural bias can be acquired from raw data. If this con- clusion is correct, it is tentative evidence that some linguistic universals can be acquired by learners without innate biases. Test behavior: Structural bias observed Test behavior: Linear bias observed However, the precise implications for human language acqui- sition are unclear, as humans learn language from significantly Has the man seen the cat who has gone? Has the man seen the cat who has gone? less data than BERT. Has the man has seen the cat who gone? Has the man has seen the cat who gone? Keywords: inductive bias; structure dependence; BERT; learnability of grammar; poverty of the stimulus; neural net- work; self-supervised learning Figure 1: Illustration of the poverty of the stimulus design ex- Introduction periment for subject-auxiliary inversion. Colors correspond to the binary classes for sentences. Humans appear to use structural biases to guide language ac- quisition. A classic example is the rule for subject-auxiliary inversion: Native English speakers uniformly acquire a rule downstream task may be substantially influenced by the prior like the structural generalization in Figure 1 that makes refer- knowledge acquired during pretraining. ence to hierarchical syntactic structures, despite the fact that In this work, we present new experimental evidence that the raw linguistic input often supports linear generalizations BERT may acquire an inductive bias towards structural gener- which are intuitively just as simple (Chomsky, 1965). Hu- alizations from exposure to raw data alone. We conduct four mans are not alone in possessing this inductive bias: Prior experiments inspired by McCoy, Frank, and Linzen (2018, arXiv:2007.06761v2 [cs.CL] 23 Sep 2020 investigations have identified some artificial learners with a 2020) to evaluate whether BERT has a structural or linear structural bias by virtue of having a significantly restricted the bias when generalizing about structure-dependent English hypothesis space (Perfors, Tenenbaum, & Regier, 2011) or phenomena. We follow the poverty of the stimulus design a hierarchically structured architecture that learns from pre- (Wilson, 2006), outlined in Figure 1. First, we fine-tune parsed data (McCoy, Frank, & Linzen, 2020). BERT to perform a classification task using data intention- However, these results cannot tell us whether a learner ally ambiguous between structural and linear generalizations. starting with very weak biases can acquire a structural bias Then, we probe the inductive biases of the learner by observ- merely from exposure to raw linguistic data. While inductive ing how it classifies held-out examples that disambiguate be- biases are often understood to be unchangeable properties of tween the generalizations. The classification tasks illustrate a learner, this need not be the case. For instance, in one domi- three structure dependent rules of English grammar regarding nant paradigm in natural language processing, pretraining on subject-auxiliary inversion, reflexive-antecedent agreement, raw data is used to create a general purpose sentence process- and negative polarity item (NPI) licensing. A fourth task ing model like BERT (Bidirectional Encoder Representations requires the classifcation of sentences based on an arbitrary from Transformers; Devlin, Chang, Lee, & Toutanova, 2019), rule: whether the verb in an embedded clause is in the past which can subsequently be fine-tuned to perform a down- tense. The data is generated from templates using the gener- stream task. The model’s inductive biases with respect to the ation tools and lexicon built by Warstadt, Cao, et al. (2019) and Warstadt, Parrish, et al. (2019). vised training on acceptability. BERT’s internal representa- The results of these experiments suggest that BERT likely tions appear to attend to linguistic features such as syntac- acquires a inductive bias towards structural rules from self- tic category (Clark, Khandelwal, Levy, & Manning, 2019) supervised pretraining. BERT generalizes in a way consistent and contain sufficient information from which to recover a with a structural bias in 3 out of 4 experiments: those involv- dependency parse for an inputted sentence (Hewitt & Man- ing subject-auxiliary inversion, reflexive binding, and embed- ning, 2019). However, it is not known whether BERT is ded verb tense detection. While these experiments leave open biased towards forming generalizations based on structural several alternative explanations for this generalization behav- features when fine-tuned on structure-dependent phenomena. ior, they add to mounting evidence that significant syntac- Indeed, it is possible that BERT could acquire knowledge of tic knowledge, including a structural biases, can be acquired hierarchical syntax but still preferentially use surface features from self-supervised learning on raw data. to generalize. Our experiments are designed to address this question. Background & Related Work Structure Dependence & the Innateness Hypothesis Self-supervised Learning and BERT The learnability of structural bias has played a large role in Recent advances in machine learning for natural language debates about human language acquisition. Chomsky (1965, processing give new reason to believe that low-bias learn- 1971) proposes that humans have an innate bias towards ers can acquire significant grammatical knowledge from raw learning structural grammatical rules. For example, children data. The Transformer neural network architecture (Vaswani must learn a general rule for subject-auxiliary inversion pri- et al., 2017) used in models like BERT has very weak bi- marily from input like (1). With this input, a learner could ases: It is a universal approximator of the class of sequence form a structural generalization (front the highest auxiliary in transduction functions (Yun, Bhojanapalli, Rawat, Reddi, & the corresponding declarative) or a linear one (e.g. front the Kumar, 2019), and it has been applied effectively in non- first or last auxiliary), but human learners always choose the linguistic domains such computer vision (Parmar et al., 2018) former. That is, no human learner of English acquires a lin- and protein sequence modeling (Rives et al., 2019). However, ear rule that produces the form in (2a) over (2b). From such rather than training a low bias model from scratch to perform examples, Chomsky (1971) concludes that humans have an a particular linguistic task, recent results show that it is far innate preference for structure-dependent rules. Otherwise it more more effective to pretrain a general purpose model on would be difficult to explain how we so consistently avoid raw data and subsequently fine-tune it on a downstream task deeply un-language-like hypotheses in lieu of significant dis- (e.g. Howard & Ruder, 2018). The implication is that they confirming evidence. acquire helpful biases from pretraining. Crucially, these models are usually pretrained with only (1) a. The cat has gone. raw data using the technique of self-supervised learning, b. Has the cat gone? which circumvents the need for labeled data by using the (2) a. *Is the man has seen the cat who going? data itself as the labels. The most common self-supervised b. Has the man seen the cat who is going? tasks for pretraining are the language modeling task, where the objective is predict the next word in a string (e.g. Peters These examples play a key role in Chomsky’s influential et al., 2018; Radford et al., 2019), and, in the case of BERT, argument from the poverty of the stimulus in support of this 1 the cloze task where the objective is predict the identity of a hypothesis. A version of the argument is given below: masked token anywhere in a string. Premise 1 Humans form grammatical hypotheses about their na- Despite containing no explicit information about grammat- tive language using either innate biases or data-driven learning. ical concepts, self-supervised tasks appear to teach neural Premise 2 Human language learners preferentially form struc- models significant knowledge of grammar and hierarchical tural hypotheses over equally simple linear hypotheses. syntax. These models can perform human-like acceptabil- Premise 3 The raw linguistic input