On the Proper Role of Linguistically-Oriented Deep Net Analysis in Linguistic Theorizing

On the proper role of linguistically-oriented deep net analysis in linguistic theorizing Marco Baroni1,2,3 1Facebook AI Research 2Universitat Pompeu Fabra 3Catalan Institution for Research and Advanced Studies June 17, 2021 Abstract A lively research field has recently emerged that uses experimental methods to probe the linguistic behavior of modern deep networks. While work in this tradition often reports intriguing results about the grammatical skills of deep nets, it is not clear what their implications for linguistic theorizing should be. As a consequence, linguistically-oriented deep net analysis has had very little impact on linguistics at large. In this chapter, I suggest that deep networks should be treated as theories making explicit predictions about the acceptability of linguistic utterances. I argue that, if we overcome some obstacles standing in the way of seriously pursuing this idea, we will gain a powerful new theoretical tool, complementary to mainstream algebraic approaches. 1 Introduction During the last decade, deep neural networks have come to dominate the field arXiv:2106.08694v1 [cs.CL] 16 Jun 2021 of natural language processing (NLP) (Sutskever et al., 2014; Vaswani et al., 2017; Devlin et al., 2019). While earlier approaches to NLP relied on tools, such as part-of-speech taggers and parsers, that extracted linguistic knowledge from explicit manual annotation of text corpora (Jurafsky and Martin, 2008), deep-learning-based methods typically adopt an “end-to-end” approach: A deep net is directly trained to associate some form of natural linguistic input (e.g., text in a language) to a corresponding linguistic output (e.g., the same text in a different language), dispensing with the traditional pipeline of intermediate linguistic modules and the related annotation of latent linguistic structure (e.g., syntactic parses of source and target sentences) (Goldberg, 2017; Lappin, 2021). This paradigm shift has important implications for the relation between theoretical and computational linguistics. The issue of which linguistic formalisms 1 might provide the best annotation schemes to develop effective NLP tools is no longer relevant. Instead, the last few years have seen the raise of a new field of investigation consisting in the experimental analysis of the grammatical skills of deep nets trained without the injection of any explicit linguistic knowledge. For the remainder of the chapter, I will refer to this research area as LODNA, for linguistically-oriented deep net analysis. LODNA takes the perspective of a psycholinguist (Futrell et al., 2019), or perhaps more accurately that of an ethol- ogist (McCloskey, 1991; Scholte, 2016), designing sophisticated experiments to “probe” the knowledge implicit in a species’ behavior. LODNA is currently a very active research field, with many papers focusing on whether neural networks have correctly induced specific kinds of grammatical generalization (e.g., Linzen et al., 2016; Chowdhury and Zamparelli, 2018; Futrell et al., 2019; Chaves, 2020), as well as benchmarks attempting to probe their linguistic competence at multiple levels (Conneau et al., 2018; Warstadt et al., 2019). LODNA papers account for a significant proportion of the work presented at annual events such as the Society for Computation in Linguistics conference and the BlackBox NLP workshop. LODNA is well-motivated from a machine-learning perspective. Under- standing how a system behaves is a prerequisite to improve it, and might be important in the perspective of AI safety and explainability (Belinkov and Glass, 2019; Xie et al., 2020). However, there is no doubt that the grammatical performance of deep nets is also extremely intriguing from a linguistic perspective, particularly because the architectural primitives of these models (such as distributed representations and structures that linearly propagate information across time) are profoundly different from those postulated in linguistics (such as categorical labels and tree structures). Still, as we will see below, for all the enthusiasm for LODNA within NLP, this line of work is hardly having any impact on the current debate in theoretical linguistics. In this chapter, after introducing, as an example of LODNA, the by-now “classic” domain of long-distance agreement probing (Section 2), I will present evidence for the claim that this sort of research, despite the intriguing patterns it uncovers, is hardly affecting contemporary linguistics (Section 3), I will argue that this gap stems from lack of clarity about its theoretical significance (Section 4). In particular, I will show that modern deep networks cannot be treated as blank slates meant to falsify innateness claims. They should rather be seen as algorithmic linguistic theories making predictions about utterance acceptability. I will then outline several issues that are currently standing in the way of taking deep nets seriously as linguistic theories. I will conclude in Section 5 by briefly discussing why taking this stance might be beneficial to computational and theoretical linguistics, and by sketching two possible ways to pursue LODNA- based linguistic theorizing. 2 2 Linguistic-oriented analysis of deep nets: The case of long-distance agreement Linguists identify sensitivity to syntactic structure that is not directly observable in the signal as one of the core properties of human grammatical competence (Everaert et al., 2015). A paradigmatic test for structure sensitivity comes from agreement phenomena. For example, subject-verb number agreement in an English clause depends on the c-command relation between the subject noun phrase and the corresponding verb, and it is not affected by nouns intervening between the NP head and the verb: (1) [The kid [near the toys in the boxes]] is tired. In (1), an example of long-distance agreement, the fact that two plural nouns (toys and boxes) directly precede the main verb is does not affect its number, as the only noun that entertains the right relation with the verb is kid. As Everaert and colleagues’ (2015) motto goes, it’s all about structures, not strings! Current deep network architectures, such as long-short-term memory networks (LSTMs, Hochreiter and Schmidhuber, 1997), convolutional networks (CNNs, Kalchbrenner et al., 2014) or Transformers (Vaswani et al., 2017), do not encode any prior favoring a structural analysis of their input over a sequential one. It is natural then to ask whether they are able to correctly handle structure- dependent phenomena, such as long-distance agreement. Consequently, starting with the influential work of Linzen et al. (2016), long-distance agreement has become a standard test to probe their linguistic abilities. Probably the most thorough analysis of long-distance number agreement in deep networks was the one we carried out in Gulordava et al. (2018). We focused on LSTMs trained as language models. That is, the networks were trained by exposing them to large amounts of textual data (samples from the Wikipedias of the relevant languages), with the task of predicting the next word given the context. No special tuning for the long-distance agreement challenge was applied. After this generic training, the networks were presented with sentence prefixes up to and excluding the second item in an agreement relation (e.g., The kid near the toys in the boxes. ), and the probability they assigned to continuations with the right or wrong agreement (is/are) was measured. The experiment was conduced with a test set of genuine corpus examples in 4 languages (English, Hebrew, Italian and Russian), and considering various agreement relations (not only noun-verb but, also, for example, verb-verb and adjective-noun). The networks got the right agreement with high accuracy in all languages and for all constructions. Even more impressively, the networks were also able to get agreement right when tested with nonsense sentences such as the one in (2), showing that they must extract syntactic rules at a rather abstract level. (2) The carrot around the lions for the disasters... sings/*sing. Finally, we compared the agreement accuracy of the Italian network with that 3 of native speakers (both on corpus-extracted and nonsense sentences), finding that the network is only marginally lagging behind human performance. Other studies tested different deep architectures, such as CNNs (Bernardy and Lappin, 2017) and Transformers (Goldberg, 2019b), confirming that they also largely succeed at long-distance agreement. Deep nets have been tested for a number of other linguistic generalizations, such as those pertaining to filler-gap constructions, auxiliary fronting and case assignment. See Linzen and Baroni (2021) for a recent survey of LODNA specif- ically aimed at linguists.1 In pretty much all cases, while they departed here and there from human intuition, deep nets captured at least the general gist of the phenomena being investigated. 3 Thegap Results such as the ones on long-distance agreement I briefly reviewed should be of interest to theoretical linguists, since, as already mentioned, deep nets possess very different priors from those postulated by linguists as part of the universal language faculty, such as a predisposition for hierarchical structures (Hauser et al., 2002; Berwick and Chomsky, 2016; Adger, 2019). In reality, however, the growing body of work on LODNA is almost completely ignored by the current theoretical linguistics debate. To sustain this claim with quantitative evidence, I looked at the impact of Tal Linzen’s

On the Proper Role of Linguistically-Oriented Deep Net Analysis in Linguistic Theorizing

Towards a Multidominant Theory of Movement Lecture One: Introduction Kyle Johnson University College London University of Massachusetts at Amherst 20 June 2016

Prolegomena to a Future Science of Biolinguistics

Distributed Morphological Mechanisms of Smith Island Weren't Leveling

Do Learners' Word Order Preferences Reflect Hierarchical Language Structure?

Notes for Contributors

Causal Interrogative Variation in Multicultural and Traditional Varieties of London English

Alexander Martin [ˌʔæləɡˈze͜͠əndɚ ˈmɑɻʈ ̚ʔn̩] // [Alɛksɑ̃dɛχ Maχˈtæ̃]

Spelling out the Noun Phrase: Interpretation, Word Order, and the Problem of ‘Meaningless Movement’

Cross-Linguistic Evidence for Cognitive Universals in the Noun Phrase Received January 11, 2019; Accepted April 29, 2019

Foundations of Generative Linguistics

Object Movement (A)Symmetries in British English Dialects

The Minimalist Program: the Nature and Plausibility of Chomsky’S Biolinguistics Fahad Rashed Al-Mutairi Frontmatter More Information