Emergence of Syntax Needs Minimal Supervision

Emergence of Syntax Needs Minimal Supervision Raphael¨ Bailly Kata Gabor´ SAMM, EA 4543, FP2M 2036 CNRS ERTIM, EA 2520 Universite´ Paris 1 Pantheon-Sorbonne´ INALCO [email protected] [email protected] Abstract 2019; Lake and Baroni, 2017; Linzen et al., 2016; Gulordava et al., 2018)? Second, whether research This paper is a theoretical contribution to the debate on the learnability of syntax from a into neural networks and linguistics can benefit corpus without explicit syntax-specific guid- each other (Pater, 2019; Berent and Marcus, 2019); ance. Our approach originates in the observ- by providing evidence that syntax can be learnt able structure of a corpus, which we use to in an unsupervised fashion (Blevins et al., 2018), define and isolate grammaticality (syntactic in- or the opposite, humans and machines alike need formation) and meaning/pragmatics informa- innate constraints on the hypothesis space (a univer- tion. We describe the formal characteristics sal grammar) (Adhiguna et al., 2018; van Schijndel of an autonomous syntax and show that it be- comes possible to search for syntax-based lex- et al., 2019)? ical categories with a simple optimization pro- A closely related question is whether it is possi- cess, without any prior hypothesis on the form ble to learn a language’s syntax exclusively from a of the model. corpus. The poverty of stimulus argument (Chom- sky, 1980) suggests that humans cannot acquire 1 Introduction their target language from only positive evidence Syntax is the essence of human linguistic capacity unless some of their linguistic knowledge is innate. that makes it possible to produce and understand The machine learning equivalent of this categori- a potentially infinite number of unheard sentences. cal ”no” is a formulation known as Gold’s theorem The principle of compositionality (Frege, 1892) (Gold, 1967), which suggests that the complete states that the meaning of a complex expression is unsupervised learning of a language (correct gram- fully determined by the meanings of its constituents maticality judgments for every sequence), is in- and its structure; hence, our understanding of sen- tractable from only positive data. Clark and Lappin tences we have never heard before comes from (2010) argue that Gold’s paradigm does not resem- the ability to construct the sense of a sentence out ble a child’s learning situation and there exist algo- of its parts. The number of constituents and as- rithms that can learn unconstrained classes of infi- signed meanings is necessarily finite. Syntax is nite languages (Clark and Eyraud, 2006). This on- responsible for creatively combining them, and it is going debate on syntax learnability and the poverty commonly assumed that syntax operates by means of the stimulus can benefit from empirical and theo- of algebraic compositional rules (Chomsky, 1957) retical machine learning contributions (Lappin and and a finite number of syntactic categories. Shieber, 2007; McCoy et al., 2018; Linzen, 2019). One would also expect a computational model In this paper, we argue that syntax can be in- of language to have - or be able to acquire - this ferred from a sample of natural language with very compositional capacity. The recent success of neu- minimal supervision. We introduce an information ral network based language models on several NLP theoretical definition of what constitutes syntactic tasks, together with their ”black box” nature, at- information. The linguistic basis of our approach tracted attention to at least two questions. First, is the autonomy of syntax, which we redefine in when recurrent neural language models general- terms of (statistical) independence. We demon- ize to unseen data, does it imply that they acquire strate that it is possible to establish a syntax-based syntactic knowledge, and if so, does it translate lexical classification of words from a corpus with- into human-like compositional capacities (Baroni, out a prior hypothesis on the form of a syntactic 477 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 477–487 July 5 - 10, 2020. c 2020 Association for Computational Linguistics model. tasks” or ”diagnostic classifiers” (Giulianelli et al., Our work is loosely related to previous attempts 2018; Hupkes et al., 2018). This approach consists at optimizing language models for syntactic perfor- in extracting a representation from a NN and us- mance (Dyer et al., 2016; Adhiguna et al., 2018) ing it as input for a supervised classifier to solve and more particularly to Li and Eisner(2019) be- a different linguistic task. Accordingly, probes cause of their use of mutual information and the in- were conceived to test if the model learned parts formation bottleneck principle (Tishby et al., 1999). of speech (Saphra and Lopez, 2018), morphology However, our goal is different in that we demon- (Belinkov et al., 2017; Peters et al., 2018a), or syn- strate that very minimal supervision is sufficient tactic information. Tenney et al.(2019) evaluate in order to guide a symbolic or statistical learner contextualized word representations on syntactic towards grammatical competence. and semantic sequence labeling tasks. Syntactic knowledge can be tested by extracting constituency 2 Language models and syntax trees from a network’s hidden states (Peters et al., 2018b) or from its word representations (Hewitt As recurrent neural network based language models and Manning, 2019). Other syntactic probe sets in- started to achieve good performance on different clude the work of Conneau et al.(2018) and Marvin tasks (Mikolov et al., 2010), this success sparked and Linzen(2018). attention on whether such models implicitly learn Despite the vivid interest for the topic, no consen- syntactic information. Language models are typi- sus seems to unfold from the experimental results. cally evaluated using perplexity on test data that is Two competing opinions emerge: similar to the training examples. However, lower perplexity does not necessarily imply better syntac- • Deep neural language models generalize by tic generalization. Therefore, new tests have been learning human-like syntax: given sufficient put forward to evaluate the linguistically meaning- amount of training data, RNN models approx- ful knowledge acquired by LMs. imate human compositional skills and implic- A number of tests based on artificial data have itly encode hierarchical structure at some level been used to detect compositionality or system- of the network. This conjecture coincides with aticity in deep neural networks. Lake and Baroni the findings of, among others Bowman et al. (2017) created a task set that requires executing (2015); Linzen et al.(2016); Giulianelli et al. commands expressed in a compositional language. (2018); Gulordava et al.(2018); Adhiguna Bowman et al.(2015) design a task of logical en- et al.(2018). tailment relations to be solved by discovering a • The language model training objective does recursive compositional structure. Saxton et al. not allow to learn compositional syntax from (2019) propose a semi-artificial probing task of a corpus alone, no matter what amount of mathematics problems. training data the model was exposed to. Syn- Linzen et al.(2016) initiated a different line of tax learning can only be achieved with task- linguistically motivated evaluation of RNNs. Their specific guidance, either as explicit supervi- data set consists in minimal pairs that differ in sion, or by restricting the hypothesis space to grammaticality and instantiate sentences with long hierarchically structured models (Dyer et al., distance dependencies (e.g. number agreement). 2016; Marvin and Linzen, 2018; Chowdhury The model is supposed to give a higher probability and Zamparelli, 2018; van Schijndel et al., to the grammatical sentence. The test aims to detect 2019; Lake and Baroni, 2017). whether the model can solve the task even when this requires knowledge of a hierarchical structure. Moreover, some shortcomings of the above prob- Subsequently, several alternative tasks were created ing methods make it more difficult to come to a along the same concept to overcome specific short- conclusion. Namely, it is not trivial to come up comings (Bernardy and Lappin, 2017; Gulordava with minimal pairs of naturally occurring sentences et al., 2018), or to extend the scope to different that are equally likely. Furthermore, assigning a languages or phenomena (Ravfogel et al., 2018, (slightly) higher probability to one sentence does 2019). not reflect the nature of knowledge behind a gram- It was also suggested that the information con- maticality judgment. Diagnostic classifiers may tent of a network can be tested using ”probing do well on a linguistic task because they learn to 478 solve it, not because their input contains a hierar- present in language models: they are bound by the chical structure (Hewitt and Liang, 2019). In what type of the data they are exposed to in learning. follows, we present our assessment on how the We suggest that it is still possible to learn syn- difficulty of creating a linguistic probing data set tactic generalization from a corpus, but not with is interconnected with the theoretical problem of likelihood maximization. We propose to isolate the learning a model of syntactic competence. syntactic information from shallow performance- related information. In order to identify such infor- 2.1 Competence or performance, or why mation without explicitly injecting it as direct su- syntax drowns in the corpus pervision

Load more