<<

Emergence of Needs Minimal Supervision

Raphael¨ Bailly Kata Gabor´ SAMM, EA 4543, FP2M 2036 CNRS ERTIM, EA 2520 Universite´ Paris 1 Pantheon-Sorbonne´ INALCO [email protected] [email protected]

Abstract 2019; Lake and Baroni, 2017; Linzen et al., 2016; Gulordava et al., 2018)? Second, whether research This paper is a theoretical contribution to the debate on the learnability of syntax from a into neural networks and can benefit corpus without explicit syntax-specific guid- each other (Pater, 2019; Berent and Marcus, 2019); ance. Our approach originates in the observ- by providing that syntax can be learnt able structure of a corpus, which we use to in an unsupervised fashion (Blevins et al., 2018), define and isolate grammaticality (syntactic in- or the opposite, humans and machines alike need formation) and / informa- innate constraints on the space (a univer- tion. We describe the formal characteristics sal grammar) (Adhiguna et al., 2018; van Schijndel of an autonomous syntax and show that it be- comes possible to search for syntax-based lex- et al., 2019)? ical categories with a simple optimization pro- A closely related question is whether it is possi- cess, without any prior hypothesis on the form ble to learn a ’s syntax exclusively from a of the model. corpus. The poverty of stimulus (Chom- sky, 1980) suggests that humans cannot acquire 1 Introduction their target language from only positive evidence Syntax is the essence of human linguistic capacity unless some of their linguistic is innate. that makes it possible to produce and understand The machine learning equivalent of this categori- a potentially infinite number of unheard sentences. cal ”no” is a formulation known as Gold’s The principle of compositionality (Frege, 1892) (Gold, 1967), which suggests that the complete states that the meaning of a complex expression is unsupervised learning of a language (correct gram- fully determined by the meanings of its constituents maticality judgments for every sequence), is in- and its structure; hence, our understanding of sen- tractable from only positive data. Clark and Lappin tences we have never heard before comes from (2010) argue that Gold’s paradigm does not resem- the ability to construct the sense of a out ble a child’s learning situation and there exist algo- of its parts. The number of constituents and as- rithms that can learn unconstrained classes of infi- signed meanings is necessarily finite. Syntax is nite (Clark and Eyraud, 2006). This on- responsible for creatively combining them, and it is going debate on syntax learnability and the poverty commonly assumed that syntax operates by means of the stimulus can benefit from empirical and theo- of algebraic compositional rules (Chomsky, 1957) retical machine learning contributions (Lappin and and a finite number of syntactic categories. Shieber, 2007; McCoy et al., 2018; Linzen, 2019). One would also expect a computational model In this paper, we argue that syntax can be in- of language to have - or be able to acquire - this ferred from a sample of natural language with very compositional capacity. The recent success of neu- minimal supervision. We introduce an ral network based language models on several NLP theoretical definition of what constitutes syntactic tasks, together with their ”black box” nature, at- information. The linguistic basis of our approach tracted attention to at least two questions. First, is the autonomy of syntax, which we redefine in when recurrent neural language models general- terms of (statistical) independence. We demon- ize to unseen data, does it imply that they acquire strate that it is possible to establish a syntax-based syntactic knowledge, and if so, does it translate lexical classification of words from a corpus with- into human-like compositional capacities (Baroni, out a prior hypothesis on the form of a syntactic

477 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 477–487 July 5 - 10, 2020. c 2020 Association for Computational Linguistics model. tasks” or ”diagnostic classifiers” (Giulianelli et al., Our work is loosely related to previous attempts 2018; Hupkes et al., 2018). This approach consists at optimizing language models for syntactic perfor- in extracting a representation from a NN and us- mance (Dyer et al., 2016; Adhiguna et al., 2018) ing it as input for a supervised classifier to solve and more particularly to Li and Eisner(2019) be- a different linguistic task. Accordingly, probes cause of their use of mutual information and the in- were conceived to test if the model learned parts formation bottleneck principle (Tishby et al., 1999). of speech (Saphra and Lopez, 2018), morphology However, our goal is different in that we demon- (Belinkov et al., 2017; Peters et al., 2018a), or syn- strate that very minimal supervision is sufficient tactic information. Tenney et al.(2019) evaluate in order to guide a symbolic or statistical learner contextualized word representations on syntactic towards grammatical competence. and semantic sequence labeling tasks. Syntactic knowledge can be tested by extracting constituency 2 Language models and syntax trees from a network’s hidden states (Peters et al., 2018b) or from its word representations (Hewitt As recurrent neural network based language models and Manning, 2019). Other syntactic probe sets in- started to achieve good performance on different clude the work of Conneau et al.(2018) and Marvin tasks (Mikolov et al., 2010), this success sparked and Linzen(2018). attention on whether such models implicitly learn Despite the vivid interest for the topic, no consen- syntactic information. Language models are typi- sus seems to unfold from the experimental results. cally evaluated using perplexity on test data that is Two competing emerge: similar to the training examples. However, lower perplexity does not necessarily imply better syntac- • Deep neural language models generalize by tic generalization. Therefore, new tests have been learning human-like syntax: given sufficient put forward to evaluate the linguistically meaning- amount of training data, RNN models approx- ful knowledge acquired by LMs. imate human compositional skills and implic- A number of tests based on artificial data have itly encode hierarchical structure at some level been used to detect compositionality or system- of the network. This conjecture coincides with aticity in deep neural networks. Lake and Baroni the findings of, among others Bowman et al. (2017) created a task that requires executing (2015); Linzen et al.(2016); Giulianelli et al. commands expressed in a compositional language. (2018); Gulordava et al.(2018); Adhiguna Bowman et al.(2015) design a task of logical en- et al.(2018). tailment relations to be solved by discovering a • The language model training objective does recursive compositional structure. Saxton et al. not allow to learn compositional syntax from (2019) propose a semi-artificial probing task of a corpus alone, no matter what amount of problems. training data the model was exposed to. Syn- Linzen et al.(2016) initiated a different line of tax learning can only be achieved with task- linguistically motivated evaluation of RNNs. Their specific guidance, either as explicit supervi- data set consists in minimal pairs that differ in sion, or by restricting the hypothesis space to grammaticality and instantiate sentences with long hierarchically structured models (Dyer et al., distance dependencies (e.g. number agreement). 2016; Marvin and Linzen, 2018; Chowdhury The model is supposed to give a higher probability and Zamparelli, 2018; van Schijndel et al., to the grammatical sentence. The test aims to detect 2019; Lake and Baroni, 2017). whether the model can solve the task even when this requires knowledge of a hierarchical structure. Moreover, some shortcomings of the above prob- Subsequently, several alternative tasks were created ing methods make it more difficult to come to a along the same to overcome specific short- conclusion. Namely, it is not trivial to come up comings (Bernardy and Lappin, 2017; Gulordava with minimal pairs of naturally occurring sentences et al., 2018), or to extend the to different that are equally likely. Furthermore, assigning a languages or phenomena (Ravfogel et al., 2018, (slightly) higher probability to one sentence does 2019). not reflect the nature of knowledge behind a gram- It was also suggested that the information con- maticality judgment. Diagnostic classifiers may tent of a network can be tested using ”probing do well on a linguistic task because they learn to

478 solve it, not because their input contains a hierar- present in language models: they are bound by the chical structure (Hewitt and Liang, 2019). In what type of the data they are exposed to in learning. follows, we present our assessment on how the We suggest that it is still possible to learn syn- difficulty of creating a linguistic probing data set tactic generalization from a corpus, but not with is interconnected with the theoretical problem of likelihood maximization. We propose to isolate the learning a model of syntactic competence. syntactic information from shallow performance- related information. In order to identify such infor- 2.1 Competence or performance, or why mation without explicitly injecting it as direct su- syntax drowns in the corpus pervision or model-dependent linguistic presuppo- If syntax is an autonomous module of linguistic sitions, we propose to examine inherent structural capacity, the rules and principles that govern it are properties of corpora. As an illustration, consider formulated independently of meaning. However, a the following natural language sample: corpus is a product of language use or performance. cats eat rats Syntax constitutes only a of the rules that rats fear cats generate such a product; the others include com- mathematicians prove municative needs and pragmatics. Just as meaning doctors heal wounds is uncorrelated with grammaticality, corpus fre- According to the Chomskyan principle of the quency is only remotely correlated with human autonomy of syntax (Chomsky, 1957), the syntactic grammaticality judgment (Newmeyer, 2003). rules that define well-formedness can be formu- Language models learn a probability distribution lated without to meaning and pragmatics. over sequences of words. The training objective For instance, the sentence Colorless green ideas is not designed to distinguish grammatical from sleep furiously is grammatical for humans, despite agrammatical, but to predict language use. While being meaningless and unlikely to occur. We study Linzen et al.(2016) found a correlation between whether it is possible to deduce, from the struc- the perplexity of RNN language models and their tural properties of our sample above, human-like syntactic knowledge, subsequent studies (Bernardy grammaticality judgments that predict sequences and Lappin, 2017; Gulordava et al., 2018) recog- like cats rats fear as agrammatical, and accept e.g. nized that this result could have been achieved by wounds eat theorems as grammatical. encoding lexical semantic information, such as ar- We distinguish two levels of observable structure gument typicality. E.g. ”in ’dogs (...) bark’, an in a corpus: RNN might get the right agreement by encoding in- 1. the proximity; the tendency of words to occur formation about what typically barks” (Gulordava in the of each other (in the same docu- et al., 2018). ment/same sentence, etc.) Several papers revealed the tendency of deep 2. the order in which the words appear. neural networks to fixate on surface cues and heuris- tics instead of ”deep” generalization in solving Definition 1. Let L be a language over vocabu- NLP tasks (Levy et al., 2015; Niven and Kao, 2019). lary V . The language that contains every possible In particular, McCoy et al.(2019) identify three sequence obtained by shuffling the elements in a types of syntactic heuristics that get in the way of sequence of L will be denoted L. meaningful generalization in language models. If V ∗ is the set of every possible sequence over Finally, it is difficult to build a natural language vocabulary V and L is the language instantiated data set without semantic cues. Results from the by our corpus, L is generated by a mixture of con- syntax- interface research show that lexi- textual and syntactic constraints over V ∗. We are cal semantic properties account for part of syntactic looking to separate the syntactic specificities from realization (Levin and Rappaport Hovav, 2005). the grammatically irrelevant, contextual cues. The processes that transform V ∗ into L, and L into L 3 What is syntax a generalization of? proximity V ∗ −−−−−→ L −−−→order L We have seen in section2 that previous works on the linguistic capacity of neural language models are entirely dependent on words: it should be pos- concentrate on compositionality, the key to creative sible to encode the information used by these pro- use of language. However, this creativity is not cesses into word categories.

479 In what follows, we will provide tools to isolate One can check that the partition P1 : the information involved in proximity from the in- formation involved in order. We also relate these c1 = {c, r, e, f} categories to linguistically relevant . c2 = {m, p, t} 3.1 Isolating syntactic information c3 = {d, h, w} For a given word, we want to identify the informa- is contextual: the well-formed sequences over this tion involved in each type of structure of the corpus, partition are c1c1c1, c2c2c2 and c3c3c3. These and represent it as partitions of the vocabulary into patterns convey the information that words like lexical categories: ’mathematicians’ and ’theorems’ occur together, 1. Contextual information is any information but do not provide information on order. Therefore unrelated to sentence structure, and hence, gram- π1(L) = {c1c1c1, c2c2c2, c3c3c3} = π1(L). P1 is maticality: this encompasses meaning, topic, prag- also a maximal partition for that : any matics, corpus artefacts etc. The surface realization further splitting leads to order-specific patterns. of sentence structure is a language-specific combi- Intuitively, this partition corresponds to the seman- nation of word order and morphological markers. tic categories Animals = {r, c, e, f}, Science = 2. Syntactic information is the information re- {m, p, t}, and Medicine = {d, h, w}. lated to sentence structure and - as for the autonomy requirement - nothing else: it is independent of all A syntactic partition has two characteristics: its contextual information. patterns encode the structure (in our case, order), In the rest of the paper we will concentrate on and it is completely autonomous with respect to English as an example, a language in which syn- contextual information. Let us now express this tactic information is primarily encoded in order. In autonomy formally. section5 we present our ideas on how to deal with Two partitions of the same vocabulary are said to morphologically richer languages. be independent if they do not share any informa- tion with respect to language L. In other words, Definition 2. Let L be a language over vocabu- if we translate a sequence of from L into lary V = {v ,... }, and P = (V, C, π : V 7→ C) 1 their categories from one partition, this sequence a partition of V into categories C. Let π(L) de- of categories will not provide any information on note the language that is created by replacing a how the sequence translates into categories from sequence of elements in V by the sequence of their the other partition: categories. Definition 4. Let L be a language over vocabulary One defines the partition Ptot = {{v}, v ∈ V } V , and let P = (V, C, π) and P 0 = (V,C0, π0) be (one category per word) and the partition Pnul = 0 {V } (every word in the same category). two partitions of V . P and P are considered as independent with respect to L if Ptot is such that πtot(L) ∼ L. The minimal partition Pnul does not contain any information. 0 0 0 ∀ci1 . . . cin ∈ π(L), ∀cj1 . . . cjn ∈ π (L) A partition P = (V, C, π) is contextual if it is −1 0−1 0 0 π (ci1 . . . cin ) ∩ π (cj1 . . . cjn ) 6= ∅ impossible to determine word order in language L Definition 5. Let L be a language over V , and from sequences of its categories: let P = (V, C, π) be a partition. P is said to Definition 3. Let L be a language over vocabulary be syntactic if it is independent of any contextual V , and let P = (V, C, π) be a partition over V . partition of V . The partition P is said to be contextual if A syntactic partition is hence a partition that π(L) = π(L) does not share any information with contextual partitions; or, in linguistic terms, a syntactic pattern The trivial partition P is always contextual. nul is equally applicable to any contextual category. Example. Consider the natural language Example. We can see that the partition P : sample. We refer to the words by their 2 initial letters: r(ats),e(at)..., thus we have c4 = {c, r, m, t, d, w} V = {c, e, r, f, m, p, t, d, h, w}. and L = {cer, rfc, mpt, dhw}. c5 = {e, f, p, h}

480 is independent of the partition P1: one has 4 Syntactic and contextual categories in π2(L) = {c4c5c4}. Knowing the sequence c4c5c4 a corpus does not provide any information on which P1 cat- As we have seen in section2, probabilistic lan- egories the words belong to. P2 is therefore a syn- tactic partition. guage modeling with a likelihood maximization objective does not have incentive to concentrate Looking at the corpus, one might be tempted on syntactic generalizations. In what follows, we to consider a partition P3 that sub-divides c4 into demonstrate that using the autonomy of syntax prin- nouns, object nouns, and - if one word can ciple it is possible to infer syntactic categories for be mapped to only one category - ”ambiguous” a probabilistic language. nouns: A stochastic language L is a language which as- c6 = {m, d} a probability to each sequence. As an illustra- tion of such a language, we consider the empirical c = {t, w} 7 distribution induced from the sample in section3. c = {c, r} 8 1 1 1 1 L = {cer( ), rfc( ), mpt( ), dhw( )} c9 = {e, f, p, h} 4 4 4 4

The patterns corresponding to this partition would We will denote by pL(vi1 . . . vin ) the probability distribution associated to L. be π3(L) = {c6c9c7, c8c9c8}. These patterns will not predict that sentence (2) is grammatical, be- Definition 6. Let V be a vocabulary. A (proba- cause the word wounds was only seen as an object. bilistic) partition of V is defined by P = (V, C, π : If we want to learn the correct generalization we V 7→ P(C)) where P(C) is the set of probability need to reject this partition in favour of P2. distributions over C. This is indeed what happens by virtue of definition Example. The following probabilistic partitions 5. We notice that the patterns over P3 categories correspond to the non-probabilistic partitions (con- are not independent of the contextual partition P1: textual and syntactic, respectively) defined in sec- one can deduce from the rule c8c9c8 that the corre- tion3. We will now consider these partitions in the sponding sentence cannot be e.g. category c2: context of the probabilistic language L.

−1 −1 π3 (c8c9c8) ∩ π1 (c2c2c2) = ∅ c  1 0 0  c  1 0  r 1 0 0 r 1 0 e 1 0 0 e 0 1 P3 is hence rejected as a syntactic partition. f  1 0 0  f  0 1  m  0 1 0  m  1 0  P2 is the maximal syntactic partition: any fur- π1 = p  0 1 0  , π2 = p  0 1  t  0 1 0  t  1 0  ther distinction that does not conflate P1 categories d  0 0 1  d  1 0  h 0 0 1 h 0 1 would lead to an inclusion of contextual informa- w 0 0 1 w 1 0 tion. We can indeed see that category c4 corre- sponds to Noun and c5 corresponds to Verb. The From a probabilistic partition P = (V, C, π) as syntactic rule for the sample is Noun Verb Noun. defined above, one can a stochastic language It becomes possible to distinguish between syn- L to a stochastic language π(L) over the sequences tactic and contextual acceptability: cats rats fear of categories: is acceptable as a contextual pattern c1c1c1 under ’Animals’, but not a valid syntactic pattern. The se- pπ(ci1 . . . cin ) = quence wounds eat theorems is syntactically well- X Y formed by c5c6c5, but does not correspond to a ( π(cik |ujk ))pL(uj1 . . . ujn ) valid contextual pattern. uj1 ...ujn k In this section we provided the formal definitions of syntactic information and the broader contextual As in the non-probabilistic case, the language L information. By an illustrative example we gave an will be defined as the language obtained by shuf- intuition of how we apply the autonomy of syntax fling the sequences in L. principle in a non probabilistic grammar. We now Definition 7. Let L be a stochastic language over turn to the probabilistic scenario and the inference vocabulary V . We will denote by L the language from a corpus. obtained by shuffling the elements in the sequences

481 of L in the following way: for a sequence v1 . . . vn, 4.1 Information-theoretic formulation one has The definitions above may need to be relaxed if we 1 X want to infer syntax from natural language corpora, p (v . . . v ) = p (v . . . v ) L 1 n n! L i1 in where strict independence cannot be expected. We (i ...i )∈σ(n) 1 n propose to reformulate the definitions of contextual One can easily check that π(L) = π(L). and syntactic information in the information theory framework. Example. The stochastic patterns of L over the two partitions are, respectively: We present a relaxation of our definition based on Shannon’s information theory (Shannon, 1948). 1 1 1 We seek to quantify the amount of information in a π1(L) = {c1c1c1( ), c2c2c2( ), c3c3c3( )} 2 4 4 partition P = (V, C, π) with respect to a language

π2(L) = {c4c5c4(1)} L. Shannon’s entropy provides an appropriate mea- sure. Applied to π(L), it gives We can now define a probabilistic contextual partition: X H(π(L)) = − pπ(w)(log(pπ(w))) Definition 8. Let L be a stochastic language over w∈π(L) vocabulary V , and let P = (V, C, π) be a proba- bilistic partition. P will be considered as contex- For a simpler illustration, from now on we will tual if consider only languages composed of fixed-length π(L) = π(L) sequences s, i.e |s| = n for a given n. If L is such a language, we will consider the language L as the We now want to express the independence of language of sequences of size n defined by syntactic partitions from contextual partitions. The Y independence of two probabilistic partitions can be p (v . . . v ) = p (v ) L i1 in L ij construed as an independence between two random j variables: where p (v) is the frequency of v in language L. Definition 9. Consider two probabilistic partitions L P = (V, C, π) and P 0 = (V,C0, π0). We will use Proposition 1. Let L be a stochastic language, the notation P = (V, C, π) a partition. One has:

0 0 0 0 (π · π )v(ci, cj) = πv(ci)πv(cj) H(π(L)) ≥ H(π(L)) ≥ H(π(L)) and the notation with equality iff the stochastic languages are equal. P · P 0 = (V,C × C0, π · π0) Let C be a set of categories. For a given distribu- tion over the categories p(ci), the partition defined 0 P and P are said to be independent (with respect by π(ci|v) = p(ci) (constant distribution w.r.t. the to L) if the distributions inferred over sequences of vocabulary) contains no information on the lan- their categories are independent: guage. One has pπ(ci1 . . . cik ) = p(ci1 ) . . . p(cik ), which is the unigram distribution, in other words ∀w ∈ π(L), ∀w0 ∈ π0(L), π(L) = π(L). As the amount of syntactic or con- 0 0 pπ·π0 (w, w ) = pπ(w)pπ0 (w ) textual information contained in L can be consid- A syntactic partition will be defined by its inde- ered as zero, a consistent definition of the informa- pendence from contextual information: tion would be: Definition 10. Let P be a probabilistic partition, Definition 11. Let P = (V, C, π) be a partition, and L a stochastic language. The partition P is and L a language. The information contained in P said to be syntactic if it is independent (with re- with respect to L is defined as spect to L) of any possible probabilistic contextual I (P ) = H(π(L)) − H(π(L)) partition in L. L

Example. The partition P1 is contextual, as Lemma 1. Information IL(P ) defined as above is π1(L) = π1(L). The partition P2 is clearly in- always positive. One has IL(P ) ≤ IL(P ), with dependent of P1 w.r.t. L. equality iff π(L) = π(L).

482 After having defined how to measure the amount • A partition P is considered µ, γ-syntactic if it of information in a partition with respect to a lan- minimizes guage, we now translate the independence between max IL(P ; P∗) − µ IL(P ) (2) two partitions into the terms of mutual information: P ∗ Definition 12. We follow notations from Defini- for any γ-contextual partition P ∗. tion9. We define the mutual information of two Let P and P 0 be two partitions for L, such that partitions P = (V, C, π) et P 0 = (V,C0, π0) with respect to L as ∆I (L) = IP 0 (L) − IP (L) ≥ 0

0 0 0 IL(P ; P ) = H(P ) + H(P ) − H(P · P ) then the γ-contextual program (1) would choose P 0 over P iff This directly implies that ∆I (L) − ∆I (L) Lemma 2. P = (V, C, π) and P 0 = (V,C0, π0) ≤ γ ∆I (L) are independent w.r.t. L Let P ∗ be a γ-contextual partition. Let 0 ⇔ IL(P ; P ) = 0 ∗ 0 ∗ ∗ ∆MI (L, P ) = IL(P ; P ) − IL(P ; P ) . This comes from the fact that, by construc- then the µ, γ-syntactic program (2) would choose tion, the marginal distributions of π · π0 are the P 0 over P iff π π0 distributions and . ∗ ∆MI (L, P ) With these two definitions, we can now propose ≤ µ ∆I (L) an information-theoretic reformulation of what con- Example. Let us consider the following partitions: stitutes a contextual and a syntactic partition: - P1 and P2 refer to the previous partitions above: Proposition 2. Let L be a stochastic language over {Animals, Science, Medicine} and {Noun, Verb} vocabulary V , and let P = (V, C, π) be a proba- - PA is adapted from P1 so that ’fear’ belongs to bilistic partition. Animals and Medicine • P is iff 1 1 contextual {c, e, r, f( 2 )}, {m, p, t}, {d, h, w, f( 2 )} - PB merges Animals and Medicine from P1 IL(P ) = IL(P ) {c, e, r, f, d, h, w}, {m, p, t}

• P is syntactic iff for any contextual partition - Psent describes the probability for a word to be- P∗ long to a given sentence (5 categories) IL(P ; P∗) = 0 - PC is adapted from P2 so that ’fear’ belongs to Verb and Noun 4.2 Relaxed formulation {c, r, m, t, d, w, f( 1 )}, {e, p, h, f( 1 )} If we deal with non artificial samples of natural lan- 2 2

Ptot guage data, we need to prepare for sampling issues 1.75 and word (form) that make the above 1.50 formulation of independence too strict. Consider Pposi 1.25 for instance adding the following sentence to the P2 PD previous sample: 1.00 PC doctors heal fear 0.75

The distinction between syntactic and contextual 0.50 P categories is not as clear as before. We need a 0.25 PA sent P1 relaxed formulation for real corpora: we introduce Pnul PB 0.00

γ-contextual and µ, γ-syntactic partitions. 0 1 2 3 4 5 Definition 13. Let L be a stochastic language. Figure 1: IL(P ) − IL(P ) represented w.r.t. IL(P ) for • A partition P is considered as γ-contextual different partitions: acceptable solutions of program (1) if it minimizes lie on the convex hull boundary of the set of all parti- tions. Solution for γ is given by the tangent of slope γ. Non trivial solutions are PB and P1. IL(P )(1 − γ) − IL(P ) (1) 483 - PD is adapted from P2 and creates a special cate- by relative frequency in the three topics. The re- gory for ’fear’ sults for this partition are IL3 (Pcon) = 0.06111 and I (P ) = 0.06108, corresponding to a γ {c, r, m, t, d, w}, {e, p, h}, {f} L3 con threshold of 6.22.10−4 in (1), and thus distribution - Pposi describes the probability for a word to ap- over topics can be considered as an almost purely pear in a given position (3 categories) contextual partition. The syntactic partition Psyn is the distribution over POS categories (tagged with 0.5 PB PA Psent Ptot P1 the Stanford tagger, Toutanova et al.(2003)). Using the gold categories, we can manipulate 0.4 the information in the partitions by merging and

0.3 splitting across contextual or syntactic categories. We study how the information calculated by (1) and

0.2 (2) evolve; we validate our claims if we can deduce

Pposi the nature of information from these statistics.

0.1 PD 0.30 PC syntactic Pnul P2 topic 0.0 0.25 random 0 1 2 3 4 5

0.20 Figure 2: IL(P ; PB) represented w.r.t. IL(P ) for dif- ferent partitions: acceptable solutions of program (2) 0.15 lies on the convex hull boundary of the set of all parti-

tions. Solution for µ is given by the tangent of slope µ. 0.10 Non-trivial solution is P2. 0.05 Acceptable solutions of (1) and (2) are, respectively,

on the convex hull boundary in Fig.1 and Fig.2. 0.00 ADV JJ JJ JJ JJ NN NN NN V V While the lowest parameter (non trivial) solutions WH ADV NN V WH ADV V WH ADV WH are PB for context and P2 for syntax, one can check Figure 3: Increase of information ∆I in three scenarios: that partitions P1, PA and Psent are all close to syntactic split, topic split and random split. the boundary in Fig.1, and that partitions PC , PD We start from the syntactic embeddings and and Pposi are all close to the boundary in Fig.2, as expected considering their information content. we split and merge over the following POS cat- egories: Nouns (NN), Adjectives (JJ), Verbs (V), 4.3 Experiments Adverbs(ADV) and Wh-words (WH). For a pair of categories (say NN+V), we create: In this section we illustrate the emergence of syn- tactic information via the application of objectives • Pmerge merges the two categories (NN + V ) (1) and (2) to a natural language corpus. We show that the information we acquire indeed translates • Psyntax splits the merged category into NN into known syntactic and contextual categories. and V (syntactic split) For this experiment we created a corpus from • P (NN + the Simple English Wikipedia dataset (Kauchak, topic splits the merged category into V ) , (NN + V ) and (NN + V ) along 2013), selected along three main topics: Numbers, t1 t2 t3 Democracy, and Hurricane, with about 430 sen- the three topics (topic split) tences for each topic and a vocabulary of 2963 • Prandom which splits the merged category unique words. The stochastic language is the set into (NN + V ) and (NN + V ) randomly 3 1 2 L of 3-gram frequencies from the dataset. In or- (random split) der to avoid biases with respect to the final punc- tuation, we considered overlapping 3-grams over It is clear that each split will increase the informa- sentences. For the sake of evaluation, we construct tion compared to Pmerge. We display the simple one contextual and one syntactic embedding for information gains ∆I in Fig.3. The question is each word. These are the probabilistic partitions whether we can identify if the added information over gold standard contextual and syntactic cate- is syntactic or contextual in nature, i.e. if we can gories. The contextual embedding Pcon is defined find a µ for which the µ, γ-syntactic program (2)

484 selects every syntactic splitting and rejects every Following Definition 12, we define contextual or random one. ˜ 0 ILS (P ; P ) = 4.0 syntactic random 3.5 ˜ ˜ 0 ˜ 0 topic H(LS,P ) + H(LS,P ) − H(LS,P · P ) 3.0 We may consider the following program: 2.5

2.0 • A partition P is said to be γ-contextual if it minimizes 1.5

1.0 I˜ (P )(1 − γ) − I˜ (P ) LS LS 0.5

0.0 • Let P∗ be a γ-contextual partition for L, µ ∈ ADV JJ JJ JJ JJ NN NN NN V V + WH ADV NN V WH ADV V WH ADV WH R , k ∈ N. The partition P is considered µ, γ-syntactic if it minimizes Figure 4: Ratio ∆MI /∆I in three scenarios: syntactic split, topic split and random split. Considering objec- ˜ ∗ ˜ tive (2) with parameter µ = 0.5 leads to discrimination max ILS (P ; P ) − µ ILS (P ) P ∗ between contextual and syntactic information. Fig.4 represents the ratio between the increase of 5 Conclusion and Future Work mutual information (relatively to Pcon) ∆MI and In this paper, we proposed a theoretical reformu- the increase of information ∆I , corresponding to lation for the problem of learning syntactic infor- the the threshold µ in (2). It shows that indeed mation from a corpus. Current language models for a µ = 0.5 syntactic information (meaningful have difficulty acquiring syntactically relevant gen- refinement according to POS) will be systemat- eralizations for diverse reasons. On the one hand, ically selected, while random or topic splittings we observe a natural tendency to lean towards shal- will not. We conclude that even for a small nat- low contextual generalizations, likely due to the ural language sample, syntactic categories can be maximum likelihood training objective. On the identified based on statistical considerations, where other hand, a corpus is not representative of human a language model learning algorithm would need linguistic competence but of performance. It is further information or hypotheses. however possible for linguistic competence - syn- 4.4 Integration with Models tax - to emerge from data if we prompt models to establish a distinction between syntactic and con- We have shown that our framework allows to search textual (semantic/pragmatic) information. for syntactic categories without prior hypothesis of Two orientations can be identified for future a particular model. Yet if we do have a hypothesis, work. The immediate one is experimentation. The we can indeed search for the syntactic categories current formulation of our syntax learning scheme that fit the particular of models M. In order needs adjustments in order to be applicable to real to find the categories which correspond to the syn- natural language corpora. At present, we are work- tax rules that can be formulated in a given class ing on an incremental construction of the space of of models, we can integrate the model class in the categories. training objective by replacing entropy by the neg- The second direction is towards extending the ative log-likelihood of the training sample. approach to morphologically rich languages. In Let M ∈ M be a model, which takes a prob- that case, two types of surface realization need abilistic partition P = (V, C, π) as input, and let to be considered: word order and morphological LL(M,P,L ) be the log-likelihood obtained for S markers. An agglutinating morphology probably sample S. We will denote allows a more straightforward application of the method, by treating affixes as individual elements of the vocabulary. The adaptation to other types H˜ (LS,P ) = − sup LL(M,P,LS) M∈M of morphological markers will necessitate more elaborate linguistic reflection. ˜ ˜ ˜ ILS (P ) = H(LS,P ) − H(LS,P ) 485 Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network Kuncoro Adhiguna, Dyer Chris, Hale John, Yogatama grammars. In North American Chapter of the Asso- Dani, Clark Stephen, and Blunsom Phil. 2018. ciation for Computational Linguistics: Human Lan- LSTMs can learn syntax-sensitive dependencies guage Technologies. well, but modeling structure makes them better. In Proceedings of the 56th Annual Meeting of the Asso- . 1892. Uber¨ Sinn und Bedeitung. ciation for Computational Linguistics. Zeitschrift fur¨ Philosophie und philosophische Kri- Marco Baroni. 2019. Linguistic generalization and tik, 100:25–50. compositionality in modern artificial neural net- works. CoRR, abs/1904.00157. Mario Giulianelli, Jack Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. 2018. Un- Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan der the hood: Using diagnostic classifiers to investi- Sajjad, and James Glass. 2017. What do neural ma- gate and improve how language models track agree- chine translation models learn about morphology? ment information. In EMNLP Workshop Blackbox In Proceedings of the 55th Annual Meeting of the NLP: Analyzing and Interpreting Neural Networks Association for Computational Linguistics. for NLP.

Iris Berent and Gary Marcus. 2019. No integration E. Mark Gold. 1967. Language identification in the without structured representations: Response to Pa- limit. Information and control, 10:5:447–474. ter. Language, 95:1:e75–e86. Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Jean-Philippe Bernardy and Shalom Lappin. 2017. Us- Tal Linzen, and Marco Baroni. 2018. Colorless ing deep neural networks on learn syntactic agree- green recurrent networks dream hierarchically. In ment. Linguistic Issues in Language Technology, North American Chapter of the Association for Com- 15(2):1––15. putational Linguistics: Human Language Technolo- gies. Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018. Deep RNNs encode soft hierarchical syntax. In Pro- John Hewitt and Percy Liang. 2019. Designing and ceedings of the 56th Annual Meeting of the Associa- interpreting probes with control tasks. In Proceed- tion for Computational Linguistics. ings of the 2019 Conference on Empirical Methods Samuel R. Bowman, Christopher D. Manning, and in Natural Language Processing. Christopher Potts. 2015. Tree-structured composi- tion in neural networks without tree-structured archi- John Hewitt and Christopher D. Manning. 2019. A tectures. In NIPS Workshop on Cognitive Computa- structural probe for finding syntax in word represen- tion: Integrating Neural and Symbolic Approaches. tations. In Proceedings of the North American Chap- ter of the Association for Computational Linguistics. . 1957. Syntactic Structures. Mouton, Berlin, Germany. Dieuwke Hupkes, Sara Veldhoen, and Willem H. Zuidema. 2018. Visualisation and ’diagnostic classi- Noam Chomsky. 1980. Rules and representations. Be- fiers’ reveal how recurrent and recursive neural net- havioral and Brain Sciences, 3(1):1–15. works process hierarchical structure. Journal of Ar- tificial Intelligence Research, 61:907—-926. Shammur Absar Chowdhury and Roberto Zamparelli. 2018. RNN simulations of grammaticality judg- David Kauchak. 2013. Improving text simplification ments on long-distance dependencies. In Proceed- language modeling using unsimplified text data. In ings of the 27th International Conference on Com- Proceedings of the 51st Annual Meeting of the Asso- putational Linguistics. ciation for Computational Linguistics. Alexander Clark and Remi´ Eyraud. 2006. Learning auxiliary fronting with grammatical inference. In Brenden M. Lake and Marco Baroni. 2017. General- Conference on Computational Language Learning. ization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. Alexander Clark and Shalom Lappin. 2010. Unsuper- In 34th International Conference on Machine Learn- vised learning and grammar induction. In Handbook ing. of Computational Linguistics and Natural Language Processing. Wiley-Blackwell, Oxford. Shalom Lappin and Stuart Shieber. 2007. Machine learning theory and practice as a source of insight Alexis Conneau, German Kruszewski, Guillaume Lam- into universal grammar. Journal of Linguistics, ple, Lo¨ıc Barrault, and Marco Baroni. 2018. What 43:393–427. you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Beth Levin and Malka Rappaport Hovav. 2005. Ar- Proceedings of the 56th Annual Meeting of the As- gument Realization. Cambridge University Press, sociation for Computational Linguistics. Cambridge.

486 Omer Levy, Steffen Remus, Chris Biemann, and Ido Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. Dagan. 2015. Do supervised distributional methods 2019. Studying the inductive biases of RNNs with really learn lexical inference relations? In Proceed- synthetic variations of natural languages. CoRR, ings of the North American Chapter of the Associ- abs/1903.06400. ation for Computational Linguistics – Human Lan- guage Technologies. Shauli Ravfogel, Yoav Goldberg, and Francis Tyers. 2018. Can LSTM learn to capture agreement? the Xiang Lisa Li and Jason Eisner. 2019. Specializing case of basque. In EMNLP Workshop Blackbox word embeddings (for parsing) by information bot- NLP: Analyzing and Interpreting Neural Networks tleneck. In 2019 Conference on Empirical Methods for NLP. in Natural Language Processing and International Joint Conference on Natural Language Processing. Naomi Saphra and Adam Lopez. 2018. Language mod- els learn POS first. In EMNLP Workshop Blackbox Tal Linzen. 2019. What can linguistics and deep learn- NLP: Analyzing and Interpreting Neural Networks ing contribute to each other? Response to Pater. Lan- for NLP, Brussels, Belgium. guage, 95(1):e98–e108. David Saxton, Edward Grefenstette, Felix Hill, and Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Pushmeet Kohli. 2019. Analysing mathematical rea- 2016. Assessing the ability of LSTMs to learn soning abilities of neural models. In Proceedings of syntax-sensitive dependencies. Transactions of the the 7th International Conference on Learning Repre- Association for Computational Linguistics, 4. sentations. Rebecca Marvin and Tal Linzen. 2018. Targeted syn- Marten van Schijndel, Aaron Mueller, and Tal Linzen. tactic evaluation of language models. In Proceed- 2019. Quantity doesn’t buy quality syntax with neu- ings of the 2018 Conference on Empirical Methods ral language models. In Proceedings of Empirical in Natural Language Processing. Methods in Natural Language Processing and Inter- Richard McCoy, Robert Frank, and Tal Linzen. 2018. national Joint Conference on Natural Language Pro- Revisiting the poverty of the stimulus: hierarchical cessing. generalization without a hierarchical bias in recur- Claude E. Shannon. 1948. A mathematical theory rent neural networks. ArXiv, abs/1802.09091. of communication. Bell System Technical Journal, Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. 27:379–423 and 623–656. Right for the wrong reasons: Diagnosing syntactic Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, heuristics in natural language inference. In Proceed- Adam Poliak, R Thomas McCoy, Najoung Kim, ings of the 57th Annual Meeting of the Association Benjamin Van Durme, Sam Bowman, Dipanjan Das, for Computational Linguistics. and Ellie Pavlick. 2019. What do you learn from Tomas Mikolov, Martin Karafiat,´ Lukas´ Burget, Jan er- context? Probing for sentence structure in contextu- nocky,´ and Sanjeev Khudanpur. 2010. Recurrent alized word representations. In International Con- neural network based language model. In INTER- ference on Learning Representations. SPEECH. Naftali Tishby, Fernando Pereira, and William Bialek. Frederick J. Newmeyer. 2003. Grammar is grammar 1999. The information bottleneck method. In An- and usage is usage. Language, 79:4:682—-707. nual Allerton Conference on Communication, Con- trol and Computing, pages 368–377. Timothy Niven and Hung-Yu Kao. 2019. Probing neu- ral network comprehension of natural language argu- Kristina Toutanova, Dan Klein, Christopher D. Man- ments. In Proceedings of the 57th Annual Meeting ning, and Yoram Singer. 2003. Feature-rich part-of- of the Association for Computa-tional Linguistics. speech tagging with a cyclic dependency network. In Proceedings of the North American Chapter of Joe Pater. 2019. Generative linguistics and neural net- the Association for Computational Linguistics. works at 60: Foundation, friction, and fusion. Lan- guage, 95:1:41–74. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word rep- resentations. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wentau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.

487