On the gap between theoretical and computational

Marco Baroni EACL April 21th, 2021

Facebook AI Research Foreword 1

• Theoretical linguistics: • in the generative tradition • 68% of articles in last issues of 3 highest-impact linguistics journals • Computational linguistics: • Deep learning for NLP • 40% of long papers at last EACL had a deep-learning keyword in the title

2 Foreword 2

• My attitude here: • Ask not what theoretical linguistics can do for you – ask what you can do for theoretical linguistics (JFK) • ... but theoretical linguists should also do their part to bridge the gap: • Better experimental practices • Less theory-internal argumentation • Larger corpus data coverage

3 Outline

• Deep learning for deep linguistics • The gap • The theoretical significance of deep nets • Conclusion: can we narrow the gap?

4 Deep nets’ shocking grammatical competence

5 Long-distance number agreement Structures, not strings!

• The boy is jumping • The boy [that I saw yesterday with the girls and the dogs on the rocks] is jumping

6 Long-distance agreement in neural models

• Problem formulation in Linzen et al. TACL 2016 • Given prefix... The boy that I saw yesterday with the girls and the dogs on the rocks... • Does pre-trained neural language model assign higher probability to continuation...

Model got agreement right is OR are Model got agreement wrong

7 Long-distance agreement in Results from: neural language models Gulordava et al. NAACL 2018 • Vanilla LSTM language models trained on Wikipedia samples guess right number agreement with high accuracy • in English, Hebrew, Italian and Russian • in a variety of constructions (subject-verb, noun-adjective, verb- verb...) • in nonsense sentences: • The colorless green ideas [that I ate yesterday with the chair] sleep furiously • with accuracy comparable to that of human subjects.

Result also holds for other (CNNs: Bernardy and Lappin LILT 2017, Transformers: Goldberg arXiv 2019)

8 What do the long-distance agreement experiments teach us?

• Neural language models possess awesome grammatical skills! • Such skills can then be acquired without supposedly fundamental innate priors such as a preference for tree-like structures

9 Learning music helps you read! Papadimitriou and Jurafsky EMNLP 2020

Image source: Pixabay 10 Learning music helps you read! Papadimitriou and Jurafsky EMNLP 2020

• Train LSTM model on non-linguistic data • Freeze LSTM weights • Fine-tune input and output embeddings on Spanish • Compute perplexity on Spanish text

11 Random The random corpora are sampled randomly from the Spanish Uniform: marroquın´ jemer pertenecer vocabulary. There is no underlying structure of any kind that osasuna formaron citoesqueleto links words with each other. All words are equally likely to relativismo be sampled in the Uniform corpus, while common words are more likely in the Zipfian corpus. Zipf: en con conocidas y en los victoriano como trabajar unk monte * en juegos dıas´ Random en el h i The random corpora are sampled randomly from the Spanish Pre-training sourcesUniform: marroquın´ jemer pertenecerMusic vocabulary. There is no underlying structure of any kind that osasuna formaron citoesqueleto links words with each other. All words are equally likely to The music data is encoded from classical piano performances relativismo be sampled in the Uniform corpus, while common words areaccording to the MAESTRO standard. Music is structured on more likely in the Zipfian corpus. Zipf: en con conocidas y en los victoriano many levels. The red arrow in the example illustrates how, on como trabajar unk monte * en juegos dıas´ a small timescale, each note is linked to its corresponding note Random en el h i The random corpora are sampled randomly from the Spanish when a motif is repeated but modulated down a whole-step. Music Uniform: marroquın´ jemer pertenecer vocabulary. ThereCode is no underlyingThe music structure data is encoded of any from kind classical that piano performances osasuna formaron citoesqueleto links words with each other. Allaccording words to the are MAESTRO equally standard. likely Music to is structured onThe code corpus is composed of Java code. The above snippet if (coordFactormany levels. The == red arrow 1.0f) in the example illustrates how, on relativismo a small timescale, each note is linked to its corresponding notedemonstrates some kinds of structure that are present in code: be sampled in the Uniformreturn corpus, sumExpl while common words are more likely in the Zipfian corpus.when a motif is repeated but modulated down a whole-step. brackets are linked to their pairs, else statements are linked Zipf: en con conocidas y en los victoriano else { to an if statement, and coreference of variable names is Code result = sum coordFactor como trabajar unk monte * en juegos dıas´ The code corpus is* composed of Java code. The above snippetunambiguous. h i if (coordFactor == 1.0f) } demonstrates some kinds of structure that are present in code: en el return sumExpl brackets are linked to their pairs, else statements are linked else { to an if statement, and coreference of variable names is Music result = sum * coordFactorParentheses unambiguous. } The music data is encoded from classical piano performances Our artificial corpora consist of pairs of matching integers. In according to theNesting: MAESTRO standard. Music is structured on the Nesting Parentheses corpus, integer pairs nest Parentheses many levels. The red arrow inOur the artificial example corpora illustrates consist of pairs how, of matching on integers. Inhierarchically and so the arcs do not cross. In the Flat Nesting: a small timescale, each note isthe linked Nesting to Parentheses its corresponding corpus, integer note pairs nest Parentheses corpus, each integer pair is placed independently hierarchically and so the arcs do not cross. In the Flat of all the others, and so the arcs can cross multiple times. when a motif is repeated but modulatedParentheses corpus, down each a integer whole-step. pair is placed independently 0 29 29 0 0of all5 the5 others,0 1016 and so1016 the arcs can9 8 cross8 multiple28 28 times.9 (There is a one-to-one mapping between Spanish words and Code 0 29 29 0 0 5 5 0 1016 1016 9 8 8 28 28 9 The code corpusFlat: is composed(There of Java is a code. one-to-one The mapping above between snippet Spanish words and integers and so these integers are sampled from the same if (coordFactor == 1.0f) Flat: demonstrates some kinds of structureintegers and that so these are integers present arein sampled code: from the same Spanish vocabulary distribution as the Random Zipfian return sumExpl Spanish vocabulary distribution as the Random Zipfian brackets are linked to their pairs,corpus.else We visualizestatements these corpora are herelinked with integers and thecorpus. We visualize these corpora here with integers and the else { Random corpora with words for simplicity). Random corpora with words for simplicity). 21 13 21 6294to13 an6294if5 5471statement,5 32 3221 and547113 coreference21 6294 13 of6294 variable5 5471 names5 32 is32 5471 result = sum * coordFactor unambiguous. } Figure 3: Examples illustrating the contentFigure of our3: Examplesnon-linguistic illustrating corpora for Experiments the content 1-3. of All our examples non-linguistic12 are corpora for Experiments 1-3. All examples are taken from the corpora. Parentheses taken from the corpora. Our artificial corpora consist of pairs of matching integers. In Nesting: The Zipfian Randomthe baselineNesting is Parentheses controlled for corpus,the integerabstract pairsstructural nest features that these corpora vocabulary distribution:hierarchically if an experiment andThe so yieldsZipfian the arcsshare Random do not with cross. natural baseline In language the Flatis controlled in a generalizable for waythe abstract structural features that these corpora better results than the Zipfian Random baseline, we that’s usable to model human language? Parentheses corpus,vocabulary each integer distribution: pair is placed if an independently experiment yields share with natural language in a generalizable way cannot attribute its successof all only the to others, lexical-level and so sim- the arcs can cross multiple times. ilarity to the L2. Therefore, modelsbetter that are results more than4.1 the Data Zipfian Random baseline, we that’s usable to model human language? 0 29 29 0 0 5 5 0 1016 1016 9 8 8 28 28 9 successful than the Zipfian(There baseline is a one-to-onecannot at transfer attribute mapping to For its between our success music Spanish only data we to use wordslexical-level the MAESTRO and sim- dataset human language wouldintegers have useful, andilarity so generalizable these to integers the L2.of areHawthorne Therefore, sampled et from al. models(2018 the). same The that MAESTRO are more dataset4.1 Data Flat: syntactic information about the structures that link Spanish vocabularysuccessful distribution thanembeds the as Zipfian the MIDI Random files baseline of Zipfian many at parallel transfer notes to into a tokens. linear format suitable for sequence modelling, with-For our music data we use the MAESTRO dataset corpus. We visualizehuman language these corpora would here have with useful, integers generalizable and the Random corpora with wordsout for losingsimplicity). musical information. The final corpusof Hawthorne et al. (2018). The MAESTRO dataset 21 13 21 6294 13 6294 5 5471 5 32432 Experiment5471 2: Non-linguistic structure syntactic informationhas a vocabulary about the of 310 structures tokens, and that encodes link overembeds MIDI files of many parallel notes into a 172 hours of classical piano performances. 6 In this experiment, we test the performancetokens. of linear format suitable for sequence modelling, with- Figure 3: Examples illustrating the contentLSTMs of our on non-linguistic Spanish when they corpora have been for trained ExperimentsFor programming 1-3. All examples code data, we are used the Habeas on music and on code data. While music data es- corpus released by Movshovitz-Attias and Cohenout losing musical information. The final corpus taken from the corpora. 4 Experiment 2: Non-linguistic structure 7 pecially is very different from human language on (2013), of tokenized and labelled Java code. Wehas a vocabulary of 310 tokens, and encodes over the surface level, we know that music and code took out every token that was labelled as a com-172 hours of classical piano performances. 6 both contain syntactic elements thatIn are this similar experiment, to ment so we as to test not contaminate the performance the code corpus of with The Zipfian Random baseline is controlled for 5 the abstractLSTMs structural on Spanish features when that they these have corpora been trained For programming code data, we used the Habeas human language. By comparing performance to 6The MAESTRO dataset is available at https:// vocabulary distribution: if an experimentour random yields baselines,share we ask: with can LSTMson natural music encode language and onmagenta.tensorflow.org/datasets/maestro code in a data. generalizable While music way data es- corpus released by Movshovitz-Attias and Cohen 7The Habeas corpus is available at(2013), of tokenized and labelled Java code. 7 We better results than the Zipfian Random baseline,5See for example we Lerdahlthat’s and Jackendoff usablepecially(1996 to) model for gram-is very humanhttps://github.com/habeascorpus/ different language? from human language on matical structure in music. habeascorpus-data-withComments took out every token that was labelled as a com- cannot attribute its success only to lexical-level sim- the surface level, we know that music and code ment so as to not contaminate the code corpus with ilarity to the L2. Therefore, models that are more 4.1 Databoth contain syntactic elements that are similar to 5 human language. By comparing performance to 6 successful than the Zipfian baseline at transfer to The MAESTRO dataset is available at https:// For our musicour random data we baselines, use the MAESTRO we ask: can LSTMs dataset encode magenta.tensorflow.org/datasets/maestro human language would have useful, generalizable of Hawthorne et al. (2018). The MAESTRO dataset 7The Habeas corpus is available at 5 syntactic information about the structures that link embeds MIDISee files for example of manyLerdahl parallel and Jackendoff notes into(1996) a for gram- https://github.com/habeascorpus/ matical structure in music. habeascorpus-data-withComments tokens. linear format suitable for sequence modelling, with- out losing musical information. The final corpus 4 Experiment 2: Non-linguistic structure has a vocabulary of 310 tokens, and encodes over 6 In this experiment, we test the performance of 172 hours of classical piano performances. LSTMs on Spanish when they have been trained For programming code data, we used the Habeas on music and on code data. While music data es- corpus released by Movshovitz-Attias and Cohen 7 pecially is very different from human language on (2013), of tokenized and labelled Java code. We the surface level, we know that music and code took out every token that was labelled as a com- both contain syntactic elements that are similar to ment so as to not contaminate the code corpus with 5 human language. By comparing performance to 6The MAESTRO dataset is available at https:// our random baselines, we ask: can LSTMs encode magenta.tensorflow.org/datasets/maestro 7The Habeas corpus is available at 5See for example Lerdahl and Jackendoff (1996) for gram- https://github.com/habeascorpus/ matical structure in music. habeascorpus-data-withComments Results

the Zipfian Random corpus, which has the same surface-level vocabulary and distribution as Span- ish, yet models trained on it perform on average 237 ppl worse compared to those trained on the music corpus. Since the surface forms between music and language are so different, the difference in performance cannot be based on surface-level heuristics, and our results suggest the presence of generalizable, structurally-informed representa- tions in LSTM language models. We also show that models trained on Java code can transfer this knowledge to a human L2 bet- ter than the random baseline. Syntactic properties Figure 4: Results of Experiments 1 through 3, train- 13 of code such as recursion are similar to natural ing on non-linguistic corpora. Error bars on all bars in- dicate a 95% t-test confidence interval over 5 restarts language, though code is constructed to be unam- with different random seeds. All structured data is biguously parsed and lacks a lot of the subtlety much better to train on than random data, including and ambiguity that characterizes natural language. music which has a totally divergent vocabulary surface Models trained on code have an average perplexity form from the rest. The two parentheses corpora result of 139.10 on the Spanish test set. The large discrep- in equivalent perplexities, even though one has a hierar- ancy between this performance and the baseline chical underlying structure and the other does not. indicates that LSTMs trained on code capture the syntactic commonalities between code and natural natural language. language in a manner that is usable for modelling The music corpus is 23 million tokens in length natural language. and the code corpus is 9.5 million. We cannot ef- Our results on non-linguistic data suggest that fectively control the lengths of these corpora to be LSTMs trained on structured data extract repre- the same as all of the others, since there is no con- sentations which can be used to model human lan- trolled notion of what one token means in terms of guages. The non-linguistic nature of these data information. However, we only compare these re- suggests that it is something structural about the sults to the random baseline, which we have trained music and Java code that is helping in the zero-shot on 100 million tokens – if the LSTMs trained on task. However, there is a multitude of structural these corpora are under-specified compared to the interpretations of music, and it is not clear what baseline, this would only strengthen our results. kinds of structure the LSTM encodes from music. In the next experiment, we create simple artificial 4.2 Results corpora with known underlying structures in order Our results show that language models pretrained to test how the LMs can represent and utilize these on music are far better at modelling Spanish than structures. those pretrained on random data. As shown in figure 4, LSTMs trained on music data have an av- 5 Experiment 3: Recursive Structure erage performance of 256.15 ppl on Spanish, com- In this experiment, we isolate and assess possible pared with 493.15 when training on the Zipfian structural features of music and code that may ex- random corpus. This discrepancy suggests that the plain the results of Experiment 2. The most widely- model, when training on music, creates represen- known structural hypothesis is the claim of Hauser tations of the relationships between tokens which et al. (2002) that the narrow language faculty in are generalizable and can apply to Spanish. humans (the inductive bias in the mind/brain that The music corpus is markedly different from the allows humans to acquire and develop language) Spanish corpus by most measures. Most saliently, can be reduced to just recursion. Given the promi- MAESTRO uses a vocabulary of just 310 tokens nence of such theories, it is natural to ask: is it the to encode various aspects of music like volume and note co-occurrence.8 This is in contrast to matrix of 50,000 rows, but during training only ever sees words 1-310, meaning that much of the word embedding space 8For consistency, the model still has a word embedding has never been seen by the LSTM part of the model. Random The random corpora are sampled randomly from the Spanish Uniform: marroquın´ jemer pertenecer vocabulary. There is no underlying structure of any kind that osasuna formaron citoesqueleto links words with each other. All words are equally likely to Resultsrelativismo be sampled in the Uniform corpus, while common words are more likely in the Zipfian corpus. Zipf: en con conocidas y en los victoriano como trabajar unk monte * en juegos dıas´ Structural information is important for neuralen el languageh i models when Random Music The random corpora are sampled randomly from the Spanish The music data is encoded from classical piano performances acquiring a naturalmarroqu languageın´ jemer pertenecer Uniform: vocabulary. There is no underlying structure of any kind that according to the MAESTRO standard. Music is structured on osasuna formaron citoesqueleto links words with each other. All words are equally likely to many levels. The red arrow in the example illustrates how, on More so thanrelativismo lexical frequency statistics be sampled in the Uniform corpus, while common words are a small timescale, each note is linked to its corresponding note more likely in the Zipfian corpus. Zipf: en con conocidas y en los victoriano when a motif is repeated but modulated down a whole-step. como trabajar unk monte * en juegos dıas´ h i Code en el The code corpus is composed of Java code. The above snippet if (coordFactor == 1.0f) Music demonstrates some kinds of structure that are present in code: The music datareturn is encoded sumExpl from classical piano performances brackets are linked to their pairs, else statements are linked accordingelse to the { MAESTRO standard.the Music Zipfian is structured Random on to an corpus,if statement, which and coreference has the of same variable names is many levels.result The red arrow = sum in the* examplecoordFactor illustrates how, on unambiguous. a small} timescale, each note is linkedsurface-level to its corresponding note vocabulary and distribution as Span- when a motif is repeated but modulated down a whole-step. Parentheses ish, yet models trained on it perform on average Code Our artificial corpora consist of pairs of matching integers. In Nesting:The code corpus is composed of Java237 code. The ppl above worse snippet comparedthe Nesting Parentheses to those corpus, trained integer pairs on nestthe if (coordFactor == 1.0f) demonstrates some kinds of structure that are present in code: hierarchically and so the arcs do not cross. In the Flat return sumExpl brackets are linked to their pairs, elsemusicstatements corpus. are linked SinceParentheses the corpus, surface each integer forms pair between is placed independently else { to an if statement, and coreference of variable names is of all the others, and so the arcs can cross multiple times. result = sum coordFactor music and language are so different, the difference * unambiguous.0 29 29 0 0 5 5 0 1016 1016 9 8 8 28 28 9 } in performance cannot(There is a be one-to-one based mapping on surface-level between Spanish words and Flat: integers and so these integers are sampled from the same Parentheses Spanish vocabulary distribution as the Random Zipfian Our artificial corpora consist of pairsheuristics, of matching integers. and In our results suggest the presence corpus. We visualize these corpora here with integers and the Nesting: the Nesting Parentheses corpus, integer pairs nest of generalizable, structurally-informedRandom corpora with words for simplicity). representa- hierarchically21 13 21 and6294 so the13 arcs6294 do not5 cross.5471 5 In32 the32 Flat5471 Parentheses corpus, each integer pairtions is placed in independently LSTM language models. Figureof all the 3: Examplesothers, and illustratingso the arcs can the cross contentWe multiple also of our times. show non-linguistic that models corpora for trained Experiments on Java 1-3. All code examples are 0 29 29 0 0 5 5 0 1016 1016 9 8 8 28 28 9 taken from the corpora. 14 (There is a one-to-one mapping betweencan Spanish transfer words and this knowledge to a human L2 bet- Flat: integers and so these integers are sampled from the same Spanish vocabulary distribution as theter Random than Zipfian the random baseline. Syntactic properties Figure 4: Results of Experimentscorpus.The 1 Zipfian through We visualize Random 3, these train- corpora baseline here is with controlled integers and for the the abstract structural features that these corpora Random corpora with words for simplicity).of code such as recursion are similar to natural 21 13 21 ing6294 on13 non-linguistic6294 5 5471 5 32 corpora.32 5471 Errorvocabulary bars on distribution: all bars in- if an experiment yields share with natural language in a generalizable way dicate a 95% t-test confidencebetter interval results over than 5 the restarts Zipfian Randomlanguage, baseline, though we that’s code usable is constructed to model human to language? be unam- Figure 3: Examples illustrating the content of our non-linguistic corpora for Experiments 1-3. All examples are with different random seeds.cannot All structured attribute its success data is only tobiguously lexical-level parsedsim- and lacks a lot of the subtlety taken from the corpora. ilarity to the L2. Therefore, models that are more 4.1 Data much better to train on than random data, including and ambiguity that characterizes natural language. successful than the Zipfian baseline at transfer to For our music data we use the MAESTRO dataset music which has a totally divergenthuman vocabulary language would surface have useful,Models generalizable trained on code have an average perplexity The Zipfian Randomform from baseline the rest. is controlled The two for parenthesesthe abstract corpora structural result features that these corpora of Hawthorne et al. (2018). The MAESTRO dataset vocabulary distribution: if an experiment yields syntacticshare with information natural language about the in structures aof generalizable 139.10 that on link way the Spanishembeds MIDI test set. files The of many large parallel discrep- notes into a better results thanin the equivalent Zipfian Random perplexities, baseline, even we thoughtokens.that’s usable one has to model a hierar- human language?ancy between thislinear performance format suitable and for sequence the baseline modelling, with- cannot attribute itschical success underlying only to lexical-level structure sim- and the other does not. indicates that LSTMsout losing trained musical on information. code capture The the final corpus 4 Experiment 2: Non-linguistic structure ilarity to the L2. Therefore, models that are more 4.1 Data syntactic commonalitieshas a vocabulary between of 310 code tokens, and and natural encodes over successful than the Zipfian baseline at transfer to 172 hours of classical piano performances. 6 InFor this our experiment, music data we we use test the thelanguage MAESTRO performance in dataset a of manner that is usable for modelling human languagenatural would have language. useful, generalizable LSTMsof Hawthorne on Spanish et al. ( when2018). they The have MAESTRO been trained dataset For programming code data, we used the Habeas syntactic information about the structures that link onembeds music MIDIand on files code of data. many While parallelnatural music notes language. data into es- a corpus released by Movshovitz-Attias and Cohen The music corpus is 23 million tokens in length 7 tokens. peciallylinear format is very suitable different for from sequence humanOur modelling, language results with- on on(2013 non-linguistic), of tokenized data and labelled suggest Java that code. We and the code corpus is 9.5 million. We cannot ef- took out every token that was labelled as a com- theout surface losing level,musical we information. know thatLSTMs music The final and trained corpus code on structured data extract repre- 4 Experimentfectively 2: Non-linguistic control the structure lengths ofbothhas these containa vocabulary corpora syntactic of to 310 elements be tokens, that and are encodes similar over to ment so as to not contaminate the code corpus with 5 sentations6 which can be used to model human lan- the same as all of the others,human since172 hours language.there of classical is noBy con- comparing piano performances. performance to 6 In this experiment, we test the performance of guages. The non-linguisticThe MAESTRO nature dataset of is these available data at https:// LSTMs on Spanishtrolled when notion they have of whatbeen trained one tokenourFor random means programming baselines, in terms code we of ask: data, can we LSTMs used the encode Habeas magenta.tensorflow.org/datasets/maestro corpus released by Movshovitz-Attias and Cohen 7The Habeas corpus is available at on music and on code data. While music data es- 5 suggests that it is something structural about the information. However, we onlySee compare for example theseLerdahl re-and Jackendoff (1996) for gram-7 https://github.com/habeascorpus/ pecially is very different from human language on matical(2013 structure), of tokenized in music. and labelledmusic Javaand code. JavaWe codehabeascorpus-data-withComments that is helping in the zero-shot sults to the random baseline, whichtook out we every have token trained that was labelled as a com- the surface level,on we 100 know million that music tokens and code – if the LSTMs trained on task. However, there is a multitude of structural both contain syntactic elements that are similar to ment so as to not contaminate the code corpus with these5 corpora are under-specified compared to the interpretations of music, and it is not clear what human language. By comparing performance to 6The MAESTRO dataset is available at https:// our random baselines,baseline, we ask: this can would LSTMs only encode strengthenmagenta.tensorflow.org/datasets/maestro our results. kinds of structure the LSTM encodes from music. 7The Habeas corpusIn is the available next experiment, at we create simple artificial 5See for example Lerdahl and Jackendoff (1996) for gram- https://github.com/habeascorpus/ matical structure in4.2 music. Results habeascorpus-data-withCommentscorpora with known underlying structures in order Our results show that language models pretrained to test how the LMs can represent and utilize these on music are far better at modelling Spanish than structures. those pretrained on random data. As shown in figure 4, LSTMs trained on music data have an av- 5 Experiment 3: Recursive Structure erage performance of 256.15 ppl on Spanish, com- In this experiment, we isolate and assess possible pared with 493.15 when training on the Zipfian structural features of music and code that may ex- random corpus. This discrepancy suggests that the plain the results of Experiment 2. The most widely- model, when training on music, creates represen- known structural hypothesis is the claim of Hauser tations of the relationships between tokens which et al. (2002) that the narrow language faculty in are generalizable and can apply to Spanish. humans (the inductive bias in the mind/brain that The music corpus is markedly different from the allows humans to acquire and develop language) Spanish corpus by most measures. Most saliently, can be reduced to just recursion. Given the promi- MAESTRO uses a vocabulary of just 310 tokens nence of such theories, it is natural to ask: is it the to encode various aspects of music like volume and note co-occurrence.8 This is in contrast to matrix of 50,000 rows, but during training only ever sees words 1-310, meaning that much of the word embedding space 8For consistency, the model still has a word embedding has never been seen by the LSTM part of the model. ... and there’s a lot more cool work on linguistically-oriented deep net • Linzen and Baroni: Syntactic Structures from Deep Learning. Annual Review of Linguistics 2021 • Proceedings of BlackBox NLP workshop series • Proceedings of Society for Computation in Linguistics conference • Interpretability and Model Analysis in NLP and Linguistic Theories, Cognitive Modeling and Psycholinguistics tracks at EACL and other major *CL conferences • Most of this work addresses core questions of generative linguistics: • Learnability • Generalization given insufficient data • Centrality of syntactic structure • ...

15 Outline

• Deep learning for deep linguistics • The gap • The theoretical significance of deep nets • Conclusion: can we narrow the gap?

16 The shallow impact of linguistically-oriented deep learning on linguistics 486 citations Tal Linzen’s original long- distance agreement paper

17 The shallow impact of linguistically-oriented deep learning on linguistics Of 486 papers citing Tal Linzen’s seminal long-distance agreement paper, excluding self-citations, a total of...

18 The shallow impact of linguistically-oriented deep learning on linguistics Of 486 papers citing Tal Linzen’s seminal long-distance agreement paper, excluding self-citations, a total of... 3 qualify as theoretical linguistics papers: • J. Pater: Generative linguistics and neural networks at 60: Foundation, friction, and fusion. Language 2019 • E Dunbar: Generative , neural networks, and the implementational mapping problem: Response to Pater. Language 2019 • S Lappin & JH Lau: Gradient probabilistic models vs categorical : A reply to Sprouse et al.(2018). The Science of Language 2018

19 The shallow impact of linguistically-oriented deep learning on linguistics Compare to Linzen’s paper impact on agricultural studies (4 citations): • Applying deep learning for agricultural classification using multitemporal SAR Sentinel-1 for Camargue, France • An adversarial generative network for crop classification from remote sensing timeseries Images • Land cover classification via multitemporal spatial data by deep recurrent neural networks • Deep recurrent neural networks for winter vegetation quality mapping via multitemporal SAR Sentinel-1

20 A mini-corpus of contemporary theoretical linguistics • Top 3 theoretical linguistics journals according to Scimago ranking: • • Natural Language and Linguistic Theory • Syntax • Total papers in latest issues: 19 • Total syntax papers in latest issues: 13

21 The shallow impact of linguistically-oriented deep learning on linguistics • Number of references to deep learning work in the theoretical linguistics mini-corpus:

22 The shallow impact of linguistically-oriented deep learning on linguistics • Number of references to deep learning work in the theoretical linguistics mini-corpus: 0

23 What is theoretical linguistics about? Representative titles from the mini-corpus • Compound Wh-questions and fragment answers in Japanese: Implications for the nature of ellipsis • Diagnosing object agreement vs clitic doubling: an Inuit case study • Diagnosing clause structure in a polysynthetic language: Wh- agreement and parasitic gaps in West Circassian • In favour of the low IP area in the Arabic clause structure: Evidence from the VSO word order in Jordanian Arabic • Spans in South Caucasian agreement: Revisiting the pieces of inflection

24 What is theoretical linguistics about?

• Mainstream theoretical linguistics is about... an impressive variety of ! • In my mini-corpus: • 1 article focusing on English • 2 articles focusing on other Indo-European languages (Russian and colloquial French) • 8 articles focusing on non-Indo-European languages (Turkish, Japanese, Tundra Nenets, West Circassian, Inuit, Samoan, Jordanian Arabic and Georgian) • 2 articles not focusing on any specific language (one about wide typological generalizations, one heavy on pure theory)

25 What is theoretical linguistics about?

• Rather than generically testing the “deep questions” of generative linguistics, papers start from theoretical assumptions and explore their empirical implications, with lots of emphasis on new empirical findings • Typical methodology: • Prediction deriving from typological analysis and strong universality assumptions • Apparently contradicted by glaring counter-examples • Analysis of counter-examples • Happy ending: not only counter-examples are only apparent, but strong hypothesis leads to explaining new empirical evidence

26 The that-trace effect Example suggested by David Adger, Portuguese data from Raposo NELS 1987

• The that-trace effect • you said (that) Gi likes Carlo • who did you say __ likes Carlo? • * who did you say that __ likes Carlo? • Strong hypothesis: this derives from innate constraints (and should thus be universal) • Portuguese as a counterexample: • que pessoas a Inês acha que __ viram o filme? which people the Ines thinks that __ saw the movie? “which people does Ines think saw the movie?”

27 The that-trace effect

• ... but wait, Portuguese also allows post-verbal subjects • assistiu ao jogo António attended to-the game António “Antonio attended the game” • Perhaps the counterexample is only apparent: • que pessoas a Inês acha que __ viram o filme? which people the Ines thinks that __ saw the movie? • que pessoas a Inês acha que viram o filme __? which people the Ines thinks that saw the movie __?

28 The that-trace effect • Sticking to strong hypothesis (that-trace effect is also present in Portuguese) leads to explain a further asymmetry in Portuguese • In-situ wh-questions: • encontraste quem ontem? met who yesterday? “who did you meet yesterday?” • In embedded clauses, this is OK(-ish) when subject wh-phrase is sentence final, but very bad when it directly follows que (that): • achas que falou com o teu pai acerca de ti quem? think that talked with the your father about of you who? • * achas que quem falou com o teu pai acerca de ti? think that who talked with the your father about of you “who do you think talked with your father about you?” 29 The that-trace effect story in short

• By sticking to the strongest hypothesis (that-trace effect stems from universal constraints)... • ... we were not only able to account for apparent Portuguese counterexamples... • ... but also to capture a heretofore unobserved pattern of Portuguese without any further stipulation... • ... thus extending the empirical coverage of universal grammar

30 Leaving the floor to the experts

Catalan Special Issue, 2019 7-26

The achievements of Generative Syntax: a time chart and some reflections*

Roberta D’Alessandro Utrecht University/UiL-OTS [email protected]

Received: April 10, 2018 Accepted: September 23, 2019

Abstract

In May 2015, a group of eminent linguists met in Athens to debate the road ahead for genera- tive grammar. There was a lot of discussion, and the linguists expressed the intention to draw a list of achievements of , for the benefit of other linguists and of the field in general. The list has been sketched, and it is rather interesting, as it presents a general picture of the results that is very ‘past-heavy’. In this paper I reproduce the list and discuss the reasons why it looks the way it does. Keywords: generative grammar; syntax; linguistics; results

Resum. Els assoliments de la sintaxi generativa: una gràfica temporal

El maig de 2015, un grup d’eminents lingüistes es van reunir a Atenes per debatre el camí que cal seguir per a la gramàtica generativa. Hi va haver molta discussió i els lingüistes van manifestar la intenció de confeccionar una llista d’èxits de la gramàtica generativa en benefici d’altres lingüis- tes i de l’àmbit en general. La llista ha estat esbossada i és força interessant, ja que presenta una imatge general dels resultats molt «passada». En aquest treball reprodueixo la llista i comento els motius pels quals es veu d’aquesta manera. Paraules clau: gramàtica generativa; sintaxi; lingüística; resultats

Table of Contents 1. Introduction 5. Communicating results and learning 2. “Results” in generative grammar about them 3. Suggestions for additions 6. Some final remarks 4. Some comments on the time charts References

* The author acknowledges funding support from the European Research Council (ERC) under 31 the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 681959_Microcontact).

ISSN 1695-6885 (in press); 2014-9718 (online) https://doi.org/10.5565/rev/catjl.232 ROBERTA What are the most exciting issues in D’ALESSANDRO theoretical linguistics right now? DAVID ADGER

labeling algorithm in syntax reconstruction phenomena unexpected agreement patterns

features and their composition systematic gaps in typology

differential object marking relation between Merge and Agree

syntax- interface

32 How can computational methods ROBERTA help theoretical research? DAVID

extract agreement patterns across microvariational data tagging/parsing large corpora

cluster languages by different implement complex syntactic systems feature setups

track evolution of single features across time

33 How can computational methods ROBERTA help theoretical research? DAVID

extract agreement patterns across microvariational data tagging/parsing large corpora Now, I know that NLP does completely different stuff, but cluster languages by different implement complex syntactic systems feature setupshey: we’re looking at the future, right? � (Roberta) track evolution of single features across time

34 How could we help?

• Tools for linguistic research: • Parsers, automated classification of historical data... • Deep nets might help building better tools, but this is not about deep learning per se • Most current research in linguistically-oriented deep learning is instead about studying the linguistic behavior (successes and failures) of deep nets themselves • Just browse the titles of any edition of BlackBox NLP, Society for Computation in Linguistics, etc.

35 Outline

• Deep learning for deep linguistics • The gap • The theoretical significance of deep nets • Conclusion: can we narrow the gap?

36 What can theoretical linguistics learn from linguistic successes and failures of deep nets?

A common idea: • Deep nets are blank slates • If a deep net can learn linguistic pattern P, then P must be learnable from data without special innate knowledge

[This] sort of simulations [...] can help linguists focus on the aspects [...] that truly require explanation in terms of innate constraints. If the simulation shows that there is plenty of data for the learner to acquire a particular phenomenon, maybe there's nothing to explain! Tal Linzen, p.c.

37 A blank slate?

38 https://www.mihaileric.com/posts/transformers-attention-in-disguise/ Blank slates? Kharitonov and Chaabouni ICLR 2021

• Minimal training corpus • aabaa -> b • bbabb -> a • aaaaa -> a • bbbbb -> b

39 Blank slates? Kharitonov and Chaabouni ICLR 2021

• Minimal training corpus • aabaa -> b • bbabb -> a • aaaaa -> a • bbbbb -> b • Fully compatible with (at least) two generalizations:

• Repeat symbol in the middle Hierarchical generalization

• Repeat third symbol Linear generalization

40 Blank slates? Kharitonov and Chaabouni ICLR 2021

• Minimal training corpus • aabaa -> b • bbabb -> a • aaaaa -> a • bbbbb -> b • Test example where output will differ based on chosen generalization • aaabaaa -> ???

• aaabaaa -> b Hierarchical generalization

• aaabaaa -> a Linear generalization

41 Blank slates? Kharitonov and Chaabouni ICLR 2021

• Minimal training corpus • aabaa -> b • bbabb -> a • aaaaa -> a • bbbbb -> b • Test example where output differs based on chosen generalization • aaabaaa -> ??? Transformers, LSTMs with attention: strong preference for hierarchical generalization • aaabaaa -> b

• aaabaaa -> a CNNs, LSTMs without attention: strong preference for linear generalization 42 Deep nets are not blank slates!

• Modern deep nets have complex, highly structured (and differing!) innate architectures... • leading them to behave very differently when exposed to the same data • Like formal linguistic theories, deep nets for language processing are full-fledged algorithmic models of linguistic knowledge!

43 Deep nets as linguistic theories?

• Like traditional formal linguistic theories, deep nets are algorithmic models assigning latent structures to sentences • These structures are dense vectors rather than symbolic trees • Consequently, structure-building rules are not discrete tree construction operations, but tensor algebra computations • Like traditional theories, deep-net theories can make precise predictions about sentence acceptability

44 Deep nets as linguistic theories What’s missing? 1

• Low commitment to models • We just like to use the state of the art • Differences between LSTMs and Transformers are probably larger than differences between HPSG and Minimalism, yet linguistically-oriented computational linguistics is switching from the ones to the others without much discussion • Little or no reflection on language processing assumptions we are making when we use a deep learning model • Significance of different input reading methods, gating mechanisms, attention etc.

45 Deep nets as linguistic theories What’s missing? 2

• Showing that our models capture known linguistic patterns is a good sanity check but... • where are the interesting new predictions they make? • Recall the Portuguese that-trace effect? • Problematic to be interested in deep nets’ non-trivial predictions when • trivial changes can make a big difference • for predictions to further our understanding of how language works, we need a mechanistic understanding of how deep nets process language, which we largely lack

46 The impact of uninteresting parameters

• McCoy et al. TACL 2020 test several models and model variations on auxiliary fronting, a natural language phenomenon akin to hierarchical/linear generalization (as in the Kharitonov/Chaabouni study above)

47 Unsquashed Squashed Unsquashed Squashed GRU ! 0.99 0.77 GRU ! 0.54 0.78 LSTM ! 0.98 0.98 LSTM ! 0.05 0.43 (a) Full-sentence accuracy on the test set (b) First-word accuracy on the generalization set

Figure 5: Effects of squashing. All numbers are medians across 100 initializations. The standard versions of the architectures are the squashed GRU and the unsquashed LSTM.

The SRN without attention failed on the test fied LSTM that included squashing in the calcula- set, mainly because it often confused words that tion of its cell state, and a modified GRU that did had the same part of speech, a known weakness not have the squashing usually present in GRUs. of SRNs (Frank and Mathis, 2007). Therefore, See Appendix B for more details. Using the same its generalization set behavior is uninformative. training setup as before, we trained models with The other architectures performed strongly on the these modified recurrent units and with location- test set (> 50% full-sentence accuracy), so we based attention. LSTMs and GRUs with squash- now consider their generalization set performance. ing chose MOVE-MAIN more often than the cor- The GRU withThe location-based impact attention andof the uninterestingresponding models without squashing parameters (Figure 5), SRN with content-based attention both preferred suggesting that such squashing is one factor that MOVE-MAIN, while the remaining architectures causes GRUs to behave differently than LSTMs. preferred MOVE-FIRST.4 These results suggest that both the recurrent unit and the type of atten- 3.5 Hyperparameters and random seed tion can qualitatively affect a model’s inductive bi- In addition to variation across architectures, we ases. Moreover, the interactions of these factors also observed considerable variation across multi- can have drastic effects: with SRNs, content-based ple instances of the same architecture that differed attention led to behavior consistent with MOVE- only in random seed; the random seeds determined MAIN while location-based attention led to behav- both the initial weights of each model and the or- ior consistent with MOVE-FIRST; these types of at- der in which training examples were sampled. For tention had opposite effects with GRUs. example, the generalization set first-word accu- that is, from 83% preference for racy for SRNs with content-based attention ranged LINEAR GENERALIZATION to 90% 3.4 Differences between LSTMs and GRUs from 0.17 to 0.90. Based on our exploration of hy- preference for HIERARCHICAL One striking result in Figure 4 is that LSTMs and perparameters, it also appears that the learning rate GENERALIZATION GRUs display qualitative differences, even though and hidden size can qualitatively affect generaliza- the two architectures are often viewed as inter- tion. The effects of these details are difficult to in- changeable and achieve similar performance in ap- terpret systematically, and we leave the character- plied tasks (Chung et al., 2014). One difference ization of their effects for future work. Results for between LSTMs and GRUs is that a squashing all individual re-runs are at the project website.3 function is applied to the hidden state of a GRU 48 to keep its values within the range ( 1, 1), while 4 Tree models the cell state of an LSTM is not bounded. Weiss So far we have tested whether properties that are et al. (2018) demonstrate that such squashing leads not interpretably related to hierarchical structure to a qualitative difference in how well these mod- nevertheless affect how a model generalizes on a els generalize counting behavior. Such squashing syntactic task. We now turn to a related but op- may also explain the qualitative differences that posite question: when a model’s design is meant we observe: counting the input elements is equiv- to give it a hierarchical inductive bias, does this alent to keeping track of their linear positions, so design succeed at giving the model this bias? we might expect that a tendency to count would make the linear generalization more accessible. 4.1 Tree model that learns implicit structure To test whether squashing increases a model’s The first hierarchical model that we test is the Or- preference for MOVE-MAIN, we created a modi- dered Neurons LSTM (ON-LSTM; Shen et al., 4We say that a model preferred generalization A over gen- 2019). This model is not given the tree structure eralization B if it behaved more consistently with A than B. of each sentence as part of its input. Instead, its Deep nets as linguistic theories What’s missing? 2

• Where are the interesting predictions our models make? • Recall the Portuguese that-trace effect? • Problematic to be interested in deep nets’ non-trivial predictions when • trivial changes can make a huge difference • for predictions to further our understanding of how language works, we need a mechanistic understanding of how deep nets process language, which we largely lack

49 Taking deep nets’ predictions seriously Lakretz et al. NAACL 2019, Cognition to appear

• In depth studies of LSTM long-distance agreement processing down to a cell-by-cell level • LSTMs have single, sparse mechanism to track long-distance agreement • Non-trivial prediction: two embedded long-distance agreement relations will lead to processing problems, with the inner (shorter- distance) one being more difficult The kids that the teacher with the pencils likes say...

longer distance but shorter distance but easier harder 50 Taking deep nets’ predictions seriously Lakretz et al. NAACL 2019, Cognition to appear

The kids that the teacher with the pencils likes say...

longer distance but shorter distance but easier harder

Prediction is borne out in LSTM and (more weakly) human subjects!

51 Taking deep nets’ predictions seriously Lakretz et al. NAACL 2019, Cognition to appear

The kids that the teacher with the pencils likes say...

longer distance but shorter distance but easier harder

Prediction is borne out in LSTM and (more weakly) human subjects! a single network type, a single construction, 4 years in the making! 52 Outline

• Deep learning for deep linguistics • The gap • Deep nets as linguistic theories? • Conclusion: can we narrow the gap?

53 Outro

• Deep nets have something to say about language • How can we make them more relevant to theoretical linguistics? • ... besides harnessing them to create better massively multilingual language analysis tools

54 Outro

• Not unreasonable to think of deep nets as linguistic theories • However, we must stop shooting the moving target of state-of-the-art performance and commit to a model • We need to understand the model well enough to use it to make interesting predictions about human • This takes lots of time and patience, and it is incompatible with current pace of exploration of models and linguistic phenomena in computational linguistics • Can we slow down a bit? J

55 With many thanks to...

Tal Linzen Emmanuel Dupoux Adina Williams Gemma Boleda Louise McNally

Roberta D’Alessandro Dieuwke Hupkes Shalom Lappin David Adger

56 Thank you!

• Questions? • Wanna catch up later? Ping me @ mbaroni at gmail • We can chat at the Gather Town social gathering, if I don’t get lost in there J • Birds of a Feather Meetup on Linguistic Theories, Cognitive Modeling and Psycholinguistics: Tomorrow 7.30pm-8.30pm CEST

57