Marco Baroni on the Gap Between Theoretical and Computational

On the gap between theoretical and computational linguistics Marco Baroni EACL April 21th, 2021 Facebook AI Research Foreword 1 • Theoretical linguistics: • Syntax in the generative tradition • 68% of articles in last issues of 3 highest-impact linguistics journals • Computational linguistics: • Deep learning for NLP • 40% of long papers at last EACL had a deep-learning keyword in the title 2 Foreword 2 • My attitude here: • Ask not what theoretical linguistics can do for you – ask what you can do for theoretical linguistics (JFK) • ... but theoretical linguists should also do their part to bridge the gap: • Better experimental practices • Less theory-internal argumentation • Larger corpus data coverage 3 Outline • Deep learning for deep linguistics • The gap • The theoretical significance of deep nets • Conclusion: can we narrow the gap? 4 Deep nets’ shocking grammatical competence 5 Long-distance number agreement Structures, not strings! • The boy is jumping • The boy [that I saw yesterday with the girls and the dogs on the rocks] is jumping 6 Long-distance agreement in neural language models • Problem formulation in Linzen et al. TACL 2016 • Given prefix... The boy that I saw yesterday with the girls and the dogs on the rocks... • Does pre-trained neural language model assign higher probability to continuation... Model got agreement right is OR are Model got agreement wrong 7 Long-distance agreement in Results from: neural language models Gulordava et al. NAACL 2018 • Vanilla LSTM language models trained on Wikipedia samples guess right number agreement with high accuracy • in English, Hebrew, Italian and Russian • in a variety of constructions (subJect-verb, noun-adJective, verb- verb...) • in nonsense sentences: • The colorless green ideas [that I ate yesterday with the chair] sleep furiously • with accuracy comparable to that of human subJects. Result also holds for other (CNNs: Bernardy and Lappin LILT 2017, Transformers: Goldberg arXiv 2019) 8 What do the long-distance agreement experiments teach us? • Neural language models possess awesome grammatical skills! • Such skills can then be acquired without supposedly fundamental innate priors such as a preference for tree-like structures 9 Learning music helps you read! Papadimitriou and Jurafsky EMNLP 2020 Image source: Pixabay 10 Learning music helps you read! Papadimitriou and Jurafsky EMNLP 2020 • Train LSTM model on non-linguistic data • Freeze LSTM weights • Fine-tune input and output embeddings on Spanish • Compute perplexity on Spanish text 11 Random The random corpora are sampled randomly from the Spanish Uniform: marroquın´ jemer pertenecer vocabulary. There is no underlying structure of any kind that osasuna formaron citoesqueleto links words with each other. All words are equally likely to relativismo be sampled in the Uniform corpus, while common words are more likely in the Zipfian corpus. Zipf: en con conocidas y en los victoriano como trabajar unk monte * en juegos dıas´ Random en el h i The random corpora are sampled randomly from the Spanish Pre-training sourcesUniform: marroquın´ jemer pertenecerMusic vocabulary. There is no underlying structure of any kind that osasuna formaron citoesqueleto links words with each other. All words are equally likely to The music data is encoded from classical piano performances relativismo be sampled in the Uniform corpus, while common words areaccording to the MAESTRO standard. Music is structured on more likely in the Zipfian corpus. Zipf: en con conocidas y en los victoriano many levels. The red arrow in the example illustrates how, on como trabajar unk monte * en juegos dıas´ a small timescale, each note is linked to its corresponding note Random en el h i The random corpora are sampled randomly from the Spanish when a motif is repeated but modulated down a whole-step. Music Uniform: marroquın´ jemer pertenecer vocabulary. ThereCode is no underlyingThe music structure data is encoded of any from kind classical that piano performances osasuna formaron citoesqueleto links words with each other. Allaccording words to the are MAESTRO equally standard. likely Music to is structured onThe code corpus is composed of Java code. The above snippet if (coordFactormany levels. The == red arrow 1.0f) in the example illustrates how, on relativismo a small timescale, each note is linked to its corresponding notedemonstrates some kinds of structure that are present in code: be sampled in the Uniformreturn corpus, sumExpl while common words are more likely in the Zipfian corpus.when a motif is repeated but modulated down a whole-step. brackets are linked to their pairs, else statements are linked Zipf: en con conocidas y en los victoriano else { to an if statement, and coreference of variable names is Code result = sum coordFactor como trabajar unk monte * en juegos dıas´ The code corpus is* composed of Java code. The above snippetunambiguous. h i if (coordFactor == 1.0f) } demonstrates some kinds of structure that are present in code: en el return sumExpl brackets are linked to their pairs, else statements are linked else { to an if statement, and coreference of variable names is Music result = sum * coordFactorParentheses unambiguous. } The music data is encoded from classical piano performances Our artificial corpora consist of pairs of matching integers. In according to theNesting: MAESTRO standard. Music is structured on the Nesting Parentheses corpus, integer pairs nest Parentheses many levels. The red arrow inOur the artificial example corpora illustrates consist of pairs how, of matching on integers. Inhierarchically and so the arcs do not cross. In the Flat Nesting: a small timescale, each note isthe linked Nesting to Parentheses its corresponding corpus, integer note pairs nest Parentheses corpus, each integer pair is placed independently hierarchically and so the arcs do not cross. In the Flat of all the others, and so the arcs can cross multiple times. when a motif is repeated but modulatedParentheses corpus, down each a integer whole-step. pair is placed independently 0 29 29 0 0of all5 the5 others,0 1016 and so1016 the arcs can9 8 cross8 multiple28 28 times.9 (There is a one-to-one mapping between Spanish words and Code 0 29 29 0 0 5 5 0 1016 1016 9 8 8 28 28 9 The code corpusFlat: is composed(There of Java is a code. one-to-one The mapping above between snippet Spanish words and integers and so these integers are sampled from the same if (coordFactor == 1.0f) Flat: demonstrates some kinds of structureintegers and that so these are integers present arein sampled code: from the same Spanish vocabulary distribution as the Random Zipfian return sumExpl Spanish vocabulary distribution as the Random Zipfian brackets are linked to their pairs,corpus.else We visualizestatements these corpora are herelinked with integers and thecorpus. We visualize these corpora here with integers and the else { Random corpora with words for simplicity). Random corpora with words for simplicity). 21 13 21 6294to13 an6294if5 5471statement,5 32 3221 and547113 coreference21 6294 13 of6294 variable5 5471 names5 32 is32 5471 result = sum * coordFactor unambiguous. } Figure 3: Examples illustrating the contentFigure of our3: Examplesnon-linguistic illustrating corpora for Experiments the content 1-3. of All our examples non-linguistic12 are corpora for Experiments 1-3. All examples are taken from the corpora. Parentheses taken from the corpora. Our artificial corpora consist of pairs of matching integers. In Nesting: The Zipfian Randomthe baselineNesting is Parentheses controlled for corpus,the integerabstract pairsstructural nest features that these corpora vocabulary distribution:hierarchically if an experiment andThe so yieldsZipfian the arcsshare Random do not with cross. natural baseline In language the Flatis controlled in a generalizable for waythe abstract structural features that these corpora better results than the Zipfian Random baseline, we that’s usable to model human language? Parentheses corpus,vocabulary each integer distribution: pair is placed if an independently experiment yields share with natural language in a generalizable way cannot attribute its successof all only the to others, lexical-level and so sim- the arcs can cross multiple times. ilarity to the L2. Therefore, modelsbetter that are results more than4.1 the Data Zipfian Random baseline, we that’s usable to model human language? 0 29 29 0 0 5 5 0 1016 1016 9 8 8 28 28 9 successful than the Zipfian(There baseline is a one-to-onecannot at transfer attribute mapping to For its between our success music Spanish only data we to use wordslexical-level the MAESTRO and sim- dataset human language wouldintegers have useful, andilarity so generalizable these to integers the L2.of areHawthorne Therefore, sampled et from al. models(2018 the). same The that MAESTRO are more dataset4.1 Data Flat: syntactic information about the structures that link Spanish vocabularysuccessful distribution thanembeds the as Zipfian the MIDI Random files baseline of Zipfian many at parallel transfer notes to into a tokens. linear format suitable for sequence modelling, with-For our music data we use the MAESTRO dataset corpus. We visualizehuman language these corpora would here have with useful, integers generalizable and the Random corpora with wordsout for losingsimplicity). musical information. The final corpusof Hawthorne et al. (2018). The MAESTRO dataset 21 13 21 6294 13 6294 5 5471 5 32432 Experiment5471 2: Non-linguistic structure syntactic informationhas a vocabulary about the of 310 structures tokens,

Load more