Grammaticality and Language Modelling
Total Page:16
File Type:pdf, Size:1020Kb
Grammaticality and Language Modelling Jingcheng Niu and Gerald Penn Department of Computer Science University of Toronto Toronto, Canada fniu,[email protected] Abstract With the PBC in mind, we will first reappraise some recent work in syntactically targeted lin- Ever since Pereira(2000) provided evidence guistic evaluations (Hu et al., 2020), arguing against Chomsky’s (1957) conjecture that sta- that while their experimental design sets a new tistical language modelling is incommensu- high watermark for this topic, their results may rable with the aims of grammaticality predic- not prove what they have claimed. We then tion as a research enterprise, a new area of re- turn to the task-independent assessment of search has emerged that regards statistical lan- language models as grammaticality classifiers. guage models as “psycholinguistic subjects” Prior to the introduction of the GLUE leader- and probes their ability to acquire syntactic board, the vast majority of this assessment was knowledge. The advent of The Corpus of Lin- essentially anecdotal, and we find the use of guistic Acceptability (CoLA) (Warstadt et al., the MCC in this regard to be problematic. We 2019) has earned a spot on the leaderboard conduct several studies with PBCs to compare for acceptability judgements, and the polemic several popular language models. We also between Lau et al.(2017) and Sprouse et al. study the effects of several variables such as (2018) has raised fundamental questions about normalization and data homogeneity on PBC. the nature of grammaticality and how accept- ability judgements should be elicited. All the 1 Background while, we are told that neural language models continue to improve. The three currently most popular means of evalu- That is not an easy claim to test at present, ating a neural language model are: (1) perplexity, however, because there is almost no agreement an information-theoretic measure that was in use on how to measure their improvement when long before neural networks became the preferred it comes to grammaticality and acceptability means of implementing language models; (2) task judgements. The GLUE leaderboard bundles performance profiles, in which derivative aspects CoLA together with a Matthews correlation of a language model’s predictions are embedded coefficient (MCC), although probably because in a so-called “downstream” task, with all other CoLA’s seminal publication was using it to compute inter-rater reliabilities. Researchers aspects of the implementation held constant; and working in this area have used other accuracy (3) targeted linguistic evaluations, the purpose of and correlation scores, often driven by a need which is to demonstrate specific syntactic general- to reconcile and compare various discrete and izations that a candidate model implicitly captures continuous variables with each other. or does not capture. These targeted evaluations The score that we will advocate for in this pa- must take place on a large number of small data per, the point biserial correlation, in fact com- sets in order to control for the syntactic and lexical pares a discrete variable (for us, acceptabil- variations that we witness among sentences in a ity judgements) to a continuous variable (for realistic corpus. us, neural language model probabilities). The The purpose of this paper is ultimately to find only previous work in this area to choose the a task-independent means of testing how well lan- PBC that we are aware of is Sprouse et al. (2018), and that paper actually applied it back- guage model probabilities might serve as gram- wards (with some justification) so that the lan- maticality regression scores. Using evidence from guage model probability was treated as the dis- targeted linguistic evaluations, we argue for the crete binary variable by setting a threshold. point-biserial correlation as at least the basis of 110 Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP), pages 110–119, November 20, 2020. c 2020 Association for Computational Linguistics such a task-independent measure, and then use the (1U) The girl said that the mothers saw herself. PBC to examine several neural models along with some important variables that affect both their eval- (2U) The mother said that the girls saw herself. uation and the data that we evaluate on. Borrowing a convention from linguistic theory, where (B) is a baseline grammatical string, (D) is a Marvin and Linzen(2018) coined the use of “mini- distractor, and (U) is an ungrammatical string. This mal pairs” as input to language models in order to set has six strings, but sets in their experiments can test these fine-grained variations. For example: have as many as 48 strings each, with as many as 75 sets in a single experiment, each one having (1) Reflexive pronoun in a sentential complement: a unique target pronoun in all of its strings. Be- a. The bankers thought the pilot embar- cause here it is the context that varies, rather than rassed himself. the pronoun, Hu et al.(2020) must rank the condi- b. *The bankers thought the pilot embar- tional probabilities of the pronoun in these various rassed themselves. contexts, rather than total sentence probabilities. Hu et al.(2020) also evaluate models with ac- (2) Reflexive pronoun across an object relative curacies. Because there are three classes of string, clause: rather than two, a model is said to have made the correct prediction if the ungrammatical data receive a. The manager that the architects like a lower score than both the baseline and distractor doubted himself. data. But because there are more than three strings, b. *The manager that the architects like they do not compare individual scores from the can- doubted themselves. didate model, but rather the three means that result These pairs deal with referential agreement in spe- from averaging the conditional pronoun probabili- cific syntactic environments. If a model assigns the ties of the baseline, distractor and ungrammatical grammatical string in a pair a higher score than the strings, respectively. ungrammatical string, then we say that the model This alternative design not only provided bet- made the correct prediction on that pair. Having ter accuracies than were achieved by Marvin and evaluated the model over a large number of these Linzen(2018), but the inclusion of distractors in pairs, we can compute an accuracy score, relative the design lowers the random baseline from 50% to a 50% random baseline. to 33.3% accuracy. Hu et al.(2020) conclude that Hu et al.(2020) have taken exception to the de- current neural language models are learning more sign of many such evaluations on that grounds that: about the licensing of reflexive anaphora than was (1) a number of English nouns are stereotypically previously thought. gendered, which conditions pronoun choice, and (2) the unigram probabilities of reflexive pronouns 2 Theoretical Exceptions are different, which biases the probabilities that In a typical psycholinguistics experiment, we models assign to sentences that contain them. To would give human subjects a task to perform dur- circumvent these shortcomings, they generalized ing which they would be presented with a stimulus the pairs to larger sets of strings in which multi- that was labelled as either baseline, distractor or ple nouns were used in multiple positions so that ungrammatical. The effect of the stimulus on the lexical choice and order could be permuted across task could be measured by time to completion, the sets. They also introduced distractors, grammatical number of correct tokens retrieved during a fixed strings that contain material irrelevant, or distract- interval of time, etc. Regardless, the task would ing, to the determination of the sentence’s gram- almost certainly be chosen so that samples of its maticality. One set that they use, for example, is: corresponding measure of success would be nor- (1B) The girl said that the mother saw herself. mally distributed. So a within-subjects mean of these quantities is entirely justifiable. (2B) The mother said that the girl saw herself. The situation is somewhat less clear with the (1D) The girls said that the mother saw herself. scores that are returned by a neural language model. Ignoring for the moment that Hu et al.(2020) are (2D) The mothers said that the girl saw herself. interested in conditional pronoun probabilities and 111 not sentence probabilities, the scores are gener- ally not regarded as measures of success on a task per se — there is no actual task here, apart from achieving a high rank in the evaluation. Legiti- mate task performance profiles are defined over separate downstream tasks, such as those in the GLUE leaderboard (Wang et al., 2018). It is rather more difficult to think of downstream tasks that depend on conditional pronoun probabilities, how- ever. Note that for Marvin and Linzen(2018), the Figure 1: Surprisals (negative log probabilities) for ev- ratio of conditional pronoun probabilities of a set ery set in experiment 1b for the GRNN model with her- of stimuli was the same as the ratio of their total self, on the left, and for the TransXL model with them- sentence probabilities because the reflexive pro- selves on the right. noun is always the last word of the sentence, and the contexts preceding the pronouns are always identical. α = 0:05, and marginally successful for an addi- Several papers by Lau et al., culminating in tional 8% at α = 0:1. This means that somewhere Lau et al.(2017), have argued instead that sen- between 20–30% of the sets are provably not nor- tence probabilities can justifiably be interpreted as mal. Homoscedasticity is merely one aspect of gradient grammaticality scores, rejecting the long- normal distributions that can be used to prove that standing assumption in generative linguistics that a distribution is not normal.