A Model of Discourse Predictions in Human Sentence Processing
Total Page:16
File Type:pdf, Size:1020Kb
A Model of Discourse Predictions in Human Sentence Processing Amit Dubey and Frank Keller and Patrick Sturt Human Communication Research Centre, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, UK {amit.dubey,frank.keller,patrick.sturt}@ed.ac.uk Abstract prediction has been demonstrated by studies that show anticipation based on selectional restrictions: This paper introduces a psycholinguistic listeners are able to launch eye-movements to the model of sentence processing which combines predicted argument of a verb before having encoun- a Hidden Markov Model noun phrase chun- tered it, e.g., they will fixate an edible object as soon ker with a co-reference classifier. Both mod- els are fully incremental and generative, giv- as they hear the word eat (Altmann and Kamide, ing probabilities of lexical elements condi- 1999). Semantic prediction has also been shown in tional upon linguistic structure. This allows the context of semantic priming: a word that is pre- us to compute the information theoretic mea- ceded by a semantically related prime or by a seman- sure of surprisal, which is known to correlate tically congruous sentence fragment is processed with human processing effort. We evaluate faster (Stanovich and West, 1981; Clifton et al., our surprisal predictions on the Dundee corpus 2007). An example for syntactic prediction can be of eye-movement data show that our model achieve a better fit with human reading times found in coordinate structures: readers predict that than a syntax-only model which does not have the second conjunct in a coordination will have the access to co-reference information. same syntactic structure as the first conjunct (Fra- zier et al., 2000). In a similar vein, having encoun- tered the word either, readers predict that or and a 1 Introduction conjunct will follow it (Staub and Clifton, 2006). Recent research in psycholinguistics has seen a Again, priming studies corroborate this: Compre- growing interest in the role of prediction in sentence henders are faster at naming words that are syntacti- processing. Prediction refers to the fact that the hu- cally compatible with prior context, even when they man sentence processor is able to anticipate upcom- bear no semantic relationship to it (Wright and Gar- ing material, and that processing is facilitated when rett, 1984). predictions turn out to be correct (evidenced, e.g., Predictive processing is not confined to the sen- by shorter reading times on the predicted word or tence level. Recent experimental results also provide phrase). Prediction is presumably one of the factors evidence for discourse prediction. An example is the that contribute to the efficiency of human language study by van Berkum et al. (2005), who used a con- understanding. Sentence processing is incremental text that made a target noun highly predictable, and (i.e., it proceeds on a word-by-word basis); there- found a mismatch effect in the ERP (event-related fore, it is beneficial if unseen input can be antici- brain potential) when an adjective appeared that was pated and relevant syntactic and semantic structure inconsistent with the target noun. An example is (we constructed in advance. This allows the processor to give translations of their Dutch materials): save time and makes it easier to cope with the con- stant stream of new input. (1) The burglar had no trouble locating the secret Evidence for prediction has been found in a range family safe. of psycholinguistic processing domains. Semantic a. Of course, it was situated behind a 304 Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 304–312, Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics bigneu but unobtrusive paintingneu. of prediction, and again only proposes a loose in- b. Of course, it was situated behind a tegration of co-reference and syntax. Furthermore, bigcom but unobtrusive bookcasecom. Dubey’s (2010) model has only been tested on two experimental data sets (pertaining to the interaction Here, the adjective big, which can have neutral or of ambiguity resolution with context), no broad cov- common gender in Dutch, is consistent with the pre- erage evaluation is available. dicted noun painting in (1-a), but inconsistent with it The aim of the present paper is to overcome these in (1-b), leading to a mismatch ERP on big in (1-b) limitations. We propose a computational model that but not in (1-a). captures discourse effects on syntax in terms of pre- Previous results on discourse effects in sentence diction. The model comprises a co-reference com- processing can also be interpreted in terms of pre- ponent which explicitly stores discourse mentions diction. In a classical paper, Altmann and Steed- of NPs, and a syntactic component which adjust man (1988) demonstrated that PP-attachment pref- the probabilities of NPs in the syntactic structure erences can change through discourse context: if the based on the mentions tracked by the discourse com- context contains two potential referents for the tar- ponent. Our model is HMM-based, which makes get NP, then NP-attachment of a subsequent PP is it possible to efficiently process large amounts of preferred (to disambiguate between the two refer- data, allowing an evaluation on eye-tracking cor- ents), while if the context only contains one target pora, which has recently become the gold-standard NP, VP-attachment is preferred (as there is no need in computational psycholinguistics (e.g., Demberg to disambiguate). This result (and a large body of and Keller 2008; Frank 2009; Boston et al. 2008; related findings) is compatible with an interpretation Mitchell et al. 2010). in which the processor predicts upcoming syntactic The paper is structured as follows: In Section 2, attachment based on the presence of referents in the we describe the co-reference and the syntactic mod- preceding discourse. els and evaluate their performance on standard data Most attempts to model prediction in human lan- sets. Section 3 presents an evaluation of the overall guage processing have focused on syntactic pre- model on the Dundee eye-tracking corpus. The pa- diction. Examples include Hale’s (2001) surprisal per closes with a comparison with related work and model, which relates processing effort to the con- a general discussion in Sections 4 and 5. ditional probability of the current word given the previous words in the sentence. This approach has 2 Model been elaborated by Demberg and Keller (2009) in a model that explicitly constructs predicted structure, This model utilises an NP chunker based upon a hid- and includes a verification process that incurs ad- den Markov model (HMM) as an approximation to ditional processing cost if predictions are not met. syntax. Using a simple model such as an HMM fa- Recent work has attempted to integrate semantic cilitates the integration of a co-reference component, and discourse prediction with models of syntactic and the fact that the model is generative is a prereq- processing. This includes Mitchell et al.’s (2010) uisite to using surprisal as our metric of interest (as approach, which combines an incremental parser surprisal require the computation of prefix probabil- with a vector-space model of semantics. However, ities). The key insight in our model is that human this approach only provides a loose integration of sentence processing is, on average, facilitated when the two components (through simple addition of a previously-mentioned discourse entity is repeated. their probabilities), and the notion of semantics used This facilitation depends upon keeping track of a list is restricted to lexical meaning approximated by of previously-mentioned entities, which requires (at word co-occurrences. At the discourse level, Dubey the least) shallow syntactic information, yet the fa- (2010) has proposed a model that combines an incre- cilitation itself is modeled primarily as a lexical phe- mental parser with a probabilistic logic-based model nomenon. This allows a straightforward separation of co-reference resolution. However, this model of concerns: shallow syntax is captured using the does not explicitly model discourse effects in terms HMM’s hidden states, whereas the co-reference fa- 305 cilitation is modeled using the HMM’s emissions. However, using Bayes’ theorem we can compute: The vocabulary of hidden states is described in Sec- P(tag word)P(word) tion 2.1 and the emission distribution in Section 2.2 P(word tag) = | (1) | P(tag) 2.1 Syntactic Model which is what we need for surprisal. The pri- A key feature of the co-reference component of our mary information from this probability comes from model (described below) is that syntactic analysis P(tag word), however, reasonable estimates of | and co-reference resolution happen simultaneously. P(tag) and P(word) are required to ensure the prob- This could potentially slow down the syntactic anal- ability distribution is proper. P(tag) may be esti- ysis, which tends to already be quite slow for ex- mated on a parsed treebank. P(word), the probabil- haustive surprisal-based incremental parsers. There- ity of a particular unseen word, is difficult to esti- fore, rather than using full parsing, we use an HMM- mate directly. Given that our training data contains based NP chunker which allows for a fast analysis. approximately 106 words, we assume that this prob- 6 NP chunking is sufficient to extract NP discourse ability must be bounded above by 10− . As an ap- mentions and, as we show below, surprisal values proximation, we use this upper bound as the proba- computed using HMM chunks provide a useful fit bility of P(word). on the Dundee eye-movement data. Training The chunker is trained on sections 2– To allow the HMM to handle possessive construc- 22 of the Wall Street Journal section of the Penn tions as well as NP with simple modifiers and com- Treebank.