A Probabilistic Earley Parser As a Psycholinguistic Model

A Probabilistic Earley Parser as a Psycholinguistic Model John Hale Department of Cognitive Science The Johns Hopkins University 3400 North Charles Street; Baltimore MD 21218-2685 [email protected] Abstract performance. The present work adopts a numerical In human sentence processing, cognitive load can be view of competition in grammar that is grounded in defined many ways. This report considers a defini- probability. tion of cognitive load in terms of the total probability of structural options that have been disconfirmed at Principle 3 Sentence processing is eager. some point in a sentence: the surprisal of word w i “Eager” in this sense means the experimental situa- given its prefix w on a phrase-structural lan- 0:::i 1 tions to be modeled are ones like self-paced reading guage model. These− loads can be efficiently calcu- in which sentence comprehenders are unrushed and lated using a probabilistic Earley parser (Stolcke, no information is ignored at a point at which it could 1995) which is interpreted as generating predictions be used. about reading time on a word-by-word basis. Un- The proposal is that a person’s difficulty per- der grammatical assumptions supported by corpus- ceiving syntactic structure be modeled by word-to- frequency data, the operation of Stolcke’s probabilis- word surprisal (Attneave, 1959, page 6) which can tic Earley parser correctly predicts processing phe- be directly computed from a probabilistic phrase- nomena associated with garden path structural am- structure grammar. The approach taken here uses biguity and with the subject/object relative asym- a parsing algorithm developed by Stolcke. In the metry. course of explaining the algorithm at a very high Introduction level I will indicate how the algorithm, interpreted as a psycholinguistic model, observes each principle. What is the relation between a person’s knowledge of After that will come some simulation results, and grammar and that same person’s application of that then a conclusion. knowledge in perceiving syntactic structure? The answer to be proposed here observes three principles. 1 Language models Principle 1 The relation between the parser and Stolcke’s parsing algorithm was initially applied as a grammar is one of strong competence. component of an automatic speech recognition sys- tem. In speech recognition, one is often interested Strong competence holds that the human sentence in the probability that some word will follow, given processing mechanism directly uses rules of gram- that a sequence of words has been seen. Given some mar in its operation, and that a bare minimum of lexicon of all possible words, a language model as- extragrammatical machinery is necessary. This hy- signs a probability to every string of words from pothesis, originally proposed by Chomsky (Chom- the lexicon. This defines a probabilistic language sky, 1965, page 9) has been pursued by many re- (Grenander, 1967) (Booth and Thompson, 1973) searchers (Bresnan, 1982) (Stabler, 1991) (Steed- (Soule, 1974) (Wetherell, 1980). man, 1992) (Shieber and Johnson, 1993), and stands A language model helps a speech recognizer focus in contrast with an approach directed towards the its attention on words that are likely continuations discovery of autonomous principles unique to the of what it has recognized so far. This is typically processing mechanism. done using conditional probabilities of the form Principle 2 Frequency affects performance. P (Wn = wn W1 = w1,...Wn 1 =wn 1) | − − The explanatory success of neural network and the probability that the nth word will actually be constraint-based lexicalist theories (McClelland and wn given that the words leading up to the nth have St. John, 1989) (MacDonald et al., 1994) (Tabor et been w1,w2,...wn 1. Given some finite lexicon, the − al., 1997) suggests a statistical theory of language probability of each possible outcome for Wn can be estimated using that outcome’s relative frequency in during sentence processing is exactly this disconfir- a sample. mation. Traditional language models used for speech are n- gram models, in which n 1 words of history serve 2 Earley parsing as the basis for predicting− the nth word. Such mod- The computation of prefix probabilities takes advan- els do not have any notion of hierarchical syntactic tage of the design of the Earley parser (Earley, 1970) structure, except as might be visible through an n- which by itself is not probabilistic. In this section I word window. provide a brief overview of Stolcke’s algorithm but Aware that the n-gram obscures many the original paper should be consulted for full details linguistically-significant distinctions (Chomsky, (Stolcke, 1995). 1956, section 2.3), many speech researchers (Jelinek Earley parsers work top-down, and propagate and Lafferty, 1991) sought to incorporate hierar- predictions confirmed by the input string back up chical phrase structure into language modeling (see through a set of states representing hypotheses the (Stolcke, 1997)) although it was not until the late parser is entertaining about the structure of the sen- 1990s that such models were able to significantly tence. The global state of the parser at any one time improve on 3-grams (Chelba and Jelinek, 1998). is completely defined by this collection of states, a Stolcke’s probabilistic Earley parser is one way chart, which defines a tree set. A state is a record to use hierarchical phrase structure in a language that specifies model. The grammar it parses is a probabilistic the current input string position processed so context-free phrase structure grammar (PCFG), • far e.g. 1.0S NP VP a grammar rule → • 0.5NP Det N a “dot-position” in the rule representing how → 0.5NP NP VP • much of the rule has already been recognized . → . the leftmost edge of the substring this rule gen- • erates see (Charniak, 1993, chapter 5) An Earley parser has three main functions, predict, scan and complete, each of which can enter Such a grammar defines a probabilistic language in new states into the chart. Starting from a dummy terms of a stochastic process that rewrites strings of start state in which the dot is just to the left of the grammar symbols according to the probabilities on grammar’s start symbol, predict adds new states for the rules. Then each sentence in the language of the rules which could expand the start symbol. In these grammar has a probability equal to the product of new predicted states, the dot is at the far left-hand the probabilities of all the rules used to generate it. side of each rule. After prediction, scan checks the This multiplication embodies the assumption that input string: if the symbol immediately following rule choices are independent. Sentences with more the dot matches the current word in the input, then than one derivation accumulate the probability of all the dot is moved rightward, across the symbol. The derivations that generate them. Through recursion, parser has “scanned” this word. Finally, complete infinite languages can be specified; an important propagates this change throughout the chart. If, as mathematical question in this context is whether or a result of scanning, any states are now present in not such a grammar is consistent – whether it assigns which the dot is at the end of a rule, then the left some probability to infinite derivations, or whether hand side of that rule has been recognized, and any all derivations are guaranteed to terminate. other states having a dot immediately in front of Even if a PCFG is consistent, it would appear to the newly-recognized left hand side symbol can now have another drawback: it only assigns probabili- have their dots moved as well. This happens over ties to complete sentences of its language. This is as and over until no new states are generated. Parsing inconvenient for speech recognition as it is for mod- finishes when the dot in the dummy start state is eling reading times. moved across the grammar’s start symbol. Stolcke’s algorithm solves this problem by com- Stolcke’s innovation, as regards prefix probabili- puting, at each word of an input string, the prefix ties is to add two additional pieces of information to probability. This is the sum of the probabilities of all each state: α, the forward, or prefix probability, and derivations whose yield is compatible with the string γ the “inside” probability. He notes that seen so far. If the grammar is consistent (the probabilities of all derivations sum to 1.0) then subtracting path An (unconstrained) Earley path, the prefix probability from 1.0 gives the total proba- or simply path, is a sequence of Earley bility of all the analyses the parser has disconfirmed. states linked by prediction, scanning, If the human parser is eager, then the “work” done or completion. constrained A path is said to be con- processing mechanism. Theories of initial parsing strained by, or generate a string x if preferences (Fodor and Ferreira, 1998) suggest that the terminals immediately to the left the human parser is fundamentally serial: a func- of the dot in all scanned states, in se- tion from a tree and new word to a new tree. These quence, form the string x. theories explain processing difficulty by appealing ... to “garden pathing” in which the current analysis The significance of Earley paths is that is faced with words that cannot be reconciled with they are in a one-to-one correspondence the structures built so far. A middle ground is held with left-most derivations. This will al- by bounded-parallelism theories (Narayanan and Ju- low us to talk about probabilities of deriva- rafsky, 1998) (Roark and Johnson, 1999). In these tions, strings and prefixes in terms of the theories the human parser is modeled as a function actions performed by Earley’s parser.

Load more