Parsing with Generative Models of Predicate-Argument Structure

Parsing with generative models of predicate-argument structure

Julia Hockenmaier IRCS, University of Pennsylvania, Philadelphia, USA and Informatics, University of Edinburgh, Edinburgh, UK [email protected]

Abstract Due to CCG’s transparent syntax-semantics inter- face, the parser has direct and immediate access The model used by the CCG parser to the predicate-argument structure, which includes of Hockenmaier and Steedman (2002b) not only local, but also long-range dependencies would fail to capture the correct bilexical arising through coordination, extraction and con- dependencies in a language with freer trol. These dependencies can be captured by our word order, such as Dutch. This paper model in a sound manner, and our experimental re- argues that probabilistic parsers should sults for English demonstrate that their inclusion im- therefore model the dependencies in the proves parsing performance. However, since the predicate-argument structure, as in the predicate-argument structure itself depends only to model of Clark et al. (2002), and defines a degree on the grammar formalism, it is likely a generative model for CCG derivations that parsers that are based on other grammar for- that captures these dependencies, includ- malisms could equally benefit from such a model. ing bounded and unbounded long-range The conditional model used by the CCG parser of dependencies. Clark et al. (2002) also captures dependencies in the predicate-argument structure; however, their model 1 Introduction is inconsistent. State-of-the-art statistical parsers for Penn First, we review the dependency model proposed Treebank-style phrase-structure grammars (Collins, by Hockenmaier and Steedman (2002b). We then 1999), (Charniak, 2000), but also for Categorial use the example of Dutch ditransitives to demon- Grammar (Hockenmaier and Steedman, 2002b), strate its inadequacy for languages with a freer word include models of bilexical dependencies defined order. This leads us to define a new generative model in terms of local trees. However, this paper of CCG derivations, which captures word-word de- demonstrates that such models would be inadequate pendencies in the underlying predicate-argument for languages with freer word order. We use the structure. We show how this model can capture example of Dutch ditransitives, but our argument long-range dependencies, and deal with the pres- equally applies to other languages such as Czech ence of multiple dependencies that arise through the (see Collins et al. (1999)). We argue that this presence of long-range dependencies. In our current problem can be avoided if instead the bilexical implementation, the probabilities of derivations are dependencies in the predicate-argument structure computed during parsing, and we discuss the dif- are captured, and propose a generative model for ficulties of integrating the model into a probabilis- these dependencies. tic chart parsing regime. Since there is no CCG The focus of this paper is on models for Combina- treebank for other languages available, experimen- tory Categorial Grammar (CCG, Steedman (2000)). tal results are presented for English, using CCGbank (Hockenmaier and Steedman, 2002a), a translation but also long-range dependencies that are projected of the Penn Treebank to CCG. These results demon- from the lexicon or through some rules such as the strate that this model benefits greatly from the inclu- coordination of functor categories. For details, see sion of long-range dependencies. Hockenmaier (2003).

2 A model of surface dependencies 4 Word-word dependencies in Dutch Hockenmaier and Steedman (2002b) deﬁne a sur- Dutch has a much freer word order than English. face dependency model (henceforth: SD) HWDep The analyses given in Steedman (2000) assume that which captures word-word dependencies that are de- this can be accounted for by an extended use of ﬁned in terms of the derivation tree itself. It as- composition. As indicated by the indices (which

sumes that binary trees (with parent category È ) are only included to improve readability), in the

ÆÈ ¿

have one head child (with category À ) and one non- following examples, hij is the subject ( )of

ÆÈ ¾ head child (with category D ), and that each node geeft, de politieman the indirect object ( ), and

ÆÈ

=hc; Û i has one lexical head h . In the following tree, een bloem the direct object ( ).

Hij geeft de politieman een bloem

È =Ë[dcÐ]ÒÆÈ À =´Ë[dcÐ]ÒÆÈµ=ÆÈ D=ÆÈ h =

, , , À (He gives the policeman a ﬂower)

i h =hÆ ; i h´Ë[dcÐ]ÒÆÈµ=ÆÈ ;

Ë=´Ë=ÆÈ µ ´´Ë=ÆÈ µ=ÆÈ µ=ÆÈ ÌÒ´Ì=ÆÈ µ ÌÒ´Ì=ÆÈ µ

¿ ½ ¾ ¿ ¾ ½

opened ,and D doors .

ÌÒ´´Ì=ÆÈ µ=ÆÈ µ

½ ¾

Ë[dcÐ]ÒÆÈ

Ë=ÆÈ

Ë[dcÐ]ÒÆÈµ=ÆÈ ÆÈ

´ Een bloem geeft hij de politieman

Ë=´Ë=ÆÈ µ ´´Ë=ÆÈ µ=ÆÈ µ=ÆÈ ÌÒ´Ì=ÆÈ µ ÌÒ´Ì=ÆÈ µ

½ ¾ ¿ ¿ ¾

opened its doors ½

´Ë=ÆÈ µ=ÆÈ

½ ¾

Û <

The model conditions D on its own lexical cate-

Ë=ÆÈ

c h = hc ;Û i

À À À

gory D ,on and on the local tree Ë De politieman geeft hij een bloem

in which the D is generated (represented in terms of

Ë=´Ë=ÆÈ µ ´´Ë=ÆÈ µ=ÆÈ µ=ÆÈ ÌÒ´Ì=ÆÈ µ ÌÒ´Ì=ÆÈ µ

¾ ½ ¾ ¿ ¿ ½

È; À; Di

the categories h ):

´Ë=ÆÈ µ=ÆÈ

½ ¾

Ë=ÆÈ

È ´Û jc ; = hÈ; À; Di;h = hc ;Û iµ

D D À À À Ë A SD model estimated from a corpus containing 3 Predicate-argument structure in CCG these three sentences would not be able to capture Like Clark et al. (2002), we deﬁne predicate- the correct dependencies. Unless we assume that argument structure for CCG in terms of the depen- the above indices are given as a feature on the ÆÈ dencies that hold between words with lexical func- categories, the model could not distinguish between tor categories and their arguments. We assume that the dependency relations of Hij and geeft in the

ﬁrst sentence, bloem and geeft in the second sen-

c; Û i a lexical head is a pair h , consisting of a word

tence and politieman and geeft in the third sentence. c Û and its lexical category . Each constituent has at least one lexical head (more if it is a coordinate Even with the indices, either the dependency be- construction). The arguments of functor categories tween politieman and geeft or between bloem and geeft in the ﬁrst sentence could not be captured by a are numbered from 1 to Ò, starting at the innermost model that assumes that each local tree has exactly argument, where Ò is the arity of the functor, eg.

one head. Furthermore, if one of these sentences oc-

´Ë[dcÐ]ÒÆÈ µ=ÆÈ ´ÆÈÒÆÈ µ=´Ë[dcÐ]=ÆÈµ

¾ ½ ¾ ½ , .De- pendencies hold between lexical heads whose cat- curred in the training data, all of the dependencies in egory is a functor category and the lexical heads the other variants of this sentence would be unseen

of their arguments. Such dependencies can be ex- to the model. However, in terms of the predicate- ¼

¼ argument structure, all three examples express the

c; Û i;i;hc ;Û ii c

pressed as 3-tuples hh ,where is a ¼

¼ same relations. The model we propose here would

i hc ;Û i functor category with arity ,and isalex-

therefore be able to generalize from one example to c ical head of the ith argument of . The predicate-argument structure that corre- the word-word dependencies in the other examples. 1 sponds to a derivation contains not only local, The variables Ì are uninstantiated for reasons of space. The cross-serial dependencies of Dutch are one words of arguments (such as Smith) are generated of the syntactic constructions that led people to in the following manner:

believe that more than context-free power is re-

È ´Û jc ; hhc ;Û i;i; hc ;Û iiµ

a a a a h quired for natural language analysis. Here is an h example together with the CCG derivation from The head word of modiﬁers (such as yesterday)are Steedman (2000): generated differently:

dat ik Cecilia de paarden zag voeren

È ´Û jc ; hhc ;Û i;i; hc ;Û iµ

Ñ Ñ Ñ Ñ h

(that I Cecilia the horses saw feed) h

ÆÈ ÆÈ ÆÈ ´´ËÒÆÈ µÒÆÈ µ=ÎÈ ÎÈÒÆÈ

½ ¾ ¿ ½ ¾ ¿

¢ Like Collins (1999) and Charniak (2000), the SD

´´ËÒÆÈ µÒÆÈ µÒÆÈ

½ ¾ ¿

< model assumes that word-word dependencies can be

´ËÒÆÈ µÒÆÈ

½ ¾

ÒÆÈ

Ë deﬁned at the maximal projection of a constituent.

½ <

Ë However, as the Dutch examples show, the argument

Again, a local dependency model would systemat- slot i can only be determined if the head constituent

[dcÐ]

ically model the wrong dependencies in this case, is fully expanded. For instance, if Ë expands

=´Ë=ÆÈµ Ë[dcÐ]=ÆÈ

since it would assume that all noun phrases are ar- to a non-head Ë and to a head ,

[dcÐ]=ÆÈ guments of the same verb. it is necessary to know how the Ë expands However, since there is no Dutch corpus that is to determine which argument is ﬁlled by the non-

annotated with CCG derivations, we restrict our at- head, even if we already know that the lexical cate-

[dcÐ]=ÆÈ

tention to English in the remainder of this paper. gory of the head word of Ë is a ditransitive

Ë[dcÐ]=ÆÈµ=ÆÈµ=ÆÈ ´´ . Therefore, we assume that 5 A model of predicate-argument the non-head child of a node is only expanded after structure the head child has been fully expanded.

We first explain how word-word dependencies in the 5.2 Modelling long-range dependencies predicate-argument structure can be captured in a generative model, and then describe how these prob- The predicate-argument structure that corresponds abilities are estimated in the current implementation. to a derivation contains not only local, but also long- range dependencies that are projected from the lex- 5.1 Modelling local dependencies icon or through some rules such as the coordination We first define the probabilities for purely local de- of functor categories. In the following derivation, pendencies without coordination. By excluding non- Smith is the subject of resigned and of left:

local dependencies and coordination, at most one

Ë[dcÐ]

dependency relation holds for each word. Consider

ÆÈ [dcÐ]ÒÆÈ

the following sentence: Ë

Æ Ë[dcÐ]ÒÆÈ Ë[dcÐ]ÒÆÈ[cÓÒj]

Ë[dcÐ]

Ë[dcÐ]ÒÆÈ

Smith resigned cÓÒj

Ë[dcÐ]ÒÆÈ ÆÈ

and left

Æ Ë[dcÐ]ÒÆÈ ´ËÒÆÈµÒ´ËÒÆÈµ In order to express both dependencies, Smith has Smith resigned yesterday to be conditioned on resigned and on left:

This derivation expresses the following depen-

´Û = j Æ;hhË[dcÐ]ÒÆÈ; i; ½; hÆ;Ûii;

dencies: È Smith resigned

hhË[dcÐ]ÒÆÈ; ; ½; hÆ ;Ûiiµ

lefti

Ë[dcÐ]ÒÆÈ; i; ½; hÆ; ii

hh resigned Smith

´ËÒÆÈµÒ´ËÒÆÈµ ; i; ¾; hË[dcÐ]ÒÆÈ ; ii hh yesterday resigned In terms of the predicate-argument structure, We assume again that heads are generated before resigned and left are both lexical heads of this their modifiers or arguments, and that word-word sentence. Since neither fills an argument slot of dependencies are expressed by conditioning modi- the other, we assume that they are generated inde- fiers or arguments on heads. Therefore, the head pendently. This is different from the SD model, which conditions the head word of the second extraction, words cannot be generated at the max- and subsequent conjuncts on the head word of imal projection of constituents anymore. Consider the first conjunct. Similarly, in a sentence such the following examples:

as Miller and Smith resigned, the current model as- ÆÈ

ÆÈÒÆÈ sumes that the two heads of the subject noun phrase ÆÈ

are conditioned on the verb, but not on each other. The woman

ÆÈÒÆÈµ=´Ë[dcÐ]=ÆÈµ Ë[dcÐ]=ÆÈ

Argument-cluster coordination constructions ´

=´ËÒÆÈµ ´Ë[dcÐ]ÒÆÈµ=ÆÈ that Ë

such as give a dog a bone and a policeman a ﬂower ÆÈ saw

are another example where the dependencies in the I

ÆÈ

ÆÈÒÆÈ predicate-argument structure cannot be expressed ÆÈ

at the level of the local trees that combine the The woman

ÆÈÒÆÈµ=´Ë[dcÐ]=ÆÈµ Ë[dcÐ]=ÆÈ

individual arguments. Instead, these dependencies ´

=´ËÒÆÈµ ´Ë[dcÐ]ÒÆÈµ=ÆÈ are projected down through the category of the that Ë argument cluster: ÆÈ

ËÒÆÈ

´Ë[dcÐ]ÒÆÈµ=ÆÈ ÆÈ=ÆÈ

´ËÒÆÈ µÒ´´´ËÒÆÈ µ=ÆÈ µ=ÆÈ µ ´´ËÒÆÈ µ=ÆÈ µ=ÆÈ

½ ½ ¾ ¿ ½ ¾ ¿

=ÈÈ ÈÈ=ÆÈ saw ÆÈ give a picture of

Lexical categories that project long-range depen-

[dcÐ]=ÆÈ In both cases, there is a Ë with lexical head

dencies include cases such as relative pronouns, con-

Ë[dcÐ]ÒÆÈµ=ÆÈ ´ ; however, in the second case, the trol verbs, auxiliaries, modals and raising verbs. NP argument is not the object of the transitive verb.

This can be expressed by co-indexing their argu- This problem can be solved by generating

´ÆÈÒÆÈ µ=´Ë[dcÐ]ÒÆÈ µ i ments, eg. i for relative pro- words at the leaf nodes instead of at the maxi-

nouns. Here, Smith is also the subject of resign: mal projection of constituents. After expanding

Ë[dcÐ]

Ë[dcÐ]ÒÆÈµ=ÆÈ ´Ë[dcÐ]ÒÆÈµ=ÆÈ

the ´ node to and

ÆÈ Ë[dcÐ]ÒÆÈ

=ÆÈ

ÆÈ , the NP that is co-indexed with woman can-

´Ë[dcÐ]ÒÆÈµ=´Ë[b]ÒÆÈµ Ë[b]ÒÆÈ Æ not be uniﬁed with the object of saw anymore. Smith will resign These examples have shown that two changes to Again, in order to capture this dependency, we as- the generative process are necessary if word-word sume that the entire verb phrase is generated before dependencies in the predicate-argument structure the subject. are to be captured. First, head constituents have to In relative clauses, there is a dependency between be fully expanded before non-head constituents are the verbs in the relative clause and the head of the generated. Second, words have to be generated at noun phrase that is modiﬁed by the relative clause: the leaves of the tree, not at the maximal projection

ÆÈ of constituents.

ÆÈÒÆÈ

ÆÈ 5.3 The word probabilities

Æ ´ÆÈÒÆÈµ=´Ë[dcÐ]ÒÆÈµ Ë[dcÐ]ÒÆÈ Smith who resigned Not all words have functor categories or ﬁll argu- Since the entire relative clause is an adjunct, it is ment slots of other functors. For instance, punctu- generated after the noun phrase Smith. Therefore, ation marks, conjunctions, and the heads of entire we cannot capture the dependency between Smith sentences are not conditioned on any other words. and resigned by conditioning Smith on resigned.In- Therefore, they are only conditioned on their lexical stead, resigned needs to be conditioned on the fact categories. Therefore, this model contains the fol- that its subject is Smith. This is similar to the way lowing three kinds of word probabilities:

in which head words of adjuncts such as yesterday 1. Argument probabilities:

¼ ¼

´Û jc;hhc ;Û i;i;hc; Û iiµ are generated. In addition to this dependency, we È

also assume that there is a dependency between who The probability of generating word Û ,given

hc; Û i

and resigned. It follows that if we want to capture that its lexical category is c and that is

¼ ¼

hc ;Û i unbounded long-range dependencies such as object head of the ith argument of . i

2. Functor probabilities: The probability of a functor Û , given that its th ar-

¼ ¼

´Û jc;hhc; Û i;i;hc ;Û iiµ

È gument is instantiated by a constituent whose lexical

¼ ¼

hc ;Û i

The probability of generating word Û ,given head is can be estimated in a similar manner:

¼ ¼

È ´Û jc; hhc; Û i;i;hc ;Û iiµ= hc ;Û i

that its lexical category is c and that is

¼ ¼

i hc; Û i

´hhc; Û i;i;hc ;Û iiµ

head of the th argument of . C

¼¼ ¼ ¼

C ´hhc; Û i;i; hc ;Û iiµ

¼¼

´Û jcµ 3. Other word probabilities: È

Here we count the number of times the ith argu-

If a word does not ﬁll any dependency relation,

¼ ¼

c; Û i hc ;Û i ment of h is instantiated with , and di-

it is only conditioned on its lexical category.

¼ ¼

c ;Û i vide this by the number of times that h is the

5.4 The structural probabilities c ith argument of any lexical head with category . Like the SD model, we assume an underlying pro- For instance, in order to compute the probability cess which generates CCG derivation trees starting of yesterday modifying resigned as in the previous from the root node. Each node in a derivation tree section, we count the number of times the transitive has a category, a list of lexical heads and a (possi- verb resigned was modified by the adverb yesterday bly empty) list of dependency relations to be filled and divide this by the number of times resigned was by its lexical heads. As discussed in the previous modified by any adverb of the same category. section, head words cannot in general be generated We have seen that functor probabilities are not at the maximal projection if unbounded long-range only necessary for adjuncts, but also for certain dependencies are to be captured. This is not the case types of long-range dependencies such as the rela- for lexical categories. We therefore assume that a tion between the noun modified by a relative clause node’s lexical head category is generated at its max- and the verb in the relative clause. In the case of zero imal projection, whereas head words are generated or reduced relative clauses, some of these dependen- at the leaf nodes. Since lexical categories are gen- cies are also captured by the SD model. However, in erated at the maximal projection, our model has the that model, only counts from the same type of con- same structural probabilities as the LexCat model of struction could be used, whereas in our model, the Hockenmaier and Steedman (2002b). functor probability for a verb in a zero or reduced relative clause can be estimated from all occurrences 5.5 Estimating word probabilities of the head noun. In particular, all instances of the This model generates words in three different noun and verb occurring together in the training data ways—as arguments of functors that are already (with the same predicate-argument relation between generated, as functors which have already one (or them, but not necessarily with the same surface con- more) arguments instantiated, or independent of the figuration) are taken into account by the new model. surrounding context. The last case is simple, as this To obtain the model probabilities, the relative fre-

probability can be estimated directly, by counting quency estimates of the functor and argument prob- Û

the number of times c is the lexical category of in abilities are both interpolated with the word proba-

´Û jcµ the training corpus, and dividing this by the number bilities È .

of times c occurs as a lexical category in the training 5.6 Conditioning events on multiple heads

corpus:

C ´Û; cµ

´Û jcµ=

È In the presence of long-range dependencies and co-

C ´cµ In order to estimate the probability of an argument ordination, the new model requires the conditioning of certain events on multiple heads. Since it is un-

Û , we count the number of times it occurs with lex-

likely that such probabilities can be estimated di- i

ical category c and is the th argument of the lexical ¼

¼ rectly from data, they have to be approximated in

c ;Û i

functor h in question, divided by the number ¼

¼ some manner.

hc ;Û i

of times the ith argument of is instantiated deÔ

If we assume that all dependencies i that hold

with a constituent whose lexical head category is c:

¼ ¼

^ for a word are equally likely, we can approximate

È ´Û jc; hhc ;Û i;i;hc; Û iiµ=

È ´Û jc; deÔ ; :::; deÔ µ

¼ ¼ Ò

½ as the average of the individ-

C ´hhc ;Û i;i;hc; Û iiµ

¼ ¼¼

¼ ual dependency probabilities:

C ´hhc ;Û i;i;hc; Û iiµ

¼¼ Û The fact that constituents can have more than one

Ò lexical head causes similar problems for dynamic

È ´Û jc; deÔ ; :::; deÔ µ È ´Û jc; deÔ µ

Ò i

½ programming and the beam search.

Ò =½ i In order to be able to parse efficiently with our This approximation is has the advantage that it is model, we use the following approximations for dy- easy to compute, but might not give a good estimate, namic programming and the beam search: Two con- since it averages over all individual distributions. stituents with the same span and the same category 6 Dynamic programming and beam search are considered equivalent if they delay the evaluation of the probabilities of the same words and if This section describes how this model is integrated they have the same number of lexical heads, and if into a CKY chart parser. Dynamic programming and the first two elements of their lists of lexical heads effective beam search strategies are essential to guar- are identical (the same words and lexical categories). antee efficient parsing in the face of the high ambi- This is only an approximation to true equivalence, guity of wide-coverage grammars. Both use the in- since we do not check the entire list of lexical heads. side probability of constituents. In lexicalized mod- Furthermore, if a cell contains more than 100 con- els where each constituent has exactly one lexical stituents, we iteratively narrow the beam (by halv- head, and where this lexical head can only depend ingitinsize)2 until the beam search has no further on the lexical head of one other constituent, the in- effect or the cell contains less than 100 constituents. side probability of a constituent is the probability This is a very aggressive strategy, and it is likely to that a node with the label and lexical head of this adversely affect parsing accuracy. However, more constituent expands to the tree below this node. The lenient strategies were found to require too much probability of generating a node with this label and space for the chart to be held in memory. A better lexical head is given by the outside probability of the way of dealing with the space requirements of our constituent. model would be to implement a packed shared parse In the model defined here, the lexical head of forest, but we leave this to future work. a constituent can depend on more than one other word. As explained in section 5.2, there are in- 7 An experiment stances where the categorial functor is conditioned We use sections 02-21 of CCGbank for training, sec- on its arguments – the example given above showed tion 00 for development, and section 23 for test- that verbs in relative clauses are conditioned on the ing. The input is POS-tagged using the tagger of lexical head of the noun which is modified by the Ratnaparkhi (1996). However, since parsing with relative clause. Therefore, the inside probability of

the new model is less efﬁcient, only sentences 40 a constituent cannot include the probability of any tokens only are used to test the model. A fre- lexical head whose argument slots are not all ﬁlled.

quency cutoff of 20 was used to determine rare This means that the equivalence relation deﬁned words in the training data, which are replaced with by the probability model needs to take into account their POS-tags. Unknown words in the test data not only the head of the constituent itself, but also are also replaced by their POS-tags. The models all other lexical heads within this constituent which are evaluated according to their Parseval scores and have at least one unﬁlled argument slot. As a conse- to the recovery of dependencies in the predicate- quence, dynamic programming becomes less effec- argument structure. Like Clark et al. (2002), we tive. There is a related problem for the beam search: do not take the lexical category of the dependent

in our model, the inside probabilities of constituents

c; Û i;i;h ;Û ii into account, and evaluate hh for la-

within the same cell cannot be directly compared

;Ûi; ; h ;Û ii belled, and hh for unlabelled recov- anymore. Instead, the number of unﬁlled lexical ery. Undirectional recovery (UdirP/UdirR) evalu- heads needs to be taken into account. If a lexical

ates only whether there is a dependency between Û

c; Û i

head h is unﬁlled, the evaluation of the proba- ¼

and Û . Unlike unlabelled recovery, this does not pe- bility of Û is delayed. This creates a problem for the beam search strategy. 2Beam search is as in Hockenmaier and Steedman (2002b). nalize the parser if it mistakes a complement for an LexCat Local LeftArgs All Lex. cats: 88.2 89.9 90.1 90.1 adjunct or vice versa. Parseval In order to determine the impact of capturing dif- LP: 76.3 78.4 78.5 78.5 ferent kinds of long-range dependencies, four differ- LR: 75.9 78.5 79.0 78.7 UP: 82.0 83.4 83.6 83.2 ent models were investigated: The baseline model is UR: 81.6 83.6 83.8 83.4 like the LexCat model of (2002b), since the struc- Predicate-argument structure (all) tural probabilities of our model are like those of LP: 77.3 80.8 81.6 81.5 LR: 78.2 80.6 81.5 81.4 that model. Local only takes local dependencies UP: 86.4 88.3 88.9 88.7 into account. LeftArgs only takes long-range de- UR: 87.4 88.1 88.8 88.6 pendencies that are projected through left arguments UdirP: 88.0 89.7 90.2 90.0

UdirR: 89.0 89.5 90.1 90.0 X (Ò ) into account. This includes for instance long- Non-long-range dependencies range dependencies projected by subjects, subject LP: 78.9 82.5 83.0 82.9 and object control verbs, subject extraction and left- LR: 79.5 82.3 82.7 82.6 UP: 87.5 89.7 89.9 89.8 node raising. All takes all long-range dependen- UR: 88.1 89.4 89.6 89.4 cies into account, in particular it extends LeftArgs All long-range dependencies by capturing also the unbounded dependencies aris- LP: 60.8 62.6 67.1 66.3 LR: 64.4 63.0 68.5 68.8 ing through right-node-raising and object extraction. UP: 75.3 74.2 78.9 78.1 Local, LeftArgs and All are all tested with the ag- UR: 80.2 74.9 80.5 80.9 gressive beam strategy described above. Bounded long-range dependencies LP: 63.9 64.8 69.0 69.2 In all cases, the CCG derivation includes all long- LR: 65.9 64.1 70.2 70.0 range dependencies. However, with the models that UP: 79.8 77.1 81.4 81.4 exclude certain kinds of dependencies, it is possible UR: 82.4 76.7 82.6 82.6 Unbounded long-range dependencies

that a word is conditioned on no dependencies. In LP: 46.0 50.4 55.6 52.4

´Û jcµ these cases, the word is generated with È . LR: 54.7 55.8 58.7 61.2 Table 1 gives the performance of all four mod- UP: 54.1 58.2 63.8 61.1 UR: 66.5 63.7 66.8 69.9 els on section 23 in terms of the accuracy of lexical

categories, Parseval scores, and in terms of the re- Table 1: Evaluation (sec. 23, 40 words). covery of word-word dependencies in the predicate- argument structure. Here, results are further bro- ture. By including long-range dependencies on left ken up into the recovery of local, all long-range, arguments (such as subjects) (LeftArgs), a further bounded long-range and unbounded long-range de- improvement of 0.7% on the recovery of predicate- pendencies. argument structure is obtained. This model captures LexCat does not capture any word-word de- the dependency between shares and bought. In con- pendencies. Its performance on the recovery of trast to the SD model, it can use all instances of predicate-argument structure can be improved by shares as the subject of a passive verb in the train- 3% by capturing only local word-word dependen- ing data to estimate this probability. Therefore, even cies (Local). This excludes certain kinds of depen- if shares and bought do not co-occur in this partic- dencies that were captured by the SD model. For in- ular construction in the training data, the event that stance, the dependency between the head of a noun is modelled by our dependency model might not be phrase and the head of a reduced relative clause (the unseen, since it could have occurred in another syn- shares bought by John) is captured by the SD model, tactic context. since shares and bought are both heads of the local Our results indicate that in order to perform well trees that are combined to form the complex noun on long-range dependencies, they have to be in- phrase. However, in the SD model the probability of cluded in the model, since Local, the model that this dependency can only be estimated from occur- captures only local dependencies performs worse on rences of the same construction, since dependency long-range dependencies than LexCat, the model relations are defined in terms of local trees and not that captures no word-word dependencies. How- in terms of the underlying predicate-argument struc- ever, with more than 5% difference on labelled pre- cision and recall on long-range dependencies, the probabilities of multiple dependencies, and more so- model which captures long-range dependencies on phisticated techniques should be investigated. left arguments performs significantly better on re- We have argued that a model of the kind proposed covering long-range dependencies than Local.The in this paper is essential for parsing languages with greatest difference in performance between the mod- freer word order, such as Dutch or Czech, where the els which do capture long-range dependencies and model of Hockenmaier and Steedman (2002b) (and the models which do not is on long-range dependen- other models of surface dependencies) would sys- cies. This indicates that, at least in the kind of model tematically capture the wrong dependencies, even if considered here, it is very important to model not only local dependencies are taken into account. For just local, but also long-range dependencies. It is not English, our experimental results demonstrate that clear why All, the model that includes all dependen- our model benefits greatly from modelling not only cies, performs slightly worse than the model which local, but also long-range dependencies, which are includes only long-range dependencies on subjects. beyond the scope of surface dependency models. On the Wall Street Journal task, the overall performance of this model is lower than that of the Acknowledgements SD model of Hockenmaier and Steedman (2002b). I would like to thank Mark Steedman and Stephen Clark for In that model, words are generated at the maxi- many helpful discussions, and gratefully acknowledge support mal projection of constituents; therefore, the struc- from an EPSRC studentship and grant GR/M96889, the School tural probabilities can also be conditioned on words, of Informatics, and NSF ITR grant 0205 456. which improves the scores by about 2%. It is also very likely that the performance of the new models References is harmed by the very aggressive beam search. Eugene Charniak. 2000. A Maximum-Entropy-Inspired Parser. In Proceedings of the First Meeting of the NAACL, Seattle.

8 Conclusion and future work Stephen Clark, Julia Hockenmaier, and Mark Steedman. 2002. Building Deep Dependency Structures using a Wide- This paper has deﬁned a new generative model for Coverage CCG Parser. In Proceedings of the 40th Annual CCG derivations which captures the word-word de- Meeting of the ACL. pendencies in the corresponding predicate-argument Michael Collins, Jan Hajic, Lance Ramshaw, and Christoph structure, including bounded and unbounded long- Tillmann. 1999. A Statistical Parser for Czech. In Pro- ceedings of the 37th Annual Meeting of the ACL. range dependencies. In contrast to the conditional model of Clark et al. (2002), our model captures Michael Collins. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of these dependencies in a sound and consistent man- Pennsylvania. ner. The experiments presented here demonstrate that the performance of a simple baseline model Julia Hockenmaier and Mark Steedman. 2002a. Acquiring Compact Lexicalized Grammars from a Cleaner Treebank. can be improved signiﬁcantly if long-range depen- In Proceedings of the Third LREC, pages 1974–1981, Las dencies are also captured. In particular, our re- Palmas, May. sults indicate that it is important not to restrict the Julia Hockenmaier and Mark Steedman. 2002b. Generative model to local dependencies. Future work will ad- Models for Statistical Parsing with Combinatory Categorial dress the question whether these models can be Grammar. In Proceedings of the 40th Annual Meeting of the ACL. run with a less aggressive beam search strategy, or whether a different parsing algorithm is more suit- Julia Hockenmaier. 2003. Data and Models for Statistical Parsing with CCG. Ph.D. thesis, School of Informatics, Uni- able. The problems that arise due to the overly versity of Edinburgh. aggressive beam search strategy might be over- come if we used an n-best parser with a simpler Adwait Ratnaparkhi. 1996. A Maximum Entropy Part-Of- Speech Tagger. In Proceedings of the EMNLP Conference, probability model (eg. of the kind proposed by pages 133–142, Philadelphia, PA. Hockenmaier and Steedman (2002b)) and used the Mark Steedman. 2000. The Syntactic Process. The MIT Press, new model as a re-ranker. The current implemen- Cambridge Mass. tation uses a very simple method of estimating the