Cascaded Grammatical Relation Assignment
Total Page:16
File Type:pdf, Size:1020Kb
Cascaded Grammatical Relation Assignment Sabine Buchholz and Jorn Veenstra and Walter Daelemans ILK, Computational Linguistics, Tilburg University PO box 90153, 5000 LE Tilburg, The Netherlands [buchholz, veenstra, dae lemans] 0kub. nl Abstract higher modules? In this paper we discuss cascaded Memory- Recently, many people have looked at cas- Based grammatical relations assignment. In the caded and/or shallow parsing and GR assign- first stages of the cascade, we find chunks of sev- ment. Abney (1991) is one of the first who pro- eral types (NP,VP,ADJP,ADVP,PP) and label posed to split up parsing into several cascades. them with their adverbial function (e.g. local, He suggests to first find the chunks and then temporal). In the last stage, we assign gram- the dependecies between these chunks. Grefen- matical relations to pairs of chunks. We stud- stette (1996) describes a cascade of finite-state ied the effect of adding several levels to this cas- transducers, which first finds noun and verb caded classifier and we found that even the less groups, then their heads, and finally syntactic performing chunkers enhanced the performance functions. Brants and Skut (1998) describe a of the relation finder. partially automated annotation tool which con- structs a complete parse of a sentence by recur- 1 Introduction sively adding levels to the tree. (Collins, 1997; When dealing with large amounts of text, find- Ratnaparkhi, 1997) use cascaded processing for ing structure in sentences is often a useful pre- full parsing with good results. Argamon et al. processing step. Traditionally, full parsing is (1998) applied Memory-Based Sequence Learn- used to find structure in sentences. However, ing (MBSL) to NP chunking and subject/object full parsing is a complex task and often pro- identification. However, their subject and ob- vides us with more information then we need. ject finders are independent of their chunker For many tasks detecting only shallow struc- (i.e. not cascaded). tures in a sentence in a fast and reliable way is Drawing from this previous work we will to be preferred over full parsing. For example, explicitly study the effect of adding steps to in information retrieval it can be enough to find the grammatical relations assignment cascade. only simple NPs and VPs in a sentence, for in- Through experiments with cascading several formation extraction we might also want to find classifiers, we will show that even using im- relations between constituents as for example perfect classifiers can improve overall perfor- the subject and object of a verb. mance of the cascaded classifier. We illustrate In this paper we discuss some Memory-Based this claim on the task of finding grammati- (MB) shallow parsing techniques to find labeled cal relations (e.g. subject, object, locative) to chunks and grammatical relations in a sentence. verbs in text. The GR assigner uses several Several MB modules have been developed in sources of information step by step such as sev- previous work, such as: a POS tagger (Daele- eral types of XP chunks (NP, VP, PP, ADJP mans et al., 1996), a chunker (Veenstra, 1998; and ADVP), and adverbial functions assigned Tjong Kim Sang and Veenstra, 1999) and a to these chunks (e.g. temporal, local). Since grammatical relation (GR) assigner (Buchholz, not all of these entities are predicted reliably, it 1998). The questions we will answer in this pa- is the question whether each source leads to an per are: Can we reuse these modules in a cas- improvement of the overall GR assignment. cade of classifiers? What is the effect of cascad- In the rest of this paper we will first briefly de- ing? Will errors at a lower level percolate to scribe Memory-Based Learning in Section 2. In 239 Section 3.1, we discuss the chunking classifiers memory-based approaches to parsing, see (Bod, that we later use as steps in the cascade. Sec- 1992) and (Sekine, 1998). tion 3.2 describes the basic GR classifier. Sec- tion 3.3 presents the architecture and results of 3 Methods and Results the cascaded GR assignment experiments. We In this section we describe the stages of the cas- discuss the results in Section 4 and conclude cade. The very first stage consists of a Memory- with Section 5. Based Part-of-Speech Tagger (MBT) for which we refer to (Daelemans et al., 1996). The 2 Memory-Based Learning next three stages involve determining bound- Memory-Based Learning (MBL) keeps all train- aries and labels of chunks. Chunks are non- ing data in memory and only abstracts at clas- recursive, non-overlapping constituent parts of sification time by extrapolating a class from the sentences (see (Abney, 1991)). First, we si- most similar item(s) in memory. In recent work multaneously chunk sentences into: NP-, VP- Daelemans et al. (1999b) have shown that for , Prep-, ADJP- and APVP-chunks. As these typical natural language processing tasks, this chunks are non-overlapping, no words can be- approach is at an advantage because it also long to more than one chunk, and thus no con- "remembers" exceptional, low-frequency cases flicts can arise. Prep-chunks are the preposi- which are useful to extrapolate from. More- tional part of PPs, thus excluding the nominal over, automatic feature weighting in the similar- part. Then we join a Prep-chunk and one -- ity metric of an MB learner makes the approach or more coordinated -- NP-chunks into a PP- well-suited for domains with large numbers of chunk. Finally, we assign adverbial function features from heterogeneous sources, as it em- (ADVFUNC) labels (e.g. locative or temporal) bodies a smoothing-by-similarity method when to all chunks. data is sparse (Zavrel and Daelemans, 1997). In the last stage of the cascade, we label We have used the following MBL algorithms1: several types of grammatical relations between pairs of words in the sentence. IB1 : A variant of the k-nearest neighbor (k- The data for all our experiments was ex- NN) algorithm. The distance between a tracted from the Penn Treebank II Wall Street test item and each memory item is defined Journal (WSJ) corpus (Marcus et al., 1993). as the number of features for which they For all experiments, we used sections 00-19 as have a different value (overlap metric). training material and 20-24 as test material. IBi-IG : IB1 with information gain (an See Section 4 for results on other train/test set information-theoretic notion measuring the splittings. reduction of uncertainty about the class to For evaluation of our results we use the pre- be predicted when knowing the value of a cision and recall measures. Precision is the per- feature) to weight the cost of a feature value centage of predicted chunks/relations that are mismatch during comparison. actually correct, recall is the percentage of cor- rect chunks/relations that are actually found. IGTree : In.this variant, a decision tree is cre- For convenient comparisons of only one value, ated with features as tests, and ordered ac- we also list the FZ=i value (C.J.van Rijsbergen, cording to the information gain of the fea- tures, as a heuristic approximation of the 1979): (Z2+l)'prec'rec/~2.prec+rec , with/~ = 1 computationally more expensive IB1 vari- 3.1 Chunking ants. In the first experiment described in this section, For more references and information about the task is to segment the sentence into chunks these algorithms we refer to (Daelemans et al., and to assign labels to these chunks. This pro- 1998; Daelemans et al., 1999b). For other cess of chunking and labeling is carried out by assigning a tag to each word in a sentence left- 1For the experiments described in this paper we have to-right. Ramshaw and Marcus (1995) first as- used TiMBL, an MBL software package developed in the ILK-group (Daelemans et al., 1998), TiMBL is available signed a chunk tag to each word in the sentence: from: http://ilk.kub.nl/. I for inside a chunk, O for outside a chunk, and 240 B tbr inside a chunk, but tile preceding word is type precision recall •fl=l in another chunk. As we want to find more than NPchunks 92.5 92.2 92.3 one kind of chunk, we have to [hrther differen- VPchunks 91.9 91.7 91.8 tiate tile IOB tags as to which kind of chunk ADJPchunks 68.4 65.0 66.7 (NP, VP, Prep, ADJP or ADVP) the word is ADVPchunks 78.0 77.9 77.9 ill. With the extended IOB tag set at hand we Prepchunks 95.5 96.7 96.1 can tag the sentence: PPchunks 91.9 92.2 92.0 But/CO [NP the/DT dollar/NN NP] ADVFUNCs 78.0 69.5 73.5 [ADVP later/KB ADVP] [VP rebounded/VBD VP] ,/, [VP finishing/VBG VP] Table h Results of chunking-labeling experi- [ADJP slightly/KB higher/RBK ADJP] ments. NP-,VP-, ADJP-, ADVP- and Prep- [Prep against/IN Prep] [NP the/DT chunks are found sinmltaneously, but for con- yen/NNS NP] [ADJP although/IN ADJP] venience, precision and recall values are given [ADJP slightly/RB lower/J JR ADJP] separately for each type of chunk. [Prep against/IN Prep] [NP the/DT mark/NN NP] ./. experiments, one of these words is always a verb, as: since this yields the most important GRs. The But/CCo the/DTi_NP dollar/NNi_Np other word is the head of the phrase which is later/KBi-ADVP rebounded/VBDi_vP ,/,0 annotated with this grammatical relation in the f inishing/VBGi_vP slightly/KBi-ADVP treebank. A preposition is the head of a PP, higher/RBRi_ADVP against/INi_Prep a noun of an NP and so on.