Recognition and Tagging of Compound Verb Groups in Czech
Total Page:16
File Type:pdf, Size:1020Kb
In: Proceedings of CoNLL-2000 and LLL-2000, pages 219-225, Lisbon, Portugal, 2000. Recognition and Tagging of Compound Verb Groups in Czech Eva Zgt~kov~ and Lubo~ Popelinsk:~ and Milo~ Nepil NLP Laboratory, Faculty of Informatics, Masaryk University Botanick£ 68, CZ-602 00 Brno, Czech Republic {glum, popel, nepil}@fi.muni.cz Verb groups are often split into more parts with so called gap words. In the second verb group Abstract the gap words are o td konferenci (about the In Czech corpora compound verb groups are conference). In annotated Czech corpora, in- usually tagged in word-by-word manner. As a cluding DESAM (Pala et al., 1997), compound consequence, some of the morphological tags of verb groups are usually tagged in word-by-word particular components of the verb group lose manner. As a consequence, some of the morpho- their original meaning. We present a method for logical tags of particular components of the verb automatic recognition of compound verb groups group loose their original meaning. It means in Czech. From an annotated corpus 126 def- that the tags are correct for a single word but inite clause grammar rules were constructed. they do not reflect the meaning of the words in These rules describe all compound verb groups context. In the above sentence the word jsem that are frequent in Czech. Using those rules is tagged as a verb in present tense, but the we can find compound verb groups in unanno- whole verb group to which it belongs - jsem tated texts with the accuracy 93%. Tagging nev~d~la - is in past tense. Similar situation ap- compound verb groups in an annotated corpus pears in byla bych se jl zdSastnila (I would have exploiting the verb rules is described. participated in it) where zdSastnila is tagged as past tense while it is only a part of past condi- Keywords: compound verb groups, chunking, tional. Without finding all parts of a compound morphosyntactic tagging, inductive logic pro- verb group and without tagging the whole group gramming (what is necessary dependent on other parts of the compound verb group) it is impossible to 1 Compound Verb Groups continue with any kind of semantic analysis. Recognition and analysis of the predicate in a We consider a compound verb group to be a sentence is fundamental for the meaning of the list of verbs and maybe the reflexive pronouns sentence and its further analysis. In more than se, si. Such a group is obviously compound of half of Czech sentences the predicate contains auxiliary and full-meaning verbs, e.g. budu se the compound verb group. E.g. in the sentence um~vat where budu is auxiliary verb (like will in Mrzl m~, 2e jsem o td konferenci nev~d~la, byla English), se is the reflexive pronoun and um~vat bych se jl zdSastnila. (literary translation: I am means to wash. As word-by-word tagging of sorry that I did not know about the conference, verb groups is confusing, it is useful to find and I would have participated in it.) there are three assign a new tag to the whole group. This tag verb groups should contain information about the beginning <vg> Mrzl </vg> m~, 5e <vg> jsem o td kon- and the end of the group and about the particu- ferenci nev~d~la </vg>, <vg> byla bych se jl lar components of the verb group. It must also zdSastnila. </vg> contain information about relevant grammati- cal categories that characterise the verb group I <vg> am sorry </vg> that I <vg> did as a whole. In (Zg~kov~ and Pala, 1999), a pro- not know </vg> about the conference, I <vg> posal of the method for automatic finding of would have participated </vg> in it. 219 compound verb groups in the corpus DESAM Mrzf <l>mrzet is introduced. We describe here the improved <t >k5eAp3nStPmIaI method that results in definite clause grammar m@ <l>j£ rules - called verb rules - that contain informa- <t>k3xPnSc24pl tion about all components of a particular verb group and about tags. We describe also some ~e <l>~e improvements that allow us to increase the ac- <t>k8xS curacy of verb group recognition. jsem <l>b~t <t>k5eAplnStPmIaI The paper is organised as follows. Corpus DE- o <1>o SAM is described in Section 2. Section 3 con- <t>kTc46 tains a description of the method for learning t@ <l>ten verb rules. Recognition of verb groups in anno- <t>k3xDgFnSc6 tated text is discussed in Section 4. Improve- <t>k3xOgXnSc6p3 ments of this method are introduced in Sec- konferenci <l>konference tion 5. In Section 6 we briefly show how the <t>klgFnSc6 tag for the compound verb group is constructed nev~d~la <l>v~d~t employing verb rules. We conclude with dis- <t>k5eNpFnStMmPaI cussion (Section 7), the description of ongoing research (Section 8) and with a summary of rel- byla <l>b~t evant works (Section 9). <t>k5eApFnStMmPaI bych <l>by 2 Data Source <t>k5eAplnStPmCa~ DESAM (Pala et al., 1997), the annotated and se <l>se fully disambiguated corpus of Czech newspaper <t>k3xXnSc4 texts, has been used as the source of learning jl <l>on data. It contains more than 1 000 000 word <t>k3xPgFnSc2p3 positions, about 130 000 different word forms, zdbastnila <l>zfibastnit about 65 000 of them occurring more then once, <t>k5eApFnStMmPaP and 1 665 different tags. E.g. in Tab. 1 the tag kbeApFnStMmPaP of the word zd6astnila (partic- ipated) means: part of speech (k) = verb (5), Table 1: Example of the disambiguated Czech person (p) = feminine (F), number (n) = sin- sentence gular (S) and tense (t) = past (M). Lemmata and possible tags are prefixed by <1>, <t> re- 3.1 Verb Chunks spectively. As pointed out in (Pala et al., 1997; PopeHnsk~ et al., 1999), DESAM is not large The observed properties of a verb group are the enough. It does not contain the representative following: their components are either verbs or set of Czech sentences yet. In addition some a reflexive pronoun se (si); the boundary of a words are tagged incorrectly and about 1/5 po- verb group cannot be crossed by the boundary sitions are untagged. of a sentence; and between two components of the verb group there can be a gap consisting of 3 Learning Verb Rules an arbitrary number of non-verb words or even a The algorithm for learning verb rules (Z£bkov£ whole sentence. In the first step, the boundaries and Pala, 1999) takes as its input annotated of all sentences are found. Then each gap is sentences from corpus DESAM. The algorithm replaced by tag gap. is split into three steps: finding w~rb chunks The method exploits only the lemma of each (i.e. finding boundaries of simple clauses in word (nominative singular for nouns, adjectives, compound or in complex sentences, and elim- pronouns and numerals, infinitive for verbs) and ination of gap words), generalisation and verb its tag. We will demonstrate the whole process rule synthesis. These three steps are described using the third simplex sentence of the clause in below. Tab. 1 ( byla bych se jl zd6astnila (I would have 220 participaied in iV): k3xXnSc? gap b~t/k5eApFnStMmPaI k5e?pFnStMmPa? by/k5eAplnStPmCaI se/k3xXnSc3 3.2.3 Finding grammatical agreement on/k3xPgUnSc4p3 constraints zfi~astnit/k5eApFnStMmPaP Another situation appears when two or more values of some category are related. In the sim- After substitution ofgaps we obtain plest case they have to be the same - e.g. the b~t/k5eApFnStMmPaI value of attribute person (p) in the first and the by/k5eAplnStPmCaI last word of our example. More complicated is si/k3xXnSc3 the relation among the values of attribute num- gap ber (n). They should be the same except when zfi~astnit/k5eApFnStMmPaP the polite way of addressing occurs, e.g. in byl byste se ji zdSastnil (you would have partici- pated in it). Thus we have to check whether the 3.2 Generalisation values are the same or the conditions of polite The lemmata and the tags are now being gener- way of addressing are satisfied. For this purpose alised. Three generalisation operations are em- we add the predicate check_.num() that ensures ployed: elimination of (some of) lemmata, gen- agreement in the grammatical category number eralisation of grammatical categories and find- and we obtain ing grammatical agreement constraints. b~t/k5e?p_n_tMmPa? 3.2.1 Elimination of lemmata by/k5e?p?n_tPmCa? All lemmata except forms of auxiliary verb bit k3xXnSc? (to be) (b~t, by, aby, kdyby) are rejected. Lem- gap mata of modal verbs and verbs with similar be- k5e?p_n_tMmPa? haviour are replaced by tag modal. These verbs check_hum(n) have been found in the list of more than 15 000 verb valencies (Pala and Seve£ek P., 1999). In our example it is the verb zdSastnit that is re- 3.3 DCG Rules Synthesis moved. Finally the verb rule is constructed by rewriting the result of the generalisation phase. For the b@t/k5eApFnStMmPaI sentence byla bych se jl zd6astnila (I would have by/k5eAplnStPmCaI participated in it) we obtain k3xXnSc3 gap verb_group (vg (Be, Cond, Se, Verb), k5eApFnStMmPaP Gaps) --> be (Be ,_, P, N, tM.,mP,_), 3.2.2 Generalisation of grammatical 7, b~t/k5e?p_n_tMmPa? categories cond (Cond, .... Ncond, tP, mC, _), Exploiting linguistic knowledge, several gram- 7, by/k5e?p?n_tPmCa? matical categories are not important for verb {che ck_num (N, Ncond, Cond, Vy) }, group description. Very often it is negation (e), ref lex_pron (Se, xX, _, _), or aspect (a - aI stands for imperfectum, aP 7, k3xXnSc? for perfectum).