In: Proceedings of CoNLL-2000 and LLL-2000, pages 219-225, Lisbon, Portugal, 2000. Recognition and Tagging of Compound Verb Groups in Czech
Eva Zgt~kov~ and Lubo~ Popelinsk:~ and Milo~ Nepil NLP Laboratory, Faculty of Informatics, Masaryk University Botanick£ 68, CZ-602 00 Brno, Czech Republic {glum, popel, nepil}@fi.muni.cz
Verb groups are often split into more parts with so called gap words. In the second verb group Abstract the gap words are o td konferenci (about the In Czech corpora compound verb groups are conference). In annotated Czech corpora, in- usually tagged in word-by-word manner. As a cluding DESAM (Pala et al., 1997), compound consequence, some of the morphological tags of verb groups are usually tagged in word-by-word particular components of the verb group lose manner. As a consequence, some of the morpho- their original meaning. We present a method for logical tags of particular components of the verb automatic recognition of compound verb groups group loose their original meaning. It means in Czech. From an annotated corpus 126 def- that the tags are correct for a single word but inite clause grammar rules were constructed. they do not reflect the meaning of the words in These rules describe all compound verb groups context. In the above sentence the word jsem that are frequent in Czech. Using those rules is tagged as a verb in present tense, but the we can find compound verb groups in unanno- whole verb group to which it belongs - jsem tated texts with the accuracy 93%. Tagging nev~d~la - is in past tense. Similar situation ap- compound verb groups in an annotated corpus pears in byla bych se jl zdSastnila (I would have exploiting the verb rules is described. participated in it) where zdSastnila is tagged as past tense while it is only a part of past condi- Keywords: compound verb groups, chunking, tional. Without finding all parts of a compound morphosyntactic tagging, inductive logic pro- verb group and without tagging the whole group gramming (what is necessary dependent on other parts of the compound verb group) it is impossible to 1 Compound Verb Groups continue with any kind of semantic analysis. Recognition and analysis of the predicate in a We consider a compound verb group to be a sentence is fundamental for the meaning of the list of verbs and maybe the reflexive pronouns sentence and its further analysis. In more than se, si. Such a group is obviously compound of half of Czech sentences the predicate contains auxiliary and full-meaning verbs, e.g. budu se the compound verb group. E.g. in the sentence um~vat where budu is auxiliary verb (like will in Mrzl m~, 2e jsem o td konferenci nev~d~la, byla English), se is the reflexive pronoun and um~vat bych se jl zdSastnila. (literary translation: I am means to wash. As word-by-word tagging of sorry that I did not know about the conference, verb groups is confusing, it is useful to find and I would have participated in it.) there are three assign a new tag to the whole group. This tag verb groups should contain information about the beginning
219 compound verb groups in the corpus DESAM Mrzf
220 participaied in iV): k3xXnSc? gap b~t/k5eApFnStMmPaI k5e?pFnStMmPa? by/k5eAplnStPmCaI se/k3xXnSc3 3.2.3 Finding grammatical agreement on/k3xPgUnSc4p3 constraints zfi~astnit/k5eApFnStMmPaP Another situation appears when two or more values of some category are related. In the sim- After substitution ofgaps we obtain plest case they have to be the same - e.g. the b~t/k5eApFnStMmPaI value of attribute person (p) in the first and the by/k5eAplnStPmCaI last word of our example. More complicated is si/k3xXnSc3 the relation among the values of attribute num- gap ber (n). They should be the same except when zfi~astnit/k5eApFnStMmPaP the polite way of addressing occurs, e.g. in byl byste se ji zdSastnil (you would have partici- pated in it). Thus we have to check whether the 3.2 Generalisation values are the same or the conditions of polite The lemmata and the tags are now being gener- way of addressing are satisfied. For this purpose alised. Three generalisation operations are em- we add the predicate check_.num() that ensures ployed: elimination of (some of) lemmata, gen- agreement in the grammatical category number eralisation of grammatical categories and find- and we obtain ing grammatical agreement constraints. b~t/k5e?p_n_tMmPa? 3.2.1 Elimination of lemmata by/k5e?p?n_tPmCa? All lemmata except forms of auxiliary verb bit k3xXnSc? (to be) (b~t, by, aby, kdyby) are rejected. Lem- gap mata of modal verbs and verbs with similar be- k5e?p_n_tMmPa? haviour are replaced by tag modal. These verbs check_hum(n) have been found in the list of more than 15 000 verb valencies (Pala and Seve£ek P., 1999). In our example it is the verb zdSastnit that is re- 3.3 DCG Rules Synthesis moved. Finally the verb rule is constructed by rewriting the result of the generalisation phase. For the b@t/k5eApFnStMmPaI sentence byla bych se jl zd6astnila (I would have by/k5eAplnStPmCaI participated in it) we obtain k3xXnSc3 gap verb_group (vg (Be, Cond, Se, Verb), k5eApFnStMmPaP Gaps) --> be (Be ,_, P, N, tM.,mP,_), 3.2.2 Generalisation of grammatical 7, b~t/k5e?p_n_tMmPa? categories cond (Cond, .... Ncond, tP, mC, _), Exploiting linguistic knowledge, several gram- 7, by/k5e?p?n_tPmCa? matical categories are not important for verb {che ck_num (N, Ncond, Cond, Vy) }, group description. Very often it is negation (e), ref lex_pron (Se, xX, _, _), or aspect (a - aI stands for imperfectum, aP 7, k3xXnSc? for perfectum). These categories may be re- gap( [] ,Gaps), moved. For some of verbs even person (p) can 7. gap be removed. In our example the values of those k5 (Verb ..... P,N,tM,mP,_). grammatical categories have been replaced by ? 7. k5e?p_n_tMmPa? and we obtained If this rule does not exist in the set of verb b~t/k5e?pFnStMmPa? rules yet~ it is added into it. The meanings by/k5e?p?nStPmCa? of non-terminals used in the rule are following:
221 be() represents auxiliary verb b~t, cond() rep- 5 Fixing Misclassification Errors resents various forms of conditionals by, aby, We combined two approaches, elimination of kdyby, reflex_pron() stands for reflexive pro- lemmata which are very rare for a given noun se (si), gap() is a special predicate for word form, and inductive logic program- manipulation with gaps, and k5 () stands for ar- ming (Popelinsk~ et al., 1999; Pavelek and bitrary non-auxiliary verb. The particular val- Popelinsk~, 1999). The method is used in the ues of some arguments of non-terminals repre- post-processing phase to prune the set of rules sent required properties. Simple cases of gram- that have fired for the sentence. matical agreement are treated through binding of variables. More complicated situations are 5.1 Elimination of infrequent lemmata solved employing constraints like the predicate In Czech corpora it was observed that 10% of check_hum(). word positions - i.e. each 10th word of a text The method has been implemented in Perl. 126 - have at least 2 lemmata and about 1% word definite clause grammar rules were constructed forms of Czech vocabulary has at least 2 lem- from the annotated corpus that describe all verb mata. (Popelinsk~ et al., 1999; Pavelek and groups that are frequent in Czech. Popelinsk~, 1999) E.g. word form psi can be correct rules(%) > 1 verb all 4 Recognition of Verb Groups number of examples 349 600 original method 86.8 92.3 The verb rules have been used for recognition, + infrequent lemmata 91.1 94.8 and consequently for tagging, of verb groups in + ILP 92.8 95.8 unannotated text. A portion of sentences which have not been used for learning, has been ex- Table 2: DESAM: Results for unannotated text tracted from a corpus. Each sentence has been ambiguously tagged with LEMMA morphologi- cal analyser (Pala and Seve~ek, 1995), i.e. each correct rules(%) word of the sentence has been assigned to all > 1 verb all possible tags. Then all the verb rules were ap- number of examples 284 467 plied to each sentence. The learned verb rules original method 87.0 92.1 displayed quite good accuracy. For corpus DE- + infrequent lemmata 91.6 94.9 SAM, a verb rule has been correctly assigned to + ILP 92.6 95.5 92.3% verb groups. We tested, too, how much is this method dependent on the corpus that Table 3: PTB: Results for unannotated text was used for learning. As the source of testing data we used Prague Tree Bank (PTB) Corpus either preposition at (like at the lesson) or im- that is under construction at Charles Univer- perative of argue. We decided to remove all the sity in Prague. The accuracy displayed was not verb rules that recognised a word-lemma couple different from results for DESAM. It maybe ex- of a very small frequency in the corpus. Actu- plained by the fact that both corpora have been ally those lemmata that did not appear more built from newspaper articles. than twice in DESAM corpus were supposed to Although the accuracy is acceptable for the test be incorrect. data that include also clauses with just one For testing, we randomly chose the set of 600 verb, errors have been observed for complex sen- examples including compound or complex sen- tences. In about 13% of them, some of com- tences from corpus DESAM. 251 sentences con- pound verb groups were not correctly recog- tained only one verb. The results obtained are nized. It was observed that almost 7{)% of these in Tab. 2. The first line contains the number errors were caused by incorrect lemrna recogni- of examples used. In the following line there tion. In the next section we describe a method are results of the original method as mentioned for fixing this kind of errors. in Section 4. Next line (+ infrequent lemmata)
222 displays results when the word-lemma couple of 6 Tagging Verb Groups a very small frequency have been removed. The column '> 1 verb' concerns the sentences where We now describe a method for compound verb at least two verbs appeared. The column 'all' group tagging in morphologically annotated displays accuracy for all sentences. Results for corpora. We decided to use SGML-like notation corpus PTB are displayed in Tab. 3. It can be for tagging because it allows to incorporate observed that after pruning rules that contain a the new tags very easily into DESAM corpus. rare lemma the accuracy significantly increased. The beginning and the end of the whole verb group and beginnings and ends of its particular 5.2 Lemma disambiguation by ILP components are marked. For the sentence byla Some of incorrectly recognised lemmata cannot bych se jl zd6astnila (I would have participated be fixed by the method described. E.g. word in it) we receive form se has two lemmata, se - reflexive pronoun
223 The set of attributes can be also enriched e.g. tactic information. In this paper we deal with with the number of components. We also plan recognition and tagging potentially discontinu- to include into the attributes of
224 9 Relevant works K. Osolsob~. 1999. Morphological tagging of com- Another approach for recognition of compound posed verb forms in Czech corpus. Technical re- verb groups in Czech (Osolsob~, 1999) have port, Studia Minora Facultatis Philosophicae Uni- versitatis Brunensis Brno. been already discussed in Section 7. Ramshaw K. Pala and Seve~ek P. 1999. Valencies of Czech and Marcus (Ramshaw and Marcus, 1995) views Verbs. Technical Report A45, Studia Minora chunking as a tagging problem. They used Facultatis Philosophicae Universitatis Brunensis transformation-based learning and achieved re- Brno. call and precision rates 93% for base noun K. Pala and P. Seve~ek. 1995. Lemma morphologi- phrase (non-recursive noun phrase) and 88% for cal analyser. User manual. Lingea Brno. chunks that partition the sentence. Verb chunk- K. Pala, P. Rychl~, and P. Smr~. 1997. DESAM ing for English was solved by Veenstra (Veen- - annotated corpus for Czech. In In Pldgil F., stra, 1999). He used memory-based learner Jeffery K.G.(eds.): Proceedings of SOFSEM'97, Timbl for noun phrase, verb phrase and propo- Milovy, Czech Republic. LNCS 1338, pages 60- 69. Springer. sitional phrase chunking.Each word in a sen- T. Pavelek and L. Popelinskp. 1999. Mining lemma tence was first assigned one of tags I_T (inside disambiguation rules from Czech corpora. In the phrase), O_T (outside the phrase) and B_T Principles of Knowledge Discovery in Databases: (left-most word in the phrase that is preceded Proceedings of PKDD '99 Conference, LNA11704, by another phrase of the same kind), where T pages 498-503. Springer. stands for the kind of a phrase. L. Popellnsk~, T. Pavelek, and T. Ptgt~nik. 1999. Towards disambiguation in Czech corpora. In Chunking in Czech language is more difficult Learning Language in Logic: Working Notes of than in English for two reasons. First, a gap LLL Workshop, pages 106-116. JSI Ljubljana. inside a verb group may be more complex and L. A. Ramshaw and M. P. Marcus. 1995. Text it may be even a whole sentence. Second, Czech chunking using transformation-based learning. In language is a free word-order language what im- Proceedings of the Third A CL Workshop on Very plies that the process of recognition of the verb Large Corpora. Association for Computational group structure is much more difficult. Linguistics. P. Smr~ and E. Z£~kov£. 1998. New tools for disam- 10 Conclusion biguation of Czech texts. In Text, Speech and Di- alogue: Proceedings of TSD'98 Workshop, pages We described the method for automatic recog- 129-134. Masaryk University, Brno. nition of compound verb groups in Czech sen- J. Veenstra. 1999. Memory-based text chunking. In tences. Recognition of compound verb groups Machine Learning in Human Language Technol- was tested on unannotated text randomly cho- ogy: Workshop at ACAI 99. sen from two different corpora and the accu- E. Zg~kov~ and K. Pala. 1999. Corpus-based racy reached 95% of correctly recognised verb rules for Czech verb discontinuous constituents. groups. We also introduced the method for au- In Text, Speech and Dialogue: Proceedings of tomatic tagging of compound verb groups. TSD'99 Workshop, LNAI 1692, pages 325-328. Springer. Acknowledgements E. Z~kov£ and L. Popelinskp. 2000. Automatic tagging of compound verb groups in Czech cor- We thank to anonymous referees for their com- pora. In Text, Speech and Dialogue: Proceedings ments. This research has been partially sup- of TSD'2000 Workshop, LNAL Springer. ported by the Czech Ministry of Education un- der the grant VS97028. References S. Abney. 1991. Parsing by chunks. In Principle- Based Parsing. Kluwer Academic Publishers. N. M. Marques, G. P. Lopes, and C. A. Coelho. 1998. Using loglinear clustering for subcatego- rization identification. In Principles of Data Min- ing and Knowledge Discovery: Proceedings of PKDD'98 Symposium, LNAI 1510, pages 379- 387. Springer.
225