<<

In: Proceedings of CoNLL-2000 and LLL-2000, pages 219-225, Lisbon, Portugal, 2000. Recognition and Tagging of Groups in Czech

Eva Zgt~kov~ and Lubo~ Popelinsk:~ and Milo~ Nepil NLP Laboratory, Faculty of Informatics, Masaryk University Botanick£ 68, CZ-602 00 Brno, Czech Republic {glum, popel, nepil}@fi.muni.cz

Verb groups are often split into more parts with so called gap . In the second verb group Abstract the gap words are o td konferenci (about the In Czech corpora compound verb groups are conference). In annotated Czech corpora, in- usually tagged in -by-word manner. As a cluding DESAM (Pala et al., 1997), compound consequence, some of the morphological tags of verb groups are usually tagged in word-by-word particular components of the verb group lose manner. As a consequence, some of the morpho- their original meaning. We present a method for logical tags of particular components of the verb automatic recognition of compound verb groups group loose their original meaning. It means in Czech. From an annotated corpus 126 def- that the tags are correct for a single word but inite grammar rules were constructed. they do not reflect the meaning of the words in These rules describe all compound verb groups context. In the above sentence the word jsem that are frequent in Czech. Using those rules is tagged as a verb in present tense, but the we can find compound verb groups in unanno- whole verb group to which it belongs - jsem tated texts with the accuracy 93%. Tagging nev~d~la - is in past tense. Similar situation ap- compound verb groups in an annotated corpus pears in byla bych se jl zdSastnila (I would have exploiting the verb rules is described. participated in it) where zdSastnila is tagged as past tense while it is only a part of past condi- Keywords: compound verb groups, chunking, tional. Without finding all parts of a compound morphosyntactic tagging, inductive logic pro- verb group and without tagging the whole group gramming (what is necessary dependent on other parts of the compound verb group) it is impossible to 1 Compound Verb Groups continue with any kind of semantic analysis. Recognition and analysis of the in a We consider a compound verb group to be a sentence is fundamental for the meaning of the list of and maybe the reflexive sentence and its further analysis. In more than se, si. Such a group is obviously compound of half of Czech sentences the predicate contains auxiliary and full-meaning verbs, e.g. budu se the compound verb group. E.g. in the sentence um~vat where budu is (like will in Mrzl m~, 2e jsem o td konferenci nev~d~la, byla English), se is the reflexive and um~vat bych se jl zdSastnila. (literary translation: I am means to wash. As word-by-word tagging of sorry that I did not know about the conference, verb groups is confusing, it is useful to find and I would have participated in it.) there are three assign a new tag to the whole group. This tag verb groups should contain information about the beginning Mrzl m~, 5e jsem o td kon- and the end of the group and about the particu- ferenci nev~d~la , byla bych se jl lar components of the verb group. It must also zdSastnila. contain information about relevant grammati- cal categories that characterise the verb group I am sorry that I did as a whole. In (Zg~kov~ and Pala, 1999), a pro- not know about the conference, I posal of the method for automatic finding of would have participated in it.

219 compound verb groups in the corpus DESAM Mrzf mrzet is introduced. We describe here the improved k5eAp3nStPmIaI method that results in definite clause grammar m@ j£ rules - called verb rules - that contain informa- k3xPnSc24pl tion about all components of a particular verb group and about tags. We describe also some ~e ~e improvements that allow us to increase the ac- k8xS curacy of verb group recognition. jsem b~t k5eAplnStPmIaI The paper is organised as follows. Corpus DE- o <1>o SAM is described in Section 2. Section 3 con- kTc46 tains a description of the method for learning t@ ten verb rules. Recognition of verb groups in anno- k3xDgFnSc6 tated text is discussed in Section 4. Improve- k3xOgXnSc6p3 ments of this method are introduced in Sec- konferenci konference tion 5. In Section 6 we briefly show how the klgFnSc6 tag for the compound verb group is constructed nev~d~la v~d~t employing verb rules. We conclude with dis- k5eNpFnStMmPaI cussion (Section 7), the description of ongoing research (Section 8) and with a summary of rel- byla b~t evant works (Section 9). k5eApFnStMmPaI bych by 2 Data Source k5eAplnStPmCa~ DESAM (Pala et al., 1997), the annotated and se se fully disambiguated corpus of Czech newspaper k3xXnSc4 texts, has been used as the source of learning jl on data. It contains more than 1 000 000 word k3xPgFnSc2p3 positions, about 130 000 different word forms, zdbastnila zfibastnit about 65 000 of them occurring more then once, k5eApFnStMmPaP and 1 665 different tags. E.g. in Tab. 1 the tag kbeApFnStMmPaP of the word zd6astnila (partic- ipated) means: (k) = verb (5), Table 1: Example of the disambiguated Czech person (p) = feminine (F), number (n) = sin- sentence gular (S) and tense (t) = past (M). Lemmata and possible tags are prefixed by <1>, re- 3.1 Verb Chunks spectively. As pointed out in (Pala et al., 1997; PopeHnsk~ et al., 1999), DESAM is not large The observed properties of a verb group are the enough. It does not contain the representative following: their components are either verbs or set of Czech sentences yet. In addition some a se (si); the boundary of a words are tagged incorrectly and about 1/5 po- verb group cannot be crossed by the boundary sitions are untagged. of a sentence; and between two components of the verb group there can be a gap consisting of 3 Learning Verb Rules an arbitrary number of non-verb words or even a The algorithm for learning verb rules (Z£bkov£ whole sentence. In the first step, the boundaries and Pala, 1999) takes as its input annotated of all sentences are found. Then each gap is sentences from corpus DESAM. The algorithm replaced by tag gap. is split into three steps: finding w~rb chunks The method exploits only the lemma of each (i.e. finding boundaries of simple in word (nominative singular for , , compound or in complex sentences, and elim- pronouns and numerals, for verbs) and ination of gap words), generalisation and verb its tag. We will demonstrate the whole process rule synthesis. These three steps are described using the third simplex sentence of the clause in below. Tab. 1 ( byla bych se jl zd6astnila (I would have

220 participaied in iV): k3xXnSc? gap b~t/k5eApFnStMmPaI k5e?pFnStMmPa? by/k5eAplnStPmCaI se/k3xXnSc3 3.2.3 Finding grammatical agreement on/k3xPgUnSc4p3 constraints zfi~astnit/k5eApFnStMmPaP Another situation appears when two or more values of some category are related. In the sim- After substitution ofgaps we obtain plest case they have to be the same - e.g. the b~t/k5eApFnStMmPaI value of attribute person (p) in the first and the by/k5eAplnStPmCaI last word of our example. More complicated is si/k3xXnSc3 the relation among the values of attribute num- gap ber (n). They should be the same except when zfi~astnit/k5eApFnStMmPaP the polite way of addressing occurs, e.g. in byl byste se ji zdSastnil (you would have partici- pated in it). Thus we have to check whether the 3.2 Generalisation values are the same or the conditions of polite The lemmata and the tags are now being gener- way of addressing are satisfied. For this purpose alised. Three generalisation operations are em- we add the predicate check_.num() that ensures ployed: elimination of (some of) lemmata, gen- agreement in the grammatical category number eralisation of grammatical categories and find- and we obtain ing grammatical agreement constraints. b~t/k5e?p_n_tMmPa? 3.2.1 Elimination of lemmata by/k5e?p?n_tPmCa? All lemmata except forms of auxiliary verb bit k3xXnSc? (to be) (b~t, by, aby, kdyby) are rejected. Lem- gap mata of modal verbs and verbs with similar be- k5e?p_n_tMmPa? haviour are replaced by tag modal. These verbs check_hum(n) have been found in the list of more than 15 000 verb valencies (Pala and Seve£ek P., 1999). In our example it is the verb zdSastnit that is re- 3.3 DCG Rules Synthesis moved. Finally the verb rule is constructed by rewriting the result of the generalisation phase. For the b@t/k5eApFnStMmPaI sentence byla bych se jl zd6astnila (I would have by/k5eAplnStPmCaI participated in it) we obtain k3xXnSc3 gap verb_group (vg (Be, Cond, Se, Verb), k5eApFnStMmPaP Gaps) --> be (Be ,_, P, N, tM.,mP,_), 3.2.2 Generalisation of grammatical 7, b~t/k5e?p_n_tMmPa? categories cond (Cond, .... Ncond, tP, mC, _), Exploiting linguistic knowledge, several gram- 7, by/k5e?p?n_tPmCa? matical categories are not important for verb {che ck_num (N, Ncond, Cond, Vy) }, group description. Very often it is negation (e), ref lex_pron (Se, xX, _, _), or aspect (a - aI stands for imperfectum, aP 7, k3xXnSc? for perfectum). These categories may be re- gap( [] ,Gaps), moved. For some of verbs even person (p) can 7. gap be removed. In our example the values of those k5 (Verb ..... P,N,tM,mP,_). grammatical categories have been replaced by ? 7. k5e?p_n_tMmPa? and we obtained If this rule does not exist in the set of verb b~t/k5e?pFnStMmPa? rules yet~ it is added into it. The meanings by/k5e?p?nStPmCa? of non-terminals used in the rule are following:

221 be() represents auxiliary verb b~t, cond() rep- 5 Fixing Misclassification Errors resents various forms of conditionals by, aby, We combined two approaches, elimination of kdyby, reflex_pron() stands for reflexive pro- lemmata which are very rare for a given se (si), gap() is a special predicate for word form, and inductive logic program- manipulation with gaps, and k5 () stands for ar- ming (Popelinsk~ et al., 1999; Pavelek and bitrary non-auxiliary verb. The particular val- Popelinsk~, 1999). The method is used in the ues of some arguments of non-terminals repre- post-processing phase to prune the set of rules sent required properties. Simple cases of gram- that have fired for the sentence. matical agreement are treated through binding of variables. More complicated situations are 5.1 Elimination of infrequent lemmata solved employing constraints like the predicate In Czech corpora it was observed that 10% of check_hum(). word positions - i.e. each 10th word of a text The method has been implemented in Perl. 126 - have at least 2 lemmata and about 1% word definite clause grammar rules were constructed forms of Czech vocabulary has at least 2 lem- from the annotated corpus that describe all verb mata. (Popelinsk~ et al., 1999; Pavelek and groups that are frequent in Czech. Popelinsk~, 1999) E.g. word form psi can be correct rules(%) > 1 verb all 4 Recognition of Verb Groups number of examples 349 600 original method 86.8 92.3 The verb rules have been used for recognition, + infrequent lemmata 91.1 94.8 and consequently for tagging, of verb groups in + ILP 92.8 95.8 unannotated text. A portion of sentences which have not been used for learning, has been ex- Table 2: DESAM: Results for unannotated text tracted from a corpus. Each sentence has been ambiguously tagged with LEMMA morphologi- cal analyser (Pala and Seve~ek, 1995), i.e. each correct rules(%) word of the sentence has been assigned to all > 1 verb all possible tags. Then all the verb rules were ap- number of examples 284 467 plied to each sentence. The learned verb rules original method 87.0 92.1 displayed quite good accuracy. For corpus DE- + infrequent lemmata 91.6 94.9 SAM, a verb rule has been correctly assigned to + ILP 92.6 95.5 92.3% verb groups. We tested, too, how much is this method dependent on the corpus that Table 3: PTB: Results for unannotated text was used for learning. As the source of testing data we used Prague Tree Bank (PTB) Corpus either preposition at (like at the lesson) or im- that is under construction at Charles Univer- perative of argue. We decided to remove all the sity in Prague. The accuracy displayed was not verb rules that recognised a word-lemma couple different from results for DESAM. It maybe ex- of a very small frequency in the corpus. Actu- plained by the fact that both corpora have been ally those lemmata that did not appear more built from newspaper articles. than twice in DESAM corpus were supposed to Although the accuracy is acceptable for the test be incorrect. data that include also clauses with just one For testing, we randomly chose the set of 600 verb, errors have been observed for complex sen- examples including compound or complex sen- tences. In about 13% of them, some of com- tences from corpus DESAM. 251 sentences con- pound verb groups were not correctly recog- tained only one verb. The results obtained are nized. It was observed that almost 7{)% of these in Tab. 2. The first line contains the number errors were caused by incorrect lemrna recogni- of examples used. In the following line there tion. In the next section we describe a method are results of the original method as mentioned for fixing this kind of errors. in Section 4. Next line (+ infrequent lemmata)

222 displays results when the word-lemma couple of 6 Tagging Verb Groups a very small frequency have been removed. The column '> 1 verb' concerns the sentences where We now describe a method for compound verb at least two verbs appeared. The column 'all' group tagging in morphologically annotated displays accuracy for all sentences. Results for corpora. We decided to use SGML-like notation corpus PTB are displayed in Tab. 3. It can be for tagging because it allows to incorporate observed that after pruning rules that contain a the new tags very easily into DESAM corpus. rare lemma the accuracy significantly increased. The beginning and the end of the whole verb group and beginnings and ends of its particular 5.2 Lemma disambiguation by ILP components are marked. For the sentence byla Some of incorrectly recognised lemmata cannot bych se jl zd6astnila (I would have participated be fixed by the method described. E.g. word in it) we receive form se has two lemmata, se - reflexive pronoun couples are very frequent in Czech. For such byla cases we exploited inductive logic programming bych (ILP). The program reads the context of the se lemma-ambiguous word and results in disam- jx biguation rules (PopeHnsk~ et al., 1999; Pavelek zfi~astnila and Popelinsk~, 1999). We employed ILP sys- tem Aleph 1. where point out the beginning and the end of the verb group, Domain knowledge predicates (Popelinsk~ et mark components (parts) of the verb group. al., 1999; Pavelek and Popelinsk~, 1999) have The assigned tag - i.e. values of significant the form p(Context, first(N),Condition) morphologic categories - of the whole group is or p(Context, somewhere,Condition), where included as a value of attribute called tag in Context contains tags of either left or right the starting mark of the group. Value of the context (for the left context in the re- attribute fmverb is the full-meaning verb; this verse order), first(N) defines a subset of information can be exploited e.g. for searching Context, somewhere does not narrow the con- and processing of verb valencies afterwards. text. The term Condition can have three The value of attribute tag is computed auto- forms, somewhere(List) (the values in List matically from the verb rule that describes the appeared somewhere in the define subset of compound verb group. Context), always(List) (the values appeared We are also able to detect other properties of in all tags in the given subset of Context) and compound verb groups. In the example above n_times (N ,Tag) (Tag may appear n-times in the the new category r is introduced. It indicates specified context). E.g. p(Left, first(2), that the group is reflexive (rl) or not (r0). always ( [k5, eA] ) ) succeeds if in the first two The category v enables to mark whether the tags of the left context there always appears group is in the form of polite way of addressing k5, eA. (vl) or not (v0). The differences of the tag In the last line of Tab. 2 and 3 there is the per- values can be observed comparing the previous centage of correct rules when also the lemma and the following examples (nebyl byste se jl disambiguation has been employed. The in- zd6astnil (you would not have participated in crease of accuracy was much smaller than after it)) pruning rules that contain a rare lemma. It has to be mentioned that in the case of PTB about a were caused by incorrect recognition of sentence nebyl boundaries. byste se lhttp://web.comlab.ox.ac.uk/oucl/research/axeas/ machleaxn/Aleph/aleph.html zfi~astnil

223 The set of attributes can be also enriched e.g. tactic information. In this paper we deal with with the number of components. We also plan recognition and tagging potentially discontinu- to include into the attributes of com- ous verb chunks. The problem of recognition of pound verb group type. It will enable to find noun chunks in Czech was addressed in (Smr2 the groups of the same type but wit:h different and Z£~kov£, 1998). We aim at an advanced word order or the number of components. method that should employ a minimum of ad hoc techniques and should be, ideally, fully au- 7 Discussion tomatic. The first step in this direction, the Sometimes compound verb groups are defined in method for pruning verb rules, has been de- a less general way. Another approach that deal scribed in this paper. In the future we want with the recognition and morphological tagging to make the method even more adaptive. Some of compound verb groups in Czech appeared in preliminary results on finding sentence bound- (Osolsob~, 1999). Basic compound verb groups ary are displayed below. in Czech like active present, passive past tense, In Czech it is either and/or a conjunc- present conditional etc., are defined in terms tion that make the boundary between two sen- of grammatical categories used in DESAM cor- tences. From the corpus we have randomly cho- pus. Two drawbacks of this approach can be sen 406 pieces of text that contain a comma. observed. First, verb groups may only be com- In 294 cases the comma split two sentences. All pound of a reflexive pronoun, verbs to be and "easy" cases (when comma is followed by a con- not more than one full-meaning verb,. Second, junction, it must be a boundary) were removed. the gap between two words of the particular It was 155 out of 294 cases. 80% of the rest of group cannot be longer than three words. The examples was used for learning. We used again verb rules defined here are less general then the Aleph for learning. For the test set the learned basic verb groups (Osolsob~, 1999). Actually rules correctly recognised comma as a delimiter verb rules make partition of them. Thus we can of two sentences in 86.3% of cases. When the tag all these basic verb groups withont the lim- "easy" cases were added the accuracy increased itations mentioned above. In contrast to some to 90.8%. other approaches we include into the groups also some verbs which are in fact infinitive partici- Then we tried to employ this method for auto- pants of verb valencies. However, we are able to matic finding of boundaries in our system for detect such cases and recognize the "pure" verb verb group recognition. Decrease of accuracy groups afterwards. We believe that for some was expected but it was quite small. In spite of kind of shallow semantic analysis - e.g. in di- some extra boundary was found (the particular alogue systems - our approach is more conve~ comma did not split two sentences), the correct nient. verb groups have been found in most of such cases. The reason is that such incorrectly de- We are also able to recognize the form of po- tected boundary splits a compound verb group lite way of addressing a person (which has not very rarely. equivalent in English, but similar phenomenon appears e.g. in French or German). We extend The last experiment concerned the case when the tag of a verb group with this inibrmation. a splits two sentences and the con- because it is quite important for understanding junction is not preceded with comma. There the sentence. E.g. in gel jste (vous ~tes allg) the are four such conjunctions in Czech - a (and), word jste (~tes) should be counted as singular nebo (or), i (even) and ani (neither, nor). Us- although it is always tagged as plural. ing Aleph we obtained the accuracy on test data 88.3% for a (500 examples, 90% used/for learn- 8 Ongoing Research ing) and 87.2% for nebo (110 examples). The Our work is a part of the project which aims last two conjunctions split sentences very rarely. at building a partial parser for Czech. Main Actually, in the current version of corpus DE- idea of partial parsing (Abney, 1991) is to rec- SAM it has never happened. ognize those parts of sentences which can be re- covered reliably and with a small amount of syn-

224 9 Relevant works K. Osolsob~. 1999. Morphological tagging of com- Another approach for recognition of compound posed verb forms in Czech corpus. Technical re- verb groups in Czech (Osolsob~, 1999) have port, Studia Minora Facultatis Philosophicae Uni- versitatis Brunensis Brno. been already discussed in Section 7. Ramshaw K. Pala and Seve~ek P. 1999. Valencies of Czech and Marcus (Ramshaw and Marcus, 1995) views Verbs. Technical Report A45, Studia Minora chunking as a tagging problem. They used Facultatis Philosophicae Universitatis Brunensis transformation-based learning and achieved re- Brno. call and precision rates 93% for base noun K. Pala and P. Seve~ek. 1995. Lemma morphologi- phrase (non-recursive noun phrase) and 88% for cal analyser. User manual. Lingea Brno. chunks that partition the sentence. Verb chunk- K. Pala, P. Rychl~, and P. Smr~. 1997. DESAM ing for English was solved by Veenstra (Veen- - annotated corpus for Czech. In In Pldgil F., stra, 1999). He used memory-based learner Jeffery K.G.(eds.): Proceedings of SOFSEM'97, Timbl for noun phrase, and propo- Milovy, Czech Republic. LNCS 1338, pages 60- 69. Springer. sitional phrase chunking.Each word in a sen- T. Pavelek and L. Popelinskp. 1999. Mining lemma tence was first assigned one of tags I_T (inside disambiguation rules from Czech corpora. In the phrase), O_T (outside the phrase) and B_T Principles of Knowledge Discovery in Databases: (left-most word in the phrase that is preceded Proceedings of PKDD '99 Conference, LNA11704, by another phrase of the same kind), where T pages 498-503. Springer. stands for the kind of a phrase. L. Popellnsk~, T. Pavelek, and T. Ptgt~nik. 1999. Towards disambiguation in Czech corpora. In Chunking in Czech language is more difficult Learning Language in Logic: Working Notes of than in English for two reasons. First, a gap LLL Workshop, pages 106-116. JSI Ljubljana. inside a verb group may be more complex and L. A. Ramshaw and M. P. Marcus. 1995. Text it may be even a whole sentence. Second, Czech chunking using transformation-based learning. In language is a free word-order language what im- Proceedings of the Third A CL Workshop on Very plies that the process of recognition of the verb Large Corpora. Association for Computational group structure is much more difficult. . P. Smr~ and E. Z£~kov£. 1998. New tools for disam- 10 Conclusion biguation of Czech texts. In Text, Speech and Di- alogue: Proceedings of TSD'98 Workshop, pages We described the method for automatic recog- 129-134. Masaryk University, Brno. nition of compound verb groups in Czech sen- J. Veenstra. 1999. Memory-based text chunking. In tences. Recognition of compound verb groups Machine Learning in Human Language Technol- was tested on unannotated text randomly cho- ogy: Workshop at ACAI 99. sen from two different corpora and the accu- E. Zg~kov~ and K. Pala. 1999. Corpus-based racy reached 95% of correctly recognised verb rules for Czech verb discontinuous constituents. groups. We also introduced the method for au- In Text, Speech and Dialogue: Proceedings of tomatic tagging of compound verb groups. TSD'99 Workshop, LNAI 1692, pages 325-328. Springer. Acknowledgements E. Z~kov£ and L. Popelinskp. 2000. Automatic tagging of compound verb groups in Czech cor- We thank to anonymous referees for their com- pora. In Text, Speech and Dialogue: Proceedings ments. This research has been partially sup- of TSD'2000 Workshop, LNAL Springer. ported by the Czech Ministry of Education un- der the grant VS97028. References S. Abney. 1991. Parsing by chunks. In Principle- Based Parsing. Kluwer Academic Publishers. N. M. Marques, G. P. Lopes, and C. A. Coelho. 1998. Using loglinear clustering for subcatego- rization identification. In Principles of Data Min- ing and Knowledge Discovery: Proceedings of PKDD'98 Symposium, LNAI 1510, pages 379- 387. Springer.

225