A Machine Learning Approach to German Resolution

Beata Kouchnir Department of Computational Linguistics Tubingen¨ University 72074 Tubingen,¨ Germany [email protected]

Abstract The system presented in this paper resolves German in free text by imitating the This paper presents a novel ensemble manual annotation process with off-the-shelf lan- learning approach to resolving German guage sofware. As the avalability and reliability of pronouns. Boosting, the method in such software is limited, the system can use only question, combines the moderately ac- a small number of features. The fact that most curate hypotheses of several classifiers German pronouns are morphologically ambiguous to form a highly accurate one. Exper- proves an additional challenge. iments show that this approach is su- The choice of boosting as the underlying ma- perior to a single decision-tree classi- chine learning algorithm is motivated both by its fier. Furthermore, present a stan- theoretical concept as well as its performance for dalone system that resolves pronouns in other NLP tasks. The fact that boosting uses the unannotated text by using a fully auto- method of ensemble learning, i.e. combining the matic sequence of preprocessing mod- decisions of several classifiers, suggests that the ules that mimics the manual annotation combined hypothesis will be more accurate than process. Although the system performs one learned by a single classifier. On the practical well within a limited textual domain, side, boosting has distinguished itself by achieving further research is needed to make good results with small feature sets. effective for open-domain question an- swering and text summarisation. 2 Related Work

Although extensive research has been conducted 1 Introduction on statistical anaphora resolution, the bulk of Automatic coreference resolution, pronominal and the work has concentrated on the English lan- otherwise, has been a popular research area in guage. Nevertheless, comparing different strate- Natural Language Processing for more than two gies helped shape the system described in this pa- decades, with extensive documentation of both per. the rule-based and the machine learning approach. (McCarthy and Lehnert, 1995) were among For the latter, good results have been achieved the first to use machine learning for coreference with large feature sets (including syntactic, se- resolution. RESOLVE was trained on data from mantic, grammatical and morphological informa- MUC-5 English Joint Venture (EJV) corpus and tion) derived from handannotated corpora. How- used the C4.5 decision tree algorithm (Quinlan, ever, for applications that work with plain text (e.g. 1993) with eight features, most of which were tai- question answering, text summarisation), this ap- lored to the joint venturte domain. The system proach is not practical. achieved an F-measure of 86.5 for full coreference resolution (no values were given for pronouns). are 82.79 and 84.94, respectively. Although a number this high must be attributed to the specific textual domain, RESOLVE also out- 3 Boosting performed the authors’ rule-based algorithm by All of the systems described in the previous sec- 7.6 percentage points, which encouraged further tion use a single classifier to resolve coreference. reseach in this direction. Our intuition, however, is that a combination of Unlike the other systems presented in this sec- classifiers is better suited for this task. The con- tion, (Morton, 2000) does not use a decision tree cept of ensemble learning (Dietterich, 2000) is algorithm but opts instead for a maximum entropy based on the assumption that combining the hy- model. The model is trained on a subset of the potheses of several classifiers yields a hypothesis Wall Street Journal, comprising 21 million tokens. that is much more accurate than that of an individ- The reported F-measure for pronoun resolution is ual classifier. 81.5. However, (Morton, 2000) only attempts to One of the most popular ensemble learning resolve singular pronouns, and there is no mention methods is boosting (Schapire, 2002). It is based of what percentage of total pronouns are covered on the observation that finding many weak hy- by this restriction. potheses is easier than finding one strong hypothe- (Soon et al., 2001) use the C4.5 algorithm with sis. This is achieved by running a base learning al- a set of 12 domain-independent features, ten syn- gorithm over several iterations. Initially, an impor- tactic and two semantic. Their system was trained tance weight is distributed uniformly among the on both the MUC-6 and the MUC-7 datasets, for training examples. After each iteration, the weight which it achieved F-scores of 62.6 and 60.4, re- is redistributed, so that misclassified examples get spectively. Although these results are far worse higher weights. The base learner is thus forced to than the ones reported in (McCarthy and Lehnert, concentrate on difficult examples. 1995), are comparable to the best-performing Although boosting has not yet been applied rule-based systems in the respective competitions. to coreference resolution, it has outperformed As (McCarthy and Lehnert, 1995), (Soon et al., stateof-the-art systems for NLP tasks such as part- 2001) do not report separate results for pronouns. ofspeech tagging and prepositional phrase attach- (Ng and Cardie, 2002) expanded on the work ment (Abney et al., 1999), word sense disam- of (Soon et al., 2001) by adding 41 lexical, se- biguation (Escudero et al., 2000), and named en- mantic and grammatical features. However, since tity recognition (Carreras et al., 2002). using this many features proved to be detrimen- The implementation used for this project is tal to performance, all features that induced low BoosTexter (Schapire and Singer, 2000), a toolkit precision rules were discarded, leaving only 19. freely available for research purposes. In addition The final system outperformed that of (Soon et al., to labels, BoosTexter assigns confidence weights 2001), with F-scores of 69.1 and 63.4 for MUC-6 that reflect the reliability of the decisions. and MUC-7, respectively. For pronouns, the re- ported results are 74.6 and 57.8, respectively. 4 System Description The experiment presented in (Strube et al., 2002) is one of the few dealing with the applica- Our system resolves pronouns in three stages: tion of machine learning to German coreference preprocessing, classification, and postprocessing. resolution covering definite phrases, proper Figure 1 gives an overview of the system archi- names and personal, and tecture, while this section provides details of each pronouns. The research is based on the Heidelberg component. Text Corpus (see Section 4), which makes it ideal for comparison with our system. (Strube et al., 4.1 Training and Test Data 2002) used 15 features modeled after those used The system was trained with data from the Heidel- by state-of-the-art resolution systems for English. berg Text Corpus (HTC), provided by the Euro- The results for personal and possessive pronouns pean Media Laboratory in Heidelberg, Germany. Figure 1: System Architecture

The HTC is a collection of 250 short texts (30-700 person, number and gender. The possible val- tokens) describing architecture, historical events ues are 1s, 1p, 2s, 2p, 3m, 3f, 3n, 3p.

and people associated with the city of Heidelberg.

To examine its domain (in)dependence, the system semantic class: human, physical (in- was tested on 40 unseen HTC texts as well as on cludes animals), or abstract. When the se- 25 articles from the Spiegel magazine, the topics mantic class is ambiguous, the ”abstract” op- of which include current events, science, arts and tion is chosen. entertainment, and travel. type: if the entity that the markable refers to is new to the discourse, the value is ”none”. If 4.2 The MMAX Annotation Tool the markable refers to an already mentioned The manual annotation of the training data was entity, the value is ”anaphoric”. An anaphoric done with the MMAX (Multi-Modal Annotation markable has another attribute for its rela- in XML) annotation tool (Muller¨ and Strube, tion to the antecedent. The values for this at- 2001). The fist step of coreference annotation is to tribute are ”direct”, ”pronominal”, and ”ISA” identify the markables, i.e. noun phrases that refer (hyponym-hyperonym). to real-word entities. Each markable is annotated with the following attributes: To mark coreference, MMAX uses coreference sets, such that every new reference to an already

np form: proper noun, definite NP, indefinite mentioned entity is added to the set of that entity. NP, , possessive pronoun, or Implicitly, there is a set for every entity in the dis- demonstrative pronoun. course - if an entity occurs only once, its set con- tains one markable.

grammatical role: , object (direct or 4.3 Feature Vector indirect), or other. The features used by our system are summarised agreement: this attribute is a combination of in Table 4.3. The individual features for anaphor Feature Description The chunker is based on a head-lexicalised prob- pron the pronoun abilistic context free grammar (H-L PCFG) and ana npform NP form of the anaphor achieves an F-measure of 92 for range only and ana gramrole grammatical role of the 83 for range and label, whereby a range of a noun anaphor chunk is defined as ”all words from the beginning ana agr agreement of the anaphor of the to the head noun”. This is dif- ana semclass* semantic class of the anaphor ferent from manually annotated markables, which ante npform NP form of the antecedent can be complex noun phrases. ante gramrole grammatical role of the an- Despite good overall performance, the chunker tecedent fails on multi-word proper names in which case it 1 ante agr agreement of the antecedent marks each word as an individual chunk. Since ante semclass* semantic class of the an- many pronouns refer to named entities, the chun- tecedent ker needs to be supplemented by a named entity recogniser. Although, to our knowledge, there cur- dist distance in markables rently does not exist an off-the-shelf named entity between anaphor and an- tecedent (1 .. 20) recogniser for German, we were able to obtain the system submitted by (Curran and Clark, 2003) to same agr same agreement of anaphor the 2003 CoNLL competition. In order to run the and antecedent? recogniser, the data needs to be tokenised, tagged same gramrole same grammatical role of and lemmatised, all of which is done by the Tree- anaphor and antecedent? Tagger (Schmid, 1995). same semclass* same semantic class of anaphor and antecedent? 4.5 Markable Creation Table 1: Features used by our system. *-ed fea- After the markables are identified, they are auto- tures were only used for 10-fold cross-validation matically annotated with the attributes described on the manually annotated data in Section 4.4. The NP form can be reliably deter- mined by examining the output of the noun chun- ker and the named entity recogniser. Pronouns and and antecedent - NP form, grammatical role, se- named entities are already labeled during chunk- mantic class - are extracted directly from the an- ing. The remaining markables are labelled as def- notation. The relational features are generated by inite NPs if their first words are definite articles comparing the individual ones. The binary tar- or possessive determiners, and as indefinite NPs get function - coreferent, non-coreferent - is de- otherwise. Grammatical role is determined by the termined by comparing the values of the member case assigned to the markable - subject if nomi- attribute. If both markables are members of the native, object if accusative. Although datives and same set, they are coreferent, otherwise they are genitives can also be objects, they are more likely not. to be adjuncts and are therefore assigned the value Due to lack of resources, the semantic class at- ”other”. tribute cannot be annotated automatically, and is For non-pronominal markables, agreement is therefore used only for comparison with (Strube determined by lexicon lookup of the head . et al., 2002). Number ambiguities are resolved with the help of the case information. Most proper names, except 4.4 Noun Phrase Chunking, NER and for a few common ones, do not appear in the lexi- POS-Tagging con and have to remain ambiguous. Although it is To identify markables automatically, the sys- impossible to fully resolve the agreement ambigu- tem uses the noun phrase chunker described in ities of pronominal markables, they can be classi- (Schmid and Schulte im Walde, 2000), which 1An example is [Verteidigunsminister Donald] displays case information along with the chunks. [Rumsfeld] ([Minister of Defense Donald] [Rumsfeld]). fied as either feminine/plural or masculine/neuter. PPER PPOS Therefore we added two underspecified values to (Strube et al., 2002) 82.8 84.9 the agreement attribute: 3f 3p and 3m 3n. Each our system 87.4 86.9 of these values was made to agree with both of its subvalues. Table 2: Comparison of classification perfor-

mance (F ¢¡¤£ ) with (Strube et al., 2002) 4.6 Antecedent Selection After classification, one non-pronominal an- tecedent has to be found for each pronoun. As demonstrate the importance of the semantic fea- BoosTexter assigns confidence weights to its pre- tures. As for the test sets, while the classifier sig- dictions, we have a choice between selecting the nificantly outperformed the baseline for the HTC antecedent closest to the anaphor (closest-first) set, it did nothing for the Spiegel set. This shows and the one with the highest weight (best-first). the limitations of an algorithm trained on overly Furthermore, we have a choice between ignoring restricted data. pronominal antecedents (and risking to discard all Among the selection heuristics, the approach of the correct antecedents within the window) and re- resolving pronominal antecedents proved consis- solving them (and risking multiplication of errors). tently more effective than ignoring them, while In case all of the instances within the window have the results for the closest-first and best-first strate- been classified as non-coreferent, we choose the gies were mixed. They imply, however, that the negative instance with the lowest weight as the an- bestfirst approach should be chosen if the classifier tecedent. The following section presents the re- performed above a certain threshold; otherwise the sults for each of the selection strategies. closest-first approach is safer. Overall, the fact that 67.2 of the pronouns were 5 Evaluation correctly resolved in the automatically annotated HTC test set, while the upper bound is 82.0, vali- Before evaluating the actual system, we compared dates the approach taken for this system. the performance of boosting to that of C4.5, as re- ported in (Strube et al., 2002). Trained on the same 6 Conclusion and Future Work corpus and evaluated with the 10-fold crossvali- dation method, boosting significantly outperforms The pronoun resolution system presented in this C4.5 on both personal and possessive pronouns paper performs well for unannotated text of a lim- (see Table 2). These results support the intuition ited domain. While the results are encouraging that ensemble methods are superior to single clas- considering the knowledge-poor approach, exper- sifiers. iments with a more complex textual domain show To put the performance of our system into per- that the system is unsuitable for wide-coverage spective, we established a baseline and an upper tasks such as question answering and summarisa- bound for the task. The baseline chooses as the an- tion. tecedent the closest non-pronominal markable that To examine whether the system would yield agrees in number and gender with the pronoun. comparable results in unrestricted text, it needs to The upper bound is the system’s performance on be trained on a more diverse and possibly larger the manually annotated (gold standard) data with- corpus. For this purpose, Tuba-D/Z,¨ a treebank out the semantic features. consisting of German newswire text, is presently For the baseline, accuracy is significantly higher being annotated with coreference information. As for the gold standard data than for the two test the syntactic annotation of the treebank is richer sets (see Table 3). This shows that agreement is than that of the HTC corpus, additional features the most important feature, which, if annotated may be derived from it. Experiments with Tuba-¨ correctly, resolves almost half of the pronouns. D/Z will show whether the performance achieved The classification results of the gold standard data, for the HTC test set is scalable. which are much lower than the ones in Table 2 also For future versions of the system, it might also HTC-Gold HTC-Test Spiegel Baseline accuracy 46.7% 30.9% 31.1%

Classification F ¢¡¤£ score 77.9 62.8 30.4 Best-first, ignoring pronominal ant. 82.0% 67.2% 28.3% Best-first, resolving pronominal ant. 72.2% 49.1% 21.7% Closest-first, ignoring pronominal ant. 82.0% 57.3% 34.4% Closest-first, resolving pronominal ant. 72.2% 49.1% 22.8%

Table 3: Accuracy of the different selection heuristics compared with baseline accuracy and classification F-score. HTC-Gold and HTC-Test stand for manually and automatically annotated test sets, respectively. be beneficial to use full parses instead of chunks. Christoph Muller¨ and Michael Strube. 2001. Annotat- As most are morphologically un- ing anaphoric and bridging relations with MMAX. ambiguous, an analysis of them could help disam- In Proceedings of the 2nd SIGdial Workshop on Dis- course and Dialogue, pages 90–95, Aalborg, Den- biguate pronouns. However, due to the relatively mark. free word order of the , this ap- proach requires extensive reseach. Vincent Ng and Claire Cardie. 2002. Improving ma- chine learning approaches to coreference resolution. In Proceedings of the 40th Annual Meeting of the As- sociation for Computational Linguistics (ACL’02), References pages 104–111, Philadelphia, PA, USA. Steven Abney, Robert E. Schapire, and Yoram Singer. J. Ross Quinlan. 1993. C4.5: Programs for Machine 1999. Boosting applied to tagging and PP attach- Learning. Morgan Kaufman, San Mateo, CA. ment. In Proceedings of the Joint SIGDAT Con- ference on Empirical Methods in Natural Language Robert E. Schapire and Yoram Singer. 2000. Boostex- Processing and Very Large Corpora. ter: A boosting-based system for text categorization. Xavier Carreras, Llu´ıs Marquez,` and Llu´ıs Padro.´ Machine Learning, 39(2/3):135–168. 2002. Named entity extraction using AdaBoost. Robert E. Schapire. 2002. The boosting approach to In Proceedings of CoNLL-2002, pages 167–170, machine learning: an overview. In Proceedings of Taipei, Taiwan. the MSRI Workshop on Nonlinear Estimation and James R. Curran and Stephen Clark. 2003. Language- Classification. independent NER using a maximum entropy tagger. Helmut Schmid and Sabine Schulte im Walde. 2000. In Proceedings of CoNLL-2003, pages 164–167, Ed- Robust German noun chunking with a probabilis- monton, Canada. tic context-free grammar. In Proceedings of Thomas G. Dietterich. 2000. Ensemble methods in the 18th International Conference on Computa- machine learning. In First International Workshop tional Linguistics (COLING-00), pages 726–732, on Multiple Classifier Systems, Lecture Notes in Saarbruck¨ en, Germany. Computer Science, pages 1–15. Springer, New York. Helmut Schmid. 1995. Improvements in part-of- Gerard Escudero, Llu´ıs Marquez,` and German Rigau. speech tagging with an application to German. In 2000. Boosting applied to word sense disambigua- Proceedings of the ACL SIGDAT-Workshop. tion. In Proceedings of the 12th European Confer- ence on Machine Learning, pages 129–141. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning ap- Joseph F. McCarthy and Wendy G. Lehnert. 1995. Us- proach to coreference resolution of noun phrases. ing decision trees for coreference resolution. In Pro- Computational Linguistics, 27(4):521–544. ceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95), pages 1050– Michael Strube, Stefan Rapp, and Christoph Muller¨ . 1055, Montreal, Canada. 2002. The influence of minimum edit distance on reference resolution. In Proceedings of the Thomas S. Morton. 2000. Coreference for nlp appli- 2002 Conference on Empirical Methods in Natural cations. In Proceedings of the 38th Annual Meet- Language Processing (EMNLP’02), pages 312–319, ing of the Association for Computational Linguistics Philadelphia, PA, USA. (ACL’00), Hong Kong.