<<

A Lemma-Based Approach to a Maximum Entropy Sense Disambiguation System for Dutch Tanja Gaustad Alfa-Informatica, University of Groningen Postbus 716 9700 AS Groningen, The Netherlands, [email protected]

Abstract (Hendrickx et al., 2002; Hoste et al., 2002). In In this paper, we present a corpus-based super- the system presented here, the classifiers built for vised word sense disambiguation (WSD) sys- each ambiguous word are based on its lemma in- tem for Dutch which combines statistical classi- stead. Lemmatization allows for more compact and fication (maximum entropy) with linguistic in- generalizable data by clustering all inflected forms formation. Instead of building individual clas- of an ambiguous word together, an effect already sifiers per ambiguous wordform, we introduce commented on by Yarowsky (1994). The more in- a lemma-based approach. The advantage of flection in a language, the more lemmatization will this novel method is that it clusters all inflec- help to compress and generalize the data. In the case ted forms of an ambiguous word in one classi- fier, therefore augmenting the training material of our WSD system this means that less classifiers available to the algorithm. Testing the lemma- have to be built therefore adding up the training ma- based model on the Dutch SENSEVAL-2 test terial available to the algorithm for each ambiguous data, we achieve a significant increase in accur- wordform. Accuracy is expected to increase for the acy over the wordform model. Also, the WSD lemma-based model in comparison to the wordform system based on lemmas is smaller and more model. robust. The paper is structured as follows: First, we will present the -based lemmatizer for Dutch 1 Introduction: WSD for Dutch which was used to lemmatize the data, followed by A major problem in natural language processing a detailed explanation of the lemma-based approach (NLP) for which no satisfactory solution has been adopted in our WSD system. Next, the statistical found to date is word sense disambiguation (WSD). classification algorithm, namely maximum entropy, WSD refers to the resolution of lexical semantic and Gaussian priors (used for smoothing purposes) ambiguity and its goal is to attribute the correct are introduced. We will then proceed to describe sense(s) to in a certain context. For instance the corpus, the corpus preparation, and the system machine translation, information retrieval or docu- settings. We conclude the paper with results on the ment extraction could all benefit from the accurate Dutch SENSEVAL-2 data, their evaluation and ideas disambiguation of word senses. for future work. The WSD system for Dutch1 presented here is a corpus-based supervised algorithm combining stat- 2 Dictionary-Based Lemmatizer for Dutch istical classification with various kinds of linguistic Statistical classification systems, like our WSD sys- information. The intuition behind the system is that tem, determine the most likely class for a given in- linguistic information is beneficial for WSD which stance by computing how likely the words or lin- means that it will improve results over purely stat- guistic features in the instance are for any given istical approaches. The linguistic information in- class. Estimating these probabilities is difficult, as cludes lemmas, part-of-speech (PoS), and the con- corpora contain lots of different, often infrequent, text around the ambiguous word. words. Lemmatization2 is a method that can be used In this paper, we focus on a lemma-based ap- to reduce the number of wordforms that need to be proach to WSD for Dutch. So far, systems built taken into consideration, as estimation is more reli- individual classifiers for each ambiguous wordform able for frequently occurring data.

1The interest in Dutch lies grounded in the fact 2We chose to use lemmatization and not because that we are working in the context of a project con- the lemma (or canonical dictionary entry form) can be used to cerned with developing NLP tools for Dutch (see look up an ambiguous word in a dictionary or an ontology like http://www.let.rug.nl/˜vannoord/alp). e.g. WordNet. This is not the case for a stem. Lemmatization reduces all inflected forms of a Normally, this implies that the basis for grouping word to the same lemma. The number of different occurrences of particular ambiguous words together lemmas in a training corpus will therefore in general is that their wordform is the same. Alternatively, we be much smaller than the number of different word- chose for a model constructing classifiers based on forms, and the frequency of lemmas will therefore lemmas therefore reducing the number of classifiers be higher than that of the corresponding individual that need to be made. inflected forms, which in turn suggests that probab- As has already been noted by Yarowsky (1994), ilities can be estimated more reliably. using lemmas helps to produce more concise and For the experiments in this paper, we used a lem- generic evidence than inflected forms. Therefore matizer for Dutch with dictionary lookup. Diction- building classifiers based on lemmas increases the ary information is obtained from Celex (Baayen et data available to each classifier. We make use of al., 1993), a lexical database for Dutch. Celex con- the advantage of clustering all instances of e.g. one tains 381,292 wordforms and 124,136 lemmas for in a single classifier instead of several classifi- Dutch. It also contains the PoS associated with ers (one for each inflected form found in the data). the lemmas. This information is useful for disam- In this way, there is more training data per ambigu- biguation: in those cases where a particular word- ous wordform available to each classifier. The ex- form has two (or more) possible corresponding lem- pectation is that this should increase the accuracy of mas, the one matching the PoS of the wordform our maximum entropy WSD system in comparison is chosen. Thus, in a first step, information about to the wordform-based model. wordforms, their respective lemmas and their PoS Figure 1 shows how the system works. During is extracted from the database. training, every wordform is first checked for ambi- Dictionary lookup can be time consuming, espe- guity, i.e. whether it has more than one sense associ- cially for large such as Celex. To guar- ated with all its occurrences. If the wordform is am- antee fast lookup and a compact representation, the biguous, the number of lemmas associated with it is information extracted from the dictionary is stored looked up. If the wordform has one lemma, all oc- as a finite state automaton (FSA) using Daciuk’s currences of this lemma in the training data are used (2000) FSA tools.3 Given a wordform, to make the classifier for that particular wordform— the compiled automaton provides the correspond- and others with the same lemma. If a wordform ing lemmas in time linear to the length of the input has more than one lemmas, a classifier based on the word. Contrasting this dictionary-based lemmat- wordform is built. This strategy has been decided izer with a simple suffix stripper, such as the Dutch on in order to be able to treat all ambiguous words, Porter Stemmer (Kraaij and Pohlman, 1994), our notwithstanding lemmatization errors or wordforms lemmatizer is more accurate, faster and more com- that can genuinely be assigned two or more lemmas. pact (see (Gaustad and Bouma, 2002) for a more An example of a word that has two different lem- elaborate description and evaluation). mas depending on the context is boog: it can either During the actual lemmatization procedure, the be the past tense of the verb buigen (‘to bend’) or the FSA encoding of the information in Celex assigns boog (‘arch’). Since the Dutch SENSEVAL-2 every wordform all its possible lemmas. For am- data is not only ambiguous with regard to meaning biguous wordforms, the lemma with the same PoS but also with regard to PoS, both lemmas are sub- as the wordform in question is chosen. All word- sumed in the wordform classifier for boog. forms that were not found in Celex are processed During testing, we check for each word whether with a morphological guessing automaton.4 there is a classifier available for either its wordform The key features of the lemmatizer employed are or its lemma and apply that classifier to the test in- that it is fast, compact and accurate. stance.

3 Lemma-Based Approach 4 Maximum Entropy Word Sense As we have mentioned in the previous section, lem- Disambiguation System matization collapses all inflected forms of a given Our WSD system is founded on the idea of combin- word to the same lemma. In our system, separate ing statistical classification with linguistic sources classifiers are built for every ambiguous wordform. of knowledge. In order to be able to take full advant- age of the linguistic information, we need a classi- 3Available at http://www.eti.pg.gda.pl/ ˜jandac/fsa.html fication algorithm capable of incorporating the in- 4Also available from the FSA morphology tools (Daciuk, formation provided. The main advantage of max- 2000). imum entropy modeling is that heterogeneous and non-ambiguous psense 1 sense

pwordform LEMMA psense 1 lemma MODEL X senses ambiguous

X lemmas WORDFORM psense MODEL

Figure 1: Schematic overview of the lemma-based approach for our WSD System for Dutch overlapping information can be integrated into a ted from the labeled training data are used to de- single statistical model. Other learning algorithms, rive a set of constraints for the model. This set of like e.g. decision lists, only take the strongest fea- constraints characterises the class-specific expecta- ture into account, whereas maximum entropy com- tions for the distribution. So, while the distribution bines them all. Also, no independence assumptions should maximise the entropy, the model should also as in e.g. Naive Bayes are necessary. satisfy the constraints imposed by the training data. We will now describe the different steps in put- A maximum entropy model is thus the model with ting together the WSD system we used to incor- maximum entropy of all models that satisfy the set porate and test our lemma-based approach, start- of constraints derived from the training data. ing with the introduction of maximum entropy, the The model consists of a set of features which oc- machine learning algorithm used for classification. cur on events in the training data. Training itself Then, smoothing with Gaussian priors will be ex- amounts to finding weights for each feature using plained. the following formula:

4.1 Maximum Entropy Classification 1 n p(c|x) = exp i=1 λi f i (x, c) X  Several problems in NLP have lent themselves to Z solutions using statistical language processing tech- where the property function f i (x, c) represents niques. Many of these problems can be viewed as the number of times feature i is used to find class a classification task in which linguistic classes have c for event x, and the weights λi are chosen to max- to be predicted given a context. imise the likelihood of the training data and, at the The statistical classifier used in the experiments same time, maximise the entropy of p. Z is a nor- reported in this paper is a maximum entropy classi- malizing constant, constraining the distribution to fier (Berger et al., 1996; Ratnaparkhi, 1997b). Max- sum to 1 and n is the total number of features. imum entropy is a general technique for estimating This means that during training the weight λi for probability distributions from data. A probability each feature i is computed and stored. During test- distribution is derived from a set of events based on ing, the sum of the weights λi of all features i found the computable qualities (characteristics) of these in the test instances is computed for each class c and events. The characteristics are called features, and the class with the highest score is chosen. the events are sets of feature values. A big advantage of maximum entropy modeling If nothing about the data is known, estimating a is that the features include any information which probability distribution using the principle of max- might be useful for disambiguation. Thus, dissim- imum entropy involves selecting the most uniform ilar types of information, such as various kinds of distribution where all events have equal probability. linguistic knowledge, can be combined into a single In other words, it means selecting the distribution model for WSD without having to assume inde- which maximises the entropy. pendence of the different features. Furthermore, If data is available, a number of features extrac- good results have been produced in other areas of NLP research using maximum entropy techniques First, the corpus is lemmatized (see section 2) and (Berger et al., 1996; Koeling, 2001; Ratnaparkhi, part-of-speech-tagged. We used the Memory-Based 1997a). tagger MBT (Daelemans et al., 2002a; Daelemans et al., 2002b) with the (limited) WOTAN tag set 4.2 Smoothing: Gaussian Priors (Berghmans, 1994; Drenth, 1997) to PoS tag our Since NLP maximum entropy models usually have data (see (Gaustad, 2003) for an evaluation of dif- lots of features and lots of sparseness (e.g. features ferent PoS-taggers on this task). Since we are only seen in testing not occurring in training), smoothing interested in the main PoS-categories, we discarded is essential as a way to optimize the feature weights all additional information from the assigned PoS. (Chen and Rosenfeld, 2000; Klein and Manning, This resulted in 12 different tags being kept. In 2003). In the case of the Dutch SENSEVAL-2, for the current experiments, we included the PoS of the many ambiguous words there is little training data ambiguous wordform (important for the morpho- available, therefore making smoothing essential. syntactic disambiguation) and also the PoS of the The intuition behind Gaussian priors is that the context words or lemmas. parameters in the maximum entropy model should After the preprocessing (lemmatization and PoS not be too large because of optimization problems tagging), for each ambiguous wordform6 all in- with infinite feature weights. In other words: we stances of its occurrence are extracted from the cor- enforce that each parameter will be distributed ac- pus. These instances are then transformed into fea- cording to a Gaussian prior with mean µ and vari- ture vectors including the features specified in a par- 2 ance σ . This prior expectation over the distribution ticular model. The model we used in the reported of parameters penalizes parameters for drifting too experiments includes information on the wordform, far from their mean prior value which is µ = 0. its lemma, its PoS, contextwords to the left and right Using Gaussian priors has a number of effects on as well as the context PoS, and its sense/class. the maximum entropy model. We trade off some expectation-matching for smaller parameters. Also, (1) Nu ging hij bloemen plukken en maakte when multiple features can be used to explain a now went he flowers pick and made data point, the more common ones generally receive er een krans van. more weight. Last but not least accuracy generally it a crown of goes up and convergence is faster. ‘Now he went to pick flowers and made a In the current experiments the Gaussian prior was crown of it.’ set to σ2 = 1000 (based on preliminary experi- ments) which led to an overall increase of at least Below we show an example of a feature vector 0.5% when compared to a model which was built for the ambiguous word bloem (‘flower’/‘flour’) in without smoothing. sentence 1: bloemen bloem N nu gaan hij Adv 5 Corpus Preparation and Building V Pron plukken en maken V Conj V Classifiers bloem plant In the context of SENSEVAL-25 , the first sense- The first slot represents the ambiguous wordform, tagged corpus for Dutch was made available (see the second its lemma, the third the PoS of the am- (Hendrickx and van den Bosch, 2001) for a de- biguous wordform, the fourth to twelfth slots con- tailed description). The training section of the tain the context lemmas and their PoS (left before Dutch SENSEVAL-2 dataset contains approximately right), and the last slot represents the sense or class. 120,000 tokens and 9,300 sentences, whereas the Various preliminary experiments have shown a con- test section consists of ca. 40,000 tokens and 3,000 text size of 3 context words, i.e. 3 words to the sentences. left and 3 words to the right of the ambiguous word, In contrast to the English WSD data available to achieve the best and most stable results. Only from SENSEVAL-2, the Dutch WSD data is not only context words within the same sentence as the am- ambiguous in word senses, but also with regard to biguous wordform were taken into account. PoS. This means that accurate PoS information is Earlier experiments showed that using lemmas important in order for the WSD system to accur- as context instead of wordforms increases accuracy ately achieve morpho-syntactic as well as semantic disambiguation. 6A wordform is `ambiguous' if it has two or more different senses/classes in the training data. The sense `=' is seen as 5See http://www.senseval.org/ for more inform- marking the basic sense of a word/lemma and is therefore also ation on SENSEVAL and for downloads of the data. taken into account. due to the compression achieved through lemmat- less classifiers need to be built during training and ization (as explained earlier in this paper and put therefore the WSD system based on lemmas is to practice in the lemma-based approach). With smaller. In an online application, this might be an lemmas, less context features have to be estimated, important aspect of the speed and the size of the ap- therefore counteracting data sparseness. plication. It should be noted here that the degree of In the experiments presented here, no threshold generalization through lemmatization strongly de- was used. Experiments have shown that build- pends on the data. Only inflected wordforms oc- ing classifiers even for wordforms with very few curring in the corpus are subsumed in one lemma training instances yields better results than apply- classifier. The more different inflected forms the ing a frequency threshold and using the baseline training corpus contains, the better the “compres- count (assigning the most frequent sense) for word- sion rate” in the WSD model. Added robustness is a forms with an amount of training instances below further asset of our system. More wordforms could the threshold. It has to be noted, though, that the be classified with the lemma-based approach com- effect of applying a threshold may depend on the pared to the wordform-based one (438 vs. 410). choice of learning algorithm. In order to better assess the real gain in accur- acy from the lemma-based model, we also evalu- 6 Results and Evaluation ated a subpart of the results for the lemma-based In order to be able to evaluate the results from and the wordform-based model, namely the accur- the lemma-based approach, we also include results acy of those wordforms which were classified based based on wordform classifiers. During training with on their lemma in the former approach, but based wordform classifiers, 953 separate classifiers were on their wordform in the latter case. The compar- built. ison in table 3 clearly shows that there is much to be With the lemma-based approach, 669 classifiers gained from lemmatization. The fact that inflected were built in total during training, 372 based on wordforms are subsumed in lemma classifiers leads the lemma of an ambiguous word (subsuming 656 to an error rate reduction of 8% and a system with wordforms) and 297 based on the wordform. A total less than half as many classifiers. of 512 unique ambiguous wordforms was found in In table 4, we see a comparison with another the test data. 438 of these were classified using the WSD systems for Dutch which uses Memory-Based classifiers built from the training data, whereas only learning (MBL) in combination with local context 410 could be classified using the wordform model (Hendrickx et al., 2002). A big difference with (see table 1 for an overview). the system presented in this article is that extensive We include the accuracy of the WSD system on parameter optimization for the classifier of each am- all words for which classifiers were built (ambig) as biguous wordform has been conducted for the MBL well as the overall performance on all words (all), approach. Also, a frequency threshold of minimally including the non-ambiguous ones. This makes our 10 training instances was applied, using the baseline results comparable to other systems which use the classifier for all words below that threshold. As we same data, but maybe a different data split or a dif- can see, our lemma-based WSD system scores the ferent number of classifiers (e.g. in connection with same as the Memory-Based WSD system, without a frequency threshold applied). The baseline has extensive “per classifier” parameter optimization. been computed by always choosing the most fre- According to Daelemans and Hoste (2002), differ- quent sense of a given wordform in the test data. ent machine learning results should be compared once all parameters have been optimized for all clas- The results in table 2 show the average accuracy sifiers. This is not the case in our system, and for the two different approaches. The accuracy of yet it achieves the same accuracy as an optimized both approaches improves significantly (when ap- model. Optimization of parameters for each am- plying a paired sign test with a confidence level of biguous wordform and lemma classifier might help 95%) over the baseline. This demonstrates that the increase our results even further. general idea of the system, to combine linguistic features with statistical classification, works well. Focusing on a comparison of the two approaches, 7 Conclusion and Future Work we can clearly see that the lemma-based approach In this paper, we have introduced a lemma-based works significantly better than the wordform only approach for a statistical WSD system using max- model, thereby verifying our hypothesis. imum entropy and a number of linguistic sources Another advantage of the approach proposed, be- of information. This novel approach uses the ad- sides increasing the classification accuracy, is that vantage of more concise and more generalizable in- lemma-based wordforms Training # classifiers built 669 953 based on wordforms 297 953 based on lemmas 372 na # wordforms subsumed 656 na Testing # unique ambiguous wordforms 512 512 # classifiers used 387 410 based on wordforms 230 410 based on lemmas 70 na # wordforms subsumed 208 na # wordforms seen 1st time 74 102

Table 1: Overview of classifiers built during training and used in testing with the lemma-based and the wordform-based approach

Model ambig all baseline all ambiguous words 78.47 89.44 wordform classifiers 83.66 92.37 lemma-based classifiers 84.15 92.45

Table 2: WSD Results (in %) with the lemma-based approach compared to classifiers based on wordforms formation contained in lemmas as key feature: clas- References sifiers for individual ambiguous words are built on R. Harald Baayen, Richard Piepenbrock, and Afke the basis of their lemmas, instead of wordforms as van Rijn. 1993. The CELEX lexical database has traditionally been done. Therefore, more train- (CD-ROM). Linguistic Data Consortium, Uni- ing material is available to each classifier and the versity of Pennsylvania, Philadelphia. resulting WSD system is smaller and more robust. Adam Berger, Stephen Della Pietra, and Vin- The lemma-based approach has been tested on cent Della Pietra. 1996. A maximum entropy ap- the Dutch SENSEVAL-2 data set and resulted in a proach to natural language processing. Computa- significant improvement of the accuracy achieved tional Linguistics, 22(1):39–71. over the system using the traditional wordform Johan Berghmans. 1994. WOTAN—een automat- based approach. In comparison to earlier results ische grammaticale tagger voor het Nederlands. with a Memory-Based WSD system, the lemma- Master’s thesis, Nijmegen University, Nijmegen. based approach performs the same, involving less Stanley Chen and Ronald Rosenfeld. 2000. A sur- work (no parameter optimization). vey of smoothing techniques for ME models. A possible extension of the present approach is to IEEE Transactions on Speech and Audio Pro- include more specialized feature selection and also cessing, 8(1):37–50. to optimize the settings for each ambiguous word- Jan Daciuk. 2000. Finite state tools for natural lan- form instead of adopting the same strategy for all guage processing. In Proceedings of the COL- words in the corpus. Furthermore, we would like to ING 2000 Workshop “Using Toolsets and Archi- test the lemma-based approach in a multi-classifier tectures to Build NLP Systems”, pages 34–37, voting scheme. Centre Universitaire, Luxembourg. Walter Daelemans and Veronique´ Hoste. 2002. Acknowledgments Evaluation of machine learning methods for nat- This research was carried out within the framework ural language processing tasks. In Proceedings of of the PIONIER Project Algorithms for Linguistic the Third International Conference on Language Processing. This PIONIER Project is funded by Resources and Evaluation (LREC 2002), pages NWO (Dutch Organization for Scientific Research) 755–760, Las Palmas, Gran Canaria. and the University of Groningen. We are grateful to Walter Daelemans, Jakub Zavrel, Ko van der Gertjan van Noord and Menno van Zaanen for com- Sloot, and Antal van den Bosch. 2002a. MBT: ments and discussions. Memory-Based tagger, reference guide. Tech- Model ambig #classifiers baseline 76.77 192 wordform classifiers 78.66 192 lemma-based classifiers 80.39 70

Table 3: Comparison of results (in %) for wordforms with different classifiers in the lemma-based and wordform-based approach

Model all baseline all words 89.4 wordform classifiers 92.4 lemma-based classifiers 92.5 Hendrickx et al. (2002) 92.5

Table 4: Comparison of results (in %) on the Dutch SENSEVAL-2 Data with different WSD systems

nical Report ILK 02-09, Induction of Linguistic Veronique´ Hoste, Iris Hendrickx, Walter Daele- Knowledge, Computational Linguistics, Tilburg mans, and Antal van den Bosch. 2002. Para- University, Tilburg. version 1.0. meter optimization for machine-learning of word Walter Daelemans, Jakub Zavrel, Ko van der Sloot, sense disambiguation. Natural Language Engin- and Antal van den Bosch. 2002b. TiMBL: eering, Special Issue on Word Sense Disambigu- Tilburg Memory-Based learner, reference guide. ation Systems, 8(4):311–325. Technical Report ILK 02-10, Induction of Lin- Dan Klein and Christopher Manning. 2003. Max- guistic Knowledge, Computational Linguistics, ent models, conditional estimation, and optim- Tilburg University, Tilburg. version 4.3. ization without the magic. ACL 2003 Tutorial Erwin Drenth. 1997. Using a hybrid approach to- Notes. Sapporo. wards Dutch part-of-speech tagging. Master’s Rob Koeling. 2001. Dialogue-Based Disambigu- thesis, Alfa-Informatica, University of Gronin- ation: Using Dialogue Status to Improve Speech gen, Groningen. Understanding. Ph.D. thesis, Alfa-Informatica, Tanja Gaustad and Gosse Bouma. 2002. Accur- University of Groningen, Groningen. ate stemming of Dutch for text classification. In Wessel Kraaij and Renee´ Pohlman. 1994. Porter’s Mariet¨ Theune, Anton Nijholt, and Hendri Hon- stemming algorithm for Dutch. In L.G.M. dorp, editors, Computational Linguistics in the Noordman and W.A.M. de Vroomen, editors, Netherlands 2001, Amsterdam. Rodopi. Informatiewetenschap 1994: Wetenschapelijke Tanja Gaustad. 2003. The importance of high qual- bijdragen aan de derde STINFON Conferentie, ity input for WSD: An application-oriented com- pages 167–180, Tilburg. parison of part-of-speech taggers. In Proceedings Adwait Ratnaparkhi. 1997a. A linear observed of the Australasian Language Technology Work- time statistical parser based on maximum entropy shop (ALTW 2003), pages 65–72, Melbourne. models. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Pro- Iris Hendrickx and Antal van den Bosch. 2001. cessing (EMNLP-97), Providence. Dutch word sense disambiguation: Data and pre- liminary results. In Proceedings of Senseval-2, Adwait Ratnaparkhi. 1997b. A simple introduc- Second International Workshop on Evaluating tion to maximum entropy models for natural lan- Word Sense Disambiguation Systems, pages 13– guage processing. Technical Report IRCS Report 16, Toulouse. 97-08, IRCS, University of Pennsylvania, Phil- adelphia. Iris Hendrickx, Antal van den Bosch, Veronique´ David Yarowsky. 1994. Decision lists for lexical Hoste, and Walter Daelemans. 2002. Dutch ambiguity resolution: Application to accent res- word sense disambiguation: Optimizing the loc- toration in Spanish and French. In 32th Annual alness of context. In Proceedings of the ACL Meeting of the Association for Computational 2002 Workshop on Word Sense Disambiguation: Linguistics (ACL 1994), Las Cruces. Recent Successes and Future Directions, Phil- adelphia.