A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch
Total Page:16
File Type:pdf, Size:1020Kb
A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch Tanja Gaustad Alfa-Informatica, University of Groningen Postbus 716 9700 AS Groningen, The Netherlands, [email protected] Abstract (Hendrickx et al., 2002; Hoste et al., 2002). In In this paper, we present a corpus-based super- the system presented here, the classifiers built for vised word sense disambiguation (WSD) sys- each ambiguous word are based on its lemma in- tem for Dutch which combines statistical classi- stead. Lemmatization allows for more compact and fication (maximum entropy) with linguistic in- generalizable data by clustering all inflected forms formation. Instead of building individual clas- of an ambiguous word together, an effect already sifiers per ambiguous wordform, we introduce commented on by Yarowsky (1994). The more in- a lemma-based approach. The advantage of flection in a language, the more lemmatization will this novel method is that it clusters all inflec- help to compress and generalize the data. In the case ted forms of an ambiguous word in one classi- fier, therefore augmenting the training material of our WSD system this means that less classifiers available to the algorithm. Testing the lemma- have to be built therefore adding up the training ma- based model on the Dutch SENSEVAL-2 test terial available to the algorithm for each ambiguous data, we achieve a significant increase in accur- wordform. Accuracy is expected to increase for the acy over the wordform model. Also, the WSD lemma-based model in comparison to the wordform system based on lemmas is smaller and more model. robust. The paper is structured as follows: First, we will present the dictionary-based lemmatizer for Dutch 1 Introduction: WSD for Dutch which was used to lemmatize the data, followed by A major problem in natural language processing a detailed explanation of the lemma-based approach (NLP) for which no satisfactory solution has been adopted in our WSD system. Next, the statistical found to date is word sense disambiguation (WSD). classification algorithm, namely maximum entropy, WSD refers to the resolution of lexical semantic and Gaussian priors (used for smoothing purposes) ambiguity and its goal is to attribute the correct are introduced. We will then proceed to describe sense(s) to words in a certain context. For instance the corpus, the corpus preparation, and the system machine translation, information retrieval or docu- settings. We conclude the paper with results on the ment extraction could all benefit from the accurate Dutch SENSEVAL-2 data, their evaluation and ideas disambiguation of word senses. for future work. The WSD system for Dutch1 presented here is a corpus-based supervised algorithm combining stat- 2 Dictionary-Based Lemmatizer for Dutch istical classification with various kinds of linguistic Statistical classification systems, like our WSD sys- information. The intuition behind the system is that tem, determine the most likely class for a given in- linguistic information is beneficial for WSD which stance by computing how likely the words or lin- means that it will improve results over purely stat- guistic features in the instance are for any given istical approaches. The linguistic information in- class. Estimating these probabilities is difficult, as cludes lemmas, part-of-speech (PoS), and the con- corpora contain lots of different, often infrequent, text around the ambiguous word. words. Lemmatization2 is a method that can be used In this paper, we focus on a lemma-based ap- to reduce the number of wordforms that need to be proach to WSD for Dutch. So far, systems built taken into consideration, as estimation is more reli- individual classifiers for each ambiguous wordform able for frequently occurring data. 1The interest in Dutch lies grounded in the fact 2We chose to use lemmatization and not stemming because that we are working in the context of a project con- the lemma (or canonical dictionary entry form) can be used to cerned with developing NLP tools for Dutch (see look up an ambiguous word in a dictionary or an ontology like http://www.let.rug.nl/˜vannoord/alp). e.g. WordNet. This is not the case for a stem. Lemmatization reduces all inflected forms of a Normally, this implies that the basis for grouping word to the same lemma. The number of different occurrences of particular ambiguous words together lemmas in a training corpus will therefore in general is that their wordform is the same. Alternatively, we be much smaller than the number of different word- chose for a model constructing classifiers based on forms, and the frequency of lemmas will therefore lemmas therefore reducing the number of classifiers be higher than that of the corresponding individual that need to be made. inflected forms, which in turn suggests that probab- As has already been noted by Yarowsky (1994), ilities can be estimated more reliably. using lemmas helps to produce more concise and For the experiments in this paper, we used a lem- generic evidence than inflected forms. Therefore matizer for Dutch with dictionary lookup. Diction- building classifiers based on lemmas increases the ary information is obtained from Celex (Baayen et data available to each classifier. We make use of al., 1993), a lexical database for Dutch. Celex con- the advantage of clustering all instances of e.g. one tains 381,292 wordforms and 124,136 lemmas for verb in a single classifier instead of several classifi- Dutch. It also contains the PoS associated with ers (one for each inflected form found in the data). the lemmas. This information is useful for disam- In this way, there is more training data per ambigu- biguation: in those cases where a particular word- ous wordform available to each classifier. The ex- form has two (or more) possible corresponding lem- pectation is that this should increase the accuracy of mas, the one matching the PoS of the wordform our maximum entropy WSD system in comparison is chosen. Thus, in a first step, information about to the wordform-based model. wordforms, their respective lemmas and their PoS Figure 1 shows how the system works. During is extracted from the database. training, every wordform is first checked for ambi- Dictionary lookup can be time consuming, espe- guity, i.e. whether it has more than one sense associ- cially for large dictionaries such as Celex. To guar- ated with all its occurrences. If the wordform is am- antee fast lookup and a compact representation, the biguous, the number of lemmas associated with it is information extracted from the dictionary is stored looked up. If the wordform has one lemma, all oc- as a finite state automaton (FSA) using Daciuk’s currences of this lemma in the training data are used (2000) FSA morphology tools.3 Given a wordform, to make the classifier for that particular wordform— the compiled automaton provides the correspond- and others with the same lemma. If a wordform ing lemmas in time linear to the length of the input has more than one lemmas, a classifier based on the word. Contrasting this dictionary-based lemmat- wordform is built. This strategy has been decided izer with a simple suffix stripper, such as the Dutch on in order to be able to treat all ambiguous words, Porter Stemmer (Kraaij and Pohlman, 1994), our notwithstanding lemmatization errors or wordforms lemmatizer is more accurate, faster and more com- that can genuinely be assigned two or more lemmas. pact (see (Gaustad and Bouma, 2002) for a more An example of a word that has two different lem- elaborate description and evaluation). mas depending on the context is boog: it can either During the actual lemmatization procedure, the be the past tense of the verb buigen (‘to bend’) or the FSA encoding of the information in Celex assigns noun boog (‘arch’). Since the Dutch SENSEVAL-2 every wordform all its possible lemmas. For am- data is not only ambiguous with regard to meaning biguous wordforms, the lemma with the same PoS but also with regard to PoS, both lemmas are sub- as the wordform in question is chosen. All word- sumed in the wordform classifier for boog. forms that were not found in Celex are processed During testing, we check for each word whether with a morphological guessing automaton.4 there is a classifier available for either its wordform The key features of the lemmatizer employed are or its lemma and apply that classifier to the test in- that it is fast, compact and accurate. stance. 3 Lemma-Based Approach 4 Maximum Entropy Word Sense As we have mentioned in the previous section, lem- Disambiguation System matization collapses all inflected forms of a given Our WSD system is founded on the idea of combin- word to the same lemma. In our system, separate ing statistical classification with linguistic sources classifiers are built for every ambiguous wordform. of knowledge. In order to be able to take full advant- age of the linguistic information, we need a classi- 3Available at http://www.eti.pg.gda.pl/ ˜jandac/fsa.html fication algorithm capable of incorporating the in- 4Also available from the FSA morphology tools (Daciuk, formation provided. The main advantage of max- 2000). imum entropy modeling is that heterogeneous and non-ambiguous psense 1 sense pwordform LEMMA psense 1 lemma MODEL X senses ambiguous X lemmas WORDFORM psense MODEL Figure 1: Schematic overview of the lemma-based approach for our WSD System for Dutch overlapping information can be integrated into a ted from the labeled training data are used to de- single statistical model. Other learning algorithms, rive a set of constraints for the model. This set of like e.g. decision lists, only take the strongest fea- constraints characterises the class-specific expecta- ture into account, whereas maximum entropy com- tions for the distribution.