<<

Automatic WordNet mapping using sense disambiguation*

Changki Lee Seo JungYun Geunbae Leer Natural Language Processing Lab Natural Language Processing Lab Dept. of and Engineering Dept. of Computer Science Pohang University of Science & Technology Sogang University San 31, Hyoja-Dong, Pohang, 790-784, Korea Sinsu-dong 1, Mapo-gu, Seoul, Korea {leeck,gblee }@postech.ac.kr seojy@ ccs.sogang.ac.kr

bilingual Korean-English . The first Abstract sense of 'gwan-mog' has 'bush' as a translation in English and 'bush' has five synsets in This paper presents the automatic WordNet. Therefore the first sense of construction of a Korean WordNet from 'gwan-mog" has five candidate synsets. pre-existing lexical resources. A set of Somehow we decide a synset {shrub, bush} automatic WSD techniques is described for among five candidate synsets and link the sense linking Korean collected from a of 'gwan-mog' to this synset. bilingual MRD to English WordNet synsets. As seen from this example, when we link the We will show how individual linking senses of Korean words to WordNet synsets, provided by each WSD method is then there are semantic ambiguities. To remove the combined to produce a Korean WordNet for ambiguities we develop new word sense . disambiguation heuristics and automatic mapping method to construct Korean WordNet based on 1 Introduction the existing English WordNet. There is no doubt on the increasing This paper is organized as follows. In section 2, importance of using wide coverage we describe multiple heuristics for word sense for NLP tasks especially for information disambiguation for sense linking. In section 3, we retrieval and cross-language information explain the method of combination for these retrieval. While these ontologies exist in English, heuristics. Section 4 presents some experiment there are very few available wide range results, and section 5 will discuss some related ontologies for other languages. Manual researches. Finally we draw some conclusions construction of the by experts is the and future researches in section 6. The automatic most reliable technique but is costly and highly mapping-based Korean WordNet plays a natural time-consuming. This is the reason for many Korean-English bilingual , so it will be researchers having focused on massive directly applied to Korean-English cross-lingual acquisition of lexical knowledge and semantic as well as Korean information from pre-existing lexical resources monolingual information retrieval. as automatically as possible. This paper presents a novel approach for 2 Multiple heuristics for word sense automatic WordNet mapping using word sense disambiguation disambiguafion. The method has been applied to As the mapping method described in this paper link Korean words from a to has been developed for combining multiple English WordNet synsets. individual solutions, each single heuristic must be To clarify the description, an example is given. seen as a container for some part of the linguistic To link the first sense of Korean word knowledge needed to disarnbiguate the "gwan-mog" to WordNet synset, we employ a

* This research was supported by KOSEF special purpose basic research (1997.9- 2000.8 #970-1020-301-3) Corresponding author

142 ambiguous WordNet synsets. Therefore, not a candidate synsets in smaller number of single heuristic is suitable to all Korean words translations get relatively less weight (a=0.5 was collected from a bilingual dictionary. We will tuned experimentally), support(s,ew) calculates describe each individual WSD (word sense the maximum similarity with the synset s and the disambiguation) heuristic for Korean word translation ew, which is defined as : mapping into corresponding English senses. support(si, ew) = max S(si, s) Korean word English word WordNetsynset sEsynset( ew ) wsl S2) = ~ S im( st, s2) if sire(s,, s2) _> 0 S(sl, = WS 2 l 0 otherwise

... Similarity measures lower than a threshold 0 are considered to be noise and are ignored. In our experiments, 0=0.3 was used. sim(s,s2) computes W$ n the conceptual similarity between concepts s~ and sz as in the following formula : Figure 1: Word-to-Concept Association 2 x level(MSCA(sl, s:)) sim(sl, s2)= Figure 1 shows the Korean word to WordNet level(sO + level(s2) synset association. The j-th sense of Korean word kw has m translations in English and n where MSCA(sl,s2) represents the most specific WordNet synsets as candidate senses. Each common ancestor of concepts s~ and s2 and heuristic is applied to the candidate senses (ws,, level(s) refers to the depth of concept s from the .... ws) and provides scores for them. root node in the WordNetL 2.1 Heuristic 1: Maximum Similarity 2.2 Heuristic 2: Prior Probability This heuristic comes from our previous This heuristic provides prior probability to Korean WSD research (Lee and Lee, 2000) and each sense of a single translation as score. assumes that all the translations in English for Therefore we will give maximum score to the the same Korean word sense are semantically synset of a monosemous translation, that is, the similar. So this heuristic provides the maximum translation which has only one corresponding score to the sense that is most similar to the synset. The following formula explains the idea. senses of the other translations. This heuristic is applied when the number of translations for the H2(s,) = max P(s, l ew) same Korean word sense is more than 1. The ¢,n,E EW~ following formula explains the idea. where EWi = {ew I si ~ synset(ew) } 1 P ( si I ewj ) -~ -- Hi(s,) = max support( s,, ew~) - 1 ~'~, (n-1)+a k,=l nj where si ~ synset(ewj), nj = where EWi = (ewl s, ~ synset(ew)} Isyr et( w,)l

In this formula, Hi(s) is a heuristic score of In this formula, n is the number of synsets of synset s, s is a candidate synset, ew is a the translation e~t~. translation into English, n is the number of 2.3 Heuristic 3: Sense Ordering translations and synset(ew) is the set of synsets (Gale et al., 1992) reports that word sense of the translation ew. So Ew becomes the set of disambiguation would be at least 75% correct if a translations which have the synset s r. The system assigns the most frequently occurring parameter tx controls the relative contribution of sense. (Miller et al., 1994) found that automatic candidate synsets in different number of translations: as the value of a increases, the I We use English WordNet version 1.6 - L

143 assignment of polysemous words in Brown If two Korean words have an IS-A relation, Corpus to senses in WordNet was 58% correct their translations in English should also with a heuristic of most frequently occurring have an IS-A relation. sense. We adopt these previous results to develop sense ordering heuristic. Korean word English word WordNet The sense ordering heuristic provides the maximum score to the most frequently used sense of a Ixanslation. The following formula explains the heuristic.

H3(sO = max SO(s,,ew) ewEEW, where EW, = {ew I si ~ synset(ew) } Figure 3: IS-A relation ot SO(s,,ew) x~ Figure 3 explains IS-A relation heuristic. In figure 3, hkw is a hypemym of a Korean word kw where si ~ synset(ew) and hew is a translation of hkw and ew is a ^ synset(ew) is sorted by frequency translation of kw. ^ s, is the x- th synset in synset(ew) This heuristic assigns score 1 to the synsets which satisfy the above assumption according to the following formula: In this formula, x refers to the sense order of s in synset(ew): x is 1 when s, is the most H,(s,) = max 1R(si, ew) frequently used sense of ew. The information ew~EW, about the sense order of synsets of an English where EWi = {ewl si ~ synset(ew) } word was extracted from the WordNet.

0.8 IR(s,,ew)={; otherwisefflsa(s"si) prob 0.7 where si ~ synset(ew) , sj ~ synset(hew) 0.$

o.5 In this formula, lsA(s,,s 2) returns true if s, is a

0.4 kind of s 2.

0.3 2.5 Heuristic 5: Word Match This heuristic assumes that related concepts 0.2 i O will be expressed using the same content words. o.1 I Given two definitions - that of the bilingual O.

#"- " "tl. A A = dictionary and that of the WordNet - this ! 2 ] 4 5 $ 7 8 9 ordm" heuristic computes the total amount of shared content words. Figure 2: Sense distribution in SemCor Hs(si) = max WM (si,ew) The value a=0.705 and fl=2.2 was acquired from a regression of Figure 2 semcor corpus2 where EW~ = (ewl s~ ~ synset(ew) } data distribution. WM (si, ew) = sim(X,Yi) 2.4 Heuristic 4: IS-A relation sim(X,Y) = IX n YI This heuristic is based on the following facts: Ix rl In this formula, X is the set of content words in 2 semcor is a sense tagged corpus from part of English examples of bilingual dictionary and Y is Brown corpus.

144 the set of content words of definition and example of the synset s, in WordNet.

2.6 Heuristic 6: Cooccurrence //~ Heur,st,c3 ~ ~K This heuristic uses cooccurrence measure ~ Heuristic4 ~ ('~'~"~ acquired from the sense tagged Korean definition sentences of bilingual dictionary. To build sense tagged corpus, we use the definition \ 1 x -H Ig2 l I Oeoi oo sentences which have monosemous translation [ Classfflcatzon I ~ -~" [ tree learning in bilingual dictionary. And we uses the 25 semantic tags of WordNet as sense tag : Figure 4: Training phase

n6(s,) = max p(t, I x) xeDef

Heuristic 2 with p =, - Z(l_a)/2 × ~'(1~ ~) Heuristic 3 n L=nklng or p(ti l x) = Freq(ti, x) HeunsUc 4 Discarding Freq(x) Heunstlc 5 Heunst=c 6 In this formula, Defis the set of content words of a Korean definition sentence, t is a semantic Figure 5: Mapping phase tag corresponding to the synset s and n refers to Freq(x). Figure 5 shows a mapping phase. In the 3 Combining heuristics with decision tree mapping phase, the new candidate synset ws~ of a learning Korean word is rated using 6 heuristics, and then the decision tree, which is learned in the training Given a candidate synset of a translation and phase, classifies w& as linking or discarding. The 6 individual heuristic scores, our goal is to use synset classified as linking is linked to the all these 6 scores in combination to classify the corresponding Korean word. synset as linking or discarding. The combination of heuristics is performed by 4 Evaluation decision tree learning for non-linear relationship. Each internal node of a decision tree is a choice In this section, we evaluate the performance of point, dividing an individual method into ranges each six heuristics as well as the combination of possible values. Each leaf node is labeled method. To evaluate the performance of WordNet with a classification (linking or disc~ding). The mapping, the candidate synsets of 3260 senses of most popular method of decision tree induction, Korean words in bilingual dictionary was employed here, is C4.5 (Quinlan, 1993). manually classified as linking or discarding. Figure 4 shows a training phase in decision We define 'precision' as the proportion of ,tree based combination method. In the training correctly linked senses of Korean words to all the phase, the candidate synset wsk of a Korean linked senses of Korean words in a test set. We word is manually classified as linking or also define 'coverage' as the proportion of linked discarding and get assigned scores by each senses of Korean words to all the senses of heuristic. A training data set is constructed by Korean words in a test set. these scores and manual classification. The Table 1 contains the results for each heuristic training data set is used to optimize a model for evaluated individually against the manually combining heuristics. classified data. The test set here consists of the 3260 manually classified senses. In general, the results of each heuristic seem to be poor, but are always better than the random choice baseline. The best heuristic according to

145 the precision is the maximum similarity heuristic. Korean words in bilingual dictionary, But it was applied to only 59.51% of 3260 maintaining a coverage 77.12%. senses of Korean words. The results of each Applying the decision tree to combine all the heuristic are better than the random mapping, heuristics for all Korean words in bilingual with a statistically significance at the 99% level. dictionary, we obtain a preliminary version of the Korean WordNet containing 21654 senses of Precision(%) Coverage(%) 17696 Korean nouns with an accuracy of 93.59% Random (-2-0.84% with 99% confidence). 49.85 100.0 mapping Heuristic 1 75.21 59.51 5 Related works Heuristic 2 74.66 100.0 Several attempts have been performed to Heuristic 3 71.87 100.0 automatically produce multilingual ontologies. Heuristic 4 55.49 29.36 (Knight & Luk 1994) focuses on the construction Heuristic 5 : 56.48 63.01 of Sensus, a large for supporting Heuristic 6 67.24 64.14 the Pangloss system, Table 1: Individual heuristics performance merging ontologies (ONTOS and UpperModel) and WordNet with monolingual and bilingual . (Okumura & Hovy 1994) describes a Preeisi0n(%) Coverage(%) semi-automatic method for associating a Japanese Summing 84.61 100.0 to an ontology using a Japanese/English Logistic regression 86.41 100.0 bilingual dictionary as a 'bridge'. Several lexical Decisioin tree 93.59 77.12 resources and techniques are combined in Table 2: Performance and comparison of the (Atserias et al., 1997) to map Spanish words from a bilingual dictionary to WordNet. In (Farreres et decision tree based combination al., 1998), use of a taxonomic structure derived from a monolingual MRD is proposed as an aid We performed 10-fold cross validation to to the mapping process. evaluate the performance of the combination of This research is contrasted that it utilized all the heuristics using the decision tree - we bilingual dictionary to build monolingual split the data into ten parts, reserved one part as thesaurus based on the existing popular lexical a validation set, trained the decision tree on the resources and used the combination of multiple other nine parts and then evaluate the reserved unsupervided WSD heuristics. part. This process is repeated nine times using each of the other nine parts as a validation set. 6 Conclusion Table 2 shows the results of the other trials of This paper has explored the automatic the combination of all the heuristics. Summing construction of a Korean WordNet from is a way to simply sum all the scores of each pre-existing lexical resources - English wordNet heuristic. Then the candidate synset which has and Korean/English bilingual dictionary. We the highest summation of the scores is selected. presented several techniques for word sense Logistic regression, as described in (Hosmer and disambiguation and their application to Lemeshow, 1989), is a popular technique for disambiguate the translations in bilingual binary classification. This technique applies an dictionary. We obtained a preliminary version of inverse logit function and employs the iterative the Korean WordNet containing 21654 senses of reweighted least squares algorithm. This 17696 Korean nouns. In a series of experiments, technique determines the weight of each we observed that the accuracy of mapping is over heuristic. 90%. With the combination of the heuristics using summing, we obtained an improvement over References maximum similarity heuristic (heuristic 1) of 9%, Atserias J., Climent S., Farreras J., Rigau G. and maintaining a coverage 100%. The decision tree Rodriguez H. (1997) Combining Multiple is able to correctly map 93.59% of the senses of Methods for the Automatic Construction of

146 Muldlingual WordNets. In proceeding of the Conference on Recent Advances on NLP. Farreres X., Rigau G., and Rodnguez H.. (1998) Using WordNet for building WordNets. In Proceedings of COLING-ACL Workshop on Usage of WordNet in Natural Language Processing Systems. Gale W., Church K., and Yarowsky D. (1992) Estimating upper and lower bounds on the performance of word-sense disarnbiguation programs. In Proceeding of 3Oth Annual Meeting of the Association for Computational Linguistics. Hosmer Jr. and Lemeshow S. (1989) Applied Logistic Regression. Wiley, New York. Knight K. and Luk S. (1994) Building a large-scale knowledge base for machine translation. In Proceeding of the American Association for . Miller G. (1990) Five papers on WordNet. Special Issue of International Journal of . Miller G., Chodorow M., Landes S., Leacock C. and Thomas R.. (1994) Using a semantic concordance for sense identification. In Proceedings of the Human Language Technology Workshop. Okumura A. and Hovy E. (1994) Building Japanese-English Dictionary based on Ontology for Machine Translation. In Proceedings of ARPA Workshop on Human Language Technology. Quinlan R.. (1993) C4.5: Programs For Machine Learning. Morgan Kaufmann Publishers. Seungwoo Lee and Geunbae Lee. (2000) Unsupervised Sense Disambiguafion Using Local Context and Co-occurrence. In Journal of Korean Information Science Society. (in press)

147