Adapting a synonym to specific domains Davide Turcato Fred Popowich Janine Toole Dan Pass Devlan Nicholson Gordon Tisher gavagai Technology Inc. P.O. 374, 3495 Ca~abie Street, Vancouver, British Columbia, V5Z 4R3, Canada and Natural Language Laboratory, School of Computing Science, Simon Fraser University 8888 University Drive, Burnaby, British Columbia, V5A 1S6, Canada {turk, popowich ,toole, lass, devl an, gt i sher}@{gavagai, net, as, sfu. ca}

Abstract valid for English, would be detrimental in a specific domain like weather reports, where This paper describes a method for both snow and C (for Celsius) occur very fre- adapting a general purpose synonym quently, but never as synonyms of each other. database, like WordNet, to a spe- We describe a method for creating a do- cific domain, where only a sub- main specific synonym database from a gen- set of the synonymy relations de- eral purpose one. We use WordNet (Fell- fined in the general database hold. baum, 1998) as our initial database, and we The method adopts an eliminative draw evidence from a domain specific corpus approach, based on incrementally about what synonymy relations hold in the pruning the original database. The domain. method is based on a preliminary Our task has obvious relations to word manual pruning phase and an algo- sense disambiguation (Sanderson, 1997) (Lea- rithm for automatically pruning the cock et al., 1998), since both tasks are based database. This method has been im- on identifying senses of ambiguous words in plemented and used for an Informa- a text. However, the two tasks are quite dis- tion Retrieval system in the aviation tinct. In word sense disambiguation, a set of domain. candidate senses for a given word is checked against each occurrence of the relevant word 1 Introduction in a text, and a single candidate sense is se- lected for each occurrence of the word. In our Synonyms can be an important resource for synonym specialization task a set of candidate Information Retrieval (IR) applications, and senses for a given word is checked against an attempts have been made at using them to entire corpus, and a subset of candidate senses expand query terms (Voorhees, 1998). In is selected. Although the latter task could be expanding query terms, overgeneration is as reduced to the former (by disambiguating all much of a problem as incompleteness or lack occurrences of a word in a test and taking of synonym resources. Precision can dramat- the union of the selected senses), alternative ically drop because of false hits due to in- approaches could also be used. In a specific correct synonymy relations. This problem is domain, where words can be expected to be particularly felt when IR is applied to docu- monosemous to a large extent, synonym prun- ments in specific technical domains. In such ing can be an effective alternative (or a com- cases, the synonymy relations that hold in the plement) to word sense disambiguation. specific domain are only a restricted portion of the synonymy relations holding for a given From a different perspective, our language at large. For instance, a set of syn- task is also related to the task of as- onyms like signing Subject Field Codes (SFC) to a terminological resource, as done by (1) {cocaine, cocain, coke, snow, C} Magnini and Cavagli~ (2000) for WordNet. Assuming that a specific domain corresponds classes (to improve speed and accuracy, re- to a single SFC (or a restricted set of SFCs, spectively) and including only relations in the at most), the difference between SFC as- third class. signment and our task is that the former The overall goal of the described method assigns one of many possible values to a given is to inspect all synonymy relations in Word- synset (one of all possible SFCs), while the Net and classify each of them into one of the latter assigns one of two possible values (the three aforementioned classes. We define a words belongs or does not belong to the SFC synonymy as a binary relation be- representing the domain). In other words, tween two synonym terms (with respect to SFC assignment is a classification task, while • a particular sense). Therefore, a WordNet ours can be seen as either a filtering or synset containing n terms defines ~11 k syn- ranking task. onym relations. The assignment of a syn- Adopting a filtering/ranking perspective onymy relation to a class is based on evidence makes apparent that the synonym pruning drawn from a domain specific corpus. We use task can also be seen as an eliminative pro- a tagged and lemmatized corpus for this pur- cess, and as such it can be performed incre- pose. Accordingly, all frequencies used in the mentally. In the following section we will rest of the paper are to be intended as fre- show how such characteristics have been ex- quencies of (lemma, tag) pairs. ploited in performing the task. The pruning process is carried out in three In section 2 we describe the pruning steps: (i) manual pruning; (ii) automatic methodology, while section 3 provides a prac- pruning; (iii) optimization. The first two tical example from a specific domain. Con- steps focus on incrementally eliminating in- clusions are offered in section 4. correct synonyms, while the third step focuses on removing irrelevant synonyms. The three 2 Methodology steps are described in the following sections. 2.1 Outline 2.2 Manual pruning The synonym pruning task aims at improv- ing both the accuracy and the speed of a syn- Different synonymy relations have a different onym database. In order to set the terms of impact on the behavior of the application in the problem, we find it useful to the which they are used, depending on how fre- set of synonymy relations defined in WordNet quently each synonymy relation is used. Rela- into three classes: tions involving words frequently appearing in either queries or corpora have a much higher . Relations irrelevant to the specific do- impact (either positive or negative) than re- main (e.g. relations involving words that lations involving rarely occurring words. E.g. seldom or never appear in the specific do- the synonymy between snow and C has a main) higher impact on the weather report domain (or the aviation domain, discussed in this pa- . Relations that are relevant but incorrect per) than the synonymy relation between co- in the specific domain (e.g. the syn- caine and coke. Consequently, the precision of onymy of two words that do appear in the a synonym database obviously depends much specific domain, but are only synonyms more on frequently used relations than on in a sense irrelevant to the specific do- rarely used ones. Another important consid- main); eration is that judging the correctness of a 3. Relations that are relevant and correct in given synonymy relation in a given domain is the specific domain. often an elusive issue: besides clearcut cases, there is a large gray area where judgments The creation of a domain specific database may not be trivial even for humans evalua- aims at removing relations in the first two tots. E.g. given the following three senses of the noun approach with only one sense are not good candidates for pruning (under the assumption of com- (2) a. {approach, approach path, glide pleteness of the synonym database). There- path, glide slope} fore .polysemous terms are prioritized over (the final path followed by an air- monosemous terms. craft as it is landing) Criterion 2. The second and third sort- b. {approach, approach shot} ing criteria axe similar, the only difference be- (a relatively short golf shot in- ing that the second criterion assumes the ex- tended to put the ball onto the istence of some inventory of relevant queries putting green) (a term list, a collection of previous queries, c. {access, approach} etc.), ff such an inventory is not available, the (a way of entering or leaving) second sorting criterion can be omitted. If the inventory is available, it is used to check which it would be easy to judge the first and second synonymy relations are actually to be used in senses respectively relevant and irrelevant to queries to the domain corpus. Given a pair the aviation domain, but the evaluation of the (ti,tj) of synonym terms, a score (which we third sense would be fuzzier. name scoreCQ) is assigned to their synonymy The combination of the two remarks above relation, according to the following formula: induced us to consider a manual pruning phase for the terms of highest 'weight' as a (3) scoreCQij = good investment of human effort, in terms of (fcorpusi * fqueryj) + rate between the achieved increase in preci- (fcorpusj • fqueryi) sion and the amount of work involved. A second reason for performing an initial man- where fcorpusn and fqueryn are, respec- ual pruning is that its outcome can be used tively, the frequencies of a term in the domain as a reliable test set against which automatic corpus and in the inventory of query terms. pruning algorithms can be tested. The above formula aims at estimating how Based on such considerations, we included a often a given synonymy relation is likely to manual phase in the pruning process, consist- be actually used. In particular, each half of ing of two steps: (i) the ranking of synonymy the formula estimates how often a given term relations in terms of their weight in the spe- in the corpus is likely to be matched as a syn- cific domain; (ii) the actual evaluation of the onym of a given term in a query. Consider, correctness of the top ranking synonymy re- e.g., the following situation (taken form the lation, by human evaluators. aviation domain discussed in section 3.1):

2.2.1 Ranking of synonymy relations (4) fcorpuSsnow = 3042 The goal of ranking synonymy relations is f querysnow = 2 to associate them with a score that estimates how often a synonymy relation is likely to fcorpusc = 9168 be used in the specific domain. The input f queryc = 0 database is sorted by the assigned scores, and the top ranking words are checked for manual It is estimated that C would be matched pruning. Only terms appearing in the domain 18336 times as a synonym for snow (i.e 9168 specific corpus are considered at this stage. * 2), while snow would never be matched as In this way the benefit of manual pruning is a synonym for C, because C never occurs as maximized. Ranking is based on three sorting a query term. Therefore scoreCQs,~ow,c is criteria, listed below in order of priority. 18336 (i.e. 18336 + 0). Criterion 1. Since a term that does ap- Then, for each polysemous term i and pear in the domain corpus must have at least synset s such that i E s, the following score is one valid sense in the specific domain, words computed: 1: Frequencies of sample synset terms. score, with the only difference that a value of 1 is conventionally assumed for fquery (the j fcorpusj fqueryj frequency of a term in the inventory of query cocaine 1 0 terms). cocain 0 0 coke 8 0 2.2.2 Correctness evaluation C 9168 0 All the synsets containing the top rank- ing terms, according to the hierarchy of crite- ria described above, are manuMly checked for (5) scorePolyCQ i,~ = pruning. For each term, all the synsets con- E{scoreCQi,~lj ~ s A i ¢ j} taining the term are clustered together and E.g., if ,5' is the synset in (1), then presented to a human operator, who exam- scorePolyCQs~ow,s is "the sum of ines each (term, synset) pair and answers the scoreCQsnow,coc~ine, scoreCQsnow,eocain, question: does the term belong to the synset scoreCQsnow,eoke and scoreCQ,no~o,c. Given in the specific domain? Evidence about the the data in Table 1 (taken again from our answer is drawn from relevant examples auto- aviation domain) the following scoreCQ matically extracted from the domain specific would result: corpus. E.g., following up on our example in the previous section, the operator would be (6) scoreCQsnow,cocaine -~ 2 presented with the word snow associated with scoreCQsnow,cocain = 0 each of the synsets in (7) and would have to provide a yes/no answer for each of them. In scoreCQs~ow,cok~ = 16 the specific case, the answer would be likely scoreCQsno~o,c = 18336 to be 'no' for (7a) and 'yes' for (75) and (7c). The evaluator is presented with all the Therefore, would equal scorePolyCQsnow,s synsets involving a relevant term (even 18354. those that did not rank high in terms of The final score assigned to each polysemous scorePoIyCQ) in order to apply a contrastive term tl is the highest scorePolyCQi,s. For approach. It might well be the case that the snow, which has the following three senses correct sense for a given term is one for which (7) a. {cocaine, cocaine, coke, C, snow} the term has no synonyms at all (e.g. 7b in (a narcotic (alkaloid) extracted the example), therefore all synsets for a given from coca leaves) term need to be presented to the evaiuator b. {snow} in order to make an informed choice. The (a layer of snowflakes (white crys- evaluator provides a yes/no answer for all the tals of frozen water) covering the (term, synset) he/she is presented with (with ground) some exceptions, as explained in section 3.1). c. {snow, snowfall} 2.3 Automatic pruning (precipitation falling from clouds in the form of ice crystals) The automatic pruning task is analogous to manual pruning in two respects: (i) its in- the highest score would be the one computed put is the set of synonymy relations involving above. WordNet polysemous words appearing in the Criterion 3. The third criterion assigns domain specific corpus; (ii) it is performed by a score in terms of domain corpus frequency examining all (term, synset) input pairs and alone. It is used to further rank terms that answering the question: does the term belong do not occur in the query term inventory (or to the synset in the specific domain? How- when no query term inventory is available). It ever, while the manual pruning task was re- is computed in the same way as the previous garded as a filtering task, where a human eval-

4 uator assigns a boolean value to each pruning with respect to the assigned scores, since we candidate, the automatic pruning task can are focusing on removing incorrect items. At be more conveniently regarded as a ranking the top of the list are the items that receive task, where all the pruning candidates are as- the lowest score, i.e. that are more likely to signed a score, measuring how appropriate a be incorrect (term, sense) associations for our given sense is for a given word, in the do- domain (thus being the best candidates to be main at hand. The actual pruning is left as pruned out). a subsequent step. Different pruning thresh- Table 2 shows the ranking of the senses olds can be applied to the ranked list, based for the word C in the aviation domain. In on different considerations (e.g. depending on the table, each term is followed by its corpus whether a stronger emphasis is put on the pre- frequency, separated by a slash. From each cision or the recall of the resulting database). synset the word C itself has been removed, The score is based on the frequencies of both as well as the gloss words found in the stop words in the synset (except the word under list. Therefore, the table only contains the consideration) and words in the sense gloss. words that contribute to the calculation of the We also remove from the gloss all words be- sense's score. E.g. the score for the first sense longing to a stoplist (a stoplist provided with in the list is obtained from the following ex- WordNet was used for this purpose). The fol- pression: lowing scoring formula is used: (9) ((0 + 57)/2/22) + (8) (average_synset_frequeney/ ((8+0+0+ 198+9559+0+1298)/7/72 ) synset_cardinality k) .4- (average_gloss_frequency~ The third sense in the list exemplifies the gloss_cardinality t:) case of an empty synset (i.e. a synset orig- Note that the synset cardinality does not inally containing only the word under con- include the word under consideration, reflect- sideration). In this case the score obtained ing the fact the word's frequency is not used from the gloss is doubled. Note that the ob- in calculating the score. Therefore a synset viously incorrect sense of C as a narcotic is only containing the word under consideration in the middle of the list. This is due to a tag- and no synonyms is assigned cardinality 0. ging problem, as the word leaves in the gloss The goal is to identify (term, sense) pairs was tagged as verb instead of noun. Therefore not pertaining to the domain. For this rea- it was assigned a very high frequency, as the son we tend to assign high scores to candi- verb leave, unlike the noun leaf, is very com- dates for which we do not have enough evi- mon in the aviation domain. The last sense dence about their inappropriateness. This is in the list also requires a brief explanation. why average frequencies are divided by some The original word in the gloss was 10S. How- factor which is function of the number of av- ever, the pre-processor that was used before eraged frequencies, in order to increase the tagging the glosses recognized S as an abbre- Scores based on little evidence (i.e. fewer av- viation for South and expanded the term ac- eraged numbers). In the sample application cordingly. It so happens that both words 10 described in section 3 the value of k was set and South are very frequent in the aviation to 2. For analogous reasons, we convention- corpus we used, therefore the sense was as- ally assign a very high score to candidates for signed a high score. which we have no evidence (i.e. no words in 2.4 Optimization both the synset and the gloss). If either the synset or the gloss is empty, we conventionally The aim of this phase is to improve the access double the score for the gloss or the synset, speed to the synonym database, by removing respectively. We note at this point that our all information that is not likely to be used. final ranking list are sorted in reverse order The main idea is to minimize the size of the Table 2: Ranking of synsets containing the word C Score Frequencies 39.37 synset: ATOMIC_NUMBEK_6/O, CAKBON/57 gloss: ABUNDANT/8, NONMETALLIC/O, TETRAVALENT/O, ELEMENT/198 0CCUR/9559, ALLOTROPIC/O, FOKM/1298 62.75 synset: AMPEre-SECOND/O, COULOMB/O gloss: UNIT/3378, ELECTRICAL/2373, CHARGE/523, EQUAL/153 AMOUNT/1634, CHARGE/523, TKANSFEK/480, CUKKENT/242, 1/37106 AMPEre/4, 1/37106 224.28 synset: 0 gloss: GENEKAL-PUKPOSE/O, PROGRAMING/O, LANGUAGE/445, CLOSELY/841 ASSOCIATE/543, UNIX/O, OPEKATE/5726, SYSTEM/49863 241.69 synset: COCAIN/O, COCAINE/i, COKE/8, SNOW/3042 gloss: NARCOTIC/i, ALKALOID/O, EXTKACT/31, COCA/I, LEAVE/24220 585.17 synset: LIGHT_SPEED/I, SPEED_OF_LIGHT/O gloss: SPEED/14665, LIGHT/22481, TRAVEL/f05, VACUUM/192 743.28 synset: DEGREE_CELSIUS/24, DEGREEiENTIGRADE/28 gloss: DEGKEE/43617, CENTIGRADE/34, SCALE/540, TEMPERATURE/2963 1053.43 synset: I00/0, CENTRED/O, CENTUKY/31, HUNDRED/O, ONE_C/O gloss: TEN/Z3, 10/16150, SOUTH/12213 database in such a way that the database be- (10) a. AMPERE-SECOND/O, COULOMB/O, havior remains unchanged. Two operations C/9168 are performed at the stage: (i) a simple rel- b. C/9168 evance test to remove irrelevant terms (i.e. terms not pertaining to the domain at hand); 2.4.2 Redundancy check (ii) a redundancy check, to remove informa- The final step is the removal of redundant tion that, although perhaps relevant, does not synsets, possibly as a consequence of the pre- affect the database behavior. vious pruning steps. Specifically, the follow- ing synsets are removed: 2.4.1 Relevance test • Synsets containing a single term (al- Terms not appearing in the domain cor- pus are considered not relevant to the spe- though the associated sense might be a valid one for that term, in the specific cific domain and removed from the synonym database. The rationale underlying this step domain). is to remove from the synonym database syn- • Duplicate synsets, i.e. identical (in terms onymy relations that are never going to be of synset elements) to some other synset used in the specific domain. In this way the ef- not being removed (the choice of the only ficiency of the module can be increased, by re- synset to be preserved is arbitrary). ducing the size of the database and the num- ber of searches performed (synonyms that are E.g., the synset in (10b) would be finMly known to never appear are not searched for), removed at this stage. without affecting the system's matching at- curacy. E.g., the synset in (10a) would be 3 Sample application reduced to the synset in (10b). The described methodology was applied to the aviation domain. We used the Aviation Safety Information System (ASRS) corpus agreement in 315 cases (75%) and disagree- (h'etp://asrs. arc.nasa.gov/) as our avia- ment in 103 cases (25%). A sample of senses tion specific corpus. The resulting domain- on which the evaluators disagreed is shown in specific database is being used in an IR ap- (11). In each case, the term being evaluated plication that retrieves documents relevant is the first in the synset. to user defined queries, expressed as phrase patterns, and identifies portions of text that (11) a. {about, around} are instances of the relevant phrase patterns. (in the area or vicinity) The application makes use of Natural Lan- b. {accept, admit, take, take on} guage Processing (NLP) techniques (tagging (admit into a group or commu- and partial parsing) to annotate documents. nity) User defined queries are matched against such c. {accept, consent, go for} annotated corpora. Synonyms are used to (give an affirmative reply to) expand occurrences of specific words in such d. {accept, swallow} queries. In the following two sections we de- (tolerate or accommodate oneself scribe how the pruning process was performed to) and provide some results. e. {accept, take} 3.1 Adapting Wordnet to the (be designed to hold or take) aviation domain f. {accomplished, effected, estab- lished} A vocabulary of relevant query terms was (settled securely and uncondi- made available by a user of our IR applica- tionally) tion and was used in our ranking of synonymy g. {acknowledge, know, recognize} relations. Manual pruning was performed on (discern) the 1000 top ranking terms, with which 6565 synsets were associated overall. The manual h. {act, cognitive operation, cogni- pruning task was split between two human tive process, operation, process} evaluators. The evaluators were programmers (the performance of some com- members of our staff. They were English na- posite cognitive activity) tive speakers who had acquaintance with our i. {act, act as, play} IR application and with the goals of the man- (pretend to have certain qualities ual pruning process, but no specific training or state of mind) or background on lexicographic or WordNet- j. {action, activeness, activity} related tasks. For each of the 1000 terms, (the state of being active) the evaluators were provided with a sample k. {action, activity, natural action, of 100 (at most) sentences where the rele- natural process} vant word occurred in the ASRS corpus. 100 (a process existing in or produced of the 1000 manually checked clusters (i.e. by nature (rather than by the in- groups of synsets referring to the same head tent of human beings)) term) were submitted to both evaluators (576 synsets overall), in order to check the rate It should be noted that the 'yes' and 'no' of agreement of their evaluations. The eval- answers were not evenly distributed between uators were allowed to leave synsets unan- the evaluators. In 80% of the cases of dis- swered, when the synsets only contained the agreement, it was evaluator A answering 'yes' head term (and at least one other synset in and evaluator B answering 'no'. This seems the cluster had been deemed correct). Leav- to suggest than one of the reasons for dis- ing out the cases when one or both evalua- agreement was a different degree of strictness tors skipped the answer, there remained 418 in evaluating. Since the evaluators matched synsets for which both answered. There was a sense against an entire corpus (represented by a sample of occurrences), one common sit- relevant items from our ranking list, then we uation may have been that a sense did oc- incrementally computed precision and recall cur, but very rarely. Therefore, the evaluators in terms of the items that had been manually may have applied different criteria in judging checked by our human evaluators. The re- how many occurrences were needed to deem sults are shown in figure 1. As an example of a sense correct. This discrepancy, of course, how this figure can be interpreted, taking into may compound with the fact that the differ- consideration the top 20% of the ranking list ences among WordNet senses can sometimes (along the X axis), an 80% precision (Y axis) be very subtle. means that 80% of the items encountered so Automatic pruning was performed on far had been removed in manual pruning; a the entire WordNet database, regardless of 27% recall (Y axis) means that 27% of the whether candidates had already been manu- overall manually removed items have been en- ally checked or not. This was done for test- countered so far. ing purposes, in order to check the results of The automatic pruning task was intention- automatic pruning against the test set ob- ally framed as a ranking problem, in order to tained from manual pruning. Besides asso- leave open the issue of what pruning threshold ciating ASRS frequencies with all words in would be optimal. This same approach was synsets and glosses, we also computed fre- taken in the IR application in which the prun- quencies for collocations (i.e. multi-word ing procedure was embedded. Users are given terms) appearing in synsets. The input to the option to set their own pruning threshold automatic pruning was constituted by 10352 (depending on whether they focus more on polysemous terms appearing at least once in precision or recall), by setting a value spec- ASRS the corpus. Such terms correspond to ifying what precision they require. Pruning 37494 (term, synset) pairs. Therefore, the is performed on the top section of the rank- latter was the actual number of pruning can- ing list that guarantees the required precision, didates that were ranked. according to the correlation between precision The check of WordNet senses against ASRS and amount of pruning shown in figure 1. senses was only done unidirectionally, i.e. A second test was designed to check we only checked whether WordNet senses whether there is a correlation between the were attested in ASRS. Although it would levels of confidence of automatic and man- be interesting to see how often the appropri- ual pruning. For this purpose we used the ate, domain-specific senses were absent from file that had been manually checked by both WordNet, no check of this kind was done. We human evaiuators. We took into account the took the simplifying assumption that Word- candidates that had been removed by at least Net be complete, thus aiming at assigning at one evaluator: the candidates that were re- least one WordNet sense to each term that moved by both evaluators were deemed to appeared in both WordNet and ASRS. have a high level of confidence, while those removed by only one evaluator were deemed 3.2 Results to have a lower level of confidence. Then we In order to test the automatic pruning per- checked whether the two classes were equally formance, we ran the ranking procedure on distributed in the automatic pruning ranking a test set taken from the manually checked list, or whether higher confidence candidates files. This file had been set apart and had tended to be ranked higher than lower con- not been used in the preliminary tests on the fidence ones. The results are shown in fig- automatic pruning algorithm. The test set ure 2, where the automatic pruning recall for included 350 clusters, comprising 2300 candi- each class is shown. For any given portion dates. 1643 candidates were actually assigned of the ranking list higher confidence candi- an evaluation during manual pruning. These dates (solid lines) have a significantly higher were used for the test. We extracted the 1643 recall than lower confidence candidates (dot- pruning procedure shows that the auto- Table 3: WordNet optimization results. DB Synsets Word-senses matic pruning algorithm captures some significant aspect of the problem. Full WN 99,642 174,008 Reduced WN 9,441 23,368 Although there is probably room for im- proving the automatic pruning performance, ted line). the preliminary results show that the current Finally, table 3 shows the result of applying approach is pointing in the right direction. the described optimization techniques alone, i.e. without any prior pruning, with respect References to the ASRS corpus. The table shows how many synsets and how many word-senses are Christiane Fellbaum, editor. 1998. Wordnet: An Electronic Lexical Database. MIT Press Books. contained in the full Wordnet database and in its optimized version. Note that such reduc- Claudia Leacock, Martin Chodorow, and tion does not involve any loss of accuracy. George A. Miller. 1998. Using corpus statistics and WordNet relations for sense identification. 4 Conclusions Computational Linguistics, 24(1):147-165. There is a need for automatically or semi- Bernardo Magnini and Gabriela Cavaglih. 2000. Integrating Subject Field Codes into WordNet. automatically adapting NLP components to In Maria Gavrilidou, George Carayannis, Stella specific domain, if such components are to be Markantonatou, Stelios Piperidis, and Gregory effectively used in IR applications without in- Stainhaouer, editors, Proceedings of the Sec- volving labor-intensive manual adaptation. A ond International Conference on Language Re- sources and Evaluation (LREC-PO00), pages key part of adapting NLP components to spe- 1413-1418, Athens, Greece. cific domains is the adaptation of their lexical and terminological resources. It may often be Mark Sanderson. 1997. Word Sense Disambigua- the case that a consistent section of a general tion and Information Retrieval. Ph.D. thesis, Department of Computing Science at the Uni- purpose terminological resource is irrelevant versity of Glasgow, Glasgow G12. Technical to a specific domain, thus involving an unnec- Report (TR-1997-7). essary amount of ambiguity that affects both Ellen M. Voorhees. 1998. Using WordNet for text the accuracy and efficiency of the overall NLP retrieval. In Fellbaum (Fellbaum, 1998), chap- component. In this paper we have proposed ter 12, pages 285-303. a method for adapting a general purpose syn- onym database to a specific domain. Evaluating the performance of the pro- posed pruning method is not a straightfor- ward task, since there are no other results available on a similar task, to the best of our knowledge. However, a comparison between the results of manual and automatic pruning provides some useful hints. In particular:

• The discrepancy between the evaluation of human operators shows that the task is elusive even for humans (the value of the agreement evaluation statistic n for our human evaluators was 0.5);

• however, the correlation between the level of confidence of human evaluations and scores assigned by the automatic

9 100 i I I I 95

90

85

75 ~2 70

65 +°60 i 55 0 100

80

60

~9 40

20

0 I I I I 0 20 40 60 80 100 Top % of ranking list

Figure 1: Precision and recall of automatic pruning

10 100 I I I I t--'-/

80 -- _t/f j" _

60

~9 cg 40 -- rl j --

20 /J_ ; _

--J" I I I I 0 20 40 60 80 100 Top % of ranking list

Figure 2: A recall comparison for different confidence rates

11