Adapting a Synonym Database to Specific Domains Davide Turcato Fred Popowich Janine Toole Dan Pass Devlan Nicholson Gordon Tisher Gavagai Technology Inc
Total Page:16
File Type:pdf, Size:1020Kb
Adapting a synonym database to specific domains Davide Turcato Fred Popowich Janine Toole Dan Pass Devlan Nicholson Gordon Tisher gavagai Technology Inc. P.O. 374, 3495 Ca~abie Street, Vancouver, British Columbia, V5Z 4R3, Canada and Natural Language Laboratory, School of Computing Science, Simon Fraser University 8888 University Drive, Burnaby, British Columbia, V5A 1S6, Canada {turk, popowich ,toole, lass, devl an, gt i sher}@{gavagai, net, as, sfu. ca} Abstract valid for English, would be detrimental in a specific domain like weather reports, where This paper describes a method for both snow and C (for Celsius) occur very fre- adapting a general purpose synonym quently, but never as synonyms of each other. database, like WordNet, to a spe- We describe a method for creating a do- cific domain, where only a sub- main specific synonym database from a gen- set of the synonymy relations de- eral purpose one. We use WordNet (Fell- fined in the general database hold. baum, 1998) as our initial database, and we The method adopts an eliminative draw evidence from a domain specific corpus approach, based on incrementally about what synonymy relations hold in the pruning the original database. The domain. method is based on a preliminary Our task has obvious relations to word manual pruning phase and an algo- sense disambiguation (Sanderson, 1997) (Lea- rithm for automatically pruning the cock et al., 1998), since both tasks are based database. This method has been im- on identifying senses of ambiguous words in plemented and used for an Informa- a text. However, the two tasks are quite dis- tion Retrieval system in the aviation tinct. In word sense disambiguation, a set of domain. candidate senses for a given word is checked against each occurrence of the relevant word 1 Introduction in a text, and a single candidate sense is se- lected for each occurrence of the word. In our Synonyms can be an important resource for synonym specialization task a set of candidate Information Retrieval (IR) applications, and senses for a given word is checked against an attempts have been made at using them to entire corpus, and a subset of candidate senses expand query terms (Voorhees, 1998). In is selected. Although the latter task could be expanding query terms, overgeneration is as reduced to the former (by disambiguating all much of a problem as incompleteness or lack occurrences of a word in a test and taking of synonym resources. Precision can dramat- the union of the selected senses), alternative ically drop because of false hits due to in- approaches could also be used. In a specific correct synonymy relations. This problem is domain, where words can be expected to be particularly felt when IR is applied to docu- monosemous to a large extent, synonym prun- ments in specific technical domains. In such ing can be an effective alternative (or a com- cases, the synonymy relations that hold in the plement) to word sense disambiguation. specific domain are only a restricted portion of the synonymy relations holding for a given From a different perspective, our language at large. For instance, a set of syn- task is also related to the task of as- onyms like signing Subject Field Codes (SFC) to a terminological resource, as done by (1) {cocaine, cocain, coke, snow, C} Magnini and Cavagli~ (2000) for WordNet. Assuming that a specific domain corresponds classes (to improve speed and accuracy, re- to a single SFC (or a restricted set of SFCs, spectively) and including only relations in the at most), the difference between SFC as- third class. signment and our task is that the former The overall goal of the described method assigns one of many possible values to a given is to inspect all synonymy relations in Word- synset (one of all possible SFCs), while the Net and classify each of them into one of the latter assigns one of two possible values (the three aforementioned classes. We define a words belongs or does not belong to the SFC synonymy relation as a binary relation be- representing the domain). In other words, tween two synonym terms (with respect to SFC assignment is a classification task, while • a particular sense). Therefore, a WordNet ours can be seen as either a filtering or synset containing n terms defines ~11 k syn- ranking task. onym relations. The assignment of a syn- Adopting a filtering/ranking perspective onymy relation to a class is based on evidence makes apparent that the synonym pruning drawn from a domain specific corpus. We use task can also be seen as an eliminative pro- a tagged and lemmatized corpus for this pur- cess, and as such it can be performed incre- pose. Accordingly, all frequencies used in the mentally. In the following section we will rest of the paper are to be intended as fre- show how such characteristics have been ex- quencies of (lemma, tag) pairs. ploited in performing the task. The pruning process is carried out in three In section 2 we describe the pruning steps: (i) manual pruning; (ii) automatic methodology, while section 3 provides a prac- pruning; (iii) optimization. The first two tical example from a specific domain. Con- steps focus on incrementally eliminating in- clusions are offered in section 4. correct synonyms, while the third step focuses on removing irrelevant synonyms. The three 2 Methodology steps are described in the following sections. 2.1 Outline 2.2 Manual pruning The synonym pruning task aims at improv- ing both the accuracy and the speed of a syn- Different synonymy relations have a different onym database. In order to set the terms of impact on the behavior of the application in the problem, we find it useful to partition the which they are used, depending on how fre- set of synonymy relations defined in WordNet quently each synonymy relation is used. Rela- into three classes: tions involving words frequently appearing in either queries or corpora have a much higher . Relations irrelevant to the specific do- impact (either positive or negative) than re- main (e.g. relations involving words that lations involving rarely occurring words. E.g. seldom or never appear in the specific do- the synonymy between snow and C has a main) higher impact on the weather report domain (or the aviation domain, discussed in this pa- . Relations that are relevant but incorrect per) than the synonymy relation between co- in the specific domain (e.g. the syn- caine and coke. Consequently, the precision of onymy of two words that do appear in the a synonym database obviously depends much specific domain, but are only synonyms more on frequently used relations than on in a sense irrelevant to the specific do- rarely used ones. Another important consid- main); eration is that judging the correctness of a 3. Relations that are relevant and correct in given synonymy relation in a given domain is the specific domain. often an elusive issue: besides clearcut cases, there is a large gray area where judgments The creation of a domain specific database may not be trivial even for humans evalua- aims at removing relations in the first two tots. E.g. given the following three senses of the noun approach with only one sense are not good candidates for pruning (under the assumption of com- (2) a. {approach, approach path, glide pleteness of the synonym database). There- path, glide slope} fore .polysemous terms are prioritized over (the final path followed by an air- monosemous terms. craft as it is landing) Criterion 2. The second and third sort- b. {approach, approach shot} ing criteria axe similar, the only difference be- (a relatively short golf shot in- ing that the second criterion assumes the ex- tended to put the ball onto the istence of some inventory of relevant queries putting green) (a term list, a collection of previous queries, c. {access, approach} etc.), ff such an inventory is not available, the (a way of entering or leaving) second sorting criterion can be omitted. If the inventory is available, it is used to check which it would be easy to judge the first and second synonymy relations are actually to be used in senses respectively relevant and irrelevant to queries to the domain corpus. Given a pair the aviation domain, but the evaluation of the (ti,tj) of synonym terms, a score (which we third sense would be fuzzier. name scoreCQ) is assigned to their synonymy The combination of the two remarks above relation, according to the following formula: induced us to consider a manual pruning phase for the terms of highest 'weight' as a (3) scoreCQij = good investment of human effort, in terms of (fcorpusi * fqueryj) + rate between the achieved increase in preci- (fcorpusj • fqueryi) sion and the amount of work involved. A second reason for performing an initial man- where fcorpusn and fqueryn are, respec- ual pruning is that its outcome can be used tively, the frequencies of a term in the domain as a reliable test set against which automatic corpus and in the inventory of query terms. pruning algorithms can be tested. The above formula aims at estimating how Based on such considerations, we included a often a given synonymy relation is likely to manual phase in the pruning process, consist- be actually used. In particular, each half of ing of two steps: (i) the ranking of synonymy the formula estimates how often a given term relations in terms of their weight in the spe- in the corpus is likely to be matched as a syn- cific domain; (ii) the actual evaluation of the onym of a given term in a query. Consider, correctness of the top ranking synonymy re- e.g., the following situation (taken form the lation, by human evaluators.