Extraction of Semantic Clusters for Terminological Information Retrieval from Mrds

Extraction of Semantic Clusters for Terminological Information Retrieval from MRDs Gerardo Sierra*, John McNaught† *Instituto de Ingeniería, UNAM Apdo. Postal 70-472 México 04510, D.F. [email protected] †Centre for Computational Linguistics, UMIST P.O.Box 88 Manchester, U.K., M60 1QD [email protected] Abstract This paper describes a semantic clustering method for data extracted from machine readable dictionaries (MRDs) in order to build a terminological information retrieval system that finds terms from descriptions of concepts. We first examine approaches based on ontologies and statistics, before introducing our analogy-based approach that lets us extract semantic clusters by aligning definitions from two dictionaries. Evaluation of the final set of clusters for a small set of definitions demonstrates the utility of our approach. synonyms or cross-references. As a result, it has been 1. Background found advantageous to expand the original query with The majority of lexicographers recognise the need for closely related keywords (Fox, 1988). In addition to the dictionaries that, contrary to the alphabetical ordering of user’s own knowledge of expressing the same concept in entries, help users to look for a word that has escaped alternative ways, a relational thesaurus brings related their memory even though they remember the concept. words together and thereby helps to stimulate their From a semantic point of view, Baldinger (1980) memory. Some systems provide an on-line thesaurus as a identifies two kinds of dictionaries. The semasiological facility for the user in this regard. However, formalising one corresponds to the viewpoint of the person a concept with the exact clue words is sometimes a heavy interpreting the speaker, thus one starts with the form of task for the user, but searching can become harder if the the expression to look for the meaning. It is user has also to identify clusters of related keywords, alphabetically arranged. The onomasiological type, from particularly when the query is expressed in natural the perspective of the speaker, allows one to start from language. The systematisation of this task has been hence the mental object and look for its designations. It is placed on the system, which can provide clusters during arranged by concepts and splits up the fields of the search session in order to allow the user to select the meanings. E.g., the name “green” may be found under best ones, or alternatively they can be automatically used the concepts “colour”, “vegetable” or “inexperience”. by the system. In order to help the user focus on the In the context of computational lexicography, it has search, it is convenient that the system produces and been shown that machine readable dictionaries (MRDs), manages the semantic clusters transparently, without any which are conventional semasiological dictionaries, can intervention by the user. In fact, this is the goal of a user- be used for onomasiological searches. This is based on friendly onomasiological search system, and the success the assumption that semasiological dictionaries have the of such a system relies on the accurate identification of necessary information in the first place. Kipfer (1986) the semantic clusters. stated that a dictionary can be considered as a matrix that maps between word senses, and that an on-line dictionary 2. Survey of clustering can be entered via words or senses. She added that in an Clustering has been applied to almost every on-line dictionary it is possible to find a word by discipline. The process of identifying clusters has following semantic links or by genus. For example, if a variously been called cluster analysis, classification, user needs to locate the word expressing a group of ducks categorisation, taxonomy, typology or clumping, (flock), they can check the entry for duck. Calzolari according to the discipline. The purpose of clustering (1988) stated that an on-line dictionary can be used to varies from classification and sorting to the development seek a word through examining definitions that contain of inductive generalisations (Anderberg, 1973). The words supplied by the user (clue words). primary goal of clustering is to collect together into The success of an onomasiological search relies upon clusters a set of elements associated by some common the accuracy of all clue words in the concept definition characteristic. Each element or member within a cluster that might represent the target word the user is looking A is strongly associated with each other because they for. Since the user often does not employ precisely the share the same property, while members of other clusters same terminology as the indexed keywords or stored full- show distinct characteristics from those of A. Clustering text database, the retrieved words may be far from the may alternatively be oriented either to discover the concept desired. When the result is not satisfactory, the strongest association among members or to seek user can expand the query with closely related keywords members which are isolated from each other. Clustering which enhance the meaning, such as alternative forms, is often based on measurements of the similarity or dissimilarity between a pair of objects, these objects being connects two concepts in the hierarchical net. Their either single members or other clusters. A cluster is method enables us to check the closeness of candidate defined by its members and often by the “central hypernyms for a given hyponym. This measure may be concept” with which all the cluster’s members are applied to cluster similar concepts, so that candidate associated (McRoy, 1992). This central concept could be concepts with the shortest path in the hierarchy should be the common characteristic, the particular conceptual clustered. On the other hand, Resnik (1995) suggests parent or even any member when there is no need to WordNet for semantic clustering on the basis of the specify the exact nature of the association among the information content shared by the synsets in comparison. members. The identification of the central concept relies However, we may add that the technique of on the variables that are used to characterise the elements conceptual distance or number of links is highly of the problem, either the characteristics, attributes, class dependent on the degree of density of coverage of the memberships or other such properties. Here, our focus is conceptual space in an area that the WordNet on semantic variables. lexicographers have been able to achieve. Besides, it is Clustering methods to identify semantically similar also appropriate to point out some drawbacks observed by words are broadly divided into relation-based and researchers applying WordNet to specific purposes distribution-based approaches (Hirakawa, Xu & Haase, (Arranz, 1998; Agirre & Rigau, 1996; Basili, Pazienza & 1996). The former analyse relations in an ontology, while Velardi, 1996): restricted types of semantic relationships; the latter use statistical analysis. According to the lack of cross-categorical semantic relations among nouns, terminology of Grefenstette (1996), these methods can be verbs and adjectives; the sense distinctions are not always called knowledge-rich, based on a conceptual dependency satisfactory; there are similar words that are not representation, and knowledge-poor, based on recognised in WordNet; tags in WordNet create over- distributional analysis. From a methodological point of ambiguity. view, there is, in addition to the above two approaches, a little known approach called the analogy-based approach. 2.1.2. Roget’s Thesaurus This employs an inferential process and is used in Roget’s Thesaurus has become rather popular for computational linguistics and artificial intelligence as an many applications. One reason is that it is a general alternative to current rule-based linguistic models. thesaurus with broad vocabulary coverage, although it is likely to be missing many domain-specific words. 2.1. Relation-based clustering Another reason is its well-organised structure in three Relation-based clustering methods rely on the hierarchical levels above the basic level of words, namely relations in a semantic network or ontology to judge the category. Grefenstette (1996) has used Roget’s Thesaurus similarity between two concepts. Since an ontology as a gold standard to evaluate distribution-based connects concepts, each located in a node, it is then clustering methods on the premise that there is a very possible to analyse either the taxonomic relations or just low chance, 0.4%, of finding two words together under the conceptual distance between the nodes. A taxonomy the same category. Therefore, he evaluates the results of lets us extract semantic relations, such as is-a or a-kind- these methods over Roget categories, in such a way that of, to judge the similarity between two concepts by there is a hit when two words appear under the same comparing their parent. A semantic network lets us category. Morris and Hirst (1991) used Roget's Thesaurus derive similarity by determining the path-length or as a knowledge base to identify lexical chains, not only number of links between the nodes. on the basis of two words sharing a common category, The most important lexical knowledge resources that but of other relationships. E.g. two words with different provide a basic ontology for clustering are the semantic categories that both point to another common one. The taxonomy WordNet (Fellbaum, 1998) and Roget’s assumption that two words connected by a category can Thesaurus (1987). be clustered together is however not always reliable. Exploration of Roget’s reveals how members of different 2.1.1. WordNet semantic clusters may belong to the same category. WordNet is an organised lexical resource of English nouns, verbs, and adjectives, widely used for relation- Word 1 Word 2 Score based applications. It is a hand-constructed system, measure Meter 10 designed on psycholinguistic principles at Princeton instrument Measure 9 University. The lexicon relies, for each concept, on a set measure Scale 8 of words that can be used to express that concept, namely instrument Meter 7 synonym sets or synsets for short.

Load more