Building a Hyponymy Lexicon with Hierarchical Structure Nph

Unsupervised Lexical Acquisition: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), Philadelphia, July 2002, pp. 26-33. Association for Computational Linguistics. Building a hyponymy lexicon with hierarchical structure Sara Rydin [email protected] Centre for Speech Technology (CTT) KTH Stockholm, Sweden GSLT Abstract with new words and new uses of existing words. Our minimally supervised method for automatically Many lexical semantic relations, such as building partial hierarchies presents one way to the hyponymy relation, can be extracted solve the update problem. from text as they occur in detectable The objective of this project is to automatically syntactic constructions. This paper shows build a hierarchical hyponymy lexicon of noun how a hypernym-hyponym based lexi- phrases given large, part-of-speech tagged and lem- con for Swedish can be created directly matized corpora that are not restricted to one specific from a news paper corpus. An algorithm domain or topic. The lexicon will thus, reflect partial is presented for building partial hierarchi- hierarchical hyponymy structures that bring forward cal structures from non domain-specific extended hypernym-hyponym relations. texts. Section 2 describes previous work in the area of automatic acquisition of semantic lexicons, section 1 Introduction 3 elaborates on the principles for this work, and the Automatic acquisition of information on semantic remaining sections describe the implementation as relations from text has become more and more pop- well as the evaluation of the algorithm for building ular during the last ten to fifteen years. The goal has a hierarchical hyponymy lexicon. been to build various types of semantic lexicons for 2 Previous work use in natural language processing (NLP) systems, such as systems for information extraction/retrieval One of the first studies on acquisition of hyponymy or dialog systems. The lexicons are used to intro- relations was made by Hearst (1992). She found that duce extended semantic knowledge into the different certain lexico-syntactic constructions can be used as systems. indicators of the hyponymy relation between words Hand-built general-purpose lexicons, such as the in text. Example 1 shows a relation of this kind and WordNet (Fellbaum, 1998), have often been used to an example. The noun phrase ’ ¢¡¤£ ’ is a hypernym §¡©¨ ¤ ¢¡© §¡ bring semantic knowledge into NLP-systems. Two and ‘ ¥¦¥ ’ is one or more important problems concerning (semantic) lexicons (conjoined) noun phrases that are the hyponyms: are those of domain coverage and updates. Firstly, a ¢¡¤£%$ §¡©¨& ' © §¡"( §¡ "! # general-purpose lexicon cannot be expected to cover ¥¦¥ (1) all specific words used in different sub-domains. Therefore, the need for domain-specific lexicons has ‘such cars as Volvo, Seat and Ford’ recently been brought to the surface. Secondly, any lexicon, general or specific, has to Hearst proposed furthermore, that new syntactic pat- be updated from time to time, in order to keep up terns can be found in the following way: 1. Use the list of hypernyms-hyponyms found by hyponymy hierarchies, and for each hierarchy the the type of pattern described above to search for following criteria should be fulfilled: places in the corpus where the two expressions occur syntactically close to each other. Save the 1. A hierarchy has to be strict, so that every child syntactic examples. node in it can have one parent node only. 2. Examine all saved syntactic environments and 2. The words or phrases forming the nodes in a find new useful syntactic patterns. hierarchy should be disambiguated. 3. Use each new pattern to find more hypernym- 3. The organization in a hierarchy should be such hyponym examples. Continue at 1. that every child node is a hyponym (i.e. a type/kind) of its parent. Caraballo (1999) uses a hierarchical clustering tech- nique to build a hyponymy hierarchy of nouns. The Generally, principle 1-2 above are meant to pre- internal nodes are labeled by the syntactic construc- vent the hierarchies from containing ambiguity. The tions from Hearst (1992). Each internal node in the built-in ambiguity in the hyponymy hierarchy pre- hierarchy can be represented by up to three nouns. sented in (Caraballo, 1999) is primarily an effect of Work by Riloff & Shepherd (1997) and Char- the fact that all information is composed into one niak & Roark (1998) aims to build semantic lexicons tree. Part of the ambiguity could have been solved where the words included in each category or entry if the requirement of building one tree had been re- are related to, or are a member of the category. laxed. Sanderson & Croft (1999) build hierarchical Principle 2, regarding keeping the hierarchy structures of concepts on the basis of generality and ambiguity-free, is especially important, as we are specificity. They use material divided by different working with acquisition from a corpus that is not text categories and base the decision of subsumption domain restricted. We will have to constrain the way on term co-occurrence in the different categories. in which the hierarchy is growing in order to keep it A term x is said to subsume y if the documents in unambiguous. Had we worked with domain-specific which y occurs are a subset of the documents in data (see e.g. Morin and Jaquemin (1999)), it would which x occurs. The relations between concepts in have been possible to assume only one sense per their subsumption hierarchy are of different kinds word or phrase. (among other the hyponymy relation), and are un- The problem of building a hyponymy lexicon can labeled. be seen as a type of classification problem. In The work most similar to ours is that of Morin & this specific classification task, the hypernym is the Jacquemin (1999). They produce partial hyponymy class, the hyponyms are the class-members, and hierarchies guided by transitivity in the relation. But classifying a word means connecting it to its cor- while they work on a domain-specific corpus, we rect hypernym. The algorithm for classification and will acquire hyponymy data from a corpus which is for building hierarchies will be further described in not restricted to one domain. section 6. 3 Principles for building a hierarchical 4 Corpus and relevant terms lexicon This work has been implemented for Swedish, a This section will describe the principles behind our Germanic language. Swedish has frequent and pro- method for building the hierarchical structures in a ductive compounding, and morphology is richer lexicon. compared to, for example, English. Compounding As the objective is to build a nominal hyponymy affects the building of any lexical resource in that lexicon with partial hierarchical structures, there the number of different word types in the language are conditions that the hierarchical structures should is larger, and thus, the problems of data sparseness meet. The structures can each be seen as separate become more noticeable. In order to, at least partly, overcome the data sparseness problem, lemmatiza- ‘riksdagen, stadsfullmaktige¨ och liknande tion has been performed. However, no attempt has forsamlingar’¨ /lit. the Swedish Parliament, the been made to make a deeper analysis of compounds. town councilor and similar assemblies/ The corpus used for this research consists of §¡7 < ¢¡©8 "A /10© 243:3;26 §¡¤¨ ¤.F',>GH21IJKGL §¡¤£ 293,692 articles from the Swedish daily news pa- ¥ per ‘Dagens Nyheter’. The corpus was tokenized, (6) tagged and lemmatized. The tagger we used, im- ‘Osterleden¨ och Vasterleden,¨ de tva˚ mo- plemented by Megyesi (2001) for Swedish, is the torvagsprojekt’¨ /lit. the East way and the West TnT-tagger (Brants, 2000), trained on the SUC Cor- way, the two highway projects/ pus (Ejerhed et al., 1992). After preprocessing, the corpus was labeled for base noun phrases (baseNP). The basic assumption is that these construc- A baseNP includes optional determiners and/or pre- tions (henceforth called hh-constructions), yield modifiers, followed by nominal heads. pairs of terms between which the hyponymy relation Naturally, conceptually relevant terms, rather than holds. After a manual inspection of 20% of the noun phrases, should be placed in the lexicon and the total number of hh-constructions, it was estimated hierarchies. For reasons of simplification, though, that 92% of the hh-constructions give us correct the choice was made as to treat nominal heads with hyponymy relations. Erroneous hh-constructions premodifying nouns in genitive (within the limits of are mainly due to problems with, for example, the baseNP described above) as the relevant terms incorrect tagging, but also change in meaning due to include in the hierarchies. However, premodi- to PP-attachment. fiers describing amounts, such as ‘kilo’, are never 6 Building the hierarchical lexicon included in the relevant terms. To give an accurate description of the algorithm 5 Lexico-syntactic constructions for building the lexicon, the description here is di- Lexico-syntactic constructions are extracted vided into several parts. The first part describes from the corpus, in the fashion suggested by how hypernyms/hyponyms are grouped into classes, Hearst (1992). Five different Swedish constructions building an unambiguous lexicon base. The sec- has been chosen – constructions 2-6 below – as ond part describes how arrangement into hierarchi- a basis for building the lexicon (an example with cal structures is performed from this unambiguous the English translation is given below for each data. Last, we will describe how the lexicon is ex- construction)1 : tended. ++( ¢¡¤£ , §¡©' ' - §¡©¨. /10© 24353526 ¢¡7 ) * ) ¥¦¥ 6.1 Classification (2) There are two straightforward methods that can be ‘sadana˚ kanslor¨ som medkansla¨ och barmhartighet’¨ used to classify the data from the hh-constructions. /lit. such feelings as sympathy and compassion/ The first would be to group all hypernyms of the §¡©£ 8 9, §¡" ' ©7 §¡©¨. /10© 243:3;26& §¡7 ) ) * same lemma into one class. The second would be ¥ ¥¦¥ (3) to let each hypernym token (independently of their ‘exotiska frukter som papaya, pepino och mango’ lemma) initially build their own class, and then try /lit.

Load more