<<

Enhanced Thesaurus Terms Extraction for Document Indexing

Frane ’ari¢, Jan ’najder, Bojana Dalbelo Ba²i¢, Hrvoje Ekli¢ Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb, Croatia E-mail:{Frane.Saric, Jan.Snajder, Bojana.Dalbelo, Hrvoje.Eklic}@fer.hr

Abstract. In this paper we present an mogeneous due to diverse background knowl- enhanced method for the thesaurus term edge and expertise of human indexers. The extraction regarded as the main support to task of building semi-automatic and auto- a semi-automatic indexing system. The matic systems, which aim to decrease the enhancement is achieved by neutralising burden of work borne by indexers, has re- the eect of language morphology applying cently attracted interest in the research com- lemmatisation on both the text and the munity [4], [13], [14]. Automatic indexing thesaurus, and by implementing an ecient systems still do not achieve the performance recursive algorithm for term extraction. of human indexers, so semi-automatic sys- Formal denition and statistical evaluation tems are widely used (CINDEX, MACREX, of the experimental results of the proposed MAI [10]). method for thesaurus term extraction are In this paper we present a method for the- given. The need for disambiguation methods saurus term extraction regarded as the main and the eect of lemmatisation in the realm support to semi-automatic indexing system. of thesaurus term extraction are discussed. Term extraction is a process of nding all ver- batim occurrences of all terms in the text. Keywords. Information retrieval, term Our method of term extraction is a part of extraction, NLP, lemmatisation, Eurovoc. CADIS [9], and is meant to facilitate nding those terms that are explicitly contained in a 1. Introduction document, although the document does not necessarily need to be indexed with the ex- Finding documents on the Web or in large tracted terms. document databases that are relevant for The process of term extraction gives rise user's queries is the primary research topic in to some practical problems concerning term the eld of information retrieval (IR). Doc- variation, such as morphological, lexical, ument indexing is the process of assigning structural, etc. An example of a method one or more key phrases that describe the which deals with term variation for English content of the document in order to facili- is presented in [11]. We restrict our work tate IR. These key phrases (called terms or to variation due to inectional morphology, descriptors) usually belong to a nite set of which makes the appear in various phrases arranged in the form of a controlled forms. We can implicitly cope with some vocabulary or thesaurus. Thesauri contain other types of term variation by using the- additional information about term relation- sauri which encode relationships between syn- ships, such as: related terms, hypernyms, hy- onyms. ponyms, etc., thus providing the means to Recall of term extraction suers in doc- control recall and precision of searches [8]. uments written in morphologically rich lan- Examples of widely used thesauri are the guages (such as Croatian), so the eects of Eurovoc (EUROpean VOCabulary) [7] and morphology have to be neutralised. The NASA Thesaurus [10]. method described in the paper enhances the Manual indexing is a time consuming, ex- process of term extraction in two aspects. pensive intellectual task and is often inho- It eciently tackles the problem of language tised, i.e. a lemma for a given inected form has to be found. Lemmatisation procedures range from purely algorithmic (rule-driven) to lexicon-based (relying on queries made to a morphological lexicon). For highly inected languages the latter approach is more com- mon. The morphological lexicon typically relates all inected forms of a to its lemma. The construction of a morphological lexicon is a labour intensive task. To facili- tate the process, various automatic and semi- Figure 1: Croatian word vode has three lemmas: voda automatic procedures based on lexical acqui- (water), vod (a duct or a squad) and voditi (to conduct, to lead). sition from corpora have been developed [3], [6], [12], [15]. In our work, contrary to the usual prac- morphology by applying lemmatisation, the tice, the process of lemmatisation does not most prominent natural language processing imply word disambiguation. Instead, lemma- (NLP) technique used for indexing, on both tisation of an ambiguous word results in more the text and the thesaurus. The process of than one lemma. Ambiguity considered here extraction is further enhanced by identifying is called homography  the case when two or and ignoring some cases in which terms are more lemmas have overlapping forms. A no- considered irrelevant. torious example in English is the word saw, The paper is structured as follows. In Sec- which can be a noun (a tool used for cutting) tion 2, we address the problem of language or the past tense of the verb see. An example morphology, and describe our approach to in Croatian is the word vode as a feminine lemmatisation. In Section 3, formal deni- noun voda (water), a masculine noun vod (a tion of our method is given. Finally, in Sec- duct or a squad), or a verb voditi (to conduct, tion 4 we present the statistical evaluation of to lead), as shown in Fig. 1. experimental results of Eurovoc term extrac- tion on a set of parallel documents written in 2.2. Our approach to lemmatisation Croatian and English.

2. Lemmatisation In our work lemmatisation of English and Croatian documents and terms is performed 2.1. The problem of morphology using appropriate morphological lexicons. For lemmatisation of English, one of many When extracting terms from text docu- publicly available lexicons was used [1]. It ments, the eects of language morphology contains over 250, 000 forms assigned to more have to be taken into account. Relevant to than 100, 000 lemmas, with 4.8% ambiguous the task of term extraction are the eects forms due to homography. of inectional morphology. It describes how For lemmatising Croatian, a morphologi- from basic word form (the lemma) dierent cal lexicon constructed by a rule-based au- word forms are generated in order to express tomatic acquisition [15] from a subsection grammatical features (e.g. number, case, gen- of Croatian National Corpus [5] totaling 107 der, degree etc., depending on the word's words was used. The obtained lexicon con- part-of-speech). If term extraction were per- tains over 500, 000 forms assigned to more formed by literal string matching, various in- than 30, 000 lemmas. Degree of homography ections of a term would not be found in the in lexicon is 5.1%. document, resulting in decreased recall. In the process of term extraction, both To neutralise the eect of inective mor- precision and recall depend on the linguistic phology, each word form has to be lemma- validity and the coverage of the morphological lexicon used for lemmatisation. Estimates for lexicons used in our experiments will be given in Section 4.

3. Term Extraction

3.1. Formal denition

In order to dene term extraction for- mally, word and term matching need to be dened rst. Let W be a set of all words and L : W → ℘(W ) denote a function that maps each word to the set of its lemmas, e.g. L(vode) = {vod, voda, voditi}. If we are not using lemmatisation or word w is not listed in the lexicon (which is usually the case Figure 2: A set of extracted terms contains the longest for non-inective words), then L(w) = {w}. term extracted in step a) and the contents of sets

We dene words w1 and w2 to match i left a and right a. Sets left a = {(tax)} and right a = {(insurance, premium)} are calculated in steps b) and L(w1) ∩ L(w2) 6= ∅, i.e. both words are in- ections of a common lemma. c), respectively. We represent a term as a list of words (t1, . . . , tm), and a portion of text In these rare cases no term takes precedence with no intervening punctuation as list of over the other, so we choose to extract all of words (w1, . . . , wn). We dene three rela- them. tions that are relevant for term extraction. The process of term extraction can be for- Term (t , . . . , t ) matches a list of words malised as follows. Let be a set of all terms, 1 m S T (w , . . . , w ) at the position k i k+m−1 ≤ n + ∞ n a set of all word n-tuples 1 n W = n=1 W + and words ti and wk+i−1 match (in the sense and E : W → ℘(T ) a function mapping a introduced above) for . Term + i = 1, . . . , m list of words (w1, . . . , wn) ∈ W to a set of tA = (t1, . . . , tn) matching some list of words extracted terms, element of ℘(T ). For ex- at position a subsumes term tB = (t1, . . . , tm) ample, E(tax, on, motor, vehicle, insurance, matching the same list of words at position premium) = {(tax), (insurance, premium), b i a ≤ b and b + m ≤ a + n. Term (motor, vehicle, insurance)}, as shown in tA = (t1, . . . , tn) matching some list of words Fig. 2. at position and term a tB = (t1, . . . , tm) Function E is dened recursively as fol- matching the same list of words at position lows. If there are no terms present in the list b overlap i a < b < a + n < b + m. of words (w1, . . . , wn), then E(w1, . . . , wn) = If term tA subsumes term tB, it is al- ∅. Otherwise, E(w1, . . . , wn) = {t} ∪ most certainly true that term tA is more spe- left ∪ right, where t is the leftmost among cic. During term extraction we always pre- the longest terms in the list of words fer more specic terms because they are of (w1, . . . , wn). If there is no term starting be- greater semantic value. For example, Eurovoc fore term t then left := ∅, otherwise left := term equality between men and women sub- E(w1, . . . , wk), where k is the greatest index sumes both Eurovoc terms men and women. of ending of all such terms. Set right is de- Here, the multi-word term is the preferable ned analogously. We choose the leftmost choice over shorter terms. A less common term only to break the tie among the longest case is when two terms overlap. For exam- terms of the same length. This denition of E ple, phrase motor vehicle insurance premium ensures that a term that is always subsumed contains two overlapping Eurovoc terms: mo- is not extracted, while those that overlap will tor vehicle insurance and insurance premium. be extracted. 3.2. Errors in term extraction Section 2.1). Polysemy refers to the phe- nomenon of multiple related meanings within The process of term extraction as de- a single lemma. Word sense disambiguation scribed above is prone to two kinds of errors: is usually performed by examining the con- lemmatisation errors and errors due to lexical text of an ambiguous word. Term extraction ambiguity. We continue with a description of as presented in this paper makes no use of these errors, while their relevance is discussed disambiguation. Consequently two kinds of in Section 4. errors are possible, both causing a decrease in precision.

3.2.1. Lemmatisation errors First is due to homography: if word w1 and a single-word term (w2) are homographs Lemmatisation errors may decrease both (L(w1) ∩ L(w2) 6= ∅ and L(w1) 6= L(w2)), precision and recall of term extraction. A dis- they will be matched regardless of actual tinction can be made between lemmatisation senses in which words w1 and w2 are used. failure (given form is not present in the lexi- Obviously, if w1 and w2 are used in dierent con) and incorrect lemmatisation (given form senses, then this is an error. As an exam- is related to a wrong lemma). ple, consider the sentence Church bells toll across the town. Here, the verb toll is used Since we have L(w) = {w} when lemmati- in a sense of sounding a bell by pulling a rope, sation fails, w1 and w2 will match i w1 = w2. In other words, if lemmatisation fails on any unlike the orthographically identical Eurovoc term, which is a noun, and used in a sense of of the words (t1, . . . , tm) constituting a term, then an exact match for this word is required. charge for the use of transport infrastructure. This poses no problem if the word in ques- The probability of this kind of error decreases tion is not inective (e.g. functional words, with the number of words constituting a term. abbreviations etc.). However, if the word is Second kind of error is due to polysemy: if inective (e.g. nouns, adjectives, verbs, etc.), a term is in itself polysemious, matching it to then various inective forms of a term, dier- any word in text is always questionable. An ing from that listed in the thesaurus, will not example is the sentence Sleeping tablet con- be found in text. This ultimately leads to a sumption was higher among subjects reporting decrease in recall. a bad atmosphere at work. Here the word at- mosphere is used in a sense of a surrounding If lemmatisation of a word w is incorrect, 1 feeling or mood encountered in the working then it is possible that L(w ) ∩ L(w ) 6= ∅ 1 2 environment, while in the Eurovoc thesaurus although w1 and w2 are in fact not inec- tions of the same lemma. A single-word term it is meant to be used in the sense of physi- cal environment. Again, the more words con- (w ) will then be incorrectly matched with 1 stitute a term, the less the probability of an the word w2 occurring in text. A mismatch causes a decrease in precision, but it can also error due to polysemy. cause a decrease in recall if w1 happens to be an inective form of another term. It should 4. Experimental results however be noted that for multi-word terms the probability of this type of error is negli- Eurovoc thesaurus term extraction experi- gible. ments were conducted on a parallel Croatian- English corpus consisting of 39 legal doc- 3.2.2. Errors due to lexical ambiguity uments  European Commission Directives and Croatian legislative documents. Eurovoc Another problem is the lexical ambiguity of is a multilingual thesaurus, used by the Eu- natural language, in particular the cases of ropean Communities, containing over 6000 homography and polysemy. Homography is a terms, each of them translated into 21 Euro- relation between words that have the same or- pean languages (including Croatian [2]), thus thographic form with unrelated meaning (see enabling multilingual information retrieval. We chose this particular thesaurus because 180 the term extraction described will be a com- 160 ponent of a larger indexing system that is us- 140 ing Eurovoc [9]. 120

The number of words ranged from 365 to 100

26651 for documents in English and from 297 80 to 19946 for the same set of documents in 60

Croatian. The reason for the larger number of 40 words in English documents lies in the nature 20 of languages, the way documents are trans- 0 Mean lated, but it is mainly due to the existence ±SE -20 Non-Outlier Range English Croatian Outliers of articles in English. The number of terms English LEM Croatian LEM found depends upon the number of words in the text  more terms are found in a longer text. However, we have considered the num- Figure 3: Box and Whiskers plot – number of extracted ber of dierent terms, this being the relevant Eurovoc terms for English and Croatian , both with and without using lemmatisation. eciency parameter of an extraction proce- dure, since only diering terms carry new in- formation useful for document indexing. erage increase in recall of the terms for 39 With the method described in the paper, English documents if lemmatisation is used terms were extracted from each Croatian and is 0.20, and for Croatian it is 0.53. English document. Then, the same proce- In Section 3.2 we pointed out that cer- dure was repeated on the same documents, tain kinds of errors can decrease both the this time using lemmatisation. Fig. 3 shows recall and precision of term extraction. De- the Box and Whisker plot for a total number crease of recall is mainly due to lemmatisa- of extracted terms for English and Croatian, tion failure, which in turn depends on the both with and without using lemmatisation. coverage of the morphological lexicons. We Wilcoxon matched pairs test for depen- dene lexicon coverage as the number of dif- dent samples conrmed that the dierence ferent words in the corpus that were found in the number of extracted terms when us- in the lexicon divided by the total number ing and not using lemmatisation on Croatian of dierent words in the corpus. Lexicons documents, as well as when using and not us- used in our experiments have proven to be ing lemmatisation on English documents, is of relatively good coverage: 89.71% for En- signicant. This implies that the lemmatisa- glish and 92.50% for Croatian lexicon. As tion process is very important in the process for the decrease in precision caused by incor- of term extraction. rect lemmatisation, which reects the linguis- Fig. 3 also shows that the eect of lemma- tic validity of the lexicon, our experiments in- tisation on Croatian texts is stronger than on dicate that these kind of errors were negligi- English texts. This was anticipated due to ble. A hand validation revealed only 0.28% the morphologically rich Croatian language. of English and 1.66% of Croatian thesaurus The eect of lemmatisation is presented in terms to be incorrectly lemmatised. Further terms of a relative increase in the recall of decreases in precision due to lexical ambigu- extracted terms. In our case, the recall is ity are also negligible: only 1.65% of English the number of successfully extracted Eurovoc as well as Croatian thesaurus terms were mis- terms divided by the overall number of Eu- takenly extracted because of homography and rovoc terms in the text. Since we could not polysemy. Contributing to these low error determine the exact number of terms in the rates is also the choice of thesaurus terms: text, we computed the dierence in the re- e.g. more than 75% of the Eurovoc terms call relative to the number of extracted terms are multi-word terms, and ambiguity of such when the lemmatisation was used. The av- terms is rare. 5. Conclusion Indexing. Proc. of the 21st ACM SIGIR Conf. on Research and Development in In- An enhanced process for thesaurus formation Retrieval, Melbourne; 1998. term extraction has been described and [5] Croatian National Corpus. formally dened in the paper. The enhance- http://www.hnk.zg.hr [01/16/2005] ment was achieved by neutralising the eect [6] Erjavec T, Dºeroski S. Machine Learning of language morphology applying lemmati- of Morphosyntactic Structure: Lemmatiz- sation on both the text and the thesaurus ing Unknown Slovene Words. Applied Ar- terms, and by implementing an ecient ticial Intelligence, 2004; 18(1): 1741. recursive algorithm for text extraction. [7] Eurovoc Thesaurus. The statistical evaluation of experimental http://europa.eu.int/celex/eurovoc/ results has shown that using lemmatisation, [01/16/2005] term extraction from documents in Croatian [8] Jackson P, Moulinier I. Natural Language is signicantly improved, and brought on par Processing for Online Applications: Text to term extraction from documents in En- Retrieval, Extraction and Categorization. glish. This clearly indicates that lemmatisa- Amsterdam: John Benjamins Publishing tion plays an important role in term extrac- Co.; 2002. tion for highly inected languages. Further- more, experiments indicate that errors due to [9] Kolar M, Vukmirovi¢ I, Dalbelo Ba²i¢ B, lexical ambiguity in Eurovoc term extraction ’najder J. Computer Aided Document are rare, making the need for disambiguation Indexing System. To be published in methods for this particular thesaurus ques- Proc. of ITI 2005, June 2023, Cavtat. tionable in practice. [10] NASA Scientic and Tech. Information Program. NASA Thesaurus Machine Acknowledgements Aided Indexing. http://mai.larc.nasa.gov/ [01/16/2005] We thank another team from Faculty [11] Nenadi¢ G, Ananiadou S, McNaught J. of Electrical Engineering and Computing, Enhancing Automatic Term Recognition Croatian Information Documentation Refer- through Recognition of Variation. Proc. of ral Agency, and Department of , COLING, Geneva; 2004. p. 604610. Faculty of Philosophy for outstanding work [12] Oliver A, Tadi¢ M. Enlarging the Croat- on our joint project system ian Morphological Lexicon by Automatic (2003-082) funded by the Ministry of Sci- Lexical Acquisition from Raw Corpora. ence, Education and Sports of the Republic Proc. of LREC, Lisboa; 2004. p. 1259 of Croatia. We also thank two anonymous 1262. reviewers for helpful comments on an earlier [13] Pouliquen B, Steinberger R, Ignat C. Au- draft of this paper. tomatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus. References Proc. of the Workshop Ontologies and In- formation Extraction, Bucharest; 2003. [1] Automatically Generated Inection [14] Ripplinger B, Schmidt P. Automatic Mul- Database. http://wordlist.sourceforge.net tilingual Indexing and Natural Language [01/16/2005] Processing. Proc. of SIGIR, Athens; 2000. [2] Bratani¢ M, ed. Pojmovnik EUROVOC, [15] ’najder J. Rule-based automatic acquisi- 2nd ed., HIDRA, Zagreb; 2000. tion of large-coverage morphological lexi- [3] Clement L, Sagot B, Lang B. Morphology cons for information retrieval. Tech. Re- Based Automatic Acquisition of Large- port, MZO’ 2003-082, ZEMRIS, FER, Coverage Lexica, Proc. of LREC, Lisboa; University of Zagreb; 2005. 2004. [4] Bakel van B. Modern Classical Document