Enhanced Thesaurus Terms Extraction for Document Indexing

Enhanced Thesaurus Terms Extraction for Document Indexing Frane ari¢, Jan najder, Bojana Dalbelo Ba²i¢, Hrvoje Ekli¢ Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb, Croatia E-mail:{Frane.Saric, Jan.Snajder, Bojana.Dalbelo, Hrvoje.Eklic}@fer.hr Abstract. In this paper we present an mogeneous due to diverse background knowl- enhanced method for the thesaurus term edge and expertise of human indexers. The extraction regarded as the main support to task of building semi-automatic and auto- a semi-automatic indexing system. The matic systems, which aim to decrease the enhancement is achieved by neutralising burden of work borne by indexers, has re- the eect of language morphology applying cently attracted interest in the research com- lemmatisation on both the text and the munity [4], [13], [14]. Automatic indexing thesaurus, and by implementing an ecient systems still do not achieve the performance recursive algorithm for term extraction. of human indexers, so semi-automatic sys- Formal denition and statistical evaluation tems are widely used (CINDEX, MACREX, of the experimental results of the proposed MAI [10]). method for thesaurus term extraction are In this paper we present a method for the- given. The need for disambiguation methods saurus term extraction regarded as the main and the eect of lemmatisation in the realm support to semi-automatic indexing system. of thesaurus term extraction are discussed. Term extraction is a process of nding all ver- batim occurrences of all terms in the text. Keywords. Information retrieval, term Our method of term extraction is a part of extraction, NLP, lemmatisation, Eurovoc. CADIS [9], and is meant to facilitate nding those terms that are explicitly contained in a 1. Introduction document, although the document does not necessarily need to be indexed with the ex- Finding documents on the Web or in large tracted terms. document databases that are relevant for The process of term extraction gives rise user's queries is the primary research topic in to some practical problems concerning term the eld of information retrieval (IR). Doc- variation, such as morphological, lexical, ument indexing is the process of assigning structural, etc. An example of a method one or more key phrases that describe the which deals with term variation for English content of the document in order to facili- is presented in [11]. We restrict our work tate IR. These key phrases (called terms or to variation due to inectional morphology, descriptors) usually belong to a nite set of which makes the words appear in various phrases arranged in the form of a controlled forms. We can implicitly cope with some vocabulary or thesaurus. Thesauri contain other types of term variation by using the- additional information about term relation- sauri which encode relationships between syn- ships, such as: related terms, hypernyms, hy- onyms. ponyms, etc., thus providing the means to Recall of term extraction suers in doc- control recall and precision of searches [8]. uments written in morphologically rich lan- Examples of widely used thesauri are the guages (such as Croatian), so the eects of Eurovoc (EUROpean VOCabulary) [7] and morphology have to be neutralised. The NASA Thesaurus [10]. method described in the paper enhances the Manual indexing is a time consuming, ex- process of term extraction in two aspects. pensive intellectual task and is often inho- It eciently tackles the problem of language tised, i.e. a lemma for a given inected form has to be found. Lemmatisation procedures range from purely algorithmic (rule-driven) to lexicon-based (relying on queries made to a morphological lexicon). For highly inected languages the latter approach is more common. The morphological lexicon typically relates all inected forms of a word to its lemma. The construction of a morphological lexicon is a labour intensive task. To facilitate the process, various automatic and semi- Figure 1: Croatian word vode has three lemmas: voda automatic procedures based on lexical acqui- (water), vod (a duct or a squad) and voditi (to conduct, to lead). sition from corpora have been developed [3], [6], [12], [15]. In our work, contrary to the usual prac- morphology by applying lemmatisation, the tice, the process of lemmatisation does not most prominent natural language processing imply word disambiguation. Instead, lemma- (NLP) technique used for indexing, on both tisation of an ambiguous word results in more the text and the thesaurus. The process of than one lemma. Ambiguity considered here extraction is further enhanced by identifying is called homography the case when two or and ignoring some cases in which terms are more lemmas have overlapping forms. A no- considered irrelevant. torious example in English is the word saw, The paper is structured as follows. In Sec- which can be a noun (a tool used for cutting) tion 2, we address the problem of language or the past tense of the verb see. An example morphology, and describe our approach to in Croatian is the word vode as a feminine lemmatisation. In Section 3, formal deni- noun voda (water), a masculine noun vod (a tion of our method is given. Finally, in Sec- duct or a squad), or a verb voditi (to conduct, tion 4 we present the statistical evaluation of to lead), as shown in Fig. 1. experimental results of Eurovoc term extraction on a set of parallel documents written in 2.2. Our approach to lemmatisation Croatian and English. 2. Lemmatisation In our work lemmatisation of English and Croatian documents and terms is performed 2.1. The problem of morphology using appropriate morphological lexicons. For lemmatisation of English, one of many When extracting terms from text docu- publicly available lexicons was used [1]. It ments, the eects of language morphology contains over 250, 000 forms assigned to more have to be taken into account. Relevant to than 100, 000 lemmas, with 4.8% ambiguous the task of term extraction are the eects forms due to homography. of inectional morphology. It describes how For lemmatising Croatian, a morphologi- from basic word form (the lemma) dierent cal lexicon constructed by a rule-based au- word forms are generated in order to express tomatic acquisition [15] from a subsection grammatical features (e.g. number, case, gen- of Croatian National Corpus [5] totaling 107 der, degree etc., depending on the word's words was used. The obtained lexicon con- part-of-speech). If term extraction were per- tains over 500, 000 forms assigned to more formed by literal string matching, various in- than 30, 000 lemmas. Degree of homography ections of a term would not be found in the in lexicon is 5.1%. document, resulting in decreased recall. In the process of term extraction, both To neutralise the eect of inective mor- precision and recall depend on the linguistic phology, each word form has to be lemma- validity and the coverage of the morphological lexicon used for lemmatisation. Estimates for lexicons used in our experiments will be given in Section 4. 3. Term Extraction 3.1. Formal denition In order to dene term extraction for- mally, word and term matching need to be dened rst. Let W be a set of all words and L : W → ℘(W ) denote a function that maps each word to the set of its lemmas, e.g. L(vode) = {vod, voda, voditi}. If we are not using lemmatisation or word w is not listed in the lexicon (which is usually the case Figure 2: A set of extracted terms contains the longest for non-inective words), then L(w) = {w}. term extracted in step a) and the contents of sets We dene words w1 and w2 to match i left a and right a. Sets left a = {(tax)} and right a = {(insurance, premium)} are calculated in steps b) and L(w1) ∩ L(w2) 6= ∅, i.e. both words are in- ections of a common lemma. c), respectively. We represent a term as a list of words (t1, . , tm), and a portion of text In these rare cases no term takes precedence with no intervening punctuation as list of over the other, so we choose to extract all of words (w1, . , wn). We dene three rela- them. tions that are relevant for term extraction. The process of term extraction can be for- Term (t , . , t ) matches a list of words malised as follows. Let be a set of all terms, 1 m S T (w , . , w ) at the position k i k+m−1 ≤ n + ∞ n a set of all word n-tuples 1 n W = n=1 W + and words ti and wk+i−1 match (in the sense and E : W → ℘(T ) a function mapping a introduced above) for . Term + i = 1, . , m list of words (w1, . , wn) ∈ W to a set of tA = (t1, . , tn) matching some list of words extracted terms, element of ℘(T ). For ex- at position a subsumes term tB = (t1, . , tm) ample, E(tax, on, motor, vehicle, insurance, matching the same list of words at position premium) = {(tax), (insurance, premium), b i a ≤ b and b + m ≤ a + n. Term (motor, vehicle, insurance)}, as shown in tA = (t1, . , tn) matching some list of words Fig. 2. at position and term a tB = (t1, . , tm) Function E is dened recursively as fol- matching the same list of words at position lows. If there are no terms present in the list b overlap i a < b < a + n < b + m. of words (w1, . , wn), then E(w1, . , wn) = If term tA subsumes term tB, it is al- ∅. Otherwise, E(w1, . , wn) = {t} ∪ most certainly true that term tA is more spe- left ∪ right, where t is the leftmost among cic.

Enhanced Thesaurus Terms Extraction for Document Indexing

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population

LASLA and Collatinus

Experiments in Clustering Urban-Legend Texts

Removing Boilerplate and Duplicate Content from Web Corpora

The State of (Full) Text Search in Postgresql 12

A Diachronic Treebank of Russian Spanning More Than a Thousand Years

Download in the Conll-Format and Comprise Over ±175,000 Tokens

Lemmagen: Multilingual Lemmatisation with Induced Ripple-Down Rules

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

A 500 Million Word POS-Tagged Icelandic Corpus

Approaches for Natural Language Processing (NLP)