Thesaurus Based Semantic Representation in Language Modeling for Medical Article Indexing

THESAURUS BASED SEMANTIC REPRESENTATION IN LANGUAGE MODELING FOR MEDICAL ARTICLE INDEXING Jihen Majdoubi, Mohamed Tmar and Faiez Gargouri Multimedia Information System and Advanced Computing Laboratory Higher Institute of Information Technologie and Multimedia, Sfax, Tunisia Keywords: Medical article, Conceptual indexing, Language models, MeSH thesaurus. Abstract: Language modeling approach plays an important role in many areas of natural language processing including speech recognition, machine translation, and information retrieval. In this paper, we propose a contribution for conceptual indexing of medical articles by using the MeSH (Medical Subject Headings) thesaurus, then we propose a tool for indexing medical articles called SIMA (System of Indexing Medical Articles) which uses a language model to extract the MeSH descriptors representing the document. To assess the relevance of a document to a MeSH descriptor, we estimate the probability that the MeSH descriptor would have been generated by language model of this document. 1 INTRODUCTION saurus, domain ontologies) to IRS is a possible so- lution to this problem of the current web. The goal of an Information Retrieval System (IRS) As in the example of ”operating system” cited is to retrieve relevant information to a user’s query. above, by using concepts of the semantic resource This goal is quite a difficult task with the rapid and (SR) and their description, the IRS can detect the re- increasing development of the Internet. lationships between operating system, windows, unix, Indeed, web information retrieval becomes more vista and return the document that mentions windows and more complex for the user which IRS provide a as an answer to the query about ”operating system”. lot of information, but he often fail, to find the best Consequently, incorporating the semantic in the one in the context of his information need. IR process can improve the IRS performance. The classical IRS are not suitable for managing In the literature, there are three main approaches this growing volume of data and finding relevant doc- regarding the incorporation of semantic information uments to a user’s information need. The informa- into IRS: (1) semantic indexing, (2) conceptual index- tion retrieval techniques commonly used are based on ing and (3) query expansion. statistical methods and do not take into account the meaning of words contained in the user’s query as 1. Semantic indexing (Sense Based Indexing): is an well as in the documents. Indeed, the current IRS use indexing approach based on the word senses. The simple keyword matching: a document to be returned basic idea is to index word meanings, rather than to the user, should contain at least one word of the words taken as lexical strings. query. However a document can be relevant even it For example, bank (river/money) and plant (man- does not contain any word of the query. ufacturing/life) (Sanderson.M, 1994)(yarowski.D, As a simple example, if the query is about ”oper- 1993). ating system”, a document containing windows, unix, Thus, word Sense Disambiguation (WSD) al- vista, and not the term ”operating system”, would gorithms are needed in order to resolve word not be retrieved by classical search engines. Conse- ambiguity in the document and determine its best quently, the recall is often low. word sense. Thus, much more ”intelligence” should be embed- The usage of word senses in the process of ded to IRS in order to be to understand the meaning document indexing is an issue of discussions. of the word. (Gonzalo.J et al., 1998) performed experiments Adding a semantic resource (dictionaries, the- in sense based indexing: they used the SMART Majdoubi J., Tmar M. and Gargouri F. (2010). 65 THESAURUS BASED SEMANTIC REPRESENTATION IN LANGUAGE MODELING FOR MEDICAL ARTICLE INDEXING. In Proceedings of the 12th International Conference on Enterprise Information Systems - Artificial Intelligence and Decision Support Systems, pages 65-74 DOI: 10.5220/0002903300650074 Copyright c SciTePress ICEIS 2010 - 12th International Conference on Enterprise Information Systems retrieval system and a manually disambiguated suchas the legal field (Stein.J.A, 1997), medical field collection (Semcor). The results of their ex- (Muller.H et al., 2004) and sport field (Khan.L, 2000). periments proved that indexing by synsets can In this paper, we propose our contribution for con- increase recall up to 29 % compared to word ceptual indexing of medical articles by using the lan- based indexing. guage modeling approach. Ellen Voorhees (Voorhees.E.M, 1998) applied After summarizing the background for this prob- word meanings indexing. in the collection of lem in the next section, we present the previous work documents, as well as in the query. Comparing according to indexing medical articles in section 3. the results obtained with the performance of a Section 4 explains the language model for Informa- standard run, (Voorhees.E.M, 1998) affirmed than tion Retrieval. Following that, we detail our concep- the overall results have shown a degradation in tual indexing approach in Section 5. An experimen- IR effectiveness when word meanings were used tal evaluation and comparison results are discussed in for indexing. She states that a long query has a sections 6 and 7. Finally section 8 presents some con- bad influence on these results and degrades the IR clusions and future work directions. performance. 2. Conceptual Indexing. Unlike previous indexing 2 BACKGROUND systems that use lists of simple words to index a document, conceptual indexing is based on concepts issued from the SR. 2.1 Context The conceptual indexing technique has been used in several works (Baziz.M, 2006) (Stairmand.A Each year, the rate of publication of scientific lit- and J.William, 1996) (Mauldin.M.L, 1991). How- erature grows, making it increasingly harder for re- ever, to our knowledge, the most intensive searchers to keep up with novel relevant published work in this direction was performed by Woods work. In recent years big efforts have been devoted (Woods.W.A, 1997) that proposed an approach to attempt to manage effectively this huge volume of which was evaluated using small collections, as information, in several fields. for example the unix manual pages (about 10MB In the medical field, scientific articles represent a of text). To evaluate his system, he defines a new very important source of knowledge for researchers measure, called success rate which indicates if a of this domain. The researcher usually needs to deal question has an answer in the top ten documents with a large amount of scientific and technical articles returned by a retrieval system. The success rate for checking, validating and enriching of his research obtained was 60% compared to a maximum of work. 45% obtained using other retrieval systems. This kind of information is often present in elec- tronic biomedical resources available through the In- The experiments described in (Woods.W.A, 1997) 1 2 are based on small collections of text. But, as ternet like CISMEF and PUBMED . However, the shown in (Ambroziak.J, 1997), this is not a lim- effort that the user put into the search is often forgoten itation; conceptual indexing can be successfully and lost. applied to much larger text collections. To solve these issues, current health Informa- tion Systems must take advantage of recent advances 3. Query Expansion. SR can also help the user to in knowledge representation and management areas choose search terms and formulate its requests. such as the use of medical terminology resources. In- For example, (Mihalcea.D and Moldovan, 2000) deed, these resources aim at establishing the represen- and (Voorhees.E.M, 1994) propose an IRS which tations of knowledge through which the computer can use a thesaurus WordNet to expand the user’s handle the semantic information. query. Such as the query is expanded with terms similar to those of the original query. 2.2 Medical Terminology Resources These studies showed that IRS based either on conceptual indexing, semantic indexing or a query ex- The language of biomedical texts, like all natural pansion can improve the effectiveness of IRS. language, is complex and poses problems of syn- In our work, we are interested in the conceptual onymy and polysemy. Therefore, many terminolog- indexing. The essential argument which motivates ical systems have been proposed and developed such our choice is that we are concerned about the medical field, and that the technique of conceptual indexing 1http://www.chu-rouen.fr/cismef/ have been used with success in particular domains, 2http://www.ncbi.nlm.nih.gov/pubmed/ 66 THESAURUS BASED SEMANTIC REPRESENTATION IN LANGUAGE MODELING FOR MEDICAL ARTICLE INDEXING as Galen, UMLS, GO and MeSH. the resource to be indexed, translation of the emerg- In this section, we present some examples of medical ing concepts into the appropriate controlled vocabu- terminology resources: lary (MeSH thesaurus) and revision of the resulting • SNOMED is a coding system, controlled vocab- index. ulary, classification system and thesaurus. It is a In (Aronson.A et al., 2004), the authors pro- comprehensive clinical terminology; designed to posed the MTI (MeSH Terminology Indexer) used capture information about a patient’s history, ill- by NLM to index English resources. MTI results nesses, treatment and outcomes. from the combination of two MeSH Indexing methods: MetaMap Indexing (MMI) and a statistical, 3 • Galen (General Architecture for Language and knowledge-based approach

Thesaurus Based Semantic Representation in Language Modeling for Medical Article Indexing

Enhanced Thesaurus Terms Extraction for Document Indexing

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population

LASLA and Collatinus

Experiments in Clustering Urban-Legend Texts

Removing Boilerplate and Duplicate Content from Web Corpora

The State of (Full) Text Search in Postgresql 12

A Diachronic Treebank of Russian Spanning More Than a Thousand Years

Download in the Conll-Format and Comprise Over ±175,000 Tokens

Lemmagen: Multilingual Lemmatisation with Induced Ripple-Down Rules

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

A 500 Million Word POS-Tagged Icelandic Corpus