First Steps Towards Building a Medical Lexicon for Spanish with Linguistic
Total Page:16
File Type:pdf, Size:1020Kb
First Steps towards a Medical Lexicon for Spanish with Linguistic and Semantic Information Leonardo Campillos-Llanos Computational Linguistics Laboratory, Universidad Autonoma´ de Madrid [email protected] Abstract methods (e.g. rule-based and dictionary-based). In order to overcome the data bottleneck, richly- We report the work-in-progress of collecting structured terminological thesauri enhance the an- MedLexSp, an unified medical lexicon for the Spanish language, featuring terms and in- notation and concept normalization of domain cor- flected word forms mapped to Unified Medical pora to be used subsequently in supervised mod- Language System (UMLS) Concept Unique els. More importantly, to achieve comparable Identifiers (CUIs), semantic types and groups. benchmarks, domain resources should integrate First, we leveraged a list of term lemmas and standard terminologies and coding schemes. forms from a previous project, and mapped In this context, we aim at providing a computa- them to UMLS terms and CUIs. To en- rich the lexicon, we used both domain-corpora tional lexicon to be used in the pre-processing of (e.g. Summaries of Product Characteristics text data used in more complex Natural Language and MedlinePlus) and natural language pro- Processing (NLP) tasks. The work here presented cessing techniques such as string distance reports the first steps towards building the Medi- methods or generation of syntactic variants of cal Lexicon for Spanish (MedLexSp). MedLexSp multi-word terms. We also added term vari- is conceived as an unified resource with linguis- ants by mapping their CUIs to missing items tic information (lemmas, inflected forms and part- available in the Spanish versions of standard of-speech), concepts mapped to Unified Medical thesauri (e.g. Medical Subject Headings and R World Health Organization Adverse Drug Re- Language System (hereafter, UMLS) (Boden- actions terminology). We enhanced the vo- reider, 2004) Concept Unique Identifiers (CUIs), cabulary coverage by gathering missing terms and semantic information (UMLS types and from resources such as the Anatomical Thera- groups). Figure1 is a sample of the lexicon. peutical Classification, the National Cancer In- MedLexSp is firstly aimed at named entity recog- stitute (NCI) Dictionary of Cancer Terms, Or- nition (NER), and it can be used in the pre- phaData, or the Nomenclator´ de Prescripcion´ annotation step of an NER pipeline. It can also for drug names. Part-of-Speech information is being included in the lexicon, and the cur- help lemmatization and feed general-purpose Part- rent version amounts up to 76 454 lemmas and of-Speech taggers applied to medical texts—as 1 203 043 inflected forms (including conjugated done in previous works (Oronoz et al., 2013). Be- verbs, number and gender variants), corre- cause it gathers semantic data of terms, it can ease sponding to 30 647 UMLS CUIs. MedLexSp relation extraction tasks. is distributed freely for research purposes. Our work makes several contributions. We 1 Introduction provide a resource to be distributed for research purposes in the BioNLP community. MedLexSp Current machine-learning and deep-learning- includes inflected forms (singular/plural, mascu- based methods are data-intensive; however, in do- line/feminine) and conjugated verb forms of term mains such as Medicine, sufficient data are not al- lemmas, which are mapped to UMLS Concept ways available—due to ethical concerns or privacy Unique Identifiers. Verb terms are also mapped issues, especially when dealing with Patient Pro- to Concept Unique Identifiers; this is the line of tected Information. Moreover, some tasks demand current works for expanding terminologies by in- high precision outcomes, which either need super- vised approaches with annotated data or hybrid 1https://zenodo.org/record/2621286 152 Proceedings of the BioNLP 2019 workshop, pages 152–164 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics Figure 1: Sample of the MedLexSp lexicon. In each entry, field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the Part-of-Speech; field 5, the semantic types(s); and field 6, the semantic group. cluding verb terms (Thompson et al., 2011; Chiu (MeSH) are developed by the National Library of et al., 2019). We also added inflected terms Medicine for indexing biomedical articles. Lastly, from MedlinePlus terms, OrphaData (INSERM, the World Organization of Family Doctors pro- 2019), the National Cancer Institute (NCI) Dic- duced the International Classification of Primary tionary of Cancer Terms, or the Nomenclator de Care (ICPC) to classify data aimed at family and prescripcion´ (AEMPS, 2019), a knowledge base primary care physicians (WONCA, 1998). of medical drugs prescribed in Spain. Medical taxonomies or classifications gather es- Section2 gives an overview of medical thesauri, sential domain knowledge.Some examples are the and Section3 describes the methods used to gather International Classification of Diseases vs. 10 terms (both corpora and NLP techniques), map (ICD-10) (WHO, 2004), or the Anatomical Thera- them to UMLS CUIs, and enrich the lexicon. Sec- peutical Chemical (ATC) classification of pharma- tion4 reports descriptive statistics of the current cological substances (WHO, 2019). version, and Section5, the results of an evaluation conducted during development. We discuss some 2.2 Medical Lexicons limitations and conclude in Section6. Medical lexicons provide a structured represen- 2 Background and Context tation of terms and their linguistic information (lemmas, inflection, or surface variants); hence, 2.1 Health thesauri and taxonomies they are essential for NLP tasks. Unlike medi- Medical thesauri and controlled vocabularies ag- cal thesauri or classifications, they do not register gregate listings of domain terms, and also gather term hierarchies, classifications nor ontological re- information about the type of term (e.g. syn- lations, but they can encode semantic information onym or preferred term), a semantic descriptor and, occasionally, argument structure and corpus- (e.g. DRUG or FINDING), an unique concept iden- based frequency data (Thompson et al., 2011). tifier, and very often a term definition or hier- Initiatives to collect medical lexicons have been archical relations between concepts (e.g. IS A). conducted for English (McCray et al., 1994; John- Thesauri are essential for indexing and populating son, 1999; Davis et al., 2012), German (Weske- databases, domain-specific information retrieval, Heck et al., 2002), French (Zweigenbaum et al., and standardized codification (Cimino, 1996). 2005) or Swedish, even in multilingual initia- Medical thesauri vary according to the applica- tives (Marko´ et al., 2006). For Spanish, some ef- tion (we only give examples related to our work). forts were sparked when a team at the National The Systematized Nomenclature of Medicine Library of Medicine (Divita et al., 2007) started Clinical Terms (SNOMED-CT) (Donnelly, 2006) to build an equivalent of the MetaMap tool (Aron- aims at encoding verbatim mentions in clinical son, 2001). Other teams conducted experiments to texts, and gathers ontological relations between automate the creation of a Spanish MetaMap by concepts. To report drug reactions in pharma- applying machine translation and domain ontolo- covigilance, the World Health Organization cre- gies (Carrero et al., 2008). These initiatives, to the ated the Adverse Reactions Terminology (WHO best of our knowledge, did not achieve a Spanish ART), although the Medical Dictionary for Regu- lexicon available for medical NLP. latory Activities (MedDRA) (Brown et al., 1999) Besides medical lexicons, domain-specific vo- is now preferred. The Medical Subject Headings cabularies were collected for Biology (Thompson 153 et al., 2011). With a different perspective and goal, Corpus-derived medical terminology construc- Consumer Health Vocabularies have been col- tion requires collecting domain texts and applying lected to bridge the gap between patients’ expres- term extraction methods, among others: comput- sions and healthcare professionals’ jargon (Zeng ing graphs of relations between parse trees and and Tse, 2006; Keselman et al., 2007). word dependency similarities (Nazarenko et al., 2001), using parallel corpora to map cognates or 2.3 The Unified Medical Language System aligned words (Sbrissia et al., 2004; Deleger´ et al., The Unified Medical Language System R 2009), linking terms or abbreviations to their def- (UMLS) (Bodenreider, 2004) MetaThesaurus initions or expanded word forms in the text where includes thesauri. The version we used (2018AB) they occur (Yu and Agichtein, 2003; McCrae and gathers 210 sources and over 3.82 millions of Collier, 2008), using dictionary features to iden- concepts in 23 languages. Synonym terms are tify polysemy (Pezik et al., 2008), combining encoded with Concept Unique Identifiers (CUIs); text mining techniques with databases (Thomp- and concepts are assigned a semantic type and son et al., 2011), or having experts review terms, group (McCray et al., 2001). a method which has been used to build disease- specific vocabularies (Wang et al., 2016). 2.4 Methods for Creating Medical Lexicons Approaches based on the Firthian Distribu- We will restrict us here to a shallow overview of tional hypothesis exploit distributional similarity approaches and will not consider taxonomy nor metrics (Carroll et al., 2012). Among them, more ontology building. Methods for widening medi- recent distributional semantics