First Steps towards a Medical for Spanish with Linguistic and Semantic Information

Leonardo Campillos-Llanos Computational Laboratory, Universidad Autonoma´ de Madrid [email protected]

Abstract methods (e.g. rule-based and -based). In order to overcome the data bottleneck, richly- We report the work-in-progress of collecting structured terminological thesauri enhance the an- MedLexSp, an unified medical lexicon for the Spanish , featuring terms and in- notation and concept normalization of domain cor- flected word forms mapped to Unified Medical pora to be used subsequently in supervised mod- Language System (UMLS) Concept Unique els. More importantly, to achieve comparable Identifiers (CUIs), semantic types and groups. benchmarks, domain resources should integrate First, we leveraged a list of term lemmas and standard terminologies and coding schemes. forms from a previous project, and mapped In this context, we aim at providing a computa- them to UMLS terms and CUIs. To en- rich the lexicon, we used both domain-corpora tional lexicon to be used in the pre-processing of (e.g. Summaries of Product Characteristics text data used in more complex Natural Language and MedlinePlus) and natural language pro- Processing (NLP) tasks. The work here presented cessing techniques such as string distance reports the first steps towards building the Medi- methods or generation of syntactic variants of cal Lexicon for Spanish (MedLexSp). MedLexSp multi-word terms. We also added term vari- is conceived as an unified resource with linguis- ants by mapping their CUIs to missing items tic information (lemmas, inflected forms and part- available in the Spanish versions of standard of-speech), concepts mapped to Unified Medical thesauri (e.g. Medical Subject Headings and R World Health Organization Adverse Drug Re- Language System (hereafter, UMLS) (Boden- actions terminology). We enhanced the vo- reider, 2004) Concept Unique Identifiers (CUIs), cabulary coverage by gathering missing terms and semantic information (UMLS types and from resources such as the Anatomical Thera- groups). Figure1 is a sample of the lexicon. peutical Classification, the National Cancer In- MedLexSp is firstly aimed at named entity recog- stitute (NCI) Dictionary of Cancer Terms, Or- nition (NER), and it can be used in the pre- phaData, or the Nomenclator´ de Prescripcion´ annotation step of an NER pipeline. It can also for drug names. Part-of-Speech information is being included in the lexicon, and the cur- help lemmatization and feed general-purpose Part- rent version amounts up to 76 454 lemmas and of-Speech taggers applied to medical texts—as 1 203 043 inflected forms (including conjugated done in previous works (Oronoz et al., 2013). Be- verbs, number and gender variants), corre- cause it gathers semantic data of terms, it can ease sponding to 30 647 UMLS CUIs. MedLexSp relation extraction tasks. is distributed freely for research purposes. Our work makes several contributions. We 1 Introduction provide a resource to be distributed for research purposes in the BioNLP community. MedLexSp Current machine-learning and deep-learning- includes inflected forms (singular/plural, mascu- based methods are data-intensive; however, in do- line/feminine) and conjugated verb forms of term mains such as , sufficient data are not al- lemmas, which are mapped to UMLS Concept ways available—due to ethical concerns or privacy Unique Identifiers. Verb terms are also mapped issues, especially when dealing with Patient Pro- to Concept Unique Identifiers; this is the line of tected Information. Moreover, some tasks demand current works for expanding terminologies by in- high precision outcomes, which either need super- vised approaches with annotated data or hybrid 1https://zenodo.org/record/2621286

152 Proceedings of the BioNLP 2019 workshop, pages 152–164 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics Figure 1: Sample of the MedLexSp lexicon. In each entry, field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the Part-of-Speech; field 5, the semantic types(s); and field 6, the semantic group. cluding verb terms (Thompson et al., 2011; Chiu (MeSH) are developed by the National Library of et al., 2019). We also added inflected terms Medicine for indexing biomedical articles. Lastly, from MedlinePlus terms, OrphaData (INSERM, the World Organization of Family Doctors pro- 2019), the National Cancer Institute (NCI) Dic- duced the International Classification of Primary tionary of Cancer Terms, or the Nomenclator de Care (ICPC) to classify data aimed at family and prescripcion´ (AEMPS, 2019), a knowledge base primary care physicians (WONCA, 1998). of medical drugs prescribed in Spain. Medical taxonomies or classifications gather es- Section2 gives an overview of medical thesauri, sential domain knowledge.Some examples are the and Section3 describes the methods used to gather International Classification of Diseases vs. 10 terms (both corpora and NLP techniques), map (ICD-10) (WHO, 2004), or the Anatomical Thera- them to UMLS CUIs, and enrich the lexicon. Sec- peutical Chemical (ATC) classification of pharma- tion4 reports descriptive statistics of the current cological substances (WHO, 2019). version, and Section5, the results of an evaluation conducted during development. We discuss some 2.2 Medical limitations and conclude in Section6. Medical lexicons provide a structured represen- 2 Background and Context tation of terms and their linguistic information (lemmas, inflection, or surface variants); hence, 2.1 Health thesauri and taxonomies they are essential for NLP tasks. Unlike medi- Medical thesauri and controlled vocabularies ag- cal thesauri or classifications, they do not register gregate listings of domain terms, and also gather term hierarchies, classifications nor ontological re- information about the type of term (e.g. syn- lations, but they can encode semantic information onym or preferred term), a semantic descriptor and, occasionally, argument structure and corpus- (e.g. DRUG or FINDING), an unique concept iden- based frequency data (Thompson et al., 2011). tifier, and very often a term definition or hier- Initiatives to collect medical lexicons have been archical relations between concepts (e.g. IS A). conducted for English (McCray et al., 1994; John- Thesauri are essential for indexing and populating son, 1999; Davis et al., 2012), German (Weske- databases, domain-specific information retrieval, Heck et al., 2002), French (Zweigenbaum et al., and standardized codification (Cimino, 1996). 2005) or Swedish, even in multilingual initia- Medical thesauri vary according to the applica- tives (Marko´ et al., 2006). For Spanish, some ef- tion (we only give examples related to our work). forts were sparked when a team at the National The Systematized Nomenclature of Medicine Library of Medicine (Divita et al., 2007) started Clinical Terms (SNOMED-CT) (Donnelly, 2006) to build an equivalent of the MetaMap tool (Aron- aims at encoding verbatim mentions in clinical son, 2001). Other teams conducted experiments to texts, and gathers ontological relations between automate the creation of a Spanish MetaMap by concepts. To report drug reactions in pharma- applying machine translation and domain ontolo- covigilance, the World Health Organization cre- gies (Carrero et al., 2008). These initiatives, to the ated the Adverse Reactions Terminology (WHO best of our knowledge, did not achieve a Spanish ART), although the Medical Dictionary for Regu- lexicon available for medical NLP. latory Activities (MedDRA) (Brown et al., 1999) Besides medical lexicons, domain-specific vo- is now preferred. The Medical Subject Headings cabularies were collected for Biology (Thompson

153 et al., 2011). With a different perspective and goal, Corpus-derived medical terminology construc- Consumer Health Vocabularies have been col- tion requires collecting domain texts and applying lected to bridge the gap between patients’ expres- term extraction methods, among others: comput- sions and healthcare professionals’ jargon (Zeng ing graphs of relations between parse trees and and Tse, 2006; Keselman et al., 2007). word dependency similarities (Nazarenko et al., 2001), using parallel corpora to map cognates or 2.3 The Unified Medical Language System aligned words (Sbrissia et al., 2004; Deleger´ et al., The Unified Medical Language System R 2009), linking terms or abbreviations to their def- (UMLS) (Bodenreider, 2004) MetaThesaurus initions or expanded word forms in the text where includes thesauri. The version we used (2018AB) they occur (Yu and Agichtein, 2003; McCrae and gathers 210 sources and over 3.82 millions of Collier, 2008), using dictionary features to iden- concepts in 23 . Synonym terms are tify polysemy (Pezik et al., 2008), combining encoded with Concept Unique Identifiers (CUIs); text mining techniques with databases (Thomp- and concepts are assigned a semantic type and son et al., 2011), or having experts review terms, group (McCray et al., 2001). a method which has been used to build disease- specific vocabularies (Wang et al., 2016). 2.4 Methods for Creating Medical Lexicons Approaches based on the Firthian Distribu- We will restrict us here to a shallow overview of tional hypothesis exploit distributional similarity approaches and will not consider taxonomy nor metrics (Carroll et al., 2012). Among them, more ontology building. Methods for widening medi- recent distributional semantics methods represent cal vocabularies range from generating syntactic- terms in the vector space, or calculate word- level variants of multi-word terms (Jacquemin, embeddings to compute similarity measures be- 1999), inferring derivation rules from string sim- tween vectors, thus allowing the unsupervised ex- ilarity matches and morphological relations be- pansion of domain terms (Pyysalo et al., 2013; tween derivational variants (Grabar and Zweigen- Skeppstedt et al., 2013; Henriksson et al., 2014; baum, 2000), gathering inflected variants semi- Wang et al., 2015; Ahltorp et al., 2016; Segura- automatically (Cartoni and Zweigenbaum, 2010), Bedmar and Mart´ınez, 2017) or concept normal- or deriving terms from corpora (more below). ization (Limsopatham and Collier, 2016). Graeco- components are very productive Lastly, to develop Consumer Health Vocabular- for coining medical terms; thus, several BioNLP ies (CHV), a variety of techniques have been used: systems integrate -based lexical re- analysis by experts of Medline queries (Zeng sources. For example, for decomposing terms and Tse, 2006), term recognition methods and morphosemantically and deriving their defini- collaborative review of user logs in medical tions (Namer and Zweigenbaum, 2004), or map- sites (Zeng et al., 2007), hybrid methods com- ping queries to concepts and indexing documents bining n-grams extraction, the C-value, and dic- in cross-lingual information retrieval, based on a tionary look-up (Doing-Harris and Zeng-Treitler, subword-based (Marko´ et al., 2011), co-occurrence analysis of terms and seed 2005). In this line, generating paraphrase equiv- words (Jiang and Yang, 2013), or approaches alents of neoclassical compounds (e.g. thyrome- based of similarity measures between CHV lexi- galia → enlarged thyroid) is an approach with po- cons and reference lexicons (Seedorff et al., 2013). tential for deriving new terms, and concept nor- malization systems (Thompson and Ananiadou, 3 Methods 2018) already implement it. Because string simi- larity measures and edit distance patterns are used Figure2 depicts the methods used to collect the for normalization—e.g (Tsuruoka et al., 2007; MedLexSp lexicon. In a first step (left part of Fig- Kate, 2015)—and terminology mapping (Dziadek ure2), we leveraged the lemmas and word forms et al., 2017), these approaches are also powerful obtained from a Spanish medical lexicon, mostly for expanding medical lexicons from a set of ref- corpus-derived; we will refer to it as the base list. erence terms. Decomposition of multi-word terms We only used the subset of lemmas and forms that and synonym expansion of their components are could be mapped authomatically to UMLS CUIs also alternative strategies applied in normalization (exact string match). In a second step, we added systems (Tseytlin et al., 2016). missing variants of terms using different methods:

154 Figure 2: Methods to collect the MedLexSp lexicon.

• Testing string distance metrics to match from the UMLS—e.g. Spanish Medical Sub- terms in the base list to variants that re- ject Headings (MeSH), SNOMED CT or mained unmatched: e.g. eccema ↔ eczema the WHO ART terminology—and external (’eczema’, C0013595). sources such as the Anatomical Therapeu- tical Classification, the National Cancer In- • Incorporating derivational variants to the stitute (NCI) Dictionary of Cancer Terms,2 base list: e.g. aneurisma (‘aneurysm’) ↔ the Nomenclator de prescripcion´ (AEMPS, aneurismatico´ (‘aneurysmatic’, C0002940). 2019), OrphaData (INSERM, 2019), or the Spanish Drug Effect database (SD- • Including conjugated verbs corresponding Edb) (Segura-Bedmar et al., 2015). to the noun terms with CUIs selected in the base list: e.g. tos (‘cough’, C0010200) → • Including subsets of missing terms from toser, tosiendo... (‘to cough’, ‘coughing’...). thesauri if attested in domain texts. And vice versa, extracting corpus-derived terms • Matching affixes and roots to those terms in from domain texts: synonymous terms from the base list with CUIs: e.g. corazon´ (‘heart, MedlinePlus,3 and terms from Summaries of C0018787) → cardio- (‘cardio-’). Product Characteristics (Segura-Bedmar and • Adding syntactic variants of the multi-word Mart´ınez, 2017). terms in the base list: e.g. aneurisma aortico´ The next subsections explain each method. abdominal (‘aortic abdominal aneurysm’) ↔ 3.1 Leveraging an Inflected Lexicon aneurisma abdominal aortico´ (‘abdominal aortic aneurysm’, C0162871). We started using a list of medical terms col- lected in a previous project on Spanish med- • Adding acronyms and abbreviations of ical terminology;4 we will refer to it as the the terms included in the base list: e.g. base list. We collected this resource by com- aneurisma abdominal aortico´ (‘abdominal bining different methods (Moreno Sandoval and aortic aneurysm’, C0162871) → AAA. Campillos Llanos, 2015) applied on a corpus of 4204 Spanish medical texts (around 4 mil- • Extending the base list by mapping the lion tokens) (Moreno-Sandoval and Campillos- CUIs of the terms in the subset to gather Llanos, 2013). To extract candidate medical missing variants of synonymous terms: 2 eccema ↔ der- https://www.cancer.gov/espanol/ e.g. (‘eczema’, C0013595) publicaciones/diccionario matitis eccematosa (‘eczematous dermatitis’, 3https://medlineplus.gov/spanish/ C0013595). We considered several sources 4http://labda.inf.uc3m.es/multimedica/

155 terms for the base list, we combined rule-based terms. Using this list, we assigned a CUI to the techniques (Part-of-Speech tagging and filtering corresponding derivational variant: e.g. the CUI through medical affixes), corpus-based methods of pancreas´ (C0030274) was also ascribed to pan- (comparing word forms from a general corpus and creatico´ (‘pancreatic’). The current version gath- from the domain corpus), and statistical methods, ers a total of 801 derivational variants with CUIs. namely the Log-Likelihood ratio (Dunning, 1993). We checked in medical sources—e.g. the dictio- Conjugated verbs Most terms in the UMLS nary published by the Spanish Royal Academy of or standard terminologies are noun or adjective Medicine (RANME, 2011)—the terms selected by phrases. This limits the named entity recogni- means of those three methods, before being in- tion of medical concepts expressed with verbs in cluded in the list. This base list was used to build free text; given a context such as el paciente tose an automatic term extractor (Campillos Llanos (‘the patient coughs’), the concept of ‘coughing’ et al., 2013), and amounted to 38 354 entries. would not be identified. To widen the scope of concept normalization, verb terms were mapped Because one of the goals of MedLexSp is to CUIs from derived nouns: e.g. tos (‘cough- concept normalization by using standard domain ing’, C0010200) → toser (‘to cough’, C0010200). terminologies, we did not include the full base We again used a list of correspondences between list. We only used terms that could be assigned verbs and deverbal nouns. We included the conju- UMLS Concept Unique Identifiers (CUIs) in the gated forms of verb lemmas in each verb entry of UMLS MetaThesaurus version 2018AB, namely the lexicon. We used a python script that relies from those terminologies of special biomedical or on the lexicon of a Spanish Part-of-Speech tag- clinical interest (e.g. SNOMED CT, WHO ART or ger (Moreno Sandoval and Guirao, 2006) to gener- Medical Subject Headings) with available Span- ate all conjugated forms of verb terms: e.g. toser ish translations. We mapped 18 263 lemmas to (‘to cough’) → tose (‘he/she coughs’), tosiendo CUIs, which means 47.61% entries of the original (‘coughing’), etc. The current version includes a lexicon. CUIs were assigned according to an ex- total of 295 single- or multi-word verb items. act match criterion. For example, donacion´ (‘do- nation’) is not matched with donacion´ de tejido Affixes and lexical roots In a first step, we (‘Tissue Donation’, C0080231), because the latter collected affixes and roots from several sources. makes reference to a donation subtype. Note that Firstly, we leveraged a list used in a previous ex- the current version of MedLexSp does not include periment (Sandoval et al., 2013). This list amounts the full list of terms from MeSH or SNOMED CT, to 1719 forms and considers morphological vari- but only those which were originally mapped from ants of affixes (e.g. prefix cardio- may have ac- the base list to UMLS terms with CUIs. cented variant forms in Spanish, such as cardio-´ ). Secondly, we translated to Spanish several affixes 3.2 Enriching the Lexicon and roots from the Specialist Lexicon R (McCray String distance metrics We tested mapping et al., 1994) and then added variant forms. In a terms from the subset of entities with CUIs to second step, we assigned UMLS CUIs to affixes terms in the UMLS by applying distance met- and roots in the list. The current list gathers a total rics (Levenshtein, 1966) of less than 2. This al- of 161 entries (82 prefixes and 79 suffixes) with lowed us mapping hyphenated variants to terms 134 different CUIs and 386 variant forms. Note without hyphen (e.g. creatina-cinasa ↔ creatina that many affixes and roots were not included be- cinasa, ‘creatine kinase’, C0010287), compound cause they are too underspecified to be assigned to terms that are often written as single-words (dietil a CUI, or are not restricted to the medical domain eter´ ↔ dietileter´ , ‘diethyl ether’, C0014994), or (e.g. kilo- expresses a quantitative concept). matching terms with minimal morphological vari- ation (eccema ↔ eczema, ‘eczema’, C0013595). Abbreviations and acronyms Firstly, we gath- A total of 1463 terms with CUIs were matched to ered a list of equivalences between full forms the original base list. and abbreviations and acronyms; we used three sources: 1) the collection of Spanish abbre- Derivational variants In line with previous viations and acronyms used in hospitals, col- work (Grabar and Zweigenbaum, 2000), we col- lected by medical doctors (Yetano and Alberola, lected a list of equivalent derivational variants of 2003); 2) abbreviations and acronyms used in

156 the 2nd IberEval Challenge 2018 on Biomedi- Spanish terms from the base list to the English cal Abbreviation Recognition and Resolution (In- component of the UMLS. This method allowed txaurrondo et al., 2018); and 3) Spanish abbre- us to obtain the CUIs of terms unavailable in viations and acronyms from Wikipedia.5 Sec- Spanish terminologies, which remain unchanged ondly, we matched the resulting list of equivalent in the Spanish language. Namely, Latin scien- terms (acronyms and full forms) to UMLS terms, tific names (e.g. Campylobacter fetus, C0006814), adding the corresponding CUIs to those miss- compound terms with Graeco-Latin roots (e.g. ab- ing acronyms. For example, the full term virus dominalgia, C0000737), English acronyms that de Epstein-Barr (‘Epstein-Barr virus’) has CUI are broadly used in the medical discourse with- C0014644, and we also assigned this code to the out Spanish translation (e.g. GABA, ‘gamma- corresponding acronym in Spanish (VEB). With aminobutyric acid’, C0016904), or international this method, we assigned CUIs to 1225 items. brand drug names (e.g. abilify R ). In these cases, the same word is used in both English and Spanish. Syntactic variants of terms To widen the cov- We manually revised the list of mapped terms to erage of terms mapped to CUIs, we generated discard homonymous terms with a different mean- variants of multiword entities by swapping the ing in English (e.g. TIP R is a brand name of a word order of their components. Then, we tried medical drug, but it also means ‘point’ or ‘sugges- to match each new variant to entities with CUIs. tion’ in English). For example, aneurisma aortico´ abdominal (‘aor- We extended the list of terms by extracting tic abdominal aneurysm’) has CUI C0162871, and the information related to rare diseases from we assigned the same CUI to the generated variant OrphaData (INSERM, 2019).6 We also added aneurisma abdominal aortico´ (‘abdominal aortic terms of pharmacological substances and inter- aneurysm’). With this method, we gathered a total national non-proprietary names from the Spanish of 154 variants of terms with CUIs in the base list. Drug Effect database (SDEdb) (Segura-Bedmar Mapping UMLS term variants through CUIs et al., 2015) and the Nomenclator de pre- We gathered synonymous variants referring to scripcion´ (AEMPS, 2019), a resource published and updated regularly by the Spanish Agency of each corresponding concept by using the UMLS 7 CUIs from the terms included in the base list. To Drugs and Food Products. avoid including noisy terms adequate for biomedi- For all these procedures and sources, we applied cal natural language processing, we first cleaned semiautomatic methods to generate the singular the terms from the terminologies we used. To and plural inflected forms of the missing terms that do so, we applied methods for cleaning term were mapped through CUIs. We used the Pattern strings (Aronson et al., 2008; Hettne et al., 2010; python library (Smedt and Daelemans, 2012) to Nev´ eol´ et al., 2012; Hellrich et al., 2015). We create plural forms of terms, which were revised deleted paraphrastic terms that include a descrip- manually before being included in MedLexSp. tion or specification of the entity type in the Corpus-derived terms When we started adding term string. These terms commonly come from variant terms from thesauri, the question of where Spanish SNOMED CT. For example, we deleted to stop adding terms came up. In the first version, tos (hallazgo), ‘cough (finding)’ (CUI C0010200) we decided not to include all terms available in and kept the term (cough, ‘cough’). Likewise, MeSH or SNOMED CT terminologies, given that we removed most anatomic terms beginning with these thesauri contain terms that are often not nec- estructura de (‘structure of’): e.g. regarding essary in clinical or biomedical NER tasks (e.g. term estructura del ojo (‘structure of eyeball’, names of trees, wild animals, professions or ab- C0015392), we only kept the synonym ojo (‘eye- stract concepts). On the other hand, to make the ball’). Lastly, terms in the WHO ART terminol- ogy needed to be accented and reversed regarding 6http://www.orphadata.org/data/ word order: e.g. disociativa, reaccion → reaccion´ xml/es_product1.xml We make available disociativa (’dissociative reaction’, C0012746). the script used to extract terms from OrphaData: https://github.com/lcampillos/bionlp2019 We also applied an exact-match mapping of The code can be adapted to process OrphaData in other languages (e.g. English, French, Italian or Portuguese). 5https://es.wikipedia.org/wiki/Anexo: 7http://listadomedicamentos.aemps.gob. Acrnimos_en_medicina es/prescripcion.zip.

157 resource comprehensive, we needed to comple- 3.3 Semantic and linguistic information ment the base list with supplementary terms from We added to each CUI and lexical entry the cor- thesauri. Hence, in order to decide which items to responding semantic type(s) and group from the include in a first version, we computed term fre- UMLS. To avoid noise when annotating biomed- quencies using a medical corpus from a previous ical texts semantically, we disfavoured semantic project (4 million tokens) (Moreno-Sandoval and types of the semantic group Concepts and Ideas Campillos-Llanos, 2013). We currently include (CONC, e.g. Quantitative Concept, Functional terms from the Spanish MeSH and SNOMED CT Concept or Qualitative Concept), which are rather that were missing in the base list, if they were doc- unspecified. We only included terms from that umented in that corpus. By limiting the inclusion group if no other semantic label was available. of such subset of terms, we aim at providing qual- If a concept or term can be assigned to two dif- ity enriched data (i.e. with revised inflected forms) ferent groups, the element labelled with CONC in a reasonable time and manner. is not included in our lexicon. For example, In a different vein, and similarly to for- the term inhalacion´ (‘inhalation’) can be related mer work (Calleja et al., 2017), we extracted to concept C0004048 (semantic type Organism terms from Summaries of Product Characteristics Function, and group PHYS) and also to con- (SPCs). We used Easy Drug Package Leaflets cept C4521689 (semantic type Intellectual Prod- (EasyDPL), a corpus of 306 texts annotated with uct, and group CONC). In this case, we only pre- medical drugs and pathological entities (1400 drug serve the lexical entry of concept C0004048 and effects) (Segura-Bedmar and Mart´ınez, 2017). We we rule out the entry of concept C4521689. annotated these texts and compared our output an- We have also started adding the Part-of-Speech notation with regard to this dataset. We used a (PoS) category of each entry in the lexicon. For purely dictionary-based named-entity recogniser multiword terms, the category of the head term is with modules for normalization (e.g. lower- selected; e.g. enfermedad de Crohn (‘Crohn’s dis- casing), tokenization and lemmatization, imple- ease’) is categorized as N (‘noun’). We are cur- 8 mented in spaCy; then, the MedLexSp lexicon rently testing different techniques to predict the was used for exact string matching. We did not use PoS and automate the assignment of categories to pre- or post-processing rules in the current version each entry, which is still not fully satisfactory. (e.g. rules of term composition). In several iterative rounds, we annotated the 4 Statistics texts, identified the unannotated entities, and added them to the lexicon. We did not add (al- Table1 shows the count of entries in the lexicon though annotated in the corpus) entities without according to each source or procedure applied to a CUI, e.g. coordinated entities (e.g. pies y map terms to UMLS CUIs. Note that the full manos fr´ıas, ‘cold hands and feet’) or too spe- count exceeds the count of term entries in the cur- cific, post-modified terms (e.g. dolor de cabeza rent version of MedLexSp, given that some terms intenso, ‘intense headache’; only ‘headache’ has were gathered through different methods simulta- CUI C0018681). By using SPCs, we added 837 neously. Table2 shows the descriptive statistics term entries to MedLexSp, and we ensure that it of the lexicon: counts of lemmas and word forms, includes common terms referring to adverse drug and total number of CUIs. Lastly, Table3 shows reactions and medical drugs. a preliminary count of PoS categories in the cur- Lastly, for Consumer Health Vocabulary terms, rent version of the lexicon. Note that most entries we extracted synonyms in MedlinePlus Spanish. are nouns or need revision (UNKN stands for ‘un- This resource provides terms in patient language known’); this task is currently being undertaken. that were missing: e.g. ojo vago (‘lazy eye’) is a Finally, Figure3 depicts the distribution of se- synonym of ambliop´ıa (‘amblyopia’, C0002418). mantic groups. Of note, some groups are under- We added 783 term entries from this resource. In represented, due to the corpora and thesauri used addition, we collected 6110 cancer-related terms to collect terms. For example, few entities belong from the Spanish version of the National Cancer to the GENE group, which implies that the cov- Institute Dictionary. erage of the current version of MedLexSp is not adequate for tasks in the Genomics domain. 8https://spacy.io/ The amount of lemmas/word forms is lower

158 Method # entries Abbreviations / acronyms 1225 Affixes / roots 161 Derived adjectives 801 Conjugated verb forms 295 Base list mapped to UMLS CUIs: Exact match to Spanish UMLS 18 263 Exact match to English UMLS 2534 String distance method 1463 Syntactic variants 134 Terms from thesauri and corpora: ATC + Nomenclator´ + SDEdb 2931 ICD-10 1299 Figure 3: Distribution of semantic groups in the lexicon ICPC 55 MedDRA 5015 MedlinePlus 783 UMLS CUIs; and 2) we cleaned noisy terms. MeSH 6831 As explained, descriptors and qualifiers were re- NCI 6110 moved: e.g. SNOMED CT term fiebre (hal- OrphaData 10 741 lazgo) (‘fever (finding)’, C0015967) was short- SNOMED CT 23 096 ened to fiebre. We also ruled out some con- SPCs (EasyDLP corpus) 837 cepts belonging to semantic groups that we can be noisy for clinical or medical NER tasks, such Table 1: Count of lexical entries according to each as CONC or GEOG; e.g. hierro is related to source or procedure to map terms to UMLS CUIs. concept C0302583, ‘iron’, CHEM; or to concept C0454671, ‘Island of Hierro’, GEOG (the latter Lemmas Forms CUIs concept was discarded). Single-words 23 572 23 592 - Multi-words 52 882 179 451 - 5 Development Evaluation Total 76 454 203 043 30 647 We analysed the coverage of the lexicon with re- gard to UMLS semantic groups. We applied the Table 2: Descriptive statistics of the lexicon dictionary-based NER tool explained below to a gold standard available in the community. We fo- PoS Example Count cused on analysing the annotation of few UMLS N pancreas 58 830 groups (DISO, CHEM, PROC and ANAT) and UNKN - 13 618 assessed how well the lexicon annotated them ADJ abdominal 2283 with regard to the gold standard. We quanti- ADJ/N gemelo (‘twin’) 700 fied the matched annotations in terms of preci- NPR Filoviridae 549 sion, recall and F1-measure by using the BRAT- V toser (‘to cough’) 295 Eval script (Verspoor et al., 2013). cardio- AFF 161 A first version of MedLexSp was evaluated with gravemente ADV (‘severely’) 20 the Spanish texts from the MANTRA corpus (Kors et al., 2015), which gathers 100 texts from the Eu- Table 3: Preliminary counts of Part-of-Speech (PoS) ropean Agency (1961 tokens) and 100 categories. N: ‘noun’; UNKN: ‘unknown’; ADJ: ‘adjec- texts from Medline (1087 tokens). These texts are tive’; ADJ/N: a term that can be an adjective or a noun (depending on the context); NPR: ‘proper name’; AFF: available in BRAT format and were annotated with ‘affix’; ADV: ‘adverb’ UMLS CUIs, semantic types and groups. We pre- processed the annotated texts for mapping refer- ence annotations to UMLS semantic groups. than in other UMLS-based resources because: 1) With this dataset, we achieved an overall F- we did not include the full thesauri, but only terms measure of 0.83 (exact match) and of 0.87 (ap- from the original base list that were mapped to proximate match), although the performance var-

159 Exact match Approx. match multiwords with coordinated terms (e.g. cancer´ de P R F1 P R F1 mama y ovario, ‘breast and ovarian cancer’) and ANAT 0.64 0.91 0.75 0.67 0.98 0.80 adjective modifiers. For example, MedLexSp in- CHEM 0.85 0.96 0.90 0.88 0.97 0.92 cludes the term cancer´ de mama (‘breast cancer’), DISO 0.83 0.83 0.83 0.89 0.92 0.90 but not common variants such as cancer´ de mama PROC 0.73 0.84 0.78 0.76 0.88 0.82 derecha (‘right breast cancer’) or cancer´ de una OA 0.79 0.87 0.83 0.83 0.93 0.87 mama (‘cancer of one breast’). Both phenomena need specific processing techniques. Table 4: Evaluation of the lexicon; P: precision; Mapping concepts to terms differing across va- R: recall; F1: F-measure; OA: overall rieties of the Spanish language was not exhaus- tive. As we departed mainly from a set of corpus- derived terms, most terms belong to the variety ied across semantic groups (Table4). In our er- used in the texts (i.e. Peninsular Spanish). How- ror analysis, we observed that unmatched entities ever, since we used other terminological sources, were misspellings (e.g. ∗detecion´ instead of de- terms from other varieties were included: e.g. teccion´ , ‘detection’), discontinuous entities (e.g. virus sincitial respiratorio (‘respiratory syncytial hinchazon´ de la piel in hinchazon´ y hormigueo de virus’, C0035236) is a term preferred in Spain or la piel, ‘swelling and tingling of skin’), or entities Colombia, but we have the variant virus sincicial whose scope was wrongly annotated. respiratorio (most frequent in Chile or Argentina). 6 Discussion and Conclusions These aspects need nonetheless improvement in future versions, in the same way as the coverage The lexicon is being developed by means of hybrid of terms from Consumer Health Vocabularies. NLP methods and corpus-derived terms. We com- Lastly, we are interested in exploring bine the mapping of corpus terms to available the- embedding-based methods for term expan- sauri, and viceversa, terms missing in the lexicon sion, and in evaluating the lexicon with a broader were attested in domain texts, so that only a sub- set of domain texts. set of attested terms be included in a first version. Interestingly, searching terms from thesauri in a Acknowledgments corpus showed us that many of those terms show low frequencies. From a subset of 56 813 MeSH This work has been done under the NLPMedTerm 9 terms missing in the base list, only 6 676 (11.75%) project, funded by the European Union’s Hori- occurred in the corpus we used (Moreno-Sandoval zon 2020 research programme under the Marie and Campillos-Llanos, 2013). Although this is Skodowska-Curie grant agreement no. 713366 due to the influence of the text types, it also reflects (InterTalentum UAM). We greatly thank the the difference betweem terms from thesauri and in institutions who gave permission to include their real usage. This is another argument that stands data in MedLexSp, and also thank the anonymous for the need for dedicated lexicons combined with reviewers for their valuable comments. NLP methods to achieve successful NER results. A limitation of our evaluation procedure is the restriction to a very small set of texts; hence, re- References sults are not comparable to other tasks or text types. To provide more generalizable results, we AEMPS. 2019. Nomenclator´ de prescripcion.´ www. aemps.gob.es [accessed 2019-03-09] need to evaluate the MedLexSp lexicon with an- . other annotated medical corpus in Spanish, but Magnus Ahltorp, Maria Skeppstedt, Shiho Kitajima, such resource is not freely available to date. Aron Henriksson, Rafal Rzepka, and Kenji Araki. We assume the lexicon is not task-independent. 2016. Expansion of medical vocabularies using distributional semantics on Japanese patient blogs. To avoid ambiguity, terms would need to be fil- Journal of Biomedical Semantics, 7(1):58. tered according to the semantic types needed. For example, terms from the Occupation or Discipline Alan R Aronson. 2001. Effective mapping of biomed- group could be removed for most NER tasks. We ical text to the UMLS Metathesaurus: the MetaMap are also aware of the limits of a purely lexicon- 9http://www.lllf.uam.es/ESP/ based approach. Contexts of variation occur in nlpmedterm_en.html

160 program. In Proc. of the AMIA Symposium, pages Louise Deleger,´ Magnus Merkel, and Pierre Zweigen- 17–21. American Medical Informatics Association. baum. 2009. Translating medical terminologies through word alignment in parallel text corpora. Alan R Aronson, James G Mork, Aurelie´ Nev´ eol,´ Journal of Biomedical Informatics, 42(4):692–701. Sonya E Shooshan, and Dina Demner-Fushman. 2008. Methodology for creating UMLS content Guy Divita, Graciela Rosemblat, and Allen C Browne. views appropriate for biomedical natural language 2007. Building a Medical Spanish Lexicon. In processing. In Proc. of the AMIA Annual Sympo- Proc. of the AMIA Symposium, page 941. sium, pages 21–25. American Medical Informatics Association. Kristina M Doing-Harris and Qing Zeng-Treitler. 2011. Computer-assisted update of a consumer health vo- Olivier Bodenreider. 2004. The Unified Medical cabulary through mining of social network data. Language System (UMLS): integrating biomedi- Journal of Medical Internet Research, 13(2):e37. cal terminology. Nucleic acids research, 32(suppl 1):D267–D270. Kevin Donnelly. 2006. SNOMED-CT: The advanced terminology and coding system for eHealth. Studies Elliot G Brown, Louise Wood, and Sue Wood. 1999. in health technology and informatics, 121:279–290. The Medical Dictionary for Regulatory Activities (MedDRA). Drug safety, 20(2):109–117. Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Computational linguis- Pablo Calleja, Raul´ Garc´ıa-Castro, Guadalupe Aguado- tics, 19(1):61–74. de Cea, and Asuncion´ Gomez-P´ erez.´ 2017. Ex- panding SNOMED-CT through Spanish Drug Sum- Juliusz Dziadek, Aron Henriksson, and Martin Duneld. maries of Product Characteristics. In Proc. of 2017. Improving terminology mapping in clinical the Knowledge Capture Conference, pages 29–37. text with context-sensitive correction. In- ACM. formatics for Health: Connected Citizen-Led Well- ness and Population Health, 235:241. L Campillos Llanos, A Moreno Sandoval, and JM Guirao. 2013. An automatic term extractor for Natalia Grabar and Pierre Zweigenbaum. 2000. A gen- biomedical terms in Spanish. In Proc. of the 5th Int. eral method for sifting linguistic knowledge from Symposium on Languages in Biology and Medicine, structured terminologies. In Proc. of the AMIA Sym- Tokyo, Japan. posium, pages 310–314. American Medical Infor- matics Association. Francisco Carrero, Jose´ Carlos Cortizo, and Jose´ Mar´ıa Gomez.´ 2008. Building a Spanish MMTx by using Johannes Hellrich, Stefan Schulz, Sven Buechel, and automatic translation and biomedical ontologies. In Udo Hahn. 2015. Jufit: A configurable rule engine International Conference on Intelligent Data Engi- for filtering and generating new multilingual UMLS neering and Automated Learning, pages 346–353. terms. In Proc. of the AMIA Symposium, pages 604– Springer. 610. American Medical Informatics Association.

John Carroll, Rob Koeling, and Shivani Puri. 2012. Aron Henriksson, Hans Moen, Maria Skeppstedt, Vi- Lexical acquisition for clinical text mining using dis- das Daudaravicius,ˇ and Martin Duneld. 2014. Syn- tributional similarity. In International Conference onym extraction and abbreviation expansion with on Intelligent Text Processing and Computational ensembles of semantic spaces. Journal of Biomedi- Linguistics, pages 232–246. Springer. cal Semantics, 5(1):6–.

Bruno Cartoni and Pierre Zweigenbaum. 2010. Semi- Kristina M Hettne, Erik M van Mulligen, Martijn J Automated Extension of a Specialized Medical Lex- Schuemie, Bob JA Schijvenaars, and Jan A Kors. icon for French. In Proc. of LREC, Valletta, Malta. 2010. Rewriting and suppressing UMLS terms for improved biomedical term identification. Journal of Billy Chiu, Olga Majewska, Sampo Pyysalo, Laura Biomedical Semantics, 1(1):5. Wey, Ulla Stenius, Anna Korhonen, and Martha Palmer. 2019. A neural classification method for INSERM. 2019. Orphadata: Free access data supporting the creation of BioVerbNet. Journal of from Orphanet. Data version (XML data ver- Biomedical Semantics, 10(1):2:1–2:12. sion). http://www.orphadata.org [ac- cessed 2019-05-10]. James J Cimino. 1996. Coding systems in health care. Methods of information in medicine, 35(04/05):273– Ander Intxaurrondo, Montserrat Marimon,´ Aitor 284. Gonzalez-Agirre,´ Jose´ Antonio Lopez-Mart´ ´ın, H Rodr´ıguez Betanco, J Santamar´ıa, Marta Villegas, Allan Peter Davis, Thomas C Wiegers, Michael C and Martin Krallinger. 2018. Finding mentions Rosenstein, and Carolyn J Mattingly. 2012. of abbreviations and their definitions in Spanish MEDIC: a practical disease vocabulary used at the Clinical Cases: the BARR2 shared task evalua- Comparative Toxicogenomics Database. Database, tion results. In Proc. of IberEval@SEPLN 2018. 2012. SEPLN.

161 Christian Jacquemin. 1999. Syntagmatic and paradig- Alexa T McCray, Suresh Srinivasan, and Allen C matic representations of term variation. In Proc. of Browne. 1994. Lexical methods for managing varia- the 37th Annual Meeting of the Association for Com- tion in biomedical terminologies. In Proc. of the An- putational Linguistics, pages 341–348. nual Symposium on Computer Application in Medi- cal Care, pages 235–239. American Medical Infor- Ling Jiang and Christopher C Yang. 2013. Using co- matics Association. occurrence analysis to expand consumer health vo- cabularies from social media data. In 2013 IEEE In- Antonio Moreno-Sandoval and Leonardo Campillos- ternational Conference on Healthcare Informatics, Llanos. 2013. Design and Annotation of pages 74–81. IEEE. MultiMedica–A Multilingual Text Corpus of the Biomedical Domain. Procedia-Social and Behav- Stephen B Johnson. 1999. A semantic lexicon for med- ioral Sciences, 95:33–39. ical language processing. Journal of the American Medical Informatics Association, 6(3):205–218. Antonio Moreno Sandoval and Leonardo Campil- los Llanos. 2015. Combinacion´ de estrategias Rohit J Kate. 2015. Normalizing clinical terms using lexicas´ y estad´ısticas para el reconocimiento au- learned edit distance patterns. Journal of the Amer- tomatico´ de terminos:´ aplicacion´ a un corpus de ican Medical Informatics Association, 23(2):380– medicina. Lingu¨´ıstica Espanola˜ Actual, 37:173– 386. 197.

Alla Keselman, Tony Tse, Jon Crowell, Allen Browne, Antonio Moreno Sandoval and Jose´ Mar´ıa Guirao. Long Ngo, and Qing Zeng. 2007. Assessing con- 2006. Morphosyntactic tagging of the Spanish C- sumer health vocabulary familiarity: an exploratory ORAL-ROM corpus: Methodology, tools and eval- study. Journal of Medical Internet Research, uation. Spoken language corpus and linguistic in- 9(1):e5. formatics, 5:199–218.

Jan A Kors, Simon Clematide, Saber A Akhondi, Fiammetta Namer and Pierre Zweigenbaum. 2004. Ac- Erik M van Mulligen, and Dietrich Rebholz- quiring meaning for French medical terminology: Schuhmann. 2015. A multilingual gold-standard contribution of morphosemantics. In Proc. of Med- corpus for biomedical concept recognition: the Info, pages 535–539. Mantra GSC. Journal of the American Medical In- formatics Association, 22(5):948–956. Adeline Nazarenko, Pierre Zweigenbaum, Benoˆıt Habert, and Jacques Bouaud. 2001. Corpus-based Vladimir Levenshtein. 1966. Binary codes capable of extension of a terminological semantic lexicon. Re- correcting deletions, insertions, and reversals. In So- cent Advances in Computational Terminology, pages viet Physics Doklady, volume 10, pages 707–710. 327–351. Nut Limsopatham and Nigel Collier. 2016. Normalis- ing medical concepts in social media texts by learn- Aurelie´ Nev´ eol,´ Jiao Li, and Zhiyong Lu. 2012. ing semantic representation. In Proc. of the 54th An- Linking multiple disease-related resources through nual Meeting of the Association for Computational UMLS. In Proc. of the 2nd ACM SIGHIT inter- Linguistics (Volume 1: Long Papers), volume 1, national health informatics symposium, pages 767– pages 1014–1023. 772. ACM.

Kornel´ Marko,´ Robert Baud, Pierre Zweigenbaum, Maite Oronoz, Arantza Casillas, Koldo Gojenola, and Lars Borin, Magnus Merkel, and Stefan Schulz. Alicia Perez. 2013. Automatic annotation of med- 2006. Towards a multilingual medical lexicon. In ical records in Spanish with disease, drug and sub- Proc. of the AMIA Annual Symposium, pages 534– stance names. In Iberoamerican Congress on Pat- 538. American Medical Informatics Association. tern Recognition, pages 536–543. Springer.

Kornel´ Marko,´ Stefan Schulz, and Udo Hahn. Piotr Pezik, A Jimeno-Yepes, V Lee, and D Rebholz- 2005. MorphoSaurus. Design and evaluation of Schuhmann. 2008. Static dictionary features for an -based, cross-language document re- term polysemy identification. In Proc. of Building trieval engine for the medical domain. Methods of and Evaluating Resources for Biomedical Text Min- information in medicine, 44(04):537–545. ing LREC Workshop, Marrakech, Morocco.

John McCrae and Nigel Collier. 2008. Synonym set Sampo Pyysalo, Filip Ginter, Hans Moen, Salakoski extraction from the biomedical literature by lexical Tapio, and Sophia Ananiadou. 2013. Distributional pattern discovery. BMC Bioinformatics, 9(1):159. semantics resources for biomedical text processing. Proc. of Languages in Biology and Medicine, pages Alexa T McCray, Anita Burgun, and Olivier Bodenrei- 39–44. der. 2001. Aggregating UMLS semantic types for reducing conceptual complexity. Studies in health RANME. 2011. Diccionario de Terminos´ Medicos´ . technology and informatics, 84(0 1):216–220. Editorial Panamericana.

162 A Moreno Sandoval, L Campillos Llanos, A Gon- Chang Wang, Liangliang Cao, and Bowen Zhou. 2015. zlez Martnez, and JM Guirao. 2013. An affix-based Medical synonym extraction with concept space method for automatic term recognition from a med- models. In Proc. of the Twenty-Fourth International ical corpus of Spanish. In Proc. of the 7th Corpus Joint Conference on Artificial Intelligence, pages Linguistics Conference 2013, Lancaster University. 989–995.

Eduardo Sbrissia, Percy Nohama, Stefan Schulz, and Liqin Wang, Bruce E Bray, Jianlin Shi, Guilherme Kornel´ Marko.´ 2004. Semi-Supervised Acquisition Del Fiol, and Peter J Haug. 2016. A method for the of a Spanish Lexicon from a Portuguese Seed Lexi- development of disease-specific reference standards con. Proc. of COLING. vocabularies from textual biomedical literature re- sources. Artificial intelligence in medicine, 68:47– Michael Seedorff, Kevin J Peterson, Laurie A Nelsen, 57. Cristian Cocos, Jennifer B McCormick, Christo- pher G Chute, and Jyotishman Pathak. 2013. Incor- Gesa Weske-Heck, Albrecht Zaiss, Matthias Zabel, porating expert terminology and disease risk factors Stefan Schulz, Wolfgang Giere, Michael Schopen, into consumer health vocabularies. In Biocomputing and Rudiger¨ Klar. 2002. The German specialist lex- 2013, pages 421–432. World Scientific. icon. In Proceedings of the AMIA Symposium, pages 884–888. American Medical Informatics Associa- Isabel Segura-Bedmar and Paloma Mart´ınez. 2017. tion. Simplifying drug package leaflets written in Span- ish by using word embedding. Journal of Biomedi- WHO. 2004. International Statistical Classification cal Semantics, 8(1):45. of Diseases and Related Health Problems. World Health Organization. Isabel Segura-Bedmar, Paloma Mart´ınez, Ricardo Re- vert, and Julian´ Moreno-Schneider. 2015. Explor- WHO. 2019. Anatomical Therapeutical Chemi- ing Spanish health social media for detecting drug cal classification. Uppsala: Nordic Council on effects. In BMC medical informatics and decision Medicines. making, volume 15, page S6. BioMed Central. WONCA. 1998. International Classification of Pri- Maria Skeppstedt, Magnus Ahltorp, and Aron Henriks- mary Care 2nd ed. Oxford: Oxford University son. 2013. Vocabulary expansion by semantic ex- Press, 1998. Proc. of Languages in traction of medical terms. Javier Yetano and Vincent Alberola. 2003. Diccionario Biology and Medicine, pages 63–67. de siglas medicas´ y otras abreviaturas, eponimos´ y terminos´ medicos´ relacionados con la codificacion´ Tom De Smedt and Walter Daelemans. 2012. Pattern de las altas hospitalarias. SEDOM. for python. Journal of Machine Learning Research, 13(Jun):2063–2067. Hong Yu and Eugene Agichtein. 2003. Extracting syn- onymous gene and protein terms from biological lit- Paul Thompson and Sophia Ananiadou. 2018. Hyphen. erature. Bioinformatics, 19(suppl 1):i340–i349. Terminology., 24(1):91–121. Qing Zeng and Tony Tse. 2006. Exploring and de- Paul Thompson, John McNaught, Simonetta Monte- veloping consumer health vocabularies. Journal magni, Nicoletta Calzolari, Riccardo Del Gratta, Vi- of the American Medical Informatics Association, vian Lee, Simone Marchi, Monica Monachini, Piotr 13(1):24–29. Pezik, Valeria Quochi, et al. 2011. The BioLexicon: a large-scale terminological resource for biomedical Qing Zeng, Tony Tse, Guy Divita, Alla Keselman, text mining. BMC Bioinformatics, 12(1):397. Jonathan Crowell, Allen Browne, Sergey Gory- achev, and Long Ngo. 2007. Term identifica- Eugene Tseytlin, Kevin Mitchell, Elizabeth Legowski, tion methods for consumer health vocabulary de- Julia Corrigan, Girish Chavan, and Rebecca S Ja- velopment. Journal of Medical Internet Research, cobson. 2016. Noble–flexible concept recognition 9(1):e4. for large-scale biomedical natural language process- ing. BMC Bioinformatics, 17(1):32. Pierre Zweigenbaum, Robert Baud, Anita Burgun, Fi- ammetta Namer, Eric´ Jarrousse, Natalia Grabar, Yoshimasa Tsuruoka, John McNaught, Junichi Tsujii, Patrick Ruch, Franck Le Duff, Jean-Franois Forget, and Sophia Ananiadou. 2007. Learning string sim- Magaly Douyere,` and Stefan´ Darmoni. 2005. A uni- ilarity measures for gene/protein name dictionary fied medical lexicon for French. International Jour- look-up using logistic regression. Bioinformatics, nal of Medical Informatics, 74(2–4):119–124. 23(20):2768–2774.

Karin Verspoor, Antonio Jimeno Yepes, Lawrence A Appendix - Copyright and Usage Cavedon, Tara McIntosh, Asha Herten-Crabb, Zoe¨ Thomas, and John-Paul Plazzer. 2013. Annotat- MedLexSp is distributed freely for research pur- ing the biomedical literature for the human variome. poses; contact for a license at the email address Database, 2013. provided or through the project page. Some

163 thesauri included in MedLexSp were obtained cal Terms R , which is used by permission of the through a distribution and usage agreement from International Health Terminology Standards De- the corresponding institutions who develop them. velopment Organization (IHTSDO; all rights re- In addition, some material in the UMLS Metathe- served). SNOMED CT R was originally created saurus is from copyrighted sources of the re- by The College of American Pathologists. spective copyright holders. Users of the UMLS Metathesaurus are solely responsible for compli- ance with any copyright, patent or trademark re- strictions and are referred to the copyright, patent or trademark notices appearing in the original sources, all of which are hereby incorporated by reference. The version of MedLexSp freely available for research does not include terms nor coding data from terminological sources with copyright rights; only the subset of data in MedLexSp without us- age restrictions is accessible. We acknowledge the intellectual property rights of the institutions who develop the sources from which we extracted subsets of terms to compile the lexicon, and who gave permission (or provide a li- cence to reuse their data) to distribute these sub- sets of terms: the National Library of Medicine maintains the MedLinePlus resource and the Med- ical Subject Headings, and BIREME/OPS (Latin- American and Caribbean Center on Health Sci- ences Information) is in charge of the Spanish translation (Descriptores en Ciencias de la Salud, DeCS); the National Cancer Institute publishes the Dictionary of Cancer Terms; the French Na- tional Institute of Health and Medical Research (INSERM) supports OrphaNet and gathers the information provided in OrphaData; the World Health Organization produces the Adverse Drug Reactions terminology, the International Classifi- cation of Diseases vs. 10, and the Anatomical Therapeutical Classification; the Spanish transla- tion of the International Classification of Primary Care (ICPC) is supported by the World Organiza- tion of Family Doctors; and the Spanish Agency of Drugs and Food Products (AEMPS) publishes the Nomenclator´ de prescripcion.´ MedLexSp also gathers some terms from the Spanish ver- sion of the Medical Dictionary for Regulatory Activities (MedDRA), which is maintained by the Maintenance and Support Services Organiza- tion (MSSO). However, the distributed version of MedLexSp does not include terms coming solely from the MedDRA sources, because of copyright restrictions. In addition, MedLexSp includes a subset of the Spanish version of SNOMED Clini-

164