WME 3.0: an Enhanced and Validated Lexicon of Medical Concepts
Total Page:16
File Type:pdf, Size:1020Kb
WME 3.0: An Enhanced and Validated Lexicon of Medical Concepts Anupam Mondal1 Dipankar Das1 Erik Cambria2 Sivaji Bandyopadhyay1 1Department of Computer Science and Engineering 2School of Computer Science and Engineering Jadavpur University, Kolkata, India Nanyang Technological University, Singapore [email protected], [email protected] [email protected], 1sivaji cse [email protected] Abstract However, medical text is in general unstructured since doctors do not like to fill forms and pre- Information extraction in the medical do- fer free-form notes of their observations. Hence, main is laborious and time-consuming due a lexical design is difficult due to lack of any to the insufficient number of domain- prior knowledge of medical terms and contexts. specific lexicons and lack of involve- Therefore, we are motivated to enhance a med- ment of domain experts such as doctors ical lexicon namely WordNet of Medical Events and medical practitioners. Thus, in the (WME 2.0) which helps to identify medical con- present work, we are motivated to de- cepts and their features. In order to enrich this sign a new lexicon, WME 3.0 (WordNet lexicon, we have employed various well-known of Medical Events), which contains over resources like conventional WordNet, SentiWord- 10,000 medical concepts along with their Net (Esuli and Sebastiani, 2006), SenticNet (Cam- part of speech, gloss (descriptive expla- bria et al., 2016), Bing Liu (Liu, 2012), and nations), polarity score, sentiment, sim- Taboada’s Adjective list (Taboada et al., 2011) ilar sentiment words, category, affinity and a preprocessed English medical dictionary1 on score and gravity score features. In ad- top of WME 1.0 and WME 2.0 lexicons (Mon- dition, the manual annotators help to val- dal et al., 2015; Mondal et al., 2016). WME 1.0 idate the overall as well as individual cat- contains 6415 number of medical concepts and egory level of medical concepts of WME their glosses, POS, polarity scores, and sentiment. 3.0 using Cohen’s Kappa agreement met- Thereafter, Mondal et. al., (2016) enhanced WME ric. The agreement score indicates almost 1.0 by adding few more features as affinity score, correct identification of medical concepts gravity score, and SSW to the medical concepts and their assigned features in WME 3.0. and presented as WME 2.0. The affinity and grav- ity scores present the hidden link between the pair 1 Introduction of medical concepts and the concept with the vari- In the clinical domain, the representation of a lex- ous source of glosses respectively. SSW of a med- ical resource is treated as a crucial and contribu- ical concept refers the similar sentiment words tory task because of handling several challenges. (SSW) which follow the common sentiment prop- The challenges are the identification of medical erty. concepts, their categories and relations, disam- In the current research, we have focused on en- biguation of polarities, recognition of semantics riching WME 2.0 with more number of medical whereas the scarcity of structured clinical texts concepts and including an additional feature i.e doubles the challenges. In the last few years, medical category. In order to develop such up- several researchers were involved in developing dated version of WME namely WME 3.0, we have various domain-specific lexicon such as Medical taken the help of WME 1.0 and WME 2.0. We WordNet and UMLS (Unified Medical Language have also noticed that the previous versions of System) to cope up with such challenges. These WMEs are unable to extract knowledge-based in- lexicons help to bridge the gap between medical formation such as the category of the medical con- experts such as doctors or medical practitioners cepts and its coverage is also lower. and non-experts such as patients (Cambria et al., 1http://alexabe.pbworks.com/f/Dictionary+of+Medical+Terms 2010a; Cambria et al., 2010b). +4th+Ed.-+(Malestrom).pdf Therefore, we have enhanced the number of steps of WME 3.0; Section 5 discusses the valida- medical concepts as well as add category feature tion process of the proposed lexicon; finally, Sec- on top of WME 2.0. The current version, WME tion 6 illustrates the concluding remarks and future 3.0 contains 10,186 number of medical concepts scopes of the research. and their category, POS, gloss, sentiment, polar- ity score, SSW, affinity and gravity scores. For 2 Background example, WME 3.0 lexicon presents the proper- ties of a medical concept say amnesia as of cate- Biomedical information extraction is treated as gory (disease), POS (noun), gloss (loss of memory one of the challenging research tasks as it deals sometimes including the memory of personal iden- with available medical corpora that are either un- tity due to brain injury, shock, fatigue, repression, structured or semi-structured. Hence, a domain- or illness or sometimes induced by anesthesia.), specific lexicon becomes an essential component sentiment (negative), polarity score (-0.375), SSW to convert a structured corpus from the unstruc- (memory loss, blackout, fugue, stupor), affinity tured corpus (Borthwick et al., 1998). Also, score (0.429) and gravity score (0.170). it helps in extracting the subjective and con- Moreover, to enhance and validate lexicon with ceptual information related to medical concepts the newly added medical concepts and categories, from the corpus. Various researchers have tried we have summarized our contributions as follows. to build various ontologies and lexicons such as (a) Enriching the number of medical concepts in UMLS, SNOMED-CT (Systematized Nomencla- the existing lexicon, WME 2.0: In order to meet up ture of Medicine-Clinical Terms), MWN (Medical this issue, we have employed a preprocessed En- WordNet), SentiHealth, and WordNet of Medical glish medical dictionary2 and various well-defined Events (WME 1.0 and WME 2.0) etc. in the do- lexicons such as SentiWordNet, SenticNet, and main of healthcare (Miller and Fellbaum, 1998; MedicineNet etc. They helped to enhance the Smith and Fellbaum, 2004; Asghar et al., 2016; number of medical concepts of the proposed lexi- Asghar et al., 2014). UMLS helps to enhance con. the access to biomedical literature by facilitating the development of computer systems that under- (b) Overall validation of the current lexicon: stand biomedical language (Bodenreider, 2004). To resolve the issue, we have taken the help of SNOMED-CT is a standardized, multilingual vo- two manual annotators as medical practitioners. cabulary that contains clinical terminologies and The annotators provided agreement scores that are assists in exchanging the electronic healthcare in- processed using Cohen’s Kappa and obtained a κ formation among physicians (Donnelly, 2006). score which assists in validating the overall lex- icon as well as the individual features of WME Furthermore, Fellbaum and Smith (2004) pro- 3.0 (Viera et al., 2005). posed Medical WordNet (MWN) with two sub- networks e.g., Medical FactNet (MFN) and Med- (c) Evaluate various individual feature of the ical BeliefNet (MBN) for justifying the consumer medical concepts: In order to extract the subjec- health. The MWN follows the formal architecture tive and knowledge-based features, we have ap- of the Princeton WordNet (Fellbaum, 1998). On plied our evaluation scripts on the mentioned re- the other hand, MFN aids in extracting and under- sources. The scripts assist in identifying the affin- standing the generic medical information for non- ity and gravity scores as feature values for the con- expert groups whereas MBN identifies the fraction cepts. Also, the resources are used to assign the of the beliefs about the medical phenomena (Smith SSW as semantics and glosses for the concepts. and Fellbaum, 2004). Their primary motivation On the other hand, a supervised classifier helps to was to develop a network for medical information add the category feature in the proposed lexicon. retrieval system with visualization effect. Senti- The remainder of the paper is organized as fol- Health lexicon was developed to identify the sen- lows: Section 2 presents the related works for timent for the medical concepts (Asghar et al., building a medical lexicon; Section 3 and Sec- 2016; Asghar et al., 2014). WME 1.0 and WME tion 4 describe the previous versions of WMEs 2.0 lexicons were designed to extract the medi- like WME 1.0 and WME 2.0 and the development cal concepts and their related linguistic and sen- 2http://alexabe.pbworks.com/f/Dictionary+of+Medical+ timent features from the corpus (Mondal et al., Terms+4th+Ed.-+(Malestrom).pdf 2015; Mondal et al., 2016). These mentioned ontologies and lexicons as- For example, the medical concept abnormality sist in identifying the medical concepts and their appears with the following gloss, POS as noun, sentiments from the corpus but unable to provide negative sentiment and polarity score of -0.25 in the complete knowledge-based information of the WME 1.0. concepts. Hence, in the current work, we are mo- tivated to design a full-fledged lexicon in health- 3.2 WME 2.0 care which provides the linguistic, sentiment, and The next version of WME, i.e., WME 2.0, extracts knowledge-based features together for the medical more semantic features of medical concepts (Mon- concepts. dal et al., 2016) and added with the existing fea- tures of WME 1.0. While updated WME 2.0 with 3 Attempts for WordNet of Medical affinity score, gravity score, and SSW, the num- Events ber of concepts in WME 2.0 remains same, but In healthcare, a domain-specific lexicon is the features of each concept are included (Mondal required for identifying the conceptual and et al., 2016). knowledge-based information such as category, Affinity score indicates the strength of a medi- gloss, semantics, and sentiment of the medical cal concept and its corresponding SSWs by assign- concepts from the clinical corpora (Cambria, ing a probability score.