TF-IDF Vs Word Embeddings for Morbidity Identification in Clinical Notes: an Initial Study

TF-IDF vs Word Embeddings for Morbidity Identification in Clinical Notes: An Initial Study 1 2 Danilo Dess`ı [0000−0003−3843−3285], Rim Helaoui [0000−0001−6915−8920], Vivek 1 1 Kumar [0000−0003−3958−4704], Diego Reforgiato Recupero [0000−0001−8646−6183], and 1 Daniele Riboni [0000−0002−0695−2040] 1 University of Cagliari, Cagliari, Italy {danilo dessi, vivek.kumar, diego.reforgiato, riboni}@unica.it 2 Philips Research, Eindhoven, Netherlands [email protected] Abstract. Today, we are seeing an ever-increasing number of clinical notes that contain clinical results, images, and textual descriptions of pa-tient’s health state. All these data can be analyzed and employed to cater novel services that can help people and domain experts with their com-mon healthcare tasks. However, many technologies such as Deep Learn-ing and tools like Word Embeddings have started to be investigated only recently, and many challenges remain open when it comes to healthcare domain applications. To address these challenges, we propose the use of Deep Learning and Word Embeddings for identifying sixteen morbidity types within textual descriptions of clinical records. For this purpose, we have used a Deep Learning model based on Bidirectional Long-Short Term Memory (LSTM) layers which can exploit state-of-the-art vector representations of data such as Word Embeddings. We have employed pre-trained Word Embeddings namely GloVe and Word2Vec, and our own Word Embeddings trained on the target domain. Furthermore, we have compared the performances of the deep learning approaches against the traditional tf-idf using Support Vector Machine and Multilayer per-ceptron (our baselines). From the obtained results it seems that the lat-ter outperform the combination of Deep Learning approaches using any word embeddings. Our preliminary results indicate that there are specific features that make the dataset biased in favour of traditional machine learning approaches. Keywords: Deep Learning · Natural Language Processing · Morbidity Detection · Word Embeddings · Classification. 1 Introduction In these years we are seeing an increment of life expectancy that has also in- creased the risk of long-term diseases such as cancer, diabetes, mental health condition, and other chronic health threats [22, 10, 3, 21]. Also, one more disad- vantage with long life expectancy is that people can be affected by more than one 1 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). disease at a time, increasing the risk of substandard health quality. For instance, a person suffering from long term diabetes have higher chances of hypertension, high cholesterol levels, arteries or veins blockage. The World Health Organiza-tion report [17] states that, in a developed country, over 40% of the population is exposed to at least one long-term health condition including all ages, and 25% percent of the population suffers with multi- morbidity. Furthermore, the report emphasizes that the high rate of multi-morbidity is directly proportional to the middle and low-income countries as they do not have funds that should be invested for enhancing primary care of population [2]. All this medical information is being monitored, and with the advent of Information Technology (IT) services, a lot of clinical data are continuously being stored within clinical reports, which might be employed to provide novel healthcare services world-wide, overcoming issues related to the social or economic condition of people. Clinical reports may contain various information in the form of numbers (e.g., laboratory results), images (e.g., x-ray), or medical descriptions (e.g., surgical descriptions) that may be used to create content-based services. However, the whole amount of data is challenging to be analyzed and employed by humans to provide healthcare services, and the development of computer-based systems to deal with it has recently gained the attention of the scientific community. For example, textual data of clinical reports have been explored in tasks such as classification [4], clustering [12], and recommendation [8]. Although state-of-the-art research in this direction has already provided important outcomes, many challenges still remain open [5]. Methods based on Deep Learning models and advanced Word Embedding representations have recently proven to be state-of-the-art for many tasks, but their use within many healthcare problems have not been investigated yet. Therefore, in this paper we investigate the use of a Deep Learning model with various types of Word Embeddings in order to address the problem of morbidity detection related to the obesity disease in textual clinical notes. We performed our research study on the n2c23 dataset released for the i2b24 obesity and co-morbidity detection challenge in 2008, and compared our results against two baseline approaches where the Term Frequency — In-verse Document Frequency (TF-IDF) was used to represent clinical notes. More precisely, the contributions of our paper are: – We provide a Deep Learning model using Word Embeddings for performing multi-morbidity detection within clinical notes. – We analyze three types of Word Embeddings (for the Deep Learning approaches) and the TF-IDF (for the baselines) representation for modelling the knowledge of clinical notes. – We compare the proposed deep learning approaches against two machine learning methods by using k-fold cross-validation. – We found out a very high f-measure score of the machine learning baselines that beats those of the deep learning approaches indicating the occurrences 3 https://n2c2.dbmi.hms.harvard.edu/ 4 https://www.i2b2.org/NLP/Obesity/ 2 of representative tokens highly connected with the presence of each morbidity class. – We make available all the sources code used for our experiments by a GitHub repository5. The remaining of this manuscript is organized as it follows: Section 2 discusses the role of Deep Learning and Natural Language Processing (NLP) techniques within the healthcare domain. Section 3 describes the dataset we have used, its contents and how it was created. Section 4 introduces the methodology we have applied, and the feature engineering approaches we have designed for the Deep Learning task. Section 5 presents and discusses the results. Finally, Section 6 draws some conclusions of our preliminary investigation and describes problems to solve and future research challenges where we are headed. 2 Related Work These days Artificial Intelligence (AI) and its sub fields such as Deep Learning, Text Mining, and more in general Machine Learning, are playing a significant role in clinical decision making and understanding, automatic disease diagnosis, and therapy assistance [15]. Deep Learning applications in healthcare are con-tributing with relevant improvements in many fields such as analyzing the blood samples, detecting heart problems, detecting tumours, and so on [19]. Moreover, the high quality performances of Deep Learning models for healthcare issues has raised positive discussions and interests within the AI community. However, the use of Deep Learning technologies for the purpose of detecting multi-morbidity in clinical notes has not been deeply analyzed yet. For example, a relevant re-cent work [13] uses Negative Matrix Factorisation (NMF) for simultaneously mining disease clusters in order to detect relations that exist between morbidity patterns. Authors demonstrated how the temporal characteristics of the disease clusters can help mining multi-morbidity networks and generating new hypothe-ses for the emergence of various morbidity patterns over time. However, no Deep Learning methods have been investigated to try to uncover morbidity patterns. Other works only rely on NLP techniques [23]. For example, authors in [11] present a method called FREGEX which is based on regular expressions to ex-tract features from written clinical notes. The use of Deep Learning models for discovering multi-morbidity linked to the obesity disease has been recently inves-tigated by [24]. The Deep Learning model used in this work presents two layers, a Convolutional Neural Network (CNN) layer and a Max Pooling layer, that were fed by using both word and entity embeddings, and allowed the authors to improve the results that were obtained during the i2b2 obesity challenge in 2008 especially for the intuitive classification task. However, they investigated only CNN as the main layer of their model and, therefore, the use of Long-Short Term Memory (LSTM) still needs to be explored. In addition, in literature there is a strong evidence that Bidirectional layers can infer relevant characteristic 5 https://github.com/vsrana-ai/SmartPhil 3 from clinical notes [14]. Hence, in this work we investigate the use of Bidirec-tional LSTM layers for the task of multi-morbidity detection. Moreover, we try to understand whether state-of-the-art Word Embeddings available in literature can better represent the knowledge of clinical notes for the same task. 3 Dataset Description The dataset used for this work is n2c2 obesity data. The n2c2 dataset contains test and training documents of patients clinical records. The dataset was com-pletely anonymized by replacing personal and sensitive information of patients with surrogates. The morbidity classes in the dataset are Asthma, CAD, CHF, Depression, Diabetes, Gallstones, GERD, Gout, Hypercholesterolemia, Hyperten-sion, Hypertriglyceridemia, OA, Obesity, OSA, PVD, and Venous

Load more