Healthcare NER Models Using Language Model Pretraining Empirical Evaluation of Healthcare NER Model Performance with Limited Training Data

Healthcare NER Models Using Language Model Pretraining Empirical Evaluation of Healthcare NER Model Performance with Limited Training Data Amogh Kamat Tarcar * Aashis Tiwari Dattaraj Rao Persistent Systems Limited, Goa, India Persistent Systems Limited, Pune, India Persistent Systems Limited, Goa, India [email protected] [email protected] [email protected] Vineet Naique Dhaimodker Penjo Rebelo Rahul Desai National Institute of Technology, Goa, India National Institute of Technology, Goa, India National Institute of Technology, Goa, India [email protected] [email protected] [email protected] ABSTRACT ACM Reference format: Amogh Kamat Tarcar, Aashis Tiwari, Dattaraj Rao, Vineet Naique In this paper, we present our approach to extracting structured DhaimodKer, Penjo Rebelo and Rahul Desai. 2020. Healthcare NER Models information from unstructured Electronic Health Records (EHR) Using Language Model Pretraining: Empirical Evaluation of Healthcare [2] which can be used to, for example, study adverse drug reactions NER Model Performance with Limited Training Data. In Proceedings of in patients due to chemicals in their products. Our solution uses a Health Search and Data Mining Workshop (HSDM 2020) in the 13th ACM combination of Natural Language Processing (NLP) techniques International WSDM Conference (WSDM 2020). ACM, Houston, TX, USA. and a web-based annotation tool to optimize the performance of a https://doi.org/10.1145/3336191.3371879 custom Named Entity Recognition (NER) [1] model trained on a limited amount of EHR training data. This worK was presented at 1. INTRODUCTION the first Health Search and Data Mining WorKshop (HSDM 2020) [26]. Extracting structured information from unstructured text such as EHRs and medical literature has always been a challenging tasK. We showcase a combination of tools and techniques leveraging the Recent advancements in machine learning taKe advantage of the recent advancements in NLP aimed at targeting domain shifts by large text corpora available in scientific literature as well as medical applying transfer learning and language model pre-training and pharmaceutical web sites and train systems which can be techniques [3]. We present a comparison of our technique to the leveraged for several NLP tasKs ranging from text mining to current popular approaches and show the effective increase in question answering. Along with progress in the research space, performance of the NER model and the reduction in time to there has been significant progress in the libraries and tools annotate data.A Key observation of the results presented is that the available for industry use. F1 score of model (0.734) trained with our approach with just 50% of available training data outperforms the F1 score of the blank The specific problem we focused on was extracting adverse drug spacy model without language model component (0.704) trained reactions from EHRs using NER. The solution required us to with 100% of the available training data. extract Key entities such as prescribed drugs with dosage and the symptoms and diseases mentioned in the EHRs. The extracted We also demonstrate an annotation tool to minimize domain expert entities would be processed further downstream to linK the entities time and the manual effort required to generate such a training and leverage dictionary-based techniques for flagging any dataset. Further, we plan to release the annotated dataset as well as symptoms which could potentially be adverse drug reactions of the the pre-trained model to the community to further research in prescribed medicines. medical health records. A crucial component in our devised solution employed a custom KEYWORDS NER model for extracting Key entities from EHRs. The state-of- the-art Named Entity Recognition models built using deep learning Transfer Learning, Named Entity Recognition, Natural Language techniques [13] extract entities from text sentences by not only Processing, Pre-Training, Language Modeling, Electronic Health identifying the keywords or linguistic shape of entities, but also by Records (EHR), Annotations. leveraging the context of the entity in the sentence. Furthermore, with language model pre-trained embeddings, the NER models *Corresponding Author leverage the proximity of other words which appear along with the Presented at the first Health Search and Data Mining Workshop (HSDM 2020) in entity in domain specific literature. the 13th ACM International WSDM Conference (WSDM 2020) held in Feb 2020 Houston, Texas, USA Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). HSDM 2020, Feb, 2020, Houston, Texas USA A.Tarcar et al. One of the Key challenges in training NLP based models is the Using tasK-specific annotation tools can minimize the time to availability of reasonable-sized, high-quality annotated datasets. generate high-quality annotated datasets for training models. The Further, in a typical industrial setting, the relative difficulty in traditional process of annotating data is slow, but fundamental to garnering significant domain expert time, and the lacK of tools and most NLP models. It often acts as a hindrance in evaluating and techniques for effective annotation along with the ability to review benchmarKing multiple models, as well as in parameter tuning of such annotations to minimize human errors , affects research and models. Many tools such as Doccano [6] exist in the open-source benchmarking new learning techniques and algorithms. community that help in solving this problem. We developed an in- house tool which we could customize for speeding up the Additionally, models liKe NER often need significant amount of annotation process. data to generalize well to a vocabulary and language domain. Such vast amounts of training data are often unavailable or difficult to manufacture or synthesize.To bridge the gap between academic 3. USE CASE DETAILS developments and industrial requirements, we designed a series of experiments employing transfer learning from pre-trained models Studying adverse reactions due to chemicals in a drug on the patient while worKing with a comparatively smaller dataset. is central to drug development in healthcare. Pharmacovigilance (PV) [21] as described by WHO, is defined as “the science and Transfer learning techniques [3] are largely successful in the image activities relating to the detection, assessment, understanding and domain and are advancing steadily in natural language domain with prevention of adverse effects or any other drug-related problem.” the availability of pre-trained language embeddings and pre-trained Pharmaceutical companies often want to understand the conditions models. and pre-conditions under which a drug might have an adverse reaction on a patient. This would help in research and studies of the In this paper we present findings of our experiments to solve the drugs and also reduce or prevent risKs of any harm to the patient. industrial problem of training NER models with limited data using spacy [7] , a state-of-the-art industrial strength natural language Co-occurrence of disease and chemicals in an EHR of a patient is processing package, along with the latest techniques in transfer useful in studies and research for most pharmaceutical companies. learning. However, EHRs are unstructured data and additional processing is required to extract structured information such as named entities of The paper is organized as follows. Section 2 describes the interest. Such extraction can lead to significant savings of manual motivation for our experiments followed by Section 3 discusses the labor and minimizing the time taKen to get a new drug to market. problem and our solution, both in algorithmic and implementation terms, and evaluates the results produced by our solution. Section We developed custom healthcare NER models to extract phrases 4 discusses the results and Section 5 concludes and suggests related to (pharmaceutical) chemicals with dosage, diseases and directions for future worK. symptoms from EHRs. As the entities were specific to the domain text, an in-house annotated dataset was created using our custom- built annotation tool. A number of experiments were designed and 2. MOTIVATION executed for training custom NER models on annotated data from Recent advancements in NLP also Known as the ImageNet moment base models (spacy[7] and scispacy[8]) using transfer learning. in NLP [3], have shown significant improvements in many NLP Section 3.1 describes the dataset preparation followed by Section tasKs using transfer learning. Language models like ELMo [4] and 3.2 which presents an architecture overview. Section 3.3 presents BERT [5] have shown the effect of language model pre-training on experiment details and Section 3.4 describes the results obtained. downstream NLP tasks. Language models are capable of adjusting to changes in the textual domain with a process of fine-tuning. Also, 3.1. DATASET PREPARATION in this self-supervised learning scenario, there is an implicit We created a domain-specific corpus by collating publicly annotation in sentences, i.e. predicts the next token (word) given a available sample medical notes and drug public assessment reports sequence of toKens appearing earlier in the sequence. Given all this, from European Medical Agency (EMA )[9] and Sample Medical we can adjust to a new domain-specific vocabulary with very little Transcripts [10]. training time and almost

Load more