Neural Natural Language Processing for Unstructured Data in Electronic

Neural Natural Language Processing for Unstructured Data in Electronic Health Records: a Review Irene Li, Jessica Pan, Jeremy Goldwasser, Neha Verma, Wai Pan Wong Muhammed Yavuz Nuzumlalı, Benjamin Rosand, Yixin Li, Matthew Zhang, David Chang R. Andrew Taylor, Harlan M. Krumholz and Dragomir Radev {irene.li,jessica.pan,jeremy.goldwasser,neha.verma,waipan.wong}@yale.edu {yavuz.nuzumlali,benjamin.rosand,yixin.li,matthew.zhang,david.chang}@yale.edu {richard.taylor,harlan.krumholz,dragomir.radev}@yale.edu. Yale University, New Haven CT 06511 Abstract diagnoses, medications, and laboratory values in fixed nu- Electronic health records (EHRs), digital collections of pa- merical or categorical fields. Unstructured data, in contrast, tient healthcare events and observations, are ubiquitous in refer to free-form text written by healthcare providers, such medicine and critical to healthcare delivery, operations, and as clinical notes and discharge summaries. Unstructured data research. Despite this central role, EHRs are notoriously dif- represent about 80% of total EHR data, but unfortunately re- ficult to process automatically. Well over half of the informa- main very difficult to process for secondary use[255]. In this tion stored within EHRs is in the form of unstructured text survey paper, we focus our discussion mainly on unstruc- (e.g. provider notes, operation reports) and remains largely tured text data in EHRs and newer neural, deep-learning untapped for secondary use. Recently, however, newer neural based methods used to leverage this type of data. network and deep learning approaches to Natural Language In recent years, artificial neural networks have dramati- Processing (NLP) have made considerable advances, outper- cally impacted fields such as speech recognition, computer forming traditional statistical and rule-based systems on a vision (CV), and natural language processing (NLP) within variety of tasks. In this survey paper, we summarize current medicine and elsewhere [31, 167]. These networks, which neural NLP methods for EHR applications. We focus on a most often are composed of multiple layers in a modeling broad scope of tasks, namely, classification and prediction, strategy known as deep learning, have come to outperform word embeddings, extraction, generation, and other topics traditional rule-based and statistical methods on many tasks. such as question answering, phenotyping, knowledge graphs, Recognizing the inherent performance advantages and poten- medical dialogue, multilinguality, interpretability, etc. tial to improve health outcomes by unlocking unstructured EHR text data, neural NLP methods are attracting great inter- CCS Concepts: • General and reference ! Surveys and est among researchers in artificial intelligence for healthcare. overviews; • Computing methodologies ! Natural language processing; Machine learning algorithms. 1.1 Challenges and difficulties Keywords: natural language processing, neural networks, In this survey, we focus on neural approaches to analyzing EHR, unstructured data clinical texts in EHRs. While these data can carry abundant useful information in healthcare, there also exist significant 1 Introduction challenges and difficulties including: • Privacy. Due to the extremely sensitive nature of the arXiv:2107.02975v1 [cs.CL] 7 Jul 2021 Electronic health records (EHRs), digital collections of patient healthcare events and observations, are now ubiquitous information contained in EHRs and the existence of in medicine and critical to healthcare delivery, operations, regulatory laws such as the (US) Health Insurance and and research [45, 51, 73, 92]. Data within EHRs are often Portability and Accountability Act (HIPAA), mainte- classified based on collection and representation formats nance of privacy within analytic pipelines is impera- belonging into two classes: structured or unstructured [43]. tive. [62, 115]. Therefore, often before any downstream Structured EHR data consist of heterogeneous sources like tasks can be performed or any data can be shared with others, additional privacy-preserving steps must be Authors’ address: Irene Li, Jessica Pan, Jeremy Goldwasser, Neha taken. Removing identifying information from a large Verma, Wai Pan Wong; Muhammed Yavuz Nuzumlalı, Benjamin corpus of EHRs is an expensive process. It is difficult Rosand, Yixin Li, Matthew Zhang, David Chang; R. Andrew Tay- to automate and requires annotators with domain ex- lor, Harlan M. Krumholz and Dragomir Radev, {irene.li,jessica.pan, pertise. jeremy.goldwasser,neha.verma,waipan.wong}@yale.edu, {yavuz.nuzumlali, • benjamin.rosand,yixin.li,matthew.zhang,david.chang}@yale.edu, {richard. Lack of annotations. Many existing machine learning taylor,harlan.krumholz,dragomir.radev}@yale.edu.Yale University, New and deep learning models are supervised and thus Haven CT 06511. require labeled data for training. Annotating EHR data Li et al. Neural NLP Methods for EHR Tasks systematic survey by Xiao et al. [255] targeted five categories: disease detection/classification, sequential prediction, concept embedding, data augmentation and EHR data privacy. Classification Embeddings Extraction Generation Other Topics Kwak and Hui [116] reviewed research applying artificial Medical Medical Named Entity EHR Generation Question Answering intelligence to health informatics. Specifically in the EHR Text Concept Recognition Classification Embeddings field, they included the following applications: outcome pre- Summarization Phenotyping Entity Linking diction, computational phenotyping, knowledge extraction, Segmentation Visiting Medical Knowledge Graphs Embeddings representation learning, de-identification and medical in- Relation and Language Word Sense Event Extraction Translation Medical Dialogues tervention recommendations. Assale et al. [9] presented a Disambiguation Patient Embeddings Medication literature review of how free-text content from electronic Multilinguality Medical Coding Information patient records can benefit from recent Natural Language BERT-based Extraction Embeddings Interpretability Medical Outcome Processing techniques, selecting four application domains: Prediction data quality, information extraction, sentiment analysis and Applications in Public Health De-identification predictive models, and automated patient cohort selection. Another recent survey [251] summarized deep learning meth- Figure 1. Various tasks covered in this survey. ods in clinical NLP. The authors reviewed methods such as deep learning architectures, embeddings and medical knowledge on four groups of clinical NLP tasks: text classification, can be challenging because of the cognitive complexity named entity recognition, relation extraction, and others. of the task and because of the variability in the quality Joshi et al. [99] discussed textual epidemic intelligence or of data. Neural networks require large amounts of text the detection of disease outbreaks via medical and informal on which to train, and unfortunately, useful EHR data (i.e. the Web) text. are often in short supply. For some tasks, only qualified annotators can be recruited to complete annotations. Even when annotators are available, the quality of the 1.3 Motivation annotations is sometimes hard to ensure. There may be While the aforementioned reviews target different applica- disagreement among the annotators which can make tions or techniques, and vary on the scope of selected topics, the evaluations more difficult and controversial. in this survey, we look at a more comprehensive coverage of • Interpretability. Deep neural networks are able to achieve applications and tasks, and we include several works with superior results compared with other methods. How- very recent BERT-based models. Specifically, we summarize ever, in many fields, they are often treated as black a broad range of existing literature on deep learning methods boxes [230, 273]. Typically, a neural network model has with clinical text data. Most of the papers on clinical NLP a large number of trainable parameters, which causes discussed achieve performance at or near the state-of-the-art severe difficulties for model interpretability. Moreover, for their tasks. unlike linear models, which are usually more straight- We now provide some NLP preliminaries, and then we ex- forward and explainable, neural networks consist of plore deep learning methods on a various tasks: classification, non-linear layers and complex architectures further embeddings, extraction, generation, and other topics includ- hampering interpretation. Recently, there have been ing question answering, phenotyping, knowledge graphs, a few attempts to produce explainable deep neural and more. Figure1 shows the scope of the tasks. Finally, we networks to increase model transparency [29, 30, 164]. also summarize relevant datasets and existing tools. 1.2 Related Work 2 Preliminaries Prior surveys have focused on a variety of deep learning In this section, we present an overview of important con- topics within health informatics, bioinformatics, including cepts in deep learning and natural language processing. In EHRs. An early survey by Miotto et al. [157] summarized particular, we cover commonly used deep learning architec- deep learning methods for healthcare and their applications tures in NLP, word embeddings, recent Transformer models, in clinical imaging, EHRs, genomics, and mobile domains. and concepts from transfer learning. The authors reviewed a number of tasks including predict- ing diseases, modeling phenotypes,

Neural Natural Language Processing for Unstructured Data in Electronic

Information Retrieval Application of Text Mining Also on Bio-Medical Fields

Barriers to Retrieving Patient Information from Electronic Health Record Data: Failure Analysis from the TREC Medical Records Track

Biomedical Text Mining: a Survey of Recent Progress

Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

Unstructured Data Is a Risky Business

1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature

Big Data Mining Tools for Unstructured Data: a Review YOGESH S

Extracting Unstructured Data from Template Generated Web Documents

A Survey of Embeddings in Clinical Natural Language Processing Kalyan KS, S

Affective Computing and Deep Learning to Perform Sentiment Analysis

Top Natural Language Processing Applications in Business UNLOCKING VALUE from UNSTRUCTURED DATA for Years, Enterprises Have Been Making Good Use of Their 1

Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis