Structured prediction models for RNN based sequence labeling in clinical text

Abhyuday N Jagannatha1, Hong Yu1,2 1 University of Massachusetts, MA, USA 2 Bedford VAMC and CHOIR, MA, USA [email protected] , [email protected]

Abstract from drug efficacy analysis to adverse effect surveil- lance. Sequence labeling is a widely used method A widely used method for Information Extrac- for named entity recognition and information tion from natural text documents involves treating extraction from unstructured natural language the text as a sequence of tokens. This format al- data. In the clinical domain one major ap- plication of sequence labeling involves ex- lows sequence labeling algorithms to label the rel- traction of relevant entities such as medica- evant information that should be extracted. Sev- tion, indication, and side-effects from Elec- eral sequence labeling algorithms such as Condi- tronic Health Record Narratives. Sequence la- tional Random Fields (CRFs), Hidden Markov Mod- beling in this domain presents its own set of els (HMMs), Neural Networks have been used for challenges and objectives. In this work we information extraction from unstructured text. CRFs experiment with and HMMs are probabilistic graphical models that based structured learning models with Recur- rent Neural Networks. We extend the pre- have a rich history of Natural Language Process- viously studied CRF-LSTM model with ex- ing (NLP) related applications. These methods try plicit modeling of pairwise potentials. We also to jointly infer the most likely label sequence for a propose an approximate version of skip-chain given sentence. CRF inference with RNN potentials. We use Recently, Recurrent (RNN) or Convolutional these methods1 for structured prediction in or- Neural Network (CNN) models have increasingly der to improve the exact phrase detection of clinical entities. been used for various NLP related tasks. These Neu- ral Networks by themselves however, do not treat sequence labeling as a structured prediction prob- 1 Introduction lem. Different Neural Network models use dif- ferent methods to synthesize a context vector for Patient data collected by hospitals falls into two cat- each word. This context vector contains informa- egories, structured data and unstructured natural lan- tion about the current word and its neighboring con- guage texts. It has been shown that natural text tent. In the case of CNN, the neighbors comprise clinical documents such as discharge summaries, of words in the same filter size window, while in progress notes, etc are rich sources of medically rel- Bidirectional-RNNs (Bi-RNN) they contain the en- evant information like adverse drug events, medica- tire sentence. tion prescriptions, diagnosis information etc. Infor- Graphical models and Neural Networks have their mation extracted from these natural text documents own strengths and weaknesses. While graphical can be useful for a multitude of purposes ranging models predict the entire label sequence jointly, they 1Code is available at https://github.com/abhyudaynj/LSTM- usually rely on special hand crafted features to pro- CRF-models vide good results. Neural Networks (especially Re-

856

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 856–865, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics current Neural Networks), on the other hand, have phrase labeling. However, better ways of modeling been shown to be extremely good at identifying pat- the pairwise potential functions of CRFs might lead terns from noisy text data, but they still predict each to improvements in labeling rare entities and detect- word label in isolation and not as a part of a se- ing exact phrase boundaries. quence. In simpler terms, RNNs benefit from rec- Another important challenge in this domain is a ognizing patterns in the surrounding input features, need to model long term label dependencies. For ex- while structured learning models like CRF benefit ample, in the sentence “the patient exhibited A sec- from the knowledge about neighboring label predic- ondary to B”, the label for A is strongly related to tions. Recent work on Named Entity Recognition the label prediction of B. A can either be labeled as by (Huang et al., 2015) and others have combined an adverse drug reaction or a symptom if B is a Med- the benefits of Neural Networks(NN) with CRF by ication or Diagnosis respectively. Traditional linear modeling the unary potential functions of a CRF as chain CRF approaches that only enforce local pair- NN models. They model the pairwise potentials as wise constraints might not be able to model these a paramater matrix [A] where the entry Ai,j corre- dependencies. It can be argued that RNNs may im- sponds to the transition probability from the label i plicitly model label dependencies through patterns to label j. Incorporating CRF inference in Neural in input features of neighboring words. While this is Network models helps in labeling exact boundaries true, explicitly modeling the long term label depen- of various named entities by enforcing pairwise con- dencies can be expected to perform better. straints. In this work, we explore various methods of struc- This work focuses on labeling clinical events tured learning using RNN based feature extractors. (medication, indication, and adverse drug events) We use LSTM as our RNN model. Specifically, and event related attributes (medication dosage, we model the CRF pairwise potentials using Neural route, etc) in unstructured clinical notes from Elec- Networks. We also model an approximate version of tronic Health Records. Later on in the Section 4, skip chain CRF to capture the aforementioned long we explicitly define the clinical events and attributes term label dependencies. We compare the proposed that we evaluate on. In the interest of brevity, for the models with two baselines. The first baseline is a rest of the paper, we use the broad term “Clinical En- standard Bi-LSTM model with softmax output. The tities” to refer to all medically relevant information second baseline is a CRF model using handcrafted that we are interested in labeling. feature vectors. We show that our frameworks im- Detecting entities in clinical documents such as prove the performance when compared to the base- Electronic Health Record notes composed by hospi- lines or previously used CRF-LSTM models. To tal staff presents a somewhat different set of chal- the best of our knowledge, this is the only work fo- lenges than similar sequence labeling applications cused on usage and analysis of RNN based struc- in the open domain. This difference is partly due tured learning techniques on extraction of clinical to the critical nature of medical domain, and partly entities from EHR notes. due to the nature of clinical texts and entities therein. Firstly, in the medical domain, extraction of exact 2 Related Work clinical phrase is extremely important. The names of clinical entities often follow polynomial nomen- As mentioned in the previous sections, both Neural clature. Disease names such as Uveal melanoma Networks and Conditional Random Fields have been or hairy cell leukemia need to be identified exactly, widely used for sequence labeling tasks in NLP. since partial names ( hairy cell or melanoma) might Specially, CRFs (Lafferty et al., 2001) have a long have significantly different meanings. Addition- history of being used for various sequence labeling ally, important clinical entities can be relatively rare tasks in general and named entity recognition in par- events in Electronic Health Records. For example, ticular. Some early notable works include McCal- mentions of Adverse Drug Events occur once ev- lum et. al. (2003), Sarawagi et al. (2004) and Sha et. ery six hundred words in our corpus. CRF inference al. (2003). Hammerton et. al. (2003) and Chiu et. with NN models cited previously do improve exact al. (2015) used Long Short Term Memory (LSTM)

857 (Hochreiter and Schmidhuber, 1997) for named en- sentence is processed with a regular expression to- T tity recognition. kenizer into sequence of tokens x = [xt]1 . The to- Several recent works on both image and text based ken sequence is fed into the embedding layer, which domains have used structured inference to improve produces dense vector representation of words. The the performance of Neural Network based mod- word vectors are then fed into a bidirectional RNN els. In NLP, Collobert et al (2011) used Convolu- layer. This bidirectional RNN along with the em- tional Neural Networks to model the unary poten- bedding layer is the main machinery responsible for tials. Specifically for Recurrent Neural Networks, learning a good feature representation of the data. Lample et al. (2016) and Huang et. al. (2015) used The output of the bidirectional RNN produces a T LSTMs to model the unary potentials of a CRF. feature vector sequence ω(x) = [ω(x)]1 with the In biomedial named entity recognition, several same length as the input sequence x. In this base- approaches use a biological corpus annotated with line model, we do not use any structured inference. entities such as protein or gene name. Settles (2004) Therefore this model alone can be used to predict the T used Conditional Random Fields to extract occur- label sequence, by scaling and normalizing [ω(x)]1 . rences of protein, DNA and similar biological en- This is done by using a softmax output layer, which tity classes. Li et. al. (2015) recently used LSTM scales the output for a label l where l 1, 2, ..., L ∈ { } for named entity recognition of protein/gene names as follows: from BioCreative corpus. Gurulingappa et. al. exp(ω(x)tWj) (2010) evaluated various existing biomedical dictio- P (˜yt = j x) = (1) | L naries on extraction of adverse effects and diseases l=1 exp(ω(x)tWl) from a corpus of Medline abstracts. The entire model is trainedP end-to-end using cate- This work uses a real world clinical corpus of gorical cross-entropy loss. Electronic Health Records annotated with various clinical entities. Jagannatha et. al. (2016) recently 3.2 Bi-LSTM CRF showed that RNN based models outperform CRF models on the task of Medical event detection on This model is adapted from the Bi-LSTM CRF clinical documents. Other works using a real world model described in Huang et. al. (2015). It clinical corpus include Rochefort et al. (2015), who combines the framework of bidirectional RNN T worked on narrative radiology reports. They used a layer[ω(x)]1 described above, with linear chain SVM-based classifier with bag of words feature vec- CRF inference. For a general linear chain CRF the tor to predict deep vein thrombosis and pulmonary probability of a label sequence y˜ for a given sentence embolism. Miotto et. al. (2016) used a denoising x can be written as : to build an unsupervised representation N 1 of Electronic Health Records which could be used P (˜y x) = exp φ(˜yt) + ψ(˜yt, y˜t+1) (2) | Z { } for predictive modeling of patient’s health. t=1 Y

3 Methods Where φ(yt) is the unary potential for the label po- sition t and ψ(yt, yt+1) is the pairwise potential be- We evaluate the performance of three different Bi- tween the positions t,t+1. Similar to Huang et. al. LSTM based structured prediction models described (2015), the outputs of the bidirectional RNN layer in section 3.2, 3.3 and 3.4. We compare this perfor- ω(x) are used to model the unary potentials of a lin- mance with two baseline methods of Bi-LSTM(3.1) ear chain CRF. In particular, the NN based unary po- and CRF(3.5) model. tential φnn(yt) is obtained by passing ω(x)t through a standard feed-forward tanh layer. The binary po- 3.1 Bi-LSTM (baseline) tentials or transition scores are modeled as a matrix This model is a standard bidirectional LSTM neu- [A]L L. Here L equals the number of possible la- × ral network with input and a Soft- bels including the Outside label. Each element Ai,j max Output layer. The raw natural language input represents the transition score from label i to j. The

858 Num. of Avg word length probability for a given sequence y˜ can then be cal- Labels Instances std culated as : ± ADE 1807 1.68 1.22 T ± 1 Indication 3724 2.20 1.79 P (˜y x; θ) = exp φnn(˜yt) + Ay˜t,y˜t+1 (3) ± | Z { } Other SSD 40984 2.12 1.88 t=1 ± Y Severity 3628 1.27 0.62 The network is trained end-to-end by minimizing ± Drugname 17008 1.21 0.60 ± the negative log-likelihood of the ground truth label Duration 926 2.01 0.74 ± sequence yˆ for a sentence x as follows: Dosage 5978 2.09 0.82 ± Route 2862 1.20 0.47 (x, yˆ; θ) = δ(yt =y ˆt) log P (yt x; θ) ± L − | } Frequency 5050 2.44 1.70 t yt ± X X (4) Table 1: Annotation statistics for the corpus. The negative log likelihood of given label se- quence for an input sequence is calculated by sum- The neural network is trained end-to-end with the product message passing. Sum-product message objective of minimizing the negative log likelihood passing is an efficient method for exact inference in in equation 4. The negative log-likelihood scores are Markov chains or trees. obtained by sum-product message passing. 3.3 Bi-LSTM CRF with pairwise modeling 3.4 Approximate Skip-chain CRF In the previous section, the pairwise potential is cal- culated through a transition probability matrix [A] Skip chain models are modifications to linear chain irrespective of the current context or word. For rea- CRFs that allow long term label dependencies sons mentioned in section 1, this might not be an through the use of skip edges. These are basically effective strategy. Some clinical entities are rela- edges between label positions that are not adjacent tively rare. Therefore transition from an Outside la- to each other. Due to these skip edges, the skip chain bel to a clinical label might not be effectively mod- CRF model (Sutton and McCallum, 2006) explicitly eled by a fixed parameter matrix. In this method, models dependencies between labels which might the pairwise potentials are modeled through a non- be more than one position apart. The joint inference linear Neural Network which is dependent on the over these dependencies are taken into account while current word and context. Specifically, the pairwise decoding the best label sequence. However, unlike potential ψ(yt, yt+1) in equation 2 is computed by the two models explained in the preceding section, using a one dimensional CNN with 1-D filter size 2 the skip-chain CRF contains loops between label and tanh non-linearity. At every label position t, it variables. As a result we cannot use the sum-product takes [ω(x)t; ω(x)t+1] as input and produces a L L message passing method to calculate the negative × pairwise potential output ψnn(yt, yt+1). This CNN log-likelihood.The loopy structure of the graph in layer effectively acts as a non-linear feed-forward skip chain CRF renders exact message passing in- neuron layer, which is repeatedly applied on con- ference intractable. Approximate solutions for these secutive pairs of label positions. It uses the output models include loopy belief propagation(BP) which of the bidirectional LSTM layer at positions t and requires multiple iterations of message passing. t + 1 to prepare the pairwise potential scores. However, an approach like loopy BP is pro- The unary potential calculation is kept the same as hibitively expensive in our model with large Neural in Bi-LSTM-CRF. Substituting the neural network Net based potential functions. The reason for this is based pairwise potential ψnn(yt, yt+1) into equation that each gradient descent iteration for a combined 2 we can reformulate the probability of the label se- RNN-CRF model requires a fresh calculation of the quence y˜ given the word sequence x as : marginals. In one approach to mitigate this, Lin et. N al. (2015) directly model the messages in the mes- 1 sage passing inference of a 2-D grid CRF for image P (˜y x; θ) = exp φnn(˜yt) + ψnn(˜yt, y˜t+1) | Z { } t=1 segmentation. This bypasses the need for modeling Y (5) the potential function, as well as calculating the ap-

859 proximate messages on the graph using loopy BP. formulated as : Approximate CRF message passing inference: βR(yt) = [ βF t(yt)] (7) Lin et. al. (2015) directly estimate the factor to → F F (t) variable message using a Neural Network that uses ∈XR input image features. Their underlying reasoning is We also need the unary potential of the label vari- that the factor-to-variable message from factor F to able t to compose its marginal. The unary po- label variable y for any iteration of loopy BP can t tentials of each variable from yt m, ..., yt 1 and { − − } be approximated as a function of all the input vari- y , ..., y should already be included in their { t+1 t+m} ables and previous messages that are a part of that respective factors. The log of the unnormalized factor. They only model one iteration of loopy BP, marginal P¯(yt x) for the variable yt, can therefore | and empirically show that it leads to an apprecia- be calculated by ble increase in performance. This allows them to model the messages as a function of only the input log P¯(yt x) = βR(yt) + βL(yt) + φ(yt) (8) variables, since the messages for the first iteration of | message passing are computed using the potential Similar to Lin et. al. (2015), in the interest of functions alone. limited network complexity, we use only one mes- We follow a similar approach for calculation sage passing iteration. In our setup, this means that a of variable marginals in our skip chain model. variable-to-factor message from a neighboring vari- However, instead of estimating individual factor-to- able yi to the current variable yt contains only the variable messages, we exploit the sequence struc- unary potentials of yi and binary potential between ture in our problem and estimate groups of factor- yi , yt. As a consequence of this, we can see that to-variable messages. For any label node yt, the first βL(yt) can be written as : group contains factors that involve nodes which oc- t m − cur before yt in the sentence (from left). The second βL(yt) = log [exp ψ(yt, yi) + φ(yi)] (9) group of factor-to-variable messages corresponds to i=t 1 yi factors involving nodes occurring later in the sen- X− X tence. We use recurrent computational units like Similarly, we can formulate a function for βR(yt) in LSTM to estimate the sum of log factor-to-variable a similar way : messages within a group. Essentially, we use bidi- t+m rectional recurrent computation to estimate all the βR(yt) = log [exp ψ(yt, yi) + φ(yi)] incoming factors from left and right separately. y i=Xt+1 Xi To formulate this, let us assume for now that we (10) are using skip edges to connect the current node t to m preceding and m following nodes. Each edge, Modeling the messages using RNN: As mentioned skip or otherwise, is denoted by a factor which con- previously in equation 8, we only need to estimate tains the binary potential of the edge and the unary βL(yt), βR(yt) and φ(yt) to calculate the marginal potential of the connected node. As mentioned ear- of variable yt. We can use φnn(yt) framework intro- lier, we will divide the factors associated with node duced in section 3.2 to estimate the unary potential t into two sets, FL(t) and FR(t). Here FL(t) , con- for yt. We use different directions of a bidirectional tains all factors formed between the variables from LSTM to estimate βR(yt) and βL(yt). This elim- the group yt m, ..., yt 1 and yt. So we can for- inates the need to explicitly model and learn pair- { − − } mulate the combined message from factors in FL(t) wise potentials for variables that are not immediate as neighbors. The input to this layer at position t is βL(yt) = [ βF t(yt)] (6) → [φnn(yt); ψnn(yt, yt+1)] (composed of potential F FL(t) ∈X functions described in section 3.3). This can be The combined messages from factors in FR(t) viewed as an LSTM layer aggregating beliefs about t 1 which contains variables from yt+1 to yt+m can be yt from the unary and binary potentials of [y]1−

860 Strict Evaluation ( Exact Match) Relaxed Evaluation (Word based) Models Recall Precision F-score Recall Precision F-score CRF 0.7385 0.8060 0.7708 0.7889 0.8040 0.7964 Bi-LSTM 0.8101 0.7845 0.7971 0.8402 0.8720 0.8558 Bi-LSTM CRF 0.7890 0.8066 0.7977 0.8068 0.8839 0.8436 Bi-LSTM CRF-pair 0.8073 0.8266 0.8169 0.8245 0.8527 0.8384 Approximate Skip-Chain CRF 0.8364 0.8062 0.8210 0.8614 0.8651 0.8632 Table 2: Cross validated micro-average of Precision, Recall and F-score for all clinical tags to approximate the sum of messages from left side The model is optimized using cross entropy βL(yt). Similarly, βR(yt) can be approximated from loss between the true marginal and the predicted the LSTM aggregating information from the oppo- marginal. The loss for a sentence x with a ground site direction. Formally, βL(yt) is approximated as truth label sequence yˆ is provided in equation 13. a function of neural network based unary and binary potentials as follows: 3.5 CRF (baseline) We use the linear chain CRF, which is a widely used t 1 βL(yt) f ([φnn(yi); ψnn(yi, yi+1)] − ) (11) model in extraction of clinical named entities. As ≈ 1 mentioned previously, Conditional Random Fields Using LSTM as a choice for recurrent compu- explicitly model dependencies between output vari- tation here is advantageous, because LSTMs are ables conditioned on a given input sequence. able to learn long term dependencies. In our The main inputs to CRF in this model are not framework, this allows them to learn to prioritize RNN outputs, but word inputs and their correspond- more relevant potential functions from the sequence ing word vector representations. We add additional t 1 [[φnn(yi); ψnn(yi, yi+1)]1− . Another advantage of sentence features consisting of four vectors. Two of this method is that we can approximate skip edges them are bag of words representation of the sentence between all preceding and following nodes, instead sections before and after the word respectively. The of modeling just m surrounding ones. This is be- remaining two vectors are dense vector representa- cause LSTM states are maintained throughout the tions of the same sentence sections. The dense vec- sentence. tors are calculated by taking the mean of all indi- The partition function for yt can be easily ob- vidual word vectors in the sentence section. We add tained by using logsumexp over all label entries of these features to explicitly mimic information pro- the unnormalized log marginal shown in equation 8 vided by the bidirectional chains of the LSTM mod- as follows: els.

Zt = exp[βR(yt) + βL(yt) + φ(yt)] (12) 4 Dataset yt X We use an annotated corpus of 1154 English Elec- Here the partition function Z is a different for differ- tronic Health Records from cancer patients. Each ent positions of t. Due to our approximations, it is note was annotated2 by two annotators who label not guaranteed that the partition function calculated clinical entities into several categories. These cate- from different marginals of the same sentence are gories can be broadly divided into two groups, Clin- equal. The normalized marginal can be now calcu- ical Events and Attributes. Clinical events include lated by normalizing log P¯(yt x) in equation 8 using any specific event that causes or might contribute to | Zt. a change in a patient’s medical status. Attributes are phrases that describe certain important proper- ties about the events. (x, yˆ; θ) = δ(yt =y ˆt)(βR(yt; θ) 2 L − The annotation guidelines can be found t yt (13) X X at https://github.com/abhyudaynj/LSTM-CRF- +β (y ; θ) + φ(y ; θ) log Z (θ)) models/blob/master/annotation.md L t t − t 861 Figure 1: Plots of Recall, Precision and F-score for RNN based methods. The metrics with prefix Strict are using phrase based evaluation. Relaxed metrics use word based evaluation.Bar-plots are in order with Bi-LSTM on top and Approx-skip-chain-CRF at the bottom.

Clinical Event categories in this corpus are Ad- annotation statistics of the corpus are provided in the verse Drug Event (ADE), Drugname , Indication Table 1. and Other Sign Symptom and Diseases (Other SSD). ADE, Indication and Other SSD are events having 5 Experiments a common vocabulary of Sign, Symptoms and Dis- eases (SSD). They can be differentiated based on the Each document is split into separate sentences and context that they are used in. A certain SSD should the sentences are tokenized into individual word and be labeled as ADE if it can be manually identified as special character tokens. The models operate on a side effect of a drug based on the evidence in the the tokenized sentences. In order to accelerate the clinical note. It is an Indication if it is an affliction training procedure, all LSTM models use batch-wise that a doctor is actively treating with a medication. training using a batch of 64 sentences. In order to do Any other SSD that does not fall into the above two this, we restricted the sentence length to 50 tokens. categories ( for e.g. an SSD in patients history) is All sentences longer than 50 tokens were split into labeled as Other SSD. Drugname event labels any shorter size samples, and shorter sentences were pre- medication or procedure that a physician prescribes. padded with masks. The CRF baseline model(3.5) The attribute categories contain the following does not use batch training and so the sentences were properties, Severity , Route, Frequency, Duration used unaltered. and Dosage. Severity is an attribute of the SSD event The first layer for all LSTM models was a 200 types , used to label the severity a disease or symp- dimensional word embedding layer. In order to im- tom. Route, Frequency, Duration and Dosage are prove performance, we initialized embedding layer attributes of Drugname. They are used to label the values in these models with a skip-gram word em- medication method, frequency of dosage, duration bedding (Mikolov et al., 2013). The skip-gram em- of dosage, and the dosage quantity respectively. The bedding was calculated using a combined corpus

862 of PubMed open access articles, English Wikipedia irrespective of whether the remaining words in its and an unlabeled corpus of around hundred thousand phrase are labeled correctly. Word based evaluation Electronic Health Records. The EHRs used in the is a more relaxed metric than phrase based evalua- annotated corpus are not in this unlabeled EHR cor- tion. pus. This embedding is also used to provide word vector representation to the CRF baseline model. 6 Results The bidirectional LSTM layer which outputs The micro-averaged Precision, Recall and F-score ω(x) contains LSTM neurons with a hidden size for all five models are shown in Table 2. We report ranging from 200 to 250. This hidden size is both strict (exact match) and relaxed (word based) kept variable in order to control for the number of evaluation results. As shown in Table 2, the best per- trainable parameters between different LSTM based formance is obtained by Skip-Chain CRF (0.8210 models. This helps ensure that the improved perfor- for strict and 0.8632 for relaxed evaluation). All mance in these models is only because of the modi- LSTM based models outperform the CRF baseline. fied model structure, and not an increase in trainable Bi-LSTM-CRF and Bi-LSTM-CRF-pair models us- parameters. The hidden size is varied in such a way ing exact CRF inference improve the precision of that the number of trainable parameters are close to strict evaluation by 2 to 5 percentage points. Bi- 3.55 million parameters. Therefore, the Approx skip LSTM CRF-pair achieved the highest precision for chain CRF has 200 hidden layer size, while stan- exact-match. However, the recall (both strict and re- dard Bi-LSTM model has 250 hidden layer. Since laxed) for exact CRF-LSTM models is less than Bi- the ω(x) layer is bidirectional, this effectively means LSTM. This reduction in recall is much less in the that the Bi-LSTM model has 500 hidden layer size, Bi-LSTM-pair model. In relaxed evaluation, only while Approx skip chain CRF model has 400 dimen- the Skip Chain model has a better F-score than the sional hidden layer. baseline LSTM. Overall, Bi-LSTM-CRF-pair and We use dropout (Srivastava et al., 2014) with a Approx-Skip-Chain models lead to performance im- probability of 0.50 in all LSTM models in order to provements. However, the standard Bi-LSTM-CRF improve regularization performance. We also use model does not provide an appreciable increase over batch norm (Ioffe and Szegedy, 2015) between lay- the baseline. ers wherever possible in order to accelerate training. Figure 1 shows the breakdown of performance All RNN models are trained in an end-to-end fashion for each RNN model with respect to individual using Adagrad (Duchi et al., 2011) with momentum. clinical entity labels. CRF baseline model perfor- The CRF model was trained using L-BFGS with L2 mance is not shown in Figure 1, because its per- regularization. We use Begin Inside Outside (BIO) formance is consistently lower than Bi-LSTM-CRF label modifiers for models that use CRF objective. model across all label categories. We use pairwise We use ten-fold cross validation for our results. t-test on strict evaluation F-score for each fold in The documents are divided into training and test cross validation, to calculate the statistical signifi- documents. From each training set fold, 20% of the cance of our scores. The improvement in F-score sentences form the validation set which is used for for Bi-LSTM-CRF-pair and Approx-Skip Chain as model evaluation during training and for early stop- compared to Bi-LSTM baseline is statistically sig- ping. nificant (p < 0.01). The difference in Bi-LSTM- We report the word based and exact phrase match CRF and Bi-LSTM baseline, does not appear to be based micro-averaged recall, precision and F-score. statistically significant (p > 0.05). However, the im- Exact phrase match based evaluation is calculated provements over CRF baseline for all LSTM models on a per phrase basis, and considers a phrase as pos- are statistically significant. itively labeled only if the phrase exactly matches the true boundary and label of the reference phrase. 7 Discussion Word based evaluation metric is calculated on labels of individual words. A word’s predicted label is con- Overall, Approx-Skip-Chain CRF model achieved sidered as correct if it matches the reference label, better F-scores than CRF,Bi-LSTM and Bi-LSTM-

863 CRF in both strict and relaxed evaluations. The re- sults show that the exact CRF-LSTM models mis- sults of strict evaluation, as shown in Figure 1, are labeled around 40% of Indication words as Other our main focus of discussion due to their impor- SSD, as opposed to just 20 % in case of the Bi- tance in the clinical domain. As expected, two ex- LSTM baseline. The label Severity is a similar case. act inference-based CRF-LSTM models (Bi-LSTM- It contains non-label-specific phrases such as “not CRF and Bi-LSTM-CRF-pair) show the highest pre- terribly”, “very rare” and “small area,” which may cision for all labels. Approx-Skip-Chain CRF’s pre- explain why almost 35% of Severity words are mis- cision is lower(due to approximate inference) but it labeled as Outside by the bi-LSTM-CRF as opposed still mostly outperforms Bi-LSTM. The recall for to around 20% by the baseline. Skip Chain CRF is almost equal or better than all It is worthwhile to note that among exact CRF- other models due to its robustness in modeling de- LSTM models, the recall for Bi-LSTM-CRF-pair is pendencies between distant labels. The variations in much better than Bi-LSTM-CRF even for sparse la- recall contribute to the major differences in F-scores. bels. This validates our initial hypothesis that Neu- These variations can be due to several factors includ- ral Net based pairwise modeling may lead to better ing the rarity of that label in the dataset, the com- detection of rare labels. plexity of phrases of a particular label, etc. We believe, exact CRF-LSTM models described 8 Conclusion here require more training samples than the baseline We have shown that modeling pairwise potentials Bi-LSTM to achieve a comparable recall for labels and using an approximate version of Skip-chain in- that are complex or “difficult to detect”. For exam- ference increase the performance of the LSTM-CRF ple, as shown in table 1, we can divide the labels into models. We also show that these models perform frequent ( Other SSD, Indication, Severity, Drug- much better than baseline LSTM and CRF models. name, Dosage, and Frequency) and rare or sparse These results suggest that the structured prediction (Duration, ADE, Route). We can make a broad gen- models are good directions for improving the exact eralization, that exact CRF models (especially Bi- phrase extraction for clinical entities. LSTM-CRF) have somewhat lower recall for rare labels. This is true for most labels except for Route, Acknowledgments Indication, and Severity. The CRF models have very We thank the UMassMed annotation team: Elaine close recall (0.780,0.782) to the baseline Bi-LSTM Freund, Wiesong Liu, Steve Belknap, Nadya Frid, (0.803) for Route even though its number of inci- Alex Granillo, Heather Keating, and Victoria Wang dences are lower (2,862 incidences) than Indication for creating the gold standard evaluation set used in (3,724 incidences) and Severity (3,628 incidences), this work. We also thank the anonymous reviewers both of which have lower recall even though their for their comments and suggestions. incidences are much higher. This work was supported in part by the grant Complexity of each label can explain the afore- HL125089 from the National Institutes of Health mentioned phenomenon. Route for instance, fre- (NIH). We also acknowledge the support from the quently contains unique phrases such as “by mouth” United States Department of Veterans Affairs (VA) or “p.o.,” and is therefore easier to detect. In con- through Award 1I01HX001457. This work was also trast, Indication is ambiguous. Its vocabulary is supported in part by the Center for Intelligent Infor- close to two other labels: ADE (1,807 incidences) mation Retrieval. The contents of this paper do not and the most populous Other SSD (40,984 inci- represent the views of CIIR, NIH, VA or the United dences). As a consequence, it is harder to sepa- States Government. rate the three labels. Models need to learn cues from surrounding context, which is more difficult and requires more samples. This is why the re- References call for Indication is lower for CRF-LSTM models, Jason PC Chiu and Eric Nichols. 2015. Named en- even though its number of incidences is higher than tity recognition with bidirectional lstm-cnns. arXiv Route. To further support our explanation, our re- preprint arXiv:1511.08308.

864 Ronan Collobert, Jason Weston, Leon´ Bottou, Michael fields, feature induction and web-enhanced lexicons. Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. In Proceedings of the seventh conference on Natu- Natural language processing (almost) from scratch. ral language learning at HLT-NAACL 2003-Volume 4, The Journal of Research, 12:2493– pages 188–191. Association for Computational Lin- 2537. guistics. John Duchi, Elad Hazan, and Yoram Singer. 2011. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Adaptive subgradient methods for online learning and rado, and Jeff Dean. 2013. Distributed representa- stochastic optimization. The Journal of Machine tions of words and phrases and their compositionality. Learning Research, 12:2121–2159. In Advances in neural information processing systems, Harsha Gurulingappa, Roman Klinger, Martin Hofmann- pages 3111–3119. Apitius, and Juliane Fluck. 2010. An empirical evalu- Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dud- ation of resources for the identification of diseases and ley. 2016. Deep patient: An unsupervised representa- adverse effects in biomedical literature. In 2nd Work- tion to predict the future of patients from the electronic shop on Building and evaluating resources for biomed- health records. Scientific reports, 6:26094. ical text mining (7th edition of the Language Resources Christian M Rochefort, Aman D Verma, Tewodros and Evaluation Conference). Eguale, Todd C Lee, and David L Buckeridge. 2015. James Hammerton. 2003. Named entity recognition with A novel method of adverse event detection can ac- long short-term memory. In Proceedings of the sev- curately identify venous thromboembolisms (vtes) enth conference on Natural language learning at HLT- from narrative electronic health record data. Jour- NAACL 2003-Volume 4, pages 172–175. Association nal of the American Medical Informatics Association, for Computational Linguistics. 22(1):155–165. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long Sunita Sarawagi and William W Cohen. 2004. Semi- short-term memory. Neural computation, 9(8):1735– markov conditional random fields for information ex- 1780. traction. In Advances in neural information process- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- ing systems, pages 1185–1192. tional lstm-crf models for sequence tagging. arXiv Burr Settles. 2004. Biomedical named entity recognition preprint arXiv:1508.01991. using conditional random fields and rich feature sets. Sergey Ioffe and Christian Szegedy. 2015. Batch In Proceedings of the International Joint Workshop on normalization: Accelerating deep network training Natural Language Processing in Biomedicine and its by reducing internal covariate shift. arXiv preprint Applications, pages 104–107. Association for Compu- arXiv:1502.03167. tational Linguistics. Abhyuday Jagannatha and Hong Yu. 2016. Bidirectional Fei Sha and Fernando Pereira. 2003. Shallow parsing rnn for medical event detection in electronic health with conditional random fields. In Proceedings of the records. In Proceedings of NAACL-HLT, pages 473– 2003 Conference of the North American Chapter of the 482. Association for Computational Linguistics on Human John Lafferty, Andrew McCallum, and Fernando CN Language Technology-Volume 1, pages 134–141. As- Pereira. 2001. Conditional random fields: Probabilis- sociation for Computational Linguistics. tic models for segmenting and labeling sequence data. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Guillaume Lample, Miguel Ballesteros, Sandeep Subra- Ilya Sutskever, and Ruslan Salakhutdinov. 2014. manian, Kazuya Kawakami, and Chris Dyer. 2016. Dropout: A simple way to prevent neural networks Neural architectures for named entity recognition. from overfitting. The Journal of Machine Learning arXiv preprint arXiv:1603.01360. Research, 15(1):1929–1958. Lishuang Li, Liuke Jin, Zhenchao Jiang, Dingxin Song, Charles Sutton and Andrew McCallum. 2006. An in- and Degen Huang. 2015. Biomedical named entity troduction to conditional random fields for relational recognition based on extended recurrent neural net- learning. Introduction to statistical relational learn- works. In and Biomedicine (BIBM), ing, pages 93–128. 2015 IEEE International Conference on, pages 649– 652. IEEE. Guosheng Lin, Chunhua Shen, Ian Reid, and Anton van den Hengel. 2015. Deeply learning the messages in message passing inference. In Advances in Neural Information Processing Systems, pages 361–369. Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random

865