Annotating Temporal Information in Clinical Narratives ⇑ Weiyi Sun A, , Anna Rumshisky B, Ozlem Uzuner C
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Journal of Biomedical Informatics 46 (2013) S5–S12 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin Methodological Review Annotating temporal information in clinical narratives ⇑ Weiyi Sun a, , Anna Rumshisky b, Ozlem Uzuner c a Department of Informatics, University at Albany, SUNY, 1400 Washington Ave., Draper 114B, Albany, NY 12222, United States b Department of Computer Science, University of Massachusetts, 198 Riverside St., Olsen Hall, Lowell, MA 01854, United States c Department of Information Studies, University at Albany, SUNY, 1400 Washington Ave., Draper 114A, Albany, NY 12222, United States article info abstract Article history: Temporal information in clinical narratives plays an important role in patients’ diagnosis, treatment and Received 21 May 2013 prognosis. In order to represent narrative information accurately, medical natural language processing Accepted 10 July 2013 (MLP) systems need to correctly identify and interpret temporal information. To promote research in this Available online 19 July 2013 area, the Informatics for Integrating Biology and the Bedside (i2b2) project developed a temporally anno- tated corpus of clinical narratives. This corpus contains 310 de-identified discharge summaries, with Keywords: annotations of clinical events, temporal expressions and temporal relations. This paper describes the pro- Natural Language Processing cess followed for the development of this corpus and discusses annotation guideline development, anno- Temporal Reasoning tation methodology, and corpus quality. Medical Informatics Corpus Building Ó 2013 Elsevier Inc. All rights reserved. Annotation 1. Introduction of temporally annotated clinical narratives [2] This corpus contains the clinical history and the hospital course sections of 310 de-iden- Electronic Medical Records (EMRs) contain significant amounts tified discharge summaries from Partners Healthcare and the Beth of unstructured narrative text, which can be turned into structured Israel Deaconess Medical Center, for a total of approximately data with help from automated medical language processing (MLP) 178,000 tokens. The corpus was annotated for three layers of infor- systems. Some sub-areas of MLP, such as de-identification and clin- mation: events, temporal expressions and their normalization, and ical concept (e.g. disorder, medication) extraction are well-studied. temporal relations. Our annotation scheme was adapted from Other areas, such as analysis of the temporal structures embedded TimeML [11]. More specifically, we annotate three types of tempo- in clinical texts, are less so [1]. Besides being a more complicated ral information: (1) EVENTs which represent the semantic events task, we believe that the lack of availability of manually annotated mentioned in the text that affect the patient’s clinical timeline; clinical corpora with temporal information also hindered the pro- (2) TIMEX3s that represent temporal expressions of date, times, gress of MLP in this area [2]. durations, and frequencies; and (3) TLINKs which represent the Temporal information in clinical narratives plays an important temporal relations between EVENTs and TIMEX3s. We refer to this role in medical decision-making and care assessment [3]. Some corpus as the i2b2 temporal relations corpus. examples of clinical applications that utilize temporal information In this paper, we present the process followed for the develop- include: diagnosis, prognosis and treatment decision support [3,4], ment of the i2b2 temporal relations corpus, including the creation time specific clinical information extraction [5–7], and time-re- of a temporal annotation scheme tailored to clinical narratives, the lated question answering [8–10]. These applications rely on tempo- methodology for applying the scheme to the i2b2 temporal rela- ral reasoning systems which extract temporal information from tions corpus, the evaluation of the resulting annotation quality, natural language, and perform temporal inference over the and a description of the resulting annotated corpus. We hope that extracted information. Temporal information in narrative texts in- this paper will: (1) inform the MLP researchers about the temporal cludes the events and the temporal expressions that appear in data preparation process, (2) guide the development of future clin- the text, as well as the temporal relations among them. ical temporal annotation guidelines, (3) caution against pitfalls and In order to develop and evaluate temporal reasoning systems, the issues often raised in the representation of temporal informa- we need clinical corpora annotated with temporal information. tion, and (4) share our solutions to these problems with the Given this need, the 2012 Informatics for Integrating Biology and community. the Bedside (i2b2) project provided the community with a corpus The remainder of this paper is organized as follows: Section 2 summarizes the related work in temporal representation and exist- ⇑ Corresponding author. ing temporally annotated corpora in the general as well as the MLP E-mail address: [email protected] (W. Sun). domains. Section 3 summarizes our annotation guidelines. Sec- 1532-0464/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jbi.2013.07.004 S6 W. Sun et al. / Journal of Biomedical Informatics 46 (2013) S5–S12 tion 4 presents our annotation methodology and procedures. The continuing; SLINKs indicate the subordinate relations between annotation quality evaluation and corpus statistics are presented EVENTs (e.g. the conditional or evidential relations between in Sections 5 and 6, respectively. two EVENTs) [22]. The TempEval [17–19] 2007 corpus applied a simplified 2. Related work TimeML annotation, and restricted the TLINK assignment to those (1) between EVENTs and document creation times, that 2.1. Temporal annotation is, the time stamp of the document creation; (2) between EVENTs/TIMEX3s in the same sentence and (3) between main A temporal representation scheme translates time-related infor- EVENTs (syntactically dominating verbs) in adjacent sentences. mation into a computer readable form to support temporal reason- The TempEval 2010 extended the 2007 annotations to multiple ing. Temporal annotation is a type of temporal representation that languages. TempEval 2012 used subsets of the TimeBank and focuses on interpreting time-related natural language information. AQUAINT corpora, as well as an automatically annotated Eng- Defining a temporal representation scheme is non-trivial in that it lish Gigaword corpus [23–25]. requires the specification of many fundamental assumptions about time [12]. This task becomes even more challenging when the tar- In the clinical domain, Galescu and Blaylock [21] applied an geted temporal information is embedded in natural language be- adaptation of TimeML guidelines to 40 discharge summaries [26]. cause time-related concepts are usually vaguely and implicitly Savova et al. [27] also described an adaptation of TimeML to clini- conveyed in free text [13,14]. For instance the verbal event ‘know’ cal narratives. The Clinical E-Science Framework (CLEF) project describes a continuous state, and the event ‘catch’ is instantaneous annotated a corpus of 167 clinical records for temporal relations [2]. [20]; however, they limited their annotations to intra-sentence The most prominent challenges in temporal annotation include: temporal relations and to the temporal relations between events (1) large search space in the assignment of TLINKs. Given the and document creation times. In addition to these full temporal EVENTs and TIMEX3s in one document, the theoretical search relation annotation schema, there are also annotations that focus space for TLINKs is (N-1)N/2 (N: total number of entities). (2) Mul- on some more specific temporal elements in the clinical narratives, tiple ways to represent the same set of TLINKs. TLINKs can be tran- such as conditions, temporal expressions [28,7,29]. These re- sitive (e.g. before or after) or equivalent (e.g. concurrence). For sources served as a good start at addressing the need for a tempo- example, let ‘<’ represent the before relation, ‘>’ for the after rela- rally annotated MLP corpus, and highlighted the need for tion, and ‘=’ for the concurrence relation, and for entities A, B, and comprehensive temporal annotations that can support the extrac- C, we have ‘A < B, B = C’. We can equivalently represent this rela- tion of the complete patient clinical timeline from narrative patient tion with ‘B > A, A < C, B = C’ among many other sets of TLINKs, records. We aimed to fill this gap. which give the annotators flexibility during annotation but whose equivalence is difficult to manually confirm. For this reason, we 3. Design of i2b2 annotation guidelines usually need to compute temporal closure when handling TLINKs. The temporal closure of a set of TLINKs is the set of minimal transi- We built our annotation guidelines on the following principles: tive relations that contains the original TLINKs. In the previous example, the temporal closure of ‘A < B, B = C’ contains the follow- (1) Ease of Use: the annotators should become proficient in the ing relations: ‘A < B, A < C, B > A, C > A, B = C, C = B’. (3) Conflict in task after a few short training sessions, and the human anno- TLINKs. The transitive and equivalent