A Hybrid System for Temporal Information Extraction from Clinical Text Buzhou Tang,1,2 Yonghui Wu,1 Min Jiang,1 Yukun Chen,3 Joshua C Denny,3,4 Hua Xu1,3
Total Page:16
File Type:pdf, Size:1020Kb
Research and applications A hybrid system for temporal information extraction from clinical text Buzhou Tang,1,2 Yonghui Wu,1 Min Jiang,1 Yukun Chen,3 Joshua C Denny,3,4 Hua Xu1,3 1School of Biomedical ABSTRACT than general English texts due to lack of formalism Informatics, The University of Objective To develop a comprehensive temporal in writing quality. Texas Health Science Center at Houston, Houston, Texas, USA information extraction system that can identify events, To accelerate TIE research in the medical 2Department of Computer temporal expressions, and their temporal relations in domain, the 2012 Informatics for Integrating Science, Harbin Institute of clinical text. This project was part of the 2012 i2b2 Biology and Beside (i2b2) clinical NLP challenge Technology Shenzhen Graduate clinical natural language processing (NLP) challenge on focused on extraction of temporal information School, Shenzhen, Guangdong, temporal information extraction. from hospital discharge summaries.4 The challenge China 3Department of Biomedical Materials and methods The 2012 i2b2 NLP consists of three sub-tasks: (1) clinical event extrac- Informatics, Vanderbilt challenge organizers manually annotated 310 clinic tion with relevant attributes (eg, polarity) in clinic University, School of Medicine, notes according to a defined annotation guideline: a text; (2) temporal expression extraction, which Nashville, Tennessee, USA training set of 190 notes and a test set of 120 notes. requires both identification and normalization of 4Department of Medicine, Vanderbilt University, School of All participating systems were developed on the training text strings indicating date, time, and duration; and Medicine, Nashville, set and evaluated on the test set. Our system consists of (3) temporal relation extraction, which determines Tennessee, USA three modules: event extraction, temporal expression if a temporal link (‘TLink’) exists between two extraction, and temporal relation (also called Temporal events, two times, or one event and one time, and Correspondence to Link, or ‘TLink’) extraction. The TLink extraction module what type (before, after,oroverlap) of TLink it is. Dr Hua Xu, School of fi Biomedical Informatics, contains three individual classi ers for TLinks: The TLink extraction task was further divided into The University of Texas Health (1) between events and section times, (2) within a two tracks: ‘end-to-end’ (using system generated Science Center at Houston, sentence, and (3) across different sentences. The events and temporal expressions) and TLink-only 7000 Fannin St, Suite 600, performance of our system was evaluated using scripts (using gold standard events and temporal expres- Houston, TX 77030, USA; [email protected] provided by the i2b2 organizers. Primary measures were sions). In this paper, we describe our TIE system micro-averaged Precision, Recall, and F-measure. developed for the i2b2 challenge. It is a compre- Received 9 January 2013 Results Our system was among the top ranked. It hensive pipeline-based system that addressed all Revised 11 March 2013 achieved F-measures of 0.8659 for temporal expression sub-tasks and was the top-ranked system in the Accepted 18 March 2013 extraction (ranked fourth), 0.6278 for end-to-end TLink i2b2 challenge in both the end-to-end TLink and Published Online First fi 9 April 2013 track (ranked rst), and 0.6932 for TLink-only track TLink-only evaluations. (ranked first) in the challenge. We subsequently investigated different strategies for TLink extraction, and BACKGROUND were able to marginally improve performance with an In the general English domain, many investigators F-measure of 0.6943 for TLink-only track. have studied TIE from natural-language text corpora, such as newswires. TIE work began pri- marily with temporal representation in the 1980s. An important work was the interval-based algebra INTRODUCTION for representing temporal information in natural Temporal information extraction (TIE) is a challen- language, proposed by Allen in 1983.5 Many early ging area of natural language processing (NLP) studies adopted Allen’s representation, which research and is an important component of many promptly became a standard. In the 1990s, the NLP systems, such as question answering, docu- widespread development of large annotated text ment summarization, and machine translation. For corpora for NLP advanced TIE research dramatic- clinical NLP systems that process medical narrative ally. Community-wide information extraction tasks data, accurate recognition and interpretation of the started to include TIE tasks. The message under- timing of medical events is crucial for many standing conferences (MUCs) sponsored by the US medical reasoning tasks. To generate a complete government organized two consecutive temporal- timeline of medical events of a patient, a clinical related tasks: MUC-6 (1995)6 and MUC-7 (1998).7 TIE system must be able to identify events (eg, In MUC-6, extracting absolute time information medical concepts), temporal expressions (eg, dates (ie, extracting exactly-specified times in the text) associated with events), and temporal relations was a part of a general named entity recognition between two events. Although significant efforts (NER) task. In MUC-7, the TIE task was expanded have been devoted to the representation, annota- to include extraction of relative times also. These tion, and extraction of temporal information in the two tasks defined the Timex tags, which interpret general English domain (eg, the TimeML frame- time expressions into a normalized ISO standard 1 89 To cite: Tang B, Wu Y, work ), the performance of the state-of-the-art form through the TIDES Timex2 guidelines. In Jiang M, et al. J Am Med TIE systems is still not ideal (F-measures around 2004, extracting and normalizing temporal expres- Inform Assoc 2013;20: 60–70%).23Moreover, extracting temporal infor- sions according to the Timex2 guidelines for both – 828 835. mation from clinical text can be more challenging English and Chinese texts was part of the Time 828 Tang B, et al. J Am Med Inform Assoc 2013;20:828–835. doi:10.1136/amiajnl-2013-001635 Research and applications Expression Recognition and Normalization Evaluation chal- methods to extract temporal expressions from clinical narra- lenge, sponsored by the Automatic Content Extraction tives.16 17 22 For example, Reeves et al extended the open- program.10 These tasks provided preliminary but valuable con- source temporal awareness and reasoning systems for question tributions to TIE research. interpretation (TARSQI) toolkit, originally developed from news Rapid development of TIE methods started in 2004 with reports, to extract temporal expressions from veterans affairs the work of TimeML,1 a robust specification language for (VA) clinical text. They found that temporal expressions in events and temporal expressions in natural language. The clinic notes were very different from those in the newswire TimeML schema mainly integrates two annotation schemes: domain, and the out-of-the-box implementation of the TARSQI TIDES (Translingual Information Detection, Extraction, and toolkit performed poorly.22 Some existing clinical NLP systems, Summarization) TIMEX2 and STAG (Sheffield Temporal such as ConText25 and MedLEE,26 also have the capability to Annotation Guidelines).11 12 It defined three elements of tem- recognize certain temporal expressions and link them to clinical poral information: events, temporal expressions, and temporal concepts. More comprehensive systems such as developed by relations. Events, including verbs, adjectives, and nominals, cor- Zhou et al,16 19 27 can not only extract temporal expressions responding to events and states are classified into different associated with medical events, but also reason about temporal types, and have various attributes, including tense, aspect, and information in clinical narrative reports. For more details of other features. Temporal expressions are token sequences that studies in clinical TIE, see the review paper by Zhou and denote times with various attributes such as their normalized Hripcsak.27 Nevertheless, very few studies have investigated the values. TimeML also represents temporal relations between use of TimeML in the medical domain. Recent studies by events/times using an Allen-like format. It defines temporal rela- Savova et al17 21 have annotated clinical text using TimeML. tions using three types of links: TLinks (Temporal Links), To advance the TIE research in the medical domain, organi- SLinks (Subordination Links), and ALinks (Aspectual Links). zers of the 2012 i2b2 clinical NLP challenge prepared an anno- TimeML has become an ISO standard for temporal annotation. tated clinical corpus based on TimeML and organized a clinical Several TimeML-based annotated corpora have been created. TIE competition similar to the TimeEval2 competition. The The popular corpora include TimeBank1.2, AQUAIN, 2012 i2b2 challenge consisted of three subtasks: (1) Event TempEval, and TempEval2. Among them, the TempEval corpus, extraction: six types of clinical events were extracted for the based on TimeBank1.2, was created for the temporal relation i2b2 challenge, including medical problems, tests, treatments, task at TempEval1 in 2007.2 For the Tempeval2 task in 2010, a clinical departments, evidentials, and occurrences. Every event multilingual corpus was created.313Detailed information about also has two attributes: polarity and modality. The polarity attri- these corpora can be found at http://www.timeml.org/site/