MINING PATIENT JOURNEYS FROM HEALTHCARE NARRATIVES

A THESISSUBMITTEDTOTHE UNIVERSITYOF MANCHESTER FORTHEDEGREEOF DOCTOROF PHILOSOPHY INTHE FACULTY OF ENGINEERINGAND PHYSICAL SCIENCES

2014

By Azad Dehghan School of Computer Science Contents

Abstract 14

Declaration 16

Copyright 17

Acknowledgements 18

1 Introduction 23 1.1 Hypotheses and research questions...... 25 1.2 Aim and objectives...... 26 1.3 Contributions...... 27 1.4 Thesis structure...... 28

2 Background 29 2.1 Text Mining...... 29 2.1.1 Natural language processing...... 30 2.1.2 Information extraction...... 40 2.1.3 Named entity recognition...... 47 2.1.4 Temporal information extraction...... 54 2.1.5 Temporal entity extraction...... 59 2.1.6 Temporal relation extraction...... 66 2.2 Clinical Background...... 77 2.2.1 Clinical text...... 77 2.2.2 Computer aided standardisation of clinical care...... 77 2.2.3 Health-related quality of life...... 78 2.3 Summary...... 81

2 3 Extraction of Health-related Concepts 83 3.1 Event Extraction...... 84 3.1.1 Methods...... 85 3.1.2 Data...... 89 3.1.3 Experiments, results, and discussion...... 91 3.2 Health-related Quality of Life Extraction...... 96 3.2.1 HrQoL Schema...... 96 3.2.2 Data...... 99 3.2.3 Methods...... 100 3.2.4 Experiments, results, and discussion...... 103 3.3 Summary...... 108

4 Temporal Information Extraction 110 4.1 Temporal Entity Extraction...... 111 4.1.1 Methods...... 111 4.1.2 Data...... 115 4.1.3 Experiments, results, and discussion...... 116 4.2 Temporal Relation Extraction...... 119 4.2.1 Methods...... 119 4.2.2 Data...... 125 4.2.3 Experiments, results, and discussion...... 126 4.3 Summary...... 129

5 Extracting Patient Journeys: a Case Study 132 5.1 Introduction: Childhood Central Nervous System Tumours...... 133 5.2 Introduction: Case Study...... 134 5.2.1 Data...... 137 5.3 Comparative Analysis of Narratives...... 140 5.3.1 Aggregated analysis of narratives...... 141 5.3.2 Individual case analysis of narratives...... 148 5.3.3 Discussion...... 157 5.4 Extracting Patient Journeys...... 158 5.4.1 Methods...... 158 5.4.2 Evaluation...... 162 5.4.3 Individual patient journeys...... 163 5.4.4 Aggregated patient journeys...... 172

3 5.4.5 Discussion...... 175 5.5 Visualising Patient Timelines...... 176 5.6 Summary...... 179

6 Conclusion 181 6.1 Contributions...... 181 6.2 Limitations...... 183 6.3 Future work...... 184 6.4 Summary...... 185

A Background 213 A.1 A sample clinical note...... 213 A.2 Token level sequence label modelling...... 214 A.3 Transitive closure...... 214 A.4 HUI-2 classification system...... 215

B Extraction of Health-related Concepts 216 B.1 NER annotation examples...... 216 B.2 HrQoL NER...... 218

C Tools and Resources 220 C.1 NLP pre-processing...... 220 C.2 Implementation...... 220

D HrQoL Annotation Guideline 221 D.1 Introduction...... 221 D.1.1 HrQoL schema description...... 222 D.1.2 HrQoL annotation...... 222 D.2 HrQoL schema...... 225 D.3 Ambiguous cases...... 230 D.3.1 Indirect references...... 230 D.4 What (not)? to annotate...... 231 D.4.1 What not to annotate...... 231 D.4.2 What to annotate...... 231 D.5 More examples...... 232

4 E Extracting Patient Journeys: a Case Study 234 E.1 Comparative Analysis of Narratives...... 234 E.2 Extracting Patient Journeys...... 236

5 Word Count: 61,885

6 List of Tables

2.1 A example of lexical normalisation...... 32 2.2 Common negation phrases in clinical data...... 37 2.3 Evaluation variables matrix...... 38 2.4 Common rule-based languages...... 44 2.5 Top systems in the 2010 i2b2 event extraction task...... 51 2.6 Top systems in the 2012 i2b2 event extraction task...... 52 2.7 Common data-driven features used for clinical event extraction.... 52 2.8 TIMEX3 representation schema...... 57 2.9 TLINK representation schema...... 57 2.10 TempEval-2 TERN results...... 62 2.11 TempEval-3 TERN results...... 63 2.12 2012 i2b2 TERN: methods and resources...... 64 2.13 2012 i2b2 TERN results...... 65 2.14 Common data-driven features used for clinical TER...... 65 2.15 TempEval-3: TLINK identification and classification task...... 69 2.16 TempEval-3: TLINK end-to-end task...... 69 2.17 TempEval-3: approaches for TLINK identification...... 70 2.18 TempEval-3: approaches for TLINK classification...... 70 2.19 2012 i2b2: TLINK identification and classification task...... 72 2.20 2012 i2b2: TLINK end-to-end task...... 72 2.21 Common TLINK classification features...... 75 2.22 HrQoL instruments and corresponding classification dimentions... 81

3.1 Definition of EVENT categories...... 84 3.2 Feature template: clinical EVENTs...... 87 3.3 Label fixer heuristic...... 88 3.4 EVENT annotated datasets...... 90 3.5 i2b2-TRC: EVENT IAA...... 91

7 3.6 i2b2-CARC: EVENT IAA...... 91 3.7 EVENT label distribution...... 91 3.8 EVENT extraction validation test results...... 92 3.9 EVENT extraction results on the held-out test set...... 93 3.10 The clinical NER performance...... 94 3.11 EVENT recognition: feature impact analysis...... 94 3.12 EVENT recognition: dataset impact...... 95 3.13 EVENT recognition: post-processing impact analysis...... 95 3.14 Performance of the offficial 2012 i2b2 submission...... 96 3.15 Example of HrQoL concepts...... 99 3.16 HrQoL dataset label distribution...... 100 3.17 HrQoL dataset IAA...... 100 3.18 HrQoL dictionary summary...... 102 3.19 HrQoL NER results on the development set...... 104 3.20 HrQoL NER results on the held-out test set...... 105 3.21 The HrQoL NER performance...... 105 3.22 The HrQoL NER impact analysis...... 107

4.1 Feature template: clinical TER...... 114 4.2 TIMEX3 label distribution...... 116 4.3 i2b2-TRC: TIMEX3 IAA...... 116 4.4 TER validation results...... 116 4.5 TER evaluation on the held-out test set...... 117 4.6 TE normalisation results...... 118 4.7 TLINK patterns...... 122 4.8 TLINK extraction features...... 124 4.9 TLINK label distribution...... 126 4.10 i2b2-TRC: TLINK IAA...... 126 4.11 TLINK development set results...... 127 4.12 TLINK results on the held-out test set...... 128 4.13 TLINK component based evaluation...... 128 4.14 TLINK end-to-end results on the held-out test set...... 129

5.1 Adopted case study specific types...... 137 5.2 Case study corpus: clinical narratives profile...... 138 5.3 Case study corpus: patient narratives profile...... 138

8 5.4 Top occuring concept in clinical and patient narratives...... 145 5.5 Proportion of traditional clinical concepts in patient narratives.... 145 5.6 Proportion of HrQoL concepts in clinical narratives...... 146 5.7 A semantic analysis of patient narratives...... 147 5.8 A semantic analysis of clinical narratives...... 147 5.9 An example list of concepts and their PathCluster confidence..... 162 5.10 Test case A: a tabular view of high-level processes...... 164 5.11 Test case B: a tabular view of high-level processes...... 167 5.12 Test case C: a tabular view of high-level processes...... 170 5.13 Aggregated patient pathway: a tabular view of high-level processes.. 172

A.1 Transitive relations...... 214

B.1 Contextual reasoner exclusion cues...... 219 B.2 Boundary adjustment: adjectives...... 219

D.1 Example 1 annotations...... 226 D.2 Example 2 annotations...... 226 D.3 Example 3 annotations...... 227 D.4 Example 4 annotations...... 227 D.5 Example 5 annotations...... 228 D.6 Example 6 annotations...... 228 D.7 Example 7 annotations...... 229 D.8 Example 8 annotations...... 229 D.9 Example 9 annotations...... 230 D.10 Ambigious cases...... 230

E.1 Concept types in clinical narratives...... 234 E.2 Concept types in patient narratives...... 235 E.3 PathCluster: co-reference lists...... 236

9 List of Figures

3.1 Health-related concept extraction architecture...... 83 3.2 EVENT extraction architecture...... 86 3.3 Relevant HrQoL instruments and contained themes...... 98 3.4 HrQoL method architecture...... 101

4.1 TIE architecture...... 110 4.2 TERN architecture...... 112 4.3 TLINK extraction architecture...... 120

5.1 A hypothetical clinical narrative...... 134 5.2 Clinical timeline...... 135 5.3 Method overview...... 135 5.4 Proportion of concepts found in patient narratives...... 142 5.5 Proportion of concepts found in clinical narratives...... 142 5.6 Aggregated concept analysis between patient and clinical narratives. 143 5.7 Lexical analysis using word clouds...... 144 5.8 Patient A: proportion of concepts in the patient narrative...... 149 5.9 Patient A: proportion of concepts in the clinical narratives...... 149 5.10 Patient A: concept analysis between clinical and patient narratives.. 150 5.11 Patient B: proportion of concepts in the patient narratives...... 152 5.12 Patient B: proportion of concepts in the clinical narrative...... 152 5.13 Patient B: concept analysis between clinical and patient narratives.. 153 5.14 Patient C: proportion of concepts in the patient narrative...... 155 5.15 Patient C: proportion of concepts in the clinical narratives...... 155 5.16 Patient C: concept analysis between clinical and patient narratives.. 156 5.17 A hypothetical patient journey...... 158 5.18 Pathway extraction architecture...... 159 5.19 Test case A: reconstructing the patient journey...... 166

10 5.20 Test case B: reconstructing the clinical pathway...... 168 5.21 Test case C: reconstructing the clinical pathway...... 171 5.22 An aggregated patient pathway...... 174 5.23 Clinical dashboard...... 177 5.24 Clinical dashboard: event view...... 178

A.1 A sample clinical note...... 213 A.2 Health Utilities Index Mark 2...... 215

D.1 The combined HrQoL schema...... 222

11 Acronyms

Avg Average CFG Context Free Grammar CDSS Clinical Decision Support Systems CN Clinical narrative CPSL Common Pattern Specification Language CRF Conditional Random Field CTM Clinical Text Mining DCT Document Creation Time DRT Document Reference Time DocTimeRel Document Creation/Reference Time Relation EHR Electronic Health Record EPR Electronic Patient Record

F1 F1-measure or F1-score FN False Negative FP False Positive HrQoL Health-related Quality of Life HS Health Status IAA Inter-Annotator Agreement IE Information Extraction IR Information Retrieval JW Jaro-Winkler measure LSP Labelled Sequential Pattern MLN Markov Logic Network ML Machine learning NE Named Entity NER Named Entity Recognition NERC Named Entity Recognition and Classification

12 NLP Natural Language Processing NP Noun phrase P Precision PN Patient narrative POS Part of Speech R Recall Regex/Regx Regular expression SBD Section Boundary Detection SVM Support Vector Machine TC Transitive Closure TE Temporal Entity/Expression TER Temporal Entity recognition TERN Temporal Expression Recognition and Normalisation TIE Temporal Information Extraction TLINK Temporal link/relation TM Text Mining TN True Negative TP True positive VP Verb phrase WSD Word Sense Disambiguation OCR Optical Character Recognition

13 Abstract

MINING PATIENT JOURNEYS FROM HEALTHCARE NARRATIVES Azad Dehghan A thesis submitted to the for the degree of Doctor of Philosophy, 2014

The aim of the thesis is to investigate the feasibility of using text mining methods to reconstruct patient journeys from unstructured clinical narratives. A novel method to extract and represent patient journeys is proposed and evaluated in this thesis. A composition of methods were designed, developed and evaluated to this end; which included health-related concept extraction, temporal information extraction, and concept clustering and automated work-flow generation. A suite of methods to extract clinical information from healthcare narratives were proposed and evaluated in order to enable chronological ordering of clinical concepts. Specifically, we proposed and evaluated a data-driven method to identify key clini- cal events (i.e., medical problems, treatments, and tests) using a sequence labelling algorithm, CRF, with a combination of lexical and syntactic features, and a rule- based post-processing method including label correction, boundary adjustment and false positive filter. The method was evaluated as part of the 2012 i2b2 challenge and achieved a state-of-the-art performance with a strict and lenient micro F1-measure of 83.45% and 91.13% respectively. A method to extract temporal expressions us- ing a hybrid knowledge- (dictionary and rules) and data-driven (CRF) has been pro- posed and evaluated. The method demonstrated the state-of-the-art performance at the 2012 i2b2 challenge: F1-measure of 90.48% and accuracy of 70.44% for identi- fication and normalisation respectively. For temporal ordering of events we proposed and evaluated a knowledge-driven method, with a F1-measure of 62.96% (considering the reduced temporal graph) or 70.22% for extraction of temporal links. The method developed consisted of initial rule-based identification and classification components which utilised contextual lexico-syntactic cues for inter-sentence links, string similar- ity for co-reference links, and subsequently a temporal closure component to calculate

14 transitive relations of the extracted links. In a case study of survivors of childhood central nervous system tumours (medul- loblastoma), qualitative evaluation showed that we were able to capture specific trends part of patient journeys. An overall quantitative evaluation score (average precision and recall) of 94-100% for individual and 97% for aggregated patient journeys were also achieved. Hence, indicating that text mining methods can be used to identify, extract and temporally organise key clinical concepts that make up a patient’s jour- ney. We also presented an analyses of healthcare narratives, specifically exploring the content of clinical and patient narratives by using methods developed to extract patient journeys. We found that health-related quality of life concepts are more com- mon in patient narrative, while clinical concepts (e.g., medical problems, treatments, tests) are more prevalent in clinical narratives. In addition, while both aggregated sets of narratives contain all investigated concepts; clinical narratives contain, proportion- ally, more health-related quality of life concepts than clinical concepts found in patient narratives. These results demonstrate that automated concept extraction, in particular health-related quality of life, as part of standard clinical practice is feasible. The proposed method presented herein demonstrated that text mining methods can be efficiently used to identify, extract and temporally organise key clinical concepts that make up a patient’s journey in a healthcare system. Automated reconstruction of patient journeys can potentially be of value for clinical practitioners and researchers, to aid large scale analyses of implemented care pathways, and subsequently help monitor, compare, develop and adjust clinical guidelines both in the areas of chronic diseases where there is plenty of data and rare conditions where potentially there are no estab- lished guidelines.

15 Declaration

No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

16 Copyright

i. The author of this thesis (including any appendices and/or schedules to this the- sis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, includ- ing for administrative purposes. ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other in- tellectual property (the “Intellectual Property”) and any reproductions of copy- right works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the Univer- sity IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx? DocID=487), in any relevant Thesis restriction declarations deposited in the Uni- versity Library, The University Library’s regulations (see http://www.manchester. ac.uk/library/aboutus/regulations) and in The University’s policy on pre- sentation of Theses

17 Acknowledgements

I would like to express my gratitude to my supervisors: Dr Goran Nenadic and Prof John A. Keane, for their support throughout this research.

In particular, I would like to acknowledge my main supervisor Goran. His wisdom and vision has been imperative to the success of this thesis. Hopefully, I will be able to hire his services in the near future :)

I would like to thank and acknowledge the support of my clinical supervisors: Dr Martin McCabe (Teenage Cancer Trust Senior Lecturer and Honorary Consultant Pae- diatric Oncologist), Dr Edward J Estlin (Consultant Paediatrician, Blackpool Teaching Hospitals NHS Foundation Trust) and Dr Ian Kamaly-Asl (Consultant Neurosurgeon and Honorary Senior Lecturer, The University of Manchester). Their support have been imperative to the successful completion of this research.

I would also like to thank and acknowledge the support of: Prof Tony Long (Pro- fessor in Child and Family Health, The University of Salford); Dr Louise Robinson (Macmillan Principal Clinical Psychologist, Central Manchester University Hospitals NHS Foundation Trust, The Royal Manchester Children’s Hospital); Ms Ruth Morgan (Therapy and Dietetic Service Manager (Medical Team), Professional Lead for Oc- cupational Therapy, Central Manchester University Hospitals NHS Foundation Trust, The Royal Manchester Children’s Hospital); and Dr Ram Kumar (Consultant Paedi- atric Neurologist, Alder Hey Childrens NHS Foundation Trust).

A very special thanks to the organisers of the annual Informatics for Integrating Bi- ology & the Bedside (i2b2) challenges for providing the NLP research datasets which been vital for the successful completion of this project.

18 It has been a life learning experience to be part of such a vibrant research team. I will look back to this chapter in life with only fond memories. Unfortunately, the notion of time is non recursive, or, perhaps it is. It has been a privilege to do research next to these gentleman and ladies: Geraint, Martin, Farzaneh, Chengkun (aka China), Michele (aka Italy), George, Daniel, Rosyzie, Mona, Jenny, and Nikola. I want to also mention Aleksandar (aka Kocha) with whom I have tackled many projects, these research problems have been a great sources of acquired knowledge throughout this research.

I must also mention all of the people that I have shared countless discussion of the intricacies of academia, science, philosophy, politics, life and beyond: Salil, James, Mohsen, Paris, Mona D, Farideh, Abbas, Erol, Keletso, and Matthew.

19 This journey would not have been possible without the unconditional support of my family, not only throughout this thesis, but throughout my life. Without them this would not have been possible! Literally. A very special thanks to my wife, Adrianna, who has put up with countless can- celled dinners, weekends, and holidays including the current 2014 winter holidays. I’ll make it up!! ;) xoxoxo

20 To my parents.

21 22 Chapter 1

Introduction

Unstructured text is the most common format in which human knowledge is commu- nicated: it is estimated that unstructured textual data make up as much as 80 percent of data today. For example, the primary means of knowledge transfer between scientists is by conference and journal papers. The PubMed Central which is the main archive of biomedical and life science literature compromises of over 23 million citations; about one citation per minute is added to this digital repository.1 Likewise, health- care organisations rely on text to record and communicate health-related information. For instance, a single health record system such as the the Clinical Record Interactive Search (CRIS) database contain over 20 million narrative records and is growing with an estimated rate of 170,000 documents every month.2 It is estimated that medical in- formation currently doubles every three years and that by 2020 will grow by the same rate every 73 days (Densen, 2011). Given the data-deluge of unstructured textual data in the information age, auto- mated means of deriving knowledge from unstructured text has been increasingly re- quired. The attempt to automate the process of knowledge rendering from un- or semi- structured text has evolved into the research field of Text Mining (TM). Specifically, TM aims to enable the automatic processing of large collection of textual data in order to derive relevant information or knowledge. Generally, TM applications can aid ex- perts in making sense of large amounts of text by distilling and contextualising knowl- edge, extracting facts, discovering implicit links, generating and testing hypotheses (Spasic et al., 2005).

1http://www.ncbi.nlm.nih.gov/pmc/ 2The CRIS system contain data from the South and Maudsley NHS Foundation Trust (SLaM) electronic clinical records system. http://brc.slam.nhs.uk/about/core-facilities/ cris/.

23 24 CHAPTER 1. INTRODUCTION

TM has been used successfully in a number of applications, such as the automate identification for clinical trials (Rao et al., 2006), prediction of a disease status (Yang et al., 2009), detection and tracking of infectious disease outbreaks (Collier et al., 2008), large scale extraction and contextualisation of bimolecular events (Gerner et al., 2012), drug discovery or discovery of novel application of existing drugs (Frijters et al., 2010), gene-disease (Jamieson et al., 2014; Ozgur¨ et al., 2008), and disease-disease as- sociation (Holzinger et al., 2012), among others.

The adoption of Electronic Health Records (EHRs) has shown promising poten- tial in enhancing patient safety, care quality and efficiency. This is partly attributed to the consequential enablement of integration of intelligent systems that enable patient- specific recommendations (e.g., drug prescription advice, preventive health tasks re- minders), also known as computerised Clinical Decision Support Systems (CDSS). For example, Kaushal et al. (2003) systematic review of computerised physician order en- try (CPOE) systems showed that its use can substantially reduce medication error rates caused by human prescription errors. Another review (Chaudhry et al., 2006) reported three major benefits on quality and safety: (i) increased adherence to guideline-based care, (ii) enhanced surveillance and monitoring, and (iii) decreased medication errors. In terms of efficiency, they reported decreased utilisation of care as the main advan- tage. In addition, they also highlighted the primary benefit of EHR/CDSS was on the preventive health domain.

High-quality, consistent, and evidence-based care of patients are universally strived goals of healthcare providers. In the , the National Institute for Health and Care Excellence (NICE)3 is responsible for developing series of national clinical guidelines in order to guide the standardisation of care based on the best available ev- idence. Clinical guidelines are described as “systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clini- cal circumstances” (Field et al., 1990, p.50). A notable challenge to clinical guidelines is that evidence is not always available and guidelines may therefore be derived in-part from expert opinions rather than ‘evidence’ (Shekelle et al., 1999). Additionally, while rigorous evaluation of clinical guidelines have shown that they can improve quality of care, whether this is achieved in practice is uncertain (Rycroft-Malone et al., 2009). Current evidence about the effectiveness of guidelines is incomplete. Integrated care

3https://www.nice.org.uk/ 1.1. HYPOTHESES AND RESEARCH QUESTIONS 25 pathways4 are designed by multidisciplinary teams at the local level in order to im- plement national standards, and determine care provision by using best available evi- dence in the case national standards are not available (NHS Modernisation Agency and NICE, 2008). In conjunction with the sometimes conjectural nature of clinical guide- lines, this effectively means that different hospitals or even consultants within the same hospital may adopt different paths of care for similar cases, as expert views may differ in absence of concrete evidence. In the context of evidence-based care, automated means of analysing and moni- toring the implementation of clinical guidelines and care pathways will be useful to aid quality of care by ensuring best practice is being followed. Additionally, in the absence of evidence or accepted variation in practice, large scale analysis of clinical practice can aid researchers to determine best care based on real-world data derived from clinical practice. Such findings could be useful to update guidelines/pathways to reflect best practice as proven based on actual clinical practice and outcomes. In order to enable (semi-) automated monitoring and analyses of guidelines or im- plemented care pathways in clinical practice, the need to develop a computational method that take advantage of the vast amount of data available in EHR to recon- struct specific patient pathways is necessary. Patient journey or patient pathway (used interchangeably in this thesis) is the term coined to convey implemented guidelines or care pathways as applied to individual patients. In this thesis we aim to explore the feasibility of TM methods to reconstruct patient journeys. We will use a case study of adult survivors childhood central nervous sys- tem (CNS) cancer or specifically medullablastoma to both motivate and evaluate the proposed methodologies.

1.1 Hypotheses and research questions

We hypothesise that TM techniques can be used to extract patient pathways from un- structured healthcare narratives. In addition, we further speculate that the same com- putational methods can be used to analyse patient narrative experiences with factors, such as clinical concepts (including health-related quality of life), found in their hos- pital records. These general hypotheses motivates the main research questions to be addresses in this thesis:

4Also known as clinical/care pathways, critical pathways, care paths, case management plans, clini- cal care pathways or care maps. 26 CHAPTER 1. INTRODUCTION

(1) Can TM techniques be used to reconstruct patient journeys from unstructured clin- ical narratives? (2) Do narratives contain enough information to reconstruct patient pathways? (3) What are the differences between the content of clinical narratives and patient narratives?

1.2 Aim and objectives

The aims of this thesis are: (1) to investigate the feasibility of TM methods to extract patient pathways from longitudinal clinical records of patients with chronic or severe long-term illnesses; (2) conduct a comparative investigation of the content of clinical narratives vis-a-vis patient narratives. Specifically, the main objectives of this research are:

1. Design and develop a framework or a set of methods to extract relevant clinical information from healthcare narratives, including:

(a) clinical concepts such as medical problems, treatments, tests, and health- related quality of life; (b) temporal information such as temporal expressions and temporal relations;

2. Design and implement a method to automatically extract and represent patient journeys from a set of longitudinal clinical narratives;

3. Design and implement a method to enable the analysis of patients’ narrative experience and compare them to factors found in their hospital records;

4. Evaluate the proposed methods through a case study in order to show the poten- tial of the methods developed.

Note that as part of this research, we only used a small fraction of data contained in corresponding EHR, merely a subset of unstructured clinical narratives (i.e., patient correspondence and internal clinical notes). Hence, additional data, including struc- tured data which contain a wealth of relevant information were not considered. The motivation for this was to assesses the feasibility of TM techniques. 1.3. CONTRIBUTIONS 27

1.3 Contributions

The research presented in this thesis has made the following contributions:

• A method to extract clinical concepts such as medical problems, treatment, and tests.5

• A method to extract health-related quality of life concepts. Additionally, a be- spoke representation schema was developed in order to classify extracted con- cept mentions.

• A set of methods to extract temporal information such as temporal expressions6 and temporal relations.7 These techniques were used for temporal ordering of extracted clinical concepts.

• A novel method to extract and represent patient journeys validated using longi- tudinal clinical records derived from a case study of adult survivors of childhood central nervous system tumours.

• An analysis of clinical concepts presented in patient and clinical narratives using a case study in order to explore and characterise any gaps/overlaps of concepts.

A number of these methods have been published as open source tools.8,9

Intermediate results from this thesis have been published in the following journal.10

• Aleksandar Kovacevic, Azad Dehghan, Michele Filannino, John A Keane, and Goran Nenadic. Combining rules and machine learning for extraction of tem- poral expressions and events from clinical narratives. Journal of the American Medical Informatics Association: JAMIA, 20(5):85966, 2013.

5This method has been validated as part of the international 2012 i2b2 Challenge in Natural Lan- guage Processing for Clinical Data 6This method has been validated as part of the international 2012 i2b2 Challenge in Natural Lan- guage Processing for Clinical Data 7This methods was validated using the 2012 i2b2 dataset. 8Clinical TERN: http://sourceforge.net/projects/clinical-tern/ 9Clinical NERC: http://sourceforge.net/projects/clinical-nerc/ 10The temporal expression normalisation method was solely developed by third author. The remain- ing methods were joint contributions by the second and first author who also trained the machine- learning models. 28 CHAPTER 1. INTRODUCTION

1.4 Thesis structure

This thesis is structured as followed.

Chapter2 provides the relevant background to this thesis. This include a description of relevant TM components and specific tasks such as named entity recognition; temporal expression recognition and normalisation; and temporal relation iden- tification and classification. In addition, we describe the clinical background, specifically computerised implementation and monitoring of clinical care path- ways and self-reported clinical concepts such as health-related quality of life.

Chapter3 presents the methods developed and validated to extract health-related con- cepts, including clinical concepts such as medical problems, treatments, and tests (Section 3.1); as well as more subjective concepts such as health-related quality of life (Section 3.2).

Chapter4 presents the methods developed and validated to extract temporal informa- tion. In particular, we describe temporal expression recognition and normalisa- tion (Section 4.1); and temporal ordering or temporal relation identification and classification (Section 4.2).

Chapter5 describes a number of applications of the methods developed as part of this thesis. A brief background (Section 5.1) and introduction to the case study and the data is initially described in Section 5.2. This chapter also presents a com- parative analysis between clinical and patient narratives (Section 5.3), the meth- ods developed and validated to extract and represent individual and aggregated patient pathways (Section 5.4), and a ‘clinical dashboard’ to chronologically vi- sualise clinical events (Section 5.5).

Chapter6 concludes this thesis by summarising the contributions, the limitations of our work, and lastly discusses the future direction of the research presented herein. Chapter 2

Background

This chapter presents the background review of relevant concepts and methods. The literature reviewed presents a general discussion on TM (Section 2.1), including specif- ically, named entity recognition (Section 2.1.3), temporal entity extraction (Section 2.1.5), temporal relation extraction (Section 2.1.6), as well as clinical background re- lated to this project (Section 2.2). Related work is reviewed as part of each relevant subsection. This chapter is heavily focused on the clinical domain and therefore unstructured clinical text. This bias was deemed appropriate given the aim of this thesis but also given the nature of the problem at hand. In particular, in this thesis, we aim to char- acterise clinical practice which is defined by a rather unique and unparalleled tempo- rality. Specifically, temporality inherently characterises clinical practice. For instance, clinical events are naturally organised, administered, and analysed on a temporally significant dimension.

2.1 Text Mining

TM is a broad and multidimensional research field that is made up of a cluster of re- lated information theoretical and computational fields such as Information Retrieval (IR), Natural Language Processing (NLP), Information Extraction (IE), Data Mining (DM). In addition, statistical and data/information visualisation methods are increas- ingly becoming important techniques utilised by text miners to analyse and represent extracted knowledge. The overall aim of TM is: the automatic processing of unstructured or semi- structured textual data in order to derive knowledge. For example, given a clinical

29 30 CHAPTER 2. BACKGROUND discharge letter, we may develop a TM method which automatically extracts the date of admission/discharge, a medical cause of the admission, prior medical history, or all of the above and presents it in a chronological order.

2.1.1 Natural language processing

Natural Language processing (NLP) is the semantic modelling of natural language for computational or automated processing. The following subsections describe a number of relevant processing components.

Tokenisation

Tokenisation refers to the problem of segmenting text into individual meaningful ele- ments or lexical items, called tokens (Hahn and Wermter, 2006). A token may loosely be described as a lexical unit such as a term, word, abbreviation, number, symbol, or space. Tokenisation is a non-trivial task. The following example show a sentence and its tokenised output, with each individual box representing a token.

Input: “The patient denies any other symptoms .”

Output: The patient denies any other symptoms .

In the English language, specific linguistic characteristic such as (open) compound words, hyphenations, and apostrophes may introduce ambiguity and complexity (Man- ning et al., 2008). Simple word delimitation by white-space would often not suffice. For example, open compound words such as ‘New York’, ‘North America’, and ‘Text Mining’ should each, linguistically, be regarded as single lexical unit. For example, hy- phenations provide another dimension of complexity due to its multi-functional quality (Manning et al., 2008). Further, a simple contraction such as “aren’t” may be tokenised in multiple ways i.e., (i) are n’t , (ii) aren ’ t , and (iii) aren ’t .

Sentence boundary identification

Sentence boundary detection or sentence splitting refers to the problem of segmenting text into sentences. Sentences are de-marked by punctuation marks (i.e., period, excla- mation mark, or question mark) in the English language. Therefore, the extensive use of acronyms and abbreviation may cause complication, in particular when they include periods and occur at the end of a sentence. 2.1. TEXT MINING 31

In unstructured or semi-structured clinical data, the challenges of sentence splitting are more pronounced. This is partly due to the reasons that clinical texts tend to be rid- dled with abbreviations and acronyms. In addition, clinical texts are characteristically written in ungrammatical, short and telegraphic phrases that does not follow common English syntactic structures (Meystre et al., 2008). A sample clinical note has been included in Appendix A.1.

Section boundary detection

Section Boundary Detection (SBD) refers to the problem of identifying document sec- tions or sub-sections. This task has shown to be useful when processing clinical text as they are often subdivided into meaningful sections, e.g., diagnosis, medication, medi- cal history, and so forth. SBD could aid with contextual identification of topic/theme of a specific segment of a text, which in turn may aid the overall performance of a TM task. For example, Yang et al. (2009) successfully applied SBD to a set of discharge summaries for predicting disease status. SBD has likewise been successfully used in the general biomedical domain (e.g., (Guo et al., 2011; Mizuta et al., 2006; Teufel and Moens, 2002)).

Morphological analysis

Morphological analysis is the linking of varied forms of lexical elements to their canon- ical base form (Hahn and Wermter, 2006). For example, the plural morpheme ‘cancers’ is a inflectional morpheme of the base form ‘cancer’. This type of analysis is for many TM task, such as Named Entity Recognition (NER) and term normalisation in order to address the challenge of term variability. Morphological analysis is generally divided into two main approaches: (1) lexicon- free and (2) (domain specific) lexicon look-up. A successful and commonly used lexicon-free approach is Porter’s stemming algorithm (Porter, 1980). In addition, an alternative lexicon look-up approach is e.g., the UMLS SPECIALIST Lexicon 1 which is widely used in the medical domain.

1http://lexsrv3.nlm.nih.gov/Specialist/Home/index.html 32 CHAPTER 2. BACKGROUND

Lexical normalisation

Lexical normalisation aim to link derivative lexical terms to a single/normalised rep- resentation. For example, the terms “Hodgkin’s Diseases NOS” and “Hodgkin’s Dis- ease” may both be normalised to ‘hodgkin disease’. Table 2.1 shows a set of common normalisation heuristics employed for lexical normalisation.

Table 2.1: A example of lexical normalisation

Normalisation heuristic Candidate term - Hodgkin’s Diseases, NOS Remove genitives Hodgkin’s Diseases, NOS Remove punctuation Hodgkin Diseases, NOS Remove stop words Hodgkin Diseases NOS Word case normalisation Hodgkin Diseases Uninflect individual words hodgkin diseases Sort words by alphabetic order hodgkin disease Normalised term: disease hodgkin

Other approaches include soft string matching or string similarity metrics to ad- dress lexical normalisation. Edit distance based metrics are well known and define distance by the cost, represented by a real number t, incurred by edit operations needed to convert a string s1 to s2. Edit operations are typically defined as character-level in- sert, substitute, and delete with either a constant penalty value across all operations or weighted costs. A well known edit distance measures is the Levensthein distance which assigns a unit of cost to any edit operations. For example, the Levenshtein dis- tance between ‘Saturday’ and ‘Sunday’ is 3: we require two deletion (‘a’ and ‘t’) and one substitute (‘n’ for ‘r’) operation. Other related distance measures include those that considers character sub-sequences (i.e., one or more characters), such as n-gram similarity which is based on the num- ber of shared n-grams (where n > 1); and the related q-gram similarity which simply accounts for sub-strings of length q (Ukkonen, 1992). The Jaro metric (Jaro, 1989, 1995) considers a set of string features such as the string lengths, the number and or- der of intersecting characters, and the number of transpositions between two strings.

Specifically, the Jaro similarity metric for given strings s1 and s2 is:

1  m m m −t  Jaro(s1,s2) = × + + (2.1) 3 |s1| |s2| m 2.1. TEXT MINING 33 where:

• m is the number of intersecting character that appear in the same order,

• t is half the number of transpositions.

Winkler (1999) extends the aforementioned measure. Specifically, the Jaro-Winkler

(JW) metric, incorporates the length of the longest common prefix (P) between s1 and s2 with the maximum length of 4, and, is defined as followed:

P   JW(s ,s ) = Jaro(s ,s ) + × 1 − Jaro(s ,s ) (2.2) 1 2 1 2 10 1 2 Both the Jaro and JW measures are intended for short strings e.g., person names (Cohen et al., 2003). Token-based approaches consider sets of terms T, where a term may contain one or more word/token w. Common approaches for term similarity include Jaccard similarity and TFIDF. The Jaccard similarity coefficient between word sets is defined as:

|T1 ∩ T2| Jaccard(T1,T2) = (2.3) |T1 ∪ T2| The TFIDF metric, also known as cosine similarity, generally refers to the prod- uct of term frequency (number of times a term occurs in a document) and the inverse document frequency (the inverse of the fraction of terms across documents). As a lex- ical similarity function, TFIDF can be calculated between term T1 and T2 as followed (Cohen et al., 2003):

TF(T1,T2) = ∑ V(w,T1) ×V(w,T2) (2.4) w∈T1∩T2

0 p 0 0 2 0 where: V(w,Ti) = V (w,Ti) ÷ ΣwV (w,Ti) , such that V (w,Ti) = log(TFw,Ti + 1) × log(IDFw). A widely used string similarity metric is the SoftTFIDF which considered both character and token level similarity, specifically combining (term-level) TFIDF and Jaro-Winkler metric. SoftTFIDF have shown to be very effective for various lexi- cal normalisation-like task across multiple case studies (Cohen et al., 2003; Tsuruoka et al., 2007; Wellner et al., 2005). 34 CHAPTER 2. BACKGROUND

Part-of-speech tagging

Part-of-Speech (POS) tagging refers to the computational task of assigning POS tags to tokens, and is considered a central part of lexical level processing. While there exist two main methods of POS tagging (i.e., knowledge- and data-driven), most taggers are representative of statistical or data-driven approaches (Hahn and Wermter, 2006). The following example show the hypothetical output of a POS tagger (where each word POS tagged is represented by a ‘/’ and its respective POS tag).

Input: The patient denies any other symptoms .

Output: The/DT patient/NN denies/VBZ any/DT other/JJ symptoms/NNS .

POS tagging is essential for lexico-syntactic disambiguation of words (Hahn and Wermter, 2006). For example, the word ‘cold’ can be either a common noun or a adjective, depending on its syntactic context. Furthermore, a robust POS tagger is crucial for NER tasks. For example, data-driven approaches almost unanimously use POS tags as features for predicting Named Entities (NEs)2. Similarly, knowledge- driven method crucially rely on POS tags for NE candidate generation (e.g., see UMLS MetaMap NER pipeline described in (Aronson and Lang, 2010)). Moreover, due to the inherent sensitivity of clinical data, the clinical domain has been plagued by lack of publicly available and domain specific annotated corpora for training of POS taggers. Despite the availability of computational lexicons (e.g., SPE- CIALIST Lexicon), lack of corpora have introduced challenges relying on non-domain specific corpora such as general English and newspaper (e.g., Penn Treebank; Wall Street Journal, Brown corpus) and/or biomedical (e.g., GENIA corpus) for training of POS taggers. For example, Hahn et al. (2004) investigation showed that porting of POS tagger between domains (i.e., newspaper-language model to medical text) resulted in a significant drop in performance. On the other hand, Pakhomov et al. (2006) reported that POS taggers trained on medical data significantly improved their performance on medical text.

Syntactic processing

The aim of syntactic processing is the identification of structure within sentence, syn- tactic analysis of sequence of words, or the recognition, analysis, and formal character- isation of phrases and clauses (Hahn and Wermter, 2006). Chunking has proven useful

2Name of things, either physical or conceptual entities, e.g., person, organisation, disease names. 2.1. TEXT MINING 35 for NER tasks in both the general and biomedical domain, this is expected given that NEs are often contained within noun or prepositional phrases. Methods used for syntactic level processing are coined chunkers and parsers. Parsers identify clauses such as word sequences containing a subject and a predicate (Hahn and Wermter, 2006) and organise these into parse or dependency trees. Chunkers partition or label sentences into shallow phrasal units (i.e., noun, preposition, verb, or adjec- tival phrases), hence also known as shallow parsing. Research within this area has predominantly focused on two types of chunking (Hahn and Wermter, 2006):

1. Base NP chunking involve identifying heads (including determiners, but ex- cluding post modifying prepositional phrases or clauses) of non-recursive noun phrases (Ramshaw and Marcus, 1995).

2. Chunking involves grouping lexical units at the phrase-level into non-overlapping phrases (e.g., verbal phrases, prepositional phrases, noun phrases, and so forth), such that, syntactically related words become member of the same phrase (Hahn and Wermter, 2006; Sang and Buchholz, 2000).

The following examples show a POS tagged input sentence and the output of the two mentioned chunking methods.

Input: The/DT patient/NN denies/VBZ any/DT other/JJ symptoms/NNS .

Base NP: The/DT patient/NN denies/VBZ any/DT other/JJ symptoms/NNS .

Chunking: The/DT patient/NN denies/VBZ any/DT other/JJ symptoms/NNS .

Moreover, resources commonly utilised to train parsers or chunkers include tree- banks and grammars. Treebanks are annotated text corpora with syntactic annotations at sentence level (such as lexico-syntactic structures: POS tags and chunks). Gram- mars contain some subset of linguistic syntax, commonly, rules or constraints which characterise morpho-syntactic and non-terminal grammar categories. An example of treebank used in the biomedical domain is GENIA treebank v1.0, which is made up of annotated PubMed abstracts (Kim et al., 2003). As previously highlighted, the lack of domain specific treebanks can be a chal- lenge to medical/clinical NER. Medical NERs (e.g., UMLS MetaMap) typically rely on shallow parsing to identify NPs for the subsequent identification of NE candidates. 36 CHAPTER 2. BACKGROUND

For instance, the importance of chunkers is demonstrated by Abacha and Zweigen- baum (2011b) who reported a notable improvement of UMLS MetaMap for disease name recognition by simply substituting the noun phrase chunker.

Semantic processing

Semantic-level processing refers to the linking of terms or concepts to form logical/- knowledge propositions (Hahn and Wermter, 2006). For example, in clinical TM, mapping of NEs e.g., an annotated mention of diagnosis to terminological/ontologi- cal resource such as UMLS Metathesaurus is considered as semantic-level processing. For example, by mapping the term ‘posterior fossa medulloblastoma’ to the aforemen- tioned ontology we can conclude that ‘posterior fossa’ is-a Body Part, Organ, or Organ Component and ‘medulloblastoma’ is-a Neoplastic process. Word Sense Disambiguation (WSD) is arguably another form of semantic-level processing: e.g., does the mention of ‘cold’ refer to the disease, temperature, or sen- sation? Nonetheless, we consider WSD as separate task (to be described shortly). Further, NER tasks (i.e., linking of terms to predefined concept categories) can also be regarded as semantic-level processing even though it is commonly regarded as lexical- level analysis.

Negation

Negation is an important contextual feature to consider in IE tasks. Chapman et al. (2001) showed that a large portion of clinical observations mentioned in textual patient records are negated. Consequently, they proposed the widely used NegEx algorithm which consider contextual negation cues to solve the problem. As part of their investi- gation, they reported a list of 14 most common negation phrases found in a systematic study of 42,160 clinical reports (see Table 2.2). Harkema et al. (2009) extended the NegEx algorithm by also considering whether a clinical event is historical, hypothetical, or experienced by someone other than the patient. They coined the method ConText. Moreover, Chu et al. (2006) showed that negation was the most important contextual feature (among validity, certainty, tempo- rality, and directionality/negation) for improving accuracy in a classification task of clinical conditions. Thus, addressing negation is essential for any real-world clinical application identifying/extracting clinical observation. 2.1. TEXT MINING 37

Table 2.2: Common negation phrases in clinical data This table list common negation phrases found in a analysis of over 40,000 clinical narratives (Chapman et al., 2001).

Negation phrase Frequency no 62,436 denies 17,845 without 9,538 not 7,591 no evidence 5,488 with no 3,009 negative 2,979 denied 1,576 to rule out 932 no significant 820 w/o evidence 397 no new 368 no abnormal 105 no suspicious 55

Word sense disambiguation

WSD is the process of determining the correct sense or meaning of a word. For exam- ple the word ‘cold’ can refer to the concept cold temperature (Natural Phenomenon or Process), common cold (Disease or Syndrome), or cold sensation (Physiologic Func- tion). Various methods have been applied to address this problem. However, a common denominator has been to resolve this task by considering contextual (e.g., surrounding words or section, and/or document) and lexical (POS and syntactic structure) features.

Evaluation in Text Mining

Precision and recall metrics

Evaluation metrics in TM/IE are naturally adopted from the field IR (Grishman and Sundheim, 1996; Rijsbergen, 1979; Van Rijsbergen, 1974). Metrics used in IE tasks include: Precision (P) which is the fraction of extracted instances that belong to the relevant class, Recall (R) is the fraction of relevant instances that are actually extracted; and F1-measure or F1-score (F1) which is the harmonic average of P and R.

The described metrics P, R, and F1-measure (or F1-score) are defined as followed: 38 CHAPTER 2. BACKGROUND

Table 2.3: Evaluation variables matrix Description of evaluation variables.

Actual class Relevant Not relevant Extracted True positives (tp) False positives (fp) Tagged class Not extracted False negative (fn) True negatives (tn)

t p Precision(P) = (2.5) t p + f p

t p Recall(R) = (2.6) t p + f n

(β2 + 1) × P × R F measure = (2.7) β (β2P) × R

Where β in Equation 2.7 reflects the weighting of P vis-a-vis R. For example, if β = 1, then P and R are weighted equally; if β = 0.5, then P weights twice as much as R, and if β = 2, the vice versa is true. β = 1 is the most common schema used in IE, and the schema used throughout this thesis. Moreover, the given evaluation metrics may be expressed or computed in two dif- ferent scoring schemas: strict or lenient/relaxed. Specifically, strict scoring schema only considers exact matches as true/relevant, while lenient scoring schema counts partial matches (i.e., any overlap with the gold label) as true/relevant class.

Accuracy

Another commonly used measure in IE is accuracy. In contrast to the precision-recall metric, accuracy is favoured for binary classification and/or when there is a clear dis- tinction of what constitutes a negative and positive class. For example, entity attributes which have strictly defined values (e.g., negation) are typically reported using accuracy (Equation 2.8). Accuracy is defined as followed:

t p +tn Accuracy = (2.8) t p + f p + f n +tn 2.1. TEXT MINING 39

Micro and macro average

When evaluating more than one related entity type, apart from individually presented evaluation scores, there are two common way the aggregated precision-recall metrics may be given:

• Micro averaging treats multiple entities as a single category. Hence, t p, f p, tn, f n are calculated accordingly. • Macro averaging calculates P, R, F by entity type, and subsequently averages the results across.

Inter-annotator agreement

In order to benchmark an IE component against the best known natural language pro- cessors, i.e., humans, the availability of a gold-standard dataset is essential. A gold- standard dataset, in the context of IE, refers to a manual/semi-automatic annotated corpus (often with a high Inter-Annotator Agreement (IAA) on some topic-specific an- notated task, e.g., NER, negation, and so forth. IAA is a statistic representing human performance of identifying relevant annotations (e.g., name of person, disease, treat- ment and so forth). A common way to measure IAA for binary classification problems agreed annotations (e.g., negation) is by Cohens Kappa coefficient (k), where P(a) = number o f annotations 1 is the agreement rate between annotators and P(e) = 2 is the estimated expected agree- ment by chance (Cohen, 1960).

P(a) − P(e) k = (2.9) 1 − P(e) Some general properties of Cohen’s Kappa are:

• k is a real number between 0 and 1 inclusive or 0 ≤ κ ≤ 1

• k = 1 equates perfect agreement

• k = 0 equates perfect disagreement

Moreover, IAA may also be reported using F1 for common IE tasks (Hripcsak and Rothschild, 2005). The motivation partly stems from the fact that F1 is invariant irrespective of which annotator’s labels are used the gold standard. Another approach is to calculate the IAA by the average precision and recall by holding one annotation set/labels as the gold standard and measuring the precision of 40 CHAPTER 2. BACKGROUND the other annotation set, and then doing the vice-versa to subsequently calculate the average (Sun et al., 2013c, p.809).

2.1.2 Information extraction

A central component of TM is derived from the research field of IE. The computational task of IE is characterised by the extraction and representation of data. Specifically, IE can generally be described as the transformation of unstructured/semi-structured data into structured content or templates, before subsequent information synthesis through various methods of integration and analysis (e.g., statistical and visualisation). The conception of IE research has notably been influenced by a series of Message Under- standing Conferences (MUCs) under the auspices of the United States Navy and later the TIPSTER program under the umbrella of Defence Advanced Research Projects Agency (DARPA). The aim of initial MUC evaluations were to develop technologies to facilitate automated analysis of military messages (e.g., naval sightings and engage- ments) containing textual information. The focus gradually shifted to extracting infor- mation from reports of terrorist events in Central and South America published in ar- ticles by the Foreign Broadcast Information Service (Grishman and Sundheim, 1996). The evolution of these tasks paved the way for Named Entity Recognition (NER) tasks (Chinchor, 1998; Grishman and Sundheim, 1996), including Temporal Information Extraction (TIE). In retrospect, the importance of NEs and therefore NER for informa- tion synthesis should be a natural development given that the common knowledge that (English) language is largely made up of names of things either physical or conceptual. While MUC has been an important influence in the early development of the IE field, relevant individual research efforts had been conducted. For example, Hirschman et al. (1976) described a method for extracting medical information (e.g., medical test, finding, condition) by contextual clues or distributional analysis of lexical cues from medical reports. In another study, Hirschman and Story (1981) described extraction of time information including relations and entities such as date/time and duration from textual data. Nevertheless, MUC-organised evaluations set the precedence for IE tasks such as NE, TIE and relation extraction tasks by formalisation of the problems at hand, organisation of coordinated community tasks and provision of resources. Since, these efforts have been extended with various recent community organised challenges/tasks in different domains including: 2.1. TEXT MINING 41

Nota bene: concepts mentioned in the following list will be described further ahead in this chapter. Specifically, event extraction (or NER) is described in Section 2.1.3, temporal expression recognition and normal- isation (TERN) and temporal relation identification and classification is described in Section 2.1.5 and Section 2.1.6 respectively.

• Conference and Labs of the Evaluation Forum or CLEF (formerly known as the Cross-Language Evaluation Forum) has organised two related challenge thus far:

1. ShARe/CLEF eHealth 2013 Shared Tasks was jointly organised with the Shared Annotated Resources (ShARe) project and included two tracks on recognition and mapping of disorder mentions to UMLS Metathesaurus.3 2. ShARe/CLEF eHealth 2014 Shared Tasks included a task on IE from clin- ical text, specifically, disease/disordered mention recognition and UMLS mapping 4

• Semantic Evaluation or SemEval (formerly known as SensEval and which was initially focused on WSD) has thus far organised:

1. TempEval-1 or SemEval-2007 Task 15 (Verhagen et al., 2007). TempEval-1 was a general domain task (newspaper text) that included three limited tem- poral relation (TLINK) classification tasks i.e., determining the type (i.e., Before, After, Overlap, Before-or-overlap, Overlap-or-after, and Vague) of given TLINK candidate pairs. For example, in the sentence ‘He taught on Wednesday’, the event ‘taught’ and the temporal expression (TE) ‘Wednes- day’ is a candidate pair and should be classified as an Overlap TLINK.5 2. TempEval-2 or SemEval-2010 Task 13 (Verhagen et al., 2010). TempEval- 2 was likewise a general domain challenge that consisted of several sub- tasks including event recognition (i.e., eventuality); TERN which included

3Detailed description of the shared task including guideline can be found at: https://sites. google.com/site/shareclefehealth/ 4Detailed description of the shared task including guideline can be found at: http:// clefehealth2014.dcu.ie/task-2/ 5The tasks include: (1) determine the relation type between an event and a TE in the same sentence, in addition, the event should syntactically dominate the TE or the event and TE should occur in the same noun phrase; (2) determine the relation type between two ‘main events’ in consecutive sentences; (3) determine the relation type between between two events where one event syntactically dominates the other. 42 CHAPTER 2. BACKGROUND

distinguishing between four temporal types: Time (‘at 2:45 p.m.’), Date (‘January 27’, ‘1928’), Duration (‘two weeks’), Set (‘every Monday’) and normalising the TEs according to the ISO-8601 standard; and four limited TLINK task i.e., solely classification: given a set of TLINKs, the partici- pants had to determine the type of link (i.e., Before, After, Overlap, Before- or-overlap, Overlap-or-after, and Vague). 3. TempEval-3 or SemEval-2013 Task 1 (UzZaman et al., 2013). TempEval-3 was another general domain challenge that consisted of six subtasks that included event extraction, TERN, and three TLINK tracks. Notably, in contrast to previous TempEval challenges, the TLINK tasks included ‘end- to-end’ evaluation (whereas participants had to first extract relevant events and TE, and subsequently identify links between candidate pairs and deter- mine their type, i.e., TLINK recognition and classification). In addition, no artificial restriction were imposed on TLINKs. 4. Clinical TempEval or SemEval-2015: Task 6 is ongoing at the time of writ- ing of this thesis. TempEval-3 is the first TempEval in the clinical domain. TempEval-3 (task 6) include three main tasks: TERN, event extraction (a subset of disorder i.e., oncology related), and TLINK extraction (i.e., both the identification of pairwise TLINKs and the classification of these rela- tions).

• Informatics for Integrated Biology and the Bedside (i2b2) has been a major player in advancing TM research in the clinical domain. I2b2 have organised several clinical text mining tasks since 2007.6 A couple of relevant challenges are:

1. The 2010 i2b2/VA 4th Shared Task on Concepts, Assertions and Relations in clinical text (Uzuner et al., 2011). This challenge included a concept ex- traction task which focused on the extraction of clinical events such as med- ical Problem (e.g., signs or symptoms, disorder, diseases), Treatment (e.g., medication, surgical procedures) and Test (e.g., diagnostic procedures). In addition, each EVENT included a negation attribute which accepted one of the following values: ‘present’, ‘absent’, or ‘hypothetical’. 2. The 2012 i2b2/ 6th Shared Task on Temporal Information Extraction in clinical text (Sun et al., 2013c). This shared task included three tasks: 6The full list of challenges can be found at: https://www.i2b2.org/ 2.1. TEXT MINING 43

(a) NER or clinical event extraction (i.e., categories: Problem, Treatment, Test, Occurrence, Evidential, and Clinical department); in addition, each EVENT included two attributes polarity i.e., ‘positive’ or ‘negated’, and modality i.e., ‘factual’, ‘conditional’, ‘possible’, and ‘proposed’; (b) TERN (i.e., recognition of temporal expression (e.g., ‘December 5, 2001’) and normalisation of these expression by three attributes: (i) value using ISO-8601 standard, (ii) type i.e., Date, Time, Duration, or Fre- quency (equivalent to Set), and (iii) modifier i.e., ‘NA’, ‘more’, ‘less’, ‘approx’, ‘start’, ‘end’, or ‘middle’;and (c) TLINK extraction: identification of temporal links between EVENTs, TEs, and EVENTs and TEs; and classification of these into predefined link types: After, Before, and Overlap.

Information extraction methods

IE methods can be grouped into three broad set of methodologies: knowledge-driven (also known as knowledge-based), data-driven which is largely associated with machine- learning- and statistical-based methods; and hybrid, which is any combination of knowledge- and data-driven methods.

Knowledge-driven IE methods are broadly characterised by the domain knowl- edge often required to engineer methods. Background knowledge or domain knowl- edge may be acquired from domain experts in the form of knowledge resources such as vocabularies, terminological or ontological resources (e.g., UMLS Metathesaurus), collaboration with expert in engineering peculiarities of techniques developed (e.g., providing contextual knowledge), and/or acquired directly by the individual researcher. Knowledge-driven IE methods are commonly grouped into rule-based and dictionary- based approaches. These approaches are further described in the following sections.

Rule-based methods include the use of techniques such as heuristics (often indicat- ing simple rules analogous to if-else statements), regular expression (regex), or more predominately IE driven rule-based language models such as the common pattern spec- ification language (CPSL). The notion of CPSL was initially developed during the TIP- STER program and was purposefully developed for formalising extraction patterns in a relatively system-independent manner (Appelt and Onyshkevych, 1998). Notably, the development of CPSL was motivated by the fast development and portability of 44 CHAPTER 2. BACKGROUND

components or rule sets, relative to the variety of native rule-based specification lan- guages existing at the time. One of the advantages of CPSL versus regex is that the former enable matching regular expression patterns over annotations or lexical features of words, phrases or a given text span. The CPSL grammar also enables the creation of annotations or features over text spans (e.g., a word/token can have one or more attributes). Hence, the CPSL grammar can be applied over annotated data processed with some TM workflow (e.g., tokenisation, syntactic processing, NER, and so forth). A well known CPSL languages is the Java Annotation Pattern Engine (JAPE) (Cun- ningham et al., 2000) which is provided as part of the GATE (A General Architecture for Text Engineering) framework (Cunningham et al., 2013). UIMA Ruta is another CPSL-like language part of the Apache UIMA (Unstructured Information Manage- ment Architecture) framework (see Table 2.4 for common rule formalism languages and their respective framework).

Table 2.4: Common rule-based languages

Formalism language Framework RegEx - JAPE 7 GATE Ruta 8 UIMA Mixup 9 MinorThird

Common criticisms of rule-based approaches include that it tend to be time-consuming, domain-specific, and require tedious manual labour (Riloff, 1993), a view that does not radically differ from other IE methods. The following example illustrate a Java regex and two alternative JAPE pattern to extract the common date pattern DD-MM-YYY:

Java regex :

[0-9]{2}-[0-9]{2}-[0-9]{4} 2.1. TEXT MINING 45

JAPE :

Example 1:

{Token.kind == number, Token.length==2} {Token.string == "-"} {Token.kind == number, Token.length==2} {Token.string == "-"} {Token.kind == number, Token.length==4}

Example 2:

{Token.string ==˜ "[0-9]{2}"} {Token.string == "-"} {Token.string ==˜ "[0-9]{2}"} {Token.string == "-"} {Token.string ==˜ "[0-9]{4}"}

Dictionary-based methods are in essence straight-forward (partial or exact) string matching algorithms over a list or dictionary. The concept ‘dictionary’ in this context refers to multiple types of lexical resources. For example, gazetteer or list of words, computational lexicons (often containing additional lexical information e.g., canonical forms, part-of-speech, synonyms), or rich knowledge resources such as terminological and ontological lexical resources that are hierarchically organised (e.g., part-of, is-a, and so forth; i.e., providing meaningful relations between concepts). An example of a well establish knowledge resource in the medical domain is the UMLS Metathesaurus. Arguably and perhaps more obviously than other IE methods, dictionary-based resources/methods are domain and task dependent by design. There exists several lim- itations to dictionary-based methods. For example, the development of knowledge resources often require domain experts. Secondly, resource curation is a labour in- tensive task, and regardless of the phenomenon of neologism, resources are finite and therefore may need regular updates. However, the development, maintenance and re- lease of publicly available resources have helped researchers and practitioners alike to overcome these limitation.10

Data-driven methods in IE are largely characterised by the use annotated data to train mathematical or statistical algorithms (i.e., commonly referred to as machine- learning (ML)) in order to derive classifiers (or models) for a given classification (or

10Once again, the UMLS Metathesaurus is exemplary. 46 CHAPTER 2. BACKGROUND

recognition) task. For example, regardless of the method employed, a classification

task requires a training dataset D = {d1,...,dn} with annotated target data (e.g., tokens, entities, or documents) that are labelled with a class L ∈ L (e.g., Person, Organisation, Problem, Test). Subsequently, the task is to derive a classification model:

f : D → L f (d) = L (2.10)

Similar to knowledge-driven methods, NLP and semantic-level processing is often necessary to enrich a source text with useful features: a process known as feature generation). In fact, data-driven approaches often adopt knowledge-based resources to enrich data with semantic attributes in order to derive useful futures (e.g., (de Bruijn et al., 2011; Kovacevic et al., 2013)). The process of deriving a subset of features that provide the most information is commonly refereed to as feature selection (Alpaydin, 2010). This process involves various techniques in order to experimentally derive robust and generalisable models for given IE tasks. The aim of feature selection strategies is to discover best-fit feature sets that best model a given problem. Common feature selection strategies include e.g., forward selection (i.e., incrementally adding features and assessing the perfor- mance) and backward selection (i.e., starting with all possible features extracted, and incrementally remove features to assess the best model). While there exists a number of data-driven methods employed in IE tasks, it is out- side the scope of this report to provide a review in this regard. However, a relevant and noteworthy methods often employed include conditional random field (CRF). CRF is a state-of-the-art sequence labelling algorithm which has shown to be very effective in NER tasks (i.e., identification/classification of name of things either physical or con- ceptual, such as Person e.g., ‘Barack Hussein Obama’; Organisation e.g., ‘University of Manchester’, Date e.g., ‘June 1969’, medical concepts: Problem e.g., ‘breast can- cer’, Treatment e.g., ‘chemotherapy’, Test e.g., ‘MRI scan’) across different domains Finkel et al. (2005); Sun et al. (2013b). CRF is a undirected graphical model based on Bayesian statistics (Lafferty et al., 2001). CRF attempts to assign a label sequence

Y = {y1,...,yn} to an observation sequence X = {x1,...,xn} by maximising the con- ditional probability P(Y|X). A particular useful characteristic of graphical models is their ability to model interdependent variables and thus take into account contextual features or context. This has proven useful in NER tasks since NE labels of neighbour- ing words are dependent, e.g., while ‘New York’ is a Location, New York Times is an Organisation (Sutton and McCallum, 2007). 2.1. TEXT MINING 47

Hybrid methods refer to some combination of knowledge- and data-driven method- ology. However, the boundary between data-/knowledge- and hybrid-based methods can be ambiguous. It appears that heuristics and rules are under-reported in data driven methods. For example, common rule-based methods such lexical normalisation (e.g., string manipulation such as stemming or converting strings to lower-case, upper-case, and so forth) and filtering (e.g., stop word) broadly employed at initial NLP processing may not often be considered as inclusive of the overall methodology. Another example include using dictionary-based methods for feature generation, which is customary not considered part of the the ‘methodology’. However, one may argue otherwise. Nevertheless, a hybrid method is characterised by the combination of knowledge- and data-driven methods for the immediate IE task.

2.1.3 Named entity recognition

The problem of recognising and classifying sequence of words that appear in text and denote name of things, either physical or conceptual, into predefined categories (e.g., Problem, Treatment), is known as named entity recognition and classification (NERC); commonly denoted as NER. MUC-7 (Chinchor, 1998) defined NEs as expressions that uniquely identify entities (e.g., Person, Organisation, Location), including time (e.g., Date, Time) and quantity (e.g., monetary values, percentages). While the definition of entity types are domain-dependent, a common characteristic across domains is that NE expressions are typically contained within noun phrases. NER typically refers to two separate but often homogeneous sub-tasks: (i) recog- nition, and (ii) classification.The first step, recognition, is characterised by identi- fication of single or multiple adjacent words indicating the presence of an entity. Subsequently (in knowledge-driven approaches) or simultaneously (in data-driven ap- proaches),entities, are classified into specific a NE category .

Clinical event extraction

In the medical or clinical domain, NER typically constitutes the recognition of med- ical/clinical events which may be categorised according to coarse-grained concepts such as medical Problem (e.g., ‘disease or syndrome’, ‘anatomic abnormality’, ‘sign or symptom’), Treatment (e.g., ‘therapeutic or preventive procedure’, ‘medical device’), Test (e.g., ‘laboratory procedure’, ‘diagnostic procedures’) (exhaustive definitions are 48 CHAPTER 2. BACKGROUND provided in Chapter 3.1). Hence, the terms NE and event as well as NER and concep- t/event extraction/recognition are used interchangeably throughout this thesis.

Clinical named entity recognition

In this subsection we review recent developments in clinical NER. This review is largely restricted to methods evaluated under similar conditions, in particular, using identical datasets for evaluation. This is due to the fact that cross comparison between methods are scientifically more acceptable. Conveniently, recent development in clin- ical NER has predominately been driven by the 2010 and 2012 i2b2 shared tasks (Sun et al., 2013c; Uzuner et al., 2011).

Knowledge-driven methods for clinical NER are predominantly concerned with medication and temporal expression (described in Section 2.1.5) recognition. More commonly, rule-based approaches are complimented with lexical resources. Hence, what is considered as ‘rule-based’ or ‘dictionary-based’ can be unclear. Rule-based strategies for NER include the adoption of a range of features such as lexical, con- textual, linguistic to identify entity mentions in text. The approach is reminiscent to term identification as described by (Krauthammer and Nenadic, 2004) where lexical, orthographic, and morpho-syntactic features are used to predict term occurrence. Spasic et al. (2010) describes a rule-based approach to extract medication names. In addition to manually curated dictionaries of medication names collected from the training data and publicly available resources, they exploit the morphology of med- ication names. Specifically, using common medication name affixes (e.g., -cycline, -nazole, -sulfa, -statin) showed to be a good indicator of medication name mentions. Similarly, Yang (2010) rule-based approach relied on several complementary manually curated lexicons including medication, dosage, frequency, mode/route, duration, and reason which were used as part of term-based and token-based rule matching strate- gies. Subsequently a set of rules were applied to expand the medication name lexicons to cope with abbreviations, synonyms, and spelling variations. In addition, contextual rules were applied during post-processing to exclude common false-positives. Spa- sic et al. (2010) and Yang (2010) approaches were evaluated as part of the 2009 i2b2 medication challenge (Uzuner et al., 2010) and achieved an F1-measure of 83.8% and 85.8% respectively. In comparison, the best medication NER method, using a data- driven approach (a set of cascade Maximum Entropy (MaxEnt) classifiers), achieved 2.1. TEXT MINING 49

89.8% F1-measure evaluated under the same conditions (Halgrim et al., 2011). Never- theless, Uzuner et al. (2010) reported that knowledge-based approaches dominated the top 10 systems in the 2009 medication challenge .

Xu et al. (2010) achieved 93.2% F1-measure in extracting medication names using a bespoke medication extraction system (MedEx). MedEx is based on a knowledge- based approach which combines a semantic (drug) lexicon and regex for initial tag- ging/annotation, and subsequently, a set of post-processing disambiguation rules. Nev- ertheless, MedEx was evaluated under different conditions (such as different and smaller corpus than the data used by (Halgrim et al., 2011; Spasic et al., 2010; Yang, 2010)), therefore, direct comparison between methods is not feasible.

Kraus et al. (2007) describes a pure rule-based method which solely takes advan- tage of contextual features to extract medication names. Hence, no lexical resources were adopted. A single rule was crafted to identify: at least a four letter word fol- lowed by a number and a unit of measurement (which equates to a common pattern of medication information in clinical text and represents: a medication name, route of administration and the frequency of dosage). Their algorithm achieved 84.7% F1- measure on a bespoken evaluation set of clinical notes.

Dictionary-based NER approaches use knowledge resources for identification and classification of candidate terms into predefined categories. The typical approach is characterised by NLP pre-processing that include POS tagging and shallow parsing to identify candidate terms. Subsequently, candidate terms (typically contained in NPs) are normalised (e.g., Table 2.1 and lexical variant generation (LVG)) and matched against one or more knowledge resources to identify the category, if any. An advan- tage of dictionary-based NER is the availability of well-maintained domain (clinical) specific knowledge resources. Hence, arguably, this makes the adoption of dictionary- based approached less expensive to adopt than ML or rule-based approaches. Specifi- cally, adopting available resources tends to be less time consuming then manually craft- ing rules or annotating corpora in order to develop data-driven approaches. As clini- cal NEs possesses similar characteristics and challenges as described in term recogni- tion literature (e.g., variability, ambiguity, spelling variations), dictionary lookup (i.e., string matching) can be adversely affected in terms of performance. Therefore, in addition to dictionary lookup strategies, common heuristics are often employed to ad- dress this problem. For example, lexical normalisation or LVG (e.g., generation of synonyms, acronyms, spelling variations, and so forth), are typically applied. 50 CHAPTER 2. BACKGROUND

A couple of well known and publicly available dictionary-based NER systems in- clude the clinical Text Analysis and Knowledge Extraction System (cTAKES) (Savova et al., 2010), and UMLS MetaMap Aronson (2001). Savova et al. (2010) reported

71.5% (strict) and 82.4% (lenient) micro F1-measure for identifying and classifying clinical NEs defined by UMLS semantic types (such as ‘congenital abnormality’, ‘ac- quired abnormality’, ‘injury or poisoning’, ‘pathologic function’, ‘disease or syn- drome’, ‘mental or behavioural dysfunction’, ‘cell or molecular dysfunction’, ‘exper- imental model of disease’, ‘anatomical abnormality’, and ‘neoplastic process’). The latter results was achieved on a held-out test set of 160 clinical notes. Abacha and Zweigenbaum (2011a) evaluated MetaMap using the 2010 i2b2 test dataset contain- ing clinical concepts (i.e., Problem, Treatment, and Test). They reported a micro F1- measure of 52.28% or per category score of problem: 56.67%, treatment 56.53% and test: 37.91%.

Dictionary-based methods evaluated on a narrow and focused set of categories tend to achieve reasonable performance (e.g.,(Savova et al., 2010; Tanenblatt et al., 2010; Xu et al., 2010)). In contrast, broadly defined categories typically inversely effects evaluation scores (e.g., (Abacha and Zweigenbaum, 2011a; Leaman et al., 2009)). Nevertheless, knowledge-based systems, in contrast to data-driven, enable fine grained semantic classification (i.e., mapping/classification of terms to knowledge resources). In addition, availability of publicly available and well-maintained knowledge resources often make dictionary-based approaches less expensive than alternative methods. Fur- thermore, dictionary-based strategies or more specifically the application of knowledge resources tend to be widely used regardless of NER method (e.g.,(Sun et al., 2013c; Uzuner et al., 2011)).

Data-driven methods approach NER as a sequence labelling problem. Hence, sequence labelling representation such as modelling token positioning within a entity sequence is important. In order to model token positioning within a entity (sequence) two models have been widely reported: IO or BIO representation; indicating if a to- ken is in the beginning (B), inside (I), or outside (O) of an entity (see example in Appendix A.2). Conditional random fields (CRF) is a state-of-the-art sequence la- belling algorithm successfully applied to NER tasks within the domain. In fact, CRF is over-represented in the general domain NER problems and has likewise proven as the state-of-the-art method for clinical NER through recent shared tasks (Sun et al., 2013c; Uzuner et al., 2011). 2.1. TEXT MINING 51

In the 2010 i2b2 challenge, or specifically the clinical event extraction task, data- driven methods dominated the top 10 systems. In particular, CRF was overwhelmingly represented. However, the best system was described as semi-automated (de Bruijn et al., 2011) (to be reviewed shortly). The concept extraction task included the recog- nition and classification of clinical EVENTs into coarse grained (high-level) categories such as: Problem (e.g., ‘chest pain’, ‘breast cancer’), Treatment (e.g., ‘craniectomy’, ‘diltiazem’ (medication)) and Test (e.g., ‘full blood count’, ‘CT scan’).

Table 2.5: Top systems in the 2010 i2b2 event extraction task Event extraction (i.e., problem, treatment, test): micro-averaged results on the test data (477 records). F -measure % System Method 1 strict|lenient de Bruijn et al. (2011) ML (Semi-supervised) 85.20|92.40 Jiang et al. (2011) ML (CRF) 83.90|91.30 Kang et al. (2010) Hybrid (CRF + Dictionary) 82.10|90.40 Gurulingappa et al. (2010) ML (CRF) 81.80|90.50 Patrick et al. (2011) ML (CRF) 81.80|89.80 Torii et al. (2011) ML (CRF) 81.30|89.80

Notably, four out of the five top systems used external knowledge resources (all of which included some derivative of the UMLS thesaurus) (Gurulingappa et al., 2010; Jiang et al., 2011; Kang et al., 2010; Patrick et al., 2011). In addition, many top performing systems adopted exiting NER systems for feature generation. For ex- ample, Jiang et al. (2011) reported the adoption of MedLEE (Friedman et al., 2004) and KnowledgeMap11; Kang et al. (2010) used a number of off-the-shelf systems: (i) a Biomedical Named Entity Recognizer (ABNER) (Settles, 2005), (ii) Peregrine (Schuemie et al., 2007) with the 2009 UMLS thesaurus, and the StanfordNER12 ; Torii et al. (2011) retrained the BioTagger-GM, originally developed as a gene/protein NER, as their clinical NER. In addition to the EVENT categories covered by 2010 i2b2 concept extraction task, the 2012 i2b2 challenge concept extraction task included event categories such as: Evidential (i.e., events represented by verbs or noun/adjective derived from verbs: e.g., ‘reported’, ‘showed’), Clinical department (e.g., ‘intensive care unit’, ‘hearing clinic’), and Occurrence (i.e., any event that does not fall into to the other five categories) (Sun

11http://knowledgemap.mc.vanderbilt.edu/research/ 12http://nlp.stanford.edu/software/CRF-NER.shtml 52 CHAPTER 2. BACKGROUND

et al., 2013a). Similar to the preceding clinical NER task (Uzuner et al., 2011), methods submit- ted as part of the 2012 i2b2 challenge event extraction task (Sun et al., 2013c) were dominated by data-driven methods and in particular: CRF (Table 2.6 lists the top six systems, their methods and respective results). Once again, the adoption of knowledge resources was widely reported (e.g., (Kovacevic et al., 2013; Roberts et al., 2013)). Similar to the preceding challenge, the most commonly reported sequence labelling schema was BIO. Table 2.6: Top systems in the 2012 i2b2 event extraction task Event extraction (i.e., problem, treatment, test, occurrence, evidential and clinical department): micro-averaged (lenient) results on the test data (120 records).

System Method F1-measure % Xu et al. (2013) ML (CRF) 91.66 Tang et al. (2013) ML (CRF) 90.13 Roberts et al. (2013) ML (CRF) 89.33 D’Souza and Ng (2013) ML (CRF) 88.35 Lin et al. (2013) ML (CRF) 87.94 Kovacevic et al. (2013) Hybrid (CRF + Dictionary) 87.29

Features adopted to recognise clinical events are more or less consistent across data-driven approaches. We have analysed a number of top performing methods de- scribed in Tables [2.5,2.6] and summarised commonly adopted futures for clinical NER in Table 2.7. Table 2.7: Common data-driven features used for clinical event extraction

Feature group Feature type Lexical token, n-gram, bag-of-words, POS Morphological word lemma, suffix, prefix word type (e.g., alphanumeric, word, number); Orthographic word case (e.g., lower/upper case, all capital); word shape (e.g., ‘CT scan’ → ‘XX xxxx’, ‘Brain’ → ‘Xxxxx’) Syntactic chunks (e.g., noun, verb and prepositional phrases) Semantic knowledge resources (e.g., UMLS Metathesaurus), NER Contextual feature window [-(1-2), +(1-2)]

Recent NER approaches have also focused on semi-supervised strategies (de Bruijn 2.1. TEXT MINING 53 et al., 2011; Jonnalagadda et al., 2012; Leaman et al., 2009; Suakkaphong et al., 2011). In contrast to supervised, semi-supervised classification use both labelled and unla- belled data. The aim is to derive useful ML models while relying on less annotated data, hence reducing the cost of labour (i.e., the need manual annotation of corpora) while potentially maximising performance. de Bruijn et al. (2011) presented a semi-supervised method (using the sequence labelling algorithm: semi-Markov, a Hidden Markov Model that can tag multitoken spans of text) for event extraction and achieved a significantly better performance than other system participating in the 2010 i2b2 NER task. Their performance is considered among state-of-the-art for extraction of clinical concepts (achieving lenient and strict micro F1-measure of 85.23% and 91.66% respectively). Their approach consisted of generating hierarchical word clusters (based on contextual similarity) from unlabelled data which theoretically allows rarely or unobserved event mentions in the labelled training data to be predicted (in the held-out test data). The Brown clustering algo- rithm (Brown et al., 1992) was used to generate hierarchical word clusters from the unlabelled data. Cluster granularity was found to be optimum at seven hierarchical levels. In addition, they were largely made up of semantic concepts and POS, that were used as word-level back-off features. However, notably, the contribution of the semi-supervised approach is questionable as they reported a negative feature impact of the cluster features i.e., -0.0014% in F1-measure. Additionally, they reported using a very large feature space with over one million lexical features. More recently, Jonnalagadda et al. (2012) presented a similar approach, apply- ing distributional semantics as a semi-supervised strategy for clinical event extraction (using the 2010/i2b2 dataset). The aim of this method is to generate word features derived from words that appear in similar context (or with similar distributional ele- ments across multiple contexts). Similar to word clusters described by de Bruijn et al. (2011), distributional semantics could be used to compensate for limited vocabulary observed in smaller sets of annotated data by profiling the context in which events appear. Further, while Jonnalagadda et al. (2012) were unable to match the state-of- the-art performance, their method showed notable performance gain over a baseline. Suakkaphong et al. (2011) trained a sequence labelling algorithm (CRF) with a set of common NER features (e.g., lexical, syntactic and semantic) and used this model as a baseline to statistically compare against various semi-supervised methods. Two semi- supervised approaches were experimented with: bootstrapping and feature sampling. Bootstrapping or self-training is an iterative learning process in which a model is 54 CHAPTER 2. BACKGROUND trained on a small initial label set (Zhu and Goldberg, 2009). The classifier is then used to classify the unlabelled data to get self-labelled data. Subsequently, the most confident self-labelled predictions are added to the set of labelled data, and the classi- fier is retrained. This procedure is then repeated until optimum outcome is achieved. Similarly, feature sampling is an iterative learning process. For each iteration, two or more classifiers are trained using randomly generated overlapping ‘views’ (intersecting subsets of features). Self-labelled data is then obtained by some voting strategy. Moreover, experiments in Suakkaphong et al. (2011) showed that bootstrapping performed significantly better than their baseline. Their best F1-measure achieved was 73.94%. The dataset used consisted of cancer-related MEDLINE abstracts annotated for disease names. This dataset is not publicly available. In summary, review of current literature shows supervised ML approaches for clin- ical NER provide state-of-the-art performance, in particular for coarse grained iden- tification of common medical event categories (e.g., Problem, Treatment, and Test). Knowledge-driven approach have likewise a number advantages, such as: simplic- ity (e.g., (Kraus et al., 2007)), cost-effectiveness considering the availability of well- maintained knowledge resources and consequently the inherit means to facilitate fine grained semantic classification. Semi-supervised approaches to clinical NER are fairly a recent trend with only a handful of investigations. However, semi-supervised methods (e.g., word-clustering and distributional semantics) are increasingly being explored to address the lack of annotated corpora and has showed promising results, but yet to match supervised ap- proaches. Moreover, recent development in ML approaches to IE/NER have shown deep be- lief networks or ‘deep learning’ to perform very well on a variety of NLP tasks such as POS tagging, NER, and sentiment analysis among others. One of the most attrac- tive aspects of deep learning is unsupervised/automated feature learning. However, a challenge of deep learning methods are that they are extremely resource intensive. Nevertheless, deep learning in conjunction with semi-supervised methods are promis- ing and probably future direction of TM/NLP methods.

2.1.4 Temporal information extraction

TIE refers to the automated extraction of temporal information from natural language text based on formalised temporal representation (Sun et al., 2013b). TIE is naturally split into two related tasks: 2.1. TEXT MINING 55

1. temporal expression recognition and normalisation (TERN), and

2. temporal relation identification and classification (TLINK)

In contrast to other IE tasks (e.g., NER), a community-adopted formalised repre- sentation schema (such as the current de-facto standard: TimeML (Saur´ı et al., 2006) or some derivative (e.g., (Sun et al., 2013a; UzZaman et al., 2013)) is used to structure extracted temporal data. Hence, before we review the TIE tasks, we describe work in temporal representation in IE.

Temporal representation

Temporal representation schemas in NLP have developed through many years of re- search. It is widely recognised that one of the most influential work on the representa- tion of concept primitives and temporal relations were derived from Allen (1984) and Allen and Ferguson (1994). For example, the widely used TimeML temporal represen- tation schema or its derivative are significantly influenced by the representation logic of this early work. The TIMEX schema was one of the earlier attempts to standardise representation of temporal expressions and evolved from the Message Understanding Conference 7 (Chinchor, 1998). TIMEX provide a very limited representation model, solely for temporal expressions recognition, with the capacity to capture two types of temporal expressions: Date (e.g., ‘August 23, 1993’) and Time (e.g., ‘2:23 p.m.’). Ferro et al. (2001) extended the TIMEX (coined TIMEX2) schema by accommodating normalised value (using ISO-8601 for standardised representation) of temporal expressions, tem- poral modifiers (e.g., ‘approximately’, ‘before’, ‘after’), and extends recognition of temporal expressions types with Duration (e.g., ’two weeks’), and Set or Frequency (e.g., ‘every Tuesday’) including related periodicity and granularity attributes. Peri- odicity provides a normalised representation of a Set expression (e.g., ’every week’ or ’weekly’ translates to the periodicity of ‘F1W’, with the granularity of ‘G1D’ or one day). Pustejovsky et al. (2003) significantly extends preceding work by addressing gaps on TIE and reasoning (ACL Workshop on Spatial and Temporal Reasoning (2001) and LREC workshop on Annotation Standards for Temporal Information in Natural Language (2002)).13

13Influential papers from these workshops include Schilder and Habel (2001) and Filatova and Hovy (2001), as well as a related thesis on temporal information and its representation by Setzer (2001). 56 CHAPTER 2. BACKGROUND

Pustejovsky et al. (2003) specified four basic problems that were addressed by the TimeML schema ((Ingria and Pustejovsky, 2004; Saur´ı et al., 2004, 2006) for event and temporal expressions markup:

(i) time stamping of events (temporal anchoring of events);

(ii) temporally ordering events with respect to one another;

(iii) reasoning/representation of contextually underspecified temporal expressions (e.g., ’two weeks ago’; ’last week’);

(iv) reasoning/representation of persistence of events or event timeline (i.e., how long does an event last?).

TimeML version 1.2.1 is the most up-to-date version (Saur´ı et al., 2006). How- ever, the development temporal representation in IE is ongoing. In this thesis, we have adopted the amended TimeML schema as described by Sun et al. (2013a). More recently, Styler et al. (2014) proposed and published a more comprehensive tempo- ral representation schema (based on TimeML) for clinical narratives coined ‘THYME Annotation Guidelines’14. Sun et al. (2013a) defined three relevant data structures: EVENT (note that events or clinical NEs have been described in Section 2.1.3), TIMEX3 and TLINK.15

EVENT tag is used to annotated anything in a medical record that is relevant to a patient’s clinical timeline. In this thesis we have restricted type of EVENTs into the major clinical concept categories: Problem, Treatment and Test (formally defined in Chapter3).

TIMEX3 tag is used to mark up explicit temporal expressions (type) such as Date (e.g., ‘June 19th 1983’), Time (e.g., ‘1:30 pm’), Duration (e.g., ‘1 to 2 weeks’, ‘6 months period’) and Frequency (e.g., ‘daily’, ‘once every week’). Each TIMEX3 need to be normalised according to ISO-8601 standard (value). This standard requires Date/Time to be normalised to [YYYY-MM-DD]T[HH:MM] format, and Duration/Frequency TIMEX3s to be normalized to R[#1 times]P[#2][Units] (repeat for #1 times during #2 units of time). For example, twice every three weeks’ is normalized as ‘R2P3W’. Like the TimeML TIMEX3s, the TIMEX3s defined also have a modifier attribute mod, which represents a subset of the

14Available for download here: http://clear.colorado.edu/compsem/documents/THYME%20Guidelines.pdf 15Note, the following descriptions have largely been reproduced from (Sun et al., 2013a, p.5). 2.1. TEXT MINING 57

TimeML TIMEX3 modifier values: ‘more’, ‘less’, ‘approx’, ‘start’, ‘end’, ‘mid- dle’ the default ‘NA’ (see Table 2.8). In contrast to TimeML temporal functions, and instead TLINKs between two TIMEX3s are used to handle the anchoring of durations and relative times (see the guidelines for details 16).

Table 2.8: TIMEX3 representation schema This table shows the adopted temporal expression representation schema.

Tag Attribute Values type DATE | TIME | DURATION | FREQUENCY TIMEX3 mod NA | more | less | approx | start | end | middle val ISO-8601 adopted representation

TLINK is used to encode temporal relations between temporal elements such as EVENTs and TIMEX3s. TLINK types include a subset of the defined TimeML TLINK type. Eight types were adopted: Before, After, Begun by, Ended by, During, Simultaneous, Overlap and Before overlap. See Table 2.9 for TLINK the representation schema, and the following examples of TLINKs in clinical narratives.

Table 2.9: TLINK representation schema This table shows the adopted temporal link representation schema.

Tag Attribute Values type BEFORE | AFTER | OVERLAP TLINK explicit no | yes

In the following section we give a number examples using the above described temporal representation schema for TIMEX3 and TLINK:

Example: TIMEX3.type = DATE

"Discharge date: 9/11/93"

Example: TIMEX3.type = TIME

162012 i2b2 Clinical Temporal Relations Challenge Annotation Guidelines 58 CHAPTER 2. BACKGROUND

"Admitted: 9/6/93 at 17:45."

Examples: TIMEX3.type = DURATION

"... headache for about two weeks ..."

"... for the next 72 hours..."

Examples: TIMEX3.type = FREQUENCY

"... Valium 5 mg PO t.i.d. ..."

"... Growth hormone 0.6mg per day ..."

The following examples are reproduced from (Sun et al., 2013c, p.3):17

BEFORE: The patient was given stress dose steroids prior to his surgery. → [stress dose steroids] BEFORE [his surgery]

AFTER: Before admission, he had another serious concussion. → [admission] AFTER [another serious concussion]

SIMULTANEOUS: The patient’s serum creatinine on discharge, 2012-05-06, was 1.9. → [discharge] SIMULTANEOUS [2012−05−06]

OVERLAP: She denies any fevers or chills. → [fevers] OVERLAP [chills]

BEGUN BY: On postoperative day No 1, he was started on Percocet. → [Percocet] BEGUN BY [postoperative day No 1]

17Due to the number of TLINK annotations in the 2012 i2b2 corpus: Before, Ended by, and Be- fore overlap were merged as Before; Begun by and After were merged as After; and Simultaneous, Overlap and During were merged as Overlap. 2.1. TEXT MINING 59

ENDED BY: His nasogastric tube was discontinued on 05-26-98. → [His nasogastric] ENDED BY [05−26−98]

DURING: His preoperative workup was completed and included a normal white count. → [a normal white count] DURING [His preoperative workup]

BEFORE OVERLAP: he patient had an undocumented history of possible atrial fibrillation prior to admission. → [possible atrial fibrillation] BEFORE OVERLAP [admission]

2.1.5 Temporal entity extraction

The problem of recognising and normalising expressions denoting temporal entities (TE) is known as temporal expression recognition and normalisation or TERN task. TE defined by TIMEX3 are grouped into four temporal types: Date (e.g., ‘August 23, 1993’),Time (e.g., ‘2:23 p.m.’), Frequency (e.g., ‘every morning’), and Duration (e.g., ‘two weeks’). In addition, the ‘Date and time format: ISO-8601’ standard is used to normalise all TE types (Ferro et al., 2001; Sun et al., 2013c; UzZaman et al., 2013; Verhagen et al., 2007, 2010). 18 TE may also be characterised as explicit, implicit, or relative. Explicit expres- sions are clear and unambiguous expressions such as ‘1951’, ‘October 1917’, ‘May 8, 1945’, and so forth. Implicit expressions are indirectly stated expressions e.g., by the reference to common bank/national holidays such as ‘Christmas Day’, ‘Martin Luhter King, Jr. Day’, etc. Relative temporal expressions are expressions that cannot be spec- ified without a relative source date. Examples of relative expression are e.g., ‘today’, ‘day of admission’, ‘day of surgery’ and similar. Further, relative expressions in a news article are typically disambiguated by the referenced Document Creation Time or DCT (i.e., the article creation/publication date). In the clinical domain, the DCT may not always be reliable reference date. Instead, the date of consultation, date of admission, or date of discharge are commonly regarded as ‘DCT’. However, in this thesis, we have recoined ‘DCT’ as, arguably, to the more suitable name: Document Reference Time (DRT). A common means of evaluating temporal recognition task is by relaxed precision- recall metrics (Section 2.1.1). In addition, normalisation of expressions are measured by the given attribute (value and type) accuracy. The overall performance metric used

18Note that recent work on temporal representation have extended this schema: i.e., the THYME Annotation Guideline. 60 CHAPTER 2. BACKGROUND

to evaluate TERN task is typically the product of F1-score and value accuracy (e.g., (Sun et al., 2013c; UzZaman et al., 2013)). The motivation of this approach stems from the reasoning that the complete and correct span recognition of a TE would be useless without the correct interpretation of the expression. While the TERN task in the clinical domain has lagged behind due to the lack of available research data (Sun et al., 2013c), related work in the general domain have made notable progress. Therefore, we first review a few recent and influential general domain evaluation tasks in TERN, before relevant domain specific work in clinical TERN.

TempEval-2 TERN task

TempEval-2 compromised of several evaluation tasks, most notably a TERN (Verhagen et al., 2010) task. Overall, TempEval-2 provided manually annotated data (newswire) for six languages, however, we are obviously only interested in the English TERN task. TempEval-2 adopted a simplified version of TimeML version 1.2.1 (Saur´ı et al., 2006), and in particular TIMEX3 tag set. Further, while the evaluation of TE recognition uses customary precision-recall metrics (strict scores), normalisation attributes such as type t p and value are obtained by the precision rate (t p+ f p ). The HeidelTime system (Strotgen¨ and Gertz, 2010) achieved the best evaluation results in TempEval-2 (Table 2.10 shows the top six systems). The system was de- veloped using UIMA19 (Unstructured Information Management Architecture) frame- work, and consisted of rule-based extraction and normalisation components. An ini- tial NLP pipeline of sentence splitter, tokeniser and POS-tagger precedes the TERN components. HeidelTime uses two post-processing steps to disambiguate underspeci- fied temporal expressions (e.g., relative expressions) and to remove invalid expressions (e.g., redundant annotations/overlaps). In addition, HedidelTime system provided two different optimised output (precision and recall), unfortunately, the specifics of these optimisation methods were not given. Another notable system, coined TRIPS/TRIOS, achieved near state-of-the-art per- formance on the TempEval-2 dataset (UzZaman and Allen, 2010a). The system is characterised by a hybrid methodology. It combines TRIPS (a off-the-shelf deep se- mantic parser; see (Allen et al., 2008)) and CRF for recognition of TEs (i.e., the CRF predictions are only accepted if normalised value and type can be extracted). The CRF is trained using a token-level BIO sequence representation model with a set of lexical

19http://uima.apache.org/ 2.1. TEXT MINING 61 and syntactic features (Allen et al., 2008). The normalisation component was purely rule-based. Saquete Boro (2010) developed an interesting method, despite the poor results reported. Inspired by Negri et al. (2006), Saquete Boro (2010) adapted a Spanish knowledge-based TERN system (TERSEO) and extended it to multiple languages (En- glish, Italian and Catalan). A similar architecture to (Vossen, 2000) was implemented to obtain knowledge databases for the different languages. In order to adapt/translate the rule set, temporal expression rules interlingua index (TER-ILI) was used to inter- connect the various knowledge resources to consequently develop the cross-language TERN component. The TIPSem system (Llorens et al., 2010) uses a data-driven method for TE recog- nition, and a hybrid normalisation component. They reported the adoption of the same feature set across all CRF models trained, including: morphological, syntactic, polar- ity, tense, aspect and semantic role features. For TE recognition, a CRF was trained using a BIO sequence representation model at the token-level. The normalisation com- ponent consisted of two main steps: (1) obtain the normalised TE type (which was achieve by training a CRF model at the expression-level: feature sets of multi-token expressions were concatenated), and (2) apply corresponding normalisation rules.The initial step involved determining the type of the rule-set to be applied in the second step. This was achieved by training a CRF model using the same feature set as the previous models in addition to abstracted TE patterns (e.g., by replacing numbers by ‘NUM’, months by ‘MONTH’, weekdays by ‘WEEKDAY’, and so forth). The Edinburgh system was developed using a bespoke set of NLP components coined LT-TTT2 (Grover et al., 2010).20 While described as rule-based, detailed sys- tem description is unfortunately missing.

TempEval-3 TERN task

TempEval-3 is the follow-up and the most recent (completed) evaluation tasks. Some key differences, as far as the TERN task is concerned are: a larger gold-standard corpus (twice the size as TempEval-2) and a complimentary silver-standard corpus (automat- ically annotated data with minor human correction) are provided as training data. In addition, lenient (as opposed to strict) performance metrics are considered. Similar to Sun et al. (2013c), the primary score for evaluating the overall TERN task is obtained by the product of (lenient) F1-measure and the attribute value accuracy.

20LT-TTT2 is available at: http://ww.ltg.ed.ac.uk/software/lt-ttt2/ 62 CHAPTER 2. BACKGROUND

Table 2.10: TempEval-2 TERN results This table shows the official results from the TempEval-2 TERN task.

System P% R% F1% Type Value HeidelTime-1 (Strotgen¨ and Gertz, 2010) 90 82 86 0.96 0.85 HeidelTime-2 (Strotgen¨ and Gertz, 2010) 82 91 86 0.92 0.77 TRIPS/TRIOS (UzZaman and Allen, 2010a) 85 85 85 0.94 0.76 TERSEO (Saquete Boro, 2010) 76 66 71 0.98 0.65 TIPSem (Llorens et al., 2010) 92 80 85 0.92 0.65 Edinburgh (Grover et al., 2010) 85 82 84 0.84 0.63

Some notable observation of the TempEval-3 TERN results are that rule-based methods, again, dominated the top performing systems (i.e., HeidelTime (Strotgen¨ et al., 2013), NavyTime (Chambers, 2013) and SUTime (Chang and Manning, 2013)).21 While hybrid methods showed promising results, strictly data-driven approaches gen- erally performed poorly. All submitted systems reported using rule-based normalisa- tion methods. Moreover, UzZaman et al. (2013) noted, based on the overall reported experiments, that the use of the silver dataset either on its own or together with the gold data did not provide any improvement in performance for hybrid/ML-based approaches. A notable data-driven method was developed by Filannino et al. (2013); coined ManTIME-(4,6). The difference between these two submitted methods was the train- ing data used: ManTime-4 used the gold annotated data, while ManTime-6 used the silver standard data. Both submitted systems used CRFs trained using a token-level BIO sequence representation schema with 94 features (mainly morphological inspired; see (Filannino et al., 2013, p.54) for the exhaustive list). A set of rule-based post- processing components were used to boost the recognition performance: (i) a proba- bilistic correction module, (ii) BIO fixer, and (iii) threshold-based label switcher . All post-processing modules aimed to improve the token-level sequence label predictions. As such, they reported a significant statistical difference between plain CRFs versus CRFs and post-processing. Another data-driven system, ATT-1, used binary MaxEnt classifiers (see (Jung and Stent, 2013, p.21) for the complete list of features), achieving the best strict recogni- tion performance. Yet, the strict performance is insignificant on its own, given that the

21HeidelTime (https://code.google.com/p/heideltime/) and SUTime (http://nlp. stanford.edu/software/sutime.shtml) are both freely available open-source software. 2.1. TEXT MINING 63 primary score was approximately 12 percentage point below the best performing sys- tem. An explanation may be the poor overall recall achieved by the ATT-1. However, a noteworthy observation of their investigation was the effectiveness of a large context window-size used as features: (0, 1, 3, 7 versus 15) tokens preceding and following the current token. Finally, given the small sample size of the test dataset (i.e., 138 temporal expres- sions in total), final results should be considered with caution as the reliability or gen- eralisation of the results is questionable. Table 2.11: TempEval-3 TERN results This table list the official results of the top performing systems from the TempEval-3 TERN task.

System P% R% F1% Primary score HeidelTime-t (Strotgen¨ et al., 2013) 90.30 93.08 87.68 77.61 HeidelTime-bf (Strotgen¨ et al., 2013) 87.31 90.00 84.78 72.39 HeidelTime-1.2 (Strotgen¨ et al., 2013) 86.99 89.31 84.78 72.12 NavyTime-1,2 (Chambers, 2013) 90.32 89.36 91.30 70.97 ManTIME-4 (Filannino et al., 2013) 89.66 95.12 84.78 68.97 ManTIME-6 (Filannino et al., 2013) 87.55 98.20 78.99 68.27 ManTIME-3 (Filannino et al., 2013) 87.06 94.87 80.43 67.45 SUTime (Chang and Manning, 2013) 90.32 89.36 91.30 67.38 ATT-1 (Jung and Stent, 2013) 99.05 75.36 85.60 65.02

2012 i2b2 TERN task

We review a clinical TERN task that was organised as part of the 2012 i2b2 Temporal Relation Challenge (Sun et al., 2013c). To the best of our knowledge, the set of in- vestigation that resulted from this challenge is the most significant work in the area of clinical TERN to-date. We have observed a few notable trends in relation to previous work. For exam- ple, in contrast to previous TERN evaluation tasks, rule-based systems were not the predominating methodology in the clinical TERN task. Only two out of the top five teams (the 1st and 4th) developed pure rule-based methodology (see Table 2.12). The remaining three out of the top five teams used hybrid methodologies. A reasonable explanation to this potential paradigm shift may be the availability of more annotated data compared to previous challenges (TempEval-2 and TempEval-3). In addition, the availability of general domain TERN components (as a direct results of previous chal- lenges) may have also encouraged researchers to investigate alternative approaches. 64 CHAPTER 2. BACKGROUND

Other notable observations are that majority of teams adopted and extended general domain TERN components. Hence, suggesting the possibility of domain adaptation. The most popular general domain TERN component adopted and extended was Hei- delTime (Strotgen¨ and Gertz, 2010; Strotgen¨ et al., 2013) which ranked 1st in both TempEval-2 and TempEval-3 TERN tasks (Tables [2.10,2.11]). In addition, all submit- ted systems, similarly to previous tasks, used rule-based (value) normalisation meth- ods. Table 2.12: 2012 i2b2 TERN: methods and resources This table list the methods and resources adopted by the top performing systems in the 2012 i2b2 TERN task. Note, ‘hybrid’ methods in this table refer to a combination of CRF and rules.

Off-the-shelf Off-the-shelf System Method recognition normalisation Sohn et al. (2013) Rule-based HeidelTime HeidelTime Xu et al. (2013) Hybrid - - Kovacevic et al. (2013) Hybrid - TRIOS Tang et al. (2013) Rule-based HeidelTime HeidelTime Lin et al. (2013) Hybrid HeidelTime jchronic22

Xu et al. (2013) used a Context-Free Grammar (CFG) for their normalisation com- ponent rather than the commonly adopted regex-based approach. Specifically, their normalisation component was divided in two phases. In the first phase CFG was used to compute values of TEs. Initially, TEs are parsed into CFG parse trees by extract- ing terminal and non-terminal symbols (e.g., cardinals, ordinals, medical abbreviation of frequency expressions). Subsequently, some of these expression could directly be normalised (i.e., value and type attributes) by reading inherit properties of parsed sym- bols from the parse tree. Otherwise, those expressions that could not be determined directly or ‘locally’ (i.e., relative TEs) were converted into intermediate expressions which were used in the second phase. In the second phase, several deduction rules are applied in order to determine relative dates (e.g., admission date, operation day, etc.). As shown in Table 2.13,(Xu et al., 2013) achieved the best normalisation results for both the type and modifier attributes using this approach. In addition, the value attribute was comparable to the other top performing system. We note that among the top three systems (see Table 2.13) is the method (Kovacevic et al., 2013) developed as part of the research presented in this thesis (described in

Chapter4). Notably, there was no statistical difference in terms of F1-score between the top three teams. Therefore, considering statistical significance our method ranked 2.1. TEXT MINING 65 combined first in TE recognition. However, our primary score was notably lower (- 3%) compared to the other top systems. This was due to the performance of the overall normalisation task. Table 2.13: 2012 i2b2 TERN results This table show the official 2012 i2b2 TERN results. Note, ‘P-score’ refers to Primary score = F1 ×Value

System F1% Type% Value% Modifier% P-score% Sohn et al. (2013) 90.03 86.04 73.13 85.66 65.83 Xu et al. (2013) 91.41 89.29 71.70 89.07 65.54 Kovacevic et al. (2013) 90.08 84.73 70.44 82.75 63.45 Tang et al. (2013) 86.59 85.00 70.00 85.00 60.61 Lin et al. (2013) 88.00 82.10 68.80 82.80 60.54

Based on a review of TERN literature (across domains) we collected and listed a set of commonly used ML features in data-driven approaches (see Table). We also found that, similar to clinical and biomedical NER tasks, that CRF is by far the most common and best performing discriminate classifier adopted by researchers. In addition, the most common sequence label representation schema reported was BIO.

Table 2.14: Common data-driven features used for clinical TER

Feature group Feature type Lexical token, POS Morphological word lemma, suffix, prefix word type (e.g., alphanumeric, word, number); Orthographic word case (e.g., lower/upper case, all capital); word shape (e.g., ‘Dec-2001’ → ‘Xxx-yyyy’ or similar) Syntactic shallow chunks (e.g., noun, verb and prepositional phrases) Semantic temporal information (e.g., week day, month, modifiers) Contextual feature window [-(1-2), +(1-2)]

TERN methods in the clinical domain have shown comparable results to previous work reported in the general domain. However, clinical TERN seem more challenging than the general domain task. For example, the best reported primary score in the clinical domain is 65.83%, whilst 77.61% was reported in the general domain. Further, we note that the difference between general domain and clinical domain in terms of TEs profile is minimal as proven by the extensive adoption of general domain tools 66 CHAPTER 2. BACKGROUND in the clinical domain. One of the notable difference are medication dosage and and frequency abbreviations used by clinical practitioners, which was also highlighted by (Sun et al., 2013c). Knowledge-driven and hybrid-based techniques make up the current state-of-the- art methods in both general and clinical domains (Sun et al., 2013c; UzZaman et al., 2013), with a slight inclination towards rule-based approaches in terms of performance. Nevertheless, rule-based method seem to be the preferred approach for both TE recog- nition and normalisation. At least given the current state or amount of available data. Yet, a observation of adopted methodologies in recent literature, and in particular in the clinical domain, seem to indicate a shift in paradigm from knowledge-driven meth- ods to hybrid/data-driven approaches. This may be reasoned by the growing amount of available training data as well as off-the-shelf resources. In contrast, TE normalisation remain a rule-based problem. The flexibility and ease to reason over pattern or strings is simply most efficiently handled by rule-based methodologies.

2.1.6 Temporal relation extraction

The identification and classification of temporal links or TLINKs between EVENTs, TEs, and EVENTs and TEs is an active and open research problem in IE. TLINK ex- traction is often interchangeable with temporal ordering. Temporal ordering of events is a central and crucial step to a wide range of NLP applications such as information extraction, document summarisation, questioning and answering among others. Similar to most NLP tasks in the clinical domain, TLINKs have generally lagged behind the general English domain due to the lack of publicly available annotated corpora. Thus, we first review the related work in non-clinical domains.

TempEval-1 and TempEval-2 TLINK tasks

The outcome of TempEval-1 and TempEval-2 challenges and the resulting literature represent notable work in TLINK research. However, both TLINK sub-tasks were limited to temporal relation classification. In addition, further restriction of candidate pairs were also imposed such as TLINK classification were only considered between: 2.1. TEXT MINING 67

TempEval-1 TempEval-2

• EVENTs and TEs within the same • EVENTs and TEs within the same sentence sentence

• DCT and EVENTs • DCT and EVENTs • main EVENTs in adjacent sen- • main EVENTs of adjacent sen- tences tences • EVENTs where one dominated the other

A wide range methods such as rule-based (Hagege` and Tannier, 2007), hybrid (e.g., Support Vector Machine (SVM) and finite-state rules (Min et al., 2007); statistical information and rules (Puscasu, 2007)), SVMs (Bethard and Martin, 2007), HMM- SVMs (Cheng et al., 2007) among others were investigated as a result of the TempEval- 1 competition. While the performance among the systems were not too different, two approaches presented by Hagege` and Tannier (2007) and Puscasu (2007) were notable. Hagege` and Tannier (2007) adopted and extended a rule-based system which relies on a deep syntactic analyser to extract grammatical relations and thematic roles in the form of dependency links in order to determine temporal relations. While their approach achieved the best precision for two out of three tasks, their system performed consistently (in terms of recall and F1-measure) worst on all tasks. This seems mostly due to the intended choice of a precision-bias approach as well as over-relaying on the syntactic analyser. Specifically, if the parser did not find a relation between two candidates, no alternative classification mechanism was adopted. A hybrid statistical and rule-based system coined TICTAC (or Syntactico-Semantic Temporal Annotation Cluster) achieved the best overall results across the evaluation tasks (Puscasu, 2007). They approached intra-sentence temporal relations classifica- tion by the combination of deep syntactico-semantic processing (using FDG parser (Tapanainen and Jarvinen,¨ 1997) for syntactic tree generation), recursive bottom-up propagation to identify temporal order between directly linked constituents, and a set of temporal reasoning and conflict resolution heuristics. Similarly, inter-sentence tem- poral relations classification was approached by using deep linguistic knowledge, but more notably using statistical data extracted from the training corpus with a set of heuristics. 68 CHAPTER 2. BACKGROUND

Similar to TempEval-1, TempEval-2 TLINK tasks were restricted to TLINK clas- sification with further restriction previously described. Investigated methods included CRFs (Kolya et al., 2010; Llorens et al., 2010), MaxEnt (Derczynski and Gaizauskas, 2010), and Markov logic network (MLN) (Ha et al., 2010; UzZaman and Allen, 2010b). Sun et al. (2013c) noted several relevant studies in addition to the challenge that adopted statistical ML methods, however, SVMs (Chambers and Jurafsky, 2008; Mir- roshandel et al., 2011) and MaxEnt (Mani et al., 2006; Verhagen and Pustejovsky, 2008) were among the most popular methods. A notable observation of TempEval-1 and TempEval-2 was that, regardless of the method used, most approaches adopted deep linguistic and syntactic features. Statisti- cal methods including statical ML approaches were the most popular. In addition, the only pure knowledge-based approach presented, performed worst 23.

TempEval-3 TLINK task

In contrast to the previous TempEval TLINK tasks, no artificial restriction was im- posed. The TempEval-3 temporal relation tasks included an end-to-end24 (results given in Table 2.16) as well as TLINK identification and classification sub-task which is sim- ilar to the former task but using gold EVENTs and TEs (results given in Table 2.15). An apparent observation, in contrast to previous challenges was the notably low performance. For example the best system (Bethard, 2013) achieve a F1-measure of 36.26% and 30.98% for the TLINK extraction (Table 2.15) and end-to-end (Table 2.16) tasks, respectively. These (poor) results seem to be a direct consequence of approach- ing TLINK extraction without any artificial limitations imposed.

23Note that a pure rule-based approach was only adopted in TempEval-1. 24Recall, in the ‘end-to-end’ task, the goal is to first extract events and temporal expressions, and secondly, to identify and classify TLINK. 2.1. TEXT MINING 69

Table 2.15: TempEval-3: TLINK identification and classification task This table shows evaluation of TLINK identification and classification given gold events and temporal expressions.

System P% R% F1% ClearTK-2 37.32 35.25 36.26 ClearTK-4 35.17 36.57 35.86 ClearTK-1 37.64 33.04 35.19 UTTime-5 35.94 33.92 34.90 ClearTK-3 33.27 35.03 34.13 NavyTime-1 35.48 27.62 31.06 UTTime-4 37.41 23.43 28.81 JU-CSE 21.04 35.47 26.41

Table 2.16: TempEval-3: TLINK end-to-end task This table shows evaluation of TLINK identification and classification where participant have to first extract events and temporal expressions.

System P% R% F1% ClearTK-2 34.08 28.40 30.98 ClearTK-1 34.49 26.19 29.77 ClearTK-3 30.94 26.63 28.62 ClearTK-4 29.73 27.29 28.46 NavyTime-1 31.25 24.20 27.28 JU-CSE 19.17 34.36 24.61

Moreover we have summarised the method developed by the top teams in for TLINK identification (listed in Table 2.17) and classification (listed in Table 2.18) as reported by UzZaman et al. (2013) and respective systems: ClearTK (Bethard, 2013), UTTime (Laokulrat et al., 2013), NavyTime (Chambers, 2013), and JU-CSE (Kolya et al., 2013). 25 Notably, the data-driven methods (in particular SVM and Logistic regression clas- sifiers) were the most popular and performed best for TLINK identification as well as classification. In addition, once again, knowledge-driven methods were rare and only adopted for TLINK candidate generation (i.e., identification).

25Abbreviations used in the Tables [2.17,2.18]: ms: morphosyntactic information, e.g. POS, lexi- cal information, morphological information and syntactic parsing related features; ls: lexical semantic information, e.g. WordNet synsets; ss: sentence-level semantic information, e.g. semantic role labels; e-attr: entity attributes, e.g. event class, tense, aspect, polarity, modality; timex type, value; con: context information e.g., lexical context cues. 70 CHAPTER 2. BACKGROUND

Table 2.17: TempEval-3: approaches for TLINK identification This table list the overall strategy of top performing methods for TLINK identification in the TempEval-3 TLINK tasks. Note, that abbreviations used in this table have been defined in Footnote 25. Method System Classifier Features Data-driven ClearTK-1,2,3,4 SVM,Logit e-attr,ms UTTime-4,5 Logit ms,ls,ss Hybrid NavyTime-1 MaxEnt ms Rule-based JU-CSE NA

Table 2.18: TempEval-3: approaches for TLINK classification This table list the overall strategy of top performing methods for TLINK classification in the TempEval-3 TLINK tasks. Note, that abbreviations used in this table have been defined in Footnote 25. Method System Classifier Features ClearTK-1,2,3,4 SVM,Logit ms,ls UTTime-4,5 Logit ms,ls,ss Data-driven NavyTime-1 MaxEnt ms,ls JU-CSE CRF ms,e-attr,con

2012 i2b2 TLINK task

The 2012 i2b2 temporal relations challenge included two TLINK tasks ((Sun et al., 2013c)):

1. End-to-end track: in this track, the released input data included raw discharge summaries. The aim was to first identify EVENTs (i.e., Problem, Treatment, Test, Evidential, Occurrence, and Clinical department) and TIMEX3s and their respective attributes, and subsequently to identify and classify TLINKs between the EVENTs and TIMEX3s (i.e., EVENT-EVENT; EVENT-TE; TE-TE).

2. TLINK identification and classification track: the input data include the gold annotated EVENT and TIMEX3 tags with their respective attributes. The aim was to identify and classify TLINKs.

The aforementioned TLINK-tracks extended the scope of previous challenges (i.e., TempEval-1 and TempEval-2). In addition to the clinical focus, TLINK constraints imposed in TempEval-1 and TempEval-2 were disregarded. Specifically, the i2b2 2.1. TEXT MINING 71

TLINK tracks included unconditional TLINKs between any element such as EVENTs, TIMEX3s, and EVENTs and TIMEX3s. In addition, TLINKs between any element in a sentence, adjacent sentences, or non-adjacent sentences were considered as valid. In addition, EVENT and section time TLINKs (SECTIME) (i.e., admission and discharge date) were also included. By convention, EVENTs that appear in the ‘clinical history’ section were considered as related to admission date, and EVENTs in the ‘hospital course’ section (in the clinical narratives) were considered as related to discharge date (Sun et al., 2013c). Tang et al. (2013) submitted system ranked first in both TLINK tracks (see Tables [2.19,2.20]). They approached the problem by compartmentalisation or dividing the TLINK extraction tasks into three sub-tasks: (i) TLINKs between EVENT and SECTIME: one classifier was trained for each section time.

(ii) Inter-sentence TLINKs between EVENTs and TEs (TLINKs between TEs were ignored): two classifiers were trained for this sub-task: (a) EVENT-EVENT, and (b) EVENT-TE TLINKs.

(iii) Intra-sentence TLINK between EVENTs and TEs: two classifiers were trained for this sub-task, for: (a) main EVENTs (defined as the first and last EVENT in a sentence) in adjacent sentences, and (b) EVENTs that are co-referenced. Moreover, preceding these classification sub-tasks, a heuristic-based TLINK candidate pair generation component was executed in order to determine the most likely entity pairs (i.e., the TLINK identification phase). The strategy to generate candidate pairs differed for the three sub-tasks: (i) all EVENTs and section time pairs were considered as candidates; (ii) any consecutive EVENT-TE pair in the sentence, and EVENT-TE pair that has a dependency relation (based on the dependency parse tree generated of the sentence with the Standford parser); (iii) candidate pairs for inter-sentence TLINKs would encompass all ‘main’ EVENTs in adjacent sentences, and co-references across multiple sentence (generated based on the heuristic of any two EVENTs which share the same head noun and UMLS semantic type). Tang et al. (2013) reported the use of CRF and SVM based classifiers without further details; the exact classification task for respective algorithm was omitted from the manuscript. However, they reported the adoption of a wide range of features: (i) TLINKs between EVENTs and SECTIMEs: EVENT position information (document- , section-, and sentence-level), bag-of-words (each event is treated as a word), 72 CHAPTER 2. BACKGROUND

Table 2.19: 2012 i2b2: TLINK identification and classification task This table list the results of the 2012 i2b2 TLINK identification and classification task.

System F1% Tang et al. (2013) 69.32 Cherry et al. (2013) 69.24 Xu et al. (2013) 68.49 Nikfarjam et al. (2013) 62.80 D’Souza and Ng (2013) 61.42

Table 2.20: 2012 i2b2: TLINK end-to-end task This table list the results of the 2012 i2b2 TLINK end-to-end task.

System F1% Tang et al. (2013) 62.78 Xu et al. (2013) 59.24 Roberts et al. (2013) 52.58 D’Souza and Ng (2013) 51.26 Grouin et al. (2013) 49.32

POS, verb tense, dependency-related information, TE-related information (e.g., presence/absence of TEs within the sentence), and EVENT-related information (i.e., all attributes of the candidate EVENTs).

(ii) Inter-sentence TLINKs between EVENTs/TEs: dependecny-related information (e.g., the path of the word and POS in the relation), distance between two cadi- date EVENTs, conjunction between two entites of the TLINK, and the TLINKs between EVENTs-SECTIMEs (determined by the previous step).

(iii) Intra-sentence TLINKs between EVENTs/TEs: (for main EVENTs in two con- secutive sentences) presence of TEs, the word in each EVENT, verb tense, and attributes of each EVENT; (for co-referenced EVENTs) token length of each EVENT, the number of overlapping tokens between candidate EVENTs, the semantic type of EVENTs, weather the two EVENTs contain same positional words (i.e., ‘left’ and ‘right’) and whether they contain anatomic words (e.g., ‘arm’ and ‘leg’).

Notably, Tang et al. (2013)[p.832-33] reported computing the transitive closure (defined in Appendix A.3) of TLINKs in the training set in order to generate the final 2.1. TEXT MINING 73 dataset used for training their classifiers. This approach was adopted by most top- performing teams using data-driven approaches in order to address the imbalance in the training data (e.g., (Cherry et al., 2013; Tang et al., 2013; Xu et al., 2013)). Specifically, several studies reported that by computing the transitive closure of temporal links in the training dataset, it benefited data-driven methods by reducing the bias toward the majority class (‘no TLINK’) as well as reducing the noise caused by missing true TLINKs. Xu et al. (2013) submitted another notable system with a bare 1% difference com- pared to (Tang et al., 2013) in the TLINK classification track (see Tables 2.19). They subdivided the problem at hand into different TLINK categories: EVENT-EVENT, EVENT-TE, TE-TE, EVENT-SECTIME, and TE-SECTIME (where SECTIME refers to both admission and discharge date). In addition, intra- and inter-sentence TLINKs were also differentiated for relevant categories. Hence, ten different ‘categories’ in total were defined (these categories have been listed below for clarity).

Intra-sentence Inter-sentence Section time EVENT-EVENT EVENT-EVENT EVENT-AdmissionDate EVENT-TE EVENT-TE TE-AdmissionDate TE-TE TE-TE EVENT-DischargeDate TE-DischargeDate

Xu et al. (2013) opted for a rule-based recognition phase. For example, candidate pair generation for EVENT-EVENT in the same sentence was dependent on the syn- tactic structure derived from its parse tree and sentence pattern. For EVENT-TE in the same sentence, prepositions prior to TE and verb cues were used to identify the relation. The identification of EVENT-EVENT and TE-EVENT in different sentence were dependent on contextual and sentence information. Notably and in contrast to (Tang et al., 2013), co-referential cues were not used for EVENT-EVENT in inter- sentence TLINKs. Finally, for EVENT-SECTIME links, a SBD component was used, and for TE-SECTIMEs relation both the SBD component and the TE type were used to identify relations. A total of 10 SVM classifiers (one for each category as listed above) was trained on an expanded training dataset (generated by the transitive closure). They reported three feature sets: 74 CHAPTER 2. BACKGROUND

• syntax features built on parse trees. This feature set was derived from the output of Stanford and Enju parsers.26 Features were extracted from dependencies, dependency chains in dependency graphs (Enju), and paths from the parse trees.

• Labelled sequential pattern. LSP mining was applied to normalised sentences to extract patterns with high frequency.

• Coordination-class features. This feature set was particularly useful for ‘over- lap’ TLINKs. Multiple EVENTs that were separated by the comma symbol ‘,’ typically indicated ‘overlap’.

Further, as a final TLINK extraction component, they used MLN27 to infer implicit, such as transitive TLINKs. The MLN component primarily used pair-wise confidences of SVMs (of two existing relations) as input, with the similar confident score as output, to infer implicit TLINKs (see (Xu et al., 2013)[p.855] for further description). From their results, EVENT-SECTIME and TE-SECTIME were the easiest to ex- tract, followed by inter-sentence TLINKs (i.e., EVENT-EVENT and EVENT-TE). In addition, they noted that intra-sentence TLINKs were the most challenging. Specif- ically, Xu et al. (2013) reported results for different TLINKs ‘categories’ are listed below.

Inter-sentence F1% EVENT-EVENT 72.69 EVENT-TE 68.01 Intra-sentence F1% EVENT-EVENT 45.46 EVENT-TE 53.26 Intra-sentence F1% EVENT/TE-SECTIME 91.02

Roberts et al. (2013) used a heuristic-based components to generate candidate links for SECTIME-related TLINKs and a hybrid component for other type of links. The hybrid component consisted of a SVM-component (or a SVM-ranker) with a preceding rule-based component which generates an initial set of candidate pairs. Specifically, candidate pairs were generated for each EVENT considering all EVENTs and TEs in consecutive sentences (current and previous sentences), ignoring EVENTs and TEs in

26http://www.nactem.ac.uk/enju/ 27implementation: http://code.google.com/p/thebeast/ 2.1. TEXT MINING 75 the current sentence that occur after the given EVENT. Subsequently, the SVM-ranker ranked candidates by a confidence score. They noted that the tuning of the confidence score (threshold cut-off) allowed the system to be optimised for recall at the expense of precision or vice versa. In addition, restricting the confidence to only top ranked

TLINKs was reported to maximize the F1-measure. Subsequently, a multi-class SVM was used to determine the TLINK type (i.e., Before, After, or Overlap). Further, they reported using the same future set for recognition and classification (see (Roberts et al., 2013)[p.972] for a list of high-level description). In summary, Table 2.21 shows summarised features adopted by the state-of-the-art methods for TLINK classification, with a bias toward the 2012 i2b2 temporal relation challenge.

Table 2.21: Common TLINK classification features Note that EV=EVENT, ST=SECTIME.

Inter-sentence Intra-sentence SECTIME Feature type EV-EV EV-TE TE-TE EV-EV EV-TE TE-TE EV-ST Position information  Distance information  Punctuation    Tense  Preposition    Conjunction    Part-of-Speech     Dependency-related    Co-reference     TE-related        EV-related       

Description of feature types mentioned in Table 2.21 follows.

• Position information: the position of a EVENT within a section; in particular (Tang et al., 2013) highlighted 1. EVENTs appearing in the first/last five sen- tence, 2. one of the first/last three EV in a section, and 3. one of the first/last five EV in a document;

• Distance information: (i) token distance between entity pairs, and (ii) number of EV and TE between entity pairs; 76 CHAPTER 2. BACKGROUND

• Punctuation: the semicolon and comma symbols in between candidate pairs;

• Tense: verb tense (Tang et al. (2013) also propagated the verb tense of given EVENT for SECTIME links to the sentence containing the EVENT);

• Preposition: between two candidate pairs e.g., ‘in’, ‘on’, ‘after’, ‘before’ and so forth;

• Conjunction: between two candidate pairs e.g., ‘and’, ‘both’ and so forth;

• Part-of-Speech: the POS tag of candidate entity pairs;

• Dependency-related: specifically: (i) dependencies, (ii) paths on parse trees, and (iii) dependency chains in dependency graphs ;

• Co-reference: EVENT co-reference pair information;

• TE-related: TE type (e.g., date, duration, and so forth);

• EV-related: EVENT attribute (e.g., type, polarity).

Despite the slight difference in scope between the TempEval series and i2b2 2012 TLINK tracks, the outcome of the i2b2 temporal ordering tracks are consistent with results obtained in the general domain (UzZaman et al., 2013; Verhagen et al., 2007, 2010). Overall, these results show that temporal ordering of events remain a challeng- ing research problem, both in the general and clinical domains alike. The most common approach for TLINK identification is rule-based. For example, for inter-sentence TLINKs, common rule-based cues include contextual lexical cues (e.g., prepositions and verbs), dependency relation cues derived from the dependency parse tree of a given sentence, and co-ordination cues (e.g., consecutive TE-EVENT; EVENTs separated by the comma symbol or ‘,’). Co-reference resolution information were often adopted for intra-sentence EVENT-EVENT TLINKs. In contrast, majority of TLINK classification approaches reported (except for Hagege` and Tannier (2007)) across TLINK challenges described were data-driven. In response, we aim to explore knowledge-based methods for TLINK extraction (i.e., both for identification and clas- sification). 2.2. CLINICAL BACKGROUND 77

2.2 Clinical Background

In this section we review the clinical background relevant to this thesis. This in- cludes a review of computer aided protocol-based care and health-related quality of life (HrQoL) in clinical practice.

2.2.1 Clinical text

Clinical text is inherently different from other text relevant for NLP (e.g., biomedical, newswire). For example, at the word level, clinical texts tend to be riddled with ab- breviations and acronyms. Liu et al. (2001) estimated that acronyms are overloaded about a third of the time and are often ambiguous even in context. In addition, the presence of misspellings is another notable characteristic (Meystre et al., 2008). At the sentence level, clinical texts are characteristically written in ungrammatical, short and telegraphic phrases that do not follow common English syntactic structures (Meystre et al., 2008). At the paragraph level, clinical texts (e.g., discharge summaries) are often subdivided into meaningful sections, e.g., diagnosis, medication, medical history, and so forth (see a sample text in AppendixA, Figure A.1). However, general formatting principles and guidelines seem to be non-existent (e.g., it is not uncommon to find copy-and-pasted text from laboratory test results). Consequently, the combination of these characteristics distinguishes clinical text from other general domain text.

2.2.2 Computer aided standardisation of clinical care

Protocol-based care has been described as a mechanism for facilitating high-quality evidence-based care based by the standardisation of decision making (Rycroft-Malone et al., 2009). It is an umbrella term often used to refer to clinical guidelines and clin- ical care pathways. This notion has evolved as a results of executive policies initiated through several bodies, such as the NHS Modernisation Agency and the National Insti- tute for Clinical Excellence (recently renamed to the National Institute for Health and Care Excellence) on the turn of the millennium to aid modernisation of NHS, partly, through the standardisation of care. Challenges of implementing protocol-based care are aplenty. While evaluation of clinical guidelines has shown that they can improve quality of care, whether this is achieved in practice in unclear (Rycroft-Malone et al., 2009). Notably, in the absence of evidence it is a common practice to rely on expert opinions. Hence, at the local-level 78 CHAPTER 2. BACKGROUND the interpretation of guidelines can often differ given the difference in the interpreta- tion of what constitutes quality. Effectively, this results in variation of care between different hospitals and even between consultants within the same hospital. With the introduction of EHR, there has been a growing interest in implementing and monitoring protocol-based care by computerised support. The majority of work in this area have proposed hierarchical models to decompose protocols and guidelines into tasks such as decision, plans, and actions. Subsequently, users are guided step-by- step through these ‘task networks’ to ensure the optimum implementation of encoded protocol. Example of such models include: Medical Logic Modules (MLMs) Hripcsak (1994), Skeletal Plan Instantiation (SPIN) (Uckun, 1994), Modeling Better Treatment Advice (MBTA) (Barnes and Barnett, 1995), EON (Musen et al., 1996), GuideLine Interchange Format (GLIF) (Ohno-Machado et al., 1998; Peleg et al., 2000, 2001), PROforma (Fox et al., 1998; Sutton and Fox, 2003). However, while many derivative approaches have been proposed for the representing and implementing care protocols using computerised support, relatively few have been adopted as part of real-world operational systems (Gooch and Roudsari, 2011). In this thesis we present an alternative and novel approach to monitor protocol- based care: by extracting patient journeys from EHR or specifically unstructured clin- ical narratives. In contrast to previous work, the aim is not to provide a means to guide the implementation of protocols, but rather to enable analysis and monitoring of im- plemented protocols posteriori. By the reconstruction of implemented care pathways from recorded clinical practice we enable large-scale analysis of real-world clinical data/practice. Such an approach can enable the comparison of implemented care path- ways between different views, and importantly, without the limitation imposed by task networks. This can further enable objective comparison between different consultants and hospitals to identify and characterise any discrepancies leading to the harmonisa- tion of practice.

2.2.3 Health-related quality of life

Over a half a century ago, the World Health Organization (WHO) stated that “health is a state of complete physiological, mental, and social well-being—not merely the absence of disease, or infirmity” (WHO, 1946). Yet, health has traditionally, and is still, largely viewed and measured by the presence or absence of disease conformable by medical diagnostic tests. Increasingly, this strict view is being challenged due to growing contradictory evidence. Self-reported or subjective health measures such as 2.2. CLINICAL BACKGROUND 79

Health-related Quality of Life (HrQoL) in particular have gained increasing attention due to their importance as an indicator of intervention outcomes, predictor of mortality, morbidity, and service needs (Taylor, 2000). However, despite repeated studies show- ing the importance of such measures, the systematic monitoring of HrQoL in primary and secondary care in UK is still not a common practice. In order to describe what HrQoL entails, it is necessary to first define the concept of Quality of Life (QoL). QoL is a broad multidimensional concept that is defined by the WHO as an individual’s subjective perception of positive and negative aspects of life which includes their physical health, psychological state, level of independence, social relationships, personal beliefs and their relationship to important aspects of their environment. Hence, QoL encapsulates one’s overall positive and negative sense of wellbeing or overall subjective sense of wellbeing. This includes, for example, aspects of happiness and satisfaction with life as whole. McHorney (1999) describes HrQoL as encompassing all physical and mental as- pects of overall quality of life that can be clearly shown to affect health and notably vice-versa. Aspects of quality of life that are typically considered non-health-related, include, for example, the quality of the environment, public safety, education, standard of living, political freedom, or cultural amenities. However, the distinction between health-related and non-health-related quality of life is complex. For example, Ferrans et al. (2005) claim that environmental factors may have direct effects on health, e.g., air pollution may lead to chronic respiratory disease, or long dark winters may lead to seasonal affective disorder. In addition, all areas of life tend to be affected by health in cases of chronic illnesses, and thus, in effect become ’health-related’ (Guyatt et al., 1993). Hence, the distinction between QoL and HrQoL can at times be ambiguous. Perhaps more importantly, HrQoL focuses on perceived effects of health, illness and treatment on quality of life as whole (Ferrans et al., 2005; Taylor, 2000). This in- cludes physical and mental health perceptions and functioning, health status, etc., but also a subset of quality of life aspects such as individual and environmental character- istics (e.g., life-style choices, interpersonal or social influences and so forth) that may directly/indirectly effect health and vice-versa. HrQoL on the individual level includes subjective perceptions of physical and mental state, health status and risks, functional status, social support and socio-economic status. Moreover, subjective health perception can be divergent between people with sim- ilar health status. For example, some people may perceive themselves as healthy de- spite suffering from chronic health problems, while others may perceive themselves as 80 CHAPTER 2. BACKGROUND ill when no objective evidence of disease is present. In clinical practice, assessment of HrQoL is typically conducted through self- reporting (such as disease-specific questionnaires/interviews/assessments) and/or with careers (e.g., guardians, teachers, health professionals). Self-reported measure are typ- ically measured in Likert scales (e.g., 0 to 4 or 1 to 5) were a combined score (using some bespoke schema) provide an overall measure of HrQoL, given by the dimen- sion. Another means of measurement is through the Model of Human Occupation or MOHO-based assessments (e.g., Short Child Occupational Profile (SCOPE); Child Occupational Self-Assessment) (Bowyer et al., 2007; Keller and Kielhofner, 2005; Keller et al., 2005). These are predominately objective forms of measurements, also involving structured form of data collection, where typically occupational therapists assess a patient through observation, informal interviews, and questionnaires depend- ing on the clinical area. MOHO-based assessments may cover dimensions such as motivation and interest, routines of engagement, motor skills, process skills, and com- munication and interaction skills. HrQoL dimension (or concept) of interest tend to differ depending on the clinical problem; Table 2.2.3 list a number of dimension of interest for patients diagnosed with central nervous system (CNS) cancers (specifically medulloblastoma). There exist many different standardised measurements which assess HrQoL, in- cluding (see Table 2.22 for domains covered by the respective instrument):

• The Paediatric Quality of Life Inventory (PedsQL) measurement model which uses a modular approach combined into one measurement system to measure HrQoL in adolescents, young-people and children.

• The Strengths and Difficulties Questionnaire (SDQ).

• The Health Utilities Index mark 2 (HUI2) and mark 3 (HUI3) are used to mea- sure subjective health status. Overall scores are considered as a HrQoL measure.

The use of structured or semi-structured interviews to investigate HrQoL, in spe- cific cohorts, as related to the effects of disease have been published in a number of studies Groenvold (2010); Spiroch et al. (2000); Tay et al. (2014). In this thesis, we will use semi-structured interviews in order to compare patient narratives with factors, such as clinical concepts (including HrQoL), identified in their hospital records using TM methods. 2.3. SUMMARY 81

Table 2.22: HrQoL instruments and corresponding classification dimentions

PedsQL HUI-2 HUI-3 SDQ Physical Sensation Vision Emotion Psychosocial Mobility Hearing Behavioural problems Emotion Emotion Speech Hyperactivity Social Cognition Ambulation Peer relations School Self-care Dexterity Pro-social behaviour Pain Emotion Cognition Pain

2.3 Summary

In this chapter we have introduced TM and NLP (Section 2.1) and reviewed relevant IE methods (Section 2.1.2), such as clinical NER (Section 2.1.3), TERN (Section 2.1.5), and TLINK (Section 2.1.6). In addition, we also reviewed relevant clinical background such as computer aided standardisation of care and HrQoL and its importance and application in clinical practice. As a results of reviewed literature we have identified a number of topics to investigate which are presented in the following chapters. NER is perhaps one of the most investigated topics in IE which was catalysed by a series of MUC challenges in the early 1990’s. However, interestingly, clinical NER have not until recently emerged as ‘hot’ research topic. Similar to other TM applica- tions in the clinical domain, clinically related IE tasks have lagged behind due to the inherit privacy issues and sensitivity of clinical data which has consequently resulted in the lack of, or restricted number of available research datasets. Nevertheless, recent research have presented state-of-the-art methods to address clinical NER, of which our methods described herein (Chapter3) rank among the best in the community. A number of investigations have been conducted in TIE, predominately in the gen- eral domain. We have reviewed these, in addition to a set of related work that had been presented in the clinical domain. We have shown that the TERN task is a highly do- main adaptable problem, with notable work in both the general and subsequently more recently in the clinical domain. Notably, our hybrid method described in this thesis (Chapter4) is ranked 1 st in the community, evaluated as part of the only concluded clinical TERN challenge to-date (Kovacevic et al., 2013). 82 CHAPTER 2. BACKGROUND

Likewise, we have reviewed a number published investigation into temporal order- ing, both in the general as well as the clinical domain. Judging from the results across domain, TLINK is still an open research question. We found that while many system incorporated rule-based components for identification of temporal links, knowledge- based approaches have been extremely rare for TLINK classification. This is true across domains. Hence, in this thesis we aim to explore a strictly knowledge-based approach to TLINK recognition and classification. Relevant clinical background described includes the application of HrQoL in clini- cal practice. Despite the importance of HrQoL we found that there is a lack of system- atic monitoring of such concept in clinical practice. In this thesis we aim to investigate an automated method to extract HrQoL concepts from clinical data. Protocol-based care is a growing topic of interest both for researcher and practition- ers alike. Our review of computer aided monitoring and execution of protocol-based care showed that a number of methods have been proposed, all of which are charac- terised by guiding practitioners through a step-by-step process of care. In this thesis we will investigate an alternative and novel approach to monitor and analyse the im- plementation of protocol-based care through the automated reconstruction of patient journeys. Chapter 3

Extraction of Health-related Concepts

This chapter describes the methods developed to extract clinical events such as Prob- lem, Treatment, and Test (Section 3.1); and HrQoL concepts (Section 3.2).

A high-level architecture of the concept extraction work-flow is given in Figure 3.1. More detailed descriptions are given in the relevant sections of this chapter.

Figure 3.1: Health-related concept extraction architecture

83 84 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

3.1 Event Extraction

The aim of the clinical NER method is to identify broad clinical event categories such as, Problem, Treatment and Test; and subsequently map them to a medical knowledge base such as the UMLS Metathesaurus for fine-grained semantic characterisation of event instances. These event categories will collectively be referred to as EVENTs from henceforth. We have adopted the i2b2 definitions of concept or event categories which are largely based on the UMLS semantic types, but not limited by their cover- age1 (Table 3.1 shows the semantic definition of relevant event categories). We have also reproduced a number of examples from the official annotation guidelines in the AppendixB.

Table 3.1: Definition of EVENT categories The definition of event categories are described according to the annotation guidelines.

EVENT Semantic type Semantic group acquired abnormality anatomical abnormality cell or molecular dysfunction congenital abnormality disease or syndrome Disorders injury or poisoning Problem mental or behavioural dysfunction neoplastic process pathologic functions sign or symptom bacterium Living Beings virus antibiotic biomedical or dental material clinical drug Chemicals & Drugs pharmacologic substance Treatment steroid drug delivery device Devices medical device therapeutic or preventive procedure Procedures diagnostic procedure Test Procedures laboratory procedure

1https://www.i2b2.org/NLP/Relations/assets/Concept%20Annotation%20Guideline. pdf 3.1. EVENT EXTRACTION 85

Extensive prior research in clinical NER (see Chapter2) and the availability of adequate amount of annotated clinical corpora enabled us to adopt data-driven methods to address this problem.

3.1.1 Methods

A hybrid approach was developed to extract clinical events from healthcare narra- tives. Specifically, a data-driven approach (the state-of-the-art sequence labelling al- gorithm: CRF) was developed to identify EVENTs such as Problem, Treatment and Test, and a knowledge-driven approach (MetaMap) was adopted for fine-grained clas- sification/mapping of extracted EVENT mentions to the UMLS Metathesaurus.

Architecture

The EVENT extraction pipeline is made up of the following methods/tasks (see Figure 3.2):

1. NLP pre-processing 2. Data-driven NER 3. Knowledge-driven concept mapping

A more detailed description of the listed components is described in the remaining sections of this chapter.

NLP pre-processing

The feature generation consists of a NLP pre-processing pipeline made up of common processing components (see appendixC for detailed description of components used): (1) Tokeniser, (2) sentence splitter, (3) word stemmer, (4) POS tagger, and (5) chunker / shallow parser.

Data-driven NER

The main NER component utilises separate CRFs trained for each EVENT category: Problem, Treatment and Test. A combination of the forward and backward feature selection approaches were adopted to select a total of 20 most discriminant features from an initial set of 120 86 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

Figure 3.2: EVENT extraction architecture 3.1. EVENT EXTRACTION 87 features. The same set of features were used across all categories as our experiments showed this was the best fit across all categories. Extracted features can be clustered into two sets: lexical and syntactic, with four feature groups across (see the below list).

Lexical

• f g1: the token string or alphanumeric character sequence

• f g2: the stem of each token

• f g3: the POS-tag for each token Syntactic

• f g4: the chunk tag for each token

Further, the feature space is also made up of contextual features of the neighbouring tokens with a feature window size of 5 or [-2,2] with respect to the current token. The window size (5) corresponds to the number of tokens to the left and right, including the current token, of which contextual token’s features are considered. Specifically, for each token t and a given feature group f g, the feature space consists of: (t f g), (t f g+1), (t f g+2), (t f g-1), and (t f g-2) (see Table 3.2).

Table 3.2: Feature template: clinical EVENTs CRF feature template used for all EVENT categories: Problem, Treatment and Test.

f g1:Token f g2:Stem f g3:POS f g4:Chunk U00:%x[-2,1] U05:%x[-2,2] U10:%x[-2,3] U15:%x[-2,4] U01:%x[-1,1] U06:%x[-1,2] U11:%x[-1,3] U16:%x[-1,4] U02:%x[0,1] U07:%x[0,2] U12:%x[0,3] U17:%x[0,4] U03:%x[1,1] U08:%x[1,2] U13:%x[1,3] U18:%x[1,4] U04:%x[2,1] U09:%x[2,2] U14:%x[2,3] U19:%x[2,4]

All CRFs were trained using a mix of BIO and W-BIO sequence label models (see Section 3.1.3 for justification) with the following (default) CRF parameters: C = 1.00, ETA : 0.0001 and L2-regularisation algorithm.

Post-processing

The post-processing component contains three sub-components:

1. Label fixer This components corrects sequence label prediction from the NER component. 88 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

These corrections are simple heuristics based on commonly observed errors on the training data set. Table 3.3 list the full set of heuristics utilised.

Table 3.3: Label fixer heuristic

Raw predictions Corrected predictions a ... O O O I I I I ...... O O B I I I I ... b ... O O O B O O O ...... O O O B I O O ... c ... O O O B O I I ...... O O O B I I I... d ... O O O B I I B I I ...... O O O B I I I I I...

Notably, the heuristic given in Table 3.3 (d) was later removed when we observed the output once applied on the case study dataset. This was motivated by the fact that the aforementioned heuristic would label NEs given in a list format (with no enumeration and only separated by a newline character) as a single prediction instead of separate predictions. 2. Boundary adjustment This component attempts to expand the concept boundary by including adjacent tokens to the right and left of predictions that possess POS/chunk tags that cor- responded to nouns and noun phrases and their constituents including adjectives and determiners (e.g., ‘a’; ‘this’; ‘her’). This sub-component is useful when the NER only captures/annotate part of an event. For example, if the NER com- ponent annotates the word ‘severe’, ‘stomach’, or ‘ache’ from the actual term ‘severe stomach ache’, this component would capture the latter complete term boundary. 3. False positive filter This component removes common false positives predictions observed during the validation of the NER. Examples of false positives prediction include single character predictions (e.g., ‘a’), pronouns (e.g., ‘he’; ‘she’), and determiners (e.g., ‘the’).

Negation

To identify negated clinical EVENTs we used the ConText negation tool as described in (Harkema et al., 2009). 3.1. EVENT EXTRACTION 89

Concept mapping

The UMLS MetaMap tool was adopted for mapping of events to the UMLS knowledge resource (the Metathesaurus). This enabled us to to derived fine-grained knowledge from the identification of high-level concept categories. For example, we could deter- mine if a particular instance of a Problem identified is a ‘sign or symptom’, ‘disease or syndrome’, ‘anatomical abnormality’ and so forth. The mapping of concepts was restricted to a small set of (relevant) terminological resources of the Metathesaurus:

• SNOMED-CT: a international healthcare terminology used for classifying clini- cal concepts. It is the most comprehensive, multilingual clinical healthcare ter- minology in the world.2 • ICD-10CM: the International Classification of Disease, 10th Revision, Clini- cal Modification; is a international standard developed and maintained by the WHO and is used to classify and code disease and other health problems (e.g., HrQoL).3 • ICF and ICF-CY: the International Classification of Functioning, Disability and Health, and the ICF Children and Youth Version; is a standard used to classify functioning and disability associated with health problems (e.g., HrQoL). • RxNorm: a medication vocabulary which provides normalised names for clini- cal drugs and links its names to multiple other drug vocabularies (e.g., National Drug File - Reference Terminology: which is used to code drug properties, phys- iological effects and therapeutic category).4

3.1.2 Data

The NER components engineered as part of this thesis were developed and validated using a set of publicly available research datasets. Notably, the NLP research datasets used were obtained from the clinical TM challenges organised by the i2b25 (see Chap- ter2). Specifically, these datasets are derived from the following shared-tasks:

(i) The 2010 i2b2 4th Shared Task; referred to as i2b2-CARC hereafter (Uzuner et al., 2011), and 2http://ihtsdo.org/snomed-ct/ 3www.who.int/whosis/icd10/ 4http://www.nlm.nih.gov/research/umls/rxnorm/ 5The research datasets provided by i2b2 are not entirely public, but require data user agreements to be signed; see https://www.i2b2.org/NLP/DataSets/. 90 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

(ii) The 2012 i2b2 6th Shared Task; referred to as i2b2-TRC hereafter (Sun et al., 2013c).

Table 3.4 provides further details such as relevant annotations/labels and the size (num- ber of documents across training and test datasets).

Table 3.4: EVENT annotated datasets This table shows the NLP datasets used in this thesis for event extraction. The i2b2-TRC in- clude annotated EVENTs such as Problem, Treatment, Test, Occurrence, Evidential and Clin- ical department (the latter three were not considered). The i2b2-CARC include annotated EVENTs such as Problem, Treatment and Test. Dataset Annotation Training Test i2b2-TRC EVENT 190 120 i2b2-CARC EVENT 170 256

These datasets were produced using multiple annotators, including domain experts. Specifically, the i2b2-TRC was produced using eight expert annotators, four of whom had medical background; the i2b2-CARC was produced using twelve annotators in- cluding six with medical background6.

Inter-annotator agreement

Table 3.5 show the IAA for i2b2-TRC (Sun et al., 2013c, p.808)7 and Table 3.6 show the IAA for i2b2-CARC dataset8. The IAA scores confirm that recognition of event boundaries for both i2b2-TRC and i2b2-CARC is a fairly straight forward task for manual processing; with the identification of Problem, Treatment and Test event bound- aries being a simpler task (see Table 3.6). Likewise, classification of EVENT type and concept negation reveal to be a relatively straight forward manual annotation task for appropriately trained experts.

6Annotation task information regarding i2b2-CARC corpus was obtained by email from responsible researcher Brett South, Senior scientist (currently) at University of Utah, Department of Biomedical Informatics. 7These statistics are computed for Problem, Treatment and Test. 8These statistics are computed across six different EVENTs: Problem, Treatment, Test, Occurrence, Evidential and Clinical department. Only the first three EVENT categories are considered in this thesis. 3.1. EVENT EXTRACTION 91

Table 3.5: i2b2-TRC: EVENT IAA Table 3.6: i2b2-CARC: EVENT IAA

EVENT Avg.P&R κ EVENT Avg.P&R Span (strict) 0.83 - Span (strict) 0.85 Span (lenient) 0.87 - Span (lenient) 0.91 Type 0.93 0.90 Negation 0.97 0.21

Clinical EVENT corpora

The dataset utilised to engineer this method was composed of the i2b2-TRC and i2b2- CARC corpora. A total of 736 discharge summaries was used across the training (616 documents with a total of 584,978 tokens) and test (120 documents with a total of 58,265 tokens) datasets. The discharge summaries were typically organised into sub- section (similar to the sample clinical narrative in the Appendix A.1) with common headings such as ‘HISTORY OF PRESENT ILLNESS’, ‘FAMILY HISTORY’, ‘PAST MEDICAL HISTORY’, ‘MEDICATION’, ‘PHYSICAL EXAMINATION, and ‘DIS- CHARGE STATUS’ among others. Table 3.7 shows the label distribution by event category across the combined datasets used in this thesis.

Table 3.7: EVENT label distribution In this thesis, the i2b2-TRC (training) and i2b2-CARC (training and test) data was combined as the training dataset, while the i2b2-TRC test dataset was used as the held-out test data for the clinical NER method described herein. EVENT Training Test Problem 24,330 4,309 Treatment 17,773 3,285 Test 16,062 2,173 Total 58,165 9,767

3.1.3 Experiments, results, and discussion

We explored a number of sequence label models: IO, BIO and W-BIO (see Appendix A.2) in addition to a set of post-processing components. For the development/valida- tion experiments we used the training data (Table 3.7) which we split into a validation training set (500 documents) and a validation test set (116 documents). 92 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

Table 3.8: EVENT extraction validation test results The validation test set results are obtained by training the models on a set of 500 documents and testing on a validation test set of 116 (shown here). The best results, horizontally or by EVENT category, are highlighted. From all models experimented, the IO model performed worst overall concept types, with strict scores being notably lower than other models (approxi- mately 5% across all concept categories). Further, the difference between BIO and W-BIO were minimal: the BIO models performed slightly better for the Problem and Treatment categories while W-BIO performed better on identifying the Test category.

Precision % Recall % F -measure % EVENT Model 1 strict|lenient strict|lenient strict|lenient IO 67.46|84.33 70.22|87.78 68.81|86.02 Problem BIO 73.20|85.95 74.63|87.62 73.91|86.78 W-BIO 72.32|85.83 73.54|87.28 72.92|86.55 IO 73.63|89.36 70.65|85.74 72.11|87.51 Treatment BIO 79.45|90.37 74.70|84.97 77.00|87.59 W-BIO 79.41|90.91 73.45|84.09 76.31|87.37 IO 75.00|89.20 72.13|85.79 73.54|87.47 Test BIO 80.14|89.82 76.37|85.59 78.21|87.65 W-BIO 80.72|90.34 76.50|85.62 78.56|87.92 IO 71.31|87.13 70.88|86.60 71.09|86.86 Micro score BIO 76.92|88.30 75.13|86.24 76.02|87.26 W-BIO 76.67|88.54 74.33|85.84 75.48|87.17

Our experiments showed that the IO models performed consistently worst com- pared to BIO and W-BIO in terms of strict evaluation (Table 3.8). The IO schema is obviously unable to discriminate concept boundaries as well as alternative models. We hypothesis that this is due to the reason that a token/word (i.e., adjectives, determiners, and quantifiers in particular) can be part of multiple relevant as well as irrelevant NEs (i.e., across EVENT categories and non-EVENT). Therefore, when NE sequences are modelled without considering the beginning of a sequence9, CRF is unable to effec- tively distinguish sequence boundaries. A manual analysis of the errors further sup- ports this hypothesis. Specifically, majority of the difference of false positive produced by IO models (compared to BIO/W-BIO) excluded adjectives, determiners and quan- tifiers. This may be attributed to the reason that the conditional probability P(Y|X) becomes adversely effected by the label ambiguity that arises from lack of modelling the beginning of a sequence.

9The beginnings of relevant NEs often include words/tokens with multiple memberships such as adjectives, determiners, and quantifiers. 3.1. EVENT EXTRACTION 93

Moreover, considering strict evaluation metrics (Table 3.8), there is a minimal dif- ference between BIO and W-BIO models, while a notable difference can be observed between IO and BIO/W-BIO models (approximately 5% micro F1-measure). This sug- gests that BIO and W-BIO models are better suited for strict boundary identification of clinical concepts investigated compared to the IO sequence label schema. Further, when considering lenient evaluation scores, there is a minimal difference among all models, however, BIO and W-BIO models perform consistently higher in terms of precision and F1-measure while IO models show consistently higher recall. The final evaluation or test results are presented in Table 3.9. These are consis- tent with our findings during validation (Table 3.8). As may be seen from both the validation and evaluation results, there is no major difference between BIO and W- BIO models, hence, we conclude that distinguishing multi-token sequences and single token sequences does not seem to be helpful. In light of evaluation results that are comparable to IAA (Table [3.5,3.6]), we have omitted detailed error analysis.

Table 3.9: EVENT extraction results on the held-out test set The results on the held-out test set showed similar trend to the validation results; the IO models have been excluded due to notably poor performance on the validation set. Further, similar to the validation results, BIO performed better on Problem and Treatment categories while W-BIO model performed best on the Test category.

Precision % Recall % F -measure % EVENT Model 1 strict|lenient strict|lenient strict|lenient BIO 81.52|90.68 82.62|92.90 82.07|91.29 Problem W-BIO 81.91|90.84 82.80|91.83 82.35|91.33 BIO 87.24|94.43 80.12|86.73 83.53|90.42 Treatment W-BIO 88.00|94.72 80.12|86.24 83.88|90.28 BIO 85.48|93.02 82.88|90.20 84.16|91.59 Test W-BIO 86.45|93.49 83.71|90.52 85.06|91.98 BIO 84.22|92.39 81.84|89.78 83.01|91.07 Micro score W-BIO 84.85|92.66 82.10|89.66 83.45|91.13

The final models selected for the clinical NER pipeline was BIO for Problem and Treatment, and W-BIO for Test. The final evaluation scores, including negation is given in Table 3.10 94 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

Table 3.10: The clinical NER performance

F -micro % 1 Negation strict|lenient EVENT 83.21|91.17 0.93

Impact analysis

In order to justify the use of various features, datasets, and post-processing com- ponents, a series of impact analysis have been conducted and shown in Table 3.11 (which shows the feature impact of different CRF features used), Table 3.12 (impact of datasets on the overall performance) and Table 3.13 (impact of post-processing com- ponents). Table 3.11 shows the feature impact analysis by the micro score of EVENTs; lexi- cal features have been used as the baseline. Notably, word stem have the most impact

(+3% strict and +2% lenient F1); POS and chunk features showed minimal impact on their own with the latter having a negative impact of -0.01% lenient F1. Further, while POS and chunk features adversely affect the precision, both show a positive effect on recall. Table 3.11: EVENT recognition: feature impact analysis This table shows the feature impact of several CRF feature groups.

EVENTs Feature vector P-micro % R-micro % F1-micro % strict|lenient strict|lenient strict|lenient Baseline (Lexical) 82.56|92.34 76.79|85.88 79.57|88.99 Baseline + Stem 84.56|92.96 81.37|89.44 82.93|91.17 Baseline + POS 82.51|92.08 77.66|86.67 80.01|89.29 Baseline + Chunk 82.39|92.07 77.03|86.09 79.62|88.98 All features 84.43|92.50 82.02|89.85 83.21|91.17

Notably, using the i2b2-CARC corpus improved (strict and lenient) micro F1-score with +17% and +12% (see Table 3.12). Table 3.13 shows the impact of the post-processing sub-components. For example, while the label-fixer has a adverse effect on the precision (-5% strict and -4% lenient), it has a positive impact on recall (+3% strict and +5% lenient). In addition, the label-

fixer shows less than -1% (strict) and more than +1% (lenient) impact on the F1-score. 3.1. EVENT EXTRACTION 95

Table 3.12: EVENT recognition: dataset impact This table shows the impact of the different datasets used to train the CRF models.

EVENTs Dataset P-micro % R-micro % F1-micro % strict|lenient strict|lenient strict|lenient i2b2-TRC 69.03|82.97 63.04|75.77 65.90|79.20 i2b2-TRC+i2b2-CARC 84.43|92.50 82.02|89.85 83.21|91.17

Boundary adjustment showed a positive effect on all strict metrics, and expectedly with no effect on lenient scores. The FP filter showed a slight positive impact on precision, and interestingly vice-versa on recall.

Table 3.13: EVENT recognition: post-processing impact analysis This table lists the performance impact of the various post-processing components.

EVENTs Component P-micro % R-micro % F1-micro % strict|lenient strict|lenient strict|lenient No post-processing 88.09|96.06 77.64|84.66 82.54|90.00 Only label-fixer 82.85|92.23 80.81|89.97 81.82|91.09 Only boundary-adjustment 89.34|96.06 78.73|84.66 83.70|90.00 Only FP filter 89.14|96.45 77.63|84.00 82.99|89.79 All post-processing 84.43|92.50 82.02|89.85 83.21|91.17

Remarks

The NER component described herein is an extension of the official submission to the 2012 i2b2 concept extraction task (Kovacevic et al., 2013) with couple noteworthy differences. For instance, (1) the original feature set was reduced from 280 to 20 discriminate features, (2) the IOB and W-BIO, rather than IO sequence label schema, was used, and (3) label-fixer was included as a post-processing component. The performance presented herein is comparable and slightly exceeds the perfor- mance of our original results (Kovacevic et al., 2013) (see Table 3.14). Specifically, we can observe a positive increase of +0.87 and +1.13% F1-measure for Test and Treat- ment respectively, with a drop of -0.05% F1-measure for the identification of Problem. 96 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

Table 3.14: Performance of the offficial 2012 i2b2 submission This table shows the performance on the held-out test set (i2b2-TRC) of the NER component published in (Kovacevic et al., 2013) and submitted as part of the 2012 i2b2 concept extraction task.

EVENT Precision % Recall % F1-measure % Problem 94.24 87.82 91.38 Treatment 95.68 83.65 89.29 Test 95.05 87.48 91.11

3.2 Health-related Quality of Life Extraction

While there are existing resources for common clinical concepts (i.e., Problem, Treat- ment and Test), there are no annotated corpora for HrQoL concepts. In fact, to the best of our knowledge there has been no work to-date on the identification and clas- sification of HrQoL concepts from unstructured text. Therefore, we first describe the ‘annotation task’ in order to create an annotated corpus, and subsequently the meth- ods developed and validated using this corpus, to extract mentions of HrQoL concepts. While HrQoL is typically measured on a scale of severity, we make no attempts to extract this attribute.10

3.2.1 HrQoL Schema

The annotation task consisted of initial conceptualisation of the problem at hand (i.e., ‘what does HrQoL concept constitute?’), in particular, considering survivors of CNS cancer such as medulloblastoma. Through investigation of general literature (e.g., (Ferrans et al., 2005; Wilson and Cleary, 1995)) as well as relevant clinical practice in local hospitals (The Christie NHS Foundation Trust and The Royal Manchester Children’s Hospital), several disease specific HrQoL instruments were collected in or- der to model relevant HrQoL concepts. In particular, relevant instruments such as the Paediatric Quality of Life Inventory (generic core, brain tumour and family impact modules) (PedsQL) (Palmer et al., 2007; Varni et al., 2004), Strength and Difficul- ties Questionnaire (SDQ), the Health Utility Index (HUI2 and HUI3) (Furlong et al.,

10However, we have proposed a schema to record severity (see AppendixD), and we propose future work to address this limitation. 3.2. HEALTH-RELATED QUALITY OF LIFE EXTRACTION 97

2001), European Organisation for Research and Treatment of Cancer (EORTC) QLQ- 30 Questionnaire (Aaronson et al., 1993), Paediatric Index of Emotional Distress (PI- ED) (O’Connor et al., 2010) and Hospital Anxiety and Depression Scale (HADS) (Zig- mond and Snaith, 1983) questionnaires were manually analysed for HrQoL themes. This initial task was straight forward as the aforementioned instruments are organised into themes or dimensions (Section 2.2.3): Figure 3.3 shows the HrQoL themes iden- tified across the instruments. 98 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

Figure 3.3: Relevant HrQoL instruments and contained themes

Subsequently, a systematic process was adopted to devise a combined schema in order to be used as a basis for the automated processing and classification of HrQoL concepts. Further, several iteration of consultation with our clinical colleagues11 were completed in order to finalise the combined schema. Table 3.15 describes the result- ing bespoke HrQoL schema containing nine HrQoL categories. The full annotation schema and description of combined HrQoL categories are described in detailed in the ‘HrQoL Annotation Guideline’ (see AppendixD). The combined model is illustrated in Figure D.1.

11The combined HrQoL schema was primarily developed in collaboration with Dr Edward J Es- tlin, and qualitatively validated through consultations with clinical colleagues and collaborators: Dr Louise Robinson (Macmillian Principle Clinical Psychologist), Professor Tony Long (Child and Fam- ily Health), Dr Ian Kamaly-Asl (Consultant Neurosurgeon), Ms Ruth Morgan (Professional Lead for Occupational Therapy), and Dr Ram Kumar (Consultant Paediatric Neurologist). 3.2. HEALTH-RELATED QUALITY OF LIFE EXTRACTION 99

Table 3.15: Example of HrQoL concepts

HrQoL category Examples of concepts Physical functioning and speech motor skills; balance; coordination; mobility; speech Sensory and pain sensation; vision; hearing; pain Other well-being other general well-being concepts, e.g., fertility Emotional functioning sleep; motivation; confidence; volition; energy Cognitive functioning learning; memory; attention Social functioning interpersonal relationships School functioning functioning and status Home and family life style; employment; housing/living arrangements Activity participation in social/physical activities

3.2.2 Data

HrQoL corpora

A subset of the case study dataset was annotated for HrQoL concepts based on the bespoke annotation schema. This combined corpus includes clinical narratives or clinical annotations (internal medical notes) and clinical correspondence (e.g., letters between doctors, and doctors and patient), and patient narratives or transcribed patient interviews (further description of data types is given in Chapter 5.2). The annotated dataset was subdivided such that the development set contained a total of 148 narra- tives (7 patient narratives and 141 clinical narratives with a combined total of 72,730 tokens) and the test set contained 98 narratives (6 patient narratives and 92 clinical narratives and letters with a combined total of 59,443 tokens). Table 3.16 shows the label distribution across the development and test dataset used to develop and validate the HrQoL NER. These HrQoL categories will collectively be referred to as HrQoL from henceforth.

Inter-annotator agreement

Table 3.17 shows the IAA in the annotation task using a subset of 21 narratives. This dataset was annotated using two annotators: the author and an annotator with a medical background. The IAA shows relatively reasonable score for manual efforts to identify HrQoL

(0.79 F1-measure). However, the strict identification of concept boundaries proved to be a more challenging task (0.63 F1-measure). Further, negation detection and source 100 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

Table 3.16: HrQoL dataset label distribution This table gives the number of annotations by the development/training and held-out test dataset; used to validate the method developed to extract HrQoL concepts.

HrQoL Development Test Emotional functioning 348 238 Physical functioning and speech 223 187 School functioning 198 191 Sensory and pain 175 104 Other well-being 162 117 Cognitive functioning 102 69 Social functioning 61 71 Home and family 90 38 Activity 37 59 Total 1,396 1,074

Table 3.17: HrQoL dataset IAA

HrQoL Micro F1 κ Span (strict) 0.63 - Span (lenient) 0.79 - Negation - 0.36 Source - 0.96

identification (i.e., objective: third person or subjective: first person) proved to be as a straight forward manual task.

3.2.3 Methods

The HrQoL NER component is developed using a combination knowledge-driven methods. A set of topic specific dictionaries is initially used to spot candidate HrQoL mentions. Subsequently, contextual rules are applied in order to determine inclu- sion/exclusion and to expand valid concept boundaries.

Architecture

Figure 3.4 shows the method work-flow of the HrQoL extraction component. 3.2. HEALTH-RELATED QUALITY OF LIFE EXTRACTION 101

Figure 3.4: HrQoL method architecture

NLP pre-processing

The pre-processing pipeline is made up of off-the-shelf NLP components, and in- cludes: (i) tokeniser and (ii) sentence splitter . 102 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

HrQoL extractor

The HrQoL NER is developed using a knowledge-driven two-stage approach. In the first stage, a set of focused dictionaries are used to spot candidate terms. Secondly, the context reasoner, through a set of contextual rules is used to determine if a given candidate term is valid. Subsequently the concept boundaries are adjusted/expanded.

HrQoL dictionary

The HrQoL dictionary is made up of multiple focused dictionaries that correspond to each HrQoL category/type. The dictionary entries are typically head nouns or verbs. The dictionaries are manually created by analysing various HrQoL instruments and the analysis of the development dataset.

Table 3.18: HrQoL dictionary summary This table gives the size (number of terms) of HrQoL dictionaries used as part of the HrQoL concept extractor.

HrQoL category/dictionary Size Emotional functioning 191 Physical functioning and Speech 154 Sensory and pain 80 Other well-being 76 Cognitive functioning 69 Home and family 64 School functioning 58 Activity 47 Social functioning 37

These dictionaries correspond to the previously discussed, bespoke, HrQoL schema (also see AppendixD, Figure D.1).

Context reasoner

The context reasoner determines if a given candidate term from the dictionary tagger is valid or shall be excluded. This component uses lexical contextual cues in order to disambiguate or determine exclusion. For instance, the word ‘hearing’ often indicates a sensory concept, however, a straight-forward dictionary approach will return many false positives. Consider the following examples of noun phrases: 3.2. HEALTH-RELATED QUALITY OF LIFE EXTRACTION 103

(a) .. hearing problem .. (b) .. hearing test.. (c) .. hearing clinic..

These examples will in part be tagged by the the HrQoL dictionary (specifically the Sensory and pain dictionary; annotated parts are indicated by the underlined text segments). However, contextual cues in examples (b) and (c), i.e., ‘test’ and ‘clinic’ respectively, are considered as contextual exclusion cues and will therefore be omitted by the context reasoner and therefore not tagged as HrQoL concepts. The list of complete set of lexical cues used for specific HrQoL concepts types are given in the AppendixB.

Boundary adjustment

This component attempts to expand the boundary of recognised HrQoL terms by analysing the left and right context of a given annotation. The boundary adjustment component take solely advantage of lexical features, specifically, common adjectives (e.g., ‘mild’; ‘severe’; ‘poor’) that are used to express severity of a concept and anatomical/spatial terms (e.g., ‘lower limb’; ‘arm’; ‘bilateral’). The full list of adjectives and anatomi- cal/spatial terms used by the boundary adjustment method is given in AppendixB.

Negation

To identify negated HrQoL concepts we used the ConText negation tool as described in (Harkema et al., 2009).

Concept mapping

The UMLS MetaMap tool was adopted for mapping of events to the UMLS Metathe- saurus.

3.2.4 Experiments, results, and discussion

Using the case study dataset which includes both clinical and patient narratives (see description in Chapter5), we validated the described methods to extract HrQoL. The engineered method was used for both types of data without any amendments. The method was first validated on the development set (Table 3.19), and subsequently eval- uated on an unseen set (Table 3.20). 104 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

The results from the development set (Table 3.19) indicate that, while concepts

can be extracted with fairly reasonable score, 71.85% in (lenient) micro F1-measure, a better method for boundary adjustment may be needed. Specifically, there was a

notable gap between the lenient and strict micro F1-measure (71.85% vs. 48.32%). Furthermore, this ‘gap’ was consistent across all HrQoL types. Evaluation of the HrQoL NER component on the held-out test data (Table 3.20) showed similar trend as the development set. A notable gap between strict and lenient

F1-measure was apparent at both the category (Table 3.20) and the micro level (Table 3.21: 47.86% vs. 69.75%). In addition, little variation or drop between the develop- ment and test results could be observed considering strict (precision: -0.95%, recall:

-0.75% and F1-measure: 0.84%) and lenient (precision: -2.27%, recall: -1.94% and F1-measure: 2.10%) measures. This minimal drop in results between the development and test sets indicate a generalisable method. Table 3.19: HrQoL NER results on the development set

Precision % Recall % F -measure % HrQoL category 1 strict|lenient strict|lenient strict|lenient Sensory and pain 50.00|82.39 50.29|82.86 50.14|82.62 Physical functioning & speech 57.07|84.85 50.67|75.34 53.68|79.81 Social functioning 53.97|77.78 55.74|80.33 54.84|79.03 Other well-being 65.36|78.43 61.73|74.07 63.49|76.19 Cognitive functioning 47.47|73.74 46.08|71.57 46.77|72.64 Activity 65.62|78.12 56.76|67.57 60.87|72.46 Emotional functioning 46.44|70.59 43.10|65.52 44.71|67.96 Home and family 50.62|66.67 45.56|60.00 47.95|63.16 School functioning 33.49|55.96 36.87|61.62 35.10|58.65 Micro score: 49.66|73.27 47.78|70.49 48.70|71.85

Negation extraction (see Table 3.21) was easily captured by existing methods (i.e., ConText). In addition, using a simple method for determining source: assigning the default value of ‘subjective’ for both patient and clinical narratives gave a accuracy of 87% . This was in particular interesting as patient narrative were in fact mainly narrated by a third person rather than the patient. Both the development and test results shows that School functioning and Home and family are challenging concept categories to extract. These were relatively infrequent (see Table 3.16) and context dependent. Further, our error analysis revealed that a common false positive for School functioning was caused by the word ‘school’ (or its 3.2. HEALTH-RELATED QUALITY OF LIFE EXTRACTION 105

Table 3.20: HrQoL NER results on the held-out test set

Precision % Recall % F -measure % HrQoL category 1 strict|lenient strict|lenient strict|lenient Sensory and pain 47.79|83.19 51.92|90.38 49.77|86.64 Physical functioning an speech 57.52|80.39 47.06|65.78 51.76|72.35 Social functioning 51.22|68.29 59.15|78.87 54.90|73.20 Other well-being 57.76|68.97 57.26|68.38 57.51|68.67 Cognitive functioning 48.61|72.22 51.47|76.47 50.00|74.29 Emotional functioning 50.00|76.80 40.93|62.87 45.01|69.14 Activity 50.91|74.55 47.46|69.49 49.12|71.93 School functioning 40.38|60.58 43.98|65.97 42.11|63.16 Home and family 32.61|52.17 39.47|63.16 35.71|57.14 Micro score: 48.71|71.00 47.03|68.55 47.86|69.75

Table 3.21: The HrQoL NER performance

F1-micro % Source Negation HrQoL 47.86|69.75 0.87 0.90

derivatives e.g., ‘primary school’, ‘secondary school’ and so forth) and in particular its mention in patient narratives was the most challenging to correctly extract. Specifi- cally, 55 out of a of 82 false positives were generated from only four patient narratives that were part of the held-out test set. The Home and family was the most challenging category to extract. A detailed error analysis showed that this concept type was particu- larly infrequent in the test data set (38 instances) and highly context dependent. Similar to ‘school functioning’, the majority of false positives were generated from the patient narratives. Common errors observed (15 out of 22 false positives generated) occurred when mentions such as ‘work’, ‘job’, and ‘volunteering’ were generally discussed or related to family members rather than the patient in question. Furthermore, given that majority of errors across both poor performing annotators occurred in the patient narratives, this may indicate that the methods developed may benefit from specialised or separation of approaches for respective corpus/data types. Specifically, a possible approach would be to devise two separate NER components for the patient and the clinical narratives (or at least for the School functioning and Home and family). This hypothesis was further supported when we re-evaluate the held-out 106 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS test set when the patient narrative documents were removed: we obtain a strict score of 51.40% and lenient F1-measure of 81.37% (nearly +3% and +10% improvement on strict and lenient micro F1-score). The concept types that were relatively frequent (except for Social functioning), well defined and less context dependent were easier to extract. In particular, concepts as: Sensory and pain, Physical functioning and speech, Social functioning, Cognitive functioning, and Activity can be extracted with fairly good scores (i.e., 72-87% lenient

F1-measure). The results obtained from our experiments to extract HrQoL show both encourag- ing results and notable shortcomings (i.e., exact boundary identification). Addition- ally, there is a notable difference between the strict/lenient scores when compared to EVENT recognition. We argue that these results are effected by multiple factors, such as, the annotation quality, methods devised, and the nature and complexity of the prob- lem at hand. A notable challenge of the method is the exact boundary identification. Specifi- cally, identification of strict concept boundary is more challenging than lenient iden- tification. However, this challenge also reflects human manual efforts to identify con- cepts. As shown earlier, the IAA for strict and lenient scores (micro F1-measure) are approximately 63% and 79% respectively (Table 3.17). Comparably, automated meth- ods to extract HrQoL concepts achieve approximately 48% and 70% for strict and lenient scores respectively. Hence, both manual (-16%) and automated (-22%) efforts are effected by a notable drop between strict and lenient scores. On the other hand, high quality NER datasets, e.g., i2b2-CARC and i2b2-TRC (Table 3.6 and Table 3.5) achieve IAA in the range of 87-91% for lenient and 83-85% for strict (in average P&R).

Impact analysis

Table 3.22 shows the impact analysis of the context reasoner and boundary adjustment components. We consider both strict|lenient scores. The context reasoner shows posi- tive impact on precision (5.26|6.59%) with an adverse effect on recall (-0.28|-1.86%) and an overall positive impact on F1-measure (2.55|2.48%). On the other hand, the boundary adjustment component shows a positive impact on largely all metrics: preci- sion (4.62|4.34%), strict recall (1.79|-0.07%) and F1-measure (3.28|2.26%). Further, combining these two components showed the best overall result, with a drop (1.38%) in lenient recall . 3.2. HEALTH-RELATED QUALITY OF LIFE EXTRACTION 107

Table 3.22: The HrQoL NER impact analysis The table shows the impact of the context reasoner and boundary adjustment components.

Concept Component P-micro % R-micro % F1-micro % Baseline (Dictionary) 41.63|64.06 45.45|69.93 43.46|66.86 Baseline + context reasoner 46.89|70.65 45.17|68.07 46.01|69.34 Baseline + boundary adjustment 46.25|68.40 47.24|69.86 46.74|69.12 All 48.71|71.00 47.03|68.55 47.86|69.75

HrQoL concept expressions

Other challenges are derived from the fact that HrQoL concepts differ from clinical NER tasks in different aspects. (i) In contrast to common clinical NE terms that are characteristically contained within NP structure, HrQoL concepts are characterised by a variety of syntactical structures, such as NP, VP as well as longer descriptive phrases that may contain a mix of NP and VP phrases (a number of examples are given below). Further, (ii) common clinical NEs are events that have typically either happened or not (or are hypothetical). In contrast, HrQoL concepts may be characterised in similar manner, but are additionally described on a scale of severity.Hence, we argue that these characteristics amplify challenges faced by common NE tasks (Dehghan et al., 2013), such as term variability, ambiguity, and complexity. For example, lets consider the potential HrQoL concepts ‘walking’ which may be described in a variety of syntactical forms; the correct concept boundary has been highlighted in bold in the following examples.

(a) {He}/NP {is}/VP {walking}/VP {fine}/NP. (b) {He}/NP {is}/VP {wobbly when walking}/VP. (c) {He}/NP {veers sideways when walking}/VP. (d) {He}/NP {has}/VP {no problems}/NP {walking}/VP. (e) {He}/NP {is}/VP {unable to walk}/VP. (f) {His ability}/NP {to walk}/VP {is progressively getting}/VP {worst}NP. (g) {He}/NP {has}/VP {a walking disability}/NP.

Concept expression in patient narratives can even be more challenging to capture. An explanation is the use of laymen terms and expressions to describe HrQoL con- cepts. These types of formulation may also introduce ambiguity in terms of concept classification. A number of examples taken from patient narratives are given below: 108 CHAPTER 3. EXTRACTION OF HEALTH-RELATED CONCEPTS

• “Some days I just feel like I’ve hit the wall” • “I have good days and bad days” • “couldn’t carry something and walk” • “can’t cope with sometimes getting dressed” • “he’s swimming downstream a bit now for the first time”

In addition to introducing ambiguity in terms of concept classification, boundary identification becomes likewise a challenging task. In our view, in all of the examples above, the whole phrase/sentence have to be captured in order to convey the semantic meaning.

3.3 Summary

In this chapter we presented the methods to extract health-related concepts such as Problem, Treatment, Test, and HrQoL. The approaches described have shown state-of-the-art performance, using a se- quence labelling algorithm (i.e., CRF), to extract clinical events from unstructured clinical narratives. We achieved an F1-micro of 91.13% (or a lenient score of 83.45%) across EVENTs (Problem, Treatment and Test). Additionally, the extraction of HrQoL concepts from healthcare narratives was pre- sented. First, we described the development of a computational classification schema for HrQoL concepts. Subsequently, we presented, developed and evaluated a method to extract HrQoL concepts. The proposed knowledge-based approach achieved a mi- cro F1-measure of 69.75% (and a strict score of 47.86) across nine concept categories. These results are comparatively low compared to clinical event extraction. There are many factors to explain these results. For instance, the annotation of HrQoL concepts proved to be a challenging task, with a relatively low IAA (compared to e.g., i2b2 provided EVENT annotations) and consequently poor quality annotation. Other fac- tor may include the method developed. However, given that this is, to the best of our knowledge, the first attempt to extract HrQoL from unstructured text, these results are reasonable and promising. Moreover, we discovered that existing methods for negation identification was suit- able for HrQoL extraction. In addition, determining the source seems as a straight for- ward. Notably, we found that patient narratives were mainly narrated by a third person rather than the patient. 3.3. SUMMARY 109

The extraction of NEs described herein (EVENTs and HrQoL) now enable for the identification of key concepts in healthcare narratives which will be used in the identification of key events in patient journeys. Chapter 4

Temporal Information Extraction

This chapter describes the methods developed to extract temporal information from clinical narratives. This include IE tasks such as:

(i) TERN, which aims to identify TEs (specifically, Date, Time, Duration and Fre- quency) and normalise them according to ISO-8601;

(ii) TLINK, which aims to identity temporal relations between EVENTs and TEs, and classify identified links into predefined categories such as: After, Before, and Overlap.

These methods are paramount IE components necessary to anchor or order events and concepts in a chronological order. An overview of the TIE methods developed is given in Figure 4.1.

Figure 4.1: TIE architecture

110 4.1. TEMPORAL ENTITY EXTRACTION 111

4.1 Temporal Entity Extraction

The TERN task involves the recognition and normalisation of TEs. TE are defined by TIMEX3 schema are grouped into four temporal types: Date (e.g., ‘August 23, 1993’), Time (e.g., ‘2:23 p.m.’), Frequency (e.g., ‘every morning’), and Duration (e.g., ‘two weeks’). In addition, the Date and time format: ISO-8601 standard is used to normalise TEs into a standardised format. Detailed description of TERN was given in Section 2.1.4.

4.1.1 Methods

We propose a hybrid-based TER component, and a rule-based temporal normalisation component (ClinicalNorMA)1. The motivation for adopting a hybrid approach for TER was to compare different approaches, and potentially combine the methods for the best possible performance.

Architecture

The TERN component is made up of the following methods (see Figure 4.2 for a overview).

NLP pre-processing

A pre-processing pipeline is made up of the following NLP components: (1) tokeniser, (2) sentence splitter, and (3) semantic temporal resources.Specifically, several bespoke temporal knowledge resources were manual compiled and applied at this stage of processing to subsequently be utilised as features for the rule- and ML-based TER components. These semantic resources cover a broad set of temporal expression sub- categories:

• clinical frequency (e.g., qd (once a day), bid (twice a day)) • duration (e.g., ‘over night’, ‘weekend’, ‘months’) • festival (e.g., ‘Yom Kippur’, ‘Nowruz’, ‘Christmas’) • season (e.g., ‘summer’, ‘spring’, ‘autumn’, etc.) • weekday (e.g., ‘Monday’, ‘Tuesday’, ‘Wednesday’, etc.) • month (e.g., ‘January’, ‘February’, ‘March’, etc.)

1https://github.com/filannim/clinical-norma 112 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

• literal time (e.g., ‘morning’, ‘afternoon’, ‘evening’) • temporal modifier (e.g., ‘on’, ‘after’, ‘before’) • ordinal number (e.g., ‘first’, ‘second’, ‘third’, etc.) • literal number (e.g., ‘one’, ‘two’, ‘three’, etc.)

Figure 4.2: TERN architecture 4.1. TEMPORAL ENTITY EXTRACTION 113

Temporal expression recognition

The TER component consists of combined rule- and ML-based methods.

The rule-based component consist of a total of 65 rules containing patterns de- rived from an initial collocation extraction (i.e., bi- and tri-grams) and pattern analysis of TEs in the training data. For example, the TE patterns ‘MM/DD/YYYY’, ‘MM/D- D/YY’, ‘YYYY/DD/MM’ and ‘MM/DD’ accounted for roughly 35% of temporal ex- pressions in the training data (i2b2-TRC). The rule set developed combines a few types of feature: (a) semantic: temporal categories derived from the set of specific temporal knowledge resources during the pre-processing (see previous sub-section), (b) lexical: such as common recurring ex- pressions (e.g., ‘postoperative day one’, ‘hospital day five’, ‘today’), and (c) pattern features e.g., ‘MM/DD/YYYY’, ‘MM/DD/YY’.2

The ML-based component was developed using a set of features selected based on an initial literature review, and further refinement using a combination of manual forward and backward feature selection approach. A total of 19 most discriminate fea- tures were selected from an initial set of 120 features. These features can be organised into three sets:

Lexical

• f g1: the token string or alphanumeric character sequence

• f g2: semantic temporal categories derived from the ‘NLP pre-processing’ Orthographic

• f g3: token kind given by the literal representation: word, number, symbol, or punctiuation

• f g4: token-case given by the literal representation: lower-case, upper-case, upper-initial, mixed-caps, all-caps Combined

• f g5: concatenation of the features: f g1, f g2 and f g4 In addition to these features, the feature space consists of contextual features. Specifically, we found that the optimal feature window size of 5 or [-2,2] was ideal for f g1, f g3 and f g4, and a window size of 3 or [0,2] for f g2 (Table 4.1 gives the complete feature space used).

2Note, examples of rules were given in Chapter2. 114 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

Table 4.1: Feature template: clinical TER CRF feature template used for the TER.

f g1:Token f g2:Dictionary U00:%x[-2,1] U01:%x[-1,1] U02:%x[0,1] U05:%x[0,2] U03:%x[1,1] U06:%x[1,2] U04:%x[2,1] U07:%x[2,2]

f g3:TokenKind f g4:TokenCase U08:%x[-2,5] U13:%x[-2,6] U09:%x[-1,5] U14:%x[-1,6] U10:%x[0,5] U15:%x[0,6] U11:%x[1,5] U16:%x[1,6] U12:%x[2,5] U17:%x[2,6]

f g5:Combined U18:%x[0,1]/%x[0,2]/%x[0,4]

The ML-based module uses a a state-of-the-art sequence labelling algorithm (CRF) trained with the IO token representation schema with the following (default) CRF pa- rameters: C = 1.00, ETA : 0.0001 and L2-regularisation algorithm.

Results integration

The output of the ML- and rule-based methods are combined at the mention level: union of the respective overlapping and non-overlapping tags.

Post-processing

This rule-based post-processing component was developed in order to correct obvious and systematic errors from the ML-based TER. This component removes common false-positives predictions identified during the development of the TER component. Common examples include single character predictions and non-related but similar numerical expressions e.g., pulmonary artery pressure measures (e.g., ‘42/21’) and other (partial) numerical expressions such as telephone, fax and ward numbers. 4.1. TEMPORAL ENTITY EXTRACTION 115

TE normalisation

The clinical TE normalisation component adopted is: ClinicalNorMA3 which is based on the general domain normalisation component TRIOS (UzZaman and Allen, 2010a). This component is a rule-based and adheres to the TIMEX3 schema, specifically, the extended schema described in (Sun et al., 2013a). ClinicalNorMA takes as input two TEs: (i) a utterance date or the DRT necessary to determine relative TEs, and (ii) the target temporal expression to be normalised . The output of this component is a set of three relevant elements: (i) type, (ii) value, (iii) modifier. A small number of extensions was added to the normalisation component to ad- dress bespoke requirements and identified shortcomings:

1. A preprocessing component ensures that, if possible, the document reference date appears as the first TE in a document. This ensure that ClinicalNorMA receives the correct document reference date. 2. Handle date expressions that are separated with punctuations e.g., ‘DD.MM.YY’, ‘DD.MM.YYYY’; or include extra white space(s) e.g., ‘MM / DD/ YYYY’. 3. Handle common UK formatted expressions e.g., ‘DD/MM/YYYY’ as opposed to the US standard of ‘MM/DD/YYYY’.

4.1.2 Data

The i2b2-TRC dataset was used for the development and evaluation of the TERN com- ponent. A total of 310 discharge summaries was used across the development (190 documents with a total of 105,708 tokens) and test (120 documents with a total of 58,265 tokens) datasets. Table 4.2 and Table 4.3 show the label distribution across the dataset by TE type and the IAA, respectively (Sun et al., 2013c, p.808). Notably, while the IAA shows a fairly good agreement for the recognition of TE spans (with strict boundary identification proving more challenging), normalisation of TE (i.e., value) seems even more challenging for manual efforts.

3This component was developed in the research group by M. Filannino and developed as part of the TERN method evaluated in the 2012 i2b2 challenge. 116 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

Table 4.2: TIMEX3 label distribution Table 4.3: i2b2-TRC: TIMEX3 IAA

Type Training Test TIMEX3 Avg.P&R κ Date 1,641 1,222 Span (strict) 0.73 - Duration 407 341 Span (lenient) 0.89 - Frequency 249 197 Type 0.90 0.37 Time 69 60 Value 0.75 - Total 2,366 1,820 Modifier 0.83 0.21

4.1.3 Experiments, results, and discussion

We explored a number of methods in order to adopt the best approach for TER (valida- tion results are given in Table 4.4). For the ML-based method, we experimented with various sequence label schemas (i.e., IO, BIO and W-BIO). Notably, we discovered that all sequence label models explored performed relatively similar in terms of lenient scores, but W-BIO and BIO models performed notably better in terms of strict scores

(e.g., 3-4% F1-measure). However, the strict rule-based method outperformed all ML models both in terms of lenient and strict scores (over 90% lenient F1-score).

Table 4.4: TER validation results The ML-based component was validated on the i2b2-TRC training data which was split 60/40% for training and validation respectively. *The rule-based results shown was obtained using the whole training data.

Precision % Recall % F -measure % Method 1 strict|lenient strict|lenient strict|lenient IO 66.03|86.94 67.17|88.44 66.60|87.69 BIO 71.26|87.85 70.95|87.47 71.10|87.66 W-BIO 71.80|87.85 71.49|87.47 71.65|87.66 Rule-based* 78.66|89.64 80.15|91.34 79.40|90.48

Using the official i2b2-TRC test set, we further evaluated the various ML models (using the complete training set to derive the models) and the rule-based method. In addition, we explored a number of combination the various ML models and the rule- based method (results are given in Table 4.5). The evaluation on the held-out test set (Table 4.5) shows a similar trend to the validation results (Table 4.4) in terms of strict scores i.e., W-BIO and BIO performs notably better than IO: approximately +3%. This indicates good generalisable meth- ods. However, the IO model shows more notable improvement (than the validation 4.1. TEMPORAL ENTITY EXTRACTION 117

Table 4.5: TER evaluation on the held-out test set This table shows the evaluation results of various ML-, rule- and hybrid-based methods on the official i2b2-TRC test.

Precision % Recall % F -measure % Method 1 strict|lenient strict|lenient strict|lenient IO 64.42|87.10 66.65|90.11 65.51|88.58 BIO 67.45|86.63 69.56|89.34 68.49|87.96 W-BIO 68.22|86.47 68.41|86.70 68.31|86.58 Rule-based 77.29|89.64 76.65|88.90 76.97|89.27 IO+Rule-based 72.03|86.62 78.41|94.29 75.09|90.29 BIO+Rule-based 71.15|86.05 77.64|93.90 74.25|89.81 W-BIO+Rule-based 71.66|85.73 78.08|93.41 74.73|89.40

results) in terms of F1-measure over the W-BIO (+2%) and BIO (0.62%) models. The rule-based methods performs better than all ML models, except for the IO model’s lenient recall. We also explored a number of combinations between various ML models and the rule-based method (union of the output of each respective method) in order to dis- cover any possible improvements. In particular, since the normalisation of TE is more important than recognition results, we are specifically interested in improved recall. The combination of the IO model and the rule-based method showed the best over- all performance. A notable improvement, in terms of lenient recall, of +4.18% and +5.39% compared the best ML model (IO) and the rule-based method respectively, was achieved by the ‘IO+rule-based’ method. Similarly, the strict recall was improved with +8.85% and +1.76% over the best ML model (BIO) and the rule-based method respectively. In addition, the best F1-measure of 90.29% was also achieved with the ‘IO+rule-based’ method. As expected, the rule-based method achieved the best pre- cision of all methods. This slightly exceeds the state-of-the-art system (Sohn et al.,

2013), which reported an F1-score of 90.03%. The normalisation scores reproduced using the i2b2-TRC test dataset are given in Table 4.6. As apparent by the ‘primary score’ TERN task is a challenging task and an open research problem. While automated recognition of TEs have shown comparable and exceeding human- level benchmark results (e.g., (Sun et al., 2013c; UzZaman et al., 2013)), normalisation remain a challenge—both for human and automated methods. For instance, the current state-of-the-art clinical TERN methods achieve a mere 66% (primary score) which is 118 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

Table 4.6: TE normalisation results This table gives the normalisation scores. The primary score is the product of the TER lenient F1-measure and normalisation value accuracy and is considered the main TERN metric. Type Value Modifier Primary score 0.8473 0.7044 0.8275 0.63

just lower than the human benchmark of 66.75% (Sun et al., 2013c). Similarly, the state-of-the-art system (Sohn et al., 2013) achieved a 73% accuracy for the normalised value attribute, notably lower to the human benchmark of 75%. Regardless, these scores, either automated or human, are notably lower than common IE score of +90% which is typically considered as good. One of the notable challenges of TERN is the normalisation of relative expressions (e.g., ‘two weeks ago’ ‘post-operative day’).

Error analysis

An error analysis of TER component identified interesting challenges. For example, around 20% of false positives were due to typical date ‘patterns’ used to represent other medical information (e.g. ‘25/52/70’ is an arterial blood gas test result). A significant chunk of false positives (20%) are ambiguous temporal expressions (e.g. ‘that time’, ‘x 3’, ‘daily’, ‘per day’) that always annotated as TEs in the gold standard: for example, only 48% of mentions of ‘this time’, ‘the time’, and ‘that time’ were annotated as TEs; similarly, ‘daily’ has only a 68% precision hit rate. On the other hand, false negatives included specific TE mentions such as ‘time of delivery’ and ‘day of transfer’, or highly ambiguous mentions such as ‘now’, which were excluded from the TER rule-set. The majority of the normalization errors were due to the limited coverage of the rules (e.g. ‘the course of the night’), the presence of typos, and ambiguities (e.g. ‘this time’). Another source of mistakes was a wrong reference time attached to a TE. In addition to occasional errors in the gold standard annotations (e.g. ‘2017-09-15’ normalised as ‘2019-09-15’), some errors were recorded because of a different nor- malization code used when compared to the gold standard although the values were equivalent in the temporal sense (e.g. value: PT24H (24 hours) vs. value: P1D (one day)). Furthermore, some errors were due to a non-standardized approach when nor- malizing expressions such as ‘postoperative day XX’: in some cases, the day of the referent event (e.g. the day of operation) would be day 0, sometimes day 1. This has 4.2. TEMPORAL RELATION EXTRACTION 119 led to a potential one-day difference between the annotations and the system’s predic- tions.

Remarks

The TERN method described in this thesis is a slight extension of our official sub- st mission to the 2012 i2b2 TERN task, where we ranked combined 1 in terms of F1- measure (Kovacevic et al., 2013). However, there are noteworthy differences between the extended work presented herein. For instance, in the methods described herein we reduced the original ML feature set from 280 to 19 discriminate features, with com- parable results. Specifically, the published results achieved a 90.08% (F1-measure), 91.54% (recall) and 88.68% (precision); this is -0.21% (F1-measure), -2.75% (recall) and +2.06 (precision) compared to the results achieved in this thesis. Hence, the main difference in terms of performance is the trade-off between precision and recall, with a slight increase in F1-measure. Notably, this was achieved with a significantly reduced feature set.

4.2 Temporal Relation Extraction

The aim of temporal relation extraction is the chronological ordering of events. The identification of temporal links between entity pairs such as EVENTs (e.g., Problem, Treatment, Test), TEs, and EVENTs and TEs, as well as the subsequent classification of these links into predefined categories (e.g., After, Before, Overlap) is known as TLINK extraction. A comprehensive introduction and review of TLINK extraction was given in Chapter 2.1.6.

4.2.1 Methods

The TLINK method developed and described herein is rule-based. The developed approach is motivated by a gap in current literature of pure knowledge driven methods for clinical TLINKs extraction (see Section 2.1.4). The developed method has two main components. The first component takes as in- put extracted clinical concepts (Problem, Treatment, Test and HrQoL) and TEs (Date, Time, Duration and Frequency), and generates TLINK candidate pairs (the identifica- tion step) and subsequently classifies the identified links into three different categories: 120 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

After, Before, or Overlap. A final component derives the transitive closure (refer to Ap- pendix A.3) of relations extracted in order to generate implied relations that have been missed by the preceding TLINK method. Figure 4.3 shows an abstract representation of the methodology. The remaining part of this section describes our methods in detail.

Figure 4.3: TLINK extraction architecture

TLINK identification and classification

A notable difference between previous work and our approach is that we use (i) a pure rule-based method for TLINK extraction, and (ii) combine the TLINK candidate generation (identification) and classification into a single simultaneous component. The rule based TLINK component is partitioned into two sub-components:

(1) intra-sentence: TLINKs within a sentence span;

(2) inter-sentence: TLINKs across sentences.

Intra-sentence TLINKs

In order to analyse intra-sentence TLINKs, we first performed an initial semi-automatic analysis in the development dataset. For each sentence containing a TLINK, the 4.2. TEMPORAL RELATION EXTRACTION 121

TLINK pairs were abstracted to their respective EVENT or TIMEX3 types. Addi- tionally, any context to the right and left of the TLINKs were removed to easily spot patterns. Subsequently, the abstracted TLINK pairs were manually analysed for com- mon patterns by the given TLINK category. For example, in the following sentences (a, b) the underlined EVENTs and TEs are part of six different TLINKs (or three TLINKs per sentence):

(a) ‘The patient reported vomiting, nausea and headaches.’

(b) ‘The patient received steroids for his swelling in 2006.’

In the following list, all pair-wise EVENTs or TE, part of TLINK is abstracted to their respective label and any context to the left and right of the pair-wise link is removed (illustrated by being strikeout).

(a1)‘ The patient reported PROBLEM, PROBLEM and headaches.

(a2)‘ The patient reported vomiting, PROBLEM and PROBLEM.’

(a3)‘ The patient reported PROBLEM, nausea and PROBLEM.’

(b1)‘ The patient received TREATMENT for PROBLEM in 2006.’

(b2)‘ The patient received steroids for PROBLEM in DATE.’

(b3)‘ The patient received TREATMENT for his swellings in DATE.’

This approach enabled us to profile various TLINK categories and formalise ex- traction rules based on common abstraction patterns. For example, Table 4.7 lists a number of common patterns found and their typically associated TLINK category. 122 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

Table 4.7: TLINK patterns This table show common patterns semi-automatically extracted from the development/training dataset. The patterns listed in this tables make up the largest and most obvious TLINK patterns observed. TLINK abstraction patterns Typical TLINK PROBLEM and PROBLEM → [PROBLEM] Overlap [PROBLEM PROBLEM, PROBLEM → [PROBLEM] Overlap [PROBLEM] TREATMENT on DATE → [TREATMENT] Overlap [DATE] TREATMENT in DATE → [TREATMENT] Overlap [DAT]E TREATMENT for PROBLEM → [TREATMENT] Before [PROBLEM] TREATMENT of PROBLEM → [TREATMENT] Before [PROBLEM] TEST showed PROBLEM → [TEST] Before [PROBLEM] PROBLEM after TREATMENT → [PROBLEM] After [TREATMENT] TREATMENT post TEST → [TREATMENT] After [TEST]

Profiling of TLINKs revealed the occurrence of different types of relations at the sentence level which we group into three different types: co-ordinate, prepositional, and other TLINKs. Further, these three types of TLINKs directly correspond to the type of extraction rules, which take advantage of corresponding features that charac- terised them. Specifically:

• co-ordinate TLINKs are links that are characterised by EVENTs separated by co-ordinate conjunctions such as ‘and’, ‘or’, or comma (i.e., ‘,’). For example, in the sentence (a) above, all events are co-ordinate TLINKs. In the development dataset we observed that co-ordinate TLINKs as predominately categorised as ‘overlap’.

• prepositional TLINKs are characterised by EVENTs/TEs that are linked by a prepositions. For example, in sentence (b), the preposition ‘for’ between the two EVENTs indicates the presence of a TLINK (in this particular example the TLINK is [TREATMENT] after [PROBLEM]).

• other TLINKs are links that do not fit in either of the previously characterised types. A notable number of other TLINKs are characterised by linking verbs be- tween EVENTs. For example, in the sentence ‘The TEST revealed PROBLEM’, TEST is linked, by the verb ‘revealed’, to PROBLEM (in this particular example the TLINK is: [TEST] Before [PROBLEM]). 4.2. TEMPORAL RELATION EXTRACTION 123

Table 4.8 lists and describes the type of features used to extract intra-sentence TLINKs.

Inter-sentence TLINKs

TLINKs that span across sentences fall into two characterised types: SECTIME and co-referential TLINKs.

• SECTIME TLINKs represent the largest proportion of inter-sentence TLINKs (e.g., in the full i2b2-TRC corpus, 45.87% of all TLINKs are SECTIME links (Sun et al., 2013a)). These are links anchored to relevant document section. Specifically, in the i2b2-TRC dataset, all events within ‘History of Present Ill- ness’ or related sections are linked to the admission date, and events within the ‘Hospital course’ section are linked to discharge date. SECTIME links are pre- dominately categorised as Before.

Notably, clinical narratives do not always contain multiple temporally distinct sections. Consequently, events are anchored to the document creation time (or DRT), a category of relations also known as DocTimeRel (document creation time relation). This is the case in the case study presented in the following Chapter5.

• Co-referential TLINKs are EVENT co-references. These type of TLINKs are characterised as multiple EVENT mentions that refer to the same EVENT.

The approach for these two types of inter-sentence TLINKs differed. In the i2b2- TRC datasets, for development and testing, SECTIME TLINKs were addressed in a three step approach:

(1) extract admission and discharge dates;

(2) apply Section Boundary Detection (SBD), i.e., identify ‘history of present illness’ and ‘hospital course’ sections accordingly;

(3) anchor each EVENT in a given document section to the appropriate section date and set each TLINK category to Before.

Co-referential TLINKs are approached by considering a novel feature: lexical-level similarity (i.e., comparing literal strings with no additional features considered) be- tween EVENTs in a given clinical note. A combined token- and character-level string 124 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION similarity metric SoftTFIDF algorithm Cohen et al. (2003) was adopted to determine the similarity between two candidate events. Specifically, the SoftTFIDF component take as input two strings and outputs a similarity score: a real number between [0,1]; where 1 is a perfect match and 0 the vice versa. The optimum threshold of 0.8 was determined through systematic experimentation with the i2b2-TRC development set. The co-referential TLINK pseudo method developed is given below:

(1) using SoftTFIDF, n2 −1 comparisons are done between events in a given document (including across document sections, if any);

(2) if the SoftTFIDF similarity score between any pair-wise EVENTs is greater or equal to the threshold (0.8): create a TLINK between EVs with the link category: Overlap.

TLINKs features

Table 4.8 list the type of features used across both intra- and inter sentence TLINK methods. The features are used as part of formalised rules and heuristics to identify and classify TLINKs and include:

Table 4.8: TLINK extraction features The features listed herein were used for both TLINK identification and classification; descrip- tion of each feature type follows this table. Nota bene: EV=EVENT and ST=SECTIME.

Inter-sentence Intra-sentence Feature type EV-EV EV-TE EV-ST EV-EV EV-TE String similarity  Position information    Distance information   Preposition   Conjunction   TE-related   NE-related    Transitive closure     

Description of feature types listed in Table 4.8 follows.

• String similarity: specifically, string similarity score between pair-wise EVENTs derived from SoftTFIDF were used to extract co-referential TLINKs; 4.2. TEMPORAL RELATION EXTRACTION 125

• Position information: the position of an EVENT within a given section (SEC- TIME TLINKs);

• Distance information: (i) token distance between entity pairs, and (ii) number of EVENT and TE between entity pairs;

• Preposition: between two candidate pairs e.g., ‘in’, ‘on’, ‘after’, ‘before’ and so forth;

• Conjunction: lexical cues between two candidate pairs e.g., ‘and’, ‘both’ and so forth;

• TE-related: TE type i.e., date, time, duration, and frequency;

• EVENT-related: EVENT information such as type i.e., Problem, Treatment, Test; including HrQoL concept categories) and negation information;

• Transitive closure: derived from temporal closure of TLINKs.

Temporal links closure

In order to capture implied TLINKs not extracted by our methods, the transitive clo- sure is calculated using the output of our bespoke TLINK method. This component has been been adopted from the SputLink component which is part of the Tempo- ral Awareness and Reasoning Systems for Question Interpretation (TARSQI) Toolkit4 (Verhagen and Pustejovsky, 2008). The final TLINK component calculates the full set of transitive relations or temporal closure of given links. Examples of transitive closure were discussed in Chapter2 and practical description is given in the Appendix A.3. For the case study data, we adopted the SputLink component and integrated it into our pipeline. Further, only a subset of SputLink links were used in the final pipeline. For example, inverse links were all discarded i.e., given the relation A Before B, then, the generated inverse link A After B was discarded.

4.2.2 Data

The temporal relation component was developed and validated using the i2b2-TRC dataset. Note that only TLINKs that include EVENTs such as Problem, Treatment,

4www.timeml.org/site/tarsqi/index.html 126 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

Test and TIMEX3 have been considered. Table 4.9 and Table 4.10 show the label distri- bution and the IAAs, respectively (Sun et al., 2013c, p.808). Notably, and comparably (i.e., versus EVENT and TE recognition tasks), it is apparent that TLINK identifica- tion is a challenging task (i.e., 0.39 in average precision-recall)5 for humans. However, manual effort for TLINK classification (i.e., type) show reasonable performance. Note that TLINK and its relevant attributes have been described in detailed in Sec- tion 2.1.4.

Table 4.9: TLINK label distribution Table 4.10: i2b2-TRC: TLINK IAA

Type Training Test TLINK Avg.P&R κ Before 11,981 10,488 Span (strict) 0.39 - Overlap 7,276 5,694 Span (lenient) - - After 1,415 1,275 Type 0.79 0.3 Total 20,672 17,457

4.2.3 Experiments, results, and discussion

The methods described herein have been validated using multiple evaluation methods/- metrics. The main evaluation metric used in the 2012 i2b2 temporal relation challenge (Sun et al., 2013a) was TempEval-3 type evaluation metrics where the ‘reduced graph’ or redundant relations (i.e., a relation is redundant if it can be inferred from other rela- tions) are removed. The TempEval-3 evaluation method used is described below:

• Precision: the total number of reduced system output TLINKs that can be veri- fied in the gold standard closure divided by the total number of reduced system output TLINKs.

• Recall: the total number reduced gold standard output TLINKs that can be veri- fied in the system closure divided by the total number of reduced gold standard output TLINKs.

We initially developed and evaluated our method using gold standard EVENTs and TEs. The results of these experiments are shown in Table 4.11 and Table 4.12.6 In

5This low agreement for TLINKs may raise question regarding the reliability and usefulness of the provided annotations. 6Nota bene: the official 2012 i2b2 evaluation script was used to calculate the temporal closure, specifically using the evaluation setting ‘Original against Closure’ or ‘–oc’. 4.2. TEMPORAL RELATION EXTRACTION 127 addition, an end-to-end evaluation where EVENTs, TEs and TLINKs are all predicted using bespoke methods described in Chapter3 and this chapter is shown in Table 4.14.

Table 4.11: TLINK development set results This table shows the performance of the TLINK pipeline on the development/training dataset. We used two evaluation metrics: common precision-recall and the TempEval-3.

Evaluation setting Precision % Recall % F1-measure % Precision-recall (no closure) 81.40 55.06 65.69 Precision-recall (with closure) 81.16 60.97 69.85 TempEval-3 80.43 48.05 62.59

As expected, precision-bias results were achieved, as that was the aim during de- sign and development. This leaves room for future work to further extend the current method in order to balance recall and to further optimise the overall score. A direct comparison cannot be made between our results and work on the full i2b2- TRC dataset (Sun et al., 2013a) due to the reason that our experiments were based on a reduced set of TLINKs. The full i2b2-TRC dataset included pairwise TLINKs of six different EVENTs, three more than used in our experiments. We did not include Occurrence, Evidential and Clinical department as they were not relevant/useful for characterising patient journeys. Nonetheless, we note the performance of the best systems evaluated on the full i2b2-TRC dataset as a point of reference. The best systems to-date, using gold anno- tations (for clinical EVENTs and TEs) achieved a F1-measure of 69% (Cherry et al., 2013; Tang et al., 2013). As previously mentioned in Chapter2, both Tang et al. (2013) and Cherry et al. (2013) use complex hybrid methods with rule based components for candidate generation (i.e., TLINK identification). For classification of TLINKs, Tang et al. (2013) uses a combination of CRF and SVM, whilst Cherry et al. (2013) use a combination of MaxEnt and SVM for TLINK classification. In contrast, our method uses a knowledge based approach to recognise and simultaneously classify TLINKs.

Our approach achieved an overall score of 62.96% F1-measure on the held-out test set (Table 4.12). Further, considering common IE evaluation metrics, where system predictions are evaluated against manually annotated gold dataset without any further manipulation of labels, our approach achieved 65.34% without the temporal closure component, and 70.22% F1-measure using our complete TLINK pipeline (including the temporal closure component). 128 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

Table 4.12: TLINK results on the held-out test set This table shows the results of the TLINK pipeline on the held-out test set. The results are presented using common precision-recall and the TempEval-3 evaluation metric.

Evaluation setting Precision % Recall % F1-measure % Precision-recall (no closure) 81.51 54.52 65.34 Precision-recall (with closure) 81.48 61.36 70.22 TempEval-3 80.23 49.10 62.96

A comparison of results between the development (Table 4.11) and held-out test data (Table 4.12), indicate good generalisability of the methods developed. For in- stance, consider the minimal variation in F1-measures between the development and test set. Except a small drop in F1-score when not including temporal closure, the results on the test dataset were slightly better than those on the development set. Table 4.13 shows the specific component-based evaluation of SECTIME, intra- sentence and inter-sentence TLINKs. For each component, the held-out test set has been reduced to only the relevant type of TLINKs (i.e., when evaluating SECTIME, only SECTIME links are retained). These evaluation results are obtained using the test dataset with gold annotations. Similar to the findings of the TLINK challenge described in (Sun et al., 2013a), we found that SECTIME TLINKs were easiest to extract (see Table 4.13). Secondly, as expected, intra-sentence TLINKs where easier to extract than inter-sentence TLINKs (when exluding SECTIME TLINKs). Lastly, as concluded by previous work (Sun et al., 2013a), and equally applicable to our rule based approach, a better method to generate candidate pairs would be beneficial to optimise recall.

Table 4.13: TLINK component based evaluation This table shows the individual TLINK component based evaluation of the three TLINK sub- components: SECTIME, intra-sentence and inter-sentence TLINK methods. For each TLINK component evaluated the data has been reduced to only the relevant type of links.

TLINK Precision % Recall % F1-measure % SECTIME 93.93 92.04 92.97 Inter-sentence 55.72 20.40 29.87 Intra-sentence 39.47 22.50 28.66

The component-based analysis also reinforces the conclusion that an extension of our method for recognition of candidate pairs ought to be explored. Currently, 4.3. SUMMARY 129 only neighbouring candidate EVENTs and co-referential inter-sentence TLINK are addressed. Extensions may include intra-sentence TLINKs that have multiple token distance in-between (e.g., first and last EVENTs in a sentence) and non co-referential inter-sentence TLINKs. Moreover, Table 4.13 also shows the source of errors. Despite the aim of generating high precision rules, yet, it was challenging to replicate the manual effort. However, the highly inconsistent annotations (i.e., IAA: 0.39) indicate that the TLINK annotation themselves were a notable source of generated errors.

End-to-end evaluation

Table 4.14 shows the end-to-end evaluation: all entities are derived from bespoke meth- ods such as clinical NER (described in Chapter3), and the TERN method described in this chapter.

Table 4.14: TLINK end-to-end results on the held-out test set This table shows the results of the end-to-end system output: all annotations are derived from the bespoke clinical NER, TERN and TLINK methods described in this thesis thus far.

Evaluation method Precision % Recall % F1-measure % Precision-recall (no closure) 78.27 48.21 59.67 Precision-recall (with closure) 78.13 54.21 64.01 TempEval-3 76.87 43.05 55.19

As a point of reference, Tang et al. (2013) achieved 62.78% (F1-measure) on the full i2b2-TRC dataset using the TempEval-3 evaluation method. Our method achieved 55.19% using the same metric. In addition, evaluating our method as per typical IE evaluation (i.e., against the gold set without any manipulation to the temporal graph) we achieved a F1-measure of 64.01% and without the temporal closure component 59.67%.

4.3 Summary

We developed and evaluated a state-of-the-art method for TERN. Further, we found that while TER can be achieved with a reasonable and comparable performance to human effort, normalisation of TE remains an open research question. Notably, relative 130 CHAPTER 4. TEMPORAL INFORMATION EXTRACTION

TE (e.g., ‘two weeks ago’) which must rely on context in order for its correct value to be derived have proven challenging to extract. Moreover, we developed and evaluated an entirely rule-based approach for TLINK extraction, where we combined the identification and classification of links into a sin- gle simultaneous step. Our method achieved comparable results to other published methods. We also used and proposed a novel feature for co-referential TLINKs using SoftTFIDF. The TLINK results, both those achieved herein and those published in the commu- nity, shows that TLINK extraction is challenging task and an open research question. A particular challenging aspect of TLINK extraction, both for manual and automated methods, is TLINK candidate generation or temporal link identification. However, one may question if the performance measured by customary IE metrics, or the useful- ness of a given application of temporal ordering is more important? For example, in a realistic scenario, all EVENTs and TEs in a given document are temporally related. However, are all relations necessary in a real-world application e.g., chronological ordering events or extracting clinical pathways? Probably not. This is certainly the reasoning behind the widely (community) accepted evaluation metric ‘TempEval-3’ which reduces the ‘temporal graph’. Moreover, Sun et al. (2013c) post-hoc analysis and our review of the TLINK (across the general and clinical domain) revealed several overlapping and notable in- sight of the problem at hand:

• recognition and classification of EVENT-SECTIME TLINKs were easier than other types of TLINKs (i.e., EVENT-EVENT, EVENT-TE, and TE-TE);

• among non-section time TLINKs, EVENT-TE and EVENT-EVENT relations were easier to extract, while TE-TE and TIMEX3-EVENT were comparatively more challenging. Further analysis of this observation suggests that this is re- lated to the fact that TE-TE and TIMEX3-EVENT TLINKs involved anchoring of relative dates and durations which are couple of the most challenging TERN problems;

• inter-sentence TLINKs are easier to extract than intra-sentence TLINKs;

• classification of TLINKs is comparatively easier than identification of candidate pairs; 4.3. SUMMARY 131

The methods described herein (TERN and TLINK) have now enabled us to chrono- logically order EVENTs and HrQoL which will be used to extract patient journeys. Chapter 5

Extracting Patient Journeys: a Case Study

In previous chapters we have described the methodology developed to extract major healthcare concepts (i.e., medical problems, treatments, tests, and HrQoL) (Chapter3) and to subsequently chronologically order these onto a timeline (Chapter4). In this chapter we present a case study that aims to extract individual and aggregated patient journeys or pathways. A patient pathway is a set of abstract care processes followed in clinical practice in terms of medical treatments and investigations described either at the individual patient level or at the aggregated level. The former represents a patient’s journey, whereas an aggregated pathway may indicate common practice in a given cohort.

In the remaining sections of this chapter we will first introduce specific background to the case study such as Central Nervous System (CNS) tumours in children, relevance of HrQoL, including treatment and survivorship issues (Section 5.1)1. Section 5.2 pro- vides an overview of the case study, including the dataset profile and its preparation for TM. Section 5.3 describes the analysis and profiling of clinical and patient narratives. Section 5.4 describes and evaluates the remaining component developed to extract and visualise individual and aggregated patient journeys. Section 5.5 describes the appli- cation and evaluation of visualising patient timelines. Finally, Section 5.6 summarises some of the notable discussion points and findings in this chapter.

1Contributed by Edward J Estlin.

132 5.1. INTRODUCTION: CHILDHOOD CENTRAL NERVOUS SYSTEM TUMOURS133

5.1 Introduction: Childhood Central Nervous System Tumours

CNS tumours comprise more than 20% of new cases of cancer in children in Europe (Steliarova-Foucher et al., 1994), and approximately 350 new cases of this tumour type are diagnosed each year in the UK. Survival following diagnosis of a CNS tumour in childhood is now approximately 70% (Peris-Bonet et al., 2006). Children with this diagnosis present many challenges of management for the multidisciplinary team, in- cluding relatively long pre-diagnosis symptom intervals that involve neurological signs and symptoms (Wilne et al., 2007), and treatment-related complications such as pos- terior fossa syndrome Robertson et al. (2006) and fatigue and anorexia (Ward et al., 2009). All of these factors are likely to contribute to the well-described problems with school attendance and reintegration, peer relationships and behaviour that are found from diagnosis onwards (Eiser and Vance, 2002). The survivorship challenges for the survivors of CNS tumours diagnoses in child- hood are also well characterised into the longer term, and are more marked for children with medulloblastoma, which comprises approximately 10% of all childhood CNS tu- mours and where whole brain radiotherapy is mandated as part of the treatment needed for cure. For example, the HUI2 and HUI3 systems consist of a multi-attribute health status classification scheme which provides utility scores for domains of health and for global health states (Barr et al., 1999). Deficits in health status (HS) for cognition and pain (Boman et al., 2009; Glaser et al., 1999), mobility, sensation and self-care are reported for survivors of a CNS tumour diagnosed in childhood and these are more pronounced for suvivors of medulloblastoma (Boman et al., 2009). Similarly, mea- surement of the HrQoL factors for psychosocial, physical, emotional, social and school functioning show lower scores for children in the first year following diagnosis with a CNS tumour when compared to the general population (Penn et al., 2008), with chil- dren receiving whole brain radiotherapy being at particular risk for an adverse HrQoL outcome in the longer term (Bhat et al., 2005). In addition to those challenges that relate to HS, HrQoL and psychological func- tioning, the diagnosis of a CNS tumour in childhood, and in particular for children with medulloblastoma and those diagnosed at a young age, relates adversely to psy- chological distress (Zeltzer et al., 2009), use of special educational services and subse- quent lower level of attainment (Lancashire et al., 2010; Mitby et al., 2003) and poorer prospects for employment (Pang et al., 2008). Many of these factors relate, at least 134 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY in part to the known neuropsychological impairments that have been particularly well characterised for children with medulloblastoma, and where deficicts in memory, at- tention and processing speed contribute to lower IQ scores and educational attainment (Mulhern et al., 2004). Perhaps as a result of the above HS and HrQoL factors, or even as a contributory factor to these outcomes, young adult survivors of childhood CNS tumours are known to have muscle strength and fitness parameters that are similar in value to persons aged 60 years or more (Ness et al., 2010) and both measures of physical performance and a lack of engagement in activities of daily living (Ness et al., 2009) relate adversely to educational attainment, prospects for employment and living independently (Gurney et al., 2009).

5.2 Introduction: Case Study

In this thesis, we develop a variety of methods to recognise and normalise general as well as specific clinical concepts and expressions necessary to characterise patient journeys. For example, given an unstructured clinical note (Figure 5.1) the aim is to chronologically order the relevant clinical events (illustrated in Figure 5.2).

Given a narrative, the aim is to automatically extract relevant events, in this case, e.g., diagno- sis: ‘medulloblastoma’, and treatments: ‘complete resection’ and ‘craniospinal irradiation’, as well as their related temporal expressions (each relevant expression following a related con- cept have been underlined). In the following Figure 5.2 these concepts have been visually and chronologically represented. Figure 5.1: A hypothetical clinical narrative 5.2. INTRODUCTION: CASE STUDY 135

Following the extraction of relevant concepts from the unstructured clinical narrative (Figure 5.1), we can chronologically order (or anchor) concepts on to a timeline. Figure 5.2: Clinical timeline

An overview of the methodology developed for mining clinical pathways is shown in Figure 5.3.

Our methods are subdivided into four different subsections: (A) extraction of relevant clinical concepts, (B) extraction of temporal information, including entities and relations, (C) mining and representation of healthcare pathways, and (D) analysis. Figure 5.3: Method overview

(A) A corpus of longitudinal clinical documents and patient narratives are processed to recognise relevant clinical concepts (such as problems, treatments, tests and HrQoL) at the mention level. Subsequently, all extracted concepts are semanti- cally normalised or mapped to the UMLS Metathesaurus (specifically, SNOMED- CT, ICD-10, ICF-CY, and RxNorm) using MetaMap.

(B) TIE methods are applied to recognise and normalise temporal expressions, and subsequently organise relevant concepts in a chronological order at the patient 136 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

level.

(C) A initial computational analysis to explore health-related concepts appearing in clinical and patient narratives is conducted in order to profile the data.

(D) The final step consists of a set of methods which combines concept extraction, TERN, TLINK and include a bespoke clustering technique to order and recon- struct individual and combined patient journeys, and subsequently apply visuali- sation techniques to visually represent these pathways.

A significant part of methods developed as part of this thesis (i.e., HrQoL com- ponent, narrative comparison, and patient pathway extraction) are evaluated using a retrospective case study with a cohort of young survivors of childhood brain cancer (or specifically medulloblastoma). The study participants secondary care provider was The Christie NHS Foundation Trust and Central Manchester University Hospitals NHS Foundation Trust (or the Royal Manchester Children’s Hospital) (Manchester, Eng- land). We obtained n = 26 full patient case notes (the actual data used is described in the following subsection) and n = 21 participated in the patient interviews. Seven of the participants fell into the young transition group: those who had experienced the major challenges of moving from primary school to secondary school (11-16 years); seven were in the emerging adulthood transition group (18-24 years) who had experi- enced the change from secondary school to further or higher education, employment, or unemployment. The remaining thirteen patients were all 26 years or older.

Case study specific EVENT types

EVENTs (i.e., Problem, Treatment and Test) are further classified to represent a set of specific sub-types relevant for the case study. Through consultation with our clinical colleagues, we manually crafted a set of dictionaries to be used for further classifica- tion. Specifically, six sub-types were identified. The case study specific EVENT types shown in Table 5.1 are presented in a seman- tic hierarchical manner (i.e., is-a) to corresponding high-level categories. For example, Endocrinology diagnosis is-a medical Problem that is related to the endocrine system (e.g., ‘growth hormone deficiency’, ‘testosterone deficiency’); Oncology diagnosis in- clude problems that are related to oncology (e.g., ‘malignant tumours’); Endocrinology treatment (e.g., ‘growth hormone therapy’, ‘testosterone therapy’); Oncology treat- ment (e.g., ‘chemotherapy’, ‘radiotherapy’); Endocrinology investigation (e.g., ‘pitu- itary function test’, ‘thyroid function test’); Oncology investigation (e.g., ‘CT scan’, 5.2. INTRODUCTION: CASE STUDY 137

‘MRI scan’).

Table 5.1: Adopted case study specific types

EVENT Type Endocrinology Diagnosis Problem Oncology Diagnosis Other Endocrinology Treatment Treatment Oncology Treatment Other Endocrinology Investigation Test Oncology Investigation Other

5.2.1 Data

The case study dataset contains (1) longitudinal clinical narratives and (2) patient narratives; This data was obtained from The Christie NHS Foundation Trust.2

(1) Longitudinal clinical narratives are recorded by specialist physicians (i.e., on- cologist, endocrinologists, occupational therapists, and clinical physiologists) over the course of treatment and post-treatment follow-up. Two types of narratives were used:

• Clinical annotations are internal chronological hospital records, recorded by doctors following each patient consultation. Each entry includes a time stamp (date of entry) and a short narrative/description of the consultation. Clinical annotations are recorded into a single continuous document with no text style formatting, with the first entry added at the time of first appointment. • Clinical letters are used to communicate treatment/patient related informa- tion between the hospital and the patient or between different relevant au- thorities (e.g., general practitioner and consultant oncologists, educational institutions and consultants, and so forth). These documents are in letter for- mat with no semi-structured clinical content. The corpus statistics are given in Table 5.2.

2An ethics approval was obtain prior to the commencement of this project/thesis. 138 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Table 5.2: Case study corpus: clinical narratives profile This table shows the descriptive statistics for the clinical narrative corpus, for each data type: the number of documents (count) and average token/words per document is shown. Note bene: Avg = Average.

Data type Count Avg Tokens Annotation 2,469 104.8 Letter 2,187 251.2 Total 4,656 173.6

(2) Patient narratives

• Interviews were semi-structured and focused on concepts important to pa- tients’. The great majority of these interviews were conducted with the pa- tient and their carer(s). The corpus statistics are given in Table 5.3.

Table 5.3: Case study corpus: patient narratives profile

Data type Count Avg Tokens Interview 21 4111

Preparation and de-identification

The process of preparation and de-identification of narrative differed between clini- cal and patient narratives. The common denominators was the final document format (i.e., flat text file, with ‘.txt’ extension) and type of protected health information (PHI) removed or replaced from all healthcare narrative.

• The specific types of PHI removed from clinical narratives include:

Names such as personal names of doctors, patients’ and their relatives’. ID numbers such as patient identification or reference numbers. Dates specifically, date of birth of patients’. Location information such as patients’ home and school addresses. Contact information such as patients’ telephone numbers. 5.2. INTRODUCTION: CASE STUDY 139

Institution information such as school and hospital name.

The respective data preparation and de-identification process description follows:

• Clinical narrative Clinical annotations were retrieved in electronic format from the Christie’s EHR system. However, the clinical letters were not all available in electronic for- mat, therefore, all letters were obtained from each participant’s paper case notes (paper based health records). Paper based notes were scanned manually. Sub- sequently, a optical recognition (OCR) software (OmniPage Ultimate 18) was used to convert the resulting image format into editable text required for NLP processing. Notably, a estimated 900 letters or roughly 16% of the complete dataset had to be discarded due to various data quality reasons. The de-identification of these records were completed by the author in a semi- automatic manner. For each patient narrative, initial PHI was collect and added to a computational dictionary to subsequently remove/replace the information. Patient names were replaced with ‘XPX’, doctor names with ‘XDX‘, and all other identified PHI was removed.

• Patient narrative3 Data collection was undertaken by a single researcher (Tony Long, Professor in Child and Family Health, The University of Salford) either in the clinic or at the family home. Initially, HrQoL instruments were completed as required—either by the researcher reading the questions (e.g., HUI), or the by patient (e.g., Ped- sQL brain tumour module) or by the parent (e.g., parent-version, PedsQL core module). The nature of the deficits experienced by survivors of Medulloblas- toma dictated that help was needed by most participants due to physical disabili- ties (e.g., hearing loss or poor eyesight) or cognitive deficit (e.g., failing memory or difficulty in concentrating). This had been anticipated. There was no missing data. As the questionnaires were being completed, areas of deficit became clear and were noted for further discussion. In addition, most participants and their carers felt the need to explain some responses, and these contribution were also noted to be revisited during the interview.

3The description of the data collection process for interviews were kindly provided by Professor Tony Long. 140 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

The interviews were conducted immediately after completing the questionnaires. A reminder was offered again of the purpose of the study and the interview to ascertain the problems that had been encountered since the diagnosis and what, if anything, had been done to address these. The interviews were digi- tally recorded. They lasted typically about 45 minutes, never less than 30 min- utes, and in some cases continued for more than 60 minutes. As is often the case, during the conclusion of the interview after switching off the recorder, ad- ditional relevant comments were sometimes made. These were recorded as field notes. If able to be transposed verbatim they were added to the transcripts. The recording were transcribed by a professional, confidential transcription service and then corrected and de-identified by the researcher by memory or revisiting the recording. Notably, by the end of this thesis we developed set of novel methods for automated de-identification, and evaluated them as part of the annual (2014) i2b2 challenge. Our hybrid method was particularly characterised by the use of ‘two-pass recognition’ and ranked 2nd among the 22 submissions (Dehghan et al., 2015).

5.3 Comparative Analysis of Narratives

As a first step in applying the methods described herein, in particular health-related concept extraction is the computer aided profiling of narratives.4 As such, we could ask/answer the research questions: (i) What are the differences between the content of clinical narratives and patient narratives? Using previously developed methods (Chapter3), we adopt an automated approach to profile, and subsequently examine clinical and patient narratives. We normalised the data for analysis and visualisation purposes. Specifically, we use the proportion based normalised concept frequency (pncf). This statistic is used to analyse the occurrence of concepts in narratives. Further, the pncf normalises the number of times a given concept c occurs in a given document set D (i.e., patient or clinical narratives). Specif- ically, the pncf value returns the relative proportion of a given concept over all possible concepts, and is defined as followed:

Frequency(c,D) pnc f (c,D) = (5.1) ∑Frequency(ci,D) 4The pie and bubble charts are generated using JavaScript http://www.highcharts.com/. The word clouds is generated using http://www.wordle.net/. 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 141

where Frequency() returns the count of the given concept category/type ci in document set D.

5.3.1 Aggregated analysis of narratives

First we present an aggregated analysis using a set of twenty clinical narratives (CNs) and corresponding patient narratives (PNs). The data have been normalised using the pncf statistic. The pie charts shown in Figure 5.4 and Figure 5.5 show a notable difference in concepts found in PNs versus CNs, respectively. For example, most common concepts found in PNs are School functioning (24.5%), Emotional functioning (20.2%), Physical functioning (11.4%), and Cognitive functioning. While, Endocrinology investigation (22.7%, ), Oncology treatment (15.8%), Endocrinology treatment (8.2%), and Physical functioning (7.3%) are proportionally the most common concept types appearing in CNs. 142 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Figure 5.4: Proportion of concepts found in patient narratives

Figure 5.5: Proportion of concepts found in clinical narratives

An alternative data view is shown using a bubble chart. Figure 5.6 show a clear difference in concept between PNs versus CNs. For example, traditional clinical con- cepts such as diagnosis, investigation and treatments are more prevalent in CNs. In contrast, HrQoL concepts are more common in PNs. While CNs contain all HrQoL concepts investigated, they occur proportionally less than in the PNs. 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 143 Aggregated concept analysis between patient and clinical narratives Figure 5.6: This figure illustrates the proportion of concepts in clinical narratives (CNs) versus patient narratives (PNs). 144 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Figure 5.7 illustarets a lexical analysis of frequent terms found in corresponding PNs and CNs. A clear contrast in terminology is apparent. PNs contain typical HrQoL mentions while CNs contain traditional clinical concept mentions.

(a) PNs: word cloud of the most common terms

(b) CNs: word cloud of the most common terms

The word clouds use term frequency to set font size and show the most frequent terms in clinical (a) and patient (b) narratives. Figure 5.7: Lexical analysis using word clouds 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 145

The set of analysis presented thus far have shown a clear difference between tra- ditional clinical concepts and HrQoL in PNs and CNs. This difference can be further examined in Table 5.4 which lists the top ranking concepts across narratives.

Table 5.4: Top occuring concept in clinical and patient narratives Top occurring concept types listed by document type: patient narratives versus clinical narra- tives. Concept type Rank Patient narrative Clinical narrative 1 School functioning Endocrinology investigation 2 Emotional functioning Oncology treatment 3 Physical functioning and speech Endocrinology treatment 4 Cognitive functioning Oncology diagnosis 5 Social functioning Physical functioning 6 Other well-being Other well-being 7 Sensory and pain Endocrinology diagnosis 8 Home and family Emotional functioning 9 Activity Sensory and pain 10 Oncology treatment School functioning

Despite this apparent difference between clinical and patient narratives, an inter- esting observation is that both set of aggregated data set contain all concepts investi- gated. However, clinical narratives contained, proportionally, more HrQoL concepts, compared to the occurrence of traditional clinical concepts in patient narratives. This observation is more obvious Tables (5.5, 5.6).

Table 5.5: Proportion of traditional clinical concepts in patient narratives This table shows the proportion of common clinical concepts in patient narratives.

Concept type Proportion % Oncology diagnosis 0.96 Oncology investigation 1.03 Oncology treatment 2.67 Endocrinology diagnosis 0.14 Endocrinology investigation 0.14 Endocrinology treatment 0.21 146 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Table 5.6: Proportion of HrQoL concepts in clinical narratives This table shows the proportion of subjective concepts in clinical narratives.

Concept type Proportion % Physical functioning and speech 6.73 Emotional functioning 6.07 Social functioning 0.43 Cognitive functioning 2.60 Sensory and pain 6.06 Other well-being 6.62 School functioning 4.95 Activity 1.13 Home and family 0.65

As the last set of analysis of the aggregated data: Table 5.7 and Table 5.8 list the most common UMLS semantic types occurring in PNs and CNs, respectively.5 Note that the semantic types were not restricted to the case study specific concept types and considered the classification of all concepts investigated (i.e., Problem, Treatment, Test, and HrQoL). Moreover, similar to the preceding analysis we observe a notable difference of semantic types between narrative types. For example, while there are considerable overlap e.g., ‘Finding’, ‘Sign or Symptom’, ‘Therapeutic or Preventive Procedure’, etc., there is a notable difference in the ranking of semantic types. For example, Table 5.7 shows that PNs contain a considerable proportion of concepts mapped to ‘Manufac- tured Object’ which largely corresponds to the HrQoL concept type School functioning (e.g., over 300 mentions of ‘school’) and ‘Mental Process’ which largely corresponds to Emotional and Cognitive functioning (e.g., containing mentions such as ‘mood’, ‘emotions’, ‘memory’, ‘concentration’ and so forth). On the other hand, Table 5.8 shows that CNs contain a considerable amount of concepts mapped to ‘Therapeutic or Preventive Procedure’ which corresponds to Treatment concepts (approximately 3,800 mentions), and Diagnostic Procedure which corresponds to Test concepts (approxi- mately 2,900 mentions).

5Note that the pncf has been calculated for the top twenty semantic types whilst considering the set to be finite; i.e., ci;i = 20. 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 147 A semantic analysis of clinical narratives UMLS semnatic type Frequency pncf Table 5.8: Therapeutic or Preventive ProcedureDiagnostic ProcedureFindingAmino Acid, Peptide, or ProteinDisease or Syndrome 3,964Sign or 12.12 SymptomQualitative ConceptNeoplastic ProcessLaboratory ProcedureFunctional 2,594 ConceptManufactured 3,849 Object 7.93 11.77 Organic ChemicalBody Part, Organ, or Organ ComponentMental Process 2,319Health Care Activity 7.09 Quantitative Concept 3,739 1,856 2,181 1,006Organism Function 11.43 5.68 6.67 1,801 1,495 3.08 Medical DeviceHormone 5.51 4.57 1,446Spatial 1,344 Concept 4.42 4.11 1,038 3.17 678 735 584 2.07 2.25 1.79 543 1.66 534 1.63 498 500 1.52 1.53 A semantic analysis of patient narratives UMLS semnatic type Frequency pncf Table 5.7: Manufactured ObjectFindingMental ProcessQualitative ConceptPopulation GroupSign or SymptomTherapeutic or Preventive ProcedureMedical DeviceDaily or Recreational ActivityMental or Behavioral Dysfunction 362Spatial Concept 20.58 Occupational Activity 75Diagnostic Procedure 4.26 Functional 137 Concept 152Disease or 7.79 Syndrome 114 8.64 Physiologic 258 Function 54 63 14.67 Organism 6.48 106 Function 3.07 3.58 Neoplastic Process 6.03 Body Part, Organ, or Organ 68 ComponentProfessional or Occupational Group 3.87 48 53 46 2.73 27 3.01 2.62 44 1.53 33 24 2.50 32 1.88 1.36 1.82 32 31 1.82 1.76 148 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

5.3.2 Individual case analysis of narratives

In addition to the aggregated data analysed, we examine three patients in order to address any potential averaging bias. The three patients selected for analysis have undergone treatment over variable duration: patient A, patient B and patient C were diagnosed roughly 10, 20 and 30 years ago, respectively. Starting in alphabetic order, we examine the visualised data.

Patient A

The pie charts (Figure 5.8) and (Figure 5.9) illustrate the proportion of concepts in patient and clinical narratives, respectively. Similar to the aggregated data, we can observe a difference of prevalent concept types appearing in PNs versus CNs. For example, School functioning (20%), Emotional functioning (14%), Other well-being and Cognitive functioning (12%) are most discussed in PN. In contrast, Endocrinol- ogy investigation (27.4%, ), Oncology treatment (12.3%) and Other well-being (9.5%) are the most frequent concept types in CNs. Interestingly, other well-being concepts appear in both PNs and CNs as third most common concept type. The bubble chart (shown in Figure 5.10) reinforces the clear difference between concepts found in CNs versus PNs. For example, PNs contain proportionally more HrQoL concepts than CNs. In addition, majority of traditional clinical concepts are entirely missing from PNs. 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 149

Figure 5.8: Patient A: proportion of concepts in the patient narrative

Figure 5.9: Patient A: proportion of concepts in the clinical narratives 150 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY Patient A: concept analysis between clinical and patient narratives Figure 5.10: The bubble chart illustrate the proportion of concepts in clinical narratives (CNs) and patient narratives (PNs) of patient A. 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 151

Patient B

The contrast in concepts between narratives is once again notable. For example, Figure 5.11 show that Physical functioning (40.6%), School functioning (15.6%), and Emo- tional functioning (14.1%) are the most discussed concepts in PNs. On the other hand, Figure 5.12 show Endocrinology investigation (18.6%, ), Oncology treatment (11.1%), Endocrinology diagnosis (11%) and Endocrinology treatment (10.7%) as the most fre- quently appearing concept types CNs. We can further examine the difference in content between PNs and CNs by exaine Figure 5.13. Similar to previous data examined, we can observe an obvious difference between CNs versus PNs. For example, common clinical concepts (to the right side of vertical separation line) appear less in PNs with even some concept type not appearing at all. In addition, interestingly, some HrQoL concept, in particular Sensory and pain, appear quite frequently in CNs, but do not in PNs. 152 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Figure 5.11: Patient B: proportion of concepts in the patient narratives

Figure 5.12: Patient B: proportion of concepts in the clinical narrative 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 153 Patient B: concept analysis between clinical and patient narratives Figure 5.13: This bubble chart illustrate the proportion of concepts in clinical narratives (CNs) and patient narratives (PNs). 154 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Patient C

The contrast in concepts between narratives is once again obvious, even more than previous data examined. For example, Figure 5.14 shows that School functioning (30.3%), Emotional functioning (13.6%), and Physical functioning (12.1%) are the most discussed concepts PN. In contrast, Figure 5.15 show Endocrinology investiga- tion (31.5%, ), Endocrinology treatment (12.5%), and Oncology treatment (10.1%) as the most frequently appearing concept types in CNs. In addition, Figure 5.16 provides an alternative view on our findings. We have once again analysed this data using a bubble chart (Figure 5.16). Once again, a clear contrast can be seen: while CNs include all concepts (yet proportion- ately less than PNs), PNs does largely not include traditional clinical concepts (to the right side of vertical separation line). In addition, traditional clinical concepts, except Oncology treatment, are entirely absent from PNs. 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 155

Figure 5.14: Patient C: proportion of concepts in the patient narrative

Figure 5.15: Patient C: proportion of concepts in the clinical narratives 156 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY Patient C: concept analysis between clinical and patient narratives Figure 5.16: This bubble chart illustrate the proportion of concepts in clinical narratives (CNs) and patient narratives (PNs). 5.3. COMPARATIVE ANALYSIS OF NARRATIVES 157

5.3.3 Discussion

HrQoL concepts are more prevalent in PNs, while traditional clinical concepts (e.g., medical problems, treatments, tests) are more common in CNs. In addition, while both aggregated set of narratives contain all investigated concepts; CNs contain, proportion- ally, more HrQoL concepts than traditional clinical concepts found in PNs. Our analysis of patients A, B and C was generally consistent with the aggregated data analysis: a difference of concepts that appear in clinical versus patient narratives are apparent. Hence, this addresses one of the main research questions in this thesis (Chapter 1.1): (3) What are the differences between the content of clinical narratives and patient narratives?. HrQoL have shown to be a good indicator of intervention outcomes, predictor of mortality, morbidity, and service needs (Taylor, 2000). In addition, given that HrQoL is not collected as part of standard clinical practice in UK (e.g., at the primary/sec- ondary care) there may be benefits from an automated surveillance of HrQoL. Both the analysis of the case study and results obtained from the IE methods are promis- ing. Nevertheless, at the least, this analysis has shown that further investigation into automated surveillance of HrQoL in clinical practice may be viable. 158 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

5.4 Extracting Patient Journeys

In this section we presents the final component of the method developed to extract individual and aggregated patient journeys. Figure 5.17 illustrates a hypothetical individual or patient-level clinical pathway. We note that for this case study, patient journeys are reconstructed strictly from un- structured clinical data.

oncology treatment oncology diagnosis oncology treatment oncology investigation surgery medulloblastoma chemotherapy mri scan radio therapy

oncology investigation mri scan

T0 (Bin 1) [0,6)Months (Bin 2) [6,11)Months (Bin 3) [12,18)Months (Bin 4)

An example of clinical pathway, consisting of four bins, each containing the most character- istic concepts within that given 6 month interval. This pathway represents an abstract care process or ‘patient journey’: the patient was initially diagnosed with ‘medulloblastoma’ and subsequently had ‘surgery’ and ‘radiotherapy’ as the initial set of treatments; subsequently, this patient received chemotherapy; after 12 months, initial oncology treatment was stopped and the patient underwent a period of surveillance to assess the impact of treatment. Figure 5.17: A hypothetical patient journey

The initial node of each pathway represent the extracted diagnosis, subsequently, the pathway is reconstructed in bins in given time interval. The exact definition of an

interval in m months, given the start interval (sm) and end interval (em), is: [sm,em) or sm ≤ interval < em.

5.4.1 Methods

Following the methods described in the previous chapters (i.e., clinical NER, TERN and TLINK), an additional set of methods was required to organise concepts into ‘path- ways’ Figure 5.18 presents the overall workflow to extract patient journeys. 5.4. EXTRACTING PATIENT JOURNEYS 159

The pathway pipeline take as input extracted TLINKs (or chronologically order events/NEs), and consists of three main methods: (i) ‘pre-processing’, (ii) ‘PathCluster’, and (iii) ‘PathVi- sualisation’. Figure 5.18: Pathway extraction architecture

Pre-processing

The pre-processing method for pathway extraction is made up of three sub-components:

1. TLINK filter. This component filters extracted TLINKs that do not contain a TE participant or EVENTs that cannot directly be anchored onto a timeline (i.e., TLINKs between EVENTs) are discarded. Therefore, considered TLINKs include EVENT-TE, hence, predominately DocTimeRel. In addition, only case study specific EVENT types are considered, with Other ignored (see Table 5.1).

2. Inter-document co-reference resolution.This method addresses the problem of clinical EVENT co-references resolution at the patient-level (or at inter-document- level as opposed to the common document-level or intra-document co-reference resolution task). The motivation of this component is to identify events that are referenced multiple times across a patient timeline, but only occur once. Three clinical EVENT types were considered: (i) Oncology diagnosis, (ii) On- cology treatment, and (iii) Endocrinology diagnosis.These events predominately occurred on or near the document reference date (or roughly on the date of 160 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

consultation). A notable example which slightly differed from this pattern is oncology investigations (e.g., surveillance or diagnostic scans) which typically occurred within 4 to 8 weeks of a mention in a clinical narrative.

A manual analysis of the remaining NE types such as Oncology investigation, in particular Endocrinology investigations, and Endocrinology treatment revealed that almost no intra-document co-reference resolution was necessary. These events were typically not referenced across documents. For example, we ob- served a number of common events that occurred on the DRT, e.g., routine clin- ical measurements such as weight, height, blood pressure, and similar. These contained DocTimeRels were therefore all considered as Overlap.

The approach adopted for inter-document co-reference resolution follows the following approach:

(i) A initial set of commonly identified lexical co-reference terms (or ‘seed terms’) are manually curated for six relevant and recurring event groups (i.e., ‘medulloblastoma’, ‘surgery’, ‘radiotherapy’, ‘chemotherapy’, ‘growth hormone deficiency’, and ‘growth hormone treatment’). The complete list of terms for each corresponding event group/dictionary is given in the Ap- pendixE, Table E.3.

(ii) Identify first occurring date of a given event group (Table E.3 lists even groups and corresponding set of terms). This is a trivial computational task given that TLINKs have already been extracted in a preceding step. For example, we determine when the first reference to ‘surgery’ was by temporally sorting all TLINKs that include ‘surgery’. SoftTFIDF, with a threshold of 0.7, is applied to each event group to account for variability.

(iii) The first occurring date identified (from the previous step) is propagated across co-reference terms with exception of negated EVs and those that contain a TLINK type of ‘overlap’.

3. Timeline normaliser. The aim of this component is to temporally normalise events on a given patient timeline. The diagnosis or the first occurring patient record, whichever comes first, is considered as beginning of the timeline: time = 0. 5.4. EXTRACTING PATIENT JOURNEYS 161

This components normalises each temporally anchored event on a patient time- line to a set of positive real numbers representing: (a) week, (b) month, (c) quar- ter year (d) half year, and (e) year.

PathCluster

The PathCluster organises extracted concepts into similar groups based on features such as time interval (defined by using the output of the timeline normaliser), clus- ter confidence (described below), and concept (i.e., category, type, and lexically nor- malised mention). This component extracts characterised ‘processes’ or commonly occurring events at a given time interval. The PathCluster extracts patient journeys in structured text format. The PathCluster takes three related parameters: (a) a set of cluster confidences given by a real number interval [0,1] for (i) EVENTs (i.e., Problem, Treatment, Test), (ii) types (see Table 5.1), and (iii) lexically normalised event mentions) which deter- mines the inclusion threshold (description of this metric is given below); (b) number of bins, and (c) the time interval of each bin which is specified in number of months.

Confidence

The cluster confidence is frequency and time based: the more frequent a particular concept (EVENT, type, or lexically normalised mention) occur within a specified time interval the higher the confidence. In other words, the confidence has a strong positive correlation to the frequency of a particular concept at a given time interval. Formally, the confidence is calculated for a given concept c, at the time interval t by the given Equation 5.2

Frequency(c) Con fidence(c,t) = (5.2) MAX(Frequency(ci,t)) For example, consider we retrieve the following concept types and their frequency for some time interval (listed in Table 5.9). 162 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Table 5.9: An example list of concepts and their PathCluster confidence This example illustrates hypothetical example where the cluster confidence has been calculated using Equation 5.2

EVENT type Frequency Confidence Oncology investigation 155 1.00 Endocrinology investigation 75 0.48 Oncology treatment 55 0.35 Endocrinology treatment 25 0.16

PathVisualisation

The aim of this component is to graphically visualise extracted patient journeys. This component uses the graph description language Dot6 to automatically generate work- flows from the results obtained from the PathCluster.7,8

5.4.2 Evaluation

A qualitative approach is adopted to validate the methods developed to organise and visualise patient journeys. Yet, the integral IE methods developed to extract (Chap- ter3) and order (Chapter4) clinical concepts have been validated using customary quantitative approaches. We are in particular interested to validate if the methods described thus far are feasible to extract implemented care pathways. Specifically: are we able to capture the general clinical care trends such as treatments and investigations that are adopted in clinical practice, including relevant case study specific problems (i.e., Oncology diagnosis and Endocrinology diagnosis)? The validation is two fold, at the: (1) individual-, and (2) aggregated or cohort- level. Both ‘folds’ follow a similar approach: a subset (10 PNs) of the case study dataset was used as a development set, (used to develop and and optimise our methods), and an unseen subset (3 PNs) was used for validation (which will be analysed herein).

6http://www.graphviz.org/content/dot-language. 7The PathVisualisation component is a specialised Java wrapper for the the Dot language. The output of this component is a graph description file (.dot) and its corresponding compiled portable document format (PDF). 8The text preceding the parentheses contained within the automatically generated workflows or its nodes are manually added for readability. 5.4. EXTRACTING PATIENT JOURNEYS 163

The validation of methods developed to extract pathways involved an initial man- ual reconstruction of individual patient journeys. This was necessary in order to de- velop a ‘gold-standard’ to validate our approach. The manual reconstruction involved a high-level summary of common clinical concepts at given time intervals, and a vi- sual reconstruction of the ‘pathway’. More specifically, each pathway for each given patient was constructed and consisted of:

• important concepts such as tests (i.e., Oncology investigation and Endocrinology investigation), treatments (i.e., Oncology treatment and Endocrinology treat- ment), and problems (i.e., Oncology diagnosis and Endocrinology diagnosis) characterising the patients’ pathways;

• chronologically ordered concept types grouped into bins or by time interval.

We decided to evaluate the first 42 months of each patient pathway only (test cases) as this initial period of a given pathway typically contain the most transitions between various investigations, diagnosis and treatments. Hence, this should enable the evalu- ation of our approach. In addition, generated pathways are reconstructed at six month intervals which we hypothesise as reasonable considering that treatment and investi- gation for the case study are typically adopted at longer periods (e.g., initial oncology treatment stretch between 6 to 18 months; Oncology investigation are initially done at about 3 months intervals during initial treatment and can later stretch to annual or greater intervals).

5.4.3 Individual patient journeys

To validate individual clinical pathways, we have at random selected three patients or test cases (A, B and C) to evaluate9. The patient journey for each test case is given in six month intervals/bins, using 0.01 and 0 for EVENT type and normalised EVENT mention confidences respectively. These low confidences scores were necessary due to the reason that the lexical co-reference resolution method resulted in the over rep- resentation of Oncology treatment during the initial period of any given pathway. This bias is generated due to the consistent reference to Oncology treatment in nearly every single clinical note in all patient narratives. Hence, the greater the longitudinal data the greater the bias.

9These do not correspond to patients in Section 5.3 164 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Test case A

Table 5.10: Test case A: a tabular view of high-level processes This table describes the manual and automatically generated patient journey for test case A. Most common EVENT types are listed by the given bin to illustrate the high-level processes. In addition, Oncology diagnosis and Endocrinology diagnosis have also been included. The visually illustrated pathway is given in Figure 5.19.

Interval Manual Automatic Oncology diagnosis Oncology diagnosis Oncology treatment Oncology treatment [0,6) Oncology investigation Oncology investigation Oncology treatment Oncology treatment [6,12) Oncology investigation Oncology investigation [12,18) Oncology investigation Oncology investigation [18,24) Oncology investigation Oncology investigation Endocrinology investigation Endocrinology investigation [24,30) Oncology investigation Oncology investigation Endocrinology investigation Endocrinology investigation [30,36) Endocrinology diagnosis Endocrinology diagnosis Endocrinology treatment Endocrinology treatment Endocrinology investigation Endocrinology investigation [36,42) Endocrinology treatment Endocrinology treatment Oncology investigation Oncology investigation

Comparing the manual and automatically generated patient journey for test case A, using Table 5.10 and Figure 5.19, we can clearly observe that the general trend of processes are captured by the automated method. Considering the high-level care processes, as listed in Table 5.10, we can use cus- tomary IE (precision-recall) metrics to calculate evaluation scores using the manually 15 generated pathway as a ‘gold standard’. Consequently, we obtain a precision (= 15+0 ) 15 and recall (= 15+0 ) of 100% which are notably good results. However, when analysing the pathway in detail (Figure 5.19) we observe that, while we capture the general trend in terms of care processes followed, we find that the automated approach has some short comings. For example, there exists some dis- crepancies in terms of the extracted EVENT mentions. Specifically, comparing the the second bins ([6,12)) between Figure 5.19 (a) and (b), the Oncology treatment pro- cesses do not match. While both the manual and automatically generated pathways indicate ‘chemotherapy’ treatment, the semantics (i.e., specific chemotherapy drug) 5.4. EXTRACTING PATIENT JOURNEYS 165 given by the automated method is incorrect. This is obviously an error inherited from the concept extraction, TIE and/or the co-reference resolution methods. These type of errors can be expected given the findings from the evaluation of preceding methods (see Chapter3,4). If the automatically generated pathway is re-evaluated quantitatively by consider- ing a process to be a correct match if and only if the content of the process strictly matches the gold-standard (we coin this evaluation method strict matching from here- after). Since we can observe a total of 3 mismatches, this would result in a precision 12 12 (= 12+3 ) and recall (= 12+3 ) of 80% when analysing the Figure 5.19. 166 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY [36-42) Month [36-42) [36-42) Months [36-42) oncology investigation oncology oncology investigation oncology mri scan mri mri scan mri endocrinology treatment endocrinology endocrinology treatment endocrinology endocrinology investigation endocrinology endocrinology investigation endocrinology height sitting height thyroid test function weight weight height sitting height thyroid test function growth hormone treatment (genotropin) treatment hormone growth thyroxine thyroxine (genotropin) treatment hormone growth [30-36) Month [30-36) [30-36) Months [30-36) endocrinology treatment endocrinology endocrinology diagnosis endocrinology endocrinology treatment endocrinology endocrinology diagnosis endocrinology thyroxine endocrinology investigation endocrinology endocrinology investigation endocrinology growth hormone deficiency hormone growth growth hormone deficiency hormone growth growth hormone treatment (genotropin) treatment hormone growth thyroxine 1 Insulin factor growth like measurement like growth factor Insulin 1 cortisol test pituitary function height sitting height thyroid stimulating measurement hormone weight height sitting height weight cortisol measurement like growth factor Insulin 1 [24-30) Month [24-30) [24-30) Months [24-30) oncology investigation oncology mri scan mri oncology investigation oncology scan endocrinology investigation endocrinology endocrinology investigation endocrinology height sitting height weight height sitting height weight height [18-24]) Month [18-24]) [18-24) Months [18-24) oncology investigation oncology mri scan mri oncology investigation oncology mri scan mri [12-18) Month [12-18) [12-18) Months [12-18) oncology investigation oncology mri scan mri oncology investigation investigation oncology mri scan mri Test case A: reconstructing the patient journey [6-12) Month [6-12) [6-12) Months [6-12) oncology treatment oncology oncology treatment oncology oncology investigation oncology mri scan mri oncology investigation oncology mri scan mri Figure 5.19: chemotherapy (cyclophosphamide) chemotherapy chemotherapy (ccnu) chemotherapy (cyclophosphamide) chemotherapy (vincristine) chemotherapy [0-6) Month [0-6) [0-6) Months [0-6) oncology treatment oncology oncology treatment oncology oncology investigation oncology oncology investigation oncology mri scan mri scan ct ct scan ct scan mri chemotherapy (vincristine) chemotherapy (cyclophosphamide) chemotherapy (ccnu) chemotherapy therapy radio surgery surgery therapy radio (ccnu) chemotherapy (cyclophosphamide) chemotherapy (vincristine) chemotherapy Manually reconstructed pathway Automatically reconstructed pathway TimeLine_0 TimeLine_0 (a) oncology diagnosis oncology oncology diagnosis oncology (b) medulloblastoma medulloblastoma The apparent trend of high-level process types, including diagnoses, are identical between figure (a) and (b). 5.4. EXTRACTING PATIENT JOURNEYS 167

Test case B

Similarly, when comparing the high-level care process between the manually and au- tomatically generated pathway for test case B (using Table 5.11 and Figure 5.20) we observe a perfect match. A quantitative analysis of the extracted patient journey reveals a precision and re- call scores of 100%. On the other hand, when using strict matching criteria, we obtain one mismatch (i.e., see Figure 5.20: interval [0,6), Oncology treatment) and there- fore obtain a precision and recall of 94%. However, arguably, even considering strict matching we should obtained 100%. This is due to the fact that the ‘packer regimen’ for test case B actually refer to a mix of chemotherapy drugs (i.e., vincristine, cisplatin and ccnu) and this is captured by the automated pathway (see Figure 5.20). Hence, no mismatch.

Table 5.11: Test case B: a tabular view of high-level processes This table describes the manual and automatically generated clinical pathway for test case B. Most common concept types are listed by the given bin to illustrate the high-level processes. In addition, Oncology diagnosis and Endocrinology diagnosis has have been included. The visually illustrated pathway is given in Figure 5.20.

Interval Manual Automatic Oncology diagnosis Oncology diagnosis Oncology treatment Oncology treatment [0,6) Oncology investigation Oncology investigation Oncology treatment Oncology treatment [6,12) Oncology investigation Oncology investigation Oncology treatment Oncology treatment [12,18) Oncology investigation Oncology investigation [18,24) Oncology investigation Oncology investigation [24,30) Oncology investigation Oncology investigation Endocrinology investigation Endocrinology investigation Endocrinology diagnosis Endocrinology diagnosis [30,36) Endocrinology treatment Endocrinology treatment Oncology investigation Oncology investigation Endocrinology investigation Endocrinology investigation [36,42) Endocrinology treatment Endocrinology treatment Oncology investigation Oncology investigation 168 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY [36-42) Month [36-42) [36-42) Month [36-42) oncology investigation oncology mri scan mri oncology investigation oncology endocrinology treatment endocrinology mri scan mri endocrinology treatment endocrinology endocrinology investigation endocrinology endocrinology investigation endocrinology growth hormone treatment (genotropin) treatment hormone growth growth hormone treatment (genotropin) treatment hormone growth age bone age axhair head circumference stimulating measurement hormone follicle height sitting height measurement luteinizing hormone teste right left phair test synacthen short tuberculin test skin testosterone weight weight height phair teste right left test synacthen short head circumference stimulating measurement hormone follicle sitting height axhair measurement luteinizing hormone tuberculin test skin bone age testosterone [30-36) Month [30-36) [30-36) Month [30-36) oncology investigation oncology mri scan mri oncology investigation oncology mri scan mri endocrinology treatment endocrinology endocrinology diagnosis endocrinology endocrinology treatment endocrinology endocrinology diagnosis endocrinology endocrinology investigation endocrinology endocrinology investigation endocrinology growth hormone deficiency hormone growth growth hormone deficiency hormone growth growth hormone treatment (genotropin) treatment hormone growth growth hormone treatment (genotropin) treatment hormone growth 1 Insulin factor growth like measurement like growth factor Insulin 1 bone age test stimulation arginine hair axillary cortisol stimulating measurement hormone follicle thyroid test function gonadotropin height measurement luteinizing hormone tolerance test insulin teste testosterone phair prolactin weight prolactin testosterone measurement like growth factor Insulin 1 thyroid test function gonadotropin weight stimulating measurement hormone follicle measurement luteinizing hormone teste tolerance test insulin cortisol height phair bone age hair axillary test stimulation arginine [24-30) Month [24-30) [24-30) Month [24-30) oncology investigation oncology oncology investigation oncology mri scan mri mri scan mri [18-24) Month [18-24) [18-24) Month [18-24) oncology investigation oncology oncology investigation oncology mri scan mri mri scan mri [12-18) Month [12-18) [12-18) Month [12-18) oncology treatment oncology oncology treatment oncology oncology investigation oncology oncology investigation oncology mri scan mri mri scan mri Test case B: reconstructing the clinical pathway chemotherapy (packer regimen) (packer chemotherapy chemotherapy (packer regimen) (packer chemotherapy [6-12) Month [6-12) [6-12) Month [6-12) Figure 5.20: oncology treatment oncology oncology treatment oncology oncology investigation oncology mri scan mri scan ct oncology investigation oncology ct scan ct scan mri chemotherapy (packer regimen) (packer chemotherapy chemotherapy (packer regimen) (packer chemotherapy [0-6) Month [0-6) [0-6) Month [0-6) oncology treatment oncology oncology treatment oncology oncology investigation oncology oncology investigation oncology csf cytology csf scan ct scan mri ct scan ct cytology csf scan mri chemotherapy (vincristine) chemotherapy (cisplatin) chemotherapy (ccnu) chemotherapy regimen) (packer chemotherapy surgery therapy radio surgery therapy radio regimen) (packer chemotherapy Manually reconstructed pathway Automatically reconstructed pathway TimeLine_0 TimeLine_0 (a) oncology diagnosis oncology oncology diagnosis oncology (b) medulloblastoma medulloblastoma The apparent trend of high-level processes, including diagnoses, are identical between figure (a) and (b). 5.4. EXTRACTING PATIENT JOURNEYS 169

Test case C

From an qualitative approach we can observe that the majority of processes have been captured accurately. However, an error in the automatically generated pathway oc- curred at the interval [30,36), specifically, Endocrinology treatment which do not appear in the gold-standard (Table 5.12). A closer analysis of the clinical narrative showed that several mentions of specific Endocrinology treatment (i.e., growth hor- mone treatment) was mentioned in [30,36), but as the gold standard (see Table 5.12 and Figure 5.21) accurately shows the actual endocrinology treatment did not commence until [36,42). This analysis confirms that the errors are due preceding IE components, and specifically, the adopted negation component. Further, we also use quantitative description to evaluate the automatically gener- ated pathway. Given the one mismatch generated by the Endocrinology treatment in bin [30,36), we obtain 94% precision/recall. However, when considering strict match- ing criteria: we obtain 5 mismatches and therefore a precision of 72%, and 76% in recall (or 74% in average P&R). A considerable low strict score compared to previous test cases. However, some of these mismatches are in fact derived from errors record in the clinical narratives. For example, considering the bins [6,12) and [12,18) of the automated reconstructed pathway, ‘packer regimen’ had in fact been recorded multiple times in the narratives, whilst closer examination of the actual treatment revealed that only ‘vincristine’ had been used for this particular patient. 170 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Table 5.12: Test case C: a tabular view of high-level processes The visually illustrated pathway is given in Figure 5.21.

Interval Manual Automatic oncology diagnosis oncology diagnosis Oncology treatment Oncology treatment [0,6) Oncology investigation Oncology investigation Oncology treatment Oncology treatment [6,12) Oncology investigation Oncology investigation Oncology treatment Oncology treatment [12,18) Oncology investigation Oncology investigation Endocrinology investigation Endocrinology investigation [18,24) Oncology investigation Oncology investigation Endocrinology investigation Endocrinology investigation [24,30) Oncology investigation Oncology investigation Endocrinology investigation Endocrinology investigation Endocrinology diagnosis Endocrinology diagnosis [30,36) Oncology investigation Oncology investigation Endocrinology treatment Endocrinology investigation Endocrinology investigation [36,42) Endocrinology treatment Endocrinology treatment Oncology investigation Oncology investigation 5.4. EXTRACTING PATIENT JOURNEYS 171 [36-42) Month [36-42) [36-42) Month [36-42) oncology investigation oncology mri scan mri endocrinology treatment endocrinology oncology investigation oncology mri scan mri endocrinology treatment endocrinology endocrinology investigation endocrinology endocrinology investigation endocrinology growth hormone treatment (genotropin) treatment hormone growth growth hormone treatment (genotropin) treatment hormone growth function test thyroid test function test pituitary function gonadotropin tolerance test insulin estradiol test stimulation arginine prolactin sitting height bone age weight height measurement like growth factor Insulin 1 cortisol phair axhair 1 Insulin factor growth like measurement like growth factor Insulin 1 bone age test stimulation arginine axhair cortisol estradiol test pituitary function thyroid test function gonadotropin height sitting height tolerance test insulin phair prolactin weight [30-36) Month [30-36) [30-36) Month [30-36) oncology investigation oncology mri scan mri endocrinology treatment endocrinology endocrinology diagnosis endocrinology endocrinology investigation endocrinology oncology investigation oncology mri scan mri endocrinology diagnosis endocrinology endocrinology investigation endocrinology growth hormone deficiency hormone growth growth hormone deficiency hormone growth growth hormone treatment (genotropin) treatment hormone growth weight test recent pituitary function height sitting height bone age phair thyroid test function axhair measurement like growth factor Insulin 1 1 Insulin factor growth like measurement like growth factor Insulin 1 bone age axhair test recent pituitary function thyroid test function height sitting height phair prolactine weight [24-30) Month [24-30) [24-30) Month [24-30) oncology investigation oncology mri scan mri endocrinology investigation endocrinology weight height sitting height oncology investigation oncology mri scan mri endocrinology investigation endocrinology height sitting height weight [18-24) Month [18-24) [18-24) Month [18-24) oncology investigation oncology mri scan mri endocrinology investigation endocrinology height sitting height weight height oncology investigation oncology mri scan mri endocrinology investigation endocrinology height sitting height weight [12-18) Month [12-18) oncology treatment oncology [12-18) Month [12-18) oncology investigation oncology mri scan mri oncology treatment oncology Test case C: reconstructing the clinical pathway oncology investigation oncology mri scan mri chemotherapy (packer regimen) (packer chemotherapy chemotherapy (vincristine) chemotherapy [6-12) Month [6-12) Figure 5.21: [6-12) Month [6-12) oncology treatment oncology oncology investigation oncology mri scan mri oncology treatment oncology oncology investigation oncology mri scan mri chemotherapy (packer regimen) (packer chemotherapy chemotherapy (vincristine) chemotherapy [0-6) Month [0-6) [0-6) Month [0-6) oncology treatment oncology oncology treatment oncology oncology investigation oncology oncology investigation oncology csf cytology csf scan ct scan mri ct scan ct cytology csf scan mri surgery therapy radio (vincristine) chemotherapy chemotherapy (vincristine) chemotherapy (cisplatin) chemotherapy regimen) (packer chemotherapy (carboplatin) chemotherapy (ccnu) chemotherapy surgery therapy radio Manually reconstructed pathway Automatically reconstructed pathway TimeLine_0 TimeLine_0 (a) oncology diagnosis oncology oncology diagnosis oncology (b) medulloblastoma medulloblastoma The apparent trend of high-level processes, including diagnoses, are largely identical between figure (a) and (b). 172 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

5.4.4 Aggregated patient journeys

For aggregated patient journeys, EVENTs (i.e., category or type) that do not occur at least once in each patient pathway at a given time interval are discarded for inclusion at that given time interval in the combined pathway. For example, if two patient path- ways are combined, and say, at the time interval of [6-12] months, one of the patient does not have any Endocrinology treatment then, the latter concept is excluded from their combined pathway at the time interval [6-12] months. Additionally, another re- quirement of inclusion is the confidence threshold. Note that for aggregated pathways, the confidence is first calculated at the patient-level and subsequently aggregated and divided by the number of aggregated patients. We have generated an aggregated pathway of the described test cases A, B and C. This pathway is reconstructed based on overlapping processes at any given time interval. The following combined pathway was generated with the given confidences: 0.2 and 0.01 for concept type and normalised lexical events respectively. Table 5.13 shows the expected and automatically generated processes by the time interval. The visually illustrated pathway is given in Figure 5.22.

Table 5.13: Aggregated patient pathway: a tabular view of high-level processes This table describes the expected and automatically generated clinical pathway for the ag- gregated pathway that include test cases A, B and C. Most common concept types are listed by the given bin to illustrate the high-level processes. In addition, Oncology diagnosis and Endocrinology diagnosis have also been included.

Interval Expected Automatic Oncology diagnosis Oncology diagnosis Oncology treatment Oncology treatment [0,6) Oncology investigation Oncology investigation Oncology treatment Oncology treatment [6,12) Oncology investigation Oncology investigation [12,18) Oncology investigation Oncology investigation [18,24) Oncology investigation Oncology investigation [24,30) Oncology investigation Oncology investigation Endocrinology investigation Endocrinology investigation [30,36) Endocrinology diagnosis Endocrinology diagnosis Endocrinology treatment Endocrinology investigation Endocrinology investigation [36,42) Endocrinology treatment Endocrinology treatment Oncology investigation Oncology investigation 5.4. EXTRACTING PATIENT JOURNEYS 173

Moreover, given the aforementioned discrepancy in the automatically generated pathway for test case C (i.e., the inclusion of Endocrinology treatment at the interval [30,36)), the automatically generated aggregated pathway have therefore erroneously included this process since it is part of test case A and B pathways’. Hence, taking into account this single mismatch, the calculated extraction score would amount to 93% precision and 100% recall (or 97% in average P&R). 174 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY [36-42) Month [36-42) oncology investigation oncology mri scan mri endocrinology treatment endocrinology endocrinology investigation endocrinology growth hormone treatment (genotropin) treatment hormone growth thyroxine 1 Insulin factor growth like measurement like growth factor Insulin 1 bone age test stimulation arginine hair axillary head circumference cortisol estradiol stimulating measurement hormone follicle test recent pituitary function thyroid test function gonadotropin height sitting height measurement luteinizing hormone tolerance test insulin teste testosterone hair pubic prolactin test synacthen short tuberculin test skin weight [30-36) Month [30-36) endocrinology treatment endocrinology endocrinology diagnosis endocrinology endocrinology investigation endocrinology growth hormone deficiency hormone growth growth hormone treatment (genotropin) treatment hormone growth thyroxine 1 Insulin factor growth like measurement like growth factor Insulin 1 bone age test stimulation arginine hair axillary cortisol stimulating measurement hormone follicle test recent pituitary function thyroid test function gonadotropin height sitting height measurement luteinizing hormone tolerance test insulin teste testosterone hair pubic prolactin weight [24-30) Month [24-30) oncology investigation oncology mri scan mri [18-24) Month [18-24) oncology investigation oncology mri scan mri An aggregated patient pathway [12-18) Month [12-18) oncology investigation oncology mri scan mri Figure 5.22: [6-12) Month [6-12) oncology treatment oncology oncology investigation oncology mri scan mri scan ct chemotherapy (ccnu) chemotherapy (cyclophosphamide) chemotherapy (vincristine) chemotherapy regimen) (packer chemotherapy [0-6) Month [0-6) oncology treatment oncology oncology investigation oncology mri scan mri cytology csf scan ct chemotherapy (carboplatin) chemotherapy (ccnu) chemotherapy (cisplatin) chemotherapy (cyclophosphamide) chemotherapy (vincristine) chemotherapy regimen) (packer chemotherapy therapy radio surgery TimeLine_0 oncology diagnosis oncology medulloblastoma The following figure show the aggregate patientunion pathway of of the test normalised case lexical A, events B from and all C. test The cases. aggregated pathway only shows overlapping process, but the 5.4. EXTRACTING PATIENT JOURNEYS 175

5.4.5 Discussion

An apparent finding from our evaluation of automatic reconstructed (individual and aggregated) patient pathways is that we are able to capture major processes followed in clinical practice. Hence, this addresses couple of our main research questions (Chapter 1.1): (1) Can TM techniques be used to reconstruct patient journeys from unstructured clinical narratives?; (2) Do narratives contain enough information to reconstruct patient pathways? We have demonstrated the potential of TM in that we can reconstruct patient jour- neys by capturing major trends or processes follow in clinical practice. For example. we can observe that the three test cases analysed were all diagnosed with medulloblas- toma; during an initial period ranging from 12 to 18 month they underwent oncology treatment which included initial surgery and radiotherapy, and continued chemother- apy. In addition, around the time of diagnosis and during the whole oncology treat- ment, all cases underwent continuous oncology investigations (i.e., CT and MRI scan). These investigations were at least conducted bi-annually until the 42nd month anal- ysed. In addition, all test cases analysed underwent various in-depth endocrinology investigations, and interestingly, all were diagnosed with ‘growth hormone deficiency’ around 30 to 36 month after diagnosis. These pathways described herein were also analysed with Dr Martin McCabe (Se- nior Lecturer in Paediatric Oncology and Honorary Consultant Paediatric Oncologist) who confirmed the described findings as ”expected” characterisations of patient/treat- ment journeys. Specifically, the clinical processes followed in terms of oncology treat- ment, investigation, as well as the commence of endocrinology investigation and treat- ment was allegedly a common pattern with the explored cohort. Hence, these trends had been satisfactory captured by our automated method. In addition, the discrep- ancy discussed in this chapter regarding test case A’s Oncology treatment (i.e., specific chemotherapy drugs) was also noted during discussions with the clinician. He stated that it would be useful to extract the implementation of specific treatment protocols that e.g., describe chemotherapy cycles, drug, dosage, and so forth. This highlights the need for future work to develop methods to extract fine-grained information in patient journeys. He also suggested that including HrQoL concepts as part of patient journeys would be an interesting way of analysing such concepts. The errors observed during our analysis of extracted patient journeys were some- what expected given the challenges highlighted with preceding IE methods such as the concept extraction (Chapter3), TERN, and TLINK extraction (Chapter4) methods. 176 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

In addition, another probable source of errors are from the data quality and content. Specifically, a significant portion of clinical narratives had to be discarded due to qual- ity issues (see Section 5.2). In addition, detailed information regarding implemented treatment were sometimes incomplete in the used data types (i.e., clinical narratives); this seem to be due to the reason that majority of study participants had been treated between, at least, two hospitals. Moreover, another source of errors is derived from our simplistic approach for inter-document co-reference resolution which is a challenging problem and an open research question. Specific examples of propagated errors from preceding IE methods, include the semantics of test case A’s Oncology treatment and test case C’s commencement of Endocrinology treatment which are related to the simplistic approach developed for lexical co-reference resolution and the adopted out-of-the-box negation method (the ConText algorithm), respectively. Obviously, the performance of the concept extrac- tion and TIE methods are other notable factors. While our work proves the feasibility of reconstructing patient journeys from un- structured clinical narrative, future work will need to address/extend specific issues identified. For example, limitation to address would be (i) the use of both structured and unstructured data source in order to complement the shortcomings (e.g., miss- ing/incomplete information) of respective data source; (ii) a focused case study such as extracting specific treatment protocols: e.g., the packer regimen; and (iii) in ad- dition to major trends/processes, develop methods to incorporate decision point and alternative flows in patient journeys.

5.5 Visualising Patient Timelines

A possible application of methods developed in this thesis to extract clinical concepts is the visualisation of patient (or clinical) timelines as part of interactive EHR/CDSS. This would include chronological visualisation of important clinical events using un- structured data. These type of applications would work as a visual and user-friendly interface on top of structured and unstructured clinical records available in an EHR system. In fact, projects enabling patient timeline visualisation from structured data sources have already commenced in many hospitals, including the The Christie Hos- pital NHS Foundation Trust.10

10This information was derived from personal communication with Mr. Matthew Barker-Hewitt, Head of Information, The Christie NHS Foundation Trust. 5.5. VISUALISING PATIENT TIMELINES 177

For instance, a common practice prior to a doctor-patient interaction (i.e., before an appointment or consultation) in the primary and secondary care hospitals in UK, in- volve the review patient’s longitudinal records by the relevant physician. These records may span over decades in the primary care or for patients with severe/chronic diseases in the secondary care. Such initial reviews of hospital records are often a necessity in order for physicians to obtain both an overview and specific knowledge of patient’s relevant medical history including past and present medical problems, treatments, and tests. This type of information is paramount for decision making with regard to con- tinued care. Hence, adopting combined information retrieval (i.e., search) and visu- alisation techniques as an interface to clinical records can aid both quicker and more accurate review of patient timelines. To support such a task ‘clinical dashboard’ using the results obtained from the IE components described in Chapter3,4 was described in (Tiranardvanich, 2014). The clinical dashboard enabled users to chronologically visualise clinically important events, organised by concept categories such as problem, treatment and test; including (see a screenshot from the ‘clinical dashboard’ in Figure 5.23).

Figure 5.23: Clinical dashboard

As shown, the ‘clinical dashboard’ organises patient timelines by concepts: problem, treat- ment, test and health-related quality of life. In addition, more specialised case study specific types, in addition to the current ‘summary’ view, such as ‘oncology investigation’, ‘oncology diagnosis’ and so forth, are available as ‘tab’ views. 178 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY

Each patient timeline is made up of coloured squares representing an event and membership of the given concept category, in this case, medical test (Figure 5.23). These squares are clickable and contain further detailed information (see Figure 5.24). In addition, the timeline interface enables user to zoom in and out. This function enables a detailed fine-grained view (e.g., to find a exact date (DD-MM-YYYY) on which a particular event occurred), or a coarse-grained overview of the timeline.

This figure shows an event view (i.e., once an event on the timeline has been clicked) and provide detailed information such as UMLS classification: semantic type and group; temporal information such as date of reference; document information: specific document where the information was extracted.) Figure 5.24: Clinical dashboard: event view

A detailed qualitative evaluation of the clinical dashboard confirmed what had been anticipated by the IE results, in particular the TIE tasks which have proven to be chal- lenging task at the time of writing of this thesis. Whilst the evaluation of the clinical dashboard received an extremely positive review regarding its potential use and appli- cation in clinical practice, the accuracy of the presented data was of concern. This is a important question irrespective of the quality of data extracted. Given the nature of the application in clinical settings, inaccuracy of information can have severe reper- cussions. 5.6. SUMMARY 179

Moreover, as previously noted: TIE, in particular, TE normalisation (i.e., normal- ising expressions to a standardised format such as ISO-8601) and temporal link extrac- tion remain challenging IE tasks. These challenges coupled with the clinical dashboard evaluation indicate that widely scoped applications are probably not mature for real- world clinical applications. However, our findings suggests that more narrowly scoped tasks such as temporal ordering (and visualisation) of, e.g., specific (fine-grained) con- cept types such as oncology treatment, investigation or diagnosis are more practical. In addition, real-world ‘clinical dashboard’ applications need to combine structured and unstructured data as complimentary data sources which will further enhance and make automated extraction of information from unstructured clinical data a feasible technology in the clinical settings.

5.6 Summary

The methods developed and evaluated as part of this thesis have been shown to have many potential applications as part of clinical practice and/or potentially novel CDSS. For example, we have shown that automated concept extraction can be used to profile narratives. As such, we presented a novel analysis between patient and clinical narra- tives. Consequently, we found a clear contrast between the content of CNs and PNs (which addressed one of our main research questions, Chapter 1.1). Specifically, we found that HrQoL concepts appear proportionally more in PNs than in CNs. In addi- tion, traditional clinical concepts (e.g., medical problems, treatments, and tests) appear overwhelmingly more in CNs than in PNs. Moreover, while HrQoL were expectedly discussed less frequently than in PNs, they did appear extensively in CNs and are there- fore feasible to systematically extract and monitor from hospital records. However, we do acknowledge the potential sample bias. Currently, HrQoL measures are typically collected through structured means, however, they are not widely monitored as part of clinical practice despite being repeatedly shown to be a good indicator of intervention outcomes, predictor of mortality, morbidity, and service needs. We have also shown that the automatic reconstruction of patient journey from un- structured data using TM techniques is feasible (which addresses couple of our main research questions, Chapter 1.1). We showed that we were able to characterise dif- ferent stages of patient pathways, both for individual as well as aggregated journeys in the first three and a half years after diagnosis with satisfactory granularity based 180 CHAPTER 5. EXTRACTING PATIENT JOURNEYS: A CASE STUDY on various evaluation methods. However, the use of complementary data sources (in- cluding structured medical records) could be useful to further improve the results. The application of automatic reconstruction of patient pathways are aplenty. For instance, it can be used to generate visual summaries/overview of patient journeys as part of standard clinical practice (e.g., before/during a consultation). Further, extraction of cohort-level pathways can be used to analyse common clinical practice to compare and contrast against established reference model or protocols regarding the best prac- tice. Large scale extraction of patient journeys across multiple institutions could be used to compare and contrast different views of care to ultimately harmonise practice. Moreover, we have also proposed, designed and evaluated a ‘clinical dashboard’ for visualising individual patient timelines.11 Such application could be used as part of standard clinical practice to improve efficiency and aid decision making by better presentation of relevant clinical data.

11In fact, the application of extracting and chronologically ordering clinical events from unstructured text have been presented and discussed with the Head of Information at the Christie’s Hospital. Future meetings are planned to evaluate a pilot to integrate such application as part of their ongoing project to visualise structured data. Chapter 6

Conclusion

With the adoption of EHR, the feasibility to adopt computational methods to aid ex- perts with large scale analyses of clinical data is becoming increasingly appealing. As part of these activities, the challenges of monitoring, developing and implementing clinical protocols remain a premature and active research problem. The adoption of automated methods to support experts with analyses of patient journeys has only been considered recently, with the new achievements in processing data and health infor- matics in general, and clinical text/data analytics in particular. In this thesis we have proposed a novel method to identify key events and extract individual and aggregate patient journeys from healthcare narratives. This type of application can potentially be of value for clinical practitioners and researchers, to aid large scale analyses of implemented care pathways, and subsequently help monitor, compare, develop and adjust clinical guidelines both in the areas of chronic diseases (where there is plenty of data) and rare conditions (where potentially there are no established guidelines).

6.1 Contributions

In this thesis we addressed the problem of extracting individual- and cohort-level pa- tient journeys from unstructured healthcare narratives using TM methods. In order to resolve this problem it was necessary to extract clinical concepts and temporal infor- mation to enrich healthcare narratives with relevant information in order to characterise embedded patient pathways. As an initial set of analysis we profiled the clinical and patient narratives. Con- sequently, we were able to answer one of our main research questions: ‘what are the

181 182 CHAPTER 6. CONCLUSION differences between the content of clinical narratives and patient narratives?’ (Section 1.1).1 Using the case study on medulloblastoma, we have shown that automated extrac- tion of clinical concepts and relevant temporal information can facilitate the extraction and representation of individual and aggregate patient journeys. This was the main research question: whether ‘TM techniques can be used to reconstruct patient journeys from unstructured clinical narratives?’ and if ‘narratives contain enough information to reconstruct patient journeys’ (Section 1.1). The specific contributions of this research are listed below.

1. Identification of relevant clinical concepts in healthcare narratives.

(a) We have proposed a data-driven method to identify key clinical concepts (i.e., medical problems, treatments and tests) using a state-of-the-art se- quence labelling algorithm, CRF, with a combination of lexical and syntac- tic features, and a rule-based post-processing method including label correc- tion, boundary adjustment and false positive filter. The method demonstrated the state-of-the-art performance at the 2012 i2b2 challenge: it achieved strict 2 and lenient micro F1-measures of 83.45% and 91.13% respectively. (b) We have developed a method to extract health-related quality of life con- cepts using a knowledge-driven method with an overall (strict and lenient)

micro F1-measure of 48.70% and 71.85%. To support this process, a new health-related quality of life schema for the automated classification of ex- tracted concepts from unstructured text was designed. This was achieved by an initial combination of study related subjective measures and subsequent validation through an iterative process of consultation with clinical experts and health researchers. The method used dictionary matching with lexical contextual cues for boundary adjustment.

2. Identification of relevant temporal information in healthcare narratives.

(a) A method to extract temporal expressions using a hybrid knowledge- (dic- tionary and rules) and data-driven (CRF) approach has been proposed. The approach combines respective components at the mention level with a post- processing filter to remove common errors. The method demonstrated the

1At this stage, we also showed that patient narratives were not suitable nor necessary for extracting patient journeys. 2Note that these are the results from the extended method described herein. 6.2. LIMITATIONS 183

state-of-the-art performance at the 2012 i2b2 challenge: F1-measure of 90.48% and accuracy of 70.44% for identification and normalisation respectively. (b) A method to identify and classify temporal relations using knowledge-based

method was proposed, with a F1-measure of 62.96% (considering the re- duced temporal graph) or 70.22% for extraction of temporal links. The method developed consisted of the initial rule-based identification and clas- sification components which utilised contextual lexico-syntactic cues for inter-sentence links, string similarity for co-reference links, and subsequently a temporal closure component to calculate transitive relations of the ex- tracted relations.

3. A novel method to extract and represent individual and aggregated patient jour- neys from unstructured clinical narratives using TM methods was proposed. In a case study, qualitative evaluation has shown that we were able to capture the major trends/processes part of patient journeys. An overall quantitative evalu- ation score (average P&R) of 94-100% for individual and 97% for aggregate pathways was achieved. The method developed includes clinical concepts and temporal information extraction, concept clustering, and automated workflow generation.

4. We have applied the proposed methodology to explore in detail a case study on medulloblastoma through analyses of patient journeys and health-related con- cepts as mentioned in clinical narratives and patient narratives. We found that there is both a gap and overlap of concepts contained in these sets of narratives. For example, we found that HrQoL concepts are more common in patient narra- tive, while clinical concepts (e.g., medical problems, treatments, tests) are more prevalent in clinical narratives. In addition, while both aggregated sets of narra- tives contain all investigated concepts; clinical narratives contain, proportionally, more HrQoL concepts than clinical concepts found in patient narratives.

6.2 Limitations

There are a number of limitation to the research as whole as well as the method pro- posed for extracting clinical pathways:

• Data source. The work presented in this thesis use solely unstructured data (i.e., clinical narratives) to reconstruct patient pathways. The use of complete set of 184 CHAPTER 6. CONCLUSION

(electronic) patient records including both unstructured and structured clinical records could be used to capture all relevant information and complement and validate missing information in various data sources.

• Data quality. Firstly, there is a notable data quality issue with the medium or format. This was due to clinical correspondences that were originally in pa- per format and had to first be scanned and subsequently converted to text by an OCR software. An estimated 16% of the complete case-study dataset had to be discarded due to quality issues that arisen from this process. Secondly, a set of quality issues were identified from the content of the data. For example, we found omissions or missing information (e.g., incomplete treatment infor- mation), redundancy (e.g., between clinical annotations and correspondence), conflicts and errors within narrative data. One of the reasons may be that some of this information is present in the structured parts of patients records.

• Scope. Current method does not account for decision points and therefore alter- native flows (e.g., a particular treatment had to be stopped due to adverse effects, and therefore, an alternative treatment plan was adopted). However, alternative flows are not necessarily part of patient journeys, nevertheless, since we aimed to extract major trends/processes we have not attempted to identify decision points nor alternative flows.

6.3 Future work

The limitations and challenges of this work are good indicators of future direction and related open research questions. Throughout this thesis we have identified a number of interesting research problems to be addressed.

• TLINK identification is a complex task both for manual as well as automated methods. What constitutes a link is not always clear. Thus, manual efforts for TLINK identification have reported notably poor IAAs. Temporal relations may span over a phrase, sentence, or paragraph, consequently increasing the com- plexity of identification. Currently, the difficulty of identifying TLINKs is cor- related to the distance between TLINK pairs (with the exception of SECTIME which make up the easiest type of TLINKs). Future approaches should further explore temporal inference as well as knowledge-intensive approaches. 6.4. SUMMARY 185

• Inter-document co-reference resolution in clinical narratives remains an open research question and an important component of temporal ordering of events that are co-referenced across multiple documents (such as in patient records). For example, in order to determine the persistence of events or event timelines (i.e., how long does an event last?) it is necessary to accurately extract inter- document co-references. To-date, as far as we are aware, only one notable work has addressed this problem (Raghavan et al., 2014).

• The approach to extract patient journeys, presented in this thesis, does not con- sider decision points. Hence, future work should expand the current approach by enabling the extraction of decision points and alternative flows. As such, it may be more feasible for a future case-study to consider more narrow topics such as specific treatment protocols.

6.4 Summary

The findings of this thesis have contributed towards the proof-of-concept that auto- mated methods can be used to extract individual patient journeys and identify trends that characterise more generic (aggregated) pathways. We have demonstrated that TM methods can be efficiently used to identify, extract and temporally organise key clinical concepts that make up a patient’s journey in a healthcare system. The proposed method presented herein can be useful for (semi-) automated large scale analysis of clinical narratives to aid experts to monitor, compare, develop, and adjust clinical guidelines. Automated reconstruction of patient journeys could also be used as part of standard clinical practice to provide visual overview of patients’ jour- neys as an alternative to manual review of textual data which can be time consuming and partial (i.e., does not provide an overall summary). In addition, such methods can also be used for knowledge transfer between clinical information systems or be used for quick information transfer between systems for emergency cases. Bibliography

N K Aaronson, S Ahmedzai, B Bergman, M Bullinger, A Cull, N J Duez, A Filiberti, H Flechtner, S B Fleishman, J C J M d Haes, S Kaasa, M Klee, D Osoba, D Razavi, P B Rofe, S Schraub, K Sneeuw, M Sullivan, and F Takeda. The European Organiza- tion for Research and Treatment of Cancer QLQ-C30: A Quality-of-Life Instrument for Use in International Clinical Trials in Oncology. Journal of the National Cancer Institute, 85:365–376, 1993. ISSN 0027-8874. doi: 10.1093/jnci/85.5.365. URL http://jnci.oxfordjournals.org/cgi/content/abstract/85/5/365.

Asma B Abacha and Pierre Zweigenbaum. Medical entity recognition: a com- parison of semantic and statistical methods. In 2011 Workshop on Biomedical Natural Language Processing, pages 56–64, 2011a. ISBN 978-1-932432-91-6. URL http://dl.acm.org/citation.cfm?id=2002902.2002911.

Asma B Abacha and Pierre Zweigenbaum. Automatic extraction of semantic relations between medical entities: a rule based approach. Journal of Biomedical Semantics, 2(Suppl 5):S4+, 2011b. ISSN 2041-1480. doi: 10.1186/2041-1480-2-s5-s4. URL http://dx.doi.org/10.1186/2041-1480-2-s5-s4.

James F Allen. Towards a General Theory of Action and Time. Artificial Intelligence, 23:123–154, 1984. ISSN 00043702. doi: 10.1016/0004-3702(84)90008-0.

James F Allen and George Ferguson. Actions and events in interval temporal logic. Journal of Logic and Computation, 4:531–579, 1994. ISSN 0955792X. doi: 10. 1093/logcom/4.5.531.

James F Allen, Mary Swift, and Will de Beaumont. Deep Semantic Analysis of Text. In Proceedings of the 2008 Conference on Semantics in Text Processing, STEP ’08, pages 343–354, Stroudsburg, PA, USA, 2008. Association for Computational Lin- guistics. URL http://dl.acm.org/citation.cfm?id=1626481.1626508.

186 BIBLIOGRAPHY 187

Ethem Alpaydin. Introduction to Machine Learning. The MIT Press, Cambridge, 2nd edition, 2010. ISBN 026201243X, 9780262012430.

Douglas E Appelt and Boyan Onyshkevych. The Common Pattern Specifi- cation Language. In Proceedings of a Workshop on Held at Baltimore, Maryland: October 13-15, 1998, TIPSTER ’98, pages 23–30, Stroudsburg, PA, USA, 1998. Association for Computational Linguistics. doi: 10.3115/ 1119089.1119095. URL http://dx.doi.org/10.3115/1119089.1119095http: //portal.acm.org/citation.cfm?doid=1119089.1119095.

Alan R Aronson. MetaMap Evaluation. Technical report, US National Library of Medicine (NLM), May 2001. URL http://skr.nlm.nih.gov/papers/ references/mm.evaluation.pdf.

Alan R Aronson and Franc¸ois-Michel Lang. An Overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3):229–236, May 2010. ISSN 1527-974X. doi: 10.1136/jamia. 2009.002733. URL http://dx.doi.org/10.1136/jamia.2009.002733.

M Barnes and Octo G Barnett. An architecture for a distributed guideline server. Proceedings of the Annual Symposium on Computer Application in Medical Care, pages 233–7, January 1995. ISSN 0195-4210. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2579090\&tool=pmcentrez\&rendertype=abstract.

Ronald D Barr, T Simpson, A Whitton, B Rush, W Furlong, and D H Feeny. Health- related quality of life in survivors of tumours of the central nervous system in childhood–a preference-based approach to measurement in a cross-sectional study. European journal of cancer (Oxford, : 1990), 35(2):248–55, February 1999. ISSN 0959-8049. URL http://www.ncbi.nlm.nih.gov/pubmed/10448267.

Steven Bethard. ClearTK-TimeML: A minimalist approach to TempEval 2013. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 10–14, Atlanta, Georgia, USA, June 2013. Associa- tion for Computational Linguistics. URL http://www.aclweb.org/anthology/ S13-2002. 188 BIBLIOGRAPHY

Steven Bethard and James H Martin. CU-TMP: Temporal Relation Classification Using Syntactic and Semantic Features. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 129–132, Prague, Czech Republic, 2007. Association for Computational Linguistics.

Sundeep R Bhat, Tress L Goodwin, Tasha M Burwinkle, Meagan F Lansdale, Gary V Dahl, Stephen L Huhn, Iris C Gibbs, Sarah S Donaldson, Ruth K Rosenblum, James W Varni, and Paul G Fisher. Profile of daily life in children with brain tu- mors: an assessment of health-related quality of life. Journal of clinical oncology : official journal of the American Society of Clinical Oncology, 23(24):5493– 500, August 2005. ISSN 0732-183X. doi: 10.1200/JCO.2005.10.190. URL http://www.ncbi.nlm.nih.gov/pubmed/16110009.

K K Boman, E Hoven,´ M Anclair, B Lannering, and G Gustafsson. Health and persistent functional late effects in adult survivors of childhood CNS tumours: a population-based cohort study. European journal of cancer (Oxford, England : 1990), 45(14):2552–61, September 2009. ISSN 1879-0852. doi: 10.1016/j.ejca. 2009.06.008. URL http://www.ncbi.nlm.nih.gov/pubmed/19616428.

Patricia L Bowyer, Jessica Kramer, Gary Kielhofner, Vanessa Maziero-Barbosa, and Gay Girolami. Measurement properties of the Short Child Occupational Pro- file (SCOPE). Physical & occupational therapy in pediatrics, 27(4):67–85, Jan- uary 2007. ISSN 0194-2638. URL http://www.ncbi.nlm.nih.gov/pubmed/ 18032150.

Peter F. Brown, Peter V. DeSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-Based n-gram Models of Natural Language. Computational Linguistics, 18:467–479, 1992. ISSN 08912017. URL http://citeseerx.ist. psu.edu/viewdoc/summary?doi=10.1.1.13.9919.

Nate Chambers. NavyTime: Event and Time Ordering from Raw Text. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 73–77, Atlanta, Georgia, USA, June 2013. Associa- tion for Computational Linguistics. URL http://www.aclweb.org/anthology/ S13-2012.

Nathanael Chambers and Dan Jurafsky. Jointly Combining Implicit Constraints Im- proves Temporal Ordering. In Proceedings of the Conference on Empirical Methods BIBLIOGRAPHY 189

in Natural Language Processing, EMNLP ’08, pages 698–706, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. URL http://dl.acm. org/citation.cfm?id=1613715.1613803.

Angel Chang and Christopher D Manning. SUTime: Evaluation in TempEval-3. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 78–82, Atlanta, Georgia, USA, June 2013. Associa- tion for Computational Linguistics. URL http://www.aclweb.org/anthology/ S13-2013.

Wendy W Chapman, Will Bridewell, Paul Hanbury, Gregory F Cooper, and Bruce G Buchanan. A Simple Algorithm for Identifying Negated Findings and Dis- eases in Discharge Summaries. Journal of Biomedical Informatics, 34(5):301– 310, 2001. doi: http://dx.doi.org/10.1006/jbin.2001.1029. URL http://www. sciencedirect.com/science/article/pii/S1532046401910299.

Basit Chaudhry, Jerome Wang, Shinyi Wu, Margaret Maglione, Walter Mojica, Eliz- abeth Roth, Sally C. Morton, and Paul G. Shekelle. Systematic review: Impact of health information technology on quality, efficiency, and costs of medical care, 2006. ISSN 00034819.

Yuchang Cheng, Masayuki Asahara, and Yuji Matsumoto. NAIST.Japan: Temporal Relation Identification Using Dependency Parsed Tree. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 245–8, Prague, Czech Re- public, 2007. Association for Computational Linguistics.

Colin Cherry, Xiaodan Zhu, Joel Martin, and Berry de Bruijn. A` la Recherche du Temps Perdu: extracting temporal relations from medical text in the 2012 i2b2 NLP challenge. Journal of the American Medical Informatics Association, 20(5):843– 848, September 2013. doi: 10.1136/amiajnl-2013-001624. URL http://dx.doi. org/10.1136/amiajnl-2013-001624.

Nancy A Chinchor. Proceedings of the Seventh Message Understanding Conference (MUC-7) Named Entity Task Definition. In Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA, April 1998. URL http://acl. ldc.upenn.edu/muc7/ne_task.html. 190 BIBLIOGRAPHY

David Chu, John N Dowling, and Wendy W Chapman. Evaluating the effectiveness of four contextual features in classifying annotated clinical conditions in emergency department reports. In Proceedings of the AMIA Symposium 2006, pages 141–145, 2006.

Jacob Cohen. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1):37–46, April 1960. doi: 10.1177/001316446002000104. URL http://dx.doi.org/10.1177/ 001316446002000104.

William W Cohen, Pradeep Ravikumar, and Stephen E Fienberg. A comparison of string distance metrics for name-matching tasks, 2003. URL http://citeseerx. ist.psu.edu/viewdoc/summary?doi=10.1.1.14.3605.

Nigel Collier, Son Doan, Ai Kawazoe, Reiko Matsuda Goodwin, Mike Conway, Yoshio Tateno, Quoc Hung Ngo, Dinh Dien, Asanee Kawtrakul, Koichi Takeuchi, Mika Shigematsu, and Kiyosu Taniguchi. BioCaster: Detecting public health ru- mors with a Web-based text mining system. Bioinformatics, 24:2940–2941, 2008. ISSN 13674803. doi: 10.1093/bioinformatics/btn534.

H Cunningham, D Maynard, and Valentin Tablan. JAPE: a Java Annotation Patterns Engine (Second Edition). Technical Report CS-00-10, Department of Computer Science, University of Sheffield, November 2000. URL http://www.dcs.shef. ac.uk/˜diana/Papers/jape.ps.

Hamish Cunningham, Valentin Tablan, Angus Roberts, and Kalina Bontcheva. Getting More Out of Biomedical Documents with GATE’s Full Lifecycle Open Source Text Analytics. PLoS Comput Biol, 9(2), 2013. doi: 10.1371/journal.pcbi.1002854. URL http://dx.doi.org/10.1371/journal.pcbi.1002854.

Berry de Bruijn, Colin Cherry, Svetlana Kiritchenko, Joel Martin, and Xiao- dan Zhu. Machine-learned solutions for three stages of clinical informa- tion extraction: the state of the art at i2b2 2010. Journal of the American Medical Informatics Association, 18(5):557–562, September 2011. ISSN 1527- 974X. doi: 10.1136/amiajnl-2011-000150. URL http://dx.doi.org/10.1136/ amiajnl-2011-000150.

Azad Dehghan, John A Keane, and Goran Nenadic. Challenges in Clinical Named Entity Recognition for Decision Support. In Systems, Man, and Cybernetics (SMC), BIBLIOGRAPHY 191

2013 IEEE International Conference on, pages 947–951, October 2013. doi: 10. 1109/SMC.2013.166.

Azad Dehghan, Aleksandar Kovaceviˇ c,´ George Karystianis, John A Keane, and Goran Nenadic. Combining Knowledge- and Data-driven Methods for De-identification of Clinical Narratives. Journal of Biomedical Informatics (Accepted), 2015.

Peter Densen. Challenges and Opportunities Facing Medical Education. Transactions of the American Clinical and Climatological Association, 122:48–58, 2011. ISSN 0065-7778. URL http://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3116346/.

Leon Derczynski and Robert Gaizauskas. USFD2: Annotating Temporal Expresions and TLINKs for TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 337–340, Uppsala, Sweden, July 2010. Associa- tion for Computational Linguistics. URL http://www.aclweb.org/anthology/ S10-1075.

Jennifer D’Souza and Vincent Ng. Classifying temporal relations in clinical data: A hybrid, knowledge-rich approach. Journal of Biomedical Informatics, 46:S29–S39, 2013. ISSN 15320464. doi: 10.1016/j.jbi.2013.08.003.

Christine Eiser and Yvonne H Vance. Implications of cancer for school attendance and behavior. Medical and pediatric oncology, 38(5):317–9, May 2002. ISSN 0098- 1532. doi: 10.1002/mpo.1341. URL http://www.ncbi.nlm.nih.gov/pubmed/ 11979455.

Carol Estwing Ferrans, Julie Johnson Zerwic, Jo Ellen Wilbur, and Janet L Larson. Conceptual model of health-related quality of life. Journal of nursing scholarship : an official publication of Sigma Theta Tau International Honor Society of Nursing / Sigma Theta Tau, 37:336–342, 2005. ISSN 1527-6546. doi: http://dx.doi.org/10. 1111/j.1547-5069.2005.00058.x.

Lisa Ferro, Inderjeet Mani, Beth Sundheim, and George Wilson. TIDES Tem- poral Annotation Guidelines-Version 1.0.2. Technical report, The MITRE Corporation, 2001. URL http://www.timeml.org/site/terqas/readings/ MTRAnnotationGuide_v1_02.pdf. 192 BIBLIOGRAPHY

Marilyn J Field, Kathleen N Lohr, and Editors. Clinical practice guidelines: directions for a new program. Technical report, National Academy Press, Institute of Medicine, Washington, DC, USA, 1990.

Michele Filannino, Gavin Brown, and Goran Nenadic. ManTIME: Temporal ex- pression identification and normalization in the TempEval-3 challenge. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 53–57, Atlanta, Georgia, USA, June 2013. Associa- tion for Computational Linguistics. URL http://www.aclweb.org/anthology/ S13-2009.

Elena Filatova and Eduard Hovy. Assigning Time-stamps to Event-clauses. In Proceedings of the Workshop on Temporal and Spatial Information Processing - Volume 13, TASIP 2001, Stroudsburg, PA, USA, 2001. Association for Computa- tional Linguistics. doi: 10.3115/1118238.1118250. URL http://dx.doi.org/ 10.3115/1118238.1118250.

R Jenny Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into information extraction systems by gibbs sampling. . . . of the 43rd Annual Meeting on . . . , pages 363–370, 2005. doi: 10.3115/1219840.1219885. URL http://dl.acm.org/citation.cfm?id=1219885.

J Fox, N Johns, and A Rahmanzadeh. Disseminating medical knowledge: the PRO- forma approach. Artificial intelligence in medicine, 14(1-2):157–81, 1998. ISSN 0933-3657. URL http://www.ncbi.nlm.nih.gov/pubmed/9779888.

Carol Friedman, Lyudmila Shagina, Yves Lussier, and George Hripcsak. Automated Encoding of Clinical Documents Based on Natural Language Processing. Journal of the American Medical Informatics Association, 11(5):392–402, September 2004. ISSN 1067-5027. doi: 10.1197/jamia.m1552. URL http://dx.doi.org/10. 1197/jamia.m1552.

Raoul Frijters, Marianne van Vugt, Ruben Smeets, Rene´ van Schaik, Jacob de Vlieg, and Wynand Alkema. Literature mining for the discovery of hid- den connections between drugs, genes and diseases. PLoS computational biology, 6(9), January 2010. ISSN 1553-7358. doi: 10.1371/journal.pcbi. 1000943. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2944780\&tool=pmcentrez\&rendertype=abstract. BIBLIOGRAPHY 193

John W Furlong, David H Feeny, George W Torrance, and Ronald D Barr. The Health Utilities Index (HUI) system for assessing health-related quality of life in clinical studies. Annals of Medicine, 33:375–384, 2001. ISSN 0785-3890. doi: 10.3109/ 07853890109002092.

Martin Gerner, Farzaneh Sarafraz, Casey M Bergman, and Goran Nenadic. BioCon- text: An integrated text mining system for large-scale extraction and contextualiza- tion of biomolecular events. Bioinformatics, 28:2154–2161, 2012. ISSN 13674803. doi: 10.1093/bioinformatics/bts332.

A W Glaser, W Furlong, D A Walker, K Fielding, K Davies, D H Feeny, and Ronald D Barr. Applicability of the Health Utilities Index to a population of childhood sur- vivors of central nervous system tumours in the U.K. European journal of cancer (Oxford, England : 1990), 35(2):256–61, February 1999. ISSN 0959-8049. URL http://www.ncbi.nlm.nih.gov/pubmed/10448268.

Phil Gooch and Abdul Roudsari. Computerization of workflows, guidelines, and care pathways: a review of implementation challenges for process-oriented health infor- mation systems. Journal of the American Medical Informatics Association : JAMIA, 18(6):738–48, 2011. ISSN 1527-974X. doi: 10.1136/amiajnl-2010-000033. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 3197986\&tool=pmcentrez\&rendertype=abstract.

Ralph Grishman and Beth Sundheim. Message Understanding Conference-6: A Brief History. In Proceedings of the 16th Conference on Computational Linguistics - Volume 1, COLING ’96, pages 466–471, Stroudsburg, PA, USA, 1996. Associa- tion for Computational Linguistics. URL http://dx.doi.org/10.3115/992628. 992709.

Mogens Groenvold. Health-related quality of life in early breast cancer. Danish medical bulletin, 57(9):B4184, September 2010. ISSN 1603-9629. URL http: //www.ncbi.nlm.nih.gov/pubmed/20816024.

Cyril Grouin, Natalia Grabar, Thierry Hamon, Sophie Rosset, Xavier Tannier, and Pierre Zweigenbaum. Eventual situations for timeline extraction from clinical re- ports. Journal of the American Medical Informatics Association, 20(5):820–827, September 2013. doi: 10.1136/amiajnl-2013-001627. URL http://dx.doi.org/ 10.1136/amiajnl-2013-001627. 194 BIBLIOGRAPHY

Claire Grover, Richard Tobin, Beatrice Alex, and Kate Byrne. Edinburgh-LTG: TempEval-2 System Description. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 333–336, Uppsala, Sweden, July 2010. Associa- tion for Computational Linguistics. URL http://www.aclweb.org/anthology/ S10-1074.

Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, Johan Hogberg, and Ulla Stenius. A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinformatics, 12(1):69, 2011. doi: 10.1186/1471-2105-12-69. URL http://www.biomedcentral.com/ 1471-2105/12/69.

James G Gurney, Kevin R Krull, Nina Kadan-Lottick, H Stacy Nicholson, Paul C Nathan, Brad Zebrack, Jean M Tersak, and Kirsten K Ness. Social out- comes in the Childhood Cancer Survivor Study cohort. Journal of clinical oncology : official journal of the American Society of Clinical Oncology, 27(14):2390–5, May 2009. ISSN 1527-7755. doi: 10.1200/JCO.2008. 21.1458. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2677924\&tool=pmcentrez\&rendertype=abstract.

H Gurulingappa, M Hofmann-Apitius, and J Fluck. Concept identification and asser- tion classification in patient health records. In Proceedings of the 2010 i2b2/VA Workshop on Challenges in Natural Language Processing for Clinical Data, Boston, MA, USA, 2010.

Gordon H Guyatt, David H Feeny, and Donald L Patrick. Measuring Health-Related Quality of Life. Annals of Internal Medicine, 118(8):622–629, 1993. doi: 10. 7326/0003-4819-118-8-199304150-00009. URL http://dx.doi.org/10.7326/ 0003-4819-118-8-199304150-00009.

Eun Ha, Alok Baikadi, Carlyle Licata, and James Lester. NCSU: Modeling Tempo- ral Relations with Markov Logic and Lexical Ontology. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 341–344, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL http://www.aclweb. org/anthology/S10-1076.

Caroline Hagege` and Xavier Tannier. XRCE-T : XIP temporal module for Tem- pEval campaign. In Proceedings of the 4th International Workshop on Semantic BIBLIOGRAPHY 195

Evaluations, pages 492–495, Prague, Czech Republic, 2007. Association for Com- putational Linguistics.

Udo Hahn and Joachim Wermter. Levels of Natural Language Processing for Text Mining, chapter 2, pages 13–41. Artech House, 2006.

Udo Hahn, Joachim Wermter, and Computerlinguistik F schiller-universitat¨ Jena. High-Performance Tagging on Medical Texts. In Proceedings of the 20th international conference on Computational Linguistics, pages 973–979, 2004.

Scott Russell Halgrim, Fei Xia, Imre Solti, Eithon Cadag, and Ozlem Uzuner. A cas- cade of classifiers for extracting medication information from discharge summaries. Journal of biomedical semantics, 2 Suppl 3(Suppl 3):S2, January 2011. ISSN 2041- 1480. doi: 10.1186/2041-1480-2-S3-S2. URL http://www.jbiomedsem.com/ content/2/S3/S2.

Henk Harkema, John N Dowling, Tyler Thornblade, and Wendy W Chapman. Con- Text: An algorithm for determining negation, experiencer, and temporal status from clinical reports. Journal of Biomedical Informatics, 42(5):839–851, 2009. doi: http://dx.doi.org/10.1016/j.jbi.2009.05.002. URL http://www.sciencedirect. com/science/article/pii/S1532046409000744.

Lynette Hirschman and Guy Story. Representing Implicit and Explicit Time Relations in Narrative. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI’81, pages 289–295, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc. URL http://dl.acm.org/citation.cfm?id= 1623156.1623212.

Lynette Hirschman, Ralph Grishman, and Naomi Sager. From Text to Structured In- formation: Automatic Processing of Medical Reports. In Proceedings of the June 7-10, 1976, National Computer Conference and Exposition, AFIPS ’76, pages 267– 275, New York, NY, USA, 1976. ACM. doi: 10.1145/1499799.1499842. URL http://doi.acm.org/10.1145/1499799.1499842.

Andreas Holzinger, Klaus-Martin Simonic, and Pinar Yildirim. Disease-Disease Re- lationships for Rheumatic Diseases: Web-Based Biomedical Textmining an Knowl- edge Discovery to Assist Medical Decision Making. In 2012 IEEE 36th Annual Computer Software and Applications Conference, pages 573–580. IEEE, July 2012. 196 BIBLIOGRAPHY

ISBN 978-1-4673-1990-4. doi: 10.1109/COMPSAC.2012.77. URL http:// ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6340213.

George Hripcsak. Writing Arden Syntax medical logic modules. Computers in Biology and Medicine, 24(5):331–363, September 1994. ISSN 00104825. doi: 10.1016/ 0010-4825(94)90002-7. URL http://linkinghub.elsevier.com/retrieve/ pii/0010482594900027.

George Hripcsak and Adam S. Rothschild. Agreement, the F-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12:296–298, 2005. ISSN 10675027. doi: 10.1197/jamia.M1733.

Bob Ingria and James Pustejovsky. TimeML: A Formal Specification Language for Events and Temporal Expressions (TimeML Specification 1.2), 2004. URL http: //timeml.org/site/publications/timeMLdocs/timeml_1.2.html.

Daniel G Jamieson, Andrew Moss, Michael Kennedy, Sherrie Jones, Goran Nenadic, David L Robertson, and Ben Sidders. The pain interac- tome: Connecting pain-specific protein interactions. Pain, 155(11):2243– 52, November 2014. ISSN 1872-6623. doi: 10.1016/j.pain.2014. 06.020. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=4247380\&tool=pmcentrez\&rendertype=abstract.

Matthew A Jaro. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406):414–420, 1989. doi: 10.2307/2289924. URL http://dx.doi.org/10. 2307/2289924.

Matthew A Jaro. Probabilistic Linkage of Large Public Health Data Files. Statistics in Medicine, 14:491–498, 1995.

Min Jiang, Yukun Chen, Mei Liu, S Trent Rosenbloom, Subramani Mani, Joshua C Denny, and Hua Xu. A study of machine-learning-based approaches to extract clini- cal entities and their assertions from discharge summaries. Journal of the American Medical Informatics Association : JAMIA, 18(5):601–606, 2011. ISSN 1067-5027. doi: 10.1136/amiajnl-2011-000163.

Siddhartha Jonnalagadda, Trevor Cohen, Stephen Wu, and Graciela Gonzalez. Enhanc- ing clinical concept extraction with distributional semantics. Journal of Biomedical Informatics, 45:129–140, 2012. ISSN 15320464. doi: 10.1016/j.jbi.2011.10.007. BIBLIOGRAPHY 197

Hyuckchul Jung and Amanda Stent. ATT1: Temporal Annotation Using Big Win- dows and Rich Syntactic and Semantic Features. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 20–24, Atlanta, Georgia, USA, June 2013. Association for Computational Linguis- tics. URL http://www.aclweb.org/anthology/S13-2004.

Ning Kang, Rogier J Barendse, Zubair Afzal, Bharat Singh, Martijn J Schuemie, Erik M van Mulligen, and Jan A Kors. Erasmus MC approaches to the i2b2 Chal- lenge. Proceedings of the 2010 i2b2/VA workshop on challenges in natural language processing for clinical data., 2010.

Rainu Kaushal, Kaveh G Shojania, and David W Bates. Effects of computerized physician order entry and clinical decision support systems on medication safety: a systematic review. Archives of internal medicine, 163:1409–1416, 2003. ISSN 0003-9926. doi: 10.1001/archinte.163.12.1409.

Jessica Keller and Gary Kielhofner. Psychometric characteristics of the Child Oc- cupational Self-Assessment (COSA), Part Two: Refining the psychometric prop- erties. Scandinavian Journal of Occupational Therapy, 12(4):147–158, January 2005. ISSN 1103-8128. doi: 10.1080/11038120510031761. URL http:// informahealthcare.com/doi/abs/10.1080/11038120510031761.

Jessica Keller, Ana Kafkes, and Gary Kielhofner. Psychometric characteristics of the Child Occupational Self Assessment (COSA), Part One: An initial examination of psychometric properties. Scandinavian Journal of Occupational Therapy, 12(3): 118–127, January 2005. ISSN 1103-8128. doi: 10.1080/11038120510031752. URL http://informahealthcare.com/doi/abs/10.1080/11038120510031752.

J.D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii. GENIA corpus–a seman- tically annotated corpus for bio-textmining. Bioinformatics, 19(Suppl 1): i180–i182, July 2003. ISSN 1367-4803. doi: 10.1093/bioinformatics/ btg1023. URL http://bioinformatics.oxfordjournals.org/cgi/doi/10. 1093/bioinformatics/btg1023.

Anup Kumar Kolya, Asif Ekbal, and Sivaji Bandyopadhyay. JU CSE TEMP: A First Step towards Evaluating Events, Time Expressions and Temporal Relations. In 198 BIBLIOGRAPHY

Proceedings of the 5th International Workshop on Semantic Evaluation, pages 345– 350, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/S10-1077.

Anup Kumar Kolya, Amitava Kundu, Rajdeep Gupta, Asif Ekbal, and Sivaji Bandy- opadhyay. JU CSE: A CRF Based Approach to Annotation of Temporal Ex- pression, Event and Temporal Relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 64–72, At- lanta, Georgia, USA, June 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/S13-2011.

Aleksandar Kovacevic, Azad Dehghan, Michele Filannino, John A Keane, and Goran Nenadic. Combining rules and machine learning for extraction of temporal ex- pressions and events from clinical narratives. Journal of the American Medical Informatics Association : JAMIA, 20(5):859–66, 2013. URL http://www.ncbi. nlm.nih.gov/pubmed/23605114.

Scott Kraus, Catherine Blake, and Suzanne L West. Information Extraction from Med- ical Notes, 2007.

Michael Krauthammer and Goran Nenadic. Term identification in the biomedical literature, volume 37. Elsevier, December 2004. URL http://linkinghub.elsevier.com/ retrieve/pii/S1532046404000826?showall=true.

John D Lafferty, Andrew McCallum, and Fernando C N Pereira. Conditional Ran- dom Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Pub- lishers Inc. ISBN 1-55860-778-1. URL http://dl.acm.org/citation.cfm?id= 645530.655813.

E R Lancashire, C Frobisher, R C Reulen, D L Winter, A Glaser, and M M Hawkins. Educational attainment among adult survivors of childhood cancer in Great Britain: a population-based cohort study. Journal of the National Cancer Institute, 102(4): 254–70, February 2010. ISSN 1460-2105. doi: 10.1093/jnci/djp498. URL http: //www.ncbi.nlm.nih.gov/pubmed/20107164. BIBLIOGRAPHY 199

Natsuda Laokulrat, Makoto Miwa, Yoshimasa Tsuruoka, and Takashi Chikayama. UTTime: Temporal Relation Classification using Deep Syntactic Features. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 88–92, Atlanta, Georgia, USA, June 2013. Associa- tion for Computational Linguistics. URL http://www.aclweb.org/anthology/ S13-2015.

Robert Leaman, Christopher Miller, and Graciela Gonzalez. Enabling Recognition of Diseases in Biomedical Text with Machine Learning : Corpus and Benchmark. In Proceedings of the 3rd International Symposium on Languages in Biology and Medicine (LBM), pages 82–89, 2009.

Yu Kai Lin, Hsinchun Chen, and Randall A. Brown. MedTime: A temporal informa- tion extraction system for clinical narratives, 2013. ISSN 15320464.

Hongfang Liu, Yves A Lussier, and Carol Friedman. Disambiguating ambiguous biomedical terms in biomedical narrative text: an unsupervised method. Journal of biomedical informatics, 34(4):249–261, 2001. ISSN 1532-0464. doi: 10.1006/ jbin.2001.1023.

Hector Llorens, Estela Saquete Boro, and Borja Navarro. TIPSem (English and Spanish): Evaluating CRFs and Semantic Roles in TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 284–291, Upp- sala, Sweden, July 2010. Association for Computational Linguistics. URL http: //www.aclweb.org/anthology/S10-1063.

Inderjeet Mani, Marc Verhagen, Ben Wellner, Chong Min Lee, and James Puste- jovsky. Machine Learning of Temporal Relations. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 753–760, Sydney, Australia, July 2006. Association for Computational Linguistics. URL http://www.aclweb. org/anthology/P/P06/P06-1095.

Christopher D Manning, Prabhakar Raghavan, and Hinrich Schutze.¨ Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008. 200 BIBLIOGRAPHY

C A McHorney. Health status assessment methods for adults: past accomplishments and future challenges. Annual review of public health, 20:309–35, January 1999. ISSN 0163-7525. doi: 10.1146/annurev.publhealth.20.1.309. URL http://www. ncbi.nlm.nih.gov/pubmed/10352861.

S M Meystre, Guergana K Savova, K C Kipper-Schuler, and J F Hurdle. Extracting information from textual documents in the electronic health record: a review of recent research. Yearbook of medical informatics, pages 128–144, 2008. ISSN 0943-4747. URL http://view.ncbi.nlm.nih.gov/pubmed/18660887.

Congmin Min, Munirathnam Srikanth, and Abraham Fowler. LCC-TE: A Hybrid Ap- proach to Temporal Relation Identification in News Text. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 219–222, Prague, Czech Republic, 2007. Association for Computational Linguistics.

Seyed Abolghasem Mirroshandel, Gholamreza Ghassem-Sani, and Mahdy Khayyamian. Using Syntactic-based Kernels for Classifying Temporal Relations. J. Comput. Sci. Technol., 26(1):68–80, January 2011. ISSN 1000-9000. doi: 10.1007/ s11390-011-1112-0. URL http://dx.doi.org/10.1007/s11390-011-1112-0.

Pauline A Mitby, Leslie L Robison, John A Whitton, Michael A Zevon, Iris C Gibbs, Jean M Tersak, Anna T Meadows, Marilyn Stovall, Lonnie K Zeltzer, and Ann C Mertens. Utilization of special education services and educational attainment among long-term survivors of childhood cancer: a report from the Childhood Cancer Sur- vivor Study. Cancer, 97(4):1115–26, February 2003. ISSN 0008-543X. doi: 10.1002/cncr.11117. URL http://www.ncbi.nlm.nih.gov/pubmed/12569614.

Y Mizuta, A Korhonen, T Mullen, and N Collier. Zone Analysis in Biology Articles as a Basis for Information Extraction. International Journal of Medical Informatics on Natural Language Processing in Biomedicine and Its Applications, 75(6):468–487, 2006.

Raymond K Mulhern, Thomas E Merchant, Amar Gajjar, Wilburn E Reddick, and Larry E Kun. Late neurocognitive sequelae in survivors of brain tumours in childhood. The Lancet. Oncology, 5(7):399–408, July 2004. ISSN 1470-2045. doi: 10.1016/S1470-2045(04)01507-4. URL http://www.ncbi.nlm.nih.gov/ pubmed/15231246. BIBLIOGRAPHY 201

Mark A. Musen, Samson W. Tu, Amar K. Das, and Yuval Shahar. EON: A Component- Based Approach to Automation of Protocol-Directed Therapy. Emerging Infectious Diseases, 3:367–388, 1996. ISSN 10806040. doi: 10.1136/jamia.1996.97084511.

Matteo Negri, Estela Saquete Boro, Patricio Mart´ınez-Barco, and Rafael Munoz.˜ Evaluating Knowledge-based Approaches to the Multilingual Extension of a Tem- poral Expression Normalizer. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, pages 30–37, Sydney, Australia, July 2006. Association for Computational Linguistics. URL http://www.aclweb.org/ anthology/W/W06/W06-0905.

Kirsten K Ness, Wendy M Leisenring, Sujuan Huang, Melissa M Hudson, James G Gurney, Kimberly Whelan, Wendy L Hobbie, Gregory T Armstrong, Leslie L Robison, and Kevin C Oeffinger. Predictors of inactive lifestyle among adult survivors of childhood cancer: a report from the Childhood Cancer Survivor Study. Cancer, 115(9):1984–94, May 2009. ISSN 0008-543X. doi: 10. 1002/cncr.24209. URL http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=2692052\&tool=pmcentrez\&rendertype=abstract.

Kirsten K Ness, E Brannon Morris, Vikki G Nolan, Carrie R Howell, Laura S Gilchrist, Marilyn Stovall, Cheryl L Cox, James L Klosky, Amar Gajjar, and Joseph P Neglia. Physical performance limitations among adult survivors of childhood brain tumors. Cancer, 116(12):3034–44, June 2010. ISSN 0008-543X. doi: 10. 1002/cncr.25051. URL http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=3554250\&tool=pmcentrez\&rendertype=abstract.

DH NHS Modernisation Agency and NICE. Protocol Based Care. . . Underpinning Improvement. Technical report, Department of Health, 2008.

Azadeh Nikfarjam, Ehsan Emadzadeh, and Graciela Gonzalez. Towards generating a patient’s timeline: Extracting temporal relationships from clinical notes. Journal of biomedical informatics, 46:S40—-S47, 2013.

Suzy O’Connor, Emma House Eamonn Ferguson Terri Carney, and Rory O’Connor. Paediatric Inventory of Emotional Distress. GL Assessment, London, 2010.

Lucila Ohno-Machado, John H Gennari, Shawn N Murphy, Nilesh L Jain, Sam- son W Tu, Diane E. Oliver, Edward Pattison-Gordon, Robert A Greenes, Edward H Shortliffe, and Octo G Barnett. The guideline interchange 202 BIBLIOGRAPHY

format: a model for representing guidelines. Journal of the American Medical Informatics Association : JAMIA, 5(4):357–72, 1998. ISSN 1067- 5027. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=61313\&tool=pmcentrez\&rendertype=abstract.

Arzucan Ozgur,¨ Thuy Vu, Gunes¨ Erkan, and Dragomir R Radev. Identify- ing gene-disease associations using centrality on a literature mined gene- interaction network. Bioinformatics (Oxford, England), 24(13):i277–85, July 2008. ISSN 1367-4811. doi: 10.1093/bioinformatics/btn182. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2718658\&tool=pmcentrez\&rendertype=abstract.

Sergeui V Pakhomov, Anni Coden, and Christopher G Chute. Developing a corpus of clinical notes manually annotated for part-of-speech, volume 75. Elsevier Science Ireland Ltd.,, June 2006. URL http://linkinghub. elsevier.com/retrieve/pii/S1386505605001723?showall=true.

Stephanie N Palmer, Kathleen A Meeske, Ernest R Katz, Tasha M Burwinkle, and James W Varni. The PedsQL brain tumor module: Initial reliability and validity. Pediatric Blood and Cancer, 49(3):287–293, 2007. ISSN 15455009. doi: 10.1002/ pbc.21026.

Jenny W Y Pang, Debra L Friedman, John A Whitton, Marilyn Stovall, Ann C Mertens, Leslie L Robison, and Noel S Weiss. Employment status among adult survivors in the Childhood Cancer Survivor Study. Pediatric blood & cancer, 50 (1):104–10, January 2008. ISSN 1545-5017. doi: 10.1002/pbc.21226. URL http://www.ncbi.nlm.nih.gov/pubmed/17554791.

Jon D Patrick, Dung H M Nguyen, Yefeng Wang, and Min Li. A knowledge dis- covery and reuse pipeline for information extraction in clinical notes. Journal of the American Medical Informatics Association : JAMIA, 18:574–579, 2011. ISSN 1067-5027. doi: 10.1136/amiajnl-2011-000302.

M Peleg, A A Boxwala, O Ogunyemi, Q Zeng, S Tu, R Lacson, E Bernstam, N Ash, P Mork, L Ohno-Machado, E H Shortliffe, and R A Greenes. GLIF3: the evolution of a guideline representation format. Proceedings / AMIA ... Annual Symposium. AMIA Symposium, pages 645–9, January 2000. ISSN 1531- 605X. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2243832\&tool=pmcentrez\&rendertype=abstract. BIBLIOGRAPHY 203

M Peleg, O Ogunyemi, S Tu, A A Boxwala, Q Zeng, R A Greenes, and E H Shortliffe. Using features of Arden Syntax with object-oriented medical data models for guideline modeling. Proceedings of the AMIA Symposium, pages 523–7, January 2001. ISSN 1531-605X. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2243476\&tool=pmcentrez\&rendertype=abstract.

Anthony Penn, Stephen P Lowis, Linda P Hunt, Robert I Shortman, Michael C G Stevens, Renee L McCarter, Andrew L Curran, and Peta M Sharples. Health related quality of life in the first year after diagnosis in children with brain tumours com- pared with matched healthy controls; a prospective longitudinal study. European journal of cancer (Oxford, England : 1990), 44(9):1243–52, June 2008. ISSN 0959- 8049. doi: 10.1016/j.ejca.2007.09.015. URL http://www.ncbi.nlm.nih.gov/ pubmed/17997300.

Rafael Peris-Bonet, Carmen Mart´ınez-Garc´ıa, Brigitte Lacour, Svetlana Petrovich, Begona˜ Giner-Ripoll, Aurora Navajas, and Eva Steliarova-Foucher. Childhood cen- tral nervous system tumours–incidence and survival in Europe (1978-1997): report from Automated Childhood Cancer Information System project. European journal of cancer (Oxford, England : 1990), 42(13):2064–80, September 2006. ISSN 0959- 8049. doi: 10.1016/j.ejca.2006.05.009. URL http://www.ncbi.nlm.nih.gov/ pubmed/16919771.

M F Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

Georgiana Puscasu. {WVALI}: Temporal Relation Identification by Syntactico-Sem antic Analysis. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), pages 484–487, Prague, Czech Republic, 2007. As- sociation for Computational Linguistics.

James Pustejovsky, Jose´ Castano,˜ Robert Ingria, Roser Saur´ı, Robert Gaizauskas, An- drea Setzer, and Graham Katz. TimeML: Robust specification of event and temporal expressions in text. In Fifth International Workshop on Computational Semantics (IWCS-5), pages 1–11, 2003. URL http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.161.8972\&rep=rep1\&type=pdf.

Preethi Raghavan, Eric Fosler-lussier, Noemie Elhadad, and Albert M. Lai. Cross- narrative temporal ordering of medical events. ACL, pages 998–1008, 2014. 204 BIBLIOGRAPHY

Lance Ramshaw and Mitch Marcus. Text Chunking Using Transformation-Based Learning. In David Yarovsky and Kenneth Church, editors, Proceedings of the Third Workshop on Very Large Corpora, pages 82–94, Somerset, New Jersey, 1995. As- sociation for Computational Linguistics. URL http://citeseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.53.2725.

R Bharat Rao, Sriram Krishnan, and Radu Stefan Niculescu. Data min- ing for improved cardiac care. ACM SIGKDD Explorations Newsletter, 8 (1):3–10, June 2006. ISSN 19310145. doi: 10.1145/1147234.1147236. URL http://doi.acm.org/10.1145/1147234.1147236http://portal.acm. org/citation.cfm?doid=1147234.1147236.

C J Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition, 1979. ISBN 0408709294.

Ellen Riloff. Automatically constructing a dictionary for information extraction tasks. In Proceedings of the National Conference on Artificial Intelligence, pages 811–811, 1993. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1. 1.41.708\&rep=rep1\&type=pdf.

Kirk Roberts, Bryan Rink, and Sanda M Harabagiu. A flexible framework for recognizing events, temporal expressions, and temporal relations in clinical text. Journal of the American Medical Informatics Association, 20(5):867–875, Septem- ber 2013. doi: 10.1136/amiajnl-2013-001619. URL http://dx.doi.org/10. 1136/amiajnl-2013-001619.

Patricia L Robertson, Karin M Muraszko, Emiko J Holmes, Richard Sposto, Roger J Packer, Amar Gajjar, Mark S Dias, and Jeffrey C Allen. Incidence and severity of postoperative cerebellar mutism syndrome in children with medulloblastoma: a prospective study by the Children’s Oncology Group. Journal of neurosurgery, 105 (6 Suppl):444–51, December 2006. ISSN 0022-3085. doi: 10.3171/ped.2006.105. 6.444. URL http://www.ncbi.nlm.nih.gov/pubmed/17184075.

Jo Rycroft-Malone, Marina Fontenla, Kate Seers, and Debra Bick. Protocol-based care: the standardisation of decision-making? Journal of clinical nursing, 18(10): 1490–500, May 2009. ISSN 1365-2702. doi: 10.1111/j.1365-2702.2008.02605.x. URL http://www.ncbi.nlm.nih.gov/pubmed/19413539. BIBLIOGRAPHY 205

E F T K Sang and S Buchholz. Introduction to the CoNLL-2000 Shared Task: Chunking. In Proceedings of the Conference on Computational Natural Language Learning: CoNLL-2000, pages 151–153, Lisbon, Portugal, 2000.

Estela Saquete Boro. ID 392:TERSEO + T2T3 Transducer. A systems for Recogniz- ing and Normalizing TIMEX3. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 317–320, Uppsala, Sweden, July 2010. Associa- tion for Computational Linguistics. URL http://www.aclweb.org/anthology/ S10-1070.

Roser Saur´ı, Jessica Littman, Robert Gaizauskas, Andrea Setzer, and James Puste- jovsky. TimeML Annotation Guidelines Version 1.1, 2004. URL http://timeml. org/site/publications/timeMLdocs/annguide_1.1.pdf.

Roser Saur´ı, Jessica Littman, Robert Gaizauskas, Andrea Setzer, and James Puste- jovsky. TimeML Annotation Guidelines, Version 1.2.1, 2006. URL http:// timeml.org/site/publications/timeMLdocs/annguide_1.2.1.pdf.

Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association : JAMIA, 17 (5):507–513, September 2010. ISSN 1527-974X. doi: 10.1136/jamia.2009.001560. URL http://dx.doi.org/10.1136/jamia.2009.001560.

Frank Schilder and C Habel. From Temporal Expressions to Temporal Information: Semantic Tagging of News Messages. In Proceedings of ACL 2001 workshop on temporal and spatial information processing, pages 65–72, Toulouse, France, 2001. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1. 23.1756.

M J Schuemie, R Jelier, and J A Kors. Peregrine: Lightweight gene name normaliza- tion by dictionary lookup. In Proceedings of the BioCreAtIvE II Workshop 2007, pages 131–133, Madrid, Spain, 2007.

Burr Settles. {ABNER}: An open source tool for automatically tagging genes, pro- teins, and other entity names in text. Bioinformatics, 21(14):3191–3192, 2005.

Andrea Setzer. Temporal Information in Newswire Articles: An Annotation Scheme and Corpus Study. PhD thesis, The University of Sheffield, 2001. 206 BIBLIOGRAPHY

P. G Shekelle, S. H Woolf, M. Eccles, and J. Grimshaw. Clinical guidelines: De- veloping guidelines. BMJ, 318(7183):593–596, February 1999. ISSN 0959-8138. doi: 10.1136/bmj.318.7183.593. URL http://www.bmj.com/cgi/doi/10.1136/ bmj.318.7183.593.

Sunghwan Sohn, Kavishwar B Wagholikar, Dingcheng Li, Siddhartha R Jonnala- gadda, Cui Tao, Ravikumar Komandur Elayavilli, and Hongfang Liu. Comprehen- sive temporal information detection from clinical text: medical events, time, and TLINK identification. Journal of the American Medical Informatics Association, 20(5):836–842, April 2013. doi: 10.1136/amiajnl-2013-001622. URL http: //dx.doi.org/10.1136/amiajnl-2013-001622.

Irena Spasic, Sophia Ananiadou, John McNaught, and Anand Kumar. Text mining and ontologies in biomedicine: Making sense of raw text. Briefings in Bioinformatics, 6:239–251, 2005. ISSN 14675463. doi: 10.1093/bib/6.3.239.

Irena Spasic, Farzaneh Sarafraz, John A Keane, and Goran Nenadic. Medication in- formation extraction with linguistic pattern matching and semantic rules. Journal of the American Medical Informatics Association : JAMIA, 17(5):532–5, Jan- uary 2010. ISSN 1527-974X. doi: 10.1136/jamia.2010.003657. URL http: //jamia.bmj.com/cgi/content/long/17/5/532.

C R Spiroch, D Walsh, P Mazanec, and K A Nelson. Ask the patient: a semi-structured interview study of quality of life in advanced cancer. The American journal of hospice & palliative care, 17(4):235–40, 2000. ISSN 1049-9091. URL http:// www.ncbi.nlm.nih.gov/pubmed/11883798.

Eva Steliarova-Foucher, Charles Stiller, Peter Kaatsch, Franco Berrino, Jan-Willem Coebergh, Brigitte Lacour, and Max Parkin. Geographical patterns and time trends of cancer incidence and survival among children and adolescents in Europe since the 1970s (the ACCISproject): an epidemiological study. Lancet, 364(9451):2097– 105, 1994. ISSN 1474-547X. doi: 10.1016/S0140-6736(04)17550-8. URL http: //www.ncbi.nlm.nih.gov/pubmed/15589307.

Jannik Strotgen¨ and Michael Gertz. HeidelTime: High Quality Rule-Based Ex- traction and Normalization of Temporal Expressions. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 321–324, Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL http://www.aclweb. org/anthology/S10-1071. BIBLIOGRAPHY 207

Jannik Strotgen,¨ Julian Zell, and Michael Gertz. HeidelTime: Tuning English and De- veloping Spanish Resources for TempEval-3. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 15–19, At- lanta, Georgia, USA, June 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/S13-2003.

William F Styler, Steven Bethard, Sean Finan, Martha Palmer, Sameer Pradhan, Piet C de Groen, Brad Erickson, Timothy Miller, Chen Lin, Guergana Savova, and Others. Temporal Annotation in the Clinical Domain. Transactions of the Association for Computational Linguistics, 2:143–154, 2014. URL http://www.transacl.org/ wp-content/uploads/2014/04/47.pdf.

Nichalin Suakkaphong, Zhu Zhang, and Hsinchun Chen. Disease named entity recog- nition using semisupervised learning and conditional random fields. Journal of the American Society for Information Science and Technology, 62:727–737, 2011. ISSN 15322882. doi: 10.1002/asi.21488.

Weiyi Sun, Anna Rumshisky, and Ozlem¨ Uzuner. Evaluating temporal rela- tions in clinical text: 2012 i2b2 Challenge. Journal of the American Medical Informatics Association : JAMIA, 20(5):806–13, 2013a. ISSN 1527-974X. doi: 10.1136/amiajnl-2013-001628. URL http://www.ncbi.nlm.nih.gov/pubmed/ 23564629.

Weiyi Sun, Anna Rumshisky, and Ozlem¨ Uzuner. Temporal reasoning over clinical text: the state of the art. Journal of the American Medical Informatics Association, 20(5):814–819, September 2013b. ISSN 1527-974X. doi: 10.1136/amiajnl-2013-001760. URL http://dx.doi.org/10.1136/ amiajnl-2013-001760.

Weiyi Sun, Anna Rumshisky, and Ozlem¨ Uzuner. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. Journal of the American Medical Informatics Association, 20(5):806–813, September 2013c. doi: 10.1136/amiajnl-2013-001628. URL http://dx.doi.org/10.1136/amiajnl-2013-001628.

Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning, chapter 4, pages 93–126. MIT Press, Cambridge, 2007. ISBN 9780262072885. URL http: //people.cs.umass.edu/˜mccallum/papers/crf-tutorial.pdf. 208 BIBLIOGRAPHY

David R Sutton and John Fox. The syntax and semantics of the PROforma guideline modeling language. Journal of the American Medical Informatics Association : JAMIA, 10(5):433–43, 2003. ISSN 1067-5027. doi: 10.1197/jamia. M1264. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=212780\&tool=pmcentrez\&rendertype=abstract.

Michael Tanenblatt, Anni Coden, and Igor Sominsky. The ConceptMapper Ap- proach to Named Entity Recognition. In Proceedings of the Seventh conference on International Language Resources and Evaluation LREC10, pages 546–551, 2010. ISBN 2951740867.

Buzhou Tang, Yonghui Wu, Min Jiang, Yukun Chen, Joshua C Denny, and Hua Xu. A hybrid system for temporal information extraction from clinical text. Journal of the American Medical Informatics Association : JAMIA, 20(5):828– 35, September 2013. ISSN 1527-974X. doi: 10.1136/amiajnl-2013-001635. URL http://dx.doi.org/10.1136/amiajnl-2013-001635http://www.ncbi. nlm.nih.gov/pubmed/23571849.

Pasi Tapanainen and Timo Jarvinen.¨ A Non-Projective Dependency Parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, pages 64–71, 1997. doi: 10.1.1.52.9681. URL http://citeseerx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.52.9681.

Kay Chai Peter Tay, Chuen Chai Dennis Seow, Chunxiang Xiao, Hui Min Julian Lee, Helen Fk Chiu, and Sally Wai-Chi Chan. Structured interviews examining the bur- den, coping, self-efficacy and quality of life among family caregivers of persons with dementia in Singapore. Dementia (London, England), February 2014. ISSN 1741-2684. doi: 10.1177/1471301214522047. URL http://www.ncbi.nlm.nih. gov/pubmed/24535819.

V R Taylor. Measuring healthy days: population assessment of health-related quality of life, November 2000. URL http://stacks.cdc.gov/view/cdc/6406.

S Teufel and M Moens. Summarizing scientific articles: Experiments with rel- evance and rhetorical status. Computational Linguistics, 28:409–445, 2002. doi: 10.1162/089120102762671936. URL http://dx.doi.org/10.1162/ 089120102762671936. BIBLIOGRAPHY 209

Sirithana Tiranardvanich. Clinical Dashboard: Representing and Visualising Patient Timelines. Msc dissertation, The University of Manchester, 2014.

Manabu Torii, Kavishwar Wagholikar, and Hongfang Liu. Using machine learning for concept extraction on clinical documents from multiple data sources. Journal of the American Medical Informatics Association : JAMIA, 18:580–587, 2011. ISSN 1527-974X. doi: 10.1136/amiajnl-2011-000155.

George W Torrance, David H Feeny, William J Furlong, Ronald D Barr, Yueming Zhang, and Quinan Wang. Multiattribute utility function for a comprehensive health status classification system. Health Utilities Index Mark 2. Medical care, 34:702– 722, 1996. ISSN 0025-7079.

Yoshimasa Tsuruoka, John McNaught, Junichi Tsujii, and Sophia Ananiadou. Learn- ing string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics, 23(20):2768–2774, 2007. doi: 10.1093/ bioinformatics/btm393. URL http://bioinformatics.oxfordjournals.org/ content/23/20/2768.abstract.

S Uckun. Instantiating and monitoring treatment protocols. Proceedings of the Annual Symposium on Computer Application in Medical Care, pages 689–93, January 1994. ISSN 0195-4210. URL http://www.pubmedcentral.nih. gov/articlerender.fcgi?artid=2247972\&tool=pmcentrez\&rendertype= abstract.

Esko Ukkonen. Approximate String-matching with Q-grams and Maximal Matches. Theor. Comput. Sci., 92(1):191–211, January 1992. doi: 10.1016/0304-3975(92) 90143-4. URL http://dx.doi.org/10.1016/0304-3975(92)90143-4.

Ozlem¨ Uzuner, Imre Solti, and Eithon Cadag. Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 17(5):514– 518, September 2010. ISSN 1527-974X. doi: 10.1136/jamia.2010.003947. URL http://dx.doi.org/10.1136/jamia.2010.003947.

Ozlem¨ Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552–556, September 2011. doi: 10.1136/amiajnl-2011-000203. URL http://dx.doi.org/10.1136/ amiajnl-2011-000203. 210 BIBLIOGRAPHY

Naushad UzZaman and James Allen. TRIPS and TRIOS System for TempEval-2: Ex- tracting Temporal Information from Text. In Proceedings of the 5th International Workshop on Semantic Evaluation, pages 276–283, Uppsala, Sweden, July 2010a. Association for Computational Linguistics. URL http://www.aclweb.org/ anthology/S10-1062.

Naushad UzZaman and James F Allen. TRIPS and TRIOS system for TempEval-2: Extracting temporal information from text. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 276–283, Stroudsburg, PA, USA, 2010b. Association for Computational Linguistics. URL http://portal. acm.org/citation.cfm?id=1859726.

Naushad UzZaman, Hector Llorens, Leon Derczynski, Marc Verhagen, James Allen, and James Pustejovsky. SemEval-2013 Task 1: TEMPEVAL-3: Evaluating Time Expressions, Events, and Temporal Relations, 2013.

C J Van Rijsbergen. Foundation of Evaluation. Journal of Documentation, 30(4): 365–373, 1974.

James W Varni, Sandra A Sherman, Tasha M Burwinkle, Paige E Dickinson, and Pamela Dixon. The PedsQL Family Impact Module: preliminary reliability and validity. Health and quality of life outcomes, 2:55, 2004. ISSN 1477-7525. doi: 10.1186/1477-7525-2-55.

Marc Verhagen and James Pustejovsky. Temporal processing with the TARSQI toolkit. In 22nd Intern. Conference on on Computational Linguistics: Demonstration Papers, pages 189–192, 2008. ISBN 978-2-9517408-7-7. URL http://dl.acm.org/ citation.cfm?id=1599288.1599300.

Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. SemEval-2007 Task 15: TempEval Temporal Relation Identification. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pages 75–80, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL http://www.aclweb.org/ anthology/S/S07/S07-1014.

Marc Verhagen, Roser Saur´ı, Tommaso Caselli, and James Pustejovsky. SemEval-2010 task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, pages 57–62, Stroudsburg, PA, USA, 2010. Association BIBLIOGRAPHY 211

for Computational Linguistics. URL http://portal.acm.org/citation.cfm? id=1859674.

P Vossen. EuroWordNet: Building a Multilingual Database with WordNets in 8 Euro- pean Languages, 2000.

Evelyn Ward, Monica Hopkins, Lesley Arbuckle, Nicola Williams, Lynette Forsythe, Sylwia Bujkiewicz, Barry Pizer, Edward Estlin, and Susan Picton. Nutritional prob- lems in children treated for medulloblastoma: implications for enteral nutrition sup- port. Pediatric blood & cancer, 53(4):570–5, October 2009. ISSN 1545-5017. doi: 10.1002/pbc.22092. URL http://www.ncbi.nlm.nih.gov/pubmed/19530236.

Ben Wellner, Jose´ Castano,˜ and James Pustejovsky. Adaptive String Similarity Metrics for Biomedical Reference Resolution. In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pages 9–16, 2005.

WHO. Constitution of the World Health Organization. American Journal of Public Health Nations Health, 36(11):1315–1323, November 1946.

Sophie Wilne, Jacqueline Collier, Colin Kennedy, Karin Koller, Richard Grundy, and David Walker. Presentation of childhood CNS tumours: a systematic review and meta-analysis. The Lancet. Oncology, 8(8):685–95, August 2007. ISSN 1470- 2045. doi: 10.1016/S1470-2045(07)70207-3. URL http://www.ncbi.nlm.nih. gov/pubmed/17644483.

Ira B Wilson and Paul D Cleary. Linking clinical variables with health-related quality of life. A conceptual model of patient outcomes. JAMA : the journal of the American Medical Association, 273:59–65, 1995. ISSN 0098-7484. doi: 10.1001/jama.1995. 03520250075037.

William E Winkler. The state of record linkage and current research problems, 1999. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1. 39.4336.

Hua Xu, Shane P Stenner, Son Doan, Kevin B Johnson, Lemuel R Wait- man, and Joshua C Denny. MedEx: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association, 17(1):19–24, January 2010. ISSN 1527-974X. doi: 10.1197/jamia. 212 BIBLIOGRAPHY

m3378. URL http://dx.doi.org/10.1197/jamia.m3378http://jamia.bmj. com/cgi/content/long/17/1/19.

Yan Xu, Yining Wang, Tianren Liu, Junichi Tsujii, and Eric I Chang. An end-to-end system to identify temporal relation in discharge summaries: 2012 i2b2 challenge. Journal of the American Medical Informatics Association, 20(5):849–858, Septem- ber 2013. doi: 10.1136/amiajnl-2012-001607. URL http://dx.doi.org/10. 1136/amiajnl-2012-001607.

Hui Yang. Automatic extraction of medication information from medical discharge summaries. Journal of the American Medical Informatics Association : JAMIA, 17 (5):545–8, January 2010. ISSN 1527-974X. doi: 10.1136/jamia.2010.003863. URL http://jamia.bmj.com/cgi/content/long/17/5/545.

Hui Yang, Irena Spasic, John A Keane, and Goran Nenadic. A Text Mining Approach to the Prediction of Disease Status from Clinical Discharge Summaries. Journal of the American Medical Informatics Association, 16(4):596–600, July 2009. ISSN 1067-5027. doi: 10.1197/jamia.m3096. URL http://dx.doi.org/10.1197/ jamia.m3096.

Lonnie K Zeltzer, Christopher Recklitis, David Buchbinder, Bradley Zebrack, Jacque- line Casillas, Jennie C I Tsao, Qian Lu, and Kevin Krull. Psychological status in childhood cancer survivors: a report from the Childhood Cancer Survivor Study. Journal of clinical oncology : official journal of the American Society of Clinical Oncology, 27(14):2396–404, May 2009. ISSN 1527-7755. doi: 10.1200/JCO.2008. 21.1433. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=2677925\&tool=pmcentrez\&rendertype=abstract.

Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi-Supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3:1–130, 2009. ISSN 1939-4608. doi: 10.2200/S00196ED1V01Y200906AIM006.

A S Zigmond and R P Snaith. The hospital anxiety and depression scale. Acta psychiatrica Scandinavica, 67(6):361–370, 1983. ISSN 0001-690X. doi: 10.1111/j. 1600-0447.1983.tb09716.x. Appendix A

Background

A.1 A sample clinical note

This sample clinical narrative has been reproduced from MTSamples (http://www. mtsamples.com). Figure A.1: A sample clinical note

213 214 APPENDIX A. BACKGROUND

A.2 Token level sequence label modelling

In the sentence: ‘The patient denies any other symptoms .’, the token sequence: ‘any other symptoms’ is considered a NE (i.e., Problem). Consequently, we may model this sequence using an IO (two labels) or BIO (three labels) schema indicating if a given token is in the beginning (B), inside (I) or outside (O) of the NE. The W-BIO model distinguishes the modelling of multi-token sequences (using BIO) and sequences made up of a single token (W). Hence, the aforementioned NE sequence can be modelled as followed (with the remaining tokens in the above sentence being assigned ‘O’ labels):

• IO: any/I other/I symptoms/I • BIO: any/B other/I symptoms/I • W-BIO: any/B other/I symptoms/I

A.3 Transitive closure

Transitive closure of a given relations computes all implicit relations or take into account its transitivity. Further, we can define a transitive relation as:

∀a,b,c ⊆ X : (aRb ∧ bRc) ⇒ aRc (A.1)

For example, in Table A.1 the letters A,B,C represent different EVENTs, and the arrows ‘→’, ‘←’, and ‘↔’ represent the temporal relations ‘before’, ‘after’, and ‘over- lap’ respectively. Hence, given the TLINKs: EVENT A before EVENT B, EVENT B after EVENT C, and EVENT A overlap EVENT C are represent as followed A → B, B ← C, and A ↔ C respectively.

Table A.1: Transitive relations Inspired by Tang et al. (2013); his table shows a number of examples of binary transitive relations. If A → B and B → C, then A → C If A ← B and B ← C, then A ← C If A ↔ B and B ↔ C, then A ↔ C If A → B and B ↔ C, then A → C If A → B and A ↔ C, then C → B A.4. HUI-2 CLASSIFICATION SYSTEM 215

A.4 HUI-2 classification system

HUI2 Multi-Attribute Health Status Classification System Attribute Level Description* Sensation 1 Able to see, hear, and speak normally for age. 2 Requires equipment to see or hear or speak. 3 Sees, hears, or speaks with limitations even with equipment. 4 Blind, deaf, or mute. Mobility 1 Able to walk, bend, lift, jump, and run normally for age. 2 Walks, bends, lifts, jumps, or runs with some limitations but does not require help. 3 Requires mechanical equipment (such as canes, crutches, braces, or wheelchair) to walk or get around independently. 4 Requires the help of another person to walk or get around and requires mechanical equipment as well. 5 Unable to control or use arms and legs. Emotion 1 Generally happy and free from worry. 2 Occasionally fretful, angry, irritable, anxious, depressed, or suffering “night terrors”. 3 Often fretful, angry, irritable, anxious, depressed, or suffering “night terrors”. 4 Almost always fretful, angry, irritable, anxious, depressed. 5 Extremely fretful, angry, irritable, anxious, or depressed usually requiring hospitalisation or psychiatric institutional care. Cognition 1 Learns and remembers school work normally for age. 2 Learns and remembers school work more slowly than classmates as judged by parents and/or teachers. 3 Learns and remembers very slowly and usually requires special educational assistance. 4 Unable to learn and remember. Self-Care 1 Eats, bathes, dresses, and uses the toilet normally for age. 2 Eats, bathes, dresses, or uses the toilet independently with difficulty. 3 Requires mechanical equipment to eat, bathe, dress, or use the toilet independently. 4 Requires the help of another person to eat, bathe, dress, or use the toilet. Pain 1 Free of pain and discomfort. 2 Occasional pain. Discomfort relieved by non-prescription drugs or self-control activity without disruption of normal activities. 3 Frequent pain. Discomfort relieved by oral medicines with occasional disruption of normal activities. 4 Frequent pain; frequent disruption of normal activities. Discomfort requires prescription narcotics for relief. 5 Severe pain. Pain not relieved by drugs and constantly disrupts normal activities. Fertility 1 Able to have children with a fertile spouse. 2 Difficulty in having children with a fertile spouse. 3 Unable to have children with a fertile spouse.

Legend: * - Level descriptions are worded here exactly as presented to respondents of the HUI2 preference survey.

Note: Fertility attribute is optional and not part of the standard HUI23-15Q nor HUI23-40Q questionnaires.

The HUI-2 is a multi-attribute health status classification system. This classification sys- tem is used to map questionnaire responses to specific levels within the classification schema using complimentary decision tables and coding algorithms (not shown) part of the HUI2 manual. This schema and additional tables have been reproduced from http://www. healthutilities.com/ originally derived from Torrance et al. (1996). Figure A.2: Health Utilities Index Mark 2 Appendix B

Extraction of Health-related Concepts

B.1 NER annotation examples

The following examples are reproduced from the official 2010 i2b2 concept annotation guidelines1 (Uzuner et al., 2011).

Problem

Examples of Medical Problems to annotate (are highlighted in bold):

(a) Phrases that name a disease, syndrome, sign, or symptom

• the wound was noted to be clean with mild serious drainage • an echocardiogram revealed a pericardial effusion and possible tamponade clinically

(b) Mental or behavioural status observations

• his mental status changes remained stable • she did well except for some episodes of confusion

(c) Virus and bacterium

• blood cultures were positive for S. Veridans • procured sample to rule out MRSA

1https://www.i2b2.org/NLP/Relations/assets/Concept%20Annotation%20Guideline. pdf

216 B.1. NER ANNOTATION EXAMPLES 217

(d) Injury

• patient arrived with a broken arm • examined the deep gash in her head

(e) Abnormalities

• the defects were found • chest x-ray revealed an abnormality

(f) Test results explicitly stated to be abnormal

• low blood pressure • moderately decreased ejection fraction

Treatment

Examples of Treatments to annotate (are highlighted in bold):

(a) Medication names, brand names and generics as well as collective names for groups of medications.

• Lasix 20mg b.i.d. by mouth • she was started on IV Ciprofloxacin

(b) Biological substances

• the patient remained on IV hydration therapy • he did not require a transfusion

(c) Drug and treatment delivery devices

• the patient uses her respirator at night • the patient remained hemodynamically stable on the ventilator

(d) Procedures and devices or hardware involved in those procedures

• the patient had a bronchoalveolar lavage performed • significant for radiation therapy after his surgery

(e) General terms referring to the patient’s treatments 218 APPENDIX B. EXTRACTION OF HEALTH-RELATED CONCEPTS

• his asthma medication • her treatment

Test

Examples of Tests to annotate (are highlighted in bold):

(a) Procedures performed on the patient

• a lung biopsy was performed, revealing chorio carcinoma pathologically • chest x-ray revealed clear lungs

(b) Panels and tests run on patient body fluids

• his urinalysis showed 10-20 granular casts • blood cultures were positive for S. Veridans

(c) Physiologic measures and vital signs

• blood pressure 120/80 • pulse 40

(d) Examinations and evaluations of the patient

• cardiac exam revealed an irregular rate • rectal exam was heme negative

B.2 HrQoL NER

HrQoL concept specific contextual cues used by the context reasoner are listed in Table B.1. In addition, the full list of adjectives and anatomical/spatial used by the boundary adjustment method are given in Table B.2. B.2. HRQOL NER 219

Table B.1: Contextual reasoner exclusion cues This table gives the exclusion cues used by concept specific context reasoners’. Specifically, if a given HrQoL candidate term has a given cue in its context, the candidate term will not be annotated as a HrQoL concept.

HrQoL concept Cues Cognition i; am; to Sensory test, clinic, place School bus; dental; hospital; review Activity pool; association; facility; teacher; department Home mother; father; brother; sister; uncle; aunt; well; centre; fair

Table B.2: Boundary adjustment: adjectives This table gives two different list of (i) adjectives, and (ii) anatomical/spatialused by the bound- ary adjustment method to extend concept boundaries.

Adjective Anatomical/Spatial normal lower limb basically arms extremely bilateral fairly left arm fine left leg good left sided great legs highly neck less right arm mild right leg minimal right sided minor truncal persistent upper limb poor pretty quite reasonably residual severe stable very worsening Appendix C

Tools and Resources

C.1 NLP pre-processing

The actual components used for NLP pre-processing for various IE components devel- oped (i.e., TERN, NERC and TLINK) are:

1. GATE ANNIE Tokeniser 2. GATE ANNIE Sentence splitter 3.(Porter, 1980) Porter’s stemmer 1 4. OpenNLP POS tagger (with cTAKES model)2 5. OpenNLP Chunker (with cTAKES model)

C.2 Implementation

• The GATE framework was used as the underlying building block for our meth- ods (i.e., handle/store/represent documents and annotations); and JAPE grammar was used for rule formalism.

• The implementation of the sequence labelling algorithm used was CRF++ 0.58 3. The (CRF++) Java SWIG interface was used to port the algorithm.

1tartarus.org/martin/PorterStemmer/ 2http://ctakes.apache.org/ 3http://code.google.com/p/crfpp/

220 Appendix D

HrQoL Annotation Guideline

D.1 Introduction

HRQoL is a subjective concept. What may be considered to affect one person’s quality of life may be considered as triviality for another. This causes some ambiguity in objec- tively annotating relevant HrQoL concepts. Nevertheless, we are interested in concepts that are measured in various HrQoL instruments. As such, we have combined multiple disease specific (i.e., brain tumour) HrQoL instruments (e.g., PedsQL (Palmer et al., 2007; Varni et al., 2004), HADS (Zigmond and Snaith, 1983), QLQ-30 (Aaronson et al., 1993), PI-ED (O’Connor et al., 2010), HUI2 and HUI3 (Furlong et al., 2001)) as well as the revised (Wilson and Cleary, 1995) HrQoL conceptual model (Ferrans et al., 2005) in order to create a more holistic HrQoL classification model. We note that this combined model has only been developed for computational purposes, and that we do not aim to validate this model as HrQoL instrument as this has extensively been done with the mentioned instruments. A general rule of thumb of what could be recognised as a HRQoL concept or not, is that concepts that are unknown to the individual or require medical diagnostic tests to determine should not be considered as HrQoL concepts. This view is generally agreed in literature (Ferrans et al., 2005; Taylor, 2000). (For further conceptualisation of what constitutes an HrQoL expression (to be annotated) read the remaining document.) For example: hypertension, high blood pressure, neuropathy, or low blood count would require some medical diagnostic test to establish the problem, and would there- fore (typically) be unknown to the person. However, on the other hand, their poten- tial signs or symptoms such as tiredness, chronic pain, balance problems, or bladder problems and so forth could be physically experienced or be observed by the patient

221 222 APPENDIX D. HRQOL ANNOTATION GUIDELINE him/her-self or a third-party, and thus be valid HrQoL concepts (given that it is covered by the conceptual model Figure D.1 and its description Section D.2).

Figure D.1: The combined HrQoL schema

D.1.1 HrQoL schema description

HRQoL concepts typically describe a patient’s functional ability/disability in relation to a HRQoL dimension/theme.

• HRQoL concepts are terms or descriptive expressions of e.g., patients’ abili- ty/disability with specific concepts described in this section and/or Figure D.1

D.1.2 HrQoL annotation

Each HRQoL annotation has four attributes:1

• type [sensory-pain | physical-functioning | emotional-functioning | cognitive- functioning | other | home-family | school-functioning | activity | social-functioning]

1The severity attribute was excluded in this thesis. D.1. INTRODUCTION 223

• negation [negated|]

• source [subjective | objective]

• severity [none | mild | moderate | sever | na]

Attribute: type A HrQoL expression may be categorised into one and only one of the 9 concept types. Attribute: negation The negation attribute represent if the concept is currently negated or not. Hence, medical history and past event that are not relevant any more should all be negated. The attribute and its value ’negated’ are only included if a term is negated. For example, in the sentence ‘The patient reported fatigue, but denied double vision’, fatigue should not have a negation attribute, while double vision is negated. Attribute: source HrQoL measures are typically self-reported. However, depending on the dataset to be annotated, we may have to annotate text written/dictated by a third-party. Therefore, we are interested in the source describing a the concept. For example, in the sentence: ‘The patient reported fatigue, but denied double vision’; fatigue and double vision have both subjective as source value. However, in the sentence: ‘The patient has problems with fatigue and double vision’; the concepts are not explicitly stated to been expressed by the patient, thus, both would have objective source. Attribute: severity HRQoL measures are typically assessed by degree of severity represented by a likert scale (i.e., 1 to 5; 1 for complete ability and 5 for complete disability). While it would be impractical to attempt to do the same, yet we want to convey some information of severity.

General guidelines to adhere to are:

• If there is no adjective present the attribute value ought to be ’moderate.’ How- ever, exceptions do exist: some concepts may inherently convey severity without the presence of adjectives. For example, ’loss of vision’ or ’blind’ already con- veys the extreme spectrum of severity.

• Severity may not be applicable to some concepts: e.g., relationship status (i.e., single, married, etc. ). 224 APPENDIX D. HRQOL ANNOTATION GUIDELINE

• In addition, if the concept is negated (i.e., negation = negated) this means that severity is not applicable, i.e., the value of the attribute ought to be ’NA’.

Explicit mentions of HRQoL HRQoL concepts should be explicitly stated (i.e., we can’t infer that due to a motor speech disorder, the patient’s school-functioning is effected). Exceptions do exist: we advise the reader to study Section D.3 for ambiguous cases (and indirect references). HRQoL Concept Name vs HRQoL Concept Description Concepts to annotate describes ability/disability of HrQoL dimensions (such as Phys- ical functioning) and their related sub-concepts/terms. This includes terms/concepts that are considered description of problems/disorders as well as the name of the problem/disorder. For example, consider the following HrQoL terms:

• speaking difficulties (description of problem) vs dysarthria (potential name of disorder: a motor speech disorder);

• balance and coordination problems (description) vs ataxia (potential name of disorder: a neurological movement disorder);

• occasional involuntary arm moments (description) vs huntington’s disease or myoclonus (potential names of disorders);

Syntactical features of HrQoL concepts HRQoL concepts are typically made up of noun-phrases (NP) or (NP preposition NP), including longer descriptive phrases, and (rarely of) verb phrases (VP) or verbs (e.g., smokes, drinks, unsteady when walk- ing etc.). HrQoL concepts in patient narratives tend to be longer descriptive phrases. D.2. HRQOL SCHEMA 225

D.2 HrQoL schema

• Physical functioning and speech i.e., terms describing (ability/disability with) motor skills (e.g., movement, balance, coordination), difficulties with self-care, speech, strength, and agility. This includes, but not limited to: - e.g., balance, walking, arm/hand coordination, involuntary movements, etc. - e.g., neurological movement disorders: e.g.,: dystonia, myoclonus, essential tremor, etc. - e.g., stuttering, difficulty speaking, dysarthria, etc.

• Sensory and pain: i.e., terms describing (ability/disability with) vision, hearing, sensation, hearing, taste, smell, and pain. This includes, but not limited to: - e.g., back pain, muscle cramps, chest pain, chronic pain, headache, stom- achache, backache etc. - e.g., loss of vision, blindness, reduced vision, poor vision, double vision, cross- eyed vision, etc. - e.g., deafness, partial deafness, reduced hearing, etc. - e.g., sensation problems (e.g., numbness and tingling)

• Other well-being terms include concepts describing signs or symptoms / adverse- effects of disease and treatment, problems with fertility, etc. This includes, but not limited to: - e.g., appetite, nausea, vomiting, rash, hair-loss, toxicity, fatigue, seizures/fits, convulsion, weakness/numbness of body parts, difficulty swallowing, shortness of breath, fatigue, chills, fevers, dizziness, tender, nontender, vertigo, burns, etc. - e.g., fertility, e.g., ability/disability to reproduce

EXAMPLES from clinical narratives: Example 1: He denies any fevers, chills, or night sweats. No lymphadenopathy. No nausea or vomiting. No change in bowel or bladder habits. 226 APPENDIX D. HRQOL ANNOTATION GUIDELINE

Table D.1: Example 1 annotations

HrQoL term concept negation source severity fevers other negated subjective na chills other negated subjective na night sweats other negated subjective na nausea other negated objective na vomiting other negated objective na bowel or bladder habits other negated subjective na

Example 2: HISTORY OF PRESENT ILLNESS: This is a 19-year-old known male with sickle cell anaemia. He comes to the emergency room on his own with 3-day history of loss of vision. He is on no medicines. He does live with a room mate. Appetite is decreased. No diarrhoea, vomiting. Voiding well. Denies any abdominal pain and dysarthria. He complains of a constant mild headache, but his main concern is back ache that extends from above the lower T-spine to the lumbosacral spine.

Table D.2: Example 2 annotations

HrQoL term concept negation source severity loss of vision sensory-pain subjective severe appetite other objective mild diarrhoea other negated objective na vomiting other negated objective na voiding other negated objective na abdominal pain sensory-pain negated subjective na dysarthria sensory-pain negated subjective na mild headache sensory-pain subjective mild back ache sensory-pain subjective moderate

Example 3: REVIEW OF SYSTEMS: No headaches or vision problems. Ongoing heart problems, without complaints. No weakness, numbness or tingling sensation. He reports contin- ued discomfort with his chronic neck pain. No history of endocrine problems. He has nocturia and urinary frequency. D.2. HRQOL SCHEMA 227

Table D.3: Example 3 annotations

HrQoL term concept negation source severity headaches sensory-pain negated objective na vision problems sensory-pain negated objective na weakness physical-functioning negated objective na numbness sensory-pain negated objective na tingling sensation sensory-pain negated objective na chronic neck pain sensory-pain subjective severe nocturia other objective moderate urinary frequency other objective moderate

• Cognitive functioning: terms describing (ability/disability with) cognitive func- tions such as learning, memory, attention

This includes, but not limited to: - e.g., memory problems, learning difficulties, volition, concentration or atten- tion issues, diseases associated with memory/attention/learning: e.g., agnosia, amnesia, dementia;

• Emotional functioning and mental state This includes, but not limited to: - e.g., irritable, distress, angry, happy, fear, guilt, concern, anxious, insomnia, depressed/depression, personality or behavioural changes, mental disorders - e.g., self-esteem and appearance related concepts

Example 4: PAST MEDICAL HISTORY: He has had a lumbar fusion. I believe he’s had heart dis- ease. Mental state changes are either due to the tumour or other psychiatric problems.

Table D.4: Example 4 annotations

HrQoL term concept negation source severity mental state emotional-functioning objective moderate psychiatric problems emotional-functioning objective moderate 228 APPENDIX D. HRQOL ANNOTATION GUIDELINE

Example 5: PAST MEDICAL HISTORY: The patient is a 25-year old female with a history of brain tumor. She has been cancer free for 5 years. The patients has a history of mental and emotional distress during and after treatment. She was diagnosed with bipolar personality disorder in 1993. She denies any psychosis or hallucinations in the past 5 years.

Table D.5: Example 5 annotations

HrQoL term concept negation source severity mental emotional-functioning objective moderate emotional distress emotional-functioning objective moderate bipolar personality disorder emotional-functioning objective moderate psychosis emotional-functioning negated subjective na hallucinations emotional-functioning negated subjective na

Example 6: HISTORY OF PRESENT ILLNESS: The patient is a 25-year old female with a his- tory of brain tumour. Post-treatment, the patient has reported problems with attention, learning difficulties, and memory problems.

Table D.6: Example 6 annotations

HrQoL term concept negation source severity attention cognitive-functioning subjective moderate learning difficulties cognitive-functioning subjective moderate memory problems cognitive-functioning subjective moderate

• Home and family: terms describing concepts related to functioning of home and family life This includes, but not limited to: - e.g., life-style (e.g., alcohol and substance abuse) - e.g., family (e.g., living arrangement) - e.g., finance and employment (status, difficulties, etc.)

• School functioning: terms describing school functioning such as: This includes, but not limited to: D.2. HRQOL SCHEMA 229

e.g., progress and status (special needs, class-room support, attendance and progress due to disease/treatment, etc. )

• Social functioning: - e.g., relationship (e.g., friend, partner, etc.)

• Activity: - e.g., participation (e.g., physical or social events)

EXAMPLES from clinical narratives: Example 7: SOCIAL HISTORY: The patient is married but has been separated from his wife for many years, they remain close, and they have two adult sons. He is retired from the Air Force, currently works for Lockheed Martin. He was born and raised in New York. He does have a smoking history, about a 20 pack-year history and he reports quitting on July 27. He does drink alcohol socially. No use of illicit drugs.

Table D.7: Example 7 annotations

HrQoL term concept negation source severity married social-functioning objective na separated social-functioning objective na works home-family objective na smoking home-family negated objective na drink alcohol home-family objective na illict drugs home-family negated objective na

Example 8: SOCIAL HISTORY: He is married and has 1 son. He has a brother who is healthy. There is no history of tobacco use or alcohol use.

Table D.8: Example 8 annotations

HrQoL term concept negation source severity married social-functioning objective na tobacco use home-family negated objective na alcohol use home-family negated objective na 230 APPENDIX D. HRQOL ANNOTATION GUIDELINE

Example 9: The patient has had 12 months interrupted university studies due to ongoing disease and treatment side-effects, but plans to return to studies this new academic school year.

Table D.9: Example 9 annotations

HrQoL term concept negation source severity interrupted university studies school-functioning objective na studies school-functioning negated objective na

D.3 Ambiguous cases

Concepts to not annotate, as mentioned previously, if a concepts is unknown to the person or requires a medical diagnostic test it is not a HrQoL concept. Examples include: Table D.10: Ambigious cases

Terms Reason for no annotation high blood pressure / hypertension requires diagnostic test to determine problem. heart problems requires diagnostic test to determine problem. endoctrine problems requires diagnostic test to determine problem. neuropathy requires diagnostic test to determine problem. well developed not explicit.

D.3.1 Indirect references

Other ambiguous cases include indirect HrQoL references. For example, any equip- ment/aid that is directly related to support specific disabilities covered by the concept map shall be classified accordingly. Examples include:

• ’The patient requires hearing aid’ implies sensory-pain (hearing disability).

• ’The patient need walking aid’ implies physical-functioning (gross-motor skills difficulties). D.4. WHAT (NOT)? TO ANNOTATE 231

• ’wheel chair’ 7→ physical-functioning

• ’piedro boots’ 7→ physical-functioning

• ’school support’ 7→ school-functioning

D.4 What (not)? to annotate

D.4.1 What not to annotate

• negation should not be inclusive of the

• annotation span annotation spans should never be overlapping

• determiners

• conjunctives or copulas should generally not be part of annotation spans (e.g.: is, with, of, and, etc.). Exceptions do exist, for example when they are inherently inclusive of the phrase/concepts (e.g., loss of vision) or absolutely necessary to describe the concept (e.g., loss of right leg, walks with assistance).

D.4.2 What to annotate

Concepts to annotated include syntactic patterns such as: (i) noun-phrases (NPs); (ii) NP preposition NP; and (iii) head of verb-phrases:

1. Noun phrases are made up of: Nouns, Pronouns, Determiners, Adjectives

• we exclude determiners and pronouns from annotation spans • We also exclude adjectives that denote any temporal sense: e.g., ‘recent’, ‘current’, and so forth

2. NP preposition NP (preposition, commonly: ‘with’ and ‘of’)

• these may include determiners/pronouns, if part of the second NP (see Ex- ample E)

3. Head verbs

• e.g., smoking, smokes, walking, etc. 232 APPENDIX D. HRQOL ANNOTATION GUIDELINE

D.5 More examples

This section contains additional problematic examples. Nota bene, correct annotation span is underlined: Example A energy and appetite is decreased emotional and mental problems problems with attention, learning, and memory problems

Example B he denies any recent chest pain he denies a chest pain he reports severe chest pain he suffered multiple fits over the course of last week he suffered couple convulsions his nausea has progressively worsened since last visit the patient’s nausea has progressively worsened since last visit

Example C no appetite appetite is decreased increased appetite decreased appetite

Example D problems with appetite problems with school problems with attention problems with coordination problems with balance attention problems coordination problems balance problems memory problems good coordination good balance D.5. MORE EXAMPLES 233 coordination is good balance is good

Example E loss of appetite loss of vision pain of the lower back lower back pain walks with assistance walks with no assistance walks without assistance Appendix E

Extracting Patient Journeys: a Case Study

E.1 Comparative Analysis of Narratives

Table E.1: Concept types in clinical narratives

Concept type Frequency pncf Physical functioning 1,756 6.73 Emotional functioning 1,583 6.07 Social functioning 112 0.43 Cognitive functioning 678 2.60 Sensory and pain 1,580 6.06 Other well-being 1,727 6.62 Activity 296 1.13 Home and family 169 0.65 School functioning 1,290 4.95 Oncology diagnosis 1,903 7.30 Oncology investigation 1,102 4.22 Oncology treatment 4,132 15.84 Endocrinology diagnosis 1,699 6.51 Endocrinology investigation 5,919 22.69 Endocrinology treatment 2,140 8.20

234 E.1. COMPARATIVE ANALYSIS OF NARRATIVES 235

Table E.2: Concept types in patient narratives

Concept type Frequency pncf Physical functioning 166 11.39 Emotional functioning 294 20.16 Social functioning 107 7.34 Cognitive functioning 162 11.10 Sensory and pain 81 5.56 Other well-being 85 5.83 Activity 65 4.46 Home and family 66 4.53 School functioning 357 24.49 Oncology diagnosis 14 0.96 Oncology investigation 15 1.03 Oncology treatment 39 2.67 Endocrinology diagnosis 2 0.14 Endocrinology investigation 2 0.14 Endocrinology treatment 3 0.21 236 APPENDIX E. EXTRACTING PATIENT JOURNEYS: A CASE STUDY PathCluster: co-reference lists Table E.3: fossa posterior tumourradiotherapy irradiation surgery Growth hormone deficiencydeficient gh Medulloblastomadeficiency ghdeficient growth hormonedeficiency hormonedeficiency growth hormone cns Radiotherapygrowth medulloblastoma hormone tumour deficiencygrowth hormone cranial medulloblastoma insufficiency radio metastatic therapygrowth poor medulloblastoma brain fossa tumour tumour medulloblastoma medulloblastoma cerebellar craniospinal medulloblastoma irradiation cranial debulk radiation Surgery craniospinal excision radiation craniospinal radio therapy resection excision cranial Chemotherapy total radio chemotherapy chemotherapy therapy next recent cranial craniotomy irradiation fossa tumour debulk chemotherapy total chemotherapy craniectomy previous chemo craniospinal xrt resection total E.2 Extracting Patient Journeys This table list the five different list used in conjunction with SoftTFIDF (threashold: 0.7) as part of the PathCluster co-reference method.