Mining Patient Journeys from Healthcare Narratives
Total Page:16
File Type:pdf, Size:1020Kb
MINING PATIENT JOURNEYS FROM HEALTHCARE NARRATIVES A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES 2014 By Azad Dehghan School of Computer Science Contents Abstract 14 Declaration 16 Copyright 17 Acknowledgements 18 1 Introduction 23 1.1 Hypotheses and research questions................... 25 1.2 Aim and objectives........................... 26 1.3 Contributions.............................. 27 1.4 Thesis structure............................. 28 2 Background 29 2.1 Text Mining............................... 29 2.1.1 Natural language processing.................. 30 2.1.2 Information extraction..................... 40 2.1.3 Named entity recognition.................... 47 2.1.4 Temporal information extraction................ 54 2.1.5 Temporal entity extraction................... 59 2.1.6 Temporal relation extraction.................. 66 2.2 Clinical Background.......................... 77 2.2.1 Clinical text........................... 77 2.2.2 Computer aided standardisation of clinical care........ 77 2.2.3 Health-related quality of life.................. 78 2.3 Summary................................ 81 2 3 Extraction of Health-related Concepts 83 3.1 Event Extraction............................ 84 3.1.1 Methods............................ 85 3.1.2 Data............................... 89 3.1.3 Experiments, results, and discussion.............. 91 3.2 Health-related Quality of Life Extraction................ 96 3.2.1 HrQoL Schema......................... 96 3.2.2 Data............................... 99 3.2.3 Methods............................ 100 3.2.4 Experiments, results, and discussion.............. 103 3.3 Summary................................ 108 4 Temporal Information Extraction 110 4.1 Temporal Entity Extraction....................... 111 4.1.1 Methods............................ 111 4.1.2 Data............................... 115 4.1.3 Experiments, results, and discussion.............. 116 4.2 Temporal Relation Extraction...................... 119 4.2.1 Methods............................ 119 4.2.2 Data............................... 125 4.2.3 Experiments, results, and discussion.............. 126 4.3 Summary................................ 129 5 Extracting Patient Journeys: a Case Study 132 5.1 Introduction: Childhood Central Nervous System Tumours...... 133 5.2 Introduction: Case Study........................ 134 5.2.1 Data............................... 137 5.3 Comparative Analysis of Narratives.................. 140 5.3.1 Aggregated analysis of narratives............... 141 5.3.2 Individual case analysis of narratives............. 148 5.3.3 Discussion........................... 157 5.4 Extracting Patient Journeys....................... 158 5.4.1 Methods............................ 158 5.4.2 Evaluation........................... 162 5.4.3 Individual patient journeys................... 163 5.4.4 Aggregated patient journeys.................. 172 3 5.4.5 Discussion........................... 175 5.5 Visualising Patient Timelines...................... 176 5.6 Summary................................ 179 6 Conclusion 181 6.1 Contributions.............................. 181 6.2 Limitations............................... 183 6.3 Future work............................... 184 6.4 Summary................................ 185 A Background 213 A.1 A sample clinical note......................... 213 A.2 Token level sequence label modelling................. 214 A.3 Transitive closure............................ 214 A.4 HUI-2 classification system....................... 215 B Extraction of Health-related Concepts 216 B.1 NER annotation examples....................... 216 B.2 HrQoL NER............................... 218 C Tools and Resources 220 C.1 NLP pre-processing........................... 220 C.2 Implementation............................. 220 D HrQoL Annotation Guideline 221 D.1 Introduction............................... 221 D.1.1 HrQoL schema description................... 222 D.1.2 HrQoL annotation....................... 222 D.2 HrQoL schema............................. 225 D.3 Ambiguous cases............................ 230 D.3.1 Indirect references....................... 230 D.4 What (not)? to annotate......................... 231 D.4.1 What not to annotate...................... 231 D.4.2 What to annotate........................ 231 D.5 More examples............................. 232 4 E Extracting Patient Journeys: a Case Study 234 E.1 Comparative Analysis of Narratives.................. 234 E.2 Extracting Patient Journeys....................... 236 5 Word Count: 61,885 6 List of Tables 2.1 A example of lexical normalisation................... 32 2.2 Common negation phrases in clinical data............... 37 2.3 Evaluation variables matrix....................... 38 2.4 Common rule-based languages..................... 44 2.5 Top systems in the 2010 i2b2 event extraction task........... 51 2.6 Top systems in the 2012 i2b2 event extraction task........... 52 2.7 Common data-driven features used for clinical event extraction.... 52 2.8 TIMEX3 representation schema.................... 57 2.9 TLINK representation schema..................... 57 2.10 TempEval-2 TERN results....................... 62 2.11 TempEval-3 TERN results....................... 63 2.12 2012 i2b2 TERN: methods and resources............... 64 2.13 2012 i2b2 TERN results........................ 65 2.14 Common data-driven features used for clinical TER.......... 65 2.15 TempEval-3: TLINK identification and classification task....... 69 2.16 TempEval-3: TLINK end-to-end task.................. 69 2.17 TempEval-3: approaches for TLINK identification........... 70 2.18 TempEval-3: approaches for TLINK classification........... 70 2.19 2012 i2b2: TLINK identification and classification task........ 72 2.20 2012 i2b2: TLINK end-to-end task................... 72 2.21 Common TLINK classification features................ 75 2.22 HrQoL instruments and corresponding classification dimentions... 81 3.1 Definition of EVENT categories.................... 84 3.2 Feature template: clinical EVENTs................... 87 3.3 Label fixer heuristic........................... 88 3.4 EVENT annotated datasets....................... 90 3.5 i2b2-TRC: EVENT IAA........................ 91 7 3.6 i2b2-CARC: EVENT IAA....................... 91 3.7 EVENT label distribution........................ 91 3.8 EVENT extraction validation test results................ 92 3.9 EVENT extraction results on the held-out test set........... 93 3.10 The clinical NER performance..................... 94 3.11 EVENT recognition: feature impact analysis.............. 94 3.12 EVENT recognition: dataset impact.................. 95 3.13 EVENT recognition: post-processing impact analysis......... 95 3.14 Performance of the offficial 2012 i2b2 submission........... 96 3.15 Example of HrQoL concepts...................... 99 3.16 HrQoL dataset label distribution.................... 100 3.17 HrQoL dataset IAA........................... 100 3.18 HrQoL dictionary summary....................... 102 3.19 HrQoL NER results on the development set.............. 104 3.20 HrQoL NER results on the held-out test set.............. 105 3.21 The HrQoL NER performance..................... 105 3.22 The HrQoL NER impact analysis.................... 107 4.1 Feature template: clinical TER..................... 114 4.2 TIMEX3 label distribution....................... 116 4.3 i2b2-TRC: TIMEX3 IAA........................ 116 4.4 TER validation results......................... 116 4.5 TER evaluation on the held-out test set................. 117 4.6 TE normalisation results........................ 118 4.7 TLINK patterns............................. 122 4.8 TLINK extraction features....................... 124 4.9 TLINK label distribution........................ 126 4.10 i2b2-TRC: TLINK IAA......................... 126 4.11 TLINK development set results..................... 127 4.12 TLINK results on the held-out test set................. 128 4.13 TLINK component based evaluation.................. 128 4.14 TLINK end-to-end results on the held-out test set........... 129 5.1 Adopted case study specific types................... 137 5.2 Case study corpus: clinical narratives profile.............. 138 5.3 Case study corpus: patient narratives profile.............. 138 8 5.4 Top occuring concept in clinical and patient narratives......... 145 5.5 Proportion of traditional clinical concepts in patient narratives.... 145 5.6 Proportion of HrQoL concepts in clinical narratives.......... 146 5.7 A semantic analysis of patient narratives................ 147 5.8 A semantic analysis of clinical narratives................ 147 5.9 An example list of concepts and their PathCluster confidence..... 162 5.10 Test case A: a tabular view of high-level processes........... 164 5.11 Test case B: a tabular view of high-level processes........... 167 5.12 Test case C: a tabular view of high-level processes........... 170 5.13 Aggregated patient pathway: a tabular view of high-level processes.. 172 A.1 Transitive relations........................... 214 B.1 Contextual reasoner exclusion cues................... 219 B.2 Boundary adjustment: adjectives.................... 219 D.1 Example 1 annotations......................... 226 D.2 Example 2 annotations......................... 226 D.3 Example 3 annotations......................... 227 D.4 Example 4 annotations......................... 227 D.5 Example 5 annotations......................... 228 D.6 Example 6 annotations........................