Linking Electronic Health Records with the Biomedical Literature
Total Page:16
File Type:pdf, Size:1020Kb
LINKING ELECTRONIC HEALTH RECORDS WITH THE BIOMEDICAL LITERATURE A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN THE FACULTY OF SCIENCE AND ENGINEERING 2016 By Xiao Fu School of Computer Science Contents List of Tables ............................................................................................................... 5 List of Figures ............................................................................................................. 7 Abstract ..................................................................................................................... 10 Declaration ................................................................................................................ 11 Copyright ................................................................................................................... 12 Acknowledgements ................................................................................................... 14 Abbreviations ............................................................................................................ 15 1 Introduction ........................................................................................................... 19 1.1 Motivation ........................................................................................................ 19 1.2 Research hypotheses and objectives ................................................................ 20 1.3 Overview of contributions ................................................................................ 22 2 Phenotype extraction.............................................................................................24 2.1 Literature review .............................................................................................. 24 2.1.1 Pre-processing ............................................................................................ 25 2.1.2 Post-processing .......................................................................................... 27 2.1.3 Phenotype extraction in the clinical domain .............................................. 28 2.1.4 Phenotype extraction in the biomedical domain ........................................ 29 2 2.1.5 Evidence-based medicine........................................................................... 30 2.2 Dataset .............................................................................................................. 32 2.3 Methodology .................................................................................................... 33 2.3.1 Baseline ...................................................................................................... 33 2.3.2 Truecasing .................................................................................................. 38 2.3.3 Abbreviation disambiguation ..................................................................... 42 2.3.4 Distributional similarity ............................................................................. 44 2.3.5 Hybrid methods .......................................................................................... 46 2.4 Results and discussion ...................................................................................... 47 2.4.1 Results ........................................................................................................ 50 2.4.2 Discussion .................................................................................................. 56 2.5 Summary .......................................................................................................... 57 3 Corpus development .............................................................................................. 59 3.1 Literature review .............................................................................................. 60 3.2 Document collection ........................................................................................ 63 3.2.1 Clinical records .......................................................................................... 63 3.2.2 Biomedical literature .................................................................................. 70 3.3 Annotation scheme and procedure ................................................................... 76 3.3.1 Annotation scheme .................................................................................... 76 3.3.2 Text mining-assisted annotation ................................................................ 78 3.4 Results and discussion ...................................................................................... 87 3.4.1 Evaluation in the clinical domain .............................................................. 87 3.4.2 Evaluation in the biomedical domain ........................................................ 88 3.5 Summary .......................................................................................................... 96 4 Phenotype normalisation ...................................................................................... 98 4.1 Literature review ............................................................................................ 101 4.1.1 Concept normalisation ............................................................................. 101 4.1.2 Domain adaptation ................................................................................... 103 4.2 Methodology .................................................................................................. 105 4.2.1 Compositional synonym generation ........................................................ 106 4.2.2 Normalisation of syntactic structures....................................................... 115 4.2.3 Normalisation of acronyms and abbreviations ........................................ 123 4.2.4 Normalisation of plural and singular ....................................................... 124 4.2.5 Individual and hybrid methods ................................................................ 126 4.3 Results and discussion .................................................................................... 126 4.3.1 Variants generation .................................................................................. 126 4.3.2 Evaluation ................................................................................................ 128 4.4 Discussion ...................................................................................................... 137 4.5 Summary ........................................................................................................ 144 5 Conclusion ............................................................................................................ 145 5.1 Evaluation of research objectives ................................................................... 145 5.2 Future work .................................................................................................... 150 References................................................................................................................ 152 Appendix A List of COPD phenotypes used to retrieve articles from the PubMed OpenAccess subset .................................................................................. 172 Appendix B Finer-grained COPD phenotypes .................................................... 189 B.1 Problem .......................................................................................................... 189 B.2 Treatment ....................................................................................................... 191 B.3 Test ................................................................................................................ 191 Word Count: 38,210 4 List of Tables Table 2.1: Examples from a pair of report and annotation files. ............................... 33 Tabl Trauma series demonstrated no evidence of a pelvic fracture. ..................................................................................................... 37 Table 2.3: Proximity relationships used to calculate DS. .......................................... 44 Table 2.4: Definition of TP, FP, and FN. .................................................................. 47 Table 2.5: Exact and relaxed matching results for concept extraction. ..................... 51 Table 2.6: Exact and relaxed matching results for each concept type ((a) Problem; (b) Treatment; (c) Test) ................................................................................................... 53 Table 2.7: Number of unrecognised abbreviations. ................................................... 57 Table 3.1: Number of MIMIC clinical records of each disease and its relationship with COPD ................................................................................................................ 65 Table 3.2: Distribution of 1000 MIMIC II clinical records ....................................... 69 Table 3.3: Description of collected clinical records of COPD patients..................... 71 Table 3.4: PMC journals relevant to COPD. ............................................................. 72 Table 3.5: 30 top-ranked biomedical articles. ........................................................... 74 Table 3.6: Annotation scheme for capturing COPD phenotypes (*: Semantic types adapted from the PhenoCHF scheme) ....................................................................... 77 Table 3.7: Examples