Extracting Detailed Tobacco Exposure from the Electronic Health Record
Total Page:16
File Type:pdf, Size:1020Kb
Extracting Detailed Tobacco Exposure From The Electronic Health Record By Travis John Osterman Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Biomedical Informatics August 11, 2017 Nashville, Tennessee Approved Josh Denny, M.D., M.S. Mia Levy, M.D., Ph.D. Pierre Massion, M. D. i To Laura, Owen and Gavin. Thank you for your patience, encouragement, and support. ii ACKNOWLEDGEMENTS I would like to thank the National Library of Medicine and the Conquer Cancer Foundation for supporting my biomedical informatics training and research (LM007450, R01- LM010685). My appreciation and thanks also go to the Department of Biomedical Informatics faculty and students for their support. Specifically, I thank Cindy Gadd and Rischelle Jenkins for creating and maintaining a rigorous and enlightening curriculum and to my master’s committee Josh Denny, Mia Levy, and Pierre Massion, for supporting my education and broader career development. In addition, I would like to thank Wei-Qi Wei, Dara Mize, Julie Wu, and Lisa Bastarche who directly contributed to this work. Finally, I would like to thank my parents, wife, and children for supporting me through this program. iii TABLE OF CONTENTS Page DEDICATION ................................................................................................................................ ii ACKNOWLEDGEMENTS ........................................................................................................... iii LIST OF TABLES ......................................................................................................................... vi LIST OF FIGURES ...................................................................................................................... vii Chapter I. Introduction .................................................................................................................................1 Natural language processing .......................................................................................................2 Tobacco extraction ......................................................................................................................3 Phenome Wide Association Studies (PheWAS) .........................................................................6 Genotype by Environment Interaction Studies (G×E) ................................................................8 II. Smoking History And Pack-year Extraction System (SHAPES) ..............................................10 Introduction ...............................................................................................................................10 Demonstration Project: Introduction .........................................................................................10 Demonstration Project: Methods ...............................................................................................11 Demonstration Project: Results .................................................................................................13 Demonstration Project Discussion ............................................................................................14 Smoking History And Pack-year Extraction System (SHAPES): Introduction ........................14 SHAPES: Methods ....................................................................................................................15 SHAPES: Methods: Augmented Review and Annotation System (ARAS) .............................16 SHAPES Methods: Training .....................................................................................................18 SHAPES: Methods: Assumptions .............................................................................................20 SHAPES: Methods: Pipeline .....................................................................................................22 SHAPES: Methods: Validation .................................................................................................27 SHAPES: Results ......................................................................................................................27 SHAPES: Reference Implementations ......................................................................................39 SHAPES: Discussion ................................................................................................................42 SHAPES: Conclusion ................................................................................................................48 III. Smoking PheWAS ..................................................................................................................49 iv Introduction ...............................................................................................................................49 Methods .....................................................................................................................................49 Results .......................................................................................................................................51 Discussion .................................................................................................................................61 Conclusion .................................................................................................................................62 IV. A Smoking Genome by Environment (GxE) Interaction Study .............................................64 Introduction ...............................................................................................................................64 Methods .....................................................................................................................................64 Results .......................................................................................................................................66 Discussion .................................................................................................................................71 Conclusion .................................................................................................................................73 V. Summary.................................................................................................................................75 Conclusions ...............................................................................................................................76 Appendix A. Extraction Rules (rules.py) ........................................................................................................77 B. SHAPES Dependencies .............................................................................................................85 C. Significant results with ever-never smoking classification system ...........................................86 D. Significant results with pack-year tobacco exposure via SHAPES ..........................................91 E. Single Nucleotide Polymorphism (SNP) – Phenotype Replications .......................................108 F. Significant SNP-Phenotype Interactions .................................................................................117 REFERENCES ............................................................................................................................120 v LIST OF TABLES Table Page 1. Seven variables extracted by SHAPES ......................................................................................15 2. Frequency of note-level smoking data in training and validation sets .......................................28 3. Frequency of patient-level smoking data in training and validation sets...................................28 4. Training set performance, note level .........................................................................................29 5. Training set performance, patient level .....................................................................................30 6. Inter-rater reliability between two independent reviewers ........................................................32 7. Review vs adjudicator agreement ..............................................................................................33 8. Reviewer vs adjudicator agreement, ignoring NAs ...................................................................34 9. SHAPES validation set performance (note-level data) ..............................................................35 10. SHAPES validation set performance (patient-level data) ........................................................36 11. SHAPES lung cancer screening eligibility performance .........................................................38 12. SHAPES abdominal aortic aneurysm screening performance.................................................38 13. Smoking PheWAS overview of ever-never smoking classification vs SHAPES ....................51 14. Top 10 smoking-phenotype associations using ever/never smoking status. ............................52 15. Top 10 smoking-phenotype associations using SHAPES pack-years .....................................54 16. Top 10 interactions between SNP and smoking exposure .......................................................67 17. Significant interactions between lung cancer and SNPs ..........................................................68 vi LIST OF FIGURES Figure Page 1. Major challenges of G×E research ...............................................................................................9