Extracting Detailed Tobacco Exposure from the Electronic Health Record

Extracting Detailed Tobacco Exposure From The Electronic Health Record By Travis John Osterman Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Biomedical Informatics August 11, 2017 Nashville, Tennessee Approved Josh Denny, M.D., M.S. Mia Levy, M.D., Ph.D. Pierre Massion, M. D. i To Laura, Owen and Gavin. Thank you for your patience, encouragement, and support. ii ACKNOWLEDGEMENTS I would like to thank the National Library of Medicine and the Conquer Cancer Foundation for supporting my biomedical informatics training and research (LM007450, R01- LM010685). My appreciation and thanks also go to the Department of Biomedical Informatics faculty and students for their support. Specifically, I thank Cindy Gadd and Rischelle Jenkins for creating and maintaining a rigorous and enlightening curriculum and to my master’s committee Josh Denny, Mia Levy, and Pierre Massion, for supporting my education and broader career development. In addition, I would like to thank Wei-Qi Wei, Dara Mize, Julie Wu, and Lisa Bastarche who directly contributed to this work. Finally, I would like to thank my parents, wife, and children for supporting me through this program. iii TABLE OF CONTENTS Page DEDICATION ................................................................................................................................ ii ACKNOWLEDGEMENTS ........................................................................................................... iii LIST OF TABLES ......................................................................................................................... vi LIST OF FIGURES ...................................................................................................................... vii Chapter I. Introduction .................................................................................................................................1 Natural language processing .......................................................................................................2 Tobacco extraction ......................................................................................................................3 Phenome Wide Association Studies (PheWAS) .........................................................................6 Genotype by Environment Interaction Studies (G×E) ................................................................8 II. Smoking History And Pack-year Extraction System (SHAPES) ..............................................10 Introduction ...............................................................................................................................10 Demonstration Project: Introduction .........................................................................................10 Demonstration Project: Methods ...............................................................................................11 Demonstration Project: Results .................................................................................................13 Demonstration Project Discussion ............................................................................................14 Smoking History And Pack-year Extraction System (SHAPES): Introduction ........................14 SHAPES: Methods ....................................................................................................................15 SHAPES: Methods: Augmented Review and Annotation System (ARAS) .............................16 SHAPES Methods: Training .....................................................................................................18 SHAPES: Methods: Assumptions .............................................................................................20 SHAPES: Methods: Pipeline .....................................................................................................22 SHAPES: Methods: Validation .................................................................................................27 SHAPES: Results ......................................................................................................................27 SHAPES: Reference Implementations ......................................................................................39 SHAPES: Discussion ................................................................................................................42 SHAPES: Conclusion ................................................................................................................48 III. Smoking PheWAS ..................................................................................................................49 iv Introduction ...............................................................................................................................49 Methods .....................................................................................................................................49 Results .......................................................................................................................................51 Discussion .................................................................................................................................61 Conclusion .................................................................................................................................62 IV. A Smoking Genome by Environment (GxE) Interaction Study .............................................64 Introduction ...............................................................................................................................64 Methods .....................................................................................................................................64 Results .......................................................................................................................................66 Discussion .................................................................................................................................71 Conclusion .................................................................................................................................73 V. Summary.................................................................................................................................75 Conclusions ...............................................................................................................................76 Appendix A. Extraction Rules (rules.py) ........................................................................................................77 B. SHAPES Dependencies .............................................................................................................85 C. Significant results with ever-never smoking classification system ...........................................86 D. Significant results with pack-year tobacco exposure via SHAPES ..........................................91 E. Single Nucleotide Polymorphism (SNP) – Phenotype Replications .......................................108 F. Significant SNP-Phenotype Interactions .................................................................................117 REFERENCES ............................................................................................................................120 v LIST OF TABLES Table Page 1. Seven variables extracted by SHAPES ......................................................................................15 2. Frequency of note-level smoking data in training and validation sets .......................................28 3. Frequency of patient-level smoking data in training and validation sets...................................28 4. Training set performance, note level .........................................................................................29 5. Training set performance, patient level .....................................................................................30 6. Inter-rater reliability between two independent reviewers ........................................................32 7. Review vs adjudicator agreement ..............................................................................................33 8. Reviewer vs adjudicator agreement, ignoring NAs ...................................................................34 9. SHAPES validation set performance (note-level data) ..............................................................35 10. SHAPES validation set performance (patient-level data) ........................................................36 11. SHAPES lung cancer screening eligibility performance .........................................................38 12. SHAPES abdominal aortic aneurysm screening performance.................................................38 13. Smoking PheWAS overview of ever-never smoking classification vs SHAPES ....................51 14. Top 10 smoking-phenotype associations using ever/never smoking status. ............................52 15. Top 10 smoking-phenotype associations using SHAPES pack-years .....................................54 16. Top 10 interactions between SNP and smoking exposure .......................................................67 17. Significant interactions between lung cancer and SNPs ..........................................................68 vi LIST OF FIGURES Figure Page 1. Major challenges of G×E research ...............................................................................................9

Extracting Detailed Tobacco Exposure from the Electronic Health Record

Data Sanity Check for Deep Learning Systems Via Learnt Assertions

Caradoc of the North Wind Free

Faithful Saliency Maps: Explaining Neural Networks by Augmenting "Competition for Pixels"

Sys/Sys/Bio.H 1 1 /* 65 Void Bio Caller2; / Private Use by the Caller

Hello, World! Free

Bugdoc: Algorithms to Debug Computational Processes

Opengl Performer™ Programmer's Guide

CS107, Lecture 16 Wrap-Up / What’S Next?

IRIS Performer™ Programmer's Guide

Programming in Standard ML

Classification of Computer Programs in the Scratch Online Community Lena Abdalla

Appendix a – Checklist: Sanity Check on the Design