Leveraging Electronic Health Records and Patient Similarity
Total Page:16
File Type:pdf, Size:1020Kb
Toward Precision Medicine in Intensive Care: Leveraging Electronic Health Records and Patient Similarity by Anis Sharafoddini A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Public Health and Health Systems Waterloo, Ontario, Canada, 2019 ©Anis Sharafoddini 2019 Examining Committee Membership The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. External Examiner Frank Rudzicz Professor, Department of Computer Science, University of Toronto Supervisors Joel A. Dubin Professor, Department of Statistics and Actuarial Science; and School of Public Health and Health Systems, University of Waterloo Joon Lee Professor, Department of Community Health Sciences; and Department of Cardiac Sciences, Cumming School of Medicine, University of Calgary Internal Members Plinio Morita Professor, School of Public Health and Health Systems, University of Waterloo David Maslove Professor, Department of Critical Care Medicine, Queen's University Internal-external Member Yeying Zhu Department of Statistics and Actuarial Science, University of Waterloo ii AUTHOR'S DECLARATION This thesis consists of material all of which I authored or co-authored: see Statement of Contributions included in the thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Statement of Contributions This thesis consists in part of two manuscripts that have been published. Exceptions to sole authorship: Chapter 2: A. Sharafoddini, J. A. Dubin, D. M. Maslove, and J. Lee, "A New Insight Into Missing Data in Intensive Care Unit Patient Profiles: Observational Study," JMIR Med Inform, vol. 7, no. 1, p. e11605, Jan 8 2019. Chapter 3: A. Sharafoddini, J. A. Dubin, and J. Lee, "Finding Similar Patient Subpopulations in the ICU Using Laboratory Test Ordering Patterns," Proceedings of the 7th International Conference on Bioinformatics and Biomedical Science, 2018. As lead author of these two chapters, I was responsible for defining the research questions, conceptualizing the study design, carrying out data extraction and analysis, and drafting and submitting the manuscripts. My co-authors provided guidance during each step of the research and gave feedback on draft manuscripts. iv Abstract The growing adoption of Electronic Health Record (EHR) systems has resulted in an unprecedented amount of data. This availability of data has also opened up the opportunity to utilize EHRs for providing more customized care for each patient by considering individual variability, which is the goal of precision medicine. In this context, patient similarity (PS) analytics have been introduced to facilitate data analysis through investigating the similarities in patients’ data, and, ultimately, to help improve the healthcare system. This dissertation is presented in six chapters and focuses on employing PS analytics in data-rich intensive care units. Chapter 1 provides a review of the literature and summarizes studies describing approaches for predicting patients’ future health status based on EHR and PS. Chapter 2 demonstrates the informativeness of missing data in patient profiles and introduces missing data indicators to use this information in mortality prediction. The results demonstrate that including indicators with observed measurements in a set of well-known prediction models (logistic regression, decision tree, and random forest) can improve the predictive accuracy. Chapter 3 builds upon the previous results and utilizes these missing indicators to reveal patient subpopulations based on their similarity in laboratory test ordering being used for them. In this chapter, the Density-based Spatial Clustering of Applications with Noise method, was employed to group the patients into clusters using the indicators generated in the previous study. Results confirmed that missing indicators capture the laboratory-test- v ordering patterns that are informative and can be used to identify similar patient subpopulations. Chapter 4 investigates the performance of a multifaceted PS metric constructed by utilizing appropriate similarity metrics for specific clinical variables (e.g. vital signs, ICD-9, etc.). The proposed PS metric was evaluated in a 30-day post-discharge mortality prediction problem. Results demonstrate that PS-based prediction models with the new PS metric outperformed population-based prediction models. Moreover, the multifaceted PS metric significantly outperformed cosine and Euclidean PS metric in k-nearest neighbors setting. Chapter 5 takes the previous results into consideration and looks for potential subpopulations among septic patients. Sepsis is one of the most common causes of death in Canada. The focus of this chapter is on longitudinal EHR data which are a collection of observations of measurements made chronologically for each patient. This chapter employs Functional Principal Component Analysis to derive the dominant modes of variation in septic patients’ EHR's. Results confirm that including temporal data in the analysis can help in identifying subgroups of septic patients. Finally, Chapter 6 provides a discussion of results from previous chapters. The results indicate the informativeness of missing data and how PS can help in improving the performance of predictive modeling. Moreover, results show that utilizing the temporal information in PS calculation improves patient stratification. Finally, the discussion identifies limitations and directions for future research. vi Acknowledgements First and foremost, I would like to thank God Almighty for giving me the strength, ability and opportunity to undertake this research. Without his blessings, this achievement would not have been possible. I am grateful to my parents, Maryam Salehi Sirjani and Mohammad Sharafoddini, for their love, prayers, caring and incredible sacrifices. I am extremely thankful to my husband, Ehsan Asadi, who has always patiently supported me and helped me overcome the difficulties through this journey. Thank you, Ehsan, for always being there for me. My appreciation extends to my sister Zohreh, my brother Ali, and all my family members in particular Marjan, Morteza, Ghazal, Amir Mohammad, Setayesh, Farkhondeh, Abdolazim, Nahid, Ali, and Ahmad for their support. There are not enough words in the whole world for me to describe how much I am grateful for having the loveliest family. I also would like to thank my supervisors, Dr. Joel Dubin and Dr. Joon Lee, for guiding me to think and explore, for helping me work on my weaknesses, and for providing me with many opportunities to develop my skills and gain self-confidence. I would like to extend my thanks to Dr. Frank Rudzicz, Dr. Yeying Zhu, Dr. David Maslove and Dr. Plinio Morita for serving on my committee, for their helpful comments and valuable suggestions on improving my thesis. I am also indebted to Mary McPherson, from Writing and Communication Center, for kindly supporting me through my PhD studies. So truly, thank you for everything. I also vii would like to thank Mohaddeseh and Mahya for their friendship and help in the past few years. My PhD journey finally came to an end, and I wish I could say thank you to all the great people I feel lucky and happy for having them in my life. viii Dedication This thesis is dedicated to my mother who took the lead to heaven before the completion of this work. ix Table of Contents Examining Committee Membership ......................................................................................... ii AUTHOR'S DECLARATION ................................................................................................ iii Statement of Contributions ...................................................................................................... iv Abstract ..................................................................................................................................... v Acknowledgements ................................................................................................................. vii Dedication ................................................................................................................................ ix List of Figures ......................................................................................................................... xv List of Tables ....................................................................................................................... xviii Introduction .............................................................................................................. 1 1.1 Background ..................................................................................................................... 2 1.1.1 Electronic Health Records and their Secondary Uses .............................................. 2 1.1.2 EHR Data Challenges ............................................................................................... 3 1.1.3 Precision and Population-based Analytics ............................................................... 5 1.1.4 Intensive Care Units and ICU Databases ................................................................. 9 1.2 Patient Similarity Literature Review ............................................................................. 11 1.2.1 Predictor Extraction and Missing Data Treatment