Subset Selection and Feature Identification in the Electrocardiogram by Emily Hendryx

RICE UNIVERSITY Subset Selection and Feature Identification in the Electrocardiogram by Emily Hendryx A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Doctor of Philosophy APPROVED, THESIS COMMITTEE: Beatrice Riviere, Chair Noah Harding Chair, Professor of Computational and Applied Mathematics, Rice University Cr~ in, Director Assistant Professor of Pediatric Cardiology, Baylor College of Medicine D1.~~ Noah Harding Professor Emeritus, Research Professor of Computational and Applied Mathematics, Rice University Houston1 Texas April, 2018 ABSTRACT Subset Selection and Feature Identification in the Electrocardiogram by Emily Hendryx Each feature in the electrocardiogram (ECG) corresponds to a di↵erent part of the cardiac cycle. Tracking changes in these features over long periods of time can o↵er insight regarding changes in a patient’s clinical status. However, the automated identification of features in some patient populations, such as the pediatric congenital heart disease population, remains a nontrivial task that has yet to be mastered. Working toward a solution to this problem, this thesis outlines an overall framework for the identification of individual features in the ECGs of di↵erent populations. With a goal of applying part of this framework retrospectively to large sets of patient data, we focus primarily on the selection of relevant subsets of ECG beats for subse- quent interpretation by clinical experts. We demonstrate the viability of the discrete empirical interpolation method (DEIM) in identifying representative subsets of beat morphologies relevant for future classification models. The success of DEIM applied to data sets from a variety of contexts is compared to results from related approaches in numerical linear algebra, as well as some more common clustering algorithms. We also present a novel extension of DEIM, called E-DEIM, in which additional representative data points can be identified as important without being limited by the rank of the corresponding data matrix. This new algorithm is evaluated on two di↵erent data sets to demonstrate its use in multiple settings, even beyond medicine. With DEIM and its related methods identifying beat-class representatives, we then pro- pose an approach to automatically extend physician expertise on the selected beat morphologies to new and unlabeled beats. Using a fuzzy classification scheme with dynamic time warping, we are able to provide preliminary results suggesting further pursuit of this framework in application to patient data. Acknowledgments There are many people who have supported the completion of this work–both directly and indirectly–and to express their deserved thanks in writing would make this document twice as long. However, I would be remiss not to express my gratitude to a certain few. IamverygratefultomyadvisorsCraigRusinandBéatriceRivièrefortheir guidance throughout my time at Rice. Dr. Rusin first introduced me to this problem five years ago, and since then, he has generously shared his expertise and resources in facilitating my work on such an interesting topic. Dr. Riviere has been kind to o↵er her wisdom and direction in both completing this work and navigating the larger world of academia. Both of my advisors have been incredibly patient and supportive of my interests, and the invaluable time and advice they have both given me throughout the years is much appreciated. Imustalsothankmyadditionalcommitteemembers,DannySorensenandChris Jermaine. Dr. Sorensen has been a key resource in conducting this research, and I would like to thank him for sharing his knowledge and expertise in exploring the use of DEIM for ECG subset selection–especially on days when he could have been fishing. I also thank Dr. Jermaine for o↵ering his time and expertise through our various conversations related to this work. Sebastian Acosta has also been helpful in exploring ideas, and I appreciate the numerous discussions that we have had over the years. To set this project in motion and maintain its clinical relevance, a number of physicians at Texas Children’s Hos- pital have o↵ered their precious time and clinical expertise, including Daniel Penny, Kenneth Brady, Blaine Easley, Je↵ Kim, and Eric Vu, in addition to several others. v I am very grateful to these clinicians for not only sharing their knowledge, but for expressing and pursuing an interest in actually using mathematical tools to improve clinical decision support. Financially, this work was primarily supported by a training fellowship from the Keck Center of the Gulf Coast Consortia, on the NLM Training Program in Biomed- ical Informatics (NLM Grant No. T15LM007093). I am very grateful for this fellowship, as it provided an opportunity to interact with researchers in a variety of di↵erent fields and learn more about topics relevant to the project presented here. I would also like to thank the students, faculty, and sta↵ in the CAAM department. A number of fellow CAAM graduate students have shared their knowledge with me these past few years, and the support they have o↵ered has been a key part of my graduate career. This support has been mirrored by several other friends both in and outside of the Rice community, and to them I also extend my thanks. Iamextremelygratefultomyfamilyfortheloveandencouragementtheyhave shown me since day one–they have been a constant support system, ready to listen and o↵er any needed advice or assistance. So to my parents, Forrest and Rebecca Hendryx, as well as to Jennifer and Doug Parsons (and Jackson), I say thank you. Finally, I must o↵er thanks to God for the innumerable ways in which he has shown Himself faithful over the years. To Him be all praise, honor, and glory forever. Contents Abstract ii Acknowledgments iv List of Illustrations x List of Tables xiii 1 Introduction 1 1.1 Setting the Scene: The Clinical Problem . 1 1.2 Feature Identification Framework . 2 1.3 Thesis Overview . 4 2 Background and Literature Review 7 2.1 Motivation . 7 2.1.1 Congenital Heart Disease . 7 2.1.2 Electrocardiogram . 9 2.2 ECG Feature Identification . 11 2.3 CURFactorization ............................ 13 2.4 Dynamic Time Warping . 16 2.5 Clustering . 17 3 DEIM-CUR ECG Subset Selection 21 3.1 Introduction . 21 3.2 Background and Related Work . 25 3.2.1 Electrocardiogram Analysis . 25 3.2.2 CUR Factorization . 27 vii 3.3 Methods . 29 3.3.1 Synthetic Data Construction . 30 3.3.2 Real Patient Data . 37 3.3.3 CUR Factorization . 42 3.4 Results and Discussion . 49 3.4.1 CUR with Synthetic Data . 50 3.4.2 CUR with Patient Data . 53 3.5 Conclusion . 73 4 A Comparison of Subset Selection Methods 75 4.0.1 Data Sets . 76 4.1 Subset Selection Schemes . 80 4.1.1 DEIM . 80 4.1.2 Q-DEIM . 81 4.1.3 Leverage Scores . 82 4.1.4 Maximal Volume . 83 4.1.5 Clustering . 85 4.2 Comparison Results . 89 4.2.1 MIT-BIH: DS2 . 90 4.2.2 Grammatical Facial Expressions: Subject “b” . 93 4.2.3 MNIST .............................. 94 4.3 Conclusion . 97 5 An Extension of DEIM 98 5.1 Introduction . 98 5.2 Background ................................ 100 5.2.1 Standard DEIM . 101 5.3 Proposed Method: Extended DEIM . 102 5.3.1 Construction of p for Extended DEIM . 103 viii 5.3.2 Theoretical Implications . 113 5.3.3 Description of Data Experiments . 122 5.4 Results and Discussion . 126 5.4.1 MIT-BIH Arrhythmia Database Results . 126 5.4.2 Letter Recognition Data Set Results . 129 5.5 Conclusion . 132 6 Beat Classification and Feature Identification 134 6.1 Beat Delineation . 134 6.2 Hierarchical DEIM-CUR . 135 6.2.1 Application to TCH Data Subset . 138 6.2.2 Further Reduction via Clustering . 143 6.3 Labeling Beats . 144 6.4 Classifying Unlabeled Beats . 145 6.5 Feature Identification from Annotated Beats . 145 6.5.1 Dynamic Time Warping . 145 6.5.2 Feature Location Probability . 150 7 Additional Directions 154 7.1 Bitmapping and Linear Dependence . 154 7.1.1 Converting a beat into a binary bitmap . 155 7.1.2 Converting a beat into a recoverable bitmap . 157 7.1.3 Linear Dependence . 159 7.2 Measuring Subset Selection Quality . 162 7.2.1 Texas Children’s Hospital Database and Normalization Schemes 163 7.2.2 Subset Basis Representation . 164 7.2.3 Subset Basis Representation with `1 Regularization . 173 7.2.4 Preliminary Results with `1 Penalty . 179 7.2.5 Performance with Random Bases . 184 ix 7.2.6 Data Representation with Only One Basis Beat . 190 8 Conclusion 195 8.1 Concluding Remarks . 195 8.2 Future Work . 198 Appendix A 200 A.1 Synthetic Data Parameters . 200 A.2 Full MIT-BIH Database Class Detection Results . 200 Appendix B 202 B.1 CUR Basis Performance Results Without Lasso . 202 B.2 CUR Basis Performance Results With Lasso . 205 B.3 Synthetic Results for Individual Beat-Basis Comparison . 210 Appendix C 216 C.1 Guide for Annotation GUI . 216 Bibliography 228 Illustrations 1.1 Feature identification approach overview . 3 2.1 Labeled synthetic ECG . 9 3.1 Labeled features within synthetic ECG waveforms . 23 3.2 Synthetic classes . 32 3.3 Synthetic beats with random noise . 34 3.4 Synthetic classes . 36 3.5 Percent CUR detection with physiological noise . 72 4.1 MIT-BIH Arrhythmia DS2 class distribution . 77 4.2 Grammatical Facial Expressions Test Set class distribution . 79 4.3 MNIST Test Set class distribution . 80 4.4 DS2 method comparison class results . 92 4.5 Summary of method comparison results . 96 5.1 DS1 DEIM variations detection results . 127 5.2 Letter recognition training set missed classes . 130 6.1 TCH lead I CUR beat selection . 141 6.2 TCH lead II CUR beat selection . 141 6.3 TCH lead III CUR beat selection .

Subset Selection and Feature Identification in the Electrocardiogram by Emily Hendryx

Surviving Set Theory: a Pedagogical Game and Cooperative Learning Approach to Undergraduate Post-Tonal Music Theory

Pitch-Class Set Theory: an Overture

The Axiom of Choice and Its Implications

Formal Construction of a Set Theory in Coq

Equivalents to the Axiom of Choice and Their Uses A

P0138R2 2016-03-04 Reply-To: [email protected]

CE160 Determinacy Lesson Plan Vukazich

Cardinality and the Nature of Infinity

A Tale of Two Set Theories

Nonconvergence, Covariance Constraints, and Class Enumeration in Growth Mixture Models

Florida State University Libraries

Guide to Set Theory Proofs