Advanced Astroinformatics for Variable Star Classification by Kyle
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Astroinformatics for Variable Star Classifcation by Kyle Burton Johnston Master of Space Sciences Department of Physics and Space Sciences College of Science, Florida Institute of Technology 2006 Bachelor of Astrophysics Department of Physics and Space Sciences College of Science, Florida Institute of Technology 2004 A dissertation submitted to Florida Institute of Technology in partial fulfllment of the requirements for the degree of Doctorate of Philosophy in Space Sciences Melbourne, Florida April, 2019 ⃝c Copyright 2019 Kyle Burton Johnston All Rights Reserved The author grants permission to make single copies. We the undersigned committee hereby approve the attached dissertation Title: Advanced Astroinformatics for Variable Star Classifcation Author: Kyle Burton Johnston Saida Caballero-Nieves, Ph.D. Assistant Professor Aerospace, Physics and Space Sciences Committee Chair Adrian M. Peter, Ph.D. Associate Professor Computer Engineering and Sciences Outside Committee Member V´eroniquePetit, Ph.D. Assistant Professor Physics and Astronomy Eric Perlman, Ph.D. Professor Aerospace, Physics and Space Sciences Daniel Batcheldor, Ph.D. Professor and Department Head Aerospace, Physics and Space Sciences ABSTRACT Title: Advanced Astroinformatics for Variable Star Classifcation Author: Kyle Burton Johnston Major Advisor: Saida Caballero-Nieves, Ph.D. This project outlines the complete development of a variable star classifcation algorithm methodology. With the advent of Big-Data in astronomy, professional astronomers are left with the problem of how to manage large amounts of data, and how this deluge of information can be studied in order to improve our understand- ing of the universe. While our focus will be on the development of machine learning methodologies for the identifcation of variable star type based on light curve data and associated information, one of the goals of this work is the acknowledgment that the development of a true machine learning methodology must include not only study of what goes into the service (features, optimization methods) but a study on how we understand what comes out of the service (performance analysis). The complete development of a beginning-to-end system development strategy is presented as the following individual developments (simulation, training, feature extraction, detection, classifcation, and performance analysis). We propose that a complete machine learning strategy for use in the upcoming era of big data from the next generation of big telescopes, such as LSST, must consider this type of design integration. iii Table of Contents Abstract iii List of Figures xi List of Tables xxi Abbreviations xxiii Acknowledgments xxvi Dedication xxvii 1 Introduction 1 1.1 Utility of Variable Star Automated Identifcation . 3 1.2 Proposal for an Advanced Astroinformatics Design . 6 1.3 Outline of the Efort . 13 2 Variable Stars 20 2.1 Eclipsing Binaries and Contact Binaries . 24 2.1.1 β Persei (EA, Detached Binaries) . 28 2.1.2 β Lyrae (EB, Semi-Detached Binary) . 29 2.1.3 W Ursae Majoris (EW, Contact Binary) . 32 iv 2.2 Pulsating Variables . 34 2.2.1 Cepheid (I & II) . 38 2.2.2 RR Lyr (a, b, & c) . 39 2.2.3 Delta Scuti SX Phe . 40 3 Tools and Methods 42 3.1 DSP with Irregularly Sampled Data . 44 3.2 Phasing, Folding, and Smoothing . 48 3.3 Feature Extraction . 51 3.4 Machine Learning and Astroinformatics . 54 3.4.1 Example Dataset for Demonstration of Classifers . 56 3.4.2 Machine Learning Nomenclature . 57 3.4.3 k-NN Algorithm . 60 3.4.4 Parzen Window Classifer . 63 3.4.5 Decision Tree Algorithms . 66 3.4.5.1 Classifcation and Regression Tree . 67 3.4.5.2 Random Forest Classifers . 69 3.4.6 Artifcial Neural Network . 70 3.4.6.1 Multilayer Perceptron . 72 3.4.7 k-Means Clustering . 74 3.4.8 Principal Component Analysis (PCA) . 76 3.4.9 k-Means Representation . 78 3.4.10 Metric Learning . 80 3.4.10.1 Large-Margin Nearest Neighbor Metric Learning (LMNN) . 82 3.4.11 Metric Learning Improvements . 84 v 3.4.12 Multi-view Learning . 85 3.4.12.1 Large-Margin Multi-view Metric Learning . 87 4 System Design and Performance of an Automated Classifer 91 4.1 Related Work . 92 4.2 Data . 93 4.2.1 Sample Representation . 94 4.2.2 Feature Space . 97 4.2.2.1 Exploratory Data Analysis . 98 4.2.3 Efectiveness of Feature Space . 99 4.3 Supervised Classifcation Development . 100 4.3.1 Training and Testing . 103 4.3.2 Performance Analysis | Supervised Classifcation . 106 4.3.2.1 Broad Classes | Random Forest . 106 4.3.2.2 Broad Classes | Kernel SVM . 107 4.3.2.3 Broad Classes | k-NN . 107 4.3.2.4 Broad Classes | MLP . 108 4.3.2.5 Unique Classes . 109 4.3.3 Performance Analysis | Anomaly Detection . 112 4.4 Application of Supervised Classifer to LINEAR Dataset . 118 4.4.1 Analysis and Results . 118 4.5 Conclusions . 122 4.5.1 Future Research . 123 5 Novel Feature Space Implementation 125 5.1 Related Work . 126 vi 5.2 Slotted Symbolic Markov Modeling . 128 5.2.1 Algorithm Design . 128 5.2.2 Slotting (Irregular Sampling) . 131 5.2.3 State Space Representation . 134 5.2.4 Transition Probability Matrix (Markov Matrix) . 135 5.2.5 Feature Space Reduction (ECVA) . 138 5.3 Implementation of Methodology . 140 5.3.1 Datasets . 140 5.3.2 Pattern Classifcation Algorithm . 141 5.3.2.1 k-NN . 141 5.3.2.2 PWC . 142 5.3.2.3 Random Forest Classifer . 142 5.3.3 Comparison to Standard Set (UCR) . 143 5.3.3.1 Analysis . 144 5.3.3.2 Discussion . 146 5.3.4 Application to New Set (LINEAR) . 146 5.3.4.1 Non-Variable Artifcial Data . 147 5.3.4.2 Time Domain and Color Feature Space . 148 5.3.4.3 Analysis . 149 5.3.4.4 Anomaly Detection . 151 5.3.4.5 Discussion . 153 5.4 Conclusions . 154 5.4.1 Future Research . 155 6 Detector for O'Connell Type EBs 157 6.1 Eclipsing Binaries with OConnell Efect . 160 vii 6.1.1 Signatures and Theories . 160 6.1.2 Characterization of OEEB . 164 6.2 Variable Star Data . 166 6.2.1 Kepler Villanova Eclipsing Catalog . 166 6.2.1.1 Light Curve/Feature Space . 167 6.2.1.2 Training/Testing Data . 168 6.2.2 Data Untrained . 168 6.3 PMML Classifcation Algorithm . 169 6.3.1 Prior Research . 170 6.3.2 Distribution Field . 171 6.3.3 Metric Learning . 174 6.3.4 k-NN Classifer . 176 6.4 Results of the Classifcation . 177 6.4.1 Kepler Trained Data . 177 6.4.2 Kepler Untrained Data . 178 6.4.2.1 Design Considerations . 178 6.4.2.2 Results . 180 6.4.2.3 Subgroup Analysis . 182 6.4.3 LINEAR Catalog . 185 6.5 Discussion on the Methodology . 186 6.5.1 Comparative Studies . 186 6.5.2 Strength of the Tools . 191 6.5.3 Perspectives . 192 6.6 Conclusion . 193 viii 7 Multi-View Classifcation of Variable Stars Using Metric Learn- ing 195 7.1 Procedure . 198 7.2 Theory and Design . 200 7.2.1 Signal Conditioning . 201 7.2.2 Feature Extraction . 202 7.2.2.1 Slotted Symbolic Markov Models (SSMM) . 203 7.2.2.2 Distribution Field (DF) . 204 7.2.3 Classifcation and Metric Learning . 204 7.3 Challenges Addressed . 206 7.3.1 Hinge Loss Functionality . 206 7.3.2 Step-Size Optimization . 207 7.3.3 Vectorization and ECVA . 210 7.3.3.1 Multi-View Learning . 212 7.3.4 Large Margin Multi-Metric Learning with Matrix Variates . 215 7.4 Implementation . 222 7.4.1 Optimization of Features . 223 7.4.2 Feature Optimization . 224 7.4.3 Large Margin Multi-Metric Learning . 225 7.4.3.1 Testing and Results (UCR & LINEAR) . 228 7.4.4 Large Margin Multi-Metric Learning - MV . 230 7.4.4.1 Testing and Results (UCR & LINEAR) . 231 7.4.5 Comparison . 232 7.5 Conclusions . 235 8 Conclusions 237 ix 8.1 Variable Star Analysis . 237 8.2 Future Development . 242 8.2.1 Standard Data Sets . 242 8.2.2 Simulation . 244 8.3 Results . 246 A Chapter 4: Broad Class Performance Results 286 B Chapter 5: Optimization Analysis Figures 292 B.1 Chapter 5: Performance Analysis Tables . ..