Comparing Multi-Class, Binary and Hierarchical Machine Learning Classification Schemes for Variable Stars
Total Page:16
File Type:pdf, Size:1020Kb
MNRAS 000,1{16 (2019) Preprint 19 July 2019 Compiled using MNRAS LATEX style file v3.0 Comparing Multi-class, Binary and Hierarchical Machine Learning Classification schemes for variable stars Zafiirah Hosenie,1? Robert Lyon,1 Benjamin Stappers,1 Arrykrishna Mootoovaloo2 1Jodrell Bank Centre for Astrophysics, School of Physics and Astronomy, The University of Manchester, Manchester M13 9PL, UK. 2Imperial Centre for Inference and Cosmology (ICIC), Imperial College, Blackett Laboratory, Prince Consort Road, London SW7 2AZ, UK. Accepted 2019 July 15. Received 2019 July 15; in original form 2019 March 06 ABSTRACT Upcoming synoptic surveys are set to generate an unprecedented amount of data. This requires an automatic framework that can quickly and efficiently provide classification labels for several new object classification challenges. Using data describing 11 types of variable stars from the Catalina Real-Time Transient Surveys (CRTS), we illustrate how to capture the most important information from computed features and describe detailed methods of how to robustly use Information Theory for feature selection and evaluation. We apply three Machine Learning (ML) algorithms and demonstrate how to optimize these classifiers via cross-validation techniques. For the CRTS dataset, we find that the Random Forest (RF) classifier performs best in terms of balanced- accuracy and geometric means. We demonstrate substantially improved classification results by converting the multi-class problem into a binary classification task, achieving a balanced-accuracy rate of ∼99 per cent for the classification of δ-Scuti and Anomalous Cepheids (ACEP). Additionally, we describe how classification performance can be improved via converting a `flat-multi-class' problem into a hierarchical taxonomy. We develop a new hierarchical structure and propose a new set of classification features, enabling the accurate identification of subtypes of cepheids, RR Lyrae and eclipsing binary stars in CRTS data. Key words: stars: variables- general { methods: data analysis - Astronomical instru- mentation, methods, and techniques. 1 INTRODUCTION the efficacy of the manual approach decreases as the volume of data grows exponentially, as will be the case for the next Astronomy has experienced an increase in the volume, qual- generation of surveys. Visual inspection becomes inconsis- ity and complexity of datasets produced during numerical tent, consequently, mistakes are made, and rare/interesting simulations and surveys. One factor that contributes to the objects can be missed. data avalanche is the new generation of synoptic sky sur- veys, for example, the Catalina Real-Time Transient Surveys To address this problem, Machine Learning (ML) has (CRTS) (Drake et al. 2017). In addition, the Large Synoptic been applied to variable-star classification in multiple time- Survey Telescope (LSST, Ivezic et al.(2008)) for example, series datasets (see Belokurov et al. 2003; Willemsen & Eyer arXiv:1907.08189v1 [astro-ph.IM] 18 Jul 2019 which is now on the horizon, will produce ∼ 15 Terabytes 2007). In ML, variable stars are represented by features: of raw data per night (Juric et al. 2015). However, despite independent measures that contain information useful for this data deluge, source variability is often still visually in- differentiating variable stars into their respective classes. spected to detect new promising candidates/variable stars. Therefore, several developments have been made towards Visual inspection does have utility for detection and classi- determining the best methods and features for describ- fication. Human experts can extract new useful information ing variable-stars, including the Lomb-Scargle periodogram despite unevenly sampled data sets and also have the ability (Lomb 1976; Scargle 1982), Bayesian Evidence Estimation to distinguish noisy data from data exhibiting interesting be- (Gregory & Loredo 1992) as well as hybrid methods (Saha & haviour/characteristics. They can also incorporate complex Vivas 2017). In addition, Eyer & Blake(2005) analysed the contextual information into their decision making. However, small sharp features of light curves and included them as in- put features to a Na¨ıve Bayes classifier (Zhang 2004). While Djorgovski et al.(2016) developed an automatic framework ? E-mail: zafi[email protected] to detect and classify transient events and variable stars. © 2019 The Authors 2 Z. Hosenie et al. RRab: J000031.7-412854 14.5 RRc: J000044.8-430758 RRd: J000956.0-242445 18.0 16.0 15.0 20.0 Magnitude 15.5 17.0 Blazhko: J001108.0-593330 13.0 Contact & Semi-Detached EB: J000025.8-393651 Detached EB: J000111.5-223745 15.0 14.0 14.0 16.0 14.5 Magnitude 15.0 Rotational: J000733.1-294903 LPV: J001234.2-225517 15.5 δ-Scuti: J012028.9-292610 16.0 12.0 13.0 16.0 16.5 Magnitude 14.0 16.5 0.0 0.5 1.0 1.5 2.0 Phase ACEP: J003041.3-441620 Cep-II: J004302.8-533428 13.0 13.0 13.5 13.5 Magnitude 14.0 14.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Phase Phase Figure 1. Examples of folded light curves from the CRTS for the various types of variable stars considered in our analyses. They used a subset of the CRTS data to perform classifica- accurate and robust automated classification methods for tion between two types of variable stars (W Uma and RR this problem using machine learning and other statistical Lyrae) and obtained completeness rates of ∼96-97 per cent. approaches. This paper describes a new automatic classi- Kim & Bailer-Jones(2016) developed the UPSILON pack- fication pipeline for the classification of variable stars via age to classify periodic variable stars using 16 extracted fea- application to archival data. To our knowledge, this is the tures from light curves, which achieves good results. Maha- first time the southern CRTS (Drake et al. 2017) data set bal et al.(2017) developed a classifier based on the Convolu- has been used to build/evaluate an automatic classification tional Neural Network (CNN) model using labelled datasets system. of periodic variables from the CRTS (Drake et al. 2009; Djor- Similar work has been completed in recent years (Kim & govski et al. 2011; Mahabal et al. 2012; Djorgovski et al. Bailer-Jones 2016; Mahabal et al. 2017; Narayan et al. 2018), 2016). They transformed a light curve (time series) into though the features used for learning are rarely evaluated in a two-dimensional mapping representation (dm − dt) which a statistically rigorous way. We found that using a large set is based on the changes in magnitude (dm) over the time- of features does not imply higher classification metrics. We difference (dt). Using multi-class classification, their algo- therefore perform an in-depth analysis of ML features to rithm achieved an accuracy of ∼83 per cent. Narayan et al. understand their information content, and determine which (2018) developed an ML approach to classify variable ver- give rise to the best classification performance. We utilize sus transient stars. Similarly, they performed a multi-class various visualization techniques and the tools of Information classification of combined variable stars & transients, and a Theory to achieve this. \purity-driven" sub-categorisation of the transient class us- Based on our analyses we find that accurate variable ing multi-band optical photometry. Revsbech et al.(2018) star classification is possible with just seven features - much used a data augmentation technique to mitigate the effects fewer than in other works. In addition, we show that this of bias in their data by generating additional training data classification problem cannot be solved with a ‘flat' multi- using Gaussian Processes (GPs). They used a diffusion map class classification approach, as the data is inherently imbal- method that calculates a pair-wise distance matrix that out- anced. To partially alleviate the `imbalanced learning prob- puts diffusion map coefficients of the light curves. These co- lem' (Last et al. 2017), we developed an approach inspired by efficients act as feature inputs to a Random Forest (RF) earlier work in this area (Richards et al. 2011). This involved classifier used to help identifying Type Ia supernova. converting a standard multi-class problem in to a hierar- We found that it is fundamentally important to develop chical classification problem, by aggregating sub-classes in MNRAS 000,1{16 (2019) Machine Learning Classification for variable stars 3 to super-classes. This results improved performance on rare 4509 4509 4325 class examples typically misclassified by multi-class meth- ods. We adopt a similar methodology to Richards et al. 4000 3752 3636 (2011), however we i) propose a different hierarchical classifi- cation structure, ii) use a different feature analysis/selection 3000 methodology resulting in different feature choices, iii) apply hyper-parameter optimisation to build optimal classification 2000 models, and finally iv) apply the resulting approach to CRTS 1286 Number of Samples data. 1000 502 The outline of this paper is as follows. In x2, we provide 171 147 153 153 a brief description of the dataset used and in x3 we present 0 the feature generation techniques we employ here; while in 5. Ecl 6. EA Scuti 2. RRc 3. RRd 8. LPV δ 1. RRab 9. x4 we explain how we build the classification pipeline. In x5 4. Blazhko 10. ACEP12. Cep-II we apply state-of-the-art feature visualisation techniques to 7. Rotational visualise how separable our features are before performing a Figure 2. Class distributions for the CSDR2 datasets. We down- multi-class classification. In x6, we provide an in-depth fea- sample Type 5: semi-detached binary stars to 4,509 samples to ture evaluation to determine the usefulness of our extracted prevent larger classes from dominating the training sets. The ex- features before performing a binary classification. In x7, we cluded samples ∼14,294 for Type 5: Ecl are then included in the present a hierarchical taxonomy for classification and discuss test set for prediction.