Open Saenapark-Dissertation.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School CLASSIFICATION OF TRANSIENTS BY DISTANCE MEASURES A Dissertation in Statistics by Sae Na Park c 2015 Sae Na Park Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2015 The dissertation of Sae Na Park was reviewed and approved∗ by the following: G. Jogesh Babu Professor of Statistics Dissertation Advisor Chair of Committee John Fricks Associate Professor of Statistics Matthew Reimherr Assistant Professor of Statistics Eric B. Ford Professor of Astronomy Aleksandra Slavkovic Associate Professor of Statistics Chair of Graduate Program ∗Signatures are on file in the Graduate School. Abstract Due to a rapidly increasing size of data in astronomical surveys, statistical methods which can automatically classify newly detected celestial objects in an accurate and efficient way have become essential. In this dissertation, we introduce two methodologies to classify variable stars and transients by using light curves, which are graphs of magnitude (the logarithm measure of brightness of a star) as a function of time. Our analysis focuses on characterizing light curves by using magnitude changes over time increments and developing a classifier with this information. First we present the classifier based on the difference between two distributions of magni- tudes, estimated by the statistical distance measures such as the Kullback-Leibler divergence, the Jensen-Shannon divergence, and the Hellinger distance. Also, we propose a method that groups magnitudes and times by binning and uses frequen- cies in each bin as the variables for classification. Along with these two methods, a way to incorporate other measures into our classifiers, which have been used for classification of light curves, is presented. Finally, the proposed methods are demonstrated with real data and compared with the past classification methods of variable stars and transients. iii Table of Contents List of Figures vii List of Tables ix Acknowledgments xii Chapter 1 Introduction 1 Chapter 2 Literature Review, Background, and Statistical Methods 4 2.1 Introduction to CRTS Data . 4 2.2 Classification of Variable Stars . 7 2.2.1 Periodic Feature Generation . 7 2.2.2 Non-periodic Features Generation . 8 2.2.3 Modeling Light Curves: Gaussian Process Regression . 9 2.3 Bayesian Decision Theory . 12 2.4 Kernel Density Estimation . 14 2.5 Classification Methods . 15 2.5.1 Linear Discriminant Analysis . 16 2.5.2 Classification Tree . 16 2.5.3 Random Forest . 17 2.5.4 Support Vector Machines . 18 2.5.5 Neural Networks . 18 2.6 Distance Measures . 19 2.6.1 Kullback-Leibler Divergence . 19 2.6.1.1 Definition and Properties . 19 iv 2.6.1.2 Bayesian Estimate of Kullback-Leibler Divergence . 20 2.6.2 Jensen-Shannon Divergence . 22 2.6.3 Hellinger Distance . 24 Chapter 3 Methods 25 3.1 Introduction . 25 3.2 Classification by Distance Measures . 26 3.2.1 Kullback-Leibler Divergence with Kernel Density Estimation 27 3.2.2 Kullback-Leibler Divergence with Binning . 28 3.2.3 Bayesian Estimate of Kullback-Leibler Divergence . 30 3.2.4 Incorporating Other Measures . 30 3.3 Classification by Binning . 31 3.3.1 Incorporating Other Measures . 32 3.4 Summary and Conclusions . 32 Chapter 4 Simulation 34 4.1 Data Generation . 34 4.2 Simulation Results . 41 4.2.1 Light Curves without Gaps . 41 4.2.2 Light Curves with Gaps . 44 Chapter 5 Analysis 47 5.1 Introduction . 47 5.2 Data Description . 48 5.3 Exploratory Analysis . 50 5.4 Classification by Distance Measures . 53 5.4.1 Settings . 55 5.4.1.1 Kullback-Leibler Divergence . 55 5.4.1.2 Bayesian Estimate of Kullback-Leibler Divergence . 55 5.4.1.3 Jensen-Shannon Divergence and Hellinger Distance 57 5.4.2 Results . 59 5.4.2.1 Binary Classification . 59 5.4.2.2 Multi-class Classification . 62 5.4.3 Incorporating Other Measures . 65 5.5 Classification by Binning . 69 5.5.1 Incorporating Other Measures . 70 5.6 Application on Newly Collected Data Sets . 72 v 5.7 Summary . 75 Chapter 6 Summary and Future Work 77 6.1 Summary . 77 6.2 Future Work . 78 Appendix A Details on Application to CRTS Data 80 A.1 Choice of Pseudocounts . 80 A.2 Choice of Measures . 81 A.2.1 Measures for Section 5.4.3 . 81 A.2.2 Measures for Section 5.6 . 82 A.3 Contingency Tables . 83 A.3.1 Tables of Section 5.4 and 5.5 . 83 A.3.2 Tables of Section 5.6 . 88 Appendix B R code 90 B.1 Classification by Distance Measures . 90 B.2 Classification by Binning . 97 Bibliography 99 vi List of Figures 1.1 Example of a light curve . 2 2.1 Celestial coordinates: Right ascension and declination . 5 2.2 Four images of CSS080118:112149-131310, taken minutes apart by CSS on March 30, 2008. This object was classified as a transient. 6 2.3 The light curve for Transient CSS080118:112149-131310. Measure- ments are presented with blue points with error bars. x-axis and y-axis represent time and magnitude respectively. 6 2.4 GPR curves for an AGN and a Supernova. Black dots are the observations, and the blue solid line and the red dashed line are the fitted curves with the median magnitude and the detection limit 20.5 as a mean function respectively. 12 3.1 Light curves of a supernova (left) and a non-transient (right) . 27 4.1 Example of RR Lyrae light curve . 35 4.2 Simulated RR Lyrae light curves with b = 0:2, a = (1; 2; 3) (from top to bottom), and p = (0:2; 0:5; 1) (from left to right) . 36 4.3 Simulated RR Lyrae light curves with b = 0:4, a = (1; 2; 3) (from top to bottom), and p = (0:2; 0:5; 1) (from left to right) . 37 4.4 Simulated RR Lyrae light curves with b = 0:6, a = (1; 2; 3) (from top to bottom), and p = (0:2; 0:5; 1) (from left to right) . 38 4.5 Examples of Type I, Type II-P, and Type II-L supernovae light curves (Doggett and Branch, 1985) . 39 4.6 Simulated Type I, Type II-P, and Type II-L (from top to bottom) supernovae light curves with a = (1; 3; 5) (from left to right) . 40 5.1 Top: A boxplot of dmag for different types of transients (entire range). Bottom: A boxplot of dmag between -2 and 2. 52 5.2 Light curves of a AGN, a Flare, a Supernova, and a non-transient . 53 vii 5.3 (dt; dmag) density plots by kernel density estimation of a AGN, a Flare, a Supernova, and a non-transient. These are the same object with the light curves presented in Figure 4.2. The number of grid points used for evaluating densities is 50 in each side. 54 5.4 Left: The histogram of the entire dt. Right: The histogram of dts which are less than 100. 56 5.5 The histogram of dmag ......................... 56 5.6 Classification rates of KLD3 for different choices of α . 58 5.7 Stepwise selection for choosing measures to include for the distance method: The first value on the plot is the classification rate when every measure is used. After that, it shows the rate when the specified measure in the x-axis is excluded from the set. 68 5.8 Stepwise selection for choosing measures to include for the binning method: The first value on the plot is the classification rate when every measure is used. After that, it shows the rate when the specified measure is excluded from the set. 71 6.1 Light curve of CV (CSS071216:110407045134) and its (dt, dmag) kernel density plot . 78 viii List of Tables 2.1 Non-periodic features from Richards et al. (2011) . 10 2.2 The features generated by modeling light curves in Faraway et al. (2014) . 13 3.1 Contingency table for dt and dmag bins . 29 4.1 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different amplitudes . 42 4.2 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different periods . 42 4.3 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different shapes . 42 4.4 Supernovae vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different types . 43 4.5 Supernovae vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different amplitudes . 43 4.6 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different amplitudes when gaps are present in light curves. 45 4.7 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different periods when gaps are present in light curves. 45 4.8 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different shapes when gaps are present in light curves. 45 5.1 Number of light curves for each transient type and non-transient objects (Faraway et al., 2014) . 50 5.2 Transients vs. non-variables classification by KLD2 with α = 0:6 . 59 5.3 Examples of a contingency table for SNe versus non-SNe classification 60 ix 5.4 Completeness: Percentages correctly classified to each type of tran- sients . 60 5.5 Contamination: False alarm rates . 61 5.6 Comparisons among Richards measure, Faraway measure, and our distance measure for completeness . 62 5.7 Comparisons among Richards measure, Faraway measure, and our distance measure for contamination . 62 5.8 Completeness for all-type classification .