The Pennsylvania State University The Graduate School

CLASSIFICATION OF TRANSIENTS BY MEASURES

A Dissertation in by Sae Na Park

c 2015 Sae Na Park

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2015 The dissertation of Sae Na Park was reviewed and approved∗ by the following:

G. Jogesh Babu Professor of Statistics Dissertation Advisor Chair of Committee

John Fricks Associate Professor of Statistics

Matthew Reimherr Assistant Professor of Statistics

Eric B. Ford Professor of Astronomy

Aleksandra Slavkovic Associate Professor of Statistics Chair of Graduate Program

∗Signatures are on file in the Graduate School. Abstract

Due to a rapidly increasing size of in astronomical surveys, statistical methods which can automatically classify newly detected celestial objects in an accurate and efficient way have become essential. In this dissertation, we introduce two methodologies to classify variable stars and transients by using light curves, which are graphs of magnitude (the logarithm of brightness of a star) as a function of time. Our analysis focuses on characterizing light curves by using magnitude changes over time increments and developing a classifier with this information. First we present the classifier based on the difference between two distributions of magni- tudes, estimated by the measures such as the Kullback-Leibler , the Jensen-Shannon divergence, and the . Also, we propose a method that groups magnitudes and times by binning and uses frequen- cies in each bin as the variables for classification. Along with these two methods, a way to incorporate other measures into our classifiers, which have been used for classification of light curves, is presented. Finally, the proposed methods are demonstrated with real data and compared with the past classification methods of variable stars and transients.

iii Table of Contents

List of Figures vii

List of Tables ix

Acknowledgments xii

Chapter 1 Introduction 1

Chapter 2 Literature Review, Background, and Statistical Methods 4 2.1 Introduction to CRTS Data ...... 4 2.2 Classification of Variable Stars ...... 7 2.2.1 Periodic Feature Generation ...... 7 2.2.2 Non-periodic Features Generation ...... 8 2.2.3 Modeling Light Curves: Gaussian Process Regression . . . . 9 2.3 Bayesian Decision Theory ...... 12 2.4 Kernel ...... 14 2.5 Classification Methods ...... 15 2.5.1 Linear Discriminant Analysis ...... 16 2.5.2 Classification Tree ...... 16 2.5.3 Random Forest ...... 17 2.5.4 Support Vector Machines ...... 18 2.5.5 Neural Networks ...... 18 2.6 Distance Measures ...... 19 2.6.1 Kullback-Leibler Divergence ...... 19 2.6.1.1 Definition and Properties ...... 19

iv 2.6.1.2 Bayesian Estimate of Kullback-Leibler Divergence . 20 2.6.2 Jensen-Shannon Divergence ...... 22 2.6.3 Hellinger Distance ...... 24

Chapter 3 Methods 25 3.1 Introduction ...... 25 3.2 Classification by Distance Measures ...... 26 3.2.1 Kullback-Leibler Divergence with Kernel Density Estimation 27 3.2.2 Kullback-Leibler Divergence with Binning ...... 28 3.2.3 Bayesian Estimate of Kullback-Leibler Divergence ...... 30 3.2.4 Incorporating Other Measures ...... 30 3.3 Classification by Binning ...... 31 3.3.1 Incorporating Other Measures ...... 32 3.4 Summary and Conclusions ...... 32

Chapter 4 Simulation 34 4.1 Data Generation ...... 34 4.2 Simulation Results ...... 41 4.2.1 Light Curves without Gaps ...... 41 4.2.2 Light Curves with Gaps ...... 44

Chapter 5 Analysis 47 5.1 Introduction ...... 47 5.2 Data Description ...... 48 5.3 Exploratory Analysis ...... 50 5.4 Classification by Distance Measures ...... 53 5.4.1 Settings ...... 55 5.4.1.1 Kullback-Leibler Divergence ...... 55 5.4.1.2 Bayesian Estimate of Kullback-Leibler Divergence . 55 5.4.1.3 Jensen-Shannon Divergence and Hellinger Distance 57 5.4.2 Results ...... 59 5.4.2.1 Binary Classification ...... 59 5.4.2.2 Multi-class Classification ...... 62 5.4.3 Incorporating Other Measures ...... 65 5.5 Classification by Binning ...... 69 5.5.1 Incorporating Other Measures ...... 70 5.6 Application on Newly Collected Data Sets ...... 72

v 5.7 Summary ...... 75

Chapter 6 Summary and Future Work 77 6.1 Summary ...... 77 6.2 Future Work ...... 78

Appendix A Details on Application to CRTS Data 80 A.1 Choice of Pseudocounts ...... 80 A.2 Choice of Measures ...... 81 A.2.1 Measures for Section 5.4.3 ...... 81 A.2.2 Measures for Section 5.6 ...... 82 A.3 Contingency Tables ...... 83 A.3.1 Tables of Section 5.4 and 5.5 ...... 83 A.3.2 Tables of Section 5.6 ...... 88

Appendix B R code 90 B.1 Classification by Distance Measures ...... 90 B.2 Classification by Binning ...... 97

Bibliography 99

vi List of Figures

1.1 Example of a light curve ...... 2

2.1 Celestial coordinates: Right ascension and declination ...... 5 2.2 Four images of CSS080118:112149-131310, taken minutes apart by CSS on March 30, 2008. This object was classified as a transient. . 6 2.3 The light curve for Transient CSS080118:112149-131310. Measure- ments are presented with blue points with error bars. x-axis and y-axis represent time and magnitude respectively...... 6 2.4 GPR curves for an AGN and a Supernova. Black dots are the observations, and the blue solid line and the red dashed line are the fitted curves with the magnitude and the detection limit 20.5 as a function respectively...... 12

3.1 Light curves of a supernova (left) and a non-transient (right) . . . . 27

4.1 Example of RR Lyrae light curve ...... 35 4.2 Simulated RR Lyrae light curves with b = 0.2, a = (1, 2, 3) (from top to bottom), and p = (0.2, 0.5, 1) (from left to right) ...... 36 4.3 Simulated RR Lyrae light curves with b = 0.4, a = (1, 2, 3) (from top to bottom), and p = (0.2, 0.5, 1) (from left to right) ...... 37 4.4 Simulated RR Lyrae light curves with b = 0.6, a = (1, 2, 3) (from top to bottom), and p = (0.2, 0.5, 1) (from left to right) ...... 38 4.5 Examples of Type I, Type II-P, and Type II-L supernovae light curves (Doggett and Branch, 1985) ...... 39 4.6 Simulated Type I, Type II-P, and Type II-L (from top to bottom) supernovae light curves with a = (1, 3, 5) (from left to right) . . . . 40

5.1 Top: A boxplot of dmag for different types of transients (entire ). Bottom: A boxplot of dmag between -2 and 2...... 52 5.2 Light curves of a AGN, a Flare, a Supernova, and a non-transient . 53

vii 5.3 (dt, dmag) density plots by kernel density estimation of a AGN, a Flare, a Supernova, and a non-transient. These are the same object with the light curves presented in Figure 4.2. The number of grid points used for evaluating densities is 50 in each side...... 54 5.4 Left: The of the entire dt. Right: The histogram of dts which are less than 100...... 56 5.5 The histogram of dmag ...... 56 5.6 Classification rates of KLD3 for different choices of α ...... 58 5.7 Stepwise selection for choosing measures to include for the distance method: The first value on the plot is the classification rate when every measure is used. After that, it shows the rate when the specified measure in the x-axis is excluded from the set...... 68 5.8 Stepwise selection for choosing measures to include for the binning method: The first value on the plot is the classification rate when every measure is used. After that, it shows the rate when the specified measure is excluded from the set...... 71

6.1 Light curve of CV (CSS071216:110407045134) and its (dt, dmag) kernel density plot ...... 78

viii List of Tables

2.1 Non-periodic features from Richards et al. (2011) ...... 10 2.2 The features generated by modeling light curves in Faraway et al. (2014) ...... 13

3.1 for dt and dmag bins ...... 29

4.1 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different amplitudes ...... 42 4.2 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different periods ...... 42 4.3 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different shapes ...... 42 4.4 Supernovae vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different types ...... 43 4.5 Supernovae vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different amplitudes ...... 43 4.6 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different amplitudes when gaps are present in light curves...... 45 4.7 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different periods when gaps are present in light curves...... 45 4.8 RR Lyraes vs. non-variables classification: Completeness and con- tamination (in parenthesis) for different shapes when gaps are present in light curves...... 45

5.1 Number of light curves for each transient type and non-transient objects (Faraway et al., 2014) ...... 50 5.2 Transients vs. non-variables classification by KLD2 with α = 0.6 . . 59 5.3 Examples of a contingency table for SNe versus non-SNe classification 60

ix 5.4 Completeness: Percentages correctly classified to each type of tran- sients ...... 60 5.5 Contamination: False alarm rates ...... 61 5.6 Comparisons among Richards measure, Faraway measure, and our distance measure for completeness ...... 62 5.7 Comparisons among Richards measure, Faraway measure, and our distance measure for contamination ...... 62 5.8 Completeness for all-type classification ...... 63 5.9 Completeness for transients vs. non-transients classification . . . . . 63 5.10 Completeness for only-transient classification ...... 63 5.11 Percentage correctly classified using the Richards measures only . . 64 5.12 Percentage correctly classified using the Faraway measures in addi- tion to the Richards measures ...... 64 5.13 Percentage correctly classified using our method ...... 64 5.14 Completeness for all-type classification ...... 66 5.15 Completeness for transients vs. non-transients classification . . . . . 66 5.16 Completeness for only-transient classification ...... 66 5.17 Percentage correctly classified using our method with measures . . . 67 5.18 Percentage correctly classified using the binning method ...... 69 5.19 Percentage correctly classified using the binning method with mea- sures ...... 70 5.20 Numbers of light curves for each transient type (Faraway et al., 2014) 72 5.21 Completeness: Percentages correctly classified to each type of tran- sients ...... 73 5.22 Contamination: False alarm rates ...... 73 5.23 Comparisons among Richards measure, Faraway measure, and our distance measure for completeness ...... 73 5.24 Comparisons among Richards measure, Faraway measure, and our distance measure for contamination ...... 74 5.25 Percentage correctly classified using the Richards measures and Far- away measures ...... 74 5.26 Percentage correctly classified using our distance measures . . . . . 74 5.27 Percentage correctly classified using the binning method ...... 75 5.28 Summary of classification results for Section 5.4 and 5.5 ...... 76 5.29 Summary of classification results for Section 5.6 ...... 76

A.1 A pseudocount α for Section 5.4.2: Classification with distance mea- sures ...... 80 A.2 A pseudocount α for Section 5.4.3: Classification with distance mea- sures by incorporating Richards and Faraway measures ...... 80

x A.3 A pseudocount α for Section 5.6: Classification with distance mea- sures tested on newly collected data set ...... 80 A.4 Choice of measures obtained by stepwise procedure for all-type clas- sification ...... 81 A.5 Choice of measures obtained by stepwise procedure for transients vs. non-transients classification ...... 81 A.6 Choice of measures obtained by stepwise procedure for only-transient classification ...... 82 A.7 Choice of measures obtained by stepwise procedure for distance methods ...... 82 A.8 Choice of measures obtained by stepwise procedure for grouping method ...... 82 A.9 All types: Richards+Faraway Measures (Random Forest) ...... 83 A.10 All types: Our Method (KLD2) ...... 83 A.11 All types: Our Method (Binning) ...... 84 A.12 All types: Our Method (KLD3 with measures) ...... 84 A.13 All types: Our Method (Binning with measures) ...... 84 A.14 Transients vs. non-variables: Richards+Faraway Measures (SVM) . 85 A.15 Transients vs. non-variables: Our Method (KLD2) ...... 85 A.16 Transients vs. non-variables: Our Method (Binning) ...... 85 A.17 Transients vs. non-variables: Our Method (KLD2 with measures) . 85 A.18 Transients vs. non-variables: Our Method (Binning with measures) 85 A.19 Transient only: Richards+Faraway Measures (SVM) ...... 86 A.20 Transient only: Our Method (KLD1) ...... 86 A.21 Transient only: Our Method (Binning) ...... 86 A.22 Transient only: Our Method (KLD2 with measures) ...... 87 A.23 Transient only: Our Method (Binning with measures) ...... 87 A.24 Richards+Faraway Measures (Random Forest) ...... 88 A.25 Our Method (JS2) ...... 88 A.26 Our Method (Binning) ...... 88 A.27 Our Method (JS2 with measures) ...... 89 A.28 Our Method (Binning with measures) ...... 89

xi Acknowledgments

First of all, I would like to express my sincere gratitude to my advisor, Dr. G. Jogesh Babu for his advice, encouragement, and patience during the past three years. I am also very grateful to Dr. Ashish Mahabal of Caltech who provided CRTS data and helped me understand the data structure and astronomy related is- sues. I would like to thank the committee members, Dr. John Fricks, Dr. Matthew Reimherr, and Dr. Eric B. Ford for their valuable suggestions and comments on my dissertation. I must acknowledge my parents and my sister for their love, support, and prayers throughout my life. Without them, I could have not finished this disser- tation. Finally, I thank God who has guided me and never gave up on me.

xii Chapter 1

Introduction

Classification of celestial objects has been usually done by astronomers’ subjective evaluation, but due to a rapidly increasing size of data in astronomical surveys, statistical methods which can automatically classify newly detected sources have become essential. In this dissertation, we introduce statistical classifiers which can identify variable stars and transients in an efficient and accurate way. There are massive number of celestial objects in the sky, and the step that needs to be taken before exploring them is to classify them correctly. Accuracy of a clas- sifier is crucial because misclassification of objects will cause missing opportunities of interesting objects. Among many types of stars, we focus on building a classifier of variable stars and transients. A variable star is a star whose brightness as seen from Earth fluctuates, and if this fluctuation is extreme, a star is called a transient. Variable stars and transients play an important role in understanding astronomical objects and the universe. For example, Cepheid variables can be used to determine the size of the universe, and researches on supernovae have suggested that expansion of the universe is accelerating. By exploring variable stars and transients, we will have a better understanding of the universe. Brightness change in variable stars can be easily seen by plotting a light curve. A light curve is a graph of variation in brightness of a celestial object as a function of time. Figure 1.1 shows the example of a light curve. In this dissertation, we will propose two statistical methods which can be ap- plied to light curves from any astronomical surveys to identify different types of 2

Figure 1.1. Example of a light curve transients in an accurate and efficient way. There have been many studies on classification of variable stars and transients using light curves: Debosscher et al. (2007) and Richards et al. (2011) provided the statistics extracted from light curves which can be used for classification of variable stars. Faraway et al. (2014) suggested modeling light curves and calculate statistics from them instead of using original light curves. Both methods focus on designing statistics (measures) which characterize transients. However, these measures can be survey-dependent, which that we cannot be sure of accuracy of a classifier if we train a classifier from one survey, and predict the classes from the new survey. Therefore, in this dissertation, we focus on developing statistical tools for classification of transients which have less survey dependence. Our analysis is based on a simple but important fact about variable stars and transients: They are the stars which change their brightness over time. Therefore, differences of magnitudes (the logarithmic measure of brightness) over time will give us useful information about variable stars. For example, if we have n obser- vations in a given light curve with (t1, mag1), ..., (tn, magn) where ti is time and magi is magnitude, we can take dtij = ti − tj and dmagij = magi − magj where 1 ≤ j < i ≤ n. We will use this dt and dmag as our variables for classification. In Chapter 2 of this dissertation, the background of astronomical data, lit- erature review on automated discovery and classification of variable stars and 3 transients, and statistical methods, which will be used in our analysis, will be introduced. In Chapter 3, we present a classifier by evaluating (dt, dmag) densities and us- ing statistical , which are measures for estimating the distance between two probability distributions. Our classification method by using distance mea- sures is based on the assumption that if we evaluate densities with dt and dmag, they will look different for different types of transients. Among many statistical distance measures, here we present classification by Kullback-Leibler divergence, Jensen-Shannon divergence, and Hellinger distance. Also, classification by group- ing (dt, dmag) will be presented in addition to distance methods. Chapter 4 provides simulation studies for classification of transients by some of distance measures. We will generate light curves for two types of transients and apply our methods to them. We will verify when each distance measure works well, and not so well. In Chapter 5, we apply our methods to two real astronomical data sets and also incorporate the classification measures by Richards et al. (2011) and Faraway et al. (2014) into ours. The results and comparisons with other methods will be presented. Finally, Chapter 6 provides a summary of our works and discuss a possible future research that can be done to improve classification of variable stars and transients. Chapter 2

Literature Review, Background, and Statistical Methods

This chapter provides the review of literature that has guided our analysis. First, the studies which are closely related to our analysis, discovery and classification of variable stars and transients, are presented along with a description of astronomical data and statistical backgrounds for them. Then, some of the statistical methods used in our analysis, such as distance measures, general classification methods, kernel density estimation, and Bayesian decision theory are presented.

2.1 Introduction to CRTS Data

Catalina Real-time Transient Survey (CRTS) is a synoptic astronomical explo- ration which focuses on discovering and analyzing transients (Djorgovski et al., 2011). The survey covers thirty three thousand square degrees of the sky by using data streams from the three telescopes, CSS, SSS, and MLS, and they provide the position, brightness, and observed time of the object. The position of the object in the sky is provided with right ascension (RA) and declination (Dec). Declination is the angular distance perpendicular to the celestial equator, and right ascension is the angular distance along the celestial equator. See Figure 2.11 for illustration. Declination increases as it goes north

1The figure is from http://en.wikipedia.org/wiki/Right ascension 5 from the celestial equator, and right ascension increases as it goes east from the vernal equinox, where the Sun crosses the celestial equator as it moves north.

Figure 2.1. Celestial coordinates: Right ascension and declination

Declination is expressed in degrees:arcminutes:arcseconds or decimal degrees and right ascension can be expressed in hours:minutes:seconds or decimal degrees. Note that 24 hours are 360 degrees, and therefore 1 hour is 15 degrees, and 1 degree is 60 arcminutes which are 3600 arcseconds. The telescope takes the four images from the same location, separated in time by ∼10 min. The next set can be taken the next night, a week later, or even a month later (Faraway et al., 2014). The example of the set of four CSS images taken in the same location is given in Figure 2.2.2 We have these kinds of data set up to a time span of ∼6 years, and they will allow us to detect variable stars and transients by looking at changes in brightness over times. The brightness change in stars can be easily seen by a light curve, which is a graph of light intensity of a celestial object as a function of time. It can be either periodic or non-periodic. The light curve corresponding to the object in Figure 2.2 (Transient CSS080118:112149-131310) is given in Figure 2.3. Magnitude, which is

2The images are from http://nesssi.cacr.caltech.edu/catalina/Allns.html 6 the logarithmic measure of the brightness, is presented in a reverse order in the y-axis of the plot because smaller magnitude indicates more brightness. For the CRTS data, time is in the form of the Modified Julian Date (MJD). The Julian date (JD) is obtained by counting the days from Julian day 0, which began at noon on January 1, 4713 B.C. The MJD is calculated by JD-2400000.5. The first object in Figure 2.3 was detected on MJD 53249, which is September 1, 2004 and the first date was subtracted from the entire date so that we can see the data range easily.

Figure 2.2. Four images of CSS080118:112149-131310, taken minutes apart by CSS on March 30, 2008. This object was classified as a transient.

Figure 2.3. The light curve for Transient CSS080118:112149-131310. Measurements are presented with blue points with error bars. x-axis and y-axis represent time and magnitude respectively. 7

2.2 Classification of Variable Stars

There are many different ways for classifying variable stars and transients, and the one which we will focus on in this dissertation is using light curves. There are many ongoing studies on classification by using light curves, and this section provides some of the studies on how to extract valuable features from light curves and use them for classification.

2.2.1 Periodic Feature Generation

Debosscher et al. (2007) proposed a supervised classification method for more than 20 classes of variable stars by using statistical features extracted from light curves. They provided the periodic features which were obtained by applying the Lomb-Scargle periodogram (Lomb, 1976; Barning, 1963; Scargle, 1982). In spectral analysis, can be expressed as a sum of cosine or sine functions with different amplitudes and frequencies. A periodogram is used to find the important periods or frequencies in this time series data. For an arbitrary data set {X(tj), i = 1, 2,...,N0}, where X is a pure noise which is independently and normally distributed with zero mean, the periodogram is conventionally defined as follows:  !2 !2 1 X X P (ω) = X cos ωt + X sin ωt . X N  j j j j  0 j j

A larger PX (w) indicates more importance for the ω. Scargle (1982) suggested the following modified periodogram for unevenly spaced time series data to obtain a better estimate of the power at a frequency.

 2 2  hP i hP i 1 j Xj cos ω(tj − τ) j Xj sin ω(tj − τ) P (ω) =  +  X  P 2 P 2  2 j cos ω(tj − τ) j sin ω(tj − τ) where τ is a time delay which is defined as: P j sin 2ωtj tan 2ωτ = P . j cos 2ωtj 8

Debosscher et al. (2007) applies this Lomb-Scargle periodogram to get the periodic features from light curves. First, the linear trend is checked by fitting a + bT to light curves where a is the intercept, b is the slope, and T is the time, and this fitted value is subtracted from the data to prevent the linear trend from affecting the estimation of frequencies. Then the Lomb-Scargle periodogram is calculated with the starting frequency f0 = 1/Ttot, a frequency step ∆f = 0.1/Ttot, and the highest frequency fN = 0.5h1/∆T i where Ttot is the total time span of observations and h·i is an average. After the highest peak f1 is calculated from the Lomb-Scargle periodogram, a harmonic sum of sines and cosines is fitted to light curves with a linear term:

4 X y(t) = (aj sin 2πf1jt + bj cos 2πf1jt) + b0. j=1

This curve is then subtracted from the data and the Lomb-Scargle periodogram is recalculated to find a frequency. This procedure is repeated three times to get three frequencies and finally light curves are fitted with the following equation:

3 4 X X y(t) = (aij sin 2πfijt + bij cos 2πfijt) + b0. (2.1) i=1 j=1

Based on the estimated coefficients of Equation (2.1), a set of amplitudes Aij and phases PHij which can be used for classification of light curves are defined as follows. q 2 2 Aij = aij + bij

−1 PHij = tan (bij, aij)

These estimates have different values for different light curves and also do not depend on time T .

2.2.2 Non-periodic Features Generation

The features introduced by Debosscher et al. (2007) can characterize periodicity of light curves well but light curves are not always solely characterized by periodicity. In order to model parts of light curves which cannot be explained by periodic 9 features, Richards et al. (2011) suggested using non-periodic features in addition to periodic ones for classification of variable stars and transients. They defined 20 non-periodic features. Some of the non-periodic features extracted from light curves and their descriptions are given in Table 2.1.3 After features were calculated from light curves, classification of 25 classes of variable stars was performed with the same data used in Debosscher et al. (2007) to compare the results. Classification was done with classification and regression trees (CART), random forests, and boosted trees. Among these, random forests achieved the lowest misclassification rate of 22.8% with 10-fold cross validation. Classification results were improved when they used both periodic and non-periodic features compared to periodic features only.

2.2.3 Modeling Light Curves: Gaussian Process Regres- sion

Although the features presented in Richards et al. (2011) work well for classifica- tion of variable stars, none of the features involves fitting a curve to data. Faraway et al. (2014) introduced how to model light curves and generate additional features from them which can lead to improved classification of variable stars and transients. Most light curves have gaps (periods when observations are not detected) in data. This can be because we cannot see them due to the orbit of Earth, bad weather, or the magnitude of objects being less than the detection limit, which is around 20.5. If one can predict how stars behave during missing times, it will help us have a better understanding of characteristics and therefore better classification results of stars. Even though we do not have information on how stars behave during missing times, it can be interpolated by fitting the curve to light curves. In Faraway et al. (2014), Gaussian process regression (GPR) was used to get a continuous curve throughout the entire period and fill the gaps between the data to improve classification. Although there are several interpolation methods for estimating a continuous curve fi(t) for each light curve i, Gaussian process regression was chosen for several reasons: (i) It can incorporate information about censored data.

3These are from Table 5 in Richards et al. (2011). 10

Feature Description amplitude Half the difference between the maximum and the minimum magnitude beyond1std Percentage of points beyond one st. dev. from the weighted mean fpr20 Ratio of flux (60th - 40th) over (95th - 5th) fpr35 Ratio of flux percentiles (67.5th - 32.5th) over (95th - 5th) fpr50 Ratio of flux percentiles (75th - 25th) over (95th - 5th) fpr80 Ratio of flux percentiles (90th - 10th) over (95th - 5th) max_slope Maximum absolute flux slope between two consecutive observations mad Median discrepancy of the fluxes from the median flux medbuf Percentage of fluxes within 20% of the amplitude from the median pairslope Percentage of all pairs of consecutive flux measurements that have positive slope peramp Largest percentage difference between either the max or min magnitude and the median pdfp Differences between the 2nd & 98th flux percentiles, converted to magnitude skew Skew of the fluxes Kurtosis of the fluxes, reliable down to a small number of epochs std Standard of the fluxes rcorbor Fraction of observations greater than 1.5 more than the median Table 2.1. Non-periodic features from Richards et al. (2011)

Magnitudes for some of the observations in light curves can fall below the detection limit but standard methods cannot fit sensible curves for these missing periods. (ii) CRTS survey provides the measurement errors for observations but it is hard to find other methods which can incorporate this information. For each light curve, a true underlying curve f(x) is estimated with a Gaussian process over grid points of x. By definition, a Gaussian process is a collection of random variables, any Gaussian process finite number of which have a joint 11

Gaussian distribution (Rasmussen and Williams, 2006). It is specified by its mean function ψ(x) and function κ(x, x0), i.e., ψ(x) = E[f(x)] and κ(x, x0) = Cov[f(x), f(x0)]. 2 GPR requires priors for several components: The signal σf , the noise 2 2 2 variance σn, the length-scale l, and the mean function ψ(x). σf , σn, and l are used to define the squared-exponential covariance function κ(x, x0). GPR with the mean function ψ(x) and the covariance function κ(x, x0) can be written as f(x) ∼ GP (ψ(x), κ(x, x0)) where κ(x, x0) is defined as follows:

 1  κ(x, x0) = σ2 exp − (x − x0)2 + σ2δ(x − x0). f 2l2 n

2 2 δ(x) is the function which is 1 when x = 0 and 0 otherwise. The prior for σf , σn, 2 and l were chosen based on light curves data they used: The signal variance σf was set to the median observed variance of the non-transients since most observations that will be detected in the future will be non-transients. For the noise variance 2 σn, the mean value of the measurement error was used, and the length-scale l, which controls the smoothness of the curve, was set to 140 days. Choosing a mean function ψ(x) can be subjective, but the median magnitude of the objects was used in this case since one can expect to have about the same magnitude for missing objects. However, for some of the transient classes, using the median magnitude as a mean function would not make much sense. See Figure 2.4 for example. In this Supernova light curve, the objects were not detected for a long time and if magnitudes for were similar to the others as assumed, it would have been already seen. Therefore, in this case the detection limit 20.5 would be more appropriate than the median value. To combine these two cases, they used the detection limit as the mean function when there were less than one year of observations, and the median magnitude for other cases. The fitted GPR curves for two different classes of transients, an AGN and a Supernova are given in Figure 2.4. Dots are the original data, and the blue solid line and the red dashed line are fitted curves by using the median magnitude and the detection limit 20.5 as a mean function respectively. We can see that the solid line fits the data well for the AGN, but the dashed line reflects the variability of the objects better for the Supernova case. 12

Figure 2.4. GPR curves for an AGN and a Supernova. Black dots are the observations, and the blue solid line and the red dashed line are the fitted curves with the median magnitude and the detection limit 20.5 as a mean function respectively.

ˆ Now, we can obtain the estimated curve (the posterior mean) fi for a given light curve i on evenly spaced grid points. See Rasmussen and Williams (2006) for ˆ detail. From fi, 11 new features were created by Faraway et al. (2014). Some of these features are given in Table 2.2. Classification methods such as Linear discriminant analysis (LDA), classifica- tion trees, Support Vector Machines (SVM), neural networks, and random forests were applied to these features in addition to the ones created by Richards et al. (2011). The result in Faraway et al. (2014) shows that the misclassification error rates were decreased when the new measures were added. In the later chapter, we will compare the classification results of our methods with the measures by Richards et al. (2011) and Faraway et al. (2014) by applying them to the same data set.

2.3 Bayesian Decision Theory

In this section, we review Bayesian decision theory, which is one of the statistical approaches that can be used for classification when a priori information is given.

Following the definitions in Hastie et al. (2001), let {w1, ..., wc} be the set of c states of nature. Then the probability of the true nature being wj is P (wj) and 13

Feature Description P ˆ ˆ totvar total variation: j |fi(ui,j+1) − fi(uij)|/m P ˆ ˆ 2 quadvar quadratic variation: j(fi(ui,j+1) − fi(uij)) /m ˆ ˆ famp amplitude of fitted function: maxt fi − mint fi ˆ0 fslope maximum derivative in the fitted curve: maxt |fi | lsd the log of the ,σ ˘, computed using the residuals from these group mean fits P ˘ ˘ gtvar The group total variation j |f(ti,j+1) − f(tij)|/ni P ˘ ¯ gscore j φ((fij − fi/σ˘))/ni where φ is the standard normal density, f¯ is mean of the fitted group means shov mean of absolute differences of successive observed values: P j |yi,j+1 − yij|/ni maxdiff the maximum difference of successive observed values:

maxj |yi,j+1 − yij| P ˜ dscore the density score: j φ((yij − fi)/sij)/ni where ˜ fi is the median observed magnitude for curve i and

sij is the observed measurement error at tij. Table 2.2. The features generated by modeling light curves in Faraway et al. (2014) we call it a . If we have a d-dimensional feature vector x, the conditional probability density function for x given the true state of nature wj is p(x|wj) and it is called a likelihood. What we are interested in is the probability that the true state of nature is wj given that x is observed. This can be calculated from Bayes formula: p(x|w )P (w ) P (w |x) = j j (2.2) j p(x) Pc where p(x) = j=1 p(x|wj)P (wj). Equation (2.2) is called a .

Now let {α1, ..., αc} be the set of a possible actions, i.e., the decision we can make when the true state of nature is wj. Then we define the λ(αi|wj), the loss for taking action αi when the true nature is wj. For the action αi and the true state of nature wj, we have the correct decision for i = j and the incorrect 14 one for i 6= j. In this case, the zero-one loss function can be used as follows:

( 0 if i = j λ(αi|wj) = 1 if i 6= j where i, j = 1, ..., c.

Conditioned on the feature vector x and for the true state of nature wj, the expected loss by taking action αi is:

c X X R(αi|x) = λ(αi|wj)P (wj|x) = P (wj|x) = 1 − P (wi|x). (2.3) j=1 j6=i

This is called the conditional risk. When x is given, the is the one which minimizes the conditional risk R(αi|x). Therefore, by Equation (2.3), this is equivalent to selecting i which maximizes the posterior probability P (wi|x). The decision rule can be expressed as follows:

Decide wi if P (wi|x) > P (wj|x) for all j 6= i.

2.4 Kernel Density Estimation

Kernel density estimation is a nonparametric way to estimate the probability den- sity function. Information on nonparametric estimation can be found in many books including Fan and Gijbels (1996), Wand and Jones (1995), Hastie and Tib- shirani (1990), and Hart (1997). One example of a widely used nonparametric density estimation is the his- togram. In order to construct the histogram, we first determine the number of bins for the histogram and caculate heights by

number of obs in the bin containing x gˆ(x, h) = nh where x is the observation, n is the number of bins, and h is the width of the bins. Then, we place a block on each bin with heightg ˆ(x, h). However, this estimate is not smooth and highly depends on the location and the width of bins. If we use a kernel density estimate, we can get a smooth curve compared to the histogram. 15

Suppose we have an independent and identically distributed {x1, ..., xn} from some distribution with an unknown density g(x). Then a kernel density estimate is defined as n   1 X x − xi gˆ (x) = K h nh h i=1 where K(·) is a kernel function satisfying R K(x)dx = 1 and h is a smoothing parameter called a bandwidth.g ˆh(x) is evaluated over grid points x. Some of the commonly used kernel functions are as follows:   Gaussian kernel: K(t) = √1 exp(−t2/2)  2π 1 Uniform kernel: K(t) = 2 I(|t| ≤ 1)  3 2  Epanechikov kernel: K(t) = 4 (1 − t )I(|t| ≤ 1)

For kernel density estimation, the choice of K(·) is not very sensitive to estima- tion ofg ˆh(x) (Marron and Nolan, 1988), but the choice of h is critical. Depending on the choice of h, we can have very different estimates. There is a significant literature on this issue (Jones et al., 1996), but here we present the rule of thumb for bandwidth selection for each kernel (Silverman, 1986):

 ˆ −0.2  Gaussian kernel: hI = 1.06 ∗ sn ∗ n  ˆ −0.2 Uniform kernel: hI = 2.30 ∗ sn ∗ n  ˆ −0.2  Epanechikov kernel: hI = 2.92 ∗ sn ∗ n where sn is the sample . In order to avoid choosing a single bandwidth which may oversmooth estimates, the family smoothing approach can be considered. This explores smoothing patterns for different bandwidths by using the following: ˆ j ˆ {fh(x), h = 1.4 hI , j = −3, −2, −1, 0, 1, 2}

2.5 Classification Methods

In this section, we review five classification methods which are commonly used in statistics: Linear discriminant analysis, classification trees, random forests, sup- port vector machines, and neural networks. We provide the definitions and briefly 16 explain how these classification methods work.

2.5.1 Linear Discriminant Analysis

Following the notation given in Hastie et al. (2001), let X be input vectors, G be categorical outputs and K be the number of classes. Suppose πk is a prior probability of class k and hk(x) is the class-conditional density of X in class G = k. Then by Bayes theorem, the probability that the class will be k given that observation X = x is:

h (x)π P r(G = k|X = x) = k k . PK l=1 hl(x)πl

Then the optimal classification rule is

ˆ G(x) = argmaxP r(G = k|X = x) = argmax δk(x) k k

T −1 1 T −1 where δk(x) = x Σ µk − 2 µk Σ µk + log(πk) assuming that classes have mul- tivariate Gaussian densities and the same Σk = Σ, ∀k. δk(x) is called the linear discriminant function, and the decision boundary between two classes k and l can be expressed as {x : δk(x) = δl(x)}. Therefore, the linear discriminant analysis decision rule classifies the object to class l if

1 1 xT Σˆ −1(µ ˆ − µˆ ) > µˆT Σˆ −1µˆ − µˆT Σˆ −1µˆ + log(N /N) − log(N /N) l k 2 l l 2 k k k l whereπ ˆ = N /N, N is the number of class-k observations,µ ˆ = P x /N , k k k k gi=k i k and Σˆ = PK P (x − µˆ )(x − µˆ )T /(N − K). k=1 gi=k i k i k

2.5.2 Classification Tree

The classification tree is a simple yet powerful classification method. Let us assume that we have data points (xi, yi) for i = 1, 2, ..., N where xi are inputs and yi are categorical variables taking values of 1, 2, ..., K. The procedures for constructing a classification tree include the selection of the splits, when to terminate a node or continue splitting, and assigning each terminal node to a class (Hastie et al., 2001). Starting with all data in one group, a tree divides the feature space into a 17 series of recursive binary splits. The choice of the variable and the splitting point is made by minimizing the impurity at each node. Possible impurity functions are (James et al., 2013):

  Missclassification rate: 1 − maxj pj  PK PK 2 Gini index: j=1 pj(1 − pj) = 1 − j=1 pj  PK 1  Entropy: pj log j=1 pj where pj is the proportion of objects in the jth class.

Now, let m be a terminal node representing a region Rm with Nm observations. Then the predicted class label k at a node m is:

1 X k(m) = argmax I(yi = k). k Nm xi∈Rm i.e., we assign the observation to class k, the one which has the maximum propor- tion in node m (Hastie et al., 2001).

2.5.3 Random Forest

The classification tree is a powerful classification tool, but small changes in data can yield very different estimates in tree-based methods. This is because small changes in the top node can affect structures of the tree below dramatically. One solution to this is the random forest approach. Random forests apply the concept of bagging, which fits trees to a random sample of the data (bootstrap sample) repeatedly and takes the majority vote of trees. James et al. (2013) provides a nice description of bagging and random forests in addition to tree-based methods. Random forests try to improve on bagging by reducing correlation between trees. In random forests, a random sample of size N is taken with replacement and only these N samples are used for splitting. Trees are constructed for each bootstrap sample, and the error rate is evaluated with observations which are left out of the bootstrap sample (”out-of-bag” error rate). 18

2.5.4 Support Vector Machines

The Support Vector Machine (SVM) is one of the classification methods most widely used especially in , because it is very accurate and able to deal with high-dimensional data (Carugo and Eisenhaber, 2010). The SVM is the technique which constructs the optimal separating hyperplane between the classes. Following the definitions from Hastie et al. (2009), let us assume that we have data points (x1, y1), (x2, y2),..., (xN , yN ) for a given training set where p T xi ∈ R and yi ∈ {−1, 1}. A hyperplane is defined as {x : x β + β0 = 0} where ||β|| = 1. The optimal separating hyperplane is the one which creates the biggest margin, the largest distance to the nearest training points for class 1 and -1. For linearly separable classes, the optimization can be done by minimizing ||β|| subject T to yi(xi β + β0) ≥ 1 for i = 1,...,N. For the nonseparable case, where the classes overlap, let us define the points which are on the wrong side of the margin as ξ = (ξ1, ξ2, . . . , ξN ). Then, the optimal T hyperplane can be found by minimizing ||β|| subject to yi(xi β +β0) ≥ 1−ξi where PN ξi ≥ 0 for i = 1,...,N, and i=1 ξi ≤ constant.

2.5.5 Neural Networks

Consider K-class classification with input X and output Yk, where k = 1, ..., K and

Yk is either 0 or 1 for the kth class. In neural networks, we have hidden layers Zm, the linear combinations of inputs, in addition to inputs and outputs. The output

Yk is a function of nonlinear combinations of Zm (Hastie et al., 2001). Hidden layers are iteratively reset to improve classification by mininizing cross-entropy with a gradient descent procedure, and this procedure is called backpropagation. The detailed description of backpropagation algorithm can be found in Duda et al. (2001). Neural networks work well even when the number of variables is large, and the data is complicated or imprecise (Dumitru and Maria, 2013). However, they are not recommended when the goal is the model interpretation. In neural networks, each input enters into the model multiple times in a nonlinear fashion which makes it difficult to interprete the model (Hastie et al., 2001). 19

2.6 Distance Measures

In this section, we review some of the statistical distance measures which define a distance between two probability distributions. These distance measures will be used prominently in this dissertation since our classifier is based on measuring sim- ilarlity/dissimilarlity between distributions of magnitudes for two different types of transients. Note that some of the distances featured here are not symmetric and therefore called ’divergence’, instead of ’distance’.

2.6.1 Kullback-Leibler Divergence

Kullback-Leibler divergence (or KL divergence) is one of the most widely used distance measures for calculating the difference between two probability distribu- tions. Here we review the definition and properties of the KL divergence and also present a Bayesian estimate of the KL divergence.

2.6.1.1 Definition and Properties

Kullback-Leibler divergence is a non-symmetric measure which quantifies a dif- ference between two probability distributions P and Q. If P and Q are discrete variables with the probabilities {p(xi); i = 1, 2, ...} and {q(xi); i = 1, 2, ...}, then the KL divergence of Q from P is defined as:

∞ X p(xi) KL(p, q) = p(x ) log . (2.4) i q(x ) i=1 i

If P and Q are continuous random variables, the KL divergence can be expressed as: Z ∞ p(x) KL(p, q) = p(x) log dx. (2.5) −∞ q(x) Note that KL(p, q) cannot be defined if q = 0 but p 6= 0. The KL divergence has the following properties:

(i) KL(p, q) ≥ 0

(ii) KL(p, q) = 0 if and only if p(x) = q(x). 20

Typically P represents a true distribution of data and Q is an arbitrarily spec- ified model, and KL(p, q) can be interpreted as the information lost when Q is used to approximate P (Burnham and Anderson, 2002). That is, the smaller value of the KL divergence indicates p(x) being similar to q(x). Another important feature of the KL divergence is that it is a non-symmetric measure. Therefore, the KL divergence of Q from P , KL(p, q), is not the same as that of P from Q, KL(q, p). The symmetrised version of the KL divergence can be defined as follows: KL(p, q) + KL(q, p).

2.6.1.2 Bayesian Estimate of Kullback-Leibler Divergence

If we have low-probability or zero-probability events in our data, it is hard to get the accurate Kullback-Leibler divergence. For example, the KL divergence cannot be defined if q(xi) = 0 in Equation (2.4) for the discrete case or q(x) = 0 in Equation (2.5) for the continuous case. Recall the definition given in Section 2.6.1.1. The KL divergence of Q from P measures how much information we lose when Q is used to approximate P.

If we have q(xi) = 0 for any i, that means we are confident that the event xi does not happen. Therefore, having a non-zero probability for pi will cause the KL divergence to be infinity because we have a value in one probability density function (pdf) which is not possible in the other. The solution to this problem is to add a small probability to the pdfs of which we want to calculate the KL divergence, so that we ensure events can happen and the divergence is defined. If we have discrete distributions, instead of adding a small probability to a probability mass function, a small number can be added to where we have zero events. This small number is called a pseudocount. By definition, a pseudocount is a small amount added to the number of observations to avoid having probability estimates of zero, and this technique of adding a small amount is called additive smoothing. Additive smoothing can improve an original estimate by combining other information. In this case, we improve the KL divergence between two distributions by adding an arbitrary small number, a pseudocount, because we will get an infinite divergence otherwise, which does not give us any information about pdfs. 21

Let us assume that cell counts (c1, ..., ck) are from a multinomial distribution Pk with n trials and probabilities (p1, ..., pk) where n = i=1 ci. Then a smoothed version of the parameter estimate is

c + α pˆ = i i i = 1, ··· , k (2.6) i Pk i=1(ci + αi) where αi > 0 is a pseudocount or a smoothing parameter. αi = 0 means we do not use smoothing. From a Bayesian point of view, Equation (2.6) corresponds to the posterior mean with a multinomial likelihood and a parameter α = (α1, ··· , αk) as a Dirich- let prior. Note that the Dirichlet distribution is a conjugate prior of a categorical distribution or a multinomial distribution: If data has either a categorical or a multinomial distribution as a likelihood and a Dirichlet as a prior of the distribu- tion’s parameter, the posterior distribution is also a Dirichlet.

Recall the cell counts (c1, ..., ck) which are from a multinomial distribution with parameters (p1, ..., pk). The probability mass function of this multinomial distribution is as follows:

k n! Y P (C|p) = pci . c ! ··· c ! i 1 k i=1

Also, the density for a Dirichlet prior is defined as:

k 1 Y P (p|α) = pαi−1 B(α) i i=1

Qk i=1 Γ(αi) where B(α) = Pk and α = (α1, . . . , αk). Note that the expected value for a Γ( i=1 αi) Dirichlet distribution is: α E(p ) = i . i Pk i=1 αi Now, we can calculate the posterior distribution for a Dirichlet prior and a multinomial likelihood:

k k n! Y 1 Y P (p|C, α) ∝ P (C|p)P (p|α) = pci pαi−1 c ! ··· c ! i B(α) i 1 k i=1 i=1 22

k k k Y ci Y αi−1 Y ci+αi−1 ∝ pi pi = pi i=1 i=1 i=1

This is again a Dirichlet distribution with a parameter {ci + αi}. The posterior mean of the estimate is

c + α E(p |c , ··· , c , α , ··· , α ) = i i (2.7) i 1 k 1 k Pk i=1(ci + αi) which is the same asp ˆi in Equation (2.6). For this reason, KL divergence with smoothing is also called the Bayesian estimate of KL divergence. Now, we are ready to define the KL divergence when two count vectors are given. Let us assume that we have two cell count vectors c = {c1, ··· , ck} and d = {d1, ··· , dk}. Then, the KL divergence with smoothing, or the Bayesian estimate of KL divergence can be defined by replacing pi in Equation (2.4) withp ˆi in Equation (2.6):

k X pˆ(ci, αi) KLD(ˆp, qˆ) = pˆ(c , α ) log . (2.8) i i qˆ(d , α ) i=1 i i

Finally, we provide some of the popular choices for a pseudocount α.   α = 0 : maximum likelihood estimator   α = 0.5 : Jeffreys’ prior; Krichevsky-Trovimov (1991) entropy estimator  α = 1 : Laplace’s prior   α = 1/k : Schurmann-Grassberger (1996) entropy estimator  √  α = n/k : minimax prior

2.6.2 Jensen-Shannon Divergence

Jensen-Shannon divergence (or JS divergence) is also widely used for measuring the similarity between two probability distributions. It is closely related to the Kullback-Leibler divergence, but has more useful features. Recall that KL(p, q) can not be defined if q = 0 but p 6= 0. However, the JS divergence can be calculated even if there are zeros in probability density functions. Also, the JS divergence is 23 a symmetric measure and has an upper bound whereas the KL divergence is not symmetric and can have infinite values. The Jensen-Shannon Divergence is defined as

1 1 JS(p, q) = KL(p, m) + KL(q, m) (2.9) 2 2

1 where KL(·) is the KL divergence and m = 2 (p + q). One of the major features of the JS divergence is that different weights can be assigned to the distributions according to their importance (Lin, 1991). Also, the JS divergence can be defined by using the entropy (or the Shannon entropy), which is a measure of the uncertainty in a . For a discrete random variable P with probability mass function pi = p(xi), the entropy is defined as:

X H(p) = − p(xi) log p(xi). i

Let us assume that we have two probability distributions P and Q with different weights π1 and π2. Then the JS divergence between P and Q can be expressed as:

JSπ(p, q) = H(π1p + π2q) − (π1H(p) + π2H(q))

where π1 + π2 = 1. This can be generalized to more than two distributions. Let

{p1, p2, . . . , pk} be the probability distributions for the discrete random variable X with the possible outcomes {x1, x2, . . . , xk}. Then the JS divergence can be defined as follows: k ! k X X JSπ(p1, p2, . . . , pk) = H πipi − πiH(pi) i=1 i=1 where π = {π1, . . . , πk} are weights and H(·) is the Shannon entropy for the distribution pi. The JS divergence has the following properties:

(i) Different weights can be assigned to probability distributions according to their importance.

(ii)0 ≤ JS(p, q) ≤ 1 if the base 2 logarithm is used.

(iii)0 ≤ JS(p, q) ≤ ln 2 if the base e logarithm is used. 24

2.6.3 Hellinger Distance

The Hellinger distance is another measure which can be used to define the similarity between two probability distributions. Let P and Q be two probability measures that are absolutely continuous with respect to some σ-finite measure λ. Then the Hellinger distance between P and Q is defined as

v u r r !2 1 uZ dP dQ HD(P,Q) = √ t − dλ. (2.10) 2 dλ dλ

If we have two discrete probability distributions P = (p1 . . . pk) and Q =

(q1 . . . qk), the Hellinger distance is defined as

v u k 1 uX √ √ 2 HD(P,Q) = √ t ( pi − qi) , (2.11) 2 i=1 and this can be expressed as

1 √ p HD(P,Q) = √ P − Q 2 2 where k · k is the Euclidean norm. The Hellinger distance has the following prop- erties:

(i)0 ≤ HD(P,Q) ≤ 1

(ii) HD(P,Q) = 1 if and only if the measures P and Q are mutually singular.

(iii) HD(P,Q) = 0 if and only if P = Q. Chapter 3

Methods

3.1 Introduction

In this chapter, we propose two methods for supervised classification of transients. The methodologies involve kernel density estimation, distance measures, and gen- eral classification tools. The goal of our analysis is to build a model which can classify variable stars and transients in an efficient and accurate way. The meth- ods we will present here are based on light curves, which show the magnitude of stars in a given time. Throughout the analysis, we use the magnitude changes over times as our variables for classification instead of the original magnitudes. Since transients are the objects which change their brightness, we expect that the differences between magnitudes obtained from light curves of transients will give us useful information about them. In Section 3.2, we present a classification method by using various distance measures as well as how to incorporate the past light curve features such as the ones introduced by Richards et al. (2011) or Faraway et al. (2014) into our methods. Section 3.3 provides a classification method which can be done by grouping values obtained by taking differences between magnitudes and times from light curves. 26

3.2 Classification by Distance Measures

In this section, we present classification of variable stars and transients by us- ing distance measures. We use magnitude changes over time increments as our variables for classification because the telescope’s observations come primarily in the form of single magnitude changes over time increments - e.g., an observed (∆t, ∆mag) pair (Djorgovski et al., 2011). Therefore, it will be more efficient to build a model with (∆t, ∆mag) since we can easily update the classifier as the new (∆t, ∆mag) set comes in. Another advantage is that the model will be invariant to absolute magnitude and time shift of observations, which means that classifiers can be built even though the distance to the event or a starting point of observations are unknown.

Let us assume that we have n observations in a given light curve, (t1, mag1), ...,

(tn, magn), then we define time and magnitude increments as dtij = ti − tj and dmagij = magi − magj where 1 ≤ j < i ≤ n. Our classification method by dis- tance measures is based on the assumption that if we evaluate densities for dtij and dmagij, they will look different for different types of transients. For example, see Figure 3.1 for a comparison between the light curves of a supernova and a non-transient. We can see that the supernova has a wide range of magnitudes in a very short time period whereas the non-transient has a small range of magnitude in a wide range of time. In other words, if we build (dt, dmag) probability density estimates, the supernova will have high frequencies for small dt but almost none for large dt, whereas the non-transient will have high frequencies for small dmag but almost none for large dmag. We expect that each transient will have different distributions, and this will allow us to distinguish one transient from the other. We will measure the similarity/dissimilarity between two (dt, dmag) probability dis- tributions and classify them into the same group of transients if their distributions are similar. There are several measures which can be used to estimate the distance between two probability distributions, but here we only present the classification proce- dure by the Kullback-Leibler divergence. Other distance measures can be applied similarly. 27

Figure 3.1. Light curves of a supernova (left) and a non-transient (right)

3.2.1 Kullback-Leibler Divergence with Kernel Density Es- timation

For supervised classification, data first needs to be separated into a training set and a test set. For data points (t1, mag1), ..., (tn, magn) in a given light curve from a test set, we can take dtij = ti − tj and dmagij = magi − magj where

1 ≤ j < i ≤ n. Let the vectors of dtij and dmagij be dt and dmag. Then the joint density for (dt, dmag) can be evaluated by kernel density estimation where probabilities are obtained at given grid points with an appropriate choice of the bandwidth (Silverman, 1986). Let the resulting probability density estimate for a given light curve in a test set be f. Similarly, we can estimate densities for every light curve in a training set. Let these density estimates be giK for iK = 1, ..., nK , where K is the number of classes in a training set and nK is the number of light curves in class K. Recall that the KL divergence is a non-symmetric measure, which means that KL(p, q) can be different from KL(q, p). Therefore, we explore three options for the KL divergence: (i) The divergence of a distribution function in a test set from a training set, (ii) the divergence of a distribution function in a training set from a test set, and (iii) the symmetrized distance between a test set and a training set. If we use the option (iii), our measure is defined as

M(f, giK ) ≡ KL(f, giK ) + KL(giK , f) (3.1) 28

where KL(·) is the KL divergence between two densities and iK = 1, ..., nK . Equa- tion (3.1) implies that we calculate the distance between a probability density estimate of a given light curve in a test set and estimates of all the light curves in a training set. If we use (i) or (ii), the measure will become M(f, giK ) ≡ KL(f, giK ) or M(f, giK ) ≡ KL(giK , f).

Once we get M(f, giK ) for every light curve, the minimum KL divergence in each K group is taken. Let these minimums be MK (f) for K class. Then we classify a light curve from a test set to the class with the minimum divergence. We can write this classification algorithm as follows:

MK (f) = min M(f, giK ) (3.2)

Classify f to class J if MJ (f) ≤ MK (f) for all K (3.3)

3.2.2 Kullback-Leibler Divergence with Binning

Instead of assuming continuous distributions and evaluating densities with ker- nel density estimation for (dt, dmag), we can use binning to estimate densities. Compared to kernel density estimation, this is less likely to lose information on densities of (dt, dmag) because we use the exact number of counts to calculate the probability density estimates without smoothing.

Once again, let us consider data points dtij = ti −tj and dmagij = magi −magj for a given light curve in a test set where 1 ≤ j < i ≤ n and n is the number of observations in the light curve. Then for arbitrary numbers u1 < u2 < ... < ua−1 < ua and v1 < v2 < ... < vb−1 < vb, we can define cIJ as the number of counts of

(dtij, dmagij) where uI ≤ dtij < uI+1 and vJ ≤ dmagij < vJ+1 for 1 ≤ I ≤ a − 1 and 1 ≤ J ≤ b − 1. i.e.,

X cIJ = I[uI ,uI+1)(dtij)I[vJ ,vJ+1)(dmagij) 1≤j

Table 3.1 explains how we obtain cIJ , with a contingency table. The first column and the first row represent dt and dmag bins respectively, and cIJ is the number of observations which belong to each bin where 1 ≤ I ≤ a − 1 and 1 ≤ 29

J ≤ b − 1.

[v1, v2)[v2, v3) ··· [vb−1, vb]

[u1, u2) c11 c12 ··· c1,b−1

[u2, u3) c21 c22 ··· c2,b−1 ......

[ua−1, ua] ca−1,1 ca−1,2 ··· ca−1,b−1 Table 3.1. Contingency table for dt and dmag bins

Let c be the count vector {c11, c12, ··· , ca−1,b−1} obtained as above for a light curve with given dt and dmag bins {u1, u2, ··· , ua} and {v1, v2, ··· , vb}. Since n is the total number of observations in a given light curve, we have n = c11 + c12 +

··· + ca−1,b−1. Then we can obtain the probability mass function for (dt, dmag) as follows: nc c c o 11 , 12 , ··· , a−1,b−1 (3.4) n n n Let p be the probability mass function obtained by Equation (3.4) for a given light curve in a test set and qiK be the one for a training set of class K where iK = 1, ..., nK . If we use the symmetrized KL divergence, our measure is defined as follows:

M(p, qiK ) ≡ KL(p, qiK ) + KL(qiK , p)

We can use the non-symmetric KL divergence, KLD(p, qiK ) or KLD(qiK , p), as our as well, but the procedure for classification is the same.

After we get M(p, qiK ) for every light curve in a training set, we take the minimum of them for each class K, and classify a given light curve from a test set to the class with the minimum divergence. That is,

MK (p) = min M(p, qiK ) (3.5)

Classify p to class J if MJ (p) ≤ MK (p) for all K. (3.6) 30

3.2.3 Bayesian Estimate of Kullback-Leibler Divergence

In this section, we show how to use a Bayesian estimate of the KL divergence for classification, which may lead to significant improvements over the standard Kullback-Leibler estimate.

Recall the count vector c = {c11, c12, ··· , ca−1,b−1} in Section 3.2.2. We consider the cell counts as a multinomial distribution and apply a Bayesian estimate of the KL divergence with a Dirichlet prior and a multinomial likelihood. Let c be a vector for a light curve in a test set by given dt and dmag bins {u1, u2, ··· , ua} and {v1, v2, ··· , vb}. Then with a pseudocount α, we can getp ˆ(cIJ , α) by using Equation (2.7) where 1 ≤ I ≤ a − 1 and 1 ≤ J ≤ b − 1. Similarly for a training set,q ˆiK (cIJ , α) for class K can be obtained with the same pseudocount and the dt, dmag bins where iK = 1, . . . , nK and nK is the number of light curves in class K. Assuming that we use the symmetrized version of the KL divergence, our metric for classification by using the Bayesian estimate of the KL divergence can be defined as the following:

M(ˆp, qˆiK ) ≡ KLD(ˆp, qˆiK ) + KLD(ˆqiK , pˆ) where KLD(·) is KL divergence with smoothing, or a Bayesian estimate of the

KL divergence as defined in Equation (2.8). After we get M(ˆp, qˆiK ) for every light curve in a training set, we take the minimum of them for each class K, and classify a given light curve from a test set to the class with the minimum divergence. That is,

MK (ˆp) = min M(ˆp, qˆiK ) (3.7)

Classifyp ˆ to class J if MJ (ˆp) ≤ MK (ˆp) for all K. (3.8)

3.2.4 Incorporating Other Measures

As we have seen in the previous sections, classification by distance measures only uses density estimates of (dt, dmag). We expect to have better classification performance if we incorporate measures which reflect the astronomical features of variable stars and transients. The examples of measures which can be used for classification of transients are given in Section 2.2. 31

Since our metric is obtained by calculating distances between two distributions, we cannot incorporate other measures such as the ones given in Table 2.1 and 2.2 with their original forms. First, we need to calculate a distance between measures of a given light curve in a test set and in a training set. Let r be the vector of measures for a given light curve in a test set and siK in a training set. Let dist(·) be a distance measure between two vectors such as the Euclidean distance. Then our new metric is defined as

∗ M (f, giK ) = M(f, giK ) + dist(r, siK )

where M(f, giK ) is obtained by one of the distance measures as described in Sec- ∗ tion 3.2.1, 3.2.2, or 3.2.3. By using the new metric M (f, giK ), the classification algorithm can be expressed as follows:

∗ ∗ MK (f) = min M (f, giK )

∗ ∗ Classify f to class J if MJ (f) ≤ MK (f) for all K

3.3 Classification by Binning

In this section, we introduce an approach which is different from, yet also closely related to the previously described distance methods. Instead of evaluating den- sities with dt and dmag, we group them by using dt and dmag bins and use the frequencies in each bin as the variables for classification. Recall the count vector

{c11, c12, ··· , ca−1,b−1} in Section 3.2.2 which can be obtained by

X cIJ = I[uI ,uI+1)(dtij)I[vJ ,vJ+1)(dmagij) 1≤j

{c11, c12, ··· , ca−1,b−1}. The idea behind this method is similar to the distance methods in terms of 32 using (dt, dmag) frequencies for classification. This is much simpler compared to the distance method but still can be a very powerful classification tool.

3.3.1 Incorporating Other Measures

Again, we can incorporate existing light curves measures into our binning method.

Let us assume that we have a count vector c = {c11, c12, ··· , ca−1,b−1} as defined in

Section 3.2.2 and a vector of measures r = {r1, r2, ··· , rk}. Then we simply com- ∗ bine c and r together to get a new vector c = {c11, c12, ··· , ca−1,b−1, r1, r2, ··· , rk} which has the length of (a−1)∗(b−1)+k. We use c∗ as our variable for classification and apply general classification methods. We expect incorporating measures will be more effective in the case of the binning method than the distance method: While we need to take distances be- tween measures to incorporate them into the distance method, we can use original measures for the binning method which will lead to less information loss.

3.4 Summary and Conclusions

In this chapter, we have introduced two methods for classification of variable stars and transients. The first classification method is based on the distance mea- sures such as the Kullback-Leibler divergence which can quantify how similar two (dt, dmag) distributions of transients are. We expect these measures will work well for identifying different types of transients by capturing different attributes of (dt, dmag) distributions. Although the KL divergence is the one of the most widely used distance mea- sures for two probability distributions, we cannot use it if we have zeros in our pdfs. In this case, we can try a Bayesian estimate of the KL divergence which adds a small number to count vectors or pdfs so that the estimate is not affected by zero probabilities. We only introduced the classification algorithm by the KL divergence, but other distance measures can be applied in similar fashion. In addition to the Jensen- Shannon divergence and Hellinger distance which were presented in Chapter 2, distance measures such as the total variation distance, Chi-squared divergence, 33 or can be used. Classification performances of distance measures can be different depending on the survey or the transient type. The second method utilizes the number of counts of (dt, dmag) in each dt and dmag bins as the variables and apply them to general classification methods such as LDA, classification trees, SVM, neural networks, and random forests. In Chapter 4, we will apply some of distance measures to simulated light curves and see how they perform. In Chapter 5, classification performance of distance methods and binning methods will be demonstrated with the real astronomical data and the results after incorporating other light curves measures into our meth- ods will be presented as well. Chapter 4

Simulation

In this Chapter, we investigate classification performances of some of distance mea- sures with simulated RR Lyrae stars, supernovae, and non-variables. The goal of this simulation study is to provide ideas when each distance measure successes or fails to work, so that it can be useful when the methods are applied for astronomi- cal surveys. For simulation, we consider three types of classification: (i) RR Lyraes versus non-variables, (ii) supernovae versus non-variables, and (iii) RR Lyraes ver- sus supernovae. In Section 4.1, we provide descriptions on how we generate light curves of RR Lyraes, supernovae, and non-variables. Section 4.2 provides classifi- cation results for (i), (ii), and (iii). For each case, we first perform classification assuming there are no gaps between observations in light curves, and then repeat it again with gaps to compare the results.

4.1 Data Generation

First, we describe how to generate light curves of RR Lyraes, supernovae, and non-variables. RR Lyraes are periodic variable stars with periods of 1.5 hours - 24 hours. The example of CRTS RR Lyrae light curves is given in Figure 4.1.1 As we can see, they are sinusoidal curves, so we create them with the harmonic sum of sine curves. The purpose of this simulation study is to verify classification performance of our methods for different shapes of light curves. For this reason, we generate 1This image is credited to http://nesssi.cacr.caltech.edu/DataRelease/ 35

Figure 4.1. Example of RR Lyrae light curve light curves of RR Lyraes with three different parameters - amplitude, period, and shape. Amplitude is the range of magnitudes, period is the time it takes for the observation to repeat itself, and the controls the shape of light curves - more or less sinusoidal. We will try different values for amplitude, period, and shape and see when our methods work well and not so well. RR Lyrae light curves are generated with the following formula:

 2πt 4πt 6πt 1 y = a sin + b sin + b2 sin +  p p p c where t is time, a is amplitude, p is period, b is shape, c is the tuning parameter, and  is a measurement error. To decide the ranges of some of the parameters, we refer to the actual observa- tions from the CRTS data. The minimum magnitude range of RR Lyraes in CRTS data, excluding outliers, is 0.5 and the maximum is 2. Therefore, it is reasonable to set the minimum and maximum amplitude close to 0.5 and 2. In the simulation, we set the ranges of amplitude a as the values from 0.5 to 3 by interval of 0.5. RR Lyraes typically have a period of less than 1 day, so the parameter for period p is set as the values from 0.1 to 1 by interval of 0.1. The shape parameter b is fixed as 0.2, 0.4, or 0.6. As we will see in the simulated light curves later, the smaller b creates the more sinusoidal curve. Finally, RR Lyraes light curves will be created in the range of t = (0, 1000). Once light curves are created with these parameters, the measurement error 36

Figure 4.2. Simulated RR Lyrae light curves with b = 0.2, a = (1, 2, 3) (from top to bottom), and p = (0.2, 0.5, 1) (from left to right) should be added to them. For simplicity, we will generate the error from normal 2 distribution with mean 0 and variance σ . CRTS provides the measurement er- ror for each observation, and the median error for RR Lyraes is only about 0.05. However, we will apply larger errors, such as 0.1, 0.3, and 0.5, to show that our classification methods perform well even when a large noise is present. The ex- amples of RR Lyrae light curves generated with different parameters values where

σ = 0.3 are given in Figure 4.2, Figure 4.3, and Figure 4.4. Now, we present how to generate supernovae light curves. Note that supernovae have many sub-types. Among those, we consider three different types for the simulation: Type I, Type II-P, and Type II-L. Doggett and Branch (1985) provides 37

Figure 4.3. Simulated RR Lyrae light curves with b = 0.4, a = (1, 2, 3) (from top to bottom), and p = (0.2, 0.5, 1) (from left to right) brief explanations and composite light curves on these types. See Figure 4.5 for the examples of light curves. All Type I supernovae show similar rates of decline of their brightness for 20 days after the peak, and the rate of dimming slows and becomes constant after about 50 days. Type II supernovae show a rapid rise in brightness and a steady decrease after the peak. Type II-P and Type II-L supernovae can be distinguished by a plateau: Type II-P supernovae have a temporary but clear plateau between about 30 and 80 days after the peak brightness, but Type II- L supernovae do not have this feature and their luminosity decrease constantly (Carroll and Ostlie, 2006). We will generate these three different types of light curves based on the expla- 38

Figure 4.4. Simulated RR Lyrae light curves with b = 0.6, a = (1, 2, 3) (from top to bottom), and p = (0.2, 0.5, 1) (from left to right) nation and the plot given in Doggett and Branch (1985) with varying amplitudes, i.e., different values of a magnitude range. Light curves of Type I, Type II-P, and Type II-L Supernovae are created with the following equations:

 2   −t2/20    −(t1 − 20) I(t1) e I(t2) −t3I(t3) y1 = a + d1 + + d2 + + d3 +  c1 c2 c3

 2   3    −(t1 − 30) I(t1) (t2 − 85) I(t2) −t3I(t3) y2 = a + d1 + + d2 + + d3 + c1 c2 c3

 2   −t2/50    −(t1 − 20) I(t1) e I(t2) −t3I(t3) y3 = a + d1 + + d2 + + d3 +  c1 c2 c3 39

Figure 4.5. Examples of Type I, Type II-P, and Type II-L supernovae light curves (Doggett and Branch, 1985)

where a is amplitude, (t1, t2, t3) are time, (c1, c2, c3, d1, d2, d3) are tuning parame- ters, and  is the measurement error. t1, t2, and t3 are set differently for each type to make them as similar as light curves in Figure 4.5, but the entire time range is the same for three types, which is (0, 300). The minimum magnitude range of supernovae in CRTS data is 0.2 and the maximum is 4.5. Therefore, we decide to apply amplitudes from 0.5 to 5 by interval of 0.5 for simulation. The median measurement error for supernovae in CRTS data is about 0.2, and again we use the larger values for the noise. Simulated supernovae light curves with different amplitudes for three different types are given in Figure 4.6. Finally, we generate light curves of non-variables, which will be compared to RR Lyraes and supernovae. Non-variables do not have certain patterns in their brightness changes, and these changes are not significant. In CRTS data, the median magnitude range of non-variables is 1 and the maximum is 3.5. Based on this information, we decide to create non-variable light curves from with arbitrary mean and standard deviation 0.5. This will allow us to have simulated light curves of the magnitude range around 3, which is larger than the median range of CRTS data and rather close to the maximum range of 40

Figure 4.6. Simulated Type I, Type II-P, and Type II-L (from top to bottom) super- novae light curves with a = (1, 3, 5) (from left to right) it. Although having smaller magnitude ranges for non-variables, which is the case in actual data, will make them more distinguishable from variable stars and lead us to better classification performance, we want to assume the worst scenario so that we will not have inflated classification rates. We will perform classification with three different distance measures - Kullback- Leibler divergence with smoothing, Jesen-Shannon divergence with binning, and Hellinger distance. For transients (RR Lyraes and supernovae) vs. non-transients classification, we create 1,000 transients and 9,000 non-transients since we have unbalanced data with much more of non-transients than transients in a real survey. For supernovae vs. RR Lyraes classification, 5,000 light curves are created for 41 each type. We will compute classification rates for each measure and see how they perform for different parameter values.

4.2 Simulation Results

Now, we present how each distance measure perform for different parameter val- ues. As mentioned earlier, it is common that there are gaps between observations in light curves due to periods when detection of observations is hard to be made. Therefore, we consider two options for simulation studies: (i) When we have com- plete data without gaps in light curves, and (ii) When the gaps are present between observations.

4.2.1 Light Curves without Gaps

In this section, we explore classification performance with simulated light curves assuming that there are no gaps between observations. First, we perform RR Lyraes versus non-variables classification. Classification rates for different values of amplitude (a), period (p), and shape (b) are given in Table 4.1, 4.2, and 4.3. For example, Table 4.1 shows the classification performance for amplitude of 0.5, 1, 1.5, 2, 2.5, and 3, while other parameters, period and shape, are randomly assigned for 1,000 light curves. Classification results are presented with completeness and contamination in parenthesis. Completeness is the probability of not missing any interesting objects and contamination is the false alarm rate (Mahabal et al., 2010). Therefore, we would want a high completeness and a low contamination.

For RR Lyraes classification, we apply 0.1 and 0.3 as σ, which are large enough considering that the median measurement error of RR Lyraes in CRTS data is

0.05. When we have σ = 0.1, we achieve more than 95% of completeness and less than 1% of contamination for all cases. Therefore, we only present the results for

σ = 0.3 here. Table 4.1, 4.2, and 4.3 shows completeness and contamination for different parameter values of amplitude, period, and shape. 42

amp 0.5 1 1.5 2 2.5 3 KLD 99.7 (0.1) 72.8 (2.1) 78.1 (0.7) 100 (0.0) 100 (0.0) 100 (0.0) JSD 99.7 (0.1) 71.1 (2.1) 80.6 (0.7) 100 (0.0) 100 (0.0) 100 (0.0) HD 100 (0.0) 73.1 (2.2) 78.4 (0.6) 99.7 (0.1) 100 (0.0) 100 (0.0) Table 4.1. RR Lyraes vs. non-variables classification: Completeness and contamination (in parenthesis) for different amplitudes

period 0.1 0.3 0.5 0.7 0.9 1 KLD 93.1 (0.2) 90.4 (0.4) 88.5 (0.6) 89.1 (0.4) 86.1 (0.6) 90.9 (0.6) JSD 94.3 (0.3) 87.0 (0.7) 84.9 (0.7) 92.9 (0.4) 88.3 (0.6) 91.1 (0.5) HD 92.5 (0.1) 86.9 (0.5) 88.1 (0.5) 88.2 (0.3) 86.6 (0.4) 89.1 (0.8) Table 4.2. RR Lyraes vs. non-variables classification: Completeness and contamination (in parenthesis) for different periods

shape 0.2 0.4 0.6 KLD 86.2 (0.4) 85.1 (0.4) 84.5 (1.0) JSD 85.7 (0.4) 84.9 (0.3) 83.3 (0.7) HD 90.3 (0.8) 87.4 (1.1) 81.7 (0.7) Table 4.3. RR Lyraes vs. non-variables classification: Completeness and contamination (in parenthesis) for different shapes

There is not much difference in performance among distance measures for dif- ferent amplitudes. We can see that amplitude of 2 or larger achieves 100% com- pleteness and 0% contamination. However, amplitude between 1 and 1.5 does not work so well although they still achieves false alarm rates less than 3%. It seems inconsistent of having better performance results at amplitude of 0.5 than 1 or 1.5, but this might be due to the too small amplitude and large noise we have: Consid- ering that we assumed rather large magnitude range for non-variables, light curves of RR Lyraes will become easier to be classified when their amplitude (magnitude range) is very small. If we choose a smaller σ or a smaller magnitude range for non- variables as a real astronomical survey has, we will have increasing classification rates as amplitude gets larger. 43

For different periods, the smallest or largest values work better than the ones in the middle, which achieve more than 90% completeness. Contamination is below 1% for all cases. Also, Table 4.3 shows that we get more satisfying results if light curves are more sinusoidal. The results are again similar for three distance measures, but HD works slightly better for less sinusoidal curves whereas KLD and JSD perform better when we have peaky RR Lyrae light curves. Now we present the results for supernovae versus non-transients classification.

We start with σ = 0.3, which is the similar value as the median measurement error of supernovae in CRTS data, and almost 100% completeness and 0% contamination for all cases can be obtained in this case. Therefore, we only present the results for σ = 0.5. Table 4.4 and 4.5 show completeness and contamination for different types of supernovae and various amplitudes.

Type I Type II-P Type II-L KLD 90.3 (0.6) 98.4 (0.1) 100.0 (0.1) JSD 90.7 (0.9) 97.6 (0.2) 99.7 (0.0) HD 89.9 (0.6) 97.3 (0.4) 99.4 (0.0) Table 4.4. Supernovae vs. non-variables classification: Completeness and contamina- tion (in parenthesis) for different types

amp 0.5 1 1.5 2 2.5 3 KLD 48.7 (5.7) 86.2 (1.5) 98.8 (0.2) 100 (0.0) 100 (0.0) 100 (0.0) JSD 44.4 (6.0) 90.7 (1.2) 99.1 (0.2) 100 (0.0) 100 (0.0) 100 (0.0) HD 43.9 (5.0) 86.9 (1.3) 97.8 (0.1) 100 (0.0) 100 (0.0) 100 (0.0) Table 4.5. Supernovae vs. non-variables classification: Completeness and contamina- tion (in parenthesis) for different amplitudes

Table 4.4 shows that we achieve high completeness for all types, but among those, Type II-L is the easiest to classify, and Type I is the hardest. False alarm rates are all less than 1%. Table 4.5 shows that we have less than 50% of com- pleteness at amplitude of 0.5, but this might be because amplitude is dominated by the noise when both of them are 0.5, so it is not much of concern. When we 44 have amplitudes of 1.5 or larger, we achieve almost 100% of completeness and 0% contamination. Finally, we present RR Lyraes versus supernovae classification. Recall that there are three different parameters for RR Lyraes - amplitude, period, and shape, and two parameters for supernovae - amplitude and type. We perform classification for all possible combinations of parameters. For example, we can investigate per- formance of our methods for different combinations of amplitudes for RR Lyraes and supernovae. In this case, period and shape parameters for RR Lyraes and types for supernovae will be randomly assigned. In addition, σ for RR Lyraes and supernovae have to be decided. When 0.3 for RR Lyraes and 0.5 for supernovae are used, we achieve 100% of completeness and 0% of contamination for almost all parameter combinations. We get the similar results for σ = 0.5 for both RR Lyraes and supernovae. Exception is when amplitudes are 0.5, but even for this case, completeness is ∼95% and contamination is ∼5%. In conclusion, our methods perform very well in general even when a large noise is present and amplitude of transients is small. For some cases, completeness is not very satisfying, but we still achieve small false alarm rates, which are less than 3%. This is an important feature for classifiers of transients because we do not want to waste our time on falsly detected objects, considering the massive number of stars detected every night.

4.2.2 Light Curves with Gaps

In Section 4.2.1, we performed classification assuming that we have complete light curves. However, real-world data usually have gaps between observations because objects cannot be detected for certain time periods. Therefore, in this section, we perform classification in a more realistic situation by creating gaps between observations in light curves. The length of periods for missing observations (when the gaps are present) is again determined based on the CRTS data. In CRTS data, periods without any observation appear every ∼200 day. Therefore, we generate light curves by the same procedure presented in Section 4.1, but make them to have periods of missing observations in every 200 day. For example, RR Lyrae light curves will have observations only between t = (0, 200), (400, 600), 45 and (800, 1000), instead of the entire period of t = (0, 1000). On the other hand, CRTS data shows that light curves of supernovae have periods which are too short to have gaps: After the peak brightness, they fade over several weeks or months, e.g. less than 300 days, and become undetectable. Therefore, we usually have complete observations for supernova light curves. For this reason, we only produce gaps for RR Lyrae and non-variable light curves. The classification results for RR

Lyraes with σ = 0.3 versus non-variables when there are missing observations in light curves are given in Table 4.6, 4.7, and 4.8.

amp 0.5 1 1.5 2 2.5 3 KLD 99.1 (0.1) 63.7 (3.0) 69.2 (2.0) 99.1 (0.1) 100.0 (0.0) 100.0 (0.0) JSD 99.4 (0.0) 69.1 (3.5) 74.2 (2.1) 99.1 (0.2) 100.0 (0.0) 100.0 (0.0) HD 99.4 (0.0) 66.2 (2.5) 70.6 (2.2) 99.1 (0.1) 100.0 (0.0) 100.0 (0.0) Table 4.6. RR Lyraes vs. non-variables classification: Completeness and contamination (in parenthesis) for different amplitudes when gaps are present in light curves.

period 0.1 0.3 0.5 0.7 0.9 1 KLD 83.9 (1.8) 79.3 (2.2) 89.4 (0.5) 91.1 (0.6) 80.9 (1.5) 97.3 (0.1) JSD 80.6 (1.4) 84.7 (2.5) 93.7 (0.6) 89.6 (0.9) 82.9 (2.1) 98.3 (0.0) HD 82.4 (2.2) 82.8 (2.5) 90.4 (0.3) 91.2 (0.6) 78.4 (2.2) 99.4 (0.0) Table 4.7. RR Lyraes vs. non-variables classification: Completeness and contamination (in parenthesis) for different periods when gaps are present in light curves.

shape 0.2 0.4 0.6 KLD 84.1 (1.0) 84.1 (1.0) 84.2 (0.9) JSD 84.9 (1.2) 83.7 (1.2) 84.0 (1.0) HD 87.3 (1.0) 80.8 (1.2) 82.7 (1.5) Table 4.8. RR Lyraes vs. non-variables classification: Completeness and contamination (in parenthesis) for different shapes when gaps are present in light curves.

To compare the results with original light curves without gaps, see Table 4.1, 4.2, and 4.3. Table 4.6 shows that we have 2-10% lower completeness and slightly 46 higher false alarm rates for amplitude of 1 and 1.5. However, we achieve stable rates with almost 100% completeness and 0% contamination for amplitude of 2 or larger even when gaps are present in light curves. Table 4.7 shows that we have lower completeness for some periods, but we still achieve more than 80% of completeness and less than 3% of contamination for almost all cases. For different shapes, having gaps in light curves rarely affects classification performance.

Finally, we perform RR Lyraes versus supernovae classification with σ = 0.5 for both, and the results are similar to the ones without gaps in light curves: We achieve more than 95% completeness and less than 5% contamination for almost all parameters combinations. Therefore, we can say that our methods are not very sensitive to the presence of missing values in light curves. This implies that they work well not only in ideal situation but also for actual astronomical data. Chapter 5

Analysis

5.1 Introduction

In Chapter 3, we presented several methods which can be applied to classify vari- able stars and transients. As our goal is to find an efficient and accurate classifier, the most important thing would be how it performs with real astronomical data. In this chapter, we apply our classification methods presented in Chapter 3 to the CRTS data. As briefly explained in Section 2.1, the Catalina Real-time Transient Survey (CRTS) explores data from 3 telescopes, CSS, SSS, and MLS. Four images are taken from the same location in the sky by ∼10 minutes apart, and if there are objects which have changed their brightness considerably above a certain thresh- old, they will be marked as transients (Drake et al., 2009; Mahabal et al., 2010). Detected transients from the CRTS are available publicly along with their images and light curves. CRTS has detected ∼10,000 transients to date including ∼ 2400 Supernovae (SNe), ∼1200 Cataclysmic Variables (CV), ∼2900 Active Galactic Nu- clei (AGN), and hundreds of Blazars and Flares. Up-to-date information on newly detected transients by CRTS can be found at http://crts.caltech.edu/. Here we apply our classification methods to two different data sets whose tran- sients are already classified and labeled. With a labeled data set, we can perform supervised classification and calculate classification rates to see how our classifiers work. First, CRTS data with 7 different types of transients which include AGN, Blazar, CV, Downes, Flare, RR Lyrae and SNe in addition to non-transients will 48 be used. Secondly, a relatively new data set which only has 5 different types of transients - AGN, Blazar, CV, Flare, and SNe - will be used for analysis. Section 5.2 describes how the first data is obtained, and Section 5.3 gives ex- ploratory data analysis of it. Section 5.4 and 5.5 provide the classification results of the distance measures and the binning method applied to the first data set: In Section 5.4, we present the necessary settings for performing classification by distance measures and explain how to incorporate other measures into our meth- ods. After that, the results obtained from the distance measures will be presented and compared with other existing measures. In Section 5.5, we demonstrate the performance of our binning method as well as other measures incorporated into the binning method. In Section 5.6, our classification methods will be tested with a new data set with 5 different types of transients. Similarly to the previous data set, classification by distance measures and the binning method are demonstrated and the results are presented. Classification rates after incorporating other measures are also provided. Finally, Section 5.7 gives the summary of the results for the two data sets. Since we have more than one distance measure for classification, only the one which gives the best performance rate is presented here along with the result of the binning method and they are compared with other light curves measures.

5.2 Data Description

To demonstrate our classification methods, we use the specific sample from CRTS data used by Faraway et al. (2014). The samples from CRTS data provide the position of the object on the sky with right ascension (RA) and declination (Dec), observed time of the object with Julian Date, brightness with magnitude, and an error estimate of magnitude for each light curve. In the sample set of data, we have 3720 light curves which consist of 1749 transients detected by CRTS and 1971 non-transients. Transients include Active Galactic Nuclei (AGNs), Blazars, Cataclysmic Variables (CV), Flare stars and Supernovae (SNe). Here we briefly describe characteristics of each transient: AGN is a center of a galaxy which is more luminous than the entire rest of the galaxy. AGNs are 49 extragalactic, i.e., outside our galaxy, and have stochastic variability. They can be categorized into several types such as Seyfert galaxies, radio galaxies, quasars, LINERs, and blazars. Blazars are a subset of AGNs and they emit a jet, which is a long, collimated, supersonic flows of plasma emanating from compact celestial objects (Clarke et al., 2008), pointing at the Earth. They are variable at all time scales, from hours to many years. Note that AGNs we have in the data set are the ones which are not blazars. CVs are binary stars (two stars orbiting around a com- mon center) in which a relatively normal star is transferring mass to its compact companion (Smith, 2007). This causes a strong, stochastic variation in brightness. They are galactic, i.e., within our galaxy, and their outbursts in brightness are more peaky than blazars. Flares are also galactic stars and have a single sharp peak in brightness for a short time period. Since a second peak in brightness may appear after a rather long period, we rarely see two peaks in a light curve of a single flare star. Lastly, Supernovae are extragalactic stars with a dramatic outburst in brightness due to a stellar explosion which fades over several weeks or months. They have many subtypes: Type Ia, Type Ib, Type Ic, Type II-P, and Type II-L. In addition to five types of transients, two types of bright variables were in- cluded: Cataclysmic Variables from Downes set (Downes et al., 2005) and RR Lyrae. RR Lyraes are the only periodic variable stars in our data set and they have periodicity of ∼1 day. Since their absolute magnitudes are known, they are good measures for computing distances within the Galaxy (Jones and Lambourne, 2004). Finally, a set of 15 random pointings and objects within 3’ of those point- ing were included as non-transients in addition to transients and variable stars, because if there were any transients in those regions, they would have been seen earlier. The smallest number of observations in a light curve is 5 and the maximum is 641. Light curves with less than 5 observations were omitted from data set since we cannot get meaningful information from too small observations. The earliest date for light curves in the sample was 53464 the Modified Julian Day (MJD) and the latest was 56228, which are converted into April 5, 2005 and October 28, 2012. Therefore, the data spans 2764 days. We use this data set so that we can compare our methods with other light 50

Functionally transient Transients Bright variables AGN Blazar CV Flare SNe CV Downes RR-Lyrae non-transient 140 124 461 66 536 376 292 1971 Table 5.1. Number of light curves for each transient type and non-transient objects (Faraway et al., 2014) curves measures. In Faraway et al. (2014), 11 light curve measures are introduced and the classification is performed in combination with 16 measures in Richards et al. (2011). The lists of these measures and details on them can be found in Table 2.1 and Table 2.2. For convenience, we will call these sets ”Richards measures” and ”Faraway measures”. As explained in Section 2.2.2 and 2.2.3, Richards measures are the statistics derived from original light curves and Faraway measures are the ones from fitted Gaussian process regression (GPR) lines of light curves. These measures are easy to derive and apply for classification: Once you compute the measures, just need to apply them to classification methods such as SVM, random forests, neural networks etc. However, these features might not be robust, which means that even though these work well with CRTS data, they might not perform well with other astronomical surveys. Also, GPR lines can be very different de- pending on hyperparameters such as a prior mean and a length-scale, and also a covariance function, but choosing these is rather subjective. Since our methods are solely dependent on dt and dmag distributions, i.e., how transients change their brightness, we expect to have less variation for different transients surveys. We will show how our methods work for different types of transients and com- pare them with other measures. The data set and relevant R codes are downloaded from http://people.bath.ac.uk/jjf23/modlc/.

5.3 Exploratory Analysis

As described in Chapter 3, we use (dt, dmag) pairs for our analysis instead of original time and magnitude from light curves, and it is based on the assumption that (dt, dmag) distributions for different transients will have different shapes. Before we apply any classification methods to real data, we plot dmag 51 and the joint probability density estimates of (dt, dmag) for different types of transients to check this assumption. Figure 5.1 shows the boxplot of dmag for different types of objects. The plot on top is for entire range of dmag, and on the bottom it magnifies the range between -2 and 2. We can see that there are differences in variation for different types of objects. In Figure 5.2, we present the light curves for an AGN, a Flare, a Supernova, and a non-transient, and Figure 5.3 provides (dt, dmag) density plots by kernel density estimation for the same object. The AGN light curve and the density plot show a stochastic variability which can be found at all time scales. The flare has a single peak in a light curve which lasts for a short time period around MJD of 2500 and we can see this is featured in (dt, dmag) density plot around dmag of -4. The supernova light curve has a bright peak which fades over several weeks, and the density plot has frequencies only between -3 and 3 of dmag for short dt and shows zero probabilities elsewhere. The non-transient density plot has frequencies only at dmag of zero and no dramatic changes for almost all time periods. As features of each type of transients are reflected in (dt, dmag) densities, we expect our classifiers based on dt and dmag will perform well by detecting different attributes of different transients. 52

Figure 5.1. Top: A boxplot of dmag for different types of transients (entire range). Bottom: A boxplot of dmag between -2 and 2. 53

Figure 5.2. Light curves of a AGN (CSS071216:110407-045134), a Flare (CSS111103:230309+400608), a Supernova (CSS071111:095436+045612), and a non- transient 5.4 Classification by Distance Measures

For classification, first we split the data into a training set and a test set. We use 2/3 of the entire data as a training set and 1/3 for a test set. The training set is used to build the classification model and the model is applied to the test set to evaluate how well the model performs by evaluating classification rate. Although there are several distance measures which can be used to estimate the distance between two probability distributions, here we only consider Kullback- Leibler divergence, Jensen-Shannon divergence, and Hellinger distance for analysis. We choose to only present these three since they are widely used measures and 54

Figure 5.3. (dt, dmag) density plots by kernel density estimation of a AGN (CSS071216:110407-045134), a Flare (CSS111103:230309+400608), a Supernova (CSS071111:095436+045612), and a non-transient. These are the same object with the light curves presented in Figure 5.2. The number of grid points used for evaluating densities is 50 in each side. the classification results for other measures when applied to our data are similar or worse. In Section 5.4.1, we describe the parameter settings of each distance measure to perform classification, and in Section 5.4.2, the results are given. Also, how to incorporate other measures into our methods and classification rates after incor- porating them are given in Section 5.4.3. 55

5.4.1 Settings

In this section, we explain how to choose necessary tuning parameters when using distance measures. For example, the number of grid points and the bandwidth need to be decided for the Kullback-Leibler divergence, and dt and dmag bin, and a pseudocount α for a Bayesian estimate of the KL divergence. Parameter settings for the KL divergence and a Bayesian estimate of the KL divergence are given in Section 5.4.1.1 and 5.4.1.2 respectively. In Section 5.4.1.3, settings for the Jensen-Shannon divergence and the Hellinger distance are explained.

5.4.1.1 Kullback-Leibler Divergence

To evaluate densities for (dt, dmag), we take every possible combinations of

(dtij, dmagij) for 0 ≤ j < i ≤ 2764 and apply kernel density estimation for each light curve. The number of grid points used for evaluating kernel densities is 100 for each side, and the rule of thumb is applied for choosing the bandwidth. We might get more accurate results if we use more grid points for evaluating kernel density. However, it is computationally expensive and we can see in Figure 5.3 that features of transients are reflected well in densities only with 50 grid points, so 100 grid points should be enough. The KL divergence can be calculated by the R package entropy. Note that the KL divergence KL(p, q) cannot be calculated if p 6= 0 but q = 0. To make every density to be non-zero, we add a small arbitrary number 1e-08 to all probability densities.

5.4.1.2 Bayesian Estimate of Kullback-Leibler Divergence

For a Bayesian estimate of the KL divergence (or the KL divergence with smooth- ing), first we want to choose the optimal dt and dmag bins, u = {u1 < u2 <

... < ua−1 < ua} and v = {v1 < v2 < ... < vb−1 < vb} which dt and dmag will be grouped into. In order to determine these values, we draw the histograms of dt and dmag. The left panel in Figure 5.4 presents dts for the entire range and the right panel only shows dts which are less than 100. As we can see, a large number of dts are less than 1. Also, the histogram for dmag is given in Figure 5.5. dmags are clustered around 0 and there are not many observations outside of ±2. Therefore, 56 we decide to group dt between 0 and 1 together and use larger intervals beyond 1. For dmag, we group the values between the minimum of dmag and -2 together, the values between +2 and the maximum of dmag together, and use the intervals of 0.1 for the values between -2 and +2 since the resolution of CRTS images is not greater than 0.1.

Figure 5.4. Left: The histogram of the entire dt. Right: The histogram of dts which are less than 100.

Figure 5.5. The histogram of dmag

Considering dt and dmag distributions, several different dt and dmag bins have been applied for classification, and we choose the followings as they perform better than the other. However, we cannot be sure these dt and dmag bins are the best choices and might get better results by selecting different dt and dmag bins. How 57 to choose dt and dmag bins with a systematic algorithm can be studied further for improvement.

dtbin = {0, 1, 100, 500, 1000, 1500, 2764} dmagbin = {−11.8, −2.0, −1.9, −1.8, ..., 1.8, 1.9, 2.0, 11.5}

Note that 0 and 2764 in dt bins and -11.8 and 11.5 in dmag bins are the minimum and the maximum values of dt and dmag respectively. These bins are not only used for the Bayesian estimate of the KL divergence, but also for the JS divergence, the Hellinger distance and the grouping method in later sections. Since KLD(p, q) is not the same as KLD(q, p), we explore classification by KLD(p, q), KLD(q, p) and KLD(p, q) + KLD(q, p) where p is the distribution of a light curve in a training set and q is a test set. We denote each of them as KLD1,

KLD2, and KLD3. In other words, KLD1 is the KL divergence of a test light curve distribution from a training distribution, KLD2 is the divergence of a training light curve distribution from a test distribution, and KLD3 is a distance between two distributions. For the Bayesian estimate of the KL divergence, the pseudocount α, which is the parameter for a Dirichlet prior, needs to be decided. Some of the popular choices of α are given in Section 2.6.1.2. However, instead of choosing a particular parameter for a prior, we will try several values of α and choose the one based on the result since our goal is to find the model which yields the best performance with the data we have. In other words, we will choose α which maximizes the classification rate after applying multiple values of them with the CRTS data. For example, the classification rates with KLD3 for different α are given in Figure 5.6, and we can see that α = 0.5 gives the best result. In case of KLD3 we choose 0.5 as our α, but different values can be chosen if we use different data sets or different distance measures.

5.4.1.3 Jensen-Shannon Divergence and Hellinger Distance

Jensen-Shannon divergence is closely related to the Kullback-Leibler divergence but the difference is it always has the finite value and therefore we do not need to add a small number to probability densities. Similarly to the KL divergence, we evaluate (dt, dmag) densities by kernel density estimation with grid points of 100 58

Figure 5.6. Classification rates of KLD3 for different choices of α in each side and choose the bandwidth by the rule of thumb. Once we obtain the (dt, dmag) probability density functions, we can get JS(p, q) where p is the pdf for a light curve in a training set, q is for a test set. We mentioned earlier that a Bayesian estimate of the KL divergence can lead to a significant improvement over the standard KL divergence. Therefore, we also want to check the results of the JS divergence by plugging in a Bayesian estimate of the KL divergence into Equation (2.9) to see if there is an improvement over using the standard KL divergence. We denote the JS divergence obtained with a standard KL divergence by us- ing kernel density estimation as JS1, and the one with a Bayesian estimate of KL divergence as JS2. Similarly to the Bayesian estimate of KL divergence, a pseudo- count α needs to be determined for JS2, and again we use the empirical Bayesian approach and choose the one which gives the best classification result. The α cho- sen by the empirical Bayesian approach for KLD1, KLD2, KLD3, and JS2 will be given in Appendix A.1. Lastly, for classification with the Hellinger distance, we use the counting vector by binning to get a discrete distribution instead of using kernel density estimation. It is computationally efficient and likely to get more accurate results since it does not involve smoothing. 59

5.4.2 Results

Once we have the observations to classify, the first thing we need to is to identify if the object is a transient or a non-transient. If it turns out to be a transient, it can be further classified into a sub-type of transients. Therefore, correctly classifying objects into transients or non-transients is the most important part because we have a large amount of newly detected astronomical objects every day, and even the small classification error rate will cause many of transients to be classified as non-transients. The smaller misclassification rate implies less possibility of missing interesting events. To perform classification of transients versus non-transients, we first combine all the transients including bright variables together (AGN, Blazar, CV, Flare, SNe, Downes, and RR-Lyrae) so that we can only have two classes. Our distance measures, the KL divergence, the JS divergence, and the Hellinger distance are applied for classification and the best performance result is given below. Table 5.2 shows the contingency table for classification of transients versus non-transients with KLD2. Comparison of classification rates for all the distance measures will be given later. Note that we will present all the classification results with percentages.

For KLD2, the misclassification rate is only about 10% and 87% of transients are correctly classified.

Actual trans nv trans 87.2 7.9 nv 12.8 92.1

Table 5.2. Transients vs. non-variables classification by KLD2 with α = 0.6

5.4.2.1 Binary Classification

Next, we present the results for binary classification of AGN, Blazar, CV, Flare and SNe by using our distance measures. For example, to perform supernovae classification, we group all other transients (AGN, Blazar, CV, and Flare) together and label them as non-SNe, and perform the classification for SNe versus non-SNe. As we can see in Table 5.1, the data used for this analysis is highly biased: There are relatively a large number of observations for SNe and CV, but only 60 about 100 to 150 for AGN, blazars, and flares, which account for less than 10% of the entire data set. This means that for AGN, blazars, and flares, we can achieve 10% of classification error rate by simply classifying all the objects to non-AGN, non-Blazar, or non-Flare. Therefore, even though we achieve high classification rates, that does not mean the classifier can discriminate transients well. For this reason, we investigate completeness and contamination for each type of classifica- tion instead of the average classification rate. See Table 5.3 for example. In SNe versus non-SNe classification, completeness is the probability of SNe actually classified to SNe, and contamination is the prob- ability of non-SNe classified to SNe. Therefore, if we assume that Table 5.3 shows the number of counts for classification of SNe versus non-SNe, the completeness is (i) (ii) (i)+(iv) (i)+(iii) , the contamination is (ii)+(iv) , and the classification rate is (i)+(ii)+(iii)+(iv) .

Actual SN non-SN SN (i) (ii) non-SN (iii) (iv) Table 5.3. Examples of a contingency table for SNe versus non-SNe classification

Table 5.4 and 5.5 show completeness and contamination achieved for each dis- tance measure for different type of transients. The results for completeness demon- strate that our distance measures detect transients well except for blazars and flares. Also, false alarm rates (contamination) are low enough except for CV and SNe.

KL KLD1 KLD2 KLD3 JS1 JS2 HD AGN 80.6 84.7 84.1 85.0 80.5 84.5 86.4 Blazar 24.0 38.6 37.5 39.5 25.4 32.9 39.8 CV 54.4 61.1 67.9 64.2 51.3 63.4 60.0 Flare 41.4 36.5 47.2 40.8 40.2 40.8 39.2 SN 80.3 83.7 81.6 82.9 83.0 84.0 80.0 Table 5.4. Completeness: Percentages correctly classified to each type of transients

In general, classifications with binning (KLD1, KLD2, KLD3, JS2, and HD) yield better results than those with kernel density estimation (KL and JS1). For

AGN and Blazar, HD and KLD3 work best (Note that Blazars are a subset of 61

KL KLD1 KLD2 KLD3 JS1 JS2 HD AGN 3.3 2.3 2.4 2.8 3.1 2.4 3.4 Blazar 3.3 3.3 2.7 3.5 2.4 2.4 3.4 CV 9.8 10.0 11.5 10.6 9.1 10.1 10.1 Flare 5.0 3.0 3.1 3.1 4.9 3.2 2.9 SN 8.7 9.7 9.0 9.4 9.8 10.5 5.7 Table 5.5. Contamination: False alarm rates

AGNs), and for CV and Flare, KLD2 is recommended (CV has a sharp peak compared to AGN or Blazar so they could look like Flare, which has a single sharp peak, in a short time interval). For Supernovae, JS2 achieves the best performance rate. In terms of false alarm rates, there are not much of differences between kernel density estimation and binning. The measures which achieve the lowest false alarm rates for each type of transients are KLD1, JS1, JS1, HD, and HD respectively. Especially for Supernovae, we can see that HD has a 3-5% lower rate compared to the others. The comparisons of completeness and contamination with other light curves measures are given in Table 5.6 and 5.7. Faraway et al. (2014) compares the clas- sification results of their measures in addition to Richards’ with Richards measures alone by using linear discriminant analysis (LDA), recursive partitioning (Tree), support vector machines (SVM), neural networks (NN), and random forests (RF). Following the Faraway’s paper, we will compare our methods with (i) Richards measures alone and (ii) Faraway measures in addition to the Richards set. The highest completeness achieved by Richards measures after trying LDA, Tree, SVM, NN, and RF are given in the first row of Table 5.6, and similarly, the results by Faraway measures in addition to Richards’ are given in the second row to compare them with our methods. Among our distance measures - KL, KLD1,

KLD2, KLD3, JS1, JS2, and HD - the one which gives the best result is presented. The results for contamination are given in Table 5.7. We can see that our methods achieve higher completeness than any other method except for flares, which means that it is less likely to miss interesting events. On the other hand, other methods have lower contamination than our distance methods, although our methods achieve low enough false alarm rates ex- 62 cept for CV and SN. We might be able to compensate high contamination rates by incorporating Richards and Faraway measures into our distance measures, but we do not present it here since we just want to show our methods have high com- pleteness and low contamination for binary classification. However, for multi-class classification, we will present the results of incorporating other measures in Section 5.4.3.

AGN Blazar CV Flare SN Richards 46.0a 23.4a 48.7a 42.0b 63.2c Richards+Faraway 70.3d 27.7a 55.5d 51.1a 78.3c Our methods 86.4e 39.8e 67.9f 47.2f 84.0g

aTree; bLDA; cRF; dSVM e f g HD; KLD2; JS2 Table 5.6. Comparisons among Richards measure, Faraway measure, and our distance measure for completeness

AGN Blazar CV Flare SN Richards 2.1 2.3 9.4 2.3 8.1 Richards+Faraway 0.8 2.4 5.2 0.5 5.9 Our methods 3.4 3.4 11.5 3.1 10.5 Table 5.7. Comparisons among Richards measure, Faraway measure, and our distance measure for contamination

5.4.2.2 Multi-class Classification

The next step is multi-class classification. Here we perform three types of classi- fication: ”All” type of classification is for classifying objects into one of 8 classes we have in the data set - AGN, Blazar, CV, Downes, Flare, RR Lyr, SN, and NV (non-transient). ”Transient or not” classification categorizes data into transients or non-transients (NV) as described earlier in this section. ”Transient only” classifica- tion is for only 7 classes which are transients excluding non-transients. Percentages correctly classified to each class are shown in Table 5.8, 5.9, and 5.10 for all-type, transients versus non-transients, and only-transient classification respectively. 63

KL KLD1 KLD2 KLD3 JS1 JS2 HD AGN 71.1 79.4 79.1 80.3 79.7 79.9 83.9 Blazar 18.8 35.4 35.4 35.4 21.1 29.6 26.3 CV 46.1 56.0 62.1 57.7 44.4 58.6 57.2 Downes 27.6 36.5 31.8 32.3 26.0 31.7 43.6 Flare 25.9 20.9 30.2 24.0 21.0 23.8 28.6 NV 88.7 92.1 92.6 92.7 89.9 93.1 83.3 RR Lyrae 67.2 80.8 78.5 80.6 66.5 77.5 78.5 SN 77.1 79.6 79.3 79.8 79.6 80.9 77.6 Table 5.8. Completeness for all-type classification

KL KLD1 KLD2 KLD3 JS1 JS2 HD Transient 79.5 86.8 87.2 86.4 78.8 85.4 90.9 NV 89.0 91.6 92.1 92.5 89.2 93.0 82.9 Table 5.9. Completeness for transients vs. non-transients classification

KL KLD1 KLD2 KLD3 JS1 JS2 HD AGN 79.1 84.7 81.9 84.0 77.9 84.9 85.9 Blazar 23.6 37.2 36.1 38.2 27.1 30.0 38.7 CV 53.6 59.4 67.8 61.8 50.3 63.4 57.8 Downes 56.7 54.3 48.0 50.1 58.0 49.7 51.2 Flare 37.2 36.7 44.4 39.7 40.8 38.8 38.9 RR Lyrae 80.4 87.7 83.9 86.0 80.5 86.0 88.4 SN 80.6 82.7 81.8 82.6 82.5 83.4 81.1 Table 5.10. Completeness for only-transient classification

In general, classification by binning works better than kernel density estimation. From Table 5.8 and 5.10, we can see that HD works best for AGN. For Blazar, CV, Flare, and RR Lyrae, KL divergence with smoothing works better than the other methods. For Suprenovae, all the measures achieve ∼80% of classification rates, but JS2 achieves the highest among those. The result for transients vs. 64 non-transients in Table 5.9 shows that HD outperforms other methods and it only misses 10% of transients. Now, we compare overall classification rates of our methods with other light curve measures. The results of these three types of classification by Richards and Faraway measures are given in Table 5.11 and Table 5.12. The overall percentages correctly classified by using our distance measures are given in Table 5.13. Our methods do not outperform Faraway measures in addition to the Richards set in general, but clearly work better than Richards measures alone. Contingency tables of the results of the best performing methods for ours and Faraway and Richards measures combined are given in Appendix A.3.1. We can see our method works as good as or better than Faraway measures in addition to the Richards set for AGN, CV, Flare, NV, and SN in all-type classification, and for AGN, Downes, RR Lyrae, and SN in only-transient classification. Since Downes is just another form of CV, it is not much of concern for Downes classified as CV or CV classified as Downes (Faraway et al., 2014).

LDA TREE SVM NN RF All 56.7 58.6 66.1 63.3 67.3 Transient or not 74.7 79.5 81.0 75.2 82.5 Transient only 54.5 58.9 56.5 60.1 62.9 Table 5.11. Percentage correctly classified using the Richards measures only

LDA TREE SVM NN RF All 76.0 71.9 80.2 79.6 80.5 Transient or not 90.4 88.4 92.0 91.6 91.8 Transient only 70.1 65.1 74.3 72.3 74.2 Table 5.12. Percentage correctly classified using the Faraway measures in addition to the Richards measures

KL KLD1 KLD2 KLD3 JS1 JS2 HD All 69.7 75.7 76.1 75.9 69.7 75.8 72.0 Transient or not 83.3 89.0 89.5 89.2 83.3 88.9 87.2 Transient only 68.1 68.5 68.5 68.1 65.9 68.1 67.5 Table 5.13. Percentage correctly classified using our method 65

5.4.3 Incorporating Other Measures

In this section, we present how other light curve measures can be incorporated into our methods. There are many measures which can be used for classification of light curves, but here we only consider Richards and Faraway measures which are avail- able to us right now. Once other features are available, we can try incorporating them into our methods as well. There are 16 Richards measures and 11 Faraway measures we can use. However, instead of using all of 27 measures, we first want to test if dropping any of them can improve the performance when they are combined with our methods. Stepwise selection is implemented to find the best set of measures: Starting with all 27 measures, removing the variable one by one from the model if it improves the classification rate. We repeat this procedure until there is no further improvement, and get the final set of measures. The order in which the measures are removed is determined by the least decrease in the Gini index (Faraway et al., 2014). The Gini index is the measure for the node impurity in random forests and it determines the variable importance. Therefore, from the measure which has the least Gini index, i.e., the least importance, we keep removing the measures until the best rate is achieved. Figure 5.7 shows stepwise selection for all-type, transients versus non-transients, and transient-only classification by KLD3. Starting from all measures, the plot shows the classification rates when the specified measures in the x-axis are ex- cluded from the set. In this case, it shows 1%∼2% of improvement after stepwise selection, and it can be more or less for different distance measures. Although there are not much difference in the results between having the full set of measures and some of them dropped, we choose the set which gives the best result and keep the procedure here since it can be also useful for future application with other measures. The final sets of Richards and Faraway measures selected for each type of clas- sification with KLD3 are given below. The descriptions of Richards and Faraway measures can be found in Table 2.1 and 2.2, and the subsets of measures used for other distances measures are given in Appendix A.2.1. Note that ”all measures” indicates every measure included in Table 2.1 and Table 2.2. 66

All: rcorbor, beyond1std, fpr80, fpr35, medbuf, kurtosis, peramp, maxdiff, dscore, gtvar, quadvar, gscore, fslope, famp Transient or not: fpr35, mad, fpr50, maxdiff, kurtosis, peramp, maxslope, gtvar, shov, totvar, lsd, gscore, fslope, quadvar, famp Transient only: all measures except rcorbor, fpr20, pdfp

Now, we present the percentages correctly classified to each class for all-type, transients versus non-transients, and only-transient in Table 5.14, 5.15, and 5.16.

KL KLD1 KLD2 KLD3 JS1 JS2 HD AGN 80.5 86.6 80.8 87.8 73.8 83.8 84.2 Blazar 29.8 43.7 40.6 45.8 49.5 46.2 43.6 CV 50.4 58.8 63.2 70.1 61.8 63.2 62.9 Downes 31.9 41.5 38.1 34.4 41.8 40.9 39.9 Flare 40.9 41.5 38.8 47.8 46.8 41.8 47.4 NV 90.9 79.2 80.5 92.7 91.4 79.6 77.5 RR Lyrae 87.2 93.7 93.0 95.0 94.4 94.3 95.8 SN 81.0 83.1 85.0 81.8 81.5 84.3 82.7 Table 5.14. Completeness for all-type classification

KL KLD1 KLD2 KLD3 JS1 JS2 HD Transient 83.2 92.0 92.5 91.5 90.7 90.6 91.9 NV 90.2 90.9 92.9 91.8 91.1 90.6 90.5 Table 5.15. Completeness for transients vs. non-transients classification

KL KLD1 KLD2 KLD3 JS1 JS2 HD AGN 85.6 91.7 80.0 86.3 83.1 90.9 88.1 Blazar 38.0 44.7 45.0 41.0 48.3 48.5 44.3 CV 55.5 62.6 63.3 63.2 64.9 65.0 64.0 Downes 57.5 57.6 54.3 55.5 55.6 54.6 55.9 Flare 54.5 58.1 68.4 52.4 54.8 56.1 57.1 RR Lyrae 91.4 95.3 95.2 94.3 94.7 95.7 96.4 SN 84.4 86.7 85.6 86.8 82.8 86.2 84.0 Table 5.16. Completeness for only-transient classification

For all-type classification, KLD3 works best for AGN, CV, NV. On the other hand, for Blazar and SN, JS1 and KLD2 yield the best performance. From the 67 tables, we can see that HD is good at identifying objects in general except for

NV. Also, classification rates of JS2 in only-transient classification are satisfying in general. Now, we present the results after incorporating the Richards and Faraway mea- sures into our distance methods for three cases: all-type, transients versus non- transients, and transient-only classification. The result is given in Table 5.17.

KL KLD1 KLD2 KLD3 JS1 JS2 HD All 78.9 79.8 79.6 79.9 78.0 79.6 78.2 Transient or not 88.5 92.6 92.7 92.6 90.2 92.6 91.6 Transient only 70.5 72.9 73.1 73.1 69.9 73.0 71.9 Table 5.17. Percentage correctly classified using our method with measures

We can see that the classification results are improved compared to the ones without measures. For all-type classification, our distance measures achieve ∼15% higher rates than Faraway and Richards measures together in case of AGN and flares while we still achieve similar performance rates for other classes. See Ap- pendix A.3.1 for contingency tables. Also, the percentage correctly classified to transients in case of transients versus non-transients classification is slightly higher for our method. Although only Richards and Faraway measures are considered in this disserta- tion, other features can be incorporated into our methods in the same way, and we might get better results when other astronomical features are combined with ours. 68

Figure 5.7. Stepwise selection for choosing measures to include for the distance method: The first value on the plot is the classification rate when every measure is used. After that, it shows the rate when the specified measure in the x-axis is excluded from the set. 69

5.5 Classification by Binning

Now we apply the binning method presented in Section 3.3. We use the same dt and dmag bins which are used in the Bayesian estimate of KL divergence:

dtbin = {0, 1, 100, 500, 1000, 1500, 2764} dmagbin = {−11.8, −2.0, −1.9, −1.8, ..., 1.8, 1.9, 2.0, 11.5}

Once the dt and dmag bins are determined, the number of observations in each bin is counted, and we get the count vector {c11, c12, . . . , ca−1,b−1}. For example, c11 is the number of observations which belong to {(dtij, dmagij) : 0 ≤ dtij <

1, −11.8 ≤ dmagij < −2.0}.

The count vector {c11, c12, . . . , ca−1,b−1} is used for our variable to be applied for LDA, Tree, SVM, NN, and RF. Here the length of dt and dmag bins are 7 and 43 respectiely. Therefore, the number of variables we have is (7 − 1) ∗ (43 − 1) = 252. We present the result of our binning method in Table 5.18. It does not outper- form the Faraway measures in general, but clearly works better than the Richards measures (see Table 5.11 and 5.12 for comparison). Also, some of the classification methods such as classification trees or random forests yield low misclassification error rates although LDA, SVM, and neural networks do not work well for our bin- ning method. We can see that random forest yields better or similar classification rates compared to Faraway measures in addition to the Richards set. Contingency tables given in Appendix A.3.1 show that the binning method with random forests only misses 8% of transients in transients versus non-transients classification, and works fine with AGN, CV, RR Lyrae, and SN in transient-only and all-type clas- sification. However, it needs improvement on Blazar, Downes, and Flare.

LDA TREE SVM NN RF All 61.3 69.8 54.7 60.5 80.1 Transient or not 71.0 84.8 79.4 74.2 91.8 Transient only 58.0 64.9 45.4 53.0 75.4 Table 5.18. Percentage correctly classified using the binning method 70

5.5.1 Incorporating Other Measures

Like distance methods, we can incorporate Richards and Faraway measures into our binning method. The procedure for the choice of measures based on stepwise selection is given in Figure 5.8. The resulting lists of measures for each classifica- tion type are as follows:

All: all measures except rm.medbuf, rm.kurtosis, and rm.peramp Transient or not: all measures except medbuf, pairslope, mad, peramp quadvar and famp Transient only: all measures except rcorbor and fpr80

Classification results after combining above measures with our binning method are given below.

LDA TREE SVM NN RF All 74.7 73.4 71.5 65.0 83.4 Transient or not 89.2 90.0 91.1 83.2 93.9 Transient only 67.2 66.4 64.8 55.8 77.9 Table 5.19. Percentage correctly classified using the binning method with measures

We can see that classification rates are improved in general and especially ran- dom forest performs very well, which yields the better classification rates than Far- away and Richards measures combined. If we compare the contingency tables for all-type classification, our method performs better for all types of transients except Blazar. Also, the percentage correctly classified to transients in transients versus non-transients classification is slightly higher. However, we need improvement in transient-only classification since there is no class which clearly outperforms the other methods. 71

Figure 5.8. Stepwise selection for choosing measures to include for the binning method: The first value on the plot is the classification rate when every measure is used. After that, it shows the rate when the specified measure is excluded from the set. 72

5.6 Application on Newly Collected Data Sets

In this section, we demonstrate our methods by applying them to a relatively new data set of CSS transients from 2013. This data set is also from Faraway et al. (2014)1 and consists of AGN, Blazar, CV, Flare and SNe. We use the original transients data set as our training set to build a classification model and apply it to the new data set to test the model. The numbers of light curves for each transient type in a training set and a test set are given in Table 5.20. Light curves with less than 5 observations are excluded from the set.

Training Set Test Set AGN Blazar CV Flare SNe AGN Blazar CV Flare SNe 139 124 458 64 536 153 32 92 46 251 Table 5.20. Numbers of light curves for each transient type (Faraway et al., 2014)

The maximum of dt for new data set is 3501 and the maximum of the absolute value of dmag is around 8.7. Since we have different dt and dmag range from the previous data set we need to set up our dt and dmag bins again. After applying several bins we choose the followings:

dtbin = {0, 1, 10, 100, 200, 400,..., 1800, 2000, 2300, 2600, 2900, 3200, 3501} dmagbin = {−8.7, −2.0, −1.9, −1.8,..., 1.8, 1.9, 2.0, 8.7}

Although we can choose the smaller number of bins for dt like we did in Section 5.4.1, we decide to use the above since it gives the better result when the distance measures are applied to the new data set. Again, how to get reasonable dt and dmag bins for different data sets should be studied further in the future. First we perform binary classification for each type of transients. Completeness and contamination are given in Table 5.21 and Table 5.22. We can see that distance measures work well with AGN and SN but need improvement on Blazar, CV, and Flare. 1Data set and relevant R codes are downloaded from http://people.bath.ac.uk/jjf23/modlc/ 73

KL KLD1 KLD2 KLD3 JS1 JS2 HD AGN 82.4 88.2 88.9 90.8 80.4 81.7 90.8 Blazar 21.9 28.1 31.2 34.4 18.8 21.9 28.1 CV 65.1 47.7 64.0 55.8 60.5 54.7 64.0 Flare 57.8 57.8 48.9 57.8 53.3 48.9 53.3 SN 69.0 82.2 81.0 81.4 75.6 86.8 70.7 Table 5.21. Completeness: Percentages correctly classified to each type of transients

KL KLD1 KLD2 KLD3 JS1 JS2 HD AGN 7.7 9.1 6.2 8.6 6.9 4.9 10.4 Blazar 6.3 4.4 6.3 6.5 6.3 2.9 5.7 CV 6.6 3.2 8.9 5.1 5.1 2.8 5.3 Flare 6.2 5.1 3.7 4.7 6.0 3.9 4.3 SN 15.5 23.4 20.6 21.8 17.1 25.0 13.0 Table 5.22. Contamination: False alarm rates

Comparisons of completeness and contamination with Richards and Faraway measures are given in Table 5.23 and 5.24. We can see that our method has the higher completeness than Richards measures for all classes except for CV, but only for AGN and SN when compared to Faraway measures in addition to Richards’. Also, the false alarm rate for SN is too high, so we might want to balance that with completeness. We expect to get better results if we incorporate astronomical features into our method. AGN Blazar CV Flare SN Richards 60.8a 18.8a 68.6b 44.4c 76.0b Richards+Faraway 83.7c 40.6c 66.3d 71.1c 82.6b Our methods 90.8e 34.4e 65.1f 57.8e 86.8g

aTree; bRF; cNN; dSVM e f g KLD3; KL; JS2; Table 5.23. Comparisons among Richards measure, Faraway measure, and our distance measure for completeness

Next, we perform multi-class classification. Table 5.25 shows the classification results for 5 transients by using Richards measures and Faraway measures. Since 74

AGN Blazar CV Flare SN Richards 4.2 4.4 5.5 1.2 18.4 Richards+Faraway 4.2 1.3 3.0 2.3 15.5 Our methods 8.6 6.5 6.6 4.7 25.0 Table 5.24. Comparisons among Richards measure, Faraway measure, and our distance measure for contamination we only have 5 transients in the new data set, we cannot perform transients versus non-transients classification. The results for our distance methods are also given in Table 5.26. ”Ours+ measures” indicates our distance methods incorporated with Richards and Faraway measures. Similarly to Section 5.4.3, we use stepwise selection to choose measures to include. Selected measures and the value of a pseudocount α for KLD1, KLD2, KLD3, and JS2 are given in Appendix A.1 and A.3.2.

LDA TREE SVM NN RF Richards 60.8 64.0 72.6 71.1 72.0 Faraway+Richards 74.6 67.4 78.1 77.2 79.9 Table 5.25. Percentage correctly classified using the Richards measures and Faraway measures

KL KLD1 KLD2 KLD3 JS1 JS2 HD Our methods 68.5 71.1 72.6 71.5 69.5 73.3 71.3 Ours+measures 74.0 79.9 80.3 79.6 72.4 80.8 79.2 Table 5.26. Percentage correctly classified using our distance measures

Although our methods alone do not outperform Faraway and Richards mea- sures combined, after we incorporate measures, the classification rate of our best performing method (JS2) is slightly higher than the Faraway’s best performing method (RF). Not only JS2 but most of our distance measures have better results than Faraway measures in addition to the Richards set. The contingency tables of the best performances (RF and JS2) are given for comparison in Appendix A.3.2, and we can see that JS2 with measures performs better than RF for identifying AGN and Blazar. Now we apply our binning method to the data. dt and dmag bins used for the binning method are as follows: 75

dtbin = {0, 1, 100, 500, 1000, 1500, 3501} dmagbin = {−8.7, −2.0, −1.9, −1.8,..., 1.8, 1.9, 2.0, 8.7}

The results for our binning method and the binning method with measures are given in Table 5.27. The binning method alone achieves more than 60% of the classification rate for all classification tools except SVM. Also, we can see that random forest performs better than Richards measures (See Table 5.25 for comparison). Now, if we look at the results after incorporating other measures into the grouping method, there are certain improvements for Tree and RF over Faraway and Richards measures together. Especially, RF performs better than any other methods.

LDA TREE SVM NN RF Grouping 62.9 61.6 38.7 60.0 76.5 Grouping with measures 70.3 75.1 67.4 66.7 82.1 Table 5.27. Percentage correctly classified using the binning method

In Appendix A.3.2, the contingency table for the binning method with measures by RF is given. The table shows that our method has higher classification rates for AGN, CV, and SN than Richards and Faraway measures combined but needs improvement on Flare.

5.7 Summary

Finally, we provide comparisons for the results of all the methods presented in Section 5.4, 5.5, and 5.6. Table 5.28 summarizes the results in Section 5.4 and 5.5 in which we applied our methods to the CRTS data with seven different types of transients. Table 5.29 is for Section 5.6 in which we performed classification with five different types of transients. In the tables, ”Dist” stands for the distance methods and ”Dist+meas” is for the distance methods with Richards and Faraway measures. Also, ”Bin” and ”Bin+meas” show the results for the binning method and the binning with measures. We applied 7 different distance measures - KL,

KLD1, KLD2, KLD3, JS1, JS2, and HD - and 5 different classification tools for the binning method - LDA, Tree, SVM, NN, and RF. Out of these methods, only the 76 one which gives the best result is presented in the tables. Similarly for Richards and Faraway measures, the best performances are shown.

Dist Bin Richards Dist Bin Richards +meas +meas +Faraway All 76.1 79.9 80.1 83.3 67.3 80.5 Transient or not 89.5 92.7 91.7 93.7 82.5 92.0 Transient only 68.5 73.1 75.4 77.7 64.4 74.3 Table 5.28. Summary of classification results for Section 5.4 and 5.5

Dist Bin Richards Dist Bin Richards +meas +meas +Faraway 73.3 80.8 76.5 82.1 72.6 79.9 Table 5.29. Summary of classification results for Section 5.6

We have satisfying results with the binning method, but we might want im- provements on classification performance of our distance measures since it does not outperform other methods. However, even though overall accuracies are just similar to or not better than Faraway measures in addition to the Richards set, we did see that completeness for certain types of transients are better for our distance methods. Especially, the distance methods seem to work very well at identifying AGN. We will get the optimal classification performance when we combine more than one classifier because different classifiers work best with certain kinds of inputs. For example, Djorgovski et al. (2011) illustrates the classification fusion module, which triggers different classifiers depending on characteristics of available events. Since we showed that our methods work well enough in general and achieve the better performance than the other previous methods for certain types of transients, we expect that they can contribute more to classification of variable stars and transients when they are combined with many other classifiers. Chapter 6

Summary and Future Work

6.1 Summary

In this dissertation, we have proposed two methods for supervised classification of variable stars and transients: (i) Use statistical distances and measure how similar two (dt, dmag) distributions of two different types of transients are. (ii) Group (dt, dmag) and use these values as our variables for classification. Along with two methods, we explained how to incorporate other measures into our methods. For our analysis, statistical methods such as kernel density estimation and various classification tools were used in addition to distance measures, and brief explanations about them were given in Chapter 2. Also, as the measures intro- duced by Richards et al. (2011) and Faraway et al. (2014) are incorporated into our methods, we reviewed how they obtained their measures and performed clas- sification by using them. Two suggested methods are presented in detail in Chapter 3 and demonstrated with similated light curves as well as two real data sets in Chapter 4 and Chapter 5. We can either use kernel density estimation or binning to evaluate (dt, dmag) density but the results in Chapter 5 suggest that we get the better performance with binning. Out of many statistical distance measures, only three - Kullback- Leibler divergence, Jensen-Shannon divergence, and Hellinger distance - were used, but we can apply other distance measures as long as they are appropriate for measuring distances between two probability distributions. Also, we have seen that we can get better results by incorporating other mea- 78 sures. Instead of using all the measures given to us, dropping some of them based on stepwise selection is recommended which can lead to improvement of classifica- tion rates.

6.2 Future Work

In this section, we propose the work which can be done in the future to improve our classification methods. We have used dt and dmag throughout our analysis, and have seen that lots of dmag values are around 0. This makes our densities for (dt, dmag) to have high frequencies for dmag = 0 and almost none for larger dmag. However, considering the characteristic of transients, our interest lies in extreme values of dmag, not in 0.

Figure 6.1. Light curve of CV (CSS071216:110407045134) and its (dt, dmag) kernel density plot

The example of the light curve of CV (CSS071216:110407045134) is given in Figure 6.1 along with its (dt, dmag) distribution obtained by kernel density esti- mation. We can see that there exists an outburst in magnitude at MJD of 1000, which would distinguish this light curve as a transient, and this large dmag is reflected in kernel density plot as small bumps at the left bottom area. We expect to have better performance by giving weights to probability density estimates or the number of counts which reflect outbursts. For instance, since it is likely that 79 the mean of dmag would be around 0, we can give less weights to (dt, dmag) fre- quencies if dmag is close to the mean and more weights as they are farther from the mean. Appendix A

Details on Application to CRTS Data

A.1 Choice of Pseudocounts

KLD1 KLD2 KLD3 JS2 All 1 0.9 0.9 0.1 Transient or not 0.8 0.6 0.9 0.1 Transient only 0.5 0.9 0.3 0.1 Table A.1. A pseudocount α for Section 5.4.2: Classification with distance measures

KLD1 KLD2 KLD3 JS2 All 0.3 0.8 0.4 0.2 Transient or not 0.5 0.4 0.7 0.2 Transient only 0.5 0.4 0.7 0.1 Table A.2. A pseudocount α for Section 5.4.3: Classification with distance measures by incorporating Richards and Faraway measures

KLD1 KLD2 KLD3 JS2 without measures 0.9 1 0.6 0.9 with measures 0.2 0.3 0.7 0.1 Table A.3. A pseudocount α for Section 5.6: Classification with distance measures tested on newly collected data set 81

A.2 Choice of Measures

A.2.1 Measures for Section 5.4.3

All

KLD1 all measures except fpr20, fpr35, medbuf, kurtosis, skew, outl, std KLD2 all measures except rcorbor, pairslope, fpr80, outl, maxdiff KLD3 rcorbor, beyond1std, fpr80, fpr35, medbuf, kurtosis, peramp, maxdiff, dscore, gtvar, quadvar, gscore, fslope, famp JS2 all measures except rcorbor, pairslope, fpr80, maxslope, skew, amplitude, lsd, std, totvar HD all measures except pairslope, beyond1std Table A.4. Choice of measures obtained by stepwise procedure for all-type classification

Transient or not

KLD1 beyond1std, fpr20, fpr35, mad, maxdiff, outl, gtvar, shov, totvar, lsd, dscore, fslope, quadvar, famp KLD2 all measures except medbuf, pairslope, mad, skew, pdfp, peramp, maxslope, amplitude, std KLD3 fpr35, mad, fpr50, maxdiff, kurtosis, peramp, maxslope, gtvar, shov, totvar, lsd, gscore, fslope, quadvar, famp JS2 all measures except rcorbor, medbuf, fpr35, fpr80, mad, skew, fslope, famp HD all measures except pairslope, beyond1std, fpr35, skew, fpr50, pdfp, quadvar, famp Table A.5. Choice of measures obtained by stepwise procedure for transients vs. non- transients classification 82

Transient only

KLD1 all measures except rcorbor, maxslope, skew, outl, amplitude KLD2 all measures except rcorbor, pairslope, beyond1std, fpr20, kurtosis, amplitude KLD3 all measures except rcorbor, fpr20, pdfp JS2 all measures except rcorbor, pairslope, fpr80, fpr50, skew, outl HD all measures except rcorbor, pairslope, outl Table A.6. Choice of measures obtained by stepwise procedure for only-transient clas- sification

A.2.2 Measures for Section 5.6

Distance measures

KLD1 all measures except rcorbor, mad, fpr80, dscore, skew, famp, fslope

KLD2 all measures except pairslope, fpr20, fpr35, beyond1std, medbuf, peramp, std, gscore, skew, amplitude

KLD3 mad, fpr50, pdfp, medbuf, maxslope, shov, std, gscore, amplitude, outl, fslope, gtvar, maxdiff, totvar

JS2 all measures except rcorbor, pairslope, fpr20, beyond1std, fpr50, lsd, medbuf, shov, skew, famp HD all measures except rcorbor, fpr20, fpr35, fpr50, lsd, skew Table A.7. Choice of measures obtained by stepwise procedure for distance methods

Binning method all measures except rcorbor, pdfp Table A.8. Choice of measures obtained by stepwise procedure for grouping method 83

A.3 Contingency Tables

A.3.1 Tables of Section 5.4 and 5.5

Actual agn blazar cv downes flare nv rrlyr sn agn 73.4 6.2 0.4 2.6 0.0 0.2 0.0 1.0 blazar 0.8 46.7 1.7 5.2 0.0 0.0 0.0 0.2 cv 2.5 11.9 62.8 21.6 4.5 0.6 0.7 6.2 downes 3.0 11.6 11.3 38.9 2.0 0.8 4.9 0.6 flare 0.0 0.0 0.7 0.8 30.3 0.3 0.0 0.2 nv 16.4 3.7 5.4 22.6 56.6 97.1 0.4 10.6 rrlyr 0.0 2.8 0.1 4.9 0.0 0.1 93.9 0.0 sn 3.8 17.1 17.6 3.5 6.6 1.1 0.2 81.3 Table A.9. All types: Richards+Faraway Measures (Random Forest)

Actual agn blazar cv downes flare nv rrlyr sn agn 79.1 11.0 0.5 2.1 2.4 0.5 0.0 0.8 blazar 5.2 35.4 2.4 2.2 0.0 0.3 0.0 1.9 cv 0.9 10.5 62.1 23.5 12.1 1.2 0.4 7.4 downes 3.4 15.6 12.0 31.8 7.6 2.3 8.05 2.1 flare 0.6 1.5 2.1 4.5 30.2 1.0 1.4 1.3 nv 8.3 11.7 10.5 28.3 36.3 92.6 11.1 7.2 rrlyr 0.0 0.0 0.0 3.1 0.0 0.0 78.5 0.0 sn 2.6 14.4 10.4 4.5 11.4 2.1 0.5 79.3

Table A.10. All types: Our Method (KLD2) 84

Actual agn blazar cv downes flare nv rrlyr sn agn 72.8 9.5 0.3 0.7 0.9 0.2 0.0 1.2 blazar 1.7 32.3 0.8 1.9 0.0 0.0 0.1 1.1 cv 0.1 19.6 65.2 21.1 10.0 0.5 0.3 5.3 downes 1.0 16.8 10.4 39.7 22.4 1.5 1.7 1.4 flare 0.0 0.2 0.0 0.4 18.9 0.1 0.0 0.3 nv 20.1 10.6 14.7 30.3 43.3 96.9 7.8 8.2 rrlyr 0.0 1.8 0.0 4.6 0.0 0.1 90.0 0.0 sn 4.3 9.2 8.6 1.4 4.5 0.8 0.1 82.4 Table A.11. All types: Our Method (Binning)

Actual agn blazar cv downes flare nv rrlyr sn agn 87.8 12.5 2.0 2.3 0.0 0.3 0.0 0.6 blazar 2.4 45.8 6.1 5.5 0.0 0.9 0.0 3.0 cv 0.0 12.5 70.1 23.4 0.0 0.7 0.0 4.2 downes 2.4 18.8 6.1 34.4 0.0 3.2 4.0 2.4 flare 0.0 2.1 0.7 4.7 47.8 1.2 0.0 1.8 nv 7.3 0.0 6.1 21.1 39.1 92.7 1.0 6.1 rrlyr 0.0 0.0 0.0 7.0 0.0 0.0 95.0 0.0 sn 0.0 8.3 8.8 1.6 13.0 1.0 0.0 81.8

Table A.12. All types: Our Method (KLD3 with measures)

Actual agn blazar cv downes flare nv rrlyr sn agn 76.4 7.1 0.2 0.3 0.5 0.1 0.0 0.8 blazar 0.2 41.1 1.3 2.8 0.0 0.0 0.0 0.5 cv 0.1 15.8 71.0 21.3 6.0 0.5 0.4 4.9 downes 2.4 19.1 10.8 45.3 7.5 1.1 3.0 0.9 flare 0.0 0.0 0.4 0.4 34.0 0.2 0.0 0.5 nv 18.1 3.9 6.3 22.9 47.7 97.7 1.0 7.5 rrlyr 0.0 1.9 0.0 5.6 0.0 0.0 95.2 0.0 sn 2.8 11.2 9.9 1.3 4.4 0.4 0.4 84.8 Table A.13. All types: Our Method (Binning with measures) 85

Actual trans nv trans 91.6 5.3 nv 8.4 94.7 Table A.14. Transients vs. non-variables: Richards+Faraway Measures (SVM)

Actual trans nv trans 87.2 7.9 nv 12.8 92.1

Table A.15. Transients vs. non-variables: Our Method (KLD2)

Actual trans nv trans 91.9 8.3 nv 8.1 91.7 Table A.16. Transients vs. non-variables: Our Method (Binning)

Actual trans nv trans 92.5 7.1 nv 7.5 92.9

Table A.17. Transients vs. non-variables: Our Method (KLD2 with measures)

Actual trans nv trans 94.0 6.3 nv 6.0 93.7

Table A.18. Transients vs. non-variables: Our Method (Binning with measures) 86

Actual agn blazar cv downes flare rrlyr sn agn 81.3 4.2 0.6 4.9 5.8 0.0 2.7 blazar 2.3 34.1 2.0 5.4 0.0 0.1 0.4 cv 1.4 13.5 68.3 22.5 5.9 0.8 7.6 downes 4.8 15.7 9.8 52.7 17.6 4.9 1.7 flare 0.6 0.0 0.4 2.2 54.4 0.0 0.4 rrlyr 0.0 13.1 0.0 8.1 0.0 94.2 0.1 sn 9.7 19.5 18.9 4.1 16.4 0.0 87.1 Table A.19. Transient only: Richards+Faraway Measures (SVM)

Actual agn blazar cv downes flare rrlyr sn agn 84.7 16.3 0.8 4.0 2.9 0.0 0.9 blazar 3.5 37.2 4.2 3.4 1.5 0.1 2.5 cv 1.3 7.7 59.4 20.8 13.2 0.7 7.7 downes 5.8 18.9 15.9 54.3 36.0 9.8 4.3 flare 0.5 2.9 3.5 6.0 36.7 0.6 2.0 rrlyr 0.0 0.2 0.0 6.7 0.2 87.7 0.0 sn 4.2 16.9 16.1 4.9 9.4 1.0 82.7

Table A.20. Transient only: Our Method (KLD1)

Actual agn blazar cv downes flare rrlyr sn agn 80.2 9.8 0.5 1.2 1.6 0.0 1.5 blazar 1.7 35.8 1.0 1.9 0.1 0.1 1.2 cv 2.7 21.4 76.8 24.3 18.6 0.6 7.3 downes 8.8 20.3 11.9 64.9 43.3 6.4 3.7 flare 0.0 0.6 0.1 0.4 25.6 0.0 0.5 rrlyr 0.0 1.1 0.0 5.1 0.0 92.6 0.0 sn 6.7 11.0 9.7 2.2 10.8 0.3 85.8 Table A.21. Transient only: Our Method (Binning) 87

Actual agn blazar cv downes flare rrlyr sn agn 86.5 2.0 0.6 3.2 0.0 0.0 0.0 blazar 3.8 48.0 8.3 6.5 4.2 0.0 0.6 cv 0.0 10.0 65.4 25.8 8.3 0.0 7.9 downes 3.8 20.0 15.4 50.0 16.7 3.2 1.8 flare 0.0 0.0 0.6 3.2 62.5 0.0 1.8 rrlyr 0.0 0.0 0.0 5.6 0.0 96.8 0.0 sn 5.8 20.0 9.6 5.6 8.3 0.0 87.8

Table A.22. Transient only: Our Method (KLD2 with measures)

Actual agn blazar cv downes flare rrlyr sn agn 83.7 6.8 0.5 1.0 0.8 0.0 1.2 blazar 0.4 42.6 1.2 2.5 0.0 0.0 0.4 cv 2.2 13.9 76.7 22.4 11.0 0.8 6.7 downes 7.2 21.4 11.5 65.8 22.6 4.4 2.9 flare 0.2 0.0 0.3 0.9 49.6 0.0 0.4 rrlyr 0.0 1.8 0.0 5.2 0.0 94.5 0.0 sn 6.4 13.6 9.8 2.2 16.1 0.3 88.4 Table A.23. Transient only: Our Method (Binning with measures) 88

A.3.2 Tables of Section 5.6

Actual agn blazar cv flare sn agn 79.7 25.0 0.0 4.4 1.7 blazar 3.3 34.4 2.3 2.2 1.2 cv 0.7 6.2 67.4 0.0 4.1 flare 0.7 0.0 4.7 66.7 1.2 sn 15.7 34.4 25.6 26.7 91.7 Table A.24. Richards+Faraway Measures (Random Forest)

Actual agn blazar cv flare sn agn 81.0 43.8 2.3 0.0 1.7 blazar 7.2 21.9 1.2 0.0 1.7 cv 0.7 6.2 53.5 2.2 3.3 flare 0.0 0.0 7.0 48.9 6.6 sn 11.1 28.1 36.0 48.9 86.8

Table A.25. Our Method (JS2)

Actual agn blazar cv flare sn agn 81.7 43.8 0.0 2.2 2.1 blazar 0.7 28.1 2.3 2.2 1.7 cv 1.3 9.4 61.6 15.6 7.9 flare 0.0 0.0 3.5 48.9 0.8 sn 16.3 18.8 32.6 31.1 87.6 Table A.26. Our Method (Binning) 89

Actual agn blazar cv flare sn agn 88.2 34.4 0.0 2.2 2.5 blazar 3.3 40.6 4.7 2.2 1.7 cv 0.7 3.1 65.1 6.7 4.1 flare 0.0 3.1 4.7 62.2 1.2 sn 7.8 18.8 25.6 26.7 90.5

Table A.27. Our Method (JS2 with measures)

Actual agn blazar cv flare sn agn 83.7 34.4 0.0 2.2 0.4 blazar 0.0 34.4 3.5 2.2 1.7 cv 1.3 9.4 69.8 8.9 1.2 flare 0.0 0.0 5.8 57.8 1.2 sn 15.0 21.9 20.9 28.9 95.5 Table A.28. Our Method (Binning with measures) Appendix B

R code

B.1 Classification by Distance Measures

## Run "General Settings", "one of the distance measures" ## and then the final part of the code

################### General Settings ############################ library(entropy) load("lcdb.rda") source("funcs.R") rmid <- which(lcdbi$nobs <= 4) id <- 1:length(lcdb) id <- id[-rmid] id <- id[id <= 1995] ## add this line for only-transient train.ind <- sample(id, round(2*length(id)/3)) train.ind <- sort(train.ind) test.ind <- id[-match(train.ind, id)] tmp <- lapply(1:8, function(k){ 91

datalabel <- which(lcdbi$type==levels(lcdbi$type)[k]); firstind <- datalabel[1]; lastind <- tail(datalabel, n=1); train.ind[which(train.ind >= firstind & train.ind <= lastind)]}) tr.train.ind <- train.ind[train.ind <= 1995] ## add these lines nv.train.ind <- train.ind[train.ind >= 1996] ## for tr vs. nv daterange <- 2764; firstdate <- 53464 grid <- seq(0, daterange, by=1) sigv <- 0.27; wvec <- 2e-04 featmat <- rep(0, 41) for(i in 1:length(lcdb)) { ## use length(id) instead of ## length(lcdb) for only-transient lc <- lcdb[[i]] if (length(lc$mjd) <= 4) next featmat.new <- lcstats(lc,GPR=TRUE,kerntype="exponential",wvec, firstdate,daterange,sigv,detection.limit=20.5) featmat <- rbind(featmat, featmat.new) } featmat <- featmat[-1,] featmat <- data.frame(featmat) colnames(featmat) <- c("meds", "iqr", "shov", "maxdiff", "dscore", "wander","moveloc","nobs","totvar","quadvar","famp","fslope", "trend", "outl", "skewres", "std","shapwilk", "lsd", "gscore", "mdev", "gtvar","rm.amplitude","rm.mean","rm.stddev", "rm.beyond1std","rm.fpr20", "rm.fpr35", "rm.fpr50","rm.fpr65", "rm.fpr80", "rm.lintrend", "rm.maxslope", "rm.mad","rm.medbuf", "rm.pairslope","rm.peramp", "rm.pdfp", "rm.skew","rm.kurtosis", 92

"rm.std","rm.rcorbor") featmat <- featmat[,c("rm.amplitude", "rm.beyond1std", "rm.fpr20", "rm.fpr35", "rm.fpr50", "rm.fpr80","rm.maxslope", "rm.mad", "rm.medbuf", "rm.pairslope", "rm.peramp", "rm.pdfp", "rm.skew", "rm.kurtosis","rm.std","rm.rcorbor","totvar","quadvar","famp", "fslope","lsd","gtvar","gscore","shov","maxdiff","dscore")]

############ Run one of the following distance measures ##########

### KL ### denvec <- vector("list", length(lcdb)) for(i in 1:length(lcdb)) { lc <- lcdb[[i]] if (length(lc$mjd) <= 4) next lc$mjd <- lc$mjd - firstdate

dT.new <- unlist(lapply(1:(length(lc$mjd)-1), function(n) {diff(lc$mjd, lag=n)})) dMag.new <- unlist(lapply(1:(length(lc$mag)-1), function(n) {diff(lc$mag, lag=n)})) if (bandwidth.nrd(dMag.new)==0) next denvec[[i]] <- kde2d(dT.new, dMag.new, n=100, lims=c(0,2764,-12,12))$z } result <- rep(0, length(test.ind)) for (j in test.ind) { lc <- lcdb[[j]] 93

if (length(lc$mjd) <= 4) next testdata <- denvec[[j]]+0.00000001

kl.feat <- rep(NA, length(train.ind))

for (i in train.ind) { lc <- lcdb[[i]] if (length(lc$mjd) <= 4) next traindata <- denvec[[i]]+0.00000001

kl <- KL.plugin(traindata, testdata) + KL.plugin(testdata, traindata) feat.dist <- dist(rbind(featmat[j,], featmat[i,]), method="canberra") kl.feat[which(train.ind==i)] <- kl/(1+kl) + feat.dist/(1+feat.dist) }

### JS1 ###

## substitute ’kl’ in the "KL" code by the following

mym <- 0.5*(testdata+traindata) LR1 <- ifelse(testdata > 0, log(testdata/mym), 0) LR2 <- ifelse(traindata >0, log(traindata/mym), 0) js1 <- 0.5*sum(traindata*LR2) + 0.5*sum(testdata*LR1)

### KLD1 ### dTseq <- c(0, 1, 100, 500, 1000, 1500, 2764) dMagseq <- c(-11.8, seq(-2, 2, by=0.1), 11.5) 94

mat <- rep(0, (length(dTseq)-1)*(length(dMagseq)-1)) for(i in 1:length(lcdb)) { ## use length(id) instead of ## length(lcdb) for only-transient lc <- lcdb[[i]] if (length(lc$mjd) <= 4) next lc$mjd <- lc$mjd - firstdate

mat.new <- rep(0, (length(dTseq)-1)*(length(dMagseq)-1))

dT <- unlist(lapply(1:(length(lc$mjd)-1), function(n) {diff(lc$mjd, lag=n)})) dMag <- unlist(lapply(1:(length(lc$mag)-1), function(n) {diff(lc$mag, lag=n)}))

dTint1 <- which(dT >= dTseq[1] & dT <= dTseq[2]) dMag1 <- dMag[dTint1]

for(j in 1:(length(dMagseq)-1)){ dMagint <- which(dMag1 >= dMagseq[j] & dMag1 <= dMagseq[j+1]) mat.new[j] <- length(dMagint) }

for(n in 2:(length(dTseq)-1)){ dTint <- which(dT > dTseq[n] & dT <= dTseq[n+1]) dMag2 <- dMag[dTint]

for(m in 1:(length(dMagseq)-1)){ dMagint <- which(dMag2 > dMagseq[m] & dMag2 <= dMagseq[m+1]) mat.new[(n-1)*(length(dMagseq)-1)+m] <- length(dMagint) 95

} } mat <- rbind(mat, mat.new) } mat <- mat[-1,] trainmat <- mat[match(train.ind,id), ] testmat <- mat[match(test.ind,id),] feat.trainmat <- featmat[match(train.ind,id), ] feat.testmat <- featmat[match(test.ind,id),] result <- rep(0, length(test.ind)) for (j in 1:length(test.ind)) { testdata <- testmat[j,] testden <- testdata/sum(testdata) kl.feat <- rep(NA, length(train.ind))

for (i in 1:nrow(trainmat)) { traindata <- trainmat[i,] trainden <- traindata/sum(traindata) kld1 <- KL.Dirichlet(traindata, testdata, 0.1, 0.1) feat.dist <- dist(rbind(feat.testmat[j,], feat.trainmat[i,]), method="canberra") kl.feat[i] <- kld1/(1+kld1) + feat.dist/(1+feat.dist) }

## change 0.1 to different alpha values ## use the following for KLD2 or KLD3 ## kld2 <- KL.Dirichlet(testdata, traindata, 0.1, 0.1) ## kld3 <- KL.Dirichlet(traindata, testdata, 0.1, 0.1) + KL.Dirichlet(testdata, traindata, 0.1, 0.1) 96

### JS2 ### ## substitute ’kld1’ in the "KLD1" code by the following

mym <- 0.5*(traindata + testdata) js2 <- 0.5*KL.Dirichlet(traindata, mym) + 0.5*KL.Dirichlet(testdata, mym)

### HD ### ## substitute ’kld1’ in the "KLD1" code by the following

hd <- sqrt(sum((sqrt(trainden)-sqrt(testden))^2))*(1/sqrt(2))

################################################################## ## the following is for the all types ## change the code according to classification type ## for example, for tr vs. nv classification, use ## kl1 <- kl.feat[tr.train.ind]; kl2 <- kl.feat[nv.train.ind]

kl1 <- kl.feat[ tmp[[1]] ] kl2 <- kl.feat[ tmp[[2]] ] kl3 <- kl.feat[ tmp[[3]] ] kl4 <- kl.feat[ tmp[[4]] ] kl5 <- kl.feat[ tmp[[5]] ] kl7 <- kl.feat[ tmp[[7]] ] kl8 <- kl.feat[ tmp[[8]] ] kl6 <- kl.feat[ tmp[[6]] ]

kl1 <- kl1[!is.na(kl1)] kl2 <- kl2[!is.na(kl2)] kl3 <- kl3[!is.na(kl3)] 97

kl4 <- kl4[!is.na(kl4)] kl5 <- kl5[!is.na(kl5)] kl6 <- kl6[!is.na(kl6)] kl7 <- kl7[!is.na(kl7)] kl8 <- kl8[!is.na(kl8)]

kl.sum <- c(min(kl1), min(kl2), min(kl3), min(kl4), min(kl5), min(kl6), min(kl7), min(kl8)) result[j] <- which(kl.sum==min(kl.sum)) } type <- lcdbi$type[test.ind] type <- factor(type) tab <- table(result, type) tab

B.2 Classification by Binning

## Run the ’General Settings’ from B.1 ## and the code for getting ’mat’ in KLD1 code library(randomForest) newmat <- cbind(mat, featmat) type1 <- lcdbi$type[train.ind] type2 <- lcdbi$type[test.ind] type1 <- factor(type1) type2 <- factor(type2) trainmat <- newmat[train.ind, ] 98 testmat <- newmat[test.ind,] trainmat <- cbind(trainmat, type1) model <- randomForest(type1~., trainmat) pred <- predict(model, testmat, type="response") tab <- table(pred, type2) tab Bibliography

[1] Agresti, A. and Hitchcock, D.B. (2005). for categorical data analysis. Statistical Methods & Applications, 14(3), 297-330.

[2] Barning, F. J. M. (1963). The numerical analysis of the light-curve of 12 Lacertae. Bulletin of the Astronomical Institutes of the Netherlands, 17(1), 22-28.

[3] Burnham, K. P. and Anderson, D. R. (2002). and Multi-Model Inference. Springer-Verlag, New York.

[4] Carroll, B. W. and Ostlie, D. A. (2006). An Introduction to Modern Astro- physics. Addison-Wesley.

[5] Carugo, O. and Eisenhaber, F. (2010). Data Mining Techniques for the Life Sciences. Humana Press.

[6] Clarke, D. A., MacDonald, N. R., Ramsey, J. P. and Richardson, M. (2008). Astrophysical Jets. Physics in Canada, 64(2), 47-53.

[7] Cover, T. M. and Thomas, J. A. (2006). Elements of . Wiley-Interscience.

[8] Debosscher, J., Sarro, L. M., Aerts, C., Cuypers, J., Vandenbussche, B., Gar- rido, R., and Solano, E.(2007). Automated supervised classification of variable stars. I. Methodology. Astronomy & Astrophysics, 475(3), 1159-1183.

[9] Djorgovski, S. G., Donalek, C., Mahabal, A. A., Moghaddam, B., Turmon, M., Graham, M. J., Drake, A. J., Sharma, N. and Chen, Y. (2011). Towards an Automated Classification of Transient Events in Synoptic Sky Surveys. CIDU, NASA Ames Research Center, 174-188. 100

[10] Djorgovski, S. G., Drake, A. J., Mahabal, A. A., Graham, M. J., Donalek, C., Williams, R., Beshore, E. C., Larson, S. M., Prieto, J., Catelan, M., Chris- tensen, E., and McNaught, R. H. (2011). The Catalina Real-Time Transient Survey (CRTS). arXiv:1102.5004

[11] Doggett, J. B. and Branch, D. (1985). A Comparative Study of Supernova Light Curves. The Astronomical Journal, 90(2), 2303-2311.

[12] Downes, R. A., Webbink, R. F., Shara, M. M., Ritter, H., Kolb, U. and Duer- beck, H.W. (2006). A Catalog and Atlas of Cataclysmic Variables: The Final Edition. Journal of Astronomical Data, 11, 2-10.

[13] Drake, A. J., Djorgovski, S. G., Mahabal, A., Beshore, E., Larson, S., Graham, M. J., Williams, R., Christensen, E., Catelan, M., Boattini, A., Gibbs, A., Hill, R. and Kowalski, R. (2009). First Results from the Catalina Real-Time Transient Survey. Astrophysical Journal, 696(1), 870-884.

[14] Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classification. John Wiley & Sons, Inc.

[15] Dumitru, C. and Maria, V. (2013). Advantages and Disadvantages of Using Neural Networks for Predictions. Ovidius University Annals, Economic Sci- ences Series, 13(1), 444-449.

[16] Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applica- tions. Chapman & Hall, London.

[17] Faraway, J., Mahabal, A., Sun, J., Wang, X., Wang, Y., and Zhang, L. (2014). Modeling Light Curves for Improved Classification. arXiv:1401.3211v2

[18] Hart, J. D. (1997). Nonparametric Smoothing and Lack-of-fit Tests. Springer, New York.

[19] Hastie, T. J. and Tibshirani, R. (1990). Generalized Additive Models. Chap- man & Hall, London.

[20] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York.

[21] Hausser, J. and Strimmer, K. (2013). entropy: Estimation of Entropy, Mutual Information and Related Quantities. R package version 1.2.0. http://CRAN.R-project.org/package=entropy

[22] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning, with applications in R. Springer-Verlag, New York. 101

[23] Jones, M. C., Marron, J. S. and Sheather, S. J. (1996). A Brief Survey of Band- width Selection for Density Estimation. Journal of the American Statistical Association, 91(433), 401-407.

[24] Jones, M. H. and Lambourne R. J. (2004). An Introduction to Galaxies and Cosmology. Cambridge University Press.

[25] Konishi, S. and Kitagawa, G. (2008). Information Criteria and Statistical Modeling. Springer-Verlag, New York.

[26] Krichevsky, R. E. and Trofimov, V. K. (1981). The performance of universal encoding. IEEE Transactions on Information Theory, 27(2), 199-207.

[27] Lin, J. (1991). Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37(1), 145-151.

[28] Lomb, N. R. (1976). Least-squares frequency analysis of unequally spaced data. Astrophysics and Space Science, 39(2), 447-462.

[29] Mahabal, A. A., Djorgovski, S. G., Donalek, C., Drake, A. J., Graham, M. J., Williams, R. D., Moghaddam, B. and Turmon, M. (2010). Classification of Optical Transients: Experiences from PQ and CRTS Surveys. EAS Publi- cations Series, 45, 173-178.

[30] Marron, J. S. and Nolan, D. (1988). Canonical kernels for density estimation. Statistics & Probability Letters, 7(3), 195-199.

[31] Nikulin, M. S. Hellinger Distance: Encyclopedia of Mathematics http://www.encyclopediaofmath.org/index.php/Hellinger_distance [32] Papoulis, A. (1965). Probability, Random Variables, and Stochastic Processes. McGraw-Hill.

[33] Rasmussen, C. and Williams, C. (2006). Gaussian Processes for Machine Learning. The MIT Press.

[34] Richards, J. W., Starr, D. L., Butler, N. R., Bloom, J. S., Brewer, J. M., Crellin-Quick, A., Higgins, J., Kennedy, R. and Rischard, M. (2011). On machine-learned classification of variable stars with sparse and noisy time- series data. Astrophysical Journal, 733(1), 10-29.

[35] Scargle, J. D. (1982). Studies in astronomical time series analysis. II - Statisti- cal aspects of spectral analysis of unevenly spaced data. Astrophysical Journal, 263, 835-853.

[36] Schurmann, T. and Grassberger, P. (1996). Entropy estimation of symbol sequences. Chaos, 6(3), 414-427. 102

[37] Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall.

[38] Smith, R. C. (2007). Cataclysmic Variables. arXiv:astro-ph/0701654

[39] Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman & Hall, London. Vita Sae Na Park

Research Interests: Astrostatistics, Statistics for Big Data, Data Mining

Education

• Pennsylvania State University, University Park, PA Ph.D in Statistics, Aug. 2015 Advisor: Prof. G. Jogesh Babu

• Ewha Women’s University, Seoul, Korea M.S. in Statistics, Feb. 2010 Advisor: Prof. Oesook Lee

• Ewha Women’s University, Seoul, Korea B.S. in Statistics, Bachelor of Economics, Minor in Mathematics, Feb. 2008

Work Experience

• Summer School Assistant, Center for Astrostatistics, Pennsylvania State University, June 2013, June 2014

• Graduate Statistical Consultant, Statistical Consulting Center, Penn- sylvania State University, Spring 2012, Spring 2013

• Intern, Korea Institute of Science and Technology, Seoul, July - Aug. 2007

Research Experience

• Visitor, Caltech, Pasadena, June 2013, April 2015

• SAMSI Graduate Fellow, SAMSI, Research Triangle Park, Fall 2012

• RA for Undergraduate Research Program, KOFAC, Ewha Women’s University, Apr. 2009 - Feb. 2010