Data-Driven Astronomical Inference

Josh Bloom, UC Berkeley

@profjsb

Computational Astrophysics 2014–2020: Approaching Exascale LBNL 21 March 2014

Friday, March 21, 14 2 Inference Space

Data Driven non-parametric

Bayesian Frequentist

parametric Theory Driven

Hardware laptops → NERSC Software Python/Scipy, R, ... Carbonware astro grad students, postdocs

Friday, March 21, 14 3 Bayesian Distance Ladder Pulsational Variables: Period-Luminosity Relation

mij = µi + M0j i indexes over individual j indexes over wavebands + ↵j log10 (Pi/P0) a and b are fixed constants at each waveband + E(B V ) [R a (1/ )+b (1/ )] i ⇥ V ⇥ i i + ✏ij Data 134 RR Lyrae (WISE, Hipparcos, UVIORJHK)

Fit 307 dimensional model parameter inference - deterministic MCMC model - ~6 days for a single run (one core) - parallelism for convergence tests Klein+12; Klein,JSB+14,

Friday, March 21, 14 4 4 BayesianC. R. Klein et al. Distance Ladder

• Approaching 1% distance uncertainty • Precision 3D dust measurements

Figure 6. Multi-band period–luminosity relations. RRab stars are in blue, RRc stars in red. Blazhko-a↵ected stars are denoted with Friday,diamonds, March stars not 21, known 14 to exhibit the Blazhko e↵ect are denoted with squares. Solid black lines are the best-fitting period–luminosity 5 relations in each waveband and dashed lines indicate the 1 prediction uncertainty for application of the best-fitting period–luminosity relation to a new with known period.

c 2013 RAS, MNRAS 000, 1–11 Bayesian

TORRES 2007 Fitting for: Parallax, Proper Motion, Binary Parameters, Microlensing... North (milliarcsec) North

Hipparcos: 106 stars Gaia: ~109 stars

East (milliarcsec)

Friday, March 21, 14 6 Bayesian Astrometry “Distortion map” 0.5

0.4 Step 1:

0.3 Regress 7-d parametric

0.2 affine transformation (scale, rotation, shear, 0.1 etc.) 0

−0.1 central declination Step 2: −0.2 Learn a non-parametric −0.3 distortion map with −0.4 Gaussian processes

−0.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

http://berianjames.github.com/pyBAST/ James,JSB+14

Friday, March 21, 14 7 Bayesian Astrometry

Some Clear Benefits High Proper Motion WD in SDSS Stripe 82 • covariate uncertainties in 204±5 mas/yr celestial coordinates • mapping observed points can incorporate variance throughout image, extending even to highly non-trivial distortion effects

• astrometry can be treated as Bayesian updating, allowing incorporation of prior knowledge colored by time about proper motion & parallax

non-parallel Cholesky + MCMC: ~hour for 71 observations

http://berianjames.github.com/pyBAST/

Friday, March 21, 14 8 Machine Learned Classification

25-class Data: 50k from ASAS, 810 with known labels (timeseries, colors)

PRRL = 0.94

Richards+12

Friday, March 21, 14 9 The AstrophysicalMachine Journal Supplement Series,203:32(27pp),2012December Learned ClassificationRichards et al. True Class

a b1 b2 b3 b4 c d e f g h i j j1 l o p q r1 s1 s2 s3 t u v w x y

a. Mira 0.923 0.015 0.042 0.057 0.176

b1. Semireg PV 0.066 0.824 0.276 0.111 0.125 0.086 0.25 0.059 0.375

0.6 0.034 0.019 25-class variable0.059 star b2. SARG A 74 b3. SARG B 0.015 0.267 0.586 0.25 0.05 1 b4. LSP Data:0.011 0.074 0.067 0.069 0.87 50k from ASAS,0.2 0.171 8100.059 with known labels c. RV Tauri 0.029 0.75 0.011 0.059 0.125 0.015 dimensional d. Classical Cepheid 0.955 0.538 0.25 0.25 0.1

0.042 0.308 0.25 (timeseries, colors) e. Pop. II Cepheid feature set for f. Multi. Mode Cepheid 0.011 0.077 0.5 0.012 0.043 0.1 0.059 1

g. RR Lyrae, FM 0.011 0.077 0.965 0.5 1 h. RR Lyrae, FO 0.012 0.897 0.036 0.015 learning i. RR Lyrae, DM 0.5

j. Delta Scuti 0.034 0.786 0.278 0.015

j1. SX Phe

l. Beta Cephei 0.107 0.722 0.015 o. Pulsating Be 0.067 0.4 0.059 featurization is p. RSG 0.044 0.686 0.125

Predicted Class 0.036 0.2 0.913 0.015 PRRL = 0.94 q. Chem. Peculiar the bottleneck r1. RCB 0.706 s1. Class. T Tauri (but s2. Weak−line T Tauri 0.034 0.011 0.043 0.7 0.118 0.061 s3. RS CVn 0.05 0.059 0.037 embarrassingly t. Herbig AE/BE 0.2 0.375 u. S Doradus parallel) v. Ellipsoidal

w. Beta Persei 0.353 0.889 0.182

x. Beta Lyrae 0.059 0.118 0.074 0.667 0.044

y. W Ursae Maj. 0.042 0.012 0.069 0.036 0.25 0.059 0.091 0.882

91 68 15 29 54 24 89 13 4 86 29 2 28 1 18 5 35 23 17 4 20 17 8 1 1 27 33 68 Figure 5. Cross-validated confusion matrix for all 810 ASAS training sources. Columns are normalized to sum to unity, with the total number of true objects of each class listed along the bottom axis. The overall correspondence rate for these sources is 80.25%, with at least 70% correspondence for half of the classes. Classes with Richards+12 low correspondence are those with fewer than 10 training sources or classes which are easily confused. Red giant classes tend to be confused with other red giant classes and eclipsing classes with other eclipsing classes. There is substantial power in the top-right quadrant, where rotational and eruptive classes are misclassified as red giants;Friday, these March errors 21, are 14 likely due to small training set size for those classes and difficulty to classify those non-periodic sources. 9 (A color version of this figure is available in the online journal.) with any monotonically increasing function (which is typically probabilities, pi1, pi2,...,piC ,areproperprobabilitiesinthat restricted to a set of non-parametric isotonic functions, such as they are each{ between 0 and 1 and} sum to unity for each object. step-wise constants). A drawback to both of these methods is The optimal value of r is found by minimizing the Brier score ! ! ! that they assume a two-class problem; a straightforward way (Brier 1950)betweenthecalibrated(cross-validated)andtrue around this is to treat the multi-class problem as C one-versus- probabilities.14 We find that using a fixed value for r is too all classification problems, where C is the number of classes. restrictive and, for objects with small maximal RF probability, However, we find that Platt Scaling is too restrictive of a trans- it enforces too wide of a margin between the first- and second- formation to reasonably calibrate our data and determine that largest probabilities. Instead, we implement a procedure similar we do not have enough training data in each class to use Isotonic to that of Bostrom (2008)andparameterizer with a sigmoid Regression with any degree of confidence. function based on the classifier margin, ∆i pi,max pi,2nd, for Ultimately, we find that a calibration method similar to the each source, = − one introduced by Bostrom (2008)isthemosteffectiveforour 1 1 data. This method uses the probability transformation r(∆i) , (5) = 1+eA∆i +B − 1+eB

pij + r(1 pij )ifpij max pi1,pi2,...,piC where the second term ensures that there is zero calibration per- pij − = { } formed at ∆i 0. This parameterization allows the amount of = pij (1 r)otherwise, = " − (4) 14 N C 2 The Brier score is defined as B(p) 1/N (I(yi j) pij ) , = i 1 j 1 = − where! pi1,pi2,...,piC is the vector of class probabilities where N is the total number of objects, C is the number= = of classes, and { } # # for object i and r [0, 1] is a scalar. Note that the adjusted I(yi j) is 1 if and only if the true class of the source i is j. ∈ = ! ! 11 Machine-learned varstar catalog: http://bigmacc.info

Friday, March 21, 14 10 Doing Science with Probabilistic Catalogs

Demographics (with little followup): trading high purity at the cost of lower efficiency e.g., using RRL to find new Galactic structure

Novelty Discovery (with lots of followup): trading high efficiency for lower purity e.g., discovering new instances of rare classes

DRAFT April 20, 2012

Discovery of Bright Galactic and DY Persei Variables: Rare Gems Mined from ASAS

1, 1,2 1 1 1 A. A. Miller ⇤, J. W. Richards , J. S. Bloom , S. B. Cenko , J. M. Silverman , D. L. Starr1, and K. G. Stassun3,4

Friday, March 21, 14 11 ABSTRACT

We present the results of a machine-learning (ML) based search for new R Coronae Borealis (RCB) stars and DY Persei-like stars (DYPers) in the using cataloged light curves obtained by the All-Sky Automated Survey (ASAS). RCB stars—a rare class of hydrogen-deficient carbon-rich supergiants—are of great interest owing to the insights they can provide on the late stages of stellar evolution. DYPers are possibly the low-temperature, low-luminosity analogs to the RCB phenomenon, though additional examples are needed to fully estab- lish this connection. While RCB stars and DYPers are traditionally identified by epochs of extreme dimming that occur without regularity, the ML search framework more fully captures the richness and diversity of their photometric behavior. We demonstrate that our ML method recovers ASAS candidates that would have been missed by traditional search methods employing hard cuts on amplitude and periodicity. Our search yields 13 candidates that we consider likely RCB stars/DYPers: new and archival spectroscopic observations confirm that four of these candidates are RCB stars and four are DYPers. Our discovery of four new DYPers increases the number of known Galactic DYPers from two to six; noteworthy is that one of the new DYPers has a measured parallax and is m 7 mag, making it the brightest known DYPer to date. Future observations ⇡ of these new DYPers should prove instrumental in establishing the RCB con- arXiv:1204.4181v1 [astro-ph.SR] 18 Apr 2012 nection. We consider these results, derived from a machine-learned probabilistic

1Department of Astronomy, University of California, Berkeley, CA 94720-3411, USA 2Statistics Department, University of California, Berkeley, CA, 94720-7450, USA 3Department of Physics and Astronomy, Vanderbilt University, Nashville, TN 37235, USA 4Department of Physics, Fisk University, 1000 17th Ave. N., Nashville, TN 37208, USA *E-mail: [email protected] Turning Imagers into Spectrographs Time variability + colors → fundamental stellar parameters

Data: 5000 variables in SDSS Stripe 82 with spectra ~80 dimensional regression with Random Forest

Miller, JSB+14

Friday, March 21, 14 12 Big DataText Challenge: Time & Resources

Large Synoptic Survey Telescope (LSST) - 2018 ! Light curves for 800M sources every 3 days 106 supernovae/yr, 105 eclipsing binaries 3.2 gigapixel camera, 20 TB/night

LOFAR & SKA 150 Gps (27 Tflops) → 20 Pps (~100 Pflops)

Gaia space astrometry mission - 2013 1 billion stars observed ∼70 times over 5 years Will observe 20K supernovae

Many other astronomical surveys are already producing data: SDSS, iPTF, CRTS, Pan-STARRS, Hipparcos, OGLE, ASAS, Kepler, LINEAR, DES etc.,

Friday, March 21, 14 13 Big DataText Challenge: Time & Resources

Large Synoptic Survey Telescope (LSST) - 2018 ! Light curves for 800M sources every 3 days 106 supernovae/yr, 105 eclipsing binaries 3.2 gigapixelHow camera, do 20we TB/night do discovery,

LOFAR & SKAfollow-up, and inference 150 Gps (27when Tflops) →the 20 Pps data (~100 rates Pflops) (& Gaia space requisiteastrometry mission timescales) - 2013 1 billion stars observedpreclude ∼70 times human over 5 years Will observe 20Kinvolvement? supernovae Many other astronomical surveys are already producing data: SDSS, iPTF, CRTS, Pan-STARRS, Hipparcos, OGLE, ASAS, Kepler, LINEAR, DES etc.,

Friday, March 21, 14 13

Towards a Fully Automated Scientific Stack for Transients

papers inference current typing state-of-the-art } followup stack classification discovery finding reduction observing scheduling strategy automated (e.g. iPTF) not (yet) automated

Friday, March 21, 14 14 4 H. Brink et al.

values lie between 1 and 1. As the pixel values for real can- didates can take on a wide range of values depending on the astrophysicalGoal: source and observing conditions, this normal- ization ensures that our features are not overly sensitive to buildthe peak a brightnessframework of the residual to nor the residual level of background flux, and instead capture the sizes and shapes of “bogus” discoverthe subtraction variable/ residual. Starting with the raw subtraction thumbnail, I, normalization is achieved by first subtract- transienting the median pixel sources value from the subtraction thumbnail and then dividing by the maximum absolute value across all median-subtractedwithout people pixels via I(x, y) med[I(x, y)] IN(x, y)= . (1) max abs[I(x, y)] ⇢ { } Analysis of the features derived from these normalized real • andfast bogus (compared subtraction to images people) showed that the transfor- mationparallelizeable in (1) is superior to other alternatives, such as •the Frobenius norm ( trace(IT I)) and truncation schemes where extreme pixel values are removed. • transparent p deterministicUsing Figure 1 as a guide, our first intuition about “real” •real candidates is that their subtractions are typically az- imuthallyversionable symmetric in nature, and well-represented by a •2-dimensional Gaussian function, whereas bogus candidates are not well behaved. To this end, we define a spherical 2D Gaussian, G(x, y), over pixels x, y as

2 2 1000 to 1 needle1 (cx inx )the( cy y) G(x, y)=A exp + , (2) · 2 haystack ⇢problem Figure 1. Examples of bogus (top) and real (bottom) thumbnails. which we fit to the normalized PTF subtraction image, IN , Note that the shapesPTF of thesubtractionsbogus sources can be quite varied, of each candidate by minimizing the sum-of-squared di↵er- which poses a challenge in developing features that can accurately ence between the model Gaussian image and the candidate represent all of them. In contrast, the set of real detections is Friday, March 21, 14 postage stamp with respect to the central position15 (cx,cy), more uniform in terms of the shapes and sizes of the subtraction amplitude A1 and scale of the Gaussian model. This fit residual. Hence, we focus on finding a compact set of features that is obtained by employing an L-BFGS-B optimization algo- accurately captures the relevant characteristics of real detections as discussed in 2.2. rithm (Lu, Nocedal & Zhu 1995). The best fit scale and am- § plitude determine the scale and amp features, respectively, while the gauss feature is defined as the sum-of-squared dif- candidates. For every real or bogus candidate, we have at ference between the optimal model and image, and corr our disposal the subtraction image of the candidate (which is the Pearson correlation coecient between the best-fit is reduced to a 21-by-21 pixel—about 10 times the median model and the subtraction image. seeing full width at half maximum—postage stamp image Next, we add the feature sym to measure the symmetry centered around the candidate), and metadata about the of the subtraction image. The sym feature should be small reference and subtraction images. Figure 1 shows subtrac- for real candidates, whose subtraction image tends to have a tion thumbnail images for several arbitrarily chosen bogus spherically symmetric residual. sym is computed by first di- and real candidates. viding the subtraction thumbnail into four equal-sized quad- In this work, we supplement the set of features devel- rants, then summing the flux over the pixels in each quad- oped by Bloom et al. (2011) with image-processing features rant (in units of standard deviations above the background) extracted from the subtraction images and summary statis- and lastly averaging the sum-of-squares of the di↵erences be- tics from the PTF reduction pipeline. These new features— tween each quadrant to the others. Thus, sym will be large which are detailed below—are designed to mimic the way for di↵erence images that are not symmetric and will be humans can learn to distinguish real and bogus candidates nearly zero for highly symmetric di↵erence images. by visual inspection of the subtraction images. For conve- Next, we introduce features that aim to capture the nience, we describe the features from Bloom et al. (2011), smoothness characteristics of the subtraction image thumb- hereafter the RB1 features, in Table 1, along with the fea- nails. A typical real candidate will have a smoothly varying tures added in this work. In 3.1, we critically examine the subtraction image with a single prominent peak while bogus § relative importance of all the features and select an optimal subset for real–bogus classification. 1 As subtraction images of real candidates can be negative when Prior to computing features on each subtraction image the brightness of the source is decreasing, we allow the Gaussian postage stamp, we normalize the stamps so that their pixel amplitude A to take on negative, as well as positive, values.

c 2012 RAS, MNRAS 000,1–16 Real or Bogus? 5

Fig. 2.— Histogram of a selection of features divided in real (purple) and bogus (cyan) populations. First two newly introduced features gauss and amp,thegoodness-of-fitandamplitudeoftheGaussianfit.Thenmag ref,themagnitudeofthesourceinthereferenceimage, flux ratio,theratioofthefluxesinthenewandreferenceimagesandlastly,ccid,theIDofthecameraCCDwherethesourcewas detected. The fact that this feature is useful at all is surprising, but we can clearly see that there are a higher probability of the candidates beeing real or bogus on some of the CCDs. Real or Bogus? 5 els of performance in the astronomy literature ( joey: SVM“Discovery” with a radial basis kernel,is Imperfect a common alternative add refs ). A description of the algorithm can be| found for non-linear classification problems. A line is plotted in Breiman| (2001). Briefly, the method aggregates a col- tobut show some the classifiers 1% FPR to work which better our figure than of meritothers is fixed. lection of hundreds to thousands of classification trees, and for a given new candidate, outputs the fraction of classifiers that vote real. If this fraction is greater than some threshold ⌧, random forest classifies the candidate as real ; otherwise it is deemed to be bogus. While an ideal classifier will have no missed detections (i.e., no real identified as bogus), with zero false positives

(bogus identified as real ), a realisticFig. classifier 2.— Histogram will of a selection typi- of features divided in real (purple) and bogus (cyan) populations. First two newly introduced features gauss and amp,thegoodness-of-fitandamplitudeoftheGaussianfit.Thenmag ref,themagnitudeofthesourceinthereferenceimage, cally o↵er a trade-o↵ between the twoflux ratio types,theratioofthefluxesinthenewandreferenceimagesandlastly, of errors. A ccid,theIDofthecameraCCDwherethesourcewas detected. The fact that this feature is useful at all is surprising, but we can clearly see that there are a higher probability of the candidates receiver operating characteristic (ROC)beeing real curve or bogus on is some a ofcom- the CCDs. monly used diagram which displaysels the of performance missed detection in the astronomy literature ( joey: SVM with a radial basis kernel, a common alternative add refs ). A description of the algorithm can be| found for non-linear classification problems. A line is plotted rate (MDR) versus the false positive rate| (FPR) of a clas- 6 in Breiman (2001). Briefly, the method aggregates a col- to show the 1% FPR to which our figure of merit is fixed. sifier . With any classifier, we facelection a trade-o of hundreds↵ between to thousands of classification trees, andReal for and a given Bogus new candidate, outputs the fraction of MDR and FPR: the larger the thresholdclassifiersobjects thatin⌧ ourby vote whichreal. If this we fraction is greater than deem a candidate to be real, thesome lowertraining threshold set the of⌧ , MDR random forest but classifies the candidate as real ; otherwise it is deemed to be bogus. higher the FPR and vice versa. Varying78kWhile detections, an⌧ idealmaps classifier out will the have no missed detections ROC curve for a particular classifier,(i.e.,42-dimensional no andreal identified we can as bogus com-), with zero false positives (bogusimageidentified and as real ), a realistic classifier will typi- pare the performance of di↵erent classifierscally o↵er a by trade-o comparing↵ between the two typesFig. of errors. 3.— A Comparison of a few well known classification algo- context features their cross-validated ROC curves: thereceiver lower operating the curvecharacteristic the (ROC) curverithms is a applied com- to the full dataset. ROC curves enableBrink+2012 a trade-o↵ monlyon each used diagram which displays the missedbetween detection false positives and missed detections, but the best classi- better the classifier. ratecandidate (MDR) versus the false positive rate (FPR)fier pushes of a clas- closer towards the origin. Linear models (Logistic Re- sifier6. With any classifier, we face a trade-o↵ between A commonly used figure of meritMDR (FoM) and FPR: for the selecting larger the thresholdgression⌧ by which or we Linear SVMs) perform poorly as expected, while non- deem a candidate to be real, the lowerlinear the MDR models but (SVMs with radial basis function kernels or Random a classifier is the so-called Area UnderFriday, March the 21, 14 Curve (AUC, 16 Friedman et al. (2001)), by which thehigher classifier the FPR and with vice versa. min- Varying ⌧Forests)maps out are the much more suited for this problem. Random Forests ROC curve for a particular classifier, andperform we can com- well with minimal tuning and ecient training, so we will imal AUC is deemed optimal. Thispare criterion the performance is agnostic of di↵erent classifiers by comparing Fig. 3.— Comparison of a few well known classification algo- their cross-validated ROC curves: the loweruse the those curve the in therithms remainder applied to the of full this dataset. paper. ROC curves enable a trade-o↵ between false positives and missed detections, but the best classi- to the actual FPR or MDR requirementsbetter the for classifier. the problem fier pushes closer towards the origin. Linear models (Logistic Re- at hand, and thus is not appropriate forA commonly our purposes. used figure of In- merit (FoM) for selecting gression or Linear SVMs) perform poorly as expected, while non- a classifier is the so-called Area Under the Curve (AUC, linear models (SVMs with radial basis function kernels or Random deed, the ROC curves of di↵erent classifiersFriedman et al. often(2001)), by cross, which the classifier with min- Forests) are much more suited for this problem. Random Forests 3. OPTIMIZINGperform well with minimal THE tuning DISCOVERY and ecient training, ENGINE so we will imal AUC is deemed optimal. This criterion is agnostic use those in the remainder of this paper. so that performance in one regimeto does the actual not FPR necessarily or MDR requirements for the problem carry over to other regimes. In theat real–bogus hand, and thus is classifica- not appropriate for our purposes.With In- any machine learning method, there are a deed, the ROC curves of di↵erent classifiers often cross, plethora of modeling3. OPTIMIZING decisions THE DISCOVERY to make ENGINE when attempt- tion problem, we instead define ourso FoM that performance as the MDR in one regime at does not necessarily 1% FPR, which we aim to minimize.carry The over to choice other regimes. of this In the real–bogusing classifica- to optimizeWith predictive any machine accuracy learning method, on future there are data. a Typ- tion problem, we instead define our FoM as the MDR at plethora of modeling decisions to make when attempt- particular value for the false positive1% FPR, rate which stems we aim from to minimize. a Theically, choice of a this practitionering to optimize is faced predictive with accuracy questions on future data. such Typ- as which practical reason: we don’t want to beparticular swamped value for by the bogus false positive ratelearning stems from algorithm a ically, a practitioner to use, is what faced with subset questions of such features as which to em- practical reason: we don’t want to be swamped by bogus learning algorithm to use, what subset of features to em- candidates misclassified as real. candidates misclassified as real. ploy, and whatploy, values and what of values certain of certain model-specific model-specific tuning tuning pa- pa- Figure 3 shows example ROC curvesFigure 3 comparingshows example ROCthe curvesrameters comparing the to choose.rameters to Without choose. Without rigorous rigorous optimization optimization of the of the performance on pre-split training and testing sets includ- model, performance of the machine learner can be hurt performance on pre-split training anding alltesting features. sets With includ- minimal tuning, Randommodel, Forests performancesignificantly. of In the context machine of real–bogus learner classification, can be hurt perform better, for any position on the ROC curve, than this could mean failure to discover objects of tremen- ing all features. With minimal tuning, Random Forests significantly.dous In scientific the context impact. In of this real–bogus section, we describe classification, several perform better, for any position on the6 Note ROC that the curve, standard form than of the ROC isthis to plot could the false meanchoices failure that must be to made discover in the real–bogus objects discovery of tremen- positive rate versus the true positive rate (TPRdous = 1-MDR) scientificengine impact. and outline In how this we section, choose the optimal we describe classifica- several 6 Note that the standard form of the ROC is to plot the false choices that must be made in the real–bogus discovery positive rate versus the true positive rate (TPR = 1-MDR) engine and outline how we choose the optimal classifica- ML Algorithmic Trade-Off

High * on real-world data sets Lasso Linear/Logistic Regression

Decision Trees

Naive Bayes Bagging Random Forest® Splines Nearest Boosting Neighbors Gaussian/ Dirichlet Interpretability Processes SVMs Neural Nets Low Deep Learning Low High Accuracy

Random Forest is a trademark of Salford Systems, Inc.

Friday, March 21, 14 17 Real-time Classifications...

Friday, March 21, 14 18 Real-time Classifications...

Friday, March 21, 14 18 Seismology Neuroscience

Friday, March 21, 14 19 immediately, then eventually) and we will provide a mechanism for doing so. Indeed, part of the service that we are proposing to provide is a storage and search repository for public time-series data suitable for classification7. In this sense, the site may become known as the portal for time-domain datasets (see also [29]), akin to the (mostly) non-temporal datasets stored at NSF-sponsoredcloud. UC We Irvine will Machine design Learning our classification engine to give probabilistic classifications for new time series, al- 8 Repository . lowing users to choose personalized classification thresholds to attain the specified level of false positive 2.3. Feature Extraction. While the meaning encoded in time-series dataor varies false across negative scientific rate domains, required for their scientific problem. Longer term, we intend to add more sophisti- the manner in which that meaning is manifest shares great commonality. Typically, similar sets of features or “attributes” are used to capture latent information in time-series datacated across classification disparate scientific tools fields, such as instance-based dynamic time warping classification [30], semi-supervised reducing dimensionality (usually) and homogenizing datasets into a fixed-dimensionalmethods that space exploit [40, 5 both, 62]. labeled and unlabeled data for classification [63], and multivariate time-series This is the major reason that our proposed classification platform will beclassification “restricted” to time-domain: [28]. our collaboration will provide essential feature-based analytics tools without being experts in every domain that makes use of the service. A large number of time-series features will be made available through2.5. the classificationWebsite. platform.Figure 5 Theseshows some example screenshots of the website. For the web-server, we will use features range from very simple summary statistics of the data to parameterswidely that popular are estimated high-performance, via sophis- high-concurrency 9 system; it has a proven capability to do load ticated model fitting. The most elementary set of features are summaries of the marginal distribution of the time series, which ignore the time axis altogether. Though simple, featuresbalancing, such as the caching, peak-to-peak access ampli- and bandwidth control. In a nutshell, takes an asynchronous, event-driven tude, skew, and standard deviation of the time series measurements areapproach, often powerful farming discriminators out from web requests to worker engines that hold identical views of the website. astronomy to biology to musicology (e.g., [50, 35, 33]) and are trivial toworks compute. well In addition, on the we elastic plan to cloud compute architecture of Amazon (EC2). Thus, the web-server itself will be compute basic variability metrics, such as summary statistics from the empirical auto-correlation function and simple point-to-point features meant to capture the time-dependentpart structure of the of each native time series. parallelism that will help make the platform scale to a large number of concurrent users, Next, spectral features will be provided to capture the periodic informationdata loads, encoded and in each active time series. computations. For evenly-spaced time-seriesClassification data, Fourier and wavelet-basedPlatform decompositions for Novel can beScientific performed, pulling Insight 7/11/12 12:44 AM out the principal frequencies, amplitudes and phases, among other features extracted from the estimated

on Time-Series Data 7/11/12 12:43 AM Logged!in!as!Sally. Log!out. 7 Note that this service would fall under the Data collection and management (DCM) and E-science collaboration environments (ESCE) Get my API key... rubric; however, DCM and ESCE are not the main perspectives of the proposed effort. 2 Feature!Selection 8 NSF/BIGDATA project Not!logged!in. Log!in. Home > My Projects > zooproject1 > feature selection Machine!Learning"for"Time!Domain"Science My Dataset: 143.4 Gb, 19211 examples, 45 classes Got time-variable data that you want to get insight into? Time-domain data across scientific disciplines benefit from a common machine-learning (ML) workflow. From sandbox-sized Marginal Spectral Stochastic My Features to Big Data in real-time, now you can get cooking with cutting-edge ML in 4 easy steps.

Summary Statistics Quantiles 1 Assemble and upload a directory of time-series data and labels Skew 5%/95 Read more about this step... Kurtosis Varian Median #68/9 Choose preloaded feature sets to compute on the time-series data. 2 Mean You can even upload your own feature generators using our API. What are "features" in a machine-learning context?? MAD What's this? Variability

Extrema Stets Explore the results of different ML models built from your data. 3 Min/Max Stets Choose which one you want working in production mode. What's this? Read more about this step... #X-sigma outliers Stets

4 Upload new time-series data and get back probabalistic classification results. Choose ALL available features Why shouldn't I always do this? Progress...

Let me get started! Compute Features! ETA: 2.3 min... FIGURE 4. Schematic of the proposed platform and website interface. Major codebases, all created to work on running on 950 Amazon instances commodity cloud computing systems (ie., Microsoft Azure or Amazon EC2), are shown in yellow.About... Databases Public!datasets This will use 1.2 of your 250 credits and raw files will be housed on cloud datastores (e.g., Microsoft Azure and Amazon S3). ParallelismTime"domain of!Machine the !Learning!tutorials Contact!Us platform by project will ensured by the webserver ( ; see below). Speed of feature computation will be FAQ accomplished by on-demand spin-ups of feature extraction workers. Friday, March 21, 14 20

C-8 FIGURE 5. Wireframe screenshots of two pages from the proposed classification website. (left) Homepage highlighting the four basic steps for learning and application to new data. (right) Feature selection and compu- tation page for an example project (“zooproject1”) from user “Sally.” Here the user can select certain features to extract and will be able to upload and use their own feature extraction modules. The parallelism of the platform is demonstrated by the showing the number of Amazon instances being used to compute features.

In year 1, we will build a “minimal viable product” (MVP) using IPython/GoogleAppEngine. Our collaboration has extensive experience building user-facing websites with the (free) AppEngine10 We will also explore the use of the Go and Ruby programming languages for the production version of the site. In addition to building out an intuitive user experience (as shown in Fig. 5) we will also create a set of APIs (§2.1) so that the entire learning and application process can be controlled through computer (rather than

file:///Users/jbloom/Desktop/homepage2.svg Page 1 of 2 9 10For example, we built using AppEngine a data explorationfile:///Users/jbloom/Desktop/features2.svg and visualization site ( ) for our probabilistic Page 1 of 2 catalog of variable stars, constructed using ML on more than 50k stars.

C-10 Parting Thoughts • Astronomy’s data deluge demands an abstraction of the traditional roles in the scientific process. Despite automation, crucial (insightful) roles remain for people • Machine learning is an emerging & useful tool • Deterministic prediction with verifiable uncertainties is crucial to maximize scientific impact under real-world resource constraints • Major Challenge: Training & access to great learning frameworks • In the time-domain, machine-learning prediction is the gateway to understanding

Friday, March 21, 14 21