<<

Kirk D. Borne1, K. Stassun2, R. J. Brunner3, S. G. Djorgovski4, M. Graham4, J. Hakkila5, A. Mahabal4, M. Paegert2, M. Pesenson4, A. Prša6, A. Ptak7, J. Scargle7, and the LSST and Team 1George Mason University, 2Vanderbilt University, 3University of Illinois, 4Caltech, 5College of Charleston , 6Villanova University , 7NASA

The LSST Informatics and Statistics Science Collaboration (ISSC) focuses on research and scientific discovery challenges posed by the very large and complex data collection that LSST will generate. Application areas include , machine , , astrostatistics, visualization, scientific data semantics, time series analysis, and advanced signal processing. Research problems to be addressed with these methodologies include transient event characterization and classification, rare class discovery, correlation mining, outlier/anomaly/surprise detection, improved estimators (e.g., for photometric redshift or early onset supernova classification), exploration of highly dimensional (multivariate) data catalogs, and more. The LSST ISSC team members present five sample science results from these data-oriented approaches to large-data astronomical research: the EB (Eclipsing Binary) Factory, transients characterization (for rapid follow-up and classification), classification of complex data sets, semantic e-Science, and dimensionality reduction through data compression (for visual analytics).

Background : Current Topics of Interest for the LSST Informatics and Statistical Science Collaboration team: • Astrostatistics conferences since 1991 (Statistical Challenges in Modern SCMA, Babu & Feigelson). • Advancing the field = Community-building: • Astroinformatics developments since 2001 (= the establishment of Virtual Observatory projects worldwide). • Our team consists of Astronomers, Statisticians, and data mining () experts – we are • 2009: Astroinformatics position paper submitted to ASTRO2010 Decadal Survey Committee. developing a strong collaboration among these 3 research communities for the benefit of LSST and similar big data projects • Several additional ASTRO2010 Decadal Survey position papers and white papers were submitted related to • Promoting Astroinformatics + Astrostatistics to astronomers (with several workshops sponsored since 2009) astronomical sky surveys, astrostatistics, and . • Education, education, education! (including , undergraduate and graduate education, …) • 2009: LSST Informatics and Statistics Research Collaboration Team proposed, approved, and formed. • LSST Event Characterization vs. Classification • 2010: Astroinformatics and Astrostatistics special sessions at AAS-Washington DC. • Characterization of sparse, irregularly sampled time series and the LSST observing cadence • Websites: http://astrostatistics.psu.edu/ and http://www.practicalastroinformatics.org/ • Challenge Problems, such as the Photo-z challenge and the Supernova Photometric Classification challenge • 2010: Astroinformatics 2010 International Conference: http://www.astroinformatics2010.org/ • Testing on the LSST simulations: images/catalogs PLUS observing cadence – can we recover known • What is Astroinformatics? – It is the set of data-oriented research methodologies for data-intensive astronomy: classes of variability? • Large-scale scientific data modeling, organization, and management • Generating and/or accumulating training samples of numerous classes (especially variables and transients) • Scientific metadata, tagging, and annotations for astronomical information search and retrieval • Proposing a mini-survey during the science verification year (Science Commissioning): • Data mining, machine learning, and knowledge discovery from data • e.g., obtain high-density and evenly-spaced observations of extragalactic and Galactic test fields, to generate • Astrostatistics training sets for variability classification and assessment thereof • Information and data visualization, including data structures and data compression • Science Data Quality Assessment (SDQA): R&D efforts to support LSST Data Management QA activities • Semantic e-Science • Data-intensive astro- and analysis WHAT ARE THE PRIMARY SCIENTIFIC GOALS OF THE LSST ISSC TEAM? • Reference: Borne, K. “Astroinformatics: Data-Oriented Astronomy Research and Education," Journal of • To address the LSST “Data to Knowledge” challenge Earth Science Informatics, 3, 5-17 (2010). • To help discover the unknown unknowns in LSST’s petascale and archive

Semantic e-Science: Semantics adds to the handling of complex data by capturing Data Mining and Knowledge Discovery from Light Curve Data – The Eclipsing Binary Factory: domain knowledge and expressing it in a machine-processable fashion. At its simplest level, this The EB Factory is envisioned as a pipeline for processing large allows disparate data sets to be commonly understood by both human and computer through quantities of light curve data through a series of artificial mappings between terms, concepts and relationships used to describe and model the data. At a intelligence-based filters, classifiers, and solution finders more sophisticated level, knowledge is incorporated directly into data mining algorithms, making them in order to produce a set of “validated EBs” whose physical smarter and more powerful. The ontological self-organizing map (OSOM; Graham et al. 2011, in properties are well constrained and are thus suitable for preparation) is an example of such a semantic data mining technique: it is a modified version of the detailed follow-up. The Solution Refiner is fully developed traditional (dimension-reducing) SOM with the distance metric dependent upon the (PHOEBE; Prša & Zwitter 2007), and the Solution Estimator conceptual similarity of terms used to annotate data objects, i.e., the degree to which they share has been piloted (Prša et al. 2008, 2010) with OGLE and Kepler information. The plot shows the trained toroidal network for a 21-dimensional data set comprising light curves. The Period Finder and Classifier components are 100 annotated objects randomly drawn from the SIMBAD astronomical . The axes are the under development, thus completing the basic assembly line transformed coordinates (reduced from 21 dimensions to 2) that mimic the “separation” between (top row of figure). The EB Factory will be used immediately to concepts. The clusters denote groups of objects which are conceptually similar: in this case, they are harvest EBs from LSST light-curves. We also seek to further 'galaxies' (4,6), 'stars' (7,3), 'infrared sources' (6, 1), 'radio sources' (4,1) and 'variable stars' (1,1). enhance the EB Factory by adding capability to incorporate In this case, similarity is a measure dependent on both the hierarchical structure of annotation terms ancillary catalog data on the fly (bottom row of figure), such as (as expressed in an ontology) and the probabilistic frequency of terms in the SIMBAD corpus. radial velocities from the RAVE and SDSS MARVELS surveys, standardized photometry, and others. Data Compression through Dimensionality Reduction for Visual Real-time Characterization of Transients for Pepper, J., Stassun, K., & Prsa, A. 2010, Bulletin of the American Exploration of high-dimensional (multivariate) data sets. The figure Analytics: Follow-up Observation Support: The variety of Astronomical Society, Vol. 42, pp. 218 shows an example of interpolation of eigenfunctions by Lagrangian splines on astronomical phenomena and attendant transient events graphs with nontrivial topology. This illustrates a random graph with 2000 nodes ensure that classification is a complex process. The path Prša, A., Zwitter, T. 2005, IAU Symposium, Vol. 240, pp. 217-229 (the edges are not shown); the average degree of the nodes is 35; only 2% of all to classification is through characterization = Prša, A., et al. 2008, ApJ, Vol. 687, pp. 542-565 nodes were used for interpolation of the Laplacian eigenfunctions. cataloguing the sequence of ups & downs in a light curve: Prša, A., et al. 2010, ArXiv e-prints, arXiv:1006.2815

The figure to the right illustrates the expected yield of eclipsing binary stars with LSST. 10,000 eclipsing binary light curves are synthesized and sampled according to the LSST universal cadence and passed through the period finder. The phased data are then passed …aided by statistical methods and astronomy knowledge, to a neural network-based estimator of principal eclipsing binary parameters. The solid line which in turn are based on heuristics from a select set of depicts the fraction of sources where the estimated period is within 10% of the actual value. data embedded in huge multi-dimensional datasets. Light The fractions of sources for which the principal physical parameters are recovered to a 10% curves with a few points lead to early characterization: accuracy, with exact periods (dotted line) and the recovered periods (dashed line), represent (1) Simple heuristics and Gaussian processes (Fig. A), the ideal and realistic expected yields, respectively. Overall, we estimate that LSST will (2) delta-mag delta-time surfaces (Fig. B), and observe ~16 million eclipsing binaries down to r~22, for which S/N should be sufficient for (3) Statistics of delta-mag over characteristic periods analysis. Our yield calculations here suggest that ~1.6 million of these eclipsing binaries are likely to be successfully recovered and their physical parameters well estimated (Pepper et (one-day boxplots, Fig. C). al. 2010; see also LSST Science Book, section 6.10). Fig. A Fig. B Fig. C

Classification of Complex Data Sets: A breadth-first tree A series of Bayesian Techniques including based on small Gaussian Process methodologies lead to number of classification from complex datasets. Priors characteristics is a based on basic observables like colors and way towards a other auxiliary information lead to broad early complex classification classification. Research is focused on making tree, as illustrated these algorithms scalable to large datasets. here.