Survey of Object-Based Data Reduction Techniques in Observational Astronomy

Open Phys. 2016; 14:579–587 Research Article Open Access Szymon Łukasik*, André Moitinho, Piotr A. Kowalski, António Falcão, Rita A. Ribeiro, and Piotr Kulczycki Survey of Object-Based Data Reduction Techniques in Observational Astronomy DOI 10.1515/phys-2016-0064 Received Sep 03, 2016; accepted Oct 26, 2016 1 Introduction Abstract: Dealing with astronomical observations repre- Astronomy stands on the forefront of big data analytics. In sents one of the most challenging areas of big data analyt- recent decades it acquired tools which have enabled un- ics. Besides huge variety of data types, dynamics related precedented growth in generated data and consequently to continuous data flow from multiple sources, handling – information which needs to be processed. It led to the enormous volumes of data is essential. This paper pro- creation of two specific fields of scientific research: astro- vides an overview of methods aimed at reducing both the statistics, which applies statistics to the study and analysis number of features/attributes as well as data instances. It of astronomical data, and astroinformatics, which uses in- concentrates on data mining approaches not related to information/communications technologies to solve the big struments and observation tools instead working on pro- data problems faced in astronomy [58]. cessed object-based data. The main goal of this article is Since the times of individual observations with ba- to describe existing datasets on which algorithms are fre- sic optical instruments astronomy transformed into a do- quently tested, to characterize and classify available data main employing more than 1900 observatories (Interna- reduction algorithms and identify promising solutions ca- tional Astronomical Union code list currently holds 1984 pable of addressing present and future challenges in as- records [23]). The sizes of catalogs of astronomical objects tronomy. have reached petabytes, and they may contain billions of instances described by hundreds of parameters [14]. As Keywords: astronomy; big data; dimensionality reduction; such, the obstacles of astronomical data analysis exem- feature extraction; data condensation plify perfectly three main challenges of Big Data, namely PACS: 93.85.Bc volume, velocity and variety (also known as the 3Vs). Vol- ume corresponds to both large number of instances and characteristics (features), velocity is related to dynamics of the data flow, and finally, variety stands for the broad range of data types and data sources [17]. This paper summarizes research efforts in the first of these aforementioned domains. Its goal is to present techniques aimed at alleviating problems of data dimension- *Corresponding Author: Szymon Łukasik: Faculty of Physics and ality and its numerosity from a data mining perspective as Applied Computer Science, AGH University of Science and Tech- nology; Systems Research Institute, Polish Academy of Sciences, well as to suggest suitable algorithms for upcoming chal- E-mail: [email protected] lenges. Data is seen here as a set of astronomical objects André Moitinho: CENTRA, Universidade de Lisboa, FCUL, Portu- and their properties (or their spectra). It means it is already gal; Email: [email protected] processed from raw signals/images typically present at the Piotr A. Kowalski: Faculty of Physics and Applied Computer Sci- instrument’s level. Similarly the term "reduction" corre- ence, AGH University of Science and Technology; Systems Research sponds here purely to the transformation of object-based Institute, Polish Academy of Sciences; Email: [email protected] António Falcão: Center of Technology and Systems, UNINOVA, data not to the transition of raw signals/images to science Portugal; Email: [email protected] ready data products. The latter can be composed of several Rita A. Ribeiro: Center of Technology and Systems, UNINOVA, steps and in this context data reduction could refer to sev- Portugal; Email: [email protected] eral things: that raw images were processed, that photo- Piotr Kulczycki: Faculty of Physics and Applied Computer Science, metric measurements were performed using counts stored AGH University of Science and Technology; Systems Research Insti- tute, Polish Academy of Sciences; Email: [email protected] © 2016 S. Łukasik et al., published by De Gruyter Open. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License. 580 Ë S. Łukasik et al. Table 1: Selected sky surveys Survey Institution Number of objects Type Time frame Hipparcos European Space Agency 0.12M Optical 1989-1993 Tycho-2 European Space Agency 2.5M Optical 1989-1993 DPOSS Caltech 550M Optical 1950-1990 2MASS Univ. of Massachusetts, Caltech 300M Near-IR 1997-2001 Gaia European Space Agency 1000M Optical 2013- SDSS Astrophysical Research 470M Optical 2000- Consortium LSST LSST Corporation 4000M Optical 2019- in the pixels, that physical properties were extracted from monitoring with the naked eye allowed significant devel- spectra, etc. opments to astronomical science. Today both wide-field In the first part of the paper, following this introduc- surveys (large data sets obtained over areas of the sky that tion, the scale of the data analysis problems of contem- may be at least of the order of 1% of the entire Galaxy, porary observational astronomy is emphasized. The sec- e.g. see Gaia in Table 1) and deep surveys (aimed at get- ond section reports on available datasets and knowledge ting important informative content from only small areas discovery procedures. In the third section an overview of of the galaxy but with significant depth) represent keys to feature extraction/dimensionality reduction techniques is groundbreaking discoveries about the Universe. provided along with examples of their application for as- Selected recent surveys frequently approached with tronomical data. The fourth section is devoted to data nu- the use of data science tools are listed in Table 1. For a merosity reduction and its specific utilization for visualiza- more exhaustive list of astronomical surveys one can refer tion of astronomical data. Both sampling and more sophis- to [9]. It can be noticed that the number of objects listed – ticated approaches are also addressed. Finally we suggest even for older projects – is huge. The dimensionality of the some existing algorithmic solutions for astronomical data datasets depends on appropriate data preprocessing (e.g. reduction problems, identify future challenges in this do- frequency binning) but may reach thousands of attributes. main, and provide concluding remarks. The extraction of knowledge from such enormous data sets is a highly complex task. Difficulties which may occur are mainly related to limits in the processing performance 2 Data Volume Problem in of computer systems – for large-sized samples – and problems exclusively connected with the analysis of multidi- Observational Astronomy mensional data. The latter arises mostly from a number of phenomena occurring in data sets of this type, known in The development of novel instruments used for produc- literature as "the curse of multidimensionality". Above all, ing astronomical data increases the data volume gener- this includes the exponential growth in sample size, nec- ated each year, at a rate which is twice that of Moore’s essary to achieve appropriate effectiveness of data analysis law [46]. That is why the essence of contemporary obser- methods with increasing dimension, as well as the vanish- vational astronomy could be accurately described with the ing difference between near and far points (norm concen- metaphor of drinking water from a fire hose [49]. It re- tration) using standard distance metrics [30]. flects the fact that data processing algorithms have to deal Survey data can be explored with a variety of data with enormous amount of data – also on a real-time ba- science techniques. First of all, outlier detection which is sis [58]. Consequently data reduction occurs at a low-level, aimed at identifying elements which are atypical for the at signal/image processing phase to bring down the size of whole dataset. In astronomy this technique is generally transferred data. It typically involves removing noise, sig- useful for discovering unusual, rare or unknown types natures of the atmosphere and/or instrument and other of astronomical objects or phenomena but also for data data contaminating factors. For examples of this type of preprocessing [59]. Another procedure is cluster analysis, reduction one could refer to [15, 16, 44, 50]. which is the division of available data elements into sub- Sky surveys represent the fundamental core of astron- groups (clusters) where the elements belonging to each omy. Historically, making sky observations, plotting and cluster are similar to each other and, on the other hand, Survey of Data Reduction Techniques in Observational Astronomy Ë 581 Table 2: Selected methods of dimensionality reduction used for astronomical data Method Linear Parameters References Principal Component Analysis Yes – [27] Kernel Principal Component Analysis No 1 [45] Isomap No 1 [48] Locally Linear Embedding No 1 [43] Diffusion Maps No 2 [31] Locality Preserving Projection Yes 1 [20] Laplacian Eigenmaps No 2 [3] there exist a significant dissimilarity between different selection) or by means of constructing a reduced data set, cluster elements [33]. Identifying galaxies or groups of ob- based on initial features (feature extraction) [24, 57]. The jects/galaxies are clustering tasks frequently performed

Load more