Extension of M Dwarf Spectra Based on Adversarial Autoencoder

universe Communication Extension of M Dwarf Spectra Based on Adversarial AutoEncoder Jiyu Wei , Xingzhu Wang, Bo Li , Yuze Chen and Bin Jiang * School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai 264209, China; [email protected] (J.W.); [email protected] (X.W.); [email protected] (B.L.); [email protected] (Y.C.) * Correspondence: [email protected] Abstract: M dwarfs are main sequence stars and they exist in all stages of galaxy evolution. As the living fossils of cosmic evolution, the study of M dwarfs is of great significance to the understanding of stars and the stellar populations of the Milky Way. Previously, M dwarf research was limited due to insufficient spectroscopic spectra. Recently, the data volume of M dwarfs was greatly increased with the launch of large sky survey telescopes such as Sloan Digital Sky Survey and Large Sky Area Multi-Object Fiber Spectroscopy Telescope. However, the spectra of M dwarfs mainly concentrate in the subtypes of M0–M4, and the number of M5–M9 is still relatively limited. With the continuous development of machine learning, the generative model was improved and provides methods to solve the shortage of specified training samples. In this paper, the Adversarial AutoEncoder is proposed and implemented to solve this problem. Adversarial AutoEncoder is a probabilistic AutoEncoder that uses the Generative Adversarial Nets to generate data by matching the posterior of the hidden code vector of the original data extracted by the AutoEncoder with a prior distribution. Matching the posterior to the prior ensures each part of prior space generated results in meaningful data. To verify the quality of the generated spectra data, we performed qualitative and quantitative verification. The experimental results indicate the generation spectra data enhance the measured Citation: Wei, J.; Wang, X.; Li, B.; spectra data and have scientific applicability. Chen, Y.; Jiang, B. Extension of M Dwarf Spectra Based on Adversarial Keywords: M-type dwarfs; adversarial AutoEncoder; spectral data generation; sky survey AutoEncoder. Universe 2021, 7, 326. https://doi.org/10.3390/ universe7090326 1. Introduction Academic Editor: Lorenzo Iorio The traditional method to address spectral classification of stars is to combine their photometric and spectroscopic data together. The most commonly used Harvard stellar Received: 25 June 2021 spectral classification system was proposed by the Harvard University Observatory in the Accepted: 30 August 2021 late 19th century [1]. In accordance with the order of the surface temperature of the star, Published: 31 August 2021 the system divides the stellar spectra into O, B, A, F, G, K, M, and other types [2]. M dwarfs are the most common stars in the Galaxy [3] and are characterized by low Publisher’s Note: MDPI stays neutral brightness, small diameter and mass, and a surface temperature around or lower than with regard to jurisdictional claims in 3500 K. With the nuclear fusion speed inside the M dwarfs being slow, M dwarfs tend published maps and institutional affil- iations. to have a long life span, and they exist in all stages of the evolution of the Galaxy [4]. A huge number of spectra are obtained with the emergence of sky survey telescopes, such as Sloan Digital Sky Survey (SDSS) [5,6] and Large Sky Area Multi-Object Fiber Spectroscopy Telescope (LAMOST) [7,8]. However, the distribution of the subtypes is unbalanced. In the SDSS-DR15, as shown Copyright: © 2021 by the authors. in Figure1, the spectra of M0–M4 are relatively greater in number, whereas that of M5–M9 Licensee MDPI, Basel, Switzerland. is limited. The generation of specific subtype spectra of M dwarfs is helpful to solve the This article is an open access article problem of unbalanced distribution and provide more reliable samples for research. For distributed under the terms and example, in the SDSS dataset, the number of M5–M9 is very low. When the data are conditions of the Creative Commons Attribution (CC BY) license (https:// limited, it is difficult for astronomers to analyze them using machine learning or deep creativecommons.org/licenses/by/ learning methods, such as classification, clustering, dimensionality reduction, etc. If we 4.0/). can effectively expand the data, we can improve the M Dwarf dataset to better understand Universe 2021, 7, 326. https://doi.org/10.3390/universe7090326 https://www.mdpi.com/journal/universe Universe 2021, 7, 326 2 of 12 these stars. In this study, we select M-class stars with unbalanced distribution of subtypes M0–M6 (signal-to-noise ratio > 5) to verify the effectiveness of our method. Figure 1. Distribution of M0–M6 dwarfs with signal-to-noise ratio > 5 of Sloan Digital Sky Survey (SDSS) 15. With the development of machine learning and deep learning, generation models were remarkably improved. An increasing number of methods are proposed to solve the lack of data, and all kinds of data, especially the two-dimensional data from the real world, were expanded effectively. AutoEncoder is a kind of neural network. After encoding and decoding, it can obtain output similar to the input. The Variational AutoEncoder (VAE), a model proposed by Kingma and Welling [9], combines variational Bayesians with neural network and achieves good results with data generation. Generative Adversarial Nets (GAN) is a model proposed by Goodfellow et al. [10] to solve the lack of data, especially for the 2D data from the real world [10–14]. GAN consists of a discriminator and a generator. The discriminator is designed to determine whether the input data are real or fake data generated by the generator, and the task of the generator is to generate fake data that can confuse the discriminator as much as possible. Through such a dynamic game process, similar data are generated. The Adversarial AutoEncoder (AAE) is a model proposed by Makhzani et al. [15]. The AAE replaces the generator of the traditional generation model, GAN, with an AutoEncoder that can better learn the feature of discrete data. At the same time, the discriminator is used to correct the distribution after encoding. By doing this, the problem of traditional GAN with the generation of discrete data is solved effectively. However, due to the restriction of traditional GAN structure, the AAE also has problems, such as unstable training and model collapse, and training a good AAE with a small amount of data is difficult. For the quality of the generated spectrum, it is necessary to qualitatively test its similarity with the original spectrum. Principal Component Analysis (PCA) [16] is a widely used data dimensionality reduction algorithm that can extract features of high-dimensional data. T-Distributed Stochastic Neighbor Embedding (t-SNE) is a visualization method for high-dimensional data proposed by by Arjovsky et al. [17]. These two methods can visually demonstrate the similarity between the observational spectra and the generated spectra. Simultaneously, we use the generated spectrum to enhance the data of the classifier to further quantitatively verify the value of the generated spectrum. Fully connected neural network is a commonly used feature extraction method; through multilayer full connection, the feature of the spectrum can be effectively extracted. Training the classifier through two methods can visually show the performance improvement of the classifier Universe 2021, 7, 326 3 of 12 after data expansion using the generated spectrum. The contribution of this paper could be summarized as three-fold: 1. We used AAE to generate spectral data, and the model performed well with various kinds of spectral data, providing new ideas for the generation of spectral data. 2. From a qualitative and quantitative perspective, we proved the high quality of the generated spectra and the effectiveness and robustness of the AAE. 3. Our work provides a new direction for the combination of astronomy and machine learning. 2. Method In this work, we propose to use an Adversarial AutoEncoder (AAE) to generate spectral data. The model is composed of a generator and a discriminator; the generator is an AutoEncoder composed of an encoder and a decoder, and the discriminator is implemented by a GAN discriminant network. AAE does not directly train the network to generate spectral data. Instead, the output of the encoder in the AutoEncoder is constrained to conform to a preselected prior distribution by the game process between the discriminator and the generator. The network parameters of autoencoder are continuously optimized to make the output of the decoder as consistent as possible with the input of the AutoEncoder. Finally, a decoder is obtained as the generator of spectral data, which can stably decode the vector that conforms to a prior distribution into high-quality spectral data. In this study, we use two fully connected neural networks to form the encoder and decoder of the AutoEncoder, and the GAN discriminant network is constituted of a two- layer, fully connected neural network. The model is shown in Figure2. The training process is divided into two stages: the reconstruction stage, which aims to obtain a decoder that can stably reconstruct the encoding vector into high-quality spectral data, and the regularization stage, which aims to constrain the encoding vector generated by the encoder to an artificially selected prior distribution through GAN’s confrontation process. Figure 2. Structure of the Adversarial AutoEncoder (AAE). AAE is composed of an autoencoder and a discriminator. Green box part is autoencoder composed of an encoder and a decoder, and yellow box part is a GAN composed of a discriminator and an autoencoder. Firstly, spectral data are encoded and then decoded to generate reconstructed data similar to the input data. Secondly, encoding vector of input spectral data and a randomly selected vector which conforms to the normal distribution are used as false data and real data as input of the discriminator, respectively.

Load more