METHODS published: 21 October 2020 doi: 10.3389/fspas.2020.593516
Using Dimensionality Reduction and Clustering Techniques to Classify Space Plasma Regimes
Mayur R. Bakrania 1*, I. Jonathan Rae 1,2, Andrew P. Walsh 3, Daniel Verscharen 1,4 and Andy W. Smith 1
1Department of Space and Climate Physics, Mullard Space Science Laboratory, University College London, Dorking, United Kingdom, 2Department of Mathematics, Physics and Electrical Engineering, University of Northumbria, Newcastle, United Kingdom, 3European Space Astronomy Centre, ESA, Madrid, Spain, 4Space Science Center, University of New Hampshire, Durham, NH, United States
Collisionless space plasma environments are typically characterized by distinct particle populations. Although moments of their velocity distribution functions help in distinguishing different plasma regimes, the distribution functions themselves provide more comprehensive information about the plasma state, especially at times when the distribution function Edited by: includes non-thermal effects. Unlike moments, however, distribution functions are not Vladislav Izmodenov, fi Space Research Institute (RAS), easily characterized by a small number of parameters, making their classi cation more Russia difficult to achieve. In order to perform this classification, we propose to distinguish between Reviewed by: the different plasma regions by applying dimensionality reduction and clustering methods to Anton Artemyev, electron distributions in pitch angle and energy space. We utilize four separate algorithms to Space Research Institute (RAS), Russia achieve our plasma classifications: autoencoders, principal component analysis, mean shift, Olga Khabarova, and agglomerative clustering. We test our classification algorithms by applying our scheme Institute of Terrestrial Magnetism Ionosphere and Radio Wave to data from the Cluster-Plasma Electron and Current Experiment instrument measured in Propagation (RAS), Russia the Earth’s magnetotail. Traditionally, it is thought that the Earth’s magnetotail is split into *Correspondence: three different regions (the plasma sheet, the plasma sheet boundary layer, and the lobes), Mayur R. Bakrania that are primarily defined by their plasma characteristics. Starting with the ECLAT database [email protected] with associated classifications based on the plasma parameters, we identify eight distinct Specialty section: groups of distributions, that are dependent upon significantly more complex plasma and field This article was submitted dynamics. By comparing the average distributions as well as the plasma and magnetic field to Space Physics, a section of the journal parameters for each region, we relate several of the groups to different plasma sheet Frontiers in Astronomy and Space populations, and the rest we attribute to the plasma sheet boundary layer and the lobes. We Sciences find clear distinctions between each of our classified regions and the ECLAT results. The Received: 10 August 2020 automated classification of different regions in space plasma environments provides a useful Accepted: 22 September 2020 Published: 21 October 2020 tool to identify the physical processes governing particle populations in near-Earth space. Citation: These tools are model independent, providing reproducible results without requiring the Bakrania MR, Rae IJ, Walsh AP, placement of arbitrary thresholds, limits or expert judgment. Similar methods could be used Verscharen D and Smith AW (2020) Using Dimensionality Reduction and onboard spacecraft to reduce the dimensionality of distributions in order to optimize data Clustering Techniques to Classify collection and downlink resources in future missions. Space Plasma Regimes. Front. Astron. Space Sci. 7:593516. Keywords: space plasma environments, particle populations, distribution functions, dimensionality reduction, doi: 10.3389/fspas.2020.593516 clustering techniques
Frontiers in Astronomy and Space Sciences | www.frontiersin.org1 October 2020 | Volume 7 | Article 593516 Bakrania et al. Machine Learning: Classifying Plasma Regimes
1 INTRODUCTION
Particle populations in collisionless space plasma environments, such as the Earth’s magnetotail, are traditionally characterized by the moments of their distribution functions. 2D distribution functions in pitch angle and energy, however, provide the full picture of the state of each plasma environment, especially when non-thermal particle populations are present that are less easily characterized by a Maxwellian fit. These non-thermal plasma populations are ubiquitous across the solar system. They make crucial contributions to the bulk properties of a plasma, such as the temperature and collisionality (Hapgood et al., 2011). Magnetic reconnection, for example, heats non-thermal seed populations in both the diffusion and outflow regions, making them an important component of the overall energization process (Øieroset et al., 2002). High-quality measurements and analysis of collisionless plasmas are consequently of key importance when attempting to understand these non-thermal populations. Distribution functions, unlike moments, are not easily classified by a small number of parameters. We therefore propose to apply dimensionality reduction and clustering methods to particle distributions in pitch angle and energy space as a new method to distinguish between the different plasma regions. 2D distributions functions in pitch angle and energy are derived from full 3D distributions in velocity space based on the magnetic field direction and the assumption of gyrotropy of electrons. With these novel methods, we robustly classify variations in particle populations to a high temporal and spatial resolution, allowing us to better identify the physical processes governing particle populations in near- Earth space. Our method also has the advantage of being FIGURE 1 | The architecture of an autoencoder, adapted from Sakurada independent of the model applied, as these methods do not and Yairi (2014). Each circle represents a neuron corresponding to a data- require prior assumptions of the distributions of each point. Layer L1 represents the input data, layer L2 the encoded data in latent population. space, and layer L3 the reconstructed data. The circles labeled “+1” are known as “bias units,” which are parameters that are adjusted during training to improve the performance of the neural network. 1.1 Machine Learning Models In this section, we give a detailed account of the internal operations of each of the unsupervised machine learning algorithms used in our method. In unsupervised learning, We use the autoencoder to compress the data by a factor of 10 algorithms discover the internal representations of the input from a high-dimensional representation. We subsequently apply data without requiring training on example output data. the PCA algorithm to further compress the data to a three- Dimensionality reduction is a specific type of unsupervised dimensional representation. The PCA algorithm has the learning in which data in high-dimensional space is advantage of being a lot cheaper computationally than an transformed to a meaningful representation in lower autoencoder, however the algorithm only captures variations dimensional space. This transformation allows complex that emerge from linear relationships in the data, while datasets, such as 2D pitch angle and energy distributions, to autoencoders also account for non-linear relationships in the be characterized by analysis techniques (e.g., clustering dimensionality reduction process (Bishop, 1998). For this algorithms) with much more computational efficiency. Our reason, we only utilize the PCA algorithm after the data have machine learning method utilizes four separate algorithms: been compressed via an autoencoder. After compressing the data, autoencoders (Hinton and Salakhutdinov, 2006), principal we use the mean shift algorithm to inform us of how many component analysis (PCA, Abdi and Williams, 2010), mean populations are present in the data using this three-dimensional shift (Fukunaga and Hostetler, 1975), and agglomerative representation. While the mean shift algorithm provides us with clustering (AC) (Lukasovã, 1979). We obtain the autoencoder this estimate of the requisite number of clusters, the algorithm is algorithm from the Keras library (Chollet et al., 2015), and the ineffective in constraining the shapes of the clusters to determine PCA, mean shift, and AC algorithms from the scikit-learn library which population each data-point belongs to. Therefore, we use an (Pedregosa et al., 2011). AC algorithm to assign each data-point to one of the populations.
Frontiers in Astronomy and Space Sciences | www.frontiersin.org2 October 2020 | Volume 7 | Article 593516 Bakrania et al. Machine Learning: Classifying Plasma Regimes
1.1.1 Autoencoders (de Boer et al., 2005) is typically used when the input data, x, Autoencoders are a particular class of unsupervised neural are normalized to values between 0 and 1. The loss value, c, networks. They are trained to learn compressed increases as the reconstruction data, x , diverge from the input representations of data by using a bottleneck layer which maps data. The loss function is defined as: the input data to a lower dimensional space, and then c −[x ln(x ) + (1 − x)ln(1 − x )]. (5) subsequently reconstructing the original input. By minimizing the “reconstruction error,” or “loss,” the autoencoder is able to An overview of various loss functions is provided by retain the most important information in a representative Janocha and Czarnecki (2017). Optimisers are used to compression and reconstruction of the data. As a result, ensure the autoencoder converges quickly to a minimum fi autoencoders have applications in dimensionality reduction loss value by nding the optimum value of the weight, wi, (e.g., Hinton and Salakhutdinov, 2006), anomaly detection of each neuron. This is achieved by running multiple (e.g., Kube et al., 2019) and noise filtering (e.g., Chandra and iterations with different weight values, known as gradient Sharma, 2014). descent (Ruder, 2016).Theweightsareadjustedineach During training, an autoencoder runs two functions iteration, t, according to: simultaneously. The first, called an “encoder,” maps the input zc data, x, to the coded representation in latent space, z. The second − − α , wt wt 1 zω (6) function, called a “decoder,” maps the compressed data, z,toa reconstruction of the input data, x . The encoder, E(x), and where zc/zω is the gradient, which is a partial derivative of the loss decoder, D(z), are defined by the following deterministic value with respect to the weight. The learning rate, α, updates all posteriors: the weights simultaneously with respect to the gradient descent. This learning rate is randomly initialized between 0 and 1 by the E(x) p(z|x; θE), (1) Algorithm. A low learning rate results in a slower convergence to ( ) ( | ; θ ), D z p x z D the global minimum loss value. However a too high value for the learning rate impedes the gradient descent (Eq. 6) from where θ and θ are the trainable parameters of the encoder and E D converging on the optimum weights. The Adadelta optimiser decoder respectively. Figure 1 illustrates the standard (Zeiler, 2012) is commonly used due to its rapid convergence to architecture of an autoencoder. the minimum loss value and its ability to adapt the learning rate In feed-forward neural networks, such as autoencoders, each depending on each parameter. The optimiser updates each neuron computes the following sum: parameter, θ, according to: + , y wixi b (2) [Δθ] Δθ −RMS t−1 , i t gt (7) RMS g t where xi represents the input from the previous layer, wi enotes Δθ the weights associated with the connections between neurons in where t is the parameter update at the t-th iteration, gt is the different layers, and b denotes the bias term associated with each gradient of the parameters at the t-th iteration, and RMS is the layer (represented by the circles labeled “+1” in Figure 1). The root mean square. An overview of the various optimisers is number of neurons in each layer defines the dimension of the data provided by Khandelwal (2019). representation in that layer. The output of each neuron, f(y), is called the activation function. Rectified Linear Unit, (Hahnioser 1.1.2 Principal Component Analysis et al., 2000) is the most commonly used activation function due to PCA is a statistical procedure that, as well as autoencoders, also its low computational cost (Agarap, 2018). The function is reduces the dimensionality of input data. The algorithm achieves described as: this by transforming the input data from a large number of correlated variables to a smaller number of uncorrelated , . f y max 0 y (3) variables, known as principal components. These principal The sigmoid activation function (Chandra and Singh, 2004)is components account for most of the variation in the original also commonly used. It is defined by: input data, making them a useful tool in feature extraction. Before the procedure, the original data, X0, are represented by 1 a(n × Q) matrix, where n is the number of observations and Q is f y , (4) 1 + e−y the number of variables (also called dimensions). In the first step, the algorithm scales and centers the data: where y is defined in Eq. 2. Analysis of the use of various activation functions in the remit of plasma physics are given by −1 X X0 − X0 D , (8) Kube et al. (2019). In order to improve the representation of the compressed data where X0 contains the means of each of the variables, and D is a in layer L2 and minimize the discrepancy between the input and diagonal matrix that contains the scaling coefficient of each reconstruction layer, the autoencoder adjusts the weights and variable. Typically, Dii σi where σi is the standard deviation biases by minimising a loss function through an optimiser of variable with index i (Peerenboom et al., 2015). The algorithm (described below). The binary cross-entropy loss function then uses X to calculate the covariance matrix:
Frontiers in Astronomy and Space Sciences | www.frontiersin.org3 October 2020 | Volume 7 | Article 593516 Bakrania et al. Machine Learning: Classifying Plasma Regimes