Data Augmentation by Autoencoders for Unsupervised Anomaly Detection Kasra Babaei, Zhiyuan Chen, Tomas Maul

1 Data Augmentation by AutoEncoders for Unsupervised Anomaly Detection Kasra Babaei, ZhiYuan Chen, Tomas Maul Abstract—This paper proposes an autoencoder (AE) that to reconstruct each data point, it generates a value, is used for improving the performance of once-class clas- known as reconstruction error (RE), that shows the sifiers for the purpose of detecting anomalies. Traditional degree of resemblance between the original input and one-class classifiers (OCCs) perform poorly under certain its reconstruction at the output layer. In some studies conditions such as high-dimensionality and sparsity. Also, such as [4], authors used the reconstruction error to the size of the training set plays an important role on the performance of one-class classifiers. Autoencoders have create a threshold-based classifier for separating normal been widely used for obtaining useful latent variables from points from anomalies. It is also possible to use an high-dimensional datasets. In the proposed approach, the AE to reduce the dimensionality and pass the latent AE is capable of deriving meaningful features from high- variables obtained from the bottleneck to a One-Class dimensional datasets while doing data augmentation at the Classification (OCC) algorithm such as One-Class SVM same time. The augmented data is used for training the (OCSVM). The authors in [5] applied a number of OCC algorithms. The experimental results show that the OCC algorithms such as Local Outlier Factor (LOF) and proposed approach enhance the performance of OCC algo- Kernel Density Estimation (KDE) on the latent variables rithms and also outperforms other well-known approaches. that were obtained from a regularised AE to capture anomalies. Keywords— anomaly detection, autoencoder, data There are various challenges in the area of anomaly augmentation, one-class classifier detection including availability of anomalous examples and imbalanced class distribution [6], which debilitate I. INTRODUCTION performance. Many real-world datasets suffer from the Deep neural networks have demonstrated their effec- class imbalance problem. In a skewed dataset, for in- tiveness and managed to improve the state-of-the-art in stance in a binary dataset, each data point belongs to diverse application areas. In particular, AutoEncoders either the majority class or the minority class. The ma- (AEs), also previously known as auto-associative neural jority class contains a greater number of data points than networks, have shown to be a very powerful tool for the minority class [7]. The ratio between the majority the purpose of reducing dimensionality thanks to their class and the minority class is known as the imbalance ability in discovering non-linear correlations between ratio and varies from one domain to another [8]. features [1]. This important capability of AutoEncoders The imbalanced dataset problem weakens the perfor- has made them very suitable for the task of anomaly mance and can cause misclassification [6]. In particular, arXiv:1912.13384v1 [cs.LG] 21 Dec 2019 detection. AutoEncoders come in various architectures. supervised approaches suffer more because it is hard When the number of features in the middle layer, known to obtain examples of anomalies for training the model as the bottleneck, is smaller than the number of inputs, or the current definition of anomaly changes over time. as depicted in Fig. 1, the high dimensional space can be Therefore, unsupervised models or OCC algorithms ap- transformed into a low dimensional space [2]. AutoEn- pear to be more suitable. In an OCC algorithm, the coders are different from other dimensionality reduction model is trained with a training set that only includes methods such as Principal Component Analysis (PCA) in data instances from one class (known as the target class) the sense that AEs generally perform non-linear dimen- and the model is expected to separate data instances of sionality reduction. Moreover, according to the literature, the target class from non-target class instances in the test AEs generally outperform PCA [2]. Besides, statistical set [9]. In order to employ an OCC, it is necessary to approaches such as PCA or Zero Component Analysis have access to a training set that includes merely normal (ZCA) require more memory to calculate the covariance examples. It is worth noting that in some scenarios the matrix as the dimensionality increases [3]. training set includes a small number of data points from AutoEncoders have been employed for the task of other classes as well. This training set is used to set the anomaly detection in various ways. As the network tries threshold at which non-target and target points will be divided. In an anomaly detection problem, the goal is to separate anomalies from normal points. There are several methods to overcome the issue of imbalanced class distributions that can be cat- egorised into data-level, algorithmic-level, and cost- sensitive methods [10]. In a data-level method, the goal is to bring balance to the dataset prior to perform- ing classification by oversampling, undersampling or a hybrid approach [11]. Two widely used oversampling Encoder Bottleneck Decoder methods are Synthetic Minority Oversampling Technique (SMOTE) and adaptive synthetic sampling (ADASYN) [12]. Authors in [7] proposed a new approach in which Fig. 1: The structure of a deep undercomplete autoen- a Generative Adversarial Network (GAN) is used to encoder hance classification performance in imbalanced datasets. They used a variant of GAN that can also detect outliers in the majority class, which prevents creating a biased II. AUTOENCODERS classification boundary. In a similar approach, authors In this section, a summary of the AutoEncoder (AE) in [10] proposed a conditional Generative Adversarial neural network is presented. The AE is an unsupervised Network (cGAN) for oversampling the minority class multi-layer neural network with two fundamental com- in a binary classification problem. In their approach, the ponents, namely the encoder and the decoder. While the network is conditioned on external information, i.e., class encoder tries to transform high dimensional data into a labels, to approximate the actual class distribution. lower dimension, the decoder attempts to approximately The performance of OCC methods depends on various reconstruct the input from the low-dimensional feature factors including the size of the training set. These meth- space. The difference between the input and the recon- ods can perform better when the training set includes structed data point is known as the reconstruction error more data points as it makes it feasible to compute a and by training the network, the AE tries to minimise less ambiguous class boundary for dividing data points this error, i.e., to maximise the resemblance between the of the target class from the rest [9]. Data augmentation reconstructed data and the original input. refers to the process in which the training instances are oversampled to improve the model’s performance u v x z y [13]. This approach is widely used in machine learning tasks. In the area of anomaly detection, the authors in [14] claim to be the first to use data augmentation in Fig. 2: The structure of a basic autoencoder an unsupervised approach for detecting anomalies. They employed a variant of a GAN to obtain latent variables, As shown in Figure 2, the most generic AE with only and selectively oversampled instances close to the head one latent layer attempts to transform the input x into and tail of the latent variables by an approach similar to latent vector z using an encoder represented by function SMOTE. They argued that their approach addresses the u. Then, the decoder tries to map z to reconstruct y problem of scarcity of infrequent normal instances which by using a decoder represented as function v. Having a can reduce the performance of density-based anomaly training set, Du = fx1; x2; x3; :::; xng, where n refers to detection algorithms. the number of instances in D and xi is the ith instance This paper is motivated by the question of whether with m features, the encoder can then be defined as: AEs are capable of augmenting data points with mean- z = u(x) = s(W x + b) (1) ingful features for the purpose of unsupervised anomaly detection or not. To explain further, the latent variables while the decoder can be defined as: of the bottleneck of the AE are collected from a certain y = v(z) = s0(W 0z + b0) (2) number of epochs and then used to train various OCC algorithms for finding anomalies. An extensive amount where both s and s0 represent the activation functions, of experimentation was carried out to demonstrate that W and W 0 denote the weight matrices while b and b0 this approach can lead to achieving better anomaly represent the bias vectors. Choosing the right activation detection performance from OCC algorithms. function depends on various factors; however, often a non-linear activation function such as Sigmoid or ReLU representation of the original input. The algorithm of can capture more useful representations. the proposed approach is summarised in Algorithm 1, in The AE has several variants. In an under-complete AE which n0 > n as the model is augmenting the training set where the number of nodes in the middle layer is smaller while m0 < m as at the dimensionality is being reduced. than the number of nodes at the input layer, the aim is Another advantage of the proposed model is that unlike to reduce the input dimension by learning some desired other approaches, it is not necessary to perform extra latent features. A detailed review of different variants of computation for augmenting the training set. AEs can be found in [15]. Algorithm 1: Over-sampling by AE III. SMOTE AND ADASYN input : D: a dataset with n × m dimension 0 0 0 This section briefly explains the two oversampling output: D : a dataset with n × m dimension approaches used against the proposed method in this 1 n epochs number of epochs; paper for comparison purposes.

Data Augmentation by Autoencoders for Unsupervised Anomaly Detection Kasra Babaei, Zhiyuan Chen, Tomas Maul

Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation

A Group-Theoretic Framework for Data Augmentation

Real-Time Monitoring for Hydraulic States Based on Convolutional Bidirectional LSTM with Attention Mechanism

Data Augmentation and Feature Selection for Automatic Model Recommendation in Computational Physics

Generative Adversarial Networks to Improve the Robustness of Visual Defect Segmentation by Semantic Networks in Manufacturing Components

Posture Recognition Using Ensemble Deep Models Under Various Home Environments

Automatic Relational Data Augmentation for Machine Learning

Data Augmentation Schemes for Deep Learning in an Indoor Positioning Application

Augmenting Image Classifiers Using Data Augmentation Generative

GAN-Based Image Data Augmentation

Image-Based Perceptual Learning Algorithm for Autonomous Driving

Empirical Evaluation of Variational Autoencoders for Data Augmentation