Supervised based on Deep Autoregressive Density Estimators

Tomoharu Iwata Yuki Yamanaka NTT Communication Science Laboratories NTT Secure Platform Laboratories

Abstract (VAE) (Kingma and Welling 2013), flow- based generative models (Dinh, Krueger, and Bengio 2014; We propose a supervised anomaly detection method based Dinh, Sohl-Dickstein, and Bengio 2016; Kingma and Dhari- on neural density estimators, where the negative log likeli- wal 2018), and autoregressive models (Uria, Murray, and hood is used for the anomaly score. Density estimators have been widely used for unsupervised anomaly detection. By Larochelle 2013; Raiko et al. 2014; Germain et al. 2015; the recent advance of , the density estimation Uria et al. 2016). The VAE has been used for anomaly de- performance has been greatly improved. However, the neural tection (An and Cho 2015; Suh et al. 2016; Xu et al. 2018). density estimators cannot exploit anomaly label information, In some situations, the label information, which indicates which would be valuable for improving the anomaly detec- whether each instance is anomalous or normal, is avail- tion performance. The proposed method effectively utilizes able (Gornitz¨ et al. 2013). The label information is valuable the anomaly label information by training the neural density for improving the anomaly detection performance. How- estimator so that the likelihood of normal instances is max- ever, the existing neural network based density estimation imized and the likelihood of anomalous instances is lower methods cannot exploit the label information. To use the than that of the normal instances. We employ an autoregres- sive model for the neural density estimator, which enables anomaly label information, supervised classifiers, such as us to calculate the likelihood exactly. With the experiments nearest neighbor methods (Singh and Silakari 2009), sup- using 16 datasets, we demonstrate that the proposed method port vector machines (Mukkamala, Sung, and Ribeiro 2005), improves the anomaly detection performance with a few la- and feed-forward neural networks (Rapaka, Novokhodko, beled anomalous instances, and achieves better performance and Wunsch 2003), have been used. However, these stan- than existing unsupervised and supervised anomaly detection dard supervised classifiers do not perform well when labeled methods. anomalous instances are very few, which is often the case, since anomalous instances rarely occur by definition. 1 Introduction In this paper, we propose a neural network density esti- mator based anomaly detection method that can exploit the Anomaly detection is an important task in artificial intel- label information. The proposed method performs well even ligence, which is a task to find anomalous instances in a when only a few labeled anomalous instances are given since dataset. The anomaly detection has been used in a wide va- it is based on a density estimator, which works without any riety of applications (Chandola, Banerjee, and Kumar 2009; labeled anomalous instances. We employ the negative log Patcha and Park 2007; Hodge and Austin 2004), such as probability of an instance as its anomaly score. For the den- network intrusion detection for cyber-security (Dokas et al. sity function to calculate the probability, we use neural au- 2002; Yamanishi et al. 2004), fraud detection for credit toregressive models (Uria et al. 2016; Germain et al. 2015). cards (Aleskerov, Freisleben, and Rao 1997), defect detec- The autoregressive models can compute the probability den- tion of industrial machines (Fujimaki, Yairi, and Machida arXiv:1904.06034v1 [stat.ML] 12 Apr 2019 sity exactly for a test instance. On the other hand, the VAE 2005; Ide´ and Kashima 2004) and disease outbreak detec- computes the lower bound of the probability density ap- tion (Wong et al. 2003). proximately. Moreover, the autoregressive models have been Anomalies, which are also called , are instances achieved the high density estimation performance compared that rarely occur. Therefore, it is natural to consider that with other neural density estimators, such as VAE and flow- instances at a low probability density region are anoma- based generative models (Dinh, Sohl-Dickstein, and Bengio lous, and many density estimation based anomaly detec- 2016). tion methods have been proposed (Barnett and Lewis 1974; The density function is trained so that the probability den- Parra, Deco, and Miesbach 1996; Yeung and Ding 2003). sity of normal instances becomes high, which is the same By the recent advances of deep learning, the density es- with the standard maximum likelihood estimation. In addi- timation performance has been greatly improved by neu- tion, we would like to make the density function to satisfy ral network based density estimators, such as variational that the probability density of anomalous instances is lower than that of normal instances. To achieve this, we introduce a regularization term, which is calculated by using the log Density model likelihood ratio between normal and anomalous instances. For the density function p(x|θ), we use the deep masked au- Since our objective function is differentiable, the density toencoder density estimator (MADE) (Germain et al. 2015), function can be estimated efficiently by using stochastic which is a neural autoregressive model. The probability dis- gradient-based optimization methods. tribution can always be decomposed into the product of the Figure 1 illustrates anomaly scores with an unsupervised nested conditional distributions using the probability prod- density estimation based anomaly detection method (a), uct rule as follows, with a supervised binary classifier based anomaly detection D Y method (b), and with the proposed method (c). The unsu- p(x|θ) = p(x |x ; θ), (2) pervised method considers only normal instances, and the d 0. The remainder of the paper is organized as follows. In When the feature vector is a binary, we use the following Section 2, we define our task, and propose our method for Bernoulli distribution, supervised anomaly detection based on the neural autore- xd 1−xd gressive estimators. In Section 3, we briefly review related p(xd|x − log p(xn0 |θ), for n ∈ A, n ∈ A¯. where θ is parameters of the density function. (5) (a) Unsupervised anomaly detection (b) Supervised anomaly detection (c) Proposed method

Figure 1: Examples of anomaly scores with an unsupervised density estimation based anomaly detection method (a), with a supervised binary classifier based anomaly detection method (b), and with the proposed method (c). The white triangle M represents an observed normal instance, the white circle ◦ represents an observed anomalous instance, and the black circle • represents a test anomalous instance, which is not observed in training. The horizontal axis is the one-dimensional attribute√ space, and the vertical axis is the anomaly score, where the anomaly score increases in the downward direction. The marks and × indicate that the method can detect the test anomalous instance successfully or not, respectively.

In addition, the following log likelihood of the normal in- stances should be high, 1 X L0(θ) = log p(x |θ), (6) |A|¯ n n∈A¯ since the anomaly score, which is defined by the negative log likelihood, of the normal instances should be low. Here, | · | represents the number of elements in the set. We would like to maximize Eq. (6) while satisfying the constraints in Eq. (5) as much as possible. To make the ob- jective function differentiable with respect to the parameters   Figure 2: Regularization term f log p(xn0 |θ) of the ob- without constraints, we relax the constraints in Eq. (5) to a p(xn|θ) soft regularization term as follows, jective function (7) with respect to the log likelihood ratio   p(xn0 |θ) λ X X p(xn0 |θ) log . L(θ) = L0(θ) + f log , p(xn|θ) |A||A|¯ p(xn|θ) n∈A n0∈A¯ (7) The regularization term can be seen as a stochastic version where f(·) is the sigmoid function, of the area under the receiver operating characteristic (ROC) curve (AUC) (Yan et al. 2003), since the AUC is computed 1 f(s) = , (8) by 1 + exp(−s) 1 X X AUC = I(− log p(xn|θ) > − log p(xn0 |θ)) and λ ≥ 0 is the hyperparameter. Figure 2 shows the regu- |A||A|¯ n∈A n0∈A¯ larization with respect to the log likelihood ratio. When the   1 X X p(xn0 |θ) anomaly score of an anomalous instance is much higher than = I log > 0 (9) that of a normal instance, − log p(x |θ)  − log p(x 0 |θ), |A||A|¯ p(xn|θ) n n n∈A 0 ¯ the sigmoid function takes the maximum value one. When n ∈A the anomaly score of an anomalous instance is much where I(A) is the indicator function, i.e. I(A) = 1 if A is I(A) = 0 lower than that of a normal instance, − log p(xn|θ)  true, otherwise, and the sigmoid function is an f(s) ≈ I(s > 0) − log p(xn0 |θ), the sigmoid function takes the minimum approximation of the indicator function . value zero. Therefore, the maximization of this regulariza- tion term moves parameters θ so as to satisfy the con- 3 Related work straints in Eq. (5). We maximize the objective function A number of unsupervised methods for anomaly detection, (7) with a gradient based optimization method, such as which is sometimes called detection (Hodge and ADAM (Kingma and Ba 2014). Austin 2004) or novelty detection (Markou and Singh 2003), When there are no labeled anomalous instances or λ = 0, have been proposed, such as the local outlier factor (Breunig the regularization term becomes zero, and the first term on et al. 2000), one-class support vector machines (Scholkopf¨ the likelihood remains with the objective function, which is et al. 2001), and the isolation forest (Liu, Ting, and Zhou the same objective function with the standard density es- 2008). With density estimation based anomaly detection timation. Therefore, the proposed method can be seen as methods, Gaussian distributions (Shewhart 1931), Gaus- a generalization of unsupervised density estimation based sian mixtures (Eskin 2000) and kernel density estima- anomaly detection methods. tors (Laxhammar, Falkman, and Sviestins 2009) have been used. The density estimation methods have been regarded as unsuitable for anomaly detection in high-dimensional Table 1: Statistics of the datasets used in our experiments. data due to the difficulty of estimating multivariate prob- N is the number of instances, D is the number of attributes, ability distributions (Friedland, Gentzel, and Jensen 2014; |A| is the number of anomalous instances, and |A|/N is the Hido et al. 2011). Although some supervised anomaly de- anomaly rate. tection methods have been proposed (Nadeem et al. 2016; data ND |A| |A|/N Gao, Cheng, and Tan 2006; Das et al. 2016; Das et al. 2017; Annthyroid 7016 21 350 0.050 Munawar, Vinayavekhin, and De Magistris 2017; Pimentel Arrhythmia 305 259 61 0.200 et al. 2018; Akcay, Atapour-Abarghouei, and Breckon 2018; Cardiotocography 2068 21 413 0.200 Yamanaka et al. 2019), they are not based on deep autore- HeartDisease 187 13 37 0.198 gressive density estimators, which can achieve high density InternetAds 1775 1555 177 0.100 estimation performance. Ionosphere 351 32 126 0.359 Recent research of neural networks has made substan- KDDCup99 60839 79 246 0.004 tial progress on density estimation for high-dimensional PageBlocks 5171 10 258 0.050 data. The neural network based density estimators, includ- Parkinson 60 22 12 0.200 ing VAE (Kingma and Welling 2013), flow-based genera- PenDigits 9868 16 20 0.002 tive models (Dinh, Krueger, and Bengio 2014; Dinh, Sohl- Pima 625 8 125 0.200 Dickstein, and Bengio 2016; Kingma and Dhariwal 2018), Shuttle 1013 9 13 0.013 and autoregressive models (Uria, Murray, and Larochelle SpamBase 3485 57 697 0.200 2013; Raiko et al. 2014; Germain et al. 2015; Uria et al. Stamps 325 9 16 0.049 2016), can flexibly learn dependencies across different at- Waveform 3443 21 100 0.029 tributes, and have achieved the high density estimation per- Wilt 4671 5 93 0.020 formance. The autoregressive models have been success- fully used for density estimations, as well as modeling images (Oord, Kalchbrenner, and Kavukcuoglu 2016) and a few training anomalous instances. With our experiments speech (Van Den Oord et al. 2016). The autoregressive described in Section 4, we demonstrate that both of the like- models can compute the probability for each instance ex- lihood and AUC maximizations are effective to achieve good actly, which is desirable since we use the probability as performance in various datasets. the anomaly score. A shortcoming of the autoregressive models is that it requires computational time to generate 4 Experiments samples. However, generating samples is not necessary for Data anomaly detection. The VAE can compute the approxima- We evaluated our proposed supervised anomaly detec- tion of the lower bound of the log probability, but it can- tion method based on deep autoregressive density estima- not compute the probability exactly. By using importance tors with 16 datasets used for unsupervised outlier detec- sampling (Burda, Grosse, and Salakhutdinov 2015), one can tion (Campos et al. 2016) 1. The number of instances N, the calculate a lower bound that approaches the true log prob- number of attributes D, the number of anomalous instances ability as the number of samples increases, although it re- |A| and the anomaly rate |A|/N of the datasets are shown in quires the infinite number of samples for the true proba- Table 1. Each attribute was linearly normalized to the range bility. The flow-based generative models can calculate the [0, 1], and duplicate instances were removed. We used 80% probability exactly, and can generate samples efficiently and of the normal instances and three anomalous instances for effectively (Kingma and Dhariwal 2018). However, it re- training, 10% of the normal instances and three anomalous quires specialized functions for transformations, which are instances for validation, and remaining instances for testing. invertible and their determinant of the Jacobian matrix can For the evaluation measurement, we used the AUC. For each be easily calculated, and the density estimation performance dataset, we randomly generated ten sets of training, valida- is lower than the autoregressive models. Although we use tion and test data, and calculated the average AUC over the the autoregressive models in our framework, VAE and flow- ten sets. based generative models are straightforwardly applicable to our framework. The VAE has been used for unsupervised Comparing methods anomaly detection (An and Cho 2015; Suh et al. 2016; We compared the proposed method with the following nine Xu et al. 2018), but not for supervised anomaly detection. methods: LOF, OCSVM, IF, VAE, MADE, KNN, SVM, RF In the proposed method with the high value of the hy- and NN. LOF, OCSVM, IF, VAE and MADE are unsuper- perparameter λ, the second term in the objective function vised anomaly detection methods, where the attribute xn is (7), which is an approximation of the AUC, is dominant. used for calculating the anomaly score, but the label infor- Therefore, the proposed method is related to AUC max- mation yn is not used. KNN, SVM, RF, NN as well as the imization (Cortes and Mohri 2004; Brefeld and Scheffer proposed method are supervised anomaly detection meth- 2005), which has been used for training on class imbalanced ods, where both the attribute x and the label information data. The proposed method employs the likelihood maxi- n mization with normal instances as well as the AUC maxi- 1The datasets were obtained from http://www.dbs.ifi. mization, which enables us to improve the performance with lmu.de/research/outlier-evaluation/DAMI/ yn are used. The hyperparameters were selected based on • NN is the feed-forward neural network classifier. We used the AUC score on the validation data for both of the unsuper- three layers with rectified linear unit (ReLU) activation, vised and supervised methods. We used the implementation where the number of hidden units was selected from of scikit-learn (Pedregosa et al. 2011) with LOF, OCSVM, {5, 10, 50, 100}. IF, KNN, SVM, RF and NN. Settings of the proposed method • LOF is the local outlier factor method (Breunig et al. We used Gaussian mixtures with K = 3 components for the 2000). The LOF unsupervisedly detects anomalies based output layer. The number of hidden layers was one, the num- on the degree of isolation from the surrounding neigh- ber of hidden units was 500, the number of masks was ten, borhood. The number of neighbors was tuned from and the number of different orderings was ten. The hyperpa- {1, 3, 5, 15, 35} using the validation data. rameter λ was selected from {0, 10−1, 1, 10, 102, 103, 104} • OCSVM is the one-class support vector ma- using the validation data. The validation data were also used chine (Scholkopf¨ et al. 2001), which is an extension for early stopping, where the maximum number of training of the support vector machine (SVM) to the case of epochs was 100. We optimized the neural network parame- unlabeled data. The OCSVM finds the maximal margin ters using ADAM with 10−3. hyperplane which separates the given normal data from the origin by embedding them into a high dimensional Results space via a kernel function. We used the RBF ker- Table 2 shows the AUC results. The proposed method nel, and the kernel hyperparameter was tuned from achieved the highest average AUC among the ten methods. {10−3, 10−2, 10−1, 1}. The AUC by the MADE was high compared with the other • IF is the isolation forest method (Liu, Ting, and Zhou unsupervised methods, which indicates that the neural au- 2008), which is a tree-based unsupervised anomaly de- toregressive density estimators would be useful for unsu- tection method. The IF isolates anomalies by randomly pervised anomaly detection. With some datasets, the AUC selecting an attribute and randomly selecting a split value by the proposed method was statistically higher than that between the maximum and minimum values of the se- by the MADE, e.g. Annthyroid, PenDigits and Wilt. There lected attribute. The number of base estimators was cho- were no datasets where the AUC by the MADE was sta- sen from {1, 5, 10, 20, 30}. tistically higher than that by the proposed method. This re- sult indicates that the regularization term in the proposed • VAE is the variational (Kingma and Welling method works well for improving anomaly detection perfor- 2013), which is a density estimation method based on mance with a few labeled anomalous instances. The SVM neural networks. With the VAE, the observation is as- achieved the high AUC among supervised binary classifier sumed to follow a Gaussian distribution, where the mean based methods. However, the performance of the SVM with and variance are modeled by a neural network that takes some datasets was very low, e.g. Arrhythmia and Pima. On latent variables as the input. The latent variable is also the other hand, the proposed method achieved relatively high modeled by another neural network that takes the at- AUC on all of the datasets. This would be because the pro- tribute vector as the input. We used three-layered feed- posed method incorporates the characteristics of unsuper- forward neural networks with 100 hidden units, and the vised methods by the likelihood term in the objective func- 20-dimensional latent space. We optimized the neural net- tion as well as the characteristics of supervised methods by work parameters using ADAM. The number of epochs the regularization term. Table 3 shows the AUC results with was selected by using the validation data. (a) one and (b) five training anomalous instances by the su- pervised methods. The proposed method also achieved the • MADE is the deep masked autoencoder density estima- highest average AUC with these settings. The average AUC tor (Germain et al. 2015), which is used with the proposed by the proposed method increased as the number of training method for the density function. The proposed method anomalous instances increased, where the AUCs were 0.807, with λ = 0 corresponds to the MADE. We used the 0.821, 0.839, 0.859 when zero, one, three, five anomalous same parameter setting with the proposed method for the instances were used for training. The average computational MADE, which is described in the next subsection. time for training the proposed method was 2.78, 0.53, 0.24, • KNN is the k-nearest neighbor method, which classifies 0.03, 9.50, 0.05, 62.39, 1.36, 0.02, 4.12, 0.05, 0.09, 2.06, instances based on the votes of the neighbors. The number 0.03, 0.99, 0.82 seconds on Annthyroid, Arrhythmia, Car- of neighbors was selected from {1, 3, 5, 15}. diotocography, HeartDisease, InternetAds, Ionosphere, KD- DCup99, PageBlocks, Parkinson, PenDigits, Pima, Shuttle, • SVM is the support vector machine (Scholkopf,¨ Smola, SpamBase, Stamps, Waveform, Wilt dataset, respectively, and others 2002), which is a kernel-based binary classi- using a computer with a Xeon Gold 6130 CPU 2.10GHz. fication method. We used the RBF kernel, and the kernel Figure 3 shows the AUCs on the test data by the proposed hyperparameter was tuned from {10−3, 10−2, 10−1, 1}. method with different hyperparameters λ trained on datasets • RF is the method (Breiman 2001), which with three anomalous instances. The best hyperparameters is a meta estimator that fits a number of decision were different across datasets. For example, the high λ was tree classifiers. The number of trees was chosen from better with the PenDigits and Shuttle datasets, the low λ {5, 10, 20, 30}. was better with the Annthyroid and Stamps datasets, and Table 2: AUCs on 16 datasets with three training anomalous instances by unsupervised anomaly detection methods (LOF, OCSVM, IF, VAE, MADE) and supervised anomaly detection methods (KNN, SVM, RF, NN, Proposed). Values in bold typeface are not statistically different (at the 5% level) from the best performing method according to a paired t-test. The bottom row shows the average AUC over the datasets. LOF OCSVM IF VAE MADE KNN SVM RF NN Proposed Annthyroid 0.627 0.667 0.700 0.766 0.716 0.510 0.741 0.875 0.596 0.776 Arrhythmia 0.711 0.718 0.649 0.668 0.694 0.504 0.568 0.621 0.638 0.677 Cardiotocography 0.569 0.834 0.836 0.726 0.828 0.582 0.846 0.707 0.554 0.873 HeartDisease 0.581 0.688 0.759 0.729 0.803 0.664 0.722 0.701 0.540 0.825 InternetAds 0.677 0.836 0.562 0.860 0.780 0.533 0.804 0.601 0.711 0.834 Ionosphere 0.860 0.951 0.864 0.844 0.821 0.564 0.933 0.883 0.792 0.845 KDDCup99 0.582 0.993 0.992 0.968 0.995 0.707 0.812 0.939 0.979 0.988 PageBlocks 0.776 0.930 0.923 0.924 0.847 0.560 0.730 0.675 0.428 0.815 Parkinson 0.840 0.847 0.748 0.747 0.797 0.640 0.817 0.735 0.733 0.807 PenDigits 0.898 0.989 0.955 0.915 0.901 0.839 0.999 0.946 0.617 0.993 Pima 0.586 0.688 0.737 0.677 0.725 0.530 0.649 0.569 0.296 0.744 Shuttle 0.962 0.918 0.949 0.952 0.927 0.879 0.997 0.999 0.400 0.969 SpamBase 0.521 0.662 0.781 0.775 0.735 0.536 0.750 0.766 0.621 0.786 Stamps 0.814 0.890 0.922 0.908 0.902 0.761 0.895 0.868 0.832 0.904 Waveform 0.729 0.737 0.709 0.751 0.743 0.532 0.801 0.591 0.709 0.800 Wilt 0.731 0.352 0.596 0.455 0.707 0.503 0.801 0.623 0.649 0.785 average 0.717 0.794 0.793 0.791 0.807 0.615 0.804 0.756 0.631 0.839

Table 3: AUCs on datasets with (a) one and (b) five training anomalous instances by supervised anomaly detection methods. The AUCs by unsupervised methods are the same with Table 2 since they do not use the anomaly information for training. Values in bold typeface are not statistically different (at the 5% level) from the best performing method according to a paired t-test including AUCs by the unsupervised methods. The AUCs on Parkinson and Shuttle datasets with five training anomalous instances could not be calculated since they do not contain enough anomalous instances. (a) one training anomalous instance (b) five training anomalous instances KNN SVM RF NN Proposed KNN SVM RF NN Proposed Annthyroid 0.506 0.664 0.691 0.586 0.726 0.517 0.807 0.857 0.610 0.761 Arrhythmia 0.503 0.522 0.545 0.550 0.680 0.513 0.650 0.642 0.657 0.701 Cardiotocography 0.529 0.744 0.565 0.500 0.846 0.589 0.885 0.790 0.594 0.901 HeartDisease 0.542 0.765 0.605 0.292 0.820 0.740 0.809 0.779 0.695 0.824 InternetAds 0.513 0.700 0.546 0.615 0.851 0.568 0.849 0.666 0.786 0.866 Ionosphere 0.516 0.919 0.745 0.758 0.820 0.713 0.943 0.945 0.817 0.887 KDDCup99 0.575 0.833 0.791 0.979 0.949 0.754 0.890 0.962 0.980 0.979 PageBlocks 0.514 0.687 0.530 0.421 0.848 0.630 0.726 0.789 0.450 0.891 Parkinson 0.642 0.847 0.762 0.445 0.850 PenDigits 0.766 0.996 0.648 0.483 0.987 0.890 0.999 0.969 0.733 0.995 Pima 0.506 0.586 0.517 0.365 0.691 0.546 0.663 0.653 0.304 0.771 Shuttle 0.661 0.989 0.861 0.532 0.985 SpamBase 0.510 0.758 0.697 0.462 0.703 0.562 0.814 0.869 0.780 0.852 Stamps 0.639 0.826 0.726 0.795 0.849 0.873 0.946 0.924 0.805 0.920 Waveform 0.506 0.752 0.510 0.695 0.748 0.560 0.894 0.678 0.750 0.866 Wilt 0.503 0.725 0.543 0.614 0.776 0.512 0.852 0.730 0.654 0.812 average 0.558 0.770 0.643 0.568 0.821 0.640 0.838 0.804 0.687 0.859 the intermediate λ was better with the Cardiotocography and mization. Ionosphere datasets. This result indicates that the AUC max- imization without the likelihood maximization, which cor- 5 Conclusion responds to the proposed method with the high λ, is not We have proposed a supervised anomaly detection method effective in some datasets. The proposed method achieved based on neural autoregressive models. With the proposed the high performance with various datasets by automatically method, the neural autoregressive model is trained so that adapting the λ using the validation data to control the bal- the likelihood of normal instances is maximized and the like- ance of the likelihood maximization and the AUC maxi- lihood of anomalous instances is lower than that of normal 0.78 0.725 0.90 0.875 0.85 0.84 0.700 0.76 0.88 0.850 0.84 0.82 0.675 0.86 0.825 0.83 0.74 0.80 0.800 0.650 0.84 0.82 0.78 0.775 0.72 0.625 0.82 0.81 0.76 0.750 0.600 0.80 0.725 0.70 0.80 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 Annthyroid Arrhythmia Cardiotocography HeartDisease InternetAds Ionosphere

0.90 0.77 1.000 0.84 0.98 0.76 0.975 0.98 0.85 0.82 0.96 0.75 0.950 0.80 0.74 0.925 0.96 0.80 0.94 0.73 0.900 0.75 0.78 0.92 0.94 0.72 0.875 0.90 0.70 0.76 0.71 0.850 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 KDDCup99 PageBlocks Parkinson PenDigits Pima Shuttle

0.92 0.775 0.825 0.85 0.750 0.800 0.90 0.80 0.725 0.775 0.88 0.700 0.750 0.75 0.86 0.675 0.725 0.84 0.70 0.650 0.700 0.625 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 10 1 100 101 102 103 104 SpamBase Stamps Waveform Wilt

Figure 3: AUCs by the proposed method with different hyperparameters λ. The x-axis is λ and the y-axis is the AUC. The errorbar shows the standard error. The horizontal line is the AUC with λ = 0.

instances. The proposed method can detect anomalies in the [Brefeld and Scheffer 2005] Brefeld, U., and Scheffer, T. region where there are no normal instances as well as in the 2005. AUC maximizing support vector learning. In Pro- region where anomalous instances are located closely. We ceedings of the ICML Workshop on ROC Analysis in Ma- have experimentally confirmed the effectiveness of the pro- chine Learning. posed method using 16 datasets. Although our results have [Breiman 2001] Breiman, L. 2001. Random forests. Ma- been encouraging to date, our approach can be further im- chine learning 45(1):5–32. proved in a number of ways. First, we would like to extend our framework for semi-supervised setting (Blanchard, Lee, [Breunig et al. 2000] Breunig, M. M.; Kriegel, H.-P.; Ng, and Scott 2010), where unlabeled instances as well as la- R. T.; and Sander, J. 2000. LOF: identifying density-based beled anomalous and normal instances are given. Second, local outliers. ACM SIGMOD Record 29(2):93–104. we plan to incorporate other neural density estimators in- [Burda, Grosse, and Salakhutdinov 2015] Burda, Y.; Grosse, cluding VAE to our framework. R.; and Salakhutdinov, R. 2015. Importance weighted au- toencoders. arXiv preprint arXiv:1509.00519. References [Campos et al. 2016] Campos, G. O.; Zimek, A.; Sander, J.; [Akcay, Atapour-Abarghouei, and Breckon 2018] Akcay, S.; Campello, R. J.; Micenkova,´ B.; Schubert, E.; Assent, I.; and Atapour-Abarghouei, A.; and Breckon, T. P. 2018. Houle, M. E. 2016. On the evaluation of unsupervised out- Ganomaly: Semi-supervised anomaly detection via adver- lier detection: measures, datasets, and an empirical study. sarial training. arXiv preprint arXiv:1805.06725. and Knowledge Discovery 30(4):891–927. [Aleskerov, Freisleben, and Rao 1997] Aleskerov, E.; [Chandola, Banerjee, and Kumar 2009] Chandola, V.; Freisleben, B.; and Rao, B. 1997. Cardwatch: A neural Banerjee, A.; and Kumar, V. 2009. Anomaly detection: A network based database mining system for credit card fraud survey. ACM Computing Surveys 41(3):15. detection. In IEEE/IAFE Computational Intelligence for [Cortes and Mohri 2004] Cortes, C., and Mohri, M. 2004. Financial Engineering, 220–226. AUC optimization vs. error rate minimization. In Advances [An and Cho 2015] An, J., and Cho, S. 2015. Variational in Neural Information Processing Systems, 313–320. autoencoder based anomaly detection using reconstruction [Das et al. 2016] Das, S.; Wong, W.-K.; Dietterich, T.; Fern, probability. Special Lecture on IE 2:1–18. A.; and Emmott, A. 2016. Incorporating expert feedback [Barnett and Lewis 1974] Barnett, V., and Lewis, T. 1974. into active anomaly discovery. In 16th International Con- Outliers in statistical data. Wiley. ference on Data Mining, 853–858. IEEE. [Blanchard, Lee, and Scott 2010] Blanchard, G.; Lee, G.; [Das et al. 2017] Das, S.; Wong, W.-K.; Fern, A.; Dietterich, and Scott, C. 2010. Semi-supervised novelty detec- T. G.; and Siddiqui, M. A. 2017. Incorporating feed- tion. Journal of Research 11(Nov):2973– back into tree-based anomaly detectionw. arXiv preprint 3009. arXiv:1708.09441. [Dinh, Krueger, and Bengio 2014] Dinh, L.; Krueger, D.; [Laxhammar, Falkman, and Sviestins 2009] Laxhammar, R.; and Bengio, Y. 2014. NICE: Non-linear independent com- Falkman, G.; and Sviestins, E. 2009. Anomaly detection in ponents estimation. arXiv preprint arXiv:1410.8516. sea traffic - a comparison of the Gaussian mixture model and [Dinh, Sohl-Dickstein, and Bengio 2016] Dinh, L.; Sohl- the kernel density estimator. In International Conference on Dickstein, J.; and Bengio, S. 2016. Density estimation Information Fusion, 756–763. using real NVP. arXiv preprint arXiv:1605.08803. [Liu, Ting, and Zhou 2008] Liu, F. T.; Ting, K. M.; and [Dokas et al. 2002] Dokas, P.; Ertoz, L.; Kumar, V.; Lazare- Zhou, Z.-H. 2008. Isolation forest. In Proceeding of the vic, A.; Srivastava, J.; and Tan, P.-N. 2002. Data mining 8th IEEE International Conference on Data Mining, 413– for network intrusion detection. In NSF Workshop on Next 422. IEEE. Generation Data Mining, 21–30. [Markou and Singh 2003] Markou, M., and Singh, S. 2003. [Eskin 2000] Eskin, E. 2000. Anomaly detection over noisy Novelty detection: a review. Signal processing 83(12):2481– data using learned probability distributions. In International 2497. Conference on Machine Learning. [Mukkamala, Sung, and Ribeiro 2005] Mukkamala, S.; [Friedland, Gentzel, and Jensen 2014] Friedland, L.; Sung, A.; and Ribeiro, B. 2005. Model selection for kernel Gentzel, A.; and Jensen, D. 2014. Classifier-adjusted based intrusion detection systems. In Adaptive and Natural density estimation for anomaly detection and one-class Computing Algorithms, 458–461. Springer. classification. In SIAM International Conference on Data [Munawar, Vinayavekhin, and De Magistris 2017] Mining, 578–586. Munawar, A.; Vinayavekhin, P.; and De Magistris, G. [Fujimaki, Yairi, and Machida 2005] Fujimaki, R.; Yairi, T.; 2017. Limiting the reconstruction capability of gener- and Machida, K. 2005. An approach to spacecraft anomaly ative neural network using negative learning. In 27th detection problem using kernel feature space. In Interna- International Workshop on Machine Learning for Signal tional Conference on Knowledge Discovery in Data Mining, Processing. IEEE. 401–410. [Nadeem et al. 2016] Nadeem, M.; Marshall, O.; Singh, S.; [Gao, Cheng, and Tan 2006] Gao, J.; Cheng, H.; and Tan, P.- Fang, X.; and Yuan, X. 2016. Semi-supervised deep neural N. 2006. A novel framework for incorporating labeled ex- network for network intrusion detection. In KSU Conference amples into anomaly detection. In Proceedings of the 2006 on Cybersecurity Education, Research and Practice. SIAM International Conference on Data Mining, 594–598. [Oord, Kalchbrenner, and Kavukcuoglu 2016] Oord, A. SIAM. v. d.; Kalchbrenner, N.; and Kavukcuoglu, K. 2016. Pixel re- [Germain et al. 2015] Germain, M.; Gregor, K.; Murray, I.; current neural networks. arXiv preprint arXiv:1601.06759. and Larochelle, H. 2015. MADE: Masked autoencoder for [Parra, Deco, and Miesbach 1996] Parra, L.; Deco, G.; and distribution estimation. In International Conference on Ma- Miesbach, S. 1996. Statistical independence and novelty chine Learning, 881–889. detection with information preserving nonlinear maps. Neu- [Gornitz¨ et al. 2013] Gornitz,¨ N.; Kloft, M.; Rieck, K.; and ral Computation 8(2):260–269. Brefeld, U. 2013. Toward supervised anomaly detection. [Patcha and Park 2007] Patcha, A., and Park, J.-M. 2007. An Journal of Artificial Intelligence Research 46:235–262. overview of anomaly detection techniques: Existing solu- [Hido et al. 2011] Hido, S.; Tsuboi, Y.; Kashima, H.; tions and latest technological trends. Computer Networks Sugiyama, M.; and Kanamori, T. 2011. Statistical outlier 51(12):3448–3470. Knowledge detection using direct density ratio estimation. [Pedregosa et al. 2011] Pedregosa, F.; Varoquaux, G.; Gram- and Information Systems 26(2):309–336. fort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; [Hodge and Austin 2004] Hodge, V., and Austin, J. 2004. A Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikit- survey of outlier detection methodologies. Artificial Ntelli- learn: Machine learning in python. Journal of Machine gence Review 22(2):85–126. Learning Research 12:2825–2830. [Ide´ and Kashima 2004] Ide,´ T., and Kashima, H. 2004. [Pimentel et al. 2018] Pimentel, T.; Monteiro, M.; Viana, J.; Eigenspace-based anomaly detection in computer systems. Veloso, A.; and Ziviani, N. 2018. A generalized active learn- In International Conference on Knowledge Discovery and ing approach for unsupervised anomaly detection. arXiv Data Mining, 440–449. preprint arXiv:1805.09411. [Kingma and Ba 2014] Kingma, D. P., and Ba, J. 2014. [Raiko et al. 2014] Raiko, T.; Li, Y.; Cho, K.; and Bengio, Y. ADAM: A method for stochastic optimization. arXiv 2014. Iterative neural autoregressive distribution estimator preprint arXiv:1412.6980. NADE-k. In Advances in Neural Information Processing [Kingma and Dhariwal 2018] Kingma, D. P., and Dhariwal, Systems, 325–333. P. 2018. Glow: Generative flow with invertible 1x1 convo- [Rapaka, Novokhodko, and Wunsch 2003] Rapaka, A.; No- lutions. arXiv preprint arXiv:1807.03039. vokhodko, A.; and Wunsch, D. 2003. Intrusion detection [Kingma and Welling 2013] Kingma, D. P., and Welling, M. using radial basis function network on sequences of system 2013. Auto-encoding variational Bayes. arXiv preprint calls. In International Joint Conference on Neural Networks, arXiv:1312.6114. volume 3, 1820–1825. [Scholkopf¨ et al. 2001] Scholkopf,¨ B.; Platt, J. C.; Shawe- Taylor, J.; Smola, A. J.; and Williamson, R. C. 2001. Esti- mating the support of a high-dimensional distribution. Neu- ral Computation 13(7):1443–1471. [Scholkopf,¨ Smola, and others 2002] Scholkopf,¨ B.; Smola, A. J.; et al. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press. [Shewhart 1931] Shewhart, W. A. 1931. Economic control of quality of manufactured product. ASQ Quality Press. [Singh and Silakari 2009] Singh, S., and Silakari, S. 2009. An ensemble approach for of cyber attack dataset. arXiv preprint arXiv:0912.1014. [Suh et al. 2016] Suh, S.; Chae, D. H.; Kang, H.-G.; and Choi, S. 2016. Echo-state conditional variational autoen- coder for anomaly detection. In International Joint Confer- ence on Neural Networks, 1015–1022. [Uria et al. 2016] Uria, B.; Cotˆ e,´ M.-A.; Gregor, K.; Murray, I.; and Larochelle, H. 2016. Neural autoregressive distri- bution estimation. Journal of Machine Learning Research 17(1):7184–7220. [Uria, Murray, and Larochelle 2013] Uria, B.; Murray, I.; and Larochelle, H. 2013. RNADE: The real-valued neu- ral autoregressive density-estimator. In Advances in Neural Information Processing Systems, 2175–2183. [Van Den Oord et al. 2016] Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbren- ner, N.; Senior, A. W.; and Kavukcuoglu, K. 2016. WaveNet: A generative model for raw audio. In SSW, 125. [Wong et al. 2003] Wong, W.-K.; Moore, A. W.; Cooper, G. F.; and Wagner, M. M. 2003. anomaly pattern detection for disease outbreaks. In International Conference on Machine Learning, 808–815. [Xu et al. 2018] Xu, H.; Chen, W.; Zhao, N.; Li, Z.; Bu, J.; Li, Z.; Liu, Y.; Zhao, Y.; Pei, D.; Feng, Y.; et al. 2018. Un- supervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In World Wide Web Conference, 187–196. [Yamanaka et al. 2019] Yamanaka, Y.; Iwata, T.; Takahashi, H.; Yamada, M.; and Kanai, S. 2019. Autoencoding binary classifiers for supervised anomaly detection. arXiv preprint arXiv:1903.10709. [Yamanishi et al. 2004] Yamanishi, K.; Takeuchi, J.-I.; Williams, G.; and Milne, P. 2004. On-line unsupervised outlier detection using finite mixtures with discount- ing learning algorithms. Data Mining and Knowledge Discovery 8(3):275–300. [Yan et al. 2003] Yan, L.; Dodier, R. H.; Mozer, M.; and Wolniewicz, R. H. 2003. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statis- tic. In International Conference on Machine Learning, 848– 855. [Yeung and Ding 2003] Yeung, D.-Y., and Ding, Y. 2003. Host-based intrusion detection using dynamic and static be- havioral models. Pattern recognition 36(1):229–243.