Arxiv:1904.06034V1 [Stat.ML] 12 Apr 2019 Sity Exactly for a Test Instance
Total Page:16
File Type:pdf, Size:1020Kb
Supervised Anomaly Detection based on Deep Autoregressive Density Estimators Tomoharu Iwata Yuki Yamanaka NTT Communication Science Laboratories NTT Secure Platform Laboratories Abstract autoencoders (VAE) (Kingma and Welling 2013), flow- based generative models (Dinh, Krueger, and Bengio 2014; We propose a supervised anomaly detection method based Dinh, Sohl-Dickstein, and Bengio 2016; Kingma and Dhari- on neural density estimators, where the negative log likeli- wal 2018), and autoregressive models (Uria, Murray, and hood is used for the anomaly score. Density estimators have been widely used for unsupervised anomaly detection. By Larochelle 2013; Raiko et al. 2014; Germain et al. 2015; the recent advance of deep learning, the density estimation Uria et al. 2016). The VAE has been used for anomaly de- performance has been greatly improved. However, the neural tection (An and Cho 2015; Suh et al. 2016; Xu et al. 2018). density estimators cannot exploit anomaly label information, In some situations, the label information, which indicates which would be valuable for improving the anomaly detec- whether each instance is anomalous or normal, is avail- tion performance. The proposed method effectively utilizes able (Gornitz¨ et al. 2013). The label information is valuable the anomaly label information by training the neural density for improving the anomaly detection performance. How- estimator so that the likelihood of normal instances is max- ever, the existing neural network based density estimation imized and the likelihood of anomalous instances is lower methods cannot exploit the label information. To use the than that of the normal instances. We employ an autoregres- sive model for the neural density estimator, which enables anomaly label information, supervised classifiers, such as us to calculate the likelihood exactly. With the experiments nearest neighbor methods (Singh and Silakari 2009), sup- using 16 datasets, we demonstrate that the proposed method port vector machines (Mukkamala, Sung, and Ribeiro 2005), improves the anomaly detection performance with a few la- and feed-forward neural networks (Rapaka, Novokhodko, beled anomalous instances, and achieves better performance and Wunsch 2003), have been used. However, these stan- than existing unsupervised and supervised anomaly detection dard supervised classifiers do not perform well when labeled methods. anomalous instances are very few, which is often the case, since anomalous instances rarely occur by definition. 1 Introduction In this paper, we propose a neural network density esti- mator based anomaly detection method that can exploit the Anomaly detection is an important task in artificial intel- label information. The proposed method performs well even ligence, which is a task to find anomalous instances in a when only a few labeled anomalous instances are given since dataset. The anomaly detection has been used in a wide va- it is based on a density estimator, which works without any riety of applications (Chandola, Banerjee, and Kumar 2009; labeled anomalous instances. We employ the negative log Patcha and Park 2007; Hodge and Austin 2004), such as probability of an instance as its anomaly score. For the den- network intrusion detection for cyber-security (Dokas et al. sity function to calculate the probability, we use neural au- 2002; Yamanishi et al. 2004), fraud detection for credit toregressive models (Uria et al. 2016; Germain et al. 2015). cards (Aleskerov, Freisleben, and Rao 1997), defect detec- The autoregressive models can compute the probability den- tion of industrial machines (Fujimaki, Yairi, and Machida arXiv:1904.06034v1 [stat.ML] 12 Apr 2019 sity exactly for a test instance. On the other hand, the VAE 2005; Ide´ and Kashima 2004) and disease outbreak detec- computes the lower bound of the probability density ap- tion (Wong et al. 2003). proximately. Moreover, the autoregressive models have been Anomalies, which are also called outliers, are instances achieved the high density estimation performance compared that rarely occur. Therefore, it is natural to consider that with other neural density estimators, such as VAE and flow- instances at a low probability density region are anoma- based generative models (Dinh, Sohl-Dickstein, and Bengio lous, and many density estimation based anomaly detec- 2016). tion methods have been proposed (Barnett and Lewis 1974; The density function is trained so that the probability den- Parra, Deco, and Miesbach 1996; Yeung and Ding 2003). sity of normal instances becomes high, which is the same By the recent advances of deep learning, the density es- with the standard maximum likelihood estimation. In addi- timation performance has been greatly improved by neu- tion, we would like to make the density function to satisfy ral network based density estimators, such as variational that the probability density of anomalous instances is lower than that of normal instances. To achieve this, we introduce a regularization term, which is calculated by using the log Density model likelihood ratio between normal and anomalous instances. For the density function p(xjθ), we use the deep masked au- Since our objective function is differentiable, the density toencoder density estimator (MADE) (Germain et al. 2015), function can be estimated efficiently by using stochastic which is a neural autoregressive model. The probability dis- gradient-based optimization methods. tribution can always be decomposed into the product of the Figure 1 illustrates anomaly scores with an unsupervised nested conditional distributions using the probability prod- density estimation based anomaly detection method (a), uct rule as follows, with a supervised binary classifier based anomaly detection D Y method (b), and with the proposed method (c). The unsu- p(xjθ) = p(x jx ; θ); (2) pervised method considers only normal instances, and the d <d anomaly score is low where normal instances are located. d=1 Since it cannot exploit the information on anomalous in- where x<d = [x1; ··· ; xd−1] is the attribute vector before stances, the anomaly score cannot be increased even where the dth attribute. anomalous instances are located closely. With this exam- We model the conditional distribution with the following ple, it succeeds to detect test anomalous instances at the Gaussian mixture, far left and far right, but fails to detect the test anomalous K X instance at the center, where normal instances are closely p(xdjx<d; θ) = πdk(x<d; θ) located. The supervised method considers both normal and k=1 anomalous instances, where a decision boundary is placed 2 between the normal and anomalous instances. It can detect × N xdjµdk(x<d; θ); σdk(x<d; θ) ; the test anomalous instances at the center since an observed anomalous instance exists in the same region. However, it (3) cannot detect the test anomalous instances at both ends since where K is the number of mixture components, N (·|µ, σ2) they are at the normal instance side of the decision bound- is the Gaussian distribution with mean µ and variance σ2, ary. With the proposed method, the anomaly score is high 2 and πdk(x<d; θ), µdk(x<d; θ), σdk(x<d; θ) are the neural at the region where normal instances are not located as well networks that define the mixture weight, mean and vari- as the region where anomalous instances are located. There- ance of the kth mixture component for the dth attribute, fore, it can detect all of the test anomalous instances in this PK respectively, πdk(x<d; θ) ≥ 0, k=1 πdk(x<d; θ) = 1, example. 2 σdk(x<d; θ) > 0. The remainder of the paper is organized as follows. In When the feature vector is a binary, we use the following Section 2, we define our task, and propose our method for Bernoulli distribution, supervised anomaly detection based on the neural autore- xd 1−xd gressive estimators. In Section 3, we briefly review related p(xdjx<d; θ) = φd(x<d; θ) (1 − φd(x<d; θ)) ; (4) work. In Section 4, we demonstrate the effectiveness of the where φd(x<d; θ) is the neural network that outputs the proposed method using various datasets. Finally, we present probability of xd being one. Similarly, Poisson and Gamma concluding remarks and a discussion of future work in Sec- distributions with parameters modeled by neural networks tion 5. can be used in the cases of non-negative integers and non- negative continuous values, respectively. 2 Proposed method With the deep MADE, the conditional densities of differ- ent attributes are defined by deep autoencoders with masks Task so that the conditional density function for the dth attribute N xd depends only on the attributes before d, x<d, but does Suppose that we have a dataset X = f(xn; yn)gn=1, where not depend on the other attributes, x≥d = [xd; ··· ; xD]. The xn = (xn1; ··· ; xnD) is the D-dimensional attribute vector of the nth instance, and y is its anomaly label, i.e. y = 1 if MADE is more efficient than other autoregressive models. n n Note that in our framework we can use other density es- it is anomalous and yn = 0 if it is not anomalous, or normal. Our task is to estimate the anomaly score of unseen instances timators, such as VAE and flow-based generative models, as x∗, where the anomaly score of anomalous instances is high, well as autoencoders, where the reconstruction error is used and that of normal instances is low. for the anomaly score. Objective function Anomaly score Let D = f1; ··· ;Ng be a set of indexes of all the given in- The anomalous instances rarely occur, and the normal in- stances, A = fn 2 Djyn = 1g be a set of indexes of anoma- stances frequently occur. Then, the proposed method uses lous instances, and A¯ = fn 2 Djyn = 0g be a set of indexes the following negative log probability as the anomaly score of normal instances. The anomaly score of anomalous in- of instance x, stances should be higher than those of normal instances as follows, anomaly-score(x) = − log p(xjθ); (1) 0 − log p(xnjθ) > − log p(xn0 jθ); for n 2 A, n 2 A¯: where θ is parameters of the density function.