Supervised Anomaly Detection based on Deep Autoregressive Density Estimators
Tomoharu Iwata Yuki Yamanaka NTT Communication Science Laboratories NTT Secure Platform Laboratories
Abstract autoencoders (VAE) (Kingma and Welling 2013), flow- based generative models (Dinh, Krueger, and Bengio 2014; We propose a supervised anomaly detection method based Dinh, Sohl-Dickstein, and Bengio 2016; Kingma and Dhari- on neural density estimators, where the negative log likeli- wal 2018), and autoregressive models (Uria, Murray, and hood is used for the anomaly score. Density estimators have been widely used for unsupervised anomaly detection. By Larochelle 2013; Raiko et al. 2014; Germain et al. 2015; the recent advance of deep learning, the density estimation Uria et al. 2016). The VAE has been used for anomaly de- performance has been greatly improved. However, the neural tection (An and Cho 2015; Suh et al. 2016; Xu et al. 2018). density estimators cannot exploit anomaly label information, In some situations, the label information, which indicates which would be valuable for improving the anomaly detec- whether each instance is anomalous or normal, is avail- tion performance. The proposed method effectively utilizes able (Gornitz¨ et al. 2013). The label information is valuable the anomaly label information by training the neural density for improving the anomaly detection performance. How- estimator so that the likelihood of normal instances is max- ever, the existing neural network based density estimation imized and the likelihood of anomalous instances is lower methods cannot exploit the label information. To use the than that of the normal instances. We employ an autoregres- sive model for the neural density estimator, which enables anomaly label information, supervised classifiers, such as us to calculate the likelihood exactly. With the experiments nearest neighbor methods (Singh and Silakari 2009), sup- using 16 datasets, we demonstrate that the proposed method port vector machines (Mukkamala, Sung, and Ribeiro 2005), improves the anomaly detection performance with a few la- and feed-forward neural networks (Rapaka, Novokhodko, beled anomalous instances, and achieves better performance and Wunsch 2003), have been used. However, these stan- than existing unsupervised and supervised anomaly detection dard supervised classifiers do not perform well when labeled methods. anomalous instances are very few, which is often the case, since anomalous instances rarely occur by definition. 1 Introduction In this paper, we propose a neural network density esti- mator based anomaly detection method that can exploit the Anomaly detection is an important task in artificial intel- label information. The proposed method performs well even ligence, which is a task to find anomalous instances in a when only a few labeled anomalous instances are given since dataset. The anomaly detection has been used in a wide va- it is based on a density estimator, which works without any riety of applications (Chandola, Banerjee, and Kumar 2009; labeled anomalous instances. We employ the negative log Patcha and Park 2007; Hodge and Austin 2004), such as probability of an instance as its anomaly score. For the den- network intrusion detection for cyber-security (Dokas et al. sity function to calculate the probability, we use neural au- 2002; Yamanishi et al. 2004), fraud detection for credit toregressive models (Uria et al. 2016; Germain et al. 2015). cards (Aleskerov, Freisleben, and Rao 1997), defect detec- The autoregressive models can compute the probability den- tion of industrial machines (Fujimaki, Yairi, and Machida arXiv:1904.06034v1 [stat.ML] 12 Apr 2019 sity exactly for a test instance. On the other hand, the VAE 2005; Ide´ and Kashima 2004) and disease outbreak detec- computes the lower bound of the probability density ap- tion (Wong et al. 2003). proximately. Moreover, the autoregressive models have been Anomalies, which are also called outliers, are instances achieved the high density estimation performance compared that rarely occur. Therefore, it is natural to consider that with other neural density estimators, such as VAE and flow- instances at a low probability density region are anoma- based generative models (Dinh, Sohl-Dickstein, and Bengio lous, and many density estimation based anomaly detec- 2016). tion methods have been proposed (Barnett and Lewis 1974; The density function is trained so that the probability den- Parra, Deco, and Miesbach 1996; Yeung and Ding 2003). sity of normal instances becomes high, which is the same By the recent advances of deep learning, the density es- with the standard maximum likelihood estimation. In addi- timation performance has been greatly improved by neu- tion, we would like to make the density function to satisfy ral network based density estimators, such as variational that the probability density of anomalous instances is lower than that of normal instances. To achieve this, we introduce a regularization term, which is calculated by using the log Density model likelihood ratio between normal and anomalous instances. For the density function p(x|θ), we use the deep masked au- Since our objective function is differentiable, the density toencoder density estimator (MADE) (Germain et al. 2015), function can be estimated efficiently by using stochastic which is a neural autoregressive model. The probability dis- gradient-based optimization methods. tribution can always be decomposed into the product of the Figure 1 illustrates anomaly scores with an unsupervised nested conditional distributions using the probability prod- density estimation based anomaly detection method (a), uct rule as follows, with a supervised binary classifier based anomaly detection D Y method (b), and with the proposed method (c). The unsu- p(x|θ) = p(x |x ; θ), (2) pervised method considers only normal instances, and the d