GAN Ensemble for Anomaly Detection
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) GAN Ensemble for Anomaly Detection Xu Han*, Xiaohui Chen*, Li-Ping Liu Department of Computer Science, Tufts University Xu.Han, Xiaohui.Chen, Liping.Liu @tufts.edu f g Abstract 2017), f-AnoGAN (Schlegl et al. 2019), EGBAD (Zenati et al. 2018), GANomaly (Akcay, Atapour-Abarghouei, and When formulated as an unsupervised learning prob- Breckon 2018), and Skip-GANomaly (Akcay, Atapour- lem, anomaly detection often requires a model to learn Abarghouei, and Breckon 2019). All these models use an the distribution of normal data. Previous works mod- ify Generative Adversarial Networks (GANs) by using encoder-decoder as the generator since the generator of a encoder-decoders as generators and then apply them to vanilla GAN cannot check a sample. A synthetic sample anomaly detection tasks. Related studies also indicate from the decoder is either a reconstruction of a real sam- that GAN ensembles are often more stable than single ple or a new sample. The discriminator needs to distinguish GANs in image generation tasks. In this work, we pro- synthetic samples and normal samples, so the discrimina- pose to construct GAN ensembles for anomaly detec- tor acquires the ability of differentiate anomalies from nor- tion. In this new method, a group of generators inter- mal data. For the discriminator, the training is the same as act with a group of discriminators, so every generator binary classification but avoids manual labeling of training gets feedback from every discriminator, and vice versa. data (Liu and Fern 2012). The anomaly score is usually com- Compared to a single GAN, an ensemble of GANs puted by checking the reconstruction of a sample and the in- can better model the distribution of normal data and thus better detect anomalies. We also make a theoretical ternal representation of a sample in the discriminator. These analysis of GANs and GAN ensembles in the context of models perform well in a series of detection tasks. anomaly detection. The empirical study constructs en- There still lacks a thorough understanding of the role of sembles based on four different types of detecting mod- adversarial training in anomaly detection. Theoretical analy- els, and the results show that the ensemble outperforms sis (Goodfellow et al. 2014; Arora et al. 2017) indicates that the single model for all four model types. the discriminator should do no better than random guess- ing at the end of an ideal training procedure. However, the Introduction internal representation of a discriminator is very effective in differentiating normal samples and anomalies in practice. Anomaly detection is an important problem in machine Then there is a gap between theory and practice: how does learning and has a wide range of applications such as fraud the discriminator characterize normal samples’ distribution? detection (Abdallah, Maarof, and Zainal 2016; Kou et al. Training a GAN is challenging because it needs to op- 2004), intrusion detection (Garcia-Teodoro et al. 2009), and timize multiple deep neural networks in a min-max prob- event detection (Atefeh and Khreich 2015). In most anomaly lem. The optimization often has stability issues. If neural detection problems, only samples of normal data are given, networks are not carefully regularized, there are also prob- and the task is to detect anomalies that deviates from normal lems of mode collapse. Recent works show multiple genera- data. A large class of anomaly detection algorithms (Hodge tors or/and discriminators help to overcome those problems. and Austin 2004; Gupta et al. 2013; Chalapathy and Chawla Several studies (Durugkar, Gemp, and Mahadevan 2016; 2019) directly or indirectly model the data distribution and Neyshabur, Bhojanapalli, and Chakrabarti 2017) use multi- then report samples atypical in the distribution as anomalies. ple discriminators to provide stable gradients to the genera- Generative Adversarial Networks (GAN) (Goodfellow tor, making the training process more smooth. Multiple gen- et al. 2014), which are flexible models for learning data erators, which essentially defines a distribution mixture, can distributions, provide a new set of tools for anomaly capture multiple data modes (Hoang et al. 2018). Arora et al. detection. A GAN consists of a generator network and (2017) analyzes the equilibrium of GAN models and show a discriminator network, which learn a data distribution that a mixture of both generators and discriminators guaran- through a two-player game. Several studies apply GANs tees an approximate equilibrium in a GAN’s min-max game. to anomaly detection, such as AnoGAN (Schlegl et al. These ensemble methods have improved the performance in *The first two authors contribute equally. various generation tasks. Copyright © 2021, Association for the Advancement of Artificial In this work, we propose to use GAN ensembles for Intelligence (www.aaai.org). All rights reserved. anomaly detection. A GAN ensemble consists multiple 4090 encoder-decoders and discriminators, which are randomly WGAN has the form of u = D(x; γ). In a EGBAD model, paired and trained via adversarial training. In this proce- which is based on a BiGAN, the discriminator takes both a dure, an encoder-decoder gets feedback from multiple dis- sample and its encoding as the input, so it has the from of criminators while a discriminator sees “training samples” u = D((x;Ge(z)); γ). from multiple generators. The anomaly score is the aver- Since a model consists of an encoder-decoder and a dis- age of anomaly scores computed from all encoder-decoder- criminator, the training often considers losses inherited from discriminator pairs. This ensemble method works for most both models. The adversarial loss is from GAN training. existing GAN-based anomaly detection models. The losses are defined as follows when the discriminator We further analyze GAN models in the context of works in three GAN types (vanilla GAN, WGAN, and Bi- anomaly detection. The theoretical analysis refines the func- GAN). tion form of the optimal discriminator in a WGAN (Ar- jovsky, Chintala, and Bottou 2017). The analysis further ex- La-g(x) = log D(x) + log 1 D(Gd(Ge(x))) (2) − plains why the discriminator helps to identify anomalies and La-wg(x) = D(x) D Gd(z~) (3) how GAN ensemble improves the performance. − L - (x) = log D(x;G (x)) + log 1 D(G (z~); z~) In the empirical study, we test the proposed ensemble a bg e − d method on both synthetic and real datasets. The results in- (4) dicate that ensembles significantly improve the performance In L (x), the sample from the generator is the recon- over single detection models. The empirical analysis of vec- a-g struction of the original sample. The WGAN objective will tor representations verifies our theoretical analysis. be used by f-AnoGAN, which does not train the genera- tor Ge( ; φ) in the objective, so La-wg(x) has no relation Background with φ.· An encoding z~ in (3) or (4) is sampled from a prior The anomaly detection problem is formally defined as fol- distribution p(z), e.g. multivariate Gaussian distribution, so lows. Let X be a training set of N normal data samples, L - (x) and L - (x) are considered as stochastic func- Rd a wg a bg X = xn : n = 1;:::;N from some unknown tions of x. f 2 g Rd distribution . There is also a test sample x0 , which The second type of loss is the reconstruction loss, which may or mayD not be from the distribution . The2 problem of D is inherited from encoder-decoders. The loss is the differ- anomaly detection is to train a model from X such that the ence between a reconstruction and the original sample. The model can classify x0 as normal if x0 is from the distribu- difference is measured by `-norm with either ` = 1 or ` = 2. tion or abnormal if x0 is from a different distribution. The D ` model often computes an anomaly score y0 R for x0 and L (x) = x G G (x) (5) 2 r k − d e k` decide the label of x0 by thresholding y0. Anomaly detection essentially depends on the character- Previous research also indicates that the hidden vector ization of the distribution of normal samples. Adversarial h of a sample in the last hidden layer of the discriminator D( ; γ) is often useful to differentiate normal and abnormal training, which is designed to learn data distributions, suits · the task well. Models based on adversarial training usually samples. Denote h = fD(x; γ) as the hidden vector of x in have an encoder-decoder as the generator and a classifier as D(x; γ), then the discriminative loss based on h is the discriminator. Previous models such as f-AnoGAN, EG- L (x) = f (x) f G (G (x)) ` (6) BAD, GANomaly, and Skip-GANomaly all share this archi- d k D − D d e k` tecture. It is important to have an encoder-decoder as the The GANomaly model also considers the difference be- generator because a detection model often needs the low- tween the encoding vectors of a normal sample x and its dimensional encoding of a new sample. Below is a review reconstruction x~. Particularly, it uses a separate encoder of these models. G ( ; φ~) to encode the recovery x~. Then the encoding loss Rd e The generator consists of an encoder Ge( ; φ): is · d0 · ! R , which is parameterized by φ, and a decoder Gd( ; ): 0 · ~ ` Rd Rd, which is parameterized by . The encoder maps Le(x) = Ge(x; φ) Ge Gd(Ge(x; φ)); φ ` (7) ! k − k a sample x to an encoding vector z while the decoder com- ~ putes a reconstruction x~ of the sample from z. Equation (7) explicitly indicates that the parameters φ and φ of the two encoders are different.