Dependency-Based Anomaly Detection: Framework, Methods and Benchmark
Total Page:16
File Type:pdf, Size:1020Kb
Dependency-based Anomaly Detection: Framework, Methods and Benchmark Sha Lu : [email protected] Lin Liu : [email protected] Jiuyong Li : [email protected] Thuc Duy Le : [email protected] Jixue Liu : [email protected] University of South Australia , Adelaide, SA 5095, Australia Editor: TBD Abstract Anomaly detection is an important research problem because anomalies often contain criti- cal insights for understanding the unusual behavior in data. One type of anomaly detection approach is dependency-based, which identifies anomalies by examining the violations of the normal dependency among variables. These methods can discover subtle and meaningful anomalies with better interpretation. Existing dependency-based methods adopt different implementations and show different strengths and weaknesses. However, the theoretical fundamentals and the general process behind them have not been well studied. This paper proposes a general framework, DepAD, to provide a unified process for dependency-based anomaly detection. DepAD decomposes unsupervised anomaly detection tasks into feature selection and prediction problems. Utilizing off-the-shelf techniques, the DepAD frame- work can have various instantiations to suit different application domains. Comprehensive experiments have been conducted over one hundred instantiated DepAD methods with 32 real-world datasets to evaluate the performance of representative techniques in DepAD. To show the effectiveness of DepAD, we compare two DepAD methods with nine state-of-the- art anomaly detection methods, and the results show that DepAD methods outperform comparison methods in most cases. Through the DepAD framework, this paper gives guidance and inspiration for future research of dependency-based anomaly detection and provides a benchmark for its evaluation. arXiv:2011.06716v1 [cs.LG] 13 Nov 2020 Keywords: Anomaly Detection, Dependency-based Anomaly Detection, Causal Feature Selection, Bayesian Networks, Markov Blanket 1. Introduction Anomalies are patterns in data that do not conform to a well-defined notion of normal behavior (Chandola et al., 2009). They often contain insights about the unusual behaviors or abnormal characteristics of the data generation process, which may imply flaws or misuse of a system. For example, an anomaly in network traffic data may suggest a threat of cyber security (Buczak and Guven, 2015), and an unusual pattern in credit card transaction data may imply fraud (Kou et al., 2004). Anomaly detection has been intensively researched 1 and widely applied to various domains, such as cyber-intrusion detection, fraud detection, medical diagnosis and law enforcement (Aggarwal, 2016; Chandola et al., 2009). The mainstream anomaly detection methods are based on proximity, including distance- based methods (Knorr and Ng, 1997, 1998; Ramaswamy et al., 2000; Angiulli and Pizzuti, 2005) and density-based methods (Breunig et al., 2000; Tang et al., 2002; Zhang et al., 2009; Kriegel et al., 2009a; Yan et al., 2017). Proximity-based methods work under the assumption that normal objects are in a dense neighborhood, while anomalies stay far away from other objects or in a sparse neighborhood (Aggarwal, 2016; Chandola et al., 2009). Another line of research in anomaly detection is to exploit the dependency among vari- ables, assuming normal objects follow the dependency while anomalies do not. Dependency- based methods (Lu et al., 2020; Paulheim and Meusel, 2015; Noto et al., 2012; Babbar and Chawla, 2012; Huang et al., 2003) first discover variable dependency possessed by the ma- jority of objects, then the anomalousness of objects is evaluated through how well they follow the dependency. The objects that significantly deviate from normal dependency are reported as anomalies. This paper focuses on dependency-based methods. Dependency-based approach is fundamentally different from proximity-based approach because it considers the relationship among variables, while proximity-based approach relies on the relationship among objects. Exploiting variable dependency for anomaly detection gives rise to a few advantages, as illustrated with the following examples. Example 1 (Better extrapolation capability.) Figure 1 shows a dataset of 453 objects, each with two variables, human's age and weight. It is adapted from a real-world dataset, Arrhythmia, in the UCI data repository (Dua and Graff, 2017), by taking only the age and weight attributes of the information of 452 people in the dataset (shown as black dots in Figure 1) and adding an unusual object o with age 100 and weight 65 kg (shown by the red triangle on the right). In the figure, the blue curve is the regression line showing the relationship between age and weight. The shade around the blue line represents the 95% confidence interval. When the two types of anomaly detection methods are applied to this dataset, a proximity- based method will report o as an anomaly because it stays far away from other objects. In contrast, o will not be reported by a dependency-based method because although o stays far away from other objects, it follows the dependency relationship between the two variables, i.e., the blue curve. Then, one must wonder, should we consider o to be an anomaly or not? A common way to answer this question is to check against the purpose of the analysis. If the detection is to identify people with obesity, then o is not a true anomaly. In this case, the correct conclusion can only be drawn through checking the dependency between the two variables, weight and age. Example 2 (Ability to find intrinsic patterns and better interpretability.) The dependency deviation identified by a dependency-based method can reveal some intrinsic patterns that cannot be found by proximity-based methods, and these patterns can provide meaningful interpretations of the detected anomalies. In this example, we use the Zoo dataset from the UCI machine learning repository (Dua and Graff, 2017), which contains information about 101 animals belonging to 7 classes. For each animal, 16 variables are used to describe its features, such as if it has hair and if it produces milk. It is noted that 2 Figure 1: An example showing the proximity-based and dependency-based anomaly meth- ods produce opposite detection results the class labels are only used to evaluate anomaly detection results. For visualization, we use T-distributed Stochastic Neighbor Embedding (t-SNE) (Maaten and Hinton, 2008) to map the dataset (without class labels) into two-dimensional space. As shown in Figure 2, different classes are marked with different letters in different colors. Three proximity-based methods, wkNN (Angiulli and Pizzuti, 2005), LOF (Breunig et al., 2000) and FastABOD (Kriegel et al., 2008), and a dependency-based method, LoPAD (Lu et al., 2020), are applied to the dataset. The top 10 anomalies detected by each method are shaded with gray circles, and the numbers attached to the circles are their ranks (the smaller the number, the higher the anomalousness). The names and ranks of the anomalous animals are shown in the grey box in each sub-figure. Overall, the four methods yield very different results. LOF mainly detects anomalies at the edge of the dense cluster, i.e., the mammal cluster. WkNN and FastABOD mostly identify anomalies in sparse areas, i.e., other clusters except for mammal. The anomalies detected by LoPAD are well-distributed in both dense and sparse areas. As to interpretability, LOF and wkNN only output anomaly scores with the detected anomalies, which does not help with explaining the reason for the detected anomalies. FastA- BOD and LoPAD provide additional explanations. FastABOD first exams the difference between a detected anomaly and its most similar object in the dataset, then reports top devi- ated variables and their deviations to explain the anomaly. An example given in the original paper of FastABOD explains a detected anomaly, scorpion, for which the most similar ani- mal of scorpion found by FastABOD is termite. Comparing scorpion to termite, FastABOD reports the reasons for scorpion being an anomaly as: 1) scorpion has eight instead of six legs; 2) it is venomous; 3) it has a tail. In contrast, the explanation by LoPAD is based on the deviation from normal depen- dency. Scorpion is reported as an anomaly by LoPAD because it significantly deviates from 3 (a) LOF (b) wkNN (c) FastABOD (d) LoPAD Figure 2: Top-10 anomalies detected by LOF, wkNN, FastABOD and LoPAD on the Zoo dataset the two dependencies: 1) if an animal has a tail, it is likely to have a backbone, while a scorpion has a tail but has no backbone. 2) if an animal does not produce milk, it likely lays eggs, but a scorpion neither produces milk nor lays eggs. Comparing the two explanations, one can see that LoPAD provides more reasonable and meaningful interpretations. The two examples have shown that dependency-based methods can detect meaning- ful anomalies that proximity-based methods fail to uncover with a better explanation for detected anomalies to help decision-making. However, dependency-based approach has not received enough attention in the anomaly detection community. Through literature review, we only find a very small number of methods that fully exploit dependency to detect anomalies (Lu et al., 2020; Paulheim and Meusel, 2015; Noto et al., 2012; Babbar and Chawla, 2012). Existing dependency-based methods adopt different implementations and show different strengths and weaknesses, but the fundamental ideas and the general process behind these methods have not been well studied. There is a need to explore further in this direction to take advantage of the possible dependency among variables for anomaly detection. In this paper, we propose a Dependency-based Anomaly Detection framework (DepAD) to provide a unified process of dependency-based anomaly detection. The goal of DepAD is twofold: 1) as a general framework, to guide the development and evaluation of new dependency-based methods; 2) as a reference model or abstraction of existing dependency- based methods, to help the understanding and communication about these methods.