Dependency-based : Framework, Methods and Benchmark

Sha Lu : [email protected] Lin Liu : [email protected] Jiuyong Li : [email protected] Thuc Duy Le : [email protected] Jixue Liu : [email protected] University of South Australia , Adelaide, SA 5095, Australia

Editor: TBD

Abstract

Anomaly detection is an important research problem because anomalies often contain criti- cal insights for understanding the unusual behavior in data. One type of anomaly detection approach is dependency-based, which identifies anomalies by examining the violations of the normal dependency among variables. These methods can discover subtle and meaningful anomalies with better interpretation. Existing dependency-based methods adopt different implementations and show different strengths and weaknesses. However, the theoretical fundamentals and the general process behind them have not been well studied. This paper proposes a general framework, DepAD, to provide a unified process for dependency-based anomaly detection. DepAD decomposes unsupervised anomaly detection tasks into feature selection and prediction problems. Utilizing off-the-shelf techniques, the DepAD frame- work can have various instantiations to suit different application domains. Comprehensive experiments have been conducted over one hundred instantiated DepAD methods with 32 real-world datasets to evaluate the performance of representative techniques in DepAD. To show the effectiveness of DepAD, we compare two DepAD methods with nine state-of-the- art anomaly detection methods, and the results show that DepAD methods outperform comparison methods in most cases. Through the DepAD framework, this paper gives guidance and inspiration for future research of dependency-based anomaly detection and provides a benchmark for its evaluation.

arXiv:2011.06716v1 [cs.LG] 13 Nov 2020 Keywords: Anomaly Detection, Dependency-based Anomaly Detection, Causal Feature Selection, Bayesian Networks, Markov Blanket

1. Introduction

Anomalies are patterns in data that do not conform to a well-defined notion of normal behavior (Chandola et al., 2009). They often contain insights about the unusual behaviors or abnormal characteristics of the data generation process, which may imply flaws or misuse of a system. For example, an anomaly in network traffic data may suggest a threat of cyber security (Buczak and Guven, 2015), and an unusual pattern in credit card transaction data may imply fraud (Kou et al., 2004). Anomaly detection has been intensively researched

1 and widely applied to various domains, such as cyber-intrusion detection, fraud detection, medical diagnosis and law enforcement (Aggarwal, 2016; Chandola et al., 2009). The mainstream anomaly detection methods are based on proximity, including distance- based methods (Knorr and Ng, 1997, 1998; Ramaswamy et al., 2000; Angiulli and Pizzuti, 2005) and density-based methods (Breunig et al., 2000; Tang et al., 2002; Zhang et al., 2009; Kriegel et al., 2009a; Yan et al., 2017). Proximity-based methods work under the assumption that normal objects are in a dense neighborhood, while anomalies stay far away from other objects or in a sparse neighborhood (Aggarwal, 2016; Chandola et al., 2009). Another line of research in anomaly detection is to exploit the dependency among vari- ables, assuming normal objects follow the dependency while anomalies do not. Dependency- based methods (Lu et al., 2020; Paulheim and Meusel, 2015; Noto et al., 2012; Babbar and Chawla, 2012; Huang et al., 2003) first discover variable dependency possessed by the ma- jority of objects, then the anomalousness of objects is evaluated through how well they follow the dependency. The objects that significantly deviate from normal dependency are reported as anomalies. This paper focuses on dependency-based methods. Dependency-based approach is fundamentally different from proximity-based approach because it considers the relationship among variables, while proximity-based approach relies on the relationship among objects. Exploiting variable dependency for anomaly detection gives rise to a few advantages, as illustrated with the following examples.

Example 1 (Better extrapolation capability.) Figure 1 shows a dataset of 453 objects, each with two variables, human’s age and weight. It is adapted from a real-world dataset, Arrhythmia, in the UCI data repository (Dua and Graff, 2017), by taking only the age and weight attributes of the information of 452 people in the dataset (shown as black dots in Figure 1) and adding an unusual object o with age 100 and weight 65 kg (shown by the red triangle on the right). In the figure, the blue curve is the regression line showing the relationship between age and weight. The shade around the blue line represents the 95% confidence interval. When the two types of anomaly detection methods are applied to this dataset, a proximity- based method will report o as an anomaly because it stays far away from other objects. In contrast, o will not be reported by a dependency-based method because although o stays far away from other objects, it follows the dependency relationship between the two variables, i.e., the blue curve. Then, one must wonder, should we consider o to be an anomaly or not? A common way to answer this question is to check against the purpose of the analysis. If the detection is to identify people with obesity, then o is not a true anomaly. In this case, the correct conclusion can only be drawn through checking the dependency between the two variables, weight and age.

Example 2 (Ability to find intrinsic patterns and better interpretability.) The dependency deviation identified by a dependency-based method can reveal some intrinsic patterns that cannot be found by proximity-based methods, and these patterns can provide meaningful interpretations of the detected anomalies. In this example, we use the Zoo dataset from the UCI repository (Dua and Graff, 2017), which contains information about 101 animals belonging to 7 classes. For each animal, 16 variables are used to describe its features, such as if it has hair and if it produces milk. It is noted that

2 Figure 1: An example showing the proximity-based and dependency-based anomaly meth- ods produce opposite detection results the class labels are only used to evaluate anomaly detection results. For visualization, we use T-distributed Stochastic Neighbor Embedding (t-SNE) (Maaten and Hinton, 2008) to map the dataset (without class labels) into two-dimensional space. As shown in Figure 2, different classes are marked with different letters in different colors. Three proximity-based methods, wkNN (Angiulli and Pizzuti, 2005), LOF (Breunig et al., 2000) and FastABOD (Kriegel et al., 2008), and a dependency-based method, LoPAD (Lu et al., 2020), are applied to the dataset. The top 10 anomalies detected by each method are shaded with gray circles, and the numbers attached to the circles are their ranks (the smaller the number, the higher the anomalousness). The names and ranks of the anomalous animals are shown in the grey box in each sub-figure. Overall, the four methods yield very different results. LOF mainly detects anomalies at the edge of the dense cluster, i.e., the mammal cluster. WkNN and FastABOD mostly identify anomalies in sparse areas, i.e., other clusters except for mammal. The anomalies detected by LoPAD are well-distributed in both dense and sparse areas. As to interpretability, LOF and wkNN only output anomaly scores with the detected anomalies, which does not help with explaining the reason for the detected anomalies. FastA- BOD and LoPAD provide additional explanations. FastABOD first exams the difference between a detected anomaly and its most similar object in the dataset, then reports top devi- ated variables and their deviations to explain the anomaly. An example given in the original paper of FastABOD explains a detected anomaly, scorpion, for which the most similar ani- mal of scorpion found by FastABOD is termite. Comparing scorpion to termite, FastABOD reports the reasons for scorpion being an anomaly as: 1) scorpion has eight instead of six legs; 2) it is venomous; 3) it has a tail. In contrast, the explanation by LoPAD is based on the deviation from normal depen- dency. Scorpion is reported as an anomaly by LoPAD because it significantly deviates from

3 (a) LOF (b) wkNN (c) FastABOD (d) LoPAD

Figure 2: Top-10 anomalies detected by LOF, wkNN, FastABOD and LoPAD on the Zoo dataset the two dependencies: 1) if an animal has a tail, it is likely to have a backbone, while a scorpion has a tail but has no backbone. 2) if an animal does not produce milk, it likely lays eggs, but a scorpion neither produces milk nor lays eggs. Comparing the two explanations, one can see that LoPAD provides more reasonable and meaningful interpretations.

The two examples have shown that dependency-based methods can detect meaning- ful anomalies that proximity-based methods fail to uncover with a better explanation for detected anomalies to help decision-making. However, dependency-based approach has not received enough attention in the anomaly detection community. Through literature review, we only find a very small number of methods that fully exploit dependency to detect anomalies (Lu et al., 2020; Paulheim and Meusel, 2015; Noto et al., 2012; Babbar and Chawla, 2012). Existing dependency-based methods adopt different implementations and show different strengths and weaknesses, but the fundamental ideas and the general process behind these methods have not been well studied. There is a need to explore further in this direction to take advantage of the possible dependency among variables for anomaly detection. In this paper, we propose a Dependency-based Anomaly Detection framework (DepAD) to provide a unified process of dependency-based anomaly detection. The goal of DepAD is twofold: 1) as a general framework, to guide the development and evaluation of new dependency-based methods; 2) as a reference model or abstraction of existing dependency- based methods, to help the understanding and communication about these methods. The design of DepAD aims to solve two issues in existing dependency-based methods: (1) high time complexity; (2) low adaptability to various applications with different types of data and requirements. Specifically, the DepAD framework contains two phases: (1) dependency model construction; (2) anomaly score generation. In Phase (1), firstly, for each variable in a given dataset, a set of other variables that are strongly related to the variable is selected. Then for each variable, a dependency model containing the dependency relationships between the variable and its relevant variables is constructed. In Phase (2),

4 anomaly scores are generated based on dependency deviations. For an object, the expected value of each variable is estimated using the dependency model built in Phase (1) for the variable, then the dependency deviation is computed as the difference between the expected value and observed value of the variable. Finally, the dependency deviations of all variables are normalized and combined to produce the anomaly score of the object. Through this process, the DepAD framework decomposes the unsupervised anomaly detection problem into local feature selection and prediction problems to facilitate efficient dependency-based anomaly detection on high dimensional data. Moreover, DepAD enables the use of off-the-shelf feature selection, prediction and anomaly score ensemble techniques to assemble hundreds of anomaly detection methods for various applications. In this paper, we analyze what and how to utilize the off-the-shelf techniques in the context of anomaly detection. We also propose techniques specialized for the DepAD framework. To empirically study the impact of different techniques, we first instantiate dependency-based methods with different candidates techniques, then conduct experiments to compare their performance. To show the effectiveness of the DepAD framework, we compare DepAD methods with state-of- the-art anomaly detection methods. The results show that the DepAD methods outperform the comparison methods in most cases. Lastly, we use a case study to demonstrate the interpretability of the DepAD framework. In summary, the main contributions of this work are as follows:

• We propose a dependency-based anomaly detection framework, DepAD, to provide a unified dependency-based anomaly detection process. DepAD is effective, recon- figurable with off-the-shelf techniques for different applications, scalable to high- dimensional data, and able to produce interpretable results.

• We investigate and analyze the use of off-the-shelf techniques for the DepAD frame- work to enable the configuration (instantiation) of hundreds of dependency-based anomaly detection methods from the DepAD framework.

• We comprehensively and empirically study the performance of representative off-the- shelf techniques for the DepAD framework and compare DepAD methods with state- of-the-art anomaly detection methods. The results have shown the effectiveness and efficiency of the DepAD framework.

• To manifest the interpretability of the DepAD framework, we illustrate how to explain identified anomalies through a case study.

The rest of the paper is organized as follows. In Section 2, we introduce the background of anomaly detection and survey existing anomaly detection methods. In Section 3, we introduce the DepAD framework. For each step of DepAD, we identify and analyze the suitable off-the-shelf techniques and propose new specialized techniques. Section 4 summa- rizes representative techniques for each step of DepAD as an algorithm. In Section 5, we empirically study the performance of instantiations of DepAD and present the comparison of the proposed methods with state-of-the-art anomaly detection methods. Section 6 ex- plains how to interpret anomalies identified by DepAD methods and illustrate this using a case study. Section 7 concludes the paper.

5 2. Related Work In this section, we first introduce the background of anomaly detection, including how the two terms, outlier and anomaly, are used in literature, and the general process of anomaly detection. Then we discuss existing dependency-based methods, followed by a brief review of subspace approach, focusing on the similarities to and differences from DepAD.

2.1 Background Although the two terms, anomaly and outlier, have been used interchangeably in literature, sometimes they may mean different things due to, likely, historical reasons and perspectives in different research communities. The study of outliers started more than a century ago in the statistics community, where the term outlier has been used to refer to noise, extreme values, or contaminants (Hawkins, 1980; Barnett and Lewis, 1994; Rousseeuw and Leroy, 2005). The first type of outliers (noises) are considered to be caused by instrumental or human errors, so the related studies aim to discover and then remove the noise to have a better-fitted model of data (Rousseeuw and Leroy, 2005). The two classic outlier detection textbooks in statistics (Hawkins, 1980; Barnett and Lewis, 1994) consider outliers are extreme values or contaminants. Contam- inants are objects generated by a contaminated distribution, which is different from the distribution generating the normal objects, while extreme values are generated from the same distribution as the normal objects, but they stay at the tail of the distribution. Au- thors of the books argued that only contaminants should be discarded, whereas extreme values should be kept in data modeling. In 1997, Knorr and Ng introduced the notion of outliers based on the proximity of objects (Knorr and Ng, 1997) and later proposed a distance-based method (Knorr and Ng, 1998). Since then, outlier detection has attracted much attention from the community. Compared with the research in the statistics community, the research in the data mining area considers that outliers are generated from an unknown but meaningful mechanism so that they could provide critical insights into the data. While some research in the data mining area, e.g., (Knorr and Ng, 1997, 1998; Ramaswamy et al., 2000; Angiulli and Pizzuti, 2005; Breunig et al., 2000), still uses the term outlier considering the objects generated by a different mechanism but not necessarily being outlying, some research, e.g., (Tan et al., 2018; Chandola et al., 2009; Emmott et al., 2015; Lu et al., 2020; Noto et al., 2012) has adopted the term anomaly to avoid confusion. In this paper, we focus on the detection of anomalies as defined above and the data mining methods for anomaly detection. A variety of anomaly detection methods have been proposed in the data mining area, and all try to utilize the characteristics of anomalies that are in some way inconsistent with other objects. The general process of anomaly detection starts with an assumption of the aspect in which anomalies are abnormal, then evaluates the anomalousness of objects in this aspect. The mainstream anomaly detection approach is proximity-based (Knorr and Ng, 1997, 1998; Ramaswamy et al., 2000; Angiulli and Pizzuti, 2005; Breunig et al., 2000), which assumes that anomalies are objects staying far away from other objects or in a sparse neighborhood. The anomalousness of objects is virtually evaluated through distance among objects. Proximity-based methods can roughly be grouped into kNN distance-based meth-

6 ods, relative density-based methods and clustering-based methods. The kNN distance-based methods utilize the distances of the k-nearest neighbors of an object to evaluate its anoma- lousness, such as the distance to the kth nearest neighbor (Ramaswamy et al., 2000) and the average distance of k nearest neighbors (Angiulli and Pizzuti, 2005). The relative density-based methods utilize the density of the neighborhood of an object to evaluate its anomalousness. LOF (Breunig et al., 2000) is the first method proposed in this category and is followed by some variants using different ways of determining neighbors (Tang et al., 2002), or different ways of measuring local density (Zhang et al., 2009; Kriegel et al., 2009a) or improving efficiency (Yan et al., 2017). Clustering-based methods (Tan et al., 2018; Chandola et al., 2009) assume that normal objects are similar and can form clusters, and consider anomalies are objects in either of the three cases: (1) not in any clusters; (2) in small clusters; (3) staying at the edge of clusters. Another approach for anomaly detection, which is also the focus of this paper, is dependency-based (Lu et al., 2020; Paulheim and Meusel, 2015; Noto et al., 2012; Bab- bar and Chawla, 2012). These methods assume that anomalies are objects that do not follow the normal dependency among variables, and the anomalousness of objects is eval- uated through the deviation from the normal dependency. This is fundamentally different from proximity-based approach since the focus is dependency among variables instead of the relationship among objects.

2.2 Dependency-based Approach Compared to proximity-based methods, not many dependency-based methods are currently available. In this section, we review the existing dependency-based methods, with reference to the steps of the DepAD framework. CFA (Huang et al., 2003) uses the correlation among variables to detect anomalies in mobile ad hoc networking. For each variable, it uses all other variables as predictors to estimate the variable’s expected value skipping the relevant variable selection step in the DepAD framework. When constructing dependency models, it first discretizes continu- ous values. Then classification models (C4.5, RIPPER and naive Bayes) are trained to predict expected values. Two combination functions, average match count and average probability, are proposed for anomaly score generation, and average probability shows bet- ter performance than average match count in their experiments. As CFA can only handle discrete data, its performance with continuous data may be affected by the discretization process. Moreover, using all other variables as relevant variables makes it unsuitable for high-dimensional data. FraC (Noto et al., 2010, 2012) is proposed to deal with continuous data and improve detection effectiveness. FraC also uses all other variables as the predictors of a variable. It adopts an ensemble regression model with three base predictors, RBF kernel SVMs, linear kernel SVMs and regression trees. When generating anomaly scores, it first converts the dependency deviations of variables, i.e., the difference between expected values and observed values, into suprisals, i.e., the log-loss, then computes anomaly scores as the summation of surpisals over all variables. The major problem of FraC is the low efficiency because it adopts ensemble models and uses all other variables as relevant variables, both of which have high computational cost.

7 Another method, ALSO (Paulheim and Meusel, 2015), also uses all other variables as predictors and regression models in estimating expected values, but compared with FraC, ALSO employs a single regression model and a different combination process. ALSO adopts linear, Isotonic and M5’ regression models in experiments and M5’ performs better. ALSO uses the summation of the weighted loss, i.e., squared dependency deviations weighted by the strength of the regression model, over all variables as anomaly scores. A common issue with ALSO, FraC and CFA is that they use all other variables as relevant variables, which causes two problems, high computational complexity and low accuracy in high dimensional data. In contrast, two other methods, COMBN (Babbar and Chawla, 2012) and LoPAD (Lu et al., 2020), have been proposed, which use a subset of variables as relevant variables. COMBN (Babbar and Chawla, 2012) uses a Bayesian Network (BN) to represent the dependency among variables. In the structure of a BN (represented by a directed acyclic graph), nodes represent variables, and arcs represent the dependency among nodes. If there is a arc from variable X1 to variable X2, X1 is known as a parent of X2, and X2 is a child of X1. For a node X, COMBN uses the set of all its parent nodes, denoted as Pa(X), as relevant variables and assumes each variable follows a Gaussian distribution and the mean of the Gaussian distribution, i.e., the expected value of the variable, has a linear relationship with its parents’ values. The summation of dependency deviations, i.e., the difference of the expected values and observed values, over all variables are computed as anomaly scores. COMBN only uses the parents of a variable as its relevant variables, so if an anomaly occurs to the non-parents of X, the dependency relationship between X and its parents Pa(X) cannot capture the anomaly. Moreover, COMBN needs a whole BN, which is often not available in practice. Learning a BN from data is time-consuming and even impossible for high-dimensional data (Chickering et al., 1994). LoPAD (Lu et al., 2020) utilizes the local structure of a BN, Markov Blanket (MB), to deal with the problems COMBN has encountered. For a variable X in a BN, its MB, denoted as MB(X), contains all parents, children and spouses of X (the other parents of X’s children). Given MB(X), X is conditionally independent of all other variables in the BN. In LoPAD, MB(X) is selected as X’s relevant variables, which ensures that the complete dependency around X is captured and hence a more accurate estimation of the expected values of X can be achieved. Since LoPAD only needs to learn the MB for each variable, instead of a whole BN, it is able to cope with high-dimensional data. LoPAD builds regression trees with bagging (Breiman, 1996) to predict the expected values of variables and pruned summation (Aggarwal and Sathe, 2015) to sum up significant dependency deviations of an object over all variables as anomaly score. Experiments have shown great performance improvement over other dependency-based methods and state-of-the-art anomaly detection methods compared.

2.3 Connection of DepAD with Subspace Anomaly Detection To tackle the problem of anomaly detection in high-dimensional data, subspace anomaly detection methods (Zimek et al., 2012; Yu and Chen, 2018; Kriegel et al., 2009b, 2012; Lazarevic and Kumar, 2005) have been proposed to detect anomalies based on the proximity in subsets of variables, i.e., subspaces. When determining subspaces, many methods (Yu and

8 Chen, 2018; Kriegel et al., 2009b, 2012) utilize the correlation of variables, and some (Liu et al.; Lazarevic and Kumar, 2005) randomly select variables to form subspaces. When evaluating anomalousness in each subspace, they usually utilize exiting proximity-based algorithms, such as LOF (Yu and Chen, 2018; Lazarevic and Kumar, 2005) and kNN (Kriegel et al., 2012, 2009b). Comparing the subspace approach with DepAD, although they both work with respect to subspaces, subspace methods are proximity-based, while DepAD is dependency-based.

3. The DepAD Framework

In this section, we present the DepAD framework. For each step of DepAD, we explain the goal and key considerations and analyze which off-the-shelf techniques can be utilized in the step and how. We also propose techniques that are specialized for the DepAD framework. The notation used in the rest of the paper is shown in Table 1.

Table 1: Notation

Notation Description X a variable x a value of variable X X a set of variables, i.e., X = {X1,X2, ··· ,Xm} D a data matrix of n objects and m variables x i∗ the i-th row vector (data point or object) of D x ∗j the j -th column vector i.e., all values of variable Xj in D xij the j -th element in x i∗ xˆi∗ the predicted value of x i∗ xˆij the predicted value of xij R(Xj) the set of relevant variables of Xj r(xij) the value of R(Xj) in the i-th object in D

3.1 Overview of the DepAD Framework

In dependency-based anomaly detection, an object’s anomalousness is evaluated based on its deviation from the underlying dependency relationships between variables. A general way to measure dependency deviation is to examine the difference between the observed value of the object and its expected value estimated based on variable dependency. A straightforward method to examine the difference is to utilize the difference with respect to each variable and aggregate the variable level differences to obtain the object level difference. The above idea leads us to use a supervised approach to deal with the problem of detecting the anomalies due to dependency derivation using existing predictive models. That is, for each variable X, we build a prediction model with X as the target and the other variables related to X as the predictors. Then given an object, we can use the model to predict or estimate the expected value of X based on the observed values of the predictor variables.

9 Figure 3: The DepAD framework

Based on the above discussion, we propose the DepAD framework, as shown in Figure 3. DepAD contains two phases: (1) dependency model construction; (2) anomaly score generation. Phase 1 contains two steps: (a) relevant variable selection; (b) dependency model training. In step (a), for each variable X, we find the set of variables that are strongly related to X, called relevant variables of X in the paper and denoted as R(X). Then, in step (b), a dependency model G is constructed, which contains m prediction models corresponding to each variable. For each prediction model g with respect to variable X, g is learned using R(X) as predictors to predict the values of X. In Phase 2, given an object x i∗, for a variable Xj (j ∈ {1, . . . , m}), its expected values on Xj, i.e.,x ˆij, is estimated using the learned prediction model gj in Phase 1. The dependency deviations, δij, is computed as the difference between the observed values, xij, and its expected values,x ˆij, then δij is normalized so that the same deviation on different variables represents a similar degree of anomalousness. This process is applied to each variable to obtain the deviation vector for x i∗, i.e., δi∗. Lastly, δi∗ is input into a combination function c to generate the anomaly score of x i∗, si. The top scored objects or objects whose scores are higher than a user-defined threshold are reported as anomalies. In the following, we explain the details of the two phases.

3.2 Phase 1: Dependency Model Construction 3.2.1 Relevant Variable Selection The goal of the relevant variable selection step is to identify a set of predictor variables for each variable in a given dataset. Given a variable X, relevant variable selection can be considered as a supervised feature selection for building a prediction model for X. This step brings three advantages to DepAD. First, relevant variable selection can exclude redundant

10 and irrelevant variables from the prediction model and possibly reduce overfitting risk to improve prediction accuracy. Secondly, it accelerates prediction models’ learning speed and increases model scalability for dealing with high dimensional data. Finally, and impor- tantly, relevant variables enable easy and meaningful interpretation of detected anomalies, especially in high-dimensional settings. Given a dataset D with a set of variables X, for X ∈ X, we aim to find a set of other variables R(X) ⊆ X\{X}, so that the estimated values of X based on R(X) can best reflect the normal dependency between X and all other variables in X. In other words, we would like R(X) to completely capture the dependency of X with all other variables in D. We can see that the goal of relevant variable selection is the same as supervised feature selection (Yu et al., 2018; Li et al., 2017). Therefore, off-the-shelf feature selection methods can be used for this step. When choosing a feature selection method, considerations need to be given to 1) the prediction model used in the dependency model training step; 2) the interpretability of features selected; 3) the efficiency of the feature selection method. Some prediction models, e.g., Lasso regression, implicitly conduct feature selection. When instantiating an algorithm from the DepAD framework, a decision needs to be made on whether or not to skip the relevant variable selection step as it may leave the prediction model too few features to choose from, leading to low prediction accuracy (Li et al., 2017). Regarding point 2) above, a relatively small-sized, non-redundant and strongly related relevant variable set is more helpful for interpreting detected anomalies. Last, different feature selection methods have different computational cost, and some are not suitable for high dimensional data. So, computational efficiency needs to be considered, especially in time-critical scenarios. Feature selection methods fall into three categories, wrapper, filter and embedded meth- ods (Li et al., 2017). Wrapper methods use a prediction model to evaluate the quality of different subsets of features. With embedded methods, the feature selection process is an integrated part of the prediction model learning algorithm. Filter feature selection is predic- tion model agnostic, and it utilizes the characteristics of data to evaluate feature relevance. A detailed review of feature selection methods can be found at (Li et al., 2017). Theoret- ically, methods from the three categories can all be utilized for relevant variable selection for DepAD. However, filter feature selection is preferred due to its high efficiency and in- dependence of prediction models. Being model-agnostic brings high flexibility to DepAD, especially with the availability of a large number of off-the-shelf feature selection methods. Among existing filter feature selection methods, causal feature selection methods (Yu et al., 2020, 2018; Guyon et al., 2007) are good choices because they find the causal factors of a target and thus have better interpretability. These methods select the parents and children (PC) or Markov Blanket (MB) of a target variable in a Bayesian network (BN) as predictors for the target. MB is an optimal choice for relevant variables because, for a variable X, given its MB, denoted as MB(X), X is conditionally independent of all other variables in the BN (Pearl, 2000). This means MB(X) contains the complete dependency information of X. Moreover, according to the studies in (Guyon et al., 2007; Aliferis et al., 2010), MB is theoretically the optimal feature subset for a prediction task, which means using MB(X) as relevant variables can give us a more accurate estimation of the expected values of X. However, finding MB can have high computational cost, especially when the size of the MB of a variable is big.

11 To tackle this problem, PC can be used. A study (Yu et al., 2020) has shown that selecting PCs in prediction tasks produces similar performance with MBs in practice, but much faster than finding MBs. Parent variables as features can produce more stable predictions when training and testing data distributions are different, but for i.i.d. data, parent variables may not bring benefit for prediction accuracy (Yu et al., 2020). There have been many efficient methods for learning the PC or MB of a variable from data, without learning a complete BN (Aliferis et al., 2003; Pe˜naet al., 2005; Margaritis and Thrun, 2000; Borboudakis and Tsamardinos, 2019; Yaramakala and Margaritis, 2005; Yu et al., 2020). However, due to the uncertainty in causal structure learning from data (Li et al., 2020; Spirtes, 2010), it is still an open question to identify parents from the learned PC or MB, or learn a unique BN structure from data, not mentioning that learning a complete BN structure is often intractable for high-dimensional data (Chickering et al., 1994). A special case of relevant variable selection is to use all other variables as the relevant variables for a variable, as done by the existing methods (Huang et al., 2003; Noto et al., 2010, 2012; Paulheim and Meusel, 2015). However, this may decrease the effectiveness of anomaly detection and the efficiency of building prediction models and make it difficult to interpret detected anomalies. So when instantiating a dependency-based method from the DepAD framework, we do not recommend using all other variables as relevant variables.

3.2.2 Dependency Model Training

In this step, a set of prediction models are trained, one for each variable (as the target) using its relevant variables as predictors. Specifically, the aim of the step is to train a set of prediction models G = {g1, ··· , gm} using the input dataset D, where gj : R(Xj)− > Xj, and thus for an object x, the expected value of variable Xj in this object isx ˆj = g(r(Xj)). Off-the-shelf prediction models can be used for this step. In this paper, we focus on regression models. If a regression model well represents the relationship between a target variable and its predictors, it would produce small residual errors for objects that follow the normal dependency in data and large residual errors for objects that deviate from it. When choosing a regression model for DepAD, for a variable X and its relevant variables R(X), the regression model is expected to represent the dependency between X and R(X) well so that it can yield small dependency deviations, i.e., residual errors, for normal objects and large deviations for anomalies. However, when a training set contains anomalies, the estimations of normal objects would be further from their observed values, while for anomalies, their estimations would be closer to their observed values. Thus, the contrast of normal and abnormal objects’ dependency deviations could decrease, making anomalies hard to be detected. Therefore, when choosing a regression model for DepAD, we need to consider: 1) how close of the estimations of normal objects to their observed values to reduce false positives; 2) how to avoid the impact of anomalies in training data when training the dependency model to increase true positives. To achieve more accurate estimations of normal objects, regression tree is a better choice for DepAD in general cases as regression tree can well represent both linear and non-linear distributions. If we have prior knowledge that the dependency is linear, linear regression

12 models also can work well. We recommend to use bagging (Breiman, 1996) with regression trees to improve prediction accuracy. Bagging can also be used to mitigate the impact of anomalies. With bagging, several re- gression trees are trained with a re-sampled set, and then the predicted values are computed as the mean of estimations from all the trees. Bagging can reduce the impact of anomalies because: 1) re-sampling possibly reduces the number of anomalies in the re-sampled sets because anomalies are assumed to be rare in the original training set; 2) a regression tree is prone to overfitting clustered anomalies, while re-sampling may separate clustered anoma- lies; 3) averaging the predictions from multiple trees could reduce the impact of overfitting from individual trees. In addition to the basic regression tree, some robust versions of tree-based regression models can also be utilized for DepAD. In order to reduce the impact of anomalies, these variants of regression tree have been mainly improved in the two aspects: (1) use more robust loss functions to construct trees (Li and Martin, 2017; Galimberti et al., 2007; Brence and Brown, 2006; Roy and Larocque, 2012); (2) use more robust aggregation functions to calculate the final prediction from the predictions of multiple trees (Meinshausen, 2006; Roy and Larocque, 2012). For example, in (Roy and Larocque, 2012), a robust version of random forest is proposed to use the median instead of the mean to aggregate the predictions of trees. This idea can be utilized to use the median to aggregate the predictions of bagging CART trees, which is named as mCART in the paper. With linear regression models, objects with significant impacts on determining the coef- ficients are named leverage points (Hastie et al., 2009). Regularization is used to reduce the impact of leverage points by shrinking the size of coefficients. If the dependency is linear, anomalies are likely leverage points, which means that reducing the impact of leverage points has the same effect as reducing the impact of anomalies. If relevant variable selection step is implemented for an instantiation of DepAD, we do not recommend regularization with L1 norm, i.e., Lasso regression, because it tends to reduce some coefficients to zero, which may affect the accuracy of prediction due to too few predictors with the implementation of the relevant variable selection step.

3.3 Phase 2: Anomaly Score Generation As shown in Figure 3, in Phase 2, anomaly score is generated for an object based on its dependency deviations. Specifically, for an object x i∗, firstly expected value xˆi∗ is estimated by the dependency model obtained in Phase 1. For a variable Xj, the dependency deviation 0 is computed as δij = |xij −xˆij|. Then, δij is normalized, denoted as δij. Finally, the anomaly score of x i∗, si, is generated using a combination function c over normalized dependency 0 0 deviations on all variables, i.e., si = c(δi1, ··· , δim). Normalization is a critical step in DepAD, which aims to make normalized deviations with similar values on different variables represent similar degrees of anomalousness. It is important to understand that the degree of anomalousness expressed by a deviation is relative to the deviations of other objects, not the original values. This means even if the input data is in the same range and scale on all variables, the normalization procedure is still necessary because the normalization is based on the distribution of the deviations of objects rather than the distribution of the original values of objects.

13 For example, given two objects x 1∗ and x 2∗ and two variables X1 and X2 in the range of [0, 1], the deviation of x 1∗ on X1, δ11, is 0.5, and the deviation of x 2∗ on X2, δ22, is 0.9. Although δ22 is larger than δ11, we cannot conclude that x 2∗ on X2 is more anomalous than x 1∗ on X1. The deviations of 0.5 and 0.9 need to be compared with the deviations of other objects to determine their anomalous extent. If the deviations of other objects on X1 are between 0 and 0.1, and those on X2 are between 0.5 and 0.9, δ11 of 0.5 indicates a much higher degree of anomalousness than δ22 of 0.9 in this case. Therefore, the normalization is required to make the deviations on different variables on a comparable scale. When choosing normalization functions, we recommend using functions that are more robust to anomalies. One function proposed in (Tan et al., 2018) is preferred, which is based on traditional Z-score, replacing mean and standard deviation with median and average absolute deviation (AAD). We name this normalization function robust Z-score there forth. Specifically, given a deviation vector δ∗j, for a deviation δij ∈ δ∗j, its normalized value is δ −µ˜ 0 ij δ∗j computed as δij = , whereµ ˜δ∗j andσ ˜δ∗j are median and AAD of δ∗j respectively. σ˜δ∗j Pn 1 |δij −µ˜δ∗j | Here,σ ˜δ∗j is defined asσ ˜δ∗j = n . Existing normalization methods mainly come from ensemble anomaly detection, which aims to transform the outputs of multiple base detectors into a comparable scale. They mainly adopt two types of transformations: 1) using Z-score normalization (Aggarwal and Sathe, 2017; Nguyen et al., 2010); 2) converting to probabilities (Gao and Tan, 2006; Kriegel et al., 2011). When using the existing methods for DepAD, two problems may affect de- tection effectiveness. First, both transformations do not consider the impact of anomalies. Second, converting outputs into probabilities would significantly reduce the gap between normal and abnormal objects. In the following discussion, the word deviation refers to normalized deviation unless otherwise specified. When choosing a combination function, one thing to note is (what we call) the dilution effect. The dilution effect refers to the case that the scores of the anomalies with few high deviations may be lower than the scores of some normal objects with a large number of small deviations, making these anomalies undetectable. For some anomalies, their anomalousness may only occur to few variables. Although the deviations on these variables are still greater than those on other variables, the anomalousness may not be reflected in the anomaly scores because the larger deviations are ‘diluted’ by those small (normal) ones. The dilution effect is often more serious in high-dimensional data. Existing combination combination methods mainly come from subspace anomaly de- tection (Zimek et al., 2012) and ensemble anomaly detection (Aggarwal and Sathe, 2017). The commonly used methods include summation (Lazarevic and Kumar, 2005), maximum (Yu and Chen, 2018), pruned sum (PS) (Aggarwal and Sathe, 2015), AOM (Aggarwal and Sathe, 2015), weighted sum (Paulheim and Meusel, 2015; Nguyen et al., 2010). Details about these methods can be found at (Aggarwal and Sathe, 2015, 2017). For DepAD, PS is a good choice because it mitigates the dilution effect by only summing up deviations greater than a threshold. Specifically, given a threshold η, for an object x i∗, Pm 0 0 its anomaly score computed by PS is si = j Iijδij, where Iij is 0 when δij is less than 0 η, and Iij is 1 when δij is larger than η. By excluding some small deviations from the combination, PS emphasizes more on larger (possibly anomalous) deviations to improve detection effectiveness.

14 In summary, for anomaly score generation, we suggest using robust Z-score normalization along with PS combination (RZPS).

4. Instantiations and Alogorithms

In the previous section, we have discussed the important considerations when choosing off- the-shelf techniques for the phases or steps of DepAD and our proposed improvements to the existing techniques. In this section, we summarize the representative techniques that can be used for DepAD, which can result in hundreds of dependency-based anomaly detection methods instantiated from the DepAD framework. We will present the outline of these instantiations in Section 4.2 as Algorithm 1.

4.1 Summary of Techniques for Instantiating of the DepAD Framework

Table 2 shows a list of representative techniques for instantiating DepAD. In this paper, an instantiated method or algorithm is named by the acronyms of the specific techniques used for different steps/phases (shown in Table 2) separated by ‘-’.

4.2 Outline of DepAD Algorithms

All instantiated DepAD methods follow the same procedure, as shown in Algorithm 1. A DepAD algorithm takes as its input a dataset D and outputs the anomaly scores for the object in D. Then, top-scored objects or objects whose scores are larger than a user-defined threshold are reported as anomalies. Lines 1 to 4 are for the dependency model construction phase, in which for each variable Xj, its relevant variables R(Xj) are obtained at line 2, then a prediction model gj is trained at line 3 using Xj as the target variable and R(Xj) as predictors. Lines 5 to 13 comprise the second phase, anomaly score generation. The expected value of a variable with respect to an object is estimated at line 7, followed by the dependency deviation computation at line 8. After obtaining the deviations for all variables, the anomaly score of the object is computed at line 12. Lastly, the anomaly scores are output at line 14. The time complexity of a DepAD algorithm depends on the techniques selected for the two steps of Phase 1, relevant variable selection and dependency model training. Taking IAMB-CART-RZPS as an example, for a dataset with n objects and m variables, the time complexity of the MB discovering using IAMB for m variables is O(m2λ) (Aragam and Zhou, 2015), where λ is the average size of MBs. The complexity of building a CART tree is O(mλnlogn) (Breiman et al., 1984). Therefore, the overall complexity is O(m2λ) + O(mλnlogn).

5. Evaluation

In evaluating the DepAD framework, we aim to answer the following questions: (1) How the different techniques in individual steps affect the performance of DepAD methods? (2) How the different combinations of techniques in different steps affect the performance of DepAD methods? 3) Comparing with state-of-the-art anomaly detection methods, how is

15 Table 2: Recommended Techniques for the DepAD framework

Step Types of Techniques Techniques FBED(Borboudakis and Tsamardinos, 2019) Relevant HITON-MB (Aliferis et al., 2003) Variables Causal FS: MB IAMB (Aragam and Zhou, 2015) Selection Fast-IAMB (Yaramakala and Margaritis, 2005) GSMB (Margaritis and Thrun, 2000) HITON-PC (Aliferis et al., 2003) Causal FS: PC GET-PC (Pe˜naet al., 2005) Information-based MI (Qian and Shu, 2015) IECP (Arauzo-Azofra et al., 2008) Consistency-based IEC (Dash and Liu, 2003) Dependency-based DC (Dodge, 2008) Distance-based Relief (Kira and Rendell, 1992) Dependency mCART (using median as aggregation function) Model Tree-based regression * CART (Breiman et al., 1984) Training M5p (Wang and Witten, 1996)

Genearal Linear (Linear) (Seber, 2009) Ridge (Hoerl and Kennard, 1970) Linear regression Elastic Net Regression (Zou and Hastie, 2005) Lasso (Tibshirani, 1996) Spline regression MARS (Friedman, 1991) RZPS (robust Z-score with PS) PS (Aggarwal and Sathe, 2015) Anomaly HeDES (Nguyen et al., 2010) Score Maximum (Max) (Yu and Chen, 2018) Generation Summation (Sum) (Lazarevic and Kumar, 2005)

* Bagging (Breiman, 1996) is recommended to be used with Tree-based regression models.

16 Algorithm 1: The Algorithm of the DepAD Framework

Input:D : a dataset with n objects and a set of variables X = {X1, ··· ,Xm} Output:s : a vector of anomaly scores for the n objects in D /* PHASE 1: Dependency Model Construction */ 1 for each Xj ∈ X j ∈ {1, ··· , m} do 2 get relevant variable of Xj, R(Xj) 3 train prediction model gj : Xj = gj(R(Xj)) 4 end /* PHASE 2: Anomaly Score Generation */ 5 for each Xj ∈ X j ∈ {1, ··· , m} do 6 for each xi∗ ∈ D, i ∈ {1, ··· , n} do 7 predict expected value of xij, i.e.,x ˆij = gj(x iR(Xj )) 8 δij = |xij − xˆij| 9 end 10 end 11 for each δi∗ i ∈ {1, ··· , n} do 12 compute anomaly score of x i∗, si = c(δi∗) 13 end 14 output s = {s1, ··· , sn} the performance of the instantiated DepAD methods? 4) How is the efficiency of DepAD methods?

5.1 Datasets and Evaluation Metrics As shown in Table 3, we choose 32 real-world datasets that cover diverse domains, e.g., spam detection, molecular bioactivity detection, and image object recognition for the evalu- ation. The AID362, Backdoor, MNIST and caltech16 datasets are obtained from the Kaggle dataset repository (Kaggle). The Pima, WBC, Stamps, Ionosphere and Bank datasets are obtained from anomaly detection dataset repository (Campos et al., 2016). The others are retrieved from the UCI data repository (Dua and Graff, 2017). These datasets are often used in anomaly detection literature. We follow the common process to obtain ground truth labels. If a dataset only contains two classes, the majority class is chosen to be the normal class, and the minority class is the anomalous class. If a dataset contains multiple classes of unbalanced sizes, the normal class/classes may be chosen from one or multiple majority classes, and the anomaly class/classes may be selected from one or multiple minority classes to avoid too few objects in normal and/or anomalous classes. If a dataset contains multiple classes with similar sizes, we randomly choose two classes, one for the normal class and one for the anomalous class. It is noted that the experiments are conducted in an unsupervised setting, where labels are only used to evaluate the results. In Table 3, the #variable column refers to the number of the variables of a dataset, and the class labels column contains the labels of all the classes. The labels of normal classes are shown in the normal class column, and the total number of objects in these normal classes are presented in the #objects in normal class column. The

17 labels of anomalous classes are shown in the anomaly class column, and the total number of objects in these anomalous classes are presented in the #objects in anomaly class column. If the ratio of the number of objects in the anomaly class to the total number of objects in a dataset is greater than 1%, we randomly sample (without replacement) 1% of the total number of objects from the anomalous class as anomalies. The sampling is repeated 20 times, and an experiment is conducted with each of the sampled 20 datasets, then the average result is reported. If the ratio of anomalies is less than 1%, an experiment is conducted once using the dataset with objects in the normal and anomalous classes, which is the case for the wine, pageBlocks, Aid362 and Arrhythmia datasets. Categorical features are converted into numeric ones by 1-of-` encoding (Campos et al., 2016).

Table 3: Summary of the 32 Real-world Datasets Used in Experiments

normal #objects in anomaly #objects in name #variables class label class normal class class anomaly class Wilt 6 n,w n 4578 w 261 Pima 9 no,yes no 500 yes 268 WBC 10 no,yes no 444 yes 10 Stamps 10 no,yes no 309 yes 6 Glass 10 2,1,7,3,5,6 2,1,7,3,5 205 6 9 Gamma 11 h,g g 12332 h 6688 PageBlocks 11 1,2,5,4,3 1 4913 3 28 Wine 12 6,5,7,8,4,3,9 6,5,7,8,4 4873 3,9 25 HeartDisease 14 0,1,2,3,4 0,1,2,3 290 4 13 Leaf 16 1-36 1-35 330 36 10 Letter 17 A-Z A 789 Z 734 PenDigits 17 0-9 1-9 9849 0 1143 Waveform 22 2,1,0 2 1696 0 1657 Cardiotocography 23 1,2,3 1 1655 3 176 Parkinson 23 1,0 1 147 0 48 BreastCancer 31 B,M B 357 M 212 WBPC 31 N,R N 151 R 47 Ionosphere 33 no,yes no 225 yes 126 Biodegradation 42 NRB,RB NRB 699 RB 356 Bank 52 no,yes no 4000 yes 521 Spambase 58 0,1 0 2788 1 1813 Libras 91 1-15 2-15 336 1 24 Aid362 145 inactive,active inactive 4219 active 60 Backdoor 190 normal,backdoor normal 56000 backdoor 1746 CalTech16 257 1-101 1 798 53 31 Arrhythmia 275 1-16 1,2,10 339 14 4 Census 386 low,high low 44708 high 2683 Secom 591 -1,1 -1 1463 1 104 MNIST 785 0-9 7 1028 0 980 CalTech28 785 1-101 1 798 34 65 Fashion 785 0-9 1 1000 0 1000 Ads 1559 nonad,ad nonad 2820 ad 459 * When the number of classes of a dataset is small (less than 8), class labels are arranged in descending order of the size of classes.

In anomaly detection literature, ROC AUC (Liu and Ozsu,¨ 2009) and Average Precision (AP) (Campos et al., 2016) are the two commonly used evaluation metrics. ROC AUC refers to the area under a ROC curve. In anomaly detection, a ROC curve is obtained by plotting the true positive rate against the false positive rate in the top-l scored objects at all the possible choices for l, i.e., l ∈ {1, ··· , n}. Here, positive refers to anomalies and negative

18 refers to normal objects. The value of ROC AUC is between 0 and 1, where the value of 1 means that all anomalies have higher scores than any normal objects, and a random guess would lead to a value close to 0.5. ROC AUC is often used to evaluate the overall performance of an anomaly detection method. AP is defined as the average value of P @l, where P @l is the precision at the top-l scored objects. Specifically, given an anomaly score a n a vector s = {si, ··· , sn}, it can be divided into two subsets, i.e., s = s ∪ s , where s contains the scores of anomalies and sn includes the scores of normal objects. For any s ∈ s, its rank in the descending order of s is denoted as rank(s). Given a value l, P @l is a P |s∈s |rank(s)≤l| s∈sa P @(rank(s)) defined as |sa| , and AP is defined as |sa| . In the paper, we present the experimental results in both ROC AUC and AP.

5.2 Evaluation of Techniques in Individual Steps of DepAD In this section, we answer the first question, i.e., how the different techniques in individual steps affect the performance of DepAD methods.

5.2.1 Experiment Setting We select five representative techniques from different categories for each step, i.e., FBED, HITON-PC, MI, IEPC and DC for the relevant variable selection step of Phase 1; CART, mCART, Linear, Ridge, Lasso for the dependency model training step of Phase 1; RZPS, PS, Sum, Max, GS for Phase 2, anomaly score generation. It is noted that although we do not recommend Lasso when there is a relevant variable selection, it is still include in the evaluation to support our arguments. Every time when comparing the different techniques for a step, we fix the techniques used in the other two steps. This gives us, for the evaluation of the techniques in a step, 25 results since the other two steps each have 5 different techniques used in the experiments. For the relevant variable selection step, for FBED and HITON-PC, we use the imple- mentations from (Yu et al., 2020, 2018), and the test significance level is set to 0.01 for both of them. MI, IEPC and DC are implemented using R package FSinR (Arag´on-Roy´on et al., 2020). The slope thresholds in the three techniques for selecting features are all set to 0.8, as recommended by the package. For the dependency model training step, the regression algorithms, CART is imple- mented using R package ipred (Peters and Hothorn, 2019), and Linear, Ridge and Lasso regression algorithms are implemented using R package glmnet (Friedman et al., 2010). Bagging is used with mCART and CART, where the number of trees is set to 25, the min- imum number of a node to be split is 20, the minimum number of objects in a bucket is 7, and the complexity parameter is 0.003. For Lasso and Ridge, 10-cross validation is used to determine λ, the regularization parameter. For the anomaly score generation phase, the threshold used in RZPS and PS is set to 0.

5.2.2 Performances of Relevant Variable Selection Techniques The results of relevant variable selection techniques are shown in Figure 4. In the figure, the results of ROC AUCs and APs are shown in two sub-figures. When presenting the results of a technique, a violin plot and a dot plot are overlay display. For a violin plot, the

19 (a) ROC AUC (b) AP

Figure 4: Performance of relevant variable selection techniques with the p-values of the Wilcoxon rank-sum tests on the top of each pair of them. shape of the violin represents the kernel density estimated from the results of a technique. The red dot is the mean value, and the length of the red line on either side of the mean shows the standard deviation of the 25 results. For a dot plot, each gray dot corresponds to the result of a DepAD method using the technique. The Wilcoxon rank-sum tests are pairwisely applied to the techniques, with the alternative hypothesis that the technique on the left is better than the other one. The p-values are presented on the top of each pair of techniques. In the following discussion, if p-value is less than 0.05, we think the test result is significant. From Figure 4a, HITON-PC has the best average results, followed by FBED, IEPC, MI and DC. From the p-values shown on the figure, a technique is significantly better than the techniques on its right, except HITON-PC compared with FBED and MI compared with DC. DC shows a much larger variance than other techniques with a standard deviation of 0.044, while the standard deviations of other techniques range from 0.011 to 0.017. In terms of AP, as shown in Figure 4b, the five techniques show a similar relative performance as indicated by ROC AUC. Except for HITON-PC in comparison with FBED, each technique is significantly better than the techniques on its right. The variances of the results in terms of AP are generally larger than the variances in terms of ROC AUC. The standard deviation of the results of MI is 0.027, and the standard deviations of the results of the other four methods range from 0.032 to 0.037. Table 4 shows the average reduction rates, i.e., the ratio of the number of relevant variables selected to the total number of variables in a dataset, achieved by each of the five techniques. We found that different techniques produce quite different reduction results for the same dataset. For example, on the dataset Libras, the reduction rate of IEPC is 97.8%, while the reduction rates of other techniques are below 6.8%. On average, HITON-PC has the most reduction, to 15.4%, while IEPC has the least reduction to 59.5%. We suspect that the low reduction of IEPC is the reason for its unstable performance in terms of ROC AUC.

20 FBED, DC and MI achieve similar reduction rates of around 23%. FBED and HITON-PC generally show a similar trend, but HITON-PC always has a higher reduction than FBED because a PC is the subset of an MB.

Table 4: Average Redcution Rate of the Five FS Techniques on the 32 Datasets

Name #Variables FBED HITON-PC IEPC DC MI Wilt 6 46.7% 43.7% 66.7% 40% 46.7% Pima 9 30.3% 24.8% 76.7% 11.1% 54.8% WBC 10 28.2% 28.2% 63.8% 64% 62.8% Stamps 10 29.4% 25% 71% 71% 27% Glass 10 46.2% 31.4% 62% 17% 45.5% Gamma 11 52.3% 37.7% 81.8% 57.3% 39.3% PageBlocks 11 61.8% 36.4% 73.6% 10.9% 28.2% Wine 12 57.5% 38.3% 83.3% 9.2% 35.8% HeartDisease 14 16.1% 15.8% 35.2% 7.2% 66.1% Leaf 16 27.9% 17.8% 75.6% 75.6% 14.4% Letter 17 44.6% 22.6% 82.9% 10.3% 7.4% PenDigits 17 74.2% 30.4% 72.4% 7.9% 22.4% Waveform 22 22.7% 22.5% 90.9% 25.9% 64.7% Cardiotocography 23 33% 17.4% 52.6% 13.8% 35.1% Parkinson 23 16.5% 10.3% 87% 60.3% 12.4% BreastCancer 31 23.5% 11.9% 93.5% 50.4% 5.8% WBPC 31 16.5% 9.4% 93.5% 77.9% 5% Ionosphere 33 19.1% 9.5% 88.2% 19.5% 4.7% Biodegradation 42 11.1% 19.6% 16.6% 36% 19.7% Bank 52 17.9% 11.7% 27.1% 5.5% 31.6% Spambase 58 10.1% 7.2% 32.3% 4.7% 41.9% Libras 91 6.8% 4.4% 97.8% 1.1% 2.4% Aid362 145 17.2% 5.7% 49.3% 17.7% 0.8% Backdoor 190 6.1% 1.8% 63.8% 61.7% 58% CalTech16 257 4.8% 2% 21.5% 1.9% 0.4% Arrhythmia 275 3.2% 1.5% 42.9% 1.8% 1.1% Census 386 5.9% 2.7% 55.3% 5.8% 4.3% Secom 591 1.9% 0.7% 63.8% 4.6% 0.9% MNIST 785 1.9% 0.7% 23.4% 0.6% 0.1% calTech28 785 1.8% 0.6% 8% 0.2% 0.1% Fashion 785 2.8% 0.6% 48.5% 4.6% 0.3% Ads 1559 0.5% 0.3% 10.1% 0.3% 0.5%

In summary, among the five techniques, the two causal feature selection techniques, HITON-PC and FBED, show better performance than other techniques in both ROC AUC and AP.

21 (a) ROC AUC (b) AP

Figure 5: Performance of different dependency algorithms with the p-values of the Wilcoxon rank-sum tests on the top of each pair of them.

5.2.3 Performances of Different Dependency Models The results of different regression algorithms are shown in Figure 5. With respect to ROC AUC, as shown in Figure 5a, the two tree-based algorithms, CART and mCART, produce better results than all the linear regression algorithms. From the p-values on the figure, each regression algorithm is significantly better than algorithms on the right except Linear compared with Ridge and Lasso, respectively. It is noted that Ridge and Lasso show unstable results with the same standard deviation of 0.036, while the standard deviations of other regression algorithms range from 0.014 to 0.015. In terms of AP, as shown in Figure 5b, on average, CART achieves the best results, followed by mCART, Ridge, Linear and Lasso. The variance of the five algorithms is similar, ranging from 0.031 to 0.034. Except mCART compared with Ridge, Ridge compared with Linear and Linear compared with Lasso, a regression algorithm is significantly better than the algorithms on its right. In summary, the two tree-based models, CART and mCART, show better performance with respect to both ROC AUC and AP. On the contrary, Lasso yields the worst results in terms of the two metrics.

5.2.4 Performances of Anomaly Score Generation Techniques In terms of ROC AUC, the results of anomaly score generation techniques are shown in Figure 6a. On average, RZPS achieves the best performance, followed by PS, Sum, GS and Max. From the p-values on the figure, each technique is significantly better than the techniques on its right. The results of all the five techniques have similar standard deviations between 0.023 and 0.031. When we examine the violin plots of the five techniques, the shapes are more consistent than the shapes in the two steps of Phase 1, which may imply that the choices of the methods of relevant variable selection and dependency models contribute more to the final performance of a DepAD method than the choice of anomaly score generation

22 techniques. It is noted that all of the five techniques contain two particularly poor results. We discuss this in Section 5.3.

(a) ROC AUC (b) AP

Figure 6: Performance of anomaly score generation techniques with the p-values of the Wilcoxon rank-sum tests on the top of each pair of them.

With respect to AP, as shown in Figure 6b, PS and Sum achieve the best results, followed by RZPS, GS and Max. Each technique is significantly better than the techniques on its right except for PS compared with Sum. The results of Max are significantly worse than the results of other techniques. The standard deviations of the results of the five techniques are similar, around 0.02, except Max, which has a slightly lower standard deviation of 0.009. In summary, RZPS has the best performance in terms of ROC AUC, while PS and Sum produce the best results in terms of AP. GS and Max show the worst performance as indicated by both ROC AUC and AP.

5.3 Evaluation of Different Combinations of Techniques In this section, we investigate the results of the 125 DepAD methods to answer the second question, i.e., how the different combinations of techniques in different steps affect the performance of DepAD methods. In the analysis, we point out the combinations that could cause poor results and suggest the combinations of techniques that can achieve better performance. The results of the 125 DepAD methods are shown in Figure 7, which consists of two sub- figures, corresponding to ROC AUC and AP. In each sub-figure, the results are presented in different colors and shapes, where different colors correspond to different relevant variable selection techniques, and different shapes represent different dependency models. Lastly, the results are grouped according to the techniques used in the anomaly score generation phase. In terms of ROC AUC, as shown in Figure 7a, the methods with the two combinations, DC combined with Lasso and Ridge respectively, have the worst performance. Methods using MI as the relevant variable selection technique show inferior results in general no

23 (a) Results of 125 DepAD methods in terms of ROC AUC

(b) Results of 125 DepAD methods in terms of AP

Figure 7: Results (ROC AUC and AP) of 125 DepAD methods presented with the tech- niques used in each step/phase.

24 matter what techniques are used in the other two steps. When using IEPC to select relevant variables, combining it with Linear model yields much poor results than IEPC combined with other dependency models. Ten methods have achieved the best results of 0.85, which are mainly from the combinations of: (1) the relevant variable selection techniques of FBED, HITON-PC and IEPC; (2) the dependency models of CART and mCART; (3) the anomaly score generation techniques of RZPS, PS and Sum. The methods using any combination of these techniques show good performance in general. Furthermore, from the combinations of the techniques used by the ten methods, we have the following findings. First, when HITON-PC is the relevant variable select technique, the best results can only be achieved by combining it with RZPS, not PS and Sum. Second, when anomaly score generation techniques are PS or Sum, the best results can only be achieved through using CART as the dependency model. Lastly, if the mCART is used as the dependency model, the best results are only shown by combining it with RZPS. In summary, methods using the combinations of FBED, CART, RZPS, PS and Sum always achieve the best results in the experiments. Therefore, we recommend FBED-CART-RZPS, FBED-CART-PS and FBED-CART-Sum in terms of ROC AUC. With respect to AP, as shown in Figure 7b, the methods using Max yield the worst results. The results with MI as the relevant variable selection technique are generally inferior no matter what techniques are used in the other two steps. To achieve better performance, the techniques for the three steps/phase are suggested to choose from: (1) the relevant variables selection techniques of IEPC, FBED and HITON-PC; (2) the dependency models of CART and Ridge; (3) the anomaly score generation techniques of PS, Sum and RZPS. Among the combinations of these techniques, we have the following findings. First, the results with the combination of IEPC with Ridge have much lower performance than other the results with other combinations. Second, when Ridge is used as the dependency model, the best results can only be achieved by combining it with PS or Sum. Third, if HITON-PC is used to select relevant variables, the best results are only shown when combining it with Ridge. Lastly, when FBED is the relevant variable selection technique, the best results can only be achieved by combining it with PS or Sum. Therefore, with respect to AP, we recommend IEPC-CART-PS, FBED-CART-PS, FBED-CART-Sum and IEPC-CART-Sum. In summary, FBED-CART-PS and FBED-CART-Sum are good choices in terms of both ROC AUC and AP.

5.4 Comparison with Benchmark Methods

To show the effectiveness of the DepAD framework, we choose two DepAD methods, FBED- CART-PS and FBED-CART-Sum, to compare with nine state-of-the-art anomaly detection methods. As shown in Table 5, the nine methods consist of seven proximity-based methods and two dependency-based methods, and their parameter settings for the evaluation are present in the table. In the experiments, if a method is unable to produce a result within four hours, we stop the experiments, which are the following cases: 1) FastABOD and SOD on datasets Backdoor and Census; 2) ALSO on datasets Backdoor, CalTech16, Census, Secom, MNIST,

25 Table 5: Summary of Benchmark Methods and Parameter Settings

category benchmark methods parameter setting a proximity-based LOF (Breunig et al., 2000) the number of nearest neighbors is set to 10 proximity-based wkNN (Angiulli and Pizzuti, 2005) the number of nearest neighbors is set to 10 proximity-based FastABOD (Kriegel et al., 2008) the number of nearest neighbors is set to 10 proximity-based iForest (Liu et al.) the number of trees is set to 100 without sub-sampling proximity-based MBOM (Yu and Chen, 2018) the number of nearest neighbors is set to 10 proximity-based SOD (Kriegel et al., 2009b) the number of shared nearest neighbors is 10 (1) the kernel function is RBF; One-class SVM (OCSVM) proximity-based (2) the kernel coefficient setting γ is chosen (Sch¨olkopf et al., 2001) from {0.01, 0.03, 0.05, 0.07, 0.09} b dependency-based ALSO (Paulheim and Meusel, 2015) M5’ regression is used as the prediction model dependency-based COMBN (Babbar and Chawla, 2012) HITON-PC is used to learn a BN a We adopt the commonly used or recommended parameters used in the original papers. b As OCSVM is sensitive to the kernel coefficient setting γ, experiments are conducted with γ are set to 0.01, 0.03, 0.05, 0.07 and 0.09, then the best results are reported.

CalTech28, Fashion and Ads; 3) COMBN on datasets Backdoor, CalTech16, Arrhythmia, Census, Secom, MNIST, CalTech28, Fashion and Ads. For LOF, iForest, FastABOD, OCSVM and SOD, we use the implementations in the (Hahsler et al., 2019) R package, IsolationForest (Liu, 2009) R package, abodOutlier (Jimenez, 2015) R package, e1071 (Meyer et al., 2019) R package and HighDimOut (Fan, 2015) R package respectively. MBOM, ALSO and COMBN are implemented by ourselves based on the bnlearn (Scutari, 2010) R package. All the methods are implemented using R, and the experiments are conducted on a high-performance computer cluster with 32 CPU cores and 128G memory. The comparison results are shown in Figures 8 (ROC AUC) and 9 (AP), where each sub-figure corresponds to the comparison results between the two DepAD methods with one benchmark method. The results are shown in different shapes to indicate different DepAD methods. In each sub-figure, the X and Y coordinates of a result (a circle or a plus sign) are the ROC AUC or AP of the comparison method and a DepAD method on the same dataset, respectively. Therefore, in a sub-figure, the more results sitting above the diagonal line (to the top left half of the plane), the better the performance of the DepAD method is than the benchmark method. As shown in Figure 8, in terms of ROC AUC, the two DepAD methods outperform all the benchmark methods except wkNN. With wkNN, the results are similar. In terms of AP, as shown in Figure 9, the DepAD methods have similar performance as wkNN, slightly better performance than iForest, ALSO and COMBN, and better performance than the other benchmark methods.

Table 6: p-values of Wilcoxon Sum Test of DepAD Methods with the Benchmark Methods

Metrics Methods LOF wkNN FastABOD iForest MBOM SOD OCSVM ALSO COMBN * * * * * * * ROC AUC FBED-CART-PS 0 0.892 0 0.062 0 0.002 0 0 0.007 FBED-CART-Sum 0* 0.848 0* 0.072 0* 0.001* 0* 0* 0.002* * * * * * * AP FBED-CART-PS 0.004 0.57 0 0.077 0 0 0.001 0.025 0.112 FBED-CART-Sum 0.003* 0.54 0* 0.067 0* 0.001* 0* 0.02* 0.086 * The p-value is less than 0.05.

26 Figure 8: Comparison of two DepAD methods, FBED-CART-PS and FBED-CART-Sum, with benchmark methods in terms of ROC AUC. Circles or plus signs on the diagonal line indicate two methods have the same results. Circles or plus signs above the line represent that the DepAD method yields better results than the benchmark method.

27 Figure 9: Comparison of two DepAD methods, FBED-CART-PS and FBED-CART-Sum, with benchmark methods in terms of AP. Circles or plus signs on the diagonal line indicate two methods have the same results. Circles or plus signs above the line represent that the DepAD method yields better results than the benchmark method.

28 Wilcoxon rank-sum tests are conducted to the results of each of the two DepAD methods pairwise each of the nine benchmark methods. The alternative hypothesis is that the DepAD method is better than the comparison method, and the p-values are shown in Table 6, where * indicates p-values less than 0.05. From Table 6, in terms of ROC AUC, the two DepAD methods are significantly better than all benchmark methods except for wkNN and iForest. With respect to AP, the two DepAD methods yield significantly better results than all benchmark methods except for wKNN, iForest and COMBN. In summary, the two DepAD methods outperform most of the benchmark methods, including both proximity-based methods and existing dependency- based methods.

5.5 Efficiency of DepAD Methods

The computational costs of DepAD methods mainly come from the relevant variable selec- tion and dependency model training steps. Different techniques have very different time complexity. We summarize the average running times (in seconds) of the five relevant vari- able selection techniques and the five dependency models. It is noted that for FBED and HITON-PC, we use the implementations in CausalFS (Yu et al., 2020) that is based on C language, while the implementations of the other relevant variables selection techniques use R language. Therefore, the running times of the relevant variable selection step are not directly comparable. We still present them for the reference of the readers. As shown in Table 7, FBED and HITON-PC show much lower running times on all datasets. IEPC has the highest running time, especially for large-sized datasets, such as Ads, Census and Backdoor. The efficiency of DC and MI are decent, and DC takes slightly longer time than MI. For the five prediction models, overall, the computational costs are moderate, and mCART and Linear have slightly higher time consumption on average. The overall running time of the two DepAD methods and the nine benchmark meth- ods are presented in Table 8. In general, the two DepAD methods show good efficiency. In the nine benchmark methods, FastABOD, ALSO, SOD and COMBN could not finish experiments in four hours on some datasets. Additionally, ALSO, MBOM and SOD have significantly higher computation costs than other methods. To summarize this section, we analyzed the impact of different techniques on the per- formance of DepAD methods. As discussed in Section 3, the two causal feature selection techniques, FBED and HITON-PC, could generally lead to better performance, and the two tree-based dependency models, CART and mCART, could support better results of DepAD methods than Linear models. For anomaly score generation techniques, using RZPS, PS or Sum could lead to significantly better results than using GS and Max. We also investigate the impact of different combinations of techniques in different steps/phase on the perfor- mance of DepAD methods, and we conclude that FBED-CART-PS and FBED-CART-Sum are good choices in terms of both ROC AUC and AP. Comparing the results of the two DepAD methods to the results of nine state-of-the-art anomaly detection methods, includ- ing proximity-based and dependency-based methods, DepAD methods outperform them in most cases with significantly lower time complexity.

29 Table 7: Average Running Time (Seconds) of Different Relevant Variable Selection Tech- niques and Different Dependency Models

Relevant Variable Selection Dependency Model Training Dataset FBED a HITON-PC a IEPC DC MI mCART CART Linear Lasso Ridge Wilt 0 * 0 * 14 25 0.08 1.8 1.5 0.15 0.21 0.14 Pima 0 * 0 * 2.3 0.19 0.06 1.2 0.76 0.14 0.14 0.14 WBC 0 * 0 * 1.8 0.15 0.08 1 0.76 0.16 0.15 0.15 Stamps 0 * 0 * 3.4 0.4 0.07 1.1 0.77 0.15 0.14 0.14 Glass 0 * 0 * 1.6 0.15 0.07 1.1 0.7 0.14 0.14 0.14 Gamma 0 * 0 * 162 472 0.51 9.4 10 0.19 0.17 0.17 PageBlocks 0 * 0 * 33 4.4 0.24 4.1 3 0.16 0.16 0.16 Wine 0 * 0 * 37 2.1 0.29 5.2 4.1 0.16 0.16 0.16 HeartDisease 0 * 0 * 2.9 0.25 0.17 1.4 1 0.16 0.15 0.15 Leaf 0 * 0 * 10 1.1 0.21 1.7 1.3 0.16 0.15 0.15 Letter 0 * 0 * 11 0.75 0.26 2.6 1.6 0.16 0.15 0.15 PenDigits 0 * 0 * 133 8.2 1.1 2.4 1.5 0.25 0.19 0.2 Waveform 0 * 0.02 62 6.6 0.88 5.3 6.1 0.19 0.18 0.18 Cardiotocography 0 * 0 * 45 2.8 0.65 5 3.6 0.18 0.17 0.17 Parkinson 0 * 0 * 11 1 0.44 2.3 1.7 0.19 0.17 0.18 BreastCancer 0 * 0 * 47 5.5 0.84 5 3.2 0.22 0.21 0.19 WBPC 0 * 0 * 23 2.2 0.79 3.3 2.7 0.21 0.19 0.18 Ionosphere 0 * 0 * 34 3.5 0.91 3.8 2.7 0.2 0.18 0.17 Biodegradation 0 * 0 * 45 3.2 1.4 4.8 3.2 0.21 0.19 0.18 Bank 0 * 0.16 582 30 6 3.3 2 0.38 0.26 0.29 Spambase 0 * 0.06 555 31 6.3 2.8 2.1 0.38 0.29 0.28 Libras 0 * 0 * 316 30 7.4 2 1.3 0.49 0.27 0.31 Aid362 0.58 4.3 5647 417 48 9.7 5.8 2.3 2 1.4 Backdoor 1.9 1.9 4610 758 38 72 87 113 89 78 CalTech16 0.51 0.77 87 8.2 2.7 5.4 2.4 1.3 1 0.82 Arrhythmia 0.28 0.28 56 5.1 2.1 4.7 2.5 1.2 2.7 0.77 Census 6.9 502 10638 709 117 332 150 272 94 187 Secom 1.8 4.1 1007 289 15 20 22 20 37 13 MNIST 6.3 4.7 530 50 16 17 8.1 10 25 5.5 calTech28 12 10 664 60 21 26 6.6 5.4 5.3 2.9 Fashion 29 11 3759 272 82 32 47 33 12 14 Ads 27 296 20090 1064 369 77 52 189 8.4 19 Average 2.7 26 1538 133 23 21 14 20 8.8 10 a These two techniques are implemented using CausalFS (Yu et al., 2020), while others use R language. * If a running time is less than 0.01, it is displayed as 0 in the table.

30 Table 8: Running Time (in Seconds) of the DepAD Methods and the Benchmark Methods

DepAD Benchmark Methods Dataset 1 * 2* LOF wkNN FastABOD iForest MBOM SOD OCSVM ALSO COMBN Wilt 1.4 1.4 0.38 0.16 6.3 0.49 2.1 132 0.03 18 0.02 Pima 0.63 0.64 0.04 0.02 0.31 0.37 0.42 4.7 0 4.6 0.01 WBC 0.58 0.58 0.04 0.01 0.27 0.36 0.55 4.1 0 4.6 0.02 Stamps 0.62 0.6 0.03 0.01 0.19 0.36 0.36 2.7 0 4.4 0.01 Glass 0.61 0.62 0.02 0.01 0.12 0.36 0.25 1.7 0 3.2 0.01 Gamma 9.7 9.7 1.3 0.68 36 1.4 12 852 0.3 178 0.3 PageBlocks 3.4 3.3 0.41 0.18 7.2 0.58 4.5 157 0.06 53 0.17 Wine 4.7 4.7 0.54 0.31 6.3 0.59 5.9 156 0.05 66 0.22 HeartDisease 0.73 0.71 0.02 0.01 0.17 0.52 0.36 2.4 0 6 0.02 Leaf 0.95 0.95 0.03 0.01 0.22 0.37 0.51 3.6 0 9.8 0.03 Letter 1.7 1.7 0.07 0.04 0.53 0.38 1.2 10 0.01 19 0.15 PenDigits 1.5 1.5 1 0.5 25 1.4 17 578 0.25 263 2.1 Waveform 3.6 3.3 0.22 0.14 1.9 0.47 3 34 0.03 94 0.24 Cardiotocography 3.5 3.5 0.18 0.09 1.6 0.63 3.5 32 0.02 53 0.43 Parkinson 1.3 1.2 0.01 0.01 0.1 0.36 0.44 1.8 0 11 0.05 BreastCancer 2.4 2.4 0.03 0.02 0.28 0.37 1.3 5.5 0.01 33 0.31 WBPC 1.8 1.6 0.01 0.01 0.12 0.54 0.6 2.2 0 15 0.14 Ionosphere 2.2 2.1 0.02 0.01 0.18 0.37 0.97 3.5 0 27 0.33 Biodegradation 2.6 2.6 0.04 0.02 0.28 0.38 1.9 6.6 0.01 49 0.57 Bank 1.2 1.3 0.75 0.51 7.1 1.1 19 161 0.19 546 7.7 Spambase 1.2 1.2 0.39 0.25 4.7 0.75 14 100 0.17 615 3 Libras 1.2 1.2 0.03 0.02 0.37 0.42 5.7 11 0.02 212 1.7 Aid362 5.3 5.5 1.8 1.6 9.6 2.1 138 292 1.6 4154 561 Backdoor 42 43 13 11 - a 177 1077 - a 47 - a - a CalTech16 2.7 2.8 0.16 0.14 1.5 0.8 36 68 0.28 - a - a Arrhythmia 2.2 2.3 0.05 0.04 0.63 0.53 16 27 0.08 1299 6.9 Census 106 104 125 120 - a 360 2064 - a 1304 - a - a Secom 8.1 8.2 0.83 0.75 6.1 2.3 91 226 3.8 - a - a MNIST 12 12 0.37 0.37 3.9 1.7 188 175 1.6 - a - a calTech28 17 18 0.31 0.27 3.5 1.6 1570 181 1.9 - a - a Fashion 38 38 0.44 0.43 5.3 2.2 732 234 1.8 - a - a Ads 41 37 3.3 3.6 32 32 852 792 14 - a - a Average 10.1 9.9 4.7 4.4 5.4 18 214 142 43 322 24 * 1: FBED-CART-PS; 2: FBED-CART-Sum a The running time is larger than 4 hours.

31 6. Interpretability of the DepAD Framework One advantage of the DepAD framework is its ability to provide a meaningful explanation of why an object is identified as an anomaly, which is critical to understand the reported anomalies and the data itself. In this section, we first describe how to utilize the outputs of a DepAD method to interpret an identified anomaly. Then we illustrate this using an example. When interpreting an anomaly, we first identify variables with large deviations by exam- ining the difference between its observed value and expected value. The larger the deviation is, the more the variable contributes to the anomaly. Additionally, by comparing the nor- mal dependency pattern between a variable and its relevant variables with the observed dependency patterns, we can gain an understanding of how the anomaly is different from the normal behaviors. Here, the normal dependency is represented by an expected value of a variable given the values of its relevant variables. The observed value of a variable and the values of its relevant variables is the observed pattern. In the following, we use the Zoo dataset (which was used in Example 2 in Section 1) to illustrate how to interpret the anomalies found. The dataset contains information about 101 animals, each of which is described by 16 features. The DepAD method, FBED-CART-PS, is applied to the dataset and the top-10 identified anomalies were presented in Figure 2d. In the illustration, we show how to interpret the top-3 anomalies.

Table 9: Interpretation of Top-3 Anomalies Detected by FBED-CART-PS in the Zoo Dataset

top-3 variables with most deviations Dependency Violation a b c d d anomalies variable δj xj xˆj normal dependency pattern (%) observed dependency pattern(%) backbone 9.2 0 1 tail=1 → backbone=1 (73%) tail=1 → backbone=0 (0.9%) scorpion eggs 5.3 0 0.9 milk=0 → eggs=1 (57%) milk=0 → eggs=0 (2%) milk 5.2 0 0.7 eggs=0 → milk=1 (57%) eggs=0 → milk=0 (2%) toothed 6.9 0 0.9 milk=1 → toothed=1 (40%) milk=1 → toothed=0 (1%) platypus eggs 6.1 1 0 milk=1 → eggs=0 (40%) milk=1 → eggs=1 (1%) milk 5.5 1 0.2 eggs=1 → milk=0 (57%) eggs=1 → milk=1 (1%) eggs 5.3 0 0.9 milk=0 → eggs=1 (57%) milk=0 → eggs=0 (2%) seasnake milk 5.1 0 0.7 eggs=0 → milk=1 (57%) eggs=0 → milk=0 (2%) fins 4.2 0 0.8 legs!=0 → fins=0 (76%) legs=0 → fins=0 (7%) a The column shows the normalized deviations. b The column shows the observed values. c The column shows the expected values. d The % after each pattern indicates the percentage of animals having the pattern in the Zoo dataset.

As shown in Table 9, the top-3 anomalies in the Zoo dataset are scorpion, platypus and sea snake. For each anomaly, we find the top-3 variables that contribute to the anomalous- ness most, i.e., the three variables with the highest derivation values. Then for each of the three variables of an anomaly, we pick its most relevant variable to examine the details of dependency violation. From Table 9, the most anomalous animal in the dataset is scorpion, and the three variables backbone, eggs and milk contribute most to the anomalousness. For variable backbone, 73% of the animals in the dataset follow the normal dependency, that is, if an animal has a tail, it would have a backbone. The scorpion has a tail but no backbone, and it is the only one (account for 0.9%) with this dependency pattern violation (i.e., the

32 observed pattern is not consistent with the expected pattern) in the dataset. In relation to the other two most contributing variables, eggs and milk, the normal dependency that is held by 57% of animals in the dataset says if an animal does not produce milk, it lays eggs. Scorpion neither lay eggs nor produce milk, and only 2% of the animals have this pattern. Platypus mainly violates the normal dependencies: 1) if an animal produces milk, it likely has teeth; 2) if an animal produces milk, it is unlikely to lay eggs or if an animal lays eggs, it is unlikely to produce milk, each of the three normal patterns is held by over 40% of animals in the dataset. Platypus is the only animal that violates these dependencies. The first normal dependency that the sea snake violates is the relationship between eggs and milk, and it is one of the two animals in the dataset that neither lay eggs nor produce milk. Another normal dependency is that if an animal has legs, it is unlikely to have fins, held by 76% of the animals in the dataset, but the sea snake neither has legs nor fins.

7. Conclusion In this paper, we propose a general framework, DepAD, for dependency-based anomaly detection. DepAD provides a unified process for us to utilize off-the-shelf feature selec- tion techniques and supervised prediction models to assemble effective, scalable and flexible anomaly detection algorithms suitable for various data types and applications. In the pa- per, we have investigated the suitability of various techniques for the DepAD framework and have conducted comprehensive experiments to empirically evaluate the performance of these techniques individually in a step of DepAD and in combination with techniques in other steps. To show the effectiveness of DepAD, we have compared two high performing instantiations of DepAD with nine state-of-the-art anomaly detection methods. The re- sults have shown that DepAD methods outperform state-of-the-art methods in most cases. The high interpretability of the DepAD framework has also been demonstrated with a case study.In summary, the DepAD framework has demonstrated the capability of utilizing de- pendency for effective, efficient, interpretable and flexible anomaly detection. We hope the work in this manuscript will inspire and facilitate more research in dependency-based anomaly detection.

8. Acknowledgments We acknowledge the Australian Government Training Program Scholarship, and Data to Decisions CRC (D2DCRC), Cooperative Research Centres Programme for funding this re- search. The work has also been partially supported by ARC Discovery Project DP170101306.

References Charu C. Aggarwal. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition, 2016. ISBN 3319475770.

Charu C Aggarwal and Saket Sathe. Theoretical foundations and algorithms for outlier ensembles. ACM Sigkdd Explorations Newsletter, 17(1):24–47, 2015.

Charu C Aggarwal and Saket Sathe. Outlier ensembles: An introduction. Springer, 2017.

33 Constantin F Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. Hiton: a novel markov blanket algorithm for optimal variable selection. In AMIA annual symposium proceedings, volume 2003, page 21. American Medical Informatics Association, 2003. Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D Koutsoukos. Local causal and markov blanket induction for causal discov- ery and feature selection for classification part i: Algorithms and empirical evaluation. Journal of Machine Learning Research, 11(Jan):171–234, 2010. Fabrizio Angiulli and Clara Pizzuti. Outlier mining in large high-dimensional data sets. IEEE transactions on Knowledge and Data engineering, 17(2):203–215, 2005. Bryon Aragam and Qing Zhou. Concave penalized estimation of sparse gaussian bayesian networks. Journal of Machine Learning Research, 16(1):2273–2328, 2015. Francisco Arag´on-Roy´on, Alfonso Jim´enez-V´ılchez, Antonio Arauzo-Azofra, and Jos´eManuel Benitez. Fsinr: an exhaustive package for feature selection. arXiv e-prints, art. arXiv:2002.10330, feb 2020. URL https://arxiv.org/abs/2002.10330. Antonio Arauzo-Azofra, Jose Manuel Benitez, and Juan Luis Castro. Consistency measures for feature selection. Journal of Intelligent Information Systems, 30(3):273–292, 2008. Sakshi Babbar and Sanjay Chawla. Mining causal outliers using gaussian bayesian networks. In 2012 Proc. of ICTAI, volume 1, pages 97–104. IEEE, 2012. V Barnett and T Lewis. Outliers in statistical data. New York, 1994. Giorgos Borboudakis and Ioannis Tsamardinos. Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20(1):276–314, 2019. Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and regression trees. CRC press, 1984. MAJ John R Brence and Donald E Brown. Improving the robust random forest regres- sion algorithm. Systems and Information Engineering Technical Papers, Department of Systems and Information Engineering, University of Virginia, 2006. Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J¨orgSander. Lof: identifying density-based local outliers. In ACM sigmod record, volume 29, pages 93–104. ACM, 2000. Anna L Buczak and Erhan Guven. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications surveys & tutorials, 18(2): 1153–1176, 2015. Guilherme O Campos, Arthur Zimek, J¨orgSander, Ricardo JGB Campello, Barbora Mi- cenkov´a,Erich Schubert, Ira Assent, and Michael E Houle. On the evaluation of unsu- pervised outlier detection: measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30(4):891–927, 2016.

34 Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009.

David M Chickering, Dan Geiger, David Heckerman, et al. Learning bayesian networks is np-hard. Technical report, Citeseer, 1994.

Manoranjan Dash and Huan Liu. Consistency-based search in feature selection. Artificial intelligence, 151(1-2):155–176, 2003.

Yadolah Dodge. Coefficient of determination. The Concise Encyclopedia of Statistics, pages 88–91, 2008.

Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http:// archive.ics.uci.edu/ml.

Andrew Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, and Weng-Keen Wong. A meta-analysis of the anomaly detection problem. arXiv preprint arXiv:1503.01158, 2015.

Cheng Fan. HighDimOut: Outlier Detection Algorithms for High-Dimensional Data, 2015. URL https://CRAN.R-project.org/package=HighDimOut. R package version 1.0.0.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for gener- alized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. URL http://www.jstatsoft.org/v33/i01/.

Jerome H Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1–67, 1991.

Giuliano Galimberti, Marilena Pillati, and Gabriele Soffritti. Robust regression trees based on m-estimators. Statistica, 67(2):173–190, 2007.

Jing Gao and Pang-Ning Tan. Converting output scores from outlier detection algorithms into probability estimates. In Sixth International Conference on Data Mining (ICDM’06), pages 212–221. IEEE, 2006.

Isabelle Guyon, Constantin Aliferis, et al. Causal feature selection. In Computational methods of feature selection, pages 79–102. Chapman and Hall/CRC, 2007.

Michael Hahsler, Matthew Piekenbrock, and Derek Doran. dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1):1–30, 2019. doi: 10.18637/jss. v091.i01.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learn- ing: data mining, inference, and prediction. Springer Science & Business Media, 2009.

Douglas M Hawkins. Identification of outliers, volume 11. Springer, 1980.

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthog- onal problems. Technometrics, 12(1):55–67, 1970.

35 Yi-an Huang, Wei Fan, Wenke Lee, and Philip S Yu. Cross-feature analysis for detecting ad-hoc routing anomalies. In 23rd International Conference on Distributed Computing Systems, 2003. Proceedings., pages 478–487. IEEE, 2003.

Jose Jimenez. abodOutlier: Angle-Based Outlier Detection, 2015. URL https://CRAN. R-project.org/package=abodOutlier. R package version 0.1. Kaggle. Kaggle repository. URL https://www.kaggle.com/datasets. Kenji Kira and Larry A Rendell. A practical approach to feature selection. In Machine Learning Proceedings 1992, pages 249–256. Elsevier, 1992. Edwin M Knorr and Raymond T Ng. A unified notion of outliers: Properties and compu- tation. In KDD, volume 97, pages 219–222, 1997. Edwin M Knorr and Raymond T Ng. Algorithms for mining distancebased outliers in large datasets. In Proceedings of the international conference on very large data bases, pages 392–403. Citeseer, 1998. Yufeng Kou, Chang-Tien Lu, Sirirat Sirwongwattana, and Yo-Ping Huang. Survey of fraud detection techniques. In IEEE International Conference on Networking, Sensing and Control, 2004, volume 2, pages 749–754. IEEE, 2004. Hans-Peter Kriegel, Matthias Schubert, and Arthur Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 444–452, 2008. Hans-Peter Kriegel, Peer Kr¨oger,Erich Schubert, and Arthur Zimek. Loop: local outlier probabilities. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 1649–1652, 2009a. Hans-Peter Kriegel, Peer Kr¨oger, Erich Schubert, and Arthur Zimek. Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowl- edge Discovery and Data Mining, pages 831–838. Springer, 2009b. Hans-Peter Kriegel, Peer Kroger, Erich Schubert, and Arthur Zimek. Interpreting and unifying outlier scores. In Proceedings of SIAM, pages 13–24, 2011. Hans-Peter Kriegel, Peer Kr¨oger,Erich Schubert, and Arthur Zimek. Outlier detection in arbitrarily oriented subspaces. In 2012 IEEE 12th international conference on data mining, pages 379–388. IEEE, 2012. Aleksandar Lazarevic and Vipin Kumar. Feature bagging for outlier detection. In Proceed- ings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 157–166, 2005. Alexander Hanbo Li and Andrew Martin. Forest-type regression with general losses and robust forest. In International Conference on Machine Learning, pages 2091–2100, 2017. Jiuyong Li, Lin Liu, Thuc Duy Le, and Jixue Liu. Accurate data-driven prediction does not mean high reproducibility. Nature Machine Intelligence, 2(1):13–15, 2020.

36 Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):1–45, 2017.

Fei Tony Liu. IsolationForest: Isolation Forest, 2009. URL https://R-Forge.R-project. org/projects/iforest/. R package version 0.0-26/r4.

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In ICDM’08, pages 413–422.

Ling Liu and M Tamer Ozsu.¨ Encyclopedia of database systems, volume 6. Springer New York, NY, USA:, 2009.

Sha Lu, Lin Liu, Jiuyong Li, Thuc Duy Le, and Jixue Liu. Lopad: A local prediction approach to anomaly detection. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 660–673. Springer, 2020.

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.

Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighbor- hoods. In Advances in neural information processing systems, pages 505–511, 2000.

Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7(Jun):983–999, 2006.

David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (For- merly: E1071), TU Wien, 2019. URL https://CRAN.R-project.org/package=e1071. R package version 1.7-3.

Hoang Vu Nguyen, Hock Hee Ang, and Vivekanand Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In International Conference on Database Systems for Advanced Applications, pages 368–383. Springer, 2010.

Keith Noto, Carly Brodley, and Donna Slonim. Anomaly detection using an ensemble of feature models. In 2010 ieee international conference on data mining, pages 953–958. IEEE, 2010.

Keith Noto, Carla Brodley, and Donna Slonim. Frac: a feature-modeling approach for semi- supervised and unsupervised anomaly detection. Data mining and knowledge discovery, 25(1):109–133, 2012.

Heiko Paulheim and Robert Meusel. A decomposition of the outlier detection problem into a set of supervised learning problems. Machine Learning, 100(2-3):509–531, 2015.

Judea Pearl. Causality: models, reasoning and inference, volume 29. Springer, 2000.

37 Jose M Pe˜na,Johan Bj¨orkegren, and Jesper Tegn´er.Scalable, efficient and correct learning of markov boundaries under the faithfulness assumption. In European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pages 136–147. Springer, 2005.

Andrea Peters and Torsten Hothorn. ipred: Improved Predictors, 2019. URL https:// CRAN.R-project.org/package=ipred. R package version 0.9-9.

Wenbin Qian and Wenhao Shu. Mutual information criterion for feature selection from incomplete data. Neurocomputing, 168:210–220, 2015.

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. In ACM Sigmod Record, volume 29, pages 427–438. ACM, 2000.

Peter J Rousseeuw and Annick M Leroy. Robust regression and outlier detection, volume 589. John wiley & sons, 2005.

Marie-H´el`eneRoy and Denis Larocque. Robustness of random forests for regression. Journal of Nonparametric Statistics, 24(4):993–1006, 2012.

Bernhard Sch¨olkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. Estimating the support of a high-dimensional distribution. Neural com- putation, 13(7):1443–1471, 2001.

Marco Scutari. Learning bayesian networks with the bnlearn R package. Journal of Statis- tical Software, 35(3):1–22, 2010. doi: 10.18637/jss.v035.i03.

George AF Seber. Multivariate observations, volume 252. John Wiley & Sons, 2009.

Peter Spirtes. Introduction to causal inference. Journal of Machine Learning Research, 11 (5), 2010.

Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, and Vipin Kumar. Introduction to Data Mining (2nd Edition). Pearson, 2nd edition, 2018. ISBN 0133128903.

Jian Tang, Zhixiang Chen, Ada Wai-Chee Fu, and David W Cheung. Enhancing effec- tiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 535–548. Springer, 2002.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.

Yong Wang and Ian H Witten. Induction of model trees for predicting continuous classes. 1996.

Yizhou Yan, Lei Cao, and Elke A Rundensteiner. Scalable top-n local outlier detection. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1235–1244, 2017.

38 Sandeep Yaramakala and Dimitris Margaritis. Speculative markov blanket discovery for optimal feature selection. In ICDM’05, pages 4–pp, 2005.

Kui Yu and Huanhuan Chen. Markov boundary-based outlier mining. IEEE transactions on neural networks and learning systems, 30(4):1259–1264, 2018.

Kui Yu, Lin Liu, and Jiuyong Li. A unified view of causal and non-causal feature selection. arXiv preprint arXiv:1802.05844, 2018.

Kui Yu, Xianjie Guo, Lin Liu, Jiuyong Li, Hao Wang, Zhaolong Ling, and Xindong Wu. Causality-based feature selection: Methods and evaluations. ACM Comput. Surv., 53(5), September 2020. ISSN 0360-0300. doi: 10.1145/3409382. URL https://doi.org/10. 1145/3409382.

Ke Zhang, Marcus Hutter, and Huidong Jin. A new local distance-based outlier detec- tion approach for scattered real-world data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 813–822. Springer, 2009.

Arthur Zimek, Erich Schubert, and Hans-Peter Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Journal, 5(5):363–387, 2012.

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.

39