Graph Fairing Convolutional Networks for Anomaly Detection

Mahsa Mesgaran and A. Ben Hamza Concordia Institute for Information Systems Engineering Concordia University, Montreal, QC, Canada

Abstract class (i.e. normal class) by learning a discriminative hyperplane boundary around the normal instances. Another commonly- Graph convolution is a fundamental building block for many used anomaly detection approach is support vector data descrip- deep neural networks on graph-structured data. In this paper, tion (SVDD) [4], which basically finds the smallest possible hy- we introduce a simple, yet very effective graph convolutional persphere that contains all instances, allowing some instances to network with skip connections for semi-supervised anomaly de- be excluded as anomalies. However, these approaches rely on tection. The proposed multi-layer network architecture is theo- hand-crafted features, are unable to appropriately handle high- retically motivated by the concept of implicit fairing in geome- dimensional data, and often suffer from computational scalabil- try processing, and comprises a graph convolution module for ity issues. aggregating information from immediate node neighbors and Deep learning has recently emerged as a very powerful way a skip connection module for combining layer-wise neighbor- to hierarchically learn abstract patterns from data, and has been hood representations. In addition to capturing information from successfully applied to anomaly detection, showing promising distant graph nodes through skip connections between the net- results in comparison with shallow methods [5]. Ruff et al. [6] work’s layers, our approach exploits both the graph structure and extend the shallow one-class classification SVDD approach to node features for learning discriminative node representations. the deep learning setting by proposing a deep learning based The effectiveness of our model is demonstrated through exten- SVDD framework for anomaly detection using an anomaly de- sive experiments on five benchmark datasets, achieving better tection based objective. Deep SVDD is an unsupervised learn- or comparable anomaly detection results against strong baseline ing model that learns to extract the common factors of variation methods. of the data distribution by training a neural network while min- imizing the volume of a hypersphere that encloses the network Keywords: Anomaly detection; implicit graph filtering; graph representations of the data. Also, Ruff et al. [7] introduce a neural networks; skip connection; deep learning. deep semi-supervised anomaly detection (Deep SAD) approach, which is a generalization of the unsupervised Deep SVDD tech- nique to the semi-supervised setting. Deep SAD differs from 1 Introduction Deep SVDD in that its objective function also includes a loss term for labeled normal and anomalous instances. The more Anomaly detection is of paramount importance when deploy- diverse the labeled anomalous instances in the training set, the ing artificial intelligence systems, which often encounter unex- better the anomaly detection performance of Deep SAD. pected or abnormal items/events that deviate significantly from Owing to the recent developments in deep semi-supervised the majority of data. Anomaly detection techniques are widely learning on graph-structured data, there has been a surge of in- used in a variety of real-world applications, including, but not terest in the adoption of graph neural networks for learning la- limited to, intrusion detection, fraud detection, healthcare, In- tent representations of graphs [8, 9]. Defferrard et al. [8] in- arXiv:2010.10274v1 [cs.LG] 20 Oct 2020 ternet of Things, Industry 4.0, and social networks [1, 2]. Most troduce the Chebyshev network, an efficient spectral-domain of these techniques often involve a training set where no anoma- graph convolutional neural network that uses recursive Cheby- lous instances are encountered, and the challenge is to identify shev polynomial spectral filters to avoid explicit computation suspicious items or events, even in the absence of abnormal in- of the Laplacian spectrum. These filters are localized in space, stances. In practical settings, the output of an anomaly detec- and the learned weights can be shared across different locations tion model is usually an alert that triggers every time there is a in a graph. An efficient variant of graph neural networks is anomaly or a pattern in the data that is atypical. graph convolutional networks (GCNs) [9], which is an upsurg- Detecting anomalies is a challenging task, primarily because ing semi-supervised graph-based deep learning framework that the anomalous instances are not known a priori and also the vast uses an efficient layer-wise propagation rule based on a first- majority of observations in the training set are normal instances. order approximation of spectral graph convolutions. The fea- Therefore, the mainstream approach in anomaly detection has ture vector of each graph node in GCN is updated by essen- been to separate the normal instances from the anomalous ones tially applying a weighted sum of the features of its immediate by using unsupervised learning models. One-class support vec- neighboring nodes. While significant strides have been made tor machines (OC-SVM) are a classic example [3], which is a in addressing anomaly detection on graph-structured data [10], one-class classification model trained on data that has only one it still remains a daunting task on graphs due to various chal-

1 lenges, including graph sparsity, data nonlinearity, and com- the background for spectral graph theory and present the prob- plex modality interactions [11]. Ding et al. [11] design a GCN- lem formulation. In Section 4, we introduce a graph convolu- based autoencoder for anomaly detection on attributed networks tional network architecture with skip connection for deep semi- by taking into account both topological structure and nodal at- supervised anomaly detection. In Section 5, we present exper- tributes. The encoder of this unsupervised approach encodes the imental results to demonstrate the competitive performance of attribute information using the output GCN embedding, while our approach on five standard benchmarks. Finally, we conclude the decoder reconstructs both the structure and attribute infor- in Section 6 and point out future work directions. mation using non-linearly transformed embeddings of the output GCN layer. The basic idea behind anomaly detection methods based on reconstruction errors is that the normal instances can 2 Related Work be reconstructed with small errors, while anomalous instances are often reconstructed with large errors. More recently, Ku- The basic goal of anomaly detection is to identify abnormal in- magai et al. [12] have proposed two GCN-based models for stances, which do not conform to the expected pattern of other semi-supervised anomaly detection. The first model uses only instances in a dataset. To achieve this goal, various anomaly labeled normal instances, whereas the second one employs la- detection techniques have been proposed, which can distinguish beled normal and anomalous instances. Both models are trained between normal and anomalous instances [1]. Most mainstream to minimize the volume of a hypersphere that encloses the GCN- approaches are one-class classification models [3, 4] or graph- learned node embeddings of normal instances, while embedding based anomaly methods [10]. the anomalous ones outside the hypersphere. Deep Learning for Anomaly Detection. While shallow Inspired by the implicit fairing concept in geometry pro- methods such as one-class classification models require explicit cessing for triangular mesh smoothing [13], we introduce a hand-crafted features, much of the recent work in anomaly de- graph fairing convolutional network architecture, which we tection leverages deep learning [2], which has shown remarkable call GFCN, for deep semi-supervised anomaly detection. The capabilities in learning discriminative feature representations by proposed framework achieves better anomaly detection perfor- extracting high-level features from data using multilayered neu- mance, as GFCN uses a multi-layer architecture, together with ral networks. Ruff et al. [6] develop a deep SVDD anomaly de- skip connections, and non-linear activation functions to ex- tection framework, which is basically an extension of the shal- tract high-order information of graphs as discriminative fea- low one-class classification SVDD approach. The basic idea tures. Moreover, GFCN inherits all benefits of GCNs, including behind SVDD is to find the smallest hypersphere that contains accuracy, efficiency and ease of training. all instances, except for some anomalies. Deep SVDD is an In addition to capturing information from distant graph nodes unsupervised learning model that learns to extract the common through skip connections between layers, the proposed GFCN factors of variation of the data distribution by training a neural model is flexible and exploits both the graph structure and node network while minimizing the volume of a hypersphere that en- features for learning discriminative node representations in an closes the network representations of the data. The centroid of effort to detect anomalies in a semi-supervised setting. Not only the hypersphere is usually set to the mean of the feature repre- does GFCN outperforms strong anomaly detection baselines, sentations learned by performing a single initial forward pass. but it is also surprisingly simple, yet very effective at identi- In order to improve model performance, Ruff et al. [7] propose fying anomalies. The main contributions of this work can be Deep SAD, a generalization of the unsupervised Deep SVDD to summarized as follows: the semi-supervised setting. The key difference between these • We propose a novel multi-layer graph convolutional net- two deep anomaly detection models is that the objective function work with a skip connection for semi-supervised anomaly of Deep SAD also includes a loss term for labeled normal and detection by effectively exploiting both the graph structure anomalous instances. The idea behind this loss term is to mini- and attribute information. mize (resp. maximize) the squared Euclidean distance between the labeled normal (resp. anomalous) instances and the hyper- • We introduce a learnable skip-connection module, which sphere centroid. However, both Deep SVDD and Deep SAD helps nodes propagate through the network’s layers and suffer from the hypersphere collapse problem due to the learn- hence substantially improves the quality of the learned ing of a trivial solution. In other words, the network’s learned node representations. features tend to converge to the centroid of the hypersphere if no • We analyze the complexity of the proposed model and train constraints are imposed on the architectures of the models. An- it on a regularized, weighted cross-entropy loss function by other line of work uses deep generative models to address the leveraging unlabeled instances to improve performance. anomaly detection problem [14,15]. These generative networks are able to localize anomalies, particularly in images, by simul- • We demonstrate through extensive experiments that our taneously training a generator and a discriminator, enabling the model can capture the anomalous behavior of graph detection of anomalies on unseen data based on unsupervised nodes, leading to state-of-the-art performance across sev- training of the model on anomaly-free data [16]. However, the eral benchmark datasets. use of deep generative models in anomaly detection has been The rest of this paper is organized as follows. In Section 2, shown to be quite problematic and unintuitive, particularly on we review important relevant work. In Section 3, we outline image data [17].

2 Graph Convolutional Networks for Anomaly Detection. admits an eigendecomposition given by L = UΛU|, where GCNs have recently become the de facto model for learn- U = (u1,..., uN ) is an orthonormal matrix whose columns ing representations on graphs, achieving state-of-the-art perfor- constitute an orthonormal basis of eigenvectors and Λ = mance in various application domains, including anomaly de- diag(λ1, . . . , λN ) is a diagonal matrix comprised of the corre- tection [11, 12]. Ding et al. [11] present an unsupervised graph sponding eigenvalues such that 0 = λ1 ≤ · · · ≤ λN ≤ 2 in anomaly detection framework using a GCN-based autoencoder. increasing order [18]. If G is a , then the spec- This approach leverages both the topological structure and nodal tral radius (i.e. largest absolute value of all eigenvalues) of the attributes, with an encoder that maps the attribute information normalized Laplacian matrix is equal to 2. into a low-dimensional feature space and a decoder that recon- structs the structure as well as the attribute information using the Problem Statement. Anomaly detection aims at identifying learned latent representations. The basic idea behind this GCN- anomalous instances, which do not conform to the expected pat- based autoencoder is that the normal instances can be recon- tern of other instances in a dataset. It differs from binary clas- structed with small errors, while anomalous instances are often sification in that it not only distinguishes between normal and reconstructed with large errors. However, methods based on re- anomalous observations, but also the distribution of anomalies construction errors are prone to outliers and often require noise- is not usually known a priori. Nl Let Dl = {(xi, yi)}i=1 be a set of labeled data points xi ∈ free data for training. On the other hand, some of the main chal- F lenges associated with graph anomaly detection is the lack of la- R and their associated known labels yi ∈ {0, 1} with 0 and beled graph nodes (i.e. no information is available about which 1 representing “normal” and “anomalous” observations, respec- Nl+Nu tively, and Du = {xi} be a set of unlabeled data points, instances are actually anomalous and which ones are normal) i=Nl+1 and data imbalance, as abnormalities occur rarely and hence a where Nl + Nu = N. Hence, each node i can be labeled with a tiny fraction of instances is expected to be anomalous. To cir- 2-dimensional one-hot encoding vector yi = (yi, 1 − yi). cumvent these issues, Kumagai et al. [12] propose two semi- The goal of semi-supervised anomaly detection on graphs is supervised anomaly detection models using GCNs for learning to estimate the anomaly scores of the unlabeled graph nodes. latent representations. The first model uses only labeled normal Nodes with high anomaly scores are considered anomalous, instances, while the second one employs both labeled normal while nodes with lower scores are deemed normal. and anomalous instances. However, both models are trained to minimize the volume of a hypersphere that encloses the GCN- learned node embeddings of normal instances, and hence they 4 Proposed Method also suffer from the hypersphere collapse problem. By con- In this section, we present the main building blocks of the pro- trast, our semi-supervised GFCN model does not suffer from posed graph fairing convolutional network with skip connection the above mentioned issues. In addition to leveraging the graph and analyze its complexity. Then, we train our model using a structure and node attributes, GFCN learns from both labeled regularized, weighted cross-entropy loss function by leveraging and unlabeled data in order to improve model performance. unlabeled instances to improve performance.

3 Preliminaries and Problem Statement 4.1 Spectral Graph Filtering

We introduce our notation and present a brief background on The idea of spectral filtering on graphs was first introduced spectral graph theory, followed by our problem formulation of in [19] in the context of triangular mesh smoothing. The goal of semi-supervised anomaly detection on graphs. spectral graph filtering is to use polynomial or rational polyno- mial filters defined as functions of the graph Laplacian (or equiv- Basic Notions. Consider a graph G = (V, E), where V = alently its eigenvalues) in an effort to attenuate high-frequency {1,...,N} is the set of N nodes and E ⊆ V × V is the set noise corrupting the graph signal. These functions are usually of edges. We denote by A = (Aij) an N × N adjacency ma- referred to as frequency responses or transfer functions. While trix (binary or real-valued) whose (i, j)-th entry Aij is equal to polynomial filters have finite impulse responses, their rational the weight of the edge between neighboring nodes i and j, and counterparts have infinite impulse responses. | 0 otherwise. We also denote by X = (x1, ..., xN ) an N × F Since the normalized Laplacian matrix is diagonalizable, ap- feature matrix of node attributes, where xi is an F -dimensional plying a spectral graph filter with transfer function h on the row vector for node i. This real-valued feature vector is often graph signal X ∈ RN×F yields referred to as a graph signal, which assigns a value to each node in the graph. | | H = h(L)X = Uh(Λ)U X = U diag(h(λi))U X, (2) Spectral Graph Theory. The normalized Laplacian matrix is defined as where H is the filtered graph signal. However, this filtering pro- − 1 − 1 cess necessitates the computation of the eigenvalues and eigen- L = I − D 2 AD 2 , (1) vectors of the Laplacian matrix, which is prohibitively expen- where D = diag(A1) is the diagonal degree matrix, and 1 sive for large graphs. To circumvent this issue, spectral graph is an N-dimensional vector of all ones. Since the normal- filters are usually approximated using Chebyshev polynomi- ized Laplacian matrix is symmetric positive semi-definite, it als [8, 20, 21] or rational polynomials [22–24].

3 4.2 Implicit Graph Fairing Taking the derivative of L and set it to zero, we have

Graph fairing refers to the process of designing and computing ∂J = H − X + sLH = 0, (6) smooth graph signals on a graph in order to filter out undesirable ∂H high-frequency noise while retaining the graph geometric fea- tures as much as possible. The implicit fairing method, which which yields the implicit fairing equation. uses implicit integration of a diffusion process for graph filter- ing, has shown to allow for both efficiency and stability [13]. The implicit fairing filter is an infinite impulse response filter 4.3 Spectral Analysis whose transfer function is given by hs(λ) = 1/(1 + sλ), where The matrix I + sL is symmetric positive definite with minimal s is a positive parameter. Substituting h with hs in Eq. (2), we obtain eigenvalue equal to 1 and maximal eigenvalue bounded from H = (I + sL)−1X, (3) above by 1 + 2s. Hence, the condition number κ of I + sL satisfies where I + sL is a symmetric positive definite matrix (all its κ ≤ 1 + 2s, (7) eigenvalue are positive), and hence admits an inverse. There- fore, performing graph filtering with implicit fairing is equiva- where κ, which is defined as the ratio of the maximum to min- lent to solving the following sparse linear system: imum stretching, is also equal to the maximal eigenvalue of I + sL. Intuitively, the condition number measures how much (I + sL)H = X, (4) can a change (i.e. small perturbation) in the right-hand side of the implicit fairing equation affects the solution. In fact, it can which we refer to as implicit fairing equation. be readily shown that the resulting relative change in the solu- The implicit fairing filter enjoys several nice properties, in- tion of the implicit fairing equation is bounded from above by cluding unconditional stability as h (λ) is always in [0, 1], and s the condition number multiplied by the relative change in the also preservation of the average value (i.e. DC value or cen- right-hand side. troid) of the graph signal as hs(0) = 1 for all s. As shown in Figure 1, the higher the value of the scaling parameter, the closer the implicit fairing filter becomes to the ideal low-pass filter. 4.4 Iterative Solution

One of the simplest iterative techniques for solving a matrix equation is Jacobi’s method, which uses matrix splitting. We can split the matrix I+sL into the sum of a diagonal matrix and an off-diagonal matrix as follows:

I + sL = diag(I + sL) + off(I + sL), (8)

where

diag(I+sL) = (1+s)I and off(I+sL) = −sD−1/2AD−1/2.

Hence, the implicit fairing equation can then be solved itera- tively using the Jacobi method as follows:

H(t+1) = −(diag(I + sL))−1 off(I + sL)H(t) + (diag(I + sL))−1X (9) −1/2 −1/2 (t) Figure 1: Transfer function of the implicit fairing filter for vari- = D AD H Θs + XΘe s, ous values of the scaling parameter.

where Θs = diag(s/(1+s)) and Θe s = diag(1/(1+s)) are F × Relation to Variational Filtering: It is worth pointing out F diagonal matrices, each of which has equal diagonal entries, (t) that the implicit fairing equation can also be obtained by mini- and H is the t-th iteration of H. Since the spectral radius of mizing the following objective function the normalized is equal to 1, it follows that the spectral radius of the Jacobi’s iteration matrix 1 2 s | J (H) = kH − XkF + tr(H LH), (5) s 2 2 C = D−1/2AD−1/2, 1 + s where k · kF and tr(·) denote the Frobenius norm and trace op- erator, respectively, and s controls the smoothness of the filtered is equal to s/(1+s), which is always smaller than 1. Hence, the graph signal. convergence of the iterative method given by Eq. (9) holds.

4 4.5 Graph Fairing Convolutional Network initial feature matrix X into the skip-connection weight matrix (`) F ×F Θe ∈ `+1 , with F`+1 feature maps. Then, a point-wise Inspired by the Jacobi iterative solution of the implicit fair- R activation function σ(·) such as ReLU(·) = max(0, ·) is applied ing equation, we propose a multi-layer graph fairing convolu- to obtain the output node embedding matrix. The aggregation tional network (GFCN) with the following layer-wise propaga- scheme of GFCN is depicted in Figure 3. Note that GFCN uses tion rule: skip connections between the initial feature matrix and each hid- H(`+1) = σ(D−1/2AD−1/2H(`)Θ(`) + XΘe (`)), (10) den layer. Skip connections not only allow the model to carry over information from the initial node attributes, but also help where Θ(`) and Θe (`) are learnable weight matrices, σ(·) is an facilitate training of multi-layer networks. element-wise activation function, H(`) ∈ RN×F` is the input feature matrix of the `-th layer with F` feature maps for ` = 0,...,L − 1. The input of the first layer is the initial feature matrix H(0) = X. Note that in addition to performing graph convolution, which essentially averages the features of the immediate (i.e. first- order or 1-hop) neighbors of nodes, the layer-wise propagation rule of GFCN also applies a skip connection that reuses the ini- tial node features, as illustrated in Figure 2. In other words, skip connections GFCN combines both the aggregated node neighborhood repre- sentation and the initial node representation, hence memorizing Figure 3: Illustration of the GFCN aggregation scheme with skip information across distant nodes. While most of the existing connections. graph convolutions with skip connections [23, 25] are based on heuristics, our graph fair convolution is theoretically motivated The embedding H(L) of the last layer of GFCN contains the by implicit fairing and derived directly from the Jacobi iterative final output node embeddings, which can be used as input for method. downstream tasks such as graph classification, clustering, visu- alization, link prediction, recommendation, node classification, convolution convolution input output and anomaly detection. Since anomaly detection can be posed as a binary classification problem, we apply the point-wise soft- max function (i.e. sigmoid in the binary case) to obtain an N ×2 matrix of predicted labels for graph nodes

skip connection skip connection | (L) Yˆ = (yˆ1,..., yˆN ) = softmax(H ), (11)

Figure 2: Schematic layout of the proposed GFCN architecture. where yˆi = (ˆyi, 1 − yˆi) is a two-dimensional raw vector of Each block comprises a graph convolution and a skip connec- predicted probabilities, with yˆi the probability that the network tion, followed by an activation function, where S denotes the associates the i-th node with one of the two classes (i.e. 1 for normalized adjacency matrix. anomalous and 0 for normal).

In contrast to GCN-based methods which use a first-order ap- proximation of spectral graph convolutions and a renormaliza- 4.7 Model Complexity tion trick in their layer-wise propagation rules to avoid numeri- cal instability, our GFCN model does not require any renormal- For simplicity, we assume the feature dimensions are the same ization as the spectral radius of the normalized adjacency matrix for all layers, i.e. F` = F for all `, with F  N. The time com- 2 is equal to 1, while still maintaining the key property of convo- plexity of an L-layer GFCN is O(LkAk0F + LNF ), where lution as a neighborhood aggregation operator. Hence, repeated kAk0 denotes the number of non-zero entries of the sparse application of the GFCN’s layer-wise propagation rule provides adjacency matrix. Note that multiplying the normalized ad- a computationally efficient convolutional process, leading to nu- jacency matrix with an embedding costs O(kAk0F ) in time, merical stability and avoidance of exploding/vanishing gradi- while multiplying an embedding with a weight matrix costs 2 ents. It should also be pointed out that similar to GCN, the pro- O(NF ). Also, multiplying the initial feature matrix by the 2 posed GFCN model also does not require explicit computation skip-connection weight matrix costs O(NF ). of the Laplacian eigenbasis. For memory complexity, an L-layer GFCN requires O(LNF + LF 2) in memory, where O(LNF ) is for storing O(LF 2) 4.6 Model Prediction all embeddings and is for storing all layer-wise weight matrices. The layer-wise propagation rule of GFCN is basically a node Therefore, our proposed GFCN model has the same time and embedding transformation that projects both the input H(`) ∈ memory complexity as GCN, albeit GFCN takes into account RN×F` into a trainable weight matrix Θ(`) ∈ RF`×F`+1 and the distant graph nodes for improved learned node representations.

5 4.8 Model Training • Pubmed is a citation network dataset containing 19717 sci- entific publications pertaining to diabetes and 44338 edges The parameters (i.e. weight matrices for different layers) of the representing citation links between publications. All publi- proposed GFCN model for semi-supervised anomaly detection cations are classified into 3 classes. Each node is described are learned by minimizing the following regularized loss func- by a TF/IDF weighted word vector from the dictionary, tion which consists of 500 unique words.

Nl L−1 1 X β X  (`) 2 (`) 2  • Amazon Computers and Amazon Photo datasets are co- L = Cα(yi, yˆi) + kΘ kF + kΘe kF , Nl 2 purchase graphs [27], where nodes represent goods and i=1 `=0 (12) edges indicate that two goods are frequently bought to- where Cα(yi, yˆi) is the weighted cross-entropy given by gether. The node features are bag-of-words encoded prod- uct reviews, while the class labels are given by the product Cα(yi, yˆi) = −α yi logy ˆi − (1 − yi) log(1 − yˆi), (13) category. which measures the dissimilarity between the one-hot encoding Since there is no ground truth of anomalies in these datasets, vector yi of the ith node and the corresponding vector yˆi of we employ the commonly-used protocol [12] in anomaly de- predicted probabilities. This dissimilarity decreases as the value tection by treating the smallest class for each dataset as the of the predicted probability approaches the ground-truth label. anomaly class and the remaining classes as the normal class. The weight parameter α adjusts the importance of the positive Dataset statistics are summarized in Table 1, where anomaly rate labels by assigning more weight to the anomalous class, while refers to the percentage of abnormalities in each dataset. For all the parameter β controls the importance of the regularization datasets, we only use the normal class labeled instances to train term, which is added to prevent overfitting. It is important to the model. mention that we only use the normal class labeled instances to train our model. Table 1: Summary statistics of datasets.

Dataset Nodes Edges Features Classes Anomaly Rate (%) 5 Experiments Cora 2708 5278 1433 7 0.06 Citeseer 3327 4732 3703 6 0.07 In this section, we conduct extensive experiments to assess the Pubmed 19717 44338 500 3 0.21 performance of the proposed anomaly detection framework in Photo 7487 119043 745 8 0.04 comparison with state-of-the-art methods on several benchmark Computers 13381 245778 767 10 0.02 datasets. The source code to reproduce the experimental results is made publicly available on GitHub1.

5.1 Datasets 5.2 Baseline Methods We demonstrate and analyze the performance of the pro- We evaluate the performance of the proposed method posed model on three citation networks: Cora, Citeseer, and against various baselines, including one-class support vec- Pubmed [26], and two co-purchase graphs: Amazon Photo and tor machines (OC-SVMs) [3], imbalanced vertex diminished Amazon Computers [27]. The summary descriptions of these (ImVerde) [28], one-class deep support vector data descrip- benchmark datasets are as follows: tion (Deep SVDD) [6], deep anomaly detection on attributed networks (Dominant) [11], one-class deep semi-supervised • Cora is a citation network dataset consisting of 2708 nodes anomaly detection (Deep SAD) [7], graph convolutional net- representing scientific publications and 5429 edges repre- works (GCNs) [9], and GCN-based anomaly detection (GCN-N senting citation links between publications. All publica- and GCN-AN) [12]. For baselines, we mainly consider methods tions are classified into 7 classes (research topics). Each that are closely related to GFCN and/or the ones that are state- node is described by a binary feature vector indicating the of-the-art anomaly detection frameworks. A brief description of absence/presence of the corresponding word from the dic- these strong baselines can be summarized as follows: tionary, which consists of 1433 unique words. • OC-SVM is an unsupervised one-class anomaly detec- • Citeseer is a citation network dataset composed of 3312 tion technique, which learns a discriminative hyperplane nodes representing scientific publications and 4723 edges boundary around the normal instances using support vector representing citation links between publications. All publi- machines by maximizing the distance from this hyperplane cations are classified into 6 classes (research topics). Each to the origin of the high-dimensional feature space. node is described by a binary feature vector indicating the ImVerde absence/presence of the corresponding word from the dic- • is a semi-supervised graph representation learn- tionary, which consists of 3703 unique words. ing technique for imbalanced graph data based on a vari- ant of random walks by adjusting the transition probability 1https://github.com/MahsaMesgaran/GFCN each time a graph node is visited by the random particle.

6 • Deep SVDD is an unsupervised anomaly detection 5.4 Implementation Details method, inspired by kernel-based one-class classification For fair comparison, we implement the proposed method and and minimum volume estimation, which learns a spheri- baselines in PyTorch using the PyTorch Geometric library. Fol- cal, instead of a hyperplane, boundary in the feature space lowing common practices for evaluating performance of GCN- around the data using support vector data description. It based models [9,12], we train our 2-layer GFCN model for 100 trains a deep neural network while minimizing the volume epochs using the Adam optimizer with a learning rate of 0.05. of a hypersphere that encloses the network embeddings We tune the latent representation dimension by hand, and set of the data. Normal instances fall inside the hypersphere, it to 64. The hyperparameters α and β are chosen via grid while anomalies fall outside. search with cross-validation over the sets {10−4, 10−3,..., 1} and {2, 3,..., 10}, respectively. We tune hyper-parameters us- • Dominant is a deep autoencoder based on GCNs for un- ing the validation set, and terminate training if validation loss supervised anomaly detection on attributed graphs. It em- does not decrease after 10 consecutive epochs. For each dataset, ploys an objective function defined as a convex combina- we consider the settings where 2.5%, 5% and 10% of instances tion of the reconstruction errors of both graph structure and are labeled, and we compute the average and standard deviation node attributes. These learned reconstruction errors are of test AUCs over ten runs. then used to assess the abnormality of graph nodes. 5.5 Anomaly Detection Performance • Deep SAD is a semi-supervised anomaly detection tech- nique, which generalizes the unsupervised Deep SVDD ap- Tables 2-4 present the anomaly detection results on the five proach to the semi-supervised setting by incorporating a datasets. The best results are highlighted as bold. For each new term for labeled training data into the objective func- dataset, we report the AUC averaged over 10 runs as well as the tion. The weights of the Deep SAD network are initialized standard deviation, at various ratios of labeled instances. As can using an autoencoder pre-training mechanism. be seen, our GFCN model consistently achieves the best perfor- mance on all datasets, except in the case of the Cora dataset when 10% of instances are labeled. In that case, GCN-AN • GCN is a deep graph neural network for semi-supervised yields a marginal improvement of 0.7% over GFCN, despite the learning of graph representations, encoding both local fact that GCN-AN is trained on both normal and anomalous in- graph structure and attributes of nodes. It is an effi- stances, whereas our model is trained only on normal instances. cient extension of convolutional neural networks to graph- In addition, ImVerde, Deep SAD, GCN and GCN-AN all per- structured data, and uses a graph convolution that aggre- form reasonably well on all datasets at various levels of label gates and transforms the feature vectors from the local rates, but we find that GFCN outperforms these baselines on al- neighborhood of a graph node. most all datasets, while being considerably simpler. An AUC value of 50% or less indicates that the baseline is an uninforma- • GCN-N and GCN-AN are GCN-based, semi-supervised tive anomaly detector. anomaly detection frameworks, which rely on minimizing On the Amazon Computers dataset, Table 2 shows that the the volume of a hypersphere that encloses the node em- proposed GFCN approach performs on par with GCN-AN, but beddings to detect anomalies. Node embeddings placed in- outperforms all baselines on the other four datasets. In partic- side and outside this hypersphere are deemed normal and ular, GFCN yields 16.2% and 15.7% performance gains over anomalous, respectively. GCN-N uses only normal label Deep SAD on the Cora and Photo datasets, respectively. These information, while GCN-AN uses both anomalous and nor- gains are consistent with the results shown in Tables 3-4. We mal label information. argue that the better performance of GFCN over GCN-N and Deep SAD is largely attributed to the fact that our model does not suffer from the hypersphere collapse problem. Interestingly, 5.3 Evaluation Metric the performance gains are particularly higher at the lower label rate 2.5%, confirming the usefulness of semi-supervised learn- In order to evaluate the performance of our proposed framework ing in that it improves model performance by leveraging unla- against the baseline methods, we use AUC, the area under the re- beled data. Another interesting observation is that in general ceiving operating characteristic (ROC) curve, as a metric. AUC both GCN-AN and GCN-N yield relatively high AUC standard summarizes the information contained in the ROC curve, which deviations compared to our GFCN model, indicating that our plots the true positive rate versus the false positive rate, at vari- model has less variability than these two strong baselines. ous thresholds. Larger AUC values indicate better performance at distinguishing between anomalous and normal instances. An 5.6 Parameter Sensitivity Analysis uninformative anomaly detector has an AUC equal to 50% or less. An AUC of 50% corresponds to a random detector (i.e. The weight hyperparameter α of the weighted cross-entropy for every correct prediction, the next prediction will be incor- and the regularization hyperparameter β play an important role rect), whereas an AUC value smaller than 50% indicates that a in the anomaly detection performance of the proposed GFCN detector performs worse than a random detector. framework. We conduct a sensitivity analysis to investigate how

7 efits from relatively larger values of the weight hyperparameter. Table 2: Test AUC (%) averaged over 10 runs when 2.5% of in- For almost all datasets, our model achieves satisfactory perfor- stances are labeled. We also report the standard deviation. Bold- mance with α = 4. face numbers indicate the best anomaly detection performance. In Figure 5, we plot the average AUCs, along with the stan- Dataset dard error bars, of our GFNC model vs. β using various label rates for all datasets, and by varying the value of β from 10−4 Method Cora Citeseer Pubmed Photo Computers to 1. Notice that the best performance is generally achieved OC-SVM 50.0±0.1 50.6±0.4 68.9±0.9 51.9±0.6 47.3±0.7 when β = 0.01, except in the cases of the Citeseer and Pubmed ImVerde 85.9±6.1 60.3±6.5 94.3±0.5 89.1±1.4 98.5±0.7 datasets, on which the best performance is obtained when β = Deep SVDD 69.6±6.5 55.3±1.6 73.7±6.3 52.3±1.4 46.6±1.5 0.1. In general, when the regularization hyperparameter in- Dominant 52.3±0.9 53.9±0.6 50.8±0.4 38.1±0.4 46.8±1.2 creases, the performance improves rapidly at the very beginning, Deep SAD 72.7±6.0 53.8±2.9 91.3±2.4 81.9±5.7 92.2±2.5 but then deteriorates after reaching the best setting due to over- GCN 84.9±6.9 60.9±6.0 96.2±0.1 90.1±2.5 98.1±0.3 fitting. An interesting observation is that GFCN generally shows GCN-AN 88.8±5.4 65.6±4.7 95.6±0.3 95.4±1.8 98.8±0.3 GCN-N 62.6±9.9 56.0±4.4 76.5±4.2 55.1±11 56.9±5.1 steady increase in performance with the regularization parame- ter, except on the Citeseer dataset when the label rate is 5%, GFCN 88.9±2.3 68.3±1.1 96.3±0.1 97.6±0.5 98.8±0.4 whereas the performance on the other datasets degrades after reaching a certain threshold.

Table 3: Test AUC (%) averaged over 10 runs when 5% of in- 0.98 0.800 0.96 0.775 stances are labeled. We also report the standard deviation. Bold- 0.94 0.750 face numbers indicate the best anomaly detection performance. 0.92 0.725

AUC 0.90 AUC 0.700

Dataset 0.88 0.675 2.5% 2.5% 0.86 5% 0.650 5% Method Cora Citeseer Pubmed Photo Computers 10% 10% 0.625 0.84 OC-SVM 50.2±0.1 50.7±0.5 71.0±1.1 51.9±0.6 47.2±0.8 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 ImVerde 91.1±3.2 64.5±4.5 94.9±0.5 92.2±1.2 98.6±0.6 (a) Cora (b) Citeseer Deep SVDD 58.3±2.8 56.0±0.8 79.9±3.4 52.3±0.8 47.0±1.4 0.980 1.000 Dominant 52.5±0.9 53.9±0.7 53.0±0.5 38.1±0.5 47.2±1.2 0.975 0.995 Deep SAD 72.4±5.7 61.1±4.2 91.7±1.7 89.8±3.1 92.8±2.6 0.970 0.990

GCN 89.2±7.6 63.9±4.7 96.6±0.1 92.3±1.2 98.3±0.3 0.965 0.985 AUC AUC

GCN-AN 91.8±5.4 68.3±3.8 96.2±0.2 97.0±0.7 99.1±0.3 0.960 0.980 GCN-N 67.1±5.8 57.4±3.1 76.1±4.8 56.2±9.6 58.2±5.8 2.5% 2.5% 0.955 5% 0.975 5% 10% 10% 0.950 0.970 GFCN 92.2±2.3 71.3±1.3 96.7±0.1 98.8±0.4 99.3±0.4 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 (c) Pubmed (d) Photo 1.0000 Table 4: Test AUC (%) averaged over 10 runs when 10% of in- 0.9975 0.9950 stances are labeled. We also report the standard deviation. Bold- 0.9925 face numbers indicate the best anomaly detection performance. 0.9900 AUC 0.9875

0.9850 Dataset 2.5% 0.9825 5% 10% Method Cora Citeseer Pubmed Photo Computers 0.9800 2 3 4 5 6 7 8 9 10 OC-SVM 51.8±1.7 51.1±0.9 73.1±0.8 51.7±0.9 47.4±0.8 (e) Computers ImVerde 94.5 ±2.1 68.5 ±6.9 95.5 ±0.4 92.8 ±1.0 99.1 ±0.4 Deep SVDD 59.8 ±4.8 56.7 ±1.7 93.2 ±1.0 53.1 ±0.6 47.5 ±1.0 Figure 4: Effect of weight hyperparameter α on anomaly detec- Dominant 52.6 ±0.9 53.9 ±0.8 53.1 ±0.4 38.1 ±0.7 46.6 ±1.9 tion performance (AUC). Deep SAD 72.9 ±3.3 62.4 ±4.4 96.7 ±0.1 89.5 ±1.9 93.5 ±1.9 GCN 94.5 ±4.3 68.6 ±3.3 96.7 ±0.1 93.3 ±0.8 98.3 ±0.3 GCN-AN 95.4 ±2.7 72.9 ±4.3 96.6 ±0.1 97.9 ±0.3 99.1 ±0.3 GCN-N 72.3 ±7.0 60.1 ±2.1 73.4 ±5.7 53.6 ±3.9 58.5 ±4.6 5.7 Visualization GFCN 94.7 ±1.0 76.5 ±1.0 97.3 ±0.1 99.4 ±0.2 99.4 ±0.4 The feature embeddings learned by GFCN can be visualized using the Uniform Manifold Approximation and Projection (UMAP) [29], which is a dimensionality reduction the performance of GFCN changes as we vary these two hyper- technique that is particularly well-suited for embedding high- parameters. In Figure 4, we analyze the effect of the hyperpa- dimensional data into a two- or three-dimensional space. Fig- rameter α by plotting the AUC results of GFNC vs. α using ure 6 displays the UMAP embeddings of GFCN (top) and GCN various label rates for all datasets, where α varies from 2 to 10. (bottom) using the Cora dataset. As can be seen, the GFCN em- We can see that with a few exceptions, our model generally ben- beddings are more separable than the GCN ones. With GCN

8 0.85 0.95 Normal 0.80 12 0.90 Anomalous

0.75 0.85 10 AUC AUC 0.70 0.80 8 2.5 2.5% 0.75 0.65 5.0 5% 10.0 10% 0.70 0.60 6 0.0001 0.001 0.01 0.1 1.0 0.0001 0.001 0.01 0.1 1.0 (a) Cora (b) Citeseer 4 0.99 1.00 2 0.98 0.98

0.97 0.96 0 0.96 AUC AUC 0.94 0.95 5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0

2.5% 0.92 2.5 0.94 5% 5.0 10% 10.0 0.93 0.90 12.5 0.0001 0.001 0.01 0.1 1.0 0.0001 0.001 0.01 0.1 1.0 (c) Pubmed (d) Photo 10.0 1.000 7.5 0.995

0.990 5.0 0.985

AUC 0.980 2.5 0.975 2.5 0.970 5.0 0.0 10.0 0.965 0.0001 0.001 0.01 0.1 1.0 2.5 Normal (e) Computers Anomalous 5.0 Figure 5: Effect of regularization hyperparameter β on anomaly 5 0 5 10 15 detection performance (AUC). Figure 6: UMAP embeddings of GFCN (top) and GCN (bottom) using the Cora dataset. features, the normal and anomalous instances are not discrimi- nated very well, while with GFCN features these data instances pp. 1–58, 2009. are discriminated much better, indicating that GFNC learns bet- ter node representations for anomaly detection tasks. [2] G. Pang, C. Shen, L. Cao, and A. van den Hengel, “Deep learning for anomaly detection: A review,” arXiv preprint arXiv:2007.02500, 2020. 6 Conclusion [3] B. Scholkopf,¨ J. Platt, J. Shawe-Taylor, A. Smola, , In this paper, we introduced a graph convolutional network and R. Williamson, “Estimating the support of a high- with skip connection for semi-supervised anomaly detection on dimensional distribution,” Neural Computation, vol. 13, graph-structured data by learning effective node representations no. 7, pp. 1443–1471, 2001. in an end-to-end fashion. The proposed framework is theoret- ically motivated by implicit fairing and derived directly from [4] D. M. Tax and R. P. Duin, “Support vector data descrip- the Jacobi iterative method. Through extensive experiments, we tion,” Machine learning, vol. 54, no. 1, pp. 45–66, 2004. demonstrated the competitive or superior performance of our [5] R. Wang, K. Nie, T. Wang, Y. Yang, and B. Long, “Deep model in comparison with the current state of the art on five learning for anomaly detection,” in Proc. International benchmark datasets. For future work, we plan to apply our ap- Conference on Web Search and Data Mining, pp. 894–896, proach to other downstream tasks on graph-structured data. We 2020. also intend to incorporate higher-order neighborhood informa- tion into the graph structure of the model, where nodes not only [6] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. receive latent representations from their 1-hop neighbors, but Siddiqui, A. Binder, E. Muller,¨ and M. Kloft, “Deep one- also from multi-hop neighbors. class classification,” in Proc. International Conference on Machine Learning, pp. 4393–4402, 2018. References [7] L. Ruff, R. A. Vandermeulen, N. Gornitz,¨ A. Binder, E. Muller,¨ K.-R. Muller,¨ and M. Kloft, “Deep semi- [1] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detec- supervised anomaly detection,” in International Confer- tion: A survey,” ACM Computing Surveys, vol. 41, no. 3, ence on Learning Representations, 2019.

9 [8] M. Defferrard, X. Bresson, and P. Vandergheynst, “Con- [22] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein, volutional neural networks on graphs with fast localized “CayleyNets: Graph convolutional neural networks with spectral filtering,” in Advances in Neural Information Pro- complex rational spectral filters,” IEEE Transactions on cessing, pp. 3844–3852, 2016. Signal Processing, vol. 67, no. 1, pp. 97–109, 2018.

[9] T. Kipf and M. Welling, “Semi supervised classification [23] F. M. Bianchi, D. Grattarola, C. Alippi, and L. Livi, with graph convolutional networks,” in International Con- “Graph neural networks with convolutional ARMA fil- ference on Learning Representations, pp. 1–14, 2017. ters,” arXiv preprint arXiv:1901.01343, 2019. [24] A. Wijesinghe and Q. Wang, “DFNets: Spectral CNNs for [10] L. Akoglu, H. Tong, and D. Koutra, “Graph based anomaly graphs with feedback-looped filters,” in Advances in Neu- detection and description: a survey,” Data Mining and ral Information Processing Systems, 2019. Knowledge Discovery, vol. 29, no. 3, pp. 626–688, 2015. [25] K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and [11] K. Ding, J. Li, R. Bhanushali, and H. Liu, “Deep anomaly S. Jegelka, “Representation learning on graphs with jump- detection on attributed networks,” in Proc. SIAM Interna- ing knowledge networks,” in Proc. International Confer- tional Conference on Data Mining, pp. 594–602, 2019. ence on Machine Learning, 2018. [12] A. Kumagai, T. Iwata, and Y. Fujiwara, “Semi-supervised [26] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and anomaly detection on attributed graphs,” arXiv preprint T. Eliassi Rad, “Collective classification in network data,” arXiv:2002.12011, 2020. AI Magazine, vol. 29, no. 3, pp. 93–106, 2008.

[13] M. Desbrun, M. Meyer, P. Schroder,¨ and A. H. Barr, “Im- [27] O. Shchur, M. Mumme, A. Bojchevski, and plicit fairing of irregular meshes using diffusion and cur- S. Gunnemann,¨ “Pitfalls of graph neural network vature flow,” in Proc. SIGGRAPH, pp. 317–324, 1999. evaluation,” arXiv preprint arXiv:1811.05868, 2018.

[14] J. Donahue, P. Krahenb¨ uhl,¨ and T. Darrell, “Adversarial [28] J. Wu, J. He, and Y. Liu, “ImVerde: Vertex-diminished feature learning,” in International Conference on Learning random walk for learning network representation from im- Representations, 2017. balanced data,” in Proc. IEEE International Conference on Big Data, pp. 871–880, 2018. [15] T. Schlegl, P. Seebock,¨ S. Waldstein, G. Langs, and [29] L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform U. Schmidt-Erfurth, “f-AnoGAN: Fast unsupervised manifold approximation and projection for dimension re- anomaly detection with generative adversarial networks,” duction,” The Journal of Open Source Software, 2018. Medical Image Analysis, vol. 54, pp. 30–44, 2019.

[16] F. D. Mattia, P. Galeone, M. D. Simoni, and E. Ghelfi, “A survey on GANs for anomaly detection,” arXiv preprint arXiv:1906.11632, 2019.

[17] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan, “Do deep generative models know what they don’t know?,” in International Conference on Learning Representations, 2019.

[18] A. Ortega, P. Frossard, J. Kovacevic, J. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” Proceedings of the IEEE, vol. 106, no. 5, pp. 808–828, 2018.

[19] G. Taubin, “A signal processing approach to fair surface design,” in Proc. SIGGRAPH, pp. 351–358, 1995.

[20] G. Taubin, T. Zhang, and G. Golub, “Optimal surface smoothing as filter design,” in Proc. European Conference on Computer Vision, 1996.

[21] D. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.

10