Received March 2, 2021, accepted March 13, 2021, date of publication March 23, 2021, date of current version March 31, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3068172

Intelligent Anomaly Detection for Large Network Traffic With Optimized Deep Clustering (ODC) Algorithm

ANNIE GILDA ROSELIN 1,2, PRIYADARSI NANDA 1, SURYA NEPAL 2, AND XIANGJIAN HE 1, (Senior Member, IEEE) 1Department of Electrical and Data Engineering, University of Technology Sydney (UTS), Ultimo, NSW 2007, Australia 2Commonwealth Scientific and Industrial Research Organisation (CSIRO/Data61), Marsfield, NSW 2122, Australia Corresponding author: Annie Gilda Roselin ([email protected]; [email protected]) This work was supported by the Data61 through the Commonwealth Scientific and Industrial Research Organisation (CSIRO), Marsfield, Australia.

ABSTRACT The availability of an enormous amount of unlabeled datasets drives the anomaly detection research towards unsupervised algorithms. Deep clustering algorithms for anomaly detection gain significant research attention in this era. We propose an intelligent anomaly detection for extensive network traffic analysis with an Optimized Deep Clustering (ODC) algorithm. Firstly, ODC does the optimization of the deep algorithm by tuning the hyperparameters. Thereby we can achieve a reduced reconstruction error rate from the deep AutoEncoder. Secondly, ODC feeds the optimized deep AutoEncoder’s latent view to the BIRCH clustering algorithm to detect the known and unknown malicious network traffic without human intervention. Unlike other deep clustering algorithms, ODC does not require to specify the number of clusters needed to analyze the network traffic dataset. We experiment ODC algorithm with the CoAP off-path dataset obtained from our testbed and the MNIST dataset to compare our algorithm’s accuracy with state-of-art clustering algorithms. The evaluation results show ODC deep clustering method outperforms the existing deep clustering methods for anomaly detection.

INDEX TERMS , , latent space view, anomaly detection, regularization, BIRCH clustering.

I. INTRODUCTION Augmentation (DDC-DA) [10] use convolutional AutoEn- Network traffic increase is directly proportional to increas- coder with k-means clustering, Gaussian mixture variational ing malicious activities on the internet. IoT plays a vital AutoEncoder (GMVAE) [12] practices variational AutoEn- role in producing a massive number of network traffic coder with k-means clustering. Most of these deep clustering datasets and creates significant challenges for detecting techniques use the k-means clustering algorithm for the data anomalies. clustering part, which in turn demands the number of clusters Anomaly detection in network traffic with machine learn- manually. In a real-time situation, predicting the number of ing is a rapidly growing research area [1]–[7]. Deep clus- clusters at the initial time (training the model) for a new tering techniques for anomaly detection use variations of dataset might not help discover new and unknown anomalies. AutoEncoder’s latent representation with a k-means clus- To overcome this major limitation of the existing works, tering algorithm. For example, Deep Embedding Clustering we use BIRCH (Balanced Iterative Reducing and Clustering (DEC) [8], Improved Deep Embedding Clustering (IDEC) [9] using Hierarchies) in our ODC deep clustering technique. and Deep Density-based Clustering (DDC) [10] use dense BIRCH has the advantage of intelligent cluster assignment deep AutoEncoder, Deep Convolutional Embedded Cluster- and anomaly detection without human intervention. Also, ing (DCEC) [11] and Deep Density-based Clustering-Data a deep AutoEncoder reduces the dimensionality of the dataset irrespective of it has linear/non-linear data. The BIRCH The associate editor coordinating the review of this manuscript and clustering method is not getting much attention among the approving it for publication was Amir Masoud Rahmani . researchers on deep clustering methods. However, BIRCH

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 9, 2021 47243 A. G. Roselin et al.: Intelligent Anomaly Detection for Large Network Traffic With ODC Algorithm

has the capability of doing intelligent clustering on a vast by being factually not the same as the remainder of the dataset [13]. perceptions. Present-day organizations are starting to com- Our contributions are summarized as follows: prehend the significance of interconnected tasks to get their • Optimization of the deep AutoEncoder by tuning the business’s full image. Additionally, they have to react to hyper-parameters to achieve a reduced reconstruction quick-moving changes in information instantly, particularly error rate. if there should be an occurrence of cybersecurity dangers. • We inferred a novel unsupervised anomaly detection Unfortunately, there is no compelling method to deal with algorithm ‘‘ODC’’ by incorporating the BIRCH clus- and break down, continually developing datasets physically. tering algorithm with the Latent representation of the With the dynamic frameworks having various segments in a enhanced deep AutoEncoder. ceaseless movement where the ‘‘normal’’ conduct is contin- • Unlike other deep clustering algorithms, ODC does not ually reclassified, another proactive way to deal with distin- require to specify the number of clusters needed to ana- guishing anomalous behavior is required [20]. lyze the network traffic. Based on the dataset we use to train the machine learning • ODC handles anomalies, including known and unknown model, anomaly detection varies in many real-world appli- attacks intelligently, for a huge dataset. cations and academic research areas. With the emergence of • We analyzed how the Branching factor value and sensor networks, processing data as it arrives has become the Threshold value of BIRCH influence the cluster- a necessity [21]. Techniques have been proposed that can ing accuracy and normalized mutual information score operate in an online fashion [22]; such techniques assign values. an anomaly score to a test instance as it arrives, but also We observed that our ODC clustering algorithm outper- incrementally update the model. Authors in [23] showcased forms the existing deep clustering methods for anomaly the importance of anomaly detection in dynamic settings detection. Moreover, ODC suits well for vast network traf- through a real-world application example, i.e., forest fire risk fic datasets where multiple scans of the datasets are not prediction. Also, they recommend redesigning the current advisable since ODC has the BIRCH clustering algorithm’s models to be able to detect outlying patterns accurately and embedment. ODC incorporates the advantages of the BIRCH efficiently. More specifically, when there are many features, clustering algorithm. We achieved great clustering accuracy a set of anomalies emerge in only a subset of dimensions at and normalized mutual index score for the anomaly detection a particular period. This set of anomalies may appear normal process due to the combination of a deep AutoEncoder and regarding a different subset of dimensions and periods. the BIRCH clustering algorithm. Also, ODC put a stop to the Authors in [24] discussed the unavailability of financial need of domain experts to manually label the large datasets data for fraud detection research and a methodology for and explicitly specify the number of clusters needed for the synthetic data generation. They suggest that a universal tech- dataset. Our proposed method differs from the state of the arts nique in the domain of fraud detection is yet to be found [14]–[17] and [18] in which we associated BIRCH clustering due to the evolving change in the context of normality and with our enhanced deep AutoEncoder. To preserve the data unavailability. According to [25] much of the point’s local structure, the StructAE [19] learns representa- research is performed on simulated data (37 out of the 65 sur- tions for each data point by minimizing reconstruction error veyed papers); in-vehicle network data and vehicular ad hoc with respect to itself. However, ODC achieves low recon- network (VANET) data are seldom considered together to struction error rate by tuning the hyperparameters such as safeguard the connected vehicles (except for 1 out of the activation function and the regularization function. Hence, 65 surveyed papers); Connected vehicles safety research we prove that ODC preserves the data points’ structure, lead- does not get the same amount of attention as cybersecurity ing to an intelligent clustering method to detect anomalies. research. It is observed that the anomaly detection domain The rest of the paper is organized as follows. Section II has various promising research directions; many anomaly provides the background information needed to understand detection methods require a large amount of test data set for the ODC clustering algorithm. The working principles of detecting anomalies [26]. The literature survey we conducted a deep AutoEncoder and the BIRCH clustering algorithm in anomaly detection motivates us to use the machine learning are explained in Section II-A and Section II-B, respec- models to determine the abnormal behavior of the legitimate tively. Section III describes the state of the art of deep user in a private network. clustering algorithms. The proposed deep clustering method Anomaly detection should be possible, utilizing the ideas is explained in Section IV. Section V describes the eval- of Machine Learning. It tends to be done in the following uation process of the proposed deep clustering method. manners: Finally, we discuss the possible extension of our research Supervised Anomaly Detection: This strategy requires in Section VI. a labeled data set with normal and abnormal examples for building a prescient model. The most well-known super- II. BACKGROUND vised methods incorporate supervised neural networks, sup- Anomaly Detection [20] is the strategy of recognizing port vector machine, k-nearest neighbors, Bayesian networks, uncommon occasions or perceptions which can raise doubts and decision trees [27]. Supervised models are accepted to

47244 VOLUME 9, 2021 A. G. Roselin et al.: Intelligent Anomaly Detection for Large Network Traffic With ODC Algorithm

give a more superior detection rate than unsupervised tech- following equation niques because of their capacity to encode interdependencies m between factors, alongside their capacity to join both earlier 1 X 2 L(x, y) = kxi − yik (3) knowledge and information and to restore a certainty score m i=1 with the model yield [2]. Unsupervised Anomaly Detection: This strategy does where m is the total number of training dataset. The main not require labeled training data. They assume that the vast objective is to find the optimal parameters (hiandθ) which majority of the system associations are normal traffic and just can effectively minimize the difference between input and a modest quantity of rate is unusual and envision that noxious reconstructed output over the whole training set as traffic is factually not quite the same as should be expected θ = {W , b} = arg min L(x, y) (4) traffic [28]. In light of these two suspicions, groups of regular θ instances are thought to be ordinary, and rare data groups are sorted as an anomaly. The most popular unsupervised algo- B. BIRCH CLUSTERING ALGORITHM rithms include K-means, AutoEncoders, GMMs (Gaussian BIRCH, refers to Balanced Iterative Reducing and Clus- Mixture Models), and PCAs (Principle Component Analysis) tering using Hierarchies, created in 1996 by Tian Zhang, based analysis [29]. Raghu Ramakrishnan, and Miron Livny. BIRCH is best suited Deep learning is the subspace of machine learning that for large data sets or streaming due to the ability to find accomplishes great performance as they learn the detailed good clustering solutions with single scan data. Optionally, features of datasets with the help of neural networks [30]. The the algorithm can further scan through the data to improve existing deep clustering techniques for anomaly detection clustering quality. BIRCH outperforms the existing clustering merge a deep learning algorithm, and a clustering algorithm methods such as K-means and DBSCAN clustering algo- usually k-means clustering algorithms. With the observa- rithms [13] for handling large data sets. tions of background studies and the research gap learned According to [13], BIRCH is a multipath search tree, like from related work in Section III, we proposed our ODC in the structure of a B+ tree. There are three kinds of nodes in SectionIV for intelligent anomaly detection. a cluster-feature (CF) tree: Leaf, NonLeaf, and MinCluster. Three following parameters are engaged with the model. The A. DEEP AUTOENCODER first parameter is B (Branching factor), the greatest number of An AutoEncoder with more than one hidden layer is called a child nodes that a non-leaf node can hold. The second param- deep AutoEncoder. Deep AutoEncoders learn more complex eter is L, the most extreme number of child nodes that a leaf features of the dataset since they have more layers than node suits. Furthermore, the third parameter is T (Threshold), a simple AutoEncoder. The deep AutoEncoder intends to the most extreme span estimation of the cluster. A CF tree is reconstruct the input with minimum reconstruction error. The a set of three data points in a single cluster. These data points encoding part, decoding part, and the latent representation are as follows: −→ part (compressed input) are the three essential parts of the CF = (N, LS, SS) (5) deep AutoEncoder. The application of deep AutoEncoder is un-avoidable in network traffic analysis since it compresses • Count (N): The number of information esteems in the a sizeable high dimensional dataset into a low dimensional cluster. −→ dataset. • Linear Sum (LS): Aggregate the individual coordinates For a given training [31] dataset X = {x1, x2,..., xm} with of the data points. This is a proportion of the area of the m samples, where xi is a d-dimensional feature vector, the cluster. encoder maps the input vector x to a hidden representation i N vector hi through a deterministic mapping fθ as given in (1) −→ X LS = xEi (6) = hi = fθ (xi) = σ (Wxi + b) (1) i 1 where, W is a d × d matrix, d is the number of hidden • Squared Sum (SS): Aggregate the squared coordinates units, b is a bias vector, θ is the mapping parameter set of the data points. This is a proportion of the spread of θ = {W , b}. σ is a proper activation function. The decoder the cluster. maps back the resulting hidden representation hi to a recon- N X  E 2 structed d-dimensional vector yi in input space as SS = Xi (7) =  ˆ ˆ i 1 yi = g ˆ (xi) = σ W hi + b (2) θ BIRCH has two phases: where Wˆ is a d × dˆ matrix, bˆ is a bias vector and • Phase 1: Building the CF tree. Load the network traffic θˆ = {Wˆ , bˆ} [31]. The goal of training the AutoEn- data into the memory by building a cluster-feature (CF) coder is to minimize the difference between input and tree. This phase will compress the initial CF tree only output. Therefore, a loss function is calculated by the when this option is chosen at the training time.

VOLUME 9, 2021 47245 A. G. Roselin et al.: Intelligent Anomaly Detection for Large Network Traffic With ODC Algorithm

• Phase 2: Global Clustering. Optional refinement of clus- A. ENHANCED DEEP AUTOENCODER ters which are obtained from phase 1 by applying an The enhanced AutoEncoder model is constructed using the existing clustering algorithm on the leaves of the CF tree. proper combination of activation function, regularizers, and In view of the Additivity Hypothesis of CF [13], the CF optimization functions to reduce the reconstruction error estimation of the parent node is the aggregate of the CF value. Our enhanced AutoEncoder treats every input as self- estimations of its child nodes. reliant values, thereby reducing the over-fitting of training data. The ODC training phase requires  E E  CF1 + CF2 = N1 + N2, LS1 + LS2, SS1 + SS2 (8) and fine-tuning the model parameters to enhance the effi- ciency of the model. III. RELATED WORK We used an ELU (Exponential Linear Unit) [40] activation In DEC [8], initial dense auto-encoder is prepared with function for all layers and Adamax optimization function for limiting recreation mistake. At that point, as a clustering the enhanced AutoEncoder model. advancement arrange, the strategy repeats between process- ( x if x ≥ 0 ing a helper target conveyance from AutoEncoder depiction f (x) = (9) and limiting the Kullback-Leibler disparity to it. In IDEC [9], α(exp(x) − 1) if x < 0 it is contended that the grouping loss of DEC undermines the component space; in this way, IDEC proposes the clustering ELUs have negative values that push the average of the loss and reproduction loss of the auto-encoder. functions closer to zero. Average functions close to zero allow faster learning as the gradient approaches the natural gradient. Deep clustering in [32] shows that a l2 normalization on the latent representation of AutoEncoder makes the latent ELUs for negative net entries are saturated at a negative value. space more divisible and minimized in the Euclidean space. Besides, the likelihood of code interference for different con- This significantly improves the clustering precision when k- cepts is less likely, as incomprehensible negative values of means clustering is utilized on the latent representation. DDC information avoid distributed codes. α is a hyper-parameter of [10] clustering technique reduces the dimension of the dataset ELU. Positively activated ELUs interact by activating the next with the help of deep convolutional AutoEncoder and t-SNE layer of units. Thus the ELU activation function is well-suited algorithm. Consequently, DDC applies density-based clus- for deep network models where vanishing gradient interferes tering on the result of the t-SNE (2-dimensional embedded with the learning of the model. data) algorithm without mentioning the number of clusters Dropout regularizer randomly dropping out nodes, thereby in advance. Deep clustering algorithms [33], [34] and DDC increasing the uniqueness of a node in the network. The co- are using t-SNE for further dimensionality reduction of input adaptation of the features in the node is reduced by adopting data. The issue with t-SNE is that it does not safeguard the dis- a dropout regularizer in the network. = { } tances nor thickness between the data. Also, the compressed The number of hidden layers (hi h0, h1, h2, h3, h4 ) in data cannot be assured to recreate the original input since our enhanced AutoEncoder are five. Here, the latent space there are no hyper-parameters to reduce the reconstruction can be represented as, error between the input data and the recreated data. h2 = fθ (x2) = σ (Wx2 + b) (10) Recent works on convolutional AutoEncoder cluster- ing such as [35]–[39] are most applicable for clustering According to [41] the dropout function is defined as, image datasets, not for analysing network traffic datasets. (l) ∼ DCEC [11] embraces a convolutional AutoEncoder and rj Bernoulli (p) shows that it improves the clustering exactness of DEC and (l) = (l) ∗ (l) ey r y (11) IDEC. Dealing the anomalies in credit card transactions [15] is done with the AutoEncoder and k-means clustering algo- In equation 11 ∗ signifies a component insightful product. rithm on the European bank transaction dataset. However, this For any layer l, r(l) is a vector of self-governing Bernoulli work and the other works specified in this Section III has irregular factors each of which has probability p of being 1. the problem of predicting the number of clusters after pre- This vector is sampled and multiplied element-wise with the (l) (l) training the AutoEncoder. outputs of that layer, y , to create the thinned outputs ey . Our proposed algorithm ODC optimizes the pre-training These reduced features are then used as input to the next layer. process of deep AutoEncoder to reduce the reconstruction This procedure is applied at each layer. If we apply dropout error. Furthermore, it uses BIRCH clustering to overcome the to the hidden layer with a probability value of p, the equation limitations of the existing deep clustering algorithms. would be modified as follows (at training time): ˜ ∼ IV. PROPOSED DEEP CLUSTERING METHOD x Dropout (x) ODC groups the network traffic data based on the Euclidean h = f (W x˜ + b) distance between the nodes so that we get more and more h˜ ∼ Dropout (h) dynamic clusters as the network traffic passes on to the ODC   y = g W h˜ + b (12) model.

47246 VOLUME 9, 2021 A. G. Roselin et al.: Intelligent Anomaly Detection for Large Network Traffic With ODC Algorithm

A loss function is calculated by the following equation revamp the CF-tree by reinserting the old leaf nodes, the size m of the new CF-tree is diminished in two different ways [13]. 1 X L(x, y) = kx − y k2 + r (13) To begin with, we increment the limit esteem (Threshold T), m i i i=1 subsequently permitting each leaf node to assimilate more where m is the total number of training dataset. The column focuses. Second, we treat some leaf nodes as potential anoma- named ‘‘Train RE’’ in Table1 refers to the reconstruction lies and work them out to disk. An old leaf node is viewed as error rate during training time and Test RE means, recon- a potential anomaly in the event that it has far less data points struction error rate during testing time of our enhanced deep than normal. An increment in the T value or a modification AutoEncoder. The values in Table 1 shows, how our opti- in the distribution considering the new data could well infer mized deep AutoEncoder outperforms to reduce the recon- that the potential anomaly never again qualifies as an anomaly struction error rate both at training and testing time. data point. The data point whose Euclidean distance to the closest seed is larger than twice the radius of that cluster is treated TABLE 1. Optimization of deep AutoEncoder. as an anomaly [13]. As a result, the potential anomalies are examined to check on the off chance that they can be re-invested in the tree without making the tree develop in size. In Algorithm1 steps from 1 to 4 explain how the compressed form of the input dataset has been made with the help of optimized deep AutoEncoder. Furthermore, the steps from 5 to 22 describe the handling process of BIRCH [13] clustering algorithm. As a result, ODC handles the outlier in the network traffic data well than the existing deep clustering combinations. The evaluation of resultant clusters of ODC is discussed in SectionV. B. OPTIMIZED DEEP CLUSTERING WITH BIRCH The compressed representation of data points (h2 = fθ (x2) = V. EXPERIMENTAL EVALUATION σ (Wx2 + b)) which are obtained from the enhanced deep An enhanced deep AutoEncoder is implemented in Python AutoEncoder as explained in IV-A is feed into the BIRCH using Keras [42]. Experiments on our datasets are conducted clustering algorithm. Each time the new data point is added on a regular laptop with the Intel Core i7 processor. To evalu- to the CF tree by calculating the radius of the cluster. The ate our algorithm ODC, we use CoAP off-path dataset [5] to radius of the cluster (R) is calculated as find out the anomalies in IoT network traffic and the standard v publicly available MNIST [43] image dataset to compare the u  2 s uPN E − E −→ accuracy of ODC results with other existing works. We use t i=1 Xi C N · CE 2 + SS − 2 · CE · LS R = = the testbed from [5] to get more instances of IoT traffic with N N v a CoAP off-path attack and feed the proposed algorithm with u −→!2 uSS LS 10,000 unlabeled instances of IoT-CoAP traffic. We are ready R = t − to give the CoAP off-path dataset if anyone wants to redo the N N experiment for their research. To the best of our knowledge, The calculated R-value decides where to push the new data our work is the first to combine deep AutoEncoder with the point. If R < T , then a new data point is pushed to the same BIRCH clustering algorithm for anomaly detection in IoT leaf node. If R > T , then the new data point is formed as a new network traffic datasets. The MNIST dataset has 70,000 digits leaf node. Thereby the CF tree is built for all the data points of 28 × 28 pixels. We use publicly released codes by the in our training and testing data. If we divide the sum of data respective DEC and IDEC authors to execute the correspond- points by the number of data points, we can get the centroid ing algorithms to our dataset. of the cluster. The centroid (CE ) of the cluster is calculated as The encoder of our ODC contains two hidden layers and −→ an input layer for both the datasets MNIST and CoAP off- PN XE LS CE = i=1 i = path, as in Figure.1. The decoder part contains two hidden N N layers and an output layer for both the datasets MNIST and Thereby we can calculate the distance between two CoAP off-path. The dimension of the encoder is set as input clusters CFi and CFj data dimension(d) - 1626 - 756 - 50. The decoder dimen- sion is set as a reverse of the encoder, such as 50 - 756 - C. ODC OUTLIER HANDLING 1626 - output dimension(d). The graphs in Figure.2 and We can set aside a fixed measure of disk/memory space Figure.3 shows that the reconstruction error rate fall depends for taking care of anomalies. Anomalies are leaf nodes of on the choice of having activation and optimization functions. low thickness that are made a decision to be irrelevant con- Though the relu, adamax combination, and ELU, dropout cerning the general clustering design. At the point when we combinations with baseline deep AutoEncoder have a similar

VOLUME 9, 2021 47247 A. G. Roselin et al.: Intelligent Anomaly Detection for Large Network Traffic With ODC Algorithm

Algorithm 1 Optimized deep clustering with BIRCH for outlier handling Input: Input data: X; Epoch: E; Batch-size: I; Branching factor: B; Threshold: T; Output: Label s Data: Training/Testing set x Parameters: Optimized deep auto-encoder weight W, Cluster radius R and Cluster centers 1 Let t=0 2 while iter < E do 3 if iter I == 0 then

4 Compute latent points hi = fθ (xi) = σ (Wxi + b) by FIGURE 1. Enhanced deep clustering (enhanced deep AutoEncoder + applying (11)(9) and (13) BIRCH clustering). 5 Start CF tree t1 as in IV-B of initial T 6 Continue scanning data and insert into t1 7 if out of memory then 8 Increase T 9 Rebuild CF tree t2 of new T from CF tree t1 10 if leaf data point of t1 is an outlier and disk space available then 11 Write that data point as outlier 12 else 13 use the data point to rebuild t2

14 if t1 <= t2 then 15 if Disk has space then 16 Go to step 5 and repeat the process for the rest of the data points FIGURE 2. Comparing the training reconstruction error (Train RE) of deep 17 else AutoEncoder with various combinations of activation and optimization 18 Re-absorb potential into t1 functions. 19 Go to step 5 and repeat the process for the rest of the data points

20 else 21 Re-absorb potential outliers into t1 22 Go to step 5 and repeat the process for the rest of the data points

reconstruction error rate, the later combination (ELU, dropout) produces consistent low reconstruction error rate for different iterations and different datasets. At the time of training the model, the decoder is used to reduce the FIGURE 3. Comparing the testing reconstruction error (Test RE) of deep AutoEncoder with various combinations of activation and optimization reconstruction error rate. Once the model is optimized with functions. a low reconstruction error, we merge the BIRCH clustering technique with the encoder’s latent representation. The clustering accuracy (ACC) depends on the branching the good ACC and NMI value, as shown in the graphs of factor (B) and the threshold value (T). When training the Figure.8 and Figure.9. Hence, B and T values are directly clustering algorithm, we choose the value of B and T through proportional to the ACC and NMI values of a dataset. several iterations. We start to set the value of B as 15 and T The Table2 shows our proposed algorithm ODC has the as 1.5 to get good clustering accuracy and NMI. Branching highest clustering accuracy than the state-of-the-art deep factor value and Threshold value influence the ACC and clustering methods. The method mentioned in Table 2, AE NMI of the CoAP off-path dataset. It is noted that, when the (AutoEncoder) with the k-means algorithm performs the threshold value and branching factor value decreases, we get k-means clustering algorithm on the latent representation

47248 VOLUME 9, 2021 A. G. Roselin et al.: Intelligent Anomaly Detection for Large Network Traffic With ODC Algorithm

TABLE 2. Comparing the accuracy of various deep clustering techniques.

FIGURE 7. NMI of MNIST dataset.

FIGURE 4. Accuracy of CoAP dataset.

FIGURE 8. ACC and NMI of CoAP off-path dataset based on branching factor of BIRCH.

FIGURE 5. NMI of CoAP dataset.

FIGURE 9. ACC and NMI of CoAP off-path dataset based on threshold value of BIRCH.

We utilize two standard unsupervised evaluation mea- surements for evaluation and correlations with the bench- mark strategies, clustering Accuracy (ACC), and Normalized Mutual Information (NMI). ACC is defined as, Pn = I {li == m (ci)} ACC = max i 1 (14) FIGURE 6. Accuracy of MNIST dataset. m n and NMI is defined as, of the trained AutoEncoder. We use the same AutoEncoder + I(l; c) parameters as ours (ODC) to evaluate the AE K-means deep NMI = (15) clustering method. max{H(l); H(c)}

VOLUME 9, 2021 47249 A. G. Roselin et al.: Intelligent Anomaly Detection for Large Network Traffic With ODC Algorithm

where the I{·} is sign function, the li is the ground-truth [8] J. Xie, R. Girshick, and A. Farhadi, ‘‘Unsupervised deep embedding for clustering analysis,’’ in Proc. Int. Conf. Mach. Learn., 2016, pp. 478–487. label, ci is the cluster assignment of the ith sample predicted [9] X. Guo, L. Gao, X. Liu, and J. Yin, ‘‘Improved deep embedded clustering by the algorithm, and m ranges over all possible one-to-one with local structure preservation,’’ in Proc. IJCAI, 2017, pp. 1753–1759. = { }n mapping between predicted clusters and labels. l li i=1 [10] Y. Ren, N. Wang, M. Li, and Z. Xu, ‘‘Deep density-based image = { }n ; clustering,’’ 2018, arXiv:1812.04287. [Online]. Available: http://arxiv. c ci i=1 , respectively. n is the number of samples. I(l c) · org/abs/1812.04287 denotes the mutual information between l and c, and H( ) [11] X. Guo, X. Liu, E. Zhu, and J. Yin, ‘‘Deep clustering with convolu- denotes their entropy. Both ACC and NMI are in [0, 1], and tional autoencoders,’’ in Proc. Int. Conf. Neural Inf. Process. Cham, the higher scores imply more accurate clustering results. The Switzerland: Springer, 2017, pp. 373–382. [12] N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, graphs in Figure. 4, Figure. 5 and Figure. 6, Figure. 7 show, H. Salimbeni, K. Arulkumaran, and M. Shanahan, ‘‘Deep unsuper- how well our proposed ODC algorithm outperforms the state- vised clustering with Gaussian mixture variational autoencoders,’’ 2016, of-the-art deep clustering algorithms. For MNIST dataset, arXiv:1611.02648. [Online]. Available: http://arxiv.org/abs/1611.02648 [13] T. Zhang, R. Ramakrishnan, and M. Livny, ‘‘BIRCH: An efficient data ODC achieves 0.975 ACC and 0.955 NMI. For CoAP off-path clustering method for very large databases,’’ ACM SIGMOD Rec., vol. 25, dataset, ODC accomplishes 0.983 ACC and 0.957 NMI. no. 2, pp. 103–114, Jun. 1996. [14] W. Wang, D. Yang, F. Chen, Y.Pang, S. Huang, and Y.Ge, ‘‘Clustering with VI. CONCLUSION orthogonal AutoEncoder,’’ IEEE Access, vol. 7, pp. 62421–62432, 2019. [15] M. Zamini and G. Montazer, ‘‘Credit card fraud detection using autoen- We proposed an intelligent anomaly detection algorithm coder based clustering,’’ in Proc. 9th Int. Symp. Telecommun. (IST), ODC for extensive network traffic analysis. IoT environments Dec. 2018, pp. 486–491. produce a massive amount of data, and we need a mecha- [16] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen, ‘‘Deep autoencoding Gaussian mixture model for unsupervised nism/model to detect anomalies within the vast datasets. ODC anomaly detection,’’ in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–20. optimizes the deep AutoEncoder to train the encoder. The [17] N. Mrabah, N. M. Khan, R. Ksantini, and Z. Lachiri, ‘‘Deep clus- latent version of the network traffic instances is fed into the tering with a dynamic autoencoder: From reconstruction towards centroids construction,’’ 2019, arXiv:1901.07752. [Online]. Available: BIRCH clustering algorithm for anomaly detection without http://arxiv.org/abs/1901.07752 human intervention. We demonstrated that ODC intelligently [18] X. Peng, H. Zhu, J. Feng, C. Shen, H. Zhang, and J. T. Zhou, ‘‘Deep detects the anomalies for vast datasets. We analyzed B and clustering with sample-assignment invariance prior,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 11, pp. 4857–4868, Nov. 2020. T values’ influence on the ACC and NMI values of an [19] X. Peng, J. Feng, S. Xiao, W.-Y. Yau, J. T. Zhou, and S. Yang, ‘‘Struc- input dataset. The performance of the ODC deep cluster- tured AutoEncoders for subspace clustering,’’ IEEE Trans. Image Process., ing algorithm is evaluated through our implementation, and vol. 27, no. 10, pp. 5076–5086, Oct. 2018. [20] P. Harrington, Machine Learning in Action. Shelter Island, NY, USA: results presented in Table 2 clearly shows that our proposed Manning Publications, 2012. scheme exhibits better performance in comparison with exist- [21] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection: A survey,’’ ing schemes. ACM Comput. Surv., vol. 41, no. 3, pp. 1–58, Jul. 2009. [22] D. Pokrajac, A. Lazarevic, and L. J. Latecki, ‘‘Incremental local outlier Future directions of our work would be experimenting detection for data streams,’’ in Proc. IEEE Symp. Comput. Intell. Data with ODC metrics other than Euclidean distance for anomaly Mining, Mar. 2007, pp. 504–515. detection. Our ODC anomaly detection method can be [23] M. Salehi and L. Rashidi, ‘‘A survey on anomaly detection in evolving data: [With application to forest fire risk prediction],’’ ACM SIGKDD Explor. upgraded further by automating the whole anomaly detection Newslett., vol. 20, no. 1, pp. 13–23, May 2018. model. This involves generating an alert message and send it [24] M. Ahmed, A. N. Mahmood, and M. R. Islam, ‘‘A survey of anomaly to the system or network administrator without delay. Also, detection techniques in financial domain,’’ Future Gener. Comput. Syst., vol. 55, pp. 278–288, Feb. 2016. the suspected network traffic causing source can be identi- [25] G. K. Rajbahadur, A. J. Malton, A. Walenstein, and A. E. Hassan, ‘‘A sur- fied and terminated or suspended from the regular network vey of anomaly detection for connected vehicle cybersecurity and safety,’’ communication in a fraction of seconds. in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2018, pp. 421–426. [26] S. Jose, D. Malathi, B. Reddy, and D. Jayaseeli, ‘‘A survey on anomaly based host intrusion detection system,’’ J. Phys., Conf. Ser., vol. 1000, no. 1, REFERENCES Apr. 2018, Art. no. 012049. [1] A. Faigon, K. Narayanaswamy, J. Tambuluri, R. Ithal, S. Malmskog, [27] T. Shon, Y. Kim, C. Lee, and J. Moon, ‘‘A machine learning framework and A. Kulkarni, ‘‘Machine learning based anomaly detection,’’ for network anomaly detection using SVM and GA,’’ in Proc. 6th Annu. U.S. Patent 10 270 788, Apr. 23, 2019. IEEE Syst., Man Cybern. (SMC) Inf. Assurance Workshop, Jun. 2005, [2] R. Sommer and V. Paxson, ‘‘Outside the closed world: On using machine pp. 176–183. learning for network intrusion detection,’’ in Proc. IEEE Symp. Secur. [28] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, ‘‘A geomet- Privacy, May 2010, pp. 305–316. ric framework for unsupervised anomaly detection,’’ in Applications of [3] S. Zhao, M. Chandrashekar, Y. Lee, and D. Medhi, ‘‘Real-time network in . Boston, MA, USA: Springer, 2002, anomaly detection system using machine learning,’’ in Proc. 11th Int. Conf. pp. 77–101. Design Reliable Commun. Netw. (DRCN), Mar. 2015, pp. 267–270. [29] J. Zhang and M. Zulkernine, ‘‘Anomaly based network intrusion detection [4] T. Dunning and E. Friedman, Practical Machine Learning: A New Look at with unsupervised outlier detection,’’ in Proc. IEEE Int. Conf. Commun., Anomaly Detection. Sebastopol, CA, USA: O’Reilly Media, 2014. vol. 5, Jun. 2006, pp. 2388–2393. [5] A. G. Roselin, P. Nanda, S. Nepal, X. He, and J. Wright, ‘‘Exploiting the [30] R. Chalapathy and S. Chawla, ‘‘Deep learning for anomaly detection: remote server access support of CoAP protocol,’’ IEEE Internet Things J., A survey,’’ 2019, arXiv:1901.03407. [Online]. Available: http://arxiv.org/ vol. 6, no. 6, pp. 9338–9349, Dec. 2019. abs/1901.03407 [6] S. Bulusu, B. Kailkhura, B. Li, P. K. Varshney, and D. Song, ‘‘Anomalous [31] F. Farahnakian and J. Heikkonen, ‘‘A deep auto-encoder based approach example detection in deep learning: A survey,’’ IEEE Access, vol. 8, for intrusion detection system,’’ in Proc. 20th Int. Conf. Adv. Commun. pp. 132330–132347, 2020. Technol. (ICACT), Feb. 2018, pp. 178–183. [7] R. Wang, K. Nie, T. Wang, Y. Yang, and B. Long, ‘‘Deep learning for [32] C. Aytekin, X. Ni, F. Cricri, and E. Aksu, ‘‘Clustering and unsupervised anomaly detection,’’ in Proc. 13th Int. Conf. Web Search Data Mining, anomaly detection with l2 normalized deep auto-encoder representations,’’ 2020, pp. 894–896. in Proc. Int. Joint Conf. Neural Netw. (IJCNN), 2018, pp. 1–6.

47250 VOLUME 9, 2021 A. G. Roselin et al.: Intelligent Anomaly Detection for Large Network Traffic With ODC Algorithm

[33] Y. Wang, Z. Shi, X. Guo, X. Liu, E. Zhu, and J. Yin, ‘‘Deep embedding PRIYADARSI NANDA received the bache- for determining the number of clusters,’’ in Proc. 32nd AAAI Conf. Artif. lor’s degree in computer engineering, the mas- Intell., 2018, pp. 1–2. ter’s degree in computer engineering, and the [34] J. Ding, A. Condon, and S. P. Shah, ‘‘Interpretable dimensionality reduc- Ph.D. degree in computing science. He is currently tion of single cell transcriptome data with deep generative models,’’ Nature a Senior Lecturer with the University of Tech- Commun., vol. 9, no. 1, pp. 1–13, Dec. 2018. nology Sydney (UTS) with extensive experience [35] M. Kriegerowski, G. M. Petersen, H. Vasyura-Bathke, and M. Ohrnberger, in research and development of cyber security, ‘‘A deep convolutional neural network for localization of clustered earth- the IoT security, and wireless sensor network quakes based on multistation full waveforms,’’ Seismol. Res. Lett., vol. 90, security. His most significant work has been in the no. 2A, pp. 510–516, Mar. 2019. [36] F. Li, H. Qiao, and B. Zhang, ‘‘Discriminatively boosted image cluster- area of intrusion detection and prevention systems ing with fully convolutional auto-encoders,’’ Pattern Recognit., vol. 83, using image processing techniques, Sybil attack detection in IoT-based pp. 161–173, Nov. 2018. applications, intelligent firewall design. He has successfully supervised ten [37] S. M. Mousavi, W. Zhu, W. Ellsworth, and G. Beroza, ‘‘Unsupervised clus- research students in the past and currently supervising eight research students tering of seismic signals using deep convolutional autoencoders,’’ IEEE in Cyber security research. He has published more than 80 high quality Geosci. Remote Sens. Lett., vol. 16, no. 11, pp. 1693–1697, Nov. 2019. refereed research articles, including IEEE TRANSACTIONS ON COMPUTERS, IEEE [38] Y. Ren, K. Hu, X. Dai, L. Pan, S. C. H. Hoi, and Z. Xu, ‘‘Semi-supervised TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, Future Generation deep embedded clustering,’’ Neurocomputing, vol. 325, pp. 121–130, Computer Systems (FGCS) as well as many ERA Tier A/A* conference Jan. 2019. papers. In 2017, his work in cyber security research has earned him and [39] Y. Gu, X. Lu, L. Yang, B. Zhang, D. Yu, Y. Zhao, L. Gao, L. Wu, and his team the prestigeous Oman Research Council’s National Award for best T. Zhou, ‘‘Automatic lung nodule detection using a 3D deep convolutional research. neural network combined with a multi-scale prediction strategy in chest CTs,’’ Comput. Biol. Med., vol. 103, pp. 220–231, Dec. 2018. [40] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, ‘‘Fast and accurate SURYA NEPAL received the B.E. degree from deep network learning by exponential linear units (ELUs),’’ 2015, the National Institute of Technology, Surat, India, arXiv:1511.07289. [Online]. Available: http://arxiv.org/abs/1511.07289 the M.E. degree from the Asian Institute of Tech- [41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and nology, Bangkok, Thailand, and the Ph.D. degree R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks from RMIT University, Australia. He is currently a from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Principal Research Scientist with the Information 2014. Engineering Laboratory, CSIRO Computational [42] A. Gulli and S. Pal, Deep Learning With Keras. Birmingham, U.K.: Packt, Informatics. He has more than 15 years experi- 2017. ence in computer science research, latterly with [43] THE MNIST DATABASE of Handwritten Digits. Accessed: Feb. 4, 2020. a specific focus on security, privacy and trust in [Online]. Available: http://yann.lecun.com/exdb/mnist/ distributed systems. He has more than 100 publications to his credit, has edited or coauthored several books, and is the co-inventor of two patents. Much of his work appears in top international forums, such as VLDB, ICDE, ICWS, SCC, CoopIS, ICSOC, the International Journal of Web Services Research, IEEE TRANSACTIONS ON SERVICES COMPUTING, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, ACM Computing Survey, and ACM Transactions on Internet Technology.

XIANGJIAN HE (Senior Member, IEEE) is cur- rently a Professor of computer science with the School of Electrical and Data Engineering, Uni- versity of Technology Sydney (UTS), and a Core Member, Global Big Data Technologies Asso- ciate Member with the AAI—Advanced Ana- lytics Institute. As a Chief Investigator, he has received various research grants, including four national Research Grants awarded by the Aus- ANNIE GILDA ROSELIN received the B.Tech. tralian Research Council (ARC). He is also the (Hons.) degree in information technology from Director of the Computer Vision and Pattern Recognition Laboratory, Global the Noorul Islam College of Engineering, Anna Big Data Technologies Centre (GBDTC), UTS. He is also a Leading University, Chennai, India, and the M.E. (Hons.) Researcher in several research areas, including big-learning based human degree in computer science and engineering from behaviour recognition on a single image, image processing based on hexag- the Mepco Schlenk Engineering College, Sivakasi, onal structure, authorship identification of a document, and a documents Anna University, Chennai, India. She is currently components, such as sentences, and sections, network intrusion detection pursuing the Ph.D. degree with University of Tech- using computer vision techniques, car license plate recognition of high nology Sydney. She has worked as a Lecturer with speed moving vehicles with changeable and complex background, and video the Department Computer Science and Engineer- tracking with motion blur. He has been a member with the IEEE Signal ing, Velammall Engineering College, Chennai, from 2008 to 2011. Her Processing Society Student Committee. He has been awarded the Interna- research interest includes end-to-end IoT security: authentication, vulnera- tionally Registered Technology Specialist by the International Technology bility exploration, and with deep learning. She is the successful Institute (ITI). recipient of the Industrial Scholarship funded by CSIRO/Data61, Australia.

VOLUME 9, 2021 47251