<<

Kitsune: An Ensemble of for Online Network Intrusion Detection

Yisroel Mirsky, Tomer Doitshman, Yuval Elovici and Asaf Shabtai Ben-Gurion University of the Negev {yisroel, tomerdoi}@post.bgu.ac.il, {elovici, shabtaia}@bgu.ac.il

Abstract—Neural networks have become an increasingly popu- Over the last decade many machine learning techniques lar solution for network intrusion detection systems (NIDS). Their have been proposed to improve detection performance [2], [3], capability of learning complex patterns and behaviors make them [4]. One popular approach is to use an artificial neural network a suitable solution for differentiating between normal traffic and (ANN) to perform the network traffic inspection. The benefit network attacks. However, a drawback of neural networks is of using an ANN is that ANNs are good at learning complex the amount of resources needed to train them. Many network non-linear concepts in the input data. This gives ANNs a gateways and routers devices, which could potentially host an NIDS, simply do not have the memory or processing power to great advantage in detection performance with respect to other train and sometimes even execute such models. More importantly, machine learning algorithms [5], [2]. the existing neural network solutions are trained in a supervised The prevalent approach to using an ANN as an NIDS is manner. Meaning that an expert must label the network traffic and update the model manually from time to time. to train it to classify network traffic as being either normal or some class of attack [6], [7], [8]. The following shows the In this paper, we present Kitsune: a plug and play NIDS typical approach to using an ANN-based classifier in a point which can learn to detect attacks on the local network, without deployment strategy: supervision, and in an efficient online manner. Kitsune’s core algorithm (KitNET) uses an ensemble of neural networks called 1) Have an expert collect a dataset containing both normal autoencoders to collectively differentiate between normal and traffic and network attacks. abnormal traffic patterns. KitNET is supported by a feature 2) Train the ANN to classify the difference between normal extraction framework which efficiently tracks the patterns of and attack traffic, using a strong CPU or GPU. every network channel. Our evaluations show that Kitsune can detect various attacks with a performance comparable to offline 3) Transfer a copy of the trained model to the net- anomaly detectors, even on a Raspberry PI. This demonstrates work/organization’s NIDS. that Kitsune can be a practical and economic NIDS. 4) Have the NIDS execute the trained model on the observed network traffic. Keywords—Anomaly detection, network intrusion detection, on- line algorithms, autoencoders, ensemble learning. In general, a distributed deployment strategy is only prac- tical if the number of NIDSs can economically scale according I.INTRODUCTION to the size of the network. One approach to achieve this goal is to embed the NIDSs directly into inexpensive routers (i.e., The number of attacks on computer networks has been with simple hardware). We argue that it is impractical to use increasing over the years [1]. A common security system used ANN-based classifiers with this approach for several reasons: to secure networks is a network intrusion detection system (NIDS). An NIDS is a device or software which monitors all Offline Processing. In order to train a supervised model, all traffic passing a strategic point for malicious activities. When labeled instances must be available locally. This is infeasible such an activity is detected, an alert is generated, and sent to on a simple network gateway since a single hour of traffic may the administrator. Conventionally an NIDS is deployed at a contain millions of packets. Some works propose offloading single point, for example, at the Internet gateway. This point the data to a remote server for model training [9] [3]. However, deployment strategy can detect malicious traffic entering and this solution may incur significant network overhead, and does leaving the network, but not malicious traffic traversing the not scale. network itself. To resolve this issue, a distributed deployment Supervised Learning. The labeling process takes time and is strategy can be used, where a number of NIDSs are be expensive. More importantly, what is considered to be normal connected to a set of strategic routers and gateways within depends on the local traffic observed by the NIDS. Further- the network. more, in attacks change overtime and while new ones are constantly being discovered [10], so continuous maintainable of a malicious attack traffic repository may be impractical. Finally, classification is a closed-world approach to identifying concepts. In other words, a classifier is trained to identify the Network and Distributed Systems Security (NDSS) Symposium 2018 classes provided in the training set. However, it is unreasonable 18-21 February 2018, San Diego, CA, USA to assume that all possible classes of malicious traffic can be ISBN 1-891562-49-5 collected and placed in the training data. http://dx.doi.org/10.14722/ndss.2018.23204 www.ndss-symposium.org High Complexity. The computational complexity of an ANN ’s framework . ) as a generic online Kitsune can increase the packet ORK 1 W is available for download at: KitNET Kitsune ELATED KitNET II.R ), which is lightweight and plug-and-play. Kitsune A feature extractiontaining framework and for extractingnetwork implicit dynamically traffic. contextual main- footprint features The from since framework theover has damped statistics windows. a areAn small updated online memory technique incrementally ensemble of for autoencoders automatically (i.e.,inputs) mapping constructing features in to an the ANN unsupervisedthe manner. incremental The hierarchal method clustering(transpose involves of of the feature-space thecluster unbounded sizes. dataset),Experimental and results bounding on of surveillance an network, operational IoT IP network,attacks. and camera We a video also wideand variety demonstrate of ability the algorithm’sbenchmarks to efficiency, on run a Raspberry on PI. a simple router, by performing A novel -based NIDSvices for ( simple network de- To the best ofthe our knowledge, use we of areonline autoencoders the anomaly first detection with to inpresent or propose computer the without networks. core We ensemblesunsupervised also algorithm for ( anomaly detectionthe algorithm, source code and for provide download. The domain of using machine learning (specifically Several previous works have proposed online anomaly The rest of the paper is organized as follows: Section The reason we use autoencoders is because (1) they can The source code for • • • • 1 https://github.com/ymirsky/KitNET-py and it’s entireexperimental machine learning results pipeline. inrun-time Section terms performance. V Finally, of presents conclusion. in detection section performance VII we and present our anomaly detection) forresearched implementing in NIDSs was thethese extensively past solutions [13], usuallyresources [14], do [15], of not [16], themodel, have [17]. machine and any However, running thereforeexecute assumption training on are on or simple either the perform executing gateways, the too the or training expensive process. require to a train labeled and datasetdetection to mechanisms usingFor different example, lightweight the algorithms. PAYL IDS which models simple histograms II discusses relateddetection. work Section III inand provide the how a they domain work. background Section of on IV online autoencoders presents anomaly trained in an unsupervised manner,anomaly and (2) detection they in can bereason the used we for event propose ofis using a because an poor they ensemble reconstruction. area of The more small single efficient autoencoders, and autoencoder canexperiments, over be we the less found noisier sameprocessing that than feature rate space. byperformance From a our which factor rivalsdetectors. of other five, and an provide offlineIn a (batch) summary, the detection anomaly contributions of this paper as follows:

score 2

RMSE OutputLayer

… has an ensemble RMSE RMSE

RMSE , are not used in the : a novel ANN-based Kitsune ’s anomaly detection algo- benign RMSE anomaly detection algorithm or RMSE Kitsune RMSE Labels, which indicate explicitly

RMSE Kitsune Kitsune’s malicious After the training or executing the model The packet processing rate must exceed the has one main parameter, which is the maximum Ensemble LayerEnsemble . , no more than one instance is stored in memory at ) is illustrated in Fig. 1. First, the features of an KitNET Map KitNET Kitsune In light of the challenges listed above, we suggest that In this paper, we present The architecture of KitNET Fig. 1: An illustration of rithm grows exponentially with numberthat of an neurons ANN [11]. which Thisis is means restricted deployed on in afeatures terms simple which of network gateway, it itsgateways can architecture which and use. handle high number This velocity of is traffic. input especially problematic on the development of anwhich ANN-based is network to intrusion be detector, manner, deployed should and adhere trained to on routers the in followingOnline a restrictions: Processing. distributed whether a packet is with an instance,practice, the a instance smallgiven is time, number as immediately of done discarded. instances in stream In canUnsupervised clustering be [12]. Learning. stored at any training process. Other metaas information acquiring can the be information used does so notLow long delay Complexity. the process. expected maximum packetmust ensure arrival that rate.processed there In by is the other no model. queue words, of we packets awaiting toNIDS be which is online,in unsupervised, Japanese and folklore, efficient. isnumber A a of Kitsune, tails, mythical can fox-like mimicincreases creature different with that forms, experience. and has Similarly, whose a strength of small neuralto networks mimic (autoencoders), (reconstruct)performance which incrementally network improves are overtime. traffic trained patterns, and whose ( a time. instance are mappedNext, to each the autoencoder visiblefeatures, attempts neurons to and of reconstruct computesroot the the the instance’s ensemble. mean reconstruction squaredforwarded error to errors an in output (RMSE). autoencoder,voting terms which Finally, mechanism acts of for the as the aing RMSEs ensemble. non-linear We are note that while train- number of inputs forThis parameter any is given autoencoder useda in to modest increase the trade the ensemble. off algorithm’s in speed detection with performance. of packet content [18] or the kNN algorithm [19]. These III.BACKGROUND:AUTOENCODERS methods are either very simple and therefore produce very poor results, or require accumulating data for the training or Autoencoders are the foundation building blocks of Kit- detection. sune. In this section we provide a brief introduction to au- toencoders; what they are, and how they work. To describe the training and execution of an auto encoder we will refer to A popular algorithm for network intrusion detection is the example in Fig. 2. the ANN. This is because of its ability to learn complex concepts, as well as the concepts from the domain of network 푙 푙 푙 communication [17]. In [20], the authors evaluated the ANN, 1 2 3 among other classification algorithms, in the task of network 3 푥 푥ෞ 1 1 = ℎ푊,푏 푥റ intrusion detection, and proposed a solution based on an ℝ ensemble of classifiers using connection-based features. In [8], 푥 푥ෞ2

∈ 2 Learned Parameters: the authors presented a modification to the back propagation (1) (2)

റ 푊 = 푊 , 푊

푥 푇 푥 푥ෞ 2 algorithm to increase the speed of an ANN’s training process. 3 3 where, 푊 = 푊(1) In [7], the authors used multiple ANN-based classifiers, where each one was trained to detect a specific type of attack. In [9], +1 +1 푏 = 푏(1), 푏(2) the authors proposed a hierarchal method where each packet first passes through an anomaly detection model, then if an Fig. 2: An example autoencoder with one compression layer, anomaly is raised, the packet is evaluated by a set of ANN which reconstructs instances with three features. classifiers where each classifier is trained to detect a specific attack type. A. Artificial Neural Networks

All of the aforementioned papers which use ANNs, are ANNs are made up of layers of neurons, where each either supervised, or are not suitable for a simple network layer is connected sequentially via synapses. The synapses gateway. In addition, some of the works assume that the have associated weights which collectively define the concepts learned by the model. Concretely, let l(i) denote the i-th layer training data can be stored and accumulated which is not the (i) (i) case for simple network gateways. Our solution enables a plug- in the ANN, and let kl k denote the number of neurons in l . Finally, let the total number of layers in the ANN be denoted as and-play deployment which can operate at much faster speeds (i) (i+1) than the aforementioned models. L. The weights which connect l to l are denoted as the kl(i)k-by-kl(i+1)k matrix W (i) and kl(i+1)k dimensional bias vector ~b(i). Finally, we denote the collection of all parameters With regards to the use of autoencoders: In [21], the θ as the tuple θ ≡ (W, b), where W and b are the weights of authors used an ensemble of deep neural networks to address each layer respectively. Fig 2 illustrates how the weights form object tracking in the online setting. Their proposed method the synapses of each layer in an ANN. uses a stacked denoising autoencoder (SDAE). Each layer of the SDAE serves as a different feature space for the raw There are two kinds of layers in an ANN: visible layers image data. The scheme transforms each layer of the SDAE and hidden layers. The visible layer receives the input instance to a deep neural network which is used as discriminative ~x with an additional bias variable (a constant value of 1). ~x binary classifier. Although the authors apply autoencoders in is a vector of numerical features which describes the instance, an online setting, they did not perform anomaly detection, nor and is typically normalized to fall out approximately on the address the challenge of real-time processing (which is great range of [−1, +1] (e.g., using 0-1 normalization or zscore challenge with deep neural networks). Furthermore, training normalization)[25]. The difference between the visible layer a deep neural network is complex and cannot be practically and the hidden layers is that the visible layer is considered to performed on a simple network device. In [22] and [23], the be precomputed, and ready to be passed to the second layer. authors propose the use of autoencoders to extract features from datasets in order to improve the detection of cyber B. Executing an ANN threats. However, the autoencoders themselves were not used for anomaly detection. Ultimately, the authors use classifiers To execute an ANN, l(2) is activated with the output of l(1) to detect the cyber threats. Therefore, their solution requires an (i.e., ~x) weighted with W (1), then the output l(2) weighted with expert to label instances, whereas our solution is unsupervised, W (2) is used to activate l(3), and so on until the final layer has and plug-and-play. been activated. This process is known as forward-propagation. Let ~a(i) be the kl(i)k vector of outputs from the neurons in l(i). a(i+1) ~a(i) l(i+1) In [24], the authors proposed the generic use of an au- To obtain , we pass through by computing toencoder for detecting anomalies. In [19], the authors use   a(i+1) = f W (i) · ~a(i) + b~(i) (1) autoencoders to detect anomalies in power grids. These works differ from ours because (1) they are not online, (2) the archi- where f is what’s known as the neuron’s activation function. tecture used by the authors is not lightweight and scalable as A common activation function, and what we use in Kitsune, an ensemble, and (3) has not been applied to network intrusion is the sigmoid function, defined as detection. We note that part of this paper’s contribution is an appropriate feature extraction framework, which enables the 1 f (~x) = (2) use of autoencoders in the online network setting. 1 + e~x

3 Algorithm 1: The back-propagation algorithm for Algorithm 2: The back-propagation algorithm for performing batch-training of an ANN. performing stochastic-training of an ANN.

procedure: trainGD(θ, X, Y, max iter) procedure: trainSGD(θ, X, Y, max iter) 1 1 1 1 1 1 θ ← U(− kl(1)k , kl(1)k ). . random initialization θ ← U(− kl(1)k , kl(1)k ). . random initialization 2 cur iter ← 0 2 cur iter ← 0 3 while cur iter ≤ max iter do 3 while cur iter ≤ max iter do 0 4 A, Y ← hθ(X) . forward propagation 4 for xi in kXk do 0 0 5 deltas ← bθ(Y,Y ) . backward propagation 5 A, y ← hθ(xi) . forward propagation 0 6 θ ← GD`(A, deltas) . weight update 6 deltas ← bθ(~y, y~ ) . backward propagation 7 cur iter++ 7 θ ← GD`(A, deltas) . weight update 8 end 8 end 9 return θ 9 cur iter++ 10 end 11 return θ Finally, we define output of the last layer to be denoted as y~0 = a(L). Let the function h be the full layer-wise forward- propagation from the input ~x until the final output y~0, denoted D. Autoencoders

h (~x) = y~0 (3) An autoencoder is an artificial neural network which is θ trained to reconstruct it’s inputs (i.e., X = Y ). Concretely, during training, an autoencoder tries to learn the function C. Training an ANN hθ(~x) ≈ ~x (4) The output of an ANN depends on the training set, and the training algorithm. A training dataset is composed of instances It can be seen that an autoencoder is essentially trying to ~x ∈ X and the respective expected outputs ~y ∈ Y (e.g., the learn the identity function of the original data distribution. labels in classification). During training, the weights of the Therefore, constraints are placed on the network, forcing it ANN are tuned so that hθ(~x) = ~y. A common algorithm for to learn more meaningful concepts and relationships between ~ training an ANN (i.e., finding the optimum W and b given the features in ~x. The most common constraint is to limit the X,Y ) is known as the back-propagation algorithm [6]. number of neurons in the inner layers of the network. The The back-propagation algorithm involves a step where the narrow passage causes the network to learn compact encodings error between the predicted y0 and the expected y is propagated and decodings of the input instances. from the output to each neuron. During this process, the errors As an example, Fig. 2 illustrates an autoencoder which between a neuron’s actual and expected activations are stored. receives an instance ~x ∈ R3 at layer l(1), (2) encodes We denote this backward-propagation step as the function (compresses) ~x at layer l(2), and (3) decodes the compressed 0 bθ(Y,Y ). Given some execution of hθ, let A be every neuron’s representation of ~x at layer L(3). If an autoencoder is sym- activation and let Delta be the activation errors of all neurons. metric in layer sizes, the same (mirrored) weights can be used for coding and decoding [28]. This trick reduces the number Given the current A and Delta, and a set learning rate ` ∈ of calculations needed during training. For example, in Fig 2, (0, 1], we can incrementally optimize W and b by performing W (2) = [W (1)]T . the Gradient Descent (GD) algorithm [26]. All together, the back-propagation algorithm is as follows: E. Anomaly Detection with Autoencoders The above process is referred to as batch-training. This is because for each iteration, GD updates the weights accord- Autoencoders have been used for many different machine ing to the collective errors of all instances in X. Another learning tasks. For example, generating new content [29], and approach is stochastic-training, where Stochastic Gradient De- filtering out from images [30]. In this paper, we are scent (SGD) is used instead of GD. In SGD, the weights are interested in using autoencoders for anomaly detection. updated according to the errors of each instance individually. In general, an autoencoder trained on X gains the ca- With SGD, the back propagation algorithm becomes: pability to reconstruct unseen instances from the same data The difference between GD and SGD is that GD converges distribution as X. If an instance does not belong to the on an optimum better than SGD, but SGD initially converges concepts learned from X, then we expect the reconstruction faster [27]. For a more elaborate explanation of the back- to have a high error. The reconstruction error of the instance propagation algorithm, we refer the reader to [26]. ~x for a given autoencoder, can be computed by taking the root mean squared error (RMSE) between ~x and the reconstructed In Kitsune, we use SGD with a max iter of 1. In other output y~0. The RMSE between two vectors is defined as words, while we are in the training phase, if a new instance s Pn 2 arrives we perform a single iteration of the inner loop in (xi − yi) Algorithm 2, discard the instance, and then wait for the next. RMSE (~x,~y) = i=1 (5) n This way we learn only once from each observed instance, and remain an online algorithm. where n is the dimensionality of the input vectors.

4 Let φ be the anomaly threshold, with an initial value of Kitsune’s framework is composed of the following com- −1, and let β ∈ [1, ∞) be some given sensitivity parameter. ponents: One can apply an autoencoder to the task of anomaly detection by performing the following steps: • Packet Capturer: The external library responsible for ac- quiring the raw packet. Example libraries: NFQueue[31], afpacket[32], and tshark[33] (Wireshark’s API). 1) Training Phase: Train an autoencoder on clean (normal) • Packet Parser: The external library responsible for pars- data. For each instance xi in the training set X: ing raw packets to obtain the meta information required 3 a) Execute: s = RMSE (~x,hθ(~x)) by the Feature Extractor. Example libraries: Packet++ , b) Update: if(s ≥ φ) then φ ← s and tshark. • c) Train: Update θ by learning from xi Feature Extractor (FE): The component responsible for extracting n features from the arriving packets to create 2) Execution Phase: n When an unseen instance ~x arrives: creating the instance ~x ∈ R . The features of ~x describe the a packet, and the network channel from which it came. a) Execute: s = RMSE (~x,hθ(~x)) • Feature Mapper (FM): The component responsible for b) Verdict: if(s ≥ φβ) then Alert creating a set of smaller instances (denoted as v) from ~x, The process in which Kitsune performs anomaly detection and passing v to the in the Anomaly Detector (AD). This over an ensemble of autoencoders will be detailed later in component is also responsible for learning the mapping, section IV-D from ~x to v. • Anomaly Detector (AD): The component responsible for detecting abnormal packets, given a packet’s representa- F. Complexity tion v. In order to activate the layer l(i+1), one must perform the matrix multiplication W (i) ·~a(i) as described in (1). Therefore, Since the Packet Capturer and Packet Extractor are not the the complexity of activating layer l(i+1) is O l(i) · l(i+1).2 contributions of this paper, we will focus on the FE, FM, and Therefore, the total complexity of executing an ANN is de- AD components. We note that FM and AD components are pendent on the number of layers, and the number of neurons task generic (i.e., solely depend on the input features), and in each layer. The complexity of training an ANN on a therefore can be reapplied as a generic online anomaly detec- single instance using SDG (Algorithm 2) is roughly double tion algorithm. Moreover, we refer to the generic algorithm in the complexity of execution. This is because of the backward- the AD component as KitNET. propagation step. KitNET has one main input parameter, m: the maximum We note that autoencoders can be deep (have many hidden number of inputs for each autoencoder in KitNET’s ensemble. layers). In general, deeper and wider networks can learn This parameter affects the complexity of the ensemble in concepts which are more complex. However, as shown above, KitNET. Since m involves a trade-off between detection and deep networks can be computationally expensive to train runtime performance, the user of Kitsune must decide what and execute. This is why in KitNET we ensure that each is more important (detection rate vs packet processing rate). autoencoder is limited to three layers with at most seven visible This trade-off is further discussed later in section V. neurons. The FM and AD have two modes of operation: train-mode and exec-mode. For both components, train-mode transitions IV. THE KITSUNE NIDS into exec-mode after some user defined time limit. A compo- nent in train-mode updates its internal variables with the given In this section we present the Kitsune NIDS: the packet inputs, but does not generate outputs. Conversely, a component preprocessing framework, the feature extractor, and the core in exec-mode does not update its variables, but does produce anomaly detection algorithm. We also discuss the complexity outputs. of the anomaly detection algorithm and provide a bound on its runtime performance. In order to better understand how Kitsune works, we will now describe the process which occurs when a packet acquired, A. Overview as depicted in Fig. 3: Kitsune is a plug-and-play NIDS, based on neural net- 1) The Packet Capturer acquires a new packet and passes works, and designed for the efficient detection of abnormal the raw binary to the Packer Parser. patterns in network traffic. It operates by (1) monitoring the 2) The Packet Parser receives the raw binary, parses the statistical patterns of recent network traffic, and (2) detecting packet, and sends the meta information of the packet to anomalous patterns via an ensemble of autoencoders. Each the FE. For example, the packet’s arrival time, size, and autoencoder in the ensemble is responsible for detecting network addresses. anomalies relating to a specific aspect of the network’s be- 3) The FE receives this information, and uses it to retrieve havior. Since Kitsune is designed to run on simple network over 100 statistics which are used to implicitly describe routers, and in real-time, Kitsune has been designed with small the current state of the channel from which the packet memory footprint and a low computational complexity. came. These statistics form the instance ~x ∈ Rn, which is passed to the FM. 2The modern approach is to accelerate these operations using a GPU. However, in this paper, we assume that no GPU is available. This is the 3The Packet++ project can be found on GitHub: case for a simple network router. https://github.com/seladb/PcapPlusPlus

5 , } | := 2 (6) (1) (1) last Log  T O O ,... 2 score N LS i,λ . , where , and the ) ~x , x ) IS RMSE 1 2 i − x . The tuple x i is the time { N SS S + | t . For example, = R = OutputLayer S ,SS Anomaly Detector(AD) Anomaly i N, LS, SS 2 S … ∈ x , in contrast to σ

RMSE be the decay function ) i , can be updated incre- + Let

RMSE n x d := ( (

RMSE N LS S O λt , LS is the current weight, IS = − w S RMSE +1 µ

RMSE N ( ) = 2

RMSE t RMSE ( ← λ , where d ) IS last is is the decay factor, and ,T Ensemble LayerEnsemble IS are the number, linear sum, and squared sum ij . 0 2 S σ SS into i p x = λ > Feature Mapper (FM) Mapper Feature ’s Architecture. , and S σ In order to extract the current behavior of a data stream, we For this reason, we designed a framework for high speed We will now briefly describe how damped incremental The solution to this is to use damped incremental statistics. 1) Damped Incremental Statistics: can be a sequence of observed packet sizes. The mean, LS , w, LS, SS, SR and statistics at any given time are ( of a damped incremental statistic is defined as S variance, and standardmentally deviation by of maintaining theN tuple of instances seen soinserting far. Concretely, the update procedure for must forget older instances.a The naive sliding approach is windowmemory to and of maintain runtime values. complexity of However, this approach has a elapsed since the last observation from stream be an unbounded data stream where where windows. However, it clearin how terms this of can memory, and become doesn’t impractical scale very well. feature extraction of temporalber statistics, of over data aa dynamic streams small num- (network memorymaintained channels). footprint over The since a framework it dampedmeans has that uses window. Using the incremental extracted acent statistics features damped behavior are of window temporal the (capturestatistic packet’s channel), the can and re- that be anzero deleted incremental (saving when additional itscomplexity memory). dampening The because weight framework the becomes hasmaintained collection a of in incremental auseful statistics 2D hash statistics is which table.rx capture the and The relationship tx between framework traffic the of also a maintains connection. statistics work. Afterwards,features we extracted will by enumerate the FE the to statistical produce the instance for an incrementalapproach statistic. does Furthermore, the notthe sliding consider window. window the Forarrived amount example, over of the the last time last hour 100 spanned or instances by in could the have lastIn few a seconds. dampedexponentially window decreased over model, time. the Let defined as weight of older values are Kitsune 6 Feature Extractor (FE) Extractor Feature , the map is Damped Statistics Incremental is then used to , which is then ~x , then an alert is φβ from Fig. 3: An illustration of train-mode across all layers. If the v Packet Parser Packet v are discarded. to train the ensemble layer. into sets with a maximum to learn a feature map. The Kitsune v v ~x ~x , and forward-propagation ~x ... and stored for later use. ... ~x v φ Packet Capturer Packet : ..if and uses : ...and uses : ...and the learned mapping is used to create : ...and executes each. Nothing is passed to the AD until the m receives receives AD FM size of map is complete. Atpassed the to end the of the AD, ensemble and architecture the (eachan set AD autoencoder forms uses in the the theExec-mode inputs map ensemble). to to build a collection of smallpassed instances to thelayer respective autoencoders of in the AD. the ensemble Train-mode The RMSE of the map groups features of RMSE of the output layer exceeds Train-mode train the output layer.layer The is largest set RMSE as Exec-mode of the output logged with packet details. • • • • External Libs External Feature extraction is the process of obtaining or engineer- These are just some example of attacks where temporal- We discuss how the FE, FM, and AD components work in 5) The 4) The 6) The original packet, ing a vector ofIn values which network describe anomaly atures real detection, world which it observation. capture istraversing the important the context to network. extract andSYN For packet. fea- purpose example, The of packet considera may each a connection be packet a single withsimilar benign TCP a attempt packets server, to sent establish in orattack an it (DoS). attempt may to Assent be cause another a from one denial example, an ofof of consider IP millions service the a of surveillance packets videoconsistently camera. are stream Although significant legitimate, the risetraffic there contents is in being may . sniffed suddenly in This appear a may man-in-the-middle a attack. indicatestatistical the features could helpwith detect extracting anomalies. these The kindsthat challenge of (1) features from packetsinterleaved, network from traffic (2) is differentmoment, there channels (3) can (conversations) thenaive be are approach packet many is arrival tochannel, channels maintain rate a and at can window to of any be packets continuously given very from each compute high. statistics The over those B. Feature Extractor (FE) greater detail.

Channel 흀 Extracted Stats from Type (windows) Statistics all windows Description: All traffic… Example MI 15 …originating from the same MAC and IP address (00:1C:B3:09:85:15, 192.168.0.2) →* 5, 3, 1, H 푤 , 휇 , 휎 15 …originating from an IP address 192.168.0.2→* 0.1, 0.01 푖 푖 푖 HHjit 15 …from one IP to another (delta in arrival times) 192.168.0.2→192.168.0.5 HH 5, 3, 1, 푤푖, 휇푖, 휎푖, ‖푆푖, 푆푗‖, 35 …shared between two IP addresses 192.168.0.2↔192.168.0.5 …shared between two network sockets HpHp 0.1, 0.01 푅푆푖,푆푗, 퐶표푣푆푖,푆푗, 푃푆푖,푆푗 35 192.168.0.2:23523↔192.168.0.5:80

The packet’s… Statistics Aggregated by # Features Description of the Statistics

…size 휇푖, 휎푖 SrcMAC-IP, SrcIP, Channel, Socket 8 Bandwidth of the outbound traffic Bandwidth of the …size ‖푆 , 푆 ‖, 푅 , 퐶표푣 , 푃 Channel, Socket 8 푖 푗 푆푖,푆푗 푆푖,푆푗 푆푖,푆푗 outbound and inbound traffic together …count 푤푖 SrcMAC-IP, SrcIP, Channel, Socket 4 Packet rate of the outbound traffic We note that not every packet applies to every channel type TABLE I: Summary…jitter of the incremental푤푖, statistics휇푖, 휎푖 which can be Channel 3 Inter-packet delays of the outbound traffic (e.g., there is no socket if the packet does not contain a TCP computed from Si and Sj. or UDP datagram). In these cases, these features are zeroed. Type Statistic Notation Calculation Thus, the final feature vector ~x, which the FE passes to the FM, is always a member of Rn, where n = 115. Weight 푤 푤 Mean 휇 퐿푆⁄푤 1D 푆푖 C. Feature Mapper (FM) 2 Std. 휎푆푖 √|푆푆⁄푤 − (퐿푆⁄푤) | The purpose of the FM is to map ~x’s n features (dimen- 휇2 + 휇2 Magnitude ‖푆푖, 푆푗‖ √ 푆푖 푆푗 sions) into k smaller sub-instances, one sub-instance for each autoencoder in the Ensemble Layer of the AD. Let v denote 2 푅 2 2 2 Radius 푆푖,푆푗 √(휎 ) + (휎 ) the ordered set of k sub-instances, where 푆푖 푆푗 2D Approx. 푆푅푖푗 v = { ~v1, ~v2, ··· , ~vk} (7) 퐶표푣 Covariance 푆푖,푆푗 푤 + 푤 푖 푗 We note that the sub-instances of v can be viewed as subspaces 퐶표푣 Correlation 푆푖,푆푗 푃 of ~x’s domain X. Coefficient 푆푖,푆푗 휎 휎 푆푖 푆푗 In order to ensure that the ensemble in the AD operates effectively and with a low complexity, we require that the selected mapping f(~x) = v: is the timestamp of the last update of ISi,λ, and SRij is the sum of residual products between streams i and j (used for 1) Guarantee that each ~vi has no more than m features, computing 2D statistics). To update ISλ with xcur at time where m is a user defined parameter of the system. tcur, Algorithm 3 is performed. The parameter m affects the collective complexity of the ensemble (see section IV-E). Table I provides a list of the statistics which can be 2) Map each of the n features in ~x exactly once to the computed from the incremental statistic ISi,λ. We refer to the features in v. This is to ensure that the ensemble is not statistics whose computation involves one and two incremental too wide. statistics as 1D and 2D statistics respectively. 3) Contain subspaces of X which capture the normal behav- ior well enough to detect anomalous events occurring in Algorithm 3: The algorithm for inserting a new value the respective subspaces. into a damped incremental statistic. 4) Be discovered in a process which is online, so that no more than one at time is stored in memory. procedure: update(ISi,λ,xcur,tcur,rj) 1 γ ← dλ(tcur − tlast) . Compute decay factor To respect the above requirements, we find the mapping 2 ISi,λ ← (γw, γLS, γSS, γSR, Tcur) . Process decay 2 f by incrementally clustering the features (dimensions) of X 3 ISi,λ ← (w+1, LS+xcur,SS+xi ,SRij +rirj,Tcur) into k groups which are no larger than m. We accomplish . Insert value this by performing agglomerative hierarchal clustering on 4 return ISi,λ incrementally updated summary data. More precisely, the feature mapping algorithm of the FE 2) Features Extracted for Kitsune: Whenever a packet performs the following steps: arrives, we extract a behavioral snapshot of the hosts and 1) While in train-mode, incrementally update summary protocols which communicated the given packet. The snapshot statistics with features of instance ~x. consists of 115 traffic statistics capturing a small temporal 2) When train-mode ends, perform hierarchal clustering on window into: (1) the packet’s sender in general, and (2) the the statistics to form f. traffic between the packet’s sender and receiver. 3) While in execute-mode, perform f( ~xt) = v, and pass v Specifically, the statistics summarize all of the traffic... to the AD. • ...originating from this packet’s source MAC and IP In order to ensure that the grouped features capture normal address (denoted SrcMAC-IP). behavior, in the clustering process, we use correlation as the • ...originating from this packet’s source IP distance measure between two dimensions. In general, the (denoted SrcIP). correlation distance dcor between two vectors u and v is • ...sent between this packet’s source and destination IPs defined as (denoted Channel). (u − u¯) · (v − v¯) • ...sent between this packet’s source and destination dcor(u, v) = 1 − (8) TCP/UDP Socket (denoted Socket). k(u − u¯)k2k(v − v¯)k2 where u¯ is the mean of the elements in vector u, and u · v is A total of 23 features (capturing the above) can be extracted the dot product. from a single time window λ (see Table II). The FE extracts the same set of features from a total of five time windows: We will now explain how the correlation distances between 100ms, 500ms, 1.5sec, 10sec, and 1min into the past (λ = features can be summarized incrementally for the purpose of 5, 3, 1, 0.1, 0.01), thus totaling 115 features. clustering. Let nt be the number of instances seen so far. Let

7

Channel 흀 Extracted Stats from Type (windows) Statistics all windows Description: All traffic… Example MI 15 …originating from the same MAC and IP address (00:1C:B3:09:85:15, 192.168.0.2) →* 5, 3, 1, H 푤 , 휇 , 휎 15 …originating from an IP address 192.168.0.2→* 0.1, 0.01 푖 푖 푖 HHjit 15 …from one IP to another (delta in arrival times) 192.168.0.2→192.168.0.5 HH 5, 3, 1, 푤푖, 휇푖, 휎푖, ‖푆푖, 푆푗‖, 35 …shared between two IP addresses 192.168.0.2↔192.168.0.5 …shared between two network sockets HpHp 0.1, 0.01 푅푆푖,푆푗, 퐶표푣푆푖,푆푗, 푃푆푖,푆푗 35 192.168.0.2:23523↔192.168.0.5:80 TABLE II: The statistics (features) extracted from each time window λ when a packet arrives. The packet’s… Statistics Aggregated by # Features Description of the Statistics

…size 휇푖, 휎푖 SrcMAC-IP, SrcIP, Channel, Socket 8 Bandwidth of the outbound traffic Bandwidth of the …size ‖푆 , 푆 ‖, 푅 , 퐶표푣 , 푃 Channel, Socket 8 푖 푗 푆푖,푆푗 푆푖,푆푗 푆푖,푆푗 outbound and inbound traffic together …count 푤푖 SrcMAC-IP, SrcIP, Channel, Socket 4 Packet rate of the outbound traffic …jitter 푤푖, 휇푖, 휎푖 Channel 3 Inter-packet delays of the outbound traffic

1.0 However, our distance matrix is small (n being in the order of Type Statistic Notation Calculation a few hundred) and therefore practical to compute on-site. Weight 푤 푤 0.8 Finally, with the dendrogram of D, we can easily find k 1D Mean 휇푆푖 퐿푆⁄푤 2 clusters (groups of features) where no cluster is larger than m. Std 휎푆푖 √|푆푆⁄푤 − (퐿푆⁄푤) | 0.6 2 2 The procedure is to break the dendrogram’s largest link (i.e., Magnitude ‖푆푖, 푆푗‖ √휇 + 휇 푆푖 푆푗 the top most node) and then check if all found clusters have 2 푅 2 2 2 Radius 푆푖,푆푗 √(휎 ) + (휎 ) 푆푖 푆푗 a size less then m. If there is at least one cluster with a size 2D 0.4 Approx. 푆푅푖푗 greater than m, then we repeat the process on the exceeding Correlation Distance 퐶표푣푆푖,푆푗 Covariance 푤푖 + 푤푗 clusters. At the end of the procedure, we will have k groups 퐶표푣푆 ,푆 0.2 푃 푖 푗 of features with a strong inter-correlation, where no single Correl. Coef. 푆푖,푆푗 휎 휎 푆푖 푆푗 group is larger than m. These groupings are saved, and used to perform the mapping f. 0.0 9 6 0 3 8 2 5 7 1 4 72 66 69 75 78 73 67 70 76 79 85 86 99 92 93 49 35 42 56 63 50 36 43 97 83 90 57 64 58 77 12 27 54 61 47 33 40 94 80 87 34 41 48 55 62 24 51 74 44 71 21 30 65 15 37 68 18 23 14 29 11 26 17 20 46 53 60 32 39 22 13 28 10 25 16 19 45 52 59 31 38 95 81 88 96 82 89 98 84 91 100 113 114 106 107 104 111 108 101 102 109 103 110 105 112 Feature (dimension) ID The algorithm described in this section is suitable for Fig. 4: An example dendrogram of the 115 features clustered online and on-site processing because (1) it never stores more together, from one million network packets. than one instance in memory, (2) it uses very little memory (On2 during train-mode) and, (3) is very fast since the update procedures require updating small n-by-n distance matrix. ~c be an n dimensional vector containing the linear sum of (i) Pnt (i) each feature’s values, such that element c = t=0 xt for D. Anomaly Detector (AD) feature i at the time index t. Similarly, Let ~cr denote a vector containing the summed residuals of each feature, such that As depicted in Fig. 3, the AD component contains a special  (i)  (i) Pnt (i) c neural network we refer to as a KitNET (Kitsune NETwork). cr = xt − . Similarity, Let ~crs denote a vector t=0 nt KitNET is an unsupervised ANN designed for the task of containing the summed squared residuals of each feature, such online anomaly detection. KitNET is composed of two layers  (i) 2 (i) Pnt (i) c that crs = x − . Let C be the n-by-n partial of autoencoders: the Ensemble Layer and the Output Layer. t=0 t nt correlation matrix, where Ensemble Layer: An ordered set of k three-layer autoen- nt  (i)   (j)  coders, mapped to the respective instances in v. This layer is X (i) c (j) c [C ] = x − x − (9) responsible for measuring the independent abnormality of each i,j t n t n t=0 t t subspace (instance) in v. During train-mode, the autoencoders is the sum of products between the residuals of features i and j. learn the normal behavior of their respective subspaces. During Let D be the correlation distance matrix between each features both train-mode and execute-mode, each autoencoder reports of X. its RMSE reconstruction error to the Output Layer. Output Layer: A three-layer autoencoder which learns the Using C and ~crs, the correlation distance matrix D can be computed at any time by normal (i.e., train-mode) RMSEs of the Ensemble Layer. This layer is responsible for producing a final anomaly score, con- Ci,j sidering (1) the relationship between subspace abnormalities, D = [Di,j] = 1 − q q (10) (i) (j) and (2) naturally occurring noise in the network traffic. crs crs We will now detail how KitNET operates the Ensemble and Now that we know how to obtain the distance matrix Output Layers. D incrementally, we can perform agglomerative hierarchal 1) Initialization: When the AD receives the first set of clustering on D to find f. Briefly, the algorithm starts with mapped instances v from the FM, the AD initializes KitNET’s n clusters, one cluster for each point represented by D. It then architecture using v as a blueprint. Concretely, let θ denote an searches for the two closest points and joins their associated entire autoencoder, and let L(1) and L(2) denote the Ensemble clusters. This search and join procedure repeats until there and Output Layers respectively. L(1) is defined as the ordered is one large cluster containing all n points. The tree which set represents the discovered links is called a dendrogram (pictured L(1) = {θ , θ , . . . , θ } (11) in Fig. 4). For further information on the clustering algorithm, 1 2 k (1) we refer the reader to [34]. Typically, hierarchal clustering such that autoencoder θi ∈ L has three layers of neurons: cannot be performed on large datasets due to its complexity. dim(~vi) neurons in the input and output, and dβ · dim(~vi)e

8 Algorithm 4: The back-propagation training algo- Algorithm 5: The execution algorithm for KitNET. KitNET rithm for . procedure: execute(L(1),L(2), v) procedure: train(L(1),L(2), v) // Execute Ensemble Layer // Train Ensemble Layer (2) 1 ~z ← zeros(k) . init input for L (2) 1 ~z ← zeros(k) . init input for L (1) 2 for (θi in L ) do (1) 2 for (θ in L ) do 0 i 3 ~vi = norm0−1(~vi) 0 0 3 ~vi = norm0−1(~vi) 4 Ai, ~yi ← hθ (~v ) . forward propagation 0 i i 4 0 Ai, ~yi ← hθi (~vi) . forward propagation 5 ~z[i] ← RMSE(~v , ~y ) . set error signal 0 i i 5 deltasi ← bθi (~vi , ~yi) . backward propagation 6 end 6 θi ← GD`(Ai, deltasi) . weight update // Execute Output Layer 0 0 7 ~z[i] ← RMSE(~vi , ~yi) . set error signal 7 ~z = norm0−1(~z) 8 end 0 8 A0, ~y0 ← hθ0 (~z1 ) . forward propagation 0 // Train Output Layer 9 return ← RMSE(~z , ~y0) 0 0 9 ~z = norm0−1(~z) A0, ~y0 ← hθ0 (~z ) . forward propagation 0 10 deltas ← b (~z , ~y ) . backward propagation 0 θ0 0 exactly once. The algorithm for training KitNET on a single 11 θ0 ← GD`(A0, deltas0) . weight update (1) (2) instance is presented in Algorithm 4. 12 return L ,L We note that in order to perform the 0-1 normalization on line 3, each autoencoder must maintain a record of the largest and smallest value seen for each input feature. Furthermore, neurons in the inner layer, where β ∈ (0, 1] (in our experiments these maximum and minimum records are only updated during we take β = 3 ). Fig. 5 illustrates the described mapping 4 train-mode. ~ (1) between ~vi ∈ v and θi ∈ L . Similar to the discussion in section III-E, KitNET must be (2) L is defined as the single autoencoder θ0, which has k trained on normal data (without the presence of attacks). This input and output neurons, and dk · βe inner neurons. is a common assumption [35], [36] and is practical in many Layers L(1) and L(2) are not connected via the weighted types of computer networks, such as IP camera surveillance synapses found the common ANN. Rather, the inputs to systems. Furthermore, there are methods for filtering the L(2) are the 0-1 normalized RMSE error signals from each training data in order to reduce the impact of the possible respective autoencoder in L(1). Signaling the aggregated errors preexisting attacks in the network [37], [38]. (RMSEs) of each autoencoder in L(1), as opposed to signaling 3) Execute-mode: In execute-mode, KitNET does not up- from each individual neuron of L(1), reduces the complexity of date any of its internal parameters. Instead, KitNET performs the network. Finally, the weights of autoencoder θi in KitNET forward propagation through the entire network, and returns is initialized with random values from the uniform distribution L(2)’s RMSE reconstruction error. The execution procedure of U( −1 , 1 ). dim( ~vi) dim( ~vi) KitNET is presented in Algorithm 5. 2) Train-mode: Training KitNET is slightly different than L(2)’s reconstruction error measures the instance’s abnor- training a common ANN network, as described in section III. mality with respect to the relationships between the sub- This is because KitNET signals RMSE reconstruction errors spaces in v. For example, consider two autoencoders from (1) between the two main Layers of the network. Furthermore, the Ensemble Layer θi, θj ∈ L . If the of RMSE θi and KitNET is trained using SGD using each observed instance v θj correlate during train-mode, then a lack of correlation in execute-mode may be considered a more significant anomaly than say the simple sum of their independent RMSEs (similarly Feature Mapper (FM) Anomaly Detector (AD) in vice versa). Since the Output Layer L(2) learns these Ensemble Layer 휃1 relationships (and other complex relationships) during train- 휃2 mode, L(2)’s reconstruction error of L(1)’s RMSEs will reflect RMSE 휃3

RMSE these anomalies.

휃4 RMSE

RMSE 4) Anomaly Scoring: The output of KitNET is the RMSE

휃푛푎−2 anomaly score s ∈ [0, ∞)], as described in section III-E. The

휃푛푎−1 larger the score s, the greater the anomaly. To use s, one RMSE 휃푛푎 must determine an anomaly score cutoff threshold φ. The naive

RMSE approach is to set φ to the largest score seen during train-mode, RMSE where we assume that all instances represent normal traffic. Another approach is to select φ probabilistically. Concretely, one may (1) fit the outputted RMSE scores to log-normal or non-standard distribution, and then (2) raise an alert if s Fig. 5: An illustration of the mapping process between the has a very low probability of occurring. A user of KitNET FM and the Ensemble Layer of KitNET: a sub-instance ~vi is should decide the best method of selecting φ according to mapped from ~x, which is then sent to the autoencoder θi. his/her application of the algorithm. In section V, we evaluate Kitsune’s detection capabilities based on its raw RMSE scores.

9 E. Complexity once (see Algorithm 4). This can be contrasted to ANN-based classifiers which typically make multiple passes (epochs) over KitNET As a baseline, we shall compare the complexity of the training set. to a single three-layer autoencoder over the same feature space ~x ∈ n, with the compression layer ratio β ∈ (0, 1]. R V. EVALUATION The complexity of executing the single autoencoder is as In this section, we provide an evaluation of Kitsune in follows: In section III-F, we found that the complexity of terms of its detection and runtime performance. We open by activating layer l(i+1) in an ANN is O l(i) · l(i+1). Therefore, describing the datasets, followed by the experiment setup, and the complexity of executing the single autoencoder is finally, close by presenting our results. O(n · βn + βn · n) = O(n2) (12) A. Datasets The complexity of executing KitNET is as follows: We The goal of Kitsune is to provide a light weight IDS which remind the reader that k denotes the number of subspaces can handle many packets per second on a simple router. Given (autoencoders) selected by the FM, and that m is an input this goal, we evaluated Kitsune’s capabilities in detecting parameter of the system which defines the maximum number attacks in a real IP camera video surveillance network. The of inputs for any given autoencoder in L(1). The complexity of (1) (2) 2 2 network (pictured in Fig. 7) consists of two deployments executing L and L are O(km ) and O(k ) respectively. of four HD surveillance cameras each. The cameras in the Since the variable m ∈ 1, 2,..., 10 is a constant parameter of deployments are powered via PoE, and are connected to the the system, the total complexity of KitNET is DVR via a site-to-site VPN tunnel. The DVR at the remote site O(km2 + k2) = O(k2) (13) provides users with global accessibility to the video streams via a client-to-site VPN connection. The cameras used in the The result of (13) tells us that the complexity of the network, and their configurations are described in Table IV. Ensemble Layer scales linearly with n, but the complexity Fig. 6 pictures two of the eight cameras used in our setup. of the Output Layer depends on how many autoencoders There are a number of attacks which can performed on (subspaces) are designated by the FM. Concretely, the best case the video surveillance network. However, the most critical n scenario is where k = m , in which case performance increased attacks affect the availability and integrity of the video uplinks. by a factor of m. The worst case scenario is where the FM For example, a SYN flood on a target camera, or a man in designates nearly every single feature its own autoencoder in the middle attack involving video injection into a live video L(1). In this case, k = n, and KitNET operates as a single stream. Therefore, in our evaluation, we focused on these types wide autoencoder meaning that there is no performance gain. of attacks. Table III summarizes the attack datasets used in This will also occur if the user sets m = 1 or m = n. our experiments, and Fig. 7 illustrates the location (vector) of However, it is very rare for the latter case to occur on the attacker for each respective attack. The Violation column a natural dataset. This is because it would mean that the in Table III indicates the attacker’s security violation on the dendrogram from the FM’s clustering process in section IV-C network’s confidentiality (C), integrity (I), and availability (A). packet capture point is completely imbalanced. For example, All datasets where recorded from the as indicated in Fig. 7. (1) (2) (2) (3) (3) (4) dcor(x , x ) < dcor(x , x ) < dcor(x , x ) < ... To setup the active wiretap, we used a Raspberry PI 3B as a (14) i n physical network bridge. The PI was given a USB-to-Ethernet where x is the i-th dimension of R . Therefore, it can be adapter to provide the second Ethernet port, and then placed expected that in the presence of many features, KitNET will physically in the middle of the cable. have a runtime which is faster than a single autoencoder or stacked autoencoder. Finally, the complexity of training We note that for some of the attacks, the malicious packets KitNET is also O(k2) since we learn from each instance only did not explicitly traverse the router on which Kitsune is connected to. In these cases, the FE components implicitly captures these attacks as a result of statistical changes in the network’s behavior. For example, the man in the middle attacks affect the timing of the packets, but not necessarily the contents of the packets themselves. In order to evaluate Kitsune on a nosier network, we used an additional network. The additional network was a Wi-Fi network populated with 9 IoT devices, and three PCs. The IoTs were a thermostat, baby monitor, webcam, two different doorbells, and four different cheap security cameras. On this particular network, we infected a one of the security cameras with a real sample of the Mirai botnet malware.

B. Experiment Setup Fig. 6: Two of the cameras used in the IP camera video surveil- lance network. Left: SNC-EM602RC. Right: SNC-EB602R. Offline algorithms typically perform better than online algorithms. This is because offline algorithms have access to

10 TABLE III: The datasets used to evaluate Kitsune.

Attack Train Execute Attack Name Tool Description: The attacker… Violation Vector # Packets Type [min.] [min.] …scans the network for hosts, and their operating systems, to OS Scan Nmap C 1 1,697,851 33.3 18.9 reveal possible vulnerabilities. Recon. …searches for vulnerabilities in the camera’s web servers by Fuzzing SFuzz C 3 2,244,139 33.3 52.2 sending random commands to their cgis. Video Injection Video Jack …injects a recorded video clip into a live video stream. C, I 1 2,472,401 14.2 19.2 Man in the ARP MitM Ettercap …intercepts all LAN traffic via an ARP poisoning attack. C 1 2,504,267 8.05 20.1 Middle Raspberry …intercepts all LAN traffic via active wiretap (network bridge) Active Wiretap C 2 4,554,925 20.8 74.8 PI 3B covertly installed on an exposed cable. …overloads the DVR by causing cameras to spam the server SSDP Flood Saddam A 1 4,077,266 14.4 26.4 with UPnP advertisements. Denial of …disables a camera’s video stream SYN DoS Hping3 A 1 2,771,276 18.7 34.1 Service by overloading its web server. SSL …disables a camera’s video stream by sending many SSL THC A 1 6,084,492 10.7 54.9 ModelReneg otiationResolution Codec FPSrenegotiation Avg PPS packets Avg to BWthe camera. Protocol Botnet …infects IoT with the Mirai malware by exploiting default SNC-EM602RCMirai 1280x720Telnet H.264/MPEG4 15 195 1.8Mbit/s RTP C, I X 764,137 52.0 66.9 MalwareSNC -EB600 1280x720 H.264/MPEG4credentials , and15 then scans290 for new1.8Mbit/s vulnerable Https/TCP victims network . SNC-EM600 1280x720 H.264/MPEG4 15 350 1.4Mbit/s RTP SNC-EB602R 1280x720 H.264/MPEG4 15 320 1.8Mbit/s Http/TCP Remote Site VPN Attacker TABLE IV: The specifications and statistics of the cameras Switch DVR Router used in the experiments. Server 3 SNC- SNC- SNC- SNC-

EM602RC EM600 EB600 EB602R VPN Kitsune Router Resolution 1280x720 Client Codec H.264/MPEG4 Frames/Sec 15 Avg. Distribution A 195 350 290 320

Packets/Sec Network Surveillance 1 Avg. PoE Switch 1.8 Mbit/s 1.4 Mbit/s 1.8 Mbit/s 1.8 Mbit/s Bandwidth Distribution B Protocol RTP RTP Https Http/TCP 2 (TLSv1) Attacker

IoT Distribution Remote Site the entire dataset during training and can perform multiple IoT Server passes over the data. However, online algorithms are useful when resources, such as the computational power and memory, WiFi Router

are limited. In our evaluations, we compare Kitsune to both X IoT Network IoT online and offline algorithms. The online algorithms provide Kitsune Attacker a baseline for Kitsune as an online anomaly detector, and the offline algorithms provide a perspective on the Kitsune’s per- Fig. 7: The network topologies used in the experiments: the formance as a sort of upperbound. As an additional baseline, surveillance network (top) and the IoT network (bottom). we evaluate Suricata [39] –a signature-based NIDS. Suricata is an open source NIDS which is similar to the Snort NIDS, but is parallelized over multiple threads. Signature-based NIDS have a much lower false positive rate than anomaly-based NIDS. on the first million packets, and then executed on the remain- However, they are incapable of detecting unknown threats or der. The duration of the first million packets depends on the abnormal/abstract behaviors. We configured Suricata to use the packet rate of the network at the time of capture. For example, 13,465 rules from the Emerging Threats repository [40]. with the OS Scan dataset, each algorithm was trained on the first one million packets, and then executed on the remaining For the offline algorithms, we used Isolation Forests (IF) 697,851 packets. Table III lists the relative train and execute [41] and Gaussian Mixture Models (GMM) [42]. IF is an periods for each dataset. Each algorithm received the exact ensemble based method of outlier detection, and GMM is same features from Table II. a statistical method based on the expectation maximization algorithm.For the online algorithms, we used an incremental GMM from [43], and pcStream2 [44]. pcStream2 is a stream Kitsune has one main parameter, m ∈ {1, 2, . . . , n}, which clustering algorithm which detects outliers by measuring the is the maximum number of inputs for any one autoencoder Mahalanobis distance of new instances to known clusters. of KitNET’s ensemble. For our detection performance evalua- tions, we set m = 1 and m = 10. For all other algorithms, we For each experiment (dataset) every algorithm was trained used the default settings.

11 C. Evaluation Metrics The output of an anomaly detector (s) is a value on the range of [0, ∞), where larger values indicate greater anomalies (e.g., the RMSE of an autoencoder). This output is typically normalized such that scores which have a value less than 1 are normal, and greater than 1 are anomalies. The score s is normalized by dividing it by a cutt-off threshold φ. Choosing φ has a great effect on an algorithm’s performance. The detection performance of an algorithm, on a particular dataset, can be measured in terms of its true positives (TP ), true negatives (TN), false positives (FP ), and false negatives (FN). In our evaluations we measure an algorithm’s true TP positive rate (TPR = TP +FN ) and the false negative rate FN (FNR = FN+TP ) when φ is selected so that the false positive FP rate (FPR = FP +TN ) is very low (i.e., 0.001). We also count Fig. 8: KitNET’s RMSE anomaly scores before and during the number of true positives where the FPR is zero. These a fuzzing attack. The red lines indicate when the attacker measures capture the amount of malicious packets which were connects to the network (left) and initiates the attack (right). detected correctly, and accidentally missed with few and no false alarms. In network intrusion detection, is it important that there be a minimal number of false alarms since it is time As evident from Fig. 9, there is a trade-off between the consuming and expensive an analyst to investigate each alarm. detection performance and m (runtime performance). Users who prefer better detection performance over speed should use Fig. 8 plots KitNET’s anomaly scores before and during a an m which is close to 1 or n. Whereas users who prefer speed fuzzing attack. The lower blue line is the lowest threshold we should use a moderate sized m. The Kitsune gives the user the might select during the training phase which produces no FPs. ability to adjust this parameter according to the requirements of The upper blue line is the lowest threshold possible during the the user’s system. The affect of m on the runtime performance attack which produces no FPs (kind of like a global optimum). is presented in section V-E. Another way of looking at these two thresholds is like a As a baseline comparison to the performance of online best-case and worst case scenarios for threshold selection. algorithms we compared Kitsune to the incremental GMM Therefore, by measuring the performance of each at these two and pcStream2. Overall, it is clear that Kitsune out performs thresholds, we can get a better idea of an algorithm’s potential both algorithms in terms of AUC and EER. as an anomaly detector. The top row of figure 9 presents the maximum number To measure the general performance (i.e. with every pos- of true positives each algorithm was able to obtain, when the sible φ), we used the area under the receiver operating char- threshold was set so that here were no false positives (e.g., the acteristic curve (AUC), and the equal error rate (EER). In our blue dashed bar in Fig. 8). In other words, these figures show context, the AUC is the probability that a classifier will rank a how well each anomaly detector is able to raise the anomaly randomly chosen anomalous instance higher than a randomly score of malicious packets above the noise floor. The figures chosen normal instance. In other words, an algorithm with an shows that Kitsune detects attacks across the datasets better AUC of 1 is a perfect anomaly detector on the given dataset, than the other algorithms, and more so than the GMM in whereas an algorithm with an AUC of 0.5 is randomly guessing most cases. We note that Kitsune with m = 10 sometimes labels. The EER is a measure which captures an algorithm’s performs better than m = 1. This is because the Output Layer trade-off between its FNR and FPR. It is computed as the autoencoder lowers the noise floor of the ensemble. This affect value of FNR and FPR when they are minimal and equal can amplify the scores of some outliers. to one another. E. Runtime Performance D. Detection Performance One of Kitsune’s greatest advantages is its runtime per- Figure 9 presents TPR and FNR of each algorithm over formance. As discussed in section IV-E, KitNET’s ensemble each dataset when the threshold is selected so that the FPR is of small autoencoders is more efficient than using a single 0 and 0.001. The figure also presents the AUC and EER as autoencoder. This is because the ensemble reduces the overall well. We remind the reader that the GMM and Isolation Forest number of operations required to process each instance. are a batch (offline) algorithms, which have full access to the To demonstrate this effect, we performed benchmarks on entire dataset and perform many iterations over the dataset. a Raspberry PI 3B and an Ubuntu Linux VM running on a Therefore, these algorithms serve as an optimum goal for us Windows 10 PC (full details are available in Table V). The to achieve. In Fig. 9 we see that Kitsune performed very well experiments were coded in C++, involved n = 198 statistical relative to these algorithms. In particular, Kitsune performed packet features, and were executed on a single core (physical even better than the GMM in detecting the active wiretap. core on the PI and logical core on the Ubuntu VM). Moreover, our algorithm achieved a better EER than the GMM on the AR, Fuzzing, Mirai, SSL R., SYN and active wiretap Fig. 10 plots the affect which KitNET’s ensemble size has datasets. on the packet processing rate. With a single autoencoder in

12 xeiet eerno igecr,teei oeta to potential is there core, single the a since on that run note We were solution. experiments support NIDS distributed can reliable that resources, means and This limited NIDS. with an as router, Kitsune network simple a is traffic network in jitter may where This times. applications undesirable. processing in the beneficial in variance be the 11 with reduce times also Fig. processing can respectively. packet 37,300 PI’s the and at k 5400 look closer approx. a to provides autoencoders (top-left), 5 0.001 35 of to factor with equal However, (bottom-right). is respectively. EER FPR in the second the and when per (bottom-left), TPR AUC ets the the datasets: (top-right), the 0.001 L of to each equal on is algorithms FPR all the when for FNR results the experimental The 9: Fig. (1)

1 = AUC TPR TPR (sqrt−scale) 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 L 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 h eut fteRsbryP’ ecmr hwthat show benchmark PI’s Raspberry the of results The h IadP a adeapo.100ad750pack- 7,500 and 1,000 approx. handle can PC and PI the , (1) Area Under the Curve (AUC)Area UndertheCurve True Positive Rate(TPR)atFPR=0.001 True Positive Rate(TPR)atFPR=0 and

h efrac fbt niomnsipoeb a by improve environments both of performance the , Suricata 0.50003 0.50001 0.50001 0.50002 0.5 0.50001 0 0 0 Iso. Forest 0.58667 0.90702 0.59836 0.44725 0.1568 0.55628 0.0027 0 0 SYN DoS OS Scan k SYN DoS OS Scan 0.83353 0.94929 0.76709 SYN DoS GMM OS Scan ARP 0.23602 0.094 0.29056 0 0 0 ARP ARP 35 = Inc. GMM 0.67969 0.94692 0.59841 0.24524 0.09742 0.43881 0 0 0 pcStream 0.63469 0.74191 0.71699 0.6836 0.28033 0.65069 0 0 2e−06

hsfiuesosta sn nensemble an using that shows figure This . Kitsune (m=10) 0.95024 0.94805 0.58423 0.07735 0.09413 0.45239 0.000142 0 0 Kitsune (m=1) 0.92289 0.94806 0.79499 0.13557 0.09412 0.23152 0.000142 0 0

Suricata 0.5006 0.50001 0.5 0.4997 0.50001 0.5 0 0 0 0.51661 0.94839 0.95797 Iso. Forest 0.49067 0.09357 0.07753 0 0 0 Video Inj. SSDP F. Video Inj. Fuzzing Video Inj. SSDP F. Fuzzing SSDP F.

0.87324 0.99995 0.99988 Fuzzing GMM 0.20398 0.00056 0.00129 0 0 0.00012

Inc. GMM 0.68661 0.84836 0.93163 - Higher isbetter 0.34993 0.23274 0.92263 0 0 0 pcStream 0.57698 0.99626 0.99081 0.39979 0.97355 0.0316 0 0 0 Kitsune (m=10) 0.85378 0.99984 0.99995 0.228 0.00104 0.0012 0 0 0 Kitsune Kitsune (m=1) 0.80458 0.99997 0.99997 0.26756 0.00067 0.00091 0 0 0

Suricata 0.5 0.73159 0.50046 0.5 0.34539 0.50023 0 0 0 Iso. Forest 0.57378 0.87216 0.52547 0.54265 0.15408 0.51388 0 0 0 sa inexpensive an is Wiretap Wiretap SSL R. Wiretap SSL R. GMM 0.87878 0.96509 0.99928 0.21027 0.08449 0.01095 1e−06 5.4e−05 SSL R. 0 Mirai Mirai Mirai Inc. GMM 0.55266 0.88418 0.58598 0.52538 0.88022 0.44774 0.002835 1.7e−05 9.4e−05 pcStream 0.73931 0.8737 0.87399 0.74978 0.82166 0.81684 0 2.2e−05 0 Kitsune (m=10) 0.88221 0.97353 0.99919 0.17273 0.04603 0.00317 4.2e−05 2.2e−05 0.322297 Kitsune (m=1) 0.90876 0.9392 0.99857 0.1693 0.10761 0.0032 2.3e−05 2.2e−05 0.321032

EER FNR FNR 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 13 Equal ErrorRate(EER) False Negative Rate(FNR)atFPR=0.001 False Negative Rate(FNR)atFPR=0 Suricata 0.50002 0.5 0.50001 1 0.99998 1 1 1 1 nraetepce rcsigrtsfrhr oaheethis, achieve To parallelizing further. on rates plan processing we packet the increase ewr.A uuewr,i ol eitrsigt n a find to interesting be would it work, future a this As of network. aware installing be when should user risk a Regardless, themselves. present Kitsune first When evade to [45]. attempt learning may train-mode adversary machine advanced adversarial installed, an perform all, to off First consider. Iso. Forest 0.44725 0.1568 0.55628 0.9649 1 0.99853 0.9973 1 1 SYN DoS SYN DoS SYN DoS OS Scan OS Scan OS Scan I A VI. GMM 0.23602 0.094 0.29056 0.75391 0.98825 0.91667 1 1 1 hnusing When ARP ARP ARP Inc. GMM 0.24524 0.09742 0.43881 1 0.99035 0.99948 1 1 1 pcStream 0.6836 0.28033 0.65069 0.8842 1 1 1 1 0.999998 Kitsune ildtc e tak,adnwtrasa they as threats new and attacks, new detect will

Kitsune Kitsune (m=10) 0.07735 0.09413 0.45239 0.75391 0.99015 0.99953 0.999858 1 1 DVERSARIAL hrfr,apexsigavraymyb able be may adversary preexisting a Therefore, . Kitsune (m=1) 0.13557 0.09412 0.23152 0.75391 0.99014 0.99996 0.999858 1 1

Suricata 0.4997 0.50001 0.5 0.99873 1 1 1 1 1 Kitsune

sdtcin oee,during However, detection. ’s Iso. Forest 0.49067 0.09357 0.07753 0.96467 1 0.99976 1 1 1 -Lower isbetter sue htaltafi sbng hl in while benign is traffic all that assumes Video Inj. Video Inj. Video Inj. SSDP F. SSDP F. SSDP F. GMM 0.20398 0.00056 0.00129 Fuzzing 0.98968 0 0.00192 Fuzzing 1 1 0.99988 Fuzzing

Kitsune Inc. GMM 0.34993 0.23274 0.92263 0.99888 1 1 1 1 1 hr r eea set n should one aspects several are there , A pcStream 0.39979 0.97355 0.0316 0.99841 1 0.0513 1 1 1 KitNET

TTACKS Kitsune (m=10) 0.228 0.00104 0.0012 0.98941 0.00105 0.00541 1 1 1 Kitsune (m=1) 0.26756 0.00067 0.00091 0.98941 0.00062 0 1 1 1 naptnilycompromised potentially a on Suricata 0.5 0.34539 0.50023 0.99999 1 1 1 1 1 vrmlil cores. multiple over

C & Iso. Forest 0.54265 0.15408 0.51388 1 0.98585 0.99587 1 1 1 Wiretap Wiretap Wiretap GMM 0.21027 0.08449 SSL R. 0.01095 0.99877 0.98501 SSL R. 0.02994 0.999999 0.999946 SSL R. 1 Mirai Mirai Mirai

OUNTERMEASURES Inc. GMM 0.52538 0.88022 0.44774 0.99973 0.9956 0.99892 0.997165 0.999983 0.999906 pcStream 0.74978 0.82166 0.81684 0.99992 0.99461 0.99954 1 0.999978 1 Kitsune (m=10) 0.17273 0.04603 0.00317 0.99853 0.97533 0.00329 0.999958 0.999978 0.677703 Kitsune (m=1) 0.1693 0.10761 0.0032 0.99954 0.97491 0.00327 0.999977 0.999978 0.678968 execute-mode Anom-based (batch) Signature-based Anom-based (online) Kitsune (m=1) Kitsune (m=10) pcStream GMM Inc. GMM Iso. Forest Suricata

, Packet Processing Rate on Single Logical Core

40000

30000 PC: Exec−mode PC: Train−mode 20000 PI: Exec−mode PI: Train−mode

Rate [Packets/Sec] 10000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Size of Ensemble Layer (k)

Fig. 10: The affect KitNET’s ensemble size (k) has on the average packet processing rate, while running on a single core of a Raspberry PI and an Ubuntu VM (PC), using n = 198 features.

TABLE V: The environments used to perform the benchmarks. can contain approximately 1,000 network links (assuming five damped windows per link). A good solution to maintaining Environment 1 Environment 2 a small memory footprint is to periodically search and delete Raspberry PI 3B Ubuntu VM incremental statistics with wi ≈ 0. In practice, the majority Type Broadcom BCM2837 Intel i7-4790 of incremental statistics remain in this state since we use CPU Clock 1.2GHz 3.60GHz relatively large λs (quick decay). Cores 4 4 (8 logical) RAM 1 GB 4 GB VII.CONCLUSION Kitsune is a neural network based NIDS which has been mechanism which can safely filter out potentially contaminated designed to be efficient and plug-and-play. It accomplishes instances during the training process. For example, we can first this task by efficiency tracking the behavior of all network execute and see if there is a high anomaly score. If there is, channels, and by employing ensemble of auto encoders (Kit- then we will not train from the instance (since we only want to NET) for anomaly detection. In this paper, we discussed the learn from benign instances). By doing so, we can potentially framework’s online machine learning process in detail, and remain in train-mode indefinitely. evaluated it in terms of detection and runtime performance. KitNET, an online algorithm performed nearly as well as other If there is a significant concern that the target network has batch / offline algorithms, and in some cases better. Moreover, been contaminated, then one may prefer to use a signature the algorithm is efficient enough to run on a single core of a based NIDS, such as Snort. The trade off is that a signature Raspberry PI, and has an even greater potential on stronger based NIDS cannot automatically detect new or abstract threats CPUs. (as demonstrated in the evaluation results). A good compro- mise would be to install an efficient NIDS (such as Snort 3.0 In summation, there is a great benefit in being able to de- or Suricata) alongside Kitsune. ploy an intelligent NIDS on simple network devices, especially when the entire deployment process is plug-and-play. We hope Another threat to Kitsune is a DoS attack launched against that Kitsune is beneficial for professionals and researchers the FE. In this scenario, an attacker sends many packets with alike, and that the KitNET algorithm sparks an interest in random IP addresses. by doing so, the FE will creates many further developing the domain of online neural network based incremental statistics which eventually consume the device’s anomaly detection. memory. Although this attack causes a large anomaly, the system may become instable. Therefore, it is highly recom- ACKNOWLEDGMENT mended that the user limit the number of incremental statistics The authors would like to thank Masayuki Nakae, NEC which can be stored in memory. One may note that with Corporation of America, for his feedback and assistance in a C++ implementation of Kitsune, roughly 1MB of RAM building the surveillance camera deployment. The authors would also like to thank Yael Mathov, Michael Bohadana, and Yishai Wiesner for their help in creating the Mirai dataset. k = 1 k = 35

6 REFERENCES 0.10 Mode [1] Marshall A Kuypers, Thomas Maillart, and Elisabeth Pate-Cornell.´ An Execute 4 empirical analysis of cyber security incidents at a large organization. Train Department of Management Science and Engineering, Stanford Univer- density 0.05 2 sity, School of Information, UC Berkeley, 30, 2016. [2] Dimitrios Damopoulos, Sofia A Menesidou, Georgios Kambourakis,

0.00 0 Maria Papadaki, Nathan Clarke, and Stefanos Gritzalis. Evaluation of 950 1000 1050 1100 175 200 225 250 275 anomaly-based ids for mobile devices using machine learning classi- Packet Processing Time [usec] Packet Processing Time [usec] fiers. Security and Communication Networks, 5(1):3–14, 2012. Fig. 11: Density plots of the packet processing times in the [3] Yinhui Li, Jingbo Xia, Silan Zhang, Jiakai Yan, Xiaochuan Ai, and Kuobin Dai. An efficient intrusion detection system based on support PI, with k = 1 (top), and k = 35 (bottom). vector machines and gradually feature removal method. Expert Systems with Applications, 39(1):424–430, 2012.

14 [4] Supranamaya Ranjan. Machine learning based botnet detection us- [25] T Jayalakshmi and A Santhakumaran. Statistical normalization and ing real-time extracted traffic features, March 25 2014. US Patent back propagationfor classification. International Journal of Computer 8,682,812. Theory and Engineering, 3(1):89, 2011. [5] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly [26] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009. machine learning. Mit Press, 2012. [6] Howard B Demuth, Mark H Beale, Orlando De Jess, and Martin T [27] Leon´ Bottou. Stochastic gradient descent tricks. In Neural networks: Hagan. Neural network design. Martin Hagan, 2014. Tricks of the trade, pages 421–436. Springer, 2012. [7] Nidhi Srivastav and Rama Krishna Challa. Novel intrusion detection [28] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learn- system integrating layered framework with neural network. In Advance ing machine: a new learning scheme of feedforward neural networks. Computing Conference (IACC), 2013 IEEE 3rd International, pages In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint 682–689. IEEE, 2013. Conference on, volume 2, pages 985–990. IEEE, 2004. [8] Reyadh Shaker Naoum, Namh Abdula Abid, and Zainab Namh Al- [29] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. Sultani. An enhanced resilient backpropagation artificial neural network An uncertain future: Forecasting from static images using variational for intrusion detection system. International Journal of Computer autoencoders. In European Conference on Computer Vision, pages 835– Science and Network Security (IJCSNS), 12(3):11, 2012. 851. Springer, 2016. [9] Chunlin Zhang, Ju Jiang, and Mohamed Kamel. Intrusion detec- [30] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and tion using hierarchical neural networks. Pattern Recognition Letters, Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning 26(6):779–791, 2005. useful representations in a deep network with a local denoising criterion. [10] Rebecca Petersen. Data mining for network intrusion detection: A Journal of Machine Learning Research, 11(Dec):3371–3408, 2010. comparison of data mining algorithms and an analysis of relevant [31] Yusuke Sugiyama and Kunio Goto. Design and implementation of a features for detecting cyber-attacks, 2015. network emulator using virtual network stack. In 7th International [11] Rupesh K Srivastava, Klaus Greff, and Jurgen¨ Schmidhuber. Training Symposium on Operations Research and Its Applications (ISORA08), very deep networks. In Advances in neural information processing pages 351–358, 2008. systems, pages 2377–2385, 2015. [32] Eric Leblond and Giuseppe Longo. Suricata idps and its interaction [12] Jonathan A Silva, Elaine R Faria, Rodrigo C Barros, Eduardo R with linux kernel. Hruschka, Andre´ CPLF de Carvalho, and Joao˜ Gama. Data stream [33] Borja Merino. Instant Traffic Analysis with Tshark How-to. Packt clustering: A survey. ACM Computing Surveys (CSUR), 46(1):13, 2013. Publishing Ltd, 2013. [13] Garcia-Teodoro et. al. Anomaly-based network intrusion detection: [34] Fionn Murtagh and Pedro Contreras. Methods of hierarchical clustering. Techniques, systems and challenges. computers & security, 28(1):18– arXiv preprint arXiv:1105.0121, 2011. 28, 2009. [35] Wenke Lee, Salvatore J Stolfo, et al. Data mining approaches for [14] Harjinder Kaur, Gurpreet Singh, and Jaspreet Minhas. A review of intrusion detection. In USENIX Security Symposium, pages 79–93. San machine learning based anomaly detection techniques. arXiv preprint Antonio, TX, 1998. arXiv:1307.7286, 2013. [36] Mingrui Wu and Jieping Ye. A small sphere and large margin [15] Taeshik Shon and Jongsub Moon. A hybrid machine learning approach approach for novelty detection using training data with outliers. IEEE to network anomaly detection. Information Sciences, 177(18):3799– transactions on pattern analysis and machine intelligence, 31(11):2088– 3821, 2007. 2092, 2009. [16] Taeshik Shon, Yongdae Kim, Cheolwon Lee, and Jongsub Moon. A [37] Niladri Sett, Subhrendu Chattopadhyay, Sanasam Ranbir Singh, and machine learning framework for network anomaly detection using svm Sukumar Nandi. A time aware method for predicting dull nodes and and ga. In Information Assurance Workshop, 2005. IAW’05. Proceedings links in evolving networks for data cleaning. In Web Intelligence (WI), from the Sixth Annual IEEE SMC, pages 176–183. IEEE, 2005. 2016 IEEE/WIC/ACM International Conference on, pages 304–310. [17] Anna L Buczak and Erhan Guven. A survey of data mining and IEEE, 2016. machine learning methods for cyber security intrusion detection. IEEE [38] Deepthy K Denatious and Anita John. Survey on data mining techniques Communications Surveys & Tutorials, 18(2):1153–1176, 2016. to enhance intrusion detection. In Computer Communication and [18] Ke Wang and Salvatore J Stolfo. Anomalous payload-based network Informatics (ICCCI), 2012 International Conference on, pages 1–5. intrusion detection. In RAID, volume 4, pages 203–222. Springer, 2004. IEEE, 2012. [19] Miao Xie, Jiankun Hu, Song Han, and Hsiao-Hwa Chen. Scalable [39] Suricata — open source ids / ips / nsm engine. https://suricata-ids.org/, hypergrid k-nn-based online anomaly detection in wireless sensor 11 2017. (Accessed on 11/14/2017). networks. IEEE Transactions on Parallel and Distributed Systems, [40] Index of /open/suricata/rules. https://rules.emergingthreats.net/open/ 24(8):1661–1670, 2013. suricata/rules/, 11 2017. (Accessed on 11/14/2017). [20] Srinivas Mukkamala, Andrew H Sung, and Ajith Abraham. Intrusion [41] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In detection using an ensemble of intelligent paradigms. Journal of Data Mining, 2008. ICDM’08. Eighth IEEE International Conference network and computer applications, 28(2):167–182, 2005. on, pages 413–422. IEEE, 2008. [21] Xiangzeng Zhou, Lei Xie, Peng Zhang, and Yanning Zhang. An [42] Douglas Reynolds. Gaussian mixture models. Encyclopedia of biomet- ensemble of deep neural networks for object tracking. In Image rics, pages 827–832, 2015. Processing (ICIP), 2014 IEEE International Conference on, pages 843– [43] Sylvain Calinon and Aude Billard. Incremental learning of gestures 847. IEEE, 2014. by imitation in a humanoid robot. In Proceedings of the ACM/IEEE [22] Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey, and Uday international conference on Human-robot interaction, pages 255–262. Tupakula. Autoencoder-based feature learning for cyber security ACM, 2007. applications. In Neural Networks (IJCNN), 2017 International Joint [44] Yisroel Mirsky, Tal Halpern, Rishabh Upadhyay, Sivan Toledo, and Conference on, pages 3854–3861. IEEE, 2017. Yuval Elovici. Enhanced situation space mining for data streams. In [23] Ahmad Javaid, Quamar Niyaz, Weiqing Sun, and Mansoor Alam. A Proceedings of the Symposium on Applied Computing, pages 842–849. deep learning approach for network intrusion detection system. In ACM, 2017. Proceedings of the 9th EAI International Conference on Bio-inspired [45] Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubin- Information and Communications Technologies (formerly BIONETICS), stein, and JD Tygar. Adversarial machine learning. In Proceedings of pages 21–26. ICST (Institute for Computer Sciences, Social-Informatics the 4th ACM workshop on Security and artificial intelligence, pages and Telecommunications Engineering), 2016. 43–58. ACM, 2011. [24] Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoen- coders with nonlinear dimensionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, page 4. ACM, 2014.

15