-based Feature Learning for Cyber Security Applications

Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey and Uday Tupakula

Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia. Email: mahmood.yousefi[email protected], [email protected] [email protected], [email protected]

Abstract—This paper presents a novel feature learning model unsupervised or semi-supervised training schemes. Normally, for cyber security tasks. We propose to use Auto-encoders using in supervised schemes could result in better (AEs), as a generative model, to learn latent representation of performance compared to models. The different feature sets. We show how well the AE is capable of automatically learning a reasonable notion of semantic similarity utilization of labled data alongside the availability of the among input features. Specifically, the AE accepts a feature large number of cyber criminals attempting to gain access to vector, obtained from cyber security phenomena, and extracts the data, have motivated researchers to use semi-supervised a code vector that captures the semantic similarity between the algorithms that have advantages of both supervised and unsu- feature vectors. This similarity is embedded in an abstract latent pervised schemes. The application of the unsupervised scheme representation. Because the AE is trained in an unsupervised fashion, the main part of this success comes from appropriate proposed in this paper is relevant for both supervised classifiers original feature set that is used in this paper. It can also and semi-supervised algorithms. provide more discriminative features in contrast to other feature Malicious software is intentionally developed to target engineering approaches. Furthermore, the scheme can reduce the computer systems or networks for different purposes such dimensionality of the features thereby signicantly minimising the as stealing information, or distribution of spam messages or memory requirements. We selected two different cyber security tasks: networkbased anomaly intrusion detection and Malware even destruction of messages. Malware classically refers to classication. We have analysed the proposed scheme with various malicious software such as computer viruses, Internet worms, classifiers using publicly available datasets for network anomaly trojans, adware and ransomware. Identification and removal intrusion detection and malware classifications. Several appro- of malware are a significant part of both network and host priate evaluation metrics show improvement compared to prior defence systems. Detection, clustering and classification of results. malware are major threads in cyber security and form im- portant applications of malware analysis. I.INTRODUCTION Malware analysis using has been receiving In the cyber space, security requires a wide range of much attention in the recent years, both in the academia technologies and processes to protect the range of devices and in the industry [1, 3, 4]. The main reason behind this from computers, to smart phones, to networks to Internet of is the capability to automatically identify malicious software Things to users and importantly data from intrusion, unau- compared to more tedious manual techniques. Another major thorized access and destruction. To meet these requirements, reason behind the application of machine learning techniques cyber security defensive technologies include two conventional for malware analysis is the emergence of zero-day malware, systems, namely network defence systems and host defence whose fingerprints or signatures are unknown to the software systems. Each of these systems are composed of different developers. technologies and various layers such as intrusion detection, In this study, our aim is to consider both malware classi- firewall and antivirus [1]. fication as well as malware detection. To do both detection Intrusion Detection Systems (IDSs), in particular, network- and classification, we introduce a technique that achieves a based IDSs are special-purpose algorithms and tools to detect richer feature space using deep auto-encoders (AEs). The AEs anomaly attacks to a networked system, and help determine as the automatic feature learning models can provide more and identify unauthorized usage, duplication, alteration as discriminative features in contrast to other well as any destruction of data. Depending on the detection approaches. In the literature, a wide range of feature sets have techniques, IDSs can be categorized into different approaches been used to identify anomaly intrusions and malware [5, 6, 7]. such as signature-based detection, anomaly-based detection The examples of feature sets in network-based anomaly in- and behaviour based detection. The focus of our study is on trusion detection application domain include network flow, network-based anomaly intrusion detection systems. source IP and port, destination IP and protocols. For malware Machine learning approaches are being widely used for analysis, the number of bytes, the entropy of the binary file, anomaly intrusion detection [2, 3]. The schemes are able to system calls and operation code of assembly files have been detect patterns of known and unknown attacks in supervised, commonly used. The AEs can learn the concept space from the original feature sets to achieve both these tasks. learning models and other non-deep models. Our literature Another advantage of our proposed scheme is the dimen- review is mainly related to models that are sionality reduction. In terms of tractability of a model, some related to malware analysis and intrusion detection. classifiers require the observation of uncorrelated features. The Deep learning models have been recently used to detect and two most commonly used statistical techniques to provide such classify malware in Microsoft Windows and Android [12, 13, features are Principal Component Analysis (PCA) and Zero 14, 15]. The models use a wide range of structures such as Component Analysis (ZCA) [8, 9]. In practice, greater the Convolutional Neural Network (CNN), Auto-encoders (AEs), dimension of the feature space, greater the memory required Recurrent Neural Networks (RNNs) and Deep Belief Networks to compute the covariance matrix needed in either the PCA (DBN) [13, 14, 15, 16, 17, 18, 19]. or the ZCA. In addition to a more discriminative feature The paper [15] showed that a stacked de-noising AE can space, the AEs can reduce the dimensionality of the features, be a good model to distinguish malicious from non-malicious thereby helping to reduce the memory needed to compute the software (i.e. detect malware). The model is designed to covariance matrix. handle malicious scripts (e.g. JavaScript code). The stacked More specifically, we have used the AEs to map the original de-noising AEs have also been used in portable executable feature space to a latent representation, with two unsupervised files classification [16]. This model uses Application Program training stages. The motivation is that AE as a generative Interface (API) calls as the dynamic feature set and provides model is capable of learning a reasonable notion of semantic a signature for each malware, consisting of 30 codes. The similarity and the relation among input features [10, 11] . To paper [14] developed a AE-based model that uses RBM. This evaluate the proposed scheme, we have carried out security model could successfully detect malware using a wide range analysis of the proposed scheme using two publicly available of dynamic and static feature sets and convert them to 16 datasets. feature vector sets and 4 graph feature sets. The paper [17] In summary, the major contributions of this paper are the proposed RNN-based AEs that can automatically learn the following: representation of a malware from its raw API calls. They • We introduce an unsupervised feature learning approach manage to handle the difficulties of training a recurrent neural for the two different cyber security problems using AEs. network. Although AEs have been previously applied to cyber In the literature, another area of interest has been outlier security, the proposed model has unique training phase detection where recently deep learning models have shown and topology compared to the previous works. some promising results. The proposed models vary on the • We show how a single model with the same training structure of the model used, the application purpose and model and topology can be quite effective for both mal- motivation behind the chosen strategy [20, 21, 22, 23]. The ware classification and network-based anomaly intrusion paper [24] introduced the application of a Variational AE for detection. This is helpful when it comes to designing only intrusion detection and showed that the Variational AE can one embedded security analysis tool for different systems. perform well for both network-based intrusion detection as • Our scheme uses almost the minimum number of features well as in detecting outliers. The paper [25] proposed a hybrid compared to other state-of-the-art algorithms. This makes AE and Density Estimation Model for . The the model to be more effective for real time protection. model is based on estimating the density in the compressed • In addition to the limited number of original features, the hidden-layer representation of the applied AE. Another paper proposed scheme generates a small set of latent features. [26] uses hybrid stacked de-noising AEs. The resulting rich and small latent representation makes Our model uses hex-based features of portable executable it practical for it to be implemented in small devices such files, without disassembling. The model does not require any as the Internet of Things. phase, as the pre-processing phase, which The rest of this paper is presented as follows. Section II helps to improve the performance. The proposed scheme also briefly describes some previous relevant works on malware provides a discriminative concept space to distinguish between and intrusion detection using machine learning models and a normal flow of network data and an anomalous flow. The deep learning approaches. In Section III, we describe the model provides a fixed 10 size vector for both malware AE model and its pre-train and training stages. Section IV detection and classification tasks. presents the performance of the proposed scheme for both III.FEATURE LEARNING malware detection and clasification. Subsection IV-A describes the experimental setup, and Subsection IV-B presents the The main purpose of an unsupervised feature (or represen- results of the model. Section IV-B1 presents and discusses tation) learning is to provide a setting to map the original malware classification, and Section IV-B2 presents the results feature set into a different (or latent) representation that is for intrusion detection. Section V concludes. suitable for a specific machine learning task. In representation learning, a model automatically learns both a specific task and II. LITERATURE REVIEWS the features itself. We categorize the previous work on the application of ma- To achieve the goals of our study (i.e. learning a representa- chine learning to the cyber security tasks into two types: deep tion that is appropriate for detection and classification tasks), we have used AE. Our motivation is that an AE offers a rich representation from which it should be possible to re-generate the original representation. This assumption is not task-specific and has motivated us to apply AE for different cyber security tasks.

A. An AE (Figure 1) is a feed-forward network that learns to reconstruct the input x [27]. AE is trained to encode the input x using a set of recognition weights into a feature space C(x). Then, the features (codes) C(x) are converted into an approximate reconstruction of xˆ using a set of generative weights. The generative weights are mostly obtained firstly from unrolled weights of encoder and then from a fine- tuned phase. Through the encoder (i.e. mapping to a hidden representation) the dimensionality of x is reduced to the give number of codes. The codes are then mapped to xˆ through the decoder. Because no labeled data is required in the training process, the algorithm is unsupervised. Indeed, a deep AE transforms the original input to a better representation over hierarchical features or representations, with each level of hierarchy corresponding to a different level of abstraction. Fig. 1. The Topology of our AE. The decoder parameters are obtained from AEs can have different topological structures. Figure 1 unrolled parameters of encoder, which are then fine-tuned. x and xˆ denote the input and reconstructed inputs respectively. h are the hidden layers and demonstrates our AE topology. The coding layer (a.k.a bottle- i wi are the parameters. The ε denotes the modification of parameters over the neck layer or discriminative layer) provides the latent repre- fine-tuning phase. Concept space in which features/codes are extracted C(x), sentation. In general, an AE with a linear activation function in in this scheme, is the output of the bottleneck layer. all units projects a subspace of the principal components of the input and not a richer representation than the PCA. However, it There are two reasons behind our choice of the use of the is expected the AE with non-linear activation functions learns RBM. First, because the training phase is based on a back- more useful feature-detectors [28]. propagation algorithm, we believe that having two different Mainly because each layer of our AE provides a different search methods in the parameter space of the AE can be more level of abstraction, we have used a fairly deep network. Ad- efficient. Also, assuming that the observations are produced ditionally, [28] in the supplementary material paper1 showed from a stochastic generative model, the RBM as a generative that a deep AE can have better performance than a shallow AE model can be an appropriate prototype to discover the feature with the same number of parameters. Because our AE consists detectors [27, 31]. of many hidden layers, the back-propagated gradients of the error is very small for the lower layers. The RBM (figure 2) is the undirected that can be considered as a two-layer neural network with one layer To tackle the vanishing problem, the parameters of a deep of observable units and one layer of hidden units (i.e. feature AE requires a good initialization point. This initialization detectors). The weighted connections are restricted between could significantly improve the performance of an AE in a the hidden units and the visible units, symmetrically, and variety of applications compared with random initialization there are no connections between the units in the same layer. [29, 30]. Depending on the function of the network, the visible and Our model has two training stages: pre-training and hidden units could be considered as a different distribution fine-tuning. Pre-training is an approach to find an appropriate such as Bernoulli, Gaussian or Multinomial, binomial. We starting point for the fine-tuning phase. The parameters have used Bernoulli-Bernoulli units, Gaussian-Bernoulli units obtained in the pre-training phase will be the initial weights and Bernoulli-Gaussian units (only for the bottleneck layer of the fine-tuning phase. RBM). The energy function is bilinear for both binary states of Pre-training Stage visible and hidden units (i.e. Bernoulli-Bernoulli units) (see [32]): The two most common techniques to pre-train an AE and obtain the initialization weight are stacked Restricted Boltz- X X X E(x, h; θ) = − bixi − ajhj − xihjwij (1) mann Machine (RBM) and stacked de-noising AE [28, 30]. i∈visible j∈hidden i,j

1 http://www.cs.toronto.edu/∼hinton/absps/science som.pdf Where wij is the weights between visible units xi and To estimate the parameters of the network, maximum like- lihood estimation (equivalent to minimizing the negative log- likelihood) can be used. Taking the derivative of the negative log-probability of the inputs with respect to the weights gives:

∂ − log p(x) ∂ X = (− log p(x, h)) ∂θij ∂θ Fig. 2. The schematic representation of the RBM. h (9) = hxi, hjidata − hxi, hjirecon hidden units h , b and a are their biases. θ = {W, a, b} j i j where the angle brackets are used to denote the expectation denotes the all sets of network parameters. The joint distribu- of the distribution of the subscript that follows. This leads to tion p(x, h; θ) has the following the energy function: a simple learning algorithm by which the parameters update exp (−E(x, h; θ)) rule is given: p(x, h; θ) = (2) Z P Where Z = x,h exp (−E(x, h; θ)) is a partition function ∆wij = (hxi, hjidata − hxi, hjirecon) (10) as a normalization constant. The marginal probability assigned to a visible vector is: where  is a , hxi, hjidata is the so-called positive phase contribution and hx , h i is the so-called P exp (−E(x, h; θ)) i j recon p(x; θ) = h (3) negative phase contribution. The positive phase decreases the Z energy of observation and negative phase increases the model Because this network is symmetric, the conditional proba- energy. However, the computation of the expectation defined bilities for Bernoulli-Bernoulli RBM is: by the model is not easily tractable. [33] presented maximizing the likelihood or log-likelihood of the data is equivalent exp (P w x + a ) to minimize KullbackLeibler (KL) divergence between data p(h = 1|x; θ) = i ij i j j P distribution and the equilibrium distribution over the visible 1 + exp ( i wijxi + aj) X (4) variables. To compute this expectation, The k-step contrastive = f( w x + a ) ij i j divergence (CD-k) approximation provides surprising results i [33]. We used CD-1, with running one step Gibbs sampler that P is effective enough: exp ( j wijhj + bi) p(xi = 1|h; θ) = P 1 + exp ( j wijhj + bi) 0 0 1 (5) 0 p(h|x ) 0 p(x|h ) 1 p(h|x ) 1 X x = x −−−−−→ h −−−−−→ x −−−−−→ h (11) = f( wijhj + bi) j The RBM blocks can be stacked to form the topology of Assuming the visible units has Gaussian distribution, the the desired AE. More clearly, in pre-training phase, the AE is energy function is represented as following: trained in a greedy layer-wise fashion using individual RBMs, where the output of one trained RBM is used as input to the (x − b )2 X i i X next upper RBM block. Then, individual RBM blocks would E(x, h; θ) = 2 − ajhj 2σi be stacked on top of each other and having the RBM blocks i∈visible j∈hidden (6) X xi stacked, the topology of the AE can be generated. − h w , σ j ij Global adjusting parameters i,j i Figure 1 shows how the obtained weights from the pre- where σ is the standard deviation of the ith visible unit. i training are ties (i.e. unrolls) and used to initialise the deep For Gaussian-Bernoulli conditional probabilities becomes: AE. Globally adjusting parameters is indeed fine-tuning the P parameters in an iterative way. We use back-propagation exp ( i wijxi + aj) p(hj = 1|x; θ) = P algorithm to do this iterative tuning of parameters. 1 + exp ( i wijxi + aj) X (7) The whole network is trained in this phase. We use the = f( w x + a ) ij i j cross-entropy error as the loss function for this unsupervised i fine-tuning phase and to have optimal reconstruction, given P 2 ! the encoding C(x): 1 (x − bi − wijhj) p(x |h; θ) = √ exp − j i " N N # 2π 2 1 X X − log p(x|C(x)) = − x logx ˆ + (1 − x ) log(1 − xˆ ) (12)   (8) N i i i i X i=1 i=1 = N  wijhj + bi, 1 where n is the total number of items of training data, x is j the input of the AE, xˆ = fθ(C(x)) is the reconstructed values. IV. PERFORMANCE EVALUATION In the above sample, 00401370 indicates the offset and is We have selected malware classification and network-based not used in this paper. anomaly intrusion detection to evaluate the capability of the The most successful feature set for this dataset is n-gram AE for the security application domain. To evaluate the modelling [34]. Inspired from human language modelling, a n- performance of the learned representation, the metrics are the gram is a contiguous sequence of n items (here, a byte) from accuracy, the multi-class logarithmic loss (a.k.a logistic loss, a given sequence of the binary file. We use unigram (a.k.a cross-entropy loss or only Log Loss) and the confusion matrix. 1-gram) probabilistic model. The metrics measure the performance of the applied classifiers: Having unigram feature representation, each item in a byte Gaussian Nave Bayes (GaussianNB), K-nearest-neighborhood sequence can have one out of 256 different values; thus, the (K-NN), Support Vector Machines (SVMs) and Gradient dictionary size is 256 (accordingly 00 to FF hex). Instead of Boosting (Xgboost). The classification accuracy shows the using the raw frequency of an item in a hex file (i.e. the correct predictions of labels. The Log Loss function measures number of times that item t occurs in the hex file), we used the uncertainty of the probabilities of a classifier compared the logarithmically scaled frequency as following: to the true labels. That is, the function is the cross entropy Unigram = 1 + log(ft) or zero where ft is zero. between the distribution of the true labels and the predictions. In short, each hex file turns into a vector with size 256 and For multi-class classification the Log Loss is defined as: fed into the AE. To keep the same setting with our baselines, we used 5- N M fold cross validation. Although the dataset is imbalanced, we 1 X X logloss = − y log(p ) (13) randomly chose equal proportion of each class for each fold. N ij ij i=1 j=1 For anomaly intrusion detection task, we used NSL-KDD 12 where N is the total number of the samples of training data, dataset , which consists of selected records of the KDD- M is the number of labels, log is the natural logarithm, y is CUP99 dataset. NSL-KDD collected by analysing incoming the true label (i.e. 0 or 1) of an item and p is the estimated network traffic and it has been widely used to develop the probability that the item i belongs to class j. network-based intrusion detection systems (NIDs). The presented confusion metrics is a table to visualize the NSL-KDD includes 125,973 train and 22,544 test records classifier performance in confusing classes. labeled either normal or anomaly (i.e. a network attack). The anomaly class has four categories; however, for distinguishing A. Experiment Setup normal from an anomaly and having a detection system, we consider all the four categories into one class label. Briefly, the We used Microsoft Malware Classification Challenge (BIG two-class classifiers are trained to distinguish a normal class 2015) dataset hosted at Kaggle 2 to analyse the AE-based from an anomaly class. representation. The dataset includes hexadecimal and assembly Each sample in the NSL-KDD represents with 41 fea- representation of the 10868 labled malware binary files from 9 tures. The features are categorized into four feature sets: different malware families: Ramnit 3 (Ram), Lollipop 4(Lol), Basic features (from TCP/IP connection without inspecting Kelihos ver3 5 (Kel), Vundo 6 (Vun), Simda 7 (Sim), Tracur 8 the payload), content features (accessing the payload of TCP (Tra), Kelihos ver1 9 (Kel), Obfuscator.ACY 10 (Obf), Gatak packet), time based traffic features (statistics in a 2 second time 11 (Gat). This dataset is imbalanced. That is, the number of windows) and host based traffic features (within a historical malware samples for each class (family) is not equal. window). Each of the 41 features is presented as either as We only use hex dump files. The average size of the hex a continuous value or as a symbolic sign. We used the dump files is 4.8 MByte while the biggest file is 56 MByte continuous value features intact while we replaced symbolic and the smallest is 110 KByte. A sample line of a hex file is: features with one-hot encoding. In the one-hot encoding each 00401370 8B 4C 24 04 8B D1 8D 81 D6 8D 82 F7 81 F2 60 4F symbol is replaced with the state of the symbol. 2https://www.kaggle.com/c/malware-classification/data To have an efficient topology for both experiments, we set 3https://www.symantec.com/security response/writeup.jsp?docid= our deep AE with a 150-90-50-10-50-90-150 structure. Here, 2010-011922-2056-99 150 is the size of the first and last hidden layer and 10 is the 4https://www.microsoft.com/security/portal/threat/encyclopedia/Entry.aspx? Name=Adware:Win32/Lollipop size of AE’s bottleneck layer. The ten code units provide a 5https://en.wikipedia.org/wiki/Kelihos botnet ten-dimensional concept space to represent the feature space 6https://www.symantec.com/security response/writeup.jsp?docid= for both network-based intrusion attack and malware families. 2004-112111-3912-99 The 10 units in the bottleneck of AE is a constraint by which 7https://www.microsoft.com/security/portal/threat/Encyclopedia/entry.aspx? Name=Trojan%3AWin32%2FSimda the AE is complied to learn useful features. The activation 8https://www.symantec.com/security response/writeup.jsp?docid= function of all units is sigmoid except for the bottleneck layer 2011-071504-5259-99 which is linear, in both pre-training and fine-tuning phases. 9 https://en.wikipedia.org/wiki/Kelihos botnet For the layer-wise pre-train phase, that is, training four 10https://www.microsoft.com/en-us/security/portal/threat/encyclopedia/ Entry.aspx?Name=VirTool:Win32/Obfuscator.ACY RBMs with 150, 90, 50 and 10 units, we run one step of the 11https://www.symantec.com/security response/writeup.jsp?docid= 2012-012813-0854-99 12http://nsl.cs.unb.ca/NSL-KDD/ Gibbs sampler with 200 epochs [35]. In the fine-tuning phase, Classifiers Accuracy with Accuracy with the whole network is trained using mini-batch Conjugate Unigram representation AE representation Gradient with line search, having 1500 epochs [36]. The loss Naive Bayes 66.2% (±4.45e − 05) 80.4% (±5.73e − 04) K-NN 94.0% (±3.90e − 05) 96.0% (±3.26e − 05) function in the training phase is the cross-entropy error. We SVM 95.6% (±1.22e − 05) 96.3% (±1.13e − 04) have used the same hyper-parameters, obtained on the basis of Xgboost 98.2% (±6.65e − 06) 95.7% (±2.53e − 05) presented practical guides in [31, 37], in both the experiments. TABLE I Extracted codes from the AE are fed into the classifiers. THE ACCURACY OF DIFFERENT CLASSIFIERS FOR THE ORIGINAL REPRESENTATION (UNIGRAM) AND LATENT REPRESENTATION Although there is not a general agreement in literature for GENERATED USING THE AE. the input of classifiers, it is beneficial to have a linear transformation such as whitening transformation. Statistical Whitening data helps to uncorrelate features with an identity Log Loss reduces and is less than the other models in which covariance matrix. To do so, we used Principal Component Unigram has been used as the feature. Analysis (PCA) whitening algorithm [8]. One of the key evaluations is shown in Table II. The Ram We used sklearn library (for the Python 2.7.12) to imple- and Tra families are confused with Gat family more than ment the classifiers. Assuming the features have Gaussian other families. This issue is not due to different number of distribution, we used Gaussian Naive Bayes. The hyper- sampling for each class. Interestingly, Gat is a trojan that parameters of K-NN set used a grid search to achieve max- opens a backdoor on the compromised computer while Ram is imum accuracy. This grid search range from 1 to 20 for a worm which also functions as a backdoor and Tra is a trojan. n neighbors and are either ’uniform’ or ’distance’ for weights. This shows that the AE provides a clustered representation We also used a grid search to tune the C and gamma by which similar families are more likely being alternatively parameters of SVM (with a radial basis function kernel) to predicted than dissimilar families. Figure 3 illustrates the have maximum accuracy. This grid search range from 10−4 to whitened features in two dimensions, visualized by t-SNE 104 for both C and gamma parameters of SVM. Xgboost has [39]. the same hyper-parameters of the first winner of the Kaggle Another important and mutual confusion of classes is be- Competition [34]. tween Vun, Sim and Tra classes. All the three classes are tro- We used the same topology for AE and other conditions jans. This confusion again provides grounds for understanding of the experiments throughout our study to make the results that the three classes of malware are indeed from one malware more comparable. broader family, and possibly with the same pattern. The important thing to notice is that regardless of observing B. Results and Discussion imbalanced classes, Naive Bayes can provide a fairly good Having the pre-processed data, we conducted experiments prediction across all the malware families, even for Sim family on two different cyber security threats: malware classification with only 42 samples. and network-based anomaly intrusion detection. 1) Malware classification: Table I shows how well AE can provide the discriminative representation. As expected, among the classifiers, Gaussian Naive Bayes gains the highest benefits of the latent repre- sentation. Indeed, AE can enrich the representation and insert the relation between original features into the concept space, assuming independent features will not drastically reduce the accuracy of the Naive Bayes. Additionally, because of applying Bernoulli-Gaussian RBM for the initialization of the bottleneck layer, the Gaussian Naive Bayes classifier can perform well. Both K-NN and SVM can also generate more accurate predictions. For K-NN and SVM, the error rate improves by 33% and 16% respectively. However, Xgboost can handle the classification task better with the original unigram and not with the latent representation generated by the AE. Table I also shows the possible variance of the accuracy for Fig. 3. The output of coding layer that is applied to the classifiers. all the classifiers. The variance in all situations is small and, more importantly, without any overlap with each other. 2) Network-based intrusion detection: In addition to the accuracy, Log Loss metric has been used The accuracy of classifiers for the intrusion detection task to analyse the classification performance for Xgboost. The also improves using the AE-based generated features com- result is presented in Table III. In contrast with the accuracy, pared to the original features (Figure 4). Similar to the Family Name Ram Lol Kel ver3 Vun Sim Tra Kel ver1 Obf Gat Ram (n = 1541) 45.13% 06.82% 00.19% 01.04% 00.39% 05.33% 00.32% 10.39% 30.39% Lol (n = 2478) 04.38% 91.12% 00.00% 00.24% 00.04% 00.16% 01.22% 01.87% 00.97% Kel ver3 (n = 2942) 00.00% 10.82% 89.05% 00.00% 00.00% 00.07% 00.00% 00.00% 00.07% Vun (n = 0475) 00.00% 00.21% 00.00% 78.11% 06.11% 06.95% 01.47% 06.95% 00.21% Sim (n = 0042) 00.00% 00.00% 00.00% 00.00% 87.50% 05.00% 00.00% 07.50% 00.00% Tra (n = 0751) 04.00% 01.60% 01.47% 06.27% 05.73% 62.27% 01.47% 06.67% 10.53% Kel ver1 (n = 0398) 00.25% 01.52% 00.51% 00.00% 00.00% 03.80% 93.92% 00.00% 00.00% Obf (n = 1228) 03.10% 03.43% 00.65% 01.31% 00.90% 03.10% 01.39% 84.65% 01.47% Gat (n = 1013) 00.89% 00.10% 01.68% 00.00% 00.20% 03.47% 00.20% 07.03% 86.44%

TABLE II THE CONFUSION MATRIX OF WITH AE-BASED FEATURES AS THE INPUT OF THE CLASSIFIER. n ISTHENUMBEROFSAMPLES IN THE DATASET

Models Log Loss Xgboot (AE) 0.0748 Deep Learning H20 models [38] 0.1810 1G [5] 0.0764 TABLE III THE LOG LOSSOF XGBOOT CLASSIFIER AND 2 BASELINEMODELS

classification task, the performance of the Gaussian Naive Bayes classifier improves significantly. Although K-NN can not predict labels with the latent features as accurate as the original features, SVM and Xgboost perform much better. This is to be expected that not all classifiers gain perfor- mance from a representation driven from a particular data. We think it might be because of the distribution of the original data. However, the AE can provide much more discriminative representation by which most classifiers can perform better Fig. 4. The accuracy of different classifiers for the original feature set and compared to the classifier’s accuracy with the original features latent features generated using AE. as the input of classifier. Table IV shows the comparison between our model and the other models. To be able to have a fair comparison, V. CONCLUSION single classifier models have been selected. AE-based features In this paper we proposed an unsupervised feature learn- with not a highly complicated classifier (i.e. Gaussian Naive ing approach for malware classification and network-based Bayes) can outperform other single classifier models and even anomaly detection using auto-encoder (AE). Compared to a complex algorithm proposed in [40]. Although the second previous work, the proposed scheme uses a unique training proposed algorithm by [40] performs better than our scheme, phase and topology. We showed how a single model with the baseline model is computationally more expensive, built the same training model and topology can be quite effective up from different modules of classifiers. Also, a classifier is for both malware classification and network-based anomaly required to be re-trained over the algorithm. In contrast, the intrusion detection. This is helpful when it comes to designing proposed scheme of this paper uses only one classifier with a shared embedded security analysis tool for different systems. one training stage. For malware classification, our scheme uses raw byte features of portable executable files, without disassembling and does Models KDDTest+ not require any preprocessing to select features. For network intrusion detection, the proposed scheme also provides a Gaussian Naive Bayes Tree [7] 82.02% Fuzzy classifier [41] 82.74% discriminative concept space to distinguish between a normal Our Approach 83.34% flow of network data and an anomalous flow. The model Decision Tree [42] 80.14% produces a fixed 10 size vector for both classification and Proposed algorithm-(Experiment-1) [40] 82.41% Proposed algorithm-(Experiment-2) [40] 84.12% detection of attacks. Hence our scheme uses the minimum TABLE IV number of features compared to other state-of-the-art algo- THEACCURACYOFOURPROPOSEDMODELANDTHEOTHERMODELS rithms. This makes the model more computationally efficient for real time protection. In addition to the limited number of original features, the proposed scheme generates a small set of latent features. The resulting rich and small latent in International Conference on Future Data and Security Engineering. representation makes it practical for implementation in small Springer, 2016, pp. 141–152. [22] T. Nolle, A. Seeliger, and M. Muhlh¨ auser,¨ “Unsupervised anomaly detec- devices such as the Internet of Things. tion in noisy business process event logs using denoising autoencoders,” in International Conference on Discovery Science. Springer, 2016, pp. REFERENCES 442–456. [1] A. Patel, Q. Qassim, and C. Wills, “A survey of intrusion detection and [23] S. Potluri and C. Diedrich, “Accelerated deep neural networks for prevention systems,” Information Management & Computer Security, enhanced intrusion detection system,” in Emerging Technologies and vol. 18, no. 4, pp. 277–290, 2010. Factory Automation (ETFA), 2016 IEEE 21st International Conference [2] P. Garcia-Teodoro, J. Diaz-Verdejo, G. Macia-Fern´ andez,´ and on. IEEE, 2016, pp. 1–8. E. Vazquez,´ “Anomaly-based network intrusion detection: Techniques, [24] J. An and S. Cho, “ based anomaly detection systems and challenges,” computers & security, vol. 28, no. 1, pp. using reconstruction probability,” 2015. 18–28, 2009. [25] M. Nicolau, J. McDermott et al., “A hybrid autoencoder and density [3] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “Intrusion de- estimation model for anomaly detection,” in International Conference tection techniques in cloud environment: A survey,” Journal of Network on Parallel Problem Solving from Nature. Springer, 2016, pp. 717– and Computer Applications, 2016. 726. [4] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis [26] T. Ma, F. Wang, J. Cheng, Y. Yu, and X. Chen, “A hybrid spectral of malware behavior using machine learning,” Journal of Computer clustering and deep neural network ensemble algorithm for intrusion Security, vol. 19, no. 4, pp. 639–668, 2011. detection in sensor networks,” Sensors, vol. 16, no. 10, p. 1701, 2016. [5] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, [27] G. E. Hinton and R. S. Zemel, “Minimizing description length in an “Novel , selection and fusion for effective malware unsupervised neural network,” Preprint, 1997. family classification,” in Proceedings of the Sixth ACM Conference on [28] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of Data and Application Security and Privacy. ACM, 2016, pp. 183–194. data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, [6] K. Wang and S. J. Stolfo, “Anomalous payload-based network intrusion 2006a. detection,” in International Workshop on Recent Advances in Intrusion [29] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for Detection. Springer, 2004, pp. 203–222. deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, [7] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A detailed analysis 2006b. of the kdd cup 99 data set (2009),” in Proceedings of the 2009 IEEE [30] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, Symposium on Computational Intelligence in Security and Defense “Stacked denoising autoencoders: Learning useful representations in a Applications (CISDA 2009), 2009. deep network with a local denoising criterion,” Journal of Machine [8] A. Kessy, A. Lewin, and K. Strimmer, “Optimal whitening and decor- Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010. relation,” arXiv preprint arXiv:1512.00809, 2015. [31] G. Hinton, “A practical guide to training restricted boltzmann machines,” [9] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from Momentum, vol. 9, no. 1, p. 926, 2010. tiny images,” 2009. [32] J. J. Hopfield, “Neural networks and physical systems with emergent [10] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International collective computational abilities,” Proceedings of the national academy Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009. of sciences, vol. 79, no. 8, pp. 2554–2558, 1982. [11] R. Salakhutdinov, “Learning deep generative models,” Annual Review [33] G. E. Hinton, “Training products of experts by minimizing contrastive of Statistics and Its Application, vol. 2, pp. 361–385, 2015. divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002. [12] A. Narayanan, M. Chandramohan, L. Chen, Y. Liu, and S. Sami- [34] X. Wang, J. Liu, and X. Chen, “First place team: Say no to overfitting,” nathan, “subgraph2vec: Learning distributed representations of rooted 2015. sub-graphs from large graphs,” arXiv preprint arXiv:1606.08928, 2016. [35] M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergence [13] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, learning,” in Proceedings of the tenth international workshop on artifi- “Malware detection with deep neural network using process behavior,” cial intelligence and statistics. Citeseer, 2005, pp. 33–40. in Computer Software and Applications Conference (COMPSAC), 2016 [36] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng, IEEE 40th Annual, vol. 2. IEEE, 2016, pp. 577–582. “On optimization methods for deep learning,” in Proceedings of the [14] L. Xu, D. Zhang, N. Jayasena, and J. Cavazos, “Hadm: Hybrid analysis 28th International Conference on Machine Learning (ICML-11), 2011, for detection of malware.” pp. 265–272. [15] Y. Wang, W.-d. Cai, and P.-c. Wei, “A deep learning approach for detect- [37] Y. Bengio, “Practical recommendations for gradient-based training of ing malicious javascript code,” Security and Communication Networks, deep architectures,” in Neural Networks: Tricks of the Trade. Springer, 2016. 2012, pp. 437–478. [16] O. E. David and N. S. Netanyahu, “Deepsign: Deep learning for [38] Malware classification: Distributed with spark. http:// automatic malware signature generation and classification,” in 2015 msan-vs-malware.com/. Accessed: 2016-11-11. International Joint Conference on Neural Networks (IJCNN). IEEE, [39] L. Van Der Maaten, “Accelerating t-sne using tree-based algorithms.” 2015, pp. 1–8. Journal of machine learning research, vol. 15, no. 1, pp. 3221–3245, [17] X. Wang and S. M. Yiu, “A multi-task learning model for malware 2014. classification with useful file access pattern from api call sequence,” [40] R. A. R. Ashfaq, X.-Z. Wang, J. Z. Huang, H. Abbas, and Y.-L. arXiv preprint arXiv:1610.05945, 2016. He, “Fuzziness based semi- approach for intrusion [18] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas, detection system,” Information Sciences, 2016. “Malware classification with recurrent networks,” in 2015 IEEE In- [41] P. Kromer,¨ J. Platos,ˇ V. Sna´sel,ˇ and A. Abraham, “Fuzzy classification ternational Conference on Acoustics, Speech and Signal Processing by evolutionary algorithms,” in Systems, Man, and Cybernetics (SMC), (ICASSP). IEEE, 2015, pp. 1916–1920. 2011 IEEE International Conference on. IEEE, 2011, pp. 313–318. [19] W. Huang and J. W. Stokes, “Mtnet: a multi-task neural network for dy- [42] M. Mohammadi, B. Raahemi, A. Akbari, and B. Nassersharif, “Class namic malware classification,” in Detection of Intrusions and Malware, dependent feature transformation for intrusion detection systems,” in and Vulnerability Assessment: 13th International Conference, DIMVA 2011 19th Iranian Conference on Electrical Engineering. IEEE, 2011, 2016, San Sebastian,´ Spain, July 7-8, 2016, Proceedings. Springer, pp. 1–6. 2016, pp. 399–418. [20] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning ap- proach for network intrusion detection system,” in In Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS). ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2016, pp. 21–26. [21] L. Bontemps, J. McDermott, N.-A. Le-Khac et al., “Collective anomaly detection based on long short-term memory recurrent neural networks,”