Autoencoder-Based Feature Learning for Cyber Security Applications

Autoencoder-based Feature Learning for Cyber Security Applications Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey and Uday Tupakula Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia. Email: mahmood.yousefi[email protected], [email protected] [email protected], [email protected] Abstract—This paper presents a novel feature learning model unsupervised or semi-supervised training schemes. Normally, for cyber security tasks. We propose to use Auto-encoders using labeled data in supervised schemes could result in better (AEs), as a generative model, to learn latent representation of performance compared to unsupervised learning models. The different feature sets. We show how well the AE is capable of automatically learning a reasonable notion of semantic similarity utilization of labled data alongside the availability of the among input features. Specifically, the AE accepts a feature large number of cyber criminals attempting to gain access to vector, obtained from cyber security phenomena, and extracts the data, have motivated researchers to use semi-supervised a code vector that captures the semantic similarity between the algorithms that have advantages of both supervised and unsu- feature vectors. This similarity is embedded in an abstract latent pervised schemes. The application of the unsupervised scheme representation. Because the AE is trained in an unsupervised fashion, the main part of this success comes from appropriate proposed in this paper is relevant for both supervised classifiers original feature set that is used in this paper. It can also and semi-supervised algorithms. provide more discriminative features in contrast to other feature Malicious software is intentionally developed to target engineering approaches. Furthermore, the scheme can reduce the computer systems or networks for different purposes such dimensionality of the features thereby signicantly minimising the as stealing information, or distribution of spam messages or memory requirements. We selected two different cyber security tasks: networkbased anomaly intrusion detection and Malware even destruction of messages. Malware classically refers to classication. We have analysed the proposed scheme with various malicious software such as computer viruses, Internet worms, classifiers using publicly available datasets for network anomaly trojans, adware and ransomware. Identification and removal intrusion detection and malware classifications. Several appro- of malware are a significant part of both network and host priate evaluation metrics show improvement compared to prior defence systems. Detection, clustering and classification of results. malware are major threads in cyber security and form im- portant applications of malware analysis. I. INTRODUCTION Malware analysis using machine learning has been receiving In the cyber space, security requires a wide range of much attention in the recent years, both in the academia technologies and processes to protect the range of devices and in the industry [1, 3, 4]. The main reason behind this from computers, to smart phones, to networks to Internet of is the capability to automatically identify malicious software Things to users and importantly data from intrusion, unau- compared to more tedious manual techniques. Another major thorized access and destruction. To meet these requirements, reason behind the application of machine learning techniques cyber security defensive technologies include two conventional for malware analysis is the emergence of zero-day malware, systems, namely network defence systems and host defence whose fingerprints or signatures are unknown to the software systems. Each of these systems are composed of different developers. technologies and various layers such as intrusion detection, In this study, our aim is to consider both malware classi- firewall and antivirus [1]. fication as well as malware detection. To do both detection Intrusion Detection Systems (IDSs), in particular, network- and classification, we introduce a technique that achieves a based IDSs are special-purpose algorithms and tools to detect richer feature space using deep auto-encoders (AEs). The AEs anomaly attacks to a networked system, and help determine as the automatic feature learning models can provide more and identify unauthorized usage, duplication, alteration as discriminative features in contrast to other feature engineering well as any destruction of data. Depending on the detection approaches. In the literature, a wide range of feature sets have techniques, IDSs can be categorized into different approaches been used to identify anomaly intrusions and malware [5, 6, 7]. such as signature-based detection, anomaly-based detection The examples of feature sets in network-based anomaly in- and behaviour based detection. The focus of our study is on trusion detection application domain include network flow, network-based anomaly intrusion detection systems. source IP and port, destination IP and protocols. For malware Machine learning approaches are being widely used for analysis, the number of bytes, the entropy of the binary file, anomaly intrusion detection [2, 3]. The schemes are able to system calls and operation code of assembly files have been detect patterns of known and unknown attacks in supervised, commonly used. The AEs can learn the concept space from the original feature sets to achieve both these tasks. learning models and other non-deep models. Our literature Another advantage of our proposed scheme is the dimen- review is mainly related to deep learning models that are sionality reduction. In terms of tractability of a model, some related to malware analysis and intrusion detection. classifiers require the observation of uncorrelated features. The Deep learning models have been recently used to detect and two most commonly used statistical techniques to provide such classify malware in Microsoft Windows and Android [12, 13, features are Principal Component Analysis (PCA) and Zero 14, 15]. The models use a wide range of structures such as Component Analysis (ZCA) [8, 9]. In practice, greater the Convolutional Neural Network (CNN), Auto-encoders (AEs), dimension of the feature space, greater the memory required Recurrent Neural Networks (RNNs) and Deep Belief Networks to compute the covariance matrix needed in either the PCA (DBN) [13, 14, 15, 16, 17, 18, 19]. or the ZCA. In addition to a more discriminative feature The paper [15] showed that a stacked de-noising AE can space, the AEs can reduce the dimensionality of the features, be a good model to distinguish malicious from non-malicious thereby helping to reduce the memory needed to compute the software (i.e. detect malware). The model is designed to covariance matrix. handle malicious scripts (e.g. JavaScript code). The stacked More specifically, we have used the AEs to map the original de-noising AEs have also been used in portable executable feature space to a latent representation, with two unsupervised files classification [16]. This model uses Application Program training stages. The motivation is that AE as a generative Interface (API) calls as the dynamic feature set and provides model is capable of learning a reasonable notion of semantic a signature for each malware, consisting of 30 codes. The similarity and the relation among input features [10, 11] . To paper [14] developed a AE-based model that uses RBM. This evaluate the proposed scheme, we have carried out security model could successfully detect malware using a wide range analysis of the proposed scheme using two publicly available of dynamic and static feature sets and convert them to 16 datasets. feature vector sets and 4 graph feature sets. The paper [17] In summary, the major contributions of this paper are the proposed RNN-based AEs that can automatically learn the following: representation of a malware from its raw API calls. They • We introduce an unsupervised feature learning approach manage to handle the difficulties of training a recurrent neural for the two different cyber security problems using AEs. network. Although AEs have been previously applied to cyber In the literature, another area of interest has been outlier security, the proposed model has unique training phase detection where recently deep learning models have shown and topology compared to the previous works. some promising results. The proposed models vary on the • We show how a single model with the same training structure of the model used, the application purpose and model and topology can be quite effective for both mal- motivation behind the chosen strategy [20, 21, 22, 23]. The ware classification and network-based anomaly intrusion paper [24] introduced the application of a Variational AE for detection. This is helpful when it comes to designing only intrusion detection and showed that the Variational AE can one embedded security analysis tool for different systems. perform well for both network-based intrusion detection as • Our scheme uses almost the minimum number of features well as in detecting outliers. The paper [25] proposed a hybrid compared to other state-of-the-art algorithms. This makes AE and Density Estimation Model for anomaly detection. The the model to be more effective for real time protection. model is based on estimating the density in the compressed • In addition to the limited number of original features, the hidden-layer representation of the applied AE. Another paper proposed scheme generates a small

Load more