Can Distribute This PDF File for Research and Educational (Nonprofit) Purposes Only
Total Page:16
File Type:pdf, Size:1020Kb
VOL. E103-B NO. 1 JANUARY 2020 The usage of this PDF file must comply with the IEICE Provisions on Copyright. The author(s) can distribute this PDF file for research and educational (nonprofit) purposes only. Distribution by anyone other than the author(s) is prohibited. IEICE TRANS. COMMUN., VOL.E103–B, NO.1 JANUARY 2020 32 PAPER Special Section on Internet Architecture, Applications and Operation Technologies for a Cyber-Physical System IoT Malware Analysis and New Pattern Discovery Through Sequence Analysis Using Meta-Feature Information Chun-Jung WUya), Shin-Ying HUANGyy, Nonmembers, Katsunari YOSHIOKAy;yyy, and Tsutomu MATSUMOTOy;yyy, Members SUMMARY A drastic increase in cyberattacks targeting Internet of small change in a malware program could alter its signature Things (IoT) devices using telnet protocols has been observed. IoT mal- sufficiently to thwart antivirus detection. Therefore, an ur- ware continues to evolve, and the diversity of OS and environments in- gent necessity is to analyze IoT malware and related logs to creases the difficulty of executing malware samples in an observation set- ting. To address this problem, we sought to develop an alternative means recognize the behavior of unfamiliar threats and thus assist of investigation by using the telnet logs of IoT honeypots and analyzing organizations in mounting a timely and appropriate defense. malware without executing it. In this paper, we present a malware clas- Gartner estimated that 6.4 billion IoT devices were in sification method based on malware binaries, command sequences, and use in 2016, and this number is projected to grow to 20.8 meta-features. We employ both unsupervised or supervised learning algo- rithms and text-mining algorithms for handling unstructured data. Cluster- billion by 2020 [3]. Embedding information and commu- ing analysis is applied for finding malware family members and revealing nication technology into devices thus represents an ongoing their inherent features for better explanation. First, the malware binaries are trend. The nature of IoT presents challenges in establish- grouped using similarity analysis. Then, we extract key patterns of interac- ing a comprehensive security mechanism. These challenges tion behavior using an N-gram model. We also train a multiclass classifier arising from IoT devices are as follows: to identify IoT malware categories based on common infection behavior. For misclassified subclasses, second-stage sub-training is performed using 1. Most IoT devices are always online. a file meta-feature. Our results demonstrate 96.70% accuracy, with high 2. Most utilize simple, low-level hardware. precision and recall. The clustering results reveal variant attack vectors and one denial of service (DoS) attack that used pure Linux commands. 3. IoT devices have a variety of CPU architectures and key words: IoT malware, botnet, denial of service, text mining, N-gram, OS. classification, clustering 4. Antivirus and monitoring services are lacking. 5. Diverse developers result in a lack of unified standards. 1. Introduction 6. IoT malware attack patterns are continually evolving, with an extremely large damage scope. Internet of Things (IoT) is a network of physical devices, Few of them are protected by antivirus software. To such as vehicles, furniture, and buildings, that are embedded analyze the threat of IoT malware, Pa et al. [4] proposed with electronics, sensors, and networking abilities. Connec- IoTPOT, a honeypot that observes cyberattacks against IoT tivity enables these objects to collect and exchange data for devices, focusing on telnet-based attacks; it emulated IoT further application. However, in October 2016, the IoT mal- devices that accepting telnet protocol connections. When ware called Mirai executed the largest-ever distributed de- attackers access IoTPOT, it records the entire netflow and nial of service (DDoS) attack against Dyn DNS in 2016, en- maintains logs for further analysis, such as downloading listing approximately 100,000 Mirai IoT botnet nodes for samples. Since 2015, 6,016,030 download attempts from a reported attack rate of up to 1.2 Tbps [1]. According 1,085,664 different hosts have been successfully observed to the report from Kaspersky in Sept. 2018, Mirai is still and over 40,000 malware samples downloaded. Moreover, the most popular IoT malware family for cybercriminals 124,517,838 telnet session logs have been collected, record- (20.9%). Moreover, the most popular attack and infection ing all the shell command input sent by the attackers. IoT- vector against devices remains the telnet service (75.4%) POT thus represents a useful method of collecting samples, [2]. Although signature-based detection methods are sen- analyzing threat behavior, and understanding IoT cyberat- sitive to the structures of existing malware samples, even a tacks. However, the enormous data size resulted in huge Manuscript received February 15, 2019. time and resource cost when analyzing their patterns and re- Manuscript revised June 3, 2019. lationship. It is an urgent necessity to create an appropriate Manuscript publicized August 5, 2019. view which analyzes the incoming data in depth and utilize yThe authors are with the Graduate School of Environment and Information Sciences, Yokohama National University, Yokohama- our resource efficiently. shi, 240-8501 Japan. The purpose of this study is to apply machine learning yyThe author is with Cybersecurity Technology Institute, Insti- techniques to create a simplified and accurate view of IoT tute for Information Industry, 105 Taipei, Taiwan. cyberattacks. The method determines categories of malware yyyThe authors are with Institute of Advanced Sciences, Yoko- by analyzing its meta-features and command sequences. Its hama National University, Yokohama-shi, 240-8501 Japan. contributions may be summarized as follows: a) E-mail: [email protected] DOI: 10.1587/transcom.2019CPP0009 1. We proved that similar IoT malware binaries conduct Copyright c 2020 The Institute of Electronics, Information and Communication Engineers WU et al.: IOT MALWARE ANALYSIS AND NEW PATTERN DISCOVERY THROUGH SEQUENCE ANALYSIS USING META-FEATURE INFORMATION 33 similar infection commands. Moreover, through simi- based on mining dynamic program executions. They ana- larity analysis of command sequences, we can identify lyzed system calls from MS Windows and Linux and traced the malware category of unknown threats. system call sequences using a natural language processing 2. By clustering telnet logs, we discovered a new DoS cy- algorithm. They also applied the machine-learning algo- berattack executed using pure Linux commands, with- rithms Naive Bayes and Support Vector Machines (SVM), out IoT malware binaries. with SVM achieving a 99.5% worm detection rate and a 3. Using malware samples from the IoT honeypot, our 2.22% false positive rate [11]. proposed method could identify malware categories For Android malware, Ham et al. [12] extracted fea- with 96.70% accuracy. tures about the network, phone, message, CPU, battery, and memory for each process in Android devices. They apply a linear SVM to detect Android malware and compare the 2. Related Works malware detection performance of SVM with that of other machine learning classifiers. They show that the SVM out- In this section, we review the literature on the malware clas- performs other machine learning classifiers with 0.995 Ac- sification problem. Yen et al. (2013) conducted an epidemi- curacy and 0.957 Precision. ological study of malware encountered in a large, multina- Azmoodeh, Dehghantanha, and Choo [13] presented a tional enterprise. They collected security and network in- deep learning based method to detect Internet of Battlefield frastructure logs to determine the key behavioral features of Things (IoBT) malware via the device’s Operational Code web-based malware. Moreover, they used a logistical re- (OpCode) sequence. They transmuted OpCodes into a vec- gression model to identify and rank the malware risk [5]. tor space and apply a deep Eigenspace learning approach Masud, Khan, and Thuraisingham presented a method of de- to classify malicious and benign application. Their method tecting malicious executables that combined three types of could achieve 99.68% accuracy and 98.37% recall. features: binary N-grams, assembly instruction sequences, Su et al. [14] proposed a novel lightweight method of and dynamic-link library function calls [6]. detecting DDoS malware in IoT environments. First, one- In 2015, Microsoft and Kaggle held the Malware Clas- channel grayscale images converted from binaries were ex- sification Challenge (BIG 2015), in which Microsoft pro- tracted, and then a lightweight convolutional neural network vided 20,000 Windows malware binary and assembler code was used to classify IoT malware families. The experimen- files, with nine categories of malware. Contestants had to tal results indicated that this system could achieve 94.0% classify the malware categories as well as possible. The accuracy in goodware and DDoS malware classification and winning team extracted different features from the ASM file 81.8% accuracy in classification of goodware and the two opcode and gathered pixel data from malware disk images, main malware families. then applied an N-gram algorithm to predict the malware Our study examined Linux malware. Its major differ- category, thereby achieving 99.7% accuracy. Ahmadi et al. ence from other research lay in the dataset. We primarily subsequently used similar features to improve the classifica-