Dllminer: Structural Mining for Malware Detection Masoud Narouei1, Mansour Ahmadi2 *, Giorgio Giacinto2, Hassan Takabi1 Andashkansami3
Total Page:16
File Type:pdf, Size:1020Kb
SECURITY AND COMMUNICATION NETWORKS Security Comm. Networks 2015; 8:3311–3322 Published online 22 April 2015 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/sec.1255 RESEARCH ARTICLE DLLMiner: structural mining for malware detection Masoud Narouei1, Mansour Ahmadi2 *, Giorgio Giacinto2, Hassan Takabi1 andAshkanSami3 1 Department of Computer Science and Engineering, University of North Texas, Denton, TX, U.S.A. 2 Department of Electrical and Electronic Engineering, University of Cagliari, Italy 3 CSE and IT Department, School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran ABSTRACT Existing anti-malware products usually use signature-based techniques as their main detection engine. Although these methods are very fast, they are unable to provide effective protection against newly discovered malware or mutated variant of old malware. Heuristic approaches are the next generation of detection techniques to mitigate the problem. These approaches aim to improve the detection rate by extracting more behavioral characteristics of malware. Although these approaches cover the disadvantages of signature-based techniques, they usually have a high false positive, and evasion is still possible from these approaches. In this paper, we propose an effective and efficient heuristic technique based on static analysis that not only detect malware with a very high accuracy, but also is robust against common evasion techniques such as junk injection and packing. Our proposed system is able to extract behavioral features from a unique structure in portable executable, which is called dynamic-link library dependency tree, without actually executing the application. Copyright © 2015 John Wiley & Sons, Ltd. KEYWORDS malware analysis; dependency tree; closed frequent tree; evasion *Correspondence Mansour Ahmadi, University of Cagliari, Italy. E-mail: [email protected] 1. INTRODUCTION malware, such as byte sequences or strings, for detec- tion. The key advantage of these approaches is their high Malware is the short form for malicious software that con- detection rate and low false alarms. Another benefit of sists of different hostile code combinations. Malware is these approaches is their speed in extracting features mainly designed to gain unauthorized access, steal user’s because they do not need to run each executable file, which critical information, and sometimes cause damage to com- is a time consuming task. New “anti-anti-malware” tech- puters. Malware is categorized into viruses, worms, Trojan niques such as code packing, control-flow and entry point horses, spywares, rootkits, scareware, adware, riskware, obfuscation change the specific signature of discovered and so on. According to [1], “As much malware was pro- malware, making static analyzers toothless in detecting duced in 2007 as in the previous 20 years altogether.” In previously detected malware [6]. Although there are lots 2011, Kaspersky [2] stated that about 413 billion infections of learning-based detection techniques such as [7,8] or, have been detected by host-based anti-viruses. Symantec distance-based signature matching [9], evasion from them [3] encountered more than 286 million unique variants of is still possible for new malware variants. Therefore, a malware and also Web-attack toolkits, and a phenomenon malware detection system is actually a trade-off between called Polymorphism drive the number of malware variants detection rate and robustness that should be considered in common circulation. during development. Improving the intelligence of anti-malware software is Because of the weaknesses of static-based detection critical. A lot of effort has been put in the area of malware techniques, dynamic analysis techniques have made lots detection, which can be broadly categorized into static [4] of progress in recent years. In comparison to static-based, and behavioral-based [5] malware detection techniques. dynamic analysis techniques consider the behavior of mal- Static detection techniques such as anomaly, heuris- ware during run time. They mostly monitor the behavior tic or signature-based, mostly operate on disassembled of malware using API calls [10,11]. Dynamic-based tech- instructions or Application Programming Interface (API) niques act almost robustly against polymorphism malware calls. These techniques use a specific signature of each because obfuscation techniques only change the static Copyright © 2015 John Wiley & Sons, Ltd. 3311 Structural mining for malware detection M. Narouei et al. signature of malware and do not affect the behavior of the different structures like opcode [18] from the output, or malware. Despite of its advantages, dynamic-based tech- directly focus on the byte code [19,20]. Sung et al. [21] niques have a large preprocessing overhead during run time proposed SAVE that considers a sequence of API calls as and monitoring, which lowers system performance. In fact, signature of malware. Kumar and Spafford [22] proposed dynamic analysis is usually performed when anti-malware a method that detects malware based on regular expres- companies want to look for a new sample in the wild sion matching. Christodorescu and Jha [9] proposed SAFE and then they use its static features in the anti-malware that takes patterns of malicious behavior and then creates products. Therefore, implementation of these techniques an automata from it. Some other approaches [7,8] consid- in the commercial host-based anti-malware products is not ered IAT of PE and its content. Another work [23] also applicable. Another drawback of such approaches is that considered some metadata, such as number of bitmaps, monitoring behavior of dynamic libraries, such as OCX size of import, and export address table, besides IAT con- and dynamic-link library (DLL), is difficult to perform. tent. There is only one research [24] that considered DLL According to [12], about 60% of malwares collected at dependency of PE but they did not extract semantic behav- KingSoft anti-malware lab are DLL files, which cannot be iors from all DLLs. They considered single APIs and DLL run and analyzed dynamically. dependency paths of the tree as features that can be cir- A portable executable (PE) depends on some DLLs for cumvented by junk injection techniques. In contrast, our execution, and each DLL has a relationship with other method is more resilient against evasion techniques and DLLs for completing the task. This hierarchical dependen- also the extracted behaviors by our method is more seman- cies between PE and DLLs is known as DLL dependency tic because we consider the relationships between all DLLs tree. In this paper, we will propose a hybrid system that not only the paths of the DLL dependency tree. The last by statically extracting the DLL dependency tree from the four approaches are the most similar methods to ours PE, has the advantages of both static and dynamic tech- but opposite to our approach, they did not consider any niques. Although the PE is not executed, the extracted tree semantic relation between all DLLs. can reveal coarse-grain behaviors of a PE because it is Besides static analysis, researchers have put a lot created based on the interactions of the PE with the operat- of effort in proposing behavior-based malware detection ing system, which leads to relationships among DLLs. The methods that monitor programs during the run-time. Dif- novelty of our method is the extraction of the coarse-grain ferent approaches and techniques can be applied to perform behaviors from PE’s import address table (IAT), which such dynamic analysis. In this technique, analyzers usually was not considered in previous methods. It means that the consider system calls or instructions sequences as discrim- analysis system is also very fast because it only consid- inating features between malware and benign programs. ers the header of PE, and also, the extracted behaviors Some approaches [11,25,26], monitored the program’s are semantic and influential to discriminate malicious and behavior by analyzing API calls. In another work [27], the benign programs. In addition of the efficiency of static authors extracted sequences of instructions from both mali- methods and the effectiveness of dynamic approaches, our cious and benign programs and use some combination of method cannot simply be circumvented. The evaluations frequent assembly instructions to build the dataset. In some show the effectiveness against several kinds of packed techniques [28,29], a behavioral graph is extracted from malware as well as robustness against junk injection tech- each malware based on API call parameters and under- niques. The rest of the paper is organized as follows: stood behavior of malware. Instead of program-centric related work are renewed in section 2; definitions are detection approaches, some approaches try to generalize presented in section 3, and section 4 presents the pro- the detection method to the whole system. Lanzi et al. posed method. Results of the experiments are discussed in [30] proposed an access activity model that captures the section 5, and conclusions and future work will wrap up generalized interactions of benign applications with oper- the paper. ating system resources such as files and registry, and the model can detect the malware well with very low false pos- itive. Nappa et al. [31] also proposed a technique to detect 2. RELATED WORK exploited servers, managed by the same organization. They collected information about how exploited servers Analysis techniques