Malware Analysis Through High-Level Behavior
Total Page:16
File Type:pdf, Size:1020Kb
Malware Analysis Through High-level Behavior Xiyue Deng and Jelena Mirkovic {xiyueden,mirkovic}@isi.edu Information Sciences Institute Abstract variants [17]. Such obfuscation techniques have greatly undermined traditional signature-based malware detec- Malware is becoming more and more stealthy to evade tion methods. We simply cannot keep pace with prolif- detection and analysis. Stealth techniques often involve erating malware variants. Notable attempts have been code transformation, ranging from equivalent code sub- made by researchers to improve signature analysis by stitution and junk code injection, to continuously trans- running AV suites on cloud [22], detecting encryption forming code using a polymorphic or a metamorphic decoders [15], detecting runtime signatures for polymor- engine. Evasion techniques have a great impact on phic malwares [24], etc. However, as new obfuscation signature-based malware detection, making it very costly techniques emerge, signature analysis alone cannot save and often unsuccessful. us. We propose to study a malware’s network behavior Besides static analyses, researchers have used dy- during its execution. While malware may transform its namic analysis to overcome code obfuscation. These code to evade analysis, we contend that its behavior include tracking disk changes, analyzing dynamic call must mostly remain the same to achieve the malware’s graphs, as well as monitoring malware execution using ultimate purpose, such as sending spam, scanning for debuggers and virtual machines. However, modern mal- vulnerable hosts, etc. While live malware analysis is ware is often equipped with anti-debugging and anti- hard, we leverage our Fantasm platform on the Deter- virtualization capabilities [18,19], making its analysis in Lab testbed to perform it safely and effectively. Based a controlled environment difficult. Yet, a controlled en- on observed network traffic we propose a behavior clas- vironment is needed to observe a malware’s actions on sification approach, which can help us interpret the mal- the compromised host, and record sufficient details for ware’s actions and its ultimate purpose at a high level. analysis. We then apply our approach to 999 diverse samples from the Georgia Tech Apiary project to understand current To complement contemporary static and dynamic code trends in malware behaviors. analysis, we propose to study malware behavior by ob- serving and interpreting its network activity. Based on our prior research [6], we believe that network access is 1 Introduction essential for malware to achieve its purpose. We also believe that it would be difficult for malware to signifi- Contemporary state of Internet security is grave, thanks cantly alter its network behavior and still achieve its pur- to proliferation of malware. Malware is an increasing pose. Studying network behavior thus may offer an op- threat, not only because of its increased capability and portunity to both understand what malware is trying to sophistication, but also because of its stealth, aimed at do in an analysis environment, and to detect malware on avoiding detection and analysis. compromised hosts. In this paper we pursue this first di- Because anti-virus suites usually apply signature- rection, by investigating how we can infer the malware’s based malware detection, malware designers have in- ultimate purpose by analyzing its network activity. vested enormous effort to change their binaries to avoid Live malware analysis, with unrestricted network ac- detection. From simple instruction transformation, such cess, is risky, because malware may inflict damage to techniques have evolved through junk code injection, other Internet hosts during analysis, and we would be- code obfuscation, to polymorphic and metamorphic en- come unwitting accomplices. We leverage our Fantasm gines that can transform malware code into millions of system for safe and productive live malware analysis [6] to minimize this risk. icantly against signature-based detection techniques. On Just having network traffic records of a malware run is the other hand, researchers have also been improving sig- not enough to understand a malware’s purpose, because nature detection methods, such as detecting decryption such data is very rich and very unstructured. To over- routines, runtime signature detection, etc. Overall, this come this challenge, we first extract a set of features from is a losing race for the researchers, as it will always be network traffic records, and then devise patterns that help cheaper to generate new malware variants, than to ana- us group these features into higher-level behaviors, such lyze them. We can become faster, but we can never win. as file download, e-mail sending, scanning, etc. Next Our proposed direction shows promise because malware we use a set of these high-level behaviors to label mal- cannot change its network behavior with as much free- ware by its purpose. We note that malware could have dom as changing its code. We thus believe that we will more than one purpose. For example, it could both scan be able to analyze and detect those malware samples that for vulnerable hosts and send unsolicited emails. Thus a static analysis misses. malware sample could have more than one label. We perform our analysis on 999 malware sam- 2.2 Dynamic Analysis on Host ples, randomly selected from the Georgia Tech Apiary project [1]. We observe that the most common activities Instead of looking for a binary signature to detect mal- are scanning, propagating, and downloading. Interest- ware presence, dynamic analysis builds a signature of a ingly, of all samples that show more than one behavior, malware’s interaction with its host. This signature may scanning and propagating show up most frequently to- include memory access and file access patterns, as well gether, which suggests these activities usually work in as system call patterns. Those patterns that are likely to combination and complement each other. be present in malware but not in benign binaries can be Our paper is structured as follows. We discuss back- used to develop a behavioral signature for malware de- ground of this work and contemporary research in Sec- tection [7]. tion 2. In Section 3 we explain our methodology. We de- Dynamic analysis complements static analysis, and is fine our classification criteria in Section 4; We describe able to overcome malware code obfuscation. Willems experiment set up and evaluation in Section 5. We dis- et al. [23] proposed CWSandbox that combines static cuss the use of the results and possible future work in and dynamic techniques for analyzing malware on a con- Section 6 and conclude in Section 7. tained host. Guarnieri et al. [8]’s Cuckoo sandbox fol- lows similar concepts as CWSandbox and additionally 2 Background and Related Work provides a network packet sink. Both works utilize vir- tual machines to isolate study environment and the host. In this section we discuss contemporary malware analy- However, stealthy malware has another set of techniques sis methods and the rationale behind our research meth- to evade dynamic analysis – it detects debuggers and vir- ods, and also survey related work. tual machines, which are often used to speed up and fa- cilitate dynamic analysis, and modifies its behavior. This leads to incorrect or unusable signatures of stealthy mal- 2.1 Static Analysis ware. Chen et al. [4] found that 39.9% and 2.7% of Various static analysis methods like CTPL [11], Generic 6,222 malware samples exhibit anti-debugging and anti- Virus Scanner [5], etc. have been used to analyze binary virtualization behaviors respectively. In 2012, Branco code of malware without running it, and identify portions et al. [3] analyzed 4 million samples and observed that that could be used for signature generation. These por- 81.4% of them exhibited anti-virtualization behavior and tions of code exhibit malicious activity, and are unlikely 43.21% exhibited anti-debugging behavior. to present in benign binaries. The identified portion is Our work helps analyze malware that would other- then synthesized to become the signature of this kind of wise be missed by static and dynamic analysis, because malware. Once the signature is obtained, researchers can it relies on observations of malware’s network behavior. build anti-virus tools to find the same kind of malware in Malware can be run on bare metal machines thus evading future by scanning all suspicious binaries. detection of the analysis environment. Signature-based malware detection has been the most widely used method and has been quite successful. How- 2.3 Dynamic Analysis of Network Behav- ever, malware designers have also been working on ior countermeasures over the years to undermine such tech- niques. From junk code generation, malware encryp- There are a few efforts on analyzing the semantic of tion and oligomorphism, to polymorphic and metamor- malware network behavior. Sandnet [16] provided a phic malware [17], such techniques have evolved signif- detailed, statistical analysis of malware network traffic. The authors surveyed protocol popularity employed by No Flow Essential? Drop malware. However, they did not go further to understand the semantic of the network communication, while we Yes Yes do. No Morales et al. [13] studied seven network activities, Can we fake? Risky? selected using heuristics. These activities could be used Yes No to identify malicious activities. Morales et al. also report on prevalence of those behaviors in contemporary mal- Impersonator Internet ware. Their chosen activities are not as useful to under- stand a malware sample’s purpose or type, and are very limited. We complement this work by extending the set Figure 1: Flow handling: how we decide if an outgoing of activities being observed. flow will be let out, impersonated or dropped. Nari et al. [14] proposed an automated malware clas- sification system also focusing on malware network be- to forward and which to drop. This decision is made by havior, which generates protocol flow dependency graph taking into account each outgoing flow separately, mak- based on the IP address being contacted by malware.