Effective and Efficient Malware Detection at the End Host
Total Page:16
File Type:pdf, Size:1020Kb
Effective and Efficient Malware Detection at the End Host Clemens Kolbitsch∗, Paolo Milani Comparetti∗, Christopher Kruegel‡, Engin Kirda§, Xiaoyong Zhou†, and XiaoFeng Wang† ∗ Secure Systems Lab, TU Vienna ‡ UC Santa Barbara {ck,pmilani}@seclab.tuwien.ac.at [email protected] § Institute Eurecom, Sophia Antipolis † Indiana University at Bloomington [email protected] {zhou,xw7}@indiana.edu Abstract sible for such information flows. For detection, we exe- Malware is one of the most serious security threats on cute these slices to match our models against the runtime the Internet today. In fact, most Internet problems such behavior of an unknownprogram. Our experimentsshow as spam e-mails and denial of service attacks have mal- that our approach can effectively detect running mali- ware as their underlying cause. That is, computers that cious code on an end user’s host with a small overhead. are compromised with malware are often networked to- gether to form botnets, and many attacks are launched 1 Introduction using these malicious, attacker-controlled networks. With the increasing significance of malware in Inter- Malicious code, or malware, is one of the most press- net attacks, much research has concentrated on develop- ing security problems on the Internet. Today, millions ing techniques to collect, study, and mitigate malicious of compromised web sites launch drive-by download ex- code. Without doubt, it is important to collect and study ploits against vulnerable hosts [35]. As part of the ex- malware found on the Internet. However, it is even more ploit, the victim machine is typically used to download important to develop mitigation and detection techniques and execute malware programs. These programs are of- based on the insights gained from the analysis work. ten bots that join forces and turn into a botnet. Bot- Unfortunately, current host-based detection approaches nets [14] are then used by miscreants to launch denial (i.e., anti-virus software) suffer from ineffective detec- of service attacks, send spam mails, or host scam pages. tion models. These models concentrate on the features Given the malware threat and its prevalence, it is not of a specific malware instance, and are often easily evad- surprising that a significant amount of previous research able by obfuscation or polymorphism. Also, detectors has focused on developing techniques to collect, study, that check for the presence of a sequence of system calls and mitigate malicious code. For example, there have exhibited by a malware instance are often evadable by been studies that measure the size of botnets [37], the system call reordering. In order to address the shortcom- prevalence of malicious web sites [35], and the infes- ings of ineffective models, several dynamic detection ap- tation of executables with spyware [31]. Also, a num- proaches have been proposed that aim to identify the be- ber of server-side [4, 43] and client-side honeypots [51] havior exhibited by a malware family. Although promis- were introduced that allow analysts and researchers to ing, these approaches are unfortunately too slow to be gather malware samples in the wild. In addition, there used as real-time detectors on the end host, and they of- exist tools that can execute unknown samples and mon- ten require cumbersome virtual machine technology. itor their behavior [6, 28, 54, 55]. Some tools [6, 54] In this paper, we propose a novel malware detection provide reports that summarize the activities of unknown approach that is both effective and efficient, and thus, can programs at the level of Windows API or system calls. be used to replace or complement traditional anti-virus Such reports can be evaluated to find clusters of samples software at the end host. Our approach first analyzes a that behave similarly [5, 7] or to classify the type of ob- malware program in a controlled environment to build a served, malicious activity [39]. Other tools [55] incorpo- model that characterizes its behavior. Such models de- rate data flow into the analysis, which results in a more scribe the information flows between the system calls es- comprehensive view of a program’s activity in the form sential to the malware’s mission, and therefore, cannot of taint graphs. be easily evaded by simple obfuscation or polymorphic While it is important to collect and study malware, techniques. Then, we extract the program slices respon- this is only a means to an end. In fact, it is crucial that the insight obtained through malware analysis is trans- sample to use obfuscation and polymorphictechniques to lated into detection and mitigation capabilities that al- alter its appearance. The problem with static techniques low one to eliminate malicious code running on infected is that static binary analysis is difficult [30]. This diffi- machines. Considerable research effort was dedicated to culty is further exacerbated by runtime packing and self- the extraction of network-based detection models. Such modifying code. Moreover, the analysis is costly, and models are often manually-crafted signatures loaded into thus, not suitable for replacing AV scanners that need to intrusion detection systems [33] or bot detectors [20]. quickly scan large numbers of files. Dynamic analysis Other models are generated automatically by finding is an alternative approach to model malware behavior. In common tokens in network streams produced by mal- particular, several systems [22, 55] rely on the tracking of ware programs (typically, worms) [32, 41]. Finally, mal- dynamic data flows (tainting) to characterize malicious ware activity can be detected by spotting anomalous traf- activity in a generic fashion. While detection results are fic. For example, several systems try to identify bots by promising, these systems incur a significant performance looking for similar connection patterns [19, 38]. While overhead. Also, a special infrastructure (virtual machine network-based detectors are useful in practice, they suf- with shadow memory) is required to keep track of the fer from a number of limitations. First, a malware pro- taint information. As a result, static and dynamic anal- gram has many options to render network-based detec- ysis approaches are often employed in automated mal- tion very difficult. The reason is that such detectors can- ware analysis environments (for example, at anti-virus not observe the activity of a malicious program directly companies or by security researchers), but they are too but have to rely on artifacts (the traffic) that this program inefficient to be deployed as detectors on end hosts. produces. For example, encryption can be used to thwart In this paper, we propose a malware detection ap- content-based techniques, and blending attacks [17] can proach that is both effective and efficient, and thus, can change the properties of network traffic to make it ap- be used to replace or complement traditional AV soft- pear legitimate. Second, network-based detectors cannot ware at the end host. For this, we first generate effective identify malicious code that does not send or receive any models that cannot be easily evaded by simple obfusca- traffic. tion or polymorphic techniques. More precisely, we exe- Host-based malware detectors have the advantage that cute a malware program in a controlled environment and they can observe the complete set of actions that a mal- observe its interactions with the operating system. Based ware program performs. It is even possible to identify on these observations, we generate fine-grained models malicious code before it is executed at all. Unfortunately, that capture the characteristic, malicious behavior of this current host-based detection approaches have significant program. This analysiscan be expensive,as it needsto be shortcomings. An important problem is that many tech- run only once for a group of similar (or related) malware niques rely on ineffective models. Ineffective models are executables. The key of the proposed approach is that models that do not capture intrinsic properties of a mali- our models can be efficiently matched against the run- cious program and its actions but merely pick up artifacts time behavior of an unknown program. This allows us of a specific malware instance. As a result, they can be to detect malicious code that exhibits behavior that has easily evaded. For example, traditional anti-virus (AV) been previously associated with the activity of a certain programs mostly rely on file hashes and byte (or instruc- malware strain. tion) signatures [46]. Unfortunately, obfuscation tech- The main contributions of this paper are as follows: niques and code polymorphism make it straightforward to modify these features without changing the actual se- • We automatically generate fine-grained (effective) mantics (the behavior) of the program [10]. Another ex- models that capture detailed information about the ample are models that capture the sequence of system behavior exhibited by instances of a malware fam- calls that a specific malware program executes. When ily. These models are built by observing a malware these system calls are independent, it is easy to change sample in a controlled environment. their order or add irrelevant calls, thus invalidating the • We have developed a scanner that can efficiently captured sequence. match the activity of an unknown program against In an effort to overcome the limitations of ineffective our behavior models. To achieve this, we track de- models, researchers have sought ways to capture the ma- pendencies between system calls without requiring licious activity that is characteristic of a malware pro- expensive taint analysis or special support at the end gram (or a family). On one hand, this has led to de- host. tectors [9, 12, 25] that use sophisticated static analysis to identify code that is semantically equivalent to a mal- • We present experimental evidence that demon- ware template. Since these techniquesfocus on the actual strates that our approach is feasible and usable in semantics of a program, it is not enough for a malware practice.