Extraction of Statistically Significant Malware Behaviors
Total Page:16
File Type:pdf, Size:1020Kb
Extraction of Statistically Significant Malware Behaviors ∗ Sirinda Palahan Domagoj Babic´ Penn State University Google, Inc. [email protected] [email protected] Swarat Chaudhuri Daniel Kifer Rice University Penn State University [email protected] [email protected] ABSTRACT data and they do not provide scores that indicate the statis- Traditionally, analysis of malicious software is only a semi- tical confidence that a (possibly new) behavior is malicious. automated process, often requiring a skilled human analyst. In this paper, we propose a new method for identifying As new malware appears at an increasingly alarming rate | malicious behavior and assigning it a p-value (a measure of now over 100 thousand new variants each day | there is a statistical confidence that the behavior is indeed malicious). need for automated techniques for identifying suspicious be- It requires a training set consisting of malware and good- havior in programs. In this paper, we propose a method for ware executables. Using dynamic program analysis tools, extracting statistically significant malicious behaviors from we represent each executable as a graph. We then train a a system call dependency graph (obtained by running a bi- linear classifier to discriminate between malware and good- nary executable in a sandbox). Our approach is based on a ware; the parameters of this classifier are crucial for our new method for measuring the statistical significance of sub- statistical test. To evaluate a new executable, we can use graphs. Given a training set of graphs from two classes (e.g., the linear classifier to categorize it as goodware or malware. goodware and malware system call dependency graphs), our To then identify its suspicious behaviors, we again represent method can assign p-values to subgraphs of new graph in- it as a graph and use a subgraph mining algorithm to ex- stances even if those subgraphs have not appeared before in tract a candidate subgraph. We assign confidence scores to the training data (thus possibly capturing new behaviors or this subgraph with statistical procedures that use the classi- disguised versions of existing behaviors). fier weights obtained from the training phase; the statistical procedures work for any subgraph, even if it did not appear in the training data. The framework is simple and modu- 1. INTRODUCTION lar { one can plug in different program analysis tools, linear Signature-based detection has been a major technique in classifiers, and subgraph mining algorithms to take advan- commercial anti-virus software. However, that approach is tage of progress in those areas. Our statistical tests work ineffective against code obfuscation techniques. To address with all of these options and are easy to implement. this problem, most of the current work, e.g., [1,7, 12, 10], It is important to note that the evaluation of malware has focused on behavior-based detection techniques because specifications is a challenging task. Manual construction the semantics of malware are unlikely to change even after of malware specifications is labor-intensive and error-prone; a series of syntactic code transformations. the resulting relatively small quantity of specifications will To develop effective behavior-based detection techniques, often have high false positive rates [14]. Other work [7, it is important to understand how malware behaves. Pre- 12] compared extracted malware behavior in the form of vious studies (e.g., [14, 20]) typically used experts to con- graphs to textual descriptions provided by anti-virus com- struct malware specifications that describe malicious behav- panies (apparently also manually). In this paper we take iors. This requires deep expertise and is costly in terms of a more automated approach designed to reduce the risk of time and effort. To address the problem, Christodorescu, et experimenter bias. We use carefully designed experiments al. [7] and Fredrikson et al. [12] proposed methods to auto- to both validate the quality of the p-values and the quality matically generate specifications of malicious activity from of the suspicious executable behaviors that were identified. samples of malicious and benign executables. These meth- In summary, the contributions of this paper are: ods only recognize behaviors that appeared in the training 1. A framework for identifying suspicious behavior in pro- ∗This work was done while Domagoj was with UC Berkeley. grams and assigning them statistical significance scores. The framework is modular, easy to implement, and does not require the malware test set to exhibit the same behaviors as the malware training set. Permission to make digital or hard copies of all or part of this work for 2. A careful empirical evaluation of the extracted behav- personal or classroom use is granted without fee provided that copies are iors. This includes an evaluation of p-values and some not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to initial experimental evidence for the identification of republish, to post on servers or to redistribute to lists, requires prior specific malicious behaviors not seen in the training data. permission and/or a fee. ACSAC’13 Dec. 9-13, 2013, New Orleans, Louisiana USA We discuss related work in Section2. We present our Copyright 2013 ACM 978-1-4503-2015-3/13/12 ...$15.00. framework in Section3. The framework includes a training phase (Section 3.1) and a deployment phase (Section 3.2). paths between two given nodes with the minimum sum of We then empirically evaluate our methods in Section4. edge weights. They adopt a randomization approach for sta- tistical significance testing { they compare scores of paths 2. RELATED WORK they find to scores of paths found in random networks whose edges have been shuffled. Our approach uses randomization 2.1 Malware Specification for statistical testing but we carefully avoid comparisons to random graphs (since they may not be plausible representa- Christodorescu, et al. [7] use contrast subgraph mining tions of system call dependency graphs). to construct specifications by comparing syscall dependency graphs of malware and goodware samples to obtain activ- ities appearing only in malware. They assess specification 3. THE STATISTICAL FRAMEWORK quality by manual comparison to specifications produced by We next describe the major components of our framework an expert. This technique does not produce statistical sig- for identifying statistically significant malicious behaviors. nificance scores for the specifications. Comparetti, et al. [8] An overview of the framework is shown in Figure1. There propose a novel method to find dormant behaviors statically are two phases: the training phase (where statistical infor- in binaries, based on manually-provided behavior specifica- mation about malware and goodware is collected) and the tions. Our work complements theirs, as our approach can deployment phase (for analyzing a new executable). be used to identify statistically significant behavior specifica- The training phase (Section 3.1) requires samples of good- tions. Fredrikson, et al. [12] use LEAP [25] to extract behav- ware and malware executables. These executables are con- iors that are synthesized by concept analysis and simulated verted into system call dependency graphs (SDG) using dy- annealing to generate specifications. The authors evaluate namic analysis (Section 3.1.1). We build a linear classifier to their specifications by manually comparing with behavior distinguish malware from goodware and we use its parame- reports from malware experts. Even though LEAP, a signif- ters to obtain a function that assigns weights to edges (Sec- icant subgraph extraction algorithm, is used, no statistical tion 3.1.2); these weights are used by our statistical tests. significance value can be calculated for new executables. The deployment phase (Section 3.2) is used to analyze a new executable for suspicious behavior. The linear classifier 2.2 Significant Subgraph Extraction from the training phase can be used to classify it as mal- There are relatively few algorithms that extract signifi- ware or goodware. To extract suspicious behavior, we first cant subgraphs. LEAP [25] finds significant subgraphs that build the SDG and assign edge weights based on the param- are frequent in the positive dataset and rare in the negative eters of the linear classifier. We then use a subgraph mining dataset and maximizes a user-defined significance function. algorithm to identify candidate subgraphs (Section 3.2.1); For malware detection, it is used to mine the system call any subgraph mining algorithm for weighted or unweighted dependency graphs [12]. However, when such a system is graphs can be used here as a black box. Then our statis- deployed for analyzing new executables, searching for exact tical tests (Section 3.2.2) assign significance scores to the subgraphs could be a brittle approach { LEAP will not iden- behaviors associated with those subgraphs. These tests use tify malicious behavior if it does not correspond to a graph it the edge weights to assign p-values and automatically cor- has seen before (a slight change in the graph may be enough rect for the multiple testing problem (explained in Section to avoid detection). Our proposed method also mines sys- 3.2.2), which is an important concern in subgraph mining. tem call dependency graphs but is more flexible and so can identify significant behaviors whose corresponding sub- 3.1 The Training Phase graphs have not appeared in the training data. Ranu and We now describe the two components of the training phase: Singh [18] propose GraphSig, a frequent subgraph mining building system call dependency graphs and building a linear algorithm, to find significant subgraphs in large databases. malware classifier whose parameters will be used to assign GraphSig prunes out the search space by testing the signif- weights to edges in those graph. icance of a subgraph; the definition of significance is based on a subgraph's frequency.