On Callgraphs and Generative Mechanisms
Total Page:16
File Type:pdf, Size:1020Kb
J Comput Virol DOI 10.1007/s11416-007-0057-x ORIGINAL PAPER On callgraphs and generative mechanisms Daniel Bilar Received: 15 February 2007 / Accepted: 14 June 2007 © Springer-Verlag France 2007 1 Abstract This paper examines the structural features of code mutation techniques, the technical reader is referred 29 2 callgraphs. The sample consisted of 120 malicious and 280 to [17]. The net result of these techniques is a shrinking 30 3 non-malicious executables. Pareto models were fitted to inde- usable “constant base” for strict signature-based detection 31 4 gree, outdegree and basic block count distribution, and a sta- approaches. 32 5 tistically significant difference shown for the derived power Since signature-based approaches are quite fast (but show 33 6 law exponent. A two-step optimization process involving little tolerance for metamorphic and polymorphic code) and 34 7 human designers and code compilers is proposed to account heuristics such as emulation are more resilient (but quite 35 8 for these structural features of executables. slow and may hinge on environmental triggers), a detection 36 approach that combines the best of both worlds would be 37 desirable. This is the philosophy behind a structural finger- 38 9 1 Introduction print. Structural fingerprints are statistical in nature, and as 39 such are positioned as ‘fuzzier’ metrics between static signa- 40 10 All commercial antivirus (AV) products rely on signature tures and dynamic heuristics. The structural fingerprint inves- 41 11 matching; the bulk of which constitutes strict byte sequence tigated in this paper for differentiation purposes is based on 42 12 pattern matching. For modern, evolving polymorphic and some properties of the executable’s callgraph. 43 13 metamorphic malware, this approach is unsatifactory. The rest of this paper is structured as follows. Section2 44 14 Clementi [9] recently checked fifteen state-of-the-art, describes the setup, data, procedures and results. Section 3 45 15 updated AVscanner against ten highly polymorphic malware gives a short overview of related work on graph-based 46 16 samples and found false negative rates from 0 to 90%, with classification. Section4 sketches the proposed generative 47 17 an average of 48%. This development was already predicted mechanism. 48 18 in 2001 [51]. 19 Polymorphic malware contain decryption routines which 20 decrypt encrypted constant parts of the malware body. The 2 Generating the callgraph 49 21 malware can mutate its decryptors in subsequent generations, 22 thereby complicating signature-based detection approaches. Primary tools used are described in more details in the 50 23 The decrypted body, however, remains constant. Metamor- Acknowledgements. 51 24 phic malware generally do not use encryption, but are able 25 to mutate their body in subsequent generation using vari- 2.1 Samples 52 26 ous techniques, such as junk insertion, semantic NOPs, code 27 transposition, equivalent instruction substitution and regis- For non-malicious software, henceforth called ‘goodware’, 53 28 ter reassignments [8,49]. For a recent formalization of these sampling followed a two-step process: We inventoried all PEs 54 (the primary 32-bit Windows file format) on a Microsoft XP 55 D. Bilar (B) Home SP2 laptop, extracted uniform randomly 300 samples, 56 Department of Computer Science, Wellesley College,uncorrected proof Wellesley, MA 02481, USA discarded overly large and small files, yielding 280 samples. 57 e-mail: [email protected] For malicious software (malware), seven classes of interest 58 123 Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 D. Bilar 59 were fixed: backdoor, hacking tools, DoS, trojans, exploits, A sequence of instructions that is continuous (e.g. has 105 60 virus, and worms. The worm class was further divided into no branches jumping into its middle and ends at a branch 106 61 Peer-to-Peer (P2P), Internet Relay Chat/Instant Messenger instruction) is called a basic block. We consider the graph 107 62 (IRC/IM), Email and Network worm subclasses. For an non- formed by having each basic block as a node, and each short 108 63 specialist introduction to malicious software, see [48]; for a branch an edge. The connected components in this directed 109 64 canonical reference, see [50]. graph correspond to the flowgraphs of the functions in the 110 65 Each class (subclass) contained at least 15 samples. Since source code. For each connected component in the previ- 111 66 AV vendors were hesitant for liability reasons to provide ous graph, we create a node in the callgraph. For each far 112 67 samples, we gathered them from herm1t’s collection [24] branch in the connected component, we add an edge to the 113 68 and identified compiler and (potential) packer metadata using node corresponding to the connected component this branch 114 69 PEiD. Practically all malware samples were identified as hav- is targeting. Figures 7 and 8 in the Appendix illustrate a func- 115 70 ing been compiled by MS C++ 5.0/6.0, MS Visual Basic tion’s flow- and callgraph, respectively. 116 = ( , ) 71 5.0/6.0 or LCC, and about a dozen samples were packed Formally, denote a callgraph CG as CG G V E , 117 72 with various versions of UPX (an executable compression where G(·) stands for ‘Graph’. Let V = F, where F ∈ 118 73 program). Malware was run through best-of-breed, updated normal, import, library, thunk. This just says that each 119 74 open- and closed-source AV products yielding a false neg- function in CG is either a ‘library’ function (from an external 120 75 ative rate of 32% (open-source) and 2% (closed-source), libraries statically linked in), an ‘import’ function (dynam- 121 76 respectively. Overall file sizes for both mal- and goodware ically imported from a dynamic library), a ‘thunk’ function 122 1 77 ranged from (10kb) to (1MB). A preliminary file size (mostly one-line wrapper functions used for calling conven- 123 78 distribution investigation yielded a log-normal distribution; tion or type conversion) or a ‘normal’ function (can be viewed 124 79 for a putative explanation of the underlying generative pro- as the executables own function). Following metrics were 125 80 cess, see [38] and [31]. programmatically collected from CG 126 81 All 400 samples were loaded into the de-facto industry 82 standard disassembler (IDA Pro [22]), inter- and intra- – |V | is number of nodes in CG, i.e. the function count of 127 83 procedurally parsed and augmented with symbolic meta- the callgraph 128 84 information gleaned programmatically from the binary via – For any f ∈ V ,let f = G(V f , E f ) where b ∈ V f is 129 85 FLIRT signatures (when applicable). We exported the a block of code, i.e. each node in the callgraph is itself a 130 86 identified structures exported via IDAPython into a MySQL graph, a flowgraph, and each node on the flowgraph is a 131 87 database. These structures were subsequently parsed by a dis- basic block 132 88 assembly visualization tool (BinNavi [13]) to generate and – Define IC : B → N where B is defined to be set of 133 89 investigate the callgraph. blocks of code, and IC(b) is the number of instructions in 134 b. We denote this function shorthand as |b|IC, the number 135 90 2.2 Callgraph of instructions in basic block b. 136 |·| – We extend this notation IC to elements of V be defining 137 | f | = |b| . This gives us the total number of 138 91 Following [14], we treat an executable as a graph of graphs. IC b∈V f IC 92 This follows the intuition that in any procedural language, instructions in a node of the callgraph, i.e. in a function. 139 +( ) −( ) bb( ) 93 the source code is structured into functions (which can be –LetdG f , dG f and dG f denote the indegree, out- 140 94 viewed as a flowchart, e.g. a directed graph which we call degree and basic block count of a function, respectively. 141 95 flowgraph). These functions call each other, thus creating a 96 larger graph where each node is a function and the edges 2.3 Correlations 142 97 are calls-to relations between the functions. We call this 98 larger graph the callgraph. We recover this structure by di- We calculated the correlation between in and outdegree of 143 99 assembling the executable into individual instructions. We functions. Prior analysis of static class collaboration net- 144 100 distinguish between short and far branch instructions: Short works [40,44] suggest an anti-correlation, characterizing 145 101 branches do not save a return address while far branches do. some functions as source or sinks. We found no signifi- 146 102 Intuitively, short branches are normally used to pass con- cant correlation between in and outdegree of functions in 147 103 trol around within one function of the program, while far the disassembled executables (Fig.1). Correlation intuitively 148 104 branches are used to call other functions. is unlikely to occur except in the ‘0 outdegree’ case (the 149 BinNavi toolset does not generate the flowgraph for imported 150 1 ( ) ( ( )) functions, i.e. an imported function automatically has 151 A function f n is g n if there are positiveuncorrected constants c1, c2,and proof 152 n0 such that 0 ≤ c1g(n) ≤ f (n) ≤ c2g(n), ∀n ≥ n0 .See[26]fora outdegree 0, and but will be called from many other readable discussion of asymptotic notation etymology. functions). 153 123 Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 On callgraphs and generative mechanisms B&W print Fig. 1 Correlation coefficient 0.4 0.4 rin,out Malware: p versus rin,out Goodware: p versus rin,out 0.3 0.3 0.2 0.2 0.1 0.1 0 0 −0.1 −0.1 −0.2 −0.2 −0.3 −0.3 −0.4 −0.4 −7 −6 −5 −4 −3 −2 −1 0 −7 −6 −5 −4 −3 −2 −1 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 (a) (b) Table 1 Correlation, IQR for instruction count |V |, which may serve as a rough proxy for the executable’s 179 ( , ) = Class Metric (10) (100) (1000) size.