J Comput Virol DOI 10.1007/s11416-007-0057-x

ORIGINAL PAPER

On callgraphs and generative mechanisms

Daniel Bilar

Received: 15 February 2007 / Accepted: 14 June 2007 © Springer-Verlag France 2007

1 Abstract This paper examines the structural features of code mutation techniques, the technical reader is referred 29

2 callgraphs. The sample consisted of 120 malicious and 280 to [17]. The net result of these techniques is a shrinking 30

3 non-malicious . Pareto models were fitted to inde- usable “constant base” for strict signature-based detection 31

4 gree, outdegree and basic block count distribution, and a sta- approaches. 32

5 tistically significant difference shown for the derived power Since signature-based approaches are quite fast (but show 33

6 law exponent. A two-step optimization process involving little tolerance for metamorphic and polymorphic code) and 34

7 human designers and code compilers is proposed to account heuristics such as emulation are more resilient (but quite 35

8 for these structural features of executables. slow and may hinge on environmental triggers), a detection 36

approach that combines the best of both worlds would be 37

desirable. This is the philosophy behind a structural finger- 38

9 1 Introduction print. Structural fingerprints are statistical in nature, and as 39

such are positioned as ‘fuzzier’ metrics between static signa- 40

10 All commercial antivirus (AV) products rely on signature tures and dynamic heuristics. The structural fingerprint inves- 41

11 matching; the bulk of which constitutes strict byte sequence tigated in this paper for differentiation purposes is based on 42

12 pattern matching. For modern, evolving polymorphic and some properties of the ’s callgraph. 43

13 metamorphic , this approach is unsatifactory. The rest of this paper is structured as follows. Section2 44

14 Clementi [9] recently checked fifteen state-of-the-art, describes the setup, data, procedures and results. Section 3 45

15 updated AVscanner against ten highly polymorphic malware gives a short overview of related work on graph-based 46

16 samples and found false negative rates from 0 to 90%, with classification. Section4 sketches the proposed generative 47

17 an average of 48%. This development was already predicted mechanism. 48

18 in 2001 [51].

19 Polymorphic malware contain decryption routines which

20 decrypt encrypted constant parts of the malware body. The 2 Generating the callgraph 49

21 malware can mutate its decryptors in subsequent generations,

22 thereby complicating signature-based detection approaches. Primary tools used are described in more details in the 50

23 The decrypted body, however, remains constant. Metamor- Acknowledgements. 51

24 phic malware generally do not use , but are able

25 to mutate their body in subsequent generation using vari- 2.1 Samples 52

26 ous techniques, such as junk insertion, semantic NOPs, code

27 transposition, equivalent instruction substitution and regis- For non-malicious , henceforth called ‘goodware’, 53

28 ter reassignments [8,49]. For a recent formalization of these sampling followed a two-step process: We inventoried all PEs 54

(the primary 32-bit Windows file format) on a XP 55 D. Bilar (B) Home SP2 laptop, extracted uniform randomly 300 samples, 56 Department of Computer Science, Wellesley College,uncorrected proof Wellesley, MA 02481, USA discarded overly large and small files, yielding 280 samples. 57 e-mail: [email protected] For malicious software (malware), seven classes of interest 58 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 D. Bilar

59 were fixed: backdoor, hacking tools, DoS, trojans, exploits, A sequence of instructions that is continuous (e.g. has 105

60 virus, and worms. The worm class was further divided into no branches jumping into its middle and ends at a branch 106

61 Peer-to-Peer (P2P), Relay Chat/Instant Messenger instruction) is called a basic block. We consider the graph 107

62 (IRC/IM), Email and Network worm subclasses. For an non- formed by having each basic block as a node, and each short 108

63 specialist introduction to malicious software, see [48]; for a branch an edge. The connected components in this directed 109

64 canonical reference, see [50]. graph correspond to the flowgraphs of the functions in the 110

65 Each class (subclass) contained at least 15 samples. Since . For each connected component in the previ- 111

66 AV vendors were hesitant for liability reasons to provide ous graph, we create a node in the callgraph. For each far 112

67 samples, we gathered them from herm1t’s collection [24] branch in the connected component, we add an edge to the 113

68 and identified compiler and (potential) packer metadata using node corresponding to the connected component this branch 114

69 PEiD. Practically all malware samples were identified as hav- is targeting. Figures 7 and 8 in the Appendix illustrate a func- 115

70 ing been compiled by MS C++ 5.0/6.0, MS Visual Basic tion’s flow- and callgraph, respectively. 116 = ( , ) 71 5.0/6.0 or LCC, and about a dozen samples were packed Formally, denote a callgraph CG as CG G V E , 117 72 with various versions of UPX (an where G(·) stands for ‘Graph’. Let V = F, where F ∈ 118

73 program). Malware was run through best-of-breed, updated normal, import, library, thunk. This just says that each 119

74 open- and closed-source AV products yielding a false neg- function in CG is either a ‘library’ function (from an external 120

75 ative rate of 32% (open-source) and 2% (closed-source), libraries statically linked in), an ‘import’ function (dynam- 121

76 respectively. Overall file sizes for both mal- and goodware ically imported from a dynamic library), a ‘thunk’ function 122 1 77 ranged from (10kb) to (1MB). A preliminary file size (mostly one-line wrapper functions used for calling conven- 123

78 distribution investigation yielded a log-normal distribution; tion or type conversion) or a ‘normal’ function (can be viewed 124

79 for a putative explanation of the underlying generative pro- as the executables own function). Following metrics were 125

80 cess, see [38] and [31]. programmatically collected from CG 126

81 All 400 samples were loaded into the de-facto industry

82 (IDA Pro [22]), inter- and intra- – |V | is number of nodes in CG, i.e. the function count of 127 83 procedurally parsed and augmented with symbolic meta- the callgraph 128 84 information gleaned programmatically from the binary via – For any f ∈ V ,let f = G(V f , E f ) where b ∈ V f is 129 85 FLIRT signatures (when applicable). We exported the a block of code, i.e. each node in the callgraph is itself a 130 86 identified structures exported via IDAPython into a MySQL graph, a flowgraph, and each node on the flowgraph is a 131 87 database. These structures were subsequently parsed by a dis- basic block 132 88 assembly visualization tool (BinNavi [13]) to generate and – Define IC : B → N where B is defined to be set of 133 89 investigate the callgraph. blocks of code, and IC(b) is the number of instructions in 134 b. We denote this function shorthand as |b|IC, the number 135

90 2.2 Callgraph of instructions in basic block b. 136 |·| – We extend this notation IC to elements of V be defining 137 | f | = |b| . This gives us the total number of 138 91 Following [14], we treat an executable as a graph of graphs. IC b∈V f IC 92 This follows the intuition that in any procedural language, instructions in a node of the callgraph, i.e. in a function. 139 +( ) −( ) bb( ) 93 the source code is structured into functions (which can be –LetdG f , dG f and dG f denote the indegree, out- 140 94 viewed as a flowchart, e.g. a directed graph which we call degree and basic block count of a function, respectively. 141

95 flowgraph). These functions call each other, thus creating a

96 larger graph where each node is a function and the edges 2.3 Correlations 142 97 are calls-to relations between the functions. We call this

98 larger graph the callgraph. We recover this structure by di- We calculated the correlation between in and outdegree of 143 99 assembling the executable into individual instructions. We functions. Prior analysis of static class collaboration net- 144 100 distinguish between short and far branch instructions: Short works [40,44] suggest an anti-correlation, characterizing 145 101 branches do not save a return address while far branches do. some functions as source or sinks. We found no signifi- 146 102 Intuitively, short branches are normally used to pass con- cant correlation between in and outdegree of functions in 147 103 trol around within one function of the program, while far the disassembled executables (Fig.1). Correlation intuitively 148 104 branches are used to call other functions. is unlikely to occur except in the ‘0 outdegree’ case (the 149

BinNavi toolset does not generate the flowgraph for imported 150 1 ( ) ( ( )) functions, i.e. an imported function automatically has 151 A function f n is g n if there are positiveuncorrected constants c1, c2,and proof 152 n0 such that 0 ≤ c1g(n) ≤ f (n) ≤ c2g(n), ∀n ≥ n0 .See[26]fora outdegree 0, and but will be called from many other readable discussion of asymptotic notation etymology. functions). 153 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 On callgraphs and generative mechanisms B&W print Fig. 1 Correlation coefficient 0.4 0.4 rin,out Malware: p versus rin,out Goodware: p versus rin,out 0.3 0.3

0.2 0.2

0.1 0.1

0 0

−0.1 −0.1

−0.2 −0.2

−0.3 −0.3

−0.4 −0.4 −7 −6 −5 −4 −3 −2 −1 0 −7 −6 −5 −4 −3 −2 −1 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 (a) (b)

Table 1 Correlation, IQR for instruction count |V |, which may serve as a rough proxy for the executable’s 179 ( , ) = Class Metric (10) (100) (1000) size. The dark red point at X Y (0.87, 0.007) is 180 endnote.exe, since it is the only goodware binary with 181 − − 4 Goodware r 0.05 0.017 0.0366 functions count of (10 ). 182

IQR 12 44 36 Most thunks are wrappers around imports, hence in small 183

Malware r 0.08 0.0025 0.0317 executables, a larger proportion of the functions will be 184

IQR 8 45 28 thunks. The same holds for libraries: The larger the execut- 185

able, the smaller the percentage of libraries. This is heavily 186

influenced by the choice of dynamic versus static linking. The 187

thunk/library plot, listed for completeness reasons, does not 188 154 Additionally, we size-blocked both sample groups into give much information, confirming the intuition that they are 189 155 three function count blocks, with block criteria chosen as independent of each other, mostly due to compiler behavior. 190 156 (10), (100) and (1000) function counts to investigate a

157 correlation between instruction count in functions and com-

158 plexity of the executable (with function count as a proxy).

159 Again, we found no correlation at significance level ≤0.001. 2.5 α fitting with Hill estimator 191

160 Coefficient values and the IQR for instruction counts are

161 given in Table 1. IQR, short for Inter-Quartile Range, is a Taking my cue from [43] who surveyed empirical studies of 192

162 spread measure denoting the difference between the 75th technological, social, and biological networks, we hypoth- 193 + − 163 and the 25th percentiles of the sample values. esize that the discrete distributions of d ( f ), d ( f ) and 194 bb  164 The first result corroborate previous findings; the second d ( f ) follows a truncated power law of the form Pd ( f )(m)∼ 195 α − m d( f ) 165 result at the phenomenological level agrees with the ‘refac- m e kc , where kc indicates the end of the power law 196 166 toring’ model in [40], which posits that excessively long regime. Shorthand, we call αd( f ) for the respective metrics 197 167 functions that tend to be decomposed into smaller functions. αindeg, αoutdeg and αbb. 198 168 Remarkably, the spread is quite low, on the order of a few Figure 3a, b show pars pro toto the fitting procedures for 199

169 dozen instructions. We will discuss models more in Sect.4. our 400 samples. The plot is an empirical complimentary 200

cumulative density function plot (ECCDF). A cumulative 201

distribution function (CDF) F(x) = P[X ≤ x] of a random 202

170 2.4 Function types variable X denotes the probability that the observed value 203

of X is at most x. ‘Complimentary’ simply represents the 204

171 Each point in the scatterplots in Fig.2 represents three CDF as 1 − F(x), whereas the prefix ‘empirical’ signifies 205

172 metrics for one individual executable: Function count, and that experimental samples generated this (step) function. 206

173 the proportions of normal function, static library+dynamic The x-axis show indegree, the y-axis show the ECCDF 207

174 import functions, and thunks. Proportions for an individual P[X > x] that a function in endote.exe has indegree of 208

175 executable add up to 1. The four subgraphs are parsed thusly, x.IfP[X > x] can be shown to fit a Pareto distribution, we 209  ( ) 176 using Fig. 2 as an example. The x-axis denotesuncorrected the proportion can extract the power proof law exponent for PMF Pd ( f ) m from 210 177 of ‘normal’ function, and the y-axis the proportion of “thunk” the CDF fit (see [1] and more extensively [41] for the rela- 211

178 functions in the binaries. The color of each point indicates tionship between Pareto, power laws and Zipf distributions). 212 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 D. Bilar B&W print

Fig. 2 Scatterplot of function 4 4 type proportions GW:Norm 1 10 1 10 versus Lib+ Imp, MW:Norm versus Lib+ Imp, GW:Norm 0.8 0.8 versus Thunk, MW:Norm versus 3 3 10 Thunk, GW:Thunk versus 10 Lib+ Imp, MW:Thunk versus 0.6 0.6 Lib+Imp 0.4 0.4 2 10 2 10 0.2 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(a) GW:Norm vs Lib+Imp (b) MW:Norm vs Lib+Imp

4 4 1 10 1 10

0.8 0.8

3 3 10 10 0.6 0.6

0.4 0.4 2 10 2 10 0.2 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(c) GW:Norm vs Thunk (d) MW:Norm vs Thunk

4 4 1 10 1 10

0.9 0.9

0.8 0.8

0.7 0.7 3 3 10 10 0.6 0.6

0.5 0.5

0.4 0.4 2 10 0.3 2 0.3 10 0.2 0.2

0.1 0.1

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

(e) GW:Thunk vs Lib+Imp (f) MW:Thunk vs Lib+Imp

213 The probability mass function PMF p[X = x] denotes the The fitted distribution is shown in magenta, together with the 220 214 probability that a discrete random variable X takes on value x. parameters α = 1.97 and kc = 1415.83. 221 215 Parsing Fig.3a: Blue points denotes the data points (func- Although tempting, simply ‘eyeballing’ Pareto CDFs for 222

216 tions) and two descriptive statistics (median and the maxi- the requisite linearity on a log–log scale [21] is not enough: 223

217 mum value) for the indegree distribution for endote.exe. Following [38] on philosophy and [46] on methodology, we 224 218 We see that for endnote.exe, 80% ofuncorrected functions have a calculate the Hill estimatorproofαˆ whose asymptotical normal- 225 219 indegree= 1, 2% indegree > 10 and roughly 1% indegree > 20. ity is then used to compute a 95% CI. This is shown in the 226

123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 On callgraphs and generative mechanisms B&W print

Fig. 3 GW sample: Fitting 0 0 10 10 αindeg and kc, MW sample: 3 3 Fitting αbb and kc, Pareto fitting 2 2 ECCDFs, shown with Hill −1 −1 estimator inset 10 1 10 1 0 0 0 2 4 0 2 4 10 10 10 10 10 10

−2 −2 10 10

−3 −3 10 10

−4 −4 10 10 0 1 2 3 0 1 2 3 10 10 10 10 10 10 10 10

(a) (b)

227 inset and serves as a Pareto model self-consistency check Table 2 α Distribution fitting and testing α 228 that estimates the parameter as a function of the number Class Basic block Indegree Outdegree 229 of observations. As the number of observations i increase, a ( . , . ) ( . , . ) ( . , . ) 230 model that is consistent along the data should show roughly GW N 1 634 0 3 N 2 02 0 3 N 1 69 0 307 ( . , . ) ( . , . ) ( . , . ) 231 CIi ⊇ CIi+1. For an insightful exposé and more recent pro- MW N 1 7 0 3 N 2 08 0 45 N 1 68 0 35 − . 232 cedures to estimate Pareto tails, see [56,16]. t 2.57 1.04 0 47

233 To tentatively corroborate the consistency of our posited

234 Pareto model, 30 (goodware) and 21 (malware) indegree, out-

235 degree and basic block ECCDF plots were uniformly sam- nificance level 0.05 (tcritical = 1.97) only µ(αbb,malware) ≥ 261 236 pled into three function count blocks, with block criteria µ(αbb,goodware). 262 237 chosen as (10), (100) and (1000) function counts, For the basic blocks, kc ≈ LogN(59.1, 52) (goodware) 263 238 yielding a sampling coverage of 10 %(goodware) and 17% and ≈ LogN(54.2, 44) (malware) and µ(kc(bb, malware)) 264 239 (malware). Visual inspection indicates that for malware, the = µ(kc(bb, goodware)) was rejected via Wilcoxon Rank 265 240 model seemed more consistent for outdegree than indegree Sum with z = 13.4. 266 241 at all function sizes. For basic block count, the consistency The steeper slope of malware’s αbb imply that functions 267 242 tends to be better for smaller executables. We see these ten- in malware tend to have a lower basic block count. This 268 243 dency for goodware, as well, with the observation that out- can be accounted for by the fact that malware tends to be 269 244 ( ) ( ) degree was most consistent in size block 100 ;for 10 simpler than most applications and operates without much 270 245 ( ) and 1000 . For both malware and goodware, indegree interaction, hence fewer branches, hence fewer basic blocks. 271 246 seemed the least consistent, quite a few samples did exhibit Malware tends to have limited functionality, and operate 272 247 αˆ a so-called ‘Hill Horror Plot’ [46], where s and the corre- independently of input from user and the operating environ- 273 248 sponding CIs were very jittery. ment. Also, malware is usually not compiled with aggres- 274 249 α α α The fitted power-law exponents indeg, outdeg, bb, sive compiler optimization settings. Such a regime leads to 275 250 together with individual functions’ callgraph size are shown more inlining and thus increases the basic block count of 276 251 α ≈ in Fig. 4. For both classes, the range extends for indeg the individual functions. It may be possible, too, that mal- 277 252 α ≈ α ≈ [1.5–3], outdeg [1.1–2.5] and bb [1.1–2.1], with a ware authors tend to break functions into simpler compo- 278 253 slightly greater spread for malware. nents than ‘regular’ programmers. The smaller cutoff point 279

for malware seems to corroborate this, as well, in that the 280

power law relationship holds over a shorter range. However, 281

282 254 2.6 Testing for difference this explanation should be regarded as speculative pending further investigation. 283

255 We now check whether there are any statistically signifi- 256 cant differences between (α, kc) fit for goodware and mal- 257 ware, respectively. Following procedures in [57], We find 3 Related work 284 α α α 258 indeg, outdeg and bb distributed approximatelyuncorrected normal. The proof 259 exponential cutoff parameters kc are lognormally distributed. A simple but effective graph-based signature set to character- 285 260 Applying a standard two-tailed t test (Table2), we find at sig- ize statically disassembled binaries was proposed by Flake 286 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 D. Bilar B&W print Fig. 4 GW:α versus indeg 4 4 3 10 3 10 αoutdeg,MW:αindeg versus αoutdeg,GW:αindeg versus αbb, MW:αindeg versus αbb, αoutdeg 2.5 2.5 versus αbbdeg,MW:αoutdeg α versus bbdeg, Scatterplots of 2 3 2 3 10 10 α’s 1.5 1.5

1 1 2 2 10 10 0.5 0.5

0 0 0123 0123

(a) (b)

4 4 3 10 3 10

2.5 2.5

2 3 2 3 10 10

1.5 1.5

1 1 2 2 10 10 0.5 0.5

0 0 0123 0123

(c) (d)

4 4 3 10 3 10

2.5 2.5

2 3 2 3 10 10

1.5 1.5

1 1 2 2 10 10 0.5 0.5

0 0 0123 0123

(e) (f)

287 [18]. For the purposes of similarity analysis, he assigned streams, augments them with a colouring scheme, identifies 294

288 to each function a 3-tuple consisting of basic blocks count, k-connected subgraphs that are subsequently used as struc- 295

289 count of branches, and count of calls. These sets were used tural fingerprints. 296

290 to compare malware variants and localize changes; an in- Power-law relationships were reported in [52,40,53,7]. 297

291 depth discussion of the involved procedures can be found Valverde et al. [52] measured undirected graph properties of 298 292 in [14]. For the purposes of worm detection,uncorrected Kruegel [27] static class relationships proof for Development Framework 299 293 extracts control flow graphs from executable code in network 1.2 and a racing computer game, ProRally 2002. They found 300

123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 On callgraphs and generative mechanisms B&W print

Link Speed Router Speed (Gbps) (Gbps) 5.0 – 10.0 50 – 100 1.0 – 5.0 10 – 50 0.5 – 1.0 5 – 10 0.1 – 0.5 1 – 5 0.05 – 0.1 0.5 – 1.0 0.01 – 0.05 0.1 – 0.5 0.005 – 0.01 0.05 – 0.1 (a) (c) 0.001 – 0.005 0.01 – 0.05

102

101 Node Rank

100 100 101 102 103 (b) (d) (e) Node Degree

Fig. 5 Degree sequence (e) following power law is identical for all graphs (a–d)

301 the αJDK ≈ 2.5 − 2.65 for the two largest (N1 = 1, 376, Bilar [4] investigated and statistically compared opcode 335 302 N2 = 1, 364) connected components and αgame ≈ 2.85 ± frequency distributions of malicious and non-malicious exec- 336 303 1.1 for the game (N = 1, 989). In the context of study- utables. Weber et al. [54] start from the assumption that com- 337

304 ing time series evolution of C/C++ compile-time “#include” piled binaries exhibit homogeneities with respect to several 338 305 dependency graphs, αin ≈ 0.97 − 1.22 and an exponential structural features such as instruction frequencies, instruction 339 306 outdegree distribution are reported. This asymmetry is not patterns, memory access, jumpcall distances, entropy metrics 340

307 explained. and byte-type probabilities and that tampering by malware 341

308 Focusing on the properties of directed graphs, Potanin et would disturb these statistical homogeneities. 342

309 al. [44] examined the binary heap during execution and took

310 a snapshot of 60 graphs from 35 programs written in Java,

311 Self, C++ and Lisp. They concluded that the distributions of

312 incoming and outgoing object references followed a power 4 Optimization processes 343 313 law with αin ≈ 2.5 and αout ≈ 3. Myers [40] embarked 314 on an extensive and careful analysis of six large collabora- In 2003, Myers [40], within the context of code evolvabil- 344

315 tion networks (three C++ static class diagrams and three C ity, investigated how certain software engineering practices 345

316 callgraphs) and collected data on in/outdegree distribution, might alter graph topologies. He proposed a ‘refactoring’ 346

317 degree correlation, clustering and complexity evolution of model which was phenomenologically able to reproduced 347

318 individual classes through time. He found roughly similar key features of source code callgraphs, among them the in and 348 319 results for the callgraphs, αin ≈ αin ≈ 2.5, and noted that it outdegree distributions. He noted that refactoring techniques 349 320 was more likely to find a function with many incoming links could be rephrased as optimizations. Earlier and more 350

321 than outgoing ones. explicitly, Valverde et al. [52] speculated that multidimen- 351

322 More recently, Chatzigeorgiou et al. [7] applied algebraic sional optimization processes might be the causative mecha- 352

323 methods to identify, among other structures, heavily loaded nism for graph topological features they unearthed. It has also 353

324 ‘manager’ classes with high in- and outdegree in three static been suggested in other venues that optimization processes 354

325 OO class graphs. In the spirit of classification through motifs are the norm, even the driving force, for various physical, 355

326 in [35] and graphlets in [45], Chatzigeorgiou proposes a simi- biological, ecological and engineered systems [15,47]. We 356

327 larity-measure algorithm to detect Design Patterns [20], best- share this particular outlook. 357

328 practices high level design structure whose presence manifest We hypothesize that the call-graph features described in 358

329 themselves in the form of tell-tale subgraphs. the preceding sections may be the phenomenological signa- 359

330 Analysis of non-graph-based structural features of exec- ture of two distinct, domain-specific HOT (Highly Optimized 360

331 utables were undertaken by [4,30,54]. Li et al. [30]used Tolerance) optimization processes; one involving human 361 332 statistical 1-g analysis of binary byte valuesuncorrected to generate a designers and the other,proof code compilers. HOT mechanisms 362 333 fingerprint (a ‘fileprint’) for file type identification and clas- are processes that induce highly structured, complex sys- 363

334 sification purposes. At the semantically richer opcode level, tems (like a binary executable) through heuristics that seek 364 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 D. Bilar

365 to optimally allocate resources to limit event losses in an involved discussion of software engineering practices and 413

366 probabilistic environment [5]. their relation to complex networks, the reader is referred 414

to [40]. Designers take implicitly (rarely explicitly, though 415

they should) the probability of the event space into consid- 416 367 4.1 Background eration, indirectly through the choice of programming lan- 417

guage (typed, OO, procedural, functional, etc.) and directly 418 368 For a historical sketch of models and processes that induce through the design choice of data structures and control flow. 419 369 graphs, the reader is referred to [42]; for a shorter, more Human programmers generally design for average (or even 420 370 up-to-date synopsis on power laws and distinctive gener- optimal) operating environments; the resulting programs 421 371 ative mechanisms, including HOT, see [41]. Variations of deal very badly with exceptional conditions effected by 422 372 the Yule process , pithily summarized as a ‘rich-get-richer’ random inputs [33,34] and resource scarcity [55]. 423 373 scheme, are the most popular. Physicist Barabasi rediscov- For years, the most common attack technique has been 424 374 ered and recoined the process as ‘preferential attachment’ exploiting input validation vulnerabilities, accounting for 425 375 [3], although the process discovery antedates him by at least over half of the published software vulnerabilities and over 426 376 40years (its origins lay in explaining biological taxa). In eighty percent of faults leading to successful penetration 427 377 some quarters of the physics community, power laws have attacks. Miller et al. testing Unix, Windows and OS X 428 378 also been taken as a signature of emergent complexity posited utilities [33,34] by subjecting them in the simplest case to 429 379 by critical phenomena such as phase transitions and chaotic random keyboard input, and reports crash failure rates of 430 380 bifurcation points [6]. 25–40, 24, and 7%, respectively. More recently, Whittaker et 431 381 The models derived from such a framework are mathe- al. [55] describe a dozen practical attack techniques targeting 432 382 matically compelling and very elegant in their generality; resources against which the executable were constrained (pri- 433 383 with little more than a couple of parameter adjustments, they marily by the human designer); among them memory, mem- 434 384 are able at some phenomenological level to generate graphs ory, disk space and network availability conditions. Effects 435 385 whose aggregate statistics (sometimes provably, sometimes of so-called “Reduction of Quality” attacks against optimiz- 436 386 asymptotically) exhibit power-law distributions. Although ing control systems have also been studied by [36,37]. We 437 387 these models offer a relatively simple, mathematically trac- shall give a toy example illustrating the ‘attack-as-system 438 388 table approximation of some features of the system under perturbation-by-rare-events’ view in Sect.4.4. 439 389 study, we think that HOT models with their emphasis on

390 evolved and engineered complexity through feedback, trade-

391 offs between objective functions and resource constraints sit- 4.3 Compiler as HOT mechanism 440 392 uated in a probabilistic environment is a more natural and

393 appropriate framework to represent the majority of real-life The second domain-specific mechanism that induces a cost- 441 394 systems. We illustrate the pitfalls of a narrow focus on power optimized, resource-constrained structure on the executable 442 395 law metrics without proper consideration of real-life domain is the compiler. The compiler functions as a HOT process. 443 396 specification, demands and constraints with Fig. 5 from [12]: Cost function here include memory footprint, execution 444 397 Note that the degree sequence in Fig. 5eisthesameforall cycles, and power consumption minimization, whereas the 445 398 Fig. 5a–d, yet the topological structure for Fig. 5a–d is vastly constraints typically involves register and cache line allo- 446 399 different. Along these lines, domain experts have argued cation, opcode sequence selection, number/stages of pipe- 447 400 against ‘emergent’ complexity models in the cases of Internet lines, ALU and FPU utilization. The interactions between 448 401 router [29] and in river stream [25] structures. at least 40+ optimization mechanisms (in itself a network 449

graph [39, pp. 326+]) are so complex that meta-optimiza- 450

402 4.2 Human design and coding as HOT mechanism tions [23] have been developed to heuristically choose a sub- 451

set from the bewildering possibilities. Although the callgraph 452

403 The first domain-specific mechanism that induces a cost- is largely invariant under most optimization regimes, more 453

404 optimized, resource-constrained structure on the executable aggressive mechanisms can have a marked effect on call- 454

405 is the human element. Humans using various best-practice graph structure. Figure6 shows a binary’s CFG induced by 455

406 software development techniques [20,28] have to juggle at the Intel C++ Compiler 9.1 under a standard optimization 456

407 various stage of the design and coding stages: Evolvability regime. The yellowed sections are loop structures. Figure6 457

408 versus specificity of the system, functionality versus code shows the binary CFG of the same source code, but com- 458

409 size, source readability versus development time, debugging piled under a more aggressive inlining regime. We see that 459 410 time versus time-to-market, just to nameuncorrected a few conflicting the compiler unrolled proof the loops into an assortment of switch 460 411 objective function and resource constraints. Humans design statements, vastly increasing the number of basic blocks, and 461

412 and implement programs against a set of constraints. For an hence changing the executable’s structural features. 462 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 On callgraphs and generative mechanisms B&W print Fig. 6 Compiler: CFG without loop unrolling, Compiler: CFG with loop unrolling, basic block differences in CFG under compiler optimization regimes

(a) (b)

463 4.4 Example: PLR optimization problem as a HOT process which is our objective function to be minimized (Eq. 1). 480 Resources ri are hedged against losses li (Eq. 4), subjected to 481 464 The HOT mechanism inducing the structure of the callgraph resource bounds R (Eq. 2). We will demonstrate the applica- 482

465 executable can be formulated as a probability, loss, resource bility of this PLR model with the following short C program, 483

466 (PLR) optimization problem, which in its simplest form can adapted from [19]: 484

467 be viewed as a generalization of Shannon source coding for #include 485 468 [32]. The reader is referred to [11]for #include 486 469 details; I just give a sketch of the general formulation and #include 487

470 a motivating example: 488 int provePequalsNP() 489 490 471 min J (1) { /∗ Next paper . . ∗/ 491 } 492 472 subject to  int bof() 493 { 494 473 ri ≤ R (2) char buffer[8]; /∗ an 8 byte character buffer ∗/ 495 strcpy(buffer,gets()); /∗get input from the user∗/ 496 474 where ∗  / may not return if buffer overflowed 497 return 42; 498 475 J = p l (3) i i } 499 476 li = f (ri ) (4) 500 int main(int argc, char ∗∗argv) 501 477 ≤ ≤ 1 i N (5) { 502 ∗ ∗ uncorrectedbof(); / call bof() functionproof/ 503 478 We have a set of N events (Eq. 5) with occurring iid with /∗ execution may never reach 504 479 probability pi incurring loss li (Eq. 3), the sum-product of next function because of overflow∗/ 505 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 D. Bilar

506 provePequalsNP(); Pareto model. The power-laws evidenced in the binary call- 556 507 ∗ ∗ return 1000000; / exit with Clay prize / graph structure may be the result of optimization processes 557 508 } which take objective function tradeoffs and resource con- 558 509 } straints into account. In the case of the callgraph, the primary 559

510 We assume here that the uncertain, probabilistic environ- optimizer is the human designer, although under aggressive 560

511 ment is just the user. She is asked for input in gets(),this optimization regimes, the compiler will alter the callgraph, 561

512 represents the event. In the C code, the human designer spec- as well. 562 513 ified an 8 byte buffer (char buffer[8]) and the com-

514 piler would dutifully allocate the minimum buffer needed Acknowledgments This paper is an extended and revised version of 563 515 for 8 bytes (this is the resource r). Hence, the constrained a paper to appear in the European Journal on Artificial Intelligence’s 564 516 buffer resources r is the variable . The loss associated with special issue on “Network Analysis in Natural Sciences and Engi- 565 517 the user input event is really a step function; as long as the neering”. The goodware samples were indexed, collected and meta- 566

518 user satisfies the assumption of the designer, the ‘loss’ is data identified using Index Your Files—Revolution! 3.1, Advanced Data 567 Catalog 1.51 and PEiD 0.94, all freely available from www.softpe- 568 519 constant, and can be seen (simplified) as just the ‘normal’ dia.com. The executable’s callgraph generation and structural identifi- 569 520 loss incurred in proper continuation of control flow. Put dif- cation was done with IDA Pro 5 and a pre-release of BinNavi 1.2, both 570 521 ferently, as long as user input is ≤8bytes, the resource r commercially available at www.datarescue.com and www.sabre-secu- 571

522 is minimally sufficient to ensure normal control flow con- rity.com. Programming was done with Python 2.4, freely available at 572 www.python.org. Graphs generated with and some analytical tools pro- 573 523 tinuation. If, however, the user decides to input ‘Superfra- vided by Matlab 7.3, commercially available at www.matlab.com. We 574 524 gilisticexpialidocious’ (which was implicitly assumed to be would like to thank Thomas Dullien (SABRE GmbH) without whose 575 525 an unlikely/impossible event by the human designer in the superb tools and contributions this paper could not have been written. 576

526 code declaration), the loss l takes a huge jump: a catastrophic He served as the initial inspiration for this line of research. Furthermore, 577 we would like to thank Walter Willinger (AT&T Research), Ero Car- 578 527 loss ensues since strcpy(buffer,gets()) overflows rera (SABRE), Jason Geffner (Microsoft Research), Scott Anderson, 579 528 buffer . The improbable event breaches the resource and Frankly Turbak, Mark Sheldon, Randy Shull, Frédéric Moisy (Univer- 580 529 now, control flow may be rerouted, the process crashed, shell- sité Paris-Sud 11), Mukhtar Ullah (Universität Rostock), David Alder- 581

530 code executed via a stack overflow (or in our example, fame son (Naval Post Graduate School), as well as the anonymous reviewers 582 for their helpful comments, suggestions, explanations, and arguments. 583 531 remains elusive). This is a classic buffer overflow attack and

532 the essence of hacking in general—violating assumptions by

533 ‘breaking through’ the associated resource allocated explic- 6 Appendix 584 534 itly (input validation) and implicitly (race condition attacks,

535 for instance) by the programmer, compiler or at runtime by We illustrate the concept of a flowgraph, callgraph and basic 585 536 the OS. 2 block by means of a fragment dissassembly of Backdoor. 586 537 What could have prevented this catastrophic loss? A type- Win32.Livup.c. We focus on the function sub_402400, 587 538 safe language such as Java and C# rather than C, more resour- consisting of six basic blocks. The flowgraph is given in 588 539 ces in terms of buffer space and more code in terms of bounds Fig. 7. The assembly code for one basic block starting at 589 540 checking from the human designer’s side theoretically would 0x402486 and ending with a jz at 0x4024B9 is given 590 541 have worked. In practice, for a variety of reasons, program- below. Figure8 shows the callgraph of sub_402400. 591 542 mers write unsafe, buggy code. Recently, compiler guard

543 techniques [10] have been developed to make these types

544 of system perturbation attacks against allocated resources

545 harder to execute or more easily to detect; again attacks

546 against these compiler guard techniques have been developed

547 [2].

548 5 Conclusion

549 We started by analyzing the callgraph structure of 120 mali-

550 cious and 280 non-malicious executables, extracting descrip-

551 tive graph metrics to assess whether statistically relevant

552 differences could be found. Malware tends to have a lower 553 basic block count, implying a simpler structureuncorrected (less inter- proof 554 action, fewer branches, limited functionality). The metrics 2 See www.viruslist.com/en/viruses/encyclopedia?virusid=44936 for 555 under investigation were fitted relatively successfully to a more information. 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 On callgraphs and generative mechanisms B&W print Fig. 7 Flowgraph for function sub_402400 B&W print Fig. 8 Callgraph for function sub_402400

592 References 5. Carlson, J.M., Doyle, J.: Highly optimized tolerance: A 603 mechanism for power laws in designed systems. Phys. Rev. 604

593 1. Adamic, L., Huberman, B.: Zipf’s law and the internet. Glotto- E 60(2), 1412 (1999) 605 594 metrics 3, 143–150 (2002) 6. Carlson, J.M., Doyle, J.: Complexity and robustness. Proc. Natl. 606 595 2. Alexander, S.: Defeating compiler-level buffer overflow protec- Acad. Sci. 99(Suppl 1), 2538–2545 (2002) 607 596 tion.. J-LOGIN 30(3), 59–71 (2005) 7. Chatzigeorgiou, A., Tsantalis, N., Stephanides, G.: Application 608 597 3. Barabasi, A.L.: Mean field theory for scale-free random networks. of graph theory to OO software engineering. In WISER ’06: 609 598 Phys. A Stat. Mech. Appl. 272, 173–187 cond-mat/9907068 Proceedings of the 2006 International Workshop on Workshop 610 611 599 (1999) on Interdisciplinary Software Engineering Research, pp. 29–36, 612 600 4. Bilar, D.: Fingerprinting malicious code through statistical opcode New York, NY, USA. ACM Press, New York (2006) rd 613 601 uncorrected8. Christodorescu, M.,proof Jha, S.: Static analysis of executables to analysis. In: ICGeS ’07: Proceedings of the 3 International Con- 614 602 ference on Global E-Security, London (UK) (2007) detect malicious patterns. In Security ’03: Proceedings of the

123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 D. Bilar

th 615 12 USENIX Security Symposium, pp. 169–186. USENIX the Sixth Annual IEEE Information Assurance Workshop on Sys- 681 616 Association, USENIX Association (2003) tems, Man and Cybernetics, pp. 64– 71. West Point, New York 682 617 9. Clementi, A.: Anti-virus comparative no. 11. Technical report, (2005) 683 618 Kompetenzzentrum IT, Insbruck (Austria). http://www.av-com- 31. Limpert, E., Stahel, W.A., Abbt, M.: Log-normal distributions 684 619 paratives.org/seiten/ergebnisse/report11.pdf (2006) across the sciences: keys and clues. BioScience 51(5), 341– 685 620 10. Cowan, C., Pu, C., Maier, D., Walpole, J., Bakke, P., Beattie, 352 (2001) 686 621 S., Grier, A., Wagle, P., Zhang, Q., Hinton, H.: StackGuard: 32. Manning, M., Carlson, J.M., Doyle, J.: Highly optimized toler- 687 622 Automatic adaptive detection and prevention of buffer-overflow ance and power laws in dense and sparse resource regimes. Phys. 688 623 attacks. In: Proceedings of 7th USENIX Security Conference, Rev. E (Stat. Nonlinear Soft Matter Phys.) 72(1), 016108–016125 689 624 pp. 63–78, San Antonio, Texas (1998) (2005), physics/0504136 690 625 11. Doyle, J., Carlson, J.M.: Power laws, highly optimized tolerance, 33. Miller, B.P., Cooksey, G., Moore, F.: An empirical study of the 691 626 and generalized source coding. Phys. Rev. Lett. 84(24), 5656– robustness of macos applications using random testing. In: RT 692 627 5659 (2000) ’06: Proceedings of the 1st International workshop on Random 693 628 12. Doyle, J.C., Alderson, D.L., Li, L., Low, S., Roughan, M., Testing, pp. 46–54, New York, NY, USA. ACM Press, New York 694 629 Shalunov, S., Tanaka, R., Willinger, W.: The “robust yet frag- (2006) 695 630 ile” nature of the Internet. Proc. Natl. Acad. Sci. 102(41), 14497– 34. Miller, B.P., Fredriksen, L., So, B.: An empirical study of the 696 631 14502 (2005) reliability of unix utilities. Commun. ACM 33(12), 32–44 (1990) 697 632 13. Dullien, T.: Binnavi v1.2. http://www.sabre-security.com/prod- 35. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., 698 633 ucts/binnavi.html (2006) Alon, U.: Network Motifs: Simple Building Blocks of Complex 699 634 14. Dullien, T., Rolles, R.: Graph-based comparison of executable Networks. Science 298(5594), 824–827 (2002) 700 635 objects. In SSTIC ’05: Symposium sur la Sécurité des Technol- 36. Mina Guirguis, A.B., Matta, I.: Reduction of quality (roq) attacks 701 636 ogies de l’Information et des Communications. Rennes, France on dynamic load balancers: Vulnerability assessment and design 702 th 637 (2005) tradeoffs. In: Infocom ’07: Proceedings of the 26 IEEE Inter- 703 638 15. Ekeland, I.: The Best of All Possible Worlds: Mathematics and national Conference on Computer Communication, Anchorage 704 639 Destiny. University of Chicago Press, (2006) (AK) (2007, to appear) 705 640 16. Fan, Z.: Estimation problems for distributions with heavy tails. 37. Mina Guirguis, I.M., Bestavros, A., Zhang, Y.: Adversarial 706 641 PhD thesis, Georg-August-Universität zu Göttingen (2001) exploits of end-systems adaptation dynamics. J. Parallel Distrib. 707 642 17. Filiol, É.: Metamorphism, formal grammars and undecidable code Comput. ( 2007, to appear) 708 643 mutation. Int. J. Comput. Sci. 2(2), 70–75 (2007) 38. Mitzenmacher, M.: Dynamic models for file sizes and double 709 644 18. Flake, H.: Compare, Port, Navigate. Black Hat Europe 2005 Brief- pareto distributions. Internet Math. 1(3), 305–334 (2004) 710 645 ings and Training (2005) 39. Muchnick, S.S.: pp. 326–327 (1998) 711 646 19. Foster, J.C., Osipov, V., Bhalla, N., Heinen, N.: Buffer Overflow 40. Myers, C.: Software systems as complex networks: Structure, 712 647 Attacks. Syngress, (2005) function, and evolvability of software collaboration graphs. Phys. 713 648 20. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design patterns: Rev. E (Stat. Nonlinear Soft Matter Phys.) 68(4), 046116 (2003) 714 649 Abstraction and reuse of object-oriented design. Lect. Notes Com- 41. Newman, M.: Power laws, Pareto distributions and Zipf’s 715 650 put. Sci. 707, 406–431 (1993) law. Contemp. Phys. 46(5), 323–351 (2005) 716 651 21. Goldstein, M.L., Morris, S.A., Yen, G.G.: Problems with fitting to 42. Newman, M., Barabasi, A.-L., Watts, D.J.: The Structure 717 652 the power-law distribution. Eur. J. Phys. B 41(2), 255–258 cond- and Dynamics of Networks: (Princeton Studies in Complex- 718 653 mat/0402322 (2004) ity). Princeton University Press, Princeton (2006) 719 654 22. Guilfanov, I.: Ida pro v5.0.0.879. http://www.datarescue.com/ 43. Newman, M.E.J.: The structure and function of complex net- 720 655 idabase/ (2006) works. SIAM Rev. 45, 167 (2003) 721 656 23. Haneda, M., Knijnenburg, P.M.W., Wijshoff, H.A.G.: Optimiz- 44. Potanin, A., Noble, J., Frean, M., Biddle, R.: Scale-free geometry 722 657 ing general purpose compiler optimization. In: CF ’05: Proceed- in oo programs. Commun. ACM 48(5), 99–103 (2005) 723 nd 658 ings of the 2 Conference on Computing Frontiers, pp. 180–188, 45. Pržulj, N.: Biological network comparison using graphlet degree 724 659 New York, NY, USA. ACM Press, New York (2005) distribution. In: Proceedings of the 2006 European Conference on 725 660 24. herm1t. VX Heaven. http://vx.netlux.org// (2007) Computational Biology, ECCB ’06, Oxford, UK. Oxford Univer- 726 661 25. Kirchner, J.W.: Statistical inevitability of horton’s laws and sity Press, New York (2006) 727 662 the apparent randomness of stream channel networks. Geol- 46. Resnick, S.: Heavy tail modeling and teletraffic data. Ann. 728 663 ogy 21, 591–594 (1993) Stat. 25(5), 1805–1869 (1997) 729 664 26. Knuth, D.E.: Big omicron and big omega and big theta. SIGACT 47. Schneider, E.D., Sagan, D.: Into the Cool : Energy Flow, Thermo- 730 665 News 8(2), 18–24 (1976) dynamics, and Life. University Of Chicago Press, Chicago (2005) 731 666 27. Krügel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Poly- 48. Skoudis, E., Zeltser, L.: Malware: Fighting Malicious Code. Pren- 732 667 morphic worm detection using structural information of execu- tice Hall PTR, Upper Saddle River (2003) 733 668 tables. In: Valdes, A., Zamboni, D. (eds.) Recent Advances in 49. Szor, P.: The Art of Research and Defense, 734 669 Intrusion Detection, vol. 3858 of Lecture Notes in Computer pp. 252–293. Prentice Hall PTR, Upper Saddle River (2005) 735 670 Science, pp. 207–226. Springer, Heidelberg (2005) 50. Szor, P.: The Art of Computer Virus Research and Defense. Addi- 736 671 28. Lakos, J.: Large-scale C++ software design.. Addison Wesley son-Wesley Professional, Upper Saddle River (NJ) (2005) 737 672 Longman Publishing Co., Inc, Redwood City (1996) 51. Szor, P., Ferrie, P.: Hunting for metamorphic. In: VB ’01: Pro- 738 th 673 29. Li, L., Alderson, D., Willinger, W., Doyle, J.: A first-principles ceedings of the 11 Virus Bulletin Conference (2001) 739 674 approach to understanding the internet’s router-level topology. In: 52. Valverde, S., Ferrer Cancho, R., Solé, R.V.: Scale-free networks 740 675 SIGCOMM ’04: Proceedings of the 2004 Conference on Appli- from optimal design. Europhys. Lett. 60, 512–517 (2002) cond- 741 676 cations, Technologies, Architectures, and Protocols for Computer mat/0204344 742 677 Communications, pp. 3–14, New York, NY, USA. ACM Press, 53. Valverde, S., Sole, R.V.: Logarithmic growth dynamics in soft- 743 678 New York (2004) uncorrectedware networks. Europhys. proof Lett. 72, 5–12 (2005) physics/0511064 744 679 30. Li, W.-J., Wang, K., Stolfo, S., Herzog, B.: Fileprints: Identify- 54. Weber, M., Schmid, M., Schatz, M., Geyer, D.: A toolkit for 745 680 ing file types by n-gram analysis. In SMC ’05: Proceedings from detecting and analyzing malicious software. In: ACSAC ’02: 746 123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13 On callgraphs and generative mechanisms

th 747 Proceedings of the 18 Annual Computer Security Applications ’04: Proceedings of the 36th Conference on Winter Simulation, 754 748 Conference, Washington (DC) (2002) pp. 130–141. Winter Simulation Conference (2004) 755 749 55. Whittaker, J., Thompson, H.: How to break Software secu- 57. Wu, G.T., Twomey, S.L., Thiers, R.E.: Statistical evaluation of 756 750 rity. Addison Wesley (Pearson Education), Reading (2003) method-comparison data. Clin. Chem. 21(3), 315–320 (1975) 757 751 56. Willinger, W., Alderson, D., Doyle, J.C., Li, L.: More normal 752 than normal: scaling distributions and complex systems. In: WSC 753

uncorrected proof

123

Journal: 11416 MS: 57 CMS: GIVE CMS TYPESET DISK LE CP Disp.:2007/7/4 Pages: 13