Analysis of Adversarial Code: Problem, Challenges, Results

Arun Lakhotia Center for Advanced Computer Studies University of Louisiana at Lafayette www.cacs.louisiana.edu/labs/SRL

Black Hat Federal 2006 Sheraton Crystal City, Washington DC January 23-26, 2006 Adversarial Code Analysis

• ACA at UL Lafayette – Ongoing research for over 3 years – Evolved from analyzing and writing virus detectors – Impacted by failures in using traditional analysis – Aim: fundamental advances in hardening analysis • focus: key (real) problems in analysis • develop and adapt theoretical approaches • build and test prototypes

2 Our ACA Approach

• Short term – Harden individual steps – Use solid theory • Long term – Holistic infrastructure improvement

3 Malware Detection Process

ANALYSIS IN LAB

Sample Extract Signature + Filter Analyze Removal Instructions Verify

ON DESKTOP Signature Removal Instructions

Clean File/Message Match Disinfect Scanner System System

4 Talk Contents

Motivation Analysis Tools Overall ACA Approach Results VILO: malware phylogeny generation DOC: detecting obfuscated calls UMPH: reversing metamorphic transforms Future and Conclusions

5 Program Analysis: The Old Frontier • Half a century of program analysis – Compilers, optimizers, checkers, refactoring tools – Analyze, visualize, transform • Problem Space – Program optimization – Profiling, testing, debugging – Understanding, Comprehension – Reengineering • Purpose – Help programmers help themselves

6 Typical Analysis Pipeline

PROGRAM DATABASE

extract extract control verify disassemble procedures & data flow property

certify / reject

7 Program Analysis: The New Frontier • Analysis of malicious programs – Viruses, worms, Trojans, spyware, adware • Problems – What does the malware do? – What attack tools and methods are employed? – How did it arrive on this computer? – Which other computers did it go to? – Who wrote the malware? • Purpose – Help security analysts defend computing resources

8 Malware Analysis Pipeline

VIRUS DATABASE

extract extract control verify disassemble procedures & data flow property

certify / reject

9 Implications of Undecidability Analysis problems are undecidable Precise solution Precise solutions cannot be computed ‘Safe’ solution

Solutions are approximated

Play ‘safe’: over approximate or under approximate

Catch: ‘Safe’ solutions leave hideouts for malware

Hideout for malware 10 Problem: Analyses Not Hardened

DATABASE

extract extract control verify disassemble proceduresD I S A& dataB flowL E D property!

certify / reject

11 Malware Analysis Pipeline

DATABASE

extract extract control verify disassemble procedures & data flow property

certify / reject

12 extract extract control verify disassemble procedures & data flow property

decode machine instructions (byte seq) ORIG BYTES ASSEMBLY

401063: 5d pop %ebp bad disassembly jump over junk 401064: c3 (no jump target) ret 401065: 55 push %ebp 401066: 89 e5 mov %esp,%ebp malicious func 401068: 83 ec 08 sub $0x8,%esp 40106b: eb 05 jmp 0x401072 40106d: e8c7 ee ff ff ff e8 callmov 0x401060 $0xe8ffffff,%esi 401072:401073: e8e9 e9ff ff ff ff ff c7 ff calljmp 0x401060 0xc8401077 401077:401078: c745 45 fc 00 00 00 00 movlinc % $0x0,0xfffffffc(%ebp)ebp 40107e:401079: fc81 7d fc e7 03 00 00 cmplcld $0x3e7,0xfffffffc(%ebp)

13 extract extract control verify disassemble procedures & data flow property

trace call structure (control flow) L0: call F L0: push L1 L1:  push F 401063: 5d pop % ebp L1: ret 401064: c3 ret instr. substitution 401065 <_malicious>: 401065: 55 push %ebp 401066: 89 e5 mov %esp,%ebp 401068: 83 ec 08 sub $0x8,%esp no call found 40106b: ebff 35 05 78 10 40 00 pushljmp 401072 401078 <_malicious+0xd> <_malicious+0x13> 401071:40106d: e8ff 35 ee 60 ff ff10 ff 40 00 pushlcall 401060 401060 <_ <_sendLotsOfEmailsendLotsOfEmail>> 401077:401072: e8c3 e9 ff ff ff retcall 401060 <_sendLotsOfEmail> 401078:401077: c7 45 fc 00 00 00 00 movl $0x0,0xfffffffc(%ebp)

14 extract extract control verify disassemble procedures & data flow property

verify security or match pattern/signature

40106b:push x ff 35 78 10push 40 00 x pushl 401078 <_malicious+0x13> 401071:push y ff 35 60 10push 40 00 z pushl 401060 <_sendLotsOfEmail> 401077:ret c3 pop ret 401078: c7 45 fc 00push 00 00 y 00 movl $0x0,0xfffffffc(%ebp) ret

• Transformations destroy signature/pattern match – eg metamorphic viruses: self-transforming – Instruction substitution, nop insertion, etc.

15 Attacks on Signature Analysis

• Polymorphic malware – Code is encrypted – Carries a decryptor – Decryptor transformed before propagation • Metamorphic malware – Whole code transformed before propagation – So far threat mostly ‘in-the-zoo’ so far – Off-the-shelf metamorphic engines available, improving • Packed malware – Rapidly release variants packed by different packers – Overwhelm the security analysts

16 Current AV Infrastructure

• Human intensive – Analysts specialize on specific attacks • In leading companies, person(s) dedicated to deal with packers – Knowledge resident in specialists • High workload – Spyware – may have few HUNDRED programs – About 5-8 samples per analyst per day

17 Current AV Infrastructure

• Depend on tools not designed for the trade – Disassemblers, debuggers, program monitors – No methodical way to organize knowledge • Rely on Google • The Bright Side – Significant advances in dynamic analysis – Metamorphic viruses detected by emulation • Has its own set of issues

18 Talk Contents

Motivation Analysis Tools Overall ACA Approach Results VILO: malware phylogeny generation DOC: detecting obfuscated calls UMPH: reversing metamorphic transforms Future and Conclusions

19 Manual Analysis Process

Configure Environment Static Analysis

Dynamic Analysis Process Activity Network Activity

20 Configure Environment

• Requirements – Prevent contamination of COMMON TOOLS production systems VMWare – Quickly undo damage MS Virtual PC – Allow interaction among multiple systems • Goals – Prep files and operating systems for infection – Initialize analysis tools

21 Static Analysis

• Goals – Quickly identify key program COMMON TOOLS features Strings BinText • does it send mail? IDA Pro • … open an IRC channel? • … kill processes? – Quickly identify possible malicious intent

22 Dynamic Analysis

• Goals – Identify process activity COMMON TOOLS • are processes created/killed? Process Explorer – Identify hard disk actvity FileMon RegMon – Identify network activity RegShot – Identify registry changes ProcDump • Used when deeper IDA Pro understanding is required OllyDbg

23 Talk Contents

Motivation Analysis Tools Overall ACA Approach Results VILO: malware phylogeny generation DOC: detecting obfuscated calls UMPH: reversing metamorphic transforms Future and Conclusions

24 A Shift in Research Focus

Question How to make program analysis battle-ready?

25 Revisit Assumptions

• Assumption #1: – Programmers and analysis tools have common goal

26 Revisit Assumptions

• Reality – Programmer (writer of code under scrutiny) and tools are adversaries

27 Revisit Assumptions

• Assumption #2: – Malware authors have the benefit of surprise

28 Revisit Assumptions

• Reality – High degree of reuse and plagiarism • Jaschan (2004), author of , copied from Lovesan • Others immediately picked up on his ideas. 12,000 Total viruses and worms* 10,000 Total families* 8,000

6,000

4,000

2,000

0 Jan-June 2003 July-Dec 2003 Jan-June 2004 July-Dec 2004 Jan-June 2005 *Source: Symantec Internet Threat Report, January – June 2005 29 Revisit Assumptions

• Reality (cont.) – No big bang; malware also evolves • Beagle versions from A, B, .., AA, to ED • Each version introduced small change – Inventions discussed in Blackhat forums • Format string attack; attack on Oracle – Vulnerability and exploits often first found by security analysts • Implication – Utilize knowledge outside of code under scrutiny

30 Revisit Assumptions

• Assumption #3 – Undecidability a hindrance

31 Revisit Assumptions

• Reality – Analysis tools can live with ‘statistical’ equivalence • Need statistical ‘safety’, not theoretical safety – Undecidability is a two-edged sword • Self-transforming code must analyze itself • Must deal with undecidability too – Metamorphic virus W32.Evol does not use any disassembly attack • Not easy to exploit – Trend has moved to packed malware – Complete obfuscation is impossible [Barak et al. 2001] • Implication – Develop targeted deobfuscators

32 Vision for an Analysis Infrastructure

• Day in the life of an analyst – Arrive at work – Analyze a sample • Sample pre-analyzed, relation with other malware annotated • Review, verify annotations • Move on to next sample, if satisfied with findings • Analyze un-annotated parts – Use an integrated environment with dynamic/static tools – Apply various deobfuscators to discover hidden meaning • Add annotations of finding into the knowledge base – Move on to the next sample

33 Core Capabilities Needed

• Hardened static analysis – All phases should be: • semantic driven • interleaved • utilize knowledge base – New questions, new algorithms • Can a variable have a certain value on some path? • What if traditional procedural units do not exist? • Probabilistic analysis

34 Core Capabilities Needed

• Integrated security analysis environments – Integrate dynamic and static analysis – Knowledge base – Comparison of code fragments • Catch evolutionary relation between families, and within family – Deobfuscators, targeted • Undo call obfuscations, key to determining behavior • Undo transformations

35 Portfolio of Results

Create Deobfuscate malware Calls phylogeny

Reverse self transformations

36 Talk Contents

Motivation Analysis Tools Overall ACA Approach Results VILO: malware phylogeny generation DOC: detecting obfuscated calls UMPH: reversing metamorphic transforms Future and Conclusions

37 Overall Identification Problem

W32/.u@mm W32.Beagle.A@mmW32/Bugbear.17916intd

W32/Bagle.j@mm W32..F@mm W32..A W32.Beagle.J@mm [email protected] W32.Beagle.U@mm ??

W32.NetSky.D W32/Bagle.ao@mm W32/NetSky.B W32.NetSky.B W32/NetSky.A W32/Bagle.a@mm W32/Klez.i@MM

W32/Klez.f@MM W32.Beagle.AO@mm W32.Klez.I@mm W32/Klez.e@MM 38 How to Name and Classify?

Symantec McAfee

W32.NetSky.A W32/NetSky.A W32.NetSky.B W32/NetSky.B W32.NetSky.D W32/Bugbear.17916intd W32.Beagle.A@mm W32/Bagle.a@mm W32.Beagle.J@mm W32/Bagle.j@mm ?? W32.Beagle.AO@mm W32/Bagle.aq@mm W32.Beagle.U@mm W32/Bagle.u@mm ?? [email protected] W32/Klez.e@MM W32.Klez.F@mm W32/Klez.f@MM W32.Klez.I@mm W32/Klez.i@MM

39 Generating Phylogeny Model phylogeny: evolutionary relationships between organisms

NetSky.A NetSky.B • Use cluster analysis NetSky.D

Beagle.A • Key need Beagle.D – A similarity measure Beagle.U Beagle.AO • N-grams based measure not-effective ?? – Cannot account for permutations Klez.E • Developed N-perm similarity measure Klez.F Klez.I – Influenced by bio-informatics

40 Example: Permuted Netsky worm l2D2: push ecx l144: push ecx push 4 push 4 pop ecx pop ecx push ecx push ecx l2D7: rol edx, 8 l149: mov dl, al mov dl, al and dl, 3Fh and dl, 3Fh rol edx, 8 shr eax, 6 shr ebx, 6 loop l2D7 loop l149 pop ecx pop ecx call s319 call s52F xchg eax, edx xchg ebx, edx stosd stosd xchg eax, edx xchg ebx, edx inc [ebp+v4] inc [ebp+v4] cmp [ebp+v4], 12h cmp [ebp+v4], 12h jnz short l305 jnz short l18 41 Permutation Example P P P P Virus 1 P P O P R M A S L O C X S X I C J O O P P Virus R2 P P O P M A R S L O C X S MX I C J M A A R S S L L O O C C X X S S X X I I C C J J

42 Permutation Example

Virus 1 P P O P R M A S L O C X S X I C J

Virus 2 P P O P M A R S L O C X S X I C J

Virus 3 P P O P M A R S L O C X S X I C J

43 Compare: 4-grams

1 P O P R M A S L

2 P O P M A R S L

3 M A R S L P O P

POP OPR PMARPRM MARSRMA MAS POP OPM PMAR MARS ARSL RSLP SLPO LPOP R M A S L M A 1 1 1 1 1 1 0 0 0 0 0 0 0 0

2 0 0 0 0 0 1 1 1 1 1 0 0 0

3 0 0 0 0 0 0 0 0 1 1 1 1 1

44 4-perms

1 P O P R M A S L

2 P O P M A R S L

3 M A R S L P O P

POP OPR PRM RMA MAS POP OPM ARSL RSLP SLPO LPOP R M A S L M A 1 1 1 1 1 1 0 0 0 0 0 0

2 0 0 1 1 0 1 1 1 0 0 0 3 0 0 0 1 0 0 0 1 1 1 1

45 Evaluation

• Question – Are the models useful for classifying new malware? • Process – 170 known malware • From VXHeavens archive – Three unknown worms (A, B, C) • Captured by AV scanner on mail gateway – Place unknown samples using n-grams and n- perms

46 Results

• N-perm classification better in: – Clustering distinct malware classes – Classifying unknown clusters with close relatives – Identifying naming conflicts

47 10-perm Phylogeny

MyDooms

Klez/Elkerns

Beagles

48 Summary of VILO

• Impact – It is feasible to utilize historic knowledge – Can even be used on the desktop • Detect new malware – Improve forensics

49 Summary of VILO

• Open issues – Scaling for O(104-105) data set – Visualization for exploring large space of relations – Online/incremental classification

50 Talk Contents

Motivation Analysis Tools Overall ACA Approach Results VILO: malware phylogeny generation DOC: detecting obfuscated calls UMPH: reversing metamorphic transforms Future and Conclusions

51 Call Obfuscations

NORMAL CALL OBFUSCATED CALL

L0a: push L1 L0b: push L5 L0: call L5 L0c: ret L1: … L1: … L2: … L2: … L3: … L3: … L4: … L4: … L5: L5: L6: … L6: …

52 Problem

• Determine unconventional control transfers statically – Implicit calls • Determine “bogus” returns statically – Return address modification

53 Approach

• Abstract Interpretation – Operations are interpreted to operate over an abstract domain (rather than on real data) – Real-world properties are translated into abstract properties of interest

54 Example

• Interested in sign of integers

x = {+, -}; y = {+, -}; z = {+, -} Entry All variables initialized to {+, -} x = {-}; y = {+, -}; z = {+, -} x = -5 x = neg x = {-}; y = {+}; z = {+, -} y = x * x y = neg × neg = pos x = {-}; y = {+}; z = {+} z = y - x z = pos – neg = pos

55 Example Domain Concrete Values Abstract Values (Properties of Interest)

Source: David Schmidt, http://www.cis.ksu.edu/santos/schmidt/Escuela03/ 56 Our Domain

• Concrete domain – Runtime stack • tracks actual program data • Abstract domain – Abstract Stack Graph (ASG) • tracks all stack-manipulation (push, pop, call, etc.)

57 Abstract Stack • Holds addresses of instructions pushing data onto stack – not the data – not the instruction SAMPLE PROGRAM ACTUAL STACK ABSTRACT STACK

L1: push eax eax L1 L2: push ebx ebxedx L2L4 L3: pop esi L4: push edx

58 Abstract Stack Graph Abstract Stack L0 L0 Address Instruction L1 L1 L0: push ebp L3 L1 top of L1: push eax L1 stack L2: beqz L5 L3 L3: push ebx L3 L4 L4: jmp L1 L5

L5: pop edx L2 L1 L1

L5 L0 L0

Abstract Stack Graph 59 Uses of ASG

• Detect obfuscations – call obfuscations (e.g., push-push-ret) – obfuscation of parameters to a call – obfuscated return – manipulation of return address • Match call / return instructions – return instruction need not follow entry point

60 Prototype

61 Prototype

62 Summary of DOC

• Impact – Detected all call obfuscations in W32.Evol – Initial step towards semantic disassembler

63 Summary of DOC

• Open issues – Indirect stack operations • through memory and other registers – Attacks on abstract interpretation • Explode size of state • Hide in over approximation

64 Talk Contents

Motivation Analysis Tools Overall ACA Approach Results VILO: malware phylogeny generation DOC: detecting obfuscated calls UMPH: reversing metamorphic transforms Future and Conclusions

65 Metamorphic malware

Virus Virus Virus

Form - A T Form T Form- B- B Form - C

Anti-Virus Signatures

QUESTION: If T is non-deterministic, is detection by signature possible?

66 Example push ecx mov ecx, [ebp + 10] mov ecx, ebp push ecx push eax mov [ebp - 3], eax add eax, 2342 push ecx mov ecx, ebp mov eax, 33 mov ecx,ebp push eax add ecx, eax add ecx,33 mov eax, 33 pop eax push ecx push esi add ecx, eax mov eax, esi mov ecx,ebp mov esi,ecx pop eax push eax add ecx,33 sub esi,34 mov esi, ecx mov [ecx-36],eax mov [esi-2],eax push esi push edx pop ecx pop esi mov esi, ecx xor edx, 778f pop ecx push edx mov edx, 34 mov edx, 34 sub esi, edx sub esi, edx pop edx pop edx mov [esi - 2], mov [esi-2], eax eax pop esi pop esi pop ecx pop ecx (a) (b) (c) (d) (e) 67 Goal

• Reduce variants to unique “normal” form • Detect all variants using a single signature

68 Approach

• Extract transformations used by malware • Model mutation engine as term rewriting system (T) • Construct a Normalizer (N) for T – Use length-reducing, – Lexicographic ordering to re-orient, – And yield finite length-reducing rewriting system. • Apply reverse transforms to unmorph the malware – Staged Priority without completion (WC) – Staged Priority with manual completion – Simply use the automatically completed set • Knuth-Bendix completion procedure

69 Approach

Virus Virus Virus

Form Form - A T - B T Form - C

N N N

Virus Even if T is non- ZERO Form deterministic, detection by signature is possible!!! Anti-Virus ZERO Signature

One virus Multiple forms ZERO form ZERO signature 70 Reverse transformations

Forward push eax mov eax, reg2 mov [reg1], reg2 mov [reg1], eax pop eax

Reverse push eax mov eax, reg2 mov [reg1], reg2 mov [reg1], eax pop eax

– Revert only ‘increasing’ transformations

71 Evaluation • Case study – Unmorph W32.Evol • Process – Created 72 variants over six generations • Chose 26 variants for reversal – Extracted rules used by W32.Evol • 55 rules (with non-ground terms) – Reverted rules and added completion rules

72 Evaluation (cont) • Results – With manual completion of rules • All 26 variants reverted to a single, unique normal form – Without completion (WC) • Normal forms of all 26 variants showed more than 98% similarity • Can be exploited to extract a single signature to match all

73 Results: Without Completion

Generation Eve 2 3 4 5 6 Avg. size of 2182 3257 4524 5788 6974 8455 original Max. size of 2167 2167 2184 2189 2195 2204 normal form Avg. size of normal 2167 2167 2177 2183 2191 2204 form Lines not in 0 0 10 16 24 37 common % in common 100.0 100.0 99.54 99.27 98.90 98.32

Execution time (ms) 2469 3034 4264 6327 7966 11219

Rule Counts 16 533 980 1472 1902 2481

74 Example - Reversed

push ecx mov ecx, [ebp + 10] push ecx mov ecx, ebp mov ecx, ebp push eax push eax add eax, 2342 push ecx mov eax, 33 mov eax, 33 mov ecx,ebp add ecx, eax push ecx add ecx, eax add ecx,33 pop eax pop eax mov ecx,ebp push esi mov [ebp - 3], eax add ecx,33 mov eax, esi mov esi,ecx push esi push eax mov [ecx-36],eax sub esi,34 mov esi, ecx pop ecx mov esi, ecx mov [esi-2],eax push edx push edx pop esi xor edx, 778f pop ecx mov edx, 34 mov edx, 34 sub esi, edx sub esi, edx pop edx pop edx mov [esi - 2], eax mov [esi-2], eax pop esi pop esi pop ecx pop ecx

(a) (b) (c) (d) (e) 75 Summary of UMPH

• Impact – Better than we expected – Can raise the bar very high for malware authors

76 Summary – Open Issues

• How to extract/gather transformation rules? – Studying samples ‘in-the-zoo’ – Creating own equivalent transformations • How to deal with semantic non-preserving transformations? – Malware may introduce dead/irrelevant code – Reversing the rules may be problematic • W32.Evol had such a rule • We gave least priority to its reverse rule • How to complete the rules? – Knuth-Bendix procedure is not guaranteed to terminate – Use rule set specific knowledge • We added only TWO rules for completion

77 Talk Contents

Motivation Analysis Tools Overall ACA Approach Results VILO: malware phylogeny generation DOC: detecting obfuscated calls UMPH: reversing metamorphic transforms Future and Conclusions

78 ACA in the Future • Beginnings of movement in academic research – Seeing a few papers on relevant topics • disassembly, de-obfuscation, phylogeny • largely ignored by academic community – Some appreciation of ACA vision • feeding back to prior stages • history-directed analysis

79 ACA in the Future

• Our focus: work with industry – refine vision, keep focus on important issues • believe we can drive important research this way • VILO, DOC, and UMPH building blocks

80 Credits Software Research Lab Alumni Center for Advanced Computer Studies University of Louisiana at Lafayette Nitin Jyoti, Avertlabs Aditya Kapoor, McAfee Arun Lakhotia Erik Uday Kumar, Authentium Director Moinuddin Mohammed, Microsoft Prashant Pathak, Symantec Andrew Walenstein Prabhat Singh, Symantec Research Scientist

Michael Venable Funded by: Software Engineer and Alumni Louisiana Governor’s IT Initiative Ph.D. Students Mohamed Chouchane Md Enamul Karim M.S. Student Rachit Mathur

References

• A. Lakhotia, E. U. Kumar, and M. Venable, “A Method for Detecting Obfuscated Calls in Malicious Binaries,” IEEE Transactions on Software Engineering, 2005 (in print). • M. Karim, A. Walenstein, A. Lakhotia, and L. Parida, "Malware Phylogeny Generation using Permutations of Code," European Research Journal of Computer Virology, 2005 (in print). • M. Venable, M. R. Chouchane, Md. E. Karim, and A. Lakhotia, "Analyzing Memory Accesses in Obfuscated x86 Executables," K. Julisch and C. Kruegel (Eds.): Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA) 2005, LNCS 3548, Springer-Verlag Berlin Heidelberg, pp. 1 – 18, 2005 • A. Lakhotia, M. E. Karim, A. Walenstein, and L. Parida, "Malware Phylogeny using Maximal Pi- Patterns", EICAR 2005 Conference: Best Papers Proceedings, Malta, pp. 156-174, April-May, 2005. • A. Lakhotia and M. Mohammed, “Imposing Order on Program Statements and its implication to AV Scanners,” in Proceedings of 11th IEEE Working Conference on Reverse Engineering, Delft, The Netherlands, November 2004, pp. 161-171. • U. K. Eric, A. Kapoor, and A. Lakhotia, "DOC- Answering the Hidden 'Calls' of Virus," Virus Bulletin, April 2005. • A. Lakhotia, A. Kapoor, and E. U. Kumar, “Are Metamorphic Viruses Really Invincible?” Virus Bulletin, December 2004 and January 2005. • M. Venable, P. Pathak, and A. Lakhotia, “Getting into Beagle’s Backdoor,” Virus Bulletin, July 2004, pp. 9-13. • A. Lakhotia and P. K. Singh, Challenges in getting 'Formal' with viruses, Virus Bulletin, September 2003.

82