Statistical Structures: Fingerprinting Malware for Classification and Analysis

Statistical Structures: Fingerprinting Malware for Classification and Analysis Daniel Bilar Wellesley College (Wellesley, MA) Colby College (Waterville, ME) bilar <at> alum dot dartmouth dot org Why Structural Fingerprinting? Goal: Identifying and classifying malware Problem: For any single fingerprint, balance between over-fitting (type II error) and under- fitting (type I error) hard to achieve Approach: View binaries simultaneously from different structural perspectives and perform statistical analysis on these ‘structural fingerprints’ Different Perspectives Idea: Multiple perspectives may increase likelihood of correct identification and classification Structural Description Statistical static / Perspective Fingerprint dynamic? Assembly Count different Opcode Primarily instruction instructions frequency static distribution Win 32 API Observe API calls API call vector Primarily call made dynamic System Explore graph- Graph structural Primarily Dependence modeled control and properties static Graph data dependencies Fingerprint: Opcode frequency distribution Synopsis: Statically disassemble the binary, tabulate the opcode frequencies and construct a statistical fingerprint with a subset of said opcodes. Goal: Compare opcode fingerprint across non- malicious software and malware classes for quick identification and classification purposes. Main result: ‘Rare’ opcodes explain more data variation then common ones Goodware: Opcode Distribution 1, 2 ---------.exe Procedure: -------.exe 1. Inventoried PEs (EXE, DLL, ---------.exe etc) on XP box with Advanced Disk Catalog 2. Chose random EXE samples size: 122880 with MS Excel and Index totalopcodes: 10680 3, 4 your Files compiler: MS Visual C++ 6.0 3. Ran IDA with modified class: utility (process) InstructionCounter plugin on sample PEs 0001. 002145 20.08% mov 4. Augmented IDA output files 0002. 001859 17.41% push with PEID results (compiler) 0003. 000760 7.12% call and general ‘functionality 0004. 000759 7.11% pop class’ (e.g. file utility, IDE, network utility, etc) 0005. 000641 6.00% cmp 5. Wrote Java parser for raw data files and fed JAMA’ed 5 matrix into Excel for analysis Malware: Opcode Distribution Procedure: Giri.5209 Gobi.a 1. Booted VMPlayer with XP AFXRK2K4.root.exe image ---------.b 2,3 vanquish.dll 2. Inventoried PEs from C. Ries malware collection with Advanced Disk Catalog size: 12288 3. Fixed 7 classes (e.g. virus,, totalopcodes: 615 rootkit, etc), chose random compiler: unknown 4, 5 PEs samples with MS Excel class: virus and Index your Files 4. Ran IDA with modified InstructionCounter plugin on 0001. 000112 18.21% mov sample PEs 0002. 000094 15.28% push 5. Augmented IDA output files 0003. 000052 8.46% call with PEID results (compiler, 0004. 000051 8.29% cmp packer) and ‘class’ 0005. 000040 6.50% add 6. Wrote Java parser for raw data 1 files and fed JAMA’ed matrix into Excel for analysis 6 Aggregate (Goodware): Opcode Breakdown retn xor and 20 EXEs 2% 2% 1% 20 EXEs jnz (size-blocked (size-blocked random random samples samples from 3% mov 538 frominventoried 538 inventoried EXEs) EXEs) 25% add ~1,520,000~1,520,000 opcodes opcodes read read 3% 192 out of 398 possible opcodes jmp 192 out of 420 possible found 3% opcodes found test 72 opcodes72 opcodes in pie in chartpie chart account for 3% >99.8%account for >99.8% lea push 14 opcodes14 opcodes labelled labelled account account for ~90% 4% jz 19% for ~90% 4% cmp Top 5 opcodes account for ~64 % 5% pop call 6% Top 5 opcodes account for 9% ~64 % Aggregate (Malware): Opcode Breakdown 67 67PEs PEs sub jnz xor (class-blocked(class-blocked random random samples samples 3% 1% retn 3% fromfrom 250 250 inventoried inventoried PEs) PEs) 3% ~665,000~665,000 opcodes opcodes read read test mov 3% 30% 141141 out out of of420 398 possible possible add opcodes found (two undocu- 3% opcodes found (two undocu- mented)mented) lea 3% jmp 6060 opcodes opcodes in piein pie chart chart 3% accountaccount for for >99.8% >99.8% jz 4% 14 14opcodes opcodes labelled labelled account account forfor >92% >92% cmp push pop 4% 16% Top 5 opcodes account for 6% call Top 5 opcodes account for ~65% 10% ~65% Class-blocked (Malware): Opcode Breakdown Comparison xor jnz sub Aggregate retn worms test virus rootkit (kernel) add mov lea rootkit (user) trojan bots s jmp tools jz Aggregate breakdown mov 30 % lea 3% cmp push 16 % add 3% call 10 % test 3% pop 6 % retn 3% pop cmp 4 % jnz 2% push jz 4 % xor 2% call jmp 4 % sub 1% Top 14 Opcodes: Frequency Opcode Goodware Kernel User Tools Bot Trojan Virus Worms RK RK mov 25.3% 37.0% 29.0% 25.4% 34.6% 30.5% 16.1% 22.2% push 19.5% 15.6% 16.6% 19.0% 14.1% 15.4% 22.7% 20.7% call 8.7% 5.5% 8.9% 8.2% 11.0% 10.0% 9.1% 8.7% pop 6.3% 2.7% 5.1% 5.9% 6.8% 7.3% 7.0% 6.2% cmp 5.1% 6.4% 4.9% 5.3% 3.6% 3.6% 5.9% 5.0% jz 4.3% 3.3% 3.9% 4.3% 3.3% 3.5% 4.4% 4.0% lea 3.9% 1.8% 3.3% 3.1% 2.6% 2.7% 5.5% 4.2% test 3.2% 1.8% 3.2% 3.7% 2.6% 3.4% 3.1% 3.0% jmp 3.0% 4.1% 3.8% 3.4% 3.0% 3.4% 2.7% 4.5% add 3.0% 5.8% 3.7% 3.4% 2.5% 3.0% 3.5% 3.0% jnz 2.6% 3.7% 3.1% 3.4% 2.2% 2.6% 3.2% 3.2% retn 2.2% 1.7% 2.3% 2.9% 3.0% 3.2% 2.0% 2.3% xor 1.9% 1.1% 2.3% 2.1% 3.2% 2.7% 2.1% 2.3% and 1.3% 1.5% 1.0% 1.3% 0.5% 0.6% 1.5% 1.6% Comparison Opcode Frequencies Opcode Goodware Kernel User Tools Bot Trojan Virus Worms RK PerformRK distribution tests for top mov 25.3% 37.0%14 opcodes29.0% 25.4% on 734.6% classes30.5% of 16.1% 22.2% push 19.5% 15.6%malware16.6% :19.0% 14.1% 15.4% 22.7% 20.7% call 8.7% 5.5% 8.9% 8.2% 11.0% 10.0% 9.1% 8.7% pop 6.3% 2.7%Rootkit5.1% (kernel5.9% 6.8% + user)7.3% 7.0% 6.2% cmp 5.1% 6.4% 4.9% 5.3% 3.6% 3.6% 5.9% 5.0% jz 4.3% 3.3%Virus3.9% and4.3% Worms3.3% 3.5% 4.4% 4.0% lea 3.9% 1.8%Trojan3.3% and3.1% Tools2.6% 2.7% 5.5% 4.2% test 3.2% 1.8% 3.2% 3.7% 2.6% 3.4% 3.1% 3.0% jmp 3.0% 4.1%Bots3.8% 3.4% 3.0% 3.4% 2.7% 4.5% add 3.0% 5.8% 3.7% 3.4% 2.5% 3.0% 3.5% 3.0% jnz 2.6% 3.7%Investigate:3.1% 3.4% Wh2.2%ich,2.6% if any,3.2% 3.2% retn 2.2% 1.7%opcode2.3% frequency2.9% 3.0% is3.2% significantly2.0% 2.3% xor 1.9% 1.1%different2.3% 2.1%for malware3.2% 2.7%? 2.1% 2.3% and 1.3% 1.5% 1.0% 1.3% 0.5% 0.6% 1.5% 1.6% Top 14 Opcode Testing (z-scores) O p c Opcode Kernel User Tools Bot Trojan Virus Worms Higher o RK RK d High e mov 36.8 20.6 2.0 70.1 28.7 -27.9 -20.1 F Similar r push -15.5 -21.0 4.6 -59.9 -31.2 12.1 6.9 e q Low u call -17.0 1.2 5.2 26.0 10.6 2.6 -0.3 e Lower n pop -22.0 -13.5 4.9 5.1 9.8 4.8 -1.1 c y cmp 7.4 -3.5 -0.6 -30.8 -21.2 4.7 -1.8 jz -7.4 -6.1 0.9 -20.9 -11.0 1.4 -4.4 Tests lea -16.2 -8.4 10.9 -29.2 -18.3 11.5 4.2 suggests test -12.2 0.0 -6.6 -14.6 1.8 -0.2 -3.4 opcode frequency jmp 8.5 11.7 -5.0 -2.2 5.0 -2.3 20.4 roughly add 22.9 10.8 -6.4 -13.5 -0.1 4.3 0.5 jnz 8.7 7.4 -11.7 -12.2 -0.9 5.3 8.0 1/3 same retn -5.5 2.5 -12.3 18.4 17.8 -1.4 2.6 1/3 lower 1/3 higher xor -8.9 6.7 -2.6 29.5 15.3 2.7 7.7 and 1.9 -7.3 -0.7 -33.6 -17.0 2.4 5.9 vs goodware Top 14 Opcodes Results Interpretation Cramer’s V 10.3 6.1 4.0 15.0 9.5 5.6 5.2 Most frequent 14 (in %) opcodes weak Op Krn Usr Tools Bot Trojan Virus Worm mov predictor push call Explains just 5-15% of pop variation! cmp jz lea Higher O test p c High jmp F add r Similar e jnz q retn Low xor Lower and Tools: (almost) no Virus + Worms: Kernel-mode Rootkit: deviation in top 5 few # of deviations; most # of deviations opcodes more more jumps smaller handcoded assembly; ‘benign’ (i.e. similar size, simpler malicious ‘evasive’ opcodes ? to goodware) ? function, more control flow ? Rare 14 Opcodes (parts per million) Opcode Goodware Kernel User Tools Bot Trojan Virus Worms RK RK bt 30 0 34 47 70 83 0 118 fdivp 37 0 0 35 52 52 0 59 fild 357 0 45 0 133 115 0 438 fstcw 11 0 0 0 22 21 0 12 imul 1182 1629 1849 708 726 406 755 1126 int 25 4028 981 921 0 0 108 0 nop 216 136 101 71 7 42 647 83 pushf 116 0 11 59 0 0 54 12 rdtsc 12 0 0 0 11 0 108 0 sbb 1078 588 1330 1523 431 458 1133 782 setb 6 0 68 12 22 52 0 24 setle 20 0 0 0 0 21 0 0 shld 22 0 45 35 4 0 54 24 std 20 272 56 35 48 31 0 95 Rare 14 Opcode Testing (z-scores) O p c Opcode Kernel User Tools Bot Trojan Virus Worms Higher o RK RK d High e bt -1.2 -0.4 0.7 6.6 5.9 -0.7 4.8 F Similar r fdivp -1.3 -2.2 -0.3 3.8 2.8 -0.8 1.3 e q Low u fild -4.3 -6.5 -6.1 -1.5 -0.8 -2.6 2.1 e Lower n fstcw -0.7 -1.2 -1.0 3.3 2.2 -0.4 0.2 c y imul -3.3 1.3 -5.9 4.4 -1.4 -1.7 0.9 int 45.0 26.2 28.7 -1.8 -1.0 2.4 -1.4 Tests nop -2.3 -3.6 -3.2 -5.0 -1.6 4.5 -2.3 suggests pushf -2.4 -3.7 -1.8 -3.9 -2.2 -0.7 -2.6 opcode frequency rdtsc -0.7 -1.2 -1.1 1.1 -0.7 3.8 -0.9 roughly sbb -6.5 -2.0 3.4 -2.2 0.3 0.8 -2.0 setb -0.5 4.7 0.6 4.6 7.9 -0.3 2.1 1/10 lower setle -1.0 -1.6 -1.4 -1.6 1.3 -0.6 -1.2 1/5 higher 7/10 same shld -1.0 0.6 0.6 -1.1 -0.9 1.0 0.2 std 4.8 1.4 0.8 0.3 2.4 -0.6 4.8 vs goodware Rare 14 Opcodes: Interpretation Cramer’s V 63 36 42 17 16 10 12 Infrequent 14 opcodes (in %) much better Op Krn Usr Tools Bot Trojan Virus Worm bt predictor! fdivp Explains 12-63% of fild fstcw variation imul int nop Higher O p pushf c High F rdtsc r sbb Similar e q setb Low setle Lower shld std NOP: INT: Rooktkits (and tools) make Virus makes use heavy use of software interrupts NOP sled, padding ? tell-tale sign of RK ? Summary: Opcode Distribution Compare opcode Giri.5209 fingerprints against Gobi.a AFXRK2K4.root.exe ---------.b various software classes vanquish.dll for quick identification size: 12288 and classification totalopcodes: 615 compiler: unknown Malware opcode class: virus frequency distribution seems to deviate 0001.

Statistical Structures: Fingerprinting Malware for Classification and Analysis

Exploring Corporate Decision Makers' Attitudes Towards Active Cyber

GQ: Practical Containment for Measuring Modern Malware Systems

Workaround for Welchia and Sasser Internet Worms in Kumamoto University

Shoot the Messenger: IM Worms Infectionvectors.Com June 2005

The Blaster Worm: Then and Now

Common Threats to Cyber Security Part 1 of 2

Emerging Threats and Attack Trends

The Ecology of Malware

Computer Viruses, in Order to Detect Them

Modeling of Computer Virus Spread and Its Application to Defense

Security and Privacy Implications of Third-Party Access to Online Social Networks

Fondamentaux & Domaines