Statistical Structures: Fingerprinting Malware for Classification and Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Structures: Fingerprinting Malware for Classification and Analysis Daniel Bilar Wellesley College (Wellesley, MA) Colby College (Waterville, ME) bilar <at> alum dot dartmouth dot org Why Structural Fingerprinting? Goal: Identifying and classifying malware Problem: For any single fingerprint, balance between over-fitting (type II error) and under- fitting (type I error) hard to achieve Approach: View binaries simultaneously from different structural perspectives and perform statistical analysis on these ‘structural fingerprints’ Different Perspectives Idea: Multiple perspectives may increase likelihood of correct identification and classification Structural Description Statistical static / Perspective Fingerprint dynamic? Assembly Count different Opcode Primarily instruction instructions frequency static distribution Win 32 API Observe API calls API call vector Primarily call made dynamic System Explore graph- Graph structural Primarily Dependence modeled control and properties static Graph data dependencies Fingerprint: Opcode frequency distribution Synopsis: Statically disassemble the binary, tabulate the opcode frequencies and construct a statistical fingerprint with a subset of said opcodes. Goal: Compare opcode fingerprint across non- malicious software and malware classes for quick identification and classification purposes. Main result: ‘Rare’ opcodes explain more data variation then common ones Goodware: Opcode Distribution 1, 2 ---------.exe Procedure: -------.exe 1. Inventoried PEs (EXE, DLL, ---------.exe etc) on XP box with Advanced Disk Catalog 2. Chose random EXE samples size: 122880 with MS Excel and Index totalopcodes: 10680 3, 4 your Files compiler: MS Visual C++ 6.0 3. Ran IDA with modified class: utility (process) InstructionCounter plugin on sample PEs 0001. 002145 20.08% mov 4. Augmented IDA output files 0002. 001859 17.41% push with PEID results (compiler) 0003. 000760 7.12% call and general ‘functionality 0004. 000759 7.11% pop class’ (e.g. file utility, IDE, network utility, etc) 0005. 000641 6.00% cmp 5. Wrote Java parser for raw data files and fed JAMA’ed 5 matrix into Excel for analysis Malware: Opcode Distribution Procedure: Giri.5209 Gobi.a 1. Booted VMPlayer with XP AFXRK2K4.root.exe image ---------.b 2,3 vanquish.dll 2. Inventoried PEs from C. Ries malware collection with Advanced Disk Catalog size: 12288 3. Fixed 7 classes (e.g. virus,, totalopcodes: 615 rootkit, etc), chose random compiler: unknown 4, 5 PEs samples with MS Excel class: virus and Index your Files 4. Ran IDA with modified InstructionCounter plugin on 0001. 000112 18.21% mov sample PEs 0002. 000094 15.28% push 5. Augmented IDA output files 0003. 000052 8.46% call with PEID results (compiler, 0004. 000051 8.29% cmp packer) and ‘class’ 0005. 000040 6.50% add 6. Wrote Java parser for raw data 1 files and fed JAMA’ed matrix into Excel for analysis 6 Aggregate (Goodware): Opcode Breakdown retn xor and 20 EXEs 2% 2% 1% 20 EXEs jnz (size-blocked (size-blocked random random samples samples from 3% mov 538 frominventoried 538 inventoried EXEs) EXEs) 25% add ~1,520,000~1,520,000 opcodes opcodes read read 3% 192 out of 398 possible opcodes jmp 192 out of 420 possible found 3% opcodes found test 72 opcodes72 opcodes in pie in chartpie chart account for 3% >99.8%account for >99.8% lea push 14 opcodes14 opcodes labelled labelled account account for ~90% 4% jz 19% for ~90% 4% cmp Top 5 opcodes account for ~64 % 5% pop call 6% Top 5 opcodes account for 9% ~64 % Aggregate (Malware): Opcode Breakdown 67 67PEs PEs sub jnz xor (class-blocked(class-blocked random random samples samples 3% 1% retn 3% fromfrom 250 250 inventoried inventoried PEs) PEs) 3% ~665,000~665,000 opcodes opcodes read read test mov 3% 30% 141141 out out of of420 398 possible possible add opcodes found (two undocu- 3% opcodes found (two undocu- mented)mented) lea 3% jmp 6060 opcodes opcodes in piein pie chart chart 3% accountaccount for for >99.8% >99.8% jz 4% 14 14opcodes opcodes labelled labelled account account forfor >92% >92% cmp push pop 4% 16% Top 5 opcodes account for 6% call Top 5 opcodes account for ~65% 10% ~65% Class-blocked (Malware): Opcode Breakdown Comparison xor jnz sub Aggregate retn worms test virus rootkit (kernel) add mov lea rootkit (user) trojan bots s jmp tools jz Aggregate breakdown mov 30 % lea 3% cmp push 16 % add 3% call 10 % test 3% pop 6 % retn 3% pop cmp 4 % jnz 2% push jz 4 % xor 2% call jmp 4 % sub 1% Top 14 Opcodes: Frequency Opcode Goodware Kernel User Tools Bot Trojan Virus Worms RK RK mov 25.3% 37.0% 29.0% 25.4% 34.6% 30.5% 16.1% 22.2% push 19.5% 15.6% 16.6% 19.0% 14.1% 15.4% 22.7% 20.7% call 8.7% 5.5% 8.9% 8.2% 11.0% 10.0% 9.1% 8.7% pop 6.3% 2.7% 5.1% 5.9% 6.8% 7.3% 7.0% 6.2% cmp 5.1% 6.4% 4.9% 5.3% 3.6% 3.6% 5.9% 5.0% jz 4.3% 3.3% 3.9% 4.3% 3.3% 3.5% 4.4% 4.0% lea 3.9% 1.8% 3.3% 3.1% 2.6% 2.7% 5.5% 4.2% test 3.2% 1.8% 3.2% 3.7% 2.6% 3.4% 3.1% 3.0% jmp 3.0% 4.1% 3.8% 3.4% 3.0% 3.4% 2.7% 4.5% add 3.0% 5.8% 3.7% 3.4% 2.5% 3.0% 3.5% 3.0% jnz 2.6% 3.7% 3.1% 3.4% 2.2% 2.6% 3.2% 3.2% retn 2.2% 1.7% 2.3% 2.9% 3.0% 3.2% 2.0% 2.3% xor 1.9% 1.1% 2.3% 2.1% 3.2% 2.7% 2.1% 2.3% and 1.3% 1.5% 1.0% 1.3% 0.5% 0.6% 1.5% 1.6% Comparison Opcode Frequencies Opcode Goodware Kernel User Tools Bot Trojan Virus Worms RK PerformRK distribution tests for top mov 25.3% 37.0%14 opcodes29.0% 25.4% on 734.6% classes30.5% of 16.1% 22.2% push 19.5% 15.6%malware16.6% :19.0% 14.1% 15.4% 22.7% 20.7% call 8.7% 5.5% 8.9% 8.2% 11.0% 10.0% 9.1% 8.7% pop 6.3% 2.7%Rootkit5.1% (kernel5.9% 6.8% + user)7.3% 7.0% 6.2% cmp 5.1% 6.4% 4.9% 5.3% 3.6% 3.6% 5.9% 5.0% jz 4.3% 3.3%Virus3.9% and4.3% Worms3.3% 3.5% 4.4% 4.0% lea 3.9% 1.8%Trojan3.3% and3.1% Tools2.6% 2.7% 5.5% 4.2% test 3.2% 1.8% 3.2% 3.7% 2.6% 3.4% 3.1% 3.0% jmp 3.0% 4.1%Bots3.8% 3.4% 3.0% 3.4% 2.7% 4.5% add 3.0% 5.8% 3.7% 3.4% 2.5% 3.0% 3.5% 3.0% jnz 2.6% 3.7%Investigate:3.1% 3.4% Wh2.2%ich,2.6% if any,3.2% 3.2% retn 2.2% 1.7%opcode2.3% frequency2.9% 3.0% is3.2% significantly2.0% 2.3% xor 1.9% 1.1%different2.3% 2.1%for malware3.2% 2.7%? 2.1% 2.3% and 1.3% 1.5% 1.0% 1.3% 0.5% 0.6% 1.5% 1.6% Top 14 Opcode Testing (z-scores) O p c Opcode Kernel User Tools Bot Trojan Virus Worms Higher o RK RK d High e mov 36.8 20.6 2.0 70.1 28.7 -27.9 -20.1 F Similar r push -15.5 -21.0 4.6 -59.9 -31.2 12.1 6.9 e q Low u call -17.0 1.2 5.2 26.0 10.6 2.6 -0.3 e Lower n pop -22.0 -13.5 4.9 5.1 9.8 4.8 -1.1 c y cmp 7.4 -3.5 -0.6 -30.8 -21.2 4.7 -1.8 jz -7.4 -6.1 0.9 -20.9 -11.0 1.4 -4.4 Tests lea -16.2 -8.4 10.9 -29.2 -18.3 11.5 4.2 suggests test -12.2 0.0 -6.6 -14.6 1.8 -0.2 -3.4 opcode frequency jmp 8.5 11.7 -5.0 -2.2 5.0 -2.3 20.4 roughly add 22.9 10.8 -6.4 -13.5 -0.1 4.3 0.5 jnz 8.7 7.4 -11.7 -12.2 -0.9 5.3 8.0 1/3 same retn -5.5 2.5 -12.3 18.4 17.8 -1.4 2.6 1/3 lower 1/3 higher xor -8.9 6.7 -2.6 29.5 15.3 2.7 7.7 and 1.9 -7.3 -0.7 -33.6 -17.0 2.4 5.9 vs goodware Top 14 Opcodes Results Interpretation Cramer’s V 10.3 6.1 4.0 15.0 9.5 5.6 5.2 Most frequent 14 (in %) opcodes weak Op Krn Usr Tools Bot Trojan Virus Worm mov predictor push call Explains just 5-15% of pop variation! cmp jz lea Higher O test p c High jmp F add r Similar e jnz q retn Low xor Lower and Tools: (almost) no Virus + Worms: Kernel-mode Rootkit: deviation in top 5 few # of deviations; most # of deviations opcodes more more jumps smaller handcoded assembly; ‘benign’ (i.e. similar size, simpler malicious ‘evasive’ opcodes ? to goodware) ? function, more control flow ? Rare 14 Opcodes (parts per million) Opcode Goodware Kernel User Tools Bot Trojan Virus Worms RK RK bt 30 0 34 47 70 83 0 118 fdivp 37 0 0 35 52 52 0 59 fild 357 0 45 0 133 115 0 438 fstcw 11 0 0 0 22 21 0 12 imul 1182 1629 1849 708 726 406 755 1126 int 25 4028 981 921 0 0 108 0 nop 216 136 101 71 7 42 647 83 pushf 116 0 11 59 0 0 54 12 rdtsc 12 0 0 0 11 0 108 0 sbb 1078 588 1330 1523 431 458 1133 782 setb 6 0 68 12 22 52 0 24 setle 20 0 0 0 0 21 0 0 shld 22 0 45 35 4 0 54 24 std 20 272 56 35 48 31 0 95 Rare 14 Opcode Testing (z-scores) O p c Opcode Kernel User Tools Bot Trojan Virus Worms Higher o RK RK d High e bt -1.2 -0.4 0.7 6.6 5.9 -0.7 4.8 F Similar r fdivp -1.3 -2.2 -0.3 3.8 2.8 -0.8 1.3 e q Low u fild -4.3 -6.5 -6.1 -1.5 -0.8 -2.6 2.1 e Lower n fstcw -0.7 -1.2 -1.0 3.3 2.2 -0.4 0.2 c y imul -3.3 1.3 -5.9 4.4 -1.4 -1.7 0.9 int 45.0 26.2 28.7 -1.8 -1.0 2.4 -1.4 Tests nop -2.3 -3.6 -3.2 -5.0 -1.6 4.5 -2.3 suggests pushf -2.4 -3.7 -1.8 -3.9 -2.2 -0.7 -2.6 opcode frequency rdtsc -0.7 -1.2 -1.1 1.1 -0.7 3.8 -0.9 roughly sbb -6.5 -2.0 3.4 -2.2 0.3 0.8 -2.0 setb -0.5 4.7 0.6 4.6 7.9 -0.3 2.1 1/10 lower setle -1.0 -1.6 -1.4 -1.6 1.3 -0.6 -1.2 1/5 higher 7/10 same shld -1.0 0.6 0.6 -1.1 -0.9 1.0 0.2 std 4.8 1.4 0.8 0.3 2.4 -0.6 4.8 vs goodware Rare 14 Opcodes: Interpretation Cramer’s V 63 36 42 17 16 10 12 Infrequent 14 opcodes (in %) much better Op Krn Usr Tools Bot Trojan Virus Worm bt predictor! fdivp Explains 12-63% of fild fstcw variation imul int nop Higher O p pushf c High F rdtsc r sbb Similar e q setb Low setle Lower shld std NOP: INT: Rooktkits (and tools) make Virus makes use heavy use of software interrupts NOP sled, padding ? tell-tale sign of RK ? Summary: Opcode Distribution Compare opcode Giri.5209 fingerprints against Gobi.a AFXRK2K4.root.exe ---------.b various software classes vanquish.dll for quick identification size: 12288 and classification totalopcodes: 615 compiler: unknown Malware opcode class: virus frequency distribution seems to deviate 0001.