Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families
Total Page:16
File Type:pdf, Size:1020Kb
Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families Lars Strande Grini Master’s Thesis Master of Science in Information Security 30 ECTS Department of Computer Science and Media Technology Gjøvik University College, 2015 Avdeling for informatikk og medieteknikk Høgskolen i Gjøvik Postboks 191 2802 Gjøvik Department of Computer Science and Media Technology Gjøvik University College Box 191 N-2802 Gjøvik Norway Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families Lars Strande Grini 15/12/2015 Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families Abstract There exist different methods of identifying malware, and widespread method is the one found in almost every antivirus solution on the market today; the signature based ap- proach. This approach uses a one-way cryptographic function to generate a unique hash of each file. Afterwards, each hash is checked against a database of hashes of known mal- ware. This method provides close to none false positives, but this does also mean that this approach can only detect previously known malware, and will in many cases also provide a number of false negatives. Malware authors exploit this weakness in the way that they change a small part of the malicious code, and thereby changes the entire hash of the file, which then leaves the malicious code undetectable until the sample is discovered, analyzed and updated in the vendors database(s). In the light of this relatively easy mit- igation for malware authors, it is clear that we need other ways to identify malware. The other two main approaches for this are static analysis and behavior based/dynamic ana- lysis. The primary goal of such analysis and previous research has been focused around detecting whether a file is malicious or benign (binary classification). There has been comprehensive work in these fields the last few years. In the work we are proposing, we will leverage results from static analysis using machine learning methods, to distin- guish malicious Windows executables. Not just benign/malicious as in many researches, but by malware family affiliation. To do this we will use a database consisting of about of 330.000 malicious executables. A challenge in this work will be the naming of the samples and families as different antivirus vendors labels samples with different names and follows no standard naming scheme. This is exemplified by e.g. the VirusTotal online scanner which scans a hash in 57 malware databases. For the static analysis we will use the VirusTotal scanner as well as an open source tool for analyzing portable executables, PEframe. The work performed in the thesis presents a novel approach to extract and construct features that can be used to make an estimation of which type and family a malicious file is an instance of, which can be useful for analysis and antivirus scanners. This contribution is novel because multinominal classification is applied to distinguish between different types and families. i Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families Acknowledgements I would like to express my greatest appreciation to my supervisors, Katrin Franke and Andrii Shalaginov for the extraordinary guidance and feedback through this project. Fur- ther, I will thank my classmates Espen, Lars, David and Martin for discussions, feedback and company during the entire process. Lastly, I would like to thank my current class- mate Jan William, and my former classmate Simen for valuable discussions, feedback and proof reading of my work. iii Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families Contents Abstract .......................................... i Acknowledgements ................................... iii Contents ......................................... v List of Figures ...................................... ix List of Tables ....................................... xi Abbreviations ...................................... xiii 1 Introduction ..................................... 1 1.1 Topics covered by project . .1 1.2 Keywords . .1 1.3 Problem description . .1 1.4 Justification, motivation and benefits . .2 1.5 Research questions . .2 1.6 Planned contributions . .2 1.7 Thesis outline . .2 2 Malware: Taxonomy, Analysis & Detection .................... 5 2.1 Methods for malware analysis . .5 2.1.1 Detection and Analysis . .5 2.1.2 Dynamic Analysis . .5 2.2 Malware Taxonomy . .6 2.2.1 Virus . .6 2.2.2 Worm . .7 2.2.3 Trojan . .7 2.2.4 Backdoor . .7 2.2.5 Rootkit . .8 2.2.6 Bot . .8 2.3 Malware detection in antivirus scanners . .9 2.3.1 Signature based . .9 2.3.2 Anomaly based . .9 2.3.3 Heuristic based . 10 2.4 Obfuscation Techniques . 10 2.4.1 Encryption . 10 2.4.2 Polymorphism . 10 2.4.3 Metamorphism . 11 2.4.4 Specific obfuscation techniques . 11 2.4.5 Dead-Code Insertion . 11 2.4.6 Register Reassignment . 12 2.4.7 Instruction Substitution . 12 2.4.8 Code Transposition . 12 2.5 Windows Portable Executables . 13 2.6 Naming of malware . 14 v Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families 3 Machine Learning & Pattern Recognition .................... 17 3.1 Preprocessing . 17 3.2 Feature Selection . 18 3.3 Learning . 18 3.4 Challenges . 21 3.4.1 "No free lunch" . 21 3.4.2 "Ugly Duckling" . 21 3.4.3 Overfitting and underfitting . 21 3.4.4 Validation of results . 22 4 Related work ..................................... 25 4.1 Binary classification . 25 4.2 Multi-class Classification . 26 5 Large-scale Malware Analysis ........................... 29 5.1 Choice of methods . 30 5.2 Data acquisition . 30 5.3 Feature construction . 32 5.4 Subset generation . 33 5.5 Machine Learning Methods Used . 37 5.5.1 Feature selection . 37 5.5.2 Classification . 37 6 Experiments, results and discussion ....................... 39 6.1 Experimental Environments . 39 6.2 Data acquisition . 41 6.3 Feature Construction . 41 6.4 Feature selection . 43 6.4.1 10 most frequent families . 43 6.4.2 100 most frequent families . 46 6.4.3 500 most frequent families . 50 6.4.4 10 most frequent types . 52 6.4.5 35 most frequent types (Full feature set) . 53 6.5 Classification . 55 7 Conclusion & future work ............................. 57 7.1 Theoretical Implications . 57 7.2 Practical Considerations . 59 7.3 Further Research . 60 Bibliography ....................................... 63 A RawData database .................................. 69 A.1 Database explanation . 69 A.2 Processed database . 71 B Sample output froom tools ............................. 73 B.1 PEframe . 73 B.2 VirusTotal . 75 B.3 Serialized VirusTotal . 80 C Python Code Example for Feature Construction ................. 83 C.1 rawdata_to_data.py . 83 D Python Code Example for generating .arff file from DB table ......... 87 vi Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families D.1 table_to_arff.py . 87 E Data contents .................................... 89 E.1 Types of malware in data set . 89 E.2 Malware families in data set . 89 E.3 Architecture Distribution in PE headers . 110 vii Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families List of Figures 1 Illustration of boot sector virus [1] . .6 2 Kernel mode rootkit [2] . .9 3 Encrypted malware [3] . 11 4 Example of dead-code insertion [4] . 12 5 Example of instruction substitution [4] . 12 6 Example of code transposition [4] . 13 7 The portable executable format [5] . 14 8 Implementation of the CARO naming scheme [6] . 15 9 The machine learning process [7] . 17 10 Linearly separable data [8] . 18 11 K-Means clustering [9] . 19 12 Modes for multi-class classification [10] . 20 13 Overfitting and underfitting [11] . 21 14 5-fold cross validation [12] . 22 15 Methodology for large-scale static malware analysis and classification . 29 16 10 most frequent families . 34 17 100 most frequent families . ..