Software Similarity and Classification
Total Page:16
File Type:pdf, Size:1020Kb
Software Similarity and Classification By Silvio Cesare, BIT, M.Info Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Deakin University June, 2013 Acknowledgments I would like to thank the people who made writing this thesis possible. A special thanks to my partner, Kylie, who has supported me greatly. Thanks to Dani, Joe, and Eva who always add a degree of the unexpected to life. Thanks to my mother, Maxine, who enabled me early on to pursue academia. Thanks to my sister, Paloma, who knows too well the life of academic pursuits. Finally, thanks to my supervisor, Prof. Yang Xiang, who has supported me for many years during my Masters degree and PhD. Publications Books 1. Silvio Cesare, Yang Xiang, “Software Similarity and Classification”, Springer, 2012. Refereed Journal Papers 1. Silvio Cesare, Yang Xiang, Wanlei Zhou, "Control Flow-based Malware Variant Detection", IEEE Transactions on Dependable and Secure Computing, IEEE, 2013, (in press). (ERA A) 2. Silvio Cesare, Yang Xiang, Wanlei Zhou, “Malwise - An Effective and Efficient Classification System for Packed and Polymorphic Malware”, IEEE Transactions on Computers, IEEE, vol. 62, no. 6, pp. 1193-1206, 2013. (ERA A*) 3. Yini Wang, Sheng Wen, Silvio Cesare, Wanlie Zhou, Yang Xiang, "Eliminating Errors in Worm Propagation Models", Communication Letters, IEEE, vol. 15, no. 9, pp. 1022-1024, 2011. (ERA A) 4. Yini Wang, Sheng Wen, Silvio Cesare, Wanlie Zhou, Yang Xiang, "The Microcosmic Model of Worm Propagation", The Computer Journal, vol. 54, no. 10, pp. 1700-1720, 2011. (ERA A*) 5. 5. Yongrui Cui, Mingchu Li, Yang Xiang, Yizhi Ren, Silvio Cesare, "A Quality-of- Service based Fine-grained Reputation System in the Grid Economy", Concurrency and Computation: Practice and Experience, 2011. (ERA A) Refereed Conference Papers 1. Silvio Cesare, Yang Xiang, Jun Zhang, "Clonewise - Detecting Package-level Clones Using Machine Learning", 9th International Conference on Security and Privacy in Communication Networks (SecureComm 2013), 2013. (ERA A) 2. Silvio Cesare, Yang Xiang, "Simseer and Bugwise - Web Services for Binary-level Software Similarity and Defect Detection", 10th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2012), 2012. 3. Silvio Cesare, Yang Xiang, “Wire – A Formal Intermediate Language for Binary Analysis”, IEEE Trustcom, IEEE, 2012. (ERA A) 4. Silvio Cesare, Yang Xiang, "Malware Variant Detection Using Similarity Search over Sets of Control Flow Graphs", IEEE Trustcom, IEEE, 2011. (ERA A) Industry Conferences 1. Black Hat, 2013 - Bugalyze.com - Detecting Bugs Using Decompilation and Data Flow Analysis 2. AusCERT, 2013 - Simseer.com - Malware Detection in a Cloud 3. Ruxcon, 2012 - FooCodeChu - Web Services for Software Analysis, Malware Detection, and Vulnerability Research 4. Ruxcon Breakpoint, 2012 - Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis 5. Black Hat, 2012 - Clonewise – Automated Package Clone Detection 6. AusCERT, 2012 - Effective Flowgraph-based Malware Variant Detection 7. Ruxcon, Professional Delegates Event, 2011 - Faster, More Effective Flowgraph- based Malware Classification 8. Ruxcon, 2011 - Automated Detection of Software Bugs and Vulnerabilities in Linux Media Articles and Interviews 1. "AusCERT 2013: Cloud-based scanner identifies new malware by its ancestry" CSO 2. "Scanner Identifies Malware Strains, Could Be Future of AV" Slashdot 3. "“Tool detects software plagiarism, theft and malware outbreaks” SC Magazine 4. “Research offers software salvation from AV friendly-fire” SC Magazine 5. “Tool kills hidden Linux bugs, vulnerabilities” Slashdot 6. "'Clonewise' Security Service Helps Identify Vulnerable Code" Dark Reading 7. Risky Business #177 -- Silvio Cesare discusses his AV PhD 8. Risky Business #203 -- LulzSec: They're baaaaaaaack 2 Software Similarity and Classification This thesis identifies the key topics in software similarity and classification. It examines the task of detecting software variants, clones, derivatives, and classes of software. From this theory, we propose a novel system to detect package-level clones of software using pattern classification techniques enabling us to discover software vulnerabilities in Linux. We also propose a formal language to aid binary analysis and using this framework, propose a novel system to detect malware variants through unique malware signatures, database indexing, and searching algorithms. These systems have been evaluated on real data sets including over 10,000 Linux packages making up the Debian Linux distribution where 34 previously unknown clones and over 30 previously unknown vulnerabilities were identified. Our malware system was evaluated on over 15,000 real malware and is demonstrated to be more effective and efficient than previous systems maintaining a near real-time scan performance. TABLE OF CONTENTS Contents Chapter 1: Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Problem Formulization 3 1.4 Problem Overview 4 1.5 Aims and Scope 5 1.6 Contributions 6 1.7 Thesis Organization 7 Chapter 2: Related Work 9 2.1 Taxonomy of Program Features 9 2.1.1 Syntactic Features 10 2.1.2 Semantic Features 15 2.1.3 Taxonomy of Features in Program Binaries 16 2.1.4 Case Studies 17 2.2 Program Transformations and Obfuscations 18 2.2.1 Compiler Optimisation and Recompilation 18 2.2.2 Program Obfuscation 20 2.2.3 Plagiarism, Software Theft, and Derivative Works 21 2.2.4 Malware Packing, Polymorphism, and Metamorphism 22 2.2.5 Features under Program Transformations 28 2.3 Formal Methods of Program Analysis 28 2.3.1 Static Feature Extraction 28 2.3.2 Formal Syntax and Lexical Analysis 29 2.3.3 Parsing 29 2.3.4 Intermediate Representations 30 2.3.5 Formal Semantics of Programming Languages 32 2.3.6 Theorem Proving 33 2.3.7 Model Checking 34 2.3.8 Data Flow Analysis 34 2.3.9 Abstract Interpretation 36 2.3.10 Intermediate Code Optimisation 37 2.3.11 Research Opportunities 37 2.4 Static Analysis of Binaries 37 2.4.1 Disassembly 38 2.4.2 Intermediate Code Generation 40 2.4.3 Procedure Identification 41 2.4.4 Procedure Disassembly 42 2.4.5 Control Flow Analysis, Deobfuscation and Reconstruction 42 2.4.6 Pointer Analysis 43 2.4.7 Decompilation of Binaries 43 1 TABLE OF CONTENTS 2.4.8 Obfuscation and Limits to Static Analysis 46 2.4.9 Research Opportunities 46 2.5 Dynamic Analysis 47 2.5.1 Relationship to Static Analysis 47 2.5.2 Environments 48 2.5.3 Debugging 48 2.5.4 Hooking 48 2.5.5 Dynamic Binary Instrumentation 49 2.5.6 Virtualization 49 2.5.7 Application Level Emulation 49 2.5.8 Whole System Emulation 51 2.6 Feature Extraction 52 2.6.1 Processing Program Features 52 2.6.2 Strings 53 2.6.3 Vectors 53 2.6.4 Sets 53 2.6.5 Sets of Vectors 53 2.6.6 Trees 53 2.6.7 Graphs 53 2.6.8 Embeddings 54 2.6.9 Kernels 54 2.6.10 Research Opportunities 54 2.7 Software Birthmark Similarity 55 2.7.1 Distance Metrics 55 2.7.2 String Similarity 56 2.7.2.3 Longest Common Subsequence (LCS) 57 2.7.3 Vector Similarity 57 2.7.4 Set Similarity 58 2.7.5 Set of Vectors Similarity 59 2.7.6 Tree Similarity 59 2.7.7 Graph Similarity 60 2.8 Software Similarity Searching and Classification 61 2.8.1 Instance-based Learning and Nearest Neighbour 61 2.8.2 Statistical Machine Learning 63 2.8.3 Research Opportunities 64 2.9 Applications 65 2.9.1 Malware Classification 65 2.9.2 Software Theft Detection (Static Approaches) 67 2.9.3 Software Theft Detection (Dynamic Approaches) 69 2.9.4 Plagiarism Detection 69 2.9.5 Code Clone Detection 70 2.9.6 Critical Analysis 71 2.10 Future Trends 72 Concluding Remarks 73 Chapter 3: Clonewise – Detecting Package-level Clones Using Machine Learning 74 2 TABLE OF CONTENTS 3.1 Introduction 74 3.1.1 Motivation for Package-level Clone Detection 75 3.1.2 Motivation for Automated Approaches 76 3.1.2 Generability 77 3.1.3 Innovation 78 3.1.4 Structure of the Chapter 79 3.2 Problem Definition and Our Approach 79 3.2.1 Problem Definition 79 3.2.2 Our Approach 79 3.3 Initial Attempts 80 3.3.1 Containment for Embedded Package Clone Detection 80 3.3.2 Intersection for Shared Package Clone Detection 81 3.3.3 Motivations for Other Approaches 81 3.4 Package Clone Detection 81 3.4.1 Shared Package Clone Detection 82 3.4.2 Shared Package Clone Classification 86 3.4.3 Embedded Package Clone Detection 86 3.4.4 Classification Using Asymmetric Bagging 87 3.5 Inferring Security Problems 88 3.5.1 Use-case of Clone Detection to Detect Vulnerabilities 88 3.5.2 Standardization Efforts 89 3.5.3 Debian Linux Security Tracking 89 3.5.4 Automated Vulnerability Inference 89 3.6 System Implementation 91 3.6.1 Software 91 3.6.2 Scaling The Analysis 93 3.7 Evaluation 95 3.7.1 Filenames as Features 95 3.7.2 Establishing the Ground Truth for Training and Evaluation 95 3.7.3 Accuracy of Shared Package Clone Detection 97 3.7.4 Accuracy of Embedded Package Clone Detection 98 3.7.5 Practical Package Clone Detection 99 3.7.6 Vulnerability Detection 99 3.7.7 Automated Vulnerability Detection 99 3.8 Discussion 102 3.8.1 Practical Consequences of Our Research 103 3.8.2 Referencing CVEs in an advisory. 104 Concluding Remarks 104 Chapter 4: Wire - A Formal Intermediate Language for Binary Analysis 105 4.1 Introduction 105 4.1.1 Motivation 105 4.1.2 Innovation 107 4.1.3 Structure of the Chapter 108 4.2 Translating Native Code 108 3 TABLE OF CONTENTS 4.2.1 Disassembly 108 4.2.2 Abstract Machines 109 4.2.3 Intermediate Code Generation 109 4.2.4 Register Mapping between Native Architectures and Wire 110 4.2.5 Label Generation 110 4.2.6 Condition Code Generation 110 4.2.7