Detection and Analysis of Web-Based Malware and Vulnerability
Total Page:16
File Type:pdf, Size:1020Kb
NANYANG TECHNOLOGICAL UNIVERSITY Detection and Analysis of Web-based Malware and Vulnerability Wang Junjie School of Computer Science and Engineering A thesis submitted to the Nanyang Technological University in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2018 THESIS ABSTRACT Detection and Analysis of Web-based Malware and Vulnerability by Wang Junjie Doctor of Philosophy School of Computer Science and Engineering Nanyang Technological University, Singapore Since the dawn of the Internet, all of us have been swept up by the Niagara of infor- mation that fills our daily life. In this process, browsers play an extremely important role. Modern browsers have turned from a simple text displayer to a complicated soft- ware that supports rich user interfaces and a variety of file formats and protocols. This enlarges the attack surface and makes browsers one of the main targets of cyber attack. Inside the Internet security, JavaScript malware is one of the major threats. They exploit vulnerabilities in the browsers to launch attacks remotely. To protect end-users from these threats, this thesis makes two main contributions: identifying JavaScript malware and detecting vulnerabilities in browsers, which aim at a complete solution for Internet security. In identifying JavaScript malware, we first propose to classify JavaScript malware us- ing the machine learning approach combined with dynamic confirmation. Static and dynamic approaches both have merits and drawbacks. Dynamic approaches are effec- tive while not scalable. Static approaches are efficient but normally suffer from a high false negative ratio. To identify JavaScript malware effectively and efficiently, we pro- pose a two-phase approach. The first phase lightweight classifies JavaScript malware from benign web pages. Then the second phase further subdivides the attack behaviors of JavaScript malware. We implement our approach as an online tool and conduct a large-scale experiment to show its effectiveness. Towards an insightful analysis of JavaScript malware evolution trend, it is desirable to further classify them according to the exploited attack vector and the corresponding at- tack behaviors. Considering the emergence of numerous new JavaScript malware and their variants, such an automated classification can significantly speed up the overall response to the JavaScript malware and even shorten the time to discover the zero-day attacks. We propose to use the Deterministic Finite Automaton (DFA), to summarize patterns of malware. Our approach can automatically learn a DFA from the dynamic execution traces of JavaScript malware. The experiment results demonstrate that our ap- proach is more scalable and effective in JavaScript malware detection and classification, compared with other commercial anti-virus tools. Through previous two works, we realized that the root cause of the prevalence of JavaScript malware is the existence of vulnerabilities in browsers. Therefore, finding vulnerabilities in browsers and improving mitigation is of significant importance. We propose a novel data-driven seed generation approach to test the core components of browsers, especially XML engines and XSLT engines. We first learn a Probabilistic iv Context-Sensitive Grammar (PCSG) from a large number of samples of one specific grammar. The feature of PCSG can help us to generate samples whose syntax and se- mantics are correct with high probability. The experimental results demonstrate that both the bug finding capability and code coverage of fuzzing are advanced. We further improve coverage-based greybox fuzzing by proposing a new grammar- aware approach for programs that process structured inputs. In details, our approach requires the grammar of test inputs, which is often publicly available. Based on the grammar, we propose a grammar-aware trimming strategy to trim test inputs at the tree level. Besides, we introduce two grammar-aware mutation strategies (i.e., enhanced dictionary-based mutation and tree-based mutation). Tree-based mutation works by replacing sub-trees of the Abstract Syntax Tree (AST) of parsed test inputs. With grammar-awareness, we can effectively mutate test inputs while keeping the input struc- ture valid, quickly carrying the fuzzing exploration into width and depth. We conduct experiments to evaluate the effectiveness of it on one XML engine, libplist and two JavaScript engines, WebKit, and Jerryscript. The results demonstrate that our approach outperforms other fuzzing tools in both code coverage and the bug-finding capability. Contents 1 Introduction1 1.1 Motivations and Goals..........................1 1.2 Main Works and Contributions......................4 1.3 Thesis Outline............................... 11 1.4 Publication List.............................. 13 2 Background and Preliminaries 15 2.1 JavaScript Malware Types........................ 16 2.2 Obfuscation in JavaScript Malware.................... 18 2.3 Preliminaries about Fuzzing....................... 19 3 JavaScript Malware Detection Using Machine Learning 23 3.1 Introduction................................ 23 3.2 System Overview of JSDC........................ 25 3.2.1 Data Preparation and Preprocessing............... 26 3.2.2 Feature Extraction........................ 26 3.2.3 Normalizing Features Vector................... 27 3.2.4 Feature Selection......................... 28 3.2.5 Classifiers Training........................ 29 3.2.6 Dynamic Confirmation...................... 29 3.3 Details of Feature Extraction....................... 30 3.3.1 Textual Analysis......................... 30 3.3.2 Inner-Script Program Analysis.................. 33 3.3.3 Inter-Script Program Analysis.................. 35 3.4 Dynamic Confirmation.......................... 36 3.5 Implementation.............................. 37 3.6 Evaluation................................. 38 3.6.1 Experiment Setup......................... 38 3.6.2 Evaluation of Malware Detection................ 39 3.6.2.1 Controlled Experiments................ 39 3.6.2.2 Uncontrolled Experiments............... 41 3.6.2.3 Performance...................... 42 3.6.3 Attack Type Classification.................... 43 3.6.3.1 Prediction Results................... 44 3.6.4 Combining Dynamic Confirmation and Machine Learning... 44 v vi CONTENTS 3.7 Related Work............................... 45 3.8 Conclusion................................ 47 4 JavaScript Malware Behavior Modelling 49 4.1 Introduction................................ 49 4.2 JavaScript Malware Behavior Modelling................. 51 4.3 Approach................................. 52 4.3.1 Overview............................. 53 4.3.2 Trace Preprocessing....................... 54 4.4 JS* Learning Framework......................... 59 4.4.1 Membership Query........................ 59 4.4.1.1 Browser Defense Rule................. 60 4.4.1.2 Data Dependency Analysis.............. 61 4.4.1.3 Trace Replay...................... 63 4.4.1.4 Membership Query Algorithm............ 65 4.4.2 Candidate Query......................... 67 4.4.3 The Learned DFA and Refinement................ 68 4.5 Implementation and Evaluation...................... 69 4.5.1 Capturing System Calls..................... 69 4.5.2 Data Preparation and Setup.................... 71 4.5.3 JS* Learning Evaluation..................... 72 4.6 Related Work............................... 80 4.7 Conclusion................................ 82 5 Vulnerability Detection via Data-Driven Seed Generation 85 5.1 Introduction................................ 86 5.2 Approach Overview............................ 88 5.2.1 Target Programs......................... 88 5.2.2 Overview of Skyfire....................... 90 5.3 PCSG Learning.............................. 91 5.3.1 PCSG............................... 91 5.3.2 Learning PCSG.......................... 94 5.4 Seed Generation.............................. 96 5.5 Implementation and Evaluation...................... 99 5.5.1 Evaluation Setup......................... 100 5.5.2 Vulnerabilities and Bugs Discovered............... 101 5.5.3 Code Coverage.......................... 104 5.5.4 The effectiveness of Context and Heuristics........... 108 5.5.5 Performance Overhead...................... 109 5.6 Related Work............................... 111 5.6.1 Mutation-Based Fuzzing..................... 111 5.6.2 Generation-Based Fuzzing.................... 112 5.7 Conclusion................................ 113 6 Vulnerability Detection via Grammar-Aware Greybox Fuzzing 115 CONTENTS vii 6.1 Introduction................................ 116 6.2 Our Approach............................... 118 6.2.1 Grammar-Aware Trimming Strategy............... 119 6.2.2 Grammar-Aware Mutation Strategies.............. 121 6.2.2.1 Enhanced Dictionary-Based Mutation........ 121 6.2.2.2 Tree-Based Mutation................. 123 6.2.3 Selective Instrumentation Strategy................ 126 6.3 Evaluation................................. 128 6.3.1 Evaluation Setup......................... 129 6.3.2 Discovered Bugs and Vulnerabilities (RQ1)........... 133 6.3.3 Code Coverage (RQ2)...................... 134 6.3.4 The effectiveness of Grammar-Aware Trimming (RQ3)..... 135 6.3.5 The effectiveness of Grammar-Aware Mutation (RQ4)..... 136 6.3.6 The effectiveness of Selective Instrumentation (RQ5)...... 141 6.3.7 Performance Overhead (RQ6).................. 141 6.3.8 Case Study............................ 143 6.4 Related Work............................... 145 6.4.1 Guided Mutation......................... 145 6.4.2 Grammar-Based