Automatic Simplification of Obfuscated Javascript Code: a Semantics-Based Approach

Automatic Simplification of Obfuscated JavaScript Code: A Semantics-Based Approach Gen Lu Saumya Debray Department of Computer Science The University of Arizona Tucson, AZ 85721, USA Email: {genlu, debray}@cs.arizona.edu Abstract—JavaScript is a scripting language that is code to avoid detection [13]. The growing prevalence commonly used to create sophisticated interactive client- of malicious JavaScript code is exemplified by the side web applications. With the popularity growth in Gumblar worm, which uses dynamically generated and those “Web 2.0” sites, browsing the Internet without heavily obfuscated JavaScript to avoid detection and JavaScript support is impractical. However, JavaScript identification, and which at one point was considered code can also be used to exploit vulnerabilities in the web browser and its extensions, and in recent years it to be the fastest-growing threat on the Internet [17]. has become a major mechanism for web-based malware Identifying malicious JavaScript code is not easy, delivery. In order to avoid detection, attackers often take however. The mere presence of obfuscated JavaScript advantage of the dynamic nature of JavaScript to create does not, in itself, signal the presence of malicious highly obfuscated code. This makes the identification content, since benign web pages also to use code of malicious JavaScript code challenging: current ap- obfuscation to protect intellectual property [7], [16]. proaches to analyzing such code typically either make assumptions that may not hold in practice, or else Moreover, attackers often use server-side scripting to require manual intervention that can be tedious and deliver randomly obfuscated code where each instance time-consuming. This paper describes a semantics-based is syntactically different from the next (server-side approach for automatic deobfuscation of JavaScript code polymorphism). For these reasons, static signature- using dynamic analysis and slicing. Experiments using a based heuristics (e.g., “‘eval(’ and ‘unescape(’ within prototype implementation indicate that our approach is 15 bytes of each other” [17]) have limited success able to penetrate multiple layers of complex obfuscations when dealing with obfuscated JavaScript. Traditional and extract the core logic of the computation, which anti-virus tools that process web pages typically rely makes it easier to understand the behavior of the code. on such syntactic heuristics and so tend to produce a high misidentification rate [7], [13], [16]. I. INTRODUCTION A better solution would be to use semantics-based Recent years have seen a dramatic rise in the role of techniques that focus on code behavior. Unfortunately, the World-Wide Web in many aspects of our day-to- current behavioral analysis techniques for obfuscated day lives. Unfortuately, there has been a concomitant JavaScript typically require considerable manual inter- increase in the use of the Web as a malware-delivery vention, e.g., to modify the code in specific ways or platform: even a single visit by an unwary user to to monitor its execution within a debugger [19], [22], an infected website may be sufficient to cause an [32]. There has been some recent work on automated arbitrary malicious payload to be downloaded and behavioral analyses of obfuscated JavaScript that in executed without user’s permission or knowledge. This many cases has a “deobfuscator” component [4], [6], process, known as drive-by-downloading, is carried out [7], [9], [16], as well as standalone JavaScript deobfus- by exploiting vulnerabilities in web browsers and/or cation tools [1], [10], [23], [30]. These deobfuscators their extensions [25], [26] and can be delivered us- all rely on some simple and intuitive assumptions about ing a variety of web-based mechanisms [24]. Drive- the obfuscation and the structure of the obfuscated by-downloads often rely on JavaScript, a scripting code. Although these assumptions seem plausible, it is language widely used for generating dynamic web not difficult to construct obfuscations that violate them content. Attackers often take advantage of the dy- and thereby defeat the corresponding deobfuscators. namic nature of JavaScript to create highly obfuscated This is illustrated in Section IV. This paper proposes a different approach to analyze II. BACKGROUND obfuscated JavaScript code. We collect bytecode execution traces from the target program and use dynamic This section provides an overview of the JavaScript slicing and semantics-preserving code transformations language and the host environment of web browser and to automatically simplify the trace, then reconstruct describes some widely used real-world code obfusca- deobfuscated JavaScript code from simplified trace. tion techniques. The code so obtained is observationally equivalent to the original program for the execution path considered, A. JavaScript Basics but has obfuscations simplified away, thereby exposing The term “JavaScript”, commonly used to refer to the core logic of the computation performed by the a scripting language used for client-side programming original code. The resulting code can then be examined of dynamic websites, actually encompasses several dis- manually or fed to other analysis tools for further pro- tinct components. These include the core programming cessing. Our approach differs from existing approaches language, an implementation of the ECMA-262 stan- in that it makes no assumptions about the structure of dard; and the host enviroment, namely, the Document the obfuscation and uses semantics-based techniques Object Model (DOM) provided by web browser. to reveal the behavior of the code. The core JavaScript language provides a set of In previous work [18], we have described a data types (e.g. Boolean, String, Object), a collection semantics-based approach to deobfuscating “core” of built-in objects and functions (e.g. RegExp, Math, JavaScript, i.e. code that runs on a standalone Date), and a prototype-based inheritance mechanism, JavaScript engine. This does not account for interac- among other things. Like most scripting languages, tions between the executing JavaScript code and the JavaScript is highly dynamic in nature. It is dynam- host browser, which provides the Document Object ically typed, which means that a variable can take Model (DOM) for manipulating HTML documents and on values of different types at different points in a the ability to interact with external websites. This pa- program. Properties of (associative-array based) ob- per, by contrast, focuses on the more challenging task jects can be added/deleted on the fly, and code can be of deobfuscating JavaScript code in the full context generated from strings at runtime using built-in eval of the browser’s execution environment. This is much function. more complex than the “core” JavaScript considered in [18], and admits many more options for obfuscation, The Document Object Model (DOM) is an API that but is able to handle the full range of behaviors of real abstracts HTML documents as a structural represen- JavaScript malware. We evalute our approach using tation of objects and provides a mechanism for ma- a prototype implemention and test it againest both nipulating this abstraction, thereby enabling JavaScript synthetic programs and real malware code. The results code to modify and interact with the content of web show that our approach can penetrate multiple layers pages dynamically. For example, write method of the of complex obfuscations, some of which cannot be document object can be used to dynamically write handled by existing techniques, and extract the core HTML expressions or JavaScript code to a document. logic of the underlying computation. In contrast to the built-in objects defined in core JavaScript, objects defined in the DOM specification The rest of this paper is organized as follows. are called “host objects”, and are provided as a part of Section II provides the background information of the host environment by the web browser. JavaScript malwares and obfuscation techniques. Sec- tion III describes the concept of semantic-based deobfuscation and the implementation of our approach. B. JavaScript Runtime Section IV presents our experimental evaluation. Sec- At the implementation level, JavaScript typically tions V and VI discusses related and future work, and uses an expression-stack-based byte-code interpreter; Section VII concludes. Appendix A presents the com- modern implementations of these interpreters usually plete code contexts for malware sample P4 discussed come with JIT compilers. For example, Mozilla’s in Section IV. popular FireFox web browser uses an open source JavaScript interpreter, SpiderMonkey, written in C/C++ [21]. This is a single dispatching function that steps through the bytecode one instruction at a time. <div id='x'>HelloWorld</div> As discussed in previous sections, client-side 1 2 <script> JavaScript programs have the ability of generating code 3 a = document; b = "Id"; at runtime, using various mechanisms provided by both 4 var c = a[f()]('x')['i'+'nn'+'erH'+'TML']; interpreter and browser (e.g. eval(), document.write() 5 function f(){ and iframe, etc). Further, dynamic code generation can 6 var p='ent'+'By'+b; be multi-layered, e.g., a string that is eval-ed may 7 var q='get'+'Elem'; itself contain calls to eval, and such embedded calls 8 return q+p; to eval can be stacked several layers deep. We refer 9 } to such dynamic code generating as code unfolding, 10 </script> and for each piece of code generated by runtime

Automatic Simplification of Obfuscated Javascript Code: a Semantics-Based Approach

Bypassing Web Application Firewalls

C++11 Metaprogramming Applied to Software Obfuscation

Software Protection Through Obfuscation

6.858 Final Project

Achieving Obfuscation Through Self-Modifying Code: a Theoretical Model

Optimizing Away Javascript Obfuscation

Large-Scale and Language-Oblivious Code Authorship Identification

Automated Malware Analysis Report for Jetbrains-Toolbox

An Obfuscation Resilient Approach for Source Code Plagiarism Detection in Virtual Learning Environments

A Taxonomy of Software Obfuscation Techniques for Layered Security Hui Xu1*, Yangfan Zhou2, Jiang Ming3 and Michael Lyu4

4 Protecting Software Through Obfuscation: Can It Keep Pace with Progress in Code Analysis?

HOP: Hardware Makes Obfuscation Practical∗