<<

Automatic Simplification of Obfuscated JavaScript Code: A Semantics-Based Approach

Gen Lu Saumya Debray Department of Computer Science The University of Arizona Tucson, AZ 85721, USA Email: {genlu, debray}@cs.arizona.edu

Abstract—JavaScript is a that is code to avoid detection [13]. The growing prevalence commonly used to create sophisticated interactive client- of malicious JavaScript code is exemplified by the side web applications. With the popularity growth in Gumblar worm, which uses dynamically generated and those “Web 2.0” sites, browsing the Internet without heavily obfuscated JavaScript to avoid detection and JavaScript support is impractical. However, JavaScript identification, and which at one point was considered code can also be used to exploit vulnerabilities in the web browser and its extensions, and in recent years it to be the fastest-growing threat on the Internet [17]. has become a major mechanism for web-based Identifying malicious JavaScript code is not easy, delivery. In order to avoid detection, attackers often take however. The mere presence of obfuscated JavaScript advantage of the dynamic nature of JavaScript to create does not, in itself, signal the presence of malicious highly obfuscated code. This makes the identification content, since benign web pages also to use code of malicious JavaScript code challenging: current ap- obfuscation to protect intellectual property [7], [16]. proaches to analyzing such code typically either make assumptions that may not hold in practice, or else Moreover, attackers often use server-side scripting to require manual intervention that can be tedious and deliver randomly obfuscated code where each instance time-consuming. This paper describes a semantics-based is syntactically different from the next (server-side approach for automatic deobfuscation of JavaScript code polymorphism). For these reasons, static signature- using dynamic analysis and slicing. Experiments using a based heuristics (e.g., “‘eval(’ and ‘unescape(’ within prototype implementation indicate that our approach is 15 bytes of each other” [17]) have limited success able to penetrate multiple layers of complex obfuscations when dealing with obfuscated JavaScript. Traditional and extract the core logic of the computation, which anti-virus tools that process web pages typically rely makes it easier to understand the behavior of the code. on such syntactic heuristics and so tend to produce a high misidentification rate [7], [13], [16]. I. INTRODUCTION A better solution would be to use semantics-based Recent years have seen a dramatic rise in the role of techniques that focus on code behavior. Unfortunately, the World-Wide Web in many aspects of our day-to- current behavioral analysis techniques for obfuscated day lives. Unfortuately, there has been a concomitant JavaScript typically require considerable manual inter- increase in the use of the Web as a malware-delivery vention, e.g., to modify the code in specific ways or platform: even a single visit by an unwary user to to monitor its execution within a debugger [19], [22], an infected website may be sufficient to cause an [32]. There has been some recent work on automated arbitrary malicious payload to be downloaded and behavioral analyses of obfuscated JavaScript that in executed without user’s permission or knowledge. This many cases has a “deobfuscator” component [4], [6], process, known as drive-by-downloading, is carried out [7], [9], [16], as well as standalone JavaScript deobfus- by exploiting vulnerabilities in web browsers and/or cation tools [1], [10], [23], [30]. These deobfuscators their extensions [25], [26] and can be delivered us- all rely on some simple and intuitive assumptions about ing a variety of web-based mechanisms [24]. Drive- the obfuscation and the structure of the obfuscated by-downloads often rely on JavaScript, a scripting code. Although these assumptions seem plausible, it is language widely used for generating dynamic web not difficult to construct obfuscations that violate them content. Attackers often take advantage of the dy- and thereby defeat the corresponding deobfuscators. namic nature of JavaScript to create highly obfuscated This is illustrated in Section IV. This paper proposes a different approach to analyze II. BACKGROUND obfuscated JavaScript code. We collect bytecode exe- cution traces from the target program and use dynamic This section provides an overview of the JavaScript slicing and semantics-preserving code transformations language and the host environment of web browser and to automatically simplify the trace, then reconstruct describes some widely used real-world code obfusca- deobfuscated JavaScript code from simplified trace. tion techniques. The code so obtained is observationally equivalent to the original program for the execution path considered, A. JavaScript Basics but has obfuscations simplified away, thereby exposing The term “JavaScript”, commonly used to refer to the core logic of the computation performed by the a scripting language used for client-side programming original code. The resulting code can then be examined of dynamic websites, actually encompasses several dis- manually or fed to other analysis tools for further pro- tinct components. These include the core programming cessing. Our approach differs from existing approaches language, an implementation of the ECMA-262 stan- in that it makes no assumptions about the structure of dard; and the host enviroment, namely, the Document the obfuscation and uses semantics-based techniques Object Model (DOM) provided by web browser. to reveal the behavior of the code. The core JavaScript language provides a set of In previous work [18], we have described a data types (e.g. Boolean, String, Object), a collection semantics-based approach to deobfuscating “core” of built-in objects and functions (e.g. RegExp, Math, JavaScript, i.e. code that runs on a standalone Date), and a prototype-based inheritance mechanism, JavaScript engine. This does not account for interac- among other things. Like most scripting languages, tions between the executing JavaScript code and the JavaScript is highly dynamic in nature. It is dynam- host browser, which provides the Document Object ically typed, which means that a variable can take Model (DOM) for manipulating HTML documents and on values of different types at different points in a the ability to interact with external websites. This pa- program. Properties of (associative-array based) ob- per, by contrast, focuses on the more challenging task jects can be added/deleted on the fly, and code can be of deobfuscating JavaScript code in the full context generated from strings at runtime using built-in eval of the browser’s execution environment. This is much function. more complex than the “core” JavaScript considered in [18], and admits many more options for obfuscation, The Document Object Model (DOM) is an API that but is able to handle the full range of behaviors of real abstracts HTML documents as a structural represen- JavaScript malware. We evalute our approach using tation of objects and provides a mechanism for ma- a prototype implemention and test it againest both nipulating this abstraction, thereby enabling JavaScript synthetic programs and real malware code. The results code to modify and interact with the content of web show that our approach can penetrate multiple layers pages dynamically. For example, write method of the of complex obfuscations, some of which cannot be document object can be used to dynamically write handled by existing techniques, and extract the core HTML expressions or JavaScript code to a document. logic of the underlying computation. In contrast to the built-in objects defined in core JavaScript, objects defined in the DOM specification The rest of this paper is organized as follows. are called “host objects”, and are provided as a part of Section II provides the background information of the host environment by the web browser. JavaScript and obfuscation techniques. Sec- tion III describes the concept of semantic-based de- obfuscation and the implementation of our approach. B. JavaScript Runtime Section IV presents our experimental evaluation. Sec- At the implementation level, JavaScript typically tions V and VI discusses related and future work, and uses an expression-stack-based byte-code interpreter; Section VII concludes. Appendix A presents the com- modern implementations of these interpreters usually plete code contexts for malware sample P4 discussed come with JIT compilers. For example, Mozilla’s in Section IV. popular FireFox web browser uses an open source JavaScript interpreter, SpiderMonkey, written in /C++ [21]. This is a single dispatching function that steps through the bytecode one instruction at a time.

HelloWorld
As discussed in previous sections, client-side 1 2 and for each piece of code generated by runtime unfolding, we call it a code context. Native and non- Figure 1. Example of obfuscated JavaScript code native functions are also terms being used through out this paper. In JavaScript programs, functions defined using document.getElementById() at runtime. And, of in JavaScript (using keyword function) are called non- course, document.write() is a more powerful weapon native functions; and functions provided by the inter- than eval(), which can be used in combination of those preter or browser (e.g. built-in functions and methods obfuscation techqniues mentioned above, to generate of host objects) are called native functions. Unlike script, document elements storing data and pointers to non-native functions, native functions do not generate external documents, all at runtime. a bytecode trace when executed. We will see this Figure 1 presents an example of JavaScript code difference of native/non-native functions is important obfuscated using some of the techniques discussed to our deobfuscation technique. above. Line 3 of this code snippet uses bracket notation to reference object property, using the strings ‘getEle- C. JavaScript Obfuscation Techniques mentById’ (obtained as the concatenation of the strings ‘get’, ‘Elem’, ‘ent’, ‘By’, and ‘Id’) and ‘innerHTML’ The dynamic nature of Javascript code makes pos- (obtained as the concatenation of the strings ‘i’, ‘nn’, sible a variety of obfuscation techniques. Particularly ‘erH’, and ‘TML’) as array indices instead of using challenging is the combination of the ability to execute the more straightforward dot-notation to obtain the a string using code unfolding, as described above, corresponding property values. Furthermore, it uses the and the fact that the string being “executed” may DOM method document.getElementById() to retrieve be obfuscated in a wide variety of ways. Howard data, namely, the string ‘HelloWorld.’ For simplicity of discusses several such techniques in more detail [13]. exposition, this code uses a very straightforward obfus- For example, the characters in the string can be en- cation of the array index strings, namely, concatenation coded in various ways, e.g., using %-encoding (a as of a few smaller strings; the code could, however, just %61, b as %62, ...), Unicode (a as \u0061, b as as easily have used arbitrarily more complex obfusca- \u0062, ...), Base-64, etc. The string can be kept in tions to construct these strings. The script is equivalent encrypted, compressed, or permuted form. It can be to var c = document.getElementById(’x’).innerHTML. constructed at runtime by concatenating other strings The value finally assigned to variable c is a string together. Besides, in addition to the traditional “dot ’HelloWorld’ which is retrived from the HTML
notation” (obj.property) for object access, one can use element with ID ’x’. a “bracket notation” (obj[“property”]) instead. In the latter case, moreover, the use of a string as an array Those obfuscation techniques can be combined in ar- index makes applicable all of the string obfuscation bitrary ways with multi-layered code unfolding, which techniques mentioned earlier. makes it difficult to determine the intent of a JavaScript program from a static examination of the program The host environment of web browser also provides text. To make it more challenging, the payload can be various options for obfuscation. One approach is to scattered in multiple code contexts at different levels, split code into several parts, either in the same file with each piece using various obfuscation techniques or even into multiple files stored among web servers. and hidden in the garbage code whose only purpose is This technique is frequently seen with web-based to confuse deobfuscators. We will show in Section IV malware. Another approach takes advantage of DOM this trick can defeat existing JavaScript deobfuscators, interaction. For example, data can be stored in the which assume the unobfuscated, complete payload is HTML file, outside the dynamically created hidden iFrame. Figure 10 shows the output of our deobfuscator. The recovered code is very close to hand deobfuscated result and captures Figure 8. Source code for malware sample P4 the essence of the malicious behavior of P4. The call to document.write() in context 3 is part of the simplified code since it is used as a method for output as discussed in Section III-D.

V. RELATED WORK

Figure 9. Execution flow of malware sample P4 Most current approaches to dealing with obfuscated JavaScript typically require a significant amount of Figure 7 presents the output of our deobfuscator, in manual intervention, e.g., to modify the JavaScript which most of the obfuscation and garbage code are code in specific ways or to monitor its execution removed, recovered code is very close to the original within a debugger [19], [22], [32]. There are also P3, and expresses the same functionality. The extra approaches, such as Caffeine Monkey [10], intended code (function f0 and f4) is introduced because of the to assist with analyzing obfuscated JavaScript code, by control dependency, which is possible to be simplified instrumenting JavaScript engine and logging the actual away by further analysis, e.g. in this case, since none string passed to eval. Similar tools include several of the arguments is relevant, the invocation of the browser extensions, such as the JavaScript Deobfus- functions can be simply replaced by their body. We cator extension for Firefox [23]. The disadvantage of leave this for future work. such approaches is that they show all the code that is Finally, we evaluated our prototype system using a executed and do not separate out the code that pertains JavaScript malware sample, P4, which we collected to the actual logic of the program from the code whose from the Internet as described earlier. The complete only purpose is to deal with obfuscation. code contexts of P4 is presented in Appendix A; Figure Recently a few authors have begun looking at 8 shows the initial obfuscated JavaScript, while Figure automatic analysis of obfuscated and/or malicious 9 shows its high-level dynamic structure. Context 1 JavaScript code. Cova at al. [4] and Curtsinger et al. resides in the web page opened by web browser; it [7] describe the use of machine learning techniques is a small piece of obfuscated JavaScript code (see based on a variety of dynamic execution features Figure 8) that invokes document.write() method to to classify Javascript code as malicious or benign. dynamically insert a hidden iFrame, and cause the Such techniques typically do not focus on automatic load of an external web page. This newly loaded web deobfuscation, relying instead on the heuristics based page contains more obfuscated code, which consists on behavioral characteristics. Since obfuscation can also be found in benign code and really is simply an indicative of a desire to protect the code against casual inspection, classifiers that rely on obfuscation-oriented features are not reliable indicators of malicious intent. Our automatic deobfuscation approach can potentially increase the accuracy of such techniques by exposing the actual logic of the code. Saxena et al. discuss dynamic symbolic execution of JavaScript code using constraint-solving over strings [28]. Hallaraker and Vigna describe an approach to detecting malicious JavaScript code by monitoring the execution of the Figure 10. Deobfuscator output for malware sample P4 program and comparing the execution to a set of high- level policies [12]. All of these works are very different what is going on is that our slicing algorithm starts from the approach discussed in this paper. with a notion of a set of “interesting” data values and There is a rich body of literature dealing with dy- identifies all of the other code that affects these values; namically generated (“unpacked”) code in the context the problem is that our current notion of “interesting of native-code malware executables [5], [15], [20], values,” limited to arguments to native function calls, [27]. Much of this work focuses on detecting unpack- is too narrow to capture such attack code. We intend to ing and identifying the unpacked code. By contrast, the explore ways to address this issue by broadening the work described here is not concerned with the identi- notion of interesting data values appropriately, e.g., by fication and extraction of dynamically-generated code also including strings that appear to contain machine per se, but focuses instead on identifying instructions code [9]. that are relevant to the externally-observable behavior Finally, there are a few minor aspects of JavaScript of the program. and its DOM interactions that we have not yet had time to implement fully within our decompiler. For VI. DISCUSSION AND FUTURE WORK example, our JavaScript decompiler currently does not handle exception-handling via try-catch statements. Our approach requires instrumenting the JavaScript This is straightforward implementation work and does interpreter within a web browser to write out a trace not present significant conceptual challenges. of the byte code being executed. Because this requires inserting code into the interpreter, our current prototype VII. CONCLUSIONS is implemented in the context of the open-source Firefox web browser. However, none of our ideas are The common use of JavaScript code for web-based Firefox-specific, and in principle they could be adapted malware delivery makes it important to be able analyze in a straightforward way to any browser whose source the behavior of JavaScript programs and, possibly, code is available. classify them as benign or malicious. For malicious JavaScript code, it is useful to have automated tools A potential concern with dynamic approaches, such that can help identify the functionality of the code. as ours, is that of code coverage: in theory, static However, such JavaScript code is usually highly obfus- analyses can examine all of the code in a program cated, and use dynamic language constructs that make while dynamic analyses can only examine the code program analysis difficult. that lies on a particular execution path. In practice, however, current static analyses for JavaScript are not We present a semantics-based approach for auto- able to actually penetrate constructs such as eval() matic deobfuscation of JavaScript code. We use dy- and analyze the code generated at runtime; rather, namic analysis and program slicing techniques to sim- they rely on syntactic heuristics such as the presence plify away the obfuscation and expose the underlying of redirection, calls to eval(), and code/data entropy, logic of the JavaScript code in web pages. Moreover, to classify whether or not the code is potentially this approach does not make any assumptions about malicious. Examination of the simplified JavaScript the structure of the obfuscation. Experiments using a code obtained from a tool such as ours can make it prototype implementation indicate that our technique easier to identify inputs that would cause the code to is effective even against highly obfuscated JavaScript execute alternative execution paths, which could then programs. be explored in a suitable browser environment. We leave an exploration of this issue as future research. ACKNOWLEDGEMENT Another area of future research is the extension our This work was supported in part by the National approach to allow identification of attack code that Science Foundation via grant nos. CNS-1016058 and does not have any other semantic significance. For CNS-1115829, and the Air Force Office of Scientific example, code whose only purpose is to position heap Research via grant no. FA9550-11-1-0191. buffers for a heap spray attack [8], [29], or strings crafted to contain that is written into REFERENCES such buffers, is currently not detected by our slicing algorithm unless there is also an explicit connection [1] jsunpack: A generic JavaScript unpacker. with an argument to a native function call. Intuitively, http://jsunpack.jeek.org/. [2] Online obfuscator. [14] E. Joelsson. Decompilation for visualization of code http://www.daftlogic.com/projects-online-javascript- optimizations. Master’s thesis, Royal Institute of Tech- obfuscator.htm. nology, 2003.

[3] A.V. Aho, R. Sethi, and J.D. Ullman. Compilers: [15] M. G. Kang, P. Poosankam, and H. Yin. Renovo: A principles, techniques, and tools. Pearson/Addison hidden code extractor for packed executables. In Proc. Wesley, 1986. Fifth ACM Workshop on Recurring Malcode (WORM 2007), November 2007. [4] D. Canali, M. Cova, G. Vigna, and C. Kruegel. Prophiler: A fast filter for the large-scale detection [16] S. Kaplan et al. “NOFUS: Automatically Detecting”+ of malicious web pages. In Proc. 20th International String. fromCharCode (32)+ “ObFuSCateD ”.toLow- Conference on World Wide Web, pages 197–206, 2011. erCase()+ “JavaScript Code”. Technical report, Mi- crosoft Research, 2011. [5] K. Coogan, S. Debray, T. Kaochar, and G. Townsend. Automatic static unpacking of malware binaries. In [17] A. Kirk. Gumblar and more on Javascript Proc. 16th. IEEE Working Conference on Reverse obfuscation. Sourcefire Vulnerability Research Engineering, pages 167–176, October 2009. Team. http://vrt-blog.snort.org/2009/05/gumblar-and- more-on-javascript.html. May 22, 2009. [6] M. Cova, C. Kruegel, and G. Vigna. Detection and analysis of drive-by-download attacks and malicious [18] Gen Lu, Kevin Coogan, and Saumya Debray. Au- JavaScript code. In Proc. 19th International Confer- tomatic simplification of obfuscated JavaScript code ence on World Wide Web, pages 281–290, 2010. (extended abstract). In Proc. ICISTM-12 Work- shop on Program Protection and Reverse Engineer- [7] C. Curtsinger, B. Livshits, B. Zorn, and C. Seifert. ing (PPREW), March 2012. Full version available Zozzle: Fast and precise in-browser JavaScript malware at: http://www.cs.arizona.edu/˜debray/Publications/js- detection. In USENIX Security Symposium, 2011. deobf-full.pdf. [19] P. Markowski. ISC’s four methods of [8] M. Daniel, J. Honoroff, and C. Miller. Engineer- decoding JavaScript + 1, March 2010. ing heap overflow exploits with javascript. In Proc. http://blog.vodun.org/2010/03/iscs-four-methods- Second USENIX Workshop on Offensive Technologies of-decoding.html. (WOOT), 2008. [20] L. Martignoni, M. Christodorescu, and S. Jha. Omni- [9] M. Egele, P. Wurzinger, C. Kruegel, and E. Kirda. De- Unpack: Fast, Generic, and Safe Unpacking of Mal- fending browsers against drive-by downloads: Mitigat- ware. In Proc. ACSAC ’07, December 2007. ing heap-spraying code injection attacks. In Proc. 6th International Conference on Detection of Intrusions [21] Mozilla. SpiderMonkey. and Malware, and Vulnerability Assessment, pages 88– https://developer.mozilla.org/en/SpiderMonkey. 106, 2009. [22] J. Nazario. Reverse engineering ma- [10] B. Feinstein and D Peck. Caffeine Monkey: Auto- licious JavaScript. CanSecWest 2007, mated collection, detection and analysis of malicious http://cansecwest.com/csw07/csw07-nazario.pdf. javaScript. Black Hat USA, 2007. [23] Wladimir Palant. FireFox add-on: JavaScript [11] Google. Minify. http://code.google.com/p/minify/. deobfuscator 1.5.7. https://addons.mozilla.org/en- US/firefox/addon/javascript-deobfuscator/. [12] O. Hallaraker and G. Vigna. Detecting Malicious JavaScript Code in Mozilla. In Proceedings of the [24] N. Provos, P. Mavrommatis, M.A. Rajab, and F. Mon- IEEE International Conference on Engineering of rose. All your iFRAMEs point to us. In Proc. 17th Complex Computer Systems (ICECCS), pages 85–94, USENIX Security Symposium, pages 1–15, 2008. Shanghai, China, June 2005. [25] N. Provos, D. McNamee, P. Mavrommatis, K. Wang, [13] F. Howard. Malware with your mocha: and N. Modadugu. The ghost in the browser analysis Obfuscation and anti emulation tricks in of web-based malware. In Proceedings of the first malicious JavaScript, September 2010. conference on First Workshop on Hot Topics in Un- http://www.sophos.com/security/technical- derstanding Botnets, pages 4–4. USENIX Association, papers/malware with your mocha.pdf. 2007. [26] N. Provos, M.A. Rajab, and P. Mavrommatis. Cyber- crime 2.0: when the cloud turns dark. Communications of the ACM, 52(4):42–47, 2009.

[27] P. Royal, M. Halpin, D. Dagon, R. Edmonds, and W. Lee. Polyunpack: Automating the hidden-code extraction of unpack-executing malware. In Proc. ACSAC ’06, pages 289–300, 2006.

[28] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCa- mant, and D. Song. A symbolic execution framework for JavaScript. In Proceedings of the 31st IEEE Symposium on Security and Privacy, pages 513–528, Oakland, CA, USA, 2010.

[29] A. Sotirov. Heap feng shui in JavaScript. In Black Hat USA 2007, July 2007. https://www.blackhat.com/presentations/ bh-usa-07/Sotirov/Whitepaper/ bh-usa-07-sotirov-WP.pdf.

[30] Boban Spasic. Malzilla. http://malzilla.sourceforge.net/. Figure 11. The first code context of P4 [31] T. Wang and A. Roychoudhury. Dynamic slicing on java bytecode traces. ACM TOPLAS, 30(2):10, 2008.

[32] D. Wesemann. Advanced obfus- cated JavaScript analysis, April 2008. http://isc.sans.org/diary.html?storyid=4246.

[33] Yahoo! YUI compressor. http://developer.yahoo.com/yui/compressor/.

[34] Bojan Zdrnja. Advanced JavaScript obfuscation (or why signature scanning is a failure), April 2009. http://isc.sans.edu/diary.html?storyid=6142.

APPENDIX COMPLETE CODE CONTEXTS OF P4 This section presents the complete code contexts of malware sample P4 discussed in SectionIV. Figure 11 presents the first context, it creates a hidden iFrame dynamically which points to a web page (i.e. code context 2) from a malicious site. The second context, as shown in Figure 12, is unfolded by context 1. It assembles the exploit code by decrypting infor- mation retrieved from a pair of hidden DOM elements, then executes it using eval. Figure 13 presents the real exploit code of P4, which tries to detect whether you have the Adobe reader plugin in your browser and loads the payload — a PDF file created to exploit a vulnerability in the Adobe PDF reader JavaScript Figure 13. The third code context of P4 engine. Figure 12. The second code context of P4