Clojure Bytecode Decompiler on JVM
Total Page:16
File Type:pdf, Size:1020Kb
MASARYK UNIVERSITY FACULTYLSC OF INFORMATICS Clojure bytecode decompiler on JVM BACHELOR THESIS Michael Šimáˇcek Brno, spring 2015 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Michael Šimáˇcek Advisor: Mgr. Marek Grác, Ph.D. iii Acknowledgement I wish to express my sicere thanks to Pavel Tišnovský for technical guidance and for introducing me to Clojure. I would like to also thank Marek Grác for comments and consultations. v Abstract The aim of this thesis is to develop a prototype of a decompiler for Clo- jure programming language that can output possible source code that corresponds to given JVM bytecode that was generated by Clojure compiler. The text provides an overview of the Java Virtual Machine and the Clojure language, describes the process of compilation of Clo- jure programs targeting the Java platform and provides an overview of the architecture of the Clojure decompiler. vii Keywords Clojure, Java, JVM, decompiler, bytecode ix Contents 1 Introduction ............................1 2 Clojure features ..........................3 2.1 Functional paradigm ....................3 2.2 Dynamic ...........................3 2.3 Code as data .........................3 2.4 Concurrency .........................4 2.5 Multiple dispatch ......................4 2.6 Ports to other platforms ..................5 3 Java virtual machine .......................7 3.1 JVM class files ........................7 3.1.1 Class metadata . .8 3.1.2 Constant pool . .8 3.1.3 Fields . .9 3.1.4 Methods . 10 3.1.5 Code attribute . 10 3.2 Class loading ........................ 10 3.3 Code execution environment ............... 11 3.3.1 Data types . 12 3.4 Bytecode instructions .................... 12 3.4.1 Load and store instructions . 13 3.4.2 Stack manipulation instructions . 13 3.4.3 Return instructions . 14 3.4.4 Constant push instructions . 14 3.4.5 Primitive conversion instructions . 14 3.4.6 Array manipulation instructions . 15 3.4.7 Arithmetic instructions . 15 3.4.8 Field access . 16 3.4.9 Method invocation instructions . 16 3.4.10 Branching instructions . 17 3.4.11 Other instructions . 18 4 Clojure compilation to JVM bytecode ............. 21 4.1 Modes of Clojure compilation ............... 21 4.1.1 On-the-fly compilation . 21 4.1.2 Ahead-of-time compilation . 21 4.2 Representation of scalar types ............... 22 xi 4.2.1 Numbers . 22 4.2.2 Booleans . 22 4.2.3 Boxing and unboxing . 22 4.2.4 Strings and regular expressions . 23 4.3 Java interoperation ..................... 23 4.3.1 Calling static methods . 24 4.3.2 Creating instances of Java classes . 25 4.3.3 Getting fields of objects . 26 4.3.4 Calling methods on object . 26 4.4 Clojure functions ...................... 27 4.4.1 IFn .......................... 27 4.4.2 AFn, AFunction and RestFn ............ 28 4.4.3 Closures . 29 4.4.4 Accessing vars and constants . 29 4.4.5 Type hints on function arguments . 29 4.4.6 Inline functions . 30 4.4.7 Example of compiled form . 30 4.4.8 Tail recursion . 31 4.5 Local bindings and function arguments ......... 32 4.6 Control structures ...................... 33 4.6.1 Conditional expression . 33 4.6.2 loop expression . 36 4.7 Top-level expressions .................... 37 5 Clojure decompiler ........................ 39 5.1 Used technologies ...................... 39 5.1.1 Leiningen . 39 5.1.2 Apache BCEL . 39 5.1.3 Other libraries . 40 5.2 Architecture ......................... 40 5.3 Implementation ....................... 41 5.3.1 Entry point . 41 5.3.2 Decompilation context . 42 5.3.3 Decompiling function classes . 42 5.3.4 Decompiling initialization classes . 42 5.3.5 Processing instructions . 43 5.3.6 Expression tree transformations . 44 5.3.7 Outputting Clojure code . 46 6 Conclusion ............................. 47 xii 1 Introduction The Java platform is one of the most popular platforms for developing software used by enterprises and the Java language is the primary lan- guage that targets it. Many new languages have come into existence since Java was created and some of them may offer newer features or would be more suited for a particular problem, but larger enterprises are hesitant to migrate because they prefer using stable, time-tested platform and are bound by existing libraries, frameworks and legacy code. The Java platform provides runtime environment that is used to host more programming languages than Java and multiple new lan- guages that target the platform have been created, aiming to provide faster development, more concise syntax and alternative program- ming paradigms. Such languages are commonly called Java.next and they could provide a solution for enterprises that would like to try new development paradigms, but need to maintain compatibility with existing codebase. Clojure is one of the Java.next languages that was created by Rich Hickey in 2007. It aims to be a general-purpose language that provides tight integration with the underlying platform [1]. Clojure adheres to functional paradigm, which is based on treating functions as values and not modifying objects after they are created [2]. It is a dynamic language, which means that the programs can change their structure at runtime [3]. The most visible difference from mainstream programming languages is the syntax derived from the Lisp language. Lisp represents code using the same structures it uses to store data, allowing to easily express self-modifying programs [4]. This thesis aims to provide an overview of Clojure implementation on the Java Virtual Machine and create Clojure decompiler that can reconstuct possible source code corresponding to compiled bytecode. Chapter 2 provides overview of the Clojure programming lan- guage, its notable features and differences from Java. Chapter 3 de- scribes the Java Virtual Machine and its architecture. It presents the JVM instruction set divided into groups of instructions that serve similar purpose. Chapter 4 focuses on the process of Clojure compila- tion, providing examples of how individual basic language forms are 1 1. INTRODUCTION represented in JVM bytecode. The bytecode examples are presented inline with comments describing what each instruction does from higher-level point of view. Chapter 5 describes the architecture and implementation of the Clojure decompiler developed as part of this thesis. 2 2 Clojure features 2.1 Functional paradigm Clojure is a functional language that has support for first-class func- tions, lexical closures and anonymous functions. Functions are called first-class if they are treated as data by the language, i.e. they can be passed as arguments to other functions or returned as a value [5, p. 126]. Lexical closures are functions that can access locals from the context where it was defined [5, p. 135]. Anonymous functions are functions that are not bound to a name, usually created directly at the place they are needed. Clojure encourages programing without hav- ing mutable state prefering recursion to side-effect based looping [6], but it also acknowledges that in certain situations maintaining state is needed and provides primitives to deal with it safely in concurrent environment. 2.2 Dynamic Dynamic programming languages enables programs to change their logical structure when they are running and they generally feature dynamic typing. Dynamic typing doesn’t require the programmer to explicitly declare types of their variables and data structures and the checking of types is done at runtime [3]. Clojure is dynamically typed, but provides the ability to add type information to function arguments and local bindings. This, along with type inference, can eliminate some of the overhead imposed by runtime method lookup. It doesn’t provide compile-time checking for type mismatches, but it is possible to add optional type checking via third-party library [7], such as core.typed 1. 2.3 Code as data Clojure’s syntax is derived from Lisp, building on top of idea that code is represented using the same as structures as data, thus enabling 1. http://typedclojure.org/ 3 2. CLOJURE FEATURES the program to easily modify or generate code [4]. One of the fea- tures that utilize this are macros. Macros are functions that are run at compile-time and can perform arbitrary transformations on the code they get as their aruments. This enables the programmer to add new syntactical elements to the laguage without modifying the com- piler or manipulating the compiled code. Such elements can then be composed to create a Domain Specific Language (DSL) – a computer pro- gramming language of limited expressiveness focused on a particular domain [8]. Unlike other DSL-friendly languages, such as Groovy or Ruby that create DSLs by generating code at runtime, Clojure-based code generation happens at compile-time with no additional runtime overhead [9]. 2.4 Concurrency Achieving high performance without race conditions and data cor- ruption is considered to be a difficult task. The problems mostly arise from using shared data structures that are altered in place, thus requir- ing explicit synchronization that may lower performance and cause deadlocks when not used carefully. Clojure addreses this problem by using persistent data structures – structures that are not modified in place and instead create a new copy containing the modifications, sharing the common data with the original. Then, only the reference to the data is changed, ensuring that all threads see a consistent view of the data structure at any time. Important feature that is available to Clojure programmers is built-in support for Software transactional memory (STM). STM provides the ability to perform operations on multiple references to objects in memory without explicit locking, while maintaining atomicity, consistency and isolation properties as known from relational database systems [5, p. 234]. 2.5 Multiple dispatch Methods in most object-oriented languages, such as Java, provide the possibility to have multiple implementation of the same method in different classes.