MASARYK UNIVERSITY FACULTYLSC OF INFORMATICS

Clojure bytecode decompiler on JVM

BACHELORTHESIS

Michael Šimáˇcek

Brno, spring 2015

Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Michael Šimáˇcek

Advisor: Mgr. Marek Grác, Ph.D.

iii

Acknowledgement

I wish to express my sicere thanks to Pavel Tišnovský for technical guidance and for introducing me to . I would like to also thank Marek Grác for comments and consultations.

v

Abstract

The aim of this thesis is to develop a prototype of a decompiler for Clo- jure programming language that can output possible source code that corresponds to given JVM bytecode that was generated by Clojure . The text provides an overview of the and the Clojure language, describes the process of compilation of Clo- jure programs targeting the Java platform and provides an overview of the architecture of the Clojure decompiler.

vii

Keywords

Clojure, Java, JVM, decompiler, bytecode

ix

Contents

1 Introduction ...... 1 2 Clojure features ...... 3 2.1 Functional paradigm ...... 3 2.2 Dynamic ...... 3 2.3 Code as data ...... 3 2.4 Concurrency ...... 4 2.5 Multiple dispatch ...... 4 2.6 Ports to other platforms ...... 5 3 Java virtual machine ...... 7 3.1 JVM class files ...... 7 3.1.1 Class metadata ...... 8 3.1.2 Constant pool ...... 8 3.1.3 Fields ...... 9 3.1.4 Methods ...... 10 3.1.5 Code attribute ...... 10 3.2 Class loading ...... 10 3.3 Code execution environment ...... 11 3.3.1 Data types ...... 12 3.4 Bytecode instructions ...... 12 3.4.1 Load and store instructions ...... 13 3.4.2 Stack manipulation instructions ...... 13 3.4.3 Return instructions ...... 14 3.4.4 Constant push instructions ...... 14 3.4.5 Primitive conversion instructions ...... 14 3.4.6 Array manipulation instructions ...... 15 3.4.7 Arithmetic instructions ...... 15 3.4.8 Field access ...... 16 3.4.9 Method invocation instructions ...... 16 3.4.10 Branching instructions ...... 17 3.4.11 Other instructions ...... 18 4 Clojure compilation to JVM bytecode ...... 21 4.1 Modes of Clojure compilation ...... 21 4.1.1 On-the-fly compilation ...... 21 4.1.2 Ahead-of-time compilation ...... 21 4.2 Representation of scalar types ...... 22

xi 4.2.1 Numbers ...... 22 4.2.2 Booleans ...... 22 4.2.3 Boxing and unboxing ...... 22 4.2.4 Strings and regular expressions ...... 23 4.3 Java interoperation ...... 23 4.3.1 Calling static methods ...... 24 4.3.2 Creating instances of Java classes ...... 25 4.3.3 Getting fields of objects ...... 26 4.3.4 Calling methods on object ...... 26 4.4 Clojure functions ...... 27 4.4.1 IFn ...... 27 4.4.2 AFn, AFunction and RestFn ...... 28 4.4.3 Closures ...... 29 4.4.4 Accessing vars and constants ...... 29 4.4.5 Type hints on function arguments ...... 29 4.4.6 Inline functions ...... 30 4.4.7 Example of compiled form ...... 30 4.4.8 Tail ...... 31 4.5 Local bindings and function arguments ...... 32 4.6 Control structures ...... 33 4.6.1 Conditional expression ...... 33 4.6.2 loop expression ...... 36 4.7 Top-level expressions ...... 37 5 Clojure decompiler ...... 39 5.1 Used technologies ...... 39 5.1.1 Leiningen ...... 39 5.1.2 Apache BCEL ...... 39 5.1.3 Other libraries ...... 40 5.2 Architecture ...... 40 5.3 Implementation ...... 41 5.3.1 Entry point ...... 41 5.3.2 Decompilation context ...... 42 5.3.3 Decompiling function classes ...... 42 5.3.4 Decompiling initialization classes ...... 42 5.3.5 Processing instructions ...... 43 5.3.6 Expression tree transformations ...... 44 5.3.7 Outputting Clojure code ...... 46 6 Conclusion ...... 47 xii 1 Introduction

The Java platform is one of the most popular platforms for developing software used by enterprises and the Java language is the primary lan- guage that targets it. Many new languages have come into existence since Java was created and some of them may offer newer features or would be more suited for a particular problem, but larger enterprises are hesitant to migrate because they prefer using stable, time-tested platform and are bound by existing libraries, frameworks and legacy code. The Java platform provides runtime environment that is used to host more programming languages than Java and multiple new lan- guages that target the platform have been created, aiming to provide faster development, more concise syntax and alternative program- ming paradigms. Such languages are commonly called Java.next and they could provide a solution for enterprises that would like to try new development paradigms, but need to maintain compatibility with existing codebase. Clojure is one of the Java.next languages that was created by Rich Hickey in 2007. It aims to be a general-purpose language that provides tight integration with the underlying platform [1]. Clojure adheres to functional paradigm, which is based on treating functions as values and not modifying objects after they are created [2]. It is a dynamic language, which means that the programs can change their structure at runtime [3]. The most visible difference from mainstream programming languages is the syntax derived from the Lisp language. Lisp represents code using the same structures it uses to store data, allowing to easily express self-modifying programs [4]. This thesis aims to provide an overview of Clojure implementation on the Java Virtual Machine and create Clojure decompiler that can reconstuct possible source code corresponding to compiled bytecode. Chapter 2 provides overview of the Clojure programming lan- guage, its notable features and differences from Java. Chapter 3 de- scribes the Java Virtual Machine and its architecture. It presents the JVM instruction set divided into groups of instructions that serve similar purpose. Chapter 4 focuses on the process of Clojure compila- tion, providing examples of how individual basic language forms are

1 1. INTRODUCTION represented in JVM bytecode. The bytecode examples are presented inline with comments describing what each instruction does from higher-level point of view. Chapter 5 describes the architecture and implementation of the Clojure decompiler developed as part of this thesis.

2 2 Clojure features

2.1 Functional paradigm

Clojure is a functional language that has support for first-class func- tions, lexical closures and anonymous functions. Functions are called first-class if they are treated as data by the language, i.e. they can be passed as arguments to other functions or returned as a value [5, p. 126]. Lexical closures are functions that can access locals from the context where it was defined [5, p. 135]. Anonymous functions are functions that are not bound to a name, usually created directly at the place they are needed. Clojure encourages programing without hav- ing mutable state prefering recursion to side-effect based looping [6], but it also acknowledges that in certain situations maintaining state is needed and provides primitives to deal with it safely in concurrent environment.

2.2 Dynamic

Dynamic programming languages enables programs to change their logical structure when they are running and they generally feature dynamic typing. Dynamic typing doesn’t require the programmer to explicitly declare types of their variables and data structures and the checking of types is done at runtime [3]. Clojure is dynamically typed, but provides the ability to add type information to function arguments and local bindings. This, along with type inference, can eliminate some of the overhead imposed by runtime method lookup. It doesn’t provide compile-time checking for type mismatches, but it is possible to add optional type checking via third-party library [7], such as core.typed 1.

2.3 Code as data

Clojure’s syntax is derived from Lisp, building on top of idea that code is represented using the same as structures as data, thus enabling

1. http://typedclojure.org/ 3 2. CLOJURE FEATURES the program to easily modify or generate code [4]. One of the fea- tures that utilize this are macros. Macros are functions that are run at compile-time and can perform arbitrary transformations on the code they get as their aruments. This enables the programmer to add new syntactical elements to the laguage without modifying the com- piler or manipulating the compiled code. Such elements can then be composed to create a Domain Specific Language (DSL) – a computer pro- gramming language of limited expressiveness focused on a particular domain [8]. Unlike other DSL-friendly languages, such as Groovy or Ruby that create DSLs by generating code at runtime, Clojure-based code generation happens at compile-time with no additional runtime overhead [9].

2.4 Concurrency

Achieving high performance without race conditions and data cor- ruption is considered to be a difficult task. The problems mostly arise from using shared data structures that are altered in place, thus requir- ing explicit synchronization that may lower performance and cause deadlocks when not used carefully. Clojure addreses this problem by using persistent data structures – structures that are not modified in place and instead create a new copy containing the modifications, sharing the common data with the original. Then, only the reference to the data is changed, ensuring that all threads see a consistent view of the data structure at any time. Important feature that is available to Clojure programmers is built-in support for Software transactional memory (STM). STM provides the ability to perform operations on multiple references to objects in memory without explicit locking, while maintaining atomicity, consistency and isolation properties as known from relational database systems [5, p. 234].

2.5 Multiple dispatch

Methods in most object-oriented languages, such as Java, provide the possibility to have multiple implementation of the same method in different classes. Which method will be actually called is then de- termined at runtime based on the type of the object the method is

4 2. CLOJURE FEATURES

called on (internally, the object is passed as the first argument). This feature is called single dispatch, because the method lookup uses single argument. Multiple dispatch can use runtime type of more than one argument. It can also be implemented in single dispatch languages, but might require specialized hand-coding [10]. Clojure provides built-in support for multiple dispatch via multimethods, al- lowing to specify custom dispatch function that can select the method based on not only the types, but also on the values of the arguments.

2.6 Ports to other platforms

Clojure’s main target is the JVM, but there are several ports to other platforms. The ports share the same syntax, semantics and standard library, but the Java classes used by the main JVM implementation are replaced with their equivalents in the underlying platforms. Thus differences in behavior may arise. ClojureCLR 2 runs on top of Microsoft Common Language Run- time (CLR). It aims to stay close to JVM implementation while using libraries from Microsoft .NET Framework platform and providing interoperation with CLR-based languages, such as # or Visual Basic. ClojureScript 3 is a port that compiles to JavaScript that can be then run within the web browser. For web services it is an advantage to be able to use the same language on both server-side and client-side.

2. https://github.com/clojure/clojure-clr 3. https://github.com/clojure/clojurescript

5

3 Java virtual machine

Java Virtual Machine is an abstract computing machine that simulates a real computer to provide a portable hardware- and OS-independent environment to host programming languages. The JVM has an of- ficial specification provided by Oracle (formerly by Sun) and there exist multiple different implementations. The specification describes the class file format used to contain executable code, the instruction set, virtual machine startup and loading, but doesn’t specify how to implement those in order not to constrain the implementations. The term Java Virtual Machine is used to refer to either the specification or particular implementation [11]. JVM is designed to primarily sup- port the Java programming language, but can be used to host any programming language that can be expressed using JVM’s instruction set and wrapped into class files.

3.1 JVM class files

Class file is a format describing how executable code targeting JVM is represented in binary form. Its structure is oriented towards object- oriented languages. This requires a language that targets JVM to map its concepts onto object-oriented concepts that exist within JVM. Programs and libraries for JVM consist of one or more class files that are usually stored as actual files on disk. Class files may also be represented only in memory, which is a common case for code that is dynamically generated at runtime. In Java, a single source file will generate one or more class files (more class files if the toplevel class contains additional nested classes). In Clojure, one source file represents whole module and thus will typically generate tens to hundreds of class files. The class file format is a defined as a structure containing multiple items that are either numeric values, structures or arrays. Arrays are variable length and are preceded by an item with numerical value specifying its length. In the following listings the items that have no effect on the semantics and are used only for parsing the class file, such as array length items, are omitted.

7 3. JAVA VIRTUAL MACHINE

This section is based on the The class File Format chapter of The Java Virtual Machine Specification [11].

3.1.1 Class metadata A class file contains multiple attributes that describe the class or the class file itself. Identification number The file is identified as class file by number 0xCAFEBABE (hexadecimal value). Major and minor version Version of the class file format to prevent loading incompatible class files by JVMs that don’t support them. Access flags A number describing properties of the class or interface. Individual flags can be combined using bitwise or operation to get the resulting value for access_flags field. Possible flags are: ACC_PUBLIC Class declared with public access modifier. It can be accessed outside of its declaring package ACC_FINAL Inheriting from this class is not allowed ACC_SUPER Whether to treat superclass’ method calls spe- cially ACC_INTERFACE Is an interface, not a class ACC_ABSTRACT Cannot be directly instantiated ACC_SYNTHETIC The class was not present directly in the source code, it was generated ACC_ANNOTATION Can be used as an annotation type ACC_ENUM Can be used as an enumeration type This class Reference to constant pool entry naming the current class Superclass Zero or reference to constant pool entry naming the par- ent class this class inherits from Attributes Array of additional optional metadata such as source file name or runtime annotations

3.1.2 Constant pool Constant pool is a table of constants that are referenced from within methods and other parts of the class file. Its purpose is to reduce du-

8 3. JAVA VIRTUAL MACHINE plication of constants that are used multiple times and to decouple the runtime data layout of the class from operand addressing. Constant pool entries are structures with different types and variable number of elements. They can be divided into basic categories: Constants of basic types Integer, Long, Float, Double, String and Utf-8. Utf-8 is a raw container for string data encoded in modified Utf-8 encoding. String constant pool entry contains only an index to a Utf-8 entry and doesn’t contain any data itself. At runtime it will become a java.lang.String object intialized from the referenced Utf-8 entry. Numeric constants are inter- nally stored in big-endian order. NameAndType Points to two Utf-8 entries – one representing a method or field name and one containing method or field type information. References to classes or class members Class entry type contains an index pointing to Utf-8 entry containing a name of class. Fieldref, Methodref and InterfaceMethodref types consist of two indices to Utf-8 entries: name of the target class and entry that names the target method or field (NameAndType). Bytecode instructions that invoke methods or access fields don’t contain the target name in their immediate operand. Instead they contain an index of a constant pool entry of one of the types in this category. References to support dynamic invocation MethodHandle, MethodType and InvokeDynamic entry types are used to support invocations of methods using types that are not known at compile time via the instruction invokedynamic. See 3.4.8.

3.1.3 Fields

Fields are variables that are part of the class or the class’ instance. In a class file their declarations are stored as an array. The class file only contains fields that were declared directly in given class, inherited fields are not included.

9 3. JAVA VIRTUAL MACHINE

3.1.4 Methods

Methods are basic executable units on the JVM. In the class file, they are stored as an array of method_info structure. method_info struc- ture consists of the following items: Access flags Number describing method’s properties Name index Index of constant pool entry containing method’s name Descriptor index Index of constant pool entry containing method’s type information Attributes Array of attribute_info structures representing addi- tional data and metadata. The methods array contains all methods declared by given class in- cluding auxiliary methods generated by the compiler, such as the class initializer. Inherited methods are not included.

3.1.5 Code attribute

For methods that are neither abstract (don’t have implementation) nor native (have implementation in machine code), the actual bytecode for the method body is stored as an attribute Code. It contains following items: • Maximum depth of the operand stack during execution of the code • Maximum number of local variables • Actual bytecode instructions of the method in an array • Exception table, where entries contain the type of caught ex- ception, the index of the corresponding exception handler and the range of code indices covered by the handler • Array of additional metadata of the code, such as table of local variables or table of line numbers

3.2 Class loading

Class loader is an object that is responsible for obtaining the class file and loading it into running JVM. Applications can supply their own class loader to load classes from different locations. The JVM

10 3. JAVA VIRTUAL MACHINE

contains so-called bootstrap class loader that is responsible for loading core classes and other class loaders. Class loading process consists of following steps [11, ch. 5]: • Obtaining the binary representation of the class (class file) • Constructing the implementation specific runtime representa- tion of the class • Verification of the binary representation • Initialization of class fields • Resolution of symbolic references to classes, fields and meth- ods to their concrete values • Executing the class initializer – method that can execute arbi- trary code provided by the programmer to initialize the class (internally named ) • Linking native method (methods implemented in machine’s native code outside of JVM)

3.3 Code execution environment

JVM specification defines abstract inner architecture while leaving details up to individual implementations. The architecture is defined in terms of subsystems and memory areas. JVM is a stack-based vir- tual machine. It means that its instruction operate on items on stack as opposed to register-based virtual machines, where instructions operate primarily on registers. Each thread has its own stack, com- posed of stack frames. A stack frame contains information about the current state of method invocation. It is created and pushed onto stack when a method is invoked and after the method returns, the stack frame is popped and discarded. Native methods (implemented out- side JVM) are allocated separate implementation-dependent stacks. A stack frame contains array of local variables and operand stack. Method arguments are copied to the local variable array before the method body starts executing. Most of the instructions operate on the operand stack and don’t acces method arguments and local variables directly. Instead, local variables first need to be pushed onto operand stack to be used. There are other memory areas than stacks – method area and the heap. Method area contains per-class structures, such as method

11 3. JAVA VIRTUAL MACHINE bytecode or runtime representation of constant pool. The heap is the data area where objects are allocated and it is subject to automatic reclaiming of space occupied by unreachable objects (garbage collec- tion). How the garbage collection is performed is implementation- dependent [11, ch 2.5].

3.3.1 Data types

JVM distinguishes between two basic types – references and primitive types. References are internally integer numbers that point to other memory location that contains an object or an array. Primitive types hold the value directly. The JVM supports following primitive types: integral types, floating-point types, boolean type and returnAddress type (used internally, usually not exposed in high-level languages) [11, ch. 2.2].

3.4 Bytecode instructions

JVM instruction set contains wide variety of instructions that can be divided into basic groups. Each instruction has an associated oper- ation code, can take fixed number of immediate operands and can consume and create variable number of stack items. This section is based on the The Java Virtual Machine Instruction Set chapter of The Java Virtual Machine Specification [11, ch 6].

Type Prefix object reference a int i long l float f double d Table 3.2: Table of type prefixes of instructions

12 3. JAVA VIRTUAL MACHINE

Type Prefix char c byte b short s Table 3.4: Extension of table 3.2

3.4.1 Load and store instructions Load instructions push values of method arguments or local variables onto stack. There are multiple instuctions for different types. For each type there is a generic variant with index as an immediate argument and 4 variants that have an implicit index encoded in the opcode. Indexing starts from 0, method arguments are placed before local variables. The instruction names are in format Tload for generic or Tload_N for implicit instructions, where T is one of type prefixes in table 3.2 and N is a number from 0 to 3. Store instructions pop an item from the stack and store it into local variable at given index. Their naming and behavior with respect to indices is analogous to load instructions, being named Tstore or Tstore_N. Examples of instructions in this group: iload_0, astore, fstore_3

3.4.2 Stack manipulation instructions

pop Pops a stack item, discarding the value pop2 Pops two stack items, discarding the values swap Pops two stack items and pushes them back in reversed order dup Duplicates item on the top of the stack dup2 Duplicates two items on top of the stack in their original order dup_x1 Duplicates item on the top of the stack and moves it below original top dup_x2 Duplicates item on the top of the stack and moves it two items below original top dup2_x1 Like dup_x1, but duplicates two values dup2_x2 Like dup_x2, but duplicates two values

13 3. JAVA VIRTUAL MACHINE

3.4.3 Return instructions return instruction discards current stack frame and returns control flow to the caller of the method. It’s used in methods that don’t return a value. For methods that do, Treturn is used instead, where T is one of the type prefixes in table 3.2. It first pushes top of the current operand stack to the operand stack of the caller and then proceeds the same as return.

3.4.4 Constant push instructions

aconst_null Pushes null reference onto stack bipush Pushes its immediate operand (single byte) onto stack as an int sipush Pushes its immediate operand (two bytes) onto stack as an int dconst_N Pushes N as double constant onto stack, where N can be 0 or 1 fconst_N Pushes N as float constant onto stack, where N can be 0, 1 or 2 iconst_N Pushes N as int constant onto stack, where N can be from 0 to 5 inclusive iconst_m1 Pushes -1 as int constant onto stack lconst_N Pushes N as long constant onto stack, where N can be 0 or 1 ldc Pushes item from constant pool. Index is given as immediate operand (single byte). Works with following entry types: int, float, String, Class, MethodType and MethodHandle ldc_w Like ldc, but with two byte immediate operand ldc2_w Like ldc_w, but works with long and double types instead

3.4.5 Primitive conversion instructions

Group of instructions that convert from one primitive type to another, truncating or extending the value when needed. They pop the value from stack and push the value back after conversion was performed.

14 3. JAVA VIRTUAL MACHINE

They are named as X2Y, where X and Y are type prefixes from ta- ble 3.2, excluding a. Example of such instruction is d2f that converts double to float type. Additionally, there are three instructions i2b, i2c and i2s that convert int values to byte, char and short respec- tively. Since those types are internally represented as int, it performs only the truncation and pushes the values back as int (sign-extended for byte and short and zero-extended for char).

3.4.6 Array manipulation instructions

anewarray Creates a new array of references, taking the constant pool index of the class of the reference as immediate argument. Consumes an integer value from stack that is used as the array size and pushes back the reference to the array. newarray Like anewarray, but creates an array of primitive type. The immediate operand is the constant pool index of the type. multianewarray Creates a multidimensional array with number of dimensions as immediate argument and di- mensions themselves on the stack. arraylength Consumes array reference from the stack and pushes back the length of the array. Xaload Loads value of type given by X (type prefix from tables 3.2 and 3.4) from an array. Consumes array reference and index from the stack and pushes back the retrieved value. Xastore Stores value of type given by X (type prefix from tables 3.2 and 3.4) from an array. Consumes array reference, index and the value to be stored from the stack.

3.4.7 Arithmetic instructions

All instructions in this subsection (except iinc) perform their opera- tion by popping stack item(s) and pushing back the result.

15 3. JAVA VIRTUAL MACHINE

Basic artihmetics Xadd, Xsub, Xmul, Xdiv, Xrem and Xneg, that per- form addition, subtraction, multiplication, remainder, division and negation respectively. (X is a type prefix from table 3.2) Bitwise operations Xand, Xor, Xxor, Xshr, Xshr and Xushr. They perform bitwise and, or, xor, right shift, left shift and unsigned right shift, respectively. X is either i for int or l for long. Increment iinc increments a local variable pointed by index in the first immediate operand by a constant contained in second immediate operand. Comparison operations dcmpg and dcmpl compares two double val- ues from the stack, pushing back value 1 if the first value is greater than the second, 0 if they’re equal and -1 otherwise. The two instructions differ in their treatment of NaN (Not a number) value. fcmpg and fcmpl work the same as dcmpg and dcmpl respectively, but compare float values instead. lcmp compares values for long type.

3.4.8 Field access The name of field being accessed is stored in a constant pool entry referenced by instruction’s immediate operand.

getfield Retrieves a field value from an object referenced by top stack item. putfield Stores a value from the stack to a field of an object referenced by top stack item. getstatic Retrieves a class field value from class referenced by its immediate operand putstatic Stores a value to a class field of class referenced by its immediate operand

3.4.9 Method invocation instructions Method invocation instructions haven an immediate argument that points to a constant pool entry indentifying the method to be called. They consume variable number (can be zero) of items from the stack that are used as method arguments.

16 3. JAVA VIRTUAL MACHINE

invokestatic Calls given static method of a class. invokevirtual Calls method of an object using the object’s runtime type to determine which method im- plementation to call (virtual dispatch). The ob- ject reference is the first argument (consumed last from the stack). The method needs to be declared in the class of the object or its super- classes. For calling methods through an inter- face, invokeinterface is used instead. invokespecial Calls method of an object without using vir- tual dispatch. It is used to call constructors and superclass methods. invokeinterface Like invokevirtual, but calls method of an object that was declared in an interface. It has additional immediate argument – count of method arguments. invokedynamic New instruction available since JDK version 7 [12], aimed to improve support for hosting dynamic languages on the JVM. The new mech- anism is already used by some of the JVM- hosted languages, such as Groovy [13]. The instruction’s immediate operand refers to a method that will perform the dispatch to par- ticular implementation, which is then cached per call-site. Technical details are out of scope of this thesis, as invokedynamic is not presently used by Clojure compiler 1.

3.4.10 Branching instructions

Performs unconditional jump using offset in its imme- diate operand (2 bytes) goto_w Like goto, but with 4 byte offset

1. Reasons why it is not implemented were expressed by Rich Hickey in Google Groups discussion available at: https://groups.google.com/d/msg/clojure-dev/ PuV7XTps9vo/FvFv0sQsQcgJ

17 3. JAVA VIRTUAL MACHINE

if_acmpeq Performs jump using an offset when two top stack items are references pointing to the same object if_acmpne Negated version of if_acmpeq ifnull Performs jump using an offset when top stack item is null ifnonnull Performs jump using an offset when top stack item is not null ifOP Performs jump using an offset when top stack item is successfully compared against zero using OP, which is one of eq (=), ne (6=), ge (≥), gt (>), le (≤), lt (<)

There are two branching instructions that can perform jumps to mul- tiple locations based on a table – tableswitch and lookupswitch. The tableswitch instruction takes three immediate operands – low value, high value and default jump offset – followed by a table of integers that are jump offsets for the value given by their index in the table (the table length is high − low + 1). It consumes an integer from stack that is then compared to low and high values. If it is within the range specified by the two values, it is used as an index to the offset table and the value at given index is used to perform the jump. Otherwise, the default jump offset is used. The lookupswitch instruction takes two immediate operands – number of branches (excluding the default) and the offset of the default branch. It’s followed by a table of match-offset pairs for each branch. When executed, it consumes an integer value from top of the stack and compares it against match values in the table. If a match is found, the offset in given match-offset pair is used to perform the jump. Otherwise, the default offset is used. When executed, lookupswitch needs to compare the input value against different match values. In contrast, tableswitch performs the dispatch in constant time.

3.4.11 Other instructions

athrow Throws an exception using the object reference from the top of the stack

18 3. JAVA VIRTUAL MACHINE checkcast Compares type of top stack item against type referenced by its immediate operand. Throws ClassCastException if the types don’t match. Doesn’t pop the stack item otherwise instanceof Pushes value 1 if object referenced by top stack item is an instance of type referenced by instruction’s im- mediate operand. Pushes 0 otherwise

19

4 Clojure compilation to JVM bytecode

4.1 Modes of Clojure compilation

Clojure programs’ source code is represented as set of files in a di- rectory structure corresponding to namespaces (means of separating code, equivalent to Java packages). In order to be run or distributed they have to be compiled into the hosted language’s bytecode - in this case JVM’s bytecode. Clojure supports two different modes of compilation.

4.1.1 On-the-fly compilation On-the-fly compilation means that Clojure source files are loaded and compiled into JVM bytecode in memory and are directly run without saving the bytecode to disk. This has the advantage of faster development and easier use, but causes performance overhead on startup and requires the source code to be distributed which is not always desirable [14]. On-the-fly compilation is implemented by reading the input file and evaluating the expressions found therein. Function definitions are evaluated by compiling them into classes and then immediately loading the compiled classes into the running JVM using a class loader. Top-level expressions that define global variables are evalu- ated directly by performing the variable definition. Other top-level expressions are wrapped into functions which are then evaluated the same way as regular function definitions and executed immediately [15, line 6787].

4.1.2 Ahead-of-time compilation This mode of compilation generates bytecode and saves it as Java class files. Such programs can then be launched as regular Java programs directly, be distributed or used as a library from other languages running on the JVM. For each source file, the compiler emits a class named after the source file (without extension) with __init appended, which contains

21 4. CLOJURE COMPILATION TO JVM BYTECODE initialization code for constants and a method load. Because unlike on-the-fly compilation, all top-level expressions that appear outside of function definitions have to be compiled, their bytecode is placed into load method. The load method is called by the class initializer when the init class is loaded into JVM. Separate class files are then generated for function definitions. Additionally, when using ahead-of- time compilation, the programmer may explicitly request generation of a custom classes or interfaces using gen-class or gen-interface facilities, that can then be used by other JVM languages [14].

4.2 Representation of scalar types

4.2.1 Numbers Clojure’s basic number types are implemented using Java object nu- meric types. Integer literals in source are by default compiled as java.lang.Long, floating point literals as java.lang.Double. Clojure additinally provides literals for arbitrarily long integers via type clojure.lang.BigInt and arbitrary precision decimal numbers us- ing java.math.BigDecimal. Other integer types, such as Short, may appear as results of Java method calls that return them. The compiler automatically emits conversion calls betwen different numeric types as needed [5].

4.2.2 Booleans Boolean literals true and false are translated to TRUE and FALSE constants declared in java.lang.Boolean. When using a conditional expression, such as if, it accepts arbitrary values as conditions, not only booleans as in Java. The only values that are considered false are FALSE constant and null reference, everything else is treated as true regardless of the type.

4.2.3 Boxing and unboxing On the JVM, primitive types, such as int, aren’t object types and cannot be treated as such since they aren’t subclasses of Object. When primitive values need to appear in a context that requires an object,

22 4. CLOJURE COMPILATION TO JVM BYTECODE

usually as a parameter of a function call, they have to be wrapped into instances of corresponding classes, such as Integer. This process of wrapping and unwrapping primitive values is known as boxing and unboxing respectively. Clojure compiler inserts boxing and unboxing calls automatically. When using integer or floating point numeric literals and unboxing would be necessary, the compiler directly emits constant of primitive type.

4.2.4 Strings and regular expressions Clojure uses java.lang.String as its string implementation. Clojure additionally provides literals for regular expressions that are compiled into state machines at compile time. Such literals produce an instance of java.lang.Pattern.

4.3 Java interoperation

Clojure provides variety of tools to interact with the hosting platform. The basic functionality of creating classes, accessing them and their instances is provided by . (the dot character), new and set! special forms. On top of them, Clojure provides additional forms as reader macros – syntactic shortcuts that are transformed into the special forms at source read-time. All of the interoperation forms can be com- piled in two ways, depending on whether the compiler has enough information about types to determine which exact method or field to use. If the type information is present, it can directly use JVM method invocation or field manipulation instructions, such as invokevirtual. Otherwise it needs to perform the method/field lookup at runtime using reflection. Reflection is the process of inspecting or modifying object’s structure at runtime, for example calling a method on an object whose type was not known at compile-time. The type information needed to make a direct call: • In case of instance method calls or field manipulation it needs to know the type of object being acted upon. • Java methods can be overloaded – there can be more methods with the same name, but different argument types (or differ- ent number of arguments). If a there are multiple overloads

23 4. CLOJURE COMPILATION TO JVM BYTECODE

of given method, the compiler needs to know the types the arguments. The type information can be either infered (from used literals or interoperation calls) or explicitly provided by the programmer in the form of type hints. Type hints are type names attached to names of function arguments or local bindings (the syntax is ˆType name).

4.3.1 Calling static methods Example of a call where the type can be resolved at compile-time: (Math/sqrt 2.0)

The example calls static method sqrt of java.lang.Math class. The sqrt method is not overloaded, therefore there is no ambiguity in which method to call. The compiled bytecode:

ldc2_w #10 Puts constant 2.0 of primitive type double onto stack invokestatic #30 Calls static method sqrt of java.lang.Math

When the types cannot be infered, the decision which method to call has to be done at runtime using reflection. Example of such call: ; let x be function argument without a type hint (Math/absx) abs has different implementations defined for int, long, float and double. In this case, x can be any object, therefore it cannot be de- cided which implementation to choose at compile-time. The compiled bytecode:

ldc #14 Puts String "java.lang.Math" onto stack invokestatic #20 Calls method forName of java.lang.Class to obtain the class object of class Math to perform introspection ldc #22 Puts String "abs" onto stack

24 4. CLOJURE COMPILATION TO JVM BYTECODE

iconst_1 Puts integer constant 1 onto stack that will be used as size of the array created by the following instruction anewarray #24 Creates new array of type Object dup Duplicates array reference on the stack as aastore will consume the reference iconst_0 Stores 0 as the index where the argument will be stored in the array aload_1 Puts x function argument onto stack aconst_null astore_1 Clears x reference aastore Stores value of x into the array invokestatic #30 Invokes method invokeStaticMethod of clojure.lang.Reflector with arguments: class object of java.lang.Math, "abs" and ar- ray containing value of x

4.3.2 Creating instances of Java classes Example where there is no ambiguity in which constructor to call: (StringBuilder. "foo")

The example creates new instance of java.lang.StringBuilder with "foo" as argument to the constructor. The dot at the end of name denotes a constructor call. The compiled bytecode:

new #14 Creates new uninitalized instance of StringBuilder dup Duplicates the stack item as it will be con- sumed by the constructor call ldc #16 Puts String "foo" onto stack checkcast #18 Ensure that "foo" is of type String invokespecial #21 Calls StringBuilder’s constructor with String "foo" as an argument

When the constructor method cannot be determined at runtime, the generated bytecode follows similar scheme as when calling static method using reflection (described in 4.3.1) – getting the class ob-

25 4. CLOJURE COMPILATION TO JVM BYTECODE ject, accumulating arguments into an array and calling static method invokeConstructor of clojure.lang.Reflector.

4.3.3 Getting fields of objects Example: (.x point) ; gets x field of object named point

The compiled bytecode if the type of point can be determined at compile-time (in this case, let point be function argument of type java.awt.Point):

aload_1 Puts point reference onto stack aconst_null astore_1 Clears point reference checkcast #14 Ensures that point is of type java.awt.Point getfield #18 Gets x field of point

When the type of point cannot be determined at compile-time, the compiler emits a static call to invokeNoArgInstanceMember of class Reflector with the point object reference and the string "x" as arguments. There’s one additional boolean argument that specifies whether it should call a method with no arguments or get a field, which is set to false. JVM allows to have a method and a field with the same in name in one class. Clojure prefers the no-arguments method, should such ambiguous case arise 1.

4.3.4 Calling methods on object

Example:

(.println System/out "Hello world")

1. Resolving to a field can be forced using (.-x point), but this special form is not documented

26 4. CLOJURE COMPILATION TO JVM BYTECODE

The compiled bytecode:

getstatic #18 Gets static field out of java.lang.System checkcast #20 Ensures that out stack item is of type java.io.PrintStream ldc #22 Puts String "Hello world" onto stack checkcast #24 Ensures that it is a String invokevirtual #28 Calls method println on out aconst_null Puts null reference onto stack, because every call in Clojure needs to return a value, but println doesn’t return one

When the method to be called cannot be determined at compile- time, the compiler emits a static method call to invokeInstanceMethod of Reflector with the object, method name and argument array as arguments. For calls to methods with no arguments, a static method call to invokeNoArgInstanceMember is emitted instead.

4.4 Clojure functions

Clojure functions are implemented as Java classes that are instantiated in the place they were defined. Function calls are then performed by calling methods on the function object. Clojure program’s mutable state (functions are also considered to be state), can only be accessed through a reference. Var is basic reference type that is used to refer to global state in Clojure programs. Calling a function by its name thus requires additional step of obtaining the function object referenced by the var. Clojure as a dialect of Lisp employs list syntax for calling functions. If a list (denoted by parentheses) is present in the code, its first element is called as a function with the rest of the list as arguments, unless the first symbol is one of the forms handled specially by the compiler.

4.4.1 IFn Calling a function is done by calling function object’s invoke method. The invoke method is declared in clojure.lang.IFn interface which the object needs to implement. Clojure suports having multiple arities

27 4. CLOJURE COMPILATION TO JVM BYTECODE of given function, hence the invoke method is overloaded for each ar- ity from 0 to 20. Additionally, there is one variadic (taking any number of arguments greater than its required arity which is 20 in this case) overload of invoke, that is used to implement variadic functions 2. IFn also declares method applyTo that is used when applying function on arguments wrapped in a collection 3. For such call, the number of ar- guments is not known at compile-time, therefore this case needs to be treated specially. IFn extends Java interfaces Runnable and Callable, thus enabling Clojure functions and other callable objects to be passed to Java methods that expect those, without the need for additional wrapping.

4.4.2 AFn, AFunction and RestFn clojure.lang.AFn is an abstract class implementing IFn that pro- vides basic skeleton for a callable object. It delegates run and call methods from Runnable and Callable to invoke call and provides applyTo method implementation that delegates the function call to the invoke method of the number of arguments corresponding to the length of the argument sequence. All the invoke methods have a dummy implementation that throws ArityException to convey that given arity is not implemented. clojure.lang.AFunction is a subclass of AFn that is actually used as a base class for classes that are generated for non-variadic Clo- jure functions. For variadic functions, clojure.lang.RestFn is used instead. For each function that is defined in the code, the compiler generates a new subclass of AFunction or RestFn. This also includes local functions and anonymous functions for which a unique internal names are generated. The newly created class overrides invoke methods for arities that are specified in the code. Those methods then contain the actual bytecode instructions for the function implementation.

2. Non-variadic functions with arity greater than 20 are not supported 3. (apply function collection-of-arguments)

28 4. CLOJURE COMPILATION TO JVM BYTECODE

4.4.3 Closures Captured references to locals outside of the function definition are stored as fields of the function object which are populated by the class constructor. At runtime, when a function definition is encountered, a new instance of its corresponding class is created and all references to non-local objects that the function accesses are passed as arguments to the constructor (which ones was already determined at compile time). The function object can then be passed around retaining references in instance fields.

4.4.4 Accessing vars and constants Vars are implemented as objects, therefore they need to be obtained from the runtime environment in order to access the object they refer- ence. All vars that are used within the function body (usually refer- ing to other functions) are looked up using var static method of clojure.lang.RT in the function’s class initializer and stored as static fields. The referenced value is then obtained in the function body by calling either getRawRoot or get (depending on the var type) virtual methods on the var object using invokevirtual instruction. Other object constants that are used in the function body, such as boxed numbers or regular expressions, are also stored as constant fields. They are initialized in the class initializer by their construction methods specific to given type, such as valueOf static method of java.lang.Long. Strings don’t need to be stored as fields as they are stored directly in the constant pool and retrieved with ldc instruction.

4.4.5 Type hints on function arguments Function arguments can be annotated with type hints that provide the compiler the information of expected type to avoid reflection for interoperation calls (as described in section 4.3). Additionally, when marking arguments to be of primitive type long or double, the compiler can generate invokePrim method that has given primitive argument types. The regular invoke method then only delegates the work to invokePrim method of given arity. When such a function is called with actual values of given primitive types, the compiler can

29 4. CLOJURE COMPILATION TO JVM BYTECODE emit call to invokePrim instead of invoke. This helps to avoid the overhead of boxing and unboxing primitive values.

4.4.6 Inline functions Clojure functions can be marked as inline. Compiler will then try to insert the function body directly into the place where the function is called. Such optimization is used by functions that are mere wrap- pers around Java interoperation calls, such as arithmetic functions in clojure.core. It is desirable because regular functions always have return type Object in order to conform to IFn interface, therefore cannot return primitive types, whereas calls to Java methods can.

4.4.7 Example of compiled form Example function: (defn inc-coll[coll] (map inc coll))

The inc-coll function takes a collection and returns a collection with all elements incremented. The example produces one class file for the function named user$inc_coll.class (user is the default namespace name). The compiled class contains two static fields for the two vars being used – map (field name const__0) and inc (field name const__1). The bytecode for class initializer ( method):

ldc #12 Puts String "clojure.core" onto stack ldc #14 Puts String "map" onto stack invokestatic #20 Calls static method var of clojure.lang.RT with "clojure.core" and "map" as argu- ments, obtaining the var object for map func- tion checkcast #22 Ensures that the value is of type var putstatic #24 Stores the var object of map into field const__0 ldc #12 Puts String "clojure.core" onto stack ldc #26 Puts String "inc" onto stack invokestatic #20 Calls method var of RT to obtain the var object for inc function 30 4. CLOJURE COMPILATION TO JVM BYTECODE checkcast #22 Ensures that the value is of type var putstatic #28 Stores the var object of inc into field const__0 return

The bytecode for class constructor (method ):

aload_0 Puts reference to current object onto stack invokespecial #31 Calls superclass (AFunction) constructor return

The bytecode for invoke method: getstatic #24 Gets var object for map from field const__0 invokevirtual #37 Obtains the function object for map from var by calling its getRawRoot vir- tual method. checkcast #39 Ensures that it implements IFn getstatic #28 Gets var object for inc from field const__1 invokevirtual #37 Obtains the function object for inc from var by calling its getRawRoot vir- tual method. aload_1 Puts coll function argument onto stack aconst_null astore_1 Clears coll reference in stack frame invokeinterface #42, 3 Calls invoke method of map function object with inc function object and coll as arguments areturn Returns the stack item as functions re- turn value

4.4.8 Tail recursion

If a function call is made as the last operation before function return it is called a tail call. Tail calls can be optimized by reusing the current stack frame for the new call and performing a jump to the function

31 4. CLOJURE COMPILATION TO JVM BYTECODE addresss, instead of performing full new call, which would require stack frame allocation. Recursion in tail call position is called tail re- cursion. Because functional languages rely on using recursion instead of loops, it is necessary to perform tail call optimization, otherwise many common algorithms would suffer from exhaustion of stack frames [16]. The common JVM implementations don’t provide support for tail call optimizations, not even for tail recursion in particular [16]. The Clojure compiler solves this by adding recur special form. Call made with recur is semantically the same as when recursively calling the function using its name, but it guarantees tail recursion optimiza- tion and fails to compile when the call is not in tail position. It is implemented by overwriting the current function arguments with arguments for the new call and using goto instruction to jump back to the top of the invoke method.

4.5 Local bindings and function arguments

Clojure is a functional language therefore it doesn’t have local vari- ables as known in imperative languages, but local bindings of names to values that cannot change after being assigned. Example of local binding: (let [x "Hello"] ; assigns value "Hello" to x (printlnx))

Internally, local bindings are implemented as local variables, using load/store family of instructions. They can be used to store value of any type and if the initialization expression returns a value of primitive type, it is not immediately boxed, possibly allowing more efficient arithmetic operations within the let body. If more bindings are specified, their initializations are executed in sequence and can refer to bindings defined before them. Even though the bindings are immutable, the compiler can reuse the same variable index for non-overlapping let blocks [15, line 6064]. Function arguments are also immutable and implemented the same way as local bindings, except their values are automatically assigned by the JVM when the function’s invoke method is called.

32 4. CLOJURE COMPILATION TO JVM BYTECODE

Indexes of function arguments precede those of local variables and the index 0 is used for the implicit reference to the function object (this in Java), which is inaccessible from within Clojure, but is used internally. Bytecode of the example:

ldc #30 Pushes String "Hello" onto stack astore_1 Stores the previous value as local vari- able with index 1 getstatic #23 Pushes the var object for println onto the stack invokevirtual #33 Calls method getRawRoot on the var object to obtain the function object for println checkcast #35 Ensures the value is of type IFn aload_1 Pushes the value of local variable at index 1 (x) onto stack aconst_null Pushes null reference onto stack astore_1 Clears local variable with index 1 (x) by storing null reference into it invokeinterface #38, 2 Calls invoke on the println function object In the previous example, the compiler emitted aconst_null fol- lowed by astore_1 to store null reference in a variable that is not used again. While this may seem unnecessary, it clears the reference that would otherwise prevent the object from being garbage collected.

4.6 Control structures

4.6.1 Conditional expression Conditionalizing execution of code is done using if special form. It is an expression, therefore it always returns a value of the branch that was chosen by the condition. General syntax: (if condition then-branch else-branch) then-branch is evaluated when the condition expression returns any value different from false and null reference. Otherwise else-branch

33 4. CLOJURE COMPILATION TO JVM BYTECODE is evaluated. How the expression is compiled depends on the condi- tion. The compiler contains a map of static methods (called intrinsic predicates) that when applied to primitive arguments can be ex- pressed directly by JVM comparison instructions [15, line 1713]. The map is defined in class clojure.lang.Intrinsics and contains pred- icates for numeric comparisons. Example of such method is gt in class clojure.lang.Numbers, normally expressed in code by inline function >. Example code: (if (>21) "World is broken" "OK")

Compiled bytecode:

0: ldc2_w #27 Puts long value 2 onto stack 3: lconst_1 Puts long value 1 onto stack 4: lcmp Compares two longs from the stack 5: ifle 14 If the result of comparison was -1, jumps to 14 8: ldc #45 Then branch – Puts String "World is broken" onto stack 10: goto 16 Jumps to the next instruction after the whole if expression to skip else branch 13: pop Unreachable instruction (artifact of code genera- tion) 14: ldc #47 Else branch – Puts String "OK" onto stack

If the condition is not computed using one of the intrinsic predi- cates, but can be determined at compile time to have a return value of primitive type boolean, it is not necessary to box the value and compare it to null reference [15, line 2700]. Example code: (if (Double/isNaN 1.0) "World is broken" "OK") Compiled bytecode:

34 4. CLOJURE COMPILATION TO JVM BYTECODE

0: dconst_1 Puts double value 1.0 onto stack 1: invokestatic #26 Invokes static method isNaN of java.lang.Double 4: ifeq 13 If the top stack item is equal to zero, jumps to index 14 7: ldc #28 Then branch – Puts String "World is bro- ken" onto stack 9: goto 15 Jumps to the next intstruction after whole if expression to skip else branch 12: pop Unreachable instruction (artifact of code generation) 13: ldc #30 Else branch – Puts String "OK" onto stack

In general case, where the condition doesn’t return primitive boolean value, the compiler needs to emit code that will check that the result of condition evaluation is not equal to null reference nor false. Thus, two branch instructions are needed. Example code: (if x ; let x be a function argument without a type hint "yes" "no")

Compiled bytecode:

0: aload_1 Pushes the x argument onto stack 1: aconst_null 2: astore_1 Clears x reference 3: dup Duplicates the x value for the two compar- isons 4: ifnull 18 Jumps to instruction 19 (pop) if x was null reference 7: getstatic #18 Pushes FALSE constant from java.lang.Boolean’s static field onto stack 10: if_acmpeq 19 Compares x with false. Jumps to instruction 20 if they were equal. 13: ldc #20 Then branch. Pushes String "yes" onto stack. 15: goto 21 Jumps to the next intstruction after whole if expression to skip else branch. 35 4. CLOJURE COMPILATION TO JVM BYTECODE

19: pop Executed only when x is null reference. Dis- cards the copy of x value that was duplicated for the comparison against false 19: ldc #22 Else branch. Pushes String "no" onto stack.

4.6.2 loop expression The loop expression creates local bindings the same way as let ex- pression (described in 4.5) and additionally establishes so called recur target. That means that recur expressions inside its body will perform jump back to the begining of loop expressions with local bindings rebound to new values passed to recur. Semantically, it is equivalent to a tail-recursive local , where the binidings are the function’s arguments and initial values are passed to the function call. Example: (loop [x0] (printlnx) (recur(incx)))

The example is an infinite loop that prints numbers from 0 onwards. In each iteration, the value passed to recur replaces the old value of x and the body is executed again. If the loop was supposed to end at some point there should have been a conditional expression whose branch doesn’t end with recur. Compiled bytecode:

0: lconst_0 Puts 0 onto stack 1: lstore_1 Initializes x to 0 2: getstatic #34 Gets the var object for println 5: invokevirtual #46 Obtains function object for println 8: checkcast #48 Ensures that the item implements IFn 11: lload_1 Puts current value of x onto stack 12: invokestatic #54 Boxes the value into Long using num method of class clojure.lang.Numbers

36 4. CLOJURE COMPILATION TO JVM BYTECODE

15: invokeinterface #57, 2 Calls println’s invoke method with boxed x as argument 20: pop Discards return value of println 21: lload_1 Puts current value of x onto stack 22: invokestatic #60 Calls inc static method of class clojure.lang.Numbers on the value 25: lstore_1 Stores the result back into x 26: goto 2 Jumps back at index 2

Note the value of x was kept in as primitive until it was passed to println which requires an object. When the loop expression isn’t in the position of function return value, it is first wrapped in a local anonymous function.

4.7 Top-level expressions

Clojure source files contain expressions that are not part of a body of any function. The most common are def and defn, which bind values to global vars, but arbitrary expressions can appear at top- level. The top-level scope is compiled as separate class that initializes the whole module, named NS__init, where NS is the name of the namespace (default is user). It has a method named load and one or more __initN methods for initializing used constants, where N is an integral number. The top-level expressions are compiled similarly as if they were in a function body and the bytecode is output into method load, whereas the used constants are initialized in one of the init methods instead of the class initializer [15, line 7376].

37

5 Clojure decompiler

5.1 Used technologies

The decompiler is implemented in Clojure, leveraging its code-as-data concept to represent and output the decompiled code.

5.1.1 Leiningen Leiningen 1is a build system for Clojure projects inspired by Apache Maven. It features automatic dependency management compatible with Apache Maven’s repositories, using Maven Central 2 and Clojars 3 as default repositories. The project is described by project.clj file using Clojure as the data format. It aims to be declarative and extensible.

5.1.2 Apache BCEL Apache Bytecode Engineering Library 4 is a library written in Java that can load JVM class files and analyze them. It provides high-level interface for accesing parts of class files, in particular the bytecode. Individual instructions are represented as objects that provide meth- ods to acces instruction operands, referenced values on constant pool, jump targets, etc. The classes for individual instruction types im- plement interfaces that form a hierarchy. Thus, the instructions are grouped into multiple categories depending on what they do and what they access, such as StoreInstruction or StackConsumer. It also provides the ability to manipulate and generate classes and their code. There are other libraries available that can read and manipulate JVM class files, most notably Objectweb ASM 5. It provides mostly the same set of features as BCEL, but aims to be more lightweight and

1. http://leiningen.org/ 2. http://central.sonatype.org/pages/apache-maven.html 3. https://clojars.org/ 4. https://commons.apache.org/proper/commons-bcel/ 5. http://asm.ow2.org/

39 5. CLOJUREDECOMPILER to offer better performance. For this project, BCEL was chosen due to grouping of instruction objects by their implemented interfaces, which is not present in ASM. It’s a feature that simplifies instruction processing and makes the implementation more robust with respect to future additions into JVM instruction set.

5.1.3 Other libraries tools.cli Clojure library for parsing commandline arguments in simi- lar format to the one used by common Unix programs 6. tools.logging Clojure library providing abstraction layer over several Java logging frameworks 7. clj-logging-config Library providing simple way to configure log- ging provided by tools.logging 8. clojure.pprint Library distributed as part of Clojure itself that can print Clojure data structures in more readable way (by insert- ing whitespace). It also has support for printing Clojure code in a way that follows common conventions.

5.2 Architecture

The decompiler gets multiple class files as an input and processes each of them separeately. It looks for classes that have a counterpart in Clo- jure and decompiles their methods that correspond to actual function bodies or top-level expressions. The decompilation process simulates parts the Java virtual machine to be able to connect individual values and operations into an intermediate expression tree. When process- ing instructions, it pops the instruction operands from the stack and pushes the result back (if any) as particular instruction would do on real JVM, but using the decompiler intermediate form nodes as stack items instead of actual values. That way, it constructs an expression tree where individual nodes correspond to Clojure forms, such as function calls or the if conditial expression. Expression whose value

6. https://github.com/clojure/tools.cli 7. https://github.com/clojure/tools.logging 8. https://github.com/malcolmsparks/clj-logging-config

40 5. CLOJUREDECOMPILER was not used (was probably called for side-effects) is attached to next expression node that follows it.

Figure 5.1: Example of processing instructions Input bytecode: The resulting expression tree: dload_1 Static invocation node: invokestatic #46 class: clojure.lang.Numbers dconst_1 method: add invokestatic #52 arguments: • Static invocation node: class: java.lang.Math #46 points to method sqrt of method: sqrt java.lang.Math arguments: #52 points to method add of • Local variable node: clojure.lang.Numbers index: 1 • Constant node: value: 1

The expression tree is built in its low-level form that usually con- tains many Java interoperation calls that are used to implement higher- level Clojure operations. Such calls weren’t expressed as interopera- tion calls in the original source and the decompiler contains several transformations that handle these. The expression trees that were produced for individual class files, but correspond to declarations in single source file are then linked together by replacing placeholder calls to function class constructors with actual function definitions. Finally the expression tree is output as Clojure source code.

5.3 Implementation

5.3.1 Entry point

The decompiler’s input is a list of paths that contain class files. It parses the files using Apache BCEL and determines how to decompile a class by its category. There are two categories of classes recognized by the decompiler – initialization classes and function classes.

41 5. CLOJUREDECOMPILER

5.3.2 Decompilation context The decompilation context is an object representing the state of the simulated virtual machine at given time. It consists of • Operand stack • Remaining instructions to be processed • Parsed form of the class file • Class fields and their values after they were initialized • Local variables • Resulting expression tree after the instructions have been pro- cessed • Auxiliary information that needs to be carried through decom- pilation When decompiling methods, the input is a decompilation context containing the list of bytecode instructions and the output is altered context with the final state.

5.3.3 Decompiling function classes Before decompiling invoke methods, the class initializer (method ) is decompiled first to obtain initial values for class fields that contain vars and constants. Vars stored in fields are later used to obtain names of functions (and global variables), as those names are not present in the place where the functions are actually called. Then, all invoke methods are decompiled, each within separate de- compilation context with fields prepopulated from the previous step. Additionally, the decompiler needs to obtain minimal required arity if the function is variadic. The output of this step is an object that contains function metadata (name, namespace, original class name) and expression trees for all arities (and lists of their arguments).

5.3.4 Decompiling initialization classes Initialization classes (described in 4.7) contain top-level statements that were present in the clj file that initialize the whole module. There must be at least one class of this type on the input. When decompiling an initialization class, its init methods are processed first as they initialize class fields. The fields are extracted from the decompilation

42 5. CLOJUREDECOMPILER

context and are used to prepopulate fields of new context that serves as an input for processing the load method. Unlike methods that return a value, in the load method the Clojure compiler doesn’t emit pop instructions after compiling expressions whose values aren’t used. Such expressions therefore remain on the stack at the end of the processing and the stack contents are used as the output of this step.

5.3.5 Processing instructions Instructions are processed one by one by handlers specific to given instruction type. Each handler implementation returns altered decom- pilation context representing the state after the instruction has been executed. When one of return instructions is encountered, it marks the current stack top as the resulting expression that will be the outermost expression (function body). If an expression node is popped from the stack without becoming part of another expression node, it means that the return value of that expression wasn’t used and was probably executed for its side- effects. The same is true for interoperation method calls that don’t return a value (have return type void). The expression is retained in decompilation context and when processing the next instruction it is attached to the newly created expression node as preceding expression. That way, multiple following expressions with unused return values form a linked list that can later be expanded into body of Clojure forms. When a store instruction is encountered the stored value is put into a vector of local variables in the decompilation context. The vector consists of placeholder objects for local variables that contain their names and are used to store the initial value (first store to given index) and subsequent value (rebinding to new value for next iteration of loop or tail-recursive call). Function arguments are represented the same way, but have no initial value. Argument names and local variable names are extracted from local variable table attribute of the method in the class file, which contains rows with variable names and indices denoting the range where the variable was used. The indices are compared against the index of the current instruction and if they match, the corresponding name is used. If the name is not found, sequentially numbered name is generated. First store instruction that

43 5. CLOJUREDECOMPILER accesses given local variable starts a new let or loop expression (described in sections 4.5 and 4.6.2) and the rest of the remaining code forms the body of the expression. When a goto instruction that points to index right after current store instruction is encountered, it signifies that that current binding expression is a loop (otherwise, it is a let). In that case, recur expression is created. Store instructions that access variables, whose initial value was already set, don’t produce expression on their own, but store the new value to the local variable vector. Values of such secondary stores are the used as arguments to the recur expression. If the body of a newly created let or loop expression starts with another expression of the same type, the two expressions are merged into one (this process continues recursively). When processing branch instructions, the decompiler needs to de- termine the condition and then process both then and else branches separately. Conditional expression (if) with intrinsic predicate (de- scribed in 4.6.1) starts with a numeric comparison instruction followed by branch instruction (such as iflt). The equivalent Clojure function for the two instructions is looked up in a map created from data found in the compiler source code. For other types of if expression, the com- parison expression is taken from the stack. For the general condition (comparison against both null reference and false), the processing handler also needs to consume the retrieval of false constant and the following if_acmpeq branch instruction. The then begins starts after the condition and is delimited by goto instruction with positive jump offset. The else branch starts after then branch and ends at the target index of goto instruction that ended the then block. The resulting expression tree node contains the expression trees for the condition and both branches.

5.3.6 Expression tree transformations The expression tree undergoes several transformations to convert low-level compiler generated structures into high-level function calls or other forms. Most of the transformations are implemented as parts of the instruction processing handlers, thus not requiring additional pass through the expression tree. One of such transformations is converting reflective calls back to regular calls (an example bytecode for both types of calls is in 4.3.1). When interoperation static method

44 5. CLOJUREDECOMPILER

call to helper class clojure.lang.Reflector is encountered, it tries to match it against known patterns generated by the compiler and if the match succeeds, replaces them with the desired form. An example of output code without the transformation applied:

(clojure.lang.Reflector/invokeConstructor (Class/forName "java.lang.Exception") (into-array[x]))

With the transformation applied:

(Exception.x)

Similarly, it tries to match Java interoperation calls that construct Clojure objects, for which literals exist and replaces them with se- quence literals. Example of output code without this transformation applied:

(clojure.lang.RT/mapUniqueKeys (into-array [(clojure.lang.RT/keyword nil "leading-space") (java.util.regex.Pattern/compile "^\\s")]))

With the transformation applied:

{:leading-space#"^\s"}

The code creates a hash map with one key-value pair consisting of a keyword and a regular expression. Another transformation undoes the effect of inlining basic Clojure functions (mostly arithmetic and comparison functions) by looking up the called static method of known inline functions. Example of output code without this transformation applied:

(clojure.lang.Util/identicalx nil)

45 5. CLOJUREDECOMPILER

With the transformation applied: (nil?x)

The code compares x to null reference. 9 The compiler also automatically inserts casts and boxing/unbox- ing calls around interoperation calls that work with primitive types. There is a transformation that tries to eliminate those by comparing the return type of the cast function with the type that is expected at given place. If they match, the compiler can generate the cast auto- matically and it is safe to be removed.

5.3.7 Outputting Clojure code After decompilation of all classes took place, the intermediate forms can be output as Clojure code. Function expression trees that were output from decompilation of individual function classes are inserted into expression trees where the functions were defined. The expres- sion trees for initialization classes will each be output into separate clj file. The intermediate form nodes are then converted to Clojure data structures that Clojure itself uses to represent code and formatted using clojure.pprint core library.

9. nil? is ordinary function, the question mark is part of the identifier, it doesn’t have any syntactic meaning

46 6 Conclusion

The thesis described the process of Clojure compilation to JVM byte- code. It summarized how functional concepts within Clojure are mapped onto object-oriented concepts in JVM and provided detailed examples of how individual basic Clojure forms are expressed in the bytecode. The primary aim of this thesis was to design and develop a pro- totype of a decompiler for Clojure targeting the Java platform. The decompiler can output reasonably readable source code for programs that implement common algorithms. The emphasis was mainly made on generating code that is as close as possible to the original source. One of the problems that negatively affects readability of decom- piled source code is Clojure’s reliance on using macros. The macro- expanded code is semantically equivalent to the original code, but tends to be less clear and thus it is desirable to undo the effect of macro expansion. Since macros are expanded before the code is ac- tually compiled and they are regular Clojure functions, undoing the macro expansion would require finding an inverse to an arbitrary function, which is a problem that is known not to have a general solu- tion. It is possible, however, to have a hand-coded inverse function for a particular macro. The decompiler contains few such functions for basic macros, such as defn or ns, but it could be extended with many more and possibly provide extension points for plugins that could cover macros from third-party libraries and frameworks. The decompiler prototype doesn’t support the full set of extensive language features that Clojure provides. It could be extended to sup- port advanced features, such as records, protocols or multimethods. It could also be integrated into existing graphical tools that can display the original bytecode alongside decompiled source code.

47

Bibliography

[1] R. Hickey, Clojure, [online]. Available: http://clojure.org/ [2] C. Emerick, B. Carper, and C. Grand, Clojure programming., O’Reilly Media, Inc., 2012 [3] L. Paulson, Developers Shift to Dynamic Programming Languages, IEEE Computer, vol.40, no.2, pp.12,15, 2007 [4] P. Graham, On LISP: Advanced Techniques for . Prentice-Hall, Inc. Upper Saddle River, NJ, 1993. [5] M. Fogus and C. Houser, The Joy of Clojure: Thinking the Clojure Way, 1st ed, Manning Publications Co. Greenwich, CT, 2011 [6] R. Hickey, Clojure – , [online]. Available: http://clojure.org/functional_programming

[7] A. Bonnaire-Sergeant, R. Davies and S. Tobin-Hochstadt, Practical Optional Types for Clojure, draft, Feb 2015. Available: http:// homes.soic.indiana.edu/samth/typed-clojure-draft.pdf

[8] M. Fowler, Domain Specific Languages, Addison-Wesley Profes- sional, Sep 2010 [9] D. Ghosh, DSLs in Action, Manning Publications Co. Stanford, CT, 2011 [10] R. Muschevici et al., Multiple Dispatch in Practice, SIGPLAN Not., ACM, New York, NY, Sep 2008 [11] T. Lindholm, F. Yellin, G. Bracha, and A. Buckley. The Java Virtual Machine Specification, Java SE 8 Edition, 1st ed. Addison-Wesley Professional, Mar 2014. [12] Java Community Process, JSR #292 Supporting Dynamically Typed Languages on the Java Platform, [online], Available: https://jcp. org/en/jsr/detail?id=292

[13] Groovy project, Invoke dynamic support, [online], Available: http: //www.groovy-lang.org/indy.html

49 BIBLIOGRAPHY

[14] R. Hickey, Clojure – Compilation, [online]. Available: http:// clojure.org/compilation

[15] R.Hickey, Clojure Compiler, [software, source code], Avail- able: https://github.com/clojure/clojure/blob/clojure-1. 7.0-beta2/src/jvm/clojure/lang/Compiler.java

[16] M. Schinz and M. Odersky, Tail call elimination on the Java Vir- tual Machine, ACM SIGPLAN BABEL’01 Workshop on Multi- Language Infrastructure and Interoperability, pp.155,168, Else- vier, Amsterdam, NL, 2001

50