Challenges and Solutions in the Design of a for Resource Constrained Microcontrollers

Faisal Aslam

Dissertation Submitted to the University of Freiburg for Doctorate Degree in Computer Scinece

Computer Networks and Telematics Department of Computer Science Faculty of Applied Sciences University of Freiburg, Germany

March 2011 Dissertation Defense Date: 17.03.2011

Thesis committee:

Chair: Prof. Dr. Georg Lausen Examiner: PD Dr. Cyrill Stachniss Referees: Prof. Dr. Christian Schindelhauer Prof. Dr. Peter Thiemann To Abbu, Ammi and Ayesha

Zusammenfassung

Diese Arbeit beschreibt die TakaTuka Java Virtual Machine (JVM), die für Microcon- troller mit geringen Kapazitäten an RAM, Speicher und Prozessor entwickelt wurde. TakaTuka optimiert die Speicheranforderungen sowohl des compilierten Java Codes als auch die des JVM Interpreters, die beide auf einem eingebetteten System abgespeichert werden müssen. Im Durchschnitt lässt sich die Größe des compilierten Java Codes der TakaTuka JVM um über 92% der Größe des Standard-Java-Codes eines Programms re- duzieren. Desweiteren kann die Interpreter-Größe für ein gegebenes Programm auf nicht mehr als 25 KB verkleinert werden. Auf diese Weise führt ein Sensorknoten mit geringem Flash Speicher einen hoch optimierten Java Binärcode und Java Interpreter aus. Die Fähigkeit komplexe Java-Programme auszuführen, wird durch den auf den Bauteilen vorhandenen RAM-Speicher beschränkt. Das Speicher-Management für ein Java Pro- gramm besteht aus zwei Teilen: Erstens wird der Speicher, den jedes Objekt konsumiert, durch die Garbage Collection (GC) geregelt, die zwischendurch ausgeführt wird um Spe- icher wieder freizugeben. Zweitens wird der Speicher, den der Frame einer Funktion beansprucht hat, automatisch zurückgefordert, sobald eine aufgerufene Funktion been- det ist. TakaTuka verfügt über eine Offline Garbage Collection, um den von Objekten während der Dauer einer Programmausführung beanspruchten Speicher zu reduzieren. Die Offline Garbage Collection erlaubt die Freigabe eines Objekts, das zwar noch erreich- bar ist, aber allerdings garantiert nicht mehr im Programm benutzt wird. Es wird gezeigt, dass die Offline GC den RAM-Speicher, das dem Programm zur Verfügung steht, im Ver- gleich zum üblichen online Garbage Collector um bis zu 66% erhöhen kann. Eine weitere wichtige Eigenschaft der TakaTuka JVM ist das Variable Slot Size (VSS) Schema, das die RAM Anforderungen an den Funktionen-Frame verringert. Die durch TakaTuka bereit gestellten Speicher- und RAM-Optimierungen erhöhen außer- dem die Ausführungsgeschwindigkeit eines Programms und erhöht auf diese Weise die Lebensdauer eines Sensorknotens. Die Eigenschaften der TakaTuka JVM bezüglich Spe- icher, Leistungs und RAM-Konsum machen sie zu einer attraktiven Plattform, um An- wendungen für drahtlose Sensornetzwerke zu entwickeln, die Microcontroller und einge- bettete Systeme beinhalten.

v

Abstract

This thesis describes the TakaTuka Java virtual machine (JVM), which has been devel- oped for microcontrollers with small RAM, storage and processing power. The TakaTuka optimizes storage requirements for the Java binary as well as the JVM interpreter, both of which are to be stored on an embedded device. On the average, Java binary size of the TakaTuka JVM can have a percentage reduction of more than 92 % of the size of tra- ditional Java binary of a program. Furthermore, the interpreter size for a given program may be reduced to as small as 25 KB. Thus a mote with small flash storage executes a highly optimized Java binary and Java interpreter. The ability of tiny motes to run large, feature-rich Java programs is constrained by the amount of RAM installed on such devices. The memory management for a Java program may be divided into two parts. First, the memory consumed by each object is managed by the Garbage Collection (GC), which is executed intermittently to reclaim that memory. Second, the memory consumed by the frame of a function is automatically reclaimed on the completion of function’s invocation. TakaTuka has an Offline GC to reduce the aggregated memory requirements of objects during execution of a program. The Offline GC allows freeing an object that is still reachable but is guaranteed not to be used again in the program. The Offline GC is shown to increase the amount of RAM available to a program by up to 66 % as compared with a typical online garbage collector. Another important feature of the TakaTuka JVM is variable Slot Size (VSS) scheme that reduces RAM requirements of the function’s frame. TakaTuka storage and RAM optimizations also increase the speed of a program ex- ecution thus increasing the lifetime of the mote. The storage, performance and RAM consumption features of the TakaTuka JVM makes it an attractive platform for develop- ing applications for wireless sensor networks employing microcontrollers and embedded devices.

vii

Acknowledgments

I am very grateful to my supervisor Professor Christian Schindelhauer for his encourage- ment, help and inspiring guidance throughout the course of this work. He always believed in me and supported me in creating a team for the development of TakaTuka JVM. I could rely on him for all my problems including the personal ones. The formal and informal exchange of ideas with him and the influence of his personality helped me in becoming a better person. I am also obliged to my co-supervisor Professor Peter Thiemann for his continued guidance and counseling during this work. His qualified suggestions and vi- sion of the programming techniques proved extremely useful in corroborating my ideas and avoiding possible dead ends. I am greatly indebted to Professor Zartash Afzal Uzmi for stimulating my research interests and giving publishable shape to most of my work. I wish to thank two of my young colleagues, Luminous (Lu) Fennel and Gidon Ernst for their assistance in realizing the TakaTuka project. Gidon wrote the early version of the TakaTuka interpreter to help establish its feasibility. Lu transformed the TakaTuka JVM into a fully functional and usable form. I want to express my gratitude to Oliver Maye, of IHP-microelectronics, for certain improvements in the TakaTuka JVM and emancipating it from bugs. I would also like to acknowledge Elmar Haussmann for implementing Variable Slot Size (VSS) functionality in TakaTuka JVM, Jet Tang for porting TakaTuka to MSP430 family of processors, Omer Salem Al-Khanbushi for writing Apache-Ant script, Zhongjie Cai for improving TakaTuka multi-threads support and Shou-Yu Chao for providing Java debugger (JDB) functionality on motes and porting CLDC library to the TakaTuka JVM. I am grateful to all my friends and colleagues, especially Saqib Raza, Fahad Rafique Dogar, Arne Vater, Stefan Rührup, Wasim Malik, Asif M. Siddiqui and Professor Sohail Aslam, for continued support and encouragement I have received from them. I am hugely obligated to my parents and also to my wife and children who were always there for me and patiently waited to see me successful. My thanks are also due to my uncle Dr. I. A. Azhar who reviewed the manuscript to improve the language of the thesis. Financial support of Higher Education Commission of Pakistan for carrying out my Ph.D. program is gratefully acknowledged.

ix

Contents

Acknowledgements ix

List of Tables xiii

List of Figures xv

Abbreviations xvii

1 Introduction 1 1.1 Existing JVMs for Microcontrollers ...... 3 1.2 ...... 3 1.3 Programming Paradigm ...... 5 1.4 Fundamentals of a Java ...... 6 1.5 Challenges and Contributions ...... 10

2 Offline Garbage Collection 17 2.1 Motivating Example ...... 19 2.2 Offline GC Data-Flow Analyzer ...... 20 2.3 Offline Garbage Identifier ...... 25 2.4 Runtime Implementation ...... 33 2.5 Related Work ...... 35 2.6 Results and Discussion ...... 36

3 Variable Slot Size 47 3.1 VSS Instruction Design ...... 49 3.2 Changing for Supporting VSS ...... 52 3.3 Results and Discussion ...... 54

xi xii Contents

4 Bytecode Compaction 57 4.1 Available Opcodes ...... 59 4.2 Single Instruction Compaction ...... 60 4.3 Multiple Instruction Compaction (MIC) ...... 60 4.4 Interpreter Support for Bytecode Compaction ...... 67 4.5 Results and Discussion ...... 70

5 Class File Optimizations 83 5.1 TakaTuka Constant Pool Optimization ...... 84 5.2 Dead Code Removal ...... 86 5.3 TUK File Format ...... 86 5.4 Results and Discussion ...... 89

6 Interpreter Design 91 6.1 Runtime Architecture ...... 92 6.2 Customized JVM ...... 92 6.3 Hardware and Abstraction Layer ...... 93 6.4 TakaTuka Linkage With TinyOS ...... 93 6.5 Tiny Java Debugger ...... 94 6.6 Results and Discussion ...... 96

7 Summary and Conclusions 99

References 105 List of Tables

2.1 Benchmarks used for the evaluation of the Offline GC...... 39 2.2 The number of times online GC is called by the benchmark programs, with and without Offline GC...... 41

3.1 Set of additional bytecode instructions in the TakaTuka VSS Extension. . 51

4.2 Detailed information about the programs used for evaluating Java binary optimizations...... 69 4.3 Total number of available opcodes before bytecode compaction...... 69 4.4 Maximum and average lengths of customized instructions created by MIC with FC...... 70 4.5 Maximum and average lengths of customized instructions created by MIC with LC...... 70 4.6 Number of opcodes used for creating customized instructions by the ORE scheme with LC...... 75 4.7 Number of opcodes used for creating customized instructions by the MIC scheme with LC...... 76 4.8 Number of opcodes used for creating customized instructions by the ORD scheme...... 76 4.9 Number of opcodes used for creating customized instructions by the ORE scheme with FC...... 76 4.10 Number of opcodes used for creating customized instructions by the MIC scheme with FC...... 77 4.11 Number of opcodes used for creating customized instructions by the three compaction schemes with LC...... 78 4.12 Number of opcodes used for creating customized instructions by the three compaction schemes with FC...... 78

xiii xiv List of Tables

5.1 Class files removed by the dead-code removal algorithm and the functions removed from rest of the class files...... 86

6.1 List of motes tested using TJD and the supported features...... 95 6.2 Size of the customized JVM in when compiled for the IRIS mote. . 97 List of Figures

1.1 High level view of the class file contents...... 9 1.2 The offline optimizations and checks for the TakaTuka JVM ...... 11 1.3 TakaTuka hardware and operating system abstraction ...... 15

2.1 Deallocation of reachable objects by the Offline GC...... 20 2.2 Instruction-Graph of an example ...... 28 2.3 DAG of an Instruction-Graph ...... 31 2.4 Function-Call-Graph of a thread ...... 31 2.5 IRIS Mote ...... 36 2.6 The amount of RAM-available at different points of program’s execution . 37 2.7 The amount of RAM-available and the number of CPU cycles consumed in deallocating objects for the Quicksort program...... 38 2.8 The number of CPU cycles consumed in deallocating objects, during the execution of a program, with and without the use of the Offline GC. ... 44 2.9 RAM availability and CPU cycles consumption for CTP benchmark ... 45 2.10 Efficacy and Limitations of the Offline GC ...... 46

3.1 Source code of function addTwo that adds a short and a .... 50 3.2 Java binary of function addTwo ...... 50 3.3 Java binary of function addTwo with Darjeeling Extension ...... 50 3.4 Java binary of function addTwo with TakaTuka VSS Extension ..... 51 3.5 Percentage reduction in the aggregated size of local variables ...... 53 3.6 Percentage reduction in the aggregated size of operand stack ...... 54 3.7 Percentage increase in the size of Java binary ...... 55 3.8 Percentage increase in the size of JVM ...... 56

4.1 Pattern replacement tree ...... 65 4.2 TakaTuka instruction labels before Bytecode Compaction...... 67

xv xvi List of Figures

4.3 TakaTuka instruction labels after Bytecode Compaction...... 68 4.4 The average percentage of bytecode and CP sizes as a compared to total size of Java binary of the programs given in Table 4.2...... 69 4.5 Percentage reduction in Java binary using the ORE and MIC schemes with LC...... 72 4.6 Percentage reduction in Java binary using the ORE and MIC schemes with FC...... 73 4.7 Aggregated Percentage reduction in Java binary using three compaction schemes with LC and FC...... 74 4.8 Percentage reduction in Java binary using the ORD scheme...... 75 4.9 Aggregated compilation time taken by three compaction schemes with LC and FC...... 79 4.10 Percentage increase in the JVM size with FC as compared to JVM size with LC...... 80 4.11 Percentage increase in the bytecode execution speed using LC and FC as compared to no bytecode compaction...... 80

5.1 Percentage reduction in the number of Global Constant Pool (GCP) and Tuk-GCP entries ...... 87 5.2 Percentage reduction in the size of GCP and Tuk-GCP ...... 88 5.3 Percentage reduction of Tuk file, compressed Jar and Squawk’s Suite file as compared to the aggregated size of the program’s class files. .... 89

6.1 The split debugging architecture used by TJD and Squawk...... 96 6.2 Percentage increase in the customized JVM size as compared to the cus- tomized JVM for null program...... 97 Abbreviations

AOT Ahead-of-Time

BV-DFA Bytecode Verification Data-Flow Analyzer

CLDC Connected Limited Device Configuration

CP Constant Pool

CTP Collection Tree Protocol

DAU Disconnect All Uses

FC Full Compaction

FTT Finding the Tail

GC Garbage Collector

GCP Global Constant Pool

HOAL Hardware and Operating System Abstraction Layer

JDB Java Debugger

JIT Just-In-Time

JTC JVM Test Cases

JVM Java Virtual Machine

LC Limited Compaction

LED Light Emitting Diode

MIC Multiple Instruction Compaction

xvii xviii List of Figures

OGC-DFA Offline GC Data-Flow Analyzer

OGI Offline Garbage Identifier

OOP Object Oriented Programming

ORD Operand Reduction

ORE Operand Removal

PC Program Counter

PNR Point of No Return

RAM Random Access Memory

RRA Representative Reference Array

SVA Split VM Architecture

TJD Tiny Java Debugger

VM Virtual Machine

VSS Variable Slot Size

VSS-DFA VSS Data-Flow Analyzer

WSN Wireless Sensor Network HAPTER 1

Introduction

1 2 Introduction

The era of modern digital computation started in the mid 20th century [1, 2, 3]. The early computers were shared by multiple users as the computer was too expensive to be owned by a single user. These computers were large in size, usually occupying an entire room or building. However, the resources available with the computers such as RAM, storage capability and processing power were limited. The phenomenal advancement in technology in the 1980s resulted in computers designed for single users, which were called personal computers [3]. A personal computer is small enough to fit in a briefcase but is more resourceful and less expensive than earlier versions of computers. At the dawn of 21st century the concepts like “Smart Dust” and “” were intro- duced and the post personal computer era began [4, 5]. The Smart Dust refers to a very large set of cubic-millimeter computers, with sensing and communication capabilities, embedded in the environment to form a spatially distributed Wireless Sensor Network (WSN). A WSN obtains information from everyday objects and activities, without re- quiring the humans to control its operations. This is referred to as ubiquitous computing because the wireless sensor motes1 are embedded in the human environment instead of requiring humans to enter theirs [5]. When several motes form a network, many useful and powerful applications may be realized despite limited resources and small size of the constituents. Primary design objective of a wireless mote is to minimize its physical size and man- ufacturing cost. In order to achieve this objective, usually limited resources are made available on such a mote. Currently, a typical wireless sensor mote is provided with less than 10 KB of Random Access Memory (RAM), 128 KB of flash storage memory and an 8-bit or 16-bit microcontroller [6, 7, 8]. A common way of programming an application into wireless sensor motes is by using a low level or a specially designed , such as Assembly, C or NesC [9]. These languages usually require a steep learning curve and produce code that is difficult to debug and maintain. In contrast, it is attractive to program these motes in Java, a widely used high level programming lan- guage with a large developer community. This is because Java is highly portable and provides many high level concepts including object oriented design, type safety and run- time garbage collection. Sun Java Virtual Machine (JVM) requires more than 100 KB of RAM requirement. Thus a mote with less than 10 KB of RAM cannot use it. This thesis highlights the challenges faced and solutions offered for programming such tiny motes using high level object oriented Java language. The next section discusses contem- porary work related to Java virtual machines that has been designed for microcontrollers or embedded devices. The subsequent sections introduce various fundamental concepts that would be helpful in comprehending the following Chapters.

1 A small computer (node) with sensing and communication capabilities is referred as mote. 1.1. Existing JVMs for Microcontrollers 3

1.1 Existing JVMs for Microcontrollers

The standard JVM of cannot be installed on an embedded device or a mote because it has a storage and RAM requirements of many hundred KBs. Therefore, recently Sun Microsystems has developed Squawk, a JVM for embedded systems [10], which overcomes some of the shortcomings of traditional JVMs by employing a Split VM Architecture (SVA). In SVA, resource hungry tasks of the JVM, including class file loading and bytecode verification are performed on the desktop [10, 11]. This process reduces the memory and CPU usage requirements for the execution of the program on the mote, because no runtime loading and verification is required. When compared to standard JVMs, Squawk has less stringent requirements of resources; it is, however, still not feasible to run Squawk on a typical mote equipped with an 8-bit microcontroller with a few hundred KB of flash and around 10 KB of RAM. For these typical motes, Sentilla Corp. has developed a JVM, but it is not open-source and currently does not support any devices other than Sentilla motes [12]. Darjeeling, is an open source JVM designed for motes [7]. It does support a good part of the JVM specification but sacrifices some features like floating point support, 8-byte data types and synchronized static-method calls for efficiency [7]. The Java Card VM [13] is developed for Sun and has very small resource requirements, however it does not support multi-threading and garbage collection. Thus it is not suitable for variety of applications. There are a few other JVMs available for embedded devices, such as NanoVM [14], and VM* [15], but these are either limited in functionality by not fully supporting JVM specifications or are closed source with a limited scope of operation only on specific devices. This work presents the contributions made in the development of the TakaTuka JVM. The TakaTuka JVM aims to remain small, open source and suitable for tiny motes and a wide variety of embedded devices, while providing all features of a fully CLDC-compliant JVM 2. This thesis describes an indigenously developed TakaTuka JVM that supports all but two of the Java bytecode instructions and most of the CLDC library. This JVM also supports threading, debugging, synchronized method calls, 8-byte data types and 4-byte floating point arithmetic on motes.

1.2 Wireless Sensor Network

A network comprises a number of computers that are connected with each other in such a manner that messages can be propagated, resources can be shared and applications can be distributed among various constituent computers (called “nodes”). The topology of a

2 TakaTuka might never actually obtain formal CLDC-compliance due to the high price tag associated with the license of the CLDC Technology Compatibility Kit (CLDC TCK). 4 Introduction

network is carefully designed to maximize the network utility while minimizing the cost. Nodes of a network are of two kinds: application nodes and special powerful network nodes. The application nodes run applications, and special powerful nodes (e.g. routers) form the backbone of the network to support routing in the network. An ad-hoc wireless network is different from a typical network in two ways. Firstly, it does not have the powerful network backbone nodes. Thus in an ad-hoc network each node, besides running the application, takes part in the networking by forwarding data and the routing packets. Secondly, the network topology of an ad-hoc wireless network is usually ’ad-hoc’ in the sense that it lacks a fixed and carefully configured layout. This is because the nodes of an ad-hoc wireless network are usually mobile or deployed in hostile environments (e.g. in a disaster hit area) where maintaining a fixed topology may not be feasible. An ad-hoc network is created with built-in fault tolerance. The objective is that the network remains connected even when a few nodes cease to function or go out of radio range of the other nodes. The routing protocols for ad-hoc networks are specially designed to cope with changing network topology. These protocols are called On-Demand routing protocols. Ad-hoc On-demand Distance Vector (AODV) and Dynamic MANET On-demand (DYMO) routing protocols [16, 17] are the examples. The Wireless Sensor Networks (WSN) are similar to ad-hoc wireless networks. How- ever, a mote of a WSN is inexpensive, has a smaller size and is resource constrained. Furthermore, a wireless sensor network has higher node density as compared with an ad-hoc network. The reason is that the goal of a WSN is to capture the environmental characteristics of an area using sensors besides providing networking in that area. A node of an ad-hoc network is usually a typical laptop or mobile phone, which uses a regular operating system such as Linux or Windows. These operating systems have large require- ments of RAM, and so cannot be used on a wireless sensor mote with less than 10 KB of RAM. Therefore, a mote of a WSN must use a customized resource constrained operating system instead of a regular operating system. TinyOS and operating systems have been designed for WSN, which use less than 1 KB of RAM [18, 19, 20]. Some of the applications of WSN are: • oversight of hostile army’s movements; • house automation; • detection and prevention of forest fires; • intelligent agriculture to increase profits by better utilizing the resources (water and electricity) and by reducing manual labor; • automation and monitoring of activities in factories; • monitoring quality of industrial products; • in disaster management, for example to locate trapped people after earthquakes; • monitoring of shipping; and • monitoring the movement of patients and elderly people in hospitals. 1.3. Programming Paradigm 5

1.3 Programming Paradigm

Early computer programs were written in machine language. A machine language pro- gram consisted of binary codes. The binary codes, comprising zeros and ones, are not easy to read, write or modify. Hence any mistake in the program is quite cumbersome to locate and correct. Furthermore, to program a simple task many hundred lines of binary code has to be written. The assembly language brought improvement over the machine language by introducing mnemonics for each binary code that represented either low-level microcontroller instruction or a register address. An assembly program is relatively easier to write as compared to corresponding machine language program. However, assembly language is also a low-level language which is nearer to machine than the human. In order to make programs modular and easier to write and debug, several high-level programming paradigms were introduced. The programs written using these programming paradigms are closer to human nature as compared to machine code (thus called high-level program). Most widely used programming paradigms [21, 22, 23, 24] are discussed in the following subsections.

Procedural and functional programming A procedural program consists of a set of functions, each function having a list of high level instructions to be executed, for example, an instruction to multiply two integers or to invoke another function. A function operates on data that is either local to that function or global to the program. Hence, in case of procedural programs data can have two scope levels: function scope and program scope. A functional program is similar to a procedural program. However, in this case a func- tion is usually a mathematical function whose outcome solely depends upon the param- eters input to that function. It implies that at any point of the program’s execution, if a function is called with same parameters, it will always perform the same operation and return the same value, irrespective of the program’s current state.

Object oriented programming For an Object Oriented Program the basic construct is an object. An object encapsulates the data and the functions to be carried out on that data. Hence in case of object oriented programs, data can have three scope levels: function scope, program scope and object scope. Another widely used object oriented concept is inheritance. This means that an object may inherit data and functions from another object, to increases code reusability and reduce code complexity. Some other commonly used object oriented programming concepts include polymorphism, dynamic binding, function overriding and function over- loading. More details of these concepts and object oriented programming may be found 6 Introduction

in Ref. [23, 24]. The object oriented programming is employed in numerous well de- signed programs as the underlying concepts are quite consistent with our natural traits of thought.

Aspect oriented programming The current state of the art and widely used programming paradigm is Object Oriented Programming (OOP). The Aspect Oriented Programming is an improvement over OOP, but it is still under development and has limited support available [22]. The AOP sepa- rates main tasks of a program from supporting tasks, known as cross-cutting concerns, to improve the overall design of the program. The OOP can also create such a separation except when the task is distributed between different parts of the program (i.e. cross- cutting). AspectJ is a modified version of Java with AOP ability.

1.4 Fundamentals of a Java

Java is an object oriented programming language, with millions of world- wide. As compared with other programming languages, Java is easier to program, debug and learn. Besides being object oriented, Java has introduced many new concepts to fa- cilitate good design and code reusability; for example, exception handling and automatic memory management. Furthermore, Java programs are portable across different and hardware platforms. This section describes terminologies and concepts pertaining to Java so that the descriptions contained in the following chapters are better understandable.

Portability – Java Virtual Machine One of the attractive features of Java is its portability across different hardware and soft- ware platforms that it achieves by employing a Java Virtual Machine (JVM). High-level programming languages, prior to Java, needed a . A compiler is a program that takes a high level program as input and coverts it into machine code so that the given program may be executed by a microcontroller. Note that a microcomputer can execute only the low-level machine code. This approach has following drawbacks:

• Machine codes of different platforms are not the same. Therefore a program com- piled for one platform is not executable on another platform without recompiling. It implies that the binary code of such a program is not portable across different platforms.

• In case the for two platforms are not compatible with each other, the same program compiled for the two platforms may behave differently. 1.4. Fundamentals of a Java 7

To overcome these limitations, Java programs are not converted into machine code. In- stead, these programs are converted into an intermediate code called Java binary. The Java binary is platform-independent, so the same Java binary will be generated for any platform. However, this binary code cannot be executed by a microcontroller directly as it is not a machine code. In order to execute the Java binary on a microcontroller a Java Virtual Machine (JVM) is used. The JVM is a program that executes the Java binary in- structions called Java bytecode. Each JVM must follow Java specifications to ensure that the bytecode execution on each platform produces exactly the same results [11]. Thus, a JVM ensures that Java binary is portable across several platforms and the bytecode exe- cution on different platforms produce exactly same results. However, the Java bytecode execution is usually much slower as compared to machine code. Java provides a trade- off between well designed, reusable and portable code, and slow execution speed of that code. Although, Java bytecode may never execute as fast as the low-level machine code, however there has been significant efforts to improve its execution speed [25, 26].

Interpreter

The JVM may use an interpreter or a Just-in-Time compiler to execute the Java binary on a platform. An interpreter executes a single Java bytecode instruction at a time. The program-counter (PC) points to the current instruction to be executed, and after execu- tion of each instruction the PC is updated. The main disadvantage of using an interpreter is its slow sequential bytecode execution. Specifically, instruction dispatch and fetch- ing operands of an instruction are known to be the primary time-consuming tasks in the JVM interpreters [27]. However, as compared to a Just-in-Time compiler, usually the interpreters are easier to develop (or port to a new platform) and consume less RAM.

Just-in-Time Compiler

Interpreters for a JVM are slower in execution as compared to machine code. The goal of using a Just-in-Time (JIT) compiler is to increase the execution speed of a program, while not compromising its portability. A typical compiler generates machine code for a given program before the program execution. In contrast, a JIT compiler generates machine code dynamically during the program execution. The JIT compiler does not convert complete Java binary of a program into the machine code; instead it uses run-time analyses to select portions of the Java binary to convert into machine code on the fly. The selection is based on execution time of a portion or if that portion is to be executed several times. Hence JIT compiler behaves like an interpreter for some portions of code and as a compiler for the others. The main disadvantage of JIT compiler is that it consumes additional RAM for runtime machine code generation. Furthermore, it is usually difficult to develop a JIT compiler (or port it to a new platform) than an interpreter. 8 Introduction

Virtual Machine Architecture

A Java virtual machine may have either stack architecture or register architecture. The Sun official JVM has stack architecture. In the stack architecture, operands of a bytecode instruction as well as any result produced after the bytecode execution are stored in a stack called operand stack. For example, IADD instruction has no explicit operands in stack architecture. Instead the IADD instruction pops two integers from the operand stack, adds them and pushes the result of addition back on to the stack. In contrast, in a register based architecture operands of an instruction are stored in registers. For example, IADD R1 R2 R3 may have three operands in register architecture. IADD R1 R2 R3 instruction will add integers stored in register R1 and R2, subsequently the result will be stored in register R3. Both of these architectures have disadvantages and advantages. A register based in- struction must have explicit register operands, whereas in a stack architecture operands are located in the operand stack. Therefore, individual instructions in register architecture are longer as compared to stack architecture and also the overall size of a function’s byte- code is longer in register architecture [27]. However, the number of instructions required to be implemented by a JVM is usually smaller in register architecture than in stack ar- chitecture. This leads to faster bytecode execution in register architecture as compared to stack architecture due to smaller number of instruction dispatches. In contrast, the mem- ory consumption of register architecture is higher as compared to the stack architecture.

Automatic Memory Management

In traditional programming languages, it is the responsibility of the to ex- plicitly allocate in the program a memory region for the needed data (or object). The programmer must also deallocate this memory when it is no longer needed. In case a memory region is deallocated prematurely, then the program may behave undesirably or crash unexpectedly. Furthermore, in case a memory region is not immediately deallo- cated after its last use then the program might go out of memory. Such manual memory management errors are tedious to locate and fix. One of the attractive attributes of Java is its automatic memory management using a program called Garbage Collector (GC). Garbage collector is invoked during the execution of the Java program and based on the program’s current state it deallocates objects. The basic principle of a garbage collection algorithm is to find the data objects that are no longer reachable during program execution and reclaim the memory used by such objects. To this end, a garbage collection algorithm traverses the object graph starting at the root-set, the set of objects always reachable from the program. All objects that are not visited during the object graph traversal are declared unreachable and hence can be freed. The GC relinquishes the burden of manual memory management from the programmer and makes Java code more robust. 1.4. Fundamentals of a Java 9

Java Binary

Figure 1.1: High level view of the class file contents.

The source code of a Java program is compiled into a set of class files. A class file contains Java binary that has three main parts: (1) Constant Pool (CP), (2) Fields, and (3) Functions, as shown in Fig. 1.1. Each class file has a collection of distinct constant values of variable size stored in a Constant Pool (CP). These constant values either define the characteristics of the class or are used by the class itself. A two byte index is used to access a given CP value which is usually larger than two bytes. A constant value may be referred many times using that index from the bytecode. Thus, the use of two byte index for referring to large constant values results in smaller class file size. Each function of a class file, except abstract or native functions, must contain a list of Java bytecode instructions. These instructions are executed by a JVM interpreter or by a JIT compiler. Most of the CP entries are indexed from these instructions. Further- more, fields are also accessed and updated using these instructions. Fields defined in a class file could be of two kinds: static and non-static. A non-static field has a separate copy in each object of the class it is defined in. In contrast, for a static field there is exactly one copy per class irrespective of how many objects of the class exist. In addition, a function may have local variables.A local variables is a data field whose scope is limited to the function.

Java Runtime Data Structure – Frames

Frame is a data structure used to store operand stack and local variables of a method. Each time a new method is invoked a new frame for that method is created. That frame is destroyed when the invocation completes. The maximum amount of RAM required 10 Introduction

for a frame of a method is calculated during generation of Java binary of that method. The allocation and deallocation of the memory of a frame during program execution are automatic and are based upon method invocation and invocation completion. Therefore, the garbage collection does not play any role in managing memory for the frames.

Dynamic Linking and Name Resolution

A Java project consists of a set of class files. An object of a class file may use an object of another class by calling its functions or by accessing its fields. However, each class file is created independent of any other class file. Linking between class files takes place dynamically during runtime. Such dynamic linking is possible because functions and classes of a project use unique names, and each class file during its creation uses these unique names as references in its bytecode. During runtime these names are resolved into actual classes and functions. The advantage of dynamic linking is that each class file creation is independent from the rest of the project. It implies that updating a class file does not require recreating binary of the whole Java program as long as names are not changed. The disadvantage of dynamic linking is that Java binary of each class file must store names of all the functions, fields and classes it uses thus drastically increasing the size of Java binary of the program.

1.5 Challenges and Contributions

We have developed a new JVM for tiny embedded devices and wireless sensor motes called TakaTuka JVM. The TakaTuka JVM currently runs on 8-bit and 16-bit microcon- trollers and has less than 1 KB of RAM requirements. The TakaTuka JVM has been devel- oped without using code of any existing JVM so that each part of the TakaTuka JVM may be optimized for tiny devices. This JVM is open-source and is available from the website .sourceforge.net. This section explains the challenges faced and the contributions made in the development of the TakaTuka JVM for motes and tiny embed- ded devices. Fig. 1.2 highlights the offline optimizations carried out by the TakaTuka JVM.

Offline Garbage Collection

Garbage collection is a program that is responsible for automatic memory management as explained in Section 1.4. The garbage collection is invoked during the program execution and it reclaims objects not reachable from the root-set. A drawback of this approach is that any object which is part of the root-set, or which is reachable from the root-set, cannot be deallocated even if that object is never used again by the program. Therefore, garbage 1.5. Challenges and Contributions 11

Figure 1.2: The offline optimizations and checks for the TakaTuka JVM. All optimiza- tions are performed on a desktop computer and, afterwards, a compact Java binary (TUK) is transferred to the host device such as a mote. collection wastes RAM by deallocating a subset of objects from the set of objects that should be deallocated at a given point of the program execution. Furthermore, the garbage collection needs to analyze the program several times during its execution. These analyses require extensive CPU cycles. Thus, garbage collection wastes CPU cycles and reduce the battery lifetime of wireless motes. This work presents Offline Garbage Collection (GC) to alleviate both of these problems. Our approach defies the current practice in which an object may be deallocated only if it is unreachable. The Offline GC allows freeing an object that is still reachable but is guaranteed not to be used again in the program. 12 Introduction

In addition, it may deallocate an object inside a function, a loop or a block where it was last used even if that object is assigned to a static or non-static field. This leads to making available a larger amount of memory for a program. The Offline GC identifies, during program compilation, the point at which an object can safely be deallocated at runtime. Such offline identification is based on an inter-procedural data flow analysis that is both field-sensitive as well as flow-sensitive. We have designed three algorithms for the purpose of making these Offline deallocation decisions. To enforce these offline decisions at runtime, the Java bytecode is updated using two customized instructions, also during the compile time. Our implementation of Offline GC indicates a significant reduction in the amount of RAM and the number of CPU cycles needed to run a variety of benchmark programs. The Offline GC is shown to increase the amount of RAM available to a program by up to 66% compared to a typical online garbage collector. Furthermore, the number of CPU cycles consumed in freeing the memory is reduced by up to 73.13% when the Offline GC is used. The Offline GC will be discussed in detail in Chapter 2. The Offline GC is under-review for publications for the ACM SenSys conference [28].

Variable Slot Size

Frame is the data structure used to store local variables and operand stack of a function (see Section 1.4). The maximum number of slots required by the frame of a function is calculated during creation of class files from the Java source of the program. In Sun JVM each slot of a frame is exactly 32-bit long. This means that data types smaller than 32-bit (e.g. 16-bit Short) will also be stored in a 32-bit long slot and that any data type larger than 32-bit will be stored in multiple slots. We may call this Fixed Slot Size (FSS) scheme as in this scheme each slot will always be 32-bit long. Sun JVM is optimized for 32-bit microcontroller. A 32-bit microcontroller can access a memory address that is a multiple of 32 quite efficiently. It would require extra CPU cycles if a memory address that is not a multiple of 32 is to be accessed. This would slow down the execution of JVM. If every value is stored at a memory offset that is a multiple of 32, then it would make JVM run faster on a 32-bit microcontroller, but this would waste RAM. A wireless sensor mote usually has either 16-bit or 8-bit microcontroller, and has less than 10 KB of RAM. Unlike a 32-bit microcontroller an 8-bit microcontroller can effi- ciently access a memory offset that is a multiple of 8. In order to save RAM on tiny motes we have designed a Variable Slot Size (VSS) scheme in which the slot of a frame may be 8-bit, 16-bit or 32-bit long. The TakaTuka JVM gives the programmer an option of se- lecting the size of slot and compiling the Java source file accordingly. The results of our studies for both 8-bit and 16-bit microcontroller slot sizes show that the slot size of 16-bit saves RAM as well as CPU cycles on a tiny mote. The VSS scheme will be discussed in detail in Chapter 3. 1.5. Challenges and Contributions 13

Java Binary Optimizations The Offline garbage collection and the VSS both are designed to increase the RAM avail- ability for a Java program. Due to these optimizations a program with larger RAM re- quirements may be executed on a tiny mote that has only few KBs of RAM. However, the Java binary of that program must also fit on the flash of a mote that is used to store programs. Unfortunately the collective size of a project class files binary is usually very large which makes it impractical to fit a feature rich Java program on the flash of a mote. The TakaTuka has a novel Java binary format called Tuk. The Tuk file of a program has all the class files of a project but the size of Tuk file is many times smaller than aggregate size of that project class files. Tuk file is produced by performing three kinds of optimizations on the class files of a program:

1. Bytecode Optimizations

2. Constant Pool Optimizations

3. Format Optimizations

The set of instructions executed by a JVM are called Java bytecode instructions. The TakaTuka has two different kinds of bytecode optimizations, namely, single instruction optimization and multiple instructions optimization. These optimizations reduce the size of bytecode. In addition, the optimized bytecode executes faster. The bytecode optimiza- tion will be described in Chapter 4. The Constant Pool (CP) of a class file has no redundant information, however CPs of different class files of a program may possess the same information. Tuk file has a Global CP (GCP) that has constants of all the class files but has no redundant informa- tion. Furthermore, instead of using names for dynamic linking at runtime, TakaTuka has static linking that is carried out during the program compilation. The static linking does not rely on the names of class files, functions, and fields stored in the CP. Therefore, the GCP does not have names if they are not used in the program. Removing the naming information from GCP makes it further smaller in size. On the average, the Tuk GCP for a given set of programs is 95% smaller than corresponding regular CP of those programs. Chapter 5 presents detailed discussion on GCP. The format of data stored in Tuk file is also different as compared to class file format. The format of Tuk file is compact so that the information available in class files may be presented in a smaller space. Furthermore, Tuk file contains only that information which is required during a program execution. For example, the class file has information that is used for runtime bytecode verification. The verification process checks the integrity of class file (e.g. if a branch target is valid or not.). In TakaTuka the bytecode verification is performed on the desktop computer before transferring the Tuk file onto the mote. 14 Introduction

Therefore, the information required for the runtime bytecode verification and not for the program execution is removed from the Tuk file, which further reduces its size. The Tuk file format will be described in Chapter 5. TakaTuka binary optimizations have been published in Ref. [6] and [8].

Interpreter Design

The interpreter is also stored on the flash of a mote beside the Java binary of a project. Hence the interpreter should be small enough to fit on the mote’s flash. TakaTuka inter- preter has been developed as a set of modules, which enables the programmer to remove certain modules for reducing the size of the interpreter. For example, the programmer may decide that floating operation will never be required by the program, and so he may remove the floating modules from the interpreter to reduce the size. Moreover, the inter- preter may also be shrunk based on bytecode instructions used by a program. Java has 204 bytecode instructions. Even large applications usually do not use all of the 204 byte- code instructions. By default the TakaTuka interpreter removes support of instructions automatically that are not used by a given program. The advantage of this approach is that in most of the cases the interpreter will have a very small size. The disadvantage is that the interpreter would require recompilation for every new program. Besides being small enough to fit on to a mote’s flash, the interpreter for motes should also be optimized for bytecode execution speed. This is because, motes are usually pow- ered by battery and fast execution speed results in longer battery lifetime. Instruction dis- patch and fetching operands of an instruction are known to be the primary time-consuming tasks in the JVM interpreters [27]. Traditionally, the instruction dispatch in an interpreter is implemented using a switch. In case a switch is used to implement an interpreter then for interpreting a bytecode instruction several comparisons have to be made, which results in slow bytecode execution. In order to reduce the time of instructions dispatch, the TakaTuka interpreter employs the labels-as-values approach for providing direct thread- ing instead of traditional switch. The labels-as-values approach requires only a small constant number of comparisons to interpret a bytecode instruction, and thus exhibits bet- ter execution time as compared to a switch. More details of TakaTuka interpreter design will be discussed in Chapter 6. Furthermore, portions of TakaTuka interpreter design has been published in Ref. [6] and [8].

Hardware and Operating System Abstraction Layer

The TakaTuka JVM provides the programmer with the ability to write a program for motes in Java instead of a low-level language. Such a program may need to use hardware drivers including drivers for radio, Light Emitting Diodes (LEDs) and sensors. TakaTuka aims at 1.5. Challenges and Contributions 15 supporting a large variety of hardware platforms, but each hardware platform may have a different set of drivers. Many existing operating systems that have been developed for the motes have support for these drivers. However, each operating system has a different set of functions and a model for the hardware drivers’ support. For example TinyOS has event- driven model and Contiki operating system has threading based model. Furthermore, both of these operating systems have different sets of functions for drivers’ support. TakaTuka design supports all the existing operating systems, without depending on any one of them. TakaTuka provides the programmers a set of hardware-independent Java interfaces to access drivers. The TakaTuka JVM interfaces hide the operating system model and low- level driver functions from the Java programmer. A programmer, writing his code using TakaTuka, does not have to know which operating system is supported by a given mote. The Java code written for one mote may run on any other mote without any changes even when the mote has a different hardware or operating system. This is accomplished by providing a “hardware and operating abstraction layer” shown in Fig. 1.3. The abstraction layer will be discussed in Chapter 6.

Figure 1.3: TakaTuka hardware and operating system abstraction (HOAL). HOAL in- cludes HAL, Java interfaces and an object factory

C HAPTER 2

Offline Garbage Collection

17 18 Offline Garbage Collection

A typical wireless sensor mote has limited amount of hardware resources: the RAM is around 10KB or less, the flash is usually limited to 128KB and a 16-bit or an 8-bit pro- cessor is used [29]. Despite their limited resources, several motes may form a network to realize many useful and powerful applications. Such applications include environment monitoring, military surveillance, forest fire monitoring, industrial control, intelligent agriculture and robotic exploration, to name a few. The motes were traditionally pro- grammed using low-level languages but recent advances in Java Virtual Machine (JVM) design enable programming these motes using Java language [7, 6, 12]. The capability to program the motes in a high level language not only expands the developer base but also allows rapid application development. Garbage collection is one of the features that make Java particularly attractive for ap- plication development on memory-constrained motes. The basic principle of a garbage collection algorithm is to find the data objects that are no longer reachable during pro- gram execution and reclaim the memory used by such objects. To this end, a garbage collection algorithm traverses through the object graph starting at the root set, the set of objects always reachable from the program. All objects that are not visited during the ob- ject graph traversal are declared unreachable and hence can be freed. All existing garbage collection algorithms work on this basic principle, although the exact algorithmic details vary across different implementations [30]. This approach is online as the objects that can be deallocated are identified during the execution of the program. The online garbage collection has two main drawbacks, specifically for resource constrained devices such as wireless sensor motes:

• An online garbage collector cannot deallocate an object that is reachable from the root-set, even though the object may never be used again in the program. This leads to a reduced amount of RAM that can otherwise be available to the program.

• The online garbage collector is frequently called during the program execution, each time traversing the object graph, even if only a small number of objects can be freed. This consumes excessive CPU cycles resulting in a reduced battery lifetime.

In contrast with the online garbage collection, offline schemes can make deallocation decisions during the compilation of the program [31, 32, 33, 34, 35]. However, these schemes do not typically aim to deallocate an object that is still reachable from the root- set or is part of the root-set. These schemes reduce the CPU cycles consumption of a program without making additional RAM available to the program. This thesis presents a new offline scheme, called the Offline GC. The distinguishing feature of the Offline GC is that it allows freeing objects that are still reachable at that point of execution. Thus, Offline GC offers two major benefits when building applications for resource constrained devices: (i) A larger free memory pool becomes available to a program during runtime, 2.1. Motivating Example 19 allowing deployment of bigger programs with an extensive functionality. (ii) The search for unreachable objects by traversing the object tree as in traditional online garbage col- lectors, can be performed less frequently saving on CPU cycles and the battery life. Note that, the Offline GC does not completely eliminate the need of using traditional garbage collector; however the latter is invoked much infrequently. The Offline GC scheme works in three phases, the first two of which occur during program compilation and the third phase occurs during the program execution: (1) Data- flow analysis phase during which the Java bytecode is analyzed and necessary program information is collected. For this purpose, we employ a component called Offline GC Data-Flow Analyzer (OGC-DFA). (2) In the second phase, deallocation decisions are made based on the data collected in the first phase, and the bytecode is accordingly up- dated. We refer to it as the bytecode updating phase and the component of the Offline GC scheme which implements this phase is referred to as Offline Garbage Identifier (OGI). (3) The third phase, the Runtime Deallocation phase, is the one in which the objects that are no longer needed (and identified as such after the first two phases) are actually deallo- cated. For this purpose, the information included in the Java bytecode during the bytecode updating phase is utilized. Altogether, following contributions are made:

1. Design and implementation of an inter-procedural, field-sensitive, flow-sensitive, and context-sensitive data flow analyzer.

2. Design and development of three algorithms to make offline deallocation decisions by identifying the bytecode points where objects, that may still be reachable, can be safely deallocated.

3. Introduction of two new bytecode instructions to enforce offline deallocation deci- sions and developing JVM support for executing these bytecode instructions.

4. A thorough evaluation of the Offline GC scheme for a variety of benchmarks on real world platforms (sensor motes).

2.1 Motivating Example

We first present an example, in Fig. 2.1 that motivates the use of the Offline GC. Although the Offline GC operates on bytecode but this example uses the source code for better readability. Furthermore, this example shows a single function for the sake of simplicity even though the Offline GC scheme is inter-procedural. The function foo of the example uses three objects: (1) An object inObj that is received as a parameter and is used in many other functions of the program. The Offline GC will free inObj inside foo if this is the last function where it is used. (2) An object localObj that is created inside 20 Offline Garbage Collection

a while loop. This object will be deallocated immediately after its last use inside the same loop, given that each loop iteration will overwrite the local variable. (3) An object globalObj that is stored in a global field that can be used by many other functions. This object is also deallocated after its last use, if that happens to be within the foo function. All three objects are deallocated by the Offline GC when they are still reachable and cannot be deallocated by a typical online GC. Futhermore, these object are deallocated inside the same function, loop or code block they are used in. Our scheme may track individual objects. That is multiple objects allocated at the same site can be deallocated at different points of execution. For example, an object contained within localObj always has the same allocation site but could be freed at two different points of execution, as indicated in the if-else block within the while loop.

1 p u b l i c void foo(Input inObj) { 2 bar(inObj); 3 i f (...){ 4 inObj.call(); 5 // OFFLINE_GC_FREE for inObj 6 } 7 while (...){ 8 LocalObj localObj = new LocalObj() ; 9 localObj.foobar(globalObj); 10 i f (...){ 11 localObj.i++; 12 // OFFLINE_GC_FREE for localObj 13 } e l s e { 14 localObj.j++; 15 // OFFLINE_GC_FREE for localObj 16 } 17 } 18 globalObj.k = 5; 19 // OFFLINE_GC_FREE for globalObj 20 }

Figure 2.1: Deallocation of reachable objects by the Offline GC.

2.2 Offline GC Data-Flow Analyzer

The Offline GC data-flow analyzer (OGC-DFA) is built upon an existing data-flow ana- lyzer (BV-DFA) used for Java bytecode verification (S 4.9.2 of [11]). The OGC-DFA and BV-DFA work in similar ways. That is, they both are flow-sensitive, path-insensitive and 2.2. Offline GC Data-Flow Analyzer 21 operate on object types rather than real objects. However, they differ in some important as- pects. First, our OGC-DFA is interprocedural context-sensitive as well as field-sensitive. Second, unlike BV-DFA, the OGC-DFA creates a new object type only when encounter- ing an instruction that would actually create an object at runtime. Third, the OGC-DFA differentiates between the reference types created by two different instructions, even when both of these reference types are the same. The OGC-DFA maintains this differentiation by assigning a unique tag to the reference type based on the instruction which created that reference type. Owing to these differences, the OGC-DFA is able to conduct a more precise analysis as compared to the BV-DFA. We first provide a review of the Java bytecode verification data-flow analyzer (BV- DFA) on which the OGC-DFA is based. This is followed by a description of the en- hancements made by the Offline GC data-flow analyzer (OGC-DFA) in the subsequent section.

BV-DFA: Implementation Review

The existing Java bytecode verification mechanism analyzes and verifies the bytecode for each function in a given program. In this process, each function is analyzed exactly once in an arbitrary order with no relation to the order of execution [11]. Each function has an indexed list of bytecode instructions. For a function being analyzed, the BV-DFA maintains the following data structures:

• A change-bit for each bytecode instruction. If this bit is set, it indicates that the instruction needs to be examined. These bits change state during the analysis pro- cedure and the function analysis is terminated when all these bits become unset.

• An entry-frame for each bytecode instruction. This frame consists of two structures, each of which contains the “types” of data values rather than the actual values. The first is a list of (types of) local variables (referred to as LocVar) and the second is a stack containing types of instruction operands (referred to as OpStk).

• A temp-frame which is meant to temporarily hold the data from an entry-frame. There is only one temp-frame needed during the verification of the entire function.

When a function is picked by BV-DFA for verification analysis, the first bytecode instruc- tion has the following status: the change-bit is set, the OpStk is empty, and LocVar only contains the types of the input parameter to the function. For all the remaining bytecode instructions, the change-bit is unset and OpStk and LocVar are empty. The working of BV-DFA is as follows: 22 Offline Garbage Collection

1. Select any instruction whose change-bit is set, and unset the change-bit. If no in- struction is found whose change-bit is set then the data-flow analysis for the current function terminates.

2. Copy the entry-frame of the selected instruction into the temp-frame. Execute the selected instruction and update the temp-frame based on its result. Note that, an instruction execution is based on data types rather than actual values. For example, execution of ILOAD_1 pushes the data type integer from LocVar (at index 1) on to the OpStk, rather than pushing an actual integer value.

3. Determine the set of all possible candidate instructions that may follow the current instruction in the control flow graph of the function [11]. Clearly, there is only one instruction from the candidate set which follows the current instruction during program execution with actual data. However, it is not always possible to ascertain which of the instructions from the candidate set will follow the current instruction just by looking at the types during the data-flow analysis. Hence, in the process of data-flow analysis, all possible instructions that may follow next are included in the candidate set.

4. Using the information in the temp-frame, update the OpStk and LocVar of the entry- frame for each instruction in the candidate set of the next instructions. • If a candidate instruction has been never analyzed before, then copy the temp- frame into the entry-frame of the next instruction and set its change-bit. • If a candidate instruction has been analyzed before, then “merge” the temp- frame into the entry-frame of that instruction. In this case, the change-bit is set only if the “merge” process changes any element in the OpStk or LocVar of the entry-frame of the candidate instruction.

5. continue at step 1.

To explain the merge process, we first note that for a valid Java binary, the control flow graph ensures that the primitive data types1 at corresponding locations in temp-frame and entry-frame are identical. This holds true for primitive type elements in OpStk as well as in LocVar. However, the reference types at the corresponding locations (elements of OpStk or LocVar) of temp-frame and entry-frame may differ. If these two reference types are identical, the merge process takes no action. Otherwise, it works as follows:

1. For OpStk: The least common superclass of the two reference types is determined and the element of the OpStk in the entry-frame is updated with its type.

1 The primitive data types include byte, short, char, boolean, int, long, float and double. Note that BV-DFA considers byte, short, char and boolean as int data type and does not differentiate between them. 2.2. Offline GC Data-Flow Analyzer 23

2. For LocVar: The corresponding element of the LocVar in entry-frame is set as empty (or unused).

Offline GC DFA

The Offline GC Data-flow Analyzer (OGC-DFA) builds upon the BV-DFA, also working with data types rather than with actual values, albeit with some important differences. To highlight these differences, we first note that the goal of BV-DFA is bytecode verification, but OGC-DFA must collect information that can be used to free the actual objects during program execution. The BV-DFA treats the types liberally which is acceptable for the purpose of bytecode verification but not for reclaiming the allocated memory. For an understanding of the working of OGC-DFA, we enlist mechanisms where it differs from BV-DFA in the following.

Precision in Generation of Reference Types

When the BV-DFA encounters a reference type for inclusion in the OpStk or LocVar of the temp-frame, this reference type may refer to any super class of the actual data object that would be encountered during runtime. In contrast, the OGC-DFA must know the exact reference type during data-flow analysis that would be encountered during runtime. To obtain the precise type information in OGC-DFA, we note that new objects can be created at runtime using only the following five bytecode instructions [11]: NEW, NEWARRAY, ANEWARRAY, LDC and MULTANEWARRAY Both BV-DFA and OGC-DFA create new reference types when any of these bytecode instructions are analyzed. In addition to this, the BV-DFA may create a new reference types while (i) analyzing the bytecode instructions that either invoke a function or retrieve data from a field, or (ii) merging the temp-frame into the entry-frame. The OGC-DFA does not create new reference types in these additional cases.

Reference Type Tagging

Rather than tracking individual objects, the OGC-DFA tracks a group of objects created by an instruction at a specific location within the bytecode. This grouping is more precise compared to the type-based grouping of objects in BV-DFA. To identify a group of objects, the OGC-DFA assigns a tag to the reference type created by an instruction at index i in function f . The tracking is then based on the assigned tag which ensures that each object in a group has been created by an instruction at a specific location within the bytecode. That is, each reference type created by an instruction at index i in the instruction list of function f will always be assigned the same tag, no matter how many times that instruction is analyzed. Furthermore, this tag will be different from 24 Offline Garbage Collection

the tag assigned to a reference type created by any other instruction, even if that instruction creates the same reference type. The OGC-DFA, therefore, works with tags considering the reference type information as implicit. Thus, we will use the term “tag” in place of “tagged reference type” for the discussion that follows.

Frame Merging

The OGC-DFA needs to maintain precise type information so that the OGI can subse- quently make correct deallocation decisions. Since the merge process in BV-DFA loses the precise type information, as described in Section 2.2, the OGC-DFA uses the follow- ing enhancements to the merge process: it allows each element in the OpStk to maintain a set of tags. Furthermore, an element of the OpStk of the temp-frame is merged into the corresponding element of the entry-frame by taking a union of their respective sets of tags. This enhancement has two implications. First, elements of LocVar are also per- mitted to maintain a set of tags. Second, the execution of some instructions during the data-flow analysis phase is modified as they need to operate on a set of tags instead of a single reference type per element.

Storing Reference Types in Fields

In Java there are two kinds of fields: static and non-static. The OGC-DFA uses field data structures, one per class for static fields and one per tag for non-static fields. This is dif- ferent from the BV-DFA which does not use any such structure. The consequence is that the OGC-DFA can collect precise type information from fields while the BV-DFA is un- able to do so. Collection of this precise information is actually necessary for OGC-DFA to identify the correct type of an object a bytecode instruction may load from the field. Maintaining the field information has implications on the data-flow analysis: each byte- code instruction that receives data from a field must have its change-bit set whenever that field is updated. This is necessary because the order of execution can not be determined at compile time. For this same reason, a tag stored in a field data structure should not be overwritten. Thus, a field data structure contains a list of tags, each indicating a possible reference type that the field may hold during program execution. The OGC-DFA handles the arrays in a similar way; however, the index to the array elements is ignored.

Interprocedural Analysis

Besides collecting the precise type information through the use of tags, the OGC-DFA also tracks which set of bytecode instructions will use a specific tag during program ex- ecution. That is, it tracks the tags passed to a function, returned from a function, and 2.3. Offline Garbage Identifier 25 used by various instructions within that function. Therefore, the OGC-DFA performs the analysis in an interprocedural context-sensitive manner; it always starts from the main function of the application. When the OGC-DFA encounters an invoke instruction within a function being analyzed, it halts the analysis of the current function while con- tinuing with the newly invoked function until its analysis is completed.2 The invoked function may have multiple bytecode instruction branches, each returning its own set of tags to the parent function. All such tags are merged and provided to the parent function whose analysis continues from where it was left. Scalability: A consequence of the interprocedural nature of analysis carried out by the OGC-DFA is that a function may be up for analysis multiple times if invoked from multi- ple locations in the bytecode. To speed up the analysis for large programs, the OGC-DFA “caches” the analysis results (the set of return tags) for a function invoked with given parameters. Thus, in case of a cache hit, re-analysis of the invoked function is avoided.

Threads

The OGC-DFA analyzes all the threads independently one after the other, except in the case when there is an update in the data structure for a field that is shared among multiple threads. In such a case, the OGC-DFA identifies all the bytecode instructions that may retrieve data from that field, in all functions of all threads analyzed previously. Then, it sets the change-bit for those instructions and partially re-analyzes all the previously analyzed affected threads. This is to ensure that each instruction considers a superset of tags whose corresponding reference types may be used by the instruction at runtime, without making any assumption about threads context-switching.

2.3 Offline Garbage Identifier

The offline Garbage identifier (OGI) works during compilation of a program in the second phase of our Offline GC scheme. This component is responsible for (i) making dealloca- tion decisions using three graph algorithms, and (ii) updating the bytecode so that these deallocation decisions can be carried out at runtime. The bytecode is updated by inserting the following two customized instructions:

• OFFLINE_GC_NEW : This instruction is meant to assign the given tag to a newly created object at runtime. Thus, the OGI inserts this immediately after each bytecode instruction which creates a new object.

2 A function is invoked after following the proper function lookup procedures as described in Chapter 6 of the JVM specifications [11]. 26 Offline Garbage Collection

• OFFLINE_GC_FREE : This instruction is meant to deallocate all the objects which were previously assigned the given tag during program execution. Thus, the OGI inserts this immediately after the instructions where it is safe to deallocate all the objects grouped by the given tag. Such points of insertions are determined based on the offline deallocation decisions.

We have designed three graph algorithms for making offline allocation decisions. These algorithms determine the insertion points within the bytecode for the OFFLINE_GC_FREE instructions. Each of these algorithms is used independently and is suited for different code scenarios. Our implementation uses all the three algorithms to ensure that a variety of programs can benefit from the offline deallocation decisions. We use the following terminology to describe the working of these algorithms. Function-Call: This is simply the invocation of a given function with a given set of input tags. Instruction-Graph of a function: This is a directed graph in which each node represents a bytecode instruction in that function. These nodes are connected through links which collectively chalk out every possible execution path only through that function, ignoring invocation of any other function. Each node in the Instruction-Graph of a function con- tains the set of all tags that may be used by the corresponding instruction, for a given Function-Call. Instruction-Graph of a thread: The instruction graph of a thread chalks out every pos- sible execution path through the entire program thread. Thus, this is quite similar to the Instruction-Graph of a function except that the control flow (i) starts at the entry function of the thread, and (ii) may now follow the function invocations. Function-Call-Graph of a thread: This is a directed graph in which each node represents a Function-Call. Therefore, this graph provides all possible flows in the thread at the granularity level of a function. DAG of an Instruction-Graph: The DAG of an Instruction-Graph is obtained by sub- stituting each strongly connected component in the Instruction-Graph with a single node. Such a composite node of the DAG contains all the tags of all the nodes of the underlying strongly connected component.

Algorithm 1: PNR (Point of No Return) This algorithm is designed to find points of no return (PNR) of a tag. A PNR of a tag refers to a bytecode instruction; it is guaranteed that the given tag is not used either at this instruction on any subsequent instruction during the execution of the program. Such an instruction is referred to as the point of no return (PNR) of that tag. All the objects assigned with this tag at runtime can safely be deallocated at this point of execution. Thus, the OFFLINE_GC_FREE instruction for a tag can be inserted immediately before 2.3. Offline Garbage Identifier 27 a PNR of that tag. The PNR algorithm operates on the DAG of the Instruction-Graph of a program thread as shown in the associated pseudocode.

Algorithm 1 Point of No Return (PNR) 1: Consider the DAG G = (V,E) of the Instruction-Graph 2: Compute a topological ordering v1,...,v|V| of the nodes of G starting starting with the sinks 3: for i ← 1 to |V| do ← 4: BMvi set of all tags used in vi joint with the sets BMw of all successors w of vi (stored as bitmap) 5: for i ← 1 to |V| do 6: for each predecessor p of vi in G do 7: for each tag t of p do < 8: if t BMvi then 9: Mark vi as the PNR of t 10: Add OFFLINE_GC_FREE instruction for tag t at it’s PNR

Moving the PNR: Consider the situation when multiple bytecode instructions can invoke a function. If it is determined that a tag can safely be freed within the function on any such invocation, the algorithm places the PNR of that tag after the last invocation of that function. As a result, the OFFLINE_GC_FREE instruction is never inserted within the bytecode of a function that is invoked from multiple instructions. This avoids an inadvertent deallocation of an object during an earlier invocation of that function than the one during which the object can be safely deallocated. It is important to note that the algorithm can place a PNR within the bytecode of a function that is invoked multiple times from the same bytecode instruction, such as in a loop.

Strengths and Limitations

The strength of the PNR algorithm is that it can find PNRs for tags that represent either the field objects or the local variables. However, the usefulness of this algorithm is diminished for programs that do not have exit points. This is because the PNR algorithm works on DAG of the Instruction-Graph of a program thread and does not have visibility inside the composite nodes of the DAG.

Scalability

In addition to using the PNR algorithm for applications developed for tiny wireless sen- sor motes, we also aim at using this algorithm for large programs developed for powerful computers. However, the Instruction-Graph of a program thread could be exponentially 28 Offline Garbage Collection

large, making the PNR algorithm impracticable for some large programs. This is because multiple invocations of a function from different bytecode instructions will create multi- ple subgraphs, each representing the same function in the Instruction-Graph of the thread. To maintain scalability, we use the following solution: If a function is invoked from more than n separate bytecode instructions, (i) we disallow the control flow through that func- tion, and (ii) we import all the tags from that function on each invocation. The following theorem bounds the complexity of the PNR algorithm.

main, NEW, [1]

main, INVOKESTATIC, [1]

bar, LOAD_REFERENCE, [1, 2]

bar, IFNONNULL, [1,2]

bar, LOAD_REFERENCE, [1,2] bar, INVOKESTATIC, [3, 4]

bar, IFNONNULL, [1,2] bar, RETURN

bar, NEW, [2] main, INVOKESTATIC, [3, 4]

bar, STORE_REFERENCE, [2] main, RETURN

bar, LOAD_REFERENCE, [1, 2]

bar, PUTSTATIC, [1,2]

bar, GOTO

Figure 2.2: Instruction-Graph of an example thread: Each node represents a bytecode instruction and contains all possible tags that may be used during the execution of that in- struction. The nodes are labeled with the name of the containing function, the instruction mnemonic, and the tags contained. The graph alludes to two functions main and bar. Two rectangular nodes indicate the invocation of a third function call2Times through which the control flow is avoided as an optimization (Section 2.3).

Theorem 2.1. The worst case time complexity of the PNR algorithm is O(|T| · (|V| + |E|)) ≤ O(|V| · (|V| + |E|)), where V and E are the nodes and edges in the Instruction- DAG and T is the set of tags. 2.3. Offline Garbage Identifier 29

Proof. First note that the number of tags is bounded by |V| because a node can create at most one tag. The topological ordering of a DAG can be computed in time O(|V| + |E|).

For the computation of the bitmaps BMv each edge of the graph causes at most |T| operations for updating the bitmaps from the successors plus |T| operations for initializing the bitmaps. This causes a runtime of O(|T| · (|V| + |E|)). Analogously, the computation of all PNR needs the same amount of operations: Each edge is processed only once and causes |T| operations for checking the bitmaps. So, the resulting worst case time is O(|T| · (|V| + |E|)).

Example

An example Instruction-Graph of a program thread is shown in Fig. 2.2. The rectangular nodes indicate the invocation of a function call2Times with maximum call limit n = 1 (see Section 2.3). The corresponding DAG on which the PNR algorithm operates is shown in Fig. 2.3. In this example, the PNR of tag 1 and tag 2 is the INVOKESTATIC instruction, shown in the rectangular node. From this node onward in the control flow, these two tags are guaranteed to remain unused. Similarly, the RETURN instruction of the main function is the PNR for tag 3 and tag 5.

Algorithm 2: FTT (Finding the Tail)

The FTT algorithm operates on the Function-Call-Graph of a thread. It makes the deallo- cation decisions for objects that would be stored, at runtime, only in local variables (and never in a field). By definition, a tag for such objects is generated by a bytecode instruc- tion in exactly one function. For each appearance of that generating function, we can find subgraphs, disconnected from each other, in the Function-Call-Graph such that:

1. Each node representing the function that generates a given tag is included in exactly one subgraph.

2. Each node in the subgraph uses the given tag.

3. Any node that is not part of a subgraph and is directly connected to it must not use the given tag.

The FTT algorithm is designed to find the tail of each subgraph. The tail of the subgraph is a node which represents a function that is the last to use a given tag. This function: (i) contains the instruction which either generates this tag, or invokes another function which returns this tag, and (ii) neither returns this tag nor receives this tag as input parameter. 30 Offline Garbage Collection

Once the FTT algorithm determines the tail of a subgraph, it calls upon the PNR algo- rithm on the DAG of the Instruction-Graph of the function represented by the tail. The PNR algorithm may successfully identify the insertion point of the OFFLINE_GC_FREE instruction; otherwise, the OFFLINE_GC_FREE instruction is inserted immediately after the instruction that invoked the function represented by the tail. The pseudocode of the complete algorithm is:

Algorithm 2 Finding the Tail (FTT) 1: Compute FG as the Function-Call-Graph of a thread 2: for each tag element t which is never stored in a field do 3: Compute the induced subgraph FG(t) of FG by the nodes which use t 4: Determine all recursive vertices in FG(t), i.e. nodes v in FG(t) for which a directed path starting and ending at v exists in FG(t) 5: for each vertex v ∈ FG(t) do 6: if v represents a tail method m for t then 7: if v is not recursive then 0 8: Create the DAG G of the Instruction-Graph of v 0 9: Apply PNR algorithm on G for tag t 10: else 11: for predecessor w ∈ V(FG) \V(FG(t)) of v do 12: Add OFFLINE_GC_FREE instruction for tag t on w immediately after the instruction that invoked m

Strengths and Limitations

The FTT algorithm, unlike the PNR algorithm, may have visibility into the strongly con- nected components of the Instruction-Graph of a thread. It also works for programs that never exit and even can free references inside the loop within which they are used. The main limitation of this algorithm is that it cannot free any reference that is ever stored in a field.

Scalability

The time complexity of FTT is O(|T|(|V 0| + |E0| + |V 00| + |E00|)) where T is the set of tags, V 0 and E0 are the nodes and edges of the Function-Call Graph, V 00 and E00 are the maximum number of nodes and edges of the Instruction graph of a method. Since the Function-Call Graph and the Instruction graph of a method are very small compared to the instruction graph of a thread. Therefore, O(|T|(|V 0| + |E0| + |V 00| + |E00|)) < O(|T|(|V| + |E|)) where |V| and |E| are the number of the nodes and edges of the Instruction graph of the thread. 2.3. Offline Garbage Identifier 31

main:NEW, Tags:[1]

main:INVOKESTATIC, Tags:[1]

bar:LOAD_REFERENCE, ..., bar:GOTO, Tags:[1, 2]

bar:INVOKESTATIC, Tags:[3, 4]

bar, RETURN

main:INVOKESTATIC, Tags:[3,4]

main, RETURN

Figure 2.3: DAG of an Instruction-Graph: Showing the DAG of the Instruction-Graph of Fig. 2.2. The strongly connected component from Fig. 2.2 has been replaced with a single composite node.

main, Tags:[1,2,3]

TimeConstruction, Tags:[1,5] Populate, Tags:[3,4]

NumIters, Tags:[5] Populate, Tags:[4,5]

TreeSize, Tags:[5] TreeSize, Tags:[5]

Figure 2.4: Function-Call-Graph of a thread: Each node contains all possible tags used in the corresponding Function-Call. 32 Offline Garbage Collection

Example

An example Function-Graph is shown in Fig. 2.4 which shows that the FTT algorithm will free the tags as follows: The tags 1, 2, and 3 will be deallocated inside its tail function named main and the tag 5 will be deallocated on its tail function TimeConstruction. The tag 4 is deallocated immediately after the instructions invoking Populate, in the main and TimeConstruction.

Algorithm 3: DAU (Disconnect All Uses)

The DAU algorithm can free a reference inside the loop where it is used, even if such a reference is assigned to a field. The basic idea of this algorithm is as follows: if an instruction r that creates a tag is deleted from the Instruction-Graph of the program and all the nodes using that tag become unreachable from the root of that graph, then each object assigned with that tag is guaranteed to be overwritten each time the instruction r is executed, provided the tag is stored in exactly one variable, (i.e., either in a single non-array field or in a local variable).

Algorithm 3 Disconnect All Uses (DAU) 1: Consider the Instruction-Graph G = (V,E) of a thread 2: for each tag t that is never stored in multiple variables do 3: Lookup v ∈ V as the instruction (node) that creates t 0 4: Compute Gv = (V \{v},E ) 5: Compute the set Rv of all reachable nodes in Gv from the root node 6: if no node in Rv is using t then 7: Create a DAG D of the subgraph induced by the nodes V \ Rv of G (containing only unreachable nodes) 8: Apply PNR algorithm on the DAG D for tag t

Strengths and Limitations

This algorithm can free a reference inside the loop in which it is used, even if such a reference is assigned to a field. The main limitation is that it cannot free a reference assigned to more than one variable.

Scalability

This section discusses scalability of the DAU algorithm. We start by proving its time complexity. 2.4. Runtime Implementation 33

Theorem 2.2. The time complexity of DAU algorithm is O(|T| · (|V| + |E|)) ≤ O(|V| · (|V| + |E|)), where V and E are the vertices and edges of the instruction-graph and T is the set of all tags.

Proof. Recall that the number of tags is bounded by the number nodes |V| of the instruc- tion graph. A reverse lookup of the creating node can be done in constant time after once creating the table in time O(|V|).

Computing the sub-graphs Gv and the DAG D can be done in |V| + |E| since this can be done by a depth-first-search in G. Finally, the application of PNR needs only time O(|V| + |E|) since we consider only one tag using Theorem 2.1.

Dealing with threads: Each of the three algorithms (PNR, FTT, and DAU) ascertains as to when a tag can be freed. However, since the thread switching order is unknown at compile time, tags common to multiple threads are left alone by these algorithms.

2.4 Runtime Implementation

The first two phases of the Offline GC scheme, the data-flow analysis phase and the bytecode updating phase, are part of program compilation. This third and final phase of the Offline GC scheme is carried out at runtime. This phase can be implemented on any available JVM interpreter. We have selected the TakaTuka JVM [6, 8] because it is open source and it works on tiny platforms where the Offline GC is most useful in saving scarce resources. In the following subsections we highlight the design goals of the Offline GC runtime implementation and the support needed from a JVM interpreter to achieve these goals. While the Offline GC aims at saving the required runtime memory, its implementation calls for a small memory overhead at runtime, as characterized in the discussion below.

Freeing a Group of Objects

The Offline GC assigns a unique tag to all the objects created by a single instruction, as described in Section 2.2. A JVM interpreter must execute the OFFLINE_GC_FREE instruction at runtime to free all the objects that are assigned with the tag spec- ified in the instruction operand. A goal of implementing the OFFLINE_GC_FREE in- struction is that the interpreter should be able to free all the objects with a given tag efficiently. To this end, all the objects with a given tag should be grouped together and the group must be deallocated without traversing the complete object graph. 34 Offline Garbage Collection

During program execution, all objects with a given tag are linked together by using a doubly-linked list. For each tag, the reference of the first object in the associated doubly- linked list is stored in an array called Representative Reference Array (RRA). The length of this array is, therefore, equal to the number of tags identified during OGC-DFA. The element at index i of the RRA stores the reference to the first element of the doubly- linked list associated with tag i. Using the RRA, all the objects grouped by a given tag can be reached with the time complexity of O(n), where n is number of objects which are assigned with that tag at runtime. To deallocate the objects grouped by a tag, the doubly-linked list associated with that tag is traversed, instead of a complete traversal of the object graph. Memory Overhead: The size of the RRA, the static array which stores one reference per tag, is 2γ, where γ is the total number of tags. Furthermore, to create the doubly-linked list, an additional 4 bytes per object are required.

Conflict with Online GC

A typical online GC only frees an object if it is unreachable and cannot be reached during the traversal of the object graph. Thus, an online GC may need to traverse the object graph each time it is invoked. In contrast, the Offline GC can free an object that may be reachable but will never be used in future. When an object that is reachable is deallocated by the Offline GC then the reference to this object becomes invalid and a subsequent invocation of online GC is likely to report memory errors when it reaches the invalid reference during traversal of the object graph. Our implementation avoids such errors as follows: When an object is deallocated by the OFFLINE_GC_FREE instruction, the value of the corresponding reference (i.e., the memory location) is stored in a dynamic table. When an online GC traverses the object graph, the value of each encountered reference is first looked up in the dynamic table. If the table contains a reference value, indicating that the memory location pointed to by that reference value has already been freed by the Offline GC, the online GC does not dereference that value. The online GC also sets each variable storing such references to null. Thus, at the end of each invocation of online GC, all entries in the dynamic table containing the reference values can be deleted. Memory overhead: The size of the dynamic table is 2δ bytes (assuming that a reference occupies 2 bytes), where δ is the number of objects that are deallocated by the Offline GC since the last invocation of the online GC. The table entries are removed at the completion of each online GC cycle. 2.5. Related Work 35

2.5 Related Work

The notion of compile-time garbage collection goes back to Barth [36] who proposed a scheme for maintaining reference counts at compile time. This topic has been picked up and improved upon by later researches, e.g., [37, 38]. However, this strand of work differs from our proposal which is targeted at mark-and-sweep GC. Free-me [32] is a static analysis that identifies the objects when they become dead, and inserts code to free them. Unlike our scheme it is based on flow-insensitive and field- insensitive pointer analysis, thus it is suitable mostly for local variables. Furthermore, Free-me puts the main emphasis on reclaiming short-lived objects, whereas our analysis handles the long-lived objects also. Shaham et al. [39] provide a yet more precise analysis for placing free instructions. Their approach can even deallocate an object when it is still reachable; however their analysis is intra-procedural and rather expensive in practice. Agesen el al. [40] have also applied type analysis and liveness analysis, but its appli- cation is to shrink the root set for a garbage collector. They do not insert explicit free instructions. The scheme proposed by Cherem et al. [41] is closest to our work as it updates the program bytecode using a flow-sensitive and inter-procedural pointer analysis. The main limitation of this scheme is that it allows deallocating only those objects that are not shared by multiple fields or local variables at the function boundaries. Thus it cannot deallocate objects that are saved in multiple variables and used by multiple methods. The Offline GC scheme overcomes this limitation by keeping static and non-static field data structures during the analysis phase. A further area of related work is based on escape analysis [33, 34, 35] which determines whether objects can escape from their defining scope. The main application of escape analysis is to determine which objects can safely be allocated on the runtime stack. This allocation is static and the corresponding deallocation is automatic when the function exits. Besides speeding up the program by avoiding the cost of memory management for non-escaping objects, there is also no need to synchronize on such objects as they stay within a thread. While escape analysis speeds up program execution, it may lead to memory leaks because a stack-allocated object cannot be freed before the function returns. In contrast, a heap-allocated object could be freed as soon as it becomes inaccessible. Thus, escape analysis trades execution speed for space. In contrast, the main goal of the Offline GC is to reduce the memory consumption of a program without compromising its execution speed. Offline GC is also related to region-based memory management [42]. Each allocation site gives rise to its own region of memory (identified by a tag, following Jones and Muchnick [43]), which is freed when our analysis determines that all references to this 36 Offline Garbage Collection

Figure 2.5: IRIS Mote: An IRIS mote used in our experimental evaluation. The mote has 128 KB of flash, 8 KB of RAM, and it uses an 8-bit AtMega128L processor. It is powered by two AA batteries and is equipped with a 2.4 GHz radio for communication with other motes.

region are dead. As with region analysis, our analysis can also free objects that are still reachable and thus would not be reclaimed yet by garbage collection. Much of the work on regions is based on static analysis [44, 45], but there is also work supporting explicit manual region management [46]. While original work supports region deallocation only in a stack-like manner, the subsequent work supports more fine-grain control over the placement of deallocation.

2.6 Results and Discussion

The use of Offline GC offers two main advantages as compared to when only a tradi- tional, online garbage collector is used: (1) additional RAM is made available for a user Java program, and (2) the number of CPU cycles consumed in deallocating objects at runtime is reduced, resulting in an increased battery lifetime of the host device. These ad- vantages come with two drawbacks: (1) the increased flash storage requirement, and (2) a longer compilation time of a program. We first describe the programs used to measure the characteristics of the Offline GC.

Benchmark Programs

We have implemented the Offline GC scheme on the TakaTuka JVM [6, 8]. The TakaTuka JVM supports wireless sensor motes which typically have at the most 10 KB of RAM and 128 KB of flash storage. Many existing GC benchmarks, such as SPECjbb2005, SPECjvm2008 and EEMBC [49, 50], or an arbitrary Java program cannot be selected to run on such a resource-constrained device. For our evaluation we needed benchmarks developed for resource constrained sensor motes or are flexible enough to reduce their default memory requirements. We found two such GC benchmarks namely GCBench and 2.6. Results and Discussion 37

2500

2000

1500

1000

RAM Available 500 with Offline GC without 0 50 100 150 200 250 300 Point of Execution (a) JTC

2500

2000

1500

1000

RAM Available 500 with Offline GC without 0 10 20 30 40 50 60 70 80 90 Point of Execution (b) GCBench

2500

2000

1500

1000

RAM Available 500 with Offline GC without 0 0 100 200 300 400 500 Point of Execution (c) GCOld

Figure 2.6: RAM-available: The amount of RAM-available in bytes at different points of execution for the first four benchmarks given in Table 2.1. In each of these graphs, successive points of execution are separated by at least m instructions, where m is fixed for each program. 38 Offline Garbage Collection

1800 1600 1400 1200 1000 800 600

RAM Available 400 200 with Offline GC without 0 50 100 150 200 250 300 350 Point of Execution (a) Quicksort OFFLINE_GC_NEW 7 OFFLINE_GC_FREE Online GC 6

5

4

3

2 CPU Cycles (in millions) 1

0 with Offline GC without (b) Quicksort

Figure 2.7: The RAM-available in bytes at different points of execution and number of CPU cycles consumed in deallocating objects, for Quicksort program with and without the use of the Offline GC. 2.6. Results and Discussion 39

Benchmark Brief Evaluation Name Description Platform JTC JVM Test Cases. Taken from the TakaTuka Sourceforge repository. Mica2, Avrora Includes hundreds of test cases. The test cases are developed to check the Java binary instructions set implementation of a JVM. GCBench Designed to mimic memory usage behavior of real applications. Mica2, Avrora Default RAM requirement is 32 MB. To fit the benchmark onto a tiny mote with 4KB RAM, we used the following input parameters: kStretchTreeDepth=7, kLongLivedTreeDepth=25, kArraySize=100, kMinTreeDepth=4, and kMaxTreeDepth=10. A description of these parameters may be obtained from the source code [47]. The benchmark creates a mix of long and short lived objects. The short lived objects are binary trees. To constrain the memory, we changed the binary tree structure to a linked list. GCOld A GC benchmark developed at Sun Microsystems for Mica2, Avrora garbage collection research [47]. It creates binary trees for short and long lived objects. We have changed the following parameters so that it can run on a tiny mote using a few KBs of RAM: MEG=2, INSIGNIFICANT=5, BYTES_PER_WORD=1, BYTES_PER_NODE=1, treeHeight=4, size=14, workUnits=20, promoteRate=2, ptrMutRate=5, and steps=20. A description of these parameters may be obtained from the implementation of GCOld [47]. QuickSort The in-place iterative implementation of the Quicksort algorithm. Mica2, Avrora Each iteration creates very few dynamic objects. This program is selected to show that the Offline GC is to be used with caution; it may not be beneficial if a program has only a few dynamic objects. CTP Taken from the TakaTuka sourceforge repository. IRIS (see Fig. 2.5) CTP is an address-free routing protocol for networks that use wireless sensor motes [48]. The protocol is completely implemented using Java and uses four threads. Table 2.1: Benchmarks used for the evaluation of the Offline GC. 40 Offline Garbage Collection

GCOld [51, 47] whose RAM requirements were dependent upon their input parameters. In addition to these two benchmarks, we have also used three existing application from the TakaTuka JVM sourceforge repository. These applications have been developed for the resource constraints of sensor motes. Table 2.1 provides a description of these five benchmarks, which are used for the evaluation of the Offline GC scheme.

Evaluation Results This section provides an evaluation of the Offline GC scheme on the first four benchmarks in Table 2.1. We evaluated these benchmark programs on Crossbow Mica2 motes [29, 52] as well as on Avrora which is a simulator for the Mica2 motes [53]. The memory usage results for the actual Mica2 motes and the Avrora simulator were found to be consistent with each other. For measuring the CPU cycles consumed in deallocating the objects, we used only the Avrora simulator because it is hard, if not impossible, to measure the CPU cycles on an actual sensor mote. Avrora simulates microcontroller instructions and can accurately calculate the CPU cycles consumed by a set of procedures during the execution of a program. In order to preserve the correlation of memory usage with CPU cycles, both of these results are presented using the Avrora simulator.

RAM Availability

The results for RAM-available at a point of execution with and without the use of the Offline GC have been shown in Fig. 2.6. The term RAM-available at a given point of execution refers to the maximum amount of memory that becomes available to a program if a precise compacting online GC is executed at that point. Thus, at every execution point, the actual free memory will always be less than or equal to the RAM-available at that point. Furthermore, if RAM-available at a point is 0 bytes, then a program is deemed to be out of memory. It can no longer continue its execution because even running a GC at that point will not free any additional memory. We measure the RAM-available of a program after the execution of every m instructions. Fig. 2.6 shows the RAM-available at different points of execution of JTC, GCBench, and GCOld benchmarks. The JVM Test Cases (JTC) program consists of many test cases, each executing one after the other. Therefore, many dynamic and independent objects are created during the execution of JTC. It is because of the presence of excessive dynamic and independent objects, that JTC on the average has 53.79 % additional RAM-available during the program execution if the Offline GC is used. The GCBench and GCOld programs have, on average, 66.06 % and 31.65 % additional RAM available respectively when the Offline GC is used. Both of these programs have long-lived objects that are created at the start of their execution but are not used afterwards. As these objects are in-scope (reachable) during almost the entire program execution hence they cannot be freed by the online GC. In contrast, the 2.6. Results and Discussion 41

Offline GC deallocates these objects soon after their last use even when they are still reachable. Furthermore, few of the short-lived objects are also freed by the Offline GC a few bytecode instructions ahead of the execution point when an online GC would free them. However, this increase in RAM-availability, that lasts only a few bytecode instruc- tions3 is too small to be captured in the RAM-availability graphs of these benchmarks. Finally, Fig. 2.7 shows that in-place Quicksort performs better when Offline GC is not used. This is because this benchmark creates very few (only three) dynamic objects during each iteration for sorting a subset of the input array. None of these objects can be freed by the Offline GC earlier than the execution point when a regular online GC would free them. In this case, the RAM-available is 5.97 % smaller if the Offline GC is used. This reduction is due to the implementation overhead of the Offline GC as explained in Section 2.4. This overhead is present in all the programs and is visible at the beginning of each graph in Fig. 2.6. However, for a program that creates many dynamic objects, this overhead is usually negligible as compared to the additional memory made available by the Offline GC. In conclusion, the Offline GC is effective in bringing down RAM utilization for programs that create several large dynamic objects, each larger than a few bytes. Thus, for a given program, the Offline GC has to be used with caution.

Benchmark with Offline GC without JVM Test Cases 1 4 GCBench 16 38 GCOld 1 18 QuickSort 13 13 Table 2.2: The number of times online GC is called by the benchmark programs, with and without Offline GC.

CPU Cycles

The TakaTuka JVM uses a mark-and-sweep online GC with compaction [30, 54]. As a mote has only a few KBs of RAM, the implementation must be memory efficient and should run with very little memory consumption. Hence, the online GC in TakaTuka is implemented non recursively with multiple iterations of the object graph. During a single invocation, the overall time consumed by the online GC is O(k + b2), where b is the number of objects on the heap and k is the current size of all stack-frames. In order to save the CPU cycles consumed in deallocating the objects, the online GC is run only when it is necessary, and that is when the RAM is full. By deallocating objects at regular intervals the Offline GC reduces the number of times an online GC needs to be invoked (see Table. 2.2 for comparative results of the first four benchmarks.). This leads to a

3 A single line of code may translate into many bytecode instructions 42 Offline Garbage Collection

reduction in the number of CPU cycles consumed when the Offline GC is used as shown in Fig. 2.8. This reduction in the number of CPU cycles, due to freeing the objects is 45.58 %, 29.24 %, and 73.13 % for JTC, GCBench, and GCOld respectively. On the other hand Quicksort program takes 20.60 % extra CPU cycles when the Offline GC is used (see Fig. 2.7). This again is because, in case of in-place Quicksort, the Offline GC does not deallocate objects prior to when a typical online GC would do. This results in increased CPU cycles consumption due to the overhead associated with the execution of OFFLINE_GC_NEW and OFFLINE_GC_FREE instructions.

Collection Tree Protocol

This section presents performance results of Offline GC for the collection tree protocol (CTP) [48], the last benchmark from Table 2.1. The CTP is a tree-based routing protocol [48]. Few nodes in a network advertise them- selves as root of the network. Based on such advertisements, each non-root node deter- mines which of its neighbors is either a root or nearer to one of the roots. In steady state operation of CTP, a set of distributed routing trees are created by the non-root nodes, each tree ending at a root node. The CTP is an address-free protocol in which the data is always transmitted from a non-root towards the nearest root node. The Java implementation of the CTP, available in the TakaTuka Sourceforge repository, has a memory requirement of more than 4 KB. Since Avrora supports Mica2 simulation with a maximum RAM size of 4 KB, we could not use Avrora as the evaluation platform for benchmarking the Offline GC with CTP. Instead, we used IRIS motes (see Fig. 2.5) equipped with 8 KB of RAM [29]. The Java CTP implementation uses four threads, some of which wait to receive the data or control packets. It implies that the bytecode instructions of the program are not executed in a deterministic order, and thus the memory usage results, similar to the ones in Fig. 2.6 for other benchmarks, cannot be reproduced. Therefore, our evaluation of Offline GC with the CTP proceeds as follows: We run the benchmark program repeatedly, incrementing the size of all objects by one byte in each successive run. The goal is to find the maximum possible increase in the size of all the objects of the program before we reach a point where the program exceeds the RAM of the mote. Fig. 2.9 indicates that when the Offline GC is used, larger objects can fit in the RAM as compared to the case when only the online GC is used. The same figure further shows that the use of Offline GC results in a reduction in the number of times the online GC is needed to be invoked, thus saving the CPU cycles and hence the battery life of a wireless sensor mote. 2.6. Results and Discussion 43

Efficacy and Limitations The effectiveness of using the Offline GC is quite prominent for programs that cre- ate many large-sized objects. The use of Offline GC does not come for free. That is, the flash storage requirement is increased and the program will take longer to com- pile. The increase in flash storage requirement is due to two reasons. First, the size of the Java binary gets inflated due to the insertion of the two new bytecode instructions (OFFLINE_GC_NEW and OFFLINE_GC_FREE). Second, the size of the Java interpreter is also increased when it provides the implementation of these two new instructions. Fig. 2.10 shows that, for the benchmarks we used, the aggregated size of the Java binary and the interpreter increases on the average by 3.90 %. This implies that the Offline GC does not not increase the flash storage requirements of a mote significantly and it is feasible to use it on a tiny mote device with limited storage. Fig. 2.10 also shows the time taken in compilation of a program with the use of Offline GC. Before transferring the code on to a mote, the compilation was carried out on a typical notebook with Intel Pentium running at 2.0 GHz with 2 GB of RAM. The maximum compilation time taken by a program to compile with the Offline GC was 41.65 seconds. It included the time taken by both of the first two phases of the Offline GC as well as the time for changing the bytecode. The current implementation is not yet fully optimized for speed and a future optimized version is expected to compile a given program much faster. The slowest part of the compilation for the Offline GC is the DAU algorithm, which takes 48.78 % of the total compilation time on the average. 44 Offline Garbage Collection

1,400 OFFLINE_GC_NEW OFFLINE_GC_FREE Online GC 1,200

1,000

800

600

400

CPU Cycles (in thousands) 200

0 with Offline GC without (a) JTC OFFLINE_GC_NEW 30 OFFLINE_GC_FREE Online GC

25

20

15

10

CPU Cycles (in millions) 5

0 with Offline GC without (b) GCBench 6

5 OFFLINE_GC_NEW OFFLINE_GC_FREE Online GC 4

3

2

CPU Cycles (in millions) 1

0 with Offline GC without (c) GCOld

Figure 2.8: CPU Cycles: The number of CPU cycles consumed in deallocating objects, during the execution of a program, with and without the use of the Offline GC. 2.6. Results and Discussion 45

140 with Offline GC 120 without 100 80 60 40 20 0 Root Non−Root Increase in Objects’ Size (bytes) (a) RAM

100 with Offline GC without 80

60

40

20

Number of Online GC Calls 0 Root Non−Root (b) Online GC Calls

Figure 2.9: CTP benchmark Results: Figure (a) shows the maximum possible increase in the size of objects just before reaching a stage when CTP program can no more fit in the RAM of the mote. Figure (b) gives the number of times the online GC is called, while receiving 25 data packets with inflated objects’ size, using the CTP program. Both graphs provide a comparative performance, with and without using the Offline GC. 46 Offline Garbage Collection

70000 with Offline GC without 60000 50000 40000 30000 20000

Flash Used (bytes) 10000 0 JTC GCOld CTP Root GCBench Quicksort CTP Non−Root (a) Flash Size

DAU FTT 40 PNR OGC−DFA 30

20

10

Compilation Time (seconds) 0 JTC GCOld Quicksort GCBench CTP Root

Programs CTP Non−Root (b) Compilation Time

Figure 2.10: Fig. (a) shows the aggregated size of Java binary and interpreter (Flash size). Fig. (b) shows the time taken by various components of the Offline GC during the first two phases of program compilation before transferring the code on to a mote. C HAPTER 3

Variable Slot Size

47 48 Variable Slot Size

The memory management for a Java program may be divided into two parts. First, the memory consumed by each object is managed by the Garbage Collection (GC) that is ex- ecuted periodically to reclaim the memory. Second, the memory consumed by the frame of a function is automatically reclaimed on the completion of the function’s invocation. The frame is a data structure in which local variables and operand stack of a function are stored. When a function is invoked, a frame for that function is created. The frame is destroyed and memory reserved for that frame is reclaimed when function invocation ends. A frame has many fixed size slots to store data. For example if each slot size is 32 bits long then a local variable needs one slot to store a 32-bit integer and two slots for a 64-bit long. The maximum number of slots required by the frame of a function is calculated during creation of the class files from the Java source of a program. This number is stored in the Java binary of the function and is used during program execution to allocate memory for the frame creation. Sun standard Java has 32-bit slot size as Java was originally designed for 32-bit microcontroller. There are two implications of having 32-bit size for each slot.

• Each data item in a frame is always stored at a memory offset that is a multiple of 32; leading to a fast data access on a 32-bit microcontroller. Note that a 32-bit microcontroller needs extra CPU cycles if a memory address other than a multiple of 32 is accessed.

• Each data item smaller than 32-bit (e.g. a 16-bit short) is also stored in a 32-bit slot; leading to waste of precious RAM space.

For a 32-bit microcontroller, the 32-bit slot size presents a trade-off between fast exe- cution of bytecode and a smaller amount of the RAM available for a program. In contrast, a mote has usually either 8-bit or 16-bit microcontroller, which may efficiently access a memory address which is not a multiple of 32. Therefore, for a mote a slot size less than 32-bit could save RAM without undermining the speed of bytecode execution. The TakaTuka JVM offers an option to the programmer to select from 32 bits, 16 bits and 8 bits slot sizes at the creation of a class file. The scheme to support this option is called Variable Slot Size (VSS) scheme. The design of VSS scheme has two parts:

1. Extending the Java bytecode instruction set by introducing new instructions to sup- port VSS scheme.

2. Updating the bytecode of each function based on the selection of slot size.

The following section discusses the design of VSS scheme in detail. 3.1. VSS Instruction Design 49

3.1 VSS Instruction Design

The standard Java bytecode instructions support only 32-bit or 64-bit operations. In Sun JVM, a data of type smaller than 32-bit is sign extended to a 32-bit integer before being stored in a local variable or in the operand stack of a function. Subsequently, any operation carried out on that data is the same as that for a 32-bit integer. There- fore, supporting VSS scheme in the TakaTuka JVM, requires extending the Java bytecode instruction set to support smaller than 32-bit data types. There are 204 standard Java bytecode instructions. Each instruction has a single byte long opcode1. One byte long opcode allows 28 = 256 bytecode instructions. Thus, there are 256 − 204 = 52 opcodes available for extending Java bytecode instructions set. However, the VSS scheme cannot extend Java bytecode instruction set by using all of the 52 available opcodes. This is be- cause the Offline garbage collector (Chapter 2) and the TakaTuka bytecode optimizations (Chapter 4) also need to extend the Java bytecode instruction set using the same available opcode space. Furthermore, the size of JVM increases proportional to the instructions supported by it. Therefore, to fit the TakaTuka JVM on a mote’s flash a conservative approach in the extension of Java bytecode instruction set is required. We have considered two options for extending Java bytecode to support VSS scheme. These options can be explained using an example. Example: Fig. 3.1 shows Java source of a simple function addTwo. The function adds two input variables: a 16-bit short and an 8-bit byte. The Java binary of function addTwo generated by the standard Java compilation is shown in Fig. 3.2. The frame of the function addTwo requires five 32-bit slots, including three slots for local variables (i.e. 3×32 = 96 bits of RAM) and two slots for operand stack (i.e. 2 × 32 = 64 bits of RAM).

Option 1: Darjeeling Extension

This extension introduces a new instruction for a smaller than 32-bit data type correspond- ing to each existing instruction of 32-bit integer data type. For example, corresponding to the standard Java instruction IADD, a new instruction SADD is introduced for adding two 16-bit shorts. The Fig. 3.3 shows the Java binary with extended instruction set for function addTwo. The number of slots required by the function, using VSS scheme, depends upon the slot size selected by the programmer during creation of Java binary. If 8-bit slot size is selected then the local variables will need seven slots (i.e. 7×8 = 56 bits of RAM) and the operand stack will require four 8-bit slots. In contrast, in case 16-bit slots size is selected then the local variables will need four slots (i.e. 4 × 16 = 64 bits of RAM) and the operand stack will require two 16-bit slots. Thus the RAM consumption is

1 Opcode (operation code) identifies the operation performed by an instruction 50 Variable Slot Size

1 public static int addTwo ( sh o rt s , byte b ) { 2 i n t i = s + b ; 3 return i ; 4 }

Figure 3.1: Source code of function addTwo that adds a short and a byte

1 ILOAD_1 2 ILOAD_2 3 IADD 4 ISTORE_3 5 ILOAD_3 6 IRETURN

Figure 3.2: Java binary of function addTwo

reduced significantly with smaller slot sizes. However, the size of Java binary increases as shown in Fig. 3.3. Java has dozens of standard instructions for 32-bit integer data type, e.g. IADD, IMUL, IAND, etc. Therefore, the drawback of this extension of Java byte- code is that it requires introducing dozens of additional instructions that would lead to significant increase in the JVM size.

The Darjeeling JVM [7] uses a bytecode extension for supporting 16-bit data types that is similar to the extension presented here. Unlike TakaTuka the Darjeeling JVM does not support variable slot sizes. This implies that a Darjeeling JVM programmer does not have the option to choose slot size during the program compilation as slot size for Darjeeling programs is always equals to 16-bit.

1 LOAD_SHORT_CHAR 1 2 LOAD_BYTE_BOOLEAN 2 3 BYTE2SHORT 4 SADD 5 SHORT2INT 6 ISTORE_3 7 ILOAD_3 8 IRETURN

Figure 3.3: Java binary of function addTwo with Darjeeling Extension 3.1. VSS Instruction Design 51

1 LOAD_SHORT_CHAR 1 2 LOAD_BYTE_BOOLEAN 2 3 CAST_STACK_LOCATION 0 BYTE 4 CAST_STACK_LOCATION 1 SHORT 5 IADD 6 ISTORE_3 7 ILOAD_3 8 IRETURN

Figure 3.4: Java binary of function addTwo with TakaTuka VSS Extension

Option 2: TakaTuka Extension

The approach used in the TakaTuka JVM extends the Java bytecode instruction set by introducing only six new instructions that are given in Table 3.1. Thus the VSS extension of the TakaTuka JVM leads to a small JVM size increase as compared to Darjeeling ex- tension. The Java binary of function addTwo with TakaTuka VSS extension is shown in Fig. 3.4. Both TakaTuka extension and Darjeeling extension of the Java bytecode have additional instructions for loading data from operand stack as well as storing data in local variables, for example both have LOAD_SHORT_CHAR instruction to load a 16-bit data onto the operand stack. However, unlike Darjeeling, the TakaTuka JVM does not dupli- cate all 32-bit integer operations (see code in Fig. 3.4 has no SADD). Instead TakaTuka has two casting instructions to cast smaller than 32-bit data into 32-bit integer. By us- ing these instruction, a data of type smaller than 32-bit is cast into 32-bit integer and subsequently existing integer operations on that data can be applied. This significantly reduces the number of additional instructions required to support the VSS scheme.

Instruction Description LOAD_SHORT_CHAR index Loads data of type short or a char on the operand stack of a function. STORE_SHORT_CHAR index Stores data of type short or a char in a local variable of a function. LOAD_BYTE_BOOLEAN index Loads a data of type byte or a boolean on the operand stack of a function. STORE_BYTE_BOOLEAN index Stores data of type byte or a boolean in a local variable of a function. CAST_STACK_LOCATION index type Casts a given stack location to integer. METHOD_STACK_LOCATION index from-type to-type Casts a given stack location with given type to another type. Table 3.1: Set of additional bytecode instructions in the TakaTuka VSS Extension. 52 Variable Slot Size

3.2 Changing Bytecode for Supporting VSS

Given the source code of a program, the TakaTuka JVM creates Java binary in two steps: (1) it uses standard Java compilation for the creation of the class files from the program source code, (2) subsequently, it transforms the class files into the TakaTuka binary, called Tuk. The Tuk file is optimized for efficient RAM and flash usage. The Tuk file creation and optimizations are performed on a desktop computer before application is transferred on to a mote for execution. Therefore, the optimization process does not consume resources of a mote. One of these Tuk file optimizations is the VSS scheme for smaller RAM consumption. The VSS scheme updates the bytecode during the creation of the Tuk file based on the selection of the slot size by the programmer. This section describes how VSS updating of the bytecode is carried out.

Reducing Input Parameters Size

The local variables of a function consist of the function’s input parameters and the vari- ables defined within the body of the function (called Body-variables). For example, the function addTwo in Fig. 3.1, has three local variables, that include two input parameters s,b and one Body-variable i. The constant pool of a class file contains type information of the input parameters of each function. However, the constant pool has no information about Body-variables. We use input parameter information available in the constant pool for reducing their size. To this end, the Java bytecode is updated, by replacing 32-bit load and store instructions with corresponding smaller than 32-bit instructions, based on the input parameters’ information available in the constant pool. For example, first instruction ILOAD of function addTwo in Fig. 3.2 is replaced by LOAD_SHORT_CHAR in Fig. 3.4 based on the input parameters type information.

Reducing Operand Stack Size

The Java binary of a function contains the maximum number of slots required for creating the frame of that function. The number of slots required for storing local variables in the frame is recalculated after reducing input parameters’ size, as explained in the previous section. In order to determine the number of slots required by the operand stack of the function, the bytecode of the function is analyzed. The analysis is performed by VSS Data-Flow Analyzer (VSS-DFA) that is an extension of the Bytecode Verification Data- Flow Analyzer (BV-DFA) explained in Section 2.2. The VSS-DFA is different from BV- DFA in two ways:

• In case the verification of the bytecode fails then the bytecode is modified by adding casting instructions listed in Table 3.1. 3.2. Changing Bytecode for Supporting VSS 53

35 8 bits Slot Size 16 bits Slot Size 30 25 20 15 10 Percentage Reduction In LVs Aggregated Size 5 0 CTP DYMO JVM Test CasesQuicksort

Figure 3.5: Percentage reduction in the aggregated size of local variables

• The rules to merge the temp-frame into the entry-frame are different. The Java bytecode is verified using BV-DFA and this verified bytecode is used by the VSS scheme. The VSS scheme changes the already verified bytecode by adding load and store instructions for smaller than 32-bit data types. Subsequently, the VSS-DFA reanalyzes the changed bytecode. Verification errors in the changed bytecode could only be found because of the newly added smaller than 32-bit load and store instructions. These errors are corrected by inserting the casting instructions in the function’s bytecode.

Frame Merging

In BV-DFA (Section 2.2) the primitive data types in corresponding slots in temp-frame and entry-frame are always identical. However, this rule does not hold after load and store instructions are introduced for smaller than 32-bit data types. Therefore, it is possible that during VSS data flow analysis the temp-frame and the entry-frame may have different sized primitive data types. For example, temp-frame slot may contain a 32-bit integer but the corresponding slot of entry-frame may contain a 16-bit short. In case the data type of a slot in the temp-frame has larger size than the corresponding slot of the entry-frame then the entry-frame slot is updated with temp-frame data type.

Reducing Body-variables’ Size The constant pool of a class has no information about data types of a function’s Body- variables. Therefore, it is not always possible to know the size of these variables. The TakaTuka VSS scheme uses a guessing algorithm to estimate types of function’s input parameters. This algorithm works as follows: 54 Variable Slot Size

45 8 bits Slot Size 40 16 bits Slot Size 35 30 25 20 15

Percentage Reduction 10 5

In Operand Stacks Aggregated Size 0 CTP DYMO JVM Test CasesQuicksort

Figure 3.6: Percentage reduction in the aggregated size of operand stack

1. During VSS data flow analysis, the data of maximum size ever stored in a Body- variable is found. 2. After the bytecode verification is finished then based on the maximum size of data stored in the Body-variable, its type is guessed. For example, if a slot reserved for the Body-variable has never stored data of size greater than 16 bits during analy- sis, then the Body-variable should have either char or short type. The runtime semantics of both char and short are same, so it is safe to choose any of these types for the given Body-variable. 3. The bytecode is then changed on the basis of the guess.

The above algorithm does not guarantee that the size of each Body-variable will be mini- mized, however, the algorithm is still useful in saving precious RAM.

3.3 Results and Discussion

The VSS scheme reduces the RAM consumption requirements of the frame of the func- tion. The frame of a function includes local variables and operand stack of the function. The RAM required to create the frame of a function at runtime is calculated during the program compilation and is stored in the Java binary of the program. Fig. 3.5 shows the average percentage reduction of greater than 27 % in the RAM required by the local vari- ables of all the functions for each given program. Similarly, Fig. 3.6 presents the average percentage reduction of more than 35 % in the RAM required by the operand stacks of all the functions for a program. Thus the VSS scheme significantly reduces the RAM requirements of a program. It may however be mentioned that the VSS scheme causes a 3.3. Results and Discussion 55

8 8 bits Slot Size 7 16 bits Slot Size 6 5 4 3 2 1

Percentage Increase In Java Binary 0 CTP DYMO JVM Test CasesQuicksort

Figure 3.7: Percentage increase in the size of Java binary small increase in the size of the program’s Java binary and the Java interpreter. This is be- cause, the Java binary of a program is updated with customized instructions to support the VSS scheme and the interpreter is also extended to support these customized instructions. Fig. 3.7 and Fig. 3.8 show that on average Java binary of a program increases by less than 6 % and the interpreter size increases by less than 2.5 %. Such a small percentage increase does not prohibitively affect the feasibility of using VSS scheme on a mote with a few KB of flash memory. Furthermore, 8-bit slot size does not save significant additional RAM as compared to 16-bits slot size, because the input programs rarely use 8-bit data types (i.e. boolean). 56 Variable Slot Size

5 8 bits Slot Size 16 bits Slot Size 4 3 2 1

Percentage Increase In JVM Size 0 CTP DYMO JVM Test CasesQuicksort

Figure 3.8: Percentage increase in the size of JVM C HAPTER 4

Bytecode Compaction

57 58 Bytecode Compaction

While there are many approaches to bytecode size reduction (see [55] for a survey), not all of them are suitable for small embedded devices. The two primary methods for bytecode size reduction are compression and compaction[55]. A typical compression approach is to find a more efficient opcode representation. For example, Huffman codes may be used to find shorter representations for frequently used instructions. Compression techniques require decompression at runtime, consuming com- putation resources on the device [56]. Thus, on small embedded devices, where memory and computing capabilities are limited, it is desirable that at runtime, a partial rather than full decompression is performed. Tiny embedded devices usually run on battery power and often have small RAM. As decompression consumes RAM and wastes battery power due to extra computation, performing decompression at runtime is not desirable for em- bedded devices. Therefore, compression techniques for bytecode reduction are generally not suitable for tiny embedded devices. In contrast to compression, compaction involves identifying and factoring out recur- ring instruction sequences and creating new customized bytecode instructions to replace those sequences. These customized instructions do not require decompression and are interpretable because they share the characteristics of normal bytecode instructions. The process of compaction produces a smaller code that is executable or interpretable with no or little overhead. A typical compaction scheme creates a set of all occurring sequences of bytecode in- structions (patterns) after pruning the ones which are not factorisable.1 Then, the patterns with the highest savings in bytecode size are selected greedily until all unused instruc- tions are considered. Finally, all occurrences of a pattern are substituted by a macro. Compaction scheme presented in Ref. [57] produces compact bytecode that is about 15 % smaller and runs 2 to 30 % slower than the original bytecode. Rayside et al. [58] also use a similar method involving pattern substitution, but only for sequences of opcodes instead of the sequences of instructions, i.e., in their scheme, the operands remain variable. Their combinations are restricted to 2 and 3 instructions, and the resulting compact bytecode is about 20 % smaller (see Table 5 of [58]). TakaTuka uses bytecode compaction instead of compression, such that the resulting compact bytecode can be interpreted without any overhead. In the design of TakaTuka, we employ three different bytecode compaction techniques that work in conjunction with each other. Our approach produces much superior results as compared to those reported in Ref. [57, 58] for a given set of programs, our compaction approach produces a byte- code reduction of about 60 % on the average. Furthermore, unlike previous compaction approaches, our compact code runs faster without using extra RAM and, therefore, results in an extended battery lifetime.

1 An example of an instruction sequence that is not factorisable is the one that contains branching instruc- tions. 4.1. Available Opcodes 59

TakaTuka employs three compaction techniques, each of which replaces a single byte- code instruction or a sequence of bytecode instructions with a new customized bytecode instruction such that the total size of the bytecode is reduced. A customized bytecode instruction, like any other bytecode instruction, is composed of an opcode and an optional set of operands. In the following, we first explain the process of choosing an opcode for a customized instruction. Then, we provide the details of the compaction processes and relevant algorithms used in TakaTuka.

4.1 Available Opcodes

Each customized instruction, that replaces one or more bytecode instructions during the compaction process, uses an opcode selected from a set of available opcodes. The set contains opcodes that are not used by any other bytecode instruction, and the cardinality of this set impacts the extent of compaction. No further compaction is possible if all possible opcodes are used by existing bytecode instructions. The Java specification has reserved one byte (8-bits) to represent 256 possible Java opcodes but uses only 204 of those for corresponding bytecode instructions[11]. Thus, there are 52 unused opcodes that are available for defining customized instructions with one byte long opcode. That is, for every Java program on any platform with any virtual machine architecture, 52 opcodes are always available. Furthermore, Java specifications include many bytecode instructions with similar func- tionalities but different data-type information. The type information of a subset of these instructions is only used in Java bytecode verification and is not required by JVM in- terpreter. Since the Split VM architecture (SVA) does not require run-time verification, as indicated in Section 2.5, such architecture frees up additional opcodes available after bytecode verification for a compaction algorithm. For example, ILOAD and FLOAD both push a two byte long data onto the stack and have the same effect at runtime. Hence, each occurrence of FLOAD could be replaced by ILOAD after bytecode verification, making the opcode used by FLOAD available for bytecode compaction. Using this process of repre- senting two different instructions with a single bytecode in a SVA,29 additional become available for bytecode compaction. Thus, with a SVA, at least 52 + 29 = 81 op- codes are available for defining customized instructions to assist in bytecode compaction. Finally, many Java programs may not use all the 204 standard Java instructions, de- pending upon the functionality of the program. Hence a custom-made JVM interpreter such as the one offered by TakaTuka can make use of additional opcodes, not used by the Java program, for the purpose of defining customized instructions during the bytecode compaction process. If, however, a general purpose TakaTuka interpreter is needed that could run any CLDC-compliant program, the number of available opcodes for defining customized instructions remains 81. 60 Bytecode Compaction

4.2 Single Instruction Compaction

In this technique of compaction, size of a single bytecode instruction is reduced by replac- ing it with a smaller customized instruction. That is, in single instruction compaction, each customized instruction replaces only a single bytecode instruction. The single in- struction compaction in TakaTuka can either be a reduction in the memory footprint needed to represent an operand (called Operand reduction) or a complete removal of the Operand from the bytecode instruction (called Operand removal).

Operand Reduction

In standard Java all CP-instructions use a 2-byte CP index as their operand2; similarly, each offset-instruction uses either 2-byte or 4-byte operand as branch offset [11]. In total, there are 35 JVM instructions that use CP index or branch offset as their operands. In TakaTuka, we introduce a new custom instruction with reduced operand size of one byte, if the operand value is smaller than 256. In order to maximize the savings resulting from the use of this technique, we sort the information in our set of global CP3 such that most referred entries of CP from the bytecode are stored at a numerically smaller CP index. This leads to a large number of CP-instructions in the bytecode using only one byte CP index instead of two.

Operand Removal

For bytecode compaction in TakaTuka, we also combine opcode and operand to form a customized instruction with implicit operand(s). For example, the instruction ILOAD 0x0010 could be converted to ILOAD_0x0010 and the two bytes originally used by the operand may be saved. Note, however, that we do not apply operand removal on offset- instructions as their offset usually changes after any kind of bytecode compaction. Fur- thermore, the number of instructions undergoing operand removal is potentially large— much larger than the number of available opcodes—leaving little room for any other com- paction technique to be used.

4.3 Multiple Instruction Compaction (MIC)

In multiple instruction compaction (MIC), a recurring sequence of instructions, called a pattern, is replaced by a single customized instruction. For example, a pattern {GOTO 0x0030, LDC 0x01, POP} in bytecode could be replaced by a single customized

2 The only exception is LDC which uses only one byte CP index as the operand. 3 Global constant pool is explained in detail in Chapter 5 4.3. Multiple Instruction Compaction (MIC) 61 instruction GOTO__LDC__POP 0x003001, providing reduction of two bytes per oc- currence. The MIC technique involves finding a set of valid patterns and then replacing a subset of those patterns with customized instructions. First, we define the criteria that must be met for a pattern to be replaceable and then we discuss the valid pattern search and replacement algorithms used in TakaTuka.

Valid Pattern Criterion A sequence of bytecode instructions or a pattern is said to be valid if it can be replaced by a single customized instruction. Keeping the main aim of TakaTuka in mind—bytecode reduction without increasing RAM or CPU usage—a pattern is valid if it fulfills the fol- lowing two criterion:

1. A branch-target instruction can only be the first instruction of a pattern, and

2. Any Java bytecode instruction designed to invoke a method (e.g. INVOKEVIRTUAL) can only be a last instruction of a pattern.

Note that both of these restrictions are imposed to avoid decoding a customized instruction during runtime for finding return or branch offset address inside it. This is because finding targets inside a customized instruction during runtime is not possible without additional computation or maintaining extra information per customized instruction.

Pattern Generation The pattern generation algorithm finds and selects a number of patterns of instruction sequences from the original bytecode of the Java program, up to a maximum number of available opcodes. These patterns are stored in a hash-map which is used as an input to the pattern replacement algorithm. The pattern replacement algorithm then constructs customized instructions and replaces the input patterns with those customized instructions in the bytecode. We use the following terminology to explain the pattern generation and replacement algorithms.

m : Total number of available opcodes that may be used for creating customized instructions. k : Maximum number of single instructions in any pattern. li : Number of single instructions in a pattern i that can potentially be replaced by a customized instruction. We also refer to this parameter as the length of the pattern i. 62 Bytecode Compaction

εi : Reduction in bytecode achieved when one occurrence of a pattern i in the bytecode is replaced by a customized instruction. The MIC technique results compaction in the opcodes only, without affecting the storage need for operands, as the example given in Section 4.3 illustrates. Thus, εi equals li − 1 when a pattern i is replaced by a MIC customized instruction. ζi : Frequency of a pattern i, that is to be replaced by a customized in- struction, in the entire bytecode of the Java program. ηi : Total reduction in bytecode achieved when a pattern i is replaced by a customized instruction, in the entire bytecode of the Java program. This reduction ηi is clearly equal to the product εi · ζi. ξ(y) : ∑ ηi where y is a set of patterns. i∈y

In TakaTuka, pattern generation for multiple instruction compaction uses a Multi-pass greedy algorithm, which is based on a simple Single-pass greedy algorithm.

Single-pass greedy algorithm

The Single-pass greedy algorithm creates a list of patterns of length ≤ k by traversing the bytecode exactly once. When a valid pattern i of any length is encountered for the first time, it is added to the hash map with ζi = 1. Then, ζi is incremented whenever the same pattern i is found again while traversing the remaining bytecode. Consequently, after a single traversal of the Java bytecode, the hash map contains all possible patterns of length ≤ k with their corresponding frequencies. The algorithm returns a subset σ of patterns from within the hash map such that |σ| ≤ m and ξ(σ) is maximized. This algorithm has one major flaw as it may return patterns that are subset of other pat- terns, undermining the extent of bytecode reduction. To elaborate this flaw, consider the instruction sequence {I1,I2,I3,I4,I5,I1,I2,I3,I6,I7} with m = 3 and k = 3. The single-pass greedy algorithm will return {I1,I2,I3}, {I1,I2}, and {I2,I3}. Note that the two patterns of length 2 are subsets of the pattern of length 3. Thus, subsequent replacement of patterns will only replace the two occurrences of {I1,I2,I3}, while leaving the patterns {I4,I5} and {I6,I7} intact. If the algorithm had returned {I1,I2,I3}, {I4,I5}, and {I6,I7} as the valid patterns, a greater compaction would have been achieved.

Multi-pass greedy algorithm

The multi-pass greedy algorithm addresses the limitation described above by traversing the bytecode multiple times, making temporary changes in each iteration, and using those changes in subsequent iterations. Thus, this algorithm works on a copy of the bytecode, 4.3. Multiple Instruction Compaction (MIC) 63 leaving the task of changing the original bytecode to the pattern replacement algorithm. While making changes in bytecode during different iterations, the intention is to select better patterns and not to change the bytecode permanently. In the first iteration, the single-pass greedy algorithm is used on a copy of the byte- code and the resulting patterns are stored in a set y. A pattern i is then selected, such that ηi ≥ η j ∀ j ∈ y, and replaced as a customized instruction in the copy of the bytecode. Subsequent iterations are similar except that the single-pass greedy algorithm is called on the modified copy of bytecode from the previous iteration. This continues until either m patterns of length ≤ k are selected or additional patterns cannot be found. The customized instructions introduced in a given iteration may become a part of a new pattern in subse- quent iterations, as long as the constraint of maximum k original single instructions per customized instruction is not violated. Let us again consider the example of an instruction sequence {I1,I2,I3,I4,I5,I1,I2,I3,I6,I7} with m = 3 and k = 3. After the first iteration, the multi-pass greedy algorithm selects only the pattern {I1,I2,I3} and constructs a customized instruction, say C1, to represent this pattern. The next iteration runs on the sequence {C1,I4,I5,C1,I6,I7}. In this iteration, C1 may not become a part of any pattern as its length is already k. Furthermore, any sub-pattern within C1 is not visible to the algorithm anymore and may not be a part of an additional pattern. As a result, subsequent iterations select {I4,I5} and {I6,I7} as the patterns to be replaced by customized instructions.

Pattern Replacement

The pattern replacement algorithm takes the bytecode and a set of patterns and replaces those patterns, as they appear in the byteode, by new customized instructions. The pri- mary goal in this replacement process is to maximize the bytecode reduction, leading to the maximum savings in storage. While our pattern generation algorithm is greedy and may not generate an optimal set of patterns, our pattern replacement algorithm is not only optimal but also runs in poly- nomial time. First, we describe our algorithm, then we show its polynomial complexity in Theorem 4.1, finally proving its optimality in Theorem 4.3.

Algorithm:. The pattern replacement algorithm keeps track of many temporary solu- tions in order to produce the replacement with the maximum savings. The algorithm is applied on each class method within the bytecode one by one and produces maximum reduction possible for that method with the given set of patterns. The inputs to the pattern replacement algorithm are: 1) the number k indicating the maximum number of single instructions in any pattern, 2) a set of patterns σ generated by a pattern generation algo- rithm, and 3) the bytecode of a method. The pattern replacement algorithm creates a tree 64 Bytecode Compaction

with different replacement possibilities. One branch of this tree contains the bytecode sequence corresponding to the maximum reduction in bytecode, and the algorithm uses this branch to update the bytecode of the method. To demonstrate the replacement algorithm, assume that the instruction at index i in the bytecode of a method µ is represented by τi. That is, {τ1,τ2,...,τλ } represents the method bytecode, where λ is the total number of instructions in µ. Each level in the tree corresponds to the index in the bytecode, hence the tree has depth λ. Within the tree, each node located at level j corresponds to either τ j, or to a customized instruction that ends at τ j. Each node x in the tree has exactly one incoming edge whose weight w(x) is given by:   1 x is the root node  0 x corresponds to a w(x) =  (4.1)  customized instruction,  min(k − 1,w(xp) + 1) x corresponds to τ j itself

where xp is the immediate parent of node x. Each node x in the tree has exactly w(x) + 1 child nodes each corresponding to instructions with unique lengths ranging from 1 to w(x) + 1. If node x exists at a level j in the tree, then one of its child nodes correspond to the instruction τ j+1 which has length 1; the other child nodes correspond to customized instructions c that represent a pattern obtained by traversing node x and parents further above in the tree, to the extent allowed by lc. From this, we can directly infer the total bytecode reduction of a branch up to a specific node at level j, and this value is maintained within the node. The tree is built level by level, where addition of each level is done in two phases: the creation phase and the pruning phase. The creation phase of a level j is carried out simply by finding the children of all the nodes at level j − 1. In the pruning phase of level j, first we prune all those nodes from level j which represent an invalid customized instruction whose corresponding pattern is not a member of σ which is the set of valid patterns. Subsequently, additional nodes are pruned such that no two nodes have the same weight on their incoming edge. If multiple nodes have w as the weight on their incoming edge, the one corresponding to the leaf of the branch with highest total bytecode reduction is kept and the remaining nodes are pruned. In this additional pruning, random selection is made if there is a tie. Since there could be at most k −1 customized instructions of 2 ≤ length ≤ k ending at τ j, each level of the tree has at most k nodes after the pruning phase. Thus, the resulting tree is a linear sized structure with depth λ and a constant span k.

Example:. Figure 4.1 shows an example of a pattern replacement tree using the above algorithm. The input to the algorithm is (i) k = 3, (ii) method bytecode I1,I2,I3,I4,I1,I3,I4, 4.3. Multiple Instruction Compaction (MIC) 65

Figure 4.1: Pattern replacement tree 66 Bytecode Compaction

and (iii) σ = {{I1,I2}, {I2,I3}, {I4,I1}, {I1,I3}, {I1,I3,I4}} which is obtained by ap- plying multi-pass greedy pattern generation algorithm on complete bytecode of all func- tions of a program. The edge of the tree has labels generated using Equation (4.1). The branch with a total reduction of 3 in the figure is selected as the output branch. Theorem 4.1. The time complexity of the replacement algorithm is O(k2 · λ). Proof. The complexity of the algorithm depends on the size of the tree and the number of operations performed on each node. From the description of the algorithm, we note that the depth of the tree is exactly λ and each level of the tree has at the most k nodes. Hence, the tree has at the most k · λ nodes. In the worst case, however, before pruning phase, a level j may have a total of up to 1+2+3+...+k nodes. This is because prior to pruning, the number of children at each node is always one greater than the weight on its incoming edge. Since level j − 1 could have at the most k nodes with edge weights from 0 to k −1 after pruning phase, this implies that the number of nodes at level j (i.e., children − 1 − of level j 1) before pruning is simply 2 k(k 1). In order to prune nodes at level j, in 1 − the worst case, we need to make 2 k(k 1) comparisons. Thus, the overhead for lookups and comparisons at each level is bounded by O(k2). Since there are exactly λ levels in the tree, the worst case complexity of the algorithm is O(k2 ·λ). Furthermore, in TakaTuka, k is assigned a constant value of 100, hence the pattern replacement algorithm in TakaTuka exhibits O(λ) runtime complexity. The optimality of the pattern replacement algorithm is proved in the following. Lemma 4.2. All nodes with same weight w on their edge at level j will have same sub-tree originating from them. Proof. We prove our claim by contradiction: assume that all nodes with the same weight w on their incoming edges at level j do not have the same sub-tree from them. This is possible if and only if their immediate children are different. We now consider two situations: (i) when w is zero, each node has only one child node corresponding to the instruction τ j+1. Thus, all sub-trees are the same, contradicting the original assumption, and proving the lemma. (ii) when w is non-zero, a child node either has a customized instruction composed of at the most w parent nodes or a simple instruction τ j+1. Based on Equation (4.1), all of those w parents of each node will always be simple instructions instead of customized instruction. The algorithm says that simple instruction τi can only occur at level i in tree. Hence all of those w parents nodes must be equal. In summary, the immediate children of each node with same w at level j will have the same w parents and, therefore, will have the same child nodes. Hence the sub-tree emanating from each node with same weight w at level j will always be the same. Theorem 4.3. Given a set of patterns, the pattern replacement algorithm finds the re- placement with maximum overall reduction in the size of bytecode. 4.4. Interpreter Support for Bytecode Compaction 67

L_INT2BYTE: pc_inc=1; int2byte();

L_BASTORE: pc_inc=1; bastore();

L_IINC: pc_inc = 2; iinc((int32_t)bytecode_getarg(1));

L_GOTO: pc_inc = 3; fun_goto((int32_t)bytecode_getarg(2)); goto L_END;

//more cases ...

Figure 4.2: TakaTuka instruction labels before Bytecode Compaction.

Proof. First, we note that the complete tree without pruning contains all combinations of solutions including the optimal one identified by the leaf node with the highest total reduction. Next, we argue that the branch corresponding to the optimal solution is not affected by pruning. Using Lemma 4.2, all nodes with the same weight on their incoming edge at level j have the same sub-tree emanating from them. Hence keeping only one branch ending on a leaf node with maximum overall saving, from a group of tree branches with leaf nodes of same reduction, implies that the optimal solution is still part of the tree. This completes the proof.

4.4 Interpreter Support for Bytecode Compaction

The TakaTuka interpreter is implemented by using the labels-as-values approach [59] to provide direct threading. Each bytecode instruction translates into a source code snippet following a corresponding label. To this end, TakaTuka treats the customized instruc- tions, resulting from single instruction compaction and multiple instruction compaction, like any other bytecode instruction that could not be compacted. However, because some 68 Bytecode Compaction

L_INT2BYTE___BASTORE___IINC_0401___GOTO_SHORT: pc_inc=2;

//INT2BYTE int2byte();

//BASTORE bastore();

//IINC with implicit operand iinc((uint16_t)(0x0401));

//GOTO with shorter offset of length 1 fun_goto((int32_t)bytecode_getarg(1)); goto L_END;

//more cases ...

Figure 4.3: TakaTuka instruction labels after Bytecode Compaction.

instructions are generated only during the compaction process a ‘static’ set of labels is not applicable. TakaTuka addresses this problem by dynamically generating the labels supporting the required set of bytecode instructions. Fig. 4.3 indicates the structure of one label after the compaction whereas Fig. 4.2 shows the same instructions before com- paction. The aggregated program counter of Fig. 4.2 is 7 that is decreased to 2 after compaction in Fig. 4.3 highlights the reduction in the size of the program. It may be noted that the customized instruction in Fig. 4.3 incorporates all the three compaction methods described in this chapter.

When a user wishes to generate a general purpose JVM interpreter that can run any CLDC-compliant program, a maximum of 81 opcodes are available for the compaction process. These available opcodes can be used for fixed customized instructions replacing frequently occurring patterns resulting from good programming styles such as those used in Java libraries. However our focus here is the design of a customized JVM rather than the optimization of a general purpose Java Virtual Machine. 4.4. Interpreter Support for Bytecode Compaction 69

Figure 4.4: The average percentage of bytecode and CP sizes as a compared to total size of Java binary of the programs given in Table 4.2.

Programs Number of Aggregated Bytecode Aggregated CP Total class files Size in Bytes Size in Bytes Size in Bytes javax/microedition 10 565 5678 8505 java/io 18 3347 11175 28566 java/lang 46 16209 35885 88896 java/util 12 2041 8699 18938 Sun Spot Demos 18 15276 38679 79351 CTP 112 25850 89837 199870 DYMO 112 25830 93530 199762 JVM Test Cases 59 13175 41580 100439 Quicksort 48 10041 26983 72835

Table 4.2: Detailed information about the programs used for evaluating Java binary opti- mizations.

Program Available Made Available Total Available javax/microedition 218 0 218 java/io 166 4 170 java/lang 90 17 107 java/util 160 2 162 Sun Spot Demos 149 7 156 CTP 157 13 170 DYMO 139 13 152 JVM Test Cases 93 24 117 Quicksort 154 13 167

Table 4.3: Total number of available opcodes before bytecode compaction. 70 Bytecode Compaction

Program Average Length Maximum Length javax/microedition 3.5 8 java/io 5.4 18 java/lang 16.37 98 java/util 5.74 23 Sun Spot Demos 13.55 100 CTP 5.94 39 DYMO 11.92 88 JVM Test Cases 9.3 91 Quicksort 6.76 88

Table 4.4: Maximum and average lengths of customized instructions created by MIC with FC.

4.5 Results and Discussion

In this section we present detailed results of various bytecode compaction schemes used by the TakaTuka JVM for reducing flash storage requirements of Java binary. First, we introduce the programs used as input for performance evaluation. Second, bytecode com- paction optimization results of each individual scheme and the aggregated bytecode com- paction achieved using these schemes are discussed. Finally, inferences from these results and summary of increased JVM performance due to bytecode optimization are given in the last part of this Chapter.

Program Average Length Maximum Length javax/microedition 2 2 java/io 2.25 3 java/lang 2.54 4 java/util 2 2 Sun Spot Demos 2.8 4 CTP 2.28 3 DYMO 2.64 4 JVM Test Cases 2.53 4 Quicksort 2.5 4

Table 4.5: Maximum and average lengths of customized instructions created by MIC with LC.

Input Programs

The programs used to produce results of bytecode compaction schemes employed by the TakaTuka JVM are shown in Table 4.2. These programs may be divided into two categories: 4.5. Results and Discussion 71

• The applications that may executed on the TakaTuka JVM (i.e. CTP, DYMO, JVM Test Cases and Quicksort) and • the programs that are library packages or an application (i.e. Sun Spot Demos) that cannot be executed on the TakaTuka JVM.

We have used library packages and an application that could not be executed on the TakaTuka JVM to compare Java binary size of the Tuk file with the corresponding Suite file of Squawk JVM. It may be noted that, the TakaTuka JVM does not transfer an entire library to a mote. Instead, it strips the library of all the functions and fields that are not used by the corresponding application. Thus the comparison of TakaTuka with Squawk involves only the input program and excludes any class files used by that program. Such class files may not be part of that program. It would be useful to note the total size of an input program before optimization. Table 4.2 shows the total number of class files in each of our input programs and the size of Java binary of these programs. For each class file in a Java program, the Java binary could be divided into three parts: bytecode, constant pool (CP), and other miscellaneous structural information and flags. Fig. 4.4 shows the average percentage sizes of bytecode and CP of the given input programs. On the average, for the given programs, the aggregate bytecode size is 14.06% while the CP size is 43.47% of the Java binary. Table 4.3 shows the total number of opcodes that are available, or can be made available4, before any bytecode compaction. These are the opcodes that may be used by the bytecode compaction schemes. We have suggested three bytecode compaction techniques in this chapter: 1) Operand Reduction (ORD), 2) Operand Removal (ORE), and 3) Multiple Instruction Compaction (MIC). Each compaction algorithm creates new customized instructions and assigns op- codes to these instructions from the set of available opcodes. The TakaTuka JVM interpreter is implemented using the labels-as-values approach. The introduction of customized instructions causes a change in the size of the interpreter. This may limit the net compaction in the Java binary by an amount γx which is the total change (increase or decrease) in the size of the interpreter for supporting a new cus- tomized instruction which replaces a given pattern x of original instructions. The net reduction in the bytecode while compacting a pattern x is, therefore, is given by:

R(x) = ηx − γx (4.2) where η is defined in Section5 4.3.

4 Additional opcodes are made available by ignoring type information from similar bytecode instructions after bytecode verification 5 Here, η is used in the context of all three compaction techniques and not just in the context of MIC. 72 Bytecode Compaction

By means of configuration, we allow two compaction strategies in the TakaTuka JVM, namely Limited Compaction (LC) in which the change in the size of interpreter is taken into account, and Full Compaction (FC) in which any change in the size of interpreter is ignored by setting γx = 0 in Equation 4.2. Compaction is always performed in the case of FC, while in the case of LC, the compaction is only performed when the compaction reduction function of Equation 4.2 returns a positive value.

35 javax/microedition java/io 30 java/lang java/util Sun Spot Demos 25 CTP DYMO 20 JTC Quicksort 15 10

Percentage Reduction 5 0 Input Programs (a) ORE 25 javax/microedition java/io java/lang 20 java/util Sun Spot Demos CTP DYMO 15 JTC Quicksort 10

5 Percentage Reduction

0 Input Programs (b) MIC

Figure 4.5: Percentage reduction in Java binary using the ORE and MIC schemes with LC. 4.5. Results and Discussion 73

javax/microedition java/io 50 java/lang java/util Sun Spot Demos 40 CTP DYMO JTC 30 Quicksort

20

10 Percentage Reduction

0 Input Programs (a) ORE 80 javax/microedition java/io 70 java/lang java/util 60 Sun Spot Demos CTP 50 DYMO JTC 40 Quicksort 30 20

Percentage Reduction 10 0 Input Programs (b) MIC

Figure 4.6: Percentage reduction in Java binary using the ORE and MIC schemes with FC. 74 Bytecode Compaction

140 javax/microedition java/io 120 java/lang java/util Sun Spot Demos 100 CTP DYMO 80 JTC Quicksort 60 40

Percentage Reduction 20 0 Input Programs (a) LC

javax/microedition 60 java/io java/lang java/util 50 Sun Spot Demos CTP DYMO 40 JTC Quicksort 30

20

Percentage Reduction 10

0 Input Programs (b) FC

Figure 4.7: Aggregated Percentage reduction in Java binary using three compaction schemes with LC and FC. 4.5. Results and Discussion 75

javax/microedition 30 java/io java/lang java/util 25 Sun Spot Demos CTP DYMO 20 JTC Quicksort 15

10

Percentage Reduction 5

0 Input Programs

Figure 4.8: Percentage reduction in Java binary using the ORD scheme.

For MIC the number of instructions, allowed to be part of a customized instruction, has been limited to 100. This limit is imposed so that the TakaTuka JVM does not take more than a reasonable amount of time in bytecode compaction. Tables 4.4 and 4.5 show that a program rarely approaches this limiting value.

Program Opcodes Used Customized Instr. javax/microedition 1 1 java/io 7 7 java/lang 54 54 java/util 5 5 Sun Spot Demos 48 49 CTP 5 5 DYMO 28 28 JVM Test Cases 9 9 Quicksort 5 5

Table 4.6: Number of opcodes used for creating customized instructions by the ORE scheme with LC.

Individual Schemes Performance

In this section it is assumed that all the opcodes listed in Table 4.3 are available to a given compaction technique and that technique is applied on the bytecode independent of the other techniques. For each technique, except Operand Reduction, results are presented 76 Bytecode Compaction

Program Opcodes Used Customized Instr. javax/microedition 0 2 java/io 4 4 java/lang 33 33 java/util 1 1 Sun Spot Demos 59 65 CTP 6 7 DYMO 27 28 JVM Test Cases 14 15 Quicksort 5 6

Table 4.7: Number of opcodes used for creating customized instructions by the MIC scheme with LC.

Program Opcodes Used Customized Instr. javax/microedition 1 23 java/io 1 23 java/lang 8 30 java/util 2 29 Sun Spot Demos 5 27 CTP 3 25 DYMO 6 27 JVM Test Cases 3 29 Quicksort 4 27

Table 4.8: Number of opcodes used for creating customized instructions by the ORD scheme.

Program Opcodes Used Customized Instr. javax/microedition 60 73 java/io 107 110 java/lang 47 47 java/util 96 102 Sun Spot Demos 95 96 CTP 110 110 DYMO 91 92 JVM Test Cases 57 57 Quicksort 103 107

Table 4.9: Number of opcodes used for creating customized instructions by the ORE scheme with FC. 4.5. Results and Discussion 77 with limited compaction as well as with full compaction. In case of Operand Reduc- tion, full compaction is always carried out, as very few opcodes are needed for Operand Reduction compaction scheme. In case LC is used then the average compaction of 12.98 %, and 9.03 % is achieved for ORE, and MIC schemes respectively as shown in Fig. 4.5. The number of opcodes used by these individual reduction schemes is presented in Table 4.6, and Table 4.7. The aver- age number of opcodes used are 18 and 16.55 for ORE, and MIC schemes respectively. In contrast when ORD scheme is employed with FC then on the average 17.11 % com- paction is achieved using on the average only 3.6 opcodes (see Fig. 4.8 and Table 4.8). This result shows that the average amount of compaction achieved using a single opcode is significantly higher, i.e. 4.75 %, in the ORD scheme as compared to 0.72 % for the ORE scheme and 0.54 % for the MIC scheme. Furthermore, the number of customized instructions created is always greater than or equal to the number of opcodes used, as it is possible that a set of customized instructions actually makes a few more opcodes available. In case FC is used then the average compaction achieved using ORE and MIC schemes are 29.84 % and 34.19 % respectively as shown in Fig. 4.6. This is more than twice the compaction achieved using these schemes with LC. However, the number of opcodes used are more than 4 times and 6 times respectively for ORE and MIC schemes with FC as compared to with LC as shown in Tables 4.6, 4.7, 4.9 and 4.10.

Program Opcodes Used Customized Instr. javax/microedition 12 40 java/io 103 198 java/lang 107 117 java/util 73 116 Sun Spot Demos 156 205 CTP 154 235 DYMO 152 187 JVM Test Cases 114 222 Quicksort 87 169

Table 4.10: Number of opcodes used for creating customized instructions by the MIC scheme with FC.

Collective Performance

After analyzing the bytecode reduction achieved using each compaction technique in- dividually, we now consider maximizing the aggregated reduction when all of them are successively used. In case a compaction technique is employed the input bytecode instruc- tions might be found to have already changed by the preceding compaction technique. Furthermore, a compaction technique may not have all available opcodes of Table 4.3 at 78 Bytecode Compaction

Program Opcodes Used javax/microedition 1 java/io 13 java/lang 96 java/util 12 Sun Spot Demos 110 CTP 16 DYMO 81 JVM Test Cases 28 Quicksort 16

Table 4.11: Number of opcodes used for creating customized instructions by the three compaction schemes with LC.

Program Opcodes Used javax/microedition 42 java/io 169 java/lang 107 java/util 113 Sun Spot Demos 156 CTP 169 DYMO 152 JVM Test Cases 116 Quicksort 138

Table 4.12: Number of opcodes used for creating customized instructions by the three compaction schemes with FC.

its disposal as some opcodes might have already been used by the other technique. There could be different permutations of the three compaction techniques and a permutation could be repeated a number of times. Although bytecode compaction is performed on the Desktop computer before the program is actually transferred to the mote, we still aim at providing bytecode compaction without significantly increasing the time required for such compaction. Hence, we apply each technique on the bytecode exactly once and in the order mentioned in Fig. 1.2. This order is chosen because of its simplicity of imple- mentation. A detailed study of other permutations is left as future work. The bytecode compaction achieved using both FC and LC is measured again. The bytecode compaction with LC using all three schemes is 34.81 % on the average and it is on the average 60.89 % with FC as shown in Fig. 4.7. The corresponding number of opcodes used by LC and FC are detailed in Tables 4.11 and 4.12. 4.5. Results and Discussion 79

1.6e+04 javax/microedition java/io 1.4e+04 java/lang java/util 1.2e+04 Sun Spot Demos CTP 1e+04 DYMO JTC Quicksort 8e+03

6e+03

4e+03

Compilation Time (ms) 2e+03

0 Input Programs (a) LC

8e+05 javax/microedition java/io 7e+05 java/lang java/util Sun Spot Demos 6e+05 CTP DYMO 5e+05 JTC Quicksort 4e+05

3e+05

2e+05

Compilation Time (ms) 1e+05

0 Input Programs (b) FC

Figure 4.9: Aggregated compilation time taken by three compaction schemes with LC and FC. 80 Bytecode Compaction

CTP 30 DYMO JTC Quicksort 25 Binary Search 20

15

10

5

0 Percentage Increase In JVM Size Input Programs

Figure 4.10: Percentage increase in the JVM size with FC as compared to JVM size with LC.

LC 50 FC 40 30 20 10 0 Percentage Increase in Program Speed CTP DYMO JVM Test CasesQuicksort Binary Search

Figure 4.11: Percentage increase in the bytecode execution speed using LC and FC as compared to no bytecode compaction. 4.5. Results and Discussion 81

Implications of Compaction The bytecode compaction increases the compilation time of a program because the byte- code compaction algorithms are executed during the program compilation. It is shown in Fig. 4.9(a) that 4.8 seconds are required on the average to compile a program using LC. In contrast, when FC is used then compilation of a program has taken on the average 5.2 minutes as shown in Fig. 4.9(b). The bytecode is usually less than 20 % of the Java binary of a project (see Fig. 4.4). Therefore, the bytecode compaction alone cannot reduce the size of Java binary significantly. However, the bytecode compaction can result in faster bytecode execution that can lead to reduced number of CPU cycles required by the pro- gram, thus increasing the lifetime of the wireless sensor network. For the given set of programs, the program execution speed with LC increases by 4.5 % on the average and by 32.06 % on the average with FC as shown in Fig. 4.11. The advantage of FC is the increased JVM interpreter speed, because customized instructions may replace a number of original Java instructions thus reducing total number of instructions dispatched and operands fetched. The disadvantage is that the size of interpreter supporting customized instructions increases more than the bytecode compaction achieved using FC. Fig. 4.10 shows that on the average FC results in 16.86 % increase in the interpreter size as com- pared to when LC is used. It is suggested to use LC if a program is being executed on a mote with small flash (e.g. TelosB has only 48 KB of flash) and FC should be used if the mote has a reasonably larger flash (i.e. IRIS mote with 128 KB of flash). TakaTuka allows the programmer to select the type of compaction to be used during the program compilation.

C HAPTER 5

Class File Optimizations

83 84 Class File Optimizations

5.1 TakaTuka Constant Pool Optimization

Each class file has a collection of distinct constant values of variable size called the Con- stant Pool (CP). These constant values either define the characteristics of the class or are used by the class itself. A two byte index is used to access a given CP value which is usually larger than two bytes and is used a number of times from the class file. The ag- gregated CP size of a project is usually much larger compared to its total bytecode size. Hence, reducing CP size is critical in the overall size reduction of a Java program. Our constant pool design is based on some of the ideas given in [58] and [60] with im- provements using the characteristics of a Split-VM architecture, which is not considered in the above references. In the following paragraphs we present the optimizations used in TakaTuka for reducing the CP size.

Global Constant Pool

Java specifications outline eleven different types [11], and each value in any CP belongs to exactly one of these types which include numeric constants, string literals, and references to Java entities. In traditional designs, the CP values of a single class appear in an arbitrary order within the CP, where a leading one byte tag is used for type identification. This design, however, has the following shortcomings:

1. One byte is consumed to specify the type with each CP entry.

2. Since a CP is unordered, an index has to be built in RAM in order to index its entities in a constant-time.

3. Although CP values are distinct for one class, there can be many redundant occur- rences in the scope of a given project.

These shortcomings lead to excessive flash and RAM requirements, both of which are scarce resources in sensor motes. To address this in TakaTuka, we use the preloading characteristic of SVA and create one global pool per type, during the linking phase. As compared to traditional CPs, our set of Global Constant Pools (GCPs) has no redundant information per project. We keep a common header for these GCPs specifying the start address of a pool and corresponding type. As all entries of a GCP have the same type, no tag per constant pool value is required. Keeping a separate CP per type enables a constant time lookup for any CP’s entry in the flash and does not require loading the complete CP in RAM. This is because each CP type, except UTF-8, has always fixed size values and UTF-8 values are never used directly from the bytecode [11]. We explain this with an Example, in which a JVM interpreter wants to decode the instruction INVOKEVIRTUAL 15. The INVOKEVIRTUAL instruction 5.1. TakaTuka Constant Pool Optimization 85 always uses a constant pool of type 10 in which each element is 4 bytes long [11]. Hence the JVM interpreter performs a binary search (in the list of at the most eleven elements) to go to the starting address of GCP with type 10. Subsequently, the interpreter reads the 15th value of CP with type 10 by looking at location 14 · 4 = 56 from that start address. This constant time lookup is possible because each Java CP-instruction always accesses the same type of CP (e.g., INVOKEVIRTUAL always access CP of type 10). However, there are three exceptions to this, namely the instructions LDC, LDC_W and LDC2_W. We have introduced five additional bytecode instructions so that in TakaTuka each CP instruction, including the ones mentioned above, implicitly contains the CP type information. In traditional Java, the CP of a class uses a two byte index. Since TakaTuka uses global CPs, it appears that two bytes may not be sufficient to index all the CP entries. Our experiments, however, show that even the entire Java Standard Edition can be indexed globally with just two bytes. This is due to the fact that most of CP values are removed during CP globalization. Furthermore, remaining values are distributed among eleven CPs, where every CP has its separate index. It implies that each type of GCP starts with index zero. In motes, it is even more unlikely to have an aggregated size of application and Java CLDC libraries exceeding two byte index limitation. However, in the design of TakaTuka, there is still an option to use more than two bytes for CP indexing, if there is a need for it.

Reference Resolution

Traditional JVMs apply dynamic loading, also called on-demand loading. Whenever a class method or a field needs to be accessed, the corresponding class file has to be loaded into RAM after performing verification, preparation and resolution [11]. To resolve ref- erences during runtime, fully qualified names are required to identify components (i.e. methods, fields and classes). In TakaTuka, we have used the preloading characteristic of SVA to resolve names during linking. Hence a preloaded, preverified and resolved Java program is transferred to a mote. This allows us to remove all the UTF-8 strings which are traditionally required for names resolution but are not used by the application. Fur- thermore, we also remove all the other constant pool entries (e.g., all entries of type 12) typically used for resolving names during runtime[11].

Variable Sized Literals

A constant pool may contain four types of numeric literals: Double (type 6), Long (type 5), Float (type 4), and Integer1 (type 3) [11]. Traditionally, each of these CP types has a fixed size like other CP types. It implies that any entry within Integer pool (type 3) will

1 Boolean, Short, and Byte are also represented as Integers within the constant pool. 86 Class File Optimizations

Programs Class Files Removed Method Removed CTP 55.35% 51.93% DYMO 37.50% 47.55% JVM Test Cases 27.27% 53.56% Quicksort 22.91% 62.92%

Table 5.1: Class files removed by the dead-code removal algorithm and the functions removed from rest of the class files.

always be represented in four bytes even if it has one byte long data, leading to wastage of flash memory. In TakaTuka, we have divided each of these literal constant pools into sub constant pools of sizes 1, 2, 4, 6 and 8 bytes, keeping the start addresses of these sub-pools in the GCP common header. Hence, when an Integer literal has only one byte long data, it is placed in the sub-pool reserved for one byte long data within type 3 GCP. Although this sub-pool architecture appears to require some extra computation for searching the start address of a sub-pool using binary search in GCP common header, in fact the search now extends from 11 to 16 items requiring same worst case complexity for binary search.

5.2 Dead Code Removal

For dead code removal, TakaTuka employs a commonly known procedure of traversing a program from its main method and removing all the methods, fields and classes that are not accessible from it, taking into account polymorphism and dynamic binding. Our dead code removal is similar to any traditional Java dead code removal with one exception. That is, when removing the dead code, TakaTuka does not differentiate between library and a user program. In this way, TakaTuka removes those classes, methods, and fields that are not accessed from the main method, even if they are part of the included libraries.

5.3 TUK File Format

The class file has three important parts: the bytecode, the CP and the structure infor- mation. We observe that the structure information usually makes up for half the size of class files of a program. We use a special format called TUK for storing the set of class files used by a user program and corresponding Java library components. The TUK format has two main characteristics. First, it only contains information required during program execution in a reduced format, forgoing any information required during linking or debugging phases. Second, the TUK format contains preloaded information stored in an easy-to-access manner, obviating the need to load TUK file in RAM. The special characteristic of TUK file is that we can access different portions of TUK file, 5.3. TUK File Format 87

javax/microedition java/io 100 java/lang java/util Sun Spot Demos 80 CTP DYMO JTC 60 Quicksort

40

20

Percentage Reduction in CP Entries 0 Input Programs (a) GCP Entries

javax/microedition 160 java/io java/lang 140 java/util Sun Spot Demos 120 CTP DYMO 100 JTC Quicksort 80 60 40 20 Percentage Reduction in CP Entries 0 Input Programs (b) Tuk-GCP Entries

Figure 5.1: Percentage reduction in the number of Global Constant Pool (GCP) and Tuk- GCP entries as compared to the aggregated number of entries in all the constant pools of the program’s class files. 88 Class File Optimizations

javax/microedition 100 java/io java/lang java/util Sun Spot Demos 80 CTP DYMO JTC 60 Quicksort

40

20 Percentage Reduction in CP Size 0 Input Programs (a) GCP Size

180 javax/microedition java/io 160 java/lang java/util 140 Sun Spot Demos CTP 120 DYMO JTC 100 Quicksort 80 60 40 20 Percentage Reduction in CP Size 0 Input Programs (b) Tuk-GCP Size

Figure 5.2: Percentage reduction in the size of GCP and Tuk file constant pool as com- pared to the aggregated size of all the constant pools of the program’s class files. 5.4. Results and Discussion 89 using pre-calculated addresses and indexes, in a constant time, relieving the computation resources of the host device. It may be noted that the addresses and indexes that make up part of the preloaded information are created by processing on a desktop computer, before transferring a small preloaded and preverified TUK file to the host device. For example, in a TUK file a CP entry for a class or a method reference contains address of location inside TUK file where actual data (i.e., the class file or the method) is to be found. All the addresses in TUK file are relative to the starting address of TUK file and are, therefore, platform independent. We use either 2-byte or 4-byte addresses depending upon the total size of TUK file.

120 Tuk Compressed Jar 100 Suite 80 60 40 20 0 javax/microeditionjava/io java/lang java/util Sun Spot DemosCTP DYMO JVM TestQuicksort Cases Percentage Reduction in Java Binary

Figure 5.3: Percentage reduction of Tuk file, compressed Jar and Squawk’s Suite file as compared to the aggregated size of the program’s class files.

5.4 Results and Discussion

The dead code removal of the TakaTuka JVM removes all class files and methods that are not used by the program during its compilation. It is shown in Table 5.1 that the TakaTuka dead code removal, on the average removes 36 % class files. Subsequently, it removes 57.71 % methods from the remaining class files of the program. It is shown in Fig. 4.4, that on average the aggregated size of a program’s CP is more than 40 % of its total Java binary size. Therefore, the CP optimization can play a signifi- cant role in the reduction of the overall size of the program’s Java binary. The TakaTuka constant pool optimizations may be divided into two steps: 90 Class File Optimizations

1. Each CP of a class file has unique values, however the set of CPs of a program may contain redundant information. In TakaTuka, we remove all the redundant CP entries and create a set of Global Constant Pools (GCPs) with program-wide unique information.

2. A CP of a class contains names of the methods, fields and classes used by the bytecode of the class’s functions. The TakaTuka CP optimizations change the GCP by resolving these names. Subsequently, all entries storing these names are re- moved. Furthermore, information stored in the GCPs for loading, debugging and verification of the class files is also removed because TakaTuka has static loading and bytecode is verified during compilation instead of during program execution. These final versions of the CPs are called Tuk-GCPs.

Fig. 5.1 shows that the GCPs have on average 46.74 % less entries and Tuk-CPs have 82.52 % less entries as compared to the original CPs of the program. The corresponding percentage reduction in the aggregated size of GCPs is 51.06 % and that of Tuk-GCPs is 95.04 % respectively, compared to the original CPs of the project. Note that the above optimizations do not increase the runtime CPU or RAM utilization because the Tuk-GCPs in TakaTuka are kept in mote’s flash and have a constant time access as has already been explained. The overall reduction in the size of Java binary is shown in Fig 5.3. The TakaTuka Java binary, Tuk, is compared with Sun Squawk Java binary (called Suite) and with com- pressed Jar file. The TakaTuka JVM does not transfer complete Java library on to a mote. However, we have used libraries for comparison with Squawk Suite files because TakaTuka and Squawk cannot be compared using applications. This is because, TakaTuka and Squawk have different interfaces for supporting drivers and functions to access mote’s hardware. The Tuk file is on the average 19 % smaller than corresponding Suite file and is 94 % smaller as compared with corresponding compressed Jar file. Furthermore, as compared to original Java binary of a project, the Tuk file provides 92.92 % reduction in case of applications (i.e. CTP, DYMO, JTC and Quicksort) and an average reduction of 85.8 % in case of library files. C HAPTER 6

Interpreter Design

91 92 Interpreter Design

6.1 Runtime Architecture

The runtime architecture of TakaTuka provides support for JVM interpreter rather than Just-in-Time (JIT) or Ahead-of-Time (AOT) compilers. This is because JIT architecture requires relatively large amount of memory, a scarce resource in tiny embedded devices, for storing and dynamically producing the native code at runtime [59]. Furthermore, many embedded devices follow the Harvard architecture in which the flash is used as program memory while the RAM is used as data memory. In such devices, JIT would require generation of native code in flash memory which usually has a long write time. Another draw back of using JIT is that a significant portion of JIT needs to be rewritten for different hardware architectures having different native codes [59]. Thus, we chose to forgo the use of JIT in the design of TakaTuka. We also found AOT compilers to diminish our portability goals because AOT generates executables which are not portable across multiple platforms. As JIT and AOT are not suitable for a variety of tiny embedded devices, our design of TakaTuka is based on JVM interpreter.

6.2 Customized JVM

The flash memory in some tiny embedded devices may be too small to contain a complete JVM. To address this limitation, the default behavior of TakaTuka is to reduce the size of the interpreter depending upon the set of bytecode instructions used by a given Java program. To this end, TakaTuka removes all unused components from the JVM, stripping the JVM bytecode support down to the bytecode set used by the given program. For a given Java program, the initial step of compilation is to generate a header file enlisting the set of bytecode instructions used by that program. This step is com- pleted on the desktop computer. Subsequently, the JVM recompiles, also on the desktop computer, to shrink itself based on the information contained in the header file. This default behavior leads to a very small JVM1, albeit the one that is capable of running only a subset of Java programs. If a more generic JVM, capable of supporting additional bytecode instructions is needed, TakaTuka allows this through a configuration file. For example, it is possible to have a JVM version that supports the complete set of bytecode instructions except the ones which involve floating point operations. Similarly, a user can also completely turn off the JVM customization, resulting in the generation of a general purpose JVM that supports any CLDC-compliant Java program.

1 The correct technical term would be a “JVM interpreter” instead of “JVM” but using this terminology is also a common practice and we will use it as long as the context is clear. 6.3. Hardware and Operating System Abstraction Layer 93

6.3 Hardware and Operating System Abstraction Layer

The TakaTuka JVM has a hardware and operating system abstraction layer (HOAL) that enables writing code in Java for a mote, independent of the underline operating system and the hardware used by it. The HOAL hides underline operating system and hardware from the programmer. Hence a code written for one mote can be executed, without any change, on another mote which has a different hardware or operating system. The high level view of the HOAL is shown in Fig. 1.3. The HOAL provides Java interfaces to access underline function provided by an operating system. These interfaces can only be initialized using an object-factory [61]. The factory automatically undergoes changes depending upon the operating system used by a mote but the Java program written for a mote remain unchanged even if used on any other mote that has a different operating system. This is so because the Java program is exposed only to the interfaces provided by the factory instead of actual implemen- tations. Thus factory makes the program independent of the operating system. Similar concept is employed to make the TakaTuka JVM, hardware-independent by providing low level hardware-independent interfaces. The TakaTuka JVM uses those interfaces to access hardware specific implementations instead of using those implementations directly.

6.4 TakaTuka Linkage With TinyOS

TakaTuka can run on any platform that provides an implementation of HOAL’s Java in- terfaces. This section describes, TakaTuka linkage with TinyOS. To work with TinyOS, the TakaTuka interpreter is allowed to execute n bytecode instructions in a scheduling cycle before returning the control back to the TinyOS scheduler. This is accomplished by posting a TinyOS Task [18] that calls the TakaTuka interpreter which is written in C. We keep n small so that TinyOS remains responsive in between TakaTuka scheduling cycles. TinyOS is an event driven operating system. In contrast Java language functions are gen- erally blocking and it is up to Java user to create threads when multitasking is required. Hence when a TakaTuka user calls a method that is written as event in TinyOS then that method blocks until the corresponding event is generated in TinyOS. When TinyOS re- ceives an event, then the Java thread waiting for that event is notified. A TakaTuka user can access any of the TinyOS Commands [18] using TakaTuka native method support. For a Split-phase TinyOS Command [18], the current thread is put into a waiting state af- ter calling the Command and is notified when latter is completed. TakaTuka also supports thread synchronization, that is used for sharing resources (e.g. radio) between multiple user threads. 94 Interpreter Design

Power Management

The current version of the TakaTuka JVM uses the same power management strategies as implemented in TinyOS but in future versions, we intend to provide a more compre- hensive power management. In the current version, when all Java threads are sleeping or waiting for events then TinyOS could go to the low power state and TakaTuka fol- lows through. In case an event arrives, then any thread waiting for that event is notified and and TakaTuka resumes. Furthermore, TakaTuka uses TinyOS radio drivers, thereby supporting the low-power listening implemented by TinyOS.

6.5 Tiny Java Debugger

Debugging the wireless sensor network (WSN) applications is known to be quite difficult as the execution flow of such applications is usually triggered by unpredictable external events. Existing debuggers for WSN can be classified into two categories:

• Node and Network Debugger: monitors the state of the network and motes. Such debuggers report high level failures e.g. node crashes, congestion and bad path to node. • Native language debugger: debugs the native code of a program.

Using native debugger for examining the Java source code execution has following draw- backs: (1) Running such debuggers directly on a mote is usually not possible. Therefore, either an In-Circuit Emulator (ICE) is required or a simulator is used. (2) The program- mer himself has to interpret raw memory contents. (3) The bytecode has to be mapped manually to the corresponding source code. (4) The programmer has to filter much more information (such as JVM internal data structures) than necessary to debug Java source code. Another way to debug code is to use a primitive debugging method such as print-and- retry, in which the developer prints out the target variable on to the console screen via serial channel. Furthermore, the developer usually has to recompile entire code a number of times and inject the printing statement into a different position before the issue could be locked down. TakaTuka has a Tiny Java Debugger (TJD), which can monitor and examine the Java source code execution on a mote without significantly burdening the mote resources [62]. The TJD has been tested on most of the motes supported by the TakaTuka JVM. Further- more, it can support any mote supported by TakaTuka in future with a few or no changes. Table 6.1 lists motes for which TJD has been tested and also the debugging features sup- ported by TakaTuka JVM. 6.5. Tiny Java Debugger 95

Motes Tested Supported Features With TJD Mica2, Breakpoint, Step-By-Step execution, Micaz, Object dumps, Field dump, TelosB, Array dump, Garbage collection control, IRIS Threads list, Stacks trace

Table 6.1: List of motes tested using TJD and the supported features.

System Design of TJD

In order to effectively debug Java source code, Sun designed the Java Platform Debugger Architecture (JPDA) for Java SE applications [63]. Implementing JPDA requires a large RAM footprint and flash storage space. So adopting this architecture for motes is not feasible due to the resource constraints of a mote. Furthermore, the JPDA uses debug symbols contained in the Java binary files; however the TakaTuka JVM strips Java binary of these symbols in order to reduce the storage requirements. Sun has developed Squawk JVM for embedded devices [10], which works on Sun SPOT embedded devices with 512 KB of RAM. Although RAM requirements of Squawk JVM are quite small as compared with those of a regular JVM, but the required memory is still much higher than a typical mote can offer. The Squawk JVM has a split debugging architecture that forms the basis of the TJD developed in this work. The split debugging architecture used by both Squawk debugger and TJD is shown in Fig. 6.1. In this architecture, the debug symbols are stored on a desktop computer that runs the debugger rather than ROM image transferred on to a mote. The mote responds to only simple commands, for example “Send me the current method ID and offset”, the response is received at the debugging proxy running on the desktop computer. The proxy processes the response received, i.e., the method ID and offset, and converts it into a user friendly representation based on the debug symbols stored on the desktop computer. The TJD differs from Squawk’s split architecture in several ways:

• Unlike Squawk, JTD agent on a mote is written in native C, thus requiring less RAM. • The wire protocol of TJD is significantly simplified to work with less than 100 bytes of RAM. • Unlike Squawk debugger that support only ARM based platform, JTD has been tested on multiple platforms. 96 Interpreter Design

Figure 6.1: The split debugging architecture used by TJD and Squawk.

6.6 Results and Discussion

Using a debugger on any platform can unavoidably drag down the performance and reduce the memory space available to the user. The overhead is caused mainly by extra code injected into each instruction fetch cycle. The performance issue has been given serious consideration during the design process of our debugger. We have attempted to write our code using as few static variables as possible. The dynamic memory was not used at all by the debugger. The communication protocol was also extremely simplified to avoid any potential overhead. The programs tested using the Tiny Java Debugger (TJB) have average performance drop of about 10 %, which is hardly noticeable. Furthermore, the Tiny Java Debugger designed for the TakaTuka JVM uses only static RAM for debugging a program. The amount of static RAM used by it is only 88 bytes and the amount of flash memory required by the debugger is around 1 KB. This shows that it is feasible to use TJD on tiny motes with few KB of RAM and small flash memory. Based on the user program to be installed on a mote, the size of TakaTuka JVM can be reduced by stripping functionality not used by that program. Fig. 6.2 shows percentage increase in the size of customized JVM as compared to the customized JVM compiled for the smallest program possible, i.e., null program. For the null program the JVM size is decreased to only 30 KB if it is compiled for the IRIS mote that has 128 KB of flash memory. The size of the TakaTuka JVM increases with the complexity of a program. It may be noted that complexity of a program is not directly proportional to its size. For example, the Tuk file of DYMO is smaller than Tuk file for CTP but the corresponding JVM customized for DYMO is bigger than the JVM customized for the CTP program. Thus it is possible to write a big program with careful selection of Java instructions such that the customized JVM for the program remains small. Table 6.2 shows actual size of the TakaTuka JVM customization for a set of programs when compiled for the IRIS mote. It may also be noted that, JVM size depends upon the mote’s hardware and compiler. For 6.6. Results and Discussion 97

CTP 120 DYMO JTC Quicksort 100 Binary Search 80

60

40 Percentage Increase in Customized JVM Size 20

0 Input Programs

Figure 6.2: Percentage increase in the customized JVM size as compared to the cus- tomized JVM for null program.

Program Customized JVM Size (Bytes) CTP 62290 DYMO 68707 JVM Test Cases 64462 Quicksort 44559 Binary Search 36358 Null Program 31676

Table 6.2: Size of the customized JVM in bytes when compiled for the IRIS mote. example the JVM size for a given program is usually smaller when compiled for a TelosB mote that having a MSP430 microcontroller, than that for an IRIS mote with an AVR microcontroller.

C HAPTER 7

Summary and Conclusions

99 100 Summary and Conclusions

A new JVM called TakaTuka has been presented which can run on motes with RAM as small as 4 KB and flash as small as 48 KB. It provides Java interfaces for commonly used hardware components of sensor motes. Therefore, any platform, which provides an implementation of those interfaces, can support TakaTuka. The TakaTuka JVM has been tested on more than five different mote platforms. In addition, a number of Java WSN applications and protocols were developed and tested on motes with the aforementioned resource constraints using the TakaTuka JVM. These achievements were possible due to following contributing factors.

• Introduction of the Offline GC that can deallocate objects even when they are still reachable. This is accomplished with the help of an advanced bytecode analysis scheme called OGC-DFA. Three algorithms have been designed for making offline deallocation decision during the program’s compilation.

• Support for smaller slot sizes to reduce RAM requirements of a method frame.

• A condensed format for class files called Tuk that is reusable by third party JVMs. This format includes static linking of classes and fast data access.

• Three different techniques for constant pool optimization.

• Comprehensive bytecode reduction strategies including single and multiple byte- code compaction.

• The ability of the TakaTuka JVM to be customized for a specific application. The size of the corresponding JVM interpreter is reduced based on the bytecode in- structions used by the program. It can also be compiled as a general purpose JVM supporting any Java program.

• The Offline GC, bytecode compaction and constant pool optimization also lead to faster program execution besides reducing the program footprint in RAM and flash of the mote.

• Designing a debugger, called TJD, to debug the Java code on a mote and also to debug several motes of a wireless sensor network.

• Development of a completely open-source JVM from scratch, without employing any existing JVM code. This JVM supports almost all the bytecode instructions and almost complete CLDC library. 101

Offline Garbage Collection

The Offline GC scheme is presented that moves the bulk of Java garbage collection to the offline mode. The Offline GC scheme brings two benefits: it allows running feature-rich large Java programs and it saves on CPU cycles to run a given program thus extend- ing the lifetime of a wireless sensor mote that usually operates on battery power. Three algorithms to facilitate offline deallocation decisions are presented. These algorithms identify where in the program execution an object, which would be left alone by a tra- ditional garbage collector, may be deallocated. Working independently, these algorithms suit different types of programs. Thus, using these algorithms, a variety of programs can benefit from the use of the Offline GC scheme. The Offline GC is evaluated using many programs including the two well-known GC benchmarks and a wireless sensor network routing protocol. Results produced indicate that the Offline GC can reduce the CPU cy- cles consumed in freeing memory up to 73.13 %, while making up to 66.06 % extra RAM available compared to a typical mark-and-sweep online GC. The Offline GC is the most useful for applications which allocate many dynamic objects, each larger than 4 bytes. In contrast, applications which reuse same set of small-footprint objects do not benefit much. Although the Offline GC is developed for tiny wireless sensor motes, it can scale to large programs developed for powerful computers. Each of the three algorithms developed for making the offline deallocation decisions has polynomial time complexity and RAM re- quirements. The slowest of these algorithms is DAU. It consumes on the average 48.78 % of the total time taken by the Offline GC components during compilation. However, this is a parallel algorithm that can easily be made to run on multiple processors if a very large program designed for powerful computers needs compilation.

Variable Slot Size

The Offline GC deallocates objects quite early as compared to a traditional online GC, thus increasing the RAM available to a program. However, the Offline GC does not re- duce the RAM requirements of the frames of a method. RAM for new frame is allocated corresponding to each method invocation and is deallocated automatically upon comple- tion of the method invocation. The frame stores local variables and operand stack corre- sponding to the method invocation. A frame is divided into fixed size slots. The number of slots required to create a frame is determined during compilation of the program. The standard Sun JVM is designed to work on 32-bit microcontrollers, therefore it supports 32 bits long slots. A slot of 32 bits wastes RAM when used to store data smaller than 32 bits. The TakaTuka compiler allows the programmer to choose slot size (i.e., 32 bits, 16 bits or 8 bits) during compilation of a program. Based on the selection, the Java bytecode is changed using VSS-DFA. For the 16-bit slot size, the aggregated size of local variables 102 Summary and Conclusions

for all the methods of a program is reduced by more than 27 % as shown in Fig. 3.5. The average percentage reduction of 35 % is achieved in the aggregated size of operand stacks as shown in Fig. 3.6. Thus the variable slot size scheme significantly lowers the RAM requirements of a frame for method invocation.

Bytecode Compaction

The bytecode instructions that are executed by a JVM interpreter constitute 14 % of a Java program’s binary (Fig. 4.4) for typical programs. Bytecode optimization usually results in decreasing execution speed because a JVM interpreter needs to decompress or process the optimized code during runtime [58, 57]. The TakaTuka optimization is aimed at re- ducing Java binary without increasing CPU or memory utilization. Hence, the TakaTuka JVM regenerates parts of the interpreter to support new optimized instructions for sup- porting bytecode compaction. The advantage of such an approach is the increased JVM interpreter speed as optimized instructions would replace multiple original Java instruc- tions. Thus the total number of instructions dispatched and operands fetched is reduced. This may be accompanied by a disadvantage, that is, the size of the interpreter supporting optimization could increase significantly, outweighing the reduction achieved in byte- code size. We have, therefore, suggested two bytecode optimization strategies: (1) Full compaction (FC) and (2) Limited compaction (LC). In LC, bytecode optimization is per- formed only if it results in reducing the aggregated size of Java binary and interpreter. In case FC is used then the bytecode compaction increases the performance of a program up to 45 % (Fig. 4.11), with the average percentage reduction of 60 % in the size of the Java bytecode. In contrast, when LC is used, the corresponding speed is increased by only by 6.7 % and the average percentage reduction of 34.81 % is achieved in the size of Java binary (Fig. 4.7). The extra speed using FC comes along with corresponding average per- centage increase of 16.86 % in JVM size (Fig. 4.10) and significantly longer compilation time (Fig. 4.9). TakaTuka could be configured by an application developer to maximize bytecode size reduction or to increase its speed based on the target mote platform.

CP Optimizations

The aggregated size of constant pools (CPs) of a program is on the average 40 % of the Java binary of that program (Fig. 4.4). Therefore, reducing CP size is important for re- ducing the size of overall Java binary of the program. The constant pool of a class has no redundant information but constant pool of different class files of a program may have such information. The TakaTuka Java binary called Tuk has a set of global constant pools (GCP) for a program such that each value appears exactly once in the GCP. On 103 the average, the aggregated percentage reduction in the size of the GCPs is more than 50 % of the aggregated size of CPs of the program. The GCP size is further reduced by removing naming information required for dynamic linking as TakaTuka supports static linking. Furthermore, information required for bytecode verification, but not for the pro- gram execution, is also removed from GCP. All these optimizations lead to creating a set of Tuk GCPs. On the average the percentage reduction in the aggregated size of Tuk GCP is more than 95 % as compared to the original CPs of the program.

Tuk File Format

The bytecode and CP comprise less than 55 % of the Java binary. The rest of the Java binary has formatting information. For example, what are the access flags of a function, how many slots a function frame should allocate at runtime, etc. The TakaTuka JVM supports a special Java binary format called Tuk. The overall reduction in the size of Java binary is shown in Fig. 5.3. The TakaTuka Java binary, Tuk, is compared with Sun Squawk Java binary (called Suite) and with compressed Jar file. The Tuk file is on the average 19 % smaller than corresponding Suite file and is 94 % smaller as compared to corresponding compressed Jar file. On average the size of Tuk file has a percentage reduction of 92.92 % as compared to original Java binary of an application and the average percentage reduction of 85.8 % for the library files.

JVM Optimizations

Based on the user program to be installed on a mote, the size of TakaTuka JVM can be reduced by stripping functionality not to be used by the user program. The advantage of this approach is that a very small JVM is to be transferred on to a mote. However, the JVM would require recompilation for each application, and JVM compiled for one ap- plication may not support another application with different functionality. Fig. 6.2 shows percentage increase in the size of customized JVM as compared to the customized JVM compiled for smallest possible program i.e. null program. For the null program the JVM size decreases to only 30 KB if it is compiled for an IRIS mote that has 128 KB of storage space. The size of the TakaTuka JVM increases depending upon the complex- ity of a program. It may be noted that complexity of a program is not linearly related to its size. Hence it may be possible to write a large program with careful selection of Java instructions, such that the JVM customized for the program remains small. In addition, the JVM size depends upon the mote’s hardware and the compiler. For example, the JVM size for a given program is usually smaller, when compiled for a TelosB mote having a MSP430 microcontroller, than for an IRIS mote having an AVR microcontroller. 104 Summary and Conclusions

Debugging

Programs designed for wireless sensor networks are prone to errors because of motes’ constrained resources and complexity of the network program distributed across many motes of the wireless network. In general, presently available debuggers of Java or tra- ditional low level languages (e.g. C, Assembly, NesC) are not supported to operate on tiny motes with few KBs of RAM. One of the advantages of Java is its powerful debugger called JDB [63]. However, JDB cannot be used by a program running on a mote because of two reasons: (1) JDB is too large to be accommodated on a mote flash and requires extensive RAM, and (2) JDB uses debug symbols (e.g. line number, names, etc.) that are removed from the Tuk file for size reduction. To solve these issues, TakaTuka presents a Tiny Java Debugger (TJD) which has split architecture [62] . For the Java program run- ning on a mote the debug symbols are stored on the desktop computer instead of the mote. The debugger is run on the desktop computer using its resources and communicates with the mote by a simple protocol. This split architecture enables debugging a mote real-time without burdening its resources. The TJD requires only 88 bytes of static RAM on a mote and less than 1 KB of flash storage.

Scope of Future Work

This work has provided extensive optimizations which are aimed mainly at reduction in flash storage and RAM requirements of a program. The increase in program execution speed comes as a consequence of these optimizations; however this was not the primary objective of our work. It may be a worthwhile effort to explore the possibility of using hybrid arrangements of register architecture and stack architecture of JVM, to increase the speed of Java programs without significantly increasing the size of Java binary. Fur- thermore, Java bytecode generation may be improved and also bytecode instructions set may be extended by introducing new instructions with the primary objective of improving the program speed. It would also be desirable if the structure of the code of the TakaTuka JVM is further improved to enhance the performance of bytecode execution. References

[1] Konrad Zuse. The computer—my life. Springer-Verlag, Inc., New York, NY, USA, 1993.

[2] Alan Mathison Turing. On Computable Numbers, with an application to the Entscheidungsproblem. Proc. London Math. Soc., 2(42):230–265, 1936.

[3] Paul Ceruzzi. A History of Modern Computing. MIT Press, Cambridge, MA, USA, 2nd edition, 2003.

[4] Brett Warneke, Matt Last, Brian Liebowitz, and Kristofer S.J. Pister. Smart dust: Communicating with a cubic-millimeter computer. Computer, 34:44–51, 2001.

[5] Judy York and Parag Pendharkar. Human-computer interaction issues for mobile computing in a variable work context. International Journal of Human-Computer Studies, 60(5-6):771 – 797, 2004. HCI Issues in Mobile Computing.

[6] Faisal Aslam, Luminous Fennell, Christian Schindelhauer, Peter Thiemann, Gidon Ernst, Elmar Haussmann, Stefan Rührup, and Zastash Uzmi. Optimized Java Binary and Virtual Machine for Tiny Motes. In Distributed Computing in Sensor Systems (DCOSS), volume 6131, chapter 2, pages 15–30. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.

[7] Niels Brouwers, Koen Langendoen, and Peter Corke. Darjeeling, a feature-rich vm for the resource poor. In Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems, SenSys ’09, pages 169–182, New York, NY, USA, 2009. ACM.

[8] Faisal Aslam, Christian Schindelhauer, Gidon Ernst, Damian Spyra, Jan Meyer, and Mohannad Zalloom. Introducing takatuka: a java virtualmachine for motes. In Pro- ceedings of the 6th ACM conference on Embedded network sensor systems, SenSys ’08, pages 399–400, New York, NY, USA, 2008. ACM.

105 106 References

[9] David Gay, Philip Levis, Robert von Behren, Matt Welsh, Eric Brewer, and David Culler. The language: A holistic approach to networked embedded systems. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, PLDI ’03, pages 1–11, New York, NY, USA, 2003. ACM.

[10] Doug Simon, Cristina Cifuentes, Dave Cleal, John Daniels, and Derek White. Java on the bare metal of wireless sensor devices: the squawk java virtual machine. In Proceedings of the 2nd international conference on Virtual execution environments, VEE ’06, pages 78–88, New York, NY, USA, 2006. ACM.

[11] Tim Lindholm and Frank Yellin. Java Virtual Machine Specification. Addison- Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 1999.

[12] Janice J. Heiss. Sentilla’s Pervasive Computing – The Universe Is the Computer. 2008 JavaOne Conference.

[13] . Java card platform specification, v 2.2.2, 2006.

[14] AREXX engineering. The NanoVM - Java for the AVR. http://www.harbaum.org [Last Accessed: 15.02.2011].

[15] Joel Koshy and Raju Pandey. Vmstar: synthesizing scalable runtime environments for sensor networks. In Proceedings of the 3rd international conference on Embed- ded networked sensor systems, SenSys ’05, pages 243–254, New York, NY, USA, 2005. ACM.

[16] Charles Perkins and Elizabeth Royer. Ad-hoc on-demand distance vector routing. IEEE Workshop on Mobile Computing Systems and Applications (HotMobile), 0:90, 1999.

[17] Ian Chakeres and Charles Perkins. Dynamic MANET On-demand (DYMO) Routing. IETF, 2010.

[18] Philip Levis and David Gay. TinyOS Programming. Cambridge University Press, 1st edition, April 2009.

[19] , Bjorn Gronvall, and Thiemo Voigt. Contiki - a lightweight and flexible operating system for tiny networked sensors. IEEE Conference on Local Computer Networks (IEEE LCN), 0:455–462, 2004.

[20] Jason Hill, Robert Szewczyk, Alec Woo, Seth Hollar, David Culler, and Kristofer Pister. System architecture directions for networked sensors. SIGPLAN Not., 35:93– 104, November 2000. References 107

[21] Michael Kifer, Georg Lausen, and James Wu. Logical foundations of object-oriented and frame-based languages. Journal of the ACM (JACM), 42:741–843, July 1995.

[22] Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes, Jean-Marc Loingtier, and John Irwin. Aspect-oriented programming. In ECOOP’97 — Object-Oriented Programming, volume 1241 of Lecture Notes in Computer Sci- ence, chapter 10, pages 220–242. Springer-Verlag, Berlin/Heidelberg, 1997.

[23] Peter Coad and Jill Nicola. Object-oriented programming. Yourdon Press, Upper Saddle River, NJ, USA, 1993.

[24] Matt Weisfeld. The Object-Oriented Thought Process. Sams, March 2000.

[25] Jack Shirazi. Java Performance Tuning. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2nd edition, 2002.

[26] Timothy Cramer, Richard Friedman, Terrence Miller, David Seberger, Robert Wil- son, and Mario Wolczko. Compiling java just in time. IEEE Micro, 17:36–43, May 1997.

[27] Yunhe Shi, David Gregg, Andrew Beatty, and M. Anton Ertl. Virtual machine show- down: stack versus registers. In Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments, VEE ’05, pages 153–163, New York, NY, USA, 2005. ACM.

[28] Faisal Aslam, Luminous Fennell, Christian Schindelhauer, Peter Thiemann, and Zartash Afzal Uzmi. Offline gc: Trashing reachable objects on tiny devices. Sub- mitted to ACM SenSys, 2011.

[29] ETH Zurich. The Sensor Network Museum. btn- ode.ethz.ch/Projects/SensorNetworkMuseum [Last Accessed: 15.02.2011].

[30] Richard Jones and Rafael D. Lins. Garbage Collection: Algorithms for Automatic Dynamic Memory Management. John Wiley & Sons, September 1996.

[31] Sigmund Cherem and Radu Rugina. Uniqueness inference for compile-time ob- ject deallocation. In Proceedings of the 6th international symposium on Memory management, ISMM ’07, pages 117–128, New York, NY, USA, 2007. ACM.

[32] Samuel Guyer, Kathryn McKinley, and Daniel Frampton. Free-me: a static analysis for automatic individual object reclamation. In Proceedings of the 2006 ACM SIG- PLAN conference on Programming language design and implementation, PLDI ’06, pages 364–375, New York, NY, USA, 2006. ACM. 108 References

[33] Elvira Albert, Samir Genaim, and Miguel Gomez-Zamalloa. Heap space analysis for java bytecode. In Proceedings of the 6th international symposium on Memory management, ISMM ’07, pages 105–116, New York, NY, USA, 2007. ACM.

[34] Jong-Deok Choi, Manish Gupta, Mauricio Serrano, Vugranam C. Sreedhar, and Sam Midkiff. Escape analysis for java. In Proceedings of the 14th ACM SIGPLAN confer- ence on Object-oriented programming, systems, languages, and applications, OOP- SLA ’99, pages 1–19, New York, NY, USA, 1999. ACM.

[35] Bruno Blanchet. Escape analysis for javatm: Theory and practice. ACM Trans. Program. Lang. Syst., 25:713–775, November 2003.

[36] Jeffrey Barth. Shifting garbage collection overhead to compile time. Communica- tions of the ACM (CACM), 20:513–518, July 1977.

[37] Pramod Joisha. Overlooking roots: a framework for making nondeferred reference- counting garbage collection fast. In International Symposium on Memory Manage- ment (ISMM), pages 141–158, New York, NY, USA, 2007. ACM.

[38] Wolfram Schulte. Deriving residual reference count garbage collectors. In Sympo- sium on Programming Language Implementation and Logic Programming (PLILP), pages 102–116. Springer-Verlag, 1994.

[39] Ran Shaham, Eran Yahav, Elliot Kolodner, and Mooly Sagiv. Establishing local temporal heap safety properties with applications to compile-time memory manage- ment. In Radhia Cousot, editor, Static Analysis, volume 2694 of Lecture Notes in Computer Science, pages 1075–1075. Springer Berlin / Heidelberg, 2003.

[40] Ole Agesen, David Detlefs, and J. Eliot Moss. Garbage collection and local variable type-precision and liveness in java virtual machines. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI ’98, pages 269–279, New York, NY, USA, 1998. ACM.

[41] Sigmund Cherem and Radu Rugina. Compile-time deallocation of individual ob- jects. In Proceedings of the 5th international symposium on Memory management, ISMM ’06, pages 138–149, New York, NY, USA, 2006. ACM.

[42] Dan Grossman, Greg Morrisett, Trevor Jim, Michael Hicks, Yanling Wang, and James Cheney. Region-based memory management in cyclone. In Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and imple- mentation, PLDI ’02, pages 282–293, New York, NY, USA, 2002. ACM. References 109

[43] Neil Jones and Steven Muchnick. Flow analysis and optimization of lisp-like struc- tures. In Proceedings of the 6th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, POPL ’79, pages 244–256, New York, NY, USA, 1979. ACM.

[44] Alexandru Stefan, Florin Craciun, and Wei-Ngan Chin. A flow-sensitive region inference for cli. In Proceedings of the 6th Asian Symposium on Programming Lan- guages and Systems, APLAS ’08, pages 19–35, Berlin, Heidelberg, 2008. Springer- Verlag.

[45] Sigmund Cherem and Radu Rugina. Region analysis and transformation for java programs. In Proceedings of the 4th international symposium on Memory manage- ment, ISMM ’04, pages 85–96, New York, NY, USA, 2004. ACM.

[46] David Gay and Alex Aiken. Memory management with explicit regions. In Pro- ceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI ’98, pages 313–323, New York, NY, USA, 1998. ACM.

[47] Source code. GCBench and GCOld benchmarks. http://www.cs.cmu.edu/ spoon- s/gc/benchmarks.tar.gz [Last Accessed: 15.02.2011].

[48] Omprakash Gnawali, Rodrigo Fonseca, Kyle Jamieson, David Moss, and Philip Levis. Collection tree protocol. In Proceedings of the 7th ACM Conference on Em- bedded Networked Sensor Systems, SenSys ’09, pages 1–14, New York, NY, USA, 2009. ACM.

[49] David Bacon, Perry Cheng, and David Grove. Garbage collection for embedded systems. In Proceedings of the 4th ACM international conference on Embedded software, EMSOFT ’04, pages 125–136, New York, NY, USA, 2004. ACM.

[50] Kumar Shiv, Kingsum Chow, Yanping Wang, and Dmitry Petrochenko. Specjvm2008 performance characterization. In Proceedings of the 2009 SPEC Benchmark Workshop on Computer Performance Evaluation and Benchmarking, pages 17–35, Berlin, Heidelberg, 2009. Springer-Verlag.

[51] Daniel Spoonhower, Guy Blelloch, and Robert Harper. Using page residency to bal- ance tradeoffs in tracing garbage collection. In Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments, VEE ’05, pages 57–67, New York, NY, USA, 2005. ACM.

[52] Crossbow Technology. Wireless Sensor Networks. http://www.xbow.com. [Last Accessed: 15.02.2011]. 110 References

[53] Ben Lawrence Titzer, Daniel Lee, and Jens Palsberg. Avrora: scalable sensor net- work simulation with precise timing. In Proceedings of the 4th international sym- posium on Information processing in sensor networks, IPSN ’05, Piscataway, NJ, USA, 2005. IEEE Press.

[54] Luminous Fennell. Garbage Collection for TakaTuka, Bachelor’s Thesis, University of Freiburg, Germany.

[55] Árpád Beszédes, Rudolf Ferenc, Tibor Gyimóthy, André Dolenc, and Konsta Kar- sisto. Survey of code-size reduction methods. ACM Computer Survey, 35:223–267, September 2003.

[56] Dimitris Saougkos, George Manis, Konstantinos Blekas, and Apostolos V. Zarras. Revisiting java bytecode compression for embedded and mobile computing environ- ments. IEEE Transactions on Software Engineering, 33:478–495, July 2007.

[57] Lars Raeder Clausen, Ulrik Pagh Schultz, Charles Consel, and Gilles Muller. Java bytecode compression for low-end embedded systems. ACM Transactions on Pro- gramming Languages and Systems (TOPLAS), 22:471–489, May 2000.

[58] Derek Rayside, Evan Mamas, and Erik Hons. Compact java binaries for embedded systems. In Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research, CASCON ’99, pages 9–. IBM Press, 1999.

[59] Anton Ertl and David Gregg. The behavior of efficient virtual machine interpreters on modern architectures. In Rizos Sakellariou, John Gurd, Len Freeman, and John Keane, editors, Euro-Par 2001 Parallel Processing, volume 2150 of Lecture Notes in Computer Science, pages 403–413. Springer Berlin / Heidelberg, 2001.

[60] Sheng-De Wang and Yuhder Lin. Jato: A compact binary file format for java class. Parallel and Distributed Systems, International Conference on, 0:0467, 2001.

[61] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns: elements of reusable object-oriented software. Addison-Wesley Professional, 1995.

[62] Shou-Yu Chao. A scalable Wireless Sensor Network Debugger. Master’s thesis, University of Freiburg, Germany, 2010.

[63] Oracle Corporation. Java Debug Wire Protocole Specification, 2010.