Determining When Objects Die to Improve

Garbage Collection

A dissertation submitted by

Nathan P. Ricci

In partial fulfillment of the requirements

for the degree of Doctor of Philosophy in

Computer Science

TUFTS UNIVERSITY

May 2016

ADVISER: Samuel Z. Guyer ii 0.1 Abstract

Although garbage collection frees programmers from the burden of manually free- ing objects that will no longer be used, and eliminates many common programming errors, it can come with a significant performance penalty. This penalty can become particularly great when memory is constrained.

This thesis introduces two pieces of work to address this problem. The first is

Elephant Tracks, a tool that uses novel techniques to measure memory properties of Java programs Elephant Tracks is novel in that it places object death events more accurately than any existing tool, and is able to do so with out modification to the underlying VM.

The second is the Deferred Garbage collector. Built based on observations from the Elephant Tracks traces, the Deferred Garbage Collector reduces redundant work performed by the garbage collector In this thesis, we show that the techniques used by the Deferred Collector can reduce garbage collector tracing workload in some programs.

iii 0.2 Acknowledgements

I would like to thank the Tufts University Department of Computer Science for building the environment that makes this work possible. I am especially grateful for the patient mentoring of my advisor, Sam Guyer. I also wish to express gratitude to my comitee members; Norman Ramsey, Alva Couch, Mark Hempstead, and Tony

Printezis.

Additionally, I would like to thank my wife Gizem, and my parents Susan, and

David, without whose support I never would have made it here.

I would also be remiss if I did not thank the Tufts Department of Athletics, for providing barbells, without which I would have long ago gone mad.

Lastly, I must also thank Tufts University and the National Science Foundation for their financial support.

iv Contents

0.1 Abstract ...... iii

0.2 Acknowledgements ...... iv

1 Introduction 1

2 Garbage Collection Background 3

2.1 On Object Lifetime ...... 3

2.2 Garbage Collector Overhead ...... 4

2.2.1 Mark Sweep ...... 6

2.2.2 Generational Mark Sweep ...... 7

2.3 Weak References ...... 8

2.4 The Java Virtual Machine Tool Interface ...... 9

3 Elephant Tracks 11

3.1 Elephant Tracks Introduction ...... 11

3.2 Trace time ...... 13

3.3 Background and related work ...... 15

3.3.1 Garbage collection tracing ...... 16

3.3.2 Why a new trace generator? ...... 19

3.3.3 Related work ...... 21

3.4 Elephant Tracks Design ...... 23

3.4.1 Kinds of trace records ...... 23

v 3.4.2 Execution model ...... 25

3.5 Implementation ...... 29

3.5.1 Timestamping strategy ...... 30

3.5.2 The instrumenter ...... 31

3.5.3 The agent ...... 36

3.5.4 Properties of our implementation approach ...... 39

3.6 Results ...... 40

3.6.1 Performance ...... 40

3.6.2 Trace analysis ...... 42

3.7 Conclusions ...... 43

4 Deferred Collector 49

4.1 Finding Key Objects ...... 50

4.2 Deferred Collector Design ...... 54

4.2.1 Defer all Object Reachable from the Key ...... 54

4.2.2 Large Object Space ...... 56

4.3 Mitigating Bad Hints ...... 57

4.3.1 Application Programming Interface ...... 58

4.4 JikesRVM ...... 58

4.4.1 Sanity Checker ...... 60

4.5 Experimental Results for The Deferred Collector ...... 61

4.5.1 Methodology ...... 61

4.5.2 Doubly Linked List ...... 62

4.5.3 sunflow ...... 63

4.6 Related Work ...... 66

5 Conclusion 68

vi Bibliography 68

vii List of Figures

2.1 The events in the life of an object. First it is allocated, then it is

used. Eventually, it is used for the last time. Sometime after this, it

may become unreachable. Finally its memory will be reclaimed. . .4

2.3 A weak reference object has a field called a ”referent” that refers

to some object (A). When the object referred to by the referent ob-

ject is reachable only through weak reference objects, it may be

collected. If this happens, the referent field is nulled (B)...... 9

3.1 Pseudo code for the merlin algorithm...... 18

4.1 The live size of each of the dacapo bench benchmaks with respect

to time. The blue line shows the number of objects live over time.

The lower green dashed line shows the number of live objects that

fall into clusters as described in 4.1 ...... 51

4.1 The live size of each of the dacapo bench benchmaks with respect

to time. The blue line shows the number of objects live over time.

The lower green dashed line shows the number of live objects that

fall into clusters and described in 4.1 ...... 52

viii 4.1 The live size of each of the dacapo bench benchmaks with respect

to time. The blue line shows the number of objects live over time.

The lower green dashed line shows the number of live objects that

fall into clusters and described in 4.1 ...... 53

4.2 mark/cons ratio for the doubly linked list program, run with a 16

MB heap, and differing values of immediate collection period. . . . 63

4.3 mark/cons ratio for sunflow, run with a 27 MB heap, and differing

values of immediate collection period...... 65

4.4 mark/cons ratio for sunflow, run with a 32 MB heap, and differing

values of immediate collection period...... 66

ix Chapter 1

Introduction

Since the task of allocating and freeing memory in computer programs is tedious and error prone, automated (known as garbage collection) has been a boon to computer programmers. However, automated memory manage- ment comes with costs; extra memory overhead or time.

The hypothesis of this thesis is that a greater understanding of the structure of a program’s heap can improve the performance of garbage collectors. Intuitively, this seems obvious; the data structures used in a program govern in large part how the memory of a program is used, and understanding those data structures should give us more information about the lifetime of objects.

However, in order to examine this intuitive idea, we will need to develop new techniques to observe how programs use memory. In particular, we want to know precisely when objects become unreachable and may be collected. Thus, this thesis is two parts. In the first half, we present Elephant Tracks, a tool which uses novel techniques to trace the execution of Java programs. The traces produced by Ele- phant Tracks contain a record of all object allocation and death events, and enough information to place these events in a specific calling context (a novel feature of Ele- phant Tracks). These traces can be used to analyze runtime properties of programs,

1 including prototyping of new garbage collection algorithms. Since they place death so precisely, it is easy test schemes that require less precision by ignoring some of the information in the trace. A central claim of this thesis is that that this can be done with out relying on any modification to the underlying VM.

In the second half of this thesis, we present the deferred collector, built based on data gather with Elephant Tracks. The deferred collector exploits the observation that much of the heap does not change between two successive runs of a tracing- style garbage collector, and as a result there is much work repeated between such successive runs. The deferred collector reduces this repeated work based on pro- grammer hints. We claim that this can reduce garbage collector tracing workload in some programs.

2 Chapter 2

Garbage Collection Background

In order to understand the performance characteristics of garbage collectors, it is

necessary to understand how they work. Thus, we will describe their operation in

this section. As there is copious work on garbage collection, this chapter will only

focus on details relevant to this thesis.

2.1 On Object Lifetime

First, we must make a brief aside to explain the events of interest in an objects life-

time. We will borrow terminology first used by Rojemo¨ and Runciman (1996). A program allocates an object, and sometime later it is first used; the period between these events is called the lag. Eventually, the program uses an object for the last time, and this point we can say the object is dead. In principle, it would be safe for the garbage collector to reclaim an objects memory the very instant of its death.

However, in general it is not possible to determine whether a use is the last use.

So, instead the first point the collector could collect an object is later, when it be-

comes unreachable. The period between last use and the time an object becomes

unreachable is called use drag.

3 Lifetime of An Object

Figure 2.1: The events in the life of an object. First it is allocated, then it is used. Eventually, it is used for the last time. Sometime after this, it may become unreach- able. Finally its memory will be reclaimed.

Finally, although the garbage collector could collect an object as soon as the object becomes unreachable, in practice the collector runs only intermittently; the collector will not collect the object until its next run. This engenders another period of drag, the reachability drag.

2.2 Garbage Collector Overhead

How much overhead does garbage collection entail? While there have been many significant advances in garbage collection performance, overhead can still be very high when memory is constrained. To understand why, first consider how most typical garbage collection algorithms work: As the program requests new objects,

4 (a) Heap, memory usage below threshold (b) Heap full (c) Heap after collection

(d) Key: Blue objects are allocated and reach- able. Grey Objects are Al- located, but dead. White space is unallocated. memory for them is allocated on demand, until some threshold of allocated mem- ory, (H for ”heap size”), is reached. Then, the program is paused, and the garbage collector begins executing to reclaim memory. It begins by tracing all objects re- ferred to by the program roots, (the roots are just pointers into the heap from other memory locations, such as the stack). The garbage collector marks these initial ob- jects, and then finds all objects reachable from this , and marks them, eventually marking all objects that are transitively reachable from the roots.

If the threshold H is larger than the total amount of memory the program allo- cates during its execution, H is never be reached, the program would never have to perform a garbage collection, and the cost of GC is zero. If instead H is a large fraction of the total amount of allocation, but less than the total amount of alloca- tion, the collector runs only infrequently. Imagine the collector has not run in some time , and then performs a new collection. Since most objects eventually die (that is, they are not immortal), and the cost of collection is proportional to the number of live objects, the collection is inexpensive. Furthermore, the collection will re- cover a large amount of memory, which means that it will be a long time before the

5 amount of allocated memory reaches H again. So, if H is large enough collections are infrequent and inexpensive.

In contrast, if H is small, then collections will occur frequently: Each collection will recover only a small amount of memory, leaving only a small amount to be allocated before H is reached again and a new collection must occur. In this case, we are in the unhappy world of frequent and expensive collections.

A direct comparison to is not straight forward.

One could do as Hertz and Berger (2004) did and determine the exact unreachable time of every object in a Java program, and replay the program exactly, inserting manual frees at the exact moment of death, comparing that to the standard garbage collection scheme. However, the presence or absence of manual memory manage- ment changes how programs are written, and this experiment does not necessarily reflect the way that manual memory management is actually done. In practice, aux- iliary data structures may be used to track memory, defense copies may be made, or other techniques employed. If the data structures used to track memory are elab- orate enough, they could be seen as a partial implementation of a garbage collector.

However imperfect the analysis, (Hertz and Berger, 2004) concluded that when H is approximately four times the typical live-size of the program, the costs garbage collection are similar to the costs of manual memory management. When H is greater than this, garbage collection is more efficient than manual memory man- agement. However, if we move H in the opposite direction, the costs of garbage collection increases dramatically. With H only twice the live size, the execution of a program is slowed dramatically (Hertz and Berger, 2004).

2.2.1 Mark Sweep

One of the most basic algorithms for garbage collection is Mark Sweep. It proceeds as follows:

6 1. Once the threshold is reached, the mutator is paused

2. The collector enumerates all roots (references on the stack, or references to

global variables, that refer to objects in the heap)

3. The Mark Phase Begins. All objects reachable via the roots are marked, and

all objects transitively reachable from those objects are marked.

4. The Sweep Phase begins. The collector linearly scans through memory,

checking to see if objects are marked. All objects not marked are reclaimed

(”swept”).

5. The mutator resumes.

The marking can be done by modifying a bit in a side table, or the mark bit can be stored in memory adjacent to the object in a special header area. Since the sweeper is read-only with respect to live objects, and the mutator will never interact with unreachable objects, it is possible for the mutator to resume before the sweeper is complete; the mutator and sweeper then run concurrently.

2.2.2 Generational Mark Sweep

One important advance in garbage collector technology exploits an empirical ob- servation of object lifetimes known as the Generational Hypothesis: most objects die young.

In order to take advantage of this, new objects are allocated into a small space called the nursery. When the nursery is filled, survivors are copied to a larger space, called the mature space. This larger space is typically managed with the previously described mark sweep algorithm.

In order that the collector can collect the nursery without having to examine the mature space at all, it needs to keep track of any pointers located in the mature

7 space into that refer to objects in the nursery; such pointers are treated as roots during a nursery collection. In a JVM that relies on a just-in-time (JIT) compiler, this tracking is accomplished by the JIT compiler injecting code into the mutator surrounding any write of a pointer; this code is called the write barrier (a write barrier could also be implemented in an interpreter, or ahead of time compiler).

The write barrier code checks that this condition is met, and if so the address of that pointer is stored in a list called the remembered set.

Some generation collectors use an alternative called card marking instead of a simple list. In this scheme, one bit is associated with each ”card” of memory

(the size of a card is variable, but typically cards fit evenly onto pages, several per page). These bits are stored together in a single bit map called the card table. In the card marking scheme, instead of appending to a list, the write barrier sets the bit associated with the card in which a pointer is being mutated. Then, when a nursery garbage collection occurs, the garbage collector scans those cards that have been marked in the card table. After a collection, the card table is zeroed. Card marking has the advantage of bounded space, since the card table is of fixed size (and the list used in a remembered set is not). However, it has the disadvantage that the collector will have to scan an entire card even if only one pointer in it has been mutated.

2.3 Weak References

As with many other garbage collected languages, Java provides weak references. In

Java, these are implemented using a special container type, WeakReference. When a WeakReference contains an object, it can be accessed via the weak objects get method. Suppose an object is reachable only through WeakReference objects; such an object is said to be weakly reachable. When an object is determined to be weakly reachable, it will be collected. This is illustrated in fig. 2.3 Weak references are

8 Figure 2.3: A weak reference object has a field called a ”referent” that refers to some object (A). When the object referred to by the referent object is reachable only through weak reference objects, it may be collected. If this happens, the referent field is nulled (B). typically used to implement software caches, since it is not desired for the cache itself to keep objects resident in memory.

2.4 The Java Virtual Machine Tool Interface

Java Virtual Machines provide an standard interface for debugging tools called the

Java Virtual Machine Tool Interface (Sun Microsystems, 2004). This interface allows the JVM’s user to specify a shared library, called a JVMTI agent, to be loaded along with the VM. As classes are loaded by the JVM, their byte codes can be intercepted by the agent. The agent can then modify them before they are executed by the JVM.

JVMTI is a relatively sparse interface, and it is intended that agents primarily rely on byte code instrumentation to do most of the agents work. It does, however, provide some additional features beyond intercepting loaded classes. A JVMTI agent can traverse all the live objects in the heap, examine local variables on the stack, and intercept Java Native Interface calls.

9 In order to carry out the bytecode instrumentation used in the work presented in this thesis, we use version 3.3 of the ASM bytecode rewriting framework. ASM is widely used and actively maintained, and is part of the infrastructure of at least one major commercial JVM (Oracle’s Hotspot).

10 Chapter 3

Elephant Tracks

3.1 Elephant Tracks Introduction

Garbage collection tracing tools have been instrumental in the development of new garbage collection algorithms. A GC tracing tool produces an accurate trace of all the dynamic program events that are relevant to memory management, including allocations, pointer updates, and object deaths. We can quickly test a new GC algo- rithm by building a simulator that reads the GC trace, instead of developing a full

GC implementation in a real virtual machine, which is a considerable undertaking.

One of the widely used GC tracing tools for Java, GCTrace, is available as a component of the JikesRVM Java virtual machine (Alpern et al., 2005b). That tool, like ours, is based on the Merlin algorithm (Hertz et al., 2002, 2006), but suffers from several limitations. First, the implementation is integrated directly into the garbage collector. Due to the ongoing evolution of the JikesRVM Memory Man- agement Toolkit, it no longer functions with recent versions of JikesRVM, and older versions will not run modern Java software. Second, GCTrace measures time only in terms of bytes allocated, a fine metric for GC simulation, but not useful for pro- gram analysis since it cannot readily be tied back to points in the program. Third,

11 allocation time is not very precise for events other than allocation: many pointer

updates and object deaths can occur at various points between two allocations. Fi-

nally, the existing tool does not support a number of features found in real programs,

including weak references and multithreading.

In this chapter we present Elephant Tracks, a new GC tracing tool that is pre- cise, informative, and can run on top of any standard JVM. Our goal is not simply to address the limitations of prior work, but to provide new capabilities that allow our tool to support a wider variety of program analysis and runtime systems re- search. The implementation of Elephant Tracks uses a combination of bytecode instrumentation and JVMTI (JVM Tool Interface) callbacks to construct a graph representing the connectivity of the heap, on which it runs the Merlin algorithm to compute idealized object death times. Its attributes include:

Precise: Elephant Tracks measures time in terms of method calls (i.e., the clock

ticks at every method entry and exit), which are much more frequent than

allocations.1

Complete: The implementation properly handles all relevant events, including dif-

ficult cases, such as weak references, the Java Native Interface, sun.misc.Unsafe,

and JVM start up objects. Previous tools did not handle all of this features,

which could result in inaccurate traces if a program makes use of them.

Informative: Traces include much more than just GC-related events. We emit a

record for every method call and return, allowing us to tie memory behavior

back to the program structure. In fact, we can reconstruct the complete dy-

namic calling context for any time step. We also record information about

threads and exceptions, and, optionally, counts of heap reads and writes and

number of bytecodes executed. This ability is unique to Elephant Tracks.

1This includes constructor calls, thus tightly bounding most allocations as well.

12 Well-defined: We carefully define which aspects of the JVM execution model that

the traces capture , which embodies a number of subtle design issues that

affect the meaning of the traces. These include trace time, the definition of

object lifetime, and the ordering of events in multi-threaded programs. In this

thesis we explore these issues in detail and justify our choices.

Portable: Elephant Tracks is implemented as a JVMTI agent that runs on any com-

pliant Java virtual machine. Running with out modification to the underlying

Java Virtual Machine represents a research contribution of Elephant Tracks.

Fast: Elephant Tracks is as fast as existing tools, while providing much more in-

formation. Elephant Tracks includes performance tuning and optimizations

to reduce overhead, which is critical for larger, long-running programs. Pro-

grams instrumented with Elephant Tracks run hundreds of times slower than

uninstrumented programs, but that is comparable to existing tools, while pro-

viding much more information.

In the following sections we explore the design space of GC tracing tools, and explain the choices made for Elephant Tracks. We discuss the technical challenges of building such a tool using the JVMTI interface, which does not provide direct access to the JVM’s representation of Java Objects, or to the garbage collector. We also discuss the handling of weak references. This proved difficult because the JVM is able to side-step some of our instrumentation techniques in this case. Finally, we present some results, including overhead measurements, as well as new insights about the benchmarks gleaned from our precise traces.

3.2 Trace time

For Merlin-based tracing we need a notion of trace time, so that object death records, which are generated only at GC time, can be inserted in their proper place in the

13 trace. The choice of trace time has a profound affect on the implementation of the tool and on the resulting traces.

Real time is a poor choice, since it is dependent on many factors, including the virtual machine, the operating system, and the hardware. In addition, tracing tends to slow programs down significantly, so the real times are likely to be significantly different from uninstrumented runs. Real time is also, in some sense, too precise: we do not want the trace to reflect the time it takes to actually perform a timestamp or record a trace record.

The solution is to express time in terms of some program-level : each time the event is encountered we tick the trace clock. In this way, time depends only on a property of the program, not on the VM or underlying machine. This model breaks time into discrete steps, each of which represents a small region of program execution.

The choice of which event(s) to use for the clock affects the granularity of time, which ultimately determines the precision of the trace, since trace records labeled with the same time cannot be ordered or localized within the region covered by that time step. The trade-off is that more fine-grained notions of time are more difficult to implement correctly, since we need to place the instrumentation more precisely to make sure that every event is labeled with the correct time. They may also incur more overhead. In the following subsections, we will contrast two approaches: allocation time, and method time.

Allocation Time

GCTrace measures time in terms of the number of bytes allocated since the program started (called allocation time). At each allocation, time advances by the number of bytes allocated. Allocation time is good for basic GC research, since the traces are precise enough to drive simulations of experimental GC algorithms. Allocation

14 time is fairly coarse, however, and a single time step can cover a large region of the code spanning multiple method calls. This notion of time was adequate for the GCTrace, since they were primarily interest in simulating garbage collection, and none of the collectors they were interested in simulating had any reliance on particular methods.

Method Time

Elephant Tracks measures time in terms of the number of method entries and exits executed (which we call method time). To get a sense of the difference in preci- sion, consider that across the DaCapo benchmarks there are, on average, 70 method entry/exit events between any two allocations (we present more measurements in

Section 3.6). Method time is almost a strict superset of allocation time, since every allocation of a scalar object also calls at least one constructor. The exception is array allocations (in Java, arrays do not have constructors), but in our experience these are not frequent enough to change the results significantly. Also, if a con- structor receives as an argument (not the receiver) a new object, there can be two allocations without an intervening constructor call. Again, this is not common. Ele- phant Tracks uses method time as it allows us to place events in a particular calling context.

3.3 Background and related work

In this section we describe the general garbage collection tracing problem and ex- isting solutions, and motivate the need for a new trace generator.

15 3.3.1 Garbage collection tracing

A GC trace is a record of the sequence of memory management events that oc- cur during a program’s execution. The events of interest may vary depending on the intended use of the trace, but typically include object allocation, object recla- mation, and mutations in the heap. Many of these events are straightforward to capture, such as object allocation, since they are explicitly invoked in the code.2

We can instrument those operations directly to emit a trace record with the relevant information.

The central challenge in GC tracing is determining object death times. An ob- vious solution is to emit an object death record when the garbage collector actually reclaims each object. This approach is easy to implement using JVMTI, but is un- appealing for at least two reasons. First, the particular timing of these events is collector-specific—we would measure a property of the GC algorithm used during trace generation, rather than a fundamental property of the program. Second, the resulting information is very imprecise. Most garbage collectors run infrequently, reclaiming large numbers of objects long after they are no longer needed by the program. As a consequence, object deaths appear far removed from the program events that actually cause them. This makes the traces poorly suited to evaluating new GC algorithms, as the Merlin work showed (Hertz et al., 2006).

Idealized death times

Our goal is to generate traces with idealized death records. That is, each object death appears in the trace at the earliest time at which the object could be reclaimed.

An idealized trace captures the behavior of a program independent of any particu- lar GC algorithm, with object death events appearing close to the program actions

2This is more subtle than you might think. In Java, the virtual machine may allocate application- visible objects as side-effects of other actions, such as class loading, and native libraries can also do so. Similar remarks apply to pointer updates.

16 that cause them. The exact nature of this problem depends on how we define “the

earliest time”. For example, we could compute death times based on object use

(see fig. 2.1): an object is dead immediately after its last use. While interesting as a

lower bound, this level of precision is potentially expensive to compute and cannot

be exploited by any practical memory manager. Therefore, we adopt the definition

used in garbage collection and in prior work on tracing: an object is dead when it

is no longer reachable from the roots (local and global variables) directly or indi- rectly through any sequence of pointers. This choice still leaves many subtle issues, however, including the granularity of trace time and the liveness of root variables, which are discussed in more detail in Section 3.4.

A naive algorithm for computing idealized death times is to run the garbage collector much more frequently. For example, we could produce a very precise trace by invoking the collector at every program point where an object could become dead. Not surprisingly, this approach is impractical.

The Merlin algorithm

The Merlin algorithm, introduced by: (Hertz et al., 2002) (reproduced here in

Figure 3.1 ), solves this problem by using timestamps to infer the idealized death

times of objects when they are reclaimed at regularly scheduled garbage collec-

tions. During normal execution the algorithm timestamps objects whenever they

lose an incoming pointer. At any point in time an object’s timestamp represents

the last time it was directly observed to be reachable. When the collector reclaims

an object, however, its timestamp is not necessarily its death time. In many cases

an object becomes unreachable indirectly, when an object that points to it becomes unreachable. In this case we need to determine which event occurred later: the di- rect loss of an incoming pointer (the timestamp), or the indirect loss of reachability

(the death times of the referring objects). So, the idealized death time of an object

17 void ComputeObjectDeathTimes() Time lastTime = sort unreachable objects from the earliest timestamp to the latest; push each unreachable object onto a stack from the earliest timestamp to the latest;

while (!stack.empty()) Object obj = stack.pop(); // pop obj with next earlier timestamp Time objTime = obj.timeStamp; if (objTime <= lastTime) // dont reprocess relabeled objects lastTime = objTime; for each (field in obj) if (isPointer(field) && obj.field != null) Object target = getMemoryWord(obj.field); Time targetTime = target.timeStamp; if (isUnreachable(target) && targetTime < lastTime) target.timeStamp = lastTime; stack.push(target);

Figure 3.1: Pseudo code for the merlin algorithm.

(Td(o)) is computed from its timestamp (Ts(o)) and the death times of any objects that point to it:

Td(o) = Max(Ts(o),{Td(p),∀p : p references o})

This insight leads to a practical approach for GC tracing that is also at central to the system we present in this chapter:

• During normal execution:

– Record ordinary events in the trace as they occur (e.g., object allocations

and pointer updates).

– Timestamp objects whenever they might become directly unreachable

(i.e., when they lose an incoming pointer).

• At GC time:

– Compute idealized death times using the formula above (implemented

18 roughly as a depth-first search on the graph of dead objects, pushing

computed death times across the pointers).

– Generate a death event record for each reclaimed object and insert it in

the proper place in the trace.

– Flush records to disk, and continue ...

An important implication of the Merlin algorithm is that it requires a notion of trace time for use in the timestamps. In fact all events in the trace must be associated with timestamps because the we cannot learn about death at the time it occurs— we discover the true death times of objects only at GC time, which is typically much later. The model of trace time (in particular, its granularity) has a profound impact on the implementation of the trace generator and the precision of the traces it generates.

3.3.2 Why a new trace generator?

The first realization of Merlin took the form of a customized garbage collector called GCTrace, implemented in JikesRVM. The main advantage of this approach is that the implementation can be integrated directly into the virtual machine code.

The compiler can be modified to add instrumentation to object allocations and pointer updates, and the garbage collector can be modified to perform the extra death time computation. GCTrace is a valuable tool. Its primary goal was to pro- vide a trace that allowed simulated garbage collections with great accuracy. We elaborate on some of its properties, and discuss where and why we want to improve on them, below.

Precise: GCTrace uses allocation time for its traces: the trace time clock “ticks”

at each object allocation. This is a good choice for its goals, as allocation is

what causes the garbage collector to eventually do work. However, it does

19 mean that object deaths and other events that occur in between allocations

cannot be ordered or precisely localized at any finer granularity.

A related problem is that allocation time does not correspond to anything

static in the program itself, so figuring out where events occur relative to the

code is very difficult (e.g., “In which method did the death of object 739229

occur?”). If a collector wants to take advantage of this information, for exam-

ple by doing something different in different phases of program execution (as

demarcated by methods), it would be difficult to prototype using GCTrace.

Completeness: GCTrace works with many Java programs, but since since it was

created Java programs have grown more sophisticated. Multithreading, use

of the Java Native Interface, and weak references are common; these were

not handled by GCTrace.

Integrated into VM GCTrace is implemented as a garbage collector within the

VM. This allows it to exploit internal structures of the VM for performance.

However, it does leave it exposed to changes in those structures. MMTk (the

memory management toolkit used in JikesRVM) has undergone a number of

radical refactorings, often leaving the GCTrace implementation out of date.

A major goal of our system therefore is to create a useful memory tracing tool

without modifying the JVM internally.

The calling context is potentially useful for a variety of optimizations. For

example Xu (2013) presents an optimization to re-use the memory of objects

if they die before the next time their allocation site is invoked. Although

Elephant Tracks is not fast enough to use on-line, it would allow researchers

to prototype such ideas by analyzing traces before investing time in complex

modifications to a JVM.

20 3.3.3 Related work

The work most closely related to ours is the original GCTrace implementation of the

Merlin algorithm (Hertz et al., 2006), which is discussed in detail throughout this chapter. Foucar reimplemented GCTrace using a shadow heap (see section 3.5.3) implemented in C++, like Elephant Tracks, but otherwise preserving the execution model and dependence on JikesRVM (Foucar, 2006).

Two prior papers explore the relationship between liveness and reachability for garbage collection. Agesen et al. (Agesen et al., 1998) examine the effects of apply- ing different levels of liveness analysis to the GC root set (variables on the stack).

They found that on average the differences were small, but on occasion static live- ness analysis would improve collection efficiency noticeably. This result suggests that our dynamic liveness model is reasonable for most purposes, but could be im- proved (see later discussion). Hirzel et al. (Hirzel et al., 2002) additionally consider the difference between reachability and last-use liveness of objects (see object life- time 2.1). They also find that schemes based on liveness of variables (i.e. employ- ing a compiler analysis to determine whether variables on the stack are no longer live, and therefore are not part of the GCs root set) have little impact on when a reachability-based garbage collector could collect objects. However, they do find that an objects last use time and the time it becomes unreachable are often signifi- cantly different. Elephant Tracks currently cannot compute equivalent information.

GC traces have been used to drive empirical studies of heap behavior, especially those examining the distribution and predictability of the lifetimes of objects (In- oue et al., 2006; Jones and Ryder, 2008). At a coarse level, measuring time in bytes allocated and measuring time with method events do not produce dramatically dif- ferent lifetime distributions. For analyses that are sensitive to program structure, however, this may be inadequate: many methods may be executed between any two allocations, and so the trace does not record accurately what event occurred in what

21 method. In addition, allocation time is not stable across runs of a program under different inputs.

Jones and Ryder (Jones and Ryder, 2008) show that the calling context of ob- ject allocation correlates well with lifetime i.e objects allocated in the same calling context tend to live the same amount of time. They could not determine, however, whether the calling context of object death correlates with lifetime, which might be a more useful fact for further improving garbage collection.

Inoue et al. (Inoue et al., 2006) look at what information is needed to precisely predict the lifetime of an object at the time of its allocation. They define a fully precise predictor as one that predicts lifetime to within a single quantum of time. By using bytes-of-allocation as their unit of time, they significantly reduce the coverage and accuracy of their predictors. The lifetime of an object in bytes-of-allocation time is much less stable than the calling context of its death, since the latter is directly related to its cause of death, while the former includes many irrelevant events (i.e., unrelated allocations).

Compile-time GC (Guyer et al., 2006) and connectivity-based garbage collec- tion (Hirzel et al., 2003) are two examples of techniques where knowing the pro- gram location at which an object dies is crucial. Such techniques are often evaluated using trace-driven simulation before embarking on a full implementation. Using

Elephant Tracks traces would yield a more accurate assessment of their potential.

Lambert et al. present a system for performing platform-independent JVM tim- ing (Lambert and Power, 2008). Although similar in spirit to our JVM-independent execution model, the focus of this work is on developing a model of code execution, rather than heap memory behavior.

Uhlig and Mudge (Uhlig and Mudge, 1997) present a survey of memory tracing techniques. While their focus is on tracing memory accesses for architecture and system research, they enumerate a set of features they consider desirable for tracing

22 systems in general: completeness (all relevant events are recorded), detail (events are associated with program-level information), low distortion (tracing does not change the program’s behavior), and portability. Elephant Tracks achieves many of these goals, although it significantly distorts actual running time, however, which is why we use a separate notion of time.

3.4 Elephant Tracks Design

Our goals in designing a new trace generator are to address the limitations of prior systems and to add new functionality to support new kinds of program analysis and memory management research. The central features of this design are (1) the kinds of program events recorded in a trace, and (2) the accuracy of this information with respect to some model of program execution. In this section we present the design of Elephant Tracks, and we discuss our choices in the context of the general GC tracing design space. In Section 3.5 we describe how this design is implemented.

3.4.1 Kinds of trace records

A minimal GC trace consists of just a sequence of object allocations and object deaths, labeled with the trace time and thread ID of each event. Without more information, though, such a trace has limited utility. In practice we add trace records for other kinds of relevant events to provide context for program analysis and to enable more kinds of trace-based simulations. For garbage collection research, for example, it is useful to add trace records for pointer updates in the heap, allowing a simulator to maintain an accurate heap model. Elephant Tracks can be configured to produce different kinds of trace records. We currently support the following kinds of records:

• Object allocations and object deaths (with idealized death times computed

23 using the Merlin algorithm).

• Pointer updates in the heap: These records include the source and target ob-

jects, as well as the object field or array index being updated. We also include

updates of static fields.

• Method entry and exit: These records allow trace times to be mapped to

specific methods, and even more precisely, to specific calling contexts.

• Exceptions: We augment method entry and exit to indicate when an exception

is thrown, the sequence of method calls (if any) that are terminated early

because of the exception, and the entry to a handler for the exception. The

main purpose of these events is to provide accurate information about method

execution context.

Separately from the trace, Elephant Tracks also emits information about each class loaded, each field declared in the class, each method declared in the class, and each allocation site in each method. This information is referred to by the trace, e.g., the trace will mention a unique allocation site number, which can be found in the side description file.

We currently do not generate trace records for object timestamps or for gen- eral memory accesses (including stack reads and writes). This information would enable an even wider range of applications, such as cache simulations. These events are extremely frequent, however, and would result in overwhelmingly large traces. In addition, instrumenting every single variable access would be technically challenging—bytecode rewriting might not be the best approach for this level of detail.

24 3.4.2 Execution model

Ideally, we would like to generate perfect traces, in which every event is recorded with a perfectly accurate and precise time. But this goal raises a critical question: accurate with respect to what? That is, what aspects of the execution model that we want the trace to represent? Elephant Tracks, like other trace generators, re- lies on a host virtual machine to execute the target program. It runs alongside the

VM, recording relevant events. The problem is that the timing of some events is highly VM dependent—directly recording these events as they occur produces a trace reflective of arbitrary VM implementation choices. Instead, we want to gener- ate traces that abstract away some details of the VM’s execution model, and record events in their own well-defined, less VM-specific, aspects. In particular, we would like a trace that places objects at the first time when any collector could collect them

(at the point where the object becomes unreachable, see figure 2.1). The possible models range widely, with some elements closer to the VM (essentially profiling the VM), and other elements more abstract, capturing an idealized execution of the program.

The main aspects of the execution model we wish to capture are (1) the defini- tion of object lifetime (in particular, when are objects considered dead), and (2) the definition of trace time (i.e., when does the trace time clock “tick” and with what frequency). The overall goal of the Elephant Tracks is to define these components in such a way that events can be localized precisely within the structure of the code.

The model is idealized for object lifetimes, but resorts to VM timing in cases where an idealized model is not possible, such as the interleaving of concurrent threads and the clearing of weak references.

One potential approach is to use non-deferred , which re- claims objects as soon as their reference counts becomes zero. Like reference count- ing collection, however, this approach cannot directly detect the death of cycles of

25 objects, and would require frequent tracing collections to achieve high precision.

Therefore, we do not use this approach.

Defining object lifetime

Object lifetimes are delineated by allocation and death events. Most object alloca-

tions are explicit in the program, so simply recording them as they occur produces

a VM-independent trace. We have found, however, that there are several other

sources of allocations, including VM internal allocations (e.g., String constants in

class files and Class objects themselves), objects allocated by the VM before it can

even turn instrumentation on, and objects allocated by JNI calls. We capture all of

these, but cannot associate them with a usual allocation site in the application, and

for those allocated very early in the run, we cannot provide relative time or context

of allocation.

For object deaths, however, an explicit goal of GC tracing is to compute ide-

alized death times. Both Elephant Tracks and GCTrace adopt the standard GC

definition: an object is dead when it is no longer reachable from the roots (local and

global variables). Even within this seemingly narrow definition, however, there are

a range of possible models. To see why, consider the program events that can cause

an object to become unreachable:

• The program overwrites a pointer in the heap (putfield, etc.)

• The program overwrites a static (global) reference (putstatic)

• A local reference variable goes out of scope

• The program changes the value of a local reference variable

• A weak reference is cleared by the garbage collector

While the first two (heap and global writes) are straightforward to instrument, local variables and weak references are more difficult to pin down. Furthermore, there

26 are roots inside the VM that we cannot observe and that the VM does not neces-

sarily inform us about when they change. Fortunately these are mostly “immortal”

references, such as to class objects, or relate to constants constructed from class

files (these may come and go).

Local variables

Tracing local variables presents many design choices and challenges. The key

question is: at what point is a local pointer variable dead, and therefore no longer

keeping the target object alive? At one end of the spectrum we could consider local

variables live throughout the method invocation with which they are associated. In

practice, however, most virtual machines apply some form of static liveness analysis

to compute more precise lifetimes. The virtual machine uses this information to

construct GC maps, which tell the garbage collector which variables to consider as

GC roots at a given point in the method.

GCTrace uses the GC maps in JikesRVM to determine which variables are live.

The advantage of this approach is that it is straightforward to implement. The down-

side is that the timing of the object death records depends on the specific liveness

analysis algorithm and choice of GC points made in JikesRVM.

Elephant Tracks currently uses a form of dynamic liveness to determine the life-

times of local variables. This choice reflects implementation decisions (described

in more detail below). A variable is considered dead after its last dynamic use. We

define a use as one of the following: (1) a direct dereference (access to an object or

array), (2) a type test, such as instanceof, (3) obtaining an array’s length, (4) use as a receiver of a dynamic method dispatch, or (5) a reference test, such as ifnull.

Dynamic liveness, however, is more precise than static liveness analysis, pri-

marily because it is not conservative about liveness on different execution paths.

The resulting traces show some object death times earlier than any real garbage

collector could achieve. For example, a reference that is held while a series of

27 methods are invoked, but never used or passed to any method, is considered dead in all the methods. We consider a variable live if it is passed to a method call as a parameter, or returned, even if it is never used with in those methods, or the return value is not used.

Weak references

Weak References are a special class in Java used to manage the collection of objects. An object reachable only via weak references can be collected at any time.

Listing 3.1: ”Weak reference usage”

WeakReference w e a k references = new WeakReference(foo );

For example, the code in 3.1 would create a new weak reference to the object referred to by variable foo, called . Internal to the WeakReference class is a field called referent. may also have normal (strong) references to it, and while it remains strongly reachable it will not be collected. However, if the program eliminates these strong references, and becomes reachable only via weak references, it may be collected. In this case, the referent field will be nulled out. Java also supports

Soft and Phantom references, which operate in a similar way, but have different semantics for when the referent field may be null.

Weak references and their cousins present an interesting challenge. In principle, the garbage collector can choose to clear weak references at any time (or not at all) once an object is no longer strongly reachable. In practice, they will only be cleared when the collector is actually run. Further, soft references are cleared “at the discretion” of the collector, in response to memory pressure. Phantom references are similarly affected by the timing of collector runs by the host VM. For a trace, though, this leaves no obvious idealized model of when to clear a weak reference.

Both Elephant Tracks and GCTrace opt to record these events when the VM chooses to perform them. Given that programs can perceive and respond to the collector’s

28 decision, there is no good alternative to this approach.

Concurrency

Most modern software uses concurrency in some form, which raises the question of

how to order trace events that occur in different threads. We adopt a straightforward

approach in which time is global, but trace records include both the time3 and the

ID of the thread in which the event occurred. In the current implementation, how-

ever, timestamps on objects do not include the thread ID, so object deaths cannot

necessarily be assigned to particular threads.

One problem with this approach is that the resulting traces encode the schedul-

ing decisions of the VM and operating system. Furthermore, trace instrumentation

perturbs program execution significantly, resulting in schedules that could be quite

different from the uninstrumented programs. While interesting, this problem is

difficult to address without controlling the scheduler directly—for example, by re-

playing a schedule from a real run. One potential solution is to represent time using

vector clocks, which would encode only the necessary timing dependences between

threads. However, this would still suffer from particularity of orders of interactions.

We hope to investigate alternative designs in the future.

3.5 Implementation

Elephant Tracks is implemented as a Java agent that uses the Java Virtual Ma- chine Tool Interface (JVMTI). The primary components of a system using ET are: the JVM itself, including its JVMTI and JNI support; the application; the Elephant

Tracks agent; the ElephantTracks Java class file, which connects bytecode instru- mentation to the agent via Java Native Interface (JNI) calls; and the instrumenter,

3We do not actually output the time value, but it can be derived by knowing which events “tick” the clock.

29 which rewrites the bytecode of classes as they are loaded.

3.5.1 Timestamping strategy

For Merlin to produce precise death times, the timestamp on an object must always be the time at which the object last lost an incoming reference. This invariant is easy to maintain for heap and static references, since we can directly instrument these operations, timestamping the old target before allowing the store to proceed.

For stack references, however, there is no explicit operation denoting the end of a variable’s scope. There are essentially two strategies for solving this problem:

(1) timestamp all live variables at every time step, or (2) timestamp each variable exactly when its lifetime ends. (Recall that we define a variable as being live only up to its last actual use.)

GCTrace uses strategy (1), which has the advantage that it is straightforward to implement: at each tick of the clock, walk the stack and timestamp each object re- ferred to by a live variable. This strategy, however, creates a trade-off between per- formance and precision. Walking the stack is a extreamly expensive operation, so it cannot be performed frequently, limiting the granularity of the clock. The problem is particularly acute when using allocation time, since a single time step can span multiple methods, requiring a full walk of the call stack at every tick. We believe that stack walking also inhibits code optimizations (or forces de-optimization), fur- ther slowing execution. Furthermore, as mentioned in Section 3.4 it relies on the

VM’s GC Maps to define variable liveness.

Elephant Tracks uses strategy (2). This approach requires more instrumentation to timestamp a variable’s referent whenever the variable is used. It has several ad- vantages, though. The most important is that it works correctly for any granularity of time. In addition, it gives the trace generator explicit control over the model of variable liveness. Finally, it is amenable to an instrumentation-time optimization

30 (described below) that eliminates redundant timestamping operations.

3.5.2 The instrumenter

The instrumenter is ordinary Java code and is written using the ASM bytecode rewriting tool (Bruneton et al., 2002). The current version of ET is written to use

ASM 3.3.1. ASM should work on any Java 1.6 class file, and will produce Java 1.6 class files who’s meaning is well defined on any standards compliant JVM. There- fore, we are not as worried about introducing reliance on ASM as we are about relying on JVM internals, which could be changed at any time (as long as they still implement the standard). In order to avoid recursive self-instrumentation between instrumenter code and the application, we run the instrumenter in a separate oper- ating system process, connected with the agent via pipes in both directions. The agent uses the JVMTI ClassFileLoadHook callback, which causes the JVM to present to the agent each class that the JVM wants to load, and to give the agent the opportunity to substitute other bytecode for what the JVM presents. The ET agent sends the bytecode to the instrumenter, which sends back an instrumented class file.

The instrumenter assigns a unique number to each class, each field, each method, and each allocation site (for both scalars and arrays) in each method, writing them to what we call the names file. The instrumenter also sends the class and field in- formation to the agent. (At present the agent has no need to maintain tables for the other information, so it is not sent.)

Ordinary instrumentation

We defer to Section 3.5.2 some special cases, and describe now the usual instru- mentation added by the ET instrumenter. We organize the description by feature.

Method entry and exit: On entry, and just before a return, we insert a call noting

31 the id of the method and the receiver (for instance methods). In a constructor

we cannot actually pass the receiver (it is not yet initialized), so we pass null

and the agent uses a JNI GetLocalObject call to retrieve the receiver from

the stack frame.

Exception throw: At an athrow bytecode we insert a call that passes the exception

object, the method id, and the receiver (for instance methods). The same

special handling of the receiver in constructors happens here, too.

Exception exit: To detect exceptional exit of a method, we wrap each method’s

original bytecode with a catch-anything exception handler, which makes a

call indicating the same information as for a throw, and then re-throws the

exception.

Exception handle: At the start of each exception handler we insert a call that notes

the same information as for a throw.

Scalar object allocations: The basic idea is to insert, after the new bytecode, a call

that indicates the new object, its class, and the allocation site number. How-

ever, we cannot pass the new object directly since it is uninitialized. Further,

it is on the JVM stack, not in a local variable, so the JNI GetLocalObject

function will not work. Our solution is to add one extra local variable to any

method that allocates a scalar. The instrumented code dups the new object

reference and astores it to the extra local. In the call to the agent, the in-

strumented code indicates which local variable the agent should examine to

obtain the object reference. Strictly speaking, we do not need to pass the

class, since the agent can figure it out; we may remove that in the future.

Array allocations: New arrays start life fully initialized, so we simply pass them in

a call to the agent, along with the allocation site number. For multianewarray

we call an out-of-line instrumentation routine that informs the agent of each

32 of the whole collection of new arrays that are created. This could also be

done in the agent, if desired.

Pointer updates: For putfield of a reference and for aastore we insert, before

the bytecode, a call that notes the object being changed, the object reference

being stored, and the field (or index, for an array) being updated. Java allows

putfield on uninitialized objects (mostly so that an instance of an inner class

can have its pointer to its “containing” outer class instance installed, before

invoking the inner class constructor). In that case we use the same technique

as for scalar allocations to indicate to the agent the object being updated.

Uses of objects: As mentioned in Section 3.4, ET timestamps objects when they

are used. We mentioned there the cases in which that happens. We simply

insert a call, passing the object to be timestamped. On method entry we times-

tamp all pointer arguments, including the receiver. (In constructors we make

a slightly different call since we cannot pass the receiver; the agent fetches it

out of the frame.) For efficiency on method entry, we have timestamp calls

that take 2, 3, 4, or 5 objects to stamp.

Counts: As an extension controlled by a command line flag, the instrumenter will

also track the number of heap read and writes, the number of heap reads

and writes of reference values, and the number of bytecodes executed, and

insert calls reporting these just before each action that advances the timestamp

clock, and just before control flow branch and merge points.

The instrumenter also includes a simple kind of optimization to reduce the number of timestamp calls. It tracks which variables (locals and stack) have been times- tamped since the last tick or the last bytecode frame object. (Frames occur at control

flow merge points, and detail the types of the local and stack variables at that point.)

It avoid timestamping an object twice in the same tick. This optimization requires

33 tracking object references as bytecodes move them around, but is straightforward.

The optimization is effective, and necessary in order to avoid having some methods

increase in size so much, because of added instrumentation, that they overflow the

maximum allowed method size. We present the results in table 3.1 on page 34.

The average enlargement factor (as measured by increase in byte size) of a class

due to injected instrumentation was 1.47 with out optimization, and 1.38 with op-

timization. batik was unable to run to completion with out this optimization, as the number of byte codes exceeded the number allowed in a Java method. The optimization is particularly effective on methods that initialize large arrays.

Benchmark no opt opt avrora 1.46 1.34 batik - 1.31 eclipse 1.38 1.27 fop 1.47 1.37 h2 1.49 1.48 jython 1.58 1.52 luindex 1.48 1.38 lusearch 1.48 1.48 pmd 1.45 1.53 sunflow 1.48 1.38 tomcat 1.43 1.40 xalan 1.46 1.37 geomean 1.47 1.38

Table 3.1: The average class enlargement for each of the DaCapo benchmarks (as measured by increase in size by bytes), with and with out the optimizations de- scribed in section 3.5.2 on page 31. batik does not run to completion in the no opt configuration, since it violates the maximum Java method size.

Instrumentation special cases

We now detail various special cases (beyond access to uninitialized new objects,

mentioned in the previous section). These are cases where VMs do something that

by passes the bytecode in some way that is difficult to observe. Elephant Tracks

is as VM independent as possible, but these parts might have to be reimplemented

34 on different Java VMs (although those we examined, Oracles HotSpot and IBM’s

J9 Java 1.6 JVMs, both behave in a similar way). If we did not handle these cases,

and a program happens to use these features, the trace would be inaccurate in some

detail; either missing a pointer update, allocation, or having incorrect death times

for some objects.

Native methods: In order to indicated when a native method is called and re-

turns, we change its name, prepending $$ET$$, and insert a non-native method that

calls the native method. We instrument the non-native method essentially as usual.

A number of native methods require special treatment, however:

getStackClass: This method of java.lang.Class, and several similar meth-

ods, include an argument specifying the number of stack frames to go up to

look for some information. To wrap these native methods, the ET non-native

wrapper adds one to the number of frames before calling the native. This

properly adjusts for the extra level of call that the wrapper adds. getClassContext: This method of the IBM J9 ClassLoader probes a specific

number of frames up the stack, so wrapping it disturbs the result. With re-

gret, we do not wrap it. (We contend that native methods subject to this

problem should be redesigned, like getStackClass described above, so that

they can be wrapped.) A number of other methods exhibit essentially the

same problem.

Several native methods of class Object: Specifically, getClass, notify, notifyAll,

and wait do not operate correctly if wrapped, so we omit them.

initReferenceImpl: This method of class Reference initializes the referent

field of a weak reference object. We instrument it specially so that the agent

can observe the update to the field, which otherwise would be hidden to ET.

Several methods of sun.misc.Unsafe: for allocateInstance we note the al-

35 location; for a successful compareAndSwapObject we note the pointer up-

date, as we do for putObject, putObjectVolatile, and putObjectOrdered.

All of these updating operations work in terms of the offset of a field or array

element into the object, a fact not readily available to the agent. Therefore we

instrument objectFieldOffset, staticFieldBase, staticFieldOffset,

arrayBaseOffset, and arrayBaseScale to inform the agent of the base or

offset information they return, so that the agent can map the offsets and bases

back to fields and array elements.

System.arraycopy: We instrument this specially so that the agent can note all

the resulting updates to arrays of objects. The agent does the actual work and

notes the effects, taking care to deal correctly with situations that will throw

an exception, etc.

Class Object: We instrument Object. to report the newly initialized object. Sometimes this is the first time we see an object, e.g., for some objects created via JNI calls. We carefully avoid instrumenting Object.finalize since having any bytecode in that method will cause every object to be scheduled for

finalization (which breaks JVMs). Any finalize method in another class is in- strumented, so finalizations are visible in the trace.

Timestamping new objects: Trying to obtain a reference to and timestamp a new object in Object. or Thread. fails, but the object will be reported soon anyway, so skipping the timestamp operation is not harmful.

3.5.3 The agent

The agent performs these functions to support ET’s goals:

• Sends classes to the instrumenter and returns instrumented classes to the

JVM.

36 • Notes several actions of the JVM and responds appropriately. These include:

changes in the JVMTI phase of execution (VMStart, VMInit, and VMDeath);

GarbageCollectionFinish, which triggers a scan (described further be-

low) to see if any weak references have been cleared; and VMObjectAlloc,

to detect objects allocated directly by the VM.

• Intercepts various JNI calls so that it can emit suitable trace records, specif-

ically, AllocObject, ThrowNew, and the various NewObject and NewString

calls, to note the new object; and SetObjectField, SetStaticObjectField,

and SetObjectArrayElement to note reference updates.

• Handles the various instrumentation calls from the ElephantTracks class

and (generally) creates a trace record, inserting it into a buffer. The size of

this buffer is a configurable runtime parameter, specified by the number of

records it will hold. We find sizes in the hundreads of thousands work well in

practice.

• Maintains a model of the heap graph called the Shadow Heap. Each node rep-

resents an object and each directed edge a pointer. The model also includes

static variables, but does not (cannot) include various VM internal roots, and

as previously described, we do not model stack roots directly, but employ

timestamping to determine liveness from thread stacks. The Shadow Heap

is represented in the agents memory, not the Java heap, so it cannot cause

memory pressure that would trigger additional garbage collections.

• To help maintain the heap graph model and to identify objects in trace records,

the agent uses the JVMTI object tagging facility to associate a unique serial

number with each object, as early as possible after the object is created.

• Maintains a table of object liveness timestamps, and the timestamp “tick”

clock.

37 • Maintains a data structure describing weak objects and their referents. When-

ever the JVM runs its garbage collector, after collection completes the agent

notifies a separate agent thread to check each weak object to see if its referent

has been cleared. This thread will timestamp the now-unreachable referent

with the current time, giving a good-faith estimate as to when it died.

Trace outputting proceeds in cycles. This is because determining which objects have died, propagating timestamps, and inserting death records at the right place in the trace, is a periodic activity, done in batches. When the agent is notified that the

JVM is entering the JVMTI Live phase, the agent iterates over the initial heap and creates an object allocation record for each object and a pointer update record for each non-null instance and static field. When the agent is notified that the JVM is entering the JVMTI Dead phase (JVM shutdown), it closes out the current buffer of trace records.

In between, during the Live phase, whenever the trace buffer fills with records, the agent:

1. Forces a garbage collection and then iterates over the remaining heap. This

allows the agent to detect which objects have been reclaimed since the trace

buffer was last emptied.

2. Applies the Merlin algorithm to compute object death times (really “last time

alive” times).

3. Checks weak objects to see if their referent as been cleared. The VM does

not inform the agent directly about this, but since we note referent field ini-

tialization, we know about weak objects and their referent targets. The tables

the agent maintains for these are carefully designed not to keep the objects

live (we use JNI weak references).

4. Adds death records to the trace buffer, properly timestamped.

38 5. Sorts the records using a stable sort, and outputs them.

The last step, sorting and outputting, we observed to consume about half the time

of creating a trace and so we developed a parallel version. We report performance

results in Section 3.6.

3.5.4 Properties of our implementation approach

Our implementation strategy has many advantages and few drawbacks. Its advan-

tages include:

• It works with commercial JVMs (in principle with any JVM supporting JVMTI;

we have tested it with the Oracle Hotspot and IBMJ9 1.6 JVMs) and can run

any application. Of course timing-dependent applications may misbehave as

with any tool that slows execution, etc. This prevents several DaCapo bench-

marks from completing successfully.

• The run-time is implemented in C++, with all of its data structures outside of

the JVM. This makes it easier to insure that ET data structures and actions

are not inappropriately entwined with the application and JVM.

• The instrumenter is in a separate process, insuring it does not become recur-

sively instrument itself. As discussed previously, our reliance on ASM is not

problematic because ASM is widely used, actively maintained, and part of

the infrastructure of at least one major commercial JVM (Oracle’s HotSpot).

• We capture even some very tricky cases, including weak references, field up-

dates via sun.misc.Unsafe, reflective object creation, updates, and method

calls, VM internal allocations, relevant JNI calls made by the VM or other

native libraries, and System.arraycopy.

Drawbacks of ET as it stands are mostly ones that similar tools are likely to share:

39 • A few methods cannot be instrumented, since doing so breaks the JVM.

• Relative timing and thread interactions are affected, which may change ap-

plication behavior.

• Weak reference clearing is dependent on the vagaries of the JVM.

• Precision in determining object deaths, and the general wealth of information

in the traces, come at a cost: the execution dilation factor is on the order of

hundreds (see Section 3.6 for performance results).

• The resulting system is not as simple as we would like. There are places with

somewhat tricky synchronization and more data structures and mappings than

we would like, but it is not easy to deal with features such as weak references

and sun.misc.Unsafe.

• We rely heavily on correctness and completeness of JVMTI and JNI support.

One implication is that, at present, JikesRVM cannot support ET. Also, we

have discovered previously unreported JVM bugs, such as failure of one JVM

to present for rewriting every class it loads, which implies that a handful of

classes go uninstrumented. (That bug is being fixed, but it appears we later

found a similar case whose fix will take longer.)

3.6 Results

3.6.1 Performance

In this section we present results from running Elephant Tracks on the DaCapo

Benchmarks in order to give a sense of its performance and the properties of the re- sulting traces. (Unfortunately, it fails to run tradebeans and tradesoap, perhaps because of internal timeouts.) In table 3.2 on page 45 on page 45 we present the run-time overhead of our tool under several configurations:

40 In the No Callback configuration, all of our bytecode instrumentation was in- jected, but callbacks into the JVMTI agent were disabled (resulting in an empty trace). Additionally, the No Callback configuration enables only the absolute min- imum number of JVMTI features necessary to instrument classes. This represents a practical lower bound on the overhead of instrumenting class files and executing the instrumented bytecode, without the overhead of calling into the JVMTI agent, processing the events, or producing a trace record.

The Serial ET configuration periodically pauses to generate death records, put them in order, and output them to the trace file. In contrast, the Parallel ET con-

figuration spawns a separate thread to do this work. This generally results in better performance and fewer pauses in the traced application, but may be of no benefit if the machine lacks sufficient resources to execute this thread in parallel with the application.

With a geometric mean of about 250, the overall dilation factor of Elephant

Tracks is within a factor of two of the published dilation factors of GCTrace (Hertz et al., 2002, 2006), while providing much more information.

The dilation factors of the different benchmarks are not uniform. This diversity cannot be explained only by the differences in amount of instrumentation, since there is no simple linear relationship between the No Callback configuration and the other configurations. Similarly, it could not be explained by a simple linear model relating number of calls into the JVMTI agent and/or average heap size of the benchmark being traced (at least, no model we were able to discover). Therefore, we theorize that it relates to complex interactions between our instrumentation, Java optimizations, and/or the implementation details of the JVMTI interface.

41 3.6.2 Trace analysis

Table 3.3 on page 46 shows the composition of the traces by event type (percentage of the trace accounted for by each type). Method entry and exit events outnumber the others significantly, which is why method time is so much more precise than allocation time. In fact, on average there are 70 method entry/exit events between any two allocations. In other words, a single tick of the allocation clock can span dozens of methods, making it difficult to localize object death events within the code. A single tick of the method time clock occasionally contains an allocation, and depending on where the starting and ending method events are found, we might not be able to tell if a death event occurred before or after the allocation. In a few very rare cases, a single unit of method time might contain multiple allocations, but they would have to be arrays, since regular objects always have a constructor call.

In order to demonstrate the value of these more precise traces, we present a few simple trace analysis examples. First, a simple escape analysis is easy to per- form with ET traces. We process a trace, and upon encountering a record of object allocation, note the context in which it occurred. Then, if the death event for that same object is encountered before the associated method return, we know the object has not escaped. Conversely, if we do not find the death record before this point, the object has escaped. Note that this does not necessarily mean that there is any static analysis that could have determined in advance that the object would or would not have escaped. The results of this escape analysis are reported in table 3.5 on page 47, where we see that in most benchmarks a majority of objects escape their allocating context.

Second, previous work has shown that the allocation site plus some calling con- text is a good basis for predicting object lifetime (measured in bytes of alloca- tion) Jones and Ryder (2008). Since Elephant Tracks’ traces can also provide the

42 calling context of an object’s death, it is possible to consider whether the allocation context is also a predictor of death context.

As a preliminary investigation, we performed the following analysis. Each ob- ject’s allocation is recorded with a triple, consisting of the allocation site, the allo- cating method, and the caller of the allocating method (this gives us a partial calling context). Next, the analysis finds the top ten allocation contexts (based on number of objects allocated). For each object allocated in these contexts, it determines the object’s partial death context (there is no site for a death event, so we consider only the method in which it died, and the calling method). Finally, the analysis finds, for each allocation context, the most common death context for objects with that allo- cation context. In table 3.6 on page 48 we report the average percentage of objects allocated in the top ten contexts that die in the plurality context.

This initial work suggests that the death context of an object may be a stable and predictable feature. However, additional refinement will be required to further illu- minate the relationship between an object’s allocation context and its death context, as well as to determine if this relationship can be exploited for any optimization.

3.7 Conclusions

In this chapter, We have presented Elephant Tracks, a tool for efficiently generating program traces including accurate object death records. Unlike previous tools, Ele- phant Tracks traces allow recorded events to be placed in the context of the methods of the program being traced. It also works independently of any particular choice of the JVM to which it attaches. These two features will allow the prototyping of new

GC algorithms and new kinds of program analysis, and let the tool keep up with changes in the JVM and class libraries. Elephant Tracks offers performance com- parable to similar previous tools but a wealth more information and covers many

43 more of the tricky and corner cases of Java and Java virtual machines.

In the following chapter, we will see how the information produced by Elephant

Tracks can be used to inform garbage collection strategies.

44 Benchmark No Callback Serial ET Parallel ET Dilation Dilation Dilation avrora-default 1.6 436.5 291.1 avrora-large 0.9 553.9 436.1 avrora-small 1.7 354.0 227.3 batik-default 3.7 152.5 102.6 batik-large 3.0 124.6 84.1 batik-small 2.9 58.3 41.9 eclipse-default 18.0 310.5 2110.5 eclipse-large 19.3 498.6 1603.6 eclipse-small 50.5 47.6 4297.5 fop-default 2.6 181.2 130.2 fop-small 2.8 42.0 30.7 h2-default 4.4 3137.3 3245.8 h2-large 4.3 2652.7 3583.1 h2-small 3.1 1272.9 1038.7 h2-tiny 3.9 947.5 754.7 jython-default 2.0 342.2 235.0 jython-large 2.5 949.4 774.9 jython-small 1.6 93.1 71.5 luindex-default 1.7 88.4 71.6 luindex-small 1.6 5.8 4.4 lusearch-default 2.7 385.7 304.3 lusearch-large 2.8 451.8 327.9 lusearch-small 2.9 112.5 85.7 pmd-default 2.0 276.1 134.6 pmd-large 2.4 549.1 230.2 pmd-small 1.8 7.4 5.6 sunflow-default 5.9 1830.1 1457.8 sunflow-large 6.4 2073.2 1583.3 sunflow-small 6.9 598.1 481.2 tomcat-default 1.8 100.3 72.3 tomcat-large 1.7 240.0 175.4 tomcat-small 1.8 48.4 38.4 xalan-default 3.0 482.0 372.5 xalan-large 3.7 922.6 715.4 xalan-small 2.8 114.8 95.5 geomean 3.2 257.7 245.8

Table 3.2: Run-time overhead for Elephant Tracks on the DaCapo benchmark suite

45 Benchmark Method Alloc Catch Pointer + Death + Throw Update avrora-default 97.97 0.30 0.00 1.74 avrora-large 95.17 0.24 0.00 4.59 avrora-small 97.85 0.30 0.00 1.85 batik-default 92.01 1.70 0.00 6.29 batik-large 92.34 1.77 0.00 5.89 batik-small 92.55 2.55 0.00 4.89 eclipse-default 88.65 4.29 0.00 7.06 eclipse-large 90.24 3.23 0.00 6.52 eclipse-small 93.67 3.27 0.01 3.05 fop-default 90.30 4.63 0.00 5.07 fop-small 89.83 4.21 0.00 5.95 h2-default 94.23 2.86 0.00 2.90 h2-large 94.92 2.02 0.00 3.06 h2-small 94.05 3.04 0.00 2.91 h2-tiny 93.98 3.11 0.00 2.91 jython-default 91.64 3.08 0.02 5.26 jython-large 91.74 2.79 0.01 5.46 jython-small 97.36 1.11 0.00 1.53 luindex-default 96.37 0.28 0.00 3.36 luindex-large 90.49 5.61 0.06 3.84 luindex-small 94.91 1.36 0.00 3.72 lusearch-default 91.70 2.73 0.05 5.52 lusearch-large 91.70 2.72 0.05 5.52 lusearch-small 91.77 2.74 0.05 5.43 pmd-default 87.31 3.86 0.10 8.73 pmd-large 87.04 4.27 0.09 8.60 pmd-small 92.54 3.48 0.01 3.97 sunflow-default 94.84 3.00 0.00 2.16 sunflow-large 94.83 3.00 0.00 2.17 sunflow-small 94.90 2.96 0.00 2.14 tomcat-default 89.92 5.92 0.02 4.14 tomcat-large 90.47 6.33 0.01 3.19 tomcat-small 89.00 5.28 0.03 5.70 xalan-default 94.10 1.54 0.00 4.37 xalan-large 94.11 1.52 0.00 4.37 xalan-small 93.99 1.65 0.00 4.36 mean 92.74 2.85 0.01 4.40

Table 3.3: Percentage of each record type in traces of the DaCapo benchmark suite

46 Benchmark Shadow Heap (MB) Program Heap (MB) Shadow Heap/Program Heap avrora 1.78 2.94 0.60 batik 9.03 13.06 0.69 eclipse 22.68 28.49 0.80 fop 5.90 16.56 0.35604 jython 14.27 29.43 0.48 luindex 7.20 5.10 1.41 jython 14.27 29.43 0.48 sunflow 3.83 9.94 0.37 tomcat 16.11 12.50 1.28 xalan 5.38 6.92 0.78 mean 0.73

Table 3.4: Shadow heap memory usage compared with the size of the heap.

Benchmark % Escaping Benchmark % Escaping avrora-default 83.53 luindex-default 54.14 avrora-large 79.39 luindex-small 46.25 avrora-small 87.41 lusearch-default 39.98 batik-default 63.79 lusearch-large 40.00 batik-large 62.97 lusearch-small 40.02 batik-small 62.22 pmd-default 53.68 eclipse-default 32.32 pmd-large 52.66 eclipse-large 41.97 pmd-small 51.78 eclipse-small 26.85 sunflow-default 68.63 fop-default 55.25 sunflow-large 68.49 fop-small 65.06 sunflow-small 68.38 h2-default 58.03 tomcat-default 25.44 h2-large 58.24 tomcat-large 21.87 h2-small 58.00 tomcat-small 32.44 h2-tiny 57.82 xalan-default 54.99 jython-default 42.13 xalan-large 55.16 jython-large 42.95 xalan-small 53.59 jython-small 68.02

Table 3.5: Percentage of objects escaping their allocating context in the DaCapo benchmark suite

47 Benchmark Mean % Benchmark Mean % avrora-default 41.22 luindex-small 76.57 avrora-large 44.38 lusearch-default 83.39 avrora-small 30.06 lusearch-large 79.63 batik-default 63.05 lusearch-small 83.37 batik-large 59.64 pmd-default 47.82 batik-small 75.08 pmd-large 34.98 fop-default 81.24 pmd-small 57.70 fop-small 64.54 sunflow-default 86.12 h2-default 86.37 sunflow-large 80.12 h2-large 88.09 sunflow-small 89.54 h2-small 86.78 tomcat-default 75.80 jython-default 68.79 tomcat-large 72.15 jython-large 74.03 tomcat-small 79.32 jython-small 74.15 xalan-default 71.48 luindex-default 78.45 xalan-large 71.55

Table 3.6: Mean percentage of objects that are born in the same context and die in the same context (over top 10 allocation contexts).

48 Chapter 4

Deferred Collector

The goal of the deferred collector is to avoid repeated work in the marking phase of two successive garbage collections. Recall from chapter 2 that each time a full heap garbage collection is performed (exuding nursery collections from consideration), the collector visits all live objects. However, if a section of the heap is unchanged, this represents repeated work. If, during its traversal, the collector could know it was entering an subgraph that had not changed since its last traversal, it could avoid this repeated work.

The deferred collector takes advantage of this by taking a hint from a programer about a key object. The key object governs the lifetime of other objects, and the fact that it is alive indicates that other objects reachable from it are alive. In this chapter, we will describe how we can use the traces generated by Elephant Tracks to find key objects, the implementation of a collector that exploits this knowledge, and the result of using this collector on a program.

49 4.1 Finding Key Objects

We would like to reduce the work load of the garbage collector, and we now have a

lot of data in hand with which to attempt it. But the question is what do we look for

in the data? One idea proposed by Hayes (1991) is to find key objects. Key objects are those whose liveness governs the liveness of other objects in the heap; if the key object is alive, we can assume the governed objects are alive.

How can we discover key objects with the data available in an Elephant Tracks trace? In principle, programs can tie the lifetimes of objects together in arbitrary ways, and analyzing the program source to determine this would be difficult or impossible. However, since Elephant Tracks is a dynamic analysis tool, there is a runtime signal we can use: If an object A is reachable, then any object it refers to must also be reachable. Imagine an object A such that a large number of objects are reachable only via A (that is, A dominates them). As long as A is reachable, and the

relationship holds, we know that all the objects dominated by A must be alive.

Therefore, what we want to find are large groups of objects dominated by a

relatively small set of objects. The dominating objects can potentially serve as our

key objects, and we can avoid tracing the dominated objects. Since a generational

collector already avoids tracing short lived objects, we only want to consider objects

that live a long time. We therefore only consider objects that live for longer than

fifty percent of the programs lifetime; such objects will be visited by many garbage

collections. We also are not interested in objects that dominate only a few other

objects, so we will only consider groups of at least 10,000 objects (a threshold

choosen because it is at least a few percent of the live size of all programs studied)

dominated by some smaller set of objects.

We identified these objects by processing the traces from Elephant Tracks. The

resulting data for the DaCapo Blackburn et al. (2006) benchmarks are shown in

50 (a) avrora (b) batik

(c) eclipse (d) fop

(e) h2 (f) jython

Figure 4.1: The live size of each of the dacapo bench benchmaks with respect to time. The blue line shows the number of objects live over time. The lower green dashed line shows the number of live objects that fall into clusters as described in 4.1

51 (g) luindex (h) lusearch

(i) pmd (j) sunflow

Figure 4.1: The live size of each of the dacapo bench benchmaks with respect to time. The blue line shows the number of objects live over time. The lower green dashed line shows the number of live objects that fall into clusters and described in 4.1

figures 4.1a through 4.1m. In these figures, the blue line on top shows the number of objects live at a given time, and the green dashed line shows the number that fall into the previous described clusters. Although we have previously discussed the utility of measuring time in terms of methods instead of bytes allocated, for coarse grained measurements such as this, there is little difference. Measuring with method time would simply add more ticks to the X axis; therefore we use bytes al- located to be more consistent with existing work measuring the live sizes of dacapo benchmarks. Also note that because some benchmarks have live sizes that are or- ders of magnitude higher than others, and some allocate orders of magnitutde more than others, the axes of these figures are not all on the same scale. We see that in

52 (k) tomcat (l) tradesoap

(m) xalan

Figure 4.1: The live size of each of the dacapo bench benchmaks with respect to time. The blue line shows the number of objects live over time. The lower green dashed line shows the number of live objects that fall into clusters and described in 4.1

some benchmarks, such as pmd few such objects exist, but in others many objects

(such as h2) meet our criteria. However, even in pmd such objects represent roughly

100,000 with a heap size of 700,000 objects, representing a significant potential for savings.The difficulty will be in producing a garbage collector that avoids tracing these objects, while still collecting all (or at least most) of the dead objects. If we fail to collect too many objects, the result will be additional garbage collections, eliminating any potential savings in time.

53 4.2 Deferred Collector Design

Many objects in the heap of a program do not change at all between two succes-

sive garbage collections. However, in most garbage collectors these objects will be

repeatedly traced, despite the fact that the result of this tracing has already been

computed and yielded no garbage. To avoid this repeated work, have created the

Deferred Collector, which relies on hints from a programmer to specify key objects.

Intuitively, the key object is the root of some data structure that the programmer ex-

pects to change only infrequently. Objects within that data structure should not be

traced on each garbage collection, but rather with some lower frequency.

In order to make this happen, we add two extra bits to the header of each object.

The first bit, the key bit, is set on any of the key objects provided by the user. When

an object has a its marked bit set, we say it is marked. Analogously, when an object

has the key bit set, we will say it is a keyed. The second bit, called the deferred bit, is set on those objects we which to skip when tracing the heap. We will refer to objects with the deferred bit set as deferred. The deferred bit acts something like a durable mark-bit: when tracing the heap during the mark phase, the collector will not follow edges pointing to deferred objects.

The key objects are specified by the programmer, using an API consisting of a single method traceInfrequently. The mutator program gives a hint for a key object simply by passing the object as parameter to the traceInfrequently method. We discuss some alternative interfaces in section 4.3.1.

4.2.1 Defer all Object Reachable from the Key

We modified the basic garbage collection phases as follows.

1. A memory threshold is reached, the mutator is paused

2. The collector sets the key bit on all key objects (given by programmer hint).

54 3. The Deferral phase begins. All objects reachable from teh key obecjt are

deferred.

4. The collector enumerates all roots (references on the stack, or references to

global variables, that refer to objects in the heap)

5. The Mark Phase Begins. All objects reachable via the roots on a path that

does not go through a keyed object or deferred object are traversed and marked.

6. The Sweep Phase begins. The collector linearly scans through memory,

checking to see if objects are marked. All objects not marked, deferred, or

keyed are reclaimed (”swept”).

7. The mutator resumes.

This algorithm maintains the invariant that after the collection is complete, all reachable objects are either marked or deferred. Since the sweeper will never re- claim a marked for deferred object, this guarantees correctness. It does not how- ever, guarantee that all deferred objects are reachable, so it is possible that some unreachable objects are not collected.

Although we will need to visit all objects reachable from a key object on the first collection dealing with that key, since the deferred bit persists between collections, we will not need to visit them on subsequent collections.

Maintaining these invariants requires we changes to the write barrier in the mu- tator. Suppose that the mutator installs a pointer from a deferred object A to a

non-deferred object B. At the next garbage collection, we cannot guarantee that A

is reachable, so we have to assume that it is. Therefore, we will have to treat any

such object B as a key object in the next collection. This is crucial to correctness,

since otherwise B may never be traced, and collected while still alive.

55 4.2.2 Large Object Space

Large Objects in JikesRVM are handled with the Baker Treadmill Baker (1992)

algorithm. As this requires special consideration in our deferred collector scheme,

we will briefly describe it here, and then explain the modifications employed in the

deferred collector.

Within the large object space, there are two data structures used for the Baker

Treadmill algorithm: They are two doubly linked lists of object addresses, referred

to as the to-space and the from-space treadmill. When a garbage collection occurs, tracing from roots begins and normal. When an object in the large object space is encountered, its linked list node in the from-space treadmill is moved to the to-space treadmill (note that the objects themselves are not moved in memory; the linked listen node referring to them is instead moved from one treadmill to another).

Then, after the marking phase, all reachable objects in the large object space are on the to-space treadmill, and any objects remaining on the from space tread- mill must be unreachable. The collector then iterates over the from-space tread- mill, freeing those objects who’s nodes have not been moved. Then, the collector switches the roll of the from-space and to-space for the next collection.

Thus far we have described the existing large object space implementation.

However, a problem with this scheme in the context of the deferred collector is that the deferred collector is not guaranteed to visit each deferred object on every collection. If one of the deferred objects is in the large object space, it will therefore not be moved from the from-space treadmill to the to space. In order to fix this, we slightly modify the sweeping phase. In our modified scheme, when the collector iterates through the from space, it first checks if the object is deferred, and if so puts it on the to-space treadmill.

56 4.3 Mitigating Bad Hints

If the programmer gives us a bad hint, while it is impossible to violate memory

safety, it is possible to degrade performance. To see why, suppose the programmer

selects a key object A that is short lived, and that this object points to many other

short lived objects. In this case, if we naively trace all the objects reachable from

A and defer them, A and the objects reachable from A will be kept resident in

memory long past the end of their lives. This would result in less memory headroom

available to the system, and more frequent garbage collections.

We have implemented several strategies to mitigate the damage that might be

caused by a bad hint. The first is very simple; every k collections (of the mature space), the JVM does an immediate (the normal, non-deferred) collection. We re- fer call the value k the immediate collection period. This guarantees that all dead objects will eventually be collected (if the program eventually triggers enough col- lections). The VM used has only two generations; the nursery, and the mature space. Deferral is not used at all for nursery objects.

A second strategy exploits the fact that the deferred collector is built on top of a generational garbage collector (see 2 for an overview). Recall that in a generational collector, new objects are allocated in a small, frequently collected space. If a key object supplied by the programmer is in the nursery, the collector does not immediately act on the hint. Rather, it waits until the key object survives a nursery collection. This means that a particularly bad hint may die and be collected in a nursery collection before it ever has the opportunity to cause harm.

Another strategy is to monitor mutations. Suppose that there is a pointer from a deferred object A to another object B. Then the mutator modifies this pointer so that it no longer refers to B. It is possible that in doing this, the mutator has caused B to become unreachable. However, the collector will not collect B since its deferred

57 bit will still be set. Fortunately, we can monitor such mutations in the mutator.

Since the deferred collector is implemented on top of a generational collector, there is already a write barrier in place for every pointer write. If the deferred collector observes more than a few (we use 10 as a default) such pointer updates, the next collection is an immediate collection (an immediate collection does not respect the deferred bit).

4.3.1 Application Programming Interface

The interface determines how the programmer can give hints to the garbage collec- tor about what objects are key objects. One option would be to provide annotations that let the programmer statically annotate certain elements of the program. For ex- ample, all objects returned to an annotated method could be treated as key objects, or all objects of a particular type, or those produced by an annotated constructor.

However, whether an object is a good key object or not may depend on a run time condition. For example, a certain data structure may only be good candidate based on user input. Therefore, instead of a static annotation, we want to make it possible for hints to be executed conditionally in the same way as any state- ment in the program. Thus, the interface we expose is a single method named traceInfrequently. Objects passed to traceInfrequently are hints to the garbage collector.

4.4 JikesRVM

In this section, we discuss some of the technical challenges of implementing and de- bugging a garbage collector, specifically on JikesRVM. The Jikes Research Virtual

Machine (JikesRVM) (Alpern et al., 2005a) is a self-hosting Java Virtual Machine, written in Java. We choose JikesRVM because we had group expertise at Tufts, and

58 because its memory management system is modular by design. Although the mod- ularity of its memory management system offers considerable advantages, the fact that it is self hosting makes it extremely challenging to debug and evaluate for sev- eral reasons. Firstly, any object allocated by JikesRVM resides in the same heap as the program being run. This is challenging for several reasons; for one, the size of

JikesRVM itself (just booting it allocates approximately 100,000 objects, depend- ing on the exact configuration used) which can easily dominate the memory usage when examining a smaller program.

Secondly, this makes it very difficult to allocate memory for JikesRVM itself to use. For example, suppose a programmer wishes to maintain a list of machine words in the garbage collector. Since JikesRVM is written in Java, the most obvious choice would be to allocate an object of type ”int[]” (an integer array). However, this object would be allocated in the heap the VM is currently trying to collect.

This is not permitted, since it would violated the collector’s invariants (new objects appearing while it tries to traverse all objects). Instead, one must use some of the low-level primitive operations JikesRVM exposes, and grab a page of raw memory and write words into it. While arrays are fairly easy to handle in this manner, more complicated data structures can be quite tedious to implement. To implement a singly linked list, one would need to create functions that splits memory from a page sufficient to hold each node, and another that given a memory address, computes the offset from that memory address where the ”next” field would be. Since everything is of the primitive type Address (similar to a C program where everything is type void*), the effectiveness of static type checking is dramatically reduced. Nothing from the standard library is available, since it all allocates Java level objects.

Another consequence of JikesRVM being written in Java is that traditional de- buggers (such as gdb) are not very helpful. While they can be attached to the

JikesRVM process, they are not aware of any of the semantics of the Java language

59 or how they relate to the actual layout of memory. So, for example, the programmer

has no way of examining variables on the stack, since the location of the Java stack

is not exposed to gdb.

At the same time, debuggers for the Java programming language are also un-

helpful. This is because JikesRVM is actually written in a variant of Java that has

been extended to allow low level primitive types and operations, such as Addresses

(basically pointers). A debugger for Java has no idea how to make sense of these.

Since it is very difficult to use a debugger or create auxiliary structures for de-

bugging, this leaves two primary methods. Copious amounts of logging, and copi-

ous amounts of assertion checking. Both techniques can have similar draw backs.

Logging a lot of information dramatically slows down program execution, and the

programmer will often discover after a long run of the program that he need to log

just one more piece of information to make sense of the out put; at which point the

entire run of the program must be repeated.

4.4.1 Sanity Checker

JikesRVM includes a sanity checker for garbage collectors. Is essentially a mark sweep collector implemented as simply as possible, with no performance optimiza- tions (it is, for example, single threaded). The sanity checker does not actually reclaim any memory, rather it is run before the garbage collector and records which objects it found to be reachable. Then, after the garbage collector runs its results are compared against the sanity checker. Since the sanity checker is simple, it is assumed to be correct.

As the sanity checker already traverses the heap to check invariants , it is a good place to add additional assertions checking other invariants. For example, in the deferred collector we want to check the invariant that all objects reachable from a deferred collector are deferred. This is testable (although slowly) using simple

60 modifications to the sanity checker.

There are two major downsides to using the sanity checker in this way. For one, it is very slow; it is a single threaded collector, and must traverse the entire heap even on a nursery collection. It essentially adds an extra collection on top of each collection. The exact slow down varies considerably depending on the parameters given to the collector, as well as the program being run. Using a 100 MB heap with a 4 Mb nursery, on the generational mark sweep collector, we observed a geomean slowdown of 14 on the DaCapo benchmarks.

Second, it can only check assertions at the time it is run. It gives you little infor- mation about specifically when the assertion was violated, or what code might have caused the violation. This usually means you have to combine it with trace based debugging techniques, acquiring some object of interest from the sanity checker, and working back through the trace to understand the history of that object. Com- bining the sanity checker, potentially expensive assertions, and tracing results in an execution that is very slow. The exact slowdown will depend on what informa- tion one wants out of the trace, and how much engineering effort one is willing to put into speeding up the process. We typically observed slowdowns slowdowns as several hundred for the most intensive debugging traces.

4.5 Experimental Results for The Deferred Collector

First we present results on a synthetic benchmark, and then a real benchmark from the DaCapo suite.

4.5.1 Methodology

We observed the mark/cons ratio (the ratio between the number of marking opera- tions done, and the number of allocations) as we increase the period of immediate

61 collections. For the purposes of this experiment, we count setting the deferred bit as a ”mark”, since marking an object and deferring an object require the same amount of work (setting one bit in the object header). An immediate collection period of

1 means that every collection is immediate (effectively the same as an unmodified generational garbage collector). An immediate collection period greater than the number of garbage collections means that all collections are deferred.

JikesRVM can alter the size of the nursery at run time. In order to avoid any interference from the nursery resizing algorithm, we fixed the nursery size at 4 MB.

Since the amount of allocation and marking is not deterministic in the DaCapo benchmarks, we used five runs of each configuration, and took the best of the five runs for each value of immediate collection frequency.

4.5.2 Doubly Linked List

Our doubly linked list program is a simple example that illustrates how the deferred collector can be used. It allocates a long doubly linked list (100,000 nodes), which lives for the entire lifetime of the program. The program then churns through mem- ory by allocating equally sized linked lists, and disconnecting them from the roots, ensuring the will be collected when the garbage collector runs (that is, they have shorter lifetimes). They key object used is simply the head of the long lived list.

This is a good case for the deferred collector, since the it is able to save the effort of repetedly traceing the long lived list.

We see the resulting mark/cons ration in figure 4.2. When all collections are immediate (i.e. the immediate collection period is one), the mark/cons ration is

0.053. It drops to 0.026 with an immediate collection period of 3. At the maximum meaningful immediate collection period of seven, the mark/cons ration is 0.019.

This indicates that the deferred collector eliminates about half of the marking work, which is the expected result This indicates that the deferred collector eliminates

62 Figure 4.2: mark/cons ratio for the doubly linked list program, run with a 16 MB heap, and differing values of immediate collection period. about half of the marking work, which is the expected result.

4.5.3 sunflow

Sunflow is a ray tracer that is part of the DaCapo suite of benchmarks. In order to find the key objects, we run sunflow through Elephant Tracks and analyze the resulting traces, and find the allocation contexts of potential candidates. This is somewhat complicated by the fact that Elephant Tracks does not run on JikesRVM, the VM on which the deferred collector is implemented. Some VM specific features actually show up in the Java heap (such as choices in the class libraries). These means that finding the key objects cannot be entirely automated, and requires some manual inspection of code to determine the appropriate objects and points in the code.

63 After analyzing the traces, we found for static source code locations to give

hints to the deferred collector:

1. In the constructor of TriangleMesh, we send the TriangleMesh itself to the

deferred collector. The TriangleMesh is used to represent the scene being

traced.

2. In the method render of BucketThread, we invoke traceInfrequently on

each of several thread objects. These threads do the rendering.

3. In the constructor of RenderObjectMap. RenderObjectMap is a hash map

that maps objects in the scene to a data structure containing information about

their rendering.

4. The constructor of Sunflow. This class maintains some data structures for

managing the benchmark harness itself.

We found that sunflow requires a minimum heap size of 21 MB to run to com- pletion on JikesRVM, and performed experiments with two heap size configura- tions; 1.25 times this minimum size, and 1.5 the minimum size (27 MB and 32

MB, respectively). At the 1.25x heap size, the virtual machine with only eight full heap garbage collections, meaning beyond this there is not much garbage collector work to be saved. In the 1.5x configuration, the virtual machine runs the garbage collector only three times.

Figure 4.3 shows the resulting mark/cons ratio for the 1.25x (27 MB heap) con-

figuration with a varying value for the immediate collection period. With an imme- diate collection period of 1 (meaning all collections are immediate), the mark/cons ratio is 0.031. With am immediate collection period of 3 (meaning every third col- lection is immediate), the mark/cons ratio decreases to 0.029 (6 % decrease). Since there are only eight full heap garbage collections performed, when the immediate

64 Figure 4.3: mark/cons ratio for sunflow, run with a 27 MB heap, and differing values of immediate collection period.

collection period is greater than eight, all collections are deferred; this results in a mark/cons ratio of 0.025 (20 % reduction in mark/cons).

Figure 4.4 shows that as the immediate collection frequency increases, the amount of mark/cons ratio decreases. As there are fewer garbage collections, there is less opportunity to save work. There is a sharp decline from 0.012 to 0.011 as the deferred collector is used with an immediate collection ever three collections (6 % decline in mark cons). Since there are only three full heap garbage collections per- formed, when the immediate collection period is four or greater, all the collections are deferred.

65 Figure 4.4: mark/cons ratio for sunflow, run with a 32 MB heap, and differing values of immediate collection period.

4.6 Related Work

There are two pieces of closely related work. The first is Cohen and Petrank’s ”Data

Structure Aware Garbage Collector” (Cohen and Petrank, 2015). This scheme re- lies on the programmer annotating the methods of specific data structures to clarify which methods add nodes to a data structure, and which methods remove nodes.

The nodes can then be marked in a single pass by copying from one side table to the mark side table, without traversal. This can give good performance, but is not profitable if removal is too frequent. So, if some instances of a data structure require frequent removal and others do not, two copies of that data structures source code may be required. The Data Structure aware garbage collector was able to reduce running time by 31 % in the hsqldb benchmark, from the earlier DaCapo 2006 suite.

The second is the clustered collection (Cutler and Morris, 2015). The Clus-

66 tered Collector operates in a way similar to the Deferred Collector. In this scheme, the garbage collector identifies clusters of objects using a heuristic, and tries to avoid tracing within those clusters as long as a certain head object is still reach- able. If any mutations occur within the cluster, it is traced normally. The Clustered

Collector was implemented on a a Scheme system that runs a completely different set of programs than the JVM the Deferred Collector is implemented on, making a direct emperical comparison difficult. The only benchmark evaluated was the

”Hacker News” application, a web content management system. The Clustered

Collector was able to decrease garbage collector paused times by one-third, but reduced garbage collector throughput by 10 % on this application.

Somewhat less closely related, Hirzel et al.’s Connectivity Based Garbage Col- lection (Hirzel et al., 2003) organizes objects into separate spaces based on their connectivity, and tries to collect these spaces independently (similar to the way in which the nursery is collected in a generational collector). It employes heuristics based on the number of roots pointing into a given region, and attempts to collect the most profitable regions first based on this heuristic.

Detlefts et al.’s ”Garbage First Collector” (Detlefs et al., 2004) similarly breaks the heap down into regions. However, it has a concurrent mark phase (that is, the mark phase runs concurrently with the mutator). Although the mutator is con- current, the garbage first collector does pause the mutator to collect a region, by copying marked objects elsewhere. The goal of its region selection is to achieve a user-defined minimum pause time. They reported being able to attain specified soft real time goals over 98 % of the time in many applications.

67 Chapter 5

Conclusion

In this thesis, we have presented a tool, Elephant Tracks, that lets us gather a great deal of information about the JVM, in particular the way it uses memory. We have shown how this tool could be used to build a garbage collector, the deferred garbage collector.

Elephant Tracks runs on any JVM that supports the JVMTI interface, is able to trace real programs, and provides much more information than any similar tool with a cost not much greater.

The deferred collector shows one possible way garbage collection could be changed based on object lifetime information. While there are many possibilities that could be explored, it represents one way to give the programmer some control of the collector with out sacrificing .

68 Bibliography

Agesen, O., Detlefs, D., and Moss, J. E. B. (1998). Garbage collection and local variable type-precision and liveness in Java virtual machines. In PLDI, pages 269–279.

Alpern, B., Augart, S., Blackburn, S. M., Butrico, M., Cocchi, A., Cheng, P., Dolby, J., Fink, S., Grove, D., Hind, M., et al. (2005a). The jikes research virtual ma- chine project: building an open-source research community. IBM Systems Jour- nal, 44(2):399–417.

Alpern, B., Augart, S., Blackburn, S. M., Butrico, M. A., Cocchi, A., Cheng, P., Dolby, J., Fink, S. J., Grove, D., Hind, M., McKinley, K. S., Mergen, M. F., Moss, J. E. B., Ngo, T. A., Sarkar, V., and Trapp, M. (2005b). The Jikes Research Virtual Machine project: Building an open-source research community. IBM Systems Journal, 44(2):399–418.

Baker, H. G. (1992). The treadmill: Real-time garbage collection without motion sickness. SIGPLAN Not., 27(3):66–70.

Blackburn, S., Garner, R., McKinley, K. S., Diwan, A., Guyer, S. Z., Hosking, A., Moss, J. E. B., Stefanovic,´ D., et al. (2006). The DaCapo benchmarks: Java benchmarking development and analysis. In ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM SIGPLAN Notices 41(10), pages 169–190, Portland, OR. ACM Press.

Bruneton, E., Langlet, R., and Coupaye, T. (2002). ASM: A code manipulation tool to implement adaptable systems. In Adaptable and Extensible Component Systems, Grenoble, France. 12 pages.

Cohen, N. and Petrank, E. (2015). Data structure aware garbage collector.

Cutler, C. and Morris, R. (2015). Reducing pause times with clustered collection. In Proceedings of the 2015 ACM SIGPLAN International Symposium on Memory Management, ISMM 2015, pages 131–142, New York, NY, USA. ACM.

Detlefs, D., Flood, C., Heller, S., and Printezis, T. (2004). Garbage-first garbage collection. In Proceedings of the 4th International Symposium on Memory Man- agement, ISMM ’04, pages 37–48, New York, NY, USA. ACM.

69 Foucar, J. (2006). A Platform for Research into Object-Level Trace Generation. PhD thesis, The University of New Mexico.

Guyer, S. Z., McKinley, K. S., and Frampton, D. (2006). Free-Me: A static analy- sis for automatic individual object reclamation. PLDI, ACM SIGPLAN Notices, 41(6):364–375.

Hayes, B. (1991). Using key object opportunism to collect old objects. In Con- ference Proceedings on Object-oriented Programming Systems, Languages, and Applications, OOPSLA ’91, pages 33–46, New York, NY, USA. ACM.

Hertz, M. and Berger, E. D. (2004). Automatic vs. explicit memory management: Settling the performance debate.

Hertz, M., Blackburn, S. M., Moss, J. E. B., McKinley, K. S., and Stefanovic,´ D. (2002). Error-free garbage collection traces: How to cheat and not get caught. SIGMETRICS Perform. Eval. Rev., 30:140–151.

Hertz, M., Blackburn, S. M., Moss, J. E. B., McKinley, K. S., and Stefanovic, D. (2006). Generating object lifetime traces with Merlin. ACM Transactions on Programming Languages and Systems, 28(3):476–516.

Hirzel, M., Diwan, A., and Henkel, J. (2002). On the usefulness of type and live- ness accuracy for garbage collection and leak detection. ACM Transactions on Programming Languages and Systems (TOPLAS), 24(6):593–624.

Hirzel, M., Diwan, A., and Hertz, M. (2003). Connectivity-based garbage collec- tion. In ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 359–373.

Inoue, H., Stefanovic,´ D., and Forrest, S. (2006). On the prediction of Java object lifetimes. IEEE Transactions on Computers, 55(7):880–892.

Jones, R. E. and Ryder, C. (2008). A study of Java object demographics. In Pro- ceedings of the 7th International Symposium on Memory Management, pages 121–130. ACM.

Lambert, J. M. and Power, J. F. (2008). Platform independent timing of Java vir- tual machine bytecode instructions. Electronic Notes in Theoretical Computer Science, 220(3):97–113.

Rojemo,¨ N. and Runciman, C. (1996). Lag, drag, void and use—heap profil- ing and space-efficient compilation revisited. In Proc. Intl. Conf. on Functional Programming. SIGPLAN Not., 31(6):34–41.

Sun Microsystems (2004). JVM Tool Interface. http://java.sun.com/javase/6/docs/- platform/jvmti/jvmti.html.

70 Uhlig, R. A. and Mudge, T. N. (1997). Trace-driven memory simulation: A survey. ACM Computing Surveys (CSUR), 29(2):128–170.

Xu, G. (2013). Resurrector: A tunable object lifetime profiling technique for opti- mizing real-world programs. In OOPSLA’13, volume 48, pages 111–130. ACM.

71